Article

Distinctive Image Features from Scale-Invariant Keypoints

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Multiple papers in feature based methods [10], [68] emphasized the incorporation of both local (e.g., SIFT features [72]) and global features (e.g., color histograms, GIST [73]) for matching. Global features provide broader contextual information, helping mitigate issues like illumination variation by focusing on the overall structure or color distribution rather than just local details. ...
... Similarly, Workman et al. [59] demonstrated the superiority of deep learning approaches by using pre-trained CNNs, such as those trained on ImageNet and Places datasets [80], to extract discriminative features from both aerial and ground-level images. Their experiments revealed that deep learning models outperformed traditional feature descriptors such as SIFT [72] and GIST [73] in tasks such as region classification and cross-view image matching. Notably, the Places-CNN [80] features yielded a localiza-tion accuracy of 85.1%, significantly higher than the 81.7% achieved by GIST features [59]. ...
... GIST, a handcrafted feature descriptor, focuses on capturing the rough spatial distribution of image gradients [73]. While its ability to describe distinctive local features invariant to scale, rotation, and illumination changes is noteworthy [72], its performance has been eclipsed by deep learning models. Handcrafted features such as GIST and SIFT were useful in constrained settings but are sensitive to viewpoint changes, lighting, and occlusion. ...
Article
Full-text available
Cross-view image geo-localization seeks to identify the geospatial location where a query image (i.e., street view image) was captured by matching it to a database of geo-tagged reference images such as satellite or aerial images. This problem has garnered notable attention in the realm of computer vision, spurred by the widespread availability of copious geo-tagged datasets and the advancements in machine learning techniques. This paper provides a thorough survey of cutting-edge methodologies, techniques, and associated challenges that are integral to this domain, with a focus on feature-based and deep learning strategies. Feature-based methods capitalize on unique features to establish correspondences across disparate viewpoints, whereas deep learning-based methodologies deploy neural networks (convolutional or transformer-based) to embed view-invariant attributes. This work also investigates the multifaceted challenges encountered in cross-view geo-localization (CVGL), such as variations in viewpoints and illumination, and the occurrence of occlusions, and it elucidates innovative solutions that have been formulated to tackle these issues. Furthermore, we document the benchmark datasets and relevant evaluation metrics and also perform a comparative analysis of state-of-the-art techniques. Finally, we conclude the paper with a discussion on prospective avenues for future research and the burgeoning applications of CVGL in an intricately interconnected global landscape.
... These image features incorporate expert knowledge, and can guide the model's learning process when introduced to MAE. To demonstrate that, we conduct a study on popular features for multispectral and SAR imagery: 1) CannyEdge [17], 2) histograms of oriented gradients (HOG) [18], 3) scale-invariant feature transform (SIFT) [19], and 4) normalized difference indices (NDI) [20,21,22]. Examples are shown in Figure 1. ...
... Taking into account both the representativeness and the computing efficiency, we consider two categories and four types of RS image features: spatially, 1) CannyEdge [17], 2) HOG [18], and 3) SIFT [19]; spectrally, 4) NDI, including vegetation index [20], water index [21] and built-up index [22]. ...
... SIFT Scale-invariant feature transform (SIFT) [19] is a feature descriptor that is used to extract distinctive and invariant local features from images. It works by detecting key points in an image that are invariant to scale, rotation, and illumination changes. ...
Article
Full-text available
Self-supervised learning guided by masked image modeling, such as Masked AutoEncoder (MAE), has attracted wide attention for pretraining vision transformers in remote sensing. However, MAE tends to excessively focus on pixel details, limiting the model's capacity for semantic understanding, particularly for noisy Synthetic Aperture Radar (SAR) images. In this paper, we explore spectral and spatial remote sensing image features as improved MAE-reconstruction targets. We first conduct a study on reconstructing various image features, all performing comparably well or better than raw pixels. Based on such observations, we propose Feature Guided Masked Autoencoder (FG-MAE): reconstructing a combination of Histograms of Oriented Gradients (HOG) and Normalized Difference Indices (NDI) for multispectral images, and reconstructing HOG for SAR images. Experimental results on three downstream tasks illustrate the effectiveness of FG-MAE with a particular boost for SAR imagery (e.g. up to 5% better than MAE on EuroSAT-SAR). Furthermore, we demonstrate the well-inherited scalability of FG-MAE and release a first series of pretrained vision transformers for medium-resolution SAR and multispectral images.
... For estimating poses from image sets or continuous video streams, traditional methods typically rely on identifying correspondences between hand-craft local features [3,32,38,60], which then underwent validation and refinement with RANSAC and bundle adjustment. Recently, researchers have attempted to obtain correspondences via deep learning techniques, greatly improving the performance [10,53,56]. ...
... Baselines To assess the effectiveness of our proposed framework, we compare our method against three kinds of baseline methods. The first category is the matchingbased pose estimation methods, including LoFTR [56] and SIFT [38]. Matching-based methods first build dense [56] or sparse [38] correspondences and then solve the relative pose from correspondences by a RANSAC algorithm. ...
... The first category is the matchingbased pose estimation methods, including LoFTR [56] and SIFT [38]. Matching-based methods first build dense [56] or sparse [38] correspondences and then solve the relative pose from correspondences by a RANSAC algorithm. For the SIFT-based method, we also consider incorporating single-view depth estimation methods like Zoe-Depth [4] to solve the pose directly by a 2D-to-3D PnP algorithm or a 3D-to-3D Procrustes algorithm. ...
Preprint
In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at https://github.com/scy639/Gen2SM.
... with the similarity of feature descriptors such as SIFT (Lowe, 2004) and Super-Point (DeTone et al., 2018) ...
... (iv) The CPC dataset (Wilson & Snavely, 2014) includes unstructured images of landmarks collected from Flicker, where the image pairs are wide-baseline with different resolutions. The benchmark randomly selects 1000 image pairs from each dataset and establishes correspondences with SIFT (Lowe, 2004) and SNN ratio test (Lowe, 1999). The classic kusvod (Lebeda et al., 2012) dataset contains 16 image pairs of both weak and strong perspectives, with ground-truth fundamental matrix and manually selected inliers available. ...
... The OpenCV SIFT (Lowe, 2004) detector and descriptor are adopted with nearest neighbor (NN) matching to establish putative correspondences. In each image, the maximum number of keypoints is limited to 4000. ...
Article
Full-text available
This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.
... The visual localization methods use image matching between descent images (image-image matching) and image matching between descent image and the Digital Orthophoto Map (DOM) of the landing site (image-map matching) to calculate the absolute position of the landing point. Sequence image matching primarily employs feature point matching methods [7]- [12] such as SIFT [16]/ASIFT [17], which is relatively mature and has been successfully applied in the previous missions mainly for image stitching and relative pose estimation. The descent image and DOM matching methods include manual matching [13] and crater matching [7]- [9], [12], [14]. ...
... For instance, the difference in solar incidence and azimuth angle will affect the overall light and dark changes of images. Traditional methods such as SIFT, ORB [27], SURF [28] cannot achieve accurate matching between the descent image and the reference map under the illumination variations [16], [29]. Additionally, the craterbased image matching methods [7]- [9], [12], [14] that extract the lunar craters as prominent features are insensitive to illumination variations. ...
... The descent image and the corresponding reference DOM from Chang'e-3 to Chang'e-6 were used as experimental dataset, as shown in Fig. 13. Five common feature matching methods, i.e., WSSF (2024) [29], HOWP (2023) [42], SIFT (2004) [16], RIFT (2019) [50] and HAPCG (2021) [41] were adopted for comparison. TABLE II shows the comparative results on N CM , RCM , RM SE and success rate (SR). ...
Article
Full-text available
Landing point localization is of great significance to the lunar exploration engineering and scientific missions. Vision-based landing point localization methods have successfully been utilized in Chang'e series missions. The issues in landing point visual localization task containing low-resolution reference maps, illumination changes between descent images and maps, and low automation of the localization workflow still need to be solved. In this paper, a high-precision and automatic landing point visual localization method with high-resolution map generation is proposed, including initial localization of the first frame, hybrid fine matching, landing point propagation in descent sequence images, and absolute position estimation for landing point. High-resolution digital elevation model and Digital Orthophoto Map (DOM) are generated from LROC NAC images and SLDEM2015 data. Phase-based image matching method is adopted for initial localization and matching between descent image and reference map to enhance the illumination robustness. The performance of our method is validated using the descent sequence images from Chang'e series missions. For Chang'e-6 lander, the estimated landing point coordinates are (-153.9870°± 0.00002°, -41.6378°± 0.00001°,-5256.962m±0.0041m).Compared with the manually measured lander coordinates, the deviation of landing point position is less than 1 pixel in DOM.
... This representation is being neglected in modern deep end-to-end image matching methods due to their intrinsic design which guarantees a better global optimization at the expense of the interpretability of the system. Notwithstanding that end-toend deep networks represent for many aspects the State-Of-The-Art (SOTA) in image matching, such in terms of the robustness and accuracy of the estimated poses in complex scenes [12][13][14][15][16][17][18], handcrafted image matching pipelines such as the popular Scale Invariant Feature Transform (SIFT) [19] are employed still today in practical applications [2,4] due to their scalability, adaptability, and understandability. Furthermore, handcrafted or deep modules remain present in postprocess modern monolithic deep architectures, to further filter the final output matches as done notably by the RAndom SAmpling Consensus (RANSAC) [20] based on geometric constraints. ...
... Keypoints are usually recognized as corner-like and bloblike ones, even if there is no clear separation within them outside ad-hoc ideal images. Corners are predominantly extracted by the Harris detector [62], while blobs by SIFT [19]. Several other handcrafted detectors exist besides the Harris and the SIFT one [63][64][65], but excluding deep image matching nowadays the former and their extensions are the most commonly adopted. ...
... Actually, the patch normalization takes place in between the detection and description of the keypoints, and it is critical for effective computation of the descriptor to be associated with the keypoint. Patch normalization is implicitly hidden inside both keypoint extraction and description and its task roughly consists of aligning and registering patches before descriptor computation [19]. ...
Preprint
Full-text available
This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.
... The ratio of the minimum feature distance to the secondary feature distance in PHOW feature matching (Equation (18) In the experiments, τ d of Equation (18) is used to select the line segment points that are correctly matched between the left and right views, and the determination of this value is inspired by Lowe et al. [34] who utilized a similar threshold to obtain correctly matched points. (24) is to obtain other possible corresponding line segments as candidates in the range of maximum matching rate γ. ...
... The smaller the γ value, the more candidate line segments need to be screened, and the larger the computation amount. The value of γ is set to 0.8 with reference to Lowe et al. [34]. ...
Article
Full-text available
Matched line segments are crucial geometric elements for reconstructing the desired 3D structure in stereo satellite imagery, owing to their advantages in spatial representation, complex shape description, and geometric computation. However, existing line segment matching (LSM) methods face significant challenges in effectively addressing co-linear interference and the misdirection of parallel line segments. To address these issues, this study proposes a “continuous–discrete–continuous” cyclic LSM method, based on the Voronoi diagram, for stereo satellite images. Initially, to compute the discrete line-point matching rate, line segments are discretized using the Bresenham algorithm, and the pyramid histogram of visual words (PHOW) feature is assigned to the line segment points which are detected using the line segment detector (LSD). Next, to obtain continuous matched line segments, the method combines the line segment crossing angle rate with the line-point matching rate, utilizing a soft voting classifier. Finally, local point-line homography models are constructed based on the Voronoi diagram, filtering out misdirected parallel line segments and yielding the final matched line segments. Extensive experiments on the challenging benchmark, WorldView-2 and WorldView-3 satellite image datasets, demonstrate that the proposed method outperforms several state-of-the-art LSM methods. Specifically, the proposed method achieves F1-scores that are 6.22%, 12.60%, and 18.35% higher than those of the best-performing existing LSM method on the three datasets, respectively.
... Based on how features are obtained, they are categorized into handcrafted and deep learning methods. Local feature-based handcrafted methods such as SIFT [4], KAZE [5], and ORB [6] are designed using prior knowledge, which limits their capability to capture deep features. With the rise of deep learning, Convolutional Neural Network (CNN) based local feature methods have continuously emerged, such as Superpoint [7], DISK [8], and XFeat [9]. ...
... To validate the proposed method's effectiveness and generalization, eleven representative methods are selected for comparison. These include two handcrafted local feature extraction methods: SIFT [4] and ORB [6]; seven deep learning-based local feature extraction methods: SuperPoint [7], Note: ▲ represents handcrafted local feature extraction methods, ⋆ represents deep learning based local feature extraction methods, ♦ represents end-to-end dense matching methods, and ϵ represents the allowed pixel error threshold. Bold data indicates the best value among deep learning based local feature extraction methods. ...
Preprint
Full-text available
Local feature extraction and matching has lately attracted increasing attention due to its wide application, especially in real-time automated systems. However, existing image matching methods struggle to balance the global receptive field and the efficient computation, which limits the practical applications. Recently, the State Space Model (SSM) has shown great potential in linear complexity and long-range dependency modeling. Therefore, in this paper, a local feature extraction and matching method using the SSM is proposed, which aims to achieve the tradeoff between global information extraction and model complexity. Firstly, a Local and Global Information Fusion (LGIF) block is developed to integrate local and global information and reduce model parameters through parallel SSM. Secondly, a backbone based on Euclidean group E(2) equivariant steerable Convolution (E2Conv) is designed to improve the model's robustness against geometric transformations. Finally, a self-supervised learning framework is constructed, which optimizes the ability of the network in local feature detection and description by combining four loss functions: keypoint localization loss, keypoint confidence score loss, descriptor triplet loss, and keypoint correspondence loss. Experimental results on public benchmark datasets Hpatches and RDNIM demonstrate that the proposed method has a significant advantage over existing methods in homography estimation tasks. Notably, our method outperforms the end-to-end dense matching method LoFTR by 6.11% under the 1-pixel error threshold on the Hpatches dataset, simultaneously with a smaller number of parameters and less average matching time.
... As low-level features we use SIFT, the Scale Invariant Feature Transform (Lowe, 2004). SIFT features are good at capturing parts of objects and are designed to be invariant to image transformations 3 http://www.image-net.org/ ...
... Specific object recognition is a sub-task of the general object detection problem. It involves detecting specifically defined objects in an image by detecting feature points using the scale-invariant feature transform (SIFT) [115] or the learned invariant feature transform (LIFT) [116]. An overview of computer vision applications in poultry monitoring is presented in Figure 1. ...
Chapter
Full-text available
This chapter investigates climate change’s impact on broiler chicken production and reproduction. With climate patterns shifting, poultry farming faces challenges in managing heat stress, ensuring reproductive success, and maintaining overall yield. The physiological responses of broiler chickens to changing environmental conditions, including temperature fluctuations and extreme events, will be explored. Additionally, adaptation strategies and management practices to mitigate these impacts will be discussed. By synthesizing existing literature and empirical evidence, this chapter aims to provide insights into understanding and addressing the complexities of climate change in the broiler chicken industry, offering pathways for sustainable poultry farming in a changing climate.
... The key point detection method aims to calculate the number of key points in the image. We used the scale-invariant feature transform (SIFT) [70] to calculate the number of feature points in the image. As shown in Fig. 10, the number of feature points in the original images in the first and second rows is 110 and 498. ...
Article
Underwater images frequently experience issues, such as color casts, loss of contrast, and overall blurring due to the impact of light attenuation and scattering. To tackle these degradation issues, we present a highly efficient and robust method for enhancing underwater images, called DAPNet. Specifically, we integrate the extended information block into the encoder to minimize information loss during the downsampling stage. Afterward, we incorporate the dual attention module to enhance the network's sensitivity to critical location information and essential channels while utilizing codecs for feature reconstruction. Simultaneously, we employ adaptive instance normalization to transform the output features and generate multiple samples. Lastly, we utilize Monte Carlo likelihood estimation to obtain stable enhancement results from this sample space, ensuring the consistency and reliability of the final enhanced image. Experiments are conducted on three underwater image data sets to validate our method's effectiveness. Moreover, our method demonstrates strong performance in underwater image enhancement and exhibits excellent generalization and effectiveness in tasks, such as low-light image enhancement and image dehazing.
... Each frame is then aligned to the sharpest frame. Alignment is performed in a sequential fashion, beginning with translation-only registration based on local key features derived from the scale-invariant feature transform (SIFT) 12 , followed by rigid registration, and finally affine registration (Fig. 1C). Post alignment, the frame size is cropped to the bounding box of the optic disc (typically 200 × 300 pixels). ...
Article
Full-text available
The carotid-femoral pulse wave velocity (PWV) method is used clinically to determine degrees of stiffness and other indices of disease. It is believed PWV measurement in retinal vessels may allow early detection of diseases. In this paper we present a new non-invasive method for estimating PWVs in retinal vein segments close to the optic disc centre, based on the measurement of blood column pulsation in retinal veins (reflective of vessel wall pulsation), using modified photoplethysmography (PPG). An optic disc (OD) PPG video is acquired spanning three cardiac cycles for a fixed ophthalmodynamometric force. The green colour channel frames are extracted, cropped and aligned. A harmonic regression model is fitted to each pixel intensity time series along the vein centreline from the centre to the periphery of the OD. The phase of the first harmonic is plotted against centreline distance. A least squares line is fitted between the first local maximum phase and first local minimum phase and its slope used to compute PWV. Five left eye inferior hemi-retinal veins from five healthy subjects were analysed. Velocities were calculated for several induced intraocular pressures ranging from a mean baseline of 14 mmHg (SD 5) to 56 mmHg in steps of approximately 5 mmHg. The median PWV over all pressure steps and subjects was 20.77 mm/s (IQR 29.27). The experimental results show that pulse wave propagation direction was opposite to flow in this initial venous segment.
... The most famous multiscale feature detection and description algorithms include SIFT (Lowe, 2004) and SURF (Bay et al., 2006), both of which utilize Gaussian scale space to achieve scale invariance performance requirements. However, the limitations of SIFT and SURF algorithms are their relatively low matching precision and lack of real-time performance (Agrawal et al., 2008). ...
Article
Full-text available
With the widespread application of image feature matching in orchard automation, enhancing the precision and computational efficiency of current fruit tree feature matching algorithms is crucial. In this paper, we propose a fruit tree feature matching algorithm based on an improved AKAZE method. First, the AKAZE algorithm is employed to perform feature detection and extract feature points from the left and right fruit tree images. The FREAK descriptor is then constructed for the AKAZE feature points detected on the fruit tree images. Subsequently, an improved KNN algorithm is used for preliminary feature matching, followed by the RANSAC algorithm to refine the matching results. Ultimately, the optimal feature matching results for the left and right fruit tree images are obtained. Compared to the SURF, KAZE, AKAZE, and BRISK algorithms, the proposed algorithm demonstrates significant improvements in RMSE, precision, runtime, and RAM usage across experiments involving scale, blur, and rotation matching.
... Recent work [31] proposes a sequence descriptor that incorporates spatio-temporal information, using spatial attention within and across frames to capture feature persistence or change, making it more robust to moving objects and occlusions. After identifying common areas to build relationships between current point cloud scans and local maps from different vehicles, it can match similar local maps using extracted local features generated by methods such as SIFT [32], ORB [33], and SuperPoint [34], solving relative pose estimation to augment ego-vehicle state. ...
Preprint
Full-text available
The SLAMMOT, i.e. simultaneous localization, mapping, and moving object (detection and) tracking, represents an emerging technology for autonomous vehicles in dynamic environments. Such single-vehicle systems still have inherent limitations, such as occlusion issues. Inspired by SLAMMOT and rapidly evolving cooperative technologies, it is natural to explore cooperative simultaneous localization, mapping, moving object (detection and) tracking (C-SLAMMOT) to enhance state estimation for ego-vehicles and moving objects. C-SLAMMOT could significantly upgrade the single-vehicle performance by utilizing and integrating the shared information through communication among the multiple vehicles. This inevitably leads to a fundamental trade-off between performance and communication cost, especially in a scalable manner as the number of collaboration vehicles increases. To address this challenge, we propose a LiDAR-based communication-efficient C-SLAMMOT (CE C-SLAMMOT) method by determining the number of collaboration vehicles. In CE C-SLAMMOT, we adopt descriptor-based methods for enhancing ego-vehicle pose estimation and spatial confidence map-based methods for cooperative object perception, allowing for the continuous and dynamic selection of the corresponding critical collaboration vehicles and interaction content. This approach avoids the waste of precious communication costs by preventing the sharing of information from certain collaborative vehicles that may contribute little or no performance gain, compared to the baseline method of exchanging raw observation information among all vehicles. Comparative experiments in various aspects have confirmed that the proposed method achieves a good trade-off between performance and communication costs, while also outperforms previous state-of-the-art methods in cooperative perception performance.
... Since the MaskFeat checkpoint we used was trained exclusively on the ImageNet-1K dataset and followed a training process similar to MAE, we attribute its mid-level vision strength to its choice of target features. Unlike models trained on raw pixels, tokens, or high-level features, MaskFeat is optimized on HOG (Histograms of Oriented Gradients) features, which are widely used for keypoint detection in methods such as SIFT[39]. Importantly, HOG does not rely on any external models, making it a robust feature for capturing mid-level vision attributes. ...
Preprint
Full-text available
Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.
... Owing to occlusion, low resolution, and scale variation, small-object detection has been a persistent challenge, particularly in traffic surveillance. Early methods relied on handcrafted features, such as histogram of oriented gradients (HOG) and scale-invariant feature transform (SIFT), combined with classifiers, such as support vector machines (SVMs) (Dalal and Triggs, 2005;Lowe, 2004). Although these approaches perform well in controlled environments, they often fail in complex real-world scenarios where lighting and motion introduce variability (Viola and Jones, 2001). ...
Preprint
Full-text available
Accurate detection and tracking of small objects such as pedestrians, cyclists, and motorbikes are critical for traffic surveillance systems, which are crucial in improving road safety and decision-making in intelligent transportation systems. However, traditional methods struggle with challenges such as occlusion, low resolution, and dynamic traffic conditions, necessitating innovative approaches to address these limitations. This paper introduces DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN) with YOLO11 to enhance small object detection and tracking in traffic surveillance systems. The framework leverages YOLO11's advanced spatial feature extraction capabilities for precise object detection and incorporates DGNN to model spatial-temporal relationships for robust real-time tracking dynamically. By constructing and updating graph structures, DGNN-YOLO effectively represents objects as nodes and their interactions as edges, ensuring adaptive and accurate tracking in complex and dynamic environments. Extensive experiments demonstrate that DGNN-YOLO consistently outperforms state-of-the-art methods in detecting and tracking small objects under diverse traffic conditions, achieving the highest precision (0.8382), recall (0.6875), and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability, particularly in challenging scenarios involving small and occluded objects. This work provides a scalable, real-time traffic surveillance and analysis solution, significantly contributing to intelligent transportation systems.
... Many traditional biometric systems use handcrafted features designed by computer vision experts. A good amount of handcrafted features for biometric systems were based on distributions of edges from biometric samples obtained using edge filtering methods such as scale-invariant feature transformation [37] and histograms of oriented gradients [38]. Other features were derived using domain transformation techniques, such as Gabor filtering to extract iris features [39], Fourier transforms for facial recognition [40], or wavelet transformations [41]. ...
Preprint
Full-text available
Biometric systems, while offering convenient authentication, often fall short in providing rigorous security assurances. A primary reason is the ad-hoc design of protocols and components, which hinders the establishment of comprehensive security proofs. This paper introduces a formal framework for constructing secure and privacy-preserving biometric systems. By leveraging the principles of universal composability, we enable the modular analysis and verification of individual system components. This approach allows us to derive strong security and privacy properties for the entire system, grounded in well-defined computational assumptions.
... Image stitching is a technique that combines several overlapping images (which may have been acquired at different times, from different viewpoints, or by different sensors) into one large, seamless, highresolution image. When acquiring images of a scene with a wide field of view using an ordinary camera, the resolution of the image will be lower the larger the scene is because of the resolution of the camera.In the medical imaging field, accurate panoramic images can help doctors conduct more detailed diagnosis and surgery planning, and improve the quality of medical care [1]. In driverless technology, image stitching technology enables vehicles to fully perceive the surrounding environment and enhance driving safety. ...
Article
Full-text available
Image stitching technology plays a significant role in the fields of computer vision and image processing, with applications ranging from panoramic photography to virtual reality (VR), augmented reality (AR), medical diagnostics, and autonomous vehicle technology. As technology advances, the demand for high-quality, real-time panoramic images provided by image stitching technology continues to grow. This study aims to implement an image stitching method based on the Scale-Invariant Feature Transform (SIFT) feature point detection algorithm, combined with the Random Sampling Consensus (RANSAC) algorithm and the calculation of the homography matrix to automatically stitch two images. This paper elaborates on the entire process of image feature extraction, feature matching, homography matrix calculation, and image fusion, and compares different fusion modes. The experimental results show that the method can achieve seamless image stitching in some cases, its performance in complex scenes such as crowds or traffic flows is average. This study provides new perspectives and methods for the application of image stitching technology.
... CNNs have contributed significantly to progress in the field [9]. Deep learning has also benefited from transfer learning, where models achieve good performance on small datasets by learning from larger datasets, thus overcoming data scarcity problems [10]. Attention mechanisms have also been introduced to enhance CNN performance by focusing on important regions of the face [11]. ...
Article
Full-text available
As a key area in artificial intelligence and computer vision, facial expression analysis technology has been taking hold rapidly, yielding dazzling advances. This technology extracts even the subtlest facial movements to identify when a human is sad, happy, or angry. It has extensive application value in markets like mental diagnosis, secure monitoring, and intelligent interaction. The paper will provide a brief overview of the kernel technology of facial expressions recognition: geometric features, appearance features, and deep learning. And compared the experimental results of different experiments, as a goal, the experiments compare functions of several methods in a multi-dataset context, concentrating the primary problems in the tasks like dataset bias, real-time requirements, and different light levels and partial occlusions in various environments. Eventually, the paper forecasts the imminent growth trend in data fusion and deep learning improvement. This also indicates that emerging technologies, especially in terms of applications, will have the upper hand.
... • ORB [45] is a binary descriptor using FAST detector and intensity centroids for keypoint orientation. • SIFT [46], along with its GPU-based variant CUDASIFT, extracts keypoints using the difference of Gaussians and local gradient histograms. • SURF [47], a faster alternative to SIFT, approximates the Laplacian of the Gaussian with a box filter. ...
Article
Full-text available
In this paper, we present EC-WAMI, the first successful application of neuromorphic event cameras (ECs) for Wide-Area Motion Imagery (WAMI) and Remote Sensing (RS), showcasing their potential for advancing Structure-from-Motion (SfM) and 3D reconstruction across diverse imaging scenarios. ECs, which detect asynchronous pixel-level brightness changes, offer key advantages over traditional frame-based sensors such as high temporal resolution, low power consumption, and resilience to dynamic lighting. These capabilities allow ECs to overcome challenges such as glare, uneven lighting, and low-light conditions that are common in aerial imaging and remote sensing, while also extending UAV flight endurance. To evaluate the effectiveness of ECs in WAMI, we simulate event data from RGB WAMI imagery and integrate them into SfM pipelines for camera pose optimization and 3D point cloud generation. Using two state-of-the-art SfM methods, namely, COLMAP and Bundle Adjustment for Sequential Imagery (BA4S), we show that although ECs do not capture scene content like traditional cameras, their spike-based events, which only measure illumination changes, allow for accurate camera pose recovery in WAMI scenarios even in low-framerate(5 fps) simulations. Our results indicate that while BA4S and COLMAP provide comparable accuracy, BA4S significantly outperforms COLMAP in terms of speed. Moreover, we evaluate different feature extraction methods, showing that the deep learning-based LIGHTGLUE descriptor consistently outperforms traditional handcrafted descriptors by providing improved reliability and accuracy of event-based SfM. These results highlight the broader potential of ECs in remote sensing, aerial imaging, and 3D reconstruction beyond conventional WAMI applications. Our dataset will be made available for public use.
... The Bag-of-Visual Words (BoVW) is a popular method for modeling images via discrete representations. In [62], k-means clustering is applied to keypoint descriptors from SIFT [34] to learn a fixed set of centroids. These centroids represent the vocabulary to which multiple descriptors are mapped in a process termed Vector Quantization (VQ). ...
Preprint
The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the Visual Word Tokenizer\textbf{Visual Word Tokenizer} (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to 2×2\times or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.
... Traditional UAV geo-localization methods typically rely on hand-crafted feature matching [16] and rough GPS data for pose estimation [17]. Recent approaches aim to estimate global geographical coordinates over larger regions by matching query images to geo-tagged satellite patches, framing the task as an image retrieval problem. ...
Preprint
Full-text available
Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.
... The FPFH [52], ESF [53] are two widely used features that used in the pre deep learning era. The other feature extractors such as 3DSIFT [54], PFH [55] can be found in this survey [56]. After deep learning is prevalent, the performance of point feature is largely improved, regarding the distinguishability and recall. ...
Article
Full-text available
Recent years have witnessed a signifcant breakthrough in the 3D domain. To track the most recent advances in the 3D field, in this paper, we provide a comprehensive survey of recent advances in the 3D feld, which encompasses a wide collection of topics, including diverse pre-training strategies, backbone designs and downstream tasks. Compared to the previous literature review on point cloud, our survey is more comprehensive. Our survey consists of the 3D pre-training methods, various downstream tasks, popular benchmarks, evaluation metrics as well as several promising future directions. We hope the survey can serve as the cornerstone for both academia and industry.
... where coordinates of each keypoint j in i-th frame is presented as [x j,i , y j,i ] i=1,2 j=a,b,c , i = 1 represents the coordinates of PF and i = 2 represents the coordinates of CF. The least-squares solution of vector E E E, which includes the six entities of A A A and b b b, can be determined as [134]: ...
... Besides, it employed region of interest (RoI) and excluded undesirable regions. They obtained the features of all the humans in a frame through Scale-invariant feature transform (SIFT) [43]. Then, they matched these stored features from consecutive frames in order to track the person by applying Fast Library for Approximate Nearest Neighbors(FLANN) [44]. ...
Article
Full-text available
This research aims to improve the automatic identification of armed people in surveillance videos. We focus on people armed with pistols and revolvers. Furthermore, we use the YOLOv4 to detect people and weapons in each video frame. We developed a series of algorithms to create a dataset with the information extracted from the bounding boxes generated by YOLOv4 in real-time. Thereby, we initially developed six-armed people detectors (APD) based on six machine learning models: Random Forest Classifier (RFC-APD), Multilayer Perceptron (MLP-APD), Support Vector Machine (SVM-APD), Logistic Regression (LR-APD), Naive Bayes (NB-APD), and Gradient Boosting Classifier (GBC-APD). These models use 20 predictors to make their predictions. These predictors are computed from the bounding box coordinates of the detected people and weapons, their distances, and areas of intersection. Based on our results, the RFC-APD was the best-performing detector, with an accuracy of 95.59%, a recall of 94.51%, and an F1-score of 95.65%. In this work, we propose to create selectors for deciding which APD to use in each video frame (APD4F) to improve the detection results. Besides, we implemented two types of APD4Fs, one based on a Random Forest Classifier (RFC-APD4F) and another in a Multilayer Perceptron (MLP-APD4F). We developed 44 APD4Fs combining subsets of the six APDs. Both APD4F types outperformed most of the independent use of all six APDs. A multilayer perceptron-based APD4F, which combines an MLP-APD, a NB-APD, and a LR-APD, presented the best performance, achieving an accuracy of 95.84%, a recall of 99.28% and an F1 score of 96.07%.
... An essential byproduct of this process is the scene graph, which captures information about matching pairs. However, current advanced SfM frameworks primarily depend on keypoint detection [14,16,35,46,58] and matching [32,48,54] techniques, which can be less effective in textureless or repetitive environments. The task of (re)localization [4,15,47,49] is also closely related to SfM. ...
Preprint
Neural surface reconstruction relies heavily on accurate camera poses as input. Despite utilizing advanced pose estimators like COLMAP or ARKit, camera poses can still be noisy. Existing pose-NeRF joint optimization methods handle poses with small noise (inliers) effectively but struggle with large noise (outliers), such as mirrored poses. In this work, we focus on mitigating the impact of outlier poses. Our method integrates an inlier-outlier confidence estimation scheme, leveraging scene graph information gathered during the data preparation phase. Unlike previous works directly using rendering metrics as the reference, we employ a detached color network that omits the viewing direction as input to minimize the impact caused by shape-radiance ambiguities. This enhanced confidence updating strategy effectively differentiates between inlier and outlier poses, allowing us to sample more rays from inlier poses to construct more reliable radiance fields. Additionally, we introduce a re-projection loss based on the current Signed Distance Function (SDF) and pose estimations, strengthening the constraints between matching image pairs. For outlier poses, we adopt a Monte Carlo re-localization method to find better solutions. We also devise a scene graph updating strategy to provide more accurate information throughout the training process. We validate our approach on the SG-NeRF and DTU datasets. Experimental results on various datasets demonstrate that our methods can consistently improve the reconstruction qualities and pose accuracies.
... Traditional image captioning techniques primarily relied on manually designed feature extraction and template matching strategies. These methods commonly used algorithms such as Scale-Invariant Feature Transform (SIFT) [4] and Histogram of Oriented Gradients (HOG) [5] to extract features from images, and then converted the image content into fixed-length feature vectors using the Bag-of-Visual-Words model. Template-based captioning methods employed pre-defined sentence frameworks, combining key objects and actions from the image to generate structured descriptive text. ...
Article
Full-text available
Thangka image captioning aims to automatically generate accurate and complete sentences that describe the main content of Thangka images. However, existing methods fall short in capturing the features of the core deity regions and the surrounding background details of Thangka images, and they significantly lack an understanding of local actions and interactions within the images. To address these issues, this paper proposes a Thangka image captioning model based on Salient Attention and Local Interaction Aggregator (SALIA). The model is designed with a Dual-Branch Salient Attention Module (DBSA) to accurately capture the expressions, decorations of the deity, and descriptive background elements, and it introduces a Local Interaction Aggregator (LIA) to achieve detailed analysis of the characters’ actions, facial expressions, and the complex interactions with surrounding elements in Thangka images. Experimental results show that SALIA outperforms other state-of-the-art methods in both qualitative and quantitative evaluations of Thangka image captioning, achieving BLEU4: 94.0%, ROUGE_L: 95.0%, and CIDEr: 909.8% on the D-Thangka dataset, and BLEU4: 22.2% and ROUGE_L: 47.2% on the Flickr8k dataset.
... This paper provides a comprehensive review of the SURF feature descriptor, starting with an overview of the underlying principles and its key components. Subsequently, we discuss its applications in object recognition, image matching, and 3D reconstruction [2] . Moreover, we explore recent advancements and variations of the SURF algorithm and compare it with other popular feature descriptors. ...
Article
Full-text available
This paper provides a comprehensive review of SURF (speeded up robust features) feature descriptor, commonly used technique for image feature extraction. The SURF algorithm has obtained significant popularity because to its robustness, efficiency, and invariance to various image transformations. In this paper, an in-depth analysis of the underlying principles of SURF, its key components, and its use in computer vision tasks such as object recognition, image matching, and 3D reconstruction are proposed. Furthermore, we discuss recent advancements and variations of the SURF algorithm and compare it with other popular feature descriptors. Through this review, the aim is to provide a clear understanding of the SURF feature descriptor and its significance in the area of computer vision.
... Feature detection and matching were performed using the scale-invariant feature transform (SIFT) algorithm, and the random sample consensus (RANSAC) algorithm was employed to estimate the homograph matrix, ensuring precise alignment and seamless image merging. The SIFT and RANSAC algorithms were chosen for their exceptional robustness and accuracy in image matching and panorama creation (Fischler, 1981;Lowe, 2004). Street view features were then statistically examined, with categorized results filtered to construct a dataset for evaluating perceptions of safety and health. ...
Article
Full-text available
With a growing interest in liveable cities, scholars and urban planners are increasingly studying the characteristics of child-friendly cities, including the ability to walk and move freely in public spaces. While machine learning techniques and street view imagery analysis have enabled the systematic analysis of streets, they have not yet been applied to assess street environments from a child's perspective. This study explores the use of deep learning models to address this gap by developing a machine-simulated human scoring model to assess health and safety indicators in urban streets. Using a high-density, old urban district in Hong Kong SAR, China, as a case, the study used semantic segmentation to analyze street environmental features and extract elements related to safety, such as greenery, vehicles, and fences. Subsequently, the model generated safety ratings, which were compared with scores provided by volunteer caregivers. The results indicate that natural elements and fences enhance safety, whereas an excess of buildings diminishes it. In contrast to European cities, where high visibility and larger sky proportions are considered beneficial for health, these factors were less relevant in the high-density, tropical context of Hong Kong. This analysis highlights the robustness and efficiency of the model, which can assist researchers in other cities in collecting empirical user rating data and informing strategies for more child-friendly urban planning.
... Images were processed (removed outliers and Z-stacked) with (Fiji 3) and colocalization was determined with the JaCoP plug-in 41 . Where needed, images were aligned using the Linear Stack Alignment with SIFT plugin 42 . ...
Article
Full-text available
A complete understanding of RNA biology requires methods for tracking transcripts in vivo. Common strategies rely on fluorogenic probes that are limited in sensitivity, dynamic range, and depth of interrogation, owing to their need for excitation light and tissue autofluorescence. To overcome these challenges, we report a bioluminescent platform for serial imaging of RNAs. The RNA tags are engineered to recruit light-emitting luciferase fragments (termed RNA lanterns) upon transcription. Robust photon production is observed for RNA targets both in cells and in live animals. Importantly, only a single copy of the tag is necessary for sensitive detection, in sharp contrast to fluorescent platforms requiring multiple repeats. Overall, this work provides a foundational platform for visualizing RNA dynamics from the micro to the macro scale.
... The main challenges in efficient facial expression recognition are the selection of efficient feature extraction and classification techniques. Handcrafted features, such as local binary patterns (LBP) [9,10], scale-invariant feature transforms (SIFT) [11], gray level co-occurrence matrix (GLCM) [12], sped up robust features (SURF) [13] and histograms of oriented gradient (HOG) [14] achieved a breakthrough in various fields. Although using different descriptors to extract handcrafted features is possible, it is currently common to choose to learn the representation, removing the task of defining the features from the developer. ...
Article
Full-text available
This work introduces the representation ensemble learning algorithm, a novel approach for generating diverse unsupervised representations rooted in the principles of self-taught learning. The ensemble comprises convolutional autoencoders (CAEs) learned in an unsupervised manner, fostering diversity via a loss function designed to penalize similar CAEs’ latent representations. We employ support vector machines, bagging, and random forest as primary classification methods for the final classification step. Additionally, we incorporate KnoraU, a well-established technique used to dynamically select competent classifiers based on a test sample. We evaluate various fusion strategies, including sum, product, and stacking, to comprehensively assess the ensemble’s performance. A robust experimental protocol considering the facial expression recognition problem shows that the proposed approach based on self-taught learning surpasses the accuracy of fine-tuned convolutional neural network (CNN) models. In terms of accuracy, the proposed method is up to 9.9 and 6.3 percentage points better than the CNN-based models fine-tuned for JAFFE and CK+ datasets, respectively.
... Feature detectors in the first layers resemble Gabor-like and color-blob filters, while those in the latter layers take the shape of convolutional neural networks. In contrast to earlier techniques such as SIFT [3] and HOG [4], the algorithm designer is not required to develop feature detectors when using CNNs. Over time, the network trains itself to recognize specific traits and improves its ability. ...
... As a downstream task in the field of image retrieval [1], visual place recognition systems initially gather the visual information of the query image into a compact descriptor (image representation) and subsequently match it with a reference database of known geographical locations to retrieve the location of the query image. In previous research, VPR techniques employed hand-designed local features, such as SIFT [2] and SURF [3], which could be further aggregated into global descriptors representing the entire image, such as Fisher vector [4,5] and Vector of Locally Aggregated Descriptors (VLAD) [6,7]. These algorithms are designed for specific scenarios or tasks, and their generalization ability and adaptability are limited, making it challenging for them to cope with the needs of diverse scenarios and tasks. ...
Article
Full-text available
Visual Place Recognition (VPR) technology aims to use visual information to judge an agent’s location, which plays an increasingly crucial role in mobile robot localization and automatic driving, among others. The appearance of outdoor scenes can change dramatically over time due to challenges such as weather, season, and lighting. To obtain robust descriptors that can adapt to complex environmental changes, we propose a Global Information Capture Network (GICNet), which can effectively mine invariant feature expressions under different instances of the same scene. GICNet consists of two carefully designed modules called shuffle channel attention (SCA) and global information aggregator (GIA), which play roles respectively in the processes of “feature extraction” and “feature aggregation” of the model. Specifically, SCA uses shuffle operation to enhance the information interaction between the channels of the feature map and utilizes a self-learning attention mask to recalibrate the feature relationship on the channel dimension. As a novel holistic feature aggregation technique, GIA regards the feature maps of the pre-trained backbone as a group of global features and comprehensively considers the global relationships among the elements in each feature map in a cascaded feature mixture manner. We demonstrate the effectiveness of our technique by conducting extensive experiments on multiple challenging large-scale benchmarks. In particular, the proposed method performs best on the Pitts30k dataset.
... Bottom row presents our results, demonstrating our method capability for dense matching refinement. this period was on creating more efficient feature detectors, leading to the development of methods like SIFT (Lowe 2004), ORB (Rublee et al. 2011), and other learning-based techniques (DeTone, Malisiewicz, and Rabinovich 2018). However, the dependence on detectors significantly reduce robustness, resulting in failures in scenarios with textureless regions or large viewpint changes. ...
Preprint
Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.
... A. POINT-BASED IMAGE MATCHING The Scale Invariant Feature Transform (SIFT) [13] method extracts distinctive invariant features that are invariant to image scale and rotation. First, using the Difference of Gaussians (DoG) module, the scale space extrema is detected. ...
Article
Full-text available
Infrared template matching is an essential technology that enables reliable and accurate object detection, recognition, and tracking in complex environments. Perceptible Lightweight Zero-mean normalized cross-correlation(ZNCC) Template Matching (PLZ-TM) has been proposed as a tool for matching infrared images obtained from cameras with different fields of view. Aligning such images is challenging because of the involved differences in thermal distributions, focus discrepancies, background elements, and distortions. The first stage of PLZ-TM involves extracting feature maps from the search and template images using a deep learning network. This deep learning network is designed with a Convolutional Neural Network (CNN) architecture that omits pooling layers, thereby minimizing information loss during extraction. The subsequent stage involves matching the feature maps. The matching method utilizes a lightweight ZNCC (ZNCC) module that employs average pooling for training. The deep learning network is trained to optimize the distribution of the output heatmap and the probability at the correct location of the template image. PLZ-TM delivers excellent performance achieving a processing time of only 3.3 ms in matching a 640 × 480 search image with a 192 × 144 template image. Moreover, it attains a matching accuracy of 96% on a dataset obtained from infrared cameras with different fields of view.
... Transfer learning has proven especially valuable in autonomous driving, where models initially trained for bus automation can be adapted for use in self-driving cars [8, 54,55]. This approach enables the efficient detection of road signs, vehicles, and other road users, enhancing both safety and operational effectiveness for diverse autonomous systems [56,57]. ...
Preprint
1. Abstract Ensuring high-quality medical images across diverse clinical environments is essential for accurate diagnosis and treatment planning. However, domain shifts-caused by variations in imaging devices, protocols, and patient demo-graphics-can severely impact model performance when applied to new clinical settings. This paper presents a source-free unsupervised domain adaptation approach designed to enhance medical image quality without requiring access to the original source data, which is often restricted due to privacy concerns. Our method leverages advanced feature alignment techniques and self-supervised learning to adapt pre-trained models directly to the target domain, optimizing image quality while maintaining clinical relevance. Extensive experiments on multiple imaging modalities demonstrate that our approach achieves significant improvements in image clarity and consistency across target domains, outperforming traditional domain adaptation methods that require source data access. This work advances the potential for robust, privacy-preserving medical image enhancement, enabling more reliable AI applications in diverse clinical environments.
... Scale-Invariant Feature Transform (SIFT) [14] was proposed by David Lowe in 1999. It is a classic feature detection and description algorithm used in image processing, characterized by its scale invariance. ...
Preprint
Full-text available
Keypoint detection and descriptor matching are essential in tasks like feature matching, object tracking, and 3D reconstruction. While CNN-based methods have advanced these areas, most focus on perspective projection cameras, with limited consideration of fisheye cameras, which introduce significant distortion. Conventional keypoint methods have failed on fisheye images, causing camera models to underperform in hybrid setups. This paper proposes a robust keypoint detection and description method under a hybrid camera model, addressing challenges in mixed camera systems. Given the lack of fisheye datasets, we employed viewpoint and projection transformations for data augmentation. Inspired by SuperPoint, we introduced a fisheye distortion model to generate training data. Using nearest neighbor (NN) matching, our method outperforms Su-perPoint on the Hpatches dataset and excels in hybrid image-matching tasks.
... This is used in many vision-based approaches, including Pose Estimation, Visual Odometry, Visual Simultaneous Localization and Mapping (vSLAM), Object Detection, Object Tracking, Augmented Reality, Image Mosaicking, and Panorama Stitching. There are many keypoints detectors and feature descriptors, such as SIFT (Scale-Invariant Feature Transform) (Lowe, 2004), SURF (Speeded-Up Robust Features) (Bay et al., 2008), KAZE (Alcantarilla et al., 2012), and ORB (Oriented FAST and Rotated BRIEF) (Rublee et al., 2011), which identify features in images that have the following properties: repeatability, distinctiveness, efficiency, and locality (Gao and Zhang, 2021;Tareen and Saleem, 2018). Despite the robustness of those methods, some detected keypoints in the scene do not contribute to the solution and can even introduce outliers or gross errors into the processes. ...
Article
Full-text available
Keypoint detectors and descriptors are essential for identifying points and their correspondences in overlapping images, being fundamental inputs for many subsequent processes, including Pose Estimation, Visual Odometry, vSLAM, Object Detection, Object Tracking, Augmented Reality, Image Mosaicking, and Panorama Stitching. Techniques like SIFT, SURF, KAZE, and ORB aim to identify repeatable, distinctive, efficient, and local features. Despite their robustness, some keypoints, especially those detected in fisheye cameras, do not contribute to the solution, and may introduce outliers or errors. Fisheye cameras capture a broader view, leading to more keypoints at infinity and potential errors. Filtering these keypoints is important to maintain consistent input observations. Various methods, including gradient-based sky region detection, adaptive algorithms, and K-means clustering, addressed this issue. Semantic segmentation could be an effective alternative, but it requires extensive computational resources. Machine learning provides a more flexible alternative, processing large data volumes with moderate computational power and enhancing solution robustness by filtering non-contributing keypoints already detected in these vision-based approaches. In this paper we present and assess a machine learning model to classify keypoints as sky or non-sky, achieving an accuracy of 82.1%.
... Current feature point extraction and matching frameworks can be categorized into traditional handcrafted frameworks and deep learning-based frameworks. Traditional handcrafted feature point extraction such as SIFT [23], SURF [24], KAZE [25], and ORB [26] rely on the local structure and pattern of the image, such as gradient and intensity changes. Their reliance on local information leads to limitations in capturing and utilizing the rich global MS information. ...
Article
Full-text available
As the use of unmanned aerial vehicles (UAVs) for waterfront monitoring increases, combining multiple UAV multispectral (MS) images into a single, seamless panoramic image has become crucial. This process ensures the accuracy and effectiveness of waterfront monitoring. However, the varying reflective properties of different wavelengths bring challenges for existing single-band MS image stitching frameworks, especially in complex waterfront areas. To address this challenge, we developed the Individual Band Enhanced Waterfront Multispectral Stitching (IBEWMS) framework. Central to this framework is the Individual Band Spectral Feature Enhancement (IBSFE) module, which enhances each spectral band based on varying reflectance of different land covers, yielding clearer and more reliable features. Using IBSFE, we designed a detector-free framework to effectively extract and match feature points in waterfront MS images. Additionally, we implemented an image fusion technique to address issues like ghosting and global reflectance inconsistency in panoramic images. To support this work, we provided the Wuhan UAV Waterfront Environment Multispectral Dataset (WHUWEMS), comprising 12,315 highresolution 5-band MS images. Experiments show that IBEWMS outperforms both deep learning and traditional stitching frameworks, offering valuable support for downstream applications. The dataset will be made available online in the near future
... These networks can handle uncertainty and provide confidence measures in their outputs, making them suitable for real-world face recognition applications. [3,4]. ...
Article
Full-text available
In the article face recognition is a widely researched field in computer vision, with various approaches being developed to improve accuracy and efficiency. One such approach is the use of probabilistic decision-based neural networks (PDBNN). In this article, the authors present a novel method for face recognition using PDBNNs. They explain that PDBNNs are a type of neural network that can model complex relationships between inputs and outputs, making them suitable for tasks like face recognition. The authors describe the architecture of their PDBNN, which consists of multiple layers of neurons. Each neuron in the network makes probabilistic decisions based on the inputs it receives, which are then propagated through the network to make a final decision. To train their PDBNN, the authors use a dataset of facial images with known labels. They describe a two-step training process, where the network is first trained using a standard backpropagation algorithm and then fine-tuned using a probabilistic decision-based learning algorithm. This allows the network to learn both discriminative features of the faces and the associated uncertainties. The author evaluate the performance of their PDBNN on several benchmark face recognition datasets.
Article
Full-text available
Understanding the behaviour of traffic participants within the geo‐spatial context of road/intersection topology is a vital prerequisite for any smart ITS application. This article presents a video‐based traffic analysis and anomaly detection system covering the complete data processing pipeline, including sensor data acquisition, analysis, and digital twin reconstruction. The system solves the challenge of geo‐spatial mapping of captured visual data onto the road/intersection topology by semantic analysis of aerial data. Additionally, the automated camera calibration component enables instant camera pose estimation to map traffic agents onto the road/intersection surface accurately. A novel aspect is approaching the anomaly detection problem by AI analysis of both the spatio‐temporal visual clues and the geo‐spatial trajectories for all type of traffic participants, such as pedestrians, bicyclists, and vehicles. This enables recognition of anomalies related to either traffic‐rule violations, for example, jaywalking, improper turns, zig‐zag driving, unlawful stops, or behavioural anomalies: littering, accidents, falling, vandalism, violence, infrastructure collapse etc. The method achieves leading anomaly detection results on benchmark datasets World Cup 2014, UCF‐Crime, XD‐Violence, and ShanghaiTech. All the obtained results are streamed and rendered in real‐time by the developed TGX digital twin visualizer. The complete system has been deployed and validated on the roads of Helmond town in The Netherlands.
Preprint
Purpose Augmented reality (AR) may allow vitreoretinal surgeons to leverage microscope-integrated digital imaging systems to analyze and highlight key retinal anatomic features in real-time, possibly improving safety and precision during surgery. By employing convolutional neural networks (CNNs) for retina vessel segmentation, a retinal coordinate system can be created that allows pre-operative images of capillary non-perfusion or retinal breaks to be digitally aligned and overlayed upon the surgical field in real-time. Such technology may be useful in assuring thorough laser treatment of capillary non-perfusion or in using pre-operative optical coherence tomography (OCT) to guide macular surgery when microscope-integrated OCT (MIOCT) is not available. Methods This study is a retrospective analysis involving the development and testing of a novel image registration algorithm for vitreoretinal surgery. Fifteen anonymized cases of pars plana vitrectomy with epiretinal membrane peeling, along with corresponding preoperative fundus photographs and optical coherence tomography (OCT) images, were retrospectively collected from the Mayo Clinic database. We developed a TPU (Tensor-Processing Unit)-accelerated CNN for semantic segmentation of retinal vessels from fundus photographs and subsequent real-time image registration in surgical video streams. An iterative patch-wise cross-correlation (IPCC) algorithm was developed for image registration, with a focus on optimizing processing speeds and maintaining high spatial accuracy. The primary outcomes measured were processing speed in frames per second (FPS) and the spatial accuracy of image registration, quantified by the Dice coefficient between registered and manually aligned images. Results When deployed on an Edge TPU, the CNN model combined with our image registration algorithm processed video streams at a rate of 14 FPS, which is superior to processing rates achieved on other standard hardware configurations. The IPCC algorithm efficiently aligned pre-operative and intraoperative images, showing high accuracy in comparison to manual registration. Conclusion This study demonstrates the feasibility of using TPU-accelerated CNNs for enhanced AR in vitreoretinal surgery.
Article
Full-text available
The local binary pattern (LBP) is an effective feature, describing the size relationship between the neighboring pixels and the current pixel. While individual LBP-based methods yield good results, co-occurrence LBP-based methods exhibit a better ability to extract structural information. However, most of the co-occurrence LBP-based methods excel mainly in dealing with rotated images, exhibiting limitations in preserving performance for scaled images. To address the issue, a cross-scale co-occurrence LBP (CS-CoLBP) is proposed. Initially, we construct an LBP co-occurrence space to capture robust structural features by simulating scale transformation. Subsequently, we use Cross-Scale Co-occurrence pairs (CS-Co pairs) to extract the structural features, keeping robust descriptions even in the presence of scaling. Finally, we refine these CS-Co pairs through Rotation Consistency Adjustment (RCA) to bolster their rotation invariance, thereby making the proposed CS-CoLBP as powerful as existing co-occurrence LBP-based methods for rotated image description. While keeping the desired geometric invariance, the proposed CS-CoLBP maintains a modest feature dimension. Empirical evaluations across several datasets demonstrate that CS-CoLBP outperforms the existing state-of-the-art LBP-based methods even in the presence of geometric transformations and image manipulations.
Article
Resin canals produce and transport oleoresins that are important for tree defenses within the Pinaceae family. Rapid measurement techniques are needed to better understand how resin canal characteristics vary due to genetic and environmental effects. Here we describe a semi-automated microscopy imaging system that was built for quantifying longitudinal resin canals. Tree increment cores from 210 loblolly pine ( Pinus taeda L.) trees were prepared into radial strips and the transverse surface of the samples polished with 400 and then 600 grit sandpaper. Each sample was imaged along its entire length (pith to bark) with the images collected from a monochrome camera connected to a Plan Fluorite 4× objective lens. The samples were imaged on the transverse surface via transmitted 850 nm near-infrared light directed at the radial surfaces of the samples. A total of 24 153 images were collected and then processed offline in Python using the Open Computer Vision Library (OpenCV) using a series of algorithms including contrast correction, noise removal, thresholding, contour identification, erosion, and dilation. A total of 24 491 resin canals were identified and their size quantified. The resin canals were assigned into annual rings and positioned within the earlywood or latewood of a ring using cross-correlation whereby a pseudo-density value was derived from the images and matched with density values measured by X-ray densitometry. Of the total resin canals identified, 51.5% were in the earlywood and 48.5% in the latewood, with the majority being detected in the earlywood in the first six years and the latewood in years 7 and above. This study represents the most information collected on the resin canals of loblolly pine. The detailed description of the hardware and image analysis methods should serve as a useful guide to others interested in imaging resin canals as well as other anatomical features.
Preprint
Full-text available
Multimodal image matching has been a critical challenge within the field of computer vision over an extended period. In recent years, detector free methods have received widespread attention for achieving high matching accuracy. However, these methods often fail to fully exploit the semantic information within images, which is crucial for achieving accurate matching. To overcome this limitation, we propose SemFE, a multimodal image matching framework that leverages learnable local semantic feature enhancement. In SemFE, we designed a dynamic semantic feature extraction module to capture semantic information, along with a semantic information enhancement module to refine semantically related features. Additionally, we developed a multi-scale feature fusion backbone integrated with a transformer to further enhance feature extraction. Comprehensive experimental results demonstrate that SemFE consistently outperforms competing methods across various multimodal image datasets.
Article
Full-text available
In this paper we describe an approach to recognizing poorly textured ob- jects, that may contain holes and tubular parts, in cluttered scenes under ar- bitrary viewing conditions. To this end we develop a number of novel com- ponents. First, we introduce a new edge-based local feature detector that is invariant to similarity transformations. The features are localized on edges and a neighbourhood is estimated in a scale invariant manner. Second, the neighbourhood descriptor computed for foreground features is not affected by background clutter, even if the feature is on an object boundary. Third, the descriptor generalizes Lowe's SIFT method (12) to edges. An object model is learnt from a single training image. The object is then recognized in new images in a series of steps which apply progressively tighter geometric restrictions. A final contribution of this work is to allow sufficient flexibility in the geometric representation that objects in the same visual class can be recognized. Results are demonstrated for various object classes including bikes and rackets.
Article
Full-text available
In this paper we analyze in some detail the geometry of a pair of cameras, i.e., a stereo rig. Contrarily to what has been done in the past and is still done currently, for example in stereo or motion analysis, we do not assume that the intrinsic parameters of the cameras are known (coordinates of the principal points, pixels aspect ratio and focal lengths). This is important for two reasons. First, it is more realistic in applications where these parameters may vary according to the task (active vision). Second, the general case considered here, captures all the relevant information that is necessary for establishing correspondences between two pairs of images. This information is fundamentally projective and is hidden in a confusing manner in the commonly used formalism of the Essential matrix introduced by Longuet-Higgins (1981). This paper clarifies the projective nature of the correspondence problem in stereo and shows that the epipolar geometry can be summarized in one 33 matrix of rank 2 which we propose to call the Fundamental matrix.After this theoretical analysis, we embark on the task of estimating the Fundamental matrix from point correspondences, a task which is of practical importance. We analyze theoretically, and compare experimentally using synthetic and real data, several methods of estimation. The problem of the stability of the estimation is studied from two complementary viewpoints. First we show that there is an interesting relationship between the Fundamental matrix and three-dimensional planes which induce homographies between the images and create unstabilities in the estimation procedures. Second, we point to a deep relation between the unstability of the estimation procedure and the presence in the scene of so-called critical surfaces which have been studied in the context of motion analysis. Finally we conclude by stressing the fact that we believe that the Fundamental matrix will play a crucial role in future applications of three-dimensional Computer Vision by greatly increasing its versatility, robustness and hence applicability to real difficult problems.
Conference Paper
Full-text available
We introduce a new type of local feature based on the phase and amplitude responses of complex-valued steerable filters. The design of this local feature is motivated by a desire to obtain feature vectors which are semi-invariant under common image deformations, yet distinctive enough to provide useful identity information. A recent proposal for such local features involves combining differential invariants to particular image deformations, such as rotation. Our approach differs in that we consider a wider class of image deformations, including the addition of noise, along with both global and local brightness variations. We use steerable filters to make the feature robust to rotation. And we exploit the fact that phase data is often locally stable with respect to scale changes, noise, and common brightness changes. We provide empirical results comparing our local feature with one based on differential invariants. The results show that our phase-based local feature leads to better performance when dealing with common illumination changes and 2-D rotation, while giving comparable effects in terms of scale changes.
Article
Full-text available
This paper defines a multiple resolution representation for the two-dimensional gray-scale shapes in an image. This representation is constructed by detecting peaks and ridges in the difference of lowpass (DOLP) transform. Descriptions of shapes which are encoded in this representation may be matched efficiently despite changes in size, orientation, or position. Motivations for a multiple resolution representation are presented first, followed by the definition of the DOLP transform. Techniques are then presented for encoding a symbolic structural description of forms from the DOLP transform. This process involves detecting local peaks and ridges in each bandpass image and in the entire three-dimensional space defined by the DOLP transform. Linking adjacent peaks in different bandpass images gives a multiple resolution tree which describes shape. Peaks which are local maxima in this tree provide landmarks for aligning, manipulating, and matching shapes. Detecting and linking the ridges in each DOLP bandpass image provides a graph which links peaks within a shape in a bandpass image and describes the positions of the boundaries of the shape at multiple resolutions. Detecting and linking the ridges in the DOLP three-space describes elongated forms and links the largest peaks in the tree. The principles for determining the correspondence between symbols in pairs of such descriptions are then described. Such correspondence matching is shown to be simplified by using the correspondence at lower resolutions to constrain the possible correspondence at higher resolutions.
Conference Paper
Full-text available
A key component of a mobile robot system is the ability to localize itself accurately and build a map of the environment simultaneously. In this paper, a vision-based mobile robot localization and mapping algorithm is described which uses scale-invariant image features as landmarks in unmodified dynamic environments. These 3D landmarks are localized and robot ego-motion is estimated by matching them, taking into account the feature viewpoint variation. With our Triclops stereo vision system, experiments show that these features are robustly matched between views, 3D landmarks are tracked, robot pose is estimated and a 3D map is built.
Article
Full-text available
We introduce a novel view-based object representation, called the saliency map graph (SMG), which captures the salient regions of an object view at multiple scales using a wavelet transform. This compact representation is highly invariant to translation, rotation (image and depth), and scaling, and offers the locality of representation required for occluded object recognition. To compare two saliency map graphs, we introduce two graph similarity algorithms. The first computes the topological similarity between two SMGs, providing a coarse-level matching of two graphs. The second computes the geometrical similarity between two SMGs, providing a fine-level matching of two graphs. We test and compare these two algorithms on a large database of model object views.
Conference Paper
Full-text available
There has been considerable success in automated reconstruction for image sequences where small baseline algorithms can be used to establish matches across a number of images. In contrast in the case of widely separated views, methods have generally been restricted to two or three views. In this paper we investigate the problem of establishing relative viewpoints given a large number of images where no ordering information is provided. A typical application would be where images are obtained from different sources or at different times: both the viewpoint (position, orientation, scale) and lighting conditions may vary significantly over the data set. Such a problem is not fundamentally amenable to exhaustive pair wise and triplet wide baseline matching because this would be prohibitively expensive as the number of views increases. Instead, we investiate how a combination of image invariants, covariants, and multiple view relations can be used in concord to enable efficient multiple view matching. The result is a matching algorithm which is linear in the number of views. The methods are illustrated on several real image data sets. The output enables an image based technique for navigating in a 3D scene, moving from one image to whichever image is the next most appropriate.
Article
Full-text available
We describe and analyze an appearance-based 3-D object recognition system that avoids some of the problems of previous appearance-based schemes. We describe various large-scale performance tests and report good performance for full-sphere/hemisphere recognition of up to 24 complex, curved objects, robustness against clutter and occlusion, and some intriguing generic recognition behavior. We also establish a protocol that permits performance in the presence of quantifiable amounts of clutter and occlusion to be predicted on the basis of simple score statistics derived from clean test images and pure clutter images.
Conference Paper
Full-text available
We present a robust method for automatically matching features in images corresponding to the same physical point on an object seen from two arbitrary viewpoints. Unlike conventional stereo matching approaches we assume no prior knowledge about the relative camera positions and orientations. In fact in our application this is the information we wish to determine from the image feature matches. Features are detected in two or more images and characterised using affine texture invariants. The problem of window effects is explicitly addressed by our method-our feature characterisation is invariant to linear transformations of the image data including rotation, stretch and skew. The feature matching process is optimised for a structure-from-motion application where we wish to ignore unreliable matches at the expense of reducing the number of feature matches
Article
Full-text available
We have previously developed a mobile robot system which uses scale invariant visual landmarks to localize and simultaneously build a 3D map of the environment In this paper, we look at global localization, also known as the kidnapped robot problem, where the robot localizes itself globally, without any prior location estimate. This is achieved by matching distinctive landmarks in the current frame to a database map. A Hough Transform approach and a RANSAC approach for global localization are compared, showing that RANSAC is much more e#cient. Moreover, robust global localization can be achieved by matching a small sub-map of the local region built from multiple frames.
Article
Full-text available
Recognition systems attempt to recover information about the identity of observed objects and their location in the environment. A fundamental problem in recognition is pose estimation. This is the problem of using a correspondence between some portions of an object model and some portions of an image to determine whether the image contains an instance of the object, and, in case it does, to determine the transformation that relates the model to the image. The current approaches to this problem are divided into methods that use “global” properties of the object (e.g., centroid and moments of inertia) and methods that use “local” properties of the object (e.g., corners and line segments). Global properties are sensitive to occlusion and, specifically, to self occlusion. Local properties are difficult to locate reliably, and their matching involves intensive computation. We present a novel method for recognition that uses region information. In our approach the model and the image are divided into regions. Given a match between subsets of regions (without any explicit correspondence between different pieces of the regions) the alignment transformation is computed. The method applies to planar objects under similarity, affine, and projective transformations and to projections of 3-D objects undergoing affine and projective transformations. The new approach combines many of the advantages of the previous two approaches, while avoiding some of their pitfalls. Like the global methods, our approach makes use of region information that reflects the true shape of the object. But like local methods, our approach can handle occlusion.
Article
Full-text available
Nearest-neighbor correlation-based similarity computation in the space of outputs of complex-type receptive fields can support robust recognition of 3D objects. Our experiments with four collections of objects resulted in mean recognition rates between 84% (for subordinate-level discrimination among 15 quadruped animal shapes) and 94% (for basic-level recognition of 20 everyday objects) , over a 40 ffi Theta 40 ffi range of viewpoints, centered on a stored canonical view and related to it by rotations in depth. This result has interesting implications for the design of a front end to an artificial object recognition system, and for the understanding of the faculty of object recognition in primate vision. 1 INTRODUCTION Orientation-selective receptive fields (RFs) patterned after those found in the mammalian primary visual cortex (V1) are employed by a growing number of connectionist approaches to machine vision (for a review, see Edelman, 1997). Despite the success of RF-based sy...
Article
Full-text available
This article presents: (i) a multiscale representation of grey-level shape called the scale-space primal sketch, which makes explicit both features in scale-space and the relations between structures at different scales, (ii) a methodology for extracting significant blob-like image structures from this representation, and (iii) applications to edge detection, histogram analysis, and junction classification demonstrating how the proposed method can be used for guiding later-stage visual processes. The representation gives a qualitative description of image structure, which allows for detection of stable scales and associated regions of interest in a solely bottom-up data-driven way. In other words, it generates coarse segmentation cues, and can hence be seen as preceding further processing, which can then be properly tuned. It is argued that once such information is available, many other processing tasks can become much simpler. Experiments on real imagery demonstrate that the proposed theory gives intuitive results.
Article
Full-text available
. The appearance of an object is composed of local structure. This local structure can be described and characterized by a vector of local features measured by local operators such as Gaussian derivatives or Gabor filters. This article presents a technique where appearances of objects are represented by the joint statistics of such local neighborhood operators. As such, this represents a new class of appearance based techniques for computer vision. Based on joint statistics, the paper develops techniques for the identification of multiple objects at arbitrary positions and orientations in a cluttered scene. Experiments show that these techniques can identify over 100 objects in the presence of major occlusions. Most remarkably, the techniques have low complexity and therefore run in real-time. 1. Introduction The paper proposes a framework for the statistical representation of the appearance of arbitrary 3D objects. This representation consists of a probability density function or jo...
Article
Full-text available
An inherent property of objects in the world is that they only exist as meaningful entities over certain ranges of scale. If one aims at describing the structure of unknown real-world signals, then a multi-scale representation of data is of crucial importance. This chapter gives a tutorial review of a special type of multi-scale representation, linear scale-space representation, which has been developed by the computer vision community in order to handle image structures at different scales in a consistent manner. The basic idea is to embed the original signal into a oneparameter family of gradually smoothed signals, in which the fine scale details are successively suppressed. Under rather general conditions on the type of computations that are to performed at the first stages of visual processing, in what can be termed the visual front end, it can be shown that the Gaussian kernel and its derivatives are singled out as the only possible smoothing kernels. The conditions that specify ...
Conference Paper
Shape indexing is a way of making rapid associations between features detected in an image and object models that could have produced them. When model databases are large, the use of high-dimensional features is critical, due to the improved level of discrimination they can provide. Unfortunately, finding the nearest neighbour to a query point rapidly becomes inefficient as the dimensionality of the feature space increases. Past indexing methods have used hash tables for hypothesis recovery, but only in low-dimensional situations. In this paper we show that a new variant of the k-d tree search algorithm makes indexing in higher-dimensional spaces practical. This Best Bin First, or BBF search is an approximate algorithm which finds the nearest neighbour for a large fraction of the queries, and a very close neighbour in the remaining cases. The technique has been integrated into a fully developed recognition system, which is able to detect complex objects in real, cluttered scenes in just a few seconds
Article
The wide-baseline stereo problem, i.e. the problem of establishing correspondences between a pair of images taken from different viewpoints is studied.A new set of image elements that are put into correspondence, the so called extremal regions, is introduced. Extremal regions possess highly desirable properties: the set is closed under (1) continuous (and thus projective) transformation of image coordinates and (2) monotonic transformation of image intensities. An efficient (near linear complexity) and practically fast detection algorithm (near frame rate) is presented for an affinely invariant stable subset of extremal regions, the maximally stable extremal regions (MSER).A new robust similarity measure for establishing tentative correspondences is proposed. The robustness ensures that invariants from multiple measurement regions (regions obtained by invariant constructions from extremal regions), some that are significantly larger (and hence discriminative) than the MSERs, may be used to establish tentative correspondences.The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes. Significant change of scale (3.5×), illumination conditions, out-of-plane rotation, occlusion, locally anisotropic scale change and 3D translation of the viewpoint are all present in the test problems. Good estimates of epipolar geometry (average distance from corresponding points to the epipolar line below 0.09 of the inter-pixel distance) are obtained.
Article
This paper proposes a robust approach to image matching by exploiting the only available geometric constraint, namely, the epipolar constraint. The images are uncalibrated, namely the motion between them and the camera parameters are not known. Thus, the images can be taken by different cameras or a single camera at different time instants. If we make an exhaustive search for the epipolar geometry, the complexity is prohibitively high. The idea underlying our approach is to use classical techniques (correlation and relaxation methods in our particular implementation) to find an initial set of matches, and then use a robust technique—the Least Median of Squares (LMedS)—to discard false matches in this set. The epipolar geometry can then be accurately estimated using a meaningful image criterion. More matches are eventually found, as in stereo matching, by using the recovered epipolar geometry. A large number of experiments have been carried out, and very good results have been obtained.Regarding the relaxation technique, we define a new measure of matching support, which allows a higher tolerance to deformation with respect to rigid transformations in the image plane and a smaller contribution for distant matches than for nearby ones. A new strategy for updating matches is developed, which only selects those matches having both high matching support and low matching ambiguity. The update strategy is different from the classical “winner-take-all”, which is easily stuck at a local minimum, and also from “loser-take-nothing”, which is usually very slow. The proposed algorithm has been widely tested and works remarkably well in a scene with many repetitive patterns.
Conference Paper
We present a method to learn object class models from unlabeled and unsegmented cluttered scenes for the purpose of visual object recognition. We focus on a particular type of model where objects are represented as flexible con- stellations of rigid parts (features). The variability within a class is represented by a joint probability density function (pdf) on the shape of the constellation and the output of part detectors. In a first stage, the method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator. It then learns the statistical shape model using expectation maximization. The method achieves very good classification results on human faces and rear views of cars.
Conference Paper
This paper approaches the problem of finding correspondences between images in which there are large changes in viewpoint, scale and illumi- nation. Recent work has shown that scale-space 'interest points' may be found with good repeatability in spite of such changes. Further- more, the high entropy of the surrounding image regions means that local descriptors are highly discriminative for matching. For descrip- tors at interest points to be robustly matched between images, they must be as far as possible invariant to the imaging process. In this work we introduce a family of features which use groups of interest points to form geometrically invariant descriptors of image regions. Feature descriptors are formed by resampling the image rel- ative to canonical frames defined by the points. In addition to robust matching, a key advantage of this approach is that each match implies ah ypothesis of the local 2D (projective) transformation. This allows us to immediately reject most of the false matches using a Hough trans- form. We reject remaining outliers using RANSAC and the epipolar constraint. Results show that dense feature matching can be achieved in a few seconds of computation on 1GHz Pentium III machines.
Conference Paper
Let S denote a set of n points in d-dimensional space, ℝ d , and let dist(p,q) denote the distance between two points in any Minkowski metric. For any real ε>0 and q∈ℝ d , a point p∈S is a (1+ε)-approximate nearest neighbor of q if, for all p ' ∈S, we have dist(p,q)/dist(p ' ,q)≤(1+ε). We show how to preprocess a set of n points in ℝ d in O(nlogn) time and O(n) space, so that given a query point q∈ℝ d , and ε>0, a(1+ε)-approximate nearest neighbor of q can be computed in O(logn) time. Constant factors depend on d and ε. We show that given an integer k≥1, (1+ε)-approximations to the k-nearest neighbors of q can be computed in O(klogn) time.
Article
In practice the relevant details of images exist only over a restricted range of scale. Hence it is important to study the dependence of image structure on the level of resolution. It seems clear enough that visual perception treats images on several levels of resolution simultaneously and that this fact must be important for the study of perception. However, no applicable mathematically formulated theory to deal with such problems appears to exist. In this paper it is shown that any image can be embedded in a one-parameter family of derived images (with resolution as the parameter) in essentially only one unique way if the constraint that no spurious detail should be generated when the resolution is diminished, is applied. The structure of this family is governed by the well known diffusion equation (a parabolic, linear, partial differential equation of the second order). As such the structure fits into existing theories that treat the front end of the visual system as a continuous stack of homogeneous layers, characterized by iterated local processing schemes. When resolution is decreased the images becomes less articulated because the extrem ("light and dark blobs") disappear one after the other. This erosion of structure is a simple process that is similar in every case. As a result any image can be described as a juxtaposed and nested set of light and dark blobs, wherein each blob has a limited range of resolution in which it manifests itself. The structure of the family of derived images permits a derivation of the sampling density required to sample the image at multiple scales of resolution.(ABSTRACT TRUNCATED AT 250 WORDS)
Conference Paper
We present a method to learn and recognize object class models from unlabeled and unsegmented cluttered scenes in a scale invariant manner. Objects are modeled as flexible constellations of parts. A probabilistic representation is used for all aspects of the object: shape, appearance, occlusion and relative scale. An entropy-based feature detector is used to select regions and their scale within the image. In learning the parameters of the scale-invariant object model are estimated. This is done using expectation-maximization in a maximum-likelihood setting. In recognition, this model is used in a Bayesian manner to classify images. The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).
Conference Paper
There have been important recent advances in object recognition through the matching of invariant local image features. However, the existing approaches are based on matching to individual training images. This paper presents a method for combining multiple images of a 3D object into a single model representation. This provides for recognition of 3D objects from any viewpoint, the generalization of models to non-rigid changes, and improved robustness through the combination of features acquired under a range of imaging conditions. The decision of whether to cluster a training image into an existing view representation or to treat it as a new view is based on the geometric accuracy of the match to previous model views. A new probabilistic model is developed to reduce the false positive matches that would otherwise arise due to loosened geometric constraints on matching 3D and non-rigid models. A system has been developed based on these approaches that is able to robustly recognize 3D objects in cluttered natural images in sub-second times.
Article
This paper addresses the problem of retrieving images from large image databases. The method is based on local grayvalue invariants which are computed at automatically detected interest points. A voting algorithm and semilocal constraints make retrieval possible. Indexing allows for efficient retrieval from a database of more than 1,000 images. Experimental results show correct retrieval in the case of partial visibility, similarity transformations, extraneous features, and small perspective deformations
Article
Objects can be recognized on the basis of their color alone by color indexing, a technique developed by Swain-Ballard (1991) which involves matching color-space histograms. Color indexing fails, however, when the incident illumination varies either spatially or spectrally. Although this limitation might be overcome by preprocessing with a color constancy algorithm, we instead propose histogramming color ratios. Since the ratios of color RGB triples from neighboring locations are relatively insensitive to changes in the incident illumination, this circumvents the need for color constancy preprocessing. Results of tests with the new color-constant-color-indexing algorithm on synthetic and real images show that it works very well even when the illumination varies spatially in its intensity and color
Article
Model-based recognition and motion tracking depend upon the ability to solve for projection and model parameters that will best fit a 3-D model to matching 2-D image features. The author extends current methods of parameter solving to handle objects with arbitrary curved surfaces and with any number of internal parameters representing articulation, variable dimensions, or surface deformations. Numerical stabilization methods are developed that take account of inherent inaccuracies in the image measurements and allow useful solutions to be determined even when there are fewer matches than unknown parameters. The Levenberg-Marquardt method is used to always ensure convergence of the solution. These techniques allow model-based vision to be used for a much wider class of problems than was possible with previous methods. Their application is demonstrated for tracking the motion of curved, parameterized objects
Article
Given a set of n points in d-dimensional Euclidean , and a query point q 2 E , we wish to determine the nearest neighbor of q, that is, the point of S whose Euclidean distance to q is minimum. The goal is to preprocess the point set S, such that queries can be answered as efficiently as possible. We assume that the dimension d is a constant independent of n. Although reasonably good solutions to this problem exist when d is small, as d increases the performance of these algorithms degrades rapidly. We present a randomized algorithm for approximate nearest neighbor searching. Given any set of n points S ae E , and a constant ffl ? 0, we produce a data structure, such that given any query point, a point of S will be reported whose distance from the query point is at most a factor of (1 + ffl) from that of the true nearest neighbor. Our algorithm runs in O(log n) expected time and requires O(n log n) space. The data structure can be built in ) expected time. The constant factors depend on d and ffl. Because of the practical importance of nearest neighbor searching in higher dimensions, we have implemented a practical variant of this algorithm, and show empirically that for many point distributions this variant of the algorithm finds the nearest neighbor in moderately large dimension significantly faster than existing practical approaches.
Article
this paper we examine methods for the detection of outliers to a least squares fit that would have been previously computationally infeasible. The fitting of linear regression models by least squares is undoubtedly the most widely used modelling procedure. A major drawback, however, is that outliers which are inevitably included in the initial fit can so distort the fitting process that the resulting fit can be arbitrary. A common practice is to search for outliers using the raw residuals. However, the use of these on their own can be misleading.
Article
We describe how to model the appearance of a 3-D object using multiple views, learn such a model from training images, and use the model for object recognition. The model uses probability distributions to describe the range of possible variation in the object's appearance. These distributions are organized on two levels. Large variations are handled by partitioning training images into clusters corresponding to distinctly different views of the object. Within each cluster, smaller variations are represented by distributions characterizing uncertainty in the presence, position, and measurements of various discrete features of appearance. Many types of features are used, ranging in abstraction from edge segments to perceptual groupings and regions. A matching procedure uses the feature uncertainty information to guide the search for a match between model and image. Hypothesized feature pairings are used to estimate a viewpoint transformation taking account of feature uncertainty. These methods have been implemented in an object recognition system, OLIVER. Experiments show that OLIVER is capable of learning to recognize complex objects in cluttered images, while acquiring models that represent those objects using relatively few views. 1 1
Article
Proc. of the International Conference on Computer Vision, Corfu (Sept. 1999) An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest-neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low-residual least-squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially-occluded images with a computation time of under 2 seconds. 1.