Conference Paper

Object Recognition from Local Scale-Invariant Features

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... These algorithms are being improved at fast pace and are now used in daily life in many sector such us logistic, security access, healthcare, traffic management and so on [21]. Scale-Invariant Feature Transform (SIFT) [22,23] represents a valid and versatile method for detecting and extracting distinctive local features from an image [24,25]. A key characteristic is the ability to recognize unique features of the image under analysis also under transformations such as scaling, rotation and illumination variations. ...
... These latter, identified as SIFT keys, are invariant to translation, scaling, rotation, and partially to illumination changes in the image [23], making them more robust also in case of fluctuation of the intensity of the laser beam. Four main calculation steps are required to generate all the image features: (a) scale-space extrema detection, done using a Gaussian difference function to identify potential points of interest that are invariant to scale and orientation; (b) key point localization, where key points are selected based on measurements of their stability; (c) orientation assignment, where one or more orientations are assigned to each key point location based on the gradient directions of the local image; and (d) key point descriptor, which describes key points by measuring the local gradients in the image at the selected scale [22]. To analyze and compare the images, a script was used implementing the SIFT algorithm, from OpenCV library, in Python [30]. ...
Preprint
Nowadays, due to the growing phenomenon of forgery in many fields, the interest in developing new anti-counterfeiting device and cryptography keys, based on the Physical Unclonable Functions (PUFs) paradigm, is widely increased. PUFs are physical hardware with an intrinsic, irreproducible disorder that allows for on-demand cryptographic key extraction. Among them, optical PUF are characterized by a large number of degrees of freedom resulting in higher security and higher sensitivity to environmental conditions. While these promising features led to the growth of advanced fabrication strategies and materials for new PUF devices, their combination with robust recognition algorithm remains largely unexplored. In this work, we present a metric-independent authentication approach that leverages the Scale Invariant Feature Transform (SIFT) algorithm to extract unique and invariant features from the speckle patterns generated by optical Physical Unclonable Functions (PUFs). The application of SIFT to the challenge response pairs (CRPs) protocol allows us to correctly authenticate a client while denying any other fraudulent access. In this way, the authentication process is highly reliable even in presence of response rotation, zooming, and cropping that may occur in consecutive PUF interrogations and to which other postprocessing algorithm are highly sensitive. This characteristics together with the speed of the method (tens of microseconds for each operation) broaden the applicability and reliability of PUF to practical high-security authentication or merchandise anti-counterfeiting.
... These marks can be easily localized in the 2D image using edge-based, region-based or other feature recognition methods. Another approach avoids the need for 3D markers priors by employing feature descriptors like SIFT [21], SURF [22], ORB [23], and BRISK [24] to search and match 2D-2D point pairs across multi-view images. The 3D points are then reconstructed using epipolar geometry. ...
Article
Full-text available
With the increasing integration of industrial products, critical components are increasingly being encapsulated within sealed enclosures, making it difficult to measure their actual positions during assembly using contact measurement techniques, often leading to substandard product quality. X-ray imaging provides a non-destructive solution for inspecting internal structures and accurately positioning internal components. However, traditional pose estimation methods based on X-ray imaging rely on projection optimization, which is time-consuming and cannot meet the timely feedback requirements of assembly processes. In this study, we propose a hybrid pose estimation method for industrial X-ray radiography that combines neural networks for initial pose estimation with local template matching for pose refinement. This approach achieves both high accuracy and efficiency in positioning internal targets. We conducted real X-ray imaging experiments on several objects, including a terahertz anode tube model. The mean alignment error was approximately 0.2 mm, lower than the spatial resolution (about 0.25 mm) of the CT images constructed from the same X-ray projections. The computation time for pose estimation of a single object was about 10 s, significantly faster than conventional methods that typically requiring several minutes, making it suitable for timely feedback in industrial assembly processes.
... Subsequently, a theory of multiscale, curvature-based shape representation was proposed for images of varying quality and resolution [20], which has been successfully applied alongside methods for symmetry analysis. In 1999, the SIFT algorithm was introduced [18], becoming a standard in the detection of symmetric structures in images. A revolution in image analysis arrived with the advent of Convolutional Neural Networks (CNN), which enabled the automatic learning of complex features from large datasets. ...
Article
Full-text available
The ornaments from the Serbian medieval frescoes belong to the religious decorative art with the restricted set of possible motifs, frequently related to a cross, and thus the basic motifs can fit in a very limited number of symmetry groups. The main criterion for the quality of such ornamental art could be the richness and variety of patterns obtained from a very small number of symmetry groups, proving the creativity of their authors-their ability to create variety with a very restricted number of initial symmetry groups. Here we analyze these ornaments and their symmetry group and give an introductions to further work on automated symmetry group recognition and pattern reconstruction.
... • Image filters: Techniques like SIFT (Lowe 1999b) extract invariant features under various transformations. • Rule-based methods: These methods (Rudolph et al. 1998;Grau et al. 2001) detect edges and contours using geometric features and prior knowledge. ...
Article
Full-text available
Anatomical landmark detection is crucial in medical image analysis, facilitating accurate diagnosis, surgical planning, and treatment evaluation. However, existing methods often struggle to simultaneously capture global context and local details while exhibiting limited generalization across diverse datasets and imaging modalities. To relieve this, we propose a hybrid model that leverages convolutional operations to capture local information and a Swin Transformer to enhance global context. Specifically, we introduce a novel U-shaped architecture, termed Convolutional Attention Swin Enhanced Landmark Detection Network (CASEMark). CASEMark integrates three key innovations: (1) a Convolutional Attention Swin Transformer module (CAST) that integrates transformer-based global context modeling with convolutional operations for local feature extraction, (2) an Enhanced Skip Attention Module (ESAM) enabling adaptive feature fusion between encoder and decoder pathways, and (3) a multi-resolution heatmap learning strategy that aggregates information across scales. This approach effectively balances global-local feature extraction with robust cross-modality generalization. Extensive experiments on four public datasets demonstrate the superiority of CASEMark. The code and datasets will be made publicly available.
... mative and trackable, we extract keypoints using SIFT [11], which provides robust features under appearance changes and viewpoint variation. Sequences with an insufficient number of detected keypoints are excluded to maintain supervision quality. ...
Preprint
Full-text available
Synthetic datasets have enabled significant progress in point tracking by providing large-scale, densely annotated supervision. However, deploying these models in real-world domains remains challenging due to domain shift and lack of labeled data-issues that are especially severe in surgical videos, where scenes exhibit complex tissue deformation, occlusion, and lighting variation. While recent approaches adapt synthetic-trained trackers to natural videos using teacher ensembles or augmentation-heavy pseudo-labeling pipelines, their effectiveness in high-shift domains like surgery remains unexplored. This work presents SurgTracker, a semi-supervised framework for adapting synthetic-trained point trackers to surgical video using filtered self-distillation. Pseudo-labels are generated online by a fixed teacher-identical in architecture and initialization to the student-and are filtered using a cycle consistency constraint to discard temporally inconsistent trajectories. This simple yet effective design enforces geometric consistency and provides stable supervision throughout training, without the computational overhead of maintaining multiple teachers. Experiments on the STIR benchmark show that SurgTracker improves tracking performance using only 80 unlabeled videos, demonstrating its potential for robust adaptation in high-shift, data-scarce domains.
... Multi-scale approaches, such as pyramidbased stitching, improve robustness by refining alignment progressively, making them suitable for larger images with significant variations. Featurebased methods like scale-invariant feature transform (SIFT) 60 and speededup robust features (SURF) 60 detect and match key points across overlapping images, with SIFT excelling in robustness and SURF optimized for speed. For real-time applications, Oriented FAST and rotated BRIEF (ORB) offer computational efficiency but handle scale variations less effectively than SIFT 61 . ...
Article
Full-text available
Biofilms are complex microbial communities critical in medical, industrial, and environmental contexts. Understanding their assembly, structure, genetic regulation, interspecies interactions, and environmental responses is key to developing effective control and mitigation strategies. While atomic force microscopy (AFM) offers critically important high-resolution insights on structural and functional properties at the cellular and even sub-cellular level, its limited scan range and labor-intensive nature restricts the ability to link these smaller scale features to the functional macroscale organization of the films. We begin to address this limitation by introducing an automated large area AFM approach capable of capturing high-resolution images over millimeter-scale areas, aided by machine learning for seamless image stitching, cell detection, and classification. Large area AFM is shown to provide a very detailed view of spatial heterogeneity and cellular morphology during the early stages of biofilm formation which were previously obscured. Using this approach, we examined the organization of Pantoea sp. YR343 on PFOTS-treated glass surfaces. Our findings reveal a preferred cellular orientation among surface-attached cells, forming a distinctive honeycomb pattern. Detailed mapping of flagella interactions suggests that flagellar coordination plays a role in biofilm assembly beyond initial attachment. Additionally, we use large-area AFM to characterize surface modifications on silicon substrates, observing a significant reduction in bacterial density. This highlights the potential of this method for studying surface modifications to better understand and control bacterial adhesion and biofilm formation.
... However, Gauss-Newton solvers scale poorly with the number of images and 3D points, and GPU implementations that can effectively exploit the sparsity in the Jacobians for improved scalability are less mature. As a result, the current software ecosystem remains largely CPU-bound-while COLMAP can use GPUs for SIFT [29] feature extraction and matching, camera registration remains restricted to CPUs. ...
Preprint
Full-text available
We propose FastMap, a new global structure from motion method focused on speed and simplicity. Previous methods like COLMAP and GLOMAP are able to estimate high-precision camera poses, but suffer from poor scalability when the number of matched keypoint pairs becomes large. We identify two key factors leading to this problem: poor parallelization and computationally expensive optimization steps. To overcome these issues, we design an SfM framework that relies entirely on GPU-friendly operations, making it easily parallelizable. Moreover, each optimization step runs in time linear to the number of image pairs, independent of keypoint pairs or 3D points. Through extensive experiments, we show that FastMap is one to two orders of magnitude faster than COLMAP and GLOMAP on large-scale scenes with comparable pose accuracy.
... Deformable registration of lung CT scans plays a pivotal role in real-world clinical workflows, including respiratory motion compensation [1,2], longitudinal disease monitoring [3][4][5], and radiation therapy planning [6,7]. Unlike rigid registration methods [8][9][10][11] that assume static global transformations, deformable lung CT registration accounts for the complex non-linear anatomical changes induced by respiration, enabling precise spatial alignment across different breathing phases. ...
Article
Full-text available
Deformable lung CT registration plays a crucial role in clinical applications such as respiratory motion tracking, disease progression analysis, and radiotherapy planning. While voxel-based registration has traditionally dominated this domain, it often suffers from high computational costs and sensitivity to intensity variations. In this work, we propose a novel point-based deformable registration framework tailored to the unique challenges of lung CT alignment. Our approach combines geometric keypoint attention at coarse resolutions to enhance the global correspondence with attention-based refinement modules at finer scales to accurately model subtle anatomical deformations. Furthermore, we adopt a bi-directional training strategy that enforces forward and backward consistency through cycle supervision, promoting anatomically coherent transformations. We evaluate our method on the large-scale Lung250M benchmark and achieve state-of-the-art results, significantly surpassing the existing voxel-based and point-based baselines in the target registration accuracy. These findings highlight the potential of sparse geometric modeling for complex respiratory motion and establish a strong foundation for future point-based deformable registration in thoracic imaging.
... Traditional vehicle detection algorithms require a significant amount of computation time and effort because they rely on hand-crafted features, and these techniques are not suitable for real-time detection. Traditional methods involve three steps: in the first step, it proposes a region of interest (ROI) using techniques such as selective search, sliding window, and so on; in the second step, features are extracted from ROIs using handcrafted-feature extraction methods such as HoG [2], SIFT-like [3], Haar-like [4], and so on; and in the third step, various classifiers such as SVM [5], AdaBoost [6], kNN [7], and so on are used to detect and classify vehicles. Deep learning-based vehicle detectors are classified into two types: 2-stage and 1-stage detectors. ...
... Feature-based methods mainly involve four steps: control point detection, feature description, feature matching, and image transformation. Scale-Invariant Feature Transform (SIFT) and its variants are the most common artificial descriptors [16]. Dellinger et al. [17] proposed a gradient computation method specifically designed for SAR images, which is robust to speckle noise. ...
Article
Full-text available
Variations in terrain elevation cause images acquired under different imaging modalities to deviate from a linear mapping relationship. This effect is particularly pronounced between optical and SAR images, where the range-based imaging mechanism of SAR sensors leads to significant local geometric distortions, such as perspective shrinkage and occlusion. As a result, it becomes difficult to represent the spatial correspondence between optical and SAR images using a single geometric model. To address this challenge, we propose a global optical-SAR image registration method that leverages local distortion characteristics. Specifically, we introduce a Superpixel-based Local Distortion Division (SLDD) method, which defines superpixel region features and segments the image into local distortion and normal regions by computing the Mahalanobis distance between superpixel features. We further design a Multi-Feature Fusion Capsule Network (MFFCN) that integrates shallow salient features with deep structural details, reconstructing the dimensions of digital capsules to generate feature descriptors encompassing texture, phase, structure, and amplitude information. This design effectively mitigates the information loss and feature degradation problems caused by pooling operations in conventional convolutional neural networks (CNNs). Additionally, a hard negative mining loss is incorporated to further enhance feature discriminability. Feature descriptors are extracted separately from regions with different distortion levels, and corresponding transformation models are built for local registration. Finally, the local registration results are fused to generate a globally aligned image. Experimental results on public datasets demonstrate that the proposed method achieves superior performance over state-of-the-art (SOTA) approaches in terms of Root Mean Squared Error (RMSE), Correct Match Number (CMN), Distribution of Matched Points (Scat), Edge Fidelity (EF), and overall visual quality.
... While human pose estimation is well-studied, pose estimation or keypoint detection for generic objects needs more development. Related concepts, such as SIFT [25], focus on identifying interest points based on low-level image statistics. Other methods include heatmap representation for feature matching [10] and the multi-peak heatmap approach used by StarMap [45], which provides key points with associated features and 3D locations. ...
Preprint
Full-text available
Reposing objects in images has a myriad of applications, especially for e-commerce where several variants of product images need to be produced quickly. In this work, we leverage the recent advances in unsupervised keypoint correspondence detection between different object images of the same class to propose an end-to-end framework for generic object reposing. Our method, EOPose, takes a target pose-guidance image as input and uses its keypoint correspondence with the source object image to warp and re-render the latter into the target pose using a novel three-step approach. Unlike generative approaches, our method also preserves the fine-grained details of the object such as its exact colors, textures, and brand marks. We also prepare a new dataset of paired objects based on the Objaverse dataset to train and test our network. EOPose produces high-quality reposing output as evidenced by different image quality metrics (PSNR, SSIM and FID). Besides a description of the method and the dataset, the paper also includes detailed ablation and user studies to indicate the efficacy of the proposed method
... Conventional pose estimation techniques typically utilize either local feature extraction or template matching to estimate the 6D pose of objects. In feature-based methods [3]- [7], the process begins by extracting local features from the image, which are then aligned with 3D models to establish correspondences between 2D image points and 3D world coordinates, facilitating reliable 6D pose estimation. Notable feature descriptors such as SIFT, SURF, and ORB [7]- [9] are well-known for their robustness to variations in lighting, scale, and rotation. ...
Article
Full-text available
Visual information plays a crucial role during the final approach and landing phases of an aircraft, serving as a complementary source for the navigation system. It provides additional guidance when radio navigation such as Instrument Landing System is unserviceable or compromised and enables a fully vision-based landing. The relative position and orientation of an aircraft can be derived from the features of the runway captured in images. However, conventional methods for detecting runways often suffer from high latency and low accuracy, making them inadequate and uncertain for ensuring a safe landing. In response to these limitations, this paper introduces a real-time runway detection model named Vision-based Landing System for Fixed-Winged Aircraft (VLS-FWA), which leverages deep convolutional neural networks in cascade. Moreover, predicted variance is also calculated and displayed in the pilot’s Head-Up Display (HUD) by employing methods to aid in the reduction of aleatoric and epistemic uncertainty in both real-world and synthetic scenes.
... Before the widespread application of convolutional neural networks (CNNs) in computer vision and image processing, traditional image processing and pattern recognition methods relied primarily on manually designed feature extraction algorithms. Lowe et al. [23] developed an object recognition system using local image features. Bay et al. [24] proposed an approach that involves extracting local features from an image invariant to various transformations such as scaling and rotation. ...
Article
Full-text available
Compared to single-modal methods, multimodal semantic segmentation methods leverage the rich complementary information between modalities to improve segmentation accuracy, attracting increasing attention. However, differences in imaging principles between modalities lead to incompatibilities that increase the difficulty of fusion. Efficiently fusing multiscale features across modalities and effectively exploiting their complementary information remains a challenging task. In this article, we propose a multiscale gated fusion network (MGFNet) for effectively preserving the discriminative features of different modalities at different scales and utilizing complementary information. Specifically, to preserve the discriminative features of different modalities, we design a multiscale gated fusion module to selectively fuse useful features from different modalities by extracting their complementary features at different scales. In addition, we propose a cross-modal interaction module to adaptively capture long-range dependencies and facilitate the exchange of complementary features between modalities. Finally, the cross-modal multiscale extraction module effectively extracts multiscale features from the fused features and integrates complementary information across modalities. Extensive experiments on the Vaihingen and Potsdam datasets demonstrate that our proposed MGFNet achieves superior performance compared to currently popular methods. The code of MGFNet is available at https://github.com/DrWuHonglin/MGFNet.
... With Lowe et al. [5] proposing the Scale Invariant Feature Transform (SIFT) feature extraction method in 1999, feature-based image stitching methods have rapidly developed like mushrooms after rain. In 2003, Brown et al. [6] utilized the scale invariance of SIFT to complete a complete panoramic image stitching. ...
Article
Full-text available
Due to the limited imaging range of underwater cameras, this paper proposed an underwater image stitching algorithm based on point-line dual feature. Specifically, this paper proposed an underwater image enhancement algorithm based on dynamic thresholding and dark channel prior, which preprocessed the input degraded images to improve subjective perception and feature matching results. Simultaneously analyzed the stitching methods based on optimized alignment and seam optimization, and proposed an image matching strategy with point line dual features for pre-alignment of the target image. Considering the impact of different constraint terms on image alignment, this paper constructs local constraints, global constraints, smoothing constraints, and line feature constraints as the objective energy function based on the idea of grid optimization, in order to further optimize alignment. Finally, seamless underwater image stitching is achieved through preliminary fusion and optimal seam fusion. Our algorithm is significantly ahead of other comparative algorithms in terms of overall performance, both subjectively and objectively.
... As a result, steep vertical terrain was imaged at a more perpendicular angle, which helped minimise perspective distortions between overlapping images and allowed for more reliable extraction and consistent identification of distinct surface features in stereo matching. In Metashape, feature matching is based on the scale-invariant feature transformation (SIFT) algorithm (Lowe, 1999) which, while efficient at matching features despite variations in scale and orientation, is only partially invariant to illumination and affine dis-tortions (Lowe, 2004). In general, extreme affine distortions from widely different viewpoints such as those common in low-oblique aerial imagery pose a significant challenge to robust feature matching. ...
Article
Full-text available
The use of structure-from-motion (SfM) photogrammetry coupled with multiview stereo (MVS) techniques is widespread as a tool for generating topographic data for monitoring change in surface elevation. However, study sites on remote glaciers and ice caps often offer suboptimal conditions, including large survey areas, complex topography, changing weather and light conditions, poor contrast over ice and snow, and reduced satellite positioning performance. Here, we provide a review of methodological considerations for conducting aerial photography surveys under challenging field conditions. We generate topographic reconstructions, outlining the entire workflow, from data acquisition to SfM–MVS processing, using case studies focused around two small glaciers in Arctic Canada. We provide recommendations for the selection of photographic and positioning hardware and guidelines for flexible survey design using direct measurements of camera positions, thereby removing the need for ground control points. The focus is on maximising hardware performance despite inherent limitations, with the aim of optimising the quality and quantity of the source data, including image information and control measurements, despite suboptimal conditions.
... The sliding window (of different sizes) method was the popular approach employed by the region proposal stage to extract regions (objects) of interest (RoI) from the input image, and this is to ensure that candidate regions are obtained by multi-iterations irrespective of the different sizes that each target object may possess. Talking about the second stage, in its incipient form, object detection tasks relied on the features extracted by hand-crafted-based traditional methods, such as Local Binary Pattern (LBP) [2], Scale-Invariant Feature Transform (SIFT) [3,4], Histogram of Oriented Gradients (HOG) [5]. The features extracted by hand-crafted-based traditional methods are utilized in the last stage for the object classification and regression of bounding boxes. ...
Chapter
Full-text available
Object detection is a major branch and fundamental task in computer vision, aiming to localize, identify and classify even the smallest objects of interest in images. Features can be extracted efficiently by deep convolutional neural networks (CNNs) as the backbone for real-time or near real-time object detection performance than the hand-crafted-based traditional methods. In the past few years, the advent of transformer-based models with robust self-attention mechanisms has not only raised object detection performance to a higher level but has also enabled it to produce excellent results. Many object detection tasks in the real world require that 3D information about the object be obtained, thus strengthening active research in 3D object detection. However, the algorithms for detecting 3D objects are not easy to propagate in real-world applications due to many factors, making reconstruction of 2D object detection algorithms to 3D object detection algorithms the suitable alternative. Therefore, we review the evolution of 2D object detection algorithms for digital imaging applications, focusing on their developments, models, applications, datasets, evaluation metrics, strengths and weaknesses, for better understanding of their landmarks and contributions to the advancement of the field.
... To leverage this capability, we adopt ResNet-50 as the backbone network, which has been extensively used for feature extraction in various computer vision tasks. ResNet-50 is pre-trained on ImageNet [49], providing a strong initialisation for extracting meaningful pedestrian features. ...
Article
Full-text available
Person re‐identification (Re‐ID) has gained popularity in computer vision, enabling cross‐camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re‐ID research, most existing person Re‐ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a transformer‐enhanced graph convolutional network (Tran‐GCN) model to improve person re‐identification performance in monitoring videos. The model comprises four key components: (1) a pose estimation learning branch is utilised to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) a transformer learning branch learns the global dependencies between fine‐grained and semantically meaningful local person features; (3) a convolution learning branch uses the basic ResNet architecture to extract the person's fine‐grained local features; and (4) a Graph convolutional module (GCM) integrates local feature information, global feature information and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market‐1501, DukeMTMC‐ReID and MSMT17) demonstrate that the Tran‐GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.
... On the other hand, previous psychophysical studies have demonstrated that humans are capable of perceiving the gist of scenes and objects at a glance (Thorpe et al., 1996; VanRullen & Thorpe, 2001;Greene & Oliva, 2009). Such ultra-rapid recognition of scenes and objects may rely more on lower-level, relatively global, and instantaneously processed features such as those computed by spatial envelope and scale-invariant feature transform (Oliva & Torralba, 2001;Lowe, 1999Lowe, , 2004. Additionally, modern deep neural network (DNN) models that achieve remarkable object recognition accuracy often rely more heavily on texture-based information than on local shape features (Geirhos et al., 2018;Baker et al., 2018;Hermann et al., 2020;Li et al., 2020). ...
Preprint
Recent studies have suggested the importance of statistical image features in both natural scene and object recognition, while the spatial layout or shape information is still important. In the present study, to investigate the roles of low- and high-level statistical image features in natural scene and object recognition, we conducted categorization tasks using a wide variety of natural scene (250 images) and object (200 images) images, along with two types of synthesized images: Portilla-Simoncelli (PS) synthesized images, which preserve low-level statistical features, and style-synthesized (SS) images, which retain higher-level statistical features. Behavioral experiments revealed that observers could categorize style-synthesized version of natural scene and object images with high accuracy. Furthermore, we recorded visual evoked potentials (VEPs) for the original, SS, and PS images and decoded natural scene and object categories using a support vector machine (SVM). Consistent with the behavioral results, natural scene categories were decoded with high accuracy within 200 ms after the stimulus onset. In contrast, object categories were successfully decoded only from VEPs for original images at later latencies. Finally, we examined whether style features could classify natural scene and object categories. The classification accuracy for natural scene categories showed a similar trend to the behavioral data, whereas that for object categories did not align with the behavioral results. Taken together, these findings suggest that although natural scene and object categories can be recognized relatively easily even when layout information is disrupted, the extent to which statistical features contribute to categorization differs between natural scenes and objects.
... review of re-ranking methods in [7]). Here a very simple approach was used, based on SIFT features [25] (Scale Invariant Feature Transform), which belongs to the class of geometric verification, according to [7]. These features are computed only on the images to be re-ranked (for instance the 10 best ones), which is very fast. ...
Article
Full-text available
This article advances a method to analyze a large corpus of historical photographs using artificial intelligence tools and data modeling. This research was conducted within the framework of the EyCon (Early Conflict Photography 1890-1918 and Visual AI) and HighVision projects, which aim at leveraging the power of digital tools, exploiting both visual and textual information, to investigate the development of war photography at the turn of the 20th century. To do so, one of the objectives of the project was to develop a method to extract robust features and to overcome the challenges posed by the halftone printing techniques, the most common way to reproduce photographs in daily newspapers, periodicals and books at the time. By combining visual and textual similarity measures, the proposed approach enables the identification of significant subsets of similarity within the dataset. The findings from this research hold important implications for the broader field of image analysis and provide insights into the unique characteristics and complexities of historical visual data. This work contributes to the advancement of computer vision techniques in the analysis of historical photographic collections, opening up new avenues for research in visual AI and archival studies.
... Pose estimation typically starts with a 2D detector, followed by pose hypothesis generation. Classic methods extract features from 3D models and match them to image points [37,38], while RANSAC-based solvers [39,40] perform pose refinement. Two-stage approaches [1,4,5,41] remain popular, especially with the advent of more robust descriptors [42][43][44][45][46][47][48]. ...
Article
Full-text available
Recent advances in the field of 6D pose estimation of unseen objects not present during training are promising, however, the performance gap between these general methods and object-specific methods remains significant. This paper introduces an innovative unsupervised test-time adaptation method, termed TTAPose, capable of adapting a pose estimator to any unseen object. TTAPose initially undergoes pre-training using a large synthetic dataset and thereafter refines the weights using unsupervised loss conducted on unseen real-world target objects. The network, based on a teacher-student architecture, leverages an RGB-D pose refinement pipeline to incrementally improve pseudo labels. Notably, TTAPose operates with no requirement for target data annotation, thus minimizing time and data expenditure. Experimental results show performance levels comparable to supervised methods, effectively narrowing the gap to object-specific baselines.
... Increasing either of these parameters has negative side effects; increasing the exposure duration leads to blurry images if the camera is moving, while increasing the gain leads to a noisier image. These side effects are both potentially catastrophic for many types of vision processing techniques, especially those that rely on the now standard gradient-based feature detection algorithms such as Scale-Invariant Feature Transforms (SIFT) [1] and Speeded Up Robust Features (SURF) [2]. A range of solutions have been proposed including high dynamic range techniques, high sensitivity and thermal cameras, active lighting/strobing of the environment, or simply using alternative sensors such as laser rangefinders. ...
Preprint
In this paper we evaluate performance of the SeqSLAM algorithm for passive vision-based localization in very dark environments with low-cost cameras that result in massively blurred images. We evaluate the effect of motion blur from exposure times up to 10,000 ms from a moving car, and the performance of localization in day time from routes learned at night in two different environments. Finally we perform a statistical analysis that compares the baseline performance of matching unprocessed grayscale images to using patch normalization and local neighborhood normalization - the two key SeqSLAM components. Our results and analysis show for the first time why the SeqSLAM algorithm is effective, and demonstrate the potential for cheap camera-based localization systems that function despite extreme appearance change.
... The goal of image registration is to find robust correspondences between the two images. We use SIFT [56] to find keypoints with distinctive features, then we establish [57] correspondences and filter [58] them between the keypoints. With the filtered correspondences, we calculate a homography matrix using RANSAC [59], and the matrix warps I w to I m . ...
Preprint
We present the first framework capable of synthesizing the all-in-focus neural radiance field (NeRF) from inputs without manual refocusing. Without refocusing, the camera will automatically focus on the fixed object for all views, and current NeRF methods typically using one camera fail due to the consistent defocus blur and a lack of sharp reference. To restore the all-in-focus NeRF, we introduce the dual-camera from smartphones, where the ultra-wide camera has a wider depth-of-field (DoF) and the main camera possesses a higher resolution. The dual camera pair saves the high-fidelity details from the main camera and uses the ultra-wide camera's deep DoF as reference for all-in-focus restoration. To this end, we first implement spatial warping and color matching to align the dual camera, followed by a defocus-aware fusion module with learnable defocus parameters to predict a defocus map and fuse the aligned camera pair. We also build a multi-view dataset that includes image pairs of the main and ultra-wide cameras in a smartphone. Extensive experiments on this dataset verify that our solution, termed DC-NeRF, can produce high-quality all-in-focus novel views and compares favorably against strong baselines quantitatively and qualitatively. We further show DoF applications of DC-NeRF with adjustable blur intensity and focal plane, including refocusing and split diopter.
Article
Comprehensive reviews of continuously vegetated areas to determine dispersed locations of invasive species require intensive use of computational resources. Furthermore, effective mechanisms aiding identification of locations of specific invasive species require approaches relying on geospatial indicators and ancillary images. This study develops a two-stage data workflow for the invasive species Kudzu vine (Pueraria montana) often found in small areas along roadsides. The INHABIT database from the United States Geological Survey (USGS) provided geospatial data of Kudzu vines and Google Street View (GSV) a set of images. Stage one built up a set of Kudzu images to be implemented in an object detection technique, You Only Look Once (YOLO v8s), for training, validating, and testing. Stage two defined a dataset of confirmed locations of Kudzu which was followed to retrieve images from GSV and analyzed with YOLO v8s. The effectiveness of the YOLO v8s model was assessed to determine the locations of Kudzu identified from georeferenced GSV images. This data workflow demonstrated that field observations can be virtually conducted by integrating geospatial data and GSV images; however, its potential is confined to the updated periodicity of GSV images or similar services.
Article
Defect detection is crucial for controlling product quality in the process of textile production. However, for existing detection techniques, there are still challenges in identifying different forms of defect and small defects within the same category. To address this issue, we propose a fabric defect detection model called MAS-YOLO. This model is based on YOLOv8n and incorporates several key innovations. First, we designed a multi-branch coordinate attention module to capture direction and position information. Second, we designed an adaptive weighted downsampling module based on grouped convolution, which emphasizes defective features and reduces background interference using weighting features. Finally, we introduced sliding loss to address the imbalance between easy and difficult samples. The experimental results show that the mean average precisions for a customized fabric defect dataset and the AliCloud Tianchi dataset were 96.3% and 51.6%, respectively, that is, 6.9% and 7.8% higher, respectively, than the original YOLOv8n. The detection speeds using the GTX1050ti graphics card and RTX3070ti graphics card are 57.3 frames per second (fps) and 154.3 fps, respectively; this can meet the real-time requirements of defect detection in most industrial sites and provide technical support for the application of lightweight network models in the industry.
Article
Aerial object detection plays a critical role in numerous fields, utilizing the flexibility of airborne platforms to achieve real-time tasks. Combining visible and infrared sensors can overcome limitations under low-light conditions, enabling full-time tasks. While feature-level fusion methods exhibit comparable performances in visible–infrared multispectral object detection, they suffer from heavy model size, inadequate inference speed and visible light preferences caused by inherent modality imbalance, limiting their applications in airborne platform deployment. To address these challenges, this paper proposes a YOLO-based real-time multispectral fusion framework combining pixel-level fusion with dynamic modality-balanced augmentation called Full-time Multispectral Pixel-wise Fusion Network (FMPFNet). Firstly, we introduce the Multispectral Luminance Weighted Fusion (MLWF) module consisting of attention-based modality reconstruction and feature fusion. By leveraging YUV color space transformation, this module efficiently fuses RGB and IR modalities while minimizing computational overhead. We also propose the Dynamic Modality Dropout and Threshold Masking (DMDTM) strategy, which balances modality attention and improves detection performance in low-light scenarios. Additionally, we refine our model to enhance the detection of small rotated objects, a requirement commonly encountered in aerial detection applications. Experimental results on the DroneVehicle dataset demonstrate that our FMPFNet achieves 76.80% mAP50 and 132 FPS, outperforming state-of-the-art feature-level fusion methods in both accuracy and inference speed.
Article
This study aimed to compare the relationship between the quantitative values and visual score of acquired images using the CS-SENSE method. T1-weighted image (T1WI) and T2-weighted image (T2WI) were acquired using a phantom created by a 3D printer. Each quantitative values (signal-to-noise ratio [SNR], contrast-to-noise ratio [CNR], structural similarity [SSIM], and scale-invariant feature transform [SIFT]) and visual evaluation score (VES) were calculated by the acquired images. The correlation coefficients among the calculating quantitative values and VES were calculated. The difference in methods for evaluating the image quality of T1WI and T2WI images using CS-SENSE was clarified. Variations in image quality, as reflected by VES in T1WI and T2WI images obtained via the CS-SENSE method, can be quantitatively assessed. Specifically, CNR is effective for evaluating changes in T1WI, while SNR, CNR, and SIFT are suitable for assessing variations in T2WI.
Article
The properties of light propagation underwater typically cause color distortion and reduced contrast in underwater images. In addition, complex underwater lighting conditions can result in issues such as non-uniform illumination, spotting, and noise. To address these challenges, we propose an innovative underwater-image enhancement (UIE) approach based on maximum information-channel compensation and edge-preserving filtering techniques. Specifically, we first develop a channel information transmission strategy grounded in maximum information preservation principles, utilizing the maximum information channel to improve the color fidelity of the input image. Next, we locally enhance the color-corrected image using guided filtering and generate a series of globally contrast-enhanced images by applying gamma transformations with varying parameter values. In the final stage, the enhanced image sequence is decomposed into low-frequency (LF) and high-frequency (HF) components via side-window filtering. For the HF component, a weight map is constructed by calculating the difference between the current exposedness and the optimum exposure. For the LF component, we derive a comprehensive feature map by integrating the brightness map, saturation map, and saliency map, thereby accurately assessing the quality of degraded regions in a manner that aligns with the symmetry principle inherent in human vision. Ultimately, we combine the LF and HF components through a weighted summation process, resulting in a high-quality underwater image. Experimental results demonstrate that our method effectively achieves both color restoration and contrast enhancement, outperforming several State-of-the-Art UIE techniques across multiple datasets.
Article
Object detection in remote sensing imagery plays a pivotal role in various applications, including aerial surveillance and urban planning. Despite its significance, the task remains challenging due to cluttered backgrounds, the arbitrary orientations of objects, and substantial scale variations across targets. To address these issues, we proposed RFE-FCOS, a novel framework that synergizes rotation-invariant feature extraction with adaptive multi-scale fusion. Specifically, we introduce a rotation-invariant learning (RIL) module, which employs adaptive rotation transformations to enhance shallow feature representations, thereby effectively mitigating interference from complex backgrounds and boosting geometric robustness. Furthermore, a rotation feature fusion (RFF) module propagates these rotation-aware features across hierarchical levels through an attention-guided fusion strategy, resulting in richer, more discriminative representations at multiple scales. Finally, we propose a novel dual-aspect RIoU loss (DARIoU) that simultaneously optimizes horizontal and angular regression tasks, facilitating stable training and the precise alignment of arbitrarily oriented bounding boxes. Evaluated on the DIOR-R and HRSC2016 benchmarks, our method demonstrates robust detection capabilities for arbitrarily oriented objects, achieving competitive performance in both accuracy and efficiency. This work provides a versatile solution for advancing object detection in real-world remote sensing scenarios.
Article
The creation of virtual slides, i.e., high-resolution digital images of biological samples, is expensive, and existing manual methods often suffer from stitching errors and additional reimaging costs. To address these issues, we propose a real-time mosaicing manager for manual microscopy (RT-4M) that performs real-time stitching and allows users to preview slides during imaging using existing manual microscopy systems, thereby reducing the need for reimaging. We install it on two different microscopy systems, successfully creating virtual slides of hematoxylin and eosin- and fluorescent-stained tissues obtained from humans and mice. The fluorescent-stained tissues consist of two colors, requiring the manual switching of the filter and an exposure time of 1.6 s per color. Even in the case of the largest dataset in this study (over 900 images), the entire sample is captured without any omissions. Moreover, RT-4M exhibits a processing time of less than one second per registration, indicating that it does not hinder the user’s imaging workflow. Additionally, the composition process reduces the misalignment rate by a factor of 20 compared to existing software. We believe that the proposed software will prove useful in the fields of pathology and bio-research, particularly for facilities with relatively limited budgets.
Article
Full-text available
High-precision localization is critical for intelligent robotics in autonomous driving, smart agriculture, and military operations. While Global Navigation Satellite System (GNSS) provides global positioning, its reliability deteriorates severely in signal degraded environments like urban canyons. Cross-view pose estimation using aerial-ground sensor fusion offers an economical alternative, yet current datasets lack field scenarios and high-resolution LiDAR support.This work introduces a multimodal cross-view dataset addressing these gaps. It contains 29,940 synchronized frames across 11 operational environments (6 field environments, 5 urban roads), featuring: 1) 144-channel LiDAR point clouds, 2) ground-view RGB images, and 3) aerial orthophotos. Centimeter-accurate georeferencing is ensured through GNSS fusion and post-processed kinematic positioning. The dataset uniquely integrates field environments and high-resolution LiDAR-aerial-ground data triplets, enabling rigorous evaluation of 3-DoF pose estimation algorithms for orientation alignment and coordinate transformation between perspectives.This resource supports development of robust localization systems for field robots in GNSS-denied conditions, emphasizing cross-view feature matching and multisensor fusion. Light Detection And Ranging (LiDAR)-enhanced ground truth further distinguishes its utility for complex outdoor navigation research.
Preprint
Full-text available
Microscopy is an essential tool in scientific research, enabling the visualization of structures at micro- and nanoscale resolutions. However, the field of microscopy often encounters limitations in field-of-view (FOV), restricting the amount of sample that can be imaged in a single capture. To overcome this limitation, image stitching techniques have been developed to seamlessly merge multiple overlapping images into a single, high-resolution composite. The images collected from microscope need to be optimally stitched before accurate physical information can be extracted from post analysis. However, the existing stitching tools either struggle to stitch images together when the microscopy images are feature sparse or cannot address all the transformations of images when performing image stitching. To address these issues, we propose a bi-channel aided feature-based image stitching method and demonstrate its use on Atomic Force Microscopy (AFM) generated biofilm images as experimental data. The topographical channel image of AFM data captures the morphological details of the sample, and a stitched topographical image is desired for researchers. We utilize the amplitude channel of AFM data to maximize the matching features and to estimate the position of the original topographical images and show that the proposed bi-channel aided stitching method outperforms the traditional direct stitching approach in AFM topographical image stitching task. Furthermore, we found that the differentiation of the topographical images along the x-axis provides similar feature information to the amplitude channel image, which generalizes our approach when the amplitude images are not available. Here we demonstrated the application on AFM, but similar approaches could be employed of optical microscopy with brightfield and fluorescence channels. We believe this proposed workflow can serve as a valuable augmentation strategy for microscopy image stitching tasks and will benefit the experimentalist to avoid erroneous analysis and discovery due to incorrect stitching.
Article
To address the challenge of precise picking point localization in morphologically diverse safflower plants, this study proposes PointSafNet—a novel three-stage 3D point cloud analysis framework with distinct architectural and methodological innovations. In Stage I, we introduce a multi-view reconstruction pipeline integrating Structure from Motion (SfM) and Multi-View Stereo (MVS) to generate high-fidelity 3D plant point clouds. Stage II develops a dual-branch architecture employing Star modules for multi-scale hierarchical geometric feature extraction at the organ level (filaments and frui balls), complemented by a Context-Anchored Attention (CAA) mechanism to capture long-range contextual information. This synergistic feature learning approach addresses morphological variations, achieving 86.83% segmentation accuracy (surpassing PointNet++ by 7.37%) and outperforming conventional point cloud models. Stage III proposes an optimized geometric analysis pipeline combining dual-centroid spatial vectorization with Oriented Bounding Box (OBB)-based proximity analysis, resolving picking coordinate localization across diverse plants with 90% positioning accuracy and 68.82% mean IoU (13.71% improvement). The experiments demonstrate that PointSafNet systematically integrates 3D reconstruction, hierarchical feature learning, and geometric reasoning to provide visual guidance for robotic harvesting systems in complex plant canopies. The framework’s dual emphasis on architectural innovation and geometric modeling offers a generalizable solution for precision agriculture tasks involving morphologically diverse safflowers.
Article
This study proposes a multi-source unsupervised domain adaptation framework for person re-identification (ReID), addressing cross-domain feature discrepancies and label scarcity in electric power field operations. Inspired by symmetry principles in feature space optimization, the framework integrates (1) a Reverse Attention-based Feature Fusion (RAFF) module aligning cross-domain features using symmetry-guided prototype interactions that enforce bidirectional style-invariant representations and (2) a Self-Correcting Pseudo-Label Loss (SCPL) dynamically adjusting confidence thresholds using entropy symmetry constraints to balance source-target domain knowledge transfer. Experiments demonstrate 92.1% rank-1 accuracy on power industry benchmarks, outperforming DDAG and MTL by 9.5%, with validation confirming robustness in operational deployments. The symmetric design principles significantly enhance model adaptability to inherent symmetry breaking caused by heterogeneous power grid environments.
Article
Understanding curbside parking rules is crucial for drivers to quickly find legal curbside parking. Traditional data providers rely on manual methods to collect information about curbside parking signs, either by individually noting down sign details or downloading street-level imagery from digital maps. However, this process is slow and inadequate for building an accurate and up-to-date curbside parking rule database, considering the frequent updates to parking signs. In this paper, we propose a novel deep-learning-based Inventory Management System (IMS) that automates to create and manage the curbside parking rule database using videos captured by off-the-shelf dashcams installed on vehicles. To the best of our knowledge, our system is the first of its kind to detect and interpret real-world curbside parking signs from videos and generate parking rules. By utilizing a serverless cloud architecture on AWS, IMS combines secure data retrieval, robust user authentication, and responsive map-based visualization, providing users with up-to-date parking information filtered by location, date, and time. Through real-world evaluations, we demonstrate that our system efficiently constructs an accurate curbside parking rule database enhancing urban mobility through efficient and informed parking management.
Article
Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these limitations, we propose a hybrid image captioning framework that integrates handcrafted and deep visual features. Specifically, we combine local descriptors—Scale-Invariant Feature Transform (SIFT) and Bag of Features (BoF)—with high-level semantic features extracted using ResNet50. This dual representation captures both fine-grained spatial details and contextual semantics. The decoder employs Bahdanau attention refined with an Attention-on-Attention (AoA) mechanism to optimize visual-textual alignment, while GloVe embeddings and a GRU-based sequence model ensure fluent language generation. The proposed system is trained on 200,000 image-caption pairs from the MS COCO train2014 dataset and evaluated on 50,000 held-out MS COCO pairs plus the Flickr8K benchmark. Our model achieves a CIDEr score of 128.3 and a SPICE score of 29.24, reflecting clear improvements over baselines in both semantic precision—particularly for spatial relationships—and grammatical fluency. These results validate that combining classical computer vision techniques with modern attention mechanisms yields more interpretable and linguistically precise captions, addressing key limitations in neural caption generation.
Article
Full-text available
The identification of star clusters holds significant importance in studying galaxy formation and evolution history. However, the task of swiftly and accurately identifying star clusters from vast amounts of photometric images presents an immense challenge. To address these difficulties, we employ deep learning models for image classification to identify young disk star clusters in M31 from the Pan-Andromeda Archaeological Survey (PAndAS) images. For training, validation, and testing, we utilize the Panchromatic Hubble Andromeda Treasury survey catalogs. We evaluate the performance of various deep learning models, using different classification thresholds and limiting magnitudes. Our findings indicate that the ResNet-50 model exhibits the highest overall accuracy. Moreover, using brighter limiting magnitudes and increasing the classification thresholds can effectively enhance the accuracy and precision of cluster identification. Through our experiments, we found that the model achieves optimal performance when the limiting magnitude is set to brighter than 21 mag. Based on this, we constructed a training data set with magnitudes less than 21 mag and trained a second ResNet-50 model. This model achieved a purity of 89.30%, a recall of 73.55%, and an F1 score of 80.66% when the classification threshold was set to 0.669. Applying the second model to all sources in the PAndAS fields within a projected radius of 30 kpc from the center of M31, we identified 2228 new unique star cluster candidates. We conducted visual inspections to validate the results produced by our automated methods, and we ultimately obtained 1057 star cluster candidates, of which 745 are newly identified.
Chapter
Signal processing plays a crucial role across various disciplines, from telecommunications to biomedical engineering and beyond. It involves manipulating and analyzing signals to extract useful information, enhance quality, or achieve specific goals. The importance of signal processing in extracting meaningful information from signals makes it foundational in modern technology and scientific research.
Chapter
Full-text available
This paper presents a technique to determine the identity of objects in a scene using histograms of the responses of a vector of local linear neighborhood operators (receptive fields). This technique can be used to determine the most probable objects in a scene, independent of the object's position, image-plane orientation and scale. In this paper we describe the mathematical foundations of the technique and present the results of experiments which compare robustness and recognition rates for different local neighborhood operators and histogram similarity measurements.
Article
Full-text available
This paper defines a multiple resolution representation for the two-dimensional gray-scale shapes in an image. This representation is constructed by detecting peaks and ridges in the difference of lowpass (DOLP) transform. Descriptions of shapes which are encoded in this representation may be matched efficiently despite changes in size, orientation, or position. Motivations for a multiple resolution representation are presented first, followed by the definition of the DOLP transform. Techniques are then presented for encoding a symbolic structural description of forms from the DOLP transform. This process involves detecting local peaks and ridges in each bandpass image and in the entire three-dimensional space defined by the DOLP transform. Linking adjacent peaks in different bandpass images gives a multiple resolution tree which describes shape. Peaks which are local maxima in this tree provide landmarks for aligning, manipulating, and matching shapes. Detecting and linking the ridges in each DOLP bandpass image provides a graph which links peaks within a shape in a bandpass image and describes the positions of the boundaries of the shape at multiple resolutions. Detecting and linking the ridges in the DOLP three-space describes elongated forms and links the largest peaks in the tree. The principles for determining the correspondence between symbols in pairs of such descriptions are then described. Such correspondence matching is shown to be simplified by using the correspondence at lower resolutions to constrain the possible correspondence at higher resolutions.
Article
Full-text available
1. Object vision is largely invariant to changes of retinal images of objects in size and position. To reveal neuronal mechanisms of this invariance, we recorded activities from single cells in the anterior part of the inferotemporal cortex (anterior IT), determined the critical features for the activation of individual cells, and examined the effects of changes in stimulus size and position on the responses. 2. Twenty-one percent of the anterior IT cells studied here responded to ranges of size > 4 octaves, whereas 43% responded to size ranges < 2 octaves. The optimal stimulus size, measured by the distance between the outer edges along the longest axis of the stimulus, ranged from 1.7 to 30 degrees. 3. The selectivity for shape was mostly preserved over the entire range of effective size and over the receptive field, whereas some subtle but statistically significant changes were observed in one half of the cells studied here. 4. The size-specific responses observed in 43% of the cells are consistent with recent psychophysical data that suggest that images of objects are stored in a size-specific manner in the long-term memory. Both size-dependent and -independent processing of images may occur in anterior IT.
Article
Full-text available
A model of recognition is described based on cell properties in the ventral cortical stream of visual processing in the primate brain. At a critical intermediate stage in this system, ‘Elaborate’ feature sensitive cells respond selectively to visual features in a way that depends on size (± 1 octave), orientation (± 4 5 °) but does not depend on position within central vision (± 5 °). These features are simple conjunctions of 2-D elements (e.g. a horizontal dark area above a dark smoothly convex area). They can arise either as elements of an object’s surface pattern or as a 3-D component bounded by an object’s external contour. By requiring a combination of several such features without regard to their position within the central region of the visual image, ‘Pattern’ sensitive cells at higher levels can exhibit selectivity for complex configurations that typify objects seen under particular viewing conditions. Given that input features to such Pattern sensitive cells are specified in approximate size and orientation, initial cellular ‘representations’ of the visual appearance of object type (or object example) are also selective for orientation and size. At this level, sensitivity to object view (± 6 0 °) arises because visual features disappear as objects are rotated in perspective. Processing is thus viewer-centred and the neurones only respond to objects seen from particular viewing conditions or ‘object instances’. Combined sensitivity to multiple features (conjunctions of elements) independent of their position, establishes selectivity for the configurations of ob­ject parts (from one view) because rearranged configurations of the same parts yield images lacking some of the 2-D visual features present in the normal configuration.
Article
Full-text available
We describe and analyze an appearance-based 3-D object recognition system that avoids some of the problems of previous appearance-based schemes. We describe various large-scale performance tests and report good performance for full-sphere/hemisphere recognition of up to 24 complex, curved objects, robustness against clutter and occlusion, and some intriguing generic recognition behavior. We also establish a protocol that permits performance in the presence of quantifiable amounts of clutter and occlusion to be predicted on the basis of simple score statistics derived from clean test images and pure clutter images.
Conference Paper
Full-text available
This paper describes a probabilistic object recognition technique which does not require correspondence matching of images. This technique is an extension of our earlier work (1996) on object recognition using matching of multi-dimensional receptive field histograms. In the earlier paper we have shown that multi-dimensional receptive field histograms can be matched to provide object recognition which is robust in the face of changes in viewing position and independent of image plane rotation and scale. In this paper we extend this method to compute the probability of the presence of an object in an image. The paper begins with a review of the method and previously presented experimental results. We then extend the method for histogram matching to obtain a genuine probability of the presence of an object. We present experimental results on a database of 100 objects showing that the approach is capable recognizing all objects correctly by using only a small portion of the image. Our results show that receptive field histograms provide a technique for object recognition which is robust, has low computational cost and a computational complexity which is linear with the number of pixels
Article
Full-text available
This paper describes a method for recognizing partially occluded objects for bin-picking tasks using eigenspace analysis, referred to as the “eigen window” method, that stores multiple partial appearances of an object in an eigenspace. Such partial appearances require a large amount of memory space. Three measurements, detectability, uniqueness, and reliability, on windows are developed to eliminate redundant windows and thereby reduce memory requirements. Using a pose clustering technique, the method determines the pose of an object and the object type itself. We have implemented the method and verified its validity
Article
Full-text available
Nearest-neighbor correlation-based similarity computation in the space of outputs of complex-type receptive fields can support robust recognition of 3D objects. Our experiments with four collections of objects resulted in mean recognition rates between 84% (for subordinate-level discrimination among 15 quadruped animal shapes) and 94% (for basic-level recognition of 20 everyday objects) , over a 40 ffi Theta 40 ffi range of viewpoints, centered on a stored canonical view and related to it by rotations in depth. This result has interesting implications for the design of a front end to an artificial object recognition system, and for the understanding of the faculty of object recognition in primate vision. 1 INTRODUCTION Orientation-selective receptive fields (RFs) patterned after those found in the mammalian primary visual cortex (V1) are employed by a growing number of connectionist approaches to machine vision (for a review, see Edelman, 1997). Despite the success of RF-based sy...
Article
Full-text available
This article presents: (i) a multiscale representation of grey-level shape called the scale-space primal sketch, which makes explicit both features in scale-space and the relations between structures at different scales, (ii) a methodology for extracting significant blob-like image structures from this representation, and (iii) applications to edge detection, histogram analysis, and junction classification demonstrating how the proposed method can be used for guiding later-stage visual processes. The representation gives a qualitative description of image structure, which allows for detection of stable scales and associated regions of interest in a solely bottom-up data-driven way. In other words, it generates coarse segmentation cues, and can hence be seen as preceding further processing, which can then be properly tuned. It is argued that once such information is available, many other processing tasks can become much simpler. Experiments on real imagery demonstrate that the proposed theory gives intuitive results.
Article
Full-text available
An inherent property of objects in the world is that they only exist as meaningful entities over certain ranges of scale. If one aims at describing the structure of unknown real-world signals, then a multi-scale representation of data is of crucial importance. This chapter gives a tutorial review of a special type of multi-scale representation, linear scale-space representation, which has been developed by the computer vision community in order to handle image structures at different scales in a consistent manner. The basic idea is to embed the original signal into a oneparameter family of gradually smoothed signals, in which the fine scale details are successively suppressed. Under rather general conditions on the type of computations that are to performed at the first stages of visual processing, in what can be termed the visual front end, it can be shown that the Gaussian kernel and its derivatives are singled out as the only possible smoothing kernels. The conditions that specify ...
Conference Paper
Shape indexing is a way of making rapid associations between features detected in an image and object models that could have produced them. When model databases are large, the use of high-dimensional features is critical, due to the improved level of discrimination they can provide. Unfortunately, finding the nearest neighbour to a query point rapidly becomes inefficient as the dimensionality of the feature space increases. Past indexing methods have used hash tables for hypothesis recovery, but only in low-dimensional situations. In this paper we show that a new variant of the k-d tree search algorithm makes indexing in higher-dimensional spaces practical. This Best Bin First, or BBF search is an approximate algorithm which finds the nearest neighbour for a large fraction of the queries, and a very close neighbour in the remaining cases. The technique has been integrated into a fully developed recognition system, which is able to detect complex objects in real, cluttered scenes in just a few seconds
Article
Computer vision is embracing a new research focus in which the aim is to develop visual skills for robots that allow them to interact with a dynamic, realistic environment. To achieve this aim, new kinds of vision algorithms need to be developed which run in real time and subserve the robot's goals. Two fundamental goals are determining the location of a known object. Color can be successfully used for both tasks.This article demonstrates that color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models. It shows that color histograms are stable object representations in the presence of occlusion and over change in view, and that they can differentiate among a large number of objects. For solving the identification problem, it introduces a technique calledHistogram Intersection, which matches model and image histograms and a fast incremental version of Histogram Intersection, which allows real-time indexing into a large database of stored models. For solving the location problem it introduces an algorithm calledHistogram Backprojection, which performs this task efficiently in crowded scenes.
Article
This paper discusses how local measurements of positions and surface normals may be used to identify and locate overlapping objects. The objects are modeled as polyhedra (or polygons) having up to six degrees of positional freedom relative to the sensors. The approach operates by examining all hypotheses about pairings between sensed data and object surfaces and efficiently discarding inconsistent ones by using local constraints on: distances between faces, angles between face normals, and angles (relative to the surface normals) of vectors between sensed points. The method described here is an extension of a method for recognition and localization of nonoverlapping parts previously described in [18] and [15].
Article
This paper proposes a robust approach to image matching by exploiting the only available geometric constraint, namely, the epipolar constraint. The images are uncalibrated, namely the motion between them and the camera parameters are not known. Thus, the images can be taken by different cameras or a single camera at different time instants. If we make an exhaustive search for the epipolar geometry, the complexity is prohibitively high. The idea underlying our approach is to use classical techniques (correlation and relaxation methods in our particular implementation) to find an initial set of matches, and then use a robust technique—the Least Median of Squares (LMedS)—to discard false matches in this set. The epipolar geometry can then be accurately estimated using a meaningful image criterion. More matches are eventually found, as in stereo matching, by using the recovered epipolar geometry. A large number of experiments have been carried out, and very good results have been obtained.Regarding the relaxation technique, we define a new measure of matching support, which allows a higher tolerance to deformation with respect to rigid transformations in the image plane and a smaller contribution for distant matches than for nearby ones. A new strategy for updating matches is developed, which only selects those matches having both high matching support and low matching ambiguity. The update strategy is different from the classical “winner-take-all”, which is easily stuck at a local minimum, and also from “loser-take-nothing”, which is usually very slow. The proposed algorithm has been widely tested and works remarkably well in a scene with many repetitive patterns.
Article
A computer vision system has been implemented that can recognize three-dimensional objects from unknown viewpoints in single gray-scale images. Unlike most other approaches, the recognition is accomplished without any attempt to reconstruct depth information bottom-up from the visual input. Instead, three other mechanisms are used that can bridge the gap between the two-dimensional image and knowledge of three-dimensional objects. First, a process of perceptual organization is used to form groupings and structures in the image that are likely to be invariant over a wide range of viewpoints. Second, a probabilistic ranking method is used to reduce the size of the search space during model-based matching. Finally, a process of spatial correspondence brings the projections of three-dimensional models into direct correspondence with the image by solving for unknown viewpoint and model parameters. A high level of robustness in the presence of occlusion and missing data can be achieved through full application of a viewpoint consistency constraint. It is argued that similar mechanisms and constraints form the basis for recognition in human vision.
Article
The Hough transform is a method for detecting curves by exploiting the duality between points on a curve and parameters of that curve. The initial work showed how to detect both analytic curves(1,2) and non-analytic curves,(3) but these methods were restricted to binary edge images. This work was generalized to the detection of some analytic curves in grey level images, specifically lines,(4) circles(5) and parabolas.(6) The line detection case is the best known of these and has been ingeniously exploited in several applications.(7,8,9)We show how the boundaries of an arbitrary non-analytic shape can be used to construct a mapping between image space and Hough transform space. Such a mapping can be exploited to detect instances of that particular shape in an image. Furthermore, variations in the shape such as rotations, scale changes or figure ground reversals correspond to straightforward transformations of this mapping. However, the most remarkable property is that such mappings can be composed to build mappings for complex shapes from the mappings of simpler component shapes. This makes the generalized Hough transform a kind of universal transform which can be used to find arbitrarily complex shapes.
Article
The problem of automatically learning object models for recognition and pose estimation is addressed. In contrast to the traditional approach, the recognition problem is formulated as one of matching appearance rather than shape. The appearance of an object in a two-dimensional image depends on its shape, reflectance properties, pose in the scene, and the illumination conditions. While shape and reflectance are intrinsic properties and constant for a rigid object, pose and illumination vary from scene to scene. A compact representation of object appearance is proposed that is parametrized by pose and illumination. For each object of interest, a large set of images is obtained by automatically varying pose and illumination. This image set is compressed to obtain a low-dimensional subspace, called the eigenspace, in which the object is represented as a manifold. Given an unknown input image, the recognition system projects the image to eigenspace. The object is recognized based on the manifold it lies on. The exact position of the projection on the manifold determines the object's pose in the image. A variety of experiments are conducted using objects with complex appearance characteristics. The performance of the recognition and pose estimation algorithms is studied using over a thousand input images of sample objects. Sensitivity of recognition to the number of eigenspace dimensions and the number of learning samples is analyzed. For the objects used, appearance representation in eigenspaces with less than 20 dimensions produces accurate recognition results with an average pose estimation error of about 1.0 degree. A near real-time recognition system with 20 complex objects in the database has been developed. The paper is concluded with a discussion on various issues related to the proposed learning and recognition methodology.
Article
Background: The inferior temporal cortex (IT) of the monkey has long been known to play an essential role in visual object recognition. Damage to this area results in severe deficits in perceptual learning and object recognition, without significantly affecting basic visual capacities. Consistent with these ablation studies is the discovery of IT neurons that respond to complex two-dimensional visual patterns, or objects such as faces or body parts. What is the role of these neurons in object recognition? Is such a complex configurational selectivity specific to biologically meaningful objects, or does it develop as a result of extensive exposure to any objects whose identification relies on subtle shape differences? If so, would IT neurons respond selectively to recently learned views of features of novel objects? The present study addresses this question by using combined psychophysical and electrophysiological experiments, in which monkeys learned to classify and recognize computer-generated three-dimensional objects. Results: A population of IT neurons was found that responded selectively to views of previously unfamiliar objects. The cells discharged maximally to one view of an object, and their response declined gradually as the object was rotated away from this preferred view. No selective responses were ever encountered for views that the animal systematically failed to recognize. Most neurons also exhibited orientation-dependent responses during view-plane rotations. Some neurons were found to be tuned around two views of the same object, and a very small number of cells responded in a view-invariant manner. For the five different objects that were used extensively during the training of the animals, and for which behavioral performance became view-independent, multiple cells were found that were tuned around different views of the same object. A number of view-selective units showed response invariance for changes in the size of the object or the position of its image within the parafovea. Conclusion: Our results suggest that IT neurons can develop a complex receptive field organization as a consequence of extensive training in the discrimination and recognition of objects. None of these objects had any prior meaning for the animal, nor did they resemble anything familiar in the monkey's environment. Simple geometric features did not appear to account for the neurons' selective responses. These findings support the idea that a population of neurons--each tuned to a different object aspect, and each showing a certain degree of invariance to image transformations--may, as an ensemble, encode at least some types of complex three-dimensional objects. In such a system, several neurons may be active for any given vantage point, with a single unit acting like a blurred template for a limited neighborhood of a single view.
Article
The feature-based representations of object images in the inferotemporal cortex of macaque monkeys have been further characterized by optical imaging experiments. Recently, the close correlation between the activity of inferotemporal cells and the perception of object images has been revealed by single-unit recordings from behaving monkeys. The human homologue of the monkey inferotemporal cortex has been identified through use of new non-invasive techniques.
Article
Object perception may involve seeing, recognition, preparation of actions, and emotional responses--functions that human brain imaging and neuropsychology suggest are localized separately. Perhaps because of this specialization, object perception is remarkably rapid and efficient. Representations of componential structure and interpolation from view-dependent images both play a part in object recognition. Unattended objects may be implicitly registered, but recent experiments suggest that attention is required to bind features, to represent three-dimensional structure, and to mediate awareness.
Article
This paper addresses the problem of retrieving images from large image databases. The method is based on local grayvalue invariants which are computed at automatically detected interest points. A voting algorithm and semilocal constraints make retrieval possible. Indexing allows for efficient retrieval from a database of more than 1,000 images. Experimental results show correct retrieval in the case of partial visibility, similarity transformations, extraneous features, and small perspective deformations
Article
Model-based recognition and motion tracking depend upon the ability to solve for projection and model parameters that will best fit a 3-D model to matching 2-D image features. The author extends current methods of parameter solving to handle objects with arbitrary curved surfaces and with any number of internal parameters representing articulation, variable dimensions, or surface deformations. Numerical stabilization methods are developed that take account of inherent inaccuracies in the image measurements and allow useful solutions to be determined even when there are fewer matches than unknown parameters. The Levenberg-Marquardt method is used to always ensure convergence of the solution. These techniques allow model-based vision to be used for a much wider class of problems than was possible with previous methods. Their application is demonstrated for tracking the motion of curved, parameterized objects
Generalizing the Hough transform to detect arbitrary patternsRecognition using region correspondences
  • D H Ballard
  • Basri
  • David W Ronen
  • Jacobs
Ballard, D.H., "Generalizing the Hough transform to detect arbitrary patterns," Pattern Recognition, 13, 2 (1981), pp. 111-122. [2] Basri, Ronen, and David. W. Jacobs, "Recognition using region correspondences," International Journal of Computer Vision, 25, 2 (1996), pp. 141–162.
Shape indexing using approximate nearest-neighbour search in high-dimensional spacesA representation for shape based on peaks and ridges in the difference of lowpass transform
  • Beis
  • David G Jeff
  • Lowe
  • James L Crowley
  • Alice C Parker
Beis, Jeff, and David G. Lowe, "Shape indexing using approximate nearest-neighbour search in high-dimensional spaces," Conference on Computer Vision and Pattern Recognition, Puerto Rico (1997), pp. 1000–1006. [4] Crowley, James L., and Alice C. Parker, "A representation for shape based on peaks and ridges in the difference of lowpass transform," IEEE Trans. on Pattern Analysis and Machine Intelligence, 6, 2 (1984), pp. 156–170.
Complex cells and object recognition Unpublished Manuscript, preprint at http://www.ai.mit.edu/ [6] Grimson, Eric, and Thomás Lozano-PérezLocalizing overlapping parts by searching the interpretation treeSize and position invariance of neuronal responses in monkey inferotemporal cortex
  • Edelman
  • Nathan Shimon
  • Intrator
Edelman, Shimon, Nathan Intrator, and Tomaso Poggio, "Complex cells and object recognition," Unpublished Manuscript, preprint at http://www.ai.mit.edu/ [6] Grimson, Eric, and Thomás Lozano-Pérez, "Localizing overlapping parts by searching the interpretation tree," IEEE Trans. on Pattern Analysis and Machine Intelligence, 9 (1987), pp. 469–482. [7] Ito, Minami, Hiroshi Tamura, Ichiro Fujita, and Keiji Tanaka, "Size and position invariance of neuronal responses in monkey inferotemporal cortex," Journal of Neurophysiology, 73, 1 (1995), pp. 218–226.
Learning probabilis-tic appearance models for object recognition,” in Early Vi-sualLearning
  • Pope
  • R Arthur
  • G David
  • Lowe
Pope, Arthur R. and David G. Lowe, “Learning probabilis-tic appearance models for object recognition,” in Early Vi-sualLearning,eds.ShreeNayarandTomasoPoggio(Oxford University Press, 1996), pp. 67–97.