Conference Paper

Unsupervised Learning of Invariant Features Using Video

DOI: 10.1109/CVPR.2010.5539773 Conference: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
Source: IEEE Xplore


We present an algorithm that learns invariant features from real data in an entirely unsupervised fashion. The principal benefit of our method is that it can be applied without human intervention to a particular application or data set, learning the specific invariances necessary for excellent feature performance on that data. Our algorithm relies on the ability to track image patches over time using optical flow. With the wide availability of high frame rate video (eg: on the web, from a robot), good tracking is straightforward to achieve. The algorithm then optimizes feature parameters such that patches corresponding to the same physical location have feature descriptors that are as similar as possible while simultaneously maximizing the distinctness of descriptors for different locations. Thus, our method captures data or application specific invariances yet does not require any manual supervision. We apply our algorithm to learn domain-optimized versions of SIFT and HOG. SIFT and HOG features are excellent and widely used. However, they are general and by definition not tailored to a specific domain. Our domain-optimized versions offer a substantial performance increase for classification and correspondence tasks we consider. Furthermore, we show that the features our method learns are near the optimal that would be achieved by directly optimizing the test set performance of a classifier. Finally, we demonstrate that the learning often allows fewer features to be used for some tasks, which has the potential to dramatically improve computational concerns for very large data sets.

5 Reads
  • Source
    • "Furthermore, the order of blocks is optimized as well. Parameters of SIFT and HOG features are optimized by the method of [18]. A set of patches are automatically extracted from street level imagery similarly to [6]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Place recognition for loop closure detection lies at the heart of every Simultaneous Localization and Mapping (SLAM) method. Recently methods that use cameras and describe the entire image by one holistic feature vector have experienced a resurgence. Despite the success of these methods, it remains unclear how a descriptor should be constructed for this particular purpose. The problem of choosing the right descriptor becomes even more pronounced in the context of life long mapping. The appearance of a place may vary considerably under different illumination conditions and over the course of a day. None of the handcrafted descriptors published in literature are particularly designed for this purpose. Herein, we propose to use a set of elementary building blocks from which millions of different descriptors can be constructed automatically. Moreover, we present an evaluation function which evaluates the performance of a given image descriptor for place recognition under severe lighting changes. Finally we present an algorithm to efficiently search the space of descriptors to find the best suited one. Evaluating the trained descriptor on a test set shows a clear superiority over its hand crafted counter parts like BRIEF and U-SURF. Finally we show how loop closures can be reliably detected using the automatically learned descriptor. Two overlapping image sequences from two different days and times are merged into one pose graph. The resulting merged pose graph is optimized and does not contain a single false link while at the same time all true loop closures were detected correctly. The descriptor and the place recognizer source code is published with datasets on
    IEEE Intelligent Vehicles Symposium; 01/2013
  • Source
    • "This redundancy provides useful information for deriving the relations between the features extracted in different frames. Stavens et al. [8] used such information to learn the parameters for feature description. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Likely variations in the capture conditions (e.g. light, blur, scale, occlusion) and in the viewpoint between the query image and the images in the collection are the factors due to which image retrieval based on the Query-by-Example (QBE) principle is still not reliable enough. In this paper, we propose a novel QBE-based image retrieval system where users are allowed to submit a short video clip as a query to improve the retrieval reliability. Improvement is achieved by integrating the information about different viewpoints and conditions under which object and scene appearances can be captured across different video frames. Rich information extracted from a video can be exploited to generate a more complete query representation than in the case of a single-image query and to improve the relevance of the retrieved results. Our experimental results show that video-based image retrieval (VBIR) is significantly more reliable than the retrieval using a single image as a query.
    Proceedings of the 19th International Conference on Multimedea 2011, Scottsdale, AZ, USA, November 28 - December 1, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Image retrieval based on the query-by-example (QBE) principle is still not reliable enough, largely because of the likely variations in the capture conditions (e.g. light, blur, scale, occlusion) and viewpoint between the query image and the images in the collection. In this paper, we propose a framework in which this problem is explicitly addressed to improve the reliability of QBE-based image retrieval. We aim at the use scenario involving the user capturing the query object by his/her mobile device and requesting information augmenting the query from the database. Reliability improvement is achieved by allowing the user to submit not a single image but a short video clip as a query. Since a video clip may combine object or scene appearances captured from different viewpoints and under different conditions, the rich information contained therein can be exploited to discover the proper query representation and to improve the relevance of the retrieved results. The experimental results show that video-based image retrieval (VBIR) is significantly more reliable than the retrieval using a single image as query. Furthermore, to make the proposed framework deployable in a practical mobile image retrieval system, where realtime query response is required, we also propose the priority queue-based feature description scheme and cache-based bi-quantization algorithm for an efficient parallel implementation of the VBIR concept.
    09/2012; 2(3). DOI:10.1007/s13735-012-0023-3
Show more