Conference Paper

Unsupervised Learning of Invariant Features Using Video

DOI: 10.1109/CVPR.2010.5539773 Conference: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
Source: IEEE Xplore


We present an algorithm that learns invariant features from real data in an entirely unsupervised fashion. The principal benefit of our method is that it can be applied without human intervention to a particular application or data set, learning the specific invariances necessary for excellent feature performance on that data. Our algorithm relies on the ability to track image patches over time using optical flow. With the wide availability of high frame rate video (eg: on the web, from a robot), good tracking is straightforward to achieve. The algorithm then optimizes feature parameters such that patches corresponding to the same physical location have feature descriptors that are as similar as possible while simultaneously maximizing the distinctness of descriptors for different locations. Thus, our method captures data or application specific invariances yet does not require any manual supervision. We apply our algorithm to learn domain-optimized versions of SIFT and HOG. SIFT and HOG features are excellent and widely used. However, they are general and by definition not tailored to a specific domain. Our domain-optimized versions offer a substantial performance increase for classification and correspondence tasks we consider. Furthermore, we show that the features our method learns are near the optimal that would be achieved by directly optimizing the test set performance of a classifier. Finally, we demonstrate that the learning often allows fewer features to be used for some tasks, which has the potential to dramatically improve computational concerns for very large data sets.

6 Reads
  • Source
    • "Furthermore, the order of blocks is optimized as well. Parameters of SIFT and HOG features are optimized by the method of [18]. A set of patches are automatically extracted from street level imagery similarly to [6]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Place recognition for loop closure detection lies at the heart of every Simultaneous Localization and Mapping (SLAM) method. Recently methods that use cameras and describe the entire image by one holistic feature vector have experienced a resurgence. Despite the success of these methods, it remains unclear how a descriptor should be constructed for this particular purpose. The problem of choosing the right descriptor becomes even more pronounced in the context of life long mapping. The appearance of a place may vary considerably under different illumination conditions and over the course of a day. None of the handcrafted descriptors published in literature are particularly designed for this purpose. Herein, we propose to use a set of elementary building blocks from which millions of different descriptors can be constructed automatically. Moreover, we present an evaluation function which evaluates the performance of a given image descriptor for place recognition under severe lighting changes. Finally we present an algorithm to efficiently search the space of descriptors to find the best suited one. Evaluating the trained descriptor on a test set shows a clear superiority over its hand crafted counter parts like BRIEF and U-SURF. Finally we show how loop closures can be reliably detected using the automatically learned descriptor. Two overlapping image sequences from two different days and times are merged into one pose graph. The resulting merged pose graph is optimized and does not contain a single false link while at the same time all true loop closures were detected correctly. The descriptor and the place recognizer source code is published with datasets on
    Full-text · Conference Paper · Jan 2013
  • Source
    • "This redundancy provides useful information for deriving the relations between the features extracted in different frames. Stavens et al. [8] used such information to learn the parameters for feature description. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Likely variations in the capture conditions (e.g. light, blur, scale, occlusion) and in the viewpoint between the query image and the images in the collection are the factors due to which image retrieval based on the Query-by-Example (QBE) principle is still not reliable enough. In this paper, we propose a novel QBE-based image retrieval system where users are allowed to submit a short video clip as a query to improve the retrieval reliability. Improvement is achieved by integrating the information about different viewpoints and conditions under which object and scene appearances can be captured across different video frames. Rich information extracted from a video can be exploited to generate a more complete query representation than in the case of a single-image query and to improve the relevance of the retrieved results. Our experimental results show that video-based image retrieval (VBIR) is significantly more reliable than the retrieval using a single image as a query.
    Preview · Conference Paper · Jan 2011
  • Source
    • "Continuous Transformation Learning[34,26]and Slow Feature Analysis[40]are also approaches that try to learn invariant representations by enforcing spatio-temporal continuities. Recently, Stavens and Thrun[33]showed that it is possible to tune the parameters of keypoint descriptors via observing unlabeled tracked keypoints for specific applications. For text classification, Yang et al.[41]presented an approach based on SVMs that is able to improve a text classifier by additionally using weakly-related unlabeled data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Current state-of-the-art object classification systems are trained using large amounts of hand-labeled images. In this paper, we present an approach that shows how to use unlabeled video sequences, comprising weakly-related object categories towards the target class, to learn better classifiers for tracking and detection. The underlying idea is to exploit the space-time consistency of moving objects to learn classifiers that are robust to local transformations. In particular, we use dense optical flow to find moving objects in videos in order to train part-based random forests that are insensitive to natural transformations. Our method, which is called Video Forests, can be used in two settings: first, labeled training data can be regularized to force the trained classifier to generalize better towards small local transformations. Second, as part of a tracking-by-detection approach, it can be used to train a general codebook solely on pair-wise data that can then be applied to tracking of instances of a priori unknown object categories. In the experimental part, we show on benchmark datasets for both tracking and detection that incorporating unlabeled videos into the learning of visual classifiers leads to improved results.
    Full-text · Conference Paper · Jan 2011
Show more