Conference Paper

Unsupervised learning of invariant features using video

DOI: 10.1109/CVPR.2010.5539773 In proceeding of: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
Source: IEEE Xplore

ABSTRACT We present an algorithm that learns invariant features from real data in an entirely unsupervised fashion. The principal benefit of our method is that it can be applied without human intervention to a particular application or data set, learning the specific invariances necessary for excellent feature performance on that data. Our algorithm relies on the ability to track image patches over time using optical flow. With the wide availability of high frame rate video (eg: on the web, from a robot), good tracking is straightforward to achieve. The algorithm then optimizes feature parameters such that patches corresponding to the same physical location have feature descriptors that are as similar as possible while simultaneously maximizing the distinctness of descriptors for different locations. Thus, our method captures data or application specific invariances yet does not require any manual supervision. We apply our algorithm to learn domain-optimized versions of SIFT and HOG. SIFT and HOG features are excellent and widely used. However, they are general and by definition not tailored to a specific domain. Our domain-optimized versions offer a substantial performance increase for classification and correspondence tasks we consider. Furthermore, we show that the features our method learns are near the optimal that would be achieved by directly optimizing the test set performance of a classifier. Finally, we demonstrate that the learning often allows fewer features to be used for some tasks, which has the potential to dramatically improve computational concerns for very large data sets.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Locating the center of the eyes allows for valuable information to be captured and used in a wide range of applications. Accurate eye center location can be determined using commercial eye-gaze trackers, but additional constraints and expensive hardware make these existing solutions unattractive and impossible to use on standard (i.e., visible wavelength), low-resolution images of eyes. Systems based solely on appearance are proposed in the literature, but their accuracy does not allow us to accurately locate and distinguish eye centers movements in these low-resolution settings. Our aim is to bridge this gap by locating the center of the eye within the area of the pupil on low-resolution images taken from a webcam or a similar device. The proposed method makes use of isophote properties to gain invariance to linear lighting changes (contrast and brightness), to achieve in-plane rotational invariance, and to keep low-computational costs. To further gain scale invariance, the approach is applied to a scale space pyramid. In this paper, we extensively test our approach for its robustness to changes in illumination, head pose, scale, occlusion, and eye rotation. We demonstrate that our system can achieve a significant improvement in accuracy over state-of-the-art techniques for eye center location in standard low-resolution imagery.
    IEEE Transactions on Software Engineering 09/2012; 34(9):1785-98. · 2.59 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Place recognition for loop closure detection lies at the heart of every Simultaneous Localization and Mapping (SLAM) method. Recently methods that use cameras and describe the entire image by one holistic feature vector have experienced a resurgence. Despite the success of these methods, it remains unclear how a descriptor should be constructed for this particular purpose. The problem of choosing the right descriptor becomes even more pronounced in the context of life long mapping. The appearance of a place may vary considerably under different illumination conditions and over the course of a day. None of the handcrafted descriptors published in literature are particularly designed for this purpose. Herein, we propose to use a set of elementary building blocks from which millions of different descriptors can be constructed automatically. Moreover, we present an evaluation function which evaluates the performance of a given image descriptor for place recognition under severe lighting changes. Finally we present an algorithm to efficiently search the space of descriptors to find the best suited one. Evaluating the trained descriptor on a test set shows a clear superiority over its hand crafted counter parts like BRIEF and U-SURF. Finally we show how loop closures can be reliably detected using the automatically learned descriptor. Two overlapping image sequences from two different days and times are merged into one pose graph. The resulting merged pose graph is optimized and does not contain a single false link while at the same time all true loop closures were detected correctly. The descriptor and the place recognizer source code is published with datasets on
    IEEE Intelligent Vehicles Symposium; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper gives a review of the recent developments in deep learning and un-supervised feature learning for time-series problems. While these techniques have shown promise for modeling static data, such as computer vision, ap-plying them to time-series data is gaining increasing attention. This paper overviews the particular challenges present in time-series data and provides a review of the works that have either applied time-series data to unsupervised feature learning algorithms or alternatively have contributed to modications of feature learning algorithms to take into account the challenges present in time-series data.
    Pattern Recognition Letters 01/2014; · 1.27 Impact Factor