Conference Paper

Ask the image: supervised pooling to preserve feature locality

Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

ABSTRACT In this paper we propose a weighted supervised pooling method for visual recognition systems. We combine a standard Spatial Pyramid Representation which is commonly adopted to encode spatial information, with an appropriate Feature Space Representation favoring semantic information in an appropriate feature space. For the latter, we propose a weighted pooling strategy exploiting data supervision to weigh each local descriptor coherently with its likelihood to belong to a given object class. The two representations
are then combined adaptively with Multiple Kernel Learning. Experiments on common benchmarks (Caltech-
256 and PASCAL VOC-2007) show that our image representation improves the current visual recognition pipeline and it is competitive with similar state-of-art pooling methods. We also evaluate our method on a real Human-Robot Interaction setting, where the pure Spatial Pyramid Representation does not provide sufficient discriminative power, obtaining a remarkable improvement.

Download full-text


Available from: Sean Ryan Fanello, May 27, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps.The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF’s application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF’s usefulness in a broad range of topics in computer vision.
    Computer Vision - ECCV 2006, 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part I; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter re- sponses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be bro- ken down into two steps: (1) a coding step, which per- forms a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pool- ing step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pool- ing schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the rela- tive importance of each step of mid-level feature extrac- tion through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the aver- age, or the maximum), which obtains state-of-the-art per- formance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature ex- tractors, our approachaims to facilitate the design of better recognition architectures.
    Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'10); 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Invariant representations in object recognition systems are generally obtained by pooling feature vectors over spatially local neighborhoods. But pooling is not local in the feature vector space, so that widely dissimilar features may be pooled together if they are in nearby locations. Recent approaches rely on sophisticated encoding methods and more specialized codebooks (or dictionaries), e.g., learned on subsets of descriptors which are close in feature space, to circumvent this problem. In this work, we argue that a common trait found in much recent work in image recognition or retrieval is that it leverages locality in feature space on top of purely spatial locality. We propose to apply this idea in its simplest form to an object recognition system based on the spatial pyramid framework, to increase the performance of small dictionaries with very little added engineering. State-of-the-art results on several object recognition benchmarks show the promise of this approach.
    Proc. International Conference on Computer Vision (ICCV'11); 11/2011
Show more