TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation

Department of Engineering, University of Cambridge
DOI: 10.1007/11744023_1 In book: Computer Vision – ECCV 2006, pp.1-15


This paper proposes a new approach to learning a discriminative model of object classes, incorporating appearance, shape and context information efficiently. The learned model is used for automatic visual recognition and semantic segmentation of photographs. Our discriminative model exploits novel features, based on textons, which jointly model shape and texture. Unary classification and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorporating these classifiers in a conditional random field. Efficient training of the model on very large datasets is achieved by exploiting both random feature selection and piecewise training methods.
High classification and segmentation accuracy are demonstrated on three different databases: i) our own 21-object class database of photographs of real objects viewed under general lighting conditions, poses and viewpoints, ii) the 7-class Corel subset and iii) the 7-class Sowerby database used in [1]. The proposed algorithm gives competitive results both for highly textured (e.g. grass, trees), highly structured (e.g. cars, faces, bikes, aeroplanes) and articulated objects (e.g. body, cow).

Download full-text


Available from: John M. Winn
  • Source
    • "First, we believe the pixel-level model must be married to a secondary process that captures instance-level or videolevel global information, such as action recognition, in order to properly model the actions. Lessons learned from images strongly supports this argument—the performance of semantic image segmentation on the MSRC dataset seems to hit a plateau [31] until information from secondary processes , such as context [16] [25], object detectors [17] and a holistic scene model [43], are added. However, to the best of our knowledge, there is no method in video semantic segmentation that directly leverages the recent success in action recognition. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Actor-action semantic segmentation made an important step toward advanced video understanding problems: what action is happening; who is performing the action; and where is the action in space-time. Current models for this problem are local, based on layered CRFs, and are unable to capture long-ranging interaction of video parts. We propose a new model that combines these local labeling CRFs with a hierarchical supervoxel decomposition. The supervoxels provide cues for possible groupings of nodes, at various scales, in the CRFs to encourage adaptive, high-order groups for more effective labeling. Our model is dynamic and continuously exchanges information during inference: the local CRFs influence what supervoxels in the hierarchy are active, and these active nodes influence the connectivity in the CRF; we hence call it a grouping process model. The experimental results on a recent large-scale video dataset show a large margin of 60% relative improvement over the state of the art, which demonstrates the effectiveness of the dynamic, bidirectional flow between labeling and grouping.
    Preview · Article · Dec 2015
    • "A popular trend among these methods consists of learning a pixel classifier and use it as a unary potential in a Markov Random Field (MRF), which models the dependencies of the class labels of two or more pixels [23] [13] [11] [9] [14] [12] [15]. When it comes to the classifier itself, several directions have been proposed, such as boosting-based classifiers [23] [31] [8], or exemplar-based object detectors [26] [15]. With the recent advent of deep learning , several works have focused on developing CNNS to perform semantic segmentation [6] [22] [18] [24]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Scene parsing has attracted a lot of attention in computer vision. While parametric models have proven effective for this task, they cannot easily incorporate new training data. By contrast, nonparametric approaches, which bypass any learning phase and directly transfer the labels from the training data to the query images, can readily exploit new labeled samples as they become available. Unfortunately, because of the computational cost of their label transfer procedures, state-of-the-art nonparametric methods typically filter out most training images to only keep a few relevant ones to label the query. As such, these methods throw away many images that still contain valuable information and generally obtain an unbalanced set of labeled samples. In this paper, we introduce a nonparametric approach to scene parsing that follows a sample-and-filter strategy. More specifically, we propose to sample labeled superpixels according to an image similarity score, which allows us to obtain a balanced set of samples. We then formulate label transfer as an efficient filtering procedure, which lets us exploit more labeled samples than existing techniques. Our experiments evidence the benefits of our approach over state-of-the-art nonparametric methods on two benchmark datasets.
    No preview · Article · Nov 2015
  • Source
    • "There are various outdoor and indoor scenes (e.g., beach, highway, city street and airport ) that image parsing algorithms try to label. Several systems [3] [7] [6] [9] [11] [15] [18] [19] [20] [24] [27] [28] [33] [36] have been designed to semantically classify each pixel in an image. Among the main challenges which face image parsing methods is that their recognition rate significantly varies among different types of classes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a nonparametric scene parsing approach that improves the overall accuracy, as well as the coverage of foreground classes in scene images. We first improve the label likelihood estimates at superpixels by merging likelihood scores from different probabilistic classifiers. This boosts the classification performance and enriches the representation of less-represented classes. Our second contribution consists of incorporating semantic context in the parsing process through global label costs. Our method does not rely on image retrieval sets but rather assigns a global likelihood estimate to each label, which is plugged into the overall energy function. We evaluate our system on two large-scale datasets, SIFTflow and LMSun. We achieve state-of-the-art performance on the SIFTflow dataset and near-record results on LMSun.
    Preview · Article · Oct 2015
Show more