TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation

Department of Engineering, University of Cambridge
DOI: 10.1007/11744023_1


This paper proposes a new approach to learning a discriminative model of object classes, incorporating appearance, shape and
context information efficiently. The learned model is used for automatic visual recognition and semantic segmentation of photographs.
Our discriminative model exploits novel features, based on textons, which jointly model shape and texture. Unary classification
and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number
of classes. Accurate image segmentation is achieved by incorporating these classifiers in a conditional random field. Efficient
training of the model on very large datasets is achieved by exploiting both random feature selection and piecewise training

High classification and segmentation accuracy are demonstrated on three different databases: i) our own 21-object class database
of photographs of real objects viewed under general lighting conditions, poses and viewpoints, ii) the 7-class Corel subset
and iii) the 7-class Sowerby database used in [1]. The proposed algorithm gives competitive results both for highly textured
(e.g. grass, trees), highly structured (e.g. cars, faces, bikes, aeroplanes) and articulated objects (e.g. body, cow).

Download full-text


Available from: John M. Winn,
  • Source
    • "There are various outdoor and indoor scenes (e.g., beach, highway, city street and airport ) that image parsing algorithms try to label. Several systems [3] [7] [6] [9] [11] [15] [18] [19] [20] [24] [27] [28] [33] [36] have been designed to semantically classify each pixel in an image. Among the main challenges which face image parsing methods is that their recognition rate significantly varies among different types of classes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a nonparametric scene parsing approach that improves the overall accuracy, as well as the coverage of foreground classes in scene images. We first improve the label likelihood estimates at superpixels by merging likelihood scores from different probabilistic classifiers. This boosts the classification performance and enriches the representation of less-represented classes. Our second contribution consists of incorporating semantic context in the parsing process through global label costs. Our method does not rely on image retrieval sets but rather assigns a global likelihood estimate to each label, which is plugged into the overall energy function. We evaluate our system on two large-scale datasets, SIFTflow and LMSun. We achieve state-of-the-art performance on the SIFTflow dataset and near-record results on LMSun.
  • Source
    • "It is a very popular high-level vision task with a large number of methods proposed [23] [10] [3] [11] [18]. We follow the footsteps of most previous works on image labeling and choose the standard MSRC-21 [22] dataset for the evaluation. MSRC-21 consists of 591 images of 21 semantic categories. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite the great advances made on image super-resolution (ISR) during the last years, the performance has solely been evaluated perceptually. Thus, it is still unclear how useful ISR is to other vision tasks in practice. In this paper, we present the first comprehensive study and analysis of the usefulness of ISR for other vision applications. In particular, five ISR methods are evaluated on four popular vision tasks, namely edge detection, semantic image labeling, digit recognition, and face detection. We show that applying ISR to input images of other vision systems does improve the performance when the input images are of low-resolution. This is because the features and algorithms of current vision systems are designed and optimized for images of normal resolution. We also demonstrate that the standard perceptual evaluation criteria, such as PSNR and SSIM, correlate quite well with the usefulness of ISR methods to other vision tasks, but cannot measure it very accurately. We hope this work will inspire the community to evaluate ISR methods also in real vision applications, and to deploy ISR as a preprocessing component for systems of other vision tasks if the input data are of relatively low-resolution.
  • Source
    • "Only for pedestrian detection [6] are objects often annotated amodally (with both visible and amodal bounding boxes). We note that our proposed annotation scheme subsumes modal segmentation [2], semantic segmentation [26], edge detection [2], figure-ground edge labeling [12], and object detection [8]. Specifically, we can reduce our semantic amodal segmentations to annotations suitable for each of these tasks (although our semantic labels come from an unrestricted vocabulary). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Common visual recognition tasks such as classification, object detection, and semantic segmentation are rapidly reaching maturity, and given the recent rate of progress, it is not unreasonable to conjecture that techniques for many of these problems will approach human levels of performance in the next few years. In this paper we look to the future: what is the next frontier in visual recognition? We offer one possible answer to this question. We propose a detailed image annotation that captures information beyond the visible pixels and requires complex reasoning about full scene structure. Specifically, we create an amodal segmentation of each image: the full extent of each region is marked, not just the visible pixels. Annotators outline and name all salient regions in the image and specify a partial depth order. The result is a rich scene structure, including visible and occluded portions of each region, figure-ground edge information, semantic labels, and object overlap. To date, we have labeled 500 images in the BSDS dataset with at least five annotators per image. Critically, the resulting full scene annotation is surprisingly consistent between annotators. For example, for edge detection our annotations have substantially higher human consistency than the original BSDS edges while providing a greater challenge for existing algorithms. We are currently annotating ~5000 images from the MS COCO dataset.
Show more