TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation

Department of Engineering, University of Cambridge
DOI: 10.1007/11744023_1


This paper proposes a new approach to learning a discriminative model of object classes, incorporating appearance, shape and
context information efficiently. The learned model is used for automatic visual recognition and semantic segmentation of photographs.
Our discriminative model exploits novel features, based on textons, which jointly model shape and texture. Unary classification
and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number
of classes. Accurate image segmentation is achieved by incorporating these classifiers in a conditional random field. Efficient
training of the model on very large datasets is achieved by exploiting both random feature selection and piecewise training

High classification and segmentation accuracy are demonstrated on three different databases: i) our own 21-object class database
of photographs of real objects viewed under general lighting conditions, poses and viewpoints, ii) the 7-class Corel subset
and iii) the 7-class Sowerby database used in [1]. The proposed algorithm gives competitive results both for highly textured
(e.g. grass, trees), highly structured (e.g. cars, faces, bikes, aeroplanes) and articulated objects (e.g. body, cow).

Download full-text


Available from: John M. Winn, Oct 02, 2015
67 Reads
  • Source
    • "Only for pedestrian detection [6] are objects often annotated amodally (with both visible and amodal bounding boxes). We note that our proposed annotation scheme subsumes modal segmentation [2], semantic segmentation [26], edge detection [2], figure-ground edge labeling [12], and object detection [8]. Specifically, we can reduce our semantic amodal segmentations to annotations suitable for each of these tasks (although our semantic labels come from an unrestricted vocabulary). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Common visual recognition tasks such as classification, object detection, and semantic segmentation are rapidly reaching maturity, and given the recent rate of progress, it is not unreasonable to conjecture that techniques for many of these problems will approach human levels of performance in the next few years. In this paper we look to the future: what is the next frontier in visual recognition? We offer one possible answer to this question. We propose a detailed image annotation that captures information beyond the visible pixels and requires complex reasoning about full scene structure. Specifically, we create an amodal segmentation of each image: the full extent of each region is marked, not just the visible pixels. Annotators outline and name all salient regions in the image and specify a partial depth order. The result is a rich scene structure, including visible and occluded portions of each region, figure-ground edge information, semantic labels, and object overlap. To date, we have labeled 500 images in the BSDS dataset with at least five annotators per image. Critically, the resulting full scene annotation is surprisingly consistent between annotators. For example, for edge detection our annotations have substantially higher human consistency than the original BSDS edges while providing a greater challenge for existing algorithms. We are currently annotating ~5000 images from the MS COCO dataset.
  • Source
    • "Early work on scene labeling focused on outdoor color imagery, and typically used CRF or MRF. The nodes of the graphical models were pixels [13], [35], superpixels [8], [28] or a hierarchy of regions [25]. Local interactions between nodes were captured by pairwise potentials, while unary potentials were used to represnt image observations, via features such as SIFT [27] and HOG [5]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Most existing approaches for RGB-D indoor scene labeling employ hand-crafted features for each modality independently and combine them in a heuristic manner. There has been some attempt on directly learning features from raw RGB-D data, but the performance is not satisfactory. In this paper, we propose an unsupervised joint feature learning and encoding (JFLE) framework for RGB-D scene labeling. The main novelty of our learning framework lies in the joint optimization of feature learning and feature encoding in a coherent way which significantly boosts the performance. By stacking basic learning structure, higher-level features are derived and combined with lower-level features for better representing RGB-D data. Moreover, to explore the nonlinear intrinsic characteristic of data, we further propose a more general joint deep feature learning and encoding (JDFLE) framework that introduces the nonlinear mapping into JFLE. Experimental results on the benchmark NYU depth dataset show that our approaches achieve competitive performance, compared with state-of-the-art methods, while our methods do not need complex feature handcrafting and feature combination and can be easily applied to other datasets.
    IEEE Transactions on Image Processing 08/2015; 24(11). DOI:10.1109/TIP.2015.2465133 · 3.63 Impact Factor
  • Source
    • "We describe methods for using such constraints to improve semantic segmentation performance. We follow a standard CRF-based labeling approach, built on the high quality implementation of Gould et al. [13] and an augmented set of image features from [29]. We explore simple ways to enhance this set of features using GIS data and study its influence on semantic labeling accuracy. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Contextual information can have a substantial impact on the performance of visual tasks such as semantic segmentation, object detection, and geometric estimation. Data stored in Geographic Information Systems (GIS) offers a rich source of contextual information that has been largely untapped by computer vision. We propose to leverage such information for scene understanding by combining GIS resources with large sets of unorganized photographs using Structure from Motion (SfM) techniques. We present a pipeline to quickly generate strong 3D geometric priors from 2D GIS data using SfM models aligned with minimal user input. Given an image resectioned against this model, we generate robust predictions of depth, surface normals, and semantic labels. We show that the precision of the predicted geometry is substantially more accurate other single-image depth estimation methods. We then demonstrate the utility of these contextual constraints for re-scoring pedestrian detections, and use these GIS contextual features alongside object detection score maps to improve a CRF-based semantic segmentation framework, boosting accuracy over baseline models.
Show more