Conference Paper

Panoptic Image Annotation with a Collaborative Assistant

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This is typically done within labeling software by manually selecting the borders of local regions within the image and assigning a class to these regions. Some semi-automated "assisted methods" have been developed to aid in semantic labeling of images (e.g., Uijlings et al., 2020, "magic wand" tool in BIIGLE, Langenkämper et al., 2017. ...
... The first is the time and cost required to manually generate labels. Whole image classification by humans can be reasonably fast (e.g., 5 seconds per image, Villon et al., 2018); instance segmentation tends to be slower (e.g., 13.5 sec per image, Ditria et al., 2020); and more elaborate labeling such as panoptic is slower still (e.g., up to 20 minutes per image, Uijlings et al., 2020). How much time and money might it cost to create a labeled dataset? ...
Article
Full-text available
Image-based machine learning methods are becoming among the most widely-used forms of data analysis across science, technology, engineering, and industry. These methods are powerful because they can rapidly and automatically extract rich contextual and spatial information from images, a process that has historically required a large amount of human labor. A wide range of recent scientific applications have demonstrated the potential of these methods to change how researchers study the ocean. However, despite their promise, machine learning tools are still under-exploited in many domains including species and environmental monitoring, biodiversity surveys, fisheries abundance and size estimation, rare event and species detection, the study of animal behavior, and citizen science. Our objective in this article is to provide an approachable, end-to-end guide to help researchers apply image-based machine learning methods effectively to their own research problems. Using a case study, we describe how to prepare data, train and deploy models, and overcome common issues that can cause models to underperform. Importantly, we discuss how to diagnose problems that can cause poor model performance on new imagery to build robust tools that can vastly accelerate data acquisition in the marine realm. Code to perform analyses is provided at https://github.com/heinsense2/AIO_CaseStudy.
... Aggregating Human Inputs Many works, particularly in the crowdsourcing domain, use multiple human inputs to increase accuracy. Though some works (Branson et al. 2010;Russakovsky, Li, and Li 2015) allow the model to choose when to terminate, the most common approach is to allow the human operator to review the model's output directly and provide new information until the result is satisfactory Gouravajhala et al. 2018;Choi et al. 2019;Agustsson, Uijlings, and Ferrari 2019;Uijlings, Andriluka, and Ferrari 2020). These approaches are sufficient for dataset collections: performing tasks such as answering questions about given bounding boxes (Russakovsky, Li, and Li 2015) or confirming answers (Uijlings et al. 2018) is faster and more accurate than generating the dataset through drawing a bounding box directly on the image. ...
... Evaluation Methods Most works related to deferred inference set deferral conditions prior to their experiments and report the final error. When user burden is specified, it is most often a point metric corresponding to the reported accuracy, such as time-per-annotation (Agustsson, Uijlings, and Ferrari 2019;Uijlings, Andriluka, and Ferrari 2020) or number of annotations per target (Ipeirotis, Provost, and Wang 2010;Hatori et al. 2018). The value of these point measurements is questionable: when a deferral method is evaluated at different thresholds, the best method often changes Lemmer, Song, and Corso 2021). ...
Article
Many AI systems integrate sensor inputs, world knowledge, and human-provided information to perform inference. While such systems often treat the human input as flawless, humans are better thought of as hazy oracles whose input may be ambiguous or outside of the AI system's understanding. In such situations it makes sense for the AI system to defer its inference while it disambiguates the human-provided information by, for example, asking the human to rephrase the query. Though this approach has been considered in the past, current work is typically limited to application-specific methods and non-standardized human experiments. We instead introduce and formalize a general notion of deferred inference. Using this formulation, we then propose a novel evaluation centered around the Deferred Error Volume (DEV) metric, which explicitly considers the tradeoff between error reduction and the additional human effort required to achieve it. We demonstrate this new formalization and an innovative deferred inference method on the disparate tasks of Single-Target Video Object Tracking and Referring Expression Comprehension, ultimately reducing error by up to 48% without any change to the underlying model or its parameters.
... However, current data annotation still mainly relies on manual work, which is time-consuming and expensive. For example, it takes about 26,000 hours to annotate the Microsoft Common Objects in Context (MSCOCO) [31] through crowd-sourcing on Mechanical Turk [43] and some efficient annotation tools [11,17,21,45,52]. For each object, it requires 1s [10,41] to determine the class tag and 35s [43] to draw a bounding box. ...
... In addition, several scholars work on some efficient image annotation tools [11,17,21,45,52] to improve the efficiency of manual bounding box annotation. Anjum et al. [5] explore the use of deep learning in crowdsourcing methods to improve the quality and efficiency of image annotation. ...
Article
Full-text available
Object annotation is essential for computer vision tasks, and more high-quality annotated data can effectively improve the performance of vision models. However, manual annotation is time-consuming (annotating a box takes 35s). Recent studies have explored faster automated annotation, among which weakly supervised methods stand out. Weakly supervised methods learn to automatically localize objects in images from weakly labeled annotations, e.g., class tags or points, replacing manual bounding box annotations. Although using a single weakly labeled annotation can reduce a large amount of time, it leads to poor annotation quality, particularly for the complex scenes containing multiple objects. To balance annotation time and quality, we propose a weakly semi-supervised automated annotation method. Its main idea is to incorporate point-labeled and fully labeled annotations into a teacher-student framework for training, to jointly localize the object bounding boxes on all point-labeled images. We also propose two effective techniques within this framework to better use of these mixed annotations. The first is a point-guided sample assignment technique which optimizes the loss calculation. The second is a pseudo-label filtering technique which generate accurate pseudo labels for model training by utilizing the points and boxes localization confidences. Extensive experiments on MSCOCO demonstrate that our method outperforms existing automated annotation methods. In particular, when using 95% point-labeled and 5% fully labeled data, our approach reduces the annotation time by approximately 52% and achieves an annotation quality of 87.4% mIoU.
... This is typically done within labeling software by manually selecting the borders of local regions within the image and assigning a class to these regions. Some semi-automated "assisted methods" have been developed to aid in semantic labeling of images (e.g., Uijlings et al. 2020). ...
... The first is the time and cost required to manually generate labels. Whole image classification by humans can be reasonably fast (e.g., 5 seconds per image, Villon et al. 2018); instance segmentation tends to be slower (e.g., 13.5 sec per image, Ditria et al. 2020); and more elaborate label types like panoptic labeling are slower still (e.g., up to 20 minutes per image, Uijlings et al. 2020). How much time might it cost to create a labeled dataset? ...
Preprint
Full-text available
Image-based machine learning methods are quickly becoming among the most widely-used forms of data analysis across science, technology, and engineering. These methods are powerful because they can rapidly and automatically extract rich contextual and spatial information from images, a process that has historically required a large amount of manual labor. The potential of image-based machine learning methods to change how researchers study the ocean has been demonstrated through a diverse range of recent applications. However, despite their promise, machine learning tools are still under-exploited in many domains including species and environmental monitoring, biodiversity surveys, fisheries abundance and size estimation, rare event and species detection, the study of wild animal behavior, and citizen science. Our objective in this article is to provide an approachable, application-oriented guide to help researchers apply image-based machine learning methods effectively to their own research problems. Using a case study, we describe how to prepare data, train and deploy models, and avoid common pitfalls that can cause models to underperform. Importantly, we discuss how to diagnose problems that can cause poor model performance on new imagery to build robust tools that can vastly accelerate data acquisition in the marine realm. Code to perform our analyses is provided at https://github.com/heinsense2/AIO_CaseStudy .
... Panoptic segmentation can be also employed to perform dataset annotation [145,146]. Typically, in [147], a panoptic segmentation is used to help conduct image annotation, which uses a collaborator (human) and automated assistant (based on panoptic segmentation) that both work together to annotate the dataset. The human annotator's action serves as a contextual signal for which the intelligent assistant reacts to and annotates other parts of the image. ...
... Additionally, 푃 푄 푇 ℎ , 푆푄 푇 ℎ , 푅푄 푇 ℎ (average over thing categories), 푃 푄 푆푓 , 푆푄 푆푓 , and 푅푄 푆푓 (average over stuff categories) are reported to reflect the improvement on instance and semantic segmentation segmentation. Finally, it is worthy to mention that the metrics described above have been introduced first by Kirillov et al. [23] and been adopted by other works as a common ground for comparison ever since, such as [82,69,147,83]. Indeed, to evaluate the panoptic segmentation performance of such frameworks on medical histopathology and fluorescence microscopy images, AJI, object-level F1 score (F1), and PQ have been exploited. ...
Preprint
Full-text available
Image segmentation for video analysis plays an essential role in different research fields such as smart city, healthcare, computer vision and geoscience, and remote sensing applications. In this regard, a significant effort has been devoted recently to developing novel segmentation strategies; one of the latest outstanding achievements is panoptic segmentation. The latter has resulted from the fusion of semantic and instance segmentation. Explicitly, panoptic segmentation is currently under study to help gain a more nuanced knowledge of the image scenes for video surveillance, crowd counting, self-autonomous driving, medical image analysis, and a deeper understanding of the scenes in general. To that end, we present in this paper the first comprehensive review of existing panoptic segmentation methods to the best of the authors' knowledge. Accordingly, a well-defined taxonomy of existing panoptic techniques is performed based on the nature of the adopted algorithms, application scenarios, and primary objectives. Moreover, the use of panoptic segmentation for annotating new datasets by pseudo-labeling is discussed. Moving on, ablation studies are carried out to understand the panoptic methods from different perspectives. Moreover, evaluation metrics suitable for panoptic segmentation are discussed, and a comparison of the performance of existing solutions is provided to inform the state-of-the-art and identify their limitations and strengths. Lastly, the current challenges the subject technology faces and the future trends attracting considerable interest in the near future are elaborated, which can be a starting point for the upcoming research studies. The papers provided with code are available at: https://github.com/elharroussomar/Awesome-Panoptic-Segmentation
Article
Full-text available
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.
Conference Paper
Full-text available
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
Article
Full-text available
We propose a unified approach for bottom-up hierarchical image segmentation and object proposal generation for recognition, called Multiscale Combinatorial Grouping (MCG). For this purpose, we first develop a fast normalized cuts algorithm. We then propose a high-performance hierarchical segmenter that makes effective use of multiscale information. Finally, we propose a grouping strategy that combines our multiscale regions into highly-accurate object proposals by exploring efficiently their combinatorial space. We also present Single-scale Combinatorial Grouping (SCG), a faster version of MCG that produces competitive proposals in under five second per image. We conduct an extensive and comprehensive empirical validation on the BSDS500, SegVOC12, SBD, and COCO datasets, showing that MCG produces state-of-the-art contours, hierarchical regions, and object proposals.
Conference Paper
Full-text available
Traditional active learning allows a (machine) learner to query the (human) teacher for labels on examples it finds confusing. The teacher then provides a label for only that instance. This is quite restrictive. In this paper, we propose a learning paradigm in which the learner communicates its belief (i.e. predicted label) about the actively chosen example to the teacher. The teacher then confirms or rejects the predicted label. More importantly, if rejected, the teacher communicates an explanation for why the learner's belief was wrong. This explanation allows the learner to propagate the feedback provided by the teacher to many unlabeled images. This allows a classifier to better learn from its mistakes, leading to accelerated discriminative learning of visual concepts even with few labeled images. In order for such communication to be feasible, it is crucial to have a language that both the human supervisor and the machine learner understand. Attributes provide precisely this channel. They are human-interpretable mid-level visual concepts shareable across categories e.g. "furry", "spacious", etc. We advocate the use of attributes for a supervisor to provide feedback to a classifier and directly communicate his knowledge of the world. We employ a straightforward approach to incorporate this feedback in the classifier, and demonstrate its power on a variety of visual recognition scenarios such as image classification and annotation. This application of attributes for providing classifiers feedback is very powerful, and has not been explored in the community. It introduces a new mode of supervision, and opens up several avenues for future research.
Article
Full-text available
This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi.unitn.it/~uijlings/SelectiveSearch.html).
Conference Paper
Full-text available
In this paper we introduce a new shape constraint for interactive image segmentation. It is an extension of Vek- sler's (25) star-convexity prior, in two ways: from a single star to multiple stars and from Euclidean rays to Geodesic paths. Global minima of the energy function are ob- tained subject to these new constraints. We also introduce Geodesic Forests, which exploit the structure of shortest paths in implementing the extended constraints. The star- convexity prior is used here in an interactive setting and this is demonstrated in a practical system. The system is evalu- ated by means of a "robot user" to measure the amount of interaction required in a precise way. We also introduce a new and harder dataset which augments the existing Grab- cut dataset (1) with images and ground truth taken from the PASCAL VOC segmentation challenge (7).
Conference Paper
Full-text available
We present an interactive, hybrid human-computer method for object classification. The method applies to classes of objects that are recognizable by people with appropriate expertise (e.g., animal species or airplane model), but not (in general) by people without such expertise. It can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate our methods on Birds-200, a difficult dataset of 200 tightly-related bird species, and on the Animals With Attributes dataset. Our results demonstrate that incorporating user input drives up recognition accuracy to levels that are good enough for practical applications, while at the same time, computer vision reduces the amount of human interaction required.
Conference Paper
Full-text available
This paper presents a framework for image parsing with multiple label sets. For example, we may want to simul- taneously label every image region according to its basic- level object category (car, building, road, tree, etc.), super- ordinate category (animal, vehicle, manmade object, nat- ural object, etc.), geometric orientation (horizontal, verti- cal, etc.), and material (metal, glass, wood, etc.). Some ob- ject regions may also be given part names (a car can have wheels, doors, windshield, etc.). We compute co-occurrence statistics between different label types of the same region to capture relationships such as "roads are horizontal," "cars are made of metal," "cars have wheels" but "horses have legs," and so on. By incorporating these constraints into a Markov Random Field inference framework and jointly solving for all the label sets, we are able to improve the classification accuracy for all the label sets at once, achiev- ing a richer form of image understanding.
Conference Paper
Full-text available
In the task of visual object categorization, semantic con- text can play the very important role of reducing ambigu- ity in objects' visual appearance. In this work we propose to incorporate semantic object context as a post-processing step into any off-the-shelf object categorization model. Us- ing a conditional random field (CRF) framework, our ap- proach maximizes object label agreement according to con- textual relevance. We compare two sources of context: one learned from training data and another queried from Google Sets. The overall performance of the proposed framework is evaluated on the PASCAL and MSRC datasets. Our findings conclude that incorporating context into object categorization greatly improves categorization accuracy.
Article
Full-text available
We present an algorithm for Interactive Co-segmentation of a foreground object from a group of related images. While previous works in co-segmentation have focussed on unsupervised co-segmentation, we use successful ideas from the interactive object-cutout literature. We develop an algorithm that allows users to decide what foreground is, and then guide the output of the co-segmentation algorithm towards it via scribbles. Interestingly, keeping a user in the loop leads to simpler and highly parallelizable energy functions, allowing us to work with significantly more images per group. However, unlike the interactive single-image counterpart, a user cannot be expected to exhaustively examine all cutouts (from tens of images) returned by the system to make corrections. Hence, we propose iCoseg, an automatic recommendation system that intelligently recommends where the user should scribble next. We introduce and make publicly available the largest co-segmentation dataset yet, the CMU-Cornell iCoseg dataset, with 38 groups, 643 images, and pixelwise hand-annotated groundtruth. Through machine experiments and real user studies with our developed interface, we show that iCoseg can intelligently recommend regions to scribble on, and users following these recommendations can achieve good quality cutouts with significantly lower time and effort than exhaustively examining all cutouts.
Article
Full-text available
The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.
Article
Full-text available
This paper presents a new, unified technique to perform general edge-sensitive editing operations on n-dimensional images and videos efficiently. The contribution of the paper is two-fold. First, a novel unified framework is introduced which addresses several, edge-aware editing operations efficiently. Diverse editing tasks such as: segmentation, de-noising, non-photorealistic rendering, colorization and panoramic stitching are all dealt with fundamentally the same, fast algorithm. Second, a new, geodesic, symmetric filter (GSF) is presented which imposes contrast-sensitive spatial smoothness onto the output data.The effect of the filter is controlled by two intuitive, geometric parameters. In contrast to existing techniques, the GSF filter is applied to real-valued pixel likelihoods and thus it can be used for both interactive and automatic editing tasks. Complex object topologies are dealt with effortlessly. Finally, the algorithm’s parallelism enables us to exploit modern multi-core CPU architectures as well as powerful new GPUs, thus providing great flexibility of implementation and deployment. Our technique operates on both images and videos, and generalizes naturally to n-dimensional data. The proposed algorithm is validated via rigorous quantitative and qualitative comparisons with existing, state of the art approaches. Numerous results on a variety of image and video editing tasks further demonstrate the effectiveness of our method.
Article
In this paper, we propose a novel fully convolutional two-stream fusion network (FCTSFN) for interactiveimage segmentation. The proposed network includes two sub-networks: a two-stream late fusion network (TSLFN) that predicts the foreground at a reduced resolution, and a multi-scale refining network (MSRN) that refines the foreground at full resolution. The TSLFN includes two distinct deep streams followed by a fusion network. The intuition is that, since user interactions are more direct information on foreground/background than the image itself, the two-stream structure of the TSLFN reduces the number of layers between the pure user interaction features and the network output, allowing the user interactions to have a more direct impact on the segmentation result. The MSRN fuses the features from different layers of TSLFN with different scales, in order to seek the local to global information on the foreground to refine the segmentation result at full resolution. We conduct comprehensive experiments on four benchmark datasets. The results show that the proposed network achieves competitive performance compared to current state-of-the-art interactive image segmentation methods.
Conference Paper
We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid annotation is based on three principles:(I) Strong Machine-Learning aid. We start from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new regions to cover missing objects, and removing incorrect regions.The edit operations are also assisted by the model.(II) Full image annotation in a single pass. As opposed to performing a series of small annotation tasks in isolation [51,68], we propose a unified interface for full image annotation in a single pass.(III) Empower the annotator.We empower the annotator to choose what to annotate and in which order. This enables concentrating on what the ma-chine does not already know, i.e. putting human effort only on the errors it made. This helps using the annotation budget effectively. Through extensive experiments on the COCO+Stuff dataset [11,51], we demonstrate that Fluid Annotation leads to accurate an-notations very efficiently, taking 3x less annotation time than the popular LabelMe interface [70].
Article
We propose and study a novel 'Panoptic Segmentation' (PS) task. Panoptic segmentation unifies the traditionally distinct tasks of instance segmentation (detect and segment each object instance) and semantic segmentation (assign a class label to each pixel). The unification is natural and presents novel algorithmic challenges not present in either instance or semantic segmentation when studied in isolation. To measure performance on the task, we introduce a panoptic quality (PQ) measure, and show that it is simple and interpretable. Using PQ, we study human performance on three existing datasets that have the necessary annotations for PS, which helps us better understand the task and metric. We also propose a basic algorithmic approach to combine instance and semantic segmentation outputs into panoptic outputs and compare this to human performance. PS can serve as foundation of future challenges in segmentation and visual recognition. Our goal is to drive research in novel directions by inviting the community to explore the proposed panoptic segmentation task.
Article
We introduce Intelligent Annotation Dialogs for bounding box annotation. We train an agent to automatically choose a sequence of actions for a human annotator to produce a bounding box in a minimal amount of time. Specifically, we consider two actions: box verification [37], where the annotator verifies a box generated by an object detector, and manual box drawing. We explore two kinds of agents, one based on predicting the probability that a box will be positively verified, and the other based on reinforcement learning. We demonstrate that (1) our agents are able to learn efficient annotation strategies in several scenarios, automatically adapting to the difficulty of an input image, the desired quality of the boxes, the strength of the detector, and other factors; (2) in all scenarios the resulting annotation dialogs speed up annotation compared to manual box drawing alone and box verification alone, while also out- performing any fixed combination of verification and draw- ing in most scenarios; (3) in a realistic scenario where the detector is iteratively re-trained, our agents evolve a series of strategies that reflect the shifting trade-off between verification and drawing as the detector grows stronger.
Conference Paper
In this work, we present an adaptation of the sequence-to-sequence model for structured vision tasks. In this model, the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multi-scale deconvolutional architecture for making spatial predictions at each step. We explore the impact of weight sharing with a recurrent connection matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted at different steps. We show that chain models achieve top performing results on human pose estimation from images and videos.
Article
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Article
In this paper, we present an adaptation of the sequence-to-sequence model for structured output prediction in vision tasks. In this model the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multi-scale deconvolutional architecture for making spatial predictions at each time step. We explore the impact of weight sharing with a recurrent connection matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted in different steps. We show that chained predictions achieve top performing results on human pose estimation from single images and videos.
Article
A large number of images with ground truth object bounding boxes are critical for learning object detectors, which is a fundamental task in compute vision. In this paper, we study strategies to crowd-source bounding box annotations. The core challenge of building such a system is to effectively control the data quality with minimal cost. Our key observation is that drawing a bounding box is significantly more difficult and time consuming than giving answers to multiple choice questions. Thus quality control through additional verification tasks is more cost effective than consensus based algorithms. In particular, we present a system that consists of three simple sub-tasks - a drawing task, a quality verification task and a coverage verification task. Experimental results demonstrate that our system is scalable, accurate, and cost-effective. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
We introduce a general framework for quickly annotating an image dataset when previous annotations exist. The new annotations (e.g. part locations) may be quite different from the old annotations (e.g. segmentations). Human annotators may be thought of as helping translate the old annotations into the new ones. As annotators label images, our algorithm incrementally learns a translator from source to target labels as well as a computer-vision-based structured predictor. These two components are combined to form an improved prediction system which accelerates the annotators' work through a smart GUI. We show how the method can be applied to translate between a wide variety of annotation types, including bounding boxes, segmentations, 2D and 3D part-based systems, and class and attribute labels. The proposed system will be a useful tool toward exploring new types of representations beyond simple bounding boxes, object segmentations, and class labels, and toward finding new ways to exploit existing large datasets with traditional types of annotations like SUN [36], Image Net [11], and Pascal VOC [12]. Experiments on the CUB-200-2011 and H3D datasets demonstrate 1) our method accelerates collection of part annotations by a factor of 3-20 compared to manual labeling, 2) our system can be used effectively in a scheme where definitions of part, attribute, or action vocabularies are evolved interactively without relabeling the entire dataset, and 3) toward collecting pose annotations, segmentations are more useful than bounding boxes, and part-level annotations are more effective than segmentations.
Article
Figure-ground segmentation from bounding box input, provided either automatically or manually, has been extremely popular in the last decade and influenced various applications. A lot of research has focused on high-quality segmentation, using complex formulations which often lead to slow techniques, and often hamper practical usage. In this paper we demonstrate a very fast segmentation technique which still achieves very high quality results. We propose to replace the time consuming iterative refinement of global colour models in traditional GrabCut formulation by a densely connected crf. To motivate this decision, we show that a dense crf implicitly models unnormalized global colour models for foreground and background. Such relationship provides insightful analysis to bridge between dense crf and GrabCut functional. We extensively evaluate our algorithm using two famous benchmarks. Our experimental results demonstrated that the proposed algorithm achieves an order of magnitude (10×) speed-up with respect to the closest competitor, and at the same time achieves a considerably higher accuracy.
Conference Paper
As the use of videos is becoming more popular in computer vision, the need for annotated video datasets increases. Such datasets are required either as training data or simply as ground truth for benchmark datasets. A particular challenge in video segmentation is due to disocclusions, which hamper frame-to-frame propagation, in conjunction with non-moving objects. We show that a combination of motion from point trajectories, as known from motion segmentation, along with minimal supervision can largely help solve this problem. Moreover, we integrate a new constraint that enforces consistency of the color distribution in successive frames. We quantify user interaction effort with respect to segmentation quality on challenging ego motion videos. We compare our approach to a diverse set of algorithms in terms of user effort and in terms of performance on common video segmentation benchmarks.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Conference Paper
Active learning provides useful tools to reduce annotation costs without compromising classifier performance. However it traditionally views the supervisor simply as a labeling machine. Recently a new interactive learning paradigm was introduced that allows the supervisor to additionally convey useful domain knowledge using attributes. The learner first conveys its belief about an actively chosen image e.g. "I think this is a forest, what do you think?". If the learner is wrong, the supervisor provides an explanation e.g. "No, this is too open to be a forest". With access to a pre-trained set of relative attribute predictors, the learner fetches all unlabeled images more open than the query image, and uses them as negative examples of forests to update its classifier. This rich human-machine communication leads to better classification performance. In this work, we propose three improvements over this set-up. First, we incorporate a weighting scheme that instead of making a hard decision reasons about the likelihood of an image being a negative example. Second, we do away with pre-trained attributes and instead learn the attribute models on the fly, alleviating overhead and restrictions of a pre-determined attribute vocabulary. Finally, we propose an active learning framework that accounts for not just the label-but also the attributes-based feedback while selecting the next query image. We demonstrate significant improvement in classification accuracy on faces and shoes. We also collect and make available the largest relative attributes dataset containing 29 attributes of faces from 60 categories.
Article
In this paper we describe a new technique for general purpose interactive segmentation of N-dimensional images . The user marks certain pixels as "object" or "background" to provide hard constraints for segmentation. Additional soft constraints incorporate both boundary and region in- formation. Graph cuts are used to find the globally optimal segmentation of the N-dimensional image. The obtained so- lution gives the best balance of boundary and region prop- erties among all segmentations satisfying the constraints . The topology of our segmentation is unrestricted and both "object" and "background" segments may consist of sev- eral isolated parts. Some experimental results are present ed in the context of photo/video editing and medical image seg- mentation. We also demonstrate an interesting Gestalt ex- ample. A fast implementation of our segmentation method is possible via a new max-flow algorithm in (2).
Article
Active learning strategies can be useful when manual labeling effort is scarce, as they select the most informative examples to be annotated first. However, for visual category learning, the active selection problem is particularly complex: a single image will typically contain multiple object labels, and an annotator could provide multiple types of annotation (e.g., class labels, bounding boxes, segmentations), any of which would incur a variable amount of manual effort. We present an active learning framework that predicts the tradeoff between the effort and information gain associated with a candidate image annotation, thereby ranking unlabeled and partially labeled images according to their expected ldquonet worthrdquo to an object recognition system. We develop a multi-label multiple-instance approach that accommodates multi-object images and a mixture of strong and weak labels. Since the annotation cost can vary depending on an image's complexity, we show how to improve the active selection by directly predicting the time required to segment an unlabeled image. Given a small initial pool of labeled data, the proposed method actively improves the category models with minimal manual intervention.
Conference Paper
The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors.
Article
An interactive framework for soft segmentation and mat- ting of natural images and videos is presented in this pa- per. The proposed technique is based on the optimal, lin- ear time, computation of weighted geodesic distances to the user-provided scribbles, from which the whole data is au- tomatically segmented. The weights are based on spatial and/or temporal gradients, without explicit optical flow or any advanced and often computationally expensive feature detectors. These could be naturally added to the proposed framework as well if desired, in the form of weights in the geodesic distances. A localized refinement step follows this fast segmentation in order to accurately compute the cor- responding matte function. Additional constraints into the distance definition permit to efficiently handle occlusions such as people or objects crossing each other in a video se- quence. The presentation of the framework is complemented with numerous and diverse examples, including extraction of moving foreground from dynamic background, and com- parisons with the recent literature.
Article
We present Searn, an algorithm for integrating search and learning to solve complex structured prediction problems such as those that occur in natural language, speech, computational biology, and vision. Searn is a meta-algorithm that transforms these complex problems into simple classification problems to which any binary classifier may be applied. Unlike current algorithms for structured learning that require decomposition of both the loss function and the feature functions over the predicted structure, Searn is able to learn prediction functions for any loss function and any class of features. Moreover, Searn comes with a strong, natural theoretical guarantee: good performance on the derived classification problems implies good performance on the structured prediction problem.
Article
Research in object detection and recognition in cluttered scenes requires large image collections with ground truth labels. The labels should provide information about the object classes present in each image, as well as their shape and locations, and possibly other attributes such as pose. Such data is useful for testing, as well as for supervised learning. This project provides a web-based annotation tool that makes it easy to annotate images, and to instantly sharesuch annotations with the community. This tool, plus an initial set of 10,000 images (3000 of which have been labeled), can be found at http://www.csail.mit.edu/\simbrussell/research/LabelMe/intro.html