Conference Paper

The truth about cats and dogs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-of-the-art performance for a variety of object categories, but are still outperformed by simpler bag-of-words models for highly flexible objects such as cats and dogs. In these cases we propose to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part. This approach is motivated by two ob- servations: (i) many object classes contain distinctive parts that can be detected very reliably by template-based detec- tors, whilst the entire object cannot; (ii) many classes (e.g. animals) have fairly homogeneous coloring and texture that can be used to segment the object once a sample is provided in an image. We show quantitatively that our method substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bag-of-words.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Xu et al. [23] also used bounding box as a weak supervision for semantic segmentation.Vicente et al. [11] introduced the idea of using bounding box for cosegmentation in a supervised setting. Alternatively, segmentation cues have been used before to help detection [6], [31]. Parkhi et al. [6] uses color models from predefined rectangles on cat and dog faces to do GrabCut [44] and improve the predicted bounding box. ...
... Alternatively, segmentation cues have been used before to help detection [6], [31]. Parkhi et al. [6] uses color models from predefined rectangles on cat and dog faces to do GrabCut [44] and improve the predicted bounding box. Hariharan et al. [22] used CNN to simultaneously detect and segment by classifying image regions. ...
... Note that the superpixel variables x and y are bounded by bounding box variable z in Eq. 6 and 7. If the discriminative colocalization part considers some bounding box z i to be background and sets it to close to 0, this , in principle, enforces the cosegmentation part that superpixels in this bounding box are more likely to be background (= 0)as defined by the right hand side of Equation 6: j∈Si x ij ≤ δ|S i |z i . Similarly, the segmentation cues influence the final score of z i variable if the superpixels inside this bounding box are highly discriminative and more likely to be foreground. ...
Article
This paper presents a novel framework in which image cosegmentation and colocalization are cast into a single optimization problem that integrates information from low level appearance cues with that of high level localization cues in a very weakly supervised manner. In contrast to multi-task learning paradigm that learns similar tasks using a shared representation, the proposed framework leverages two representations at different levels and simultaneously discriminates between foreground and background at the bounding box and superpixel level using discriminative clustering. We show empirically that constraining the two problems at different scales enables the transfer of semantic localization cues to improve cosegmentation output whereas local appearance based segmentation cues help colocalization. The unified framework outperforms strong baseline approaches, of learning the two problems separately, by a large margin on four benchmark datasets. Furthermore, it obtains competitive results compared to the state of the art for cosegmentation on two benchmark datasets and second best result for colocalization on Pascal VOC 2007.
... Indeed, authors have often focused on cats and dogs as examples of highly deformable objects for which recognition and detection are particularly challenging [8][9][10]. The authors of [8,9] extend template-based detector built on the deformable parts model by combining the low-level image features of histogram oriented gradients (capturing shape) and local binary patterns (capturing texture). ...
... Indeed, authors have often focused on cats and dogs as examples of highly deformable objects for which recognition and detection are particularly challenging [8][9][10]. The authors of [8,9] extend template-based detector built on the deformable parts model by combining the low-level image features of histogram oriented gradients (capturing shape) and local binary patterns (capturing texture). This model achieves the average precision 0.61 for the dog/cat head detector. ...
... This model achieves the average precision 0.61 for the dog/cat head detector. The paper [10] presents similar two-step approach as [8] for the cathead detection, which gains advantages from the use of different sets of features based on histogram oriented gradients, and achieves similar average precision 0.63. ...
Chapter
The paper deals with an approach for a reliable dogface detection in an image using the convolutional neural networks. Two detectors were trained on a dataset containing 8351 real-world images of different dog breeds. The first detector achieved the average precision equal to 0.79 while running real-time on single CPU, the second one achieved the average precision equal to 0.98 but more time for processing is necessary. Consequently, the facial landmark detector using the cascade of regressors was proposed based on those, which are commonly used in human face detection. The proposed algorithm is able to detect dog’s eyes, a muzzle, a top of the head and inner bases of the ears with the 0.05 median location error normalized by the inter-ocular distance. The proposed two-step technique – a dogface detection with following facial landmark detector - could be utilized for a dog breeds identification and consequent auto-tagging and image searches. The paper demonstrates a real-world application of the proposed technique – a successful supporting system for taking pictures of dogs facing the camera.
... To generate the final co-localization result, we have also devised a method for improving the bounding box estimate. Inspired by detection-by-segmentation approaches (e.g., [22]), we use the final detection heat map and color information to define a CRF-based segmentation algorithm, the output of which indicates the instances of the common object. ...
... Given the set of detection heat maps H, we aim to produce a segmentation of the entire object. This approach is inspired by previous work which casts localization as a segmentation problem (e.g., [22]). ...
Article
Full-text available
Given a set of images containing objects from the same category, the task of image co-localization is to identify and localize each instance. This paper shows that this problem can be solved by a simple but intriguing idea, that is, a common object detector can be learnt by making its detection confidence scores distributed like those of a strongly supervised detector. More specifically, we observe that given a set of object proposals extracted from an image that contains the object of interest, an accurate strongly supervised object detector should give high scores to only a small minority of proposals, and low scores to most of them. Thus, we devise an entropy-based objective function to enforce the above property when learning the common object detector. Once the detector is learnt, we resort to a segmentation approach to refine the localization. We show that despite its simplicity, our approach outperforms state-of-the-art methods.
... Fine-grained categorization, also known as subcategory recognition, is a rapidly growing subfield in object recognition. Applications include distinguishing different types of flowers [36,37], plants [2,29], insects [30,35], birds [5,10,15,19,31,43,50,51], dogs [27,33,38,39], vehicles [42], shoes [4], or architectural styles [34]. Each of these domains individually is of particular importance to its constituent enthusiasts; moreover, it has been shown that the mistakes of state-of-the-art recognition algorithms on the ImageNet Challenge usually pertain to distinguishing related subcategories [41]. ...
... Work on fine-grained categorization over the past 5 years has been extensive. Areas explored include feature representations that better preserve fine-grained information [35,46,47,48], segmentation-based approaches [1,13,14,15,21,37] that facilitate extraction of purer features, and part/pose normalized feature spaces [5,6,19,33,38,39,43,50,51]. Among this large body of work, it is a goal of our paper to empirically investigate which methods and techniques are most important toward achieving good performance. ...
Conference Paper
We propose an architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species. Our architecture first computes an estimate of the object's pose; this is used to compute local image features which are, in turn, used for classification. The features are computed by applying deep convolutional nets to image patches that are located and normalized by the pose. We perform an empirical study of a number of pose normalization schemes, including an investigation of higher order geometric warping functions. We propose a novel graph-based clustering algorithm for learning a compact pose normalization space. We perform a detailed investigation of state-of-the-art deep convolutional feature implementations and fine-tuning feature learning for fine-grained classification. We observe that a model that integrates lower-level feature layers with pose-normalized extraction routines and higher-level feature layers with unaligned image features works best. Our experiments advance state-of-the-art performance on bird species recognition, with a large improvement of correct classification rates over previous methods (75% vs. 55-65%).
... The bottleneck for many pose-normalized representations is indeed accurate part localization. The Poselet [8] and DPM [17] methods have previously been utilized to obtain part localizations with a modest degree of success; methods generally report adequate part localization only when given a known bounding box at test time [11,20,36,37,43]. By developing a novel deep part detection scheme, we propose an end-to-end fine grained categorization system which requires no knowledge of object bounding box at test time, and can achieve performance rivaling previously reported methods requiring the ground truth bounding box at test time to filter false positive detections. ...
... Segmentation-based approaches are also very effective for fine-grained recognition. Approaches such as [11,20,36,37,43] used region-level cues to infer the foreground segmentation mask and to discard the noisy visual information in the background. Chai et al. [10] showed that jointly learning part localization and foreground segmentation together can be beneficial for finegrained categorization. ...
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
... PET comprises eye movement recordings compiled for the bird, cat, cow, dog, horse and sheep training+validation (or trainval) sets from the VOC 2012 image set. These six animal-centric classes were chosen from the 20 object classes in VOC2012 owing to the following reasons: (i) Animal classes such as cats, dogs and birds are particularly difficult to detect using traditional supervised learning methods (e.g., deformable parts model) owing to large intrinsic shape and textural variations [13], and (ii) It would be highly beneficial to incorporate human knowledge to train object detectors for these classes as many psychophysical studies [14] have noted our tendency to instantaneously detect animals (which are both predators and prey). ...
Article
Full-text available
We present the Pascal animal classes Eye Tracking database. Our database comprises eye movement recordings compiled from forty users for the bird, cat, cow, dog, horse and sheep {trainval} sets from the VOC 2012 image set. Different from recent eye-tracking databases such as \cite{kiwon_cvpr13_gaze,PapadopoulosCKF14}, a salient aspect of PET is that it contains eye movements recorded for both the free-viewing and visual search task conditions. While some differences in terms of overall gaze behavior and scanning patterns are observed between the two conditions, a very similar number of fixations are observed on target objects for both conditions. As a utility application, we show how feature pooling around fixated locations enables enhanced (animal) object classification accuracy.
... For example, class cat accumulates most of its most discriminative filters on parts of the head. Interestingly, Parkhi et al. (2011) observed a similar phenomenon with HOG features, where the most discriminative parts of cats and dogs were found to be the heads. On the other hand, class horse tends to prefer parts of the body, such as the legs, devoting very few discriminative filters to the head. ...
Article
Full-text available
Semantic object parts can be useful for several visual recognition tasks. Lately, these tasks have been addressed using Convolutional Neural Networks (CNN), achieving outstanding results. In this work we study whether CNNs learn semantic parts in their internal representation. We investigate the responses of convolutional filters and try to associate their stimuli with semantic parts. While previous efforts [1,2,3,4] studied this matter by visual inspection, we perform an extensive quantitative analysis based on ground-truth part bounding-boxes, exploring different layers, network depths, and supervision levels. Even after assisting the filters with several mechanisms to favor this association, we find that only about 25 percent of the semantic parts in PASCAL Part dataset [5] emerge in the popular AlexNet [6] network finetuned for object detection [7]. Interestingly, both the supervision level and the network depth do not seem to significantly affect the emergence of parts. Finally, we investigate if filters are responding to recurrent discriminative patches as opposed to semantic parts. We discover that the discriminative power of the network can be attributed to a few discriminative filters specialized to each object class. Moreover, about 60 percent of them can be associated with semantic parts. The overlap between discriminative and semantic filters might be the reason why previous studies suggested a stronger emergence of semantic parts, based on visual inspection only.
... In recent years, several works have proposed to incorporate segmentation techniques to assist object detection in different ways. For example, Parkhi et al. [18] improved the predicted bounding box with color models from predicted rectangles on cat and dog faces. Dai et al. [5] proposed to use segments extracted for each object detection hypothesis to accurately localize detected objects. ...
Article
Most of existing detection pipelines treat object proposals independently and predict bounding box locations and classification scores over them separately. However, the important semantic and spatial layout correlations among proposals are often ignored, which are actually useful for more accurate object detection. In this work, we propose a new EM-like group recursive learning approach to iteratively refine object proposals by incorporating such context of surrounding proposals and provide an optimal spatial configuration of object detections. In addition, we propose to incorporate the weakly-supervised object segmentation cues and region-based object detection into a multi-stage architecture in order to fully exploit the learned segmentation features for better object detection in an end-to-end way. The proposed architecture consists of three cascaded networks which respectively learn to perform weakly-supervised object segmentation, object proposal generation and recursive detection refinement. Combining the group recursive learning and the multi-stage architecture provides competitive mAPs of 78.6% and 74.9% on the PASCAL VOC2007 and VOC2012 datasets respectively, which outperforms many well-established baselines [10] [20] significantly.
... Object recognition methods and part-based methods 12 have been popular recently to train discriminative parts of objects and model different objects for classification. 13,14 Inspirited by the above methods, we propose a new method to recognize characters by using part-based stroke structures, which can not only efficiently detect the boundary of jointed characters but also recognize the detected characters. All of the character's stroke structures are automatically obtained and constructed. ...
Article
Scene text recognition has gained significant attention in the computer vision community. Character detection and recognition are the promise of text recognition and affect the overall performance to a large extent. We proposed a good initialization model for scene character recognition from cropped text regions. We use constrained character’s body structures with deformable part-based models to detect and recognize characters in various backgrounds. The character’s body structures are achieved by an unsupervised discriminative clustering approach followed by a statistical model and a self-build minimum spanning tree model. Our method utilizes part appearance and location information, and combines character detection and recognition in cropped text region together. The evaluation results on the benchmark datasets demonstrate that our proposed scheme outperforms the state-of-the-art methods both on scene character recognition and word recognition aspects.
... It is widely used especially when the pattern behind complicated data is difficult to deduce explicitly. Successful applications include computer vision [29], cybersecurity [30], ancient abstract strategy games [31], etc. ...
Article
Full-text available
We use the density-matrix renormalization group, applied to a one-dimensional model of continuum Hamiltonians, to accurately solve chains of hydrogen atoms of various separations and numbers of atoms. We train and test a machine-learned approximation to F[n], the universal part of the electronic density functional, to within quantum chemical accuracy. We also develop a data-driven, atom-centered basis set for densities which greatly reduces the computational cost and accurately represents the physical information in the machine-learning calculation. Our calculation (a) bypasses the standard Kohn-Sham approach, avoiding the need to find orbitals, (b) includes the strong correlation of highly stretched bonds without any specific difficulty (unlike all standard DFT approximations), and (c) is so accurate that it can be used to find the energy in the thermodynamic limit to quantum chemical accuracy.
... From the point of view of cognitive science, basic-level categories are mostly defined by their parts while subordinate-level categories are distinguished by different properties of these parts [23]. Following this hypothesis, methods based on comparison between corresponding parts of an object have been developed and applied to distinguishing dogs, birds and vehicles [6,21,22,27]. Their successes lie in the fact that instances like animals and artifacts share some parts with similar appearances, and these parts can be aligned to a normalized space. But these methods can not work well in the plant domain, especially for flowers. ...
Article
Full-text available
Existing methods for flower classification are usually focused on segmentation of the foreground, followed by extraction of features. After extracting the features from the foreground, global pooling is performed for final classification. Although this pipeline can be applied to many recognition tasks, however, these approaches have not explored structural cues of the flowers due to the large variation in their appearances. In this paper, we argue that structural cues are essential for flower recognition. We present a novel approach that explores structural cues to extract features. The proposed method encodes the structure of flowers into the final feature vectors for classification by operating on salient regions, which is robust to appearance variations. In our framework, we first segment the flower accurately by refining the existing segmentation method, and then we generate local features using our approach. We combine our local feature with global-pooled features for classification. Evaluations on the Oxford Flower dataset shows that by introducing the structural cues and locally pooling of some off-the-shelf features, our method outperforms the state-of-the-arts which employ specific designed features and metric learning.
... The object heat map indicates the most discriminative details of an object, and usually focuses on object parts (e.g., the head of dogs), instead of the whole object. Inspired from [30] which casts localization as a segmentation task, we perform grabcut [34] on the object heat map to generate the segmentation mask. The goal is to propagate the discriminative part details to the whole object with color continuity cues. ...
Article
Part-based representation has been proven to be effective for a variety of visual applications. However, automatic discovery of discriminative parts without object/part-level annotations is challenging. This paper proposes a discriminative mid-level representation paradigm based on the responses of a collection of part detectors, which only requires the image-level labels. Towards this goal, we first develop a detector-based spectral clustering method to mine the representative and discriminative mid-level patterns for detector initialization. The advantage of the proposed pattern mining technology is that the distance metric based on detectors only focuses on discriminative details, and a set of such grouped detectors offer an effective way for consistent pattern mining. Relying on the discovered patterns, we further formulate the detector learning process as a confidence-loss sparse Multiple Instance Learning (cls-MIL) task, which considers the diversity of the positive samples, while avoid drifting away the well localized ones by assigning a confidence value to each positive sample. The responses of the learned detectors can form an effective mid-level image representation for both image classification and object localization. Experiments conducted on benchmark datasets demonstrate the superiority of our method over existing approaches.
... Those hand engineered features tend to be time consuming, and not scalable. In image recognition for example, images have different variations like illumination and lighting conditions, viewpoint variation, scale variation, deformation, occlusion, and intra-class variation [13]. Classification must be invariant to all these variations. ...
Conference Paper
Deep learning is a thriving research area with many successful applications in different fields. The article is written with a view to provide a state of the art review of deep learning. To some extent, we will present a historical overview necessary to understand the concepts that laid the foundations of today's Deep Learning. We will cover different methods that made the successful training of deep learning models possible at a very high scale in various modern practices.
... 95 icine, the use of machine-learning methods in veterinary medicine has been very limited. 73,74,104 Human Medicine ...
Article
Machine-learning methods can assist with the medical decision-making processes at the both the clinical and diagnostic levels. In this article, we first review historical milestones and specific applications of computer-based medical decision support tools in both veterinary and human medicine. Next, we take a mechanistic look at 3 archetypal learning algorithms-naive Bayes, decision trees, and neural network-commonly used to power these medical decision support tools. Last, we focus our discussion on the data sets used to train these algorithms and examine methods for validation, data representation, transformation, and feature selection. From this review, the reader should gain some appreciation for how these decision support tools have and can be used in medicine along with insight on their inner workings.
... A large amount of works propose to leverage the extra annotations of bounding boxes and parts to localize significant regions in fine-grained recognition [9,16,22,30,33,34]. However, the heavy involvement of human efforts make this task not practical for large-scale real problems. ...
... Experts distinguish subordinate classes based on specific parts of the objects. Therefore, a straight-forward approach is to learn features of object parts [31,12,32,33,34,35,36,37,38,39,40,41]. This approach requires heavy part annotations from domain experts, and therefore it is difficult to extend to larger scale datasets. ...
Preprint
Full-text available
Discriminative features play an important role in image and object classification and also in other fields of research such as semi-supervised learning, fine-grained classification, out of distribution detection. Inspired by Linear Discriminant Analysis (LDA), we propose an optimization called Neural Discriminant Analysis (NDA) for Deep Convolutional Neural Networks (DCNNs). NDA transforms deep features to become more discriminative and, therefore, improves the performances in various tasks. Our proposed optimization has two primary goals for inter- and intra-class variances. The first one is to minimize variances within each individual class. The second goal is to maximize pairwise distances between features coming from different classes. We evaluate our NDA optimization in different research fields: general supervised classification, fine-grained classification, semi-supervised learning, and out of distribution detection. We achieve performance improvements in all the fields compared to baseline methods that do not use NDA. Besides, using NDA, we also surpass the state of the art on the four tasks on various testing datasets.
... Identifying discriminative object parts is important for fine-grained classification [50,49,58,67]. For example, bounding box or landmark annotations can be used to learn object parts for fine-grained classification [24,34,41,62,64]. To avoid costly annotation of object parts, several recent works focused on unsupervised or weakly-supervised part learning using deep models. ...
Preprint
We present an interpretable deep model for fine-grained visual recognition. At the core of our method lies the integration of region-based part discovery and attribution within a deep neural network. Our model is trained using image-level object labels, and provides an interpretation of its results via the segmentation of object parts and the identification of their contributions towards classification. To facilitate the learning of object parts without direct supervision, we explore a simple prior of the occurrence of object parts. We demonstrate that this prior, when combined with our region-based part discovery and attribution, leads to an interpretable model that remains highly accurate. Our model is evaluated on major fine-grained recognition datasets, including CUB-200, CelebA and iNaturalist. Our results compare favorably to state-of-the-art methods on classification tasks, and our method outperforms previous approaches on the localization of object parts.
... 17, 2019; features (spots, stripes, scales, or specific shapes such as fin or ear edges) on images (Bolger et al. 2012), is less efficient with highly flexible and deformable animals (e.g., foxes or cats) without any particular coat pattern (Yu et al. 2013). In general, deformable animals, which may exhibit different poses and shapes depending on the vision angle, are very difficult to detect against the background (Parkhi, Vedaldi, Jawahar & Zisserman 2011). In addition, automatic species identification algorithms are usually developed starting from selected images where the presence of an animal has already been validated by hand (Yu et al. 2013;Gomez Villa, Salazar & Vargas 2017), but automatic segmentation algorithms that allow the detection of animal presence/absence in images are still rare (Gomez Villa et al. 2017) or are case-specific. ...
Preprint
Full-text available
Camera traps now represent a reliable, efficient and cost-effective technique to monitor wildlife and collect biological data in the field. However, efficiently extracting information from the massive amount of images generated is often extremely time-consuming and may now represent the most rate-limiting step in camera trap studies. To help overcome this challenge, we developed FoxMask, a new tool performing the automatic detection of animal presence in short sequences of camera trap images. FoxMask uses background estimation and foreground segmentation algorithms to detect the presence of moving objects (most likely, animals) on images. We analyzed a sample dataset from camera traps used to monitor activity on arctic fox Vulpes lagopus dens to test the parameter settings and the performance of the algorithm. The shape and color of arctic foxes, their background at snowmelt and during the summer growing season were highly variable, thus offering challenging testing conditions. We compared the automated animal detection performed by FoxMask to a manual review of the image series. The performance analysis indicated that the proportion of images correctly classified by FoxMask as containing an animal or not was very high (> 90%). FoxMask is thus highly efficient at reducing the workload by eliminating most false triggers (images without an animal). We provide parameter recommendations to facilitate usage and we present the cases where the algorithm performs less efficiently to stimulate further development. FoxMask is an easy-to-use tool freely available to ecologists performing camera trap data extraction. By minimizing analytical time, computer-assisted image analysis will allow collection of increased sample sizes and testing of new biological questions.
... Similarly, the part-based approaches normalize the variation present due to poses and viewpoints. Many works [28], [29], [1] assume the availability of bounding boxes at the object-level and the part-level in all the images during training as well as testing settings. To achieve higher accuracy, [22], [30], [31] employed both objectlevel and part-level annotations. ...
Preprint
Full-text available
To make the best use of the underlying minute and subtle differences, fine-grained classifiers collect information about inter-class variations. The task is very challenging due to the small differences between the colors, viewpoint, and structure in the same class entities. The classification becomes more difficult due to the similarities between the differences in viewpoint with other classes and differences with its own. In this work, we investigate the performance of the landmark general CNN classifiers, which presented top-notch results on large scale classification datasets, on the fine-grained datasets, and compare it against state-of-the-art fine-grained classifiers. In this paper, we pose two specific questions: (i) Do the general CNN classifiers achieve comparable results to fine-grained classifiers? (ii) Do general CNN classifiers require any specific information to improve upon the fine-grained ones? Throughout this work, we train the general CNN classifiers without introducing any aspect that is specific to fine-grained datasets. We show an extensive evaluation on six datasets to determine whether the fine-grained classifier is able to elevate the baseline in their experiments.
... d Post-processing. Post-process the output to get a better score for each pixel, where the operation "x" means to multiply the regression score by the binary mask template-based model to detect objects and obtained the segmentation by graph cut based energy minimization formulation [43]. ...
Article
Full-text available
Many state-of-the-art shape features have been proposed for the shape recognition task. In this paper, to explore whether a shape feature influences object segmentation, we propose a specific shape feature, Fisher shape (a form of bag of contour fragments), and we combine this with the appearance feature with multiple kernel learning to create a pipeline of object segmentation system. The experimental results on benchmark datasets clearly demonstrate that the pipeline of object segmentation is effective and that the Fisher shape can improve object segmentation with only the appearance feature.
... Discriminative part localization. Discriminative part localization has been studied for a long time by many community such as fine-grained recognition [5,10,18,29,30], face recognition [37,16,17,33,23] and person reidentification [26]. After deep learning dominate computer vision community, hand-craft part features for fine-grained recognition has been drooped. ...
... A weighted graph can be used for binary segmentation of the foreground-background. Each vertex of the graph has a prior probability of being background/foreground that serve as unary potentials for graph-cut (Parkhi et al. (2011)). Edge detectors are used to compute binary potentials which indicate whether two connected vertices will have same label. ...
Thesis
Full-text available
Problem: Deep learning based vision systems have achieved near human accuracy in recognizing coarse object categories from visual data. But recognizing fine-grained sub-categories remains an open problem. Tasks like fine-grained species recognition poses further challenges: significant background variation compared to subtle difference between objects, high class imbalance due to scarcity of samples for endangered species, cost of domain expert annotations and labeling, etc. Methodology: The existing approaches, like transfer learning, to solve the problem of learning small specialized datasets are still inadequate in case of fine-grained sub-categories. The hypothesis of this work is that collaborative filters should be incorporated into the present learning frameworks to better address these challenges. The intuition comes from the fact that collaborative representation based classifiers have been earlier used for face recognition problems which present similar challenges. Outcomes: Keeping the above hypothesis in mind, the thesis achieves the following objectives: 1) It demonstrates the suitability of collaborative classifiers for fine-grained recognition 2) It expands the state-of-the-art by incorporating automated background suppression into collaborative classification formulation 3) It incorporates the collaborative cost function into supervised learning (deep convolutional network) and unsupervised learning (clustering algorithms) 4) Lastly, during the work several benchmark fine-grained image datasets have been introduced on NZ and Indian butterflies and bird species recognition.
Conference Paper
Full-text available
Deep neural network (DNN) can extract high- dimensional feature of images for computer vision tasks in- cluding Optical Coherence Tomography (OCT) images classi- fication. However, OCT images are usually processed by DNN just like natural images, thus the performance of DNN is not satisfactory. We present an end-to-end DNN targeting OCT images classification. Considering the characteristic of OCT images, we introduce attention mechanism into classifier to extract more specific feature of OCT images. Our network demonstrates its capacity to enhance the features that repre- sent the disease region. Our method achieves the state-of-the- art performance with average accuracy of 99.5% and F1-score of 0.995 on the OCT images dataset.
Conference Paper
Given a set of images containing objects from the same category, the task of image co-localization is to identify and localize each instance. This paper shows that this problem can be solved by a simple but intriguing idea, that is, a common object detector can be learnt by making its detection confidence scores distributed like those of a strongly supervised detector. More specifically, we observe that given a set of object proposals extracted from an image that contains the object of interest, an accurate strongly supervised object detector should give high scores to only a small minority of proposals, and low scores to most of them. Thus, we devise an entropy-based objective function to enforce the above property when learning the common object detector. Once the detector is learnt, we resort to a segmentation approach to refine the localization. We show that despite its simplicity, our approach outperforms state-of-the-arts.
Conference Paper
We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annotations to train our model. The result is a new method for amodal instance segmentation, which represents the first such method to the best of our knowledge. We demonstrate the proposed method’s effectiveness both qualitatively and quantitatively.
Article
The problem of fine-grained object recognition is very challenging due to the subtle visual differences between different object categories. In this paper, we propose a task-driven progressive part localization (TPPL) approach for fine-grained object recognition. Most existing methods follow a two-step approach that first detects salient object parts to suppress the interference from background scenes and then classifies objects based on features extracted from these regions. The part detector and object classifier are often independently designed and trained. In this paper, our major finding is that the part detector should be jointly designed and progressively refined with the object classifier so that the detected regions can provide the most distinctive features for final object recognition. Specifically, we develop a part-based SPP-net (Part-SPP) as our baseline part detector. We then establish a TPPL framework, which takes the predicted boxes of Part-SPP as an initial guess, and then examines new regions in the neighborhood using a particle swarm optimization approach, searching for more discriminative image regions to maximize the objective function and the recognition performance. This procedure is performed in an iterative manner to progressively improve the joint part detection and object classification performance. Experimental results on the Caltech-UCSD-200-2011 dataset demonstrate that our method outperforms state-of-the-art fine-grained categorization methods both in part localization and classification, even without requiring a bounding box during testing.
Article
Machine learning's grand ambition is the mathematical modeling of reality. The recent years have seen major advances using deep-learned techniques that model reality implicitly; however, corresponding advances in explicit mathematical models have been noticeably lacking. We believe this dichotomy is rooted in the limitations of the current statistical tools, which struggle to make sense of the high dimensional generative processes that natural data seems to originate from. This paper proposes a new, distance based statistical technique which allows us to develop elegant mathematical models of such generative processes. Our model suggests that each semantic concept has an associated distinctive-shell which encapsulates almost-all instances of itself and excludes almost-all others. creating the first, explicit mathematical representation of the constraints which make machine learning possible.
Article
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation, where we improve state-of-the-art from 49.7 mean APr to 62.4, keypoint localization, where we get a 3.3 point boost over a strong regression baseline using CNN features, and part labeling, where we show a 6.6 point gain over a strong baseline.
Article
Full-text available
It is well accepted that image segmentation can benefit from utilizing multilevel cues. The paper focuses on utilizing the FCNN-based dense semantic predictions in the bottom-up image segmentation, arguing to take semantic cues into account from the very beginning. By this we can avoid merging regions of similar appearance but distinct semantic categories as possible. The semantic inefficiency problem is handled. We also propose a straightforward way to use the contour cues to suppress the noise in multilevel cues, thus to improve the segmentation robustness. The evaluation on the BSDS500 shows that we obtain the competitive region and boundary performance. Furthermore, since all individual regions can be assigned with appropriate semantic labels during the computation, we are capable of extracting the adjusted semantic segmentations. The experiment on Pascal VOC 2012 shows our improvement to the original semantic segmentations which derives directly from the dense predictions.
Article
It is a challenging task to recognize fine-grained subcategories due to the highly localized and subtle differences among them. Different from most previous methods that rely on object / part annotations, this paper proposes an automatic finegrained recognition approach, which is free of any object / part annotation at both training and testing stages. The key idea includes two steps of picking neural activations computed from the convolutional neural networks, one for localization, and the other for description. The first picking step is to find distinctive neurons that are sensitive to specific patterns significantly and consistently. Based on these picked neurons, we initialize positive samples and formulate the localization as a regularized multiple instance learning task, which aims at refining the detectors via iteratively alternating between new positive sample mining and part model retraining. The second picking step is to pool deep neural activations via a spatially weighted combination of Fisher Vectors coding.We conditionally select activations to encode them into the final representation, which considers the importance of each activation. Integrating the above techniques produces a powerful framework, and experiments conducted on several extensive fine-grained benchmarks demonstrate the superiority of our proposed algorithm over the existing methods.
Conference Paper
Weakly supervised fine-grained image classification (WFGIC) aims at learning to recognize hundreds of subcategories in each basic-level category with only image level labels available. It is extremely challenging and existing methods mainly focus on the discriminative semantic parts or regions localization as the key differences among different subcategories are subtle and local. However, they localize these regions independently while neglecting the fact that regions are mutually correlated and region groups can be more discriminative. Meanwhile, most current work tends to derive features directly from the output of CNN and rarely considers the correlation within the feature vector. To address these issues, we propose an end-to-end Correlation-guided Discriminative Learning (CDL) model to fully mine and exploit the discriminative potentials of correlations for WFGIC globally and locally. From the global perspective, a discriminative region grouping (DRG) sub-network is proposed which first establishes correlation between regions and then enhances each region by weighted aggregating all the correlation from other regions to it. By this means each region's representation encodes the global image-level context and thus is more robust; meanwhile, through learning the correlation between discriminative regions, the network is guided to implicitly discover the discriminative region groups which are more powerful for WFGIC. From the local perspective, a discriminative feature strengthening sub-network (DFS) is proposed to mine and learn the internal spatial correlation among elements of each patch's feature vector, to improve its discriminative power locally by jointly emphasizes informative elements while suppresses the useless ones. Extensive experiments demonstrate the effectiveness of proposed DRG and DFS sub-networks, and show that the CDL model achieves state-of-the-art performance both in accuracy and efficiency.
Article
Fine-grained vehicle recognition is a challenging problem due to high inter-class confusion among vehicle models under the influence of pose and viewpoint. To effectively describe the discriminative characteristics, many approaches try to learn detailed information from an individual image. Inspired by Siamese network that addresses the case where two inputs are relatively similar, the semantic interaction learning network (SIL-Net) is designed to discover semantic differences between two fine-grained categories via pairwise comparison. Specifically, SIL-Net first collecting contrastive information by learning the mutual feature of input image pair, and then compare it with individual features to generate corresponding semantic features. These features learn semantic differences from contextual comparison, this gives SIL-Net the ability to distinguish between two confusing images via pairwise interaction. After training, SIL-Net can adaptively learn feature priorities under the supervision of the margin ranking loss and converge quickly. SIL-Net performs well on two public vehicle benchmarks (Stanford Cars and CompCars), showing the suitability of SIL-Net to fine-grained vehicle recognition.
Article
This paper aims at learning discriminative part detectors with only image-level labels. To this end, we need to develop effective technologies for both pattern mining and detection learning. Different from previous methods which train part detectors in one step, we divide the detector learning process into two stages, and formulate it as a weak to strong learning framework. In particular, we first learn exemplar detectors from the unaligned patterns, and perform a detector-based spectral clustering to produce weak detectors that are only responsible for a few discriminative patterns. In this way, the weak detectors are able to offer right initial patterns for strong detector learning. Second, we learn strong detectors with patterns discovered from the weak detectors, which we formulate as a confidence-loss sparse Multiple Instance Learning (cls-MIL) task. The cls-MIL considers the diversity of positive samples while avoiding drifting away from the well localized ones by assigning a confidence value to each positive sample. The responses of the learned detectors produce an effective mid-level image representation for both image classification and object localization. Experiments conducted on benchmark datasets well demonstrate the superiority of our method over existing approaches.
Article
The field of multimedia is unique in offering a rich and dynamic forum for researchers from "traditional" fields to collaborate and develop new solutions and knowledge that transcend the boundaries of individual disciplines. Despite the prolific research activities and outcomes, however, few efforts have been made to develop books that serve as an introduction to the rich spectrum of topics covered by this broad field. A few books are available that either focus on specific subfields or basic background in multimedia. Tutorial-style materials covering the active topics being pursued by the leading researchers at frontiers of the field are currently lacking. In 2015, ACM SIGMM, the special interest group on multimedia, launched a new initiative to address this void by selecting and inviting 12 rising-star speakers from different subfields of multimedia research to deliver plenary tutorial-style talks at the ACM Multimedia conference for 2015. Each speaker discussed the challenges and state-of-the-art developments of their prospective research areas in a general manner to the broad community. The covered topics were comprehensive, including multimedia content understanding, multimodal human-human and human-computer interaction, multimedia social media, and multimedia system architecture and deployment. Following the very positive responses to these talks, the speakers were invited to expand the content covered in their talks into chapters that can be used as reference material for researchers, students, and practitioners. Each chapter discusses the problems, technical challenges, state-of-the-art approaches and performances, open issues, and promising direction for future work. Collectively, the chapters provide an excellent sampling of major topics addressed by the community as a whole. This book, capturing some of the outcomes of such efforts, is well positioned to fill the aforementioned needs in providing tutorial-style reference materials for frontier topics in multimedia.
Chapter
Fine-grained classification is challenging since sub-categories have little intra-class variances and large intra-class variations. The task of flower classification can be achieved through highlighting the discriminative parts. Most traditional methods trained Convolutional Neural Networks (CNN) to handle the variations of pose, color and rotation, which only utilize single-level semantic information. In this paper, we propose a fine-grained classification approach with multi-level semantic representation. With the complementary strengths of multi-level semantic representation, we attempt to capture the subtle differences between sub-categories. One object-level model and multiple part-level model are trained as a multi-scale classifier. We test our method on the Oxford Flower dataset with 102 categories, and our result achieves the best performance over other state-of-the-art approaches.
Chapter
In recent years, deep learning has been widely used in various computer vision tasks. Because the task of solving vegetable pictures are different in the local critical areas, and solving the classification of vegetable categories to meet the needs of users has become an urgent problem to be solved. In this paper, we propose fine-grained image recognition based on Vegetable Dataset, and uses multi-scale iteration to extract critical area characteristics, which the learning at each scale consists of a classification subnetwork and the critical area. In addition, the multi-scale neural network is optimized by two loss functions, to learn accurate critical area and fine-grained feature. Finally, we further prove its scalability and effectiveness in comparing different datasets and different training methods, we get satisfactory results on Vegetable Dataset.
Chapter
Attention-based learning for fine-grained image recognition remains a challenging task, where most of the existing methods treat each object part in isolation, while neglecting the correlations among them. In addition, the multi-stage or multi-scale mechanisms involved make the existing methods less efficient and hard to be trained end-to-end. In this paper, we propose a novel attention-based convolutional neural network (CNN) which regulates multiple object parts among different input images. Our method first learns multiple attention region features of each input image through the one-squeeze multi-excitation (OSME) module, and then apply the multi-attention multi-class constraint (MAMC) in a metric learning framework. For each anchor feature, the MAMC functions by pulling same-attention same-class features closer, while pushing different-attention or different-class features away. Our method can be easily trained end-to-end, and is highly efficient which requires only one training stage. Moreover, we introduce Dogs-in-the-Wild, a comprehensive dog species dataset that surpasses similar existing datasets by category coverage, data volume and annotation quality. Extensive experiments are conducted to show the substantial improvements of our method on four benchmark datasets.
Article
We investigate the localization of subtle yet discriminative parts for fine-grained image recognition. Based on the observation that such parts typically exist within a hierarchical structure (e.g., from a coarse-scale “head” to a fine-scale “eye” when recognizing bird species), we propose a novel progressive-attention convolutional neural network (PA-CNN) to progressively localize parts at multiple scales. The PA-CNN localizes parts in two steps, where a part proposal network (PPN) generates multiple local attention maps, and a part rectification network (PRN) learns part-specific features from each proposal and provides the PPN with refined part locations. This coupling of the PPN and PRN allows them to be optimized in a mutually reinforcing manner, leading to improved pinpointing of fine-grained parts. Moreover, the convolutional parameters for a PPN at a finer scale can be inherited from the PRN at a coarser scale, enabling a rich part hierarchy (e.g., eye and beak in a bird’s head) to be learned in a stacked fashion. Case studies show that PA-CNN can precisely identify parts without using bounding box/part annotations. In addition, quantitative evaluations demonstrate that PA-CNN yields state-of-the-art performance in three challenging fine-grained recognition tasks. i.e., CUB-200-2011, FGVC-Aircraft, and Stanford Cars.
Conference Paper
Full-text available
This paper proposes a new approach to learning a discriminative model of object classes, incorporating appearance, shape and context information efficiently. The learned model is used for automatic visual recognition and semantic segmentation of photographs. Our discriminative model exploits novel features, based on textons, which jointly model shape and texture. Unary classification and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorporating these classifiers in a conditional random field. Efficient training of the model on very large datasets is achieved by exploiting both random feature selection and piecewise training methods. High classification and segmentation accuracy are demonstrated on three different databases: i) our own 21-object class database of photographs of real objects viewed under general lighting conditions, poses and viewpoints, ii) the 7-class Corel subset and iii) the 7-class Sowerby database used in [1]. The proposed algorithm gives competitive results both for highly textured (e.g. grass, trees), highly structured (e.g. cars, faces, bikes, aeroplanes) and articulated objects (e.g. body, cow).
Conference Paper
Full-text available
Hierarchical conditional random fields have been successfully applied to object segmentation. One reason is their ability to incorporate contextual information at different scales. However, these models do not allow multiple labels to be assigned to a single node. At higher scales in the image, this yields an oversimplified model, since multiple classes can be reasonable expected to appear within one region. This simplified model especially limits the impact that observations at larger scales may have on the CRF model. Neglecting the information at larger scales is undesirable since class-label estimates based on these scales are more reliable than at smaller, noisier scales. To address this problem, we propose a new potential, called harmony potential, which can encode any possible combination of class labels. We propose an effective sampling strategy that renders tractable the underlying optimization problem. Results show that our approach obtains state-of-the-art results on two challenging datasets: Pascal VOC 2009 and MSRC-21.
Conference Paper
Full-text available
Recent research into recognizing object classes (such as humans, cows and hands) has made use of edge features to hypothesize and localize class instances. However, for the most part, these edge-based methods operate solely on the geometric shape of edges, treating them equally and ignoring the fact that for certain object classes, the appearance of the object on the “inside” of the edge may provide valuable recognition cues. We show how, for such object classes, small regions around edges can be used to classify the edge into object or non-object. This classifier may then be used to prune edges which are not relevant to the object class, and thereby improve the performance of subsequent processing. We demonstrate learning class specific edges for a number of object classes — oranges, bananas and bottles — under challenging scale and illumination variation. Because class-specific edge classification provides a low-level analysis of the image it may be integrated into any edge-based recognition strategy without significant change in the high-level algorithms. We illustrate its application to two algorithms: (i) chamfer matching for object detection, and (ii) modulating contrast terms in MRF based object-specific segmentation. We show that performance of both algorithms (matching and segmentation) is considerably improved by the class-specific edge labelling.
Conference Paper
Full-text available
We introduce an approach to accurately detect and segment partially occluded objects in various viewpoints and scales. Our main contribution is a novel framework for combining object-level descriptions (such as position, shape, and color) with pixel-level appearance, boundary, and occlusion reasoning. In training, we exploit a rough 3D object model to learn physically localized part appearances. To find and segment objects in an image, we generate proposals based on the appearance and layout of local parts. The proposals are then refined after incorporating object-level information, and overlapping objects compete for pixels to produce a final description and segmentation of objects in the scene. A further contribution is a novel instance penalty, which is handled very efficiently during inference. We experimentally validate our approach on the challenging PASCAL'06 car database.
Conference Paper
Full-text available
In this paper, we focus on the problem of detecting the head of cat-like animals, adopting cat as a test case. We show that the performance depends cru- cially on how to effectively utilize the shape and texture fe atures jointly. Specifi- cally, we propose a two step approach for the cat head detection. In the first step, we train two individual detectors on two training sets. One training set is normal- ized to emphasize the shape features and the other is normalized to underscore the texture features. In the second step, we train a joint sha pe and texture fusion classifier to make the final decision. We demonstrate that a si gnificant improve- ment can be obtained by our two step approach. In addition, we also propose a set of novel features based on oriented gradients, which outperforms existing leading features, e. g., Haar, HoG, and EoH. We evaluate our approach on a well labeled cat head data set with 10,000 images and PASCAL 2007 cat data.
Conference Paper
Full-text available
We describe a new approach for learning to perform class- based segmentation using only unsegmented training examples. As in previous methods, we first use training images to extract fragments that contain common object parts. We then show how these parts can be segmented into their figure and ground regions in an automatic learning process. This is in contrast with previous approaches, which required complete manual segmentation of the objects in the training examples. The figure-ground learning combines top-down and bottom-up processes and proceeds in two stages, an initial approximation followed by iterative refinement. The initial approximation produces figure-ground labeling of individual image fragments using the unsegmented training images. It is based on the fact that on average, points inside the object are cov- ered by more fragments than points outside it. The initial labeling is then improved by an iterative refinement process, which converges in up to three steps. At each step, the figure-ground labeling of individual fragments produces a segmentation of complete objects in the training images, which in turn induce a refined figure-ground labeling of the in- dividual fragments. In this manner, we obtain a scheme that starts from unsegmented training images, learns the figure-ground labeling of image fragments, and then uses this labeling to segment novel images. Our ex- periments demonstrate that the learned segmentation achieves the same level of accuracy as methods using manual segmentation of training im- ages, producing an automatic and robust top-down segmentation.
Conference Paper
Full-text available
In this paper we present a novel class-based segmentation method, which is guided by a stored representation of the shape of ob- jects within a general class (such as horse images). The approach is dif- ferent from bottom-up segmentation methods that primarily use the con- tinuity of grey-level, texture, and bounding contours. We show that the method leads to markedly improved segmentation results and can deal with significant variation in shape and varying backgrounds. We discuss the relative merits of class-specific and general image-based segmentation methods and suggest how they can be usefully combined.
Conference Paper
Full-text available
Computer vision algorithms for individual tasks such as object recognition, detection and segmentation have shown impressive results in the recent past. The next challenge is to integrate all these algorithms and address the problem of scene understanding. This paper is a step towards this goal. We present a probabilistic framework for reasoning about regions, objects, and their attributes such as object class, location, and spatial extent. Our model is a Conditional Random Field defined on pixels, segments and objects. We define a global energy function for the model, which combines results from sliding window detectors, and low-level pixel-based unary and pairwise relations. One of our primary contributions is to show that this energy function can be solved efficiently. Experimental results show that our model achieves significant improvement over the baseline methods on CamVid and pascal voc datasets.
Conference Paper
Full-text available
A new learning strategy for object detection is presented. The proposed scheme forgoes the need to train a collection of detectors dedicated to homogeneous families of poses, and instead learns a single classifier that has the inherent ability to deform based on the signal of interest. Specifically, we train a detector with a standard Ad- aBoost procedure by using combinations of pose-indexed features and pose estimators instead of the usual image fea- tures. This allows the learning process to select and com- bine various estimates of the pose with features able to im- plicitly compensate for variations in pose. We demonstrate that a detector built in such a manner provides noticeable gains on two hand video sequences and analyze the perfor- mance of our detector as these data sets are synthetically enriched in pose while not increased in size. 1. Preamble Machine-learning object detection techniques rely on searching for the presence of the target over all scales and locations of a scene. In order to handle complex cases where latent variables modulate changes in appearance, for instance due to rotation or variation in illumination, two strategies have emerged: either building a collection of pose-dedicated classifiers or explicitly visiting the addi- tional latent variables in the same manner as one explores location and scale. We propose a new approach which consists of design- ing a family of pose estimators able to compute meaningful values for the additional latent variables directly from the signal. We allow the learning procedure to automatically
Conference Paper
Full-text available
In the task of visual object categorization, semantic con- text can play the very important role of reducing ambigu- ity in objects' visual appearance. In this work we propose to incorporate semantic object context as a post-processing step into any off-the-shelf object categorization model. Us- ing a conditional random field (CRF) framework, our ap- proach maximizes object label agreement according to con- textual relevance. We compare two sources of context: one learned from training data and another queried from Google Sets. The overall performance of the proposed framework is evaluated on the PASCAL and MSRC datasets. Our findings conclude that incorporating context into object categorization greatly improves categorization accuracy.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Article
Full-text available
The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.
Article
Full-text available
Most successful object recognition systems rely on binary classification, deciding only if an object is present or not, but not providing information on the actual object location. To estimate the object's location, one can take a sliding window approach, but this strongly increases the computational cost because the classifier or similarity function has to be evaluated over a large set of candidate subwindows. In this paper, we propose a simple yet powerful branch and bound scheme that allows efficient maximization of a large class of quality functions over all possible subimages. It converges to a globally optimal solution typically in linear or even sublinear time, in contrast to the quadratic scaling of exhaustive or sliding window search. We show how our method is applicable to different object detection and image retrieval scenarios. The achieved speedup allows the use of classifiers for localization that formerly were considered too slow for this task, such as SVMs with a spatial pyramid kernel or nearest-neighbor classifiers based on the \chi;2 distance. We demonstrate state-of-the-art localization performance of the resulting systems on the UIUC Cars data set, the PASCAL VOC 2006 data set, and in the PASCAL VOC 2007 competition.
Conference Paper
Full-text available
The visual tracking of human faces is a basic functionality needed for human-machine interfaces. This paper describes an approach that explores the combined use of adaptive skin color segmentation and face detection for improved face tracking on a mobile robot. To cope with inhomogeneous lighting within a single image, the color of each tracked image region is modeled with an individual, unimodal Gaussian. Face detection is performed locally on all segmented skin-colored regions. If a face is detected, the appropriate color model is updated with the image pixels in an elliptical area around the face position. Updating is restricted to pixels that are contained in a global skin color distribution obtained off-line. The presented method allows us to track faces that undergo changes in lighting conditions while at the same time providing information about the attention of the user, i.e. whether the user looks at the robot. This forms the basis for developing more sophisticated human-machine interfaces capable of dealing with unrestricted environments.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Current approaches to object category recognition require datasets of training images to be manually prepared, with varying degrees of supervision. We present an approach that can learn an object category from just its name, by utilizing the raw output of image search engines available on the Internet. We develop a new model, TSI-pLSA, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner. Our approach can handle the high intra-class variability and large proportion of unrelated images returned by search engines. We evaluate the models on standard test sets, showing performance competitive with existing methods trained on hand prepared datasets.
This paper presents a unified framework for object detection, segmentation, and classification using regions. Region features are appealing in this context because: (1) they encode shape and scale information of objects naturally; (2) they are only mildly affected by background clutter. Regions have not been popular as features due to their sensitivity to segmentation errors. In this paper, we start by producing a robust bag of overlaid regions for each image using Arbeldez et al., CVPR 2009. Each region is represented by a rich set of image cues (shape, color and texture). We then learn region weights using a max-margin framework. In detection and segmentation, we apply a generalized Hough voting scheme to generate hypotheses of object locations, scales and support, followed by a verification classifier and a constrained segmenter on each hypothesis. The proposed approach significantly outperforms the state of the art on the ETHZ shape database(87.1% average detection rate compared to Ferrari et al. 's 67.2%), and achieves competitive performance on the Caltech 101 database.
Conference Paper
Bottom-up segmentation based only on low-level cues is a notoriously difficult problem. This difficulty has lead to recent top-down segmentation algorithms that are based on class-specific image information. Despite the success of top-down algorithms, they often give coarse segmentations that can be significantly refined using low-level cues. This raises the question of how to combine both top-down and bottom-up cues in a principled manner. In this paper we approach this problem using supervised learning. Given a training set of ground truth segmentations we train a fragment-based segmentation algorithm which takes into account both bottom-up and top-down cues simultaneously, in contrast to most existing algorithms which train top-down and bottom-up modules separately. We formulate the problem in the framework of Conditional Random Fields (CRF) and derive a novel feature induction algorithm for CRF, which allows us to efficiently search over thousands of candidate fragments. Whereas pure top-down algorithms often require hundreds of fragments, our simultaneous learning procedure yields algorithms with a handful of fragments that are combined with low-level cues to efficiently compute high quality segmentations.
Article
In this paper we present a computationally efficient framework for part-based modeling and recognition of objects. Our work is motivated by the pictorial structure models introduced by Fischler and Elschlager. The basic idea is to represent an object by a collection of parts arranged in a deformable configuration. The appearance of each part is modeled separately, and the deformable configuration is represented by spring-like connections between pairs of parts. These models allow for qualitative descriptions of visual appearance, and are suitable for generic recognition problems. We address the problem of using pictorial structure models to find instances of an object in an image as well as the problem of learning an object model from training examples, presenting efficient algorithms in both cases. We demonstrate the techniques by learning models that represent faces and human bodies and using the resulting models to locate the corresponding objects in novel images.
Conference Paper
High-level, or holistic, scene understanding involves reasoning about objects, regions, and the 3D relationships between them. This requires a representation above the level of pixels that can be endowed with high-level attributes such as class of object/region, its orientation, and (rough 3D) location within the scene. Towards this goal, we propose a region-based model which combines appearance and scene geometry to automatically decompose a scene into semantically meaningful regions. Our model is defined in terms of a unified energy function over scene appearance and structure. We show how this energy function can be learned from data and present an efficient inference technique that makes use of multiple over-segmentations of the image to propose moves in the energy-space. We show, experimentally, that our method achieves state-of-the-art performance on the tasks of both multi-class image segmentation and geometric reasoning. Finally, by understanding region classes and geometry, we show how our model can be used as the basis for 3D reconstruction of the scene.
Conference Paper
Our objective is to obtain a state-of-the art object category detector by employing a state-of-the-art image classifier to search for the object in all possible image sub-windows. We use multiple kernel learning of Varma and Ray (ICCV 2007) to learn an optimal combination of exponential χ<sup>2</sup> kernels, each of which captures a different feature channel. Our features include the distribution of edges, dense and sparse visual words, and feature descriptors at different levels of spatial organization. Such a powerful classifier cannot be tested on all image sub-windows in a reasonable amount of time. Thus we propose a novel three-stage classifier, which combines linear, quasi-linear, and non-linear kernel SVMs. We show that increasing the non-linearity of the kernels increases their discriminative power, at the cost of an increased computational complexity. Our contributions include (i) showing that a linear classifier can be evaluated with a complexity proportional to the number of sub-windows (independent of the sub-window area and descriptor dimension); (ii) a comparison of three efficient methods of proposing candidate regions (including the jumping window classifier of Chum and Zisserman (CVPR 2007) based on proposing windows from scale invariant features); and (Hi) introducing overlap-recall curves as a mean to compare and optimize the performance of the intermediate pipeline stages. The method is evaluated on the PASCAL Visual Object Detection Challenge, and exceeds the performances of previously published methods for most of the classes.
Article
This paper evaluates the performance both of some texture measures which have been successfully used in various applications and of some new promising approaches proposed recently. For classification a method based on Kullback discrimination of sample and prototype distributions is used. The classification results for single features with one-dimensional feature value distributions and for pairs of complementary features with two-dimensional distributions are presented
Conference Paper
This paper introduces a new method for semi-supervised learning on high dimen- sional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods.
This paper proposes a new object representation, called connected segmentation tree (CST), which captures canonical characteristics of the object in terms of the photometric, geometric, and spatial adjacency and containment properties of its constituent image regions. CST is obtained by augmenting the objectpsilas segmentation tree (ST) with inter-region neighbor links, in addition to their recursive embedding structure already present in ST. This makes CST a hierarchy of region adjacency graphs. A regionpsilas neighbors are computed using an extension to regions of the Voronoi diagram for point patterns. Unsupervised learning of the CST model of a category is formulated as matching the CST graph representations of unlabeled training images, and fusing their maximally matching subgraphs. A new learning algorithm is proposed that optimizes the model structure by simultaneously searching for both the most salient nodes (regions) and the most salient edges (containment and neighbor relationships of regions) across the image graphs. Matching of the category model to the CST of a new image results in simultaneous detection, segmentation and recognition of all occurrences of the category, and a semantic explanation of these results.
We present an approach to visual object-class recognition and segmentation based on a pipeline that combines multiple, holistic figure-ground hypotheses generated in a bottom-up, object independent process. Decisions are performed based on continuous estimates of the spatial overlap between image segment hypotheses and each putative class. We differ from existing approaches not only in our seemingly unreasonable assumption that good object-level segments can be obtained in a feed-forward fashion, but also in framing recognition as a regression problem. Instead of focusing on a one-vs-all winning margin that can scramble ordering inside the non-maximum (non-winning) set, learning produces a globally consistent ranking with close ties to segment quality, hence to the extent entire object or part hypotheses spatially overlap with the ground truth. We demonstrate results beyond the current state of the art for image classification, object detection and semantic segmentation, in a number of challenging datasets including Caltech-101, ETHZ-Shape and PASCAL VOC 2009.
We present an approach for object recognition that com- bines detection and segmentation within a efficient hypoth- esize/test framework. Scanning-window template classifiers are the current state-of-the-art for many object classes such as faces, cars, and pedestrians. Such approaches, though quite successful, can be hindered by their lack of explicit encoding of object shape/structure - one might, for exam- ple, find faces in trees. We adopt the following strategy; we first use these sys- tems as attention mechanisms, generating many possible object locations by tuning them for low missed-detections and high false-positives. At each hypothesized detection, we compute a local figure-ground segmentation using a win- dow of slightly larger extent than that used by the classifier. This segmentation task is guided by top-down knowledge. We learn offline from training data those segmentations that are consistent with true positives. We then prune away those hypotheses with bad segmentations. We show this strat- egy leads to significant improvements (10-20%) over estab- lished approaches such as ViolaJones and DalalTriggs on a variety of benchmark datasets including the PASCAL chal- lenge, LabelMe, and the INRIAPerson dataset.
Conference Paper
In this paper we present a principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology forcombiningtop down and bottomup cues. Thework draws together two powerful formulations: pictorial structures (PS) and Markov random fields (MRFs) both of which have effi- cient algorithms for their solution. The resulting combina- tion, which we call the Object Category Specific MRF, sug- gests a solution to the problem that has long dogged MRFs namely that they provide a poor prior for specific shapes. In contrast, our model provides a prior that is global across the image plane using the PS. We develop an efficient method, OBJ CUT, to obtain segmentations using this model. Novel aspects of this method include an efficientalgorithm for sam- pling the PS model, and the observation that the expected log likelihood of the model can be increased by a single graph cut. Results are presented on two object categories, cows and horses. We compare our methods to the state of the art in object category specific image segmentation and demon- strate significant improvements.
Conference Paper
The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors.
Conference Paper
We present a method for object detection that combines AdaBoost learning with local histogram features. On the side of learning we improve the per- formance by designing a weak learner for multi-valued features based on Weighted Fisher Linear Discriminant. Evaluation on the recent benchmark for object detection confirms the superior performance of our method com- pared to the state-of-the-art. In particular, using a single set of parameters our approach outperforms all methods reported in (5) for 7 out of 8 detection tasks and four object classes.
Conference Paper
We address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet. We postulate two criteria (1) It should be easy to find a poselet given an input image (2) it should be easy to localize the 3D configuration of the person conditioned on the detection of a poselet. To permit this we have built a new dataset, H3D, of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints. This enables us to implement a data-driven search procedure for finding poselets that are tightly clustered in both 3D joint configuration space as well as 2D image appearance. The algorithm discovers poselets that correspond to frontal and profile faces, pedestrians, head and shoulder views, among others. Each poselet provides examples for training a linear SVM classifier which can then be run over the image in a multiscale scanning mode. The outputs of these poselet detectors can be thought of as an intermediate layer of nodes, on top of which one can run a second layer of classification or regression. We show how this permits detection and localization of torsos or keypoints such as left shoulder, nose, etc. Experimental results show that we obtain state of the art performance on people detection in the PASCAL VOC 2007 challenge, among other datasets. We are making publicly available both the H3D dataset as well as the poselet parameters for use by other researchers.
Article
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.
Article
Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classifiers. The efficiency of this strategy for a complex pose, i.e., for fine-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: i) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; ii) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efficient, one-shot learning of pose-specific classifiers. \\ To avoid expensive scene processing, we arrange these classifiers in a hierarchy based on nested partitions of the pose as in previous work, which allows for efficient search. The hierarchy is then "folded" for training: all the classifiers at each level are derived from one base predictor learned from all the data. The hierarchy is "unfolded" for testing: parsing a scene amounts to examining increasingly finer object descriptions only when there is sufficient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes.
Article
The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness, color, and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, we train a classifier using human labeled images as ground truth. The output of this classifier provides the posterior probability of a boundary at each image location and orientation. We present precision-recall curves showing that the resulting detector significantly outperforms existing approaches. Our two main results are 1) that cue combination can be performed adequately with a simple linear model and 2) that a proper, explicit treatment of texture is required to detect boundaries in natural images.
Object models based on bag-of-words representations can achieve state-of-the-art performance for image classification and object localization tasks. However, as they consider objects as loose collections of local patches they fail to accurately locate object boundaries and are not able to produce accurate object segmentation. On the other hand, Markov random field models used for image segmentation focus on object boundaries but can hardly use the global constraints necessary to deal with object categories whose appearance may vary significantly. In this paper we combine the advantages of both approaches. First, a mechanism based on local regions allows object detection using visual word occurrences and produces a rough image segmentation. Then, a MRF component gives clean boundaries and enforces label consistency, guided by local image cues (color, texture and edge cues) and by long-distance dependencies. Gibbs sampling is used to infer the model. The proposed method successfully segments object categories with highly varying appearances in the presence of cluttered backgrounds and large view point changes. We show that it outperforms published results on the Pascal VOC 2007 dataset.
Edge detection is one of the most studied problems in computer vision, yet it remains a very challenging task. It is difficult since often the decision for an edge cannot be made purely based on low level cues such as gradient, instead we need to engage all levels of information, low, middle, and high, in order to decide where to put edges. In this paper we propose a novel supervised learning algorithm for edge and object boundary detection which we refer to as Boosted Edge Learning or BEL for short. A decision of an edge point is made independently at each location in the image; a very large aperture is used providing significant context for each decision. In the learning stage, the algorithm selects and combines a large number of features across different scales in order to learn a discriminative model using an extended version of the Probabilistic Boosting Tree classification algorithm. The learning based framework is highly adaptive and there are no parameters to tune. We show applications for edge detection in a number of specific image domains as well as on natural images. We test on various datasets including the Berkeley dataset and the results obtained are very good.
Given a large dataset of images, we seek to automatically determine the visually similar object and scene classes together with their image segmentation. To achieve this we combine two ideas: (i) that a set of segmented objects can be partitioned into visual object classes using topic discovery models from statistical text analysis; and (ii) that visual object classes can be used to assess the accuracy of a segmentation. To tie these ideas together we compute multiple segmentations of each image and then: (i) learn the object classes; and (ii) choose the correct segmentations. We demonstrate that such an algorithm succeeds in automatically discovering many familiar objects in a variety of image datasets, including those from Caltech, MSRC and LabelMe.
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
Article
Many tasks in computer vision involve assigning a label (such as disparity) to every pixel. A common constraint is that the labels should vary smoothly almost everywhere while preserving sharp discontinuities that may exist, e.g., at object boundaries. These tasks are naturally stated in terms of energy minimization. The authors consider a wide class of energies with various smoothness constraints. Global minimization of these energy functions is NP-hard even in the simplest discontinuity-preserving case. Therefore, our focus is on efficient approximation algorithms. We present two algorithms based on graph cuts that efficiently find a local minimum with respect to two types of large moves, namely expansion moves and swap moves. These moves can simultaneously change the labels of arbitrarily large sets of pixels. In contrast, many standard algorithms (including simulated annealing) use small moves where only one pixel changes its label at a time. Our expansion algorithm finds a labeling within a known factor of the global minimum, while our swap algorithm handles more general energy functions. Both of these algorithms allow important cases of discontinuity preserving energies. We experimentally demonstrate the effectiveness of our approach for image restoration, stereo and motion. On real data with ground truth, we achieve 98 percent accuracy
3D layout CRF for multi-view object class recognition and segmentation
  • D Hoeim
  • C Rother
  • J M Winn
D. Hoeim, C. Rother, and J. M. Winn. 3D layout CRF for multi-view object class recognition and segmentation. In Proc. CVPR, 2008.