Conference Paper

Cats and dogs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We investigate the fine grained object categorization problem of determining the breed of animal from an image. To this end we introduce a new annotated dataset of pets covering 37 different breeds of cats and dogs. The visual problem is very challenging as these animals, particularly cats, are very deformable and there can be quite subtle differences between the breeds. We make a number of contributions: first, we introduce a model to classify a pet breed automatically from an image. The model combines shape, captured by a deformable part model detecting the pet face, and appearance, captured by a bag-of-words model that describes the pet fur. Fitting the model involves automatically segmenting the animal in the image. Second, we compare two classification approaches: a hierarchical one, in which a pet is first assigned to the cat or dog family and then to a breed, and a flat one, in which the breed is obtained directly. We also investigate a number of animal and image orientated spatial layouts. These models are very good: they beat all previously published results on the challenging ASIRRA test (cat vs dog discrimination). When applied to the task of discriminating the 37 different breeds of pets, the models obtain an average accuracy of about 59%, a very encouraging result considering the difficulty of the problem.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
Recently, convolutional neural networks (CNN) have shown significant success in image classification tasks. However, studies show that neural networks are susceptible to tiny perturbations. When the disturbing image is input, the neural network will make a different judgment. At present, most studies use the negative side of perturbation to mislead the neural network, such as adversarial examples. In this paper, considering the positive side of perturbation, we propose Enhanced Samples Generative Adversarial Networks (ESGAN) to generate high-quality enhanced samples with positive perturbation, which is designed to further improve the performance of the target classifier. Enhanced samples’ generation is composed of two parts. The super-resolution (SR) network is used to generate high visual quality images, and the noise network is used to generate positive perturbations. Our ESGAN is independent of the target classifier, so it can improve performance without retraining the classifier, thus effectively reducing the computing resources and training time of the classifier. Experiments show that the enhanced samples generated by our proposed ESGAN can effectively improve the performance of the target classifier without affecting human eye recognition.
Article
Onfocus detection aims at identifying whether the focus of the individual captured by a camera is on the camera or not. Based on the behavioral research, the focus of an individual during face-to-camera communication leads to a special type of eye contact, i.e., the individual-camera eye contact, which is a powerful signal in social communication and plays a crucial role in recognizing irregular individual status (e.g., lying or suffering mental disease) and special purposes (e.g., seeking help or attracting fans). Thus, developing effective onfocus detection algorithms is of significance for assisting the criminal investigation, disease discovery, and social behavior analysis. However, the review of the literature shows that very few efforts have been made toward the development of onfocus detector owing to the lack of large-scale public available datasets as well as the challenging nature of this task. To this end, this paper engages in the onfocus detection research by addressing the above two issues. Firstly, we build a large-scale onfocus detection dataset, named as the onfocus detection in the wild (OFDIW). It consists of 20623 images in unconstrained capture conditions (thus called “in the wild”) and contains individuals with diverse emotions, ages, facial characteristics, and rich interactions with surrounding objects and background scenes. On top of that, we propose a novel end-to-end deep model, i.e., the eye-context interaction inferring network (ECIIN), for onfocus detection, which explores eye-context interaction via dynamic capsule routing. Finally, comprehensive experiments are conducted on the proposed OFDIW dataset to benchmark the existing learning models and demonstrate the effectiveness of the proposed ECIIN.
Article
Full-text available
We explore using the theory of Capsule Network(CapsNet) in Generative Adversarial Network(GAN). The traditional Convolutional Neural Networks(CNNs) cannot explain the spatial relationship between the part and whole, so it will lose some of the target’s attribute information such as direction and posture. Capsule Network, proposed by Hinton in 2017, overcomes the defect of CNNs. In order to utilize the attributes of the target as much as possible, we propose the E-CapsGan which applies the CapsNet to encode the input image attribute features and guide the data generation of GAN. We explore the application of the E-CapsGan in two scenarios. For image generation, we propose the E-CapsGan1, which uses the CapsNet as an additional attribute feature encoder to obtain image attribute features to guide GAN. For image compression encoding, we explore the E-CapsGan2 which employs the CapsNet as the encoder to compress images into vectors, and GAN as the decoder to reconstruct the original images from vectors. On multiple datasets, qualitative and quantitative experiments are used to demonstrate the superior performance of E-CapsGan1 in image generation and the feasibility of E-CapsGan2 in image compression encoding.
Article
Full-text available
This paper focuses on the face detection problem of three popular animal categories that need control such as horses, cats and dogs. Existing detectors are generally based on Convolutional Neural Networks (CNNs) as backbones. CNNs are strong and fascinating classification tools but present some weak points such as the big number of layers and parameters, require a huge dataset and ignore the relationship between image parts. To be precise, to deal with these problems, this paper contributes to present a new Convolutional Neural Network for Animal Face Detection (CNNAFD), a new backbone CNNAFD-MobileNetV2 for animal face detection and a new Tunisian Horse Detection Database (THDD). CNNAFD used a processed filters based on gradient features and applied with a new way. A new sparse convolutional layer ANOFS-Conv is proposed through a sparse feature selection method known as Automated Negotiation-based Online Feature Selection (ANOFS). The ANOFS method is used as a training optimizer for the new ANOFS-Conv layer. CNNAFD ends by stacked fully connected layers which represent a strong classifier. The fusion of CNNAFD and MobileNetV2 constructs the new network CNNAFD-MobileNetV2 which improves the classification results and gives better detection decisions. The proposed detector with the new CNNAFD-MobileNetV2 network provides effective results and proves to be competitive with the detectors of the related works with an Average Precision equal to 98.28%, 99.78%, 99.00% and 92.86% on the THDD, Cat Database, Stanford Dogs Dataset and Oxford-IIIT Pet Dataset respectively.
Article
Full-text available
Recently, the Internet of Things (IoT) has gained a lot of attention, since IoT devices are placed in various fields. Many of these devices are based on machine learning (ML) models, which render them intelligent and able to make decisions. IoT devices typically have limited resources, which restricts the execution of complex ML models such as deep learning (DL) on them. In addition , connecting IoT devices to the cloud to transfer raw data and perform processing causes delayed system responses, exposes private data and increases communication costs. Therefore, to tackle these issues, there is a new technology called Tiny Machine Learning (TinyML), that has paved the way to meet the challenges of IoT devices. This technology allows processing of the data locally on the device without the need to send it to the cloud. In addition, TinyML permits the inference of ML models, concerning DL models on the device as a Microcontroller that has limited resources. The aim of this paper is to provide an overview of the revolution of TinyML and a review of tinyML studies, wherein the main contribution is to provide an analysis of the type of ML models used in tinyML studies; it also presents the details of datasets and the types and characteristics of the devices with an aim to clarify the state of the art and envision development requirements.
Article
Full-text available
The field of Deep Learning (DL) has seen a remarkable series of developments with increasingly accurate and robust algorithms. However, the increase in performance has been accompanied by an increase in the parameters, complexity, and training time of the models, which means that we are rapidly reaching a point where DL may no longer be feasible. On the other hand, some specific applications need to be carefully considered when developing DL models due to hardware limitations or power requirements. In this context, there is a growing interest in efficient DL algorithms, with Spiking Neural Networks (SNNs) being one of the most promising paradigms. Due to the inherent asynchrony and sparseness of spike trains, these types of networks have the potential to reduce power consumption while maintaining relatively good performance. This is attractive for efficient DL and if successful, could replace traditional Artificial Neural Networks (ANNs) in many applications. However, despite significant progress, the performance of SNNs on benchmark datasets is often lower than that of traditional ANNs. Moreover, due to the non-differentiable nature of their activation functions, it is difficult to train SNNs with direct backpropagation, so appropriate training strategies must be found. Nevertheless, significant efforts have been made to develop competitive models. This survey covers the main ideas behind SNNs and reviews recent trends in learning rules and network architectures, with a particular focus on biologically inspired strategies. It also provides some practical considerations of state-of-the-art SNNs and discusses relevant research opportunities.
Article
Full-text available
Plant disease can diminish a considerable portion of the agricultural products on each farm. The main goal of this work is to provide visual information for the farmers to enable them to take the necessary preventive measures. A lightweight deep learning approach is proposed based on the Vision Transformer (ViT) for real-time automated plant disease classification. In addition to the ViT, the classical convolutional neural network (CNN) methods and the combination of CNN and ViT have been implemented for the plant disease classification. The models have been trained and evaluated on multiple datasets. Based on the comparison between the obtained results, it is concluded that although attention blocks increase the accuracy, they decelerate the prediction. Combining attention blocks with CNN blocks can compensate for the speed.
Article
The relevance of the tasks of detecting and recognizing objects in images and their sequences has only increased over the years. Over the past few decades, a huge number of approaches and methods for detecting both anomalies, that is, image areas whose characteristics differ from the predicted ones, and objects of interest, about the properties of which there is a priori information, up to the library of standards, have been proposed. In this work, an attempt is made to systematically analyze trends in the development of approaches and detection methods, reasons behind these developments, as well as metrics designed to assess the quality and reliability of object detection. Detection techniques based on mathematical models of images are considered. At the same time, special attention is paid to the approaches based on models of random fields and likelihood ratios. The development of convolutional neural networks intended for solving the recognition problems is analyzed, including a number of pre-trained architectures that provide high efficiency in solving this problem. Rather than using mathematical models, such architectures are trained using libraries of real images. Among the characteristics of the detection quality assessment, probabilities of errors of the first and second kind, precision and recall of detection, intersection by union, and interpolated average precision are considered. The paper also presents typical tests that are used to compare various neural network algorithms.
Article
Full-text available
Continued research on the epidermal electronic sensor aims to develop sophisticated platforms that reproduce key multimodal responses in human skin, with the ability to sense various external stimuli, such as pressure, shear, torsion, and touch. The development of such applications utilizes algorithmic interpretations to analyze the complex stimulus shape, magnitude, and various moduli of the epidermis, requiring multiple complex equations for the attached sensor. In this experiment, we integrate silicon piezoresistors with a customized deep learning data process to facilitate in the precise evaluation and assessment of various stimuli without the need for such complexities. With the ability to surpass conventional vanilla deep regression models, the customized regression and classification model is capable of predicting the magnitude of the external force, epidermal hardness and object shape with an average mean absolute percentage error and accuracy of <15 and 96.9%, respectively. The technical ability of the deep learning-aided sensor and the consequent accurate data process provide important foundations for the future sensory electronic system.
Article
Full-text available
Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
Article
Fine-grained visual classification (FGVC) has small inter-class variations and large intra-class variations, therefore, recognizing sub-classes belonging to the same meta-class is a difficult task. Recent studies have primarily addressed this problem by locating the most discriminative image regions, and the extracted image regions have been used to improve the ability to capture subtle differences. Most of these studies used regular anchors to extract local features. However, the local features of the target are mostly irregular geometric shapes. These methods cannot fully extract the features and inevitably include a large amount of irrelevant information, resulting in reduced credibility of the evaluation results. However, the spatial relationship between the features is easily overlooked. This study proposes a novel local feature extraction anchor generator (LFEAG) to simulate the shapes of irregular features. Thus, discriminative features can be fully included in the extracted features. In addition, an effective symmetrized local feature extraction module (SLFEM) based on an attention mechanism is proposed to fully use the spatial relationship between the extracted local features and highlight discriminative features. Experiments on six popular fine-grained benchmark datasets: CUB-200-2011, Stanford Dogs, Food-101, Oxford-IIIT Pets, Aircraft and NA-Birds, are conducted to demonstrate the advantages of our proposed method.
Article
Full-text available
3D shape reconstruction from a single-view image is an utterly ill-posed and challenging problem, while multi-view methods can reconstruct an object’s shape only from raw images. However, these raw images should be shot in a static scene, to promise that corresponding features in the images can be mapped to the same spatial location. Recent single-view methods need only single-view images of static or dynamic objects, by turning to prior knowledge to mine the latent multi-view information in single-view images. Some of them utilize prior models (e.g. rendering-based or style-transfer-based) to generate novel-view images, which are however not sufficiently accurate, to feed their model. In this paper, we represent Augmented Self-Supervised 3D Reconstruction with Monotonous Material (ASRMM) approach, trained end-to-end in a self-supervised manner, to obtain the 3D reconstruction of a category-specific object, without any relevant prior models for novel-view images. Our approach draws inspiration from the experience that (1) high quality multi-view images are difficult to obtain, and (2) the shape of an object of single material can be visually inferred more easily, rather than of multiple kinds of complex material. As to practice these motivations, ASRMM makes material monotonous in its diffuse part by setting reflectance an identical value, and apply this idea on the source and reconstruction images. Experiments show that our model can reasonably reconstruct the 3D model of faces, cats, cars and birds from their collections of single-view images, and the experiments also show that our approach can be generalized to different reconstruction tasks, including unsupervised depth-based reconstruction and 2D supervised mesh reconstruction, and achieve promising improvement in the quality of the reconstructed shape and the texture.
Article
Full-text available
We present a novel method for generic visual catego-rization: the problem of identifying the object content of natural images while generalizing across variations inherent to the ob-ject class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We propose and compare two alternative implementations using different classifiers: Naïve Bayes and SVM. The main advan-tages of the method are that it is simple, computationally effi-cient and intrinsically invariant. We present results for simulta-neously classifying seven semantic visual categories. These re-sults clearly demonstrate that the method is robust to back-ground clutter and produces good categorization accuracy even without exploiting geometric information.
Conference Paper
Full-text available
We study the problem of object classification when training and test classes are disjoint, i.e. no training examples of the target classes are available. This setup has hardly been studied in computer vision research, but it is the rule rather than the exception, because the world contains tens of thousands of different object classes and for only a very few of them image, collections have been formed and annotated with suitable class labels. In this paper, we tackle the problem by introducing attribute-based classification. It performs object detection based on a human-specified high-level description of the target objects instead of training images. The description consists of arbitrary semantic attributes, like shape, color or even geographic information. Because such properties transcend the specific learning task at hand, they can be pre-learned, e.g. from image datasets unrelated to the current task. Afterwards, new classes can be detected based on their attribute representation, without the need for a new training phase. In order to evaluate our method and to facilitate research in this area, we have assembled a new largescale dataset, "Animals with Attributes", of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes. Our experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
In this paper, we focus on the problem of detecting the head of cat-like animals, adopting cat as a test case. We show that the performance depends cru- cially on how to effectively utilize the shape and texture fe atures jointly. Specifi- cally, we propose a two step approach for the cat head detection. In the first step, we train two individual detectors on two training sets. One training set is normal- ized to emphasize the shape features and the other is normalized to underscore the texture features. In the second step, we train a joint sha pe and texture fusion classifier to make the final decision. We demonstrate that a si gnificant improve- ment can be obtained by our two step approach. In addition, we also propose a set of novel features based on oriented gradients, which outperforms existing leading features, e. g., Haar, HoG, and EoH. We evaluate our approach on a well labeled cat head data set with 10,000 images and PASCAL 2007 cat data.
Conference Paper
Full-text available
We present an interactive, hybrid human-computer method for object classification. The method applies to classes of objects that are recognizable by people with appropriate expertise (e.g., animal species or airplane model), but not (in general) by people without such expertise. It can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate our methods on Birds-200, a difficult dataset of 200 tightly-related bird species, and on the Animals With Attributes dataset. Our results demonstrate that incorporating user input drives up recognition accuracy to levels that are good enough for practical applications, while at the same time, computer vision reduces the amount of human interaction required.
Conference Paper
Full-text available
A new learning strategy for object detection is presented. The proposed scheme forgoes the need to train a collection of detectors dedicated to homogeneous families of poses, and instead learns a single classifier that has the inherent ability to deform based on the signal of interest. Specifically, we train a detector with a standard Ad- aBoost procedure by using combinations of pose-indexed features and pose estimators instead of the usual image fea- tures. This allows the learning process to select and com- bine various estimates of the pose with features able to im- plicitly compensate for variations in pose. We demonstrate that a detector built in such a manner provides noticeable gains on two hand video sequences and analyze the perfor- mance of our detector as these data sets are synthetically enriched in pose while not increased in size. 1. Preamble Machine-learning object detection techniques rely on searching for the presence of the target over all scales and locations of a scene. In order to handle complex cases where latent variables modulate changes in appearance, for instance due to rotation or variation in illumination, two strategies have emerged: either building a collection of pose-dedicated classifiers or explicitly visiting the addi- tional latent variables in the same manner as one explores location and scale. We propose a new approach which consists of design- ing a family of pose estimators able to compute meaningful values for the additional latent variables directly from the signal. We allow the learning procedure to automatically
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Article
Full-text available
The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.
Article
Full-text available
Caltech-UCSD Birds 200 (CUB-200) is a challenging image dataset annotated with 200 bird species. It was created to enable the study of subordinate categorization, which is not possible with other popular datasets that focus on basic level categories (such as PASCAL VOC, Caltech-101, etc). The images were downloaded from the website Flickr and filtered by workers on Amazon Mechanical Turk. Each image is annotated with a bounding box, a rough bird segmentation, and a set of attribute labels.
Article
Full-text available
We introduce a challenging set of 256 object categories containing a total of 30607 images. The original Caltech-101 [1] was collected by choosing a set of object categories, downloading examples from Google Images and then manually screening out all images that did not fit the category. Caltech-256 is collected in a similar manner with several improvements: a) the number of categories is more than doubled, b) the minimum number of images in any category is increased from 31 to 80, c) artifacts due to image rotation are avoided and d) a new and larger clutter category is introduced for testing background rejection. We suggest several testing paradigms to measure classification performance, then benchmark the dataset using two simple metrics as well as a state-of-the-art spatial pyramid matching [2] algorithm. Finally we use the clutter category to train an interest detector which rejects uninformative background regions.
Conference Paper
Full-text available
We investigate the problem of learning optimal descriptors for a given classification task. Many hand-crafted descriptors have been proposed in the literature for measuring visual similarity. Looking past initial differences, what really distinguishes one descriptor from another is the tradeoff that it achieves between discriminative power and invariance. Since this trade-off must vary from task to task, no single descriptor can be optimal in all situations. Our focus, in this paper, is on learning the optimal tradeoff for classification given a particular training set and prior constraints. The problem is posed in the kernel learning framework. We learn the optimal, domain-specific kernel as a combination of base kernels corresponding to base features which achieve different levels of trade-off (such as no invariance, rotation invariance, scale invariance, affine invariance, etc.) This leads to a convex optimisation problem with a unique global optimum which can be solved for efficiently. The method is shown to achieve state-of-the-art performance on the UIUC textures, Oxford flowers and Cal- tech 101 datasets.
Conference Paper
Full-text available
The Wiener series is one of the standard methods to systematically characterize the nonlinearity of a system. The classical estimation method of the expansion coefficients via cross-correlation suffers from severe problems that prevents its application to high-dimensional and strongly nonlinear systems. We propose an implicit estimation method based on regression in a reproducing kernel Hubert space that alleviates these problems. Experiments show performance advantages in terms of convergence, interpretability, and system sizes that can be handled
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Technical Report
The goal of this challenge is to recognize objects from a number of visual object classes in images of realistic scenes. It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. The object classes are: motorbikes, bicycles, people and cars. Twelve participants entered the challenge. A full description of the challenge including software and image sets is available on the web page http://www.pascal-network.org/challenges/VOC/voc/index.html.
Article
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover's Distance and the χ 2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on four texture and five object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance via extensive tests on the PASCAL database, for which ground-truth object localization information is available. Our experiments demonstrate that image representations based on distributions of local features are surprisingly effective for classification of texture and object images under challenging real-world conditions, including significant intra-class variations and substantial background clutter.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.
Conference Paper
Our objective is to obtain a state-of-the art object category detector by employing a state-of-the-art image classifier to search for the object in all possible image sub-windows. We use multiple kernel learning of Varma and Ray (ICCV 2007) to learn an optimal combination of exponential χ<sup>2</sup> kernels, each of which captures a different feature channel. Our features include the distribution of edges, dense and sparse visual words, and feature descriptors at different levels of spatial organization. Such a powerful classifier cannot be tested on all image sub-windows in a reasonable amount of time. Thus we propose a novel three-stage classifier, which combines linear, quasi-linear, and non-linear kernel SVMs. We show that increasing the non-linearity of the kernels increases their discriminative power, at the cost of an increased computational complexity. Our contributions include (i) showing that a linear classifier can be evaluated with a complexity proportional to the number of sub-windows (independent of the sub-window area and descriptor dimension); (ii) a comparison of three efficient methods of proposing candidate regions (including the jumping window classifier of Chum and Zisserman (CVPR 2007) based on proposing windows from scale invariant features); and (Hi) introducing overlap-recall curves as a mean to compare and optimize the performance of the intermediate pipeline stages. The method is evaluated on the PASCAL Visual Object Detection Challenge, and exceeds the performances of previously published methods for most of the classes.
Conference Paper
We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. 1
Conference Paper
The ASIRRA CAPTCHA (6), recently proposed at ACM CCS 2007, relies on the problem of distinguishing images of cats and dogs (a task that humans are very good at). The security of ASIRRA is based on the presumed diculty of classifying these images automatically. In this paper, we describe a classier which is 82 :7% accurate in telling apart the images of cats and dogs used in ASIRRA. This classier is a combination of support-vector machine classiers trained on color and texture features extracted from images. Our classier allows us to solve a 12-image ASIRRA challenge automatically with probability 10:3%. This probability of success is signicantly higher than the estimate given in (6) for machine vision attacks. The weakness we expose in the current implementation of ASIRRA does not mean that ASIRRA cannot be deployed securely. With appropriate safeguards, we believe that ASIRRA oers an appealing balance between usability and security. One contribution of this work is to inform the choice of safeguard parameters in ASIRRA deployments.
Conference Paper
We present Asirra (Figure 1), a CAPTCHA that asks users to identify cats out of a set of 12 photographs of both cats and dogs. Asirra is easy for users; user studies indicate it can be solved by humans 99.6% of the time in under 30 seconds. Barring a major advance in machine vision, we expect computers will have no better than a 1/54,000 chance of solving it. Asirra’s image database is provided by a novel, mutually beneficial partnership with Petfinder.com. In exchange for the use of their three million images, we display an “adopt me” link beneath each one, promoting Petfinder’s primary mission of finding homes for homeless animals. We describe the design of Asirra, discuss threats to its security, and report early deployment experiences. We also describe two novel algorithms for amplifying the skill gap between humans and computers that can be used on many existing CAPTCHAs.
Conference Paper
We investigate to what extent combinations of features can improve classification performance on a large dataset of similar classes. To this end we introduce a 103 class flower dataset. We compute four different features for the flowers, each describing different aspects, namely the local shape/texture, the shape of the boundary, the overall spatial distribution of petals, and the colour. We combine the features using a multiple kernel framework with a SVM classifier. The weights for each class are learnt using the method of Varma and Ray, which has achieved state of the art performance on other large dataset, such as Caltech 101/256. Our dataset has a similar challenge in the number of classes, but with the added difficulty of large between class similarity and small within class similarity. Results show that learning the optimum kernel combination of multiple features vastly improves the performance, from 55.1% for the best single feature to 72.8% for the combination of all features.
We propose a generic grouping algorithm that constructs a hierarchy of regions from the output of any contour detector. Our method consists of two steps, an oriented watershed transform (OWT) to form initial regions from contours, followed by construction of an ultra-metric contour map (UCM) defining a hierarchical segmentation. We provide extensive experimental evaluation to demonstrate that, when coupled to a high-performance contour detector, the OWT-UCM algorithm produces state-of-the-art image segmentations. These hierarchical segmentations can optionally be further refined by user-specified annotations.
Conference Paper
We present a method for object detection that combines AdaBoost learning with local histogram features. On the side of learning we improve the per- formance by designing a weak learner for multi-valued features based on Weighted Fisher Linear Discriminant. Evaluation on the recent benchmark for object detection confirms the superior performance of our method com- pared to the state-of-the-art. In particular, using a single set of parameters our approach outperforms all methods reported in (5) for 7 out of 8 detection tasks and four object classes.
Conference Paper
The objective of this paper is the unsupervised segmentation of image training sets into foreground and background in order to improve image classification performance. To this end we introduce a new scalable, alternation-based algorithm for co-segmentation, BiCoS, which is simpler than many of its predecessors, and yet has superior performance on standard benchmark image datasets. We argue that the reason for this success is that the cosegmentation task is represented at the appropriate levels - pixels and color distributions for individual images, and super-pixels with learnable features at the level of sharing across the image set - together with powerful and efficient inference algorithms (GrabCut and SVM) for each level. We assess both the segmentation and classification performance of the algorithm and compare to previous results on Oxford Flowers 17 & 102, Caltech-UCSD Birds-200, the Weizmann Horses, Caltech-4 benchmark datasets.
Conference Paper
Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-of-the-art performance for a variety of object categories, but are still outperformed by simpler bag-of-words models for highly flexible objects such as cats and dogs. In these cases we propose to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part. This approach is motivated by two ob- servations: (i) many object classes contain distinctive parts that can be detected very reliably by template-based detec- tors, whilst the entire object cannot; (ii) many classes (e.g. animals) have fairly homogeneous coloring and texture that can be used to segment the object once a sample is provided in an image. We show quantitatively that our method substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bag-of-words.
Conference Paper
Learning visual models of object categories notoriously requires thousands of training examples; this is due to the diversity and richness of object appearance which requires models containing hundreds of parameters. We present a method for learning object categories from just a few images (1 ∼ 5). It is based on incorporating "generic" knowledge which may be obtained from previously learnt models of unrelated categories. We operate in a variational Bayesian framework: object categories are represented by probabilistic models, and "prior" knowledge is represented as a probability density function on the parameters of these models. The "posterior" model for an object category is obtained by updating the prior in the light of one or more observations. Our ideas are demonstrated on four diverse categories (human faces, airplanes, motorcycles, spotted cats). Initially three categories are learnt from hundreds of training examples, and a "prior" is estimated from these. Then the model of the fourth category is learnt from 1 to 5 training examples, and is used for detecting new exemplars a set of test images.
Article
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.
Article
Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classifiers. The efficiency of this strategy for a complex pose, i.e., for fine-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: i) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; ii) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efficient, one-shot learning of pose-specific classifiers. \\ To avoid expensive scene processing, we arrange these classifiers in a hierarchy based on nested partitions of the pose as in previous work, which allows for efficient search. The hierarchy is then "folded" for training: all the classifiers at each level are derived from one base predictor learned from all the data. The hierarchy is "unfolded" for testing: parsing a scene amounts to examining increasingly finer object descriptions only when there is sufficient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes.
Article
The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness, color, and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, we train a classifier using human labeled images as ground truth. The output of this classifier provides the posterior probability of a boundary at each image location and orientation. We present precision-recall curves showing that the resulting detector significantly outperforms existing approaches. Our two main results are 1) that cue combination can be performed adequately with a simple linear model and 2) that a proper, explicit treatment of texture is required to detect boundaries in natural images.
We investigate to what extent ‘bag of visual words’ models can be used to distinguish categories which have significant visual similarity. To this end we develop and optimize a nearest neighbour classifier architecture, which is evaluated on a very challenging database of flower images. The flower categories are chosen to be indistinguishable on colour alone (for example), and have considerable variation in shape, scale, and viewpoint. We demonstrate that by developing a visual vocabulary that explicitly represents the various aspects (colour, shape, and texture) that distinguish one flower from another, we can overcome the ambiguities that exist between flower categories. The novelty lies in the vocabulary used for each aspect, and how these vocabularies are combined into a final classifier. The various stages of the classifier (vocabulary selection and combination) are each optimized on a validation set. Results are presented on a dataset of 1360 images consisting of 17 flower species. It is shown that excellent performance can be achieved, far surpassing standard baseline algorithms using (for example) colour cues alone.
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
Conference Paper
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover’s Distance and the ÷2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on 4 texture and 5 object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance.
Conference Paper
An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds
Article
Proc. of the International Conference on Computer Vision, Corfu (Sept. 1999) An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest-neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low-residual least-squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially-occluded images with a computation time of under 2 seconds. 1.
We are grateful for financial sup-port from EU Project AXES ICT-269980 and ERC grant VisRec no. 228180. References
  • Acknowledgements
Acknowledgements. We are grateful for financial sup-port from EU Project AXES ICT-269980 and ERC grant VisRec no. 228180. References
Novel dataset for fine-grained image categorization
  • A Khosla
  • N Jayadevaprakash
  • B Yao
  • F F Li
A. Khosla, N. Jayadevaprakash, B. Yao, and F. F. Li. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, CVPR, 2011.
Visual categorization with bags of keypoints
  • G Csurka
  • C R Dance
  • L Dan
  • J Willamowski
  • C Bray
G. Csurka, C. R. Dance, L. Dan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. ECCV Workshop on Stat. Learn. in Comp. Vision, 2004.
Object detection with discriminatively trained part based models
  • P F Felzenszwalb
  • R B Grishick
  • D Mcallester
  • D Ramanan
P. F. Felzenszwalb, R. B. Grishick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 2009.