[Show abstract][Hide abstract] ABSTRACT: In this article we investigate the problem of human action recognition in static images. By action recognition we intend a class of problems which includes both action classification and action detection (i.e. simultaneous localization and classification). Bag-of-words image representations yield promising results for action classification, and deformable part models perform very well object detection. The representations for action recognition typically use only shape cues and ignore color information. Inspired by the recent success of color in image classification and object detection, we investigate the potential of color for action classification and detection in static images. We perform a comprehensive evaluation of color descriptors and fusion approaches for action recognition. Experiments were conducted on the three datasets most used for benchmarking action recognition in still images: Willow, PASCAL VOC 2010 and Stanford-40. Our experiments demonstrate that incorporating color information considerably improves recognition performance, and that a descriptor based on color names outperforms pure color descriptors. Our experiments demonstrate that late fusion of color and shape information outperforms other approaches on action recognition. Finally, we show that the different color–shape fusion approaches result in complementary information and combining them yields state-of-the-art performance for action classification.
International Journal of Computer Vision 12/2013; · 3.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper we present a novel method to improve the flexibility of descriptor matching for image recognition by using local multiresolution pyramids in feature space. We propose that image patches be represented at multiple levels of descriptor detail and that these levels be defined in terms of local spatial pooling resolution. Preserving multiple levels of detail in local descriptors is a way of hedging one's bets on which levels will most relevant for matching during learning and recognition. We introduce the Pyramid SIFT (P-SIFT) descriptor and show that its use in four state-of-the-art image recognition pipelines improves accuracy and yields state-of-the-art results. Our technique is applicable independently of spatial pyramid matching and we show that spatial pyramids can be combined with local pyramids to obtain further improvement. We achieve state-of-the-art results on Caltech-101 (80.1%) and Caltech-256 (52.6%) when compared to other approaches based on SIFT features over intensity images. Our technique is efficient and is extremely easy to integrate into image recognition pipelines.
IEEE Transactions on Software Engineering 11/2013; · 2.59 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper we present a method for the segmentation of continuous page streams into multipage documents and the simultaneous classification of the resulting documents. We first present an approach to combine the multiple pages of a document into a single feature vector that represents the whole document. Despite its simplicity and low computational cost, the proposed representation yields results comparable to more complex methods in multipage document classification tasks. We then exploit this representation in the context of page stream segmentation. The most plausible segmentation of a page stream into a sequence of multipage documents is obtained by optimizing a statistical model that represents the probability of each segmented multipage document belonging to a particular class. Experimental results are reported on a large sample of real administrative multipage documents.
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: This paper describes an approach to human action recognition based on a probabilistic optimization model of body parts using hidden Markov model (HMM). Our method is able to distinguish between similar actions by only considering the body parts having major contribution to the actions, for example, legs for walking, jogging and running; arms for boxing, waving and clapping. We apply HMMs to model the stochastic movement of the body parts for action recognition. The HMM construction uses an ensemble of body‐part detectors, followed by grouping of part detections, to perform human identification. Three example‐based body‐part detectors are trained to detect three components of the human body: the head, legs and arms. These detectors cope with viewpoint changes and self‐occlusions through the use of ten sub‐classifiers that detect body parts over a specific range of viewpoints. Each sub‐classifier is a support vector machine trained on features selected for the discriminative power for each particular part/viewpoint combination. Grouping of these detections is performed using a simple geometric constraint model that yields a viewpoint‐invariant human detector. We test our approach on three publicly available action datasets: the KTH dataset, Weizmann dataset and HumanEva dataset. Our results illustrate that with a simple and compact representation we can achieve robust recognition of human actions comparable to the most complex, state‐of‐the‐art methods.
Expert Systems 01/2013; 30(2). · 0.77 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper we consider the problem of face recognition in imagery captured in uncooperative environments using PTZ cameras. For each subject enrolled in the gallery, we acquire a high-resolution 3D model from which we generate a series of rendered face images of varying viewpoint. The result of regularly sampling face pose for all subjects is a redundant basis that over represents each target. To recognize an unknown probe image, we perform a sparse reconstruction of SIFT features extracted from the probe using a basis of SIFT features from the gallery. While directly collecting images over varying pose for all enrolled subjects is prohibitive at enrollment, the use of high speed, 3D acquisition systems allows our face recognition system to quickly acquire a single model, and generate synthetic views offline. Finally we show, using two publicly available datasets, how our approach performs when using rendered gallery images to recognize 2D rendered probe images and 2D probe images acquired using PTZ cameras.
[Show abstract][Hide abstract] ABSTRACT: In this article we introduce the problem of identity inference as a generalization of the re-identification problem. Identity inference is applicable in situations where a large number of unknown persons must be identified without knowing a priori that groups of test images represent the same individual. Standard single- and multi-shot person re-identification are special cases of our formulation. We present an approach to solving identity inference problems using a Conditional Random Field (CRF) to model identity inference as a labeling problem in the CRF. The CRF model ensures that the final labeling gives similar labels to detections that are similar in feature space, and is flexible enough to incorporate constraints in the temporal and spatial domains. Experimental results are given on the ETHZ dataset. Our approach yields state-of-the-art performance for the multi-shot re-identification task and promising results for more general identity inference problems.
Proceedings of the 12th international conference on Computer Vision - Volume Part I; 10/2012
[Show abstract][Hide abstract] ABSTRACT: State-of-the-art object detectors typically use shape information as a low level feature representation to capture the local structure of an object. This paper shows that early fusion of shape and color, as is popular in image classification, leads to a significant drop in performance for object detection. Moreover, such approaches also yields suboptimal results for object categories with varying importance of color and shape. In this paper we propose the use of color attributes as an explicit color representation for object detection. Color attributes are compact, computationally efficient, and when combined with traditional shape features provide state-of-the-art results for object detection. Our method is tested on the PASCAL VOC 2007 and 2009 datasets and results clearly show that our method improves over state-of-the-art techniques despite its simplicity. We also introduce a new dataset consisting of cartoon character images in which color plays a pivotal role. On this dataset, our approach yields a significant gain of 14% in mean AP over conventional state-of-the-art methods.
Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 06/2012
[Show abstract][Hide abstract] ABSTRACT: A real-time posterity logging system detects and tracks multiple targets in video streams, grabbing face images and retaining only the best quality for each detected target.
[Show abstract][Hide abstract] ABSTRACT: In this paper we present a multipage administrative document image retrieval system based on textual and visual representations of document pages. Individual pages are represented by textual or visual information using a bag-of-words framework. Different fusion strategies are evaluated which allow the system to perform multipage document retrieval on the basis of a single page retrieval system. Results are reported on a large dataset of document images sampled from a banking workflow.
Pattern Recognition (ICPR), 2012 21st International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: In this paper we present a technique for real-time face logging in video streams. Our system is capable of detecting faces across a range of poses and of tracking multiple targets in real time, grabbing face images and evaluating their quality in order to store only the best for each detected target. An advantage of our approach is that we qualify every logged face in terms of a quality measure based both on face pose and on resolution. Extensive qualitative and quantitative evaluation of the performance of our system is provided on many hours of realistic surveillance footage captured in different environments. Results show that our system can simultaneously minimizing false positives and identity mismatches, while balancing this against the need to obtain face images of all people in a scene.
Pattern Recognition (ICPR), 2012 21st International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: One of the most critical limitations of KinectTM-based interfaces is the need for persistence in order to interact with virtual objects. Indeed, a user must keep her arm still for a not-so-short span of time while pointing at an object with which she wishes to interact. The most natural way to overcome this limitation and improve interface reactivity is to employ a vision module able to recognize simple hand poses (e.g. open/closed) in order to add a state to the virtual pointer represented by the user hand. In this paper we propose a method to robustly predict the status of the user hand in real-time. We jointly exploit depth and RGB imagery to produce a robust feature for hand representation. Finally, we use temporal filtering to reduce spurious prediction errors. We have also prepared a dataset of more than 30K depth-RGB image pairs of hands that is being made publicly available. The proposed method achieves more than 98% accuracy and is highly responsive.
Pattern Recognition (ICPR), 2012 21st International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: This article describes a new dataset under construction at the Media Integration and Communication Center and the University of Florence. The dataset consists of high-resolution 3D scans of human faces from each subject, along with several video sequences of varying resolution and zoom level. Each subject is recorded in a controlled setting in HD video, then in a less-constrained (but still indoor) setting using a standard, PTZ surveillance camera, and finally in an unconstrained, outdoor environment with challenging conditions. In each sequence the subject is recorded at three levels of zoom. This dataset is being constructed specifically to support research on techniques that bridge the gap between 2D, appearance-based recognition techniques, and fully 3D approaches. It is designed to simulate, in a controlled fashion, realistic surveillance conditions and to probe the efficacy of exploiting 3D models in real scenarios.
Communications Control and Signal Processing (ISCCSP), 2012 5th International Symposium on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: This paper describes a novel framework for detection and suppression of properly shadowed regions for most possible scenarios occurring in real video sequences. Our approach requires no prior knowledge about the scene, nor is it restricted to specific scene structures. Furthermore, the technique can detect both achromatic and chromatic shadows even in the presence of camouflage that occurs when foreground regions are very similar in color to shadowed regions. The method exploits local color constancy properties due to reflectance suppression over shadowed regions. To detect shadowed regions in a scene, the values of the background image are divided by values of the current frame in the RGB color space. We show how this luminance ratio can be used to identify segments with low gradient constancy, which in turn distinguish shadows from foreground. Experimental results on a collection of publicly available datasets illustrate the superior performance of our method compared with the most sophisticated, state-of-the-art shadow detection algorithms. These results show that our approach is robust and accurate over a broad range of shadow types and challenging video conditions.
[Show abstract][Hide abstract] ABSTRACT: The Hierarchical Conditional Random Field (HCRF) model have been successfully applied to a number of image labeling problems,
including image segmentation. However, existing HCRF models of image segmentation do not allow multiple classes to be assigned
to a single region, which limits their ability to incorporate contextual information across multiple scales. At higher scales
in the image, this representation yields an oversimplified model since multiple classes can be reasonably expected to appear
within large regions. This simplified model particularly limits the impact of information at higher scales. Since class-label
information at these scales is usually more reliable than at lower, noisier scales, neglecting this information is undesirable.
To address these issues, we propose a new consistency potential for image labeling problems, which we call the harmony potential. It can encode any possible combination of labels, penalizing only unlikely combinations of classes. We also propose an effective
sampling strategy over this expanded label set that renders tractable the underlying optimization problem. Our approach obtains
state-of-the-art results on two challenging, standard benchmark datasets for semantic image segmentation: PASCAL VOC 2010,
KeywordsSemantic object segmentation–Hierarchical conditional random fields
International Journal of Computer Vision 01/2011; 96(1):83-102. · 3.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This article describes an approach to adaptive video coding for video surveillance applications. Using a combination of low-level features with low computational cost, we show how it is possible to control the quality of video compression so that semantically meaningful elements of the scene are encoded with higher fidelity, while background elements are allocated fewer bits in the transmitted representation. Our approach is based on adaptive smoothing of individual video frames so that image features highly correlated to semantically interesting objects are preserved. Using only low-level image features on individual frames, this adaptive smoothing can be seamlessly inserted into a video coding pipeline as a pre-processing state. Experiments show that our technique is efficient, outperforms standard H.264 encoding at comparable bit rates, and preserves features critical for downstream detection and recognition.
2011 IEEE International Symposium on Multimedia, ISM 2011, Dana Point, CA, USA, December 5-7, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Human detection is fundamental in many machine vision applications, like video surveillance, driving assistance, action recognition and scene understanding. However in most of these applications real-time performance is necessary and this is not achieved yet by current detection methods.This paper presents a new method for human detection based on a multiresolution cascade of Histograms of Oriented Gradients (HOG) that can highly reduce the computational cost of detection search without affecting accuracy. The method consists of a cascade of sliding window detectors. Each detector is a linear Support Vector Machine (SVM) composed of HOG features at different resolutions, from coarse at the first level to fine at the last one.In contrast to previous methods, our approach uses a non-uniform stride of the sliding window that is defined by the feature resolution and allows the detection to be incrementally refined as going from coarse-to-fine resolution. In this way, the speed-up of the cascade is not only due to the fewer number of features computed at the first levels of the cascade, but also to the reduced number of windows that need to be evaluated at the coarse resolution. Experimental results show that our method reaches a detection rate comparable with the state-of-the-art of detectors based on HOG features, while at the same time the detection search is up to 23 times faster.