Andrew D. Bagdanov

University of Florence, Florens, Tuscany, Italy

Are you Andrew D. Bagdanov?

Claim your profile

Publications (49)44.13 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we introduce a script identification method based on hand-crafted texture features and an artificial neural network. The proposed pipeline achieves near state-of-the-art performance for script identification of video-text and state-of-the-art performance on visual language identification of handwritten text. More than using the deep network as a classifier, the use of its intermediary activations as a learned metric demonstrates remarkable results and allows the use of discriminative models on unknown classes. Comparative experiments in video-text and text in the wild datasets provide insights on the internals of the proposed deep network.
    No preview · Article · Jan 2016
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Action recognition in still images is a challenging problem in computer vision. To facilitate comparative evaluation independently of person detection, the standard evaluation protocol for action recognition uses an oracle person detector to obtain perfect bounding box information at both training and test time. The assumption is that, in practice, a general person detector will provide candidate bounding boxes for action recognition. In this paper we argue that this paradigm is sub-optimal and that action class labels should already be considered during the detection stage. Motivated by the observation that body pose is strongly conditioned on action class, we show: (i) that existing, state-of-the-art generic person detectors are not adequate for proposing candidate bounding boxes for action classification; (ii) that, due to limited training examples, direct training of action-specific person detectors is also inadequate; and (iii) that, using only a small number of labeled action examples, transfer learning is able to adapt an existing detector to propose higher-quality bounding boxes for subsequent action classification. To the best of our knowledge, we are the first to investigate transfer learning for the task of action-specific person detection in still images. We perform extensive experiments on two benchmark datasets: Stanford-40 and PASCAL VOC 2012. For the action detection task (i.e. both person localization and classification of the action performed), our approach outperforms methods based on general person detection by 5.7% mean average precision (MAP) on Stanford-40 and 2.1% MAP on PASCAL VOC 2012. Our approach also significantly outperforms the state-of-the-art with a MAP of 45.4% on Stanford-40 and 31.4% on PASCAL VOC 2012. We also evaluate our action detection approach for the task of action classification (i.e. recognizing actions without localizing them). For this task, our approach, without using any ground-truth person localization at test time, outperforms on both datasets state-of-the-art methods which do use person locations.
    Full-text · Article · Aug 2015 · IEEE Transactions on Image Processing

  • No preview · Conference Paper · Aug 2015
  • Giuseppe Lisanti · Iacopo Masi · Andrew D. Bagdanov · Alberto Del Bimbo
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we introduce a method for person re-identification based on discriminative, sparse basis expansions of targets in terms of a labeled gallery of known individuals. We propose an iterative extension to sparse discriminative classifiers capable of ranking many candidate targets. The approach makes use of soft- and hard- re-weighting to redistribute energy among the most relevant contributing elements and to ensure that the best candidates are ranked at each iteration. Our approach also leverages a novel visual descriptor which we show to be discriminative while remaining robust to pose and illumination variations. An extensive comparative evaluation is given demonstrating that our approach achieves state-of-the-art performance on single- and multi-shot person re-identification scenarios on the VIPeR, i-LIDS, ETHZ, and CAVIAR4REID datasets. The combination of our descriptor and iterative sparse basis expansion improves state-of-the-art rank-1 performance by six percentage points on VIPeR and by 20 on CAVIAR4REID compared to other methods with a single gallery image per person. With multiple gallery and probe images per person our approach improves by 17 percentage points the state-of-the-art on i-LIDS and by 72 on CAVIAR4REID at rank-1. The approach is also quite efficient, capable of single-shot person re-identification over galleries containing hundreds of individuals at about 30 re-identifications per second.
    No preview · Article · Aug 2015 · IEEE Transactions on Pattern Analysis and Machine Intelligence
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present the use of Sparse Radial Sampling Local Binary Patterns, a variant of Local Binary Patterns (LBP) for text-as-texture classification. By adapting and extending the standard LBP operator to the particularities of text we get a generic text-as-texture classification scheme and apply it to writer identification. In experiments on CVL and ICDAR 2013 datasets, the proposed feature-set demonstrates State-Of-the-Art (SOA) performance. Among the SOA, the proposed method is the only one that is based on dense extraction of a single local feature descriptor. This makes it fast and applicable at the earliest stages in a DIA pipeline without the need for segmentation, binarization, or extraction of multiple features.
    No preview · Article · Apr 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe a Fisher vector encoding of images over Random Density Forests. Random Density Forests (RDFs) are an unsupervised variation of Random Decision Forests for density estimation. In this work we train RDFs by splitting at each node in order to minimize the Gaussian differential entropy of each split. We use this as generative model of image patch features and derive the Fisher vector representation using the RDF as the underlying model. Our approach is computationally efficient, reducing the amount of Gaussian derivatives to compute, and allows more flexibility in the feature density modelling. We evaluate our approach on the PASCAL VOC 2007 dataset showing that our approach, that only uses linear classifiers, improves over bag of visual words and is comparable to the traditional Fisher vector encoding over Gaussian Mixture Models for density estimation.
    No preview · Article · Dec 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a page classification application in a banking workflow. The proposed architecture represents administrative document images by merging visual and textual descriptions. The visual description is based on a hierarchical representation of the pixel intensity distribution. The textual description uses latent semantic analysis to represent document content as a mixture of topics. Several off-the-shelf classifiers and different strategies for combining visual and textual cues have been evaluated. A final step uses an \(n\)-gram model of the page stream allowing a finer-grained classification of pages. The proposed method has been tested in a real large-scale environment and we report results on a dataset of 70,000 pages.
    No preview · Article · Dec 2014 · Document Analysis and Recognition
  • Enrico Bondi · Lorenzo Seidenari · Andrew D. Bagdanov · Alberto Del Bimbo
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe a system for automatic people counting in crowded environments. The approach we propose is a counting-by-detection method based on depth imagery. It is designed to be deployed as an autonomous appliance for crowd analysis in video surveillance application scenarios. Our system performs foreground/background segmentation on depth image streams in order to coarsely segment persons, then depth information is used to localize head candidates which are then tracked in time on an automatically estimated ground plane. The system runs in real-time, at a frame-rate of about 20 fps. We collected a dataset of RGB-D sequences representing three typical and challenging surveillance scenarios, including crowds, queuing and groups. An extensive comparative evaluation is given between our system and more complex, Latent SVM-based head localization for person counting applications.
    No preview · Conference Paper · Aug 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recognizing human actions in still images is a challenging problem in computer vision due to significant amount of scale, illumination and pose variation. Given the bounding box of a person both at training and test time, the task is to classify the action associated with each bounding box in an image. Most state-of-the-art methods use the bag-of-words paradigm for action recognition. The bag-of-words framework employing a dense multi-scale grid sampling strategy is the de facto standard for feature detection. This results in a scale invariant image representation where all the features at multiple-scales are binned in a single histogram. We argue that such a scale invariant strategy is sub-optimal since it ignores the multi-scale information available with each bounding box of a person. This paper investigates alternative approaches to scale coding for action recognition in still images. We encode multi-scale information explicitly in three different histograms for small, medium and large scale visual-words. Our first approach exploits multi-scale information with respect to the image size. In our second approach, we encode multi-scale information relative to the size of the bounding box of a person instance. In each approach, the multi-scale histograms are then concatenated into a single representation for action classification. We validate our approaches on the Willow dataset which contains seven action categories: interacting with computer, photography, playing music, riding bike, riding horse, running and walking. Our results clearly suggest that the proposed scale coding approaches outperform the conventional scale invariant technique. Moreover, we show that our approach obtains promising results compared to more complex state-of-the-art methods.
    No preview · Conference Paper · Aug 2014
  • Svebor Karaman · Giuseppe Lisanti · Andrew D. Bagdanov · Alberto Del Bimbo
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract In this paper we describe a semi-supervised approach to person re-identification that combines discriminative models of person identity with a Conditional Random Field (CRF) to exploit the local manifold approximation induced by the nearest neighbor graph in feature space. The linear discriminative models learned on few gallery images provides coarse separation of probe images into identities, while a graph topology defined by distances between all person images in feature space leverages local support for label propagation in the CRF. We evaluate our approach using multiple scenarios on several publicly available datasets, where the number of identities varies from 28 to 191 and the number of images ranges between 1003 and 36 171. We demonstrate that the discriminative model and the CRF are complementary and that the combination of both leads to significant improvement over state-of-the-art approaches. We further demonstrate how the performance of our approach improves with increasing test data and also with increasing amounts of additional unlabeled data.
    No preview · Article · Jun 2014 · Pattern Recognition
  • Svebor Karaman · Giuseppe Lisanti · Andrew D. Bagdanov · Alberto Del Bimbo

    No preview · Article · Jan 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this article we investigate the problem of human action recognition in static images. By action recognition we intend a class of problems which includes both action classification and action detection (i.e. simultaneous localization and classification). Bag-of-words image representations yield promising results for action classification, and deformable part models perform very well object detection. The representations for action recognition typically use only shape cues and ignore color information. Inspired by the recent success of color in image classification and object detection, we investigate the potential of color for action classification and detection in static images. We perform a comprehensive evaluation of color descriptors and fusion approaches for action recognition. Experiments were conducted on the three datasets most used for benchmarking action recognition in still images: Willow, PASCAL VOC 2010 and Stanford-40. Our experiments demonstrate that incorporating color information considerably improves recognition performance, and that a descriptor based on color names outperforms pure color descriptors. Our experiments demonstrate that late fusion of color and shape information outperforms other approaches on action recognition. Finally, we show that the different color–shape fusion approaches result in complementary information and combining them yields state-of-the-art performance for action classification.
    Full-text · Article · Dec 2013 · International Journal of Computer Vision
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a novel method to improve the flexibility of descriptor matching for image recognition by using local multiresolution pyramids in feature space. We propose that image patches be represented at multiple levels of descriptor detail and that these levels be defined in terms of local spatial pooling resolution. Preserving multiple levels of detail in local descriptors is a way of hedging one's bets on which levels will most relevant for matching during learning and recognition. We introduce the Pyramid SIFT (P-SIFT) descriptor and show that its use in four state-of-the-art image recognition pipelines improves accuracy and yields state-of-the-art results. Our technique is applicable independently of spatial pyramid matching and we show that spatial pyramids can be combined with local pyramids to obtain further improvement. We achieve state-of-the-art results on Caltech-101 (80.1%) and Caltech-256 (52.6%) when compared to other approaches based on SIFT features over intensity images. Our technique is efficient and is extremely easy to integrate into image recognition pipelines.
    Full-text · Article · Nov 2013 · IEEE Transactions on Software Engineering
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a method for the segmentation of continuous page streams into multipage documents and the simultaneous classification of the resulting documents. We first present an approach to combine the multiple pages of a document into a single feature vector that represents the whole document. Despite its simplicity and low computational cost, the proposed representation yields results comparable to more complex methods in multipage document classification tasks. We then exploit this representation in the context of page stream segmentation. The most plausible segmentation of a page stream into a sequence of multipage documents is obtained by optimizing a statistical model that represents the probability of each segmented multipage document belonging to a particular class. Experimental results are reported on a large sample of real administrative multipage documents.
    No preview · Conference Paper · Aug 2013
  • Bhaskar Chakraborty · Andrew D. Bagdanov · Jordi Gonzàlez · Xavier Roca
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes an approach to human action recognition based on a probabilistic optimization model of body parts using hidden Markov model (HMM). Our method is able to distinguish between similar actions by only considering the body parts having major contribution to the actions, for example, legs for walking, jogging and running; arms for boxing, waving and clapping. We apply HMMs to model the stochastic movement of the body parts for action recognition. The HMM construction uses an ensemble of body‐part detectors, followed by grouping of part detections, to perform human identification. Three example‐based body‐part detectors are trained to detect three components of the human body: the head, legs and arms. These detectors cope with viewpoint changes and self‐occlusions through the use of ten sub‐classifiers that detect body parts over a specific range of viewpoints. Each sub‐classifier is a support vector machine trained on features selected for the discriminative power for each particular part/viewpoint combination. Grouping of these detections is performed using a simple geometric constraint model that yields a viewpoint‐invariant human detector. We test our approach on three publicly available action datasets: the KTH dataset, Weizmann dataset and HumanEva dataset. Our results illustrate that with a simple and compact representation we can achieve robust recognition of human actions comparable to the most complex, state‐of‐the‐art methods.
    No preview · Article · May 2013 · Expert Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we consider the problem of face recognition in imagery captured in uncooperative environments using PTZ cameras. For each subject enrolled in the gallery, we acquire a high-resolution 3D model from which we generate a series of rendered face images of varying viewpoint. The result of regularly sampling face pose for all subjects is a redundant basis that over represents each target. To recognize an unknown probe image, we perform a sparse reconstruction of SIFT features extracted from the probe using a basis of SIFT features from the gallery. While directly collecting images over varying pose for all enrolled subjects is prohibitive at enrollment, the use of high speed, 3D acquisition systems allows our face recognition system to quickly acquire a single model, and generate synthetic views offline. Finally we show, using two publicly available datasets, how our approach performs when using rendered gallery images to recognize 2D rendered probe images and 2D probe images acquired using PTZ cameras.
    No preview · Conference Paper · Jan 2013
  • Source
    Svebor Karaman · Andrew D. Bagdanov
    [Show abstract] [Hide abstract]
    ABSTRACT: In this article we introduce the problem of identity inference as a generalization of the re-identification problem. Identity inference is applicable in situations where a large number of unknown persons must be identified without knowing a priori that groups of test images represent the same individual. Standard single- and multi-shot person re-identification are special cases of our formulation. We present an approach to solving identity inference problems using a Conditional Random Field (CRF) to model identity inference as a labeling problem in the CRF. The CRF model ensures that the final labeling gives similar labels to detections that are similar in feature space, and is flexible enough to incorporate constraints in the temporal and spatial domains. Experimental results are given on the ETHZ dataset. Our approach yields state-of-the-art performance for the multi-shot re-identification task and promising results for more general identity inference problems.
    Full-text · Conference Paper · Oct 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A real-time posterity logging system detects and tracks multiple targets in video streams, grabbing face images and retaining only the best quality for each detected target.
    Preview · Article · Oct 2012 · IEEE Multimedia
  • Andrew D. Bagdanov
    [Show abstract] [Hide abstract]
    ABSTRACT: State-of-the-art object detectors typically use shape information as a low level feature representation to capture the local structure of an object. This paper shows that early fusion of shape and color, as is popular in image classification, leads to a significant drop in performance for object detection. Moreover, such approaches also yields suboptimal results for object categories with varying importance of color and shape. In this paper we propose the use of color attributes as an explicit color representation for object detection. Color attributes are compact, computationally efficient, and when combined with traditional shape features provide state-of-the-art results for object detection. Our method is tested on the PASCAL VOC 2007 and 2009 datasets and results clearly show that our method improves over state-of-the-art techniques despite its simplicity. We also introduce a new dataset consisting of cartoon character images in which color plays a pivotal role. On this dataset, our approach yields a significant gain of 14% in mean AP over conventional state-of-the-art methods.
    No preview · Conference Paper · Jun 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: State-of-the-art object detectors typically use shape information as a low level feature representation to capture the local structure of an object. This paper shows that early fusion of shape and color, as is popular in image classification, leads to a significant drop in performance for object detection. Moreover, such approaches also yields suboptimal results for object categories with varying importance of color and shape. In this paper we propose the use of color attributes as an explicit color representation for object detection. Color attributes are compact, computationally efficient, and when combined with traditional shape features provide state-of-the-art results for object detection. Our method is tested on the PASCAL VOC 2007 and 2009 datasets and results clearly show that our method improves over state-of-the-art techniques despite its simplicity. We also introduce a new dataset consisting of cartoon character images in which color plays a pivotal role. On this dataset, our approach yields a significant gain of 14% in mean AP over conventional state-of-the-art methods.
    Full-text · Article · Jun 2012 · Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Publication Stats

608 Citations
44.13 Total Impact Points

Institutions

  • 2005-2015
    • University of Florence
      • Media Integration and Communication Center (MICC)
      Florens, Tuscany, Italy
  • 2009-2014
    • Autonomous University of Barcelona
      • • Computer Vision Center
      • • Department of Computer Sciences
      Cerdanyola del Vallès, Catalonia, Spain
  • 2012
    • CVC Computer Vision Center
      Barcino, Catalonia, Spain
  • 2010
    • University of Barcelona
      Barcino, Catalonia, Spain