IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE T PATTERN ANAL)

Publisher: IEEE Computer Society; Institute of Electrical and Electronics Engineers, Institute of Electrical and Electronics Engineers

Journal description

Theory and application of computers in pattern analysis and machine intelligence. Topics include computer vision and image processing; knowledge representation, inference systems, and probabilistic reasoning. Extensive bibliographies.

Current impact factor: 5.78

Impact Factor Rankings

2015 Impact Factor Available summer 2016
2014 Impact Factor 5.781
2013 Impact Factor 5.694
2012 Impact Factor 4.795
2011 Impact Factor 4.908
2010 Impact Factor 5.027
2009 Impact Factor 4.378
2008 Impact Factor 5.96
2007 Impact Factor 3.579
2006 Impact Factor 4.306
2005 Impact Factor 3.81
2004 Impact Factor 4.352
2003 Impact Factor 3.823
2002 Impact Factor 2.923
2001 Impact Factor 2.289
2000 Impact Factor 2.094
1999 Impact Factor 1.882
1998 Impact Factor 1.417
1997 Impact Factor 1.668
1996 Impact Factor 2.085
1995 Impact Factor 1.94
1994 Impact Factor 2.006
1993 Impact Factor 1.917
1992 Impact Factor 1.906

Impact factor over time

Impact factor

Additional details

5-year impact 7.76
Cited half-life >10.0
Immediacy index 0.71
Eigenfactor 0.05
Article influence 3.31
Website IEEE Transactions on Pattern Analysis and Machine Intelligence website
Other titles IEEE transactions on pattern analysis and machine intelligence, Institute of Electrical and Electronics Engineers transactions on pattern analysis and machine intelligence
ISSN 0162-8828
OCLC 4253074
Material type Periodical, Internet resource
Document type Journal / Magazine / Newspaper, Internet Resource

Publisher details

Institute of Electrical and Electronics Engineers

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author can archive a post-print version
  • Conditions
    • Author's pre-print on Author's personal website, employers website or publicly accessible server
    • Author's post-print on Author's server or Institutional server
    • Author's pre-print must be removed upon publication of final version and replaced with either full citation to IEEE work with a Digital Object Identifier or link to article abstract in IEEE Xplore or replaced with Authors post-print
    • Author's pre-print must be accompanied with set-phrase, once submitted to IEEE for publication ("This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible")
    • Author's pre-print must be accompanied with set-phrase, when accepted by IEEE for publication ("(c) 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.")
    • IEEE must be informed as to the electronic address of the pre-print
    • If funding rules apply authors may post Author's post-print version in funder's designated repository
    • Author's Post-print - Publisher copyright and source must be acknowledged with citation (see above set statement)
    • Author's Post-print - Must link to publisher version with DOI
    • Publisher's version/PDF cannot be used
    • Publisher copyright and source must be acknowledged
  • Classification
    ​ green

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: We study the problem of classifying actions of human subjects using depth movies generated by Kinect or other depth sensors. Representing human body as dynamical skeletons, we study the evolution of their (skeletons’) shapes as trajectories on Kendall’s shape manifold. The action data is typically corrupted by large variability in execution rates within and across subjects and, thus, causing major problems in statistical analyses. To address that issue, we adopt a recently-developed framework of Su et al. [1], [2] to this problem domain. Here, the variable execution rates correspond to re-parameterizations of trajectories, and one uses a parameterization-invariant metric for aligning, comparing, averaging, and modeling trajectories. This is based on a combination of transported square-root vector fields (TSRVFs) of trajectories and the standard Euclidean norm, that allows computational efficiency. We develop a comprehensive suite of computational tools for this application domain: smoothing and denoising skeleton trajectories using median filtering, up- and down-sampling actions in time domain, simultaneous temporalregistration of multiple actions, and extracting invertible Euclidean representations of actions. Due to invertibility these Euclidean representations allow both discriminative and generative models for statistical analysis. For instance, they can be used in a SVM-based classification of original actions as demonstrated here using MSR Action-3D, MSR Daily Activity and 3D Action Pairs datasets. This approach, using only the skeletal data, achieves the state-of-the-art classification results on these datasets.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 12/2015; DOI:10.1109/TPAMI.2015.2439257
  • [Show abstract] [Hide abstract]
    ABSTRACT: Dictionary-based and part-based methods are among the most popular approaches to visual recognition. In both methods, a mid-level representation is built on top of low-level image descriptors and high-level classifiers are trained on top of the mid-level representation. While earlier methods built the mid-level representation without supervision, there is currently great interest in learning both representations jointly to make the mid-level representation more discriminative. In this work we propose a new approach to visual recognition that jointly learns a shared, discriminative, and compact mid-level representation and a compact high-level representation. By using a structured output learning framework, our approach directly handles the multiclass case at both levels of abstraction. Moreover, by using a group-sparse prior in the structured output learning framework, our approach encourages sharing of visual words and thus reduces the number of words used to represent each class. We test our proposed method on several popular benchmarks. Our results show that, by jointly learning mid- and high-level representations, and fostering the sharing of discriminative visual words among target classes, we are able to achieve state-of-the-art recognition performance using far less visual words than previous approaches.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 11/2015; 37(11):1-1. DOI:10.1109/TPAMI.2015.2408349
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a method to detect changes in the geometry of a city using panoramic images captured by a car driving around the city. The proposed method can be used to significantly optimize the process of updating the 3D model of an urban environment that is changing over time, by restricting this process to only those areas where changes are detected. With this application in mind, we designed our algorithm to specifically detect only structural changes in the environment, ignoring any changes in its appearance, and ignoring also all the changes which are not relevant for update purposes such as cars, people etc. The approach also accounts for the challenges involved in a large scale application of change detection, such as inaccuracies in the input geometry, errors in the geo-location data of the images as well as the limited amount of information due to sparse imagery. We evaluated our approach on a small scale setup using high resolution, densely captured images and a large scale setup covering an entire city using instead the more realistic scenario of low resolution, sparsely captured images. A quantitative evaluation was also conducted for the large scale setup consisting of images.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 11/2015; 37(11):1-1. DOI:10.1109/TPAMI.2015.2404834
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a method for fusing data acquired by a ToF camera and a stereo pair based on a model for depth measurement by ToF cameras which accounts also for depth discontinuity artifacts due to the mixed pixel effect. Such model is exploited within both a ML and a MAP-MRF frameworks for ToF and stereo data fusion. The proposed MAP-MRF framework is characterized by site-dependent range values, a rather important feature since it can be used both to improve the accuracy and to decrease the computational complexity of standard MAP-MRF approaches. This paper, in order to optimize the site dependent global cost function characteristic of the proposed MAP-MRF approach, also introduces an extension to Loopy Belief Propagation which can be used in other contexts. Experimental data validate the proposed ToF measurements model and the effectiveness of the proposed fusion techniques.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 11/2015; 37(11):1-1. DOI:10.1109/TPAMI.2015.2408361
  • [Show abstract] [Hide abstract]
    ABSTRACT: As objects are inherently 3D, they have been modeled in 3D in the early days of computer vision. Due to the ambiguities arising from mapping 2D features to 3D models, 3D object representations have been neglected and 2D feature-based models are the predominant paradigm in object detection nowadays. While such models have achieved outstanding bounding box detection performance, they come with limited expressiveness, as they are clearly limited in their capability of reasoning about 3D shape or viewpoints. In this work, we bring the worlds of 3D and 2D object representations closer, by building an object detector which leverages the expressive power of 3D object representations while at the same time can be robustly matched to image evidence. To that end, we gradually extend the successful deformable part model [1] to include viewpoint information and part-level 3D geometry information, resulting in several different models with different level of expressiveness. We end up with a 3D object model, consisting of multiple object parts represented in 3D and a continuous appearance model. We experimentally verify that our models, while providing richer object hypotheses than the 2D object models, provide consistently better joint object localization and viewpoint estimation than the state-of-the-art multi-view and 3D object detectors on various benchmarks (KITTI [2] , 3D object classes [3] , Pascal3D+ [4] , Pascal VOC 2007 [5] , EPFL multi-view cars [6] ).
    IEEE Transactions on Pattern Analysis and Machine Intelligence 11/2015; 37(11):1-1. DOI:10.1109/TPAMI.2015.2408347
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many binary code embedding schemes have been actively studied recently, since they can provide efficient similarity search, and compact data representations suitable for handling large scale image databases. Existing binary code embedding techniques encode high-dimensional data by using hyperplane-based hashing functions. In this paper we propose a novel hypersphere-based hashing function, spherical hashing, to map more spatially coherent data points into a binary code compared to hyperplane-based hashing functions. We also propose a new binary code distance function, spherical Hamming distance, tailored for our hypersphere-based binary coding scheme, and design an efficient iterative optimization process to achieve both balanced partitioning for each hash function and independence between hashing functions. Furthermore, we generalize spherical hashing to support various similarity measures defined by kernel functions. Our extensive experiments show that our spherical hashing technique significantly outperforms state-of-the-art techniques based on hyperplanes across various benchmarks with sizes ranging from one to 75 million of GIST, BoW and VLAD descriptors. The performance gains are consistent and large, up to 100 percent improvements over the second best method among tested methods. These results confirm the unique merits of using hyperspheres to encode proximity regions in high-dimensional spaces. Finally, our method is intuitive and easy to implement.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(11):1-1. DOI:10.1109/TPAMI.2015.2408363
  • [Show abstract] [Hide abstract]
    ABSTRACT: Distributed algorithms have recently gained immense popularity. With regards to computer vision applications, distributed multi-target tracking in a camera network is a fundamental problem. The goal is for all cameras to have accurate state estimates for all targets. Distributed estimation algorithms work by exchanging information between sensors that are communication neighbors. Vision-based distributed multi-target state estimation has at least two characteristics that distinguishes it from other applications. First, cameras are directional sensors and often neighboring sensors may not be sensing the same targets, i.e., they are na¨ıve with respect to that target. Second, in the presence of clutter and multiple targets, each camera must solve a data association problem. This paper presents an information-weighted, consensus-based, distributed multi-target tracking algorithm referred to as the Multi-target Information Consensus (MTIC) algorithm that is designed to address both the na¨ıvety and the data association problems. It converges to the centralized minimum mean square error estimate. The proposed MTIC algorithm and its extensions to non-linear camera models, termed as the Extended MTIC (EMTIC), are robust to false measurements and limited resources like power, bandwidth and the real-time operational requirements. Simulation and experimental analysis are provided to support the theoretical results.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; DOI:10.1109/TPAMI.2015.2484339
  • [Show abstract] [Hide abstract]
    ABSTRACT: Subspace-based methods are known to provide a practical solution for image set-based object recognition. Based on the insight that local shape differences between objects offer a sensitive cue for recognition, this paper addresses the problem of extracting a subspace representing the difference components between class subspaces generated from each set of object images independently of each other. We first introduce the difference subspace (DS), a novel geometric concept between two subspaces as an extension of a difference vector between two vectors, and describe its effectiveness in analyzing shape differences. We then generalize it to the generalized difference subspace (GDS) for multi-class subspaces, and show the benefit of applying this to subspace and mutual subspace methods, in terms of recognition capability. Furthermore, we extend these methods to kernel DS (KDS) and kernel GDS (KGDS) by a nonlinear kernel mapping to deal with cases involving larger changes in viewing direction. In summary, the contributions of this paper are as follows: 1) a DS/KDS between two class subspaces characterizes shape differences between the two respectively corresponding objects, 2) the projection of an input vector onto a DS/KDS realizes selective visualization of shape differences between objects, and 3) the projection of an input vector or subspace onto a GDS/KGDS is extremely effective at extracting differences between multiple subspaces, and therefore improves object recognition performance. We demonstrate validity through shape analysis on synthetic and real images of 3D objects as well as extensive comparison of performance on classification tests with several related methods; we study the performance in face image classification on the Yale face database B+ and the CMU Multi-PIE database, and hand shape classification of multi-view images.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(11):1-1. DOI:10.1109/TPAMI.2015.2408358
  • [Show abstract] [Hide abstract]
    ABSTRACT: The bag-of-words (BoW) model treats images as sets of local descriptors and represents them by visual word histograms. The Fisher vector (FV) representation extends BoW, by considering the first and second order statistics of local descriptors. In both representations local descriptors are assumed to be identically and independently distributed (iid), which is a poor assumption from a modeling perspective. It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization. In this paper, we introduce non-iid models by treating the model parameters as latent variables which are integrated out, rendering all local regions dependent. Using the Fisher kernel principle we encode an image by the gradient of the data log-likelihood w.r.t. the model hyper-parameters. Our models naturally generate discounting effects in the representations; suggesting that such transformations have proven successful because they closely correspond to the representations obtained for non-iid models. To enable tractable computation, we rely on variational free-energy bounds to learn the hyper-parameters and to compute approximate Fisher kernels. Our experimental evaluation results validate that our models lead to performance improvements comparable to using power normalization, as employed in state-of-the-art feature aggregation methods.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; DOI:10.1109/TPAMI.2015.2484342
  • [Show abstract] [Hide abstract]
    ABSTRACT: Spatiotemporal image descriptors are gaining attention in the image research community for better representation of dynamic textures. In this paper, we introduce a dynamic-micro-texture descriptor, i.e., spatiotemporal directional number transitional graph (DNG), which describes both the spatial structure and motion of each local neighborhood by capturing the direction of natural flow in the temporal domain. We use the structure of the local neighborhood, given by its principal directions, and compute the transition of such directions between frames. Moreover, we present the statistics of the direction transitions in a transitional graph, which acts as a signature for a given spatiotemporal region in the dynamic texture. Furthermore, we create a sequence descriptor by dividing the spatiotemporal volume into several regions, computing a transitional graph for each of them, and represent the sequence as a set of graphs. Our results validate the robustness of the proposed descriptor in different scenarios for expression recognition and dynamic texture analysis.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(10):1-1. DOI:10.1109/TPAMI.2015.2392774
  • [Show abstract] [Hide abstract]
    ABSTRACT: Binary feature descriptors such as local binary patterns (LBP) and its variations have been widely used in many face recognition systems due to their excellent robustness and strong discriminative power. However, most existing binary face descriptors are hand-crafted, which require strong prior knowledge to engineer them by hand. In this paper, we propose a compact binary face descriptor (CBFD) feature learning method for face representation and recognition. Given each face image, we first extract pixel difference vectors (PDVs) in local patches by computing the difference between each pixel and its neighboring pixels. Then, we learn a feature mapping to project these pixel difference vectors into low-dimensional binary vectors in an unsupervised manner, where 1) the variance of all binary codes in the training set is maximized, 2) the loss between the original real-valued codes and the learned binary codes is minimized, and 3) binary codes evenly distribute at each learned bin, so that the redundancy information in PDVs is removed and compact binary codes are obtained. Lastly, we cluster and pool these binary codes into a histogram feature as the final representation for each face image. Moreover, we propose a coupled CBFD (C-CBFD) method by reducing the modality gap of heterogeneous faces at the feature level to make our method applicable to heterogeneous face recognition. Extensive experimental results on five widely used face datasets show that our methods outperform state-of-the-art face descriptors.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(10):1-1. DOI:10.1109/TPAMI.2015.2408359
  • [Show abstract] [Hide abstract]
    ABSTRACT: To uncover an appropriate latent subspace for data representation, in this paper we propose a novel Robust Structured Subspace Learning (RSSL) algorithm by integrating image understanding and feature learning into a joint learning framework. The learned subspace is adopted as an intermediate space to reduce the semantic gap between the low-level visual features and the high-level semantics. To guarantee the subspace to be compact and discriminative, the intrinsic geometric structure of data, and the local and global structural consistencies over labels are exploited simultaneously in the proposed algorithm. Besides, we adopt the -norm for the formulations of loss function and regularization respectively to make our algorithm robust to the outliers and noise. An efficient algorithm is designed to solve the proposed optimization problem. It is noted that the proposed framework is a general one which can leverage several well-known algorithms as special cases and elucidate their intrinsic relationships. To validate the effectiveness of the proposed method, extensive experiments are conducted on diversity datasets for different image understanding tasks, i.e., image tagging, clustering, and classification, and the more encouraging results are achieved compared with some state-of-the-art approaches.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(10):1-1. DOI:10.1109/TPAMI.2015.2400461
  • [Show abstract] [Hide abstract]
    ABSTRACT: The double-opponent (DO) color-sensitive cells in the primary visual cortex (V1) of the human visual system (HVS) have long been recognized as the physiological basis of color constancy. In this work we propose a new color constancy model by imitating the functional properties of the HVS from the single-opponent (SO) cells in the retina to the DO cells in V1 and the possible neurons in the higher visual cortexes. The idea behind the proposed double-opponency based color constancy (DOCC) model originates from the substantial observation that the color distribution of the responses of DO cells to the color-biased images coincides well with the vector denoting the light source color. Then the illuminant color is easily estimated by pooling the responses of DO cells in separate channels in LMS space with the pooling mechanism of sum or max. Extensive evaluations on three commonly used datasets, including the test with the dataset dependent optimal parameters, as well as the intra- and interdataset cross validation, show that our physiologically inspired DOCC model can produce quite competitive results in comparison to the state-of-the-art approaches, but with a relative simple implementation and without requiring fine-tuning of the method for each different dataset.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(10):1-1. DOI:10.1109/TPAMI.2015.2396053
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a method for constructing a generative prototype for a set of graphs by adopting a minimum description length approach. The method is posed in terms of learning a generative supergraph model from which the new samples can be obtained by an appropriate sampling mechanism. We commence by constructing a probability distribution for the occurrence of nodes and edges over the supergraph. We encode the complexity of the supergraph using an approximate Von Neumann entropy. A variant of the EM algorithm is developed to minimize the description length criterion in which the structure of the supergraph and the node correspondences between the sample graphs and the supergraph are treated as missing data. To generate new graphs, we assume that the nodes and edges of graphs arise under independent Bernoulli distributions and sample new graphs according to their node and edge occurrence probabilities. Empirical evaluations on real-world databases demonstrate the practical utility of the proposed algorithm and show the effectiveness of the generative model for the tasks of graph classification, graph clustering and generating new sample graphs.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(10):2013-2027. DOI:10.1109/TPAMI.2015.2400451
  • [Show abstract] [Hide abstract]
    ABSTRACT: Human detection in dense crowds is an important problem, as it is a prerequisite to many other visual tasks, such as tracking, counting, recognizing actions or detecting anomalous behaviors, exhibited by individuals in a dense crowd. This problem is challenging due to large number of individuals, small apparent size, severe occlusions and perspective distortion. However, crowded scenes also offer contextual constraints that can be used to tackle these challenges. In this paper, we explore context for human detection in dense crowds in the form of locally-consistent scale prior which captures the similarity in scale in local neighborhoods and its smooth variation over the image. Using the scale and confidence of detections obtained from an underlying human detector, we infer scale and confidence priors using Markov Random Field. In an iterative mechanism, the confidences of detection hypotheses are modified to reflect consistency with the inferred priors, and the priors are updated based on the new detections. The final set of detections obtained are then reasoned for occlusion using Binary Integer Programming where overlaps and relations between parts of individuals are encoded as linear constraints. Both human detection and occlusion reasoning in proposed approach are solved with local neighbor-dependent constraints, thereby respecting the inter-dependence between individuals characteristic to dense crowd analysis. In addition, we propose a mechanism to detect different combinations of body parts without requiring annotations for individual combinations. We performed experiments on a new and extremely challenging dataset of dense crowd images showing marked improvement over the underlying human detector.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2015; 37(10):1-1. DOI:10.1109/TPAMI.2015.2396051