-
[show abstract]
[hide abstract]
ABSTRACT: Boosted one-versus-all (OVA) classifiers are commonly used in multiclass problems, such as generic object recognition, biometrics-based identification, or gesture recognition. JointBoost is a recently proposed method where OVA classifiers are trained jointly and are forced to share features. JointBoost has been demonstrated to lead both to higher accuracy and smaller classification time, compared to using OVA classifiers that were trained independently and without sharing features. However, even with the improved efficiency of JointBoost, the time complexity of OVA-based multiclass recognition is still linear to the number of classes, and can lead to prohibitively large running times in domains with a very large number of classes. In this paper, it is shown that JointBoost-based recognition can be reduced, at classification time, to nearest neighbor search in a vector space. Using this reduction, we propose a simple and easy-to-implement vector indexing scheme based on principal component analysis (PCA). In our experiments, the proposed method achieves a speedup of two orders of magnitude over standard JointBoost classification, in a hand pose recognition system where the number of classes is close to 50,000, with negligible loss in classification accuracy. Our method also yields promising results in experiments on the widely used FRGC-2 face recognition dataset, where the number of classes is 535.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on; 07/2009
-
[show abstract]
[hide abstract]
ABSTRACT: A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as locality sensitive hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008
-
[show abstract]
[hide abstract]
ABSTRACT: Nearest neighbor classifiers are a popular method for multiclass recognition in a wide range of computer vision and pattern recognition domains. At the same time, the accuracy of nearest neighbor classi?ers is sensitive to the choice of distance measure. This paper introduces an algorithm that uses boosting to learn a distance measure for multiclass k-nearest neighbor classification. Given a family of distance measures as input, AdaBoost is used to learn a weighted distance measure, that is a linear combination of the input measures. The proposed method can be seen both as a novel way to learn a distance measure from data, and as a novel way to apply boosting to multiclass recognition problems that does not require output codes. In our approach, multiclass recognition of objects is reduced to a single binary recognition task, defined on triples of objects. Preliminary experiments with eight UCI datasets yield no clear winner among our method, boosting using output codes, and k-nn classification using an unoptimized distance measure. Our algorithm did achieve lower error rates in some of the datasets, which indicates that it is a method worth considering for nearest neighbor recognition in various pattern recognition domains.
Computer Vision and Pattern Recognition - Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on; 07/2005
-
[show abstract]
[hide abstract]
ABSTRACT: This paper proposes a method for efficient nearest neighbor classification in non-Euclidean spaces with computationally expensive similarity/distance measures. Efficient approximations of such measures are obtained using the BoostMap algorithm, which produces embeddings into a real vector space. A modification to the BoostMap algorithm is proposed, which uses an optimization cost that is more appropriate when our goal is classification accuracy as opposed to nearest neighbor retrieval accuracy. Using the modified algorithm, multiple approximate nearest neighbor classifiers are obtained, that provide a wide range of trade-offs between accuracy and efficiency. The approximations are automatically combined to form a cascade classifier, which applies the slower and more accurate approximations only to the hardest cases. The proposed method is experimentally evaluated in the domain of handwritten digit recognition using shape context matching. Results on the MNIST database indicate that a speed-up of two to three orders of magnitude is gained over brute force search, with minimal losses in classification accuracy.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on; 07/2005
-
[show abstract]
[hide abstract]
ABSTRACT: A method is proposed that can generate a ranked list of plausible three-dimensional hand configurations that best match an input image. Hand pose estimation is formulated as an image database indexing problem, where the closest matches for an input hand image are retrieved from a large database of synthetic hand images. In contrast to previous approaches, the system can function in the presence of clutter, thanks to two novel clutter-tolerant indexing methods. First, a computationally efficient approximation of the image-to-model chamfer distance is obtained by embedding binary edge images into a high-dimensional Euclidean space. Second, a general-purpose, probabilistic line matching method identifies those line segment correspondences between model and input images that are the least likely to have occurred by chance. The performance of this clutter tolerant approach is demonstrated in quantitative experiments with hundreds of real hand images.
Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on; 07/2003
-
[show abstract]
[hide abstract]
ABSTRACT: An appearance-based framework for 3D hand shape classification and simultaneous camera viewpoint estimation is presented. Given an input image of a segmented hand, the most similar matches from a large database of synthetic hand images are retrieved. The ground-truth labels of those matches, containing hand-shape and camera-viewpoint information, are returned by the system as estimates for the input image. Database retrieval is done hierarchically, by first quickly rejecting the vast majority of all database views, and then ranking the remaining candidates in order of similarity to the input. Four different similarity measures are employed, based on edge location, edge orientation, finger location and geometric moments, respectively
Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on; 06/2002
-
[show abstract]
[hide abstract]
ABSTRACT: Research on recognition and generation of signed languages and the gestural component of spoken languages has been held back by the unavailability of large-scale linguistically annotated corpora of the kind that led to significant advances in the area of spoken language. A major obstacle has been the lack of computational tools to assist in efficient analysis and transcription of visual language data. Here we describe SignStream, a computer program that we have designed to facilitate transcription and linguistic analysis of visual language. Machine vision methods to assist linguists in detailed annotation of gestures of the head, face, hands, and body are being developed. We have been using SignStream to analyze data from native signers of American Sign Language (ASL) collected in our new video collection facility, equipped with multiple synchronized digital video cameras. The video data and associated linguistic annotations are being made publicly available in multiple formats.
Behavior research methods, instruments, & computers: a journal of the Psychonomic Society, Inc 09/2001; 33(3):311-20.
-
[show abstract]
[hide abstract]
ABSTRACT: A system for recovering 3D hand pose from monocular color
sequences is proposed. The system employs a non-linear supervised
learning framework, the specialized mappings architecture (SMA), to map
image features to likely 3D hand poses. The SMA's fundamental components
are a set of specialized forward mapping functions, and a single
feedback matching function. The forward functions are estimated directly
from training data, which in our case are examples of hand joint
configurations and their corresponding visual features. The joint angle
data in the training set is obtained via a CyberGlove, a glove with 22
sensors that monitor the angular motions of the palm and fingers. In
training, the visual features are generated using a computer graphics
module that renders the hand from arbitrary viewpoints given the 22
joint angles. The viewpoint is encoded by two real values, therefore 24
real values represent a hand pose. We test our system both on synthetic
sequences and on sequences taken with a color camera. The system
automatically detects and tracks both bands of the user, calculates the
appropriate features, and estimates the 3D hand joint angles and
viewpoint from those features. Results are encouraging given the
complexity of the task
Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on; 02/2001
-
01/2001;
-
[show abstract]
[hide abstract]
ABSTRACT: A technique for 3D head tracking under varying illumination is
proposed. The head is modeled as a texture mapped cylinder. Tracking is
formulated as an image registration problem in the cylinder's texture
map image. The resulting dynamic texture map provides a stabilized view
of the face that can be used as input to many existing 2D techniques for
face recognition, facial expressions analysis, lip reading, and eye
tracking. To solve the registration problem with lighting variation and
head motion, the residual registration error is modeled as a linear
combination of texture warping templates and orthogonal illumination
templates. Fast stable online tracking is achieved via regularized
weighted least-squares error minimization. The regularization tends to
limit potential ambiguities that arise in the warping and illumination
templates. It enables stable tracking over extended sequences. Tracking
does not require a precise initial model fit; the system is initialized
automatically using a simple 2D face detector. It is assumed that the
target is facing the camera in the first frame. The formulation uses
texture mapping hardware. The nonoptimized implementation runs at about
15 frames per second on a SGI O2 graphic workstation. Extensive
experiments evaluating the effectiveness of the formulation are
reported. The sensitivity of the technique to illumination,
regularization parameters, errors in the initial positioning, and
internal camera parameters are analyzed. Examples and applications of
tracking are reported
IEEE Transactions on Pattern Analysis and Machine Intelligence 05/2000; · 4.91 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: A model approach for real time skin segmentation in video sequences is described. The approach enables reliable skin segmentation despite wide variation in illumination during tracking. An explicit second order Markov model is used to predict evolution of the skin color (HSV) histogram over time. Histograms are dynamically updated based on feedback from the current segmentation and based on predictions of the Markov model. The evolution of the skin color distribution at each frame is parameterized by translation, scaling and rotation in color space. Consequent changes in geometric parameterization of the distribution are propagated by warping and re-sampling the histogram. The parameters of the discrete time dynamic Markov model are estimated using maximum likelihood estimation, and also evolve over time. Quantitative evaluation of the method was conducted on labeled ground-truth video sequences taken from popular movies
Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on; 02/2000
-
[show abstract]
[hide abstract]
ABSTRACT: When we search for images in multimedia documents, we often have in mind specific image types that we are interested in; examples are photographs, graphics, maps, cartoons, portraits of people, and so on. This paper describes an automated system that classifies Web images as photographs or graphics. The design of the system is based on statistical observations about the image content of the two types, as well as learning techniques which make use of the vast amount of training data available on the Web. Text associated with the image can be used to further improve the accuracy of the classification. The system is used as a part of Webseer, an image search engine for the Web
Content-Based Access of Image and Video Libraries, 1997. Proceedings. IEEE Workshop on; 07/1997
-
[show abstract]
[hide abstract]
ABSTRACT: This paper introduces BoostMap, a method that can significantly reduce retrieval time in image and video database systems that employ computationally expensive distance measures, metric or non-metric. Database and query objects are embedded into a Euclidean space, in which similarities can be rapidly measured using a weighted Manhattan distance. Embedding construction is formulated as a machine learning task, where AdaBoost is used to combine many simple, ID embeddings into a multidimensional embedding that preserves a significant amount of the proximity structure in the original space. Performance is evaluated in a hand pose estimation system, and a dynamic gesture recognition system, where the proposed method is used to retrieve approximate nearest neighbors under expensive image and video similarity measures: In both systems, in quantitative experiments, BoostMap significantly increases efficiency, with minimal losses in accuracy. Moreover, the experiments indicate that BoostMap compares favorably with existing embedding methods that have been employed in computer vision and database applications, i.e., FastMap and Bourgain embeddings.
Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on;