Robust Real-Time Face Detection


This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the Integral Image which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Robust Real-time Face Detection
Paul Viola and Michael Jones
Compaq Cambridge Research Laboratory
One Cambridge Center
Cambridge, MA 02142
We have constructed a frontal face detection system
which achieves detection and false positive rates which are
equivalent to the best published results [7, 5, 6, 4, 1]. This
face detection system is most clearly distinguished from
previous approaches in its ability to detect faces extremely
rapidly. Operating on 384 by 288 pixel images, faces are de-
tected at 15 frames per second on a conventional 700 MHz
Intel Pentium III. In other face detection systems, auxiliary
information, such as image differences in video sequences,
or pixel color in color images, have been used to achieve
high frame rates. Our system achieves high frame rates
working only with the information present in a single grey
scale image. These alternative sources of information can
also be integrated with our system to achieve even higher
frame rates.
The first contribution of this work is a new image repre-
sentation called an integral image that allows for very fast
feature evaluation. Motivated in part by the work of Papa-
georgiou et al. our detection system does not work directly
with image intensities [3]. Like these authors we use a set of
features which are reminiscent of Haar Basis functions. In
order to compute these features very rapidly at many scales
we introduce the integral image representation for images.
The integral image can be computed from an image using a
few operations per pixel. Once computed, any one of these
Harr-like features can be computed at any scale or location
in constant time.
The second contributionof this work is a method for con-
structing a classifier by selecting a small number of impor-
tant features using AdaBoost [2]. Within any image sub-
window the total number of Harr-like features is very large,
far larger than the number of pixels. In order to ensure fast
classification, the learning process must exclude a large ma-
jority of the available features, and focus on a small set of
critical features. Motivated by the work of Tieu and Viola,
feature selection is achieved through a simple modification
of the AdaBoost procedure: the weak learner is constrained
so that each weak classifier returned can depend on only a
single feature [8]. As a result each stage of the boosting
process, which selects a new weak classifier, can be viewed
as a feature selection process.
The third major contribution of this work is a method for
combining successively more complex classifiers in a cas-
cade structure which dramatically increases the speed of the
detector by focusing attention on promising regions of the
image. More complex processing is reserved only for these
promising regions. Those sub-windows which are not re-
jected by the initial classifier are processed by a sequence
of classifiers, each slightly more complex than the last. If
any classifier rejects the sub-window, no further processing
is performed. The structure of the cascaded detection pro-
cess is essentially that of a degenerate decision tree, and as
such is related to the work of Amit and Geman [1].
The complete face detection cascade has 32 classifiers.
Nevertheless the cascade structure results in extremely rapid
average detection times. The face detector runs at about
15 frames per second on 384 by 288 pixel images which
is about 15 times faster than any previous system. On the
MIT+CMU dataset, containing 507 faces and 75 million
sub-windows, our detection rate is 90% with 78 false detec-
tions (which is 1 false positive in about 961,000 queries).
[1] Y. Amit, D. Geman, and K. Wilder. Joint induction of shape
features and tree classifiers, 1997.
[2] Y. Freund and R. Schapire. A decision-theoretic generaliza-
tion of on-line learning and an application to boosting. In
Eurocolt ’95, pages 23–37. Springer-Verlag, 1995.
[3] C. Papageorgiou, M. Oren, and T. Poggio. A general frame-
work for object detection. In ICCV, 1998.
[4] D. Roth, M. Yang, and N. Ahuja. A snow-based face detector.
In NIPS 12, 2000.
[5] H. Rowley, S. Baluja, and T. Kanade. Neural network-based
face detection. In IEEE PAMI, volume 20, 1998.
[6] H. Schneiderman and T. Kanade. A statistical method for 3D
object detection applied to faces and cars. In ICCV, 2000.
[7] K. Sung and T. Poggio. Example-based learning for view-
based face detection. In IEEE PAMI, volume 20, pages 39–51,
[8] K. Tieu and P. Viola. Boosting image retrieval. In ICCV, 2000.
0-7695-1143-0/01 $10.00 (C) 2001 IEEE
Proceedings of the Eighth IEEE International Conference on Computer Vision (ICCV’01)
0-7695-1143-0/01 $17.00 © 2001 IEEE
... Traditional counting approaches mainly concentrated on detection-based schemes and the regression-based counting methods. The detection-based approaches [30] utilized a sliding window-like detector to detect pedestrians and count the quantity in the crowd. They used low-level characteristics such as Haar wavelets, histograms of oriented gradients, and boundaries localized from the whole human body. ...
Full-text available
Crowd counting plays a crucial rule in the development of smart city. However, the problems of scale variations and background interferences degrade the performance of the crowd counting in real-world scenarios. To address these problems, a novel attentive hierarchy ConvNet (AHNet) is proposed in this paper. The AHNet extracts hierarchy features by a designed discriminative feature extractor and mines the semantic features in a coarse-to-fine manner by a hierarchical fusion strategy. Meanwhile, a re-calibrated attention (RA) module is built in various levels to suppress the influence of background interferences, and a feature enhancement (FE) module is built to recognize head regions at various scales. Experimental results on five people crowd datasets and two cross-domain vehicle crowd datasets illustrate that the proposed AHNet achieves competitive performance in accuracy and generalization.
Full-text available
This paper proposes a nonlinear dynamic system by simulating the sensitivity of V2 (second visual cortex) to the graph. The system uses the auxiliary function matrix and the target image to generate chaotic attractors with initial sensitivity, ergodicity and relatively stable similarity to extract image features. Firstly, select the excellent auxiliary function to construct auxiliary function matrix. Secondly, the target image is optimized for grayscale, and the iteration range is automatically adjusted by the Viola-Jones detector and the curvature of the image. Finally, the auxiliary function matrix and the processed target image are iteratively interleaved to generate chaotic attractors for face recognition. In this paper, we selected Yale, ORL, AR, and Jaffe face database for experiments, and the average recognition rates were 98.14\(\%\), 98.40\(\%\), 97.06\(\%\), and 97.74\(\%\), respectively. Because of its fast speed, simple method, and large room for improvement, this method is expected to be applied in many practical fields and has theoretical value for continued research.
Object detection traditionally requires sliding-window classifier in modern deep learning based approaches. However, both of these approaches requires tedious configurations in bounding boxes. Generally speaking, single-class object detection is to tell where the object is, and how big it is. Traditional methods combine the ”where” and ”how” subproblems into a single one through the overall judgement of various scales of bounding boxes. In view of this, we are interesting in whether the ”where” and ”how” subproblems can be separated into two independent subtasks to ease the problem definition and the difficulty of training. Accordingly, we provide a new perspective where detecting objects is approached as a high-level semantic feature detection task. Like edges, corners, blobs and other feature detectors, the proposed detector scans for feature points all over the image, for which the convolution is naturally suited. However, unlike these traditional low-level features, the proposed detector goes for a higher-level abstraction, that is, we are looking for central points where there are objects, and modern deep models are already capable of such a high-level semantic abstraction. Like blob detection, we also predict the scales of the central points, which is also a straightforward convolution. Therefore, in this paper, pedestrian and face detection is simplified as a straightforward center and scale prediction task through convolutions. This way, the proposed method enjoys an anchor-free setting, considerably reducing the difficulty in training configuration and hyper-parameter optimization. Though structurally simple, it presents competitive accuracy on several challenging benchmarks, including pedestrian detection and face detection. Furthermore, a cross-dataset evaluation is performed, demonstrating a superior generalization ability of the proposed method.
In this paper, numerous techniques have been presented based upon the structure or geometrical shape of an object. By extracting the features of an object, we can detect and recognize an object. In this work, we firstly detect and count the number of objects available within an image. Each object is cropped and resized, and boundary values of an object are detected, which further helps extract the relevant features of an object. The various features extracted in this work are contiguous horizontal and vertical peak extent features, non-connected and connected contour segment features, vertical and horizontal balanced division point, chord features, etc. These features further assist in finding shape of a given object. For object detection and recognition of an object, the Linear-SVM and k-NN classifiers are used during classification. In this work, we have taken total 1020 images from MPEG dataset; these images include both, i.e. training and testing. The dataset consists of a total of 51 classes, and each class contains 20 images. In this, we achieve the accurateness of 91 and 90% by the use of Linear-SVM Classifier for object recognition using the proposed vertical and horizontal peak extent feature extraction methods.KeywordsFeature extractionObjectConnectedPeak extentNon-connectedSVMk-NN classifier
Face Detection is a famous topic in computer vision. Over the most few couple of years researchers have attempted to improve the performance of face detection algorithm in plane and out of plane rotation. In this paper, we propose a quick way to deal with face detection algorithm using support vector machine (SVM) and golden ratio. For performing this new algorithm, the main prerequisite is the preparation dataset in the front facing appearances to prepare SVM for skin filtering. In the proposed algorithm first we apply color histogram equalization (If the face detection algorithm is not able to detect any face) which can address the mistake of the skin filter then apply SVM for removing non-skin color, i.e., a skin filter machine is developed using SVM and lastly apply golden ratio for detecting the face region correctly. Proposed algorithm is compared on three datasets XM2VTS, FERET, and BioID with a high discovery rate not less than 95%. The experimental result shows the proposed algorithm not only runs comparatively fast but also gives an upgrade performance.
Full-text available
Systems for automatic facial expression recognition (FER) have an enormous need in advanced human-computer interaction (HCI) and human-robot interaction (HRI) applications. Over the years, researchers developed many handcrafted feature descriptors for the FER task. These descriptors delivered good accuracy on publicly available FER benchmark datasets. However, these descriptors generate high dimensional features that increase the computational time of the classifiers. Also, a significant proportion of the features are irrelevant and do not provide additional information for facial expression analysis. Adversely, these redundant features degrade the classification accuracy of the FER algorithm. This study presents an alternate, simple, and efficient scheme for FER in static images using the Boosted Histogram of Oriented Gradient (BHOG) descriptor. The proposed BHOG descriptor employs the AdaBoost feature selection algorithm to select important facial features from the original high-dimensional Histogram of Oriented Gradient (HOG) features. The BHOG descriptor with a reduced feature dimension decreases the computational cost without diminishing the recognition accuracy. The proposed FER pipeline tuned on the optimal values of different hyperparameters achieves competitive recognition accuracy on five benchmark FER datasets, namely CK+, JAFFE, RaFD, TFE, and RAF-DB. Also, the cross-dataset experiments confirm the superior generalization performance of the proposed FER pipeline. Finally, the comparative analysis results with existing FER techniques revealed the effectiveness of the pipeline. The proposed FER scheme is computationally efficient and classifies facial expressions in real time.
Conference Paper
Full-text available
A novel learning approach for human face detection using a network of linear units is presented. The SNoW learning architecture is a sparse network of linear functions over a pre-defined or incrementally learned feature space and is speci cally tailored for learning in the presence of a very large number of features. A wide range of face images in different poses, with different expressions and under different lighting conditions are used as a training set to capture the variations of human faces. Experimental results on commonly used benchmark data sets of a wide range of face images show that the SNoW-based approach outperforms methods that use neural networks, Bayesian methods, support vector machines and others. Furthermore, learning and evaluation using the SNoW-based method are significantly more efficient than with other methods.
Conference Paper
Full-text available
We present a neural network-based upright frontal face detection system. A retinally connected neural network examines small windows of an image and decides whether each window contains a face. The system arbitrates between multiple networks to improve performance over a single network. We present a straightforward procedure for aligning positive face examples for training. To collect negative examples, we use a bootstrap algorithm, which adds false detections into the training set as training progresses. This eliminates the difficult task of manually selecting nonface training examples, which must be chosen to span the entire space of nonface images. Simple heuristics, such as using the fact that faces rarely overlap in images, can further improve the accuracy. Comparisons with several other state-of-the-art face detection systems are presented, showing that our system has comparable performance in terms of detection and false-positive rates.
Conference Paper
Full-text available
One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated hypothesis usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon,is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example,is simply the difference between,the number,of correct votes and the maximum,number,of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those
Full-text available
We propose a computational model for detecting and localizing instances from an object class in static gray-level images. We divide detection into visual selection and final classification, concentrating on the former: drastically reducing the number of candidate regions that require further, usually more intensive, processing, but with a minimum of computation and missed detections. Bottom-up processing is based on local groupings of edge fragments constrained by loose geometrical relationships. They have no a priori semantic or geometric interpretation. The role of training is to select special groupings that are moderately likely at certain places on the object but rate in the background. We show that the statistics in both populations are stable. The candidate regions are those that contain global arrangements of several local groupings. Whereas our model was not conceived to explain brain functions, it does cohere with evidence about the functions of neurons in V1 and V2, such as responses to coarse or incomplete patterns (e.g., illusory contours) and to scale and translation invariance in IT. Finally, the algorithm is applied to face and symbol detection.
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
In this paper, we describe a statistical method for 3D object detection. We represent the statistics of both object appearance and 'non-object' appearance using a product of histograms. Each histogram represents the joint statistics of a subset of wavelet coefficients and their position on the object. Our approach is to use many such histograms representing a wide variety of visual attributes. Using this method, we have developed the first algorithm that can reliably detect human faces with out-of-plane rotation and the first algorithm that can reliably detect passenger cars over a wide range of viewpoints.
We present an approach for image retrieval using a very large number of highly selective features and efficient learning of queries. Our approach is predicated on the assumption that each image is generated by a sparse set of visual “causes” and that images which are visually similar share causes. We propose a mechanism for computing a very large number of highly selective features which capture some aspects of this causal structure (in our implementation there are over 46,000 highly selective features). At query time a user selects a few example images, and the AdaBoost algorithm is used to learn a classification function which depends on a small number of the most appropriate features. This yields a highly efficient classification function. In addition we show that the AdaBoost framework provides a natural mechanism for the incorporation of relevance feedback. Finally we show results on a wide variety of image queries.
We study visual selection: Detect and roughly localize all instances of a generic object class, such as a face, in a greyscale scene, measuring performance in terms of computation and false alarms. Our approach is sequential testing which is coarse-to-fine in both in the exploration of poses and the representation of objects. All the tests are binary and indicate the presence or absence of loose spatial arrangements of oriented edge fragments. Starting from training examples, we recursively find larger and larger arrangements which are decomposable, which implies the probability of an arrangement appearing on an object decays slowly with its size. Detection means finding a sufficient number of arrangements of each size along a decreasing sequence of pose cells. At the beginning, the tests are simple and universal, accommodating many poses simultaneously, but the false alarm rate is relatively high. Eventually, the tests are more discriminating, but also more complex and dedicated to specific poses. As a result, the spatial distribution of processing is highly skewed and detection is rapid, but at the expense of (isolated) false alarms which, presumably, could be eliminated with localized, more intensive, processing.
A model for aspects of visual attention based on the concept of selective tuning is presented. It provides for a solution to the problems of selection in an image, information routing through the visual processing hierarchy and task-specific attentional bias. The central thesis is that attention acts to optimize the search procedure inherent in a solution to vision. It does so by selectively tuning the visual processing network which is accomplished by a top-down hierarchy of winner-take-all processes embedded within the visual processing pyramid. Comparisons to other major computational models of attention and to the relevant neurobiology are included in detail throughout the paper. The model has been implemented; several examples of its performance are shown. This model is a hypothesis for primate visual attention, but it also outperforms existing computational solutions for attention in machine vision and is highly appropriate to solving the problem in a robot vision system.
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in ℝn. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.