Conference PaperPDF Available

Rapid Object Detection using a Boosted Cascade of Simple Features

Authors:

Abstract and Figures

This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.
Content may be subject to copyright.
MERL A MITSUBISHI ELECTRIC RESEARCH LABORATORY
http://www.merl.com
Rapid Object Detection Using a Boosted
Cascade of Simple Features
Paul Viola and Michael Jones
TR-2004-043 May 2004
Abstract
This paper describes a machine learning approach for visual object detection which is capable
of processing images extremely rapidly and achieving high detection rates. This work is distin-
guished by three key contributions. The first is the introduction of a new image representation
called the Integral Image which allows the features used by our detector to be computed very
quickly. The second is a learning algorithm, based on AdaBoost, which selects a small num-
ber of critical visual features from a larger set and yields extremely efficient classifiers[6]. The
third contribution is a method for combining increasingly more complex classifiers in a cascade
which allows background regions of the image to be quickly discarded while spending more
computation on promising object-like regions. The cascade can be viewed as an object specific
focus-of-attention mechanism which unlike previous approaches provides statistical guarantees
that discarded regions are unlikely to contain the object of interest. In the domain of face detec-
tion the system yields detection rates comparable to the best previous systems. Used in real-time
applications, the detector runs at 15 frames per second without resorting to image differencing
or skin color detection.
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part
without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include
the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of
the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or
republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All
rights reserved.
Copyright c
Mitsubishi Electric Research Laboratories, Inc., 2004
201 Broadway, Cambridge, Massachusetts 02139
Publication History:
1. First printing, TR-2004-043, May 2004
ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001
Rapid Object Detection using a Boosted Cascade of Simple
Features
Paul Viola Michael Jones
viola@merl.com mjones@crl.dec.com
Mitsubishi Electric Research Labs Compaq CRL
201 Broadway, 8th FL One Cambridge Center
Cambridge, MA 02139 Cambridge, MA 02142
Abstract
This paper describes a machine learning approach for vi-
sual object detection which is capable of processing images
extremely rapidly and achieving high detection rates. This
work is distinguished by three key contributions. The first
is the introduction of a new image representation called the
“Integral Image” which allows the features used by our de-
tector to be computed very quickly. The second is a learning
algorithm, based on AdaBoost, which selects a small num-
ber of critical visual features from a larger set and yields
extremely efficient classifiers[6]. The third contribution is
a method for combining increasingly more complex classi-
fiers in a “cascade” which allows background regions of the
image to be quickly discarded while spending more compu-
tation on promising object-like regions. The cascade can be
viewed as an object specific focus-of-attention mechanism
which unlike previous approaches provides statistical guar-
antees that discarded regions are unlikely to contain the ob-
ject of interest. In the domain of face detection the system
yields detection rates comparable to the best previous sys-
tems. Used in real-time applications, the detector runs at
15 frames per second without resorting to image differenc-
ing or skin color detection.
1. Introduction
This paper brings together new algorithms and insights to
construct a framework for robust and extremely rapid object
detection. This framework is demonstrated on, and in part
motivated by, the task of face detection. Toward this end
we have constructed a frontal face detection system which
achieves detection and false positive rates which are equiv-
alent to the best published results [16, 12, 15, 11, 1]. This
face detection system is most clearly distinguished from
previous approaches in its ability to detect faces extremely
rapidly. Operating on 384 by 288 pixel images, faces are de-
tected at 15 frames per second on a conventional 700 MHz
Intel Pentium III. In other face detection systems, auxiliary
information, such as image differences in video sequences,
or pixel color in color images, have been used to achieve
high frame rates. Our system achieves high frame rates
working only with the information present in a single grey
scale image. These alternative sources of information can
also be integrated with our system to achieve even higher
frame rates.
There are three main contributions of our object detec-
tion framework. We will introduce each of these ideas
briefly below and then describe them in detail in subsequent
sections.
The first contribution of this paper is a new image repre-
sentation called an integral image that allows for very fast
feature evaluation. Motivated in part by the work of Papa-
georgiou et al. our detection system does not work directly
with image intensities [10]. Like these authors we use a
set of features which are reminiscent of Haar Basis func-
tions (though we will also use related filters which are more
complex than Haar filters). In order to compute these fea-
tures very rapidly at many scales we introduce the integral
image representation for images. The integral image can be
computed from an image using a few operations per pixel.
Once computed, any one of these Harr-like features can be
computed at any scale or location in constant time.
The second contribution of this paper is a method for
constructing a classifier by selecting a small number of im-
portant features using AdaBoost [6]. Within any image sub-
window the total number of Harr-like features is very large,
far larger than the number of pixels. In order to ensure fast
classification, the learning process must exclude a large ma-
jority of the available features, and focus on a small set of
critical features. Motivated by the work of Tieu and Viola,
feature selection is achieved through a simple modification
of the AdaBoost procedure: the weak learner is constrained
so that each weak classifier returned can depend on only a
1
single feature [2]. As a result each stage of the boosting
process, which selects a new weak classifier, can be viewed
as a feature selection process. AdaBoost provides an effec-
tive learning algorithm and strongbounds on generalization
performance [13, 9, 10].
The third major contribution of this paper is a method
for combining successively more complex classifiers in a
cascade structure which dramatically increases the speed of
the detector by focusing attention on promising regions of
the image. The notion behind focus of attention approaches
is that it is often possible to rapidly determine where in an
image an object might occur [17, 8, 1]. More complex pro-
cessing is reserved only for these promising regions. The
key measure of such an approach is the “false negative” rate
of the attentional process. It must be the case that all, or
almost all, object instances are selected by the attentional
filter.
We will describe a process for training an extremely sim-
ple and efficient classifier which can be used as a “super-
vised” focus of attention operator. The term supervised
refers to the fact that the attentional operator is trained to
detect examples of a particular class. In the domain of face
detection it is possible to achieve fewer than 1% false neg-
atives and 40% false positives using a classifier constructed
from two Harr-like features. The effect of this filter is to
reduce by over one half the number of locations where the
final detector must be evaluated.
Those sub-windows which are not rejected by the initial
classifier are processed by a sequence of classifiers, each
slightly more complex than the last. If any classifier rejects
the sub-window, no further processing is performed. The
structure of the cascaded detection process is essentially
that of a degenerate decision tree, and as such is related to
the work of Geman and colleagues [1, 4].
An extremely fast face detector will have broad prac-
tical applications. These include user interfaces, image
databases, and teleconferencing. In applications where
rapid frame-rates are not necessary, our system will allow
for significant additional post-processing and analysis. In
addition our system can be implemented on a wide range of
small low power devices, including hand-helds and embed-
ded processors. In our lab we have implemented this face
detector on the Compaq iPaq handheld and have achieved
detection at two frames per second (this device has a low
power 200 mips Strong Arm processor which lacks floating
point hardware).
The remainder of the paper describes our contributions
and a number of experimental results, including a detailed
description of our experimental methodology. Discussion
of closely related work takes place at the end of each sec-
tion.
2. Features
Our object detection procedure classifies images based on
the value of simple features. There are many motivations












































































































AB
CD
Figure 1: Example rectangle features shown relative to the
enclosing detection window. The sum of the pixels which
lie within the white rectangles are subtracted from the sum
of pixels in the grey rectangles. Two-rectangle features are
shown in (A) and (B). Figure (C) shows a three-rectangle
feature, and (D) a four-rectangle feature.
for using features rather than the pixels directly. The most
common reason is that features can act to encode ad-hoc
domain knowledge that is difficult to learn using a finite
quantity of training data. For this system there is also a
second critical motivation for features: the feature based
system operates much faster than a pixel-based system.
The simple features used are reminiscent of Haar basis
functions which have been used by Papageorgiou et al. [10].
More specifically, we use three kinds of features. The value
of a two-rectangle feature is the difference between the sum
of the pixels within two rectangular regions. The regions
have the same size and shape and are horizontally or ver-
tically adjacent (see Figure 1). A three-rectangle feature
computes the sum within two outside rectangles subtracted
from the sum in a center rectangle. Finally a four-rectangle
feature computes the difference between diagonal pairs of
rectangles.
Given that the base resolution of the detector is 24x24,
the exhaustive set of rectangle features is quite large, over
180,000 . Note that unlike the Haar basis, the set of rectan-
gle features is overcomplete1.
2.1. Integral Image
Rectangle features can be computed very rapidly using an
intermediate representation for the image which we call the
integral image.2The integral image at location

contains
the sum of the pixels above and to the left of

, inclusive:


! "#$%"

'&(&$
1A complete basis has no linear dependence between basis elements
and has the same number of elements as the image space, in this case 576.
The full set of 180,000 thousand features is many times over-complete.
2There is a close relation to “summed area tables” as used in graphics
[3]. We choose a different name here in order to emphasize its use for the
analysis of images, rather than for texture mapping.
2
A
C
B
D
1
4
2
3
Figure 2: The sum of the pixels within rectangle
can be
computed with four array references. The value of the inte-
gral image at location 1 is the sum of the pixels in rectangle
. The value at location 2 is

, at location 3 is

,
and at location 4 is

. The sum within
can
be computed as



.
where

 
is the integral image and


is the origi-
nal image. Using the following pair of recurrences:





(1)
 (


 (
(2)
(where
 (
is the cumulative row sum,



,
and




) the integral image can be computed in
one pass over the original image.
Using the integral image any rectangular sum can be
computed in four array references (see Figure 2). Clearly
the difference between two rectangular sums can be com-
puted in eight references. Since the two-rectangle features
defined above involve adjacent rectangular sums they can
be computed in six array references, eight in the case of
the three-rectangle features, and nine for four-rectangle fea-
tures.
2.2. Feature Discussion
Rectangle features are somewhat primitive when compared
with alternatives suchas steerable filters [5, 7]. Steerable fil-
ters, and their relatives, are excellent for the detailed analy-
sis of boundaries, image compression, and texture analysis.
In contrast rectangle features, while sensitive to the pres-
ence of edges, bars, and other simple image structure, are
quite coarse. Unlike steerable filters the only orientations
available are vertical, horizontal, and diagonal. The set of
rectangle features do however provide a rich image repre-
sentation which supports effective learning. In conjunction
with the integral image , the efficiency of the rectangle fea-
ture set provides ample compensation for their limited flex-
ibility.
3. Learning Classification Functions
Given a feature set and a training set of positive and neg-
ative images, any number of machine learning approaches
could be used to learn a classification function. In our sys-
tem a variant of AdaBoost is used both to select a small set
of features and train the classifier [6]. In its original form,
the AdaBoost learning algorithm is used to boost the clas-
sification performance of a simple (sometimes called weak)
learning algorithm. There are a number of formal guaran-
tees provided by the AdaBoost learning procedure. Freund
and Schapire proved that the training error of the strong
classifier approaches zero exponentially in the number of
rounds. More importantly a number of results were later
proved about generalization performance [14]. The key
insight is that generalization performance is related to the
margin of the examples, and that AdaBoost achieves large
margins rapidly.
Recall that there are over 180,000 rectangle features as-
sociated with each image sub-window, a number far larger
than the number of pixels. Even though each feature can
be computed very efficiently, computing the complete set is
prohibitively expensive. Our hypothesis, which is borne out
by experiment, is that a very small number of these features
can be combined to form an effective classifier. The main
challenge is to find these features.
In support of this goal, the weak learning algorithm is
designed to select the single rectangle feature which best
separates the positive and negative examples (this is similar
to the approach of [2] in the domain of image database re-
trieval). For each feature, the weak learner determines the
optimal threshold classification function, such that the min-
imum number of examples are misclassified. A weak clas-
sifier

%
thus consists of a feature

, a threshold

and
a parity
!
indicating the direction of the inequality sign:
%
#"
if
%
%$
otherwise
Here
is a 24x24 pixel sub-window of an image. See Ta-
ble 1 for a summary of the boosting process.
In practice no single feature can perform the classifica-
tion task with low error. Features which are selected in early
rounds of the boosting process had error rates between 0.1
and 0.3. Features selected in later rounds, as the task be-
comes more difficult, yield error rates between 0.4 and 0.5.
3.1. Learning Discussion
Many general feature selection procedures have been pro-
posed (see chapter 8 of [18] for a review). Our final appli-
cation demanded a very aggressive approach which would
discard the vast majority of features. For a similar recogni-
tion problem Papageorgiou et al. proposed a scheme for fea-
ture selection based on feature variance [10]. They demon-
strated good results selecting 37 features out of a total 1734
features.
Roth et al. propose a feature selection process based
on the Winnow exponential perceptron learning rule [11].
The Winnow learning process converges to a solution where
many of these weights are zero. Nevertheless a very large
3
Given example images

where

for negative and positive examples respec-
tively.
Initialize weights
  !
"$#
"$%
for
!&
respec-
tively, where
'
and
(
are the number of negatives and
positives respectively.
For
)
*+-,
:
1. Normalize the weights,
/.
10
.
2
34
.
3
so that
5.
is a probability distribution.
2. For each feature,
6
, train a classifier
7
3
which
is restricted to using a single feature. The
error is evaluated with respect to
8.
,
9
3
2
:
7
3
 <;= :
.
3. Choose the classifier,
7.
, with the lowest error
9.
.
4. Update the weights:
.>
 
.
@?
 AB$C
.
where
D
EF
if example

is classified cor-
rectly,
D
&
otherwise, and
?
.
GIH
 A
GH
.
The final strong classifier is:
7
JLK
M2ON
.
4
P
.-7Q.
SR
"
2TN
.
4
P
.
otherwise
where
P
.
OUWV+X
Y
H
Table 1: The AdaBoost algorithm for classifier learn-
ing. Each round of boosting selects one feature from the
180,000 potential features.
number of features are retained (perhaps a few hundred or
thousand).
3.2. Learning Results
While details on the training and performance of the final
system are presented in Section 5, several simple results
merit discussion. Initial experiments demonstrated that a
frontal face classifier constructed from 200 features yields
a detection rate of 95% with a false positive rate of 1 in
14084. These results are compelling, but not sufficient for
many real-world tasks. In terms of computation, this clas-
sifier is probably faster than any other published system,
requiring 0.7 seconds to scan an 384 by 288 pixel image.
Unfortunately, the most straightforward technique for im-
proving detection performance, adding features to the clas-
sifier, directly increases computation time.
For the task of face detection, the initial rectangle fea-
tures selected by AdaBoost are meaningful and easily inter-
preted. The first feature selected seems to focus on the prop-
erty that the region of the eyes is often darker than the region
Figure 3: The first and second features selected by Ad-
aBoost. The two features are shown in the top row and then
overlayed on a typical training face in the bottom row. The
first feature measures the difference in intensity between the
region of the eyes and a regionacross the upper cheeks. The
feature capitalizes on the observation that the eye region is
often darker than the cheeks. The second feature compares
the intensities in the eye regions to the intensity across the
bridge of the nose.
of the nose and cheeks (see Figure 3). This feature is rel-
atively large in comparison with the detection sub-window,
and should be somewhat insensitive to size and location of
the face. The second feature selected relies on the property
that the eyes are darker than the bridge of the nose.
4. The Attentional Cascade
This section describes an algorithm for constructing a cas-
cade of classifiers which achieves increased detection per-
formance while radically reducing computation time. The
key insight is that smaller, and therefore more efficient,
boosted classifiers can be constructed which reject many of
the negative sub-windows while detecting almost all posi-
tive instances (i.e. the threshold of a boosted classifier can
be adjusted so that the false negative rate is close to zero).
Simpler classifiers are used to reject the majority of sub-
windows before more complex classifiers are called upon
to achieve low false positive rates.
The overall form of the detection process is that of a de-
generate decision tree, what we call a “cascade” (see Fig-
ure 4). A positive result from the first classifier triggers the
evaluation of a second classifier which has also been ad-
justed to achieve very high detection rates. A positive result
from the second classifier triggers a third classifier, and so
on. A negative outcome at any point leads to the immediate
rejection of the sub-window.
Stages in the cascade are constructed by training clas-
sifiers using AdaBoost and then adjusting the threshold to
minimize false negatives. Note that the default AdaBoost
threshold is designed to yield a low error rate on the train-
ing data. In general a lower threshold yields higher detec-
4
T
F
T
F
T
F
123
Reject Sub−window
All Sub−windows
Further
Processing
Figure 4: Schematic depiction of a the detection cascade.
A series of classifiers are applied to every sub-window. The
initial classifier eliminates a large number of negative exam-
ples with very little processing. Subsequent layers eliminate
additional negatives but require additional computation. Af-
ter several stages of processing the number of sub-windows
have been reduced radically. Further processing can take
any form such as additional stages of the cascade (as in our
detection system) or an alternative detection system.
tion rates and higher false positive rates.
For example an excellent first stage classifier can be con-
structed from a two-feature strong classifier by reducing the
threshold to minimize false negatives. Measured against a
validation training set, the threshold can be adjusted to de-
tect 100% of the faces with a false positive rate of 40%. See
Figure 3 for a description of the two features used in this
classifier.
Computation of the two feature classifier amounts to
about 60 microprocessor instructions. It seems hard to
imagine that any simpler filter could achieve higher rejec-
tion rates. By comparison, scanning a simple image tem-
plate, or a single layer perceptron, wouldrequire at least 20
times as many operations per sub-window.
The structure of the cascade reflects the fact that
within any single image an overwhelming majority of sub-
windows are negative. As such, the cascade attempts to re-
ject as many negatives as possible at the earliest stage pos-
sible. While a positive instance will trigger the evaluation
of every classifier in the cascade, this is an exceedingly rare
event.
Much like a decision tree, subsequent classifiers are
trained using those examples which pass through all the
previous stages. As a result, the second classifier faces a
more difficult task than the first. The examples which make
it through the first stage are “harder” than typical exam-
ples. The more difficult examples faced by deeper classi-
fiers push the entire receiver operating characteristic (ROC)
curve downward. At a given detection rate, deeper classi-
fiers have correspondingly higher false positive rates.
4.1. Training a Cascade of Classifiers
The cascade training process involves two types of trade-
offs. In most cases classifiers with more features will
achieve higher detection rates and lower false positive rates.
At the same time classifiers with more features require more
time to compute. In principle one could define an optimiza-
tion framework in which: i) the number of classifier stages,
ii) the number of features in each stage, and iii) the thresh-
old of each stage, are traded off in order to minimize the
expected number of evaluated features. Unfortunately find-
ing this optimum is a tremendously difficult problem.
In practice a very simple framework is used to produce
an effective classifier which is highly efficient. Each stage
in the cascade reduces the false positive rate and decreases
the detection rate. A target is selected for the minimum
reduction in false positives and the maximum decrease in
detection. Each stage is trained by adding features until the
target detection and false positives rates are met ( these rates
are determined by testing the detector on a validation set).
Stages are added until the overall target for false positive
and detection rate is met.
4.2. Detector Cascade Discussion
The complete face detection cascade has 38 stages with over
6000 features. Nevertheless the cascade structure results in
fast average detection times. On a difficult dataset, con-
taining 507 faces and 75 million sub-windows, faces are
detected using an average of 10 feature evaluations per sub-
window. In comparison, this system is about 15 times faster
than an implementation of the detection system constructed
by Rowley et al.3[12]
A notion similar to the cascade appears in the face de-
tection system described by Rowley et al. in which two de-
tection networks are used [12]. Rowley et al. used a faster
yet less accurate network to prescreen the image in orderto
find candidate regions for a slower more accurate network.
Though it is difficult to determine exactly, it appears that
Rowley et al.’s two network face system is the fastest exist-
ing face detector.4
The structure of the cascaded detection process is es-
sentially that of a degenerate decision tree, and as such is
related to the work of Amit and Geman [1]. Unlike tech-
niques which use a fixed detector, Amit and Geman propose
an alternative point of view where unusual co-occurrences
of simple image features are used to trigger the evaluation
of a more complex detection process. In this way the full
detection process need not be evaluated at many of the po-
tential image locations and scales. While this basic insight
3Henry Rowley very graciously supplied us with implementations of
his detection system for direct comparison. Reported results are against
his fastest system. It is difficult to determine from the published literature,
but the Rowley-Baluja-Kanade detector is widely considered the fastest
detection system and has been heavily tested on real-world problems.
4Other published detectors have either neglected to discuss perfor-
mance in detail, or have never published detection and false positive rates
on a large and difficult training set.
5
is very valuable, in their implementation it is necessary to
first evaluate some feature detector at every location. These
features are then grouped to find unusual co-occurrences. In
practice, since the form of our detector and the features that
it uses are extremely efficient, the amortized cost of evalu-
ating our detector at every scale and location is much faster
than finding and grouping edges throughout the image.
In recent work Fleuret and Geman have presented a face
detection technique which relies on a “chain” of tests in or-
der to signify the presence of a face at a particular scale and
location [4]. The image properties measured byFleuret and
Geman, disjunctions of fine scale edges, are quite different
than rectangle features which are simple, exist at all scales,
and are somewhat interpretable. The two approaches also
differ radically in their learning philosophy. The motivation
for Fleuret and Geman’s learning process is density estima-
tion and density discrimination, while our detector is purely
discriminative. Finally the false positive rate of Fleuret and
Geman’s approach appears to be higher than that of previ-
ous approaches like Rowley et al. and this approach. Un-
fortunately the paper does not report quantitative results of
this kind. The included example images each have between
2 and 10 false positives.
5 Results
A 38 layer cascaded classifier was trained to detect frontal
upright faces. To train the detector, a set of face and non-
face training images were used. The face training set con-
sisted of 4916 hand labeled faces scaled and aligned to a
base resolution of 24 by 24 pixels. The faces were ex-
tracted from images downloaded during a random crawl of
the world wide web. Some typical face examples are shown
in Figure 5. The non-face subwindows used to train the
detector come from 9544 images which were manually in-
spected and found to not contain any faces. There are about
350 million subwindows within these non-face images.
The number of features in the first five layers of the de-
tector is 1, 10, 25, 25 and 50 features respectively. The
remaining layers have increasingly more features. The total
number of features in all layers is 6061.
Each classifier in the cascade was trained with the 4916
training faces (plus their vertical mirror images for a total
of 9832 training faces) and 10,000 non-face sub-windows
(also of size 24 by 24 pixels) using the Adaboost training
procedure. For the initial one feature classifier, the non-
face training examples were collected by selecting random
sub-windows from a set of 9544 imageswhich did not con-
tain faces. The non-face examples used to train subsequent
layers were obtained by scanning the partial cascade across
the non-face images and collecting false positives. A max-
imum of 10000 such non-face sub-windows were collected
for each layer.
Speed of the Final Detector
Figure 5: Example of frontal upright face images used for
training.
The speed of the cascaded detector is directly related to
the number of features evaluated per scanned sub-window.
Evaluated on the MIT+CMU test set [12], an average of 10
features out of a total of 6061 are evaluated per sub-window.
This is possible because a large majority of sub-windows
are rejected by the first or second layer in the cascade. On
a 700 Mhz Pentium III processor, the face detector can pro-
cess a 384 by 288 pixel image in about .067 seconds (us-
ing a starting scale of 1.25 and a step size of 1.5 described
below). This is roughly 15 times faster than the Rowley-
Baluja-Kanade detector [12] and about 600 times faster than
the Schneiderman-Kanade detector [15].
Image Processing
All example sub-windows used for training were vari-
ance normalized to minimize the effect of different light-
ing conditions. Normalization is therefore necessary during
detection as well. The variance of an image sub-window
can be computed quickly using a pair of integral images.
Recall that


, where
is the standard
deviation,
is the mean, and
is the pixel value within
the sub-window. The mean of a sub-window can be com-
puted using the integral image. The sum of squared pixels
is computed using an integral image of the image squared
(i.e. two integral images are used in the scanning process).
During scanning the effect of image normalization can be
achieved by post-multiplying the feature values rather than
pre-multiplying the pixels.
Scanning the Detector
The final detector is scanned across the image at multi-
ple scales and locations. Scaling is achieved by scaling the
detector itself, rather than scaling the image. This process
makes sense because the features can be evaluated at any
6
Detector
False detections
10 31 50 65 78 95 167
Viola-Jones 76.1% 88.4% 91.4% 92.0% 92.1% 92.9% 93.9%
Viola-Jones (voting) 81.1% 89.7% 92.1% 93.1% 93.1% 93.2 % 93.7%
Rowley-Baluja-Kanade 83.2% 86.0% - - - 89.2% 90.1%
Schneiderman-Kanade - - - 94.4% - - -
Roth-Yang-Ahuja - - - - (94.8%) - -
Table 2: Detection rates for various numbers of false positives on the MIT+CMU test set containing 130 images and 507
faces.
scale with the same cost. Good results were obtained using
a set of scales a factor of 1.25 apart.
The detector is also scanned across location. Subsequent
locations are obtained by shifting the window some number
of pixels
. This shifting process is affected by the scale of
the detector: if the current scale is
the window is shifted
by

, where
is the rounding operation.
The choice of
affects both the speed of the detector as
well as accuracy. The results we present are for

.
We can achieve a significant speedup by setting

with only a slight decrease in accuracy.
Integration of Multiple Detections
Since the final detector is insensitive to small changes in
translation and scale, multiple detections will usually occur
around each face in a scanned image. The same is often true
of some types of false positives. In practice it often makes
sense to return one final detection per face. Toward this end
it is useful to postprocess the detected sub-windows in order
to combine overlapping detections into a single detection.
In these experiments detections are combined in a very
simple fashion. The set of detections are first partitioned
into disjoint subsets. Two detections are in the same subset
if their bounding regions overlap. Each partition yields a
single final detection. The corners of the final bounding
region are the average of the corners of all detections in the
set.
Experiments on a Real-World Test Set
We tested our system on the MIT+CMU frontal face test
set [12]. This set consists of 130 images with 507 labeled
frontal faces. A ROC curve showing the performance of our
detector on this test set is shown in Figure 6. To create the
ROC curve the threshold of the final layer classifier is ad-
justed from

to

. Adjusting the threshold to

will yield a detection rate of 0.0 and a false positive rate
of 0.0. Adjusting the threshold to

, however, increases
both the detection rate and false positive rate, but only to a
certain point. Neither rate can be higher than the rate of the
detection cascade minus the final layer. In effect, a thresh-
old of

is equivalent to removing that layer. Further
increasing the detection and false positive rates requires de-
creasing the threshold of the next classifier in the cascade.
Thus, in order to construct a complete ROC curve, classifier
layers are removed. We use the number of false positives as
opposed to the rate of false positives for the x-axis of the
ROC curve to facilitate comparison with other systems. To
compute the false positive rate, simply divide by the total
number of sub-windows scanned. In our experiments, the
number of sub-windows scanned is 75,081,800.
Unfortunately, most previous published results on face
detection have only included a single operating regime (i.e.
single point on the ROC curve). To make comparison with
our detector easier we have listed our detection rate for the
false positive rates reported by the other systems. Table 2
lists the detection rate for various numbers of false detec-
tions for our system as well as other published systems. For
the Rowley-Baluja-Kanade results [12], a number of differ-
ent versions of their detector were tested yielding a number
of different results they are all listed in under the same head-
ing. For the Roth-Yang-Ahuja detector [11], they reported
their result on the MIT+CMU test set minus 5 images con-
taining line drawn faces removed.
Figure 7 shows the output of our face detector on some
test images from the MIT+CMU test set.
A simple voting scheme to further improve results
In table 2 we also show results from running three de-
tectors (the 38 layer one described above plus two similarly
trained detectors) and outputting the majority vote of the
three detectors. This improves the detection rate as well as
eliminating more false positives. The improvement would
be greater if the detectors were more independent. The cor-
relation of their errors results in a modest improvement over
the best single detector.
6 Conclusions
We have presented an approach for object detection which
minimizes computation time while achieving high detection
accuracy. The approach was used to construct a face de-
tection system which is approximately 15 faster than any
previous approach.
This paper brings together new algorithms, representa-
tions, and insights which are quite generic and may well
7
Figure 7: Output of our face detector on a number of test images from the MIT+CMU test set.
Figure 6: ROC curve for our face detector on the
MIT+CMU test set. The detector was run using a step size
of 1.0 and starting scale of 1.0 (75,081,800 sub-windows
scanned).
have broader application in computer vision and image pro-
cessing.
Finally this paper presents a set of detailed experiments
on a difficult face detection dataset which has been widely
studied. This dataset includes faces under a very wide range
of conditions including: illumination, scale, pose, and cam-
era variation. Experiments on such a large and complex
dataset are difficult and time consuming. Nevertheless sys-
tems which work under these conditions are unlikely to be
brittle or limited to a single set of conditions. More impor-
tantly conclusions drawn from this dataset are unlikely to
be experimental artifacts.
References
[1] Y. Amit, D. Geman, and K. Wilder. Joint induction of shape
features and tree classifiers, 1997.
[2] Anonymous. Anonymous. In Anonymous, 2000.
[3] F. Crow. Summed-area tables for texture mapping. In
Proceedings of SIGGRAPH, volume 18(3), pages 207–212,
1984.
[4] F. Fleuret and D. Geman. Coarse-to-fine face detection. Int.
J. Computer Vision, 2001.
[5] William T. Freeman and Edward H. Adelson. The design
and use of steerable filters. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 13(9):891–906, 1991.
[6] Yoav Freund and Robert E. Schapire. A decision-theoretic
generalization of on-line learning and an application to
boosting. In Computational Learning Theory: Eurocolt ’95,
pages 23–37. Springer-Verlag, 1995.
[7] H. Greenspan, S. Belongie, R. Gooodman, P. Perona, S. Rak-
shit, and C. Anderson. Overcomplete steerable pyramid fil-
ters and rotation invariance. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 1994.
[8] L. Itti, C. Koch, and E. Niebur. A model of saliency-based
visual attention for rapid scene analysis. IEEE Patt. Anal.
Mach. Intell., 20(11):1254–1259, November 1998.
[9] Edgar Osuna, Robert Freund, and Federico Girosi. Training
support vector machines: an application to face detection.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 1997.
[10] C. Papageorgiou, M. Oren, and T. Poggio. A general frame-
work for object detection. In International Conference on
Computer Vision, 1998.
[11] D. Roth, M. Yang, and N. Ahuja. A snowbased face detector.
In Neural Information Processing 12, 2000.
[12] H. Rowley, S. Baluja, and T. Kanade. Neural network-based
face detection. In IEEE Patt. Anal. Mach. Intell., volume 20,
pages 22–38, 1998.
[13] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boost-
ing the margin: a new explanation for the effectiveness of
voting methods. Ann. Stat., 26(5):1651–1686, 1998.
[14] Robert E. Schapire, Yoav Freund, Peter Bartlett, and
Wee Sun Lee. Boosting the margin: A new explanation for
the effectiveness of voting methods. In Proceedings of the
Fourteenth International Conference on Machine Learning,
1997.
[15] H. Schneiderman and T. Kanade. A statistical method for 3D
object detection applied to faces and cars. In International
Conference on Computer Vision, 2000.
8
[16] K. Sung and T. Poggio. Example-based learning for view-
based face detection. In IEEE Patt. Anal. Mach. Intell., vol-
ume 20, pages 39–51, 1998.
[17] J.K. Tsotsos, S.M. Culhane, W.Y.K. Wai, Y.H. Lai, N. Davis,
and F. Nuflo. Modeling visual-attention via selective tun-
ing. Artificial Intelligence Journal, 78(1-2):507–545, Octo-
ber 1995.
[18] Andrew Webb. Statistical Pattern Recognition. Oxford Uni-
versity Press, New York, 1999.
9
... The speed advantage of SURF can be attributed to many factors in its feature extraction process including its use of a pre-computed integral image. An integral image represents a type of data-structure called summed-area table, which allows for fast evaluation of the sum of values in a rectangular subset of a grid (Viola and Jones, 2001). It significantly speeds up calculations. ...
... where, (x ′ , y ′ ) denotes a pixel in the original image, (x, y) denotes a pixel in the integral image, i(x ′ , y ′ ) is the intensity value of the pixel (x ′ , y ′ ) and II(x, y) is the intensity of the integral image at pixel (x, y). The integral image (Equation 1) (Viola and Jones, 2001) can be computed efficiently in a single pass using the equation: ...
Article
Full-text available
This paper proposes a solution to the challenging task of autonomously landing Unmanned Aerial Vehicles (UAVs). An onboard computer vision module integrates the vision system with the ground control communication and video server connection. The vision platform performs feature extraction using the Speeded Up Robust Features (SURF), followed by fast Structured Forests edge detection and then smoothing with a Kalman filter for accurate runway sidelines prediction. A thorough evaluation is performed over real-world and simulation environments with respect to accuracy and processing time, in comparison with state-of-the-art edge detection approaches. The vision system is validated over videos with clear and difficult weather conditions, including with fog, varying lighting conditions and crosswind landing. The experiments are performed using data from the X-Plane 11 flight simulator and real flight data from the Uncrewed Low-cost TRAnsport (ULTRA) self-flying cargo UAV. The vision-led system can localise the runway sidelines with a Structured Forests approach with an accuracy approximately 84.4%, outperforming the state-of-the-art approaches and delivering real-time performance. The main contribution of this work consists of the developed vision-led system for runway detection to aid autonomous landing of UAVs using electro-optical cameras. Although implemented with the ULTRA UAV, the vision-led system is applicable to any other UAV.
... Before the widespread adoption of deep learning in object detection, traditional detection methods were the primary approach for target detection in complex scenarios. In 2001, Viola et al. [8] first proposed a new processing method which combining Haar features with Adaboost classifier. Ref. [8] was analyzed and verified by Dalal and Triggs [9] in 2003. ...
... In 2001, Viola et al. [8] first proposed a new processing method which combining Haar features with Adaboost classifier. Ref. [8] was analyzed and verified by Dalal and Triggs [9] in 2003. In 2008, Brubaker et al. [10] expanded and innovated this method, and achieved good results in face detection. ...
Article
Full-text available
The applications of target detection in complex scenarios cover a wide range of fields, such as pedestrian and vehicle detection in self‐driving cars, face recognition and abnormal behavior detection in security monitoring systems, hazardous materials safety detection in public transportation, and so on. These applications demonstrate the importance and the prospect of wide application of target detection techniques in solving practical problems in complex scenarios. However, in these real scenes, there are often problems such as mutual occlusion and scale change. Therefore, how to accurately identify the target in the real complex scenarios has become a big problem to be solved. In order to solve the above problem, the paper proposes a novel algorithm, Adaptive Self‐Attention‐YOLOv5 (ASA‐YOLOv5), which is built upon the YOLOv5s algorithm and demonstrates effectiveness for target identification in complex scenarios. First, the paper implements a fusion mechanism between the trunk and neck networks, enabling the fusion of features across different levels through upsampling and downsampling. This fusion process mitigates detection errors caused by feature loss. Second, the Shuffle Attention mechanism is introduced before upsampling and downsampling to suppress noise and amplify essential semantic information, further enhancing target identification accuracy. Lastly, the Adaptively Spatial Feature Fusion (ASFF) module and Receptive Field Blocks (RFBs) module are added in the head network, and it can improve feature scale invariance and expand the receptive field. The ability of the model to detect the target in the complex scene is improved effectively. Experimental results indicate a notable improvement in the model's mean Average Precision (mAP) by 2.1% on the COCO dataset and 0.7% on the SIXray dataset. The proposed ASA‐YOLOv5 algorithm can enhance the effectiveness for target detection in complex scenarios, and it can be widely used in real‐world settings. © 2024 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.
... Those sequences contain 20 frames and show a very progressive rotation of the first frame starting from a frontal view and ending on the expected viewpoint. We apply the OpenCV Viola-Jones face detector [35] on the first frame of the boot sequence (frontal view). Then we align facial feature points with the SDM tracker [26] on the retrieved face rectangle. ...
Preprint
Automatic facial expression classification (FER) from videos is a critical problem for the development of intelligent human-computer interaction systems. Still, it is a challenging problem that involves capturing high-dimensional spatio-temporal patterns describing the variation of one's appearance over time. Such representation undergoes great variability of the facial morphology and environmental factors as well as head pose variations. In this paper, we use Conditional Random Forests to capture low-level expression transition patterns. More specifically, heterogeneous derivative features (e.g. feature point movements or texture variations) are evaluated upon pairs of images. When testing on a video frame, pairs are created between this current frame and previous ones and predictions for each previous frame are used to draw trees from Pairwise Conditional Random Forests (PCRF) whose pairwise outputs are averaged over time to produce robust estimates. Moreover, PCRF collections can also be conditioned on head pose estimation for multi-view dynamic FER. As such, our approach appears as a natural extension of Random Forests for learning spatio-temporal patterns, potentially from multiple viewpoints. Experiments on popular datasets show that our method leads to significant improvements over standard Random Forests as well as state-of-the-art approaches on several scenarios, including a novel multi-view video corpus generated from a publicly available database.
... One of them is based on HOG and SVM, which is an adaptation of the human detection method proposed by Dalal & Triggs [22]. We use this approach together with the sliding window technique presented on the face detection method, proposed by Viola & Jones [34], [35]. The other approach is based on deep learning, using YOLO CNNs [23]. ...
Preprint
Full-text available
The iris is considered as the biometric trait with the highest unique probability. The iris location is an important task for biometrics systems, affecting directly the results obtained in specific applications such as iris recognition, spoofing and contact lenses detection, among others. This work defines the iris location problem as the delimitation of the smallest squared window that encompasses the iris region. In order to build a benchmark for iris location we annotate (iris squared bounding boxes) four databases from different biometric applications and make them publicly available to the community. Besides these 4 annotated databases, we include 2 others from the literature. We perform experiments on these six databases, five obtained with near infra-red sensors and one with visible light sensor. We compare the classical and outstanding Daugman iris location approach with two window based detectors: 1) a sliding window detector based on features from Histogram of Oriented Gradients (HOG) and a linear Support Vector Machines (SVM) classifier; 2) a deep learning based detector fine-tuned from YOLO object detector. Experimental results showed that the deep learning based detector outperforms the other ones in terms of accuracy and runtime (GPUs version) and should be chosen whenever possible.
... The Viola-Jones detector [11], proposed by Paul Viola and Michael Jones in 2001, is a classic real-time object detection method primarily used for face detection but also suitable for other objects. This detector employs Haar-like features and the Adaboost algorithm to train strong classifiers and utilizes cascade structures and sliding window techniques to achieve fast object detection. ...
Article
Full-text available
Ground-based detection of spaceborne dynamic objects, such as near-Earth asteroids and space debris, is essential for ensuring the safety of space operations. This paper presents YOLO-Dynamic, a novel detection algorithm aimed at addressing the limitations of existing models, particularly in complex environments and small-object detection. The proposed algorithm introduces two newly designed modules: the SC_Block_C2f and the LASF_Neck. SC_Block_C2f, developed in this study, integrates StarNet and Convolutional Gated Linear Unit (CGLU) operations, improving small-object recognition and feature extraction. Meanwhile, LASF_Neck employs a lightweight multi-scale architecture for optimized feature fusion and faster detection. The YOLO-Dynamic algorithm’s performance was validated on real-world images captured at Antarctic observatory sites. Compared to the baseline YOLOv8s model, YOLO-Dynamic achieved a 7% increase in mAP@0.5 and a 10.3% improvement in mAP@0.5:0.95. Additionally, the number of parameters was reduced by 1.48 M, and floating-point operations decreased by 3.8 G. These results confirm that YOLO-Dynamic not only delivers superior detection accuracy but also maintains computational efficiency, making it well suited for real-world applications requiring reliable and efficient spaceborne object detection.
... Their innovative approach has greatly influenced subsequent research in the field, including the work described in the paper "Rapid object detection using a boosted cascade of simple features". By introducing efficient methods for feature computation, such as the "integral image" representation, and utilizing learning algorithms like AdaBoost (Viola & Jones, 2001). ...
Conference Paper
Full-text available
This study is focused on enhancing the Haar Cascade algorithm to decrease the false positive and false negative rate of face recognition in images with variations in lighting, facial expressions, and occlusions to increase accuracy. The face recognition library was applied with Haar Cascade where 128-dimensional vectors representing the unique features of a face were encoded. The Enhanced Haar Cascade Algorithm produced a 98.39% accuracy rate, in comparison, the Haar Cascade Algorithm achieved a 46.70%-77.00% accuracy rate. Both algorithms used the Confusion Matrix Test with 301,950 comparisons using the same dataset of 550 images. The 98.39% accuracy rate shows a significant decrease in false positive and false negative rates in facial recognition in images with complex conditions
... Their innovative approach has greatly influenced subsequent research in the field, including the work described in the paper "Rapid object detection using a boosted cascade of simple features". By introducing efficient methods for feature computation, such as the "integral image" representation, and utilizing learning algorithms like AdaBoost (Viola & Jones, 2001). ...
... Their innovative approach has greatly influenced subsequent research in the field, including the work described in the paper "Rapid object detection using a boosted cascade of simple features". By introducing efficient methods for feature computation, such as the "integral image" representation, and utilizing learning algorithms like AdaBoost (Viola & Jones, 2001). ...
Article
Full-text available
This study is focused on enhancing the Haar Cascade algorithm to decrease the false positive and false negative rate of face recognition in images with variations in lighting, facial expressions, and occlusions to increase accuracy. The face recognition library was applied with Haar Cascade where 128-dimensional vectors representing the unique features of a face were encoded. The Enhanced Haar Cascade Algorithm produced a 98.39% accuracy rate, in comparison, the Haar Cascade Algorithm achieved a 46.70%-77.00% accuracy rate. Both algorithms used the Confusion Matrix Test with 301,950 comparisons using the same dataset of 550 images. The 98.39% accuracy rate shows a significant decrease in false positive and false negative rates in facial recognition in images with complex conditions
Article
Autonomous mobile robots (AMRs) are widely used in dynamic warehouse environments for automated material handling, which is one of the fundamental parts of building intelligent logistics systems. A target docking system to transport materials, such as racks, carts, and pallets is an important technology for AMRs that directly affects production efficiency. In this letter, we propose a fast and precise rack detection algorithm based on 2-D LiDAR data for AMRs that consume power from batteries. This novel detection method based on machine learning to quickly detect various racks in a dynamic environment consists of three modules: first classification, secondary classification, and multiple-matching-based 2-D point cloud registration. We conducted various experiments to verify the rack detection performance of the existing and proposed methods in a low-power embedded system. As a result, the relative pose accuracy is improved and the inference speed is increased by about 3 times, which shows that the proposed method has faster inference speed while reducing the relative pose error.
Conference Paper
Full-text available
A novel learning approach for human face detection using a network of linear units is presented. The SNoW learning architecture is a sparse network of linear functions over a pre-defined or incrementally learned feature space and is speci cally tailored for learning in the presence of a very large number of features. A wide range of face images in different poses, with different expressions and under different lighting conditions are used as a training set to capture the variations of human faces. Experimental results on commonly used benchmark data sets of a wide range of face images show that the SNoW-based approach outperforms methods that use neural networks, Bayesian methods, support vector machines and others. Furthermore, learning and evaluation using the SNoW-based method are significantly more efficient than with other methods.
Conference Paper
Texture-map computations can be made tractable through use of precalculated tables which allow computational costs independent of the texture density. The first example of this technique, the “mip” map, uses a set of tables containing successively lower-resolution representations filtered down from the discrete texture function. An alternative method using a single table of values representing the integral over the texture function rather than the function itself may yield superior results at similar cost. The necessary algorithms to support the new technique are explained. Finally, the cost and performance of the new technique is compared to previous techniques.
Article
In this paper, we describe a statistical method for 3D object detection. We represent the statistics of both object appearance and 'non-object' appearance using a product of histograms. Each histogram represents the joint statistics of a subset of wavelet coefficients and their position on the object. Our approach is to use many such histograms representing a wide variety of visual attributes. Using this method, we have developed the first algorithm that can reliably detect human faces with out-of-plane rotation and the first algorithm that can reliably detect passenger cars over a wide range of viewpoints.
Article
This article has no abstract
Article
Texture-map computations can be made tractable through use of precalculated tables which allow computational costs independent of the texture density. The first example of this technique, the “mip” map, uses a set of tables containing successively lower-resolution representations filtered down from the discrete texture function. An alternative method using a single table of values representing the integral over the texture function rather than the function itself may yield superior results at similar cost. The necessary algorithms to support the new technique are explained. Finally, the cost and performance of the new technique is compared to previous techniques.
Article
We study visual selection: Detect and roughly localize all instances of a generic object class, such as a face, in a greyscale scene, measuring performance in terms of computation and false alarms. Our approach is sequential testing which is coarse-to-fine in both in the exploration of poses and the representation of objects. All the tests are binary and indicate the presence or absence of loose spatial arrangements of oriented edge fragments. Starting from training examples, we recursively find larger and larger arrangements which are decomposable, which implies the probability of an arrangement appearing on an object decays slowly with its size. Detection means finding a sufficient number of arrangements of each size along a decreasing sequence of pose cells. At the beginning, the tests are simple and universal, accommodating many poses simultaneously, but the false alarm rate is relatively high. Eventually, the tests are more discriminating, but also more complex and dedicated to specific poses. As a result, the spatial distribution of processing is highly skewed and detection is rapid, but at the expense of (isolated) false alarms which, presumably, could be eliminated with localized, more intensive, processing.
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in n. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.
Article
A model for aspects of visual attention based on the concept of selective tuning is presented. It provides for a solution to the problems of selection in an image, information routing through the visual processing hierarchy and task-specific attentional bias. The central thesis is that attention acts to optimize the search procedure inherent in a solution to vision. It does so by selectively tuning the visual processing network which is accomplished by a top-down hierarchy of winner-take-all processes embedded within the visual processing pyramid. Comparisons to other major computational models of attention and to the relevant neurobiology are included in detail throughout the paper. The model has been implemented; several examples of its performance are shown. This model is a hypothesis for primate visual attention, but it also outperforms existing computational solutions for attention in machine vision and is highly appropriate to solving the problem in a robot vision system.