Exploiting Monge structures in optimum subwindow search
ABSTRACT Optimum subwindow search for object detection aims to find a subwindow so that the contained subimage is most similar to the query object. This problem can be formulated as a four dimensional (4D) maximum entry search problem wherein each entry corresponds to the quality score of the subimage contained in a subwindow. For n × n images, a naive exhaustive search requires O(n^{4}) sequential computations of the quality scores for all subwindows. To reduce the time complexity, we prove that, for some typical similarity functions like Euclidian metric, χ^{2} metric on image histograms, the associated 4D array carries some Monge structures and we utilise these properties to speed up the optimum subwindow search and the time complexity is reduced to O(n^{3}). Furthermore, we propose a locally optimal alternating column and row search method with typical quadratic time complexity O(n^{2}). Experiments on PASCAL VOC 2006 demonstrate that the alternating method is significantly faster than the well known efficient subwindow search (ESS) method whilst the performance loss due to local maxima problem is negligible.

Conference Paper: Partial Least Squares based subwindow search for pedestrian detection.
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we propose a Partial Least Squares based subwindow search method for pedestrian detection, by which the detection speed can be improved effectively while maintaining high detection accuracy. Firstly, a sparse search is implemented to find all the possible locations containing parts of a pedestrian. Then a prelearned Partial Least Squares regression model is applied to estimate the displacements of the subwindows to guide them towards the approximate locations of the pedestrians. Finally, we conduct a dense search around the approximate locations to obtain the exact locations of the pedestrians. Experiments on the INRIA dataset demonstrate that our method greatly reduces the number of search windows, which leads to much fewer feature extraction in the detection phase. Thus, it is about 10 times faster than the sliding window method with a jump step of 8 × 8.18th IEEE International Conference on Image Processing, ICIP 2011, Brussels, Belgium, September 1114, 2011; 01/2011  [Show abstract] [Hide abstract]
ABSTRACT: This paper addresses the performance improvement of efficient subwindow search algorithms for object detection. The current algorithms are for flexible rectangleshaped subwindow with high computation costs. In this paper, a restriction is applied on the subwindow shape from rectangle into square in order to reduce the number of possible subwindows with an expectation to improve the computation speed. However, this may come with a consequence of accuracy loss for some objects. In addition, another variance of subwindow shape is also tested which based on the ratio between the height and width of an image. The experiment results on the proposed algorithms were analysed and compared with the performance of the original algorithms to determine whether the speed improvement is significantly large while making the accuracy loss acceptable. It was found that some new algorithms show a good speed improvement while maintaining small accuracy loss. Furthermore, there is an algorithm designed from a combination of a new algorithm and an original algorithm which gains the benefit from both algorithms and produces the best performance among all new algorithms.International Journal of Machine Learning and Cybernetics. 4(1).  SourceAvailable from: Wanquan Liu
Conference Paper: Efficient subwindow search with submodular score functions.
[Show abstract] [Hide abstract]
ABSTRACT: Subwindow search aims to find the optimal subimage which maximizes the score function of an object to be detected. After the development of the branch and bound (B&B) method called Efficient Subwindow Search (ESS), several algorithms (IESS [2], AESS [2], ARCS [3]) have been proposed to improve the performance of ESS. For �� ×�� images, IESS’s time complexity is bounded by �� (�� 3 ) which is better than ESS, but only applicable to linear score functions. Other work shows that Monge properties can hold in subwindow search and can be used to speed up the search to �� (�� 3 ), but only applies to certain types of score functions. In this paper we explore the connection between submodular functions and the Monge property, and prove that submodular score functions can be used to achieve �� (�� 3 ) time complexity for object detection. The time complexity can be further improved to be subcubic by applying B&B methods on row interval only, when the score function has a multivariate submodular bound function. Conditions for submodularity of common nonlinear score functions and multivariate submodularity of their bound functions are also provided, and experiments are provided to compare the proposed approach against ESS and ARCS for object detection with some nonlinear score functions.The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 2025 June 2011; 01/2011
Page 1
Exploiting Monge Structures in Optimum Subwindow Search
Senjian An, Patrick Peursum, Wanquan Liu, Svetha Venkatesh and Xiaoming Chen
Dept. of Computing, Curtin University of Technology
GPO Box U1987, Perth, WA 6845, Australia.
s.an,p.peursum, w.liu, s.venkatesh,x.chen@curtin.edu.au
Abstract
Optimum subwindow search for object detection aims to
find a subwindow so that the contained subimage is most
similar to the query object. This problem can be formulated
as a four dimensional (4D) maximum entry search problem
wherein each entry corresponds to the quality score of the
subimage contained in a subwindow. For n × n images, a
naive exhaustive search requires O(n4) sequential compu
tations of the quality scores for all subwindows. To reduce
the time complexity, we prove that, for some typical simi
larity functions like Euclidian metric, χ2metric on image
histograms, the associated 4D array carries some Monge
structures and we utilise these properties to speed up the
optimum subwindow search and the time complexity is re
duced to O(n3). Furthermore, we propose a locally opti
mal alternating column and row search method with typical
quadratic time complexity O(n2). Experiments on PASCAL
VOC 2006 demonstrate that the alternating method is sig
nificantly faster than the well known efficient subwindow
search (ESS) method whilst the performance loss due to lo
cal maxima problem is negligible.
1. Introduction
Object detection and localization aims to find objects in
images and has become a hot topic in computer vision in
recent years. Since natural objects have complex shapes,
it is hard to find the exact position and the shape of the
objects in images. An easier way is to find the bounding
box of the object and this method is called sliding window
methods [6, 11, 17, 12]. The first step of sliding window
methods is to train a quality function based on the extracted
SIFT[16], SURF [5] or other type of features from train
ing images, and the next is to apply the quality function on
all possible subimages to find the object by maximizing
the quality scores. A naive exhaustive search is to apply
the quality function on all O(n4) subwindows in n × n im
ages. This may be computationally prohibitive in practice.
Toreducethecomputational complexity, manyapproximate
methods were proposed [8, 9, 22, 21]. Recently, a fast glob
ally optimal method, called Efficient Subwindow Search
(ESS) [14, 15], was developed to solve this problem. Since
ESS is a branch and bound method and the upper bound on
the number of its iteration is O(n4), the running time has a
wide variance across images and the worst case time com
plexity is O(n4), close to the exhaustive search. In [3], a
faster branch and bound method, called improved ESS (I
ESS), is proposed. The worse case complexity of IESS is
cubic and it is much faster than ESS in the experiments on
PASCAL VOC 2006. To further speed up the subwindow
search, [3] also proposed a locally optimal method, called
AESS. However, IESS and AESS are only applicable for
the case when the quality function is linear on the histogram
of the extracted features. In this case, the quality score of
a subwindow simply sums up the contributions of the ex
tracted features [14, 3]. This allows the subwindow search
problem to be reduced to a maximum submatrix problem
that finds a region in a matrix with the largest sum of en
tries. In the case when the quality function is nonlinear,
the subwindow search is no longer a maximum submatrix
search problem and AESS and IESS are not applicable.
Since nonlinear classifiers using kernels usually outper
form linear methods [21], it is important to develop effi
cient algorithms for optimum subwindow search with non
linear quality functions. Although ESS can be applied for
some nonlinear quality functions [14, 15], its performance
has large variances across images as in the linear case. In
this paper, we examine techniques to find computationally
efficient techniques with lesser variance across images, thus
leading to stable performance.
the optimal subwindow search problem as a four dimen
sional combinatorial optimisation problem, which finds the
maximum entry in array, wherein each entry corresponds
to the quality score of the subimage contained in a sub
window. Our theoretical contribution is to show that if the
quality function satisfies certain conditions, the four dimen
sional array carries nice Monge properties [7] which has
been widely used to speed up in many optimization prob
lems. We utilise these properties to formulate new algo
To do so, we formulate
9269781424469857/10/$26.00 ©2010 IEEE
Page 2
rithms for optimum subwindow search problem to reduce
the time complexity to O(n3). Furthermore, we propose a
locally optimal method, called ARCS, to further reduce the
complexitytoO(n2). ThoughARCSisbasedonalternating
optimization and may encounter local maxima problems,
our experiments demonstrate that the performance loss due
to local maxima problem is negligible whilst being signifi
cantly faster than ESS. The significance of our approach is
to identify the utility of using Monge properties in the con
text of computer vision tasks, and the proposed paradigm
has potential to be used in more general settings.
The rest of the paper is organized as follows. Section
2 reviews the Monge property of matrices and the effi
cient algorithm to find the row maxima. The row maxima
can be found by visiting O(n) number of matrix elements.
In section 3 we will formulate the optimum search algo
rithm as an 4D array search problem and address the ar
ray’s Monge properties. Section 4 addresses the alternating
search method and Section 5 the application issues in ob
jection detection. The experimental results are provided in
Section 6.
2. Monge Matrices
An m × n real matrix A = {a[i,j]} is called a Monge
Matrix if it satisfies the socalled Monge Property
a[i,j] + a[k,l] ≤ a[i,l] + a[k,j],∀i < k,j < l.
A is called Inverse Monge Matrix if the above inequali
ties hold in the reverse direction, i.e.,
(1)
a[i,j] + a[k,l] ≥ a[i,l] + a[k,j],∀i < k,j < l.
There are many interesting and useful properties for (in
vese) Monge matrices. We only present the following three
fundamental properties that will be used in the following
sections. The first two are useful in checking whether a
matrix is Monge and the third is related to an efficient al
gorithm for computing the row maxima, fundamental to our
application.
(2)
Proposition 1 [7] The Monge property or Inverse Monge
Property holds if and only if it holds for adjacent rows and
adjacent columns, that is, it suffices to require
a[i,j] + a[i + 1,j + 1] ≤ a[i,j + 1] + a[i + 1,j],
∀1 ≤ i < m,1 ≤ j < n
for real Monge matrices, and
(3)
a[i,j] + a[i + 1,j + 1] ≥ a[i,j + 1] + a[i + 1,j],
∀1 ≤ i < m,1 ≤ j < n
for real inverse Monge matrices.
(4)
ForaMongematrixA, ifwereversetheorderofitsrows,
then (4) will be satisfied. So we have
Corollary 1 By reversing the order of its rows, a Monge
matrix will become an inverse Monge matrix and vise versa.
Proposition 2 [7] The class of m×n (inverse) Monge ma
trices forms a convex cone in the vector space of all real
matrices with same dimensions. That is, if A and B are
real (inverse) Monge matrices, then A+B and λA are also
(inverse) Monge matrices where λ ≥ 0.
Let A be an m × n matrix and let j(i) be the column of
A which contains the leftmost maximum entry in row i. If
j(i) is nondecreasing, i.e. j(1) ≤ j(2),··· ,≤ j(m), we
call A monotone. If all its submatrices are monotone, we
call A totally monotone. For a totally monotone matrix A,
all its row maxima can be computed in O(n + m) time by
the socalled SMAWK algorithm [1].
Proposition 3 [7] Inverse Monge matrices are totally
monotone matrices, whose row maxima can be computed
in O(n + m) time.
By Corollary 1, a Monge matrix is an inverse Monge
matrix by reversing its row order. Following Proposition 3,
we have
Corollary 2 [7] The row maxima of Monge matrices can
be computed in O(n + m) time.
Also, there is an efficient parallel algorithm to compute
the row maxima of Monge matrices.
Proposition 4 [4] Inverse Monge matrices are totally
monotone matrices whose row maxima can be computed
in O(logn) time with O(n) processors in CREWPRAM or
EREWPRAM model.
EREWPRAM is the parallel model where the proces
sors operate synchronously and share a common memory,
but no two processors are allowed simultaneous access to a
memory cell (whether the access is for reading or writing in
that cell). The CREWPRAN differs from EREWPRAM in
that simultaneous reading is allowed but simultaneous writ
ing is still forbidden.
In the next section, we will show that the optimum sub
window search problem can be formulated as a maximum
entry search problem of 4D arrays wherein each 2D subar
ray with fixed row (or column) range is an inverse Monge
matrix.
927
Page 3
3. Optimum Subwindow Search
In object localization or image part retrieval, sliding win
dow methods have been widely applied. Sliding window
methods first extract features from training images and clus
ter these features into K cluster bins. Then for any image,
onecountsthenumberoffeaturesineachbin, andformulate
thesecountsasavectorh, whichisoftencalledahistogram.
Based on the histogram of the training images, sliding win
dow methods train a quality function for each object and
then apply the quality function on all possible subimages
to find the object by maximizing the quality scores. In the
following, we use f(h) to denote the quality function.
For an m × n image, let us use a 3dimensional (3D)
array H = {h[i,j,k]} to represent the histogram at each
pixel, i.e., h[i,j,k] denotes the count of kth bin feature
extracted at pixel (i,j). The histogram of the subimage
within subwindow w = [t : b,l : r] can be computed as
hw(k) =
b
?
i=t
r
?
j=l
h(i,j,k)
(5)
where t,b,l,r denote the top, bottom, left and right bound
ary of the subwindow w.
The sliding window methods aim to find a subwindow
w = [t : b,l : r] so that f(hw) is maximum [14], that is
max
w∈Wf(hw)
(6)
where W is the set of all possible subwindows, i.e.,
W = {w = [t : b,l : r]1 ≤ t ≤ b ≤ m,1 ≤ l ≤ r ≤ n}.
(7)
We will call (6) the optimum subwindow search problem
hereafter. Now let us define a 4D array A as
?
Then the optimum subwindow search problem is trans
formed as a maximum entry search problem in a 4D array,
i.e.,
max
A[t,b,l,r] =
f(hw);
−∞,
if w = [t : b,l : r] ∈ W
otherwise
(8)
[t:b,l:r]∈WA[t,b,l,r].
(9)
An exhaustive search needs to compute all its O(n4) en
tries. However, for some typical quality functions f(h),
the 4D array A carries some Monge structures that can be
used to speed up the optimum subwindow search. Now we
present the main theoretical result in this paper. The proof
is delegated to the appendix.
Theorem 1 Assumethatthequalityfunctioncanbedecom
posed as
f(h) =
?
k
fk{h(k)}
(10)
and define the 4D array A as in (8). If all fk(x) are concave
in [0,+∞], i.e.
f??
k(x) ≤ 0,∀ x ≥ 0,k = 1,2,··· ,K,
then the 2D subarrays A(·,·,l,r) are inverse Monge ma
trices for every fixed pair (l,r). Similarly, A(t,b,·,·) are
inverse Monge matrices for every fixed pair (t,b).
(11)
Here, we used A(·,·,l,r) (and A(t,b,·,·)) to represent a
2D subarray of A with given (l,r) (and (t,b) respectively).
From Theorem 1 and Proposition 3, for each fixed (t,b),
the optimal (l,r) can be found in O(n) time by using
SMAWK algorithm. Hence, we have the following Corol
lary of Theorem 1:
Corollary 3 If the quality function satisfies the condition in
Theorem 1, the maximum entry of A and thus the optimum
subwindowcanbefoundbyvisitingitsO(m2n)entrieswith
time complexity O(m2nK).
The time complexity is still too high to be applied in
practice and we propose an alternating optimization method
in the next section to further reduce the computations. Be
fore we move to next section, we give some examples where
the quality function satisfies the conditions in Theorem 1.
3.1. χ2Metric
In computer vision, the similarity between two images
is typically measured by χ2distance of their bagofvisual
words histograms hQand h where hQis the histograms
of the query image and h is the histogram of a subimage
contained in a subwindow w = [t,b,l,r]. The χ2distance
is defined as
χ2(hQ,h) =
K
?
k=1
{hQ(k) − h(k)}2
hQ(k) + h(k)
.
(12)
Let fk(x) = −{hQ(k) − x}2/{hQ(k) + x}. Then we
have
f?
=
k(x)
f??
−x2+2xhQ(k)−3h2
−
Q(k)
{x+hQ(k)}2
8h2
Q(k)
{x+hQ(k)}3.
k(x)=
(13)
Note that hQ(k) > 0, it follows f??
x ≥ 0 and therefore
f(h) =
k(x) < 0 for any
?
k
fk{h(k)} = −χ2(hQ,h)
(14)
satisfies the condition in Theorem 1 and the associated 2D
subarrays A(·,·,l,r) (and A(t,b,·,·)) are inverse Monge
matrices.
928
Page 4
3.2. Euclidean Metric
The squared Euclidean distance of histograms hQand h
is defined as
L2(hQ,h) =
K
?
k=1
(hQ(k) − h(k))2.
(15)
Let fk(x) = −(hQ(k) − x)2. Then we have f??
−1 ≤ 0 for any x ≥ 0 and therefore
f(h) =
fk{h(k)} = −L2(hQ,h)
k(x) =
?
k
(16)
satisfies the condition in Theorem 1 and the associated 2D
subarrays A(·,·,l,r) (and A(t,b,·,·)) are inverse Monge
matrices.
3.3. Voting with Multiple Queries
Suppose we have multiple query images with histograms
hQ
every query image and use the following quality function to
detect objects:
1(k),hQ
2(k),hQ
l(k), we can assign weights αi ≥ 0 for
f(h) =
l?
i=1
−αiD(hQ
i,h)
(17)
where D(hQ
Similarly as the case with one query image, f(h) can be
decomposed as
?
where
l?
In Sections 3.1 and 3.2,
D??(x,hQ
ther χ2or Euclidean metric. Note that αi ≥ 0, we have
f??
condition in Theorem 1 and the associated 2D subarrays
A(·,·,l,r) (and A(t,b,·,·)) are inverse Monge matrices.
4. Alternating Optimization
i,h) is eitherχ2or Euclidean distance.
f(h) =
k
fk{h(k)}
(18)
fk(x) =
i=1
−αiD(x,hQ
i(k)).
(19)
we have shown that
i(k)) ≤ 0 for any x ≥ 0 where D(x,y) is ei
k(x) ≤ 0 for any x ≥ 0 and therefore f(h) satisfies the
In [3], an alternating optimization method is presented
for efficient subwindow search with linear decision func
tion. Here, we adopt the same procedure but for nonlinear
decision functions which satisfies the condition in Theorem
1 and therefore SMAWK algorithm can be applied in sub
window search, and we call this method Alternating Row
and Column Search (ARCS). The ARCS method first ini
tializes the row range as the full row range or initializes the
column range as the full column range, and then applies
the SMAWK algorithm to alternatingly optimize the col
umn range and and the row range until convergence. More
precisely, we start by initializing [t : b] = [1 : m] and
optimize the column interval [l : r] by applying SMAWK
algorithm. Then we fix the column interval and optimize
the row interval [t : b] again by applying the SMAWK algo
rithm. This alternating optimization of the row and column
intervals is repeated until convergence. The convergence is
guaranteed since the maximum entry of array A is bounded
and the obtained entry is nondecreasing for each iteration.
However, it may converge to a locally optimal subwindow
instead of the global optimal solution.
Similarly, one can also start the algorithm by initializing
the column interval [l,r] = [1,n] rather than starting from
the rows. We recommend to apply both initializations and
choose the solution that yields the larger score. This can
help reduce the problem of encountering a local maxima.
4.1. Complexity Analysis
The computational complexity of ESS and ARCS de
pend on the iterations. For an n × n images with O(n2)
features in K cluster bins, the time complexity and memory
requirement are summarized in Table 1.
Table 1. Time Complexity and Memory Requirement for n × n
images with K cluster bins of features.
Time ComplexityMemory Requirement
Methods
ESS
ARCS
O(n2K + LessK)
O(Lalt(n2+ nK))
O(n2K + Less)
O(n2+ nK)
The iteration number Lessof ESS is typically of order
O(n2) though it can be O(n4) in the worst case. The iter
ation number Laltof ARCS is typically less than 10 in our
experiments. And therefore ARCS is much faster than ESS.
Also, since the upper bound of Lessis O(n4), the perfor
mance variation of ESS is large across images.
The n2K time and memory complexity of ESS comes
from the computation and storage of the integral image for
each histogram element. With these integral images, ESS
takes O(K) computations for each iteration. Hence, the
total time and memory complexity are O(n2K + LessK)
and O(n2K + Less) respectively.
Since the iteration number Laltof ARCS is much less
than K, it is not beneficial to use integral images.
stead, we read the extracted features for each iteration and it
takes O(Laltn2). Combining the O(nK) computations of
SMAWK algorithm for each iteration, the total time com
plexity of ARCS is then O(Lalt(n2+ nK)). For memory
requirement, it includes the O(n2) features and a n × K
matrix which stores the number of kth (1 ≤ k ≤ K)
In
929
Page 5
features extracted in each row (or each column) when the
column range (l,r) (or row range (t,b) respectively) is
fixed. For the implementation of the SMAWK algorithm,
one can use the public Python code which is available on
http://code.activestate.com/recipes/117244/.
5. Application to Object Detection: A Two
Stage Learning Framework
In this section, we show an example of the applications
of the proposed algorithms to object detection. For conve
nience, we call an image positive if the image contains the
object for detection and call it negative otherwise. An ob
ject detector has two tasks: first it needs to find the correct
object position if the object is in the image; second, it needs
to give correct scores to the detected candidate object; that
is, if the detected subimage is really the object, the score
should be high; otherwise, the score should be low. Based
on this observation, we propose to design separate quality
functions for each task. The first one is called the local
ization quality function and it aims to localize the object
correctly for the positive images. We train it based on the
positive training images. The second one is called classi
fication quality score and it aims to classify the subimages
detected by the localization quality function.
5.1. Localization Quality Function
We choose the localization quality function as follows.
First we compute the bagofwords histograms hi,i =
1,2,··· ,l, of the objects contained in the positive training
images based on the ground truth bounding box. Then we
use hias the query image histogram and apply the optimum
subwindow search algorithms to detect the optimal subwin
dow for each positive training image. By comparing the
detected subwindow with the ground truth bounding box,
we compute their matching rate with more than 50% over
lap. The overlap of two windows w1,w2is defined as the
ratio of the common area area{w1∩w2} and the union area
area{w1∪ w2}. We choose the hiwith maximal matching
rate as the final query image histogram and apply the opti
mum subwindow search algorithms on the test images. In
our experiments, we used the negative χ2distance between
the test and query images as the quality function.
5.2. Classification Quality Function
With the detected subimages in both positive and nega
tive images, the classification of these subimages is a stan
dard classification problem and we apply least square sup
port vector machine (LSSVM) [18] to train the classi
fier. We use LSSVM because the hyper parameters can
be found easily by efficient crossvalidation [2]. For clar
ity, we summarize LSSVM briefly. Given a training set
{(xi,yi)}n
i=1with input data xi ∈ Rnand class labels
yi∈ {−1,1}, the classifier of LSSVM [19, 18] takes the
form
?
i=1
y(x) = sign
n
?
αiK(xi,x) + b
?
.
(20)
where K(·,·) is the kernel function which can typically be
either linear, polynomial or Gaussian kernels; α and b are
the solution of the following linear equations
?
0
1T
n
1n
K +1
γIn
??
b
α
?
=
?
0
y
?
(21)
with y = [y1,y2,··· ,yn],1n = [1,1,··· ,1]T,α =
[α1,α2,··· ,αn]T.
In our application, xiis the histogram of the detected
subimage and yiis the label. yi = 1 if the detected im
age is a real object and the detected window is correct, that
is, its overlap with the ground truth window is more than
50%. The kernel function is based on the χ2distance and
we choose the Gaussian kernel
K(xi,xj) = exp{−χ2(xi,xj)/σ2}
For the parameters σ and γ, we applied the efficient
crossvalidation method [2] to estimate the generalization
performanceandchoosetheparameterswithminimalcross
validation errors. All the parameters selection are based on
the training images.
(22)
6. Experiments
Our experiment is based on VLFeat[20], an open source
library for feature extraction, and we test the performance
on the PASCAL VOC 2006 database [10]. We used the fast
dense feature extraction method [13] to extract the SIFT de
scriptors every 3 pixels along both rows and columns, and
used the hierarchical Kmeans algorithm to cluster the fea
tures extracted from the first 100 training images into 1024
bins which formulates the bagofwords vocabulary. The
PASCAL VOC 2006 database consists of 5304 images con
taining 9507 objects from 10 categories. We choose two
objects ”cat” and ”dog” to test our algorithm. For each ob
ject, we use objects in the training images to find a target
histogram, and then apply the proposed algorithms to de
tect the optimal windows. The quality scores are given by
a classifier which is trained based on the detected objects
(instead of the ground truth objects) in the training images.
All experiments1were conducted on a standard desktop PC
(Intel Core 2 Quad 2.33GHz) running Windows XP, com
piled using Visual Studio .NET 2005. No multithreaded
processing was utilized.
1Codesavailableathttp://impca.cs.curtin.edu.au/downloads/software.php
930
Page 6
6.1. Search Time Comparison
Table 2 and Table 3 report the search time of our pro
posed algorithm ARCS and ESS [14]. It shows that ARCS
is significantly faster. In all cases, the run time of ARCS
is within 1 seconds while ESS can take as long as 75 sec
onds. The iteration number of ARCS is typically less than
10 while ESS typically takes iterations of thousands. How
ever, for each iteration, the computation of ARCS is heavier
than ESS. Hence, when the iteration number of ESS is very
small, say of hundreds, ESS can be faster than ARCS.
Table 2. Search Time Comparison for cat detection of PASCAL
VOC 2006 on the 2686 test images. On average, ARCS is 27
time faster than ESS. ARCS stands for the alternating methods
we proposed and ESS stands for the branch and bound method
developed in [14]
CPU Time (Seconds)
Averageminimum
3.6280.092
0.1330.018
Methods
ESS
ARCS
Maximum
75.49
0.735
Table 3. Search Time Comparison for dog detection of PASCAL
VOC 2006 on the 2686 test images. On average, ARCS is 28 time
faster than ESS.
CPU Time (Seconds)
Average minimum
4.270.088
0.1480.017
Methods
ESS
ARCS
Maximum
64.9
0.730
6.2. Detection Performance Comparison
First, we compare the detection performance of ESS and
ARCS. Both use our proposed two stage learning frame
work. The difference is that ESS finds global optimizer
while ARCS may encounter local maxima problems. From
Figures 12 and Table 4, one can see that their performances
are almost identical.
Second, we compare the detection performance of the
proposed algorithm to reported results on two objects: ”cat”
and ”dog”. The results are shown in Table 4. With a sim
ple localization quality function and an LSSVM classifier
on the detected subimage, our twostage learning object de
tector achieves better performance than [15] which used a
linear SVM classifier. This shows that nonlinear object de
tector is better than linear methods and it is important to find
efficient ways to evaluate the optimum subwindow search
with nonlinear quality functions.
7. Conclusion
By using the Monge properties of the data structure of
subwindow search in object detection, we have developed
0 0.10.2 0.30.4 0.5
0
0.2
0.4
0.6
0.8
1
recall
precision
class: cat, subset: test, AParcs = 0.272,APess = 0.273
ARCS
ESS
Figure 1. Performance Comparison of the SubImages Detected
by ARCS and ESS on the Test Images for cat. ESS guarantees
global optimization. Though ARCS can not guarantee the global
optimization, the performance loss is negligible.
0 0.1 0.2 0.30.4 0.5
0
0.2
0.4
0.6
0.8
1
recall
precision
class: dog, subset: test, AParcs = 0.192,APess = 0.194
ARCS
ESS
Figure 2. Performance Comparison of the SubImages Detected
by ARCS and ESS on the Test Images for dog. ESS guarantees
global optimization. Though ARCS can not guarantee the global
optimization, the performance loss is negligible.
Table 4. Average Precision (AP) scores on the PASCALVOC
2006 data set. The performance of Shotton etal 2006 is the best
results submitted to VOC 2006 challenge for these two objects.
Methods data set
Lampert etal 2009 [15]
Shotton etal 2006[10]
ARCS
ESS
cat
0.223
0.151
0.272
0.273
dog
0.148
0.118
0.192
0.194
a fast locally optimum subwindow search algorithm. Also,
we propose a twostage learning framework for object de
tection in order to apply the proposed optimum subwindow
search algorithm effectively. Experiments on the PASCAL
VOC 2006 data base demonstrates that the proposed algo
931
Page 7
rithm is significantly faster than the wellknown efficient
subwindow search methods. However, our fast search algo
rithm is only applicable for some special type of nonlinear
quality functions which, unfortunately, do not include the
wideused nonlinear support vector machines. It needs fur
ther investigation to expand the cases of quality functions
where Monge property holds and our algorithm can be ap
plied.
Appendix: Proof of Theorems 1
Before we prove Theorem 1, we need consider a simple
1D subarray search problem and introduce a Lemma that
is key to the proof of Theorem 1. Let {xi}n
negative 1D array, i.e., xi≥ 0. Suppose we are interested
to find a contiguous subarray so that its sum is closest to
y among all possible contiguous subarrays. Let d(x,y) de
note the distance from x to y. Let f(x) = −d(x,y). Then
the problem can be formulated as
i=1be a non
max
1≤l≤r≤nf
?
r
?
i=l
xi
?
(23)
where l and r denote the left and right end of the subarray.
A naive solution of this problem is to try all the n(n +
1)/2 possible subarrays and choose the one with minimal
distance. We assume that
f??(x) ≤ 0,∀x ≥ 0.
(24)
Let X denote an upper triangular matrix defined as
Xij?
j
?
k=i
xk,j ≥ i.
(25)
Then we have the following Lemma
Lemma 1 Let X be defined as in (25), and f(x) satisfies
the conditions (24). Then, for any i < j < k < l, we have
f(Xjl) − f(Xjk) ≥ f(Xil) − f(Xik).
(26)
That is, f(X) is an inverse Monge matrix.
Proof. First, we prove (26). From the definition of Xij
in (25), we have
Xik− Xjk= Xil− Xjl=
j
?
s=i
xs
(27)
and therefore
f(Xil) − f(Xik)=
?Xil
?Xjl+δ
?Xjl
?Xjl
f(Xjl) − f(Xjk)
Xik
f?(x)dx
=
Xjk+δ
f?(x)dx
=
Xjk
f?(x + δ)dx
≤
=
Xjk
f?(x)dx
(28)
where δ =
f?(x + δ) ≤ f?(x) (following (24)). The inequality (26) is
proved.
?j
s=ixs ≥ 0 and the inequality holds since
?
Now we are ready to prove Theorem 1.
Proof of Theorem 1: Since f(h) =?K
A[t,b,l,r] =
k=1fk(h(k)),
we have
K
?
k=1
Ak[t,b,l,r]
(29)
where
Ak[t,b,l,r] =
?
fk(hw(k));
−∞,
if w = [t : b,l : r] ∈ W
otherwise
(30)
Note that f??
that Ak[t,b,l,r] is an inverse Monge matrix. Since each
Ak[t,b,l,r] is an inverse matrix, its sum A[t,b,l,r] is also
an inverse matrix following Proposition 2 (in Section 2),
and this concludes the proof.
k(x) ≤ 0.From Lemma 1, it follows
?
References
[1] A. Aggarwal, M. M. Klawe, S. Moran, P. Shor,
and R. Wilber. Geometric applications of a matrix
searching algorithm. Algorithmica, 2:195–208, 1987.
2
[2] S. An, W. Liu, and S. Venkatesh. Fast crossvalidation
algorithms for least squares support vector machine
and kernel ridge regression.
40:2154–2162, 2007. 5
Pattern Recognition,
[3] S. An, P. Peursum, W. Liu, and S. Venkatesh. Efficient
algorithms for subwindow search in object detection
and localization. In Proceedings of CVPR, 2009. 1, 4
[4] M. J. Atallah and S. R. Kosaraju. An efficient paral
lel algorithm for the row minima of a totally monotone
matrix. InProc.ofthesecondannualACMSIAMsym
posium on discrete algorithms, pages 394–403, 1991.
2
932
Page 8
[5] H. Bay, T. Tuytelaars, and L. J. Gool. SURF: Speeded
Up Robust Features. In Proceedings of ECCV, 2006.
1
[6] A. Bosch, A. Zisserman, and X. Munoz. Representing
shape with a spatial pyramid kernel. In Proceedings of
the 6th ACM international Conference on image and
video retrieval, pages 401–408, 2007. 1
[7] R. E. Burkard, B. Klinz, and R. Rudolf. Perspectives
of Monge properties in optimization. Discrete Applied
Mathematics, 70:95–161, 1996. 1, 2
[8] O. Chum and A. Zisserman.
for learning object classes. In Proceedings of CVPR,
2007. 1
An exemplar model
[9] N. Dalal and B. Triggs. Histograms of oriented gra
dients for human detection. In Proceedings of CVPR,
2005. 1
[10] M. Everingham,
Williams,
CAL Visual
(VOC2006)
network.org/challenges/VOC/voc2006/results.pdf. 5,
6
A. Zisserman,
Gool.
Classes
C.
The
K.I.
and L.VanPAS
2006 Object
Results.
Challenge
http://www.pascal
[11] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups
of adjacent contour segments for object detection.
IEEE Trans. on Pattern Analysis and Machine Intel
ligence, 30(1):36–51, 2008. 1
[12] M. Fritz and B. Schiele. Decomposition, discovery
and detection of visual caregories using topic models.
In Proceedings of CVPR, 2008. 1
[13] B. Fulkerson, A. Vedaldi, and S. Soatto.
ing objects with smart dictionaries. In Proceedings
of ECCV, 2008. 5
Localiz
[14] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Be
yond sliding windows: Object localization by efficient
subwindow search. In Proceedings of CVPR, 2008. 1,
3, 6
[15] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Ef
ficient subwindow search: a branch and bound frame
work for object localization.
Analysis and Machine Intelligence, 2009. To Appear.
1, 6
IEEE Trans. Pattern
[16] D. G. Lowe. Distinctive image features from scale
invariant keypoints. IJCV, 60(2):91–110, 2004. 1
[17] H. Rowley, S. Baluja, and T. Kanade. Human face de
tection in visual scene. In Advances in Neural Infor
mation Processing Systems (NIPS96), pages 875–881,
1996. 1
[18] J. Suykens, T. Van Gestel, J. De Brabanter, B. De
Moor, and J. Vandewalle. Least squares support vec
tor machines. World Scientific, 2002. 5
[19] J. Suykens and J. Vandewalle. Least squares support
vector machine classifiers. Neural Processing Letters,
9:293–300, 1999. 5
[20] A. Vedaldi and B. Fulkerson.
and portable library of computer vision algorithms.
http://www.vlfeat.org/, 2008. 5
VLFeat: An open
[21] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman.
Multiple kernels for object detection. In Proceedings
of the International Conference on Computer Vision
(ICCV), 2009. 1
[22] P. Viola and M. Jones. Rapid object detection using
a boosted cascade of simple features. In Proceedings
CVPR, 2001. 1
933