ArticlePDF Available

Abstract and Figures

As a way to relieve the tedious work of manual annotation, active learning plays important roles in many applications of visual concept recognition. In typical active learning scenarios, the number of labelled data in the seed set is usually small. However, most existing active learning algorithms only exploit the labelled data, which often suffers from over-fitting due to the small number of labelled examples. Besides, while much progress has been made in binary class active learning, little research attention has been focused on multi-class active learning. In this paper, we propose a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition. Our algorithm exploits the whole active pool to evaluate the uncertainty of the data. Considering that uncertain data are always similar to each other, we propose to make the selected data as diverse as possible, for which we explicitly impose a diversity constraint on the objective function. As a multi-class active learning algorithm, our algorithm is able to exploit uncertainty across multiple classes. An efficient algorithm is used to optimize the objective function. Extensive experiments on action recognition, object classification, scene recognition, and event detection demonstrate its advantages.
Content may be subject to copyright.
Int J Comput Vis
DOI 10.1007/s11263-014-0781-x
Multi-Class Active Learning by Uncertainty Sampling
with Diversity Maximization
Yi Yang ·Zhigang Ma ·Feiping Nie ·
Xiaojun Chang ·Alexander G. Hauptmann
Received: 11 September 2013 / Accepted: 15 October 2014
© Springer Science+Business Media New York 2014
Abstract As a way to relieve the tedious work of man-
ual annotation, active learning plays important roles in many
applications of visual concept recognition. In typical active
learning scenarios, the number of labelled data in the seed
set is usually small. However, most existing active learning
algorithms only exploit the labelled data, which often suf-
fers from over-fitting due to the small number of labelled
examples. Besides, while much progress has been made in
binary class active learning, little research attention has been
focused on multi-class active learning. In this paper, we pro-
pose a semi-supervised batch mode multi-class active learn-
ing algorithm for visual concept recognition. Our algorithm
exploits the whole active pool to evaluate the uncertainty
of the data. Considering that uncertain data are always sim-
ilar to each other, we propose to make the selected data as
diverse as possible, for which we explicitly impose a diversity
constraint on the objective function. As a multi-class active
learning algorithm, our algorithm is able to exploit uncer-
tainty across multiple classes. An efficient algorithm is used
to optimize the objective function. Extensive experiments on
action recognition, object classification, scene recognition,
and event detection demonstrate its advantages.
Communicated by Kristen Grauman.
Y. Ya n g ·X. Chang
Centre for Quantum Computation and Intelligent Systems,
University of Technology Sydney, Sydney, NSW, Australia
e-mail: yiyang@cs.cmu.edu
Z. Ma ·A. G. Hauptmann
School of Computer Science, Carnegie Mellon University,
Pittsburgh, PA, USA
F. Ni e (B)
The Center for OPTical IMagery Analysis and Learning,
Northwestern Polytechnical University, Xi’an, China
e-mail: feipingnie@gmail.com
Keywords Active learning ·Uncertainty sampling ·
Diversity maximization
1 Introduction
Typical visual concept recognition methods first train a clas-
sifier based on the labelled training data via a statistical
approach, and then use the classifier to recognize visual con-
cepts. In real-world applications, it is usually easy to obtain
huge volumes of unlabelled data in an automatic way. How-
ever, a large number of labels are difficult to get, which
require much human labour. Generally speaking, there are
three types of approaches to relieve the tedious work of
labelling the training data. The first one is known as semi-
supervised learning, which combines both the labelled and
unlabelled data to train the classifier for recognition (Zhu
2008). The second one is to borrow knowledge from related
domain(s), such as transfer learning (Ma et al. 2014;Shen
et al. 2014) and multi-task learning (Yang et al. 2013). On
the other hand, to make the most use of the scarce human
labelling resources, active learning selects the most informa-
tive data from a candidate set (usually referred to as active
pool) for labelling. Instead of being a passive recipient of
label information, the learning algorithm actively decides
what data are more useful and then asks humans to label
them for training.
As a different but complementary way to reduce the
labelling cost in supervised learning, active learning has
received much research attention. In recent years, researchers
have proposed several active learning algorithms and applied
them to different computer vision applications, e.g., image
classification (Jain and Kapoor 2009;Joshi et al. 2009), con-
cept detection (Li et al. 2010), object recognition (Gong et
al. 2014), 3D reconstruction (Kowdle et al. 2011), tracking
123
Int J Comput Vis
(Vondrick and Ramanan 2011), correspondences mapping
(Jegelka et al. 2014), etc. The key issue in active learning is
how to decide whether a sample point is “useful” or “infor-
mative”. For example, in (Chattopadhyay et al. 2012)itis
realized by specifically selecting a set of query samples that
minimize the difference in distribution between the labelled
and the unlabelled data. In literature, representativeness sam-
pling and uncertainty sampling are the two widely used crite-
ria for selecting the training data to be labelled from the active
pool. An uncertainty sampling active learning algorithm is
usually associated with a classifier, which is used to evaluate
the uncertainty of each data in the active pool. Despite the
substantial progress made in uncertainty sampling, there are
still several aspects to be improved.
First, as discussed in Jain and Kapoor (2009), most of the
exiting research in active learning is based on binary clas-
sification classifiers. While relatively few approaches have
been proposed for multi-class active learning, (e.g., Li et
al. 2004;Yan et al. 2003), many of them are direct exten-
sions of binary active learning methods to the multi-class
scenario. However, many real world applications of visual
concept recognition are multi-class problems. Decomposing
a multi-class problem as several independent binary classifi-
cation subproblems may degrade the performance of active
learning. If we use a series of binary classifiers in active
learning as those in Li et al. (2004), Yan et a l . (2003) etc.,
the model is not able to evaluate the uncertainty of a sam-
ple across multiple classes. For example, if a sample is an
uncertain sample for one class while it is a certain sample
for another class, it is tricky for an algorithm to evaluate its
uncertainty. Besides, given that the multiple binary classi-
fiers are independent from each other, the algorithm cannot
identify the classes that need more labelled training data (Jain
and Kapoor 2009).
Second, uncertainty sampling algorithms tend to suffer
from the problem of insufficient training data. Active learn-
ing algorithms usually start with a seed set, which contains
only a small number of labelled data. Based on the seed set,
a classifier is trained to evaluate the uncertainty of the can-
didate data in the active pool. The goal of active learning
is to select the data to be labelled for training. Thus, at the
beginning, the number of labelled data is very small, which
is the nature of active learning. Performance of the classi-
fier can be poor due to the small number of labelled data
(Hoi et al. 2008;Yang et al. 2012). Based on SVM active
learning (Tong and Chang 2001), Hoi et al. have proposed a
min-max optimization algorithm to evaluate the informative-
ness of data points (Hoi et al. 2008), in which the unlabelled
data are employed as complementary information. Compared
with SVM active learning (Tong and Chang 2001), the min-
max optimization algorithm is able to select training data in
batch mode and is more robust to over-fitting.Empirical study
shows that the min-max criterion proposed in Hoietal.(2008)
outperforms SVM active learning in Tong and Chang (2001).
Hoi’s algorithm calls QP solver to optimize the objective
function, resulting in high computation complexity of O(n3).
In a later work (Hoi et al. 2009) Hoi et al. have proposed a
solution to speed up the optimization approach, which makes
the algorithm in Hoietal.(2009) more applicable. Hoi’s algo-
rithm (Hoi et al. 2008,2009) has improved the performance
of active learning because it uses all the data in the active
pool to evaluate the importance of each candidate. However,
as the algorithm is based on the binary classifier SVM, it may
become less effective when the data are multi-class.
Motivated by the state of the art of active learning, partic-
ularly the semi-supervised active learning algorithm (Hoi et
al. 2008,2009), we propose a new multi-class active learn-
ing algorithm, namely Uncertainty Sampling with Diversity
Maximization (USDM), which carefully addresses the small
seed set problem by leveraging all the data in the active pool
for uncertainty evaluation. Our algorithm is able to globally
evaluate the informativeness of the pool data across multi-
ple classes. Different from the other multi-class active learn-
ing algorithms e.g., (Jain and Kapoor 2009), our algorithm
exploits all the active pool data to train the classifier, mak-
ing the uncertainty evaluation more accurate. Further, most
of the existing uncertainty sampling algorithms merely con-
sider the uncertainty score for active learning, i.e., they select
the active pool data which are closest to the classification
boundaries. However, the data close to classification bound-
aries may be very similar to each other. If similar data are
selected for supervision, the performance of active learning
may degrade. In light of this, we propose to select the most
uncertain data, which are as diverse as possible. It means that
the data selected for labelling should be sufficiently differ-
ent from each other. Compared to Jain and Kapoor (2009),
USDM simultaneously utilizes both the labelled and unla-
belled data in the active pool. While Hoi’s algorithm (Hoi et
al. 2008,2009) exploits the entire active pool, the classifier
embedded in USDM is more capable of evaluating uncer-
tainty partially because it is a multi-class classifier and par-
tially because it explicitly exploits the manifold structure of
active pool. USDM has many merits, such as batch mode,
multi-class, semi-supervised, efficient, and the diversity of
selected data is explicitly guaranteed.
2 Related Work
In this section, we briefly review the related work. This paper
is closely related to active learning and semi-supervised
learning.
Active learning has been shown effective in many appli-
cations such as 3D reconstruction (Kowdle et al. 2011), and
image retrieval (Wang et al. 2003). Existing active learn-
ing algorithms can be roughly divided into two categories
123
Int J Comput Vis
which are representativeness sampling and uncertainty sam-
pling. As it is important to exploit the data distribution when
selecting the data to be labelled (Cohn et al. 1996), repre-
sentativeness sampling tries to select the most representa-
tive data points according to data distribution. A typical way
of this kind of approach is clustering based active learning,
which employs a certain clustering algorithm to exploit the
data distribution and evaluate representativeness. The perfor-
mance of these algorithms directly depends on the clustering
algorithm. Clustering algorithms are unsupervised and only
converge to local optima, whose results may deviate severely
from the true labels. It remains unclear how the clustering
based algorithms will perform when the clustering is not
sufficiently accurate. The other well-known approach of rep-
resentativeness sampling is optimal experiment design (Yu
et al. 2006). Based on optimal experiment design, a variety
of active learning algorithms have been proposed, in which
different Laplacian matrices have been utilized, e.g., (He et
al. 2007). A limitation of optimal experiment design is that
the optimization of the objective function is usually NP-hard.
Certain relaxation is required. Then semi-definite program-
ming (SDP) or sequential method (usually a greedy method)
is applied to the optimization. However, SDP has high com-
putational complexity and the greedy methods may converge
to severe local optima.
Uncertainty sampling, also known as classifier based sam-
pling (Campbell et al. 2000;Li and Sethi 2006), is the most
frequently adopted strategy in active learning, which builds
upon the notions of uncertainty in classification (Jain and
Kapoor 2009). This type of algorithm is usually associ-
ated with a particular classification algorithm. A classifier is
trained by a seed set consisting of a small number of randomly
selected data. Data points in the active pool, which are most
likely to be misclassified by the classifier, are regarded as the
most informative ones. For example, support vector machine
(SVM) active learning (Tong and Chang 2001) selects the
data points which are closest to the classification boundary of
the SVM classifier as the training data. In Wang et al. (2003),
the transductive SVM classifier is used for active learning.
Hoi et al have proposed to integrate semi-supervised learn-
ing and support vector machines for active learning and have
achieved promising results on image retrieval (Hoi and Lyu
2005). In Brinker (2003), diversity constraint is combined
with SVM for active learning. The uncertainty sampling strat-
egy has also been combined with other classifiers, such as the
Gaussian process (Kapoor et al. 2010), the K-nearest neigh-
bor classifier (Lindenbaum et al. 2004) and the probabilistic
K-nearest neighbor classifier (Jain and Kapoor 2009).
Semi-supervised learning has been widely applied to
many applications with the appealing feature that it can
use both labelled and unlabelled data (Yang et al. 2012;
Zhu 2008). For instance, Zhu et al have proposed to uti-
lize a Gaussian random field model with a weighted graph
representing labelled and unlabelled data for semi-supervised
learning (Zhu et al. 2003). Han et al. have proposed to use
spline regression for semi-supervised feature selection (Han
et al. 2014). In Hoi et al. (2008), the researchers have for-
mulated the semi-supervised active learning algorithm as a
min-max optimization problem for image retrieval. A semi-
supervised learning based relevance feedback algorithm is
proposed in Yang et al. (2012) for multimedia retrieval. The
benefit of utilizing semi-supervised learning is that we can
save human labor cost for labelling a large amount of data
because it can exploit unlabelled data to learn the data struc-
ture. Thus, the human labelling cost and accuracy are both
considered, which gives semi-supervised learning a great
potential to boost the learning performance when properly
designed.
The rest of this paper is organized as follows. In Sect. 3,we
give the objective function of the proposed USDM algorithm.
An efficient algorithm is described in Sect. 4to optimize
the objective function, followed by detailed experiments in
Sect. 5. Lastly, we conclude this paper in Sect. 6.
3 Uncertainty Sampling with Diversity Maximization
In this section, we give the proposed USDM active learning
algorithm. We start with discussing the approach for evalu-
ating uncertainty of each sample. nis the total number of the
data in the seed set and the active pool. Suppose there are ns
data in the seed set and npdata in the active pool. It turns out
that ns+np=n. We are going to select m(m<np) data
for supervision. Denote xiRdas a sample which is either
in the active pool or the seed set, where dis the dimension of
the sample. To better utilize the distribution of the pool data
and the seed set, we propose to evaluate the uncertainty via
random walks on a graph (Zhu 2008). To begin with, we first
construct a graph G, which consists of nnodes, one for each
sample in the active pool or the seed set. The edge between
the two nodes xiand xjis defined as follows.
Wij =exp xixj2
σ2xiand xjare k-nearest neighbors;
0otherwise.
(1)
Note that one can also define the unweighted edge between
xiand xjas:
Wij =1xiand xjare k-nearest neighbors;
0 otherwise. (2)
We take each vertex in the graph as a state in a Markov
chain, i.e., each state corresponds to one sample in the seed
set or the active pool. For the ease of representation, we define
a diagonal matrix DRn×nwhose element Dii =jWij.
Denote
123
Int J Comput Vis
Q=D1W,(3)
which is partitioned into 2 ×2 blocks:
Q=Qss QT
ps
QT
sp Qpp ,(4)
where Qss Rns×nsdenotes the normalized weight between
the data in the seed set, Qsp Rns×npdenotes the weight
between the data from the seed set and the active pool; and
Qpp Rnp×npdenotes the normalized weight between the
data in the active pool. For the data in the seed set, we set the
corresponding states as absorbing states, which only transit
to themselves with the possibility of 1. If a sample xiis in the
active pool, it is a non-absorbing state. The one step transition
probability from xito xjis Tij =Wij
jWij . Then the transition
matrix Tof the Markov random walks with absorbing states
is defined as
T=Ins0np
QT
sp Qpp ,(5)
in which InsRns×nsis an identity matrix, and 0np
Rns×npis a matrix of all zeros. We use the calligraphy upper-
case letters to represent set. Denote Pas the active pool and
Sas the seed set. As demonstrated in Doyle and Shell (1984),
the probabilities that the pool data are absorbed by the seed
set data in equilibrium with transition matrix Tis
P(S|P)=(InpQpp)1QT
sp,(6)
where InpRnp×npis an identity matrix. Define Yj=
[Y1j,Y2j,...,Ynsj]T∈{0,1}ns×1as the label indicator vec-
tor of the seed set for the j-th class. If xiSbelongs to the
j-th class, Yij =1; otherwise Yij =0. Given a pool sample
xtP, we define the probability that xtis absorbed by the
j-th class as the sum of the probabilities that it is absorbed
by all the seed set data which are from the j-th class Cj.The
probabilities that the pool data are absorbed by the seed set
data belonging to the j-th class can be formulated as
P(Cj|P)=(InpQpp)1QT
spYj.(7)
The above procedure can be interpreted as a Dirichlet
problem to get harmonic functions (Zhu 2008;Zhu et al.
2003). We define FRn×cas follows
Fij=(InpQpp )1QT
spYjif xiP;
Yij if xiS.(8)
where cis the number of classes. It can be verified that
c
j=1Fij =1. Fij is regarded as the probability that xi
belongs to the j-th class, i.e., P(Cj|xi)=Fij. For a sample
xiP, we assume that its label can be estimated by a ran-
dom variable i. As Shannon Entroy is a natural choice to
measure the uncertainty of random variables, we adopt the
entropy H(i)to evaluate the uncertainty of xi, which can be
estimated by
H(i)=−
c
j
P(Cj|xi)log P(Cj|xi),(9)
where log(·)is the natural logarithm operator. A larger H(i)
indicates that xiis more uncertain. Denote fias the ranking
score of xi. The pool data with higher ranking scores are
selected before the others for supervision. According to the
uncertainty principle, we have the following objective func-
tion
max
ifi=1,fi0
xiP
fi×
c
j
P(Cj|xi)log P(Cj|xi)
Ω( fi)(10)
which can be rewritten as
max
ifi=1,fi0
xiP
fi×
c
j
Fijlog(Fij)
Ω( fi), (11)
In (10), the term Ω( fi)is a function on fencoding the data
distribution information, in other words, the diversity crite-
rion in decision making. The constraint n
i=1fi=1inthe
above function is imposed to avoid arbitrary scaling on fi.
Denote
bi=c
jFijlog(Fij)if xiP;
0ifxiS.(12)
Then we can rewrite (11)as
min
fin
i=1
1
|log(1/c)|(fi×bi)+Ω( fi),
s.t.n
i=1fi=1,fi0.(13)
In (13), the first term n
i=1
1
|log(1/c)|(fi×bi)is used to
evaluate the uncertainty of the pool data. biis dependent on
Fij,j∈{1,...,c}. Recall that c
j=1Fij =1, which means
that Fij is the probability such that xiis in the j-th class.
Thus the algorithm is able to estimate the uncertainty across
multiple classes. If the uncertainty is measured by a binary
classifier, the algorithm turns to a binary class active learn-
ing algorithm. In this sense, one main difference between our
algorithm and the S-SVM active learning (Hoi et al. 2008)is
that our algorithm is more capable of estimating the uncer-
tainty of the data in an active pool, where the manifold struc-
ture is uncovered.
Based on (13), the problem is then how to define Ω( fi)to
incorporate the diversity maximization criterion. We propose
a simple yet effective way of computing a kernel matrix K
Rn×n. For example, if we use the well-known RBF kernel, the
i,j-th element of Kcan be computed by Kij =−xixj2
σ2,
where σis a parameter. Given two data xiand xj,iftheyare
123
Int J Comput Vis
similar to each other, Kij will have a large value. In this case,
we shall not have the two data to be labelled simultaneously.
In other words, if xiis selected as a training sample, xjshould
be excluded in some sense. Therefore, given that Kij has a
large value, at least one of fiand fjshould have a small
value. We then propose to minimize the following objective
function to make the selected data as diverse as possible.
min fiΩ( fi)=min fin
i=1n
j=1fifjKij.(14)
We can see that if Kij,fiand fjare all large values, it will
incur a heavy penalty on (14). Minimizing (14) makes the
selected training data different from each other. Combining
uncertainty criterion and diversity criterion, we have the fol-
lowing objective function for active learning.
min
fin
i=1r
|log(1/c)|(fi×bi)+n
j=1fifjKij
s.t.n
i=1fi=1,fi0,(15)
where ris a parameter. The objective function shown in (15)
can also be viewed as a regularization framework for active
learning, in which we use diversity constraint as a regularizer
added to the traditional uncertainty sampling. The diversity
regularization term is crucial because some of the uncertain
data are potentially similar to each other. It is worth mention-
ing that the batch mode active learning task is a combinational
optimization problem, as discussed in Hoietal.(2009). The
solution to (15) does not necessarily give the exact optimal
solution to the batch mode active learning problem, where the
goal is to exactly select an optimal set of the most informative
examples. However, (15) approximates the optimal solution
of the batch mode active learning in an efficient and effective
way (Hoi et al. 2009). If we take a closer look at (15), it can be
seen that this objective function is inspired by the algorithm
proposed in Hoi et al. (2008,2009). The major difference is
the way how the algorithm does uncertainty estimation.
4 Efficient Optimization
In this section, we optimize the objective function of USDM.
Let f=[f1,f2,..., fn]T. For the ease of representation,
we define a=[a1,a2,...,an]T, where ai=r×bi
|log(1/c)|and r
is a parameter. (15) can be rewritten as
min ffTa+1
2fTKf
s.t.n
i=1fi=1,fi0.(16)
The objective function shown in (16) is a standard
quadratic programming (QP) problem, which can be readily
solved by existing convex optimization packages. However,
typical QP solver has a high computational complexity of
O(n3). It is more practical to make the optimization faster.
In this section, we propose to use a faster algorithm to opti-
mize the objective function (16), based on the augmented
Lagrange multiplier (ALM) framework (Bertsekas 1999).
4.1 Brief Review of ALM
The ALM algorithm in Bertsekas (1999) is introduced to
solve the following constrained minimization problem.
min g(Z), s.t.h(Z)=0,(17)
where g:RdRand h:RdRd. A typical way to
define the augmented Lagrangian function of (17)is
L(Z,U) =g(Z)+U,h(Z)+ μ
2h(Z)2
F,(18)
where Zis the optimization variable, Uis Lagrangian coef-
ficient and μis a scalar. The following procedure can be
applied to optimizing the problem shown in (17).
Algorithm 1: General ALM method (Bertsekas 1999).
1Set ρ>1, t=1, U1=0, μ1>0;
2repeat
3ˆ
Z=arg minZL(Z,Ut
t);
4Ut+1=Ut+μth(ˆ
Z);
5μt+1=ρμt
6t=t+1;
7until Convergence;
8Output ˆ
Z.
4.2 Efficient Optimization of USDM
In this subsection, we introduce a fast optimization approach
of our algorithm under the ALM framework (Bertsekas 1999;
Delbos and Gilbert 2005). First we rewrite (16)asfollows.
min f,v fTa+1
2fTKf
s.t.fT1n=1,v 0,f=v, (19)
where 1nRnis a vector of all ones. The augmented
Lagrangian function of (19) is defined as
L(f,v
1
2)=μ
2fT1n1+1
μλ12
+μ
2
fv+1
μλ2
2
F
+fTa+1
2fTKf
s.t.v 0(20)
Note that
min
fL(f,v
1
2)min
f
1
2fTAf fTe,(21)
123
Int J Comput Vis
Algorithm 2: USDM active learning algorithm.
1Initialization: set ρ>1, fi=1
n(1in),v=f,λ1=0, and
λ2Rnis a vector of all zeros, μ>0;
2repeat
3Update Aby A=K+μIn+μ1n1T
n;
4Update eby e=μv +μ1nλ11nλ2a;
5Compute ˆ
fby solving the linear system Aˆ
f=e;
6Compute vby v=pos(ˆ
f+1
μλ2);
7Update λ1by λ1=λ1+μ×(n
i=1ˆ
fi1);
8Update λ2by λ2=λ2+μ×(ˆ
fv);
9μ=ρμ;
10 until Convergence;
11 Output ˆ
f.
where
A=K+μIn+μ1n1T
n(22)
and
e=μv +μ1nλ11nλ2a.(23)
The objective function shown in (21) can be easily optimized
by solving a linear system and we have
ˆ
f=arg min
fL(f,v
1
2)=A1e.(24)
Meanwhile,
min
v0L(f,v
1
2)
min
v0
v(f+1
μλ2)
2
(25)
By solving the optimization problem shown above, we have
ˆv=arg min
v0L(f,v
1
2)=pos(q), (26)
where q=f+1
μλ2and pos(q)is a function which assigns
0 to each negative element of q, i.e., for any element qiq,
pos(qi)=max(qi,0).
In summary, the proposed USDM active learning algo-
rithm is listed in Algorithm 2 as shown above. It can be
verified that Algorithm 2 converges to the global optimum.
Except for step 5, the computation of all the steps in Algo-
rithm 2is very fast. It is worth noting that when computing
ˆ
fin step 5, we only need to solve a linear system. No matrix
inversion is involved. Note that there are a few efficient linear
system solvers that can be readily used. We may also use a
faster algorithm to solve or approximate the linear system,
e.g., (Spielman and Teng 2004). Because it is out of the scope
of this paper, we omit the detailed discussion here.
Table 1 Dataset description
Name Size # of class Application Data type
KTH 2,387 6 Action
recognition
Vid eo
Youtube 1,596 11 Action
recognition
Vid eo
Coil 1,440 20 Object
classification
Image
Scene15 4,485 15 Scene
recognition
Image
MED 2,874 18 Video event
detection
Vid eo
5 Experiment
In this section, we test the proposed active learning algorithm
by applying it to a variety of visual concept recognition appli-
cations, including action recognition, object classification,
scene recognition, and video event detection.
5.1 Experiment Setup
Five different public datasets are used in the experiment,
which are KTH (Schüldt et al. 2004), Youtube (Liu et al.
2009), Coil (Nene et al. 1996), Scene15 (Lazebnik et al.
2006) and MED dataset collected by National Institute of
Standards and Technology (NIST).1Table 1summarizes the
detailed information of the datasets used in the experiment.
We compare our algorithm to both representative sam-
pling active learning and uncertainty sampling active learn-
ing algorithms in the experiment. The comparison algorithms
include SVM active learning (SVMactive) proposed in Tong
and Chang (2001), Semi-supervised SVM active learn-
ing (S-SVM) proposed in Hoi et al. (2008), Laplacian
regularized optimal experiment design (LOED) proposed
in He et al. (2007) and the multi-class active learning
algorithm pKNN proposed in Jain and Kapoor (2009).
The kused in our algorithm for graph construction is
set to 5 empirically. For the parameters involved in dif-
ferent active learning algorithms, we similarly tune them
from {104,103,102,101,100,101,102,103,104}and
report the best results.
Each dataset is randomly split into two non-overlapping
subsets, one as training candidate set and the other as testing
data set. In our experiment we fix the size of training candi-
date set as 1,000 for all the datasets. Denote cas the number of
classes. First, we randomly select 3 positive samples for each
class from the training candidate set, i.e., the size of the seed
set is 3 ×c. The remaining data in the training candidate set
1http://www.nist.gov/itl/iad/mig/
123
Int J Comput Vis
are regarded as the active pool. Then we run each active learn-
ing algorithm to select training data. We set the batch size as
1×c,2×c,…, 7 ×c, respectively. During the training we
use both the data selected by the active learning and the data
in the seed set as labelled samples. Therefore, there are 4×c,
5×c,…, 10 ×clabelled data for training. In our experiment,
after the training data are selected by different active learning
algorithms, a classifier is trained for visual concept recogni-
tion. The SVM classifier is used for SVMactive, S-SVM and
LOED. For pKNN (Jain and Kapoor 2009), we use the clas-
sifier embedded in the algorithm (Jain and Kapoor 2009).
For our USDM algorithm, we use the random walk classifier
illustrated in Sect. 2to generate the soft label representation
for classification. As the random walk classifier is a trans-
ductive algorithm, after the pool data have been selected, we
re-construct a larger graph including all data for testing data
classification. Note that class labels of selected data could be
imbalanced. Based on the soft label representation derived
by the random walk, binary class labels for each class are
then determined by an SVM.
5.2 Action Recognition
We use the KTH action dataset (Schüldt et al. 2004) and the
Youtube action dataset (Liu et al. 2009) to compare the per-
formance of different active learning algorithms in terms of
action recognition. In this experiment, each video sequence is
represented by 1,000 dimension BoW STIP feature (Laptev
et al. 2008). KTH action dataset contains six types of human
actions (walking, jogging, running, boxing, hand waving and
hand clapping) performed several times by 25 subjects in four
different scenarios: outdoors, outdoors with scale variation,
outdoors with different clothes and indoors (Schüldt et al.
2004). There are 2,387 action sequences in this dataset. The
Youtube action dataset is a real-world dataset which was col-
lected from Youtube. It contains intended camera motion,
variations of the object scale, viewpoint, illumination as well
as cluttered background. There are 11 actions in this data set
which are basketball shooting, biking/cycling, diving, golf
swinging, horseback riding, soccer juggling, swinging, ten-
nis swinging, trampoline jumping, volleyball spiking, and
walking with a dog (Liu et al. 2009). Four videos in Youtube
action dataset are too short to be captured by the feature
extracting code shared by Laptev et al. (2008), so we use a
dataset of 1,596 sequences.
Figure 1compares the performance of different active
learning algorithms for action recognition using KTH dataset.
We observe that LOED outperforms SVMactive. Meanwhile,
S-SVM and pKNN significantly improve the accuracy, com-
pared with SVMactiveand LOED. A possible explanation
is that SVMactiveis vulnerable to over-fitting, because it
does not consider the data distribution of the active pool dur-
ing the learning process. S-SVM uses the unlabelled data
6 12 18 24 30 36 42
60
62
64
66
68
70
72
74
76
78
80
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 1 A comparison of different active learning algorithms on action
recognition using KTH dataset. There are 6 different actions in this
dataset
11 22 33 44 55 66 77
25
30
35
40
45
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 2 A comparison of different active learning algorithms on action
recognition using Youtube dataset. There are 11 different actions in this
dataset
and pKNN evaluates the uncertainty across multiple classes,
and therefore more information is used in them. Our algo-
rithm dramatically outperforms all the competitors at all
batch sizes. Figure 2shows the experimental results on action
recognition using Youtube dataset. The video sequences in
Youtube dataset were downloaded from Youtube. It is much
more noisy than the lab-generated KTH dataset. Yet, we
observe that our algorithm consistently outperforms other
active learning algorithms. The experimental results demon-
strate the advantages of our algorithm.
123
Int J Comput Vis
20 40 60 80 100 120 140
70
75
80
85
90
95
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 3 A comparison of different active learning algorithms on object
classification. There are 20 different objects in this dataset
5.3 Object Classification
Figure 3shows the experimental results of objective recog-
nition on Coil dataset, which consists of 1,440 grey scale
images (Nene et al. 1996). There are 20 different objects in
total. Each image was resized to 32 ×32.Weusethegrey
values as the features of the images, with dimension of 1024.
Both pKNN and SVMactivetrain classifiers only based on the
seed set for uncertainty evaluation. We can see from Fig. 3
that pKNN generally outperforms SVMactive, indicating that
it is beneficial to evaluate the uncertainty of data across multi-
ple classes. S-SVM gains the second best performance due to
the exploration on the data distribution of the active pool. Our
algorithm achieves the best performance. Compared with the
second best algorithm S-SVM, our algorithm has two main
advantages. First, the random walk algorithm has better capa-
bility of uncovering the manifold structure (Tenenbaum et al.
2000) of the entire active pool to evaluate uncertainty of the
pool data. Although S-SVM also takes the distribution of the
pool data into consideration, the manifold structure is missed
when training the SVM classifier for uncertainty evaluation.
Thus, it somehow suffers from the small size of the training
data, especially when the data has manifold distribution. Sec-
ond, our algorithm is a multi-class active learning algorithm,
which is able to evaluate the “informativeness” of the pool
data globally.
5.4 Scene Recognition
To test the performance of USDM in scene recognition, we
use the Scene15 dataset, which contains 4,485 images from
15 different scenes (Lazebnik et al. 2006). In this experi-
ment, we extract HOG feature to represent the images and
the dimension of the feature vectors is 6300.
15 30 45 60 75 90 105
40
42
44
46
48
50
52
54
56
58
60
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 4 A comparison of different active learning algorithms on scene
recognition. There are 15 different scenes in this dataset
Figure 4shows the experimental results. Both pKNN
and SVMactiveonly use the seed set to train the classi-
fiers for uncertainty evaluation. pKNN generally outperforms
SVMactive, which indicates that multi-class active learning
(e.g., pKNN) is a more powerful approach. If we take the pool
data into consideration for training data selection, the perfor-
mance will be further improved. As shown in Fig. 4,S-SVM
outperforms SVMactiveat all batch sizes, but not as signifi-
cantly as other applications. We observe that our algorithm
outperforms all the other algorithms. Given that S-SVM gains
good performance for this dataset and our USDM still out-
performs S-SVM, we conclude that it is better to leverage
the manifold structure of the active pool and the seed set to
evaluate uncertainty for active learning.
5.5 Complex Event Detection
In this subsection, we compare the different active learn-
ing algorithms on complex event detection (Ma et al. 2014;
Yang et al. 2013). We merge the MED10 dataset and the
MED11 dataset into one, which is referred to as MED in this
paper. In the experiment, we use the MoSift (Chen and Haupt-
mann 2009) descriptor, based on which a 32,768 dimension
spatial BoW feature is computed to represent each video
sequence (Yang et al. 2013). Principal component analysis is
performed to remove the null space.
Figure 5shows the keyframes from a video which is
“changing a vehicle tire”. We can see that the MED dataset is
rather “wild”. The problem is more difficult, compared to the
other datasets. In the experiment, we have used all the videos
which are labelled as one of the 18 events. The number of
positive samples for each event varies from 80 to 170 and
there are 2,874 positive samples for the 18 events in total.
123
Int J Comput Vis
Fig. 5 An example video
sequence of “changing a vehicle
tire” event from the MED
dataset
18 36 54 72 90 108 126
18
20
22
24
26
28
30
32
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 6 A comparison of different active learning algorithms on com-
plex event detection. There are 18 different events in this dataset
Figure 6shows the experimental results on complex event
detection. We can see that our algorithm consistently outper-
forms all the competitors. As shown in the figure, the advan-
tage of our algorithm over other algorithms is quite visible.
As the batch size grows, the performance of pKNN and S-
SVM improves but is still worse than our algorithm. This
experiment demonstrates that the algorithm proposed in this
paper is more robust in dealing with “wild” data, compared
to the state of the art.
5.6 Performance Comparison Using Different Seed Size
In this subsection, we examine the impact of the initial train-
ing size given that it usually plays a key role in the semi-
supervised learning tasks. We perform this experiment by
varying the seed size from 1 ×cand 5 ×c.AsLOEDisan
unsupervised method that is irrelevant to the labelled seed
set, we leave it out in this experiment. Figures 7,8,9,10 and
11 show the experimental results on different datasets.
The experimental results in the figures (i.e., those in Fig.
7,8,9,10 and 11) and the results when seed size is 3 ×c
demonstrate that for different seed sizes, our method consis-
tently yields compelling performance, validating its efficacy
in selecting the most informative data for a variety of vision
tasks. Meanwhile, we notice S-SVM also obtains good per-
formance, which further indicates that leveraging the unla-
belled pool data does help improve the active learning per-
formance.
5.7 Performance Comparison on the Pool Data
From this subsection on, taking the Youtube action dataset as
an example, we report the experimental results to test more
characteristics of the proposed algorithm. These experiments
include (1) classification accuracy of the active pool data; (2)
classification accuracy when different classifiers are used; (3)
classification accuracy when a different feature is used; (4)
classification accuracy when the unweighted graph is used.
First, we evaluate the classification accuracy of different
active learning algorithms on the pool data. To this end, we
exclude the seed data and the selected batch data whereas
treat the remaining pool data as the testing data. Since LOED
is unsupervised, we leave it out in the comparison. Figure 12
displays the experimental results. We can see that our method
consistently gains the top performance, whereas S-SVM and
pKNN obtain good performance as well. This observation is
consistent with the results when the testing data are outside
the active pool, again demonstrating the effectiveness of the
proposed USDM algorithm.
5.8 Performance Comparison Using a Different Feature
In this subsection, we compare the performance of the active
learning algorithms when a different feature is used. In the
123
Int J Comput Vis
6 12 18 24 30 36 42
45
50
55
60
65
70
75
80
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(a) seed size: c
6 12 18 24 30 36 42
60
65
70
75
80
85
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(b) seed size: 5×c
Fig. 7 Performance comparison on KTH dataset w.r.t. different seed size. Our method is consistently competitive
11 22 33 44 55 66 77
20
25
30
35
40
45
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(a) seed size: c
11 22 33 44 55 66 77
32
34
36
38
40
42
44
46
48
50
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(b) seed size: 5×c
Fig. 8 Performance comparison on Youtube dataset w.r.t. different seed size. Our method is consistently competitive
20 40 60 80 100 120 140
65
70
75
80
85
90
95
Batch Size
Average Accuracy (%)
USDM
SVM
active
S−SVM
pKNN
(a) seed size: c
20 40 60 80 100 120 140
80
82
84
86
88
90
92
94
96
98
100
Batch Size
Average Accuracy (%)
USDM
SVM
active
S−SVM
pKNN
(b) seed size 5×c
Fig. 9 Performance comparison on Coil dataset w.r.t. different seed size. Our method is consistently competitive
123
Int J Comput Vis
15 30 45 60 75 90 105
25
30
35
40
45
50
55
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(a) seed size: c
15 30 45 60 75 90 105
40
42
44
46
48
50
52
54
56
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(b) seed size: 5×c
Fig. 10 Performance comparison on Scene15 dataset w.r.t. different seed size. Our method is consistently competitive
18 36 54 72 90 108 126
16
18
20
22
24
26
28
30
32
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(a) seed size: c
18 36 54 72 90 108 126
22
24
26
28
30
32
34
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
(b) seed size: 5×c
Fig. 11 Performance comparison on MED dataset w.r.t. different seed size. Our method is consistently competitive
previous experiments, we have used the STIP feature for
action recognition. In this experiment, we use the MoSIFT
feature (Chen and Hauptmann 2009) for action recognition.
Figure 13 shows the experimental result, where a 1,000
dimension BoW MoSIFT feature is used to represent the
videos in the Youtube dataset.
Comparing Figs. 2and 13, we can see that the MoSIFT
feature performs better than STIP feature on Youtube dataset.
Similarly, our algorithm outperforms the other compared
algorithms dramatically. In particular, when 11, 22, and 33
data are selected, our algorithm outperforms the second best
algorithm by about 10 %, relatively. This experiment demon-
strates that when a better feature is used, the performance of
an active learning algorithm usually improves. Nevertheless,
our algorithm outperforms the other competitors consistently
using a different feature.
5.9 Performance Comparison Using Different Classifiers
The function of active learning algorithms is to select the
most informative data for supervision and then use these
labelled data as input of a specific classification algorithm
to train a classifier for recognition. It turns out an interesting
question how the active learning algorithms will perform if
we use a different classifier. In this subsection, we again use
Youtube dataset as a showcase to compare different active
learning algorithms using some other classifiers.
We first use the Least Square Regression (LSR) as the clas-
sifier for action recognition. This time, LSR is used for all the
active learning algorithms, including our USDM, SVMactive,
S-SVM, LOED and pKNN. Each active learning algorithm
is first performed to select the training data, based on which
a LSR classifier is trained for action recognition. Figures 14
123
Int J Comput Vis
11 22 33 44 55 66 77
30
32
34
36
38
40
42
44
46
48
50
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
pKNN
Fig. 12 A comparison of different active learning algorithms on clas-
sifying the pool data. The results are based on Youtube dataset
11 22 33 44 55 66 77
26
28
30
32
34
36
38
40
42
44
46
48
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 13 A comparison of different active learning algorithms on action
recognition using the MoSIFT feature. The results are based on the
Youtube dataset
and 15 show the experimental results when STIP feature and
MoSIFT feature are used, respectively. We can see from the
two figures that the proposed algorithm USDM outperforms
all the other algorithms at all batch sizes, when LSR is used
as the classifier. This experiment further demonstrates that
our algorithm is more effective than other active learning
algorithms when a different classifier is used.
Next, we additionally use KNN as the classifier to com-
pare the performance of different active learning algorithms
on Youtube dataset. In this experiment, we use the same set-
ting as LSR. Figures 16 and 17 show the experiment results
when STIP feature and MoSIFT feature are used, respec-
tively. For both features, when using KNN as the classifier,
our algorithm dramatically outperforms the competitors. The
11 22 33 44 55 66 77
30
32
34
36
38
40
42
44
46
48
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 14 A comparison of different active learning algorithms on
Youtube dataset using STIP feature. In this experiment, least squares
regression (LSR) is used as the classifier for action recognition
11 22 33 44 55 66 77
25
30
35
40
45
50
55
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 15 A comparison of different active learning algorithms on
Youtube dataset using MoSIFT feature. In this experiment, LSR is used
as the classifier for action recognition
visual concept recognition accuracy generally relies on three
factors. The first one is the feature; the second one is the
classifier and the third one is the data selected for supervi-
sion. We observe in the experiment that our USDM algorithm
consistently outperforms the other methods when a different
feature and/or a different classifier are used.
The experiment result of adopting different classifiers
reported in this section also shows that the performance of
our algorithm is still better than other compared algorithms
when a different classifier is used. We observe similar per-
formance on all datasets if an SVM classifier is directly
trained instead of reconstructing a larger graph for ran-
dom walk. Thus in real world applications one may directly
train an inductive classifier, e.g., SVM, based on the data
123
Int J Comput Vis
11 22 33 44 55 66 77
16
18
20
22
24
26
28
30
32
34
36
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 16 A comparison of different active learning algorithms on
Youtube dataset using STIP feature. In this experiment, KNN is used
as the classifier for action recognition
11 22 33 44 55 66 77
12
14
16
18
20
22
24
26
28
30
32
34
Batch Size
Average Accuracy (%)
USDM
SVMactive
S−SVM
LOED
pKNN
Fig. 17 A comparison of different active learning algorithms on
Youtube dataset using MoSIFT feature. In this experiment, KNN is
used as the classifier for action recognition
selected by the USDM algorithm to reduce the computation
cost.
5.10 Performance Variation Using an Unweighted Graph
In the previous experiment, we use the weighted graph
defined in (1) for the random walks. In the following exper-
iment we compare the weighted graph with different para-
meter σand the unweighted graph defined in (2). Again, we
use the Youtube dataset to showcase with a randomly gener-
ated seed set of 33 videos. Figure 18 shows the experimental
results.
11 22 33 44 55 66 77
30
32
34
36
38
40
42
44
46
48
50
Batch Size
Average Accuracy (%)
sigma=1e−6
sigma=1e−4
sigma=0.01
sigma=1
sigma=100
sigma=1e4
sigma=1e6
unweighted
Fig. 18 Performance comparison between unweighted graph and
weighted graph with different σon Youtube dataset
It can be seen that if σis appropriately chosen, the
weighted graph usually gives us better performance. In other
words, better performance can be expected when σis opti-
mal. In this experiment, we can see that the performance is
usually better when σis small. However, the optimal σis
data dependent, and can be determined by cross validation
or experiments.
5.11 Computational Efficiency Comparison
Finally, taking MED dataset as an example, we compare
the computational efficiency of the supervised and semi-
supervised active learning algorithms. The computation time
of the semi-supervised active learning algorithms mainly
depends on the size of the active pool. In this experiment,
the active pool size varies from 250 to 1,250, with an interval
of 250. All experiments are implemented by Matlab R2011a,
which is installed on a machine with 24 cores2and 64.0GB
RAM.
Figure 19 shows the average time elapsed to select 18
data for labelling, i.e., the batch size is 1 ×c. Note that
only S-SVM, and our USDM exploit data distribution while
SVMactiveand pKNN merely utilize the seed set for active
learning. Thus the size of active pool does not affect speed
much for pKNN and SVMactive. Although SVMactiveand
pKNN are faster, their performance is worse in the previ-
ous experiments than the semi-supervised active learning
algorithms S-SVM and USDM. In our experiments, as a
semi-supervised active learning algorithm, S-SVM generally
achieves the second best performance in accuracy. As shown
in Fig. 19, our algorithm outperforms S-SVM dramatically in
2Intel(R)Xeon Processor, 24 cores
123
Int J Comput Vis
250 500 750 1000 1250
0
2
4
6
8
10
12
14
Pool Size
Time (seconds)
USDM
SVMactive
S−SVM
pKNN
Fig. 19 Running time of different active learning algorithms w.r.t . dif-
ferent pool sizes. The result shown in this figure is the elapsed time
(seconds) of selecting pool data for training
efficiency. If we increase the pool size, the efficiency advan-
tage of our algorithm over S-SVM will become more visible.
6 Conclusion
Generally speaking, there are three important factors in visual
concept recognition, which are the features, the classifiers
and the data selected for supervision. In this paper, we have
proposed a new active learning algorithm USDM for visual
concept recognition. To address the problem of small seed
set size in uncertainty sampling, we proposed to exploit the
distribution of all the data in the active pool and the seed
set. Considering that the uncertain data in the active pool are
potentially similar to each other, we proposed to make the
selection as diverse as possible. USDM is able to evaluate
the “informativeness” of a sample across multiple classes,
making the selection more accurate. An efficient algorithm
was used to optimize the objective function of USDM. Exten-
sive experiments on a variety of applications with different
classifiers and features demonstrate that USDM dramatically
outperforms the state of the art. We have observed that even
if the size of the seed and pool sets is the same, the classifica-
tion performance would be different when the the seed and
pool sets are different. In our future research, we will study
to optimally initialize the seed set and active pool.
Acknowledgments This paper was partially supported by the US
Department of Defense the U. S. Army Research Office (W911NF-13-1-
0277), partially supported by the ARC DECRA project DE130101311,
and partially supported by the Tianjin Key Laboratory of Cognitive
Computing and Application.
References
Bertsekas, D. (1999). Nonlinear programming (2nd ed.). Belmont, MA:
Athena Scientific.
Brinker, K. (2003). Incorporating diversity in active learning with sup-
port vector machines. In International conference on machine
learning.
Campbell, C., Cristianini, N., & Smola, A. J. (2000). Query learning
with large margin classifiers. In ICML.
Chattopadhyay, R., Wang, Z., Fan, W., Davidson, I., Panchanathan, S.,
& Ye, J. (2012). Batch mode active sampling based on marginal
probability distribution matching. In KDD (pp. 741–749).
Chen, M., & Hauptmann, A. (2009). Mosift: Recognizing human
actions in surveillance videos. In Technical Report CMU-CS-09-
161.
Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning
with statistical models. Journal of Artificial Intelligence Research
(JAIR),4, 129–145.
Delbos, F., & Gilbert, J. (2005). Global linear convergence of an aug-
mented lagrangian algorithm to solve convex quadratic optimiza-
tion problems. Journal of Convex Analysis,12(1), 45–69.
Doyle, P. G., & Shell, J. (1984). Random walks and electric networks.
Washington, DC: Mathematical Association of America.
Gong, B., Grauman, K., & Sha, F. (2014). Learning kernelsfor unsuper-
vised domain adaptation with applications to visual object recog-
nition. International Journal of Computer Vision,109(1–2), 3–27.
Han, Y., Yang, Y., Yan, Y., Ma, Z., Sebe, N., & Zhou, X. (2014). Semi-
supervised feature selection via spline regression for video seman-
tic recognition. IEEE Transactionson Neural Networks and Learn-
ing Systems. doi:10.1109/TNNLS.2014.2314123.
He, X., Min, W., Cai, D., & Zhou, K. (2007). Laplacian optimal design
for image retrieval. In SIGIR.
Hoi, S., Jin, R., Zhu, J., & Lyu, M. (2008). Semi-supervised SVM batch
mode active learning for image retrieval. In CVPR.
Hoi, S., Jin, R., Zhu, J., & Lyu, M. (2009). Semisupervised svm batch
mode active learning with applications to image retrieval. ACM
Transactions on Information Systems,27(3), 16:1–16:29.
Hoi, S., & Lyu, M. (2005). A semi-supervised active learning framework
for image retrieval. CVPR,2, 302–309.
Jain, P., & Kapoor, A. (2009). Active learning for large multi-class
problems. In CVPR.
Jegelka, S., Kapoor, A., & Horvitz, E. (2014). An interactive approach to
solving correspondence problems. International Journal of Com-
puter Vision,108(1–2), 49–58.
Joshi, A., Porikli, F., & Papanikolopoulos, N. (2009). Multi-class active
learning for image classification. In CVPR.
Kapoor, A., Grauman, K., Urtasun, R., & Darrell, T. (2010). Gaussian
processes for object categorization. International Journal of Com-
puter Vision,88(2), 169–188.
Kowdle, A., Chang, Y., Gallagher, A., & Chen, T. (2011). Active learn-
ing for piecewise planar 3D reconstruction. In CVPR.
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Recog-
nizing realistic actions from videos in the wild. In CVPR.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories.
In CVPR.
Li, H., Shi, Y., Chen, M., Hauptmann, A., & Xiong, Z. (2010). Hybrid
active learning for cross-domain video concept detection. In ACM
Multimedia.
Li, M., & Sethi, I. K. (2006). Confidence-based active learning. IEEE
Transactions on Pattern Analysis and Machine Intelligence,28(8),
1251–1261.
Li, X., Wang, L., & Sung, E. (2004). Multilabel SVM active learning
for image classification. In ICIP.
123
Int J Comput Vis
Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective sam-
pling for nearest neighbor classifiers. Machine Learning,54(2),
125–152.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from
videos in the wild. In CVPR.
Ma, Z., Yang, Y., Nie, F., Sebe, N., Yan, S., & Hauptmann, A. (2014).
Harnessing lab knowledge for real-world action recognition. Inter-
national Journal of Computer Vision,109(1–2), 60–73.
Ma, Z., Yang, Y., Sebe, N., & Hauptmann, A. (2014). Knowledge adap-
tation with partiallyshared features for event detection using few
exemplars. IEEE Transactions on Pattern Analysis and Machine
Intelligence,36(9), 1789–1802.
Nene, S., Nayar, S., & Murase, H. (1996). Columbia object image library
(coil-20). Technical Report CUCS-005-96.
Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human
actions: A local SVM approach. In ICPR.
Shen, H., Yu, S.-I., Yang, Y., Meng, D., & Hauptmann, A. (2014). Unsu-
pervised video adaptation for parsing human motion. In ECCV.
Spielman, D., & Teng, S.-H. (2004). Nearly-linear time algorithms for
graph partitioning, graph sparsification, and solving linear sys-
tems. In STOC.
Tenenbaum, J., Silva, V., & Langford, J. C. (2000). A global geomet-
ric framework for nonlinear dimensionality reduction. Science,
290(5500), 2319–2323.
Tong, S., & Chang, E. (2001). Support vector machine active learning
for image retrieval. In ACM Multimedia.
Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking
with active learning. In NIPS.
Wang, L., Chan, K. L., & Zhang, Z. (2003). Bootstrapping SVM active
learning by incorporating unlabelled images for image retrieval.
In CVPR (pp. 629–634).
Yan, R., Yang, J., & Hauptmann, A. (2003). Automatically labeling
video data using multi-class active learning. In ICCV.
Yang, Y., Ma, Z., Hauptmann, A., & Sebe, N. (2013). Feature selection
for multimedia analysis by sharing information among multiple
tasks. IEEE Transactions on Multimedia,15(3), 661–669.
Yang, Y., Ma, Z., Xu, Z., Yan, S., & Hauptmann, A. (2013). How related
exemplars help complex event detection in web videos. In ICCV.
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., & Pan, Y. (2012). A
multimedia retrieval framework based on semi-supervised ranking
and relevance feedback. IEEE Transactions on Pattern Analysis
and Machine Intelligence,34(4), 723–742.
Yu, K., Bi, J., & Tresp, V. (2006). Active learning via transductive
experimental design. In ICML (pp. 1081–1088).
Zhu, X. (2008). Semi-supervised learning literature survey. Technical
Report, University of Wisconsin-Madison.
Zhu, X., Ghahramani, Z., & Lafferty, J.D. (2003). Semi-supervised
learning using gaussian fields and harmonic functions. In ICML
(pp. 912–919).
123
... Diversity-based AL strategies aim to select the most representative samples from the unlabelled dataset to cover the underlying variability of the dataset. These methods prioritize selecting samples that are dissimilar to those already selected for labeling [20]. One common approach is to cluster the unlabelled samples based on their feature representations and then select the most representative samples from each cluster [7]. ...
Conference Paper
Active learning (AL) aims to improve the model performance with minimal data annotation. While recent AL studies have utilized feature mixing to identify unlabeled instances with novel features, applying it to natural language processing (NLP) tasks has been challenged due to the discrete nature of text tokens and the limited contribution of some novel features. To address these issues, we propose a two-stage acquisition method based on feature mixing for NLP tasks. We first create a mixed feature for both labeled and unlabeled instances to identify the features in the unlabeled instances that the model cannot recognize. Next, we evaluate the contribution of these novel features to the model using the entropy of the nearest labeled neighbors. The proposed method enables the model to select the most informative samples in the unlabeled sample pool. Experiments on sentiment analysis, topic classification, and natural language inference validated that our method not only outperforms other AL approaches but improves the efficiency of batch data acquisition.
... We follow the pattern of MS-COCO to generate partial annotations with support from the well-experienced engineers. Enlightened by Yang et al. (2015), active learning method is adopt to avoid the boring hand-crafted annotation. The occluded workflow videos are selected and manually separated into two groups: light occlusion and heavy occlusion with increasing occlusion levels, which contain 2 sub-levels and 3 sub-levels respectively. ...
Article
Full-text available
Workflow recognition relying on deep convolutional neural network has obtained promising performance. Though impressive results have been achieved on standard industrial workflow, the performance on heavily occluded workflow remains far from satisfactory. In this paper, we present an effective context-aware compositional ConvNet (CA-CompNet) for occluded workflow detection with the following contributions. First, we combine compositional model and original ConvNet together to build a unified deep architecture for occluded workflow detection, which has shown innate robustness to address the problem of object classification under occlusion. Second, in order to overcome the variable occlusion limitations, the bounding box annotations are utilized to segment the context from target workflow instance during training. Then, these segmentations are used to learn the proposed CA-CompNet, which enables the network to untangle the feature representation of workflow instance from the context. Third, a robust voting mechanism for candidate bounding box is introduced to improve the detection accuracy, which facilitates the model to precisely detect the bounding box of a specific workflow instance. Comprehensive experiments demonstrate that the proposed context-aware network can robustly detect workflow instance under occlusion in industrial environment, increasing the detection performance on MS COCO dataset by 4.6% (from 45.1 to 49.7%) in absolute performance compared to the advanced CenterNet.
... Researchers also appealed instance selection for establishing a robust framework for imbalanced data sets, active learning and dealing with big data. In [37] and in [38] authors performed instance selection for active learning approaches. Kuncheva et.al. ...
Article
Full-text available
To improve both the efficiency and accuracy of video semantic recognition, we can perform feature selection on the extracted video features to select a subset of features from the high-dimensional feature set for a compact and accurate video data representation. Provided the number of labeled videos is small, supervised feature selection could fail to identify the relevant features that are discriminative to target classes. In many applications, abundant unlabeled videos are easily accessible. This motivates us to develop semisupervised feature selection algorithms to better identify the relevant video features, which are discriminative to target classes by effectively exploiting the information underlying the huge amount of unlabeled video data. In this paper, we propose a framework of video semantic recognition by semisupervised feature selection via spline regression (S(2)FS(2)R) . Two scatter matrices are combined to capture both the discriminative information and the local geometry structure of labeled and unlabeled training videos: A within-class scatter matrix encoding discriminative information of labeled training videos and a spline scatter output from a local spline regression encoding data distribution. An l2,1 -norm is imposed as a regularization term on the transformation matrix to ensure it is sparse in rows, making it particularly suitable for feature selection. To efficiently solve S(2)FS(2)R , we develop an iterative algorithm and prove its convergency. In the experiments, three typical tasks of video semantic recognition, such as video concept detection, video classification, and human action recognition, are used to demonstrate that the proposed S(2)FS(2)R achieves better performance compared with the state-of-the-art methods.
Article
Full-text available
Much research on human action recognition has been oriented toward the performance gain on lab-collected datasets. Yet real-world videos are more diverse, with more complicated actions and often only a few of them are precisely labeled. Thus, recognizing actions from these videos is a tough mission. The paucity of labeled real-world videos motivates us to “borrow” strength from other resources. Specifically, considering that many lab datasets are available, we propose to harness lab datasets to facilitate the action recognition in real-world videos given that the lab and real-world datasets are related. As their action categories are usually inconsistent, we design a multi-task learning framework to jointly optimize the classifiers for both sides. The general Schatten $p$ -norm is exerted on the two classifiers to explore the shared knowledge between them. In this way, our framework is able to mine the shared knowledge between two datasets even if the two have different action categories, which is a major virtue of our method. The shared knowledge is further used to improve the action recognition in the real-world videos. Extensive experiments are performed on real-world datasets with promising results.
Article
For many types of machine learning algorithms, one can compute the statistically `optimal' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.
Conference Paper
In this paper, we propose a method to parse human motion in unconstrained Internet videos without labeling any videos for training. We use the training samples from a public image pose dataset to avoid the tediousness of labeling video streams. There are two main problems confronted. First, the distribution of images and videos are different. Second, no temporal information is available in the training images. To smooth the inconsistency between the labeled images and unlabeled videos, our algorithm iteratively incorporates the pose knowledge harvested from the testing videos into the image pose detector via an adjust-and-refine method. During this process, continuity and tracking constraints are imposed to leverage the spatio-temporal information only available in videos. For our experiments, we have collected two datasets from YouTube and experiments show that our method achieves good performance for parsing human motions. Furthermore, we found that our method achieves better performance by using unlabeled video than adding more labeled pose images into the training set.
Article
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.
Article
Finding correspondences among objects in different images is a critical problem in computer vision. Even good correspondence procedures can fail, however, when faced with deformations, occlusions, and differences in lighting and zoom levels across images. We present a methodology for augmenting correspondence matching algorithms with a means for triaging the focus of attention and effort in assisting the automated matching. For guiding the mix of human and automated initiatives, we introduce a measure of the expected value of resolving correspondence uncertainties. We explore the value of the approach with experiments on benchmark data.
Article
Multimedia event detection (MED) is an emerging area of research. Previous work mainly focuses on simple event detection in sports and news videos, or abnormality detection in surveillance videos. In contrast, we focus on detecting more complicated and generic events that gain more users’ interest, and we explore an effective solution for MED. Moreover, our solution only uses few positive examples since precisely labeled multimedia content is scarce in the real world. As the information from these few positive examples is limited, we propose using knowledge adaptation to facilitate event detection. Different from the state of the art, our algorithm is able to adapt knowledge from another source for MED even if the features of the source and the target are partially different, but overlapping. Avoiding the requirement that the two domains are consistent in feature types is desirable as data collection platforms change or augment their capabilities and we should be able to respond to this with little or no effort. We perform extensive experiments on real-world multimedia archives consisting of several challenging events. The results show that our approach outperforms several other state-of-the-art detection algorithms.
Article
Domain adaptation aims to correct the mismatch in statistical properties between the source domain on which a classifier is trained and the target domain to which the classifier is to be applied. In this paper, we address the challenging scenario of unsupervised domain adaptation, where the target domain does not provide any annotated data to assist in adapting the classifier. Our strategy is to learn robust features which are resilient to the mismatch across domains and then use them to construct classifiers that will perform well on the target domain. To this end, we propose novel kernel learning approaches to infer such features for adaptation. Concretely, we explore two closely related directions. In the first direction, we propose unsupervised learning of a geodesic flow kernel (GFK). The GFK summarizes the inner products in an infinite sequence of feature subspaces that smoothly interpolates between the source and target domains. In the second direction, we propose supervised learning of a kernel that discriminatively combines multiple base GFKs. Those base kernels model the source and the target domains at fine-grained granularities. In particular, each base kernel pivots on a different set of landmarks—the most useful data instances that reveal the similarity between the source and the target domains, thus bridging them to achieve adaptation. Our approaches are computationally convenient, automatically infer important hyper-parameters, and are capable of learning features and classifiers discriminatively without demanding labeled data from the target domain. In extensive empirical studies on standard benchmark recognition datasets, our appraches yield state-of-the-art results compared to a variety of competing methods.