Image Annotation by Graph-Based Inference With Integrated Multiple/Single Instance Representations
ABSTRACT In most of the learning-based image annotation approaches, images are represented using multiple-instance (local) or single-instance (global) features. Their performances, however, are mixed as for certain concepts, the single-instance representations of images are more suitable, while for others, the multiple-instance representations are better. Thus this paper explores a unified learning framework that combines the multiple-instance and single-instance representations for image annotation. More specifically, we propose an integrated graph-based semi-supervised learning framework to utilize these two types of representations simultaneously. We further explore three strategies to convert from multiple-instance representation into a single-instance one. Experiments conducted on the COREL image dataset demonstrate the effectiveness and efficiency of the proposed integrated framework and the conversion strategies.
-
Citations (0)
-
Cited In (0)
Page 1
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 2, FEBRUARY 2010131
Image Annotation by Graph-Based Inference With
Integrated Multiple/Single Instance Representations
Jinhui Tang, Member, IEEE, Haojie Li, Guo-Jun Qi, and Tat-Seng Chua
Abstract—In most of the learning-based image annotation ap-
proaches,imagesarerepresentedusingmultiple-instance(local)or
single-instance(global)features. Theirperformances,however,are
mixed as for certain concepts, the single-instance representations
of images are more suitable, while for others, the multiple-instance
representations are better. Thus this paper explores a unified
learning framework that combines the multiple-instance and
single-instance representations for image annotation. More specif-
ically, we propose an integrated graph-based semi-supervised
learning framework to utilize these two types of representations
simultaneously. We further explore three strategies to convert
from multiple-instance representation into a single-instance one.
Experiments conducted on the COREL image dataset demon-
strate the effectiveness and efficiency of the proposed integrated
framework and the conversion strategies.
Index
learning.
Terms—Imageannotation,multiple/singleinstance
I. INTRODUCTION
A
and improved compression techniques, the digital image col-
lections have grown rapidly in recent years. How to index and
search for these images effectively and efficiently is an increas-
ingly urgent research issue in the multimedia community. Sev-
eral content-based search models use image samples as queries
but manyusers found that the simple set of query images cannot
represent their query demands. Most users prefer to search for
images by issuing textual queries such as “find me images of
tigers in the grass” [11]. To support this, keywords describing
the images are required to retrieve and rank images. Manual an-
notation is a direct way to obtain these keywords. However, it is
labor-intensive and error-prone. Thus automatic annotation of
images at the semantic concept level has emerged as an impor-
tant technique for efficient image search.
In recent years, many variety of learning methods have been
proposed for automatic image annotation. While a few methods
CCOMPANIED by the decreased costs for multimedia
recording and storage devices, high transmission rates,
Manuscript received October 21, 2008; revised September 27, 2009. First
published November 24, 2009; current version published January 20, 2010. The
associate editor coordinating the review of this manuscript and approving it for
publication was Dr. Ajay Divakaran.
J. Tang, H. Li, and T.-S. Chua are with the School of Computing, National
University of Singapore, 117590 Singapore (e-mail: lihj@comp.nus.edu.sg).
G.-J. Qi is with the Department of Electrical and Computer Engineering and
Beckman Institute of the University of Illinois at Urbana-Champaign, Cham-
paign, IL 61820 USA .
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2009.2037373
employ purely supervised learning [5] or semi-supervised
learning [22] with the single-instance (SI) representations of
images, most methods use multiple regions to represent each
image and inference models are learned from the multiple-in-
stance (MI) representations [2], [3], [13]. Which representation
is more suitable for detecting the semantics in the images is
an important problem. Moreover, the suitable representation is
also dependent on the types of concepts to be detected in the
images. For example, while many object-oriented concepts are
more closely related to regions such as “car” and “tiger”, other
scene-oriented concepts may relate more to the entire images
such as “garden” and “beach”.
MI representation models each image as a labeled bag with
multiple instances, usually comprising the segmented regions
of that image. Labels (or concepts) are attached to the bags
while the labels of instances are hidden. The bag label is re-
lated to the hidden labels of the instances as follows: the bag
is labeled as positive if any of the instances in it is positive,
otherwise it is labeled as negative. MI learning [7] is a type of
learning algorithms to tackle the annotation problems with MI
representations. Many approaches have been proposed to tackle
the MI learning problem, and some of them are based on the
well-known diverse-density measure [13]. In this paper, the di-
verse-density measure is also used in the three proposed strate-
giesfor convertingtheMIrepresentationsof imagesintoSI rep-
resentations.
Since labeled samples for image annotation typically come
from the users during an interactive session, it is thus important
to be able to obtain good results speedily using a small amount
of labeled data. Semi-supervised learning [1], which aims to
learnfrom bothlabeledand unlabeleddatawithcertainassump-
tions, are promising to build more accurate models than those
thatareachievablebyusingpurelysupervisedlearningmethods.
As a major family of semi-supervised learning, the graph-based
methods have attracted a lot of attention recently. Many works
on this topic are reported in the literature of machine learning
community [28] and some of them have been applied to image
and video annotation [16], [18], [22].
Recently some research efforts were conducted to combine
MI learning and semi-supervised learning. Rahmani et al. [14]
proposed a MI semi-supervised learning method by trans-
forming any MI problem into an input for a graph-based SI
semi-supervised learning method that encodes the MI aspects
of the problem simultaneously at both the bag and instance
levels, but still only uses the instance-level features. In [23],
the authors decoupled the inferring and training stages by
using random walks and SVM, and converted MI learning
to a supervised learning problem. Tang et al. [17] proposed
1520-9210/$26.00 © 2009 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on June 21,2010 at 17:33:58 UTC from IEEE Xplore. Restrictions apply.
Page 2
132 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 2, FEBRUARY 2010
Fig. 1. Integrated multiple/single instance learning framework.
a semi-supervised MI learning method to rank natural scene
images according to their typicality degrees. Besides, in [27],
Zhou et al. analyzed the relation between semi-supervised
learning and MI learning.
To the best of our knowledge, existing learning-based image
annotation methods, including the supervised MI learning,
semi-supervised SI learning and semi-supervised MI learning,
such as the aforementioned methods, just used one type of
representation, and no reported methods have combined MI
and SI representations in a unified framework. Since the SI
and MI representations are complementary and have different
strength, we believe that integrating the two types of represen-
tations will significantly improve the annotation performance.
To integrate the MI and SI representations into a unified
framework, we propose a two-stage method. First, we devise
efficient strategies to convert MI to SI representation. Second,
we introduce a multi-graph-based label propagation method to
integrate the two types of representations to infer the labels of
unlabeled images. Here multi-graph-based label propagation is
a semi-supervised learning strategy, which utilizes the labeled
and unlabeled data simultaneously to boost the annotation
performance. It has been experimentally shown to achieve
better performance as compared to purely supervised learning
methods.
We expect the integrated framework to offer the following
three advantages: 1) integrating the MI and SI representations
simultaneously for images will improve the annotation perfor-
mance; 2) using the simplified diverse-density style strategies
to search the prototypes for different concepts, which operate
only on existing instances instead of searching in the whole fea-
ture space, will result in good computational efficiency; and, 3)
multi-graph-based semi-supervised learning is used to integrate
the multiple representations and incorporate the labeled and un-
labeled data simultaneously. In [19], we have proposed a MI
to SI conversion method that finds one prototype of instance
for each concept and then maps the MI representation of every
image into the space spanned by the feature vectors of the se-
lected prototypes. In this paper, we extend the work of [19] by
proposing two more effective strategies for converting the MI
representation to SI representation. Experiments conducted on
COREL dataset show that the proposed integrated method out-
performs the normal MI and SI methods significantly, and the
conversions from MI representation to SI representation are ef-
fective.
Rather than MI representation, another popular strategy for
converting the local features into a global one is to cluster the
local features into a global “bag-of-visual-word” representation
[4], [9], [12]. However, compared to the key-points, the number
ofregionsinanimageismuchsmaller.Ifweweretoconvertthe
region features into the global bag-of-region-features represen-
tation, the resultant feature vector will be very sparse and a lot
of information included in the set of region features will be lost.
This will significantly degrade the annotation performance.
The rest of this paper is organized as follows. In Section II,
we first present an overview of the integrated framework; while
Section II-A details the three proposed conversion strategies
from MI representation to SI representation, Section II-B
presents the multi-graph-based label propagation to combine
the multiple representations, and Section II-C describes the
construction of the propagation matrix for label propagation. In
SectionIII, we present theexperimentalresults and discussions.
The conclusion and future work are given in Section IV.
II. INTEGRATED GRAPH-BASED LABEL PROPAGATION
The proposed integrated graph-based multiple/single in-
stance learning framework is shown in Fig. 1. First, images
are segmented into regions and local features are extracted
from the regions, and thus each image is represented with a
MI representation. Second, meanwhile, global features are also
extracted from the original (non-segmented) images, and the
features of each image form a SI representation. Third, to inte-
grate the MI and SI representations into a unified framework,
we explore three diverse-density based strategies to convert
the MI representation to another SI representation. Finally, a
multi-graph-based label propagation method is introduced to
integrate the two types of representations to infer the labels of
the unlabeled images.
A. Multiple-Instance to Single-Instance Conversion
The conversion from MI representation to SI representation
involves finding prototypes of the instances for the given con-
cepts as bases to form a feature space for mapping. Three strate-
Authorized licensed use limited to: National University of Singapore. Downloaded on June 21,2010 at 17:33:58 UTC from IEEE Xplore. Restrictions apply.
Page 3
TANG et al.: IMAGE ANNOTATION BY GRAPH-BASED INFERENCE133
gies for the bases construction are introduced in this subsection
and will be compared in the experiments. All these strategies
apply diverse-density measure to find the prototypes.
We denote a certain positive bag for concept as
th instance as
of instances in bag
. Similarly, we use
respectively, to represent a negative bag, its th instance and the
number of instances in
. In some cases we do not need to
differentiate between the positive and negative bags, we simply
use
andto denote a bag and the number of its instances.
Allinstances areina -dimensionallow-levelfeaturespace
We use
andto denote the numbers of positive and neg-
ative bags for concept , respectively. For convenience, we also
use
(,
the set of all instances.
Strategy I: The first strategy is similar to the one presented in
MILES [3]. It finds one prototype of instance for each concept
and maps the MI features of every image into the space spanned
by the feature vectors of the selected prototypes. The main dif-
ference is that we just search the maximal diverse-density point
intheexistedinstancesofpositivebagsbutnotthewholefeature
space, and thus the computational cost is significantly reduced.
Diverse-density was proposed based on the assumption that
thereexistsanexclusiveprototypeofinstancerepresentingeach
semantic concept. Other individual instances can then be an-
notated according to the prototype. For each concept , the di-
verse-densitymethodaimstofindapoint
that maximizes the probability that the point
given the training bags [13]:
and its
, whereis the number
and,,
.
) to denote
inthefeaturespace
is the prototype
(1)
This strategy needs to search the whole feature space to find
the prototype for each concept. The computational cost of such
search process will be very high. To achieve computational effi-
ciency,we operate onlyontheset oflikely positiveinstances in-
steadofsearchinginthewholespace,sothecandidateprototype
is restricted to within the instances
bags:
in the positive training
(2)
where the probability
mated using the noise-or model [13]:
andare esti-
(3)
(4)
where
distance. We employ the
distance since it has been shown that the
better approximate the perceptual difference of visual features
[15].
For a given concept set
number of given concepts, we obtain
types
. Using these prototypes, every bag
can be mapped to a SI representation as
is the scaling parameter and metricis the
distance instead of the widely-used
distance can
, where is the
representation proto-
(5)
where
the
strategy used in traditional diverse-density framework since
the traditional one involves the use of scaling parameters in the
mapping, which is hard to make the optimal choice.
Strategy II: In the first strategy described above we assume
thateachconceptonlyhasoneprototypeandtheMIfeaturesare
mapped to the same SI feature space for all concepts. However
every concept may actually have more than one prototype and
we should select more prototypes for each concept to form the
bases of the mapped SI feature space. Thus it can select one
potential prototype of instance for each positive bag, then
prototypes will be obtained for each concept . The selection
process can be described formally as
,is
distance. This mapping is different from the mapping
(6)
Using these prototypes to form the bases of a feature space for
concept , the second conversion strategy maps every bag
a SI representation as
to
(7)
The differences between the first strategy and the second
strategyforeachconcept canbeillustratedinFig.2(a)and(b).
The first strategy attempts to map every positive bag to a near
point of the exclusive prototype for concept
feature space, while mapping every negative bag to near one of
the prototypes for other concepts. The second strategy on the
other hand attempts to map every positive bag to near one of
the prototypes for concept
while mapping the negative bags
to points far away from these prototypes.
Strategy III: An alternative way of constructing the bases
of the new SI feature space is to use all prototypes from
all concepts since all these prototypes are useful for repre-
senting images. However, this will result in the well-known
problem of curse of dimensionality. Thus in Strategy III, we
construct the bases for each concept
totypes for concept
and a single prototype for every other
concept. For concept
, this strategy selects the prototypes
in the mapped
using multiple pro-
to form the
Authorized licensed use limited to: National University of Singapore. Downloaded on June 21,2010 at 17:33:58 UTC from IEEE Xplore. Restrictions apply.
Page 4
134 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 2, FEBRUARY 2010
Fig. 2. Illustrations for the differences among the three strategies for conversion from MI to SI representation. Here the green points represent the positive samples
for concept ? and red points represent the negative samples, green solid points represent the prototypes for concept ? and red solid points represent the prototypes
for other concepts. (a) Strategy I attempts to map every positive bag to a near point of the exclusive prototype for concept ? in the mapped feature space, while
mapping every negative bag to near one of the prototypes for other concepts. (b) Strategy II attempts to map every positive bag to near one of the prototypes for
concept ? while mappingthe negative bags to pointsfaraway from these prototypes.(c) StrategyIII attempts to mapevery positivebagto nearone ofthe prototypes
for concept ? in the mapped feature space while mapping every negative bag to near one of the exclusive prototypes for the other concepts.
bases. Then each bag
for concept
can be mapped to an SI representation
as
(8)
The third strategy for conversion attempts to map every posi-
tivebagtonearoneoftheprototypesforconcept inthemapped
feature space while mapping every negative bag to near one of
the exclusive prototypes for other concepts. The differences be-
tween this strategy and the other two strategies are illustrated in
Fig. 2.
It should be noted that the second and third conversion strate-
gies have different prototypes for each concept, so all samples
are mapped to different SI feature spaces for different concepts.
That is to say, we should calculate different propagation ma-
trices for different concepts, unlike the first strategy that only
needs to calculate one propagation matrix for all concepts. This
makes their computational costs much higher than that of the
first strategy. Thus more attention should be paid to the compu-
tational efficiency problem for converting the MI representation
to SI representation for the second and third strategies.
This idea of using prototypes for feature mapping has also
beenemployedincomputervision.In[10],theauthorsproposed
toembedtheobjectimagesintoasmallrepresentativesetofpro-
totypes for object matching in different cameras. In contrast to
the unknown prototypes of model images in their applications,
the proposed approaches utilize the prototypes of unknown re-
gions, where the key issue is how to find the most representative
ones.
B. Multi-Graph-Based Label Propagation
In this subsection, we will introduce multi-graph-based label
propagationwhichisusedtointegratethetwotypesofrepresen-
tations and incorporate the labeled and unlabeled data simulta-
neously. Before the discussion, we first introduce some basic
notations. Let
image samples. For each concept, the first image samples are
labeled as
be a set of
?with
and the remaining image samples are unlabeled. The vector of
the predicted labels of all samples is represented as , which
can be split into two blocks as:
the matrix transpose. Consider a connected undirected graph
with the vertex set
samples.
, where the vertex set
tainslabeledpointsandtheverticesinset
are unlabeled ones. The edges
wise similarity matrix.
Besides the multiple regions’ features, we also extract the
globalfeaturesforeachimage,whichformtheSIrepresentation
for each image. Thus we have two types of representations
for the dataset:
and, where
resentation of sample
as defined in (5), (7), or (8). We con-
struct two graphs
andthat, respectively, correspond to
the global representations and the region representations of the
image sets. We assume that the two graphs are represented by
their respective similarity matrices
represent the pairwise similarities between the th and th
image samples.
Then according to thetheory of graph-based semi-supervised
learning [28], the label inference problem becomes the problem
of solving the following minimization problem:
? ??, wheredenotes
corresponding to theimage
con-
are weighted by thepair-
is the prototype rep-
and, with and
(9)
where
right side of (9) indicates that the labels of the nearby samples
should not change too much according to the structure of graph
, while the second term indicates that the labels of nearby
samples should not change too much according to the structure
of graph
. The constraint requires that the labels of the an-
notated samples will not change in the label propagation proce-
dure.
and. The first term of the
Authorized licensed use limited to: National University of Singapore. Downloaded on June 21,2010 at 17:33:58 UTC from IEEE Xplore. Restrictions apply.
Page 5
TANG et al.: IMAGE ANNOTATION BY GRAPH-BASED INFERENCE135
Representing this optimization problem in the matrix form
gives rise to
??
(10)
where
the graph Laplacians of
are diagonal matrices with diagonal elements
and
is the identity matrix.
If we regard
as a variable and solve the optimization
problem with respect to both and , the solution will be trivial
since the solution is:
for
??
andcan be any value for
That is to say, only the smoothest graph is reserved. Certainly
this is not the optimal solution we want. Wang et al. [26] pro-
posed an EM-style iterative method to solve
this process made a relaxation that changes
and . The exponential coefficient
and is hard to choose. Meanwhile, we only have two graphs
here, as discussed in both [20] and [26], we can regard
parameter and determine its value by cross validations.
Let
, then the optimization problem (10) can be
transformed to
and
and
are
and, respectively;
and;
??
, for
??
.
and
and
is sensitive to noise
. However,
to
as a
?
(11)
Split the matrix
after the th row and th column, we have
(12)
Then similar to the iterative solution in [16], we can iterate
until convergence to obtain the
optimal label vector for unlabeled image samples as
(13)
According to (13), each image sample will be assigned a real-
valued score indicating the degree that it belongs to a specific
concept.
C. Construction of the Propagation Matrix
Now that we have introduced the entire framework, the re-
mainingimportantissueishowtoconstructthepropagationma-
trices
and. For simplicity, we only introduce the con-
struction of
here, while
manner. The most widely used strategy is to calculate the pair-
wisesimilaritymatrix
foracertainconcept first,thennor-
malize this similarity matrix to obtain the propagation matrix:
.Thissimilarityhasanoptimalparameter
foreveryconcept ,thatistosay,usingthissimilaritymeasurein
can be constructed in a similar
ourframeworkneedstoestimate
have
different concepts. The estimation of parameters can be
done by cross validation method. However, using cross valida-
tions to select the parameters has two problems. First the com-
putational cost is very high, and second the parameters deter-
mined are biased to the training set. To alleviate these issues,
we adopt the linear neighborhood propagation (LNP) [24] to
calculate the propagation matrix. As there is no parameter in
the linear neighborhood propagation algorithm, it can tackle the
aforementioned problems adequately.
Here we briefly introduce the process of calculating the prop-
agation matrix
using LNP. Roweis and Saul [25] assumed
that each sample can be optimally reconstructed by the linear
combination of its neighboring samples, and the optimization
objective is to minimize
optimalparameterssincewe
(14)
where
represents the reconstruction contribution of
straining
denotes the neighboring samples of, and
for. Con-
will result in
(15)
where
isthethelementinthelocalGrammatrix
of. Adding another constraint that every
reconstruction coefficient should be nonnegative, the optimal
reconstruction coefficients can thus be obtained by solving the
following
standard quadratic programming problems:
(16)
LNP further assumes that the sample’s label can also be re-
constructed by its neighbors’ labels linearly using the same re-
construction coefficients, so we can use these construction co-
efficients to construct the propagation matrices
our framework with
andin
(17)
Authorized licensed use limited to: National University of Singapore. Downloaded on June 21,2010 at 17:33:58 UTC from IEEE Xplore. Restrictions apply.