Page 1

Copyright c ? 2005 ACM Multimedia Conference, Singapore (ACM MM).

Semantic Manifold Learning for Image Retrieval

Yen-Yu Lin1,2

Tyng-Luh Liu1

Hwann-Tzong Chen1,2

1Institute of Information Science, Academia Sinica, Taipei 115, Taiwan

2Department of CSIE, National Taiwan University, Taipei 106, Taiwan

{yylin, liutyng, pras}@iis.sinica.edu.tw

ABSTRACT

Learning the user’s semantics for CBIR involves two differ-

ent sources of information: the similarity relations entailed

by the content-based features, and the relevance relations

specified in the feedback. Given that, we propose an aug-

mented relation embedding (ARE) to map the image space

into a semantic manifold that faithfully grasps the user’s

preferences. Besides ARE, we also look into the issues of se-

lecting a good feature set for improving the retrieval perfor-

mance. With these two aspects of efforts we have established

a system that yields far better results than those previously

reported.Overall, our approach can be characterized by

three key properties: 1) The framework uses one relational

graph to describe the similarity relations, and the other two

to encode the relevant/irrelevant relations indicated in the

feedback. 2) With the relational graphs so defined, learning

a semantic manifold can be transformed into solving a con-

strained optimization problem, and is reduced to the ARE

algorithm accounting for both the representation and the

classification points of views. 3) An image representation

based on augmented features is introduced to couple with

the ARE learning. The use of these features is significant in

capturing the semantics concerning different scales of image

regions. We conclude with experimental results and com-

parisons to demonstrate the effectiveness of our method.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information

Search and Retrieval—Relevance feedback

General Terms

Algorithms, Performance, Experimentation, Management

Keywords

Image Retrieval, Manifold Learning, Dimensionality Reduc-

tion, Relevance Feedback, Feature Selection

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

MM’05, November 6–12, 2005, Singapore.

Copyright 2005 ACM 1-59593-044-2/05/0011 ...$5.00.

1.INTRODUCTION

A key ingredient of designing successful Content-Based

Image Retrieval (CBIR) systems [4, 16, 31] is how to ef-

fectively transform users’ interactions into information that

could help the underlying retrieval engines to better organize

their image data. Different from in information retrieval, the

features used in image retrieval are often visually character-

ized, and therefore do not directly connect to the (semantic)

concepts implied by users as textual features would do. The

semantic gap has been the main challenge to be overcome

in CBIR research.

Among the various attempts to deal with the foregoing

difficulty, retrieval techniques based on relevance feedback

[25] are generally considered as a feasible and promising ap-

proach, e.g., [9, 10, 12, 13, 23, 24, 28, 29]. Still methods of

this kind could differ considerably in the retrieval outcomes.

And it brings up two important subjects that would have

significant bearings on the query accuracy: 1) the choice of

features for representing an image, and 2) the way of cap-

turing the implicit semantic concepts imposed through the

few query examples by a user. In this work, we aim to ad-

dress these two issues by proposing a new manifold-learning

scheme with relevance feedback that draws on useful image

features to achieve significantly better retrieval performance

than that yielded by other existing methods.

1.1 Previous Work

Like in all other classification problems, feature selection

when properly done could substantially enhance the retrieval

performance. The commonly-indexed features in CBIR are

comprehensive, such as shape, texture [9, 10, 24], color,

wavelet coefficients [12, 23, 29], and color coherence vec-

tors [19]. The consideration of these features is intuitive,

and reasonable for discriminating among images of different

categories. They are often implemented as global features to

describe the respective overall statistics for a whole image.

Such a practice may cause poor query results when the re-

gion of interest (ROI) by a user pertains to only a sub-image.

The situation could further deteriorate when the area of an

ROI is relatively small, or a query example has complex

background. Alternatively, there are descriptors that are

suitable for encoding the local properties of image patches.

The SIFT algorithm [15], invented for object recognition and

now widely used in vision research, e.g., [8, 14], is one such

example, and should be useful in providing additional query

cues for a retrieval system.

To take account of relevance feedback, several researchers

have explored supervised learning. For example, Tong and

Page 2

Chang [29] propose SVMActivefor learning a decision bound-

ary by iteratively adding the most informative (near the

boundary) samples as training data. Hoi and Lyu [12] de-

velop a soft-label SVM by taking the feedback confidence

into consideration in learning the decision boundary. In [28],

Tieu and Viola have used AdaBoost to establish a classifier

for retrieval, by selecting discriminant features from a very

huge candidate pool. Despite these efforts, we note that

though SVMs and boosting are effective for classification,

the decision boundaries derived by the two schemes would

be unstable when the feedback contains only a few image

examples for the on-line training. Hence further techniques

to address this difficulty are required.

Another possibility is to use the feedback information for

adjusting a query vector. Specifically, a query vector can

be (dimension-wise) re-weighted [13, 24], moved [13], or ex-

panded [20] to account for users’ feedback. Rui et al. [24]

propose to iteratively adjust the component weights of a

query vector to favor relevant dimensions. Alternatively, in

the work of Ishikawa et al. [13], a query can also be modified

by considering both the locations and the relevance degree

of positive examples. In [20], Porkaew et al. apply cluster-

ing to select relevant examples, and then add them into the

query representation at each feedback iteration.

The latest trend in CBIR research has been somewhat

shifted to recovering the intrinsic structure for a proper im-

age space of reduced dimensionality.

with the conventional Euclidean space, the main theme is

to assume that the images (label and unlabeled) spread as a

manifold, and the task is to learn the underlying structure.

Consequently, a similarity measure can be conveniently com-

puted on the learned manifold. He et al. [10] use geodesic

distances to approximate the distances between image pairs

along the manifold, and apply Laplacian eigenmaps [2] to

preserve such distances. The main drawback is that the

mapping is defined only on the set of training data, and thus

needs additional mechanisms, such as radial basis function

networks, to handle test data. In a related work [9], an in-

cremental learning scheme based on locality preserving pro-

jections (LPP) [11] has been proposed for semantic relation

embedding. Although the mapping derived by LPP is valid

for the entire image space, the mapping itself is limited to a

linear projection. Furthermore, in both works [9, 10], only

one relational graph is used such that the local geometry

and feedback relations are not properly represented. As a

result, the two schemes may not fully utilize the feedback

information in learning the user’s semantics.

Instead of working

1.2Our Approach

Designing an efficient scheme to understand the user’s

preferences is a nontrivial task in CBIR. We propose an

approach that learns a semantic manifold by taking account

of the multiple aspects of relations among images and the

feedback information. In our framework, a similarity rela-

tional graph is constructed by exploring the neighborhood

of each image, and two feedback relational graphs are cre-

ated to depict the relevant and irrelevant relations in the

feedback. While the similarity property is used as a con-

straint enforcing the preservation of the local geometry, we

make use of the relevant and irrelevant feedback informa-

tion in a discriminant manner [5], i.e., gathering together

relevant pairs and keeping away irrelevant ones after the

embedding. In other words, not only the class of labeled

images but also the intrinsic structure of the unlabeled data

are considered in learning the semantic manifold. We realize

these crucial concepts encoded in the graphs by formulating

a constrained optimization problem, and then by solving an

equivalent generalized eigenvalue problem. As for indexing

images for retrieval, global statistics describe the properties

of the whole image and achieve the effectiveness in CBIR,

while the local features characterize images by their salient

and distinctive regions and are often invariant to certain

transformations. Motivated by these observations, we intro-

duce a new image representation to embrace the advantages

of the two types of features, and to further improve the re-

trieval precision by our method.

2. SEMANTIC MANIFOLD LEARNING

The need of dimensionality reduction on analyzing high-

dimensional data is unavoidable. Manifold learning is one

such technique that aims for finding a constructive way to

embed the data from a high-dimensional space into a low-

dimensional manifold. Take, for example, the three impor-

tant works, Isomap [27], LLE [21], and Laplacian eigenmaps

[2]. In these methods dimensionality reduction is carried

out nonlinearly by investigating the local geometry entailed

by the data. Still they all lack an explicit mapping func-

tion defined for the entire space, i.e., they cannot directly

handle new test data. Bengio et al. [3] have subsequently

proposed a new scheme to fix the shortcoming via learning

kernel eigenfunctions. Their method can achieve good re-

sults, but is too computationally expensive. The LPP by

He and Niyogi [11] also shares considerable similarity with

Laplacian eigenmaps, except that it has a linear mapping

function learned and defined over the whole input space.

All the works discussed above learn data manifolds in an

unsupervised manner. While these algorithms are appropri-

ate for data representation and visualization, they do not

make the most of the labeled relevance feedback in CBIR.

We instead propose a new framework for learning a seman-

tic manifold that best explains the user feedback, comprising

only a few labeled examples. Notice that since a structure

like this is largely imposed by a user, it implies that the

same image database may reside in very different semantic

manifolds due to users of diverse preferences.

2.1Augmented Relation Embedding

To learn a semantic manifold for CBIR, we work on two

different sources of information: the similarity relations given

by images in the database, and the relevance relations indi-

cated by examples in the feedback. Since the user-provided

relevance information can be considered as augmented re-

lations to the data, we choose to use the term, augmented

relation embedding (ARE), to emphasize this property in our

manifold-learning algorithm.

Let X ⊂ Rnbe an n-dimensional image feature space, and

ρ : X×X → R be some distance function (to be discussed in

the next section). A database with m images can then be

represented by a data matrix X = [x1x2 ··· xm] ∈ Rn×m

where xi ∈ X for i = 1,...,m. For the relevance feedback,

we use F+to denote the set of images returned by the system

that are relevant to a query, and F−to include the remain-

ing irrelevant images. To characterize the process of ARE,

we use three relational graphs (undirected and complete)

whose vertices are over the image samples, and a general-

ized eigenvalue problem, detailed in the following steps.

Page 3

1. Construct the similarity relational graph, GS. Let the

matrix that records the weights over the edges of GS

be WS∈ Rm×m, defined by

e−ρ2(xi,xj)/t,

WS

ij=

?

?

?

?

?

if xi ∈ k-NN of xj

or xj ∈ k-NN of xi,

otherwise,0,

(1)

where t is some positive scalar, and k-NN is the ab-

breviation for the k nearest neighbors.

2. Construct the feedback relational graphs, GPand GN.

The two relational graphs encode pairwise relations

in the feedback.In particular, GPis for the posi-

tively similar relations, and GNfor the dissimilar ones.

Their respective weight matrices, WP,WN∈ Rm×m,

can be defined as follows.

WP

ij=

?

1,

0,

if xi ∈ F+∧ xj ∈ F+,

otherwise;

(2)

WN

ij =

?

?

?

?

?

1, if xi ∈ F+∧ xj ∈ F−

or xi ∈ F−∧ xj ∈ F+,

otherwise.0,

(3)

3. Embed image space X into an ?-dimensional semantic

manifold. For ? ? n, find the generalized eigenvectors

v1, v2, ..., v?corresponding to the ? largest eigenvalues

X[LN− γLP]XTv = λXLSXTv,

where LS= DS− WS, and DSis a diagonal matrix

with DS

ii=

DNcan be defined in a similar way. Notice that the

scalar γ is added to take care of the possibility of un-

balanced feedback. In practice, we have set

(4)

?

jWS

ij. Analogously, LP, DP, LN, and

γ ∝

?

i,jWN

ij/

?

i,jWP

ij. (5)

The parameter γ weighs the importance tradeoff be-

tween the positively-similar pairs and the dissimilar

ones in the feedback. (γ ≥ 1 is to emphasize the posi-

tive information.) Finally, after solving (4) and letting

V = [v1v2 ··· v?], we have, for each image xi in the

database, the embedded feature vector zi = VTxi.

4. Perform retrieval over the semantic manifold. While

the embedding of the image database is completed,

given any arbitrary query image ¯ x ∈ X, we map it onto

the manifold by ¯ z = VT¯ x. Find the nearest neighbors

of ¯ z using the Euclidean distance, and those images

corresponding to the nearest neighbors will be the top-

ranking returns for the query.

We now explain why the steps of ARE can learn a use-

ful semantic manifold for retrieval, and how the generalized

eigenvalue problem in (4) guarantees a data embedding that

effectively reflects the augmented information. We begin by

considering the following optimization problem:

MaximizeJ(V ) =

?

i,j||VTxi− VTxj||2(WN

i,j||VTxi− VTxj||2WS

The implication of the above formulation is explicit and

reasonable. While the intrinsic structure of the image data

ij− γWP

ij)

subject to

?

ij= 1.(6)

is maintained via enforcing the constraint, a feasible V to

(6) would project data by reducing the Euclidean distances

between each positively-similar pair, and enlarging those

between every dissimilar pair.

scheme based on (6) connects the user semantics with the

underlying image data in a proper space of reduced dimen-

sionality. We next describe a theorem to complete our dis-

cussion on the justification of ARE.

Theorem 1. The columns of the optimal V ∈ Rn×?to

the constrained optimization problem (6) are the generalized

eigenvectors corresponding to the ? largest eigenvalues of (4).

Proof. Using the notations in (4) and (6), we have

Thus a manifold-learning

J(V ) =

?

tr{(VTxi− VTxj)(VTxi− VTxj)T}(WN

i,j

||VTxi− VTxj||2(WN

ij− γWP

ij)

=

?

i,j

ij− γWP

ij)

=

?

i,j

tr{VT(xi− xj)(xi− xj)TV }(WN

ij− γWP

ij).

Since the trace operator is linear, and (WN

scalar, all the terms can be moved inside the trace.

ij− γWP

ij) is a

J(V ) = tr{VT

?

i,j

(xi− xj)(WN

ij− γWP

ij)(xi− xj)TV }

= 2tr{VT[ (XDNXT− XWNXT)

− γ(XDPXT− XWPXT) ]V }

= 2tr{VTX(LN− γLP)XTV }.

After applying a similar analysis to the constraint term,

equation (6) can be reformulated as

(7)

MaximizeJ(V ) = 2tr{VTX(LN− γLP)XTV }

2tr{VTXLSXTV } = 1.

Finally, apply the Lagrange multipliers to (8), and set the

derivative with respect to V to zero. It follows that the

columns of the optimal V are generalized eigenvectors cor-

responding to the ? largest eigenvalues in (4).

subject to

(8)

It should be clear now that ARE is a semi-supervised

learning technique for dimensionality reduction. The aug-

mented information used in learning a semantic manifold is

nicely encoded in the three relational graphs, GS, GP, and

GN. Like other manifold-learning methods, the proposed

ARE can preserve local geometry by referencing the neigh-

borhood similarity relations in GS. On the other hand, by

exploring the relevance feedback information in GPand GN,

ARE automatically captures the intrinsic semantics behind

the user interactions with a retrieval system.

ARE–Initialization. In general the query-by-example of

CBIR starts only with some query image provided by a user.

That is, in the inception of manifold learning there should

be no feedback information. Consequently, equation (4) is

not well defined, and it makes sense to start ARE in an

unsupervised manner, i.e., by solving

XDSXTv = λXLSXTv. (9)

Following the definitions of DSand LS, it can be easily

verified that with (9), ARE initially behaves like LPP. It

then switches to a semi-supervised scheme for learning a

sematic manifold when (4) becomes valid.

Page 4

2.2 Kernel ARE

The query/classification efficiency induced by ARE can

sometimes be further improved, especially when the data in

the original space are highly nonlinearly distributed. Mo-

tivated by the success of support vector machines (SVMs)

[30], we describe a similar strategy to kernelize the linear

ARE. The idea is to nonlinearly map the image data to a

high-dimensional feature space, and then perform ARE to

learn a semantic manifold in that space. Such a generaliza-

tion is meaningful in the sense that a kernelized ARE would

generally achieve better accuracy, and relax the restriction

of ARE being only a linear embedding scheme.

Let Φ : Rn→ Y be a nonlinear mapping. Then the im-

age data matrix in the feature space Y can be denoted as

Φ(X) ≡ [Φ(x1)Φ(x2) ··· Φ(xm)]. Since the analysis mostly

involves inner products between pairs of mapped data, it is

convenient to work with Mercer kernels instead of worrying

about the exact form of Φ. Specifically, we have used the

RBF kernel, k(xi,xj) = Φ(xi)TΦ(xj) = exp(−||xi−xj||2/c)

for the experimental results presented in this work.

Consider now a kernel-based optimization problem, the

same as (8) except that X is replaced by Φ(X). Its gener-

alized eigenvalue problem is then given by

Φ(X)[LN− γLP]Φ(X)Tv = λΦ(X)LSΦ(X)Tv.

To establish the kernel ARE, we note that the eigenvec-

tors of (10) are in the span of Φ(x1),Φ(x2),...,Φ(xm). In

particular, let the eigenvector vi of (10) be the ith column

of V , and assume the following expansion

(10)

vi =

?m

j=1αijΦ(xj) = Φ(X)αi, (11)

where αi = [αi1αi2 ··· αim]T. To this end, it is convenient

to define another matrix by A = [α1α2 ··· α?], and denote

the kernel matrix as Kij = k(xi,xj). Furthermore, it can be

shown that VTΦ(X) = ATK by element-wise comparison:

(VTΦ(X))ij = vT

iΦ(xj) = (ATK)ij

(12)

for 1 ≤ i ≤ ? and 1 ≤ j ≤ m. Therefore the kernelized

optimization problem of ARE can be stated as

MaximizeU(A) = 2tr{ATK(LN− γLP)KA}

2tr{ATKLSKA} = 1.

The optimal A to (13) would comprise α1, α2, ..., α?

that are the generalized eigenvectors corresponding to the ?

largest eigenvalues of

(13)

subject to

K[LN− γLP]Kα = λKLSKα.

Analogously given a query image ¯ x for retrieval, the ker-

nel ARE would map the data by ¯ z = VT¯ x with the ith

coordinate derived by ¯ zi = vT

(14)

i¯ x =

?m

j=1αijk(xj, ¯ x).

3. FEATURES FOR IMAGE RETRIEVAL

Selecting good features is as important as designing an ef-

fective learning algorithm for classification problems. In our

case, we intend to choose features that are likely to grasp

the user’s preferences, and general enough for accommodat-

ing most retrieval systems. While there is no particular way

to categorize image features, we shall divide them into two

groups, global and local features. Bear in mind that the

main distinction between the two categories of features is

not on how they are computed, but on what image scale a

feature is set to characterize. We detail in what follows both

the global and local features used in our experiments, in-

cluding their advantages and disadvantages. Then a scheme

integrating the two categories is proposed to form augmented

features for manifold learning with ARE.

3.1Global Features for CBIR

As we have emphasized, those features used to describe

properties concerning a whole image are classified as global.

Specifically, in our implementation, we have investigated

three types of global features for CBIR.

• Color. Features related to color are widely adopted

for their simplicity and good performance. We test

three kinds of color features: 1) After quantizing the

HSV color space, a 64-bin color histogram is evaluated;

2) The first three moments are accordingly extracted

from the H, S, and V channels; and 3) Due to the

lack of spatial information in the first two, we also

add a 128-dimensional color coherence vector (CCV)

[19] into our global features, to take account of each

color’s coherence.

• Texture. Roughly speaking, texture features refer to

the image patterns that display homogeneity. They

thus play an important role in image indexing of CBIR.

In our system, we have considered two kinds of Tamura

features, coarseness and directionality. The former is

to measure the distribution about the sizes of image

regions with which each pixel is associated, and the lat-

ter depicts the information about the magnitudes and

the directions of pixel-wise gradients.

represent these two features in the form of histograms

with 10 and 8 bins, respectively.

Similarly, we

• Wavelet. Frequency is another aspect of information

useful for characterizing images. Among the various

techniques, wavelets are deemed to be a powerful tool

for capturing both spatial and frequency properties.

We apply discrete wavelet transform (DWT) to de-

rive a 3-level image decomposition, and then calculate

the first two moments of coefficients from the 9 high-

frequency sub-bands, i.e., the High/Low, Low/High,

and High/High bands in all the three levels.

Having normalized each dimension, we can now represent

an image with a 237-dimensional feature vector, computed

from the foregoing global descriptors. However, despite the

many advantages mentioned above, using global features for

CBIR exclusively is not sufficient for ensuring good retrieval

performance. In particular, their effectiveness for CBIR

could suffer from the following situations.

• When the semantic concepts implied by a user pertain

only to sub-images, it is possible the computations of

global features may include too many irrelevant fac-

tors. As a result, the precision and recall would be-

come worse in that the information used in deriving

global features does not fairly reflect the feedback.

• Even for the same semantic concept, the corresponding

appearances may differ from image to image, such as

poses, scales, pictured viewpoints, or locations of ROI

in images.Most global features cannot account for

these varieties.

Page 5

3.2 Local Features for CBIR

Local features are introduced in our implementation to

describe properties of size-varying regions associated with

interest points in an image. It takes two steps to carry out

the computations of local features. First, the detection of

interest points in a given image is done by using Lowe’s

difference-of-Gaussian (DoG) detector [15], which has been

shown to be robust and invariant to scale and rotation. The

DoG detector identifies potentially interest points by search-

ing the local extrema in the scale-spaces. After eliminating

the unstable ones, the respectively dominating orientation

and detected scale are assigned to each of the remaining in-

terest points. Second, we calculate local features from each

salient region, identified by the scale, location, and orien-

tation of a detected interest point. Motivated by the good

results reported in [18], we consider stacking three different

kinds of local features to characterize salient regions.

• Generalized RGB Color Moments. The formulation is

Mabc

pq =

Ω

??

xpyq[R(x,y)]a[G(x,y)]b[B(x,y)]cdxdy,

with degree a + b + c and order p + q. Let (x,y) de-

note the relative coordinates with respect to an inter-

est point and its orientation, and Ω be the set of (x,y)

within a local region. Setting the degree as 1 or 2, and

the order up to 1, we get a 27-dimensional vector.

• RGB Color histogram. A color histogram of 64 bins

is evaluated to capture the RGB color distribution in

the local region of each interest point.

• SIFT (Scale Invariant Feature Transform) descriptor.

As shown in, e.g., [8, 14, 15, 17], SIFT descriptors are

quite effective in representing local image properties.

We have used a 128-dimensional SIFT descriptor.

With the above descriptors and proper normalizations,

an image would yield a local representation comprising a

bag of d local-feature vectors, denoted as Γ = {l1,l2,...,ld}

where d is the number of interest points detected, and li

is the local-feature vector for the ith interest point. Since

the value of d is image-dependent, the dimensionality of the

local representation Γ is not fixed. Thus, for the sake of uni-

formity that facilitates a similarity measure, we apply the

vector quantization technique [18, 26] to cluster local-feature

vectors resulting from all images into k clusters. In this way,

the local representations caused by different numbers of in-

terest points can all be transformed into k-dimensional vec-

tors, where for a given image the value of the ith dimension

now records the number of local-feature vectors in Γ being

included in the ith cluster.

A proper setting for the value of k is indeed a tradeoff

between the degree of precise image representation and the

ease of similarity measurement. With a larger k, the dif-

ferences between two images are more faithfully character-

ized; however, it becomes inefficient/inappropriate to cor-

relate two images using the bin-by-bin similarity measures,

e.g., L2-distance and Kullback-Leibler divergence.

stead consider the Earth Mover’s Distance (EMD) proposed

by Rubner et al. [22], for its nice property in addressing

the cross-bin dissimilarity. The k-dimensional local repre-

sentation is therefore converted to the signature form used

in EMD, where each cluster is represented by its center and

the weight (the number of elements in the cluster divided

We in-

by the total number of elements). Furthermore, the cost be-

tween each cluster pair is defined by their geodesic distance,

which can be efficiently computed by Floyd’s algorithm [6].

To justify the use of EMD with the local representation,

we conduct a simple but constructive test by excluding the

feedback information and the use of ARE. We begin by

preparing a 30-category image set in which each category

has 100 images.Those images in the same category are

considered relevant, and otherwise, irrelevant. The assump-

tion serves as the ground truth.

image, we find its 15-NN (not including the query image)

with some similarity measure, and calculate its accuracy.

Then the accuracy of each category can be calculated by

averaging. The efficiency of using EMD is compared with

those yielded by three other distance measures, including the

L2distance, the negative Bhattacharyya coefficient (BHC),

and the dot product (cosine of angle) coupled with term

frequency–inverse document frequency (TF-IDF) weighting

strategy suggested in [26]. Also, the value of k, i.e., the

number of feature clusters is set to 3000. The experimental

outcomes are shown in Figure 1a. Among the four mea-

sures, EMD is clearly the most effective one for our local

representation.

In the testing of each

3.3Augmented Features for CBIR

We compare the performance of using either global or local

features for CBIR by redoing the experiment in Figure 1a, in

which L2distance and EMD are respectively employed. The

results are shown in Figure 1b (the blue and green curves).

Overall the global representation produces better accuracy

rates. However, it is worthwhile to note that the two rep-

resentation schemes complement each other for images of

many categories. Taking the most extreme cases into ac-

count, e.g., category 20 and 6, we illustrate several images

belonging to the two categories in Figures 1c and 1d.

It is evident that the global representation works well for

category 20 owing to the consistency in the backgrounds,

though the underlying semantic concept is difficult to be

identified. On the other hand, using local features achieves

good performance for category 6 in that the locally distinc-

tive and unique patterns of jaguars typically appear in small-

area sub-images and the backgrounds are also arbitrary and

complex. In view of the significantly complementary nature

in the query performance, we thus seek a representation that

can reasonably include both image properties. Nevertheless,

a critical obstacle needed to be surmounted is that the dis-

tance measures used for the two types of features are quite

different. For example, while the L2distance is often used

for correlating global feature vectors, it performs poorly for

local features. The difficulty prompts the idea of using aug-

mented features for CBIR described next.

Given an image, suppose again there are d interest points

detected. Therefore its local representation is a bag of local

features, Γ = {l1,l2,...,ld}. Let g be the corresponding

global-feature vector for the same image. We define a new

representation by augmenting each bag of local features:

fi =

?

ωli

(1 − ω)g

?

, for 1 ≤ i ≤ d,(15)

where ω is a relative weighting factor. Then, the proposed

representation,

as for the local representation, including the vector quanti-

zation and the use of EMD for distance measurement.

?Γ = {f1,f2,...,fd}, can be handled similarly

Page 6

2468 101214

Category

1618 2022 242628 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average Accuracy

Local + TF−IDF

Local + BHC

Local + Euclidean

Local + EMD

2468 101214

Categroy

1618 202224 2628 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average Accuracy

Global + Euclidean

Local + EMD

Augmented + EMD

(a)(b)

(c)(d) (e)

Figure 1: (a) Accuracy comparisons among four similarity measures for local features. (b) Accuracy com-

parisons for global, local, and augmented features. (c)–(e) Images from category 20, 6, and 23, respectively.

To evaluate the efficiency of the proposed representation,

we also carry out the same experiment as in testing the

global and the local one. By empirically setting ω = 0.5, the

results are shown in Figure 1b (the red curve). It is along

the skyline of the curves for global and local features. Even

for some categories, e.g., 23 in Figure 1e, the augmented

features significantly outperform both global and local ones

because in these categories the two kinds of statistics, local

and global, are meaningful in the similarity measurement.

4. EXPERIMENTS AND DISCUSSIONS

We present several experimental results and comparisons

to demonstrate the effectiveness of the proposed manifold-

learning algorithm, coupling with the use of augmented fea-

tures. In Section 4.1 we describe the image dataset used

in the experiments.The various settings concerning the

performance evaluation metrics, cross validation, and im-

plementation details are given in Section 4.2. We discuss

the efficiencies of the three key components in our method,

including image representations, learning algorithms, and

proper dimensions of embedding spaces for CBIR in Sec-

tions 4.3–4.5. Then examples of 2-D visualization for em-

bedding spaces are provided in Section 4.6 to illustrate the

progressive improvements through the feedback processes.

4.1 Image Dataset

The COREL dataset is widely used in many CBIR sys-

tems, such as [9, 12, 23, 28, 29]. For the sake of evaluations,

we also choose the collection for our testing. We empirically

select 30 categories of color images, where each consists of

100 samples. Those images in the same category share the

same semantic concept, but have their individual varieties.

The fact serves as the ground truth in the experiments, i.e.,

images from the same category are considered relevant, and

otherwise irrelevant.

4.2 Evaluation and Implementation Settings

To exhibit the advantages of using our method, we need

a reliable way of evaluating the retrieval performance and

the comparisons with other systems. We also run cross val-

idation to ensure that the reported results are general and

credible. Besides these evaluation settings, different aspects

of experimental details are described below.

Evaluation Metrics. Though the precision-recall curve is

commonly used as a performance measure for retrieval, it is

less suitable for the results of CBIR, due to the often rela-

tively low recall [23]. We instead adopt the precision-scope

curve and the precision rate as the performance-evaluation

metrics. In this context, the scope specifies the number,

N, of top-ranking images returned in response to the user’s

query, and the precision is the ratio of relevant returns to the

scope N. In practice a precision-scope curve records the pre-

cision over a range of scopes. The precision rate emphasizes

the precision for a particular value of scope, and thus can be

viewed as a point on the precision-scope curve. Specifically,

we have N = 20 for all our experiments. And for those ex-

periments designed for comparing features, we shall use the

precision-scope curve to measure the performance, because

it gives more comprehensive results. For our other experi-

ments on learning algorithms and dimensionality analysis of

the embedding spaces, we prefer the precision rate in that

the emphasis should be on the precision differences among

Page 7

0 1020304050 6070 8090 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Scope

Precision

Feedback Iteration 1

Global Features

Local Features

Augmented Features

0 10 20304050 60 708090 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Scope

Precision

Feedback Iteration 2

Global Features

Local Features

Augmented Features

0 102030 4050 607080 90 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Scope

Precision

Feedback Iteration 5

Global Features

Local Features

Augmented Features

(a) (b)(c)

Figure 2: Compare the retrieval performance by applying ARE respectively with the three types of features

for image representation: global, local, and augmented features. (a)–(c) Via illustrating with the precision-

scope curves, we plot the results in the 1st, 2nd, and 5th feedback iteration, respectively.

all the feedback iterations, and hence a compact description

is more appropriate.

Five-Fold Cross Validation. To test our system, we only

consider queries that are not in the database. The strategy is

practical and meaningful because testing with training data

is less persuasive in the evaluation of a learning algorithm.

Driven by the query-by-example execution of our system,

we use five-fold cross validation to simulate the queries with

examples not in the image database. More precisely, we ran-

domly divide all the images into five equal-size sets. In each

run of cross validation, we pick one set as the query set, and

leave the other four sets as the training data. It implies that

the nodes of the relational graphs in the ARE algorithm cor-

respond to images in the training set. The precision-scope

curve and precision rate are derived by averaging the re-

sults from the five runs of cross validation. We adopt the

automatic feedback scheme described in [9] for performance

evaluation. For each submitted query, our system retrieves

and ranks the images in the training set by iteratively run-

ning ARE. At each feedback iteration, the top four relevant

and irrelevant images are selected and inserted into the rel-

evant and irrelevant sets, i.e., the F+and F−in (2) and

(3), respectively. Note that the images have been selected

in the previous iterations are excluded from later selections.

And, with each query, the automatic feedback mechanism is

carried out for eight iterations.

Implementation Details. We discuss two issues of imple-

mentation details. First, for the concern of numerical sta-

bility, we exploit the technique suggested in [1] to avoid sin-

gularities encountered in solving the generalized eigenvalue

problems. Specifically, we apply PCA to the column space of

the data matrix, i.e., X in (4), and keep the 98% information

by its low-rank approximation. Second, instead of directly

calculating X[LN− γLP]XTin the generalized eigenvalue

problem in (4) at each feedback iteration, we compute the

result of

equivalence has been shown in (7). In such a way, we take

advantage of the sparseness property of WPand WNto

save the computational resource and prevent the multiplica-

tions between large-size matrices. Furthermore, since WP

and WNare incrementally updated in the feedback itera-

tions, we only take the changed elements in WPand WN

into account at each iteration.

?

i,j(xi− xj)(WN

ij− γWP

ij)(xi− xj)T, since their

4.3 Image Features for ARE

In the previous section we have compared the efficiency of

using the global, local, and augmented features for retrieval,

and reported the results in Figure 1. The experiments are

done without using the relevance feedback and any embed-

ding algorithms. Here we again evaluate these three types

of image features by testing them with the ARE algorithm.

Via illustrating with the precision-scope curves, their perfor-

mance in the 1st, 2nd, and 5th feedback iteration is plotted

in Figures 2a–2c, respectively. Based on the results shown

in the diagrams, we observe: 1) The augmented features are

more efficient than the other two classes in all the iterations;

2) Owing to the increasing number of feedback images in the

latter iterations, the precision is improved over the entire

range of the scope; and 3) In the latter feedback iterations,

the precision decays slightly within the small-scale scope, as

shown in Figure 2c. The phenomenon may be caused by a

better fitting of ARE with more feedback information.

4.4 Manifold Learning Schemes

To demonstrate the power of the proposed ARE algorithm

in learning the semantic concepts from feedback examples,

we compare its retrieval performance with that of a related

scheme, namely, the incremental Locality Preserving Pro-

jections [9]. Both the two algorithms measure similarities

locally based on the manifold assumption, and are designed

for learning the semantic space via solving eigenvalue prob-

lems. The critical difference between the two schemes is that

the incremental LPP maintains only a graph for recording

both the neighborhood and feedback relations at the same

time, while the ARE treats the neighborhood relation as a

constraint and formulates, in a discriminant manner, feed-

back information into an objective function.

Besides incremental LPP and ARE, the kernel ARE is

also included in the comparisons. Together with the three

possible choices of feature representations, we conduct nine

experiments about the precision of learning a semantic man-

ifold. By iteratively adding the user’s feedback, the corre-

sponding precision results of the three learning schemes are

respectively shown in Figures 3a–3c, ordered by the image

features used. We next highlight some remarks on the ex-

perimental results in Figures 3.

Page 8

012345678

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of Feedback Iterations

Precision

Global Features

LPP

ARE

Kernel ARE

012345678

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of Feedback Iterations

Precision

Local Features

LPP

ARE

Kernel ARE

012345678

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of Feedback Iterations

Precision

Augmented Features

LPP

ARE

Kernel ARE

(a) (b)(c)

Figure 3: Performance evaluation of the three learning algorithms in learning the semantic concepts from

the feedback, by using (a) global, (b) local, and (c) the augmented features.

• No matter what kind of image representation is used,

the ARE and kernel ARE significantly outperform the

incremental LPP especially in the latter feedback iter-

ations. The incremental LPP is formulated based on

only one neighborhood graph. For a node in the graph

that corresponds to a labeled relevant image, it can-

not differentiate other labeled relevant images from its

neighbors (can be either relevant or irrelevant). Mean-

while, it also cannot distinguish other labeled irrele-

vant images from those which are not its neighbors

(again can be either relevant or irrelevant). Thus, in-

cremental LPP is not well account for users’ feedback.

On the contrary, ARE uses two additional graphs to

encode the augmented relations from the feedback, and

effectively transforms these relations into a constrained

optimization problem. The underlying semantics by a

user are therefore faithfully retained.

• Kernel ARE in most cases outperforms ARE except

for the latter iterations of using local features (though

they eventually converge to a similar degree of preci-

sion). The efficiency boost is owing to the fact that

kernel ARE is a nonlinear scheme, and performs man-

ifold learning over a high-dimensional feature space.

However, the main drawback of kernel ARE is that

it takes considerably longer time than ARE in learn-

ing the embedding. Especially, the value of variance

used in the RBF kernel function is often determined by

time-consuming brute force searching. In addition, for

a testing image, its inner products with all the training

images in the high-dimensional space need to be com-

puted to find its coordinates in the embedding space.

4.5 Embedding Dimensions

The dimensionality of the embedding space is critical to

the retrieval precision and the time-complexity efficiency for

the search of nearest neighbors in our system. We argue

that, for a good manifold-learning algorithm, the following

two requirements for the learned manifold are essential and

favorable to be satisfied. First, the precision should converge

rapidly with respect to the increasing of the dimension. This

property ensures the correctness of a system, the compact-

ness of image representations, and the efficiency of similarity

search in the low-dimensional embedding space. Second, it

is useful to have a broad range of dimensions that is optimal

for system precision. Furthermore, the change in the preci-

sion along the dimension axis should be stable and smooth.

The optimal dimension for embedding can therefore be con-

veniently spotted. Also, for image retrieval with relevance

feedback, it is preferable the optimal ranges of embedding

dimensions are mostly overlapped for all the feedback it-

erations. Thus a common value of dimensionality can be

applied to each iteration.

To verify whether our method has the above properties for

an efficient manifold-learning scheme, we respectively eval-

uate the precisions of the ARE in different feedback iter-

ations over a range of embedding dimensions. The results

are shown in Figure 4. Besides the iterative improvement

and convergency in precision over the feedback iterations,

we also observe that ARE satisfies the requirements: the

precision converges (along the curve) near the dimensions

of 30 ∼ 40, and a broad optimal region around dimensions

40 ∼ 100 is shared by all the feedback iterations.

4.6Visualization of Semantics

To gain insight into ARE, we display the learned seman-

tic manifold in a 2-D plane. However, ARE does not per-

form well to embed the manifold into such a low-dimensional

space (see Figure 4). Instead of directly embedding into a

2-D space for visualization, we use ARE to embed the se-

mantic manifold into a 30-dimensional space, and project

the points on a plane via multidimensional scaling (MDS)

[7], which preserves the inter-point distances of the 30-D

space as faithfully as possible. In Figures 5a and 5b, we

show the two queries with the respective semantic concepts

of firework and office interiors, and their learned seman-

tic manifolds. The images of the two queries are depicted in

the first row. The second row includes the initial embedding

spaces (without any feedback). In the last two rows, the se-

mantic manifolds learned after the 3rd and the 8th feedback

iterations are given. In each figure, the red points represent

the images relevant to the query, and the green points stand

for the irrelevant ones. The four magenta and cyan points

respectively denote the relevant and irrelevant feedback that

will be returned to the system. The region centered at the

query point is zoomed-in to reveal the detail of 100-NN of

the query. Note that the relevant (red) points progressively

gather together while the irrelevant points keep away from

the relevant ones, especially in the region around the query.

Page 9

0 1020 3040 5060 7080 90 100 110 120

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Dimension of the Embedding Space

Precision

1st 2nd4th 6th 8th Iteration

Figure 4:

learning the semantic manifold.

precisions in different feedback iterations with the

dimensionality from 5 to 120 of the embedding

space. Note that, using 40 to 100 dimensions, ARE

uniformly gives stable and reliable results for all

the feedback iterations.

The dimensionality effect of ARE in

We depict the

Besides the quantitative results, the illustrations of the em-

bedding spaces also demonstrate the effectiveness of the pro-

posed ARE in learning the semantic manifolds.

5. CONCLUSION

We have presented a framework for learning a semantic

manifold of CBIR, and applied the technique to capture the

user’s preferences from few feedback examples. Our method

is further consolidated with the use of augmented features,

designed to more precisely characterize an image by both

its global and local properties.

embedding in a transductive manner by taking both the

class of the labeled images and the intrinsic structure of the

unlabeled ones into account. The promising experimental

results and several useful comparisons justify their use in

CBIR with relevance feedback. Owing to the generality of

ARE, we consider to connect the algorithm to process other

multimedia data such as audio and video for our future work.

The ARE completes the

6.ACKNOWLEDGMENTS

This work was supported by grants 93-2213-E-001-018,

94-2213-E-001-005 and 94-EC-17-A-02-S1-032.

7. REFERENCES

[1] P. Belhumeur, J. Hespanha, and D. Kriegman.

Eigenface vs. fisherfaces: Recognition using

class-specific linear projection. IEEE Trans. Pattern

Analysis and Machine Intelligence, 19(7):711–720,

1997.

[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and

spectral techniques for embedding and clustering. In

Neural Information Processing Systems, 2001.

[3] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau,

N. Roux, and M. Ouimet. Out-of-sample extensions

for lle, isomap, mds, eigenmaps, and spectral

clustering. In Neural Information Processing Systems,

2003.

[4] C. Carson, M. Thomas, S. Belongie, J. Hellerstein,

and J. Malik. Blobworld: A system for region-based

image indexing and retrieval. In Visual Information

Systems, pages 509–516, 1999.

[5] H.-T. Chen, H.-W. Chang, and T.-L. Liu. Local

discriminant embedding and its variants. In Int’l

Conference on Computer Vision and Pattern

Recognition, pages II: 846–853, 2005.

[6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein.

Introduction to Algorithms. The MIT Press, 2nd

edition, 2001.

[7] T. Cox and M. Cox. Multidimentional Scaling.

Chapman & Hall, London, 1994.

[8] K. Grauman and T. Darrell. Efficient image matching

with distributions of local invariant features. In Int’l

Conference on Computer Vision and Pattern

Recognition, pages II: 627–634, 2005.

[9] X. He. Incremental semi-supervised subspace learning

for image retrieval. In ACM Conference on

Multimedia, pages 2–8, 2004.

[10] X. He, W.-Y. Ma, and H.-J. Zhang. Learning an

image manifold for retrieval. In ACM Conference on

Multimedia, pages 17–23, 2004.

[11] X. He and P. Niyogi. Locality preserving projections.

In Neural Information Processing Systems, 2003.

[12] C.-H. Hoi and M. Lyu. A novel log-based relevance

feedback technique in content-based image retrieval. In

ACM Conference on Multimedia, pages 24–31, 2004.

[13] Y. Ishikawa, R. Subramanya, and C. Faloutsos.

Mindreader: Querying databases through multiple

examples. In International Conference on Very Large

Data Bases, pages 218–227, 1998.

[14] Y. Ke, R. Sukthankar, and L. Huston. Efficient

near-duplicate detection and sub-image retrieval. In

ACM Conference on Multimedia, pages 869–876, 2004.

[15] D. Lowe. Distinctive image features from

scale-invariant keypoints. Int’l Journal of Computer

Vision, 60(2):91–110, 2004.

[16] W.-Y. Ma and B. Manjunath. Netra: A toolbox for

navigating large image databases. In Multimedia

Systems, volume 7, pages 184–198, 1999.

[17] K. Mikolajczyk and C. Schmid. A performance

evaluation of local descriptors. In Int’l Conference on

Computer Vision and Pattern Recognition, pages

275–263, 2003.

[18] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Weak

hypotheses and boosting for generic object detection

and recognition. In Euro. Conference on Computer

Vision, pages 71–84, 2004.

[19] G. Pass, R. Zabih, and J. Miller. Comparing images

using color coherence vectors. In ACM Conference on

Multimedia, pages 65–73, 1996.

[20] K. Porkaew and K. Chakrabarti. Query refinement for

multimedia similarity retrieval in MARS. In ACM

Conference on Multimedia, pages 235–238, 1999.

[21] S. Roweis and L. Saul. Nonlinear dimensionality

reduction by locally linear embedding. Science,

290:2323–2326, 2000.