Dynamic Two-Stage Image Retrieval from Large Multimodal Databases.
ABSTRACT Content-based image retrieval (CBIR) with global features is notoriously noisy, especially for image queries with low percentages of relevant images in a collection. Moreover, CBIR typically ranks the whole collection, which is inefficient for large databases. We experiment with a method for image retrieval from multimodal databases, which improves both the effectiveness and efficiency of traditional CBIR by exploring secondary modalities. We perform retrieval in a two-stage fashion: first rank by a secondary modality, and then perform CBIR only on the top-K items. Thus, effectiveness is improved by performing CBIR on a ‘better’ subset. Using a relatively ‘cheap’ first stage, efficiency is also improved via the fewer CBIR operations performed. Our main novelty is that K is dynamic, i.e. estimated per query to optimize a predefined effectiveness measure. We show that such dynamic two-stage setups can be significantly more effective and robust than similar setups with static thresholds previously proposed.
- SourceAvailable from: Savvas A. Chatzichristofis[Show abstract] [Hide abstract]
ABSTRACT: The advances in computer and network infrastructure together with the fast evolution of multimedia data has resulted in the growth of attention to the digital video’s development. The scientific community has increased the amount of research into new technologies, with a view to improving the digital video utilization: its archiving, indexing, accessibility, acquisition, store and even its process and usability. All these parts of the video utilization entail the necessity of the extraction of all important information of a video, especially in cases of lack of metadata information. The main goal of this paper is the construction of a system that automatically generates and provides all the essential information, both in visual and textual form, of a video. By using the visual or the textual information, a user is facilitated on the one hand to locate a specific video and on the other hand is able to comprehend rapidly the basic points and generally, the main concept of a video without the need to watch the whole of it. The visual information of the system emanates from a video summarization method, while the textual one derives from a key-word-based video annotation approach. The video annotation technique is based on the key-frames, that constitute the video abstract and therefore, the first part of the system consists of the new video summarization method.According to the proposed video abstraction technique, initially, each frame of the video is described by the Compact Composite Descriptors (CCDs) and a visual word histogram. Afterwards, the proposed approach utilizes the Self-Growing and Self-Organized Neural Gas (SGONG) network, with a view to classifying the frames into clusters. The extraction of a representative key frame from every cluster leads to the generation of the video abstract. The most significant advantage of the video summarization approach is its ability to calculate dynamically the appropriate number of final clusters. In the sequel, a new video annotation method is applied to the generated video summary leading to the automatic generation of key-words capable of describing the semantic content of the given video. This approach is based on the recently proposed N-closest Photos Model (NCP). Experimental results on several videos are presented not only to evaluate the proposed system but also to indicate its effectiveness.Expert Systems with Applications 01/2013; · 1.85 Impact Factor
Conference Paper: Multimodal re-ranking of product image search results[Show abstract] [Hide abstract]
ABSTRACT: In this article we address the problem of searching for products using an image as query, instead of the more popular approach of searching by textual keywords. With the fast development of the Internet, the popularization of mobile devices and e-commerce systems, searching specific products by image has become an interesting research topic. In this context, Content-Based Image Retrieval (CBIR) techniques have been used to support and enhance the customer shopping experience. We propose an image re-ranking strategy based on multimedia information available on product databases. Our re-ranking strategy relies on category and textual information associated to the top-k images of an initial ranking computed purely with CBIR techniques. Experiments were carried out with users' relevance judgment on two image datasets collected from e-commerce Web sites. Our results show that our re-ranking strategy outperforms the baselines when using only CBIR techniques.Proceedings of the 35th European conference on Advances in Information Retrieval; 03/2013
- ACHI 2012, The Fifth International Conference on Advances in Computer-Human Interactions; 01/2012
Dynamic Two-Stage Image Retrieval
from Large Multimodal Databases
Avi Arampatzis, Konstantinos Zagoris, and Savvas A. Chatzichristofis
Department of Electrical and Computer Engineering,
Democritus University of Thrace, Xanthi 67100, Greece
Abstract. Content-based image retrieval (CBIR) with global features is notori-
ously noisy, especially for image queries with low percentages of relevant images
in a collection. Moreover, CBIR typically ranks the whole collection, which is
inefficient for large databases. We experiment with a method for image retrieval
from multimodal databases, which improves both the effectiveness and efficiency
of traditional CBIR by exploring secondary modalities. We perform retrieval in
a two-stage fashion: first rank by a secondary modality, and then perform CBIR
only on the top-K items. Thus, effectiveness is improved by performing CBIR
on a ‘better’ subset. Using a relatively ‘cheap’ first stage, efficiency is also im-
proved via the fewer CBIR operations performed. Our main novelty is that K is
dynamic, i.e. estimated per query to optimize a predefined effectiveness measure.
We show that such dynamic two-stage setups can be significantly more effective
and robust than similar setups with static thresholds previously proposed.
In content-based image retrieval (CBIR), images are represented by global or local fea-
tures. Global features are capable of generalizing an entire image with a single vector,
describing color, texture, or shape. Local features are computed at multiple points on
an image and are capable of recognizing objects.
CBIR with global features is notoriously noisy for image queries of low generality,
i.e. the fraction of relevant images in a collection. In contrast to text retrieval where
documents matching no query keyword are not retrieved, CBIR methods typically rank
the whole collection via some distance measure. For example, a query image of a red
tomato on white backgroundwould retrieve a red pie-chart on white paper. If the query
image happens to have a low generality, early rank positions may be dominated by
spurious results such as the pie-chart, which may even be ranked before tomato images
on non-white backgrounds. Figures 2a-b demonstrate this particular problem.
Local-feature approaches provide a slightly better retrieval effectiveness than global
features . They representimages with multiple points in a feature space in contrast to
single-pointglobalfeaturerepresentations.While local approachesprovidemorerobust
information, they are more expensive computationally due to the high dimensionality
of their feature spaces and usually need nearest neighbors approximation to perform
points-matching . High-dimensional indexing still remains a challenging problem
in the database field. Thus, global features are more popular in CBIR systems as they
P. Clough et al. (Eds.): ECIR 2011, LNCS 6611, pp. 326–337, 2011.
c ? Springer-Verlag Berlin Heidelberg 2011
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases327
areeasier tohandleandstill providebasic retrievalmechanisms.Inanycase,CBIR with
either local or global features does not scale up well to large databases efficiency-wise.
In small databases, a simple sequential scan may be acceptable, however, scaling up to
millions or billion images efficient indexing algorithms are imperative .
Nowadays, information collections are not only large, but they may also be multi-
modal. Take as an example Wikipedia, where a single topic may be covered in several
languages and include non-textual media such as image, sound, and video. Moreover,
non-textualmedia may be annotated in several languages in a variety of metadata fields
such as object caption, description, comment, and filename. In an image retrieval sys-
tem where users are assumed to target visual similarity, all modalities beyond image
can be considered as secondary; nevertheless, they can still provide useful information
for improving image retrieval.
databases, which targets to improve both the effectiveness and efficiency of traditional
CBIR by exploring information from secondary modalities. In the setup considered,
an information need is expressed by a query in the primary modality (i.e. an image
example) accompanied by a query in a secondary modality (e.g. text). The core idea
for improving effectiveness is to raise query generality before performing CBIR, by
reducing collection size via filtering methods. In this respect, we perform retrieval in
a two-stage fashion: first use the secondary modality to rank the collection and then
perform CBIR only on the top-K items. Using a ‘cheaper’ secondary modality, this
improves also efficiency by cutting down on costly CBIR operations.
Best results re-rankingby visualcontenthas beenseen before,but mostlyin different
setups than the one we consideror for differentpurposes, e.g. result clustering  or di-
versity . Others used external information, e.g. an external set of diversified images
(also, theydidnotuse imagequeries),webimages todepict a topic, ortraining
data . All these approaches, as well as , employed a static predefined K for all
queries, except  who re-rankedthe top-30% of retrieved items. They all used global
features for images. Effectiveness results have been mixed; it worked for some, it did
not for others, while some did not provide a comparative evaluation or system-study.
Later, we will review the aforementioned literature in more detail.
In view of the related literature, our main contributionsare the following.Firstly, our
threshold is calculated dynamically per query to optimize a predefined effectiveness
measure, without using external information or training data; this is also our biggest
novelty. We show that the choice between static or dynamic thresholding can make
the difference between failure and success of two-stage setups. Secondly, we provide
an extensive evaluation in relation to thresholding types and levels, showing that dy-
namic thresholding is not only more effective but also more robust than static. Thirdly,
we investigate the influence of different effectiveness levels of the second visual stage
on the whole two-stage procedure. Fourthly, we provide a comprehensive review of
related literature and discuss the conditions under which such setups can be applied ef-
fectively. In summary, with a simpler two-stage setup than most previously proposed in
the literature, we achievesignificant improvementsover retrieval with text-only,several
image-only, and two-stage with static thresholding setups.
328A. Arampatzis, K. Zagoris, and S.A. Chatzichristofis
The rest of the paper is organized as follows. In Section 2 we discuss the assump-
tions, hypotheses, and requirements behind two-stage image retrieval from multimodal
databases. In Section 3 we perform an experiment on a standardized multimodal snap-
shot of Wikipedia. In Section 4 we review related work. Conclusions and directions for
further research are summarized in Section 5.
2Two-Stage Image Retrieval from Multimodal Databases
Mutlimodal databases consist of multiple descriptions or media for each retrievable
item; in the setup we considerthese are imageand annotations.On the onehand,textual
descriptions are key to retrieve relevant results for a query but at the same time provide
little information about the image content . On the other hand, the visual content of
images contains large amounts of informationwhich can hardly be described by words.
Traditionally, the method that has been followed in order to deal effectively with
multimodal databases is to search the modalities separately and fuse their results, e.g.
with a linear combinationof the retrieval scores of all modalities per item. While fusion
has been proved robust, it has a few issues: a) appropriate weighing of modalities is
not a trivial problem and may require training data, b) total search time is the sum of
the times taken for searching the participating modalities, and most importantly, c) it
is not a theoretically sound method if results are assessed by visual similarity only; the
influence of textual scores may worsen the visual quality of end-results.The latter issue
points to that there is a primary modality, i.e. the one targeted and assessed by users.
An approach that may tackle the issues of fusion would be to search in a two-stage
fashion: first rank with a secondary modality, draw a rank-threshold, and then re-rank
only the top items with the primary modality. The assumption on which such a two-
stage setup is based on is the existence of a primary modality, and the success would
if text retrieval always performs better than CBIR (irrespective of query generality),
then CBIR is redundant. If it is the other way around, only CBIR will be sufficient.
Thus, the hypothesis is that CBIR can do better than text retrieval in small sets or sets
of high query generality.
In order to reduce collection size raising query generality, a ranking can be thresh-
olded at an arbitrary rank or item score. This improves the efficiency by cutting down
on costly CBIR operations, but it may not improve too much the result quality: a too
tight threshold would producesimilar results to a text-onlysearch making CBIR redun-
dant, while a too loose threshold would produce results haunted by the red-tomato/red-
pie-chart effect mentioned in the Introduction. Three factors determine what the right
thresholdis: 1) the numberof relevant items in the collection, 2) the quality of the rank-
ing, and 3) the measure that the threshold targets to optimize . The first two factors
arequery-dependent,thus thresholdsshouldbe selecteddynamicallyperquery,not stat-
ically as most previously proposed methods in the literature (reviewed in Section 4).
The approach of , who re-rank the top-30% retrieved items which can be con-
sidered dynamic, does not take into account the three aforementioned factors. While
the number of retrieved results might be argued correlated to the number of relevant
items (thus, seemingly taking into account the first factor), this correlation can be very
weak at times, e.g. consider a high frequency query word (almost a stop-word) which
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases 329
remotely-connectedto factors (2) and (3). Consequently, we will resort to the approach
of  which, based on the distribution of item scores, is capable of estimating (1), as
well as mapping scores to probabilities of relevance. Having the latter, (2) can be deter-
mined, and any measure defined in (3) can be optimized in a straightforwardway. More
on the method can be found in the last-cited study.
Targeting to enhance query generality, the most appropriate measure to optimize
would be precision. However, since the smoothed precision estimated by the method of
 monotonically declines with rank, it makes sense to set a precision threshold. The
choice of precision thresholdis dependenton the effectiveness of the CBIR stage: it can
be seen as guaranteeing the minimum generality required by the CBIR method at hand
for achievinggoodeffectiveness.Not knowingthe relation between CBIR effectiveness
andminimumrequiredgenerality,we will trya series of thresholdsonprecision,as well
as, to optimize other cost-gain measures. Thus, while it may seem that we exchangethe
initial problem of where to set a static threshold with where to threshold precision or
which measure to optimize, it will turn out that the latter problem is less sensitive to its
available options, as we will see.
A possible drawback of the two-stage setup considered is that relevant images with
empty or very noise secondary modalities would be completely missed, since they will
not be retrieved by the first stage. If there are any improvements compared to single-
stage text-only or image-only setups, these will first show up on early precision since
only the top results are re-ranked; mean average precision or other measures may im-
prove as a side effect. In any case, there are efficiency benefits from searching the most
expensive modality only on a subset of the collection.
The requirement of such a two-stage CBIR at the user-side is that information needs
are expressed by visual as well as textual descriptions. The community is already ex-
perimenting with such setups, e.g. the ImageCLEF 2010 Wikipedia Retrieval task was
performed on a multimodal collection with topics made of textual and image queries at
the same time . Furthermore, multimodal or holistic query interfaces are showing
up in experimental search engines allowing concurrent multimedia queries . As a
last resort, automatic image annotationmethods [14,7] may be employedfor generating
queries for secondary modalities in traditional image retrieval systems.
3An Experiment on Wikipedia
Inthis section,we reportonexperimentsperformedona standardizedmultimodalsnap-
shot of Wikipedia. It is worth noting that the collection is one of the largest benchmark
image databases for today’s standards. It is also highly heterogeneous,containing color
natural images, graphics, grayscale images, etc., in a variety of sizes.
3.1Datasets, Systems, and Methods
The ImageCLEF 2010 Wikipedia test collection has image as its primary medium,
consisting of 237,434 items, associated with noisy and incomplete user-supplied tex-
tual annotations and the Wikipedia articles containing the images. Associated anno-
tations exist in any combination of English, German, French, or any other unidentified
330A. Arampatzis, K. Zagoris, and S.A. Chatzichristofis
(non-marked) language. There are 70 test topics, each one consisting of a textual and
a visual part: three title fields (one per language—English, German, French), and one
or more example images. The topics are assessed by visual similarity to the image
examples. More details on the dataset can be found in .
For text indexing and retrieval, we employ the Lemur Toolkit V4.11 and Indri V2.11
of the system except that we enable Krovetz stemming. We index only the English
annotations, and use only the English query of the topics.
We index the images with two descriptors that capture global image features: the
Joint Composite Descriptor (JCD) and the Spatial Color Distribution (SpCD). The JCD
is developed for color natural images and combines color and texture information .
In several benchmarking databases, JCD has been found more effective than MPEG-7
descriptors . The SpCD combines color and its spatial distribution; it is considered
more suitable for colored graphics since they consist of a relatively small number of
colors and less texture regions than color natural images. It is recently introduced in 
and found to perform better than JCD in a heterogeneous image database .
We evaluate on the top-1000 results with mean average precision (MAP), precision
at 10 and 20, and bpref .
3.2Thresholding and Re-ranking
We investigatetwo types of thresholding:static and dynamic. In static thresholding,the
same fixed pre-selected rank threshold K is applied to all topics. We experiment with
levels of K at 25, 50, 100, 250, 500, and 1000. The results that are not re-ranked by
image are retained as they are ranked by text, also in dynamic thresholding.
For dynamic thresholding, we use the Score-Distributional Threshold Optimization
(SDTO) as described in  and with the code provided by its authors. For tf.idf scores,
we used the technically truncated model of a normal-exponentialmixture. The method
normalizes retrieval scores to probabilities of relevance (prels), enabling the optimiza-
tion of K for any user-defined effectiveness measure. Per query, we search for the op-
timal K in [0,2500], where 0 or 1 results to no re-ranking. Thus, for estimation with
the SDTO we truncate at the score corresponding to rank 2500 but use no truncation
at high scores as tf.idf has no theoretical maximum. If there are 25 text results or less,
we always re-rank by image; these are too few scores to apply the SDTO reliably. In
this category fall the topics 1, 10, 23, and 46, with only 18, 16, 2, and 18 text results
respectively. The biggest strength of the SDTO is that it does not require training data;
more details on the method can be found in the last-mentioned study.
We experiment with the SDTO by thresholding on prel as well as on precision.
Thresholding on fixed prels happens to optimize linear utility measures , with cor-
responding rank thresholds:
– maxK: P rel DK
threshold θ, we try six values. Two of them are:
0.5000: It corresponds to 1 loss per relevant non-retrieved and 1 loss per
non-relevant retrieved, i.e. the Error Rate, and it is precision-recall balanced.
θ, where DK is the Kth ranked document. For the prel
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases331
relevant retrieved,i.e. the T9U measure used in the TREC 2000Filtering Track
, and it is recall-oriented.
These prel thresholds may optimize other measures as well; for example, 0.5000
arbitrarily enrich the experimental set of levels with four more thresholds: 0.9900,
0.9500, 0.8000, and 0.1000.
0.3333: It corresponds to 2 gain per relevant retrieved and 1 loss per non-
Furthermore,havingnormalizedscores to prels,we canestimate precisionin anytop-K
set by simply adding the prels and dividing by K. The estimated precision can be seen
as the generality in the sub-ranking. According to the hypothesis that the effectiveness
of CBIR is positively correlated to query generality, we experiment with the following
at hand for good effectiveness. Having no clue on usable g values, we arbitrarily
try levels of g at 0.9900, 0.9500, 0.8000, 0.5000, 0.3333, and 0.1000.
3.3Setting the Baseline
In initial experiments, we investigated the effectiveness of each of the stages individu-
ally, trying to tune them for best results.
In the textual stage, we employ the tf.idf model since it has been found to work well
with the SDTO . The SDTO method fits a binary mixture of probabilitydistributions
of high quality keywords . To be on the safe side, in initial experiments we tried to
increase query length by enabling pseudo relevance feedback of the top-10 documents,
but all our combinations of the parameter values for the number of feedback terms and
initial query weight led to significant decreases in the effectiveness of text retrieval. We
attribute this to the noisy nature of the annotations. Consequently, we do not run any
two-stage experiments with pseudo relevance feedback at the first textual stage.
In the visual stage, first we tried the JCD alone, as the collection seems to con-
tain more color natural images than graphics, and used only the first example image;
this represents a simple but practically realistic setup. Then, incorporating all example
images, the natural combination is to assign to each collection image the maximum
similarity seen from its comparisons to all example images; this can be interpreted as
looking for images similar to any of the example images. Last, assuming that the SpCD
descriptor captures orthogonal information to JCD, we added its contribution. We did
not normalize the similarity values prior to combining them, as these descriptors pro-
duce comparable similarity distributions . Table 1 presents the results; the index i
runs over example images.
Theimage-onlyrunsperformfar belowthe text-onlyrun.This putsin perspectivethe
quality of the currently effective global CBIR descriptors: their effectiveness in image
retrievalis muchworse thantheeffectivenessofthetraditionaltf.idftext retrievalmodel
even on sparse and noisy annotations. Since the image-only runs would have provided
332A. Arampatzis, K. Zagoris, and S.A. Chatzichristofis
Table 1. Effectiveness of different CBIR setups against tf.idf text-only retrieval
item scoring by
maxiJCDi+ maxiSpCDi .0112 .0871 .0886 .0415
MAP P@10 P@20 bpref
.0058 .0486 .0479 .0352
.0072 .0614 .0614 .0387
.1293 .3614 .3314 .1806
very weak baselines, we choose as a much stronger baseline for statistical significance
testing the text-only run. This makes sense also from an efficiency point of view: if
using a secondary text modality for image retrieval is more effective than current CBIR
methods, then there is no reason at all for using computationally costly CBIR methods.
Comparing the image-only runs to each other, we see that using more information—
to investigatethe impact of the effectivenesslevel of the secondstage on the wholetwo-
stage procedure, we will present two-stage results for both the best and the worst CBIR
Table 2 presents two-stage imageretrievalresults againsttext- and image-onlyretrieval.
It is easy to see that the dynamic thresholding methods improve retrieval effectiveness
in most of the experiments.Especially, dynamical thresholdingusing θ shows improve-
ments for all values we tried. The greatest improvement ( 28%) is observed in P@10
0.8.The table containslots of numbers;while there may be consistent increases
or decreases in some places, in the rest of this section we focus and summarize only the
statistically significant differences.
Irrespective of measure and CBIR method, the best thresholds are roughly at: 25
or 50 for K, 0.95 for g, and 0.8 for θ. The weakest thresholding method is the static
K: there are very few improvements only in P@20 at tight cutoffs, but they are ac-
companied by a reduced MAP and bpref. Actually, static thresholds hurt MAP and/or
bpref almost anywhere. Effectiveness degrades also in early precision for K
Dynamic thresholding is much more robust. Comparing the two CBIR methods at the
second stage, the stronger method helps the dynamic methods considerablywhile static
thresholding does not seem to receive much improvement.
Concerning the dynamic thresholding methods, the probability thresholds θ corre-
spond to tighter effective rank thresholds than these of the precision thresholds g, for g
and θ taking values in the range 0.1000,0.9900 . As a proxy for the effective K we
use the median threshold K across all topics. This is expected since precision declines
slower than prel. Nevertheless, the fact that a wide range of prel thresholds results to
a tight range of K, reveals a sharp decline in prel below some score per query. This
makes the end-effectiveness less sensitive to prel thresholds in comparison to precision
thresholds, thus more robust against possibly unsuitable user-selected values. Further-
more, if we compare the dynamic methods at similar K, e.g. g
slightly better. Figure 1 depicts the evaluation measures against K for all methods and
the stronger CBIR; Figure 2 presents the top image results for a query.
93),we see that prelthresholdsperform
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases333
Table 2. Two-stage image retrieval results. The best results per measure and thresholding type
are in boldface. Significance-tested with a bootstrap test, one-tailed, at significance levels 0.05
), 0.01 ( ), and 0.001 ( ), against the text-only baseline.
MAP P@10 P@20 bpref MAP P@10 P@20 bpref
.1293 .3614 .3314 .1806 .1293 .3614 .3314 .1806
25 .1162 .3957-.3457
50 .1144 .3829-.3579 .1608 .1154-.3986-.3557-.1648
100 .1138-.3786-.3471-.1609 .1133-.3900-.3486-.1623-
250 .1081 .3414-.3164-.1644-.1092 .3771-.3564-.1664-
500 .0968 .3200-.3007-.1575
1000 .0865 .2871 .2729 .1493
49 .1364-.4214 .3550-.1902 .1385
68 .1352-.4171 .3586-.1912 .1386
95 .1318-.4000-.3536-.1892-.1365-.4443 .3871 .1924-
237 .1085 .3500-.3000-.1707-.1121 .3857-.3364-.1734-
711 .0864 .2871
51 .1371-.4214 .3586-.1903
81 .1384 .4229 .3614-.1921
91 .1367-.4057-.3571-.1919 .1397
109 .1375-.4129-.3636-.1933 .1404
130 .1314-.4100-.3629-.1866-.1370-.4371 .3843 .1922
image-only — .0058 .0486 .0479 .0352 .0112 .0871 .0886 .0415
.1641 .1168-.3943-.3436 .1659
.4371 .3743 .1921
.4500 .3836 .1932
.2621 .1461 .0909 .3357-.2964-.1487
.4371 .3700 .1897
.1417 .4500 .3864 .1924
.1427 .4629 .3871 .1961
.4400 .3829 .1937
.4500 .3907 .1949
In summary, static thresholding improves initial precision at the cost of MAP and
bpref, while dynamic thresholding on precision or prel does not have this drawback.
The choice of a static or precision threshold influences greatly the effectiveness, and
unsuitable choices (e.g. too loose) may lead to a degradedperformance.Prel thresholds
aremuchmorerobustinthis respect.As expected,betterCBIR at the secondstage leads
to overall improvements, nevertheless, the thresholding type seems more important:
While the two CBIR methods we employ vary greatly in performance (the best has
almost double the effectiveness of the other), static thresholding is not influenced much
bythis choice;weattributethistoits lackofrespectforthenumberofrelevantitemsand
for the ranking quality. Dynamic methods benefit more from improved CBIR. Overall,
prel thresholds perform best, for a wide range of values.
4 Related Work
Imagere-rankingcan be performedusing textual,e.g. , or visual descriptions.Next,
we will focus only on visual re-ranking. Subset re-ranking by visual content has been
seen before, but mostly in different setups than the one we consider or for different
purposes,e.g. result clusteringordiversity.It is worthmentioningthat all the previously
proposed methods we review below used global image features to re-rank images.
334A. Arampatzis, K. Zagoris, and S.A. Chatzichristofis
Fig.1. Effectiveness, for the strongest CBIR stage: (A) MAP, (B) P@10, (C) P@20, (D) bpref
For example,  proposed an image retrieval system using keyword-based retrieval
Google Images according to their visual similarity. Using the clusters, retrieved images
were arranged in such a way that visually similar images are positioned close to each
other. Although the method may have had a similar effect to ours, it was not evaluated
against text-only or image-only baselines, and the impact of different values of K was
not investigated. In , the authors retrieved the top-50 results by text and then clus-
tered the images in order to obtain a diverse ranking based on cluster representatives.
The clusters were evaluated against manually-clustered results, and it was found that
the proposed clustering methods tend to reproduce manual clustering in the majority of
cases. The approach we have taken does not target to increasing diversity.
Another similar approach was proposed in , where the authors state that Web
image retrieval by text queries is often noisy and employ image processing techniques
in order to re-rank retrieved images. The re-ranking technique was based on the visual
similarity between image search results and on their dissimilarity to an external con-
trastive class of diversified images. The basic idea is that an image will be relevant to
the query, if it is visually similar to other query results and dissimilar to the external
class. To determine the visual coherence of a class, they took the top 30% of retrieved
images andcomputedthe averagenumberof neighborsto the externalclass. The effects
of the re-rankingwere analyzed via a user-study with 22 participants. Visual re-ranking
seemed to be preferred over the plain keyword-based approach by a large majority of
the users. Note that they did not use an image query but only a text one; in this respect,
the setup we have considered differs in that image queries are central, and we do not
require external information.
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases335
Fig.2. Retrieval results: (a) query, (b) image-only, (c) text-only, (d) K25 , (e) θ0.8
In , the authors proposed also a two-stage image retrieval system with external
information requirements: the first stage is text-based with automatic query expansion,
whereas the second exploits the visual properties of the query to improve the results
of the text search. In order to visually re-rank the top-1000 images, they employed a
visual model (a set of images which depicts each topic) using Web images. To describe
the visual content of the images, several methods using global or local features were
employed. Experimental results demonstrated that visual re-ranking improves the re-
trieval performance significantly in MAP, P@10 and P@20. We have confirmed that
visual re-ranking of top-ranked results improves early precision, though with a simpler
setup without using external information.
336A. Arampatzis, K. Zagoris, and S.A. Chatzichristofis
Some other similar setups to the one we propose are these in  and . In , the
authors trained their system to perform automatic re-ranking on all results returned by
text retrieval. The re-ranking method considered several aspects of both document and
query (e.g. generality of the textual features, color amount from the visual features).
Improved results were obtained only when the training set had been derived from the
database which is searched. Our method re-ranks the results using only visual features;
it does not require training and can be applied to any database. In , the authors re-
rank the top-K results retrieved by text using visual information. The rank thresholds
of 60 and 300 were tried and both resulted to a decrease in mean average precision
compared to the text-only baseline, with the 300 performing worse. Our experiments
have confirmed their result: static thresholds degrade MAP. They did not report early
5Conclusions and Directions for Further Research
by first using a text modality to rank the collection and then perform content-based im-
age retrieval only on the top-K items. In view of previousliterature, the biggest novelty
of our method is that re-ranking is not applied to a preset number of top-K results, but
K is calculated dynamically per query to optimize a predefined effectiveness measure.
Additionally,the proposedmethoddoes not requireanyexternalinformationortraining
data. The choice between static or dynamic nature of rank-thresholds has turned out to
make the difference between failure and success of the two-stage setup.
We have found that two-stage retrieval with dynamic thresholding is more effective
and robust than static thresholding, practically insensitive to a wide range of reason-
able choices for the measure under optimization, and beats significantly the text-only
and several image-only baselines. A two-stage approach, irrespective of thresholding
type, has also an obvious efficiency benefit: it cuts down greatly on expensive image
operations. Although we have not measured running times, only the 0.02–0.05% of the
items (on average) had to be scored at the expensive image stage for effective retrieval
from the collection at hand. While for the dynamic method there is some overhead for
estimating thresholds, this offsets only a small part of the efficiency gains.
There are a couple of interesting directions to pursue in the future. First, the idea can
be generalizedto multi-stage retrieval for multimodal databases, where rankings for the
modalities are successively being thresholded and re-ranked according to a modality
hierarchy. Second, although in Section 2 we merely argued on the unsuitability of fu-
sion under the assumptions of the setup we considered, a future plan is to compare the
effectiveness of two-stage against fusion. Irrespective of the outcome, fusion does not
have the efficiency benefits of two-stage retrieval.
We thank Jaap Kamps for providing the code for the statistical significance testing.
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases337
1. Aly, M., Welinder, P., Munich, M.E., Perona, P.: Automatic discovery of image families:
global vs. local features. In: ICIP, pp. 777–780. IEEE, Los Alamitos (2009)
2. Arampatzis, A., Kamps, J., Robertson, S.: Where to stop reading a ranked list: threshold
optimization using truncated score distributions. In: SIGIR, pp. 524–531. ACM, New York
3. Arampatzis, A., Robertson, S., Kamps, J.: Score distributions in information retrieval. In:
Azzopardi, L., Kazai, G., Robertson, S., R¨ uger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.)
ICTIR 2009. LNCS, vol. 5766, pp. 139–151. Springer, Heidelberg (2009)
4. Barthel, K.U.: Improved image retrieval using automatic image sorting and semi-automatic
generation of image semantics. In: International Workshop on Image Analysis for Multime-
dia Interactive Services, pp. 227–230 (2008)
5. Berber, T., Alpkocak, A.: DEU at ImageCLEFMed 2009: Evaluating re-ranking and inte-
grated retrieval systems. In: CLEF Working Notes (2009)
6. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: SIGIR,
pp. 25–32. ACM, New York (2004)
7. Chang, E., Goh, K., Sychay, G., Wu, G.: CBSA: content-based soft annotation for multi-
modal image retrieval using bayes point machines. IEEE Transactions on Circuits and Sys-
tems for Video Technology 13(1), 26–38 (2003)
8. Chatzichristofis, S.A., Boutalis, Y.S., Lux, M.: Selection of the proper compact composite
descriptor for improving content-based image retrieval. In: SPPRA, pp. 134–140 (2009)
9. Chatzichristofis, S.A., Boutalis, Y.S., Lux, M.: SpCD—spatial color distribution descriptor.
A fuzzy rule based compact composite descriptor appropriate for hand drawn color sketches
retrieval. In: ICAART, pp. 58–63 (2010)
10. Chatzichristofis, S.A., Arampatzis, A.: Late fusion of compact composite descriptors for
retrieval from heterogeneous image databases. In: SIGIR, pp. 825–826. ACM, New York
11. Kilinc, D., Alpkocak, A.: Deu at imageclef 2009 wikipediamm task: Experiments with ex-
pansion and reranking approaches. In: Working Notes of CLEF (2009)
12. van Leuken, R.H., Pueyo, L.G., Olivares, X., van Zwol, R.: Visual diversification of image
search results. In: WWW, pp. 341–350. ACM, New York (2009)
13. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: SIGIR,
pp. 246–254. ACM Press, New York (1995)
14. Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Transactions on
Pattern Analysis and Machine Intelligence 30, 985–1002 (2008)
15. Li, X., Chen, L., Zhang, L., Lin, F., Ma, W.Y.: Image annotation by large-scale content-based
image retrieval. In: ACM Multimedia, pp. 607–610. ACM, New York (2006)
16. Maillot, N., Chevallet, J.P., Lim, J.H.: Inter-media pseudo-relevance feedback application to
imageclef 2006 photo retrieval. In: CLEF Working Notes (2006)
17. Myoupo, D., Popescu, A., Le Borgne, H., Moellic, P.: Multimodal image retrieval over a
large database. In: Peters, C., Caputo, B., Gonzalo, J., Jones, G.J.F., Kalpathy-Cramer, J.,
M¨ uller, H., Tsikrika, T. (eds.) CLEF 2009. LNCS, vol. 6242, pp. 177–184. Springer, Heidel-
18. Popescu, A., Mo¨ ellic, P.A., Kanellos, I., Landais, R.: Lightweight web image reranking. In:
ACM Multimedia, pp. 657–660. ACM, New York (2009)
19. Popescu, A., Tsikrika, T., Kludas, J.: Overview of the wikipedia retrieval task at imageclef
2010. In: CLEF (Notebook Papers/LABs/Workshops) (2010)
20. Robertson, S.E., Hull, D.A.: The TREC-9 filtering track final report. In: TREC (2000)
21. Zagoris, K., Arampatzis, A., Chatzichristofis, S.A.: www.mmretrieval.net: a multimodal
search engine. In: SISAP, pp. 117–118. ACM, New York (2010)