Bag-of-visual-words vs global image descriptors on two-stage multimodal retrieval.
ABSTRACT The Bag-Of-Visual-Words (BOVW) paradigm is fast becoming a popular image representation for Content-Based Image Retrieval (CBIR), mainly because of its better retrieval effectiveness over global feature representations on collections with images being near-duplicate to queries. In this experimental study we demonstrate that this advantage of BOVW is diminished when visual diversity is enhanced by using a secondary modality, such as text, to pre-filter images. The TOP-SURF descriptor is evaluated against Compact Composite Descriptors on a two-stage image retrieval setup, which first uses a text modality to rank the collection and then perform CBIR only on the top-K items.
- [Show abstract] [Hide abstract]
ABSTRACT: Due to the rapid development of information technology and the continuously increasing number of available multimedia data, the task of retrieving information based on visual content has become a popular subject of scientific interest. Recent approaches adopt the bag-of-visual-words (BOVW) model to retrieve images in a semantic way. BOVW has shown remarkable performance in content-based image retrieval tasks, exhibiting better retrieval effectiveness over global and local feature (LF) representations. The performance of the BOVW approach depends strongly, however, on predicting the ideal codebook size, a difficult and database-dependent task. The contribution of this paper is threefold. First, it presents a new technique that uses a self-growing and self-organized neural gas network to calculate the most appropriate size of a codebook for a given database. Second, it proposes a new soft-weighting technique, whereby each LF is classified into only one visual word (VW) with a degree of participation. Third, by combining the information derived from the method that automatically detects the number of VWs, the soft-weighting method, and a color information extraction method from the literature, it shapes a new descriptor, called color VWs. Experimental results on two well-known benchmarking databases demonstrate that the proposed descriptor outperforms 15 contemporary descriptors and methods from the literature, in terms of both precision at K and its ability to retrieve the entire ground truth.IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics: a publication of the IEEE Systems, Man, and Cybernetics Society 07/2012; · 3.01 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Mobile devices such as smartphones and tablets are widely used in everyday life to perform a variety of operations, such as e-mail exchange, connection to social media, bank/financial transactions, and so on. Moreover, because of the large growth of multimedia applications, video and image transferring and sharing via a wireless network is becoming increasingly popular. Several modern mobile applications perform information retrieval and image recognition. For example, Google Goggles is an image recognition application that is used for searches based on pictures taken by handheld devices. In most of the cases, image recognition procedure is an image retrieval procedure. The captured images or a low-level description of them are uploaded online, and the system recognizes their content by retrieving visually similar pictures. Taking into account the last comment, our goal in this paper is to evaluate the process of image retrieval/recognition over an Institute of Electrical and Electronics Engineers 802.11b network, operating at 2.4 GHz. Our evaluation is performed through a simulated network configuration, which consists of a number of mobile nodes communicating with an access point. Throughout our simulations, we examine the impact of several factors, such as the existence of a strong line of sight during the communication between wireless devices. Strong line of sight depends on the fading model used for the simulations and has an effect on BER. We have used a large number of image descriptors and a variety of scenarios, reported in the relative literature, in order to comprehensively evaluate our system. To reinforce our results, experiments were conducted on two well-known images databases by using 10 descriptors from the literature. Copyright © 2014 John Wiley & Sons, Ltd.International Journal of Communication Systems 01/2014; · 1.11 Impact Factor
Bag-of-Visual-Words vs Global Image Descriptors
on Two-Stage Multimodal Retrieval
Konstantinos Zagoris Savvas A. Chatzichristofis
Department of Electrical and Computer Engineering
Democritus University of Thrace, Xanthi 67100, Greece
The Bag-Of-Visual-Words (BOVW) paradigm is fast becoming a
popular image representation for Content-Based Image Retrieval
(CBIR), mainly because of its better retrieval effectiveness over
this advantage of BOVW is diminished when visual diversity is en-
hanced by using a secondary modality, such as text, to pre-filter
images. The TOP-SURF descriptor is evaluated against Compact
Composite Descriptors on a two-stage image retrieval setup, which
first uses a text modality to rank the collection and then perform
CBIR only on the top-K items.
Categories and Subject Descriptors: H.3.3 [Information Storage and
Retrieval]: Information Search and Retrieval—retrieval models, search
General Terms: Measurement, Experimentation, Theory
Keywords: Image Retrieval, Bag-Of-Visual-Words, Two-Stage Retrieval
by either global or local features. Global features are capable of
generalizing an entire image with a single vector, describing color,
texture, or shape. Local features are computed at multiple points of
interest on an image and are capable of recognizing objects.
Compact Composite Descriptors (CCDs)  are global image
features capturing more than one type of information at the same
time in a very compact representation. Their retrieval quality has so
far been evaluated in several benchmarking databases and is found
to be better than other descriptors such as the MPEG-7 descriptors.
SURF local features are among the best interest-point descrip-
tors currently available. They have been shown to outperform other
well-known methods based on interest points such as SIFT and
GLOH. Nevertheless, in large-scale CBIR, it is clear that using the
SURF descriptor is storage-wise infeasible .
Bag-Of-Visual-Words (BOVW)  is a representation of images
which is built using a large set of local features. They are inspired
by the bag-of-words models in text retrieval, where a document is
represented by a set of distinct keywords. Analogously, in BOVW
models, an image is represented by a set of distinct visual words
derived from local features. The most modern implementation of
BOVW suitable for a wide range of CBIR applications is the TOP-
SURF  descriptor. TOP-SURF combines interest points with
visual words, resulting in a high performance compact descriptor.
Copyright is held by the author/owner(s).
SIGIR’11, July 24–28, 2011, Beijing, China.
The TOP-SURF descriptor initially extracts SURF local features
from images and then groups them into a desired number of clus-
ters. Each cluster can be seen as a visual word. All visual words are
stored in a visual dictionary. Next, tf.idf weighting is applied in or-
der to assign a score to all the visual words. The TOP-SURF image
descriptor is created by choosing the top-scoring visual words.
Nowadays, information collections are multimodal and large (for
example Wikipedia), where a single topic may be covered in sev-
eral languages and include non-textual media such as image, audio,
and video. In an image retrieval system where users are assumed
to target visual similarity, all modalities beyond image can be con-
sidered as secondary; nevertheless, they can still provide useful in-
formation for improving CBIR.
In , we experimented with two-stage image retrieval from a
large multimodal database, by first using a text modality to rank
the collection and then perform CBIR only on the top-K items.
The approach employed CCDs and was found to be significantly
more effective than the text-only and image-only baselines when
K was dynamically calculated with respect to the underlying query
generality. Traditionally, the method that has been followed in or-
der to deal with multimodal databases is to search the modalities
separately and fuse their results. While fusion has been proven ro-
bust, we also found that two-stage is more effective than fusion .
Furthermore, a two-stage approach has an efficiency benefit: it cuts
down greatly on expensive image operations.
The BOVW paradigm is fast becoming a widely used represen-
tation for CBIR, mainly because of its better retrieval effective-
ness over global feature representations on collections with im-
ages being near-duplicate to queries. In this experimental study,
we evaluate the performance of the BOVW approach in compari-
son to CCDs in a multistage multimodal setup. We intend to check
CBIR is performed on a pre-filtered set of images with a high prob-
ability of relevance. Given that the high relevance of such a subset
is based on the text modality, it is likely to consist of more diverse
images than the top-retrieved images of CBIR systems.
2. TOP-SURF VS CCDS
We experimented with the ImageCLEF 2010 Wikipedia test col-
lection, which consists of 237,434 images associated with noisy
and incomplete user-supplied textual annotations. There are 70
test topics, each one consisting of a textual and a visual part with
one or more example images. The topics were assessed by vi-
sual similarity to the image examples. The collection is one of the
largest benchmark image databases for today’s standards. It is also
highly heterogeneous, containing color natural images, graphics,
grayscale images, etc., in a variety of sizes.
We indexed the images with two CCDs: the Joint Composite De-
scriptor (JCD) and the Spatial Color Distribution (SpCD) descrip-
tor . We also indexed the images with the TOP-SURF descrip-
tor, employing two visual-word dictionaries: one with 10,000 and
the other with 200,000 visual words. For CBIR we used the JCD,
SpCD, and TOP-SURF, separately, as well as a late fusion setup of
JCD and SpCD explained next.
Let i be the index running over example images (i “ 1,2,...)
andj runningoverthevisualdescriptors(j P t1,2u). Thus, DESCji
is the score of a collection item against the ith example image for
the jth descriptor. We normalize DESCji values with MinMax,
taking the maximum scoreseen across example images per descrip-
tor. Assuming that the descriptors capture orthogonal information,
we add their scores per example image. Then, to take into account
all example images, the natural combination is to assign to each
collection image the maximum similarity seen from its compar-
isons to all example images; this can be interpreted as looking for
images similar to any of the example images. Summarizing, the
score s for a collection image against the topic, for the JCD/SpCD
fused setup, is defined as:
With the same reasoning, the maxi is applied also in the TOP-
SURF runs to handle multi-image topics.
For text indexing and retrieval, we employed the Lemur Toolkit
V4.11 and Indri V2.11 with the tf.idf retrieval model. We used the
default settings that come with these versions of the system except
that we enabled Krovetz stemming. We indexed only the English
annotations, and used only the English query of the topics.
First, the collection was ranked with the secondary text modal-
ity, and then the top-K results were re-ranked by the primary vi-
sual modality using CBIR methods based on the aforementioned
descriptors. The threshold K was calculated dynamically per query
(prels), enabling the the optimization of K by thresholding on prel.
We report results for three prel thresholds, i.e. 0.800, 0.500, and
0.333; these were the best three performers in .
We evaluated on the top-1000 results with MAP, precision at 10
and 20. We tested the results for statistical significance against the
text-only baseline; image retrieval based on the text queries and an-
notations was found to perform much better, with a wide margin,
than CBIR-only in the same setup . For measuring efficiency,
we report the average matching time per topic. The results are pre-
sented in Table 1.
For all θ, the CCDs perform similarly (JCD) or significantly bet-
ter (SpCD and JCD/SpCD) than the text-only baseline, while the
TOP-SURF descriptor shows significant drops in effectiveness ir-
respective of dictionary size. The differences in effectiveness of
CCDs and TOP-SURF are larger in early precision than in MAP.
We also observe that the TOP-SURF effectiveness degrades with
increased dictionary size. Furthermore, TOP-SURF is more sensi-
tive to the choice of θ: as θ decreases (i.e. for larger Ks), effective-
ness deteriorates faster than this of the CCDs.
Efficiency-wise, the experimental results show that although the
TOP-SURF uses a speedy matching algorithm, it still cannot match
the speed of the global descriptors.
s “ max
We investigated the performance of BOVW models, specifically
the TOP-SURF image descriptor, in a two-stage multimodal re-
trieval setup, in comparison to CCDs. We found that CCDs are
best results per measure and retrieval type are in boldface.
Significance-tested with a bootstrap test, one-tailed, at signif-
icance levels 0.05 (ŸŹ), 0.01 (Ĳ Ÿİ Ź), and 0.001 (Ĳ Ÿİ Ź), against the text-
only baseline. Experiments contacted on a Pentium Dual-Core
E2200 (2.4 GHz) with 4GB memory.
Retrieval effectiveness and matching time.The
more effective, as well as faster in matching speed, than TOP-
SURF. Although BOVW models are currently trendy because of
their ability to recognize objects and retrieve near-duplicate (to the
query) images, this advantage over global features such as CCDs is
diminished when visual diversity is enhanced by using a secondary
modality, such as text, to pre-filter images. In practice this means
that, applications like Google Goggles, where a user is querying an
image in order to recognize a logo or a famous painting, BOVW
models should be more effective. But in applications like Google
Similar Images, where images are pre-filtered by text similarity,
global features should be more suitable.
 A. Arampatzis, J. Kamps, and S. Robertson. Where to stop
reading a ranked list? Threshold optimization using truncated
score distributions. In SIGIR, pages 524–531. ACM, 2009.
 A. Arampatzis, K. Zagoris, and S. A. Chatzichristofis.
Dynamic two-stage image retrieval from large multimodal
databases. In ECIR, volume 6611 of Lecture Notes in
Computer Science, pages 326–337. Springer, 2011.
 A. Arampatzis, K. Zagoris, and S. A. Chatzichristofis. Fusion
vs. two-stage for multimodal retrieval. In ECIR, volume 6611
of Lecture Notes in Computer Science, pages 759–762.
 S. A. Chatzichristofis, A. Arampatzis, and Y. S. Boutalis.
Investigating the behavior of compact composite descriptors
in early fusion, late fusion, and distributed image retrieval.
Radioengineering, 19 (4):725–733, 2010.
 O. G. Cula and K. J. Dana. Compact representation of
bidirectional texture functions. In CVPR (1), pages
 B. Thomee, E. M. Bakker, and M. S. Lew. Top-surf: a visual
words toolkit. In ACM Multimedia, pages 1473–1476, 2010.