Content uploaded by Ilias Gialampoukidis
Author content
All content in this area was uploaded by Ilias Gialampoukidis on Nov 29, 2020
Content may be subject to copyright.
Retrieval of Multimedia Objects by Fusing
Multiple Modalities
Ilias Gialampoukidis
ITI-CERTH
Thessaloniki, Greece
heliasgj@iti.gr
Anastasia Moumtzidou
ITI-CERTH
Thessaloniki, Greece
moumtzid@iti.gr
Theodora Tsikrika
ITI-CERTH
Thessaloniki, Greece
theodora.tsikrika@iti.gr
Stefanos Vrochidis
ITI-CERTH
Thessaloniki, Greece
stefanos@iti.gr
Ioannis Kompatsiaris
ITI-CERTH
Thessaloniki, Greece
ikom@iti.gr
ABSTRACT
Searching for multimedia objects with heterogeneous modal-
ities is critical for the construction of effective multimedia re-
trieval systems. Towards this direction, we propose a frame-
work for the multimodal fusion of visual and textual similar-
ities, based on visual features, visual concepts and textual
concepts. Our method is compared to the baseline method
that only fuses two modalities but integrates all early, late,
linearly weighted, diffusion and graph-based models in one
unifying framework. Our framework integrates more than
two modalities and high-level information, so as to retrieve
multimedia objects enriched with high-level textual and vi-
sual concepts, in response to a multimodal query. The ex-
perimental comparison is done under the same memory com-
plexity, in two multimedia collections in the multimedia re-
trieval task. The results have shown that we outperform the
baseline method, in terms of Mean Average Precision.
1. INTRODUCTION
Multimedia retrieval systems are becoming more and more
popular as there is a need for effective and efficient ac-
cess to very large and diverse collections of multimedia ob-
jects, such as videos (e.g. YouTube and Netflix) and images
(e.g. Flickr). Searching in such collections is challenging due
to the heterogeneous media that each item in the collec-
tion may contain (e.g. text, images, and videos). Therefore,
these multiple media and the different features that can be
extracted from them, e.g. low-level visual descriptors (based
on color, shape, location, etc.), low-level textual features
(image captions, video subtitles, etc.), high-level textual or
visual features (named entities, concepts, etc.), or metadata
(timestamps, tags, etc.), need to be combined to support
various multimedia analysis tasks, such as retrieval, sum-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICMR’16, June 06-09, 2016, New York, NY, USA
c
2016 ACM. ISBN 978-1-4503-4359-6/16/06. . . $15.00
DOI: http://dx.doi.org/10.1145/2911996.2912068
marization, clustering, and classification; this combination
of multiple modalities is referred to as multimodal fusion.
Multimodal fusion has been widely investigated and is
typically performed at the feature level (early fusion), at
the decision level (late fusion), or in a hybrid manner (see
[3] for a survey). This work focuses on the decision level
or late fusion of multiple modalities for the multimedia re-
trieval task; to this end, several multimedia and cross-media
approaches have been proposed. Metric fusion [13] is a ran-
dom walk approach designed to fuse different “views” of the
same modality, such as SIFT, GIST and LBP visual features;
our focus though is on the combination of diverse modali-
ties, such as textual and visual similarity scores. A recent
video retrieval framework [9] proposes to fuse textual simi-
larity scores based on video subtitles with visual similarity
scores based on visual concepts in a simple non-linear way.
Other approaches have been motivated by Latent Dirich-
let Allocation (LDA) and either generate a joint topic prob-
ability distribution for all modalities, or combine the topic
distribution per modality [4]. Each query is related to a
topic and the retrieved documents are assigned a topic dis-
tribution. If the topic distribution of a retrieved document
is maximized at the query’s topic, the document is consid-
ered to be relevant. Convolutional Neural Networks (CNN)
have also been used to learn high-level features and combine
two modalities (text-image pairs) for cross-modal retrieval
[12]. A Partial Least Squares (PLS) based approach [10]
that maps different modalities into a common latent space
has also been used in an image retrieval task. Contrary to
these approaches that require training, our focus is on un-
supervised multimodal fusion. Furthermore, many of these
approaches can only support monomodal queries, whereas
our goal is to also cater for queries consisting of multiple
modalities. It should be noted that multimedia search sys-
tems are often interactive and user feedback is incorporated
for progressively refining the query (e.g. [14]); such relevance
feedback approaches are beyond the scope of this work.
One unsupervised multimedia retrieval approach that has
been recently proposed [2] combines textual and visual sim-
ilarities by integrating into a unifying graph-based frame-
work (i) a cross-media approach that not only considers the
similarity of the query to the objects in the collection, but
also the similarities among them [1] and (ii) a random walk
approach for multimodal fusion, and in particular a video
retrieval approach that links two objects (i.e. nodes in the
This is a draft version of the paper. The final version is available at:
http://dl.acm.org/citation.cfm?id=2912068
In: Proceedings of the 2016 ACM on International Conference on
Multimedia Retrieval (pp. 359-362)
graph) with a weighted edge if there exists a multimodal
similarity between them [6]. To decrease the complexity,
the framework assumes that the textual part of a multi-
modal query “is the main semantic source with regard to
the user information” [2] and thus it first filters out any
object not in the top-lretrieved based on their textual sim-
ilarity scores, and then applies graph-based techniques only
to these lselected items. This graph-based framework in-
cludes as special cases all well-known early, late, weighted,
diffusion-based, as well as graph-based fusion models, does
not require users’ relevance feedback, and has been evalu-
ated in the context of multimedia retrieval tasks. Thus far,
though, this framework has only considered two modalities.
2. METHODOLOGY
First, we briefly describe the graph-based framework pro-
posed in [2] that supports the fusion of two modalities and,
then, we present its extension to Mmodalities.
2.1 Background
As described above, the framework first selects the top-l
multimedia documents based on their textual similarity to
query q. All subsequent operations are performed on these
lselected documents. First, the 1 ×lquery-based simi-
larity vectors on the textual and visual modalities, st(q, .)
and sv(q, .), respectively, are computed and are normalized
so that their elements sum to one. Then, the l×ltex-
tual and visual similarity matrices, Stand Sv, respectively,
are computed for these documents and are normalized us-
ing: (s(d, d0)−min s(d, .))/(max s(d, .)−min s(d, .)), where
s(d, d0) denotes the similarity between two documents dand
d0. By denoting the regular matrix multiplication operation
as “·” and the (i, j) element of a matrix Aas A[i, j], this
graph-based framework sets x(0) =st(q, .), y(0) =sv(q, .),
and defines the following update rule:
x(i)∝K(x(i−1), k)·[(1−γ)D·(βSt+ (1−β)Sv) + γ e ·st(q , .)]
y(i)∝K(y(i−1), k)·[(1−γ)D·(βSv+ (1−β)St) + γe·sv(q, .)]
where Dis the row-normalizing matrix, eis the l×1 vector
of ones, and the operator K(x, k) takes as input a vector x
and assigns a zero value to elements with score strictly lower
than the k-th highest score in x. Following iiterations, the
final ranking with respect toquery qis given by the linear
combination of st, sv, x(i)and y(i):
score(q) = αtst(q, .) + αvsv(q , .) + αtvx(i)+αvt y(i)(1)
under the restriction that: αt+αv+αtv +αv t = 1.
This framework includes all well-known weighted, graph-
based and diffusion-based fusion techniques, as special cases
of its parameters. For atv =avt = 0, Equation (1) becomes
the weighted mean fusion model. For at=av=avt = 0, γ =
0 and sufficiently large number of iterations i, the model is
the random walk approach [6]. For av=atv = 0, β = 0, γ =
0, and i= 1, the model is the cross-media approach [1].
In the experiments reported in [2], β= 0, γ = 0.3, i = 1
and k= 10 are recommended as the default parameters
when fusing the top-l(l= 1000) results returned by text-
based search. The weights in the linear combination of
st, sv, x(1) and y(1) are tuned in {0.1,0.2,...,0.9}and the
best values are compared to the uniform weighting strategy
(αt=αv=αtv =αvt = 0.25). The results show only an
incremental increase in Mean Average Precision (MAP) in
one dataset and no increase in the other datasets, indicating
the potential effectiveness of this uniform weighting scheme.
2.2 Multimedia Retrieval Using M Modalities
Our aim is to extend the aforemonetioned graph-based
framework to more than two modalities. Assuming that
there are Mmodalities with corresponding 1×lquery-based
similarity vectors sm(q, .) and l×lsimilarity matrices Sm,
m={1,2,...,M}, we compute the following contextual
similarity matrix for each modality m:
Cm= 1−
M−1
X
w=1
βw!Sm+
M−1
X
w=1
βwSw6=m(2)
The matrices Cmof Equation (2) are row-normalized so as
to obtain the corresponding row-stochastic transition prob-
ability matrices Pmwith elements:
Pm[i, j] = Cm[i, j]
Pl
j=1 Cm[i, j](3)
For all modalities m, we set xm
(0) =sm(q, .), m ={1,...,M},
and motivated by [2], we define the following update rule:
xm
(i)∝K(xm
(i−1), k)·
1−
M
X
w=1
w6=m
γw
Pm+
M
X
w=1
w6=m
γwsw(q, .)
(4)
Inspired by the model of Equation (1), we finally propose
the vector of relevance score in response to query q:
score(q) =
M
X
m=1
αmsm(q, .) +
M
X
m=1
α0
mxm
(i)(5)
where
M
X
m=1
αm+
M
X
m=1
α0
m= 1 (6)
For three modalities, for example, Equation (2) becomes:
C1= (1 −β1−β2)S1+β1S2+β2S3
C2= (1 −β1−β2)S2+β1S1+β2S3
C3= (1 −β1−β2)S3+β1S1+β2S2
The contextual similarity matrices Cm, m ={1,2,3}are
row-normalized to obtain Pm, m ={1,2,3}using Equation
(3), and the update rule (Equation (4)) becomes:
x1
(i)∝K(x1
(i−1), k)·[(1−γ2−γ3)P1+γ2e·s2(q, .)+γ3e·s3(q, .)]
x2
(i)∝K(x2
(i−1), k)·[(1−γ1−γ3)P2+γ1e·s1(q, .)+γ3e·s3(q, .)]
x3
(i)∝K(x3
(i−1), k)·[(1−γ1−γ2)P3+γ1e·s1(q, .)+γ2e·s2(q, .)]
Finally, score(q) is computed as in Equation (5), which lin-
early combines sm(q, .) and xm
(i)for m={1,2,3}.
Figure 1 depicts our multimedia retrieval framework in the
particular case of fusing three modalities, namely visual fea-
tures, visual concepts and textual concepts. The top-ldocu-
ments in the filtering step are obtained by using the textual
concepts to index and retrieve each document in response to
the q using the open-source Apache Lucene1system. Then,
1https://lucene.apache.org/core/
the l×lsimilarity matrices: S1on visual descriptors, S2on
visual concepts, S3on textual concepts and the correspond-
ing 1 ×lquery-based similarity vectors: s1(q, .) on visual
descriptors, s2(q, .) on visual concepts and s3(q, .) on tex-
tual concepts are conputed. Finally, we fuse these similarity
matrices and the query-based similarity vectors to get a sin-
gle relevance score vector in response to query q:score(q).
Visual
descriptors
Multimedia
Database
Multimedia-Enriched Object query (𝒒)
Textual
concepts
Visual
concepts
Offline
Feature
Extraction
Visual
descriptors
Textual
concepts
Visual
concepts
Fusion score(𝒒)
Lucene search
Top-𝒍 filtering
Similarities
𝑆2 𝑠2(𝑞,. )
Online
𝑆1
𝑆3
𝑠1(𝑞,. )
𝑠3(𝑞, . )
Figure 1: Multimedia Retrieval Framework by fus-
ing 3 modalities
Memory Complexity. The memory complexity is O(l2)
for the computation of each similarity matrix Sm,O(l) for
each similarity vector sm(q, .) and O(kl) for each xm
(i), m =
1,2,...,M, thus the overall memory complexity is quadratic
in l:O(Ml2+M kl +Ml). In order to compare directly
the baseline method with our retrieval framework with M
modalities, under the same memory complexity, we seek the
number of filtered documents l0, such that Ml02+M kl0+
Ml0= 2l2+ 2kl + 2l. The non-negative solution is:
l0=r(k+ 1)2
4+2l2+ 2kl + 2l
M−k+ 1
2(7)
For example, for M= 3, k = 10, l = 1000, we find l0∼
=815.
We also observe that even for 15 modalities, the number of
the top-lfiltered documents remains >300, hence a signif-
icant number of documents is involved in the fusion, even
in the case of several modalities. We shall examine whether
the fusion of three modalities in this framework outperforms
the baseline approach, without additional memory cost.
3. EXPERIMENTS
In this section, we describe the datasets used for evalua-
tion, the features extracted from their multimedia objects,
the employed similarities, and the experimental results.
3.1 Evaluation Datasets
We evaluate our framework using two test collections: the
IAPR-TC122and the WIKI113both created in the context
of the ImageCLEF benchmarking activities. The IAPR-
TC12 collection consists of (i) 20,000 images, each annotated
with a title and a description, and also various metadata
(e.g. date, location, etc.) not considered in this work and
(ii) the 60 topics created in ImageCLEF 2007, each with
a title and three image examples. The WIKI11 collection
consists of (i) 237,434 images extracted from Wikipedia and
their user-generated captions/descriptions and (ii) 50 topics,
each with a title and one to five query images.
2http://imageclef.org/photodata
3http://www.imageclef.org/wikidata
3.2 Features and Monomodal Similarities
We use the following state-of-the-art features and similar-
ity functions for each modality in the documents and queries;
it should be noted though that our method is capable of fus-
ing any similarity score obtained by any kind of features. As
visual descriptors (VD), we extract the scale-invariant local
descriptors RGB-SIFT [11], which are then locally aggre-
gated into one vector representation using VLAD encoding
[7], and employ a similarity function based on the Euclidean
distance [5]. We also employ the 346 high-level visual con-
cepts (VC) introduced in the TRECVID 2011 semantic in-
dexing task4, which are detected by multiple independent
concept detectors that use the aforementioned visual de-
scriptors as input to Logistic Regression classifiers that have
their output averaged and further refined [8]. Finally, we use
the title/caption of each image so as to extract textual con-
cepts (TC) using the DBpedia Spotlight5annotation tool.
For the cases of visual and textual concepts, similarities are
computed based on Lucene’s retrieval function.
3.3 Experimental Setup and Results
We evaluate the MAP of our framework that fuses three
modalities against the baseline (Section 2.1) that fuses two
modalities. As the baseline models all well-known weighted,
graph-based and diffusion-based fusion techniques as special
cases of its parameters α, β, and γ, the best performance
among all these fusion techniques coincides with the best
performance of the weight parameters α, β and γ. Therefore,
we tune these parameters so as to present, in Table 1, the
best MAP scores of the fusion using two modalities. First,
we combine textual concepts with visual descriptors (TC &
VD), and then, we combine textual with visual concepts (TC
& VC); the combination of visual descriptors and concepts
(VD & VC) is not considered as it reduces the problem to the
classic image retrieval task and no initial filtering can be per-
formed with respect to the textual modality. We adopt the
default parameters reported in [2], i.e. k= 10, one iteration
(i= 1) and uniform weights (at=av=atv =avt = 1/4).
To compare directly the baseline method with our frame-
work with three modalities under the same memory com-
plexity, we use l0= 815 (see Equation (7)). For m=
{1,2,3}, we adopt a uniform weighting strategy (αm=
α0
m= 1/6) and we tune the parameters γm, while the pa-
rameters βmare kept constant (and equal to 1/3); the results
for different values of γmare reported in Table 1.
We observe that our framework outperforms the base-
line method for several values of the parameters γm, m =
{1,2,3}under the same memory cost. In particular, MAP
increases up to 13.44% for WIKI11 and up to 15.71% for
IAPR-TC12. We further tuned the parameters αm, α0
mand
βm, m ={1,2,3}and we did not observe any further in-
crease in MAP, which implies that there is no other weighted,
graph-based or diffusion-based fusion techniques that out-
performs our framework. The small differences in MAP
when employing the best parameters γm, m ={1,2,3}com-
pared to the uniform ones (γm= 1/3) allows for setting
γm= 1/3, m ={1,2,3}, without significant decrease in
MAP (less than 2%).
4http://www-nlpir.nist.gov/projects/tv2011/tv11.sin.346.
concepts.simple.txt
5http://dbpedia-spotlight.github.io/demo/
Table 1: MAP values for both datasets; in bold the
values which outperform the baselines.
M γ1γ2γ3WIKI11 IAPR
2 TC & VD (best α, β, γ ) 0.3654 0.2769
2 TC & VC (best α, β, γ) 0.3472 0.2729
3 0.00 0.00 1.00 0.3637 0.3065
3 0.00 1.00 0.00 0.3433 0.2858
3 1.00 0.00 0.00 0.3855 0.2518
3 0.50 0.50 0.00 0.4083 0.2912
3 0.5 0.25 0.25 0.4145 0.2970
3 0.25 0.50 0.25 0.4029 0.3136
3 0.25 0.25 0.50 0.4048 0.3204
3 0.34 0.33 0.33 0.4105 0.3148
4. CONCLUSION
We presented an unsupervised graph-based framework that
fuses Mmodalities for multimedia retrieval. In this work, we
consider the fusion of three modalities (textual concepts, vi-
sual descriptors, and visual concepts), but the overall frame-
work is directly applicable to any number of modalities. We
also presented a theoretical formula which provides the op-
timal number of documents that need to be initially filtered,
so that the memory cost in the case of Mmodalities remains
the same as in the case of two modalities. The experiments
have shown that the MAP improves in the case of three
modalities (up to 15.71% in some cases). We also observed
that the results of a uniform weighting strategy do not sig-
nificantly differ from those obtained using the best weights.
In the future, we plan to evaluate our framework in multilin-
gual settings using language agnostic features, such as tex-
tual concepts either in the language of the query or mapped
to a common language using multilingual ontologies.
5. ACKNOWLEDGMENTS
This work was partially supported by the European Com-
mission by the projects MULTISENSOR (FP7-610411) and
KRISTINA (H2020-645012).
6. REFERENCES
[1] J. Ah-Pine, S. Clinchant, and G. Csurka. Comparison
of several combinations of multimodal and diversity
seeking methods for multimedia retrieval. In
Multilingual Information Access Evaluation II.
Multimedia Experiments, pages 124–132. Springer,
2009.
[2] J. Ah-Pine, G. Csurka, and S. Clinchant.
Unsupervised visual and textual information fusion in
cbmir using graph-based methods. ACM Transactions
on Information Systems (TOIS), 33(2):9, 2015.
[3] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S.
Kankanhalli. Multimodal fusion for multimedia
analysis: a survey. Multimedia systems, 16(6):345–379,
2010.
[4] J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia,
G. R. Lanckriet, R. Levy, and N. Vasconcelos. On the
role of correlation and abstraction in cross-modal
multimedia retrieval. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 36(3):521–535,
2014.
[5] J. Hafner, H. S. Sawhney, W. Equitz, M. Flickner, and
W. Niblack. Efficient color histogram indexing for
quadratic form distance functions. Pattern Analysis
and Machine Intelligence, IEEE Transactions on,
17(7):729–736, 1995.
[6] W. H. Hsu, L. S. Kennedy, and S.-F. Chang. Video
search reranking through random walk over
document-level context graph. In Proceedings of the
15th international conference on Multimedia, pages
971–980. ACM, 2007.
[7] H. J´egou, M. Douze, C. Schmid, and P. P´erez.
Aggregating local descriptors into a compact image
representation. In Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on,
pages 3304–3311. IEEE, 2010.
[8] B. Safadi and G. Qu´enot. Re-ranking by local
re-scoring for video indexing and retrieval. In
Proceedings of the 20th ACM international conference
on Information and knowledge management, pages
2081–2084. ACM, 2011.
[9] B. Safadi, M. Sahuguet, and B. Huet. When textual
and visual information join forces for multimedia
retrieval. In Proceedings of International Conference
on Multimedia Retrieval, page 265. ACM, 2014.
[10] B. Siddiquie, B. White, A. Sharma, and L. S. Davis.
Multi-modal image retrieval for complex queries using
small codes. In Proceedings of International
Conference on Multimedia Retrieval, page 321. ACM,
2014.
[11] K. E. Van De Sande, T. Gevers, and C. G. Snoek.
Evaluating color descriptors for object and scene
recognition. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 32(9):1582–1596,
2010.
[12] J. Wang, Y. He, C. Kang, S. Xiang, and C. Pan.
Image-text cross-modal retrieval via modality-specific
feature learning. In Proceedings of the 5th ACM on
International Conference on Multimedia Retrieval,
pages 347–354. ACM, 2015.
[13] Y. Wang, X. Lin, and Q. Zhang. Towards metric
fusion on multi-view data: a cross-view based graph
random walk approach. In Proceedings of the 22nd
ACM international conference on Conference on
information & knowledge management, pages 805–810.
ACM, 2013.
[14] S. Xu, H. Li, X. Chang, S.-I. Yu, X. Du, X. Li,
L. Jiang, Z. Mao, Z. Lan, S. Burger, et al. Incremental
multimodal query construction for video search. In
Proceedings of the 5th ACM on International
Conference on Multimedia Retrieval, pages 675–678.
ACM, 2015.