Conference PaperPDF Available

Abstract and Figures

Effective multimedia retrieval requires the combination of the heterogeneous media contained within multimedia objects and the features that can be extracted from them. To this end, we extend a unifying framework that integrates all well-known weighted, graph-based, and diffusion-based fusion techniques that combine two modalities (textual and visual similarities) to model the fusion of multiple modalities. We also provide a theoretical formula for the optimal number of documents that need to be initially selected, so that the memory cost in the case of multiple modalities remains the same as in the case of two modalities. Experiments using two test collections and three modalities (similarities based on visual descriptors, visual concepts, and textual concepts) indicate improvements in the effectiveness over bimodal fusion under the same memory complexity.
Content may be subject to copyright.
Retrieval of Multimedia Objects by Fusing
Multiple Modalities
Ilias Gialampoukidis
ITI-CERTH
Thessaloniki, Greece
heliasgj@iti.gr
Anastasia Moumtzidou
ITI-CERTH
Thessaloniki, Greece
moumtzid@iti.gr
Theodora Tsikrika
ITI-CERTH
Thessaloniki, Greece
theodora.tsikrika@iti.gr
Stefanos Vrochidis
ITI-CERTH
Thessaloniki, Greece
stefanos@iti.gr
Ioannis Kompatsiaris
ITI-CERTH
Thessaloniki, Greece
ikom@iti.gr
ABSTRACT
Searching for multimedia objects with heterogeneous modal-
ities is critical for the construction of effective multimedia re-
trieval systems. Towards this direction, we propose a frame-
work for the multimodal fusion of visual and textual similar-
ities, based on visual features, visual concepts and textual
concepts. Our method is compared to the baseline method
that only fuses two modalities but integrates all early, late,
linearly weighted, diffusion and graph-based models in one
unifying framework. Our framework integrates more than
two modalities and high-level information, so as to retrieve
multimedia objects enriched with high-level textual and vi-
sual concepts, in response to a multimodal query. The ex-
perimental comparison is done under the same memory com-
plexity, in two multimedia collections in the multimedia re-
trieval task. The results have shown that we outperform the
baseline method, in terms of Mean Average Precision.
1. INTRODUCTION
Multimedia retrieval systems are becoming more and more
popular as there is a need for effective and efficient ac-
cess to very large and diverse collections of multimedia ob-
jects, such as videos (e.g. YouTube and Netflix) and images
(e.g. Flickr). Searching in such collections is challenging due
to the heterogeneous media that each item in the collec-
tion may contain (e.g. text, images, and videos). Therefore,
these multiple media and the different features that can be
extracted from them, e.g. low-level visual descriptors (based
on color, shape, location, etc.), low-level textual features
(image captions, video subtitles, etc.), high-level textual or
visual features (named entities, concepts, etc.), or metadata
(timestamps, tags, etc.), need to be combined to support
various multimedia analysis tasks, such as retrieval, sum-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICMR’16, June 06-09, 2016, New York, NY, USA
c
2016 ACM. ISBN 978-1-4503-4359-6/16/06. . . $15.00
DOI: http://dx.doi.org/10.1145/2911996.2912068
marization, clustering, and classification; this combination
of multiple modalities is referred to as multimodal fusion.
Multimodal fusion has been widely investigated and is
typically performed at the feature level (early fusion), at
the decision level (late fusion), or in a hybrid manner (see
[3] for a survey). This work focuses on the decision level
or late fusion of multiple modalities for the multimedia re-
trieval task; to this end, several multimedia and cross-media
approaches have been proposed. Metric fusion [13] is a ran-
dom walk approach designed to fuse different “views” of the
same modality, such as SIFT, GIST and LBP visual features;
our focus though is on the combination of diverse modali-
ties, such as textual and visual similarity scores. A recent
video retrieval framework [9] proposes to fuse textual simi-
larity scores based on video subtitles with visual similarity
scores based on visual concepts in a simple non-linear way.
Other approaches have been motivated by Latent Dirich-
let Allocation (LDA) and either generate a joint topic prob-
ability distribution for all modalities, or combine the topic
distribution per modality [4]. Each query is related to a
topic and the retrieved documents are assigned a topic dis-
tribution. If the topic distribution of a retrieved document
is maximized at the query’s topic, the document is consid-
ered to be relevant. Convolutional Neural Networks (CNN)
have also been used to learn high-level features and combine
two modalities (text-image pairs) for cross-modal retrieval
[12]. A Partial Least Squares (PLS) based approach [10]
that maps different modalities into a common latent space
has also been used in an image retrieval task. Contrary to
these approaches that require training, our focus is on un-
supervised multimodal fusion. Furthermore, many of these
approaches can only support monomodal queries, whereas
our goal is to also cater for queries consisting of multiple
modalities. It should be noted that multimedia search sys-
tems are often interactive and user feedback is incorporated
for progressively refining the query (e.g. [14]); such relevance
feedback approaches are beyond the scope of this work.
One unsupervised multimedia retrieval approach that has
been recently proposed [2] combines textual and visual sim-
ilarities by integrating into a unifying graph-based frame-
work (i) a cross-media approach that not only considers the
similarity of the query to the objects in the collection, but
also the similarities among them [1] and (ii) a random walk
approach for multimodal fusion, and in particular a video
retrieval approach that links two objects (i.e. nodes in the
This is a draft version of the paper. The final version is available at:
http://dl.acm.org/citation.cfm?id=2912068
In: Proceedings of the 2016 ACM on International Conference on
Multimedia Retrieval (pp. 359-362)
graph) with a weighted edge if there exists a multimodal
similarity between them [6]. To decrease the complexity,
the framework assumes that the textual part of a multi-
modal query “is the main semantic source with regard to
the user information” [2] and thus it first filters out any
object not in the top-lretrieved based on their textual sim-
ilarity scores, and then applies graph-based techniques only
to these lselected items. This graph-based framework in-
cludes as special cases all well-known early, late, weighted,
diffusion-based, as well as graph-based fusion models, does
not require users’ relevance feedback, and has been evalu-
ated in the context of multimedia retrieval tasks. Thus far,
though, this framework has only considered two modalities.
2. METHODOLOGY
First, we briefly describe the graph-based framework pro-
posed in [2] that supports the fusion of two modalities and,
then, we present its extension to Mmodalities.
2.1 Background
As described above, the framework first selects the top-l
multimedia documents based on their textual similarity to
query q. All subsequent operations are performed on these
lselected documents. First, the 1 ×lquery-based simi-
larity vectors on the textual and visual modalities, st(q, .)
and sv(q, .), respectively, are computed and are normalized
so that their elements sum to one. Then, the l×ltex-
tual and visual similarity matrices, Stand Sv, respectively,
are computed for these documents and are normalized us-
ing: (s(d, d0)min s(d, .))/(max s(d, .)min s(d, .)), where
s(d, d0) denotes the similarity between two documents dand
d0. By denoting the regular matrix multiplication operation
as “· and the (i, j) element of a matrix Aas A[i, j], this
graph-based framework sets x(0) =st(q, .), y(0) =sv(q, .),
and defines the following update rule:
x(i)K(x(i1), k)·[(1γ)D·(βSt+ (1β)Sv) + γ e ·st(q , .)]
y(i)K(y(i1), k)·[(1γ)D·(βSv+ (1β)St) + γe·sv(q, .)]
where Dis the row-normalizing matrix, eis the l×1 vector
of ones, and the operator K(x, k) takes as input a vector x
and assigns a zero value to elements with score strictly lower
than the k-th highest score in x. Following iiterations, the
final ranking with respect toquery qis given by the linear
combination of st, sv, x(i)and y(i):
score(q) = αtst(q, .) + αvsv(q , .) + αtvx(i)+αvt y(i)(1)
under the restriction that: αt+αv+αtv +αv t = 1.
This framework includes all well-known weighted, graph-
based and diffusion-based fusion techniques, as special cases
of its parameters. For atv =avt = 0, Equation (1) becomes
the weighted mean fusion model. For at=av=avt = 0, γ =
0 and sufficiently large number of iterations i, the model is
the random walk approach [6]. For av=atv = 0, β = 0, γ =
0, and i= 1, the model is the cross-media approach [1].
In the experiments reported in [2], β= 0, γ = 0.3, i = 1
and k= 10 are recommended as the default parameters
when fusing the top-l(l= 1000) results returned by text-
based search. The weights in the linear combination of
st, sv, x(1) and y(1) are tuned in {0.1,0.2,...,0.9}and the
best values are compared to the uniform weighting strategy
(αt=αv=αtv =αvt = 0.25). The results show only an
incremental increase in Mean Average Precision (MAP) in
one dataset and no increase in the other datasets, indicating
the potential effectiveness of this uniform weighting scheme.
2.2 Multimedia Retrieval Using M Modalities
Our aim is to extend the aforemonetioned graph-based
framework to more than two modalities. Assuming that
there are Mmodalities with corresponding 1×lquery-based
similarity vectors sm(q, .) and l×lsimilarity matrices Sm,
m={1,2,...,M}, we compute the following contextual
similarity matrix for each modality m:
Cm= 1
M1
X
w=1
βw!Sm+
M1
X
w=1
βwSw6=m(2)
The matrices Cmof Equation (2) are row-normalized so as
to obtain the corresponding row-stochastic transition prob-
ability matrices Pmwith elements:
Pm[i, j] = Cm[i, j]
Pl
j=1 Cm[i, j](3)
For all modalities m, we set xm
(0) =sm(q, .), m ={1,...,M},
and motivated by [2], we define the following update rule:
xm
(i)K(xm
(i1), k)·
1
M
X
w=1
w6=m
γw
Pm+
M
X
w=1
w6=m
γwsw(q, .)
(4)
Inspired by the model of Equation (1), we finally propose
the vector of relevance score in response to query q:
score(q) =
M
X
m=1
αmsm(q, .) +
M
X
m=1
α0
mxm
(i)(5)
where
M
X
m=1
αm+
M
X
m=1
α0
m= 1 (6)
For three modalities, for example, Equation (2) becomes:
C1= (1 β1β2)S1+β1S2+β2S3
C2= (1 β1β2)S2+β1S1+β2S3
C3= (1 β1β2)S3+β1S1+β2S2
The contextual similarity matrices Cm, m ={1,2,3}are
row-normalized to obtain Pm, m ={1,2,3}using Equation
(3), and the update rule (Equation (4)) becomes:
x1
(i)K(x1
(i1), k)·[(1γ2γ3)P1+γ2e·s2(q, .)+γ3e·s3(q, .)]
x2
(i)K(x2
(i1), k)·[(1γ1γ3)P2+γ1e·s1(q, .)+γ3e·s3(q, .)]
x3
(i)K(x3
(i1), k)·[(1γ1γ2)P3+γ1e·s1(q, .)+γ2e·s2(q, .)]
Finally, score(q) is computed as in Equation (5), which lin-
early combines sm(q, .) and xm
(i)for m={1,2,3}.
Figure 1 depicts our multimedia retrieval framework in the
particular case of fusing three modalities, namely visual fea-
tures, visual concepts and textual concepts. The top-ldocu-
ments in the filtering step are obtained by using the textual
concepts to index and retrieve each document in response to
the q using the open-source Apache Lucene1system. Then,
1https://lucene.apache.org/core/
the l×lsimilarity matrices: S1on visual descriptors, S2on
visual concepts, S3on textual concepts and the correspond-
ing 1 ×lquery-based similarity vectors: s1(q, .) on visual
descriptors, s2(q, .) on visual concepts and s3(q, .) on tex-
tual concepts are conputed. Finally, we fuse these similarity
matrices and the query-based similarity vectors to get a sin-
gle relevance score vector in response to query q:score(q).
Visual
descriptors
Multimedia
Database
Multimedia-Enriched Object query (𝒒)
Textual
concepts
Visual
concepts
Offline
Feature
Extraction
Visual
descriptors
Textual
concepts
Visual
concepts
Fusion score(𝒒)
Lucene search
Top-𝒍 filtering
Similarities
𝑆2 𝑠2(𝑞,. )
Online
𝑆1
𝑆3
𝑠1(𝑞,. )
𝑠3(𝑞, . )
Figure 1: Multimedia Retrieval Framework by fus-
ing 3 modalities
Memory Complexity. The memory complexity is O(l2)
for the computation of each similarity matrix Sm,O(l) for
each similarity vector sm(q, .) and O(kl) for each xm
(i), m =
1,2,...,M, thus the overall memory complexity is quadratic
in l:O(Ml2+M kl +Ml). In order to compare directly
the baseline method with our retrieval framework with M
modalities, under the same memory complexity, we seek the
number of filtered documents l0, such that Ml02+M kl0+
Ml0= 2l2+ 2kl + 2l. The non-negative solution is:
l0=r(k+ 1)2
4+2l2+ 2kl + 2l
Mk+ 1
2(7)
For example, for M= 3, k = 10, l = 1000, we find l0
=815.
We also observe that even for 15 modalities, the number of
the top-lfiltered documents remains >300, hence a signif-
icant number of documents is involved in the fusion, even
in the case of several modalities. We shall examine whether
the fusion of three modalities in this framework outperforms
the baseline approach, without additional memory cost.
3. EXPERIMENTS
In this section, we describe the datasets used for evalua-
tion, the features extracted from their multimedia objects,
the employed similarities, and the experimental results.
3.1 Evaluation Datasets
We evaluate our framework using two test collections: the
IAPR-TC122and the WIKI113both created in the context
of the ImageCLEF benchmarking activities. The IAPR-
TC12 collection consists of (i) 20,000 images, each annotated
with a title and a description, and also various metadata
(e.g. date, location, etc.) not considered in this work and
(ii) the 60 topics created in ImageCLEF 2007, each with
a title and three image examples. The WIKI11 collection
consists of (i) 237,434 images extracted from Wikipedia and
their user-generated captions/descriptions and (ii) 50 topics,
each with a title and one to five query images.
2http://imageclef.org/photodata
3http://www.imageclef.org/wikidata
3.2 Features and Monomodal Similarities
We use the following state-of-the-art features and similar-
ity functions for each modality in the documents and queries;
it should be noted though that our method is capable of fus-
ing any similarity score obtained by any kind of features. As
visual descriptors (VD), we extract the scale-invariant local
descriptors RGB-SIFT [11], which are then locally aggre-
gated into one vector representation using VLAD encoding
[7], and employ a similarity function based on the Euclidean
distance [5]. We also employ the 346 high-level visual con-
cepts (VC) introduced in the TRECVID 2011 semantic in-
dexing task4, which are detected by multiple independent
concept detectors that use the aforementioned visual de-
scriptors as input to Logistic Regression classifiers that have
their output averaged and further refined [8]. Finally, we use
the title/caption of each image so as to extract textual con-
cepts (TC) using the DBpedia Spotlight5annotation tool.
For the cases of visual and textual concepts, similarities are
computed based on Lucene’s retrieval function.
3.3 Experimental Setup and Results
We evaluate the MAP of our framework that fuses three
modalities against the baseline (Section 2.1) that fuses two
modalities. As the baseline models all well-known weighted,
graph-based and diffusion-based fusion techniques as special
cases of its parameters α, β, and γ, the best performance
among all these fusion techniques coincides with the best
performance of the weight parameters α, β and γ. Therefore,
we tune these parameters so as to present, in Table 1, the
best MAP scores of the fusion using two modalities. First,
we combine textual concepts with visual descriptors (TC &
VD), and then, we combine textual with visual concepts (TC
& VC); the combination of visual descriptors and concepts
(VD & VC) is not considered as it reduces the problem to the
classic image retrieval task and no initial filtering can be per-
formed with respect to the textual modality. We adopt the
default parameters reported in [2], i.e. k= 10, one iteration
(i= 1) and uniform weights (at=av=atv =avt = 1/4).
To compare directly the baseline method with our frame-
work with three modalities under the same memory com-
plexity, we use l0= 815 (see Equation (7)). For m=
{1,2,3}, we adopt a uniform weighting strategy (αm=
α0
m= 1/6) and we tune the parameters γm, while the pa-
rameters βmare kept constant (and equal to 1/3); the results
for different values of γmare reported in Table 1.
We observe that our framework outperforms the base-
line method for several values of the parameters γm, m =
{1,2,3}under the same memory cost. In particular, MAP
increases up to 13.44% for WIKI11 and up to 15.71% for
IAPR-TC12. We further tuned the parameters αm, α0
mand
βm, m ={1,2,3}and we did not observe any further in-
crease in MAP, which implies that there is no other weighted,
graph-based or diffusion-based fusion techniques that out-
performs our framework. The small differences in MAP
when employing the best parameters γm, m ={1,2,3}com-
pared to the uniform ones (γm= 1/3) allows for setting
γm= 1/3, m ={1,2,3}, without significant decrease in
MAP (less than 2%).
4http://www-nlpir.nist.gov/projects/tv2011/tv11.sin.346.
concepts.simple.txt
5http://dbpedia-spotlight.github.io/demo/
Table 1: MAP values for both datasets; in bold the
values which outperform the baselines.
M γ1γ2γ3WIKI11 IAPR
2 TC & VD (best α, β, γ ) 0.3654 0.2769
2 TC & VC (best α, β, γ) 0.3472 0.2729
3 0.00 0.00 1.00 0.3637 0.3065
3 0.00 1.00 0.00 0.3433 0.2858
3 1.00 0.00 0.00 0.3855 0.2518
3 0.50 0.50 0.00 0.4083 0.2912
3 0.5 0.25 0.25 0.4145 0.2970
3 0.25 0.50 0.25 0.4029 0.3136
3 0.25 0.25 0.50 0.4048 0.3204
3 0.34 0.33 0.33 0.4105 0.3148
4. CONCLUSION
We presented an unsupervised graph-based framework that
fuses Mmodalities for multimedia retrieval. In this work, we
consider the fusion of three modalities (textual concepts, vi-
sual descriptors, and visual concepts), but the overall frame-
work is directly applicable to any number of modalities. We
also presented a theoretical formula which provides the op-
timal number of documents that need to be initially filtered,
so that the memory cost in the case of Mmodalities remains
the same as in the case of two modalities. The experiments
have shown that the MAP improves in the case of three
modalities (up to 15.71% in some cases). We also observed
that the results of a uniform weighting strategy do not sig-
nificantly differ from those obtained using the best weights.
In the future, we plan to evaluate our framework in multilin-
gual settings using language agnostic features, such as tex-
tual concepts either in the language of the query or mapped
to a common language using multilingual ontologies.
5. ACKNOWLEDGMENTS
This work was partially supported by the European Com-
mission by the projects MULTISENSOR (FP7-610411) and
KRISTINA (H2020-645012).
6. REFERENCES
[1] J. Ah-Pine, S. Clinchant, and G. Csurka. Comparison
of several combinations of multimodal and diversity
seeking methods for multimedia retrieval. In
Multilingual Information Access Evaluation II.
Multimedia Experiments, pages 124–132. Springer,
2009.
[2] J. Ah-Pine, G. Csurka, and S. Clinchant.
Unsupervised visual and textual information fusion in
cbmir using graph-based methods. ACM Transactions
on Information Systems (TOIS), 33(2):9, 2015.
[3] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S.
Kankanhalli. Multimodal fusion for multimedia
analysis: a survey. Multimedia systems, 16(6):345–379,
2010.
[4] J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia,
G. R. Lanckriet, R. Levy, and N. Vasconcelos. On the
role of correlation and abstraction in cross-modal
multimedia retrieval. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 36(3):521–535,
2014.
[5] J. Hafner, H. S. Sawhney, W. Equitz, M. Flickner, and
W. Niblack. Efficient color histogram indexing for
quadratic form distance functions. Pattern Analysis
and Machine Intelligence, IEEE Transactions on,
17(7):729–736, 1995.
[6] W. H. Hsu, L. S. Kennedy, and S.-F. Chang. Video
search reranking through random walk over
document-level context graph. In Proceedings of the
15th international conference on Multimedia, pages
971–980. ACM, 2007.
[7] H. J´egou, M. Douze, C. Schmid, and P. P´erez.
Aggregating local descriptors into a compact image
representation. In Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on,
pages 3304–3311. IEEE, 2010.
[8] B. Safadi and G. Qu´enot. Re-ranking by local
re-scoring for video indexing and retrieval. In
Proceedings of the 20th ACM international conference
on Information and knowledge management, pages
2081–2084. ACM, 2011.
[9] B. Safadi, M. Sahuguet, and B. Huet. When textual
and visual information join forces for multimedia
retrieval. In Proceedings of International Conference
on Multimedia Retrieval, page 265. ACM, 2014.
[10] B. Siddiquie, B. White, A. Sharma, and L. S. Davis.
Multi-modal image retrieval for complex queries using
small codes. In Proceedings of International
Conference on Multimedia Retrieval, page 321. ACM,
2014.
[11] K. E. Van De Sande, T. Gevers, and C. G. Snoek.
Evaluating color descriptors for object and scene
recognition. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 32(9):1582–1596,
2010.
[12] J. Wang, Y. He, C. Kang, S. Xiang, and C. Pan.
Image-text cross-modal retrieval via modality-specific
feature learning. In Proceedings of the 5th ACM on
International Conference on Multimedia Retrieval,
pages 347–354. ACM, 2015.
[13] Y. Wang, X. Lin, and Q. Zhang. Towards metric
fusion on multi-view data: a cross-view based graph
random walk approach. In Proceedings of the 22nd
ACM international conference on Conference on
information & knowledge management, pages 805–810.
ACM, 2013.
[14] S. Xu, H. Li, X. Chang, S.-I. Yu, X. Du, X. Li,
L. Jiang, Z. Mao, Z. Lan, S. Burger, et al. Incremental
multimodal query construction for video search. In
Proceedings of the 5th ACM on International
Conference on Multimedia Retrieval, pages 675–678.
ACM, 2015.
... The VERGE architecture is shown in Fig. 1, with multimodal fusion of low-and high-level visual and textual information, color-based clustering, served by the VERGE Graphical User Interface (GUI). The overall system is novel, since it integrates the fusion of multiple modalities [4], in a hybrid graph-based and non-linear way [5], with several functionalities (eg. multimedia retrieval, image retrieval, search by visual or textual concept, etc.) already presented in [1], but in a unified user interface. ...
... Multimodal fusion is performed on semantically filtered multimodal objects, utilizing the textual modality [6], and therefore, the overall memory and computational complexity of the multimedia retrieval module is reduced [4]. ...
... In brief the multimedia retrieval module [5], constructs one similarity matrix per modality and one similarity vector (query based) per modality, given M modalities and a query, but only for the results of a text-based search, assuming that text description is the main semantic source of information [6]. A graph-based fusion of multiple modalities [4] is combined with all similarity vectors in a non-linear way [5], which in general may fuse multiple modalities. In this context, we employ M = 3 modalities, namely visual features (RGB-SIFT), locally aggregated into one vector representation using VLAD encoding (Section II.B.1), text description (Section II.C), 346 high-level visual concepts (Section II.B.2), and textual high-level concepts, which are DBpedia 2 entities. ...
... The non-linear fusion of (16) has M − 1 free parameters a m , M − 1 free parameters β m and M − 1 free parameters γ m , thus 3M − 3 parameters in total, hence the increase in the number of parameters is linear in the number of modalities. The model of (10) has been extended to multiple modalities in [9], along with a mathematical formula to keep the memory complexity at the same level with the retrieval on two modalities. Instead of using a linear combination of fused similarity scores, as in (15), the non-linear combination of all fused similarity scores of (16) has shown slightly better performance on the same datasets [8]. ...
... Secondly, we use the cross-media fusion [5] of three modalities and thirdly the random-walk approach of [12]. The fourth baseline method is the non-linear fusion [22] of all modalities and finally we compare our framework with the extension of the unifying fusion framework of [2] in the case of three modalities [9] in two cases: first with the SIFT visual descriptors and second with the state-of-the-art DCNN visual features. Our proposed framework first combines SIFT with DCNN using PLS Regression and then uses non-linear graph-based fusion for all three modalities. ...
Article
Full-text available
Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.
... CL fuses several sources of information with the goal to create links between the APT related information retrieved from CA. Given an APT report as input and using its meta-data and extracted concepts as multiple modalities, the fusion of all available modalities is based on a semantic filtering stage [33]. This process filters out the non-relevant results in a progressive way starting from the dominant modality, i.e., the attribute/concept that has been proven most effective in uni-modal APT-to-APT comparison. ...
Conference Paper
Full-text available
Many real-world objects described by multiple attributes or features can be decomposed as multiple "views" (e.g., an image can be described by a color view or a shape view), which often provides complementary information to each other. Learning a metric (similarity measures) for multi-view data is primary due to its wide applications in practices. However, leveraging multi-view information to produce a good metric is a great challenge and existing techniques are concerned with pairwise similarities, leading to undesirable fusion metric and high computational complexity. In this paper, we propose a novel Metric Fusion technique via cross-view graph Random Walk, named MFRW, regarding a multi-view based similarity graphs (with each similarity graph constructed under each view). Instead of using pairwise similarities, we seek a high-order metric yielded by graph random walks over constructed similarity graphs. Observing that ``outlier views" may exist in the fusion process, we incorporate the coefficient matrices representing the correlation strength between any two views into MFRW, named WMFRW. The principle of \textsf{WMFRW} is implemented by exploring the ``common latent structure" between views. The empirical studies conducted on real-world databases demonstrate that our approach outperforms the state-of-the-art competitors in terms of effectiveness and efficiency.
Conference Paper
Full-text available
Currently, popular search engines retrieve documents on the basis of text information. However, integrating the visual information with the text-based search for video and image retrieval is still a hot research topic. In this paper, we propose and evaluate a video search framework based on using visual information to enrich the classic text-based search for video retrieval. The framework extends conventional text-based search by fusing together text and visual scores, obtained from video subtitles (or automatic speech recognition) and visual concept detectors respectively. We attempt to overcome the so called problem of semantic gap by automatically mapping query text to semantic concepts. With the proposed framework, we endeavor to show experimentally, on a set of real world scenarios, that visual cues can effectively contribute to the quality improvement of video retrieval. Experimental results show that mapping text-based queries to visual concepts improves the performance of the search system. Moreover, when appropriately selecting the relevant visual concepts for a query, a very significant improvement of the system's performance is achieved.
Article
Full-text available
The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical formulation is proposed, equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities. Two hypotheses are then investigated regarding the fundamental attributes of these spaces. The first is that low-level cross-modal correlations should be accounted for. The second is that the space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. An extensive evaluation of retrieval performance is conducted to test the validity of the hypotheses. All approaches are shown successful for text retrieval in response to image queries and vice versa. It is concluded that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.
Article
Full-text available
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Conference Paper
Recent improvements in content-based video search have led to systems with promising accuracy, thus opening up the possibility for interactive content-based video search to the general public. We present an interactive system based on a state-of-the-art content-based video search pipeline which enables users to do multimodal text-to-video and video-to-video search in large video collections, and to incrementally refine queries through relevance feedback and model visualization. Also, the comprehensive functionalities enhance a flexible formulation of multimodal queries with different characteristics. Quantitative and qualitative analysis shows that our system is capable of assisting users to incrementally build effective queries over complex event topics.
Conference Paper
Cross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and texts directly. Many existing methods rely on low-level visual features and textual features for cross-modal retrieval, ignoring the characteristics existing in the raw data of different modalities. In this paper, a novel model based on modality-specific feature learning is proposed. Considering the characteristics of different modalities, the model uses two types of convolutional neural networks to map the raw data to the latent space representations for images and texts, respectively. Particularly, the convolution based network used for texts involves word embedding learning, which has been proved effective to extract meaningful textual features for text classification. In the latent space, the mapped features of images and texts form relevant and irrelevant image-text pairs, which are used by the one-vs-more learning scheme. This learning scheme can achieve ranking functionality by allowing for one relevant and more irrelevant pairs. The standard back-propagation technique is employed to update the parameters of two convolutional networks. Extensive cross-modal retrieval experiments are carried out on three challenging datasets that consist of image-document pairs or image-query click-through data from a search engine, and the results firmly demonstrate that the proposed model is much more effective.
Article
Multimedia collections are more than ever growing in size and diversity. Effective multimedia retrieval systems are thus critical to access these datasets from the end-user perspective and in a scalable way. We are interested in repositories of image/text multimedia objects and we study multimodal information fusion techniques in the context of content-based multimedia information retrieval. We focus on graph-based methods, which have proven to provide state-of-the-art performances. We particularly examine two such methods: cross-media similarities and random-walk-based scores. From a theoretical viewpoint, we propose a unifying graph-based framework, which encompasses the two aforementioned approaches. Our proposal allows us to highlight the core features one should consider when using a graph-based technique for the combination of visual and textual information. We compare cross-media and random-walk-based results using three different real-world datasets. From a practical standpoint, our extended empirical analyses allow us to provide insights and guidelines about the use of graph-based methods for multimodal information fusion in content-based multimedia information retrieval.
Conference Paper
We propose a unified framework for image retrieval capable of handling complex and descriptive queries of multiple modalities in a scalable manner. A novel aspect of our approach is that it supports query specification in terms of objects, attributes and spatial relationships, thereby allowing for substantially more complex and descriptive queries. We allow these complex queries to be specified in three different modalities - images, sketches and structured textual descriptions. Furthermore, we propose a unique multi-modal hashing algorithm capable of mapping queries of different modalities to the same binary representation, enabling efficient and scalable image retrieval based on multi-modal queries. Extensive experimental evaluation shows that our approach outperforms the state-of-the-art image retrieval and hashing techniques on the MSRC and SUN09 datasets by about 100%, while the performance on a dataset of 1M images, from Flickr, demonstrates its scalability.