Content uploaded by Leszek Kaliciak
Author content
All content in this area was uploaded by Leszek Kaliciak on Oct 26, 2017
Content may be subject to copyright.
Unified Hybrid Image Retrieval System with
Continuous Relevance Feedback
Leszek Kaliciak, Hans Myrhaug, and Ayse Goker
Ambiesense Limited, Aberdeen, Scotland
Email: leszek@ambiesense.com; Email: hans@ambiesense.com; Email: ayse@ambiesense.com
Abstract—In this paper, we present a unified hybrid image
retrieval system consisting of the following components: various
visual features and their combinations, combination of visual and
textual feature spaces, combination of visual and textual feature
spaces in the context of search refinement, and interactive user
interface with hybrid relevance feedback, exploratory search,
query history, relevance bar, and positive and negative results
panels.
In the paper we also introduce two novel hybrid spinoff
models and describe the new continuous relevance feedback
framework that allows us to move away from graded relevance
and shows the relationships between feedback images.
Keywords—Information Fusion, Intelligent User Interfaces,
Hybrid Models, Multimedia Retrieval, Relevance Feedback
I. INTRODUCTION AND RELATE D WOR K
Only a few prototype systems utilize both visual and textual
information in the context of relevance feedback (e.g. [1], [14],
[12]). These systems, however, combine the features in an ad-
hoc manner. No inter or intra feature relationships are modelled
within the proposed frameworks. Moreover, existing hybrid
prototype systems do not support the user-system interaction
at such high level as some mono-modal approaches (e.g. [8],
[3]).
Commercial systems have only recently started to use
visual content of images (e.g. Baidu, Yandex, Google, iStock
Photo). They usually allow for searching based on one of the
modalities only and offer very little interactivity with the user.
We propose a novel solution to alleviate this problem.
Relevance feedback in Content-based Image Retrieval
(CBIR) helps to narrow down the search by involving the
users in the search refinement process. After issuing a query
and presenting the initial results, the user can provide his/her
relevance feedback. Typically the relevance feedback would
be binary in nature - relevant/not relevant e.g. [10], [18]. A
few systems would allow for a discrete number of relevance
degrees - e.g. relevant, partially relevant, not relevant or more
degrees [9], [15].
This paper presents our prototype system consisting of an
interactive user interface, a hybrid model for the combination
of visual and textual features, and a hybrid adaptive model
for the combination of features in the context of relevance
feedback.
The novel interactive interface offers the functionalities of
mono-modal prototype systems and more: continuous degrees
of relevance represented in a form of the relevance bar to which
a user can drag and drop feedback images in a more natural
way, integrated with the hybrid relevance feedback model.
Other functionalities include: exploratory search (browsing
through positive and negative results, positive and negative
results may become the current query, total control over the
query history and images in the feedback set), query history,
zoom in and out on the images of interest, positive and negative
results which can be both utilized as feedback, selection of
visual features and their combinations.
II. INTELLIGENT US ER IN TE RFAC E
Figure 1 shows our user interface. The panels and func-
tionalities of the interface are: Visual Example Panel, Positive
and Negative Results Panels, Relevance Feedback Panel, Query
History Panel, show/hide panel description button, help button,
two reset buttons for resetting the set of feedback images
or the query history, “reset all” button for resetting both the
set of feedback images and the query history, expandable
“collection” and “visual features” lists for selecting image
collections and specific visual features and their combinations.
One of the example use cases of the system could be: a
user selects an image as a visual example (“a woman with
tattoo and sunglasses”), performs the search, narrows down
the search by dragging and dropping images from both positive
and negative result panels (concept tattoos) and click the refine
button. Then the user utilizes one of the result images as a
new query and click pagination buttons to find more relevant
images (exploratory search).
Visual Example Panel: here, the user can browse the data
collection by clicking the browse button or exploiting the quick
browse panel (visual example). Then, an image can be selected
which will represent a visual example, to query the system.
Thus, the selected current query will be displayed in the panel
current query. A double mouse click on the query image will
result in a full screen mode image display. Having selected the
visual example, the user can click the search button which will
perform the data collection search based on the hybrid model.
Positive and Negative Results Panels display the posi-
tive/negative results starting from the most/least relevant (top
left) image. Each result image can be displayed in a full
screen mode by double clicking it. Additionally, each result
image can be dragged and dropped in the current query panel
and thus become a current query itself. Buttons marked as
“<<” and “>>” allow the user to browse the search results
(exploratory search). Images from the positive/negative panel
results can also be dragged and dropped in the relevance bar,
thus indicating the level of relevance of specific images.
Fig. 1: User interface.
Relevance Bar indicates the levels of relevance of feed-
back images. The user can always change the levels of rele-
vance of the feedback images already placed on the relevance
bar. The degree of relevance is naturally represented in a form
of the spectrum, with one end corresponding to the positive
feedback, and the other end corresponding to the negative
feedback. If the current query is selected and the user needs
to continue the search and already gave the feedback to the
system, then clicking the refine button will narrow down and
also correct the search.
The expandable Query History Panel, as the name sug-
gests, shows the previously issued queries. These queries are
also utilized during the search refinement process.
Other functionalities provided by our interface are: reset
buttons for resetting either the relevance feedback or query
history, or both, show/hide buttons display or hide the panels’
descriptions, the help button displays descriptions of provided
functionalities, a left mouse click on the collection button
displays a list of data collections, a left click on the visual
features button displays a list of visual features, and various
combinations of visual features.
III. REL EVANCE CONTINUUM IN AN INTERAC TI VE USER
INT ER FACE
Existing image retrieval systems that incorporate interac-
tion with the user in the form of relevance feedback utilise
binary - relevant/not relevant or graded relevance - e.g. rel-
evant, partially relevant, not relevant. The evidence suggests
that graded relevance judgments play an important role in
the information seeking process [2], [16]. This should not be
surprising as the relevance is rarely binary in nature and more
degrees of relevance offer the user more flexibility and more
freedom of choice. However, the existing implementations
of graded relevance feedback do not show the relationships
between images in terms of their relevance. By analogy, let us
imagine a set of unordered whole numbers. When placed in a
real coordinate space of one dimension (an axis), the relative
position of these numbers (in relation to each other) allows
us to immediately compare them because the number on the
right hand side is always grater than the number on the left
Fig. 2: Relevance bar, continuous degree of relevance.
hand side. The absolute position of the numbers (in relation
to the entire axis) indicates their value. Another advantage of
placing these numbers on an axis is that the aforementioned
relationships make it easy to change one or more of these
numbers (relevance updating) or add a new one to the set.
In this section we present a relevance continuum consisting
of the relevance bar and continuous relevance feedback model.
We can envision a relevance bar representing a spectrum of
possible relevance degrees (see Figure 2). A user can drag and
drop feedback images onto this spectrum according to their
relevance degrees.
Not only the absolute (in relation to the entire bar) but also
relative (in relation to each other) relevance of feedback images
are shown on the spectrum bar. Thus, it is straightforward to
add a new image to the feedback set. Moreover, updating of
the relevance degree of feedback images can be performed
naturally by re-arranging their order. The user can drag and
drop the images to the relevance bar in a natural way. The
relevance bar shows the relationships between the relevance
degrees associated with images selected for the feedback. This
in turn makes it straightforward to add new images to the
feedback set and also update the feedback set - the feedback
images can be rearranged on the relevance continuum.
The relevance degree from the relevance bar can be calcu-
lated as
wi=−1+2·dist (0, imiP os)(1)
dist (·,·)∈[0,1] (2)
wi∈[−1,1] (3)
where dist(0, imiP os)represents distance from zero to the
feedback image position on the relevance spectrum as shown
in Figure 2.
Now, thus calculated weights wican be utilized to modify
the importance of individual feedback images. For example,
the weights can be incorporated into the classic Rocchio
algorithm (or Rocchio algorithm inspired variations) which
modifies the query so that it moves closer to the centroid
of relevant documents and further away from the centroid of
irrelevant ones
Qm= (a·Qo) +
b·1
|Dr|·
Dj∈Dr
wj·Dj
+
+c·1
Dnr
·
Dk∈Dnr
wk·Dk(4)
where Qm- modified query vector, Qo- original query vector,
Dj- related document vector, Dk- non-related document
vector, a- original query weight, b- related documents’
weights, c- non-related documents’ weights, Dr- set of related
documents, Dnr - set of non-related documents.
IV. IMPLEMENTED MOD EL S
We have implemented two tensor-based hybrid models in
our prototype system demo. The data in Multimedia Retrieval
is often multimodal and heterogeneous. Tensors, as generaliza-
tions of vectors and matrices, provide a natural and scalable
framework for handling data with inherent structures and
complex dependencies. There has been a recent renaissance of
tensor-based methods in machine learning. They range from
scalable algorithms for tensor operations (e.g. [13]), novel
models through tensor representations (e.g. [7]), to industry
applications such as Google TensorFlow [19] and Torch [20].
First is the hybrid model for the combination of visual
and textual features. We compute the Euclidean distances on
tensor-ed representations. This is equivalent to combining the
Euclidean metric (visual similarity), and the cosine similarity
(textual similarity) in a specific late fusion form. The advan-
tages of this hybrid approach are discussed in [6]. Thus, we
combine the distances as
s2
e(dv
1, dv
2)sc(dt
1, dt
2)−2sc(dt
1, dt
2) + 2 (5)
where sedenotes Euclidean distance, screpresents cosine
similarity measure, dv
1and dt
1denote visual and textual
representations of the query respectively, dv
2and dt
2denote
visual and textual representations of an image in the data
collection respectively, and ⊗is the tensor operator. Thus, we
utilize Euclidean distance to measure the similarity between
visual representations, and the cosine similarity to measure the
similarity between textual representations. Euclidean distance,
in case of our mid-level visual features, performs better than
cosine similarity. It is due to the fact that normalization of our
local features hampers the retrieval performance. On the other
hand, cosine similarity in textual space performs better than
other similarity measurements.
It can be shown that the aforementioned combination
of measurements performed on individual feature spaces is
equivalent to
s2
e(dv
1, dv
2)sc(dt
1, dt
2)−2sc(dt
1, dt
2) + 2 =
sedv
1⊗dt
1, dv
2⊗dt
2(6)
Thus, the implemented model is equivalent to computing
the Euclidean distance on a tensored representation. This
equivalence helps us avoid the curse of dimensionality (no
need to perform the tensor operation).
The second implemented hybrid model is the hybrid rele-
vance feedback model for image refinement. Figure 3 presents
the hybrid relevance model at work. It utilizes the correlation
and complementarity between different feature spaces [5].
Moreover, because query can be correlated with its context to
a different extent ( [17], [4]), the implemented model adapts
its weights based on the interactions with the user. 1
1In this paper, the textual and visual terms refer to image tags and instances
of visual words, respectively. Additionally, the visual and textual context is
represented as a visual and textual subspace of feedback images, respectively.
Fig. 3: Hybrid relevance feedback at work. Initially, users X
and Yissue the same query depicting a road sign and a snow
among other concepts - first image from the top. User X
narrows down the search to the concept snow by dragging and
dropping relevant images to the relevance bar - second image
from the top. User Ynarrows down the search to the concept
sign - third image from the top. Based on the combination of
text and visual features the top results for the original query get
re-ranked to present both users with different results - second
and third image from the top, respectively.
Let us assume that tr denotes the matrix trace operator,
⟨·|·⟩ represents the inner product, M1,M2are co-occurrence
matrices corresponding to different feature spaces (a subspace
generated by the query vector and vectors from the feedback
set), ⊗denotes the tensor operator, a,bare different represen-
tations of an image in the collection corresponding to M1and
M2,qv,qtdenote the visual and textual representations of the
query, ci,didenote visual and textual representations of the
images in the feedback set, Dv
q,Dv
fdenote the density (co-
occurrence) matrices of a visual query and its visual context
(feedback images), Dt
q,Dt
fdenote the density matrices of a
textual query and its textual context, and ndenotes the number
of images in the feedback set. Then
tr (M1⊗M2)·aTa⊗bTb=
strv⟨qv|a⟩2+ (1 −strv)1
n
icia2·
strt⟨qt|b⟩2+ (1 −strt)1
n
idib2(7)
where
strv=Dv
qDv
f
Dv
q
Dv
f
=iqvci2
⟨qv|qv⟩ni⟨ci|ci⟩2
(8)
and
strt=Dt
qDt
f
Dt
q
Dt
f
=iqtdi2
⟨qt|qt⟩ni⟨di|di⟩2
(9)
The presented hybrid adaptive model can be easily ex-
panded to incorporate more features (e.g. textual and multiple
visual features)
tr (⊗nMn)·⊗naT
n·an=
nMnaT
n·an(10)
Let us assume that the relevance feedback is provided
after the first round retrieval to refine the query. The adaptive
weighting can be interpreted in a following way:
1) small ⟨Dq|Df⟩; weak relationship between query and
its context, context becomes important. We adjust the
probability of the original query terms; the adjustment
will significantly modify the original query.
2) big ⟨Dq|Df⟩; strong relationship (similarity) between
query and its context, context will not help much.
The original query terms will tend to dominate the
whole term distribution in the modified model. The
adjustment will not significantly modify the original
query.
The visual features implemented in our prototype model
comprise: edge histogram, homogeneous texture, bag of visual
words features, colour histogram, co-occurrence matrix, and
their combinations.
In the following subsections we present different variations
of our hybrid model.
A. Hybrid Relevance Feedback Model Based on the Orthogo-
nal Projection
First, let us assume that just like in the original model
M1=r1·Dv
q+r2
n·Dv
fand M2=r1·Dt
q+r2
n·Dt
f. We can
decompose (eigenvalue decomposition) the density matrices
M1,M2to estimate the bases2(pv
i,pt
j) of the subspaces
generated by the query and the images in the feedback set.
Now, let us consider the measurement
P1⊗P2|aTa⊗bTb (11)
where P1,P2are the projectors onto visual and textual sub-
spaces generated by the query and the images in the feedback
set (i(pv
i)Tpv
i,jpt
jTpt
j), and a,bare the visual and
textual representations of an image from the data set. Because
the tensor product of the projectors corresponding to visual and
textual Hilbert spaces (H1,H2) is a projector onto the tensored
Hilbert space (H1⊗H2), our similarity measurement can be
interpreted as the probability of the relevance context, the
probability that vector a⊗bwas generated within the subspace
(representing the relevance context) generated by M1⊗M2.
Hence
P1⊗P2|aTa⊗bTb=
P1|aTa·P2|bTb=
i
(pv
i)Tpv
i|aTa·
jpt
jTpt
j|bTb=
i
⟨pv
i|a⟩2·
jpt
j|b2=
i
P rv
i·
j
P rt
j=
(⟨pv
1|a⟩, . . . , ⟨pv
n|a⟩)⊗pt
1|b,...,pt
n|b
2(12)
We can see, that this measurement is equivalent to the
weighted combinations of all the probabilities of projections
for all the images involved. In quantum mechanics, the square
of the absolute value of the inner product between the initial
state and the eigenstate is the probability of the system
collapsing to this eigenstate. In our case, the square of the
absolute value of the inner product can be interpreted as a
particular contextual factor influencing the measurement.
B. Hybrid Relevance Feedback Model for Image Re-ranking
Another version of the original hybrid relevance feedback
model employs density matrices corresponding to feedback
images only (no query density matrix). The quantum-like
measurement is then utilized to re-rank the top images (from
the first round retrieval). We have discovered that fixed-weight
relevance feedback models which utilize both query and feed-
back information should employ measurement for re-scoring
of the whole data collection. On the other hand, relevance
feedback models which utilize the feedback information only
should employ measurement for re-ranking of the top images.
2It has been highlighted [11] that the orthogonal decomposition may not be
the best option for visual spaces because the receptive fields that result from
this process are not localized, and the vast majority do not at all resemble any
known cortical receptive fields. Thus, in the case of visual spaces, we may want
to utilize decomposition methods that produce non-orthogonal basis vectors.
This is related to the dynamic nature of the importance of query
and its context and the adaptive models help alleviate this issue.
One of the main advantages of the hybrid relevance feedback
model for image re-ranking would be the lower computational
cost than the re-scoring models.
Thus, the density matrices will now be generated by
feedback images only
M1=
iciT·ci(13)
and
M2=
idiT·di(14)
and the model will simplify to
M1⊗M2aT·a⊗bT·b=
M1aT·a·M2bT·b=
iciT·ci
aT·a·
idiT·di
bT·b=
icia2·
idib2(15)
V. CONCLUSIONS
In this paper, we have presented a unified hybrid image
retrieval system consisting of various visual features and their
combinations, combination of visual and textual feature spaces,
combination of visual and textual feature spaces in the context
of search refinement with adaptive weighting scheme, and
interactive user interface with exploratory search, query his-
tory, continuous degree of relevance, and positive and negative
results panels.
Because the original hybrid model for the combination of
features in the context of user feedback can be modified to
produce other hybrid models, in the paper we describe two
novel spinoff models and discuss their potential advantages.
The relevance continuum for an image retrieval system
consists of the relevance bar and the continuous feedback rele-
vance model. The relevance continuum shows the absolute and
relative relationships between feedback images with respect to
their relevance degree. This makes it straightforward to add
new images to the feedback set and also update the feedback
set - the feedback images can be rearranged on the relevance
continuum.
VI. FUTURE WO RK
We want to perform a rigorous testing and comparison
of the two variations of the original hybrid model. The
investigated models will be used in the ocean monitoring
application, where the images and videos are captured by
marine robots and made accessible to users via wireless
communication and cloud services. The user will be provided
with a unique solution combining the virtual reality real-time
headset, 360 degrees view from the 360 degrees surface
and underwater camera, and augmented reality to remotely
monitor the surface and underwater environment (Figure 4).
Our objective is to enhance the user interaction with the
Fig. 4: Ocean Monitoring Application - augmented user inter-
face.
remote sensing and control applications. The marine robot
will augment the real-time surface and underwater data stream
with the information about similar (previously encountered)
information objects. The retrieval of similar objects will be
based on the fusion of different types of features in order
to reduce the semantic gap, the difference between machine
representation and human perception of images. Thus, the
real-time visual image of the environment will be augmented
by additional digital information. When using the select
button on the Virtual Reality (VR) control pad, the on-the-fly
image capture and retrieval will present the user with the
visual and textual information related to similar information
objects. A pop-up window will be used to display the retrieval
results which can be freely browsed. The user will be able
to further narrow down the presented top retrieval results by
highlighting the relevant images with one of the control pad
buttons. When using the select button on one of the top result
images, the associated textual information will be displayed.
Acknowledgment This work has been partially funded
by the CERBERO project no. 732105 - a HORIZON 2020
EU project. CERBERO project aims at developing a design
environment for Cyber Physical Systems based on two pillars:
a cross-layer model based approach to describe, optimize,
and analyze the system and all its different views concur-
rently; an advanced adaptivity support based on a multi-
layer autonomous engine. AmbieSense works on the new
type of marine robot with surface and underwater surveillance
capabilities, which is one of CERBERO use cases.
REF ER EN CE S
[1] Z. Chen, L. Wenyin, F. Zhang, M. Li, H. Zhang. Web Mining for Web
Image Retrieval. Journal of the American Society for Information Science
and Technology, 52(10):831–839, 2001.
[2] F. Crestani, G. Pasi. Soft Computing in Information Retrieval: Techniques
and Applications. Physica, 50, 2013.
[3] T. Deselaers, D. Keysers, H. Ney. FIRE - Flexible Image Retrieval
Engine: ImageCLEF 2004 Evaluation. Proceedings of the 5th Conference
on Cross-Language Evaluation Forum: Multilingual Information Access
for Text, Speech and Images, 688–698, 2005.
[4] A. Goker, H. Myrhaug, R. Bierig. Context and Information Retrieval,
in Information Retrieval: Searching in the 21st Century. John Wiley and
Sons, 2009.
[5] L. Kaliciak, H. Myrhaug, A. Goker, D. Song. Adaptive Relevance
Feedback for Fusion of Text and Visual Features. The 18th International
Conference on Information Fusion (Fusion 2015), Washington DC, USA,
1322–1329, 2015.
[6] L. Kaliciak, H. Myrhaug, A. Goker, D. Song. On the Duality of Fusion
Strategies and Query Modification as a Combination of Scores. The
17th International Conference on Information Fusion (Fusion 2014),
Salamanca, Spain 2014.
[7] L. Kuang, F. Hao, L. T. Yang, M. Lin, C. Luo and G. Min. A
Tensor-Based Approach for Big Data Representation and Dimensionality
Reduction. IEEE Transactions on Emerging Topics in Computing, 2,
3:280–291, 2014.
[8] H. Liu, S. Zagorac, V. Uren, D. Song, S. Ruger. Enabling Effective
User Interactions in Content-Based Image Retrieval. Proceedings of
the 5th Asia Information Retrieval Symposium on Information Retrieval
Technology, 265–276, 2009.
[9] H. Liu, P. Mulholland, D. Song, V. Uren, S. Ruger. An Information
Foraging Theory-based User Study of an Adaptive User Interaction
Framework for Content-based Image Retrieval. 17th International
Conference on MultiMedia Modeling, (MMM), 6524: 241–251, 2011.
[10] D. Markonis, R. Schaer, H. M¨
uller. Evaluating Multimodal Relevance
Feedback Techniques for Medical Image Retrieval. Information Retrieval
Journal, 19,1:100–112, 2016.
[11] B.A. Olshausen, D.J. Field.
Emergence of Simple-cell Receptive Field Properties by Learning a
Sparse Code for Natural Images.
Nature, 381:607–609, 1996.
[12] M. Ortega-Binderberger, S. Mehrotra, K. Chakrabarti, K. Porkaew.
WebMARS: a Multimedia Search Engine. Proceedings of the 12th
Annual ACM International Conference on Multimedia, 314–321, 1999.
[13] L. Qiao, B. Zhang, L. Zhuang and J. Su. An Efficient Algorithm for Ten-
sor Principal Component Analysis via Proximal Linearized Alternating
Direction Method of Multipliers. International Conference on Advanced
Cloud and Big Data (CBD), 283–288, 2016.
[14] T. Quack, U. Monich, L. Thiele, B. S. Manjunath. Cortina: a System
for Large-scale, Content-based Web Image Retrieval. Proceedings of the
12th Annual ACM International Conference on Multimedia, 508–511,
2004.
[15] Y. Rui, T. S. Huang, M. Ortega, S. Mehrotra. Relevance Feedback:
a Power Tool for Interactive Content-based Image Retrieval. IEEE
Transactions on Circuits and Systems for Video Technology, 8,5:644–
655, 1998.
[16] A. Spink, H. Greisdorf, J. Bateman. From Highly Relevant to Not
Relevant: Examining Different Regions of Relevance. Information
Processing and Management, 34, 5: 599–621, 1998.
[17] J. Teevan, S. Dumais, E. Horvitz. Personalizing Search via Automated
Analysis of Interests and Activities. 28th Annual International ACM SI-
GIR Conference on Research and Development in Information Retrieval,
449–456, 2005.
[18] R. Tronci, G. Murgia, M, Pili, L. Piras, G. Giacinto. Imagehunter: a
Novel Tool for Relevance Feedback in Content-based Image Retrieval.
New Challenges in Distributed Information Filtering and Retrieval, 53–
70, 2013.
[19] https://www.tensorflow.org/
[20] http://torch.ch/