ArticlePDF Available

Abstract and Figures

Recent works in deep-learning research highlighted remarkable relational reasoning capabilities of some carefully designed architectures. In this work, we employ a relationship-aware deep learning model to extract compact visual features used relational image descriptors. In particular, we are interested in relational content-based image retrieval (R-CBIR), a task consisting in finding images containing similar inter-object relationships. Inspired by the relation networks (RN) employed in relational visual question answering (R-VQA), we present novel architectures to explicitly capture relational information from images in the form of network activations that can be subsequently extracted and used as visual features. We describe a two-stage relation network module (2S-RN), trained on the R-VQA task, able to collect non-aggregated visual features. Then, we propose the aggregated visual features relation network (AVF-RN) module that is able to produce better relationship-aware features by learning the aggregation directly inside the network. We employ an R-CBIR ground-truth built by exploiting scene-graphs similarities available in the CLEVR dataset in order to rank images in a relational fashion. Experiments show that features extracted from our 2S-RN model provide an improved retrieval performance with respect to standard non-relational methods. Moreover, we demonstrate that the features extracted from the novel AVF-RN can further improve the performance measured on the R-CBIR task, reaching the state-of-the-art on the proposed dataset.
Content may be subject to copyright.
Submitted to International Journal of Multimedia Information Retrieval
Final authenticated publication:
Learning Visual Features for Relational CBIR
Nicola Messina ·Giuseppe Amato ·Fabio Carrara ·Fabrizio Falchi ·
Claudio Gennaro
Submitted: 15 April 2019 / Revised: 20 July 2019 / Accepted: 4 September 2019
Springer-Verlag London Ltd., part of Springer Nature 2019
Abstract Recent works in deep-learning research
highlighted remarkable relational reasoning capabili-
ties of some carefully designed architectures. In this
work, we employ a relationship-aware deep learning
model to extract compact visual features for use as
relational image descriptors. In particular, we are in-
terested in Relational Content-Based Image Retrieval
(R-CBIR), a task consisting in finding images contain-
ing similar inter-object relationships. Inspired by the
Relation Networks (RN) employed in Relational Vi-
sual Question Answering (R-VQA), we present novel
architectures to explicitly capture relational informa-
tion from images in the form of network activations
that can be subsequently extracted and used as vi-
sual features. We describe a two-stage Relation Net-
work module (2S-RN), trained on the R-VQA task,
able to collect non-aggregated visual features. Then,
we propose the Aggregated Visual Features Relation
Network (AVF-RN) module, that is able to produce
better relationship-aware features by learning the ag-
gregation directly inside the network. We employ an
R-CBIR ground-truth built by exploiting scene-graphs
similarities available in the CLEVR dataset in order to
rank images in a relational fashion. Experiments show
that features extracted from our two-stage RN (2S-RN)
model provide an improved retrieval performance with
respect to standard non-relational methods. Moreover,
we demonstrate that the features extracted from the
novel AVF-RN can further improve the performance
measured on the R-CBIR task, reaching the state-of-
the-art on the proposed dataset.
N.Messina, G. Amato, F. Carrara, F. Falchi, C. Gennaro
via G. Moruzzi, 1 - 56124 Pisa, Italy
Keywords CLEVR, Content-Based Image Retrieval,
Deep Learning, Relational Reasoning, Relation
Networks, Deep Features
1 Introduction
Recent advances in deep-learning technologies brought
to light remarkable capabilities of neural networks. In
particular, focusing on the computer vision world, one
of the aims of deep-learning architectures consists in
understanding the content of an image at a high-level
of abstraction. In this respect, some specific tasks have
been developed in order to test the capabilities of newly
proposed architectures to cope with high-level reason-
Understanding relationships between entities is con-
sidered a difficult task since it requires complex reason-
ing skills. For this reason, some challenging tasks such
as Relational Visual Question Answering (R-VQA) and
Visual Relationships Detection (VRD) have been in-
troduced as reference tasks for probing relational ca-
pabilities of deep-learning solutions. R-VQA consists of
answering questions related to difficult inter-object re-
lationships in an image; on the other hand, VRD tries
to recover relationships between couples of objects in
the images by coding the information under the form
of triplets subject,predicate,object. R-VQA and VRD
underlined some of the difficulties that current deep-
learning approaches present when it comes to reason-
ing about relationships between different objects: plain
convolutional architectures showed major performances
in tasks such as image classification or object recog-
nition; however, they exhibit some limitations in rela-
tional contexts.
2 Nicola Messina et al.
In this work, we analyze the possibility of applying
relational understanding capabilities to the Content-
Based Image Retrieval (CBIR) task. More in details,
we are interested in the sub-field of Relational-CBIR
(R-CBIR) in which the aim is to retrieve images with
given relationships among objects.
This study is focused on bringing image retrieval a
step further with respect to current approaches, keep-
ing the basic idea untouched. In fact, the similarity be-
tween two images is always measured as affinity among
some sort of high-level features extracted from the im-
age. Our objective consists in extracting a relationship-
aware descriptor able to embed relational information.
These descriptors should be easily comparable using
standard distance metrics so that they can be used in
standard indexing engines. The distance between fea-
tures should embody the dissimilarity between the re-
spective images in terms of relationships between the
objects contained in them.
The key contribution of this work is the introduc-
tion of architectures able to learn relational features
directly inside the network. These proposed architec-
tures, however, are not trained directly on the R-CBIR
task; instead, this work investigates upon the possibil-
ity of learning features from networks trained on the
task of R-VQA.
The transfer-learning methodology is not a novel
approach for CBIR. Standard CBIR features are ex-
tracted from architectures trained for example on im-
age classification tasks. Image classification, however,
does not require the architecture to learn difficult rela-
tional concepts. Hence, as far as R-CBIR is concerned,
relational-aware features can be extracted from archi-
tectures trained on a task that requires high-level rea-
soning capabilities, and the R-VQA tasks perfectly fill
this need. In fact, we rely on the assumption that ar-
chitectures that are able to correctly answer questions
on complex inter-object relationships have internally
learned some relational concepts that can be later ex-
tracted and compared.
We perform this study in a fully controlled environ-
ment, using the images and scene graphs provided by
the CLEVR synthetic dataset. CLEVR is a diagnostic
dataset originally designed for the task of R-VQA, and
it is composed of 3D rendered scenes made up of simple
shapes. Unlike real-world datasets like Visual Genome,
it avoids common relational biases. Also, being a highly
controlled environment, it is useful to test in fine details
the very specific relational capabilities of deep-learning
In this work, we extend the study published at the
CEFRL workshop of ECCV 2018 on 2S-RN [18] in which
we discussed the possibility of extracting relationship-
aware visual features from an architecture trained on
the R-VQA task. 2S-RN is designed in a way that ex-
tracted features should be aggregated afterwards, by
averaging all the contributions from every objects cou-
ple. For this reason, it is possible that the aggregated
features are not embedding in an efficient manner all
the information needed for fully describing a scene. The
novel proposed network Aggregated Visual Features Re-
lation Network (AVF-RN) solves this problem by learn-
ing the aggregation directly inside the network. By do-
ing so, we are obliging the network to incorporate as
much information as possible inside the aggregated fea-
tures. Hence, the extracted activations can immediately
be used as compact visual features. To sum up, we ex-
tend the 2S-RN approach by adding the following con-
we propose the Aggregated Visual Features Rela-
tion Network (AVF-RN), a novel architecture that
is able to learn aggregated relationship-aware fea-
tures directly inside the network;
we train AVF-RN on the R-VQA task on the CLEVR
we compare the features extracted from the AVF-
RN network with 2S-RN features on the R-CBIR
task, using three different CLEVR dataset config-
urations; we also include as non-relational baseline
the CNN features extracted from a simple model
trained on multi-label classification on CLEVR scenes.
The rest of the paper is organized as follows. In
section 2, we review some of the works belonging to the
Relational Learning world, mainly focusing on VRD, R-
VQA, and R-CBIR. In section 3, we describe in details
the process needed for creating the relational ground
truth from CLEVR. In section 4, we describe in details
the proposed AVF-RN architecture. In section 5, we
describe our experimental setup, we collect the results
also considering baseline architectures present in the
literature, and we discuss the obtained results. Finally,
in section 6, we recap our contribution, and we present
future directions for this research.
2 Related Work
In this section, we review some of the works related to
Relational Learning in particular related to Relational
Visual Question Answering (R-VQA) and Visual Rela-
tionship Detection (VRD) tasks. Afterward, we review
some of the existing approaches to Relational CBIR (R-
Visual Relationship Detection (VRD) Recent work has
addressed the problem of visual relationships detection
Learning Visual Features for Relational CBIR 3
(VRD) in images in the form of triplets (subject,predi-
cate,object), where subject and object are common ob-
jects present in an image, and predicate indicates a re-
lationship between them out of a set of possible rela-
tionships containing verbs, prepositions, comparatives,
Several datasets are comprised of a large set of vi-
sual relationships, such as [11,13, 19]. They have opened
the way to approaches aimed to detect inter-object re-
lationships in images [13,19,4].
A common approach to VRD employed by many [13,
27,20,29] consists at first in proposing entities using
region proposal networks, such as Faster-RCNN [23].
Then, once the entities have been located, a network
tries to reason on the relationships occurring between
Notwithstanding approaches that solve VRD are able
to detect relationships, they usually do not encode the
learned information in a compact representation: all
possible relationships are combinatorially tested on pre-
diction time.
Relational VQA (R-VQA) R-VQA comes from the ba-
sic task of VQA (Visual Question Answering). Plain
VQA consists in giving the correct answer to a ques-
tion asked on a given picture, so it requires connecting
together different entities coming from heterogeneous
representations (text and visuals).
Some works [31,28] proposed approaches to stan-
dard VQA problems on datasets such as VQA [1],
DAQUAR [15], COCO-QA [22].
Recently, there is a tendency to conceptually sepa-
rate VQA and Relational-VQA (R-VQA). In R-VQA,
in fact, images contain difficult inter-object relation-
ships, and question are formulated in a way that it is
impossible for deep architectures to answer correctly
without having understood high-level interactions be-
tween the objects in the same image. Some datasets,
such as CLEVR [7], RVQA [14], FigureQA [10], move
the attention towards this new challenging task.
On the CLEVR dataset, [25] and [21] authors pro-
posed a novel architecture specialized to think in a re-
lational way. They introduced a particular layer called
Relation Network (RN), which is specialized in compar-
ing pairs of objects. Objects representations are learned
by means of a four-layer CNN, and the question embed-
ding is generated through an LSTM. The overall archi-
tecture, composed of CNN, LSTM, and the RN, can be
trained fully end-to-end, and it is able to reach super-
human performances. Other solutions [6,8] introduce
compositional approaches able to explicitly model the
reasoning process by dynamically building a reasoning
graph that states which operations must be carried out
and in which order to obtain the right answer. These
architectures are internally split into two different sub-
components: a generator network that produces an ex-
ecution graph based on the question embeddings, and
an execution network that executes the graph produced
by the generator network taking in input the image fea-
tures and outputting the answer. Usually, these archi-
tectures tend to perform poorly when related to other
In order to close the performance gap between in-
terpretable architectures and high performing solutions,
[16] proposed a set of visual-reasoning primitives that
are able to perform complex reasoning tasks in an ex-
plicitly interpretable manner.
R-CBIR While standard CBIR captured a lot of atten-
tion even before the deep-learning era, R-CBIR involves
complex reasoning skills and current deep-learning ap-
proaches have shown promising results in this direction.
Nevertheless, in this work, we use the same basic
ideas from the standard CBIR methodology; we act
only on the features extraction process. We take as ref-
erence the work by [26] that introduced RMAC features
— one of the state-of-the-art non-relational image de-
scriptors for image instance retrieval. This descriptor
encodes and aggregates several regions of the image in a
dense and compact global image representation exploit-
ing a pre-trained fully convolutional network for feature
map extraction. The aggregated descriptor is obtained
by max-pooling the feature map over different regions
and scales and summing them together.
As regards the work carried out on R-CBIR, there
was some experimentation using both CLEVR and real-
world datasets. [9] introduced a CRF model able to
ground relationships given in the form of a scene graph
to test images for image retrieval purposes. However,
this model is not able to produce a compact feature.
They employed a simple dataset composed of 5000 im-
ages and annotated with objects and their relationships.
More recently, using the Visual Genome dataset,
[30] implemented a large scale image retrieval system
able to map textual triplets into visual ones (object-
subject-relation inferred from the image) projecting
them into a common space learned through a modified
version of triplet-loss.
The works by [2,18] exploit the graph data asso-
ciated with every image in order to produce ranking
goodness metrics, such as nDCG and Spearman-Rho
ranking correlation indexes. Their objective was evalu-
ating the quality of the ranking produced for a given
query, keeping into consideration the relational content
of every scene.
4 Nicola Messina et al.
3 A Relational-CBIR Ground-Truth
In order to evaluate the quality of any relational fea-
ture extracted from a relationship-aware system, we
compute a specific ground-truth exploiting relational
knowledge embedded into graphs (scene-graphs).
By carefully choosing a distance function between
graphs, we are able to give a good estimation of the
relational similarity between scenes. In order to accom-
plish this task, we need some datasets that include a
formal and precise description of relations occurring in-
side the scene. In this work, we will use the synthetic
generated dataset CLEVR [7].
CLEVR [7] is a synthetic dataset composed of 3D
rendered scenes, and it has been designed for the R-
VQA task. There are 100k rendered images subdivided
among training (70k), validation (15k), and test (15k)
sets. The total number of questions is 865k again split
among training (700k), validation (150k), and test
The main concept behind CLEVR is the scene. A
scene contains different simple shaped objects with
mixtures of colors, materials, and sizes. There are
cubes, spheres, and cylinders, each one of which can
have a color chosen among eight; they can be big or
small, and they can be made of one of two different ma-
terials, metal or rubber. The scene is fully and uniquely
described by a scene graph. The scene graph describes
in a formal way all the relationships between objects.
The question is formulated under the form of a func-
tional program. The answer to a question represented
by its functional program on a scene is simply calcu-
lated by executing the functional program on the scene
graph. Scene graphs are rendered to photo-realistic 3D
scenes by using Blender, a free 3D software; instead,
functional programs are converted to natural language
expressions compiling textual templates embedded in
the dataset and written in English.
The CLEVR dataset gives us way more control on
the learning phase than other datasets present in litera-
ture. Information in each sample of the dataset is com-
plete and exclusive. This means that no common-sense
awareness is needed in order to correctly answer the
questions. Answers can be given simply understanding
the question and reasoning exclusively on the image,
without needing external concepts.
3 4
In front
Fig. 1: CLEVR scene with associated scene graph.
3.2 Scene graphs
The best way to formally describe relations inside a
scene is by making use of scene graphs, already avail-
able in CLEVR. More in details, a scene graph contains
nodes, that account for objects occupying the scene, and
edges, that describe relations occurring among them.
Every node or edge can be assigned a set of attributes
that fully describe them. CLEVR includes some specific
objects attributes, namely the color, the shape, the ma-
terial and the size, and accounts for the following spa-
tial relationships: to the left of,to the right of,in front
In Figure 1, we report an example image from
CLEVR with the associated scene-graphs. Note that,
although CLEVR graph is complete, half of the edges
can be removed without losing information, since to the
right of implies an opposite edge to the left of and in
front of implies an opposite edge behind.
3.3 Ground-truth generation
We define a ground-truth for retrieving images with
similar relations among objects relying on the similar-
ity between scene graphs. Two scene graphs should be
Learning Visual Features for Relational CBIR 5
similar if they can depict almost the same relations be-
tween the same objects. However, evaluating the sim-
ilarity between two graphs is not trivial; furthermore,
it is often a subjective task, since there are aspects of
the graph (e.g., the attributes associated to nodes) that
weight differently depending on the specific application.
Although many solutions have been proposed in lit-
erature for defining distances between graph-structured
data [3], concerning this particular use-case, we decide
to employ the graph edit-distance (GED), that is an
extension of the well-known edit-distance working on
Differently from strings, edit operations on graphs
include delete,insert, and substitute for both nodes and
edges, for a total of 6 edit operations. The computa-
tion of the GED is faced as an optimization problem.
Since the GED problem is known to be computation-
ally hard, in this work we employ an approximated ver-
sion of the GED algorithm. Computational times be-
come easily unworkable on CLEVR scene graphs, even
if removing the redundant behind and left edges. For
this reason, we used an implementation based on [24],
that is able to perform an efficient approximation of
the algorithm. The approximated GED algorithm does
not consider the entire span of solutions, but instead,
it looks for a tiny subset of edit sequences, obtained
by first matching similar nodes using linear assignment
and then matching edges on the ruled node pairing.
The node-edge edit costs can be customized on the
basis of their attributes. In particular, we use a cost
of 1 for nodes-edges insertion/deletion and a cost of 1
if edges do not belong to the same kind of relation. A
null cost is applied otherwise. Node substitution cost is
driven by a policy that weights equally all attributes.
Since in CLEVR there are 4 attributes per node, every
attribute substitution costs 0.25.
To clarify GED algorithm functioning using our cost
policy, we report an example in Figure 2. This instance
of GED computation transforming the upper image into
the below one returns a cost of 1.5.
In the light of this, given a query, we compute
the ground-truth ranking of the dataset by sorting all
scenes using computed GEDs between the scene graph
of the query image and the graphs from all the others.
Given an image ranking produced by an arbitrary
relationship-aware system, a rank correlation metric
is computed against the ground-truth ranking. In this
work, we use the Spearman-Rho correlation index, that
is a common ranking similarity measure often employed
in information retrieval scenarios [17].
4 Models
In this section, we describe our architectures tailored
to explicit relationship-aware features learning. First
of all, we review the basic formulation of the Rela-
tion Network (RN) for the sake of comparison with the
newly introduced architecture. Then, we describe our
proposals, namely the 2-stage Relation Network (2S-
RN, previously introduced in [18]) and its extension —
the novel AVF-RN architecture. Differently from 2S-
RN, AVF-RN performs the aggregation of the visual
features directly inside the network.
4.1 RN and 2S-RN overview
The Relation Network (RN) [25] approached the task of
R-VQA and obtained remarkable results on the CLEVR
dataset. RN modules combine input objects forming all
possible pairs and applies a common transformation to
them, producing activations aimed to store information
about possible relationships among input objects. For
the specific task of R-VQA, authors used a four-layer
CNN to learn visual object representations, that are
then fed to the RN module and combined with the tex-
tual embedding of the question produced by an LSTM,
conditioning the relationship information on the tex-
tual modality. The core of the RN module is given by
the following:
gθ(oi, oj, q),(1)
where gθis a parametric function whose parameters θ
can be learned during the training phase. Specifically,
it is a multi-layer perceptron (MLP) network. oiand oj
are the objects forming the pair under consideration,
and qis the question embedding vector obtained from
the LSTM module. The answer is then predicted by
a downstream network fφfollowed by a softmax layer
that outputs probabilities for every answer:
a=softmax(fφ(r)) .(2)
Relationship-aware features useful for R-CBIR
should be extracted from a stage inside the network still
not conditioned to the question. Hence, valid R-CBIR
features can be extracted from the original RN mod-
ule only at the output of the convolutional layer since,
after that, questions condition entirely the remaining
For this reason, the two-stage pipeline [18] was pro-
posed in order to decouple visual relationships process-
ing (first-stage) from the question elaboration (second-
stage) so that the activations from a layer in the first
stage can be employed as visual relationship-aware fea-
tures. The 2S-RN considers all possible relationships
6 Nicola Messina et al.
Steps Cost
1. Substitute node small-cyan-metal-cylinder with big-cyan-metal-sphere
(change 2 attributes)
2. Substitute edge small-cyan-metal-cylinder behind small-blue-rubber-
cylinder with big-cyan-metal-sphere in front of small-blue-rubber-
Fig. 2: GED computation example.
between objects gθ(oi, oj) in the image. The function
gθis called first-stage of the RN. The output from
this stage is a representation of the relationships be-
tween objects in the image not conditioned on the
question. Then, the obtained relational representations
ri,j =gθ(oi, oj) are combined with the query embed-
ding qas follows:
hψ(ri,j , q) = X
hψ(gθ(oi, oj), q),(3)
where hψis the second-stage implemented as a multi-
layer perceptron network with parameters ψ. Using this
solution, the 2S-RN constrains the network to learn
relational concepts without considering the questions,
at least during the first stage, before the hψ(·) func-
tion evaluation. Hence, the 2S-RN architecture enables
relationship-aware features extraction from the output
of any layer of the gθ(·) function.
4.1.1 Detailed Configuration
Both the RN and the 2S-RN architectures are trained
on the R-VQA task on the CLEVR dataset.
Concerning the RN network, we use the very same
setup described by the authors. In particular, the CNN
is composed of 4 convolutional layers each with 24 ker-
nels, ReLU non-linearities, and batch normalization;
both gθand fφare composed by 256-dimensional fully-
connected layers, with ReLU non-linearities after every
layer, with four and two layers respectively. The final
linear layer with 28 units produces logits for a softmax
layer over the answers vocabulary; finally, the learning
rate follows an exponential step increasing policy, that
doubles it every 20 epochs, from 5e-6 up to 5e-4. Fea-
tures are extracted directly at the end of the CNN and
are aggregated using global average pooling.
2S-RN follows a very similar setup to the one of the
original RN. Differently from the RN, gθand the novel
hψare both composed by 2 fully-connected layers. In
this case, features are extracted at the end of the gθ
layer, immediately before the question concatenation.
Detailed architectures are shown in Figures 3a and 3b.
Both RN and 2S-RN reaches very high performances
when trained on CLEVR R-VQA: they obtain 93,6%
and 93,8% accuracy on the test set respectively.
4.2 Aggregated Visual Features Relation Network
The 2S-RN approach is able to extract the relational
content from the visual pipeline before it is conditioned
by the question embedding. Nevertheless, features ex-
tracted from the 2S-RN are still not aggregated and
contain all the descriptions from every couple of ob-
jects. Hence, standard 2S-RN features are aggregated
only during the extraction process by simply averaging
them iterating through all the couples.
Our contribution consists in learning the feature
directly inside the network. To this aim, we slightly
changed the 2S-RN architecture in order to aggregate
all the object couples before inserting the question em-
bedding into the pipeline. Hence, AVF-RN network can
be described by the following equation:
r=q, hψX
ri,j =q, hψX
gθ(oi, oj),(4)
with the same naming conventions used for 2S-RN.
However, differently from 2S-RN, hψis not evaluated
for every couple; instead, it is evaluated once, on the
already aggregated visual features. For this reason, the
hψrole changes with respect to the 2S-RN case. In
AVF-RN the purpose of hψis to process the already
aggregated visual feature, while in 2S-RN it processes
textual and visual features from every couple of objects.
The architecture has been designed so that each
function gθ,hψand fφcan be customized with any
number of fully-connected layers with any number of
neurons each. More in details:
gθcomprises the nlayers before the aggregation op-
hψcomprises the mlayers between the aggregation
and the question insertion;
Learning Visual Features for Relational CBIR 7
Are there an equal number of
large things and metal spheres?
gθ MLP fφ MLP
24 8256 256 256 256 256 256
extracted features
(non aggregated)
(a) Relation Netowrk (RN) architecture.
Are there an equal number of
large things and metal spheres?
gθ MLP fφ MLP
24 8256 256 256 256 256 256
hψ MLP
1st RN stage
2nd RN stage
extracted features
(non aggregated)
(b) Two-stage Relation Network (2S-RN) architecture.
Fig. 3: Detailed RN and 2S-RN architectures with layers configuration.
fφcomprises the klayers after question insertion;
they are aimed at processing the joint visual aggre-
gated features and the textual ones to obtain the
information needed to predict the answer.
The overall architecture is reported in Figure 4.
4.2.1 Detailed configuration and hyper-parameters
In the case of RN and 2S-RN, the concatenation of the
question with all the couples works as a simple but
quite effective attention mechanism. The novel AVF-
RN model, instead, introduces the question embedding
after the aggregation. We gain in feature relational ex-
pressiveness but, on the other hand, the attention effect
is lost. For this reason, we obtain an overall less accu-
racy with respect to the RN and the 2S-RN architec-
tures. There are several hyper-parameters that should
be tuned and an extensive search is not feasible. Among
the hyper-parameters, the most important ones are the
number of fully-connected layers for every function gθ,
hψ, and fφ, namely n,m,k, and the output size for
all of these layers. We try to stick, wherever possible,
to successful configurations observed when training the
RN and the 2S-RN architectures. In Table 1 we collect
some of the hyper-parameters experimentation we per-
formed on this architecture, together with the reached
accuracy on the CLEVR R-VQA task.
The best result is obtained using weighted-sum as
aggregation, with weights learned during training, one
layer of hψand three layers of fφ. The aggregation is
positioned after the 4th fully connected layer of gθ, while
the question is inserted after a single fully-connected
layer of hψ.
The 4th layer of gθis larger in order to augment the
expressiveness of the aggregated feature. In order to
speed up convergence, we initialize the weights for the
CNN and the first two fully-connected layers of gθwith
the weights coming from the respective layers of the
2S-RN architecture (they are the only ones to maintain
8 Nicola Messina et al.
Are there an equal number of
large things and metal spheres?
gθ MLP fφ MLP
24 8
hψ MLP
n FC layers m FC layers k FC layers
extracted feature
Fig. 4: AVF-RN architecture overview. The number of fully-connected layers is fully customizable, as well as the
aggregation function.
Table 1: The accuracy values of different fully-connected layer configurations for every function gθ,hψand fφ.
Each configuration includes the output size for every fully-connected layer.
gθconfig. hψconfig. fφconfig. Aggr. Type Accuracy(%)
256, 256, 512 256 256, 256 sum 53.8
256, 256, 256, 512 256 256, 256 sum 53.2
256, 256, 256, 256 256 256 ,256, 256 sum 54.0
256, 256, 256, 512 256 256, 256, 256 sum 54.2
256, 256, 256, 1024 - 512 1024 weighted-sum 55.7
256, 256, 256, 512 256 256, 256, 256 weighted-sum 64.5
the same role and the same interface with respect to
the AVF-RN).
Even if the reached accuracy is quite far from the
performance reached by the RN and the 2S-RN archi-
tectures, this result is enough for learning relationship-
aware visual features.
5 Experimental Setup
In this section, we compare all the different architec-
tures explained in Section 4 on the task of R-CBIR.
We use a standard CBIR metric for comparing our re-
sults, namely the Spearman-Rho metric. As a baseline,
we choose the ranking obtained with one of the state-
of-the-art non-relational image descriptors for image in-
stance retrieval, namely the RMAC descriptor [26].
Also, as a non-relational baseline, we train a simple
architecture on a multi-classification task, where the
objective consists in correctly classifying all the objects
inside every CLEVR scene. This simple architecture
consists of the CNN already used in the original RN
architecture and 2 fully-connected layers with ReLU
non-linearities for use as multi-label classifier. Similarly
to the basic RN architecture, features are extracted by
average-pooling the CNN activations. We call this ar-
chitecture Multi-label CNN.
All the architectures are trained on the clevr train-
set; however, features are always extracted on the test-
set in order to evaluate the generalization capabilities of
the system. All the architectures are trained on an RTX
2080Ti, with a batch size of 640. During experiments,
we observed that the training time was almost the same
for all the RN-derived architectures. We trained for
about 300 epochs. We then picked the model having
the highest validation accuracy among all the training
The average training speed was about 25 minutes
per epoch. Instead, extracting all the features from the
whole test set required only about 1 minute. Questions
are not needed at extraction time, so the entire archi-
tecture is considerably lighter.
We use three different setups for evaluating the re-
1. CLEVR-Full - We use the entire CLEVR test set.
Any image can be selected as query and any image
could be eligible for being retrieved.
Learning Visual Features for Relational CBIR 9
Table 2: Spearman-Rho correlation index for existing methods and our novel AVF-RN features. We report the
95% confidence intervals for the mean over 500 queries.
CLEVR Full CLEVR Filtered Queries CLEVR Subset
RMAC [5] 0.15±0.02 0.02±0.02 0.09±0.01
Multi-label CNN 0.05±0.05 0.64±0.04 0.18±0.04
RN [25] 0.04±0.05 0.64±0.03 0.20±0.03
2S-RN [18] 0.15±0.04 0.65±0.02 0.26±0.02
AVF-RN (ours) 0.28±0.04 0.72±0.02 0.34±0.02
2. CLEVR-Filtered-Queries - We select as queries
only the images containing at most Nobjects, while
any image remains eligible for being retrieved.
3. CLEVR-Subset - We filter the entire CLEVR
test set with images containing at most Nobjects.
Hence, both queries and retrieved images contain at
most Nobjects.
CLEVR-Full is the same scenario used for evaluat-
ing 2S-RN performances in [18]. However, the approx-
imated GED algorithm we employ presents some no-
table differences with the exact version when graphs
have a large number of nodes. For this reason, during
experimentation, we explore also the simpler scenarios
CLEVR-Filtered-Queries and CLEVR-Subset.
CLEVR comes with rendered images containing no
more than 10 objects. In our experiments we set N
equal to 5.
Table 2 reports values of Spearman-Rho correlation
index for all the experiments on all the three versions
of the CLEVR datasets. Spearman-Rho correlations are
relative to the ground-truth generated as explained in
3.3 and obtained by ranking images using the approx-
imated version of the GED algorithm. The Spearman-
Rho correlation index is evaluated over multiple rank-
ings, generated using 500 query images, in order to pro-
duce statistically meaningful results.
5.1 Discussion
The new AVF-RN features reaches the state-of-the-
art on the R-CBIR task, defeating both non-relational
baseline methods and the RN and 2S-RN relationship-
aware techniques. It is worth noting the almost zero per-
formance gap between convolutional features extracted
from the RN and the multi-label CNN networks. The
results tell us that the simple global average pooling
of the last feature maps of the CNN is not able to
catch significant relational content, even in the case of
a downstream RN network.
On the CLEVR-Full scenario, our AVF-RN features
obtain an almost doubled Spearman-Rho value with re-
spect to the 2S-RN one. This suggests that the novel
AVF-RN architecture is able to correctly order com-
plex relevant scenes in terms of their relational con-
tent. However, due to the approximation introduced by
ApproxGED in case of large number of objects, it is
difficult to strongly confirm this claim in this scenario.
On the other hand, in the CLEVR-Filtered-Queries
scenario, the images with few objects are privileged
by the ground-truth. Hence, standard approaches like
RMAC or simple CNN features behave quite well since
they can exploit their capability of retrieving images
having a similar number of objects with respect to the
query. Besides counting, they are in any case unable to
catch intrinsic inter-object relationships. Instead, these
details are well captured by AVF-RN and 2S-RN fea-
tures. However, the aggregation learned inside the net-
work in AVF-RN obliges the layers after the aggrega-
tion to learn compact and smart scene descriptions.
Consequently, AVF-RN captures more detailed scene-
information with respect to the simple posterior aggre-
gation performed for the 2S-RN feature.
Similarly, in the CLEVR-Subset scenario, all the re-
trieved images are forced to contain a small number of
objects, hence the basic recognition abilities by CNN
features do not capture the finest relational details. In
this case, since all the images contain few objects, the
only way to obtain remarkable results is by understand-
ing the intrinsic relational content of the scene. This
explains why there is a great improvement of AVF-RN
features over standard methods.
Even if it is quite difficult to give an objective eval-
uation of the proposed methodology by only looking at
the first 10 most relevant images, visual evaluation re-
ported in Figure 5 is useful for giving a qualitative feed-
back and an intuition beyond statistics. We collect these
visual results from the challenging Full CLEVR exper-
iment. In particular, we can see that RMAC features
always try to find the very same objects as the query,
in any position inside the image. Similarly, multi-label
CNN features seem very noisy.
It appears that 2S-RN and AVF-RN, instead, are
interpreting the scene from an high-level perspective
by finding all the images having a big object (better
10 Nicola Messina et al.
Fig. 5: Most relevant images for the proposed query from Full CLEVR experiment, using both non-relational
approaches (RMAC, Multi-Label CNN) and relational ones (RN, 2S-RN, AVF-RN). The first row belongs to the
ground-truth generated as explained in section 3.3.
if a metallic blue cube) surrounded by other smaller
On our website you can find an interac-
tive browsing system for exploring the R-CBIR results
from the proposed methods for different query images.
5.2 Success/Failure Analysis
In Figure 6 we report simple cases of success and failure
of the top-performing method AVF-RN against the two
baselines RMAC and multi-label CNN. We assume the
result as successful if our AVF-RN features can retrieve
more ground truth images with respect to the baselines;
otherwise, the experiment is considered failed for the
examined query. For the sake of simplicity, we analyze
only the top 10 results.
It can be noticed that successful retrieved images
(Figures 6a and 6b) are well approximating the ground-
truth scene graphs. This is because AVF-RN features
exhibit some scene-wide image understanding that is
not tailored to the features of single objects. On the
other hand, RMAC features are quite good at catching
the key visual features of the single objects, such as
their size, but they have troubles to focus the attention
on the global scene arrangement.
Failure cases (Figures 6c and 6d) demonstrate that
AVF-RN features cannot always catch the relational
content of the scene. In particular, in the failure ex-
ample of Figure 6c, the AVF-RN features seem to be
always triggered by a yellow object, that perhaps is a
not so important characteristic if considering the whole
scene arrangement.
Instead, Figure 6d demonstrates that is difficult to
catch objects arranged in precise configurations (in this
case, placed on the same line). In this example, both
the multi-label CNN baseline and our AVF-RN features
6 Conclusions
State-of-the-art methods for relational reasoning eval-
uate their capabilities on some challenging tasks such
as R-VQA (Relational Content-Based Image Retrieval)
and VRD (Visual Relationships Detection).
In this work, we defined the sub-task of R-CBIR in
which retrieved images should be similar to the query
in terms of relationships among objects. This was moti-
vated by the fact that current image retrieval systems,
performing traditional CBIR, are not able to infer re-
lations among the query and the retrieved images.
Given the novelty of the proposed task, we had to
generate a relational benchmark. To this aim, we em-
ployed CLEVR, a synthetic and unbiased dataset origi-
nally developed for the task of R-VQA. In particular, we
Learning Visual Features for Relational CBIR 11
(a) Success against RMAC features
(b) Success against Multi-label CNN features
(c) Failure against RMAC features
(d) Failure against Multi-label CNN features
Fig. 6: Success (a)(b) and failure (c)(d) cases for AVF-RN compared to the baselines, RMAC (a)(c) and Multi-
Label CNN (b)(d). Matches among GT and AVF-RN are marked in green, while matches among GT and the
baselines in red.
12 Nicola Messina et al.
compared scene graphs using a graph distance metric
called Graph Edit Distance (GED), in order to define
a relational-aware concept of distance between CLEVR
Distance evaluation among graphs, however, pre-
sented some degrees of freedom. In fact, the employed
GED distance must be initialized with some cost pa-
rameters. Costs have been set to values that were not
able to advantage any object attribute over the others,
in order to produce the fairest configuration.
We described the 2S-RN approach and, afterwards,
we proposed an extension to the 2S-RN module, called
Aggregated Visual Features Relation Network (AVF-
RN). This modification aims at aggregating the visual
features directly inside the network. We proved that
features from our AVF-RN are able to encode in a com-
pact representation the relationships between objects in
the image, outperforming some baseline non-relational
methods as well as the 2S-RN relational features.
Although the AVF-RN system lacks the native at-
tention mechanism that both RN and 2S-RN use when
they concatenate the question with all the objects cou-
ples, this method can successfully learn compact rela-
tional features.
We noticed that, despite the encouraging perfor-
mances measured with the introduced metrics, our ap-
proach generates results of difficult interpretation when
images have a high number of objects. This is proba-
bly due to the fact that having many objects implies
too many relationships that are difficult to track by the
human eye. Also, the proposed architectures must be
trained on VQA datasets, since the relationships be-
tween objects in the image are learned by answering
questions. In this regard, the need for a VQA train-
ing dataset is overall a strong constraint that should be
relaxed in future works.
Next steps in this ongoing research include the pos-
sibility of learning features by training architectures di-
rectly on the R-CBIR task, by using metric learning
approaches such as siamese-learning methods. Also, it
would be interesting studying how the performance of
the models changes when using real-world datasets such
as Visual Genome [11] or Open Images [12].
This work was partially supported by Automatic Data
and documents Analysis to enhance human-based pro-
cesses (ADA), CUP CIPE D55F17000290009, and by
the AI4EU project, funded by the EC (H2020 - Con-
tract n. 825619). We also gratefully acknowledge the
support of NVIDIA Corporation with the donation of
the Tesla K40 GPU used for this research.
1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D.,
Zitnick, C.L., Parikh, D.: VQA: visual question answer-
ing. CoRR abs/1505.00468 (2015). URL http://
2. Belilovsky, E., Blaschko, M.B., Kiros, J.R., Urtasun, R.,
Zemel, R.: Joint embeddings of scene graphs and images.
ICLR (2017)
3. Cai, H., Zheng, V.W., Chang, K.C.: A comprehensive
survey of graph embedding: Problems, techniques and
applications. CoRR abs/1709.07604 (2017). URL
4. Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships
with deep relational networks. In: 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
pp. 3298–3308. IEEE (2017)
5. Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-
end learning of deep visual representations for image re-
trieval. arXiv preprint arXiv:1610.07940 (2016)
6. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko,
K.: Learning to reason: End-to-end module networks for
visual question answering. In: The IEEE International
Conference on Computer Vision (ICCV) (2017)
7. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei,
L., Zitnick, C.L., Girshick, R.: Clevr: A diagnostic dataset
for compositional language and elementary visual reason-
ing (2017)
8. Johnson, J., Hariharan, B., van der Maaten, L., Hoffman,
J., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Infer-
ring and executing programs for visual reasoning. In:
The IEEE International Conference on Computer Vision
(ICCV) (2017)
9. Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma,
D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene
graphs. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 3668–3678
10. Kahou, S.E., Atkinson, A., Michalski, V., K´ad´ar, ´
Trischler, A., Bengio, Y.: Figureqa: An annotated figure
dataset for visual reasoning. CoRR abs/1710.07300
(2017). URL
11. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,
Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma,
D.A., Bernstein, M., Fei-Fei, L.: Visual genome: Connect-
ing language and vision using crowdsourced dense image
annotations (2016). URL
12. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J.R.R.,
Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci,
M., Duerig, T., Ferrari, V.: The open images dataset V4:
unified image classification, object detection, and visual
relationship detection at scale. CoRR abs/1811.00982
(2018). URL
13. Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual
relationship detection with language priors. In: European
Conference on Computer Vision (2016)
14. Lu, P., Ji, L., Zhang, W., Duan, N., Zhou, M., Wang, J.:
R-vqa: Learning visual relation facts with semantic at-
tention for visual question answering. In: SIGKDD 2018
15. Malinowski, M., Fritz, M.: A multi-world approach to
question answering about real-world scenes based on un-
certain input. In: Z. Ghahramani, M. Welling, C. Cortes,
N. Lawrence, K. Weinberger (eds.) Advances in Neural
Information Processing Systems 27, pp. 1682–1690. Cur-
ran Associates, Inc. (2014)
Learning Visual Features for Relational CBIR 13
16. Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.:
Transparency by design: Closing the gap between per-
formance and interpretability in visual reasoning. In:
The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2018)
17. Melucci, M.: On rank correlation in information retrieval
evaluation. SIGIR Forum 41(1), 18–33 (2007). DOI
18. Messina, N., Amato, G., Carrara, F., Falchi, F., Gen-
naro, C.: Learning relationship-aware visual features. In:
L. Leal-Taix´e, S. Roth (eds.) Computer Vision – ECCV
2018 Workshops, pp. 486–501. Springer International
Publishing, Cham (2019)
19. Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-
supervised learning of visual relations. In: ICCV 2017- In-
ternational Conference on Computer Vision 2017. Venice,
Italy (2017). URL https://hal.archives-
20. Qi, M., Li, W., Yang, Z., Wang, Y., Luo, J.: Atten-
tive relational networks for mapping images to scene
graphs. CoRR abs/1811.10696 (2018). URL http:
21. Raposo, D., Santoro, A., Barrett, D.G.T., Pascanu, R.,
Lillicrap, T.P., Battaglia, P.W.: Discovering objects and
their relations from entangled scene representations.
CoRR abs/1702.05068 (2017). URL http://arxiv.
22. Ren, M., Kiros, R., Zemel, R.: Exploring models and
data for image question answering. In: C. Cortes, N.D.
Lawrence, D.D. Lee, M. Sugiyama, R. Garnett (eds.) Ad-
vances in Neural Information Processing Systems 28, pp.
2953–2961. Curran Associates, Inc. (2015)
23. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: To-
wards real-time object detection with region proposal
networks. In: C. Cortes, N.D. Lawrence, D.D. Lee,
M. Sugiyama, R. Garnett (eds.) Advances in Neural In-
formation Processing Systems 28, pp. 91–99. Curran As-
sociates, Inc. (2015)
24. Riesen, K., Bunke, H.: Approximate graph edit distance
computation by means of bipartite graph matching. Im-
age and Vision Computing 27(7), 950 – 959 (2009). DOI 7th IAPR-
TC15 Workshop on Graph-based Representations (GbR
25. Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M.,
Pascanu, R., Battaglia, P., Lillicrap, T.: A simple neural
network module for relational reasoning. In: I. Guyon,
U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, R. Garnett (eds.) Advances in Neural Infor-
mation Processing Systems 30, pp. 4967–4976. Curran
Associates, Inc. (2017)
26. Tolias, G., Sicre, R., J´egou, H.: Particular object retrieval
with integral max-pooling of cnn activations. arXiv
preprint arXiv:1511.05879 (2015)
27. Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.:
Graph R-CNN for scene graph generation. CoRR
abs/1808.00191 (2018). URL
28. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked
attention networks for image question answering. CoRR
abs/1511.02274 (2015). URL
29. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual rela-
tionship for image captioning. CoRR abs/1809.07041
(2018). URL
30. Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., El-
gammal, A.M., Elhoseiny, M.: Large-scale visual rela-
tionship understanding. CoRR abs/1804.10660 (2018).
31. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus,
R.: Simple baseline for visual question answering. CoRR
abs/1512.02167 (2015). URL
In today's age, massive image data is generated rapidly. This influx has made labeling images tedious and, in turn, made it harder to retrieve images through searching algorithms that rely only on labels, keywords, or other meta-data in the images. Modern Content-Based Image Retrieval (CBIR) techniques rely on the visual features within the image to return relevant results to a search query. Deep Convolutional Neural Network (DCNN) models made great strides in the last decade. This paper relies on these complex pre-trained models to extract visual features from images. The proposed work has used pre-trained models like VGG16, MobileNet, Inceptionv3, and Xception for this task. Some studies in the CBIR space also suggest increased accuracy when both visual and textual features are considered. This paper proposes a novel three-step process for obtaining textual features. Firstly, the proposed model receives keywords for each image using Google Cloud Vision API. Secondly, the proposed model replaces each keyword with a 300-dimensional embedding vector obtained using word2vec, trained on the Google News dataset. Finally, the proposed model trains a combination of the Deep Semantic Similarity Model (DSSM) and Long Short-Term Memory (LSTM) model to reduce the 300-dimensional vector to a 64-dimensional vector. Using these new shortened word vectors, the proposed model computes the cosine similarity to replace each keyword of an image with five of its synonyms. Here, these additional steps increased the accuracy compared to simply using a word embedding technique. Finally, the proposed model combined the visual and textual feature vectors and observed that this feature set showed maximum classification accuracy of 98.33%, which is also compared with and found relatively better than other similar model results.
Full-text available
The increasing interest in social networks, smart cities, and Industry 4.0 is encouraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image. This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks: Relational Content-based Image Retrieval (R-CBIR), Visual-Textual Retrieval, and the Same-Different tasks. We use state-of-the-art deep learning methods for relational learning, such as the Relation Networks and the Transformer Networks for relating the different entities in an image or in a text.
En el marco de los estudios visuales, se observa un desarrollo de singulares prácticas cuya orientación tecnológica está basada en la innovación de algoritmos de inteligencia artificial. En este contexto, la investigación busca revelar la emergencia de una nueva interpretación de la visualidad, concretamente, mediante el análisis de dos líneas principales (cuya relación se trata de mostrar): por una parte, la visión artificial y su extensión en el universo posinternet de las redes sociales y de la web, donde la imagen pierde su significado simbólico y su dimensión estética para valorarse como una información que cambia el estado de un sistema; y, por otro lado, el conocimiento social del mundo virtual a través del uso, la actitud y el comportamiento humano con los algoritmos inteligentes. Mediante la revisión bibliográfica multidisciplinar, como método principal, las conclusiones apuntan a una importante presencia de una visualidad dependiente de las máquinas inteligentes, que aportan un mayor enriquecimiento del estudio tanto de la naturaleza humana como de la realidad social en el entorno virtual.
An innovative image retrieval agenda by concatenating deep learning features from GoogleNet and low‐level features from HSI and RGB color space is proposed in this article. Most of the CNN features suffer from loss of information due to image resize as a pre‐processing stage. To reduce this information loss super‐resolution technic is used for resizing images. An improved form of dot‐diffused block truncation coding is used for extracting RGB handcraft features. To discover the interdependencies between color and intensity component of an image, interchannel voting between hue, saturation, and intensity component is calculated as a color feature in HSI space. Histogram of orientated gradient (HOG) feature is used as shape feature. Five standard performance parameters, average precision rate, average recall rate, F‐Measure, Average Normalized Modified Retrieval Rank, and Total Minimum Retrieval Epoch, are applied on nine image datasets: Corel‐1K, Corel‐5K, Corel‐10K, VisTex, STex, ColorBrodatz and three subsets of ImageNet dataset for evaluation process of proposed method. For all dataset the best performance is achieved by the proposed method with respect to all performance parameters.
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at .
This paper proposed an efficient image retrieval framework by feature-fusion of high-level features from the improved version of DarkNet-53, named GroupNormalized-Inception-Darknet-53 (GN-Inception-Darknet-53), and handcraft features extracted from both RGB and HSI color models. To extract the more detailed features of an image, we augmented one inception layer, which includes 1 × 1, 3 × 3, and 5 × 5 kernels in place of an existing 3 × 3 kernel. To make the normalization process of the proposed model less dependent on batch size, Group Normalization (GN) layer is used instead of Batch Normalization (BN). A modified version of dot-diffused block truncation coding (DDBTC) is used to extract handcraft features in RGB color space. For HSI color space, interchannel voting between hue, saturation, and intensity components is used as color feature. To extract shape features histogram of orientated gradient (HOG) is applied on RGB color space. To evaluate the efficiency of our proposed method, Average Precision Rate (APR), Average Recall Rate (ARR), F-Measure, Average Normalized Modified Retrieval Rank (ANMRR), and Total Minimum Retrieval Epoch (TMRE) are calculated for Corel-1 K, Corel-5 K, Corel-10 K, VisTex, Stex & Color Brodatz datasets. In all datasets, the proposed method shows the best results for all the instances with a minimum average improvement of 7.02%.
Full-text available
In recent days' research, deep learning methods have shown promising performance in various fields of computer vision, including content-based image retrieval (CBIR). In this paper, an improved version of Darknet-53, called GroupNormalized-Inception-Darknet-53 (GN-Inception-Darknet-53), is proposed to extract features for the CBIR model. To extract the more detailed features of an image, we augmented one inception layer which includes 1 × 1, 3 × 3, and 5 × 5 kernels in place of an existing 3 × 3 kernel. The output of this newly added inception layer is the concatenated results of these three kernels. To make the normalization process of the proposed model less dependent on batch size, group normalization (GN) layer is used instead of batch normalization. A total of five such inception layers are used in the proposed GN-Inception-Darknet-53, and the output of all these inception layers is depth concatenated to extract more detailed features of the image. To train the proposed model transfer learning mechanism is used. Five standard performance measures: average precision rate, average recall rate, F-measure, average normalized modified retrieval rank, and total minimum retrieval epoch, are calculated to evaluate the efficiency of our proposed method. To assess the performance of the proposed method, seven challenging image datasets: three natural datasets (Corel-1K, Corel-5K & Corel-10K), three subsets of ImageNet dataset, and UKbench dataset are used. For all these datasets, the proposed method shows better results than the nineteen methods used to compare that contain traditional and CNN methods for CBIR.
Full-text available
Deep learning demonstrated major abilities in solving many kinds of different real-world problems in computer vision literature. However, they are still strained by simple reasoning tasks that humans consider easy to solve. In this work, we probe current state-of-the-art convolutional neural networks on a difficult set of tasks known as the same-different problems. All the problems require the same prerequisite to be solved correctly: understanding if two random shapes inside the same image are the same or not. With the experiments carried out in this work, we demonstrate that residual connections, and more generally the skip connections, seem to have only a marginal impact on the learning of the proposed problems. In particular, we experiment with DenseNets, and we examine the contribution of residual and recurrent connections in already tested architectures, ResNet-18, and CorNet-S respectively. Our experiments show that older feed-forward networks, AlexNet and VGG, are almost unable to learn the proposed problems, except in some specific scenarios. We show that recently introduced architectures can converge even in the cases where the important parts of their architecture are removed. We finally carry out some zero-shot generalization tests, and we discover that in these scenarios residual and recurrent connections can have a stronger impact on the overall test accuracy. On four difficult problems from the SVRT dataset, we can reach state-of-the-art results with respect to the previous approaches, obtaining super-human performances on three of the four problems.
Full-text available
Deep learning demonstrated major abilities in solving many kinds of different real-world problems in computer vision literature. However, they are still strained by simple reasoning tasks that humans consider easy to solve. In this work, we probe current state-of-the-art convolutional neural networks on a difficult set of tasks known as the same-different problems. All the problems require the same prerequisite to be solved correctly: understanding if two random shapes inside the same image are the same or not. With the experiments carried out in this work, we demonstrate that residual connections, and more generally the skip connections, seem to have only a marginal impact on the learning of the proposed problems. In particular, we experiment with DenseNets, and we examine the contribution of residual and recurrent connections in already tested architectures, ResNet-18, and CorNet-S respectively. Our experiments show that older feed-forward networks, AlexNet and VGG, are almost unable to learn the proposed problems, except in some specific scenarios. We show that recently introduced architectures can converge even in the cases where the important parts of their architecture are removed. We finally carry out some zero-shot generalization tests, and we discover that in these scenarios residual and recurrent connections can have a stronger impact on the overall test accuracy. On four difficult problems from the SVRT dataset, we can reach state-of-the-art results with respect to the previous approaches, obtaining super-human performances on three of the four problems.
Full-text available
Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to learn their joint feature embedding via multimodal fusion or attention mechanism. Some recent studies utilize external VQA-independent models to detect candidate entities or attributes in images, which serve as semantic knowledge complementary to the VQA task. However, these candidate entities or attributes might be unrelated to the VQA task and have limited semantic capacities. To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA. Specifically, we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome dataset via a semantic similarity module, in which each data consists of an image, a corresponding question, a correct answer and a supporting relation fact. A well-defined relation detector is then adopted to predict visual question-related relation facts. We further propose a multi-step attention model composed of visual attention and semantic attention sequentially to extract related visual knowledge and semantic knowledge. We conduct comprehensive experiments on the two benchmark datasets, demonstrating that our model achieves state-of-the-art performance and verifying the benefit of considering visual relation facts.
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of 〈subject, relation, object〉 triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved. We learn a visual and a semantic module that map features from the two modalities into a shared space, where matched pairs of features have to discriminate against those unmatched, but also maintain close distances to semantically similar ones. Benefiting from that, our model can achieve superior performance even when the visual entity categories scale up to more than 80,000, with extremely skewed class distribution. We demonstrate the efficacy of our model on a large and imbalanced benchmark based of Visual Genome that comprises 53,000+ objects and 29,000+ relations, a scale at which no previous work has been evaluated at. We show superiority of our model over competitive baselines on the original Visual Genome dataset with 80,000+ categories. We also show state-of-the-art performance on the VRD dataset and the scene graph dataset which is a subset of Visual Genome with 200 categories.
Conference Paper
Relational reasoning in Computer Vision has recently shown impressive results on visual question answering tasks. On the challenging dataset called CLEVR, the recently proposed Relation Network (RN), a simple plug-and-play module and one of the state-of-the-art approaches, has obtained a very good accuracy (95.5%) answering relational questions. In this paper, we define a sub-field of Content-Based Image Retrieval (CBIR) called Relational-CBIR (R-CBIR), in which we are interested in retrieving images with given relationships among objects. To this aim, we employ the RN architecture in order to extract relation-aware features from CLEVR images. To prove the effectiveness of these features, we extended both CLEVR and Sort-of-CLEVR datasets generating a ground-truth for R-CBIR by exploiting relational data embedded into scene-graphs. Furthermore, we propose a modification of the RN module – a two-stage Relation Network (2S-RN) – that enabled us to extract relation-aware features by using a preprocessing stage able to focus on the image content, leaving the question apart. Experiments show that our RN features, especially the 2S-RN ones, outperform the RMAC state-of-the-art features on this new challenging task.
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.