Conference PaperPDF Available
Testing Deep Neural Networks on the
Same-Different Task
Nicola Messina, Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro
Institute of Information Science and Technologies
National Research Council
Pisa, Italy
Abstract—Developing abstract reasoning abilities in neural
networks is an important goal towards the achievement of
human-like performances on many tasks. As of now, some works
have tackled this problem, developing ad-hoc architectures and
reaching overall good generalization performances. In this work
we try to understand to what extent state-of-the-art convolutional
neural networks for image classification are able to deal with a
challenging abstract problem, the so-called same-different task.
This problem consists in understanding if two random shapes
inside the same image are the same or not. A recent work
demonstrated that simple convolutional neural networks are
almost unable to solve this problem. We extend their work,
showing that ResNet-inspired architectures are able to learn,
while VGG cannot converge. In light of this, we suppose that
residual connections have some important role in the learning
process, while the depth of the network seems not so relevant.
In addition, we carry out some targeted tests on the converged
architectures to figure out to what extent they are able to
generalize to never seen patterns. However, further investigation
is needed in order to understand what are the architectural
peculiarities and limits as far as abstract reasoning is concerned.
Index Terms—AI, Deep Learning, Abstract Reasoning, Rela-
tional Reasoning, Convolutional Neural Networks
Artificial intelligence and in particular deep neural networks
have recently shown impressive results in key domains such
as vision, language, control, and decision-making. In partic-
ular, with the work carried out on the ImageNet challenge,
Krizhevsky et al. [1] demonstrated major capabilities of deep
neural networks in the field of image processing. Deep learning
architectures, and in particular Convolutional Neural Networks
(CNNs) [2], constitute now de-facto standard approaches to
image processing and understanding.
Recently, deep convolutional architectures have defined the
state-of-the-art on multiple computer vision tasks, such as
image classification [1], [3], [4], object detection and seg-
mentation [5], [6], multimedia and cross-media information
retrieval [7], [8], and detection of adversarial examples [9].
Despite their success, there are still many open problems
with current deep architectures. In fact, it is known that they
cannot generalize well to unseen objects and they lack human-
like reasoning capabilities.
Humans, as well as animals, are able to abstract visual
perception in order to recognize some shape patterns never
seen before. Unlike humans, convolutional networks are able
to learn very precise representations, usually very difficult to
generalize to abstract concepts.
For this reason, in this work, we concentrate on a very
specific task designed to test abstract reasoning capabilities of
neural networks. This task is called same-different and consists
in predicting if two shapes inside the same image are the same
or not. It is a challenging task for convolutional architectures
since it is required not to learn specific shape patterns in order
to solve the problem.
A clear understanding of the same-different concept, and the
ability to actuate complex relational reasoning are considered
key features in many tasks. For example, these concepts may
have a great impact in classification problems, in the research
of particular patterns in cultural heritage, and in the detection
of patterns defining aesthetic beauty in images and even music.
In fact, the world is often perceived by humans as a set of
recurrent structures composited together, such as the human
eyes in a face or the repeating chorus in a song.
In this work we employ a carefully designed dataset, called
SVRT [10], containing multiple visual problems pertaining to
the same-different category. We study the behavior of modern
standard visual deep-learning architectures by training and
validating them on the SVRT dataset.
More in detail, we extend the study carried out by Kim
et al. [11], by testing a variety of state-of-the-art deep con-
volutional architectures, originally designed for image classi-
fication, on some challenging same-different problems in the
SVRT dataset. In particular, we show that residual networks
are able to correctly solve the problem, with a remarkable
generalization margin.
We also discovered that different image interpolation al-
gorithms used during the data augmentation process have
an important effect on the final performance. This finding
remarks the overall vulnerability of neural networks to subtle
perturbations of the input image.
Our contribution is two-fold: first, we train state-of-the-
art deep image classification networks on the same-different
task, treating it as a binary classification problem; second, we
perform an extensive study upon the generalization capabilities
of the probed networks, in order to understand to what978-1-7281-4673-7/19/$31.00 ©2019 IEEE
extent current models are able to deal with higher-level visual
The rest of the paper is organized as follows: in Section
II we review some of the related work concerning abstract
reasoning capabilities of deep neural networks; in Section III
we go into details with the SVRT dataset and we recall the
models that will be probed against the same-different problem;
in Section IV we will present our experimental setup and we
will discuss the obtained results; finally, in Section V we will
remark on the importance of these studies and we will propose
future directions for this work.
It has been proven that deep neural networks can obtain
remarkable performance on different computer vision fields.
However, relatively few works have tackled the cognitive and
abstract reasoning capabilities of state-of-the-art deep neural
Recently, the introduction of on-purpose generated bench-
marks such as CLEVR and Sort-of-CLEVR [12] has paved
the way towards fine-grained studies on relational capabilities
of neural networks. In particular, [13] developed a relational
reasoning module able to solve the CLEVR visual-question-
answering task, by augmenting a standard convolutional net-
work with a reasoning module able to process couples of
objects. A slight variation of this network, the 2S-RN [14],
has been used for relational content-based image retrieval (R-
CBIR), in order to extract relationship-aware visual features
for indexing purposes.
Transparency and explainability have been considered key
objectives for understanding what kind of reasoning neural net-
works are internally performing. To this aim, on the CLEVR
dataset, [15] used explicit module composition in order to
build an explicit reasoning pipeline, while [16] proposed a set
of primitives which, when composed, manifest as a model ca-
pable of performing complex reasoning tasks in an explicitly-
interpretable manner. In particular, [16] reached more than
99% accuracy on the CLEVR test set.
Other than CLEVR, other difficult tasks, such as Raven’s
Progressive Matrices (RPM), have been proposed to test in
great detail, and under fully-controlled environments, reason-
ing capabilities of a given architecture.
In particular, [17] worked on RPMs and tried to establish
a semantic link between vision and reasoning by providing
structured representation. Similarly, [18] used an on-purpose
created dataset similar to RPM, called Procedurally Generated
Matrices (PGM). They demonstrated that popular models such
as ResNets perform poorly on this benchmark, and they
presented a novel architecture demonstrating quite effective
reasoning capabilities. Differently from [18], our work con-
centrates on a very simple yet challenging problem, the same-
different challenge.
The work by [19] introduced a synthetic dataset composed
of sequences of 2D images in order to test memorization
capabilities of neural networks.
[10] introduced SVRT, a simple dataset composed of 2D
shapes in order to test comparison and relational capabilities
of neural networks. [20] first showed that problems involv-
ing comparisons between SVRT shapes were difficult for
convolutional architectures like LeNet and GoogLeNet [21].
By contrast, instead, the traditional boosting method used by
[10] was able to reach very good results even on comparison
Even [11], in their recent work, found that the same-different
problem strains simple feed-forward convolutional networks.
In this paper we extend their work, giving some insights about
the behavior of very-deep convolutional networks on the same-
different task.
This work takes into consideration state-of-the-art deep
convolutional neural networks originally developed for image
classification tasks. Our aim consists in studying their perfor-
mance when dealing with higher-level abstraction problems.
First of all, we give some insights into the SVRT bench-
mark; then, we will discuss the architectures that we will
employ to tackle this problem.
A. SVRT Dataset
SVRT [10] is an extensive benchmark designed to test
abstract reasoning capabilities of machine learning algorithms.
SVRT requires more than trivial local descriptors in order to
be solved correctly. It consists of simple 2D images containing
simple closed curves. It is clear that brute-force memorization
cannot solve the task since the shape of curves is always
randomly generated.
SVRT collects 23 different sub-problems, that can be further
divided into two clusters: problems related to the spatial
arrangement of shapes, and problems regarding comparisons
between shapes. The latter set of sub-problems pertain to
the same-different challenge, and this is the set that we are
interested in.
Kim et al. [11] probed simple CNNs upon all the problems
in the SVRT dataset, and they found some of them particularly
straining for their simple setup. In particular, following the
findings by Kim et al., problems 1, 5, 20, 21, are among the
most difficult to solve for a convolutional neural network.
In particular, these problems pose the following challenges:
Problem 1 (P.1): Detecting the very same shapes, ran-
domly placed in the image, but having the same rotation
and scale.
Problem 5 (P.5): Detecting two couples of identical
shapes, randomly placed in the image. The two images
inside every couple have the same rotation and scale.
Problem 20 (P.20): Detecting the same shape, translated
and flipped along a randomly chosen axis.
Problem 21 (P.21): Detecting the same shape, randomly
translated, rotated and scaled.
All the images from the proposed problems contain two
shapes except problem 5, where images contain two couples
#1 #5 #20 #21
Fig. 1: Positive and negative examples from the four considered SVRT problems.
of shapes. Figure 1 shows some positive and negative examples
from each one of the above-listed problems.
In none of these benchmarks are shapes overlapping. They
are always well separated. Despite being a quite straightfor-
ward dataset, built of simple shapes and with no color infor-
mation, SVRT can help to highlight some intrinsic limitations
of current neural network models.
B. Models
In this work we take into consideration the following state-
of-the-art architectures for image classification: AlexNet [1];
VGG19 [4]; three variants of the Resnet [3], in order of
increasing complexity ResNet-18, ResNet-34 and ResNet-101;
a recently introduced biologically inspired network called
CorNet-S [22], and its simpler version, Cornet-Z, a simple
feed-forward convolutional network used as baseline.
AlexNet and CorNet-Z are intended to be our baselines.
In fact, they are straightforward convolutional networks, with
relative few layers with respect to VGG and ResNets. In
particular, CorNet-Z is a lightweight version of the AlexNet.
It consists of only four convolutional layers, with ReLU
activations and MaxPooling, with a single fully-connected
layer as a classifier, outputting probabilities for the two target
VGG19 still contains a simple convolutional architecture
comparable with the AlexNet structure, but it is significantly
deeper. ResNets, on the other hand, introduce residual connec-
tions. These skip connections force the architecture to learn
incremental differences in the stored representations, refining
the information multiple times until it can be used for the
downstream task.
Following the work by Kar et al. [23], it seems that ResNets
can be considered, from a functional point of view, biologi-
cally inspired networks. In fact, they continuously refine the
information coming from pixels, just like the visual cortex
processes the information flow coming from the eyes. In
particular, [23] found some experimental evidence on primates
brain claiming that visual cortex could be comprised of
recurrent connections. In light of this, they developed CorNet-
S [22]. CorNet-S is composed of four blocks, mimicking
different brain cortical areas involved with vision; each one of
these four blocks contains a recurrent connection together with
a skip connection, taking inspiration from ResNets. Basically,
ResNets are unrolled versions of CorNet-S. Compared to
ResNet101, that has a comparable number of weights with
CorNet-S, this model is also lighter to train since, differently
from ResNets, recurrent connections make most of neurons
weights shared among different timesteps.
We train the networks discussed in Section III on all the four
same-different problems from the SVRT dataset. Concerning
AlexNet, ResNets, and VGG19, we use the models provided
by the PyTorch framework. Instead, for CorNet-S and CorNet-
Z, we employ the public available implementation provided by
the authors.
For each of the benchmarks, we use 400k training examples
and 100k images for testing. All the positive and negative
examples are perfectly balanced in both sets. For all the
probed models, we use SGD as optimization algorithm, with a
momentum of 0.9, weight decay of 1e-4 and an initial learning
rate of 0.1. We use an exponential decay schedule for the
learning rate, that halves it every 20 epochs. We do not use
pre-trained weights if they are available.
A. Experiment 1
a) Description: Our primary objective consists in trying
to correctly learn an SVRT problem, measuring the accuracy
on the test set of the same problem.
Together with the obtained accuracies, in this experiment we
are also interested in understanding what is the convergence
speed of the model, as a measure of the strain perceived by the
network during the training phase. For this reason, we keep
track of the epoch in which the test accuracy reaches 90%.
From this moment on, we will refer to this particular point
as the convergence epoch (CE). For performance reasons we
validate the model every half epoch, so we are able to provide
the CE with a resolution of 0.5 epochs. This resolution is
enough to capture substantial differences in the training curves.
Problem 1 Problem 5 Problem 20 Problem 21
Model Acc. (%) CE Acc. (%) CE Acc. (%) CE Acc. (%) CE
LeNet [20] 57.0 n.a. 54.0 n.a. 55.0 n.a. 51.0 n.a.
GoogLeNet [20] 50.0 n.a. 50.0 n.a. 50.0 n.a. 51.0 n.a.
AdaBoost [10] 98.0 n.a. 87.0 n.a. 70.0 n.a. 50.0 n.a.
Human [10] 98.0 n.a. 90.0 n.a. 98.0 n.a. 83.0 n.a.
AlexNet 50.0 - 50.0 - 50.0 - 50.0 -
CorNet-Z 50.0 - 50.0 - 50.0 - 50.0 -
VGG-19 50.0 - 50.0 - 50.0 - 50.0 -
ResNet-18 99.0 2.0 99.8 2.5 95.8 2.0 96.1 17.5
ResNet-34 99.4 0.5 98.7 1.5 94.6 6.5 96.9 13.0
ResNet-101 99.1 3.5 97.8 3.5 95.9 4.0 91.1 20.5
CorNet-S 99.5 2.0 96.8 2.0 95.3 2.0 95.9 13.0
TABLE I: Accuracy values measured on the probed architectures, for each of the four SVRT problems. The values from the
first experiments are reported as they are from [10], [20]. They did not report any convergence information (CE is n.a.).
Together with our measurements on deep convolutional
networks, we also report the values as measured by [10], [20]
on LeNet, GoogLeNet and AdaBoost (using feature group 3).
Table I summarizes the accuracies reached from all the
architectures, over the test sets of the respective problems. For
the architectures trained by us, we report also the measured
CEs, only when the architecture converged.
Learning rotation invariance is known to be one of the major
sources of strain in convolutional networks. For this reason, we
also validate ResNet18, ResNet101, and Cornet-S on different
instances of the test set, where all the images from the same
instance are rotated by the same amount. For these trials, we
take as reference the models trained on P.21.
Rotation is inherently lossy for digital images, especially if
images do not have high resolution, as in our case. For this
reason, we try to rotate images with and without interpolation,
in order to appreciate how much the model is robust to the
noise added during the transform.
Figure 2 collects accuracy values measured on a particular
rotated instance of the test set, with and without pixel inter-
b) Discussion: Table I shows that, among all the probed
architectures, only ResNets and Cornet-S are able to learn all
the four problems correctly, perhaps with a very few error rate,
defeating all the state-of-the-art results previously reached with
AdaBoost [10].
AlexNet, CorNet-Z, and VGG19 are unable to learn. On
the validation set, they remain on the chance level accuracy
of 50%. AlexNet and CorNet-Z are not very deep and they are
quite straightforward. Hence, the performance obtained with
these models seems to be in line with results shown by Kim
et al. [11] on their simple convolutional architectures.
One interesting finding is the fact that a very-deep convo-
lutional network, VGG19, is unable to learn. Specifically, it
always outputs same for all the test samples of all the four
problems. This particular outcome confirms the inability of
the VGG architecture to discern discriminant data from the
training examples.
By contrast, residual networks (ResNet-18, ResNet-34,
ResNet-101 and CorNet-S) are able to obtain best perfor-
mances on all the problems. Even the small ResNet-18 behaves
well, both in terms of reached accuracies and CEs. Considering
P.1, all the ResNets reach almost perfect performance on
the validation set, approaching more than 99% accuracy with
pretty fast convergence. Only P.21 seems to cause some strain
in the training process since more epochs are required in order
for the models to converge.
Overall, ResNets and Cornet-S are able to defeat humans
on three of the four tasks.
These pieces of evidence on the probed models suggest that
the cause of the convergence is not related to the depth of the
network, but to its architecture instead. Given that also the
very-deep inception architecture of GoogLeNet seems unable
to converge [20], the results obtained with ResNets make us
realize that their residual architecture may have an important
role in the convergence when addressing the same-different
Looking at Figure 2, it seems that converged models are also
able to correctly handle rotated test set images. The peaks at
0, 90, 180 and 270 degrees tell us that the models are robust
to rotation, and they are able to generalize to a brand new set
of rotated figures never seen before.
On the other hand, the fluctuations in the accuracy at inter-
mediate angles are a clear signal that the noise added while
rotating images strains these models. In particular, ResNet18
and ResNet101 show more sensitivity to this noise. ResNet101
is the most vulnerable to this disturbance. Its accuracy falls
to full chance when the image is rotated by 45 degrees and
no interpolation is used. Resnet18 and Resnet101 suffer in the
same way from the padding and center-crop transforms needed
in order to preserve shapes during the rotation (see Figure 2
caption for details), meaning that ResNets have also stronger
scale-dependent problems with respect to Cornet-S.
By contrast, CorNet-S seems very stable, both to rotation
(a) ResNet101 (b) ResNet18 (c) CorNet-S
Fig. 2: Accuracy values when the test set is rotated by the given angle. Ideal refers to the accuracy value as measured on the
original validation-set, without the re-scaling operations used for rotating the images. No interp is measured by removing the
interpolation during the rotation; the shape acquires at most 1-pixel wide distortion; Bilinear interp. uses bilinear interpolation
during rotation; in this case, rotation artifacts are strongly attenuated. At zero degrees, Ideal can diverge from No interp. and
Bilinear Interp.. In fact, in order to be rotated, the image has been first padded and then center-cropped, in order to avoid
cutting the shapes during the rotation. These preliminary transforms are responsible for the gap in the case of zero rotation. A
gap at zero degrees is a clear signal indicating that the model is strained by this simple pre-scaling procedure.
noise and to image re-scaling.
B. Experiment 2
a) Description: In this scenario, we test generalization
capabilities of the converged models by measuring their per-
formance on the test set of other problems. In particular, we
set up the following trials:
1) Train on P.21, test on P.20: we test the capability of the
model to generalize to mirroring transformations, never
seen when training on P.20.
2) Train on P.21, test on P.5: we test the capability of the
model to generalize to multiple instances of the same-
different problem.
3) Train on P.1, test on P.20: we test the capability of the
model to generalize to mirroring transformations. It is
more challenging with respect to point 1), since P.1
does not even give the possibility to learn the notion
of rotation and scale invariance.
4) Train on P.1, test on P.21: we test the capability of the
model to generalize to arbitrary rotations and scales.
Table II summarizes the accuracies of all the converged
models in the above-proposed configurations.
b) Discussion: Looking at Table II it turns out that
almost all the architectures trained on P.21 are also able to
correctly solve P.20. This means that P.21 gave the ResNets
the ability to correctly understand shape flips, starting from
the notion of scale and rotation.
By contrast, there is no notion of rotation or flip invariance
in the architectures trained on the simple P.1. This is an
expected result since convolutional architectures are not able
to natively deal with invariance to affine transformations, with
the exception of the plain translation.
Also, tests on P.5 reveal that discerning multiple instances
of the same-different problem is a difficult task when the
Model Train P.21
Test P.20
Train P.21
Test P.5
Train P.1
Test P.20
Train P.1
Test P.21
ResNet-18 95.9 57.5 67.8 51.8
ResNet-34 96.5 66.1 60.1 51.5
ResNet-101 90.5 56.2 57.3 51.3
CorNet-S 95.7 54.7 55.4 51.6
TABLE II: Accuracy values measured on the probed archi-
tectures, by training and testing them on different SVRT
architectures have been trained to detect only one instance.
This problem requires the models to understand that objects
should be clustered into two couples of possibly identical
shapes. Instead, an architecture trained on P.1 most likely
learned to output a positive result only if all the shapes are
In this study we tackled the problem of understanding
to what extent very-deep convolutional neural networks are
able to deal with the same-different challenging task. We
think that developing abstract and relational abilities of neural
networks is an important step towards the achievement of
some interesting new tasks, such as the discovery of particular
patterns in cultural heritage, or the search for aesthetic beauty
patterns in images and even music.
In particular, we stuck to the work by Kim et al. [11].
They showed how the same-different problem strain simple
feed-forward convolutional neural networks. We extended their
study to very-deep state-of-the-art neural networks for image
Considering the SVRT visual challenge, our results show
that, despite some difficulties, ResNets and CorNet-S (a
biologically-inspired architecture similar to ResNet architec-
ture) are able to correctly understand and generalize to never
seen shapes. We also found that Cornet-S is able to reach
strong stability with respect to 1-pixel wide noise, in the case
of shapes rotation without interpolation.
By contrast, not very-deep models such as AlexNet and
CorNet-Z are not able to learn any of the proposed problems.
This evidence is aligned with Kim et al. findings.
However, with the evidence that also deep neural networks
such as VGG19 and GoogLeNet are not able to converge, we
hypothesized that the residual connections in the ResNet archi-
tecture are important key points for the successful convergence
of these models.
There are still many open problems in this research di-
rection. As of now, abstract reasoning capabilities of neural
networks are tested on very simple and low distribution
variability datasets, such as SVRT or Sort-of-CLEVR. One
of the important steps to enhance this research would be to
consider more complex images collecting real-world shapes,
possibly seen from different perspective angles.
In our work, we tested CorNet-S, a model that directly
draws inspiration from neuroscientific evidence on the pri-
mates brain. We think that it would be interesting to access
more neuroscience research, in order to understand what are
the functional components of the human brain that contribute
to complex abstract reasoning.
This work was partially supported by the AI4EU project,
funded by the EC (H2020 - Contract n. 825619), and Auto-
matic Data and documents Analysis to enhance human-based
processes (ADA), CUP CIPE D55F17000290009.
We also gratefully acknowledge the support of NVIDIA
Corporation with the donation of the Tesla K40 GPU used
for this research.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and
K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, Nov 1998.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016.
[4] S. Liu and W. Deng, “Very deep convolutional neural network based
image classification using small training sample size,” in 2015 3rd IAPR
Asian Conference on Pattern Recognition (ACPR), Nov 2015, pp. 730–
[5] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
CoRR, vol. abs/1804.02767, 2018.
[6] K. He, G. Gkioxari, P. Doll´
ar, and R. Girshick, “Mask r-cnn,” in
Computer Vision (ICCV), 2017 IEEE International Conference on.
IEEE, 2017, pp. 2980–2988.
[7] F. Carrara, A. Esuli, T. Fagni, F. Falchi, and A. M. Fern´
andez, “Picture it
in your mind: Generating high level visual representations from textual
descriptions,” Information Retrieval Journal, vol. 21, no. 2-3, pp. 208–
229, 2018.
[8] L. Vadicamo, F. Carrara, A. Cimino, S. Cresci, F. Dell’Orletta, F. Falchi,
and M. Tesconi, “Cross-media learning for image sentiment analysis in
the wild,” in 2017 IEEE International Conference on Computer Vision
Workshops (ICCVW), Oct 2017, pp. 308–317.
[9] F. Carrara, F. Falchi, R. Caldelli, G. Amato, R. Fumarola, and
R. Becarelli, “Detecting adversarial example attacks to deep neural
networks,” in Proceedings of the 15th International Workshop on
Content-Based Multimedia Indexing. ACM, 2017, pp. 38:1–38:7.
[Online]. Available:
[10] F. Fleuret, T. Li, C. Dubout, E. K. Wampler, S. Yantis, and D. Geman,
“Comparing machines and humans on a visual categorization test,”
Proceedings of the National Academy of Sciences, vol. 108, no. 43,
pp. 17 621–17 625, 2011.
[11] J. Kim, M. Ricci, and T. Serre, “Not-so-CLEVR: Visual relations
strain feedforward neural networks,” 2018. [Online]. Available:
[12] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei,
C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for
compositional language and elementary visual reasoning,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), July
[13] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu,
P. Battaglia, and T. P. Lillicrap, “A simple neural network module
for relational reasoning,” CoRR, vol. abs/1706.01427, 2017. [Online].
[14] N. Messina, G. Amato, F. Carrara, F. Falchi, and C. Gennaro, “Learning
relationship-aware visual features,” in Computer Vision – ECCV 2018
Workshops, L. Leal-Taix´
e and S. Roth, Eds. Cham: Springer Interna-
tional Publishing, 2019, pp. 486–501.
[15] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-
Fei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing
programs for visual reasoning,” in The IEEE International Conference
on Computer Vision (ICCV), Oct 2017.
[16] D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, “Transparency
by design: Closing the gap between performance and interpretability
in visual reasoning,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
[17] C. Zhang, F. Gao, B. Jia, Y. Zhu, and S.-C. Zhu, “RAVEN: A Dataset
for Relational and Analogical Visual rEasoNing,arXiv e-prints, Mar
[18] D. Barrett, F. Hill, A. Santoro, A. Morcos, and T. Lillicrap, “Measuring
abstract reasoning in neural networks,” in Proceedings of the 35th
International Conference on Machine Learning, ser. Proceedings of
Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.
Stockholmsmssan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp.
[19] G. R. Yang, I. Ganichev, X.-J. Wang, J. Shlens, and D. Sussillo, “A
dataset and architecture for visual reasoning with a working memory,” in
Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu,
and Y. Weiss, Eds. Cham: Springer International Publishing, 2018, pp.
[20] S. Stabinger, A. Rodr´
anchez, and J. Piater, “25years of cnns:
Can we compare to human abstraction capabilities?” in Artificial Neural
Networks and Machine Learning – ICANN 2016, A. E. Villa, P. Masulli,
and A. J. Pons Rivero, Eds. Cham: Springer International Publishing,
2016, pp. 380–387.
[21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015.
[22] J. Kubilius, M. Schrimpf, A. Nayebi, D. Bear, D. L. K. Yamins, and
J. J. DiCarlo, “Cornet: Modeling the neural mechanisms of core object
recognition,” bioRxiv, 2018.
[23] K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, and J. J. DiCarlo,
“Evidence that recurrent circuits are critical to the ventral stream’s
execution of core object recognition behavior,Nature Neuroscience,
2019. [Online]. Available:
... The authors in [3,14] found that deep CNNs, like ResNet-50, can solve the SVRT problems even with a relatively small amount of samples (28k images). Similarly, the authors in [21,22] demonstrated that also many other state-of-the-art deep learning architectures for classifying images (ResNet, DenseNets, CorNet) models can learn this task, generalizing to some extent. Recently, [32] discussed the important role of attention in deep learning models for addressing the same-different problems. ...
... From previous works [20,3] it is clear that relational problems -the ones involving shape comparisons under different geometric transformations -are the most difficult to solve for Deep Neural Networks. Thus, as in [21,22], we focus the attention on four of these problems: Problem 1 (P.1) -detecting the very same shapes, randomly placed in the image, having the same orientation and scale; Problem 5 (P.5) -detecting two pairs of identical shapes, randomly placed in the image. Problem 20 (P.20) -detecting the same shape, translated and flipped along a randomly chosen axis; Problem 21 (P.21) -detecting the same shape, randomly translated, orientated, and scaled. ...
... During training, we set the maximum time steps T = 9. We collected results using both 28k training images, following [3], and 400k training images, for comparing our proposed architectures with convolutional networks trained in [21,22]. We used 18k images both for validation and testing. ...
Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained. In the end, this study can lay the basis for a deeper understanding of the role of attention and recurrent connections for solving visual abstract reasoning tasks.
... We found that our feed-forward model can in fact perform well on the same-different tasks of SVRT (see Figure 3B; see also concurrent work of Messina et al., 2019). This result was not due to an increase in the number of training samples. ...
... This result was not due to an increase in the number of training samples. In fact, we used fewer images (28,000 images) than Kim et al. (2018) (1 million images) and Messina et al. (2019) (400,000 images). Of course, the results were obtained on the SVRT data set and might not hold for other visual reasoning data sets (see Introduction, "Testing generalization of mechanisms"). ...
Full-text available
With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed toward comparing information processing in humans and machines. These studies are an exciting chance to learn about one system by studying the other. Here, we propose ideas on how to design, conduct, and interpret experiments such that they adequately support the investigation of mechanisms when comparing human and machine perception. We demonstrate and apply these ideas through three case studies. The first case study shows how human bias can affect the interpretation of results and that several analytic tools can help to overcome this human reference point. In the second case study, we highlight the difference between necessary and sufficient mechanisms in visual reasoning tasks. Thereby, we show that contrary to previous suggestions, feedback mechanisms might not be necessary for the tasks in question. The third case study highlights the importance of aligning experimental conditions. We find that a previously observed difference in object recognition does not hold when adapting the experiment to make conditions more equitable between humans and machines. In presenting a checklist for comparative studies of visual reasoning in humans and machines, we hope to highlight how to overcome potential pitfalls in design and inference.
... On the other hand, even though it was concluded in [90] that DL models struggle with solving S-D tasks, it was later shown in [92][93][94] that the initial categorization problems from SVRT can be solved with ResNets -modern DL models that excel in computer vision. ...
Full-text available
Visual Reasoning (AVR) problems are commonly used to approximate human intelligence. They test the ability of applying previously gained knowledge, experience and skills in a completely new setting, which makes them particularly well-suited for this task. Recently, the AVR problems have become popular as a proxy to study machine intelligence, which has led to emergence of new distinct types of problems and multiple benchmark sets. In this work we review this emerging AVR research and propose a taxonomy to categorise the AVR tasks along 5 dimensions: input shapes, hidden rules, target task, cognitive function, and main challenge. The perspective taken in this survey allows to characterise AVR problems with respect to their shared and distinct properties, provides a unified view on the existing approaches for solving AVR tasks, shows how the AVR problems relate to practical applications, and outlines promising directions for future work. One of them refers to the observation that in the machine learning literature different tasks are considered in isolation, which is in the stark contrast with the way the AVR tasks are used to measure human intelligence, where multiple types of problems are combined within a single IQ test.
... In 2019, Messina et al. (2019) were able to solve Problems 1, 5, 20, and 21 of the SVRT dataset. The authors were able to achieve an accuracy of above 95% for all four problems using different ResNet architectures by He et al. (2016) as well as the Table 3 for the CorNet-S results), but the authors had to use 400,000 training images to achieve these results, which might be problematic, considering the discussed potential problems of the SVRT dataset. ...
Full-text available
Convolutional neural networks have become the state-of-the-art method for image classification in the last 10 years. Despite the fact that they achieve superhuman classification accuracy on many popular datasets, they often perform much worse on more abstract image classification tasks. We will show that these difficult tasks are linked to relational concepts from cognitive psychology and that despite progress over the last few years, such relational reasoning tasks still remain difficult for current neural network architectures. We will review deep learning research that is linked to relational concept learning, even if it was not originally presented from this angle. Reviewing the current literature, we will argue that some form of attention will be an important component of future systems to solve relational tasks. In addition, we will point out the shortcomings of currently used datasets, and we will recommend steps to make future datasets more relevant for testing systems on relational reasoning.
... Further research by Messina et al. (2019) showed that ResNet by He et al. (2016) as well as CorNet-S by Kubilius et al. (2018) can solve some of the previously unsolved problems of the SVRT dataset, but needed to use 400.000 training images to do so. Funke et al. (2020) finally were able to achieve accuracies above 90% on all problems of the SVRT dataset using a ResNet architecture with just 28.000 images. ...
Convolutional neural networks have established themselves over the past years as the state of the art method for image classification, and for many datasets, they even surpass humans in categorizing images. Unfortunately, the same architectures perform much worse when they have to compare parts of an image to each other to correctly classify this image. Until now, no well-formed theoretical argument has been presented to explain this deficiency. In this paper, we will argue that convolutional layers are of little use for such problems, since comparison tasks are global by nature, but convolutional layers are local by design. We will use this insight to reformulate a comparison task into a sorting task and use findings on sorting networks to propose a lower bound for the number of parameters a neural network needs to solve comparison tasks in a generalizable way. We will use this lower bound to argue that attention, as well as iterative/recurrent processing, is needed to prevent a combinatorial explosion.
Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here:
Full-text available
The increasing interest in social networks, smart cities, and Industry 4.0 is encouraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image. This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks: Relational Content-based Image Retrieval (R-CBIR), Visual-Textual Retrieval, and the Same-Different tasks. We use state-of-the-art deep learning methods for relational learning, such as the Relation Networks and the Transformer Networks for relating the different entities in an image or in a text.
Full-text available
Deepfakes are the result of digital manipulation to obtain credible videos in order to deceive the viewer. This is done through deep learning techniques based on autoencoders or GANs that become more accessible and accurate year after year, resulting in fake videos that are very difficult to distinguish from real ones. Traditionally, CNN networks have been used to perform deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC).
Convolutional neural networks have established themselves over the past years as the state of the art method for image classification, and for many datasets, they even surpass humans in categorizing images. Unfortunately, the same architectures perform much worse when they have to compare parts of an image to each other to correctly classify this image. Until now, no well-formed theoretical argument has been presented to explain this deficiency. In this paper, we will argue that convolutional layers are of little use for such problems, since comparison tasks are global by nature, but convolutional layers are local by design. We will use this insight to reformulate a comparison task into a sorting task and use findings on sorting networks to propose a lower bound for the number of parameters a neural network needs to solve comparison tasks in a generalizable way. We will use this lower bound to argue that attention, as well as iterative/recurrent processing, is needed to prevent a combinatorial explosion.
Full-text available
Non-recurrent deep convolutional neural networks (CNNs) are currently the best at modeling core object recognition, a behavior that is supported by the densely recurrent primate ventral stream, culminating in the inferior temporal (IT) cortex. If recurrence is critical to this behavior, then primates should outperform feedforward-only deep CNNs for images that require additional recurrent processing beyond the feedforward IT response. Here we first used behavioral methods to discover hundreds of these ‘challenge’ images. Second, using large-scale electrophysiology, we observed that behaviorally sufficient object identity solutions emerged ~30 ms later in the IT cortex for challenge images compared with primate performance-matched ‘control’ images. Third, these behaviorally critical late-phase IT response patterns were poorly predicted by feedforward deep CNN activations. Notably, very-deep CNNs and shallower recurrent CNNs better predicted these late IT responses, suggesting that there is a functional equivalence between additional nonlinear transformations and recurrence. Beyond arguing that recurrent circuits are critical for rapid object identification, our results provide strong constraints for future recurrent model development.
Conference Paper
Relational reasoning in Computer Vision has recently shown impressive results on visual question answering tasks. On the challenging dataset called CLEVR, the recently proposed Relation Network (RN), a simple plug-and-play module and one of the state-of-the-art approaches, has obtained a very good accuracy (95.5%) answering relational questions. In this paper, we define a sub-field of Content-Based Image Retrieval (CBIR) called Relational-CBIR (R-CBIR), in which we are interested in retrieving images with given relationships among objects. To this aim, we employ the RN architecture in order to extract relation-aware features from CLEVR images. To prove the effectiveness of these features, we extended both CLEVR and Sort-of-CLEVR datasets generating a ground-truth for R-CBIR by exploiting relational data embedded into scene-graphs. Furthermore, we propose a modification of the RN module – a two-stage Relation Network (2S-RN) – that enabled us to extract relation-aware features by using a preprocessing stage able to focus on the image content, leaving the question apart. Experiments show that our RN features, especially the 2S-RN ones, outperform the RMAC state-of-the-art features on this new challenging task.
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at
A vexing problem in artificial intelligence is reasoning about events that occur in complex, changing visual stimuli such as in video analysis or game play. Inspired by a rich tradition of visual reasoning and memory in cognitive psychology and neuroscience, we developed an artificial, configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures. We additionally propose a deep learning architecture that performs competitively on other diagnostic VQA datasets (i.e. CLEVR) as well as easy settings of the COG dataset. However, several settings of COG result in datasets that are progressively more challenging to learn. After training, the network can zero-shot generalize to many new tasks. Preliminary analyses of the network architectures trained on COG demonstrate that the network accomplishes the task in a manner interpretable to humans.
The robust and efficient recognition of visual relations in images is a hallmark of biological vision. Here, we argue that, despite recent progress in visual recognition, modern machine vision algorithms are severely limited in their ability to learn visual relations. Through controlled experiments, we demonstrate that visual-relation problems strain convolutional neural networks (CNNs). The networks eventually break altogether when rote memorization becomes impossible such as when the intra-class variability exceeds their capacity. We further show that another type of feedforward network, called a relational network (RN), which was shown to successfully solve seemingly difficult visual question answering (VQA) problems on the CLEVR datasets, suffers similar limitations. Motivated by the comparable success of biological vision, we argue that feedback mechanisms including working memory and attention are the key computational components underlying abstract visual reasoning.