Available via license: CC BY 4.0
Content may be subject to copyright.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
A Comprehensive Survey on Visual Question
Answering Datasets and Algorithms
RAIHAN KABIR1, NAZNIN HAQUE1, MD SAIFUL ISLAM1, 2, AND MARIUM-E-JANNAT1, 3
1Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
2Department of Computing Science, University of Alberta, Edmonton, Canada
3Department of Computer Science, University of British Columbia, Kelowna, Canada
ABSTRACT Visual question answering(VQA) refers to the problem where given an image and a natural language
question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both
the visual understanding of the image and the semantic understanding of the question. It also has to possess the necessary
reasoning capability to deduce the answer correctly. Consequently, a capable VQA model would lead to significant
advancements in artificial intelligence. Since the inception of this field, a plethora of VQA datasets and models have
been published. In this article, we meticulously analyze the current state of VQA datasets and models while cleanly
dividing them into distinct categories and then summarize the methodologies and characteristics of each category. We
divide VQA datasets into four categories:
•Available datasets that contain a rich collection of authentic images
•Synthetic datasets that contain only synthetic images produced through artificial means
•Diagnostic datasets that are specially designed to test model performance in a particular area, e.g., understanding
the scene text
•KB (Knowledge-Based) datasets that are designed to measure a model’s ability to utilize outside knowledge
Concurrently, we explore six main paradigms of VQA models: Fusion is where we discuss different methods of
fusing information between visual and textual modalities. Attention is the technique of using information from one
modality to filter information from another. External knowledge base where we discuss different models utilizing
outside information. Composition or Reasoning, where we analyze techniques to answer advanced questions that require
complex reasoning steps. Explanation, which is the process of generating visual and/or textual descriptions to verify
sound Reasoning. Graph models which encode and manipulate relationships through nodes in a graph. We also discuss
some miscellaneous topics, such as scene text understanding, counting, and bias reduction.
INDEX TERMS Visual question answering, computer vision, natural language processing, multimodal learning,
machine learning, deep learning, reinforcement learning, neural network.
I. INTRODUCTION
The field of computer vision has seen significant advances
in recent times. Since the development of neural networks,
many computer vision problems such as captioning, object
recognition, object detection, and scene classification have
seen tremendous progress. Moreover, the advent of social
networking sites, especially image-sharing websites, has led
to abundant image data. Many efficient methods have been
developed to extract and collect this data, making large-scale
image datasets possible. Crowd-sourcing has also made it
possible to quickly utilize human workers to perform cap-
tioning, annotation, etc. The same is true for natural language
processing, another field to which VQA belongs. Significant
work has been done in tasks such as language modeling, word
and sentence representation, POS(Parts of speech) tagging,
sentiment analysis, textual question answering, etc.
The problem of VQA is that given any image and any
natural language question about that image, a correct natural
language answer has to be produced.
Despite this simple description, VQA encompasses a lot of
sub-problems such as:
•Object recognition: What is behind the chair?
•Object detection: Are there any people in the image?
•Counting: How many dogs are there?
•Scene classification: Is it raining?
•Attribute classification: Is the person happy?
In VQA, the image has to be understood with the context
given by the question. The same image has to be “seen”
VOLUME 4, 2016 1
arXiv:2411.11150v1 [cs.CV] 17 Nov 2024
(a) (b)
(c) (d)
FIGURE 1: Examples of VQA from the VQA-v2 dataset.
Figure from [42]
differently for different questions. So proper understand-
ing of the question is crucial. Questions can be simple or
complex. Sometimes the reasoning process only involves a
single object, but it can also involve multiple objects, their
attributes, and their relations.
So, what exactly is solving VQA going to achieve? There
are many applications where VQA can be applied. An obvi-
ous one is to help blind people understand their surroundings.
Searching for images can be made easier as the user only
has to ask a question, and then image results are generated
based on the query. But the ultimate aim of the VQA problem
is to pass the Visual Turing Test, which is the ultimate
milestone for any AI research. A significant AI must be able
to understand and utilize visual information correctly. VQA
is a must to test this understanding.
II. DATASETS
We divide VQA datasets into four categories: 1) General;
2) Synthetic; 3) Diagnostic, and 4) Knowledge-Based. Gen-
eral datasets are large datasets consisting of natural im-
ages and human-generated QA pairs. Synthetic datasets
contain computer-generated images and questions that are
constructed from a fixed number of hand-made templates.
Diagnostic datasets are smaller datasets that can be used to
isolate and evaluate a specific capability of a model. Finally,
knowledge-based datasets provide an external knowledge
base that can be queried for extra information.
A. GENERAL DATASETS
General datasets are the largest, richest, and most used
datasets in VQA. General datasets contain many thousands of
real-world images from mainstream image datasets like MS-
COCO [74] and Imagenet [32]. These datasets are notable for
their large scope and diversity. This variety is important as
VQA datasets need to reflect the general nature of VQA. Al-
though these datasets do not necessarily capture the endless
complexity and variety of visuals in real life, they achieve a
close approximation.
1) COCO-QA
COCO-QA was one of the first VQA datasets. It consists of
123K images from MS-COCO. QA pairs were automatically
generated from image descriptions. The dataset contains very
basic questions and only a few question types. Questions are
mainly about object presence, number, color, and location.
2) VQA-v1
FIGURE 2: An example from VQA-v1 [10]
VQA-v1 [10](often called the “VQA dataset”) was the
first significantly large VQA dataset. It is divided into two
parts, real and abstract. VQA-real consists of real-life images
containing multiple objects and a rich, complex mixture
of visual information. In contrast, VQA-abstract contains
simple artificial scenes generated using clipart software.
We talk about vqa-abstract in the synthetic section. Ques-
tions and answers were generated using Amazon Mechanical
Turk(AMT) workers who were tasked with asking questions
that will stump a smart robot. There are severe biases in
the dataset. For example, a bias of 55.86% for “yes” in
“yes/no” questions and “2” being the most popular answer
for “number” questions [10]. Many questions also have a
low level of inter-human agreement. Many questions ask for
subjective answers and opinions which can not be evaluated
for correctness.
3) VQA-v2
VQA-v2 [42](also known as balanced VQA) tries to balance
the biases in the original VQA-v1 dataset. The authors be-
lieved that the original VQA dataset contained significant
2VOLUME 4, 2016
biases and language priors which enabled models to improve
their accuracy by simply leveraging those biases without
actually looking at the image. VQA-v2 tries to counter this by
having two similar images with two different correct answers
for the same question. As a result, the model is forced to look
at the image to deduce the correct answer.
(a) (b)
FIGURE 3: Two complimentary images with different an-
swers for the question “What color is the fire hydrant?” from
VQA-v2 [42]
4) Visual genome
Among the VQA datasets, visual genome(VG) [66] is one
of the richest in terms of information. It contains over 100K
images where each image has an average of 21 objects, 18
attributes, and 18 pairwise relationships between objects. The
Visual Genome dataset consists of seven main components:
1) region descriptions; 2) objects; 3) attributes; 4) relation-
ships; 5) region graphs; 6) scene graphs; and 7) QA pairs.
It is the first VQA dataset to provide a formalized structural
representation of an image through scene graphs. Questions
were collected specifically so that they were not ambigu-
ous or speculative but answerable if and only if the image
is shown. A notable difference between Visual Genome’s
question distribution and VQA-v1’s is that the authors focus
on ensuring that all categories of questions are adequately
represented while in VQA-v1, 32.37% of the questions are
“yes/no” binary questions. Visual genome actually excludes
binary questions in order to encourage more difficult QA
pairs.
5) Visual7W
Visual7W [153] establishes a semantic link between textual
descriptions and image regions through object-level ground-
ing. For each word that indicates an object in the question,
the bounding boxes of the corresponding object in the im-
age are provided. This helps the model to better associate
objects mentioned in the question with objects present in the
image. It also enables the model to provide visual answers.
Visual7W provides 7 types of questions: what, where, when,
who, why, how, and which. The first 6 question types form
the “telling” questions which require descriptive answers
while the “which” category forms the “pointing” questions
(a) (b)
FIGURE 4: An example image(a) from visual genome [66]
with part of it’s scene graph(b)
that have to point out an object in the image as the answer.
“Yes/No” questions are not present in this dataset.
6) TDIUC
Task Driven Image Understanding Challenge(TDIUC) [62]
is specifically tailored to remove biases that plagued previous
datasets. It contains about 1.6 million questions about 160K
images. The questions are divided into 12 distinct categories
each of which represents a specific vision task that a good
VQA model should be able to solve. Previous models had
severe biases in their distribution of different question types
and in the distribution of answers within each question type.
TDIUC tries to mitigate these biases as much as possible. It
also proposes new evaluation metrics which better evaluate
a model’s ability to answer each question type. In previ-
ous models, doing well on some classes of questions was
rewarded more than others. For example, in VQA-v1, im-
proving accuracy on “Is/Are” questions by 15% will increase
overall accuracy by over 5% but answering all “Why/Where”
questions correctly will increase overall accuracy by only
4.1%. TDIUC also introduces absurd questions which should
be answered with does not apply. It should be noted however
that there are only about a thousand unique answers due to
most answers being restricted to one or two words.
7) GQA
GQA [54] is a dataset that marries compositional reasoning
with real-world images. It provides a wealth of data consist-
ing of 22M questions for 113k images. Like Visual Genome,
it provides rich, structural, formalized image information
through scene graphs. But unlike Visual Genome and like
CLEVR [60], it also provides formal, structural representa-
tions of questions in the form of functional programs that de-
tail the reasoning steps needed to answer that question. GQA
is richer than CLEVR in terms of the variety of object classes
and properties as it uses real images instead of synthetic
ones. Although GQA questions are automatically generated,
they are not completely synthetic like CLEVR. Grammatical
VOLUME 4, 2016 3
Dataset
Number
of
Images
Number
of QA
Pairs
Question
Type
Question
Category
Image
Source
QA
Source
Evaluation
Metric Comments
COCO-
QA
[101]
123K 118K OE
Object, Color,
Number, and
Location
MS
COCO Auto WUPS Very basic questions,
Only a few question types
Vqa-v1
[10] 204K
614K(3
per
image)
OE and
MC
Yes/No, Number,
and Other
MS
COCO Manual vqa metric Severely biased
Vqa-v2
[42] 200K 1.1M OE and
MC
Yes/No, Number
and Other
MS
COCO Manual vqa metric Slightly less biased
Visual
Genome
[66]
100K
1.7M(17
per
image)
OE
what, where,
how, when, who,
and why
MS
COCO Manual N/A Rich annotation
Visual7W
[153] 47K 328K MC
Telling(what,
where, how,
when, who, why)
and
pointing(which)
MS
COCO Manual N/A
New pointing type
question, Object
grounding
TDIUC
[62] 167K 1.65M OE 12 types MS
COCO
Manual
and auto
A- MPT and H-
MPT(normalized
and
unnormalized)
Better evaluation metrics,
Absurd questions
GQA
[54] 113K 22M OE 25 types MS
COCO Auto
Consistency,
validity and
plausibility,
distribution,
grounding
Better evaluation metrics,
Rich structural image and
question annotations,
Object grounding,
Explanation annotation
TABLE 1: Statistics of general VQA datasets.
rules learned from natural language questions and the use of
probabilistic grammar makes GQA questions more human-
like and more immune to memorization of language priors.
Moreover, sampling and balancing are performed to reduce
skews and biases in the question and answer distribution.
As questions are not generated by humans, many human-
produced biases are not present. Finally, GQA provides
new evaluation metrics which better facilitate evaluating a
model’s true capability.
B. SYNTHETIC DATASETS
Synthetic datasets contain artificial images, produced using
software, instead of real images. A good VQA model should
be able to perform well on both real and synthetic data like
humans do. Synthetic datasets are easier, less expensive, and
less time-consuming to produce as the building of a large
dataset can be automated. Synthetic datasets can be tailored
so that performing well on them requires better reasoning and
composition skills.
1) VQA Abstract
VQA abstract is a subset of the VQA-v1 dataset consisting
of 50K artificial scenes. The images were generated using
clip-art software. VQA abstract scenes are relatively simple
compared to its real counterpart and contain less variety and
noise. It was designed this way so a model could focus more
on high level reasoning than low level vision tasks.
2) Yin and Yang
Like VQA-v2 tried to balance VQA-v1, Yin and Yang [149]
tries to balance the VQA-abstract dataset. They only focus
on the yes/no questions and show that models have been
using language biases to gain accuracy. As a solution they
balance the number of “yes” and “no” questions. They also
provide complimentary scenes so that each question has a
“yes” answer for one scene and “no” for the other.
3) SHAPES
SHAPES is a small dataset introduced in [9], consisting of
64 images that are composed by arranging colored geometric
shapes in different spatial orientations. Each image has the
same 244 “yes/no” questions resulting in 15,616 questions.
Due to it’s small size, it can only be used for diagnostic
purposes to evaluate a model’s spatial reasoning skills.
4) CLEVR
CLEVR [60] is a dataset built solely to evaluate a model’s
high level reasoning ability. CLEVR contains 100k synthetic
images and 1M artificially generated questions. Scenes are
presented as collections of objects annotated with shape,
4VOLUME 4, 2016
Dataset
Number
of
Images
Number
of QA
pairs
Type Composition QA
Source
Evaluation
Metric
CLEVR [60] 100K 1M OE
Exist, count,
compare integer,
query attribute
and compare
attribute
auto Exact matching
VQA-abstract
[10] 50K 150K OE and
MC
Yes/no, number
and other manual Vqa metric
SHAPES [9] 15.5K 15.5K binary Yes/no auto Exact matching
Yin and yang
[149] 15K 33K binay Yes/no manual Vqa metric
TABLE 2: Statistics of synthetic VQA datasets.
size, color, material, and position on the ground-plane. The
CLEVR universe contains three object shapes that come in
two absolute sizes, two materials, and eight colors. Objects
are spatially related via four relationships: “left”, “right”,
“behind” and “in front”. CLEVR questions are automatically
generated from hand-made templates.
FIGURE 5: An example from CLEVR. Figure from [60]
A unique feature of CLEVR is that it provides structural
representations of questions as extra annotation. Each ques-
tion is represented by a sequence of functions which reflect
the correct reasoning steps. CLEVR questions are longer
and more complex than questions in general VQA datasets.
This allows CLEVR to generate a greater variety of unique
questions which mitigates question conditional biases. Long
questions also correlate to longer reasoning steps necessary
to arrive at the answer.
One drawback of CLEVR is that it contains relatively
simple and basic visuals. Models can overfit due to smaller
variety of objects and object features. To test this and to also
test compositional generalization, the authors synthesized
two new versions of CLEVR: 1) In condition A, all cubes are
gray, blue, brown, or yellow and all cylinders are red, green,
purple, or cyan; 2) In condition B, these shapes swap color
palettes. Both conditions contain spheres of all eight colors.
Good models should be able to train on one set and perform
well on the other. As CLEVR does not contain real images
or natural language questions, the authors suggest using it
only as a diagnostic tool to test a model’s compositional
ability. CLEVR also has a CLEVR-humans variation where
the questions are human provided natural language questions.
C. DIAGNOSTIC DATASETS
Diagnostic datasets are specialized in the sense that they test
a model’s ability in a particular area. They are usually small
in size and are meant to complement larger, more general
datasets by diagnosing the model’s performance in a distinct
area which may not have pronounced results in the more
general dataset.
1) C-VQA and VQA-CP
C-VQA [6] is a rearranged version of VQA-v1. It is meant
to help evaluate a model’s compositional ability, i.e., the
ability to answer questions about unseen compositions of
seen concepts. For instance, a model is said to be composi-
tional if it can correctly answer [“What color are the horses”,
“white”] without seeing this QA pair during training, but
perhaps having seen [“What color are the horses?”, “red”]
and [“What color are the dogs?”, “white”]. C-VQA is created
by rearranging the train and val splits of VQA-v1 so that
the QA pairs in C-VQA test split are compositionally novel
with respect to those in C-VQA train split. For instance,
“tennis” is the most frequent answer for the question type
“what sport” in C-VQA train split whereas “skiing” is the
most frequent answer for the same question type in C-VQA
test split. However for VQA-v1, “tennis” is the most frequent
answer for both the train and val splits. Like C-VQA, VQA-
CP [4] reorganizes VQA-v1 and VQA-v2 so that the answer
distribution of each question type is different in the train and
test splits.
VOLUME 4, 2016 5
Dataset
Number
of
Images
Number
of QA
pairs
Type Image source QA Source Evaluation
Metric Diagnosis type
C-VQA [6] 123K 370K OE and
MC Vqa-v1 Vqa-v1 Vqa metric Composition
VQA-CP v1 [4] 123K 370K OE and
MC Vqa-v1 Vqa-v1 Vqa metric Composition
VQA-CP v2 [4] 123K 658K OE and
MC Vqa-v2 Vqa-v2 Vqa metric Composition
ConVQA [100] 108K 6.7M OE VG auto Perfect-con,
avg-con Consistency
VQA-
Rephrasings
[106]
40K 160K OE Vqa-v2 manual Vqa metric and
consensus score Consistency
VQA P2 [134] 16K 36K OE Vqa-v2 auto Vqa metric and
consensus score Consistency
VQA-HAT [30] 20K 60K OE Vqa-v1 Vqa-v1 Rank-correlation Visual attention
VQA-X [56] 28K 33K(QA),
42K(E) OE Vqa-v1 and
vqa-v2 manual
BLEU-4,
METEOR,
ROUGE, CIDEr,
SPICE and
human evaluation
Textual
explanation
VQA-E [71] 108K 270K OE Vqa-v2 Vqa-v2(QA),
auto(Explanation)
BLEU-N,
METEOR,
ROUGE-L, and
CIDEr-D
Textual
explanation
HowManyQA
[123] 47K 111K OE Vqa-v2 + VG Vqa-v2 + VG
Vqa metric,
RMSE(Root
mean squared
error)
Counting
(simple)
TallyQA [1] 165K 288K OE Vqa-v2 + VG +
TDIUC
Vqa-v2 + VG +
TDIUC
Accuracy and
RMSE
Counting (simple
and complex)
ST-VQA [15] 23K 32K OE Various datasets manual
Normalized
Levenshtein
distance
Scene text
Text-VQA [111] 45K 28K OE Open images v3 manual Vqa metric Scene text
TABLE 3: Statistics of diagnostic VQA datasets.
2) VQA-E, VQA-X and VQA-HAT
VQA-E [71] and VQA-X [56] argue that a VQA model
should provide an explanation along with the answer so that
it can be verified that the answer was deduced properly
without exploiting dataset biases. To facilitate this, both
datasets provide explanations along with the usual image-
question-answer triplet in its training set. This explanation
annotation can teach a model to produce explanations for
unseen instances. VQA-E contains 270K textual explanations
for 270K QA pairs and 108K images while VQA-X is smaller
containing 42K explanations for 33K QA pairs and 28K
images. Explanations in VQA-X are human-provided while
in VQA-E they are automatically generated. VQA-X also
provides visual explanations in the form of human-annotated
attention maps for its images. VQA-HAT [31] is another
dataset that provides attention annotations. To collect data
for VQA-HAT, human subjects were presented with a blurred
image and then were asked a question about that image. They
were instructed to deblur regions in the image that would help
them to answer the question correctly. The regions they chose
to deblur are thought to correspond to the regions humans
naturally choose to focus on in order to deduce the answer.
3) VQA-Rephrasings
VQA-Rephrasings [106] shows that most state-of-the-art
VQA models are notoriously brittle to linguistic varia-
tions in questions. VQA-Rephrasings provides 3 human-
provided rephrasings for 40k questions spanning 40k images
from the VQA-v2 validation dataset. They also provide a
6VOLUME 4, 2016
(a) (b)
Training
(c) (d)
Testing
(e) An example from CVQA. Figure from [6]
model-agnostic method to improve the ability of answering
rephrased questions.
(a) (b)
FIGURE 7: An example from ConVQA. Figure from [100]
4) ConVQA
ConVQA [100] introduces the task of belief consistency in
VQA models, i.e., the task of consistently answering ques-
tions that differ in linguistics but share the same semantics.
For example, if the answer to “Is it a vegetarian pizza?”
is “yes”, then the answer to “Is there pepperoni on the
pizza?” should be “no”. Previous VQA datasets did not have
such multiple questions about a single concept to test the
consistency of a model. There are a total of 6,792,068 QA
pairs in 1,530,979 consistent sets over 108,077 images in
ConVQA.
5) HowManyQA and TallyQA
HowManyQA (introduced in [123]) was made by combin-
ing counting questions from VQA-v2 and Visual Genome
resulting in 106,356 counting questions. TallyQA [1] points
out that most counting questions in VQA datasets and How-
ManyQA are simple which require nothing more than object
detection such as, “How many dogs are there?”. They present
a dataset which contains a large number of complex counting
questions, for example, “How many dogs are laying on the
grass?”. These questions require more complex reasoning
about objects, their attributes and their relationships with
other objects and background regions. TallyQA contains
287,907 questions of which 76,477 are complex questions.
6) Text-VQA and ST-VQA
(a) (b)
FIGURE 8: An example from ST-VQA. Figure from [15]
A lot of questions humans ask about an image tend to be
about text in the image, for example, “What is the number on
the license plate?”. Such types of questions are rare in general
VQA datasets. TextVQA [111] contains 45,336 questions
about 28,408 images that require reasoning involving the
text present in the image. Each image-question pair has 10
ground truth answers provided by humans. ST-VQA [15]
contains 31,791 scene text related questions about 23,038
images. ST-VQA ensures that all questions are unambiguous
and all images contain a minimum amount of text. It ensures
bias is not present in question and answer distribution. It
also proposes a new evaluation metric more suited for this
particular task.
D. KB DATASETS
Sometimes it is not possible to answer a question with
only the information present in the image. In such cases,
the required knowledge has to be acquired from external
sources. This is where KB datasets come in. They provide
questions that require finding and using external knowledge.
KB datasets can teach a model to know when it needs to
search for absent knowledge and how to acquire that knowl-
edge. Even the largest VQA datasets can not contain all real-
world concepts. So VQA models should know how to acquire
VOLUME 4, 2016 7
Dataset
Number
of
Images
Number
of QA
pairs
Type Question
types Image source QA
source Metric
External
knowledge
annotation
Knowledge base
used in
construction
FVQA
[129] 2190 5826 OE 32 MS COCO and
imagenet Manual Exact matching,
WUPS
Supporting fact
(arg1, rel, arg2)
Combination of
DBpedia,
WebChild, and
ConceptNet
KB-
VQA
[128]
700 2402 OE 23 MS COCO Mixed Exact matching,
WUPS, Manual none DBpedia
OK-
VQA
[82]
14K 14K OE 10 MS COCO Manual Vqa metric none none
TABLE 4: Statistics of KB datasets.
knowledge about unseen concepts to achieve true general-
izability. It should be noted that KB datasets are different
from EKBs or External Knowledge Bases which are large
structured databases compiling various world information.
Examples of such EKBs are, large scale KBs constructed by
human annotation, e.g., DBpedia [11], Freebase [16], Wiki-
data [127] and automatic extraction from unstructured/semi-
structured data, e.g., YAGO [48], [80], OpenIE [12], [34],
[35], NELL [22], NEIL [25], WebChild [118], ConceptNet
[76].
1) FVQA
FVQA [129] is a dataset where most of the questions(5826
questions about 2190 images) require external information
to answer. To support this, FVQA adds an additional “sup-
porting fact” to each image-question-answer triplet. Each
supporting-fact is represented as a structural triplet, such
as (Cat, CapableOf, ClimbingTrees), which can be used to
answer the question. These facts mimic common-sense and
external knowledge that help humans(as such should help
VQA models too) answer questions that require outside
knowledge. By using these facts, FVQA teaches a model to
look for the appropriate supporting fact from a facts-database
and use that to deduce the answer.
2) KB-VQA
KB-VQA [128] is a small dataset containing 2402 questions
about 700 images based on 23 question templates. Among
these, 1256 are visual questions(does not require external
information), 883 require common-sense knowledge and 263
require querying an external knowledge base.
3) OK-VQA
OK-VQA [82] contains 14,055 questions about 14,031 im-
ages. All questions were collected so that they require outside
knowledge to answer. OK-VQA is larger than both FVQA
and KB-VQA. Care was taken so that various biases such as,
bias in answer distribution were reduced as much as possible.
FIGURE 9: An example from FVQA. Figure from [129]
E. EVALUATION
A model’s performance being correctly evaluated depends on
the evaluation metric used. Unfortunately, a major problem of
VQA is that there is no widely agreed upon evaluation metric.
Many different metrics have been proposed.
1) Exact matching
The simplest metric where predictions are considered correct
if they match the answer exactly. There are obvious prob-
lems with this approach. Many questions have multiple valid
answers. The number of valid answers increases with the
number of words. This metric can be used for datasets where
most of the answers are single-word. But for datasets like
Visual Genome where 27% of answers contain three or more
words, this metric performs poorly.
2) WUPS
WUPS or Wu-Palmer similarity test [139] is a metric where
the predicted answer is given a rating between 0and 1with
respect to the ground-truth answer. The ranking is calculated
by tracing their common subsequence in a taxonomy tree and
it reflects how “semantically” similar the two words are. For
8VOLUME 4, 2016
example, “bird” and “eagle” would have a rating closer to
1whereas “bird” and “dog” would have a lower rating. But
some word pairs are ranked high in this scheme even though
they are very loosely related. Moreover, some questions are
such that only one answer is correct whereas other answers,
despite being semantically similar, are wrong. For example,
a question with “What color ..” can only have one correct
answer, e.g., “red” while other colors like “black” are wrong
answers.
3) Consensus
This was the metric used for the VQA-v1 dataset. The metric
is,
AccuracyV QA =min(n
3,1) (1)
where nis the number of annotators with the same answer
as the predicted one. Basically, this metric considers any
prediction correct if it matches with the answers of at least 3
annotators. This scheme allows multiple valid answers. There
are some problems with this metric. It turns out that not all
questions have annotators agreeing what the answer should
be. Especially 59% of “why” questions have no answer
with more than two annotators which means the metric will
consider any answer for these questions incorrect. Moreover,
there are questions where directly conflicting answers, like
“yes” and “no”, have at least three annotators. This means
that both “yes” and “no” are considered valid answers for
these questions.
4) TDIUC metrics
Because of skewed question and answer distribution, normal
evaluation metrics do not capture how well a model is per-
forming across all question and answer categories. TDIUC
[62] introduces A-MPT(Arithmetic mean per type) and H-
MPT(Harmonic mean per type). A-MPT evaluates average
accuracy like previous metrics. H-MPT measures the ability
of a system to have high scores across all question-types and
is skewed towards the lowest performing categories. TDIUC
also introduces N-MPT(normalized MPT) to compensate for
bias in the distribution of answers within each question-type.
This is done by computing the accuracy for each unique
answer separately within a question-type and then averaging
them together. A large discrepancy between unnormalized
and normalized scores suggests that the model is not gen-
eralizing to rarer answers.
5) GQA metrics
GQA [54] introduces 5 new metrics for detailed analysis of
the true capability of a model. Consistency checks that the
model provides consistent answers for different questions
based on the same semantics. For example, If the model
answers the question “What is the color of the apple to the
right of the plate?” correctly with red, then it should also
be able to infer the answer to the question “Is the plate to
the left of the red apple?”. Validity checks whether a given
answer is in the question scope, e.g., responding some color,
and not some size, to a color question. Plausibility score
goes a step further measuring whether the answer makes
sense given the question. For example, red and green are
plausible apple colors and conversely purple is implausible.
Distribution measures the overall match between the true
answer distribution and the predicted distribution. It allows
us to see if the model has generalized to rarer answers.
For attention-based models, Grounding checks whether the
model attends to regions within the image that are relevant
to the question. This metric evaluates the degree to which
the model grounds its reasoning in the image rather than just
making educated guesses.
III. ALGORITHMS
Most VQA algorithms follow a basic structure:
•Image representation
•Question representation
•Fusion and/or Attention
•Answering
A. IMAGE REPRESENTATION
Image representation in VQA is done using either pre-trained
CNNs(Convolutional Neural Network) or object detectors.
Some examples of pre-trained CNNs that have been used
are GoogLeNet [116], VGGNet [110], ResNet [46]. Among
these, ResNet seems to consistently give higher accuracies
than other CNNs. For detecting object-regions, the primary
choice has been Faster R-CNN [102]. Mask R-CNN [45] is a
more recent detector that seems to be more robust and better
at detection than Faster R-CNN.
Pre-trained CNNs like ResNet have been pre-trained on
millions of images from large databases like ImageNet [32].
Images from these datasets were also used to construct most
general VQA datasets which makes transfer learning possi-
ble. When given an input image, a CNN goes through several
convolution and pooling layers to produce a C×W×H
shaped output. Here, Wand Hrefers to the modified width
and height of the image and Crefers to the number of
convolution filters. Each convolution filter can be thought to
detect a certain pattern.
Object detectors work a bit differently. They produce mul-
tiple bounding boxes. Each bounding box usually contains an
object belonging to a specific object class. Here “object” can
mean things like sky, cloud, water which aren’t considered
objects in the traditional sense. Faster R-CNN produces
rectangular bounding boxes which means an object doesn’t
always fit inside its box. Object detection also suffers from
overlapping issues. The same object can be part of multiple
bounding boxes all overlapping each other. For example,
Consider an image containing a person wearing a shirt.
The shirt can be part of two overlapping bounding boxes,
the bounding box containing ’shirt’ and the bounding box
containing ’person’. Bounding boxes of different objects can
also overlap each other especially if one object occludes an-
other object behind it. Mask R-CNN [45] produces polygonal
masks which reduces these problems. Most VQA algorithms
VOLUME 4, 2016 9
FIGURE 10: Basic approach of VQA algorithms
using object detectors typically use the top 24-32 predicted
objects from the image.
FIGURE 11: CNN(left) and Faster R-CNN(right). Figure
from [7]
According to the method used, image representation can be
divided into grid-based and object-based approaches. CNNs
divide any image into a uniform grid(typically 14 ×14)
and object detectors divide the image into multiple object-
regions(typically 32). A uniform grid structure means some-
times an object can overlap into several adjacent cells. Ir-
relevant noise can be an issue as the whole image is used
but this can be mitigated through attention mechanisms.
Object detectors break the image down into salient objects
which may provide a stronger training signal as the scene
is organised into meaningful units. But they ignore all area
outside of the predicted bounding boxes which may result in
information loss. Visual information about broader semantics
FIGURE 12: Mask R-CNN. Figure from [45]
such as scene and sentiment is also lost.
Which method of visual representation is ideal is a topic of
major debate in VQA. Earlier models universally used pre-
trained CNN features. BUTD [7] was the first major VQA
model to utilize object detection. Since BUTD, most models
have opted to use Faster R-CNN pre-trained on the Visual
Genome object detection task. But recently [59] have pre-
sented findings arguing for using grid features. Like BUTD,
[59] also pre-trains Faster R-CNN for VG object detection
but unlike BUTD, they discard region prediction during the
inference step and directly use the grid features from the last
convolution layer. They identify that pre-training for object
detection is an important step that incorporates richer visual
understanding but extracting object features are not essential
for answering visual questions.
B. QUESTION REPRESENTATION
Question representation in VQA is usually done by first
embedding individual words and then using an RNN or a
CNN to produce an embedding of the entire question.
10 VOLUME 4, 2016
Some earlier models used one-hot vectors and BOW(Bag
of words) to embed individual words. Then they either
used them directly or fed them to an RNN or CNN. Later
models used a pre-trained word embedding model such as:
Word2Vec [84] or GloVe [93]. Word2Vec was built using
skip-gram and negative sampling. It produces similar vectors
for words with similar semantic meanings. GloVe uses word-
word co-occurrence probability on the basis that words that
occur frequently together have a higher probability of being
semantically connected. More recent works have utilized lan-
guage models such as, ELMo [95], BERT [33], and GPT [97]
[98] [17] which have set new state-of-the-art performances
across a range of NLP tasks. These models are pre-trained on
a huge amount of natural language data and can be expected
to produce sufficiently useful representations for the VQA
task.
After getting the word embeddings, they are usually passed
to an RNN such as, GRU [27], LSTM [47], bi-LSTM, and
bi-GRU. Another option is to also use a CNN for question
representation. Some models skip the word embedding and
directly embed the whole question using a sentence embed-
ding model like Skip-thought [65].
In terms of which RNN model works best for VQA, there
has not been any conclusive results favoring any particular
model. Another question to ask is “Does RNN work better
than CNN?”. Although RNNs usually outperform CNN in
many NLP related tasks, the same can not be definitely said
for VQA. [132] performs a detailed comparison between
RNN and CNN for VQA. They conclude that CNN is better
suited for VQA than RNN because of the limited length of
question sentences which prevents RNNs from using their
full potential. Understanding a question properly requires
identifying keywords which CNN does better as it treats indi-
vidual words as features. We think that comparison between
CNN and RNN for VQA is something that should be looked
into further.
C. FUSION/ATTENTION
For the network to predict answers, it has to jointly reason
over both question and image. The interaction of the visual
and textual domain in VQA is either done directly through
multimodal fusion or indirectly through attention mecha-
nisms. We will talk about both methods in the upcoming
sections.
D. ANSWERING
For open-ended VQA, answers are predicted either through
a generative approach or a non-generative approach. In the
non-generative approach, all or the most frequent unique
answers in the dataset are set as individual classes. This
approach is easier to evaluate and implement but has the
obvious shortcoming of not being able to predict answers
not seen during training. Another option is to generate the
answer word by word using an RNN but there are no good
methods of evaluating the generated answers which means
almost all models adopt the non-generative approach. For
multiple-choice VQA, answer prediction is treated as a rank-
ing problem where each question, image and answer trio gets
a score and the answer with the highest score is selected.
There are two types of non-generative approach, single-
label classification and multi-label regression. In single-
label classification, models output a probability distribution
over the answer classes using softmax. In this approach,
models are trained to maximise probability for a single
correct answer, usually the ground truth answer most an-
notators agree on. It can be argued that this single-label
approach results in a poorer training signal and a multi-
label regressive approach fares better. Most VQA questions
have multiple correct answers, so it makes sense to produce
multiple labels. The most popular VQA dataset VQA-v1
actually provides multiple ground truth answers. The an-
swers are assigned soft accuracies according to the met-
ric, min(# humans that provided that answer
3,1). This makes evaluat-
ing multiple correct answers possible as answers with fewer
annotators still get a partial score.
For a long time, VQA models used single-label classi-
fication. BUTD [7] was the first model to use the answer
scores from VQA-v1 as soft targets and cast the answering
task as a regression of scores for candidate answers instead
of traditional classification. BUTD argues that this method
is much more effective than simple softmax classification
because it allows for producing multiple answers and mod-
els the distribution of human provided annotations better.
Following BUTD, most recent models use the multi-label
regression approach.
Almost all models represent answers as one-hot vectors
for easier implementation. But this handicaps semantic un-
derstanding as there is now a gap between how the model
interprets the question and how it interprets the answer,
despite both coming from the same semantic space. For
example, for a question with the correct answer of “dog”,
most models will penalize “cat” and “german shephard”
equally even though “german shephard” comes closer to the
correct answer. Some works have tried to close this gap by
projecting the answers into the same semantic space as the
questions. Just like the questions, the answers are turned into
vectors and the task of answering becomes regressing the
answer vector. This way the model can have a more accurate
view of the answer space where “german shephard” is closer
to “dog” than “cat” resulting in a better training signal and a
more nuanced understanding.
IV. MULTIMODAL FUSION
In order to perform joint reasoning on a QA pair, information
from the two modalities have to mix and interact. This can
be achieved by multimodal fusion. We divide fusion in VQA
into two types, vector operation and bilinear pooling.
A. VECTOR OPERATION
Early VQA models used vector addition, multiplication and
concatenation to perform fusion. They are relatively simple
in terms of complexity and computational cost. In vector
VOLUME 4, 2016 11
addition and multiplication, question and image features are
projected linearly through fully-connected layers to match
their dimensions. Although vector operations are simple to
implement, they do not result in good accuracy. Among the
three vector operations, concatenation usually has the worst
performance while multiplication seems to fare the best.
B. BILINEAR POOLING
Bilinear pooling [119] computes the outer product of the
question and image feature vectors. The outer product of
two vectors gives a complete representation of all possible
interactions between the elements of the vectors. If ziis the
i-th output(with bias terms omitted) then,
zi=xTWiy(2)
where xϵRm,yϵRn,zϵRoand W ϵRm×n×o. But in VQA,
directly computing the outer product is infeasible. If both
image and question feature vectors have dimensions of 2048
and there are 3000 answer classes, we would need to com-
pute 12.5 billion parameters [37]. This is computationally
expensive and may lead to overfitting. The models we discuss
try to control the number of parameters by imposing various
restrictions. Ultimately, we will see that these models are just
different approaches towards the trade-off between express-
ibility and complexity or trainability.
It has been proven in [96] that the count sketch of the outer
product of two vectors can be expressed as the convolution
of count sketches of the two vectors,
Ψ(x⊗y, h, s) = Ψ(x, h, s)∗Ψ(y, h, s)(3)
Here, Ψis the count-sketch, ⊗is the outer-product, ∗is the
convolution operator, and hand sare randomly sampled pa-
rameters by the model. MCB [37] utilizes the above approach
but replaces the convolution with a more efficient element-
wise product in FFT(Fast Fourier Transform) space. In this
way, MCB indirectly computes the outer product.
MLB [64] argues that MCB still has too many parameters.
MCB still needs to maintain a large number of parameters to
overcome the bias that results from hand sbeing fixed. MLB
tries to mitigate this by decomposing Winto W=UV T
which results in,
zi= 1T(UT
ix◦VT
iy)(4)
where ◦is the hadamard product. Here UiϵRm×kand
ViϵRn×k. This decomposition reduces the number of param-
eters by imposing a restriction on the rank of Wito be at
most kwhere k≤min(m, n). To further reduce the order
of weight tensors by one, MLB replaces the unit tensor with
another matrix PiϵRk×cto get,
zi=PT
i(UT
ix◦VT
iy)(5)
kand care hyperparameters that decide the dimension of
the joint embeddings and the dimension of the model output
respectively. But MLB has problems too. It takes many
iterations to converge and it is sensitive to hyperparameters.
MFB [148] modifies MLB by replacing UϵRm×k×oand
V ϵRn×k×owith U′ϵRm×ko and V′ϵRn×ko and then per-
forming,
z=SumP ool (U′Tx◦V′Ty, k)(6)
where the function SumP ool (X, k)means using a 1-D non-
overlapping window with size kto perform sum pooling over
X. The authors show that MLB is a special case of MFB with
k= 1. They also introduce MFH which can perform higher
dimensional pooling by stacking multiple MFBs.
MUTAN [13] uses Tucker decomposition [124] to decom-
pose Winto,
W=τc×Wq×Wv×Wo(7)
Each of the four components have a specific role in the
modeling. Matrices Wqand Wvproject the question and the
image vector into dimensions tqand tvrespectively. These
dimensions directly impact the complexity of modeling for
each modality. Tensor τccontrols the complexity of the
interactions between projections produced by Wqand Wv.
The number of parameters can be controlled by imposing
constraints on the rank of τc. Finally, the matrix Woscores
each pair of embedding for each class in the answer set. It can
be shown that MCB and MLB are special cases of MUTAN.
Finally, there is BLOCK [14] which uses block-term de-
composition. BLOCK aims to achieve a balance between
MLB and MUTAN. MLB can be thought of as having the
same number of blocks as the rank, each block being of
size (1,1,1). This means that projections of both modalities
can be very high dimensional but interactions between them
will be poor. In comparison, MUTAN uses only one block
of size (L, M, N )which means better interactions between
projections but the projections are modeled less accurately.
BLOCK takes a middle approach between these methods and
performs better.
V. ATTENTION
Attention is the mechanism most widely used by VQA
models to improve accuracy. When we are shown an image
and asked a question about that image, we usually focus on
the particular region of the image that helps us answer the
question. VQA models try to replicate this through attention.
This improves accuracy because a lot of irrelevant noise is
filtered out and the model can focus better on the parts of the
image that are actually relevant for answering the question.
There are many different types of attention which we discuss
below.
A. SOFT AND HARD ATTENTION
The first aspect of note is soft and hard attention. Most
models use the question to produce an attention map over
the image. This map assigns different weights to different
regions in the image according to their perceived relevance
in answering the question. In soft attention, every region is
assigned weights with more weight given to more relevant
12 VOLUME 4, 2016
Model VQA-v1 VQA-v2
MLB [64] 65.1 66.62
MFB [148] 65.8 -
MCB [37] 66.5 62.27
MUTAN [13] 67.36 66.38
BLOCK [14] - 67.92
TABLE 5: Accuracy of different fusion models.
FIGURE 13: First row presents the original images. Second row presents grid-based soft attention using CNNs. Third row
presents object-based hard attention using object detectors. Figure from [79]
areas. In hard attention, only regions which are deemed suffi-
ciently relevant are selected for further consideration while
all other regions are completely ignored. So, to compare,
hard attention will completely ignore some areas of the image
while soft attention will pay less attention to those areas. The
selected regions in hard attention may go through a further
soft attention stage.
B. GRID AND OBJECT BASED ATTENTION
Attention can also be divided into grid and object based
approaches. In the grid based approach, the whole image
is divided into a uniform grid where each cell is the same
size. Almost all grid based approaches use soft attention.
As the image is divided into regions of same size, many
regions contain multiple objects. Using hard attention in this
case may erase information about multiple objects which
could result in significant information loss. In object based
approaches, object detectors are used to divide the image
into multiple objects. This falls under hard attention as image
area that is not contained by the bounding box of any object
is completely ignored. This is why some object based ap-
proaches also provide the whole image to the model to ensure
access to all information.
Some recent models [79] [108] [36] try to balance between
grid and object based attention. They argue that some ques-
tions benefit more from object level attention and some ben-
efit more from a grid based approach. These models extract
image features from both grid-based CNN and object-based
Faster R-CNN. The question is then used to assign weights to
the features. In this way, the question itself determines which
image feature should be used to answer the question.
C. BOTTOM-UP AND TOP-DOWN ATTENTION
Another important concept is bottom-up and top-down at-
tention. This is related to how human brains process vision.
In the human visual system, attention can be focused voli-
tionally by top-down signals determined by the current task,
e.g., looking for something, and automatically by bottom-up
signals associated with unexpected, novel, or salient stimuli
[18] [29]. Similarly in the context of VQA, a top-down ap-
proach means considering the question first and then looking
at the image while a bottom-up approach implies looking at
VOLUME 4, 2016 13
Soft, grid-based,
top-down attention
SAN [142], SMem [140], QAM [23], MRN [63], RAU [90], VQA-Machine [131], HieCoAtt [78], DAN
[86], MLAN [145], HOA [103], GVQA [5], DCN [89]
Hard, object-based,
bottom-up Attention
Foc-reg [109], FDA [58], BUTD [7], AOA [112], CVA [113], FEA [75], MuRel [19], DFAF [38], MCAN
[147]
Both Dual-MFA [79], QTA [108], DRAU [92]
TABLE 6: Models with different attention schemes.
Single-step Attention Foc-reg [109], QAM [23], FDA [58], VQA-Machine [131], HieCoAtt [78], MLAN [145], BUTD [7],
HOA [103], AOA [112], CVA [113], QTA [108]
Multi-step Attention SAN [142], SMem [140], MRN [63], RAU [90], DAN [86], GVQA [5], Dual-MFA [79], DCN [89], FEA
[75], MuRel [19], DFAF [38], MCAN [147], DRAU [92]
TABLE 7: Models using single and multi-step attention.
the image first and then considering the question. BUTD [7]
argues that both types of attention are useful. BUTD first
applies bottom-up attention using Faster R-CNN to detect
the most salient objects. This is equivalent to just looking
at an image and noting all the prominent objects. BUTD then
weighs these objects according to the question. This is top-
down attention as visual information is being processed after
considering the question.
D. SINGLE AND MULTI-STEP ATTENTION
FIGURE 14: Multi-step attention through multiple attention
layers in SAN. Figure from [142]
Attention can be single or multi-step. This is also referred
to as “glimpse”. A model using multiple glimpses means
multiple attention layers have been cascaded one after an-
other. Some questions are simple and only require a single
attention step to answer. For example, “What color is the
dog?” only requires a single attention step to detect the
dog. But many questions require multiple reasoning steps
which may also require applying attention multiple times. For
example, “What color is the dog standing beside the chair
behind the counter?”. This question requires three attention
steps. The first one to identify the counter, the second to
identify the chair behind the counter, and the third to identify
the dog beside the chair. So we see that multi-step attention
can lead to improved reasoning skills.
E. CO-ATTENTION AND SELF-ATTENTION
Until now we have only discussed image attention but what
about attending the question?. HieCoAtt [78] was the first
model to seriously explore question attention. They argued
that just as image attention can filter out noise from the
image, question attention can also reduce unwanted features.
Just as the question can guide attention on the image, the
image can also guide attention on the question. This is
referred to as co-attention. Co-attention can take the image
into context and apply weights to question words according
to how relevant they are for answering that particular image.
In co-attention, image and question attend each-other but
it is also possible for each modality to apply attention to
itself. This is aptly called self-attention. Self-attention can be
applied to the question because not all words in a question
are of equal importance. For example, consider the question
“What color is the dog?”. Here the question can be shortened
to “What color dog” without losing any significant meaning.
“is” and “the” can be considered noise which do not con-
tribute to the model’s understanding of the question. Self-
attention can also be applied to the image. Self-attention for
image determines which region or object is important after
taking into consideration all other regions and objects. For
example, consider an image of a sheep-dog guarding a herd
of sheep. Here the individual sheep are less important and
the herd is more important as a whole. So self-attention can
modify the weights such that the combined weight of all
the sheep is equal to the weight of the dog which basically
embodies the main relation in the image which is “A dog
guarding a herd of sheep”.
VI. EXTERNAL KNOWLEDGE
Sometimes the information required to answer a question can
not be found in the given image. For example, The question
“What do the two animals in the image have in common?” for
an image containing a Giraffe and a Lion can not be answered
from the image alone. To correctly answer this question,
i.e., they are both mammals, the model has to have an idea
about the outside concept of “mammal”. As conventional
VQA models have no way of gathering knowledge about
unseen concepts, they will guess blindly from the answer
space of known concepts. Most VQA models have limited
knowledge of real-world concepts because the datasets they
14 VOLUME 4, 2016
FIGURE 15: Co-attention where both image and question guide attention on each-other. The rows present co-attention for
two different images while the question is same. The columns present co-attention for the same image while the questions are
different. Figure from [89]
Co-attention(Image by
question)
Foc-reg [109], SAN [142], SMem [140], QAM [23], FDA [58], MRN [63] , VQA-Machine [131],
HieCoAtt [78], DAN [86], MLAN [145], GVQA [5], Dual-MFA [79] , AOA [112], DCN [89], FEA [75],
MuRel [19], DFAF [38], DRAU [92], MCAN [147]
Co-attention(Question by
image) HieCoAtt [78], DAN [86], BUTD [7], DCN [89], DFAF [38], DRAU [92], MCAN [147]
Self-attention(Image) DFAF [38], MCAN [147]
Self-attention(Question) DFAF [38], MCAN [147]
TABLE 8: Models using co-attention and self attention.
are trained on are restricted in their size and cannot reflect
every combination of every concept. A simple way to solve
this problem is to give the model the capability to query an
External Knowledge Base or EKB.
EKBs are structured representations of real-world knowl-
edge. Construction, organization, and querying of these
knowledge bases are problems that have been studied for
years. This has led to the development of large-scale KBs
constructed by human annotation, e.g., DBpedia [11], Free-
base [16], Wikidata [127]) and KBs constructed by automatic
extraction from unstructured or semi-structured data, e.g.,
YAGO [48] [80], OpenIE [12] [34] [35], NELL [22], NEIL
[25], WebChild [118], ConceptNet [76]. In structured KBs,
knowledge is typically represented by a large number of
triples of the form (arg1, rel, arg2). “arg1” and “arg2” denote
two Concepts in the KB, each describing a concrete or
abstract entity with specific characteristics. “rel” represents
a specific Relationship between them. A collection of such
triples forms a large interlinked graph. Information can be
retrieved by using an SQL-like query language. Queries usu-
ally return one or multiple sub-graphs satisfying the specified
conditions.
EKB models usually try to map image concepts and
question concepts to their equivalent KB concepts as extra
information about those concepts is likely to be helpful in
the answering process. Some models [130], [129] do this by
building an image-specific knowledge graph. Visual concepts
such as objects, scenes, attributes, actions, etc., are detected
in the image and used to build a scene graph. A knowledge
sub-graph is produced by querying the knowledge base with
the detected image concepts. The two graphs are combined
by linking the visual concepts with their corresponding KB
concepts. This combined graph is the image-specific knowl-
edge graph that integrates additional information about vari-
ous image content. Along with visual concepts, KDMN [67]
also uses keywords from the question to query the knowledge
base. Thus, its knowledge graph contains question concepts
along with visual concepts. ACK [137] uses the image to
predict top-5 attributes and uses them to query DBpedia
which returns five knowledge paragraphs. STTF [88] and
VOLUME 4, 2016 15
Model VQA-v1 VQA-v2
SAN [142] 58.9 -
FDA [58] 59.5 -
MRN [63] 61.8 -
HieCoAtt [78] 62.1 -
RAU [90] 63.2 -
DAN [86] 64.2 -
MLAN [145] 64.8 -
Dual-MFA [79] 66.09 -
BUTD [7] - 65.67
DCN [89] 67.02 67.04
DRAU [92] 67.16 66.85
MuRel [19] - 68.41
DFAF [38] - 70.34
MCAN [147] - 70.90
TABLE 9: Accuracy of different attention models.
FIGURE 16: Example of an image-question pair that needs
external knowledge to answer. Figure adapted from [138]
OOTB [87] extract visual concepts from the image and use
the question to filter relevant facts from a facts database.
STTF then selects a top fact by a ranking mechanism and uses
this to answer the question whereas OOTB [87] constructs a
graph and uses graph convolution to produce the final answer.
Although there has been significant work in integrating
external knowledge in VQA, EKB models still suffer from
FIGURE 17: An example structure of an EKB. Figure from
[129]
many issues. Some of these issues stem from the difficulty
of interacting with most KBs. KBs cannot be queried with
natural language questions and need specific query formats.
There are only a limited number of ways in which a KB can
be queried and any question we want to ask the KB must be
reducible to one of the available query templates. Parsing the
image and the question to produce appropriate queries is still
an error-prone process that many EKB models struggle with.
Another issue is with integrating knowledge graphs. They are
symbolic in nature which makes it difficult to train an EKB
model in an end-to-end differentiable manner. Some models
16 VOLUME 4, 2016
work around this by converting the knowledge graph into a
continuous vector representation. For example, KDMN [67]
treats each knowledge triple as a three-word SVO(subject-
verb-object) phrase. It embeds each triple into a common
feature space by feeding its word embeddings through an
RNN architecture. In this case, the external knowledge is
embedded in the same semantic space as the question and
the answer. Thus, the model can be trained in an end-to-end
manner.
VII. COMPOSITIONAL REASONING
Most VQA models do not possess significant reasoning
skills. They show good performance on general VQA
datasets which lack complex questions. They fail on datasets
like CLEVR which was designed specifically to test a
model’s high level reasoning skill. To correctly reason about
complex questions, a model needs to be compositional in
nature. By composition, we refer to the ability to break a
question down into individual reasoning steps which when
done sequentially produces the correct answer. An example
of such a question would be “How many large cubes are be-
hind the purple cylinder?”. Answering this question requires
detecting the purple cylinder, then filtering the objects behind
it, then filtering again to find objects that are both large and
cube-shaped, and finally counting them. In contrast, most
VQA models are trained to answer simple questions which
usually require a single visual verification step. We divide
compositional models into two camps, 1) PGE(Program gen-
erator and executor) models and 2) Blackbox models.
A. PGE MODELS
PGE models consist of two parts, program generation and
program execution. In the first part, the input question is
parsed to produce a program. This program is executed in the
second part to obtain the answer. Most complex questions can
be mapped to a dependency tree where each node represents
a single reasoning step. The correct answer can be found by
traversing the tree and processing each node appropriately.
After a question has been mapped to a valid tree, the next
step is to transform this tree into an executable program
by using neural modules. PGE models have a toolbox of
neural modules where each module can perform a specific
reasoning task such as a “find” module that can locate an
object/concept in the image. A neural module is plugged
into each node of the tree according to the task that node
represents. It is important to note here that these modules
are not hard-coded but their parameters are trained just like
a neural network. In the next step, the complete network
is dynamically assembled by laying out the modules in the
tree. The answer is predicted by feeding the inputs to this
network. Due to their modular nature, PGE models offer
greater transparency and interpretability. As the inference
process follows a chain of discrete steps, reasoning failures
can be pinpointed to a particular node which is not possible
in a traditional end-to-end neural network. This can also be
leveraged to produce explanations in order to validate sound
reasoning.
FIGURE 18: An example of a tree parsed from a question.
Figure adapted from [8]
PGE models have to parse the input question into a depen-
dency tree which can be used to produce a module layout.
Earlier models such as NMN [9] and D-NMN [8] used the
Stanford dependency parser to produce the dependency tree.
But due to the parser being error-prone, the selected layout
was often wrong. Later models [50] [49] [107] treated layout
mapping as a sequence-to-sequence learning problem and
replaced off-the-shelf parsers with encoder-decoder models.
At first, models like N2NMN [50] selected layouts discreetly
which was a non-differentiable process. This meant that the
program generation part of the model had to be optimized
using reinforcement learning while the execution part was
trained using gradients. This multi-stage joint training was
difficult as the program generator needed to produce the right
program without understanding what the individual modules
meant, and the execution engine had to produce the right an-
swer from a program that might not accurately represent the
question. Later models like Stack-NMN [49] and XNM [107]
converted layout selection into a soft continuous process so
that the model could be optimized in a fully differentiable
manner using gradient descent. DDRprog [115] solved the
problem of differentiability by interleaving module predic-
tion and generation, thereby producing the layout “on the
fly”. PTGRN [21] solves it by eliminating the need for
module prediction. They do this by using generic PTGRN
modules instead of specialized neural modules. They directly
use the dependency tree produced by the Stanford parser as
the program layout by replacing each node with a PTGRN
node module and each edge with a PTGRN edge module.
Earlier PGE models could not generalize well to main-
stream VQA datasets as they required program supervision
from layout annotations. More recent models like XNM
do not require expert layouts and have shown competitive
performance on general VQA datasets on par with models
that do not use compositional reasoning. This is significant
considering that non-compositional models have been shown
to utilize various dataset biases while PGE models are less
prone to bias due to their modular architecture.
B. BLACKBOX MODELS
A major setback for many PGE models is that they rely on
program annotations, brittle handcrafted parsers, or expert
VOLUME 4, 2016 17
Model VQA-v1 VQA-v2 CLEVR
NMN [9] 58.7 - -
D-NMN [8] 59.4 - -
N2NMN [50] - 63.3 83.7
PTGRN [21] - - 95.47
Stack-NMN [49] 66.0 64.0 96.5
IEP [61] - - 96.9
DDRProg [115] - - 98.3
NS-CL [81] - - 98.9
TbD-NET [83] - - 99.1
UnCoRd [126] - - 99.74
NS-VQA [144] - - 99.8
XNM [107] - 67.5 100
TABLE 10: Accuracy of different PGE models.
Model CLEVR
FiLM [94] 97.7
CMM [143] 98.6
MAC [53] 98.9
OCCAM [133] 99.4
TABLE 11: Accuracy of different Blackbox models.
demonstrations and require relatively complex multi-stage
reinforcement learning training schemes. These models are
also limited to using a few handcrafted neural modules each
with a highly rigid structure. This makes many PGE models
inflexible compared to conventional neural network-based
VQA models. Blackbox models try to achieve a balance
between both worlds. Like PGE models, blackbox models
utilize structural priors to a certain extent while retaining the
fluidity and dynamic nature of neural network-based models.
Blackbox models achieve this by using a single highly flex-
ible blackbox module instead of multiple operation-specific
neural modules. Multiple blackbox cells are cascaded one af-
ter another to form a long chain. Because of self-attention and
residual connections between the cells, this linear network is
capable of representing arbitrarily complex acyclic reasoning
graphs in a soft manner. Because of the use of a single
reusable module and the physically sequential nature of the
network, blackbox models manage to circumvent two major
bottlenecks of PGE models, namely, module selection and
layout selection. This allows blackbox models to be trained
in an end-to-end manner using simple backpropagation just
like conventional VQA models instead of the complex multi-
stage training of PGE models. Though multiple instances of
the same cell are used in multiple layers, they only share the
general structure. Through training, different cells in differ-
ent layers learn to perform different reasoning operations.
This allows blackbox models to perform long and complex
chains of reasoning. Blackbox models retain fluidity and
flexibility while maintaining compositional reasoning ability
albeit at the cost of some transparency and interpretability.
MAC [53] consists of a sequence of MAC cells. MAC
follows a controller-memory scheme where strict separation
is maintained between the representation space of the ques-
tion and the image. This is in contrast to the conventional
approach of fusing the question and the image representation
in the same semantic space. Each MAC cell consists of three
units: a) Control; b) Read; and c) Write. They operate on
a control state and a memory state. The control states are
determined by the question and the control states in different
cells are expected to encode a series of reasoning operations.
The read unit extracts the required visual information from
the image guided by the control and memory state. After
the control unit performs its reasoning operation, the write
unit integrates the new information into the memory state.
18 VOLUME 4, 2016
This architecture allows MAC to divide its operation into
discrete reasoning steps where the question and the image
interact in a controlled way. This makes it easier to interpret
the inner workings of a particular MAC cell resulting in
better transparency. Similar to MAC, FiLM [94] uses the
question to guide its computation over the image. In contrast
to MAC, where the question does not influence the image
feature extraction process, FiLM uses the question to guide
how its visual pipeline extracts features from the image.
FiLM consists of several convolution layers like a typical
CNN. But in FiLM, each convolution layer is paired with a
FiLM layer which controls the feature map computation in
that layer. The parameters of each FiLM layer are determined
by the question. The intuition is that each FiLM layer will
encode the reasoning operation necessary at that stage of the
pipeline. CMM [143] modifies FiLM so that both modalities
are influenced by each other. It argues that, since information
comes from multiple modalities, it is not intuitive to assume
that one modality(language) is the “program generator”, and
the other(image) is the “executor”. One way to avoid making
this assumption is to perform multiple steps of reasoning with
each modality being generator and executor alternatively in
each step.
VIII. EXPLANATION
A recent trend in VQA models has been towards explain-
ability. Many have argued that VQA models should be able
to explain their answers, i.e., the precise reasoning steps
the model followed to deduce the answer. Traditional neural
networks act like blackboxes whose inner workings remain
opaque to outside observers. This makes it difficult to verify
if the network is reasoning correctly. The model may be
using dataset biases or merely guessing the answer. There
is no way to be sure that the model is producing the right
answers for the right reasons. A good AI model should be
able to explain its actions and expose the reasoning process
it followed. Forcing models to generate explanations ensures
trust in VQA systems and exposes models which merely use
biases and shortcuts. Explanations force models to reveal
their internal process and it can be verified that the correct
answer is being predicted logically. In case of the model
predicting a wrong answer, the precise step where failure
occurred can be pinpointed. Maybe the model failed to under-
stand image content correctly or maybe the model had trouble
understanding the question. These are understanding failures.
It could also be that the model understood the image and the
question properly but failed to follow the correct reasoning
steps. This is a reasoning failure. Explanations help to gain a
better understanding of how the model is actually operating.
So if the model predicts the correct answer, we can make
sure it was because it followed proper logical reasoning.
If the prediction is wrong, we can narrow down the failed
step. Generating explanations acts as a type of regularization
method. It results in better answer prediction as the model
focuses on proper reasoning instead of taking shortcuts.
Explanations can be visual or textual. Visual explanations
FIGURE 19: Example of both visual explanation and textual
explanation. Figure from [135]
are attention heatmaps indicating which image regions the
model focused on for answer prediction. Textual explana-
tions are natural language explanations usually comprising of
multiple sentences which describe key information the model
used in its reasoning process. Earlier explanation models
only produced visual explanations as attention heatmaps are
easy to build using gradient based methods such as Grad-
CAM [104]. Visual explanations are intuitive and reveal
the areas of the image the model concentrated on. If the
model deduced the right answer but did not look at the right
area, the model’s reasoning is suspect. If the model failed
to answer correctly, we can analyse the attention map to
figure out why the model failed. But visual explanations
alone may not be enough to fully understand how the model
is operating. Many times the model attends to the correct
region but still fails to use the visual information correctly in
its inference process. Textual explanations can provide better
insights and reveal whether the attended visual information
was used correctly. Natural language explanations are also
easier to understand for humans. Textual explanations can
explain the reasoning process step by step which is clearer
than a single attention map. The best approach seems to
be producing both types of explanations. Models generating
such multimodal explanations argue that visual and textual
explanations complement each-other. Both fine-tune each
other and improve each other’s quality. Generating natural
language explanations results in better attention maps making
them less noisy, more relevant and focused. On the other
hand, attending to important image regions motivates the
textual explanations to be more relevant, more succinct and
more related to image content.
[151] uses attention supervision to train the model to look
at relevant image regions and produce an attention map of
the image as the visual explanation. It devises a way to au-
tomatically collect this attention supervision from available
region descriptions and object annotations. [70] divides it’s
answering process into two parts. In the “explaining” part,
the image is used to detect attributes and produce a caption.
VOLUME 4, 2016 19
These are used in the “reasoning” part in place of the image
to infer the correct answer. As the model bases its reasoning
on the produced caption and attributes, they can be thought
of as textual explanations explaining what information the
model deduced its answer from. [71] uses human-generated
explanations to supervise its textual explanation generation.
It argues that forcing the model to produce natural language
explanations can help it predict more accurate answers. [39]
uses the attention heatmap to separate relevant parts of the
image scene graph. It then uses this structured information
about the relevant scene entities to produce suitable natural
language explanations. [57] produces multimodal explana-
tions which point to the visual evidence for a decision and
also provides textual justifications. [135] argues for more
faithful multimodal explanations where both visual and tex-
tual explanations ground and regulate each-other. Prominent
objects in the attention map should be mentioned in the ex-
planation text and the textual explanation should ground any
object it mentions in the image. In this way, both explanations
calibrate each-other resulting in less noise and more relevant
explanations.
IX. GRAPH MODELS
Graph models use graphs to process information from the
image and the question. Graph models argue that conven-
tional VQA models cannot do complex reasoning with their
monolithic vector representations. Graph models pin the
ability of a model to do higher-level reasoning on its ca-
pacity to perform relational reasoning. Almost any complex
question about an image requires reasoning about its objects
and relationships. For example, a ‘boy’ and a ‘sandwich’
with the relation ‘eating’ encodes an important fact about
an image. Another fact like this could be (‘boy’, ‘next to’,
‘dog’). Now if a VQA model is asked the question ‘What
is the boy next to the dog eating?’, it will be much more
successful in answering the question if it understands the
various relationships that exist among the multiple objects.
As we will see, graph models use graphs to model these
relationships and use them to answer questions.
There can be two types of graphs in a graph model,
an intra-modality graph and an inter-modality graph. Intra-
modality graphs are graphs belonging to a single modality,
either the image or the question. In image graphs, the nodes
represent the objects detected in the image and the edges rep-
resent the relationships between them. In question graphs, the
nodes represent the words and the edges could represent their
syntactic dependencies or some other relationship. Having
two separate graphs means that the image and the question
do not interact with each other but rather their elements
interact within themselves. In this way, intra-modality graphs
can model self-attention or simply produce better image and
question features. On the other hand, inter-modality graphs
model interactions between the two modalities. These inter-
actions could be direct(fusion) or indirect(attention).
Graph-VQA [122] builds two graphs, a question graph and
a scene graph. In the question graph, nodes are word features
and edges are syntactic dependency types as predicted by the
Stanford Dependency Parser. In the scene graph, nodes are
object features and edges are relative spatial positions. LCGN
[51] and TRRNet [141] also use object features as nodes in
their graphs. While LCGN uses all detected objects, TRRNet
selects top k objects using the question. NSM [55] generates
a scene graph where each object node is accompanied by a
bounding box, a mask, dense visual features, and a collection
of discrete probability distributions for each of the object’s
semantic properties, such as its color, material, shape, etc.,
defined over a fixed concept vocabulary. Some graph models
do not represent objects as nodes. Rather they fuse the object
features with the question [91] [19] [68] or the words [43]
and treat the joint embeddings as individual nodes.
A Graph model usually processes its graphs in multiple
iterations. In each iteration, each node gathers context and
information from its neighbors and updates itself. The edges
could also update alongside the nodes [51] but most models
only implement node updates. Each node has its own neigh-
borhood where neighborhood could mean all other nodes
or a subset of them based on some criterion. For example,
the criterion could be that a node’s neighbors are the top k
nodes according to edge weights. The neighborhood selec-
tion criterion is important due to computational concerns. A
fully connected graph where each node considers all other
nodes its neighbors is computation-heavy. On the other hand,
a more choosy selection criterion might leave out relevant
nodes and miss important relations and interactions. In most
graph models like Graph-VQA [122] and NSM [55], each
node considers its adjacent nodes to be neighbors. In CGS
[91], a node’s neighbors correspond to the K most similar
nodes. In other graph models like MuReL [19] and LCGN
[51], the graph is fully connected and each node considers all
other nodes its neighbors.
Graph processing can be done in three ways – graph
convolution, graph attention, or graph traversal. We discuss
these three methods through three models that each make
use of one of them. CGS [91] uses graph convolution. Graph
convolution can be thought to summarize information from a
particular neighborhood of nodes. Multiple layers of convo-
lution can produce higher and higher levels of representations
of the graph. CGS implements a patch operator by using a set
of K Gaussian kernels of learnable means and covariances.
For a given vertex, the output of the patch operator can be
thought of as a weighted sum of the neighboring features.
ReGAT [68] uses graph attention. In the final step of process-
ing, it uses self-attention to produce attention weights that
are used to compute the final features. ReGAT argues that
compared to graph convolutional networks which treat all
neighbors the same, graph attention assigns different weights
to different neighbors and this can reflect which relations
are more relevant to a specific question. Finally, we have
graph traversal which is used by NSM [55]. NSM converts
the question into a set of instructions and treats the graph as a
state machine where the nodes correspond to states and the
edges correspond to transitions. Then it performs iterative
20 VOLUME 4, 2016
Model VQA-v2 GQA
CGS [91] 66.18 -
MuRel [19] 68.41 -
LCGN [51] - 56.1
ReGAT [68] 70.58 -
GRN [44] 71.12 57.04
TRRNet [141] 71.2 60.74
BGN [43] 72.41 -
NSM [55] - 63.17
TABLE 12: Accuracy of different Graph models.
computations where the question-generated instructions are
treated as input and traversing the graph simulates traversing
the state machine’s states effectively performing sequential
reasoning. At each instruction step, NSM traverses the graph
by using the current nodes and edges to determine the next set
of nodes. Node features are not updated in graph traversal.
Traversal just determines the final set of nodes the answer
will be based on. After processing, some models [91] [19]
[122] summarize the node features in some way to represent
the entire final graph. The summarization can be done by
max-pooling or averaging the top k nodes or by predicting
attention weights on the final nodes or some other method.
This final representation is then used with the question to
predict the answer.
X. TRANSFORMER MODELS
Many recent VQA models have utilized BERT-like trans-
former models to extract cross-modal joint embeddings.
[125] introduced transformers which use iterative self-
attention to model the dependency of all input elements.
BERT [33] is a transformer based model that is pre-trained
on several proxy tasks to learn better linguistic represen-
tations. Inspired by BERT’s success on a wide range of
NLP benchmarks, recent efforts have tried to leverage task-
agnostic pre-training to tackle Vision-and-Language(V+L)
challenges. Most V+L tasks rely on joint multimodel em-
beddings to bridge the semantic gap between the visual and
textual elements in image and text. Tranformer models try
to learn this joint representation through pre-training similar
to BERT. Instead of using separate vision and language pre-
training, unified multi-modal pre-training allows the model
to implicitly discover useful alignments between both sets of
inputs and build up a useful joint representation. BERT-like
models define some proxy tasks to enable this pre-training.
The tasks involve masking one part of the data and training
the model to predict the ‘missing’ part from the context
provided by the remaining data. Through these proxy tasks,
the model gains a deeper level of semantic understanding.
Transformer models fall into two camps, two-stream mod-
els and single-stream models. In the two-stream architecture,
two seperate transformers are applied to image and text
and their outputs are fused by a third Transformer in a
later stage. In contrast, single-stream models use a single
transformer for joint intra-modal and inter-modal interaction.
ViLBERT [77], LXMERT [117], and ERNIE-ViL [146] use
the two-stream architecture while VisualBERT [69], VL-
BERT [114], UNITER [26], VLP [152], Oscar [72], and
Pixel-BERT [52] use the single-stream method. Two-stream
models argue that using separate transformers to perform iso-
lated self-attention brings more specific representations while
single-stream models argue that self-attention benefits more
from the context provided by the other modality resulting in
better cross-modal alignment.
Designing proxy tasks is the most important part of pre-
training. The proxy tasks must force the model to use
information from one modality to make valid inferences
about the other. In the process, the model is expected to
mine useful cross-modal representations that encode general
semantic information. We describe some of these tasks. In
MLM(Masked Language Modelling), some textual input to-
kens are randomly masked and the model has to figure out
these tokens by using information from the remaining tex-
tual and visual tokens. Only using textual information often
results in ambiguity which can only be solved by utilising
image information. This forces the model to learn useful
alignments between the modalities. By definition, MLM pro-
vides bi-directional context where all textual tokens that were
not selected for masking are visible. VLP [152] introduces an
additional task in the form of unidirectional MLM where only
the tokens on the left of the ‘to-be-predicted’ token are visible
to the model. In MRFR(Masked Region Feature Regression),
the model has to regress region features of randomly masked
image regions. This task is harder to train on because regress-
ing on a pixel level is difficult and visual ambiguity is more
abundant than language ambiguity. For example, there are
endless variations of valid pixel arrangements for a region
labeled ‘dog’. MRC(Masked Region Classification) is a more
tenable task where the labels of randomly selected image
VOLUME 4, 2016 21
Models Pre-training Tasks
LXMERT [117] MLM, MRFR, MRC, ITM, IQA
ViLBERT [77] MLM, MRC, ITM
VisualBERT [69] MLM, ITM
VL-BERT [114] MLM, MRC
UNITER [26] MLM, MRFR, MRC, ITM, WRA
VLP [152] MLM, Uni-MLM
Oscar [72] MLM, Contrastive Loss
Pixel-BERT [52] MLM, ITM
ERNIE-ViL [146] MLM, MRC, ITM, SGP
TABLE 13: Pre-training tasks used by different transformer models.
Models VQA-v2
VLP [152] 70.7
ViLBERT [77] 70.92
VisualBERT [69] 71.00
VL-BERT [114] 72.22
LXMERT [117] 72.54
Oscar [72] 73.82
UNITER [26] 74.02
Pixel-BERT [52] 74.55
ERNIE-ViL [146] 74.93
TABLE 14: Accuracy of different transformer models.
regions are masked and the model has to predict the miss-
ing object labels. UNITER [26] introduces the WRA(Word-
Region Alignment) task where Optimal transport is used to
explicitly encourage fine-grained alignment between words
and image regions. Optimal transport tries to minimize the
cost of transporting the embeddings from image regions to
words in a sentence and vice versa. In addition to text and
image, Oscar [72] provides a third input in the form of object
tags. Oscar argues that object tags can act as helpful anchors
to ease the learning of cross-modal alignments. Oscar defines
a contrastive loss task where the model has to match objects
tags to valid images. ERNIE-ViL [146] introduces scene
graph prediction where the model is provided with three tasks
- object prediction, attribute prediction, and relationship pre-
diction. ERNIE-ViL contends that MLM pre-training alone
cannot help the model determine whether a word represents
an object or an attribute or a relation and that scene-graphs
have to be explicitly included in the training objective for the
model to gain better understanding. Pixel-BERT [52] argues
that using region features from pre-trained object detectors
causes an information gap because useful information like
shape, spatial position, scene information, etc., are lost. In-
stead of region features, Pixel-BERT uses raw pixels as the
visual input. To make this manageable, a random sampling
method is used. This makes training computationally feasible
and encourages the model to learn semantic knowledge from
incomplete visual input which enhances robustness.
XI. MISCELLANEOUS
A. UNDERSTANDING TEXT IN IMAGE
Many questions about images relate to text present in the
image. For example, “What does the sign say?”, “What time
does the clock show?”, etc. Questions like this require new
answers unseen during training. As it is not possible to
include all possible answers in the test set, the model needs
to have the ability to generate novel answers. Most of the
existing VQA models are inept at reasoning about scene text.
But some recent models have tried to tackle this problem.
22 VOLUME 4, 2016
LoRRA [111] is a VQA model which has an additional OCR
module that allows it to answer questions about scene text.
The OCR module extracts all texts in the image as tokens
and LoRRA determines which of these tokens is needed in
the answering process. It can also output one of the tokens
as the answer. ST-VQA [15] is another model that tries to
incorporate scene text understanding. ST-VQA argues that
PHOC [41] should be used for embedding the extracted text
instead of traditional word embedding models. Unlike other
word embedding models, PHOC puts more emphasis on the
morphology of words rather than the semantics. As a result,
it is more suitable for representing out-of-vocabulary words
such as, license plate numbers.
B. COUNTING
Among the various sub-problems of VQA, counting is often
singled out because of it’s difficulty. Models that use CNNs
for image features are prone to failing on counting questions.
This is because the convolution and pooling layers of CNN
aggregate spatial information of various objects together. The
resulting features can identify the presence of a certain object
but cannot count how many instances of that object are there.
So, more recent models have switched to using object detec-
tors. Object detectors can detect all instances of a particular
object. The only issue here is that the predicted bounding
boxes sometimes overlap which can lead to duplicate count-
ing. Another issue is the information loss that occurs in
the attention layer. Almost all models use normalized soft
attention which sums up the attention weights to equal 1.0.
So if an image contains 1cat, that cat will receive an attention
weight of 1.0but if there are 2cats both will receive weights
of 0.5which effectively means that the two cats are being
averaged back into a single cat. So, instead of all instances of
the object getting the highest attention, attention gets divided
and diluted. Skipping normalization is not an option as [121]
shows it degrades performance on non-counting questions
while the increase in accuracy for counting questions is not
that significant.
IRLC [123] defines counting as a sequential object selec-
tion problem. IRLC uses reinforcement learning to train a
scoring function based on the question. The Object with the
highest score gets selected and this selection prompts new
scores for the remaining objects. But the use of reinforcement
learning hinders assimilating this method in existing VQA
models. Counter [150] is a model that uses object detection
to construct a graph of all detected objects based on how they
overlap. Edges in the graph are removed based on several
heuristics to ensure that each object is only counted once.
RCN [1] is a model made specifically for the VQA counting
task. It uses Faster R-CNN to detect foreground(fg) and
background(bg) objects. It then performs relational reasoning
with all fg-fg and fg-bg object pairs. RCN argues that using
relational reasoning instead of directly working with the
bounding boxes allows it to overcome the overlapping issue.
C. BIAS REDUCTION
Dataset bias has been a persistent issue in VQA from the
start. Many papers [3] [85] [42] [60] [149] [5] have produced
findings that suggest that most of the existing models in VQA
increase their performance by utilising spurious correlations
such as, statistical biases and language priors instead of
properly reasoning about the image and question. Question
bias in particular seems to plague most models making them
indifferent to image information. Existing VQA datasets con-
tain extremely skewed question and answer distributions. So
it’s hard to distinguish whether a model is really doing well
or if it’s just using these biases. To diagnose bias-reliance
in models, VQA-cp [4] was introduced. It re-organizes the
VQA dataset(both v1 and v2) to make the training and test
splits contain different distributions. So only models with
actual generalization ability do well on the test set while
biased models fail. There have been many recent efforts
at bias reduction in VQA. Approaches fall roughly under
three categories: 1) Utilizing a bias-only model with the
main VQA model; 2) Forcing the model to take the image
into consideration; and 3) Augmenting new counter-factual
samples. The first two approaches try to remove bias from
the model side while the third tries to remove bias from the
dataset side.
A bias-only model can be fitted with the main VQA model
to isolate training signals that originate from bias. [99], [28],
and [20] follow this approach. [99] uses adversarial training
where they set the main VQA model and the question-only
adversary against each other. The main VQA network tries
to modify its question embedding so that it becomes less
useful to the question-only model. This results in the question
encoder capturing less bias in it’s encoding. [28] follows
the straight-forward route of using a simple ensemble of the
main and the bias-only model. They argue that the ensem-
bled model learns to use the unbiased model for questions
that the biased model fails to answer. [20] uses the biased
model to dynamically adjust it’s loss such that the network
gives less importance to the most biased examples and more
importance to the less biased.
Another way to reduce question bias is to make sure the
model takes the image into consideration. HINT [105] argues
that the model may focus on the right region in the attention
stage but still disregard the visual information from it’s
reasoning process. HINT works on a gradient level to make
sure the answering process is affected by the visual branch
of the model. HINT proposes a ranking loss which tries
to align it’s visual attention with human provided attention
annotations. In contrast, SCR [136] introduces a self-critical
training objective which penalizes the model if correct an-
swers are more sensitive to non-important regions than to
important regions, and if incorrect answers are more sensitive
to important regions than correct answers.
Some approaches have tried debiasing from the data side
by augmenting existing data with counter-factual samples.
These samples help the model to locate the underlying causal
relations that lead to the correct answer. A counter-factual
VOLUME 4, 2016 23
sample is produced by slightly modifying the original sample
which causes a subtle semantic shift resulting in a different
answer. Take for example an image of a dog. If the dog in
the image is erased or masked, the answer to the question “Is
there a dog in the image?” changes from “yes” to “no”. Usu-
ally counter-factual samples differ from the original sample
in a slight way while still remaining very similar. This forces
the model to gain a more nuanced understanding in order to
differentiate between the two samples. For VQA, counter-
factual samples are produced by modifying the image or the
question or both. Images can be modified by masking one or
more regions. Masking can be done either by cropping out
the region [73] [120] [24] or by replacing it with GAN-based
in-painting [2] [40]. [40] also experiments with inverting the
color of an object in the image. Similar to images, questions
can be modified by masking one or more words [73] [24]
[40]. In addition to word masking, [40] also replaces words
with their negations and adversarial substitutes.
XII. CONCLUSION
Notice that we did not talk at length about the details of
any particular dataset or model. Our goal here was to give
a bird’s-eye view of the entire VQA field. So we chose not
to engage in lengthy discussions of any model or dataset. In
preparing this survey, we found that many outdated models
still provided valuable insights or a novel way of approaching
the problem. We stress that accuracy is not everything as
biases in widely used datasets, VQA-v1 and v2 for example,
can present a skewed depiction of a model’s true ability. It
would not be a stretch to say that the VQA field has exploded
in a few short years. It is easy to feel overwhelmed and lost
as a newcomer. We hope that our work will provide anyone
new to this field with a map and a starting point.
REFERENCES
[1] Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answer-
ing complex counting questions. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 8076–8084, 2019.
[2] Vedika Agarwal, Rakshith Shetty, and Mario Fritz. Towards causal vqa:
Revealing and reducing spurious correlations by invariant and covariant
semantic editing. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 9690–9698, 2020.
[3] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing
the behavior of visual question answering models. arXiv preprint
arXiv:1606.07356, 2016.
[4] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kemb-
havi. Don’t just assume; look and answer: Overcoming priors for visual
question answering. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 4971–4980, 2018.
[5] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kemb-
havi. Don’t just assume; look and answer: Overcoming priors for visual
question answering. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 4971–4980, 2018.
[6] Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, and Devi
Parikh. C-vqa: A compositional split of the visual question answering
(vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243, 2017.
[7] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark John-
son, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention
for image captioning and visual question answering. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
6077–6086, 2018.
[8] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learn-
ing to compose neural networks for question answering. arXiv preprint
arXiv:1601.01705, 2016.
[9] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural
module networks. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[10] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell,
Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question
answering. In Proceedings of the IEEE international conference on
computer vision, pages 2425–2433, 2015.
[11] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard
Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data.
In The semantic web, pages 722–735. Springer, 2007.
[12] Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broad-
head, and Oren Etzioni. Open information extraction from the web. In
Ijcai, volume 7, pages 2670–2676, 2007.
[13] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome.
Mutan: Multimodal tucker fusion for visual question answering. In
Proceedings of the IEEE international conference on computer vision,
pages 2612–2620, 2017.
[14] Hedi Ben-Younes, Rémi Cadene, Nicolas Thome, and Matthieu Cord.
Block: Bilinear superdiagonal fusion for visual question answering and
visual relationship detection. arXiv preprint arXiv:1902.00038, 2019.
[15] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal
Rusiñol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene
text visual question answering. arXiv preprint arXiv:1905.13648, 2019.
[16] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie
Taylor. Freebase: a collaboratively created graph database for structuring
human knowledge. In Proceedings of the 2008 ACM SIGMOD interna-
tional conference on Management of data, pages 1247–1250. AcM, 2008.
[17] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, et al. Language models are few-shot learners.
arXiv preprint arXiv:2005.14165, 2020.
[18] Timothy J Buschman and Earl K Miller. Top-down versus bottom-
up control of attention in the prefrontal and posterior parietal cortices.
science, 315(5820):1860–1862, 2007.
[19] Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome.
Murel: Multimodal relational reasoning for visual question answering.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1989–1998, 2019.
[20] Remi Cadene, Corentin Dancette, Hedi Ben-Younes, Matthieu Cord,
and Devi Parikh. Rubi: Reducing unimodal biases in visual question
answering. arXiv preprint arXiv:1906.10169, 2019.
[21] Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. Interpretable
visual question answering by reasoning on dependency trees. arXiv
preprint arXiv:1809.01810, 2018.
[22] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R
Hruschka, and Tom M Mitchell. Toward an architecture for never-ending
language learning. In Twenty-Fourth AAAI Conference on Artificial
Intelligence, 2010.
[23] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and
Ram Nevatia. Abc-cnn: An attention based convolutional neural network
for visual question answering. arXiv preprint arXiv:1511.05960, 2015.
[24] Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and
Yueting Zhuang. Counterfactual samples synthesizing for robust visual
question answering. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 10800–10809, 2020.
[25] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. Neil: Extracting
visual knowledge from web data. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 1409–1416, 2013.
[26] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed,
Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text
representation learning. In European conference on computer vision,
pages 104–120. Springer, 2020.
[27] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078, 2014.
[28] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take
the easy way out: Ensemble based methods for avoiding known dataset
biases. arXiv preprint arXiv:1909.03683, 2019.
24 VOLUME 4, 2016
[29] Maurizio Corbetta and Gordon L Shulman. Control of goal-directed
and stimulus-driven attention in the brain. Nature reviews neuroscience,
3(3):201, 2002.
[30] Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv
Batra. Human attention in visual question answering: Do humans and
deep networks look at the same regions? Computer Vision and Image
Understanding, 163:90–100, 2017.
[31] Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv
Batra. Human attention in visual question answering: Do humans and
deep networks look at the same regions? Computer Vision and Image
Understanding, 163:90–100, 2017.
[32] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-
Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
[33] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language under-
standing. arXiv preprint arXiv:1810.04805, 2018.
[34] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland,
et al. Open information extraction: The second generation. In Twenty-
Second International Joint Conference on Artificial Intelligence, 2011.
[35] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying rela-
tions for open information extraction. In Proceedings of the conference
on empirical methods in natural language processing, pages 1535–1545.
Association for Computational Linguistics, 2011.
[36] Moshiur R Farazi and Salman H Khan. Reciprocal attention fusion for
visual question answering. arXiv preprint arXiv:1805.04247, 2018.
[37] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor
Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling
for visual question answering and visual grounding. arXiv preprint
arXiv:1606.01847, 2016.
[38] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi,
Xiaogang Wang, and Hongsheng Li. Dynamic fusion with intra-and inter-
modality attention flow for visual question answering. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
6639–6648, 2019.
[39] Shalini Ghosh, Giedrius Burachas, Arijit Ray, and Avi Ziskind. Generat-
ing natural language explanations for visual question answering using
scene graphs and visual attention. arXiv preprint arXiv:1902.05715,
2019.
[40] Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. Mu-
tant: A training paradigm for out-of-distribution generalization in visual
question answering. arXiv preprint arXiv:2009.08566, 2020.
[41] Lluís Gómez, Andrés Mafla, Marçal Rusinol, and Dimosthenis Karatzas.
Single shot scene text retrieval. In Proceedings of the European Confer-
ence on Computer Vision (ECCV), pages 700–715, 2018.
[42] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi
Parikh. Making the v in vqa matter: Elevating the role of image
understanding in visual question answering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 6904–
6913, 2017.
[43] Dalu Guo, Chang Xu, and Dacheng Tao. Bilinear graph networks for
visual question answering. arXiv preprint arXiv:1907.09815, 2019.
[44] Dalu Guo, Chang Xu, and Dacheng Tao. Graph reasoning networks for
visual question answering. 2019.
[45] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask
r-cnn. In Proceedings of the IEEE international conference on computer
vision, pages 2961–2969, 2017.
[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 770–778, 2016.
[47] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
[48] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and Gerhard
Weikum. Yago2: A spatially and temporally enhanced knowledge base
from wikipedia. Artificial Intelligence, 194:28–61, 2013.
[49] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Ex-
plainable neural computation via stack neural module networks. In
Proceedings of the European conference on computer vision (ECCV),
pages 53–69, 2018.
[50] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and
Kate Saenko. Learning to reason: End-to-end module networks for visual
question answering. In Proceedings of the IEEE International Conference
on Computer Vision, pages 804–813, 2017.
[51] Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko.
Language-conditioned graph networks for relational reasoning. arXiv
preprint arXiv:1905.04405, 2019.
[52] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong
Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal
transformers. arXiv preprint arXiv:2004.00849, 2020.
[53] Drew A Hudson and Christopher D Manning. Compositional attention
networks for machine reasoning. arXiv preprint arXiv:1803.03067, 2018.
[54] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for
real-world visual reasoning and compositional question answering. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 6700–6709, 2019.
[55] Drew A Hudson and Christopher D Manning. Learning by abstraction:
The neural state machine. arXiv preprint arXiv:1907.03950, 2019.
[56] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach,
Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal expla-
nations: Justifying decisions and pointing to the evidence. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 8779–8788, 2018.
[57] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach,
Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal expla-
nations: Justifying decisions and pointing to the evidence. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 8779–8788, 2018.
[58] Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. A focused dy-
namic attention model for visual question answering. arXiv preprint
arXiv:1604.01485, 2016.
[59] Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and
Xinlei Chen. In defense of grid features for visual question answering.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 10267–10276, 2020.
[60] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei,
C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for
compositional language and elementary visual reasoning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2901–2910, 2017.
[61] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoff-
man, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and
executing programs for visual reasoning. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2989–2998, 2017.
[62] Kushal Kafle and Christopher Kanan. An analysis of visual question an-
swering algorithms. In Proceedings of the IEEE International Conference
on Computer Vision, pages 1965–1973, 2017.
[63] Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee
Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual
learning for visual qa. In Advances in neural information processing
systems, pages 361–369, 2016.
[64] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-
Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank
bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
[65] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel
Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In
Advances in neural information processing systems, pages 3294–3302,
2015.
[66] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,
Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A
Shamma, et al. Visual genome: Connecting language and vision using
crowdsourced dense image annotations. International Journal of Com-
puter Vision, 123(1):32–73, 2017.
[67] Guohao Li, Hang Su, and Wenwu Zhu. Incorporating external knowledge
to answer open-domain visual questions with dynamic memory networks.
arXiv preprint arXiv:1712.00733, 2017.
[68] Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation-aware
graph attention network for visual question answering. arXiv preprint
arXiv:1903.12314, 2019.
[69] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei
Chang. Visualbert: A simple and performant baseline for vision and
language. arXiv preprint arXiv:1908.03557, 2019.
[70] Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. Tell-and-
answer: Towards explainable visual question answering using attributes
and captions. arXiv preprint arXiv:1801.09041, 2018.
[71] Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. Vqa-e: Ex-
plaining, elaborating, and enhancing your answers for visual questions. In
VOLUME 4, 2016 25
Proceedings of the European Conference on Computer Vision (ECCV),
pages 552–567, 2018.
[72] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei
Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar:
Object-semantics aligned pre-training for vision-language tasks. In
European Conference on Computer Vision, pages 121–137. Springer,
2020.
[73] Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. Learning to
contrast the counterfactual samples for robust visual question answering.
In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 3285–3292, 2020.
[74] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-
ona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft
coco: Common objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014.
[75] Yuetan Lin, Zhangyang Pang, Donghui Wang, and Yueting Zhuang.
Feature enhancement in attention for visual question answering. In
IJCAI, pages 4216–4222, 2018.
[76] Hugo Liu and Push Singh. Conceptnet—a practical commonsense
reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
[77] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretrain-
ing task-agnostic visiolinguistic representations for vision-and-language
tasks. arXiv preprint arXiv:1908.02265, 2019.
[78] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical
question-image co-attention for visual question answering. In Advances
In Neural Information Processing Systems, pages 289–297, 2016.
[79] Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang.
Co-attending free-form regions and detections with multi-modal multi-
plicative feature embedding for visual question answering. In Thirty-
Second AAAI Conference on Artificial Intelligence, 2018.
[80] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3:
A knowledge base from multilingual wikipedias. 2013.
[81] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum,
and Jiajun Wu. The neuro-symbolic concept learner: Interpreting
scenes, words, and sentences from natural supervision. arXiv preprint
arXiv:1904.12584, 2019.
[82] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mot-
taghi. Ok-vqa: A visual question answering benchmark requiring external
knowledge. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3195–3204, 2019.
[83] David Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar.
Transparency by design: Closing the gap between performance and in-
terpretability in visual reasoning. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 4942–4950, 2018.
[84] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
Dean. Distributed representations of words and phrases and their compo-
sitionality. In Advances in neural information processing systems, pages
3111–3119, 2013.
[85] Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and
Kedar Dhamdhere. Did the model understand the question? arXiv
preprint arXiv:1805.05492, 2018.
[86] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention
networks for multimodal reasoning and matching. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
299–307, 2017.
[87] Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. Out
of the box: Reasoning with graph convolution nets for factual visual
question answering. In Advances in Neural Information Processing
Systems, pages 2654–2665, 2018.
[88] Medhini Narasimhan and Alexander G Schwing. Straight to the facts:
Learning knowledge base retrieval for factual visual question answer-
ing. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 451–468, 2018.
[89] Duy-Kien Nguyen and Takayuki Okatani. Improved fusion of visual
and language representations by dense symmetric co-attention for visual
question answering. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 6087–6096, 2018.
[90] Hyeonwoo Noh and Bohyung Han. Training recurrent answering units
with joint loss minimization for vqa. arXiv preprint arXiv:1606.03647,
2016.
[91] Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning
conditioned graph structures for interpretable visual question answering.
In Advances in Neural Information Processing Systems, pages 8334–
8343, 2018.
[92] Ahmed Osmana and Wojciech Sameka. Drau: Dual recurrent attention
units for visual question answering.
[93] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove:
Global vectors for word representation. In Proceedings of the 2014 con-
ference on empirical methods in natural language processing (EMNLP),
pages 1532–1543, 2014.
[94] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and
Aaron Courville. Film: Visual reasoning with a general conditioning
layer. In Thirty-Second AAAI Conference on Artificial Intelligence,
2018.
[95] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo-
pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized
word representations. arXiv preprint arXiv:1802.05365, 2018.
[96] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels
via explicit feature maps. In Proceedings of the 19th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages
239–247. ACM, 2013.
[97] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
Improving language understanding by generative pre-training. 2018.
[98] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. Language models are unsupervised multitask learners.
OpenAI blog, 1(8):9, 2019.
[99] Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Over-
coming language priors in visual question answering with adversarial
regularization. arXiv preprint arXiv:1810.03649, 2018.
[100] Arijit Ray, Giedrius T Burachas, Karan Sikka, Anirban Roy, Avi Ziskind,
Yi Yao, and Ajay Divakaran. Make up your mind: Towards consistent
answer predictions in vqa models. In European Conference on Computer
Vision (ECCV), Workshops, 2018.
[101] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and
data for image question answering. Advances in neural information
processing systems, 28:2953–2961, 2015.
[102] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
Towards real-time object detection with region proposal networks. In
Advances in neural information processing systems, pages 91–99, 2015.
[103] Idan Schwartz, Alexander Schwing, and Tamir Hazan. High-order
attention models for visual question answering. In Advances in Neural
Information Processing Systems, pages 3664–3674, 2017.
[104] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-
ishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual
explanations from deep networks via gradient-based localization. In
Proceedings of the IEEE international conference on computer vision,
pages 618–626, 2017.
[105] Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini
Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint:
Leveraging explanations to make vision and language models more
grounded. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 2591–2600, 2019.
[106] Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-
consistency for robust visual question answering. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
6649–6658, 2019.
[107] Jiaxin Shi, Hanwang Zhang, and Juanzi Li. Explainable and explicit
visual reasoning over scene graphs. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 8376–
8384, 2019.
[108] Yang Shi, Tommaso Furlanello, Sheng Zha, and Animashree Anandku-
mar. Question type guided attention in visual question answering. In
Proceedings of the European Conference on Computer Vision (ECCV),
pages 151–166, 2018.
[109] Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus
regions for visual question answering. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 4613–
4621, 2016.
[110] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014.
[111] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen,
Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models
that can read. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 8317–8326, 2019.
[112] Jasdeep Singh, Vincent Ying, and Alex Nutkiewicz. Attention on atten-
tion: Architectures for visual question answering (vqa). 03 2018.
26 VOLUME 4, 2016
[113] Jingkuan Song, Pengpeng Zeng, Lianli Gao, and Heng Tao Shen. From
pixels to objects: Cubic visual attention for visual question answering.
[114] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng
Dai. Vl-bert: Pre-training of generic visual-linguistic representations.
arXiv preprint arXiv:1908.08530, 2019.
[115] Joseph Suarez, Justin Johnson, and Fei-Fei Li. Ddrprog: A clevr differen-
tiable dynamic reasoning programmer. arXiv preprint arXiv:1803.11361,
2018.
[116] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1–9, 2015.
[117] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder
representations from transformers. arXiv preprint arXiv:1908.07490,
2019.
[118] Niket Tandon, Gerard De Melo, and Gerhard Weikum. Acquiring
comparative commonsense knowledge from the web. In Twenty-Eighth
AAAI Conference on Artificial Intelligence, 2014.
[119] Joshua B Tenenbaum and William T Freeman. Separating style and
content with bilinear models. Neural computation, 12(6):1247–1283,
2000.
[120] Damien Teney, Ehsan Abbasnedjad, and Anton van den Hengel. Learning
what makes a difference from counterfactual examples and gradient
supervision. arXiv preprint arXiv:2004.09034, 2020.
[121] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hen-
gel. Tips and tricks for visual question answering: Learnings from the
2017 challenge. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 4223–4232, 2018.
[122] Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-
structured representations for visual question answering. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2017.
[123] Alexander Trott, Caiming Xiong, and Richard Socher. Interpretable
counting for visual question answering. ArXiv, abs/1712.08697, 2017.
[124] Ledyard R Tucker. Some mathematical notes on three-mode factor
analysis. Psychometrika, 31(3):279–311, 1966.
[125] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In Advances in neural information processing systems,
pages 5998–6008, 2017.
[126] Ben-Zion Vatashsky and Shimon Ullman. Vqa with no questions-answers
training. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10376–10386, 2020.
[127] Denny Vrandeˇ
ci´
c and Markus Krötzsch. Wikidata: a free collaborative
knowledge base. 2014.
[128] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton van den
Hengel. Explicit knowledge-based reasoning for visual question answer-
ing. In Proceedings of the Twenty-Sixth International Joint Conference
on Artificial Intelligence, IJCAI-17, pages 1290–1296, 2017.
[129] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton van den
Hengel. Fvqa: Fact-based visual question answering. IEEE transactions
on pattern analysis and machine intelligence, 40(10):2413–2427, 2018.
[130] Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony
Dick. Explicit knowledge-based reasoning for visual question answering.
arXiv preprint arXiv:1511.02570, 2015.
[131] Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. The vqa-
machine: Learning how to use existing vision algorithms to answer new
questions. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1173–1182, 2017.
[132] Zhengyang Wang and Shuiwang Ji. Learning convolutional text represen-
tations for visual question answering. In Proceedings of the 2018 SIAM
International Conference on Data Mining, pages 594–602. SIAM, 2018.
[133] Zhonghao Wang, Kai Wang, Mo Yu, Jinjun Xiong, Wen-mei Hwu, Mark
Hasegawa-Johnson, and Humphrey Shi. Interpretable visual reasoning
via induced symbolic space. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pages 1878–1887, 2021.
[134] Spencer Whitehead, Hui Wu, Yi Ren Fung, Heng Ji, Rogerio Feris, and
Kate Saenko. Learning from lexical perturbations for consistent visual
question answering. arXiv preprint arXiv:2011.13406, 2020.
[135] Jialin Wu and Raymond J Mooney. Faithful multimodal explanation for
visual question answering. arXiv preprint arXiv:1809.02805, 2018.
[136] Jialin Wu and Raymond J Mooney. Self-critical reasoning for robust
visual question answering. arXiv preprint arXiv:1905.09998, 2019.
[137] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den
Hengel. Image captioning and visual question answering based on
attributes and external knowledge. IEEE transactions on pattern analysis
and machine intelligence, 40(6):1367–1381, 2017.
[138] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den
Hengel. Ask me anything: Free-form visual question answering based on
knowledge from external sources. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 4622–4630, 2016.
[139] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection.
In Proceedings of the 32nd annual meeting on Association for Com-
putational Linguistics, pages 133–138. Association for Computational
Linguistics, 1994.
[140] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring
question-guided spatial attention for visual question answering. In
European Conference on Computer Vision, pages 451–466. Springer,
2016.
[141] Xiaofeng Yang, Guosheng Lin, Fengmao Lv, and Fayao Liu. Trrnet:
Tiered relation reasoning for compositional visual question answering.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 414–430.
Springer, 2020.
[142] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola.
Stacked attention networks for image question answering. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages
21–29, 2016.
[143] Yiqun Yao, Jiaming Xu, Feng Wang, and Bo Xu. Cascaded mutual
modulation for visual reasoning. arXiv preprint arXiv:1809.01943, 2018.
[144] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli,
and Josh Tenenbaum. Neural-symbolic vqa: Disentangling reasoning
from vision and language understanding. In Advances in Neural Infor-
mation Processing Systems, pages 1031–1042, 2018.
[145] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-level attention
networks for visual question answering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4709–
4717, 2017.
[146] Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng
Wang. Ernie-vil: Knowledge enhanced vision-language representations
through scene graph. arXiv preprint arXiv:2006.16934, 1:12, 2020.
[147] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular
co-attention networks for visual question answering. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
6281–6290, 2019.
[148] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao.
Beyond bilinear: Generalized multimodal factorized high-order pooling
for visual question answering. IEEE transactions on neural networks and
learning systems, 29(12):5947–5959, 2018.
[149] Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi
Parikh. Yin and yang: Balancing and answering binary visual questions.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5014–5022, 2016.
[150] Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to count
objects in natural images for visual question answering. arXiv preprint
arXiv:1802.05766, 2018.
[151] Yundong Zhang, Juan Carlos Niebles, and Alvaro Soto. Interpretable
visual question answering by visual grounding from attention supervision
mining. In 2019 IEEE Winter Conference on Applications of Computer
Vision (WACV), pages 349–357. IEEE, 2019.
[152] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and
Jianfeng Gao. Unified vision-language pre-training for image captioning
and vqa. In Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 34, pages 13041–13049, 2020.
[153] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w:
Grounded question answering in images. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 4995–
5004, 2016.
VOLUME 4, 2016 27