Conference PaperPDF Available

CAPTION: Correction by Analyses, POS-Tagging and Interpretation of Objects using only Nouns



Recently, Deep Learning (DL) methods have shown an excellent performance in image captioning and visual question answering. However, despite their performance, DL methods do not learn the semantics of the words that are being used to describe a scene, making it difficult to spot incorrect words used in captions or to interchange words that have similar meaning. This work proposes the combination of a DL method for object detection and natural language processing of the caption to validate image's captions. We test our method in the FOIL-COCO data set, since it provides correct and incorrect captions for various images using only objects represented in the MS-COCO image data set. Results show that our method has an good overall performance that nears human performance in some cases.
CAPTION: Correction by Analyses, POS-Tagging and
Interpretation of Objects using only Nouns
No Author Given
No Institute Given
Abstract. Recently, Deep Learning (DL) methods have shown an excellent per-
formance in image captioning and visual question answering. However, despite
their performance, DL methods do not learn the semantics of the words that are
being used to describe a scene, making it difficult to spot incorrect words used
in captions or to interchange words that have similar meaning. This work pro-
poses the combination of a DL method for object detection and natural language
processing of the caption to validate image’s captions. We test our method in the
FOIL-COCO data set, since it provides correct and incorrect captions for various
images using only objects represented in the MS-COCO image data set. Results
show that our method has an good overall performance that nears human perfor-
mance in some cases.
1 Introduction
Recently, Deep Learning (DL) methods have shown an excellent performance in image
captioning and visual question answering. However, it has also been shown that, despite
its performance, DL methods do not learn the semantics of the words that are being
used to describe a scene, making it difficult to spot incorrect terms used in captions or
to substitute words with their synonyms.The method we propose uses natural language
processing (NLP) to add meaning to terms used in captions along with a DL method
for object recognition, in order to maintain consistency in image captioning. We are
going to use the FOIL-COCO [20] data set as test bed since it provides both correct and
incorrect captions for various images using only objects represented in the MS-COCO
image data set [12].
The FOIL-COCO [20] data set is a collection of annotations on top of MS-COCO,
which provides a caption that can be either correct or have one wrong word. This data
set has been proposed to test ML methods ability to comprehend and give meaning to
terms used when generating captions, allowing the following three tasks: 1) classify the
caption as correct or not; 2) if it is incorrect, find the mistake and 3) fix the caption
by replacing the wrong word. [20] shows that, although the tested ML methods had a
great performance in Visual Question Answering (VQA) tasks [2], it performs poorly
in FOIL-COCO. The solution for this problem calls for an appropriate combination of
knowledge representation and reasoning methods with deep learning strategies in order
to make sense of the captions, connect the words with objects and apply inferences to
find the appropriate work to fix the wrong caption.
In contrast to what is traditionally done in this area, this work uses Deep Learning
models not as single method for captioning, but to extract information from the image
(e.g., objects, actions, relations) that could be used by a high-level reasoning system.
Theresult of performing NLP in the caption is, then, applied to describe facts about the
world and to reason about the image, maintaining consistency of the caption with re-
spect to the objects in the images and the relations depicted. The results of this project
will cause a positive impact on advanced sensing and perception, enhancing situation
awareness for both users and automated systems, while also promoting rapid under-
standing of the environment, enabling quick and effective decision making.
The next section presents in more details the VQA problem and the automatic cap-
tion generation, along the FOIL-COCO data set. We then describe object detection us-
ing DL and the caption processing using NLP (in the Background section), which pro-
vides the set of tools needed to develop the CAPTION algorithm, that is first proposed
in the Proposal section. Section Experiments presents the tests executed to evaluate
CAPTION, followed by Results and Discussion of the overall performance of this al-
2 Related Work
The goal of object detection is to locate objects pertaining to instances of specific classes
in visual inputs, such as images and videos. Although object detection has been a promi-
nent task in the field of computer vision, recent advances in Deep Learning and espe-
cially in Neural Network Architectures specialised in processing visual inputs, such as
Convolutional Neural Networks (CNN), have paved the way for the creation of new
methods which vastly improve results in existing image classification and object detec-
tion challenges [18]. Furthermore, multiple data sets of annotated images are available
online [12, 18, 11], enabling new models to be easily trained, tested and compared under
standard conditions.
Object detection can be separated into two subtasks: object localisation and clas-
sification. For solving the classification task, many recent techniques employ convolu-
tional layers as one of their building blocks [23] in order to learn convolutional filters
for object classification. While localisation was initially achieved through the use of
standalone region and object proposal algorithms, recent methods are able to learn how
to achieve effective proposals by fine tuning region proposals through processes such
as bounding box regression [17], or by performing classification in multiple regions of
an input image, at different scales and aspect ratios [13, 16].
Recent successful object detection techniques include You Only Look Once (YOLO)
[16], Single Shot MultiBox Detector (SSD) [13] and Faster R-CNN [17].
YOLO [16] divides the input image into an N×Ngrid, it then generates Bbound-
ing boxes for each cell and predicts Cclass probabilities for each bounding box. Each
of the predictions is composed by five values, x, y, h, w, k, where xand yrepresent
the coordinates of the the bounding box centre, hand wrepresent the bounding boxes’
height and width and kis a class probability. At training time, each of the Cpredictors
for a bounding box is specialised in a given class This specialisation is encoded in a
multipart loss function, whose sum squared error is minimised via gradient descent.
SSD [13] employs a CNN, called the base network, to generate feature maps, as well
as additional convolutional layers of multiple sizes to detect objects in multiple scales.
Multiple regions of different scales and aspect ratios are evaluated in the target image
in order to accomplish detection. Training is also done via gradient descent and may be
boosted by techniques such as hard negative mining and data augmentation strategies.
The loss function is a weighted sum of localisation and classification loss.
R-CNN [5] employs selective search [22] to extract a fixed number of 2000 region
proposals from an input image. All region proposals are normalised to the same size via
warping and used as input to a CNN, which learns a fixed-length feature vector for each
one. These vectors are then used as input to several Support Vector Machine (SVM)
classifiers, each one trained to classify objects of an specific class.
Fast R-CNN [4] improves upon R-CNN by processing all region proposals from
an input image in a single forward pass, as well as by replacing multiple SVM clas-
sifiers with a single fully-connected, softmax output layer for classification of region
Lastly, Faster R-CNN [17] uses the same classification strategy as Fast R-CNN,
while introducing a Region Proposal Network (RPN) for object localisation. The RPN
uses convolutional filters to produce region proposals represented by x, y, h, w, k, where
xand yrepresent the coordinates of a region proposal, hand wrepresent its height and
width and kis an “objectness” score. The bounding box predictions of the RPN can
be trained through gradient descent. Since both the RPN and Fast R-CNN share convo-
lutional filters, while minimising different loss functions, [17] propose alternating the
training of both networks until an acceptable performance is reached.
The use of Neural Networks to answer questions about images has seen great ad-
vances in recent years. The end-to-end use of neural networks was shown to achieve a
high performance in question answering and caption generation tasks, which fermented
the creation of various data sets to further test and develop these ideas.
[9] presents CLEVR, a data set of 3D rendered objects along with a set of example
questions. The aim of CLEVR is to provide a standard set of objects and questions to
be used as benchmark. However, the small set of objects that are available in the images
and the artificial scenario created lacked the complexity that artificial neural networks
were already capable of coping.
Instead of CLEVR, a data set that has been extensively used for object detection is
the Microsoft Common Objects in COntext (MS-COCO)[12] which has over 300,000
images with objects divided into 91 types and 11 super-categories (collections of types).
Each image has objects that could be easily recognised by a 4 year old and are presented
in their common context and not artificially renderd or modified.
The Visual Question Answering (VQA) data set [2] has been widely used as a
testbed for question-answering systems since it expands the MS-COCO [12] data set
with more information and abstract scenes. Furthermore, for each image, VQA pro-
vides a set of at least three challenging questions to be answered by any AI method.
Although machine learning (ML) methods can be used to solve the problem pre-
sented by VQA with high accuracy, two issues became apparent. First, it is unknown
how much of the visual information presented in the images is really learned and used
by the ML methods to answer the proposed questions, as some ML methods that did not
use the image and considered only the question being asked had a good performance
answering those questions. Second, it has been shown that the VQA data set is biased
towards one type of answer in multiple choice questions (e.g., the same answer for dif-
ferent questions), which explains why ML methods that do not use the input images as
source of information are able to answer some questions with great accuracy [20].
The FOIL-COCO [20] data set is a set of annotations on top of the MS-COCO
which provides a caption that can be either correct or have one wrong noun. This data
set has been proposed to test ML methods ability to comprehend and give meaning to
terms used when generating captions and proposes three tasks: 1) classify the caption
as correct or not; 2) if it is incorrect, find the mistake and 3) correct the wrong word
in the caption. However, [20] shows that although the tested ML methods had a great
performance in VQA, it performs quite poorly in FOIL-COCO. [19] then extends the
FOIL-COCO data set with annotations for other attributes such as adjectives, adverbs
and prepositions (i.e., spatial relations).
While FOIL-COCO evaluates how much of the image the neural network really
comprehends, [21] demonstrates that the longer the caption gets, the less relevant the
image becomes to the captioning system since the caption generation depends more and
more on the prediction of the next word instead of the information from the image. [21]
also shows that objects in the image are the most relevant information used by caption
generation systems.
The works of [7] and [6] are the ones that relate most to our proposal. [6] proposes
Phrase Critic, a caption generator that verifies the relevance of a caption to a given image
and, thus, can be used both to generate a caption and to check if a caption is wrong.
By using an off-the-shelf localisation model called Visual Genome [10], it generates
possible descriptions of parts of a picture which are used to ground the caption to the
image (i.e., it checks if everything described in the caption also appears in the image).
The main difference between [6] and this work is how the caption in grounded in the
image. While [6] uses a LSTM to generate possible explanations to the image and to
compare with the output of Visual Genome, we use NLP to process the caption and DL
for image processing (object detection). However, the use of Visual Genome instead
of a method for object detection is being studied as an improvement for our current
This paper uses FOIL-COCO as the data set for the experiments, but contrary to
the ML methods for CLEVR, VQA and also FOIL-COCO, which used only Artificial
Neural Networks (ANN) to solve the proposed problem, we combine ANN with natural
language processing which can provide semantics to the words used in the caption,
making easier to explain the answers to the three tasks proposed by [20].
3 Background
This section presents the items used in the construction of CAPTION, namely object
detection using ANN, NLP and WordNet.
3.1 Object detection using Artificial Neural Networks
3.2 Natural Language Processing
In this work, the image captions are manipulated mainly by two Natural Language
Processing (NLP) methods: tokenization and tagging [3].
Given a text, tokenizing it is the process of splitting the text into subtexts consider-
ing a set of predefined words that represents the ending of sentences (e.g., punctuation)
or of a single word (e.g., spaces). For example, considering the text “a man riding a mo-
torcycle”, tokenizing it by word would give us the list of words [“a”, “man”, “riding”,
“a”, “motorcycle”] which does not alter the order of the text or its meaning, but allows
each word to be processed both, independently of the other, or considering other words
within a particular text window. This process can be used in two ways when dealing
with captions. First, given a caption composed of more than one sentence, it is possible
to split each sentence and check for errors independently. Second, given a single sen-
tence, it is possible to process it to obtain more information about the words in it. One
such process that can be done to derive more information from the words present in the
caption is tagging.
Each word that constitutes a given sentence can be classified in lexical categories or
parts of speech (POS), such as nouns, verbs, adverbs, etc., thus, POS-tagging (or just
tagging) is the process of inferring a lexical category for each word in a sentence. Con-
sidering the same text that we used for tokenization, tagging it provides us with the list
of tuples [(“a”, article), (“man”, noun), (“riding”, verb), (“a”, article), (“motorcycle”,
noun)] giving a category for each word found.
By tokenizing and POS-tagging a caption, we can filter it for the type of information
that we want to process (e.g., nouns, verbs and adjectives) instead of having to deal with
the complete sentence and trying to infer a meaning to words that are not important for
the task at hand.
Although these two methods provide some information about the structure of the
text, their use in caption analysis is dependent on a possible meaning of caption words,
which is obtained by using the WordNet [15].
3.3 WordNet
Created in the mid-1980s from the theories of human semantic organisation, WordNet
is a large lexical database of words arranged into a semantic network. Words that have
the same lexical category and represent the same concept are grouped into sets known
as cognitive synonyms or synsets. These synsets are interlinked considering lexical re-
lations and conceptual-semantic notions [14, 15].
The relations between synsets are most of the time define by a IS-A relation (hy-
peronym) but can be also define in terms of part-whole relation (meronymy) or even
as Cross-POS relations, using parts of speech. An important aspect of these relations is
that, since synsets form a semantic network, it is possible to navigate it and to calculate
similarity between synsets, such as the path similarity, using the shortest path from one
synset to another.
Based on these concepts, the next section introduces the architecture proposed this
4 Correction by Analyses, POS-Tagging and Interpretation of
Objects using Nouns (CAPTION)
One common approach for caption generation is to use Artificial Neural Networks
(ANN) from end-to-end, i.e. the input of the ANN is an image and the output is the final
caption. However, when using the same approach to validate a caption, the ANN is inca-
pable of recognising eventual mistakes in the use of words, since it still not known from
an ANN (or any other deep learning tool) could deal with the word’s meanings [20].
In order to verify the generated image captions, we propose the inclusion of an addi-
tional step, which takes an image and its related generated caption, and compares the
information extracted from both.
The proposed architecture is presented as a pseudo code in Algorithm 1 and as a
diagram in Figure 1, which represents the inputs as orange boxes, the steps performed
using ANN for object detection are shown in blue boxes, green boxes represent the NLP
tasks and yellow boxes are simple set operations used to compare information from the
previous steps. Each of these steps are described in details in the following sections.
4.1 Inputs for the architecture
The proposed architecture uses three pieces of information in order to validate the cap-
tion: the caption itself, the related image and a set of common terms used to represent
both caption and objects in the image.
While the caption and its related image are the inputs to be analysed, the common
terms provides a bridge between words in the caption and object labels obtained by the
image processing. For example, if the caption has “woman” in it but the object detection
can only recognise “person”, the set of common terms is going to be used to map the
word “woman” to “person”.
This set of terms is used by both object detection and caption processing steps, as
will be described in the following sections.
4.2 Object detection
In this architecture, the object detection steps (blue boxes in Figure 1) use any off-
the-shelf method (in our current implementation, the Faster R-CNN) that is capable of
returning a set of objects that can be found in an image, such as the ones presented in
the Object Detection using Artificial Neural Networks section.
Given an image, we use an object detection method to provide a list of objects
recognised in it, without the need for information regarding the position of the object in
the image or even the degree of confidence of the object detection. This list of objects is
then mapped into a set Snames of words using the set of common terms. This Snames
of terms recognised in the image is then used to analyse the caption along with the set
obtained from the caption processing, presented in the next section.
map nouns to
common terms
map objects to
common terms
compare sets of
common terms
set of terms
the caption
set of terms
the image
Fig. 1: A flowchart of the CAPTION architecture for checking captions
(a) Original image. (b) Marked image.
Fig. 2: Example of image that can generate a wrong classification using intersection.
(a) Original image. (b) Marked image.
Fig. 3: Example of image that can generate a wrong classification using Simage.
Input: : an image and a caption
Output:: foil classification and a dictionary with foil words and corrections
1Snames detect objects(image);
2tokens tokenize(caption);
3tags POS-tagging(tokens);
4Snouns filter nouns(tags);
5Sinter ← Snouns ∩ Snames;
6Scaption ← Snouns − Snames;
7Simage ← Snames − Snouns;
8Create corrections as a dictionary;
9foreach foil ∈ Scaption do
10 Add foil as a key to corrections;
11 foreach correction ∈ Simage do
12 if foil is similar to correction)then
13 corrections[f oil]correction
14 end
15 end
16 end
17 if corrections is not empty then
18 return:T rue,corrections;
19 else
20 return:F alse
21 end
Algorithm 1: CAPTION’s pseudo code
4.3 Caption processing
The caption processing steps (green boxes in Figure 1) are responsible for transforming
the automatically generated caption into terms that can be compared to the detected
objects in the image.
Given the caption, the first step of this process is to recognise only the nouns in it,
which is the information that we expect to be related to the objects in the image. This set
Snouns is then mapped to the same set of common terms that were used in the objects
detected in the image, so that they can be compared in the next step, named Comparison
and classification.
4.4 Comparison and classification
The caption processing step provides a set of nouns defined in terms of the set of com-
mon terms (Snouns) and the object detection step provides a set of objects also defined
in terms of the set of common terms (Snames). The caption classification into correct
or not (yellow boxes in Figure 1) can be done simply in terms of some set operations
(i.e., intersection and difference between sets).
The intersection of both sets (Sinter =Snouns ∩ Snames) provides information
about which objects that were recognised in the image are also in the caption, thus
confirming that the object exists. However, we cannot state that the caption is correct
only with this information since the wrong word in the caption can be simply another
object in the image (e.g., in Figure 2 a man is on a motorcycle and a woman is walking
with a bicycle. A wrong caption can state that “the man is riding a bicycle”). Thus, the
intersection tells only which objects were used to generate the caption.
The set of objects that are in the image but not in the caption (Simage =Snames
Snouns) provides information about which objects should be checked when trying to
correct the caption. Once again, we cannot state that the object should be in the caption
only because it appears in the image (e.g., in Figure 3, a woman is using a knife, recog-
nised in the image, but the caption could only state that “a woman is cutting a cake”
without mentioning the term “knife”).
However, this set Simage can used to check for possible objects that are potentially
wrong in the caption (e.g., if the caption of Figure 3 stated that “the woman is cut-
ting a pizza” instead of a “cake”) and their properties can be considered in the caption
correction (e.g., exchanging “pizza” for “cake” in the same example).
The difference between the set of nouns in the caption and the set of objects in the
image (Scaption =Snouns−Snames ) provides information about the objects described
in the caption but that were not detected in the image, giving us two possibilities:
1. The object detection method failed to find the object in the image and the caption
may be correct;
2. The object described in the caption is not in the image and the caption is wrong.
While the first possibility may be solved by using better object detection methods, the
second possibility is exactly the information that we are looking for. If an object de-
scribed in the caption cannot be found anywhere in the image, we can consider the
possibility that the caption is wrong and identify the mistake in it. Therefore, Scaption
informs us if there is something in the caption that is not in the image, possibly the
wrong word w1∈ Scaption. A possible correction for this caption would then be to
change w1∈ Scaption for a word w2∈ Simage that can be somewhat equivalent (e.g.,
in Figure 3, “pizza” and “cake” are both food and could be exchanged). Therefore, if
we are able to find a word w1∈ Scaption and a word w2∈ Simage that can substitute
w1, the caption can be inferred as wrong and that w2should substitute w1.
4.5 Current implementation
For this paper, we implemented the architecture described in Figure 1 using a few
Python libraries in a container environment.
We used the FOIL-COCO data set [20] that provides a set of captions for MS-COCO
images [12] that may be correct together with the images itself, so that the caption
and image would match. Since the captions had more terms than the ones originally
described in the images, we used the names and supercategories of MS-COCO as the
set of common terms.
The object detection process uses one of the pre-trained TensorFlow [1] models
available in its model zoo [8]1. Since this application does not require to be done in real-
time, we employed Faster R-CNN [13] with a base network generated through neural
architecture search [24] on the MS-COCO data set. This choice was made considering
the metrics provided by [8] for the models available in the model zoo and Faster R-CNN
was selected since it provided the best reported mean Average Precision (mAP), despite
being the slowest, with a reported 1833 ms for processing a single image. The output of
this object detection method is a list of objects already defined in terms of MS-COCO
For processing the caption we used NLTK [3] with punkt and averaged perceptron tagger
to, respectively, tokenize and tag words in the caption. NLTK’s interface with Word-
Net [15] was used to calculate the similarity between nouns found in the caption and
the names defined in MS-COCO.
A very important aspect of our current implementation is that we are using only
the nouns found in the caption, that were recognised by using NLTK. However, our
experiments used the complete FOIL-COCO data set as available in [20].
After detecting objects using Tensorflow and generating the set of nouns using
NLTK, the rest of the process used only Python’s standard set type operations.
The next section presents the experiments used to test our proposal along with an
explanation of how this architecture was implemented in this paper.
5 Experiments
In this section we first present the tasks used to analyse its performance and than the
hardware and sorftware setup used during the experiments.
5.1 Tasks
To test the architecture proposed in this work, we used the three tasks described in [20]:
1. Classify the caption as right or wrong (foil);
2. Detect the wrong word, in case of a foil caption;
3. Correct the wrong word in the caption.
The three tasks were accomplished by analysing the Scaption and comparing it to
Simage so that if Scaption is empty, the caption is considered correct. However, if
Scaption is not empty, each of its terms is compared to each of the term in Simage to
find a suitable substitution (i.e., if w1∈ Scaption and w2∈ Simage have the same MS-
COCO supercategory, w2is a suitable substitute for w1). If a suitable substitute is found,
the caption is considered wrong and both, the term from Scaption and its substitute,
are informed. However, if a suitable substitute is not found, the caption is considered
1The pre-trained models are available at
research/object detection
correct. Thus, in the proposed method, the three tasks are not solved independently
since it finds the foil word (task 2) together with the correction (task 3) which results in
defining the caption as wrong or not (task 1).
Since we used the same data set as [20], the results presented in the next section
will be compared with the results provided in that paper for each task.
The next section presents the results obtained by using the proposed method to solve
the three tasks presented in this section.
5.2 Setup
The experiments were executed on an Intel Core i7-8700 @ 4.6GHz with 16GB of
RAM and a NVidia GeForce GTX 1070 with 8GB of RAM running Debian GNU/Linux
Bullseye (the testing version as of this writing). All experiments were run in a container
using Docker and NVidia-Docker and an image based on the tensorflow/tensorflow:1.14.0-
gpu-py3-jupyter with scikit-learn, opencv-python, nltk and dodo detector installed via
6 Results
This section presents the results independently for the three tasks presented in the pre-
vious section, while using the results presented (and reproduced in this section) by [20]
and [7] as baseline for CAPTION. In the latter part of this section, an overall analysis
of the performance of the proposed method is presented.
6.1 Task 1: Caption classification
The first task consists of classifying if the caption for a given image is correct or a foil.
Table 1 presents metrics related only to the proposed architecture. It is worth pointing
out that CAPTION achieves over 0.7 in precision and recall for both detection of foil
caption and in the caption classification as correct.
Caption Precision Recall F1-score
Correct 0.81 0.74 0.77
Foil 0.72 0.79 0.75
Table 1: Caption classification results for the proposed architecture.
[20] performed the same tasks with four different algorithms that used both the cap-
tion and the image to classify the caption (CNN + LSTM, IC-Wang, LSTM + norm I
and HieCoAtt), and Blind LSTM (a language-only method that did not use any infor-
mation from the image to classify the caption) and two sets of classifications as done
by humans, the first is based on the majority of votes to classify the caption and the
second is based on the agreement of all humans judges on the classification. [7] used
Classifier Overall(%) Correct(%) Foil(%)
Blind LSTM 55.62 86.20 25.04
CNN + LSTM 61.07 89.16 32.98
IC-Wang 42.21 38.98 45.44
LSTM + norm I 63.26 92.02 34.51
HieCoAtt 64.14 91.89 36.38
Phrase Critic 87.00 73.72
CAPTION 76.31 80.90 71.72
Human (majority) 92.89 91.24 92.52
Human (unanimity) 76.32 73.73 78.90
Table 2: Results for the classification task presented by [20], [7] and those obtained by
CAPTION. The correct classification results is not provided by [7]
the Phrase Critic for the same task. Table 2 presents the same results obtained by [20]
and [7] along with the ones obtained with CAPTION.
Considering only the non-human classifiers presented by [20], CAPTION overall
performance (76.31%) is over 10 percentage points above the best (HieCoAtt with
64.14%). However, when classifying captions as correct, it is only better than IC-Wang
(38.98%), with a performance of about 10 percentage points below the best classi-
fier (LSTM + norm I with 92.02%). For foil classification, CAPTION performance
(71.72%) is about 25 percentage points higher than the second best (IC-Wang with
45.44%). Comparing with the classification done by humans, CAPTION’s performance
is worse than the classification done by humans voting, but it is surprisingly close to the
unanimity classification for the three criteria.
Comparing to the Phrase Critic, its overall performance is about 10 percentage
points higher than CAPTION while the foil classification is almost the same (2 per-
centage points of difference). Although CAPTION’s performance is worst than Phrase
Critic, results show how much the ground of the caption to the image is relevant when
checking for errors in captions and, most importantly when finding the foil word in the
caption, since both Phrase Critic and CAPTION have the top non-human performance
in both tasks.
Overall, CAPTION’s performance is good among the three criteria, being close to
the human unanimity performance for the same task. CAPTION’s performance is worst
only to the Phrase Critic, but again, their approach are much related which may indicate
that comparing the information from the image to the caption may be the reason for the
good performance.
Given that CAPTION’s classification can only happens when solving the other two
tasks, it can be expected that the results for the other two tasks are at least as good as the
results for the first one since performing poorly in identifying the error and correcting
it would directly affect CAPTION’s performance in classifying the caption as right or
6.2 Task 2: Error detection
Given a foil caption, the second task is identifying which word is wrong in the caption.
Although in the FOIL-COCO data set any correct word can be changed to be a foil word,
the current implementation of CAPTION considers only nouns as possible foil words.
For this task, [20] used three algorithms (IC-Wang, LSTM + norm I and HieCoAtt),
the same human classification methods (voting and unanimity) and a random classifier
(represented by the label ’Chance’ in the results). Again, [7] used Phrase Critic for this
Results for this task is presented in Table 3. Although CAPTION only consider
nouns as possible foil words, we did not compare it to [20] methods that also considered
only nouns since our method has no a priori information about which word is a noun,
classifying it during the caption processing steps.
Identifier Nouns (%) All words (%)
Chance 23.25 15.87
IC-Wang 27.59 23.32
LSTM + norm I 26.32 24.25
HieCoAtt 38.79 33.69
Phrase Critic 73.72
Human (majority) 97.00
Human (unanimity) 73.60
Table 3: Results for the error detection task presented by [20] along with CAPTION.
Comparing only the non-human identifiers from [20], CAPTION’s performance
(71.72%) is more than two-times better than the second best method (HieCoAtt with
33.69%). For the human identifiers, we obtained analogous results to those in the first
tasks: CAPTION’s performance is worst than voting (97%) but close to the unanimity
Comparing to the Phrase Critic, once again our method performance is very close to
it, which may not be a surprise considering that both compare information from image
and caption to detect the foil word.
Considering the results of this task, and the fact that CAPTION only considers a
caption as foil when it detects a foil word and how to correct it, the results for the
next task is as important as the results for the first two tasks in the CAPTION overall
performance analysis.
6.3 Task 3: Error correction
The last task is to correct the foil word in the caption. In the methods evaluated by [20],
the caption and the foil word were given as input to the algorithm. The algorithm’s
only task was, then, to change the foil word for a correct one. In CAPTION, however,
since identifying and correcting the foil word is an important part of the classification
step, we did not inform our method which is the foil word for each caption. Instead, we
present the results based on the corrections proposed by CAPTION while performing
the other two tasks.
For this task, [20] uses the same non-human methods as the second task (Chance,
IC-Wang, LSTM + norm I and HieCoAtt), but this time, no human method was used
for comparison. Once more, Phrase Critic was used for this task by [7]. Results are
presented in Table 4 for each of those four methods and also for CAPTION.
Method All target words (%)
Chance 1.38
IC-Wang 22.16
LSTM + norm I 4.7
HieCoAtt 4.21
Phrase Critic 49.60
Table 4: Results for the word correction task presented by [20] along with CAPTION.
In this task, CAPTION’s performance (90.11%) is about four times the performance
of the best method presented by [20] (IC-Wang with 22.16%). Comparing to Phrase
Critic, CAPTION performance is superior for a great margin for the first time in the tree
tasks. This may be due to the fact that CAPTION’s correction is based on semantics,
exchanging words related to each other, while Phrase Critic uses quantitative metrics to
indicate possible corrections.
Since performing poorly in this task would interfere with the results of the previous
tasks, it was expected for CAPTION to have a good performance in it. We can see that
considering the three tasks as a single, larger task with three steps, seems to have helped
to develop a method with good overall performance in the three tasks alone.
6.4 Discussion
The first aspect worth pointing out is that [20] considered the tasks as three separated
problems and used distinct, specialised, methods to solve each task. In contrast, CAP-
TION uses the results of the second and third tasks (error detection and correction) to
perform the first task (caption classification), which indicates that, first, it is possible
to solve the three tasks at the same time with the same method, without the need to
specialise a method for a certain task. Second, information about the second and third
tasks is important for solving the first task, since it gives a reason for the caption to be
classified as foil or not and, third, combining the three tasks gives us an explainable
output. For example, consider Figure 3 and the caption “a woman cutting a pizza”. By
combining the three tasks we can tell that the caption is foil (task 1) because there is
no pizza in the image (task 2), but the object detection method found a cake and, since
cake and pizza are food, the woman may be cutting a cake instead of a pizza (task 3).
Thus, CAPTION not only has a better performance than the other methods considered,
but also gives a more explainable answer.
However, there are some setbacks in our method which can be used to explain the
results. First, CAPTION depends on the performance of the object detection method to
correctly classify the caption. For example, using the same Figure 3 with the same foil
caption of “a woman cutting a pizza”, if the object detection method fails to recognise
the cake in the image, CAPTION will not be able to find a suitable substitute for pizza
and cannot tell that the caption is foil. Thus, it would incorrectly classify the image as
Second, the foil word in the caption can be some other object in the image, which
also makes CAPTION to mistakenly classify the image. For example, consider Figure 2
with the caption “a man riding a bicycle” and an object detection method that detects
the man riding the motorcycle and also the woman with the bicycle in the back of the
image. In this case, since both objects are in the image and the current implementation
of CAPTION only consider the objects but not the relation between them, it would not
be able to find the foil word and also classifies the caption as correct.
Third, in the current implementation, a suitable substitution for a foil word is used
by finding the most similar object to the foil word that was found, thus CAPTION
error correction is as good as the similarity function used to process the objects in the
caption and image. As a result, CAPTION can find mistakes in correct captions and
also indicate incorrect corrections for foil captions.
Although these three problems are relevant to CAPTION’s performance, improve-
ments in methods for the first and third problems would improve the overall perfor-
mance of CAPTION. I.e., as objects detection methods improve, the change of not
recognising an objects decreases and the first problem becomes harder to occur. For the
second problem, a solution is to use more knowledge about the image to confirm the
information provided by the caption (e.g., knowing that a man is riding a motorcycle
and not a bicycle in the image infers that the caption “a man riding a bicycle” is wrong).
Lastly, [20] suggests that the reason for the poor performance of the methods in the
three tasks they propose are due to the lack of a word’s meaning for a neural network
which makes it unable to detect the error and to correct it. Instead of adding meaning
to words and objects before being processed by the ANN so that the ANN has informa-
tion about the meaning of words and may use it to find errors and correct the text, in
CAPTION we add meaning to the output of the ANN in a way that do not change what
happens inside the ANN. Furthermore, we use a well-defined and accepted corpus (i.e.,
WordNet) to provide these meanings to words from both image and caption and, since
WordNet is constructed as a semantic network, we are capable of performing some op-
erations with those meanings (e.g., calculate similarities). Thus, we direct each method
(ANN and NLP) to its most suitable problem (i.e., object detection and text process-
ing, respectively) instead of trying to use a single method to solve the three tasks, as
done by [20] using ANN methods. As a result, we have a method with an overall better
performance and also a solution that can be explained.
When comparing to Phrase Critic, CAPTION’s performance is worst for the first
task (caption classification), almost identical in the second one (foil word detection)
and better in the last one (foil word correction), although the two methods have a lot
in common. This shows that comparing (grounding) the caption to the image is an
important step for the classification and also provides information for detecting the
foil word. However, semantis still seems to play an important role when correcting the
7 Conclusion
This paper proposes an architecture to solve the three tasks proposed by [20] for auto-
matic caption generation for images, which are 1) classify the caption as correct or not,
2) detect the foil word in the caption and 3) correct the caption.
Our architecture combines object detection of the image with natural language pro-
cessing of the caption to search for the incorrect word in the caption, suggesting a
correction. With this information, the caption is classified as correct or not and the cor-
rections are informed if needed.
Results show that CAPTION has a better overall performance than the methods
tested by [20] and in some tasks it is comparable to human performance. However,
there are some drawbacks in our approach such as objects not being recognised by the
detector, or the use of poor word similarity functions.
Future works consists in improving the caption processing by using more infor-
mation, not just nouns, and also considering object relations in the image so that it is
possible to overcome some of the shortcomings of the current implementation. These
improvements will allow us to test our method in the data set provided by [19].
8 Acknowledgements
1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia,
Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´
e, D., Monga, R., Moore, S.,
Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker,
P., Vanhoucke, V., Vasudevan, V., Vi´
egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke,
M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems
(2015),, software available from
2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual
Question Answering. In: International Conference on Computer Vision (ICCV) (2015)
3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text
with the Natural Language Toolkit. O’Reilly Media (2009)
4. Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision
(ICCV). pp. 1440–1448 (Dec 2015)
5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object
detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and
Pattern Recognition. pp. 580–587 (June 2014)
6. Hendricks, L.: Visual Understanding through Natural Language. Ph.D. thesis, EECS Depart-
ment, University of California, Berkeley (May 2019),
7. Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explanations. In: Ferrari, V.,
Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. pp. 269–286.
Springer International Publishing, Cham (2018)
8. Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z.,
Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional
object detectors. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 3296–3297 (July 2017)
9. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.:
CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.
CoRR abs/1612.06890 (2016),
10. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y.,
Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: Connecting language and
vision using crowdsourced dense image annotations. In: CoRR (2016),
11. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S.,
Popov, S., Malloci, M., Duerig, T., Ferrari, V.: The open images dataset v4: Unified im-
age classification, object detection, and visual relationship detection at scale. arXiv preprint
arXiv:1811.00982 (2018)
12. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´
ar, P., Zitnick,
C.L.: Microsoft coco: Common objects in context. European Conference on Computer Vi-
sion (ECCV) (2014), /se3/wp-content/uploads/2014/09/coco eccv.pdf,
13. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single
shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision
– ECCV 2016, pp. 21–37. Springer International Publishing, Cham (2016)
14. Pease, A.: Ontology: A Practical Guide. Articulate Software Press (2011)
15. Princeton University: About WordNet. (2010)
16. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time
object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 779–788 (June 2016)
17. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence
39(6), 1137–1149 (June 2017)
18. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015)
19. Shekhar, R., Pezzelle, S., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.: Vision and lan-
guage integration: Moving beyond objects. In: IWCS 2017 — 12th International Conference
on Computational Semantics — Short papers (2017),
20. Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi,
R.: FOIL it! find one mismatch between image and language caption. In: Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). pp. 255–265. Association for Computational Linguistics, Vancouver, Canada (Jul
21. Tanti, M., Gatt, A., Camilleri, K.: Quantifying the amount of visual information used by
neural caption generators. In: Leal-Taix´
e, L., Roth, S. (eds.) Computer Vision – ECCV
2018 Workshops: Proceedings of the Workshop on Shortcomings in Vision and Language.
pp. 124–132. Springer, Munich, Germany (2019),
978-3-030-11018-5 11
22. Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recogni-
tion. International Journal of Computer Vision (2013),
23. Zhao, Z., Zheng, P., Xu, S., Wu, X.: Object detection with deep learning: A review. IEEE
Transactions on Neural Networks and Learning Systems pp. 1–21 (2019)
24. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scal-
able image recognition. The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2018)
... Recent results [2] have suggested that although achieving great performance in Visual Question-Answering (VQA) challenges [5], Machine Learning (ML) methods perform In general terms, this work shows that using syntactic information of sentences (along with deep learning methods) can bring performance improvement to the task of consistency checking of captions, including the identification and correction of misleading image descriptions. An early version of this paper is available as an ArXiv preprint at [6]. ...
Full-text available
This paper proposes a novel algorithm, called CAPTION, for identifying and correcting errors in automatically generated image captions. The algorithm combines Deep Learning (DL) for object detection in images with Natural Language Processing techniques. CAPTION has been tested in the following three tasks: (1) classify a caption as correct or not; (2) detect wrong words in the caption, and (3) suggest text corrections. Results show that our method is superior with respect to others evaluated in the same data set in the error correction task. These other methods are generally based exclusively on DL models. This work shows that, although semantics still has not been used at its fullest in this type of task, a combination of DL with Natural Language Processing tools presents a better overall performance than using DL methods alone.
Full-text available
Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a new model is proposed for generating explanations by utilizing localized grounding of constituent phrases in generated explanations to ensure image relevance. Specifically, we introduce a phrase-critic model to refine (re-score/re-rank) generated candidate explanations and employ a relative-attribute inspired ranking loss using "flipped" phrases as negative examples for training. At test time, our phrase-critic model takes an image and a candidate explanation as input and outputs a score indicating how well the candidate explanation is grounded in the image.
Full-text available
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.
Full-text available
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles that combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy, and optimization function. In this paper, we provide a review of deep learning-based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely, the convolutional neural network. Then, we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection, and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network-based learning systems.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.