CAPTION: Correction by Analyses, POS-Tagging and
Interpretation of Objects using only Nouns
No Author Given
No Institute Given
Abstract. Recently, Deep Learning (DL) methods have shown an excellent per-
formance in image captioning and visual question answering. However, despite
their performance, DL methods do not learn the semantics of the words that are
being used to describe a scene, making it difﬁcult to spot incorrect words used
in captions or to interchange words that have similar meaning. This work pro-
poses the combination of a DL method for object detection and natural language
processing of the caption to validate image’s captions. We test our method in the
FOIL-COCO data set, since it provides correct and incorrect captions for various
images using only objects represented in the MS-COCO image data set. Results
show that our method has an good overall performance that nears human perfor-
mance in some cases.
Recently, Deep Learning (DL) methods have shown an excellent performance in image
captioning and visual question answering. However, it has also been shown that, despite
its performance, DL methods do not learn the semantics of the words that are being
used to describe a scene, making it difﬁcult to spot incorrect terms used in captions or
to substitute words with their synonyms.The method we propose uses natural language
processing (NLP) to add meaning to terms used in captions along with a DL method
for object recognition, in order to maintain consistency in image captioning. We are
going to use the FOIL-COCO  data set as test bed since it provides both correct and
incorrect captions for various images using only objects represented in the MS-COCO
image data set .
The FOIL-COCO  data set is a collection of annotations on top of MS-COCO,
which provides a caption that can be either correct or have one wrong word. This data
set has been proposed to test ML methods ability to comprehend and give meaning to
terms used when generating captions, allowing the following three tasks: 1) classify the
caption as correct or not; 2) if it is incorrect, ﬁnd the mistake and 3) ﬁx the caption
by replacing the wrong word.  shows that, although the tested ML methods had a
great performance in Visual Question Answering (VQA) tasks , it performs poorly
in FOIL-COCO. The solution for this problem calls for an appropriate combination of
knowledge representation and reasoning methods with deep learning strategies in order
to make sense of the captions, connect the words with objects and apply inferences to
ﬁnd the appropriate work to ﬁx the wrong caption.
In contrast to what is traditionally done in this area, this work uses Deep Learning
models not as single method for captioning, but to extract information from the image
(e.g., objects, actions, relations) that could be used by a high-level reasoning system.
Theresult of performing NLP in the caption is, then, applied to describe facts about the
world and to reason about the image, maintaining consistency of the caption with re-
spect to the objects in the images and the relations depicted. The results of this project
will cause a positive impact on advanced sensing and perception, enhancing situation
awareness for both users and automated systems, while also promoting rapid under-
standing of the environment, enabling quick and effective decision making.
The next section presents in more details the VQA problem and the automatic cap-
tion generation, along the FOIL-COCO data set. We then describe object detection us-
ing DL and the caption processing using NLP (in the Background section), which pro-
vides the set of tools needed to develop the CAPTION algorithm, that is ﬁrst proposed
in the Proposal section. Section Experiments presents the tests executed to evaluate
CAPTION, followed by Results and Discussion of the overall performance of this al-
2 Related Work
The goal of object detection is to locate objects pertaining to instances of speciﬁc classes
in visual inputs, such as images and videos. Although object detection has been a promi-
nent task in the ﬁeld of computer vision, recent advances in Deep Learning and espe-
cially in Neural Network Architectures specialised in processing visual inputs, such as
Convolutional Neural Networks (CNN), have paved the way for the creation of new
methods which vastly improve results in existing image classiﬁcation and object detec-
tion challenges . Furthermore, multiple data sets of annotated images are available
online [12, 18, 11], enabling new models to be easily trained, tested and compared under
Object detection can be separated into two subtasks: object localisation and clas-
siﬁcation. For solving the classiﬁcation task, many recent techniques employ convolu-
tional layers as one of their building blocks  in order to learn convolutional ﬁlters
for object classiﬁcation. While localisation was initially achieved through the use of
standalone region and object proposal algorithms, recent methods are able to learn how
to achieve effective proposals by ﬁne tuning region proposals through processes such
as bounding box regression , or by performing classiﬁcation in multiple regions of
an input image, at different scales and aspect ratios [13, 16].
Recent successful object detection techniques include You Only Look Once (YOLO)
, Single Shot MultiBox Detector (SSD)  and Faster R-CNN .
YOLO  divides the input image into an N×Ngrid, it then generates Bbound-
ing boxes for each cell and predicts Cclass probabilities for each bounding box. Each
of the predictions is composed by ﬁve values, x, y, h, w, k, where xand yrepresent
the coordinates of the the bounding box centre, hand wrepresent the bounding boxes’
height and width and kis a class probability. At training time, each of the Cpredictors
for a bounding box is specialised in a given class This specialisation is encoded in a
multipart loss function, whose sum squared error is minimised via gradient descent.
SSD  employs a CNN, called the base network, to generate feature maps, as well
as additional convolutional layers of multiple sizes to detect objects in multiple scales.
Multiple regions of different scales and aspect ratios are evaluated in the target image
in order to accomplish detection. Training is also done via gradient descent and may be
boosted by techniques such as hard negative mining and data augmentation strategies.
The loss function is a weighted sum of localisation and classiﬁcation loss.
R-CNN  employs selective search  to extract a ﬁxed number of 2000 region
proposals from an input image. All region proposals are normalised to the same size via
warping and used as input to a CNN, which learns a ﬁxed-length feature vector for each
one. These vectors are then used as input to several Support Vector Machine (SVM)
classiﬁers, each one trained to classify objects of an speciﬁc class.
Fast R-CNN  improves upon R-CNN by processing all region proposals from
an input image in a single forward pass, as well as by replacing multiple SVM clas-
siﬁers with a single fully-connected, softmax output layer for classiﬁcation of region
Lastly, Faster R-CNN  uses the same classiﬁcation strategy as Fast R-CNN,
while introducing a Region Proposal Network (RPN) for object localisation. The RPN
uses convolutional ﬁlters to produce region proposals represented by x, y, h, w, k, where
xand yrepresent the coordinates of a region proposal, hand wrepresent its height and
width and kis an “objectness” score. The bounding box predictions of the RPN can
be trained through gradient descent. Since both the RPN and Fast R-CNN share convo-
lutional ﬁlters, while minimising different loss functions,  propose alternating the
training of both networks until an acceptable performance is reached.
The use of Neural Networks to answer questions about images has seen great ad-
vances in recent years. The end-to-end use of neural networks was shown to achieve a
high performance in question answering and caption generation tasks, which fermented
the creation of various data sets to further test and develop these ideas.
 presents CLEVR, a data set of 3D rendered objects along with a set of example
questions. The aim of CLEVR is to provide a standard set of objects and questions to
be used as benchmark. However, the small set of objects that are available in the images
and the artiﬁcial scenario created lacked the complexity that artiﬁcial neural networks
were already capable of coping.
Instead of CLEVR, a data set that has been extensively used for object detection is
the Microsoft Common Objects in COntext (MS-COCO) which has over 300,000
images with objects divided into 91 types and 11 super-categories (collections of types).
Each image has objects that could be easily recognised by a 4 year old and are presented
in their common context and not artiﬁcially renderd or modiﬁed.
The Visual Question Answering (VQA) data set  has been widely used as a
testbed for question-answering systems since it expands the MS-COCO  data set
with more information and abstract scenes. Furthermore, for each image, VQA pro-
vides a set of at least three challenging questions to be answered by any AI method.
Although machine learning (ML) methods can be used to solve the problem pre-
sented by VQA with high accuracy, two issues became apparent. First, it is unknown
how much of the visual information presented in the images is really learned and used
by the ML methods to answer the proposed questions, as some ML methods that did not
use the image and considered only the question being asked had a good performance
answering those questions. Second, it has been shown that the VQA data set is biased
towards one type of answer in multiple choice questions (e.g., the same answer for dif-
ferent questions), which explains why ML methods that do not use the input images as
source of information are able to answer some questions with great accuracy .
The FOIL-COCO  data set is a set of annotations on top of the MS-COCO
which provides a caption that can be either correct or have one wrong noun. This data
set has been proposed to test ML methods ability to comprehend and give meaning to
terms used when generating captions and proposes three tasks: 1) classify the caption
as correct or not; 2) if it is incorrect, ﬁnd the mistake and 3) correct the wrong word
in the caption. However,  shows that although the tested ML methods had a great
performance in VQA, it performs quite poorly in FOIL-COCO.  then extends the
FOIL-COCO data set with annotations for other attributes such as adjectives, adverbs
and prepositions (i.e., spatial relations).
While FOIL-COCO evaluates how much of the image the neural network really
comprehends,  demonstrates that the longer the caption gets, the less relevant the
image becomes to the captioning system since the caption generation depends more and
more on the prediction of the next word instead of the information from the image. 
also shows that objects in the image are the most relevant information used by caption
The works of  and  are the ones that relate most to our proposal.  proposes
Phrase Critic, a caption generator that veriﬁes the relevance of a caption to a given image
and, thus, can be used both to generate a caption and to check if a caption is wrong.
By using an off-the-shelf localisation model called Visual Genome , it generates
possible descriptions of parts of a picture which are used to ground the caption to the
image (i.e., it checks if everything described in the caption also appears in the image).
The main difference between  and this work is how the caption in grounded in the
image. While  uses a LSTM to generate possible explanations to the image and to
compare with the output of Visual Genome, we use NLP to process the caption and DL
for image processing (object detection). However, the use of Visual Genome instead
of a method for object detection is being studied as an improvement for our current
This paper uses FOIL-COCO as the data set for the experiments, but contrary to
the ML methods for CLEVR, VQA and also FOIL-COCO, which used only Artiﬁcial
Neural Networks (ANN) to solve the proposed problem, we combine ANN with natural
language processing which can provide semantics to the words used in the caption,
making easier to explain the answers to the three tasks proposed by .
This section presents the items used in the construction of CAPTION, namely object
detection using ANN, NLP and WordNet.
3.1 Object detection using Artiﬁcial Neural Networks
3.2 Natural Language Processing
In this work, the image captions are manipulated mainly by two Natural Language
Processing (NLP) methods: tokenization and tagging .
Given a text, tokenizing it is the process of splitting the text into subtexts consider-
ing a set of predeﬁned words that represents the ending of sentences (e.g., punctuation)
or of a single word (e.g., spaces). For example, considering the text “a man riding a mo-
torcycle”, tokenizing it by word would give us the list of words [“a”, “man”, “riding”,
“a”, “motorcycle”] which does not alter the order of the text or its meaning, but allows
each word to be processed both, independently of the other, or considering other words
within a particular text window. This process can be used in two ways when dealing
with captions. First, given a caption composed of more than one sentence, it is possible
to split each sentence and check for errors independently. Second, given a single sen-
tence, it is possible to process it to obtain more information about the words in it. One
such process that can be done to derive more information from the words present in the
caption is tagging.
Each word that constitutes a given sentence can be classiﬁed in lexical categories or
parts of speech (POS), such as nouns, verbs, adverbs, etc., thus, POS-tagging (or just
tagging) is the process of inferring a lexical category for each word in a sentence. Con-
sidering the same text that we used for tokenization, tagging it provides us with the list
of tuples [(“a”, article), (“man”, noun), (“riding”, verb), (“a”, article), (“motorcycle”,
noun)] giving a category for each word found.
By tokenizing and POS-tagging a caption, we can ﬁlter it for the type of information
that we want to process (e.g., nouns, verbs and adjectives) instead of having to deal with
the complete sentence and trying to infer a meaning to words that are not important for
the task at hand.
Although these two methods provide some information about the structure of the
text, their use in caption analysis is dependent on a possible meaning of caption words,
which is obtained by using the WordNet .
Created in the mid-1980s from the theories of human semantic organisation, WordNet
is a large lexical database of words arranged into a semantic network. Words that have
the same lexical category and represent the same concept are grouped into sets known
as cognitive synonyms or synsets. These synsets are interlinked considering lexical re-
lations and conceptual-semantic notions [14, 15].
The relations between synsets are most of the time deﬁne by a IS-A relation (hy-
peronym) but can be also deﬁne in terms of part-whole relation (meronymy) or even
as Cross-POS relations, using parts of speech. An important aspect of these relations is
that, since synsets form a semantic network, it is possible to navigate it and to calculate
similarity between synsets, such as the path similarity, using the shortest path from one
synset to another.
Based on these concepts, the next section introduces the architecture proposed this
4 Correction by Analyses, POS-Tagging and Interpretation of
Objects using Nouns (CAPTION)
One common approach for caption generation is to use Artiﬁcial Neural Networks
(ANN) from end-to-end, i.e. the input of the ANN is an image and the output is the ﬁnal
caption. However, when using the same approach to validate a caption, the ANN is inca-
pable of recognising eventual mistakes in the use of words, since it still not known from
an ANN (or any other deep learning tool) could deal with the word’s meanings .
In order to verify the generated image captions, we propose the inclusion of an addi-
tional step, which takes an image and its related generated caption, and compares the
information extracted from both.
The proposed architecture is presented as a pseudo code in Algorithm 1 and as a
diagram in Figure 1, which represents the inputs as orange boxes, the steps performed
using ANN for object detection are shown in blue boxes, green boxes represent the NLP
tasks and yellow boxes are simple set operations used to compare information from the
previous steps. Each of these steps are described in details in the following sections.
4.1 Inputs for the architecture
The proposed architecture uses three pieces of information in order to validate the cap-
tion: the caption itself, the related image and a set of common terms used to represent
both caption and objects in the image.
While the caption and its related image are the inputs to be analysed, the common
terms provides a bridge between words in the caption and object labels obtained by the
image processing. For example, if the caption has “woman” in it but the object detection
can only recognise “person”, the set of common terms is going to be used to map the
word “woman” to “person”.
This set of terms is used by both object detection and caption processing steps, as
will be described in the following sections.
4.2 Object detection
In this architecture, the object detection steps (blue boxes in Figure 1) use any off-
the-shelf method (in our current implementation, the Faster R-CNN) that is capable of
returning a set of objects that can be found in an image, such as the ones presented in
the Object Detection using Artiﬁcial Neural Networks section.
Given an image, we use an object detection method to provide a list of objects
recognised in it, without the need for information regarding the position of the object in
the image or even the degree of conﬁdence of the object detection. This list of objects is
then mapped into a set Snames of words using the set of common terms. This Snames
of terms recognised in the image is then used to analyse the caption along with the set
obtained from the caption processing, presented in the next section.
map nouns to
map objects to
compare sets of
set of terms
set of terms
Fig. 1: A ﬂowchart of the CAPTION architecture for checking captions
(a) Original image. (b) Marked image.
Fig. 2: Example of image that can generate a wrong classiﬁcation using intersection.
(a) Original image. (b) Marked image.
Fig. 3: Example of image that can generate a wrong classiﬁcation using Simage.
Input: : an image and a caption
Output:: foil classiﬁcation and a dictionary with foil words and corrections
1Snames ←detect objects(image);
4Snouns ←ﬁlter nouns(tags);
5Sinter ← Snouns ∩ Snames;
6Scaption ← Snouns − Snames;
7Simage ← Snames − Snouns;
8Create corrections as a dictionary;
9foreach foil ∈ Scaption do
10 Add foil as a key to corrections;
11 foreach correction ∈ Simage do
12 if foil is similar to correction)then
13 corrections[f oil]←correction
17 if corrections is not empty then
18 return:T rue,corrections;
20 return:F alse
Algorithm 1: CAPTION’s pseudo code
4.3 Caption processing
The caption processing steps (green boxes in Figure 1) are responsible for transforming
the automatically generated caption into terms that can be compared to the detected
objects in the image.
Given the caption, the ﬁrst step of this process is to recognise only the nouns in it,
which is the information that we expect to be related to the objects in the image. This set
Snouns is then mapped to the same set of common terms that were used in the objects
detected in the image, so that they can be compared in the next step, named Comparison
4.4 Comparison and classiﬁcation
The caption processing step provides a set of nouns deﬁned in terms of the set of com-
mon terms (Snouns) and the object detection step provides a set of objects also deﬁned
in terms of the set of common terms (Snames). The caption classiﬁcation into correct
or not (yellow boxes in Figure 1) can be done simply in terms of some set operations
(i.e., intersection and difference between sets).
The intersection of both sets (Sinter =Snouns ∩ Snames) provides information
about which objects that were recognised in the image are also in the caption, thus
conﬁrming that the object exists. However, we cannot state that the caption is correct
only with this information since the wrong word in the caption can be simply another
object in the image (e.g., in Figure 2 a man is on a motorcycle and a woman is walking
with a bicycle. A wrong caption can state that “the man is riding a bicycle”). Thus, the
intersection tells only which objects were used to generate the caption.
The set of objects that are in the image but not in the caption (Simage =Snames −
Snouns) provides information about which objects should be checked when trying to
correct the caption. Once again, we cannot state that the object should be in the caption
only because it appears in the image (e.g., in Figure 3, a woman is using a knife, recog-
nised in the image, but the caption could only state that “a woman is cutting a cake”
without mentioning the term “knife”).
However, this set Simage can used to check for possible objects that are potentially
wrong in the caption (e.g., if the caption of Figure 3 stated that “the woman is cut-
ting a pizza” instead of a “cake”) and their properties can be considered in the caption
correction (e.g., exchanging “pizza” for “cake” in the same example).
The difference between the set of nouns in the caption and the set of objects in the
image (Scaption =Snouns−Snames ) provides information about the objects described
in the caption but that were not detected in the image, giving us two possibilities:
1. The object detection method failed to ﬁnd the object in the image and the caption
may be correct;
2. The object described in the caption is not in the image and the caption is wrong.
While the ﬁrst possibility may be solved by using better object detection methods, the
second possibility is exactly the information that we are looking for. If an object de-
scribed in the caption cannot be found anywhere in the image, we can consider the
possibility that the caption is wrong and identify the mistake in it. Therefore, Scaption
informs us if there is something in the caption that is not in the image, possibly the
wrong word w1∈ Scaption. A possible correction for this caption would then be to
change w1∈ Scaption for a word w2∈ Simage that can be somewhat equivalent (e.g.,
in Figure 3, “pizza” and “cake” are both food and could be exchanged). Therefore, if
we are able to ﬁnd a word w1∈ Scaption and a word w2∈ Simage that can substitute
w1, the caption can be inferred as wrong and that w2should substitute w1.
4.5 Current implementation
For this paper, we implemented the architecture described in Figure 1 using a few
Python libraries in a container environment.
We used the FOIL-COCO data set  that provides a set of captions for MS-COCO
images  that may be correct together with the images itself, so that the caption
and image would match. Since the captions had more terms than the ones originally
described in the images, we used the names and supercategories of MS-COCO as the
set of common terms.
The object detection process uses one of the pre-trained TensorFlow  models
available in its model zoo 1. Since this application does not require to be done in real-
time, we employed Faster R-CNN  with a base network generated through neural
architecture search  on the MS-COCO data set. This choice was made considering
the metrics provided by  for the models available in the model zoo and Faster R-CNN
was selected since it provided the best reported mean Average Precision (mAP), despite
being the slowest, with a reported 1833 ms for processing a single image. The output of
this object detection method is a list of objects already deﬁned in terms of MS-COCO
For processing the caption we used NLTK  with punkt and averaged perceptron tagger
to, respectively, tokenize and tag words in the caption. NLTK’s interface with Word-
Net  was used to calculate the similarity between nouns found in the caption and
the names deﬁned in MS-COCO.
A very important aspect of our current implementation is that we are using only
the nouns found in the caption, that were recognised by using NLTK. However, our
experiments used the complete FOIL-COCO data set as available in .
After detecting objects using Tensorﬂow and generating the set of nouns using
NLTK, the rest of the process used only Python’s standard set type operations.
The next section presents the experiments used to test our proposal along with an
explanation of how this architecture was implemented in this paper.
In this section we ﬁrst present the tasks used to analyse its performance and than the
hardware and sorftware setup used during the experiments.
To test the architecture proposed in this work, we used the three tasks described in :
1. Classify the caption as right or wrong (foil);
2. Detect the wrong word, in case of a foil caption;
3. Correct the wrong word in the caption.
The three tasks were accomplished by analysing the Scaption and comparing it to
Simage so that if Scaption is empty, the caption is considered correct. However, if
Scaption is not empty, each of its terms is compared to each of the term in Simage to
ﬁnd a suitable substitution (i.e., if w1∈ Scaption and w2∈ Simage have the same MS-
COCO supercategory, w2is a suitable substitute for w1). If a suitable substitute is found,
the caption is considered wrong and both, the term from Scaption and its substitute,
are informed. However, if a suitable substitute is not found, the caption is considered
1The pre-trained models are available at https://github.com/tensorﬂow/models/tree/master/
correct. Thus, in the proposed method, the three tasks are not solved independently
since it ﬁnds the foil word (task 2) together with the correction (task 3) which results in
deﬁning the caption as wrong or not (task 1).
Since we used the same data set as , the results presented in the next section
will be compared with the results provided in that paper for each task.
The next section presents the results obtained by using the proposed method to solve
the three tasks presented in this section.
The experiments were executed on an Intel Core i7-8700 @ 4.6GHz with 16GB of
RAM and a NVidia GeForce GTX 1070 with 8GB of RAM running Debian GNU/Linux
Bullseye (the testing version as of this writing). All experiments were run in a container
using Docker and NVidia-Docker and an image based on the tensorﬂow/tensorﬂow:1.14.0-
gpu-py3-jupyter with scikit-learn, opencv-python, nltk and dodo detector installed via
This section presents the results independently for the three tasks presented in the pre-
vious section, while using the results presented (and reproduced in this section) by 
and  as baseline for CAPTION. In the latter part of this section, an overall analysis
of the performance of the proposed method is presented.
6.1 Task 1: Caption classiﬁcation
The ﬁrst task consists of classifying if the caption for a given image is correct or a foil.
Table 1 presents metrics related only to the proposed architecture. It is worth pointing
out that CAPTION achieves over 0.7 in precision and recall for both detection of foil
caption and in the caption classiﬁcation as correct.
Caption Precision Recall F1-score
Correct 0.81 0.74 0.77
Foil 0.72 0.79 0.75
Table 1: Caption classiﬁcation results for the proposed architecture.
 performed the same tasks with four different algorithms that used both the cap-
tion and the image to classify the caption (CNN + LSTM, IC-Wang, LSTM + norm I
and HieCoAtt), and Blind LSTM (a language-only method that did not use any infor-
mation from the image to classify the caption) and two sets of classiﬁcations as done
by humans, the ﬁrst is based on the majority of votes to classify the caption and the
second is based on the agreement of all humans judges on the classiﬁcation.  used
Classiﬁer Overall(%) Correct(%) Foil(%)
Blind LSTM 55.62 86.20 25.04
CNN + LSTM 61.07 89.16 32.98
IC-Wang 42.21 38.98 45.44
LSTM + norm I 63.26 92.02 34.51
HieCoAtt 64.14 91.89 36.38
Phrase Critic 87.00 73.72
CAPTION 76.31 80.90 71.72
Human (majority) 92.89 91.24 92.52
Human (unanimity) 76.32 73.73 78.90
Table 2: Results for the classiﬁcation task presented by ,  and those obtained by
CAPTION. The correct classiﬁcation results is not provided by 
the Phrase Critic for the same task. Table 2 presents the same results obtained by 
and  along with the ones obtained with CAPTION.
Considering only the non-human classiﬁers presented by , CAPTION overall
performance (76.31%) is over 10 percentage points above the best (HieCoAtt with
64.14%). However, when classifying captions as correct, it is only better than IC-Wang
(38.98%), with a performance of about 10 percentage points below the best classi-
ﬁer (LSTM + norm I with 92.02%). For foil classiﬁcation, CAPTION performance
(71.72%) is about 25 percentage points higher than the second best (IC-Wang with
45.44%). Comparing with the classiﬁcation done by humans, CAPTION’s performance
is worse than the classiﬁcation done by humans voting, but it is surprisingly close to the
unanimity classiﬁcation for the three criteria.
Comparing to the Phrase Critic, its overall performance is about 10 percentage
points higher than CAPTION while the foil classiﬁcation is almost the same (2 per-
centage points of difference). Although CAPTION’s performance is worst than Phrase
Critic, results show how much the ground of the caption to the image is relevant when
checking for errors in captions and, most importantly when ﬁnding the foil word in the
caption, since both Phrase Critic and CAPTION have the top non-human performance
in both tasks.
Overall, CAPTION’s performance is good among the three criteria, being close to
the human unanimity performance for the same task. CAPTION’s performance is worst
only to the Phrase Critic, but again, their approach are much related which may indicate
that comparing the information from the image to the caption may be the reason for the
Given that CAPTION’s classiﬁcation can only happens when solving the other two
tasks, it can be expected that the results for the other two tasks are at least as good as the
results for the ﬁrst one since performing poorly in identifying the error and correcting
it would directly affect CAPTION’s performance in classifying the caption as right or
6.2 Task 2: Error detection
Given a foil caption, the second task is identifying which word is wrong in the caption.
Although in the FOIL-COCO data set any correct word can be changed to be a foil word,
the current implementation of CAPTION considers only nouns as possible foil words.
For this task,  used three algorithms (IC-Wang, LSTM + norm I and HieCoAtt),
the same human classiﬁcation methods (voting and unanimity) and a random classiﬁer
(represented by the label ’Chance’ in the results). Again,  used Phrase Critic for this
Results for this task is presented in Table 3. Although CAPTION only consider
nouns as possible foil words, we did not compare it to  methods that also considered
only nouns since our method has no a priori information about which word is a noun,
classifying it during the caption processing steps.
Identiﬁer Nouns (%) All words (%)
Chance 23.25 15.87
IC-Wang 27.59 23.32
LSTM + norm I 26.32 24.25
HieCoAtt 38.79 33.69
Phrase Critic 73.72
Human (majority) 97.00
Human (unanimity) 73.60
Table 3: Results for the error detection task presented by  along with CAPTION.
Comparing only the non-human identiﬁers from , CAPTION’s performance
(71.72%) is more than two-times better than the second best method (HieCoAtt with
33.69%). For the human identiﬁers, we obtained analogous results to those in the ﬁrst
tasks: CAPTION’s performance is worst than voting (97%) but close to the unanimity
Comparing to the Phrase Critic, once again our method performance is very close to
it, which may not be a surprise considering that both compare information from image
and caption to detect the foil word.
Considering the results of this task, and the fact that CAPTION only considers a
caption as foil when it detects a foil word and how to correct it, the results for the
next task is as important as the results for the ﬁrst two tasks in the CAPTION overall
6.3 Task 3: Error correction
The last task is to correct the foil word in the caption. In the methods evaluated by ,
the caption and the foil word were given as input to the algorithm. The algorithm’s
only task was, then, to change the foil word for a correct one. In CAPTION, however,
since identifying and correcting the foil word is an important part of the classiﬁcation
step, we did not inform our method which is the foil word for each caption. Instead, we
present the results based on the corrections proposed by CAPTION while performing
the other two tasks.
For this task,  uses the same non-human methods as the second task (Chance,
IC-Wang, LSTM + norm I and HieCoAtt), but this time, no human method was used
for comparison. Once more, Phrase Critic was used for this task by . Results are
presented in Table 4 for each of those four methods and also for CAPTION.
Method All target words (%)
LSTM + norm I 4.7
Phrase Critic 49.60
Table 4: Results for the word correction task presented by  along with CAPTION.
In this task, CAPTION’s performance (90.11%) is about four times the performance
of the best method presented by  (IC-Wang with 22.16%). Comparing to Phrase
Critic, CAPTION performance is superior for a great margin for the ﬁrst time in the tree
tasks. This may be due to the fact that CAPTION’s correction is based on semantics,
exchanging words related to each other, while Phrase Critic uses quantitative metrics to
indicate possible corrections.
Since performing poorly in this task would interfere with the results of the previous
tasks, it was expected for CAPTION to have a good performance in it. We can see that
considering the three tasks as a single, larger task with three steps, seems to have helped
to develop a method with good overall performance in the three tasks alone.
The ﬁrst aspect worth pointing out is that  considered the tasks as three separated
problems and used distinct, specialised, methods to solve each task. In contrast, CAP-
TION uses the results of the second and third tasks (error detection and correction) to
perform the ﬁrst task (caption classiﬁcation), which indicates that, ﬁrst, it is possible
to solve the three tasks at the same time with the same method, without the need to
specialise a method for a certain task. Second, information about the second and third
tasks is important for solving the ﬁrst task, since it gives a reason for the caption to be
classiﬁed as foil or not and, third, combining the three tasks gives us an explainable
output. For example, consider Figure 3 and the caption “a woman cutting a pizza”. By
combining the three tasks we can tell that the caption is foil (task 1) because there is
no pizza in the image (task 2), but the object detection method found a cake and, since
cake and pizza are food, the woman may be cutting a cake instead of a pizza (task 3).
Thus, CAPTION not only has a better performance than the other methods considered,
but also gives a more explainable answer.
However, there are some setbacks in our method which can be used to explain the
results. First, CAPTION depends on the performance of the object detection method to
correctly classify the caption. For example, using the same Figure 3 with the same foil
caption of “a woman cutting a pizza”, if the object detection method fails to recognise
the cake in the image, CAPTION will not be able to ﬁnd a suitable substitute for pizza
and cannot tell that the caption is foil. Thus, it would incorrectly classify the image as
Second, the foil word in the caption can be some other object in the image, which
also makes CAPTION to mistakenly classify the image. For example, consider Figure 2
with the caption “a man riding a bicycle” and an object detection method that detects
the man riding the motorcycle and also the woman with the bicycle in the back of the
image. In this case, since both objects are in the image and the current implementation
of CAPTION only consider the objects but not the relation between them, it would not
be able to ﬁnd the foil word and also classiﬁes the caption as correct.
Third, in the current implementation, a suitable substitution for a foil word is used
by ﬁnding the most similar object to the foil word that was found, thus CAPTION
error correction is as good as the similarity function used to process the objects in the
caption and image. As a result, CAPTION can ﬁnd mistakes in correct captions and
also indicate incorrect corrections for foil captions.
Although these three problems are relevant to CAPTION’s performance, improve-
ments in methods for the ﬁrst and third problems would improve the overall perfor-
mance of CAPTION. I.e., as objects detection methods improve, the change of not
recognising an objects decreases and the ﬁrst problem becomes harder to occur. For the
second problem, a solution is to use more knowledge about the image to conﬁrm the
information provided by the caption (e.g., knowing that a man is riding a motorcycle
and not a bicycle in the image infers that the caption “a man riding a bicycle” is wrong).
Lastly,  suggests that the reason for the poor performance of the methods in the
three tasks they propose are due to the lack of a word’s meaning for a neural network
which makes it unable to detect the error and to correct it. Instead of adding meaning
to words and objects before being processed by the ANN so that the ANN has informa-
tion about the meaning of words and may use it to ﬁnd errors and correct the text, in
CAPTION we add meaning to the output of the ANN in a way that do not change what
happens inside the ANN. Furthermore, we use a well-deﬁned and accepted corpus (i.e.,
WordNet) to provide these meanings to words from both image and caption and, since
WordNet is constructed as a semantic network, we are capable of performing some op-
erations with those meanings (e.g., calculate similarities). Thus, we direct each method
(ANN and NLP) to its most suitable problem (i.e., object detection and text process-
ing, respectively) instead of trying to use a single method to solve the three tasks, as
done by  using ANN methods. As a result, we have a method with an overall better
performance and also a solution that can be explained.
When comparing to Phrase Critic, CAPTION’s performance is worst for the ﬁrst
task (caption classiﬁcation), almost identical in the second one (foil word detection)
and better in the last one (foil word correction), although the two methods have a lot
in common. This shows that comparing (grounding) the caption to the image is an
important step for the classiﬁcation and also provides information for detecting the
foil word. However, semantis still seems to play an important role when correcting the
This paper proposes an architecture to solve the three tasks proposed by  for auto-
matic caption generation for images, which are 1) classify the caption as correct or not,
2) detect the foil word in the caption and 3) correct the caption.
Our architecture combines object detection of the image with natural language pro-
cessing of the caption to search for the incorrect word in the caption, suggesting a
correction. With this information, the caption is classiﬁed as correct or not and the cor-
rections are informed if needed.
Results show that CAPTION has a better overall performance than the methods
tested by  and in some tasks it is comparable to human performance. However,
there are some drawbacks in our approach such as objects not being recognised by the
detector, or the use of poor word similarity functions.
Future works consists in improving the caption processing by using more infor-
mation, not just nouns, and also considering object relations in the image so that it is
possible to overcome some of the shortcomings of the current implementation. These
improvements will allow us to test our method in the data set provided by .
1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia,
Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´
e, D., Monga, R., Moore, S.,
Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker,
P., Vanhoucke, V., Vasudevan, V., Vi´
egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke,
M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems
(2015), https://www.tensorﬂow.org/, software available from tensorﬂow.org
2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual
Question Answering. In: International Conference on Computer Vision (ICCV) (2015)
3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text
with the Natural Language Toolkit. O’Reilly Media (2009)
4. Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision
(ICCV). pp. 1440–1448 (Dec 2015)
5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object
detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and
Pattern Recognition. pp. 580–587 (June 2014)
6. Hendricks, L.: Visual Understanding through Natural Language. Ph.D. thesis, EECS Depart-
ment, University of California, Berkeley (May 2019), http://www2.eecs.berkeley.edu/Pubs/
7. Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explanations. In: Ferrari, V.,
Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. pp. 269–286.
Springer International Publishing, Cham (2018)
8. Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z.,
Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional
object detectors. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 3296–3297 (July 2017)
9. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.:
CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.
CoRR abs/1612.06890 (2016), http://arxiv.org/abs/1612.06890
10. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y.,
Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: Connecting language and
vision using crowdsourced dense image annotations. In: CoRR (2016), https://arxiv.org/abs/
11. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S.,
Popov, S., Malloci, M., Duerig, T., Ferrari, V.: The open images dataset v4: Uniﬁed im-
age classiﬁcation, object detection, and visual relationship detection at scale. arXiv preprint
12. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´
ar, P., Zitnick,
C.L.: Microsoft coco: Common objects in context. European Conference on Computer Vi-
sion (ECCV) (2014), /se3/wp-content/uploads/2014/09/coco eccv.pdf,http://mscoco.org
13. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single
shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision
– ECCV 2016, pp. 21–37. Springer International Publishing, Cham (2016)
14. Pease, A.: Ontology: A Practical Guide. Articulate Software Press (2011)
15. Princeton University: About WordNet. (2010)
16. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Uniﬁed, real-time
object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 779–788 (June 2016)
17. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence
39(6), 1137–1149 (June 2017)
18. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015)
19. Shekhar, R., Pezzelle, S., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.: Vision and lan-
guage integration: Moving beyond objects. In: IWCS 2017 — 12th International Conference
on Computational Semantics — Short papers (2017), https://www.aclweb.org/anthology/
20. Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi,
R.: FOIL it! ﬁnd one mismatch between image and language caption. In: Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). pp. 255–265. Association for Computational Linguistics, Vancouver, Canada (Jul
21. Tanti, M., Gatt, A., Camilleri, K.: Quantifying the amount of visual information used by
neural caption generators. In: Leal-Taix´
e, L., Roth, S. (eds.) Computer Vision – ECCV
2018 Workshops: Proceedings of the Workshop on Shortcomings in Vision and Language.
pp. 124–132. Springer, Munich, Germany (2019), https://link.springer.com/chapter/10.1007/
22. Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recogni-
tion. International Journal of Computer Vision (2013), http://www.huppelen.nl/publications/
23. Zhao, Z., Zheng, P., Xu, S., Wu, X.: Object detection with deep learning: A review. IEEE
Transactions on Neural Networks and Learning Systems pp. 1–21 (2019)
24. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scal-
able image recognition. The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2018)