PreprintPDF Available

VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In this paper, we aim to improve the data efficiency of image captioning. We propose VisualGPT, a data-efficient image captioning model that leverages the linguistic knowledge from a large pretrained language model (LM). A crucial challenge is to balance between the use of visual information in the image and prior linguistic knowledge acquired from pretraining.We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder on a small amount of in-domain training data. The pro-posed self-resurrecting activation unit produces sparse activations but is not susceptible to zero gradients. When trained on 0.1%, 0.5% and 1% of MSCOCO and Conceptual Captions, the proposed model, VisualGPT, surpasses strong image captioning baselines. VisualGPT outperforms the best baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on Conceptual Captions.We also perform a series of ablation studies to quantify the utility of each system component. To the best of our knowledge, this is the first work that improves data efficiency of image captioning by utilizing LM pretrained on unimodal data. Our code is available at: https://github.com/Vision-CAIR/VisualGPT.
Content may be subject to copyright.
VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and
Linguistic Knowledge from Pretraining
Jun Chen1Han Guo2Kai Yi1Boyang Li3Mohamed Elhoseiny1
1King Abdullah University of Science and Technology
2Carnegie Mellon University 3Nanyang Technological University
{jun.chen,kai.yi,mohamed.elhoseiny}@kaust.edu.sa
hanguo@cs.cmu.edu,boyang.li@ntu.edu.sg
Abstract
The ability to quickly learn from a small quantity of
training data widens the range of applications of machine
learning. In this paper, we propose a data-efficient im-
age captioning model, VisualGPT, which leverages the lin-
guistic knowledge from a large pretrained language model
(LM). A crucial challenge is to balance between the use
of visual information in the image and prior linguistic
knowledge acquired from pretraining. We designed a novel
self-resurrecting encoder-decoder attention mechanism to
quickly adapt the pretrained LM as the language decoder
on a small amount of in-domain training data. The pro-
posed self-resurrecting activation unit produces sparse ac-
tivations but is not susceptible to zero gradients. When
trained on 0.1%, 0.5% and 1% of MSCOCO [35] and Con-
ceptual Captions [56], the proposed model, VisualGPT, sur-
passes strong image captioning baselines. VisualGPT out-
performs the best baseline model by up to 10.8% CIDEr
on MS COCO and up to 5.4% CIDEr on Conceptual Cap-
tions. To the best of our knowledge, this is the first work
that improves data efficiency of image captioning by utiliz-
ing LM pretrained on unimodal data. Our code is available
at: https://github.com/Vision-CAIR/VisualGPT.
1. Introduction
Image captioning [28,63,24,12,22] is a prominent ex-
ample of cross-modal reasoning, requiring accurate under-
standing of the visual content and precise expression of that
understanding in natural language. The task has been es-
tablished for novel applications such as helping people with
impaired vision to understand their surroundings [7,11] and
generating medical imaging for human physicians [33,10].
However, most of the recent performance gains in im-
age captioning relies on large-scale image-caption corpora
such as MS COCO [35] or Conceptual Captions [56]. For
A cop on brown horse on
sidewalk next to truck.
Decoder Layer 1
Pretrained
LM Weights
Encoder Layer 1
Self Attention
Feed Forward
Masked Self
Attention
Cross Attention
Feed Forward
Encoder Layer K
Decoder Layer K
Self-resurrecting
Encoder-decoder
Attention
𝜷𝑣𝑖𝑠 𝜷𝑙𝑎𝑛
Figure 1. Our VisualGPT model transfers the knowledge from
a pretrained language model to the caption decoder. A self-
resurrecting encoder-decoder attention is designed to connect the
multi-level visual features and caption decoder.
instance, MS COCO contains approximately one million
human-written captions. Manually creating captions for
such large datasets requires significant time and effort. On
the other hand, semi-automatic approaches for collecting
image-caption pairs from the Internet, as used by Concep-
tual Captions [56], may generate incorrect or undesirable
training data even after multiple rounds of data cleaning;
data crawled from the Internet are unlikely to cover highly
specific domains such as computerized tomography (CT)
scans. Thus, the availability of training data limits the range
of objects and scenes that image captioning systems can re-
liably describe [1]. Improved data efficiency of image cap-
tioning will allow practitioners to quickly curate sufficient
amount of data and establish systems that can describe rare
objects in specific domains.
In this paper, we investigate the data efficiency prob-
lem for image captioning. This problem is distinct from
the novel object captioning problem [20,1], which relies
1
arXiv:2102.10407v1 [cs.CV] 20 Feb 2021
Figure 2. Comparison of the part-of-speech distributions of the MS
COCO and WikiText-2 datasets [43]. We use the spacy parser and
show only the most important categories.
on abundant in-domain data but zero out-of-domain data.
Instead, we aim to improve the performance of image cap-
tioning systems trained on a small subset of in-domain data.
We propose to improve data efficiency by leveraging pre-
trained language models (LMs) [15,37,30,51], such as
BERT [13], XLNet [65], and GPT [49,50,8]. Via self-
supervised learning, these models acquire rich linguistic
and semantic knowledge, which has been shown to inform
downstream tasks in NLP [9,17].
A challenge in utilizing pretrained LMs is to bridge the
gap between multi-modal data and the single-modal textual
data the LMs are pretrained on. In Figure 2, we compare the
part-of-speech distributions of MS COCO and WikiText-2
[43]. MS COCO employs 75% more nouns but 14% fewer
verbs. This suggests that the MS COCO captions are biased
toward descriptions of static objects rather than actions. As
a result, effective use of pretrained LMs in image caption-
ing requires careful balancing of the linguistic knowledge
acquired from pretraining and the visual input information.
Figure 1shows the overall architecture of the proposed
network, called VisualGPT. In the commonly used encoder-
decoder architecture for image captioning, we initialize the
parameters of the decoder from pretrained LMs such as
GPT-2 [50], whereas the encoder layers are randomly ini-
tialized. In addition, we propose a self-resurrecting atten-
tion mechanism that precisely balances the input from the
visual encoder and the prior linguistic knowledge from the
lower decoder layer. The proposed self-resurrecting atten-
tion mechanism can learn to ignore small magnitude inputs
and produce sparse activations. Notably, the mechanism
does not suffer from the zero gradient problem and can “turn
on” an activation again after it has been zeroed out.
We evaluate our VisualGPT against several strong base-
line models on 0.1%,0.5% and 1% of the MS COCO
dataset, and the experimental results demonstrate that our
VisualGPT can easily outperform the baselines. We also
conduct several ablative experiments to confirm the useful-
ness of pretrained LMs and the proposed self-resurrecting
attention mechanism.
With this paper, we make the following contributions:
We propose to investigate the data efficiency problem
for image captioning and to borrow weights from pre-
trained language models to initialize the decoder. Us-
ing only a small amount of in-domain training data,
the proposed encoder-decoder quickly adapts network
weights obtained from the textual modality to the
cross-modal task of image captioning. To our knowl-
edge, this is the first work that focuses on efficiently
adapting large pretrained language models for image
captioning.
• We propose a novel encoder-decoder attention with
a self-resurrecting activation unit (SRAU), which can
learn to balance features from the visual and textual
modalities. SRAU produces sparse activations while
not being “trapped” in a zero-gradient region.
We apply the proposed VisualGPT model to several
small subsets of MS COCO and Conceptual Captions.
In both automatic evaluation and human evaluation,
VisualGPT surpasses several state-of-the-art baselines.
2. Related Work
Image Captioning. Numerous image captioning models
have been proposed in the past few years. Earlier ap-
proaches generated the image caption template, and fill
in the blanks with the outputs of object or attribute pre-
dictors [57,66]. In contrast, modern approaches exploit
the neural encoder-decoder architecture where an encoder
network encodes the visual features and a decoder net-
work generates the language description [62,14,23]. Vi-
sual features are may be represented by a grid of CNN
features [63,41] or image regions containing objects [3].
Graph neural networks have also been adopted to repre-
sent scene graphs or spatial relationships between objects
[67,64]. Recurrent networks [3,4] and Transformer net-
works [31,22,12] are popular choices for the language de-
coder. Reinforcement learning enables model optimization
with non-differentiable evaluation metrics [53,36].
The novel image captioning problem treats the image
captioning as a zero-shot learning where the captioning sys-
tem is required to describe objects that do not exist in the
training data. Lu et al. [42] propose a hybrid template-
based method that fills the slots based on the object cat-
egories recognized by object detectors. Feng et al. [16]
propose an unsupervised approach that trains via unpaired
image-caption data and a visual concept detector. From a
learning efficiency perspective, Kim et al [25] improve the
data efficiency by borrowing the knowledge from unpaired
image-caption data.
2
Self-supervised NLP Models. Self-supervised training
of large neural networks on textual data proves to be an
important technique in the creation of high-performance
NLP models. Several self-supervision signals have been
proposed, such as autoregressive language modeling and
masked language modeling.
Autoregressive Language modeling is arguably one of
the most fundamental task in NLP. The task is to predict the
next word by conditioning on all preceding words. More
formally, given a sequence of words w1, ..., wNand a prob-
ability distribution pθparameterized by θ, the training ob-
jective is
max
θ
N
X
t=1
log pθ(wt|w1, ..., wt1)(1)
Pretrained models using this objective include [6,44] and
the GPT series [50,8,49]
Another popular objective is masked language model-
ing, which predicts a randomly masked word in a textual
sequence based on all other words. Given a random vari-
able Z∈ {1, . . . , N }, the training objective is
max
θ
EZ[log pθ(wZ|w1, ...wZ1, wZ+1, . . . , wN)] (2)
Models using this objective include ELMo [47] and BERT-
related methods [13,29,38].
In this paper, we propose a quick adaptation technique
for network weights obtained using the language modeling
objective. However, the technique is not specific to this type
of self-supervision signal and can be applied to other mod-
els, as the masked LM objective can be easily converted to
the LM objective by masking only the last word in the tex-
tual sequence.
Unlike neural networks pretrained on multimodal data
(e.g., [48,59,58,40,32]), our method only requires a small
amount of multimodal training data and focuses on adapting
linguistic knowledge learned from the textual modality.
3. The VisualGPT Architecture
The VisualGPT model contains an image encoder and
a caption decoder comprising Kand MTransformer [60]
layers, respectively. Given an image, we first extract ob-
jects in the image using an off-the-shelf object detection
network. After that, we extract features from the detected
bounding boxes and feed them into the image encoder. We
denote the number of objects extracted as Oand the dimen-
sion of hidden states in the Transformer layers as D. As
such, the image encoder outputs a tensor Iof dimension
D×O×K. Conditioned on I, the caption decoder outputs
words in the caption in an autoregressive manner. For the
maximum caption length T, the decoder outputs a tensor C
of dimension D×M×T. The output of the last decoder
layer are classified into sub-word tokens under Byte Pair
Encoding (BPE) [55]. At layer mof the decoder, we use
the self-resurrecting encoder-decoder attention mechanism
to find the right balance between visual information Iand
the linguistic output C[m1] from the immediate lower
decoder layer. In the next few subsections, we will describe
these network components in details.
3.1. Visual Encoder
The visual encoder consists of KTransformer layers,
each of which contains a self-attention operation, a feed-
forward operation, and addition-normalization operation.
These components are described below.
The self-attention operation can be understood as encod-
ing one elements in the input as a convex combination of
other elements in the input. Let Ik1denote the output of
the encoder layer i1and the input to the encoder layer
k. We first linearly project the input to the query matrix Q,
key matrix K, and value matrix V.
Q=WqIk1,K=WkIk1,and V=WvIk1,(3)
where the matrices Wq,Wk, and Wvare learnable pa-
rameters for the jth head at layer i. The output of the
self-attention mechanism is a convex combination of the
columns of V.
fatt(Q,K,V) = softmax QK>
DV(4)
In standard multi-head attention, we utilize multiple sets
of Wq,Wk, and Wv. The outputs of all heads are concate-
nated and linearly projected back to Ddimensions. The
output of the multi-head attention, Iatt
k, is fed into a feed-
forward neural network FFN(·), which is applied to each
object feature separately and identically. It composes of two
affine transformations and the GELU activation [21].
Finally, an encoder layer also contains residual con-
nections and layer normalization, which are denoted by
AddNorm. For an arbitrary input zand function g(·), the
definition for AddNorm is
AddNorm(z, g(·)) = LayerNorm(z+g(z)) (5)
For simplicity, we just write AddNorm(g(z)) when there is
no ambiguity. The final output of the encoder layer iis
Ii=AddNorm(FFN(AddNorm(Iatt
k)))) (6)
The same encoding layer is repeated Ktimes to form the
complete encoder. The final output of the encoder contains
the outputs of all layers, I1, . . . , IK, which form the tensor
Iof D×O×Kdimensions.
3
Output of Encoder
Layer k
𝐷×𝑂
Output of decoder
Layer m-1
𝐷×𝑇
masked self
attention
cross
attention
ValueKey
Query 𝐷×𝑇
𝐷×𝑇
Self-
resurrecting
activation
%
𝐻!,#
𝐷×𝑇
%
𝐻!,$
%
𝐻!,%
Output of
decoder layer m
Interaction between
encoder layer k and
decoder layer m-1
Feed
Forward
Add
Norm
×
1
K
Add
Norm
Encoder-decoder
Attention
𝐻!
Figure 3. The architecture of one caption decoder layer. The diagram focuses on the interaction between the output of encoder layer kand
the output of decoder layer m1, which yields the matrix ˜
Hm,k. After that, the final output of the layer, Hm, is derived from the sum of
˜
Hm,1,..., ˜
Hm,K . Striped boxes indicate network components initialized from pretrained weights.
3.2. Caption Decoder
We create the caption decoder by adapting network
weights learned from the uni-modal language modeling
task. A pretrained language model (LM), parameterized by
θ, generates the next token wtconditioned on the prede-
cessor words w1, . . . , wt1. To train the LM is to fit the
conditional distribution Pθ(wt|w1, . . . , wt1). In compar-
ison, the caption decoder fits the conditional distribution
Pθ(wt|w1, . . . , wt1, I), where Iis the output of the image
encoder. It may look simple as we add only one more term
to the condition. However, in practice adding the term I
completely changes the distribution because it requires the
conciliation of information from two different modalities.
We hypothesize that the generation of visual words, such
as “person”, “truck”, or “dog”, requires the model to rely on
visual information. In contrast, the generation of determin-
ers or connectives requires only linguistic knowledge. Ide-
ally, we would like to exploit the massive amount of linguis-
tic knowledge stored in the pretrained LM weights, while
referring to the visual input only when required.
To achieve this goal, we design the decoder architecture,
which contains a masked self-attention operation, a cross-
attention operation, and an encoder-decoder attention with a
self-resurrecting activation unit. Without loss of generality,
we now describe these components in the decoder layer m.
These components are also illustrated in Figure 3.
Masked Self-Attention. At decoder layer m, we
apply masked self-attention, a standard component in
Transformer-based language decoders, to the output of the
decoder layer m1, which is denoted as Hm1. In the
decoding at time step t, we use a binary mask to prevent
the self-attention operation from seeing any information at
the time step t+ 1 and beyond. The output of the masked
self-attention is denoted as ˙
Hm.
Cross-modality Attention. In the cross-modality attention,
we linearly project the ˙
HmRD×Tto the query matrix
and the output of encoder layer k,IkRD×Tto both the
key and value matrices. More formally, we apply the same
fatt function defined in Equation 4,
˙
Ik=fatt(Wdq ˙
Hm, W dk Ik, W dv Ik).(7)
where Wdq,Wdk , and Wdv are trainable parameters.
Encoder-Decoder Attention. To quickly adapt a pretrained
language model to a cross-modality task, it is crucial for
the neural network to correctly employ the visual input
and the linguistic knowledge acquired from pretraining at
the right time. The visual information should take prior-
ity when generating common visual words, whereas the lin-
guistic knowledge can contribute to connectives or uncom-
mon words.
In order to balance the visual input ˙
Ikand the linguis-
tic input ˙
Hm, we propose a new encoder-decoder atten-
tion. The balance is controlled by two gating matrices
Bvis
m[0,1]D×Tand Blan
m[0,1]D×T; they control the
relative strengths of the visual input and linguistic input to
decoder layer m. As such, we compute the interaction be-
tween the mth decoder layer and the kth encoder layer as
˜
Hm,k =Bvis
m˙
Ik+Blan
m˙
Hm(8)
where denotes component-wise multiplication. The final
output of the decoder layer, Hm, is computed as the sum of
all encoder-decoder interaction.
˜
Hm=1
K
K
X
k=1
˜
Hm,k (9)
We introduce two techniques for computing Bvis and
Blan. The first soft gating technique computes them in pairs
4
using sigmoid activation.
Bvis
m[i, j] = σ(A[i, j ]), Blan
m[i, j] = 1 σ(A[i, j]) (10)
where B[i, j]denotes the i, j entry of matrix B. Here A
is computed as an affine transformation of the two input
matrices,
A=Wgate
m[˙
Ik;˙
Hm] + Cgate
m,(11)
where Wgate
mRD×2Dand Cgate
mRD×Tare trainable pa-
rameters, and [I;H]denotes the concatenation of matrices
Iand H.
The final output of the decoder layer mis denoted as Hm
and is computed using FFN and AddNorm.
Hm=AddNorm(FFN(LayerNorm(˜
Hm))) (12)
Self-Resurrecting Activation Unit. For the second method
to compute Bvis
mand Blan
m, we propose a novel paired activa-
tion function, which we call self-resurrecting activation unit
(SRAU), defined as follows
SRAU(α, τ) = [σ(α) (σ(α)> τ );
1σ(α) (1 σ(α)> τ )] (13)
The entries in the matrices Bvis and Blan are computed from
SRAU in pairs.
[Bvis
m[i, j]; Blan
m[i, j]] = SRAU(A[i, j], τ )(14)
where τis a predefined threshold hyperparameter and (·)
is the indicator function, which returns 1 if the inner state-
ment is true and 0 otherwise.
Figure 4(left) plots the SRAU function. The SRAU con-
tains two gates, σ(α)and σ(1 α), which are complemen-
tary to each other. When one of the gate falls below the
threshold τ, it is rectified to zero. This behavior suppresses
small values and creates sparse activations, which may mit-
igate overfitting. However, when a gate variable is set to
zero, it receives zero gradient and cannot be optimized via
gradient descent. This is known as a “dead” gate and may
impede proper optimization. It is worth noting that, in the
design of SRAU, when τ < 0.5, the two complementary
gates cannot simultaneously receive zero gradient. In other
words, if αreceives zero gradient, 1αcan continue to
be optimized. As such, the SRAU avoids being “trapped”
at a flat region. That is why we name the function Self-
resurrecting Activation Unit.
We contrast SRAU with a “normalized” version, which
may seem intuitive because it ensures one pair of gates
Bvis
m[i, j]and Blan
m[i, j]add up to 1.
[β1;β2] = SRAU(α, τ)
NormSRAU(α, τ) = β1
β1+β2
;β2
β1+β2(15)
Figure 4. Left: Self-resurrecting activation function with τ=0.2.
Right: Normalized self-resurrecting activation. The x-axis indi-
cates the function inputs and the y-axis indicates function values.
However, the normalization introduces large flat regions of
zero gradients, as illustrated in Figure 4(right). In Section
4.4, we compare the two versions and show the unnormal-
ized SRAU works better.
4. Experiments
4.1. Datasets and Evaluation Metrics
We evaluate our model on the popular MS COCO dataset
[35] and the Conceptual Captions dataset [56]. MS COCO
contains 123,287 images and each of them is annotated
with 5different captions. We follow the “Karpathy” split
[24] for the validation and test set. The Conceptual Captions
dataset [56] contains a wider variety of both images and im-
age caption styles than MS COCO. It contains around 3.3M
images for training and 28K for validation. As the test data
is not publicly available, we instead use the public valida-
tion data as our test set, and randomly sample 5000 different
image-caption pairs from the training set as the validation
set. All the sentences have been converted to lower cases.
To create the small training data setup, we randomly
sample 0.1%,0.5% and 1% image-caption pairs and use
them as training data. The procedure is repeated 4 times
with different random seeds.
The evaluation metrics include BLEU [45], METEOR
[5], ROUGE[34], CIDEr [61] and SPICE [2]. We report the
average performance with standard deviation.
4.2. Experimental settings
Baselines. We compare our model with several state-of-the-
art transformer-based models, including (1) Plain Trans-
former [60] model. (2) AoANet [22], which replaces the
feed-forward module with an attention-on-attention mod-
ule in every transformer layer. (3) M2Transformer [12],
the current state-of-the-art image-captioning model on MS
COCO. As VisualGPT has 12 decoder layers, for fair com-
parisons, we also create variants of Transformer and M2
Transformer with 12-layer decoders.
5
Method Decoder
Layers
BLEU-1 BLEU-4 METEOR ROUGE CIDEr SPICE
0.1% training data
Transformer [60] 3 57.4±0.26 13.1±0.15 16.7±0.25 40.7±0.17 40.8±0.15 10.3±0.15
M2Transformer [12] 3 56.9±0.14 13.1±0.21 16.9±0.09 40.6±0.24 40.9±0.89 10.2±0.10
AoANet [22] 3 56.6±1.85 13.5±1.81 15.9±0.36 40.7±1.04 38.4±4.30 9.9±0.58
Transformer [60] 12 44.0±2.35 3.8±0.44 9.5±0.40 36.0±3.15 4.74±1.10 2.1±0.15
M2Transformer [12] 12 52.0±1.85 9.1±0.31 13.7±3.03 39.5±1.30 33.1±1.00 7.8±0.21
AoANet [22] 12 20.7±2.49 2.0±2.65 7.9±1.56 34.0±2.98 7.0±1.34 3.2±0.49
VisualGPT (Ours) 12 58.2±2.30 16.4±0.65 18.5±1.85 41.9±0.17 45.1±1.90 10.9±0.40
0.5% training data
Transformer 3 62.8±0.45 18.8±0.64 19.4±0.20 45.2±0.57 59.2±0.41 13.0±0.15
M2Transformer 3 63.3±0.10 19.4±0.42 19.8±0.27 45.6±0.12 61.3±0.75 13.7±0.22
AoANet 3 63.5±1.25 20.2±1.56 19.4±0.31 45.8±0.60 63.9±4.30 13.8±0.60
Transformer 12 60.9±0.34 15.8±0.29 18.0±0.12 43.1±0.25 49.7±1.04 11.0±0.08
M2Transformer 12 60.1±1.38 14.4±0.10 17.9±0.72 43.7±0.86 44.2±0.30 11.4±0.26
AoANet 12 57.9±2.62 15.5 ±0.71 17.1 ±0.42 43.4±0.28 46.8±1.13 11.3±0.21
VisualGPT (Ours) 12 66.2±1.19 22.1±0.96 21.1±0.40 47.3±0.61 70.3±1.70 14.6±0.25
1% training data
Transformer 3 66.0±0.34 21.9±0.21 21.1±0.15 47.3±0.5 71.9±1.19 14.5±0.15
M2Transformer 3 67.1±0.58 23.4±0.42 21.3±0.17 48.3±0.26 73.0±1.0 14.9±0.13
AoANet 3 67.6±0.71 23.6±1.43 21.5±0.31 48.4±0.28 75.5±2.30 15.1±0.32
Transformer 12 64.0±0.56 19.6±0.83 19.5±0.22 45.7±0.49 62.1±0.67 12.5±0.17
M2Transformer 12 63.3±1.28 18.0±0.62 19.3±0.64 46.1±0.85 52.9±5.00 12.7±0.55
AoANet 12 63.7±0.92 17.7 ±0.14 18.5 ±0.14 48.2±2.12 58.4±0.57 12.5±0.64
VisualGPT (Ours) 12 69.7±0.62 25.7±0.72 22.6±0.21 49.8±0.21 82.5±1.81 15.8±0.21
Table 1. Testing results by training on small subsets: Performance evaluations of the compared methods training on 0.1%,0.5% and 1%
of MS COCO image-caption pairs. The best performance in each evaluation is indicated in Bold font.
4.3. Quantitative Results
Small In-domain Training Data. Results on MS COCO
and Conceptual Captions are presented in Tables 1and 3
respectively. On MS COCO, VisualGPT achieves the best
performance among all models. VisualGPT outperforms the
best baseline model by 4.2CIDEr when trained on 0.1% of
MS COCO data, 6.4CIDEr with 0.5% data and 7.0CIDEr
with 1% data. In the experiments on the Conceptual Cap-
tions dataset, we compare against only baseline models util-
lizing 3-layer decoders as these baselines have demonstrate
superior performance on MS COCO. Once again, Visual-
GPT outperforms all the baselines in every matrix evalua-
tion. It outperforms the best baseline model by 4.2CIDEr
under 0.1% training data, 5.4CIDEr under 0.5% data and
1.4CIDEr under 1% data.
Comparison against Semi-supervised and unsupervised
methods. Kim et al [26] proposed a semi-supervised learn-
ing method to improve the data efficiency of image caption-
ing. They used 1% of images as training data, rather than
1% of image-caption pairs in Table 1. For Kim et al + un-
paired, they also employ the other 99% of MS COCO as un-
paired images and captions for training. We replicate their
setup. In Table 4, we compare VisualGPT against the results
reported in [26]. Without using unpaired images and cap-
tions, the proposed VisualGPT method outperforms Kim et
al by 20.6CIDEr score.
We also compared VisualGPT against unsupervised
methods of Gu et al [18] and Feng et al [16], which use tens
of millions of unpaired images and captions. Even though
these are not fair comparisons, it is encouraging to see that
only 1133 training images are needed to surpass their per-
formance.
4.4. Ablation Studies
To further quantify the contribution of the pretrained lan-
guage model and the proposed self-resurrecting encoder-
decoder attention, we conduct experiments on the following
ablated version of VisualGPT.
Base + random init. This is the base model, a Trans-
former [60] architecture with a 3-layer encoder, a 12-
layer decoder, and a traditional cross-modality atten-
tion between the encoder and the decoder. The model
6
Ablation B-1 B-4 M R C S
0.1% training
Base + random init. 44.0 3.8 9.5 36.0 4.7 2.1
Base + GPT2 init. 56.8 15.3 17.0 41.2 42.9 10.5
Base + GPT2 + Meshed 54.9 14.7 16.6 41.1 41 10.4
Base + GPT2 + AoA 55.5 14.4 16.2 40.7 40.1 10.2
Normalized SRAU 55.7 15.0 16.8 41.2 42.4 10.4
Full VisualGPT 58.2 16.4 18.5 41.9 45.1 10.9
0.5% training
Base + random init. 60.9 15.8 18.0 43.1 49.7 11.0
Base + GPT2 init. 65.1 21.8 20.6 46.6 69.5 14.1
Base + GPT2 + Meshed 64.7 21.8 20.7 47.1 68.5 14.2
Base + GPT2 + AoA 64.2 21.2 20.5 46.5 67.2 13.8
Normalized SRAU 65.3 21.8 20.9 47.0 69.3 14.1
Full VisualGPT 66.2 22.1 21.1 47.3 70.3 14.6
1% training
Base + random init. 64.0 19.6 19.5 45.7 62.1 12.5
Base + GPT2 init. 68.5 25.1 22.1 49.0 80.5 15.4
Base + GPT2 + Meshed 68.2 25.0 22.4 49.2 80.4 15.4
Base + GPT2 + AoA 68.5 24.6 22.0 48.6 78.4 15.0
Normalized SRAU 69.1 25.2 22.3 49.3 81.4 15.5
Full VisualGPT 69.7 25.7 22.6 49.8 82.5 15.8
Table 2. Ablation study on VisualGPT with different encoder-
decoder attentions and compare the functionality of the pretrained
language model.
Models Decoder
Layers
B-1 B-4 M R C
0.1% training data
Transformer 3 12.4 2.4 4.9 15.2 21.2
M2Transformer 3 13.1 2.8 4.8 15.5 23.5
AoANet 3 11.4 2.4 4.6 14.7 20.9
VisualGPT 12 13.9 3.2 5.6 16.7 27.7
0.5% training data
Transformer 3 13.2 3.3 5.5 16.3 29.6
M2Transformer 3 14.5 3.6 6.0 17.1 32.0
AoANet 3 13.8 3.3 5.6 17.9 31.8
VisualGPT 12 15.4 4.1 6.6 18.4 37.4
1% training data
Transformer 3 13.9 3.7 6.3 18.1 37.9
M2Transformer 3 16.0 4.1 6.8 18.9 39.8
AoANet 3 14.9 4.1 6.5 18.6 39.0
VisualGPT 12 16.4 4.3 6.9 19.2 41.2
Table 3. Testing results by training on small Conceptual Captions
subsets
parameters are randomly initialized instead of pre-
trained.
Base + GPT2 init. On top of the base model, we load
the GPT-2 pretrained weights into the decoder. Other
weights remain randomly initialized.
Method B-1 B-4 M R C
Kim et al. [26] 58.1 13.4 15.9 - 36.0
Kim et al. + unpaired 63.0 18.7 20.7 - 55.2
VisualGPT 67.1 24.3 21.9 48.6 75.8
Gu et al. [18] 46.2 5.4 13.2 - 17.7
Feng et al. [16] 58.9 18.6 17.9 - 54.9
Table 4. Comparison using Kim et als split of MS COCO. Kim
et al. employ only 1% images for training, whereas Kim et al. +
unpaired also use the rest of training data as unpaired images and
texts. We also include unsupervised baselines of Gu et al. and
Feng et al.
Base + GPT2 + Meshed [12]. On top of the Base
+ GPT2 init. model, we apply the meshed cross-
connection between the encoder and the decoder [12]
instead of the traditional cross-modality attention.
Base + GPT2 + AoA [22]. On top of the Base + GPT2
init. model, we add Attention on Attention [22] to the
simple cross-modality attention in the decoder.
Normalized SRAU. We replace the self-resurrecting
activation unit in VisualGPT with the normalized self-
resurrecting activation unit (see Figure 4). We ex-
perimented with other activation functions that do not
suffer from zero gradients, such as Leaky ReLU and
GELU, but the training crashed as the activation val-
ues became too large.
Effects of GPT-2 pretrained weights. Comparing the ran-
dom initialization (Base + random init.) and the GPT-2 pre-
trained weights (Base + GPT2 init.), it is evident that the
GPT-2 weights play a significant role in learning from small
data. In particular, the gap between these two models is the
most pronounced when training on the least data.
Effects of the proposed encoder-decoder attention. We
compare the full VisualGPT model with two other variants
of the encoder-decoder attention, Base + GPT2 + Meshed
and Base + GPT2 + AOA. The VisualGPT model achieves
the best performance in all three setups, demonstrating the
effectiveness of the proposed mechanism.
Effects of self-resurrecting activation. In the Normalized
SRAU ablation baseline, the self-resurrecting capability of
SRAU is eliminated. This results in substantially lowered
performance, decreasing CIDEr from Full VisualGPT by
2.7,1.0, and 0.3respectively on the three setups. This
demonstrates that the self-resurrecting property is beneficial
for learning from small data.
4.5. Human Study
We conducted a Amazon Mechanical Turk study to in-
vestigate human preferences over the generated captions.
We randomly select 50 test images from the three setup
7
GT:the lady is sitting on the wood bench
GT:a laptop with a keyboard and mouse are on this desk
GT: a cat is sitting in front of atelevision
GT:a number of people sitting on a snowy surface with skis
Ours acat is sitting in fro nt of a television
attention 0.8 0. 86 0.8 0.83 0. 7 0. 72 0.6 0.71 0.93
Ours awo man sitting on a bench in a park
attention 0.7 0.78 0.82 0.76 0. 8 0.96 0.8 0. 69 0.85
Ours ala pt op sitting on adesk with a mo us e
attention 0.7 0.78 0.81 0.7 0.7 0.92 0.85 0.64 0.76
Ours aco uple of people sitting on asnowy surface
attention 0.8 0.87 0. 71 0.85 0.91 0.76 0. 71 0.94 0.95
Figure 5. Visual scores of words in generated captions. We show
the raw visual scores and highlight them according to normalized
visual scores. High visual scores are in blue and low scores in red.
max visual
bench
wooden
sitting
clock
toilet
min visual
to
of
on
the
a
Figure 6. Distributions of linguistic attention (Blan) and visual at-
tention (Bvis) at every decoding layer. We also show the words
generated with the highest and lowest visual attention.
of 0.1%,0.5%, and 1% training data. For every image,
we generate one caption from VisualGPT and each of three
high-performing baselines from Table 1, Transformer [60],
M2Transformer [12], and AoANet [22], all with three de-
coder layers. Every image is evaluated by 5 different Turk-
ers and they need to choose the caption which can most ac-
curately describe the image content. Finally we received
750 valid responses and the results are shown in Table 5.
Overall, we can observe that the captions generated by
VisualGPT have received the most votes, 39.6% for the
0.1% split, 38.0% for the 0.5% split, 36.4% for the 1% split.
For each training setup, we conducted Pearson’s Chi-square
test [46], which shows the differences are statistical signifi-
cant with p < 0.05 in all cases.
Method 0.1% data 0.5% data 1% data
Transformer [60] 19.6% 19.2% 17.2%
AoANet [22] 9.6% 19.2% 24.4%
M2Transformer [12] 31.2% 23.6% 22.0%
VisualGPT 39.6% 38.0% 36.4%
Table 5. The percentage of votes for our VisualGPT and baseline
models in 0.1%,0.5% and 1% training data.
4.6. Qualitative Analysis
In this section, we examine examples from the Visual-
GPT model trained on 1% of MS COCO. First, we show
example captions generated by VisualGPT in Figure 5and
the associated Bvis at the last decoder layer. Note that for
every word generated, we have a 768-dimensional visual
gate vector, which is a slice of Bvis at different time steps.
We take the mean of the gate vector as the visual score for
that word. After that, we normalize the visual scores across
the dataset to the [0,1] interval and highlight the words ac-
cordingly. Blue indicates high visual scores and red indi-
cates low visual scores. We observe that, in agreement with
our intuition, VisualGPT assigns high visual scores to words
like “desk” and “snowy surface” and low visual scores to
determiners and prepositions.
In Figure 6, we plot the distribution of Bvis and Blan at
every decoder layer as a box-and-whisker diagram. We also
show the words with the highest and lowest visual scores,
which are again in line with our expectations. Additionally,
we observe that, going from layer 0 to layer 9, the decoder
makes increasing use of visual information, but the upper-
most layers, 10 and 11, make more balanced use of informa-
tion. We hypothesize that the low layers focus on low-level
linguistics like syntax, whereas the middle layers learn to
fuse linguistic information with visual information. Finally,
the two information sources become balanced in the upper-
most layers.
5. Conclusion
In this paper, we presented a data efficient image cap-
tioning model, VisualGPT, which leverages the linguistic
knowledge from the pretrained language model. To bridge
the semantic gap between different modalities, we designed
a novel encoder-decoder attention mechanism with an un-
saturated rectified gating function. We evaluate our model
on 0.1%,0.5% and 1.0% of the MSCOCO dataset. The ex-
perimental results demonstrate the effectiveness of our ap-
proach, which outperforms several strong baseline models.
References
[1] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen,
Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste-
fan Lee, and Peter Anderson. nocaps: novel object caption-
8
ing at scale. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 8948–8957, 2019. 1
[2] Peter Anderson, Basura Fernando, Mark Johnson, and
Stephen Gould. Spice: Semantic propositional image cap-
tion evaluation. In European Conference on Computer Vi-
sion, pages 382–398. Springer, 2016. 5
[3] Peter Anderson, Xiaodong He, Chris Buehler, Damien
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
Bottom-up and top-down attention for image captioning and
visual question answering. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
6077–6086, 2018. 2,12
[4] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing.
Convolutional image captioning. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 5561–5570, 2018. 2
[5] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic
metric for mt evaluation with improved correlation with hu-
man judgments. In Proceedings of the acl workshop on in-
trinsic and extrinsic evaluation measures for machine trans-
lation and/or summarization, pages 65–72, 2005. 5
[6] Yoshua Bengio, R´
ejean Ducharme, Pascal Vincent, and
Christian Jauvin. A neural probabilistic language model.
Journal of machine learning research, 3(Feb):1137–1155,
2003. 3
[7] Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little,
Andrew Miller, Robert C. Miller, Aubrey Tatarowicz, Bran-
dyn White, Samuel White, and Tom Yeh. Vizwiz: Nearly
real-time answers to visual questions. In Proceedings of the
2010 International Cross Disciplinary Conference on Web
Accessibility (W4A), 2010. 1
[8] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint
arXiv:2005.14165, 2020. 2,3
[9] Pawel Budzianowski and Ivan Vulic. Hello, it’s GPT-2 - how
can I help you? towards the use of pretrained language mod-
els for task-oriented dialogue systems. In Alexandra Birch,
Andrew M. Finch, Hiroaki Hayashi, Ioannis Konstas, Thang
Luong, Graham Neubig, Yusuke Oda, and Katsuhito Sudoh,
editors, Proceedings of the 3rd Workshop on Neural Gener-
ation and Translation@EMNLP-IJCNLP 2019, Hong Kong,
November 4, 2019, pages 15–22. Association for Computa-
tional Linguistics, 2019. 2
[10] Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and
Maria de la Iglesia-Vay ´
a. Padchest: A large chest x-ray im-
age dataset with multi-label annotated reports. Medical im-
age analysis, 66, 2020. 1
[11] Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Ar-
jun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Ba-
tra, and Devi Parikh. Evaluating visual conversational agents
via cooperative human-ai games. In HCOMP, 2017. 1
[12] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and
Rita Cucchiara. Meshed-memory transformer for image cap-
tioning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 10578–
10587, 2020. 1,2,5,6,7,8,12,13
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding. In Proceedings of the
2019 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages
4171–4186, 2019. 2,3
[14] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
and Trevor Darrell. Long-term recurrent convolutional net-
works for visual recognition and description. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 2625–2634, 2015. 2
[15] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu,
Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon.
Unified language model pre-training for natural language un-
derstanding and generation. In Advances in Neural Informa-
tion Processing Systems, pages 13063–13075, 2019. 2
[16] Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Unsupervised
image captioning. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4125–4134,
2019. 2,6,7
[17] Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko,
Kyryl Truskovskyi, Alexander Tselousov, and Thomas Wolf.
Large-scale transfer learning for natural language genera-
tion. In Proceedings of the 57th Annual Meeting of the As-
sociation for Computational Linguistics, pages 6053–6058,
2019. 2
[18] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. Un-
paired image captioning by language pivoting. In Pro-
ceedings of the European Conference on Computer Vision
(ECCV), pages 503–519, 2018. 6,7
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 12
[20] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus
Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Dar-
rell. Deep compositional captioning: Describing novel ob-
ject categories without paired training data. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 1–10, 2016. 1
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3
[22] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei.
Attention on attention for image captioning. In Proceedings
of the IEEE International Conference on Computer Vision,
pages 4634–4643, 2019. 1,2,5,6,7,8,13
[23] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap:
Fully convolutional localization networks for dense caption-
ing. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4565–4574, 2016. 2
[24] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align-
ments for generating image descriptions. In Proceedings of
the IEEE conference on computer vision and pattern recog-
nition, pages 3128–3137, 2015. 1,5
[25] Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So
Kweon. Image captioning with very scarce supervised data:
9
Adversarial semi-supervised learning approach. In Kentaro
Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,
Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-
IJCNLP 2019, Hong Kong, China, November 3-7, 2019,
pages 2012–2023. Association for Computational Linguis-
tics, 2019. 2
[26] Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So
Kweon. Image captioning with very scarce supervised
data: Adversarial semi-supervised learning approach. In
Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-
IJCNLP), pages 2012–2023, Hong Kong, China, Nov. 2019.
Association for Computational Linguistics. 6,7
[27] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense
image annotations. International journal of computer vision,
123(1):32–73, 2017. 12
[28] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sag-
nik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and
Tamara L Berg. Babytalk: Understanding and generat-
ing simple image descriptions. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 35(12):2891–2903,
2013. 1
[29] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
bert for self-supervised learning of language representations.
In International Conference on Learning Representations,
2019. 3
[30] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-
jad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy-
anov, and Luke Zettlemoyer. BART: Denoising sequence-to-
sequence pre-training for natural language generation, trans-
lation, and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational Linguis-
tics.2
[31] Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. Entan-
gled transformer for image captioning. In Proceedings of the
IEEE International Conference on Computer Vision, pages
8928–8937, 2019. 2
[32] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei
Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu
Wei, et al. Oscar: Object-semantics aligned pre-training for
vision-language tasks. In European Conference on Computer
Vision, pages 121–137. Springer, 2020. 3
[33] Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hy-
brid retrieval-generation reinforced agent for medical image
report generation. In Advances in Neural Information Pro-
cessing Systems 31, pages 1530–1540. 2018. 1
[34] Chin-Yew Lin. Rouge: A package for automatic evaluation
of summaries. In Text summarization branches out, pages
74–81, 2004. 5
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Doll´
ar, and C Lawrence
Zitnick. Microsoft COCO: Common objects in context. In
European conference on computer vision, pages 740–755.
Springer, 2014. 1,5,12
[36] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and
Kevin Murphy. Improved image captioning via policy gra-
dient optimization of spider. In Proceedings of the IEEE in-
ternational conference on computer vision, pages 873–881,
2017. 2
[37] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
moyer, and Veselin Stoyanov. RoBERTa: A robustly opti-
mized BERT pretraining approach. arXiv Preprint, arXiv
1907.11692, 2019. 2
[38] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
moyer, and Veselin Stoyanov. Roberta: A robustly optimized
bert pretraining approach. 2019. 3
[39] Ilya Loshchilov and Frank Hutter. Fixing weight decay reg-
ularization in adam. 2018. 12
[40] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil-
bert: Pretraining task-agnostic visiolinguistic representa-
tions for vision-and-language tasks. In Hanna M. Wallach,
Hugo Larochelle, Alina Beygelzimer, Florence d’Alch´
e-
Buc, Emily B. Fox, and Roman Garnett, editors, Ad-
vances in Neural Information Processing Systems 32: An-
nual Conference on Neural Information Processing Systems
2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC,
Canada, pages 13–23, 2019. 3
[41] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.
Knowing when to look: Adaptive attention via a visual sen-
tinel for image captioning. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
375–383, 2017. 2
[42] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
Neural baby talk. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 7219–7228,
2018. 2
[43] Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. Pointer sentinel mixture models. 2017. 2
[44] Tom´
aˇ
s Mikolov, Stefan Kombrink, Luk´
aˇ
s Burget, Jan
ˇ
Cernock`
y, and Sanjeev Khudanpur. Extensions of recur-
rent neural network language model. In 2011 IEEE interna-
tional conference on acoustics, speech and signal processing
(ICASSP), pages 5528–5531. IEEE, 2011. 3
[45] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
Zhu. Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meeting of the
Association for Computational Linguistics, pages 311–318,
2002. 5
[46] Karl Pearson. X. on the criterion that a given system of de-
viations from the probable in the case of a correlated system
of variables is such that it can be reasonably supposed to
have arisen from random sampling. The London, Edinburgh,
and Dublin Philosophical Magazine and Journal of Science,
50(302):157–175, 1900. 8
[47] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.
10
Deep contextualized word representations. In Proceedings
of NAACL-HLT, pages 2227–2237, 2018. 3
[48] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and
Arun Sacheti. Imagebert: Cross-modal pre-training with
large-scale weak-supervised image-text data. arXiv preprint
arXiv:2001.07966, 2020. 3
[49] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Sutskever. Improving language understanding by generative
pre-training. 2,3
[50] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever. Language models are unsuper-
vised multitask learners. OpenAI blog, 1(8):9, 2019. 2,3
[51] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J. Liu. Exploring the limits of transfer learning with a
unified text-to-text transformer. Journal of Machine Learn-
ing Research, 21(140):1–67, 2020. 2
[52] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
proposal networks. In Advances in neural information pro-
cessing systems, pages 91–99, 2015. 12
[53] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret
Ross, and Vaibhava Goel. Self-critical sequence training for
image captioning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7008–
7024, 2017. 2
[54] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret
Ross, and Vaibhava Goel. Self-critical sequence training for
image captioning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7008–
7024, 2017. 12
[55] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neu-
ral machine translation of rare words with subword units.
In Proceedings of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics, ACL 2016, August 7-12,
2016, Berlin, Germany, Volume 1: Long Papers. The Asso-
ciation for Computer Linguistics, 2016. 3,12
[56] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Soricut. Conceptual captions: A cleaned, hypernymed, im-
age alt-text dataset for automatic image captioning. In Pro-
ceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages
2556–2565, 2018. 1,5
[57] Richard Socher and Li Fei-Fei. Connecting modalities:
Semi-supervised segmentation and annotation of images us-
ing unaligned text corpora. In 2010 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition,
pages 966–973. IEEE, 2010. 2
[58] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
linguistic representations. In International Conference on
Learning Representations, 2020. 3
[59] Hao Tan and Mohit Bansal. LXMERT: learning cross-
modality encoder representations from transformers. In Ken-
taro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,
Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-
IJCNLP 2019, Hong Kong, China, November 3-7, 2019,
pages 5099–5110. Association for Computational Linguis-
tics, 2019. 3
[60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 3,
5,6,8,13
[61] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Parikh. Cider: Consensus-based image description evalua-
tion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4566–4575, 2015. 5
[62] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
mitru Erhan. Show and tell: Lessons learned from the 2015
MSCOCO image captioning challenge. IEEE transactions
on pattern analysis and machine intelligence, 39(4):652–
663, 2016. 2
[63] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption gen-
eration with visual attention. In International conference on
machine learning, pages 2048–2057, 2015. 1,2
[64] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai.
Auto-encoding scene graphs for image captioning. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 10685–10694, 2019. 2
[65] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Russ R Salakhutdinov, and Quoc V Le. Xlnet: General-
ized autoregressive pretraining for language understanding.
In Advances in neural information processing systems, pages
5753–5763, 2019. 2
[66] Benjamin Z Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and
Song-Chun Zhu. I2t: Image parsing to text description. Pro-
ceedings of the IEEE, 98(8):1485–1508, 2010. 2
[67] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring
visual relationship for image captioning. In Proceedings of
the European conference on computer vision (ECCV), pages
684–699, 2018. 2
11
A. Supplementary material
In supplementary material, we provide experimental details
and additional qualitative examples.
A.1. Additional implementation details
Image and Word Features. Following [3], we use a Faster
R-CNN networks [52] with ResNet-101 [19] as a backbone to
train on Visual Genome dataset [27], and we extract a 2048-
dimensional feature vector for each object.
We use the Byte Pair Encoding (BPE) [55], which effectively
incorporate sub-word information and is beneficial for dealing
with out-of-vocabulary words. We employ learnable positional en-
coding and initialize token embedding from pretrained weights of
GPT-2.
Architecture and Hyperparameters. We have 3layers in the
encoder and 12 layers in the decoder with 12 heads in each layer.
The hidden size Din each layer is 768. We load the GPT-2 (small)
pretrained weights, which has 117M parameters into the decoder.
We use the learning rate of 1e4under XE loss and 1e5during
the reinforcement learning. We train the models with the AdamW
optimizer [39] and a batch size 25. The beam size is equal to 5.
The threshold τis tuned on the validation set for different training
data.
A.2. Training Details
We train all the models in two steps. We first train the models
with cross-entropy (XE) loss and then finetune them using rein-
forcement learning. The cross-entropy loss LX E is the traditional
autoregressive classification loss
LXE =
T
X
t=1
log((wt|w1:t1)) (16)
where w1:Trepresents the target ground truth sequence.
For reinforcement learning, we employ a variant of Self-
Critical Sequence training [54]. Following [12], we sample L
sentences, ˆw1
1:T,..., ˆwL
1:T, with beam search and use the mean
reward from the Lsentences as the baseline b. The gradient is
θLRL(θ) = 1
k
L
X
i=1 (r( ˆwi
1:T)b)θlog p( ˆwi
1:T)(17)
where r(·)represents the CIDEr-D reward.
A.3. More training dataset
Figure 7shows other results obtained by training networks on
the 5% ,10%,20%,50% and 100% data. VisualGPT outperforms
other baseline models when we sample 20% training data, high-
lighting its effectiveness on low data regimes.
A.4. Ablation study on τof our SRAU
To evaluate the effect of different τof our SRAU, we select the
τequals to 0,0.1,0.2and 0.3and test on the COCO dataset [35].
In Fig. 8here, we show that τ >0can outperform τ=0 in most
cases. Meaning, SRAU is better than soft gating.
35
45
55
65
75
85
95
105
115
125
135
0.10% 0.50% 1% 5% 10% 20% 50% 100%
CIDEr
Percentage
CIDEr performance on training different
percentage of data
Transformer
AoANet
M2Tr ansfo rmer
VisualGPT
Figure 7. Evaluation on different percentage of data
44.3 42.7 45.1 41.4
69 67.9 70.3 67.6
82.5 81.5 80.9 81.7
99.5 100.5 103
99.0
35
45
55
65
75
85
95
105
115
00.1 0.2 0. 3
CIDEr
Thresholds
Experimental Evaluation on different thresholds
for SRAU
0.10% 0.50% 1% 5%
Figure 8. Left: ablation study on different threashold τ.
A.5. Attention over Different types of words
We use the Spacy parser to detect the part-of-speech of words
in captions and calculate the mean value of the visual attention
score. The result is presented in Fig. 9. We found PoS that tend to
visual content, like noun (0.71), verb (0.71) and adjective (0.72),
have high visual attention scores, whereas linguistic PoS like pro-
noun (0.53), punctuation (0.58), and determiner (0.61) receive low
attention.
A.6. Hallucination Effect of GPT-2
Directly applying a pretrained language model could poten-
tially suffer from the hallucination and generate words that do not
correspond to the image but conform to the a priori knowledge of
GPT-2. To evaluate such hallucination effect, we performed a hu-
man study on models trained using 1% COCO data. We randomly
sample 250 images with the generated caption from each model.
For each image, we asked 5different participants if the caption
describes objects not in the image or misses objects in the image
(shown in Tables 6and 7). To catch random clickers, we create 5
12
Figure 9. Attention Scores over different part-of-speech words
images with verified captions, and we asked the same questions to
100 participants for each image. Participants who answered these
questions wrongly are considered unreliable and removed from the
results. Compared to the baselines, VisualGPT has less hallucina-
tion and higher coverage of objects.
Answer Ours M2Transformer Transformer AoANet GT
No 719 624 633 621 973
Yes 367 438 456 447 73
No Rate 0.66 0.59 0.58 0.58 0.93
Table 6. Does the caption miss things that are shown in the image?
Answer Ours M2Transformer Transformer AoANet GT
No 720 692 633 655 448
Yes 360 418 423 412 43
No Rate 0.67 0.62 0.60 0.61 0.96
Table 7. Does the caption describe things that aren’t in the image?
A.7. More Qualitative Examples
In Figure 10, we provide more examples of visual attentions.
Blue indicates high visual scores and red indicates low visual
scores. We can observe that VisualGPT assigns higher scores to
words like “steam engine”, “elephants”, “horse”, “lush” and “cab-
inets”, and it assigns low visual scores to determiners and prepo-
sitions like “to” and “at”.
We also show some examples of generated captions by our
VisualGPT and several strong baseline models including Trans-
former (3layers) [60], M2Transformer (3layers) [12] and
AoANet [22] in the Table 8, Table 9and Table 10. Overall, we
can observe that our VisualGPT is able to describe the image con-
tent more accurately than the baseline models.
GT:the large red flower is inside of a clear glass vase
GT:a motorcycle parked next to a white building
GT:a kitchen with wooden cabinets a sink and a dish washer
GT:a tennis player jumps and hits a ball
GT: a small boats in abody of water
Ours ared vase of ros es sitting on to p of aglass
attention 0.8 0.93 0.94 0.64 0. 87 0.84 0. 67 0.55 0. 57 0.43 0.86
Ours atenn is player jumping on atenn is court holding aball
attention 0.7 0.77 0.75 0.72 0.67 0.64 0.89 0. 79 0. 74 0.6 0.76
Ours amo torc ycle parked next to abuilding
attention 0.6 0.78 0.85 0.74 0. 34 0. 6 0.75
Ours ala rge boat sits on afield with alak e
attention 0.6 0.77 0.78 0.83 0.71 0. 6 0.74 0.66 0.63 0.73
Ours akitchen
with
awhite cabinets and a sink
attention 0.73 0.86 0.8 0.7 0.9 0.91 0.8 0.8 0.9
GT:a train sitting under a display inside a building
GT:two captive elephants stand bored behind the fake
stone fence
GT:a white horse standing in a field on top of grass
GT:a man in a restaurant smiling while holding up a camera
GT:a man sitting on a bench next to a few bags
Ours awh ite horse grazing on alush green field
attention 0.67 0.75 0.83 0.74 0.65 0.66 0.85 0. 8 0. 77
Ours eleph ants standing next to a stone fen ce
attention 0.8 0.74 0.77 0.47 0. 5 0.77 0. 76
Ours asteam en gine sitting in adisplay
attention 0.69 0.84 0. 79 0. 8 0.7 0. 6 0. 83
Ours ama n in astore look ing at his camera
attention 0.65 0.69 0.72 0.67 0.77 0.65 0.47 0.49 0.7
Ours a young ma n holding abackpack on abench
attention 0.7 0.82 0.74 0.7 0.54 0.84 0.59 0.55 0.83
Figure 10. More examples of visual attention for each word in gen-
erated captions. High visual scores are in blue and low scores in
red.
13
Image Generated Captions Ground Truth
Transformer: a woman riding some skis on skis
M2Transformer: a couple of skiers are
standing near the snow
AoANet: a man with skis in the snow
VisualGPT (ours): a group of people walk on a
snowy mountain
GT1: the people are walking through snow
in a wooded area
GT2: two people wearing skis traveling
through the snow
GT3: a man is walking down a path covered
in a snow
GT4: a couple is skiing through the
snowy woods
GT5: a couple of people that are in a
snowy field
Transformer: a street that has some street
in it
M2Transformer: a traffic light over a street
light under a traffic light
AoANet: a street with people on a city street
VisualGPT (ours): a street with tall signs and
traffic signs
GT1: a yellow traffic light above a street
next to houses
GT2:a street scene of an intersection with a
street light
GT3: a stop light hanging over an
intersection in a residential area
GT4: a traffic signal at an intersection is
suspended on wire
GT5: a street intersection with a traffic
light over it
Transformer: some pizza are sitting on a
plate
M2Transformer: a plate with food and
a knife on it
AoANet: a plate of pizza on a table
VisualGPT (ours): a plate of bread are served
on a table
GT1: a batch of bread slices sitting on a
plate
GT2: a plate with some pieces of bread on it
GT3: sliced french bread is on a plat that
is lying on a table
GT4: bread that is sitting on a plate that is
on a table
GT5: a white plate with lots topped with
garlic bread
Transformer: two tennis player playing tennis
on the ball
M2Transformer: a tennis player about to
hit a ball
AoANet: a baseball players on a game playing
a game
VisualGPT (ours): a tennis player hits a ball
with a racket
GT1: a man holding a racquet on top of a
tennis court
GT2: a man with a tennis racket reaches
for a ball
GT3: a man with a tennis racket is running
on a court
GT4: a young man is playing a game of
tennis
GT5: a tennis player in a blue shirt
runs toward a ball
Transformer: a group of birds that are
standing in the grass
M2Transformer: a flock of birds perched
in a tree branch
AoANet: several giraffe are standing next to
each trees
VisualGPT (ours): a bird standing in the
middle of a pond
GT1: a bird is perched a top a branch over
a river
GT2: a bird sits on a branch above a stream
GT3: a bird on top of a tree branch over
water
GT4: a picture of an outside region that
appears incredible
GT5: a bird on a fallen branch in a body of
water
Table 8. Caption generated by our VisualGPT, Transformer, M2Transformer and AoANet on 0.1% MS COCO data split
14
Image Generated Captions Ground Truth
Transformer: several boats are sitting in
the middle of a lake
M2Transformer: a boat filled with
boats floating in the water
AoANet: an empty boat that has water and
water
VisualGPT (ours): a canal filled with boats
in the water
GT1: a blue boat docked on a green lush
shore
GT2: a small marina with boats docked there
GT3: a group of boats sitting together with
no one around
GT4: some boats parked in the water at
a dock
GT5: boats sitting around the side of a
lake by a tree
Transformer:pizza slices and pizza in a
plate covered pizza
M2Transformer: people sitting at a
table eating pizza and other salad
AoANet: two pizza eating a table with
pizza on the table
VisualGPT (ours): a group of pizza on a
iron plate with toppings
GT1: a set of five pizzas sitting next
to each other each with different toppings
GT2:a handful of prepared pizzas sit next
to each other
GT3: five uncooked pizzas with a variety
of different toppings
GT4: five unbaked pizzas that include
various types of cheeses
GT5: five different pizzas are being
prepared over a metal tray
Transformer: several boats are sitting in
the middle of a lake
M2Transformer: a boat filled with
boats floating in the water
AoANet: an empty boat that has water and
water
VisualGPT (ours): a canal filled with boats
in the water
GT1: a blue boat docked on a green lush
shore
GT2: a small marina with boats docked there
GT3: a group of boats sitting together with
no one around
GT4: some boats parked in the water at a
dock
GT5: boats sitting around the side of a
lake by a tree
Transformer: a group of people taking a
child in a in a building
M2Transformer: a group of people in
an airport with their hands
AoANet: a picture of a young group of
people standing for men
VisualGPT (ours): a group of people
standing around a tv
GT1: a group of men standing around a room
GT2: some people are waiting in a long room
GT3: people are standing in a room looking
at a television screen
GT4: a person sitting on a bench while the
rest look somehwere else
GT5: a man in red winter clothes sits on
a bench with people behind him gather in
front of a tv
Transformer: an elephant eating a elephant
has a elephant
M2Transformer: elephant with its trunk
with their elephant with its trunk
AoANet: two elephants standing at a lot of
trees
VisualGPT (ours): three elephants standing
next to some trees
GT1: two adult elephants are surrounding
a baby elephant
GT2: a baby elephant kneeling in front of
two bigger elephants
GT3: a baby elephant and it ’s parents
eat fruit
GT4: elephants eat fruit a baby elephant
rummaging in the food
GT5: a pair of adult elephants with a baby
elephant eat from a pile of fruit
Table 9. Caption generated by our VisualGPT, Transformer, M2Transformer and AoANet on 0.5% MS COCO data split
15
Image Generated Captions Ground Truth
Transformer: a man in a suit and a woman
standing in a shop
M2Transformer: a man is standing in
a shop with a people holding people
AoANet: a man is working on a bus in a
VisualGPT (ours): a group of people standing
at an airport with their luggage
GT1: several people are purchasing tickets
at a bus station
GT2: some people are checking in at the
ticket counter somewhere in asia
GT3: people waiting in line with luggage
at a ticket counter
GT4: people are standing near an airport
ticket kiosk
GT5: customers stand at a kiosk waiting
for tickets
Transformer: a bus that is parked in front
of a building
M2Transformer: a couple of people walking
down the side of a street
AoANet: a bus is parked in a city street
VisualGPT (ours): a while and blue bus is
parked on the side of a city street
GT1: people standing outside of a blue and
white bus
GT2: an image of a tour bus that is picking
people up
GT3: several people standing around buses
and most wearing orange vests
GT4: a public transit bus pulling up to pick
up passengers
GT5: a city bus at a stop waiting to pick up
passengers
Transformer: a blue and white airplane flying
through a sky
M2Transformer: an air plane flying in the
air
AoANet: a plane airplane flying down in the
sky
VisualGPT (ours): a plane is flying in the air
over the trees
GT1: there ’s and airplane in the sky flying
over some trees
GT2: a large plane is flying over a crowd
of trees
GT3: a aeroplane soaring high in the sky
above the trees
GT4: a passenger plane flies in the sky
over a forest
GT5: an airplane is seen flying over several
trees
Transformer: a white toilet sitting in a
white bathroom next to a sink
M2Transformer: a cat sitting in the toilet
AoANet: a bathroom with a toilet and a sink
VisualGPT (ours): a cat sitting on top of a
bathroom sink
GT1: a cat climbing into a bathroom sink
looking at someone
GT2: a cat looks up as it stands in the
bathroom sink
GT3: a large cat stands inside of a clean
bathroom sink
GT4: cat is caught stepping in to the
bathroom sink
GT5: a cute kitty cat in the sink of a
bathroom near a brush and other items
Transformer: a little girl is eating a
birthday cake
M2Transformer: a child and a child are
sitting at a table with table with table
AoANet: two children sitting at a table with
a laptop computer
VisualGPT (ours): a woman and a girl sitting
at a table with a birthday cake
GT1: a woman and child stand next to a
table with cake on it
GT2: a lady standing near the table with a
baby is posing for the camera
GT3: a woman stands beside a baby in a
high chair a table is set with a birthday
cake and champagne
GT4: a woman setting up her house for a
party
GT5: a person standing next to a child in a
booster seat
Table 10. Caption generated by our VisualGPT, Transformer, M2Transformer and AoANet on 1% MS COCO data split
16
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M^2 -- a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M^2 Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy'" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.
Conference Paper
Full-text available
Constructing an organized dataset comprisedof a large number of images and several cap-tions for each image is a laborious task, whichrequires vast human effort.On the otherhand, collecting a large number of images andsentences separately may be immensely eas-ier. In this paper, we develop a novel data-efficientsemi-supervisedframework for train-ing an image captioning model. We leveragemassiveunpairedimage and caption data bylearning to associate them. To this end, ourproposed semi-supervised learning method as-signs pseudo-labels to unpaired samples viaGenerative Adversarial Networks to learn thejoint distribution of image and caption. Toevaluate, we construct scarcely-paired COCOdataset, a modified version of MS COCO cap-tion dataset. The empirical results show the ef-fectiveness of our method compared to severalstrong baselines, especially when the amountof the paired samples are scarce
Article
We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray databases suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from http://bimcv.cipf.es/bimcv-projects/padchest/.