Content uploaded by Shimul Paul
Author content
All content in this area was uploaded by Shimul Paul on Sep 14, 2021
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
BORNON: BENGALI IMAGE CAPTIONING WITH
TRANSFORMER-BASED DEEP LEARNING APPROACH
A PREPRINT
Faisal Muhammad Shah
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
faisal.cse@aust.edu
Mayeesha Humaira
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
mayeeshahumaira@gmail.com
Md Abidur Rahman Khan Jim
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
jimrahman33@gmail.com
Amit Saha Ami
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
amitsaha.aust@gmail.com
Shimul Paul
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
shimulpaul59@gmail.com
September 14, 2021
ABS TRAC T
Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and
sequence generator like RNN as Decoder has proven to be very effective. However, this method has a
drawback that is sequence needs to be processed in order. To overcome this drawback some researcher
has utilized the Transformer model to generate captions from images using English datasets. However,
none of them generated captions in Bengali using the transformer model. As a result, we utilized
three different Bengali datasets to generate Bengali captions from images using the Transformer
model. Additionally, we compared the performance of the transformer-based model with a visual
attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based
model with other models that employed different Bengali image captioning datasets.
Keywords Bengali Image Captioning ·Transformer Model ·Visual Attention ·Bornon Dataset
1 Introduction
Image captioning is a process of portraying an image that is done by combining two fields of deep learning. These
fields are computer vision and natural language processing (NLP). For many years researchers have examined methods
to caption images automatically. This method involves recognizing the objects, attributes, and their relationships with
the corresponding images to correctly generate fluent sentences. This is a very challenging task. Image captioning can
be used for social and security purposes. It can be used for increasing children’s interest in early education or Security
camera footage can be captioned in real-time to prevent theft or prevent any hazard like fire.
The image caption is a sequence modeling problem that employs a CNN-RNN-based encoder-decoder framework. In
this task, the encoder is used to extract the image feature to obtain feature vectors, then pass it through an RNN to
arXiv:2109.05218v1 [cs.CV] 11 Sep 2021
arXiv Template A PREPRINT
generate the language description. Previously all researchers utilized this CNN-RNN Subash et al. [2019], Wang et al.
[2018], Humaira et al. [2021] approach to generate captions from images. However, this method has a drawback that is
due to the structure of the LSTM or other RNNs, the current output depends on the hidden state at the previous moment.
As a result, they can only operate in time steps, this makes it implausible to parallelize the process of generating the
captions. Nevertheless, Vaswani et al. propose the Transformer Vaswani et al. [2017] model solved the parallelism
problem. The Transformer can run in parallel during the training phase as it is based on an attention mechanism there is
no sequence dependence on this model.
Recently, some researchers Zhang et al. [2019], He et al. [2020] have utilized the Transformer model instead of an RNN
to generate captions from images. But, these researches were conducted on English datasets. To see how this model
performs in the Bengali dataset we utilized three Bengali datasets. The approach to caption image in Bengali using the
transformer model is illustrated in Fig. 1. Furthermore, we compare the performance with the visual attention-based
approach to caption images in Bengali that was proposed by Ami et al. Ami et al. [2020]. This visual attention-based
approach is shown in Fig. 4. Bengali is the
7th
most used language worldwide
1
and most of the natives in some parts of
India and Bangladesh do not know English. Hence, it is also necessary to caption images in Bengali alongside English.
The contributions of this paper are as follows:
• Three Bengali dataset used to train the model.
• Transformer model combined with CNN to generate captions from images in Bengali.
•
Employed a visual attention-based approach to compare its performance with the transformer-based approach.
• Compared the performance of other models and the transformer-based model to caption images in Bengali.
2 Related Works
This section depicts the progress in image captioning. Hitherto, many types of research have been conducted and many
models have been developed in order to get captions that are syntactically corrected.
2.1 Image captioning in Bengali
Only seven works have been done on image captioning in Bengali till now. Rahman et al. [2019] was the first paper in
image captioning in Bengali followed by Deb et al. [2019], Kamal et al. [2020], Khan et al. [2021] and Jishan et al.
[2021]. Rahman et al. Rahman et al. [2019] have aimed to outline an automatic image captioning system in Bengali
called ’Chittron’. Their model was trained to predict Bengali caption from input images one word at a time. The training
process was carried out on 15700 images of their own dataset BanglaLekha. In their model Image feature vector and
words were converted to vectors after passing them through the embedding, the layer was fed to the stacked LSTM layer.
One drawback of their work was that they utilized the sentence BLEU score instead of the Corpus BLEU score. On the
other hand, Deb et al. Deb et al. [2019] illustrated two models Par-Inject Architecture and Merge Architecture for image
captioning in Bengali. In the Par-Inject model image, feature vectors were fed into intermediate LSTM and the output
of that LSTM and word vectors were combined and fed to another LSTM to generate caption in Bengali. Whereas, in
the Merge model image feature vectors and words vector were combined and passed to an LSTM without the use of an
intermediate LSTM. They utilized 4000 images of the Fickr8k dataset and the Bengali caption their models generated
were not fluent. Paper Kamal et al. [2020] used a CNN-RNN based model where VGG-16 was used as CNN and LSTM
with 256 channels was used as RNN. They trained their model on the BanglaLekha dataset having 9154 images. On the
other hand, paper Khan et al. [2021] proposed a CNN-ResNet-50 merged model, consisting of a ResNet-50 as image
feature extractor and 1D-CNN with word embedding for generating linguistic information. Later, these two features
were given as inputs to a multimodal layer that predicts what to generate next using the information at each time step.
Furthermore, Jishan et al. [2021] utilized the BNLIT dataset to implement a CNN-RNN model where they used both
BRNN and LSTM as RNN. M. Humaira et al. Humaira et al. [2021] proposed a hybridized Encoder-Decoder approach
where two word embeddings fastText and GloVe were concatenated. They also utilized beam search and greedy search
to compute the BLEU scores. Additionally, A. S. Ami et al. Ami et al. [2020] employed visual attention with the
Encoder-Decoder approach to caption images in Bengali. They added attention weights to image features and passed
them to the GRU with word vectors to generate captions. However, they did not use corpus BLEU scores to evaluate the
captions. We will compare the corpus BLEU scores of the visual attention-based approach with the transformer-based
approach.
1https://www.vistawide.com/languages/top_30_languages.htm
2
arXiv Template A PREPRINT
Figure 1: Visualization of how the Transformer model generates words from an input image. First of all, image features
extracted were by the CNN and passed to the Encoder of the Transformer. Then the vocabulary was passed to the
Decoder part of the Transformer. The Transformer then generated a Bengali caption for the corresponding image.
Figure 2: Illustration of how our model learns words from input image to generate caption using visual attention-based
approach Ami et al. [2020]. First of all, image features were extracted using CNN. Then Attention scores were given to
the image features and then passed to the GRU. On the other hand, tokenized words were passed to the embedding layer
to covert vocabulary to vectors. These word vectors were also passed to the GRU. The GRU then generates Bengali
captions word by word using word vectors and Attention weighted image features.
3
arXiv Template A PREPRINT
2.2 Image captioning in other Languages
Previously many research was conducted on English as the available datasets were all in the English language. The
authors in Aneja et al. [2018] adapted the attention mechanism to generate caption. For vision part of image captioning
VGG-16 were used by most of the papers Liu et al. [2018], Subash et al. [2019], Wang et al. [2018] as CNN but some of
them also used AlexNet Liu et al. [2018], Wang et al. [2018] or ResNet Liu et al. [2018] as CNN for feature extraction.
However, some of the researchers also utilized BiLSTM Liu et al. [2018]. Alongside English researchers also generated
captions in Chinese Lan et al. [2017], Li et al. [2016], Japanese Yoshikawa et al. [2017], Arabic Jindal [2018] and
Bahasa Indonesia Nugraha et al. [2019].
2.3 Image captioning using Transformer
The transformer model was used previously for image captioning using an English dataset. Li et al Li et al. [2019]
investigated a Transformer-based sequence modeling framework for image captioning which was built only with
attention layers and feedforward layers. Additionally, paper Herdade et al. [2019] employed object spatial relationship
modeling for image captioning, specifically within the Transformer encoder-decoder architecture by incorporating the
object relation module within the Transformer encoder. Paper Atliha and Šešok [2020] proposed the use of augmentation
of image captions in a dataset including augmentation using BERT to improve a solution to the image captioning
problem. Furthermore, paper Lee et al. [2020] utilized two streams of transformer-based architecture. One for the
visual part and another for the textual part. Paper Zhu et al. [2018] used a transformer-based architecture that consists
of an encoder and decoder model where the encoder part is a CNN model and the decoder part is a transformer model.
It also uses a stacked self-attention mechanism. Paper Zhang et al. [2019] uses the CNN as an encoder to extract
image features, the output of the encoder is a context vector that contains the necessary information from the image,
then put it into Transformer to generate the captions. On the other hand, paper He et al. [2020] introduced the image
transformer for image captioning, where each transformer layer implements multiple sub-transformers, to encode spatial
relationships between image regions and decode the diverse information in the image regions.
2.4 Image captioning using Attention Mechanism
Visual attention on English datasets was used previously by many researchers. In the past two main types of attention
were used by researchers in encoder-decoder for image or video captioning. These two types of attention are Semantic
attention that is using attention in text and Spatial attention which is applying attention to images. Xu et al. Xu et al.
[2015] proffered the first visual attention model in image captioning. They used “hard” pooling that designates the most
probably attentive region, or “soft” pooling that averages the spatial features with attentive weights. Additionally, Chen
et al. Chen et al. [2017] utilized Spatial attention and Channel wise Attentions in a CNN. Paper Chen et al. [2018] also
employed visual attention to generating captions. Lastly, paper You et al. [2016] employed a semantic attention model
to combine the visual feature with visual concepts in a recurrent neural network that generates the image caption.
3 Model Architecture
We utilized the transformer model and the attention-based model proposed by Ami et al. [2020] to caption images in
Bengali. The transformer model does not process sequence in order but the attention-based model processes sequence
in order. Hence, the transformer model allows parallel processing of captions. The transformer model is illustrated in
Fig. 3 and the attention-based approach is shown in Fig. 4.
3.1 Transformer-based Approach
Transformer Vaswani et al. [2017] is a deep learning model that utilizes the mechanism of attention, to give weights to
the influences to different parts of the input data. The transformer is made of a stack of encoder and decoder components.
In Fig. 3 left block marked
Nx
is the encoder and the right block marked
Nx
is the decoder. Here N is a hyperparameter
that represents the number of encoder and decoder components. This model takes two inputs these are image features
extracted by the CNN in the Encoder and the vocabulary formed from the list of target captions in the dataset in the
Decoder.
3.1.1 Encoder
InceptionV3 was used as the CNN in this experiment. The images in the dataset were at first passed to the CNN.
InceptionV3 extracts image features and passes them through a dense layer having ReLU as an activation function to
take the dimension of the image feature vector from d to
dmodel
where
dmodel
is the dimension of the word embedding.
4
arXiv Template A PREPRINT
Figure 3: Illustration of the transformer based model to caption image in Bengali.
These image feature vectors were then summed with the positional encoding then passed through N encoder layers.
5
arXiv Template A PREPRINT
Figure 4: Illustration of the attention-based approach Ami et al. [2020] to caption image in Bengali.
Each of these encoder layers was made up of two sublayers one of which is Multi-head attention with padding mask
and the other is Point wise feed-forward networks. Masking ensures that the model does not treat padding as the input.
The output of the encoder was then passed to the decoder as K (key) and V (Value). The Multi-Head mechanism is
explained in Section 7.
3.1.2 Decoder
The decoder takes as input the target captions in the dataset were passed through an embedding which was summed
with the positional encoding. Positional encoding is added to give the model some information about the relative
position of the words in the sentence based on the similarity of their meaning and their position in the sentence, in
the d-dimensional space. The output of the summation was then passed through N decoder layers. Each of these
decoder layers was made of three sub-layers one of which was the Masked multi-head attention with a look ahead
mask and padding mask, another one was Multi-head attention with padding mask where V (value) and K (key) receive
the encoder output as inputs and Q (query) received the output from the masked multi-head attention sublayer. The
third layer was Point wise feed-forward networks. The output of the decoder was then sent to the linear layer as input.
Finally, using probabilistic softmax predictions one word at a time, and uses the output so far to decide what to do next.
This whole process is illustrated in Fig. 3.
3.2 Visual Attention-based Approach
To focus only on the relevant parts of the image visual attention-based model was used. It is an Encoder-Decoder
approach that processes sequence in order. This model has three main parts. Firstly, a Convolutional Neural Network
(CNN) extracts features from images. Secondly, an attention mechanism was utilized to give weights to image features.
Bengali vocabulary was then converted to word vectors using an embedding layer. Finally, Gated Recurrent Units
(GRU) Chung et al. [2014] which is a sequence generator took word vectors and weighted images features as input and
generated Bengali captions in order. This process is illustrated in 4.
4 Dataset
The main aim of this research is to generate a Bengali caption from the image. To accomplish this task a dataset in the
Bengali language is needed which must have several images and a text file containing Bengali captions associated with
6
arXiv Template A PREPRINT
Table 1: Distribution of Data for Different Bengali Dataset used in our experiment.
Dataset Total Image Training Validation Testing
Flickr8k 8000 6000 (75%) 1000 (15%) 1000 (15%)
BanglaLekha 9154 7154 (78%) 1000 (11%) 1000 (11%)
Bornon 4100 2900 (72%) 600 (14%) 600 (14%)
merged 21414 12850 (60%) 4282 (11%) 4282 (11%)
each image. However, all the datasets available for image captioning are in English. The only available Bengali dataset
till now is the BanglaLekha dataset. Since one dataset is not enough to validate the performance of the models we
created a new Bengali dataset named Bornon. Furthermore, we utilized the Flickr8k dataset by translating its English
captions to Bengali and then merging it with the BanglaLekha and Bornon dataset to form a newly merged dataset.
This merged dataset was constructed especially to test the transformer model since the transformer-based models are
data-hungry. For generating Bengali cations from images, these datasets were split into three parts: training, testing,
and validation. The split ratio of each dataset used in our experiment is shown in Table 1.
4.1 Flickr8K_BN
Flickr8k
2
dataset is a publically available English dataset that contains 8091 images of which 6000 (75%) images are
employed for training, 1000 (12.5%) images for validation, and 1000 (12.5%) images are used for testing. Moreover,
with each image of the Flickr8K dataset five ground truth captions describing the image are designated which adds up
to a total of 40455 captions for 8091 images. For image captioning in Bengali, those 40455 captions were converted to
Bengali language using Google Translator
3
. Unfortunately, some of the translated captions were syntactically incorrect
as shown in Fig 5. Hence, we manually checked all 40455 translated captions and corrected them. Flickr8K-BN is the
Bengali Flickr8K dataset. Some images of the Flickr8k dataset along with their associated captions are shown in Fig. 6.
Figure 5: Illustration of Bengali captions after being translated using Using Google Translator. Sentences marked
with red color indicate syntactically incorrect Bengali sentences and the sentences inside the brackets are the manually
corrected sentences.
4.2 BanglaLekha
We also utilized the BanglaLekha
4
dataset which consists of 9154 images of which 7154 (78%) images are employed
for training, 1000 (11%) images for validation, and 1000 (11%) images are used for testing. It is the only available
2https://www.kaggle.com/adityajn105/flickr8k/activity
3https://translate.google.com/
4https://data.mendeley.com/datasets/hf6sf8zrkc/2
7
arXiv Template A PREPRINT
Figure 6: Illustration of some images of the Flickr8k dataset along with their five Bengali captions.
Bengali dataset till now. All its captions are human-annotated. One problem with this dataset is that it has only two
captions associated with each image resulting in 18308 captions for those 9154 images. Hence, vocabulary size is lower
than Flickr8k-BN. Flickr8k-BN consists of 12953 unique Bengali words, and BanglaLekha consists of 5270 unique
Bengali words. It can be seen that the BanglaLekha dataset has a vocabulary size even lower than Flickr8k-BN. Some
images of the BanglaLekha dataset along with their associated captions are shown in Fig. 7.
4.3 Bornon
Due to the lack of a Bengali image captioning dataset and to overcome the shortcomings of the existing Bengali image
captioning datasets Banglalekha Rahman et al. [2019], we created a new dataset named Bornon. The Bornon dataset
consists of 4100 images and each image has five captions describe them. Thus, there is a total of 20500 captions for
4100 images. Images were kept in a folder and the associated captions were kept in a text file. Some images of the
Bornon dataset along with their associated captions are shown in Fig. 8.
The images of this dataset were taken from a personal photography club. All images were in jpg format. These
images portray various objects like Animals, Birds, People, Food, Weather, Trees, Flower, Buildings, Cars, Boat.
Frequent Bengali words in this dataset are illustrated in Fig. 9. Around 17 people who are native Bengali speakers were
responsible for annotating and evaluating the captions.
Only two captions are associated with each image in the BanglaLekha dataset this reduced the vocabulary size hence
we gave five captions for each image in our Bornon dataset. The vocabulary size of the Bornon dataset was 6228 unique
Bengali words for only 4100 images whereas the BanglaLekha dataset had a vocabulary size of 5270 unique Bengali
words for 9154 images. If vocabulary size is the small repetition of words is observed in predicted captions. However,
8
arXiv Template A PREPRINT
Figure 7: Illustration of some images of the BanglaLekha dataset along with their two Bengali captions.
this 4100 data is not enough to train a transformer-based mode. Therefore, in the future, we plan to increase the number
of images in our dataset.
4.4 Merged Dataset
The transformer model is data-hungry. It performs well when a huge number of data is provided. However, all the
Bengali datasets mentioned above have a small amount of data which is not enough to train the transformer model. As a
result, we merged three datasets Flickr8k, BanglaLekha, and Bornon. This resulted in 21414 images and each image
had two captions associated with them which add up to a total of 42828 captions. We took two captions from all the
datasets because the BanglaLekha dataset had only two Bengali captions describing each image. This merging led to a
vocabulary size of 13416 unique Bengali words and images of various categories.
5 Text Embedding
Firstly, the maximum length of target captions in each dataset was computed. Then all the sentences having lengths
less than maximum length were padded with zeroes. Afterward, the top 5000 unique Bengali words were selected
from each dataset to tokenize the Bengali captions using Keras’s text tokenizer. Since we cannot train a model using
text we converted these tokens to numeric form using text embedding. The embedding model embeds these tokens to
one-hot vectors having d_modeldimensions. All these embedding vectors in one sentence are combined into a matrix
and provided as input to the Transformer or the GRU.
6 Convolutional Neural Network
InceptionV3 Szegedy et al. [2016] was utilized as the Convolutional Neural Network (CNN) for extracting features from
images for the transformer-based model. As this is not a classification task, the last layer of InceptionV3, a softmax
layer, was removed from the model. Then all the images were preprocessed to the same size, that is, 299
×
299 before
feeding them into the model. Hence, the shape of the output of this layer was 8x8x2048. The features were extracted
and stored as .npy files and then pass those features through the encoder.
Two different CNN InceptionV3 and Xception Chollet [2017] were employed in the different experimental setups
of the visual attention-based model. The last layer of both CNN was removed. Then like the transformer model the
attention-based model also took images of size 299x299. As a result, images were reshaped and feed to CNN. The
extracted image features were then to .npy files and attention weight was added to them.
9
arXiv Template A PREPRINT
Figure 8: Illustration of some images of the Bornon dataset along with their five Bengali captions.
7 Attention in Transformer
Self-attention is calculated using vectors. Three matrices Query, Key, and Value from each of the encoder’s inputs are
needed to calculate self-attention. These matrices are obtained by multiplying the embedding matrix and the Weight
trained weight matrices. Finally, self-attention matrices are calculated using the following formula.
Z=softmax(Q∗KT
√dk
)∗V(1)
Where Z is the self-attention matrix, Q is the Query matrix, K is the Key matrix, V is the Value matrix
dk
is the
dimension of the key matrix. The paper Vaswani et al. [2017] further refined the self-attention layer by adding a
mechanism called “Multi-Head” attention. Multi-Head attention implements the self-attention calculation eight different
times with different weight matrices.
10
arXiv Template A PREPRINT
Figure 9: Illustration most frequent Bengali words in the Bornon Dataset.
8 Visual Attention Mechanism
Two types of spatial attention that are used widely are Global attention Luong et al. [2015] and Local attention. We
employed Local attention which is also known as Bahdanau attention Bahdanau et al. [2014] because Global attention
is computationally expensive and unfeasible for large sentences. Global attention place attention on all source position
whereas Bahdanau attention focuses on a small subset of hidden states of the encoder per target word. To implement the
Bahdanau Attention several steps have been followed. Firstly, the extracted image features were passed through a Fully
connected layer using CNN Encoder to produce a hidden state of each element. Then Alignment scores were calculated
using the hidden state produced by the decoder in the previous time step and the encoder outputs using the formula
shown in Eq. 2. This Alignment score is the main component of the attention mechanism.
scorealignment =Wcombined .tanh(Wdecoder .Hdecoder +Wencoder .Hencoder)(2)
The Alignment Scores were then passed through the SoftMax function and represented in a single vector called attention
weights using Eq. 3. This vector was then multiplied with image features to form the context vector using Eq. 4.
ajt =Sof tmax(ejt)(3)
Where, ajt =eejt
PTx
k=1 eekt ,such that PTx
j=1 ajt = 1 and aij ≥0and ej t is the score-alignment.
ct=
Tx
X
j=1
αjt hj(4)
Where ajt =PTx
j=1 ajt = 1 and aij ≥0and ctis the context vector that is the weighted sum of the input.
Finally, this context vector was concatenated with the previous decoder output. It was then fed into the decoder Gated
Recurrent Unit (GRU) to produce a new output.
11
arXiv Template A PREPRINT
9 Gated Recurrent Units
In the attention-based approach, Gated Recurrent Units (GRU)Chung et al. [2014] was employed as a sequence
generator. Before passing words to the GRU they were converted to vectors using the embedding layer. Afterward, this
word embedding of Bengali words was passed to GRU. The GRU then predicts the next word in the sequence using the
previous hidden state of the decoder, the previous predicted word, and the context vector calculated in the attention
model. The equation used to predict the next word is depicted in Eq. 5.
st=RN N (st−1,[e( ˆyt−1), ct]) (5)
Where,
st
is the new state of the decoder,
st−1
is previous state of decoder, e(
ˆyt−1
)is previous predicted word and
ct
is
the context vector that is the weighted sum of the input.
However, the sequence problem remains. We need to process the data so that is we need to process the beginning of the
sequence before the end. To solve this issue we utilized the transformer-based model to caption images in Bengali.
10 Hyperparameters
The techniques used in this experiment were implemented by Jupyter Notebook. These models were developed based
on Keras 2.3.1 and Tensorflow 2.1.0. We ran our experiments on NVIDIA RTX 2060 GPU. RTX 2060 offers 1920
CUDA cores with 6 GB GDDR6 VRAM. Using these settings it took approximately three hours to train each of the
experimental setups.
The number of layers and the number of heads in the transformer were varied to tune the transformer-based model.
Three, five, and seven were used as some layers in the transformer and one and two were used as the number of
heads of the transformer. Furthermore, Internal validation was employed in this model to test the generalization
ability of our trained model. Moreover, this model was trained for 50 epochs and utilized the Adam optimizer with
a custom learning rate scheduler according to the Eq. 6 where 4000 was used as the warmup_steps. Additionally,
SparseCategoricalCrossentropy was utilized as the loss function. The loss plot of one of the experimental setups using
the transformer-based model is shown in Fig. 10.
lrate =d−0.5
model ∗min(step_num−0.5, step_num ∗warmup_steps−1.5)(6)
Figure 10: Loss plot for transformer-based model over 50 epoch using 3 layers and 2 heads for Bornon dataset.
Two different CNN InceptionV3 and Xception were used in the different experimental setups as hyperparameters in the
visual attention-based model. Furthermore, Internal validation was employed in this model to test the generalization
ability of our trained model. Moreover, this model was trained for 50 epochs and had a batch size of 64. Additionally,
Adam optimizer was used as the optimizer and for calculating the loss SparseCategoricalCrossentropy was utilized. Fig.
11 and Fig. 12 demonstrates how loss decreased over 50 epoch for InceptionV3 and Xception respectively.
12
arXiv Template A PREPRINT
Figure 11: Loss plot for visual attention-based model using InceptionV3 over 50 epoch using Flickr8k-BN dataset.
Figure 12: Loss plot for visual attention-based model Xception over 50 epoch using Flickr8k-BN dataset.
11 Experimental Results
After generating captions, the most important part is evaluating them to verify how similar the generated captions are
to human-annotated captions. Hence, we took the aid of two evaluation metrics BLEU and METEOR to justify the
accuracy of our proposed models.
11.1 BLEU
Bilingual Evaluation Understudy (BLEU) Papineni et al. [2001] is the most wielded metric nowadays to evaluate the
merit of text. It depicts how normal sentences are compared with human-generated sentences. It is predominantly
utilized to evaluate the performance of Machine translation. Sentences are compared based on modified n-gram
precision for generating BLEU scores. BLEU scores are computed using the following equations:
P(i) = M atched(i)
H(i)(7)
P(i) is the precision for each i-gram where i = 1, 2, ...N, the percentage of the i-gram tuples in the hypothesis that also
occurs in the references is computed. H(i) is the number of i-gram tuples in the hypothesis and Matched(i) is computed
using the following formula:
Matched(i) = X
ti
min {Ch(ti),max
jChj (ti)}(8)
13
arXiv Template A PREPRINT
where
ti
is an i-gram tuple in hypothesis h,
Ch(ti)
is the number of times
ti
occurs in the hypothesis,
Chj (ti)
is the
number of times tioccurs in reference j of this hypothesis.
ρ=exp{min(0,n−L
n)}(9)
where
ρ
is the brevity penalty to penalize short translation, n is the length of the hypothesis and L is the length of the
reference. Finally, the BLEU score is computed by:
BLEU =ρ{
N
Y
i=1
P(i)}1
N(10)
11.2 METEOR
Metric for Evaluation of Translation with Explicit Ordering (METEOR) Denkowski and Lavie [2014] is based on
unigram matching between reference and predicted sentences by machine using the harmonic mean of unigram precision
and recall. The recall is weighted higher than precision here. It was formulated to mend some of the issues found in
BLEU metrics. Unigram precision P is calculated as:
P=m
wt
(11)
Where m is the number of unigrams in the candidate translation that are also found in the reference translation, and
wt
is the number of unigrams in the candidate translation. Unigram recall R is computed as follows:
R=m
wr
(12)
Where m is as above, and
wr
is the number of unigrams in the reference translation. Precision and recall are combined
using the harmonic mean. There recall is weighted 9 times more than precision as shown in the equation below:
Fmean =10P R
R+ 9P(13)
To account for congruity concerning larger segments that appear in both the reference and the candidate sentence a
penalty p is added. The penalty is calculated using the following equation.
p= 0.5( C
um
)3(14)
Where C is the number of chunks, and
um
is the number of unigrams that have been mapped. Finally, the METEOR
score for a segment is calculated as M as shown in the equation below.
M=Fmean(1 −p)(15)
11.3 Result Analysis
We employed BLEU and METEOR scores for every experimental setup. These scores of the transformer-based model
are shown in Table 2 and the scores for the visual attention-based model are shown in Table 3. The highest BLEU
for all datasets score using the Transformer-based model was obtained using 3 layers. On the other hand, METEOR
was higher for the BanglaLekha dataset with 7 layers and higher for the Bornon dataset when 3 layers were used. For
the merged dataset METEOR scores did not show any trend of increasing or decreasing with the number of layers.
These scores were even better than BLEU scores obtained by paper Deb et al. [2019], paper Kamal et al. [2020] and
paper Khan et al. [2021]. Bornon and BanglaLekha datasets performed slightly better than the Merged dataset using the
transformer-based method.
From Table 3 it can be seen that experimental setups of the visual attention-based model with Xception as CNN
gave higher scores. However, the overall BLEU scores are lower than the transformer-based model. As the visual
14
arXiv Template A PREPRINT
Table 2: Result of Transformer-based model using InceptionV3 as CNN over 50 epochs.
Dataset Layers(N) Heads BLEU METEOR
1234
10.665 0.556 0.476 0.408 0.255
3 2 0.662 0.548 0.462 0.389 0.241
1 0.648 0.546 0.470 0.402 0.251
BanglaLekha 5 2 0.660 0.557 0.480 0.415 0.263
1 0.633 0.541 0.471 0.409 0.267
7 2 0.644 0.548 0.476 0.412 0.268
10.696 0.589 0.507 0.439 0.361
3 2 0.687 0.572 0.486 0.415 0.346
1 0.688 0.583 0.502 0.437 0.359
Bornon 5 2 0.683 0.575 0.493 0.425 0.340
1 0.684 0.567 0.478 0.405 0.340
7 2 0.665 0.556 0.477 0.411 0.346
10.621 0.492 0.398 0.326 0.196
3 2 0.624 0.494 0.400 0.329 0.201
1 0.616 0.482 0.384 0.311 0.189
merged 5 2 0.607 0.483 0.391 0.323 0.200
1 0.592 0.468 0.376 0.308 0.187
7 2 0.602 0.481 0.390 0.322 0.197
Table 3: Result of Visual attention-based model using GRU as sequence generator over 50 epochs.
Dataset CNN BLEU METEOR
1234
Flickr8k-BN InceptionV3 0.543 0.445 0.362 0.294 0.161
Xception 0.546 0.447 0.364 0.296 0.156
BanglaLekha InceptionV3 0.567 0.460 0.385 0.319 0.204
Xception 0.570 0.462 0.387 0.322 0.208
Bornon InceptionV3 0.596 0.475 0.390 0.324 0.314
Xception 0.605 0.492 0.412 0.351 0.348
attention-based model used a GRU as a sequence generator whereas the transformer model was used as a sequence
generator in the transformer-based model. This proves the fact that only improving the computer vision side of the
image captioning models won’t improve the results. Since image captioning is a mixture of two fields computer vision
and NLP equal importance must be given to both fields to get better results. In Table 3 we illustrated the corpus BLEU
scores which were not done by Ami et al. [2020].
We tested the transformer-based model and the visual attention-based model using a test set that contains different
images that were not present in the training set or validation set. The Bengali captions generated by various experimental
setups of the transformer-based model using three datasets BanglaLekha, Bornon, and merged dataset are shown in Fig.
13. Additionally, the Bengali captions generated by the visual attention-based model using Flickr8k-BN, BanglaLekha,
and Bornon datasets are illustrated in Fig. 14. From these figures, it can be seen that the transformer-based model gave
much better and accurate Bengali captions than the attention-based model.
Since both transformer-based model and visual attention-based model were trained using BanglaLekha dataset and
Bornon dataset a brief comparison of caption generated for the same test images of these datasets is depicted in Fig. 15.
From this figure, it can be seen that the visual attention-based model generated Bengali captions related to the objects
present in the caption whereas the transformer-based model gave a general Bengali caption that describes the whole
image. Performances of three of the transformer-based model were compared with the performance of other papers and
the results are illustrated in Table 4. This table shows that the transformer-based model performed better than other
research done on Bengali image captioning using the same datasets.
12 Conclusions
In our work, we employed a visual attention-based approach that gives attention weight to image features. This was a
traditional Encoder-Decoder approach so we compared it with a transformer-based approach. In the transformer-based
15
arXiv Template A PREPRINT
Figure 13: Illustration of Bengali captions generated by Transformer-based models.
approach, we combine the feature vector extracted by CNN and target Bengali captions into the Transformer model.
This model learns to generate Bengali captions using a multi-head attention mechanism. Not only the model can
improve the original performance, but also uplift the training speed by allowing parallelism. Later it was validated that
the transformer-based method indeed performs better than the visual attention-based method. Hence, in the future, the
transformer-based model can replace the traditional encoder-decoder architecture. This will enhance the performance
16
arXiv Template A PREPRINT
Figure 14: Illustration of Bengali captions generated by visual attention-based models.
and efficiency of caption generation from images. We also utilized various Bengali datasets to test both approaches. This
proves the fact that the transformer model can be used to generate captions from images in other languages alongside
English.
References
R Subash, R Jebakumar, Yash Kamdar, and Nishit Bhatt. Automatic image captioning using convolution neural
networks and lstm. In Journal of Physics: Conference Series, volume 1362, page 012096. IOP Publishing, 2019.
Cheng Wang, Haojin Yang, and Christoph Meinel. Image captioning with deep bidirectional lstms and multi-task
learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(2s):1–20,
2018.
Mayeesha Humaira, Shimul Paul, Md Abidur Rahman Khan Jim, Amit Saha Ami, and Faisal Muhammad Shah. A
hybridized deep learning method for bengali image captioning. International Journal of Advanced Computer Science
and Applications, 12:698–707, 2021.
17
arXiv Template A PREPRINT
Figure 15: Illustration of Bengali captions generated by visual attention-based models and transformer-based models.
Visual attention-based Bengali captions were generated using Xception and the transformer-based Bengali captions
were generated using 3 layers and 4 heads.
Table 4: A brief comparison of BLEU scores for existing models and the transformer-based model.
Dataset Model BLEU
1234
VGG-16+LSTM Kamal et al. [2020] 0.667 0.436 0.315 0.238
BanglaLekha CNN-ResNet-50 Khan et al. [2021] 0.651 0.426 0.278 0.175
Transformer Model 0.665 0.556 0.476 0.408
Flickr8k(4000 images) Inception+LSTM Deb et al. [2019] 0.62 0.45 0.33 0.22
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008,
2017.
Wei Zhang, Wenbo Nie, Xinle Li, and Yao Yu. Image caption generation with adaptive transformer. In 2019 34rd Youth
Academic Annual Conference of Chinese Association of Automation (YAC), pages 521–526. IEEE, 2019.
Sen He, Wentong Liao, Hamed R Tavakoli, Michael Yang, Bodo Rosenhahn, and Nicolas Pugeault. Image captioning
through image transformer. In Proceedings of the Asian Conference on Computer Vision, 2020.
18
arXiv Template A PREPRINT
Amit Saha Ami, Mayeesha Humaira, Md Abidur Rahman Khan Jim, Shimul Paul, and Faisal Muhammad Shah. Bengali
image captioning with visual attention. In 2020 23rd International Conference on Computer and Information
Technology (ICCIT), pages 1–5. IEEE, 2020.
Matiur Rahman, Nabeel Mohammed, Nafees Mansoor, and Sifat Momen. Chittron: An automatic bangla image
captioning system. Procedia Computer Science, 154:636–642, 2019.
Tonmoay Deb, Mohammad Zariff Ahsham Ali, Sanchita Bhowmik, Adnan Firoze, Syed Shahir Ahmed, Muham-
mad Abeer Tahmeed, NSM Rahman, and Rashedur M Rahman. Oboyob: A sequential-semantic bengali image
captioning engine. Journal of Intelligent & Fuzzy Systems, 37(6):7427–7439, 2019.
Abrar Hasin Kamal, Md Asifuzzaman Jishan, and Nafees Mansoor. Textmage: The automated bangla caption generator
based on deep learning. In 2020 International Conference on Decision Aid Sciences and Application (DASA), pages
822–826. IEEE, 2020.
Mohammad Faiyaz Khan, SM Sadiq-Ur-Rahman, and Md Saiful Islam. Improved bengali image captioning via deep
convolutional neural network based encoder-decoder model. In Proceedings of International Joint Conference on
Advances in Computational Intelligence, pages 217–229. Springer, 2021.
Md Asifuzzaman Jishan, Khan Raqib Mahmud, Abul Kalam Al Azad, Mohammad Rifat Ahmmad, Bijan Paul Rashid,
and Md Shahabub Alam. Bangla language textual image description by hybrid neural network model. Indonesian
Journal of Electrical Engineering and Computer Science, 21(2):757–767, 2021.
Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. Convolutional image captioning. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 5561–5570, 2018.
Shuang Liu, Liang Bai, Yanli Hu, and Haoran Wang. Image captioning based on deep neural networks. In MATEC Web
of Conferences, volume 232, page 01052. EDP Sciences, 2018.
Weiyu Lan, Xirong Li, and Jianfeng Dong. Fluency-guided cross-lingual image captioning. In Proceedings of the 25th
ACM international conference on Multimedia, pages 1549–1557, 2017.
Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu. Adding chinese captions to images. In Proceedings of the
2016 ACM on international conference on multimedia retrieval, pages 271–275, 2016.
Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. Stair captions: Constructing a large-scale japanese image
caption dataset. arXiv preprint arXiv:1705.00823, 2017.
Vasu Jindal. Generating image captions in arabic using root-word based recurrent neural networks and deep neural
networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Aditya Alif Nugraha, Anditya Arifianto, et al. Generating image description on indonesian language using convolutional
neural network and gated recurrent unit. In 2019 7th International Conference on Information and Communication
Technology (ICoICT), pages 1–6. IEEE, 2019.
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. Entangled transformer for image captioning. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 8928–8937, 2019.
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. Image captioning: Transforming objects into words.
arXiv preprint arXiv:1906.05963, 2019.
Viktar Atliha and Dmitrij Šešok. Text augmentation using bert for image captioning. Applied Sciences, 10(17):5978,
2020.
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. Vilbertscore:
Evaluating image caption using vision-and-language bert. In Proceedings of the First Workshop on Evaluation and
Comparison of NLP Systems, pages 34–39, 2020.
Xinxin Zhu, Lixiang Li, Jing Liu, Haipeng Peng, and Xinxin Niu. Captioning transformer with stacked attention
modules. Applied Sciences, 8(5):739, 2018.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on
machine learning, pages 2048–2057. PMLR, 2015.
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and
channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 5659–5667, 2017.
Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. Show, observe and tell: Attribute-driven
attention model for image captioning. In IJCAI, pages 606–612, 2018.
19
arXiv Template A PREPRINT
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent
neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 2818–2826, 2016.
François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1251–1258, 2017.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine
translation. arXiv preprint arXiv:1508.04025, 2015.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and
translate. arXiv preprint arXiv:1409.0473, 2014.
K Papineni, S Roukos, T Ward, and WJ Zhu. Ibm research report bleu: a method for automatic evaluation of machine
translation ibm research division technical report, 2001.
Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language.
In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
20
Access to this full-text is provided by Springer Nature.
Content available from SN Computer Science
This content is subject to copyright. Terms and conditions apply.