ArticlePDF Available

Abstract and Figures

Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
Content may be subject to copyright.
BORNON: BENGALI IMAGE CAPTIONING WITH
TRANSFORMER-BASED DEEP LEARNING APPROACH
A PREPRINT
Faisal Muhammad Shah
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
faisal.cse@aust.edu
Mayeesha Humaira
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
mayeeshahumaira@gmail.com
Md Abidur Rahman Khan Jim
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
jimrahman33@gmail.com
Amit Saha Ami
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
amitsaha.aust@gmail.com
Shimul Paul
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
shimulpaul59@gmail.com
September 14, 2021
ABS TRAC T
Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and
sequence generator like RNN as Decoder has proven to be very effective. However, this method has a
drawback that is sequence needs to be processed in order. To overcome this drawback some researcher
has utilized the Transformer model to generate captions from images using English datasets. However,
none of them generated captions in Bengali using the transformer model. As a result, we utilized
three different Bengali datasets to generate Bengali captions from images using the Transformer
model. Additionally, we compared the performance of the transformer-based model with a visual
attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based
model with other models that employed different Bengali image captioning datasets.
Keywords Bengali Image Captioning ·Transformer Model ·Visual Attention ·Bornon Dataset
1 Introduction
Image captioning is a process of portraying an image that is done by combining two fields of deep learning. These
fields are computer vision and natural language processing (NLP). For many years researchers have examined methods
to caption images automatically. This method involves recognizing the objects, attributes, and their relationships with
the corresponding images to correctly generate fluent sentences. This is a very challenging task. Image captioning can
be used for social and security purposes. It can be used for increasing children’s interest in early education or Security
camera footage can be captioned in real-time to prevent theft or prevent any hazard like fire.
The image caption is a sequence modeling problem that employs a CNN-RNN-based encoder-decoder framework. In
this task, the encoder is used to extract the image feature to obtain feature vectors, then pass it through an RNN to
arXiv:2109.05218v1 [cs.CV] 11 Sep 2021
arXiv Template A PREPRINT
generate the language description. Previously all researchers utilized this CNN-RNN Subash et al. [2019], Wang et al.
[2018], Humaira et al. [2021] approach to generate captions from images. However, this method has a drawback that is
due to the structure of the LSTM or other RNNs, the current output depends on the hidden state at the previous moment.
As a result, they can only operate in time steps, this makes it implausible to parallelize the process of generating the
captions. Nevertheless, Vaswani et al. propose the Transformer Vaswani et al. [2017] model solved the parallelism
problem. The Transformer can run in parallel during the training phase as it is based on an attention mechanism there is
no sequence dependence on this model.
Recently, some researchers Zhang et al. [2019], He et al. [2020] have utilized the Transformer model instead of an RNN
to generate captions from images. But, these researches were conducted on English datasets. To see how this model
performs in the Bengali dataset we utilized three Bengali datasets. The approach to caption image in Bengali using the
transformer model is illustrated in Fig. 1. Furthermore, we compare the performance with the visual attention-based
approach to caption images in Bengali that was proposed by Ami et al. Ami et al. [2020]. This visual attention-based
approach is shown in Fig. 4. Bengali is the
7th
most used language worldwide
1
and most of the natives in some parts of
India and Bangladesh do not know English. Hence, it is also necessary to caption images in Bengali alongside English.
The contributions of this paper are as follows:
Three Bengali dataset used to train the model.
Transformer model combined with CNN to generate captions from images in Bengali.
Employed a visual attention-based approach to compare its performance with the transformer-based approach.
Compared the performance of other models and the transformer-based model to caption images in Bengali.
2 Related Works
This section depicts the progress in image captioning. Hitherto, many types of research have been conducted and many
models have been developed in order to get captions that are syntactically corrected.
2.1 Image captioning in Bengali
Only seven works have been done on image captioning in Bengali till now. Rahman et al. [2019] was the first paper in
image captioning in Bengali followed by Deb et al. [2019], Kamal et al. [2020], Khan et al. [2021] and Jishan et al.
[2021]. Rahman et al. Rahman et al. [2019] have aimed to outline an automatic image captioning system in Bengali
called ’Chittron’. Their model was trained to predict Bengali caption from input images one word at a time. The training
process was carried out on 15700 images of their own dataset BanglaLekha. In their model Image feature vector and
words were converted to vectors after passing them through the embedding, the layer was fed to the stacked LSTM layer.
One drawback of their work was that they utilized the sentence BLEU score instead of the Corpus BLEU score. On the
other hand, Deb et al. Deb et al. [2019] illustrated two models Par-Inject Architecture and Merge Architecture for image
captioning in Bengali. In the Par-Inject model image, feature vectors were fed into intermediate LSTM and the output
of that LSTM and word vectors were combined and fed to another LSTM to generate caption in Bengali. Whereas, in
the Merge model image feature vectors and words vector were combined and passed to an LSTM without the use of an
intermediate LSTM. They utilized 4000 images of the Fickr8k dataset and the Bengali caption their models generated
were not fluent. Paper Kamal et al. [2020] used a CNN-RNN based model where VGG-16 was used as CNN and LSTM
with 256 channels was used as RNN. They trained their model on the BanglaLekha dataset having 9154 images. On the
other hand, paper Khan et al. [2021] proposed a CNN-ResNet-50 merged model, consisting of a ResNet-50 as image
feature extractor and 1D-CNN with word embedding for generating linguistic information. Later, these two features
were given as inputs to a multimodal layer that predicts what to generate next using the information at each time step.
Furthermore, Jishan et al. [2021] utilized the BNLIT dataset to implement a CNN-RNN model where they used both
BRNN and LSTM as RNN. M. Humaira et al. Humaira et al. [2021] proposed a hybridized Encoder-Decoder approach
where two word embeddings fastText and GloVe were concatenated. They also utilized beam search and greedy search
to compute the BLEU scores. Additionally, A. S. Ami et al. Ami et al. [2020] employed visual attention with the
Encoder-Decoder approach to caption images in Bengali. They added attention weights to image features and passed
them to the GRU with word vectors to generate captions. However, they did not use corpus BLEU scores to evaluate the
captions. We will compare the corpus BLEU scores of the visual attention-based approach with the transformer-based
approach.
1https://www.vistawide.com/languages/top_30_languages.htm
2
arXiv Template A PREPRINT
Figure 1: Visualization of how the Transformer model generates words from an input image. First of all, image features
extracted were by the CNN and passed to the Encoder of the Transformer. Then the vocabulary was passed to the
Decoder part of the Transformer. The Transformer then generated a Bengali caption for the corresponding image.
Figure 2: Illustration of how our model learns words from input image to generate caption using visual attention-based
approach Ami et al. [2020]. First of all, image features were extracted using CNN. Then Attention scores were given to
the image features and then passed to the GRU. On the other hand, tokenized words were passed to the embedding layer
to covert vocabulary to vectors. These word vectors were also passed to the GRU. The GRU then generates Bengali
captions word by word using word vectors and Attention weighted image features.
3
arXiv Template A PREPRINT
2.2 Image captioning in other Languages
Previously many research was conducted on English as the available datasets were all in the English language. The
authors in Aneja et al. [2018] adapted the attention mechanism to generate caption. For vision part of image captioning
VGG-16 were used by most of the papers Liu et al. [2018], Subash et al. [2019], Wang et al. [2018] as CNN but some of
them also used AlexNet Liu et al. [2018], Wang et al. [2018] or ResNet Liu et al. [2018] as CNN for feature extraction.
However, some of the researchers also utilized BiLSTM Liu et al. [2018]. Alongside English researchers also generated
captions in Chinese Lan et al. [2017], Li et al. [2016], Japanese Yoshikawa et al. [2017], Arabic Jindal [2018] and
Bahasa Indonesia Nugraha et al. [2019].
2.3 Image captioning using Transformer
The transformer model was used previously for image captioning using an English dataset. Li et al Li et al. [2019]
investigated a Transformer-based sequence modeling framework for image captioning which was built only with
attention layers and feedforward layers. Additionally, paper Herdade et al. [2019] employed object spatial relationship
modeling for image captioning, specifically within the Transformer encoder-decoder architecture by incorporating the
object relation module within the Transformer encoder. Paper Atliha and Šešok [2020] proposed the use of augmentation
of image captions in a dataset including augmentation using BERT to improve a solution to the image captioning
problem. Furthermore, paper Lee et al. [2020] utilized two streams of transformer-based architecture. One for the
visual part and another for the textual part. Paper Zhu et al. [2018] used a transformer-based architecture that consists
of an encoder and decoder model where the encoder part is a CNN model and the decoder part is a transformer model.
It also uses a stacked self-attention mechanism. Paper Zhang et al. [2019] uses the CNN as an encoder to extract
image features, the output of the encoder is a context vector that contains the necessary information from the image,
then put it into Transformer to generate the captions. On the other hand, paper He et al. [2020] introduced the image
transformer for image captioning, where each transformer layer implements multiple sub-transformers, to encode spatial
relationships between image regions and decode the diverse information in the image regions.
2.4 Image captioning using Attention Mechanism
Visual attention on English datasets was used previously by many researchers. In the past two main types of attention
were used by researchers in encoder-decoder for image or video captioning. These two types of attention are Semantic
attention that is using attention in text and Spatial attention which is applying attention to images. Xu et al. Xu et al.
[2015] proffered the first visual attention model in image captioning. They used “hard” pooling that designates the most
probably attentive region, or “soft” pooling that averages the spatial features with attentive weights. Additionally, Chen
et al. Chen et al. [2017] utilized Spatial attention and Channel wise Attentions in a CNN. Paper Chen et al. [2018] also
employed visual attention to generating captions. Lastly, paper You et al. [2016] employed a semantic attention model
to combine the visual feature with visual concepts in a recurrent neural network that generates the image caption.
3 Model Architecture
We utilized the transformer model and the attention-based model proposed by Ami et al. [2020] to caption images in
Bengali. The transformer model does not process sequence in order but the attention-based model processes sequence
in order. Hence, the transformer model allows parallel processing of captions. The transformer model is illustrated in
Fig. 3 and the attention-based approach is shown in Fig. 4.
3.1 Transformer-based Approach
Transformer Vaswani et al. [2017] is a deep learning model that utilizes the mechanism of attention, to give weights to
the influences to different parts of the input data. The transformer is made of a stack of encoder and decoder components.
In Fig. 3 left block marked
Nx
is the encoder and the right block marked
Nx
is the decoder. Here N is a hyperparameter
that represents the number of encoder and decoder components. This model takes two inputs these are image features
extracted by the CNN in the Encoder and the vocabulary formed from the list of target captions in the dataset in the
Decoder.
3.1.1 Encoder
InceptionV3 was used as the CNN in this experiment. The images in the dataset were at first passed to the CNN.
InceptionV3 extracts image features and passes them through a dense layer having ReLU as an activation function to
take the dimension of the image feature vector from d to
dmodel
where
dmodel
is the dimension of the word embedding.
4
arXiv Template A PREPRINT
Figure 3: Illustration of the transformer based model to caption image in Bengali.
These image feature vectors were then summed with the positional encoding then passed through N encoder layers.
5
arXiv Template A PREPRINT
Figure 4: Illustration of the attention-based approach Ami et al. [2020] to caption image in Bengali.
Each of these encoder layers was made up of two sublayers one of which is Multi-head attention with padding mask
and the other is Point wise feed-forward networks. Masking ensures that the model does not treat padding as the input.
The output of the encoder was then passed to the decoder as K (key) and V (Value). The Multi-Head mechanism is
explained in Section 7.
3.1.2 Decoder
The decoder takes as input the target captions in the dataset were passed through an embedding which was summed
with the positional encoding. Positional encoding is added to give the model some information about the relative
position of the words in the sentence based on the similarity of their meaning and their position in the sentence, in
the d-dimensional space. The output of the summation was then passed through N decoder layers. Each of these
decoder layers was made of three sub-layers one of which was the Masked multi-head attention with a look ahead
mask and padding mask, another one was Multi-head attention with padding mask where V (value) and K (key) receive
the encoder output as inputs and Q (query) received the output from the masked multi-head attention sublayer. The
third layer was Point wise feed-forward networks. The output of the decoder was then sent to the linear layer as input.
Finally, using probabilistic softmax predictions one word at a time, and uses the output so far to decide what to do next.
This whole process is illustrated in Fig. 3.
3.2 Visual Attention-based Approach
To focus only on the relevant parts of the image visual attention-based model was used. It is an Encoder-Decoder
approach that processes sequence in order. This model has three main parts. Firstly, a Convolutional Neural Network
(CNN) extracts features from images. Secondly, an attention mechanism was utilized to give weights to image features.
Bengali vocabulary was then converted to word vectors using an embedding layer. Finally, Gated Recurrent Units
(GRU) Chung et al. [2014] which is a sequence generator took word vectors and weighted images features as input and
generated Bengali captions in order. This process is illustrated in 4.
4 Dataset
The main aim of this research is to generate a Bengali caption from the image. To accomplish this task a dataset in the
Bengali language is needed which must have several images and a text file containing Bengali captions associated with
6
arXiv Template A PREPRINT
Table 1: Distribution of Data for Different Bengali Dataset used in our experiment.
Dataset Total Image Training Validation Testing
Flickr8k 8000 6000 (75%) 1000 (15%) 1000 (15%)
BanglaLekha 9154 7154 (78%) 1000 (11%) 1000 (11%)
Bornon 4100 2900 (72%) 600 (14%) 600 (14%)
merged 21414 12850 (60%) 4282 (11%) 4282 (11%)
each image. However, all the datasets available for image captioning are in English. The only available Bengali dataset
till now is the BanglaLekha dataset. Since one dataset is not enough to validate the performance of the models we
created a new Bengali dataset named Bornon. Furthermore, we utilized the Flickr8k dataset by translating its English
captions to Bengali and then merging it with the BanglaLekha and Bornon dataset to form a newly merged dataset.
This merged dataset was constructed especially to test the transformer model since the transformer-based models are
data-hungry. For generating Bengali cations from images, these datasets were split into three parts: training, testing,
and validation. The split ratio of each dataset used in our experiment is shown in Table 1.
4.1 Flickr8K_BN
Flickr8k
2
dataset is a publically available English dataset that contains 8091 images of which 6000 (75%) images are
employed for training, 1000 (12.5%) images for validation, and 1000 (12.5%) images are used for testing. Moreover,
with each image of the Flickr8K dataset five ground truth captions describing the image are designated which adds up
to a total of 40455 captions for 8091 images. For image captioning in Bengali, those 40455 captions were converted to
Bengali language using Google Translator
3
. Unfortunately, some of the translated captions were syntactically incorrect
as shown in Fig 5. Hence, we manually checked all 40455 translated captions and corrected them. Flickr8K-BN is the
Bengali Flickr8K dataset. Some images of the Flickr8k dataset along with their associated captions are shown in Fig. 6.
Figure 5: Illustration of Bengali captions after being translated using Using Google Translator. Sentences marked
with red color indicate syntactically incorrect Bengali sentences and the sentences inside the brackets are the manually
corrected sentences.
4.2 BanglaLekha
We also utilized the BanglaLekha
4
dataset which consists of 9154 images of which 7154 (78%) images are employed
for training, 1000 (11%) images for validation, and 1000 (11%) images are used for testing. It is the only available
2https://www.kaggle.com/adityajn105/flickr8k/activity
3https://translate.google.com/
4https://data.mendeley.com/datasets/hf6sf8zrkc/2
7
arXiv Template A PREPRINT
Figure 6: Illustration of some images of the Flickr8k dataset along with their five Bengali captions.
Bengali dataset till now. All its captions are human-annotated. One problem with this dataset is that it has only two
captions associated with each image resulting in 18308 captions for those 9154 images. Hence, vocabulary size is lower
than Flickr8k-BN. Flickr8k-BN consists of 12953 unique Bengali words, and BanglaLekha consists of 5270 unique
Bengali words. It can be seen that the BanglaLekha dataset has a vocabulary size even lower than Flickr8k-BN. Some
images of the BanglaLekha dataset along with their associated captions are shown in Fig. 7.
4.3 Bornon
Due to the lack of a Bengali image captioning dataset and to overcome the shortcomings of the existing Bengali image
captioning datasets Banglalekha Rahman et al. [2019], we created a new dataset named Bornon. The Bornon dataset
consists of 4100 images and each image has five captions describe them. Thus, there is a total of 20500 captions for
4100 images. Images were kept in a folder and the associated captions were kept in a text file. Some images of the
Bornon dataset along with their associated captions are shown in Fig. 8.
The images of this dataset were taken from a personal photography club. All images were in jpg format. These
images portray various objects like Animals, Birds, People, Food, Weather, Trees, Flower, Buildings, Cars, Boat.
Frequent Bengali words in this dataset are illustrated in Fig. 9. Around 17 people who are native Bengali speakers were
responsible for annotating and evaluating the captions.
Only two captions are associated with each image in the BanglaLekha dataset this reduced the vocabulary size hence
we gave five captions for each image in our Bornon dataset. The vocabulary size of the Bornon dataset was 6228 unique
Bengali words for only 4100 images whereas the BanglaLekha dataset had a vocabulary size of 5270 unique Bengali
words for 9154 images. If vocabulary size is the small repetition of words is observed in predicted captions. However,
8
arXiv Template A PREPRINT
Figure 7: Illustration of some images of the BanglaLekha dataset along with their two Bengali captions.
this 4100 data is not enough to train a transformer-based mode. Therefore, in the future, we plan to increase the number
of images in our dataset.
4.4 Merged Dataset
The transformer model is data-hungry. It performs well when a huge number of data is provided. However, all the
Bengali datasets mentioned above have a small amount of data which is not enough to train the transformer model. As a
result, we merged three datasets Flickr8k, BanglaLekha, and Bornon. This resulted in 21414 images and each image
had two captions associated with them which add up to a total of 42828 captions. We took two captions from all the
datasets because the BanglaLekha dataset had only two Bengali captions describing each image. This merging led to a
vocabulary size of 13416 unique Bengali words and images of various categories.
5 Text Embedding
Firstly, the maximum length of target captions in each dataset was computed. Then all the sentences having lengths
less than maximum length were padded with zeroes. Afterward, the top 5000 unique Bengali words were selected
from each dataset to tokenize the Bengali captions using Keras’s text tokenizer. Since we cannot train a model using
text we converted these tokens to numeric form using text embedding. The embedding model embeds these tokens to
one-hot vectors having d_modeldimensions. All these embedding vectors in one sentence are combined into a matrix
and provided as input to the Transformer or the GRU.
6 Convolutional Neural Network
InceptionV3 Szegedy et al. [2016] was utilized as the Convolutional Neural Network (CNN) for extracting features from
images for the transformer-based model. As this is not a classification task, the last layer of InceptionV3, a softmax
layer, was removed from the model. Then all the images were preprocessed to the same size, that is, 299
×
299 before
feeding them into the model. Hence, the shape of the output of this layer was 8x8x2048. The features were extracted
and stored as .npy files and then pass those features through the encoder.
Two different CNN InceptionV3 and Xception Chollet [2017] were employed in the different experimental setups
of the visual attention-based model. The last layer of both CNN was removed. Then like the transformer model the
attention-based model also took images of size 299x299. As a result, images were reshaped and feed to CNN. The
extracted image features were then to .npy files and attention weight was added to them.
9
arXiv Template A PREPRINT
Figure 8: Illustration of some images of the Bornon dataset along with their five Bengali captions.
7 Attention in Transformer
Self-attention is calculated using vectors. Three matrices Query, Key, and Value from each of the encoder’s inputs are
needed to calculate self-attention. These matrices are obtained by multiplying the embedding matrix and the Weight
trained weight matrices. Finally, self-attention matrices are calculated using the following formula.
Z=softmax(QKT
dk
)V(1)
Where Z is the self-attention matrix, Q is the Query matrix, K is the Key matrix, V is the Value matrix
dk
is the
dimension of the key matrix. The paper Vaswani et al. [2017] further refined the self-attention layer by adding a
mechanism called “Multi-Head” attention. Multi-Head attention implements the self-attention calculation eight different
times with different weight matrices.
10
arXiv Template A PREPRINT
Figure 9: Illustration most frequent Bengali words in the Bornon Dataset.
8 Visual Attention Mechanism
Two types of spatial attention that are used widely are Global attention Luong et al. [2015] and Local attention. We
employed Local attention which is also known as Bahdanau attention Bahdanau et al. [2014] because Global attention
is computationally expensive and unfeasible for large sentences. Global attention place attention on all source position
whereas Bahdanau attention focuses on a small subset of hidden states of the encoder per target word. To implement the
Bahdanau Attention several steps have been followed. Firstly, the extracted image features were passed through a Fully
connected layer using CNN Encoder to produce a hidden state of each element. Then Alignment scores were calculated
using the hidden state produced by the decoder in the previous time step and the encoder outputs using the formula
shown in Eq. 2. This Alignment score is the main component of the attention mechanism.
scorealignment =Wcombined .tanh(Wdecoder .Hdecoder +Wencoder .Hencoder)(2)
The Alignment Scores were then passed through the SoftMax function and represented in a single vector called attention
weights using Eq. 3. This vector was then multiplied with image features to form the context vector using Eq. 4.
ajt =Sof tmax(ejt)(3)
Where, ajt =eejt
PTx
k=1 eekt ,such that PTx
j=1 ajt = 1 and aij 0and ej t is the score-alignment.
ct=
Tx
X
j=1
αjt hj(4)
Where ajt =PTx
j=1 ajt = 1 and aij 0and ctis the context vector that is the weighted sum of the input.
Finally, this context vector was concatenated with the previous decoder output. It was then fed into the decoder Gated
Recurrent Unit (GRU) to produce a new output.
11
arXiv Template A PREPRINT
9 Gated Recurrent Units
In the attention-based approach, Gated Recurrent Units (GRU)Chung et al. [2014] was employed as a sequence
generator. Before passing words to the GRU they were converted to vectors using the embedding layer. Afterward, this
word embedding of Bengali words was passed to GRU. The GRU then predicts the next word in the sequence using the
previous hidden state of the decoder, the previous predicted word, and the context vector calculated in the attention
model. The equation used to predict the next word is depicted in Eq. 5.
st=RN N (st1,[e( ˆyt1), ct]) (5)
Where,
st
is the new state of the decoder,
st1
is previous state of decoder, e(
ˆyt1
)is previous predicted word and
ct
is
the context vector that is the weighted sum of the input.
However, the sequence problem remains. We need to process the data so that is we need to process the beginning of the
sequence before the end. To solve this issue we utilized the transformer-based model to caption images in Bengali.
10 Hyperparameters
The techniques used in this experiment were implemented by Jupyter Notebook. These models were developed based
on Keras 2.3.1 and Tensorflow 2.1.0. We ran our experiments on NVIDIA RTX 2060 GPU. RTX 2060 offers 1920
CUDA cores with 6 GB GDDR6 VRAM. Using these settings it took approximately three hours to train each of the
experimental setups.
The number of layers and the number of heads in the transformer were varied to tune the transformer-based model.
Three, five, and seven were used as some layers in the transformer and one and two were used as the number of
heads of the transformer. Furthermore, Internal validation was employed in this model to test the generalization
ability of our trained model. Moreover, this model was trained for 50 epochs and utilized the Adam optimizer with
a custom learning rate scheduler according to the Eq. 6 where 4000 was used as the warmup_steps. Additionally,
SparseCategoricalCrossentropy was utilized as the loss function. The loss plot of one of the experimental setups using
the transformer-based model is shown in Fig. 10.
lrate =d0.5
model min(step_num0.5, step_num warmup_steps1.5)(6)
Figure 10: Loss plot for transformer-based model over 50 epoch using 3 layers and 2 heads for Bornon dataset.
Two different CNN InceptionV3 and Xception were used in the different experimental setups as hyperparameters in the
visual attention-based model. Furthermore, Internal validation was employed in this model to test the generalization
ability of our trained model. Moreover, this model was trained for 50 epochs and had a batch size of 64. Additionally,
Adam optimizer was used as the optimizer and for calculating the loss SparseCategoricalCrossentropy was utilized. Fig.
11 and Fig. 12 demonstrates how loss decreased over 50 epoch for InceptionV3 and Xception respectively.
12
arXiv Template A PREPRINT
Figure 11: Loss plot for visual attention-based model using InceptionV3 over 50 epoch using Flickr8k-BN dataset.
Figure 12: Loss plot for visual attention-based model Xception over 50 epoch using Flickr8k-BN dataset.
11 Experimental Results
After generating captions, the most important part is evaluating them to verify how similar the generated captions are
to human-annotated captions. Hence, we took the aid of two evaluation metrics BLEU and METEOR to justify the
accuracy of our proposed models.
11.1 BLEU
Bilingual Evaluation Understudy (BLEU) Papineni et al. [2001] is the most wielded metric nowadays to evaluate the
merit of text. It depicts how normal sentences are compared with human-generated sentences. It is predominantly
utilized to evaluate the performance of Machine translation. Sentences are compared based on modified n-gram
precision for generating BLEU scores. BLEU scores are computed using the following equations:
P(i) = M atched(i)
H(i)(7)
P(i) is the precision for each i-gram where i = 1, 2, ...N, the percentage of the i-gram tuples in the hypothesis that also
occurs in the references is computed. H(i) is the number of i-gram tuples in the hypothesis and Matched(i) is computed
using the following formula:
Matched(i) = X
ti
min {Ch(ti),max
jChj (ti)}(8)
13
arXiv Template A PREPRINT
where
ti
is an i-gram tuple in hypothesis h,
Ch(ti)
is the number of times
ti
occurs in the hypothesis,
Chj (ti)
is the
number of times tioccurs in reference j of this hypothesis.
ρ=exp{min(0,nL
n)}(9)
where
ρ
is the brevity penalty to penalize short translation, n is the length of the hypothesis and L is the length of the
reference. Finally, the BLEU score is computed by:
BLEU =ρ{
N
Y
i=1
P(i)}1
N(10)
11.2 METEOR
Metric for Evaluation of Translation with Explicit Ordering (METEOR) Denkowski and Lavie [2014] is based on
unigram matching between reference and predicted sentences by machine using the harmonic mean of unigram precision
and recall. The recall is weighted higher than precision here. It was formulated to mend some of the issues found in
BLEU metrics. Unigram precision P is calculated as:
P=m
wt
(11)
Where m is the number of unigrams in the candidate translation that are also found in the reference translation, and
wt
is the number of unigrams in the candidate translation. Unigram recall R is computed as follows:
R=m
wr
(12)
Where m is as above, and
wr
is the number of unigrams in the reference translation. Precision and recall are combined
using the harmonic mean. There recall is weighted 9 times more than precision as shown in the equation below:
Fmean =10P R
R+ 9P(13)
To account for congruity concerning larger segments that appear in both the reference and the candidate sentence a
penalty p is added. The penalty is calculated using the following equation.
p= 0.5( C
um
)3(14)
Where C is the number of chunks, and
um
is the number of unigrams that have been mapped. Finally, the METEOR
score for a segment is calculated as M as shown in the equation below.
M=Fmean(1 p)(15)
11.3 Result Analysis
We employed BLEU and METEOR scores for every experimental setup. These scores of the transformer-based model
are shown in Table 2 and the scores for the visual attention-based model are shown in Table 3. The highest BLEU
for all datasets score using the Transformer-based model was obtained using 3 layers. On the other hand, METEOR
was higher for the BanglaLekha dataset with 7 layers and higher for the Bornon dataset when 3 layers were used. For
the merged dataset METEOR scores did not show any trend of increasing or decreasing with the number of layers.
These scores were even better than BLEU scores obtained by paper Deb et al. [2019], paper Kamal et al. [2020] and
paper Khan et al. [2021]. Bornon and BanglaLekha datasets performed slightly better than the Merged dataset using the
transformer-based method.
From Table 3 it can be seen that experimental setups of the visual attention-based model with Xception as CNN
gave higher scores. However, the overall BLEU scores are lower than the transformer-based model. As the visual
14
arXiv Template A PREPRINT
Table 2: Result of Transformer-based model using InceptionV3 as CNN over 50 epochs.
Dataset Layers(N) Heads BLEU METEOR
1234
10.665 0.556 0.476 0.408 0.255
3 2 0.662 0.548 0.462 0.389 0.241
1 0.648 0.546 0.470 0.402 0.251
BanglaLekha 5 2 0.660 0.557 0.480 0.415 0.263
1 0.633 0.541 0.471 0.409 0.267
7 2 0.644 0.548 0.476 0.412 0.268
10.696 0.589 0.507 0.439 0.361
3 2 0.687 0.572 0.486 0.415 0.346
1 0.688 0.583 0.502 0.437 0.359
Bornon 5 2 0.683 0.575 0.493 0.425 0.340
1 0.684 0.567 0.478 0.405 0.340
7 2 0.665 0.556 0.477 0.411 0.346
10.621 0.492 0.398 0.326 0.196
3 2 0.624 0.494 0.400 0.329 0.201
1 0.616 0.482 0.384 0.311 0.189
merged 5 2 0.607 0.483 0.391 0.323 0.200
1 0.592 0.468 0.376 0.308 0.187
7 2 0.602 0.481 0.390 0.322 0.197
Table 3: Result of Visual attention-based model using GRU as sequence generator over 50 epochs.
Dataset CNN BLEU METEOR
1234
Flickr8k-BN InceptionV3 0.543 0.445 0.362 0.294 0.161
Xception 0.546 0.447 0.364 0.296 0.156
BanglaLekha InceptionV3 0.567 0.460 0.385 0.319 0.204
Xception 0.570 0.462 0.387 0.322 0.208
Bornon InceptionV3 0.596 0.475 0.390 0.324 0.314
Xception 0.605 0.492 0.412 0.351 0.348
attention-based model used a GRU as a sequence generator whereas the transformer model was used as a sequence
generator in the transformer-based model. This proves the fact that only improving the computer vision side of the
image captioning models won’t improve the results. Since image captioning is a mixture of two fields computer vision
and NLP equal importance must be given to both fields to get better results. In Table 3 we illustrated the corpus BLEU
scores which were not done by Ami et al. [2020].
We tested the transformer-based model and the visual attention-based model using a test set that contains different
images that were not present in the training set or validation set. The Bengali captions generated by various experimental
setups of the transformer-based model using three datasets BanglaLekha, Bornon, and merged dataset are shown in Fig.
13. Additionally, the Bengali captions generated by the visual attention-based model using Flickr8k-BN, BanglaLekha,
and Bornon datasets are illustrated in Fig. 14. From these figures, it can be seen that the transformer-based model gave
much better and accurate Bengali captions than the attention-based model.
Since both transformer-based model and visual attention-based model were trained using BanglaLekha dataset and
Bornon dataset a brief comparison of caption generated for the same test images of these datasets is depicted in Fig. 15.
From this figure, it can be seen that the visual attention-based model generated Bengali captions related to the objects
present in the caption whereas the transformer-based model gave a general Bengali caption that describes the whole
image. Performances of three of the transformer-based model were compared with the performance of other papers and
the results are illustrated in Table 4. This table shows that the transformer-based model performed better than other
research done on Bengali image captioning using the same datasets.
12 Conclusions
In our work, we employed a visual attention-based approach that gives attention weight to image features. This was a
traditional Encoder-Decoder approach so we compared it with a transformer-based approach. In the transformer-based
15
arXiv Template A PREPRINT
Figure 13: Illustration of Bengali captions generated by Transformer-based models.
approach, we combine the feature vector extracted by CNN and target Bengali captions into the Transformer model.
This model learns to generate Bengali captions using a multi-head attention mechanism. Not only the model can
improve the original performance, but also uplift the training speed by allowing parallelism. Later it was validated that
the transformer-based method indeed performs better than the visual attention-based method. Hence, in the future, the
transformer-based model can replace the traditional encoder-decoder architecture. This will enhance the performance
16
arXiv Template A PREPRINT
Figure 14: Illustration of Bengali captions generated by visual attention-based models.
and efficiency of caption generation from images. We also utilized various Bengali datasets to test both approaches. This
proves the fact that the transformer model can be used to generate captions from images in other languages alongside
English.
References
R Subash, R Jebakumar, Yash Kamdar, and Nishit Bhatt. Automatic image captioning using convolution neural
networks and lstm. In Journal of Physics: Conference Series, volume 1362, page 012096. IOP Publishing, 2019.
Cheng Wang, Haojin Yang, and Christoph Meinel. Image captioning with deep bidirectional lstms and multi-task
learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(2s):1–20,
2018.
Mayeesha Humaira, Shimul Paul, Md Abidur Rahman Khan Jim, Amit Saha Ami, and Faisal Muhammad Shah. A
hybridized deep learning method for bengali image captioning. International Journal of Advanced Computer Science
and Applications, 12:698–707, 2021.
17
arXiv Template A PREPRINT
Figure 15: Illustration of Bengali captions generated by visual attention-based models and transformer-based models.
Visual attention-based Bengali captions were generated using Xception and the transformer-based Bengali captions
were generated using 3 layers and 4 heads.
Table 4: A brief comparison of BLEU scores for existing models and the transformer-based model.
Dataset Model BLEU
1234
VGG-16+LSTM Kamal et al. [2020] 0.667 0.436 0.315 0.238
BanglaLekha CNN-ResNet-50 Khan et al. [2021] 0.651 0.426 0.278 0.175
Transformer Model 0.665 0.556 0.476 0.408
Flickr8k(4000 images) Inception+LSTM Deb et al. [2019] 0.62 0.45 0.33 0.22
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008,
2017.
Wei Zhang, Wenbo Nie, Xinle Li, and Yao Yu. Image caption generation with adaptive transformer. In 2019 34rd Youth
Academic Annual Conference of Chinese Association of Automation (YAC), pages 521–526. IEEE, 2019.
Sen He, Wentong Liao, Hamed R Tavakoli, Michael Yang, Bodo Rosenhahn, and Nicolas Pugeault. Image captioning
through image transformer. In Proceedings of the Asian Conference on Computer Vision, 2020.
18
arXiv Template A PREPRINT
Amit Saha Ami, Mayeesha Humaira, Md Abidur Rahman Khan Jim, Shimul Paul, and Faisal Muhammad Shah. Bengali
image captioning with visual attention. In 2020 23rd International Conference on Computer and Information
Technology (ICCIT), pages 1–5. IEEE, 2020.
Matiur Rahman, Nabeel Mohammed, Nafees Mansoor, and Sifat Momen. Chittron: An automatic bangla image
captioning system. Procedia Computer Science, 154:636–642, 2019.
Tonmoay Deb, Mohammad Zariff Ahsham Ali, Sanchita Bhowmik, Adnan Firoze, Syed Shahir Ahmed, Muham-
mad Abeer Tahmeed, NSM Rahman, and Rashedur M Rahman. Oboyob: A sequential-semantic bengali image
captioning engine. Journal of Intelligent & Fuzzy Systems, 37(6):7427–7439, 2019.
Abrar Hasin Kamal, Md Asifuzzaman Jishan, and Nafees Mansoor. Textmage: The automated bangla caption generator
based on deep learning. In 2020 International Conference on Decision Aid Sciences and Application (DASA), pages
822–826. IEEE, 2020.
Mohammad Faiyaz Khan, SM Sadiq-Ur-Rahman, and Md Saiful Islam. Improved bengali image captioning via deep
convolutional neural network based encoder-decoder model. In Proceedings of International Joint Conference on
Advances in Computational Intelligence, pages 217–229. Springer, 2021.
Md Asifuzzaman Jishan, Khan Raqib Mahmud, Abul Kalam Al Azad, Mohammad Rifat Ahmmad, Bijan Paul Rashid,
and Md Shahabub Alam. Bangla language textual image description by hybrid neural network model. Indonesian
Journal of Electrical Engineering and Computer Science, 21(2):757–767, 2021.
Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. Convolutional image captioning. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 5561–5570, 2018.
Shuang Liu, Liang Bai, Yanli Hu, and Haoran Wang. Image captioning based on deep neural networks. In MATEC Web
of Conferences, volume 232, page 01052. EDP Sciences, 2018.
Weiyu Lan, Xirong Li, and Jianfeng Dong. Fluency-guided cross-lingual image captioning. In Proceedings of the 25th
ACM international conference on Multimedia, pages 1549–1557, 2017.
Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu. Adding chinese captions to images. In Proceedings of the
2016 ACM on international conference on multimedia retrieval, pages 271–275, 2016.
Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. Stair captions: Constructing a large-scale japanese image
caption dataset. arXiv preprint arXiv:1705.00823, 2017.
Vasu Jindal. Generating image captions in arabic using root-word based recurrent neural networks and deep neural
networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Aditya Alif Nugraha, Anditya Arifianto, et al. Generating image description on indonesian language using convolutional
neural network and gated recurrent unit. In 2019 7th International Conference on Information and Communication
Technology (ICoICT), pages 1–6. IEEE, 2019.
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. Entangled transformer for image captioning. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 8928–8937, 2019.
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. Image captioning: Transforming objects into words.
arXiv preprint arXiv:1906.05963, 2019.
Viktar Atliha and Dmitrij Šešok. Text augmentation using bert for image captioning. Applied Sciences, 10(17):5978,
2020.
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. Vilbertscore:
Evaluating image caption using vision-and-language bert. In Proceedings of the First Workshop on Evaluation and
Comparison of NLP Systems, pages 34–39, 2020.
Xinxin Zhu, Lixiang Li, Jing Liu, Haipeng Peng, and Xinxin Niu. Captioning transformer with stacked attention
modules. Applied Sciences, 8(5):739, 2018.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on
machine learning, pages 2048–2057. PMLR, 2015.
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and
channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 5659–5667, 2017.
Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. Show, observe and tell: Attribute-driven
attention model for image captioning. In IJCAI, pages 606–612, 2018.
19
arXiv Template A PREPRINT
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent
neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 2818–2826, 2016.
François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1251–1258, 2017.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine
translation. arXiv preprint arXiv:1508.04025, 2015.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and
translate. arXiv preprint arXiv:1409.0473, 2014.
K Papineni, S Roukos, T Ward, and WJ Zhu. Ibm research report bleu: a method for automatic evaluation of machine
translation ibm research division technical report, 2001.
Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language.
In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
20
... Word embeddings and positional encoding are derived from the sahajBERT model. [11] compared an InceptionV3 and transformer based encoderdecoder model to a visual attention based model (Incep-tionV3/Xception encoders and GRU decoders with visual attention). They introduced the "Bornon" dataset containing 4100 images with 5 captions each. ...
... The size of n-grams that should not be repeated in the generated sequences, is set to 3, and the length penalty applied during beam search decoding is set to 2. Baseline models. We evaluated the performance of our methods against several recent benchmark Bengali image captioning studies, including [5], [6], [8], [9], [11], [13]. Note that the dataset splits may vary as none seem to follow a standard split. ...
Preprint
Full-text available
An exemplary caption not only describes what is happening in a particular image but also denotes intricate traditional objects in the image by their local representative terms through which the native speakers can recognize the object in question. A caption that fails to accomplish the latter is not effective in conveying proper utility. To ensure caption locality, we aim to explore the potential of Large Language Models (LLMs) in Bengali image captioning, which have lately shown promising results in English language caption generation. As a first for the Bengali language, we utilized CLIP (Contrastive Language-Image Pre-training) encodings as a prefix to the captions by employing a mapping network, followed by fine-tuning BanglaGPT, a Bengali pre-trained large language model to generate the image captions. Furthermore, we explored vision transformer-based encoders (ViT, Swin) with BanglaGPT as the decoder. The best BanglaGPT-based model outperformed the current benchmark results, with BLEU-4, METEOR, and CIDEr scores of 54.3, 39.2, and 95.9 on the BanglaLekha dataset and 67.4, 36.6, and 76.9 on the BNature dataset.
... Pada dasarnya image captioning adalah membuat kalimat deskriptif berdasarkan gambar secara otomatis [5]. Teknologi tersebut didasari oleh computer vision yang menggabungkan identifikasi objek dalam gambar melalui penggunaan Convolutional Neural Network (CNN) sebagai encoder dan Natural Language Processing (NLP) yang digunakan untuk menghasilkan deskripsi atau keterangan dari gambar tersebut dalam bentuk bahasa alami sebagai decoder [6]. ...
... mampu meningkatkan kinerja dan kecepatan pelatihan dengan memungkinkan mekanisme paralel. Di dalam Transformer juga dilengkapi dengan multi-head attention mechanism [5] . ...
Article
Full-text available
Penelitian ini bertujuan untuk mengatasi kurangnya pemahaman terhadap rambu lalu lintas di Indonesia melalui pengembangan model image captioning menggunakan Inception V3 dan Transformer. Dengan menggunakan pendekatan ini, dataset gambar rambu lalu lintas yang terdiri dari 9.594 gambar dengan 31 kelas dikumpulkan dan dimodifikasi. Evaluasi model dilakukan menggunakan metrik BLEU, ROUGE-L, METEOR, dan CIDEr. Hasil penelitian menunjukkan kinerja yang baik dengan skor BLEU-1=0.89, BLEU-2 = 0.82, BLEU-3 = 0.75, BLEU-4 = 0.68, CIDEr = 0.57, ROUGE-L = 0.25, dan METEOR = 0.26. Dari hasil tersebut, dapat mengindikasikan bahwa model ini dapat meningkatkan pemahaman tentang rambu lalu lintas Indonesia. Pendekatan ini dapat membantu pengguna jalan memahami rambu lalu lintas dengan lebih baik, serta memiliki potensi untuk diterapkan dalam aplikasi praktis untuk meningkatkan keselamatan lalu lintas.
... RNNs) to generate Bengali captions from images. Subsequent investigations by the same Nabaraj Subedi, Nirajan Paudel, Manish Chhetri, Sudarshan Acharya, Nabin Lamichhane Journal of Soft Computing Paradigm, Month 2024, Volume 6, Issue 1 73 authors have delved into utilizing transformer models for Bengali image captioning (Shah et al., 2021)[10] focusing on addressing sequential processing challenges. ...
Article
Full-text available
The advent of deep neural networks has made the image captioning task more feasible. It is a method of generating text by analyzing the different parts of an image. A lot of tasks related to this have been done in the English language, while very little effort is put into this task in other languages, particularly the Nepali language. It is an even harder task to carry out research in the Nepali language because of its difficult grammatical structure and vast language domain. Further, the little work done in the Nepali language is done to generate only a single sentence, but the proposed work emphasizes generating paragraph-long coherent sentences. The Stanford human genome dataset, which was translated into Nepali language using the Google Translate API is used in the proposed work. Along with this, a manually curated dataset consisting of 800 images of the cultural sites of Nepal, along with their Nepali captions, was also used. These two datasets were combined to train the deep learning model. The task involved working with transformer architecture. In this setup, image features were extracted using a pretrained Inception V3 model. These features were then inputted into the encoder segment after position encoding. Simultaneously, embedded tokens from captions were fed into the decoder segment. The resulting captions were assessed using BLEU scores, revealing higher accuracy and BLEU scores for the test images.
... The model's decoder part made use of a softmax layer, selecting as its output the word with the highest probability. The use of transformers for annotating Bengali images was first introduced by Shah et al. [13]. They developed their own dataset called Bornon to evaluate their proposed model. ...
Conference Paper
Full-text available
Our research focuses on Bangla Image Captioning which involves generating descriptive captions for the images. To address this task, we propose a new approach using the Vision Encoder-Decoder model, consisting of interconnected models for image encoding and text decoding. Previous work in this area has not explored the use of the Vision Encoder-Decoder Model specifically for Bangla Image Captioning. We have conducted several studies using two publicly available Bengali datasets, Bornon and BanCap, and merged them to create a comprehensive dataset to assess the performance of our model. Our proposed model outperforms recent developments in Bengali image captioning, delivering exceptional results in both quantitative and qualitative analyses.
... Creating an image captioning system for the Bengali language involves generating descriptive and contextually relevant captions for images in Bengali. A CNN serves as the encoder and an RNN serves as the decoder in the encoder-decoder-based technique [86] to picture captioning. However, this approach needs to be processed sequentially. ...
Preprint
Full-text available
In image captioning, we generate visual descriptions from an image. Image Cap-tioning requires identifying the key entity, feature, and association in an image. There is also a requirement to generate captions that are syntactically and semantically correct. The process of image captioning requires computer vision and natural language processing. In the past few decades, a substantial attempt has been made to generate the caption for images. In this survey article, we are going to present an extensive survey on image captioning for Indian Languages. To summarize recent research work in image captioning, first, we briefly review the traditional approach to image captioning depending on template and retrieval. Further deep-learning approaches for image captioning are concentrated which are classified as encoder-decoder architecture, attention-based approach, and transformer architecture. Our main focus in this survey is based on image cap-tioning techniques for Indian languages like Hindi, Bengali Assamese, etc. After that, we analyze the state-of-the-art approach on the most widely dataset i.e. MS COCO dataset with their strengths, limitations, and performance metrics i.e. BLEU, ROUGE, METEOR, CIDEr, SPICE. At last, we explore discussion on open challenges and future direction in the field of image captioning.
Article
Automatic Image Captioning (AIC) refers to the process of synthesizing semantically and syntactically correct descriptions for images. Existing research on AIC has predominantly focused on the English language. Comparatively, lower numbers of works have focused on developing captioning systems for low-resource Indian languages like Assamese. This paper investigates AIC for the Assamese language using two distinct approaches. The first approach involves utilizing state-of-the-art AIC model pretrained on an English image-caption dataset to generate English captions for input images. Next, these English captions are translated to the Assamese language using a publicly available automatic translator. The second approach involves exclusively training the AIC model using an Assamese image-caption dataset to predict captions directly in Assamese. The experiments are performed on two types of state-of-art models, one which uses LSTM as a decoder and the other one uses a transformer. Through extensive experimentation, the performance of these approaches is evaluated both quantitatively and qualitatively. The quantitative results are obtained using automatic metrics such as BLEU-n and CIDEr. For qualitative analysis, human evaluation is performed. The comparative performances between the two approaches reveal that models trained exclusively on Assamese image-caption datasets achieve superior results both in terms of quantitative measures and qualitative assessment when compared to models pretrained on English and subsequently translated into Assamese.
Conference Paper
Visual commonsense reasoning, an integral aspect of human intelligence, extends beyond mere object identification, encompassing the nuanced inference of actions, intentions, and emotions from visual content. While humans effortlessly excel in this cognitive domain, contemporary Bengali vision systems encounter obstacles in attaining comparable proficiency. This is primarily attributed to the need for advanced cognitive faculties and common-sense reasoning abilities.This research paper introduces the Ben-gali Visual Commonsense Reasoning System (BVCRS), differentiating from prevailing Bengali vision models that predominantly focus on tasks such as image cap-tioning, object recognition, segmentation, and visual question answering. The Bengali Visual Commonsense Reasoning(BVCR) model, coupled with our Bangla-VCR dataset, stands as a pivotal asset for advancing research in Bengali visual understanding and reasoning .Our BVCR model aspires to capture real-world dynamics and human intentions in Bengali by incorporating state-of-the-art deep learning techniques.We leverage a pre-trained ResNet50 for visual features and fine-tuned M-BERT for Bengali text.To extract relationships between textual and visual features for deducing answers and reasoning, a bidirectional LSTM with attention mechanism is employed.In experiments, BVCR achieved 42.6% accuracy in visual question answering and 44.2% in visual commonsense reasoning .Our work highlights the challenges in attaining human-level performance in Bengali visual reasoning, providing insights for future research.
Conference Paper
The task of image captioning is a complex process that involves generating textual descriptions for images. Much of the research done in this particular domain, especially using transformer models, has been focused on English language. However, there has been relatively little research dedicated to the context of the Bengali language. This study addresses the lack of research in the context of Bengali language and proposes a novel approach to automatic image captioning that involves a multi-modal, transformer-based, end-to-end model with an encoder-decoder architecture. Our approach utilizes a pre-trained EfficientNet Transformer Network. To evaluate the effectiveness of our approach, we compare our model with a Vision Transformer that utilizes a non-convolutional encoder pre-trained on ImageNet. The two models were tested on the BanglaLekhaImageCaptions dataset and evaluated using BLEU metrics.
Chapter
Full-text available
Image Captioning is an arduous task of producing syntactically and semantically correct textual descriptions of an image in natural language with context related to the image. Existing notable pieces of research in Bengali Image Captioning (BIC) are based on encoder-decoder architecture. This paper presents an end-to-end image captioning system utilizing a multimodal architecture by combining a one-dimensional convolutional neural network (CNN) to encode sequence information with a pre-trained ResNet-50 model image encoder for extracting region-based visual features. We investigate our approach’s performance on the BanglaLekhaImageCaptions dataset using the existing evaluation metrics and perform a human evaluation for qualitative analysis. Experiments show that our approach’s language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption. Our work outperforms all the existing BIC works and achieves a new state-of-the-art (SOTA) performance by scoring 0.651 on BLUE-1, 0.572 on CIDEr, 0.297 on METEOR, 0.434 on ROUGE, and 0.357 on SPICE.
Article
Full-text available
An omnipresent challenging research topic in computer vision is the generation of captions from an input image. Previously, numerous experiments have been conducted on image captioning in English but the generation of the caption from the image in Bengali is still sparse and in need of more refining. Only a few papers till now have worked on image captioning in Bengali. Hence, we proffer a standard strategy for Bengali image caption generation on two different sizes of the Flickr8k dataset and BanglaLekha dataset which is the only publicly available Bengali dataset for image captioning. Afterward, the Bengali captions of our model were compared with Bengali captions generated by other researchers using different architectures. Additionally, we employed a hybrid approach based on InceptionResnetV2 or Xception as Convolution Neural Network and Bidirectional Long Short-Term Memory or Bidirectional Gated Recurrent Unit on two Bengali datasets. Furthermore, a different combination of word embedding was also adapted. Lastly, the performance was evaluated using Bilingual Evaluation Understudy and proved that the proposed model indeed performed better for the Bengali dataset consisting of 4000 images and the BanglaLekha dataset.
Article
Full-text available
Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.
Article
Full-text available
PC vision has turned out to be universal in our general public, with applications in a few fields. Given a lot of pictures, with its inscription, make a prescient model which produces regular, inventive, and intriguing subtitles for the concealed picture. A speedy look at a picture is adequate for a human to call attention to and portray a monstrous measure of insights regarding the visual scene. To rearrange the current issue of producing inscriptions for pictures by making a model which would give exact subtitles to these pictures which can be additionally utilized in other helpful applications and use cases. Be that as it may, this momentous capacity has ended up being a tricky errand for our visual acknowledgment models. Most of the past research in scene acknowledgment has concentrated on naming pictures with a predetermined arrangement of visual classifications and extraordinary advancement has been accom- plished in these undertakings. For a question picture, the past strategies recover pertinent hopeful normal language states by outwardly contrasting the inquiry picture with database pictures. In any case, while shut vocabularies of visual ideas comprise a helpful demonstrating suspicion, they are boundlessly prohibitive when contrasted with the colossal measure of rich depictions that a human can form. These methodologies forced a breaking point on the assortment of inscriptions produced. The model ought to be exempt of suppositions regarding explicit pre decided formats, standards or classes and rather depend on figuring out how to create sentences from the preparation information. The model proposed utilizes Convolution Neural Networks which help to separate highlights of the picture whose subtitle is to be created and afterward by utilizing a probabilistic methodology and Natural Language Processing Techniques reasonable sentences are framed and inscriptions are produced.
Conference Paper
Attention based approaches has been manifested to be an effective method in image captioning. However, attention can be used on text called semantic attention or on image which in known as spatial attention. We chose to implement the later as the main problem of image captioning is not being able to detect objects in image properly. In this work, we develop an approach which extracts features from images using two different convolutional neural network and combines the features with an attention model in order to generate caption with an RNN. We adapted Xception and InceptionV3 as our CNN and GRU as our RNN. Moreover, we Evaluated our proposed model on Flickr8k dataset translated into Bengali. So that captions can be generated in Bengali using visual attention.
Chapter
Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect of captioning is the notion of attention: how to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous works have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer’s internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widens the original transformer layer’s inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks. The code is available at https://github.com/wtliao/ImageTransformer.