ArticlePDF Available

Abstract and Figures

An omnipresent challenging research topic in computer vision is the generation of captions from an input image. Previously, numerous experiments have been conducted on image captioning in English but the generation of the caption from the image in Bengali is still sparse and in need of more refining. Only a few papers till now have worked on image captioning in Bengali. Hence, we proffer a standard strategy for Bengali image caption generation on two different sizes of the Flickr8k dataset and BanglaLekha dataset which is the only publicly available Bengali dataset for image captioning. Afterward, the Bengali captions of our model were compared with Bengali captions generated by other researchers using different architectures. Additionally, we employed a hybrid approach based on InceptionResnetV2 or Xception as Convolution Neural Network and Bidirectional Long Short-Term Memory or Bidirectional Gated Recurrent Unit on two Bengali datasets. Furthermore, a different combination of word embedding was also adapted. Lastly, the performance was evaluated using Bilingual Evaluation Understudy and proved that the proposed model indeed performed better for the Bengali dataset consisting of 4000 images and the BanglaLekha dataset.
Content may be subject to copyright.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
A Hybridized Deep Learning Method for Bengali
Image Captioning
Mayeesha Humaira1, Shimul Paul2, Md Abidur Rahman Khan Jim3, Amit Saha Ami4, Faisal Muhammad Shah5
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
Abstract—An omnipresent challenging research topic in com-
puter vision is the generation of captions from an input image.
Previously, numerous experiments have been conducted on image
captioning in English but the generation of the caption from the
image in Bengali is still sparse and in need of more refining. Only
a few papers till now have worked on image captioning in Bengali.
Hence, we proffer a standard strategy for Bengali image caption
generation on two different sizes of the Flickr8k dataset and
BanglaLekha dataset which is the only publicly available Bengali
dataset for image captioning. Afterward, the Bengali captions
of our model were compared with Bengali captions generated
by other researchers using different architectures. Additionally,
we employed a hybrid approach based on InceptionResnetV2
or Xception as Convolution Neural Network and Bidirectional
Long Short-Term Memory or Bidirectional Gated Recurrent Unit
on two Bengali datasets. Furthermore, a different combination
of word embedding was also adapted. Lastly, the performance
was evaluated using Bilingual Evaluation Understudy and proved
that the proposed model indeed performed better for the Bengali
dataset consisting of 4000 images and the BanglaLekha dataset.
KeywordsBengali image captioning; hybrid architecture; In-
ceptionResNet; Xception
I. INTRODUCTION
An image is worth a thousand stories. It is effortless for
humans to describe these stories but it is troublesome for a
machine to portray them. To obtain captions from images it
is necessary to combine computer vision and natural language
processing. Previously lots of research has been done on image
captioning but most of them were done in English. Research
done on Image captioning using other languages [13], [15],
[16] is still limited. Few works until now have been conducted
on image captioning in Bengali [5], [23], [37] so we aim to
explore image captioning in the Bengali language further.
About 215 million people worldwide speak in Bengali
among those 196 million individuals are natives from India
and Bangladesh. Bengali is the 7th most utilized language
worldwide1.As a result, it is momentous to generate image
captions in Bengali alongside English. Moreover, most of the
natives have no knowledge of English. Additionally, image
captioning can be used to aid blind people by converting the
text into speech blind people who can understand the image.
Also, surveillance footage can be captioned in real-time so that
theft, crime or accidents can be detected faster.
The main issue of image captioning in the Bengali lan-
guage is the availability of a dataset. Most of the datasets
1https://www.vistawide.com/languages/top\30\languages.htm
available are in English. English datasets can be translated
using manual labor or using machine translation. At any rate,
manual translations have higher accuracy, they are extremely
monotonous and troublesome. Machine translation on the other
hand provides a better solution. In our experiment, we used
a Machine translator such as Google translator2to translate
English captions to Bengali and modified those sentences
that were syntactically incorrect manually. Furthermore, we
also utilized BanglaLekha3dataset which is the only publicly
available Bengali dataset for image captioning till now. All
the captions in this dataset were in Bengali and human
annotated. We employed two approaches to captioning images
in Bengali. Firstly, a hybrid model was used as demonstrated
in Fig. 1 where two embedding layers were concatenated.
Among those concatenated embedding one was GloVe [22]
which utilize a pre-trained file in Bengali and another was
fastText [7] which was trained on the vocabulary available.
Secondly, two different models were trained to have a single
embedding. One was conducted with only a trainable fastText
embedding and the other experimented on GloVe embedding
which was pre-trained in Bengali. For all three of the cases,
InceptionResnetV2 [28] and Xception [38] was used as a
Convolution Neural Network (CNN) to detect objects from
images.
In this work, we proposed a hybridized Deep Learning
method for Image captioning. This was achieved by concate-
nating two word embedding. The contribution of this paper is
as follows:
We introduced a hybridized method of image caption-
ing where two word embedding pre-trained GloVe and
fastText were concatenated.
Experiments were carried on both our models using
Bidirectional Long Short-Term Memory (BiLSTM)
and Bidirectional Gated Recurrent Unit (BiGRU).
BiGRU has not been used before for image captioning
using different languages other than English.
Moreover, these two models have been tested on two
Flickr8k datasets of varying lengths. One dataset con-
tains 4000 images and the other contains 8000 images.
To our best knowledge, no paper used Flickr8k full
dataset translated in Bengali for image captioning.
Additionally, our model was also tested on the
BanglaLekha dataset which contains 9154 images.
2https://translate.google.com/
3https://data.mendeley.com/datasets/rxxch9vw59/2
www.ijacsa.thesai.org 698 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
Lastly, it was shown that our proposed hybrid model
achieved higher BLEU scores for both the Flickr4k-
BN dataset and the BanglaLekha dataset.
Fig. 1. Illustration of Hybridized (Right) Model and Model with Single
Embedding FastText or GloVe (Left).
II. RE LATE D WOR K
This section depicts the progress in image captioning.
Hitherto, many kinds of research have been conducted and
many models have been developed to get captions that are
syntactically corrected. The authors in [2] presented a model
that deems the probabilistic distribution of the next word
using previous word and image features. On the other hand,
H. Dong et al. [6] proposed a new training method Image-
Text-Image which amalgamate text-to-image and image-to-
text synthesis to revamp the performance of text-to-image
synthesis. Furthermore, J. Aneja [21] and S. J. Rennie [25]
adapted the attention mechanism to generate caption. For
vision part of image captioning Vgg16 were used by most
of the papers [2], [11], [24], [25], [27], [30] as CNN but
some of them also used YOLO [9], Inception V3 [6], [31],
AlexNet [24], [30] ResNet [11], [18], [24] or Unet [4] as
CNN for feature extraction. Concurrently, LSTM [6], [9], [11],
[17], [31] was used by most of the papers for generating the
next word in the sequence. However, some of the researcher
also utilized RNN [19] or BiLSTM [4], [30]. Moreover, P.
Blandfort et al. [32] systematically characterize diverse image
captions that appear “in the wild” in order to understand how
people caption images naturally. Alongside English researchers
also generated captions in Chinese [15], [16], Japanese [1],
Arabic [12], Bahasa Indonesia [13], Hindi [26] German [29]
and Bengali [5], [23]. M. Rahman et al. [23] generated image
caption in Bengali for the first time followed by T. Deb et al.
[5]. Researchers of paper [23] used VGG-16 to extract image
features and stacked LSTMs. On the contrary, researchers of
paper [5] generated image caption using InceptionResnetV2 or
VGG-16 and LSTM. They utilized 4000 images of the Flickr8k
dataset to generate captions. We modified the merge model
adapted by paper [5] to get much better and fluent captions in
Bengali.
Only three works have been done on image captioning in
Bengali till now. In [23], author’s first paper, was where in
image captioning in Bengali followed by [5] and [37]. Rahman
et al. [23] have aimed to outline an automatic image captioning
system in Bengali called ‘Chittron’. Their model was trained to
predict Bengali caption from input image one word at a time.
The training process was carried out on 15700 images of their
own dataset BanglaLekha. In their model Image feature vector
and words converted to vectors after passing them through the
embedding, the layer was fed to the stacked LSTM layer. One
drawback of their work was that they utilized sentence BLEU
score instead of Corpus BLEU score. On the other hand, Deb et
al. [5] illustrated two models Par-Inject Architecture and Merge
Architecture for image captioning in Bengali. In the Par-Inject
model image, feature vectors were fed into intermediate LSTM
and the output of that LSTM and word vectors were combined
and fed to another LSTM to generate caption in Bengali.
Whereas, in the Merge model image feature vectors and words
vector were combined and passed to an LSTM without the
use of an intermediate LSTM. They utilized 4000 images
of the Fickr8k dataset and the Bengali caption their models
generated were not fluent. Paper [37] used a CNN-RNN based
model where VGG-16 was used as CNN and LSTM with 256
channels was used as RNN. They trained their model on the
BanglaLekha dataset having 9154 images.
To overcome the above mentioned drawbacks of fluent
captions we conducted our experiment using a hybridized
approach. Moreover, we used 8000 images of the Flickr8k
dataset alongside the Flickr4k dataset. We further validated
the performance of our model using the human annotated
BanglaLekha dataset.
III. OUR AP PROACH
We employed an Encoder-Decoder approach where both
InceptionResnetV2 and Xception were used separately in dif-
ferent experimental setups to Encode Images to feature vectors
and different word embedding were used to convert vocabulary
to word vectors. Image feature vectors and word vectors after
passing through a special kind of RNN were merged and
passed to a decoder to predict captions word by word this
process is illustrated in Fig. 2. We propose a hybrid model
that consists of two embedding layers unlike the merge model
[5]. We also conducted experiments on the merged model
having either pre-trained GloVe [22] or trainable fastText [7]
embedding. To be more precise, we trained the merge model
using three settings as shown in Fig. 1.
Our proposed hybrid model is shown in Fig. 3. It consists
of two part which is encoder and decoder.
Encoder
The encoder comprised of two parts one for han-
www.ijacsa.thesai.org 699 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
Fig. 2. An overview of how captions are generated word by word using our model.
dling image features and another for handling word
sequence pair. Firstly, image features were extracted
using InceptionResnetV2 [28] or Xception [38]. These
image features were preceded down to a dropout layer
followed by a fully connected layer and then another
dropout layer. A fully connected layer was used to
reduce the dimension of the image feature vector from
1536 or 2048 to 256 to match the dimension of word
prediction output. Secondly, Input word sequence pairs
are feed to two embedding layers one was pre-trained
GloVe embedding and another was fastText which
was not pre-trained. Both embeddings were used to
convert words to vectors of dimension 100. The vector
from the two embeddings was then passed through a
separate dropout layer followed by either BiLSTM or
BiGRU of dimension 128. To match the dimension of
visual feature vector output these vectors were passed
through an additional fully connected layer of dimen-
sion 256. These two outputs were then concatenated.
This concatenated output was then mapped to the
visual part of the encoder using another concatenation
and then forwarded to the decoder.
Decoder
The decoder is a Feed Forward Network which ends
with a SoftMax. It takes the concatenated output of the
encoder as input. This input was first passed through
a fully connected layer of 256 dimensions followed
by a dropout layer. Finally, via probabilistic Softmax
function outputs the next word in the sequence. The
SoftMax greedily selects the word with maximum
probability.
IV. EXP ER IM EN TAL SETUP
This section narrates the total strategy adapted to obtain
captions from images. Also, different tuning techniques availed
are described here.
Fig. 3. Proposed Hybrid Model.
A. Dataset Processing
Flickr8k dataset has 8091 images of which 6000 (75%)
images are employed for training, 1000 (12.5%) images for
validation and 1000 (12.5%) images are used for testing.
Moreover, with each image of the Flickr8K dataset five ground
truth captions describing the image are designated which adds
up to a total of 40455 captions for 8091 images. For image
captioning in Bengali, those 40455 captions were converted
www.ijacsa.thesai.org 700 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
to Bengali language using Google Translator. Unfortunately,
some of the translated captions were syntactically incorrect.
Hence, we manually checked all 40455 translated captions
and corrected them. We utilized these 8000 images as well
as selected 4000 images as done by Deb et al. [5] in
Bengali(Flickr4k-BN and Flickr8k-BN). These 4000 images
were selected based on the frequency of words in those 40455
captions. Using POS taggers most frequent nouns Bengali
words were identified from ground truth captions. The most
frequent words in the Bengali Flickr8k dataset are shown
in Fig. 4 for Bengali and English respectively. 4000 images
analogous to these words are selected and made two small
datasets Flickr4k-BN.
We also utilized the BanglaLekha dataset which consists
of 9154 images. It is the only available Bengali dataset till
now. All its captions are human annotated. One problem
with this dataset is that it has only two captions associated
with each image resulting in 18308 captions for those 9154
images. Hence, vocabulary size is lower than Flickr4k-BN and
Flickr8k-BN. Flickr8k-BN consists of 12953 unique Bengali
words, Flickr4k-BN consist of 6420 unique Bengali words and
BanglaLekha consists of 5270 unique Bengali words. It can be
seen that the BanglaLekha dataset has a vocabulary size even
lower than Flickr4k-BN. Hence, we employed the Flickr8k-
BN dataset alongside Flickr4k-BN and BanglaLekha datasets.
The split ratio of all three datasets for training, testing and
validating are shown in Table I.
Fig. 4. Illustration of Most Frequent Noun Bengali Words in Flickr8k
Bengali Dataset.
B. Image Feature Extraction
One essential part of image captioning is to extract features
from given images. This task is achieved using Convolutional
Neural Network architectures. These architectures are used to
detect objects from images. They can be trained on a large
number of images for extracting image features. This training
process requires an enormous number of images and time.
Due to the shortage of a large number of images, we utilized
Convolutional Neural Network architecture which was pre-
trained on more than a million images from the ImageNet
[33] dataset in our model known as InceptionResnetV2 [28]
and Xception [38]. These two pre-trained architectures were
used separately for different experimental setups. The reason
for using InceptionResnetV2 and Xception is that these models
can achieve higher accuracy at lower epochs. The last layer
which is used for prediction purposes of this pre-trained of
InceptionResnetV2 model is pulled out and the last two layers
of the pre-trained Xception model were pulled out. Finally,
the average pooling layer was used to extract image features
and convert them into a feature vector of 1536 dimensions for
InceptionResnetV2 and 2048 dimensions for Xception. All the
images are given an input shape of 299x299x3 before entering
the InceptionResnetV2 model. Here 3 represents the three-
color channels R, G and B.
C. Embeddings
Handling word sequences requires word embedding that
can convert words to vectors before passing them to special
recurrent neural networks (RNN). In our model GloVe [22]
and fastText [7] have been used as an embedding.
GloVe is a model for distributed word representation.
The model employs an unsupervised learning algo-
rithm for acquiring vector representations for words.
This is achieved by mapping words into a meaningful
space where the distance between words is related to
semantic similarity.
fastText is a library for the learning of word embed-
dings and text classification created by Facebook’s AI
Research (FAIR) lab. The model employs unsuper-
vised learning or supervised learning algorithms for
obtaining vector representations for words. fastText
yields two models for computing word representations
namely skipgram and cbow. Skipgram model learns
to forecast a target word using the nearby word.
conversely, cbow model forecasts the target word
according to its context where context depicts a bag
of the words contained in a fixed size window around
the target word.
Both GloVe and fastText have pre-trained word vectors that
are trained over a large vocabulary. These embeddings can
also be trained. In the hybrid model shown in Fig. 3, two
embeddings have been used GloVe and fastText. There GloVe
was pre-trained but fastText has been trained on vocabulary
available in the dataset. Trainable fastText instead of pre-
trained fastText was used to enrich the vocabulary with words
in Flickr8k and BanglaLekha datasets. Also, results of pre-
trained fastText have already been demonstrated by Deb et al.
[5]. The combination of two embedding leads to redundancy of
words but it gives fluent caption in Bengali as the vocabulary
size increases. On the other hand, pre-trained files for both
GloVe and fastText in the hybrid model will give much greater
redundancy and the vocabulary size becomes small as the
vocabulary does not contain unique words in the dataset.
Two other models were trained alongside the hybrid model.
Unlike the hybrid model, these two models had a single
embedding either a trainable fastText embedding or a pre-
trained GloVe embedding. GloVe file “bn glove.39M.100d”4
pre-trained in Bangali Language was used for Bengali datasets.
D. Word Sequence Generation
Flickr8k dataset has five captions associated with each
image and BanglaLekha has two captions associated with each
4https://github.com/sagorbrur/GloVe- Bengali
www.ijacsa.thesai.org 701 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
TABLE I. DISTRIBUTION OF DATA FO R THRE E BEN GAL I DATASET U SE D. SAME DI ST RIB UTI ON WAS U SED F OR FLICKR8KEN GLI SH A ND BEN GA LI
DATASET S.
Dataset Total Image Training Validation Testing
Flickr4k 4000 2400 (60%) 800 (20%) 800 (20%)
Flickr8k 8000 6000 (75%) 1000 (15%) 1000 (15%)
BanglaLekha 9154 7154 (78%) 1000 (11%) 1000 (11%)
image. One of the difficult tasks of image captioning is to
make the model learn how to generate these sentences. Two
different types of special Recurrent Neural Network (RNN)
were used to train the model to generate the next word in the
sequence of a caption. The input and output sizes were fixed
to the maximum length of the sentence present in the dataset.
In the case of Flickr4k-BN and Flickr8k-BN maximum length
was 23. On the other hand, two different maximum lengths
of the sequence 40 and 26 were used for the BanglaLekha
dataset. Reducing the maximum sequence length significantly
increased the evaluation scores. While training if any sentence
were generated having a length less than the maximum length
zero-padding was applied to make that sentence length equal
to the fixed length. Additionally, an extra start token and
end token is added to the sequence pair for identification in
the training process. During training, image features vector
and previous words converted to vector using embedding
layer were used to generate the next word in the sequence
probabilistic Softmax with the help of different types of RNN.
Fig. 5 illustrates the input and output pair.
Fig. 5. Demonstrates How Word Sequences are Generated.
Due to the limitation of the basic Recurrent Neural Net-
work (RNN) [34] to retrain long term memory a better
approach was taken by Deb et al. [5] which uses Long Short-
Term Memory (LSTM). However, LSTM [10] only preserve
preceding words but for proper sentence generation succeeding
words are also necessary. As a result, our model uses Bidirec-
tional Long Short-Term Memory (BiLSTM) and Bidirectional
Gated Recurrent Unit (BiGRU) which are illustrated in Fig. 6.
Each box marked as A or A’ was either a Long Short-Term
Memory (LSTM) or a Gated Recurrent Unit GRU [8] unit. X
[0...i] are the input words and Y [0...i] are the output words.
Y [0...i] are determined using the Eq. 1.
ˆy<t> =g(Wy[
a<t>
a<t>] + by)(1)
Where ˆ
y<t>is the output at time t when activation function
g is applied to recurrent component’s weight Wyand bias by
with both forward activation
a<t>at time t and backward
activation
a<t>at time t.
Fig. 6. Illustrates Bidirectional RNN having X0. . . i as Input and Y0. . . i as
output. A and A’ boxes are both either BiLSTM or BiGRU where A is the
Forward Recurrent Component an A’ is the Backward Recurrent Component.
GRU is a special type of RNN. Reset and update the
gate of a GRU helps to solve the vanishing gradient
problem of RNN. The update gate of GRU seeks how
much information from the previous units must be
forwarded. The update gate adopted is computed by
the following formula:
zt=σ(Wz.[ht1, xt]) (2)
where ztis update gate output at the current times-
tamp, Wzis weight matrix at update gate, ht-1 informa-
tion from previous units, and xtis input at the current
unit.
The reset gate is used by the model to find how much
information from the previous units to forget. The
reset gate is computed by the following formula:
rt=σ(Wr.[ht1, xt]) (3)
where rtis reset gate output at current timestamp, Wr
is weight matrix at reset gate, ht-1 information from
previous units, and xtis input at the current unit.
Current memory content used to store the relevant
information from the previous units. It is calculated
as follows:
˜
ht=tanh(W.[rtht1, xt]) (4)
where ˜
htis current memory content, W is weight
at current unit, rtis reset gate output at current
timestamp, ht-1 is information from previous units, and
xtis input at the current unit.
Final memory at the current unit is a vector used to
store the final information for the current unit and pass
it to the next layer. It is calculated using a formula:
ht= (1 zt)ht1+zt˜
ht(5)
where htis final memory at the current unit, ztis
update gate output at current timestamp, ht-1 is infor-
mation from previous units, and ˜
htis current memory
content.
www.ijacsa.thesai.org 702 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
LSTM is another Special type of RMNN. Unlike the
GRU the LSTM has three gates, namely, the forget
gate, update gate and the output gate. The equations
for the gates in LSTM are:
it=σ(Wi[ht1, xt] + bi)(6)
ft=σ(Wf[ht1, xt] + bf)(7)
ot=σ(Wo[ht1, xt] + bo)(8)
where itrepresents input gate, ftrepresents forget gate,
otrepresents output gate, σrepresents sigmoid func-
tion, Wxrepresents weight of the respective gate(x)
neurons, ht-1 represents output of previous LSTM
block at timestamp t-1, xtrepresents input at current
timestamp and bxrepresents biases for the respective
gates(x).
Input gate tells what new information is going to
be stored in cell state. Forget gate determine what
information to throw away from cell state and Output
get is used to provide output at timestamp t. The
equations for the cell state, candidate cell state and
the final output are:
˜ct=tanh(Wc[ht1, xt] + bc)(9)
ct=ftct1+it˜ct(10)
ht=ottanh(ct)(11)
where ctrepresents cell state at timestamp t and
˜
ctrepresent candidate for cell state at timestamp t.
candidate timestamp must be generated to get memory
vector for current timestamp ct. Then the cell state is
passed through a activation function to generate ht.
Finally, htis passed through a softMax layer to get
the output yt.
E. Hyperparameter Selection
One major problem of machine learning is overfitting.
Overfit models have high variance. These models cannot
generalize well. As a result, this is a huge problem for
image captioning. We observed the performance of our model
and noticed that it was suffering from overfitting rather than
underfitting. To minimize this overfitting problem some hy-
perparameter tuning has been adapted in our model. Firstly,
different values of dropout [35] have been used for sequence
model image features and decoder. Dropouts help prevent
overfitting. For feature extractor dropout value of 0.0 was used,
a dropout of 0.3 was used for the sequence model and in the
case of decoder dropout value of 0.5 was utilized. Secondly,
different activation functions were employed for different fully
connected layers. For example, regarding the feature extractor
model and decoder ELU [3] activation function was availed
and for the sequence model, ReLU [36] activation function
was employed. Thirdly, we employed external validation to
provide an unbiased evaluation and ModelCheckpoint was
availed to save models that had minimum validation loss. On
the other hand, ReduceLROnPlateau was used for models that
had Xception as CNN. Moreover, Adam optimizer [14] was
utilized and the models were trained for 50 and 100 epochs
having learning rates of 0.0001 and 0.00001. A short summary
of the hyperparameters adapted in different models are shown
in Table II and the loss plot of BanglaLekha dataset and
Flickr8K-BN dataset are ornamented in Fig. 7 and Fig. 8,
respectively. From these plots, it can be seen that the model
converges towards epoch 100. Another important factor that
improved the result was maximum sentence length. In the
BnglaLekha only a few sentences had lengths greater than 26.
As a result, we took a maximum length of sentences in this
dataset to 26. This enhanced the evaluation scores greatly.
Fig. 7. Loss Plot of BanglaLekha Dataset for 100 epochs.
Fig. 8. Loss Plot of Flickr8k-BN Dataset for 100 epochs.
V. ANALYS IS
We implemented the algorithm using Keras 2.3.1 and
Python 3.8.1. Additionally, we ran our experiments on GPU
RTX 2060. Our code and Bengali Flickr8k dataset is given
in GitHub5. We translated the Flickr8k dataset to Bengali
using Google Translator Like that done by [16]. Bilingual
Evaluation Understudy (BLEU) [20] score was used to evaluate
the performance of our models as it is the most wielded metric
nowadays to evaluate the caliber of text. It depicts how normal
sentences are compared with human generated sentences. It
is broadly utilized to evaluate the performance of Machine
translation. Sentences are compared based on modified n-
gram precision for generating BLEU scores. BLEU scores are
computed using the following equations:
P(i) = Matched(i)
H(i)(12)
5https://github.com/MayeeshaHumaira/A-Hybridized-Deep- Learning-Method- for-Bengali- Image-Captioning
www.ijacsa.thesai.org 703 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
TABLE II. HYPER PARA MET ER S ADAPT ED I N DIFFE RE NT MOD EL S.
Search Type Model Learning Rate Loss Function Callback Epoch
Greedy Xception 0.00001 Sparse Categorical ReduceLROnPlateau 100
+BiLSTM Crossentrophy
InceptionResnetV2 0.0001 Categorical ModelCheckpoint 50
+BiLSTM Crossentrophy
Beam=3 Xception 0.00001 Sparse Categorical ReduceLROnPlateau 100
+BiLSTM Crossentrophy
Beam=5 Xception 0.00001 Sparse Categorical ReduceLROnPlateau 100
+BiLSTM Crossentrophy
where P(i) is the precision that is for each i-gram where i =
1, 2, ...N, the percentage of the i-gram tuples in the hypothesis
that also occur in the references is computed. H(i) is the
number of i-gram tuples in the hypothesis and Matched(i) is
computed using the following formula:
Matched(i) = X
ti
min {Ch(ti),max
jChj (ti)}(13)
where tiis an i-gram tuple in hypothesis h, Ch(ti)is the
number of times tioccurs in the hypothesis, Chj (ti)is the
number of times tioccurs in reference j of this hypothesis.
ρ=exp{min(0,nL
n)}(14)
where ρis brevity penalty to penalize short translation,
n is the length of the hypothesis and L is the length of the
reference. Finally, the BLEU score is computed by:
BLEU =ρ{
N
Y
i=1
P(i)}1
N(15)
Two different search types Greedy and Beam search were
used to compute these BLEU scores. In a Greedy search word
with maximum probability is chosen as the next word in the
sequence. On the other hand, Beam search considers n words
to choose from for the next word in the sequence. Where n
is the width of the beam. For our experiment, we considered
beamwidth of 3 and 5. We computed 1-gram BLEU (BLEU-
1), 2-gram BLEU (BLEU-2), 3-gram BLEU (BLEU-3), 4-
gram BLEU (BLEU-4) for various architectures. These are
illustrated in Table III, Table IV and Table V.
Performance of the proposed Hybrid architecture and single
embedding GloVe or fastText on Flickr4k-BN dataset con-
sisting of 4000 data for Bengali are demonstrated in Table
III. From Table III it can be stated that the Hybrid model
performed better for both BiLSTM and BiGRU on the Bengali
dataset than only GloVe and only fastText word embedding.
Moreover, we obtained better BLEU scores than paper [5]. The
greedy search was employed to compute these BLEU scores.
Consequently, the performance of the single embedding
GloVe or fast Text and hybrid architecture on Flickr8k-BN
dataset consisting of 8000 data and BanglaLekha dataset are
displayed in Table IV and Table V, respectively. There also
it can be observed that the proposed Hybrid model performed
better for both BiGRU and BiLSTM than the other models. The
Highest BLEU score was obtained using BiLSTM on Flickr4k-
BN and Flickr8k-BN as a result the captions generated by
the Hybrid model for both datasets are illustrated in Fig. 9.
Furthermore, our proposed Hybrid model also gave the highest
BLEU scores for the BanglaLekha dataset for both BiLSTM
and BiGRU as shown in Table V. From there it can be observed
that Xception and the learning rate played a vital role in
increasing the BLEU scores. These scores were even better
than BLEU scores obtained by paper [37]. Table VI illustrates a
brief comparison of the BLEU scores obtained by our proposed
model and the scores obtained by other papers. From there it
can be observed that our proposed Hybrid model indeed gave
a better performance. The captions generated by these models
for test images of the BanglaLekha dataset are shown in Fig.
10. Flickr8k-BN dataset consisting of 8000 images were not
previously used by any other papers for generating captions in
Bengali.
Fig. 9. Illustration of Captions Generated by Best Performing Hybrid
Architecture using Fickr4k-BN and Flickr8k-BN Datasets.
www.ijacsa.thesai.org 704 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
TABLE III. RESULT OF INCEPTIONRE SNE TV2U SE D BY FLICKR4K-BN
Experimental RNN Training Validation BLEU
Model Accuracy Accuracy 1 2 3 4
Proposed BiLSTM 0.421 0.387 0.661 0.508 0.382 0.229
Hybrid architecture BiGRU 0.432 0.386 0.660 0.503 0.371 0.215
GloVe BiLSTM 0.432 0.388 0.644 0.491 0.369 0.220
BiGRU 0.429 0.386 0.651 0.497 0.373 0.223
fastText BiLSTM 0.414 0.372 0.638 0.490 0.370 0.219
BiGRU 0.426 0.379 0.653 0.505 0.381 0.226
TABLE IV. BLEU SCORES OBTAINED USING FLICKR8K-BN DATASE T
Search Learning Word Experimental BLEU
Type Rate Embedding Model 1 2 3 4
Hybrid Xception+BiLSTM 0.504 0.326 0.232 0.119
Xception+BiGRU 0.536 0.352 0.246 0.126
Greedy 0.00001 GloVe Xception+BiLSTM 0.539 0.356 0.249 0.129
Xception+BiGRU 0.532 0.352 0.241 0.121
fastText Xception+BiLSTM 0.190 0.055 0.000 0.000
Xception+BiGRU 0.194 0.068 0.012 0.000
Hybrid InceptionResnetV2+BiLSTM 0.540 0.370 0.268 0.145
InceptionResnetV2+BiGRU 0.526 0.360 0.261 0.141
Greedy 0.0001 GloVe InceptionResnetV2+BiLSTM 0.534 0.369 0.265 0.142
InceptionResnetV2+BiGRU 0.512 0.350 0.255 0.138
fastText InceptionResnetV2+BiLSTM 0.528 0.363 0.269 0.140
InceptionResnetV2+BiGRU 0.530 0.362 0.260 0.140
Hybrid Xception+BiLSTM 0.416 0.246 0.176 0.089
Xception+BiGRU 0.414 0.247 0.178 0.093
Beam=3 0.00001 GloVe Xception+BiLSTM 0.395 0.239 0.174 0.089
Xception+BiGRU 0.404 0.245 0.178 0.090
fastText Xception+BiLSTM 0.034 0.000 0.000 0.000
Xception+BiGRU 0.059 0.003 0.001 0.000
Hybrid Xception+BiLSTM 0.409 0.240 0.175 0.090
Xception+BiGRU 0.403 0.239 0.171 0.089
Beam=5 0.00001 GloVe Xception+BiLSTM 0.377 0.226 0.162 0.079
Xception+BiGRU 0.393 0.241 0.172 0.085
fastText Xception+BiLSTM 0.034 0.000 0.000 0.000
Xception+BiGRU 0.059 0.003 0.001 0.000
VI. CONCLUSION
In this work, we exhibited a notion for automatically
generating caption from an input image in Bengali. Firstly, a
detailed description of how the Flickr8k dataset was translated
in Bengali and distributed into a dataset of two sizes was
presented. Secondly, how image features were extracted and
the different combinations of word embedding utilized were
also conferred. Moreover, the reasons for using a special
kind of word sequence generator was elucidated. Furthermore,
different parts of the proposed architecture were ornamented.
Finally, using the BLEU score it was authenticated that the
proposed architecture performs better for both Flickr4k-Bn
and BanglaLekha datasets. This validates the fact that image
captioning using the Bengali language can be refined further in
the future. We will try to adapt the visual attention and trans-
former model in the near future for better feature extraction and
getting more precise captions. Additionally, we aim to make
our own dataset having five captions with each image, unlike
the BanglaLekha dataset that has two captions associated with
each image to enrich the vocabulary of our dataset.
REFERENCES
[1] Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, “STAIR captions: Construct-
ing a large-scale Japanese image caption dataset,” ACL 2017 - 55th
Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., vol. 2,
pp. 417–421, 2017, doi: 10.18653/v1/P17-2066.
[2] J. Gu, G. Wang, J. Cai, and T. Chen, “An Empirical Study of Language
CNN for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol.
2017-October, pp. 1231–1240, 2017, doi: 10.1109/ICCV.2017.138.
[3] D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep
network learning by exponential linear units (ELUs),” 4th Int. Conf.
Learn. Represent. ICLR 2016 - Conf. Track Proc., pp. 1–14, 2016.
[4] W. Cui et al., “Landslide image captioning method based on semantic
gate and bi-temporal LSTM,” ISPRS Int. J. Geo-Information, vol. 9, no.
4, 2020, doi: 10.3390/ijgi9040194.
[5] T. Deb et al., “Oboyob: A sequential-semantic Bengali image captioning
engine,” J. Intell. Fuzzy Syst., vol. 37, no. 6, pp. 7427–7439, 2019, doi:
10.3233/JIFS-179351.
[6] H. Dong, J. Zhang, D. Mcilwraith, and Y. Guo, “I2T2I: LEARNING
TEXT TO IMAGE SYNTHESIS WITH TEXTUAL DATA AUGMEN-
TATION.”
[7] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning
word vectors for 157 languages,” Lr. 2018 - 11th Int. Conf. Lang. Resour.
Eval., pp. 3483–3487, 2019.
[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation
of Gated Recurrent Neural Networks on Sequence Modeling,” pp. 1–9,
2014, [Online]. Available: http://arxiv.org/abs/1412.3555.
[9] M. Han, W. Chen, and A. D. Moges, “Fast image captioning using
LSTM,” Cluster Comput., vol. 22, pp. 6143–6155, May 2019, doi:
10.1007/s10586-018-1885-9.
www.ijacsa.thesai.org 705 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
TABLE V. BLEU SCORES OBTAINED USING BANGLALE KHA DATAS ET
Search Learning Word Experimental BLEU
Type Rate Embedding Model 1 2 3 4
Hybrid Xception+BiLSTM 0.673 0.525 0.454 0.339
Xception+BiGRU 0.674 0.527 0.454 0.344
Greedy 0.00001 GloVe Xception+BiLSTM 0.612 0.453 0.380 0.265
Xception+BiGRU 0.610 0.454 0.383 0.272
fastText Xception+BiLSTM 0.618 0.463 0.389 0.277
Xception+BiGRU 0.624 0.473 0.402 0.290
Hybrid InceptionResnetV2+BiLSTM 0.568 0.396 0.287 0.160
InceptionResnetV2+BiGRU 0.571 0.402 0.301 0.173
Greedy 0.0001 GloVe InceptionResnetV2+BiLSTM 0.568 0.401 0.301 0.174
InceptionResnetV2+BiGRU 0.570 0.403 0.303 0.176
fastText InceptionResnetV2+BiLSTM 0.553 0.390 0.291 0.169
InceptionResnetV2+BiGRU 0.567 0.398 0.300 0.171
Hybrid Xception+BiLSTM 0.434 0.344 0.303 0.234
Xception+BiGRU 0.411 0.324 0.286 0.221
Beam=3 0.00001 GloVe Xception+BiLSTM 0.383 0.285 0.245 0.176
Xception+BiGRU 0.401 0.302 0.263 0.196
fastText Xception+BiLSTM 0.419 0.320 0.283 0.214
Xception+BiGRU 0.434 0.329 0.293 0.221
Hybrid Xception+BiLSTM 0.420 0.335 0.297 0.232
Xception+BiGRU 0.399 0.316 0.280 0.219
Beam=5 0.00001 GloVe Xception+BiLSTM 0.368 0.273 0.234 0.170
Xception+BiGRU 0.385 0.292 0.256 0.194
fastText Xception+BiLSTM 0.422 0.324 0.288 0.219
Xception+BiGRU 0.429 0.326 0.291 0.222
TABLE VI. A B RI EF COMPARISON OF BLEU SCOR ES FO R EXISTING MODELS AND OUR PROP OSE D HYBRID MOD EL.
Dataset Model BLEU
1 2 3 4
BanglaLekha VGG-16+LSTM [37] 66.7 43.6 31.5 23.8
Xception+BiGRU (Our Hybrid Model ) 0.674 0.527 0.454 0.344
Flickr8k(4000 images) Merge Bengali(Inception+LSTM) [5] 0.62 0.45 0.33 0.22
Flickr4k-BN Our Hybrid Model (InceptionResnetV2+BiLSTM) 0.661 0.508 0.382 0.229
[10] S. Hochreiter and J. Schmidhuber, “Long Short-Term Mem-
ory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997, doi:
10.1162/neco.1997.9.8.1735.
[11] K. Xu, H. Wang, and P. Tang, “IMAGE CAPTIONING WITH DEEP
LSTM BASED ON SEQUENTIAL RESIDUAL Department of Com-
puter Science and Technology , Tongji University , Shanghai , P . R .
China Key Laboratory of Embedded System and Service Computing ,
Ministry of Education ,” no. July, pp. 361–366, 2017.
[12] V. Jindal, “Generating Image Captions in Arabic Using Root-Word
Based Recurrent Neural Networks and Deep Neural Networks.” Avail-
able: www.aaai.org.
[13] A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image descrip-
tion on Indonesian language using convolutional neural network and
gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT
2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
[14] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-
tion,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc.,
pp. 1–15, 2015.
[15] W. Lan, X. Li, and J. Dong, “Fluency-guided cross-lingual image cap-
tioning,” MM 2017 - Proc. 2017 ACM Multimed. Conf., pp. 1549–1557,
2017, doi: 10.1145/3123266.3123366.
[16] X. Li, W. Lan, J. Dong, and H. Liu, “Adding Chinese captions to
images,” ICMR 2016 - Proc. 2016 ACM Int. Conf. Multimed. Retr.,
pp. 271–275, 2016, doi: 10.1145/2911996.2912049.
[17] C. Liu, F. Sun, and C. Wang, “MMT: A multimodal translator for image
captioning,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes
Artif. Intell. Lect. Notes Bioinformatics), vol. 10614 LNCS, p. 784, 2017.
[18] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain Images
with Multimodal Recurrent Neural Networks,” pp. 1–9, 2014, [Online].
Available: http://arxiv.org/abs/1410.1090.
[19] Q. You, H. Jin, Z. Wang, .. . C. F.-P. of the I., and undefined 2016, “Im-
age captioning with semantic attention,” openaccess.thecvf.com Avail-
able: http://openaccess.thecvf.com/.
[20] K. Papineni, S. Roukos, T. Ward, W. Zhu, and Y. Heights, “IBM
Research Report Bleu: a Method for Automatic Evaluation of Machine
Translation,” Science (80-. )., vol. 22176, no. February, pp. 1–10, 2001,
doi: 10.3115/1073083.1073135.
[21] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional Image
Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., pp. 5561–5570, 2018, doi: 10.1109/CVPR.2018.00583.
[22] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors
for word representation,” EMNLP 2014 - 2014 Conf. Empir. Methods
Nat. Lang. Process. Proc. Conf., no. June, pp. 1532–1543, 2014, doi:
10.3115/v1/d14-1162.
[23] M. Rahman, N. Mohammed, N. Mansoor, and S. Momen, “Chittron:
An Automatic Bangla Image Captioning System,” Procedia Comput. Sci.,
vol. 154, pp. 636–642, 2018, doi: 10.1016/j.procs.2019.06.100.
[24] S. Liu, L. Bai, Y. Hu, and H. Wang, “Image Captioning Based on Deep
Neural Networks,” MATEC Web Conf., vol. 232, pp. 1–7, 2018, doi:
10.1051/matecconf/201823201052.
www.ijacsa.thesai.org 706 |Page
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 2, 2021
Fig. 10. Illustration of Captions Generated by Best Performing Hybrid Architecture using BanglaLekha Dataset.
[25] [1] K. Xu et al., “Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention.” Available:
http://proceedings.mlr.press/v37/xuc15.
[26] S. R. Laskar, R. P. Singh, P. Pakray, and S. Bandyopadhyay, “English
to Hindi Multi-modal Neural Machine Translation and Hindi Image
Captioning,” pp. 62–67, 2019, doi: 10.18653/v1/d19-5205.
[27] R. Subash, R. Jebakumar, Y. Kamdar, and N. Bhatt, “Automatic image
captioning using convolution neural networks and LSTM,” J. Phys. Conf.
Ser., vol. 1362, no. 1, 2019, doi: 10.1088/1742- 6596/1362/1/012096.
[28] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-ResNet and the impact of residual connections on learning,”
31st AAAI Conf. Artif. Intell. AAAI 2017, pp. 4278–4284, 2017.
[29] A. Jaffe, “Generating Image Descriptions using Multilingual Data,” pp.
458–464, 2018, doi: 10.18653/v1/w17-4750.
[30] C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep
Bidirectional LSTMs and Multi-Task Learning,” ACM Trans. Multimed.
Comput. Commun. Appl., vol. 14, no. 2s, 2018, doi: 10.1145/3115432.
[31] Y. Xian and Y. Tian, “Self-Guiding Multimodal LSTM - When We
Do Not Have a Perfect Training Dataset for Image Captioning,” IEEE
Trans. Image Process., vol. 28, no. 11, pp. 5241–5252, 2019, doi:
10.1109/TIP.2019.2917229.
[32] P. Blandfort, T. Karayil, D. Borth, and A. Dengel, “Image captioning
in the wild: How people caption images on flickr,” MUSA2 2017 - Proc.
Work. Multimodal Underst. Soc. Affect. Subj. Attrib. co-located with
MM 2017, pp. 21–29, 2017, doi: 10.1145/3132515.3132522.
[33] L. Fei-Fei, J. Deng, and K. Li, “ImageNet: Constructing a large-
scale image database,” J. Vis., vol. 9, no. 8, pp. 1037–1037, 2010, doi:
10.1167/9.8.1037.
[34] M. Tanti, A. Gatt, and K. Camilleri, “What is the Role of Recurrent
Neural Networks (RNNs) in an Image Caption Generator?,” pp. 51–60,
2018, doi: 10.18653/v1/w17-3506.
[35] N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov,
“Dropout: A Simple Way to Prevent Neural Networks from Overfitting,”
2014. doi: 10.5555/2627435.2670313.
[36] [1] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted
Boltzmann Machines.” Available: https://www.cs.toronto.edu/ hin-
ton/absps/reluICML.pdf.
[37] A. H. Kamal, M. A. Jishan and N. Mansoor, ”TextMage: The Automated
Bangla Caption Generator Based On Deep Learning,” 2020 International
Conference on Decision Aid Sciences and Application (DASA), Sakheer,
Bahrain, 2020, pp. 822-826, doi: 10.1109/DASA51403.2020.9317108.
[38] F. Chollet, “Xception: Deep learning with depthwise separable con-
volutions,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recog-
nition, CVPR 2017, vol. 2017-January, pp. 1800–1807, 2017, doi:
10.1109/CVPR.2017.195.
www.ijacsa.thesai.org 707 |Page
... Moreover, it needed further retention and it was expensive to compute. The Hybridized Deep Learning (DL) approach was established by Humaira et al. (2021) for captioning the image which involved two embedding layers. Here the overfitting was lessened and appropriate sentences with consequent words were generated by this model. ...
... However, it was still a challenge to syndicate the extracted data from images and to decrease the special effects of unusable information. • In Humaira et al. (2021), the Hybridized DL approach performed better for the datasets of BanglaLekha and Flickr4k-Bn datasets. However, it was challenging to authenticate the information in which image captioning consumes the language in Bengali. ...
... The comparative methods, such as Mask RCNN+LSTM in Thangavel et al. (2023), Hybridized DL in Humaira et al. (2021), DL method in Chang et al. (2021), Encoder-Decoder deep architecture in Al-Malla, Jafar, and Ghneim (2022), and Deep CNN_ACOA without feature extraction are used for evaluation. ...
Article
Full-text available
The task of providing a natural language description of graphical information of the image is known as image captioning. As a result, it needs an algorithm to create a series of output words and understand the relations between textual and visual elements. The main goal of this research is to caption the image by extracting the features and detecting the object from the image. Here, the object is detected by employing Deep Embedding Clustering. The features from the input image are extracted such as Local Vector Pattern (LVP), Spider Local Image Features, and some statistical features like mean, variance, standard deviation, kurtosis, and skewness. The extracted features and detected objects are given to image captioning which is exploited by Deep Convolutional Neural Network (Deep CNN). The Deep CNN is trained by using the proposed Adaptive Coati Optimization Algorithm (ACOA). The proposed ACOA is attained by the integration of the Adaptive concept and Coati Optimization Algorithm (COA) and thus the image is captioned. The proposed ACOA achieved maximum values in the training data such as 90.5% of precision, 89.9% of recall 89.1% of F1-Score, 90.4% of accuracy, 90.4% of BELU, and 90.9% of ROUGE.
... In this task, the encoder is used to extract the image feature to obtain feature vectors, then pass it through an RNN to generate the language description. Previously all researchers utilized this CNN- RNN Subash et al. [2019], , Humaira et al. [2021] approach to generate captions from images. However, this method has a drawback that is due to the structure of the LSTM or other RNNs, the current output depends on the hidden state at the previous moment. ...
... On the other hand, Deb et al. Deb et al. [2019] Humaira et al. [2021] proposed a hybridized Encoder-Decoder approach where two word embeddings fastText and GloVe were concatenated. They also utilized beam search and greedy search to compute the BLEU scores. ...
Preprint
Full-text available
Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
... Humaira et al. [35] proposed a hybrid image captioning model using bidirectional long short-term memory (BiLSTM) and bidirectional gated recurrent unit (BiGRU). This study evaluated the hybrid method on two datasets (Flickr8k with 4000 and 8000 images), where the captions were translated into Bengali using Google Translate. ...
... The attention model may, at any moment, determine the correct location in the image for image caption creation and produce a description that fits the observation content [38]. This work employed context-aware attention to capturing more precise sub-region image features instead of conventional RNN-based models [9,35], which only use the overall image features. The attention approach can adjust its gaze on corresponding areas of images when generating words related to various objects in an image, improving performance. ...
Article
Full-text available
Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While much prior research has focused on caption generation for English, addressing low-resource languages like Bengali presents challenges, particularly in producing coherent captions linking visual objects with corresponding words. This paper proposes a context-aware attention mechanism over semantic attention to accurately diagnose objects for image captioning in Bengali. The proposed architecture consists of an encoder and a decoder block. We chose ResNet-50 over the other pre-trained models for encoding the image features due to its ability to solve the vanishing gradient problem and recognize complex object features. For decoding generated captions, a bidirectional Gated Recurrent Unit (GRU) architecture combined with an attention mechanism captures contextual dependencies in both directions, resulting in more accurate captions. The paper also highlights the challenge of transferring knowledge between domains, especially with culturally specific images. Evaluation of three Bengali benchmark datasets, namely BAN-Cap, BanglaLekhaImageCaption, and Bornon, demonstrates significant performance improvement in METEOR score over existing methods by approximately 30%, 18%, and 45%, respectively. The proposed context-aware, attention-based image captioning system significantly outperforms current state-of-the-art models in Bengali caption generation despite limitations in reference captions on certain datasets.
... They create a new Bengali dataset called BNATURE, containing 8,000 images with five captions each. [7] combined fastText and GloVe word embeddings and created a hybridized encoder-decoder based approach. ...
Preprint
Full-text available
An exemplary caption not only describes what is happening in a particular image but also denotes intricate traditional objects in the image by their local representative terms through which the native speakers can recognize the object in question. A caption that fails to accomplish the latter is not effective in conveying proper utility. To ensure caption locality, we aim to explore the potential of Large Language Models (LLMs) in Bengali image captioning, which have lately shown promising results in English language caption generation. As a first for the Bengali language, we utilized CLIP (Contrastive Language-Image Pre-training) encodings as a prefix to the captions by employing a mapping network, followed by fine-tuning BanglaGPT, a Bengali pre-trained large language model to generate the image captions. Furthermore, we explored vision transformer-based encoders (ViT, Swin) with BanglaGPT as the decoder. The best BanglaGPT-based model outperformed the current benchmark results, with BLEU-4, METEOR, and CIDEr scores of 54.3, 39.2, and 95.9 on the BanglaLekha dataset and 67.4, 36.6, and 76.9 on the BNature dataset.
... Researchers in the field of image captioning have investigated a variety of approaches to the problem of how to improve both their accuracy and their rate of production [9]. Recent developments in the field include the incorporation of semantic concepts for improved contextual understanding and attention approaches that allow the model to focus on specific portions of the image [10]. The field is continually evolving, and some of the more recent achievements include these incorporations [11]. ...
Article
Full-text available
Image captioning is a challenging task involving generating descriptive sentences to describe images. The application of semantic concepts to automatically annotate images has made significant progress. However, the now available frameworks have apparent limitations, particularly in concept detection. Incomplete labelling due to biased annotations, using synonyms in training captions, and the enormous gap between positive and negative thought samples contribute to the problem. Incomplete labelling is a result of biased annotations. The captioning frameworks that are now in use are inadequate and create a barrier to accurate image captioning. Unequal sample occurrences and missing training captions negatively affect the model's potential to develop rich and varied descriptions of images. Inadequate sample occurrences and missing training captions also contribute to insufficient idea generation. To circumvent these limitations, a novel approach has been designed to automatically generate images using Weighted Stacked Generative Adversarial Network (WSGAN). With the help of this boost, the uneven distribution of concepts is intended to be rectified, thereby expanding the breadth of the horizons covered by the training set. The proposed approach utilizes a WSGAN in conjunction with a Gated Recurrent Units (GRU)–based Deep Learning (DL) model and a Visual Attention Mechanism (VAM)–based DL model. The purpose of the GRU-VAM model is to enable the generation of text captions for images. To train the model, combining the MS COCO dataset with a wide variety of original and machine-generated image datasets in numerous permutations is necessary. The WSGAN-generated images correct the imbalance and incompleteness in the training dataset, which boosts the model's ability to capture a wider variety of thoughts. During testing and evaluation, the proposed WSGAN- GRU-VAM demonstrates significant enhancements in image captioning metrics compared to existing models. WSGAN-GRU-VAM is superior to other well-known image captioning algorithms such as EnsCaption, Fast RF-UIC, RAGAN, and SAT-GPT-3 in terms of its performance across various essential parameters. Increase in BLEU (8%), METEOR (7%), CIDEr (9%), and ROUGE-L (6%), on average, reflect the model's capacity to provide image captions with enhanced linguistic accuracy, relevance, and coherence.
... The Flickr8k dataset, which had been translated into Bengali, was used for training and assessment. A 2D CNN called InceptionResNetV2 was used in the study of Humaira et al. [12] to extract features from the images. Both GloVe and fastText were used to embed the captions, and a BiGRU layer was subsequently applied to extract caption features. ...
Conference Paper
Full-text available
Our research focuses on Bangla Image Captioning which involves generating descriptive captions for the images. To address this task, we propose a new approach using the Vision Encoder-Decoder model, consisting of interconnected models for image encoding and text decoding. Previous work in this area has not explored the use of the Vision Encoder-Decoder Model specifically for Bangla Image Captioning. We have conducted several studies using two publicly available Bengali datasets, Bornon and BanCap, and merged them to create a comprehensive dataset to assess the performance of our model. Our proposed model outperforms recent developments in Bengali image captioning, delivering exceptional results in both quantitative and qualitative analyses.
... The BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores for the model are 66 percentage accurate. The author [90] uses the Flickr8k and BanglaLekha datasets to propose a common methodology for creating Bengali image captions. Using several architectures and a hybrid method that combined Incep-tionResnetV2 and Bidirectional Long Short-Term Memory, the model was compared to other Bengali captions. ...
Preprint
Full-text available
In image captioning, we generate visual descriptions from an image. Image Cap-tioning requires identifying the key entity, feature, and association in an image. There is also a requirement to generate captions that are syntactically and semantically correct. The process of image captioning requires computer vision and natural language processing. In the past few decades, a substantial attempt has been made to generate the caption for images. In this survey article, we are going to present an extensive survey on image captioning for Indian Languages. To summarize recent research work in image captioning, first, we briefly review the traditional approach to image captioning depending on template and retrieval. Further deep-learning approaches for image captioning are concentrated which are classified as encoder-decoder architecture, attention-based approach, and transformer architecture. Our main focus in this survey is based on image cap-tioning techniques for Indian languages like Hindi, Bengali Assamese, etc. After that, we analyze the state-of-the-art approach on the most widely dataset i.e. MS COCO dataset with their strengths, limitations, and performance metrics i.e. BLEU, ROUGE, METEOR, CIDEr, SPICE. At last, we explore discussion on open challenges and future direction in the field of image captioning.
Article
In recent years, there has been growing interest among researchers in the field of image captioning, which involves generating one or more descriptions for an image that closely resembles a human-generated description. Most of the existing studies in this area focus on the English language, utilizing CNN and RNN variants as encoder and decoder models, often enhanced by attention mechanisms. Despite Bengali being the fifth most-spoken native language and the seventh most widely spoken language, it has received far less attention in comparison to resource-rich languages like English. This study aims to bridge that gap by introducing a novel approach to image captioning in Bengali. By leveraging state-of-the-art Convolutional Neural Networks such as EfficientNetV2S, ConvNeXtSmall, and InceptionResNetV2 along with an improvised Transformer, the proposed system achieves both computational efficiency and the generation of accurate, contextually relevant captions. Additionally, Bengali text-to-speech synthesis is incorporated into the framework to assist visually impaired Bengali speakers in understanding their environment and visual content more effectively. The model has been evaluated using a chimeric dataset, combining Bengali descriptions from the Ban-Cap dataset with corresponding images from the Flickr 8k dataset. Utilizing EfficientNet, the proposed model attains METEOR, CIDEr, and ROUGE scores of 0.34, 0.30, and 0.40, while BLEU scores for unigram, bigram, trigram, and four-gram matching are 0.66, 0.59, 0.44 and 0.26 respectively. The study demonstrates that the proposed approach produces precise image descriptions, outperforming other state-of-the-art models in generating Bengali descriptions.
Conference Paper
The task of image captioning is a complex process that involves generating textual descriptions for images. Much of the research done in this particular domain, especially using transformer models, has been focused on English language. However, there has been relatively little research dedicated to the context of the Bengali language. This study addresses the lack of research in the context of Bengali language and proposes a novel approach to automatic image captioning that involves a multi-modal, transformer-based, end-to-end model with an encoder-decoder architecture. Our approach utilizes a pre-trained EfficientNet Transformer Network. To evaluate the effectiveness of our approach, we compare our model with a Vision Transformer that utilizes a non-convolutional encoder pre-trained on ImageNet. The two models were tested on the BanglaLekhaImageCaptions dataset and evaluated using BLEU metrics.
Conference Paper
Automatic caption generation from images has evolved into an active research topic that requires Natural Language Processing (NLP) and Computer Vision (CV) to comprehend the image input and represent it in text. This can assist visually impaired people by generating text captions of images to understand their surroundings. In this study, we have presented a Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN) approach, which can generate natural language for an image. A dataset containing 8,000 images and a total of 37611 captions are utilized for training our model. Besides, VVG16 is employed to extract features from images. Finally, performance is evaluated, which shows an accuracy of 66% and BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores of 0.40, 0.18, 0.11, and 0.03, respectively.
Article
Full-text available
When a landslide happens, it is important to recognize the hazard-affected bodies surrounding the landslide for the risk assessment and emergency rescue. In order to realize the recognition, the spatial relationship between landslides and other geographic objects such as residence, roads and schools needs to be defined. Comparing with semantic segmentation and instance segmentation that can only recognize the geographic objects separately, image captioning can provide richer semantic information including the spatial relationship among these objects. However, the traditional image captioning methods based on RNNs have two main shortcomings: the errors in the prediction process are often accumulated and the location of attention is not always accurate which would lead to misjudgment of risk. To handle these problems, a landslide image interpretation network based on a semantic gate and a bi-temporal long-short term memory network (SG-BiTLSTM) is proposed in this paper. In the SG-BiTLSTM architecture, a U-Net is employed as an encoder to extract features of the images and generate the mask maps of the landslides and other geographic objects. The decoder of this structure consists of two interactive long-short term memory networks (LSTMs) to describe the spatial relationship among these geographic objects so that to further determine the role of the classified geographic objects for identifying the hazard-affected bodies. The purpose of this research is to judge the hazard-affected bodies of the landslide (i.e., buildings and roads) through the SG-BiTLSTM network to provide geographic information support for emergency service. The remote sensing data was taken by Worldview satellite after the Wenchuan earthquake happened in 2008. The experimental results demonstrate that SG-BiTLSTM network shows remarkable improvements on the recognition of landslide and hazard-affected bodies, compared with the traditional LSTM (the Baseline Model), the BLEU1 of the SG-BiTLSTM is improved by 5.89%, the matching rate between the mask maps and the focus matrix of the attention is improved by 42.81%. In conclusion, the SG-BiTLSTM network can recognize landslides and the hazard-affected bodies simultaneously to provide basic geographic information service for emergency decision-making.
Article
Full-text available
PC vision has turned out to be universal in our general public, with applications in a few fields. Given a lot of pictures, with its inscription, make a prescient model which produces regular, inventive, and intriguing subtitles for the concealed picture. A speedy look at a picture is adequate for a human to call attention to and portray a monstrous measure of insights regarding the visual scene. To rearrange the current issue of producing inscriptions for pictures by making a model which would give exact subtitles to these pictures which can be additionally utilized in other helpful applications and use cases. Be that as it may, this momentous capacity has ended up being a tricky errand for our visual acknowledgment models. Most of the past research in scene acknowledgment has concentrated on naming pictures with a predetermined arrangement of visual classifications and extraordinary advancement has been accom- plished in these undertakings. For a question picture, the past strategies recover pertinent hopeful normal language states by outwardly contrasting the inquiry picture with database pictures. In any case, while shut vocabularies of visual ideas comprise a helpful demonstrating suspicion, they are boundlessly prohibitive when contrasted with the colossal measure of rich depictions that a human can form. These methodologies forced a breaking point on the assortment of inscriptions produced. The model ought to be exempt of suppositions regarding explicit pre decided formats, standards or classes and rather depend on figuring out how to create sentences from the preparation information. The model proposed utilizes Convolution Neural Networks which help to separate highlights of the picture whose subtitle is to be created and afterward by utilizing a probabilistic methodology and Natural Language Processing Techniques reasonable sentences are framed and inscriptions are produced.
Article
Full-text available
Automatic image caption generation aims to produce an accurate description of an image in natural language automatically. How- ever, Bangla, the fifth most widely spoken language in the world, is lagging considerably in the research and development of such domain. Besides, while there are many established data sets related to image annotation in English, no such resource exists for Bangla yet. Hence, this paper outlines the development of “Chittron”, an automatic image captioning system in Bangla. To address the data set availability issue, a collection of 16, 000 Bangladeshi contextual images has been accumulated and manually annotated in Bangla. This data set is then used to train a model that integrates a pre-trained VGG16 image embedding model with stacked LSTM layers. The model is trained to predict the caption when the input is an image, one word at a time. The results show that the model has successfully been able to learn a working language model and to generate captions of images quite accurately in many cases. The results are evaluated mainly qualitatively. However, BLEU scores are also reported. It is expected that a better result can be obtained with a bigger and more varied data set.
Article
Full-text available
With the development of deep learning, the combination of computer vision and natural language process has aroused great attention in the past few years. Image captioning is a representative of this filed, which makes the computer learn to use one or more sentences to understand the visual content of an image. The meaningful description generation process of high level image semantics requires not only the recognition of the object and the scene, but the ability of analyzing the state, the attributes and the relationship among these objects. Though image captioning is a complicated and difficult task, a lot of researchers have achieved significant improvements. In this paper, we mainly describe three image captioning methods using the deep neural networks: CNN-RNN based, CNN-CNN based and Reinforcement-based framework. Then we introduce the representative work of these three top methods respectively, describe the evaluation metrics and summarize the benefits and major challenges.
Article
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models “translate” image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.