Conference Paper
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Attention based approaches has been manifested to be an effective method in image captioning. However, attention can be used on text called semantic attention or on image which in known as spatial attention. We chose to implement the later as the main problem of image captioning is not being able to detect objects in image properly. In this work, we develop an approach which extracts features from images using two different convolutional neural network and combines the features with an attention model in order to generate caption with an RNN. We adapted Xception and InceptionV3 as our CNN and GRU as our RNN. Moreover, we Evaluated our proposed model on Flickr8k dataset translated into Bengali. So that captions can be generated in Bengali using visual attention.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The approach to caption image in Bengali using the transformer model is illustrated in Fig. 1. Furthermore, we compare the performance with the visual attention-based approach to caption images in Bengali that was proposed by Ami et al. Ami et al. [2020]. This visual attention-based approach is shown in Fig. 4. Bengali is the 7 th most used language worldwide 1 and most of the natives in some parts of India and Bangladesh do not know English. ...
... They also utilized beam search and greedy search to compute the BLEU scores. Additionally, A. S. Ami et al. Ami et al. [2020] employed visual attention with the Encoder-Decoder approach to caption images in Bengali. They added attention weights to image features and passed them to the GRU with word vectors to generate captions. ...
... We utilized the transformer model and the attention-based model proposed by Ami et al. [2020] to caption images in Bengali. The transformer model does not process sequence in order but the attention-based model processes sequence in order. ...
Preprint
Full-text available
Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
... Although recent studies have focused on Bengali image captioning, the majority of them use a CNN-RNN-based architecture [9,[12][13][14] where they employed InceptionV3, VGG16 and Xception techniques for image feature extraction. Only a few consider attention-based models [15,16] whose performance needs to be more reliable. Moreover, cross-domain transfer approaches for image captioning tasks in Bengali need to be researched. ...
... However, visual and textual attention could have been more focused. Ami et al. [15] introduced the attention mechanism for image captioning using the Flicker8K dataset. They utilized visual attention on the image, also known as spatial attention, and GRU as RNN to generate caption. ...
... Second, instead of the conventional encoder-decoder architecture like in past research, this work uses a context-aware attention mechanism to focus on the visual regions and generate meaningful descriptions selectively. Third, most past studies exclusively utilize the LSTM or GRU for decoding, which cannot capture information from future and past contexts [13,34,15,8]. To resolve this issue, this work adopted a bidirectional approach to effectively capture contextual dependencies from both directions. ...
Article
Full-text available
Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While much prior research has focused on caption generation for English, addressing low-resource languages like Bengali presents challenges, particularly in producing coherent captions linking visual objects with corresponding words. This paper proposes a context-aware attention mechanism over semantic attention to accurately diagnose objects for image captioning in Bengali. The proposed architecture consists of an encoder and a decoder block. We chose ResNet-50 over the other pre-trained models for encoding the image features due to its ability to solve the vanishing gradient problem and recognize complex object features. For decoding generated captions, a bidirectional Gated Recurrent Unit (GRU) architecture combined with an attention mechanism captures contextual dependencies in both directions, resulting in more accurate captions. The paper also highlights the challenge of transferring knowledge between domains, especially with culturally specific images. Evaluation of three Bengali benchmark datasets, namely BAN-Cap, BanglaLekhaImageCaption, and Bornon, demonstrates significant performance improvement in METEOR score over existing methods by approximately 30%, 18%, and 45%, respectively. The proposed context-aware, attention-based image captioning system significantly outperforms current state-of-the-art models in Bengali caption generation despite limitations in reference captions on certain datasets.
... The visual attention-based approach [13] is described in three parts: 1. Image feature extracted by CNN [3] 2. Attention mechanism for getting weighted image features [14] www.ijacsa.thesai.org 3. GRU [13] for generating the caption. Here is a block diagram Fig. 2 of this model. ...
... The visual attention-based approach [13] is described in three parts: 1. Image feature extracted by CNN [3] 2. Attention mechanism for getting weighted image features [14] www.ijacsa.thesai.org 3. GRU [13] for generating the caption. Here is a block diagram Fig. 2 of this model. ...
... Visual attention model for image captioning[13]. ...
Article
Full-text available
Indeed, Image Captioning has become a crucial aspect of contemporary artificial intelligence because it has tackled two crucial parts of the AI field: Computer Vision and Natural Language Processing. Currently, Bangla stands as the 7th most widely spoken language globally. Due to this, image captioning has gained recognition for its significant research accomplishments. Many established datasets are found in English but no standard datasets in Bangla. For our research, we have used the BAN-Cap dataset which contains 8091 images with 40455 sentences. Many effective encoder-decoder and Visual Attention approaches are used for image captioning where CNN is utilized for the encoder and RNN is used for the decoder. However, we suggested a transformer-based image captioning model in this study with different pre-train image feature extraction models like Resnet50, InceptionV3, and VGG16 using the BAN-Cap dataset and find out its effective efficiency and accuracy based on many performances measured methods like BLEU, METEOR, ROUGE, CIDEr and also find out the drawbacks of others model.
... The approach to caption image in Bengali using the transformer model is illustrated in Fig. 1. Furthermore, we compare the performance with the visual attention-based approach to caption images in Bengali that was proposed by Ami et al. Ami et al. [2020]. This visual attention-based approach is shown in Fig. 4. Bengali is the 7 th most used language worldwide 1 and most of the natives in some parts of India and Bangladesh do not know English. ...
... They also utilized beam search and greedy search to compute the BLEU scores. Additionally, A. S. Ami et al. Ami et al. [2020] employed visual attention with the Encoder-Decoder approach to caption images in Bengali. They added attention weights to image features and passed them to the GRU with word vectors to generate captions. ...
... We utilized the transformer model and the attention-based model proposed by Ami et al. [2020] to caption images in Bengali. The transformer model does not process sequence in order but the attention-based model processes sequence in order. ...
Article
Full-text available
Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
... In Table 2, the CNN-Merge (Faiyaz Khan et al., 2021) model achieved lowest scores in all evaluation metrics. The Visual-Attention (Ami et al., 2020) model improves the performance by utilizing the extraction of only important features from an image during a caption prediction. However, despite having a relatively simple architecture, the Transformer (Shah et al., 2021) model outperforms the Visual-Attention and the CNN-Merge models by utilizing multi-head attention and better context awareness ability of the transformer. ...
... We trained the following models on our dataset: CNN-Merge: FaiyazKhan et al. (2021) proposed this model following the merge architecture ofTanti et al. (2017). Visual-Attention: Proposed byAmi et al. (2020), The visual attention model is very similar to the one introduced inXu et al. (2015). Transformer: This model, proposed byShah et al. (2021), is also based on the encoder-decoder architecture. ...
Preprint
Full-text available
As computers have become efficient at understanding visual information and transforming it into a written representation, research interest in tasks like automatic image captioning has seen a significant leap over the last few years. While most of the research attention is given to the English language in a monolingual setting, resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets. Addressing this issue, we present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators. Our dataset represents a wider variety of image caption styles annotated by trained people from different backgrounds. We present a quantitative and qualitative analysis of the dataset and the baseline evaluation of the recent models in Bangla image captioning. We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning. We also present this dataset's multipurpose nature, especially on machine translation for Bangla-English and English-Bangla. This dataset and all the models will be useful for further research.
... Meanwhile, in 2020, Ami et al. [27] trained the first local attention-based model using the Bengali-translated Flickr8k dataset. They evaluated their model's captions using the BLEU matrices, the Xception and InceptionV3 encoders, and the GRU decoder. ...
... In the domain of non-English language image captioning, significant strides have been made, particularly in languages such as Hindi and Bengali, which share similarities with Nepali. In Bengali language research, notable studies by S. Paul et al. [9] have explored techniques utilizing convolutional neural networks (CNNs) and recurrent neural networks Similarly, in the Hindi language, S.K. Mishra et al. [11] have introduced a novel image captioning model tailored specifically for Hindi, leveraging transformer networks within an encoder-decoder architecture. This approach, which incorporates a Hindi dataset translated from MSCOCO [12] and refined by human annotators, showcases the potential of transformerbased models in multilingual image captioning tasks. ...
Article
Full-text available
The advent of deep neural networks has made the image captioning task more feasible. It is a method of generating text by analyzing the different parts of an image. A lot of tasks related to this have been done in the English language, while very little effort is put into this task in other languages, particularly the Nepali language. It is an even harder task to carry out research in the Nepali language because of its difficult grammatical structure and vast language domain. Further, the little work done in the Nepali language is done to generate only a single sentence, but the proposed work emphasizes generating paragraph-long coherent sentences. The Stanford human genome dataset, which was translated into Nepali language using the Google Translate API is used in the proposed work. Along with this, a manually curated dataset consisting of 800 images of the cultural sites of Nepal, along with their Nepali captions, was also used. These two datasets were combined to train the deep learning model. The task involved working with transformer architecture. In this setup, image features were extracted using a pretrained Inception V3 model. These features were then inputted into the encoder segment after position encoding. Simultaneously, embedded tokens from captions were fed into the decoder segment. The resulting captions were assessed using BLEU scores, revealing higher accuracy and BLEU scores for the test images.
... For training, the BanglaLekhaImageCaptions dataset was adjusted and comprised 9,154 images with two captions each. An attentionbased methodology was used by Ami et al. [11] in their work on Bengali image captioning. They employed a Recurrent Neural Network (RNN) for the decoder component and a Convolutional Neural Network (CNN) for the encoder. ...
Conference Paper
Full-text available
Our research focuses on Bangla Image Captioning which involves generating descriptive captions for the images. To address this task, we propose a new approach using the Vision Encoder-Decoder model, consisting of interconnected models for image encoding and text decoding. Previous work in this area has not explored the use of the Vision Encoder-Decoder Model specifically for Bangla Image Captioning. We have conducted several studies using two publicly available Bengali datasets, Bornon and BanCap, and merged them to create a comprehensive dataset to assess the performance of our model. Our proposed model outperforms recent developments in Bengali image captioning, delivering exceptional results in both quantitative and qualitative analyses.
Article
In recent years, there has been growing interest among researchers in the field of image captioning, which involves generating one or more descriptions for an image that closely resembles a human-generated description. Most of the existing studies in this area focus on the English language, utilizing CNN and RNN variants as encoder and decoder models, often enhanced by attention mechanisms. Despite Bengali being the fifth most-spoken native language and the seventh most widely spoken language, it has received far less attention in comparison to resource-rich languages like English. This study aims to bridge that gap by introducing a novel approach to image captioning in Bengali. By leveraging state-of-the-art Convolutional Neural Networks such as EfficientNetV2S, ConvNeXtSmall, and InceptionResNetV2 along with an improvised Transformer, the proposed system achieves both computational efficiency and the generation of accurate, contextually relevant captions. Additionally, Bengali text-to-speech synthesis is incorporated into the framework to assist visually impaired Bengali speakers in understanding their environment and visual content more effectively. The model has been evaluated using a chimeric dataset, combining Bengali descriptions from the Ban-Cap dataset with corresponding images from the Flickr 8k dataset. Utilizing EfficientNet, the proposed model attains METEOR, CIDEr, and ROUGE scores of 0.34, 0.30, and 0.40, while BLEU scores for unigram, bigram, trigram, and four-gram matching are 0.66, 0.59, 0.44 and 0.26 respectively. The study demonstrates that the proposed approach produces precise image descriptions, outperforming other state-of-the-art models in generating Bengali descriptions.
Article
Investigating the synergy between image captioning and audio integration, our study employs an advanced Encoder-Decoder framework. Integrating visual and auditory features, the model enhances caption precision. Through extensive experimentation, our results reveal a notable improvement in contextual understanding compared to conventional image captioning models. This research underscores the potential of multimodal approaches for enriching multimedia applications, particularly in contexts where comprehensive content comprehension and accessibility are paramount. Key Words: Deep Learning, Natural Language Processing, Multi Modal Attention Mechanism, Image Captioning.
Conference Paper
Full-text available
Many image captioning tasks have been carried out in recent years, the majority of the work being for the English language. A few research works have also been carried out for Hindi and Bengali languages in the domain. Unfortunately, not much research emphasis seems to be given to the Nepali language in this direction. Furthermore, the datasets are also not publicly available in the Nepali language. The aim of this research is to prepare a dataset with Nepali captions and develop a deep learning model based on the Convolutional Neural Network (CNN) and Transformer combined model to automatically generate image captions in the Nepali language. The dataset for this work is prepared by applying different data pre-processing techniques on the Flickr8k dataset. The preprocessed data is then passed to the CNN-Transformer model to generate image captions. ResNet-101 and EfficientNetB0 are the two pre-trained CNN models employed for this work. We have achieved some promising results which can be further improved in the future.
Article
Full-text available
Automatic image caption generation aims to produce an accurate description of an image in natural language automatically. How- ever, Bangla, the fifth most widely spoken language in the world, is lagging considerably in the research and development of such domain. Besides, while there are many established data sets related to image annotation in English, no such resource exists for Bangla yet. Hence, this paper outlines the development of “Chittron”, an automatic image captioning system in Bangla. To address the data set availability issue, a collection of 16, 000 Bangladeshi contextual images has been accumulated and manually annotated in Bangla. This data set is then used to train a model that integrates a pre-trained VGG16 image embedding model with stacked LSTM layers. The model is trained to predict the caption when the input is an image, one word at a time. The results show that the model has successfully been able to learn a working language model and to generate captions of images quite accurately in many cases. The results are evaluated mainly qualitatively. However, BLEU scores are also reported. It is expected that a better result can be obtained with a bigger and more varied data set.
Conference Paper
Full-text available
In neural image captioning systems, a recurrent neural network (RNN) is typically viewed as the primary `generation' component. The dominant model in the literature is one in which visual features encoded by a convolutional network are `injected' into the RNN. An alternative architecture encodes visual and linguistic features separately, merging them at a late stage. This paper compares these two architectures. We find that late merging outperforms injection, suggesting that RNNs are better viewed as encoders, rather than generators.
Article
Full-text available
Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN which encodes an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism --- a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. SCA-CNN achieves significant improvements over state-of-the-art visual attention-based image captioning methods.
Conference Paper
Full-text available
This paper describes Meteor Universal, released for the 2014 ACL Workshop on Statistical Machine Translation. Meteor Universal brings language specific evaluation to previously unsupported target languages by (1) automatically extracting linguistic resources (paraphrase tables and function word lists) from the bitext used to train MT systems and (2) using a universal parameter set learned from pooling human judgments of translation quality from several language directions. Meteor Universal is shown to significantly outperform baseline BLEU on two new languages, Russian (WMT13) and Hindi (WMT14).
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Conference Paper
Despite the fact that attribute-based approaches and attention-based approaches have been proven to be effective in image captioning, most attribute-based approaches simply predict attributes independently without taking the co-occurrence dependencies among attributes into account. Besides, most attention-based captioning models directly leverage the feature map extracted from CNN, in which many features may be redundant in relation to the image content. In this paper, we focus on training a good attribute-inference model via the recurrent neural network (RNN) for image captioning, where the co-occurrence dependencies among attributes can be maintained. The uniqueness of our inference model lies in the usage of a RNN with the visual attention mechanism to \textit{observe} the image before generating captions. Additionally, it is noticed that compact and attribute-driven features will be more useful for the attention-based captioning model. To this end, we extract the context feature for each attribute, and guide the captioning model adaptively attend to these context features. We verify the effectiveness and superiority of the proposed approach over the other captioning approaches by conducting massive experiments and comparisons on MS COCO image captioning dataset.
Article
Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in a cross-lingual setting. Different from these works that manually build a dataset for a target language, we aim to learn a cross-lingual captioning model fully from machine-translated sentences. To conquer the lack of fluency in the translated sentences, we propose in this paper a fluency-guided learning framework. The framework comprises a module to automatically estimate the fluency of the sentences and another module to utilize the estimated fluency scores to effectively train an image captioning model for the target language. As experiments on two bilingual (English-Chinese) datasets show, our approach improves both fluency and relevance of the generated captions in Chinese, but without using any manually written sentences from the target language.
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Conference Paper
This paper extends research on automated image captioning in the dimension of language, studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate image captioning in this novel context, we present Flickr8k-CN, a bilingual extension of the popular Flickr8k set. The new multimedia dataset can be used to quantitatively assess the performance of Chinese captioning and English-Chinese machine translation. The possibility of re-using existing English data and models via machine translation is investigated. Our study reveals to some extent that a computer can master two distinct languages, English and Chinese, at a similar level for describing the visual world. Data is publicly available at http://lixirong.net/datasets/flickr8kcn
Article
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Article
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.
Article
Image dataset is a pivotal resource for vision research. We introduce here the preview of a new dataset called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate each of the majority of the 80,000 synsets (concrete and countable nouns and their synonym sets) of WordNet with an average of 500–1000 clean images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. To construct ImageNet, we first collect a large set of candidate images (about 10 thousands) from the Internet image search engines for each synset, which typically contains approximately 10% suitable images. We then deploy an image annotation task on the online workers market Amazon Mechanical Turk. To obtain a reliable rating, each image is evaluated by a dynamically determined number of online workers. At the time this abstract is written, we have six completed sub-ImageNets with more than 2500 synsets and roughly 2 million images in total (Mammal-Net, Vehicle-Net, MusicalInstrument-Net, Tool-Net, Furniture-Net, and GeologicalFormation-Net). Our analyses show that ImageNet is much larger in scale and diversity and much more accurate than the current existing image datasets. A particularly interesting question arises in the construction of ImageNet - the degree to which a concept (within concrete and countable nouns) can be visually represented. We call this the “imageability” of a synset. While a German shepherd is an easy visual category, it is not clear how one could represent a “two-year old horse” with reliable images. We show that “imageability” can be quantified as a function of human subject consensus. Given the large scale of our dataset, ImageNet can offer, for the first time, an “imageability” measurement to a large number of concrete and countable nouns in the English language.
conference on machine, and undefined 2015
  • K Xu
  • J Ba
  • R Kiros
  • K Cho
conference on machine, and undefined 2015
  • xu