Conference Paper

StructCap: Structured Semantic Embedding for Image Captioning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Image captioning has attracted ever-increasing research attention in multimedia and computer vision. To encode the visual content, existing approaches typically utilize the off-the-shelf deep Convolutional Neural Network (CNN) model to extract visual features, which are sent to Recurrent Neural Network (RNN) based textual generators to output word sequence. Some methods encode visual objects and scene information with attention mechanism more recently. Despite the promising progress, one distinct disadvantage lies in distinguishing and modeling key semantic entities and their relations, which are in turn widely regarded as the important cues for us to describe image content. In this paper, we propose a novel image captioning model, termed StructCap. It parses a given image into key entities and their relations organized in a visual parsing tree, which is transformed and embedded under an encoder-decoder framework via visual attention. We give an end-to-end formulation to facilitate joint training of visual tree parser, structured semantic attention and RNN-based captioning modules. Experimental results on two public benchmarks, Microsoft COCO and Flickr30K, show that the proposed StructCap model outperforms the state-of-the-art approaches under various standard evaluation metrics.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In terms of structured feature embedding, exiting works for multimedia data [2,3,5] employed different structures, e.g., chain, tree, and graph. Chen et al. [2,3] proposed to enhance the visual representation for image captioning task by a linear-based structured tree model. ...
... In terms of structured feature embedding, exiting works for multimedia data [2,3,5] employed different structures, e.g., chain, tree, and graph. Chen et al. [2,3] proposed to enhance the visual representation for image captioning task by a linear-based structured tree model. However, because of the simple linear-based tree model in these schemes [2,3], limited contextual information is transferred between different layers and without using any attention mechanism. ...
... Chen et al. [2,3] proposed to enhance the visual representation for image captioning task by a linear-based structured tree model. However, because of the simple linear-based tree model in these schemes [2,3], limited contextual information is transferred between different layers and without using any attention mechanism. Chen et al. [5] applied a chain structure model using an RNN for visual embeddings, which unfortunately ignores the underlying structure. ...
Preprint
Full-text available
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.
... eling semantics in ImC and embedding semantics to ImD as illustrated in Fig. 1 (Bottom). On the one hand, to overcome the arbitrary syntax of the caption [16,17], we construct a structured semantic tree architecture, where the semantic contents (entities and relations) are automatically parsed given a severely blurred image. On the other hand, to align the semantic contents to the blurred image spatially, we design a spatial semantic representation for the tree nodes, where each entity/relation is represented in the form of feature maps, and the convolution is operated among the nodes in the tree structure. ...
... Based on the CNN-RNN architecture, Xu et al. [14] and You et al. [30] proposed to use an attention module in captioning based on the spatial features and the semantic concepts, respectively. To capture the semantic entities and their relations, Chen et al. [16] proposed a visual parsing tree model to embed and attend the structured semantic content into captioning. To advance the attention model, a bottom-up and top-down attention mechanism was designed in [15], which enabled attention to be calculated at the object/region level. ...
... where j is the index of the tree node as shown in Fig. 3, and H j ∈ R w ′ ×h ′ ×c ′ (w ′ , h ′ , and c ′ denote the width, the height, and the channel, respectively) is a feature map tensor of the j-th node. E represents one of four semantic entities, i.e., subject 1 (Subj1), object 1 (Obj1), subject 2 (Subj2), and object 2 (Obj2) as set up in [16]. σ is an element-wise nonlinear function upon the convolution operation ⊛ and the convolution kernel set K E for Convolutional decoupling in the j-th node (j = 1, 3, 5, 7). ...
Preprint
Full-text available
Image deblurring has achieved exciting progress in recent years. However, traditional methods fail to deblur severely blurred images, where semantic contents appears ambiguously. In this paper, we conduct image deblurring guided by the semantic contents inferred from image captioning. Specially, we propose a novel Structured-Spatial Semantic Embedding model for image deblurring (termed S3E-Deblur), which introduces a novel Structured-Spatial Semantic tree model (S3-tree) to bridge two basic tasks in computer vision: image deblurring (ImD) and image captioning (ImC). In particular, S3-tree captures and represents the semantic contents in structured spatial features in ImC, and then embeds the spatial features of the tree nodes into GAN based ImD. Co-training on S3-tree, ImC, and ImD is conducted to optimize the overall model in a multi-task end-to-end manner. Extensive experiments on severely blurred MSCOCO and GoPro datasets demonstrate the significant superiority of S3E-Deblur compared to the state-of-the-arts on both ImD and ImC tasks.
... Also, we can get the implementation of BLEU score in the Python Natural Toolkit library or NLTK which can be used to consider generated text in contrast to a reference text. In this paper, we have utilised the BLEU score to make a cross reference between the captions generated by the local speaker and the machine in the wake of preparing the overview of information contained in images [15]. The accuracy level of the BLEU score lies between 0 and 1. Prediction of the model is more accurate if all the BLEU scores are high. ...
... The evaluation of matching grams of a certain order is known as individual N-gram score, for example -single words (1gram) or word pairs (2-gram or bigram) whereas the evaluation of individual n-gram scores at all orders from 1 to n and weighting them by computing the weighted geometric mean is known as cumulative score. Further, we have collected confidence scores [15,16] which is a parameter that helps to remunerate the developer's request. We have analysed that further in section 3. We attained a mean confidence score of 0.5 with the help of Microsoft Computer Vision API. ...
Article
Designing and releasing of software’s in production that contains images takes a lot of time due to the need of finding ALT-text attributes for the images embedded in the applications. This paper automates the task of writing ALT-text attributes in HTML, especially if image integration is large with the use of python PIP package and Microsoft Computer Vision API. This will save huge time and efforts of the developers by automating the task of captioning images manually up to a great extent. The challenge that confronts us is the quality of annotations generated by the machine with respect to the human generated annotations. To study the appropriateness of the captions delivered by APIs, a blend of human and machine assessment was used. We have noticed a high similarity in human and machine generated annotations as we obtained individual and cumulative BLEU score metric . Another metric is confidence score with a percentage mean of 0.5 .Also, we have calculated the time taken per caption which is 1.6 seconds per image which took 6.01 minutes to caption 200 images.
... For (2), in most existing works the pre-trained object detector is kept frozen when training the target VL task. This implies that the conditioning relationship between the detected objects and the input image is not jointly optimized with the target VL task. ...
... In the early stage of image captioning, researchers used an image encoder such as ResNet [16] to encode the input image into a global pooled representation [2,3,6,11,13,15,22,24,34,51,54,61]. Captions are then generated conditioned on the encoded global feature. ...
Preprint
Full-text available
Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.
... Melnyk et al. [22] reported the comparison of context-aware LSTM captioner and co-attentive discriminator for image captioning. Wu et al. [23] used question features and image features, but no information on how question-information is generated, Chen et al. [24] used parsing tree StructCap but handling a tree-based representation is difficult, Jiang et al. [25] sequence-to-sequence framework for joint learning but dependent on resource-hungry training, and Wu et al. [26] dual temporal modal with a different understanding of the same features and complementing each-other, Image-Text Surgery in [16,27] attribute-driven attention with different attribute influences which are difficult to gather from images, Cornia et al. [28] generative recurrent neural network but defining a random generator based captioner is difficult, Zhao et al. [29] introduced MLAIC for better representation but defining a representation for sentences can be challenging. Also there is [30] text-guided attention similar to Sur et al. [31] but has enhanced semantic features due to TPR, Chen et al. [32] reference based LSTM, Chen et al. [33] adversarial neural network [34] with a generative process mimicrying the sentence generation, highdimensional attentions [35,36] coarse-to-fine skeleton sentence where previous generation sentences are used to generate newer sentences with effectiveness, Chen et al. [37] specific styles for different applications based on data that was not shared, Chen et al. [38] structural relevance and structural diversity, multimodal attention [39] with different mode of attentions that has hardly any structural or content based explainable relevance, Harzig et al. [40] popular brands caption, Liu et al. [41] diversified captions, Chunseong Park et al. [42] stylish caption, Sharma et al. [43] sub-categorical styles, Yao et al. [44] personalized [45] captions, Zhang et al. [46] studied actor-critic reinforcement learning [47], Fu et al. [48] scene-specific attention contexts, Ren et al. [49] policy network for captions and these policies are related to image content selection, Liu et al. [50] reinforcement learning based training for finetuning the captioner model, Cohn-Gordon et al. [51] distinguish between similar kind for diversity, Liu et al. [52] improved with correctness of attention in image, Lu et al. [53] adaptivity for attention, Vinyals et al. [54] used combination of computer vision and machine translation, Zhang et al. [55] used adaptive re-weight loss function, Park et al. [56] personalized captioning, Wu et al. [57] high level semantic concept, Vinyals et al. [8] [71] new tree based approach to composing expressive image descriptions, Mao et al. [72] transposed weight sharing scheme, Mathews et al. [73] different emotions and sentiments, and Yang et al. [74] where nouns, verbs, scenes and prepositions used for structuring sentence. ...
... While most of the works were just what the model has learned, we paid more importantly what we can create and feed into the network like shown in Eqs. (24) and (37). With this approach, we have established a new performance level, surpassing previous works in all the metrics. ...
Article
Full-text available
Region visual features enhance the generative capability of the machines based on features. However, they lack proper interaction-based attentional perceptions and end up with biased or uncorrelated sentences or pieces of misinformation. In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR), which is a convenient way of gathering more information through orthogonal combination and learning the interactions as physical entities (tensors) and improving the captions. Compared to previous works, where features add up to undefined feature spaces, TPR helps maintain sanity in combinations, and orthogonality helps define familiar spaces. We have introduced a new concept layer that defines the objects and their interactions that can play a crucial role in determining different descriptions. The interaction portions have contributed heavily to better caption quality and have out-performed various previous works on this domain and MSCOCO dataset. For the first time, we introduced the notion of combining regional image features and abstracted interaction likelihood embedding for image captioning.
... Image Captioning Image Captioning is the task of describing images with syntactically and semantically sentences. Current deep learning-based image captioning models have evolved as the encodedecoder frameworks with multi-modal connection (Chen et al., 2017, attentive (Huang et al., 2019;Guo et al., 2020) and fusion strategies (Zhou et al., 2020). Standard captioning datasets contain Flickr30K (Young et al., 2014) and the commonly used MS COCO (Lin et al., 2014), which consisting of images with events, objects and scenes. ...
Preprint
Full-text available
Recent advances in image captioning are mainly driven by large-scale vision-language pretraining, relying heavily on computational resources and increasingly large multimodal datasets. Instead of scaling up pretraining data, we ask whether it is possible to improve performance by improving the quality of the samples in existing datasets. We pursue this question through two approaches to data curation: one that assumes that some examples should be avoided due to mismatches between the image and caption, and one that assumes that the mismatch can be addressed by replacing the image, for which we use the state-of-the-art Stable Diffusion model. These approaches are evaluated using the BLIP model on MS COCO and Flickr30K in both finetuning and few-shot learning settings. Our simple yet effective approaches consistently outperform baselines, indicating that better image captioning models can be trained by curating existing resources. Finally, we conduct a human study to understand the errors made by the Stable Diffusion model and highlight directions for future work in text-to-image generation.
... [42] reported the comparison of context-aware LSTM captioner and coattentive discriminator for image captioning. [72] used question features and image features, [4] parsing tree StructCap, [23] sequence-to-sequence framework, and [74] dual temporal modal, Image-Text Surgery in [17], [2] attribute-driven attention, [11] generative recurrent neural network, [85] MLAIC for better representation. Also there is [31] text-guided attention, [3] reference based LSTM, [7] adversarial neural network, high-dimensional attentions [80], [70] coarse-to-fine skeleton sentence, [8] specific styles, [5] structural relevance and structural diversity, multimodal attention [34], [21] popular brands caption [32] diversified captions, [9] stylish caption, [48] sub-categorical styles, [78] personalized captions, [83] studied actor-critic reinforcement learning, [16] scene-specific attention contexts, [46] policy network for captions, [35] reinforcement learning based training, [10] distinguish between similar kind for diversity, [33] improved with correctness of attention in image, [37] adaptivity for attention, [68] used combination of computer vision and machine translation, [84] used adaptive re-weight loss function, [43] personalized captioning, [73] high level semantic concept, [69] used visual features and machine translation attention combinations, [19] different caption styles, [24] shifting attention, [27] characteristics of text based representations, [44] variational autoencoder representation, [49] dependency trees embedding, [65] character-level language modeling, [66] fixed dimension representation, [30] 3-dimensional convolutional networks, [67] human judgments, out-of-domain data handling, [82] semantic attention, [18] Semantic Compositional Network (SCN), [20] localize and segment objects, [22] extra semantic attention, [28] content planning and recognition algorithms, [29] new tree based approach to composing expressive image descriptions, [40] transposed weight sharing scheme, [41] different emotions and sentiments, and [77] where nouns, verbs, scenes and prepositions used for structuring sentence. ...
Article
Full-text available
While image captioning through machines requires structured learning and basis for interpretation, improvement requires multiple context understanding and processing in a meaningful way. This research provides a novel concept for context combination and impacts many applications to deal with visual features as an equivalence of descriptions of objects, activities, and events. There are three components of our architecture: Feature Distribution Composition (FDC) Layer Attention, Multiple Role Representation Crossover (MRRC) Attention Layer, and the Language Decoder. FDC Layer Attention helps in generating the weighted attention from RCNN features, MRRC Attention Layer acts as intermediate representation processing and helps in generating the next word attention. A Language Decoder helps in the estimation of the likelihood for the next probable word in the sentence. We demonstrated the effectiveness of FDC, MRRC, regional object feature attention, and reinforcement learning for effective learning to generate better captions from images. The performance of our model enhanced previous performances to 35.3% and created a new standard and theory for representation generation based on logic, better interpretability, and contexts.
... Melnyk et al. [33] reported the comparison of context-aware LSTM captioner and coattentive discriminator for image captioning. Wu et al. [34] used question features and image features, [35] parsing tree StructCap, [36] sequence-to-sequence framework, and [37] dual temporal modal, Image-Text Surgery in [11,38] attribute-driven attention, [39] generative recurrent neural network, [40] MLAIC for better representation. Also there is [41] text-guided attention, [42] reference-based LSTM, [43] adversarial neural network, high-dimensional attentions [44,45] coarse-to-fine skeleton sentence, [46] specific styles, [47] structural relevance and structural diversity, multimodal attention [48,49] popular brands caption [50] diversified captions, [51] stylish caption, [52] sub-categorical styles, [53] personalized captions, [54] studied actor-critic reinforcement learning, [55] scene-specific attention contexts, [56] policy network for captions, [57] reinforcement learning-based training, [58] distinguish between similar kind for diversity, [59] improved with correctness of attention in image, [29] adaptivity for attention, [60] used combination of computer vision and machine translation, [61] used adaptive re-weight loss function, [62] personalized captioning, [63] high-level semantic concept, [6] used visual features and machine translation attention combinations, [64] different caption styles, [65] shifting attention, [66] characteristics of text-based representations, [67] variational autoencoder representation, [68] dependency trees embedding, [69] character-level language modeling, [70] fixed dimension representation, [71] 3-dimensional convolutional networks, [72] human judgments, out-of-domain data handling, [28] semantic attention, [12] semantic compositional network (SCN), [73] localize and segment objects, [74] extra semantic attention, [75] content planning and recognition algorithms, [76] new tree-based approach to composing expressive image descriptions, [77] transposed weight sharing scheme, [78] different emotions and sentiments, and [79] where nouns, verbs, scenes and prepositions used for structuring sentence. ...
Article
Full-text available
Semantic feature composition from image features has a drawback because it is unable to capture the content of the captions and failed to evolve as longer and meaningful captions. In this paper, we have proposed improvements on semantic features that can generate and evolve captions through the new approach called mixed fusion of representations and decomposition. Semantic works on the principle of using CNN visual features to generate context-word distribution and use that to generate captions using language decoder. Generated semantics are used for captioning, but have limitations. We have introduced a far better and newer approach with an enhanced representation-based network known as mixed representation enhanced (de)compositional network (MRECN), which can help produce better and different content for captions. As denoted from the results (0.351 BLUE_4), it has outperformed most of the state of the art. We defined a better feature decoding scheme using learned networks, which establishes an incoherence of related words into captions. From our research, we have come to some important conclusions regarding mixed representation strategies as it emerges as the most viable and promising way of representing the relationships of the sophisticated features for decision making and complex applications like the image to natural languages.
... Here, object-based semantic image representation was used in a deep network as features to retrieve and select the relevant image(s). Chen et al [4] introduced StructCap, where they used an extra set of features derived out the parsing tree that was created from the knowledge of the objects gathered from the visual features. The model parsed an image into key entities, derived their relations and organized them into a visual parsing tree. ...
Article
Full-text available
In this work we have analyzed a novel concept of sequential binding based learning capable network based on the coupling of recurrent units with Bayesian Prior definition. The coupling structure encodes to generate efficient tensor representations that can be decoded to generate efficient sentences and can describe certain events. These descriptions are derived from structural representations of visual features of images and media. An elaborated study of the different types of coupling recurrent structures are studied and some insights of their performance are provided. Supervised learning performance for natural language processing is judged based on statistical evaluations, however, the truth is perspective, and in this case the qualitative evaluations reveal the real capability of the different architectural strengths and variations. Bayesian Prior definition of different embedding helps in better characterization of the sentences based on the natural language structure related to parts of speech and other semantic level categorization in a form which is machine interpret-able and inherits the characteristics of the Tensor Representation binding and unbinding based on the mutually orthogonality. Our approach has surpassed some of the existing basic works related to image captioning.
... Overall, the research on MMT is of great significance. On the one hand, similar to other multimodal tasks such as image captioning [11,12,24,41] and visual question answering [22,33,37,64], MMT involves computer vision and natural language processing (NLP) and proves the effectiveness of visual features in translation tasks. In other words, it not only requires an algorithm with in-depth understanding of visual contexts, but also connects its interpretation with a language model to create a natural sentence. ...
Preprint
Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on https://github.com/DeepLearnXMU/MM-DCCN.
... Here, object-based semantic image representation was used in a deep network as features to retrieve and select the relevant image(s). Chen et al. [9] introduced StructCap, where they used an extra set of features derived out the parsing tree that was created from the knowledge of the objects gathered from the visual features. The model parsed an image into key entities, derived their relations and organized them into a visual parsing tree. ...
Article
Full-text available
Progress in image captioning is gradually getting complex as researchers try to generalize the model and define the representation between visual features and natural language processing. In the absence of any established relationship, every time a new dividend is added, it produced very little improvement, not considerable enough to make it general. This work tried to define such kind of relationship in the form of representation called Algebraic Amalgamation-based Composed Representation (AACR) which generalized the scheme of language modeling and structuring the linguistic attributes (related to grammar and parts of speech of language) which will provide a much better structure and grammatically correct sentence. AACR enables better and more unique representation and structuring of the feature space and enables transfer learning like infrastructure for all machines to interact with the external world (both human and machine) with these representations. A large part of the different ways of defining and improving these AACR are discussed and their performance concerning the traditional procedures and feature representations are evaluated for image captioning application. The new models achieved considerable improvement than the corresponding previous architectures.
... [8] reported the comparison of context-aware LSTM captioner and co-attentive discriminator for image captioning. [9] used question features and image features, [11] parsing tree StructCap, [12] sequence-tosequence framework, and [13] dual temporal modal, Image-Text Surgery in [14], [15] attribute-driven attention, [16] generative recurrent neural network, [17] MLAIC for better representation. Also there is [18] text-guided attention, [19] reference based LSTM, [21] adversarial neural network, high-dimensional attentions [22], [23] coarse-to-fine skeleton sentence, [24] specific styles, [25] structural relevance and structural diversity, multimodal attention [26], [27] popular brands caption [28] diversified captions, [29] stylish caption, [30] sub-categorical styles, [31] personalized captions, [32] studied actor-critic reinforcement learning, [33] scene-specific attention contexts, [34] policy network for captions, [35] reinforcement learning based training, [36] distinguish between similar kind for diversity, [37] improved with correctness of attention in image, [39] adaptivity for attention, [40] used combination of computer vision and machine translation, [42] used adaptive re-weight loss function, [43] personalized captioning, [46] high level semantic concept, [47] used visual features and machine translation attention combinations, [56] different caption styles, [57] shifting attention, [58] characteristics of text based representations, [62] variational autoencoder representation, [63] dependency trees embedding, [64] character-level language modeling, [65] fixed dimension representation, [66] 3-dimensional convolutional networks, [67] human judgments, out-of-domain data handling, [70] semantic attention, [73] Semantic Compositional Network (SCN), [74] localize and segment objects, [76] extra semantic attention, [78] content planning and recognition algorithms, [80] new tree based approach to composing expressive image descriptions, [81] transposed weight sharing scheme, [82] different emotions and sentiments, and [85] where nouns, verbs, scenes and prepositions used for structuring sentence. ...
Preprint
Full-text available
While image captioning through machines requires structured learning and basis for interpretation, improvement requires multiple context understanding and processing in a meaningful way. This research will provide a novel concept for context combination and will impact many applications to deal visual features as an equivalence of descriptions of objects, activities and events. There are three components of our architecture: Feature Distribution Composition (FDC) Layer Attention, Multiple Role Representation Crossover (MRRC) Attention Layer and the Language Decoder. FDC Layer Attention helps in generating the weighted attention from RCNN features, MRRC Attention Layer acts as intermediate representation processing and helps in generating the next word attention, while Language Decoder helps in estimation of the likelihood for the next probable word in the sentence. We demonstrated effectiveness of FDC, MRRC, regional object feature attention and reinforcement learning for effective learning to generate better captions from images. The performance of our model enhanced previous performances by 35.3\% and created a new standard and theory for representation generation based on logic, better interpretability and contexts.
... The framework is used to "translate" an image to a sentence, where the visual features are extracted from convolutional neural network (CNN) and fed into Long Short-Term Memory (LSTM) to generate captions. Image captioning techniques have been extensively explored in [22,45,31,9,4,5]. A few models [51,54,2] seek to apply attention mechanism to bridge the gap of visual understanding and language processing. ...
Preprint
Image captioning has attracted ever-increasing research attention in the multimedia community. To this end, most cutting-edge works rely on an encoder-decoder framework with attention mechanisms, which have achieved remarkable progress. However, such a framework does not consider scene concepts to attend visual information, which leads to sentence bias in caption generation and defects the performance correspondingly. We argue that such scene concepts capture higher-level visual semantics and serve as an important cue in describing images. In this paper, we propose a novel scene-based factored attention module for image captioning. Specifically, the proposed module first embeds the scene concepts into factored weights explicitly and attends the visual information extracted from the input image. Then, an adaptive LSTM is used to generate captions for specific scene types. Experimental results on Microsoft COCO benchmark show that the proposed scene-based attention module improves model performance a lot, which outperforms the state-of-the-art approaches under various evaluation metrics.
... Chen et al. [14] introduced StructCap, where they used an extra set of features derived out the parsing tree that was created from the knowledge of the objects gathered from the visual features. The model parsed an image into key entities, derived their relations and organized them into a visual parsing tree. ...
Article
Full-text available
Deep Learning Architectures has been researched the most in this decade because of its capability to scale up and solve problems that couldn’t be solved before. Mean while many NLP applications cropped up and there is a requirement to understand how the concepts gradually evolved till date after perceptron was introduced in 1959. This document will provide a detailed description of the computational neuroscience starting from artificial neural network and how researchers retrospected the drawbacks faced by the previous architectures and paved the way for modern deep learning. Modern deep learning is more than what it had been perceived decades ago and had been extended to architectures, with exceptional intelligence, scalability and precision, beyond imagination. This document will provide an overview of the continuation of work and will also specifically deal with applications of various domains related to natural language processing and visual and media contents.
Article
Full-text available
Grid-based features have been proven to be as effective as region-based features in multi-modal tasks such as visual question answering. However, its application to image captioning encounters two main issues, namely, noisy features and fragmented semantics. In this paper, we propose a novel feature selection scheme, with a Relation-Aware Selection (RAS) and a Fine-grained Semantic Guidance (FSG) learning strategy. Based on the grid-wise interactions, RAS can enhance the salient visual regions and channels, and suppress the less important ones. In addition, this selection process is guided by FSG, which uses fine-grained semantic knowledge to supervise the selection process. Experimental results on the MS COCO show the proposed RAS-FSG scheme achieves state-of-the-art performance on both the off-line and on-line testing, i.e ., 134.3 CIDEr for the off-line testing and 135.4 for the on-line testing of MSCOCO. Extensive ablation studies and visualizations also validate the effectiveness of our scheme.
Article
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
Article
Full-text available
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.
Article
Full-text available
Automatic image captioning has achieved a great progress. However, the existing captioning frameworks basically enumerate the objects in the image. The generated captions lack the real-world knowledge about named entities and their relations, such as the relations among famous persons, organizations and buildings. On the contrary, humans interpret images in a specific way by providing real-world knowledge with relations of the aforementioned named entities. To generate human-like captions, we focus on captioning the images of news, which could provide real-world knowledge of the whole story behind the images. Then we propose a novel model that makes captions closer to the human-like description of the image, by leveraging the semantic relevance of the named entities. The named entities are not only extracted from news under the guidance of the image content, but also extended with external knowledge based on the semantic relations. In detail, we propose a sentence correlation analysis algorithm to selectively draw the contextual information from news, and use entity-linking algorithm based on knowledge graph to discover the relations of entities with a global sight. The results of extensive experiments based on real-world dataset which is collected from the news show that our model generates image captions closer to the corresponding real-world captions.
Article
Image captioning is a task to generate natural descriptions of images. In existing image captioning models, the generated captions usually lack semantic discriminability. Semantic discriminability is difficult as it requires the model to capture detailed differences in images. In this paper, we propose an image captioning framework with semantic-enhanced features and extremely hard negative examples. These two components are combined in a Semantic-Enhanced Module. The semantic-enhanced module consists of an image-text matching sub-network and a Feature Fusion layer which provides semantic-enhanced features of rich semantic information. Moreover, in order to improve the semantic discriminability, we propose an extremely hard negative mining method which utilize the extremely hard negative examples to improve the latent alignment between visual and language information. Experimental results on MSCOCO and Flickr30K show that our proposed framework and training method can simultaneously improve the performance of image-text matching and image captioning, achieving competitive performance against state-of-the-art methods.
Article
Full-text available
Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for imagetext matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.
Preprint
In this work we have analyzed a novel concept of sequential binding based learning capable network based on the coupling of recurrent units with Bayesian prior definition. The coupling structure encodes to generate efficient tensor representations that can be decoded to generate efficient sentences and can describe certain events. These descriptions are derived from structural representations of visual features of images and media. An elaborated study of the different types of coupling recurrent structures are studied and some insights of their performance are provided. Supervised learning performance for natural language processing is judged based on statistical evaluations, however, the truth is perspective, and in this case the qualitative evaluations reveal the real capability of the different architectural strengths and variations. Bayesian prior definition of different embedding helps in better characterization of the sentences based on the natural language structure related to parts of speech and other semantic level categorization in a form which is machine interpret-able and inherits the characteristics of the Tensor Representation binding and unbinding based on the mutually orthogonality. Our approach has surpassed some of the existing basic works related to image captioning.
Article
Automatically describing the content of an image has been attracting considerable research attention in the multimedia field. To represent the content of an image, many approaches directly utilize convolutional neural networks (CNNs) to extract visual representations, which are fed into recurrent neural networks to generate natural language. Recently, some approaches have detected semantic concepts from images and then encoded them into high-level representations. Although substantial progress has been achieved, most of the previous methods treat entities in images individually, thus lacking structured information that provides important cues for image captioning. In this paper, we propose a framework based on scene graphs for image captioning. Scene graphs contain abundant structured information because they not only depict object entities in images but also present pairwise relationships. To leverage both visual features and semantic knowledge in structured scene graphs, we extract CNN features from the bounding box offsets of object entities for visual representations, and extract semantic relationship features from triples (e.g., man riding bike ) for semantic representations. After obtaining these features, we introduce a hierarchical-attention-based module to learn discriminative features for word generation at each time step. The experimental results on benchmark datasets demonstrate the superiority of our method compared with several state-of-the-art methods.
Conference Paper
Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.
Preprint
Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias" during training: ground-truth subsequence is exposed in every step prediction, which introduces bias in test when only predicted subsequence is seen. However, existing RL-based image captioning methods only focus on the language policy while not the visual policy (e.g., visual attention), and thus fail to capture the visual context that are crucial for compositional reasoning such as visual relationships (e.g., "man riding horse") and comparisons (e.g., "smaller cat"). To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning. At every time step, CAVP explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention. Compared against traditional visual attention that only fixes a single image region at every step, CAVP can attend to complex visual compositions over time. The whole image captioning model --- CAVP and its subsequent language policy network --- can be efficiently optimized end-to-end by using an actor-critic policy gradient method with respect to any caption evaluation metric. We demonstrate the effectiveness of CAVP by state-of-the-art performances on MS-COCO offline split and online server, using various metrics and sensible visualizations of qualitative visual context. The code is available at https://github.com/daqingliu/CAVP
Conference Paper
Full-text available
Image captioning is an important problem in artificial intelligence, related to both computer vision and natural language processing. There are two main problems in existing methods: in the training phase, it is difficult to find which parts of the captions are more essential to the image; in the caption generation phase, the objects or the scenes are sometimes misrecognized. In this paper , we consider the training images as the references and propose a Reference based Long Short Term Memory (R-LSTM) model, aiming to solve these two problems in one goal. When training the model, we assign different weights to different words, which enables the network to better learn the key information of the captions. When generating a caption, the consensus score is utilized to exploit the reference information of neighbor images, which might fix the misrecognition and make the descriptions more natural-sounding. The proposed R-LSTM model outperforms the state-of-the-art approaches on the benchmark dataset MS COCO and obtains top 2 position on 11 of the 14 metrics on the online test server.
Article
Full-text available
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Article
Full-text available
Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar-based learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS-COCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics.
Article
Full-text available
Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN which encodes an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism --- a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. SCA-CNN achieves significant improvements over state-of-the-art visual attention-based image captioning methods.
Conference Paper
Full-text available
The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. How to effectively integrate local and contextual visual cues from these regions has become a fundamental problem in object detection. Most existing works simply concatenated features or scores obtained from support regions. In this paper, we proposal a novel gated bi-directional CNN (GBD-Net) to pass messages between features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close iterations are modeled in a much more complex way. It is also shown that message passing is not always helpful depending on individual samples. Gated functions are further introduced to control message transmission and their on-and-off is controlled by extra visual evidence from the input sample. GBD-Net is implemented under the Fast RCNN detection framework. Its effectiveness is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO.
Article
Full-text available
Object detection is a fundamental problem in image understanding. One popular solution is the R-CNN framework and its fast versions. They decompose the object detection problem into two cascaded easier tasks: 1) generating object proposals from images, 2) classifying proposals into various object categories. Despite that we are handling with two relatively easier tasks, they are not solved perfectly and there's still room for improvement. In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Region-proposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals; in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter- and intra-category variances. CRAFT achieves consistent and considerable improvement over the state-of-the-art on object detection benchmarks like PASCAL VOC 07/12 and ILSVRC.
Article
Full-text available
In this work we focus on the problem of image caption generation. We propose an extension of the long short term memory (LSTM) model, which we coin gLSTM for short. In particular, we add semantic information extracted from the image as extra input to each unit of the LSTM block, with the aim of guiding the model towards solutions that are more tightly coupled to the image content. Additionally, we explore different length normalization strategies for beam search in order to prevent from favoring short sentences. On various benchmark datasets such as Flickr8K, Flickr30K and MS COCO, we obtain results that are on par with or even outperform the current state-of-the-art.
Article
Full-text available
Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image caption system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifting among the visual regions imposes a thread of visual ordering. This alignment characterizes the flow of "abstract meaning", encoding what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets. We show that using either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.
Article
Full-text available
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Chapter
Full-text available
In this chapter we present a Bayesian framework for parsing images into their constituent visual patterns. The parsing algorithm optimizes the posterior probability and outputs a scene representation as a “parsing graph”, in a spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph and re-configures it dynamically using a set of moves, which are mostly reversible Markov chain jumps. This computational framework integrates two popular inference approaches – generative (top-down) methods and discriminative (bottom-up) methods. The former formulates the posterior probability in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative probabilities based on a sequence (cascade) of bottom-up tests/filters. In our Markov chain algorithm design, the posterior probability, defined by the generative models, is the invariant (target) probability for the Markov chain, and the discriminative probabilities are used to construct proposal probabilities to drive the Markov chain. Intuitively, the bottom-up discriminative probabilities activate top-down generative models. In this chapter, we focus on two types of visual patterns – generic visual patterns, such as texture and shading, and object patterns including human faces and text. These types of patterns compete and cooperate to explain the image and so image parsing unifies image segmentation, object detection, and recognition (if we use generic visual patterns only then image parsing will correspond to image segmentation [48].). We illustrate our algorithm on natural images of complex city scenes and show examples where image segmentation can be improved by allowing object specific knowledge to disambiguate low-level segmentation cues, and conversely where object detection can be improved by using generic visual patterns to explain away shadows and occlusions.
Conference Paper
Full-text available
Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. Discovering this recursive structure helps us to not only identify the units that an image or sentence contains but also how they interact to form a whole. We introduce a max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. For segmentation and annotation our algorithm obtains a new level of state-of-the-art performance on the Stanford background dataset (78.1%). The features from the image parse tree outperform Gist descriptors for scene classification by 4%.
Conference Paper
Full-text available
Social media sharing web sites like Flickr allow users to annotate images with free tags, which significantly facilitate Web image search and organization. However, the tags associated with an image generally are in a random order without any importance or relevance information, which limits the effectiveness of these tags in search and other applications. In this paper, we propose a tag ranking scheme, aiming to automatically rank the tags associated with a given image according to their relevance to the image content. We first estimate initial relevance scores for the tags based on probability density estimation, and then perform a random walk over a tag similarity graph to refine the relevance scores. Experimental results on a 50, 000 Flickr photo collection show that the proposed tag ranking method is both effective and efficient. We also apply tag ranking into three applications: (1) tag-based image search, (2) tag recommendation, and (3) group recommendation, which demonstrates that the proposed tag ranking approach really boosts the performances of social-tagging related applications.
Conference Paper
Full-text available
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Technical Report
We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models "translate" image to sentence. Our proposed models are evaluated on caption generation and image-sentence retrieval tasks with three benchmark datasets: Flickr8K, Flickr30K and MSCOCO datasets. We demonstrate that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art results on caption generation even without integrating additional mechanism (e.g. object detection, attention model etc.) and significantly outperform recent methods on retrieval task.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
This work proposes a method to interpret a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships. Starting with an initial pixel labeling and a set of candidate object masks for a given test image, we select a subset of objects that explain the image well and have valid overlap relationships and occlusion ordering. This is done by minimizing an integer quadratic program either using a greedy method or a standard solver. Then we alternate between using the object predictions to refine the pixel labels and vice versa. The proposed system obtains promising results on two challenging subsets of the LabelMe and SUN datasets, the largest of which contains 45, 676 images and 232 classes.
Article
In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided.
Article
This paper proposes a learning-based approach to scene parsing inspired by the deep Recursive Context Propagation Network (RCPN). RCPN is a deep feed-forward neural network that utilizes the contextual information from the entire image, through bottom-up followed by top-down context propagation via random binary parse trees. This improves the feature representation of every super-pixel in the image for better classification into semantic categories. We analyze RCPN and propose two novel contributions to further improve the model. We first analyze the learning of RCPN parameters and discover the presence of bypass error paths in the computation graph of RCPN that can hinder contextual propagation. We propose to tackle this problem by including the classification loss of the internal nodes of the random parse trees in the original RCPN loss function. Secondly, we use an MRF on the parse tree nodes to model the hierarchical dependency present in the output. Both modifications provide performance boosts over the original RCPN and the new system achieves state-of-the-art performance on Stanford Background, SIFT-Flow and Daimler urban datasets.
Article
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and MSCOCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
We propose a method to identify and localize object classes in images. Instead of operating at the pixel level, we advocate the use of superpixels as the basic unit of a class segmentation or pixel localization scheme. To this end, we construct a classifier on the histogram of local features found in each superpixel. We regularize this classifier by aggregating histograms in the neighborhood of each superpixel and then refine our results further by using the classifier in a conditional random field operating on the superpixel graph. Our proposed method exceeds the previously published state-of-the-art on two challenging datasets: Graz-02 and the PASCAL VOC 2007 Segmentation Challenge.
Article
In this paper, we propose a Hierarchical Image Model (HIM) which parses images to perform segmentation and object recognition. The HIM represents the image recursively by segmentation and recognition templates at multiple levels of the hierarchy. This has advantages for representation, inference, and learning. First, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information (similar to how natural language models represent sentence structure in terms of hierarchical representations such as verb and noun phrases). Second, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which yields the first polynomial time algorithm for image labeling. Third, we learn the HIM efficiently using machine learning methods from a labeled data set. We demonstrate that the HIM is comparable with the state-of-the-art methods by evaluation on the challenging public MSRC and PASCAL VOC 2007 image data sets.
Article
This paper presents a simple attribute graph grammar as a generative representation for made-made scenes, such as buildings, hallways, kitchens, and living rooms, and studies an effective top-down/bottom-up inference algorithm for parsing images in the process of maximizing a Bayesian posterior probability or equivalently minimizing a description length (MDL). Given an input image, the inference algorithm computes (or constructs) a parse graph, which includes a parse tree for the hierarchical decomposition and a number of spatial constraints. In the inference algorithm, the bottom-up step detects an excessive number of rectangles as weighted candidates, which are sorted in certain order and activate top-down predictions of occluded or missing components through the grammar rules. In the experiment, we show that the grammar and top-down inference can largely improve the performance of bottom-up detection.
Conference Paper
We propose a two-class classification model for grouping. Human segmented natural images are used as positive examples. Negative examples of grouping are constructed by randomly matching human segmentations and images. In a preprocessing stage an image is over-segmented into super-pixels. We define a variety of features derived from the classical Gestalt cues, including contour, texture, brightness and good continuation. Information-theoretic analysis is applied to evaluate the power of these grouping cues. We train a linear classifier to combine these features. To demonstrate the power of the classification model, a simple algorithm is used to randomly search for good segmentations. Results are shown on a wide range of images.
Conference Paper
We propose a general framework for parsing images into regions and objects. In this framework, the detection and recognition of objects proceed simultaneously with image segmentation in a competitive and cooperative manner. We illustrate our approach on natural images of complex city scenes where the objects of primary interest are faces and text. This method makes use of bottom-up proposals combined with top-down generative models using the data driven Markov chain Monte Carlo (DDMCMC) algorithm, which is guaranteed to converge to the optimal estimate asymptotically. More precisely, we define generative models for faces, text, and generic regions- e.g. shading, texture, and clutter. These models are activated by bottom-up proposals. The proposals for faces and text are learnt using a probabilistic version of AdaBoost. The DDMCMC combines reversible jump and diffusion dynamics to enable the generative models to explain the input images in a competitive and cooperative manner. Our experiments illustrate the advantages and importance of combining bottom-up and top-down models and of performing segmentation and object detection/recognition simultaneously.