... [8] reported the comparison of context-aware LSTM captioner and co-attentive discriminator for image captioning. [9] used question features and image features, [11] parsing tree StructCap, [12] sequence-tosequence framework, and [13] dual temporal modal, Image-Text Surgery in [14], [15] attribute-driven attention, [16] generative recurrent neural network, [17] MLAIC for better representation. Also there is [18] text-guided attention, [19] reference based LSTM, [21] adversarial neural network, high-dimensional attentions [22], [23] coarse-to-fine skeleton sentence, [24] specific styles, [25] structural relevance and structural diversity, multimodal attention [26], [27] popular brands caption [28] diversified captions, [29] stylish caption, [30] sub-categorical styles, [31] personalized captions, [32] studied actor-critic reinforcement learning, [33] scene-specific attention contexts, [34] policy network for captions, [35] reinforcement learning based training, [36] distinguish between similar kind for diversity, [37] improved with correctness of attention in image, [39] adaptivity for attention, [40] used combination of computer vision and machine translation, [42] used adaptive re-weight loss function, [43] personalized captioning, [46] high level semantic concept, [47] used visual features and machine translation attention combinations, [56] different caption styles, [57] shifting attention, [58] characteristics of text based representations, [62] variational autoencoder representation, [63] dependency trees embedding, [64] character-level language modeling, [65] fixed dimension representation, [66] 3-dimensional convolutional networks, [67] human judgments, out-of-domain data handling, [70] semantic attention, [73] Semantic Compositional Network (SCN), [74] localize and segment objects, [76] extra semantic attention, [78] content planning and recognition algorithms, [80] new tree based approach to composing expressive image descriptions, [81] transposed weight sharing scheme, [82] different emotions and sentiments, and [85] where nouns, verbs, scenes and prepositions used for structuring sentence. ...