... [42] reported the comparison of context-aware LSTM captioner and coattentive discriminator for image captioning. [72] used question features and image features, [4] parsing tree StructCap, [23] sequence-to-sequence framework, and [74] dual temporal modal, Image-Text Surgery in [17], [2] attribute-driven attention, [11] generative recurrent neural network, [85] MLAIC for better representation. Also there is [31] text-guided attention, [3] reference based LSTM, [7] adversarial neural network, high-dimensional attentions [80], [70] coarse-to-fine skeleton sentence, [8] specific styles, [5] structural relevance and structural diversity, multimodal attention [34], [21] popular brands caption [32] diversified captions, [9] stylish caption, [48] sub-categorical styles, [78] personalized captions, [83] studied actor-critic reinforcement learning, [16] scene-specific attention contexts, [46] policy network for captions, [35] reinforcement learning based training, [10] distinguish between similar kind for diversity, [33] improved with correctness of attention in image, [37] adaptivity for attention, [68] used combination of computer vision and machine translation, [84] used adaptive re-weight loss function, [43] personalized captioning, [73] high level semantic concept, [69] used visual features and machine translation attention combinations, [19] different caption styles, [24] shifting attention, [27] characteristics of text based representations, [44] variational autoencoder representation, [49] dependency trees embedding, [65] character-level language modeling, [66] fixed dimension representation, [30] 3-dimensional convolutional networks, [67] human judgments, out-of-domain data handling, [82] semantic attention, [18] Semantic Compositional Network (SCN), [20] localize and segment objects, [22] extra semantic attention, [28] content planning and recognition algorithms, [29] new tree based approach to composing expressive image descriptions, [40] transposed weight sharing scheme, [41] different emotions and sentiments, and [77] where nouns, verbs, scenes and prepositions used for structuring sentence. ...