Conference Paper

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, existing VL-NLE models suffer from at least one of the following issues: First, the models incorporate separate answer prediction and NLE generation modules (e.g., FME [31], e-UG [17]), which leads to inferior NLE generation performance due to the loose integration between the two modules. Second, their backbone models are often pretrained on a limited set of tasks (e.g., VL-T5 [29]), not fully exploiting the potential of multi-task learning with a unified architecture. Third, they incorporate ad hoc solutions using additional task-specific modules or additional resources to increase their performance on the datasets at hand (e.g., NLX-GPT [38], RExC [26]), which does not allow them to be used as omnipotent models, a vision of current AI research [35,43]. ...
... NLX-GPT, however, is not trained on diverse tasks, such that it needs to resort to additional task-specific modules (e.g., a concept-detection module for e-SNLI-VE or bounding box projection-module for VCR) to improve its performance. VL-T5 and VL-BART are introduced in [29] and extend the Transformer-based unimodal language models BART-Base [20] and T5-Base [34] with visual inputs (i.e., bounding box features from Faster R-CNN [36]). Like NLX-GPT, the two models unify answer prediction and explanation generation. ...
... Following previous work on natural language explanations in vision-language settings [17,29,38], we focus our experiments on the VQA-X [31], e-SNLI-VE [17], and VCR [47] datasets and the tasks associated with each. In addition, inspired by the OFA model's multi-task pretraining, we introduce a unified explanation task that combines the three datasets and tasks into one challenge designed to assess the model's multi-task performance. ...
Preprint
Full-text available
Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models. While current models offer impressive performance on task accuracy and explanation plausibility, they suffer from a range of issues: Some models feature a modular design where the explanation generation module is poorly integrated with a separate module for task-answer prediction, employ backbone models trained on limited sets of tasks, or incorporate ad hoc solutions to increase performance on single datasets. We propose to evade these limitations by applying recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks. Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets. As a novel challenge in VL-NLE research, we propose the problem of multi-task VL-NLE and show that jointly training on multiple tasks can increase the explanation quality. We discuss the ethical implications of high-quality NLE generation and other issues in recent VL-NLE research.
... MDETR [11], 12-in-1 [12], VLT5 [13] utilize visionlanguage pretraining [14,15] for enhancing the understanding of multi-modality [16,17]. In addition, although there are some works [18,19,20,21,22,23] to explore self-explaining neural networks on NLP tasks and image captioning, we focus on constraining and improving pre-trained vision-language models for reasoning in compositional VQA. ...
... We leave it to future work to investigate the impact of artifacts in training data on effect of rationales. The differences between NLI and CQA also suggest that evaluations solely based on NLI may not cleanly transfer to other tasks; this finding provides further evidence that the benefits of rationales are task-dependent (Carton et al., 2020;Palaskar et al., 2022) and that evaluations on one task such as NLI alone are not comprehensive enough to draw general conclusions about the utility of rationales. ...
... We leave it to future work to investigate the impact of artifacts in training data on effect of rationales. The differences between NLI and CQA also suggest that evaluations solely based on NLI may not cleanly transfer to other tasks; this finding provides further evidence that the benefits of rationales are task-dependent (Carton et al., 2020;Palaskar et al., 2022) and that evaluations on one task such as NLI alone are not comprehensive enough to draw general conclusions about the utility of rationales. ...
Preprint
Full-text available
Rationalization is fundamental to human reasoning and learning. NLP models trained to produce rationales along with predictions, called self-rationalization models, have been investigated for their interpretability and utility to end-users. However, the extent to which training with human-written rationales facilitates learning remains an under-explored question. We ask whether training models to self-rationalize can aid in their learning to solve tasks for the right reasons. Specifically, we evaluate how training self-rationalization models with free-text rationales affects robustness to spurious correlations in fine-tuned encoder-decoder and decoder-only models of six different sizes. We evaluate robustness to spurious correlations by measuring performance on 1) manually annotated challenge datasets and 2) subsets of original test sets where reliance on spurious correlations would fail to produce correct answers. We find that while self-rationalization can improve robustness to spurious correlations in low-resource settings, it tends to hurt robustness in higher-resource settings. Furthermore, these effects depend on model family and size, as well as on rationale content. Together, our results suggest that explainability can come at the cost of robustness; thus, appropriate care should be taken when training self-rationalizing models with the goal of creating more trustworthy models.
Article
Full-text available
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.
Article
Full-text available
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.
Article
Full-text available
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.
Conference Paper
Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fill-in-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https://github.com/facebookresearch/LAMA.
Conference Paper
In order for machine learning to garner widespread public adoption, models must be able to provide interpretable and robust explanations for their decisions, as well as learn from human-provided explanations at train time. In this work, we extend the Stanford Natural Language Inference dataset with an additional layer of human-annotated natural language explanations of the entailment relations. We further implement models that incorporate these explanations into their training process and output them at test time. We show how our corpus of explanations, which we call e-SNLI, can be used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets. Our dataset 1 thus opens up a range of research directions for using natural language explanations, both for improving models and for asserting their trust.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We propose to use the visual denotations of linguistic expressions (i.e. the set of images they describe) to define novel denotational similarity metrics, which we show to be at least as beneficial as distributional similarities for two tasks that require semantic inference. To compute these denotational similarities, we construct a denotation graph, i.e. a subsumption hierarchy over constituents and their denotations, based on a large corpus of 30K images and 150K descriptive captions.
Article
We created the Yahoo Flickr Creative Commons 100 Million Dataseta (YFCC100M) in 2014 as part of the Yahoo Webscope program, which is a reference library of interesting and scientifically useful datasets. The YFCC100M is the largest public multimedia collection ever released, with a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all uploaded to Flickr between 2004 and 2014 and published under a CC commercial or noncommercial license. The dataset is distributed through Amazon Web Services as a 12.5GB compressed archive containing only metadata. However, as with many datasets, the YFCC100M is constantly evolving; over time, we have released and will continue to release various expansion packs containing data not yet in the collection; for instance, the actual photos and videos, as well as several visual and aural features extracted from the data, have already been uploaded to the cloud, ensuring the dataset remains accessible and intact for years to come. The YFCC100M dataset overcomes many of the issues affecting existing multimedia datasets in terms of modalities, metadata, licensing, and, principally, volume.
Article
Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machine learning research in this area has been dramatically limited by the lack of large-scale resources. To address this, we introduce the Stanford Natural Language Inference corpus, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. At 570K pairs, it is two orders of magnitude larger than all other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Vqa: Visual question answering
  • Stanislaw Antol
  • Aishwarya Agrawal
  • Jiasen Lu
  • Margaret Mitchell
  • Dhruv Batra
  • Lawrence Zitnick
  • Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners
  • Tom Brown
  • Benjamin Mann
  • Nick Ryder
  • Melanie Subbiah
  • Jared D Kaplan
  • Prafulla Dhariwal
  • Arvind Neelakantan
  • Pranav Shyam
  • Girish Sastry
  • Amanda Askell
  • Sandhini Agarwal
  • Ariel Herbert-Voss
  • Gretchen Krueger
  • Tom Henighan
  • Rewon Child
  • Aditya Ramesh
  • Daniel Ziegler
  • Jeffrey Wu
  • Clemens Winter
  • Chris Hesse
  • Mark Chen
  • Eric Sigler
  • Mateusz Litwin
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877-1901. Curran Associates, Inc.
UNITER: Learning UNiversal Image-TExt Representations
  • Yen-Chun Chen
  • Linjie Li
  • Licheng Yu
  • Ahmed El Kholy
  • Faisal Ahmed
  • Zhe Gan
  • Yu Cheng
  • Jingjing Liu
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: Learning UNiversal Image-TExt Representations. In Proceedings of the European Conference on Computer Vision (ECCV).
Unifying vision-and-language tasks via text generation
  • Jaemin Cho
  • Jie Lei
  • Hao Tan
  • Mohit Bansal
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning. PMLR.
BERT: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Unified language model pre-training for natural language understanding and generation
  • Li Dong
  • Nan Yang
  • Wenhui Wang
  • Furu Wei
  • Xiaodong Liu
  • Yu Wang
  • Jianfeng Gao
  • Ming Zhou
  • Hsiao-Wuen Hon
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13042-13054.
An image is worth 16x16 words: Transformers for image recognition at scale
  • Alexey Dosovitskiy
  • Lucas Beyer
  • Alexander Kolesnikov
  • Dirk Weissenborn
  • Xiaohua Zhai
  • Thomas Unterthiner
  • Mostafa Dehghani
  • Matthias Minderer
  • Georg Heigold
  • Sylvain Gelly
  • Jakob Uszkoreit
  • Neil Houlsby
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The pile: An 800gb dataset of diverse text for language modeling
  • Leo Gao
  • Stella Biderman
  • Sid Black
  • Laurence Golding
  • Travis Hoppe
  • Charles Foster
  • Jason Phang
  • Horace He
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
Training Vision-Language Transformers from Captions Alone
  • Liangke Gui
  • Qiuyuan Huang
  • Alex Hauptmann
  • Yonatan Bisk
  • Jianfeng Gao
Liangke Gui, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. 2022a. Training Vision-Language Transformers from Captions Alone. ArXiv.
Aniruddha Kembhavi, and Derek Hoiem. 2022. Towards general purpose vision systems
  • Tanmay Gupta
  • Amita Kamath
Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem. 2022. Towards general purpose vision systems. In The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR).
Oscar: Object-semantics aligned pre-training for vision-language tasks
  • Xiujun Li
  • Xi Yin
  • Chunyuan Li
  • Pengchuan Zhang
  • Xiaowei Hu
  • Lei Zhang
  • Lijuan Wang
  • Houdong Hu
  • Li Dong
  • Furu Wei
  • Yejin Choi
  • Jianfeng Gao
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020b. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV).
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
  • Jiasen Lu
  • Dhruv Batra
  • Devi Parikh
  • Stefan Lee
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS.
Noah Fiedel, and Karishma Malkan. 2020. Wt5?! training text-to-text models to explain their predictions. CoRR, abs
  • Sharan Narang
  • Colin Raffel
  • Katherine Lee
  • Adam Roberts
Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. Wt5?! training text-to-text models to explain their predictions. CoRR, abs/2004.14546.
Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image
  • Jae Sung Park
  • Chandra Bhagavatula
  • Roozbeh Mottaghi
  • Ali Farhadi
  • Yejin Choi
Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, and Yejin Choi. 2020. Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image. In Proceedings of the European Conference on Computer Vision (ECCV).
Learning transferable visual models from natural language supervision
  • Alec Radford
  • Jong Wook Kim
  • Chris Hallacy
  • Aditya Ramesh
  • Gabriel Goh
  • Sandhini Agarwal
  • Girish Sastry
  • Amanda Askell
  • Pamela Mishkin
  • Jack Clark
  • Gretchen Krueger
  • Ilya Sutskever
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748-8763. PMLR.
Language Models are Unsupervised Multitask Learners
  • Alec Radford
  • Jeffrey Wu
  • Rewon Child
  • David Luan
  • Dario Amodei
  • Ilya Sutskever
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
Exploring the limits of transfer learning with a unified text-to-text transformer
  • Colin Raffel
  • Noam Shazeer
  • Adam Roberts
  • Katherine Lee
  • Sharan Narang
  • Michael Matena
  • Yanqi Zhou
  • Wei Li
  • Peter J Liu
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1-67.
How much can clip benefit vision-and-language tasks
  • Sheng Shen
  • Liunian Harold Li
  • Hao Tan
  • Mohit Bansal
  • Anna Rohrbach
  • Kai-Wei Chang
  • Zhewei Yao
  • Kurt Keutzer
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2022. How much can clip benefit vision-and-language tasks? In ICLR.