Pranava Madhyastha's research while affiliated with City, University of London and other places

Publications (70)

Preprint
Full-text available
Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this,...
Preprint
Full-text available
In this study, we investigate the generalization of LSTM, ReLU and GRU models on counting tasks over long sequences. Previous theoretical work has established that RNNs with ReLU activation and LSTMs have the capacity for counting with suitable configuration, while GRUs have limitations that prevent correct counting over longer sequences. Despite t...
Article
Recent advances in fake news detection have exploited the success of large-scale pre-trained language models (PLMs). The predominant state-of-the-art approaches are based on fine-tuning PLMs on labelled fake news datasets. However, large-scale PLMs are generally not trained on structured factual data and hence may not possess priors that are ground...
Preprint
Full-text available
Recent advances in fake news detection have exploited the success of large-scale pre-trained language models (PLMs). The predominant state-of-the-art approaches are based on fine-tuning PLMs on labelled fake news datasets. However, large-scale PLMs are generally not trained on structured factual data and hence may not possess priors that are ground...
Preprint
Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting. The DROP benchmark (Dua et al., 2019) is a recent dataset that has inspired the design of NLP models aimed at solving this task. The current standings of t...
Preprint
We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models VL-BERT and M-BERT, respectively. BERTGEN is auto-regressively trained for language generation tasks, namely image captioning, machine translation and multimodal machine translation, under a multitask setting. With a...
Preprint
Full-text available
In this paper we present a controlled study on the linearized IRM framework (IRMv1) introduced in Arjovsky et al. (2020). We show that IRMv1 (and its variants) framework can be potentially unstable under small changes to the optimal regressor. This can, notably, lead to worse generalisation to new environments, even compared with ERM which converge...
Article
Full-text available
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The l...
Article
Full-text available
We propose multimodal machine translation (MMT) approaches that exploit the correspondences between words and image regions. In contrast to existing work, our referential grounding method considers objects as the visual unit for grounding, rather than whole images or abstract image regions, and performs visual grounding in the source language, rath...
Preprint
Full-text available
Reasoning about information from multiple parts of a passage to derive an answer is an open challenge for reading-comprehension models. In this paper, we present an approach that reasons about complex questions by decomposing them to simpler subquestions that can take advantage of single-span extraction reading-comprehension models, and derives the...
Preprint
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images ar...
Preprint
This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the tex...
Article
Collecting textual descriptions is an especially costly task for dense video captioning, since each event in the video needs to be annotated separately and a long descriptive paragraph needs to be provided. In this paper, we investigate a way to mitigate this heavy burden and propose to leverage captions of visually similar images as auxiliary cont...
Preprint
Full-text available
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lin...
Preprint
Full-text available
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The l...
Preprint
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and pa...
Preprint
Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus have to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addi...
Preprint
Full-text available
Public opinion influences events, especially related to stock market movement, in which a subtle hint can influence the local outcome of the market. In this paper, we present a dataset that allows for company-level analysis of tweet based impact on one-, two-, three-, and seven-day stock returns. Our dataset consists of 862, 231 labelled instances...
Article
Full-text available
Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality...
Conference Paper
Full-text available
Current Automatic Text Simplification (TS) work relies on sequence-to-sequence neural models that learn simplification operations from parallel complex-simple corpora. In this paper we address three open challenges in these approaches: (i) avoiding unnecessary transformations, (ii) determining which operations to perform, and (iii) generating simpl...
Preprint
Full-text available
This paper describes the Imperial College London team's submission to the 2019' VATEX video captioning challenge, where we first explore two sequence-to-sequence models, namely a recurrent (GRU) model and a transformer model, which generate captions from the I3D action features. We then investigate the effect of dropping the encoder and the attenti...
Preprint
Full-text available
In this paper, we focus on quantifying model stability as a function of random seed by investigating the effects of the induced randomness on model performance and the robustness of the model in general. We specifically perform a controlled study on the effect of random seeds on the behaviour of attention, gradient-based and surrogate model based (...
Preprint
Full-text available
Recent literature shows that large-scale language modeling provides excellent reusable sentence representations with both recurrent and self-attentive architectures. However, there has been less clarity on the commonalities and differences in the representational properties induced by the two architectures. It also has been shown that visual inform...
Preprint
Full-text available
We address the task of text translation on the How2 dataset using a state of the art transformer-based multimodal approach. The question we ask ourselves is whether visual features can support the translation process, in particular, given that this is a dataset extracted from videos, we focus on the translation of actions, which we believe are poor...
Preprint
Full-text available
We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task: VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The me...
Preprint
Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where image...
Preprint
Full-text available
Explaining and interpreting the decisions of recommender systems are becoming extremely relevant both, for improving predictive performance, and providing valid explanations to users. While most of the recent interest has focused on providing local explanations, there has been a much lower emphasis on studying the effects of model dynamics and its...
Conference Paper
Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the ge...
Preprint
Full-text available
Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the ge...
Preprint
Full-text available
One third of stroke survivors have language difficulties. Emerging evidence suggests that their likelihood of recovery depends mainly on the damage to language centers. Thus previous research for predicting language recovery post-stroke has focused on identifying damaged regions of the brain. In this paper, we introduce a novel method where we only...
Preprint
An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further complicated by the existence of a latent alignment between views, such as between speech and its transcription, and by t...
Preprint
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image c...
Preprint
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect lingu...
Preprint
The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic inf...
Article
Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based mod...
Conference Paper
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect lingu...
Conference Paper
The use of explicit object detectors as an intermediate step to image captioning – which used to constitute an essential stage in early work – is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a midlevel image embedding. We argue that explicit detections provide rich semantic info...
Conference Paper
Full-text available
Neural Machine Translation (NMT) has recently demonstrated improved performance over statistical machine translation and relies on an encoder-decoder framework for translating text from source to target. The structure of NMT makes it amenable to add auxiliary features, which can provide complementary information to that present in the source text....
Conference Paper
This paper describes the University of Sheffield’s submission to the WMT17 Multimodal Machine Translation shared task. We participated in Task 1 to develop an MT system to translate an image description from English to German and French, given its corresponding image. Our proposed systems are based on the state-of-the-art Neural Machine Translation...
Article
Full-text available
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem b...
Conference Paper
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem b...
Article
Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to use a log-bilinear softmax-based model for vocabulary expansion, such that given an out-of-vocabulary source wor...
Conference Paper
Full-text available
We address the task of annotating images with semantic tuples. Solving this problem requires an algorithm which is able to deal with hundreds of classes for each argument of the tuple. In such contexts, data sparsity becomes a key challenge, as there will be a large number of classes for which only a few examples are available. We propose handling...
Conference Paper
Full-text available
We consider the supervised training setting in which we learn task-specific word embeddings. We assume that we start with initial embeddings learned from unlabelled data and update them to learn taskspecific embeddings for words in the supervised training data. However, for new words in the test set, we must use either their initial embeddings or a...
Article
Full-text available
We consider the setting in which we train a supervised model that learns task-specific word representations. We assume that we have access to some initial word representations (e.g., unsupervised embeddings), and that the supervised learning procedure updates them to task-specific representations for words contained in the training data. But what a...
Article
Full-text available
We address the task of annotating images with semantic tuples. Solving this problem requires an algorithm which is able to deal with hundreds of classes for each argument of the tuple. In such contexts, data sparsity becomes a key challenge, as there will be a large number of classes for which only a few examples are available. We propose handling...
Article
Full-text available
We investigate the problem of inducing word embeddings that are tailored for a particular bilexical relation. Our learning algorithm takes an existing lexical vector space and compresses it such that the resulting word embeddings are good predictors for a target bilexical relation. In experiments we show that task-specific embeddings can benefit bo...

Citations

... The following issues plague the two existing approaches to constructing verb-argument knowledge: (1) In most semantic framework dictionaries, verb-argument knowledge is represented by a combination of argument roles, semantic classes, and verbs, such as 施事(agent) + 把(ba) + 受事(patient) + v, human + 吃(eat) + food, with no scaled representation of language instances that can act as arguments (just listing few instances). (2) Due to the limitation of the size and domain of the argument labeling treebank, there is the issue of repeatedly labeling high-frequency verbs and rarely describing the argument roles of low-frequency verbs. Simultaneously, the scale of the argument-role labeling treebank is typically kept at around tens of thousands of sentences, and the coverage of verbs and their argument roles is relatively limited. ...
... Lately, much attention has been put on training Transformers to learn how to reason (Helwe et al., 2021;Al-Negheimish et al., 2021;Storks et al., 2019;Gontier et al., 2020). This is usually done by embedding an algebraic reasoning problem in a natural language formulation. ...
... In [17], a method was proposed for extracting action sequences from NMT architectures, which were later used with sentence pairs in imitation learning to learn an optimal policy. Recently, reinforcement learning was used in multimodal translation [18], utilizing text and visual data to improve the quality of translations. ...
... Translation: "Les lunettes sont cassées." Figure 1: Visual context resolving the ambiguity of English word glasses for English-to-French translation. Caglayan et al., 2021;Li et al., 2022). It has typically been difficult to surpass strong text-only baselines, the image modality often being ignored (Wu et al., 2021). ...
... UNITER Chen et al. (2020) is extended by initializing the text encoder with mBERT and XLM-R as mUNITER and xUNITER respectively. A similar approach is adopted by Mitzalis et al. (2021) to propose BERTGEN by fusing VL-BERT with M-BERT initialization. Specifically, it is demonstrated successfully for the task of MMT where unrolling is used as masking to create the next example and self attention is performed at every time step. ...
... As a loss function/ objective function, in the latent variable space. In contrast, in general object recognition tasks, an object is considered to have a very different appearance in an image, depending on the viewing conditions [19]. ...
... The model will still be encouraged to produce a single output with high average scores to nearby references, which will be maximized at a smoothed mode in the training text distribution. Failure modes of other methods of aggregation are discussed in both Caglayan et al. [8] and Yeh et al. [46], including issues with multi-modal reference distributions and single outlier texts. ...
... translation (MT).Calixto et al. (2019) propose a latent variable model for multi-modal MT, to learn an association between an image and its target language description.Long et al. (2021);Li et al. (2022) first synthesize an image conditioned on the source sentence, then use both the source sentence and the synthesized image to produce translation.Caglayan et al. (2020) obtain a lower latency in simultaneous MT by supplying visual context. Differently, Vokenization(Tan and Bansal, 2020) extend BERT ...
... However, in the case of no simplification (P0), the system has not simplified 9 sentences (which, as mentioned before, were discarded from evaluation), but only two of them were correctly unaltered. This result suggests that preprocessing should be applied before performing any simplification step, as proposed by Scarton et al. [51]. ...
... For instance, Xu et al. (2015) show that, by using attention, the model can use different regions of the image while performing image captioning. More recent work shows that bounding boxes (Ren et al., 2015), a discrete variant of attention over images, improve the representation and hence the performance of different tasks such as VQA (Anderson et al., 2018), image captioning (Yin and Ordonez, 2017) and machine translation (Specia et al., 2020). In this work, we apply this methodology to multimodal ASR (see Section 3.4). ...