Pranava Madhyastha's research while affiliated with City, University of London and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (70)
Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this,...
In this study, we investigate the generalization of LSTM, ReLU and GRU models on counting tasks over long sequences. Previous theoretical work has established that RNNs with ReLU activation and LSTMs have the capacity for counting with suitable configuration, while GRUs have limitations that prevent correct counting over longer sequences. Despite t...
Recent advances in fake news detection have exploited the success of large-scale pre-trained language models (PLMs). The predominant state-of-the-art approaches are based on fine-tuning PLMs on labelled fake news datasets. However, large-scale PLMs are generally not trained on structured factual data and hence may not possess priors that are ground...
Recent advances in fake news detection have exploited the success of large-scale pre-trained language models (PLMs). The predominant state-of-the-art approaches are based on fine-tuning PLMs on labelled fake news datasets. However, large-scale PLMs are generally not trained on structured factual data and hence may not possess priors that are ground...
Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting. The DROP benchmark (Dua et al., 2019) is a recent dataset that has inspired the design of NLP models aimed at solving this task. The current standings of t...
We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models VL-BERT and M-BERT, respectively. BERTGEN is auto-regressively trained for language generation tasks, namely image captioning, machine translation and multimodal machine translation, under a multitask setting. With a...
In this paper we present a controlled study on the linearized IRM framework (IRMv1) introduced in Arjovsky et al. (2020). We show that IRMv1 (and its variants) framework can be potentially unstable under small changes to the optimal regressor. This can, notably, lead to worse generalisation to new environments, even compared with ERM which converge...
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The l...
We propose multimodal machine translation (MMT) approaches that exploit the correspondences between words and image regions. In contrast to existing work, our referential grounding method considers objects as the visual unit for grounding, rather than whole images or abstract image regions, and performs visual grounding in the source language, rath...
Reasoning about information from multiple parts of a passage to derive an answer is an open challenge for reading-comprehension models. In this paper, we present an approach that reasons about complex questions by decomposing them to simpler subquestions that can take advantage of single-span extraction reading-comprehension models, and derives the...
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images ar...
This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the tex...
Collecting textual descriptions is an especially costly task for dense video captioning, since each event in the video needs to be annotated separately and a long descriptive paragraph needs to be provided. In this paper, we investigate a way to mitigate this heavy burden and propose to leverage captions of visually similar images as auxiliary cont...
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lin...
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The l...
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and pa...
Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus have to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addi...
Public opinion influences events, especially related to stock market movement, in which a subtle hint can influence the local outcome of the market. In this paper, we present a dataset that allows for company-level analysis of tweet based impact on one-, two-, three-, and seven-day stock returns. Our dataset consists of 862, 231 labelled instances...
Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality...
Current Automatic Text Simplification (TS) work relies on sequence-to-sequence neural models that learn simplification operations from parallel complex-simple corpora. In this paper we address three open challenges in these approaches: (i) avoiding unnecessary transformations, (ii) determining which operations to perform, and (iii) generating simpl...
This paper describes the Imperial College London team's submission to the 2019' VATEX video captioning challenge, where we first explore two sequence-to-sequence models, namely a recurrent (GRU) model and a transformer model, which generate captions from the I3D action features. We then investigate the effect of dropping the encoder and the attenti...
In this paper, we focus on quantifying model stability as a function of random seed by investigating the effects of the induced randomness on model performance and the robustness of the model in general. We specifically perform a controlled study on the effect of random seeds on the behaviour of attention, gradient-based and surrogate model based (...
Recent literature shows that large-scale language modeling provides excellent reusable sentence representations with both recurrent and self-attentive architectures. However, there has been less clarity on the commonalities and differences in the representational properties induced by the two architectures. It also has been shown that visual inform...
We address the task of text translation on the How2 dataset using a state of the art transformer-based multimodal approach. The question we ask ourselves is whether visual features can support the translation process, in particular, given that this is a dataset extracted from videos, we focus on the translation of actions, which we believe are poor...
We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task: VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The me...
Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where image...
Explaining and interpreting the decisions of recommender systems are becoming extremely relevant both, for improving predictive performance, and providing valid explanations to users. While most of the recent interest has focused on providing local explanations, there has been a much lower emphasis on studying the effects of model dynamics and its...
Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the ge...
Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the ge...
One third of stroke survivors have language difficulties. Emerging evidence suggests that their likelihood of recovery depends mainly on the damage to language centers. Thus previous research for predicting language recovery post-stroke has focused on identifying damaged regions of the brain. In this paper, we introduce a novel method where we only...
An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further complicated by the existence of a latent alignment between views, such as between speech and its transcription, and by t...
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image c...
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect lingu...
The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic inf...
Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based mod...
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect lingu...
The use of explicit object detectors as an intermediate step to image captioning – which used to constitute an essential stage in early work – is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a midlevel image embedding. We argue that explicit detections provide rich semantic info...
Neural Machine Translation (NMT) has recently demonstrated improved performance over statistical machine translation and relies on an encoder-decoder framework for translating text from source to target. The structure of NMT makes it amenable to add auxiliary features, which can provide complementary information to that present in the source text....
This paper describes the University of Sheffield’s submission to the WMT17 Multimodal Machine Translation shared task. We participated in Task 1 to develop an MT system to translate an image description from English to German and French, given its corresponding image. Our proposed systems are based on the state-of-the-art Neural Machine Translation...
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem b...
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem b...
Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to use a log-bilinear softmax-based model for vocabulary expansion, such that given an out-of-vocabulary source wor...
We address the task of annotating images with semantic tuples. Solving this problem requires an algorithm which is able to deal with hundreds of classes for each argument of the tuple. In such contexts, data sparsity becomes a key challenge, as there will be a large number of classes for which only a few examples are available. We propose handling...
We consider the supervised training setting in which we learn task-specific word embeddings. We assume that we start with initial embeddings learned from unlabelled data and update them to learn taskspecific embeddings for words in the supervised training data. However, for new words in the test set, we must use either their initial embeddings or a...
We consider the setting in which we train a supervised model that learns
task-specific word representations. We assume that we have access to some
initial word representations (e.g., unsupervised embeddings), and that the
supervised learning procedure updates them to task-specific representations for
words contained in the training data. But what a...
We address the task of annotating images with semantic tuples. Solving this
problem requires an algorithm which is able to deal with hundreds of classes
for each argument of the tuple. In such contexts, data sparsity becomes a key
challenge, as there will be a large number of classes for which only a few
examples are available. We propose handling...
We investigate the problem of inducing word embeddings that are tailored for
a particular bilexical relation. Our learning algorithm takes an existing
lexical vector space and compresses it such that the resulting word embeddings
are good predictors for a target bilexical relation. In experiments we show
that task-specific embeddings can benefit bo...
Citations
... The following issues plague the two existing approaches to constructing verb-argument knowledge: (1) In most semantic framework dictionaries, verb-argument knowledge is represented by a combination of argument roles, semantic classes, and verbs, such as 施事(agent) + 把(ba) + 受事(patient) + v, human + 吃(eat) + food, with no scaled representation of language instances that can act as arguments (just listing few instances). (2) Due to the limitation of the size and domain of the argument labeling treebank, there is the issue of repeatedly labeling high-frequency verbs and rarely describing the argument roles of low-frequency verbs. Simultaneously, the scale of the argument-role labeling treebank is typically kept at around tens of thousands of sentences, and the coverage of verbs and their argument roles is relatively limited. ...
... Lately, much attention has been put on training Transformers to learn how to reason (Helwe et al., 2021;Al-Negheimish et al., 2021;Storks et al., 2019;Gontier et al., 2020). This is usually done by embedding an algebraic reasoning problem in a natural language formulation. ...
... In [17], a method was proposed for extracting action sequences from NMT architectures, which were later used with sentence pairs in imitation learning to learn an optimal policy. Recently, reinforcement learning was used in multimodal translation [18], utilizing text and visual data to improve the quality of translations. ...
... Translation: "Les lunettes sont cassées." Figure 1: Visual context resolving the ambiguity of English word glasses for English-to-French translation. Caglayan et al., 2021;Li et al., 2022). It has typically been difficult to surpass strong text-only baselines, the image modality often being ignored (Wu et al., 2021). ...
... UNITER Chen et al. (2020) is extended by initializing the text encoder with mBERT and XLM-R as mUNITER and xUNITER respectively. A similar approach is adopted by Mitzalis et al. (2021) to propose BERTGEN by fusing VL-BERT with M-BERT initialization. Specifically, it is demonstrated successfully for the task of MMT where unrolling is used as masking to create the next example and self attention is performed at every time step. ...
... As a loss function/ objective function, in the latent variable space. In contrast, in general object recognition tasks, an object is considered to have a very different appearance in an image, depending on the viewing conditions [19]. ...
... The model will still be encouraged to produce a single output with high average scores to nearby references, which will be maximized at a smoothed mode in the training text distribution. Failure modes of other methods of aggregation are discussed in both Caglayan et al. [8] and Yeh et al. [46], including issues with multi-modal reference distributions and single outlier texts. ...
... translation (MT).Calixto et al. (2019) propose a latent variable model for multi-modal MT, to learn an association between an image and its target language description.Long et al. (2021);Li et al. (2022) first synthesize an image conditioned on the source sentence, then use both the source sentence and the synthesized image to produce translation.Caglayan et al. (2020) obtain a lower latency in simultaneous MT by supplying visual context. Differently, Vokenization(Tan and Bansal, 2020) extend BERT ...
... However, in the case of no simplification (P0), the system has not simplified 9 sentences (which, as mentioned before, were discarded from evaluation), but only two of them were correctly unaltered. This result suggests that preprocessing should be applied before performing any simplification step, as proposed by Scarton et al. [51]. ...
... For instance, Xu et al. (2015) show that, by using attention, the model can use different regions of the image while performing image captioning. More recent work shows that bounding boxes (Ren et al., 2015), a discrete variant of attention over images, improve the representation and hence the performance of different tasks such as VQA (Anderson et al., 2018), image captioning (Yin and Ordonez, 2017) and machine translation (Specia et al., 2020). In this work, we apply this methodology to multimodal ASR (see Section 3.4). ...