Josiah Wang

Josiah Wang
  • PhD
  • Fellow at Imperial College London

About

41
Publications
6,473
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
867
Citations
Introduction
My research experience is in the field of Artificial Intelligence, more specifically in two major fields in applied Machine Learning: Computer Vision and Natural Language Processing. My research emphasis is to develop AI systems that learn from multimodal data and applying the knowledge acquired to tackle specialised Computer Vision and/or Natural Language tasks with few or no supervised examples for those specific tasks.
Current institution
Imperial College London
Current position
  • Fellow
Additional affiliations
Position
  • Research Associate
September 2013 - February 2019
The University of Sheffield
Position
  • Research Associate
Description
  • I worked on the Visual Sense (ViSen) project with the Natural Language Processing Research Group at the Department of Computer Science, The University of Sheffield.
Education
September 2008 - July 2013
University of Leeds
Field of study
  • Computer Science

Publications

Publications (41)
Conference Paper
Full-text available
We investigate the task of learning models for visual object recognition from natural language descriptions alone. The approach contributes to the recognition of fine-grain object categories, such as animal and plant species, where it may be difficult to collect many images for training, but where textual descriptions of visual attributes are readi...
Conference Paper
Full-text available
We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the la...
Conference Paper
Full-text available
Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers i...
Conference Paper
Full-text available
In this paper, we introduce the notion of visually descriptive language (VDL) – intuitively a text segment whose truth can be confirmed by visual sense alone. VDL can be exploited in many vision-based tasks, e.g. image interpretation and story illustration. In contrast to previous work requiring pre-aligned texts and images, we propose a broader de...
Article
Full-text available
We propose multimodal machine translation (MMT) approaches that exploit the correspondences between words and image regions. In contrast to existing work, our referential grounding method considers objects as the visual unit for grounding, rather than whole images or abstract image regions, and performs visual grounding in the source language, rath...
Preprint
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images ar...
Article
Full-text available
Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality...
Preprint
Full-text available
This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the ex...
Preprint
Full-text available
This paper describes the Imperial College London team's submission to the 2019' VATEX video captioning challenge, where we first explore two sequence-to-sequence models, namely a recurrent (GRU) model and a transformer model, which generate captions from the I3D action features. We then investigate the effect of dropping the encoder and the attenti...
Preprint
Localizing phrases in images is an important part of image understanding and can be useful in many applications that require mappings between textual and visual information. Existing work attempts to learn these mappings from examples of phrase-image region correspondences (strong supervision) or from phrase-image pairs (weak supervision). We postu...
Chapter
Automatic image annotation is the task of automatically assigning some form of semantic label to images, such as words, phrases or sentences describing the objects, attributes, actions, and scenes depicted in the image. In this chapter, we present an overview of the various automatic image annotation tasks that were organized in conjunction with th...
Preprint
Full-text available
We address the task of text translation on the How2 dataset using a state of the art transformer-based multimodal approach. The question we ask ourselves is whether visual features can support the translation process, in particular, given that this is a dataset extracted from videos, we focus on the translation of actions, which we believe are poor...
Preprint
Full-text available
We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task: VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The me...
Preprint
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image c...
Preprint
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect lingu...
Preprint
The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic inf...
Article
Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based mod...
Conference Paper
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect lingu...
Conference Paper
The use of explicit object detectors as an intermediate step to image captioning – which used to constitute an essential stage in early work – is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a midlevel image embedding. We argue that explicit detections provide rich semantic info...
Preprint
Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers i...
Article
Full-text available
Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers i...
Conference Paper
This paper describes the University of Sheffield’s submission to the WMT17 Multimodal Machine Translation shared task. We participated in Task 1 to develop an MT system to translate an image description from English to German and French, given its corresponding image. Our proposed systems are based on the state-of-the-art Neural Machine Translation...
Conference Paper
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem b...
Article
Full-text available
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem b...
Conference Paper
Full-text available
Since 2010, ImageCLEF has run a scalable image annotation task, to promote research into the annotation of images using noisy web page data. It aims to develop techniques to allow computers to describe images reliably, localise different concepts depicted and generate descriptions of the scenes. The primary goal of the challenge is to encourage cre...
Conference Paper
Full-text available
We tackle the sub-task of content selection as part of the broader challenge of automatically generating image descriptions. More specifically, we explore how decisions can be made to select what object instances should be mentioned in an image description, given an image and labelled bounding boxes. We propose casting the content selection problem...
Conference Paper
Full-text available
This paper presents an overview of the ImageCLEF 2016 evaluation campaign, an event that was organized as part of the CLEF (Conference and Labs of the Evaluation Forum) labs 2016. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to collections of...
Conference Paper
This paper describes the University of Sheffield’s submission for the WMT16 Multimodal Machine Translation shared task, where we participated in Task 1 to develop German-to-English and Englishto-German statistical machine translation (SMT) systems in the domain of image descriptions. Our proposed systems are standard phrase-based SMT systems based...
Conference Paper
We harvest training images for visual object recognition by casting it as an IR task. In contrast to previous work, we concentrate on fine-grained object categories, such as the large number of particular animal subspecies, for which manual annotation is expensive. We use 'visual descriptions' from nature guides as a novel augmentation to the well-...
Conference Paper
The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description generation systems. However, not much work has been done to analyse and understand...
Conference Paper
Full-text available
In this paper, we present the task of generating image descriptions with gold standard visual detections as input, rather than directly from an image. This allows the Natural Language Generation community to focus on the text generation process, rather than dealing with the noise and complications arising from the visual detection process. We propo...
Conference Paper
Full-text available
The ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task was the fourth edition of a challenge aimed at developing more scalable image annotation systems. In particular this year the focus of the three subtasks available to participants had the goal to develop techniques to allow computers to reliably describe images,...
Conference Paper
Full-text available
This paper presents an overview of the ImageCLEF 2015 evaluation campaign, an event that was organized as part of the CLEF labs 2015. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to databases of images in various usage scenarios and domains. I...
Conference Paper
Full-text available
Different people may describe the same object in different ways, and at varied levels of granularity ("poodle", "dog", "pet" or "animal"?) In this paper, we propose the idea of 'granularity-aware' groupings where semantically related concepts are grouped across different levels of granularity to capture the variation in how different people describ...
Conference Paper
Full-text available
This paper reports the results of an experiment to combine research and teaching in Corpus Linguistics, using an AI-inspired intelligent agent architecture, but casting students as the intelligent agents (Atwell 2007). Computing students studying Computational Modelling and Technologies for Knowledge Management were given the data-mining coursework...

Network

Cited By