Tamara L. Berg's research while affiliated with Meta and other places

Publications (110)

Chapter
We consider the targeted image editing problem, namely blending a region in a source image with a driveg that specifiesthe desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the edits, end-to-end in code space. Training such a model requires addressing the lack of example edits...
Preprint
Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explo...
Preprint
Full-text available
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the edits, end-to-end. Training such a model requires addressing a fundamental technical challenge: the l...
Preprint
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately witho...
Preprint
Full-text available
We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-t...
Preprint
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles. Compared to existing moment retrieval datasets, mTVR is multilingual, larger,...
Preprint
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos...
Preprint
Full-text available
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) ben...
Preprint
We study the problem of learning how to predict attribute-object compositions from images, and its generalization to unseen compositions missing from the training data. To the best of our knowledge, this is a first large-scale study of this problem, involving hundreds of thousands of compositions. We train our framework with images from Instagram u...
Preprint
We introduce a unified framework to jointly model images, text, and human attention traces. Our work is built on top of the recent Localized Narratives annotation framework [30], where each word of a given caption is paired with a mouse trace segment. We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding...
Preprint
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks different from the target domains, rendering these fixe...
Chapter
We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window....
Chapter
Query expansion is a technique widely used in image search consisting in combining highly ranked images from an original query into an expanded query that is then reissued, generally leading to increased recall and precision. An important aspect of query expansion is choosing an appropriate way to combine the images into a new query. Interestingly,...
Preprint
Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multi...
Preprint
Query expansion is a technique widely used in image search consisting in combining highly ranked images from an original query into an expanded query that is then reissued, generally leading to increased recall and precision. An important aspect of query expansion is choosing an appropriate way to combine the images into a new query. Interestingly,...
Preprint
Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory...
Preprint
We introduce a new multimodal retrieval task - TV show Retrieval (TVR), in which a short video moment has to be localized from a large video (with subtitle) corpus, given a natural language query. Different from previous moment retrieval tasks dealing with videos only, TVR requires the system to understand both the video and the associated subtitle...
Preprint
In this work, we present a new operator, called Instance Mask Projection (IMP), which projects a predicted Instance Segmentation as a new feature for semantic segmentation. It also supports back propagation so is trainable end-to-end. Our experiments show the effectiveness of IMP on both Clothing Parsing (with complex layering, large deformations,...
Preprint
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos. We first augment the TVQA dataset with 310.8k bounding boxes, linking depicted objects to visual...
Preprint
Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA makes the fundamental assumption that every question, e.g., "what color is the car?", has exactly one target ("car") being inquired about. This assumption puts a direct limitation on the abilit...
Preprint
This work presents computational methods for transferring body movements from one person to another with videos collected in the wild. Specifically, we train a personalized model on a single video from the Internet which can generate videos of this target person driven by the motions of other people. Our model is built on two generative networks: a...
Article
Full-text available
This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classific...
Article
Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case of pre-existing photographs from the web. As an alternative, we propose a method that is able to compute a 3D model of a human body and its outfit from a single photograph with little human interac...
Preprint
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours o...
Article
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image descript...
Article
Given a still photograph, one can imagine how dynamic objects might move against a static background. This idea has been actualized in the form of cinemagraphs, where the motion of particular objects within a still image is repeated, giving the viewer a sense of animation. In this paper, we learn computational models that can generate cinemagraph s...
Article
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flex...
Article
As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such cap...
Article
We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) t...
Conference Paper
Full-text available
We develop an optimization based framework to automatically illustrate poems and songs. Our method is able to produce both semantically relevant and visually coherent illustrations, all while matching a particular user selected visual style. We demonstrate our method on a selection of 200 popular poems and songs collected from the internet and oper...
Article
Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the lis...
Article
This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet dataset, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and a...
Conference Paper
Based on life-long observations of physical, chemical, and biologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future. But, what about computers? In this paper, we learn computational models of object transformations from time-lapse videos. In particular, we explore the use of...
Conference Paper
Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images. In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to oth...
Article
Based on life-long observations of physical, chemical, and biologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future. But, what about computers? In this paper, we learn computational models of object transformations from time-lapse videos. In particular, we explore the use of...
Article
Full-text available
In this paper, we explore deep learning methods for estimating when objects were made. Automatic methods for this task could potentially be useful for historians, collectors, or any individual interested in estimating when their artifact was created. Direct applications include large-scale data organization or retrieval. Toward this goal, we utiliz...
Article
This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from...
Article
Full-text available
Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case of pre-existing photographs from the web. As an alternative, we pro- pose a method that is able to compute a rich and realistic 3D model of a human body and its outfits from a single photograph wit...
Article
Full-text available
Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images. In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to oth...
Conference Paper
Although deep convolutional neural networks (CNNs) have shown remarkable results for feature learning and prediction tasks, many recent studies have demonstrated improved performance by incorporating additional handcrafted features or by fusing predictions from multiple CNNs. Usually, these combinations are implemented via feature concatenation or...
Article
We have seen remarkable recent progress in computational visual recognition, producing systems that can classify objects into thousands of different categories with increasing accuracy. However, one question that has received relatively less attention is "what labels should recognition systems output?" This paper looks at the problem of predicting...
Article
What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated...
Article
In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions,...
Article
Clothing recognition is a societally and commercially important yet extremely challenging problem due to large variations in clothing appearance, layering, style, and body shape and pose. In this paper, we tackle the clothing parsing problem using a retrieval-based approach. For a query image, we find similar styles from a large database of tagged...
Article
Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition...
Article
Full-text available
Clothing and fashion are an integral part of our everyday lives. In this paper we present an approach to studying fashion both on the runway and in more real-world settings, computationally, and at large scale, using computer vision. Our contributions include collecting a new runway dataset, designing features suitable for capturing outfit appearan...
Article
We study Refer-to-as relations as a new type of semanticknowledge. Compared to the much studied Is-a relation,which concerns factual taxonomy knowledge, Refer-to-as relationsaim to address pragmatic semantic knowledge. Forexample, a “penguin” is a “bird” from a taxonomy point ofview, but people rarely refer to a “penguin” as a “bird” invernacular u...
Conference Paper
Full-text available
We study Refer-to-as relations as a new type of semantic knowledge. Compared to the much studied Is-a relation, which concerns factual taxonomic knowledge, Refer-to-as relations aim to address pragmatic semantic knowledge. For example, a “penguin” is a “bird” from a taxonomic point of view, but people rarely refer to a “penguin” as a “bird” in vern...
Article
We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions. We investigate two related tasks: image caption generalization and generation, where the former is an optional subtask of the latter. The high-level idea of our approach is to harvest expressive phrases (as...
Article
From Flickr to Facebook to Pinterest, pictures are increasingly becoming a core content type in social networks. But, how important is this visual content and how does it influence behavior in the network? In this paper we study the effects of visual, textual, and social factors on popularity in a large real-world network focused on fashion. We mak...
Conference Paper
The clothing we wear and our identities are closely tied, revealing to the world clues about our wealth, occupation, and socio-identity. In this paper we examine questions related to what our clothing reveals about our personal style. We first design an online competitive Style Rating Game called Hipster Wars to crowd source reliable human judgment...
Conference Paper
Human observers make a variety of perceptual inferences about pictures of places based on prior knowledge and experience. In this paper we apply computational vision techniques to the task of predicting the perceptual characteristics of places by leveraging recent work on visual features along with a geo-tagged dataset of images associated with cro...
Conference Paper
We explore the use of computer vision methods for organizing, searching, and classifying x-ray scattering images. X-ray scattering is a technique that shines an intense beam of x-rays through a sample of interest. By recording the intensity of x-ray deflection as a function of angle, scientists can measure the structure of materials at the molecula...
Article
Full-text available
We posit that a person's gaze behavior while freely viewing a scene contains an abundance of information, not only about their intent and what they consider to be important in the scene, but also about the scene's content. Experiments are reported, using two popular image datasets from computer vision, that explore the relationship between the fixa...
Conference Paper
Entry level categories - the labels people will use to name an object - were originally defined and studied by psychologists in the 1980s. In this paper we study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. Our models combine visual recognition predictions with proxies for word...
Conference Paper
Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. In this paper, we tackle the clothing parsing problem using a retrieval based approach. For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. O...
Conference Paper
The ever growing amount of web images and their associated texts offers new opportunities for integrative models bridging natural language processing and computer vision. However, the potential benefits of such data are yet to be fully realized due to the complexity and noise in the alignment between image content and text. We address this challeng...