About
242
Publications
89,879
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
19,744
Citations
Additional affiliations
January 2006 - present
Publications
Publications (242)
Scriptwriters usually rely on their mental visualization to create a vivid story by using their imagination to see, feel, and experience the scenes they are writing. Besides mental visualization, they often refer to existing images or scenes in movies and analyze the visual elements to create a certain mood or atmosphere. In this paper, we develop...
We present an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model....
This paper presents a unified differentiable boolean operator for implicit solid shape modeling using Constructive Solid Geometry (CSG). Traditional CSG relies on min, max operators to perform boolean operations on implicit shapes. But because these boolean operators are discontinuous and discrete in the choice of operations, this makes optimizatio...
Motion graphics videos are widely used in Web design, digital advertising, animated logos and film title sequences, to capture a viewer's attention. But editing such video is challenging because the video provides a low-level sequence of pixels and frames rather than higher-level structure such as the objects in the video with their corresponding m...
Recent work has shown that when both the chart and caption emphasize the same aspects of the data, readers tend to remember the doubly-emphasized features as takeaways; when there is a mismatch, readers rely on the chart to form takeaways and can miss information in the caption text. Through a survey of 280 chart-caption pairs in real-world sources...
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained textto-image diffusion models. ControlNet locks the productionready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional cont...
Recent work has shown that when both the chart and caption emphasize the same aspects of the data, readers tend to remember the doubly-emphasized features as takeaways; when there is a mismatch, readers rely on the chart to form takeaways and can miss information in the caption text. Through a survey of 280 chart-caption pairs in real-world sources...
We present ZoomShop, a photographic composition editing tool for adjusting relative size, position, and foreshortening of scene elements. Given an image and corresponding depth map as input, ZoomShop combines a novel non‐linear camera model and a depth‐aware image warp to reproject and deform the image. Users can isolate objects by selecting depth...
Recent video question answering benchmarks indicate that state-of-the-art models struggle to answer compositional questions. However, it remains unclear which types of compositional reasoning cause models to mispredict. Furthermore, it is difficult to discern whether models arrive at answers using compositional reasoning or by leveraging data biase...
Prior benchmarks have analyzed models' answers to questions about videos in order to measure visual compositional reasoning. Action Genome Question Answering (AGQA) is one such benchmark. AGQA provides a training/test split with balanced answer distributions to reduce the effect of linguistic biases. However, some biases remain in several AGQA cate...
Learning 3D generative models from a dataset of monocular images enables self-supervised 3D reasoning and controllable synthesis. State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis. Images are synthesized by rendering the volumes from a given camera. These models can disentangle the 3D scene...
Statically analyzing information flow, or how data influences other data within a program, is a challenging task in imperative languages. Analyzing pointers and mutations requires access to a program's complete source. However, programs often use pre-compiled dependencies where only type signatures are available. We demonstrate that ownership types...
We present a text-based tool for editing talking-head video that enables an iterative editing workflow. On each iteration users can edit the wording of the speech, further refine mouth motions if necessary to reduce artifacts, and manipulate non-verbal aspects of the performance by inserting mouth gestures (e.g., a smile) or changing the overall pe...
Foundation paper piecing is a popular technique for constructing fabric patchwork quilts using printed paper patterns. But, the construction process imposes constraints on the geometry of the pattern and the order in which the fabric pieces are attached to the quilt. Manually designing foundation paper pieceable patterns that meet all of these cons...
We present a system that converts annotated broadcast video of tennis matches into interactively controllable video sprites that behave and appear like professional tennis players. Our approach is based on controllable video textures and utilizes domain knowledge of the cyclic structure of tennis rallies to place clip transitions and accept control...
Charts often contain visually prominent features that draw attention to aspects of the data and include text captions that emphasize aspects of the data. Through a crowdsourced study, we explore how readers gather takeaways when considering charts and captions together. We first ask participants to mark visually prominent regions in a set of line c...
Program tracing, or mentally simulating a program on concrete inputs, is an important part of general program comprehension. Programs involve many kinds of virtual state that must be held in memory, such as variable/value pairs and a call stack. In this work, we examine the influence of short-term working memory (WM) on a person's ability to rememb...
We present a text-based tool for editing talking-head video that enables an iterative editing workflow. On each iteration users can edit the wording of the speech, further refine mouth motions if necessary to reduce artifacts and manipulate non-verbal aspects of the performance by inserting mouth gestures (e.g. a smile) or changing the overall perf...
Cable TV news reaches millions of U.S. households each day, and decisions about who appears on the news, and what stories get talked about, can profoundly influence public opinion and discourse. In this paper, we use computational techniques to analyze a data set of nearly 24/7 video, audio, and text captions from three major U.S. cable TV networks...
We present a system that converts annotated broadcast video of tennis matches into interactively controllable video sprites that behave and appear like professional tennis players. Our approach is based on controllable video textures, and utilizes domain knowledge of the cyclic structure of tennis rallies to place clip transitions and accept contro...
Efficient rendering of photo‐realistic virtual worlds is a long standing effort of computer graphics. Modern graphics techniques have succeeded in synthesizing photo‐realistic images from hand‐crafted scene representations. However, the automatic generation of shape, materials, lighting, and other aspects of scenes remains a challenging problem tha...
Efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Modern graphics techniques have succeeded in synthesizing photo-realistic images from hand-crafted scene representations. However, the automatic generation of shape, materials, lighting, and other aspects of scenes remains a challenging problem tha...
Technologies for manipulating our digital appearance alter the way the world sees us as well as the way we see ourselves.
We present a capture-time tool designed to help casual photographers orient their subject to achieve a user-specified target facial appearance. The inputs to our tool are an HDR environment map of the scene captured using a 360 camera, and a target facial appearance, selected from a gallery of common studio lighting styles. Our tool computes the op...
A major concern for filmmakers creating 360° video is ensuring that the viewer does not miss important narrative elements because they are looking in the wrong direction. This paper introduces gated clips which do not play the video past a gate time until a filmmaker-defined viewer gaze condition is met, such as looking at a specific region of inte...
Many real-world video analysis applications require the ability to identify domain-specific events in video, such as interviews and commercials in TV news broadcasts, or action sequences in film. Unfortunately, pre-trained models to detect all the events of interest in video may not exist, and training new models from scratch can be costly and labo...
We present a search engine for D3 visualizations that allows queries based on their visual style and underlying structure. To build the engine we crawl a collection of 7860 D3 visualizations from the Web and deconstruct each one to recover its data, its data-encoding marks and the encodings describing how the data is mapped to visual attributes of...
We present a search engine for D3 visualizations that allows queries based on their visual style and underlying structure. To build the engine we crawl a collection of 7860 D3 visualizations from the Web and deconstruct each one to recover its data, its data-encoding marks and the encodings describing how the data is mapped to visual attributes of...
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method auto...
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method auto...
When watching how-to videos related to physical tasks, users' hands are often occupied by the task, making voice input a natural fit. To better understand the design space of voice interactions for how-to video navigation, we conducted three think-aloud studies using: 1) a traditional video interface, 2) a research probe providing a voice controlle...
Spatial layout is a key component in graphic design. While people who are blind or visually impaired (BVI) can use screen readers or magnifiers to access digital content, these tools fail to fully communicate the content's graphic design information. Through semi-structured interviews and contextual inquiries, we identify the lack of this informati...
Visual blends are an advanced graphic design technique to draw attention to a message. They combine two objects in a way that is novel and useful in conveying a message symbolically. This paper presents VisiBlends, a flexible workflow for creating visual blends that follows the iterative design process. We introduce a design pattern for blending sy...
Difficulties in accessing, isolating, and iterating on the components and connections of a printed circuit board (PCB) create unique challenges in PCB debugging. Manual probing methods are slow and error prone, and even dedicated PCB testing equipment remains limited by its inability to modify the circuit during testing. We present Pinpoint, a tool...
Dubbing puppet videos to make the characters (e.g. Kermit the Frog) convincingly speak a new speech track is a popular activity with many examples of well-known puppets speaking lines from films or singing rap songs. But manually aligning puppet mouth movements to match a new speech track is tedious as each syllable of the speech must match a close...
Document authors commonly use tables to support arguments presented in the text. But, because tables are usually separate from the main body text, readers must split their attention between different parts of the document. We present an interactive document reader that automatically links document text with corresponding table cells. Readers can se...
Visual blends are an advanced graphic design technique to draw users' attention to a message. They blend together two objects in a way that is novel and useful in conveying a message symbolically. This demo presents an interactive pipeline for creating visual blends that follows the iterative design process. Our pipeline decomposes the process into...
Online creative communities allow creators to share their work with a large audience, maximizing opportunities to showcase their work and connect with fans and peers. However, sharing in-progress work can be technically and socially challenging in environments designed for sharing completed pieces. We propose an online creative community where shar...
We present a visual analogue for musical rhythm derived from an analysis of motion in video, and show that alignment of visual rhythm with its musical counterpart results in the appearance of dance. Central to our work is the concept of visual beats --- patterns of motion that can be shifted in time to control visual rhythm. By warping visual beats...
It can be difficult to understand physical measurements (e.g., 28 lb, 600 gallons) that appear in news stories, data reports, and other documents. We develop tools that automatically re-express unfamiliar measurements using the measurements of familiar objects. Our work makes three contributions: (1) we identify effectiveness criteria for objects u...
For cooking professionals and culinary students, understanding cooking instructions is an essential yet demanding task. Common tasks include categorizing different approaches to cooking a dish and identifying usage patterns of particular ingredients or cooking methods, all of which require extensive browsing and comparison of multiple recipes. Howe...
Audio Podcasts have gained popularity because they are a compelling form of storytelling and are easy to consume. However, they are not as easy to produce since resources are invested in the research, recording, and editing process and the average length of an episode is over an hour. Some audio podcasts could benefit from visuals to increase engag...
Virtual reality filmmakers creating 360-degree video currently rely on cinematography techniques that were developed for traditional narrow field of view film. They typically edit together a sequence of shots so that they appear at a fixed-orientation irrespective of the viewer's field of view. But because viewers set their own camera orientation t...
Analog circuit design is a complex, error-prone task in which the processes of gathering observations, formulating reasonable hypotheses, and manually adjusting the circuit raise significant barriers to an iterative workflow. We present Scanalog, a tool built on programmable analog hardware that enables users to rapidly explore different circuit de...
We present a system for efficiently editing video of dialogue-driven scenes. The input to our system is a standard film script and multiple video takes, each capturing a different camera framing or performance of the complete scene. Our system then automatically selects the most appropriate clip from one of the input takes, for each line of dialogu...
In culture analytics, it is important to ask fundamental questions that address salient characteristics of collective human behavior. This paper explores how analyzing cooking recipes in aggregate and at scale identifies these characteristics in the cooking culture, and answer fundamental questions like 'what makes a chocolate chip cookie a chocola...
High-quality hand-made furniture often employs intrinsic joints that geometrically interlock along mating surfaces. Such joints increase the structural integrity of the furniture and add to its visual appeal. We present an interactive tool for designing such intrinsic joints. Users draw the visual appearance of the joints on the surface of an input...
High-quality hand-made furniture often employs intrinsic joints that geometrically interlock along mating surfaces. Such joints increase the structural integrity of the furniture and add to its visual appeal. We present an interactive tool for designing such intrinsic joints. Users draw the visual appearance of the joints on the surface of an input...
We present SceneSuggest: an interactive 3D scene design system providing context-driven suggestions for 3D model retrieval and placement. Using a point-and-click metaphor we specify regions in a scene in which to automatically place and orient relevant 3D models. Candidate models are ranked using a set of static support, position, and orientation p...
Understanding how humans explore virtual environments is crucial for many applications, such as developing compression algorithms or designing effective cinematic virtual reality (VR) content, as well as to develop predictive computational models. We have recorded 780 head and gaze trajectories from 86 users exploring omni-directional stereo panora...
Online creative communities allow creators to share their work with a large audience, maximizing opportunities to showcase their work and connect with fans and peers. However, sharing in-progress work can be technically and socially challenging in environments designed for sharing completed pieces. We propose an online creative community where shar...
We present QuickCut, an interactive video editing tool designed to help authors efficiently edit narrated video. QuickCut takes an audio recording of the narration voiceover and a collection of raw video footage as input. Users then review the raw footage and provide spoken annotations describing the relevant actions and objects in the scene. Quick...
Video production is a collaborative process in which stakeholders regularly review drafts of the edited video to indicate problems and offer suggestions for improvement. Although practitioners prefer in-person feedback, most reviews are conducted asynchronously via email due to scheduling and location constraints. The use of this impoverished mediu...
We present a technique for converting a basic D3 chart into a reusable style template. Then, given a new data source we can apply the style template to generate a chart that depicts the new data, but in the style of the template. To construct the style template we first deconstruct the input D3 chart to recover its underlying structure: the data, t...
We present a technique for converting a basic D3 chart into a reusable style template. Then, given a new data source we can apply the style template to generate a chart that depicts the new data, but in the style of the template. To construct the style template we first deconstruct the input D3 chart to recover its underlying structure: the data, t...
Distances and areas frequently appear in text articles. However, people struggle to understand these measurements when they cannot relate them to measurements of locations that they are personally familiar with. We contribute tools for generating personalized spatial analogies: re-expressions that contextualize spatial measurements in terms of loca...
Digital image editing is usually an iterative process; users repetitively perform short sequences of operations, as well as undo and redo using history navigation tools. In our collected data, undo, redo and navigation constitute about 9 percent of the total commands and consume a significant amount of user time. Unfortunately, such activities also...
" Well-performed audio narrations are a hallmark of captivating podcasts, explainer videos, radio stories, and movie trailers. To record these narrations, professional voiceover actors follow guidelines that describe how to use low-level vocal components---volume, pitch, timbre, and tempo---to deliver performances that emphasize important words whi...
" Searching for scenes in movies is a time-consuming but crucial task for film studies scholars, film professionals, and new media artists. In pilot interviews we have found that such users search for a wide variety of clips---e.g., actions, props, dialogue phrases, character performances, locations---and they return to particular scenes they have...
A collection of photos and a three-dimensional reconstruction of the photos are used to construct and texture a mesh model. In one embodiment, a first digital image of a first view of a real world scene is analyzed to identify lines in the first view. Among the lines, parallel lines are identified. A three-dimensional vanishing direction in a three...
Feedback is an important component of the design process, but gaining access to high-quality critique outside a classroom or firm is challenging. We present CrowdCrit, a web-based system that allows designers to receive design critiques from non-expert crowd workers. We evaluated CrowdCrit in three studies focusing on the designer's experience and...
We present a method for automatically identifying and validating predictive relationships between the visual appearance of a city and its non-visual attributes (e.g. crime statistics, housing prices, population density etc.). Given a set of street-level images and (location, city-attribute-value) pairs of measurements, we first identify visual elem...
Designers often create physical works-like prototypes early in the product development cycle to explore possible mechanical architectures for a design. Yet, creating functional prototypes requires time and expertise, which discourages rapid design iterations. Designers must carefully specify part and joint parameters to ensure that parts move and f...
Highly-produced audio stories often include musical scores that reflect the emotions of the speech. Yet, creating effective musical scores requires deep expertise in sound production and is time-consuming even for experts. We present a system and algorithm for re-sequencing music tracks to generate emotionally relevant music scores for audio storie...
The D3 JavaScript library has become a ubiquitous tool for developing visualizations on the Web. Yet, once a D3 visualization is published online its visual style is difficult to change. We present a pair of tools for deconstructing and restyling existing D3 visualizations. Our deconstruction tool analyzes a D3 visualization to extract the data, th...
Increasingly, authors are publishing long informational talks, lectures, and distance-learning videos online. However, it is difficult to browse and skim the content of such videos using current timeline-based video players. Video digests are a new format for informational videos that afford browsing and skimming by segmenting videos into a chapter...