Conference Paper

Time-Offset Conversations on a Life-Sized Automultiscopic Projector Array

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Over the last decade we have seen the field of computational photography, especially light field and multi-view imaging, emerge and mature as a new paradigm in imaging technology. These technologies enable a range of novel applications ranging from advanced multidimensional image processing such as refocusing [Ng et al. 2005] and depth estimation [Vaish et al. 2006] to cinematic editing [Jarabo et al. 2014], glasses free 3D display systems [Jones et al. 2016;Lee et al. 2016;Wetzstein et al. 2012b,a], single sensor light field cameras [Babacan et al. 2012;Marwah et al. 2013; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. ...
... To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request 1:2 • E. Miandji et al. the 3D display in [Jones et al. 2016] at 30 Hz corresponds to more than 5 GB of data per second. ...
Article
Full-text available
In this article we present a novel dictionary learning framework designed for compression and sampling of light fields and light field videos. Unlike previous methods, where a single dictionary with one-dimensional atoms is learned, we propose to train a Multidimensional Dictionary Ensemble (MDE). It is shown that learning an ensemble in the native dimensionality of the data promotes sparsity, hence increasing the compression ratio and sampling efficiency. To make maximum use of correlations within the light field data sets, we also introduce a novel nonlocal pre-clustering approach that constructs an Aggregate MDE (AMDE). The pre-clustering not only improves the image quality but also reduces the training time by an order of magnitude in most cases. The decoding algorithm supports efficient local reconstruction of the compressed data, which enables efficient real-time playback of high-resolution light field videos. Moreover, we discuss the application of AMDE for compressed sensing. A theoretical analysis is presented that indicates the required conditions for exact recovery of point-sampled light fields that are sparse under AMDE. The analysis provides guidelines for designing efficient compressive light field cameras. We use various synthetic and natural light field and light field video data sets to demonstrate the utility of our approach in comparison with the state-of-the-art learning-based dictionaries, as well as established analytical dictionaries.
... The angular resolution is the basis of many applications [11][12][13]. For example, it directly affects the accuracy of depth map estimation and then influences the effect of 3D reconstruction. ...
Article
Full-text available
There is a trade-off between spatial resolution and angular resolution limits in light field applications; various targeted algorithms have been proposed to enhance angular resolution while ensuring high spatial resolution simultaneously, which is also called view synthesis. Among them, depth estimation-based methods can use only four corner views to reconstruct a novel view at an arbitrary location. However, depth estimation is a time-consuming process, and the quality of the reconstructed novel view is not only related to the number of the input views, but also the location of the input views. In this paper, we explore the relationship between different input view selections with the angular super-resolution reconstruction results. Different numbers and positions of input views are selected to compare the speed of super-resolution reconstruction and the quality of novel views. Experimental results show that the speed of the algorithm decreases with the increase of the input views for each novel view, and the quality of the novel view decreases with the increase of the distance from the input views. After comparison using two input views in the same line to reconstruct the novel views between them, fast and accurate light field view synthesis is achieved.
... These interviews were recorded inside a large Light Stage system [8] with fifty-four high-definition video cameras. The multiview data enabled the conversations to be projected threedimensionally on an automultiscopic display [18,19] The light stage system is designed for recording relightable reflectance fields, where the subject is illuminated from one lighting direction at a time, and these datasets can be recombined through image-based relighting [7]. If the subject is recorded with a high speed video camera, a large number of lighting conditions can be recorded during a normal video frame duration [39,8] allowing a dynamic video to be lit with new lighting. ...
Preprint
We present a learning-based method for estimating 4D reflectance field of a person given video footage illuminated under a flat-lit environment of the same subject. For training data, we use one light at a time to illuminate the subject and capture the reflectance field data in a variety of poses and viewpoints. We estimate the lighting environment of the input video footage and use the subject's reflectance field to create synthetic images of the subject illuminated by the input lighting environment. We then train a deep convolutional neural network to regress the reflectance field from the synthetic images. We also use a differentiable renderer to provide feedback for the network by matching the relit images with the input video frames. This semi-supervised training scheme allows the neural network to handle unseen poses in the dataset as well as compensate for the lighting estimation error. We evaluate our method on the video footage of the real Holocaust survivors and show that our method outperforms the state-of-the-art methods in both realism and speed.
... Camera arrays have been used on mobile phones as well, such as Pelican [238], where the phone is equipped with an array of small cameras, as shown in Figure 6.1(e). Multi-camera system has been used to create life-size digital humans by capturing real people in a light stage environment and displaying them as a hologram using an automultiscopic projector array [239]. Recently, Overbeck et al. [240] presented a system for acquiring, processing, and rendering panoramic light field stills. ...
Book
Full-text available
The introduction and recent advancements of computational photography have revolutionized the imaging industry. Computational photography is a combination of imaging techniques at the intersection of various fields such as optics, computer vision, and computer graphics. These methods enhance the capabilities of traditional digital photography by applying computational techniques both during and after the capturing process. This thesis targets two major subjects in this field: High Dynamic Range (HDR) image reconstruction and Light Field (LF) compressive capturing, compression, and real-time rendering. The first part of the thesis focuses on the HDR images that concurrently contain detailed information from the very dark shadows to the brightest areas in the scenes. One of the main contributions presented in this thesis is the development of a unified reconstruction algorithm for spatially variant exposures in a single image. This method is based on a camera noise model, and it simultaneously resamples, reconstructs, denoises, and demosaics the image while extending its dynamic range. Furthermore, the HDR reconstruction algorithm is extended to adapt to the local features of the image, as well as the noise statistics, to preserve the high-frequency edges during reconstruction. In the second part of this thesis, the research focus shifts to the acquisition, encoding, reconstruction, and rendering of light field images and videos in a real-time setting. Unlike traditional integral photography, a light field captures the information of the dynamic environment from all angles, all points in space, and all spectral wavelength and time. This thesis employs sparse representation to provide an end-to-end solution to the problem of encoding, real-time reconstruction, and rendering of high dimensional light field video data sets. These solutions are applied on various types of data sets, such as light fields captured with multi-camera systems or hand-held cameras equipped with micro-lens arrays, and spherical light fields. Finally, sparse representation of light fields was utilized for developing a single sensor light field video camera equipped with a color-coded mask. A new compressive sensing model is presented that is suitable for dynamic scenes with temporal coherency and is capable of reconstructing high-resolution light field videos.
... Multiple viewpoint images need to be presented according to the users' positions for multiple users to view 3D images simultaneously [15,36,23]. Researchers have been working on presenting various methods of projecting viewpoint images from multiple projectors onto a screen with narrow diffusion characteristics [33,19,18,22]. In such a multi-view 3D display, the number of viewpoints is increased to smooth changes between viewpoints, present the binocular parallax, and widen the visual field area. ...
Conference Paper
We present an interactive 360-degree tabletop display system for collaborative work around a round table. Users are able to see 3D objects on the tabletop display anywhere around the table without 3D glasses. The system uses a visual perceptual mechanism for smooth motion parallax in the horizontal direction with fewer projectors than previous works. A 360-degree camera mounted above the table and image recognition software detects users' positions around the table and the heights of their faces (eyes) as they move around the table in real-time. Those mechanics help display correct vertical and horizontal direction motion parallax for different users simultaneously. Our system also has a user interaction function with a tablet device that manipulates 3D objects displayed on the table. These functions support collaborative work and communication between users. We implemented a prototype system and demonstrated the collaborative features of the 360-degree tabletop display system.
... Light field imaging has been an active research topic for more than a decade. Several new techniques have been proposed focusing on light field capture (Liang et al., 2008;Babacan et al., 2012), super-resolution (Wanner and Goldluecke, 2013;Choudhury et al., 2017), depth estimation (Vaish et al., 2006;Williem and Park, 2016), refocusing (Ng, 2005), geometry estimation (Levoy, 2001), and display (Jones et al., 2016;Wetzstein et al., 2012). A light field represents a subset of the Plenoptic function (Adelson and Bergen, 1991), where we store the outgoing radiance at several spatial locations (r i , t j ), and along multiple directions (u α , v β ), as well as as the spectral data λ γ . ...
Conference Paper
Full-text available
We present a method for GPU accelerated compression of light fields. The approach is by using a dictionary learning framework for compression of light field images. The large amount of data storage by capturing light fields is a challenge to compress and we seek to accelerate the encoding routine by GPGPU computations. We compress the data by projecting each data point onto a set of trained multi-dimensional dictionaries and seek the most sparse representation with the least error. This is done by a parallelization of the tensor-matrix product computed on the GPU. An optimized greedy algorithm to suit computations on the GPU is also presented. The encoding of the data is done segmentally in parallel for a faster computation speed while maintaining the quality. The results shows an order of magnitude faster encoding time compared to the results in the same research field. We conclude that there are further improvements to increase the speed, and thus it is not too far from an interactive compression speed.
Chapter
Research has shown that learning outcomes can be improved by interactive visualization and exploration. This has led to the appearance of interactive installations on a range of platforms from handheld devices to large immersive dome theaters. One of the underlying principles of this data-driven visualization for broad audiences is the notion of the confluence of exploratory and explanatory visualization into the concept of “Exploranation,” meaning that explanation and exploration converge in the same application. However, it is necessary to apply specific visualization and interaction design principles to enable engaging storytelling and user-driven discovery in interactive installations targeting a general audience. The design principles are unique for different platforms and uses. We here present an account for some results, challenges and areas in need for further research. We also describe a set of different cases in which visualization has been used to reach broad audiences. Based on the examples, lessons learned are described and general principles and recommendations are provided.
Conference Paper
Full-text available
While virtual humans are proven tools for training, education and research, they are far from realizing their full potential. Advances are needed in individual capabilities, such as character animation and speech synthesis, but perhaps more importantly, fundamental questions remain as to how best to integrate these capabilities into a single framework that allows us to efficiently create characters that can engage users in meaningful and realistic social interactions. This integration requires in-depth, inter-disciplinary understanding few individuals, or even teams of individuals, possess. We help address this challenge by introducing the ICT Virtual Human Toolkit, which offers a flexible framework for exploring a variety of different types of virtual human systems, from virtual listeners and question-answering characters to virtual role-players. We show that due to its modularity, the Toolkit allows researchers to mix and match provided capabilities with their own, lowering the barrier of entry to this multi-disciplinary research challenge.
Article
Full-text available
We present the first end-to-end solution to create high-quality free-viewpoint video encoded as a compact data stream. Our system records performances using a dense set of RGB and IR video cameras , generates dynamic textured surfaces, and compresses these to a streamable 3D video format. Four technical advances contribute to high fidelity and robustness: multimodal multi-view stereo fusing RGB, IR, and silhouette information; adaptive meshing guided by automatic detection of perceptually salient areas; mesh tracking to create temporally coherent subsequences; and encoding of tracked textured meshes as an MPEG video stream. Quantitative experiments demonstrate geometric accuracy, texture fidelity, and encoding efficiency. We release several datasets with calibrated inputs and processed results to foster future research.
Conference Paper
Full-text available
To increase the interest and engagement of middle school students in science and technology, the InterFaces project has created virtual museum guides that are in use at the Museum of Science, Boston. The characters use natural language interaction and have near photoreal appearance to increase and presents reports from museum staff on visitor reaction. Keywordsvirtual human applications-photoreal characters-natural language interaction-virtual museum guides-STEM-informal science education
Conference Paper
Full-text available
There is a growing need for creating life-like virtual human simulations that can conduct a natural spoken dialog with a human student on a predefined subject. We present an overview of a spoken-dialog system that supports a per- son interacting with a full-size hologram-like virtual human character in an exhibition kiosk settings. We also give a brief summary of the natural language classification component of the system and describe the experiments we conducted with the system.
Conference Paper
Full-text available
State-of-the-art motion estimation algorithms suffer from three major problems: Poorly textured regions, occlusions and small scale image structures. Based on the Gestalt principles of grouping we propose to incorporate a low level image segmentation process in order to tackle these problems. Our new motion estimation algorithm is based on non-local total variation regularization which allows us to integrate the low level image segmentation process in a unified variational framework. Numerical results on the Middlebury optical flow benchmark data set demonstrate that we can cope with the aforementioned problems.
Conference Paper
Full-text available
We present an image-based approach for capturing the appearance of a walking or running person so they can be rendered realistically under variable viewpoint and illumination. In our approach, a person walks on a treadmill at a regular rate as a turntable slowly rotates the person's direction. As this happens, the person is filmed with a vertical array of high-speed cameras under a time-multiplexed lighting basis, acquiring a seven-dimensional dataset of the person under variable time, illumination, and viewing direction in approximately forty seconds. We process this data into a flowed reflectance field using an optical flow algorithm to correspond pixels in neighboring camera views and time samples to each other, and we use image compression to reduce the size of this data. We then use image-based relighting and a hardware-accelerated combination of view morphing and light field rendering to render the subject under user-specified viewpoint and lighting conditions. To composite the person into a scene, we use an alpha channel derived from back lighting and a retroreflective treadmill surface and a visual hull process to render the shadows the person would cast onto the ground. We demonstrate realistic composites of several subjects into real and virtual environments using our technique.
Article
Full-text available
NPCEditor is a system for building a natural language-processing component for virtual humans capable of engaging a user in spoken dialog on a limited domain. It uses statistical language-classification technology for mapping from a user's text input to system responses. NPCEditor provides a user-friendly editor for creating effective virtual humans quickly. It has been deployed as a part of various virtual human systems in several applications. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
Conference Paper
Full-text available
Synthetic interviews is technology developed at Carnegie Mellon University (CMU) in Pittsburgh, Pennsylvania. Synthetic interviews provide a means of conversing in-depth with an individual or character, permitting users to ask questions in a conversational manner (just as they would if they were interviewing the figure face-to-face), and receive relevant, pertinent answers to the questions asked. Existing synthetic interviews are accessible via either typed or spoken interfaces. Through this exploration of the CG-persona, users are able to discover a character's behavior, likes and dislikes, values, qualities, influences, beliefs, or personal knowledge. The synthetic interview also strives to capture and convey the core human attributes of reflection, humor, perplexity, bewilderment, frustration, and enjoyment. Synthetic interviews therefore attempt to create nothing less than a `dyad'. A `dyad' is any two individuals maintaining a socially significant relationship (though this is not to imply that synthetic interviews are to remain limited to one-on-one experiences; in fact, synthetic interviews utilizing multiple interviewers or interviewees are currently being developed)
Article
Full-text available
This paper describes the Swedish spoken dialogue system August. This system has been used to collect spontaneous speech data, largely from people with no previous experience of speech technology or computers. The aim was to be able to analyse how novice users interact with a multi-modal information kiosk, placed without supervision in a public location. The system described in this paper featured an animated talking agent, August. Speech data was collected during the six months that the system was exposed to the general public. The system and its components are briefly described, with references to more detailed papers. Keywords: multi-modal dialogue system, talking head 1. INTRODUCTION Future dialogue systems will not only be accessed through the telephone or via the Internet. Nor will they be used only in laboratories by expert personnel. The future systems should be easy to use for people with little or no experience of computers. Speech technology promises to offer user-friendly ...
Article
We present a technique for achieving tracked vertical parallax for multiple users using a variety of autostereoscopic projector array setups, including front- and rear-projection and curved display surfaces. This hybrid parallax approach allows for immediate horizontal parallax as viewers move left and right and tracked parallax as they move up and down, allowing cues such as three-dimensional (3-D) perspective and eye contact to be conveyed faithfully. We use a low-cost RGB-depth sensor to simultaneously track multiple viewer head positions in 3-D space, and we interactively update the imagery sent to the array so that imagery directed to each viewer appears from a consistent and correct vertical perspective. Unlike previous work, we do not assume that the imagery sent to each projector in the array is rendered from a single vertical perspective. This lets us apply hybrid parallax to displays where a single projector forms parts of multiple viewers' imagery. Thus, each individual projected image is rendered with multiple centers of projection, and might show an object from above on the left and from below on the right. We demonstrate this technique using a dense horizontal array of pico-projectors aimed into an anisotropic vertical diffusion screen, yielding 1.5 deg angular resolution over 110 deg field of view. To create a seamless viewing experience for multiple viewers, we smoothly interpolate the set of viewer heights and distances on a per-vertex basis across the array's field of view, reducing image distortion, cross talk, and artifacts from tracking errors. (C) 2014 SPIE and IS&T
Conference Paper
We report on our efforts to prepare Ada and Grace, virtual guides in the Museum of Science, Boston, to interact directly with museum visitors, including children. We outline the challenges in extending the exhibit to support this usage, mostly relating to the processing of speech from a broad population, especially child speech. We also present the summative evaluation, showing success in all the intended impacts of the exhibit: that children ages 7–14 will increase their awareness of, engagement in, interest in, positive attitude about, and knowledge of computer science and technology.
Article
We have proposed a glasses-free three-dimensional (3D) display for displaying 3D images on a large screen using multi-projectors and an optical screen consisting of a special diffuser film with a large condenser lens. To achieve high presence communication with natural large-screen 3D images, we numerically analyze the factors responsible for degrading image quality to increase the image size. A major factor that determines the 3D image quality is the arrangement of component units, such as the projector array and condenser lens, as well as the diffuser film characteristics. We design and fabricate a prototype 200-inch glasses-free 3D display system on the basis of the numerical results. We select a suitable diffuser film, and we combine it with an optimally designed condenser lens. We use 57 high-definition projector units to obtain viewing angles of 13.5°. The prototype system can display glasses-free 3D images of a life-size car using natural parallax images.
Article
We propose a glasses-free 3D display, named fVisiOn, that floats virtual 3D objects on a flat tabletop surface. The approach employs a combination of an optical device and a circularly arranged projector array. It reproduces a light field optimized for a seated condition at a certain volume and forms a ring-shaped viewing area. In this principle, the optical device must control the traveling direction of numerous rays passing the volume. An appropriate algorithm for supplying source images for the projector array should also be considered. This paper introduces several approaches for fabricating a variety of these optical devices in different shapes. Those devices work like a special screen by applying an anisotropic diffusion characteristic. We then propose an algorithm that generatessourceimagesand describeanimplementation method for the D display as a system. The algorithm is confirmed from several 3D images generated by the system.
Article
We present an integrated 3D capturing, visualization and user interaction system composed of a computer vision based 3D capturing device, a scene composer and a large scale holographic display. The system performs in real-time and provides the facilities required for capturing realistic human 3D body models, inserting the human representations inside virtual scenarios, detecting 3D interactions between the body models and the virtual objects present in the scene and visualizing the resulting 3D performance on a true 3D holographic display.
Article
We describe a set of rendering techniques for an autostereoscopic light field display able to present interactive 3D graphics to multiple simultaneous viewers 360 degrees around the display. The display consists of a high-speed video projector, a spinning mirror covered by a holographic diffuser, and FPGA circuitry to decode specially rendered DVI video signals. The display uses a standard programmable graphics card to render over 5,000 images per second of interactive 3D graphics, projecting 360-degree views with 1.25 degree separation up to 20 updates per second. We describe the system's projection geometry and its calibration process, and we present a multiple-center-of-projection rendering technique for creating perspective-correct images from arbitrary viewpoints around the display. Our projection technique allows correct vertical perspective and parallax to be rendered for any height and distance when these parameters are known, and we demonstrate this effect with interactive raster graphics using a tracking system to measure the viewer's height and distance. We further apply our projection technique to the display of photographed light fields with accurate horizontal and vertical parallax. We conclude with a discussion of the display's visual accommodation performance and discuss techniques for displaying color imagery.
Conference Paper
This paper presents a quantitative comparison of several multi-view stereo reconstruction algorithms. Until now, the lack of suitable calibrated multi-view image datasets with known ground truth (3D shape models) has prevented such direct comparisons. In this paper, we first survey multi-view stereo algorithms and compare them qualitatively using a taxonomy that differentiates their key properties. We then describe our process for acquiring and calibrating multiview image datasets with high-accuracy ground truth and introduce our evaluation methodology. Finally, we present the results of our quantitative comparison of state-of-the-art multi-view stereo reconstruction algorithms on six benchmark datasets. The datasets, evaluation details, and instructions for submitting new models are available online at http://vision.middlebury.edu/mview.
Article
We present a set of algorithms and an associated display system capable of producing correctly rendered eye contact between a three-dimensionally transmitted remote participant and a group of observers in a 3D teleconferencing system. The participant's face is scanned in 3D at 30Hz and transmitted in real time to an autostereoscopic horizontal-parallax 3D display, displaying him or her over more than a 180° field of view observable to multiple observers. To render the geometry with correct perspective, we create a fast vertex shader based on a 6D lookup table for projecting 3D scene vertices to a range of subject angles, heights, and distances. We generalize the projection mathematics to arbitrarily shaped display surfaces, which allows us to employ a curved concave display surface to focus the high speed imagery to individual observers. To achieve two-way eye contact, we capture 2D video from a cross-polarized camera reflected to the position of the virtual participant's eyes, and display this 2D video feed on a large screen in front of the real participant, replicating the viewpoint of their virtual self. To achieve correct vertical perspective, we further leverage this image to track the position of each audience member's eyes, allowing the 3D display to render correct vertical perspective for each of the viewers around the device. The result is a one-to-many 3D teleconferencing system able to reproduce the effects of gaze, attention, and eye contact generally missing in traditional teleconferencing systems.
Article
Engineering and Applied Sciences Three-dimensional TV is expected to be the next revolution in the history of television. We implemented a 3D TV prototype system with real-time acquisition, transmission, and 3D display of dynamic scenes. We developed a distributed, scalable architecture to manage the high computation and bandwidth demands. Our system consists of an array of cameras, clusters of network-connected PCs, and a multi-projector 3D display. Multiple video streams are individually encoded and sent over a broadband network to the display. The 3D display shows high-resolution (1024 × 768) stereoscopic color images for multiple viewpoints without special glasses. We implemented systems with rear-projection and front-projection lenticular screens. In this paper, we provide a detailed overview of our 3D TV system, including an examination of design choices and tradeoffs. We present the calibration and image alignment procedures that are necessary to achieve good image quality. We present qualitative results and some early user feedback. We believe this is the first real-time end-to-end 3D TV system with enough views and resolution to provide a truly immersive 3D experience.
Article
A number of techniques have been proposed for flying through scenes by redisplaying previously rendered or digitized views. Techniques have also been proposed for interpolating between views by warping input images, using depth information or correspondences between multiple images. In this paper, we describe a simple and robust method for generating new views from arbitrary camera positions without depth information or feature matching, simply by combining and resampling the available images. The key to this technique lies in interpreting the input images as 2D slices of a 4D function - the light field. This function completely characterizes the flow of light through unobstructed space in a static scene with fixed illumination.
How Many Utterances Are Needed to Support Time-Offset Interaction?
  • R Artstein
  • A Leuski
  • H Maio
  • T Mor-Barak
  • C Gordon
  • D Traum
New dimensions in testimony: Findings from student pilots
  • K Kim
Which ASR should I choose for my dialogue system?
  • F Morbini
  • K Audhkhasi
  • K Sagae
  • R Artstein
  • D Can
  • P G Georgiou
  • S Narayanan
  • A Leuski
  • D Traum
The Sciencebehind the Ghost: A Brief History of Pepper's Ghost. Hahne
  • J Steinmeyer
Carnegie Mellon's Entertainment Technology Center conjures up Benjamin Franklin's ghost
  • E Sloss
  • A Watzman