ArticlePDF Available

Immersive 3D Telepresence

Authors:

Abstract and Figures

Cutting-edge work on 3D telepresence at a multinational research center provides insight into the technology's potential, as well as into its remaining challenges. The first Web extra at http://youtu.be/r4SqJdXkOjQ is a video describing FreeCam, a system capable of generating live free-viewpoint video by simulating the output of a virtual camera moving through a dynamic scene. The second Web extra at http://youtu.be/Dw1glKUKs9A is a video showing a system designed to capture the enhanced 3D structure of a room-sized dynamic scene with commodity depth cameras, such as Microsoft Kinects. The third Web extra at http://youtu.be/G_VPzXRrmIw is a video demonstrating a system that adapts to a wide variety of telepresence scenarios. By combining Kinect-based 3D scanning with optical see-through HMDs, the user can precisely control which parts of the scene are real and which are virtual or remote objects. The fourth Web extra at http://youtu.be/n45N5AHsoCI is a video demonstrating a method based on moving least squares surfaces that robustly and efficiently reconstructs dynamic scenes captured by a set of hybrid color+depth cameras. The reconstruction provides spatiotemporal consistency and seamlessly fuses color and geometric information. The video also illustrates the formulation on a variety of real sequences and demonstrates that it favorably compares to state-of-the-art methods. The fifth Web extra at http://youtu.be/OSl3f2qZzKs is a video demonstrating a 3D acquisition system capable of simultaneously capturing an entire room-sized volume with an array of commodity depth cameras and rendering it from a novel viewpoint in real time. The sixth Web extra at http://youtu.be/zKWByH7evo0 is a video demonstrating a gaze-correction approach based on a single Kinect sensor that preserves both the integrity and expressiveness of the face as well as the fidelity of the scene as a whole, producing nearly artifact-free imagery. The method is suitable for mainstream home video con- erencing: it uses inexpensive consumer hardware, achieves real-time performance, and requires just a simple and short setup.
Content may be subject to copyright.
46 COMPUTER Published by the IEEE Computer Society 0018-9162/14/$31.00 © 2014 IEEE
COVER FEATURE
Cutting-edge work on 3D telepresence at
a multinational research center provides
insight into the technology’s potential, as
well as into its remaining challenges.
For more than two decades, individually and
with many collaborators, we have actively
explored immersive 3D telepresence technol-
ogy. Since 2011, we have been working within
the BeingThere International Research Centre for Tele-
Presence and Tele-Collaboration, a joint research effort
among Nanyang Technological University (NTU) in Sin-
gapore, ETH Zürich in Switzerland, and the University
of North Carolina (UNC) at Chapel Hill.
The BeingThere Centre is directed by Nadia Magnenat-
Thalmann at NTU, Markus Gross at ETH Zürich, and
Henry Fuchs at UNC. We invite readers to visit the Cen-
tre’s website (http://imi.ntu.edu.sg/BeingThereCentre),
which provides information about the dozens of faculty,
staff, and students researching 3D telepresence as well
as mobile avatars, virtual humans, and various 3D scan-
ning and display technologies.
Here we present a brief overview of some of
our recent immersive 3D telepresence work,
focusing on major issues, recent results, and re-
maining challenges, mainly with respect to 3D
acquisition and reconstruction, and 3D display.
We do not discuss issues related to real-time data trans-
mission, such as networking and compression.
TELEPRESENCE
Researchers have long envisioned “telepresent” com-
munication among groups of people located in two or
more geographically separate rooms, such as offices
or lounges, by means of virtual joining of the spaces.
As Figure 1 shows, shared walls become transpar-
ent, enabling participants to perceive the physically
remote rooms and their occupants as if they were
just beyond the walls—life size, in three dimensions,
and with live motion and sound.
An ideal implementation would provide wall-size,
multiuser- autostereoscopic (or multiscopic, that is, show-
ing individualized 3D views to each user) displays along
the shareable walls, allowing encumbrance-free, geomet-
rically correct 3D viewing of the remote sites. Together
with directional sound, such a system should create a con-
vincing sense of co-presence within the joint real– virtual
space, enabling almost any kind of natural interaction
and communication short of actually stepping across the
seemingly transparent walls into the other rooms (Figure 2a).
Alternatively, if a display were mounted on a moving
platform (Figure 2b), remote participants could move
anywhere in the local environment. With the help of a
transparent screen, they could be even more effectively
integrated with that space and its occupants, at the ex-
pense of not showing their own remote environments.
Immersive 3D
Telepresence
Henry Fuchs and Andrei State,
University of North Carolina at Chapel Hill
Jean-Charles Bazin, ETH Zürich
r7baz.indd 46 6/26/14 10:46 AM
JULY 2014 47
WHY 3D MIGHT BE
BETTER THAN 2D
With conventional 2D teleconfer-
encing systems such as Skype, and
even with high-end systems such
as Cisco Tele Presence TX9000,
the imagery seen by all partici-
pants at one site is exactly what is
acquired by the one or more cam-
eras located at the remote site(s).
This is fundamentally different
from in-person, face-to-face meet-
ings, where each participant sees
the surroundings from his or her
own point of view, and each point
of view is unique because par-
ticipants are sitting or standing
in different locations around the
room. In face-to-face meetings, we each change our loca-
tion and direction of gaze so naturally that we hardly give
it a thought. In addition, when someone is looking at us,
not only do we see that person looking at us, but everyone
else can observe that person looking at us from his or her
own point of view. It has been shown that mutual gaze en-
hances human communication,1 and thus we also aim to
offer this capability in the systems we design.
Natural movement in 3D space, situational awareness,
gaze direction, and eye contact are very difficult to pro-
vide in 2D teleconferencing, where all participants see the
remote scene from the fixed viewpoint(s) of the remote
camera(s). Hence, to achieve most of the benefits of face-
to-face interaction, we believe that each local participant
should receive personal imagery of the remote environment
that matches his or her dynamically changing point of view.
The importance of 3D display is an active research ques-
tion and appears to depend on the target application and
context.2 For example, while 2D display might be sufficient
for casual one-to-one video conferencing, 3D display could
play a key role in telepresence scenarios such as collab-
orative work, 3D object or data manipulation, and remote
space immersion.
3D TELEPRESENCE REQUIREMENTS
To generate the novel views for each participant, two major
approaches are being used: image-based methods and 3D
reconstruction. Image-based methods3 require deploy-
ment of many cameras and are appropriate when the novel
viewpoints are close to the cameras’ physical locations. In
contrast, 3D reconstruction estimates the scene’s actual 3D
shape (objects, people, background, and so on), resulting
in a dynamic geometric model that can then be rendered
for these novel viewpoints. Moreover, such a geometric
model can be enhanced with synthetic representations of
objects of interest.
To provide dynamic personalized views to each tele-
presence participant, we reconstruct the 3D environment
of each site and display it in 3D at each of the other sites.
This requires three distinct but closely coupled processes:
continuously scan each environment to build and
maintain an up-to-date 3D model of it, including all
people and objects;
transmit that 3D model to the other sites; and
generate and display to each participant the appro-
priate 3D view of each distant room and its contents.
Our recent work has emphasized the acquisition and dis-
play challenges, both active research areas in the 3D vision
and graphics communities.
(a) (b)
Figure 2. Telepresence implementations. (a) Natural face-
to-face interaction between two participants in a room-
based scenario. (b) Remote participant displayed on a
mobile, life-size, transparent stereoscopic display.
Figure 1. An artist’s depiction of the BeingThere Centre’s multiroom telepresence
concept shows three geographically remote rooms, virtually joined as if co-located and
separated by seemingly transparent walls.
r7baz.indd 47 6/26/14 10:46 AM
48 COMPUTER
COVER FEATURE
PREVIOUS WORK
The dream of 3D telepresence has inspired researchers
for many decades, but, due to technological difficulties,
prototypes emerged slowly in the 1980s and 1990s. The
basic approach for creating 3D reconstructions of room-
size environments involves deploying multiple cameras
around the space and using their imagery with various
stereo reconstruction techniques to continually update the
3D model of that space, including, of course, the moving
people. A UNC-led team conducted an early experiment in
3D teleconferencing using such a “sea” of cameras.4 Carn-
egie Mellon University researchers created one of the first
systems to capture dynamic scenes and render them from
new viewpoints using a set of 51 cameras fixed on a five-
meter dome.5 ETH Zürich’s blue-c was perhaps the first
bidirectional 3D telepresence system—it scanned as well
as displayed in 3D a participant at each of two locations.6
Later work at the Electronic Visualization Laboratory of the
University of Illinois at Chicago introduced simultaneous
3D display for two or three local users.7
Early prototypes used
bulky head-mounted displays
(HMDs).4 While providing a
strong 3D illusion of a dis-
tant person or environment,
these early HMDs were inad-
equate if for no other reason
than they required each par-
ticipant to view other distant
or local partners by means of
a helmet or goggles—hardly
a satisfactory illusion. The
display in blue-c was a consid-
erable improvement: a CAVE
(computer-assisted virtual
environment)-like experi-
ence with head-tracked stereo
display using active shutter
stereo glasses for the single
local u ser. 6
Although today’s stereo
shutter glasses are still so
dark that they preclude ef-
fective eye contact and thus
impede some forms of nat-
ural interaction, they are
the only currently available
technology supporting fully
individualized stereo views
for multiple local users. An
excellent example of a mul-
tiuser stereo display (and 3D
acquisition) system was de-
veloped by a team at Bauhaus
Univers ity.8 ,9 It supports up to six local users through six
stereo projectors rear-projecting onto the same screen
area. The ingenious design permanently assigns each pro-
jector the task of showing a primary color (red, green, or
blue) and a single eye’s view (left or right) to all six users,
with each user’s view displayed at one of six time slots
during each video frame.
3D ACQUISITION AND RECONSTRUCTION
3D acquisition and reconstruction are the technologies that
feed the 3D telepresence pipeline. They have to meet criti-
cal requirements of accuracy, completeness, and speed. A
room-size environment can be viewed by remote partners
from a multitude of viewpoints located anywhere in the
remote site’s “telepresence room.” For example, consider
Figure 1: one of the seated participants at the UNC site
(foreground) might get up and walk up very close to the
wall display to conduct a semiconfidential, low-volume
conversation with one of the NTU participants, perhaps as
illustrated in Figure 2a. Supporting such natural behavior
Figure 3. Real-time automatic 3D reconstruction of participants from a single color-plus-
depth (RGB-D) camera. The top row shows the geometry, and the bottom row shows
its textured version. The images on the left show raw RGB-D data; the images on the
right show RGB-D data after filtering and the application of occlusion and photometric-
consistency operations.
r7baz.indd 48 6/26/14 10:46 AM
JULY 2014 49
requires that the 3D telepresence system ac-
quire and reconstruct minute details of each
environment with high accuracy.
People’s continuous movement— walk ing,
gesturing, changing facial expressions, and so
on—makes this task exceptionally difficult.
Yet such details must be captured and prop-
erly reconstructed at remote sites without
annoying or misleading visual artifacts. To-
day’s teleconferencing users are accustomed
to high-definition 2D video and are unlikely
to accept jarring image-quality degradation in
exchange for true viewpoint- specific dynamic
stereoscopy.
The most popular traditional 3D recon-
struction strategy has been to use numerous
conventional color cameras. The recent emer-
gence of inexpensive color-plus-depth (RGB-D)
cameras, such as Microsoft’s Kinect, has rev-
olutionized 3D reconstruction, and we have
been using them in many of our telepresence
projects. For small-scale scenarios with few
participants (typically up to two at each site),
one or two Kinects are sufficient. While Ki-
nects can extract textured geometry in real
time, their data contains spatial and tempo-
ral noise as well as missing values, especially
along depth discontinuities; therefore, raw
Kinect data must be processed to eliminate or
reduce those.
Figure 3 shows the quality enhancement
by one of our 3D reconstruction techniques
(http://beingthere.ethz.ch/videos/VMV2011.
mp4).10 For larger environments or more par-
ticipants, Figure 4 shows a typical real-time
3D reconstruction of a room scene using
10 Kinect cameras, which represents the approximate
limit of the amount of data that can be processed today
within a single common PC in real time (www.cs.unc.edu/
TelepresenceVideos/RealTimeVolumetricTelepresence.
mp4).11 We have also used RGB-D cameras for real-time
gaze correction, and developed a Skype plug-in to provide
convincing eye contact between videoconferencing par-
ticipants (http://beingthere.ethz.ch/videos/SA2012.mp4).12
3D DISPLAY
When the 3D telepresence display only needs to show a
single distant individual, we might choose to project that
person without his or her background environment on a
human-size transparent display to give a strong illusion of
that distant individual’s presence in the local environment
(Figure 2b). Figure 5 shows one of our implementations of
this concept, with a transparent screen displaying rear-
projected stereoscopic imagery.10,13
If there is only one local participant and the reduced
eye contact of stereo glasses is acceptable, then a simple
head-tracked stereo display might be adequate. We have
built numerous such systems, including the transparent
display13 of Figure 5 and others consisting of several large
stereoscopic TVs forming a wall-size personal display
window into the remote site.
A more significant challenge arises when there are mul-
tiple local participants, in which case we seek to present
to each participant the correct stereoscopic view from
each of their positions, preferably without any encumber-
ing stereo glasses. To achieve this multiscopic display, we
have been exploring techniques developed for compres-
sive light-field displays at the MIT Media Lab.14 Together
with its team, we have recently improved such displays by
optimizing the light-field views only for the current spa-
tial locations of all viewers.15 Figure 6 shows two photos
of our optimized display, simultaneously taken from two
Figure 4. Virtual views of real-time 3D room reconstructions from 10
Kinect RGB-D cameras.
r7baz.indd 49 6/26/14 10:46 AM
50 COMPUTER
COVER FEATURE
different vantage points without any filters or specialized
glasses. We plan to build larger displays from tiled copies
of this 27-inch prototype.
REMAINING CHALLENGES
Despite recent progress in 3D telepresence, challenges
remain both in 3D acquisition and reconstruction and in
3D dis play.
3D acquisition and reconstruction
Accurate and rapid 3D reconstruction of an entire meet-
ing room will require dozens of RGB-D cameras, more
than can be operated by a single PC today. To that end, we
must design a distributed real-time acquisition system;
such a system’s work becomes increasingly challenging
as the acquisition volume increases. System complexity is
indeed a significant issue: today, even high-end teleconfer-
encing systems use only a small number of displays and
a few high-quality cameras. In contrast, immersive 3D
telepresence systems will likely require many more cam-
eras, advanced unconventional displays, and considerably
more processing.
Another challenge is that the quality of images obtained
through real-time 3D capture and reconstruction is not on
par with that of 2D images directly acquired by conven-
tional cameras. Reconstruction artifacts and missing 3D
data are intolerable to users accustomed to high- definition
image quality even from inexpensive webcams. However,
there have been rapid advances in consumer-grade RGB-D
cameras, both in existing offerings (Microsoft Kinect,
PrimeSense Capri) and in new models (Google’s Project
Tango, Intel’s RealSense 3D camera).
There also has been progress in 3D reconstruction
algorithms. Recent work demonstrates improvements
in reconstruction quality from processing of shad-
ows from the infrared projector in an RGB-D camera,16
accumulation of temporal data for fixed as well as de-
formable objects such as people (www.cs.unc.edu/
TelepresenceVideos/VR2014.mp4),17 and spatiotemporally
consistent reconstruction from several hybrid cameras
(http://beingthere.ethz.ch/videos/EG2014.mp4).18 How-
ever, most of these techniques are not yet capable of
real-time performance.
3D display
Today, even state-of-the-art wall-size stereo displays that
provide personalized views to each freely moving user
require specialized stereo glasses.9 For many, these dark
glasses would be uncomfortable and visually unaccept-
able, as they impede eye contact. The more attractive
alternatives— high-quality, large-format multiscopic
displays—are still in a basic research phase and remain
to be built but would have an enormous impact on the field.
Augmented reality (AR) eyeglass-style displays could
enable more flexible interaction among participants than
even multiscopic displays. Using 3D models of the remote
and local environments, these HMDs could achieve the
most-powerful-yet sense of combined presence,19 as
Figure 7 shows (www.cs.unc.edu/TelepresenceVideos/
AugmentedRealityTelepresence.mp4). The newest designs
promise significant improvements over older-style goggles:
more transparency for better eye contact and a brighter
view of the local environment, as well as a wider field of
view and an eyeglass form factor suitable for long-term
we ar.20 We hope for a convenient see-through AR display
like Google Glass and the Lumus DK-32, but with a wider
field of view such as the Oculus Rift.
The most encouraging aspect for the future of 3D
telepresence is that relevant technologies are rapidly
advancing because of consumer interest in the vari-
ous components: ever-higher-quality large-format video
displays, ever-higher-quality RGB-D cameras, and ever-
faster GPUs for gaming and entertainment. Most of these
technologies can be easily repurposed for the demanding
scenarios of 3D telepresence. In the past five years, the
advances have been greater than in the previous 15; we
expect continuing and exciting improvements in the next
few years.
60 cm
70 cm
Figure 5. Photo of near-life-size stereoscopic transparent
rear-projection display, showing both left and right eye
stereo images (the display is viewed with passive stereo
glasses). Note that furniture in the background is visible
through the transparent display.
r7baz.indd 50 6/26/14 10:46 AM
JULY 2014 51
are also grateful to Mark Bolas of the University of Southern
California and Tracy McSheery of PhaseSpace for the experi-
mental optical see-through HMD through which the image in
Figure 7 was acquired.
This research was supported in pa rt by the BeingThere
Centre, a collaboration among ETH Zürich, NTU Singapore, and
We see no obvious roadblocks to the re-
alization of immersive 3D telepresence. As
with other dramatic changes, such as the
move from analog to digital television, the
older technology can remain dominant
during decades of incremental develop-
ment. However, as cost and effectiveness of
new 3D telepresence technologies continue
to improve, the advantages of 3D telepres-
ence over 2D teleconferencing will become
increasingly attractive.
Acknowledgments
We gratefully acknowledge our colleagues
within and outside of the BeingThere
Centre: codirectors Markus Gross and Nadia
Magnenat-Thalmann; project leaders Tat Jen
Cham, I-Ming Chen, Marc Pollefeys, Gerald
Seet, and Greg Welch; assistant director Frank
Guan; professors Jan-Michael Frahm, Anselmo Lastra, Tiberiu
Popa, Miriam Reiner, and Turner Whitted; and research as-
sistants Nate Dierk, Mingsong Dou, Iska ndarsyah, Claudia
Kuster, Andrew Maimone, Tobias Martin, and Nicola Ranieri.
Special thanks to Renjie Chen for the photos of our multiscopic
display and to Mingsong Dou for the 3D head-sca n data. We
Figure 6. Simultaneously taken photos of a multiscopic display from two different viewpoints. Note that the virtual head faces
perpendicularly out of the screen; its spatial relationship to the mug in the foreground is preserved, and each viewer sees a
different side of the head.
Figure 7. View through an experimental augmented reality head-mounted
display showing a distant participant across the local table and a virtual
couch model on the table.
r7baz.indd 51 6/26/14 10:46 AM
52 COMPUTER
COVER FEATURE
UNC Chapel Hill, supported by ETH, NTU, UNC, and the Sin-
gapore National Research Foundation under its International
Research Centre @ Singapore Funding Initiative and adminis-
tered by the Interactive Digital Media Programme Office. Part of
this research was also supported by Cisco Systems, and by the
US National Science Foundation, Award IIS-1319567 “HCC: CGV:
Small: Eyeglass-Style Multi-Layer Optical See-Through Displays
for Augmented Realit y.”
References
1. M. Argyle and M. Cook, Gaze and Mutual Gaze, Cambridge
Univ. Press, 1976.
2. J.P. McIntire, P.R. Havig, and E.E. Geiselman, “Stereoscopic
3D Displays and Human Performance: A Comprehensive
Re vie w,” Displays, vol. 35, no. 1, 2014, pp. 18–26.
3. H.-Y. Shum, S.-C. Chan, and S.B. Kang, Image-Based
Rendering, Springer, 2008.
4. H. Fuchs et al., “Virtual Space Teleconferencing Using a
Sea of Cameras,” Proc. 1st Int’l Conf. Medical Robotics and
Computer Assisted Surgery (MRCAS 94), 1994, pp. 161–167.
5. T. Kanade, P. Rander, and P.J. Narayanan, “Virtualized
Reality: Constructing Virtual Worlds from Real Scenes,”
IEEE MultiMedia, vol. 4, no. 1, 1997, pp. 34–47.
6. M. Gross et al., “blue-c: A Spatia lly Immersive Display a nd
3D Video Portal for Telepresence,” ACM Trans. Graphics,
vol. 22, no. 3, 2003, pp. 819–827.
7. T. Peterka et al., “Advances in the Dynalla x Solid-State
Dynamic Parallax Barrier Autostereoscopic Visualization
Display System,” IEEE Trans. Visualization and Computer
Graphics, vol. 14, no. 3, 2008, pp. 487–499.
8. A. Kulik et al., “C1x6: A Stereoscopic Six-User Display
for Co-located Collaboration in Shared Virtual
Environments,” ACM Trans. Graphics, vol. 30, no. 6, 2011;
doi:10.1145/2070781.2024222.
9. S. Beck et al., “Immersive Group-to-Group Telepresence,”
IEEE Trans. Visualization and Computer Graphics, vol. 19,
no. 4, 2013, pp. 616–625.
10. C. Kuster et al., “Towards Next Generation 3D
Teleconferencing Systems,” Proc. 3DTV-Conf.: The True
Vision—Capture, Transmission and Display of 3D Video
(3DTV-CON 12), 2012; doi:10.1109/3DTV.2012.6365454.
11. A. Maimone and H. Fuchs, “Real-Time Volumetric 3D
Capture of Room-Sized Scenes for Telepresence,” Proc.
3DTV-Conf.: The True Vision—Capture, Transmission
and Display of 3D Video (3DT V- CO N 12), 2 012;
doi:10.1109/3D TV.2 012.63654 30.
12. C. Kuster et al., “Gaze Correction for Home Video
Conferencing,” ACM Trans. Graphics, vol. 31, no. 6, 2012;
doi:10.1145/2366145.2366193
13. N. Ranieri, H. Seifert, and M. Gross, “Transparent
Stereoscopic Display and Application,” Proc. SPIE,
vol. 9 011 , 2 014 ; d o i :10 .1117/ 12. 2 0 3730 8 .
14. G. Wetzstein et al., “Tensor Displays: Compressive Light
Field Synthesis Using Multilayer Displays with Directional
Backlighting,ACM Trans. Graphics, vol. 31, no. 4, 2012;
doi:10.1145/2185520.2185576.
15. A. Maimone et al., “Wide Field of View Compressive Light
Field Display Using a Multilayer Architecture and Tracked
Viewers,” to appear in Proc. SID Display Week, 2014.
16. T. Deng et al., “Kinect Shadow Detection and
Classification,Proc. IEEE Int’l Conf. Computer Vision
Workshops (ICCVW 13), 2013, pp. 708–713.
17. M. Dou and H. Fuchs, “Temporally Enhanced 3D Capture
of Room-Sized Dynamic Scenes with Commodity Depth
Ca meras,” Proc. IEEE Conf. Virtual Reality ( VR 14) , 2014 ,
pp. 39–44.
18. C. Kuster et al., “Spatio-Temporal Geometry Fusion for
Multiple Hybrid Cameras Using Moving Least Squares
Surfaces,” Computer Graphics Forum, vol. 33, no. 2, 2014;
doi:10.1111/c g f .12 2 8 5 .
19. A. Maimone et al., “General-Purpose Telepresence
with Head-Worn Optical See-through Displays and
Projector-Based Lighting,Proc. IEEE Conf. Virtual Reality
(V R 13), 2013; d oi:10.1109/VR .2013.6549352.
20. A. Maimone and H. Fuchs, “Computational Augmented
Reality Eyeglasses,Proc. IEEE Int’l Symp. Mixed and
Augmented Reality (ISMAR 13), 2013, pp. 29–38.
Henry Fuchs is the Federico Gil Distinguished Professor
of Computer Science at the University of North Carolina at
Chapel Hill, and codirector of the BeingThere Centre. His
research interests include telepresence, augmented real-
ity, and graphics hardware and algorithms. Fuchs received
a PhD in computer science from the University of Utah. He
is a member of the National Academy of Engineering. Con-
tact him at fuchs@cs.unc.edu.
Andrei State is a senior research scientist in the Department
of Computer Science at the University of North Carolina at
Chapel Hill, and cofounder of InnerOptic Technology, which
creates virtual reality guidance for surgeons. His research
interests include telepresence and virtual and augmented
reality. State received a Dipl.-Ing. (aer) from the University
of Stuttgart, Germany, and an MS in computer science from
UNC Chapel Hill. Contact him at andrei@cs.unc.edu.
Jean-Charles Bazin is a senior researcher in the ETH Zürich
Computer Graphics Laboratory, and conducts research at
the BeingThere Centre. His research interests include vari-
ous topics in computer vision and graphics, such as image/
video editing and 3D data processing. Bazin received a PhD
in electrical engineering from KAIST, South Korea, and an
MS in computer science from Université de Technologie de
Compiègne, France. Contact him at jebazin@inf.ethz.ch.
Selected CS articles and columns are available
for free at http://ComputingNow.computer.org.
r7baz.indd 52 6/26/14 10:46 AM
... When only one remote user needs to see the 3D display for telepresence [6], we may need to project the user in a translucent, human-sized display without a backdrop in order to create a convincing illusion of the remote person's presence in the local environment. For situations where there is just one local user and stereo glass eye contact is required, a transparent head-tracked stereo system may be appropriate. ...
... It's also worth noting that augmented reality and virtual elements abstracted from the user's video constitute virtuality, while augmented reality occurs when a real-world context is overlaid with a virtual object. AR eye-wear displays may enable participants to engage more freely than multiautostereoscopic displays, and by incorporating 3D models from both the remote and local location, these HMDs may be possible to accomplish the most-achievable combined presence yet, as claimed by [6]. ...
Article
Over the years, people have tried to advance 3D display technology and researchers as well as developers have created different innovations in recent decades. there are many other different types of 3D display technology that can be classified into stereoscopic, auto stereoscopic, holographic and volumetric 3D displays. This paper, however, discusses the 3D display technology that have been implemented in the telepresence system, which can be divided into two main devices, projectors and head mounted display (HMD). From these two devices, the 3D display technology using projector devices are on-stage hologram, auto stereoscopic display, and holographic projection; while for HMD can be divided into MR headset and VR HMD. This paper provides a review on these 3D display for telepresence. Finally, we make a comparison based on the features of the 3D display technologies such as life-size capability, viewable from different perspectives, headset-free experience number of viewers per device, level of ease of setup and the nausea of discomfort level. To choose the best 3D display technology for a telepresence system, we must first identify the number of users who will be projected and who will be viewed. The goal and activity of using telepresence technology will also define the appropriate type of 3D display.
... Early approaches were limited by the capabilities of the available hardware [31,58,66,90,134,140], inaccurate silhouettebased reconstruction techniques [76,105]. Depth-based 3D scanning led to improved reconstruction quality and allowed telepresence at room [32,47,54,78,80,85], however, remaining artifacts induced by the high sensor noise and temporal inconsistency in the reconstruction process still impacted the visual experience. More recently, advances in 3D scene capture, streaming and visualozation technology led to impressive immersive AR/VR-based live-3D-telepresence experiences. ...
Preprint
Despite the impressive progress of telepresence systems for room-scale scenes with static and dynamic scene entities, expanding their capabilities to scenarios with larger dynamic environments beyond a fixed size of a few squaremeters remains challenging. In this paper, we aim at sharing 3D live-telepresence experiences in large-scale environments beyond room scale with both static and dynamic scene entities at practical bandwidth requirements only based on light-weight scene capture with a single moving consumer-grade RGB-D camera. To this end, we present a system which is built upon a novel hybrid volumetric scene representation in terms of the combination of a voxel-based scene representation for the static contents, that not only stores the reconstructed surface geometry but also contains information about the object semantics as well as their accumulated dynamic movement over time, and a point-cloud-based representation for dynamic scene parts, where the respective separation from static parts is achieved based on semantic and instance information extracted for the input frames. With an independent yet simultaneous streaming of both static and dynamic content, where we seamlessly integrate potentially moving but currently static scene entities in the static model until they are becoming dynamic again, as well as the fusion of static and dynamic data at the remote client, our system is able to achieve VR-based live-telepresence at interactive rates. Our evaluation demonstrates the potential of our novel approach in terms of visual quality, performance, and ablation studies regarding involved design choices.
... Immersive Virtual Reality (VR) applications offer an increased sense of presence and immersion. These applications have emerged as a promising alternative for remote communication and telepresence [6,9,18,24,32,36,52,54]. They allow users to employ both verbal and non-verbal communication in a shared virtual space. ...
... The focus of this chapter is on volumetric video streaming, which we anticipate will have a successful journey, even though bumpy and curved, ahead. In retrospective, we can see a number of similarities with video streaming from the 1990s, with promising and visionary services [4][5][6][7], some remarkable technological solutions [8], and upcoming standards [9,10]. Still, basic research is needed for ensuring the best possible, 6 Degrees of Freedom (6DoF), experience both for immersive consumption of media and for real-time communication. ...
Preprint
Full-text available
The rise of capturing systems for objects and scenes in 3D with increased fidelity and immersion has led to the popularity of volumetric video contents that can be seen from any position and angle in 6 degrees of freedom navigation. Such contents need large volumes of data to accurately represent the real world. Thus, novel optimization solutions and delivery systems are needed to enable volumetric video streaming over bandwidth-limited networks. In this chapter, we discuss theoretical approaches to volumetric video streaming optimization, through compression solutions, as well as network and user adaptation, for high-end and low-powered devices. Moreover, we present an overview of existing end-to-end systems, and we point to the future of volumetric video streaming.
... Even earlier works Ishii (1990); Ishii et al. (1994) identified that for seamless remote collaboration, it is not sufficient to have only 2D annotations, but it is also important to have access to both physical as well as digital tools, awareness of gaze and gesture and a way to manage the digital and physical workspaces. Many prior works have shown the needs and benefits of access to multiple perspectives of the remote collaborator (Ishii, 1990;Ishii and Kobayashi, 1992;Fussell et al., 2003;Henderson and Feiner, 2007;Ranjan et al., 2007;Fuchs et al., 2014). Our idea of Interactive Mixed-Dimensional Media builds on these works and findings wherein, it supports these rich interactions, as well as provides us with a unified framework to broadly think about systems that involve users, each of whom perceives and interacts with the same data through different visual representations and interaction affordances. ...
Article
Full-text available
Collaboration and guidance are key aspects of many software tasks. In traditional desktop software, such aspects are well supported through built-in collaboration functions or general-purpose techniques such as screen and video sharing. In Mixed Reality environments, where users carry out actions in a three-dimensional space, collaboration and guidance may also be required. However, other users may or may not be using the same Mixed Reality interface. Users may not have access to the same information, the same visual representation, or the same interaction affordances. These asymmetries make communication and collaboration between users harder. To address asymmetries in Mixed Reality environments, we introduce Interactive Mixed-Dimensional Media. In these media, the visual representation of information streams can be changed between 2D and 3D. Different representations can be chosen automatically, based on context, or through associated interaction techniques that give users control over exploring spatial, temporal, and dimensional levels of detail. This ensures that any information or interaction makes sense across different dimensions, interfaces and spaces. We have deployed these techniques in three different contexts: mixed-reality telepresence for physical task instruction, video-based instruction for VR tasks, and live interaction between a VR user and a non-VR user. From these works, we show that Mixed Reality environments that provide users with interactive mixed-dimensional media interfaces improve performance and user experience in collaboration and guidance tasks.
... Immersive Virtual Reality (VR) applications offer an increased sense of presence and immersion. These applications have emerged as a promising alternative for remote communication and telepresence [6,9,18,24,32,36,52,54]. They allow users to employ both verbal and non-verbal communication in a shared virtual space. ...
Preprint
Full-text available
Remote communication has rapidly become a part of everyday life in both professional and personal contexts. However, popular video conferencing applications present limitations in terms of quality of communication, immersion and social meaning. VR remote communication applications offer a greater sense of co-presence and mutual sensing of emotions between remote users. Previous research on these applications has shown that realistic point cloud user reconstructions offer better immersion and communication as compared to synthetic user avatars. However, photorealistic point clouds require a large volume of data per frame and are challenging to transmit over bandwidth-limited networks. Recent research has demonstrated significant improvements to perceived quality by optimizing the usage of bandwidth based on the position and orientation of the user's viewport with user-adaptive streaming. In this work, we developed a real-time VR communication application with an adaptation engine that features tiled user-adaptive streaming based on user behaviour. The application also supports traditional network adaptive streaming. The contribution of this work is to evaluate the impact of tiled user-adaptive streaming on quality of communication, visual quality, system performance and task completion in a functional live VR remote communication system. We perform a subjective evaluation with 33 users to compare the different streaming conditions with a neck exercise training task. As a baseline, we use uncompressed streaming requiring ca. 300Mbps and our solution achieves similar visual quality with tiled adaptive streaming at 14Mbps. We also demonstrate statistically significant gains to the quality of interaction and improvements to system performance and CPU consumption with tiled adaptive streaming as compared to the more traditional network adaptive streaming.
... With the increasing availability and quality of RGB-D cameras, many researchers have investigated high-quality reconstruction approaches [8,9,28,72]. Previous works also utilized 3D reconstruction techniques to render parts [50] or all of the physical environment, e.g., for telepresence [19]. With Mixed Voxel Reality [53], Regenbrecht et al. investigated the combination of VR devices and live point cloud rendering in terms of user experience. ...
... The result findings show that their systems show significant improvement in presence and completion time. Fuchs, H. et al. [12] also agrees that 3D reconstruction integrates with telepresence and can also offer the potential for remote collaboration, virtual 3D object manipulation, or exploring the remote site. Modern telepresence technologies enable persons in remote areas to virtually meet and interact with one another through the use of realistic 3D user representations. ...
Article
Full-text available
This paper introduces a real-time 3D reconstruction of a human captured using a depth sensor and has integrated it with a holographic telepresence application. Holographic projection is widely recognized as one of the most promising 3D display technologies, and it is expected to become more widely available in the near future. This technology may also be deployed in various ways, including holographic prisms and Z-Hologram, which this research has used to demonstrate the initial results by displaying the reconstructed 3D representation of the user. The realization of a stable and inexpensive 3D data acquisition system is a problem that has yet to be solved. When we involve multiple sensors we need to compress and optimize the data so that it can be sent to a server for a telepresence. Therefore the paper presents the processes in real-time 3D reconstruction, which consists of data acquisition, background removal, point cloud extraction, and a surface generation which applies a marching cube algorithm to finally form an isosurface from the set of points in the point cloud which later texture mapping is applied on the isosurface generated. The compression results has been presented in this paper, and the results of the integration process after sending the data over the network also have been discussed.
Article
Due to the increased popularity of augmented and virtual reality experiences, the interest in capturing high-resolution real-world point clouds has never been higher. Loss of details and irregularities in point cloud geometry can occur during the capturing, processing, and compression pipeline. It is essential to address these challenges by being able to upsample a low Level-of-Detail (LoD) point cloud into a high LoD point cloud. Current upsampling methods suffer from several weaknesses in handling point cloud upsampling, especially in dense real-world photo-realistic point clouds. In this paper, we present a novel geometry upsampling technique, PU-Dense, which can process a diverse set of point clouds including synthetic mesh-based point clouds, real-world high-resolution point clouds, real-world indoor LiDAR scanned objects, as well as outdoor dynamically acquired LiDAR-based point clouds. PU-Dense employs a 3D multiscale architecture using sparse convolutional networks that hierarchically reconstruct an upsampled point cloud geometry via progressive rescaling and multiscale feature extraction. The framework employs a UNet type architecture that downscales the point cloud to a bottleneck and then upscales it to a higher level-of-detail (LoD) point cloud. PU-Dense introduces a novel Feature Extraction Unit that incorporates multiscale spatial learning by employing filters at multiple sampling rates and receptive fields. The architecture is memory efficient and is driven by a binary voxel occupancy classification loss that allows it to process high-resolution dense point clouds with millions of points during inference time. Qualitative and quantitative experimental results show that our method significantly outperforms the state-of-the-art approaches by a large margin while having much lower inference time complexity. We further test our dataset on high-resolution photo-realistic datasets. In addition, our method can handle noisy data well. We further show that our approach is memory efficient compared to the state-of-the-art methods.
Article
Full-text available
We present blue-c, a new immersive projection and 3D video acquisition environment for virtual design and collaboration. It combines simultaneous acquisition of multiple live video streams with advanced 3D projection technology in a CAVE™-like environment, creating the impression of total immersion. The blue-c portal currently consists of three rectangular projection screens that are built from glass panels containing liquid crystal layers. These screens can be switched from a whitish opaque state (for projection) to a transparent state (for acquisition), which allows the video cameras to "look through" the walls. Our projection technology is based on active stereo using two LCD projectors per screen. The projectors are synchronously shuttered along with the screens, the stereo glasses, active illumination devices, and the acquisition hardware. From multiple video streams, we compute a 3D video representation of the user in real time. The resulting video inlays are integrated into a networked virtual environment. Our design is highly scalable, enabling blue-c to connect to portals with less sophisticated hardware.
Conference Paper
Full-text available
In this paper, we introduce a system to capture the enhanced 3D structure of a room-sized dynamic scene with commodity depth cameras such as Microsoft Kinects. It is challenging to capture the entire dynamic room. First, the raw data from depth cameras are noisy due to the conflicts of the room's large volume and cameras' limited optimal working distance. Second, the severe occlusions between objects lead to dramatic missing data in the captured 3D. Our system incorporates temporal information to achieve a noise-free and complete 3D capture of the entire room. More specifically, we pre-scan the static parts of the room offline, and track their movements online. For the dynamic objects, we perform non-rigid alignment between frames and accumulate data over time. Our system also supports the topology changes of the objects and their interactions. We demonstrate the success of our system with various situations.
Article
Full-text available
Effective communication using current video conferencing systems is severely hindered by the lack of eye contact caused by the disparity between the locations of the subject and the camera. While this problem has been partially solved for high-end expensive video conferencing systems, it has not been convincingly solved for consumer-level setups. We present a gaze correction approach based on a single Kinect sensor that preserves both the integrity and expressiveness of the face as well as the fidelity of the scene as a whole, producing nearly artifact-free imagery. Our method is suitable for mainstream home video conferencing: it uses inexpensive consumer hardware, achieves real-time performance and requires just a simple and short setup. Our approach is based on the observation that for our application it is sufficient to synthesize only the corrected face. Thus we render a gaze-corrected 3D model of the scene and, with the aid of a face tracker, transfer the gaze-corrected facial portion in a seamless manner onto the original image.
Conference Paper
Full-text available
Kinect depth maps often contain missing data, or "holes", for various reasons. Most existing Kinect-related research treat these holes as artifacts and try to minimize them as much as possible. In this paper, we advocate a totally different idea - turning Kinect holes into useful information. In particular, we are interested in the unique type of holes that are caused by occlusion of the Kinect's structured light, resulting in shadows and loss of depth acquisition. We propose a robust detection scheme to detect and classify different types of shadows based on their distinct local shadow patterns as determined from geometric analysis, without assumption on object geometry. Experimental results demonstrate that the proposed scheme can achieve very accurate shadow detection. We also demonstrate the usefulness of the extracted shadow information by successfully applying it for automatic foreground segmentation.
Conference Paper
Stereoscopic multi-user systems provide multiple users with individual views of a virtual environment. We developed a new projection-based stereoscopic display for six users, which employs six customized DLP projectors for fast time-sequential image display in combination with polarization. Our intelligent high-speed shutter glasses can be programmed from the application to adapt to the situation. For instance, it does this by staying open if users do not look at the projection screen or switch to a VIP high brightness mode if less than six users use the system. Each user is tracked and can move freely in front of the display while perceiving perspectively correct views of the virtual environment. Navigating a group of six users through a virtual world leads to situations in which the group will not fit through spatial constrictions. Our augmented group navigation techniques ameliorate this situation by fading out obstacles or by slightly redirecting individual users along a collision-free path. While redirection goes mostly unnoticed, both techniques temporarily give up the notion of a consistent shared space. Our user study confirms that users generally prefer this trade-off over naïve approaches.
Article
In this paper, we discuss a simple extension to existing compressive multilayer light field displays that greatly extends their field of view and depth of field. Rather than optimizing these displays to create a moderately narrow field of view at the center of the display, we constrain optimization to create narrow view cones that are directed to the viewer's eyes, allowing the available display bandwidth to be utilized more efficiently. These narrow view cones follow the viewer, creating a wide apparent field of view. Imagery is also recalculated for the viewer's exact position, creating a greater depth of field. The view cones can be scaled to match the positional error and latency of the tracking system. Using more efficient optimization and commodity tracking hardware and software, we demonstrate a real-time, glasses-free 3D display that offers a 110times45 degree field of view.
Article
Augmented reality has become important to our society as it can enrich the actual world with virtual information. Transparent screens offer one possibility to overlay rendered scenes with the environment, acting both as display and window. In this work, we review existing transparent back-projection screens for the use with active and passive stereo. Advantages and limitations are described and, based on these insights, a passive stereoscopic system using an anisotropic back-projection foil is proposed. To increase realism, we adapt rendered content to the viewer's position using a Kinect tracking system, which adds motion parallax to the binocular cues. A technique well known in control engineering is used to decrease latency and increase frequency of the tracker. Our transparent stereoscopic display prototype provides immersive viewing experience and is suitable for many augmented reality applications.
Article
Multi-view reconstruction aims at computing the geometry of a scene observed by a set of cameras. Accurate 3D reconstruction of dynamic scenes is a key component for a large variety of applications, ranging from special effects to telepresence and medical imaging. In this paper we propose a method based on Moving Least Squares surfaces which robustly and efficiently reconstructs dynamic scenes captured by a calibrated set of hybrid color+depth cameras. Our reconstruction provides spatio-temporal consistency and seamlessly fuses color and geometric information. We illustrate our approach on a variety of real sequences and demonstrate that it favorably compares to state-of-the-art methods.
Conference Paper
The VR Studio was founded in 1992 to explore the potential of Virtual Reality technology for theme park attractions. This paper presents an overview of the VR Studio's history, from the location-based entertainment attractions developed for DisneyQuest, ...