Conference PaperPDF Available

Real-Time Terascale Implementation of Tele-immersion

Authors:

Abstract and Figures

Tele-immersion is a new medium that enables a user to share a virtual space with remote participants, by creating the illusion that users at geographically dispersed locations reside at the same physical space. A person is immersed in a remote world, whose 3D representation is acquired remotely, then transmitted and displayed in the viewer's environment. Tele-immersion is effective only when the three components, computation, transmission, and rendering - all operate in real time . In this paper, we describe the real-time implementation of scene reconstruction on the Terascale Computing System at the Pittsburgh Supercomputing Center.
Content may be subject to copyright.
Real-time Terascale Implementation of Tele-immersion
Nikhil Kelshikar1, Xenophon Zabulis1, Jane Mulligan4, Kostas Daniilidis1,Vivek
Sawant2, Sudipta Sinha2, Travis Sparks2, Scott Larsen2, Herman Towles2, Ketan
Mayer-Patel2, Henry Fuchs2, John Urbanic3, Kathy Benninger3, Raghurama Reddy3,
and Gwendolyn Huntoon3
1University of Pennsylvania
2University of North Carolina at Chapel Hill
3Pittsburgh Supercomputing Center
4University of Colorado at Boulder
Abstract. Tele-immersion is a new medium that enables a user to share a virtual
space with remote participants, by creating the illusion that users at geographi-
cally dispersed locations reside at the same physical space. A person is immersed
in a remote world, whose 3D representation is acquired remotely, then transmit-
ted and displayed in the viewer’s environment. Tele-immersion is effective only
when the three components, computation, transmission, and rendering - all oper-
ate in real time . In this paper, we describe the real-time implementation of scene
reconstruction on the Terascale Computing System at the Pittsburgh Supercom-
puting Center.
1 Introduction
Tele-immersionenables users at geographically distributed locations to collaborate in a
shared space, which integrates the environments at these locations. In an archetypical
tele-immersion environmentas proposed at the origin of this project [8, 4], a user wears
polarized glasses and a tracker capturing the head’s pose. On a stereoscopic display,
a remote scene is rendered so that it can be viewed from all potential viewpoints in
the space of the viewer. To achieve this, an architecture that enables real-time view-
independent 3D scene acquisition, transmission, and renderingin a real-time fashion is
proposed (see Fig. 1). Most of the computational challenges are posed in the 3D scene
acquisition. This stage deals with the association of pixels with the 3D coordinates of
the world points they depict, in a view independentcoordinate system. This association
can be based on finding pixel correspondences between a pair of images. The derived
correspondences constitute the basis of the computation of a disparity map, from which
the depth of depicted world points and, in turn, their coordinates can be estimated.
The solution of the correspondence problem is associated with many challenging open
topics, such as establishing correspondences for pixels that reside in textureless im-
age regions, detecting occlusions, and coping with specular illumination effects.This
involves a trade-off between being conservative and producing many holes in the depth
or being lenient, covering everything with the cost of having outliers.
Contact person: Nikhil Kelshikar, nikhil@grasp.cis.upenn.edu, University of Pennsylvania,
GRASP Laboratory, 3401 Walnut St., Philadelphia, PA 19104-6228. All three sites acknowl-
edge financial support by the NSF grant IIS-0121293.
node
node
node
Video
Receiver
Image
Acqusition Renderer3D scene
sender
Internet2 Internet2
stereo
reconstruction
camera cluster head tracker
immersive display
Lemieux at PSC
Camera cluster
in Baltimore
Display suite
in Baltimore
Fig. 1. System architecture. Images are acquired with a cluster of cameras, processed by a parallel
computational engine to provide the 3D-description, transmitted, and displayed immersively.
As of summer 2001, we had achieved an 8Hz acquisition rate with limited depth qual-
ity, limited resolution and a very small operation space. Eventually, the original col-
laborative project [3] to produce a perceptually realistic tele-immersive environment,
reached a level of maturity such that the remaining performancebottlenecks were well
understood. We then established a goal to produce a real-time version with a dramatic
increase in the volume of the scanned area, resolution, and depth accuracy.As opposed
to other systems which capture one person inside a small area, we employed an array
of stereo systems distributed so as they capture all actions from the span of all possi-
ble viewpoints of the remote viewer (wide-area scanning). Each stereo unit provides a
2+1
2D view as well as the correctlyregistered texture of this view.
Fig.2. 30 (of the 55 pictured) cameras were used in the November 2002 Supercomputing Con-
ference demonstration (left). The acquired scene computed at PSC was displayed immersively
(right).
To achieve this dramatic improvementin space and resolution, while maintain real-time
performance, a significant increment of computational power is required, boosting the
computational requirements to the supercomputinglevel. Such a resource was available
at a remote location and, thus, we established one of the first applicationswhere sensing,
computation, and display are at different sites and coupled in real time. To overcome the
transmission constraints, an initial implementation contains a video server transmitting
video streams using TCP/IP and a reliable UDP transmission of the depth maps from
the computation to the display site.
1.1 Computational and Networking Requirements
A single camera unit requires a triadof cameras to be grabbingpictures at 30 frames/second
at VGA resolution. We use two monochromatic cameras (8 bits per pixel) andone color
camera (8 bits per pixel; the color conversion is done in software). The images are cap-
tured at 640 ×480 pixel resolution. This produces data of 7.03 Mbits/frame. At the data
capture rate of 30fps, we would produce data at 205.7 Mbits/sec. For each acquisition
site we proposed to use a cluster of 10 or more camera units to adequately reconstruct
the entire acquisition room, thus increasing data rates to 2.02 Gbits/sec per site. This is
the amount of raw data that must make its way to the computing engine. The computing
engine must execute 640 ×480 ×100 depths ×31 ×31 kernel size 29.5Gmultipli-
cations and additions per camera unit. The non-parallel version needed approximately
320 ×240 ×64 depths ×5×5kernel size 122Mmultiplications and additions per
camera unit (actually twice as much because it used trinocular stereo). The produced
3D data stream is 11.72 Mbits/frame and consists of 16 bit inversedepth and a 24 bit
color texture. This also scales as the number of camera units and users.
1.2 Related Work
Our real-time wide-area stereo scanning is not directly comparable to any other system
in the literature. No other existing system combines viewpoint independent wide-area
acquisition with spatially augmented displays and interaction. Thoughthe explosion of
Internet has produced many systems claiming tele-presence, none of them is working
towards the scale necessary for tele-immersion. The closest to the proposed work is
CMU’s Virtualized Reality [7] and its early dedicated real-time parallel architecture for
stereo. Other multi-camera systems include the view-dependent visual hull approach
[5], the Keck laboratory at the University of Maryland [1] and the Argus system at
Duke University [2].
2 Acquisition Algorithm
We now elaborate the main steps of the reconstruction algorithm [6] emphasizing the
factors that affect the quality of reconstruction and the processing time. The initial im-
plementation is based on two images but it is easily extensible to a polynocular config-
uration. We rely on the well-known stereo processing steps of matching and triangula-
tion, given that the cameras are calibrated.
Rectification When a 3D-point is projected onto the left and the right image plane of a
fixating stereo-rig the difference in the image positions is both in horizontal and vertical
directions. Given a point in the first image we can reducethe 2D search to a 1D search
if we know the so called epipolar geometry of the camera which is given from cali-
bration. Because the subsequent step of correlation is area based, and for reduction of
time complexity,we first perform a warping of the image that makes every epipolar line
horizontal. This image transformation is called rectification and results in correspond-
ing points having coordinates (u, v)and (ud, v), in left and right rectified images,
respectively, where dis the horizontal disparity.
Matching: The degree of correspondence is measured by a modified normalized cross-
correlation measure c(IL,I
R)= 2cov(IL,IR)
var(IL)+var(IR)., where ILand IRare the left and
right rectified images over the selected correlation windows. For each pixel (u, v)in the
left image, the matching produces a correlation profile c(u, v, d)where dranges over a
disparity range. The definition domain is the so called disparity range and depends on
the depth of working volume, i.e. the range of possible depths we want to reconstruct.
The time complexity of matching is linearly proportional to the size of the correlation
window as well as to the disparity range.
We consider all peaks of the correlation profile as possible disparity hypotheses. This
is different from other matching approaches which early decide on the maximum of the
matching criterion. We call the resulting list of hypotheses for all positions a disparity
volume. The hypotheses in the disparity volume are pruned out by a selection procedure
based on the constraints imposed by the following:
Visibility: If a spatial point is visible then there can not be any other point in the
viewing rays through this point and the left or right camera.
Ordering: Depth ordering constrains the image positions in the rectified images.
Both constraints can be formulated in terms of disparities without reconstructing
the considered 3D-point.
The output of this procedureis an integer disparity map. To refine the 3-D position esti-
mates, a sub-pixel correction of the integer disparity map is computed which results in
a sub-pixel disparity map. To achieve fast sub-pixel estimation we fit a quadratic poly-
nomial on the five-neighborhoodof the integer disparity at the correlation maximum.
Reconstruction Each of the stereo rigs is calibrated before the experiment usinga mod-
ification of Bouguet’s camera calibration toolbox. Given estimates of the two, 3×4,
projection matrices for the left and the right camera and the disparity at each point the
coordinates of a 3D-point can be computed.
Color Image Warping The stereo cameras used to compute the 3D points are monochro-
matic cameras. A third color camera is used to color the 3D points. The calibration tech-
nique also estimates the projection matrix for the color camera. The projection matrix
is used to compute a lookup table of where the 3D point lies in the color image. This
lookup table is to map color to the 3D point set.
Depth Stream Next, the 3D depth and the color image must be sent to the remote
viewer. Depth is encoded into a 3D stream which consists of a 16bit inverse depth image
and a 24 bit RGB color image. This stream is then encoded in a raw byte format and
transmitted over the network. The renderer also receives (once, during initialization)
the inverse projection matrix for mapping the viewer coordinate system to the world
coordinate system.
The error in the reconstruction depends on the errorin the disparity and the error in the
calibration matrices. Since the action to be reconstructed is close to the origin of the
world coordinate system the depth error due to calibration is negligible compared to
the error in the disparities. The principal concern is the number of outliers in the depth
estimates which result in large peaks usually appearing near occlusion or texture-less
areas.
3 Rendering
It is the task of the rendering system to take the multiple independent streams of 3D
depth maps and re-create a life-size, view-dependent, stereo display of the acquired
scene. Received depth maps are converted into 3D points and rendered as point clouds
from a user tracked viewpoint. Multiple depthmap streams are time synchronized and
simply Z-buffered to create a composite display frame. Display is accomplished using
a two-projector passive stereo configuration,and the user’s eye positions are estimated
with a HiBall wide-area tracker as describedin [3].
While it is relatively easy toscale the video capture and reconstruction front-end,it is a
most difficult task to architect a renderingsystem that is not the system bottleneck. At
640 ×480 resolution, each depth map stream arriving at the rendering system includes
11.73 Mbits per frame. Without data compression, 10 streams operating at 10 fps can
generate a 1.3 Gbps data rate. At 80 per cent reconstruction efficiency (or approximately
250K points per 640 ×480 resolution stream), ten streams tend to produce 2.5M points
to render requiring a point rendering performance of greater than 75M points/sec for
truly smooth 30Hz, view-dependent rendering.
Our current system is architectured around a three-PC Linuxcluster interconnected with
a gigabit network. One PC servesas the network aggregationnode for the multiple depth
map streams arriving from the Terascale Computing System (TCS) platform. The other
two PCs render the left and right-eye views. For performance reasons, data arriving into
the aggregation node are multicastusing the UDP protocol to the two rendering nodes.
These PCs are 2.4GHz dual-processor Dell workstations and the rendering is facilitated
by Nvidia GeForce 4 ti4600 cards. Using Vertex_Array_Range extensions rather
that OpenGL displays, we are able to render 3D point clouds with up to 2M points at
30Hz. The 30Hz view-dependent display loop runs asynchronousto depth map stream
update rate, which was limited to approximately 1Hz during the SC02 tests to avoid
increasing latency and network buffering. Frames in each stream are time stamped so
the render system can re-synchronize these independentstreams. Frame swap synchro-
nization between left and right-eye PCs is achieved with simple out-of-bandprotocols.
4 Terascale System/Parallel Porting Plan
The original development system consisted of a triple of cameras connected to a dual
processor machine running at 2.4 GHz. There were five such camera triples connected
to five different servers. Each server was used to grab three 320 ×240 pixel images.
The acquired data was processed locally. To reduce the processing time, only the fore-
ground was processed for stereo matching and reconstruction. The serial system used
two processors per stream. Each of the processors processed half the image. The algo-
rithm used small correlation kernels of 5×5size. This system runs at 8 frames per
second. The main bottleneck here is the processing hardware.
The quality of the reconstruction was not satisfactory and the real-time requirement
precluded the use of anysophisticated algorithms. The images used were low resolution.
The use of background subtractionin the images eliminated 66% of the data, hence the
viewer could only see the remote participant in an artificial surrounding.
Complete reconstruction of the scene from more cameras using higher resolution im-
ages and more sophisticated algorithms, requires much more processing time. The real-
time constraint of this system required us to harness much more processing power. It
became obvious that this serial based system would have to be migrated to a paral-
lel platform. The platform chosen was the Terascale Computing System at the Pitts-
burgh Supercomputing Center. It comprises 3000 1GHz Alpha processors and is called
Lemieux. The key parallelization insights were:
The problem decomposes naturally by each camera stream.
Serial image analysis code can remain fundamentally the same in the parallel im-
plementation.
Each processor would process a fraction of the image.
It was decided that a parallel framework, properlyconstructed, would allow the reten-
tion of the serial image analysis code and approach without sacrificing excellentscala-
bility to thousands of processors. In addition, past parallel coding experience led to the
incorporation of several other design goals:
Define explicit parallelization interface to existing serial code.
Anticipate the need for run-time debugging.
Demand full scalability - allow no serial code to remain, introduce none.
Permit partial stream asynchronicity during development, but demand fully asyn-
chronous code when done.
Design with physical I/O constraints in mind.
In addition, previous experience in development of real-time (defined as maximum al-
lowable latency) codes on other Unix-like parallel platforms led us to anticipate that
system services, and particularly ”heart beats” would be problematic. Code with maxi-
mum tolerance would be necessary.
5 Data Flow
The parallel framework that was developed was based on servicing each stream inde-
pendently of others, of scaling within each stream and assuring that physical I/O points
on the system were not allowed to become bottlenecks. The resulting schematic for a
single stream looks as shown in Fig. 3, left. Data is received over the Internet into a
designated Lemieux input node equipped with a 100Mb ethernet interface. It is then di-
vided up into congruent horizontal bands for each of the three cameras and distributed
to a variable number of computational nodes for image analysis. These analyzed images
are then gathered to a single node, which combines them into a processed frame, and
broadcast over the Internetto a remote display station. Note that the streams retain full
independence in this schematic. This was maintained in software, even while retaining
a coherent single executable, with use of MPI communicator mechanisms.
6 Performance
The switch to a terascale system allowed us to accommodate for more computationally
intensive algorithms with higher quality images. We incorporated several changes no-
tably changing the search area and the correlation kernel size. However, the core stereo
matching algorithm was not changed, only the parameters used to operate it were tuned
to increase match reliability.
U
P
E
NN UN
CR
ou
t
e
r
R
ou
t
e
r
A
l
ph
a
P
r
o
ce
ss
o
r
L
e
M
i
e
ux
I
npu
t
N
od
e
A
l
ph
a
P
r
o
ce
ss
o
r
L
e
M
i
e
ux
O
u
t
pu
t
N
od
e
Q
u
a
d
r
i
c
s
N
e
t
w
o
r
k
F
a
s
t
E
t
h
e
r
n
e
t
F
a
s
t
E
t
h
e
r
n
e
t
S
i
ng
l
e
V
i
d
e
o
S
t
r
ea
m
L
e
M
i
e
ux
N
od
e
Fig.3. Single stream data flow.
Computation The parallel system operates on images four times the size of those used
in the serial algorithm. The correlation window size is 31 ×31 rather than 5×5increas-
ing computation approximately 36 times. However we use binocular instead of trinoc-
ular stereo due to which we have to perform matching once rather than performing
pairwise matching. Thus, the new system requiresat least 72 times more computation.
Since we do not perform background subtraction, an additional order of magnitude of
complexity is required.
The correlation window size is the main parameter affecting performance. The com-
plexity of the stereo algorithm is O(m2n2)where m2is the size of the correlation
kernel and n2is the size of the image. We ran a series of tests to verify the perfor-
mance and the scalability of the system. The performance of the real-time system, with
networked input of video and network output of 3D streams, is constrained by many
external factors which could cause bottlenecks. Hence for performance analysis of the
parallel algorithm we switched to file based I/O. The image streams are readfrom disk
and we measure the time for image distribution on the cluster, image analysis and 3D
data gathering from the various cluster nodes which contribute to total processing time.
The reconstruction algorithm broadcasts the image to be processed on a particular node
in its entirety. Hence as the number of PE’s used for the particular stream increases,
so does the broadcast time, as shown in Fig. 4, left. Each processor would then per-
form stereo matching on a small strip of the entire image. This is the lowest level of
parallelization. As the number of processors increases, each processor processes fewer
pixels. Fig. 4 (right) shows the speedup in the process frame routine which performs
image rectification, stereo matching and the reconstructionof the 3D points. We show
the processing time for seven different correlation window sizes. The reconstructed 3D
0 50 100 150 200 25
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of processors
Time (sec)
MPI ProcessFrame
5x5
11x11
16x16
21x21
25x25
31x31
Fig.4. Left: The time required to broadcast images to each node increases as the number of
processors increases. Right: Total processing time in msec vs number of processors. Each plot
corresponds to a different kernel size.
points have to be re-assembled as different parts of the images are reconstructed on dif-
ferent nodes. This gather operation speeds up with number of processors because less
data must be gathered from each node.
Based on the above studies we have observed that the algorithm scales very efficiently
with the number of processors per stream. The programparallelizes so that each stream
is synchronized when reading data and runs independentlyof each other and hence does
not affect individual performance. Each stream of images has similar parameters and
hence execution time is almost the same.
Networking We conducted several networking and performance tests during the Super-
computing 2002 (SC2002) Conference in Baltimore, MD. The video servers and the
rendering system on SC2002 exhibition floor communicated with the PSC Lemieux
system over Abilene, the high-speed Internet2 backbone. The video servers and the
Fig.5. Bandwidth usage plot from Nov 19 2002 SuperComputing 2002 demonstration, showing
Mbps over time. The darker line (red) shows the data traffic from acquisition to computation site,
while the lighter line )green) shows the traffic from computation to display site.
rendering system were each connected to Abilene by separate Gigabit links. Tests were
performed with nine streams of images (from nine video servers.) The imagedata orig-
inating at video servers was transmitted to Lemieux using TCP. The data was processed
on Lemieux and the depth/texture data sets were (optionally)sent back to the rendering
system, using Reliable UDP (RUDP), a protocol specifically designed by us for this ap-
plication. RUDP provides reliable data transmission required by the application without
any congestion control, thereby providing better throughput than TCP.
In Fig. 5, shown is the bandwidth usage for the run during the “Bandwidth Challenge”
at SC2002. The run consists of two phases. The first phase involved the entire system
from image acquisition to rendering. An average frame rate of 1 frame/sec was achieved
during this phase. The average data rate from video servers to Lemieuxover TCP was 63
Mbps with peak rates over 200 Mbps. From Lemieux to the rendering system, RUDP
was able to achieve data rates of over 105Mbps with peaks over 700Mbps (see the
caption on the plot.) The frame rate achievable in this end-to-end operation was limited
by the capacity of the rendering system to consume the depth/texture data. In the second
phase, the rendering system was left out. The image data was processed at Lemieux as
before, but it was not transmitted back to the rendering system. In this phase, the frame
rate up to 8 frames/sec was achieved and the data rate over 500 Mbits/s was observed
for the image transmission over TCP. The burstiness of the traffic is due to synchronous
or lock-step operation of the system. During the entire run, 1080 processors (120 per
stream) were employed for the stereo reconstruction on Lemieux. The output of each
processors are depth estimates for 4 rows of 640 pixels width.
7 Conclusions - Outlook
We have ported a real-time application from a dedicated serial environment to one that
crosses a wide area network and utilizes a centralized parallel computing resource. The
parallel code demonstrates excellent scalability and continues to exploit a friendly seg-
mentation between the image analysis and parallel framework which allows concurrent
work by all development groups. This nearly perfect scalability, both by PE per stream
and by stream, makes us optimistic that we will be able to continue ouroverall perfor-
mance gains by the three routes of:
Better per processor performance through optimization
Larger machine size runs on a routine basis
Larger platforms that become available.
We are easily capable of saturating our current, routinely available networking connec-
tion. Imminent network enhancements will permit progress on that front. To improve
reconstruction quality, we will employ an adaptive window size in order to preserve dis-
continuities as well as inter-scanline processingto alleviate rectification inaccuracies.In
the quest for further performance improvements, we are actively investigatingadvanced
rendering techniques including moving the conversion from depth maps (integer) to 3D
points (floats) from the CPU into the graphics hardware. In the future, we would like
to apply image compression techniques to reduce the data bandwidthrequirements. We
are also exploring the issue of latency and susceptibility to network congestion to de-
velop a simple protocol that will minimize both, improving multi-stream throughput.
The task of developing a rendering architecture that will scale as easily and linearly
as independent camera-capture/reconstruction streams remains a significant research
challenge.
References
1. P. Baker and Y. Aloimonos. Complete calibration of a multi-camera network. In IEEE Work-
shop on Omnidirectional Vision, Hilton Head Island, SC, June 12, 2000.
2. D.J. Brady, R.A. Stack, S. Feller, L. Fernandez E. Cull, D. Kammeyer, and R. Brady. Infor-
mation flow in streaming 3d video. In Three-Dimensional Video and Display Devices and
Systems, SPIE PRESS Vol. CR76, 2000.
3. H. Towles et al. 3d tele-collaboration over internet2. In International Workshop on Immersive
Telepresence, Juan-les-Pins, France, 06 Dec, 2003.
4. J. Lanier. Virtually there. Scientific American, pages 66–75, April 2001.
5. W. Matusik, C. Buheler, R. Raskar, S. Gortler, and L. McMillan. Image-based visual hulls. In
Proceedings of ACM SIGGRAPH, 2000. to appear.
6. J. Mulligan, V. Isler, and K. Daniilidis. Trinocular stereo: A new algorithm and its evaluation.
International Journal of Computer Vision, 47:51–61, 2002.
7. P. Narayanan, P. Rander, and T. Kanade. Constructing virtual worlds using dense stereo. In
Proc. Int. Conf. on Computer Vision, pages 3–10, 1998.
8. R. Raskar, G.Welch, M.Cutts, A.Lake, L.Stesin, and H.Fuchs. The office of the future: A
unified approach to image- based modeling and spatially immersive displays. In ACM SIG-
GRAPH, pages 179–188, 1998.
... The reasons for full location extraction are mainly (i) ease (no extra segmentation technique is needed) or (ii) the location itself is of interest. Multiview [13] and Terrascale [48] are corresponding example systems. Explicit head extraction can be found in the HEADSpin [44] and Situated Multiview [45] systems. ...
... Coliseum [47] 1 ↔ 3 ← V Terrascale [48] 1 ← n b TEEVE [16] 1 ↔ 2 ↔ V Beaming [20] n ↔ 1 ↔ V n ← 1 1 ← n Encumbrance Free [49] 2 ↔ 2 ...
Article
Tele-immersive systems development is always driven as well as restricted by the available immersive technology. Hence, existing such systems are described mainly from a technological point of view. This focus on technology makes it difficult to compare systems' concepts; moreover, it led to different views on tele-immersion in different fields, such as remotely controlled robots, immersive video conferencing, and tele-collaboration. In this work, we give a general, structured principle to describe the conceptual part of any tele-immersion system. This principle naturally unifies the different views on tele-immersion. Our idea is based on the insight that, in order to be general, immersion must be described separately for each direction of communication. Using this principle, we define a typology, which enables the comparison and enumeration of tele-immersion concepts. We apply this typology to survey the concepts of existing tele-immersion systems and thereby demonstrate how three well-known tele-immersive scenarios --- Marvin Minsky's tele-operated robot, the Office of the Future, and the asymmetric Beaming scenario --- integrate naturally. We show how the general principle can be utilized conveniently to grasp conceptual ideas in tele-immersion, such as direct interaction, locational presence, spatial consistency, symmetries, and self-inclusion.
... Immersive VR/AR telepresence enables users to share their spaces with remote participants by creating the illusion that users at geographically distant locations reside in the same space [16]. To do this, a 3D representation of a remote user is acquired, transmitted and displayed in the local viewer's environment. ...
... In the field of practical tele-immersion, it is considered necessary that everyone who is participating in the communication is able to see any part of the image transmitted from a distant place as he/she wanted [4], [5]. In the 3-D tele-immersion system, a user wears polarized glasses and a head tracker as a view-dependent scene is rendered in real-time on a large stereoscopic display in 3-D [6], [7]. Ideally, there exists a seamless continuum between the user's experience of local and remote space within the application. ...
Article
In order to realize realistic remote communication between multipoint remote places via the Internet, the display of video streaming in each remote site with the large-scale screen is effective. However, grainy video is displayed on largescale screen to depend on the display resolution specification of display equipments such as a projector and wide-area monitor, and sufficient quality is not obtained. In this paper, we have constructed remote communication environment with tiled display wall in multipoint sites and have conducted experiment in order to study the possibility of realistic remote communication with multi-video streaming. From experimental results, these video streaming from each site have been shown to display more high-quality than expanded video image by single small camera. Moreover, we have measured the network throughput performance for each transmitted and received video streaming in this environment. From measurement results, the steady throughput performance has been gained in the case of each transmitted and received video streaming.
... 4 compares the two calibration methods to show that self-calibration results sufficiently approximate the " ground truth " . Discussion For one view, each step of the algorithm can be executed in massively parallel since computation is independent per voxel [27] . The reconstruction volume is partitioned as to the number of CPUs available. ...
Article
Full-text available
In this paper, the depth cue due to the assumption of texture uniqueness is reviewed. The spatial direction over which a similarity measure is optimized, in order to estab-lish a stereo correspondence, is considered and methods to increase the precision and accuracy of stereo reconstruc-tions are presented. It is further presented that the proposed method is quite robust to projective distortions due to less accurate camera parameters, possibly obtained through self-calibration. An efficient implementation of the above meth-ods is also offered, based on a scale-space treatment of the data. The above contributions are integrated in a generic and parallelizable implementation of the uniqueness constraint to observe speedup and increase in the fidelity of surface reconstruction.
... [9, 10, 11]). The locality of the cue due to the uniqueness constraint facilitates multi-view and parallel implementations, for real-time applications [12, 13, 14]. Traditionally, stereo-correspondences have been established through a similarity search, which matched image neighborhoods based on their visual similarity [15, 16]. ...
... Typically this depth information is calculated using stereo-based vision algorithms. The highly parallelizable nature of these algorithms makes realtime depth generation possible [9]. Furthermore, there is active research in developing cameras that can acquire perpixel depth and color information directly [3] [21]. ...
Conference Paper
Full-text available
With advances in technology, a dynamic real world scene can be captured, represented, and streamed for realistic interaction in 3D using multiple digital cameras and computers. However, the captured data would be too massive to be streamed uncompressed. Fortunately, these data exhibit spatial and temporal coherence that can be utilized for compression, and research in compression of multiple streams has increased. To facilitate the use of spatial coherence between streams for multiple stream compression, reference streams must be selected. Reference streams are streams that serve as a basis for spatial prediction. Though the selection of reference streams affects encoding efficiency, there has been little research on it. In this paper, we identify the two main approaches for selecting reference streams, and demonstrate that when selecting reference streams, maximizing the volume overlap of reference streams and non-reference streams is more effective than maximizing volume coverage of the reference streams.
Article
This thesis focuses on the the real time acquisition of 3D information on a scene from multiple camera in the context of interactive applications. A complete vision system from image acquisition to motion and shape modeling is presented. The distribution of tasks on a PC cluster, and more precisely the parallelization of different shape modeling algorithms, enables a real time execution with a low latency. Several applications are developped and validate the practical implementation of this system. An original approach of motion modeling is also presented. It allows for limbs tracking and identification while not requiring prior information on the shape of the user.
Conference Paper
In remote communication via the Internet, the existence of participant by video streaming of real scene is important. In addition, we consider that non-verbal communication plays an important role in remote communication. In order to realize realistic communication between remote places, the display of video image in remote site with the large-scale screen is effective. However, the video image by wide-area display using a projector and large screen is low-resolution, and sufficient presence is not obtained. we have constructed the multipoint tele-immersive communication environment with tiled display wall, and have conducted fundamental experiment to display high-resolution video image on tiled display wall by transmitting each video image which captured by multiple cameras in order to study the possibility of realistic remote communication in multipoint sites. From the result, these video images have been shown to display more high-quality than expanded video image by single small camera.
Conference Paper
In remote communication via the Internet, the existence of participant by video streaming of real scene is important. In addition, we consider that non-verbal communication plays an important role in remote communication. In order to realize realistic communication between remote places, the display of video image in remote site with the large-scale screen is effective. However, the video image by wide-area display using a projector and large screen is low-resolution, and sufficient presence is not obtained. In this paper, we have constructed the multipoint tele-immersive communication environment with tiled display wall and have conducted experiment in order to study the possibility of realistic remote communication with video streaming. We have implemented display technology on tiled display wall high-resolution multivideo images which captured by 2 sets of IEEE1394 cameras. By using the technique, we have constructed the system to display on tiled display wall of remote site by transmitting via LAN each video image which captured by multiple cameras set on tiled display wall in 3 remote sites, and have conducted the communication experiment.
Article
Full-text available
We describe streaming 3D video on the Argus sensor space. Argus is a Beowulf-style distributed computer with 64 processors and 64 video camera/capture pairs. Argus is a test-bed for comparing sensor space modeling and reconstruction algorithms. We describe the implementation of tomographic and stereo triangulation algorithms on this space and consider mappings from the sensor space to associated display spaces.
Article
Full-text available
In telepresence applications each user is immersed in a rendered 3D-world composed from representations transmitted from remote sites. The challenge is to compute dense range data at high frame rates, since participants cannot easily communicate if the processing cycle or network latencies are long. Moreover, errors in new stereoscopic views of the remote 3D-world should be hardly perceptible. To achieve the required speed and accuracy, we use trinocular stereo, a matching algorithm based on the sum of modified normalized cross-correlations, and subpixel disparity interpolation. To increase speed we use Intel IPL functions in the pre-processing steps of background subtraction and image rectification as well as a four-processor parallelization. To evaluate our system we have developed a test-bed which provides a set of registered dense ground-truth laser data and image data from multiple views.
Conference Paper
Full-text available
In telepresence applications each user is immersed in a rendered 3D-world composed from representations transmitted from remote sites. The challenge is to compute dense range data at high frame rates, since participants cannot easily communicate if the processing cycle or network latencies are long. Moreover errors in new stereoscopic views of the remote 3D-world should be hardly perceptible. To achieve the required speed and accuracy, we use trinocular stereo, a matching algorithm based on the sum of modified normalized cross-correlations, and subpixel disparity interpolation. To increase speed we use Intel IPL functions in the pre-processing steps of background subtraction and image rectification as well as a four-processor parallelization. To evaluate our system we have developed a testbed which provides a set of registered dense "ground-truth" laser data and image data from multiple views
Conference Paper
Full-text available
We describe a calibration procedure for a multi-camera rig. Consider a large number of synchronized cameras arranged in some space, for example, on the walls of a room looking inwards. It is not necessary for all the cameras to have a common field of view, as long as every camera is connected to every other camera through common fields of view. Switching off the lights and waving a wand with an LED at the end of it, we can capture a very large set of point correspondences (corresponding points are captured at the same time stamp). The correspondences are then used in a large, nonlinear eigenvalue minimization routine whose basis is the epipolar constraint. The eigenvalue matrix encapsulates all points correspondences between every pair of cameras in a way that minimizing the smallest eigenvalue results in the projection matrices, to within a single perspective transformation. In a second step, given additional data from waving a rod with two LEDs (one at each end) the full projection matrices are calculated. The method is extremely accurate-the reprojections of the reconstructed points were within a pixel
Conference Paper
Full-text available
We present Virtualized Reality, a technique to create virtual worlds out of dynamic events using densely distributed stereo views. The intensity image and depth map for each camera view at each time instant are combined to form a Visible Surface Model. Immersive interaction with the virtualized event is possible using a dense collection of such models. Additionally, a Complete Surface Model of each instant can be built by merging the depth maps from different cameras into a common volumetric space. The corresponding model is compatible with traditional virtual models and can be interacted with immersively using standard tools. Because both VSMs and CSMs are fully three-dimensional, virtualized models can also be combined and modified to build larger, more complex environments, an important capability for many non-trivial applications. We present results from 3D Dome, our facility to create virtualized models
Article
Full-text available
We introduce ideas, proposed technologies, and initial results for an office of the future that is based on a unified application of computer vision and computer graphics in a system that combines and builds upon the notions of the CAVE^TM, tiled display systems, and image-based modeling. The basic idea is to use real-time computer vision techniques to dynamically extract per-pixel depth and reflectance information for the visible surfaces in the office including walls, furniture, objects, and people, and then to either project images on the surfaces, render images of the surfaces, or interpret changes in the surfaces. In the first case, one could designate every-day (potentially irregular) real surfaces in the office to be used as spatially immersive display surfaces, and then project high-resolution graphics and text onto those surfaces. In the second case, one could transmit the dynamic image-based models over a network for display at a remote site. Finally, one could interpret dyna...
Article
In this paper, we describe an efficient image-based approach to computing and shading visual hulls from silhouette image data. Our algorithm takes advantage of epipolar geometry and incremental computation to achieve a constant rendering cost per rendered pixel. It does not suffer from the computation complexity, limited resolution, or quantization artifacts of previous volumetric approaches. We demonstrate the use of this algorithm in a real-time virtualized reality application running off a small number of video streams. Engineering and Applied Sciences
Information flow in streaming 3d video
  • D J Brady
  • R A Stack
  • S Feller
  • L Fernandez
  • E Cull
  • D Kammeyer
  • R Brady