Real-time Terascale Implementation of Tele-immersion
Nikhil Kelshikar1, Xenophon Zabulis1, Jane Mulligan4, Kostas Daniilidis1,Vivek
Sawant2, Sudipta Sinha2, Travis Sparks2, Scott Larsen2, Herman Towles2, Ketan
Mayer-Patel2, Henry Fuchs2, John Urbanic3, Kathy Benninger3, Raghurama Reddy3,
and Gwendolyn Huntoon3
1University of Pennsylvania
2University of North Carolina at Chapel Hill
3Pittsburgh Supercomputing Center
4University of Colorado at Boulder
Abstract. Tele-immersion is a new medium that enables a user to share a virtual
space with remote participants, by creating the illusion that users at geographi-
cally dispersed locations reside at the same physical space. A person is immersed
in a remote world, whose 3D representation is acquired remotely, then transmit-
ted and displayed in the viewer’s environment. Tele-immersion is effective only
when the three components, computation, transmission, and rendering - all oper-
ate in real time . In this paper, we describe the real-time implementation of scene
reconstruction on the Terascale Computing System at the Pittsburgh Supercom-
Tele-immersionenables users at geographically distributed locations to collaborate in a
shared space, which integrates the environments at these locations. In an archetypical
tele-immersion environmentas proposed at the origin of this project [8, 4], a user wears
polarized glasses and a tracker capturing the head’s pose. On a stereoscopic display,
a remote scene is rendered so that it can be viewed from all potential viewpoints in
the space of the viewer. To achieve this, an architecture that enables real-time view-
independent 3D scene acquisition, transmission, and renderingin a real-time fashion is
proposed (see Fig. 1). Most of the computational challenges are posed in the 3D scene
acquisition. This stage deals with the association of pixels with the 3D coordinates of
the world points they depict, in a view independentcoordinate system. This association
can be based on ﬁnding pixel correspondences between a pair of images. The derived
correspondences constitute the basis of the computation of a disparity map, from which
the depth of depicted world points and, in turn, their coordinates can be estimated.
The solution of the correspondence problem is associated with many challenging open
topics, such as establishing correspondences for pixels that reside in textureless im-
age regions, detecting occlusions, and coping with specular illumination effects.This
involves a trade-off between being conservative and producing many holes in the depth
or being lenient, covering everything with the cost of having outliers.
Contact person: Nikhil Kelshikar, email@example.com, University of Pennsylvania,
GRASP Laboratory, 3401 Walnut St., Philadelphia, PA 19104-6228. All three sites acknowl-
edge ﬁnancial support by the NSF grant IIS-0121293.
Acqusition Renderer3D scene
camera cluster head tracker
Lemieux at PSC
Fig. 1. System architecture. Images are acquired with a cluster of cameras, processed by a parallel
computational engine to provide the 3D-description, transmitted, and displayed immersively.
As of summer 2001, we had achieved an 8Hz acquisition rate with limited depth qual-
ity, limited resolution and a very small operation space. Eventually, the original col-
laborative project  to produce a perceptually realistic tele-immersive environment,
reached a level of maturity such that the remaining performancebottlenecks were well
understood. We then established a goal to produce a real-time version with a dramatic
increase in the volume of the scanned area, resolution, and depth accuracy.As opposed
to other systems which capture one person inside a small area, we employed an array
of stereo systems distributed so as they capture all actions from the span of all possi-
ble viewpoints of the remote viewer (wide-area scanning). Each stereo unit provides a
2D view as well as the correctlyregistered texture of this view.
Fig.2. 30 (of the 55 pictured) cameras were used in the November 2002 Supercomputing Con-
ference demonstration (left). The acquired scene computed at PSC was displayed immersively
To achieve this dramatic improvementin space and resolution, while maintain real-time
performance, a signiﬁcant increment of computational power is required, boosting the
computational requirements to the supercomputinglevel. Such a resource was available
at a remote location and, thus, we established one of the ﬁrst applicationswhere sensing,
computation, and display are at different sites and coupled in real time. To overcome the
transmission constraints, an initial implementation contains a video server transmitting
video streams using TCP/IP and a reliable UDP transmission of the depth maps from
the computation to the display site.
1.1 Computational and Networking Requirements
A single camera unit requires a triadof cameras to be grabbingpictures at 30 frames/second
at VGA resolution. We use two monochromatic cameras (8 bits per pixel) andone color
camera (8 bits per pixel; the color conversion is done in software). The images are cap-
tured at 640 ×480 pixel resolution. This produces data of 7.03 Mbits/frame. At the data
capture rate of 30fps, we would produce data at 205.7 Mbits/sec. For each acquisition
site we proposed to use a cluster of 10 or more camera units to adequately reconstruct
the entire acquisition room, thus increasing data rates to 2.02 Gbits/sec per site. This is
the amount of raw data that must make its way to the computing engine. The computing
engine must execute 640 ×480 ×100 depths ×31 ×31 kernel size ≈29.5Gmultipli-
cations and additions per camera unit. The non-parallel version needed approximately
320 ×240 ×64 depths ×5×5kernel size ≈122Mmultiplications and additions per
camera unit (actually twice as much because it used trinocular stereo). The produced
3D data stream is 11.72 Mbits/frame and consists of 16 bit inversedepth and a 24 bit
color texture. This also scales as the number of camera units and users.
1.2 Related Work
Our real-time wide-area stereo scanning is not directly comparable to any other system
in the literature. No other existing system combines viewpoint independent wide-area
acquisition with spatially augmented displays and interaction. Thoughthe explosion of
Internet has produced many systems claiming tele-presence, none of them is working
towards the scale necessary for tele-immersion. The closest to the proposed work is
CMU’s Virtualized Reality  and its early dedicated real-time parallel architecture for
stereo. Other multi-camera systems include the view-dependent visual hull approach
, the Keck laboratory at the University of Maryland  and the Argus system at
Duke University .
2 Acquisition Algorithm
We now elaborate the main steps of the reconstruction algorithm  emphasizing the
factors that affect the quality of reconstruction and the processing time. The initial im-
plementation is based on two images but it is easily extensible to a polynocular conﬁg-
uration. We rely on the well-known stereo processing steps of matching and triangula-
tion, given that the cameras are calibrated.
Rectiﬁcation When a 3D-point is projected onto the left and the right image plane of a
ﬁxating stereo-rig the difference in the image positions is both in horizontal and vertical
directions. Given a point in the ﬁrst image we can reducethe 2D search to a 1D search
if we know the so called epipolar geometry of the camera which is given from cali-
bration. Because the subsequent step of correlation is area based, and for reduction of
time complexity,we ﬁrst perform a warping of the image that makes every epipolar line
horizontal. This image transformation is called rectiﬁcation and results in correspond-
ing points having coordinates (u, v)and (u−d, v), in left and right rectiﬁed images,
respectively, where dis the horizontal disparity.
Matching: The degree of correspondence is measured by a modiﬁed normalized cross-
correlation measure c(IL,I
var(IL)+var(IR)., where ILand IRare the left and
right rectiﬁed images over the selected correlation windows. For each pixel (u, v)in the
left image, the matching produces a correlation proﬁle c(u, v, d)where dranges over a
disparity range. The deﬁnition domain is the so called disparity range and depends on
the depth of working volume, i.e. the range of possible depths we want to reconstruct.
The time complexity of matching is linearly proportional to the size of the correlation
window as well as to the disparity range.
We consider all peaks of the correlation proﬁle as possible disparity hypotheses. This
is different from other matching approaches which early decide on the maximum of the
matching criterion. We call the resulting list of hypotheses for all positions a disparity
volume. The hypotheses in the disparity volume are pruned out by a selection procedure
based on the constraints imposed by the following:
–Visibility: If a spatial point is visible then there can not be any other point in the
viewing rays through this point and the left or right camera.
–Ordering: Depth ordering constrains the image positions in the rectiﬁed images.
Both constraints can be formulated in terms of disparities without reconstructing
the considered 3D-point.
The output of this procedureis an integer disparity map. To reﬁne the 3-D position esti-
mates, a sub-pixel correction of the integer disparity map is computed which results in
a sub-pixel disparity map. To achieve fast sub-pixel estimation we ﬁt a quadratic poly-
nomial on the ﬁve-neighborhoodof the integer disparity at the correlation maximum.
Reconstruction Each of the stereo rigs is calibrated before the experiment usinga mod-
iﬁcation of Bouguet’s camera calibration toolbox. Given estimates of the two, 3×4,
projection matrices for the left and the right camera and the disparity at each point the
coordinates of a 3D-point can be computed.
Color Image Warping The stereo cameras used to compute the 3D points are monochro-
matic cameras. A third color camera is used to color the 3D points. The calibration tech-
nique also estimates the projection matrix for the color camera. The projection matrix
is used to compute a lookup table of where the 3D point lies in the color image. This
lookup table is to map color to the 3D point set.
Depth Stream Next, the 3D depth and the color image must be sent to the remote
viewer. Depth is encoded into a 3D stream which consists of a 16bit inverse depth image
and a 24 bit RGB color image. This stream is then encoded in a raw byte format and
transmitted over the network. The renderer also receives (once, during initialization)
the inverse projection matrix for mapping the viewer coordinate system to the world
The error in the reconstruction depends on the errorin the disparity and the error in the
calibration matrices. Since the action to be reconstructed is close to the origin of the
world coordinate system the depth error due to calibration is negligible compared to
the error in the disparities. The principal concern is the number of outliers in the depth
estimates which result in large peaks usually appearing near occlusion or texture-less
It is the task of the rendering system to take the multiple independent streams of 3D
depth maps and re-create a life-size, view-dependent, stereo display of the acquired
scene. Received depth maps are converted into 3D points and rendered as point clouds
from a user tracked viewpoint. Multiple depthmap streams are time synchronized and
simply Z-buffered to create a composite display frame. Display is accomplished using
a two-projector passive stereo conﬁguration,and the user’s eye positions are estimated
with a HiBall wide-area tracker as describedin .
While it is relatively easy toscale the video capture and reconstruction front-end,it is a
most difﬁcult task to architect a renderingsystem that is not the system bottleneck. At
640 ×480 resolution, each depth map stream arriving at the rendering system includes
11.73 Mbits per frame. Without data compression, 10 streams operating at 10 fps can
generate a 1.3 Gbps data rate. At 80 per cent reconstruction efﬁciency (or approximately
250K points per 640 ×480 resolution stream), ten streams tend to produce 2.5M points
to render requiring a point rendering performance of greater than 75M points/sec for
truly smooth 30Hz, view-dependent rendering.
Our current system is architectured around a three-PC Linuxcluster interconnected with
a gigabit network. One PC servesas the network aggregationnode for the multiple depth
map streams arriving from the Terascale Computing System (TCS) platform. The other
two PCs render the left and right-eye views. For performance reasons, data arriving into
the aggregation node are multicastusing the UDP protocol to the two rendering nodes.
These PCs are 2.4GHz dual-processor Dell workstations and the rendering is facilitated
by Nvidia GeForce 4 ti4600 cards. Using Vertex_Array_Range extensions rather
that OpenGL displays, we are able to render 3D point clouds with up to 2M points at
30Hz. The 30Hz view-dependent display loop runs asynchronousto depth map stream
update rate, which was limited to approximately 1Hz during the SC02 tests to avoid
increasing latency and network buffering. Frames in each stream are time stamped so
the render system can re-synchronize these independentstreams. Frame swap synchro-
nization between left and right-eye PCs is achieved with simple out-of-bandprotocols.
4 Terascale System/Parallel Porting Plan
The original development system consisted of a triple of cameras connected to a dual
processor machine running at 2.4 GHz. There were ﬁve such camera triples connected
to ﬁve different servers. Each server was used to grab three 320 ×240 pixel images.
The acquired data was processed locally. To reduce the processing time, only the fore-
ground was processed for stereo matching and reconstruction. The serial system used
two processors per stream. Each of the processors processed half the image. The algo-
rithm used small correlation kernels of 5×5size. This system runs at 8 frames per
second. The main bottleneck here is the processing hardware.
The quality of the reconstruction was not satisfactory and the real-time requirement
precluded the use of anysophisticated algorithms. The images used were low resolution.
The use of background subtractionin the images eliminated 66% of the data, hence the
viewer could only see the remote participant in an artiﬁcial surrounding.
Complete reconstruction of the scene from more cameras using higher resolution im-
ages and more sophisticated algorithms, requires much more processing time. The real-
time constraint of this system required us to harness much more processing power. It
became obvious that this serial based system would have to be migrated to a paral-
lel platform. The platform chosen was the Terascale Computing System at the Pitts-
burgh Supercomputing Center. It comprises 3000 1GHz Alpha processors and is called
Lemieux. The key parallelization insights were:
–The problem decomposes naturally by each camera stream.
–Serial image analysis code can remain fundamentally the same in the parallel im-
–Each processor would process a fraction of the image.
It was decided that a parallel framework, properlyconstructed, would allow the reten-
tion of the serial image analysis code and approach without sacriﬁcing excellentscala-
bility to thousands of processors. In addition, past parallel coding experience led to the
incorporation of several other design goals:
–Deﬁne explicit parallelization interface to existing serial code.
–Anticipate the need for run-time debugging.
–Demand full scalability - allow no serial code to remain, introduce none.
–Permit partial stream asynchronicity during development, but demand fully asyn-
chronous code when done.
–Design with physical I/O constraints in mind.
In addition, previous experience in development of real-time (deﬁned as maximum al-
lowable latency) codes on other Unix-like parallel platforms led us to anticipate that
system services, and particularly ”heart beats” would be problematic. Code with maxi-
mum tolerance would be necessary.
5 Data Flow
The parallel framework that was developed was based on servicing each stream inde-
pendently of others, of scaling within each stream and assuring that physical I/O points
on the system were not allowed to become bottlenecks. The resulting schematic for a
single stream looks as shown in Fig. 3, left. Data is received over the Internet into a
designated Lemieux input node equipped with a 100Mb ethernet interface. It is then di-
vided up into congruent horizontal bands for each of the three cameras and distributed
to a variable number of computational nodes for image analysis. These analyzed images
are then gathered to a single node, which combines them into a processed frame, and
broadcast over the Internetto a remote display station. Note that the streams retain full
independence in this schematic. This was maintained in software, even while retaining
a coherent single executable, with use of MPI communicator mechanisms.
The switch to a terascale system allowed us to accommodate for more computationally
intensive algorithms with higher quality images. We incorporated several changes no-
tably changing the search area and the correlation kernel size. However, the core stereo
matching algorithm was not changed, only the parameters used to operate it were tuned
to increase match reliability.
Fig.3. Single stream data ﬂow.
Computation The parallel system operates on images four times the size of those used
in the serial algorithm. The correlation window size is 31 ×31 rather than 5×5increas-
ing computation approximately 36 times. However we use binocular instead of trinoc-
ular stereo due to which we have to perform matching once rather than performing
pairwise matching. Thus, the new system requiresat least 72 times more computation.
Since we do not perform background subtraction, an additional order of magnitude of
complexity is required.
The correlation window size is the main parameter affecting performance. The com-
plexity of the stereo algorithm is O(m2n2)where m2is the size of the correlation
kernel and n2is the size of the image. We ran a series of tests to verify the perfor-
mance and the scalability of the system. The performance of the real-time system, with
networked input of video and network output of 3D streams, is constrained by many
external factors which could cause bottlenecks. Hence for performance analysis of the
parallel algorithm we switched to ﬁle based I/O. The image streams are readfrom disk
and we measure the time for image distribution on the cluster, image analysis and 3D
data gathering from the various cluster nodes which contribute to total processing time.
The reconstruction algorithm broadcasts the image to be processed on a particular node
in its entirety. Hence as the number of PE’s used for the particular stream increases,
so does the broadcast time, as shown in Fig. 4, left. Each processor would then per-
form stereo matching on a small strip of the entire image. This is the lowest level of
parallelization. As the number of processors increases, each processor processes fewer
pixels. Fig. 4 (right) shows the speedup in the process frame routine which performs
image rectiﬁcation, stereo matching and the reconstructionof the 3D points. We show
the processing time for seven different correlation window sizes. The reconstructed 3D
0 50 100 150 200 25
# of processors
Fig.4. Left: The time required to broadcast images to each node increases as the number of
processors increases. Right: Total processing time in msec vs number of processors. Each plot
corresponds to a different kernel size.
points have to be re-assembled as different parts of the images are reconstructed on dif-
ferent nodes. This gather operation speeds up with number of processors because less
data must be gathered from each node.
Based on the above studies we have observed that the algorithm scales very efﬁciently
with the number of processors per stream. The programparallelizes so that each stream
is synchronized when reading data and runs independentlyof each other and hence does
not affect individual performance. Each stream of images has similar parameters and
hence execution time is almost the same.
Networking We conducted several networking and performance tests during the Super-
computing 2002 (SC2002) Conference in Baltimore, MD. The video servers and the
rendering system on SC2002 exhibition ﬂoor communicated with the PSC Lemieux
system over Abilene, the high-speed Internet2 backbone. The video servers and the
Fig.5. Bandwidth usage plot from Nov 19 2002 SuperComputing 2002 demonstration, showing
Mbps over time. The darker line (red) shows the data trafﬁc from acquisition to computation site,
while the lighter line )green) shows the trafﬁc from computation to display site.
rendering system were each connected to Abilene by separate Gigabit links. Tests were
performed with nine streams of images (from nine video servers.) The imagedata orig-
inating at video servers was transmitted to Lemieux using TCP. The data was processed
on Lemieux and the depth/texture data sets were (optionally)sent back to the rendering
system, using Reliable UDP (RUDP), a protocol speciﬁcally designed by us for this ap-
plication. RUDP provides reliable data transmission required by the application without
any congestion control, thereby providing better throughput than TCP.
In Fig. 5, shown is the bandwidth usage for the run during the “Bandwidth Challenge”
at SC2002. The run consists of two phases. The ﬁrst phase involved the entire system
from image acquisition to rendering. An average frame rate of 1 frame/sec was achieved
during this phase. The average data rate from video servers to Lemieuxover TCP was 63
Mbps with peak rates over 200 Mbps. From Lemieux to the rendering system, RUDP
was able to achieve data rates of over 105Mbps with peaks over 700Mbps (see the
caption on the plot.) The frame rate achievable in this end-to-end operation was limited
by the capacity of the rendering system to consume the depth/texture data. In the second
phase, the rendering system was left out. The image data was processed at Lemieux as
before, but it was not transmitted back to the rendering system. In this phase, the frame
rate up to 8 frames/sec was achieved and the data rate over 500 Mbits/s was observed
for the image transmission over TCP. The burstiness of the trafﬁc is due to synchronous
or lock-step operation of the system. During the entire run, 1080 processors (120 per
stream) were employed for the stereo reconstruction on Lemieux. The output of each
processors are depth estimates for 4 rows of 640 pixels width.
7 Conclusions - Outlook
We have ported a real-time application from a dedicated serial environment to one that
crosses a wide area network and utilizes a centralized parallel computing resource. The
parallel code demonstrates excellent scalability and continues to exploit a friendly seg-
mentation between the image analysis and parallel framework which allows concurrent
work by all development groups. This nearly perfect scalability, both by PE per stream
and by stream, makes us optimistic that we will be able to continue ouroverall perfor-
mance gains by the three routes of:
–Better per processor performance through optimization
–Larger machine size runs on a routine basis
–Larger platforms that become available.
We are easily capable of saturating our current, routinely available networking connec-
tion. Imminent network enhancements will permit progress on that front. To improve
reconstruction quality, we will employ an adaptive window size in order to preserve dis-
continuities as well as inter-scanline processingto alleviate rectiﬁcation inaccuracies.In
the quest for further performance improvements, we are actively investigatingadvanced
rendering techniques including moving the conversion from depth maps (integer) to 3D
points (ﬂoats) from the CPU into the graphics hardware. In the future, we would like
to apply image compression techniques to reduce the data bandwidthrequirements. We
are also exploring the issue of latency and susceptibility to network congestion to de-
velop a simple protocol that will minimize both, improving multi-stream throughput.
The task of developing a rendering architecture that will scale as easily and linearly
as independent camera-capture/reconstruction streams remains a signiﬁcant research
1. P. Baker and Y. Aloimonos. Complete calibration of a multi-camera network. In IEEE Work-
shop on Omnidirectional Vision, Hilton Head Island, SC, June 12, 2000.
2. D.J. Brady, R.A. Stack, S. Feller, L. Fernandez E. Cull, D. Kammeyer, and R. Brady. Infor-
mation ﬂow in streaming 3d video. In Three-Dimensional Video and Display Devices and
Systems, SPIE PRESS Vol. CR76, 2000.
3. H. Towles et al. 3d tele-collaboration over internet2. In International Workshop on Immersive
Telepresence, Juan-les-Pins, France, 06 Dec, 2003.
4. J. Lanier. Virtually there. Scientiﬁc American, pages 66–75, April 2001.
5. W. Matusik, C. Buheler, R. Raskar, S. Gortler, and L. McMillan. Image-based visual hulls. In
Proceedings of ACM SIGGRAPH, 2000. to appear.
6. J. Mulligan, V. Isler, and K. Daniilidis. Trinocular stereo: A new algorithm and its evaluation.
International Journal of Computer Vision, 47:51–61, 2002.
7. P. Narayanan, P. Rander, and T. Kanade. Constructing virtual worlds using dense stereo. In
Proc. Int. Conf. on Computer Vision, pages 3–10, 1998.
8. R. Raskar, G.Welch, M.Cutts, A.Lake, L.Stesin, and H.Fuchs. The ofﬁce of the future: A
uniﬁed approach to image- based modeling and spatially immersive displays. In ACM SIG-
GRAPH, pages 179–188, 1998.