PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Volumetric video is a new form of visual media that enables novel ways of immersive visualisation and interaction. Currently volumetric video technologies receive a lot of attention in research and standardization, leading to an increasing need for related test data. This paper describes the Volograms & V-SENSE Volumetric Video Dataset which is made publicly available to help said research and standardisation efforts.
Content may be subject to copyright.
Volograms & V-SENSE Volumetric Video Dataset
Rafael Pagés1, Konstantinos Amplianitis1, Jan Ondrej1, Emin Zerman2and Aljosa Smolic2
1Volograms Limited, Guinness Enterprise Centre, Taylors Lane, Dublin 8, Ireland
2V-SENSE, School of Computer Science and Statistics, Trinity College Dublin, Ireland
{rafa, kostas, jan}, {emin.zerman, smolica}
Keywords: Volumetric video, 3D reconstructions, Augmented Reality, Virtual Reality
Abstract: Volumetric video is a new form of visual media that enables novel ways of immersive visualisation and
interaction. Currently volumetric video technologies receive a lot of attention in research and standardization,
leading to an increasing need for related test data. This paper describes the Volograms & V-SENSE Volumetric
Video Dataset which is made publicly available to help said research and standardisation efforts.
Volumetric video is a media format that allows
reconstruction of dynamic 3D objects from real life
and their visualization in immersive applications
such as augmented reality and virtual reality.
Generally, the volumetric video is captured in
dedicated studios using many cameras that are
looking inwards and capturing synchronized videos
(as further described in Section 2). The volumetric
video is then generated in 3D form which changes
with respect to time [4]. Examples of 3D models for
a sample volumetric video are shown in Fig 1. The
generated volumetric video can be represented with
either colored point clouds or textured meshes [6].
There are different approaches and methods for
content creation. The number of cameras that are
looking inwards a studio range from 4 Microsoft
Kinect cameras [7], 12 synchronised RGB cameras
[4], and 32 cameras (i.e., 16 stereo cameras) [8] to
106 cameras which include both RGB cameras and
infrared cameras [9]. Many different techniques are
used for 3D reconstruction.
Contrary to many different volumetric video
generation techniques, there have not been many
volumetric video datasets. The 8i dataset [10] is a
commonly used one for the point cloud
representation of volumetric video. There are only
two other publicly available volumetric video
datasets: vsenseVVDB [11] and vsenseVVDB2 [6].
The former provides two volumetric videos in
coloured point cloud format with four different point
cloud densities. The latter provides four new
volumetric videos in both coloured point cloud and
textured 3D mesh formats.
Even though there have been many studies that focus
on volumetric video creation, the number of datasets
that provide publicly available volumetric video is
very limited. With new MPEG standardisation
activities on compression and quality assessment of
point clouds and dynamic meshes as well as
intensified research in this area, there is a need for
new datasets.
This paper introduces the Volograms & V-SENSE
Volumetric Video Dataset, which releases three new
volumetric videos in differing characteristics (i.e.,
different texture and different movement
characteristics) with different durations. The
following sections describe the content creation,
applications, and details of the new dataset.
Figure 1. Example of 3D models (textured and
non-textured) from different time instances of a volumetric
2.1 Capture Setup
The volumetric capture studio, located in Dublin,
Ireland, is a cubic room with an approximate capture
volume consisting of a 2m radius length cylinder.
The studio contains an aluminium frame covered in
green fabric, which also covers the ceiling. The floor
of the capture space is also green. See Fig. 2 below.
Figure 2. Different viewing angles of the capture studio
in Dublin, Ireland.
2.2 Cameras & Capture capacity
There are 12 cameras mounted to the aluminium
6 Blackmagic Micro Studio FullHD cameras.
6 Blackmagic Micro Studio 4k cameras.
12 Olympus 7-14mm f/2.8 Pro M.Zuiko Digital
ED lenses.
6 Blackmagic Video Assist.
The capture capacity of the studio is approximately
60 minutes and the cameras are synchronised using a
BMD Sync Generator and a 20-port video/sync
2.2.1 Lighting
The studio is evenly lit by white light (4400K)
emanating from an arrangement of 10 LED
industrial light panels (0.3m x 1.2m) located at
different heights.
2.2.2 Geometric calibration
The cameras are modelled using the pinhole camera
model [3] including radial and tangential distortions,
which are used to remove the distortion from the
images after calibration.
To calibrate the cameras, a calibration totem is used
with the following characteristics:
A set of cubes placed with multiple orientations,
to guarantee different planar surfaces. These
cubes can be simple cardboard boxes or similar,
making it very simple to build by third parties.
The height of the totem needs to be similar to a
person’s height.
Unique random patterns generated using the
approach by Li et al [1], and attached to each
planar surface of the totem.
An example of the calibration totem is provided in
Fig. 3.
Figure 3. Calibration totem example.
The totem is placed in several positions inside the
studio and captured with all the cameras. The
automatic calibration process is done by running a
multiple structure from motion (SfM) system, which
takes the different positions of the totem separately,
but uses all the sets of point matches in a single SfM
and bundle adjustment problem.
2.2.3 Radiometric calibration
To perform radiometric calibration, a Macbeth
ColorChecker (24 squares) is used.
2.2 3D Reconstruction Pipeline
To generate the 3D data, our volumetric pipeline
uses a proprietary algorithm to combine the best of
SfM, MVS (Multi View Stereo) and volume
estimation to guarantee the generation of both
detailed and complete 3D human models, even when
a reduced number of cameras is used in the capture.
The global approach is described in the work of
Pagés et al [4].
Firstly, a foreground segmentation algorithm
separates the performer from the background on
each of the cameras, and the resulting foreground
masks are used to estimate a scene volume, through
the visual hull. In the following stage, a multi-depth
estimation algorithm computes a depth map for each
of the cameras, by using a MVS algorithm in the
volume constrained by the visual hull, which
improves accuracy and performance.
Furthermore, the resulting point cloud is fused with
the estimated scene volume in an intelligent way by
statistically analysing the differences between the
MVS point cloud and the volume point cloud (the
vertices of the volume mesh) in the voxel space,
identifying missing geometry and keeping only the
information that is denser and more accurate [4].
Next, a volume-constraint Poisson Surface
Reconstruction [5] process is applied to obtain the
final detailed mesh, avoiding the connection of small
gaps in the model. Lastly, a re-meshing process to
reduce the polygon count of the resulting models.
Once a model per frame has been obtained, it is
necessary to apply a key-framing and temporal
consistency process: a fundamental step with two
key purposes. The first step assures the meshes are
temporally coherent, reducing the flickering and
other temporal artefacts. The second is identifying
temporal redundancies, reducing the amount of
information to store and enabling smaller files. For
this, we analyse the motion of the sequence through
the optical flow and use the flow correspondences to
drive a robust ICP algorithm that registers meshes in
a tracking sub-sequence.
The last step of the pipeline is generating a set of UV
maps and colouring them. We use D-charts to
generate the texture atlases and the method by Pagés
et al. [2] to blend the colour information from the
different cameras.
2.3 Applications
The generated volumetric video can be used in many
different applications such as education, museums,
(or cultural heritage), tour guide, entertainment,
telepresence, or teleconference applications, etc.
One of the main applications of volumetric video is
telepresence. That is, the users can project their 3D
images onto another person’s reality and can feel
“present” in that reality. One of the first efforts that
uses volumetric video to achieve this was
Microsoft’s holoportation system which used
HoloLens head-mounted displays [12], more
recently, newer systems also developed to do similar
tasks [13]. In another instance, Trinity College
Dublin’s then Provost was recorded and his image
was shown at the Trinity Business and Technology
Forum 2018 [14].
Another interesting venue of application is the
cultural heritage applications. With volumetric
video, Samuel Beckett’s “Play” could be reenacted
in AR or VR (i.e., Virtual “Play”) [15], Jonathan
Swift’s likeness can reappear at Trinity College
Dublin Library [16], or James Joyce’s novel
character Stephen Dedalus from Ulysses can be
brought to life in VR [17]. A sample of these
projects can be seen in Fig. 4.
Other applications can include education, empathy
building (e.g., the creative experiment of “Bridging
the Blue”) [18], or entertainment (e.g., Awake:
Episode One) [19].
Figure 4. Some projects using Volograms technology
3. Details of the Dataset
This dataset includes three sequences featuring three
different characters, each of them captured with a
different purpose and application in mind. The three
sequences feature male characters with varying skin
colour, clothing, stature and range of movements.
This sequence shows a performer, Rafa, who does a
quick electro-move in a five seconds clip. Rafa
wears a standard shirt and jeans, which is
representative of many volumetric captures done
nowadays. He also finishes his moves with a thumbs
up gesture, which shows the reconstruction accuracy
with fingers. Rafa was captured at V-SENSE's 12
camera studio in Dublin, Ireland. Meshes are ~40k
polys/frame and texture images are 4069x4069. A
sample frame of this dataset is provided in Fig. 5.
Figure 5. Sample frame of Rafa sequence.
Five second dancing sequence featuring Levi, an
incredibly talented performer. Levi’s moves are fast,
dynamic and very complex, which poses a great
challenge for 3D reconstruction algorithms. The
models present very accurate details as fingers and
facial features, even when motion blur could be an
issue for the reconstruction process. Levi was
captured in a 60 camera studio in California, US.
Meshes are ~40k polys/frame and texture images are
4069x4069. A sample frame of this dataset is
provided in Fig. 6.
Figure 6. Sample frame of Levi sequence.
Sir Frederick
One minute monologue sequence featuring an actor
performing as Sir Frederick Hamilton, from the
Manorhamilton Castle in Leitrim, Ireland. This
capture was done for an immersive cultural
activation at the castle. Sir Frederick wears
mediaeval clothing with some dark and shiny
elements, which are typically challenging for 3D
reconstruction. Sir Frederick speaks to the audience,
and his facial features are very expressive. Sir
Frederick was captured in a 12 camera studio
captured at V-SENSE's 12 camera studio in Dublin,
Ireland. Meshes are ~40k polys/frame and texture
images are 4069x4069. A sample frame of this
dataset is provided in Fig. 7.
Figure 7. Sample frame of Sir Frederick sequence.
This paper introduces the Volograms & V-SENSE
Volumetric Video Dataset which includes three
volumetric video sequences that are created with a
different application scenario in mind. Each of the
sequences have different characteristics in terms of
texture (e.g., clothing, skin colour) and movement
(e.g., little movement to fast varying movement).
Furthermore, the volumetric video sequences have
different durations. These aspects make this dataset
unique for its use in scientific studies and
standardisation activities.
This work is part of the INVICTUS project that has
received funding from the European Union’s
Horizon 2020 research and innovation programme
under grant agreement No 952147. It reflects only
the authors' views and the Commission is not
responsible for any use that may be made of the
information it contains.
This publication has emanated partially from
research conducted with the financial support of
Science Foundation Ireland (SFI) under the Grant
1. Li B., Heng L., Koser K., Pollefeys M., (2013),
“A multiple-camera system calibration toolbox
using a feature descriptor-based calibration
pattern”, 2013 IEEE/RSJ International
Conference on Intelligent Robots and Systems,
Tokyo, Japan, 2013, pp. 1301-1307, doi:
2. Pagés, R., Berjón, D., Morán, F., & García, N.
(2015, February). Seamless, Static
Multi-Texturing of 3D Meshes. In Computer
Graphics Forum (Vol. 34, No. 1, pp. 228-238)
3. Hartley, R., Zisserman, A., Multiple View
Geometry in Computer Vision, 2000,
Cambridge University Press, ISBN:
4. Pagés, R., Amplianitis, K., Monaghan, D.,
Ondřej, J., & Smolić, A. (2018). Affordable
content creation for free-viewpoint video and
VR/AR applications. Journal of Visual
Communication and Image Representation,53,
5. Kazhdan, M., & Hoppe, H. (2013). Screened
poisson surface reconstruction. ACM
Transactions on Graphics (ToG),32(3), 1-13.
6. Zerman, E., Ozcinar, C., Gao, P., & Smolic, A.
(2020, May). Textured mesh vs coloured point
cloud: A subjective study for volumetric video
compression. In 2020 Twelfth International
Conference on Quality of Multimedia
Experience (QoMEX) (pp. 1-6). IEEE.
7. Alexiadis, D. S., Zarpalas, D., & Daras, P.
(2012). Real-time, full 3-D reconstruction of
moving foreground objects from multiple
consumer depth cameras. IEEE Transactions on
Multimedia, 15(2), 339-358.
8. Schreer, O., Feldmann, I., Renault, S., Zepp, M.,
Worchel, M., Eisert, P., & Kauff, P. (2019,
September). Capture and 3D video processing
of volumetric video. In 2019 IEEE International
conference on image processing (ICIP) (pp.
4310-4314). IEEE.
9. Collet, A., Chuang, M., Sweeney, P., Gillett, D.,
Evseev, D., Calabrese, D., ... & Sullivan, S.
(2015). High-quality streamable free-viewpoint
video. ACM Transactions on Graphics (ToG),
34(4), 1-13.
10. d’Eon, E., Harrison, B., Myers, T., & Chou, P.
A. (2017). 8i voxelized full bodies, version 2–A
voxelized point cloud dataset. ISO/IEC
JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG)
input document m40059/M74006.
11. Zerman, E., Gao, P., Ozcinar, C., & Smolic, A.
(2019). Subjective and objective quality
assessment for volumetric video compression.
Electronic Imaging, 2019(10), 323-1.
12. Orts-Escolano, S., Rhemann, C., Fanello, S.,
Chang, W., Kowdle, A., Degtyarev, Y., ... &
Izadi, S. (2016, October). Holoportation: Virtual
3d teleportation in real-time. In Proceedings of
the 29th annual symposium on user interface
software and technology (pp. 741-754).
13. Jansen, J., Subramanyam, S., Bouqueau, R.,
Cernigliaro, G., Cabré, M. M., Pérez, F., &
Cesar, P. (2020, May). A pipeline for multiparty
volumetric video conferencing: transmission of
point clouds over low latency DASH. In
Proceedings of the 11th ACM Multimedia
Systems Conference (pp. 341-344).
14. V-SENSE. (2018, October 22). Volumetric
Video of Trinity Provost Patrick Prendergast.
VSENSE Trinity Provost Patrick Prendergast.
Retrieved March 21, 2022, from
15. O’Dwyer, N., Johnson, N., Bates, E., Pagés, R.,
Ondřej, J., Amplianitis, K., ... & Smolić, A.
(2017, October). Virtual play in free-viewpoint
video: Reinterpreting samuel beckett for virtual
reality. In 2017 IEEE International Symposium
on Mixed and Augmented Reality
(ISMAR-Adjunct) (pp. 262-267). IEEE.
16. O’Dwyer, N., Zerman, E., Young, G. W.,
Smolic, A., Dunne, S., & Shenton, H. (2021).
Volumetric Video in Augmented Reality
Applications for Museological Narratives: A
user study for the Long Room in the Library of
Trinity College Dublin. Journal on Computing
and Cultural Heritage (JOCCH), 14(2), 1-20.
17. O’Dwyer, N., Young, G. W., & Smolic, A.
(2022). XR Ulysses: addressing the
disappointment of cancelled site-specific
re-enactments of Joycean literary cultural
heritage on Bloomsday. International Journal of
Performance Arts and Digital Media, 1-19.
18. Arielle, L. G. and Smolic, A. “Bridging the
Blue“, in The Art Exhibit at ICIDS 2019 Art
Book: The Expression of Emotion in Humans
and Technology, edited by Ryan Brown and
Brian Salisbury, pp. 15-27, Carnegie Mellon
University, Pittsburgh: ETC Press, 2020, ISBN:
19. Start VR: Awake (Accessed: Jan 2020).
... Such research requires suitable test data. Volograms and V-SENSE provided several VV data sets to support the community as the one illustrated in Fig. 6 [8]. The full content delivery pipeline for VV is shown in Fig. 7. ...
... Samples from volumetric video dataset by Volograms & V-SENSE[8]. ...
... Another available volumetric sequence dataset is Volograms & V-Sense dataset [7]. The dataset includes three sequences featuring three different characters, each is captured using 12 HD cameras with a different purpose and application in mind and has different characteristics in terms of texture and movement. ...
Full-text available
Recent years have witnessed a rapid development of immersive multimedia which bridges the gap between the real world and virtual space. Volumetric videos, as an emerging representative 3D video paradigm that empowers extended reality, stand out to provide unprecedented immersive and interactive video watching experience. Despite the tremendous potential, the research towards 3D volumetric video is still in its infancy, relying on sufficient and complete datasets for further exploration. However, existing related volumetric video datasets mostly only include a single object, lacking details about the scene and the interaction between them. In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments. Comprehensive dataset description and analysis are conducted, with potential usage of this dataset. The dataset and additional tools can be accessed via the following website:
Full-text available
Site-specific performances are shows created for a specific location and can occur in one or more areas outside the traditional theatre. Social gathering restrictions during the Covid-19 lockdown demanded that these shows be shut down. However, site-specific performances that apply emergent and novel mobile digital technologies have been afforded a compelling voice in showing how performance practitioners and audiences might proceed under the stifling constraints of lockdown and altered live performance paradigms, however they may manifest. Although extended reality (XR) technologies have been in development for a long time, their recent surge in sophistication presents renewed potentialities for site-specific performers to explore ways of bringing the physical world into the digital to recreate real-world places in shared digital spaces. In this research, we explore the potential role of digital XR technologies, such as volumetric video, social virtual reality (VR) and photogrammetry, for simulating site-specific theatre, thereby assessing the potential of these content creation techniques to support future remote performative events. We report specifically on adapting a real-world site-specific performance for VR. This case study approach provides examples and opens dialogues on innovative approaches to site-specific performance in the post-Covid-19 era.
Conference Paper
Full-text available
Volumetric video (VV) pipelines reached a high level of maturity, creating interest to use such content in interactive visualisation scenarios. VV allows real world content to be captured and represented as 3D models, which can be viewed from any chosen viewpoint and direction. Thus, VV is ideal to be used in augmented reality (AR) or virtual reality (VR) applications. Both textured polygonal meshes and point clouds are popular methods to represent VV. Even though the signal and image processing community slightly favours the point cloud due to its simpler data structure and faster acquisition, textured polygonal meshes might have other benefits such as better visual quality and easier integration with computer graphics pipelines. To better understand the difference between them, in this study, we compare these two different representation formats for a VV compression scenario utilising state-of-the-art compression techniques. For this purpose, we build a database and collect user opinion scores for subjective quality assessment of the compressed VV. The results show that meshes provide the best quality at high bitrates, while point clouds perform better for low bitrate cases. The created VV quality database will be made available online to support further scientific studies on VV quality assessment.
Conference Paper
Full-text available
We present an end-to-end system for augmented and virtual reality telepresence, called Holoportation. Our system demonstrates high-quality, real-time 3D reconstructions of an entire space, including people, furniture and objects, using a set of new depth cameras. These 3D models can also be transmitted in real-time to remote users. This allows users wearing virtual or augmented reality displays to see, hear and interact with remote participants in 3D ,almost as if they were present in the same physical space. From an audio-visual perspective, communicating and interacting with remote users edges closer to face-to-face communication. This paper describes the Holoportation technical system in full, its key interactive capabilities, the application scenarios it enables, and an initial qualitative study of using this new communication medium.
Full-text available
We present the first end-to-end solution to create high-quality free-viewpoint video encoded as a compact data stream. Our system records performances using a dense set of RGB and IR video cameras , generates dynamic textured surfaces, and compresses these to a streamable 3D video format. Four technical advances contribute to high fidelity and robustness: multimodal multi-view stereo fusing RGB, IR, and silhouette information; adaptive meshing guided by automatic detection of perceptually salient areas; mesh tracking to create temporally coherent subsequences; and encoding of tracked textured meshes as an MPEG video stream. Quantitative experiments demonstrate geometric accuracy, texture fidelity, and encoding efficiency. We release several datasets with calibrated inputs and processed results to foster future research.
Cross-reality technologies are quickly establishing themselves as commonplace platforms for presenting objects of historical, scientific, artistic, and cultural interest to the public. In this space, augmented reality (AR) is notably successful in delivering cultural heritage applications, including architectural and environmental heritage reconstruction, exhibition data management and representation, storytelling, and exhibition curation. Generally, it has been observed that the nature of information delivery in applications created for narrating exhibitions tends to be informative and formal. Here we report on the assessment of a pilot scene for a prototype AR application that attempts to break this mold by employing a humorous and playful mode of communication. This bespoke AR experience harnessed the cutting-edge live-action capture technique of volumetric video to create a digital tour guide that playfully embellished the museological experience of the museum visitors. This applied research article consists of measuring, presenting, and discussing the appeal, interest, and ease of use of this ludic AR storytelling strategy mediated via AR technology in a cultural heritage context.
We present a scalable pipeline for Free-Viewpoint Video (FVV) content creation, considering also visualisation in Augmented Reality (AR) and Virtual Reality (VR). We support a range of scenarios where there may be a limited number of handheld consumer cameras, but also demonstrate how our method can be applied in professional multi-camera setups. Our novel pipeline extends many state-of-the-art techniques (such as structure-from-motion, shape-from-silhouette and multi-view stereo) and incorporates bio-mechanical constraints through 3D skeletal information as well as efficient camera pose estimation algorithms. We introduce multi-source shape-from-silhouette (MS-SfS) combined with fusion of different geometry data as crucial components for accurate reconstruction in sparse camera settings. Our approach is highly flexible and our results indicate suitability either for affordable content creation for VR/AR or for interactive FVV visualisation where a user can choose an arbitrary viewpoint or sweep between known views using view synthesis.