Conference PaperPDF Available

MotionHub: Middleware for Unification of Multiple Body Tracking Systems

MotionHub: Middleware for Unification of Multiple Body
Tracking Systems
Philipp Ladwig
Mixed Reality and Visualization
Group (MIREVI)
University of Applied Sciences
Düsseldorf, Germany
Kester Evers
Mixed Reality and Visualization
Group (MIREVI)
University of Applied Sciences
Düsseldorf, Germany
Eric J. Jansen
Mixed Reality and Visualization
Group (MIREVI)
University of Applied Sciences
Düsseldorf, Germany
Ben Fischer
Mixed Reality and Visualization
Group (MIREVI)
University of Applied Sciences
Düsseldorf, Germany
David Nowottnik
Mixed Reality and Visualization
Group (MIREVI)
University of Applied Sciences
Düsseldorf, Germany
Christian Geiger
Mixed Reality and Visualization
Group (MIREVI)
University of Applied Sciences
Düsseldorf, Germany
Figure 1: MotionHub is an open-source middleware that oers interfaces to multiple body tracking systems. a) Two users were
captured and tracked by a Microsoft Azure Kinect. b) The graphical user interface of MotionHub. The green cubes represent
an OptiTrack recording while the yellow ones represent an Azure Kinect live capture. c) MotionHub streams unied skeletal
representation in real time to clients such as the Unity game engine via a plug-in.
There is a substantial number of body tracking systems (BTS), which
cover a wide variety of dierent technology, quality and price range
for character animation, dancing or gaming. To the disadvantage
of developers and artists, almost every BTS streams out dierent
protocols and tracking data. Not only do they vary in terms of scale
and oset, but also their skeletal data diers in rotational osets
between joints and in the overall number of bones. Due to this
circumstance, BTSs are not eortlessly interchangeable. Usually,
software that makes use of a BTS is rigidly bound to it, and a change
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from
MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA
2020 Copyright held by the owner/author(s). Publication rights licensed to the
Association for Computing Machinery.
ACM ISBN 978-1-4503-7505-4/20/07. . . $15.00
to another system can be a complex procedure. In this paper, we
present our middleware solution MotionHub, which can receive and
process data of dierent BTS technologies. It converts the spatial
as well as the skeletal tracking data into a standardized format in
real time and streams it to a client (e.g. a game engine). That way,
MotionHub ensures that a client always receives the same skeletal-
data structure, irrespective of the used BTS. As a simple interface
enabling the user to easily change, set up, calibrate, operate and
benchmark dierent tracking systems, the software targets artists
and technicians. MotionHub is open source, and other developers
are welcome to contribute to this project.
Computing methodologies Motion processing
mation systems Multimedia streaming;
Body tracking, middleware, skeletal data, motion capture, Azure
Kinect, OptiTrack
MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA Ladwig et al.
ACM Reference Format:
Philipp Ladwig, Kester Evers, Eric J. Jansen, Ben Fischer, David Nowottnik,
and Christian Geiger. 2020. MotionHub: Middleware for Unication of Mul-
tiple Body Tracking Systems. In 7th International Conference on Movement
and Computing (MOCO ’20), July 15–17, 2020, Jersey City/ Virtual, NJ, USA.
ACM, New York, NY, USA, 8 pages.
Real-time body tracking is an indispensable part of many interac-
tive art performances, mixed-reality applications and games. Due
to the progress in research, higher computing power and advances
in sensor technology, a large number of BTSs that are suitable for
such applications have been developed in recent years. Develop-
ers are spoilt for choice: They have to select a system and must
consider various advantages and disadvantages concerning price,
accuracy and size of the tracking area. However, in many cases
not all requirements are known at the beginning of a project and
will only be developed over time. This can pose a challenge as a
subsequent change to another tracking system can be costly. In
such a scenario, a middleware that allows for an eortless switch
between dierent BTSs would be useful.
The term ’middleware’ was probably rst used in 1968 [
]. Since
then, various sub types have been described, but the basic meaning
has never changed: A middleware receives, processes and transmits
information between at least two peers. Middlewares are often able
to understand multiple information representations and to translate
them into a standardized or unied one.
The contribution of this paper is the open-source middleware
’MotionHub’ and an open-source game engine plugin, which can
be seen in Fig. 1b and c. The intention behind our work is to unify
the output data of the wide variety of available BTSs. We believe
that a unication will not only lead to an easier switch between
BTSs during production but that it will also reduce the preparation
time for developments that involve BTSs. Integrating a BTS into an
existing application requires a manual eort. Then, if developers
want to integrate another BTS or exchange the previous one with
it, an additional eort must be considered. In some cases, an update
to a newer version of a BTS is desired, which produces an even
higher manual eort. Our intention is to integrate a generic BTS
protocol into applications once in order to reduce the overall time
spent for maintenance, the setup and the switch between dierent
Beyond this, more benets can be mentioned. If a middleware
understands dierent BTSs, it is possible to prioritize and to switch
between them automatically to exploit the advantages of each sys-
tem. A possible scenario could be the use of a BTS that is accurate
but limited in tracking space and a dierent one that has a larger
tracking space but is less accurate. A middleware could prioritize
the more accurate BTS as long as the tracked person stays in the
’accurate’ tracking space and automatically switches to the ’less
accurate’ space whenever he or she leaves it.
Despite the advantages, the concept of the MotionHub also has a
major drawback: By nature, a middleware induces a delay because
it receives, converts and resends the data between two ends, which
requires a certain processing time. Therefore, this aspect receives
special attention in this work and we evaluate it at the end of this
A demo video with a summary of MotionHub’s capabilities can
be watched here:
The Git repository can be found here:
Body tracking is a wide eld with many years of research and
development, and it has produced a large number of dierent soft-
ware and hardware approaches as well as standardizations and le
formats. In this section, we only mention the most common and
important ones for MotionHub.
2.1 Standards and File Formats
The probably best known and oldest de facto standard for data
exchange in virtual reality is the Virtual-Reality Peripheral Network
(VRPN) by Taylor et al.[
]. It oers simple and unied interfaces
to a broad range of devices of dierent manufacturers. Many of
these devices share common functionalities such as 6-DOF tracking
or button input while the way of accessing these functions diers
between manufacturers. VRPN unies functions across dierent
devices as generic classes such as vrpn_Tracker or vrpn_Button.
Therefore, it can be seen as both a standard and a middleware. The
approach of MotionHub is similar, but it focuses on body tracking.
The rst ocial international standard for humanoid animation
is the H-Anim, which was created within the scope of the Extensible
3D (X3D) standard and is a successor of the Virtual Reality Model-
ing Language (VRML) [
]. H-Anim was published in 2006 [
updated in 2019 [
] and is one of the only eorts yet to create
an ocial open standard for humanoid avatar motion and data
COLLADA and FBX are interchange le formats for 3D appli-
cations and are widely used today. While humanoid animation is
not the focus of COLLADA, its open and versatile structure enables
developers to save body tracking data. Compared to COLLADA, the
proprietary FBX format mainly focuses on motion data but lacks a
clear documentation, which has led to incompatible versions.
While COLLADA and FBX can also be used for writing and
reading 3D geometry, the Biovision Hierarchy (BVH) le format
was developed exclusively for handling skeletal motion data and
is therefore simpler in structure. It is supported by many body
tracking applications and, because of its simplicity and less over-
head compared to other le formats, it is often used for real-time
transmission of humanoid motion data. A deeper and more com-
prehensive overview of further tracking le formats is given by
Meredith and Maddock [30].
2.2 Software and Hardware
Microsoft started shipping the Kinect in 2010 and thereby allowed
the body tracking community to grow substantially since the Kinect
was an aordable and capable sensor. During this time, PrimeSense
made OpenNI [
] and NITE [
, p.15] publicly available. OpenNI of-
fers low-level access to the Microsoft Kinect and other PrimeSense
sensors, while NITE was a middleware that enables the user to
perform higher-level operations such as body tracking and gesture
recognition. PrimeSense stopped the distribution of OpenNI and
NITE after having been acquired by Apple Inc. Although OpenNi
MotionHub: Body Tracking Middleware for Unification of... MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA
is still available on other web pages [
], there is no active develop-
ment anymore.
Based on the success of the Kinect and other PrimeSense sen-
sors, many proprietary BTSs such as iPi Mocap Studio [
], Brekel
Body [
] or Nuitrack [
], which are based on RGB-D streams, have
been developed. Examples for BTSs that use IMU-based tracking are
Xsens MVN [
], Perception Neuron [
] or Rokoko Smartsuit [
Optical solutions that are based on passive or active (infrared)
markers include OptiTrack Motive:Body [
], Qualisys Track Man-
ager [
], ART-Human [
], Vicon Tracker [
] and Motion Analysis
Cortex [23].
The concept of OpenVR [
], which is implemented in SteamVR [
and OpenXR [
] is similar to the idea behind MotionHub. It unies
the interfaces of a large number of dierent types of mixed-reality
input and output devices to a single library. The fact that it is
open source is probably an important reason for its success. While
OpenVR and OpenXR do not focus on body tracking, there are ef-
forts to extend their functionalities. For example, IKenima [
] has
developed an application named Orion, which is based on inverse
kinematics and utilizes HTC Vive trackers, which are attached to
the hip and the feet. IKinema was acquired by Apple Inc. in 2019,
and they stopped the distribution of Orion.
MiddleVR [
] is a proprietary middleware that supports various
commercial input and output devices and oers a generic interface
to the Unity game engine [
]. Body tracking is supported, but it
is not the focus of the software. The commercial software most
closely related to MotionHub is Reallusion iClone with its plug-in
Motion LIVE [
]. It focuses on real-time body tracking and supports
several dierent tracking systems for face, hand and body capture.
2.3 Research
The research that is most closely related to our work is Open-
Tracker [
] and Ubiquitous Tracking (UbiTrack) [
]. Both sys-
tems are generic data ow network libraries for dierent tracking
systems. Unlike MotionHub, they do not focus solely on body track-
ing. They oer a generic interface to object tracking systems for
mixed-reality applications similar to the idea behind VRPN. Al-
though OpenTracker and UbiTrack have a dierent focus than
MotionHub and the research was conducted more than 16 years
ago, the concept of unication is similar and reusable for our work.
Suma et al. [
] have developed the FAAST (Flexible Action and
Articulated Skeleton Toolkit), which is based on OpenNI and NITE.
It provides a VRPN server for streaming the user’s skeleton joints
over a network. However, the development is discontinued. Damas-
ceno et al. [
] present a middleware for multiple low-cost motion
capture devices that is applied to virtual rehabilitation. A similar
system is suggested by Eckert et al. [
] for the use of playing
OpenPTrack [
] is one of the most recent systems. It is not de-
scribed as a middleware itself but rather as a BTS. However, since
OpenPTrack supports the processing of multiple RGB-D camera
streams of various manufacturers and uses dierent tracking al-
gorithms (such as its own or OpenPose [
]), it rather acts as a
middleware. Similar to MotionHub, OpenPTrack is open source and
focuses mainly on body tracking. The dierence between the two
systems is that OpenPTrack solely concentrates on working with
RGB-D streams while MotionHub is based on the intention of in-
cluding dierent tracking technologies such as optical, IMU-based
or magnetic tracking in order to exploit the specic advantages
of each individual technology. Therefore, our approach requires a
more generic approach to be able to fuse, calibrate and merge the
heterogeneous data of dierent BTSs.
To the authors’ knowledge, there is no middleware available
today that supports high-cost and low-cost BTSs as well as recent
systems and dierent technologies (not only RGB-D streams) and is
open source. Although the concept of a body tracking middleware is
not new, MotionHub meets the mentioned aspects and is therefore,
to our knowledge, a unique system with a valuable contribution to
the community.
MotionHub receives raw skeletal data from dierent BTSs and
processes it to create a unied skeleton in real time. The raw data
is generated by the respective software development kits (SDKs)
of each BTS. Each system uses dierent transmission methods
and protocols, hierarchy structures, units, coordinate systems and
dierent numbers of joints as well as rotation osets between joints.
To perform a correct transformation to the unied skeleton of
the MotionHub, each input type requires a specic procedure for
receiving, transforming and mapping data. As soon as the data is
available in its proper format, it is transmitted to the MotionHub
client via the Open Sound Control (OSC) [46] protocol.
3.1 Unied Skeleton
In order to unify the output data of dierent BTSs, we transform the
heterogeneous skeletal-data structures into a generic skeleton rep-
resentation consisting of 21 joints, as shown in Fig. 2. The structure
of the joints is based on the Unity humanoid avatar skeleton [
], is
similar to the H-Anim LAO-1 standard [
, p.20] and is also used by
Figure 2: The unied skeleton structure streamed by Motion-
MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA Ladwig et al.
BTSs such as Azure Kinect. This standard of joint names and joint
indices allows for a structured implementation of the transformed
skeletal data for avatar animation in external applications. Each
joint is represented by a global position (Vector3), a global rotation
(Quaternion) and a condence value (Enum [none, low, medium
and high]) in a right-handed coordinate system. All raw skeletal
data is processed based on this unication, which will be discussed
in detail later on.
Some BTSs like Azure Kinect provide joint condence values
based on distance and visibility. For joints of BTSs that do not
provide condence values, for example OptiTrack, we use ’high’ as
default value.
3.2 Subsystem Architecture
Each BTS has its own native capture frequency and update rate.
In order to receive and process incoming data as fast as possible,
the data processing code needs to be executed independently from
the main program loop. As a solution to this problem, we incorpo-
rated autonomous threads for each BTS within the MotionHub. A
tracking thread receives raw skeletal data from the respective SDK,
processes it and immediately sends it to the (game engine) client.
While using threads, it is important to protect memory areas from
simultaneous access with appropriate ’atomic’ structures. Many
consumers such as tracker threads, the UI thread, the render thread
or the network sender thread access protected areas. We experi-
enced a better performance and a lower latency of processing and
sending data when protected memory was copied to pre-allocated
areas rst, with a subsequent processing on the copy instead of
locking critical sections during the time of processing.
To stream skeletal data to the game engine client, we use the
OSC protocol [
] because of its simplicity. The structure of our
protocol is similar to the Biovision Hierarchy (BVH) le format
but is extended with further MotionHub-specic control messages,
which are necessary for the communication between MotionHub
and the client side. Furthermore, we created a skeletal-data rep-
resentation that is more compact than BVH or VRPN-based data
streams in order to further reduce the network latency. As our
lower-level protocol we prioritize UDP over TCP because we prefer
a fast connection and low latency over packet loss recovery. Each
OSC package consists of translation (three oat values) and rota-
tion (four oat values) data for each joint as well as the skeleton
ID (integer). Furthermore, a render window is integrated into the
UI module to visualize incoming as well as transformed skeleton
representations. Skeleton joints are rendered in dierent colors
depending on their condence values, as shown in Fig. 3.
3.3 Dependencies and Build System
MotionHub is written in C++ because of two reasons: First, C++ is
considered as one of the fastest programming languages [
], which
is an important aspect for a real-time body tracking middleware.
The second reason is that building an interface to numerous SDKs
is only possible in C++, because a signicant number of BTSs only
provide a SDK that is based on this programming language.
Because Microsoft’s body tracking SDK [
] is based on neu-
ral networks, it requires the NVIDIA CUDA
Deep Neural Network
library (cuDNN) [
]. For handling matrix, vector and quaternion
Figure 3: The OpenGL render window in a) indicates the
current tracking condence of joints by the aid of color.
Some BTSs, such as Azure Kinect, provide such values. Yel-
low joints have a ’medium’ condence value while red joints
have ’low’ condence, because they are occluded, as can be
seen in b). For BTSs that do not deliver a condence value,
a default value can be selected manually. In this gure, the
recording of OptiTrack’s value is ’high’ by default and ren-
dered in green.
calculations, the MotionHub uses Eigen [
]. The user interface is
created with the Qt 5 framework [5].
MotionHub aims to be as open as possible, and it automatically
downloads and congures several software dependencies when
building with CMake [
]. This fact signicantly reduces the re-
quired amount of work for developers and allows for an easier
development in the future. Binaries are also available. We have
chosen Windows as our target operating system, because many
SDKs are only available for this system.
3.4 User Interface
MotionHub’s main UI can be seen in teaser Fig. 1b. A detailed view
of the right side of the UI is shown in Fig. 4. The white numbers 1
and 2 indicate the buttons for adding and removing BTSs. Number
3 is a toggle button, which starts or stops the global tracking loop.
All BTSs can be started or stopped together or individually.
As shown in Fig. 4 below number 4, selecting a tracker in the
list (orange box) will display its properties in the Tracker Property
Inspector. Dierent tracking systems have dierent coordinate sys-
tems. For example, OptiTrack has its center in the middle of the
tracking area on the oor while the origin of the Azure Kinect lies
inside of the camera. When multiple BTSs are combined in one
MotionHub: Body Tracking Middleware for Unification of... MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA
Figure 4: MotionHub user interface: Tracker Control Panel
(1–3) and Tracker Property Inspector window (4).
tracking area, their origins must be spatially merged. Therefore,
the user can oset the coordinate system origin of dierent BTSs
in the Tracker Property Inspector.
3.5 Conversation Matrices
Processing joint position data to the MotionHub’s unied coor-
dinate space is done in multiple steps: applying translation, rota-
tion and scale osets to merge tracking spaces and, if necessary,
mirroring the correct axes. Rotations of joints, however, are most
complicated and dier between BTSs. A list of specic rotations of
the implemented BTSs can be found online in MotionHub’s docu-
mentation: In addition to
the coordinate system, we had to consider the skeletal structure.
In some BTSs the skeleton is structured in a hierarchy so that the
joint rotations are in local coordinate spaces. These local values are
transformed to global rotations by the MotionHub before they are
transmitted to the receiver side. For example, each joint rotation of
the Azure Kinect system is oset by dierent values.
The output rotation quaternion Rfor all joints iof a tracker is the
product of the tracker-specic global coordinate system transfor-
mation T, the inverse global oset orientation O, and the raw input
rotation I.
Figure 5: The data ow (image template kindly provided by
Daemen et al. [7]).
MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA Ladwig et al.
Figure 6: The Unity game engine plug-in: a) shows the avatar
view in the Unity Renderer. b) shows the debug view display-
ing the position and rotation of all joint axes.
3.6 Game engine client
To be able to receive skeletal data in a game engine, we developed a
receiver package for the Unity engine. It contains code that creates
a character for each received skeleton and animates it with given
rotation values shown in Fig. 6. Character animation is solved in
the plug-in code by the multiplication of all inverse joint-rotation
values of the character in a hierarchy order and the joint rotation
in T-pose.
𝑅𝑖(𝑐𝑙𝑖𝑒 𝑛𝑡 )=𝐼𝑖𝑇𝑖
For all joints i, the transmitted rotation quaternion Iis multiplied
by the character’s joint rotation in T-pose Tand the product of all
inverse joint rotations rin the skeleton hierarchy above the current
joint. While f(i,k) returns the joint which is knodes above iin the
hierarchy, j(i) shows on which hierarchy level the joint is located.
The process iterates through the joint hierarchy upwardly, starting
with the parent of the joints and ending with the root joint, with
nrepresenting the number of iterations. Afterwards, the product
quaternion R is applied to the character’s local rotation.
During the development of a plug-in and the integration of a
BTS into MotionHub, it is critical to preview the processed data in
order to identify and debug rotation osets on dierent axes. To
facilitate this process in Unity, our plug-in is able to visualize debug
options, as can be seen in Fig. 6b. These options include the toggle
for the display of the skeleton ID, avatar position, joint axes, joint
names and avatar meshes. The skeleton ID is equal to the internal
one of the MotionHub and is transferred to the client by OSC data
packages. Moreover, the plug-in is designed to be avatar mesh
Figure 7: Playing ’Human Tetris’ with the MotionHub. a)
shows the physical set-up with OptiTrack and Azure Kinect.
b) shows our evaluation game created in Unity 3D.
independent. This means that a switch between dierent skeletons
is possible without code changes as long as Unity recognizes them
as humanoid.
In order to verify the practical benet of the MotionHub during the
development and production of an application, we created an inter-
active game called ’Human Tetris’. It incorporates the movements
of two users. The procedure is as follows: Player #1 starts with a
random body pose. The shadow of this pose is orthogonally pro-
jected onto a wall and is freezed when Player #1 verbally conrms
that he or she is ready. Next, player #2 needs to prepare and try
to copy the body posture and shadow of player #1 while the wall
visually approaches the players. As soon as the wall reaches player
#2, the game calculates and displays a score depending on how well
the posture was imitated. The game is shown in Fig. 7, and a video
of the game can be watched here:
4.1 Procedures
We conducted three tests: 1.) playing Human Tetris, 2.) measuring
the time for switching between BTSs and 3.) measuring the delay
induced by MotionHub. For the rst test, we played two rounds of
Human Tetris and incorporated two Azure Kinects as well as an
OptiTrack motion tracking system with 24 cameras (12 x Prime 13
and 12 x Prime 17W) and a suit with 50 passive markers. We ran one
instance of each the game, OptiTrack Motive and the MotionHub,
MotionHub: Body Tracking Middleware for Unification of... MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA
Table 1: Delays between physical motion and recognized mo-
tion by two specic BTSs (MH stands for MotionHub).
System @30 Hz With MH Without MH Induced delay
OptiTrack 127 ms 114 ms 13 ms
Azure Kinect 222 ms 151 ms 71 ms
which receives the data of three BTSs simultaneously. Both Azure
Kinects and OptiTrack Motive ran on the same PC (Intel i7 6700K,
Nvidia GTX 1080). The communication between the BTSs and Mo-
tionHub was conducted via localhost. First, we played a round with
only one Azure Kinect and OptiTrack while we covered the second
Kinect in order to prove and illustrate for the demo video that this
sensor is deactivated. At the beginning of the second round, the Op-
tiTrack player left the tracking area, and another player uncovered
the Kinect and joined the game. After the rst test, we gathered
qualitative data from the players in a non-structured interview.
The second test encompassed measuring the time required for
switching between two BTSs in MotionHub. We utilized our Human
Tetris game to conduct the test and switched between OptiTrack
and Kinect.
In the last test, we recorded the visual response times with a
high-speed camera to be able to measure the induced delay. There-
fore, we recorded the time between a physical movement and the
recognition of this movement for each BTS. The test was conducted
without MotionHub (visual responses in Motive and Kinect Body
Tracking Viewer) and with MotionHub (visual response in the Mo-
tionHub render window and in the Unity game engine renderer).
MotionHub and the Unity game engine were executed on the same
PC (Intel i7 6700K, Nvidia RTX 2080) and were connected by a
localhost connection via UDP. We used a computer monitor with a
refresh rate of 144 Hz and the camera captured with 1000 frames
per second (Sony DSC-RX10M4). OptiTrack and Kinect worked
with an update rate of 30 Hz (33.3 ms each). Any higher update
rates of the OptiTrack system results in a signicant decrease of the
Kinect’s tracking quality due to infrared light interference with the
OptiTrack system. We used Motive:Body v2.0.2 and Azure Kinect
Body Tracking SDK v1.0.1.
4.2 Results
The rst test showed that the concept of MotionHub is applicable.
The players were familiar with both BTSs and reported no unex-
pected behavior of the tracking systems except for a slightly higher
delay of the Azure Kinect.
In the second test, the measured time for switching between two
dierent BTSs (from OptiTrack to Azure Kinect) was 8 seconds.
For the third test, the results of the measurements of the induced
delay, which we conducted with the high-speed camera, can be seen
in Table1. Previous tests had showed signicantly higher delays
due to coupling the frequency of sending out data packets with
the main application loop of the MotionHub. Shorter delays were
achieved by independent tracker threads that send out new data
as soon as it is available (without waiting for the main application
loop, which also draws the user interface and processes user input).
This solution is described in more detail in section 3.2.
We assume that the dierence of the induced delay by Motion-
Hub between OptiTrack (13 ms) and Azure Kinect (71 ms) is caused
by dierent refresh rates of the underlying SDKs. OptiTrack Motive
and the thread of NatNetSDK within MotionHub exchange data
with more than 240 Hz (although the cameras only capture with
30 Hz) while the Azure Kinect Body Tracking SDKs exchange data
with 30 Hz.
We did not mention the delay between the renderers of Motion-
Hub and Unity in Table1 because we were not able to visually
identify a dierence on the camera images. Both renderers showed
synchronized output and therefore the same delay. However, in
order to make a statement about the network delay between Motion-
Hub and the game engine, we used the timestamps of the network
packets for measuring the time required for sending and receiving
an UDP packet on the same PC (localhost), and it constantly stayed
below 1 ms even when MotionHub was sending data.
The current status and the concept of MotionHub can be widely
extended. For example, a node-based network structure would be
possible, in which multiple MotionHub instances with dierent
BTSs could work together simultaneously. Each MotionHub node
could stream tracking data to a master node that merges the track-
ing data into a common coordinate space and performs sensor
fusion. Possible applications of such a structure could be improving
the tracking quality when tracking a single person or realizing the
tracking of a large number of people at the same time. In order to
accomplish this, a reliable method of calibrating and spatially merg-
ing the coordinate systems of dierent BTSs must be found. While
merging dierent RGB(-D) cameras is a well-studied problem, it
remains an open question how to successfully merge BTSs with dif-
ferent technologies such as IMU-based tracking and optical-marker
tracking. Future research is required in this regard.
Another feature of the concept of MotionHub could be to add
further tracking modules such as face or hand tracking. Many
manufacturers and research publications focus on either body, face
or hand tracking. With some minor alterations, MotionHub could
be extended to include the tracking of these body areas.
Furthermore, the acquisition of data for machine learning could
also be realized because (unied) data of dierent sensors could
be acquired simultaneously within one application. Moreover, this
functionality could also be used for benchmarking dierent systems
We have presented and evaluated our open-source system Motion-
Hub. It allows for merging tracking data of dierent body tracking
systems into a standardized skeleton, which can be streamed to
clients such as a game engine. The MotionHub induces an addi-
tional delay of 13 ms for a marker-based optical tracking system
(OptiTrack) and 71 ms for a markerless optical tracking system
(Azure Kinect). Our system enables its users to change the tracking
system on the y and without further congurations within the
receiver side. MotionHub has pointed out the feasibility of promis-
ing features to extend and combine the possibilities of available
BTSs. In the future, we intend to further develop those features. We
MOCO ’20, July 15–17, 2020, Jersey City/ Virtual, NJ, USA Ladwig et al.
plan to add a semi-automatic calibration procedure for matching
dierent coordinate systems between BTSs and extend MotionHub
with the support of more BTSs such as OpenPTrack [
], Open-
Pose [
] and others. We hope to contribute a valuable solution for
the community.
We thank the MIREVI Group and the ’Promotionszentrum Ange-
wandte Informatik’ (PZAI) in Hessen, especially Ralf Dörner. This
project is sponsored by: German Federal Ministry of Education
and Research (BMBF) under the project numbers 16SV8182 and
13FH022IX6. Project names: HIVE-Lab (Health Immersive Virtual
Environment Lab) and Interactive body-near production technol-
ogy 4.0 (german: ’Interaktive körpernahe Produktionstechnik 4.0’
Qualisys AB. 2020. Qualisys Track Manager (QTM). (2020). Retrieved February
24, 2020 from
Brekel. 2020. Aordable tools for Motion Capture & Volumetric Video. (2020).
Retrieved February 24, 2020 from
Don Brutzman. 2006. Humanoid Animation (H-Anim). (2006). Retrieved Febru-
ary 29, 2020 from
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2018.
OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Anity Fields.
CoRR abs/1812.08008 (2018). arXiv:1812.08008
The QT Company. 2020. Qt 5. (2020). Retrieved February 12, 2020 from
Valve Corporation. 2020. SteamVR on Steam. (2020). Retrieved February 26,
2020 from
Je Daemen, Jens Herder, Cornelius Koch, Philipp Ladwig, Roman Wiche, and Kai
Wilgen. 2016. Semi-Automatic Camera and Switcher Control for Live Broadcast.
In Proceedings of the ACM International Conference on Interactive Experiences for
TV and Online Video (TVX ’16). Association for Computing Machinery, New York,
NY, USA, 129–134.
Eduardo Damasceno. 2014. A Motion Capture Middleware for Exergames Increse
the precision with Neural Nets. (06 2014).
E. F. Damasceno, A. Cardoso, and E. A. Lamounier. 2013. An middleware for
motion capture devices applied to virtual rehab. In 2013 IEEE Virtual Reality (VR).
M. Eckert, I. Gómez-Martinho, J. Meneses, and J. F. M. Ortega. 2016. A modular
middleware approach for exergaming. In 2016 IEEE 6th International Conference
on Consumer Electronics - Berlin (ICCE-Berlin). 169–173.
Eigen. 2020. C++ template library for linear algebra. (2020). Retrieved February
21, 2020 from
Rokoko Electronics. 2020. Smartsuit Pro - Motion Capture Suit. (2020). Retrieved
February 24, 2020 from
International Organization for Standardization. 1997. ISO/IEC 14772-1:1997. (1997).
International Organization for Standardization. 2003. ISO/IEC 14772-1:1997/AMD
1:2003. (2003).
International Organization for Standardization. 2006. ISO/IEC 19774:2006. (2006).
International Organization for Standardization. 2019. ISO/IEC 19774-1:2019. (2019).
International Organization for Standardization. 2019. ISO/IEC 19774-2:2019. (2019).
Advanced Realtime Tracking GmbH. 2020. ART Human. (2020). Retrieved
February 24, 2020 from human/
Isaac Gouy. 2020. The Computer Language Benchmarks Game. (2020).
Retrieved February 27, 2020 from
which-programs- are-fastest.html
IKinema. 2020. IKinema Orion. (2020). Retrieved February 25, 2020 from
3DiVi Inc. 2020. Nuitrack Full Body Skeletal Tracking Software. (2020). Retrieved
February 24, 2020 from
Kitware Inc. 2020. CMake. (2020). Retrieved February 27, 2020 from https:
Motion Analysis Inc. 2020. Cortex Software. (2020). Retrieved February 24, 2020
[24] NaturalPoint Inc. 2020. OptiTrack Motive:Body. (2020). Retrieved February 24,
2020 from
Reallusion Inc. 2020. iClone Motion LIVE. (2020). Retrieved June 07, 2020 from live-mocap/default.html
The Khronos Group Inc. 2020. OpenXR. (2020). Retrieved February 25, 2020
iPi Soft LLC. 2020. iPi Soft- Markerless Motion Capture. (2020). Retrieved
February 24, 2020 from
Vicon Motion Systems Ltd. 2020. Tracker - Delivery Precise Real-World Data
| Motion Capture Software. (2020). Retrieved February 24, 2020 from https:
Joe Ludwig. 2020. OpenVR. (2020). Retrieved February 25, 2020 from https:
Michael Meredith and Steve Maddock. 2001. Motion Capture File Formats Ex-
plained. Production (01 2001).
Microsoft. 2020. Azure Kinect Body Tracking SDK. (2020). Retrieved
February 12, 2020 from dk/
body-sdk- download
Microsoft. 2020. Azure Kinect Sensor SDK. (2020). Retrieved February 12, 2020
from SDK
MiddleVR. 2020. MiddleVR SDK. (2020). Retrieved February 10, 2020 from
Matteo Munaro, Filippo Basso, and Emanuele Menegatti. 2016. OpenPTrack.
Robot. Auton. Syst. 75, PB (Jan. 2016), 525–538.
Peter Naur and Brian Randell. 1969. Software Engineering: Report of a Conference
Sponsored by the NATO Science Committee, Garmisch, Germany, 7-11 Oct. 1968,
Brussels, Scientic Aairs Division, NATO.
J. Newman, M. Wagner, M. Bauer, A. MacWilliams, T. Pintaric, D. Beyer, D.
Pustka, F. Strasser, D. Schmalstieg, and G. Klinker. 2004. Ubiquitous tracking for
augmented reality. In Third IEEE and ACM International Symposium on Mixed
and Augmented Reality. 192–201.
Noitom. 2020. Perception Neuron Motion Capture. (2020). Retrieved February
24, 2020 from
Deep Neural Network library. (2019). Retrieved
February 12, 2020 from
Inc. Occipital. 2020. OpenNI 2 SDK. (2020). Retrieved February 10, 2020 from
Gerhard Reitmayr and Dieter Schmalstieg. 2005. OpenTracker: A Flexible Soft-
ware Design for Three-Dimensional Interaction. Virtual Real. 9, 1 (Dec. 2005),
E. A. Suma, B. Lange, A. S. Rizzo, D. M. Krum, and M. Bolas. 2011. FAAST: The
Flexible Action and Articulated Skeleton Toolkit. In 2011 IEEE Virtual Reality
Conference. 247–248.
Russell M. Taylor, Thomas C. Hudson, Adam Seeger, Hans Weber, Jerey Juliano,
and Aron T. Helser. 2001. VRPN: A Device-Independent, Network-Transparent
VR Peripheral System. In Proceedings of the ACM Symposium on Virtual Reality
Software and Technology (VRST ’01). Association for Computing Machinery, New
York, NY, USA, 55–61.
Unity Technologies. 2019. Preparing Humanoid Assets for export.
(2019). Retrieved February 7, 2020 from
Unity Technologies. 2020. Unity - Manual: Humanoid Avatars. (2020).
Retrieved February 29, 2020 from
M. Wagner, A. MacWilliams, M. Bauer, G. Klinker, J. Newman, T. Pintaric, and D.
Schmalstieg. 2004. Fundamentals of Ubiquitous Tracking. In IN ADVANCES IN
PERVASIVE COMP UTING. Austrian Computer Society, 285–290.
Matthew Wright. 2005. Open Sound Control: An Enabling Technology for Musical
Networking. Org. Sound 10, 3 (Dec. 2005), 193–200.
XSens. 2020. MVN Animate. (2020). Retrieved February 24, 2020 from https:
... Request permissions from [13] and body scans to be used to create and analyze a simple OSim model without programming. An overview of the components of our framework is given in Fig. 1. ...
... Our biomechanical model reduces the human body to a tree of body segments that are connected by ball joints. The kinematic structure of this tree is given by the MMH skeleton [13]. Our model leans on the model utilized by [16] and extents it to a 3D model. ...
... We employ the MMH [13] as motion tracking middleware to gain flexibility in terms of motion tracking solutions. ...
Repository that is published with this contribution:
Full-text available
This demonstration allows the visitors to use assorted technologies to motion capture devices for uses in games. We presented the MixCap Middleware, developed to blend different types of motion capture technologies and to accomplish in one data format for input data in serious games. this Middleware was developed to home care for serious games for health professionals who have low cost tracking movement equipment.
Conference Paper
Full-text available
This demonstration allows the visitors to use assorted technologies to motion capture devices for uses in virtual rehabilitation. We presented the MixCap Middleware, developed to blend different types of motion capture technologies and to accomplish in one data format for biomechanical analysis. The Middleware, together with the associated prototypes were developed for use by health professionals who have low cost tracking movement equipment.
Conference Paper
Full-text available
The Flexible Action and Articulated Skeleton Toolkit (FAAST) is middleware to facilitate integration of full-body control with virtual reality applications and video games using OpenNI-compliant depth sensors (currently the PrimeSensor and the Microsoft Kinect). FAAST incorporates a VRPN server for streaming the user's skeleton joints over a network, which provides a convenient interface for custom virtual reality applications and games. This body pose information can be used for goals such as realistically puppeting a virtual avatar or controlling an on-screen mouse cursor. Additionally, the toolkit also provides a configurable input emulator that detects human actions and binds them to virtual mouse and keyboard commands, which are sent to the actively selected window. Thus, FAAST can enable natural interaction for existing off-the-shelf video games that were not explicitly developed to support input from motion sensors. The actions and input bindings are configurable at run-time, allowing the user to customize the controls and sensitivity to adjust for individual body types and preferences. In the future, we plan to substantially expand FAAST's action lexicon, provide support for recording and training custom gestures, and incorporate real-time head tracking using computer vision techniques.
Full-text available
Tracking is an indispensable part of any virtual reality and aug- mented reality application. While the need for quality of tracking, in particular for high performance and fidelity, has led to a large body of past and current research, little attention is typically paid to soft- ware engineering aspects of tracking software. To address this issue we describe a software design and implementation that applies the pipes- and-filter architectural pattern to provide a customizable and flexible way of dealing with tracking data and configurations. The contribu- tion of this work cumulates in the development of a generic data flow network library called OpenTracker to deal specifically with tracking data. The flexibility of the data flow network approach is demon- strated in a set of development scenarios and prototype applications in the area of mobile augmented reality.
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Conference Paper
Live video broadcasting requires a multitude of professional expertise to enable multi-camera productions. Robotic systems allow the automation of common and repeated tracking shots. However, predefined camera shots do not allow quick adjustments when required due to unpredictable events. We introduce a modular automated robotic camera control and video switch system, based on fundamental cinematographic rules. The actors' positions are provided by a markerless tracking system. In addition, sound levels of actors' lavalier microphones are used to analyse the current scene. An expert system determines appropriate camera angles and decides when to switch from one camera to another. A test production was conducted to observe the developed prototype in a live broadcast scenario and served as a video-demonstration for an evaluation.
OpenPTrack is an open source software for multi-camera calibration and people tracking in RGB-D camera networks. It allows to track people in big volumes at sensor frame rate and currently supports a heterogeneous set of 3D sensors. In this work, we describe its user-friendly calibration procedure, which consists of simple steps with real-time feedback that allow to obtain accurate results in estimating the camera poses that are then used for tracking people. On top of a calibration based on moving a checkerboard within the tracking space and on a global optimization of cameras and checkerboards poses, a novel procedure which aligns people detections coming from all sensors in a x-y-time space is used for refining camera poses. While people detection is executed locally, in the machines connected to each sensor, tracking is performed by a single node which takes into account detections from all over the network. Here we detail how a cascade of algorithms working on depth point clouds and color, infrared and disparity images is used to perform people detection from different types of sensors and in any indoor light condition. We present experiments showing that a considerable improvement can be obtained with the proposed calibration refinement procedure that exploits people detections and we compare Kinect v1, Kinect v2 and Mesa SR4500 performance for people tracking applications. OpenPTrack is based on the Robot Operating System and the Point Cloud Library and has already been adopted in networks composed of up to ten imagers for interactive arts, education, culture and human-robot interaction applications.