Conference PaperPDF Available

Evaluation of Pose Tracking Accuracy in the First and Second Generations of Microsoft Kinect

Authors:

Abstract and Figures

Microsoft Kinect camera and its skeletal tracking capabilities have been embraced by many researchers and commercial developers in various applications of real-time human movement analysis. In this paper, we evaluate the accuracy of the human kinematic motion data in the first and second generation of the Kinect system, and compare the results with an optical motion capture system. We collected motion data in 12 exercises for 10 different subjects and from three different viewpoints. We report on the accuracy of the joint localization and bone length estimation of Kinect skeletons in comparison to the motion capture. We also analyze the distribution of the joint localization offsets by fitting a mixture of Gaussian and uniform distribution models to determine the outliers in the Kinect motion data. Our analysis shows that overall Kinect 2 has more robust and more accurate tracking of human pose as compared to Kinect 1.
Content may be subject to copyright.
Evaluation of Pose Tracking Accuracy in the
First and Second Generations of Microsoft Kinect
Qifei Wang & Gregorij Kurillo
University of California, Berkeley
Department of EECS
Berkeley, CA 94720, USA
Email: {qifei.wang, gregorij}@eecs.berkeley.edu
Ferda Ofli
Qatar Computing Research Institute
Hamad Bin Khalifa University
Doha, Qatar
Email: fofli@qf.org.qa
Ruzena Bajcsy
University of California, Berkeley
Department of EECS
Berkeley, CA 94720, USA
Email: bajcsy@eecs.berkeley.edu
Abstract—Microsoft Kinect camera and its skeletal tracking
capabilities have been embraced by many researchers and com-
mercial developers in various applications of real-time human
movement analysis. In this paper, we evaluate the accuracy of the
human kinematic motion data in the first and second generation
of the Kinect system, and compare the results with an optical
motion capture system. We collected motion data in 12 exercises
for 10 different subjects and from three different viewpoints.
We report on the accuracy of the joint localization and bone
length estimation of Kinect skeletons in comparison to the motion
capture. We also analyze the distribution of the joint localization
offsets by fitting a mixture of Gaussian and uniform distribution
models to determine the outliers in the Kinect motion data. Our
analysis shows that overall Kinect 2 has more robust and more
accurate tracking of human pose as compared to Kinect 1.
I. INTRODUCTION
Affordable markerless motion capture technology is be-
coming increasingly pervasive in applications of human-
computer and human-machine interaction, entertainment,
healthcare, communication, surveillance and others. Although
the methods for capturing and extracting human pose from
image data have been around for several years, the advances
in sensor technologies (infrared sensors) and computing power
(e.g., GPUs) have facilitated new systems that provide ro-
bust and relatively accurate markerless acquisition of human
movement. An important milestone for wide adoption of these
technologies was the release of Microsoft Kinect camera [1]
for the gaming console Xbox 360 in 2010, followed by
the release of Kinect for Windows with the accompanying
Software Development Kit (SDK) in 2011. The Kinect SDK
for Windows featured real-time full-body tracking of human
limbs based on the algorithm by Shotton et al. [2]. Several
other technology makers followed the suit by releasing their
own 3D cameras that focused on capture of human motion for
interactive applications (Xtion by Asus, RealSense by Intel).
Many researchers and commercial developers embraced the
Kinect in wide range of applications that took advantage of
its real-time 3D acquisition capabilities and provided skeletal
tracking, such as in physical therapy and rehabilitation [3], fall
detection [4] and exercise in elderly [5], [6], ergonomics [7],
[8] and anthropometry [9], computer vision [10], and many
others. In 2013 the second generation of the Kinect camera
was released as part of the Xbox One gaming console. In 2014
This research was supported by the National Science Foundation (NSF)
under Grant No. 1111965.
a standalone version of Kinect for Windows (k4w) was offi-
cially released featuring wider camera angle, higher resolution
of depth and color images, improved skeletal tracking, and
detection of facial expressions.
In this paper we focus on the evaluation of accuracy and
performance of skeletal tracking in the two Kinect systems
(referred to as Kinect 1 and Kinect 2 in the remainder of this
paper) compared to a marker-based motion capture system.
Several publications have previously addressed the accuracy
of the skeletal tracking of Kinect 1 for various applications;
however, the accuracy of Kinect 2 has been reported only
to a limited extent in the research literature. Furthermore,
concurrent comparison of the two systems has not yet been
published to the best of our knowledge. Although both Kinect
systems employ similar methodology for human body segmen-
tation and tracking based on the depth data, the underlying
technology for acquisition of the depth differs between the
two. We report the accuracy rates of the skeletal tracking
and the corresponding error distributions in a set of exercise
motions that include standing and sitting body configurations.
Such an extensive performance assessment of the technology
is intended to assist researchers who rely on Kinect as a
measurement device in the studies of human motion.
II. RE LATE D WOR K
In this section we review several publications related to
evaluation of the Kinect systems. Kinect 1 has been extensively
investigated in terms of 3D depth map acquisition as well as
body tracking accuracy for various applications. Khoshelman
and Elbernik [11] examined the accuracy of depth acquisition
in Kinect 1, and found that the depth error ranges from a few
millimeters up to about 4 cm at the maximum range. They
recommended that the data for mapping applications should be
acquired within 1-3 m distance. Smisek et al. [12] proposed
a geometrical model and calibration method to improve the
accuracy of Kinect 1 for 3D measurements. Kinect 1 and
Kinect 2 were jointly evaluated by Gonzalez-Jorge et al. [13]
who reported that the precision of both systems is similar
(about 8 mm) in the range of under 1 m, while Kinect 2
outperforms Kinect 1 at the range of 2 m with the error values
of up to 25 mm. They also reported that precision of Kinect 1
decreases rapidly following a second order polynomial, while
Kinect 2 exhibits a more stable behavior inside its work range.
3D accuracy of Kinect 2 was recently evaluated by Yang et
al. [14] who reported on the spatial distribution of the depth
accuracy in regard to the vertical and horizontal displacement.
Skeletal tracking of Kinect was examined primarily in the
context of biomechanical and exercise performance analyses.
In this review, we limit ourselves only to the evaluations
of skeletal tracking based on the official Microsoft Kinect
SDK. Obdrˇ
z´
alek et al. [15] performed accuracy and robustness
analysis of the Kinect skeletal tracking in six exercises for
elderly population. Their paper reports on the error bounds for
particular joints obtained from the comparison with an optical
motion capture system. The authors conclude that employing a
more anthropometric kinematic model with fixed limb lengths
could improve the performance. Clark et al. [16] examined
the clinical feasibility of using Kinect for postural control
assessment. The evaluation with Kinect and motion capture
performed in 20 healthy subjects included three postural
tests: forward reach, lateral reach, and single-leg eyes-closed
standing balance assessment. The authors found high inter-
trail reliability and excellent concurrent validity for majority
of the measurements. The study, however, revealed presence
of proportional biases for some of the outcome measures,
in particular for sternum and pelvis evaluations. The authors
proposed the use of calibration equations that could potentially
correct for such biases. Several other works have examined
the body tracking accuracy for specific applications in phys-
ical therapy, such as for example upper extremity function
evaluation [17], assessment of balance disorders [18], full-
body functional assessment [19], and movement analysis in
Parkinson’s disease [20]. Plantard et al. [21] performed an
extensive evaluation of Kinect 1 skeletal tracking accuracy
for ergonomic assessment. By using a virtual mannequin,
they generated a synthetic depth map that was input into the
Kinect SDK algorithm to predict potential accuracy of joint
locations in a large number of skeletal configurations and
camera positions. The simulation results were validated by a
small number of real experiments. The authors concluded that
the kinematic information obtained by the Kinect is generally
accurate enough for ergonomic assessment.
To the best of our knowledge, publication by Xu and
McGorry [22] is to date the only work that reported on the
evaluation of Kinect 2 skeletal tracking alongside an optical
motion capture system. In their study the authors examined 8
standing and 8 sitting static poses of daily activities. Similar
poses were also captured with Kinect 1, however the two
data sets were not obtained concurrently. The authors reported
that the average static error across all the participants and
all Kinect-identified joint centers was 76 mm for Kinect 1
and 87 mm for Kinect 2. They further concluded that there
was no significant difference between the two Kinects. This
conclusion, however, is of limited validity as the comparison
was done indirectly with two different sets of subjects.
Since the Kinect 1 system is being replaced by the Kinect 2
system in many applications, it is important to evaluate the
performance of the new camera and software for tracking of
dynamic human activities. This is especially relevant since the
depth estimation in the two systems is based on two different
physical principles. Side-by-side comparison can thus provide
a better understanding of the performance improvements as
well as potential drawbacks. In this paper, we report on
the experimental evaluation of the joint tracking accuracy of
the two Kinects in comparison to an optical motion capture
system. We analyze the results for 12 different activities that
include standing and sitting poses as well as slow and fast
movements. Furthermore we examine the performance of pose
estimation with respect to three different horizontal orientation
angles. We provide the error bounds for joint positions and
extracted limb lengths for both systems. We also analyze the
distribution of joint localization errors by fitting a mixture of
Gaussian and uniform distribution models to determine the
outliers in the motion data.
III. ACQUISITION SYST EM S
In this section we provide more details on the experimental
setup and a brief overview of the technology behind each
Kinect system. For the experimental evaluation, the movements
were simultaneously captured by Kinect 1, Kinect 2, and a
marker-based optical motion capture system which served as a
baseline. The two Kinects were secured together and mounted
on a tripod at the height of about 1.5 m. All three systems
were geometrically calibrated and synchronized prior to the
data collection using the procedure described below.
A. Motion Capture System (MoCap)
The motion capture data were acquired using PhaseSpace
(San Leandro, CA, USA) system Impulse X2 with 8 infrared
stereo cameras. The cameras were positioned around the
capture space of about 4 m by 4 m. The system provides
3D position of LED markers with sub-millimeter accuracy and
frequency of up to 960 Hz. Capture rate of 480 Hz was selected
for this study. For each subject 43 markers were attached
at standard body landmarks to a motion capture suit using
velcro. To obtain the skeleton from the marker data, a rigid
kinematic structure was dynamically fitted into the 3D point
cloud. We used PhaseSpace Recap2 software to obtain the
skeleton for each subject based on collected calibration data
which consisted of a sequence of individual joint rotations. The
built-in algorithm determines the length of the body segments
based on the set of markers associated with different parts
of the body and generates a skeleton with 29 joint positions.
Once individual’s kinematic model is calibrated, the skeletal
sequence can be extracted for any motion of that person.
B. Kinect 1
Kinect 1 sensor features acquisition rates of up to 30 Hz
for the color and depth data with the resolution of 640 ×
480 pixels and 320 ×240 pixels, respectively. The depth data
are obtained using structured light approach, where a pseudo-
random infrared dot pattern is projected onto the scene while
being captured by an infrared camera. Stereo triangulation is
used to obtain 3D position of the points from their projections.
This approach provides a robust 3D reconstruction even in low-
light conditions. The accuracy of the depth decreases with the
square of the distance with typical accuracy ranging from about
1-4 cm in the range of 1-4 m [12]. To obtain a dense depth
map, surface interpolation is applied based on the acquired
depth values at the data points. Fixed density of the points
limits the accuracy when moving away from the camera as
the points become sparser. The boundaries of surfaces in the
distance are thus often jagged.
Real-time skeletal tracking provided by the Microsoft
Kinect SDK is based on the depth data using body part
estimation algorithm based on random decision forest proposed
by Shotton et al. [2]. The algorithm estimates candidate body
parts based on a large training set of synthetically-generated
depth images of humans of many different poses and shapes in
various poses from a motion capture database [1]. The Kinect 1
SDK can track up to two users, providing the 3D location of
20 joints for each tracked skeleton.
C. Kinect 2
Kinect 2 sensor features high definition color (1920 ×1080
pixels) and higher resolution depth data (512 ×424 pixels) as
compared to Kinect 1. The depth acquisition is based on the
time-of-flight (ToF) principle where the distance to points on
the surface is measured by computing the phase-shift distance
of modulated infrared light. The intensity of the captured
image is thus proportional to the distance of the points in 3D
space. The ToF technology as opposed to the structured light
inherently provides a dense depth map, however the results can
suffer from various artifacts caused by the reflections of light
signal from the scene geometry and the reflectance properties
of observed materials. The depth accuracy of Kinect 2 is
relatively constant within a specific capture volume, however it
depends on the vertical and horizontal displacement as the light
pulses are scattered away from the center of the camera [14].
The reported average depth accuracy is under 2 mm in the
central viewing cone and increases to 2-4 mm in the range of
up to 3.5 m. The maximal range captured by Kinect 2 is 4.0
m where the average error typically increases beyond 4 mm.
The skeletal tracking method implemented in Kinect SDK
v2.0 has not been fully disclosed by Microsoft; however, it
appears to follow similar methodology as for Kinect 1 while
taking advantage of GPU computation to reduce the latency
and to improve the performance. The Kinect SDK v2.0 features
skeletal tracking of up to 6 users with 3D locations of 25 joints
for each skeleton. In comparison to Kinect 1, the skeleton
includes additional joints at the hand tip, thumb tip and neck.
The arrangement of the joints, i.e. the kinematic structure of
the model, is comparable to standard motion capture skeleton.
Kinect 2 includes some additional features, such as detection
of hand opening/closing and tracking of facial features.
D. Calibration and Data Acquisition
For the capture of the database, we connected the two
Kinect cameras to a single PC running Windows 8.1, with
Kinect 1 connected via USB 2.0 and Kinect 2 connected via
USB 3.0 on a separate PCI bus. Such arrangement allowed
for both sensors to capture at the full frame rate of 30 Hz.
The skeletal data for both cameras were extracted in real time
via Microsoft Kinect SDK v1.8 and Kinect for Windows SDK
v2.0 for Kinect 1 and Kinect 2, respectively.
The temporal synchronization of the captured data was
performed using Network Time Protocol (NTP). The motion
capture server provided the time stamps for the Kinect PC
over the local area network. Meinberg NTP Client Software
(Meinberg Radio Clocks GmbH, Bad Pyrmont, Germany) was
installed on the Windows computer to obtain more precise
clock synchronization.
Prior to the data acquisition, we first calibrated the motion
capture system using provided calibration software. The coor-
dinate frames of the Kinect cameras were then aligned with
the motion capture coordinates using the following procedure.
A planar checkerboard with three motion capture markers
attached to corners of the board was placed in three different
positions in front of the Kinects. In each configuration, marker
position, color and depth data were recorded. Next, the 3D
positions of the corners were extracted from the depth data
using the intrinsic parameters of the Kinect and corresponding
depth pixel values. Finally, a rigid transformation matrix
that maps 3D data captured by each Kinect into the motion
capture coordinate system was determined by minimizing the
squared distance between the Kinect acquired points and the
corresponding marker locations.
Fig. 1. Three skeletons captured by Kinect 1, Kinect 2, and motion capture
(extracted via Recap2 software) after geometric and temporal alignment.
E. Data Processing
Collected marker data were first processed in Recap2 to
obtain the skeletal sequence for each subject, and then exported
to BVH file format. The rest of the analysis was performed
in MATLAB (MathWorks, Natick, MA). First, the skeletal
sequences from the Kinect cameras were mapped to the
motion capture coordinate space using the rigid transformation
obtained from the calibration. Next, we aligned the sequences
using the time stamps, and re-sampled all the data points to
the time stamps of Kinect 2 in order to compare the joint
localization at the same time instances. Fig. 1 demonstrates the
three skeletal configurations projected into the motion capture
coordinate space after the calibration.
After the spatial transformation and temporal alignment, we
obtained three sequences of 3D joint positions for Kinect 1,
Kinect 2, and motion capture. Since the three skeletal config-
urations have different arrangements and number of joints, we
selected 20 joints that are common to all the three systems.
Other remaining joints were ignored in this analysis. Next,
we evaluated the position accuracy by calculating the distance
between the corresponding joints in each time frame. When the
Kinect skeletal tracking loses track of the body parts for certain
joints (e.g. due to occlusions), such frames can be flagged
as outliers. Since the data of the outliers can be assigned
arbitrary values, we use a uniform distribution to model the
distribution of the outliers. The distribution of the valid (on-
track) data samples is on the other hand modeled by a Gaussian
distribution with the mean representing the average offset of
that joint. The overall distribution of the joint offset data, p(θ),
can thus be modeled by a mixture model of Gaussian and
uniform distributions as follows:
p(θ) = ρ×N(µ, σ) + (1 ρ)×U(x1, x2).(1)
In equation (1), µand σdenote the parameters of the Gaussian
distribution, N, respectively. x1and x2denote the parameters
of the uniform distribution, U, respectively. ρdenotes the
weight of the Gaussian distribution. In this paper, we use the
maximum-likelihood method to estimate these parameters with
the input data samples. After estimating the mixture model, the
data are classified into either on-track or off-track state. The
off-track data is then excluded from the accuracy evaluation.
Another important parameter for the accuracy assessment
of human pose tracking is the variability of the limb lengths.
The human skeleton is typically modeled as a kinematic
chain with rigid body segments. The Kinect skeletal tracking,
however, does not explicitly constrain the length of body
segments. In the paper, we thus report on the variability of
the bone lengths by calculating the distance between two
end-joints of each bone for the Kinect systems. For motion
capture, the bone length is extracted from the segment length
parameters in the BVH file.
IV. EXP ER IM EN TS
In this section we describe the experimental protocol for
the data accuracy evaluation. As described in Section III, the
motion data were captured by the setup consisting of Kinect 1,
Kinect 2, and the motion capture system. We selected 12
different exercises (Table I, Fig. 2), consisting of six sitting
(and sit-to-stand) exercises and six standing exercises. In the
first set of exercises, subjects were interacting with the chair,
while no props were used in the second set. We analyze the
two sets of exercises separately.
TABLE I. LI ST OF E XE RCI SES .
Name Pose Description
1. Shallow Squats Sitting Stand-to-sit movements without sitting.
2. Chair Stands Sitting Sit-to-stand movements.
3. Buddha’s Prayer Sitting Vertical hand lifts with palms together.
4. Cops & Robbers Sitting Shoulder rotation and forward arm extension.
5. Abs in, Knee Lifts Sitting Alternating knee lifts.
6. Lateral Stepping Sitting Alternating front and side stepping.
7. Pickup & Throw Standing Step forward, pick up off the floor and throw.
8. Jogging Standing Running in place.
9. Clapping Standing Wide hand clapping while standing still.
10. Punching Standing Alternating forward punching.
11. Line Stepping Standing Alternating forward foot tapping.
12. Pendulum Standing Alternating leg adduction.
We captured the motion data in 10 subjects (mean age: 27).
Before starting the recording, each subject was instructed on
how to perform the exercise via a video. We first recorded
the motion capture calibration sequence for the subsequent
skeleton fitting. Each exercise recording consisted of five
repetitions, except for the Jogging which required subjects to
perform ten jogging steps. The recording of the 12 exercises
was repeated for three different orientation angles of the
subjects with respect to the Kinect cameras, i.e. at 0with
subject facing the cameras and at 30and 60with subject
rotated to the left of the cameras. Figs. 2 and 3 show the video
snapshots and the corresponding motion capture skeletons of
the key poses for the 12 exercises in one of the subjects.
After the data acquisition, the joint coordinates of Kinect 1
and Kinect 2 were transformed into the global coordinate
system of the motion capture. Additionally, the temporal data
were synchronized according to the time stamp of the sequence
captured by Kinect 2, as described in Section III-D.
For the analysis of joint position accuracy, we selected 20
joints that are common between the three systems. These joints
and their abbreviated names are shown in Fig. 4. In addition
to the joint position accuracy, we also evaluated the accuracy
of the bone lengths for the upper and lower extremities. Those
bones and their abbreviated names are also shown in Fig. 4.
Fig. 4. Diagram of the 20-joint skeleton with labeled joint and bone segments
that are used in the analysis (L-left, R-right, UP-upper, LO-lower).
V. R ES ULTS AND DISCUSSION
In this section, we present detailed analysis of the pose
tracking accuracy in Kinect 1 and Kinect 2 in comparison to
the motion capture system which we use as a baseline. All the
reported results are the average values across all the subjects.
The values in the sitting or standing pose represents the mean
values of all the exercises in the sitting or standing set.
A. Joint Position Accuracy
Tables II and III summarize the mean offsets for all the
joints in the sitting and standing sets of exercises in three
different viewpoint directions. The mean offset represents the
average distance between the corresponding joint position of
Kinect 1 or Kinect 2 as compared to the location identified
from the motion capture data.
In the sitting set of exercises (Table II), the majority of
the mean joint offsets range between 50 mm and 100 mm
for both Kinect systems. The largest offset in Kinect 1 is
consistently observed in the pelvic area which includes the
following three joints: ROOT, HIP L, and HIP R. Kinect 2 on
the other hand has smaller offsets for these particular joints.
In Kinect 2, the largest offsets are observed in the following
four joints of the lower extremities: ANK L, ANK R, FOO L,
and FOO R. These joints typically have a large vertical offset
from the ground plane, while the same is not observed in
Kinect 1. Similar observations can be made in the standing set
of exercises (Table III) where the largest offsets in Kinect 1
are again in the pelvic area and the largest offsets in Kinect 2
are found in the lower extremities. These observations are also
clearly visible in Fig. 1.
(a) Shallow Squats (b) Chair Stands (c) Buddha’s Prayer (d) Cops & Robbers (e) Abs in, Knee Lifts (f) Lateral Stepping
(g) Pickup & Throw (h) Jogging (i) Clapping (j) Punching (k) Line Stepping (l) Pendulum
Fig. 2. Video snapshots of the 12 exercises. The first set (a-f) consisted of seated exercises, while the second set (g-l) consisted of standing exercises.
(a) Shallow Squats (b) Chair Stands (c) Buddha’s Prayer (d) Cops & Robbers (e) Abs in, Knee Lifts (f) Lateral Stepping
(g) Pickup & Throw (h) Jogging (i) Clapping (j) Punching (k) Line Stepping (l) Pendulum
Fig. 3. Motion capture skeleton of the key poses for the 12 exercises shown in Fig. 2.
Tables II and III also summarize the standard deviation
(SD) of the joint position offsets which reflects the variability
of a particular joint tracking. For most of the joints, the SD
ranges between 10 mm and 50 mm. The joints that exhibit
considerable motion during an exercise have much higher
variability, and thus SD, typically greater than 50 mm. In
most cases, the SDs of the joint positions in Kinect 2 are
considerably smaller than those in Kinect 1. This is most
likely due to an increased resolution and reduced noise level
of Kinect 2 depth maps.
Furthermore, we can observe that the mean offset and SD
of the joints that are most active in a particular exercise are
both increasing with the viewpoint angle. This is especially
noticeable on the side of the skeleton that is turned further
away from the camera as the occlusion of joints increases the
uncertainty of the pose detection. In our experiments, the left
side of the skeleton was turning away from the camera with
the increasing viewpoint angle.
In order to examine whether Kinect 2 is performing better
than Kinect 1, we performed statistical analysis of the accuracy
for each of the 20 joints in comparison to the motion capture.
We used pair-wise t-test for the joint position analysis. Our
hypothesis was that the position of a particular joint is not
significantly different between a Kinect and the motion cap-
ture. The results of the analysis are shown in Tables II and III
where the joints with significant difference are denoted with
* symbol when p <0.05 and ** symbol when p <0.01,
respectively.
The results of the t-test analyses show that the joint position
offsets of the joints ROOT, SPINE, HIP L, HIP R, ANK L,
03060
03060
Fig. 5. Mean offset and SD of skeletal joints for the exercise Cops &
Robbers (top row: Kinect 1, bottom row: Kinect 2) as captured at three
different viewpoint angles.
03060
03060
Fig. 6. Mean offset and SD of skeletal joints for the exercise Lateral Stepping
(top row: Kinect 1, bottom row: Kinect 2) as captured at three different
viewpoint angles.
03060
03060
Fig. 7. Mean offset and SD of skeletal joints for the exercise Jogging (top
row: Kinect 1, bottom row: Kinect 2) as captured at three different viewpoint
angles.
03060
03060
Fig. 8. Mean offset and SD of skeletal joints for the exercise Punching (top
row: Kinect 1, bottom row: Kinect 2) as captured at three different viewpoint
angles.
ANK R, FOO L, are FOO R for Kinect 2 have significantly
different mean offsets as compared to Kinect 1. The mean joint
position offsets of other joints share the same distribution.
Similar conclusions can be drawn for the standing set of
exercises (Table III). For example, the SDs of the joint position
offset in Kinect 2 are usually smaller than those of Kinect 1.
The variances in the more active joints are typically increasing
with the viewpoint angle. Statistically significant differences
in the accuracy of Kinect 1 vs. Kinect 2 can be found in
the following joints: ROOT, SPINE, HIP L, HIP R, ANK L,
ANK R, FOO L, and FOO R. Overall, the means and SDs of
the joint position offsets in the standing poses are usually larger
than those in the sitting poses. In the sitting poses there are
higher number of static joints, which in general have smaller
variability.
Figs. 5, 6, 7, and 8 demonstrate the means and SDs of
the joint position offsets for the exercises Cops & Robbers,
Lateral Stepping,Jogging, and Punching, respectively. In the
figures, the skeleton in magenta represents one of the key
poses in the exercise sequence as captured by the motion
capture system. The blue or black lines on the other hand
represent the corresponding skeletons generated from the mean
joint position offsets that were observed in either Kinect 1
or Kinect 2, respectively. The ellipsoids at each joint denote
the SDs in the 3D space, analyzed for each axis of the local
coordinate system attached to the corresponding segment. The
larger size of the ellipsoid indicates larger SD of the joint
position in Kinect compared with the joint position captured
TABLE II. JO INT P OS ITI ON OFF SE TS IN S IT TIN G EX ERC IS ES.
Kinect 1 Kinect 2
Mean (mm) SD (mm) Mean (mm) SD (mm)
03060030600306003060
ROOT∗∗ 256 262 263 25 20 25 100 102 106 17 18 16
SPINE 91 97 100 24 20 21 110 117 126 13 14 11
NECK 79 65 62 25 21 23 84 78 73 14 16 14
HEAD 74 70 67 26 21 21 50 51 50 13 15 12
SHO L 90 89 97 26 24 33 76 76 82 16 24 29
ELB L 81 86 98 27 29 35 87 103 114 17 28 25
WRI L 76 90 118 33 48 55 59 84 115 25 53 44
HAN L 85 106 134 41 60 68 64 95 125 31 60 53
SHO R 78 74 69 28 24 23 80 83 78 17 20 17
ELB R 95 93 89 28 38 34 88 77 70 21 25 19
WRI R 64 93 110 30 53 65 61 64 71 25 28 24
HAN R 83 113 130 35 65 80 74 71 75 24 28 26
HIP L∗∗ 188 200 215 25 20 22 115 122 139 16 17 15
KNE L 96 95 93 24 24 27 76 95 114 16 18 25
ANK L54 77 81 17 25 32 93 103 113 16 18 29
FOO L66 74 86 19 26 38 93 105 119 18 25 39
HIP R∗∗ 207 210 211 25 21 24 133 128 132 15 18 16
KNE R 123 118 128 23 23 34 120 117 119 15 17 24
ANK R67 75 91 17 21 27 112 122 132 15 19 27
FOO R67 74 84 17 20 26 126 134 139 19 20 28
*t-test, p <0.05, ** t-test, p <0.01
TABLE III. JO INT P OS ITI ON OFF SE TS IN S TAND IN G EXE RC ISE S.
Kinect 1 Kinect 2
Mean (mm) SD (mm) Mean (mm) SD (mm)
03060030600306003060
ROOT∗∗ 245 256 267 23 25 25 76 81 93 24 19 18
SPINE79 89 102 22 22 24 112 129 144 17 16 17
NECK 82 91 102 23 24 23 113 129 143 18 16 16
HEAD 89 84 87 30 29 26 79 82 90 26 22 20
SHO L 76 76 80 33 36 38 68 72 78 29 31 31
ELB L 98 112 134 46 52 66 112 137 159 37 41 57
WRI L 85 93 110 56 57 71 66 84 111 47 52 73
HAN L 85 96 114 62 65 80 77 94 121 49 55 80
SHO R 76 76 74 30 30 26 96 106 109 26 24 22
ELB R 98 89 80 38 39 34 96 91 80 33 32 31
WRI R 80 82 80 43 43 38 78 88 87 33 32 31
HAN R 85 83 77 46 49 44 81 79 74 33 34 32
HIP L∗∗ 186 205 228 28 27 27 106 125 150 23 19 21
KNE L 101 111 124 33 38 52 107 129 148 26 34 51
ANK L119 144 135 43 61 58 135 168 174 33 51 58
FOO L115 135 122 42 60 59 130 170 180 33 56 63
HIP R∗∗ 188 195 207 25 24 23 103 104 109 25 17 17
KNE R 105 102 101 31 29 26 115 119 118 26 25 27
ANK R111 113 108 38 39 28 146 158 157 32 33 35
FOO R97 93 86 36 37 29 146 157 150 32 33 38
*t-test, p <0.05, ** t-test, p <0.01
by the motion capture.
Such visualization of results provides a quick and intuitive
way for comparison of accuracy of different joints. The results
show that the overall SDs are larger in Kinect 1 as compared
to Kinect 2. The variability of offsets is also increasing with
the viewpoint angle. In more dynamic movements, such as
Jogging (Fig. 7) and Punching (Fig. 8), the end-points, such
as feet and hands, have considerably larger SD with increasing
viewpoint angle. Finally, we can observe that certain joints
have consistently large offsets from the baseline skeleton,
such as ROOT, HIP L, and HIP R in Kinect 1 and ANK L,
ANK R, FOO L, and FOO R, in Kinect 2.
The joint position offsets in general depend on various
sources of error, such as systematic errors (e.g. offset of hips
in Kinect 1), noise from depth computation, occlusions, loss
of tracking, etc. In our further analysis, we analyze the error
distribution to discriminate between the random errors and
the errors due to tracking loss. We expect that the random
errors follow Gaussian distribution while the errors due to
tracking loss can be treated as outliers belonging to a uniform
distribution. As an example, we show the histogram of the
joint position offsets for the right elbow and right knee as
captured in the exercises Cops & Robbers and Jogging from
different viewpoint angles (Figs. 9 and 10, respectively). These
two joints are among the more active joints in these two
exercises. The histograms demonstrate our assumption about
the error distribution where the joint position offsets are mainly
concentrated close to zero with a long tail to the right side. In
order to determine the outliers in the tracking data, we use
a mixture model of a Gaussian distribution and a uniform
distribution to approximate the distribution of the joint position
offsets, as defined in equation (1). Fig. 11 demonstrates the
distribution fitting results for the right elbow in the exercise
Cops & Robbers. The results show the mixture model of
the Gaussian and uniform distributions overlaid on the data
histograms.
After applying the mixture model fitting for each joint
independently, we can classify the data into either on-track
state or off-track state. Table IV shows the percentage of
03060
Fig. 9. Distribution of joint position offsets for the right elbow in the exercise
Cops & Robbers.
03060
Fig. 10. Distribution of joint position offsets for the right knee in the exercise
Jogging.
Kinect 1 Kinect 2
Fig. 11. Mixture model fitting into the distribution of joint position offsets
for the right elbow in the exercise Cops and Robbers (viewpoint: 60).
the average on-track ratio for each joint defined as the ratio
between the number of on-track samples and the total number
of frames. The results show that in most joints the on-track
ratio is above 90%. For the frontal view, the on-track ratio of
all the joint is relatively similar. In the viewpoints of 30and
60, the active joints which are further away from the camera
typically have lower ratios than the joints closer to the camera.
If all the joints in a frame are on-track, that frame is marked
as a valid frame for the data accuracy evaluation. The last row
TABLE IV. ON-TRAC K RATI OS F OR KINECT 1AND KINECT 2.
Sitting (%) Standing (%)
Kinect 1 Kinect 2 Kinect 1 Kinect 2
03060030600306003060
ROOT 95 97 96 99 98 100 93 97 99 96 99 99
SPINE 96 98 98 99 99 100 95 100 99 97 99 99
NECK 96 98 99 99 99 99 98 98 98 97 99 99
HEAD 96 99 99 100 99 99 97 98 99 97 99 99
SHO L 96 98 99 99 97 95 97 96 95 97 97 97
ELB L 98 98 96 99 97 97 96 94 92 97 96 92
WRI L 97 93 90 99 91 96 97 95 91 97 95 96
HAN L 97 93 92 97 89 93 96 93 89 97 96 95
SHO R 97 97 99 99 98 99 98 98 97 98 99 97
ELB R 98 94 97 99 98 99 97 95 98 98 99 99
WRI R 98 93 91 99 99 99 99 99 99 98 99 99
HAN R 97 92 88 99 98 99 98 99 98 98 99 100
HIP L 96 97 99 99 98 99 95 99 98 95 98 99
KNE L 96 99 96 98 98 97 97 95 87 97 96 91
ANK L 99 98 97 98 98 94 94 93 90 96 91 96
FOO L 99 98 98 97 94 96 95 89 91 96 94 96
HIP R 94 93 98 99 96 100 94 98 99 98 99 100
KNE R 97 96 97 98 97 94 97 98 96 97 98 98
ANK R 99 96 99 97 99 98 94 95 95 97 96 98
FOO R 99 99 98 96 96 95 96 97 95 97 97 99
Overall 92 81 75 94 84 81 90 79 73 92 82 80
in Table IV summarized the percentage of valid frames. The
percentage of valid frames is typically higher for Kinect 2 than
Kinect 1. Furthermore, the percentages of valid frames in the
viewpoints of 30and 60drop by 10% and 15% compared
to those in the frontal view, respectively. Finally, in Tables V
and VI we show the mean and SD of the joint position offsets
after the removal of outliers.
Compared with the results in Tables II and III, both the
mean and SD of most joints in Tables V and VI are reduced
since the outliers are excluded from the analysis. Table VII
summarized the average reduction of the mean and SD of
the joint position offsets after excluding the outliers. The
results demonstrate that the data accuracy can be significantly
improved by fitting the data into the mixture model.
TABLE VII. AVERAGE JOINT POSITION OFFSETS WITHOUT OUTLIERS.
Kinect 1 Kinect 2
Mean (%) SD (%) Mean (%) SD (%)
03060030600306003060
Sitting 5 8 6 35 38 35 2 5 3 23 36 23
Standing 8 7 8 39 36 33 8 5 6 49 46 43
B. Bone Length Accuracy
Another important parameter for evaluation of Kinect pose
tracking performance is the bone length. As mentioned pre-
viously, the Kinect tracking algorithm does not specifically
pre-define or calibrate for the anthropometric values of the
body/bone segments. On the other hand, the human skeleton
can be approximated as a kinematic structure with rigid
segments, so we expect that the bone lengths should stay
relatively constant. The size of the variance (and SD) of the
bone length over time can thus be interpreted as a measure of
robustness of the extracted kinematic model.
For the Kinect skeleton, we define the bone length as the
l2distance between the positions of two subsequent joints.
The bone length for the motion capture is on the other hand
determined during the calibration phase and remains constant
during the motion sequence. Figs. 12 and 13 show the means
and SDs of the bone length difference of Kinect 1 and Kinect 2,
respectively, as compared to the bone length calibrated from
the motion capture data across all the subjects. The mean
bone length difference does not change too much between
different exercises. The SDs are typically increasing with larger
viewpoint angle. We can observe that the bone lengths in
Kinect 1 usually have larger offsets and SDs as compared to
Kinect 2, especially for the upper legs due to the large vertical
offset of the hip joints.
Tables VIII and IX summarize the mean and SD of the
bone length differences in Kinect 1 and Kinect 2 in the three
viewpoints for sitting and standing exercises, receptively. We
can observe that the mean differences in the bone lengths and
SDs are smaller in Kinect 2, suggesting that the kinematic
structure of its skeleton is more robust.
C. Summary of Findings
Based on the experimental results reported in this paper,
we can make the following observations:
As reported by other researchers, the hip joints in Kinect 1
are located much higher than normal with the offset of about
200 mm. The offsets should be considered when calculating
knee and hip angles, in particular in sitting position. On
the other hand, the skeleton in Kinect 2 is in general more
anthropometric with smaller offsets.
The foot and ankle joints of Kinect 2 are offset from the
ground plane for about 100 mm or more. The orientation
of the feet is thus unreliable. Once the foot is lifted off
the ground, the tracking of the joints is more accurate.
The unreliable foot position may be originating from ToF
artifacts that generate large amounts of noise close to large
planar surfaces.
Overall accuracy of joint positions in Kinect 2 is better than
in Kinect 1, except the location of feet. The average offsets
are typically between 50 mm and 100 mm.
The analysis of the distribution mixture shows that Kinect 2
has smaller uniform distribution component (i.e. less out-
liers) suggesting that the tracking of joints is more robust.
Kinect 2 also tracks human movement more reliably even
with partial body occlusions.
The difference and variance of the actual limb lengths are
smaller in Kinect 2 than in Kinect 1.
The skeleton tracking in Kinect 2 has much smaller latency
as compared to Kinect 1 which is noticeable especially
during fast actions (e.g. exercises Jogging and Punching).
VI. CONCLUSION
In this paper, we compared the human pose estimation
for the first and second generations of Microsoft Kinect with
standard motion capture technology. The results of our analysis
showed that overall Kinect 2 has better accuracy in joint
estimation while providing skeletal tracking that is more robust
to occlusions and body rotation. Only the lower legs were
tracked with large offsets, possibly due to ToF artifacts. This
phenomena was not observed in Kinect 1 which employs
structured light for depth acquisition. For Kinect 1, the largest
offsets were observed in the pelvic area as also noted by
others. Our analyses show that Kinect 1 can be exchanged with
Kinect 2 for majority of motions. Furthermore, by applying
a mixture of Gaussian and uniform distribution models we
were able to evaluate the robustness of the pose tracking. We
TABLE V. JOINT POSITION OFFSETS IN SITTING EXERCISES WITH
EX CLU DE D OUT LIE RS .
Kinect 1 Kinect 2
Mean (mm) SD (mm) Mean (mm) SD (mm)
03060030600306003060
ROOT∗∗ 251 259 261 15 13 17 100 100 107 14 12 14
SPINE 85 92 96 13 12 14 110 116 128 11 10 11
NECK 73 61 60 14 14 16 84 73 71 12 11 12
HEAD 69 66 63 15 14 15 49 47 50 11 10 11
SHO L 84 83 92 14 16 24 74 71 78 13 16 23
ELB L 77 80 92 18 17 22 86 95 109 13 14 19
WRI L 70 75 111 20 24 29 55 70 109 15 19 26
HAN L 78 86 123 28 30 38 60 79 117 19 20 28
SHO R 71 70 68 16 15 16 79 82 77 14 13 15
ELB R 90 80 83 18 18 22 88 77 71 17 17 17
WRI R 59 70 89 19 22 30 59 60 71 16 16 18
HAN R 79 88 105 23 27 35 72 66 75 14 16 19
HIP L∗∗ 182 197 214 15 13 15 115 120 138 13 12 14
KNE L 92 90 86 17 18 19 76 88 104 14 14 20
ANK L52 72 75 13 17 21 93 99 107 13 13 18
FOO L66 70 81 17 19 27 91 100 110 14 17 27
HIP R∗∗ 200 206 210 14 13 16 132 127 133 13 12 13
KNE R 120 113 123 16 16 25 120 114 111 12 12 16
ANK R66 72 88 13 14 20 112 121 127 13 15 18
FOO R66 72 81 13 16 18 125 134 133 14 15 19
*t-test, p <0.05, ** t-test, p <0.01
TABLE VI. JO INT P OS ITI ON OFF SE TS IN S TAND IN G EXE RC ISE S
WITH EXCLUDED OUTLIERS.
Kinect 1 Kinect 2
Mean (mm) SD (mm) Mean (mm) SD (mm)
03060030600306003060
ROOT∗∗ 244 257 267 16 17 20 67 80 91 10 11 12
SPINE 76 87 99 14 16 17 108 130 143 9 10 10
NECK 81 80 84 18 20 19 73 79 88 13 14 15
HEAD 86 87 87 17 19 18 58 73 77 11 12 13
SHO L 68 67 70 20 23 26 60 64 71 15 18 20
ELB L 88 100 109 26 32 35 101 125 136 17 18 23
WRI L 70 77 85 26 29 34 50 66 80 17 20 24
HAN L 68 78 84 26 31 35 62 78 89 16 17 22
SHO R 70 72 70 20 23 21 93 107 109 16 16 16
ELB R 92 83 76 30 31 28 90 86 78 24 22 23
WRI R 73 74 74 31 29 27 72 85 86 21 18 19
HAN R 77 74 70 32 33 31 73 73 71 18 18 18
HIP L∗∗ 183 203 224 17 19 19 100 124 145 11 11 13
KNE L 95 103 108 18 23 23 103 122 132 14 15 18
ANK L110 130 119 23 28 25 133 157 160 17 19 21
FOO L106 120 105 24 31 26 126 156 165 18 23 25
HIP R∗∗ 184 196 207 16 17 19 95 102 109 10 10 12
KNE R 99 98 98 18 19 19 109 121 118 15 17 17
ANK R106 107 108 23 23 22 144 161 156 19 20 20
FOO R92 85 85 23 23 24 142 157 151 20 20 24
*t-test, p <0.05, ** t-test, p <0.01
03060
Fig. 12. Mean bone length differences and the corresponding SDs for the exercise Cops & Robbers as captured by Kinect 1 and Kinect 2 from three different
viewpoints.
03060
Fig. 13. Mean bone length differences and the corresponding SDs for the exercise Jogging as captured by Kinect 1 and Kinect 2 from three different viewpoints.
showed that the SDs of the joint positions can be reduced by
30% to 40% by employing the classification with a mixture
distribution model. This finding suggests that by excluding the
outliers from the data and compensating for the offsets, more
accurate human motion analysis can be achieved.
REFERENCES
[1] Z. Zhang, “Microsoft Kinect sensor and its effect,MultiMedia, IEEE,
vol. 19, no. 2, pp. 4–10, Feb 2012.
[2] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A. Kipman, and A. Blake, “Real-time human pose recognition in parts
from single depth images,” in Proceedings of the 2011 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). Washington,
DC, USA: IEEE Computer Society, 2011, pp. 1297–1304.
[3] H. M. Hondori and M. Khademi, “A review on technical and clinical
impact of microsoft Kinect on physical therapy and rehabilitation,”
Journal of Medical Engineering, vol. 2014, p. 16, 2014.
[4] E. E. Stone and M. Skubic, “Evaluation of an inexpensive depth
camera for passive in-home fall risk assessment,” in Proceedings of
5th International Conference on Pervasive Computing Technologies for
Healthcare (PervasiveHealth) and Workshops, 2011, pp. 71–77.
TABLE VIII. ME AN AN D SD O F THE B ONE L EN GTH D IFF ERE NC E,SITTING POSE
Kinect 1 Kinect 2
Mean (mm) SD (mm) Mean (mm) SD (mm)
03060030600306003060
ARM UP L -76 -74 -81 17 18 21 -67 -62 -68 14 17 20
ARM LO L -46 -40 -32 17 22 27 -39 -30 -24 14 19 23
HAND L∗∗ -3 -4 -3 22 19 20 -20 -17 -17 20 24 25
ARM UP R -68 -70 -69 17 17 18 -67 -67 -69 15 14 13
ARM LO R -37 -30 -29 17 20 21 -35 -28 -30 13 15 16
HAND R∗∗ 2 3 3 21 24 24 -15 -13 -9 20 19 17
PELVIC L∗∗ -28 -28 -36 6 5 8 -54 -62 -75 3 3 4
LEG UP L 49 38 38 30 32 34 9 5 5 16 18 18
LEG LO L -90 -100 -81 23 28 31 -80 -79 -75 17 18 21
FOOT L∗∗ -13 -4 1 8 10 13 36 38 32 12 14 19
PELVIC R-29 -33 -41 6 5 7 -52 -47 -49 3 3 4
LEG UP R 44 46 53 31 28 31 1 9 18 16 18 18
LEG LO R -86 -88 -95 25 26 31 -76 -82 -74 17 19 23
FOOT R∗∗ -8 -2 8 7 9 11 41 44 40 12 11 12
*t-test, p <0.05, ** t-test, p <0.01
TABLE IX. ME AN AN D SD O F THE B ONE L EN GTH D IFF ERE NC E,STA NDI NG P OSE
Kinect 1 Kinect 2
Mean (mm) SD (mm) Mean (mm) SD (mm)
03060030600306003060
ARM UP L-60 -62 -65 19 22 23 -54 -54 -51 15 18 20
ARM LO L-30 -31 -34 16 18 20 -34 -33 -32 17 17 18
HAND L∗∗ -7 -5 0 21 21 23 -32 -32 -27 20 21 23
ARM UP R -53 -58 -59 21 19 18 -59 -66 -66 17 14 14
ARM LO R -22 -22 -22 17 18 20 -30 -27 -25 18 18 17
HAND R∗∗ -1 4 6 23 21 19 -21 -16 -14 19 18 18
PELVIC L∗∗ -27 -30 -37 9 7 8 -52 -63 -75 5 5 5
LEG UP L∗∗ 113 119 137 28 33 39 -13 -10 4 19 21 25
LEG LO L -64 -66 -65 29 29 31 -79 -85 -86 21 23 25
FOOT L∗∗ -6 -1 2 9 12 12 43 42 34 9 12 14
PELVIC R∗∗ -28 -34 -41 7 6 8 -47 -44 -44 7 4 5
LEG UP R∗∗ 103 115 128 29 27 29 -25 -18 -6 18 18 20
LEG LO R -61 -57 -58 26 25 25 -79 -79 -77 20 20 20
FOOT R∗∗ -2 1 4 8 9 10 48 45 32 9 11 16
*t-test, p <0.05, ** t-test, p <0.01
[5] D. Webster and O. Celik, “Systematic review of Kinect applications
in elderly care and stroke rehabilitation,” Journal of Neuroengineering
and Rehabilitation, vol. 11, p. 24, 2014.
[6] F. Ofli, G. Kurillo, S. Obdrzalek, R. Bajcsy, H. Jimison, and M. Pavel,
“Design and evaluation of an interactive exercise coaching system for
older adults: Lessons learned,” IEEE Journal of Biomedical and Health
Informatics, p. 15, Epub ahead of print. 2015.
[7] J. A. Diego-Mas and J. Alcaide-Marzal, “Using Kinect sensor in obser-
vational methods for assessing postures at work,Applied Ergonomics,
vol. 45, no. 4, pp. 976–985, 2013.
[8] P. Plantard, E. Auvinet, A.-S. Pierres, and F. Multon, “Pose estimation
with a Kinect for ergonomic studies: Evaluation of the accuracy using
a virtual mannequin,” Sensors, vol. 15, no. 1, pp. 1785–1803, 2015.
[9] M. Robinson and M. B. Parkinson, “Estimating anthropometry with
microsoft Kinect,” in Proceedings of the 2nd International Digital
Human Modeling Symposium, 2013.
[10] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision with
microsoft Kinect sensor: A review,” IEEE Transactions on Cybernetics,
vol. 43, no. 5, pp. 1318–1334, 2013.
[11] K. Khoshelham and S. O. Elberink, “Accuracy and resolution of Kinect
depth data for indoor mapping applications,” Sensors, vol. 12, no. 2,
pp. 1437–1454, 2012.
[12] J. Smisek, M. Jancosek, and T. Pajdla, “3D with Kinect,” in Proceedings
of IEEE International Conference on Computer Vision Workshops
(ICCV Workshops), Nov 2011, pp. 1154–1160.
[13] H. Gonzalez-Jorge, P. Rodr´
ıguez-Gonz`
alvez, J. Mart´
ınez-S`
anchez,
D. Gonz`
alez-Aguilera, P. Arias, M. Gesto, and L. D´
ıaz-Vilari˜
no,
“Metrological comparison between Kinect I and Kinect II sensors,”
Measurement, vol. 70, pp. 21–26, 2015.
[14] L. Yang, L. Zhang, H. Dong, A. Alelaiwi, and A. El Saddik, “Evaluating
and improving the depth accuracy of Kinect for Windows v2,Sensors
Journal, IEEE, p. 11, Epub ahead of print. 2015.
[15] S. Obdrzalek, G. Kurillo, F. Ofli, R. Bajcsy, E. Seto, H. Jimison, and
M. Pavel, “Accuracy and robustness of kinect pose estimation in the
context of coaching of elderly population,” in Proceedings of Annual
International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC), Aug 2012, pp. 1188–1193.
[16] R. A. Clark, Y.-H. Pua, K. Fortin, C. Ritchie, K. E. Webster, L. Denehy,
and A. L. Bryant, “Validity of the Microsoft Kinect for assessment of
postural control,” Gait & Posture, vol. 36, no. 3, pp. 372 – 377, 2012.
[17] G. Kurillo, A. Chen, R. Bajcsy, and J. J. Han, “Evaluation of upper
extremity reachable workspace using kinect camera,” Technology and
Health Care, vol. 21, no. 6, pp. 641–656, Nov. 2013.
[18] H. Funaya, T. Shibata, Y. Wada, and T. Yamanaka, “Accuracy as-
sessment of Kinect body tracker in instant posturography for balance
disorders,” in Proceedings of 7th International Symposium on Medical
Information and Communication Technology (ISMICT), March 2013,
pp. 213–217.
[19] B. Bonnech`
ere, B. Jansen, P. Salvia, H. Bouzahouene, L. Omelina,
F. Moiseev, V. Sholukha, J. Cornelis, M. Rooze, and S. Van Sint Jan,
“Validity and reliability of the Kinect within functional assessment
activities: Comparison with standard stereophotogrammetry,Gait &
Posture, vol. 39, no. 1, pp. 593 – 598, 2014.
[20] B. Galna, G. Barry, D. Jackson, D. Mhiripiri, P. Olivier, and
L. Rochester, “Accuracy of the Microsoft Kinect sensor for measuring
movement in people with Parkinson´
s disease,” Gait & Posture, vol. 39,
no. 4, pp. 1062 – 1068, 2014.
[21] P. Plantard, E. Auvinet, A.-S. L. Pierres, and F. Multon, “Pose estima-
tion with a Kinect for ergonomic studies: Evaluation of the accuracy
using a virtual mannequin,” Sensors, vol. 15, no. 1, pp. 1785–1803,
2015.
[22] X. Xu and R. W. McGorry, “The validity of the first and second
generation Microsoft Kinect for identifying joint center locations during
static postures,” Applied Ergonomics, vol. 49, pp. 47 – 54, 2015.
... These systems acquire the data by tomography through a set of cameras or through an RGB-Depth system, which infers the depth data with the help of an infrared (IR) camera. Recent improvements in the performance of the IR cameras and the computational power analysis, are allowing these systems to acquire a great acuity for the capture of human movement [13]. One of the most widespread RGB-IR systems is the Microsoft Kinect originally launched for the Xbox in 2010. ...
... Recently, several research groups proposed the Kinect system as a versatile and powerful tool to be applied to areas such as rehabilitation, ergonomics, or computer vision, due to the capability to capture the image of the human body in 3D in real time [13]. Other studies implemented the code of the system to provide postural information of the human body with assessments for ergonomic scales [3]. ...
... We first have shown that the depth measurement (z axes) of the Azure Kinect system based on IR technology is as efficient as the mechanism to assess the x and y coordinates based on the RGB system. Other studies had previously reported the accuracy of the Kinect camera, thus, [13] analyzed the accuracy and performance of Kinect 1 and Kinect 2 cameras (predecessors of the Azure Kinect) compared to other motion capture methods. They concluded that the Kinect 2 camera had higher accuracy than the Kinect 1, however, the Kinect 2 depth showed large measurement deviations in the detection of the lower legs due to the infrared technology. ...
Chapter
In this study we have implemented and validated the Azure Kinect system for the acquisition and analysis of the human kinematics to be applied to the practical clinic for physiotherapy and rehabilitation, as well as in research studies. The progressive increase in the ageing of the world population in the first world countries, increasingly demands the need to find new, more automated, and versatile technological systems for the acquisition and analysis of human movement data that help us to diagnose, track the evolution of the pathologies and determine how our movements influence the development of the musculoskeletal pathologies. In this work, we were able to develop a measurement technology and validate the ability of the system based on Deep Learning (DL) and Convolutional Neural Networks (CNN), to make precise and fast measurements in real-time compared to the gold standard goniometry used by clinicians. Its precision has allowed verifying its validity for the measurement of large body joints.
... Human pose tracking aims to maintain identities of multiple poses across frames in a video sequence, to provide skeleton sequences as input for actions recognition. The work by Wang et al. [14] first introduced pose tracking for evaluating the performance of Kinect devices. In the early stages, due to the immaturity pose estimation techniques tracking human poses in videos was significantly challenging. ...
Article
Full-text available
The skeleton data of limbs are unreliable due to occlusion and camera viewpoints on intercity railway platforms. It hinders the acquisition of skeleton sequences and disturbs the skeleton-based abnormal action recognition. To overcome these issues, this work proposes a framework consisting of a pose tracking module and an abnormal action recognition module. The proposed pose tracking module maintains the identities of multiple human poses across frames and provides skeleton sequences as input for recognition. Instead of utilizing the whole skeleton, the pose tracking method tracks the trunk for more stable results of identity association as the estimations of the limbs are unreliable. In addition, a position embedding graph convolutional network (PEGCN) is proposed to recognize abnormal actions. PEGCN utilizes a simple cosine encoding as position embeddings for enhancing the differentiation of skeleton vertices and an SElayer for extracting temporal dynamics. The pose tracking method achieves 66.42% tracking accuracy scores and higher frame rates than previous methods on the PoseTrack dataset. Additionally, PEGCN achieves competitive results on the Intercity Railway Action Dataset (IRAD) and the public NTU-RGB+D dataset.
... However, skeletal data representations may vary significantly in graph configurations, involving different skeleton resolution (i.e., number of joints) and joint connectivity (i.e., relative position of joints being tracked), depending on the MoCap systems used for data collection. For instance, the different generations of Microsoft Kinect rely on different skeleton body tracking algorithms and sensing modalities [4], resulting in a high degree of skeletal representation discrepancy in joint permutations, numbers, and positions [5], as depicted in Fig. 1. ...
... The preprocessed data (i.e., denoised and segmented data) of the wrist joint are shown in Figure 2B-right, where the red dots represent the motion segmentation points. The wrist joint position offsets (mean Euclid distance between the corresponding joints in each time frame) for the four motions (I, II, III, V) were 54, 43, 138, 150 mm, respectively, with a mean offset of 96mm, which was acceptable and close to that of Kinect V2 (72 mm) [32]. Therefore, the RealSense data processed by SSA could achieve high precision with limited data acquisition conditions. ...
Article
Full-text available
Motor function assessment is essential for post-stroke rehabilitation, while the requirement for professional therapists’ participation in current clinical assessment limits its availability to most patients. By means of sensors that collect the motion data and algorithms that conduct assessment based on such data, an automated system can be built to optimize the assessment process, benefiting both patients and therapists. To this end, this paper proposed an automated Fugl-Meyer Assessment (FMA) upper extremity system covering all 30 voluntary items of the scale. RGBD sensors, together with force sensing resistor sensors were used to collect the patients’ motion information. Meanwhile, both machine learning and rule-based logic classification were jointly employed for assessment scoring. Clinical validation on 20 hemiparetic stroke patients suggests that this system is able to generate reliable FMA scores. There is an extremely high correlation coefficient (r = 0.981, p < 0.01) with that yielded by an experienced therapist. This study offers guidance and feasible solutions to a complete and independent automated assessment system.
... It provides depth information, i.e., the distance from the camera, for every point in the scene for each video frame. When filming human subjects, a skeletal body representation composed of 3D joints and connecting bones ( Figure 2) is extracted from the captured depth information using machine learning algorithms [67][68][69]. For the purpose of tracking and estimating BBS task performance, we also collected the 3D data points in the patient's immediate surroundings, floor position, and orientation, as well as the 3D points of objects in the scene relevant to the BBS task. ...
Article
Full-text available
Automating fall risk assessment, in an efficient, non-invasive manner, specifically in the elderly population, serves as an efficient means for implementing wide screening of individuals for fall risk and determining their need for participation in fall prevention programs. We present an automated and efficient system for fall risk assessment based on a multi-depth camera human motion tracking system, which captures patients performing the well-known and validated Berg Balance Scale (BBS). Trained machine learning classifiers predict the patient’s 14 scores of the BBS by extracting spatio-temporal features from the captured human motion records. Additionally, we used machine learning tools to develop fall risk predictors that enable reducing the number of BBS tasks required to assess fall risk, from 14 to 4–6 tasks, without compromising the quality and accuracy of the BBS assessment. The reduced battery, termed Efficient-BBS (E-BBS), can be performed by physiotherapists in a traditional setting or deployed using our automated system, allowing an efficient and effective BBS evaluation. We report on a pilot study, run in a major hospital, including accuracy and statistical evaluations. We show the accuracy and confidence levels of the E-BBS, as well as the average number of BBS tasks required to reach the accuracy thresholds. The trained E-BBS system was shown to reduce the number of tasks in the BBS test by approximately 50% while maintaining 97% accuracy. The presented approach enables a wide screening of individuals for fall risk in a manner that does not require significant time or resources from the medical community. Furthermore, the technology and machine learning algorithms can be implemented on other batteries of medical tests and evaluations.
... In clinical assessments, Kinect 2 displayed an adequate performance when tracking joint center displacement (Napoli et al., 2017). The second generation demonstrates better accuracy in joint estimation and stays more robust to body rotation as well as occlusions during various movements like walking and jogging (Wang et al., 2015;Guess et al., 2017). Therefore, Kinect 2 seems to outperform Kinect 1 in locomotion tracking except for foot position tracking during standing, where a larger amount of noise is generated, possibly due to ToF artifacts (Otte et al., 2016). ...
Article
Full-text available
The understanding of locomotion in neurological disorders requires technologies for quantitative gait analysis. Numerous modalities are available today to objectively capture spatiotemporal gait and postural control features. Nevertheless, many obstacles prevent the application of these technologies to their full potential in neurological research and especially clinical practice. These include the required expert knowledge, time for data collection, and missing standards for data analysis and reporting. Here, we provide a technological review of wearable and vision-based portable motion analysis tools that emerged in the last decade with recent applications in neurological disorders such as Parkinson's disease and Multiple Sclerosis. The goal is to enable the reader to understand the available technologies with their individual strengths and limitations in order to make an informed decision for own investigations and clinical applications. We foresee that ongoing developments toward user-friendly automated devices will allow for closed-loop applications, long-term monitoring, and telemedical consulting in real-life environments.
Conference Paper
Teleoperation needs accurate and robust motion mapping between human and humanoid motion to generate intuitive robot control with human-like motion. Data-driven methods are often deployed as it can result in intuitive, real time motion mapping. When using these methods, the common focus is on the accuracy of the motion mapping model. However, effort needs to be put into making the mapping model robust in face of noisy or incomplete dataset. In other words, the model needs to learn the generalizable mapping rules, not just be accurate in predicting the training data. To create a robust and accurate model for motion mapping, we developed the novel CycleAutoencoder method. This method simultaneously trains two autoencoders using traditional losses, mixed losses, and cycle losses. These losses allow the autoencoders to reconstruct the motion mutually between humans and humanoids. This allows the method to learn the mapping with improved accuracy and robustness compared to training a traditional autoencoder. The results of human subject involved experiments demonstrated that the CycleAutoencoder method can achieve both accuracy and robustness for the mapping compared with other autoencoder-based mapping methods.
Article
Full-text available
Available data may differ from true data in many cases due to sensing errors, especially for the Internet of Things (IoT). Although privacy-preserving data mining has been widely studied during the last decade, little attention has been paid to data values containing errors. Differential privacy, which is the de facto standard privacy metric, can be achieved by adding noise to a target value that must be protected. However, if the target value already contains errors, there is no reason to add extra noise. In this paper, a novel privacy model called true-value-based differential privacy (TDP) is proposed. This model applies traditional differential privacy to the “true value” unknown by the data owner or anonymizer but not to the “measured value” containing errors. Based on TDP, the amount of noise added by differential privacy techniques can be reduced by approximately 20% by our solution. As a result, the error of generated histograms can be reduced by 40.4% and 29.6% on average according to mean square error and Jensen–Shannon divergence, respectively.We validate this result on synthetic and five real data sets. Moreover, we proved that the privacy protection level does not decrease as long as the measurement error is not overestimated.
Conference Paper
Limb exercises are common in physical therapy to improve range of motion (RoM), strength, and flexibility of the arm/leg. To improve therapy outcomes and reduce cost, motion tracking systems have been used to monitor the user's movements when performing the exercises and provide guidance. Traditional motion tracking systems are based on either cameras or inertial measurement unit (IMU) sensors. Camera-based systems face problems caused by occlusion and lighting. Traditional IMU-based systems require at least two IMU sensors to track the motion of the entire limb, which is not convenient for use. In this paper, we propose a novel limb motion tracking system that uses a single 9-axis IMU sensor that is worn on the distal end joint of the limb (i.e., wrist for the arm or ankle for the leg). Limb motion tracking using a single IMU sensor is a challenging problem because 1) the noisy IMU data will cause drift problem when estimating position from the acceleration data, 2) the single IMU sensor measures the motion of only one joint but the limb motion consists of motion from multiple joints. To solve these problems, we propose a recurrent neural network (RNN) model to estimate the 3D positions of the distal end joint as well as the other joints of the limb (e.g., elbow or knee) from the noisy IMU data in real time. Our proposed approach achieves high accuracy with a median error of 7.2/7.1 cm for the wrist/elbow joint in leave-one-subject-out cross validation when tracking the arm motion, outperforming the state-of-the-art approach by more than 10%. In addition, the proposed model is lightweight, enabling real-time applications on mobile devices.Clinical relevance- This work has great potential to improve limb exercises monitoring and RoM measurement in home-based physical therapy. It is also cost effective and can be made available widely for immediate application.
Article
Full-text available
Microsoft Kinect sensor has been widely used in many applications since the launch of its first version. Recently, Microsoft released a new version of Kinect sensor with improved hardware. However, the accuracy assessment of the sensor remains to be answered. In this paper, we measure the depth accuracy of the newly released Kinect v2 depth sensor, and obtain a cone model to illustrate its accuracy distribution. We then evaluate the variance of the captured depth values by depth entropy. In addition, we propose a trilateration method to improve the depth accuracy with multiple Kinects simultaneously. The experimental results are provided to ascertain the proposed model and method.
Article
Full-text available
Analyzing human poses with a Kinect is a promising method to evaluate potentials risks of musculoskeletal disorders at workstations. In ecological situations, complex 3D poses and constraints imposed by the environment make it difficult to obtain reliable kinematic information. Thus, being able to predict the potential accuracy of the measurement for such complex 3D poses and sensor placements is challenging in classical experimental setups. To tackle this problem, we propose a new evaluation method based on a virtual mannequin. In this study, we apply this method to the evaluation of joint positions (shoulder, elbow, and wrist), joint angles (shoulder and elbow), and the corresponding RULA (a popular ergonomics assessment grid) upper-limb score for a large set of poses and sensor placements. Thanks to this evaluation method, more than 500,000 configurations have been automatically tested, which would be almost impossible to evaluate with classical protocols. The results show that the kinematic information obtained by the Kinect software is generally accurate enough to fill in ergonomic assessment grids. However inaccuracy strongly increases for some specific poses and sensor positions. Using this evaluation method enabled us to report configurations that could lead to these high inaccuracies. As a supplementary material, we provide a software tool to help designers to evaluate the expected accuracy of this sensor for a set of upper-limb configurations. Results obtained with the virtual mannequin are in accordance with those obtained from a real subject for a limited set of poses and sensor placements.
Article
Full-text available
This paper reviews technical and clinical impact of the Microsoft Kinect in physical therapy and rehabilitation. It covers the studies on patients with neurological disorders including stroke, Parkinson's, cerebral palsy, and MS as well as the elderly patients. Search results in Pubmed and Google scholar reveals increasing interest in using Kinect in medical application. Relevant papers are reviewed and divided into three groups: 1) papers which evaluated Kinect's accuracy and reliability 2) papers which used Kinect for a rehabilitation system and provided clinical evaluation involving patients and 3) papers which proposed a Kinect-based system for rehabilitation but fell short of providing clinical validation. At last, to serve as technical comparison to help future rehabilitation design, other sensors similar to Kinect are reviewed.
Article
Full-text available
In this paper we present a review of the most current avenues of research into Kinect-based elderly care and stroke rehabilitation systems to provide an overview of the state of the art, limitations, and issues of concern as well as suggestions for future work in this direction. The central purpose of this review was to collect all relevant study information into one place in order to support and guide current research as well as inform researchers planning to embark on similar studies or applications. The paper is structured into three main sections, each one presenting a review of the literature for a specific topic. Elderly Care section is comprised of two subsections: Fall Detection and Fall Risk Reduction. Stroke Rehabilitation section contains studies grouped under Evaluation of Kinect's Spatial Accuracy, and Kinect-based Rehabilitation Methods. The third section, Serious and Exercise Games, contains studies that are indirectly related to the first two sections and present a complete system for elderly care or stroke rehabilitation in a Kinect-based game format. Each of the three main sections conclude with a discussion of limitations of Kinect in its respective applications. The paper concludes with overall remarks regarding use of Kinect in elderly care and stroke rehabilitation applications and suggestions for future work. A concise summary with significant findings and subject demographics (when applicable) of each study included in the review is also provided in table format.
Conference Paper
Full-text available
The accuracy of the body-tracking algorithm of Microsoft Kinect SDK was investigated with the aim to apply the device to the medical care of balance disorders. We focused on the inclination of body posture, assuming that Kinect is used for instant posturography in the test of balance. Subjects were asked to participate in several balance tests and the positions of motion-capture markers on subjects were recorded simultaneously with Kinect body-tracking measurements. After calibrating coordinates and compensating for temporal delay between the two sensor systems using a hybrid marker, the correspondence between the body-part locations of Kinect and those of motion capture was modeled by a linear combination model of neighboring markers with optimal parameters, which were estimated according to the criterion of trajectory minimization. The experiment for six healthy adults showed that our method can track body motion at the accuracy required for standard balance tests.
Article
Full-text available
Background The Microsoft Kinect sensor (Kinect) is potentially a low-cost solution for clinical and home-based assessment of movement symptoms in people with Parkinson's disease (PD). The purpose of this study was to establish the accuracy of the Kinect in measuring clinically relevant movements in people with PD. Methods Nine people with PD and 10 controls performed a series of movements which were measured concurrently with a Vicon three-dimensional motion analysis system (gold-standard) and the Kinect. The movements included quiet standing, multidirectional reaching and stepping and walking on the spot, and the following items from the Unified Parkinson's Disease Rating Scale: hand clasping, finger tapping, foot, leg agility, chair rising and hand pronation. Outcomes included mean timing and range of motion across movement repetitions. Results The Kinect measured timing of movement repetitions very accurately (low bias, 95% limits of agreement <10% of the group mean, ICCs >0.9 and Pearson's r > 0.9). However, the Kinect had varied success measuring spatial characteristics, ranging from excellent for gross movements such as sit-to-stand (ICC = .989) to very poor for fine movement such as hand clasping (ICC = .012). Despite this, results from the Kinect related strongly to those obtained with the Vicon system (Pearson's r > 0.8) for most movements. Conclusions The Kinect can accurately measure timing and gross spatial characteristics of clinically relevant movements but not with the same spatial accuracy for smaller movements, such as hand clasping.
Article
This work shows a metrological comparison between Kinect I and Kinect II laser scanners. The comparison is made using a standard artefact based on 5 spheres and 7 cubes. Accuracy and precision tests are done for different ranges and changing the inclination angle between each sensor and the artefact. Results at 1 m range show similar precision in both cases with values between 2 mm and 6 mm. However, at 2 m range values of Kinect I increase up to 12 mm in some cases, while Kinect II keeps all results below 8 mm. Accuracy is also better for Kinect II at 1 m and 2 m range, with values always lower than −5 mm. Accuracy for Kinect I reaches −12 mm at 1 m range and −25 mm at 2 m range. Precision study shows a decrease of precision with range according a second order polynomial equation for Kinect I, while Kinect II shows a much more stable data. Measurement range of Kinect II is limited to 4 m, while Kinect I can obtain data up to 6 m.
Article
The Kinect™ sensor released by Microsoft is a low-cost, portable, and marker-less motion tracking system for the video game industry. Since the first generation Kinect sensor was released in 2010, many studies have been conducted to examine the validity of this sensor when used to measure body movement in different research areas. In 2014, Microsoft released the computer-used second generation Kinect sensor with a better resolution for the depth sensor. However, very few studies have performed a direct comparison between all the Kinect sensor-identified joint center locations and their corresponding motion tracking system-identified counterparts, the result of which may provide some insight into the error of the Kinect-identified segment length, joint angles, as well as the feasibility of adapting inverse dynamics to Kinect-identified joint centers. The purpose of the current study is to first propose a method to align the coordinate system of the Kinect sensor with respect to the global coordinate system of a motion tracking system, and then to examine the accuracy of the Kinect sensor-identified coordinates of joint locations during 8 standing and 8 sitting postures of daily activities. The results indicate the proposed alignment method can effectively align the Kinect sensor with respect to the motion tracking system. The accuracy level of the Kinect-identified joint center location is posture-dependent and joint-dependent. For upright standing posture, the average error across all the participants and all Kinect-identified joint centers is 76 mm and 87 mm for the first and second generation Kinect sensor, respectively. In general, standing postures can be identified with better accuracy than sitting postures, and the identification accuracy of the joints of the upper extremities is better than for the lower extremities. This result may provide some information regarding the feasibility of using the Kinect sensor in future studies.
Article
Although the positive effects of exercise on the well-being and quality of independent living for older adults are well-accepted, many elderly individuals lack access to exercise facilities, or the skills and motivation to perform exercise at home. To provide a more engaging environment that promotes physical activity, various fitness applications have been proposed. Many of the available products, however, are geared toward a younger population and are not appropriate or engaging for an older population. To address these issues, we developed an automated interactive exercise coaching system using the Microsoft Kinect. The coaching system guides users through a series of video exercises, tracks and measures their movements, provides real-time feedback, and records their performance over time. Our system consists of exercises to improve balance, flexibility, strength, and endurance, with the aim of reducing fall risk and improving performance of daily activities. In this paper, we report on the development of the exercise system, discuss the results of our recent field pilot study with six independently-living elderly individuals, and highlight the lessons learned relating to the in-home system setup, user tracking, feedback, and exercise performance evaluation.
Article
This paper examines the potential use of Kinect™ range sensor in observational methods for assessing postural loads. Range sensors can detect the position of the joints at high sampling rates without attaching sensors or markers directly to the subject under study. First, a computerized OWAS ergonomic assessment system was implemented to permit the data acquisition from Kinect™ and data processing in order to identify the risk level of each recorded postures. Output data were compared with the results provided by human observers, and were used to determine the influence of the sensor view angle relative to the worker. The tests show high inter-method agreement in the classification of risk categories (Proportion agreement index = 0.89 κ = 0.83) when the tracked subject is facing the sensor. The camera's point of view relative to the position of the tracked subject significantly affects the correct classification of the postures. Although the results are promising, some aspects involved in the use of low-cost range sensors should be further studied for their use in real environments.