Conference PaperPDF Available

Markerless kinematic model and motion capture from volume sequences

Authors:

Abstract and Figures

An approach for model-free markerless motion capture of articulated kinematic structures is presented. This approach is centered our method for generating underlying nonlinear axes (or a skeleton curve) from the volume of an arbitrary rigid-body model. We describe the use of skeleton curves for deriving a kinematic model and motion (in the form of joint angles over time) from a captured volume sequence. Our motion capture method uses a skeleton curve, found in each frame of a volume sequence, to automatically determine kinematic postures. These postures are then aligned to determine a common kinematic model for the volume sequence. The derived kinematic model is then reapplied to each frame in the volume sequence to find the motion suited to this model. We demonstrate our method for several types of motion from synthetically generated volume sequences with arbitrary kinematic topology and human volume sequences captured from a set of multiple calibrated cameras.
Content may be subject to copyright.
In Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 475-482., Vol. 2, Madison, Wisconsin, USA, June
16-22, 2003.
Markerless Kinematic Model and Motion Capture from Volume Sequences
Chi-Wei Chu, Odest Chadwicke Jenkins, Maja J Matari
´
c
Robotics Research Laboratory
Center for Robotics and Embedded Systems
Department of Computer Science
University of Southern California
Los Angeles, CA, USA 90089-0781
chuc,cjenkins,mataric@usc.edu
Abstract
We present an approach for model-free markerless mo-
tion capture of articulated kinematic structures. This ap-
proach is centered on our method for generating underlying
nonlinear axes (or a skeleton curve) of a volume of genus
zero (i.e., without holes). We describe the use of skeleton
curves for deriving a kinematic model and motion (in the
form of joint angles over time) from a captured volume se-
quence. Our motion capture method uses a skeleton curve,
found in each frame of a volume sequence, to automatically
determine kinematic postures. These postures are aligned
to determine a common kinematic model for the volume se-
quence. The derived kinematic model is then reapplied to
each frame in the volume sequence to find the motion se-
quence suited to this model. We demonstrate our method on
several types of motion, from synthetically generated vol-
ume sequences with an arbitrary kinematic topology, to hu-
man volume sequences captured from a set of multiple cali-
brated cameras.
1. Introduction
The ability to collect human motion data is invaluable for
applications such as computer animation, activity recogni-
tion, human-computer interfaces, and humanoid robot con-
trol and teleoperation. This fact is evidenced by the increas-
ing amount of research geared towards developing and uti-
lizing motion capture technologies. Typical motion capture
mechanisms require that the subject be instrumented with
several beacons or markers. The motion of the subject is
then reconciled from the sensed positions and/or orienta-
tions of the markers. However, such systems can:
1. be prohibitively expensive;
2. require subjects to be instrumented with cumbersome
markers;
3. greatly restrict the volume of capture;
4. have difficulty assigning consistent labels to occluding
markers;
5. have difficulty converting marker data into kinematic
motion.
An emerging area of research suited to address these
problems involves uninstrumented capture of motion, or
markerless motion capture. For markerless motion cap-
ture, subject data are acquired through some passive sens-
ing mechanism and then reconciled into kinematic mo-
tion. Several model-based markerless capture approaches
[6, 14, 3, 8, 4, 13, 22, 10] have been proposed that assume
an a priori kinematic or body model. However, it would
be preferable to eliminate this model dependence to capture
both the subject’s motion and kinematic model and, thus,
perform model and motion capture.
In this paper, we introduce a solution for model-free
vision-based markerless motion capture of subjects with
tree-structured kinematics from multiple calibrated cam-
eras. Using the functional structure of a motion capture sys-
tem described by Moeslund and Granum [15], we summa-
rize our approach for markerless motion capture. Moeslund
and Granum describe a motion capture system as consisting
of four components: initialization, tracking, pose estima-
tion, and recognition. For initialization, a set of cameras
is calibrated, using a method such as Bouguet’s [2]. Be-
cause we assume no a priori kinematic model, no model
initialization is necessary. We assume for the tracking com-
ponent a system capable of capturing an individual sub-
ject’s movement over time as a volume sequence, such as
[16, 19]. The pose estimation component we develop is
more than pose estimation because it performs model and
1
motion capture. In this component, we perform automatic
model and posture estimation for each frame in the volume
sequence. The models and postures produced from each
frame are aligned in a second pass to determine a common
kinematic model across the volume sequence. The common
kinematic model is then applied to eachframe in the volume
sequence to perform pose estimation with respect to a con-
sistent model. Our current methodology for capture does
not include a recognition component. However, we envi-
sion our capture system providing vast amounts of motion
data for other uses. For instance, Jenkins and Matari
´
c [11]
require long streams of motion data as demonstrations for
automatically deriving vocabularies of behaviors and con-
trollers for humanoid robot control.
Central to our model and motion capture approach is the
ability to estimate a kinematic model and its posture from
a subject’s volume in a single frame. Towards this end, we
developed a model-free method, called nonlinear spherical
shells (NSS), for extracting skeleton point features that are
linked into a tree-structured skeleton curve for a particular
frame within a motion. A skeleton curve is an approxima-
tion of the “underlying axes” of a subject, similar to prin-
cipal curves [9], the axis of a generalized cylinder, or the
wire spine of a posable puppet. NSS works by accentuat-
ing the underlying axes of a volume through Isomap non-
linear dimension reduction [21] and traversing the result-
ing “Da Vinci”-like posture. Isomap essentially eliminates
the nonlinearities caused by joint rotations. Using skele-
ton curve provided via NSS, we automatically estimate the
tree-structured kinematics and posture of the volume.
Several advantages arise in using our approach for mark-
erless motion capture. First, our method is fast and accurate
enough to be tractably applied to all frames in a motion.
Our method can be used alone or as an initialization step for
model-based capture approaches. Second, our dependence
on modeling humanbodies is eliminated. Automated model
derivation is especially useful when the subject’s kinemat-
ics differ from standard human kinematics due to missing
limbs or objects the subject is manipulating. Third, the pos-
ture of the human subject is automatically determined with-
out complicated label assignments.
2. Volume Sequence Capture
The volume sequence data used for this work came from
two sources. One source of captured volume data is from
real-world subjects (humans) by multiple cameras. The
other source was synthetically generated volume data from
an articulated 3D geometry with arbitrary kinematics.
For real-world volume capture, we used an existing
volume capture technique for multiple calibrated cameras.
While not the focus of our work, this implementation
does provide an adequate means for collecting volume se-
(a)
3540455055 4050
0
5
10
15
20
25
30
35
Student Version of MATLAB
(b)
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
15
Student Version of MATLAB
(c)
−20 −15 −10 −5 0 5 10 15
−10
−5
0
5
10
15
Student Version of MATLAB
(d)
3540455055 4050
0
5
10
15
20
25
30
35
Student Version of MATLAB
(e)
6
(f) (g) (h)
Figure 1. An illustrated outline of our ap-
proach. (a) A subject viewed in multiple cam-
eras over time is used to build (b) a Euclidean
space point volume sequence. Postures in
each frame are estimated by: transforming
the subject volume (c) to an intrinsic space
pose-invariant volume, finding its (d) princi-
ple curves, project the principal curves to a
(e) skeleton curve, and breaking the skeleton
curve into a kinematic model. (f) Kinematic
models for all frames are (g) aligned to find
the joints for a normalized kinematic model.
The normalized kinematic model is applied to
all frames in the volume sequence to estimate
its (h) motion, shown froman animation view-
ing program.
quences. The implementation is derived from the work of
Penny et. al. [16] for real-time volume capture; however,
several other approaches are readily available (e.g., [19, 4]).
The capture approach is a basic brute-force method that
checks each element of a voxel grid for inclusion in the
point volume. In our capture setup, we place multiple cam-
eras around three sides of a hypothetical rectangular vol-
ume, such that each camera can view roughly all of the vol-
ume. This rectangular volume is a voxel grid that divides
the space in which moving objects can be captured.
The intrinsic and extrinsic calibration parameters for the
cameras are extracted using a camera calibration toolbox
designed by [2]. The parameters from calibration allow us
to precompute a look-up table for mapping a voxel to pixel
locations in each camera. For each frame in the motion, sil-
houettes of foreground objects in the capture space are seg-
mented within the image of each camera and used to carve
the voxel grid. A background subtraction method proposed
in [7] was used. It can then be determined if each voxel
in the grid is part of a foreground object by counting and
thresholding the number of camera images in which it is
part of a silhouette. One set of volume data is collected for
each frame (i.e., set of synchronized camera images) and
stored for offline processing.
For synthetic data, we artificially create motion se-
quences from a synthetic articulated object with arbitrary
tree-structured kinematics. We use this data to test our ap-
proach for objects readily available or controllable in the
real world. In creating this data, we manually specified
the kinematic model, rigid body geometries (cylinders), and
joint angle trajectories. The motion of the object is con-
verted into a volume sequence by scan converting each
frame according to a voxel grid.
3. Nonlinear Spherical Shells
Nonlinear spherical shells (NSS) is our model-free ap-
proach for extracting a skeleton curve feature from a
Euclidean-space volume of points. For NSS, we assume
that nonlinearity of rigid-body kinematic motion is intro-
duced by rotations about the joint axes. By removing these
joint nonlinearities, we can trivially extract skeleton curves.
Fortunately for us, recent work on manifold learning
techniques has produced methods capable of uncovering
nonlinear structure from spatial data. These techniques in-
clude Isomap [21], Kernel PCA [18], and Locally Linear
Embedding [17]. Isomap works by building geodesic dis-
tances between data point pairs on an underlying spatial
manifold. These distances are used to perform a nonlin-
ear PCA-like embedding to an intrinsic space, a subspace
of the original data containing the underlying spatial man-
ifold. Isomap, in particular, has been demonstrated to ex-
tract meaningful nonlinear representations for high dimen-
sional data such as images of handwritten digits, natural
hand movements, and a pose-varying human head.
The procedure for (NSS) works in three main steps:
1. removal of pose-dependent nonlinearities from the vol-
ume by transforming the volume into an intrinsic space
using Isomap;
2. dividing and clustering the pose-independent volume
such that principal curves are found in intrinsic space;
3. project points defining the intrinsic space principal
curve into the original Euclidean space to produce a
skeleton curve for the volume.
Isomap is applied in the first step of the NSS procedure
to remove pose nonlinearities from a set of points compro-
mising the captured human in Euclidean space. We use the
implementation provided by the authors of Isomap (avail-
able at http://isomap.stanford.edu/). This implementation is
applied directly to the volume data. Isomap requires the
user to specify only the number of dimensions for the in-
trinsic space and how to construct local neighborhoods for
each data point. Because dimension reduction is not our
aim, the intrinsic space is set to have 3 dimensions. Each
point determines other points within its local neighborhood
using k-nearest neighbors or an epsilon sphere with a cho-
sen radius.
The application of Isomap transforms the volume points
into a pose-independent arrangement in the intrinsic space.
The pose-independent arrangement is similar to a “Da
Vinci” pose in 3 dimensions (Figure 2). Isomap can pro-
duce the Da Vinci point arrangement for any point volume
with distinguishable limbs.
The next step in the NSS procedure is processing in-
trinsic space volume for principal curves. The definition
of principal curves can be found in [9] or [12] as “self-
consistent” smooth curves that pass through the “middle”
of a d-dimensional data cloud, or nonlinear principal com-
ponents. While smoothness is not our primary concern, we
are interested in placing a curve through the “middle” of our
Euclidean space volume. Depending on the posture of the
human, this task can be difficult in Euclidean space. How-
ever, the pose-invariant volume provided by Isomap makes
the extraction of principal curves simple, due to properties
of the intrinsic space volume. Isomap provides an intrinsic
space volume that is mean-centered at the origin and has
limb points that extend away from the origin.
Points on the principle curves in intrinsic space be found
by the following subprocedure (Figure 4):
1. partitioning the intrinsic space volume points into con-
centric spherical shells;
2. clustering the points in each partition;
3. averaging the points of each cluster to produce a prin-
cipal curve point;
4. linking principal curve points with overlapping clus-
ters in adjacent spherical shells.
3540455055 4050
0
5
10
15
20
25
30
35
Student Version of MATLAB
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
15
Student Version of MATLAB
Figure 2. A captured human volume in Eu-
clidean space (top) and its pose-invariant in-
trinsic space representation (bottom).
Clustering used for each partition was developed from
the one-dimensional “sweep-and-prune” technique, de-
scribed by Cohen et al. [5], for finding clusters bounded by
axis-aligned boxes. This clustering method requires spec-
ification of a separating distance threshold for each axis
rather than the expected number of clusters. The result from
the principal curves procedure is a set of points defining
the principal curves linked in a hierarchical tree-structure.
These include three types of indicator nodes: a root node
located at the mean of the volume, branching nodes that
separate into articulations, and leaf nodes at terminal points
of the body.
The final step in the NSS procedure projects the intrin-
sic space principal curve points onto a skeleton curve in
the original Euclidean space. We use Shepard’s interpola-
tion [20] to map principal curve points onto the Euclidean
space volume, producing skeleton curve points. The skele-
ton curve is formed by reapplying the tree-structured link-
ages of the intrinsic space principal curves to the skeleton
curve points.
Other methods for volume skeletonization are available.
These approaches include the distance coding [23], bound-
ary peeling [23], and self-organizing feature maps [1]. For
our purposes, it is important to ensure that the skeletoniza-
tion produces a bordered 1-manifold, not necessarily a me-
dial axis that is potentially a 2-manifold.
3.1. Skeleton Curve Refinement
The skeleton curve found by the NSS procedure will
be indicative of the underlying spatial structure of the Eu-
clidean space volume, but may contain a few undesirable
artifacts. We handle these artifacts using a skeleton curve
refinement procedure. The refinement procedure first elim-
inates noise branches in the skeleton curve that typically
occur in areas of small articulation, such as the hands and
feet. Noise branches are detected as branches with depth
under some threshold. A noise branch is eliminated through
merging its skeleton curve points with a non-noise branch.
The refinement procedure then eliminates noise for the
root of the skeleton curve. Shell partitions around the mean
of the body volume will be encompassed by the volume
(i.e., contain a single cluster spread across the shell). The
skeleton curve points for such partitions will be roughly lo-
cated near the volume mean. These skeleton curve points
are merged to yield a new root to the skeleton curve. The
result is a skeleton curve having a root and two or more im-
mediate descendants.
The minor variations in the topology of the skeleton
curve are then eliminated by merging adjacent branching
nodes. These are two skeleton points on adjacent spherical
shells with adjacent clusters that both introduce a branching
of the skeleton curve. The branches at these nodes are as-
sumed to represent the same branching node. Thus, the two
skeleton points are merged into a single branching node.
4. Model and Motion Capture
In this section, wedescribe the application of NSS within
the context of our approach for markerless model and mo-
tion capture. The model and motion capture (MMC) proce-
dure automatically determines a common kinematic model
and joint angle motion from a volume sequence in a three-
pass process. In the first pass, the procedure applies NSS
independently to each frame in the volume sequence. From
the skeleton curve and volume of each frame, a kinematic
model and posture is produced that is specific to the frame.
A second pass across the specific kinematic models of each
frame is used to produce a single normalized kinematic
model with respect to the frames in the volume sequence.
Finally, the third pass applies the normalized model to each
volume and skeleton curve in the sequence to produce esti-
mated posture parameters.
The described NSS procedure is capable of producing
skeleton curve features in a model-free fashion. The skele-
ton curve is used to derive a kinematic model for the vol-
ume in each frame. First, we consider each branch (occur-
ring between two indicator nodes) as a kinematic link. The
root node and all branching nodes are classified as joints.
Each branch is then segmented into smaller kinematic links
based on the curvature of the skeleton curve. This division
is performed by starting at the parent indicator node and it-
eratively including skeleton points until the corresponding
volume points become nonlinear. Nonlinearity is tested by
applying a threshold to the skewness of the volume points
with respect to the line between the first and last included
skeleton point. When the nonlinearity occurs, a segment,
representing a joint placement, is set at the last included
skeleton point. The segment then becomes the first node
in the determination of the next link and the process iter-
ates until the next indicator node is reached. The length of
these segments, relative to the length of the whole branch, is
recorded in the branch. The specific kinematic models de-
rived from the volume sequence may have different branch
lengths and each branch may be divided into a different
number of links.
In the second pass, a normalization procedure is used
across all frame-specific models to produce a common
model for the sequence. For normalization, we aim to align
all specific models in the sequence and look for groupings
of joints. The alignment method we used iteratively col-
lapsed two models in subsequent frames using a matching
procedure to find correspondences. The matching proce-
dure uses summed error values of minimum squared dis-
tance between branch parents, the difference between an-
gles of branches, and the differencebetween branch lengths.
The normalization procedure finds the mapping that mini-
mizes the total error value. We have also begun to experi-
ment with a simpler alternative alignment procedure. This
procedure uses Isomap to align by constructing neighbor-
hoods for each skeleton point that considers its intra-frame
skeleton curve neighbors and corresponding points on the
skeleton curve in adjacent frames.
Once the specific kinematic models are aligned, cluster-
ing on each branch is performed to identify joint positions.
Each branch is normalized by averaging the length of the
branch and number of links in the branch. The location of
the aligned joint locations along the branch forms a 1D data
sequence. An example is shown (Figure 3) for a branch with
an average number of joints rounded to three. In this figure,
the joint positions roughly form three sparse clusters of joint
points along the branch, with some outliers. To identify the
joint clusters, we used a clustering method that estimates
density of all joint locations and places a joint cluster where
peaks in the density are found.
In the third pass, the common kinematic model is applied
Figure 3. (top) Aligned segmentation points
(stars) and joints clusters (circles) of one of
the branches in the synthetic data. (bottom)
The normalized kinematic model (circles as
joints) with respect to the aligned skeleton
curve sequence.
to the skeleton curve in each frame to find the motion of the
model (Figure 3). The coordinate system of the root node of
the model is always aligned to the world coordinate system.
For every joint, the direction of the link is the Z axis of
the joint’s coordinate system. The Y axis of the joint is
derived by the cross product of its Z axis and its parent’s X
axis. The cross product of the Y and Z axis is the X axis of
the joint. The world space coordinate system for each joint
is converted to a local coordinate system by determining
its 3D rotational transformation from its parent. The set of
these 3D rotations provides the joint angle configuration for
the current posture of the derived model.
5. Results and Observations
In this section, we describe the implementation of our
markerless model and motion capture approach and the re-
sults from its application to both captured human volume
data and synthetic data. The human volume data contain
two different motion sequences: waving and jumping jacks.
−25 −20 −15 −10 −5 0 5 10 15 20 25
−20
−15
−10
−5
0
5
10
15
20
−20 −15 −10 −5 0 5 10 15
−10
−5
0
5
10
15
Student Version of MATLAB
3540455055 4050
0
5
10
15
20
25
30
35
Student Version of MATLAB
Figure 4. Partitioning of the pose-invariant
volume (top), its tree-structured principal
curves (middle), and project back into Eu-
clidean space (bottom).
Our approach was implemented in Matlab, with our volume
capture implementation in Microsoft Visual C++. The exe-
cution of the entire implementation was performed on a 350
MHz Pentium with 128 MB of memory.
For each human motion sequence, a volume sequence
was captured and stored for offline processing by the model
and motion capture procedure. Using the Intel Image Pro-
cessing Library, we were able to capture volumes within a
80 × 80 × 50 grid of cubic 50mm
3
voxels at 10 Hz. Each
volume sequence consisted of roughly 50 frames. Due to
our frugal choices for camera and framegrabber options,
our ability to capture human volumes was significantly re-
stricted. Our image technology allowed for 320 × 240 im-
age data from each camera, which produced severalartifacts
such as incorrectly activated voxels from shadows, occlu-
sion ghosting, and image noise. This limitation restricted
our capture motions to exaggerated, but usable, motion,
where the limbs were very distinct from each other. Im-
proving our proof-of-concept volume capture system, with
more and better cameras, lighting, and computer vision
techniques, will vastly improve our capture system, without
having to adjust the model and motion capture procedure.
Using the captured volume sequences, our model and
motion capture mechanism was able to accurately deter-
mine appropriate postures for each volume without fail.
We used the same user parameters for each motion, con-
sisting of an Isomap epsilon-ball neighborhood of radius
(50mm
3
)
1/2
and 25 for the number of concentric sphere
partitions. In addition to accurate postures, the derivedkine-
matic model parameters for each sequence appropriately
matched the kinematics of the capture subject. However,
for camera captured volume data, a significant amount of
noise occurred between subsequent frames in the produced
motion sequence. Noise is typical for many instrumented
motion capture systems and should be expected when inde-
pendently processing frames for temporally dependent mo-
tion. We were able to clean up this noise to produce aes-
thetically viable motion using standard low pass filtering.
When applied to synthetic data, our method can re-
construct its original kinematic model with reasonable ac-
curacy. This data were subject to the problem of over-
segmentation, i.e., joints are placed where there is in fact
only one straight link. There are three causes for this prob-
lem. First, a joint will always be placed at branching nodes
in the skeleton curves. A link will be segmented if another
link is branching from its side. Second, the root node of the
skeleton curve is always classified as a joint, even if it is
placed in the middle of an actual link. Third, noise in the
volume data may add fluctuation of the skeleton curves and
cause unwanted segments.
Motions were output to files in the Biovision
BVH motion capture format. Figure 5 shows the
kinematic posture output for each motion. More
images and movies of our results are available at
http://robotics.usc.edu/˜cjenkins/markerless/.
In observing the performance of our markerless model
and motion capture system, several benefits of our approach
became evident. First, the relative speed of our capture
procedure made the processing of each frame of a motion
tractable. Depending on the number of volume points, the
elapsed time for producing a posture from a volume by our
Matlab implementation ranged between 60 and 90 seconds,
with approximately 90 percent of this time spent for Isomap
processing. Further improvements can be made to our im-
plementation to speed up the procedure and process vol-
umes with increasingly finer resolution. Second, our imple-
mentation required no explicit model of human kinematics,
30
35
40
45
30
35
40
45
50
0
5
10
15
20
25
30
Student Version of MATLAB
−15
−10
−5
0
5
10
15
−10
−5
0
5
10
15
20
−10
−5
0
5
10
Student Version of MATLAB
40
45
50
30
40
50
0
5
10
15
20
25
30
Student Version of MATLAB
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
15
20
Student Version of MATLAB
0
5
10
−6
−4
−2
0
2
4
6
8
10
12
15
20
25
30
35
Student Version of MATLAB
Figure 5. Results from producing kinematic motion for human waving, jumping jacks and synthetic
object motion (rows). The results are shown as a snapshot of the performing human or object,
the capture or generated point volume data, the pose-invariant volume, and the derived kinematic
posture (columns).
no initialization procedure, and no optimization of param-
eters with respect to a volume. Our model-free NSS pro-
cedure produced a representative skeleton curve description
of a human posture based on the geometry of the volume.
Lastly, the skeleton curve may be a useful representation
of posture in and of itself. Rigid-body motion is often rep-
resented through typically model-specific kinematics. In-
stead, the skeleton curve may allow for an expression of
motion that can be shared between kinematic models, for
purposes such as robot imitation.
6. Issues for Future Work
Using our current work as a platform, we aim to im-
prove our ability to collect human motion data in various
scenarios. Motion data are critically important for other re-
lated projects, such as the derivation of behavior vocabu-
laries [11]. Areas for further improvements to our capture
approach include: i) more consistent mechanism for seg-
menting skeleton curve branches, ii) different mechanisms
for aligning and clustering joints from specific kinematic
models in a sequence, iii) automatically deriving kinematic
models and motion for kinematic topologies containing cy-
cles (i.e., “bridges”, volumes of genus greater than zero),
iv) and exploring connections between model-free meth-
ods for robust model creation and initialization and model-
based methods for robust temporal tracking, v) extensions
to Isomap for volumes of greater resolutions and faster pro-
cessing of data, vi) using better computer vision techniques
for volume capture to extend the types subject motion that
can be converted into kinematic motion.
7. Conclusion
We have presented an approach for model-free marker-
less model and motion capture. In our approach, a kine-
matic model and joint angle motion are extracted from vol-
ume sequences of subjects with arbitrary tree-structured
kinematics. We have presented the application of Isomap
nonlinear dimension reduction to volume data for both the
removal of pose-dependent nonlinearities and extractable
skeleton curve features for a captured human volume. We
proposed an approach, nonlinear spherical shells, for ex-
tracting skeleton curve features from a human volume. This
feature extraction is placed within the context of a larger ap-
proach for capturing a kinematic model and corresponding
motion. Our approach was successfully applied to different
types of subject motion.
8. Acknowledgments
This research was partially supported by the DARPA
MARS Program grant DABT63-99-1-0015 and ONR
MURI grant N00014-01-1-0890. The authors wish to thank
Gabriel Brostow for valuable discussions and feedback.
References
[1] C. M. Bishop. Neural Networks for Pattern Recognition.
Oxford University Press, 1995.
[2] J.-Y. Bouguet. Camera calibration toolbox for matlab.
http://www.vision.caltech.edu/bouguetj/calib doc/index.html.
[3] C. Bregler and J. Malik. Tracking people with twists and
exponential maps. In IEEE Conference on Computer Vision
and Pattern Recoginition, pages 8–15, Santa Barbara, CA,
USA, 1998.
[4] K. M. Cheung, T. Kanade, J.-Y. Bouguet, and M. Holler. A
real time system for robust 3d voxel reconstruction of hu-
man motions. In Proceedings of the 2000 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR ’00),
volume 2, pages 714 – 720, June 2000.
[5] J. D. Cohen, M. C. Lin, D. Manocha, and M. K. Ponamgi. I-
COLLIDE: An interactive and exact collision detection sys-
tem for large-scale environments. In Proceedings of the
1995 symposium on Interactive 3D graphics, pages 189–
196, 218, Monterey, CA, USA, 1995. ACM Press.
[6] J. Deutscher, A. Blake, and I. Reid. Articulated body motion
capture by annealed particle filtering. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog-
nition, volume 2, pages 126–133, Hilton Head, SC, USA,
2000.
[7] A. R. Francois and G. G. Medioni. Adaptive color
background modeling for real-time segmentation of video
streams. In Proceedings of the International on Imaging Sci-
ence, Systems, and Technology, pages 227–232, Las Vegas,
NV, USA, June 1999.
[8] D. Gavrila and L. Davis. 3d model-based tracking of hu-
mans in action: A multi-view approach. In IEEE Conference
on Computer Vision and Pattern Recoginition, pages 73–80,
San Francisco, CA, USA, 1996.
[9] T. Hastie and W. Stuetzle. Principal curves. Journal of the
American Statistical Association, 84:502–516, 1989.
[10] A. Hilton, J. Starck, and G. Collins. From 3d shape
capture to animated models. In 1st International Sympo-
sium on 3D Data Processing Visualization and Transmission
(3DPVT’02), pages 246–257, Padova, Italy, Jun 2002.
[11] O. C. Jenkins and M. J. Matari
´
c. Automated derivation of
behavior vocabularies for autonomous humanoid motion. In
To appear in the Second International Joint Conference on
Autonomous Agents and Multiagent Systems (Agents 2003),
Melbourne, Australia, July 2003.
[12] B. Kegl, A. Krzyzak, T. Linder, and K. Zeger. Learning and
design of principal curves. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(3):281–297, 2000.
[13] J. Luck, D. Small, and C. Q. Little. Real-time tracking of
articulated human models using a 3d shape-from-silhouette
method. In Robot Vision, International Workshop RobVis,
volume 1998, pages 19–26, Feb 2001.
[14] I. Miki
´
c, M. Trivedi, E. Hunter, and P. Cosman. Articu-
lated body posture estimation from multi-camera voxel data.
In IEEE International Conference on Computer Vision and
Pattern Recognition, pages 455–460, Kauai, HI, USA, De-
cember 2001.
[15] T. Moeslund and E. Granum. A survey of computer vision-
based human motion capture. Computer Vision and Image
Understanding, 81(3):231–268, March 2001.
[16] S. G. Penny, J. Smith, and A. Bernhardt. Traces: Wireless
full body tracking in the cave. In Ninth International Con-
ference on Artificial Reality and Telexistence (ICAT’99), De-
cember 1999.
[17] S. T. Roweis and L. K. Saul. Nonlinear dimension-
ality reduction by locally linear embedding. Science,
290(5500):2323–2326, 2000.
[18] B. Scholkopf, A. J. Smola, and K.-R. Muller. Nonlinear
component analysis as a kernel eigenvalue problem. Neural
Computation, 10(5):1299–1319, 1998.
[19] S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruc-
tion by voxel coloring. In Proc. Computer Vision and Pat-
tern Recognition Conf., pages 1067–1073, 1997.
[20] D. Shepard. A two-dimensional interpolation function for
irregularly-spaced data. In Proceedings of the ACM national
conference, pages 517–524. ACM Press, 1968.
[21] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global
geometric framework for nonlinear dimensionality reduc-
tion. Science, 290(5500):2319–2323, 2000.
[22] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland.
Pfinder: Real-time tracking of the human body. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
19(7):780–785, 1997.
[23] Y. Zhou and A. W. Toga. Efficient skeletonization of vol-
umetric objects. IEEE Transactions on Visualization and
Computer Graphics, 5(3):196–209, July-September 1999.
... Most of existing skeleton extraction approaches have been focused on either images/video (Tresadern and Reid 2005;Ramanan, Forsyth, and Barnard 2006;Yan and Pollefeys 2008;Ross, Tarlow, and Zemel 2010;Chang and Demiris 2015) or 3D motion (Chun, Jenkins, and Mataric 2003;Cheung, Baker, and Kanade 2003;Kirk, O'Brien, and Forsyth 2005;Schaefer and Yuksel 2007;Le and Deng 2014). By contrast, so far relatively few methods, except This is a preprint version. ...
... Some research efforts have been focused on extracting skeletons from 3D motion. Chun et al. (2003) used the generated underlying nonlinear axes from each frame to derive a kinematic model, based on a volumetric sequence captured by multiple cameras. Their method does not track points among frames, thus making the tracking or identification difficult. ...
Conference Paper
Full-text available
How to robustly and accurately extract articulated skeletons from point set sequences captured by a single consumer-grade depth camera still remains to be an unresolved challenge to date. To address this issue, we propose a novel, un-supervised approach consisting of three contributions (steps): (i) a non-rigid point set registration algorithm to first build one-to-one point correspondences among the frames of a sequence ; (ii) a skeletal structure extraction algorithm to generate a skeleton with reasonable numbers of joints and bones; (iii) a skeleton joints estimation algorithm to achieve accurate joints. At the end, our method can produce a quality articulated skeleton from a single 3D point sequence corrupted with noise and outliers. The experimental results show that our approach soundly outperforms state of the art techniques, in terms of both visual quality and accuracy.
... Zheng et al. regarded the curve skeletons of shapes as a global description feature and assumed that the skeleton structure of the captured shape be consistent for a period of time [38]. Other scholars extract skeletons by determining motion postures based on the observation of continuous frames of body movements [5,35]. ...
Article
Full-text available
Current approaches of human body skeleton extraction mainly suffer from following problems: insufficient temporal and spatial continuity, unrobust to background, ambient noise, etc. This paper proposes a three-dimensional human body skeleton extraction method from consecutive meshes. We extract the consistent skeletons from consecutive surfaces based on shape segmentation and skeleton sequences; then, we present a spatiotemporal skeleton optimization model to adjust the skeleton sequences. Experiments on multiview images captured from a light field device demonstrate that our method captures more complete and accurate skeletons compared to state-of-the-art methods.
... The aim is to obtain a 3D model of an actor filmed by a system with multi-cameras which can require markers [254] or not [247,[250][251][252][253]255,257]. Because it is impossible to have a rigorous 3D reconstruction of a human model, a 3D voxel approximation [250,251] obtained by shape-from-silhouette (also called visual hull) is computed with the silhouettes obtained from each camera. ...
... The aim is to obtain a 3D model of an actor filmed by a system with multi-cameras which can require markers [254] or not [247,[250][251][252][253]255,257]. Because it is impossible to have a rigorous 3D reconstruction of a human model, a 3D voxel approximation [250,251] obtained by shape-from-silhouette (also called visual hull) is computed with the silhouettes obtained from each camera. ...
Article
Computer vision applications based on videos often require the detection of moving objects in their first step. Background subtraction is then applied in order to separate the background and the foreground. In literature, background subtraction is surely among the most investigated field in computer vision providing a big amount of publications. Most of them concern the application of mathematical and machine learning models to be more robust to the challenges met in videos. However, the ultimate goal is that the background subtraction methods developed in research could be employed in real applications like traffic surveillance. But looking at the literature, we can remark that there is often a gap between the current methods used in real applications and the current methods in fundamental research. In addition, the videos evaluated in large-scale datasets are not exhaustive in the way that they only covered a part of the complete spectrum of the challenges met in real applications. In this context, we attempt to provide the most exhaustive survey as possible on real applications that used background subtraction in order to identify the real challenges met in practice, the current used background models and to provide future directions. Thus, challenges are investigated in terms of camera (i.e CCD cameras, omnidirectional cameras, …), foreground objects and environments. In addition, we identify the background models that are effectively used in these applications in order to find potential usable recent background models in terms of robustness, time and memory requirements.
... The method [16], however, cannot classify the body voxel data reliably and hence cannot obtain the correct body motion when the arms are close to the trunk or the feet are merged. Another method [17] uses algorithms to achieve human model acquisition and motion capture automatically, but it is unstable with large computation. ...
... A substantial amount of research efforts have been focused on extracting skeletons from 3D motion. Based on a volumetric sequence captured by multiple cameras, Chun et al. (2003) use the generated underlying nonlinear axes from each frame to derive a kinematic model. Cheung et al. (2003) introduced a Shapefrom-Silhouette algorithm for articulated objects, to recover the motion, shape, and joints from silhouette and color images. ...
Article
Full-text available
Articulated skeleton extraction or learning has been extensively studied for 2D (e.g., images and video) and 3D (e.g., volume sequences, motion capture, and mesh sequences) data. Nevertheless, robustly and accurately learning 3D articulated skeletons from point set sequences captured by a single consumer-grade depth camera still remains challenging, since such data are often corrupted with substantial noise and outliers. Relatively few approaches have been proposed to tackle this problem. In this paper, we present a novel unsupervised framework to address this issue. Specifically, we first build one-to-one point correspondences among the point cloud frames in a sequence with our non-rigid point cloud registration algorithm. We then generate a skeleton involving a reasonable number of joints and bones with our skeletal structure extraction algorithm. We lastly present an iterative Linear Blend Skinning based algorithm for accurate joints learning. At the end, our method can learn a quality articulated skeleton from a single 3D point sequence possibly corrupted with noise and outliers. Through qualitative and quantitative evaluations on both publicly available data and in-house Kinect-captured data, we show that our unsupervised approach soundly outperforms state of the art techniques in terms of both quality (i.e., visual) and accuracy (i.e., Euclidean distance error metric). Moreover, the poses of our extracted skeletons are even comparable to those by KinectSDK, a well-known supervised pose estimation technique; for example, our method and KinectSDK achieves similar distance errors of 0.0497 and 0.0521.
... Marker-less capture methods based on computer vision technology [22][23][24] can overcome the limitations of passive optical motion capture systems and can provide movement freedom for dancers. However, these systems are susceptible to error approximation, do not fully exploit global spatiotemporal consistency constraints, and are generally less precise than systems with markers. ...
Article
Full-text available
According to UNESCO, cultural heritage does not only include monuments and collections of objects, but also contains traditions or living expressions inherited from our ancestors and passed to our descendants. Folk dances represent part of cultural heritage and their preservation for the next generations appears of major importance. Digitization and visualization of folk dances form an increasingly active research area in computer science. In parallel to the rapidly advancing technologies, new ways for learning folk dances are explored, making the digitization and visualization of assorted folk dances for learning purposes using different equipment possible. Along with challenges and limitations, solutions that can assist the learning process and provide the user with meaningful feedback are proposed. In this paper, an overview of the techniques used for the recording of dance moves is presented. The different ways of visualization and giving the feedback to the user are reviewed as well as ways of performance evaluation. This paper reviews advances in digitization and visualization of folk dances from 2000 to 2018.
Article
We present a markerless motion capture system able to determine the kinematic structure while measuring joint movement. In addition to volume data, we also use texture data to precisely measure the degrees of freedom that do not affect the shape, e.g., pronation/supination angles of the forearm and shank. We first obtain topology using a Reeb graph and independently build a tentative articulated-body chain model of the subject for each frame. We then extract a common optimized chain model by comparing joint angles of tentative models of all frames to identify which joints are related to describing the movement of the subject. Our system thus measures movement without prior knowledge of the structure. The system identifies the link length of objects with known structures based on measured data.
Article
Full-text available
We present a system to perform real-time background modeling and segmentation of video streams on a PC, in the context of video surveillance and multimedia applications. The images, captured with a fixed camera, are modeled as a fixed or slowly changing background, which may become occluded by mobile agents. The system learns a statistical color model of the background, which is used for detecting changes produced by occluding elements. We propose to operate in the Hue-Saturation-Value (HSV) color space, instead of the traditional RGB space, and show that it provides a better use of the color information, and nat- urally incorporates gray-level only processing. At each instant, the system maintains an updated background model, and a list of occluding regions that can then be tracked. Other applications are video compression, enhancement and modification, such as obstacle highlight or removal.
Article
Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996. Includes bibliographical references (p. 39-40). by Christopher R. Wren. M.S.
Article
A novel scene reconstruction technique is presented, different from previous approaches in its ability to cope with large changes in visibility and its modeling of intrinsic scene color and texture information. The method avoids image correspondence problems by working in a discretized scene space whose voxels are traversed in a fixed visibility ordering. This strategy takes full account of occlusions and allows the input cameras to be far apart and widely distributed about the environment. The algorithm identifies a special set of invariant voxels which together form a spatial and photometric reconstruction of the scene, fully consistent with the input images. The approach is evaluated with images from both inward-facing and outward-facing cameras.
Article
Principal curves are smooth one-dimensional curves that pass through the middle of a p-dimensional data set, providing a nonlinear summary of the data. They are nonparametric, and their shape is suggested by the data. The algorithm for constructing principal curves starts with some prior summary, such as the usual principal-component line. The curve in each successive iteration is a smooth or local average of the p-dimensional points, where the definition of local is based on the distance in arc length of the projections of the points onto the curve found in the previous iteration. In this article principal curves are defined, an algorithm for their construction is given, some theoretical results are presented, and the procedure is compared to other generalizations of principal components. Two applications illustrate the use of principal curves. The first describes how the principal-curve procedure was used to align the magnets of the Stanford linear collider. The collider uses about 950 magnets in a roughly circular arrangement to bend electron and positron beams and bring them to collision. After construction, it was found that some of the magnets had ended up significantly out of place. As a result, the beams had to be bent too sharply and could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations, but rather to a sufficiently smooth arc through the middle of the existing positions. This arc was found using the principal-curve procedure. In the second application, two different assays for gold content in several samples of computer-chip waste appear to show some systematic differences that are blurred by measurement error. The classical approach using linear errors in variables regression can detect systematic linear differences but is not able to account for nonlinearities. When the first linear principal component is replaced with a principal curve, a local “bump” is revealed, and bootstrapping is used to verify its presence.
Article
In many fields using empirical areal data there arises a need for interpolating from irregularly-spaced data to produce a continuous surface. These irregularly-spaced locations, hence referred to as “data points,” may have diverse meanings: in meterology, weather observation stations; in geography, surveyed locations; in city and regional planning, centers of data-collection zones; in biology, observation locations. It is assumed that a unique number (such as rainfall in meteorology, or altitude in geography) is associated with each data point. In order to display these data in some type of contour map or perspective view, to compare them with data for the same region based on other data points, or to analyze them for extremes, gradients, or other purposes, it is extremely useful, if not essential, to define a continuous function fitting the given values exactly. Interpolated values over a fine grid may then be evaluated. In using such a function it is assumed that the original data are without error, or that compensation for error will be made after interpolation.
Article
A comprehensive survey of computer vision-based human motion capture literature from the past two decades is presented. The focus is on a general overview based on a taxonomy of system functionalities, broken down into four processes: initialization, tracking, pose estimation, and recognition. Each process is discussed and divided into subprocesses and/or categories of methods to provide a reference to describe and compare the more than 130 publications covered by the survey. References are included throughout the paper to exemplify important issues and their relations to the various methods. A number of general assumptions used in this research field are identified and the character of these assumptions indicates that the research field is still in an early stage of development. To evaluate the state of the art, the major application areas are identified and performances are analyzed in light of the methods presented in the survey. Finally, suggestions for future research directions are offered.
Conference Paper
This paper describes a system, which acquires 3D data and tracks an eleven degree of freedom human model in real-time. Using four cameras we create a time-varying volumetric image (a visual hull) of anything moving in the space observed by all four cameras. The sensor is currently operating in a volume of approximately 500,000 voxels (1.5 inch cubes) at a rate of 25 Hz. The system is able to track the upper body dynamics of a human (x,y position of the body, a torso rotation, and four rotations per arm). Both data acquisition and tracking occur on one computer at a rate of 16 Hz. We also developed a calibration procedure, which allows the system to be moved and be recalibrated quickly. Furthermore we display in real-time, either the data overlaid with the joint locations or a human avatar. Lastly our system has been implemented to perform crane gesture recognition.