Conference PaperPDF Available

First-Person Pose Recognition using Egocentric Workspaces


Abstract and Figures

We tackle the problem of estimating the 3D pose of an individual's upper limbs (arms+hands) from a chest mounted depth-camera. Importantly, we consider pose estimation during everyday interactions with objects. Past work shows that strong pose+viewpoint priors and depth-based features are crucial for robust performance. In egocentric views, hands and arms are observable within a well defined volume in front of the camera. We call this volume an egocentric workspace. A notable property is that hand appearance correlates with workspace location. To exploit this correlation, we classify arm+hand configurations in a global egocentric coordinate frame, rather than a local scanning window. This greatly simplify the architecture and improves performance. We propose an efficient pipeline which 1) generates synthetic workspace exemplars for training using a virtual chest-mounted camera whose intrinsic parameters match our physical camera, 2) computes perspective-aware depth features on this entire volume and 3) recognizes discrete arm+hand pose classes through a sparse multi-class SVM. Our method provides state-of-the-art hand pose recognition performance from egocentric RGB-D images in real-time.
Content may be subject to copyright.
First-Person Pose Recognition using Egocentric Workspaces
egory Rogez1,2, James S. Supanˇ
c III1, Deva Ramanan1
1Dept of Computer Science, University of California, Irvine, CA, USA
2Universidad de Zaragoza, Zaragoza, Spain grogez,jsupanci,
We tackle the problem of estimating the 3D pose of an in-
dividual’s upper limbs (arms+hands) from a chest mounted
depth-camera. Importantly, we consider pose estimation
during everyday interactions with objects. Past work shows
that strong pose+viewpoint priors and depth-based features
are crucial for robust performance. In egocentric views,
hands and arms are observable within a well defined vol-
ume in front of the camera. We call this volume an egocen-
tric workspace. A notable property is that hand appearance
correlates with workspace location. To exploit this correla-
tion, we classify arm+hand configurations in a global ego-
centric coordinate frame, rather than a local scanning win-
dow. This greatly simplify the architecture and improves
performance. We propose an efficient pipeline which 1) gen-
erates synthetic workspace exemplars for training using a
virtual chest-mounted camera whose intrinsic parameters
match our physical camera, 2) computes perspective-aware
depth features on this entire volume and 3) recognizes dis-
crete arm+hand pose classes through a sparse multi-class
SVM. We achieve state-of-the-art hand pose recognition
performance from egocentric RGB-D images in real-time.
1. Introduction
Understanding hand poses and hand-object manipula-
tions from a wearable camera has potential applications
in assisted living [23], augmented reality [6] and life log-
ging [19]. As opposed to hand-pose recognition from third-
person views, egocentric views may be more difficult due
to additional occlusions (from manipulated objects, or self-
occlusions of fingers by the palm) and the fact that hands
interact with the environment and often leave the field-of-
view. The latter necessitates constant re-initialization, pre-
cluding the use of a large body of hand trackers which typ-
ically perform well given manual initialization.
Previous work for egocentric hand analysis tends to rely
on local 2D features, such as pixel-level skin classification
[17,18] or gradient-based processing of depth maps with
Figure 1. Egocentric workspaces. We directly model the observ-
able egocentric workspace in front of a human with a 3D vol-
umetric descriptor, extracted from a 2.5D egocentric depth sen-
sor. In this example, this volume is discretized into 4×3×4
bins. This feature can be used to accurately predict shoulder,
arm, hand poses, even when interacting with objects. We describe
models learned from synthetic examples of observable egocentric
workspaces obtained by placing a virtual Intel Creative camera on
the chest of an animated character.
scanning-window templates [25]. Our approach follows in
the tradition of [25], who argue that near-field depth mea-
sures obtained from a egocentric-depth sensor considerably
simplifies hand analysis. Interestingly, egocentric-depth is
not “cheating” in the sense that humans make use of stereo-
scopic depth cues for near-field manipulations [7]. We ex-
tend this observation by building an explicit 3D map of the
observable near-field workspace.
Contributions: In this work, we describe a new com-
putational architecture that makes use of global egocentric
views, volumetric representations, and contextual models
of interacting objects and human-bodies. Rather than de-
tecting hands with a local (translation-invariant) scanning-
window classifier, we process the entire global egocentric
view (or work-space) in front of the observer (Fig. 1). Hand
appearance is not translation-invariant due to perspective ef-
fects and kinematic constraints with the arm. To capture
such effects, we build a library of synthetic 3D egocentric
workspaces generated using real capture conditions (see ex-
amples in Fig. 2). We animate a 3D human character model
inside virtual scenes with objects, and render such anima-
tions with a chest-mounted camera whose intrinsics match
our physical camera . We simultaneously recognize arm
and hand poses while interacting with objects by classify-
ing the whole 3D volume using a multi-class Support Vec-
tor Machine (SVM) classifier. Recognition is simple and
fast enough to be implemented in 4 lines of code.
1.1. Related work
Hand-object pose estimation: While there is a large
body of work on hand-tracking [14,13,12,1,22,31,34],
we focus on hand pose estimation during object manipu-
lations. Object interactions both complicate analysis due
to additional occlusions, but also provide additional con-
textual constraints (hands cannot penetrate object geome-
try, for example). [10] describe articulated tracker with soft
anti-penetration constraints, increasing robustness to occlu-
sion. Hamer et al. describe contextual priors for hands in
relation to objects [9], and demonstrate their effectiveness
for increasing tracking accuracy. Objects are easier to ani-
mate than hands because they have fewer joint parameters.
With this intuition, object motion can be used as an input
signal for estimating hand motions [8]. [26] use a large syn-
thetic dataset of hands manipulating objects, similar to us.
We differ in our focus on single-image and egocentric anal-
Egocentric Vision: Previous studies have focused on ac-
tivities of daily living [23,5]. Long-scale temporal structure
was used to handle complex hand object interactions, ex-
ploiting the fact that objects look different when they are
manipulated (active) versus not manipulated (passive) [23].
Much previous work on egocentric hand recognition make
exclusive use of RGB cues [18,16], while we focus on vol-
umetric depth cues. Notable exceptions include [3], who
employ egocentric RGB-D sensors for personal workspace
monitoring in industrial environments and [20], who em-
ploy such sensors to assist blind users in navigation.
Depth features: Previous work has shown the efficacy
of depth cues [28,35]. We compute volumetric depth fea-
tures from point clouds. Previous work has examined point-
cloud processing of depth-images [36,29,35]. A common
technique estimates local surface orientations and normals
[36,35], but this may be sensitive to noise since it requires
Figure 2. Synthesis: We generate a training set by sampling dif-
ferent dimensions of a workspace model, yielding a total num-
ber of Narm ×Nhand ×Nobject ×Nbackg round samples. We
sample Narm arm poses, a fixed set of hand-object configurations
(Nhand ×Nobject = 100) and a fixed set of Nbackground back-
ground scenes captured with an Intel Creative depth camera. For
each hand-object model, we randomly perturb shoulder, arm and
hand joint angles to generate physically possible arm+hand+object
configurations. We show 2 examples of a bottle-grasp (left) and a
juice-box-grasp (right) rendered in front of a flat wall.
derivative computations. We use simpler volumetric fea-
tures, similar to [30] except that we use a spherical coordi-
nate frame that does not slide along a scanning window (we
want to measure depth in an egocentric coordinate frame).
Non-parametric recognition: Our work is inspired by
non-parametric techniques that make use of synthetic train-
ing data [26,27,10,2,33]. [27] make use of pose-sensitive
hashing techniques for efficient matching of synthetic RGB
images rendered with Poser. We generate synthetic depth
images, mimicking capture conditions of our actual camera.
2. Training data
We begin by generating a training set of realistic 3D ego-
centric workspaces. Specifically, we render synthetic 3D
hand-object data (generated from a 3D animation system)
on top of real 3D background scenes, making use of the test
camera projection matrix. Because egocentric scenes in-
volve objects that lie close to the camera, we found it useful
to model camera-specific perspective effects.
Poser models. Our in-house grasp database is con-
structed by modifying the commercial Everyday hands
Grasp Poser library [4]. We vary the objects being inter-
acted with, as well as the clothing of the character, i.e.,
with and without sleeves. We use more than 200 grasp-
(a) (b)
(c) (d)
Figure 3. Examples of synthetic training images. Our render-
ing pipeline produces realistic depth maps consisting of multiple
hands manipulating household objects in front of everyday back-
ing postures and 49 objects, including kitchen utensils, per-
sonal bathroom items, office/classroom objects, fruits, etc.
Additionally we use 6 models of empty hands: wave, fist,
thumbs-up, point, open/close fingers. Some objects can be
handled with different canonical grasps. For example, one
can grip a bottle by its body or by its lid when opening it.
We manually add such variants.
Kinematic model. Let θbe a vector of arm joint an-
gles, and let φbe a vector of grasp-specific hand joint an-
gles, obtained from the above set of Poser models. We use
a standard forward kinematic chain to convert the location
of finger joints u(in a local coordinate system) to image
T(φj)u,where T, C R4×4,
u=uxuyuz1T,(x, y) = (fpx
, f py
where Tspecifies rigid-body transformations (rotation and
translation) along the kinematic chain and Cspecifies the
extrinsic camera parameters. Here prepresents the 3D posi-
tion of point uin the camera coordinate system. To generate
the corresponding image point, we assume camera intrin-
sics are given by identity scale factors and a focal length
f(though it is straightforward to use more complex in-
trinsic parameterizations). We found it important to use
the fcorresponding to our physical camera, as it is cru-
cial to correctly model perspective effects for our near-field
Pose synthesis: We wish to generate a large set of
postured hands. However, building a generative model of
grasps is not trivial. One option is to take a data-driven
approach and collect training samples using motion cap-
ture [24]. Instead, we take a model-driven approach that
perturbs a small set of manually-defined canonical postures.
To ensure that physically plausible perturbations are gener-
ated, we take a simple rejection sampling approach. We fix
φparameters to respect the hand grasps from Poser, and add
small Gaussian perturbations to arm joint angles
i=θi+where N(0, σ2).
Importantly, this generates hand joints pat different transla-
tions and viewpoints, correctly modeling the dependencies
between both. For each perturbed pose, we render hand
joints using (1) and keep exemplars that are 90% visible
(e.g., their projected (x, y)coordinates lie within the image
boundaries). We show examples in Fig. 2.
Depth maps. Associated with each rendered set of key-
points, we would also like a depth map. To construct a depth
map, we represent each rigid limb with a dense cloud of 3D
vertices {ui}. We produce this cloud by (over) sampling the
3D meshes defining each rigid-body shape. We render this
dense cloud using forward kinematics (1), producing a set
of points {pi}={(px,i, py,i, pz,i )}. We define a 2D depth
map z[u, v]by ray-tracing. Specifically, we cast a ray from
the origin, in the direction of each image (or depth sensor)
pixel location (u, v)and find the closest point:
z[u, v] = min
kRay(u,v)||pk|| (2)
where Ray(u, v)denotes the set of points on (or near) the
ray passing through pixel (u, v). We found the above ap-
proach simpler to implement than hidden surface removal,
so long as we projected a sufficiently dense cloud of points.
Multiple hands: Some object interactions require mul-
tiple hands interacting with a single object. Additionally,
many views contain the second hand in the “background”.
For example, two hands are visible in roughly 25% of the
frames in our benchmark videos. We would like our train-
ing dataset to have similar statistics. Our existing Poser li-
brary contains mostly single-hand grasps. To generate ad-
ditional multi-arm egocentric views, we randomly pair 25%
of the arm poses with a mirrored copy of another randomly-
chosen pose. We then add noise to the arm joint angles,
as described above. Such a procedure may generate unnat-
ural or self-intersecting poses. To efficiently remove such
cases, we separately generate depth maps for the left and
right arms, and only keep pairings that produce compatible
depth maps:
|zlef t[u, v]zright [u, v]|> δ u, v (3)
We find this simple procedure produces surprisingly realis-
tic multi-arm configurations (Fig. 3). Finally we add back-
ground clutter from depth maps of real egocentric scenes
Figure 4. Volume quantization. We quantize those points that fall within the egocentric workspace (observable volume within zmax =
70cm) into a binary spherical voxel grid of Nu×Nv×Nwvoxels (a). We vary the azimuth angle αto generate equal-size projections on
the image plane (b). Spherical bins ensure that voxels at different distances project to same image area (c). This allows for efficient feature
computation and occlusion handling, since occluded voxels along the same line-of-sight can easily be identified.
(not from our benchmark data). We use the above approach
to generate over 100,000 multi-hand(+arm+objects) config-
urations and associated depth-maps.
3. Formulation
3.1. Perspective-aware depth features
It may seem attractive to work in orthographic (or scaled
orthographic) coordinates, as this simplifies much of 3D
analysis. Instead, we posit that perspective distortion is use-
ful in egocentric settings and should be exploited: objects
of interest (hands, arms, and manipulated things) tend to
lie near the body and exhibit perspective effects. Specifi-
cally, parts of objects that are closer to the camera project
to a larger image size. To model such effects, we con-
struct a spherical bin histogram by gridding up the egocen-
tric workspace volume by varying azimuth and elevation an-
gles (Fig. 4). We demonstrate that this feature outperforms
orthographic counterparts, and is also faster to compute.
Binarized volumetric features: Much past work pro-
cesses depth maps as 2D rasterized sensor data. Though
convenient for applying efficient image processing routines
such as gradient computations (e.g., [32]), rasterization
may not fully capture the 3D nature of the data. Alter-
natively, one can convert depth maps to a full 3D point
cloud [15], but the result is orderless making operations
such as correspondence-estimation difficult. We propose
encoding depth data in a 3D volumetric representation, sim-
ilar to [30]. To do so, we can back-project the depth map
from (2) into a cloud of visible 3D points {pk}, visualized
in Fig. 5-(b). They are a subset of the original cloud of 3D
points {pi}in Fig. 5-(a). We now bin those visible points
that fall within the egocentric workspace in front of the cam-
era (observable volume within zmax = 70cm) into a binary
voxel grid of Nu×Nv×Nwvoxels:
b[u, v, w] = 1if ks.t. pkF(u, v , w)
0otherwise. (4)
where F(u, v, w)denotes the set of points within a voxel
centered at coordinate (u, v, w).
Spherical voxels: Past work tends to use rectilinear vox-
els [30,15]. Instead, we use a spherical binning structure,
centering the sphere at the camera origin ( Fig. 4). At first
glance, this might seem strange because voxels now vary in
size – those further away from the camera are larger. The
main advantage of a “perspective-aware” binning scheme is
that all voxels now project to the same image area in pixels
(Fig. 4-(c)). We will show that this both increases accuracy
(because one can better reason about occlusions) and speed
(because volumetric computations are sparse).
Efficient quantization: Let us choose spherical bins
F(u, v, w)such that they project to a single pixel (u, v)in
the depth map. This allows one to compute the binary voxel
grid b[u, v, w]by simply “reading off” the depth value for
each z(u, v)coordinates, quantizing it to z0, and assigning
1 to the corresponding voxel:
b[u, v, w] = 1if w=z0[u, v ]
0otherwise (5)
This results in a sparse volumetric voxel features visual-
ized in Fig. 5-(c). Crucially, a spherical parameterization al-
Figure 5. Binarized volumetric feature. We synthesize training examples by randomly perturbing shoulder, arm and hand joint angles in
a physically possible manner (a). For each example, a synthetic depth map is created by projecting the visible set of dense 3D points using
a real-world camera projection matrix (b). The resulting 2D depth map is then quantized with a regular grid in x-y directions and binned
in the viewing direction to compute our new binarized volumetric feature (c). In this example, we use a 32 ×24 ×35 grid. Note that for
clarity we only show the sparse version of our 3D binary feature. We also show the quantized depth map z[u, v]as a gray scale image (c).
lows one to efficient reason about occlusions: once a depth
measurement is observed at position b[u0, v0, w0] = 1, all
voxels behind it are occluded for ww0. This arises from
the fact that single camera depth measurements are, in fact,
2.5D. By convention, we define occluded voxels to be “1”.
Note that such occlusion reasoning is difficult with ortho-
graphic parameterizations because voxels are not arranged
along line-of-sight rays.
In practice, we consider a coarse discretization of the
volume to make the problem more tractable. The depth map
z[x, y]is resized to Nu×Nv(smaller than depth map size)
and quantized in the z-direction. To minimize the effect of
noise when counting the points which fall in the different
voxels, we quantize the depth measurements by applying a
median filter on the pixel values within each image region:
u, v [1, Nu]×[1, Nv],
z0[u, v] = Nw
zmax median(z[x, y] : (x, y)P(u, v )),(6)
where P(u, v)is the set of pixel coordinates in the original
depth map corresponding to pixel coordinate (u, v)coordi-
nates in the resized depth map.
3.2. Global pose classification
We quantize the set of poses from our synthetic database
into Kcoarse classes for each limb, and train a K-way
pose-classifier for pose-estimation. The classifier is linear
and makes use of our sparse volumetric features, making it
quite simple and efficient to implement.
Pose space quantization: For each training exemplar,
we generate the set of 3D keypoints: 17 joints (elbow +
wrist + 15 finger joints) and the 5 finger tips. Since we
want to recognize coarse limb (arm+hand) configurations,
we cluster the resulting training set by applying K-means
to the elbow+wrist+knuckle 3D joints. We usually rep-
resent each of the K resulting clusters using the average
3D/2D keypoint locations of both arm+hand (See examples
in Fig. 6). Note that K can be chosen as a compromise be-
tween accuracy and speed.
Global classification: We use a linear SVM for a multi-
class classification of upper-limb poses. However, in-
stead of classifying local scanning-windows, we classify
global depth maps quantized into our binarized depth fea-
ture b[u, v, w]from (5). Global depth maps allow the clas-
sifier to exploit contextual interactions between multiple
hands, arms and objects. In particular, we find that mod-
eling arms is particularly helpful for detecting hands. For
each class k∈ {1,2, ...K}, we train a one-vs-all SVM clas-
sifier obtaining weight vector which can be re-arranged into
aNu×Nv×Nwtensor βk[u, v, w]. The score for class k
is then obtained by a simple dot product of this weight and
our binarized feature b[u, v, w]:
score[k] = X
βk[u, v, w]·b[u, v , w].(7)
We visualize projections of the learned weight tensor
βk[u, v, w]in Fig. 6and slices of the tensor in Fig. 7.
3.3. Joint feature extraction and classification
To increase run-time efficiency, we exploit the sparsity
of our binarized volumetric feature and jointly implement
Figure 6. Pose classifiers. We visualize the linear weight tensor βk[u, v, w]learnt by the SVM for a 32 ×24 ×35 grid of binary features for
3 different pose clusters. We plot a 2D (u, v)visualization obtained by computing the max along w. We also visualize the corresponding
average 3D pose in the egocentric volume together with the top 500 positive (light gray) and negative weights (dark gray) within βk[u, v, w].
Figure 7. Weights along w. We visualize the SVM weights
βk[u, v, w]for a particular (u, v )location. Our histogram encod-
ing allows us to learn smooth nonlinear functions of depth values.
For example, the above weights respond positively to depth values
midway into the egocentric volume, but negatively to those closer.
feature extraction and SVM scoring. Since the final score is
a simple dot product with binary features, one can readily
extract the feature and update the score on the fly. Because
all voxels behind the first measurement are backfilled, the
SVM score for each class kfrom (7) can be written as:
score[k] = X
k[u, v, z0[u, v]],(8)
where z0[u, v]is the quantized depth map and tensor
k[u, v, w]is the cumulative sum of the weight tensor along
dimension w:
k[u, v, w] = X
βk[u, v, d].(9)
Note that the above cumulative-sum tensors can be precom-
puted. This makes test-time classification quite efficient (8).
Feature extraction and SVM classification can be computed
jointly following the algorithm presented in Alg. 1. Our im-
plementation runs at 275 frames per second.
input : Quantized depth map z0[u, v].
Cumsum’ed weights {β0
k[u, v, w]}.
output: score[k]
1for u∈ {0,1, ...Nu}do
2for v∈ {0,1, ...Nv}do
3for k∈ {0,1, ...K}do
4score[k]+ = β0
k[u, v, z0[u, v]]
Algorithm 1: Joint feature extraction & classification.
We jointly extract binarized depth features and evaluate
linear classifiers for all quantized poses k. We precom-
pute a “cumsum” β0
kof our SVM weights. At each
location (u, v), we add all the weights corresponding
to the voxels behind z[u, v], i.e. such that wz[u, v ].
4. Experiments
For evaluation, we use the recently released UCI Ego-
centric dataset [25] and score hand pose detection as a proxy
for limb pose recognition (following the benchmark criteria
used in [25]) . The dataset consists of 4 video sequences
(around 1000 frames each) of everyday egocentric scenes
with hand annotations every 10 frames.
Feature evaluation: We first compare hand detection
accuracy for different K-way SVM classifiers trained on
HOG on depth (as in [25]) and HOG on RGB-D, thus ex-
ploiting the stereo-views provided by RGB and depth sen-
sors. To evaluate our voxel encoding, we also trained a
SVM directly on the quantized depth map z[u, v](with-
out constructing a sparse binary feature). To evaluate our
Feature comparison Feature resolution
(a) (b)
Figure 8. Feature evaluation. We compare our feature encoding
to different variants (for K= 750 classes) in (a). Our feature
outperforms HOG-on-depth and HOG-on-RGBD. Our feature also
outperforms orthographic voxels and the raw quantized depth map,
which surprisingly itself outperforms all other baselines. When
combined with a linear classifier, our sparse encoding can learn
nonlinear functions of depth (see Fig. 7), while the raw depth map
can only learn linear functions. We also vary the resolution of our
feature in (b), again for K= 750. A size of 32×24 ×35 is a good
trade-off between size and performance. Doubling the resolution
in u, v marginally improves accuracy.
perspective voxels, we compare to an orthographic version
of our binarized volumetric feature (similar to past work
[30,15]). In that case, we quantize those points that fall
within a 64x48x70 cm3egocentric workspace in front of
the camera into a binary grid of square voxels:
b[u, v, w] = 1if is.t. (xi, yi, zi)N(u, v , w)
where N(u, v, w)specifies a 2×2×2cm cube centered at
voxel (u, v, w). Note that this feature is considerably more
involved to calculate, since it requires an explicit backpro-
jection and explicit geometric computations for binning.
Moreover, identifying occluded voxels is difficult because
they are not arranged along line-of-sight rays.
The results obtained with K= 750 pose classes are re-
ported in Fig. 8-(a). Our perspective binary features clearly
outperforms other types of features. We reach 72% detec-
tion accuracy while the state of the art [25] reports 60% ac-
curacy. Our volumetric feature has empirically strong per-
formance in egocentric settings. One reason is that it is ro-
bust to small intra-cluster misalignment and deformations
because all voxels behind the first measurement are back-
filled. Second, it is sensitive to variations in apparent size
induced by perspective effects (because voxels have consis-
Detection varying KDetection varying size of training
(a) (b)
Figure 9. Clustering and size of training set. In (a), we plot
performance as a function of K(the number of discretized pose
classes) for a fixed-size training set. For reference, we also plot
the state-of-the-art method from [25]. In (b), we plot performance
as we increase the amount of training data for K. Both results
suggest that our system may perform better with more training data
and more quantized poses. Please see text for further discussion.
tent perspective projections). In Fig. 8-(b), we also show
results varying the resolution of the grid. Our choice of
32 ×24 ×35 is a good trade-off between feature dimen-
sionality and performance.
We compare primarily to [25], as that method was al-
ready shown to outperform commercial (Intel PXC [11])
and fully-featured tracking systems [21]. Such systems per-
form poorly due to occlusions inherent in egocentric view-
points. Notably, [25] use local part templates in a scanning
window fashion. Our global approach captures correlations
between pose and spatial location, and better deals with oc-
clusion where local appearance can be misleading.
Training data and clustering: We evaluated the per-
formance of our algorithm when varying the number of
quantized pose classes Kand the amount of training data.
Fig. 9-(a) varies Kfor a fixed training set of 120,000 train-
ing images. Performance maxes out relatively quickly at
K= 750, suggesting that our model may be overfitting
due to lack of training data. Fig. 9-(a) fixes K= 750 and
increases the amount of training data per quantized class.
Here, we see a more consistent increase in accuracy. These
results suggest that a massive training set and larger Kmay
produce better results.
Qualitative results: We illustrate successes in difficult
scenarios in Fig. 10 and analyze common failure modes in
Fig. 11. Please see the figures for additional discussion.
5. Conclusions
We have proposed a new approach to the problem of ego-
centric 3D hand pose recognition during interactions with
objects. Instead of classifying local depth image regions
through a typical translation-invariant scanning window, we
have shown that classifying the global arm+hand+object
configurations within the “whole” egocentric workspace in
front of the camera allows for fast and accurate results. We
#1403 CVPR
Figure 9. Good detections. We show frames where arm and hand are correctly detected. First, we present some easy cases of hands in
free-space (top row ). Noisy depth data and cluttered background cases (middle row) showcases the robustness of our system while novel
objects (bottom row: envelope, staple box, pan, double-handed cup and lamp) require generalization to unseen objects at train-time.
reflective object (phone) bottle noisy depth/clutter unseen object (keys) malsegmentability ambiguous pose
Figure 10. Hard cases. We show frames where the pose is not correctly recognized ( sometimes not even detected) by our system. These
hard cases include excessively-noisy depth data, hands manipulating reflective material (phone or bottle of wine), malsegmentability cases
of hands touching background.
explicitly reasons about perspective occlusions while being
both conceptually and practically simple to implement (4 lines of code). We produce state-of-the-art real-time results
for egocentric pose estimation.
Noisy depth
Novel objects
Figure 10. Good detections. We show frames where arm and hand are correctly detected. First, we present some easy cases of hands in
free-space (top row ). Noisy depth data and cluttered background cases (middle row) showcases the robustness of our system while novel
objects (bottom row: envelope, staple box, pan, double-handed cup and lamp) require generalization to unseen objects at train-time.
Figure 11. Hard cases. We show cases where the pose is not correctly recognized (sometimes not even detected): excessively-noisy depth
data, hands manipulating reflective material (phone or bottle of wine) or malsegmentability cases of hands touching background.
train our model by synthesizing workspace exemplars con-
sisting of hands, arms, objects and backgrounds. Our model
explicitly reasons about perspective occlusions while being
both conceptually and practically simple to implement (4
lines of code). We produce state-of-the-art real-time results
for egocentric pose estimation in real-time.
Aknowledgement. GR was supported by the European
Commission under FP7 Marie Curie IOF grant “Egovi-
sion4Health” (PIOF-GA-2012-328288). JS and DR were
supported by NSF Grant 0954083, ONR-MURI Grant
N00014- 10-1-0933, and the Intel Science and Technology
Center - Visual Computing.
[1] V. Athitsos and S. Sclaroff. Estimating 3d hand pose from a
cluttered image. In CVPR (2), pages 432–442, 2003. 2
[2] T. Y. D. Tang and T.-K. Kim. Real-time articulated hand
pose estimation using semi-supervised transductive regres-
sion forests. In ICCV, pages 1–8, 2013. 2
[3] D. Damen, A. P. Gee, W. W. Mayol-Cuevas, and A. Calway.
Egocentric real-time workspace monitoring using an rgb-d
camera. In IROS, 2012. 2
[4] Daz3D. Every-hands pose library. http://www.daz3d.
com/everyday-hands- poses-for- v4-and- m4,
2013. 2
[5] A. Fathi, A. Farhadi, and J. Rehg. Understanding egocentric
activities. In ICCV, 2011. 2
[6] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interac-
tions: A first-person perspective. In CVPR, pages 1226–
1233, 2012. 1
[7] A. R. Fielder and M. J. Moseley. Does stereopsis matter in
humans? Eye, 10(2):233–238, 1996. 1
[8] H. Hamer, J. Gall, R. Urtasun, and L. Van Gool. Data-driven
animation of hand-object interactions. In 2011 IEEE Inter-
national Conference on Automatic Face Gesture Recognition
and Workshops (FG 2011), pages 360–367. 2
[9] H. Hamer, J. Gall, T. Weise, and L. Van Gool. An object-
dependent hand pose prior from sparse training data. In 2010
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 671–678. 2
[10] H. Hamer, K. Schindler, E. Koller-Meier, and L. Van Gool.
Tracking a hand manipulating an object. In 2009 IEEE 12th
International Conference on Computer Vision, pages 1475–
1482. 2
[11] Intel. Perceptual computing sdk, 2013. 7
[12] M. K ¨
olsch. An appearance-based prior for hand tracking. In
ACIVS (2), pages 292–303, 2010. 2
[13] M. K ¨
olsch and M. Turk. Hand tracking with flocks of fea-
tures. In CVPR (2), page 1187, 2005. 2
[14] T. Kurata, T. Kato, M. Kourogi, K. Jung, and K. Endo. A
functionally-distributed hand tracking method for wearable
visual interfaces and its applications. In MVA, pages 84–89,
2002. 2
[15] K. Lai, L. Bo, and D. Fox. Unsupervised feature learning for
3d scene labeling. In ICRA, 2014. 4,7
[16] C. Li and K. M. Kitani. Model recommendation with virtual
probes for egocentric hand detection. In ICCV, 2013. 2
[17] C. Li and K. M. Kitani. Model recommendation with virtual
probes for egocentric hand detection. In ICCV, 2013. 1
[18] C. Li and K. M. Kitani. Pixel-level hand detection in ego-
centric videos. In CVPR, 2013. 1,2
[19] Z. Lu and K. Grauman. Story-driven summarization for ego-
centric video. In CVPR, 2013. 1
[20] S. Mann, J. Huang, R. Janzen, R. Lo, V. Rampersad,
A. Chen, and T. Doha. Blind navigation with a wearable
range camera and vibrotactile helmet. In ACM International
Conf. on Multimedia, MM ’11, 2011. 2
[21] I. Oikonomidis, N. Kyriazis, and A. Argyros. Efficient
model-based 3d tracking of hand articulations using kinect.
In BMVC, 2011. 7
[22] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Tracking
the Articulated Motion of Two Strongly Interacting Hands.
In CVPR, 2012. 2
[23] H. Pirsiavash and D. Ramanan. Detecting activities of daily
living in first-person camera views. In CVPR, 2012. 1,2
[24] G. Pons-Moll, A. Baak, T. Helten, M. M¨
uller, H. Seidel, and
B. Rosenhahn. Multisensor-fusion for 3d full-body human
motion capture. In CVPR, pages 663–670, 2010. 3
[25] G. Rogez, M. Khademi, J. Supancic, J. Montiel, and D. Ra-
manan. 3d hand pose detection in egocentric rgbd images.
In ECCV Workshop on Consuper Depth Camera for Vision
(CDC4V), pages 1–11, 2014. 1,6,7
[26] J. Romero, H. Kjellstrom, C. H. Ek, and D. Kragic. Non-
parametric hand pose estimation with object context. Im.
and Vision Comp., 31(8):555 – 564, 2013. 2
[27] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estima-
tion with parameter-sensitive hashing. In Computer Vision,
2003. Proceedings. Ninth IEEE International Conference on,
pages 750–757. IEEE, 2003. 2
[28] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook,
M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman,
and A. Blake. Efficient human pose estimation from single
depth images. 35(12):2821–2840. 2
[29] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finoc-
chio, A. Blake, M. Cook, and R. Moore. Real-time human
pose recognition in parts from single depth images. Commu-
nications of the ACM, 56(1):116–124, 2013. 2
[30] S. Song and J. Xiao. Sliding shapes for 3d object detection in
rgb-d images. In European Conference on Computer Vision,
2014. 2,4,7
[31] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Model-
based hand tracking using a hierarchical bayesian filter.
PAMI, 28(9):1372–1384,, 2006. 2
[32] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Sku-
bic, and S. Lao. Histogram of oriented normal vectors for ob-
ject recognition with a depth sensor. In ACCV 2012, pages
525–538. 2013. 4
[33] D. Tzionas and J. Gall. A comparison of directional dis-
tances for hand pose estimation. In J. Weickert, M. Hein,
and B. Schiele, editors, Pattern Recognition, number 8142
in Lecture Notes in Computer Science. Springer Berlin Hei-
delberg, Jan. 2013. 2
[34] R. Wang and J. Popovic. Real-time hand-tracking with a
color glove. ACM Trans on Graphics, 28(3), 2009. 2
[35] C. Xu and L. Cheng. Efficient hand pose estimation from a
single depth image. In ICCV, 2013. 2
[36] M. Ye, X. Wang, R. Yang, L. Ren, and M. Pollefeys. Accu-
rate 3d pose estimation from a single depth image. In ICCV,
pages 731–738, 2011. 2
... A limited amount of research has looked into the problem of inferring human poses from egocentric images or videos. Most of existing methods still assume the estimated human body or part of the body is visible [8,168,169,252,261]. The "insideout" mocap approach of [286] gets rid of the visibility assumption and infer the 3D locations of 16 or more body-mounted cameras via structure from motion. ...
... A limited amount of research has looked into egocentric pose estimation. Most existing methods only estimate the pose of visible body parts [8,168,169,252,261]. Other approaches utilize 16 or more body-mounted cameras to infer joint locations via structure from motion [286]. ...
Understanding and modeling human behavior is fundamental to almost any computer vision and robotics applications that involve humans. In this thesis, we take a holistic approach to human behavior modeling and tackle its three essential aspects -- simulation, perception, and generation. Throughout the thesis, we show how the three aspects are deeply connected and how utilizing and improving one aspect can greatly benefit the other aspects. We also discuss the lessons learned and our vision for what is next for human behavior modeling.
... Oberweger et al. [32] introduces a hand depth video dataset with labeled 3D joints. Rogez et al. [38] synthesizes a hand-object depth data under egocentric workspaces. The model trained on these datasets can only be applied to the depth sensor's input, thus couldn't generalize well on hands wearing additional 3D accessories. ...
Full-text available
Hand, the bearer of human productivity and intelligence, is receiving much attention due to the recent fever of digital twins. Among different hand morphable models, MANO has been widely used in vision and graphics community. However, MANO disregards textures and accessories, which largely limits its power to synthesize photorealistic hand data. In this paper, we extend MANO with Diverse Accessories and Rich Textures, namely DART. DART is composed of 50 daily 3D accessories which varies in appearance and shape, and 325 hand-crafted 2D texture maps covers different kinds of blemishes or make-ups. Unity GUI is also provided to generate synthetic hand data with user-defined settings, e.g., pose, camera, background, lighting, textures, and accessories. Finally, we release DARTset, which contains large-scale (800K), high-fidelity synthetic hand images, paired with perfect-aligned 3D labels. Experiments demonstrate its superiority in diversity. As a complement to existing hand datasets, DARTset boosts the generalization in both hand pose estimation and mesh recovery tasks. Raw ingredients (textures, accessories), Unity GUI, source code and DARTset are publicly available at
... Embodied research: Body-mounted sensor setups are heavily used to solve various tasks: activity recognition methods like [4,9,15,32,37,61] use ego-centric camera setups with a camera looking towards the body. However they typically concentrate on capturing the upper body. ...
In everyday lives, humans naturally modify the surrounding environment through interactions, e.g., moving a chair to sit on it. To reproduce such interactions in virtual spaces (e.g., metaverse), we need to be able to capture and model them, including changes in the scene geometry, ideally from ego-centric input alone (head camera and body-worn inertial sensors). This is an extremely hard problem, especially since the object/scene might not be visible from the head camera (e.g., a human not looking at a chair while sitting down, or not looking at the door handle while opening a door). In this paper, we present HOPS, the first method to capture interactions such as dragging objects and opening doors from ego-centric data alone. Central to our method is reasoning about human-object interactions, allowing to track objects even when they are not visible from the head camera. HOPS localizes and registers both the human and the dynamic object in a pre-scanned static scene. HOPS is an important first step towards advanced AR/VR applications based on immersive virtual universes, and can provide human-centric training data to teach machines to interact with their surroundings. The supplementary video, data, and code will be available on our project page at
... Estimating 3D hands and objects from an RGB image: Recently, focus has shifted from optimization [2,18,19,20,48,49,52,65,72,74,75,78] and classification [54,55,56,57] based methods towards data-driven ones [10,11,22,23,36,69,80]. Doosti et al. [11] use two graph convolutional neural nets, one for detecting 2D hand joints and object corners, and one for lifting these to 3D. ...
We use our hands to interact with and to manipulate objects. Articulated objects are especially interesting since they often require the full dexterity of human hands to manipulate them. To understand, model, and synthesize such interactions, automatic and robust methods that reconstruct hands and articulated objects in 3D from a color image are needed. Existing methods for estimating 3D hand and object pose from images focus on rigid objects. In part, because such methods rely on training data and no dataset of articulated object manipulation exists. Consequently, we introduce ARCTIC - the first dataset of free-form interactions of hands and articulated objects. ARCTIC has 1.2M images paired with accurate 3D meshes for both hands and for objects that move and deform over time. The dataset also provides hand-object contact information. To show the value of our dataset, we perform two novel tasks on ARCTIC: (1) 3D reconstruction of two hands and an articulated object in interaction; (2) an estimation of dense hand-object relative distances, which we call interaction field estimation. For the first task, we present ArcticNet, a baseline method for the task of jointly reconstructing two hands and an articulated object from an RGB image. For interaction field estimation, we predict the relative distances from each hand vertex to the object surface, and vice versa. We introduce InterField, the first method that estimates such distances from a single RGB image. We provide qualitative and quantitative experiments for both tasks, and provide detailed analysis on the data. Code and data will be available at
... Inferring human poses from egocentric images or videos is a problem that has been looked into only recently. Early works focused on estimating gestures and hand poses assuming that arms were partially visible [29,30,31]. In [32] several body-mounted cameras on person's joints were used to infer body joint locations via a structure-from-motion approach. ...
Full-text available
In this paper, we propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The key idea is to leverage high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. We explicitly consider spatial- and motion-domain features, combined using a semi-Siamese architecture trained in a self-supervised fashion. Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos, without needing domain adaptation or knowledge of camera parameters. We achieve significant improvement of egocentric 3D body pose estimation performance on two unconstrained datasets, over three supervised state-of-the-art approaches. Our dataset and code will be available for research purposes.
Full-text available
In this paper, we propose a framework for 3D human pose estimation using a single 360° camera mounted on the user’s wrist. Perceiving a 3D human pose with such a simple setup has remarkable potential for various applications ( e . g ., daily-living activity monitoring, motion analysis for sports training). However, no existing method has tackled this task due to the difficulty of estimating a human pose from a single camera image in which only a part of the human body is captured, and because of a lack of training data. We propose a method for translating wrist-mounted 360° camera images into 3D human poses. Since we are the first to try this task, we cannot use existing datasets. To address this issue, we use synthetic data to build our own dataset. This solution, however, creates a different problem, that of a domain gap between synthetic data for training and real image data for inference. To resolve this problem, we propose silhouette-based synthetic data generation created for this task. Extensive experiments comparing our method with several baseline methods demonstrated the effectiveness of our silhouette-based pose estimation approach.
Humans move their hands and bodies together to communicate and solve tasks. Capturing and replicating such coordinated activity is critical for virtual characters that behave realistically. Surprisingly, most methods treat the 3D modeling and tracking of bodies and hands separately. Here we formulate a model of hands and bodies interacting together and fit it to full-body 4D sequences. When scanning or capturing the full body in 3D, hands are small and often partially occluded, making their shape and pose hard to recover. To cope with low-resolution, occlusion, and noise, we develop a new model called MANO (hand Model with Articulated and Non-rigid defOrmations). MANO is learned from around 1000 high-resolution 3D scans of hands of 31 subjects in a wide variety of hand poses. The model is realistic, low-dimensional, captures non-rigid shape changes with pose, is compatible with standard graphics packages, and can fit any human hand. MANO provides a compact mapping from hand poses to pose blend shape corrections and a linear manifold of pose synergies. We attach MANO to a standard parameterized 3D body shape model (SMPL), resulting in a fully articulated body and hand model (SMPL+H). We illustrate SMPL+H by fitting complex, natural, activities of subjects captured with a 4D scanner. The fitting is fully automatic and results in full body models that move naturally with detailed hand motions and a realism not seen before in full body performance capture. The models and data are freely available for research purposes in our website (
Full-text available
We focus on the task of amodal 3D object detection in RGB-D images, which aims to produce a 3D bounding box of an object in metric form at its full extent. We introduce Deep Sliding Shapes, a 3D ConvNet formulation that takes a 3D volumetric scene from a RGB-D image as input and outputs 3D object bounding boxes. In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D. In particular, we handle objects of various sizes by training an amodal RPN at two different scales and an ORN to regress 3D bounding boxes. Experiments show that our algorithm outperforms the state-of-the-art by 13.8 in mAP and is 200x faster than the original Sliding Shapes. All source code and pre-trained models will be available at GitHub.
Conference Paper
Full-text available
We focus on the task of hand pose estimation from egocentric viewpoints. For this problem specification, we show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment. Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand pose estimation in RGB-D images is still an unsolved problem. The problem is exacerbated when considering a wearable sensor and a first-person camera viewpoint: the occlusions inherent to the particular camera view and the limitations in terms of field of view make the problem even more difficult. We propose to use task and viewpoint specific synthetic training exemplars in a discriminative detection framework. We also exploit the depth features for a sparser and faster detection. We evaluate our approach on a real-world annotated dataset and propose a novel annotation technique for accurate 3D hand labelling even in case of partial occlusions.
Full-text available
We focus on the task of everyday hand pose estimation from egocentric viewpoints. For this task, we show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment. Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand pose estimation in RGB-D images is still an unsolved problem. The problem is considerably exacerbated when analyzing hands performing daily activities from a first-person viewpoint, due to severe occlusions arising from object manipulations and a limited field-of-view. Our system addresses these difficulties by exploiting strong priors over viewpoint and pose in a discriminative tracking-by-detection framework. Our priors are operationalized through a photorealistic synthetic model of egocentric scenes, which is used to generate training data for learning depth-based pose classifiers. We evaluate our approach on an annotated dataset of real egocentric object manipulation scenes and compare to both commercial and academic approaches. Our method provides state-of-the-art performance for both hand detection and pose estimation in egocentric RGB-D images.
Conference Paper
Full-text available
We tackle the practical problem of hand pose estimation from a single noisy depth image. A dedicated three-step pipeline is proposed: Initial estimation step provides an initial estimation of the hand in-plane orientation and 3D location, Candidate generation step produces a set of 3D pose candidate from the Hough voting space with the help of the rotational invariant depth features, Verification step delivers the final 3D hand pose as the solution to an optimization problem. We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. Our approach is able to work with Kinect-type noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). Extensive experiments are conducted to qualitatively and quantitatively evaluate the performance with respect to the state-of-the-art methods that have access to additional RGB images. Our approach is shown to deliver on par or even better results.
Conference Paper
Full-text available
This paper presents the first semi-supervised transductive algorithm for real-time articulated hand pose estimation. Noisy data and occlusions are the major challenges of articulated hand pose estimation. In addition, the discrepancies among realistic and synthetic pose data undermine the performances of existing approaches that use synthetic data extensively in training. We therefore propose the Semi-supervised Transductive Regression (STR) forest which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset. We also design a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints. Our contributions include: (i) capturing the benefits of both realistic and synthetic data via transductive learning, (ii) showing accuracies can be improved by considering unlabelled data, and (iii) introducing a pseudo-kinematic technique to refine articulations efficiently. Experimental results show not only the promising performance of our method with respect to noise and occlusions, but also its superiority over state-of-the-arts in accuracy, robustness and speed.
Conference Paper
This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online database. Our scene labeling system combines features learned from raw RGB-D images and 3D point clouds directly, without any hand-designed features, to assign an object label to every 3D point in the scene. Experiments on the RGB-D Scenes Dataset v.2 demonstrate that the proposed approach can be used to label indoor scenes containing both small tabletop objects and large furniture pieces.
Conference Paper
Egocentric cameras can be used to benefit such tasks as analyzing fine motor skills, recognizing gestures and learning about hand-object manipulation. To enable such technology, we believe that the hands must detected on the pixel-level to gain important information about the shape of the hands and fingers. We show that the problem of pixel-wise hand detection can be effectively solved, by posing the problem as a model recommendation task. As such, the goal of a recommendation system is to recommend the n-best hand detectors based on the probe set - a small amount of labeled data from the test distribution. This requirement of a probe set is a serious limitation in many applications, such as ego-centric hand detection, where the test distribution may be continually changing. To address this limitation, we propose the use of virtual probes which can be automatically extracted from the test distribution. The key idea is that many features, such as the color distribution or relative performance between two detectors, can be used as a proxy to the probe set. In our experiments we show that the recommendation paradigm is well-equipped to handle complex changes in the appearance of the hands in first-person vision. In particular, we show how our system is able to generalize to new scenarios by testing our model across multiple users.
Conference Paper
We propose a feature, the Histogram of Oriented Normal Vectors (HONV), designed specifically to capture local geometric characteristics for object recognition with a depth sensor. Through our derivation, the normal vector orientation represented as an ordered pair of azimuthal angle and zenith angle can be easily computed from the gradients of the depth image. We form the HONV as a concatenation of local histograms of azimuthal angle and zenith angle. Since the HONV is inherently the local distribution of the tangent plane orientation of an object surface, we use it as a feature for object detection/classification tasks. The object detection experiments on the standard RGB-D dataset [1] and a self-collected Chair-D dataset show that the HONV significantly outperforms traditional features such as HOG on the depth image and HOG on the intensity image, with an improvement of 11.6% in average precision. For object classification, the HONV achieved 5.0% improvement over state-of-the-art approaches.