Content uploaded by Grégory Rogez
Author content
All content in this area was uploaded by Grégory Rogez on Sep 17, 2015
Content may be subject to copyright.
First-Person Pose Recognition using Egocentric Workspaces
Gr´
egory Rogez1,2, James S. Supanˇ
ciˇ
c III1, Deva Ramanan1
1Dept of Computer Science, University of California, Irvine, CA, USA
2Universidad de Zaragoza, Zaragoza, Spain
grogez@unizar.es grogez,jsupanci,dramanan@ics.uci.edu
Abstract
We tackle the problem of estimating the 3D pose of an in-
dividual’s upper limbs (arms+hands) from a chest mounted
depth-camera. Importantly, we consider pose estimation
during everyday interactions with objects. Past work shows
that strong pose+viewpoint priors and depth-based features
are crucial for robust performance. In egocentric views,
hands and arms are observable within a well defined vol-
ume in front of the camera. We call this volume an egocen-
tric workspace. A notable property is that hand appearance
correlates with workspace location. To exploit this correla-
tion, we classify arm+hand configurations in a global ego-
centric coordinate frame, rather than a local scanning win-
dow. This greatly simplify the architecture and improves
performance. We propose an efficient pipeline which 1) gen-
erates synthetic workspace exemplars for training using a
virtual chest-mounted camera whose intrinsic parameters
match our physical camera, 2) computes perspective-aware
depth features on this entire volume and 3) recognizes dis-
crete arm+hand pose classes through a sparse multi-class
SVM. We achieve state-of-the-art hand pose recognition
performance from egocentric RGB-D images in real-time.
1. Introduction
Understanding hand poses and hand-object manipula-
tions from a wearable camera has potential applications
in assisted living [23], augmented reality [6] and life log-
ging [19]. As opposed to hand-pose recognition from third-
person views, egocentric views may be more difficult due
to additional occlusions (from manipulated objects, or self-
occlusions of fingers by the palm) and the fact that hands
interact with the environment and often leave the field-of-
view. The latter necessitates constant re-initialization, pre-
cluding the use of a large body of hand trackers which typ-
ically perform well given manual initialization.
Previous work for egocentric hand analysis tends to rely
on local 2D features, such as pixel-level skin classification
[17,18] or gradient-based processing of depth maps with
Figure 1. Egocentric workspaces. We directly model the observ-
able egocentric workspace in front of a human with a 3D vol-
umetric descriptor, extracted from a 2.5D egocentric depth sen-
sor. In this example, this volume is discretized into 4×3×4
bins. This feature can be used to accurately predict shoulder,
arm, hand poses, even when interacting with objects. We describe
models learned from synthetic examples of observable egocentric
workspaces obtained by placing a virtual Intel Creative camera on
the chest of an animated character.
scanning-window templates [25]. Our approach follows in
the tradition of [25], who argue that near-field depth mea-
sures obtained from a egocentric-depth sensor considerably
simplifies hand analysis. Interestingly, egocentric-depth is
not “cheating” in the sense that humans make use of stereo-
scopic depth cues for near-field manipulations [7]. We ex-
tend this observation by building an explicit 3D map of the
observable near-field workspace.
Contributions: In this work, we describe a new com-
putational architecture that makes use of global egocentric
1
views, volumetric representations, and contextual models
of interacting objects and human-bodies. Rather than de-
tecting hands with a local (translation-invariant) scanning-
window classifier, we process the entire global egocentric
view (or work-space) in front of the observer (Fig. 1). Hand
appearance is not translation-invariant due to perspective ef-
fects and kinematic constraints with the arm. To capture
such effects, we build a library of synthetic 3D egocentric
workspaces generated using real capture conditions (see ex-
amples in Fig. 2). We animate a 3D human character model
inside virtual scenes with objects, and render such anima-
tions with a chest-mounted camera whose intrinsics match
our physical camera . We simultaneously recognize arm
and hand poses while interacting with objects by classify-
ing the whole 3D volume using a multi-class Support Vec-
tor Machine (SVM) classifier. Recognition is simple and
fast enough to be implemented in 4 lines of code.
1.1. Related work
Hand-object pose estimation: While there is a large
body of work on hand-tracking [14,13,12,1,22,31,34],
we focus on hand pose estimation during object manipu-
lations. Object interactions both complicate analysis due
to additional occlusions, but also provide additional con-
textual constraints (hands cannot penetrate object geome-
try, for example). [10] describe articulated tracker with soft
anti-penetration constraints, increasing robustness to occlu-
sion. Hamer et al. describe contextual priors for hands in
relation to objects [9], and demonstrate their effectiveness
for increasing tracking accuracy. Objects are easier to ani-
mate than hands because they have fewer joint parameters.
With this intuition, object motion can be used as an input
signal for estimating hand motions [8]. [26] use a large syn-
thetic dataset of hands manipulating objects, similar to us.
We differ in our focus on single-image and egocentric anal-
ysis.
Egocentric Vision: Previous studies have focused on ac-
tivities of daily living [23,5]. Long-scale temporal structure
was used to handle complex hand object interactions, ex-
ploiting the fact that objects look different when they are
manipulated (active) versus not manipulated (passive) [23].
Much previous work on egocentric hand recognition make
exclusive use of RGB cues [18,16], while we focus on vol-
umetric depth cues. Notable exceptions include [3], who
employ egocentric RGB-D sensors for personal workspace
monitoring in industrial environments and [20], who em-
ploy such sensors to assist blind users in navigation.
Depth features: Previous work has shown the efficacy
of depth cues [28,35]. We compute volumetric depth fea-
tures from point clouds. Previous work has examined point-
cloud processing of depth-images [36,29,35]. A common
technique estimates local surface orientations and normals
[36,35], but this may be sensitive to noise since it requires
Figure 2. Synthesis: We generate a training set by sampling dif-
ferent dimensions of a workspace model, yielding a total num-
ber of Narm ×Nhand ×Nobject ×Nbackg round samples. We
sample Narm arm poses, a fixed set of hand-object configurations
(Nhand ×Nobject = 100) and a fixed set of Nbackground back-
ground scenes captured with an Intel Creative depth camera. For
each hand-object model, we randomly perturb shoulder, arm and
hand joint angles to generate physically possible arm+hand+object
configurations. We show 2 examples of a bottle-grasp (left) and a
juice-box-grasp (right) rendered in front of a flat wall.
derivative computations. We use simpler volumetric fea-
tures, similar to [30] except that we use a spherical coordi-
nate frame that does not slide along a scanning window (we
want to measure depth in an egocentric coordinate frame).
Non-parametric recognition: Our work is inspired by
non-parametric techniques that make use of synthetic train-
ing data [26,27,10,2,33]. [27] make use of pose-sensitive
hashing techniques for efficient matching of synthetic RGB
images rendered with Poser. We generate synthetic depth
images, mimicking capture conditions of our actual camera.
2. Training data
We begin by generating a training set of realistic 3D ego-
centric workspaces. Specifically, we render synthetic 3D
hand-object data (generated from a 3D animation system)
on top of real 3D background scenes, making use of the test
camera projection matrix. Because egocentric scenes in-
volve objects that lie close to the camera, we found it useful
to model camera-specific perspective effects.
Poser models. Our in-house grasp database is con-
structed by modifying the commercial Everyday hands
Grasp Poser library [4]. We vary the objects being inter-
acted with, as well as the clothing of the character, i.e.,
with and without sleeves. We use more than 200 grasp-
(a) (b)
(c) (d)
Figure 3. Examples of synthetic training images. Our render-
ing pipeline produces realistic depth maps consisting of multiple
hands manipulating household objects in front of everyday back-
grounds.
ing postures and 49 objects, including kitchen utensils, per-
sonal bathroom items, office/classroom objects, fruits, etc.
Additionally we use 6 models of empty hands: wave, fist,
thumbs-up, point, open/close fingers. Some objects can be
handled with different canonical grasps. For example, one
can grip a bottle by its body or by its lid when opening it.
We manually add such variants.
Kinematic model. Let θbe a vector of arm joint an-
gles, and let φbe a vector of grasp-specific hand joint an-
gles, obtained from the above set of Poser models. We use
a standard forward kinematic chain to convert the location
of finger joints u(in a local coordinate system) to image
coordinates:
p=CY
i
T(θi)Y
j
T(φj)u,where T, C ∈R4×4,
u=uxuyuz1T,(x, y) = (fpx
pz
, f py
pz
),(1)
where Tspecifies rigid-body transformations (rotation and
translation) along the kinematic chain and Cspecifies the
extrinsic camera parameters. Here prepresents the 3D posi-
tion of point uin the camera coordinate system. To generate
the corresponding image point, we assume camera intrin-
sics are given by identity scale factors and a focal length
f(though it is straightforward to use more complex in-
trinsic parameterizations). We found it important to use
the fcorresponding to our physical camera, as it is cru-
cial to correctly model perspective effects for our near-field
workspaces.
Pose synthesis: We wish to generate a large set of
postured hands. However, building a generative model of
grasps is not trivial. One option is to take a data-driven
approach and collect training samples using motion cap-
ture [24]. Instead, we take a model-driven approach that
perturbs a small set of manually-defined canonical postures.
To ensure that physically plausible perturbations are gener-
ated, we take a simple rejection sampling approach. We fix
φparameters to respect the hand grasps from Poser, and add
small Gaussian perturbations to arm joint angles
θ0
i=θi+where ∼N(0, σ2).
Importantly, this generates hand joints pat different transla-
tions and viewpoints, correctly modeling the dependencies
between both. For each perturbed pose, we render hand
joints using (1) and keep exemplars that are 90% visible
(e.g., their projected (x, y)coordinates lie within the image
boundaries). We show examples in Fig. 2.
Depth maps. Associated with each rendered set of key-
points, we would also like a depth map. To construct a depth
map, we represent each rigid limb with a dense cloud of 3D
vertices {ui}. We produce this cloud by (over) sampling the
3D meshes defining each rigid-body shape. We render this
dense cloud using forward kinematics (1), producing a set
of points {pi}={(px,i, py,i, pz,i )}. We define a 2D depth
map z[u, v]by ray-tracing. Specifically, we cast a ray from
the origin, in the direction of each image (or depth sensor)
pixel location (u, v)and find the closest point:
z[u, v] = min
k∈Ray(u,v)||pk|| (2)
where Ray(u, v)denotes the set of points on (or near) the
ray passing through pixel (u, v). We found the above ap-
proach simpler to implement than hidden surface removal,
so long as we projected a sufficiently dense cloud of points.
Multiple hands: Some object interactions require mul-
tiple hands interacting with a single object. Additionally,
many views contain the second hand in the “background”.
For example, two hands are visible in roughly 25% of the
frames in our benchmark videos. We would like our train-
ing dataset to have similar statistics. Our existing Poser li-
brary contains mostly single-hand grasps. To generate ad-
ditional multi-arm egocentric views, we randomly pair 25%
of the arm poses with a mirrored copy of another randomly-
chosen pose. We then add noise to the arm joint angles,
as described above. Such a procedure may generate unnat-
ural or self-intersecting poses. To efficiently remove such
cases, we separately generate depth maps for the left and
right arms, and only keep pairings that produce compatible
depth maps:
|zlef t[u, v]−zright [u, v]|> δ ∀u, v (3)
We find this simple procedure produces surprisingly realis-
tic multi-arm configurations (Fig. 3). Finally we add back-
ground clutter from depth maps of real egocentric scenes
Figure 4. Volume quantization. We quantize those points that fall within the egocentric workspace (observable volume within zmax =
70cm) into a binary spherical voxel grid of Nu×Nv×Nwvoxels (a). We vary the azimuth angle αto generate equal-size projections on
the image plane (b). Spherical bins ensure that voxels at different distances project to same image area (c). This allows for efficient feature
computation and occlusion handling, since occluded voxels along the same line-of-sight can easily be identified.
(not from our benchmark data). We use the above approach
to generate over 100,000 multi-hand(+arm+objects) config-
urations and associated depth-maps.
3. Formulation
3.1. Perspective-aware depth features
It may seem attractive to work in orthographic (or scaled
orthographic) coordinates, as this simplifies much of 3D
analysis. Instead, we posit that perspective distortion is use-
ful in egocentric settings and should be exploited: objects
of interest (hands, arms, and manipulated things) tend to
lie near the body and exhibit perspective effects. Specifi-
cally, parts of objects that are closer to the camera project
to a larger image size. To model such effects, we con-
struct a spherical bin histogram by gridding up the egocen-
tric workspace volume by varying azimuth and elevation an-
gles (Fig. 4). We demonstrate that this feature outperforms
orthographic counterparts, and is also faster to compute.
Binarized volumetric features: Much past work pro-
cesses depth maps as 2D rasterized sensor data. Though
convenient for applying efficient image processing routines
such as gradient computations (e.g., [32]), rasterization
may not fully capture the 3D nature of the data. Alter-
natively, one can convert depth maps to a full 3D point
cloud [15], but the result is orderless making operations
such as correspondence-estimation difficult. We propose
encoding depth data in a 3D volumetric representation, sim-
ilar to [30]. To do so, we can back-project the depth map
from (2) into a cloud of visible 3D points {pk}, visualized
in Fig. 5-(b). They are a subset of the original cloud of 3D
points {pi}in Fig. 5-(a). We now bin those visible points
that fall within the egocentric workspace in front of the cam-
era (observable volume within zmax = 70cm) into a binary
voxel grid of Nu×Nv×Nwvoxels:
b[u, v, w] = 1if ∃ks.t. pk∈F(u, v , w)
0otherwise. (4)
where F(u, v, w)denotes the set of points within a voxel
centered at coordinate (u, v, w).
Spherical voxels: Past work tends to use rectilinear vox-
els [30,15]. Instead, we use a spherical binning structure,
centering the sphere at the camera origin ( Fig. 4). At first
glance, this might seem strange because voxels now vary in
size – those further away from the camera are larger. The
main advantage of a “perspective-aware” binning scheme is
that all voxels now project to the same image area in pixels
(Fig. 4-(c)). We will show that this both increases accuracy
(because one can better reason about occlusions) and speed
(because volumetric computations are sparse).
Efficient quantization: Let us choose spherical bins
F(u, v, w)such that they project to a single pixel (u, v)in
the depth map. This allows one to compute the binary voxel
grid b[u, v, w]by simply “reading off” the depth value for
each z(u, v)coordinates, quantizing it to z0, and assigning
1 to the corresponding voxel:
b[u, v, w] = 1if w=z0[u, v ]
0otherwise (5)
This results in a sparse volumetric voxel features visual-
ized in Fig. 5-(c). Crucially, a spherical parameterization al-
Figure 5. Binarized volumetric feature. We synthesize training examples by randomly perturbing shoulder, arm and hand joint angles in
a physically possible manner (a). For each example, a synthetic depth map is created by projecting the visible set of dense 3D points using
a real-world camera projection matrix (b). The resulting 2D depth map is then quantized with a regular grid in x-y directions and binned
in the viewing direction to compute our new binarized volumetric feature (c). In this example, we use a 32 ×24 ×35 grid. Note that for
clarity we only show the sparse version of our 3D binary feature. We also show the quantized depth map z[u, v]as a gray scale image (c).
lows one to efficient reason about occlusions: once a depth
measurement is observed at position b[u0, v0, w0] = 1, all
voxels behind it are occluded for w≥w0. This arises from
the fact that single camera depth measurements are, in fact,
2.5D. By convention, we define occluded voxels to be “1”.
Note that such occlusion reasoning is difficult with ortho-
graphic parameterizations because voxels are not arranged
along line-of-sight rays.
In practice, we consider a coarse discretization of the
volume to make the problem more tractable. The depth map
z[x, y]is resized to Nu×Nv(smaller than depth map size)
and quantized in the z-direction. To minimize the effect of
noise when counting the points which fall in the different
voxels, we quantize the depth measurements by applying a
median filter on the pixel values within each image region:
∀u, v ∈[1, Nu]×[1, Nv],
z0[u, v] = Nw
zmax median(z[x, y] : (x, y)∈P(u, v )),(6)
where P(u, v)is the set of pixel coordinates in the original
depth map corresponding to pixel coordinate (u, v)coordi-
nates in the resized depth map.
3.2. Global pose classification
We quantize the set of poses from our synthetic database
into Kcoarse classes for each limb, and train a K-way
pose-classifier for pose-estimation. The classifier is linear
and makes use of our sparse volumetric features, making it
quite simple and efficient to implement.
Pose space quantization: For each training exemplar,
we generate the set of 3D keypoints: 17 joints (elbow +
wrist + 15 finger joints) and the 5 finger tips. Since we
want to recognize coarse limb (arm+hand) configurations,
we cluster the resulting training set by applying K-means
to the elbow+wrist+knuckle 3D joints. We usually rep-
resent each of the K resulting clusters using the average
3D/2D keypoint locations of both arm+hand (See examples
in Fig. 6). Note that K can be chosen as a compromise be-
tween accuracy and speed.
Global classification: We use a linear SVM for a multi-
class classification of upper-limb poses. However, in-
stead of classifying local scanning-windows, we classify
global depth maps quantized into our binarized depth fea-
ture b[u, v, w]from (5). Global depth maps allow the clas-
sifier to exploit contextual interactions between multiple
hands, arms and objects. In particular, we find that mod-
eling arms is particularly helpful for detecting hands. For
each class k∈ {1,2, ...K}, we train a one-vs-all SVM clas-
sifier obtaining weight vector which can be re-arranged into
aNu×Nv×Nwtensor βk[u, v, w]. The score for class k
is then obtained by a simple dot product of this weight and
our binarized feature b[u, v, w]:
score[k] = X
uX
vX
w
βk[u, v, w]·b[u, v , w].(7)
We visualize projections of the learned weight tensor
βk[u, v, w]in Fig. 6and slices of the tensor in Fig. 7.
3.3. Joint feature extraction and classification
To increase run-time efficiency, we exploit the sparsity
of our binarized volumetric feature and jointly implement
Figure 6. Pose classifiers. We visualize the linear weight tensor βk[u, v, w]learnt by the SVM for a 32 ×24 ×35 grid of binary features for
3 different pose clusters. We plot a 2D (u, v)visualization obtained by computing the max along w. We also visualize the corresponding
average 3D pose in the egocentric volume together with the top 500 positive (light gray) and negative weights (dark gray) within βk[u, v, w].
Figure 7. Weights along w. We visualize the SVM weights
βk[u, v, w]for a particular (u, v )location. Our histogram encod-
ing allows us to learn smooth nonlinear functions of depth values.
For example, the above weights respond positively to depth values
midway into the egocentric volume, but negatively to those closer.
feature extraction and SVM scoring. Since the final score is
a simple dot product with binary features, one can readily
extract the feature and update the score on the fly. Because
all voxels behind the first measurement are backfilled, the
SVM score for each class kfrom (7) can be written as:
score[k] = X
uX
v
β0
k[u, v, z0[u, v]],(8)
where z0[u, v]is the quantized depth map and tensor
β0
k[u, v, w]is the cumulative sum of the weight tensor along
dimension w:
β0
k[u, v, w] = X
d>=w
βk[u, v, d].(9)
Note that the above cumulative-sum tensors can be precom-
puted. This makes test-time classification quite efficient (8).
Feature extraction and SVM classification can be computed
jointly following the algorithm presented in Alg. 1. Our im-
plementation runs at 275 frames per second.
input : Quantized depth map z0[u, v].
Cumsum’ed weights {β0
k[u, v, w]}.
output: score[k]
1for u∈ {0,1, ...Nu}do
2for v∈ {0,1, ...Nv}do
3for k∈ {0,1, ...K}do
4score[k]+ = β0
k[u, v, z0[u, v]]
5end
6end
7end
Algorithm 1: Joint feature extraction & classification.
We jointly extract binarized depth features and evaluate
linear classifiers for all quantized poses k. We precom-
pute a “cumsum” β0
kof our SVM weights. At each
location (u, v), we add all the weights corresponding
to the voxels behind z[u, v], i.e. such that w≥z[u, v ].
4. Experiments
For evaluation, we use the recently released UCI Ego-
centric dataset [25] and score hand pose detection as a proxy
for limb pose recognition (following the benchmark criteria
used in [25]) . The dataset consists of 4 video sequences
(around 1000 frames each) of everyday egocentric scenes
with hand annotations every 10 frames.
Feature evaluation: We first compare hand detection
accuracy for different K-way SVM classifiers trained on
HOG on depth (as in [25]) and HOG on RGB-D, thus ex-
ploiting the stereo-views provided by RGB and depth sen-
sors. To evaluate our voxel encoding, we also trained a
SVM directly on the quantized depth map z[u, v](with-
out constructing a sparse binary feature). To evaluate our
Feature comparison Feature resolution
(a) (b)
Figure 8. Feature evaluation. We compare our feature encoding
to different variants (for K= 750 classes) in (a). Our feature
outperforms HOG-on-depth and HOG-on-RGBD. Our feature also
outperforms orthographic voxels and the raw quantized depth map,
which surprisingly itself outperforms all other baselines. When
combined with a linear classifier, our sparse encoding can learn
nonlinear functions of depth (see Fig. 7), while the raw depth map
can only learn linear functions. We also vary the resolution of our
feature in (b), again for K= 750. A size of 32×24 ×35 is a good
trade-off between size and performance. Doubling the resolution
in u, v marginally improves accuracy.
perspective voxels, we compare to an orthographic version
of our binarized volumetric feature (similar to past work
[30,15]). In that case, we quantize those points that fall
within a 64x48x70 cm3egocentric workspace in front of
the camera into a binary grid of square voxels:
b⊥[u, v, w] = 1if ∃is.t. (xi, yi, zi)∈N(u, v , w)
0otherwise
(10)
where N(u, v, w)specifies a 2×2×2cm cube centered at
voxel (u, v, w). Note that this feature is considerably more
involved to calculate, since it requires an explicit backpro-
jection and explicit geometric computations for binning.
Moreover, identifying occluded voxels is difficult because
they are not arranged along line-of-sight rays.
The results obtained with K= 750 pose classes are re-
ported in Fig. 8-(a). Our perspective binary features clearly
outperforms other types of features. We reach 72% detec-
tion accuracy while the state of the art [25] reports 60% ac-
curacy. Our volumetric feature has empirically strong per-
formance in egocentric settings. One reason is that it is ro-
bust to small intra-cluster misalignment and deformations
because all voxels behind the first measurement are back-
filled. Second, it is sensitive to variations in apparent size
induced by perspective effects (because voxels have consis-
Detection varying KDetection varying size of training
(a) (b)
Figure 9. Clustering and size of training set. In (a), we plot
performance as a function of K(the number of discretized pose
classes) for a fixed-size training set. For reference, we also plot
the state-of-the-art method from [25]. In (b), we plot performance
as we increase the amount of training data for K. Both results
suggest that our system may perform better with more training data
and more quantized poses. Please see text for further discussion.
tent perspective projections). In Fig. 8-(b), we also show
results varying the resolution of the grid. Our choice of
32 ×24 ×35 is a good trade-off between feature dimen-
sionality and performance.
We compare primarily to [25], as that method was al-
ready shown to outperform commercial (Intel PXC [11])
and fully-featured tracking systems [21]. Such systems per-
form poorly due to occlusions inherent in egocentric view-
points. Notably, [25] use local part templates in a scanning
window fashion. Our global approach captures correlations
between pose and spatial location, and better deals with oc-
clusion where local appearance can be misleading.
Training data and clustering: We evaluated the per-
formance of our algorithm when varying the number of
quantized pose classes Kand the amount of training data.
Fig. 9-(a) varies Kfor a fixed training set of 120,000 train-
ing images. Performance maxes out relatively quickly at
K= 750, suggesting that our model may be overfitting
due to lack of training data. Fig. 9-(a) fixes K= 750 and
increases the amount of training data per quantized class.
Here, we see a more consistent increase in accuracy. These
results suggest that a massive training set and larger Kmay
produce better results.
Qualitative results: We illustrate successes in difficult
scenarios in Fig. 10 and analyze common failure modes in
Fig. 11. Please see the figures for additional discussion.
5. Conclusions
We have proposed a new approach to the problem of ego-
centric 3D hand pose recognition during interactions with
objects. Instead of classifying local depth image regions
through a typical translation-invariant scanning window, we
have shown that classifying the global arm+hand+object
configurations within the “whole” egocentric workspace in
front of the camera allows for fast and accurate results. We
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
CVPR
#1403 CVPR
#1403
CVPR 2015 Submission #1403. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 9. Good detections. We show frames where arm and hand are correctly detected. First, we present some easy cases of hands in
free-space (top row ). Noisy depth data and cluttered background cases (middle row) showcases the robustness of our system while novel
objects (bottom row: envelope, staple box, pan, double-handed cup and lamp) require generalization to unseen objects at train-time.
reflective object (phone) bottle noisy depth/clutter unseen object (keys) malsegmentability ambiguous pose
Figure 10. Hard cases. We show frames where the pose is not correctly recognized ( sometimes not even detected) by our system. These
hard cases include excessively-noisy depth data, hands manipulating reflective material (phone or bottle of wine), malsegmentability cases
of hands touching background.
explicitly reasons about perspective occlusions while being
both conceptually and practically simple to implement (4 lines of code). We produce state-of-the-art real-time results
for egocentric pose estimation.
8
Freespace
Noisy depth
Novel objects
Figure 10. Good detections. We show frames where arm and hand are correctly detected. First, we present some easy cases of hands in
free-space (top row ). Noisy depth data and cluttered background cases (middle row) showcases the robustness of our system while novel
objects (bottom row: envelope, staple box, pan, double-handed cup and lamp) require generalization to unseen objects at train-time.
Figure 11. Hard cases. We show cases where the pose is not correctly recognized (sometimes not even detected): excessively-noisy depth
data, hands manipulating reflective material (phone or bottle of wine) or malsegmentability cases of hands touching background.
train our model by synthesizing workspace exemplars con-
sisting of hands, arms, objects and backgrounds. Our model
explicitly reasons about perspective occlusions while being
both conceptually and practically simple to implement (4
lines of code). We produce state-of-the-art real-time results
for egocentric pose estimation in real-time.
Aknowledgement. GR was supported by the European
Commission under FP7 Marie Curie IOF grant “Egovi-
sion4Health” (PIOF-GA-2012-328288). JS and DR were
supported by NSF Grant 0954083, ONR-MURI Grant
N00014- 10-1-0933, and the Intel Science and Technology
Center - Visual Computing.
References
[1] V. Athitsos and S. Sclaroff. Estimating 3d hand pose from a
cluttered image. In CVPR (2), pages 432–442, 2003. 2
[2] T. Y. D. Tang and T.-K. Kim. Real-time articulated hand
pose estimation using semi-supervised transductive regres-
sion forests. In ICCV, pages 1–8, 2013. 2
[3] D. Damen, A. P. Gee, W. W. Mayol-Cuevas, and A. Calway.
Egocentric real-time workspace monitoring using an rgb-d
camera. In IROS, 2012. 2
[4] Daz3D. Every-hands pose library. http://www.daz3d.
com/everyday-hands- poses-for- v4-and- m4,
2013. 2
[5] A. Fathi, A. Farhadi, and J. Rehg. Understanding egocentric
activities. In ICCV, 2011. 2
[6] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interac-
tions: A first-person perspective. In CVPR, pages 1226–
1233, 2012. 1
[7] A. R. Fielder and M. J. Moseley. Does stereopsis matter in
humans? Eye, 10(2):233–238, 1996. 1
[8] H. Hamer, J. Gall, R. Urtasun, and L. Van Gool. Data-driven
animation of hand-object interactions. In 2011 IEEE Inter-
national Conference on Automatic Face Gesture Recognition
and Workshops (FG 2011), pages 360–367. 2
[9] H. Hamer, J. Gall, T. Weise, and L. Van Gool. An object-
dependent hand pose prior from sparse training data. In 2010
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 671–678. 2
[10] H. Hamer, K. Schindler, E. Koller-Meier, and L. Van Gool.
Tracking a hand manipulating an object. In 2009 IEEE 12th
International Conference on Computer Vision, pages 1475–
1482. 2
[11] Intel. Perceptual computing sdk, 2013. 7
[12] M. K ¨
olsch. An appearance-based prior for hand tracking. In
ACIVS (2), pages 292–303, 2010. 2
[13] M. K ¨
olsch and M. Turk. Hand tracking with flocks of fea-
tures. In CVPR (2), page 1187, 2005. 2
[14] T. Kurata, T. Kato, M. Kourogi, K. Jung, and K. Endo. A
functionally-distributed hand tracking method for wearable
visual interfaces and its applications. In MVA, pages 84–89,
2002. 2
[15] K. Lai, L. Bo, and D. Fox. Unsupervised feature learning for
3d scene labeling. In ICRA, 2014. 4,7
[16] C. Li and K. M. Kitani. Model recommendation with virtual
probes for egocentric hand detection. In ICCV, 2013. 2
[17] C. Li and K. M. Kitani. Model recommendation with virtual
probes for egocentric hand detection. In ICCV, 2013. 1
[18] C. Li and K. M. Kitani. Pixel-level hand detection in ego-
centric videos. In CVPR, 2013. 1,2
[19] Z. Lu and K. Grauman. Story-driven summarization for ego-
centric video. In CVPR, 2013. 1
[20] S. Mann, J. Huang, R. Janzen, R. Lo, V. Rampersad,
A. Chen, and T. Doha. Blind navigation with a wearable
range camera and vibrotactile helmet. In ACM International
Conf. on Multimedia, MM ’11, 2011. 2
[21] I. Oikonomidis, N. Kyriazis, and A. Argyros. Efficient
model-based 3d tracking of hand articulations using kinect.
In BMVC, 2011. 7
[22] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Tracking
the Articulated Motion of Two Strongly Interacting Hands.
In CVPR, 2012. 2
[23] H. Pirsiavash and D. Ramanan. Detecting activities of daily
living in first-person camera views. In CVPR, 2012. 1,2
[24] G. Pons-Moll, A. Baak, T. Helten, M. M¨
uller, H. Seidel, and
B. Rosenhahn. Multisensor-fusion for 3d full-body human
motion capture. In CVPR, pages 663–670, 2010. 3
[25] G. Rogez, M. Khademi, J. Supancic, J. Montiel, and D. Ra-
manan. 3d hand pose detection in egocentric rgbd images.
In ECCV Workshop on Consuper Depth Camera for Vision
(CDC4V), pages 1–11, 2014. 1,6,7
[26] J. Romero, H. Kjellstrom, C. H. Ek, and D. Kragic. Non-
parametric hand pose estimation with object context. Im.
and Vision Comp., 31(8):555 – 564, 2013. 2
[27] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estima-
tion with parameter-sensitive hashing. In Computer Vision,
2003. Proceedings. Ninth IEEE International Conference on,
pages 750–757. IEEE, 2003. 2
[28] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook,
M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman,
and A. Blake. Efficient human pose estimation from single
depth images. 35(12):2821–2840. 2
[29] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finoc-
chio, A. Blake, M. Cook, and R. Moore. Real-time human
pose recognition in parts from single depth images. Commu-
nications of the ACM, 56(1):116–124, 2013. 2
[30] S. Song and J. Xiao. Sliding shapes for 3d object detection in
rgb-d images. In European Conference on Computer Vision,
2014. 2,4,7
[31] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Model-
based hand tracking using a hierarchical bayesian filter.
PAMI, 28(9):1372–1384,, 2006. 2
[32] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Sku-
bic, and S. Lao. Histogram of oriented normal vectors for ob-
ject recognition with a depth sensor. In ACCV 2012, pages
525–538. 2013. 4
[33] D. Tzionas and J. Gall. A comparison of directional dis-
tances for hand pose estimation. In J. Weickert, M. Hein,
and B. Schiele, editors, Pattern Recognition, number 8142
in Lecture Notes in Computer Science. Springer Berlin Hei-
delberg, Jan. 2013. 2
[34] R. Wang and J. Popovic. Real-time hand-tracking with a
color glove. ACM Trans on Graphics, 28(3), 2009. 2
[35] C. Xu and L. Cheng. Efficient hand pose estimation from a
single depth image. In ICCV, 2013. 2
[36] M. Ye, X. Wang, R. Yang, L. Ren, and M. Pollefeys. Accu-
rate 3d pose estimation from a single depth image. In ICCV,
pages 731–738, 2011. 2