Conference PaperPDF Available

Understanding Everyday Hands in Action from RGB-D Images


Abstract and Figures

We analyze functional manipulations of handheld objects , formalizing the problem as one of fine-grained grasp classification. To do so, we make use of a recently developed fine-grained taxonomy of human-object grasps. We introduce a large dataset of 12000 RGB-D images covering 71 everyday grasps in natural interactions. Our dataset is different from past work (typically addressed from a robotics perspective) in terms of its scale, diversity, and combination of RGB and depth data. From a computer-vision perspective , our dataset allows for exploration of contact and force prediction (crucial concepts in functional grasp analysis) from perceptual cues. We present extensive experimental results with state-of-the-art baselines, illustrating the role of segmentation, object context, and 3D-understanding in functional grasp analysis. We demonstrate a near 2X improvement over prior work and a naive deep baseline, while pointing out important directions for improvement.
Content may be subject to copyright.
Understanding Everyday Hands in Action from RGB-D Images
egory Rogez
Inria Rhˆ
James S. Supanˇ
University of California, Irvine
Deva Ramanan
Carnegie Mellon University
We analyze functional manipulations of handheld ob-
jects, formalizing the problem as one of fine-grained grasp
classification. To do so, we make use of a recently developed
fine-grained taxonomy of human-object grasps. We intro-
duce a large dataset of 12000 RGB-D images covering 71
everyday grasps in natural interactions. Our dataset is dif-
ferent from past work (typically addressed from a robotics
perspective) in terms of its scale, diversity, and combination
of RGB and depth data. From a computer-vision perspec-
tive, our dataset allows for exploration of contact and force
prediction (crucial concepts in functional grasp analysis)
from perceptual cues. We present extensive experimental
results with state-of-the-art baselines, illustrating the role
of segmentation, object context, and 3D-understanding in
functional grasp analysis. We demonstrate a near 2X im-
provement over prior work and a naive deep baseline, while
pointing out important directions for improvement.
1. Introduction
Humans can interact with objects in complex ways, in-
cluding grasping, pushing, or bending them. In this work,
we address the perceptual problem of parsing such inter-
actions, with a focus on handheld, manipulatable objects.
Much previous work on hand analysis tends to focus on
kinematic pose estimation [17,12]. Interestingly, the same
kinematic pose can be used for dramatically different func-
tional manipulations (Fig. 1), where differences are mani-
fested in terms of distinct contact points and force vectors.
Thus, contact points and forces play a crucial role when
parsing such interactions from a functional perspective.
Problem setup: Importantly, we wish to analyze
human-object interactions in situ. To do so, we make use of
wearable depth cameras to ensure that recordings are mo-
bile (allowing one to capture diverse scenes [33,7]) and
passive (avoiding the need for specialized pressure sen-
sors/gloves [6,24]). We make no explicit assumption about
the environment, such as known geometry [32]. However,
we do make explicit use of depth cues, motivated by the fact
Figure 1. Same kinematic pose, but different functions: We
show 3 images of near-identical kinematic hand pose, but very
different functional manipulations, including a wide-object grasp
(a), a precision grasp (b), and a finger extension (c). Contact re-
gions (green) and force vectors (red), visualized below each image,
appear to define such manipulations. This work (1) introduces a
large-scale dataset for predicting pose+contacts+forces from im-
ages and (2) proposes an initial method based on fine-grained
grasp classification.
that humans make use of depth for near-field analysis [15].
Our problem formulation is thus: given a first-person RGB-
D image of a hand-object interaction, predict the 3D kine-
matic hand pose, contact points, and force vectors.
Motivation: We see several motivating scenarios and
applications. Our long-term goal is to produce a truly
functional description of a scene that is useful for an au-
tonomous robot. When faced with a novel object, it will be
useful to know how if it can be pushed or grasped, and what
forces and contacts are necessary to do so [40]. A practi-
cal application of our work is imitation learning or learning
by demonstration for robotics [3,16], where a robot can
be taught a task by observing humans performing it. Fi-
nally, our problem formulation has direct implications for
assistive technology. Clinicians watch and evaluate patients
performing everyday hand-object interactions for diagnosis
and evaluation [2]. A patient-wearable camera that enabled
automated parsing of object manipulations would allow for
long-term monitoring.
Why is this hard? Estimating forces from visual signals
typically requires knowledge of object mass and velocity,
which is difficult to reliably infer from a single image or
even a video sequence. Isometric forces are even more dif-
ficult to estimate because no motion may be observed. Fi-
nally, even traditional tasks such as kinematic hand pose es-
timation are now difficult because manipulated objects tend
to generate significant occlusions. Indeed, much previous
work on kinematic hand analysis considers isolated hands
in free-space [43], which is a considerably easier problem.
Approach: We address the continuous problem of
pose+contact+force prediction as a discrete fine-grained
classification task, making use of a recent 73-class tax-
onomy of fine-grained hand-object interactions developed
from the robotics community [28]. Our approach is inspired
by prototype-based approaches for continuous shape esti-
mation that treat the problem as a discrete categorical pre-
diction tasks, such as shapemes [34] or poselets [5]. How-
ever, rather than learning prototypes, we make use of expert
domain knowledge to quantize the space of manipulations,
which allows us to treat the problem as one of (fine-grained)
classification. A vital property of our classification engine
is that it is data-driven rather than model-based. We put
forth considerable effort toward assembling a large collec-
tion of diverse images that span the taxonomy of classes.
We experiment with both parametric and exemplar-based
classification architectures trained on our collection.
Our contributions: Our primary contribution is (1)
a new “in-the-wild”, large-scale dataset of fine-grained
grasps, annotated with contact points and forces. Impor-
tantly, the data is RGB-D and collected from a wearable per-
spective. (2) We develop a pipeline for fine-grained grasp
classification exploiting depth and RGB data, training on
combinations of both real and synthetic training data and
making use of state-of-the-art deep features. Overall, our
results indicate that grasp classification is challenging, with
accuracy approaching 20% for a 71-way classification prob-
lem. (3) We describe a simple post-processing exemplar
framework that predicts contacts and forces associated with
hand manipulations, providing an initial proof-of-concept
system that addresses this rather novel visual prediction
2. Related Work
Hand pose with RGB(D): Hand pose estimation is a
well-studied task, using both RGB and RGB-D sensors as
input. Much work formulates the task as articulated track-
ing over time [25,23,22,4,31,42,44], but we focus on
single-image hand pose estimation during object manipu-
lations. Relatively few papers deal with object manipu-
lations, with the important exceptions of [39,38,27,26].
Most similar to us is [32], who estimate contact forces dur-
ing hand-object interactions, but do so in a “in-the-lab” sce-
nario where objects of known geometry are used. We focus
on single-frame “in-the-wild” footage where the observer is
instrumented, but the environment (and its constituent ob-
jects) are not.
Egocentric hand analysis: Spurred by the availability
of cheap wearable sensors, there has been a considerable
amount of recent work on object manipulation and grasp
analysis from egocentric viewpoints [11,8,18,7,13]. The
detection and pose estimation of human hands from wear-
able cameras was explored in [36]. [8] propose a fully auto-
matic vision-based approach for grasp analysis from a wear-
able RGB camera, while [18] explores unsupervised clus-
tering techniques for automatically discovering common
modes of human hand use. Our work is very much inspired
by such lines of thought, but we take a data-driven perspec-
tive, focusing on large-scale dataset collection guided by a
functional taxonomy.
Grasp taxonomies: Numerous taxonomies of grasps
have been proposed, predominantly from the robotics com-
munity. Early work by Cutkosky [9] introduced 16 grasps,
which were later extended to 33 by Felix et al [14], fol-
lowing a definition of a grasp as a “static hand postures
with which an object can be held with one hand”. Though
this excluded two-handed, dynamic, and gravity-dependent
grasps, this taxonomy has been widely used [37,8,7]. Our
work is based on a recent fine-grained taxonomy proposed
in [28], that significantly broadens the scope of manipu-
lations to include non-prehensile object interactions (that
are technically not grasps, such as pushing or pressing) as
well as other gravity-dependent interactions (such as lift-
ing). The final taxonomy includes 73 grasps that are an-
notated with various qualities (including hand shape, force
type, direction of movement and effort).
Datasets. Because grasp understanding is usually ad-
dressed from a robotics perspective, the resulting meth-
ods and datasets developed for the problem tend to be tai-
lored for that domain. For example, robotics platforms of-
ten require an unavoidable real-time constraint, limiting the
choice of algorithms, which also (perhaps implicitly) lim-
ited the difficulty of the data in terms of diversity (few sub-
jects, few objects, few scenes). We overview the existing
grasp datasets in Table 1and tailor our new dataset to “fill
the gap” in terms of overall scale, diversity, and annotation
Dataset View Cam. Sub. Scn Frms Label Tax.
YALE [7] Ego RGB 4 4 9100 Gr. 33
UTG [8] Ego RGB 4 1 ? Gr. 17
GTEA [13] Ego RGB 4 4 00 Act. 7
UCI-EGO [36] Ego RGB-D 2 4 400 Pose ?
Ours Ego RGB-D 8 >512,000 Gr. 71
Table 1. Object manipulation datasets. [7] captured 27.7 hours
but labelled only 9100 frames with grasp annotations. While our
dataset is balanced and contains the same amount of data for each
grasp, [7] is imbalanced in that common grasps appear much more
often than rare grasps (10 grasps suffice to explain 80% of the
data). [8] uses the same set of objects for the 4 subjects.
Figure 2. GUN-71: Grasp Understanding dataset. We have captured (from a chest-mounted RGB-D camera) and annotated our own
dataset of fine-grained grasps, following the recent taxonomy of [28]. In the top row, the “writing tripod” grasp class exhibits low variability
in object and pose/view across 6 different subjects and environments. In the second row, “flat hand cupping” exhibits high variability in
objects and low variability in pose due to being gravity-dependent. In the third row, “trigger press” exhibits high variability in objects and
pose/view. Finally, in bottom, we show 6 views of the same grasp captured for a particular object and a particular subject in our dataset.
3. GUN-71: Grasp UNderstanding Dataset
We begin by describing our dataset, visualized in Fig. 2.
We start with the 73-class taxonomy of [28], but omit
grasps 50 and 61 because of their overly-specialized nature
(holding a ping-pong racket and playing saxophone, respec-
tively), resulting in 71 classes.
3.1. Data capture
To capture truly in-the-wild data, we might follow the
approach of [7] and monitor unprompted subjects behaving
naturally throughout the course of a day. However, this re-
sults in a highly imbalanced distribution of observed object
manipulations. [7] shows that 10 grasps suffice to explain
80% of the object interactions of everyday users. Balanced
class distributions arguably allow for more straightforward
analysis, which is useful when addressing a relatively unex-
plored problem. Collecting a balanced distribution in such a
unprompted manner would be prohibitively expensive, both
in terms of raw data collection and manual annotation. In-
stead, we prompt users using the scheme below.
Capture sessions: We ask subjects to perform the 71
grasps on personal objects (typical for the specific grasp),
mimicking real object manipulation scenarios in their home
environment. Capture sessions were fairly intensive and la-
borious as a result. We mount Intel’s Senz3D, a wearable
time-of-flight sensor [20,10,29], on the subjects’s chest us-
ing a GoPro harness (as in [36]). We tried to vary the types
of objects as much as possible and considered between 3
and 4 different objects per subject for each of the 71 grasps.
For each hand-object configuration, we took between 5 and
6 views of the manipulation scene. These views correspond
to several steps of a particular action (opening a lid, pouring
water) as well as different 3D locations and orientation of
the hand holding the object (with respect to the camera).
Diversity: This process led to the capture of roughly
12,000 RGB-D images labeled with one of the 71 grasps.
We captured 28 objects per grasp, resulting in 28 71 =
1988 different hand-object configurations with 5-6 views
for each. We consider 8 different subjects (4 males and 4
females) in 5 different houses, ensuring that “house mates”
avoid using the same objects to allow leave-one-out exper-
iments (we can leave out one subject for testing and en-
sure that the objects will be novel as well). Six of our
eight subjects were right handed. To ensure consistency, we
asked the two left-handed subjects to perform grasps with
their right hand. We posited that body shape characteristics
might effect accuracy/generalizability, particularly in terms
of hand size, shape, and movement. To facilitate such anal-
ysis, we also measured arm and finger lengths for each sub-
Figure 3. Contact point and forces. We show the 3D hand model for 18 grasps of the considered taxonomy. We also show the contact
points (in green) and forces (in red) corresponding to each grasp. The blue points help visualize the shape of the typical object associated
with each of these 18 grasps. We can observe that power grasps have wider contact areas on finger and palm regions, while precision grasps
exhibit more localized contact points on finger tips and finger pads.
4. Synthetic (training) data generation
3D hand-object models: In addition to GUN-71, we
construct a synthetic training dataset that will be used dur-
ing our grasp-recognition pipeline. To construct this syn-
thetic dataset, we make use of synthetic 3D hand models.
We obtain a set of 3D models by extending the publicly-
available Poser models from [39] to cover the selected
grasps from [28]’s taxonomy (by manually articulating the
models to match the visual description of grasp).
Contact and force annotations: We compute contact
points and applied forces on our 3D models with the follow-
ing heuristic procedure. First, we look for physical points of
contact between the hand and object mesh. We do this by in-
tersecting the triangulated hand and object meshes with the
efficient method of [30]. We produce a final set of contact
regions by connected-component clustering the set of 3D
vertices lying within an intersection boundary. To estimate
a force vector, we assume that contact points are locally sta-
ble and will not slide along the surface of the object (imply-
ing the force vector is normal to the surface of the object).
We estimate this normal direction by simply reporting the
average normal of vertices within each contact region. Note
this only produces an estimate of the force direction, and
not magnitude. Nevertheless, we find our heuristic proce-
dure to produce surprisingly plausible estimates of contact
points and force directions for each 3D model ( Fig. 3).
Synthetic training data: We use our 3D models to gen-
erate an auxiliary dataset of synthetic depth data, annotated
with 3D poses, grasp class label, contacts, and force di-
rection vectors. We additionally annotate each rendered
depth map with a segmentation mask denoting background,
hand, and object pixels. We render over 200,000 training
instances (3,000 per grasp). We will release our models,
rendering images, as well as GUN-71 (our dataset of real-
world RGB-D images) to spun further research in the area.
5. Recognition pipeline
We now describe a fairly straightforward recognition
system for recognizing grasps given real-world RGB-D im-
ages. Our pipeline consists of two stages; hand segmenta-
tion and fine-grained grasp classification.
5.1. Segmentation
The first stage of our pipeline is segmenting the hand
from background clutter, both in the RGB and depth data.
Many state-of-the-art approaches [8,38,39] employ user-
specific skin models to localize and segment out the hand.
We want a system that does not require such a user-specific
learning stage and could be applied to any new user and
environment, and so instead make use of depth cues to seg-
ment out the hand.
Depth-based hand detection: We train a P-way clas-
sifier designed to report one of P= 1500 quantized hand
poses, using the approach of [35]. This classifier is trained
on the synthetic training data, which is off-line clustered
into Ppose classes. Note that the set of pose classes P
Figure 4. Segmentation. We show the different steps of our seg-
mentation stage: the depth map (a) is processed using a K-way
pose classifier [35], which reports a quantized pose detection kand
associated foreground prior bik (b) and mean depth µik(c) (used
to compute a posterior following Eq. 1). To incorporate bottom-up
RGB cues, we first extract superpixels (e) and then label superpix-
els instead of pixels to produce a segmentation mask (f). This pro-
duces a segmented RGB image in (g) , which can then be cropped
(h) and/or unsegmented (i). We concatenate (deep) features ex-
tracted from (d), (g), (h), and (i) to span a variety of resolutions
and local/global contexts.
is significantly larger than the set of fine-grained grasps
K=71. We use the segmentation mask associated with
this coarse quantized pose detection to segment out the hand
(and object) from the test image, described further below.
Pixel model: We would like to use hand detections
to generate binary segmentation masks. To do so, we
use a simple probabilistic model where xidenotes the
depth value of pixel iand yi2{0,1}is its binary fore-
ground/background label. We write the posterior probabil-
ity of label yigiven observation xi, all conditioned on pose
class kas:
which can easily derived from Bayes rule . The first term on
the right-hand-side is the “prior” probability of pixel ibeing
fg/bg, and the second term is a “likelihood” of observing a
depth value given a pose class kand label:
p(yi=1|k)=bik Bernoulli (2)
p(xi|yi=1,k)=N(xi;µik ,
ik)N ormal (3)
p(xi|yi=0,k)/constant Uniform (4)
We use a pixel-specific Bernoulli distribution for the prior,
and an univariate Normal and Uniform (uninformative) dis-
tribution for the likelihood. Intuitively, foreground depths
tend to be constrained by the pose, while the background
will not be. Given training data of depth images xwith
foreground masks yand pose class labels k, it is straight-
forward to estimate model parameters {bik
maximum likelihood estimation (frequency counts, sample
means, and sample variances). We visualize the pixel-wise
Bernoulli prior bik and mean depth µik for a particular class
kin Fig. 4-b and Fig. 4-c.
RGB-cues: Thus far, our segmentation model does
not make use of RGB-based grouping cues such as color
changes across object boundaries. To do so, we first com-
pute RGB-based superpixels [1] on a test image and reason
about the binary labels of superpixels rather than pixels:
where Sjdenotes the set of pixels from superpixel j.We
show a sample segmentation in Fig. 4. Our probabilistic ap-
proach tends to produce more reliable segmentations than
existing approaches based on connected-component heuris-
tics [19].
5.2. Fine-grained classification
We use the previous segmentation stage to produce fea-
tures that will be fed into a K= 71-way classifier. We use
state-of-the-art deep networks – specifically, Deep19 [41]–
to extract a 3096 dimensional feature. We extract off-the-
shelf deep features extracted for (1) the entire RGB image,
(2) a cropped window around the detected hand, and (3)
a segmented RGB image (Fig. 4(d,g,h,i)). We resize each
window to a canonical size (of 224 x 224 pixels) before pro-
cessing. The intuition behind this choice is to mix high and
low resolution features, as well as global (contextual) and
local features. The final concatenated descriptors are fed
into a linear multi-class SVM for processing.
Exemplar matching: The above stages return an esti-
mate for the employed grasp and a fairly accurate quan-
tized pose class, but it is still quantized nonetheless. One
can refine this quantization by returning the closest syn-
thetic training example belonging to the recognized grasp
and the corresponding pose cluster. We do this by return-
ing the training example nfrom quantized class kwith the
closest foreground depth:
NN(x)= min
We match only foreground depths in the nth synthetic train-
ing image xn, as specified by its binary label yn. Because
each synthetic exemplar is annotated with hand-object con-
tact points and forces from its parent 3D hand model, we
can predict forces and contact points by simply transferring
them from the selected grasp model to the exemplar location
in the 3D space.
6. Experiments
For all the experiments of this section, we use a leave-
one-out approach where we train our 1-vs-all SVM classi-
fiers on 7 subjects and test on the last 8th subject. We repeat
that operation with the 8 subjects and average the results.
When analyzing our results, we refer to grasps by their id#.
In the supplementary material, we include a visualization of
all grasps in our taxonomy.
Baselines: We first run some “standard” baselines:
HOG-RGB, HOG-Depth, and an off-the-shelf deep RGB
feature [41]. We obtained the following average classifica-
tion rate: HOG-RGB (3.30%), HOG-Depth (6.55%), con-
catenated HOG-RGB and HOG-Depth (6.55%) and Deep-
RGB (11.31%). Consistent with recent evidence, deep fea-
tures considerably outperform their hand-designed counter-
parts, though overall performance is still rather low (Tab. 2).
Segmented/cropped data: Next, we evaluate the role
of context and clutter. Using segmented RGB im-
ages marginally decreases accuracy of deep features from
11.31% to 11.10%, but recognition rates appear are more
homogeneous. Looking at the individual grasp classifica-
tion rates, segmentation helps a little for most grasps but
hurts the accuracy of “easy” grasps where context or ob-
ject shape are important (but removed in the segmentation).
This includes non-prehensile “pressing” grasps (interacting
with a keyboard) and grasps associated with unique ob-
jects (chopsticks). Deep features extracted from a cropped
segmentation and cropped detection increase accuracy to
12.55% and 13.67%, respectively, suggesting that some
amount of local context around the hand and object helps.
Competing methods: [38,8] make use of HOG tem-
plates defined on segmented RGB images obtained with
skin detection. Because skin detectors did not work well
on our (in-the-wild) dataset, we re-implemented [8] us-
ing HOG templates defined on our depth-based segmen-
tations and obtained 7.69% accuracy. To evaluate re-
cent non-parametric methods [38], we experimented with
a naive nearest neighbor (NN) search using the different
features extracted for the above experiments and obtained
6.10%,6.97%,6.31% grasp recognition accuracy using
Deep-RGB, cropped-RGB and cropped+segmented-RGB.
For clarity, these replace the K-way SVM classifier with
a NN search. The significant drop in performance suggests
that the learning is important, implying that our dataset is
still not big enough to cover all possible variation in pose,
objects and scenes.
Cue-combination: To take advantage of detection and
segmentation without hurting classes where context is im-
portant, we trained our SVM grasp classifier on the con-
catenation of all the deep features. Our final overall classi-
fication rate of 17.97% is a considerable improvement over
a naive deep model 11.31% as well as (our reimplementa-
tion of) prior work 7.69%. The corresponding recognition
rates per grasp and confusion matrices corresponding to this
classifier are given in Fig. 5.
Grasp classification Confusion matrix
(a) (b)
Figure 5. RGB Deep feature + SVM. We show the individual
classification rates for the 71 grasps in our dataset (a) and the cor-
responding confusion matrix in (b).
Features Acc. top 20 top 10 min max
HOG-RGB 3.30 7.20 9.59 0.00 28.54
HOG-Depth 6.55 12.96 15.74 0.66 26.18
HOG-RGBD 6.54 13.76 19.24 0.00 45.62
Deep-RGB [41]11.31 25.92 35.28 0.69 61.39
Deep-RGB(segm.) 11.10 21.56 26.51 0.69 29.46
HOG-RGB (cropped) 5.84 11.22 14.03 0.00 27.85
Deep-RGB (cropped) 13.67 27.32 36.95 1.22 55.35
HOG-RGB (crop.+segm.) [8]7.69 15.23 18.65 0.69 30.77
HOG-Depth (crop.+segm.) 10.68 22.04 27.99 0.52 42.40
Deep-RGB (crop.+segm.) 12.55 22.89 27.85 0.69 37.49
Deep-RGB (All) 17.97 36.20 44.97 2.71 68.48
Table 2. Grasp classification results. We present the result ob-
tained when training a K-way linear SVM (K=71) with different
types of features: HOG-RGB, HOG-Depth and Deep-RGB fea-
tures, on the whole workspace, i.e. entire image, on a cropped
detection window or on cropped and segmented image.
View Acc. top 20 top 10 min max
All (All) 17.97 36.20 44.97 2.71 68.48
Best scoring view 22.67 47.53 59.04 079.37
Table 3. View selection. We present grasp recognition results ob-
tained when training a K-way linear SVM on a concatenation of
Deep features. We present the results obtained when computing
the average classification rate over 1) the entire dataset and 2) over
the top scoring view of each hand-object configuration.
71 Gr. [28]33 Gr. [14]17 Gr. [9]
All views 17.97 20.50 20.53
Best scoring view 22.67 21.90 23.44
Table 4. Grasp classification for different sized taxonomies. We
present the results obtained for K=71[28], K=33[14] and
K=17[9], smaller taxonomies being obtained by selecting the
corresponding subsets of grasps.
Easy cases: High-performing grasp classes (Fig. 5) tend
to be characterized by limited variability in terms of view-
point (i.e. position and orientation of the hand w.r.t camera)
and/or object: eg. opening a lid (#10), writing (#20), hold-
ing chopsticks (#21), measuring with a tape (#33), grab-
bing a large sphere such as a basketball (#45), using screw-
driver (#47), trigger press (#49), using a keyboard (#60),
thumb press (#62), holding a wine cup (#72). Other high-
performing classes tend to exhibit limited occlusions of the
hand: hooking a small object(#15) and palm press (#55).
Common confusions: Common confusions in Fig. 6
suggest that finger kinematics are a strong cue captured by
deep features. Many confusions correspond to genuinely
similar grasps that differ by small details that might be eas-
ily occluded by the hand or the manipulated object: “Large
diameter” (#1) and “Ring” (#31) are both used to grasp
cylindrical objects, except that “Ring” only uses thumb and
index finger. When the last three fingers are fully occluded
by the object, it is visually impossible to differentiate them
(see Fig. 6-c). Adduction-Grip” (#23) and “Middle-over-
Index”(#51) both involve grasping an object using the in-
dex and middle finger. Abduction-Grip holds the object be-
tween the two fingers, while Middle-over-Index holds the
object using the pad of the middle finger and nail of the in-
dex finger (see Fig. 6-f).
Figure 6. Common confusions. The confusions occur when some
fingers are occluded (aand c) or when the poses are very simi-
lar but the functionality (associated forces and contact points) is
different (b,d,eand f).
Best view: To examine the effect of viewpoint, we se-
lect the top-scoring view for each grasp class, increasing
accuracy from 17.97% to 22.67% (Tab. 3). Comparing the
two sets of recognition rates, best-view generally increases
the performance of easy grasps significantly more than dif-
ficult ones - e.g., the average recognition rate of the top
20 grasps grow from 36.20% to 47.53%, while the top 10
grasps grows from 44.97% to 59.04%. This suggests that
some views may be considerably more ambiguous than oth-
Comparison to state-of-the-art. We now compare
our results to those systems evaluated on previous grasp
datasets. Particularly relevant is [8], which presents vi-
sual grasp recognition results in similar settings, i.e. ego-
centric perspective and daily activities. In their case, they
consider a reduced 17-grasp taxonomy from Cutkosky [9],
obtaining 31% with HOG features overall and 42% on a
specific “machinist sequence” from [7]. Though these re-
sults appear more accurate than ours, its important to note
that their dataset contains less variability in the background
and scenes, and, crucially, their system appears to require
training a skin detector on a subset of the test set. Addi-
tionally, it is not clear if they (or indeed, other past work)
allow for the same subject/scene to be included across the
train and testset. If we allow for this, recognition rate dra-
matically increases to 85%. This highly suggestive of over-
fitting, and can be seen a compelling motivation for the dis-
tinctly large number of subjects and scenes that we capture
in our dataset.
Evaluations on limited taxonomies: If we limit our tax-
onomy to the 17 grasps from [8], i.e. by evaluating only the
subset of 17 classes, we obtain 20.53% and 23.44% (best
view). See Tab. 4. These numbers are comparable to those
reported in [8]. Best-view may be a fair comparison because
[7] used data that was manually labelled, where annota-
tors were explicitly instructed to only annotate those frames
that were visually unambiguous. In our case, subjects were
asked to naturally perform object manipulations, and the
data was collected “as-is”. Finally, if we limit our taxonomy
to the 33 grasps from Feix et al. [14], we obtained 20.50%
and 21.90% (best view). The marginal improvement when
evaluating grasps from smaller taxonomies suggests that the
new classes are not much harder to recognize. Rather, we
believe that overall performance is somewhat low because
our dataset is genuinely challenging due to diverse subjects,
scenes, and objects.
Force and contact point prediction: Finally, we
present preliminary results for force and contact prediction.
We do so by showing the best-matching synthetic 3D ex-
emplar from the detected pose class, along with its contact
and force annotations. Fig. 7shows frames for which the
entire pipeline detection +grasp recognition +exemplar
matching led to an acceptable prediction. Unfortunately, we
are not able to provide a numerical evaluation as obtaining
ground-truth annotation of contact and forces is challeng-
ing. One attractive option is to use active force sensors,
Figure 7. Force and contact points prediction. We show frames for which the entire pipeline detection +grasp recognition +exemplar
matching led to an acceptable prediction of forces and contact points. For each selected frame, we show from top to bottom: the RGB
image, the depth image with contact points and forces (respectively represented by green points and red arrows, the top scoring 3D exemplar
with associated forces and contact points, and finally the RGB image with overlaid forces and contact points.
either embedded into pressure-sensitive gloves worn by the
user or through objects equipped with force sensors at pre-
defined grasp points (as done for a simplified cuboid object
in [32]). While certainly attractive, active sensing some-
what violates the integrity of a truly in-the-wild, everyday
7. Conclusions
We have introduced the challenging problem of under-
standing hands in action, including force and contact point
prediction, during scenes of in-the-wild, everyday object
manipulations. We have proposed an initial solution that
reformulates this high-dimensional, continuous prediction
task as a discrete fine-grained (functional grasp) classifica-
tion task. To spur further research, we have captured a new
large scale dataset of fine-grained grasps that we will re-
lease together with 3D models and rendering engine. Im-
portantly, we have captured this dataset from an egocentric
perspective, using RGB-D sensors to record multiple scenes
and subjects. We have also proposed a pipeline which ex-
ploits depth and RGB data, producing state-of-the-art grasp
recognition results. Our first analysis show that depth in-
formation is crucial for detection and segmentation, while
the richer RGB feature allows for a better grasp recogni-
tion. Overall, our results indicate that grasp classification is
challenging, with accuracy approaching 20% for a 71-way
classification problem.
We have used a single 3D model per grasp. In future
work, it would be interesting to (1) model within-grasp vari-
ability, capturing the dependence of hand kinematics on
object shape and size and (2) consider subject-specific 3D
hand shape models [21], which could lead to more accu-
rate set of synthetic exemplars (and associated forces and
Acknowledgement. G. Rogez was supported by the Euro-
pean Commission under FP7 Marie Curie IOF grant “Ego-
vision4Health” (PIOF-GA-2012-328288). J. Supancic and
D. Ramanan were supported by NSF Grant 0954083, ONR-
MURI Grant N00014-10-1-0933, and the Intel Science and
Technology Center - Visual Computing. We thank our hand
models Allon H., Elisabeth R., Marga C., Nico L., Odile H.,
Santi M. and Sego L. for participating in the data collection.
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
S. Susstrunk. Slic superpixels compared to state-of-the-art
superpixel methods. PAMI, 34(11):2274–2282, 2012.
[2] S. Allin and D. Ramanan. Assessment of Post-Stroke Func-
tioning using Machine Vision. In MVA, 2007.
[3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A
survey of robot learning from demonstration. Robotics and
autonomous systems, 57(5):469–483, 2009.
[4] V. Athitsos and S. Sclaroff. Estimating 3d hand pose from a
cluttered image. In CVPR (2), pages 432–442, 2003.
[5] L. Bourdev and J. Malik. Poselets: Body part detectors
trained using 3d human pose annotations. In ICCV, 2009.
[6] M. Bouzit, G. Burdea, G. Popescu, and R. Boian. The rutgers
master ii-new design force-feedback glove. Mechatronics,
IEEE/ASME Transactions on, 7(2):256–263, 2002.
[7] I. M. Bullock, T. Feix, and A. M. Dollar. The yale hu-
man grasping dataset: Grasp, object, and task data in house-
hold and machine shop environments. I. J. Robotic Res.,
34(3):251–255, 2015.
[8] M. Cai, K. M. Kitani, and Y. Sato. A scalable approach for
understanding the visual structures of hand grasps. In ICRA,
[9] M. R. Cutkosky. On grasp choice, grasp models, and the
design of hands for manufacturing tasks. IEEE T. Robotics
and Automation, 5(3):269–279, 1989.
[10] D. Damen, A. P. Gee, W. W. Mayol-Cuevas, and A. Calway.
Egocentric real-time workspace monitoring using an rgb-d
camera. In IROS, 2012.
[11] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and
W. W. Mayol-Cuevas. You-do, i-learn: Discovering task rel-
evant objects and their modes of interaction from multi-user
egocentric video. In BMVC, 2014.
[12] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and
X. Twombly. Vision-based hand pose estimation: A review.
CVIU, 108(1-2):52–73, 2007.
[13] A. Fathi, X. Ren, and J. Rehg. Learning to recognize objects
in egocentric activities. In CVPR, 2011.
[14] T. Feix, R. Pawlik, H. Schmiedmayer, J. Romero, and
D. Kragic. A comprehensive grasp taxonomy. In RSS
Workshop on Understanding the Human Hand for Advanc-
ing Robotic Manipulation, 2009.
[15] A. R. Fielder and M. J. Moseley. Does stereopsis matter in
humans? Eye, 10(2):233–238, 1996.
[16] M. A. Goodrich and A. C. Schultz. Human-robot interac-
tion: a survey. Foundations and trends in human-computer
interaction, 1(3):203–275, 2007.
[17] R. P. Harrison. Nonverbal communication. Human Commu-
nication As a Field of Study: Selected Contemporary Views,
113, 1989.
[18] D.-A. Huang, W.-C. Ma, M. Ma, and K. M. Kitani. How
do we use our hands? discovering a diverse set of common
grasps. In CVPR, 2015.
[19] Intel. Perceptual computing sdk, 2013.
[20] Y. Jang, S. Noh, H. J. Chang, T. Kim, and W. Woo. 3d finger
CAPE: clicking action and position estimation under self-
occlusions in egocentric viewpoint. IEEE Trans. Vis. Com-
put. Graph., 21(4):501–510, 2015.
[21] S. Khamis, T. J., S. J., K. C., I. S., and A. Fitzgibbon. Learn-
ing an efficient model of hand shape variation from depth
images. In CVPR, 2015.
[22] M. K¨
olsch. An appearance-based prior for hand tracking. In
ACIVS (2), pages 292–303, 2010.
[23] M. K¨
olsch and M. Turk. Hand tracking with flocks of fea-
tures. In CVPR (2), page 1187, 2005.
[24] P. G. Kry and D. K. Pai. Interaction capture and synthesis.
In ACM Transactions on Graphics (TOG), volume 25, pages
872–880. ACM, 2006.
[25] T. Kurata, T. Kato, M. Kourogi, K. Jung, and K. Endo. A
functionally-distributed hand tracking method for wearable
visual interfaces and its applications. In MVA, 2002.
[26] N. Kyriazis and A. A. Argyros. Physically plausible 3d scene
tracking: The single actor hypothesis. In CVPR, 2013.
[27] N. Kyriazis and A. A. Argyros. Scalable 3d tracking of mul-
tiple interacting objects. In CVPR, 2014.
[28] J. Liu, F. Feng, Y. C. Nakamura, and N. S. Pollard. A taxon-
omy of everyday grasps in action. In 14th IEEE-RAS Inter-
national Conf. on Humanoid Robots, Humanoids 2014.
[29] S. Mann, J. Huang, R. Janzen, R. Lo, V. Rampersad,
A. Chen, and T. Doha. Blind navigation with a wearable
range camera and vibrotactile helmet. In ACM International
Conf. on Multimedia, MM ’11, 2011.
[30] T. Moller. A fast triangle-triangle intersection test. Journal
of Graphics Tools, 2:25–30, 1997.
[31] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Tracking
the Articulated Motion of Two Strongly Interacting Hands.
In CVPR, 2012.
[32] T.-H. Pham, A. Kheddar, A. Qammaz, and A. A. Argyros.
Towards force sensing from vision: Observing hand-object
interactions to infer manipulation forces. In CVPR, 2015.
[33] H. Pirsiavash and D. Ramanan. Detecting activities of daily
living in first-person camera views. In CVPR, 2012.
[34] X. Ren, C. C. Fowlkes, and J. Malik. Figure/ground assign-
ment in natural images. In Computer Vision–ECCV 2006,
pages 614–627. Springer, 2006.
[35] G. Rogez, J. S. S. III, and D. Ramanan. First-person pose
recognition using egocentric workspaces. In CVPR, 2015.
[36] G. Rogez, M. Khademi, J. Supancic, J. M. M. Montiel, and
D. Ramanan. 3d hand pose detection in egocentric RGB-D
images. In ECCV Workshop on Consumer Depth Camera
For Computer Vision, 2014.
[37] J. Romero, T. Feix, H. Kjellstrom, and D. Kragic. Spatio-
temporal modeling of grasping actions. In IROS, 2010.
[38] J. Romero, H. Kjellstr ¨
om, C. H. Ek, and D. Kragic. Non-
parametric hand pose estimation with object context. Image
Vision Comput., 31(8):555–564, 2013.
[39] J. Romero, H. Kjellstrom, and D. Kragic. Hands in action:
real-time 3D reconstruction of hands in interaction with ob-
jects. In ICRA, pages 458–463.
[40] A. Saxena, J. Driemeyer, J. Kearns, and A. Y. Ng. Robotic
grasping of novel objects. In NIPS, 2006.
[41] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014.
[42] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Model-
based hand tracking using a hierarchical bayesian filter.
PAMI, 28(9):1372–1384,, 2006.
[43] J. Supancic, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan.
Depth-based hand pose estimation: data, methods, and chal-
lenges. In ICCV, 2015.
[44] R. Wang and J. Popovic. Real-time hand-tracking with a
color glove. ACM Trans on Graphics, 28(3), 2009.

Supplementary resource (1)

... The authors of [8] constructed a robust finger-part-based gesture recognition system using a Kinect sensor. Moreover, many studies mentioned in [9][10][11][12] also achieved considerable recognition accuracy using CV-based gesture recognition. It can be seen that there has been a great deal of research on contactless gesture recognition based on Computer Vision, and significant progress has been made. ...
... where the matched filter is [31] h(x , y ) = e −j2k (x ) 2 +(y ) 2 +z 0 2 (9) In the formula, the relevant parameters are k, which denotes the wavenumber (for spatial frequency, k x , k y , k z represent the x, y, z direction wavenumbers, respectively); f (x, y), which represents the reflectivity of the target; and FT 2D and FT −1 2D , which denote the two-dimensional Fourier transform and the two-dimensional inverse Fourier transform, respectively. The fourth imaging algorithm is mmSight, which is a robust imaging algorithm based on AFT (Analytic Fourier Transform), and its specific principle is described in [33]. ...
Full-text available
To address the limitations of wireless sensing in static gesture recognition and the issues of Computer Vision’s dependence on lighting conditions, we propose a method that utilizes millimeter-wave near-field SAR (Synthetic Aperture Radar) imaging for static gesture recognition. First, a millimeter-wave near-field SAR imaging system is used to scan the defined static gestures to obtain data. Then, based on the distance plane, the three-dimensional gesture is divided into multiple two-dimensional planes, constructing an imaging dataset. Finally, an HOG (Histogram of Oriented Gradients) is used to extract features from the imaging results, PCA (Principal Component Analysis) is applied for feature dimensionality reduction, and RF (Random Forest) performs classification. Experimental verification shows that the proposed method achieves an average recognition precision of 97% in unobstructed situations and 93% in obstructed situations, providing an effective means for wireless-sensing-based static gesture recognition.
... Hand pose estimation methods from RGB(-D) input can be broadly categorized into two streams: model-free and model-based methods. Model-free methods typically involve lifting detected 2D keypoints to 3D joint positions and hand skeletons [28,40,41,42,49,48,70]. Alternatively, they directly predict 3D hand meshes [11,17,44]. ...
Reconstructing hand-held objects from a single RGB image is an important and challenging problem. Existing works utilizing Signed Distance Fields (SDF) reveal limitations in comprehensively capturing the complex hand-object interactions, since SDF is only reliable within the proximity of the target, and hence, infeasible to simultaneously encode local hand and object cues. To address this issue, we propose DDF-HO, a novel approach leveraging Directed Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in 3D space, consisting of an origin and a direction, to corresponding DDF values, including a binary visibility signal determining whether the ray intersects the objects and a distance value measuring the distance from origin to target in the given direction. We randomly sample multiple rays and collect local to global geometric features for them by introducing a novel 2D ray-based feature aggregation scheme and a 3D intersection-aware hand pose embedding, combining 2D-3D features to model hand-object interactions. Extensive experiments on synthetic and real-world datasets demonstrate that DDF-HO consistently outperforms all baseline methods by a large margin, especially under Chamfer Distance, with about 80% leap forward. Codes and trained models will be released soon.
This work presents the Industrial Hand Action Dataset V1, an industrial assembly dataset consisting of 12 classes with 459,180 images in the basic version and 2,295,900 images after spatial augmentation. Compared to other freely available datasets tested, it has an above-average duration and, in addition, meets the technical and legal requirements for industrial assembly lines. Furthermore, the dataset contains occlusions, hand-object interaction, and various fine-grained human hand actions for industrial assembly tasks that were not found in combination in examined datasets. The recorded ground truth assembly classes were selected after extensive observation of real-world use cases. A Gated Transformer Network, a state-of-the-art model from the transformer domain was adapted, and proved with a test accuracy of 86.25% before hyperparameter tuning with 18,269,959 trainable parameters, that it is possible to train sequential deep learning models with this dataset. KeywordsHuman Action RecognitionAssembly LinesDatasetManufacturingAssistance SystemsTransformers
Full-text available
In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications, such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the quality and quantity of annotated 3D hand poses. Under the status quo, acquiring such annotated 3D hand poses is challenging, e.g., due to the difficulty of 3D annotation and the presence of occlusion. To reveal this problem, we review the pros and cons of existing annotation methods classified as manual, synthetic-model-based, hand-sensor-based, and computational approaches. Additionally, we examine methods for learning 3D hand poses when annotated data are scarce, including self-supervised pretraining, semi-supervised learning, and domain adaptation. Based on the study of efficient annotation and learning, we further discuss limitations and possible future directions in this field.
Conference Paper
Full-text available
We present a novel, non-intrusive approach for estimating contact forces during hand-object interactions relying solely on visual input provided by a single RGB-D camera. We consider a manipulated object with known geometrical and physical properties. First, we rely on model-based visual tracking to estimate the object's pose together with that of the hand manipulating it throughout the motion. Following this, we compute the object's first and second order kinematics using a new class of numerical differentiation operators. The estimated kinematics is then instantly fed into a second-order cone program that returns a minimal force distribution explaining the observed motion. However, humans typically apply more forces than mechanically required when manipulating objects. Thus, we complete our estimation method by learning these excessive forces and their distribution among the fingers in contact. We provide a full validity analysis of the proposed method by evaluating it based on ground truth data from additional sensors such as accelerometers, gyroscopes and pressure sensors. Experimental results show that force sensing from vision (FSV) is indeed feasible.
Full-text available
We present a fully unsupervised approach for the discovery of i) task relevant objects and ii) how these objects have been used. Given egocentric video from multiple operators, the approach can discover objects with which the users interact, both static objects such as a coffee machine as well as movable ones such as a cup. Importantly, the common modes of interaction for discovered objects are also found. We investigate using appearance, position, motion and attention, and present results using each and a combination of relevant features. Results show that the method is capable of discovering 95% of task relevant objects on a variety of daily tasks such as initialising a printer, preparing a coffee and setting up a gym machine. In addition, the approach enables the automatic generation of guidance video on how these objects have been used before. © 2014. The
Full-text available
We consider the problem of tracking multiple interacting objects in 3D, using RGBD input and by considering a hypothesize-and-test approach. Due to their interaction, objects to be tracked are expected to occlude each other in the field of view of the camera observing them. A naive approach would be to employ a Set of Independent Trackers (SIT) and to assign one tracker to each object. This approach scales well with the number of objects but fails as occlusions become stronger due to their disjoint consideration. The solution representing the current state of the art employs a single Joint Tracker (JT) that accounts for all objects simultaneously. This directly resolves ambiguities due to occlusions but has a computational complexity that grows geometrically with the number of tracked objects. We propose a middle ground, namely an Ensemble of Collaborative Trackers (ECT), that combines best traits from both worlds to deliver a practical and accurate solution to the multi-object 3D tracking problem. We present quantitative and qualitative experiments with several synthetic and real world sequences of diverse complexity. Experiments demonstrate that ECT manages to track far more complex scenes than JT at a computational time that is only slightly larger than that of SIT.
Full-text available
This paper presents a dataset of human grasping behavior in unstructured environments. Wide-angle head-mounted camera video was recorded from two housekeepers and two machinists during their regular work activities, and the grasp types, objects, and tasks were analyzed and coded by study staff. The full dataset contains 27.7 hours of tagged video and represents a wide range of manipulative behaviors spanning much of the typical human hand usage. We provide the original videos, a spreadsheet including the tagged grasp type, object, and task parameters, time information for each successive grasp, and video screenshots for each instance. Example code is provided for MATLAB and R, demonstrating how to load in the dataset and produce simple plots.
Full-text available
Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and will release all software and evaluation code. We summarize important conclusions here: (1) Pose estimation appears roughly solved for scenes with isolated hands. However, methods still struggle to analyze cluttered scenes where hands may be interacting with nearby objects and surfaces. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.
Grasping has been well studied in the robotics and human subjects literature, and numerous taxonomies have been developed to capture the range of grasps employed in work settings or everyday life. But how completely do these taxonomies capture grasping actions that we see every day? We asked two subjects to monitor every action that they performed with their hands during a typical day, as well as to role-play actions important for self-care, rehabilitation, and various careers and then to classify all grasping actions using existing taxonomies. While our subjects were able to classify many grasps, they also found a collection of grasps that could not be classified. In addition, our subjects observed that single entries in the taxonomy captured not one grasp, but many. When we investigated, we found that these grasps were distinguished by features related to the grasping action, such as intended motion, force, and stiffness - properties also needed for robot control. We suggest a format for augmenting grasp taxonomies that includes features of motion, force, and stiffness using a language that can be understood and expressed by subjects with light training, as would be needed, for example, for annotating examples or coaching a robot. This paper describes our study, the results, and documents our annotated database.
Our goal is to automatically recognize hand grasps and to discover the visual structures (relationships) between hand grasps using wearable cameras. Wearable cameras provide a first-person perspective which enables continuous visual hand grasp analysis of everyday activities. In contrast to previous work focused on manual analysis of first-person videos of hand grasps, we propose a fully automatic vision-based approach for grasp analysis. A set of grasp classifiers are trained for discriminating between different grasp types based on large margin visual predictors. Building on the output of these grasp classifiers, visual structures among hand grasps are learned based on an iterative discriminative clustering procedure. We first evaluated our classifiers on a controlled indoor grasp dataset and then validated the analytic power of our approach on real-world data taken from a machinist. The average F1 score of our grasp classifiers achieves over 0.80 for the indoor grasp dataset. Analysis of real-world video shows that it is possible to automatically learn intuitive visual grasp structures that are consistent with expert-designed grasp taxonomies.
In this paper we present a novel framework for simultaneous detection of click action and estimation of occluded fingertip positions from egocentric viewed single-depth image sequences. For the detection and estimation, a novel probabilistic inference based on knowledge priors of clicking motion and clicked position is presented. Based on the detection and estimation results, we were able to achieve a fine resolution level of a bare hand-based interaction with virtual objects in egocentric viewpoint. Our contributions include: (i) a rotation and translation invariant finger clicking action and position estimation using the combination of 2D image-based fingertip detection with 3D hand posture estimation in egocentric viewpoint. (ii) a novel spatio-temporal random forest, which performs the detection and estimation efficiently in a single framework. We also present (iii) a selection process utilizing the proposed clicking action detection and position estimation in an arm reachable AR/VR space, which does not require any additional device. Experimental results show that the proposed method delivers promising performance under frequent self-occlusions in the process of selecting objects in AR/VR space whilst wearing an egocentric-depth camera-attached HMD.