Conference PaperPDF Available

Visual Recognition of Grasps for Human-to-Robot Mapping

Authors:

Abstract and Figures

This paper presents a vision based method for grasp classification. It is developed as part of a Programming by Demonstration (PbD) system for which recognition of objects and pick-and-place actions represent basic building blocks for task learning. In contrary to earlier approaches, no articulated 3D reconstruction of the hand over time is taking place. The indata consists of a single image of the human hand. A 2D representation of the hand shape, based on gradient orientation histograms, is extracted from the image. The hand shape is then classified as one of six grasps by finding similar hand shapes in a large database of grasp images. The database search is performed using Locality Sensitive Hashing (LSH), an approximate k-nearest neighbor approach. The nearest neighbors also give an estimated hand orientation with respect to the camera. The six human grasps are mapped to three Barret hand grasps. Depending on the type of robot grasp, a precomputed grasp strategy is selected. The strategy is further parameterized by the orientation of the hand relative to the object. To evaluate the potential for the method to be part of a robust vision system, experiments were performed, comparing classification results to a baseline of human classification performance. The experiments showed the LSH recognition performance to be comparable to human performance.
Content may be subject to copyright.
Appears in the IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France 2008.
Visual Recognition of Grasps for Human-to-Robot Mapping
Hedvig Kjellstr
¨
om Javier Romero Danica Kragi
´
c
Computational Vision and Active Perception Lab
Centre for Autonomous Systems
School of Computer Science and Communication
KTH, SE-100 44 Stockholm, Sweden
hedvig,jrgn,dani@kth.se
Abstract This paper presents a vision based method for
grasp classification. It is developed as part of a Programming
by Demonstration (PbD) system for which recognition of objects
and pick-and-place actions represent basic building blocks for
task learning. In contrary to earlier approaches, no articulated
3D reconstruction of the hand over time is taking place. The
indata consists of a single image of the human hand. A 2D
representation of the hand shape, based on gradient orientation
histograms, is extracted from the image. The hand shape is then
classified as one of six grasps by finding similar hand shapes
in a large database of grasp images. The database search is
performed using Locality Sensitive Hashing (LSH), an approx-
imate k-nearest neighbor approach. The nearest neighbors also
give an estimated hand orientation with respect to the camera.
The six human grasps are mapped to three Barret hand grasps.
Depending on the type of robot grasp, a precomputed grasp
strategy is selected. The strategy is further parameterized by
the orientation of the hand relative to the object. To evaluate
the potential for the method to be part of a robust vision
system, experiments were performed, comparing classification
results to a baseline of human classification performance. The
experiments showed the LSH recognition performance to be
comparable to human performance.
I. INTRODUCTION
Programming service robots for new tasks puts significant
requirements on the programming interface and the user. It
has been argued that the Programming by Demonstration
(PbD) paradigm offers a great opportunity to unexperienced
users for integrating complex tasks in the robotic system [1].
The aim of a PbD system is to use natural ways of human-
robot interaction where the robots can be programmed for
new tasks by simply observing human performing the task.
However, representing, detecting and understanding human
activities has been proven difficult and has been investigated
closely during the past several years in the field of robotics
[2], [3], [4], [5], [6], [7], [8].
In our work, we have been studying different types of
object manipulation tasks where grasp recognition represents
one of the major building blocks of the system [1]. Grasp
recognition was performed using magnetic trackers [7], to-
gether with data gloves the most common way of obtaining
the measurements in the robotics field. Although magnetic
trackers and datagloves deliver exact values of hand joints,
it is desirable from a usability point of view that the user
demonstrates tasks to the robot as naturally as possible; the
use of gloves or other types of sensors may prevent a natural
grasp. This motivates the use of systems with visual input.
Camera
Object 3D
recon!
struction
Instantiation of grasp
with the proposed hand pos
and ori rel object
Execution of the whole grasp
sequence with new object pos
selected strategy
Geometry
hand pos, hand ori rel object
object pos,
stereo image (one frame, one view)
segmented mono image
object ori rel camera
L 2D hand image pos,
R 2D hand image pos
Skin segmenation,
hand detection
grasp3(pos,ori) grasp2(pos,ori) grasp1(pos,ori)
Selection of robot grasp strategy
Mapping of human grasps to robot grasps
Geometry
hand pos rel camera
grasp type
grasp recognition
Non!parametric
kNN
rel camera
hand ori
1 2 3
1 2 3 4 5 6
Fig. 1. Human grasps are recognized and mapped to a robot. From
one time frame of video, the hand is localized and segmented from the
background. The hand orientation relative to the camera, and type of grasp
is recognized by nearest neighbor comparison of the hand view with a
database, consisting of synthesized views of all grasp types from different
orientations. The human grasp class is mapped to a corresponding robot
grasp, and a predefined grasp strategy, the whole approach-grasp-retreat
sequence, for that grasp is selected. The strategy is parameterized with the
orientation and position of the hand relative to the object, obtained from the
hand and object positions and orientations relative to the camera. (In our
experiments, the object position and orientation were obtained by hand.)
(a) 1. (b) 2. (c) 4. (d) 9. (e) 10. (f) 12.
(g) Barret Wrap. (h) Barret Two-finger Thumb. (i) Barret Precision Disc.
Fig. 2. The six grasps (numbered according to Cutkosky’s grasp taxonomy [9]) considered in the classification, and the three grasps for a Barret hand,
with human-robot class mappings ((a,b,c,e)(g), (d)(h), (f)(i)) shown. a) Large Diameter grasp, 1. b) Small Diameter grasp, 2. c) Abducted Thumb
grasp, 4. d) Pinch grasp, 9. e) Power Sphere grasp, 10. f) Precision Disc grasp, 12. g) Barret Wrap. h) Barret Two-finger Thumb, i) Barret Precision Disc.
Vision based recognition of a grasping hand is a difficult
problem, due to the self occlusion of the fingers as well as the
occlusion of the hand by the grasped object [10], [11], [12],
[13]. To simplify the problem, some approaches use optical
markers [14], but markers make the system less usable when
service robot applications are considered. We therefore strive
to develop a markerless grasp recognition approach.
Figure 1 outlines the whole mapping procedure. Although
the scientific focus of this paper is on the classification on
human grasps, the classification method should be thought
of as part of the whole mapping procedure, which consists
of three main parts: The human grasp classification, the
extraction of hand position relative to the grasped object
(with object detection not implemented for our experiments),
and the compilation of a robot grasp strategy, parameterized
by the type of grasp and relative hand-object orientation and
position, described in Section VI.
The main contribution of this paper is a non-parametric
method for grasp recognition. While articulate 3D recon-
struction of the hand is straightforward when using magnetic
data or markers, 3D reconstruction of an unmarked hand
from images is an extremely difficult problem due to the
large occlusion [10], [11], [12], [13], actually more difficult
than the grasp recognition problem itself as discussed in
Section II. Our method can classify grasps and find their
orientation, from a single image, from any viewpoint, without
building an explicit representation of the hand, similarly
to [12], [15]. Other grasp recognition methods (Section II)
consider only a single viewpoint or employ an invasive
sensing device such as datagloves, optical markers for motion
capture, or magnetic sensors.
The general idea to recognize the human grasp and select
a precomputed grasping strategy is a secondary contribution
of the paper, since it differs from the traditional way to go
about the mapping problem [7]; to recover the whole 3D pose
of the human hand, track it through the grasp, and then map
the motion to the robot arm. A recognition-based approach
such as ours avoid the difficult 3D reconstruction problem,
and is also much more computationally efficient since it only
requires processing of a single video frame.
The grasp recognition problem is here formalized as the
problem of classifying a hand shape as one of six grasp
classes, labeled according to Cutkosky’s grasp taxonomy [9].
The classes are, as shown in Figure 2a-f, Large Diameter
grasp, Small Diameter grasp, Abducted Thumb grasp, Pinch
grasp, Power Sphere grasp and Precision Disc grasp.
The input to the grasp classification method is a single
image (one time instance, one camera view point) from the
robot’s camera. The hand is segmented out using skin color
segmentation, presented in more detail in Section III. From
the segmented image, a representation of the 2D hand shape
based on gradient orientation histograms is computed as
presented in Section IV. A large set of synthetic hand views
from many different viewpoints, performing all six types of
grasps has been generated. Details are given in Section III.
The new hand shape is classified as one of the six shapes by
approximate k-nearest neighbor comparison using Locality
Sensitive Hashing (LSH) [16]. Along with the grasp class,
the estimated orientation of the hand relative to the camera
is obtained by interpolating between the orientations of the
found nearest neighbors. This is presented in Section V.
Experiments presented in Section VII show the method to
(a) Input image I. (b) Hand image H. (c) Synthetic hand image H
synth
.
Fig. 3. Processing of image data. a) Input I from the robot, grabbed with an AVT Marlin F-080C camera. b) Segmented hand image H. c) Synthetic
view of hand H
synth
, generated in Poser 7.
perform comparably to humans, which indicates that it is fit
to be included into complex vision system, such as the one
required in a PbD framework.
II. RELATED WORK
Classification of hand pose is most often used for gesture
recognition, e.g. sign language recognition [12], [17]. These
applications are often characterized by low or no occlusion
of the hands from other objects, and a well defined and
visually disparate set of hand poses; in the sign language
case they are designed to be easily separable to simplify
fast communication. Our problem of grasp recognition differs
from this application in two ways. Firstly, the grasped object
is usually occluding large parts of the grasping hand. We
address this by including expected occlusion in our dataset;
occluding objects are present in all example views (Section
III). Secondly, the different grasping poses are in some cases
very similar, and there is also a large intra-class variation,
which makes the classification problem more difficult.
Related approaches to grasp recognition [14], [18] first
reconstruct the hand in 3D, from infrared images [18] or
from an optical motion capture system which gives 3D
marker positions [14]. Features from the 3D pose are then
used for classification. The work of Ogawara et al. [18]
views the grasp recognition problem as a problem of shape
reconstruction. This makes their results hard to compare to
ours. In addition, they also use a wide baseline stereo system
with infrared cameras, which makes their approach difficult
to adopt in a case of a humanoid platform.
The more recent work of Chang et al. [14] learns a
non-redundant representation of pose from all 3D marker
positions a subset of features using linear regression
and supervised selection combined. In contrast, we use a
completely non-parametric approach where the classification
problem is transformed into a problem of fast LSH nearest
neighbor search (Section IV). While a linear approach is
sufficient in the 3D marker space of Chang et al. [14] , the
classes in the orientation histogram space are less Gaussian
shaped and more intertwined, which necessitates a non-linear
or non-parametric classifier as ours.
Using 3D motion capture data as input, Chang et al. [14]
reached an astonishing recognition rate of up to 91.5%. For
the future application of teaching of service robots it is
however not realistic to expect that the teacher will be able
or willing to wear markers to provide the suitable input for
the recognition system. 3D reconstructions, although with
lower accuracy, can also be achieved from unmarked video
[19], [20]. However, Chang et al. [14] note that the full
3D reconstruction is not needed to recognize grasp type.
Grasp recognition from images is thus an easier problem
than 3D hand pose reconstruction from images, since fewer
parameters need to be extracted from the input. We conclude
that the full 3D reconstruction is an unnecessary (and error
prone) step in the chain from video input to grasp type.
Our previous work [7] considered an HMM framework for
recognition of grasping sequences using magnetic trackers.
Here, we are interested in evaluating a method that can
perform grasp classification based on a single image only,
but it should be noted that the method can easily be extended
for use in a temporal framework.
III. EXTRACTING THE HAND IMAGE
Since the robot grasp strategies are predefined, and only
parameterized by the hand orientation, position and type of
grasp, there is no need for the human to show the whole grasp
procedure; only one time instance is enough (for example,
the image that is grabbed when the human tells the robot
”now I am grasping”).
The input to the recognition method is thus a single
monocular image I from the a camera mounted on the robot.
For our experiments, we use an AVT Marlin F-080C camera.
An example of an input image is shown in Figure 3a. Before
fed into the recognition, the image is preprocessed in that the
grasping hand is segmented from the background.
A. Segmentation of hand images
The hand segmentation could be done using a number of
modalities such as depth (estimated using stereo or an active
sensor) or color. We choose to use skin color segmentation;
(a)
π
8
. (b)
3π
8
. (c)
5π
8
. (d)
7π
8
.
Fig. 4. Gradient orientation histograms from the hand image H of Figure 3b, with B = 4 bins, on level l = 1 in the pyramid of L = 4 levels (spatial
resolution 8 × 8). a) Bin 1, orientation
π
8
. b) Bin 2, orientation
3π
8
. c) Bin 3, orientation
5π
8
. d) Bin 4, orientation
7π
8
.
the details of the method used are described in [21]. To re-
move segmentation noise at the borders between background
and foreground, the segmentation mask is median filtered
three times with a 3 × 3 window.
The segmented image
ˆ
H is cropped around the hand
and converted from RGB to grayscale. An example of the
resulting hand image H is shown in Figure 3b.
B. Generation of synthetic hand images for the classification
The fact that the classification method (Section V) is non-
parametric and that no explicit model of the hand is built
(Section IV) means that a very large set of examples, from
many different views, is needed for each grasp.
As it is virtually intractable to generate such training sets
using real images, we use a commercial software, Poser
7, to generate synthetic views H
synth
of different hand
configurations. Poser 7 supplies a realistic 3D hand model
which can be configured by bending the finger joints. For
our purposes, the model was configured by hand into the
6 iconic grasps, which were a little exaggerated to provide
clear distinctions between the classes. 900 views of each
configuration were generated, with viewing angles covering
a half-sphere in steps of 6 degrees in camera elevation and
azimuth; these are the views which can be expected by a
robot with cameras above human waist-height. The synthetic
hand was grasping an object, whose shape was selected to
be typical of that grasp [9]. The object was black (as the
background), and occluded parts of the hand as it would in
the corresponding real view of that grasp. This will make
the synthetic views as similar as possible to the real views
(e.g. Figure 3b), complete with expected occlusion for that
view and grasp. Figure 3c shows such a database example.
The synthetic images H
synth
can be seen as ideal versions
of the segmented and filtered real hand images H. Note that
the recognition method is tested (Section VII) using real hand
images prepared as described in the previous subsection, and
that the synthetic images are used only for the database. Note
further that the hand in the database is not the same as the
hand in the test images.
IV. IMAGE REPRESENTATION
For classification of grasps, we seek a representation of
hand views (Figures 3b and 3c) with as low intra-class
variance, and as high inter-class variance as possible. We
choose gradient orientation histograms, frequently used for
representation of human shape [22], [23].
Gradient orientation Φ [0, π) is computed from the
segmented hand image H as
Φ = arctan(
H
y
/
H
x
) (1)
where x denotes downward (vertical) direction and y right-
ward (horizontal) direction in the image.
From Φ, a pyramid with L levels of histograms with
different spatial resolutions are created; on each level l, the
gradient orientation image is divided into 2
Ll
×2
Ll
equal
partitions. A histogram with B bins are computed from each
partition. An example of histograms at the lowest levels of
the pyramid can be seen in Figure 4.
The hand view is represented by x which is the concate-
nation of all histograms at all levels in the pyramid. The
length of x is thus B
P
L
l=1
2
2(Ll)
. The performance of the
classifier is quite insensitive to choices of B [3, 8] and
L [2, 5]; in our experiments in Section VII we use B = 4
and L = 3.
V. APPROXIMATE NEAREST NEIGHBOR CLASSIFICATIO N
A database of grasp examples is created by synthesizing
N = 900 views H
synth
i,j
with i [1, M], j [1, N], from
each of the M = 6 grasp classes (Section III), and generating
gradient orientation histograms x
i,j
from the synthetic views
(Section IV). Each sample has associated with it a class label
y
i,j
= i and a hand-vs-camera orientation o
i,j
= [φ
j
, θ
j
, ψ
j
],
i.e. the Euler angles from the camera coordinate system to a
hand-centered coordinate system.
To find the grasp class ˆy and orientation
ˆ
o of an unknown
grasp view x acquired by the robot, a distance-weighted k-
nearest neighbor (kNN) classification/regression procedure is
used. First, X
k
, the set of k nearest neighbors to x in terms
of Euclidean distance d
i,j
= kx x
i,j
k are retrieved.
As an exact kNN search would put serious limitations on
the size of the database, an approximate kNN search method,
(a) 1, 0.5039, (0, 90, 132). (b) 1, 0.5238, (0, 96, 138). (c) 1, 0.5308, (0, 96, 132). (d) 1, 0.5517, (0, 90, 126).
(e) 1, 0.5523, (0, 96, 144). (f) 1, 0.5584, (0, 102, 132). (g) 1, 0.5870, (0, 90, +108). (h) 4, 0.6068, (0, 90, +120).
Fig. 5. Distance-weighted nearest neighbor classification. a-h) Some of the approximate nearest neighbors to the hand view in Figures 3b, with associated
grasp class y
i,j
, distance in state-space d
i,j
, and 3D orientation o
i,j
.
Locality Sensitive Hashing (LSH) [16], is employed. LSH
is a method for efficient -nearest neighbor (NN) search,
i.e. the problem of finding a neighbor x
NN
for a query x
such that
kx x
NN
k (1 + )kx x
NN
k (2)
where x
NN
is the true nearest neighbor of x. This is done
as (see [16] for details): 1) T different hash tables are
created independently. 2) For t = 1, . . . , T , the part of state
space in which the dataset {x
i,j
}
i[1,M],j[1,N ]
resides is
randomly partitioned by K hyperplanes. 3) Every point x
i,j
can thereby be described by a K bit binary number f
t,i,j
defined by its position relative to the hyperplanes of table t.
4) As the total number of possible values of f
t,i,j
is large,
a hash function h(f
t,i.j
) gives the index to a hash table of
fixed size H.
The NN distance to the unknown grasp view x is now
found as: 1) For each of the T hash tables, compute hash
indices h(f
t
) for x. 2) Let X
= {x
m
}
m[1,N
]
be the
union set of found examples in the T buckets. The NN
distance kx x
NN
k = min
m[1,N
]
kx x
m
k. In analog,
the min(N
, k) -nearest neighbors X
k
are found as the
min(N
, k) nearest neighbors in X
.
The parameters K and T for a certain value of is dataset
dependent, but is learned from the normal data itself [24].
We use = 0.05.
The computational complexity of retrieval of the NN with
LSH [16] is O(DN
1
1+
) which gives sublinear performance
for any > 0. For examples of -nearest neighbors to the
hand in Figure 3b, see Figure 5.
From X
k
the estimated class of x is found as,
ˆy = arg max
i
X
j:x
i,j
X
k
exp(
d
2
i,j
2σ
2
) , (3)
i.e. a distance-weighted selection of the most common class
label among the k nearest neighbors, and the estimated
orientation as
ˆ
o =
P
j:x
ˆy,j
X
k
o
ˆy , j
exp(
d
2
ˆy,j
2σ
2
)
P
j:x
ˆy,j
X
k
exp(
d
2
ˆy,j
2σ
2
)
, (4)
i.e. a distance-weighted mean of the orientations of those
samples among the k nearest neighbors for which y
i,j
= ˆy.
(The cyclic properties of the angles is also taken into account
in the computation of the mean.) As we can see in Figure 5h,
the orientation of a sample from a different class has very
low correlation with the real orientation, simply because the
hand in a different grasp has a different shape. Therefore,
only estimates with the same class label as ˆy are used in the
orientation regression. All in all, the dependency between
the state-space and the global Euler angle space is highly
complex, and that is why it is modeled non-parametrically.
The standard deviation σ is computed from the data as
σ =
1
2MN
X
i
X
j
1
,j
2
[1,N],j
1
6=j
2
kx
i,j
1
x
i,j
2
k , (5)
the mean intra-class, inter-point distance in the orientation
histogram space [25].
The obviously erroneous neighbors in Figures 5g
and 5h could maybe have been avoided with a larger
Fig. 6. Barret Wrap grasp, carried out on the same type and size of object as the human Large Diameter grasp shown in Figure 3b.
database containing hands of varying basic shape, such as
male/female/skinny/fat/long-fingered/short-fingered hands.
The hand in the test images (Figure 3b) is considerably
different from the synthetic Poser 7 hand (Figures 3c, 5),
and thus their 3D shapes are different even though they take
the same pose. This poses no problem to the method in
general; since the approximate kNN classification/regression
has a sub-linear complexity, the database can be increased
considerably to a limited computational cost.
VI. EXAMPLE-BASED MAPPING OF GRASP TO ROBOT
To illustrate how the grasp classification can be employed
for human-to-robot mapping in a pick-and-place scenario, a
simulated robot arm is controlled with parameterized pre-
defined grasping strategies as illustrated in Figure 1.
A human-to-robot grasp mapping scheme is defined de-
pending on the type of robot hand used; here we use a Barret
hand with three types of grasps as shown in Figure 2. The
type of robot grasp defines the preshape of the robot hand.
The hand orientation estimate
ˆ
o relative to the camera,
along with the hand position estimate and the estimated
position and orientation of the grasped object relative to
the camera, are used to derive the estimated position and
orientation of the human hand relative to the object, as
depicted in Figure 1. The estimation of object position and
orientation is assumed perfect; this part of the system is
not implemented, instead the ground truth is given in the
simulations.
In contrary to related grasp approaches [26], the robot here
does not explore a range of approach vectors, but instead
directly imitates the human approach vector, encoded in the
hand position and orientation relative to the object. This
leads to a much shorter computational time at the expense
of the non-optimality of the grasp in terms of grasp quality.
However, since the the selection of robotic preshape has
been guided, the stability of the robotic grasp will be similar
to the human one, leading to a non-optimal but successful
grasp provided that the errors in the orientation and position
estimate are sufficiently small.
An analysis of the robustness to position errors can be
found in [26]. For an optimally chosen preshape, there is a
error window 4 cm × 4 cm about the position of the object,
within which the grasps are successful. The positioning of
the robot hand can also be improved by fusing the estimated
human hand position with an automatic selection of grasping
point based on object shape recognition [27].
The robustness to orientation errors depends greatly on
the type of grasp and object shape. We investigate the
robustness of the Barret Wrap grasp with an approach vector
perpendicular to the table (Figure 6). We get good results
for orientation errors around the vertical axis of up to 15
degrees. As a comparison, the mean regression error of this
orientation (Section VII-B) is on the same order as the
error window size, 10.5 degrees, which indicates that the
orientation estimation from the grasp classifier should be
used as an initial value for a corrective movement procedure
using e.g. the force sensors on the hand.
VII. EXPERIMENTAL RESULTS
Quantitative evaluations of the grasp classification and
orientation estimation performance were made.
For each of the six grasp types, two video sequences of
the hand were captured, from two different viewpoints. From
each video, three snapshots were taken, one where the hand
was starting to reach for the object, one where the hand was
about to grasp and one where the grasp was completed. This
test set is denoted X.
The test examples from the beginning of the sequences
are naturally more difficult than the others, since the hand
configuration in those cases are closer to a neutral configu-
ration, thus more alike than the examples taken closer to the
completed grasp. It is interesting to study the classification
rate for the different levels of neutrality, since it indicates
the robustness to temporal errors when the robot grabs the
image upon which the classification is based (Section III). In
some tests below, we therefore removed the 12 most neutral
examples from the test set, denoted X
0
. In other tests, we
kept only the 12 most specific examples, denoted X
00
.
A. Classification of human grasps: Comparison of LSH and
human classification performance
Figures 7a, 7b, and 7c show the confusion matrices for
LSH classification of test set X, X
0
, and X
00
, respectively.
Apart from the fact that the performances on X
0
and X
00
Estimated grasp number
True grasp number
1 2 4 9 10 12
1
2
4
9
10
12
(a) LSH, all hand images: 61%.
Estimated grasp number
True grasp number
1 2 4 9 10 12
1
2
4
9
10
12
(b) LSH, hand close to object: 71%.
Estimated grasp number
True grasp number
1 2 4 9 10 12
1
2
4
9
10
12
(c) LSH, hand grasping object: 75%.
Confusion matrix
Estimated grasp number
True grasp number
1 2 4 9 10 12
1
2
4
9
10
12
(d) Human, all hand images: 74%.
Fig. 7. Confusion matrices for classification of the six grasps. White represents a 100% recognition rate, while black represents 0%. a) LSH performance, all
hand images (X): 61% correct classifications. b) LSH performance, images with hand close to object (X
0
): 71% correct classifications. c) LSH performance,
images with hand grasping object (X
00
): 75% correct classifications. d) Human performance, all hand images (X): 74% correct classifications.
are better than for X, it can be noted that the performance
on Pinch grasp (9) and Precision Disc grasp (12) are very
good. This is expected since these grasps are visibly very
different from the others. Interestingly, it also concords with
the mapping to the Barret grasps (Figure 2) in which these
grasps have unique mappings while the others all are mapped
to the same grasp. Note however that the human grasps map
differently to more articulated robot hands.
The error rates alone say little about how the method
would perform in a PbD system. The grasp recognition
would there interact with methods for object, shape and
action recognition, and a perfect performance in an isolated
grasp recognition task is probably not needed.
How do we then know what error rate is ”enough”?
Humans are very good at learning new tasks by visual
observation, and reach a near perfect performance on com-
bined object, shape, action and object recognition. Human
recognition performance on the same task as our classifier,
with the same indata, would thus be a good baseline.
As an important side note, two things can be noted about
this comparison. Firstly, in a natural learning situation, a hu-
man would use information about the grasped object and the
motion of the hand as well. This information is removed for
this experiment. As discussed in the Conclusions, we intend
to integrate automatic grasp, object and action recognition
in the future. Secondly, it is debated how important depth
perception is for human recognition; humans perceive depth
both through stereo and through prior knowledge about the
hand proportions. For this experiment, we disregard depth as
a cue in the human experiment.
Figure 7d shows the classification performance of a human
familiar with the Cutkosky grasp taxonomy. The human was
shown the segmented hand images H in the set X in random
order and was asked to determine which of the six grasp
classes they belonged to.
Interestingly, the human made the same type of mistakes
as the LSH classifier, although to a lower extent. He some-
times misclassified Power grasp (10) as Large Diameter grasp
(1), and Small Diameter grasp (2) as Abducted Thumb grasp
(4). This indicates that these types of confusions are intrinsic
to the problem rather than dependent on the LSH and training
set. Since humans are successful with grasp recognition in
a real world setting, these confusions are compensated for
in some other way, probably by recognition of shape of the
grasped objects. It can also be noted that the human was
better at recognizing the most neutral grasps present in X
but not in X
0
or X
00
.
Overall, the LSH performance is at par with, or slightly
worse than human performance. This must be regarded as a
successful experimental result, and indicates that the grasp
recognition method can be a part of a PbD system with low
error rate.
B. Classification of human grasps: Orientation accuracy
Figure 8 shows the mean orientation error for regression
with X. The angular displacement of the two coordinate
systems corresponds to how far off a robot hand would be
in grasping an object without corrective movements during
grasping. As noted in Section VI, the orientation estimate
from this method should only be regarded as an initial
value, from where a stable grasp is found using a corrective
movement procedure.
VIII. CONCLUSIONS
PbD frameworks are considered as an important area for
future robot development where the robots are supposed to
learn new task through observation and imitation. Manipu-
lating and grasping known and unknown objects represents
a significant challenge both in terms of modeling the obser-
vation process and then executing it on the robot.
In this paper, a method for classification of grasps, based
on a single image input, was presented. A grasping hand was
represented as a gradient orientation histogram; a 2D image-
based representation. A new hand image could be classified
as one of six grasps by a kNN search among large set of
synthetically generated hand images.
On the isolated task of grasp recognition, the method
performed comparably to a human. This indicates that the
method is fit for use in a PbD system, where it is used
in interaction with classifiers of object shape and human
actions. The dataset contained grasps from all expected
viewpoints and with expected occlusion. This made the
Fig. 8. Mean orientation error, all hand images (X): (0, 0.29, 0.18) radians
= (0, 16.8, 10.5) degrees.
method view-independent although no 3D representation of
the hand was computed.
The method was considered part of a grasp mapping
framework, in which precomputed grasp strategies were
compiled based on the detected type of grasp and hand-object
orientation.
A. Future Work
It would be interesting to add an object orientation estima-
tion technique to the system, and to execute the grasps on a
real robot arm. Furthermore, we will investigate the inclusion
of automatic positioning methods into the grasp strategies,
as suggested in Section VI.
The classifier will also benefit from a training set with
hands of many different shapes and grasped objects of
different sizes. Although, this will increase the size of the
database, the sub-linear computational complexity of the
LSH approximate kNN search ensures that the computation
time will grow at a very limited rate.
This paper discussed instantaneous recognition of grasps,
recognized in isolation. Most probably, a higher recognition
performance can be reached using a sequence of images
over time. Moreover, there is a statistical correlation between
types of objects, object shapes, human hand actions, and
human grasps in a PbD scenario. We are therefore on our
way to integrating the grasp classifier into a method for
continuous simultaneous recognition of objects and human
hand actions, using Conditional Random Fields (CRF) [28].
IX. ACKNOWLEDGMENTS
This research has been supported by the EU through
the project PACO-PLUS, FP6-2004-IST-4-27657, and by the
Swedish Foundation for Strategic Research.
REFERENCES
[1] S. Ekvall, “Robot task learning from human demonstration, Ph.D.
dissertation, KTH, Stockholm, Sweden, 2007.
[2] Y. Kuniyoshi, M. Inaba, and H. Inoue, “Learning by watching, IEEE
Transactions on Robotics and Automation, vol. 10, no. 6, pp. 799–822,
1994.
[3] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends
in Cognitive Sciences, vol. 3, no. 6, pp. 233–242, 1999.
[4] A. Billard, “Imitation: A review, in Handbook of Brain Theory and
Neural Networks, M. Arbib, Ed., 2002, pp. 566–569.
[5] K. Ogawara, S. Iba, H. Kimura, and K. Ikeuchi, “Recognition of
human task by attention point analysis, in IEEE International Con-
ference on Intelligent Robots and Systems, 2000, pp. 2121–2126.
[6] M. C. Lopes and J. S. Victor, “Visual transformations in gesture
imitation: What you see is what you do, in IEEE International
Conference on Robotics and Automation, 2003, pp. 2375–2381.
[7] S. Ekvall and D. Kragi
´
c, “Grasp recognition for programming by
demonstration tasks, in IEEE International Conference on Robotics
and Automation, 2005, pp. 748–753.
[8] S. Calinon, A. Billard, and F. Guenter, “Discriminative and adapta-
tive imitation in uni-manual and bi-manual tasks, in Robotics and
Autonomous Systems, vol. 54, 2005.
[9] M. Cutkosky, “On grasp choice, grasp models and the design of
hands for manufacturing tasks, IEEE Transactions on Robotics and
Automation, vol. 5, no. 3, pp. 269–279, 1989.
[10] J. Rehg and T. Kanade, “Visual tracking of high dof articulated
structures: An application to human hand tracking, in European
Conference on Computer Vision, vol. 2, 1994, pp. 35–46.
[11] E. Ueda, Y. Matsumoto, M. Imai, and T. Ogasawara, A hand-pose
estimation for vision-based human interfaces, in IEEE Transactions
on Industrial Electronics, vol. 50(4), 2003, pp. 676–684.
[12] V. Athitsos and S. Sclaroff, “Estimating 3D hand pose from a
cluttered image, in IEEE Conference on Computer Vision and Pattern
Recognition, 2003, pp. 432–439.
[13] C. Schwarz and N. Lobo, “Segment-based hand pose estimation, in
Canadian Conf. on Computer and Robot Vision, 2005, pp. 42–49.
[14] L. Y. Chang, N. S. Pollard, T. M. Mitchell, and E. P. Xing, “Fea-
ture selection for grasp recognition from optical markers, in IEEE
International Conference on Intelligent Robots and Systems, 2007.
[15] H. Murase and S. Nayar, “Visual learning and recognition of 3-D
objects from appearance, International Journal of Computer Vision,
vol. 14, pp. 5–24, 1995.
[16] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high
dimensions via hashing, in International Conference on Very Large
Databases, 1999, pp. 518–529.
[17] Y. Wu and T. S. Huang, “Vision-based gesture recognition: A review,
in International Gesture Workshop on Gesture-Based Communication
in Human-Computer Interaction, 1999, pp. 103–115.
[18] K. Ogawara, J. Takamatsu, K. Hashimoto, and K. Ikeuchi, “Grasp
recognition using a 3D articulated model and infrared images, in
IEEE International Conference on Intellingent Robots and Systems,
vol. 2, 2003, pp. 1590–1595.
[19] B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Model-
based hand tracking using a hierarchical bayesian filter, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9,
pp. 1372–1384, 2006.
[20] E. Sudderth, M. I. Mandel, W. T. Freeman, and A. S. Willsky,
“Visual hand tracking using non-parametric belief propagation, in
IEEE Workshop on Generative Model Based Vision, 2004.
[21] A. A. Argyros and M. I. A. Lourakis, “Real time tracking of multiple
skin-colored objects with a possibly moving camera, in European
Conference on Computer Vision, vol. 3, 2004, pp. 368–379.
[22] W. T. Freeman and M. Roth, “Orientational histograms for hand
gesture recognition, in IEEE International Conference on Automatic
Face and Gesture Recognition, 1995.
[23] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estimation with
parameter sensitive hashing, in IEEE International Conference on
Computer Vision, vol. 2, 2003, pp. 750–757.
[24] B. Georgescu, I. Shimshoni, and P. Meer, “Mean shift based clustering
in high dimensions: a texture classification example, in IEEE Inter-
national Conference on Computer Vision, vol. 1, 2003, pp. 456–463.
[25] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Core vector machines:
Fast SVM training on very large data sets, Journal of Machine
Learning Research, no. 6, pp. 363–392, 2005.
[26] J. Tegin, S. Ekvall, D. Kragi
´
c, B. Iliev, and J. Wikander, “Demonstra-
tion based learning and control for automatic grasping, in Interna-
tional Conference on Advanced Robotics, 2007.
[27] A. Saxena, J. Driemeyer, J. Kearns, and A. Y. Ng, “Robotic grasping
of novel objects, in Neural Information Processing Systems, 2006.
[28] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:
Probabilistic models for segmenting and labeling sequence data, in
International Conference on Machine Learning, 2001.
... Several work incorporate objects as occluders or context to improve the robustness of hand pose estimation methods. In its simplest form, occlusion modelling suppresses the data which is not directly related to hand pose (Hamer et al. (2009)), while other methods integrate object features to estimate the hand pose (Kjellström et al. (2008); Romero et al. (2010)). Hamer et al. (2009) track a hand manipulating an object from depth data by modeling hand links with surface patches and enforcing soft anatomic constraints such as proximity between connected links and anatomically feasible joint angles. ...
... Instead of ignoring the object-occluded regions when modelling the hand pose, the occluding object can also be used as a visual cue for grasp recognition. Kjellström et al. (2008) retrieve a hand grasp by querying samples from a database of synthetically rendered images. A test time, they use a single real color image as reference, and obtain corresponding 3D hand-object proposals. ...
Thesis
Modeling hand-object manipulations is essential for understanding how humans interact with their environment. Recent efforts to recover 3D information from RGB images have been directed towards fully-supervised methods which require large amounts of labeled training samples. However, collecting 3D ground-truth data for hand-object interactions is costly, tedious, and error-prone. In this thesis, we propose several contributions to overcome this challenge.First, we propose a fully automatic method to generate synthetic data with hand-object interactions for training. We generate ObMan, a synthetic dataset with automatically generated labels, and use it to train a deep convolutional neural network to reconstruct the observed object and the hand pose from a single RGB frame. We present an end-to-end learnable model that exploits a novel contact loss to favor physically plausible hand-object constellations. We investigate the domain gap and validate that our synthesized training data allows our model to reconstruct hand-object interactions from real images, provided the captured grasps are similar to the ones in the synthetic images.While costly, curating annotations from real images allows to obtain samples from the distribution of natural hand-object interactions. Next, we investigate a strategy to make the most of manual annotation efforts: we propose to leverage the temporal context in videos when sparse annotations are available. In a learnable framework which jointly reconstructs hands and objects in 3D by inferring the poses of known models, we leverage photometric consistency across time. Given our estimated reconstructions, we differentiably render the optical flow between pairs of images and use it to warp one frame to another. We then apply a self-supervised photometric loss that relies on the visual consistency between nearby images. We display competitive results for 3D hand-object reconstruction benchmarks and demonstrate that our approach allows to improve the pose estimation accuracy by leveraging information from neighboring frames in low-data regimes.Finally, we explore automatic annotation of real RGB data by proposing a learning-free fitting approach for hand-object reconstruction. We rely on 2D cues obtained with common learnt methods for detection, hand pose estimation and instance segmentation and integrate hand-object interaction priors. We evaluate our approach and show that it can be applied to datasets with varying levels of complexity. Our method can seamlessly handle two-hand object interactions and can provide noisy pseudo-labels for learning-based approaches.In summary, our contributions are the following: (i) we generate synthetic data for hand-object grasps that allows training CNNs for joint hand-object reconstruction, (ii) we propose a strategy to leverage the temporal context in videos when sparse annotations are provided, (iii) we propose to recover hand-object interactions for short video clips by fitting models to noisy predictions from learnt models.
... Our hypothesis is that the actions, hands and objects are strongly correlated with each other in HOM activities, and thereby it is important to jointly model them together. Jointly modeling hand grasps, objects and actions has its empirical basis in neuroscience and psychology (Kjellstrom et al. (2008); Feix et al. (2014b)), 1 Institute of Industrial Science, The University of Tokyo, Japan 2 Robotics Institute, Carnegie Mellon University, USA and our work firstly presents a computational way to explicitly modeling these relationships. Specifically, three kinds of contextual relationships are studied in the proposed model. ...
... Grasp taxonomies have also been proposed to facilitate hand grasp analysis (Cutkosky (1989); Kang and Ikeuchi (1993);Feix et al. (2009);Liu et al. (2014);Feix et al. (2016)). Approaches for vision-based hand grasp analysis were developed primarily in structured environment where hand-object interactions are well recorded by camera arrays or depth sensors (Kjellstrom et al. (2008);Hamer et al. (2009);Oikonomidis et al. (2011); Romero et al. (2013)). Cai et al. (2015) first developed appearance-based method to recognize hand grasp types from manipulation tasks recorded in real-world environment with a wearable RGB camera. ...
Preprint
Full-text available
This paper proposes a novel method for understanding daily hand-object manipulation by developing computer vision-based techniques. Specifically, we focus on recognizing hand grasp types, object attributes and manipulation actions within an unified framework by exploring their contextual relationships. Our hypothesis is that it is necessary to jointly model hands, objects and actions in order to accurately recognize multiple tasks that are correlated to each other in hand-object manipulation. In the proposed model, we explore various semantic relationships between actions, grasp types and object attributes, and show how the context can be used to boost the recognition of each component. We also explore the spatial relationship between the hand and object in order to detect the manipulated object from hand in cluttered environment. Experiment results on all three recognition tasks show that our proposed method outperforms traditional appearance-based methods which are not designed to take into account contextual relationships involved in hand-object manipulation. The visualization and generalizability study of the learned context further supports our hypothesis.
... In [122], the measurements of an SE composed by a markerless vision system and a dataglove are used to recognize PH grasp postures by means of a decision tree and, consequently, to execute predefined force-controlled TH grasps. HRM for teaching by demonstration purposes based on nonparametric learning techniques can be found in [9], [123], and [124]. In such works, k-nearest neighbourhood algorithms are used to classify PH postures for the activation of TH grasping patterns. ...
Article
Full-text available
In this article, the variety of approaches proposed in the literature to address the problem of mapping human to robot hand motions are summarized and discussed. We particularly attempt to organize under macrocategories the great quantity of presented methods that are often difficult to be seen from a general point of view due to different fields of application, specific use of algorithms, terminology, and declared goals of the mappings. First, a brief historical overview is reported, in order to provide a look on the emergence of the human to robot hand mapping problem as a both conceptual and analytical challenge that is still open nowadays. Thereafter, the survey mainly focuses on a classification of modern mapping methods under the following six categories: direct joint, direct Cartesian, task-oriented, dimensionality reduction based, pose recognition based, and hybrid mappings. For each of these categories, the general view that associates the related reported studies is provided, and representative references are highlighted. Finally, a concluding discussion along with the authors' point of view regarding future desirable trends are reported.
... The problem of video object segmentation (VOS) has a variety of important applications, including object boundary estimation for grasping [23,1], autonomous driving [38,36], surveillance [9,12] and video editing [31]. The task is to predict pixel-accurate masks of the region occupied by a specific target object, in every frame of a given video sequence. ...
Preprint
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. The main difficulty is to effectively handle appearance changes and similar background objects, while maintaining accurate segmentation. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. More recent methods integrate generative target appearance models, but either achieve limited robustness or require large amounts of training data. We propose a novel VOS architecture consisting of two network components. The target appearance model consists of a light-weight module, learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks. Our method is fast, easily trainable and remains is highly effective in cases of limited training data. We perform extensive experiments on the challenging YouTube-VOS and DAVIS datasets. Our network achieves favorable performance, while operating at significantly higher frame-rates compared to state-of-the-art. Code is available at https://github.com/andr345/frtm-vos.
... We focus on this problem of learning from one video demonstration of a human performing a task, in combination with human and robot demonstration data collected on other tasks. Prior work has proposed to resolve the correspondence problem by hand, for example, by manually specifying how human grasp poses correspond to robot grasps [20] or by manually defining how human activities or commands translate into robot actions [58,23,30,37,40]. By utilizing demonstration data of how humans and robots perform each task, our approach learns the correspondence between the human and robot implicitly. ...
... Utilizando invariantes geométricos como dados de entrada, obtidos pela segmentação e seleção de contornos das mãos dos usuários, o método atingiu uma acurácia de cerca de 93% em um tempo de 50 milissegundos por classificação. Outro trabalho voltado à interação humano-robô foi desenvolvido por KJELLSTRÖM et al. (2008), objetivando mapear o ato de agarrar objetos. No trabalho, frames extraídos de streams de vídeo são utilizados como dados de entrada, permitindo a classificação dos gestos de acordo com uma taxonomia. ...
Thesis
Full-text available
É um comportamento comum aos seres humanos utilizar gestos como forma de expressão, como um complemento à fala ou como uma forma auto-contida de comunicação. No campo da Interação Humano-Computador, esse comportamento pode ser adotado na construção de interfaces alternativas, objetivando facilitar o relacionamento entre os elementos humano e computacional. Atualmente, várias técnicas para reconhecimento de gestos são descritas na literatura; porém, as validações dessas técnicas são executadas de maneira isolada, o que dificulta a comparação entre elas. Para reduzir essa lacuna, este trabalho apresenta uma comparação entre técnicas estabelecidas para o reconhecimento de gestos estáticos (posturas) e gestos dinâmicos (trajetórias). Essas técnicas são organizadas de forma a avaliar um conjunto de dados comum, adquirido por meio de uma luva instrumentada e um rastreador de movimento, gerando resultados em termos de precisão e desempenho. Especificamente para trajetórias, o processo de avaliação considera técnicas conhecidas (redes neurais e modelos ocultos de Markov) e uma nova heurística baseada em autômatos finitos determinísticos, idealizada e desenvolvida pelos autores. Os resultados obtidos mostram que o classificador baseado em uma SVM (Support Vector Machine) apresentou a melhor generalização, com as melhores taxas de reconhecimento para posturas. Para trajetórias, por sua vez, o classificador baseado em uma rede neural gerou os melhores resultados. Em termos de desempenho, todos os métodos apresentaram resultados suficientemente rápidos para serem usados de forma interativa. Finalmente, o presente trabalho identifica e discute um conjunto de critérios relevantes que deve ser observado nas etapas de construção, treinamento e avaliação dos classificadores, e sua relação com os resultados finais.
... We focus on this problem of learning from one video demonstration of a human performing a task, in combination with human and robot demonstration data collected on other tasks. Prior work has proposed to resolve the correspondence problem by hand, for example, by manually specifying how human grasp poses correspond to robot grasps [20] or by manually defining how human activities or commands translate into robot actions [58,23,30,37,40]. By utilizing demonstration data of how humans and robots perform each task, our approach learns the correspondence between the human and robot implicitly. ...
Article
Humans and animals are capable of learning a new behavior by observing others perform the skill just once. We consider the problem of allowing a robot to do the same -- learning from a raw video pixels of a human, even when there is substantial domain shift in the perspective, environment, and embodiment between the robot and the observed human. Prior approaches to this problem have hand-specified how human and robot actions correspond and often relied on explicit human pose detection systems. In this work, we present an approach for one-shot learning from a video of a human by using human and robot demonstration data from a variety of previous tasks to build up prior knowledge through meta-learning. Then, combining this prior knowledge and only a single video demonstration from a human, the robot can perform the task that the human demonstrated. We show experiments on both a PR2 arm and a Sawyer arm, demonstrating that after meta-learning, the robot can learn to place, push, and pick-and-place new objects using just one video of a human performing the manipulation.
Article
This paper presents an approach to perform bilateral in-hand (dexterous) telemanipulation of unknown objects. The proposed approach addresses three of the main problems in telemanipulation: kinematic issues due to the physical differences between the robotic and the human hands; obtaining coherent haptic feedback to provide information about the manipulation at any time; and time-delays that can affect the stability of the overall closed-loop system. The novelty of the approach lays on the shared control scheme, where the robotic hand uses the tactile and the kinematic information to manipulate an unknown object while the operator commands a desired orientation of the object without caring about the relation between her/his commands and the actual hand movements. The viability of the proposed approach has been tested through transatlantic telemanipulation experiments between Mexico and Spain.
Article
Purpose This paper presents a procedure to change the orientation of a grasped object using dexterous manipulation. The manipulation is controlled by teleoperation in a very simple way with the commands introduced by an operator using a keyboard. Design/methodology/approach The paper shows a teleoperation scheme, hand kinematics and a manipulation strategy to manipulate different objects using the Schunk Dexterous Hand (SDH2). A state machine is used to model the teleoperation actions and the system states. A virtual link is used to include the contact point on the the hand kinematics of the SDH2. Findings Experiments were conducted to evaluate the proposed approach with different objects, varying the initial grasp configuration and the sequence of actions commanded by the operator. Originality/value The proposed approach uses a shared telemanipulation schema to perform dexterous manipulation, in this schema the operator sends high level commands and a local system uses this information, jointly with tactile measurements and the current status of the system, to generate proper setpoints for the low-level control of the fingers, which may be a commercial close one. The main contribution of this work is the mentioned local system, simple enough for practical applications and robust enough to avoid object falls.
Article
Full-text available
We present a method for automatic grasp generation based on object shape primitives in a Programming by Demonstration framework. The system first recognizes the grasp performed by a demonstrator as well as the object it is applied on and then generates a suitable grasping strategy on the robot. We start by presenting how to model and learn grasps and map them to robot hands. We continue by performing dynamic simulation of the grasp execution with a focus on grasping objects whose pose is not perfectly known.
Article
Full-text available
This paper develops probabilistic methods for visual tracking of a three-dimensional geometric hand model from monocular image sequences. We consider a redundant repre-sentation in which each model component is described by its position and orientation in the world coordinate frame. A prior model is then defined which enforces the kinematic constraints implied by the model's joints. We show that this prior has a local structure, and is in fact a pairwise Markov random field. Furthermore, our redundant representation allows color and edge-based likelihood measures, such as the Chamfer distance, to be similarly decomposed in cases where there is no self–occlusion. Given this graphical model of hand kinematics, we may track the hand's motion using the recently proposed nonparametric belief propagation (NBP) algorithm. Like particle filters, NBP approximates the posterior distribution over hand configurations as a collection of samples. However, NBP uses the graphical structure to greatly reduce the dimensionality of these distribu-tions, providing improved robustness. Several methods are used to improve NBP's computational efficiency, including a novel KD-tree based method for fast Chamfer distance evaluation. We provide simulations showing that NBP may be used to refine inaccurate model initializations, as well as track hand motion through extended image sequences.
Conference Paper
Full-text available
Although the human hand is a complex biomechanical system, only a small set of features may be necessary for observation learning of functional grasp classes. We explore how to methodically select a minimal set of hand pose features from optical marker data for grasp recognition. Supervised feature selection is used to determine a reduced feature set of surface marker locations on the hand that is appropriate for grasp classification of individual hand poses. Classifiers trained on the reduced feature set of five markers retain at least 92% of the prediction accuracy of classifiers trained on a full feature set of thirty markers. The reduced model also generalizes better to new subjects. The dramatic reduction of the marker set size and the success of a linear classifier from local marker coordinates recommend optical marker techniques as a practical alternative to data glove methods for observation learning of grasping.
Conference Paper
The use of gesture as a natural interface serves as a motivating force for research in modeling, analyzing and recognition of gestures. In particular, human computer intelligent interaction needs vision-based gesture recognition, which involves many interdisciplinary studies. A survey on recent vision-based gesture recognition approaches is given in this paper. We shall review methods of static hand posture and temporal gesture recognition Several application systems of gesture recognition are also described in this paper. We conclude with some thoughts about future research directions.
Conference Paper
This paper develops probabilistic methods for visual tracking of a three-dimensional geometric hand model from monocular image sequences. We consider a redundant representation in which each model component is described by its position and orientation in the world coordinate frame. A prior model is then defined which enforces the kinematic constraints implied by the model's joints. We show that this prior has a local structure, and is in fact a pairwise Markov random field. Furthermore, our redundant representation allows color and edge-based likelihood measures, such as the Chamfer distance, to be similarly decomposed in cases where there is no self-occlusion. Given this graphical model of hand kinematics, we may track the hand's motion using the recently proposed nonparametric belief propagation (NBP) algorithm. Like particle filters, NBP approximates the posterior distribution over hand configurations as a collection of samples. However, NBP uses the graphical structure to greatly reduce the dimensionality of these distributions, providing improved robustness. Several methods are used to improve NBP's computational efficiency, including a novel KD-tree based method for fast Chamfer distance evaluation. We provide simulations showing that NBP may be used to refine inaccurate model initializations, as well as track hand motion through extended image sequences.
Conference Paper
In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges---a distributed state representation as in dynamic Bayesian networks (DBNs)---and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Article
This paper addresses the problems of what to imitate and how to imitate in simple uni and bi-manual manipulatory tasks. To solve the what to imitate issue, we use a probabilistic method, based on Hidden Markov Models (HMM), to extract the relative importance of reproducing either the gesture or the specific hand path in a given task. This allows us to determine a metric of imitation performance. To solve the how to imitate issue, we compute the trajectory that optimizes the metric given the constraints of the robot’s body. We validate the methods using a series of experiments where a human demonstrator uses kinesthetics in order to teach a robot how to manipulate simple objects.
Conference Paper
We consider the problem of grasping novel objects, specifically ones that are being seen for the first time through vision. We present a learning algorithm that neither requires, nor tries to build, a 3-d model of the object. Instead it predicts, directly as a function of the images, a point at which to grasp the object. Our algorithm is trained via supervised learning, using synthetic images for the training set. We demonstrate on a robotic manipulation platform that this approach successfully grasps a wide variety of objects, such as wine glasses, duct tape, markers, a translucent box, jugs, knife-cutters, cellphones, keys, screwdrivers, staplers, toothbrushes, a thick coil of wire, a strangely shaped power horn, and others, none of which were seen in the training set.