Conference PaperPDF Available

Learning Probabilistic Discriminative Models of Grasp Affordances under Limited Supervision

Authors:

Abstract and Figures

This paper addresses the problem of learning and efficiently representing discriminative probabilistic models of object-specific grasp affordances particularly when the number of labeled grasps is extremely limited. The proposed method does not require an explicit 3D model but rather learns an implicit manifold on which it defines a probability distribution over grasp affordances. We obtain hypothetical grasp configurations from visual descriptors that are associated with the contours of an object. While these hypothetical configurations are abundant, labeled configurations are very scarce as these are acquired via time-costly experiments carried out by the robot. Kernel logistic regression (KLR) via joint kernel maps is trained to map the hypothesis space of grasps into continuous class-conditional probability values indicating their achievability. We propose a soft-supervised extension of KLR and a framework to combine the merits of semi-supervised and active learning approaches to tackle the scarcity of labeled grasps. Experimental evaluation shows that combining active and semi-supervised learning is favorable in the existence of an oracle. Furthermore, semi-supervised learning outperforms supervised learning, particularly when the labeled data is very limited.
Content may be subject to copyright.
Learning Probabilistic Discriminative Models of Grasp Affordances
under Limited Supervision
Ays¸e Naz Erkan, Oliver Kroemer, Renaud Detry, Yasemin Altun, Justus Piater, Jan Peters
Abstract This paper addresses the problem of learning
and efficiently representing discriminative probabilistic models
of object-specific grasp affordances particularly in situations
where the number of labeled grasps is extremely limited.
The proposed method does not require an explicit 3D model
but rather learns an implicit manifold on which it defines
a probability distribution over grasp affordances. We obtain
hypothetical grasp configurations from visual descriptors that
are associated with the contours of an object. While these hypo-
thetical configurations are abundant, labeled configurations are
very scarce as these are acquired via time-costly experiments
carried out by the robot. Kernel logistic regression (KLR) via
joint kernel maps is trained to map these hypothesis space
of grasps into continuous class conditional probability values
indicating their achievability. We propose a soft-supervised
extension of KLR and a framework to combine the merits
of semi-supervised and active learning approaches to tackle
the scarcity of labeled grasps. Experimental evaluation shows
that combining active and semi-supervised learning is favorable
in the existence of to an oracle. Furthermore, semi-supervised
learning outperforms supervised learning, particularly when
the labeled data is very limited.
I. INTRODUCTION
Grasping is a fundamental skill for robots that need to
interact with their environment in a flexible manner. A wide
spectrum of tasks (e.g., emptying a dishwasher, opening
a bottle, or using a hammer) depend on the capability to
reliably grasp an object or tool as part of a larger planning
framework. It is therefore imperative that the robot learns
a task-independent model of an object’s grasp affordances
in an efficient manner. Given such a flexible model, a
planner can be used to grasp and manipulate the object for
a wide range of tasks. In this paper, we investigate learning
probabilistic models of grasp affordances for an autonomous
robot equipped with a 3D vision system (see Figure I). An
object’s affordances refers to the likelihood of a location on
the object being graspable, from a specific orientation, by
the robot.
Until this decade, the most predominant approach to
grasping has been obtaining a full 3D model of the object and
then employing various techniques such as friction cones [1]
and form- and force- closures [2]. Given the difficulties of
obtaining a 3D model with sufficient accuracy to reliably
apply these techniques, designing statistical learning methods
Max Plank Institute for Biological Cybernetics, Spemannstraße 38,
Tuebingen Germany
{naz,oliverkro,altun,jan.peters}@tuebingen.mpg.de
Department of Electrical Engineering and Computer Science Montefiore
Institute, Universit´
e de Li`
ege 4000 Li`
ege Sart Tilman Belgium
{renaud.detry,justus.piater}@ulg.ac.be
New York Unversity, Computer Science Department New York, NY
for grasping has become an active research field [3], [4],
[5], [6]. These new learning methods often employ efficient
representations and vision based models, without requiring
full 3D reconstruction, in order to provide a more robust
alternative to traditional approaches. Much of the previous
work focuses only on learning successful grasps [3], [4].
While such generative approaches can be advantageous in
cases of a well-defined data distribution, it is well-known
that discriminative learning methods have three main advan-
tages over generative models [7]: Firstly, they model class-
conditional probabilities of both successful and unsuccessful
grasp configurations, leading to a more descriptive model and
higher confidences for unsuccessful grasp regions. Secondly,
they can incorporate arbitrary feature representations more
flexibly. Thirdly, due to the conditional training, they are not
affected from any modeling error of the data distribution.
The investigation of discriminative learning methods for
grasp affordances presented in this paper continues on from
previous approaches of conditional grasp affordance models,
namely [5] and [6]. In [5], the authors propose extracting a
set of 2D image features and apply a discriminative super-
vised learning method to model grasp affordance probabili-
ties given the 2D image. In [6], this approach is extended by
combining the classifier of [5] with a probabilistic classifier
using a set of arm/finger kinematics features in order to
identify physically impossible 2D points for the robot to
reach. The strength of their approach is the combination of
two important kinds of information, i.e., image and kinematic
features, in a probabilistic manner.
We propose using Kernel Logistic Regression (KLR) [8]
for training grasp affordance models. The main motivation
behind this approach is to have the system learn a mapping
from local visual features to probabilities directly, as this
yields more general models than a comparison of explicit
geometric models to those in an object database. While this
approach enjoys the advantages of a probabilistic model, it
can also capture the non-linear relations between potential
grasps efficiently via kernels. This is an essential merit, since
our visual grasp features are extracted from the contours of
the objects and the orientation of the robot’s hand, which
results in the grasps lying on a non-linear manifold.
The KLR method provides a principled way of combining
information from the object as well as from the robot hand
via joint kernels [9]. By training a single classifier using
joint kernels, as opposed to training two separate classifiers
as was previously done [6], our approach can capture non-
linear interactions of the morphology of the robot hand and
the surface characteristics of the object implicitly. The system
therefore does not have to rely on explicit representations
such as closed form geometric descriptions or libraries of
feasible grasps.
Executing and labeling grasps of novel objects is a time-
consuming process that requires human monitoring and may
damage the objects. However, a vast number of hypothetical
grasp configurations can be generated by a vision model,
such as the Early Cognitive Vision reconstructor. These
hypothetical grasps can not be given any confident labels,
as they have not been empirically tested, and are therefore
effectively unlabeled. We investigate using such unlabeled
data in our KLR approach to reduce the number of grasps
that need to be annotated for the affordance model. In
particular, we propose combining a novel semi-supervised
KLR method with active learning in the context of robot
grasping.
Semi-supervised learning and active learning are sub-fields
of machine learning that aim to handle the scarcity of labeled
data. Semi-supervised learning methods, e.g., [10] and the
references therein, use a large set of unlabeled data in
order to improve the classification performance by revealing
the underlying geometry of the data. Active learning does
not rely on a source of unlabeled data, but rather assumes
the existence of an annotator, commonly referred to as the
oracle, that can provide labels to queries. In a robotics
context, the annotator corresponds to the robot attempting
to perform new grasps. The goal of active learning is to
guide the robot to evaluate the most informative grasps so
that the classification error is reduced with the fewest queries
possible.
Fig. 1. Three-finger Bar-
rett hand equipped with a 3D
vision system. A table tennis
paddle is used in the experi-
ments.
This framework enables the
robot to learn incrementally
by autonomously evaluating
grasps. We provide comparisons
between supervised, semi-
supervised as well as a
hybrid of semi-supervised
and active learning setups, as
minimizing the need for large
amounts of labeled data is an
essential concern. Experimental
evaluations show not only that
the proposed active learning and semi-supervised learning
methods individually improve the system’s performance,
but that the amount of necessary annotated data is also
significantly reduced when supervised learning is combined
with active learning.
This paper is organized as follows, in Section II, we de-
scribe the details of the acquisition of the features. Section III
gives a detailed explanation of the machine learning tech-
niques evaluated in the context of robot grasping. Section IV
overviews relevant work in the literature. In Section V, we
introduce the experimental setup, give empirical results and
provide a comparison of supervised, semi-supervised and
active learning approaches. Finally, Section VI provides a
discussion and directions for future work.
(a) Feasible configurations (b) Infeasible configurations
(c) Hypothesis space
Fig. 2. Kernel logistic regression algorithm is used to discriminate the
successful 2(a) and unsuccessful grasps 2(b) lying on separable nonlinear
manifolds. The entire hypothesis space 2(c) of potential grasp configurations
extracted from pairs of ECV descriptors contains feasible grasps as well as
infeasible configurations.
II. VISUAL FE ATURE EXTRACTION FOR GRASPING
The inputs of our learning algorithm are represented as
grasp configurations generated from Early Cognitive Vision
(ECV) descriptors [11], [12], which represent short edge
segments in 3D space, as described in [3]. Accordingly,
an ECV reconstruction is performed. Next, pose hypotheses
for potential grasps are generated from pairs of co-planar
ECV descriptors. The grasp position is set to the location
of one of the ECV descriptor pairs whereas the grasp
orientation is computed from the normal of the plane on
which these descriptors lie. The assumption is that two co-
planar segments constitute a potential edge of the object that
the robot hand can hold. However, this is quite optimistic
as many infeasible edges and orientations will be included
in the hypothesis space, see Figure II. Hence, we need a
learning algorithm to discriminate between the feasible and
infeasible grasps contained in this set.
Each grasp is represented with seven values in the object
relative reference frame, three for the position and four for
the orientation in unit length quaternions. The object relative
reference frame is a coordinate system that is attached to the
object such that any rigid body transformation applied to
the object will also be applied to the coordinate system and
objects therein.
III. LEARNING GRASP AFFORDANCES
In this section we outline the key concepts of our learning
algorithm. First, we describe a kernel used as a distance
metric between pairs of grasp configurations. This kernel
decomposes into separate distance measures on the position
and rotation parameters. We use this kernel in the KLR al-
gorithm. Later, we propose a soft-supervised variation of the
KLR algorithm so that it can accommodate unlabeled data
via this distance metric. Finally, we describe the uncertainty
criterion to select grasps for the queries in the active learning
setting.
A. Joint Kernel
Each grasp configuration x= (s, r)consists of seven
parameters, i.e., three from the 3D position sof the robot
hand in the object’s reference frame, and four from the
unit quaternions rdefining the rotation. These values have
different coordinate systems and have to be treated separately
in order to obtain a proper distance metric. This distance
metric, which indicates the similarity of two configurations,
is employed for both the kernel computation and the sim-
ilarity measure required by semi-supervised learning, see
Equation (2). We define the joint kernel as
K(xa, xb) = exp ksasbk2
2σ2
s
f(θab)2
2σ2
f(θ)!,
where fis the rotational distance, σsand σf(θ)are the
standard deviation of the pose and rotation distances of all
pairs of samples respectively. In order to cope with the
double cover property [13] of quaternions, we compute the
rotational distance f(θab), as the smaller angle between the
two unit length quaternions raand rb. This definition allows
us to use a Gaussian distribution on this rotational distance
metric. Here, θab is the angle of the 3D rotation that moves
rato rb, i.e., θab =θ(ra, rb) = arccos(rT
arb), and
f(θab) = min{θ(ra, rb), θ(ra,rb,)}.
For further details on distance computations between unit
quaternions see [13]. This joint kernel is similar to that
in [14] in the way it decomposes into kernels on position
and rotation features. However, there the authors employ a
Dimroth-Watson distribution to get the rotational kernel as
opposed to the Gaussian distribution, which is preferable due
to the computational complexity of the former.
B. Kernel Logistic Regression
Our goal is to model the conditional probability distribu-
tion of grasp success y∈ {−1,1}given a grasp configuration
xas defined in Section III-A. Given labeled data S=
{(xi, yi)}l
i=1, KLR achieves this goal by maximizing the
regularized log-likelihood of the data R(w;S)defined by
R(w;S) =
l
X
i=1
log p(yi|xi;w)kwk2,(1)
p(y= 1|x;w) =1/(1 + exp(− hw, f (x)i)),
where f(x)refers to an implicit feature representation in-
duced by a kernel kand wis the corresponding weight
vector. It has been shown that this optimization problem can
be derived from the Maximum Entropy (MaxEnt) framework,
where the goal is to find a conditional probability distribution
p(y|x)that matches the data (in the sense that the expected
values of features with respect to p(y|x)should match
their empirical counterparts) while remaining as simple as
possible, or equivalently maximizing the class conditional
entropy H=Pyp(y|x) log p(y|x),
max
pEx˜pm[H(p(y|x))] st.
Ex˜pmEyp(y|x)[yf(x)]E(x,y )˜pj[yf(x)]
.
Here ˜pjdenotes the empirical joint distribution and ˜pm
denotes the empirical marginal distribution over x. Defining
˜pm(xi)=1/l and ˜pj(xi, yi) = 1/l for all (xi, yi)Sand
using duality techniques yield (1).
C. Semi-Supervised Kernel Logistic Regression
The duality relation mentioned in Section III-B suggests
that the accuracy of KLR depends on accurate estimates of
the empirical marginal and joint distributions. Our goal in the
semi-supervised KLR (SSKLR) method is to use unlabeled
data to reduce the sampling bias of these distributions.
This can be achieved by imposing the smoothness of the
conditional distribution in the sense that two similar grasp
configurations have similar success and failure probabilities.
To this end, we propose assigning soft-labels to unlabeled
grasp configurations {xi}n
i=l+1 that are in the vicinity of
labeled grasp configurations with respect to the manifold
on which the grasp configurations lie. If the similarity
metric conveys the true geometry of the grasp configurations
and KLR is trained with respect to the soft success/failure
assignments for unlabeled grasp configurations as well as
the true labels of labeled grasp configurations, the resulting
conditional probability distribution is expected to be smooth.
Similarity based soft-label assignment is equivalent to
manipulating the joint distribution ˜pjto include soft labeled
data. We define ˜pm(xi)=1/n and ˜pjas
˜pj(xi, y) =
1/Zjif 1il, y =yi,
sik/Zjif l < i n, 1kl, xiNk, y =yk,
0otherwise,
(2)
where Nxis the neighborhood of xand Zjis the normal-
ization factor for ˜pjto be a proper probability distribution.
Equation (2) allows an unlabeled data to be soft-labeled by
multiple labeled data with possibly different labels, which
is desirable if an unlabeled data point lies close to multiple
label regions. Given these definitions and using duality, we
derive the SSKLR problem as maximizing
R(w;S) =
n
X
i=1 X
y
˜pj(xi, y)hw, y f(xi)i(3)
n
X
i=1
˜pm(xi) log X
y
(exp hw, yf (xi)i)kwk2.
The Representer Theorem [15] states that optimal
weight vector of Equation (3) admits the form w=
Pn
i=1 if(xi)[15]. When we substitute the solution into
Equation (3), we get a convex optimization over αwhich can
be solved using any convex optimization technique. Inference
of a new grasp configuration xis given by the sign of
Pn
i=1 Pyik(xi, x).
D. Uncertainty based active learning
We can employ active learning in scenarios where the
robot has the means to choose what to learn. For the active
selection of grasps, we use uncertainty sampling [16] which
is straightforward for probabilistic models. In this method,
the algorithm queries for the grasps on which it is the least
confident. Therefore, at each iteration, the algorithm requests
the true label for the grasp, xthat has the highest class
conditional entropy among the set of unlabeled grasps, U
x= argmax
xU
H(p(y|x)).
In turn, the robot carries out the configuration that corre-
sponds to xand labels it accordingly.
IV. RELATED WORK
Efficient representation and vision based modeling of
grasp configurations is an active research field [3], [5]. We
follow the methodology in [3] to obtain grasp pose candi-
dates and orientations as described in Section II. However,
the authors learn grasp densities using successful grasps
only, whereas in this paper, we model the class condi-
tional probabilities of both successful and unsuccessful grasp
configurations in a discriminative manner. Furthermore, we
focus on the scarcity of the labeled data points and we
evaluate active and semi-supervised learning algorithms with
the smallest number of annotated experiences possible.
Granville et al. [4] present a method where the robot
learns a mapping from object representations to grasps from
human demonstration. They cluster the orientations of grasps
and each cluster is associated with a canonical approach
orientation. The authors indicate that limiting the encoding
to orientations or excluding position knowledge, is due to
their underlying assumption that orientation and position are
independent.
As labeled data collection is expensive for most robotics
tasks, active learning techniques have already been consid-
ered. Salganicoff et al. [17] proposed some of the earliest
work on uncertainty based active learning for vision-based
grasp learning by modifying the ID3, a decision tree algo-
rithm. Montesano and Lopes [18] also propose a method to
learn local visual descriptors of good grasping points via self-
experimentation. Their method associates the outputs with
confidence values.
In machine learning, various methods to combine semi-
supervised and active learning have been proposed to exploit
the merits of both approaches [19], [20]. We attempt to
be the first in the context of robotics. The active learning
methodology in [20] is similar to ours, as the authors
employ confidence sampling for active learning based on the
probabilistic outputs of a logistic regression classifier. Their
method differs from ours since they perform semi-supervised
learning via self-training, whereas we propose a soft-labeling
approach motivated from the maximum entropy framework.
V. EMPIRICAL EVALUATION
We have empirically evaluated the methods described in
Section III on a 3-finger Barrett robot with simple objects
such as a table tennis paddle. For supervised learning, we
have used a Kernel Logistic Regression classifier and the
joint kernel defined on position and orientation features. The
labels were collected by a human demonstrator. For the semi-
supervised experiments we have used SSKLR loss given in
Section III-C. Details on the experimental setup such as data
collection, preprocessing, model selection and the results are
given below.
A. Experimental Setup
We collected 200 samples, 100 successful (positive labels)
and 100 unsuccessful (negative labels) grasps. We preprocess
the data by normalizing the position parameters to zero
mean and unit variance. The unit quaternions do not require
preprocessing.
All experiments are carried out using the following varia-
tion of a fourfold cross validation. We have separated the
200 samples into four non-overlapping validation sets of
size 50. The model variance in semi-supervised and active
learning can be high as the training set is typically very
small. In order to compensate for the resulting high variance,
we have generated five random training sets from each of
the remaining 150 samples with equal numbers of positive
and negative samples. For the data set simulations of the
active learning scenario, we used the rest of the samples
as the active learning pool for each of the 20 training sets,
trn1. . . trn20. Model selection is performed over the averages
of the models trained on these 20 training sets and their
classification performance is assessed on the correspond-
ing validation sets. To summarize, models trained on sets
trn1. . . trn5are assessed on validation set val1, trn6. . . trn10
on validation set val2and so on.
Our framework has two hyper-parameters which are to be
set during the model selection. The first parameter, Kis the
size of the neighborhood in the soft-label assignment step in
Equation 2. The second parameter, is the regularization
constant of the kernel logistic regression algorithm. We
sweep over a grid of values K={10,20,30,50}, and
={102,103,104}and report the error for the hyper-
parameters with the cross validation error described above.
Note that, for active learning model selection is performed
only once, at the initial step.
B. Evaluation on collected data sets
We evaluate the supervised and semi-supervised models
with increasing sizes of labeled data. When additional data
is selected with uncertainty sampling, we assess the ac-
tive supervised and active semi-supervised performances. In
all experiments, we train initial models with 10 randomly
selected labeled samples. We perform model selection in
this setup and fix the value of the hyper-parameters for the
following experiments. The semi-supervised algorithm uses
an additional unlabeled set of size 4000. All results are the
averages over the models trained over 20 realizations of the
training set and the fourfold cross validation.
First, we empirically evaluate the performance of semi-
supervised learning versus supervised learning. Figure 3
shows the improvement of classification error as randomly
selected samples are added to the training sets one at a
time (hence, classification error of KLR and SSKLR with
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Number of randomly selected grasps
Classification Error
Random Sampling − Supervised
Random Sampling − Semi−supervised
Fig. 3. Supervised and semi-supervised logistic regression error on the
validation sets versus the number of randomly selected labeled samples
added to the initial training of size 10. Model selection is carried out at the
initial step with 10 samples. 50 samples are added in an incremental manner
and all models are retrained at each iteration. SSKLR uses an unlabeled
training set of size 4000. K, the neighborhood size for the similarity based
augmentation (Equation 2) is set to 30.
respect to increasing labeled data). As expected, when the
size of the labeled data is small, semi-supervised learning
is advantageous over supervised learning. The difference
diminishes as the dataset gets larger.
An alternative evaluation measure is the perplexity of
the data, 2H(p)= 2(Pxp(x) log2p(x))which measures
the uncertainty of the predictions of the trained models.
This information theoretic measure is commonly used for
probabilistic models in fields such as speech recognition and
natural language processing [21]. In Figure 5, we plot the
perplexity of KLR and SSKLR. This figure shows that the
semi-supervised model is more confident (smaller perplexity)
of its predictions than the supervised model, and thus yields
preferable results. We also note that the variance of perplex-
ity across different validation sets are smaller in the case
of SSKLR, when the dataset is small. This renders semi-
supervised learning more robust compared to supervised
learning in real-life scenarios.
Secondly, we comparatively demonstrate the impact of
active learning. Figure 4 illustrates the performance of both
KLR and SSKLR when incrementally trained with uncer-
tainty based sampling. The corresponding perplexity plots
are shown in Figure 6. The comparison of KLR and SSKLR
in the active learning setting shows a similar behaviour to that
of random selection, Figure 3 and 5. Figure 7 illustrates the
classification error rate for all four scenarios together. For the
supervised classifier, the improvement rate is clearly faster
with active learning than random selection. A 10% error rate
is achieved with 17 samples whereas to get the same error
rate 40 samples are required for the random selection case.
C. On-Policy Evaluation
In order to test our approach in a real life setting we
have used a second object, the watering can shown in
Figure 8(a). For the experiments we have collected a total
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Number of actively selected grasps
Classification Error
Uncertainty Sampling − Supervised
Uncertainty Sampling − Semi−supervised
Fig. 4. Supervised and semi-supervised classification error on the validation
sets as actively selected samples are queried via uncertainty sampling. The
error bars indicate one standard deviation of uncertainty over 20 models.
The initial 10 labeled samples are randomly selected.
0 5 10 15 20 25 30 35 40 45 50
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 x 1014
Number of randomly selected grasps
Perplexity
Random Sampling − Supervised
Random Sampling − Semi−supervised
Fig. 5. Perplexity in random sampling.
0 5 10 15 20 25 30 35 40 45 50
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 x 1014
Number of actively selected grasps
Perplexity
Uncertainty Sampling − Supervised
Uncertainty Sampling − Semi−supervised
Fig. 6. Perplexity in active sampling.
0 5 10 15 20 25 30 35 40 45 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Number of incrementally added grasps
Classification Error
Active Sampling − Supervised
Active Sampling − Semi−supervised
Random Sampling − Supervised
Random Sampling − Semi−supervised
Fig. 7. Classification error rate for KLR, SSKLR, active-KLR and active
SSKLR.
(a) Watering can
(b) Initial set of training samples
Fig. 8. The watering can used for the on-policy evaluation is shown in (a).
We initiate the incremental algorithm with 20 labeled configurations shown
in (b).
of 20 labeled instances of 10 successful and 10 unsuccessful
configurations. Figure 8(b) illustrates these initial training
set of data samples where green refers to feasible grasps and
red refers to infeasible ones. Later, we trained the system
incrementally with 15 more samples separately, both with
random (RS) and actively sampled (AS) data. After we
stopped training we have identified 10 test configurations on
which the AS and RS algorithms disagree the most. When
we carried out these configurations on the robot, in 10 out
of 10 configurations the decision of the AS was correct and
RS failed indicating that the AS is stronger in the decision
boundaries.
VI. CONCLUSION AND FUTURE WORK
We have presented a probabilistic approach to model the
success likelihoods of grasp configurations from a pool of
hypothetical configurations extracted from ECV descriptors.
The main bottleneck in the learning process is the scarcity of
labeled data due to time-consumption of annotating grasps.
Therefore, we have used semi-supervised and active learn-
ing approaches in the context of robot grasping. We have
experimentally evaluated these approaches in two settings,
in the former the data is provided only once as a batch
whereas in the latter the agent has the means to query new
labeled samples incrementally. We provided the results for
three-finger Barrett hand and simple objects. Experimental
evaluation indicates that combining semi-supervised and ac-
tive learning approaches is effective in improving the robot’s
performance with limited supervision. However, it may not
always be possible to incrementally train a system. When that
is not possible, semi-supervised learning is advantageous.
The future direction is to learn visual cues that are shared
among various objects so that the grasp affordance models
are not object-specific but can be generalized to many object
categories. We plan to investigate this direction by using
the features proposed in [6] within the joint kernel KLR
framework.
REFERENCES
[1] M. T. Mason and J. K. Salisbury, Manipulator grasping and pushing
operations. MIT Press, 1985.
[2] A. Bicchi and V. Kumar, “Robotic grasping and contact: a review,” in
IROS, 2000.
[3] R. Detry, E. Baseski, M. Popovic, Y. Touati, N. Kruger, O. Kroemer,
J. Peters, and J. Piater, “Learning object-specific grasp affordance
densities,” International Conference on Development and Learning
(ICDL’09), vol. 0, pp. 1–7, 2009.
[4] C. de Granville, J. Southerland, and A. H. Fagg, “Learning grasp
affordances through human demonstration,” in Proceedings of the
International Conference on Development and Learning (ICDL’06),
2006.
[5] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel
objects using vision,” The International Journal of Robotics Research,
vol. 27, no. 2, pp. 157–173, 2008.
[6] A. Saxena, L. Wong, and A. Y. Ng, “Learning grasp strategies with
partial shape information,” in AAAI, 2008.
[7] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-
tion Science and S tatistics). Springer, 2007.
[8] J. Zhu and T. Hastie, “Kernel logistic regression and the import vector
machine,” in NIPS. MIT Press, 2001.
[9] G. Bakir, J. Weston, and B. Sch¨
olkopf, “Learning to find pre-images,”
in NIPS. MIT Press, 2003.
[10] O. Chapelle, B. Sch ¨
olkopf, and A. Zien, Eds., Semi-Supervised Learn-
ing. Cambridge, MA: MIT Press, 2006.
[11] N. Kr ¨
uger, M. Lappe, and F. W¨
org¨
otter, “Biologically motivated multi-
modal processing of visual primitives,Interdisciplinary Journal of
Artificial Intelligence the Simulation of Behavious, AISB Journal, vol.
1(5), pp. 417–427, 2004.
[12] N. Pugeault, Early Cognitive Vision: Feedback Mechanisms for the
Disambiguation of Early Visual Representation. Verlag Dr. Muller,
ISBN 978-3-639-09357-5, 2008.
[13] J. J. Kuffner, “Effective sampling and distance metrics for 3D rigid
body path planning,” in In IEEE International Conference on Robotics
and Automation, 2004, pp. 3993–3998.
[14] R. Detry, N. Pugeault, and J. H. Piater, “A probabilistic framework
for 3D visual object representation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1790–1803,
2009.
[15] B. Sch ¨
olkopf and A. J. Smola, Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. The MIT Press,
December 2001.
[16] D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for
supervised learning,” in ICML, W. W. Cohen and H. Hirsh, Eds. New
Brunswick, US: Morgan Kaufmann Publishers, San Francisco, US,
1994, pp. 148–156.
[17] M. Salganicoff, L. H. Ungar, and R. Bajcsy, “Active learning for
vision-based robot grasping,” Machine Learning, vol. 23, no. 2-3, pp.
251–278, 1996.
[18] L. Montesano and M. Lopes, “Learning object-specific grasp af-
fordance densities,” International Conference on Development and
Learning (ICDL’09), 2009.
[19] X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining active learning
and semi-supervised learning using gaussian fields and harmonic
functions,” in ICML 2003 Workshop on The Continuum from Labeled
to Unlabeled Data in Machine Learning and Data Mining, 2003, pp.
58–65.
[20] G. T ¨
ur, D. H. T¨
ur, and R. Schapire, “Combining active and semi-
supervised learning for spoken language understanding,” Speech Com-
munication, vol. 45(2), pp. 171–186, 2005.
[21] D. Jurafsky and J. Martin, SPEECH and LANGUAGE PROCESSING
An Introduction to Natural Language Processing, Computational Lin-
guistics, and Speech Recognition. Prentice Hall, 2000.
... Previous works have shown promissing results on learning grasping points from local visual descriptors [13,12,8,17,4,11]. The learning approaches aim to build a model from exploration of the objects that is able to generalize across objects. ...
... The learning approaches aim to build a model from exploration of the objects that is able to generalize across objects. The majority of these approaches rely on local visual features computed on a stero pair in order to obtain sparse 3D features, which are utilized to estimate the graspable points of an object [13,8,17]. The recent development of range cameras have brought a denser 3D data, which have been utilized to build global object models for robotic grasping [11]. ...
... We follow the reaching point concept presented by [8,13] that relates some of the parameters of the gripper (end effector position and approaching orientation) to a particular point on the object (point location and its normal vector). Each reaching point can be classified as graspable or non-graspable according to the values of its local descriptor. ...
Article
Full-text available
We study how to encode the local information of the point clouds in such a way that a robot can learn by experimentation the graspability of objects. After learning, the robot should be able to predict the graspability of unknown objects. We consider two well known descriptors in the computer vision community: Spin images and shape context. In addition, we consider two recent and efficient descriptors: the Point Feature Histogram and the Viewpoint Feature Histogram. We evaluate the discriminative properties of the descriptors on a synthetic scenario and a simulator scenario, classifying the points with a standard learning algorithm, the Support Vector Machines. The results suggest the addition of more global information to the local descriptors.
... In works such as [ [93] None-theoretical work Castellini et al. [14] None-theoretical work Aldoma et al. [4] None-theoretical work Kim and Sukhatme [52] PR2 Katz et al. [51] Barrett Gonçalves et al. [37] iCub Fritz et al. [33] Kurt2 Song et al. [88] Barret hand Ugur et al. [107] Gifu Hand III Hermans et al. [40] Pioneer 3 DX Bohg and Kragic [12] Armar head & Kuka arm Mar et al. [67] None-theoretical work Nguyen et al. [77] WALK-MAN Moldovan et al. [71] iCub Tikhanoff et al. [102] iCub de Granville et al. [24] P5 glove Kraft et al. [60] Staubli Detry et al. [26] Barret hand Ruiz and Mayol-Cuevas [85] PR2 Ye et al. [113] Baxter Chu et al. [18] None-theoretical work Kaiser et al. [48] Armar III Baleia et al. [8] Their own design Ugur et al. [106] Kurt3D Ugur et al. [105] Kurt3D Aksoy et al. [2] None-theoretical work Aksoy et al. [3] None-theoretical work Erkan et al. [30] Barret hand Chu et al. [19] Curi Hermans et al. [41] PR2 Dag et al. [23] None-theoretical work Kim and Sukhatme [53] PR2 Chan et al. [15] Their own design Dehban et al. [25] Icub Varadarajan and Vincze [111] None-theoretical work Ridge and Ude [84] None-theoretical work Abelha et al. [1] None-theoretical work Griffith et al. [38] WAM by Barret Lopes et al. [66] Baltazar Montesano and Lopes [72] Baltazar Koppula et al. [58] PR2 Ardón et al. [6] PR2 ...
... Another group of methods prefer a trial and error process to learn relations between similar features and actions-effects. This process allows the system to explore the success or failure of matching known actions on features that share some similarity with previous seen scenarios, thus being able to learn a model [8,19,20,25,26,30,37,41,53,60,66,70,72,102,[105][106][107]. An example of this trial and error procedure is shown in Fig. 8(c). ...
Preprint
Full-text available
Affordances are key attributes of what must be perceived by an autonomous robotic agent in order to effectively interact with novel objects. Historically, the concept derives from the literature in psychology and cognitive science, where affordances are discussed in a way that makes it hard for the definition to be directly transferred to computational specifications useful for robots. This review article is focused specifically on robotics, so we discuss the related literature from this perspective. In this survey, we classify the literature and try to find common ground amongst different approaches with a view to application in robotics. We propose a categorisation based on the level of prior knowledge that is assumed to build the relationship among different affordance components that matter for a particular robotic task. We also identify areas for future improvement and discuss possible directions that are likely to be fruitful in terms of impact on robotics practice.
... In [15], the authors use vision to create probabilistic grasp affordance models for objects and refine these models through grasping. Erkan et al. [16] presented a probabilistic approach to model the success probabilities of grasp configurations obtained from visual descriptors and combined active and semisupervised learning to tackle the scarcity of labeled grasps. Current learning approaches using tactile sensors are focused on either determining the properties of objects [17], [18], [19] or object recognition [19], [20], [21], [22]. ...
... This problem can be solved, e.g., with Newton's method. For more details, we refer the reader to the work of Yamada et al. [24], Erkan et al. [16], and Schlkopf and Smola [25]. ...
Article
Full-text available
This paper studies the viability of concurrent object pose tracking and tactile sensing for assessing grasp stability on a physical robotic platform. We present a kernel-logistic-regression model of pose-and touch-conditional grasp success probability. Models are trained on grasp data which consist of (1) the pose of the gripper relative to the object, (2) a tactile description of the contacts between the object and the fully-closed gripper, and (3) a binary description of grasp feasibility, which indicates whether the grasp can be used to rigidly control the object. The data is collected by executing grasps demonstrated by a human on a robotic platform composed of an industrial arm, a three-finger gripper equipped with tactile sensing arrays, and a vision-based object pose tracking system. The robot is able to track the pose of an object while it is grasping it, and it can acquire grasp tactile imprints via pressure sensor arrays mounted on its gripper's fingers. We consider models defined on several subspaces of our input data – using tactile perceptions or gripper poses only. Models are optimized and evaluated with f -fold cross-validation. Our preliminary results show that stability assessments based on both tactile and pose data can provide better rates than assessments based on tactile data alone.
... The research community has spent much effort in tackling the problem of grasping novel objects in different settings [1] [2] [3] [4] [5] with the objective of holding objects robustly with robotic manipulators; however, real manipulation tasks go far beyond holding the objects and the quality of a grasp depends on the task it is meant to support. While many quality metrics exist to evaluate the quality of a grasp by itself [6] [7], no clear quantification of the quality of a grasp relatively to a task has been defined. ...
Preprint
Full-text available
While many quality metrics exist to evaluate the quality of a grasp by itself, no clear quantification of the quality of a grasp relatively to the task the grasp is used for has been defined yet. In this paper we propose a framework to extend the concept of grasp quality metric to task-oriented grasping by defining affordance functions via basic grasp metrics for an open set of task affordances. We evaluate both the effectivity of the proposed task oriented metrics and their practical applicability by learning to infer them from vision. Indeed, we assess the validity of our novel framework both in the context of perfect information, i.e., known object model, and in the partial information context, i.e., inferring task oriented metrics from vision, underlining advantages and limitations of both situations. In the former, physical metrics of grasp hypotheses on an object are defined and computed in known object model simulation, in the latter deep models are trained to infer such properties from partial information in the form of synthesized range images.
... We note that, while the two articles included in this chapter provide a summary of our contributions, we have also contributed to preliminary work and to related projects, which yielded the following publications: The technical contents of those publications of which I am first author (1,5) are covered in the articles included below. The other articles (2,3,4) go beyond the scope of this thesis and are not discussed here. ...
... To get around this problem some researchers have used data from the web or simulation [9,15]. Alternatively, systems can use data generated from self-experience [14,3,13,10], which is the approach we have taken. A challenge in this setting is the cost of obtaining labeled examples, which we overcome by having autonomous methods of detecting success and failure for each behavior. ...
Article
Full-text available
We present an active learning approach that en-ables a mobile manipulator to autonomously learn task-relevant features. For a given behavior, our system trains a Support Vector Machine (SVM) that predicts the 3D locations at which the behavior will succeed. This decision is made based on visual features that surround each 3D location. After a quick initialization by the user, the robot efficiently collects and labels positive and negative examples fully autonomously. To demonstrate the efficacy of our approach, we present results for behaviors that flip a light switch up and down, push the top or bottom of a rocker-type light switch, and open or close a drawer. Our implementation uses a Willow Garage PR2 robot. We show that our approach produces classifiers that predict the success of these behaviors. In addition, we show that the robot can continuously learn from its experience. In our initial evaluation of 6 behaviors with learned classifiers, each behavior succeeded in 5 out of 5 trials with at most one retry.
... Methods have been proposed that learn the success rate of grasps given a descriptor extracted from sensor data. These require large amounts of labeled training data which can be acquired either by evaluating grasps on a real robotic system [22,6,9] or from synthetic sensor data with a manually chosen label [4,2,33,31]. Learning in these approaches is often restricted to 3D grasp points or 6d grasp poses rather than to a full gripper pose and finger configuration. ...
Article
Full-text available
The ability to grasp unknown objects still remains an unsolved problem in the robotics community. One of the challenges is to choose an appropriate grasp configuration, i.e., the 6D pose of the hand relative to the object and its finger configuration. In this paper, we introduce an algorithm that is based on the assumption that similarly shaped objects can be grasped in a similar way. It is able to synthesize good grasp poses for unknown objects by finding the best matching object shape templates associated with previously demonstrated grasps. The grasp selection algorithm is able to improve over time by using the information of previous grasp attempts to adapt the ranking of the templates to new situations. We tested our approach on two different platforms, the Willow Garage PR2 and the Barrett WAM robot, which have very different hand kinematics. Furthermore, we compared our algorithm with other grasp planners and demonstrated its superior performance. The results presented in this paper show that the algorithm is able to find good grasp configurations for a large set of unknown objects from a relatively small set of demonstrations, and does improve its performance over time.
... Specific examples might include graspable (e.g. [3]) or pushable [4] that indicate a particular object can be grasped or pushed, respectively. Because one can cast affordances as state-action pairs that will transform the object state in some way, there has been further work in considering affordance as a basis for planning [5]. ...
Conference Paper
Full-text available
A novel behavior representation is introduced that permits a robot to systematically explore the best methods by which to successfully execute an affordance-based behavior for a particular object. The approach decomposes affordance-based behaviors into three components. We first define controllers that specify how to achieve a desired change in object state through changes in the agent's state. For each controller we develop at least one behavior primitive that determines how the controller outputs translate to specific movements of the agent. Additionally we provide multiple perceptual proxies that define the representation of the object that is to be computed as input to the controller during execution. A variety of proxies may be selected for a given controller and a given proxy may provide input for more than one controller. When developing an appropriate affordance-based behavior strategy for a given object, the robot can systematically vary these elements as well as note the impact of additional task variables such as location in the workspace. We demonstrate the approach using a PR2 robot that explores different combinations of controller, behavior primitive, and proxy to perform a push or pull positioning behavior on a selection of household objects, learning which methods best work for each object.
Article
Full-text available
J. J. Gibson’s concept of affordance, one of the central pillars of ecological psychology, is a truly remarkable idea that provides a concise theory of animal perception predicated on environmental interaction. It is thus not surprising that this idea has also found its way into robotics research as one of the underlying theories for action perception. The success of the theory in this regard has meant that existing research is both abundant and diffuse by virtue of the pursuit of multiple different paths and techniques with the common goal of enabling robots to learn, perceive, and act upon affordances. Up until now, there has existed no systematic investigation of existing work in this field. Motivated by this circumstance, in this article, we begin by defining a taxonomy for computational models of affordances rooted in a comprehensive analysis of the most prominent theoretical ideas of import in the field. Subsequently, after performing a systematic literature review, we provide a classification of existing research within our proposed taxonomy. Finally, by both quantitatively and qualitatively assessing the data resulting from the classification process, we highlight gaps in the research terrain and outline open questions for the investigation of affordances in robotics that we believe will help inform future work, prioritize research goals, and potentially advance the field toward greater robot autonomy.
Article
Reliable vision-based grasping has proved elusive outside of controlled environments. One approach towards building more flexible and domain-independent robot grasping systems is to employ learning to adapt the robot's perceptual and motor system to the task. However, one pitfall in robot perceptual and motor learning is that the cost of gathering the learning set may be unacceptably high. Active learning algorithms address this shortcoming by intelligently selecting actions so as to decrease the number of examples necessary to achieve good performance and also avoid separate training and execution phases, leading to higher autonomy. We describe the IE-ID3 algorithm, which extends the Interval Estimation (IE) active learning approach from discrete to real-valued learning domains by combining IE with a classification tree learning algorithm (ID-3). We present a robot system which rapidly learns to select the grasp approach directions using IE-ID3 given simplified superquadric shape approximations of objects. Initial results on a small set of objects show that a robot with a laser scanner system can rapidly learn to pick up new objects, and simulation studies show the superiority of the active learning approach for a simulated grasping task using larger sets of objects. Extensions of the approach and future areas of research incorporating more sophisticated perceptual and action representation are discussed.
Article
(a) (b) (c) Fig. 1. Grasp density representation. The top image of Fig. (a) illustrates a particle from a nonparametric grasp density, and its associated kernel widths: the translucent sphere shows one position standard deviation, the cone shows the variance in orientation. The bottom image illustrates how the schematic rendering used in the top image relates to a physical gripper. Fig. (b) shows a 3D rendering of the kernels supporting a grasp density for a table-tennis paddle (for clarity, only 30 kernels are rendered). Fig. (c) indicates with a green mask of varying opacity the values of the location component of the same grasp density along the plane of the paddle (orientations were ignored to produce this last illustration).
Article
When presented with an object to be ma-nipulated, a robot must identify the available forms of interaction. How might an agent acquire this mapping from object representation to action? In this paper, we describe an approach that learns a mapping from ob-jects to grasps from human demonstration. For a given object, the teacher demonstrates a set of feasible grasps. We cluster these grasps in terms of the position and ori-entation of the hand relative to the object. Individual clusters in this pose space are represented using prob-ability density functions, and thus correspond to varia-tions around canonical grasp approaches. Multiple clus-ters are captured through a mixture distribution-based representation. Experimental results demonstrate the feasibility of extracting a compact set of canonical grasps from the human demonstration. Each of these canonical grasps can then be used to parameterize a reach con-troller that brings the robot hand into a specific spatial relationship with the object.
Article
In this paper, we describe active and semi-supervised learning methods for reducing the labeling effort for spoken language understanding. In a goal-oriented call routing system, understanding the intent of the user can be framed as a classification problem. State of the art statistical classification systems are trained using a large number of human-labeled utterances, preparation of which is labor intensive and time consuming. Active learning aims to minimize the number of labeled utterances by automatically selecting the utterances that are likely to be most informative for labeling. The method for active learning we propose, inspired by certainty-based active learning, selects the examples that the classifier is the least confident about. The examples that are classified with higher confidence scores (hence not selected by active learning) are exploited using two semi-supervised learning methods. The first method augments the training data by using the machine-labeled classes for the unlabeled utterances. The second method instead augments the classification model trained using the human-labeled utterances with the machine-labeled ones in a weighted manner. We then combine active and semi-supervised learning using selectively sampled and automatically labeled data. This enables us to exploit all collected data and alleviates the data imbalance problem caused by employing only active or semi-supervised learning. We have evaluated these active and semi-supervised learning methods with a call classification system used for AT&T customer care. Our results indicate that it is possible to reduce human labeling effort significantly.