Conference PaperPDF Available

Robot Learning Manipulation Action Plans by " Watching " Unconstrained Videos from the World Wide Web


Abstract and Figures

In order to advance action generation and creation in robots beyond simple learned schemas we need computational tools that allow us to automatically interpret and represent human actions. This paper presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots. The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation. Experiments conducted on a publicly available unconstrained video dataset show that the system is able to learn manipulation actions by " watching " unconstrained videos with high accuracy.
Content may be subject to copyright.
Robot Learning Manipulation Action Plans by “Watching” Unconstrained Videos
from the World Wide Web
Yezhou Yang
University of Maryland
Yi Li
NICTA, Australia
Cornelia Ferm¨
University of Maryland
Yiannis Aloimonos
University of Maryland
In order to advance action generation and creation in robots
beyond simple learned schemas we need computational tools
that allow us to automatically interpret and represent human
actions. This paper presents a system that learns manipula-
tion action plans by processing unconstrained videos from
the World Wide Web. Its goal is to robustly generate the se-
quence of atomic actions of seen longer actions in video in
order to acquire knowledge for robots. The lower level of the
system consists of two convolutional neural network (CNN)
based recognition modules, one for classifying the hand grasp
type and the other for object recognition. The higher level
is a probabilistic manipulation action grammar based pars-
ing module that aims at generating visual sentences for robot
manipulation. Experiments conducted on a publicly avail-
able unconstrained video dataset show that the system is able
to learn manipulation actions by “watching” unconstrained
videos with high accuracy.
The ability to learn actions from human demonstrations is
one of the major challenges for the development of intel-
ligent systems. Particularly, manipulation actions are very
challenging, as there is large variation in the way they can
be performed and there are many occlusions.
Our ultimate goal is to build a self-learning robot that is
able to enrich its knowledge about fine grained manipulation
actions by “watching” demo videos. In this work we explic-
itly model actions that involve different kinds of grasping,
and aim at generating a sequence of atomic commands by
processing unconstrained videos from the World Wide Web
The robotics community has been studying perception
and control problems of grasping for decades (Shimoga
1996). Recently, several learning based systems were re-
ported that infer contact points or how to grasp an ob-
ject from its appearance (Saxena, Driemeyer, and Ng 2008;
Lenz, Lee, and Saxena 2014). However, the desired grasp-
ing type could be different for the same target object, when
used for different action goals. Traditionally, data about the
grasp has been acquired using motion capture gloves or hand
trackers, such as the model-based tracker of (Oikonomidis,
Copyright c
2015, Association for the Advancement of Artificial
Intelligence ( All rights reserved.
Kyriazis, and Argyros 2011). The acquisition of grasp in-
formation from video (without 3D information) is still con-
sidered very difficult because of the large variation in ap-
pearance and the occlusions of the hand from objects during
Our premise is that actions of manipulation are repre-
sented at multiple levels of abstraction. At lower levels the
symbolic quantities are grounded in perception, and at the
high level a grammatical structure represents symbolic in-
formation (objects, grasping types, actions). With the recent
development of deep neural network approaches, our system
integrates a CNN based object recognition and a CNN based
grasping type recognition module. The latter recognizes the
subject’s grasping type directly from image patches.
The grasp type is an essential component in the charac-
terization of manipulation actions. Just from the viewpoint
of processing videos, the grasp contains information about
the action itself, and it can be used for prediction or as a fea-
ture for recognition. It also contains information about the
beginning and end of action segments, thus it can be used to
segment videos in time. If we are to perform the action with
a robot, knowledge about how to grasp the object is neces-
sary so the robot can arrange its effectors. For example, con-
sider a humanoid with one parallel gripper and one vacuum
gripper. When a power grasp is desired, the robot should
select the vacuum gripper for a stable grasp, but when a pre-
cision grasp is desired, the parallel gripper is a better choice.
Thus, knowing the grasping type provides information for
the robot to plan the configuration of its effectors, or even
the type of effector to use.
In order to perform a manipulation action, the robot also
needs to learn what tool to grasp and on what object to per-
form the action. Our system applies CNN based recogni-
tion modules to recognize the objects and tools in the video.
Then, given the beliefs of the tool and object (from the out-
put of the recognition), our system predicts the most likely
action using language, by mining a large corpus using a
technique similar to (Yang et al. 2011). Putting everything
together, the output from the lower level visual perception
system is in the form of (LeftHand GraspType1 Object1 Ac-
tion RightHand GraspType2 Object2). We will refer to this
septet of quantities as visual sentence.
At the higher level of representation, we generate a sym-
bolic command sequence. (Yang et al. 2014) proposed a
context-free grammar and related operations to parse ma-
nipulation actions. However, their system only processed
RGBD data from a controlled lab environment. Further-
more, they did not consider the grasping type in the gram-
mar. This work extends (Yang et al. 2014) by modeling ma-
nipulation actions using a probabilistic variant of the context
free grammar, and explicitly modeling the grasping type.
Using as input the belief distributions from the CNN
based visual perception system, a Viterbi probabilistic parser
is used to represent actions in form of a hierarchical and
recursive tree structure. This structure innately encodes the
order of atomic actions in a sequence, and forms the basic
unit of our knowledge representation. By reverse parsing it,
our system is able to generate a sequence of atomic com-
mands in predicate form, i.e. as Action(Subj ect, P atient)
plus the temporal information necessary to guide the robot.
This information can then be used to control the robot effec-
tors (Argall et al. 2009).
Our contributions are twofold. (1) A convolutional neural
network (CNN) based method has been adopted to achieve
state-of-the-art performance in grasping type classification
and object recognition on unconstrained video data; (2) a
system for learning information about human manipulation
action has been developed that links lower level visual per-
ception and higher level semantic structures through a prob-
abilistic manipulation action grammar.
Related Works
Most work on learning from demonstrations in robotics has
been conducted in fully controlled lab environments (Aksoy
et al. 2011). Many of the approaches rely on RGBD sensors
(Summers-Stay et al. 2013), motion sensors (Guerra-Filho,
uller, and Aloimonos 2005; Li et al. 2010) or specific
color markers (Lee et al. 2013). The proposed systems are
fragile in real world situations. Also, the amount of data used
for learning is usually quite small. It is extremely difficult to
learn automatically from data available on the internet, for
example from unconstrained cooking videos from Youtube.
The main reason is that the large variation in the scenery will
not allow traditional feature extraction and learning mecha-
nism to work robustly.
At the high level, a number of studies on robotic ma-
nipulation actions have proposed ways on how instruc-
tions are stored and analyzed, often as sequences. Work
by (Tenorth, Ziegltrum, and Beetz 2013), among others,
investigates how to compare sequences in order to reason
about manipulation actions using sequence alignment meth-
ods, which borrow techniques from informatics. Our paper
proposes a more detailed representation of manipulation ac-
tions, the grammar trees, extending earlier work. Chomsky
in (Chomsky 1993) suggested that a minimalist generative
grammar, similar to the one of human language, also ex-
ists for action understanding and execution. The works clos-
est related to this paper are (Pastra and Aloimonos 2012;
Summers-Stay et al. 2013; Guha et al. 2013; Yang et al.
2014). (Pastra and Aloimonos 2012) first discussed a Chom-
skyan grammar for understanding complex actions as a theo-
retical concept, (Summers-Stay et al. 2013) provided an im-
plementation of such a grammar using as perceptual input
only objects. (Yang et al. 2014) proposed a set of context-
free grammar rules for manipulation action understanding.
However, their system used data collected in a lab environ-
ment. Here we process unconstrained data from the internet.
In order to deal with the noisy visual data, we extend the ma-
nipulation action grammar and adapt the parsing algorithm.
The recent development of deep neural networks based
approaches revolutionized visual recognition research. Dif-
ferent from the traditional hand-crafted features (Lowe
2004; Dalal and Triggs 2005), a multi-layer neural network
architecture efficiently captures sophisticated hierarchies de-
scribing the raw data (Bengio, Courville, and Vincent 2013),
which has shown superior performance on standard object
recognition benchmarks (Krizhevsky, Sutskever, and Hinton
2013; Ciresan, Meier, and Schmidhuber 2012) while utiliz-
ing minimal domain knowledge. The work presented in this
paper shows that with the recent developments of deep neu-
ral networks in computer vision, it is possible to learn ma-
nipulation actions from unconstrained demonstrations using
CNN based visual perception.
Our Approach
We developed a system to learn manipulation actions from
unconstrained videos. The system takes advantage of: (1)
the robustness from CNN based visual processing; (2) the
generality of an action grammar based parser. Figure1 shows
our integrated approach.
CNN based visual recognition
The system consists of two visual recognition modules, one
for classification of grasping types and the other for recogni-
tion of objects. In both modules we used convolutional neu-
ral networks as classifiers. First, we briefly summarize the
basic concepts of Convolutional Neural Networks, and then
we present our implementations.
Convolutional Neural Network (CNN) is a multilayer
learning framework, which may consist of an input layer,
a few convolutional layers and an output layer. The goal
of CNN is to learn a hierarchy of feature representations.
Response maps in each layer are convolved with a number
of filters and further down-sampled by pooling operations.
These pooling operations aggregate values in a smaller re-
gion by downsampling functions including max, min, and
average sampling. The learning in CNN is based on Stochas-
tic Gradient Descent (SGD), which includes two main oper-
ations: Forward and BackPropagation. Please refer to (Le-
Cun and Bengio 1998) for details.
We used a seven layer CNN (including the input layer and
two perception layers for regression output). The first con-
volution layer has 32 filters of size 5×5, the second convo-
lution layer has 32 filters of size 5×5, and the third convo-
lution layer has 64 filters of size 5×5, respectively. The first
perception layer has 64 regression outputs and the final per-
ception layer has 6regression outputs. Our system considers
6grasping type classes.
Grasping Type Recognition A number of grasping tax-
onomies have been proposed in several areas of research, in-
Figure 1: The integrated system reported in this work.
cluding robotics, developmental medicine, and biomechan-
ics, each focusing on different aspects of action. In a recent
survey (Feix et al. 2013) reported 45 grasp types in the litera-
ture, of which only 33 were found valid. In this work, we use
a categorization into six grasping types. First we distinguish,
according to the most commonly used classification (based
on functionality) into power and precision grasps (Jeannerod
1984). Power grasping is used when the object needs to be
held firmly in order to apply force, such as “grasping a knife
to cut”; precision grasping is used in order to do fine grain
actions that require accuracy, such as “pinch a needle”. We
then further distinguish among the power grasps, whether
they are spherical, or otherwise (usually cylindrical), and
we distinguish the latter according to the grasping diame-
ter, into large diameter and small diameter ones. Similarly,
we distinguish the precision grasps into large and small di-
ameter ones. Additionally, we also consider a Rest position
(no grasping performed). Table 1 illustrates our grasp cat-
egories. We denote the list of these six grasps as Gin the
remainder of the paper.
Small Di-
Large Di-
& Rest
Table 1: The list of the grasping types.
The input to the grasping type recognition module is a
gray-scale image patch around the target hand performing
the grasping. We resize each patch to 32 ×32 pixels, and
subtract the global mean obtained from the training data.
For each testing video with Mframes, we pass the tar-
get hand patches (left hand and right hand, if present) frame
by frame, and we obtain an output of size 6×M. We sum
it up along the temporal dimension and then normalize the
output. We use the classification for both hands to obtain
(GraspType1) for the left hand, and (GraspType2) for the
right hand. For the video of Mframes the grasping type
recognition system outputs two belief distributions of size
6×1:PGraspT ype1and PGraspT y pe2.
Object Recognition and Corpus Guided Action Predic-
tion The input to the object recognition module is an RGB
image patch around the target object. We resize each patch to
32×32×3pixels, and we subtract the global mean obtained
from the training data.
Similar to the grasping type recognition module, we also
used a seven layer CNN. The network structure is the same
as before, except that the final perception layer has 48 re-
gression outputs. Our system considers 48 object classes,
and we denote this candidate object list as Oin the rest of
the paper. Table 2 lists the object classes.
apple, blender, bowl, bread, brocolli, brush, butter, carrot,
chicken, chocolate, corn, creamcheese, croutons, cucumber,
cup, doughnut, egg, fish, flour, fork, hen, jelly, knife, lemon,
lettuce, meat, milk, mustard, oil, onion, pan, peanutbutter,
pepper, pitcher, plate, pot, salmon, salt, spatula, spoon,
spreader, steak, sugar, tomato, tongs, turkey, whisk, yogurt.
Table 2: The list of the objects considered in our system.
For each testing video with Mframes, we pass the tar-
get object patches frame by frame, and get an output of size
48×M. We sum it up along the temporal dimension and then
normalize the output. We classify two objects in the image:
(Object1) and (Object2). At the end of classification, the ob-
ject recognition system outputs two belief distributions of
size 48 ×1:PObject1and PO bject2.
We also need the ‘Action’ that was performed. Due to the
large variations in the video, the visual recognition of actions
is difficult. Our system bypasses this problem by using a
trained language model. The model predicts the most likely
verb (Action) associated with the objects (Object1, Object2).
In order to do prediction, we need a set of candidate actions
V. Here, we consider the top 10 most common actions in
cooking scenarios. They are (Cut, Pour, Transfer, Spread,
Grip, Stir, Sprinkle, Chop, Peel, Mix). The same technique,
used here, was used before on a larger set of candidate ac-
tions (Yang et al. 2011).
We compute from the Gigaword corpus (Graff 2003) the
probability of a verb occurring, given the detected nouns,
P(Action|Object1, Object2). We do this by computing the
log-likelihood ratio (Dunning 1993) of trigrams (Object1,
Action, Object2), computed from the sentence in the English
Gigaword corpus (Graff 2003). This is done by extracting
only the words in the corpus that are defined in Oand V(in-
cluding their synonyms). This way we obtain a reduced cor-
pus sequence from which we obtain our target trigrams. The
log-likelihood ratios computed for all possible trigrams are
then normalized to obtain P(Action|Object1, Object2).
For each testing video, we can compute a belief distribution
over the candidate action set Vof size 10 ×1as :
PAction =X
P(Action|Object1, Object2)
×PObject1×PObj ect2.(1)
From Recognitions to Action Trees
The output of our visual system are belief distributions of
the object categories, grasping types, and actions. However,
they are not sufficient for executing actions. The robot also
needs to understand the hierarchical and recursive structure
of the action. We argue that grammar trees, similar to those
used in linguistics analysis, are a good representation cap-
turing the structure of actions. Therefore we integrate our
visual system with a manipulation action grammar based
parsing module (Yang et al. 2014). Since the output of our
visual system is probabilistic, we extend the grammar to a
probabilistic one and apply the Viterbi probabilistic parser
to select the parse tree with the highest likelihood among
the possible candidates.
Manipulation Action Grammar We made two exten-
sions from the original manipulation grammar (Yang et al.
2014): (i) Since grasping is conceptually different from other
actions, and our system employs a CNN based recognition
module to extract the model grasping type, we assign an ad-
ditional nonterminal symbol Gto represent the grasp. (ii) To
accommodate the probabilistic output from the processing
of unconstrained videos, we extend the manipulation action
grammar into a probabilistic one.
The design of this grammar is motivated by three obser-
vations: (i) Hands are the main driving force in manipula-
tion actions, so a specialized nonterminal symbol His used
for their representation; (ii) an action (A) or a grasping (G)
can be applied to an object (O) directly or to a hand phrase
(H P ), which in turn contains an object (O), as encoded in
Rule (1), which builds up an action phrase (AP ); (iii) an ac-
tion phrase (AP ) can be combined either with the hand (H)
or a hand phrase, as encoded in rule (2), which recursively
builds up the hand phrase. The rules discussed in Table 3
form the syntactic rules of the grammar.
To make the grammar probabilistic, we first treat
each sub-rule in rules (1) and (2) equally, and assign
equal probability to each sub-rule. With regard to the
hand Hin rule (3), we only consider a robot with
two effectors (arms), and assign equal probability to
‘LeftHand’ and ‘RightHand’. For the terminal rules
(4-8), we assign the normalized belief distributions
(PObject1, PO bject2, PGraspT y pe1, PGraspT ype2,PAction)
obtained from the visual processes to each candidate object,
grasping type and action.
AP G1O1|G2O2|A O2|A HP 0.25 (1)
HP H AP |HP AP 0.5 (2)
HLeftHand0|RightH and00.5 (3)
G1GraspT ype10PGraspT y pe1(4)
G2GraspT ype20PGraspT y pe2(5)
O1Object10PO bject1(6)
O2Object20PO bject2(7)
AAction0PAction (8)
Table 3: A Probabilistic Extension of Manipulation Action
Context-Free Grammar.
Parsing and tree generation We use a bottom-up vari-
ation of the probabilistic context-free grammar parser that
uses dynamic programming (best-known as Viterbi parser
(Church 1988)) to find the most likely parse for an input vi-
sual sentence. The Viterbi parser parses the visual sentence
by filling in the most likely constituent table, and the parser
uses the grammar introduced in Table 3. For each testing
video, our system outputs the most likely parse tree of the
specific manipulation action. By reversely parsing the tree
structure, the robot could derive an action plan for execu-
tion. Figure 3 shows sample output trees, and Table 4 shows
the final control commands generated by reverse parsing.
The theoretical framework we have presented suggests two
hypotheses that deserve empirical tests: (a) the CNN based
object recognition module and the grasping type recognition
module can robustly recognize input frame patches from
unconstrained videos into correct class labels; (b) the inte-
grated system using the Viterbi parser with the probabilistic
extension of the manipulation action grammar can generate
a sequence of execution commands robustly.
To test the two hypotheses empirically, we need to de-
fine a set of performance variables and how they relate to
our predicted results. The first hypothesis relates to visual
recognition, and we can empirically test it by measuring the
precision and recall metrics by comparing the detected ob-
ject and grasping type labels with the ground truth ones. The
second hypothesis relates to execution command generation,
and we can also empirically test it by comparing the gen-
erated command predicates with the ground truth ones on
testing videos. To validate our system, we conducted experi-
ments on an extended version of a publicly available uncon-
strained cooking video dataset (YouCook) (Das et al. 2013).
Dataset and experimental settings
Cooking is an activity, requiring a variety of manipulation
actions, that future service robots most likely need to learn.
We conducted our experiments on a publicly available cook-
ing video dataset collected from the WWW and fully la-
beled, called the Youtube cooking dataset (YouCook) (Das
et al. 2013). The data was prepared from 88 open-source
Youtube cooking videos with unconstrained third-person
view. Frame-by-frame object annotations are provided for
49 out of the 88 videos. These features make it a good em-
pirical testing bed for our hypotheses.
We conducted our experiments using the following proto-
cols: (1) 12 video clips, which contain one typical kitchen
action each, are reserved for testing; (2) all other video
frames are used for training; (3) we randomly reserve 10%
of the training data as validation set for training the CNNs.
For training the grasping type, we extended the dataset by
annotating image patches containing hands in the training
videos. The image patches were converted to gray-scale and
then resized to 32×32 pixels. The training set contains 1525
image patches and was labeled with the six grasping types.
We used a GPU based CNN implementation (Jia 2013) to
train the neural network, following the structures described.
For training the object recognition CNN, we first ex-
tracted annotated image patches from the labeled training
videos, and then resized them to 32 ×32 ×3. We used the
same GPU based CNN implementation to train the neural
network, following the structures described above.
For localizing hands on the testing data, we first applied
the hand detector from (Mittal, Zisserman, and Torr 2011)
and picked the top two hand patch proposals (left hand and
right hand, if present). For objects, we trained general object
detectors from labeled training data using techniques from
(Cheng et al. 2014). Furthermore we associated candidate
object patches with the left or right hand, respectively de-
pending on which had the smaller Euclidean distance.
Grasping Type and Object Recognition
On the reserved 10% validation data, the grasping type
recognition module achieved an average precision of 77%
and an average recall of 76%. On the reserved 10% valida-
tion data, the object recognition module achieved an average
precision of 93%, and an average recall of 93%. Figure 2
shows the confusion matrices for grasping type and object
recognition, respectively. From the figure we can see the ro-
bustness of the recognition.
The performance of the object and grasping type recog-
nition modules is also reflected in the commands that our
system generated from the testing videos. We observed an
overall recognition accuracy of 79% on objects, of 91% on
grasping types and of 83% on predicted actions (see Table
4). It is worth mentioning that in the generated commands
the performance in the recognition of object drops, because
some of the objects in the testing sequences do not have
Figure 2: Confusion matrices. Left: grasping type; Right: ob-
training data, such as “Tofu”. The performance in the clas-
sification of grasping type goes up, because we sum up the
grasping types belief distributions over the frames, which
helps to smooth out wrong labels. The performance metrics
reported here empirically support our hypothesis (a).
Visual Sentence Parsing and Commands
Generation for Robots
Following the probabilistic action grammar from Table 3, we
built upon the implementation of the Viterbi parser from the
Natural Language Processing Kit (Bird, Klein, and Loper
2009) to generate the single most likely parse tree from the
probabilistic visual sentence input. Figure 3 shows the sam-
ple visual processing outputs and final parse trees obtained
using our integrated system. Table 4 lists the commands
generated by our system on the reserved 12 testing videos,
shown together with the ground truth commands. The over-
all percentage of correct commands is 68%. Note, that we
considered a command predicate wrong, if any of the ob-
ject, grasping type or action was recognized incorrectly. The
performance metrics reported here, empirically support our
hypothesis (b).
The performance metrics reported in the experiment sec-
tion empirically support our hypotheses that: (1) our system
is able to robustly extract visual sentences with high accu-
racy; (2) our system can learn atomic action commands with
few errors compared to the ground-truth commands. We be-
lieve this preliminary integrated system raises hope towards
a fully intelligent robot for manipulation tasks that can auto-
matically enrich its own knowledge resource by “watching”
recordings from the World Wide Web.
Conclusion and Future Work
In this paper we presented an approach to learn manipulation
action plans from unconstrained videos for cognitive robots.
Two convolutional neural network based recognition mod-
ules (for grasping type and objects respectively), as well as
a language model for action prediction, compose the lower
level of the approach. The probabilistic manipulation action
grammar based Viterbi parsing module is at the higher level,
and its goal is to generate atomic commands in predicate
form. We conducted experiments on a cooking dataset which
consists of unconstrained demonstration videos. From the
Figure 3: Upper row: input unconstrained video frames;
Lower left: color coded (see lengend at the bottom) visual
recognition output frame by frame along timeline; Lower
right: the most likely parse tree generated for each clip.
performance on this challenging dataset, we can conclude
that our system is able to recognize and generate action com-
mands robustly.
We believe that the grasp type is an essential component
for fine grain manipulation action analysis. In future work
we will (1) further extend the list of grasping types to have
a finer categorization; (2) investigate the possibility of using
the grasp type as an additional feature for action recognition;
(3) automatically segment a long demonstration video into
action clips based on the change of grasp type.
Another line of future work lies in the higher level of
Snapshot Ground Truth Commands Learned Commands
Grasp PoS(LH, Knife)
Grasp PrS(RH, Tofu)
Action Cut(Knife, Tofu)
Grasp PoS(LH, Knife)
Grasp PrS(RH, Bowl)
Action Cut(Knife, Bowl)
Grasp PoS(LH, Blender)
Grasp PrL(RH, Bowl)
Action Blend(Blender, Bowl)
Grasp PoS(LH, Bowl)
Grasp PoL(RH, Bowl)
Action Pour(Bowl, Bowl)
Grasp PoS(LH, Tongs)
Action Grip(Tongs, Chicken)
Grasp PoS(LH, Chicken)
Action Cut(Chicken, Chicken)
Grasp PoS(LH, Brush)
Grasp PrS(RH, Corn)
Action Spread(Brush, Corn)
Grasp PoS(LH, Brush)
Grasp PrS(RH, Corn)
Action Spread(Brush, Corn)
Grasp PoS(LH, Tongs)
Action Grip(Tongs, Steak)
Grasp PoS(LH, Tongs)
Action Grip(Tongs, Steak)
Grasp PoS(LH, Spreader)
Grasp PrL(RH, Bread)
Action Spread(Spreader, Bread)
Grasp PoS(LH, Spreader)
Grasp PrL(RH, Bowl)
Action Spread(Spreader, Bowl)
Grasp PoL(LH, Mustard)
Grasp PrS(RH, Bread)
Action Spread(Mustard, Bread)
Grasp PoL(LH, Mustard)
Grasp PrS(RH, Bread)
Action Spread(Mustard, Bread)
Grasp PoS(LH, Spatula)
Grasp PrS(RH, Bowl)
Action Stir(Spatula, Bowl)
Grasp PoS(LH, Spatula)
Grasp PrS(RH, Bowl)
Action Stir(Spatula, Bowl)
Grasp PoL(LH, Pepper)
Grasp PoL(RH, Pepper)
Action Sprinkle(Pepper, Bowl)
Grasp PoL(LH, Pepper)
Grasp PoL(RH, Pepper)
Action Sprinkle(Pepper, Pepper)
Grasp PoS(LH, Knife)
Grasp PrS(RH, Lemon)
Action Cut(Knife, Lemon)
Grasp PoS(LH, Knife)
Grasp PrS(RH, Lemon)
Action Cut(Knife, Lemon)
Grasp PoS(LH, Knife)
Grasp PrS(RH, Broccoli)
Action Cut(Knife, Broccoli)
Grasp PoS(LH, Knife)
Grasp PoL(RH, Broccoli)
Action Cut(Knife, Broccoli)
Grasp PoS(LH, Whisk)
Grasp PrL(RH, Bowl)
Action Stir(Whisk, Bowl)
Grasp PoS(LH, Whisk)
Grasp PrL(RH, Bowl)
Action Stir(Whisk, Bowl)
Object: 79%
Grasping type: 91%
Action: 83%
Overall percentage of
correct commands: 68%
Table 4: LH:LeftHand; RH: RightHand; PoS: Power-Small;
PoL: Power-Large; PoP: Power-Spherical; PrS: Precision-
Small; PrL: Precision-Large. Incorrect entities learned are
marked in red.
the system. The probabilistic manipulation action grammar
used in this work is still a syntax grammar. We are cur-
rently investigating the possibility of coupling manipulation
action grammar rules with semantic rules using lambda ex-
pressions, through the formalism of combinatory categorial
grammar developed by (Steedman 2002).
Acknowledgements This research was funded in part by
the support of the European Union under the Cognitive Sys-
tems program (project POETICON++), the National Sci-
ence Foundation under INSPIRE grant SMA 1248056, and
support by DARPA through U.S. Army grant W911NF-14-
1-0384 under the Project: Shared Perception, Cognition and
Reasoning for Autonomy. NICTA is funded by the Aus-
tralian Government as represented by the Department of
Broadband, Communications and the Digital Economy and
the Australian Research Council through the ICT Centre of
Excellence program.
Aksoy, E.; Abramov, A.; D¨
orr, J.; Ning, K.; Dellen, B.; and
otter, F. 2011. Learning the semantics of object-action re-
lations by observation. The International Journal of Robotics Re-
search 30(10):1229–1249.
Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009.
A survey of robot learning from demonstration. Robotics and Au-
tonomous Systems 57(5):469–483.
Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation
learning: A review and new perspectives. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 35(8):1798–1828.
Bird, S.; Klein, E.; and Loper, E. 2009. Natural language process-
ing with Python. ” O’Reilly Media, Inc.”.
Cheng, M.-M.; Zhang, Z.; Lin, W.-Y.; and Torr, P. H. S. 2014.
BING: Binarized normed gradients for objectness estimation at
300fps. In IEEE CVPR.
Chomsky, N. 1993. Lectures on government and binding: The Pisa
lectures. Berlin: Walter de Gruyter.
Church, K. W. 1988. A stochastic parts program and noun phrase
parser for unrestricted text. In Proceedings of the second confer-
ence on Applied natural language processing, 136–143. Associa-
tion for Computational Linguistics.
Ciresan, D. C.; Meier, U.; and Schmidhuber, J. 2012. Multi-column
deep neural networks for image classification. In CVPR 2012.
Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients
for human detection. In Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference on,
volume 1, 886–893. IEEE.
Das, P.; Xu, C.; Doell, R. F.; and Corso, J. J. 2013. A thousand
frames in just a few words: Lingual description of videos through
latent topics and sparse object stitching. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition.
Dunning, T. 1993. Accurate methods for the statistics of surprise
and coincidence. Computational Linguistics 19(1):61–74.
Feix, T.; Romero, J.; Ek, C. H.; Schmiedmayer, H.; and Kragic,
D. 2013. A Metric for Comparing the Anthropomorphic Motion
Capability of Artificial Hands. Robotics, IEEE Transactions on
Graff, D. 2003. English gigaword. In Linguistic Data Consortium,
Philadelphia, PA.
Guerra-Filho, G.; Ferm¨
uller, C.; and Aloimonos, Y. 2005. Dis-
covering a language for human activity. In Proceedings of the
AAAI 2005 Fall Symposium on Anticipatory Cognitive Embodied
Systems. Washington, DC: AAAI.
Guha, A.; Yang, Y.; Ferm¨
uller, C.; and Aloimonos, Y. 2013. Min-
imalist plans for interpreting manipulation actions. In Proceed-
ings of the 2013 International Conference on Intelligent Robots
and Systems, 5908–5914. Tokyo: IEEE.
Jeannerod, M. 1984. The timing of natural prehension movements.
Journal of motor behavior 16(3):235–254.
Jia, Y. 2013. Caffe: An open source convolutional architecture for
fast feature embedding. http://caffe.berkeleyvision.
Krizhevsky, A.; Sutskever, I.; and Hinton, G. 2013. Imagenet clas-
sification with deep convolutional neural networks. In NIPS 2012.
LeCun, Y., and Bengio, Y. 1998. The handbook of brain theory
and neural networks. Cambridge, MA, USA: MIT Press. chapter
Convolutional networks for images, speech, and time series, 255–
Lee, K.; Su, Y.; Kim, T.-K.; and Demiris, Y. 2013. A syntac-
tic approach to robot imitation learning using probabilistic activity
grammars. Robotics and Autonomous Systems 61(12):1323–1334.
Lenz, I.; Lee, H.; and Saxena, A. 2014. Deep learning for detect-
ing robotic grasps. International Journal of Robotics Research to
Li, Y.; Ferm¨
uller, C.; Aloimonos, Y.; and Ji, H. 2010. Learning
shift-invariant sparse representation of actions. In Proceedings of
the 2009 IEEE Conference on Computer Vision and Pattern Recog-
nition, 2630–2637. San Francisco, CA: IEEE.
Lowe, D. G. 2004. Distinctive image features from scale-invariant
keypoints. International journal of computer vision 60(2):91–110.
Mittal, A.; Zisserman, A.; and Torr, P. H. 2011. Hand detection
using multiple proposals. In BMVC, 1–11. Citeseer.
Oikonomidis, I.; Kyriazis, N.; and Argyros, A. 2011. Efficient
model-based 3D tracking of hand articulations using Kinect. In
Proceedings of the 2011 British Machine Vision Conference, 1–11.
Dundee, UK: BMVA.
Pastra, K., and Aloimonos, Y. 2012. The minimalist grammar of
action. Philosophical Transactions of the Royal Society: Biological
Sciences 367(1585):103–117.
Saxena, A.; Driemeyer, J.; and Ng, A. Y. 2008. Robotic grasping of
novel objects using vision. The International Journal of Robotics
Research 27(2):157–173.
Shimoga, K. B. 1996. Robot grasp synthesis algorithms: A survey.
The International Journal of Robotics Research 15(3):230–266.
Steedman, M. 2002. Plans, affordances, and combinatory gram-
mar. Linguistics and Philosophy 25(5-6):723–753.
Summers-Stay, D.; Teo, C.; Yang, Y.; Ferm¨
uller, C.; and Aloi-
monos, Y. 2013. Using a minimal action grammar for activity un-
derstanding in the real world. In Proceedings of the 2013 IEEE/RSJ
International Conference on Intelligent Robots and Systems, 4104–
4111. Vilamoura, Portugal: IEEE.
Tenorth, M.; Ziegltrum, J.; and Beetz, M. 2013. Automated align-
ment of specifications of everyday manipulation tasks. In IROS.
Yang, Y.; Teo, C. L.; Daum´
e III, H.; and Aloimonos, Y. 2011.
Corpus-guided sentence generation of natural images. In Proceed-
ings of the Conference on Empirical Methods in Natural Language
Processing, 444–454. Association for Computational Linguistics.
Yang, Y.; Guha, A.; Fermuller, C.; and Aloimonos, Y. 2014. A
cognitive system for understanding human manipulation actions.
Advances in Cognitive Sysytems 3:67–86.
... The corresponding challenges approached with deep learning methods are grasp detection, path and trajectory planning, and motion control. Deep learning models based on recurrent units, CNNs [581,582], and deep spatial AEs [583] have been used for learning visuomotor and manipulation action plans. ...
... Scene and object recognition as well as localization are critical tasks for robot systems, since knowing what kind of objects are there in the environment and the locations of those objects is a prerequisite for performing other tasks. Deep learning methods have shown promising performance in recognizing and classifying objects for grasp detection [581,584], including advanced applications such as recognizing deformable objects and estimating their state and pose for grasping [585], semantic tasks [586], and path specification [587]. ...
... Deep learning methods present great potential in this area. Learning by demonstration [581] is one way to solve the problem, where deep learning models are trained to learn manipulation action plans by watching unconstrained videos from the World Wide Web. In another study, a recurrent model was trained for the robot to learn grasping actions from a human collaborator [588]. ...
Full-text available
Deep learning has become a predominant method for solving data analysis problems in virtually all fields of science and engineering. The increasing complexity and the large volume of data collected by diverse sensor systems have spurred the development of deep learning methods and have fundamentally transformed the way the data are acquired, processed, analyzed, and interpreted. With the rapid development of deep learning technology and its ever-increasing range of successful applications across diverse sensor systems, there is an urgent need to provide a comprehensive investigation of deep learning in this domain from a holistic view. This survey paper aims to contribute to this by systematically investigating deep learning models/methods and their applications across diverse sensor systems. It also provides a comprehensive summary of deep learning implementation tips and links to tutorials, open-source codes, and pretrained models, which can serve as an excellent self-contained reference for deep learning practitioners and those seeking to innovate deep learning in this space. In addition, this paper provides insights into research topics in diverse sensor systems where deep learning has not yet been well-developed, and highlights challenges and future opportunities. This survey serves as a catalyst to accelerate the application and transformation of deep learning in diverse sensor systems.
... Another relevant domain of AI research is the recent work on recognizing and characterizing human activities in various environments (Fu et al. 2022;Gan et al. 2021;Gao et al. 2019;Inamura and Mizuchi 2021;Li et al. 2021;Puig et al. 2018;Xiang et al. 2020;Yang et al. 2015). In these works, there are occasional uses of symbolic descriptions of the activities involved. ...
... Assume that there is a process similar to that described in prior work such as that in (Gao et al. 2019;Puig et al. 2018;Yang et al. 2015), in which human daily activities in a house, say, are observed, learned, and encoded. Further assume that the details of each hand movement, and movements of the objects involved that the human hands interact with, are observed and encoded in CD+ (granted that this is not a trivial process). ...
The understanding of the functional aspects of objects and tools is of paramount importance in supporting an intelligent system in navigating around in the environment and interacting with various objects, structures, and systems, to help fulfil its goals. A detailed understanding of functionalities can also lead to design improvements and novel designs that would enhance the operations of AI and robotic systems on the one hand, and human lives on the other. This paper demonstrates how a particular object - in this case, a frying pan - and its participation in the processes it is designed to support - in this case, the frying process - can be represented in a general function representational language and framework, that can be used to flesh out the processes and functionalities involved, leading to a deep conceptual understanding with explainability of functionalities that allows the system to answer "why" questions - why is something a good frying pan, say, or why a certain part on the frying pan is designed in a certain way? Or, why is something not a good frying pan? This supports the re-design and improvement on design of objects, artifacts, and tools, as well as the potential for generating novel designs that are functionally accurate, usable, and satisfactory.
... This level of controlled video generation is crucial for a wide range of applications, such as creating user-specified effects and personalized video content. Additionally, controllable video generation could be used to generate synthetic data for training and research purposes in various fields such as robotics [4,5,6] and self-driving cars [7,8] Figure 1: An example of controllable video generation from a static image and a text caption using our proposed TiV-ODE. The underlying dynamical system is modeled using Neural ODE. ...
Full-text available
Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations~(Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards developing advanced controllable video generation models that can handle complex and dynamic scenes.
... In contrast, our work aims to learn a pick-and-place behaviour policy directly from human videos. Other work has investigated learning to cook from videos of humans, by detecting both objects and human grasps to create a sequence of robot actions to replicate the task [23]. By obtaining priors such as hand motion, human video demonstrations have also been used in combination with 1-2 hours of robot learning to perform a variety of mobile manipulation tasks [24]. ...
Fabric manipulation is a long-standing challenge in robotics due to the enormous state space and complex dynamics. Learning approaches stand out as promising for this domain as they allow us to learn behaviours directly from data. Most prior methods however rely heavily on simulation, which is still limited by the large sim-to-real gap of deformable objects or rely on large datasets. A promising alternative is to learn fabric manipulation directly from watching humans perform the task. In this work, we explore how demonstrations for fabric manipulation tasks can be collected directly by human hands, providing an extremely natural and fast data collection pipeline. Then, using only a handful of such demonstrations, we show how a sample-efficient pick-and-place policy can be learned and deployed on a real robot, without any robot data collection at all. We demonstrate our approach on a fabric folding task, showing that our policy can reliably reach folded states from crumpled initial configurations.
... The robot will infer the performed actions and circumstances in which these occur, in order to generalise and replicate the behaviour observed in the demonstration data. These techniques require either manual annotations or automated activity extraction [6], [7] to identify the sub-tasks performed by the demonstrator. The annotated demonstrations are represented as either symbolic plans [8] or non-linear mappings [6], [9] of sensory data and motion primitives that are used for the task reproduction. ...
Conference Paper
Programming robots for complex tasks in unstructured settings (e.g., light manufacturing, extreme environments) cannot be accomplished solely by analytical methods. Learning from teleoperated human demonstrations is a promising approach to decrease the programming burden and to obtain more effective controllers. However, the recorded demonstrations need to be decomposed into atomic actions to facilitate the representation of the desired behaviour, which can be very challenging in real-world settings. In this study, we propose a method that uses features extracted from robot motion and tactile data to automatically segment atomic actions from a teleoperation sequence. We created a publicly available dataset with demonstrations of robotic pick-and-place of three different objects in single-object and cluttered situations. We use a custom-built teleoperation system that maps the user's hand and fingertips poses into a three-fingered dexterous robot hand equipped with tactile sensors. Our findings suggest that the proposed feature set generalises the activities in different episodes of the same object and between items of similar size. Furthermore, they suggest that tactile sensing contributes to higher performance in recognising activities within demonstrations.
... Recently, Learning from Demonstrations (LfD) [7] or Learning from Observations (LfO) [8] gained huge popularity in the robotics community. For example, Yang et al. [9] teach a Baxter robot how to cook by "watching" YouTube videos, or Peng et al. [10] make a robotic dog learn how to move by imitating an actual robot. Similarly, [11] and [12] perform virtual character animation by teaching a virtual humanoid different skills based on motion capture data. ...
Full-text available
Learning fine-grained movements is among the most challenging topics in robotics. This holds true especially for robotic hands. Robotic sign language acquisition or, more specifically, fingerspelling sign language acquisition in robots can be considered a specific instance of such challenge. In this paper, we propose an approach for learning dexterous motor imitation from videos examples, without the use of any additional information. We build an URDF model of a robotic hand with a single actuator for each joint. By leveraging pre-trained deep vision models, we extract the 3D pose of the hand from RGB videos. Then, using state-of-the-art reinforcement learning algorithms for motion imitation (namely, proximal policy optimisation), we train a policy to reproduce the movement extracted from the demonstrations. We identify the best set of hyperparameters to perform imitation based on a reference motion. Additionally, we demonstrate the ability of our approach to generalise over 6 different fingerspelled letters.
... However, in this work, any demonstrations with embodiment differences (e.g., human fingers vs. robot grippers) are explicitly excluded. Other approaches have included predictive modeling [26,27], context translation [25,28], learning reward representations [24], meta-learning [29], and the usage of explicit pose and object detection [30][31][32][33][34], resolving the correspondence problem [35] by means of instrumenting paired data collection or manually matching hand-specified key points. Here an approach to IL is proposed which relies only on visual inputs. ...
Robot manipulation tasks can be carried out effectively, provided the state representation is satisfactorily detailed. Embodiment difference, Viewpoint difference, and Domain difference are some of the challenges in learning from human demonstration. This work proposes a self-supervised and multi-viewpoint spatial and temporal features unified representation learning method. The algorithm consists of two components: (a) Spatial Component, which learns the setting of the environment, i.e., on which pixels to focus on most to get the best representation of the image regardless of point of view, and (b) Temporal Component that learns how snapshots taken from multiple viewpoints simultaneously (i.e., at the same time-step but from a different viewpoint) are similar and how these snaps are different from snaps taken at a different time-step but same viewpoint. Further, these representations are integrated with the Reinforcement Learning (RL) framework to learn accurate behaviors from videos of humans performing the manipulation task. The effectiveness of this approach is illustrated by training the robots to learn various manipulation tasks i.e., (a) grab objects (b) lift objects (c) open and close drawers from expert demonstrations provided by humans. The algorithm shows great promise and is highly successful across all the manipulation tasks. The robot learns to pick up objects of various shapes, sizes and colors having different orientations and placements on the table. The robot also successfully learns how to open and close drawers. The method is highly sample efficient and addresses the challenges of embodiment, viewpoint, and domain difference.
... This can be a limiting factor when trying to scale to a general robot setup. Previous approaches have also used hand [29,40] and object tracking [76] to learn action policies, however, these have been limited to simple settings and require very structured planning algorithms that are task specific. Our approach on the other hand is flexible and works for almost any manipulation task. ...
We approach the problem of learning by watching humans in the wild. While traditional approaches in Imitation and Reinforcement Learning are promising for learning in the real world, they are either sample inefficient or are constrained to lab settings. Meanwhile, there has been a lot of success in processing passive, unstructured human data. We propose tackling this problem via an efficient one-shot robot learning algorithm, centered around learning from a third-person perspective. We call our method WHIRL: In-the-Wild Human Imitating Robot Learning. WHIRL extracts a prior over the intent of the human demonstrator, using it to initialize our agent's policy. We introduce an efficient real-world policy learning scheme that improves using interactions. Our key contributions are a simple sampling-based policy optimization approach, a novel objective function for aligning human and robot videos as well as an exploration method to boost sample efficiency. We show one-shot generalization and success in real-world settings, including 20 different manipulation tasks in the wild. Videos and talk at
... However, similar to HTN, macro operators are associated with a fixed sequence of primitive operators that are executed in a reactive manner. Manipulation action trees [15] and grammar [19] by Yang et al. have been shown to facilitate robotic planning and execution by representing robotic manipulation in a tree form. Similarly, Zhang and Nikolaidis aimed to construct executable task graphs for multi-robot collaboration, where graphs are used to describe what the robot should do to replicate actions observed from cooking videos [20]. ...
Following work on joint object-action representations, functional object-oriented networks (FOON) were introduced as a knowledge graph representation for robots. Taking the form of a bipartite graph, a FOON contains symbolic (high-level) concepts useful to a robot's understanding of tasks and its environment for object-level planning. Prior to this paper, little has been done to demonstrate how task plans acquired from FOON via task tree retrieval can be executed by a robot, as the concepts in a FOON are too abstract for immediate execution. We propose a hierarchical task planning approach that translates a FOON graph into a PDDL-based representation of domain knowledge for manipulation planning. As a result of this process, a task plan can be acquired that a robot can execute from start to end, leveraging the use of action contexts and skills in the form of dynamic movement primitives (DMP). We demonstrate the entire pipeline from planning to execution using CoppeliaSim and show how learned action contexts can be extended to never-before-seen scenarios.
Full-text available
We consider the problem of detecting robotic grasps in an RGB-D view of a scene containing objects. In this work, we apply a deep learning approach to solve this problem, which avoids time-consuming hand-design of features. This presents two main challenges. First, we need to evaluate a huge number of candidate grasps. In order to make detection fast, as well as robust, we present a two-step cascaded structure with two deep networks, where the top detections from the first are re-evaluated by the second. The first network has fewer features, is faster to run, and can effectively prune out unlikely candidate grasps. The second, with more features, is slower but has to run only on the top few detections. Second, we need to handle multimodal inputs well, for which we present a method to apply structured regularization on the weights based on multimodal group regularization. We demonstrate that our method outperforms the previous state-of-the-art methods in robotic grasp detection, and can be used to successfully execute grasps on a Baxter robot.
Full-text available
Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 × 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1, 000 proposals. Increasing the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.
Full-text available
This paper describes the architecture of a cognitive system that interprets human manipulation actions from perceptual information (image and depth data) and that includes interacting modules for perception and reasoning. Our work contributes to two core problems at the heart of action understanding: (a) the grounding of relevant information about actions in perception (the perception-action integration problem), and (b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions are represented with the Manipulation Action Grammar, a context-free grammar that organizes actions as a sequence of sub events. Each sub event is described by the hand, movements, objects and tools involved, and the relevant information about these factors is obtained from biologically-inspired perception modules. These modules track the hands and objects, and they recognize the hand grasp, objects and actions using attention, segmentation, and feature description. Experiments on a new data set of manipulation actions show that our system extracts the relevant visual information and semantic representation. This representation could further be used by the cognitive agent for reasoning, prediction, and planning.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
We describe a two-stage method for detecting hands and their orientation in unconstrained images. The first stage uses three complementary detectors to propose hand bounding boxes. Each bounding box is then scored by the three detectors independently, and a second stage classifier learnt to compute a final confidence score for the proposals using these features. We make the following contributions: (i) we add context-based and skin-based proposals to a sliding window shape based detector to increase recall; (ii) we develop a new method of non-maximum suppression based on super-pixels; and (iii) we introduce a fully annotated hand dataset for training and testing. We show that the hand detector exceeds the state of the art on two public datasets, including the PASCAL VOC 2010 human layout challenge. © 2011. The