ArticlePDF Available

Abstract and Figures

We describe our software system enabling a tight integration between vision and control modules on complex, high-DOF humanoid robots. This is demonstrated with the iCub humanoid robot performing visual object detection, reaching and grasping actions. A key capability of this system is reactive avoidance of obstacle objects detected from the video stream while carrying out reach-and-grasp tasks. The subsystems of our architecture can independently be improved and updated, for example, we show that by using machine learning techniques we can improve visual perception by collecting images during the robot’s interaction with the environment. We describe the task and software design constraints that led to the layered modular system architecture.
Content may be subject to copyright.
1
A Modular Software Framework for Eye-hand
Coordination in Humanoid Robots
J ¨urgen Leitner 1,, Simon Harding 2, Alexander F ¨
orster 3and Peter Corke 1
1Australian Centre for Robotic Vision, Queensland University of Technology,
Brisbane, QLD, Australia
2Machine Intelligence Ltd, South Zeal, United Kingdom
3Institute for Artificial Intelligence, Universit¨
at Bremen, Bremen, Germany
Correspondence*:
J¨
urgen Leitner
Australian Centre for Robotic Vision, Queensland University of Technology,
Brisbane, QLD 4000, Australia, j.leitner@roboticvision.org
ABSTRACT2
We describe our software system enabling a tight integration between vision and control
3
modules on complex, high-DOF humanoid robots. This is demonstrated with the
iCub
humanoid
4
robot performing visual object detection, reaching and grasping actions. A key capability of this
5
system is reactive avoidance of obstacle objects detected from the video stream while carrying out
6
reach-and-grasp tasks. The subsystems of our architecture can independently be improved and
7
updated, for example, we show that by using machine learning techniques we can improve visual
8
perception by collecting images during the robot’s interaction with the environment. We describe
9
the task and software design constraints that led to the layered modular system architecture.10
Keywords: Humanoid Robots, Software Framework, Robotic Vision, Eye-Hand Coordination, Reactive Reaching, Machine Learning11
1 INTRODUCTION
In the last century, robots have transitioned from science fiction to science fact. When interacting with
12
the world around them robots need to be able to reach for, grasp and manipulate a wide range of objects
13
in arbitrary positions. Object manipulation, as this is referred to in robotics, is a canonical problem for
14
autonomous systems to become truly useful. We aim to overcome the limitations of current robots and
15
the software systems that control them, with a focus on complex bi-manual robots. It has previously been
16
suggested that better perception and coordination between sensing and acting are key requirements to
17
increase the capabilities of current systems (Ambrose et al., 2012; Kragic and Vincze, 2009). Yet with the
18
increasing complexity of the mechanical systems of modern robots programming these machines can be
19
tedious, error prone, and inaccessible to non-experts. Roboticists are increasingly considering learning over
20
time to “program” motions into robotic systems. In addition continuous learning increased the flexibility
21
and provides the means for self-adaptation, leading to more capable, autonomous systems. Research in
22
artificial intelligence (AI) techniques has led to computers that can play chess on a level good enough to
23
win against (and/or tutor) the average human player (Sadikov et al., 2007). Robotic manipulation of chess
24
pieces on a human-level of precision and adaptation is still beyond current systems.25
1
Leitner et al. Framework for Eye-Hand Coordination
The problem is not with the mechanical systems. Sensory feedback is of critical importance for acting in
26
a purposeful manner. For humans particularly, vision is an important factor in the development of reaching
27
and grasping skills (Berthier et al., 1996; McCarty et al., 2001). The essential challenge in robotics is to
28
create a similarly efficient perception system. For example, NASA’s Space Technology Roadmap is calling
29
for the development of autonomously calibrating hand-eye systems enabling successful off-world robotic
30
manipulation (Ambrose et al., 2012). This ability is fundamental for humans and animals alike, leading
31
to many experimental studies on how we perform these actions (Posner, 1989; Jeannerod, 1997). The
32
process is still not fully understood but basic computational models for how humans develop their reaching
33
and grasping skills during infancy exist (Oztop et al., 2004). Where 14-month-old infants can imitate and34
perform simple manipulation skills (Meltzoff, 1988), robots can only perform simple, pre-programmed
35
reaching and grasping in limited scenarios. Our ability to adapt during motion execution to changing
36
environments is lacking in robots right now. Yet this adaptation is important as even if the environment can
37
be perceived precisely it will not be static in most (interesting) settings.38
Coming back to the chess example, for an autonomous system to pick up a chess piece it needs to be
39
able to perceive the board, detect the right piece, and locate the position accurately, before executing a
40
purposeful motion that is safe for the robot and its environment. These sub-problems have turned out to be
41
much harder than expected a few decades ago. With the progress in mechanical design, motion control and
42
computer vision it is time to revisit the close coupling between those systems to create robots that perform
43
actions in day-to-day environments.44
1.1 Motion and Action: Interacting With the Environment45
In the chess example, even if the state of the board and its location are known perfectly, moving a certain
46
chess piece from one square to another without toppling other pieces is a non-trivial problem. Children
47
even at a very young age, have significantly better (more “natural”, smoother) hand movements than almost
48
all currently available humanoid robots. In humans the development of hand control starts at an early age,
49
albeit clumsily, and the precision grasp is not matured until the age of 8-10 years (Forssberg et al., 1991).
50
Even after manipulation skills have been learnt they are constantly adapted by a perception–action loop to
51
yield desired results during action execution. Vision and action are closely integrated in the human brain.
52
Various specialisations develop also in the visual pathways of infants related to extracting and encoding
53
information about the location and graspability of objects (Johnson and Munakata, 2005).54
To enable robots to interact with objects in unstructured, cluttered environments a variety of reactive
55
approaches have been investigated. These quickly generate control commands based on sensory input
56
– similar to reflexes – without sampling the robot’s configuration space and deliberately searching for
57
a solution (Khatib, 1986; Brooks, 1991; Schoner and Dose, 1992). Generally such approaches apply a
58
heuristic to transform local information (in the sensor reference frame) to commands sent to the motors,
59
leading to fast, reflex-like obstacle avoidance. Reactive approaches have become popular in the context
60
of safety and human-robot interaction (De Santis et al., 2007a; Dietrich et al., 2011) but are brittle and
61
inefficient at achieving global goals. A detailed model of the world enables the planning of coordinated
62
actions. Finding a path or trajectory is referred to as the path planning problem. This search for non-
63
colliding poses is generally expensive and increasingly so with higher DOF. Robots controlled this way are
64
typically slow and appear cautious in their motion execution. These reactive approaches started to appear
65
in the 1980s as an alternative to the “think first, act later” paradigm. Current robotic systems operate in
66
a very sequential manner. After a trajectory is planned, it is performed by the robot, before the actual
67
manipulation begins. We aim to move away from such brittle global planner paradigms and have these
68
This is a provisional file, not the final typeset article 2
Leitner et al. Framework for Eye-Hand Coordination
parts overlap and have continous refining based on visual feedback. The framework described is providing
69
quick and reactive motions and the interfaces to these so they can be controlled by higher-level agents or
70
“opportunistic” planners.71
Once the robot has moved its end-effector close enough to an object it can start to interact with it. In
72
recent years good progress has been made in this area thanks to the development of robust and precise
73
grippers and hands and the improvement of grasping techniques. In addition novel concepts of ‘grippers’
74
appeared in research including some quite ingenious solutions, such as the granular gripper (Brown et al.,
75
2010). As alternative to designing a grasping strategy it may be possible to learn it using only a small
76
number of real world examples, where good grasping points are known, and these could be generalized
77
or transferred to a wide variety of previously unseen objects (Saxena et al., 2008). An overview of the
78
state of research in robot grasping is found in (Carbone, 2013). Our framework provides an interface to
79
“action repertoires”. In one of the examples later on we show how we use a simple grasping module that is
80
triggered when the robot’s end-effector is close to a target object. While vision may be suitable for guiding
81
a robot to an object the very last phase of object manipulation – the transition to contact – may require the
82
use of sensed forces.83
1.2 Robotic Vision: Perceiving the Environment84
For a robot to pick a chess piece for example, finding the chess board and each of the chess pieces in
85
the camera image or even just to to realise that there is a chess board and pieces in the scene is critical.
86
An important area of research is the development of artificial vision systems that provide robots with
87
such capabilities. The robot’s perception system needs to be able to determine whether the image data
88
contains some specific object, feature, or activity. While closely related to computer vision, there are a
89
few differences mainly in how the images are acquired and how the outcome will provide input for the
90
robot to make informed decisions. For example, visual feedback has extensively been used in mobile robot
91
applications for obstacle avoidance, mapping and localisation (Davison and Murray, 2002; Karlsson et al.,
92
2005). Especially in the last decade there has been a surge of computer vision research. A focus is put on
93
the areas relevant for object manipulation
1
and the increased interest in working around and with humans.
94
Robots are required to detect objects in their surroundings even if they were previously unknown. In
95
addition we require them to be able to build models so they can re-identify and memorise them in the future.
96
Progress has been made on detecting objects – especially when limiting the focus on specific settings – the
97
interested reader is referred to current surveys (Cipolla et al., 2010; Verschae and Ruiz-del Solar, 2015).
98
A solution for the general case, i.e. detecting arbitrary objects in arbitrary situations, is elusive though
99
(Kemp et al., 2007). Environmental factors including changing light conditions, inconsistent sensing or
100
incomplete data acquisition seem to be the main cause of missed or errornous detection (Kragic and Vincze,
101
2009) (see also the environmental changes in Figure 1). Most object detection applications have been using
102
hand-crafted features, such as SIFT (Lowe, 1999) or SURF (Bay et al., 2006), or extensions of these for
103
higher robustness (St
¨
uckler et al., 2013). Experimental robotics still relies heavily on artificial landmarks to
104
simplify (and speed-up) the detection problem, though there is recent progress specifically for the iCub
105
platform (Ciliberto et al., 2011; Fanello et al., 2013; Gori et al., 2013). Many AI and learning techniques
106
have been applied to object detection and classification over the last years. Deep-learning has emerged
107
as a promising technology for extracting general features from ever larger datasets (LeCun et al., 2015;
108
Schmidhuber, 2015). An interface to such methods is integrated in our framework and has been applied to
109
1In recent years various challenges have emerged around this topic, such as, the Amazon Picking Challenge and RoboCup@Home.
Frontiers 3
Leitner et al. Framework for Eye-Hand Coordination
autonomously learn object detectors from small datasets (only 5-20 images) (Leitner et al., 2012a, 2013a;
110
Harding et al., 2013).111
Another problem relevant to eye-hand coordination is estimating the position of an object with respect to
112
the robot and its end-effector. ‘Spatial Perception’, as this is known, is a requirement for planning useful
113
actions and build cohesive world models. Studies in brain- and neuro-science have uncovered trends on
114
what changes, when we learn to reason about distances by interacting with the world, in contrast how these
115
changes happen is not yet clear (Plumert and Spencer, 2007). In robotics to obtain a distance measure
116
multiple camera views will provide the required observations. Projective geometry and its implementation
117
in stereo vision systems are quite common on robotic platforms. An overview of the theory and techniques
118
can be found in Hartley and Zisserman (2000). While projective geometry approaches work well under
119
carefully controlled experimental circumstances, they are not easily transferred to robotics applications
120
though. These methods are falling short as there are either separately movable cameras (such as in the case
121
of the iCub, which can ben seen in the imprecise out-of-the-box localisation module (Pattacini, 2011))
122
or only single cameras available (as with Baxter). In addition the method needs to cope with separate
123
movement of the robot’s head, gaze and upper body. A goal for the framework was also to enable the
124
learning of depth estimation from separately controllable camera pairs, even on complex humanoid robots
125
moving about (Leitner et al., 2012b).126
1.3 Integration: Sensorimotor Coordination127
Although there exists a rich body of literature in computer vision, path planning, and feedback control,
128
wherein many critical subproblems are addressed individually, most demonstrable behaviours for humanoid
129
robots do not effectively integrate elements from all three disciplines. Consequently, tasks that seem trivial
130
to humans, such as picking up a specific object in a cluttered environment, remain beyond the state-of-the-
131
art in experimental robotics. A close integration of computer vision and control is of importance, e.g. it was
132
shown that to enable a 5 DOF robotic arm to pick up objects just providing a point-cloud generated model
133
of the world was not sufficient to calculate reach and grasp behaviours on-the fly (Saxena et al., 2008).
134
The previously mentioned work by Maitin-Shepard et al. (2010) was successful manipulating towels due
135
to a sequence of visually-guided re-grasps. ‘Robotics, Vision and Control’ (Corke, 2011) puts the close
136
integration of these components into the spotlight and describes common pitfalls and issues when trying to
137
build systems with high levels of sensorimotor integration.138
Figure 1.
During a stereotypical manipulation task object detection is a hard but critical problem to solve. These images collected during our experiments show
the changes in lighting, occlusions and pose of complex objects. (Note: best viewed in colour) We provide a framework that allows for the easy integration of
multiple, new detectors (Leitner et al., 2013a).
This is a provisional file, not the final typeset article 4
Leitner et al. Framework for Eye-Hand Coordination
Visual Servoing (Chaumette and Hutchinson, 2006) is a commonly used approach to create a tight
139
coupling of visual perception and motor control. The closed-loop vision based control can be seen as a
140
very basic level of eye-hand coordination. It has been shown to work as a functional strategy to control
141
robots without any prior calibration of camera to end-effector transformation (Vahrenkamp et al., 2008). A
142
drawback of visual servoing is that it requires the robust extraction of visual features, in addition the final
143
configuration of these features in image space needs to be known a priori.144
Active vision investigates how controlling the motion of the camera, i.e. where to look at, can be used
145
to create additional information from a scene. Welke et al. (2010), for example, presented a method
146
that creates a segmentation out of multiple viewpoints of an object. These are generated by rotating an
147
object in the robot’s hand in front of its camera. This exploratory behaviours are important to create a
148
fully functioning autonomous object classification system and are highlighting one of the big differences
149
between computer and robotic vision.150
Creating a system that can improve actions by using visual feedback, and vice-versa improve visual
151
perception by performing manipulation actions, necessitates a flexible way of representing, learning and
152
storing visual object descriptions. We have developed a software framework for creating a functioning
153
eye-hand coordination system on a humanoid robot. It covers quite distinct areas of robotics research,
154
namely machine learning, computer vision and motion generation. Herein we describe and showcase this155
modular architecture that combines those areas into an integrated system running on a real robotic platform.
156
It was started as a tool for iCub humanoid but thanks to its modular design it can and has been used with
157
other robots, most recently on a Baxter robot as well.158
Robotic Systems Software Design and Toolkits159
Current humanoid robots are stunning feats of engineering. With the increased complexity of these
160
systems, the software to run these machines is increasing in complexity as well. In fact programming
161
today’s robots requires a big effort and usually a team of researchers. To reduce the time needed to setup
162
robotics experiments and to stop the need to repeatedly invent the wheel, good system level tools are
163
needed. This has led to the emergence of many open source projects in robotics (Gerkey et al., 2003;
164
van den Bergen, 2004; Metta et al., 2006; Jackson, 2007; Diankov and Kuffner, 2008; Fitzpatrick et al.,
165
2008; Quigley et al., 2009). State-of-the art software development methods have also been translated
166
into the robotics domain. Innovative ideas have been introduced in various areas to promote the reuse of
167
robotic software “artifacts”, such as, components, frameworks and architectural styles (Brugali, 2007). To
168
build more general models of robot control, robotic vision and their close integration robot software needs
169
to be able to abstract certain specificities of the underlying robotic system. There exists a wide variety
170
of middleware systems which abstract the specifics of each robot’s sensors and actuators. Furthermore,
171
such systems need to provide the ability to communicate between modules running in parallel on separate
172
computers.173
ROS (Robot Operating System) (Quigley et al., 2009) is one of the most popular robotics software
174
platforms. At heart it is a component-based middleware which allows computational nodes to publish and
175
subscribe to messages on particular topics, and to provide services to each other. Nodes communicate via
176
“messages”, i.e. data blocks of pre-defined structure and can execute a networked distributed computer
177
system and the connections can be changed dynamically during runtime. ROS also contains a wider set of
178
tools for computer vision (OpenCV and point-cloud library PCL), motion planning, visualization, data
179
logging and replay, debugging, system startup as well as drivers for a multitude of sensors and robot
180
platforms. For the iCub YARP (Yet Another Robotics Platform) (Metta et al., 2006) is the middleware of
181
Frontiers 5
Leitner et al. Framework for Eye-Hand Coordination
choice. It is largely written in C++ and uses separately running code instances, titled “modules”. These can
182
be dynamically and flexibly linked and communicate via concise and pre-defined messages called “bottles”,
183
facilitating component-based design. There is a wide range of other robotic middleware systems available,
184
such as ArmarX (Vahrenkamp et al., 2015), OROCOS (Soetens, 2006) and OpenRTM (Ando et al., 2008),
185
all with their own benefits and drawbacks, see Elkady and Sobh (2012) for a comprehensive comparison.186
The close integration of vision and control has been addressed by VISP (Visual Servoing Platform)
187
developed at INRIA (Marchand et al., 2005). It provides a library for controlling robotic systems based on
188
visual feedback. It contains a multitude of image processing operations, enabling robots to extract useful
189
features from an image. By providing the desired feature values a controller for the robot’s motion can
190
be derived (Hutchinson et al., 1996; Chaumette and Hutchinson, 2006). The framework presented here
191
is building on these software systems to provide a module-based approach to tightly integrate computer
192
vision and motion control for reaching and grasping on a humanoid robot. The architecture grew naturally
193
over the last few years and was initially designed for the iCub and hence used YARP. While there exists
194
also a “bridge” component in YARP allowing it to communicate with ROS topics and nodes, it was easy
195
to port it to ROS and Baxter. Furthermore, there is currently a branch being developed aimed to be fully
196
agnostic to the underlying middleware.197
2 THE EYE-HAND FRAMEWORK
The goal of our research is to improve the autonomous skills of humanoid robots by providing a library
198
giving a solid base of sensorimotor coordination. To do so we developed a modular framework that allows
199
to easily run and repeat experiments on humanoid robots. To create better perception and motion, as well
200
as, a coordination between those, we split the system into two subsystems, one focusing on action, the
201
other one on vision (our primary sense). To deal with uncertainties various machine learning (ML) and
202
artificial intelligence (AI) techniques are applied to support both subsystems and their integration. We
203
close the loop and perform grasping of objects, while adapting to unknown, complex environments based
204
on visual feedback, showing that combining robot learning approaches with computer vision improves
205
adaptivity and autonomy in robotic reaching and grasping.206
Our framework, sketched in Figure 2A, provides an integrated system for eye-hand coordination. The
207
Perception (green) and Action (yellow) subsystems are supported by Memory (in blue) which enables the
208
persistent modelling of the world. Functionality has grown over time and the currently existing modules
209
that have been used in support of eye-hand coordination framework for cognitive robotics research (Leitner,
210
2014, 2015) are:211
Perception: Object Detection and Identification: as mentioned above, the detection and identification
212
of objects is a hard problem. To perform object detection and identification we use a system called
213
icVision. It provides a modular approach for the parallel execution of multiple object detectors and
214
identifiers. While these can be hard-coded (e.g. optical flow segmentation of moving object, or simple
215
colour thresholding), the main advantage of this flexible system is that it can be interfaced by a learning
216
agent (sketched in Figure 2B). In our case we have successfully used Cartesian Genetic Programming
217
for Image Processing (CGP-IP) (Harding et al., 2013) as an agent to learn visual object models in both
218
a supervised and unsupervised fashion (Leitner et al., 2012a). The resulting modules perform specific
219
object segmentation of the camera images.220
Perception: Object Localisation: icVision also provides a module for estimating the location of an
221
object detected by multiple cameras – i.e. the two eyes in the case of the iCub. In this case again the
222
This is a provisional file, not the final typeset article 6
Leitner et al. Framework for Eye-Hand Coordination
flexibility of the perception framework allows for a learning agent to predict object positions with a
223
technique based on genetic programming and an artificial neural network estimators (Leitner et al.,
224
2012b). These modules can be easily swapped or run in parallel, even on different machines.225
Action: Collision Avoidance and Motion Generation: MoBeE is used to safeguard the robot from
226
collisions both with itself and the objects detected. This is implemented as a low level interface to the
227
robot and uses virtual forces based on the robot kinematics to generate the robot’s motion. A high level
228
agent or planner can provide the input to this system (more details in the next section).229
Memory: World Model In addition to modeling the kinematics of the robot to MoBeE also keeps track
230
of the detected object in operational space. It is also used as a visualization for the robot’s current
231
belief state by highlighting (impeding) collisions (see Figure 7).232
Memory: Action Repertoire: a light-weight, easy-to-use, one-shot grasping system is used. It can be
233
configured to perform a variety of grasps, all requiring to close the fingers in a coordinated fashion.
234
The iCub incorporates touch sensors on the fingertips, but due to the high noise, we use the error
235
reported by the PID controllers of the finger motors to know when they are in contact with the object.
236
Figure 2. (A)
Overview of the common subsystems for a functional eye-hand coordination on humanoid robots. In broader terms one can separate the (visual)
perception side (in green) from the action side (yellow). In addition to these a memory subsystem (blue) allows to build-up an action repertoire and a set of
object models.
(B)
The presented framework herein consists of a modular way of combining perception tasks, encapsulated in the icVision subsystem (green, as
in (A)), with the action side and a world model, represented by the MoBeE subsystem (in yellow and blue). In addition agents can interface these systems to
generate specific behaviours or to learn from the interaction with the environment (see Results). To allow portability the system uses a communication layer and
a robot abstraction middleware, e.g. ROS or YARP.
Frontiers 7
Leitner et al. Framework for Eye-Hand Coordination
Complex, state-of-the art humanoid robots are controlled by a distributed system of computers most of
237
which are not onboard the robot. On the iCub (and similarly on Baxter) an umbilical provides power to the
238
robot and a local-area-network (LAN) connection. Figure 3 sketches the distributed computing system
239
used to operate a typical humanoid robot: very limited on-board computing, which mainly focuses on the
240
low-level control and sensing, is supported by networked computers for computational intensive tasks. The
241
iCub, for example, employs an on-board PC104 controller which communicates with actuators and sensors
242
using CANBus. Similarly Baxter has an on-board computing system (Intel i7) acting as the gateway to
243
joints and cameras. More robot specific information about setup and configuration, as well as the code base,
244
can be found on the iCub and Baxter Wiki pages,
2
where researchers, from a large collection of research
245
labs using the robot, contribute and build up a knowledge base.246
All the modules described communicate with each other using a middleware framework (depicted in
247
Figure 2B). The first experiments were performed on the iCub, therefore the first choice for the middleware
248
was YARP. A benefit of using a robotic middleware is that actuators and sensors can be abstracted, i.e. the
249
modules that connect to icVision and MoBeE, do not require to know the robot specifics. Another benefit of
250
building on existing robotics middleware is the ability to distribute modules across multiple computers. In
251
our setup we separated implemented various computational tasks as nodes which were distributed over the
252
computer network. For the experiments on the iCub, a separate computer was used to run multiple object
253
detection modules in parallel, while another computer (
MoBeEBox
) performed the collision avoidance
254
and visualizing the world model. During the development of new modules an additional user PC was
255
connected via Ethernet to run and debug the new modules. Component-level abstraction using middleware
256
increases portability across different robotic systems. For example, running MoBeE with different robot
257
arms is easily done by simple providing the new arm’s kinematic model as an XML file. Transferring to
258
other middleware systems is also possible, though a bit more intricate. We have ported various parts of the
259
architecture to ROS based modules allowing to interact with ROS-based robots, such as Baxter.260
2iCub Wiki URL: http://wiki.icub.org
Baxter Wiki URL: http://api.rethinkrobotics.com
Figure 3.
A sketch of the computing setup we used to operate the iCub at the IDSIA Robotics Lab. The
pc104
handles the on-board data processing and
controls the motors via CAN-bus. The
icubServer
is running the YARP server and is the router into the IDSIA-wide network and the internet. Dedicated
computers for vision (icubVision) and collision avoidance (MoBeeBox) are used.
This is a provisional file, not the final typeset article 8
Leitner et al. Framework for Eye-Hand Coordination
Figure 4.
Little coding is required for a new module to be added as filter to icVision. This shows a simple red filter being added. The image acquisition,
connection of the communication ports and cleanup are all handled by the superclass.
Humanoid robots, and the iCub in particular, have a high DOF, which allows for complex motions. To
261
perform useful actions many robots need to be controlled in unison requiring robust control and planning262
algorithms. Our framework consists of an action subsystem, which in turn contains collision avoidance and
263
grasping capabilities.264
2.1 Object Detection and Localisation Modules: icVision265
Our humanoid robot should be able to learn how to perceive and detect objects from very few examples,
266
in a manner similar to humans. It should have the ability to develop a representation that allows it to detect
267
the same object again and again, even when the lighting conditions change, e.g., during the course of a day.
268
This is a necessary prerequisite to enable adaptive, autonomous behaviours based on visual feedback. Our
269
goal is to apply a combination of robot learning approaches, artificial intelligence and machine learning
270
techniques, with computer vision, to enable a variety of proposed tasks for robots.271
icVision (Leitner et al., 2013c), was developed to support current and future research in cognitive robotics.
272
This follows a “passive” approach to the understanding of vision, where the actions of the human or robot
273
are not taken into account. It processes the visual inputs received by the cameras and builds (internal)
274
representations of objects. This computation is distributed over multiple modules. It facilitates the 3D
275
localisation of the detected objects in the 2D image plane and provides this information to other systems,
276
e.g. a motion planner. It allows to create distributed systems of loosely coupled modules and provides
277
standardised interfaces. Special focus is put on object detection in the received input images. Figure 4 shows
278
how a simple red detection can be added as a separate running module. Specialised modules, containing a
279
specific model, are used to detect distinct patterns or objects. These specialised modules can be connected
280
and form pathways to perform, e.g., object detection, similarly to the hierarchies in the visual cortex. While
281
the focus herein is on the use of single and stereo camera images, we are confident that information from
282
RGB-D cameras (such as the Microsoft Kinect) can be easily integrated.283
Frontiers 9
Leitner et al. Framework for Eye-Hand Coordination
The system consists of different modules, with the core module providing basic functionality and
284
information flow. Figure 5 shows separate modules for the detection and localisation and their connection
285
to the core, which abstract the robot’s cameras and the communication to external agents. These external
286
agents are further modules and can do a wide variety of taks, for example, specifically test and compare
287
different object detection or localisation techniques. icVision provides a pipeline that connects visual
288
perception with world modelling in the MoBeE module (dashed line in Figure 5). By processing the
289
incoming images form the robot with a specific filter for each “eye”, the location of the specific object can
290
be estimated by the localisation module and then communicated to MoBeE (Figure 6 depicts the typical
291
information flow).292
Figure 5.
The framework described consists of multiple software entities (indicated by cogwheels), all connected via a communication layer, such as YARP or
ROS. The provided robot abstraction is used by the main modules. On the perception side icVision processes the incoming camera images in its core module and
sends them to (possibly multiple) separately running detection filters. Another separate entity is performing the localisation based on the detected objects and
robot’s pose. The icVision Core also provides interfaces for agents to query for specific objects or agents that learn object representations (such as a CGP-IP
based learner). The interface also provides the objects location to MoBeE. There a world model is created by calculating the forward kinematics from the
incoming joint positions. The same entity performs the collision avoidance between separate body parts or the objects which have been detected by icVision. The
controller is independent and translates the virtual forces created by MoBeE or provided by higher level planning agents into motor commands.
2.2 Robot and World Modeling for Collision Avoidance: MoBeE293
MoBeE (Modular Behaviour Environment for Robots) is at the core of the described framework for
294
eye-hand coordination. It is a solid, reusable, open-source
3
toolkit for prototyping behaviours on the
295
iCub humanoid robot. MoBeE represents the state-of-the-art in humanoid robotic control and is similar in
296
conception to the control system that runs DLR’s Justin (De Santis et al., 2007b; Dietrich et al., 2011). The
297
goal of MoBeE is to facilitate the close integration of planning and motion control (sketched in Figure 2B).
298
Inspired by Brooks (1991) it aims to embody the planner, provide safe and robust action primitives and
299
perform real-time re-planning. This facilitates exploratory behaviour using a real robot with MoBeE acting
300
as a supervisor preventing collisions, even between multiple robots. It consists of three main parts all
301
implemented in C++: a kinematic library with a visualisation, and a controller, running in two separate
302
modules. These together provide the “collision avoidance” (yellow) and “world model” (blue) as depicted
303
in Figure 2A). Figure 5 shows the connections between the various software entities required to run the full
304
3URL: https://github.com/kailfrank/MoBeE
This is a provisional file, not the final typeset article 10
Leitner et al. Framework for Eye-Hand Coordination
eye-hand coordination framework. MoBeE communicates with the robot and provides an interface to other
305
modules. One of these is the perception side icVision.306
In its first iteration MoBeE provided virtual feedback for a reinforcement learning experiment. This was
307
necessary as most current robotic systems lack a physical skin which would provide sensory information
308
to perform reflexive motions. It was intended to enforce constraints in real time while a robot is under
309
the control of any arbitrary planner/controller. This led to a design based on switching control, which
310
facilitated experimentation with pre-existing control modules. A kinematic model is loaded from an XML
311
file using “Zero Position Displacement Notation” (Gupta, 1986).312
When the stochastic or exploratory agent/controller (light gray at the top in Figure 5) does something
313
dangerous or undesirable, MoBeE intervenes. Collision detection is performed on the loaded kinematic
314
robot model consisting of a collection of simple geometries to form separate body parts (see Figure 7). These
315
Figure 6.
To provide information about the 3D location of an object to MoBeE, the following is performed: At first, camera images are received by the core
from the hardware via a communications layers. The images are split into channels and made available to each individual icVision filters that is currently active.
These then perform binary segmentation for a specific object. The objects (centre) location in the image frame, (u,v) is then communicated to a 3D localisation
module. Using the joint encoder values and the object’s location in both eyes, a location estimate is then sent to the MoBeE world model (Leitner et al., 2013a).
Frontiers 11
Leitner et al. Framework for Eye-Hand Coordination
Figure 7.
A scene of the iCub avoiding an object (inset) during one of our experiments (Leitner et al., 2014b) and its corresponding visualization of the MoBeE
model. Red body parts are highlighting impeding collisions with either another body part (as in the case of the hip with the upper body) or an object in the world
model (hand with the cup). (See video: https://www.youtube.com/watch?v=w_qDH5tSe7g)
geometries are created as C++ objects that inherit functionality from both the fast geometric intersection
316
library and the visualization in OpenGL. The joint encoders provided by the robot abstraction layer are used
317
to calculate collisions, i.e. intersecting body parts. In the first version this collision signal was used to avoid
318
collisions by switching control, which was later abandoned in favor of a second order dynamical system
319
(Frank, 2014). Constraints, such as impeding collisions, joint limits or cable lengths, can be addressed by320
adding additional forces to the system. Due to the dynamical system many of the collisions encountered in
321
practice no longer stop the robot’s action, but rather deflect the requested motion, bending it around an
322
obstacle.323
MoBeE continuously mixes control and constraint forces to generate the robot motion in real time and
324
results in smoother, more intuitive motions in response to constraints/collisions (Figure 8). The effects of
325
sensory noise are mitigated passively by the controller. The constraint forces associated with collisions
326
are proportional to their penetration depth, in the experimentation it was observed that the noise in the
327
motor encoder signal has a minimal effect on collision response. The sporadic shallow collisions, which
328
can be observed when the robot is operating close to an obstacle, such as the other pieces of a chess board,
329
generate tiny forces that only serve to nudge the robot gently away from the obstacle. MoBeE in addition
330
can be used for adaptive roadmap planning (Kavraki et al., 1996; Stollenga et al., 2013), the dynamical
331
approach means that the planner/controller is free to explore continuous spaces, without the need to divide
332
them into safe and unsafe regions.333
The interface for external agents is further simplified by allowing to subscribe to specific points of interest
334
in the imported models (seen in yellow in Figure 7). These markers can be defined both on static or moving
335
objects or the robots. The marker positions or events, such as the body part being in a colliding pose, are
336
broadcast via the interface allowing connected agents to react, e.g. to trigger a grasp primitive. More details
337
about the whole MoBeE architecture and how it was used for reach learning can be found in Frank (2014).
338
Additionally, we have published multiple videos of our robotic experiments while using MoBeE foremost:
339
‘Towards Intelligent Humanoids’.4
340
4Webpage: http://Juxi.net/media/ or direct video URL: http://vimeo.com/51011081
This is a provisional file, not the final typeset article 12
Leitner et al. Framework for Eye-Hand Coordination
Figure 8.
The virtual forces created by the dynamical system within MoBeE. It continuously mixing control and constraint forces (orange vectors) to generate
the robot motion in real time. It results in a smoother, more intuitive motions in response to constraints/collisions (dashed green line). To calculate the force
the distance of the object to the hand in its coordinate frame CSHand is used. MoBeE handles the transformation from the coordinate systems of the cameras
(CSR/CSL) to the world frame CSWorld and CSHand.
2.3 Action Repertoire: LEOGrasper Module341
Robotic grasping is an active and large research area in robotics. A main issue is that in order to grasp
342
successfully the pose of the object to be grasped has to be known quite precisely. This is due to the grasp
343
planners required to plan the precise placement and motion of each individual “finger” (or gripper). Several
344
methods for robust grasp planning exploit the object geometry or tactile sensor feedback. However, object
345
pose range estimation introduces specific uncertainties that can also be exploited to choose more robust
346
grasps (Carbone, 2013).347
A different approach is used in our implementation which does use a more reactive approach. Grasp
348
primitives are triggered from MoBeE, which involve the controlling the five digit iCub hand. These
349
primitives consist of target points in joint space to be reached sequentially during grasp execution. Another
350
problem is to realise when to stop grasping. The iCub has touch sensors on the palm and finger tips. To
351
know when there is a succesful grasp these sensors need to be calibrated for the material in use. Especially
352
for objects as varied as plastic cups, ceramic mugs and tin cans the tuning can be quite cumbersome and
353
leads to a lower signal to noise ratio. We decided to overcome this by using the errors from the joint
354
controllers directly. This approach allows to provide feedback whether a grasp was successful or not to a
355
planner or learning system.356
LEOGrasper is our light-weight, easy-to-use, one-shot grasping system for the iCub. The system itself is
357
contained in one single module using YARP to communicate. It can be triggered by a simple command
358
from the command line, network or as in our case from MoBeE. The module can be configured for multiple
359
grasp types, these are loaded from a simple text file, containing some global parameters (such as the
360
maximum velocity) as well as the trajectories. Trajectories are specified by providing positions for each
361
joint individually, containing multiple joints per digit as well as abduction, spread, etc. on the iCub. We
362
provide power and pinch grasp and pointing gestures. For example, to close all digits in a coordinated
363
fashion, at least two positions need to be defined, the starting and end position (see Figure 9). For more
364
intricate grasps multiple intermediate points can be provided. The robot’s fingers are controlled from the
365
Frontiers 13
Leitner et al. Framework for Eye-Hand Coordination
Figure 9.
Defining a grasp in LEOGrasper is simple: defining a start and end position in joint space is all that is required. The
open
command will revert the
hand into the start state (left), close will attempt to reach the end state (right). For more complex grasps intermediate states can be provided.
Figure 10. The iCub hand during grasp execution with a variety of objects, including tin cans, tea boxes and plastic cups (Leitner et al., 2014b).
start point to each consecutive point, when an
open
signal is received. For
close
the points are sent in
366
reverse order.367
LEOGrasper has been used extensively in our robotics lab and selected successful grasps are shown in
368
Figure 10.
5
The existing trajectories and holding parameters were tuned through experimentation, in the
369
future we aim at learning these primitives using human demonstrations or reward signals.370
3 METHOD OF INTEGRATING ACTION & VISION: APPLYING THE FRAMEWORK
The framework has been extensively used over the last few years in our experimental robotics research.
371
Various papers have been published during the development of the different subsystems and their
372
improvements. Table 1 provides an overview. MoBeE can be pre-loaded with a robot model using an XML
373
file that describes the kinematics based on “Zero Position Displacement Notation” (Gupta, 1986). Figure 11
374
shows a snippet from the XML describing the Katana robotic arm. In addition a pre-defined, marked-up
375
world model can be loaded from a separate file as well. This is particularly useful for stationary objects in
376
the world or to restrict the movement space of the robot during learning operations.377
Through the common interface to MoBeE object properties of each object can be modified, through an
378
RPC call, following YARP standard and is accessible from the command line, a webpage or any other
379
module connecting to it. These objects are placed in the world model by either loading from a file at start-up
380
or during runtime by agents such as the icVision core. Through the interface an object can also be set as an
381
obstacle, which means repelling forces are calculated, or as a target, which will attract the end-effector. In
382
addition objects can be defined as ghosts, leading to the object being ignored in the force calculation.383
5Source code available at: https://github.com/Juxi/iCub/
This is a provisional file, not the final typeset article 14
Leitner et al. Framework for Eye-Hand Coordination
Figure 11. (Left) The XML files used to describe the kinematics of the Katana arm. (Right) Visualisation of the same XML file in MoBeE.
Table 1.Overview of experiments facilitated by parts of the architecture presented herein.
Experiment Description Framework Reference (Year)
Autonomous Object Detection icVision, CGP-IP Leitner et al. (2012a, 2013b)
Multi Robot Collision Avoidance MoBeE (vSkin) Leitner et al. (2012b,c)
Safe Policy Learning MoBeE (vSkin) Pathak et al. (2013)
Object Detection and Localisation icVision Leitner et al. (2013a)
Spatial Perception Learning MoBeE, icVision Leitner et al. (2013d)
Learning Object Detection CGP-IP Leitner et al. (2013e)
Humanoid Motion Planning MoBeE Stollenga et al. (2013)
Reinforcement Learning for Reaching MoBeE Frank et al. (2014)
Improving Vision Through Interaction full system Leitner et al. (2014a)
Reactive Reaching and Grasping full system Leitner et al. (2014b)
Cognitive and Developmental Robots full system Leitner (2015)
As mentioned earlier on, previous research suggests that connections between motor actions and
384
observations exist in the human brain and describes their importance to human development (Berthier
385
et al., 1996). To interface and connect artificial systems performing visual and motor cortex-like operations
386
on robots will be crucial for the development of autonomous robotic systems. When attempting to learn
387
behaviours on a complex robot like the iCub or Baxter, state-of-the-art AI and control theories can be tested
388
(Frank et al., 2014) and shortcomings of these learning methods can be discovered (Zhang et al., 2015) and
389
addressed. For example, Hart et al. (2006), showed that a developmental approach can be used for a robot
390
to learn to reach and grasp. We developed modules for action generation and collision avoidance and their
391
interfaces to the perception side. By having the action and motion side tightly coupled, we can use learning
392
algorithms that require also negative feedback. We can create this without actually “hurting” the robot.393
3.1 Example: Evolving Object Detectors394
We previously developed a technique based on Cartesian Genetic Programming (CGP) (Miller, 1999,
395
2011) allowing for the automatic generation of computer programs for robot vision tasks, called Cartesian
396
Genetic Programming for Image Processing (CGP-IP) (Harding et al., 2013). CGP-IP draws inspiration
397
from previous work in the field of machine learning and combines it with the available tools in the image
398
processing discipline, namely in the form of OpenCV functions. OpenCV is an open-source framework
399
providing a mature toolbox for a variety of image processing tasks. This domain knowledge is integrated
400
Frontiers 15
Leitner et al. Framework for Eye-Hand Coordination
into CGP-IP allowing to quickly evolve object detectors from only a small training set – our experiments
401
showed that just a handful of (5-20) images per object are required. These detectors can then be used to
402
perform the binary image segmentation within the icVision framework. In addition CGP-IP allows for the
403
segmentation of colour images with multiple channels, a key difference to much of the previous work
404
focussing on gray scale images. CGP-IP deals with separate channels and splits incoming colour images
405
into individual channels before they can be used at each node in the detector. This leads to the evolutionary
406
process selecting which channels will be used and how they are combined.407
CGP-IP manages a population of candidates, which consists of individual genes, representing the nodes.
408
Single channels are used as inputs and outputs of each node, while the action of each node is the execution
409
of an OpenCV function. The full candidate can be interpreted as a computer program performing a sequence
410
of image operations on the input image. The output of each candidate filter is a binary segmentation. GPs
411
are supervised, in the sense that a fitness will need to be calculated for each candidate. For scoring each
412
individual a ground truth segmentation needs to be provided. A new generation of candidates is then
413
created out of the fittest individuals. An illustrative example of a CGP-IP candidate is shown in Figure 12.
414
CGP-IP can directly create C# or C++ code from these graphs. The code can be executed directly on the
415
real hardware or pushed as updates to existing filter modules running within icVision. CGP-IP includes
416
classes for image operations, the evolutionary search and the integration with the robotic side through a
417
middleware. Currently we are extending the C# implementation to run on various operating systems and be
418
integrated into a distributed visual system, such as DRVS (Chamberlain et al., 2016).419
CGP-IP allows not just for a simple reusable object detection, but also provides a simple way of learning
420
these based on only very small training sets. In connection with our framework this data can be collected
421
on the real hardware and the learned results directly executed. For this an agent module was designed that
422
communicates with the icVision core through its interface. Arbitrary objects are then placed in front of the
423
robot and images are collected while the robot is moving about. The collected images are then processed
424
by the agent and a training set is created. With the ground truth of the location known, so is the location of
425
the object in the image. A fixed size bounding box around this location leads to the ground truth required
426
to evolve an object detector. This way the robot (and some prior knowledge of the location) can be used to
427
autonomously learn object detectors for all the objects in the robot’s environment (Leitner et al., 2012a).428
3.2 Example: Reaching While Avoiding a Moving Obstacle429
The inverse kinematics problem, i.e. placing the hand at a given coordinate in operational space, can be430
performed with previously available software on the iCub, such as the existing operational space controller
431
(Pattacini, 2011) or a roadmap-based approach (Stollenga et al., 2013). These systems require very accurate
432
knowledge of the mechanical system to lead to precise solutions, requiring a lengthy calibration procedure.
433
Figure 12.
Example illustration of a CGP-IP genotype, taking three input channels of the current image. A sequence of OpenCV operations is then performed
before thresholding occurs to produce a binary segmented output image.
This is a provisional file, not the final typeset article 16
Leitner et al. Framework for Eye-Hand Coordination
These systems also tend to be brittle when change in the robot’s environment requires adapting the created
434
motions.435
To overcome this problem the framework, as described above, creates virtual forces based on the world
436
model within MoBeE to govern the actual movement of the robot. Static objects in the environment, such
437
as, e.g. the table in front of the robot, can be added directly into the model via an XML file. Once in the
438
model, actions and behaviours are adapted due to computed constraint forces. This way we are able to send
439
arbitrary motions to our system, while ensuring the safety of our robot. Even with just these static objects
440
this has been shown to provide an interesting way to learn robot reaching behaviours through reinforcement
441
(Pathak et al., 2013; Frank et al., 2014). The presented system has the same functionality also for arbitrary,
442
non-static objects. For this after the detection in both cameras the object’s location is estimated and updated
443
in the world model. The forces are continually recalculated to avoid impeding collisions even with moving
444
objects. Figure 7 shows how the localised object is in the way of the arm and the hand. To ensure the safety
445
of the rather fragile fingers, a collision sphere around the end-effector was added – seen in red, indicating
446
a possible collision due to the sphere intersecting with the object. The same can be seen with the lower
447
arm. The forces push the intersecting geometries away from each other, leading to a movement of the
448
end-effector away from the obstacle. Figure 13 shows how the the robot’s arm is “pushed” aside when
449
the cup is moved close to the arm, therefore avoiding a non-stationary obstacle. It does so until the arm
450
reaches its limit, then the forces cumulate and the end-effector is “forced” upwards to continue avoiding
451
the obstacle. Without an obstacle the arm starts to settle back into its resting pose
q
. Similarly to adapt
452
the reaching behaviour while the object is moved. By simply sending a signal through the interface the
453
type of the object within the world model can be changed from
obstacle
into
target
. This leads to
454
the calculated forces now being attracting not repelling. MoBeE also allows to trigger certain responses
455
when collisions occur. In the case when we want the robot to pick-up the object, we can activate a grasp
456
subsystem whenever the hand is in the close vicinity of the object. We are using a prototypical power grasp
457
style hand-closing action, which has been used successfully in various demos and videos.
6
Figure 10 shows
458
the iCub successfully picking up (by adding an extra upwards force) various objects using our grasping
459
subsystem, executing the same action. Our robot frameworks are able to track multiple objects at the same
460
6See videos at: http://Juxi.net/media/
Figure 13.
The iCub’s arm is controlled by MoBeE to stay in a non-colliding pose of the moving obstacle and the table by using reactive virtual forces. (See
video: https://www.youtube.com/watch?v=w_qDH5tSe7g)
Frontiers 17
Leitner et al. Framework for Eye-Hand Coordination
time, which is also visible in Figure 7, where both the cup and the tea box are tracked. By simply changing
461
the type of the object within MoBeE the robot reaches for a certain object while avoiding the other.462
3.3 Example: Improving Robot Vision by Interaction463
The two subsystems can further be integrated for the use of higher level agents controlling the robot’s
464
behaviour. Based on the previous section the following example shows how an agent can be used to
465
learn visual representations (in CGP-IP) by having a robot interact with its environment. Building on the
466
previously mentioned evolved object detectors we extended the robot’s interaction ability to become better
467
at segmenting objects. Similar to the experiment by Welke et al. (2010) the robot was able to curiously
468
rotate the object of interest with its hand. Additional actions were added for the robot to perform, such
469
as poke, push and a simple viewpoint change by leaning left and right. Furthermore, a baseline image
470
dataset is collected, while the robot (and the object) are static. In this experiment we wanted to measure
471
the impact of specific actions on the segmentation performance. After the robot performed one of the four
472
pre-programmed actions a new training set was collected which contains the images from the static scenario
473
and the images during action execution. While more data means generally better results, we could also see
474
that some actions were leading to better results than others. Figure 14 shows visually how the improvement
475
leads to better object segmentation, in validation images. On the left is the original camera image, in the
476
middle the segmentation performed by an evolved filter solely based on the “static scene baseline”, and on
477
the right is the segmentation when integrating the new observations during a haptic exploration action.478
By providing a measurable improvement the robot can select and perform the action yielding the
479
best possible improvement for a specific detector. Interaction provides robots with a unique possibility
480
(compared to cameras) to build more accurate and robust visual representations. Simple leaning actions
481
change the camera viewpoint sufficiently to collect a different dataset. This does not just help with
482
separating geometries of the scene but also creates more robust and discriminative classifiers. Active scene
483
interaction by e.g. applying forces to objects enables the robot to start to reason about relationships between
484
objects, such as “are two objects (inseparably) connected”, or, find out other physical properties, like, “is
485
the juice box full or empty”. We are planning to add more complex actions and abilities to learn more
486
object properties and have started to investigate how to determine an object’s mechanical properties through
487
interaction and observation (Dansereau et al., 2016).488
4 DISCUSSION
Herein we present our modular software framework applied in our research towards autonomous and
489
adaptive robotic manipulation with humanoids. A tightly integrated sensorimotor system, based on
490
subsystems developed over the past years, enables a basic level of eye-hand coordination on our robots.
491
The robot detects objects, placed at random positions on a table, and performs a visually-guided reaching
492
before executing a simple grasp.493
Our implementation enables the robot to adapt to changes in the environment. It safeguards complex
494
humanoid robots, such as the iCub, from unwanted interactions – i.e. collisions with the environment or
495
itself. This is performed by integrating the visual system with the motor side by applying attractor dynamics
496
based on the robot’s pose and a model of the world. We achieve a level of integration between visual
497
perception and actions not previously seen on the iCub. Our approach, while comparable to visual servoing,
498
has the advantage of being completely modular and the ability to take collisions (and other constraints)
499
into account.500
This is a provisional file, not the final typeset article 18
Leitner et al. Framework for Eye-Hand Coordination
Figure 14.
Segmentation improvements for two objects after interaction. On the left the robot’s view of the scene. The middle column shows the first
segmentation generated from the “static scene baseline”. The last column shows the improved segmentation after learning continued with new images collected
during the manipulation of the object.
The framework has grown over recent time and has been used in a variety of experiments mainly with
501
the iCub humanoid robot. It has since then been ported in parts to work with ROS with the aim of
502
running pick and place experiments on Baxter; the code will be made available on the authors webpage
503
at:
http://juxi.net/projects/VisionAndActions/
. The overarching goal was to enable a
504
way of controlling a complex humanoid robot, which combines motion planning with low-level reflexes
505
from visual feedback. icVision provides the detection and localisation of objects in the visual stream. For
506
example, it will provide the location of a chess board on a table in front of the robot. It can also provide the
507
position of chess pieces to the world model. Based on this an agent can plan a motion to pick up a specific
508
piece. During the execution of that motion MoBeE calculates forces for each chess piece, attracting for the
509
target piece, repelling forces for all the other pieces. These forces are updated whenever a new object (or
510
object location) is perceived, yielding a more robust execution of the motion due to a better coordination
511
between vision and action.512
The current system consists of a mix of pre-defined and learned parts, in the future we plan to integrate
513
further machine learning techniques to improve the object manipulation skills of robotic systems. For
514
example, learning to plan around obstacles including improved prediction and selection of actions. This
515
will lead to a more adaptive, versatile robot, being able to work in unstructured, cluttered environments.
516
Furthermore, it might be of interest to investigate an even tighter sensorimotor coupling, e.g. by working
517
directly in the image space – similar to image-based visual servoing approaches (Chaumette and Hutchinson,
518
2006) – this way avoiding to translate 2D image features into operational space locations.519
In the future, we are aiming to extend the capabilities to allow for the quick end-to-end training of
520
reaching (Zhang et al., 2015) and manipulation tasks (Levine et al., 2015), as well as, easy transition
521
from simulation to real world experiments. We are also looking at developing agents that interface this
522
Frontiers 19
Leitner et al. Framework for Eye-Hand Coordination
framework to learn the robot’s kinematics and adapt to changes occurring due to malfunction or wear,
523
leading to self calibration of a robot’s eye-hand coordination.524
ACKNOWLEDGMENTS
The authors would like to thank Mikhail Frank, Marijn Stollenga, Leo Pape, Adam Tow and William
525
Chamberlain for their discussions and valued inputs to this paper and the underlying software frameworks
526
presented herein.527
Funding
:
various European research projects (IM-CLeVeR #FP7-IST-IP-231722, STIFF #FP7-IST-IP-
528
231576) and the Australian Research Council Centre of Excellence for Robotic Vision (#CE140100016).529
REFERENCES
Ambrose, R., Wilcox, B., Reed, B., Matthies, L., Lavery, D., and Korsmeyer, D. (2012). NASA’s Space
530
Technology Roadmaps (STRs): Robotics, Tele-Robotics, and Autonomous Systems Roadmap. Tech. rep.,
531
National Aeronautics and Space Administration (NASA)532
Ando, N., Suehiro, T., and Kotoku, T. (2008). A software platform for component based rt-system
533
development: Openrtm-aist. In Simulation, Modeling, and Programming for Autonomous Robots
534
(Springer). 87–98535
Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF: Speeded up robust features. In Computer Vision
536
ECCV 2006, eds. A. Leonardis, H. Bischof, and A. Pinz (Springer), vol. 3951 of Lecture Notes in
537
Computer Science. 404–417538
Berthier, N., Clifton, R., Gullapalli, V., McCall, D., and Robin, D. (1996). Visual information and object
539
size in the control of reaching. Journal of Motor Behavior 28, 187–197540
Brooks, R. (1991). Intelligence without representation. Artificial intelligence 47, 139–159541
Brown, E., Rodenberg, N., Amend, J., Mozeika, A., Steltz, E., Zakin, M., et al. (2010). Universal robotic
542
gripper based on the jamming of granular material. Proceedings of the National Academy of Sciences
543
(PNAS) 107, 18809–18814544
Brugali, D. (2007). Software engineering for experimental robotics, vol. 30 (Springer)545
Carbone, G. (2013). Grasping in robotics, vol. 10 of Mechanisms and Machine Science (Springer)546
Chamberlain, W., Leitner, J., and Corke, P. (2016). A Distributed Robotic Vision Service. In Proceedings
547
of the International Conference on Robotics and Automation (ICRA)548
Chaumette, F. and Hutchinson, S. (2006). Visual servo control, Part I: Basic approaches. IEEE Robotics &
549
Automation Magazine 13, 82–90550
Ciliberto, C., Smeraldi, F., Natale, L., and Metta, G. (2011). Online multiple instance learning applied
551
to hand detection in a humanoid robot. In Proceedings of the International Conference on Intelligent
552
Robots and Systems (IROS)553
Cipolla, R., Battiato, S., and Farinella, G. M. (2010). Computer Vision: Detection, Recognition and
554
Reconstruction, vol. 285 (Springer)555
Corke, P. (2011). Robotics, Vision and Control, vol. 73 of Springer Tracts in Advanced Robotics (Springer)
556
Dansereau, D. G., Singh, S. P. N., and Leitner, J. (2016). In IEEE International Conference on Robotics
557
and Automation (ICRA) (IEEE)558
Davison, A. J. and Murray, D. W. (2002). Simultaneous localization and map-building using active vision.
559
IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 865–880560
This is a provisional file, not the final typeset article 20
Leitner et al. Framework for Eye-Hand Coordination
De Santis, A., Albu-Schaffer, A., Ott, C., Siciliano, B., and Hirzinger, G. (2007a). The skeleton algorithm
561
for self-collision avoidance of a humanoid manipulator. In Proceedings of the IEEE/ASME International
562
Conferenced on Advanced Intelligent Mechatronics563
De Santis, A., Albu-Sch
¨
affer, A., Ott, C., Siciliano, B., and Hirzinger, G. (2007b). The skeleton algorithm
564
for self-collision avoidance of a humanoid manipulator. In Advanced intelligent mechatronics, 2007
565
IEEE/ASME international conference on (IEEE), 1–6566
Diankov, R. and Kuffner, J. (2008). Openrave: A planning architecture for autonomous robotics. Robotics
567
Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-08-34 79568
Dietrich, A., Wimbock, T., Taubig, H., Albu-Schaffer, A., and Hirzinger, G. (2011). Extensions to
569
reactive self-collision avoidance for torque and position controlled humanoids. In Proceedings of the
570
International Conference on Robotics and Automation (ICRA). 3455–3462571
Elkady, A. and Sobh, T. (2012). Robotics middleware: A comprehensive literature survey and attribute-
572
based bibliography. Journal of Robotics 2012573
Fanello, S. R., Ciliberto, C., Natale, L., and Metta, G. (2013). Weakly supervised strategies for natural
574
object recognition in robotics. In Proceedings of the International. Conference on Robotics and
575
Automation (ICRA)576
Fitzpatrick, P., Metta, G., and Natale, L. (2008). Towards long-lived robot genes. Robotics and Autonomous
577
systems 56, 29–45578
Forssberg, H., Eliasson, A., Kinoshita, H., Johansson, R., and Westling, G. (1991). Development of human
579
precision grip i: basic coordination of force. Experimental Brain Research 85, 451–457580
Frank, M. (2014). Learning To Reach and Reaching To Learn: A Unified Approach to Path Planning
581
and Reactive Control through Reinforcement Learning. Ph.D. thesis, Universit
´
a della Svizzera Italiana,
582
Lugano583
Frank, M., Leitner, J., Stollenga, M., F
¨
orster, A., and Schmidhuber, J. (2014). Curiosity driven
584
reinforcement learning for motion planning on humanoids. Frontiers in Neurorobotics 7. doi:10.
585
3389/fnbot.2013.00025586
Gerkey, B., Vaughan, R. T., and Howard, A. (2003). The player/stage project: Tools for multi-robot and
587
distributed sensor systems. In Proceedings of the 11th international conference on advanced robotics.
588
vol. 1, 317–323589
Gori, I., Fanello, S., Odone, F., and Metta, G. (2013). A compositional approach for 3d arm-hand action
590
recognition. In Proceedings of the IEEE Workshop on Robot Vision (WoRV)591
Gupta, K. (1986). Kinematic analysis of manipulators using the zero reference position description. The
592
International Journal of Robotics Research 5, 5593
Harding, S., Leitner, J., and Schmidhuber, J. (2013). Cartesian genetic programming for image processing.
594
In Genetic Programming Theory and Practice X, eds. R. Riolo, E. Vladislavleva, M. D. Ritchie,
595
and J. H. Moore (Ann Arbor: Springer New York), Genetic and Evolutionary Computation. 31–44.
596
doi:10.1007/978-1-4614-6846-2 3597
Hart, S., Ou, S., Sweeney, J., and Grupen, R. (2006). A framework for learning declarative structure. In
598
Proceedings of the RSS Workshop: Manipulation in Human Environments599
Hartley, R. and Zisserman, A. (2000). Multiple view geometry in computer vision (Cambridge University
600
Press), 2nd edn.601
Hutchinson, S., Hager, G. D., and Corke, P. I. (1996). A tutorial on visual servo control. IEEE Transactions
602
on Robotics and Automation 12, 651–670603
Jackson, J. (2007). Microsoft robotics studio: A technical introduction. Robotics & Automation Magazine,
604
IEEE 14, 82–87605
Frontiers 21
Leitner et al. Framework for Eye-Hand Coordination
Jeannerod, M. (1997). The cognitive neuroscience of action. (Blackwell Publishing)606
Johnson, M. H. and Munakata, Y. (2005). Processes of change in brain and cognitive development. Trends
607
in cognitive sciences 9, 152–158608
Karlsson, N., Di Bernardo, E., Ostrowski, J., Goncalves, L., Pirjanian, P., and Munich, M. (2005).
609
The vSLAM Algorithm for Robust Localization and Mapping. In Proceedings of the International
610
Conference on Robotics and Automation (ICRA)611
Kavraki, L. E., ˇ
Svestka, P., Latombe, J.-C., and Overmars, M. H. (1996). Probabilistic roadmaps for path612
planning in high-dimensional configuration spaces. Robotics and Automation, IEEE Transactions on 12,
613
566–580614
Kemp, C., Edsinger, A., and Torres-Jara, E. (2007). Challenges for robot manipulation in human
615
environments [grand challenges of robotics]. IEEE Robotics & Automation Magazine 14, 20–29616
Khatib, O. (1986). Real-time obstacle avoidance for manipulators and mobile robots. The International
617
Journal of Robotics Research 5, 90618
Kragic, D. and Vincze, M. (2009). Vision for robotics. Foundations and Trends in Robotics 1, 1–78619
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, 436–444620
Leitner, J. (2014). Towards adaptive and autonomous humanoid robots: From vision to actions. Ph.D.
621
thesis, Universit’a della Svizzera italiana, Lugano, Switzerland. Doctoral thesis622
Leitner, J. (2015). A bottom-up integration of vision and actions to create cognitive humanoids. In
623
Cognitive Robotics (CRC Press), chap. 10. 191–214624
Leitner, J., Chandrashekhariah, P., Harding, S., Frank, M., Spina, G., F
¨
orster, A., et al. (2012a).
625
Autonomous learning of robust visual object detection and identification on a humanoid. In Proceedings
626
of the International Conference on Development and Learning and Epigenetic Robotics (ICDL)627
Leitner, J., F
¨
orster, A., and Schmidhuber, J. (2014a). Improving robot vision models for object detection
628
through interaction. In International Joint Conference on Neural Networks (IJCNN)629
Leitner, J., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2014b). Reactive reaching and grasping on a
630
humanoid: Towards closing the action-perception loop on the icub. In Proceedings of the International
631
Conference on Informatics in Control, Automation and Robotics (ICINCO). 102–109632
Leitner, J., Harding, S., Chandrashekhariah, P., Frank, M., A. F
¨
orster, A., Triesch, J., et al. (2013a).
633
Learning visual object detection and localization using icVision. Biologically Inspired Cognitive
634
Architectures 5, 29 – 41635
Leitner, J., Harding, S., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2012b). Learning spatial object
636
localization from vision on a humanoid robot. International Journal of Advanced Robotic Systems 9,
637
ISBN: 1729–8806. doi:10.5772/54657638
Leitner, J., Harding, S., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2012c). Transferring spatial perception
639
between robots operating in a shared workspace. In Proceedings of the International Conference on
640
Intelligent Robots and Systems (IROS)641
Leitner, J., Harding, S., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2013b). ALife in humanoids:
642
Developing a framework to employ artificial life techniques for high-level perception and cognition
643
tasks on humanoid robots. In Workshop on ’Artificial Life Based Models of Higher Cognition’ at the
644
European Conference on Artificial Life (ECAL)645
Leitner, J., Harding, S., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2013c). An Integrated, Modular
646
Framework for Computer Vision and Cognitive Robotics Research (icVision). In Biologically Inspired647
Cognitive Architectures 2012, eds. A. Chella, R. Pirrone, R. Sorbello, and K. J
´
ohannsd
´
ottir (Springer
648
Berlin Heidelberg), vol. 196 of Advances in Intelligent Systems and Computing. 205–210649
This is a provisional file, not the final typeset article 22
Leitner et al. Framework for Eye-Hand Coordination
Leitner, J., Harding, S., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2013d). Artificial neural networks
650
for spatial perception: Towards visual object localisation in humanoid robots. In Proceedings of the
651
International Joint Conference on Neural Networks (IJCNN) (IEEE), 1–7652
Leitner, J., Harding, S., Frank, M., F
¨
orster, A., and Schmidhuber, J. (2013e). Humanoid learns to detect its
653
own hands. In Proceedings of the IEEE Conference on Evolutionary Computation (CEC). 1411–1418654
Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2015). End-to-end training of deep visuomotor policies.
655
arXiv preprint arXiv:1504.00702656
Lowe, D. (1999). Object Recognition from Local Scale-Invariant Features. In Proceedings of the
657
International Conference on Computer Vision (ICCV)658
Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., and Abbeel, P. (2010). Cloth grasp point detection
659
based on multiple-view geometric cues with application to robotic towel folding. In Proceedings of the
660
International Conference on Robotics and Automation (ICRA). 2308–2315661
Marchand, E., Spindler, F., and Chaumette, F. (2005). Visp for visual servoing: a generic software platform
662
with a wide class of robot control skills. IEEE Robotics & Automation Magazine 12, 40–52663
McCarty, M., Clifton, R., Ashmead, D., Lee, P., and Goubet, N. (2001). How infants use vision for grasping
664
objects. Child development 72, 973–987665
Meltzoff, A. (1988). Infant imitation after a 1-week delay: Long-term memory for novel acts and multiple
666
stimuli. Developmental Psychology 24, 470667
Metta, G., Fitzpatrick, P., and Natale, L. (2006). YARP: Yet Another Robot Platform. International Journal
668
of Advanced Robotics Systems, Special Issue on Software Development and Integration in Robotics 3669
Miller, J. (1999). An empirical study of the efficiency of learning boolean functions using a cartesian genetic
670
programming approach. In Proceedings of the Genetic and Evolutionary Computation Conference
671
(GECCO). 1135–1142672
Miller, J. F. (ed.) (2011). Cartesian Genetic Programming. Natural Computing Series (Springer).
673
doi:doi:10.1007/978-3-642-17310-3674
Oztop, E., Bradley, N., and Arbib, M. (2004). Infant grasp learning: a computational model. Experimental
675
Brain Research 158, 480–503676
Pathak, S., Pulina, L., Metta, G., and Tacchella, A. (2013). Ensuring safety of policies learned by
677
reinforcement: Reaching objects in the presence of obstacles with the icub. In Proceedings of the
678
International Conference on Intelligent Robots and Systems (IROS)679
Pattacini, U. (2011). Modular Cartesian Controllers for Humanoid Robots: Design and Implementation on
680
the iCub. Ph.D. thesis, Italian Institute of Technology, Genova681
Plumert, J. and Spencer, J. (2007). The emerging spatial mind (Oxford University Press)682
Posner, M. (1989). Foundations of cognitive science (The MIT Press)683
Quigley, M., Gerkey, B., Conley, K., Faust, J., Foote, T., Leibs, J., et al. (2009). ROS: an open-source
684
Robot Operating System. In ICRA Workshop on Open Source Software685
Sadikov, A., Mo
ˇ
zina, M., Guid, M., Krivec, J., and Bratko, I. (2007). Automated chess tutor. In Computers
686
and Games, eds. H. Herik, P. Ciancarini, and H. Donkers (Springer Berlin Heidelberg), vol. 4630 of
687
Lecture Notes in Computer Science. 13–25688
Saxena, A., Driemeyer, J., and Ng, A. (2008). Robotic grasping of novel objects using vision. The
689
International Journal of Robotics Research 27, 157690
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks 61, 85–117691
Schoner, G. and Dose, M. (1992). A dynamical systems approach to task-level system integration used to
692
plan and control autonomous vehicle motion. Robotics and Autonomous Systems 10, 253–267693
Frontiers 23
Leitner et al. Framework for Eye-Hand Coordination
Soetens, P. (2006). A Software Framework for Real-Time and Distributed Robot and Machine Control.
694
Ph.D. thesis, Department of Mechanical Engineering, Katholieke Universiteit Leuven, Belgium.
http:695
//www.mech.kuleuven.be/dept/resources/docs/soetens.pdf696
Stollenga, M., Pape, L., Frank, M., Leitner, J., F
˜
orster, A., and Schmidhuber, J. (2013). Task-relevant
697
roadmaps: A framework for humanoid motion planning. In Proceedings of the International Conference
698
on Intelligent Robots and Systems (IROS)699
St
¨
uckler, J., Badami, I., Droeschel, D., Gr
¨
ave, K., Holz, D., McElhone, M., et al. (2013). Nimbro@home:
700
Winning team of the robocup@home competition 2012. In Robot Soccer World Cup XVI (Springer).
701
94–105702
Vahrenkamp, N., W
¨
achter, M., Kr
¨
ohnert, M., Welke, K., and Asfour, T. (2015). The robot software
703
framework armarx. it-Information Technology 57, 99–111704
Vahrenkamp, N., Wieland, S., Azad, P., Gonzalez, D., Asfour, T., and Dillmann, R. (2008). Visual servoing
705
for humanoid grasping and manipulation tasks. In Proceedings of the International Conference on
706
Humanoid Robots. 406–412707
van den Bergen, G. (2004). Collision detection in interactive 3d environments.708
Verschae, R. and Ruiz-del Solar, J. (2015). Object detection: Current and future directions. Frontiers in
709
Robotics and AI 2. doi:10.3389/frobt.2015.00029710
Welke, K., Issac, J., Schiebener, D., Asfour, T., and Dillmann, R. (2010). Autonomous acquisition of visual
711
multi-view object representations for object recognition on a humanoid robot. In IEEE International
712
Conference on Robotics and Automation (ICRA) (IEEE), 2012–2019713
Zhang, F., Leitner, J., Milford, M., Upcroft, B., and Corke, P. (2015). Towards vision-based deep
714
reinforcement learning for robotic motion control. arXiv preprint arXiv:1511.03791715
FIGURES
This is a provisional file, not the final typeset article 24
... Therefore, the main task of this step is identifying and location the human face in the given picture. outdoors, and so on), in addition to other contextual information [45]. ...
... In contrast, the 3 rd issue is object identification that includes the determination if a particular item instance was exist in the image. The 4 th related issue is estimating view and pose that includes the determination of the object's view and its pose [45,46,47]. Object detection approaches might be classified into 5 groups, everyone with merits and demerits. ...
... Some are more robust, whereas, others might be deployed in real-time systems, and others might be handling more classes. Table 2-2 illustrates a qualitative comparison of object detection approaches [45,48]. ...
Thesis
Full-text available
The development of a robust and integrated multi-camera surveillance system is an important requirement to ensure public safety and security. Being able to re-identify and track one or more targets in different scenes with surveillance cameras remains an important and difficult problem due to clogging, significant change of views, and lighting across cameras. In a traditional surveillance system, human operators monitor objects based on visual information through surveillance cameras, and can simply record video frames in storage systems but retrieving important information about a target requires considerable time and effort In this thesis, traditional surveillance systems have been developed and supported by intelligent techniques that have ability to performance the parallel processing by employ multi-thread of all cameras to track people in the different scenes (places) in which the cameras were placed, in addition to retrieving important information very quickly. The system consists of the following stages: 1. Offline Phase: Create the database and then train the system. 2. Online Phase: recognizing the detected face(s) with trained datasets. 3. Tracking Phase: This stage includes. Test the person who has been recognized in the online phase then check if it is within the database or not, and Re-Construction Video to track a stranger. In the Offline Phase, Create a database that includes all the necessary information for authorized persons as well as create the feature vectors of the training image by using the LBPH algorithm, in the Online Phase, by employing (the AdaBoost) machine learning algorithm, the system first performs face detection and then face recognition to identify particular persons. For the first step, by using a cascade classifier with haar-like features the system selects the faces of people, who have 3 been detected and traced their apparition in all surveillance cameras. A Local Binary Patterns Histogram (LBPH) recognition algorithm is then used to recognize detected faces with a known dataset which was established in the Offline Phase. In the final stage of the system (Tracking Phase), which includes giving an alert in the event of a breach of the building by unauthorized persons and tracking their movement in all surveillance cameras with the deduction of frames that show the stranger to be used later in the reconstruction of video tracking. The experiments have shown that the proposed system has efficient and robust detection, recognition, and tracking of faces in surveillance multi-cameras are represented by 95% of the detection of faces, 89% of recognition, and more than 97% of the tracking of strangers which represent a high proportion compared to similar works with high speed in the retrieval and analysis of video content.
... Em Leitner (2016),é apresentado um sistema de software que permite a integração entre módulos de visão e controle em complexos humanoides com muitos graus de liberdade (DOF). O framework desenvolvido pelo autor citado anteriormente está ilustrado na figura 8: Escrito na linguagem C++, esse framework também realiza uma comunicação com o ROS (Robot Operating System) [Leitner et al. 2016]. ...
... Em Leitner (2016),é apresentado um sistema de software que permite a integração entre módulos de visão e controle em complexos humanoides com muitos graus de liberdade (DOF). O framework desenvolvido pelo autor citado anteriormente está ilustrado na figura 8: Escrito na linguagem C++, esse framework também realiza uma comunicação com o ROS (Robot Operating System) [Leitner et al. 2016]. ROSé um sistema operacional, também chamado de middleware dependendo de comoé usado, voltado para a robótica.É open source e foi desenvolvido pela incubadora Willow Garage em colaboração com a Universidade de Stanford. ...
Thesis
Full-text available
This work describes a language structure in Python for controlling and accessing and servomotors in the robotic platform InMoov-URI. The platform has been printed in a 3D printer and its hardware is controlled by a Raspberry Pi 3 B +. The framework aims to help the development of applications of Artificial Intelligence, Assistive Technologies, and Reduction of Technical Specifications using InMoov-URI. In this way, the developer doesn't have to worry about the electric operations of the platform, only on how to access them and use them in their main work. The step of assembling the platform parts together with the installation of electronic components and servomotors is presented. Then an analysis of the performances related to their energy expenditure is showed. The framework is then described. An explanation of its operation is given together with its organization. Finally, a technology for synchronism is described. It is expected that, from this job, the development of numerous software that will enhance real-life skills and put disciplines learned during the academic time in practice.
... More details are provided in ref. [36]. Besides, there are several platforms for humanoids and other robots in studies reported in ref. [37][38][39]. Details of Pioner3-AT bender robotic platform are provided in ref. [40]. ...
Chapter
Full-text available
The chapter reviews recent developments in cognitive robotics, challenges and opportunities brought by new developments in machine learning (ML) and information communication technology (ICT), with a view to simulating research. To draw insights into the current trends and challenges, a review of algorithms and systems is undertaken. Furthermore, a case study involving human activity recognition, as well as face and emotion recognition, is also presented. Open research questions and future trends are then presented.
... Conventional approaches for visuomotor applications like vision-based grasping are usually modular. Each module is independent and expert-designed, e.g., [24]. Developmental robotics takes a different approach, relying on emerging abilities and learning of complex skills through interaction with Left: Extended End-to-end model for the primary task of grasping a selected object and the two auxiliary tasks of object classification and object localization. ...
Article
Full-text available
We present a follow-up study on our unified visuomotor neural model for the robotic tasks of identifying, localizing, and grasping a target object in a scene with multiple objects. Our Retinanet-based model enables end-to-end training of visuomotor abilities in a biologically inspired developmental approach. In our initial implementation, a neural model was able to grasp selected objects from a planar surface. We embodied the model on the NICO humanoid robot. In this follow-up study, we expand the task and the model to reaching for objects in a three-dimensional space with a novel dataset based on augmented reality and a simulation environment. We evaluate the influence of training with auxiliary tasks, i.e., if learning of the primary visuomotor task is supported by learning to classify and locate different objects. We show that the proposed visuomotor model can learn to reach for objects in a three-dimensional space. We analyze the results for biologically-plausible biases based on object locations or properties. We show that the primary visuomotor task can be successfully trained simultaneously with one of the two auxiliary tasks. This is enabled by a complex neurocognitive model with shared and task-specific components, similar to models found in biological systems.
... Conventional approaches for visuomotor applications like vision-based grasping are usually modular. Each module is independent and expert-designed, e.g., [24]. Developmental robotics takes a different approach, relying on emerging abilities and learning of complex skills through interaction with Left: Extended End-to-end model for the primary task of grasping a selected object and the two auxiliary tasks of object classification and object localization. ...
Preprint
Full-text available
We present a follow-up study on our unified visuomotor neural model for the robotic tasks of identifying, localizing, and grasping a target object in a scene with multiple objects. Our Retinanet-based model enables end-to-end training of visuomotor abilities in a biologically inspired developmental approach. In our initial implementation, a neural model was able to grasp selected objects from a planar surface. We embodied the model on the NICO humanoid robot. In this follow-up study, we expand the task and the model to reaching for objects in a three-dimensional space with a novel dataset based on augmented reality and a simulation environment. We evaluate the influence of training with auxiliary tasks, i.e., if learning of the primary visuomotor task is supported by learning to classify and locate different objects. We show that the proposed visuomotor model can learn to reach for objects in a three-dimensional space. We analyze the results for biologically-plausible biases based on object locations or properties. We show that the primary visuomotor task can be successfully trained simultaneously with one of the two auxiliary tasks. This is enabled by a complex neurocognitive model with shared and task-specific components, similar to models found in biological systems.
... Grasping is an important visuomotor ability that enables object manipulation, haptic perception and general interaction with the environment. Conventional approaches rely on accurate robot and environment models that realize different subtasks like object localization, grasp planning, and inverse kinematics, e.g., [1]. These models, however, are not always available. ...
Conference Paper
Full-text available
Robotic visuomotor abilities, like grasping, can either be realized through conventional means of independent modules for subtasks like object localization, grasp planning, and inverse kinematics. These modules, however, rely on the availability of accurate robot and environment models. An alternative is to acquire visuomotor abilities through end-to-end machine learning. While deep neural networks have proved successful in many areas, they depend on large amounts of annotated training data or long periods of trial-and-error learning. To overcome this issue, developmental robotics leverages principles of incremental learning in biological agents. Increasingly complex visuomotor abilities are learned through mostly autonomous interaction with the environment. Following this paradigm, we present current research on acquiring visuomotor skills with a humanoid robot through self-learning and minimal human assistance. The robot engages in a learning cycle where it repeatedly manipulates an object to gather training samples that link its actions (joint configurations) to states of the environment (images from the robot's perspective). Human assistance is only requested if errors occur during this phase, e.g., the training object is accidentally dropped out of reach. Based on these training samples, supervised end-to-end learning of visuomotor skills is realized with a deep convolutional neural architecture. The results show that the approach generalizes well to novel objects that were not included in learning. To enable this research, we developed NICO, the Neuro Inspired COmpanion, a humanoid research platform for embodied neurobotic models and human-robot interaction.
... traditional approaches for grasping use separate models for vision, grasp planning and inverse kinematics [5], neural learning paradigms offer the possibility to learn all of these subtasks in an end-to-end approach without even knowing the robot model [7]. ...
Conference Paper
Full-text available
Robotic motor policies can, in theory, be learned via deep continuous reinforcement learning. In practice, however, collecting the enormous amount of required training samples in realistic time, surpasses the possibilities of many robotic platforms. To address this problem, we propose a novel method for accelerating the learning process by task simplification inspired by the Goldilocks effect known from developmental psychology. We present results on a reachfor-grasp task that is learned with the Deep Deterministic Policy Gradients (DDPG) algorithm. Task simplification is realized by initially training the system with “larger-thanlife” training objects that adapt their reachability dynamically during training. We achieve a significant acceleration compared to the unaltered training setup. We describe modifications to the DDPG algorithm with regard to the replay buffer to prevent artifacts during the learning process from the simplified learning instances while maintaining the speed of learning. With this result, we contribute towards the realistic application of deep reinforcement learning on robotic platforms.
... The robotic task of grasping and manipulating objects is usually decomposed: different software modules accomplish subtasks like object localization and computation of inverse kinematics, e.g. [15]. In contrast, recent advances in deep neural architectures allow end-to-end learning setups, where a robot acquires hand-eye coordination through the interaction with its environment. ...
Conference Paper
Full-text available
We present a proof of concept to show how a deep network for end-to-end visuomotor learning to grasp is coupled with an attention focus mechanism for state-of-the-art object detection with convolutional neural networks. The cognitively motivated integration of both methods in a single robotic system allows us to realize a high-level interface to use the visuomotor network in environments with several objects, which otherwise would only be usable in environments with a single object. The resulting system is deployed on a humanoid robot, and we perform several real-world grasping experiments that demonstrate the feasibility of our approach.
... Conventional frameworks solve this problem with modular approaches, which employ computer vision algorithms for determining the 3D-position of an object and inverse kinematics solvers to reach for the object, see [7] for example. These approaches rely on expert human knowledge and accurate data about the kinematic model of the robot. ...
Conference Paper
Full-text available
Deep learning with neural networks is dependent on large amounts of annotated training data. For the development of robotic visuomotor skills in complex environments, generating suitable training data is time-consuming and depends on the availability of accurate robot models. Deep reinforcement learning alleviates this challenge by letting robots learn in an unsupervised manner through trial and error at the cost of long training times. In contrast, we present an approach for acquiring visuomotor skills for grasping through fast self-learning: The robot generates suitable training data through interaction with the environment based on initial motor abilities. Supervised end-to-end learning of visuomotor skills is realized with a deep convolutional neural architecture that combines two important subtasks of grasping: object localization and inverse kinematics.
Chapter
We present a novel, hybrid neuro-genetic visuomotor architecture for object grasping on a humanoid robot. The approach combines the state-of-the-art object detector RetinaNet, a neural network-based coordinate transformation and a genetic-algorithm-based inverse kinematics solver. We claim that a hybrid neural architecture can utilise the advantages of neural and genetic approaches: while the neural components accurately locate objects in the robot’s three-dimensional reference frame, the genetic algorithm allows reliable motor control for the humanoid, despite its complex kinematics. The modular design enables independent training and evaluation of the components. We show that the additive error of the coordinate transformation and inverse kinematics solver is appropriate for a robotic grasping task. We additionally contribute a novel spatial-oversampling approach for training the neural coordinate transformation that overcomes the known issue of neural networks with extrapolation beyond training data and the extension of the genetic inverse kinematics solver with numerical fine-tuning. The grasping approach was realised and evaluated on the humanoid robot platform NICO in a simulation environment.
Thesis
Full-text available
Although robotics research has seen advances over the last decades robots are still not in widespread use outside industrial applications. Yet a range of proposed scenarios have robots working together, helping and coexisting with humans in daily life. In all these a clear need to deal with a more unstructured, changing environment arises. I herein present a system that aims to overcome the limitations of highly complex robotic systems, in terms of autonomy and adaptation. The main focus of research is to investigate the use of visual feedback for improving reaching and grasping capabilities of complex robots. To facilitate this a combined integration of computer vision and machine learning techniques is employed. From a robot vision point of view the combination of domain knowledge from both imaging processing and machine learning techniques, can expand the capabilities of robots. I present a novel framework called Cartesian Genetic Programming for Image Processing (CGP-IP). CGP-IP can be trained to detect objects in the incoming camera streams and successfully demonstrated on many different problem domains. The approach requires only a few training images (it was tested with 5 to 10 images per experiment) is fast, scalable and robust yet requires very small training sets. Additionally, it can generate human readable programs that can be further customized and tuned. While CGP-IP is a supervised-learning technique, I show an integration on the iCub, that allows for the autonomous learning of object detection and identification. Finally this dissertation includes two proof-of-concepts that integrate the motion and action sides. First, reactive reaching and grasping is shown. It allows the robot to avoid obstacles detected in the visual stream, while reaching for the intended target object. Furthermore the integration enables us to use the robot in non-static environments, i.e. the reaching is adapted on-the- fly from the visual feedback received, e.g. when an obstacle is moved into the trajectory. The second integration highlights the capabilities of these frameworks, by improving the visual detection by performing object manipulation actions.
Article
Full-text available
Object detection is a key ability required by most computer and robot vision systems. The latest research on this area has been making great progress in many directions. In the current manuscript, we give an overview of past research on object detection, outline the current main research directions, and discuss open problems and possible future directions.
Article
Full-text available
This paper introduces a machine learning based system for controlling a robotic manipulator with visual perception only. The capability to autonomously learn robot controllers solely from raw-pixel images and without any prior knowledge of configuration is shown for the first time. We build upon the success of recent deep reinforcement learning and develop a system for learning target reaching with a three-joint robot manipulator using external visual observation. A Deep Q Network (DQN) was demonstrated to perform target reaching after training in simulation. Transferring the network to real hardware and real observation in a naive approach failed, but experiments show that the network works when replacing camera images with synthetic images.
Article
Full-text available
With ArmarX we introduce a robot programming environment that has been developed in order to ease the realization of higher level capabilities needed by complex robotic systems such as humanoid robots. ArmarX is built upon the idea that consistent disclosure of the system state strongly facilitates the development process of distributed robot applications. We show the applicability of ArmarX by introducing a robot architecture for a humanoid system and discuss essential aspects based on an exemplary pick and place task. With several tools that are provided by the ArmarX framework, such as graphical user interfaces (GUI) or statechart editors, the programmer is enabled to efficiently build and inspect component based robotics software systems.
Book
"Software Engineering for Experimental Robotics" collects contributions that describe the state of the art in software development for the Robotics domain. It reports on innovative ideas that are progressively introduced in the software development process, in order to promote the reuse of robotic software artifacts: domain engineering, components, frameworks and architectural styles. It illustrates the results of the most successful and well-known research projects which aim to develop reusable robotic software systems. Most of the chapters report on concepts and ideas discussed at the well attended ICRA2005 Workshop on "Principles and Practice of Software Development in Robotics", Barcelona, Spain, April 18 2005. The authors are recognised as leading scholars internationally, and the result is an effective blend of fundamental and innovative results on research and development in software for robotic systems, where one common factor is the integration of reusable building blocks. Besides the advancement in the field, most contributions survey the state-of-the-art, report a number of practical applications to real systems, and discuss possible future developments.
Book
Human activity and thought is embedded within and richly structured by space. The spatial mind has detailed knowledge of the world that surrounds it?it remembers where objects are, what they are, and how they are arranged relative to one another. It can navigate through spaces to locate and retrieve objects, or it can direct the actions of others through language. It can use maps to find out the way from one city to the next, or it can navigate using a virtual map to locate a missing computer file. But where do these abilities come from? What is the developmental origin of the spatial mind? This book examines how the spatial mind emerges from its humble origins in infancy to its mature, flexible, and skilled adult form. Each chapter presents research and theory that asks the following questions: what changes in spatial cognition occur over development? And how do these changes come about? The book provides conceptual as well as formal theoretical accounts of developmental processes at multiple levels of analysis (e.g. genes, neurons, behaviors, social interactions), providing an overview of general mechanisms of cognitive change. In addition, commentators place these advances in the understanding of spatial cognitive development within the field of spatial cognition more generally. This book sheds light on how the experiences of thinking about and interacting in space through time foster and shape the emerging spatial mind. © 2007 Jodie M. Plumert and John P. Spencer. All rights reserved.