Online Learning of Visuo-Motor Coordination
in a Humanoid Robot.
A Biologically Inspired Model.
Guido Schillaci, Verena V. Hafner
Cognitive Robotics Group
Humboldt-Universität zu Berlin, Germany
Cognitive Robotics Group
Universidad Autónoma del Estado de Morelos, Mexico
Abstract—Coordinating vision with movements of the body is a
fundamental prerequisite for the development of complex motor
and cognitive skills. Visuo-motor coordination seems to rely on
processes that map spatial vision onto patterns of muscular
In this paper, we investigate the formation and the coupling of
sensory maps in the humanoid robot Aldebaran Nao. We propose
a biologically inspired model for coding internal representations
of sensorimotor experience that can be fed with data coming
from different motor and sensory modalities, such as visual,
auditory and tactile. The model is inspired by the self-organising
properties of areas in the human brain, whose topologies are
structured by the information produced through the interaction
of the individual with the external world. In particular, Dynamic
Self-Organising Maps (DSOMs) proposed by Rougier et al. 
have been adopted together with a Hebbian paradigm for on-
line and continuous learning on both static and dynamic data
Results show how the humanoid robot improves the quality
of its visuo-motor coordination over time, starting from an
initial conﬁguration where no knowledge about how to visually
follow its arm movements is present. Moreover, plasticity of
the proposed model is tested. At a certain point during the
developmental timeline, a damage in the system is simulated by
adding a perturbation to the motor command used for training
the model. Consequently, the performance of the visuo-motor
coordination is affected by an initial degradation, followed by
a new improvement as the proposed model adapts to the new
I. INTRO DUC TIO N
Coordinating vision with movements of the body is a fun-
damental prerequisite for the development of complex motor
and cognitive skills. In early developmental stages, infants
progressively bootstrap their attention capabilities towards a
growing number of salient events in their environment, such
as moving objects, their own body, external objects and
other individuals . Developmental studies showed an early
coupling between visual and motor systems in infants 
and suggested a correlation between hand-eye coordination,
learning capabilities and social skills .
Control of movement is a capability that has been observed
to be acquired through exploration behaviours already during
prenatal stages . Zoia and collegues  showed that there
is no evidence of coordinated kinematic patterns in hand-
to-mouth and hand-to-eye movements in foetuses up to the
gestational age of 18 weeks. However, around the 22nd week
of gestation, foetuses perform movements that show kinematic
patterns with acceleration and deceleration phases apparently
planned according to the size and to the delicacy of the target
(facial parts, such as mouth or eyes) .
Work related to the development of visuo-motor coordina-
tion can be found also in the developmental robotics literature.
Metta  implemented an adaptive control system inspired
by biological development of visuo-motor coordination for
the acquisition of orienting and reaching behaviours on a
humanoid robot. Following a developmental paradigm, the
system starts with moving the eyes only. At this point, control
is a mixture of random and goal-directed movements. The
development proceeds with the acquisition of closed loop
gains, reﬂex-like modules controlling the arm sub-system,
acquisition of an eye-head coordination and of a head-arm
Saegusa et al.  studied self-body perception in a hu-
manoid robot based on the coherence of visual and pro-
prioceptive sensory feedback. A robot was programmed to
generate random arm movements and to store image cues
in a visuomotor base together with joint angles information.
Correlations between visual and physical movements were
used to predict the location of the robot’s body in the visual
input, and to recognise it.
In recent publications  , we showed how a humanoid
robot acquires hand-eye coordination and reaching skills by
exploring its movement capabilities through body babbling
and by using a biologically inspired model consisting of Self-
Organising Maps (SOMs ). Such a behaviour led to the
development of pointing gestures. The model architecture is
inspired by the Epigenetic Robotics Architecture , where
a structured association of multiple SOMs has been adopted
for mapping different sensorimotor modalities in a humanoid
robot. We also showed how a robot can deal with tool-
use when equipped with self-exploration behaviours and with
the capability to execute internal simulations of sensorimotor
Fig. 1. A screenshot of the robot Aldebaran Nao babbling its left arm
in the simulated environment Cyberbotics Webots. The bottom left window
shows the visual input grabbed from the bottom camera of the robot. A
ﬁducial marker (ARToolkit, www.hitl.washington.edu/artoolkit) has been used
for tagging the hand of the robot. The experiment is run in real time.
cycles  .
Visuo-motor coordination seems to rely on processes that
map spatial vision onto patterns of muscular contraction.
Such a mapping would be acquired over time through the
physical interaction of the infant with its surrounding, with
a gradual formation of internal representations already during
the early stages of development. Rochat  demonstrated that
infants start to show, around the age of 3 months, systematic
visual and proprioceptive self-exploration. Rochat and Morgan
suggested that infants, by the age of 12 months, already
possess a sense of a calibrated intermodal (that is, occurring
from multiple sensory modalities) space of their body, or a
body schema, that is a perceptually organised entity which
they can monitor and control .
Body schemas are thought to rely on mappings between
different motor and sensor modalities. Evidences in the neuro-
science suggest the existence of topographic maps in the brain,
which can be seen as projections of sensory receptors or of
effector systems into structured areas of the brain. These maps
self-organise throughout the brain development in a way that
adjacent regions process spatially close sensory parts of the
body. Studies show the existence of such maps in the visual,
auditory, olfactory and somatosensory systems, as well as in
parts of the motor brain areas .
In this paper, we investigate the formation and the coupling
of sensory and motor maps in the humanoid robot Aldebaran
Nao, inspired by the formation of body schemas in humans.
We propose a biologically inspired model for coding internal
representations of sensorimotor experience that can be fed with
data coming from different motor and sensory modalities, such
as visual, auditory and tactile. The model is inspired by the
self-organising properties of areas in the human brain, whose
topology is structured by the sensory information produced by
the interaction of the individual with the external world.
Already in 1990, Martinetz et al.  proposed an ex-
tension of Kohonen’s self-organizing mapping for learning
visuo-motor coordination in a simulated robot arm with ﬁxed
cameras. The authors used a network with three-dimensional
topology matched to the work space of the robot arm. The
system extracted the position of an object to reach from
the visual input and fed the 3D-lattice of nodes with its
coordinates. An output vector representing the arm posture was
associated to each node of the map. A training session has been
run for mapping sequences of input-output relations, to learn
the required transformations for visuo-motor coordination of a
robot arm . However, as Arras and colleagues  pointed
out, the approach proposed by Martinetz and collegues 
was based on a time-dependent learning rate. While the model
worked well for the initial learning, it then kept the learning
rate at a constant level, which was insufﬁcient for allowing
the network to adapt to changes in the robot’s environment.
Thus, Arras et al. extended the algorithm by coupling the
learning rate to the arm positioning error estimated from
the continuous camera feedback, thus allowing for adaptation
to drastic changes in the robot’s work environment .
However, both the approaches addressed learning of visuo-
motor coordination of a robot arm with ﬁxed cameras, using
a model consisting of a three-dimensional map whose nodes
contain both visual input and motor output information 
In this paper, Dynamic Self-Organising Maps (DSOMs)
proposed by Rougier et al.  have been adopted as topology
preserving maps. Similarly to the algorithm presented in ,
DSOMs allow for online and continuous learning on both
static and dynamic data distributions, thus enabling a dynamic
coupling between the environment and the model. In the ex-
periment presented here, we address visuo-motor coordination
in a humanoid robot with moving arm and camera, using
two DSOMs for coding the proprioceptive information coming
from the joint encoders of the arm and of the neck of the robot.
The two DSOMs are associated through Hebbian learning
modulated from the visual input through the interaction of
the robotic agent with its surrounding.
II. DYNAMI C SELF-O RGA NIS IN G MA PS
Classical Self-Organising Map algorithms implement de-
caying adaptation parameters for tracking data distribution.
Thus, self-organisation depends heavily on time-dependent
decreasing learning rate and neighbourhood function. Once
the adaptation strength has decayed, the network is unable to
react to subsequent changes in the signal distribution .
Models such as Growing Neural Gas (GNG) have been
proposed for online and lifelong learning, that can also adapt
to dynamic distributions . GNGs have no parameters that
change over time and they allow for continuous learning,
adding units and connections, until a performance criterion
has been met . Similarly, Evolving Self-Organising Maps
(ESOMs)  implement incremental networks that create
nodes dynamically based on the distance of the winner node
to the input data.
Rougier et al.  proposed the Dynamic Self-Organising
Map (DSOM), a modiﬁed SOM algorithm where the learning
rule and the neighbourhood function do not depend on time.
The authors demonstrated how the model dynamically adapts
to changing environments, or data distributions, as well as
stabilises over stationary distributions. They also reported
DSOM to perform better than classical SOM and Neural Gas
in a simulated scenario .
DSOM is a structured neural map composed of nodes with
ﬁxed positions piin Rqin the lattice, where qis the dimension
of the lattice (in our experiment q= 2). Each node ihas a
weight withat is updated according to the input data pattern
vthrough a learning function and a neighbourhood function.
For each input pattern v, a winner sis determined as the
closest node in the DSOM to vusing an Euclidean distance.
As described by Rougier et al. , all codes wiare thus shifted
towards vaccording to the following rule:
∆wi=ǫkv−wikhη(i, s, v)(v−wi)(1)
where ǫis a constant learning rate and hη(i, s, v)is a neigh-
bourhood function of the form:
hη(i, s, v) = e
where ηis the elasticity or plasticity parameter, piis the
position of the node iin the lattice, psis the position of the
winner node in the lattice. If v=wi, then hη(i, s, v) = 0.
The rationale behind such equations is that if a node is close
enough to the data, there is no need for other nodes to learn
anything, since the winner can represent the data. If there is
no node close enough to the data, any node learns the data
according to its own distance to the data .
However, the DSOM algorithm is not parameter free: the
elasticity parameter modulates the strength of the coupling
between nodes. If elasticity is too high, nodes cannot span
the whole space and the DSOM algorithm does not converge.
If elasticity is too low, coupling between nodes is weak and
may prevent self-organisation to occur . The effect of the
elasticity, as reported by the authors, not only depends on the
size of the network and the size of the support but also on the
initial conditions. As mentioned by Rougier and collegues,
in order to reduce the dependency on the elasticity, the initial
conﬁguration of the network should cover as much as possible
the entire support .
Nonetheless, DSOMs allow for dynamic neighbourhood and
lead to a qualitatively different self-organisation that can be
controlled using the elasticity parameter. DSOMs map the
structure or support of the distribution rather than its density,
as many other Vector Quantisation algorithms do.
III. LEA RN ING VI SUO -MOT OR COO RDI NATI ON
We implemented a biologically inspired model for learning
visuo-motor coordination in the Nao robot. The model consists
of two bi-dimensional DSOMs encoding the arm postures and
the head postures of the robot, respectively. Arm postures
consist of 4-dimensional vectors containing the angle positions
of the following joints of the robot: shoulder pitch, shoulder
roll, elbow yaw, elbow roll. Head postures consist of 2-
dimensional vectors containing the angle positions of the neck
joints of the robot: head yaw, head pitch.
The two DSOMs are associated through Hebbian links.
In particular, each node of the ﬁrst DSOM is connected to
each node of the second DSOM, where the connection is
characterised by a weight. The weight is updated according
to a positive Hebbian rule that simulates synaptic plasticity
of the brain: the connection between a pre-synaptic neuron (a
node in the ﬁrst DSOM) and a post-synaptic neuron (a node
in the second DSOM) increases if the two neurons activate
simultaneously. Thus, the model consists of two DSOMs and
a Hebbian table containing the weights of the links connecting
the two DSOMs. The size of the table is equal to the number
of nodes of the ﬁrst DSOM multiplied by the number of nodes
of the second DSOM.
Learning consists of two parallel processes. The robot
executes random body babbling of its arm, that is, every 1.5
seconds, it executes a motor command of its arm towards a
joints conﬁguration that is sampled from a uniform random
distribution within its arm joints ranges1. The ﬁrst learning
process consists in updating the two DSOMs during the
execution of the random arm movement. The process of
moving the arm is decoupled from the processes of updating
the DSOMs. Instant by instant, with a frequency of 15H z,
the current positions of the joints of the arm are used as input
data vector for the learning rule of the arm DSOM (equations
(1) and (2)). The head DSOM is also updated using equations
1 and 2 and the current angle positions of the neck joints as
input data vector, with the same frequency of 15Hz.
Head movements are also generated every 1.5seconds,
at the same time as the generation of arm movements. In
particular, a motor command is sent to the neck joints as
- search for the winner node of the arm DSOM (the closest
node in the arm DSOM to the input vector represented
by the current arm joints conﬁguration);
- select the winner node in the head DSOM as the one that
has the highest connection weight to the winner node in
the arm DSOM. If there is more than one winner node
(that is, multiple connections with the same weight), then
choose a random one from the group of winners;
- send a motor command to the joints of the neck equal to
the weight of the winning node.
A second learning process based on a Hebbian learn-
ing paradigm is run in parallel to the ﬁrst learning pro-
cess. The Hebbian learning paradigm describes an associative
connection between activities of two connected nodes .
Here, when the end-effector of the robot is visible from
the visual input, the connection between the winner nodes
of the two DSOMs is strengthened. The hand of the robot
has been tagged with a ﬁducial marker and its position in
image coordinates has been estimated using the ARToolkit
1In the experiment presented here, only the four joints of left arm of the
robot are used: shoulder pitch, whose joint angle position can range between
−119.5degrees and 119.5degrees; shoulder roll (range: −18 degrees to 76
degrees); elbow yaw (range: −119.5degrees to 119.5degrees); elbow roll
(range: −88.5degrees to 2degrees).
910 11 12
13 14 15 16
of the Head DSOM
of the Arm DSOM
Fig. 2. Illustration of the proposed model. On the left side, the 2-dimensional lattices of the two DSOMs (arm and head) are shown. The DSOMs can be
also represented in the input space, where nodes are positioned according to their weights (right side). Lines connecting the two DSOMs represent Hebbian
links, with weights w6= 0. Thicker lines correspond to stronger Hebbian links.
(www.hitl.washington.edu/artoolkit). The bottom camera of
the Nao robot has been used for grabbing the visual input.
Thus, visuo-motor coordination can be considered as suc-
cessful if the marker tagging the end-effector of the robot is
visible from the visual input. In this case, the Hebbian learning
process updates the Hebbian table connecting the two DSOMs
as follows. If a marker is visible:
- select the pre-synaptic neuron (winner node) as the clos-
est node iin the arm DSOM to the current arm joint
- select the post-synaptic neuron (winner node) as the
closest node jin the Head DSOM to the current neck
joint conﬁguration y;
- strengthen the connection wij between the pre- and
post-synaptic neurons according to the modiﬁed positive
where Ai(x)is the activation function of the neuron iover
the Euclidean distance between the neural weights and the data
pattern x,λis a scaling factor for slowing down the growth
of the weights (in this experiments it is initialised as equal to
0.01), and fcis a multiplying factor related to the distance
between the perceived position of the hand (marker) in image
coordinates to the center of the image grabbed from the robot
camera (image size: 320 ×240). fcranges from 1 (hand at the
center of the image) to 0 (hand at the corner of the image)
and it is used to make the system choose head positions that
result in the hand being close to the center of the image.
As in Kajic et al. , the activation function of a neuron,
A(d), is computed as:
A(d) = 1
1 + tanh(d)(4)
where dis the Euclidean distance between the position
of the node and the input pattern. All weights between the
two DSOMs are set initially to zero allowing for an activity-
dependent role of structural growth in neural networks .
Figure 2 shows an illustration of the proposed model, which
consists of two DSOMs connected by Hebbian links.
IV. RES ULTS
A preliminary experiment was run on the Cyberbotics
Webots robot simulator, where no noise was modelled in
the joint encoders. Future works will include reproducing the
experiment on a real robot.
As described in the previous section, the arm DSOM and the
head DSOM were trained with data generated through motor
babbling. Each DSOM consisted of 30 ×30 nodes. A weight
vector of four dimensions was associated with each node of
the arm DSOM, representing the positions of the following
joints: shoulder pitch, shoulder roll, elbow yaw and elbow roll.
Similarly, each node of the head DSOM was associated with
a weight vector of two dimensions, representing the following
joint positions: head yaw and head pitch. Weights of the nodes
of both the two DSOMs were randomly initialised within the
ranges of the corresponding joints, to reduce the effect of
elasticity dependency. As pointed out by Rougier et al. ,
the initial conﬁguration of the DSOM network should cover
the entire support as much as possible to reduce elasticity
The experiment was run for almost 3 hours and 20 minutes
(197.58 minutes). It consisted in the robot generating random
arm movements and moving its head accordingly to the cur-
rent visuo-motor coordination skills. Learning was performed
online, in parallel to the execution of the movements. It
consisted in updating the DSOMs based model with training
data represented by the current positions of the joints of the
arm and those of the head. Instant by instant, the current
arm joint conﬁguration was used as input pattern for the arm
DSOM update rule, as described by equation (1). Similarly,
the current head joints conﬁguration was used as input pattern
for the update rule of the head DSOM. Frequency of the
updates matched the 15 Hz frame rate of the visual input.
Therefore, during 197.58 minutes, the DSOMs were updated
using 177,823 input training patterns. In parallel to the DSOM
updates, the Hebbian table connecting the two DSOMs was
updated with the positive Hebbian rule described by equation
3, only when the hand of the robot was visible in the visual
input. During the 197.58 minutes, the hand of the robot was
detected 91,658 times. The Hebbian table was updated at each
As a measure for the quality of visuo-motor coordination,
we considered the number of times the hand of the robot was
detected from the visual input during a time window of 5
minutes. This measurement was repeated every 5 minutes for
the entire duration of the learning session (197.58 minutes).
A linear regression computed on the collected measurements
showed a positive trend (slope 12.147, intercept 2175.176),
suggesting that the quality of visuo-motor coordination, in
terms of the number of times the robot detected its hand,
improved over time.
In addition, the capability of the proposed model to adapt to
unexpected changes of the input data distributions was tested.
After the ﬁrst learning session, a damage in the system was
simulated by adding a perturbation to the motor command
used for training the model. In particular, arm movements
were randomly generated as in the ﬁrst learning session but
the vector representing the current arm joint conﬁguration
was affected by a perturbation. The perturbation consisted
in translating the vector of the arm motor command. The
perturbation was initialised as random, but then it has been
kept constant. In this experiment, the following perturbation
was added to the arm joints: 0.1265 radians to the shoulder
pitch joint, 1.1411 radians to the shoulder roll joint, 1.2295
radians to the elbow yaw joint and -0.2242 radians to the elbow
Therefore, learning continued in the perturbation regime
for 106.72 minutes. During this second learning session,
96,049 new input patterns containing the perturbation were
used for the online update of the models. As in the pre-
vious analysis, the hand-detection rate was measured over
a 5 minutes window. A linear regression computed on the
collected measurements during the ﬁrst 35 minutes of learning
affected by perturbation showed a negative trend (slope: -
80.286, intercept: 2685.857). In other words, there was degra-
dation of the performance of the visuo-motor coordination.
However, a new improvement in the visuo-motor coordination
was reported during the following 71.72 minutes, as conﬁrmed
by the positive trend of the linear regression (slope: 1.2154,
intercept: 2255.264) computed over the measurements of the
third learning phase. This suggests that the proposed model
was able to partially recover from the unexpected change in
the data distributions. However, we did not analyse how fast
the model can fully recover to the original performance. This
will be addressed in future experiments, including evaluating
the results of several runs to access the reliance of the learning
and recovery processes in response to different perturbations.
Figure 3 shows the trends of the quality of visuo-motor
coordination. The three blue segments show the linear regres-
sions of: the initial learning phase (without perturbation), the
ﬁrst degradation phase under the perturbation regime and the
ﬁnal phase, under the perturbation regime, characterised by a
We performed a similar analysis on the trend of the distance
between the detected position (in image coordinates) of the
robot’s hand and the center of the image grabbed from the
robot camera. As described in the Hebbian rule in equation 3,
the connections between pre- and post-synaptic neurons were
strengthened also according to a multiplying factor related to
such a distance. The multiplying factor, which ranged between
1 (hand at the center of the image) and 0 (hand at the corner
of the image), was used for making the system choose head
positions that resulted in the hand being close to the center of
the image. In other words, such a distance can be interpreted
as a measure for the accuracy of the head movements, where
agood movement results in the hand visible at the center of
We expected to observe an improvement of the accuracy of
the robot’s head movements while learning, followed by an
initial degradation of the performances under the perturbation
regime. A linear regression was computed on the measure-
ment: 1−distance(hand, centerof image)/maxdistance,
averaged over a 5 minutes window as in the previous analysis.
The results of a linear regression computed on the measure-
ments of the ﬁrst learning phase (197.58 minutes) showed
a slightly positive trend (slope: 0.0032, intercept: 0.60687).
A linear regression computed on the following measurements
collected during the ﬁrst 35 minutes of learning under the
perturbation regime showed, as expected, a negative trend
(slope: -0.0020, intercept: 0.77210), suggesting a degradation
of the performance. However, differently to the previous
analysis, there was a very little improvement in the accuracy
of the head movements during the third learning phase under
the perturbation regime (the last 71.72 minutes of learning).
Although higher than in the second learning phase, the trend
of the linear regression resulted to be still negative (slope: -
0.0009, intercept: 0.73715). This issue will be addressed in
Figure 4 and 5 show the trends of the distortion measure-
ment of the arm DSOM and of the head DSOM. Distortion
is a popular criterion for assessing the quality of a Kohonen
map . It is computed as follows. For each input pattern:
- Update the DSOM using the input pattern;
- Compute the quantization error, as the distance between
the input pattern and the winner node (the closest DSOM
node to the input)
Distortion is computed as the average of the quantization
errors, that is, the sum of the calculated distances, divided
by the number of input patterns. Since we were dealing with
online learning mechanisms, only a partial set of the processed
input data was used for computing the distortion. At each
instant, only the previous 1,800 observations (corresponding to
two minutes of exploration) were used for computing the error.
50 100 150 200 250 300
Fig. 3. The quality of visuo-motor coordination was measured as the number
of times the hand of the robot was detected from the visual input during a
time window of 5 minutes. This measurement (red line in the ﬁgure) was
repeated every 5 minutes for the entire duration of the learning session. Blue
lines show linear regressions. First learning phase: slope 12.147, intercept
2175.176; second learning phase under the perturbation regime (the ﬁrst
vertical green line marks the instant when the perturbation is added): slope
-80.286, intercept 2685.857; third learning phase under perturbation regime
(re-adaptation to the new data distribution): slope 1.2154, intercept 2255.264.
The instant represented by the second green line, corresponding to the end of
the degradation phase, has been arbitrarily chosen.
Fig. 4. Distortion measurement of the arm DSOM. The green vertical line
marks the time instant (39.51, or 197.58 minutes) when the perturbation is
added to the arm commands. Errors are computed over a moving window
of 1,800 input samples (2 minutes, considering the update frequency of 15
frames per second).
Figure 4 shows a decreasing distortion for the arm DSOM
during the ﬁrst 5 minutes of learning, followed by a quasi-
stationary error until the moment when the perturbation was
added to the arm command (around 197 minutes). Thus, an
increase of the distortion was reported, in correspondence to
the change in the distribution where the data is sampled from.
Once the DSOM adapted to the new distribution, the distortion
error started to decrease and to stabilise.
The head DSOM was not affected by the perturbation, in the
current experiment. In fact, as shown in Figure 5, no signiﬁcant
jumps in the distortion error signal have been reported.
V. CO N CL U SI ON
Developmental studies in humans suggest that control and
coordination of movements are capabilities that are acquired
over time through exploration behaviours and that would
50 100 300 350 400 450150 200 250
Fig. 5. Distortion measurement of the head DSOM. Errors are computed over
a moving window of 1,800 input samples (2 minutes, considering the update
frequency of 15 frames per second). After the ﬁrst 10 minutes of learning,
the distortion error stabilises between 0.01 and 0.02, since the underlying data
distribution is not changing.
pave the ground for the acquisition of more complex skills.
Applying such a developmental paradigm into robotics, not
only could provide insights into the human development of
cognitive and motor skills, but could also be used for providing
robots with adaptive behaviours and with the capability to
react to unexpected circumstances. This paper addressed the
challenge of autonomous acquisition of internal body repre-
sentations in artiﬁcial agents, a fundamental prerequisite for
making robots able to successfully interact with humans and
with their environment.
We investigated the formation and the coupling of sensory
maps in the humanoid robot Aldebaran Nao. In particular,
we proposed a biologically inspired model for online and
continuous learning of visuo-motor coordination. From an
engineering perspective, one might conclude that due to robots
having kinematic models the learning of hand-eye coordination
is a redundant problem. On the other hand, deﬁning models
of robots’ embodiment and their surrounding world a priori
should be avoided, since there is a risk is to stumble across
problems such as robot behaviours lacking of adaptability. Us-
ing humanoid robots equipped with neurally plausible model
also provides a controlled environment for studying learning
mechanisms in infant . Recently, we showed how a similar
architecture based on classical Self-Organizing Maps imple-
mented on a robotic platform accounts for the development
of pointing gestures, an attentional behaviour fundamental
for social interaction and imitation learning. In particular, we
studied how a robot can develop the capability to generate
attention manipulation behaviours as failed attempts to reach
for an object . The reaching capability was acquired through
self-exploration, as in the experiment presented here. However,
the motor babbling algorithm used in  manually generated
head movements for following the hand trajectories. Once the
babbling session was over, the collected data was used for
training the hand-eye coordination model. Here, in going a step
backwards in the developmental time line, we addressed the
acquisition of visuo-motor coordination in an online fashion
and already at the babbling level, where no a priori knowledge
on how to follow hand movements was present in the system.
The model proposed here consists of two Dynamic Self-
Organising Maps associated through Hebbian links, which
allow for online learning of mappings between different senso-
rimotor modalities. As in , the modular organization of the
model in terms of sensory maps allow for easier identiﬁcation
of its components with the biological equivalents. Moreover,
results show that the model is able to adapt to dynamic data
In particular, the aim of the experiment presented here was
to make the robot able to learn how to follow the movements of
its hand, while generating random motor commands to its arm
joints. During the random movement generation, arm and head
postures were used for updating the corresponding DSOMs in
an online fashion, while they were associated through Hebbian
learning whenever the hand of the robot was visible in the
visual input. Head movements were generated as outputs of the
proposed model. The quality of the head movements depended
on how well the DSOMs encoded the data distributions where
the arm and neck postures were sampled from, and on how
well they were associated through Hebbian learning.
Using the proposed model, the humanoid robot improved
the quality of its visuo-motor coordination over time, starting
from a random conﬁguration where no knowledge about how
to visually follow its arm movements is present. Moreover,
the capability of the proposed model to adapt to unexpected
changes was tested. At a certain point during the devel-
opmental timeline, a damage in the system was simulated
by adding a perturbation to the motor command used for
training the model, resulting in translating the original data
distribution. Consequently, the performance of the visuo-motor
coordination was affected by an initial degradation, followed
by a new improvement as the Arm DSOM adapted to the new
data distribution and the Hebbian connections between the arm
DSOM and the head DSOM adapt to the new mapping.
Future work will include adding a sensory map coding the
hand position of the robot between the arm and head maps.
Extending the model to represent different motor and sensory
modalities, such as visual, auditory and tactile, will also be
ACK NO WL E DG M EN T
The research leading to these results has received funding
from the European Union’s Seventh Framework Programme
(FP7/2007-2013) under grant agreement n. 609465, related
to the EARS (Embodied Audition for RobotS) project. The
authors would like to thank the members of the Cognitive
Robotics Group at the Humboldt-Universität zu Berlin for very
REF ERE NCE S
 N. Rougier and Y. Boniface, “Dynamic self-organising map,” Neuro-
comput., vol. 74, no. 11, pp. 1840–1847, May 2011.
 F. Kaplan and V. V. Hafner, “The challenges of joint attention,” Inter-
action Studies, vol. 7, no. 2, pp. 135–169, 2006.
 S. Tükel, “Development of visual-motor coordination in children with
neurological dysfunctions,” 2013.
 C. Yu and L. B. Smith, “Joint attention without gaze following: Human
infants and their parents coordinate visual attention to objects through
eye-hand coordination,” PLoS ONE, vol. 8, no. 11, p. e79659, 11 2013.
 S. Zoia, L. Blason, G. Dâ ˘
ZOttavio, M. Bulgheroni, E. Pezzetta,
A. Scabar, and U. Castiello, “Evidence of early development of action
planning in the human foetus: a kinematic study,” Experimental Brain
Research, vol. 176, no. 2, pp. 217–226, 2007. [Online]. Available:
 G. Metta, “Babyrobot – a study on sensori-motor development,” Ph.D.
 R. Saegusa, G. Metta, and G. Sandini, “Self-body discovery based on vi-
suomotor coherence,” in 3rd Conference on Human System Interactions
(HSI), 2010, May 2010, pp. 356–362.
 I. Kajic, G. Schillaci, S. Bodiroza, and V. V. Hafner, “A biologically
inspired model for coding sensorimotor experience leading to the devel-
opment of pointing behaviour in a humanoid robot,” in Proceedings of
the Workshop "HRI: a bridge between Robotics and Neuroscience". 9th
ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI 2014), 2014.
 V. V. Hafner and G. Schillaci, “From ﬁeld of view to ﬁeld of reach -
could pointing emerge from the development of grasping?” Frontiers
in Computational Neuroscience, Conference Abstract: IEEE ICDL-
EPIROB 2011, 2011.
 T. Kohonen, “Self-organized formation of topologically correct feature
maps,” Biological cybernetics, vol. 43, no. 1, pp. 59–69, 1982.
 A. Morse, J. de Greeff, T. Belpaeme, and A. Cangelosi, “Epigenetic
robotics architecture (era),” Autonomous Mental Development, IEEE
Transactions on, vol. 2, no. 4, pp. 325–339, 2010.
 G. Schillaci, V. Hafner, and B. Lara, “Coupled inverse-forward models
for action execution leading to tool-use in a humanoid robot,” in 7th
ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI), 2012, March
2012, pp. 231–232.
 G. Schillaci, “Sensorimotor learning and simulation of experience
as a basis for the development of cognition in robotics,”
Ph.D. dissertation, 2014. [Online]. Available: http://edoc.hu-
 H. Ritter, “Self-organizing maps for internal representations,” Psycho-
logical Research, vol. 52, no. 2-3, pp. 128–136, 1990.
 P. Rochat, “Self-perception and action in infancy,” Experimental Brain
Research, vol. 123, no. 1-2, pp. 102–109, 1998. [Online]. Available:
 P. Rochat and R. Morgan, “Two functional orientations of
self-exploration in infancy,” British Journal of Developmental
Psychology, vol. 16, no. 2, pp. 139–154, 1998. [Online]. Available:
 J. H. Kaas, “Topographic maps are fundamental to
sensory processing,” Brain Research Bulletin, vol. 44,
no. 2, pp. 107 – 112, 1997. [Online]. Available:
 T. Martinetz, H. Ritter, and K. Schulten, “Three-dimensional neural net
for learning visuomotor coordination of a robot arm,” IEEE Transactions
on Neural Networks, vol. 1, no. 1, pp. 131–136, Mar 1990.
 M. K. Arras, P. W. Protzel, and D. L. Palumbo, “Automatic learning
rate adjustment for self-supervising autonomous robot control,” NASA
Technical Memorandum TM-107592, NASA Langley Research Center,
Tech. Rep., 1992.
 B. Fritzke, “A self-organizing network that can follow non-stationary
distributions,” in Artiﬁcial Neural Networks â ˘
T ICANN’97, ser. LNCS,
W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, Eds., 1997, vol.
1327, pp. 613–618.
 ——, “A growing neural gas network learns topologies,” in Advances
in Neural Information Processing Systems 7. MIT Press, 1995, pp.
 D. Deng and N. Kasabov, “Esom: an algorithm to evolve self-organizing
maps from online data streams,” in Proceedings of the IEEE-INNS-ENNS
International Joint Conference on Neural Networks, 2000. IJCNN 2000,
vol. 6, 2000, pp. 3–8 vol.6.