Content uploaded by Guillermo Oliver

Author content

All content in this area was uploaded by Guillermo Oliver on Oct 07, 2019

Content may be subject to copyright.

Active inference body perception and action for

humanoid robots

Guillermo Oliver‡, Pablo Lanillos‡, Gordon Cheng‡

‡Institute for Cognitive Systems, Technical University of Munich, Arcisstrasse 21, 80333 Munich, Germany

Abstract—One of the biggest challenges in robotics systems

is interacting under uncertainty. Unlike robots, humans learn,

adapt and perceive their body as a unity when interacting with

the world. We hypothesize that the nervous system counteracts

sensor and motor uncertainties by unconscious processes that

robustly fuse the available information for approximating their

body and the world state. Being able to unite perception and

action under a common principle has been sought for decades

and active inference is one of the potential uniﬁcation theories.

In this work, we present a humanoid robot interacting with the

world by means of a human brain-like inspired perception and

control algorithm based on the free-energy principle. Until now,

active inference was only tested in simulated examples. Their

application on a real robot shows the advantages of such an

algorithm for real world applications. The humanoid robot iCub

was capable of performing robust reaching behaviors with both

arms and active head object tracking in the visual ﬁeld, despite

the visual noise, the artiﬁcially introduced noise in the joint

encoders (up to 40 degrees deviation), the differences between

the model and the real robot and the misdetections of the hand.

Index Terms—Active inference, Free-energy optimization, Bio-

inspired Perception, Predictive coding, Humanoid robots, iCub.

I. INTRODUCTION

The medical doctor and physicist Hermann von Helmholtz

described visual perception as an unconscious mechanism that

infers the world [1]. In other words, the brain has generative

models that complete or reconstruct the world from partial

information. Nowadays, there is a scientiﬁc mainstream that

describes the inner workings of the brain as those of a

Bayesian inference machine [2] [3]. This approach supports

that we are able to adjust the cues (visual, proprioceptive,

tactile, etc) contribution to our interpretation in a Bayesian

optimal way taking into account sensors and motor uncertain-

ties. This implies that the brain is able to encode uncertainty

not only for perception but also for acting in the world.

Optimal feed-back control was proposed for modelling motor

coordination [4]. Alternatively, active inference [5], defended

that both perception and action are two sides of the same pro-

cess: an unconscious mechanism that infers and adapts to the

environment. Either way, perception is inevitably connected

to the body senses and actuators, being the body the entity of

interaction [6] and possible learned through development [7].

From the several available brain theories that have arisen in

the last decades, some of them can be uniﬁed under the free-

This work has been supported by SELFCEPTION project

(www.selfception.eu) European Union Horizon 2020 Programme under

grant agreement n. 741941 and the European Union’s Erasmus+ Programme.

TARGET

PREDICTION

VISUAL

ERROR

PREDICTION

ERROR

ACTION

MOTION

Fig. 1: Dual arm and head active inference. The robot dy-

namically infers its body conﬁguration (transparent arm) using

the prediction errors. Visual prediction error is the difference

between the real visual location of the hand (red) and the

predicted one (green), which generates an action to reduce this

discrepancy. In the presence of a perceptual attractor (blue)

an error in the desired sensory state is produced promoting an

action towards the goal: the equilibrium point appears when

the hand reaches the object. The head is in motion to keep the

object in its visual ﬁeld, improving the reaching performance.

energy principle [5]. This principle accounts for perception,

action and learning through the minimization of surprise. This

is the discrepancy between the current state and the predicted

or desired one, also known as prediction error. According to

this approach, free-energy is a way of quantifying surprise

and it can be optimized by changing the current beliefs

(perception) or by acting on the environment (action) to adjust

the difference between reality and prediction [8].

We present robotic body perception as a ﬂexible and dy-

namic process that approximates the body latent conﬁguration

using the error between the expected and the observed sensory

information. In this work we provide an active inference

mathematical model for a humanoid robot combining per-

ception and action, extending [9]. This model enabled the

robot to have adaptive body perception and to perform robust

reaching behaviors even under high levels of sensor noise and

discrepancies between the model and the real robot (Fig. 1).

This is due to the way the optimization framework fuses the

available information from different sources and the coupling

between action and perception.

arXiv:1906.03022v2 [cs.RO] 8 Aug 2019

A. Related work

Multisensory perception has been widely studied in the

literature and enables the robot to combine joint information

with other sensors such as images and tactile cues. Bayesian

estimation has been proved to achieve robust and accurate

model based robot arm tracking [10] even under occlusion

[11]. Furthermore, integrated visuomotor processes enabled

humanoid robots to learn object representations through ma-

nipulation without any prior knowledge about them [12], learn

motor representations for robust reaching [13], [14] and even

visuotactile motor representations for reaching and avoidance

behaviors [15].

Active inference (under the free-energy principle) includes

action as a classical spinal reﬂex arc pathway triggered by

perception prediction errors and has been mainly studied in

theoretical or simulated conditions. Friston presented in [8]

a theoretical motor model with two degrees of freedom as an

extension of the dynamic expectation maximization algorithm.

It was recently studied in robot control of a simulated PR2

robot [16], one degree of freedom simulated vehicle [17] and

two degrees of freedom simulated robot arm [18].

A ﬁrst model of the free-energy optimization in a real robot

was performed in [9] working as an approximate Bayesian

ﬁlter estimation, where the robot was able to perceive its arm

location fusing visual, proprioceptive and tactile information.

However, authors left out the action. In this work, we took one

step further and modelled and applied active inference to the

iCub robot for dual arm reaching with active head tracking.

For reproducibility, the code is publicly available1. While the

arms goal is to minimize the prediction error between the goal

(object) and the end-effector visual location, the head goal is

to maintain the object centered in the ﬁeld of view to provide

wider and more accurate reaching capabilities.

B. Paper organization

First in Sec. II we explain the general mathematical free-

energy optimization for perception and action. Afterwards,

Sec. III describes the the iCub physical model and in Sec. IV

and Vwe detail the active inference computational model that

allows the robot to perform robust reaching and tracking tasks.

Finally, VI shows the results obtained analyzing the advantages

and limitations of the proposed algorithm.

II. FR EE -ENERGY OPTIMIZATION MODEL

A. Bayesian inference

According to the Bayesian inference model for the brain, the

body conﬁguration2xis inferred using the available sensory

data sby applying Bayes’ theorem:

p(x|s) = p(s|x)p(x)

p(s)(1)

where the posterior probability,p(x|s), corresponding to the

probability of body conﬁguration xgiven the observed data

1tobereleased

2We deﬁne body conﬁguration or body schema as a generic way to refer

to the body position in the space. For instance, the joint angles of the robot.

s, is obtained as a consequence of three antecedents: (1)

likelihood,p(s|x), or compatibility of the observed data s

with the body conﬁguration x, (2) prior probability,p(x),

or current belief about the conﬁguration before receiving

the sensory data s, also known as previous state belief,

and (3) marginal likelihood,p(s), which corresponds to the

marginalization of the likelihood of receiving sensory data s

regardless of the conﬁguration. This is a normalization term,

p(s) = Rxp(s|x)p(x)dx, which ensures that the posterior

probabilities, p(x|s), for the whole range of x, integrates to 1.

The goal is to ﬁnd the value of xwhich maximizes p(x|s),

because it is the most likely value for the real-world body con-

ﬁguration xaccording the sensory data obtained s. This direct

method presents a great difﬁculty, where the marginalization

over all the possible body states becomes intractable.

B. Free-energy principle

The free-energy principle [5] provides a tractable solution to

this obstacle, where, instead of calculating the marginal likeli-

hood, the idea is to minimize the Kullback-Leibler divergence

[19] between a reference distribution q(x)and the real p(x|s),

in order for the ﬁrst to the a good approximation of the second.

DKL (q(x)||p(x|s)) = Zq(x) ln q(x)

p(x|s)dx

=Zq(x) ln q(x)

p(s, x)dx + ln p(s) = −F+ ln p(s)≥0(2)

It is important to note that the marginal likelihood, p(s), is

independent of the conﬁguration variable xand because q(x)

is a probability distribution, which means that the integral over

its entire range is 1. Thus, p(s)falls out of the integration.

Maximizing the negative ﬁrst term effectively minimizes the

difference between these two densities, with only the marginal

likelihood remaining. Unlike the whole expression of the K-L

divergence, the ﬁrst term can be evaluated because it depends

on the reference distribution and the knowledge about the en-

vironment we can assume the agent has, p(s, x) = p(s|x)p(x).

This term is deﬁned as variational negative free-energy [20],

[21].

F≡ −Zq(x) ln q(x)

p(s, x)dx =Zq(x) ln p(s, x)

q(x)dx (3)

Maximizing this expression with respect to q(x)is known

as free-energy optimization and results in minimizing the K-L

divergence between the two distributions3.

Maximizing Fis equivalent to the previous goal of maxi-

mizing the posterior probability p(x|s), due to the fact that all

probability distributions are strictly non-negative. Considering

that the second term of the K-L divergence, ln p(s), is not de-

pendent on neither the reference distribution q(x)nor the value

3The expression for free-energy also appears in the literature in its positive

version, without the negative sign that precedes it, being in that case the

objective of optimization the minimization of the expression.

of the body conﬁguration x, the same value of xoptimizes all

three quantities: F,p(x|s)and DKL (q(x)||p(x|s)).

According to the free-energy optimization theory, there

are two ways to minimize surprise, which accounts for the

discrepancy between the current state and the predicted or

desired one (prediction error): changing the belief (perceptual

inference) or acting on the world (active inference). Perceptual

inference and active inference optimize the value of the free-

energy expression F, while active inference also optimizes the

value of the marginal likelihood by acting on the environment

and changing the sensory data s.

Under the mean-ﬁeld and Laplace approximation (the pos-

terior is approximated from a family of tractable distributions)

we can use the Maximum-A-Posteriori (MAP) to approximate

the mode [21] and simplify the calculus of F. The factorized

variational density qiis assumed to have Gaussian form

N(µi,Σi). Then, deﬁning the Laplace-encoded energy [22]

as L(s, µ) = −ln p(s, µ),Fcan be approximated to:

F=Zq(x)L(s, µ)dx +Zq(x) ln q(x)dx ≈(4)

≈L(s, µ) + X

i

1

2(ln |Σi|+niln 2π)(5)

where niis the number of parameters in the iset.

Finally, the modes µican be computed maximizing the

expected variational energy. We can further approximate µ

using the gradient descent on the Laplace-encoded energy

L(s, µ)assuming that the second term is a constant.

C. Perceptual inference

Perceptual inference, is the process of updating the inner

model belief to best account for sensory data, minimizing the

prediction error.

The agent must update the most-likely or optimal value for

the body conﬁguration µin each state. This optimal value is

the one that maximizes negative free-energy, therefore a ﬁrst-

order iterative optimization algorithm of gradient ascent will

be applied to approach the local maximum of the function. In

this case, this means that it should be changed proportionally

to the gradient of negative free-energy.

For static systems, this update is done directly considering

only the gradient ascent formulation: ˙µ=∂F

∂µ [9]. In dynamic

systems, the time derivatives of the body conﬁguration should

be considered. Usually ﬁrst and second order derivatives are

considered, µ0and µ00, but higher order derivatives could

also be considered if their dynamic equations of behavior are

known. The state variable is now a vector: µ= [µ µ0µ00]T.

In this case, all values and derivatives must be updated taking

into consideration the next higher order derivative:

˙

µ=Dµ +∂F

∂µ(6)

where the Dis the block-matrix derivative operator with the

superdiagonal ﬁlled with ones.

When negative free-energy is maximized, the value of its

derivative is ∂ F

∂µ = 0, and the system is at equilibrium; i.e.

static systems ˙µ= 0 and dynamic systems ˙

µ=Dµ.

The expression (6) denotes the change of µwith time,

and it is used to update the value of µusing any kind of

numerical integration methods. We use a simple ﬁrst-order

Euler integration method, where in each iteration the value will

be calculated using a linear dependency: µi+1 =µi+ ˙µ∆t,

where ∆t=Tis the period of execution of the updating cycle

for the internal state.

D. Active inference

Active inference [8], is the extension of perceptual inference

to the relationship between sensors and actions, taking into

account that actions can change the world to make sensory

data more accurate with predictions made by the inner model.

Action plays a core role on the optimization and improves

the approximation of the real distribution, therefore reducing

the prediction error by minimizing free-energy. It also acts

on the marginal likelihood by changing the real conﬁguration

which modiﬁes the sensory data sto obtain new data that is

more in concordance with the agent’s belief.

In this case, the optimal value is also the one which

maximizes negative free-energy, and again a gradient ascent

approach will be taken to update the value of the action.

˙a=∂ F

∂a (7)

where ais calculated using a ﬁrst-order Euler numerical

integration with explicit gain: ai+1 =ai+ka˙a∆t.

The combination of active inference and perception provides

the mathematical framework for free-energy optimization.

III. ROBOT PHYSICAL MODEL

μ

μ'

q

v

Environment

v

a

ρ

ρ

q1a1

q2

a2q3a3

Fig. 2: Model description. (Left) Generative model of the

robot. (Right) Notation and conﬁguration of the robot. Shown

variables: internal variables µand µ0, joint angle position q,

actions applied to the joints a, the visual location of the end-

effector vand the causal variables ρ.

iCub [23] (v1.4) is a 104 centimeter tall and 22 kilogram

humanoid robot that resembles a small child with 53 degrees

of freedom powered by electric motors and driven by tendons.

The upper body has 38 degrees of freedom, distributed in 7 for

each arm, 9 for each hand and 6 for the head (3 on the neck and

3 for the eyes). The lower body has 15 degrees of freedom, 6

for each leg and 3 more in the waist. The software is built on

top of the YARP (Yet Another Robot Platform) framework,

to facilitate communication between different hardware and

software implementations in robotics [24].

The robot is divided into several kinematic chains, that

are distributed according to its extremities. All kinematic

chains are deﬁned through homogeneous transformation ma-

trices using Denavit-Hartenberg convention. We focus on two

kinematic chains, those with the end-effector being the right

hand (without considering its ﬁngers) and the left eye.

TABLE I: Considered DOF for iCub robot

Location Link θi(deg) Joint name

Arm (right/left) 4 -90 + [0, 160.8] (r/l) shoulder roll

Arm (right/left) 5 -105 + [-37, 100] (r/l) shoulder yaw

Arm (right/left) 6 [5.5, 106] (r/l) elbow

Head 3 90 + [-40, 30] neck pitch

Head 5 90 + [-55, 55] neck yaw

Without lose of generality, the arm model is deﬁned

as a three degree of freedom (revolute joints) system:

r shoulder roll,r shoulder yaw and r elbow. The left eye

camera observes the end-effector position and the world

around it. The joints considered for the motion of the head

are: neck pitch and neck yaw.

The symbolic matrices for the kinematics of these chains in

terms of the joint variable were obtained using Mathematica.

These are the homogeneous transformation matrices for both

complete chains from the local robot origin to the end-effector

reference frame, as well as their partial derivatives in terms of

its three degrees of freedom.

We generated a 3-dimensional model in SolidWorks in order

to provide more accurate simulations for multi-body dynamics.

Figures 3c and 3f show the surface that deﬁnes the working

range of the robot in terms of its degrees of freedom. We used

this model to design all reaching experiments.

IV. ACTIVE INFERENCE COMPUTATIONAL MODEL FOR

ICUB A RM R EAC HI NG TASK

A. Problem formulation

The body conﬁguration, or internal variables, is deﬁned as

the joint angles. The estimated states µ∈IR3are the belief

the agent has about the joint angle position and the action

a∈IR3is the angular velocity of those same joints. Due to

the fact that we use a velocity control for the joints, ﬁrst order

dynamics must also be considered µ0∈IR3.

µ=

µ1

µ2

µ3

µ0=

µ0

1

µ0

2

µ0

3

a=

a1

a2

a3

(8)

The sensory data will be obtained though several input

sensors that provide information about the position of the end-

effector in the visual ﬁeld, sv∈IR2, and joint angle position,

sp∈IR3.

sp=

q1

q2

q3

sv=v1

v2(9)

The likelihood p(s|µ)is made up of proprioception func-

tions in terms of the current body conﬁguration, while the prior

p(µ)will take into account the dynamic model of the agent

that describes how this internal state changes with time. The

combination of both probabilities formalize the negative free-

energy. Adapting the Laplace-encoded energy for the model

described in Fig. 2:

p(s,µ) = p(s|µ)p(µ) = p(sp|µ)p(sv|µ)p(µ0|µ,ρ)(10)

B. Negative free-energy optimization

In order to deﬁne the conditional densities for each of the

terms, we should deﬁne the expressions for the sensory data.

Joint angle position, sp, is obtained directly from the joint

angle sensors. Lets assume that the input is noisy and follows

a normal distribution with mean at the internal value µand

variance Σsp. The end-effector visual position, sv, is deﬁned

by a non-linear function dependent on the body conﬁguration

and obtained using the forward model of the right arm and

the pinhole camera model for the left eye camera of the robot.

Lets assume that the input is noisy and follows a normal

distribution with mean at the value of this function g(µ)∈IR2

and variance Σsv. The dynamic model is determined by a

function which depends on both the current state µand the

causal variables ρ(e.g. the visual plane position of the object

to be reached). We assume that the input is noisy and follows

a normal distribution with mean at the value of this function

f(µ,ρ)∈IR3and variance Σsµ.

g(µ) = g1(µ)

g2(µ)f(µ,ρ) =

f1(µ,ρ)

f2(µ,ρ)

f3(µ,ρ)

(11)

Considering the same normal distribution assumptions for

the internal state and sensorial terms, the expressions of

probability functions are extended to consider all the elements

of the vectors, where Ci=1

√2πΣsi

:

p(sp|µ) = Cp

3

Y

i=1

exp −1

2Σsp

(qi−µi)2(12)

p(sv|µ) = Cv

2

Y

i=1

exp −1

2Σsv

(vi−gi(µ))2(13)

p(µ0|µ,ρ) = Cµ

3

Y

i=1

exp −1

2Σsµ

(µ0

i−fi(µ,ρ))2(14)

Variational negative free-energy, considering the previous

density functions is obtained applying the natural logarithm to

(10). The sequence product is transformed into a summation

due to the properties of the natural logarithm.

F= ln p(sp|µ) + ln p(sv|µ) + ln p(µ0|µ,ρ) + C=

=

3

X

i=1 −1

2Σsp

(qi−µi)2+

2

X

i=1 −1

2Σsv

(vi−gi(µ))2

+

3

X

i=1 −1

2Σsµ

(µ0

i−fi(µ, ρ))2+ ln Cp+ ln Cv+ ln Cµ+C

(15)

The vectorial equations used for the gradient ascent for-

mulation are obtained from the differentiation of the scalar

free-energy term by the internal state vector and the action

vector (Eq. (6) and (7)). The dependency of Fwith respect

to the vector of internal variables µcan be calculated using

the chain rule on the functions that depend on those internal

variables. The dependency of Fwith respect to the vector of

actions ais calculated considering that the only magnitudes

directly affected by action are the values obtained from the

sensors.

∂F

∂µ=1

Σsp

(sp−µ) + 1

Σsv

∂g(µ)

∂µ

T

(sv−g(µ))

+1

Σsµ

∂f(µ,ρ)

∂µ

T

(µ0−f(µ,ρ)) (16)

∂F

∂a=−1

Σsp

∂sp

∂a

T

(sp−µ) + 1

Σsv

∂sv

∂a

T

(sv−g(µ))(17)

Even though an angular velocity control is being carried out,

the agent can also be aware of the values of the ﬁrst-order

dynamics and they can be updated using a gradient ascent

formulation. The dependency of Fwith respect to the ﬁrst-

order dynamics vector µ0is limited to the inﬂuence of the

dynamic model.

∂F

∂µ0=−1

Σsµ

(µ0−f(µ,ρ)) = 1

Σsµ

(f(µ,ρ)−µ0)(18)

Equation (18) shows that the update of the ﬁrst-order

dynamics has a negative feedback, causing a beneﬁcial effect

on the stability of the process. This also means that at the

equilibrium point the value of the derivative should be zero.

Considering the equations above, the complete update equa-

tions are:

˙

µ=µ0+∂F

∂µ˙

µ0=∂F

∂µ0˙

a=∂F

∂a(19)

A ﬁrst order Euler integration method is applied to update

the values of µ,µ0and ain each iteration.

C. Perceptual attractor dynamics

We introduce the reaching goal as a perceptual attractor in

the visual ﬁeld as follows:

A(µ,ρ) = ρ3 ρ1

ρ2−g1(µ)

g2(µ) (20)

The internal variable dynamics will be deﬁned in terms of

the attractor:

f(µ,ρ) = T(µ)A(µ,ρ)(21)

where T(µ)is the function that transforms the attractor vector

from target space (visual) to the joint space. The system

is velocity controlled, therefore the target space is a linear

velocity vector and the joint space is an angular velocity.

The visual Jacobian matrix that relates visual space (2

coordinates) to joint space (3 DOF) is a rectangular matrix,

therefore the mapping matrix used is the generalized inverse

(Moore-Penrose pseudoinverse) of it: T(µ) = J+

v(µ). This

matrix is calculated using the singular-value decomposition

(SVD), where Jv=UΣVTand J+

v=VΣUT.

D. Active inference

The action is set to be an angular velocity magnitude, a,

which corresponds with angular joint velocity in the latent

space. We must calculate the expression of the partial deriva-

tives for the matrices ∂sp

∂aand ∂sv

∂ain (17) to quantify the

dependency of these parameters with respect to the velocities

of the joints.

We assume that the control action, a, is updated for every

cycle, and therefore for each interval of time between cycles

it has the same value. For each period (cycle time between

updates), the equation of a uniform circular motion is satisﬁed

for each joint position. If this equation is discretized, for each

sampling moment, which are Tseconds apart, the value will

be updated in the following way: qi+1 =qi+aiT. The

dependency of the joint angle position with respect to the

control action is therefore deﬁned.

The partial derivatives of joint position spwith respect

to action, considering there is no cross-inﬂuence or coupling

between joint velocities and that qiand its expected µishould

converge at equilibrium, are given by the following expression:

∂qi

∂aj

=∂µi

∂aj

=(T i =j,

0otherwise.(22)

If the dependency of joint position with respect to action is

known, we can use the chain rule to calculate the dependency

for the visual sensor, sv. Considering that the values of g1(µ)

and g2(µ)should also converge to v1and v2at equilibrium,

the partial derivatives are given by the following expression:

∂vi

∂aj

=∂vi

∂qj

∂qj

∂aj

=∂gi(µ)

∂µj

∂µj

∂aj

(23)

V. ACTIVE INFERENCE COMPUTATIONAL MODEL FOR

ICUB H EA D OB JE CT T RAC KI NG TASK

We extend the arm reaching model for the head to obtain

an object tracking motion behavior. The goal of this task

is to maintain the object in the center of the visual ﬁeld,

thus increasing its reaching working range capabilities. Two

available degrees of freedom (yaw and pitch) are used for this

purpose.

Visual error (pixels)

1

1

1 2

2

2

3

3 4 4

encoders +vision

encoders

vision

0 5 10 15 20 25 30

0

20

40

60

80

Time (s)

(a) Sensory fusion error.

Visual coordinate v1(pixels)

1

2

3

4

encoders +vision

encoders

vision

100 150 200 250 300

100

125

150

175

200

225

Visual coordinate v2(pixels)

(b) Sensory fusion path.

(c) Working range of right arm

end-effector (front).

Visual error (pixels)

1

1

1

1 2

2

2

3

3

3

2 3

4

4

4 4

No noise

x=0, σ=10°

x=0, σ=20°

x=0, σ=40°

0 10 20 30 40

0

20

40

60

80

Time (s)

(d) Encoder noise handling error.

Visual coordinate v1(pixels)

1

2

3

4

No noise

x=0, σ=10°

x=0, σ=20°

x=0, σ=40°

100 150 200 250 300

100

125

150

175

200

225

Visual coordinate v2(pixels)

(e) Encoder noise handling path.

(f) Working range of right arm

end-effector (back).

Fig. 3: Results of right arm object reaching using sensory fusion and under noisy sensors. (Left) RMS visual error between the

real visual position and the attractor position. For each one of the runs the moment at which the attractor position is achieved

is represented with a number inside a colored circle. (Center) Path followed by the end-effector in the visual plane. (Right)

Working range of the right hand end-effector used to calculate the attractor positions. The dark surface is considering only

shoulder roll and elbow (2 DOF), while the volume is considering also shoulder yaw (3 DOF).

A. Problem formulation

Sensory data and proprioception for the humanoid robot

head is deﬁned by internal variables beliefs µe∈IR2, actions

ae∈IR2, and ﬁrst-order dynamic vectors µ0

e∈IR2and

because the end effector in this case is the camera itself, there

is only joint angle position, se∈IR2.

B. Negative free-energy optimization

Variational negative free-energy for the head motion, Fe,

is obtained from the conditional densities and its dependency

with respect to internal variables µe, actions aeand ﬁrst-order

dynamics µ0

eis calculated.

Fe=

2

X

i=1 −1

2Σse

(qei−µei)2+ ln Ce

+

2

X

i=1 −1

2Σsµe

(µ0

ei−fei(µei, ρ))2+ ln Cµe+C(24)

∂Fe

∂µe

=1

Σse

(se−µe) + 1

Σsµe

∂fe(µe,ρ)

∂µe

T

(µ0

e−fe(µe,ρ))

∂Fe

∂ae

=−1

Σse

∂se

∂ae

T

(se−µe)

∂F

∂µ0

e

=1

Σsµe

(fe(µe,ρ)−µ0

e)(25)

C. Perceptual attractor dynamics

In order to obtain a the desired motion in the visual ﬁeld,

an attractor will be deﬁned towards the center of the image

(cx, cy). The attractor position (ρ1, ρ2)is read from the visual

input and will dynamically update with the motion of the head,

while the center coordinates have a constant value.

Ae(µe,ρ) = ρ3 cx

cy−ρ1

ρ2 (26)

Internal variable dynamics will be deﬁned in terms of the

attractor: fe(µe,ρ) = Te(µe)Ae(µe,ρ),fe(µe,ρ)∈IR2.

With two pixel coordinates and two degrees of freedom, the

inverse of the Jacobian matrix can be directly used as the

mapping matrix in the visual space: Te(µe) = J−1

ev(µe).

VI. RE SU LTS

A. Experimental Setup

The iCub robot is placed in a controlled environment with

an easily recognizable object in front of it, which serves as a

perceptual attractor to produce the movement of the robot. The

values of the causal variables ρ1and ρ2are the horizontal and

vertical positions of that object in the image plane obtained

from the left eye camera, and the value of ρ3is a weighting

Angular position (radians)

q1

μ1

q2

μ2

q3

μ3

0 10 20 30 40 50

0.0

0.5

1.0

1.5

Time (s)

(a) Right arm internal state and encoders.

Angular position (radians)

q1e

μ1e

q2e

μ2e

0 10 20 30 40 50

-0.6

-0.4

-0.2

0.0

0.2

Time (s)

(b) Head internal state and encoders.

Angular error (radians)

(q1-μ1)

(q2-μ2)

(q3-μ3)

(qe1-μe1)

(qe2-μe2)

0 10 20 30 40 50

-0.05

0.00

0.05

Time (s)

(c) Error (sp−µ)and (se−µe).

Visual position (pixels)

v1

g1

ρ1

cx

0 10 20 30 40 50

0

50

100

150

200

250

300

Time (s)

(d) Visual coordinate 1.

Visual position (pixels)

v2

g2

ρ2

cy

0 10 20 30 40 50

0

50

100

150

200

Time (s)

(e) Visual coordinate 2.

Visual error (pixels)

(v1-g1(μ))

(v2-g2(μ))

0 10 20 30 40 50

-30

-20

-10

0

10

20

30

Time (s)

(f) Error (sv−g(µ)).

Angular velocity (radians/s)

a1

a2

a3

0 10 20 30 40 50

-0.10

-0.05

0.00

0.05

0.10

Time (s)

(g) Right arm actions.

Negative free-energy (right arm)

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0.00

Right arm

Head

0 10 20 30 40 50

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

Negative free-energy (head)

Time (s)

(h) Negative free-energy. (i) Robot vision.

Fig. 4: Right arm reaching and head object tracking. Internal states are driven towards the perceptual attractor position. The

difference between the value internal states and encoders (Fig. 4c) and between calculated position and visual location of

end-effector (Fig. 4f) drive the actions. This algorithm optimizes free-energy and reaches the target attractor position. Fig. 4i

shows the visual perception of the robot with the attractor (blue), the prediction (green) and the real position (red).

factor to adjust the power of the attractor. The right arm end-

effector is also recognized by a visual marker that is placed on

the hand of the robot, obtaining the values of v1and v2. The

goal is to obtain a reaching behavior in the robot. The proposed

algorithm will generate an action in the right arm towards the

object, in order to reduce the discrepancy between the current

and the desired state imposed by the goal.

Three different experiments were performed: (1) robustness,

right arm reaching towards a series of locations that the robot

must follow in its visual plane, (2) dynamics evaluation, right

arm reaching model and active head model towards a moving

object, (3) generalization, both arms reaching and active head.

The relevant parameters in the algorithm are: (1) variance

Σspin the encoder sensor, (2) variance Σsvin the visual

perception, (3) variance Σsµin attractor dynamics and (4)

action gains ka. These parameters were tuned empirically with

their physical meaning in mind and remain constant during

the experiments, except encoder noise that was modiﬁed

to withstand more deviation in the encoder noise handling

situation of the ﬁrst experiment.

B. Right arm reaching with sensory fusion under noisy sensors

The ﬁrst experiment is performed to test the robustness of

the algorithm under two different conditions (Figure 3). The

robot has to reach four different static locations in the visual

ﬁeld with the right arm. Once a location is reached the next

location becomes active. A location is considered to be reached

when the visual position is inside a disk with a radius of

ﬁve pixels centered at the location position. The evaluation

is assessed using the root mean square (RMS) of the errors

in the visual plane coordinates between the real end-effector

location (visual marker) and the target location.

Under the ﬁrst condition we did not add any noise (only

intrinsic noise from the sensors and model errors). We tested

the contribution of each sensory information at the reaching

task: visual, joint angle encoders, and both together. Figures

3a and 3b show the RMS error and the path followed by the

arm respectively. Even though the model has been veriﬁed

and the camera calibrated, there was a difference between the

forward model and the real robot, due to possible deviations in

parameters and motor backslash, which implies that the robot

has to use the visual information to correct its real position.

Employing joint angle encoders and vision provides the best

behavior for the reaching of the ﬁxed locations in the visual

ﬁeld, achieving all positions in the shortest fashion. Visual

perception also reaches all the positions but it does not follow

the optimum path, while using only the encoder values fails

to reach all locations.

At the second condition we injected Gaussian noise in the

robot motors encoders spin order to test the robustness against

high noise. Thus, four trials were performed with Gaussian

noise with a zero mean and with standard deviation of 0°

(control test), 10°, 20° and 40°. Figures 3d and 3e show the

reaching error and the followed path for each trial. The results

of the runs with no noise and σ= 10° achieved very similar

results, with the ﬁrst one achieving the objectives slightly

faster. When σ= 20°, motion was affected by deviations

in the path followed by the end-effector. The extreme case

with σ= 40° caused oscillations and erroneous approximation

trajectories that produced signiﬁcant delays in the reaching of

the target locations. These results show the importance of a

reliable visual perception when discrepancies in the model or

unreliable joint angle measurements are present.

C. Right arm reaching of moving object with active head

We evaluated the algorithm for the right arm model and

active head for a moving object (manually operated by a

human, Fig. 4i). Variable dynamics are shown in Fig. 4. The

initial position of the right arm lies outside of the visual

plane (Fig. 4e missing v2). Hence, the free-energy optimization

algorithm only relies on joint measurements to produce the

reaching motion until the right hand appears in the visual plane

(v1enters Fig. 4d from the top). Fig. 4a and 4b show both

the encoders measurements qand the estimated joint angle µ

of the arm and head. Fig. 4d and 4e show how calculated g

and real vvisual positions of the right arm end-effector follow

the perceptual attractor ρ, while the head tries to maintain the

object at the center of the visual ﬁeld c. Right arm actions

are depicted in 4g, and stop action is produced by the sense

of touch. Contact in any of the pressure sensors triggers the

grasping motion of the hand. Finally, Fig. 4h shows that the

algorithm optimizes (maximizes) the value of negative free-

energy for both arm and head.

D. Generalization: Dual arm reaching and active head

We generalized the algorithm for dual arm reaching. Free-

energy optimization reaching task was replicated for the left

arm, obtaining a reaching motion for both arms with a tracking

motion performed by the head. The result of this experiment,

along with other runs of the previous experiments can and be

found in the supplementary video tobereleased.

VII. CONCLUSIONS

This work presents the ﬁrst active inference model working

on a real humanoid robot for dual arm reaching and active

head object tracking. The robot, evaluated with different level

of sensor noise (up to 40 degrees joint angles deviation),

was able to reach the visual goal compensating the errors

through free-energy optimization. The body conﬁguration was

treated as an unobserved variable and the forward model as

an approximation of the real end-effector location corrected

online with visual input and thus tackling model errors. The

proposed approach can be generalized to whole body reaching

and incorporate forward model learning as shown in [9].

REFERENCES

[1] H. v. Helmholtz, Handbuch der physiologischen Optik. L. Voss, 1867.

[2] D. C. Knill and A. Pouget, “The bayesian brain: the role of uncertainty

in neural coding and computation,” TRENDS in Neurosciences, vol. 27,

no. 12, pp. 712–719, 2004.

[3] K. Friston, “A theory of cortical responses,” Philos Trans R Soc Lond

B: Biological Sciences, vol. 360, no. 1456, pp. 815–836, 2005.

[4] E. Todorov and M. I. Jordan, “Optimal feedback control as a theory of

motor coordination,” Nature neuroscience, vol. 5, no. 11, p. 1226, 2002.

[5] K. J. Friston, “The free-energy principle: a uniﬁed brain theory?” Nature

Reviews. Neuroscience, vol. 11, pp. 127–138, 02 2010.

[6] P. Lanillos, E. Dean-Leon, and G. Cheng, “Yielding self-perception in

robots through sensorimotor contingencies,” IEEE Trans. on Cognitive

and Developmental Systems, no. 99, pp. 1–1, 2016.

[7] Y. Kuniyoshi and S. Sangawa, “Early motor development from par-

tially ordered neural-body dynamics: experiments with a cortico-spinal-

musculo-skeletal model,” Biological cybernetics, vol. 95, p. 589, 2006.

[8] K. J. Friston, J. Daunizeau, J. Kilner, and S. J. Kiebel, “Action and

behavior: a free-energy formulation,” Biological cybernetics, vol. 102,

no. 3, pp. 227–260, 2010.

[9] P. Lanillos and G. Cheng, “Adaptive robot body learning and estimation

through predictive coding,” Intelligent Robots and Systems (IROS), 2018

IEEE/RSJ Int. Conf. on, 2018.

[10] C. Fantacci, U. Pattacini, V. Tikhanoff, and L. Natale, “Visual end-

effector tracking using a 3d model-aided particle ﬁlter for humanoid

robot platforms,” in 2017 IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS). IEEE, 2017, pp. 1411–1418.

[11] C. Garcia Cifuentes, J. Issac, M. Wthrich, S. Schaal, and J. Bohg,

“Probabilistic articulated real-time tracking for robot manipulation,”

IEEE Robotics and Automation Letters, vol. PP, 10 2016.

[12] A. Ude, D. Omrˇ

cen, and G. Cheng, “Making object learning and recog-

nition an active process,” International Journal of Humanoid Robotics,

vol. 5, no. 02, pp. 267–286, 2008.

[13] C. Gaskett and G. Cheng, “Online learning of a motor map for humanoid

robot reaching,” in 2nd Int. Conf. on computational inte., robotics and

autonomous systems, Singapore, 2003.

[14] L. Jamone, M. Brandao, L. Natale, K. Hashimoto, G. Sandini, and

A. Takanishi, “Autonomous online generation of a motor representation

of the workspace for intelligent whole-body reaching,” Robotics and

Autonomous Systems, vol. 62, no. 4, pp. 556–567, 2014.

[15] A. Roncone, M. Hoffmann, U. Pattacini, L. Fadiga, and G. Metta,

“Peripersonal space and margin of safety around the body: learning

visuo-tactile associations in a humanoid robot with artiﬁcial skin,” PloS

one, vol. 11, no. 10, p. e0163713, 2016.

[16] L. Pio-Lopez, A. Nizard, K. Friston, and G. Pezzulo, “Active inference

and robot control: a case study,” J R Soc Interface, vol. 13, 2016.

[17] M. Baltieri and C. L. Buckley, “An active inference implementation of

phototaxis,” 2018 Conference on Artiﬁcial Life, no. 29, pp. 36–43, 2017.

[18] P. Lanillos and G. Cheng, “Active inference with function learning for

robot body perception,” International Workshop on Continual Unsuper-

vised Sensorimotor Learning, ICDL-Epirob, 2018.

[19] S. Kullback and R. A. Leibler, “On information and sufﬁciency,” The

Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 03 1951.

[20] R. Bogacz, “A tutorial on the free-energy framework for modelling

perception and learning,” Journal of Mathematical Psychology, vol. 76,

no. B, pp. 198–211, 2015.

[21] K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny,

“Variational free energy and the laplace approximation,” Neuroimage,

vol. 34, no. 1, pp. 220–234, 2007.

[22] C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth, “The free energy

principle for action and perception: A mathematical review,” Journal of

Mathematical Psychology, 2017.

[23] G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori, “The icub

humanoid robot: An open platform for research in embodied cognition,”

Performance Metrics for Intelligent Systems Workshop, 01 2008.

[24] G. Metta, P. Fitzpatrick, and L. Natale, “Yarp: Yet another robot

platform,” International Journal of Advanced Robotic Systems, vol. 3(1),

pp. 43–48, 2006.