Content uploaded by Pablo Lanillos

Author content

All content in this area was uploaded by Pablo Lanillos on Jan 12, 2021

Content may be subject to copyright.

End-to-End Pixel-Based Deep Active Inference for

Body Perception and Action

Cansu Sancaktar

Technical University of Munich

Germany

cansu.sancaktar@tum.de

Marcel A. J. van Gerven

Donders Institute for

Brain, Cognition and Behaviour

Radboud University

Netherlands

m.vangerven@donders.ru.nl

Pablo Lanillos

Donders Institute for

Brain, Cognition and Behaviour

Radboud University

Netherlands

p.lanillos@donders.ru.nl

Abstract—We present a pixel-based deep active inference

algorithm (PixelAI) inspired by human body perception and

action. Our algorithm combines the free energy principle from

neuroscience, rooted in variational inference, with deep convolu-

tional decoders to scale the algorithm to directly deal with raw

visual input and provide online adaptive inference. Our approach

is validated by studying body perception and action in a simulated

and a real Nao robot. Results show that our approach allows the

robot to perform 1) dynamical body estimation of its arm using

only monocular camera images and 2) autonomous reaching to

“imagined" arm poses in visual space. This suggests that robot

and human body perception and action can be efﬁciently solved

by viewing both as an active inference problem guided by ongoing

sensory input.

Index Terms—Active inference, Deep learning, Free energy op-

timization, Bio-inspired perception, Predictive coding, Robotics.

I. INTRODUCTION

Learning and adaptation are two core characteristics that

allow humans to perform ﬂexible whole-body dynamic esti-

mation and robust actions in the presence of uncertainty [1].

We hypothesize that the human brain acquires a representation

(model) of the body, already starting at the earliest stages of

life, by learning a mapping between tactile, proprioceptive

and visual cues [2]. This cross-modal sensorimotor mapping is

encoded [3], [4] in some sort of internal model that allows us

to predict the sensory effect of our body in the space and to

deal with unexpected perturbations through reactive actions [5].

This mapping is ﬂexible, as well as the perception of our body

in the space [6], [7] and depends on the interplay between the

top-down expectations and the current sensory inputs. Hence,

unsupervised learning mechanisms are enhanced by online

supervised adaptation [8].

On the contrary, robots usually use a ﬁxed-rigid body model

where the arm end-effector is deﬁned as a pose, i.e., a 3D

point in space and orientation. Hence, any error in the model or

change in the conditions will result in failure. Several solutions

have been proposed to overcome this problem, usually separated

in perception and control approaches. For instance, by working

This work has been supported by SELFCEPTION project EU Horizon

2020 Programme, grant nr. 741941. PixelAI code: https://github.com/cansu97/

PixelAI.

Predicted

Sensation

VisualPredictionError

Free-Energy

Optimization BackwardPass

Internal

Belief

Observed

Visual

Sensation

Bottom

Camera

Action

ForwardPass

Fig. 1. Pixel-based deep active inference (PixelAI). The robot infers its body

(e.g., joint angles) by minimizing the visual prediction error, i.e. discrepancy

between the camera sensor value

xv

and the expected sensation

g(µ)

computed

using a convolutional decoder. The error signal is used to update the internal

belief to match the observed sensation (perceptual inference) and to generate

an action

a

to reduce the discrepancy between the observed and predicted

sensations. Both are computed by minimizing the variational free-energy

bound.

in visual space (e.g. visual servoing [9]) we can exploit a set of

invariant visual keypoints to provide control that incorporates

real-world errors. Bayesian sensory fusion in combination with

model-based ﬁtting allows adaptation to sensory noise and

model errors [10] and model-based active inference provides

online adaptation in both action and perception [5]. Finally,

learning approaches have shown that the difference between

the model and reality can be overcome by optimizing the

body parameters or by explicitly learning the policy for a task,

e.g., through imitation learning or reinforcement learning (RL).

Recently, model-free approaches, particularly deep RL, have

demonstrated the potential for directly using raw images as an

input for learning visual control policies [11]. Alternatively,

Deep Active Inference has appeared as a competitive alternative

approach at least in toy problems [12], [13].

In this work, we introduce PixelAI, a novel pixel-based deep

active inference algorithm, depicted in Fig. 1, which directly

scales to high-dimensional inputs (e.g., images), provides adap-

tation, model-free learning and, moreover, uniﬁes perception

and action into a single variational inference formulation.

We combine the free energy principle (i.e. active infer-

ence) [14], which updates an internal model of the body through

Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.

perception and action, with deep convolutional decoders, that

map internal beliefs to expected sensations. More concretely,

the agent learns a generative model of the body to aid in

the construction of expected sensations from incoming partial

multimodal information. This information, instead of being

directly encoded, is processed by means of the error between

the predicted outcome and the current input. In this sense,

we formalize body perception and action as a consequence of

surprise minimization via prediction errors [14], [15].

Following this approach, the robot should learn a latent

representation (“state”) of the body and the relation between

its state and the expected sensations (“predictions”). These

predictions will be compared with the observed sensory data,

generating an error signal that can be propagated to reﬁne the

belief that the robot has about its body state. Compensatory

actions would follow a similar principle and will be exerted to

better correspond to the prediction made by the models learnt,

giving the robot the capacity to actively adjust to online changes

in the body and the environment (see Fig. 1). This also offers

a natural way to realize a plan by setting an imagined goal in

sensory space [16]. For example, when working in visual space

we can provide an image as a goal. By means of optimizing the

free-energy bound the agent then executes those actions that

minimize the discrepancy between expected sensations (our

goal) and observed sensations. Our approach was validated by

studying body perception and action in a simulated and a real

Nao robot. Results show that our approach allows the robot to

perform 1) dynamical body estimation of its arm using only

raw monocular camera images and 2) autonomous reaching to

arm poses provided by a goal image.

II. BACKGROU ND

A. The free energy principle

We model body perception as inferring the body state

z

based

on the available sensory data

x

. Given a sensation

x

, the goal is

to ﬁnd

z

such that the posterior

p(z|x) = p(x|z)p(z)/p(x)

is

maximized. However, computing the marginal likelihood

p(x)

requires an integration over all possible body states. That is,

p(x) = Rzp(x|z)p(z)dz

, which becomes intractable for large

state spaces. The free-energy [17], largely exploited in machine

learning [18] and neuroscience [19], circumvents this problem

by introducing a reference distribution (also called recognition

density)

q(z)

with known tractable form. The goal of the

minimization problem hence becomes ﬁnding the reference

distribution q(z)that best approximates the posterior p(z|x).

For tractability purposes, this approximation is calculated by

optimizing the negative variational free energy

F

, also referred

to as the evidence lower bound (ELBO).

F

can be deﬁned

as the Kullback-Leibler divergence

DKL

plus the negative log-

evidence or sensory surprise −ln p(x):

F=DKL(q(z)kp(z|x)) −ln p(x)

=Zz

q(z) ln q(z)

p(z|x)dz−ln p(x),(1)

which, due to the non-negativity properties of

DKL

, is an

upper bound on surprise. Alternatively, we can use the identity

ln p(x) = Rzq(z) ln p(x)dz

to include the second term into

the integral and write Equation (1) as

F=Zz

q(z) ln q(z)

p(x,z)dz(2)

=−Zz

q(z) ln p(x,z)dz+Zz

q(z) ln q(z)dz.(3)

According to the free energy principle [14] both perception

and action optimize the free energy and hence minimize

surprise:

1)

Perceptual inference: The agent updates its internal belief

by approximating the conditional density (inference),

maximizing the likelihood of the observed sensation:

z= arg min

z

F(z,x).(4)

2)

Active inference: The agent generates an action

a

that

results in a new sensory state

x(a)

that is consistent with

the current internal representation:

a= arg min

a

F(z,x(a)) .(5)

Under the Laplace approximation, the variational density can

take the form of a Gaussian

q(z) = N(µ,Σ)

, where

µ

is the

conditional mode and

Σ

is the covariance of the parameters.

By incorporating this reference distribution in Equation (3),

the free-energy can be approximated as – See [19] for full

derivation.

F=−ln p(x,µ)−1

2(ln |Σ|+nln 2π),(6)

where the ﬁrst term is the joint density of the observed and

the latent variables with µan n-dimensional state vector.

III. PIX EL -BAS ED DE EP ACTIVE IN FE RE NC E

Our proposed PixelAI approach combines free energy

optimization with deep learning to directly work with images

as visual input. The optimization provides adaptation and the

neural network incorporates learning of high-dimensional input.

We frame and experimentally validate the proposed algorithm

in body perception and action in robots. Figure 1visually

describes PixelAI. The agent ﬁrst learns the approximate

generative forward models of the body, implemented here

as convolutional decoders. While interacting, the expected

sensation (predicted by the decoder) is compared with the real

visual input and the prediction error is used to 1) update the

belief and 2) generate actions. This is performed by means of

optimizing the variational free-energy bound.

A. Active inference model

We formalize body perception as inferring the unobserved

body state

µ

, e.g., the estimation of the robot joint angles like

shoulder pitch and roll. We deﬁne the robot internal belief as

an

n

-dimensional vector:

µ[d]∈Rn

for each temporal order

d

.

For instance, for ﬁrst-order (velocity) generalized coordinates

the belief is:

µ={µ[0],µ[1] }

. The observed variables

x

are

the visual sensory input

xv

and the external causal variables

ρ

:

x={xv,ρ}

For instance, the robot has access to visual

Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.

information

xv

(an image of size

w×h

) and proprioception

information (the joint encoder values

q

). The causal variables

ρ

are independent variables that produce effects in the world.

Finally, let us deﬁne two generative models that describe the

system. The sensory forward model g, which is the predictor

that computes the sensory outcome

xv

given the internal state

µ

and the body internal state dynamics

f

. Both functions can

be considered as the approximations that the agent has about

the reality:

xv=g(µ) + wv(7)

µ[1] =f(µ,ρ) + wµ(8)

where

wv,wµ

are both process noise and are assumed to be

drawn from a multivariate normal distribution with zero mean

and covariance Σvand Σµrespectively.

In order to compute the variational free-energy under the

Laplace approximation from Equation (6) we need the joint

density. Assuming independence of the observed variables:

ln p(x,µ) = ln p(xv,ρ,µ)

= ln p(xv|µ) + ln p(µ[1]|µ[0],ρ)(9)

where

p(xv|µ)

is the likelihood of having a visual sensation

given the internal state and

p(µ[1]|µ[0] ,ρ)

is the transition

dynamics of the latent variables (body state).

Body perception is then instantiated as computing the body

state that minimizes the variational free-energy. This can be

performed through gradient optimization

∂F /∂µ

. Since the

temporal difference

µ[0]

t+1 −µ[0]

t

is equal to the ﬁrst-order

dynamics

µ[1]

at equilibrium, this term has to be included in

the computation of

˙µ[0]

to ﬁnd a stationary solution during the

gradient descent procedure where the gradient

∂F /∂µ

vanishes

at optimum [20]. Hence,

˙µ[0] −µ[1] =−∂F

∂µ=∂ln p(xv,ρ,µ)

∂µ

=∂ln p(xv|µ)

∂µ+∂ln p(µ[1]|µ[0] ,ρ)

∂µ(10)

In order to compute the likelihoods, we assume that the

observed image

xv

is noisy and follows a normal distribution

with a mean at the value of

g(µ)

and with variance

Σv

.

Considering that every pixel contribution is independent, such

as

Σv= diag(Σv1,...,Σvh·w)

, the likelihood

p(xv|µ)

is

obtained as the collection of independent Gaussians:

p(xv|µ) =

h·w

Y

k=1

1

p2πΣvk

exp −1

2Σvk

(xvk−gk(µ))2

(11)

Analogously, the density that deﬁnes the latent variable

dynamics is also assumed to be noisy and follows a normal

distribution with mean at the value of the function

f(µ, ρ)

and

with variance Σµ:

p(µ[1]|µ[0] , ρ) =

n

Y

i=1

1

p2πΣµi

exp −1

2Σµi

(µ[1]

i−fi(µ,ρ))2

(12)

Substituting the likelihoods and computing the partial

derivatives of Equation 10, the body state is then given by the

following differential equation:

˙µ[0] =µ[1] +

mapping

z}| {

∂g(µ)

∂µ

T

precision

z}|{

Σ−1

v

prediction error

z }| {

(xv−g(µ))

+∂f(µ,ρ)T

∂µΣ−1

µ(µ[1] −f(µ,ρ)) (13)

For notation simplicity, hereinafter we name the ﬁrst and

the second summation terms of Eq. (9) as

−Fg

and

−Ff

respectively and

−∂µFg

and

−∂µFf

for the second and third

summation terms of Eq. (13).

The action is analogously computed. However, only the

sensory information is a function of the action

x(a)

and

therefore it only depends on the free-energy terms with sensory

information Fg:

˙a =−∂Fg

∂a=−∂Fg

∂x

∂x

∂a=−∂xT

v

∂aΣ−1

v(xv−g(µ))

=−∂g(µ)T

∂µ∆tΣ−1

v(xv−g(µ)) (14)

To derive the last equality, we have employed the same

approximation as in [5] assuming that the actions are joint

velocities. In a velocity controller scheme we can approximate

the angle change between two time steps of each joint

j

as

∂qj/∂aj= ∆t

, because the target values of the joint encoders

q

are computed as

qt+1 =qt+ ∆tat

and

∆t

is a ﬁxed value

that deﬁnes the duration of each iteration. Then, assuming

convergence for the sensation values at the equilibrium point

with

µ→q

and

g(µ)→xv

, the term

∂xv

∂a

can be computed

using the following equation:

∂xv

∂aj

=∂xv

∂qj

∂qj

∂aj

=∂g(µ)

∂µj

∂µj

∂aj

=∂g(µ)

∂µj

∆t(15)

The update rule for both

µ

and

a

is ﬁnally calculated with

the ﬁrst-order Euler integration:

µt+1 =µt+ ∆t˙

µat+1 =at+ ∆t˙

a(16)

B. Scaling up with deep learning

In order to perform pixel-based free-energy optimization

we compute Equations (13) and (14) exploiting the forward

and backward pass properties of the deep neural network. We

approximate the visual generative model

g(µ)

and its partial

derivative

∂µg(µ)

with respect to the internal state by means

of a convolutional decoder.

1) Prediction of the expected sensation: We approximate

the forward model

g(µ)

by means of a generative network,

based on the architecture proposed in [21], described in Fig. 2.

It outputs the predicted image given the robot’s n-dimensional

internal belief of the body state

µ

, e.g., the joint angles of the

robot.

The input goes through 2 fully-connected layers (FC1 and

FC2). Afterwards, the transposed convolution (UpConv) is

performed to upsample the image. This deconvolution uses

Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.

7 x 10 x 16 14 x 20 x 64 28 x 40 x 16 28 x 40 x 16 28 x 40 x 16

512 Neurons

1120 Neurons

14 x 20 x 64 56 x 80

FC1 FC2 R UpConv1 Conv1 UpConv2 Conv2 Dropout UpConv3

Fig. 2. Network Architecture of the Convolutional Decoder (FC: Fully-

connected layer, R: Reshape operator, UpConv: Transposed Convolution, Conv:

Convolution)

the input as the weights for the ﬁlters and can be regarded

as a backward pass of the standard convolution operator [22].

Following [21], each transposed convolution layer was followed

by a standard convolutional layer, which helps to smooth

out the potential artifacts from the transposed convolution

step. There is an additional 1D-Dropout layer before the last

transposed convolution layer to avoid overﬁtting and achieve

better generalization performance. All layers use the rectiﬁed

linear unit (ReLU) as the activation function, except for the last

layer, where a sigmoid function was used to get pixel intensity

values in the range

[0,1]

. Throughout the consecutive UpConv-

Conv operations in the network, the number of channels is

increased and decreased again to get the required output image

size.

2)

Backward pass and mapping to the latent variable

:

An essential term, for computing both perception and action,

is the mapping between the error in the sensory space and the

inferred variables:

∂g(µ)

∂µ

. This term is calculated by performing

a backward pass over the convolutional decoder. In fact, we can

compute the whole partial derivative of the visual input term

∂Fg/∂µ

just in one forward and backward pass. The reason

is that the prediction error between the expected and observed

sensation

(xv−g(µ))

multiplied by the inverse variance and the

partial derivative is equivalent to applying the backpropagation

algorithm. It is important to note that when the function

g(µ)

outputs images of size

w×h

,

∂g(µ)

∂µ

is a three-dimensional

tensor. We stack the output into a vector

∈Rw·h

(row major).

The following equation is obtained:

∂Fg

∂µ=

∂g1,1

∂µ1

∂g1,2

∂µ1. . . ∂gw,h

∂µ1

∂g1,1

∂µ2

∂g1,2

∂µ2. . . ∂gw,h

∂µ2

.

.

..

.

..

.

..

.

.

∂g1,1

∂µ4

∂g1,2

∂µ4. . . ∂gw,h

∂µ4

|{z }

(∂g

∂µ)T

∂Fg

∂g1,1

∂Fg

∂g1,2

.

.

.

∂Fg

∂gw,h

|{z }

∂Fg

∂g

(17)

where

−∂Fg

∂gi,l

is given by

1

Σvi,l

(xvi,l −gi,l(µ))

. The action is

computed by reusing this term and multiplying it by ∆t.

C. Formalizing the task with the brain variable dynamics

In Active Inference we include the goal as prior in the body

state dynamics function

f(µ,ρ)

. For example, to perform a

reaching task, we encode the desired goal of the robot as an

instance in the sensory space (image), which acts as an attractor

generated by the causal variable

ρ

. This produces an error in

the inferred state that will promote an action towards the goal.

Note that the error

ρ−g(µ)

is zero when the prediction

matches the desired goal. We deﬁne the body state dynamics

with a causal variable attractor as:

f(µ,ρ) = T(µ)β(ρ−g(µ)) ,(18)

where

β

is a gain parameter that deﬁnes the intensity of the

attractor and

T(µ) = ∂g(µ)T/∂µ

is the mapping from the

sensory space (e.g. pixel-domain) to the internal belief

µ

(e.g. joint space). Note that this term is obtained through the

backward pass of the decoder. Finally, substituting in Eq. (13)

the new dynamics generative model we write the last term

∂Ff/∂µas:

−∂Ff

∂µ=∂f(µ,ρ)T

∂µΣ−1

µµ[1] −∂g(µ)T

∂µβ(ρ−g(µ))

(19)

In the ﬁnal model used in the experiments, we have further

simpliﬁed this equation by not including the ﬁrst-order internal

dynamics into the optimization process

µ[1] =0

and noting

that the correct mapping and direction from the sensory space to

the latent variable is already provided by

∂g(µ)T/∂µ

. Thus,

we greedily approximate

∂f(µ,ρ)/∂µ

to

−1

, avoiding the

Hessian computation of

T

but introducing an optimization

detriment. With these assumptions, the partial derivative of the

dynamics term becomes:

−∂Ff

∂µ=Σ−1

µ∂g(µ)T

∂µβ(ρ−g(µ))=Σ−1

µf(µ,ρ)

(20)

D. PixelAI algorithm

Algorithm 1summarizes the proposed method. In the robot

body perception and action application,

xv

is set to the image

provided by the robot monocular camera and the decoder input

becomes the internal belief (e.g., the estimated joint angles).

The convolutional decoder is trained using the proprioceptive

information (joint angles encoders) obtaining a predictor of the

visual forward model. The prediction error

ev

is the difference

between the expected visual sensation and the observation (line

6). The variational free-energy optimization, for perception

(line 7) and action (line 8), updates the differential equations

that drive the state estimation and control. Finally, with

the dynamics term we added the possibility of inputting

desired goals in the visual space (line 11). Although this

implementation assumes that

µ[1] =0

, it is straight forward

to add the 1st order dynamics when there is velocity image

information or joint encoders [5].

1

The gain parameter

KΣv

is added to allow the model to generate large

action values without increasing the internal belief increments.

Algorithm 1 PixelAI: Deep Active Inference Algorithm

Require: Σv,Σµ, β, ∆t

1: µ←Initial joints angle estimation

2: while (true) do

3: xv←Resize(camera image)Visual Sensation

4: g(µ)←ConvDecoder.forward(µ)

5: ∂g←ConvDecoder.backward(µ)

6: ev= (xv−g(µ)) Prediction error

7: ˙

µ=KΣv∂gTev/Σv1

8: ˙

a=−(∂gTev/Σv)∆t

9: if ∃ρthen Desired goal ρdynamics

10: ef=β(ρ−g(µ))

11: ˙

µ=˙

µ+∂gTef/Σµ

12: µ=µ+ ∆t˙

µ;a=a+ ∆t˙

a1st order Euler

integration

13: SetVelocityController(a)

IV. EXP ER IM EN TS

We tested the PixelAI in both simulated and real Aldebaran

NAO humanoid robot (Fig. 3). We used the left arm to test

both perception and action schemes. The dataset and the code

to replicate the experiments can be found in https://github.com/

cansu97/PixelAI.

Fig. 3. Experimental setup in simulation (Gazebo) and with the real robot,

and a subset of the goal images used for the benchmark.

A. Visual forward model

a)

Data acquisition & preprocessing

.: The dataset used

to train the model consisted of 3200 data samples of the left

arm elbow and shoulder joint readings (

q= [q1, q2, q3, q4]T

)

and the observed images

xv

obtained through NAO’s bottom

camera

2

. Data samples were generated using three different

methods and later concatenated (

∼25%

,

∼20%

and

∼55%

for each method).

In the ﬁrst method, the joint angles were randomly drawn

from a uniform distribution in the range of the joint limits.

Afterwards, the samples where the robot’s arm was out of

the camera frame were eliminated. The ratio of the acquired

images with the robot hand centered in the camera image

was signiﬁcantly lower than the images with the hand located

at the corners of the frame. To reduce this drawback, in the

second method, the robot’s arm was manually moved by an

operator and the joint angle readings were recorded during

2

In simulation, the color of the right arm was changed to dark grey to

achieve contrast with the grey background in the camera images.

these trajectories. This way, a subset of data was obtained,

where the robot hand was centered in the image. Finally,

in the third method, a multivariate Gaussian was ﬁt to the

second subset using the expectation-maximization algorithm

and random samples were drawn from this Gaussian for the

third and ﬁnal part of the dataset. The goal was to introduce

randomness to the centered-images and not be limited to the

operator’s choice of trajectories.

For the images collected in the Gazebo NAO Simulator,

the only preprocessing step performed was re-sizing the

image of size

640 ×480

to

80 ×56

. For the real NAO, the

images were obtained on a green background (Fig. 3) and

the following preprocessing steps were performed: 1) median

ﬁltering with kernel size 11 on the original image, 2) masking

the monochrome background,e.g. green, in the HSV color space

and replacing it with dark gray to ensure contrast 3) converting

the image to grayscale, and 4) resizing image to dimensions

80 ×56.

b)

Training

.: The convolutional decoder was trained

using the ADAM optimizer using a mini-batch of size 200

samples and an initial learning rate of

α= 10−4

with

exponential decay of

0.95

every 5000 steps. The training

was stopped after ca. 7000 iterations for the simulated NAO

dataset and 12000 iterations for the real NAO dataset to avoid

overﬁtting, as the test set error started to increase for the

corresponding model. The output of the second fully connected

layer (FC2) was an 1120-dimensional vector that was reshaped

into a

7×10 ×16

tensor. The UpConv layers all use stride

equal to 2 and a padding of 1. Moreover, a kernel with a size

of

4×4

was chosen to avoid checkerboard artifacts due to

uneven overlap [23]. Convolutional layers used a kernel size 3,

stride 1 and padding 1. The dropout probability was set to 0.15.

The ﬁnal layer outputs a

1×56 ×80

image corresponding to

a grayscale image.

A benchmark with three levels of difﬁculty was created to

evaluate the performance of PixelAI on randomized samples

for both perceptual and active inference. A set of 50 different

cores (i.e. images of the arm) were generated by sampling

the multivariate Gaussian distribution (see method 2 in section

IV-A

0a). A subset of the generated cores is shown in Fig. 3. For

each of the cores, 10 different random tests were performed. In

total, there were 2500 trials composed of 5 runs of 500 testing

image arm poses per benchmark level

3

. The test samples

for each core were generated differently depending on the

benchmark level:

•

Level 1 (close similar poses): One of the 4 joints was

chosen randomly and a random perturbation

±[5◦,10◦]

sampled from a uniform distribution was added to the

joint angle value to generate the new test sample.

•

Level 2 (far similar poses): For all of the 4 joints, a random

perturbation

±[5◦,10◦]

was sampled from a uniform

distribution and added to the core joint angles.

3

For the real robot benchmark tests (perceptual inference) only a subset of

20 cores were used.

•

Level 3 (random): For each core, 10 different cores were

chosen randomly and used as the test samples.

c)

Perceptual Inference

.: In order to evaluate the body

perception performance, the robot has to infer its real arm pose

just using visual information. The robot’s arm was initialized to

each core pose and then 10 separate test runs were performed,

where the internal belief was set to a perturbed value of the

corresponding pose. These tests are static in nature, i.e. the

change solely takes place in the internal predictions of the

robot. The goal is that the robot internal belief

µ

converges to

the true arm position, which is equal to the joint angles of the

chosen core.

d)

Active Inference

.: In order to evaluate the perception

and action performance, the core poses were treated as the

desired goal (image) encoded as an attractor in the model.

Again for each core, 10 separate test runs were performed.

In this case, the robot arm was initialized to a core pose

and also the initial internal belief was set to the current joint

measurements:

µ=q

. In each test, the goal was that the

robot reached the desired imagined arm pose. The update of

the internal belief should generate an action to compensate

for the mismatch between the current and the predicted visual

sensations. In a successful test run, the robot arm should move

to the imagined arm position and the internal belief should also

converge to the imagined joint angles, so that:

xv=g(µ) = ρ

.

e) Algorithm Parameters.: The parameters of the Pixel

AI algorithm are given by

Σv, β, KΣv, γΣv,∆t

, and were

determined empirically

4

– See table I. The intuition behind the

variance terms is as follows: the prediction errors get multiplied

by the inverse of the variances so these actually weigh the

relevance of the corresponding sensory information error [20].

The

β

term, that is part of the attractor dynamics (see Eq.

18), essentially has the same effect and it controls how much

we want to push the internal belief in the direction of the

attractor. The gain parameter

KΣv

is added to allow the model

to generate large action values without increasing the internal

belief increments. (See line 7 of Algorithm 1). For level 3 in

perceptual inference, we used a smaller

Σv

until the visual

prediction error was below a certain threshold (

0.01

). This

introduces a new model parameter

γΣv

which is used to scale

Σv

, once the error threshold is reached. This heuristic method

of adaptation helped speed up the convergence for the more

complex level 3 trajectories. The parameter

∆t

was set to

0.1 for all the perceptual inference tests and to 0.065 for the

active inference tests. For active inference, the value of

∆t

was determined based on the internal time of the robot loop

execution. Finally, the generated actions (velocity values) were

clipped so that each joint could not move more than

[−2◦; 2◦]

each time step.

V. RE SU LTS

A. Statistical analysis of perception and action in simulation

First, perceptual inference tests were run for 5000-time steps

for all 3 levels. An example of the perceptual inference for

4βand Σµare combined into a single parameter β.

TABLE I

PIX ELA I PARA MET ER S USE D FO R THE P ER CEP TUA L AND A CTI VE

INFERENCE BENCHMARKS IN SIMULATION.

Σvβ KΣvγΣv

Active Inference

Level 1 6×1022×10−510−31

Level 2 2×1025×10−510−31

Level 3 20 5 ×10−410−31

Perceptual Inference

Level 1 2×104- - 1

Level 2 2×104- - 1

Level 3 2×103- - 10

each level is depicted in Fig. 5. For level 1 and 2, the algorithm

converged fast to the ground truth, while inferring the body

location from a totally random initialization (level 3) raised the

complexity considerably. Table II shows the resulting average

of all trials of the mean absolute joint errors (

|qtrue −µ|

). Level

1 and 2 results converged to internal belief values successfully.

Figure 4(b) shows the error during the optimization process

and Figure 4(a) shows the visual prediction error. Shoulder

pitch and shoulder roll angles were estimated with better

accuracy compared to the elbow angles. This is due to the

fact that a small change in the shoulder pitch angle yields

to a greater difference in the visual ﬁeld in comparison with

the same amount of change in the elbow roll angle. Since

PixelAI achieves perception by minimizing the visual prediction

error, the accuracy increases when the pixel-based difference

is stronger. Therefore, the mean error and standard deviation

increase for the elbow joint angle estimations.

The errors in level 3, where the robot had to converge to

random arm locations, were larger compared to levels 1 and 2,

as shown in Fig. 4(b). This is due to two reasons. The ﬁrst one

is the local minima problem inherent of our gradient descent

approach. The second one affects the desired joint position,

where several joint solutions have small visual prediction error

increasing the risk of getting into a local minimum.

TABLE II

PER CEP TUA L INF ER ENC E IN SIMULATION:JOINT ANGLES ABSOLUTE

ER ROR (ME AN ±S TD IN D EG REE S).

Level Shoulder Pitch Shoulder Roll Elbow Yaw Elbow Roll

1 0.26 ±0.34 0.33 ±0.41 0.85 ±0.96 0.94 ±1.06

2 0.39 ±0.71 0.62 ±0.98 1.19 ±1.48 1.57 ±2.18

3 5.32 ±11.06 5.61 ±7.68 17.64 ±26.08 12.25 ±15.41

TABLE III

PER CEP TUA L INF ER ENC E IN RE AL ROB OT:JOINT ANGLES ABSOLUTE

ER ROR (ME AN ±S TD IN D EG REE S).

Level Shoulder Pitch Shoulder Roll Elbow Yaw Elbow Roll

1 1.33 ±0.82 0.68 ±0.77 1.57 ±1.44 1.97 ±2.03

2 1.86 ±1.85 2.22 ±3.04 2.94 ±3.31 3.81 ±3.32

3 9.80 ±12.63 12.77 ±10.91 29.35 ±35.48 21.82 ±17.23

0 200 400 600 800 1000 1200 1400

Timestep (t)

0.00

0.02

0.04

0.06

0.08

MSE(sv−g(µ))

Level 1

Level 2

Level 3

(a) Simulation: Visual Prediction Error

0 200 400 600 800 1000 1200 1400

Timestep (t)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ksp−µk2(radians)

Level 1

Level 2

Level 3

(b) Simulation: L2 Norm of sp−µ.

0 200 400 600 800 1000 1200 1400

Timestep (t)

0.00

0.02

0.04

0.06

0.08

MSE(sv−g(µ))

Level 1

Level 2

Level 3

(c) Real: Visual Prediction Error

0 200 400 600 800 1000 1200 1400

Timestep (t)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ksp−µk2(radians)

Level 1

Level 2

Level 3

(d) Real: L2 Norm of µ−sp

Fig. 4. Perceptual inference results for all levels (1-3) of the benchmark are shown. The visual prediction error (pixel-MSE), as well as the L2-Norm of the

error between internal belief

µ

and

sp

is plotted. The curves shown are the median values of all the test runs in the corresponding benchmark level bounded

by the upper and lower quartiles. (a)-(b): Results for simulated Nao. (c)-(d): Results for real Nao.

Fig. 5. Example of the internal trajectories of the latent space during the

perceptual inference tests for three different levels of difﬁculty for core 3.

Secondly, active inference tests with goal images were

performed using the simulated NAO for 1500 time steps in the

benchmark levels. The results for all three benchmark levels

are shown in Fig. 6. The joint encoder readings followed the

internal belief values through the actions generated by free-

energy optimization. Level 3 performance detriment shows that

interacting is more complex than perceiving as it includes the

body and the world constraints.

0 200 400 600 800 1000 1200 1400

Timestep (t)

0.00

0.02

0.04

0.06

0.08

MSE(ρ−sv)

Level 1

Level 2

Level 3

(a) Visual error

0 200 400 600 800 1000 1200 1400

Timestep (t)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ksp−qattrk2(radians)

Level 1

Level 2

Level 3

(b) L2-Norm of Joint Angle Errors

Fig. 6. Simulated NAO active inference test results for all three levels. The

curves shown are the median values of all the test runs in the corresponding

benchmark level bounded by the upper and lower quartiles. (a) Visual error

between the visual attractor

ρ

and the observed camera image

sv

. (b) L2-Norm

of the error between the joint angles of the attractor position

qattr

and the

proprioceptive sensor readings sp.

B. Active inference in the real robot

We tested the proposed algorithm in the real robot. Con-

versely to simulation, the robot’s movements were imprecise

due to the mechanical backlash in the actuators (

±5◦

) [24]. We

used the same network architecture and training procedure used

in simulation. Low training error was achieved on the training

dataset (MSE in pixel-intensity: ca. 0.0015). The visual forward

model was expected to model the more complex structure of

the real robot hand, that is subject to lighting differences and

has a reﬂective surface. Unlike in the simulator, the same

conditions cannot be restored perfectly in the real world, so

the model training is always subject to additional noise in the

dataset.

The results of the perceptual inference for real NAO on

all 3 benchmark levels are shown in Fig. 4as well as in

Table III. Similar behaviours of perceptual convergence to the

simulation results were found in level 1 and 2, while level 3

had a larger error due to the local minima. Figure 7shows the

PixelAI algorithm running on the robot. While body estimation

converged smoothly, the real movements were unsmooth due to

the deployed velocity controller over the built-in NAO position

control, which introduced delays in the execution of the action

commands. Direct access to the motor driver should solve the

mismatch between the internal error and the actual arm position

that is presented in the plots.

VI. CONCLUSIONS

We have described a Pixel-based Deep Active Inference

algorithm and applied it for robot body perception and action.

We have shown that variational free-energy optimization can

work as a general inner mechanism for both estimation and

control. Our algorithm extends previous active inference works

tackling high-dimensional visual inputs and providing sensory

generative models learning. This prediction error variant of

control as inference [25] exploits the representation learnt to

indirectly generate the actions without a policy. The robot

is producing the actions to reach the desired goal in the

visual space without learning the explicit policy. Our algorithm

enabled body estimation using a monocular camera input and

performed goal-driven behaviour using imaginary goals in

the visual space. Statistical results showed convergence in

both perception and action in different levels of difﬁculty

with a larger error when dealing with totally random arm

poses. This neuroscience-inspired approach is thought to

make deeper interpretations than conventional engineering

solutions [26], giving some grounding for novel machine

learning developments, especially for body perception and

Fig. 7. PixelAI test on the real Nao.

µ

is the inferred state,

q

is the real joint angles readings, goal is the ground truth goal angles and

0.05steps = 1s

.

(bottom row) Arm sequence: goal image and Nao visual input are overimposed.

action. Further work will focus on bringing this approach

closer to biological plausibility. We will explore the integration

of complex proprioceptive information within the optimization

framework [27] by learning multimodal representations and

include hierarchies in the architecture to permit the robot to

perceive its body in an abstracted way and not only in the

pixel-based domain.

REFERENCES

[1]

D. M. Wolpert, J. Diedrichsen, and J. R. Flanagan, “Principles of

sensorimotor learning,” Nature Reviews Neuroscience, vol. 12, no. 12, p.

739, 2011.

[2]

Y. Yamada, H. Kanazawa, S. Iwasaki, Y. Tsukahara, O. Iwata, S. Yamada,

and Y. Kuniyoshi, “An embodied brain model of the human foetus,”

Scientiﬁc Reports, vol. 6, 2016.

[3]

P. Lanillos, E. Dean-Leon, and G. Cheng, “Yielding self-perception in

robots through sensorimotor contingencies,” IEEE Trans. on Cognitive

and Developmental Systems, no. 99, pp. 1–1, 2016.

[4]

G. Diez-Valencia, T. Ohashi, P. Lanillos, and G. Cheng, “Sensorimotor

learning for artiﬁcial body perception,” arXiv preprint arXiv:1901.09792,

2019.

[5]

G. Oliver, P. Lanillos, and G. Cheng, “Active inference body perception

and action for humanoid robots,” arXiv preprint arXiv:1906.03022, 2019.

[6]

M. Botvinick and J. Cohen, “Rubber hands ‘feel’touch that eyes see,”

Nature, vol. 391, no. 6669, p. 756, 1998.

[7]

N.-A. Hinz, P. Lanillos, H. Mueller, and G. Cheng, “Drifting perceptual

patterns suggest prediction errors fusion rather than hypothesis selection:

replicating the rubber-hand illusion on a robot,” in 2018 Joint IEEE 8th

International Conference on Development and Learning and Epigenetic

Robotics (ICDL-EpiRob). IEEE, 2018, pp. 125–132.

[8]

K. Doya, “What are the computations of the cerebellum, the basal ganglia

and the cerebral cortex?” Neural networks, vol. 12, no. 7-8, pp. 961–974,

1999.

[9]

S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo

control,” IEEE transactions on robotics and automation, vol. 12, no. 5,

pp. 651–670, 1996.

[10]

C. G. Cifuentes, J. Issac, M. Wüthrich, S. Schaal, and J. Bohg,

“Probabilistic articulated real-time tracking for robot manipulation,” IEEE

Robotics and Automation Letters, vol. 2, no. 2, pp. 577–584, 2016.

[11]

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of

deep visuomotor policies,” The Journal of Machine Learning Research,

vol. 17, no. 1, pp. 1334–1373, 2016.

[12]

A. Tschantz, M. Baltieri, A. Seth, C. L. Buckley et al., “Scaling active

inference,” arXiv preprint arXiv:1911.10601, 2019.

[13]

B. Millidge, “Deep active inference as variational policy gradients,”

Journal of Mathematical Psychology, vol. 96, p. 102348, 2020.

[14]

K. J. Friston, “The free-energy principle: a uniﬁed brain theory?” Nature

Reviews. Neuroscience, vol. 11, pp. 127–138, 02 2010.

[15]

R. P. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a

functional interpretation of some extra-classical receptive-ﬁeld effects,”

Nature neuroscience, vol. 2, no. 1, pp. 79–87, 1999.

[16]

A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual

reinforcement learning with imagined goals,” in Advances in Neural

Information Processing Systems, 2018, pp. 9191–9200.

[17]

G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description

length and helmholtz free energy,” in Advances in neural information

processing systems, 1994, pp. 3–10.

[18]

M. J. Wainwright and M. I. Jordan, “Graphical models, exponential

families, and variational inference,” vol. 1, no. 1, pp. 1–305. [Online].

Available: http://www.nowpublishers.com/product.aspx?product=MAL&

doi=2200000001

[19]

K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny,

“Variational free energy and the laplace approximation,” Neuroimage,

vol. 34, no. 1, pp. 220–234, 2007.

[20]

C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth, “The free energy

principle for action and perception: A mathematical review,” Journal of

Mathematical Psychology, 2017.

[21]

A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox,

“Learning to generate chairs, tables and cars with convolutional networks,”

IEEE transactions on pattern analysis and machine intelligence, vol. 39,

no. 4, pp. 692–705, 2016.

[22]

V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep

learning,” arXiv preprint arXiv:1603.07285, 2016.

[23]

A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard

artifacts,” Distill, vol. 1, no. 10, p. e3, 2016.

[24]

D. Gouaillier, C. Collette, and C. Kilner, “Omni-directional closed-loop

walk for nao,” in 2010 10th IEEE-RAS International Conference on

Humanoid Robots. IEEE, 2010, pp. 448–454.

[25]

M. Toussaint, “Robot trajectory optimization using approximate infer-

ence,” in Proceedings of the 26th annual international conference on

machine learning, 2009, pp. 1049–1056.

[26]

D. Hassabis, D. Kumaran, C. Summerﬁeld, and M. Botvinick,

“Neuroscience-inspired artiﬁcial intelligence,” Neuron, vol. 95, no. 2,

pp. 245–258, 2017.

[27]

T. Rood, M. van Gerven, and P. Lanillos, “A deep active inference model

of the rubber-hand illusion,” 2020.