Content uploaded by Pablo Lanillos
Author content
All content in this area was uploaded by Pablo Lanillos on Jan 12, 2021
Content may be subject to copyright.
End-to-End Pixel-Based Deep Active Inference for
Body Perception and Action
Cansu Sancaktar
Technical University of Munich
Germany
cansu.sancaktar@tum.de
Marcel A. J. van Gerven
Donders Institute for
Brain, Cognition and Behaviour
Radboud University
Netherlands
m.vangerven@donders.ru.nl
Pablo Lanillos
Donders Institute for
Brain, Cognition and Behaviour
Radboud University
Netherlands
p.lanillos@donders.ru.nl
Abstract—We present a pixel-based deep active inference
algorithm (PixelAI) inspired by human body perception and
action. Our algorithm combines the free energy principle from
neuroscience, rooted in variational inference, with deep convolu-
tional decoders to scale the algorithm to directly deal with raw
visual input and provide online adaptive inference. Our approach
is validated by studying body perception and action in a simulated
and a real Nao robot. Results show that our approach allows the
robot to perform 1) dynamical body estimation of its arm using
only monocular camera images and 2) autonomous reaching to
“imagined" arm poses in visual space. This suggests that robot
and human body perception and action can be efficiently solved
by viewing both as an active inference problem guided by ongoing
sensory input.
Index Terms—Active inference, Deep learning, Free energy op-
timization, Bio-inspired perception, Predictive coding, Robotics.
I. INTRODUCTION
Learning and adaptation are two core characteristics that
allow humans to perform flexible whole-body dynamic esti-
mation and robust actions in the presence of uncertainty [1].
We hypothesize that the human brain acquires a representation
(model) of the body, already starting at the earliest stages of
life, by learning a mapping between tactile, proprioceptive
and visual cues [2]. This cross-modal sensorimotor mapping is
encoded [3], [4] in some sort of internal model that allows us
to predict the sensory effect of our body in the space and to
deal with unexpected perturbations through reactive actions [5].
This mapping is flexible, as well as the perception of our body
in the space [6], [7] and depends on the interplay between the
top-down expectations and the current sensory inputs. Hence,
unsupervised learning mechanisms are enhanced by online
supervised adaptation [8].
On the contrary, robots usually use a fixed-rigid body model
where the arm end-effector is defined as a pose, i.e., a 3D
point in space and orientation. Hence, any error in the model or
change in the conditions will result in failure. Several solutions
have been proposed to overcome this problem, usually separated
in perception and control approaches. For instance, by working
This work has been supported by SELFCEPTION project EU Horizon
2020 Programme, grant nr. 741941. PixelAI code: https://github.com/cansu97/
PixelAI.
Predicted
Sensation
VisualPredictionError
Free-Energy
Optimization BackwardPass
Internal
Belief
Observed
Visual
Sensation
Bottom
Camera
Action
ForwardPass
Fig. 1. Pixel-based deep active inference (PixelAI). The robot infers its body
(e.g., joint angles) by minimizing the visual prediction error, i.e. discrepancy
between the camera sensor value
xv
and the expected sensation
g(µ)
computed
using a convolutional decoder. The error signal is used to update the internal
belief to match the observed sensation (perceptual inference) and to generate
an action
a
to reduce the discrepancy between the observed and predicted
sensations. Both are computed by minimizing the variational free-energy
bound.
in visual space (e.g. visual servoing [9]) we can exploit a set of
invariant visual keypoints to provide control that incorporates
real-world errors. Bayesian sensory fusion in combination with
model-based fitting allows adaptation to sensory noise and
model errors [10] and model-based active inference provides
online adaptation in both action and perception [5]. Finally,
learning approaches have shown that the difference between
the model and reality can be overcome by optimizing the
body parameters or by explicitly learning the policy for a task,
e.g., through imitation learning or reinforcement learning (RL).
Recently, model-free approaches, particularly deep RL, have
demonstrated the potential for directly using raw images as an
input for learning visual control policies [11]. Alternatively,
Deep Active Inference has appeared as a competitive alternative
approach at least in toy problems [12], [13].
In this work, we introduce PixelAI, a novel pixel-based deep
active inference algorithm, depicted in Fig. 1, which directly
scales to high-dimensional inputs (e.g., images), provides adap-
tation, model-free learning and, moreover, unifies perception
and action into a single variational inference formulation.
We combine the free energy principle (i.e. active infer-
ence) [14], which updates an internal model of the body through
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
perception and action, with deep convolutional decoders, that
map internal beliefs to expected sensations. More concretely,
the agent learns a generative model of the body to aid in
the construction of expected sensations from incoming partial
multimodal information. This information, instead of being
directly encoded, is processed by means of the error between
the predicted outcome and the current input. In this sense,
we formalize body perception and action as a consequence of
surprise minimization via prediction errors [14], [15].
Following this approach, the robot should learn a latent
representation (“state”) of the body and the relation between
its state and the expected sensations (“predictions”). These
predictions will be compared with the observed sensory data,
generating an error signal that can be propagated to refine the
belief that the robot has about its body state. Compensatory
actions would follow a similar principle and will be exerted to
better correspond to the prediction made by the models learnt,
giving the robot the capacity to actively adjust to online changes
in the body and the environment (see Fig. 1). This also offers
a natural way to realize a plan by setting an imagined goal in
sensory space [16]. For example, when working in visual space
we can provide an image as a goal. By means of optimizing the
free-energy bound the agent then executes those actions that
minimize the discrepancy between expected sensations (our
goal) and observed sensations. Our approach was validated by
studying body perception and action in a simulated and a real
Nao robot. Results show that our approach allows the robot to
perform 1) dynamical body estimation of its arm using only
raw monocular camera images and 2) autonomous reaching to
arm poses provided by a goal image.
II. BACKGROU ND
A. The free energy principle
We model body perception as inferring the body state
z
based
on the available sensory data
x
. Given a sensation
x
, the goal is
to find
z
such that the posterior
p(z|x) = p(x|z)p(z)/p(x)
is
maximized. However, computing the marginal likelihood
p(x)
requires an integration over all possible body states. That is,
p(x) = Rzp(x|z)p(z)dz
, which becomes intractable for large
state spaces. The free-energy [17], largely exploited in machine
learning [18] and neuroscience [19], circumvents this problem
by introducing a reference distribution (also called recognition
density)
q(z)
with known tractable form. The goal of the
minimization problem hence becomes finding the reference
distribution q(z)that best approximates the posterior p(z|x).
For tractability purposes, this approximation is calculated by
optimizing the negative variational free energy
F
, also referred
to as the evidence lower bound (ELBO).
F
can be defined
as the Kullback-Leibler divergence
DKL
plus the negative log-
evidence or sensory surprise −ln p(x):
F=DKL(q(z)kp(z|x)) −ln p(x)
=Zz
q(z) ln q(z)
p(z|x)dz−ln p(x),(1)
which, due to the non-negativity properties of
DKL
, is an
upper bound on surprise. Alternatively, we can use the identity
ln p(x) = Rzq(z) ln p(x)dz
to include the second term into
the integral and write Equation (1) as
F=Zz
q(z) ln q(z)
p(x,z)dz(2)
=−Zz
q(z) ln p(x,z)dz+Zz
q(z) ln q(z)dz.(3)
According to the free energy principle [14] both perception
and action optimize the free energy and hence minimize
surprise:
1)
Perceptual inference: The agent updates its internal belief
by approximating the conditional density (inference),
maximizing the likelihood of the observed sensation:
z= arg min
z
F(z,x).(4)
2)
Active inference: The agent generates an action
a
that
results in a new sensory state
x(a)
that is consistent with
the current internal representation:
a= arg min
a
F(z,x(a)) .(5)
Under the Laplace approximation, the variational density can
take the form of a Gaussian
q(z) = N(µ,Σ)
, where
µ
is the
conditional mode and
Σ
is the covariance of the parameters.
By incorporating this reference distribution in Equation (3),
the free-energy can be approximated as – See [19] for full
derivation.
F=−ln p(x,µ)−1
2(ln |Σ|+nln 2π),(6)
where the first term is the joint density of the observed and
the latent variables with µan n-dimensional state vector.
III. PIX EL -BAS ED DE EP ACTIVE IN FE RE NC E
Our proposed PixelAI approach combines free energy
optimization with deep learning to directly work with images
as visual input. The optimization provides adaptation and the
neural network incorporates learning of high-dimensional input.
We frame and experimentally validate the proposed algorithm
in body perception and action in robots. Figure 1visually
describes PixelAI. The agent first learns the approximate
generative forward models of the body, implemented here
as convolutional decoders. While interacting, the expected
sensation (predicted by the decoder) is compared with the real
visual input and the prediction error is used to 1) update the
belief and 2) generate actions. This is performed by means of
optimizing the variational free-energy bound.
A. Active inference model
We formalize body perception as inferring the unobserved
body state
µ
, e.g., the estimation of the robot joint angles like
shoulder pitch and roll. We define the robot internal belief as
an
n
-dimensional vector:
µ[d]∈Rn
for each temporal order
d
.
For instance, for first-order (velocity) generalized coordinates
the belief is:
µ={µ[0],µ[1] }
. The observed variables
x
are
the visual sensory input
xv
and the external causal variables
ρ
:
x={xv,ρ}
For instance, the robot has access to visual
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
information
xv
(an image of size
w×h
) and proprioception
information (the joint encoder values
q
). The causal variables
ρ
are independent variables that produce effects in the world.
Finally, let us define two generative models that describe the
system. The sensory forward model g, which is the predictor
that computes the sensory outcome
xv
given the internal state
µ
and the body internal state dynamics
f
. Both functions can
be considered as the approximations that the agent has about
the reality:
xv=g(µ) + wv(7)
µ[1] =f(µ,ρ) + wµ(8)
where
wv,wµ
are both process noise and are assumed to be
drawn from a multivariate normal distribution with zero mean
and covariance Σvand Σµrespectively.
In order to compute the variational free-energy under the
Laplace approximation from Equation (6) we need the joint
density. Assuming independence of the observed variables:
ln p(x,µ) = ln p(xv,ρ,µ)
= ln p(xv|µ) + ln p(µ[1]|µ[0],ρ)(9)
where
p(xv|µ)
is the likelihood of having a visual sensation
given the internal state and
p(µ[1]|µ[0] ,ρ)
is the transition
dynamics of the latent variables (body state).
Body perception is then instantiated as computing the body
state that minimizes the variational free-energy. This can be
performed through gradient optimization
∂F /∂µ
. Since the
temporal difference
µ[0]
t+1 −µ[0]
t
is equal to the first-order
dynamics
µ[1]
at equilibrium, this term has to be included in
the computation of
˙µ[0]
to find a stationary solution during the
gradient descent procedure where the gradient
∂F /∂µ
vanishes
at optimum [20]. Hence,
˙µ[0] −µ[1] =−∂F
∂µ=∂ln p(xv,ρ,µ)
∂µ
=∂ln p(xv|µ)
∂µ+∂ln p(µ[1]|µ[0] ,ρ)
∂µ(10)
In order to compute the likelihoods, we assume that the
observed image
xv
is noisy and follows a normal distribution
with a mean at the value of
g(µ)
and with variance
Σv
.
Considering that every pixel contribution is independent, such
as
Σv= diag(Σv1,...,Σvh·w)
, the likelihood
p(xv|µ)
is
obtained as the collection of independent Gaussians:
p(xv|µ) =
h·w
Y
k=1
1
p2πΣvk
exp −1
2Σvk
(xvk−gk(µ))2
(11)
Analogously, the density that defines the latent variable
dynamics is also assumed to be noisy and follows a normal
distribution with mean at the value of the function
f(µ, ρ)
and
with variance Σµ:
p(µ[1]|µ[0] , ρ) =
n
Y
i=1
1
p2πΣµi
exp −1
2Σµi
(µ[1]
i−fi(µ,ρ))2
(12)
Substituting the likelihoods and computing the partial
derivatives of Equation 10, the body state is then given by the
following differential equation:
˙µ[0] =µ[1] +
mapping
z}| {
∂g(µ)
∂µ
T
precision
z}|{
Σ−1
v
prediction error
z }| {
(xv−g(µ))
+∂f(µ,ρ)T
∂µΣ−1
µ(µ[1] −f(µ,ρ)) (13)
For notation simplicity, hereinafter we name the first and
the second summation terms of Eq. (9) as
−Fg
and
−Ff
respectively and
−∂µFg
and
−∂µFf
for the second and third
summation terms of Eq. (13).
The action is analogously computed. However, only the
sensory information is a function of the action
x(a)
and
therefore it only depends on the free-energy terms with sensory
information Fg:
˙a =−∂Fg
∂a=−∂Fg
∂x
∂x
∂a=−∂xT
v
∂aΣ−1
v(xv−g(µ))
=−∂g(µ)T
∂µ∆tΣ−1
v(xv−g(µ)) (14)
To derive the last equality, we have employed the same
approximation as in [5] assuming that the actions are joint
velocities. In a velocity controller scheme we can approximate
the angle change between two time steps of each joint
j
as
∂qj/∂aj= ∆t
, because the target values of the joint encoders
q
are computed as
qt+1 =qt+ ∆tat
and
∆t
is a fixed value
that defines the duration of each iteration. Then, assuming
convergence for the sensation values at the equilibrium point
with
µ→q
and
g(µ)→xv
, the term
∂xv
∂a
can be computed
using the following equation:
∂xv
∂aj
=∂xv
∂qj
∂qj
∂aj
=∂g(µ)
∂µj
∂µj
∂aj
=∂g(µ)
∂µj
∆t(15)
The update rule for both
µ
and
a
is finally calculated with
the first-order Euler integration:
µt+1 =µt+ ∆t˙
µat+1 =at+ ∆t˙
a(16)
B. Scaling up with deep learning
In order to perform pixel-based free-energy optimization
we compute Equations (13) and (14) exploiting the forward
and backward pass properties of the deep neural network. We
approximate the visual generative model
g(µ)
and its partial
derivative
∂µg(µ)
with respect to the internal state by means
of a convolutional decoder.
1) Prediction of the expected sensation: We approximate
the forward model
g(µ)
by means of a generative network,
based on the architecture proposed in [21], described in Fig. 2.
It outputs the predicted image given the robot’s n-dimensional
internal belief of the body state
µ
, e.g., the joint angles of the
robot.
The input goes through 2 fully-connected layers (FC1 and
FC2). Afterwards, the transposed convolution (UpConv) is
performed to upsample the image. This deconvolution uses
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
7 x 10 x 16 14 x 20 x 64 28 x 40 x 16 28 x 40 x 16 28 x 40 x 16
512 Neurons
1120 Neurons
14 x 20 x 64 56 x 80
FC1 FC2 R UpConv1 Conv1 UpConv2 Conv2 Dropout UpConv3
Fig. 2. Network Architecture of the Convolutional Decoder (FC: Fully-
connected layer, R: Reshape operator, UpConv: Transposed Convolution, Conv:
Convolution)
the input as the weights for the filters and can be regarded
as a backward pass of the standard convolution operator [22].
Following [21], each transposed convolution layer was followed
by a standard convolutional layer, which helps to smooth
out the potential artifacts from the transposed convolution
step. There is an additional 1D-Dropout layer before the last
transposed convolution layer to avoid overfitting and achieve
better generalization performance. All layers use the rectified
linear unit (ReLU) as the activation function, except for the last
layer, where a sigmoid function was used to get pixel intensity
values in the range
[0,1]
. Throughout the consecutive UpConv-
Conv operations in the network, the number of channels is
increased and decreased again to get the required output image
size.
2)
Backward pass and mapping to the latent variable
:
An essential term, for computing both perception and action,
is the mapping between the error in the sensory space and the
inferred variables:
∂g(µ)
∂µ
. This term is calculated by performing
a backward pass over the convolutional decoder. In fact, we can
compute the whole partial derivative of the visual input term
∂Fg/∂µ
just in one forward and backward pass. The reason
is that the prediction error between the expected and observed
sensation
(xv−g(µ))
multiplied by the inverse variance and the
partial derivative is equivalent to applying the backpropagation
algorithm. It is important to note that when the function
g(µ)
outputs images of size
w×h
,
∂g(µ)
∂µ
is a three-dimensional
tensor. We stack the output into a vector
∈Rw·h
(row major).
The following equation is obtained:
∂Fg
∂µ=
∂g1,1
∂µ1
∂g1,2
∂µ1. . . ∂gw,h
∂µ1
∂g1,1
∂µ2
∂g1,2
∂µ2. . . ∂gw,h
∂µ2
.
.
..
.
..
.
..
.
.
∂g1,1
∂µ4
∂g1,2
∂µ4. . . ∂gw,h
∂µ4
|{z }
(∂g
∂µ)T
∂Fg
∂g1,1
∂Fg
∂g1,2
.
.
.
∂Fg
∂gw,h
|{z }
∂Fg
∂g
(17)
where
−∂Fg
∂gi,l
is given by
1
Σvi,l
(xvi,l −gi,l(µ))
. The action is
computed by reusing this term and multiplying it by ∆t.
C. Formalizing the task with the brain variable dynamics
In Active Inference we include the goal as prior in the body
state dynamics function
f(µ,ρ)
. For example, to perform a
reaching task, we encode the desired goal of the robot as an
instance in the sensory space (image), which acts as an attractor
generated by the causal variable
ρ
. This produces an error in
the inferred state that will promote an action towards the goal.
Note that the error
ρ−g(µ)
is zero when the prediction
matches the desired goal. We define the body state dynamics
with a causal variable attractor as:
f(µ,ρ) = T(µ)β(ρ−g(µ)) ,(18)
where
β
is a gain parameter that defines the intensity of the
attractor and
T(µ) = ∂g(µ)T/∂µ
is the mapping from the
sensory space (e.g. pixel-domain) to the internal belief
µ
(e.g. joint space). Note that this term is obtained through the
backward pass of the decoder. Finally, substituting in Eq. (13)
the new dynamics generative model we write the last term
∂Ff/∂µas:
−∂Ff
∂µ=∂f(µ,ρ)T
∂µΣ−1
µµ[1] −∂g(µ)T
∂µβ(ρ−g(µ))
(19)
In the final model used in the experiments, we have further
simplified this equation by not including the first-order internal
dynamics into the optimization process
µ[1] =0
and noting
that the correct mapping and direction from the sensory space to
the latent variable is already provided by
∂g(µ)T/∂µ
. Thus,
we greedily approximate
∂f(µ,ρ)/∂µ
to
−1
, avoiding the
Hessian computation of
T
but introducing an optimization
detriment. With these assumptions, the partial derivative of the
dynamics term becomes:
−∂Ff
∂µ=Σ−1
µ∂g(µ)T
∂µβ(ρ−g(µ))=Σ−1
µf(µ,ρ)
(20)
D. PixelAI algorithm
Algorithm 1summarizes the proposed method. In the robot
body perception and action application,
xv
is set to the image
provided by the robot monocular camera and the decoder input
becomes the internal belief (e.g., the estimated joint angles).
The convolutional decoder is trained using the proprioceptive
information (joint angles encoders) obtaining a predictor of the
visual forward model. The prediction error
ev
is the difference
between the expected visual sensation and the observation (line
6). The variational free-energy optimization, for perception
(line 7) and action (line 8), updates the differential equations
that drive the state estimation and control. Finally, with
the dynamics term we added the possibility of inputting
desired goals in the visual space (line 11). Although this
implementation assumes that
µ[1] =0
, it is straight forward
to add the 1st order dynamics when there is velocity image
information or joint encoders [5].
1
The gain parameter
KΣv
is added to allow the model to generate large
action values without increasing the internal belief increments.
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 PixelAI: Deep Active Inference Algorithm
Require: Σv,Σµ, β, ∆t
1: µ←Initial joints angle estimation
2: while (true) do
3: xv←Resize(camera image)Visual Sensation
4: g(µ)←ConvDecoder.forward(µ)
5: ∂g←ConvDecoder.backward(µ)
6: ev= (xv−g(µ)) Prediction error
7: ˙
µ=KΣv∂gTev/Σv1
8: ˙
a=−(∂gTev/Σv)∆t
9: if ∃ρthen Desired goal ρdynamics
10: ef=β(ρ−g(µ))
11: ˙
µ=˙
µ+∂gTef/Σµ
12: µ=µ+ ∆t˙
µ;a=a+ ∆t˙
a1st order Euler
integration
13: SetVelocityController(a)
IV. EXP ER IM EN TS
We tested the PixelAI in both simulated and real Aldebaran
NAO humanoid robot (Fig. 3). We used the left arm to test
both perception and action schemes. The dataset and the code
to replicate the experiments can be found in https://github.com/
cansu97/PixelAI.
Fig. 3. Experimental setup in simulation (Gazebo) and with the real robot,
and a subset of the goal images used for the benchmark.
A. Visual forward model
a)
Data acquisition & preprocessing
.: The dataset used
to train the model consisted of 3200 data samples of the left
arm elbow and shoulder joint readings (
q= [q1, q2, q3, q4]T
)
and the observed images
xv
obtained through NAO’s bottom
camera
2
. Data samples were generated using three different
methods and later concatenated (
∼25%
,
∼20%
and
∼55%
for each method).
In the first method, the joint angles were randomly drawn
from a uniform distribution in the range of the joint limits.
Afterwards, the samples where the robot’s arm was out of
the camera frame were eliminated. The ratio of the acquired
images with the robot hand centered in the camera image
was significantly lower than the images with the hand located
at the corners of the frame. To reduce this drawback, in the
second method, the robot’s arm was manually moved by an
operator and the joint angle readings were recorded during
2
In simulation, the color of the right arm was changed to dark grey to
achieve contrast with the grey background in the camera images.
these trajectories. This way, a subset of data was obtained,
where the robot hand was centered in the image. Finally,
in the third method, a multivariate Gaussian was fit to the
second subset using the expectation-maximization algorithm
and random samples were drawn from this Gaussian for the
third and final part of the dataset. The goal was to introduce
randomness to the centered-images and not be limited to the
operator’s choice of trajectories.
For the images collected in the Gazebo NAO Simulator,
the only preprocessing step performed was re-sizing the
image of size
640 ×480
to
80 ×56
. For the real NAO, the
images were obtained on a green background (Fig. 3) and
the following preprocessing steps were performed: 1) median
filtering with kernel size 11 on the original image, 2) masking
the monochrome background,e.g. green, in the HSV color space
and replacing it with dark gray to ensure contrast 3) converting
the image to grayscale, and 4) resizing image to dimensions
80 ×56.
b)
Training
.: The convolutional decoder was trained
using the ADAM optimizer using a mini-batch of size 200
samples and an initial learning rate of
α= 10−4
with
exponential decay of
0.95
every 5000 steps. The training
was stopped after ca. 7000 iterations for the simulated NAO
dataset and 12000 iterations for the real NAO dataset to avoid
overfitting, as the test set error started to increase for the
corresponding model. The output of the second fully connected
layer (FC2) was an 1120-dimensional vector that was reshaped
into a
7×10 ×16
tensor. The UpConv layers all use stride
equal to 2 and a padding of 1. Moreover, a kernel with a size
of
4×4
was chosen to avoid checkerboard artifacts due to
uneven overlap [23]. Convolutional layers used a kernel size 3,
stride 1 and padding 1. The dropout probability was set to 0.15.
The final layer outputs a
1×56 ×80
image corresponding to
a grayscale image.
A benchmark with three levels of difficulty was created to
evaluate the performance of PixelAI on randomized samples
for both perceptual and active inference. A set of 50 different
cores (i.e. images of the arm) were generated by sampling
the multivariate Gaussian distribution (see method 2 in section
IV-A
0a). A subset of the generated cores is shown in Fig. 3. For
each of the cores, 10 different random tests were performed. In
total, there were 2500 trials composed of 5 runs of 500 testing
image arm poses per benchmark level
3
. The test samples
for each core were generated differently depending on the
benchmark level:
•
Level 1 (close similar poses): One of the 4 joints was
chosen randomly and a random perturbation
±[5◦,10◦]
sampled from a uniform distribution was added to the
joint angle value to generate the new test sample.
•
Level 2 (far similar poses): For all of the 4 joints, a random
perturbation
±[5◦,10◦]
was sampled from a uniform
distribution and added to the core joint angles.
3
For the real robot benchmark tests (perceptual inference) only a subset of
20 cores were used.
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
•
Level 3 (random): For each core, 10 different cores were
chosen randomly and used as the test samples.
c)
Perceptual Inference
.: In order to evaluate the body
perception performance, the robot has to infer its real arm pose
just using visual information. The robot’s arm was initialized to
each core pose and then 10 separate test runs were performed,
where the internal belief was set to a perturbed value of the
corresponding pose. These tests are static in nature, i.e. the
change solely takes place in the internal predictions of the
robot. The goal is that the robot internal belief
µ
converges to
the true arm position, which is equal to the joint angles of the
chosen core.
d)
Active Inference
.: In order to evaluate the perception
and action performance, the core poses were treated as the
desired goal (image) encoded as an attractor in the model.
Again for each core, 10 separate test runs were performed.
In this case, the robot arm was initialized to a core pose
and also the initial internal belief was set to the current joint
measurements:
µ=q
. In each test, the goal was that the
robot reached the desired imagined arm pose. The update of
the internal belief should generate an action to compensate
for the mismatch between the current and the predicted visual
sensations. In a successful test run, the robot arm should move
to the imagined arm position and the internal belief should also
converge to the imagined joint angles, so that:
xv=g(µ) = ρ
.
e) Algorithm Parameters.: The parameters of the Pixel
AI algorithm are given by
Σv, β, KΣv, γΣv,∆t
, and were
determined empirically
4
– See table I. The intuition behind the
variance terms is as follows: the prediction errors get multiplied
by the inverse of the variances so these actually weigh the
relevance of the corresponding sensory information error [20].
The
β
term, that is part of the attractor dynamics (see Eq.
18), essentially has the same effect and it controls how much
we want to push the internal belief in the direction of the
attractor. The gain parameter
KΣv
is added to allow the model
to generate large action values without increasing the internal
belief increments. (See line 7 of Algorithm 1). For level 3 in
perceptual inference, we used a smaller
Σv
until the visual
prediction error was below a certain threshold (
0.01
). This
introduces a new model parameter
γΣv
which is used to scale
Σv
, once the error threshold is reached. This heuristic method
of adaptation helped speed up the convergence for the more
complex level 3 trajectories. The parameter
∆t
was set to
0.1 for all the perceptual inference tests and to 0.065 for the
active inference tests. For active inference, the value of
∆t
was determined based on the internal time of the robot loop
execution. Finally, the generated actions (velocity values) were
clipped so that each joint could not move more than
[−2◦; 2◦]
each time step.
V. RE SU LTS
A. Statistical analysis of perception and action in simulation
First, perceptual inference tests were run for 5000-time steps
for all 3 levels. An example of the perceptual inference for
4βand Σµare combined into a single parameter β.
TABLE I
PIX ELA I PARA MET ER S USE D FO R THE P ER CEP TUA L AND A CTI VE
INFERENCE BENCHMARKS IN SIMULATION.
Σvβ KΣvγΣv
Active Inference
Level 1 6×1022×10−510−31
Level 2 2×1025×10−510−31
Level 3 20 5 ×10−410−31
Perceptual Inference
Level 1 2×104- - 1
Level 2 2×104- - 1
Level 3 2×103- - 10
each level is depicted in Fig. 5. For level 1 and 2, the algorithm
converged fast to the ground truth, while inferring the body
location from a totally random initialization (level 3) raised the
complexity considerably. Table II shows the resulting average
of all trials of the mean absolute joint errors (
|qtrue −µ|
). Level
1 and 2 results converged to internal belief values successfully.
Figure 4(b) shows the error during the optimization process
and Figure 4(a) shows the visual prediction error. Shoulder
pitch and shoulder roll angles were estimated with better
accuracy compared to the elbow angles. This is due to the
fact that a small change in the shoulder pitch angle yields
to a greater difference in the visual field in comparison with
the same amount of change in the elbow roll angle. Since
PixelAI achieves perception by minimizing the visual prediction
error, the accuracy increases when the pixel-based difference
is stronger. Therefore, the mean error and standard deviation
increase for the elbow joint angle estimations.
The errors in level 3, where the robot had to converge to
random arm locations, were larger compared to levels 1 and 2,
as shown in Fig. 4(b). This is due to two reasons. The first one
is the local minima problem inherent of our gradient descent
approach. The second one affects the desired joint position,
where several joint solutions have small visual prediction error
increasing the risk of getting into a local minimum.
TABLE II
PER CEP TUA L INF ER ENC E IN SIMULATION:JOINT ANGLES ABSOLUTE
ER ROR (ME AN ±S TD IN D EG REE S).
Level Shoulder Pitch Shoulder Roll Elbow Yaw Elbow Roll
1 0.26 ±0.34 0.33 ±0.41 0.85 ±0.96 0.94 ±1.06
2 0.39 ±0.71 0.62 ±0.98 1.19 ±1.48 1.57 ±2.18
3 5.32 ±11.06 5.61 ±7.68 17.64 ±26.08 12.25 ±15.41
TABLE III
PER CEP TUA L INF ER ENC E IN RE AL ROB OT:JOINT ANGLES ABSOLUTE
ER ROR (ME AN ±S TD IN D EG REE S).
Level Shoulder Pitch Shoulder Roll Elbow Yaw Elbow Roll
1 1.33 ±0.82 0.68 ±0.77 1.57 ±1.44 1.97 ±2.03
2 1.86 ±1.85 2.22 ±3.04 2.94 ±3.31 3.81 ±3.32
3 9.80 ±12.63 12.77 ±10.91 29.35 ±35.48 21.82 ±17.23
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.00
0.02
0.04
0.06
0.08
MSE(sv−g(µ))
Level 1
Level 2
Level 3
(a) Simulation: Visual Prediction Error
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
ksp−µk2(radians)
Level 1
Level 2
Level 3
(b) Simulation: L2 Norm of sp−µ.
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.00
0.02
0.04
0.06
0.08
MSE(sv−g(µ))
Level 1
Level 2
Level 3
(c) Real: Visual Prediction Error
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
ksp−µk2(radians)
Level 1
Level 2
Level 3
(d) Real: L2 Norm of µ−sp
Fig. 4. Perceptual inference results for all levels (1-3) of the benchmark are shown. The visual prediction error (pixel-MSE), as well as the L2-Norm of the
error between internal belief
µ
and
sp
is plotted. The curves shown are the median values of all the test runs in the corresponding benchmark level bounded
by the upper and lower quartiles. (a)-(b): Results for simulated Nao. (c)-(d): Results for real Nao.
Fig. 5. Example of the internal trajectories of the latent space during the
perceptual inference tests for three different levels of difficulty for core 3.
Secondly, active inference tests with goal images were
performed using the simulated NAO for 1500 time steps in the
benchmark levels. The results for all three benchmark levels
are shown in Fig. 6. The joint encoder readings followed the
internal belief values through the actions generated by free-
energy optimization. Level 3 performance detriment shows that
interacting is more complex than perceiving as it includes the
body and the world constraints.
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.00
0.02
0.04
0.06
0.08
MSE(ρ−sv)
Level 1
Level 2
Level 3
(a) Visual error
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
ksp−qattrk2(radians)
Level 1
Level 2
Level 3
(b) L2-Norm of Joint Angle Errors
Fig. 6. Simulated NAO active inference test results for all three levels. The
curves shown are the median values of all the test runs in the corresponding
benchmark level bounded by the upper and lower quartiles. (a) Visual error
between the visual attractor
ρ
and the observed camera image
sv
. (b) L2-Norm
of the error between the joint angles of the attractor position
qattr
and the
proprioceptive sensor readings sp.
B. Active inference in the real robot
We tested the proposed algorithm in the real robot. Con-
versely to simulation, the robot’s movements were imprecise
due to the mechanical backlash in the actuators (
±5◦
) [24]. We
used the same network architecture and training procedure used
in simulation. Low training error was achieved on the training
dataset (MSE in pixel-intensity: ca. 0.0015). The visual forward
model was expected to model the more complex structure of
the real robot hand, that is subject to lighting differences and
has a reflective surface. Unlike in the simulator, the same
conditions cannot be restored perfectly in the real world, so
the model training is always subject to additional noise in the
dataset.
The results of the perceptual inference for real NAO on
all 3 benchmark levels are shown in Fig. 4as well as in
Table III. Similar behaviours of perceptual convergence to the
simulation results were found in level 1 and 2, while level 3
had a larger error due to the local minima. Figure 7shows the
PixelAI algorithm running on the robot. While body estimation
converged smoothly, the real movements were unsmooth due to
the deployed velocity controller over the built-in NAO position
control, which introduced delays in the execution of the action
commands. Direct access to the motor driver should solve the
mismatch between the internal error and the actual arm position
that is presented in the plots.
VI. CONCLUSIONS
We have described a Pixel-based Deep Active Inference
algorithm and applied it for robot body perception and action.
We have shown that variational free-energy optimization can
work as a general inner mechanism for both estimation and
control. Our algorithm extends previous active inference works
tackling high-dimensional visual inputs and providing sensory
generative models learning. This prediction error variant of
control as inference [25] exploits the representation learnt to
indirectly generate the actions without a policy. The robot
is producing the actions to reach the desired goal in the
visual space without learning the explicit policy. Our algorithm
enabled body estimation using a monocular camera input and
performed goal-driven behaviour using imaginary goals in
the visual space. Statistical results showed convergence in
both perception and action in different levels of difficulty
with a larger error when dealing with totally random arm
poses. This neuroscience-inspired approach is thought to
make deeper interpretations than conventional engineering
solutions [26], giving some grounding for novel machine
learning developments, especially for body perception and
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
Fig. 7. PixelAI test on the real Nao.
µ
is the inferred state,
q
is the real joint angles readings, goal is the ground truth goal angles and
0.05steps = 1s
.
(bottom row) Arm sequence: goal image and Nao visual input are overimposed.
action. Further work will focus on bringing this approach
closer to biological plausibility. We will explore the integration
of complex proprioceptive information within the optimization
framework [27] by learning multimodal representations and
include hierarchies in the architecture to permit the robot to
perceive its body in an abstracted way and not only in the
pixel-based domain.
REFERENCES
[1]
D. M. Wolpert, J. Diedrichsen, and J. R. Flanagan, “Principles of
sensorimotor learning,” Nature Reviews Neuroscience, vol. 12, no. 12, p.
739, 2011.
[2]
Y. Yamada, H. Kanazawa, S. Iwasaki, Y. Tsukahara, O. Iwata, S. Yamada,
and Y. Kuniyoshi, “An embodied brain model of the human foetus,”
Scientific Reports, vol. 6, 2016.
[3]
P. Lanillos, E. Dean-Leon, and G. Cheng, “Yielding self-perception in
robots through sensorimotor contingencies,” IEEE Trans. on Cognitive
and Developmental Systems, no. 99, pp. 1–1, 2016.
[4]
G. Diez-Valencia, T. Ohashi, P. Lanillos, and G. Cheng, “Sensorimotor
learning for artificial body perception,” arXiv preprint arXiv:1901.09792,
2019.
[5]
G. Oliver, P. Lanillos, and G. Cheng, “Active inference body perception
and action for humanoid robots,” arXiv preprint arXiv:1906.03022, 2019.
[6]
M. Botvinick and J. Cohen, “Rubber hands ‘feel’touch that eyes see,”
Nature, vol. 391, no. 6669, p. 756, 1998.
[7]
N.-A. Hinz, P. Lanillos, H. Mueller, and G. Cheng, “Drifting perceptual
patterns suggest prediction errors fusion rather than hypothesis selection:
replicating the rubber-hand illusion on a robot,” in 2018 Joint IEEE 8th
International Conference on Development and Learning and Epigenetic
Robotics (ICDL-EpiRob). IEEE, 2018, pp. 125–132.
[8]
K. Doya, “What are the computations of the cerebellum, the basal ganglia
and the cerebral cortex?” Neural networks, vol. 12, no. 7-8, pp. 961–974,
1999.
[9]
S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo
control,” IEEE transactions on robotics and automation, vol. 12, no. 5,
pp. 651–670, 1996.
[10]
C. G. Cifuentes, J. Issac, M. Wüthrich, S. Schaal, and J. Bohg,
“Probabilistic articulated real-time tracking for robot manipulation,” IEEE
Robotics and Automation Letters, vol. 2, no. 2, pp. 577–584, 2016.
[11]
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of
deep visuomotor policies,” The Journal of Machine Learning Research,
vol. 17, no. 1, pp. 1334–1373, 2016.
[12]
A. Tschantz, M. Baltieri, A. Seth, C. L. Buckley et al., “Scaling active
inference,” arXiv preprint arXiv:1911.10601, 2019.
[13]
B. Millidge, “Deep active inference as variational policy gradients,”
Journal of Mathematical Psychology, vol. 96, p. 102348, 2020.
[14]
K. J. Friston, “The free-energy principle: a unified brain theory?” Nature
Reviews. Neuroscience, vol. 11, pp. 127–138, 02 2010.
[15]
R. P. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a
functional interpretation of some extra-classical receptive-field effects,”
Nature neuroscience, vol. 2, no. 1, pp. 79–87, 1999.
[16]
A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual
reinforcement learning with imagined goals,” in Advances in Neural
Information Processing Systems, 2018, pp. 9191–9200.
[17]
G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and helmholtz free energy,” in Advances in neural information
processing systems, 1994, pp. 3–10.
[18]
M. J. Wainwright and M. I. Jordan, “Graphical models, exponential
families, and variational inference,” vol. 1, no. 1, pp. 1–305. [Online].
Available: http://www.nowpublishers.com/product.aspx?product=MAL&
doi=2200000001
[19]
K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny,
“Variational free energy and the laplace approximation,” Neuroimage,
vol. 34, no. 1, pp. 220–234, 2007.
[20]
C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth, “The free energy
principle for action and perception: A mathematical review,” Journal of
Mathematical Psychology, 2017.
[21]
A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox,
“Learning to generate chairs, tables and cars with convolutional networks,”
IEEE transactions on pattern analysis and machine intelligence, vol. 39,
no. 4, pp. 692–705, 2016.
[22]
V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
learning,” arXiv preprint arXiv:1603.07285, 2016.
[23]
A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard
artifacts,” Distill, vol. 1, no. 10, p. e3, 2016.
[24]
D. Gouaillier, C. Collette, and C. Kilner, “Omni-directional closed-loop
walk for nao,” in 2010 10th IEEE-RAS International Conference on
Humanoid Robots. IEEE, 2010, pp. 448–454.
[25]
M. Toussaint, “Robot trajectory optimization using approximate infer-
ence,” in Proceedings of the 26th annual international conference on
machine learning, 2009, pp. 1049–1056.
[26]
D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick,
“Neuroscience-inspired artificial intelligence,” Neuron, vol. 95, no. 2,
pp. 245–258, 2017.
[27]
T. Rood, M. van Gerven, and P. Lanillos, “A deep active inference model
of the rubber-hand illusion,” 2020.
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.