Conference PaperPDF Available

End-to-End Pixel-Based Deep Active Inference for Body Perception and Action

Authors:
End-to-End Pixel-Based Deep Active Inference for
Body Perception and Action
Cansu Sancaktar
Technical University of Munich
Germany
cansu.sancaktar@tum.de
Marcel A. J. van Gerven
Donders Institute for
Brain, Cognition and Behaviour
Radboud University
Netherlands
m.vangerven@donders.ru.nl
Pablo Lanillos
Donders Institute for
Brain, Cognition and Behaviour
Radboud University
Netherlands
p.lanillos@donders.ru.nl
Abstract—We present a pixel-based deep active inference
algorithm (PixelAI) inspired by human body perception and
action. Our algorithm combines the free energy principle from
neuroscience, rooted in variational inference, with deep convolu-
tional decoders to scale the algorithm to directly deal with raw
visual input and provide online adaptive inference. Our approach
is validated by studying body perception and action in a simulated
and a real Nao robot. Results show that our approach allows the
robot to perform 1) dynamical body estimation of its arm using
only monocular camera images and 2) autonomous reaching to
“imagined" arm poses in visual space. This suggests that robot
and human body perception and action can be efficiently solved
by viewing both as an active inference problem guided by ongoing
sensory input.
Index Terms—Active inference, Deep learning, Free energy op-
timization, Bio-inspired perception, Predictive coding, Robotics.
I. INTRODUCTION
Learning and adaptation are two core characteristics that
allow humans to perform flexible whole-body dynamic esti-
mation and robust actions in the presence of uncertainty [1].
We hypothesize that the human brain acquires a representation
(model) of the body, already starting at the earliest stages of
life, by learning a mapping between tactile, proprioceptive
and visual cues [2]. This cross-modal sensorimotor mapping is
encoded [3], [4] in some sort of internal model that allows us
to predict the sensory effect of our body in the space and to
deal with unexpected perturbations through reactive actions [5].
This mapping is flexible, as well as the perception of our body
in the space [6], [7] and depends on the interplay between the
top-down expectations and the current sensory inputs. Hence,
unsupervised learning mechanisms are enhanced by online
supervised adaptation [8].
On the contrary, robots usually use a fixed-rigid body model
where the arm end-effector is defined as a pose, i.e., a 3D
point in space and orientation. Hence, any error in the model or
change in the conditions will result in failure. Several solutions
have been proposed to overcome this problem, usually separated
in perception and control approaches. For instance, by working
This work has been supported by SELFCEPTION project EU Horizon
2020 Programme, grant nr. 741941. PixelAI code: https://github.com/cansu97/
PixelAI.
Predicted
Sensation

VisualPredictionError

Free-Energy
Optimization BackwardPass
Internal
Belief
Observed
Visual
Sensation
Bottom
Camera
Action
ForwardPass
Fig. 1. Pixel-based deep active inference (PixelAI). The robot infers its body
(e.g., joint angles) by minimizing the visual prediction error, i.e. discrepancy
between the camera sensor value
xv
and the expected sensation
g(µ)
computed
using a convolutional decoder. The error signal is used to update the internal
belief to match the observed sensation (perceptual inference) and to generate
an action
a
to reduce the discrepancy between the observed and predicted
sensations. Both are computed by minimizing the variational free-energy
bound.
in visual space (e.g. visual servoing [9]) we can exploit a set of
invariant visual keypoints to provide control that incorporates
real-world errors. Bayesian sensory fusion in combination with
model-based fitting allows adaptation to sensory noise and
model errors [10] and model-based active inference provides
online adaptation in both action and perception [5]. Finally,
learning approaches have shown that the difference between
the model and reality can be overcome by optimizing the
body parameters or by explicitly learning the policy for a task,
e.g., through imitation learning or reinforcement learning (RL).
Recently, model-free approaches, particularly deep RL, have
demonstrated the potential for directly using raw images as an
input for learning visual control policies [11]. Alternatively,
Deep Active Inference has appeared as a competitive alternative
approach at least in toy problems [12], [13].
In this work, we introduce PixelAI, a novel pixel-based deep
active inference algorithm, depicted in Fig. 1, which directly
scales to high-dimensional inputs (e.g., images), provides adap-
tation, model-free learning and, moreover, unifies perception
and action into a single variational inference formulation.
We combine the free energy principle (i.e. active infer-
ence) [14], which updates an internal model of the body through
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
perception and action, with deep convolutional decoders, that
map internal beliefs to expected sensations. More concretely,
the agent learns a generative model of the body to aid in
the construction of expected sensations from incoming partial
multimodal information. This information, instead of being
directly encoded, is processed by means of the error between
the predicted outcome and the current input. In this sense,
we formalize body perception and action as a consequence of
surprise minimization via prediction errors [14], [15].
Following this approach, the robot should learn a latent
representation (“state”) of the body and the relation between
its state and the expected sensations (“predictions”). These
predictions will be compared with the observed sensory data,
generating an error signal that can be propagated to refine the
belief that the robot has about its body state. Compensatory
actions would follow a similar principle and will be exerted to
better correspond to the prediction made by the models learnt,
giving the robot the capacity to actively adjust to online changes
in the body and the environment (see Fig. 1). This also offers
a natural way to realize a plan by setting an imagined goal in
sensory space [16]. For example, when working in visual space
we can provide an image as a goal. By means of optimizing the
free-energy bound the agent then executes those actions that
minimize the discrepancy between expected sensations (our
goal) and observed sensations. Our approach was validated by
studying body perception and action in a simulated and a real
Nao robot. Results show that our approach allows the robot to
perform 1) dynamical body estimation of its arm using only
raw monocular camera images and 2) autonomous reaching to
arm poses provided by a goal image.
II. BACKGROU ND
A. The free energy principle
We model body perception as inferring the body state
z
based
on the available sensory data
x
. Given a sensation
x
, the goal is
to find
z
such that the posterior
p(z|x) = p(x|z)p(z)/p(x)
is
maximized. However, computing the marginal likelihood
p(x)
requires an integration over all possible body states. That is,
p(x) = Rzp(x|z)p(z)dz
, which becomes intractable for large
state spaces. The free-energy [17], largely exploited in machine
learning [18] and neuroscience [19], circumvents this problem
by introducing a reference distribution (also called recognition
density)
q(z)
with known tractable form. The goal of the
minimization problem hence becomes finding the reference
distribution q(z)that best approximates the posterior p(z|x).
For tractability purposes, this approximation is calculated by
optimizing the negative variational free energy
F
, also referred
to as the evidence lower bound (ELBO).
F
can be defined
as the Kullback-Leibler divergence
DKL
plus the negative log-
evidence or sensory surprise ln p(x):
F=DKL(q(z)kp(z|x)) ln p(x)
=Zz
q(z) ln q(z)
p(z|x)dzln p(x),(1)
which, due to the non-negativity properties of
DKL
, is an
upper bound on surprise. Alternatively, we can use the identity
ln p(x) = Rzq(z) ln p(x)dz
to include the second term into
the integral and write Equation (1) as
F=Zz
q(z) ln q(z)
p(x,z)dz(2)
=Zz
q(z) ln p(x,z)dz+Zz
q(z) ln q(z)dz.(3)
According to the free energy principle [14] both perception
and action optimize the free energy and hence minimize
surprise:
1)
Perceptual inference: The agent updates its internal belief
by approximating the conditional density (inference),
maximizing the likelihood of the observed sensation:
z= arg min
z
F(z,x).(4)
2)
Active inference: The agent generates an action
a
that
results in a new sensory state
x(a)
that is consistent with
the current internal representation:
a= arg min
a
F(z,x(a)) .(5)
Under the Laplace approximation, the variational density can
take the form of a Gaussian
q(z) = N(µ,Σ)
, where
µ
is the
conditional mode and
Σ
is the covariance of the parameters.
By incorporating this reference distribution in Equation (3),
the free-energy can be approximated as – See [19] for full
derivation.
F=ln p(x,µ)1
2(ln |Σ|+nln 2π),(6)
where the first term is the joint density of the observed and
the latent variables with µan n-dimensional state vector.
III. PIX EL -BAS ED DE EP ACTIVE IN FE RE NC E
Our proposed PixelAI approach combines free energy
optimization with deep learning to directly work with images
as visual input. The optimization provides adaptation and the
neural network incorporates learning of high-dimensional input.
We frame and experimentally validate the proposed algorithm
in body perception and action in robots. Figure 1visually
describes PixelAI. The agent first learns the approximate
generative forward models of the body, implemented here
as convolutional decoders. While interacting, the expected
sensation (predicted by the decoder) is compared with the real
visual input and the prediction error is used to 1) update the
belief and 2) generate actions. This is performed by means of
optimizing the variational free-energy bound.
A. Active inference model
We formalize body perception as inferring the unobserved
body state
µ
, e.g., the estimation of the robot joint angles like
shoulder pitch and roll. We define the robot internal belief as
an
n
-dimensional vector:
µ[d]Rn
for each temporal order
d
.
For instance, for first-order (velocity) generalized coordinates
the belief is:
µ={µ[0],µ[1] }
. The observed variables
x
are
the visual sensory input
xv
and the external causal variables
ρ
:
x={xv,ρ}
For instance, the robot has access to visual
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
information
xv
(an image of size
w×h
) and proprioception
information (the joint encoder values
q
). The causal variables
ρ
are independent variables that produce effects in the world.
Finally, let us define two generative models that describe the
system. The sensory forward model g, which is the predictor
that computes the sensory outcome
xv
given the internal state
µ
and the body internal state dynamics
f
. Both functions can
be considered as the approximations that the agent has about
the reality:
xv=g(µ) + wv(7)
µ[1] =f(µ,ρ) + wµ(8)
where
wv,wµ
are both process noise and are assumed to be
drawn from a multivariate normal distribution with zero mean
and covariance Σvand Σµrespectively.
In order to compute the variational free-energy under the
Laplace approximation from Equation (6) we need the joint
density. Assuming independence of the observed variables:
ln p(x,µ) = ln p(xv,ρ,µ)
= ln p(xv|µ) + ln p(µ[1]|µ[0],ρ)(9)
where
p(xv|µ)
is the likelihood of having a visual sensation
given the internal state and
p(µ[1]|µ[0] ,ρ)
is the transition
dynamics of the latent variables (body state).
Body perception is then instantiated as computing the body
state that minimizes the variational free-energy. This can be
performed through gradient optimization
∂F /∂µ
. Since the
temporal difference
µ[0]
t+1 µ[0]
t
is equal to the first-order
dynamics
µ[1]
at equilibrium, this term has to be included in
the computation of
˙µ[0]
to find a stationary solution during the
gradient descent procedure where the gradient
∂F /∂µ
vanishes
at optimum [20]. Hence,
˙µ[0] µ[1] =∂F
µ=ln p(xv,ρ,µ)
µ
=ln p(xv|µ)
µ+ln p(µ[1]|µ[0] ,ρ)
µ(10)
In order to compute the likelihoods, we assume that the
observed image
xv
is noisy and follows a normal distribution
with a mean at the value of
g(µ)
and with variance
Σv
.
Considering that every pixel contribution is independent, such
as
Σv= diag(Σv1,...,Σvh·w)
, the likelihood
p(xv|µ)
is
obtained as the collection of independent Gaussians:
p(xv|µ) =
h·w
Y
k=1
1
p2πΣvk
exp 1
vk
(xvkgk(µ))2
(11)
Analogously, the density that defines the latent variable
dynamics is also assumed to be noisy and follows a normal
distribution with mean at the value of the function
f(µ, ρ)
and
with variance Σµ:
p(µ[1]|µ[0] , ρ) =
n
Y
i=1
1
p2πΣµi
exp 1
µi
(µ[1]
ifi(µ,ρ))2
(12)
Substituting the likelihoods and computing the partial
derivatives of Equation 10, the body state is then given by the
following differential equation:
˙µ[0] =µ[1] +
mapping
z}| {
g(µ)
µ
T
precision
z}|{
Σ1
v
prediction error
z }| {
(xvg(µ))
+f(µ,ρ)T
µΣ1
µ(µ[1] f(µ,ρ)) (13)
For notation simplicity, hereinafter we name the first and
the second summation terms of Eq. (9) as
Fg
and
Ff
respectively and
µFg
and
µFf
for the second and third
summation terms of Eq. (13).
The action is analogously computed. However, only the
sensory information is a function of the action
x(a)
and
therefore it only depends on the free-energy terms with sensory
information Fg:
˙a =∂Fg
a=∂Fg
x
x
a=xT
v
aΣ1
v(xvg(µ))
=g(µ)T
µtΣ1
v(xvg(µ)) (14)
To derive the last equality, we have employed the same
approximation as in [5] assuming that the actions are joint
velocities. In a velocity controller scheme we can approximate
the angle change between two time steps of each joint
j
as
∂qj/∂aj= ∆t
, because the target values of the joint encoders
q
are computed as
qt+1 =qt+ ∆tat
and
t
is a fixed value
that defines the duration of each iteration. Then, assuming
convergence for the sensation values at the equilibrium point
with
µq
and
g(µ)xv
, the term
xv
a
can be computed
using the following equation:
xv
∂aj
=xv
∂qj
∂qj
∂aj
=g(µ)
∂µj
∂µj
∂aj
=g(µ)
∂µj
t(15)
The update rule for both
µ
and
a
is finally calculated with
the first-order Euler integration:
µt+1 =µt+ ∆t˙
µat+1 =at+ ∆t˙
a(16)
B. Scaling up with deep learning
In order to perform pixel-based free-energy optimization
we compute Equations (13) and (14) exploiting the forward
and backward pass properties of the deep neural network. We
approximate the visual generative model
g(µ)
and its partial
derivative
µg(µ)
with respect to the internal state by means
of a convolutional decoder.
1) Prediction of the expected sensation: We approximate
the forward model
g(µ)
by means of a generative network,
based on the architecture proposed in [21], described in Fig. 2.
It outputs the predicted image given the robot’s n-dimensional
internal belief of the body state
µ
, e.g., the joint angles of the
robot.
The input goes through 2 fully-connected layers (FC1 and
FC2). Afterwards, the transposed convolution (UpConv) is
performed to upsample the image. This deconvolution uses
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
7 x 10 x 16 14 x 20 x 64 28 x 40 x 16 28 x 40 x 16 28 x 40 x 16
512 Neurons
1120 Neurons
14 x 20 x 64 56 x 80
FC1 FC2 R UpConv1 Conv1 UpConv2 Conv2 Dropout UpConv3
Fig. 2. Network Architecture of the Convolutional Decoder (FC: Fully-
connected layer, R: Reshape operator, UpConv: Transposed Convolution, Conv:
Convolution)
the input as the weights for the filters and can be regarded
as a backward pass of the standard convolution operator [22].
Following [21], each transposed convolution layer was followed
by a standard convolutional layer, which helps to smooth
out the potential artifacts from the transposed convolution
step. There is an additional 1D-Dropout layer before the last
transposed convolution layer to avoid overfitting and achieve
better generalization performance. All layers use the rectified
linear unit (ReLU) as the activation function, except for the last
layer, where a sigmoid function was used to get pixel intensity
values in the range
[0,1]
. Throughout the consecutive UpConv-
Conv operations in the network, the number of channels is
increased and decreased again to get the required output image
size.
2)
Backward pass and mapping to the latent variable
:
An essential term, for computing both perception and action,
is the mapping between the error in the sensory space and the
inferred variables:
g(µ)
µ
. This term is calculated by performing
a backward pass over the convolutional decoder. In fact, we can
compute the whole partial derivative of the visual input term
∂Fg/∂µ
just in one forward and backward pass. The reason
is that the prediction error between the expected and observed
sensation
(xvg(µ))
multiplied by the inverse variance and the
partial derivative is equivalent to applying the backpropagation
algorithm. It is important to note that when the function
g(µ)
outputs images of size
w×h
,
g(µ)
µ
is a three-dimensional
tensor. We stack the output into a vector
Rw·h
(row major).
The following equation is obtained:
∂Fg
µ=
∂g1,1
∂µ1
∂g1,2
∂µ1. . . ∂gw,h
∂µ1
∂g1,1
∂µ2
∂g1,2
∂µ2. . . ∂gw,h
∂µ2
.
.
..
.
..
.
..
.
.
∂g1,1
∂µ4
∂g1,2
∂µ4. . . ∂gw,h
∂µ4
|{z }
(g
µ)T
∂Fg
∂g1,1
∂Fg
∂g1,2
.
.
.
∂Fg
∂gw,h
|{z }
∂Fg
g
(17)
where
∂Fg
∂gi,l
is given by
1
Σvi,l
(xvi,l gi,l(µ))
. The action is
computed by reusing this term and multiplying it by t.
C. Formalizing the task with the brain variable dynamics
In Active Inference we include the goal as prior in the body
state dynamics function
f(µ,ρ)
. For example, to perform a
reaching task, we encode the desired goal of the robot as an
instance in the sensory space (image), which acts as an attractor
generated by the causal variable
ρ
. This produces an error in
the inferred state that will promote an action towards the goal.
Note that the error
ρg(µ)
is zero when the prediction
matches the desired goal. We define the body state dynamics
with a causal variable attractor as:
f(µ,ρ) = T(µ)β(ρg(µ)) ,(18)
where
β
is a gain parameter that defines the intensity of the
attractor and
T(µ) = g(µ)T/∂µ
is the mapping from the
sensory space (e.g. pixel-domain) to the internal belief
µ
(e.g. joint space). Note that this term is obtained through the
backward pass of the decoder. Finally, substituting in Eq. (13)
the new dynamics generative model we write the last term
∂Ff/∂µas:
∂Ff
µ=f(µ,ρ)T
µΣ1
µµ[1] g(µ)T
µβ(ρg(µ))
(19)
In the final model used in the experiments, we have further
simplified this equation by not including the first-order internal
dynamics into the optimization process
µ[1] =0
and noting
that the correct mapping and direction from the sensory space to
the latent variable is already provided by
g(µ)T/∂µ
. Thus,
we greedily approximate
f(µ,ρ)/∂µ
to
1
, avoiding the
Hessian computation of
T
but introducing an optimization
detriment. With these assumptions, the partial derivative of the
dynamics term becomes:
∂Ff
µ=Σ1
µg(µ)T
µβ(ρg(µ))=Σ1
µf(µ,ρ)
(20)
D. PixelAI algorithm
Algorithm 1summarizes the proposed method. In the robot
body perception and action application,
xv
is set to the image
provided by the robot monocular camera and the decoder input
becomes the internal belief (e.g., the estimated joint angles).
The convolutional decoder is trained using the proprioceptive
information (joint angles encoders) obtaining a predictor of the
visual forward model. The prediction error
ev
is the difference
between the expected visual sensation and the observation (line
6). The variational free-energy optimization, for perception
(line 7) and action (line 8), updates the differential equations
that drive the state estimation and control. Finally, with
the dynamics term we added the possibility of inputting
desired goals in the visual space (line 11). Although this
implementation assumes that
µ[1] =0
, it is straight forward
to add the 1st order dynamics when there is velocity image
information or joint encoders [5].
1
The gain parameter
KΣv
is added to allow the model to generate large
action values without increasing the internal belief increments.
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 PixelAI: Deep Active Inference Algorithm
Require: Σv,Σµ, β, t
1: µInitial joints angle estimation
2: while (true) do
3: xvResize(camera image)Visual Sensation
4: g(µ)ConvDecoder.forward(µ)
5: gConvDecoder.backward(µ)
6: ev= (xvg(µ)) Prediction error
7: ˙
µ=KΣvgTev/Σv1
8: ˙
a=(gTev/Σv)∆t
9: if ρthen Desired goal ρdynamics
10: ef=β(ρg(µ))
11: ˙
µ=˙
µ+gTef/Σµ
12: µ=µ+ ∆t˙
µ;a=a+ ∆t˙
a1st order Euler
integration
13: SetVelocityController(a)
IV. EXP ER IM EN TS
We tested the PixelAI in both simulated and real Aldebaran
NAO humanoid robot (Fig. 3). We used the left arm to test
both perception and action schemes. The dataset and the code
to replicate the experiments can be found in https://github.com/
cansu97/PixelAI.
Fig. 3. Experimental setup in simulation (Gazebo) and with the real robot,
and a subset of the goal images used for the benchmark.
A. Visual forward model
a)
Data acquisition & preprocessing
.: The dataset used
to train the model consisted of 3200 data samples of the left
arm elbow and shoulder joint readings (
q= [q1, q2, q3, q4]T
)
and the observed images
xv
obtained through NAO’s bottom
camera
2
. Data samples were generated using three different
methods and later concatenated (
25%
,
20%
and
55%
for each method).
In the first method, the joint angles were randomly drawn
from a uniform distribution in the range of the joint limits.
Afterwards, the samples where the robot’s arm was out of
the camera frame were eliminated. The ratio of the acquired
images with the robot hand centered in the camera image
was significantly lower than the images with the hand located
at the corners of the frame. To reduce this drawback, in the
second method, the robot’s arm was manually moved by an
operator and the joint angle readings were recorded during
2
In simulation, the color of the right arm was changed to dark grey to
achieve contrast with the grey background in the camera images.
these trajectories. This way, a subset of data was obtained,
where the robot hand was centered in the image. Finally,
in the third method, a multivariate Gaussian was fit to the
second subset using the expectation-maximization algorithm
and random samples were drawn from this Gaussian for the
third and final part of the dataset. The goal was to introduce
randomness to the centered-images and not be limited to the
operator’s choice of trajectories.
For the images collected in the Gazebo NAO Simulator,
the only preprocessing step performed was re-sizing the
image of size
640 ×480
to
80 ×56
. For the real NAO, the
images were obtained on a green background (Fig. 3) and
the following preprocessing steps were performed: 1) median
filtering with kernel size 11 on the original image, 2) masking
the monochrome background,e.g. green, in the HSV color space
and replacing it with dark gray to ensure contrast 3) converting
the image to grayscale, and 4) resizing image to dimensions
80 ×56.
b)
Training
.: The convolutional decoder was trained
using the ADAM optimizer using a mini-batch of size 200
samples and an initial learning rate of
α= 104
with
exponential decay of
0.95
every 5000 steps. The training
was stopped after ca. 7000 iterations for the simulated NAO
dataset and 12000 iterations for the real NAO dataset to avoid
overfitting, as the test set error started to increase for the
corresponding model. The output of the second fully connected
layer (FC2) was an 1120-dimensional vector that was reshaped
into a
7×10 ×16
tensor. The UpConv layers all use stride
equal to 2 and a padding of 1. Moreover, a kernel with a size
of
4×4
was chosen to avoid checkerboard artifacts due to
uneven overlap [23]. Convolutional layers used a kernel size 3,
stride 1 and padding 1. The dropout probability was set to 0.15.
The final layer outputs a
1×56 ×80
image corresponding to
a grayscale image.
A benchmark with three levels of difficulty was created to
evaluate the performance of PixelAI on randomized samples
for both perceptual and active inference. A set of 50 different
cores (i.e. images of the arm) were generated by sampling
the multivariate Gaussian distribution (see method 2 in section
IV-A
0a). A subset of the generated cores is shown in Fig. 3. For
each of the cores, 10 different random tests were performed. In
total, there were 2500 trials composed of 5 runs of 500 testing
image arm poses per benchmark level
3
. The test samples
for each core were generated differently depending on the
benchmark level:
Level 1 (close similar poses): One of the 4 joints was
chosen randomly and a random perturbation
±[5,10]
sampled from a uniform distribution was added to the
joint angle value to generate the new test sample.
Level 2 (far similar poses): For all of the 4 joints, a random
perturbation
±[5,10]
was sampled from a uniform
distribution and added to the core joint angles.
3
For the real robot benchmark tests (perceptual inference) only a subset of
20 cores were used.
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
Level 3 (random): For each core, 10 different cores were
chosen randomly and used as the test samples.
c)
Perceptual Inference
.: In order to evaluate the body
perception performance, the robot has to infer its real arm pose
just using visual information. The robot’s arm was initialized to
each core pose and then 10 separate test runs were performed,
where the internal belief was set to a perturbed value of the
corresponding pose. These tests are static in nature, i.e. the
change solely takes place in the internal predictions of the
robot. The goal is that the robot internal belief
µ
converges to
the true arm position, which is equal to the joint angles of the
chosen core.
d)
Active Inference
.: In order to evaluate the perception
and action performance, the core poses were treated as the
desired goal (image) encoded as an attractor in the model.
Again for each core, 10 separate test runs were performed.
In this case, the robot arm was initialized to a core pose
and also the initial internal belief was set to the current joint
measurements:
µ=q
. In each test, the goal was that the
robot reached the desired imagined arm pose. The update of
the internal belief should generate an action to compensate
for the mismatch between the current and the predicted visual
sensations. In a successful test run, the robot arm should move
to the imagined arm position and the internal belief should also
converge to the imagined joint angles, so that:
xv=g(µ) = ρ
.
e) Algorithm Parameters.: The parameters of the Pixel
AI algorithm are given by
Σv, β, KΣv, γΣv,t
, and were
determined empirically
4
– See table I. The intuition behind the
variance terms is as follows: the prediction errors get multiplied
by the inverse of the variances so these actually weigh the
relevance of the corresponding sensory information error [20].
The
β
term, that is part of the attractor dynamics (see Eq.
18), essentially has the same effect and it controls how much
we want to push the internal belief in the direction of the
attractor. The gain parameter
KΣv
is added to allow the model
to generate large action values without increasing the internal
belief increments. (See line 7 of Algorithm 1). For level 3 in
perceptual inference, we used a smaller
Σv
until the visual
prediction error was below a certain threshold (
0.01
). This
introduces a new model parameter
γΣv
which is used to scale
Σv
, once the error threshold is reached. This heuristic method
of adaptation helped speed up the convergence for the more
complex level 3 trajectories. The parameter
t
was set to
0.1 for all the perceptual inference tests and to 0.065 for the
active inference tests. For active inference, the value of
t
was determined based on the internal time of the robot loop
execution. Finally, the generated actions (velocity values) were
clipped so that each joint could not move more than
[2; 2]
each time step.
V. RE SU LTS
A. Statistical analysis of perception and action in simulation
First, perceptual inference tests were run for 5000-time steps
for all 3 levels. An example of the perceptual inference for
4βand Σµare combined into a single parameter β.
TABLE I
PIX ELA I PARA MET ER S USE D FO R THE P ER CEP TUA L AND A CTI VE
INFERENCE BENCHMARKS IN SIMULATION.
Σvβ KΣvγΣv
Active Inference
Level 1 6×1022×1051031
Level 2 2×1025×1051031
Level 3 20 5 ×1041031
Perceptual Inference
Level 1 2×104- - 1
Level 2 2×104- - 1
Level 3 2×103- - 10
each level is depicted in Fig. 5. For level 1 and 2, the algorithm
converged fast to the ground truth, while inferring the body
location from a totally random initialization (level 3) raised the
complexity considerably. Table II shows the resulting average
of all trials of the mean absolute joint errors (
|qtrue µ|
). Level
1 and 2 results converged to internal belief values successfully.
Figure 4(b) shows the error during the optimization process
and Figure 4(a) shows the visual prediction error. Shoulder
pitch and shoulder roll angles were estimated with better
accuracy compared to the elbow angles. This is due to the
fact that a small change in the shoulder pitch angle yields
to a greater difference in the visual field in comparison with
the same amount of change in the elbow roll angle. Since
PixelAI achieves perception by minimizing the visual prediction
error, the accuracy increases when the pixel-based difference
is stronger. Therefore, the mean error and standard deviation
increase for the elbow joint angle estimations.
The errors in level 3, where the robot had to converge to
random arm locations, were larger compared to levels 1 and 2,
as shown in Fig. 4(b). This is due to two reasons. The first one
is the local minima problem inherent of our gradient descent
approach. The second one affects the desired joint position,
where several joint solutions have small visual prediction error
increasing the risk of getting into a local minimum.
TABLE II
PER CEP TUA L INF ER ENC E IN SIMULATION:JOINT ANGLES ABSOLUTE
ER ROR (ME AN ±S TD IN D EG REE S).
Level Shoulder Pitch Shoulder Roll Elbow Yaw Elbow Roll
1 0.26 ±0.34 0.33 ±0.41 0.85 ±0.96 0.94 ±1.06
2 0.39 ±0.71 0.62 ±0.98 1.19 ±1.48 1.57 ±2.18
3 5.32 ±11.06 5.61 ±7.68 17.64 ±26.08 12.25 ±15.41
TABLE III
PER CEP TUA L INF ER ENC E IN RE AL ROB OT:JOINT ANGLES ABSOLUTE
ER ROR (ME AN ±S TD IN D EG REE S).
Level Shoulder Pitch Shoulder Roll Elbow Yaw Elbow Roll
1 1.33 ±0.82 0.68 ±0.77 1.57 ±1.44 1.97 ±2.03
2 1.86 ±1.85 2.22 ±3.04 2.94 ±3.31 3.81 ±3.32
3 9.80 ±12.63 12.77 ±10.91 29.35 ±35.48 21.82 ±17.23
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
(a) Simulation: Visual Prediction Error
(b) Simulation: L2 Norm of spµ.
(c) Real: Visual Prediction Error
(d) Real: L2 Norm of µsp
Fig. 4. Perceptual inference results for all levels (1-3) of the benchmark are shown. The visual prediction error (pixel-MSE), as well as the L2-Norm of the
error between internal belief
µ
and
sp
is plotted. The curves shown are the median values of all the test runs in the corresponding benchmark level bounded
by the upper and lower quartiles. (a)-(b): Results for simulated Nao. (c)-(d): Results for real Nao.
Fig. 5. Example of the internal trajectories of the latent space during the
perceptual inference tests for three different levels of difficulty for core 3.
Secondly, active inference tests with goal images were
performed using the simulated NAO for 1500 time steps in the
benchmark levels. The results for all three benchmark levels
are shown in Fig. 6. The joint encoder readings followed the
internal belief values through the actions generated by free-
energy optimization. Level 3 performance detriment shows that
interacting is more complex than perceiving as it includes the
body and the world constraints.
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.00
0.02
0.04
0.06
0.08
MSE(ρsv)
Level 1
Level 2
Level 3
(a) Visual error
0 200 400 600 800 1000 1200 1400
Timestep (t)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
kspqattrk2(radians)
Level 1
Level 2
Level 3
(b) L2-Norm of Joint Angle Errors
Fig. 6. Simulated NAO active inference test results for all three levels. The
curves shown are the median values of all the test runs in the corresponding
benchmark level bounded by the upper and lower quartiles. (a) Visual error
between the visual attractor
ρ
and the observed camera image
sv
. (b) L2-Norm
of the error between the joint angles of the attractor position
qattr
and the
proprioceptive sensor readings sp.
B. Active inference in the real robot
We tested the proposed algorithm in the real robot. Con-
versely to simulation, the robot’s movements were imprecise
due to the mechanical backlash in the actuators (
±5
) [24]. We
used the same network architecture and training procedure used
in simulation. Low training error was achieved on the training
dataset (MSE in pixel-intensity: ca. 0.0015). The visual forward
model was expected to model the more complex structure of
the real robot hand, that is subject to lighting differences and
has a reflective surface. Unlike in the simulator, the same
conditions cannot be restored perfectly in the real world, so
the model training is always subject to additional noise in the
dataset.
The results of the perceptual inference for real NAO on
all 3 benchmark levels are shown in Fig. 4as well as in
Table III. Similar behaviours of perceptual convergence to the
simulation results were found in level 1 and 2, while level 3
had a larger error due to the local minima. Figure 7shows the
PixelAI algorithm running on the robot. While body estimation
converged smoothly, the real movements were unsmooth due to
the deployed velocity controller over the built-in NAO position
control, which introduced delays in the execution of the action
commands. Direct access to the motor driver should solve the
mismatch between the internal error and the actual arm position
that is presented in the plots.
VI. CONCLUSIONS
We have described a Pixel-based Deep Active Inference
algorithm and applied it for robot body perception and action.
We have shown that variational free-energy optimization can
work as a general inner mechanism for both estimation and
control. Our algorithm extends previous active inference works
tackling high-dimensional visual inputs and providing sensory
generative models learning. This prediction error variant of
control as inference [25] exploits the representation learnt to
indirectly generate the actions without a policy. The robot
is producing the actions to reach the desired goal in the
visual space without learning the explicit policy. Our algorithm
enabled body estimation using a monocular camera input and
performed goal-driven behaviour using imaginary goals in
the visual space. Statistical results showed convergence in
both perception and action in different levels of difficulty
with a larger error when dealing with totally random arm
poses. This neuroscience-inspired approach is thought to
make deeper interpretations than conventional engineering
solutions [26], giving some grounding for novel machine
learning developments, especially for body perception and
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
Fig. 7. PixelAI test on the real Nao.
µ
is the inferred state,
q
is the real joint angles readings, goal is the ground truth goal angles and
0.05steps = 1s
.
(bottom row) Arm sequence: goal image and Nao visual input are overimposed.
action. Further work will focus on bringing this approach
closer to biological plausibility. We will explore the integration
of complex proprioceptive information within the optimization
framework [27] by learning multimodal representations and
include hierarchies in the architecture to permit the robot to
perceive its body in an abstracted way and not only in the
pixel-based domain.
REFERENCES
[1]
D. M. Wolpert, J. Diedrichsen, and J. R. Flanagan, “Principles of
sensorimotor learning,” Nature Reviews Neuroscience, vol. 12, no. 12, p.
739, 2011.
[2]
Y. Yamada, H. Kanazawa, S. Iwasaki, Y. Tsukahara, O. Iwata, S. Yamada,
and Y. Kuniyoshi, “An embodied brain model of the human foetus,”
Scientific Reports, vol. 6, 2016.
[3]
P. Lanillos, E. Dean-Leon, and G. Cheng, “Yielding self-perception in
robots through sensorimotor contingencies,” IEEE Trans. on Cognitive
and Developmental Systems, no. 99, pp. 1–1, 2016.
[4]
G. Diez-Valencia, T. Ohashi, P. Lanillos, and G. Cheng, “Sensorimotor
learning for artificial body perception,” arXiv preprint arXiv:1901.09792,
2019.
[5]
G. Oliver, P. Lanillos, and G. Cheng, “Active inference body perception
and action for humanoid robots,” arXiv preprint arXiv:1906.03022, 2019.
[6]
M. Botvinick and J. Cohen, “Rubber hands ‘feel’touch that eyes see,
Nature, vol. 391, no. 6669, p. 756, 1998.
[7]
N.-A. Hinz, P. Lanillos, H. Mueller, and G. Cheng, “Drifting perceptual
patterns suggest prediction errors fusion rather than hypothesis selection:
replicating the rubber-hand illusion on a robot,” in 2018 Joint IEEE 8th
International Conference on Development and Learning and Epigenetic
Robotics (ICDL-EpiRob). IEEE, 2018, pp. 125–132.
[8]
K. Doya, “What are the computations of the cerebellum, the basal ganglia
and the cerebral cortex?” Neural networks, vol. 12, no. 7-8, pp. 961–974,
1999.
[9]
S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo
control,” IEEE transactions on robotics and automation, vol. 12, no. 5,
pp. 651–670, 1996.
[10]
C. G. Cifuentes, J. Issac, M. Wüthrich, S. Schaal, and J. Bohg,
“Probabilistic articulated real-time tracking for robot manipulation,” IEEE
Robotics and Automation Letters, vol. 2, no. 2, pp. 577–584, 2016.
[11]
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of
deep visuomotor policies,” The Journal of Machine Learning Research,
vol. 17, no. 1, pp. 1334–1373, 2016.
[12]
A. Tschantz, M. Baltieri, A. Seth, C. L. Buckley et al., “Scaling active
inference,” arXiv preprint arXiv:1911.10601, 2019.
[13]
B. Millidge, “Deep active inference as variational policy gradients,
Journal of Mathematical Psychology, vol. 96, p. 102348, 2020.
[14]
K. J. Friston, “The free-energy principle: a unified brain theory?” Nature
Reviews. Neuroscience, vol. 11, pp. 127–138, 02 2010.
[15]
R. P. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a
functional interpretation of some extra-classical receptive-field effects,
Nature neuroscience, vol. 2, no. 1, pp. 79–87, 1999.
[16]
A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual
reinforcement learning with imagined goals,” in Advances in Neural
Information Processing Systems, 2018, pp. 9191–9200.
[17]
G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and helmholtz free energy,” in Advances in neural information
processing systems, 1994, pp. 3–10.
[18]
M. J. Wainwright and M. I. Jordan, “Graphical models, exponential
families, and variational inference,” vol. 1, no. 1, pp. 1–305. [Online].
Available: http://www.nowpublishers.com/product.aspx?product=MAL&
doi=2200000001
[19]
K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny,
“Variational free energy and the laplace approximation,” Neuroimage,
vol. 34, no. 1, pp. 220–234, 2007.
[20]
C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth, “The free energy
principle for action and perception: A mathematical review,” Journal of
Mathematical Psychology, 2017.
[21]
A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox,
“Learning to generate chairs, tables and cars with convolutional networks,
IEEE transactions on pattern analysis and machine intelligence, vol. 39,
no. 4, pp. 692–705, 2016.
[22]
V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
learning,” arXiv preprint arXiv:1603.07285, 2016.
[23]
A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard
artifacts,” Distill, vol. 1, no. 10, p. e3, 2016.
[24]
D. Gouaillier, C. Collette, and C. Kilner, “Omni-directional closed-loop
walk for nao,” in 2010 10th IEEE-RAS International Conference on
Humanoid Robots. IEEE, 2010, pp. 448–454.
[25]
M. Toussaint, “Robot trajectory optimization using approximate infer-
ence,” in Proceedings of the 26th annual international conference on
machine learning, 2009, pp. 1049–1056.
[26]
D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick,
“Neuroscience-inspired artificial intelligence,” Neuron, vol. 95, no. 2,
pp. 245–258, 2017.
[27]
T. Rood, M. van Gerven, and P. Lanillos, “A deep active inference model
of the rubber-hand illusion,” 2020.
Authorized licensed use limited to: Radboud University Nijmegen. Downloaded on January 12,2021 at 10:47:42 UTC from IEEE Xplore. Restrictions apply.
... Alternatively, assuming the laws that govern the dynamics are known (e.g., physics laws), the environment's physics could be exploited as a strong prior [101]. In a similar fashion, in [102], the authors used the internal state of the robot to force a known prior structure on the posterior. Finally, the prior could also be ignored/considered constant, treating the model as an entropy-regularized autoencoder [103]. ...
... Hierarchical models, which in the active inference community are also referred to as deep active inference models [125,13] (with an unfortunate confusion caused by using "deep active inference" as a term for active inference methods using deep neural networks in the generative model [31,32,102,126]), use a multi-layer structure of the hidden states of the model that facilitates the modeling of part-whole or temporal hierarchies. Similarly to using a memory, a hierarchy of states can increase the representational capacity of the model and allow more accurate predictions. ...
Preprint
Full-text available
The free energy principle, and its corollary active inference, constitute a bio-inspired theory that assumes biological agents act to remain in a restricted set of preferred states of the world, i.e., they minimize their free energy. Under this principle, biological agents learn a generative model of the world and plan actions in the future that will maintain the agent in an homeostatic state that satisfies its preferences. This framework lends itself to being realized in silico, as it comprehends important aspects that make it computationally affordable, such as variational inference and amortized planning. In this work, we investigate the tool of deep learning to design and realize artificial agents based on active inference, presenting a deep-learning oriented presentation of the free energy principle, surveying works that are relevant in both machine learning and active inference areas, and discussing the design choices that are involved in the implementation process. This manuscript probes newer perspectives for the active inference framework, grounding its theoretical aspects into more pragmatic affairs, offering a practical guide to active inference newcomers and a starting point for deep learning practitioners that would like to investigate implementations of the free energy principle.
... Alternatively, assuming the laws that govern the dynamics are known (e.g., physics laws), the environment's physics could be exploited as a strong prior [101]. In a similar fashion, in [102], the authors used the internal state of the robot to force a known prior structure on the posterior. Finally, the prior could also be ignored/considered constant, treating the model as an entropy-regularized autoencoder [103]. ...
... Hierarchical models, which in the active inference community are also referred to as deep active inference models [13,125] (with an unfortunate confusion caused by using "deep active inference" as a term for active inference methods using deep neural networks in the generative model [31,32,102,126]), use a multi-layer structure of the hidden states of the model that facilitates the modeling of part-whole or temporal hierarchies. Similarly to using a memory, a hierarchy of states can increase the representational capacity of the model and allow more accurate predictions. ...
Article
Full-text available
The free energy principle, and its corollary active inference, constitute a bio-inspired theory that assumes biological agents act to remain in a restricted set of preferred states of the world, i.e., they minimize their free energy. Under this principle, biological agents learn a generative model of the world and plan actions in the future that will maintain the agent in an homeostatic state that satisfies its preferences. This framework lends itself to being realized in silico, as it comprehends important aspects that make it computationally affordable, such as variational inference and amortized planning. In this work, we investigate the tool of deep learning to design and realize artificial agents based on active inference, presenting a deep-learning oriented presentation of the free energy principle, surveying works that are relevant in both machine learning and active inference areas, and discussing the design choices that are involved in the implementation process. This manuscript probes newer perspectives for the active inference framework, grounding its theoretical aspects into more pragmatic affairs, offering a practical guide to active inference newcomers and a starting point for deep learning practitioners that would like to investigate implementations of the free energy principle.
... Simulations are simply meant as hypotheses; and the assumptions that they encode can be empirically established, once it is demonstrated that they may have merit. The FEP literature has begun to empirically validate the predictions of active inference, supporting the correspondence between model assumptions and reality in several contexts, such as living systems (Komurlu, Shao, Akar, Bayrak, Brey, Cinar and Bilgic, 2017;Tang, Cong, Nikzad, Mehta, Cho, Hansel, Berretta, Kane and Malhotra, 2022); and technical applications have begun to proliferate (Chen, Murata, Arie, Ogata, Tani and Sugano, 2016;Lanillos, Cheng et al., 2020;Sancaktar, van Gerven and Lanillos, 2020). When studies do not deploy novel experiments, they often reproduce results from extant empirical studies (Albarracin, Demekas, Ramstead and Heins, 2022;Gumbsch, Adam, Elsner and Butz, 2021). ...
... Considering that the relation is intrinsically rooted in the agent's knowledge of its body structure and actuators dynamics, it is possible to derive a plausible relation based on the internal representation of the system dynamics. Advanced learning techniques with function approximators are out of the scope of this work [50]. For the specific implementation adopted here, A is defined as a change in angular velocity, as from Eq E.4 in Fig 2. We can then rewrite the dynamics of the velocity given in Eq M.3 in Fig 3, by setting _ m y 0 as an action A, and by expressing μ θ as a function of A. Considering that the proprioceptive input s p is here assumed to provide a noisy measure of the elbow joint angle θ, internally represented as μ θ , one plausible expression for @s p /@A can be derived computing @μ θ /@A from the expression derived as above, which return Àm arm =K as a result. ...
Article
Full-text available
The field of motor control has long focused on the achievement of external goals through action (e.g., reaching and grasping objects). However, recent studies in conditions of multisensory conflict, such as when a subject experiences the rubber hand illusion or embodies an avatar in virtual reality, reveal the presence of unconscious movements that are not goal-directed, but rather aim at resolving multisensory conflicts; for example, by aligning the position of a person’s arm with that of an embodied avatar. This second, conflict-resolution imperative of movement control did not emerge in classical studies of motor adaptation and online corrections, which did not allow movements to reduce the conflicts; and has been largely ignored so far in formal theories. Here, we propose a model of movement control grounded in the theory of active inference that integrates intentional and conflict-resolution imperatives. We present three simulations showing that the active inference model is able to characterize movements guided by the intention to achieve an external goal, by the necessity to resolve multisensory conflict, or both. Furthermore, our simulations reveal a fundamental difference between the (active) inference underlying intentional and conflict-resolution imperatives by showing that it is driven by two different (model and sensory) kinds of prediction errors. Finally, our simulations show that when movement is only guided by conflict resolution, the model incorrectly infers that is velocity is zero, as if it was not moving. This result suggests a novel speculative explanation for the fact that people are unaware of their subtle compensatory movements to avoid multisensory conflict. Furthermore, it can potentially help shed light on deficits of motor awareness that arise in psychopathological conditions.
... To achieve this, internal models relating actions to their sensory consequences would need to be learned. An instance of the described approach is presented (Sancaktar et al., 2020). In this work, the authors control a robotic arm using VFE minimization based on visually defined goals. ...
Thesis
In order to learn and recognize sequences, robotic agents should be equipped with a long-term memory of temporal patterns. Recurrent neural networks are naturally fit for the generation of temporal patterns, and thus can be used to model such a sequence memory using a connectionist approach. Writing in the sequence memory would be tightly related to the question of synaptic weights learning, and memory retrieval could be cast into a problem of inference of the latent causes in the neural generative model. There are many underlying questions to the modeling of this sequence memory. Could it be trained incrementally with minimal forgetting of previously learned sequences? How many temporal patterns could be written in the memory? How to learn motor sequence memories without direct supervision in the motor space? How to retrieve previously learned temporal patterns?We propose to approach these questions by devising sequence memory networks within the frameworks of predictive coding (Rao and Ballard, 1999) and free-energy principle (Friston and Kilner, 2006), equipped with learning and inference mechanisms for the writing and retrieval of temporal patterns. Throughout this thesis, we apply our models to the learning of handwriting trajectories for a simulated robotic agent. The main contributions brought by this thesis are the following. First, we design recurrent neural networks based on the free-energy formulation of predictive coding. Second, we propose memory retrieval algorithms for sequence memories. Finally, we combine these models with active inference to build sequence memory models able to learn motor trajectories in the absence of direct supervision in the motor space.
... The brain-inspired FEP in neuroscience and theoretical biology offers a universal view of the organisms' autopoiesis in an axiomatic way; specifically, it provides the informational FE-minimization formalism that accounts for perception, learning, and behavior of living systems [29,30]. The principle successfully applies to other cognitive systems such as artificial intelligence and robots [10,57,73,11,60,59,21]; however, the present investigation primarily focuses on living systems and the FEP implication in a biological context. Articulating the informational FEP (IFEP), all life forms are evolutionarily self-organized to tend to minimize 'surprisal', which is an information-theoretic measure of the improbability of organisms' environmental niche. ...
Preprint
Full-text available
Organisms are nonequilibrium stationary systems self-organized from spontaneous symmetry breaking in the environment and undergo irreversible work cycles without detailed balance. The thermodynamic free-energy principle (FEP) describes an organism's homeostasis as regulating metabolic work constrained by the physical FE cost. In contrast, the recent neuroscience and theoretical biology efforts account for an organism's adaptive fitness as purposeful allostasis by demanding minimization of the informational FE. For a coalesced view of life, this study proposes a physically-principled FE minimization theory overarching the essential features from both FEPs. Consequently, the ensuing Bayesian mechanics reveals that the brain function of perception and behavioral control operates like Schr{\"o}dinger's clockwork and develops optimal trajectories in neural manifolds when the sensory perturbation is aroused. Furthermore, the sensory and motor interactions of the biological agents with the environment are manifested to induce a dynamic transition between neural attractors.
... PC has thus been applied to a variety of robotics problems (Lanillos & Cheng, 2018) as well as drone and quadcopter control (Meera & Wisse, 2020, 2021). An additional advantage of PC is that it can also be used for state estimation (Oliver et al., 2021;Pezzato, Ferrari, & Corbato, 2020) (including with high dimensional image inputs (Sancaktar, van Gerven, & Lanillos, 2020), providing a joint solution of state estimation and control, as depicted in Figure 4(b). For a recent full review of this literature, see Lanillos et al. (2021). ...
Preprint
The backpropagation of error algorithm used to train deep neural networks has been fundamental to the successes of deep learning. However, it requires sequential backward updates and non-local computations, which make it challenging to parallelize at scale and is unlike how learning works in the brain. Neuroscience-inspired learning algorithms, however, such as \emph{predictive coding}, which utilize local learning, have the potential to overcome these limitations and advance beyond current deep learning technologies. While predictive coding originated in theoretical neuroscience as a model of information processing in the cortex, recent work has developed the idea into a general-purpose algorithm able to train neural networks using only local computations. In this survey, we review works that have contributed to this perspective and demonstrate the close theoretical connections between predictive coding and backpropagation, as well as works that highlight the multiple advantages of using predictive coding models over backpropagation-trained neural networks. Specifically, we show the substantially greater flexibility of predictive coding networks against equivalent deep neural networks, which can function as classifiers, generators, and associative memories simultaneously, and can be defined on arbitrary graph topologies. Finally, we review direct benchmarks of predictive coding networks on machine learning classification tasks, as well as its close connections to control theory and applications in robotics.
Article
Full-text available
Active inference is a state-of-the-art framework for modelling the brain that explains a wide range of mechanisms such as habit formation, dopaminergic discharge and curiosity. However, recent implementations suffer from an exponential (space and time) complexity class when computing the prior over all the possible policies up to the time horizon. Fountas et al. (2020) used Monte Carlo tree search to address this problem, leading to very good results in two different tasks. Additionally, Champion et al. (2021a) proposed a tree search approach based on (temporal) structure learning. This was enabled by the development of a variational message passing approach to active inference (Champion et al., 2021b), which enables compositional construction of Bayesian networks for active inference. However, this message passing tree search approach, which we call branching-time active inference (BTAI), has never been tested empirically. In this paper, we present an experimental study of the approach (Champion et al., 2021a) in the context of a maze solving agent. In this context, we show that both improved prior preferences and deeper search help mitigate the vulnerability to local minima. Then, we compare BTAI to standard active inference (AcI) on a graph navigation task. We show that for small graphs, both BTAI and AcI successfully solve the task. For larger graphs, AcI exhibits an exponential (space) complexity class, making the approach intractable. However, BTAI explores the space of policies more efficiently, successfully scaling to larger graphs. Then, BTAI was compared to the POMCP algorithm (Silver and Veness, 2010) on the frozen lake environment. The experiments suggest that BTAI and the POMCP algorithm accumulate a similar amount of reward. Also, we describe when BTAI receives more rewards than the POMCP agent, and when the opposite is true. Finally, we compared BTAI to the approach of Fountas et al. (2020) on the dSprites dataset, and we discussed the pros and cons of each approach.
Article
Full-text available
Over the last 10 to 15 years, active inference has helped to explain various brain mechanisms from habit formation to dopaminergic discharge and even modelling curiosity. However, the current implementations suffer from an exponential (space and time) complexity class when computing the prior over all the possible policies up to the time-horizon. Fountas et al. (2020) used Monte Carlo tree search to address this problem, leading to impressive results in two different tasks. In this paper, we present an alternative framework that aims to unify tree search and active inference by casting planning as a structure learning problem. Two tree search algorithms are then presented. The first propagates the expected free energy forward in time (i.e., towards the leaves), while the second propagates it backward (i.e., towards the root). Then, we demonstrate that forward and backward propagations are related to active inference and sophisticated inference, respectively, thereby clarifying the differences between those two planning strategies.
Chapter
Full-text available
Understanding how perception and action deal with sensorimotor conflicts, such as the rubber-hand illusion (RHI), is essential to understand how the body adapts to uncertain situations. Recent results in humans have shown that the RHI not only produces a change in the perceived arm location, but also causes involuntary forces. Here, we describe a deep active inference agent in a virtual environment, which we subjected to the RHI, that is able to account for these results. We show that our model, which deals with visual high-dimensional inputs, produces similar perceptual and force patterns to those found in humans.
Article
Full-text available
The 'free energy principle' (FEP) has been suggested to provide a unified theory of the brain, integrating data and theory relating to action, perception, and learning. The theory and implementation of the FEP combines insights from Helmholtzian 'perception as inference', machine learning theory, and statistical thermodynamics. Here, we provide a detailed mathematical evaluation of a suggested biologically plausible implementation of the FEP that has been widely used to develop the theory. Our objectives are (i) to describe within a single article the mathematical structure of this implementation of the FEP; (ii) provide a simple but complete agent-based model utilising the FEP; (iii) disclose the assumption structure of this implementation of the FEP to help elucidate its significance for the brain sciences.
Article
Full-text available
We address self-perception in robots as the key for world understanding and causality interpretation. We present a self-perception mechanism that enables a humanoid robot to understand certain sensory changes caused by naive actions during interaction with objects. Visual, proprioceptive and tactile cues are combined via artificial attention and probabilistic reasoning to permit the robot to discern between inbody and outbody sources in the scene.With that support and exploiting inter-modal sensory contingencies, the robot can infer simple concepts such as discovering potential "usable" objects. Theoretically and through experimentation with a real humanoid robot, we show how self-perception is a backdrop ability for high order cognitive skills. Moreover, we present a novel model for self-detection, which does not need to track the body parts. Furthermore, results show that the proposed approach successfully discovers objects in the reaching space improving scene understanding by discriminating real objects from visual artefacts.
Article
Full-text available
We propose a probabilistic filtering method, which fuses joint measurements with depth images, in order to correct biases in the joint measurements, as well as inaccuracies in the robot model, such as poor extrinsic camera calibration. The proposed method yields an accurate, real-time estimate of the end-effector pose in the camera frame, which avoids the need for frame transformations when using it in combination with visual object tracking methods. We quantitatively evaluate our approach on a dataset recorded from a real robotic system and annotated with ground truth from a motion capture system. We show that our approach is robust and accurate even under challenging conditions such as fast motion, significant and long-term occlusions, time-varying biases, and the robot arm getting in and out of view. We release the dataset along with open-source code of our approach to allow for quantitative comparison with alternative approaches.
Article
Full-text available
Cortical learning via sensorimotor experiences evoked by bodily movements begins as early as the foetal period. However, the learning mechanisms by which sensorimotor experiences guide cortical learning remain unknown owing to technical and ethical difficulties. To bridge this gap, we present an embodied brain model of a human foetus as a coupled brain-body-environment system by integrating anatomical/physiological data. Using this model, we show how intrauterine sensorimotor experiences related to bodily movements induce specific statistical regularities in somatosensory feedback that facilitate cortical learning of body representations and subsequent visual-somatosensory integration. We also show how extrauterine sensorimotor experiences affect these processes. Our embodied brain model can provide a novel computational approach to the mechanistic understanding of cortical learning based on sensorimotor experiences mediated by complex interactions between the body, environment and nervous system.
Article
Active Inference is a theory arising from theoretical neuroscience which casts action and planning as Bayesian inference problems to be solved by minimizing a single quantity — the variational free energy. The theory promises a unifying account of action and perception coupled with a biologically plausible process theory. However, despite these potential advantages, current implementations of Active Inference can only handle small policy and state–spaces and typically require the environmental dynamics to be known. In this paper we propose a novel deep Active Inference algorithm that approximates key densities using deep neural networks as flexible function approximators, which enables our approach to scale to significantly larger and more complex tasks than any before attempted in the literature. We demonstrate our method on a suite of OpenAIGym benchmark tasks and obtain performance comparable with common reinforcement learning baselines. Moreover, our algorithm evokes similarities with maximum-entropy reinforcement learning and the policy gradients algorithm, which reveals interesting connections between the Active Inference framework and reinforcement learning.
Article
The fields of neuroscience and artificial intelligence (AI) have a long and intertwined history. In more recent times, however, communication and collaboration between the two fields has become less commonplace. In this article, we argue that better understanding biological brains could play a vital role in building intelligent machines. We survey historical interactions between the AI and neuroscience fields and emphasize current advances in AI that have been inspired by the study of neural computation in humans and other animals. We conclude by highlighting shared themes that may be key for advancing future research in both fields.