Content uploaded by Jules Tsukahara
Author content
All content in this area was uploaded by Jules Tsukahara on Apr 16, 2024
Content may be subject to copyright.
A short presentation of Active Inference
Jules Tsukahara ∗Gr´egoire Sergeant-Perthuis †
March 2024
Abstract
This document presents a concise mathematical formulation of Active
inference. Active inference is an algorithm that aims to model adaptive
behaviors and has a wide range of applications [Da +20], in particular
when the number of parameters used to describe a phenomenon is of
‘reasonable’ size. In the simplest setting, an agent has a generative model
of the time evolution of its environment and of the consequences of its
actions on its environment. The agent infers beliefs about its environment
through noisy observations and plans its actions in a Bayesian fashion,
i.e. rewards are stochastic and best action is chosen by maximizing the
likelihood of possible actions.
1 References
This note covers the material of the presentation given at the ‘Paris Mathe-
matical Models of Cognition and Consciousness’ seminar on March 20, 2024.
The main references we used to write this document are [Da +20; Da +24] and
[Lal21]. We will follow [Da +20] and its specification of Active Inference. When
referring to ‘Active Inference’ (in uppercase), we refer to a precise algorithm,
first formalized in [Fri10] and reformulated in [Da +20], rather than ‘active in-
ference’ (in lower case), which is a broader concept that encapsulates cycles of
action and perception, whose formulation could depart from [Fri10]. In other
words, throughout the document, ‘Active Inference’ is synonymous with the
Free Energy Principle.
2 Introduction
In Active Inference, agents are assumed to have an internal world model of
their environment over which they maintain beliefs that are updated with time;
such a hypothesis is sometimes coined the Bayesian Brain Hypothesis. This
internal world model is an incomplete description of the agent’s environment.
∗IMJ-PRG, Sorbonne Universit´e
†LCQB, Sorbonne Universit´e
1
In this case, one could say that an agent has incomplete information about its
environment. There are two possible interpretations of this world model. In the
first one, the agent is ‘conscious’ that it has a world model that accounts for
its environment and leverages it to make optimal decisions or actions. In the
second interpretation, the world model is hard-coded in the agent, for example
through neuronal connections which account for dependencies relations between
the variables of each of the model (see Section 5.2 [TG21]); in this setting, the
reaction of the agent follows active inference in an automated manner. In the
second setting, we can therefore argue that the agent is not ‘conscious’ of having
a world model of its environment but acts accordingly to this world model.
The second interpretation is closer to the conceptual framework of Partially
Observable Markov Decision Process: in this setting, an agent is designed to
answer optimally with incomplete knowledge of its environment. The difference
between POMDP and active inference is discussed for example in [FSM12].
The random variable that accounts for the state of the environment (modulo
the actions of the agent), at a given time step t, will be denoted by St, and the
associated state space, i.e., space of all possible configurations of the environ-
ment, will be denoted by ESt. The environment is to be understood in a broad
sense; it accounts for all the information the agent has on its ‘world’ and could
account for some knowledge of itself, such as position and configuration of some
of its parts. Otis the random variable associated to one observation the agent
can make on its environment at time t, the associated state space is EOt. The
agent can choose to make an action Atfrom a set EAt. In this document, for
simplicity, ESt, EOt, EAtare finite sets.
We assume that at each time step, the spaces ESt,EOt, and EAtare the
same; in other words, for any t=t1,ESt=ESt1,EOt=EOt1, and EAt=EAt1.
We drop the reference to time tin the notation of these spaces: ES,EO,EA.
Let us denote Tas a time horizon which corresponds, for simplicity, to the
number of steps we want the Active Inference algorithm to run for.
The state of the environment up to Tis encoded by the tuple (S0, ..., ST),
denoted S0:T, taking values in ES0:T:= Q0≤t≤TESi, which in our setting is
Q0≤t≤TES. Similarly, we denote A0:Tas the tuple of actions (A0, ..., AT) and
O0:Tas the tuple of observations (O0, ..., OT). The associated realizations will
be denoted by in lowercase, for example a realization of (S0, ..., ST) is written
(s0...sT).
3 Generative model of the environment and ob-
servations
The time evolution of the environment from time tto time t+ 1 is encoded by a
stochastic map or Markov kernel, denoted by T. Let us denote P(F) as the set
of probability measures over the state space F. A Markov kernel Kfrom state
space Fto state space F1is a (measurable) map K:F→P(F1) which sends
2
any element ω∈Fto the probability distribution Kω∈P(F1) that satisfies
Pω1∈FKω(ω1) = 1.
The Markov kernel Tthat models the possible evolutions of the environment
of the agent is conditioned on the possible actions of the agent at time t. This
implies that T:EAt×ESt→P(ESt+1 ). This kernel is the same for any time
tand encodes all possible consequences of action a∈EAon state s∈ESas a
distribution of possible states over ES. Therefore the explicit map is:
T:ES×EA→P(ES)
(st, at)7→ T(·|st, at).
The relation between the observations states of the environment of the agent
at time t,St, and observations at time t,Ot, is related through a Markov kernel
fs:ES→P(EO). Once again, we assume that this kernel does not depend on
time, corresponding to assuming that the sensors remain the same throughout
the time period of the experiment. We explicitly denote by fs(ot|st)∈[0,1] the
probability of the event otgiven st. Here sis intended as a shorthand for the
word sensation.
In active inference, the sensory experience of the agent inherently includes a
level of uncertainty which is accounted for by assuming that the sensory kernel
fsis a random variable. In [Da +20], fsis itself a random variable that takes
values in the space of probability kernels from ES→P(EO); we consider a
slightly more general version where instead the Markov Kernel depends on a
parameter Θ ∈EΘ, where Θ is a random variable. In this setting, fs
θnow
depends on a realization θ∈EΘ, and fsis redefined to be a kernel from EΘ×ES
to P(EO), i.e. fs:EΘ×ES→P(EO).
The dependencies between variables Sand Ocan be summarized by the
following (directed) graphical model shown in Figure 1. Recall that a graphical
model associates to a graph a factorization of a probability distribution [GR08].
stst+1
otot+1
T
fs(·|st,θ)fs(·|st+1,θ )
Figure 1: Part of the generative model of states of the environment and of
observations
The probability distribution Passociated to Figure 1 for a given θ∈Eθand
at∈EA’ factorizes as follows:
PSt,t+1,Ot,t+1 |θ,at(st,t+1 , ot,t+1) = T(st+1 |st, at)fs(ot+1|st+1 , θ)fs(ot|st, θ )
(3.1)
3
Equation 3.1 represents only the part corresponding to what happens in time
tand t+1 of the generative model of the agent for a given action and the sensor
associated with θ. We will use the notation pol, for policy, to condense (a0:T−1)
and Π will denote the variable A0:T−1. The agent assigns weights to collections
of possible actions using a probability distribution, indicating its consideration
of competing outcomes in relation to its choice of action. τcorresponds to a
time up to which the agent has access to observations. It is greater than 0 and
less than the horizon T. In practice, if the agent can remember all possible
observations it made up to step Iof the Active Inference algorithm, then I
equals τ. However, the agent can simulate the environment after τand up to
time Teven though it does not receive any observation; this fact is accounted
for in the generative model that we will now explicitly state. Following [Da +20]
and [Lal21], we introduce the generative model of active inference.
PS0:T,O0:τ,Π,Θ(s0:T, o0:τ,pol, θ)
=fΠ(a0:T−1)fΘ(θ)Y
0≤t≤T−1
T(st+1|st, at)Y
0≤t≤τ
fs(ot|st, θ)
Here fΠ∈P(EA0:T−1) is a probability distribution over sequences of actions
and fΘ∈P(EΘ) is a probability distribution over the parameter Θ. The asso-
ciated graphical model for fixed sensor θand collection of actions a0. . . aT−1is
given in Figure 2.
s0s1· · · sτsτ+1 · · · sT
o0ot+1 oτ
Ta0
fs(·|s0,θ)fs(·|s1,θ)
Ta1Taτ−1
fs(·|sτ,θ)
TaτTaτ+1 TaT−1
Figure 2: Generative model of state of the environment and of observations up
to time τ, for a given sensor θand collection of actions a0. . . aT−1
4 Active Inference algorithm: Inference and ac-
tion
There are two steps in the Active Inference algorithm. The first one is the
Inference step, in which the agent computes an approximation to the posterior
given by conditioning the generative model on observation. Doing so implies
that it updates its belief about future states of the environment.
The second step is a probabilistic variant of choosing the best action that
maximizes a sum of rewards, given the previously computed update of beliefs
on the future outcomes. This will be referred to as the Action step.
4
4.1 Inference step
At step τof the active inference algorithm, the agent has as input observations
o0:τ. The agent wants to compute the posterior of the generative model P, that
is:
PS0:T,A0:T−1,Θ|O0:τ(s0:T, a0:T−1, θ|o0:τ) (4.1)
To ease readability, we will group the triple (S0:T, A0:T−1,Θ) into one ran-
dom variable Xand denote the corresponding state space by EX. Under this
notation, the posterior is written PX|O0:τ(x|o0:τ).
Computing the posterior is generally intractable. Instead, we will proceed
by variational inference (see A for definitions and an exposition of variational in-
ference in a general setting) that is to say, we approximate the posterior through
optimizing an entropy functional under some constraints that accounts for fit-
ness with respect to observations. Thus, Variational Inference is a constrained
optimization problem over the space of distributions QX∈P(EX), with an
objective function F1(QX) known as variational free energy. F1(QX) arises nat-
urally when minimizing the Kullback-Leibler divergence DKL(QX∥PX|O0:τ). It
takes the following form:
F1(QX) = −X
x∈EX
QX(x) ln PX,O0:τ(x, o0:τ)−S(QX) (4.2)
with S(QX) the entropy of QXdefined as S(QX) = −Px∈EXQX(x) ln QX(x).
To simplify the optimization problem and to obtain better robustness with re-
spect to the variation in the observations, we assume that the QXtake value
in a subspace Fac of P(EX), implicitly defined by imposing the following fac-
torization: ∀QX∈Fac,∃gΠ∈P(EA0:T−1),∃gΘ∈P(Θ), and ∀t∈[0, T −1],∃gt:
EAt→P(ESt) such that:
QX(x) = gΠ(a0:T−1)gΘ(θ)Y
0≤t≤T−1
gt(st|at)gT(sT).(4.3)
This is known as the Na¨ıve Bayes assumption or the mean-field approxima-
tion [Bis06]. Let
Q= arg min
QX∈Fac F1(QX).(4.4)
4.2 Action step
During the Inference step, the agent updates its beliefs about the environment.
To select subsequent actions, it is necessary to quantify the agent’s preferred
states. According to those preferences and its inferred beliefs about the environ-
ment, the agent will seek to select actions that are most likely to produce those
states. To do so, the second step of Active Inference is to compute a likelihood
5
distributions qΠ∈P(EA0:τ+1 ) on the actions a0:τ+1 from a second ‘free energy’
term F2(a0:τ, a0:τ+1). The likelihood satisfies the proportionality relation:
qΠ(a0:τ, aτ+1|QX)∝e−F2(a0:τ,aτ+1 ,QX).(4.5)
We let ˜qΠ(a0:τ, aτ+1|QX) = e−F2(a0:τ,aτ+1 ,QX)be the approximate likelihood.
This second free energy is often referred to as ‘expected free energy’ in the
literature. It is the sum of two terms referred to as ‘Exploitation’ and ‘Ex-
ploration’. Each of them depends on the previously computed posterior ap-
proximation QS0:T,Θ|A0:T−1,O0:τ=o0:τconditioned on the actions A0:T−1for the
observations up to t=τ. For simplicity, we write it as QS0:T,Θ|A0:T−1,o0:τ. The
free energy F2can thus be written as follows:
F2(a0:τ, aτ+1, QX) = Exploitation(QS0:T,Θ|A0:T−1,o0:τ)
+ Exploration(QS0:T,Θ|A0:T−1,o0:τ).(4.6)
4.2.1 Exploitation
The Exploitation term corresponds to approaching the preferred next state. It
is encoded by a distribution DSτ+1,Θ∈P(ESτ+1 ×EΘ) over possible states at
time τ+ 1. In theory, one could also include more future states by replacing
τ+ 1 by t1:t2such that T≥t2≥t1≥τ+ 1. Following [Da +20], the full
Exploitation term is given by the following Kullback-Leibler divergence:
Exploitation(QS0:T,Θ|A0:T−1) = DKL (QSτ+1,Z |A0:T−1∥DSτ+1 ,Θ).(4.7)
4.2.2 Exploration
The exploration term reflects the uncertainty about future observations. Min-
imizing this terms corresponds to choosing actions that will be more likely to
generate observations that will convey more information about the state.
This is achieved by minimizing the expected entropy of the sensation function
fs:
Exploration(QSτ,Θ|A0:T−1)
=EQSτ,Θ|A0:T−1[S(fs
Oτ+1|Sτ,Θ)]
=−EQSτ,Θ|A0:T−1fs
Oτ+1|Sτ,Θ[log fs
Oτ+1|Sτ,Θ]
=−X
sτ,z,oτ+1
QSτ,Θ|A0:T−1(sτ, z, oτ+1 )fs(oτ+1|sτ, θ) ln fs(oτ+1|sτ, θ).
(4.8)
The action chosen is the one that maximizes the marginal over atof the
distribution ˜qπ,
a∗
τ+1 = arg max
aτ+1∈EAX
˜a0:T:s.t. ˜aτ+1 =aτ+1
˜qΠ(˜a0:T).(4.9)
The agent executes the action, evolving the environment. Inference can then
take place again at step τ+ 1.
6
A Variational inference: inference as minimiza-
tion of Entropy
Consider a joint distribution PX,Y ∈P(E×E1) over two random variables X∈
E, Y ∈E1. A classical problem is, given an observation ω1on Y, to compute
the posterior PX|Y(ω, ω1) = PX,Y (ω,ω1)
PY(ω1)with PY(ω1) = Pω∈EPX|Y(ω, ω1), the
marginal distribution of Y. However, doing so requires summing over all possible
configurations of X, which can be computationally too costly. This is the case,
for example, when X=X0, ..., XTwith Xi∈Sand E=Qi≤TB. Instead, one
resorts to variational inference to compute PX|Yapproximately [Alq20]. We
will now explain what variational inference is, but first let us introduce entropy
and Gibbs free energy. When Eis a finite set, the entropy of a probability
distribution Qon Eis defined as:
S(Q) = −X
x∈E
Q(x) ln Q(x).(A.1)
Let Hbe a measurable function H:E→R. For Q∈P(E), one calls
EQ[H]−1
βS(Q) the Gibbs free energy; in general β= 1. An important property
is that,
−ln X
ω∈E
e−βH (ω)= inf
Q∈P(E)
EQ[βH ]−S(Q).(A.2)
The optimal solution to Equation A.2 is given by the Boltzmann distribution
Q∗(ω) = e−H(ω)
Pω∈Ee−βH (ω).(A.3)
Let H(ω) = −ln PX,Y (ω, ω1) and β= 1, then Q∗(ω) = PX|Y(ω|ω1). There-
fore, solving the optimization problem of Equation A.2 is equivalent to comput-
ing the posterior PX|Y(ω|ω1). Solving Equation A.2 over a subset of distribu-
tions Q∈Θ⊆P(E) is called variational inference. If furthermore the Gibbs
free energy is replaced by an approximation, we call it approximate variational
inference.
One remarks that infQ∈P(E)EQ[βH ]−S(Q) is equivalent to supQ∈P(E)S(Q)−
EQ[βH ]. And this last optimization problem relates, through Lagrange multi-
pliers, to maximizing entropy under energy constraints U∈R,
sup
Q∈P(E)
EQ[H]=U
−X
x∈E
Q(x) ln Q(x).(A.4)
In the physics literature, one refers to Equation A.4 as MaxEnt [Kes09],
which stands for the principle of maximum entropy and such principle has many
application see [Kes09; DD18]. Variational inference is called the variational
principle.
7
A celebrated example of variational inference is called Naive Bayes in the
machine learning literature (see Chapter 8 [Bis06]) and mean field approxima-
tion in statistical physics; let us now present this case. The global state space
E=Qi∈IEiis the joint configuration of variables Xi∈Ei. Θ ⊆P(E) is the set
of independently distributed variables, i.e., QI(xi, i ∈I) = Qi∈IQi(xi). One
then solves for example through a gradient descent,
inf
Q∈Θ
EQ[βH ]−S(Q).(A.5)
References
[Da +20] Lancelot Da Costa et al. “Active inference on discrete state-spaces:
A synthesis”. In: Journal of Mathematical Psychology 99 (2020),
p. 102447.
[Da +24] Lancelot Da Costa et al. Active inference as a model of agency. 2024.
[Lal21] Rida Lali. Inf´erence Active pour les agents ´emotionnels au sein du
mod`ele projective de la conscience. M1 Intership report done under
the supervision of D. Rudrauf and G. Sergeant-Perthuis. 2021.
[Fri10] Karl J. Friston. “The free-energy principle: a unified brain theory?”
In: Nature Reviews Neuroscience 11 (2010), pp. 127–138. url:https:
//api.semanticscholar.org/CorpusID:5053247.
[TG21] Youri Timsit and Sergeant-Perthuis Gr´egoire. “Towards the Idea of
Molecular Brains”. In: International Journal of Molecular Sciences
22.21 (2021). issn: 1422-0067. doi:10.3390/ijms222111868.url:
https://www.mdpi.com/1422-0067/22/21/11868.
[FSM12] Karl John Friston, Spyridon Samothrakis, and Read Montague. “Ac-
tive inference and agency: optimal control without cost functions”.
In: Biological Cybernetics 106 (2012), pp. 523–541. url:https://
api.semanticscholar.org/CorpusID:253889571.
[GR08] Kevin Gimpel and Daniel Rudoy. Statistical inference in graphical
models. Massachusetts Institute of Technology, Lincoln Laboratory,
2008.
[Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning.
Springer, 2006.
[Alq20] Pierre Alquier. “Approximate Bayesian Inference”. en. In: Entropy
22.11 (Nov. 2020), p. 1272. issn: 1099-4300. doi:10.3390/e22111272.
url:https://www.mdpi.com/1099-4300/22/11/1272 (visited on
02/20/2023).
[Kes09] H. K. Kesavan. “Jaynes’ maximum entropy principle”. In: Encyclope-
dia of Optimization. Boston, MA: Springer US, 2009, pp. 1779–1782.
isbn: 978-0-387-74759-0. doi:10.1007/978- 0-387- 74759-0_312.
url:https://doi.org/10.1007/978-0-387-74759- 0_312.
8
[DD18] Andrea De Martino and Daniele De Martino. “An introduction to the
maximum entropy approach and its application to inference prob-
lems in biology”. In: Heliyon 4.4 (2018), e00596. issn: 2405-8440.
doi:https : / / doi . org / 10 . 1016 / j . heliyon . 2018 . e00596.
url:https: //www. sciencedirect.com/ science/article/ pii/
S2405844018301695.
9