Available via license: CC BY 4.0

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Journal of Mathematical Psychology 99 (2020) 102447

Contents lists available at ScienceDirect

Journal of Mathematical Psychology

journal homepage: www.elsevier.com/locate/jmp

Review

Active inference on discrete state-spaces: A synthesis

Lancelot Da Costa a,b,∗, Thomas Parr b, Noor Sajid b, Sebastijan Veselic b, Victorita Neacsu b,

Karl Friston b

aDepartment of Mathematics, Imperial College London, London, SW7 2RH, United Kingdom

bWellcome Centre for Human Neuroimaging, University College London, London, WC1N 3AR, United Kingdom

article info

Article history:

Received 17 April 2020

Received in revised form 23 July 2020

Accepted 3 September 2020

Available online 6 November 2020

Keywords:

Active inference

Free energy principle

Process theory

Variational Bayesian inference

Markov decision process

Mathematical review

abstract

Active inference is a normative principle underwriting perception, action, planning, decision-making

and learning in biological or artificial agents. From its inception, its associated process theory has

grown to incorporate complex generative models, enabling simulation of a wide range of complex

behaviours. Due to successive developments in active inference, it is often difficult to see how its

underlying principle relates to process theories and practical implementation. In this paper, we try to

bridge this gap by providing a complete mathematical synthesis of active inference on discrete state-

space models. This technical summary provides an overview of the theory, derives neuronal dynamics

from first principles and relates this dynamics to biological processes. Furthermore, this paper provides

a fundamental building block needed to understand active inference for mixed generative models;

allowing continuous sensations to inform discrete representations. This paper may be used as follows:

to guide research towards outstanding challenges, a practical guide on how to implement active

inference to simulate experimental behaviour, or a pointer towards various in-silico neurophysiological

responses that may be used to make empirical predictions.

©2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license

(http://creativecommons.org/licenses/by/4.0/).

Contents

1. Introduction......................................................................................................................................................................................................................... 2

2. Active inference .................................................................................................................................................................................................................. 3

3. Discrete state-space generative models........................................................................................................................................................................... 5

4. Variational Bayesian inference.......................................................................................................................................................................................... 5

4.1. Free energy and model evidence......................................................................................................................................................................... 5

4.2. On the family of approximate posteriors ........................................................................................................................................................... 6

4.3. Computing the variational free energy ............................................................................................................................................................... 7

5. Perception............................................................................................................................................................................................................................ 8

5.1. Plausibility of neuronal dynamics........................................................................................................................................................................ 9

6. Planning, decision-making and action selection............................................................................................................................................................. 9

6.1. Planning and decision-making ............................................................................................................................................................................. 9

6.2. Action selection, policy-independent state-estimation ..................................................................................................................................... 10

6.3. Biological plausibility ............................................................................................................................................................................................ 10

6.4. Pruning of policy trees.......................................................................................................................................................................................... 10

6.5. Discussion of the action–perception cycle ......................................................................................................................................................... 10

7. Properties of the expected free energy ........................................................................................................................................................................... 11

8. Learning ............................................................................................................................................................................................................................... 11

9. Structure learning............................................................................................................................................................................................................... 13

9.1. Bayesian model reduction .................................................................................................................................................................................... 13

9.2. Bayesian model expansion ................................................................................................................................................................................... 14

10. Discussion............................................................................................................................................................................................................................ 15

11. Conclusion ........................................................................................................................................................................................................................... 16

Declaration of competing interest.................................................................................................................................................................................... 16

∗Corresponding author at: Department of Mathematics, Imperial College London, London, SW7 2RH, United Kingdom.

E-mail address: l.da-costa@imperial.ac.uk (L. Da Costa).

https://doi.org/10.1016/j.jmp.2020.102447

0022-2496/©2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

.............................................................................................................................................................................................................................................. 16

Appendix A. More complex generative models.............................................................................................................................................................. 16

A.1. Learning B and D ................................................................................................................................................................................................... 16

A.2. Complexifying the prior over policies................................................................................................................................................................. 16

A.3. Multiple state and outcome modalities .............................................................................................................................................................. 17

A.4. Deep temporal models.......................................................................................................................................................................................... 17

Appendix B. Computation of dynamics underlying perception.................................................................................................................................... 17

B.1. Free energy conditioned upon a policy .............................................................................................................................................................. 17

B.2. Free energy gradients............................................................................................................................................................................................ 18

Appendix C. Expected free energy as reaching steady-state ........................................................................................................................................ 18

Appendix D. Computing expected free energy ............................................................................................................................................................... 19

D.1. Ambiguity ............................................................................................................................................................................................................... 20

D.2. Risk .......................................................................................................................................................................................................................... 20

D.3. Novelty .................................................................................................................................................................................................................... 20

References ........................................................................................................................................................................................................................... 21

1. Introduction

Active inference is a normative principle underlying percep-

tion, action, planning, decision-making and learning in biological

or artificial agents, that inherits from the free energy principle,

a theory of self-organisation in the neurosciences (Buckley et al.,

2017;Friston,2019;Friston et al.,2006). Active inference pos-

tulates that these processes may all be seen as optimising two

complementary objective functions; namely, a variational free en-

ergy, which measures the fit between an internal model and past

sensory observations, and an expected free energy, which scores

possible future courses of action in relation to prior preferences.

Active inference has been employed to simulate a wide range

of complex behaviours in neuropsychology and machine learning,

including planning and navigation (Kaplan & Friston,2018a),

reading (Friston et al.,2018b), curiosity and abstract rule learn-

ing (Friston, Lin et al.,2017), substance use disorder (Smith,

Schwartenbeck et al.,2020), approach avoidance conflict (Smith,

Kirlic et al.,2020), saccadic eye movements (Parr & Friston,

2018a), visual foraging (Mirza et al.,2016;Parr & Friston,2017a),

visual neglect (Parr & Friston,2018c), hallucinations (Adams

et al.,2013), niche construction (Bruineberg et al.,2018;Constant

et al.,2018), social conformity (Constant et al.,2019), impulsiv-

ity (Mirza et al.,2019), image recognition (Millidge,2019), and

the mountain car problem (Çatal et al.,2019;Friston, Adams et al.,

2012;Friston et al.,2009). The key idea that underwrites these

simulations is that creatures use an internal forward (generative)

model to predict their sensory input, which they use to infer

the causes of these data. In addition to simulate behaviour,

active inference allows to answer questions about an individual’s

psychological processes, by comparing the evidence of different

mechanistic hypotheses in relation to behavioural data.

Active inference is very generic and allows to view different

models of behaviour in the same light. For example, a drift dif-

fusion model can now be seen in relation to predictive coding as

they can both be interpreted as minimising free energy through

a process of evidence accumulation (Bogacz,2017;Buckley et al.,

2017;Friston & Kiebel,2009). Similarly, a dynamic program-

ming model of choice behaviour corresponds to minimising ex-

pected free energy under the prior preference of maximising

reward (Da Costa et al.,2020). In being generic active inference is

not meant to replace any of the existing models, rather it should

be used as a tool to uncover the commitments and assumptions

of more specific models.

Early formulations of active inference employed generative

models expressed in continuous space and time (for an introduc-

tion see Bogacz,2017, for a review see Buckley et al.,2017), with

behaviour modelled as a continuously evolving random dynami-

cal system. However, we know that some processes in the brain

conform better to discrete, hierarchical, representations, com-

pared to continuous representations (e.g., visual working mem-

ory (Luck & Vogel,1997;Zhang & Luck,2008), state estimation via

place cells (Eichenbaum et al.,1999;O’Keefe & Dostrovsky,1971),

language, etc.). Reflecting this, many of the paradigms studied in

neuroscience are naturally framed as discrete state-space prob-

lems. Decision-making tasks are a prime candidate for this, as

they often entail a series of discrete alternatives that an agent

needs to choose among (e.g., multi-arm bandit tasks (Daw et al.,

2006;Reverdy et al.,2013;Wu et al.,2018), multi-step decision

tasks (Daw et al.,2011)). This explains why – in active inference

– agent behaviour is often modelled using a discrete state-space

formulation, the particular applications of which are summarised

in Table 1. More recently, mixed generative models (Friston, Parr

et al.,2017) – combining discrete and continuous states – have

been used to model behaviour involving discrete and continu-

ous representations (e.g., decision-making and movement (Parr

& Friston,2018d), speech production and recognition (Friston,

Sajid et al.,2020), pharmacologically induced changes in eye-

movement control (Parr & Friston,2019) or reading; involving

continuous visual sampling informing inferences about discrete

semantics (Friston, Parr et al.,2017)).

Due to the pace of recent theoretical advances in active in-

ference, it is often difficult to retain a comprehensive overview

of its process theory and practical implementation. In this paper,

we hope to provide a comprehensive (mathematical) synthesis

of active inference on discrete state-space models. This techni-

cal summary provides an overview of the theory, derives the

associated (neuronal) dynamics from first principles and relates

these to known biological processes. Furthermore, this paper

and Buckley et al. (2017) provide the building blocks neces-

sary to understand active inference on mixed generative models.

This paper can be read as a practical guide on how to imple-

ment active inference for simulating experimental behaviour, or a

pointer towards various in-silico neuro- and electro-physiological

responses that can be tested empirically.

This paper is structured as follows. Section 2is a high-level

overview of active inference. The following sections elucidate the

formulation by deriving the entire process theory from first prin-

ciples; incorporating perception, planning and decision-making.

This formalises the action–perception cycle: (1) an agent is pre-

sented with a stimulus, (2) it infers its latent causes, (3) plans

into the future and (4) realises its preferred course of action; and

repeat. This enactive cycle allows us to explore the dynamics of

synaptic plasticity, which mediate learning of the contingencies

of the world at slower timescales. We conclude in Section 9with

an overview of structure learning in active inference.

2

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Table 1

Applications of active inference (discrete state-space).

Application Description References

Decision-making under uncertainty Initial formulation of active inference on partially observable

Markov decision processes.

Friston, Samothrakis et al. (2012)

Optimal control Application of KL or risk sensitive control in an engineering

benchmark—the mountain car problem.

Çatal et al. (2019) and Friston, Adams et al. (2012)

Evidence accumulation Illustrating the role of evidence accumulation in

decision-making through an urns task.

FitzGerald, Moran et al. (2015) and FitzGerald,

Schwartenbeck et al. (2015)

Psychopathology Simulation of addictive choice behaviour. Schwartenbeck, FitzGerald, Mathys, Dolan, Wurst

et al. (2015)

Dopamine The precision of beliefs about policies provides a plausible

description of dopaminergic discharges.

Friston et al. (2014) and FitzGerald, Dolan et al.

(2015)

Functional magnetic resonance imaging Empirical prediction and validation of dopaminergic

discharges.

Schwartenbeck, FitzGerald, Mathys, Dolan and

Friston (2015)

Maximal utility theory Evidence in favour of surprise minimisation as opposed to

utility maximisation in human decision-making.

Schwartenbeck, FitzGerald, Mathys, Dolan,

Kronbichler et al. (2015)

Social cognition Examining the effect of prior preferences on interpersonal

inference.

Moutoussis et al. (2014)

Exploration–exploitation dilemma Casting behaviour as expected free energy minimising

accounts for epistemic and pragmatic choices.

Friston et al. (2015)

Habit learning and action selection Formulating learning as an inferential process and action

selection as Bayesian model averaging.

Friston et al. (2016) and FitzGerald et al. (2014)

Scene construction and anatomy of time Mean-field approximation for multi-factorial hidden states,

enabling high dimensional representations of the environment.

Friston and Buzsáki (2016) and Mirza et al. (2016)

Electrophysiological responses Synthesising various in-silico neurophysiological responses via

a gradient descent on free energy. E.g., place-cell activity,

mismatch negativity, phase-precession, theta sequences,

theta–gamma coupling and dopaminergic discharges.

Friston, FitzGerald et al. (2017)

Structure learning, curiosity and insight Simulation of artificial curiosity and abstract rule learning.

Structure learning via Bayesian model reduction.

Friston, Lin et al. (2017)

Hierarchical temporal representations Generalisation to hierarchical generative models with deep

temporal structure and simulation of reading.

Friston et al. (2018b) and Parr and Friston (2017b)

Computational neuropsychology Simulation of visual neglect, hallucinations, and prefrontal

syndromes under alternative pathological priors.

Benrimoh et al. (2018), Parr, Benrimoh et al.

(2018), Parr and Friston (2018c), Parr, Rees et al.

(2018) and Parr, Rikhye et al. (2019)

Neuromodulation Use of precision parameters to manipulate exploration during

saccadic searches; associating uncertainty with cholinergic and

noradrenergic systems.

Parr and Friston (2017a,2019), Sales et al. (2018)

and Vincent et al. (2019)

Decisions to movements Mixed generative models combining discrete and continuous

states to implement decisions through movement.

Friston, Parr et al. (2017) and Parr and Friston

(2018d)

Planning, navigation and niche construction Agent induced changes in environment (generative process);

decomposition of goals into subgoals.

Bruineberg et al. (2018), Constant et al. (2018)

and Kaplan and Friston (2018a)

Atari games Active inference compares favourably to reinforcement

learning in the game of Doom.

Cullen et al. (2018)

Machine learning Scaling active inference to more complex machine learning

problems.

Tschantz et al. (2019)

2. Active inference

To survive in a changing environment, biological (and arti-

ficial) agents must maintain their sensations within a certain

hospitable range (i.e., maintaining homeostasis through allosta-

sis). In brief, active inference proposes that agents achieve this by

optimising two complementary objective functions, a variational

free energy and an expected free energy. In short, the former

measures the fit between an internal (generative) model of its

sensations and sensory observations, while the latter scores each

possible course of action in terms of its ability to reach the range

of ‘‘preferred’’ states of being.

Our first premise is that agents represent the world through

an internal model. Through minimisation of variational free en-

ergy, this model becomes a good model of the environment.

In other words, this probabilistic model and the probabilistic

beliefs1that it encodes are continuously updated to mirror the

environment and its dynamics. Such a world model is considered

to be generative; in that it is able to generate predictions about

sensations (e.g., during planning or dreaming), given beliefs about

1By beliefs we mean Bayesian beliefs, i.e., probability distributions over a

variable of interest (e.g., current position). Beliefs are therefore used in the sense

of Bayesian belief updating or belief propagation—as opposed to propositional

or folk psychology beliefs.

future states of being. If an agent senses a heat source (e.g., an-

other agent) via some temperature receptors, the sensation of

warmth represents an observed outcome and the temperature

of the heat source a hidden state; minimisation of variational

free energy then ensures that beliefs about hidden states closely

match the true temperature. Formally, the generative model is

a joint probability distribution over possible hidden states and

sensory consequences – that specifies how the former cause

the latter – and minimisation of variational free energy enables

to ‘‘invert’’ the model; i.e., determine the most likely hidden

states given sensations. The variational free energy is the negative

evidence lower bound that is optimised in variational Bayes in

machine learning (Bishop,2006;Xitong,2017). Technically – by

minimising variational free energy – agents perform approximate

Bayesian inference (Sengupta & Friston,2016;Sengupta et al.,

2016), which enables them to infer the causes of their sensations

(e.g., perception). This is the point of contact between active infer-

ence and the Bayesian brain (Aitchison & Lengyel,2017;Friston,

2012;Knill & Pouget,2004). Crucially, agents may incorporate an

optimism bias (McKay & Dennett,2009;Sharot,2011) in their

model; thereby scoring certain ‘‘preferred’’ sensations as more

likely. This lends a higher plausibility to those courses of action

that realise these sensations. In other words, a preference is

simply something an agent (believes it) is likely to work towards.

3

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Fig. 1. Markov blankets in active inference. This figure illustrates the Markov blanket assumption of active inference. A Markov blanket is a set of variables through

which states internal and external to the system interact. Specifically, the system must be such that we can partition it into a Bayesian network of internal states µ,

external states η, sensory states oand active states u, (µ,oand uare often referred together as particular states) with probabilistic (causal) links in the directions

specified by the arrows. All interactions between internal and external states are therefore mediated by the blanket states b. The sensory states represent the sensory

information that the body receives from the environment and the active states express how the body influences the environment. This blanket assumption is quite

generic, in that it can be reasonably assumed for a brain as well as elementary organisms. For example, when considering a bacillus, the sensory states become the

cell membrane and the active states comprise the actin filaments of the cytoskeleton. Under the Markov blanket assumption – together with the assumption that

the system persists over time (i.e., possesses a non-equilibrium steady state) – a generalised synchrony appears, such that the dynamics of the internal states can be

cast as performing inference over the external states (and vice versa) via a minimisation of variational free energy (Friston,2019;Parr et al.,2020). This coincides

with existing approaches to inference; i.e., variational Bayes (Beal,2003;Bishop,2006;Blei et al.,2017;Jordan et al.,1998). This can be viewed as the internal states

mirroring external states, via sensory states (e.g., perception), and external states mirroring internal states via active states (e.g., a generalised form of self-assembly,

autopoiesis or niche construction). Furthermore, under these assumptions the most likely courses of actions can be shown to minimise expected free energy. Note

that external states beyond the system should not be confused with the hidden states of the agent’s generative model (which model external states). In fact, the

internal states are exactly the parameters (i.e., sufficient statistics) encoding beliefs about hidden states and other latent variables, which model external states in

a process of variational free energy minimisation. Hidden and external states may or may not be isomorphic. In other words, an agent uses its internal states to

represent hidden states that may or may not exist in the external world.

To maintain homeostasis, and ensure survival, agents must

minimise surprise.2Since the generative model scores preferred

outcomes as more likely, minimising surprise corresponds to

maximising model evidence.3In active inference, this is assured

by the aforementioned processes; indeed, the variational free en-

ergy turns out to be an upper bound on surprise and minimising

expected free energy ensures preferred outcomes are realised,

thereby avoiding surprise on average.

Active inference can thus be framed as the minimisation of

surprise (Friston,2009,2010;Friston et al.,2006;Friston &

Stephan,2007) by perception and action. In discrete state models

– of the sort discussed here – this means agents select from dif-

ferent possible courses of action (i.e., policies) in order to realise

their preferences and thus minimise the surprise that they expect

to encounter in the future. This enables a Bayesian formulation

of the perception–action cycle (Fuster,1990): agents perceive

the world by minimising variational free energy, ensuring their

2In information theory, the surprise (a.k.a., surprisal) associated with an

outcome under a generative model is given by −log p(o). This specifies the

extent to which an observation is unusual and surprises the agent—but this

does not mean that the agent consciously experiences surprise. In information

theory this kind of surprise is known as self-information.

3In Bayesian statistics, the model evidence (often referred to as marginal

likelihood) associated with a generative model is p(o)—the probability of ob-

served outcomes according to the model (sometimes this is written as p(o|m),

explicitly conditioning upon a model). The model evidence scores the goodness

of the model as an explanation of data that are sampled, by rewarding accuracy

and penalising complexity, which avoids overfitting.

model is consistent with past observations, and act by minimising

expected free energy, to make future sensations consistent with

their model. This account of behaviour can be concisely framed

as self-evidencing (Hohwy,2016).

In contrast to other normative models of behaviour, active

inference is a ‘first principle’ account, which is grounded in sta-

tistical physics (Friston,2019;Parr et al.,2020). Active inference

describes the dynamics of systems that persist (i.e., do not dis-

sipate) during some timescale of interest, and that can be statis-

tically segregated from their environment—conditions which are

satisfied by biological systems. Mathematically, the first condition

means that the system is at non-equilibrium steady-state (NESS).

This implies the existence of a steady-state probability density to

which the system self-organises and returns to after perturbation

(i.e., the agent’s preferences). The statistical segregation condi-

tion is the presence of a Markov blanket (c.f., Fig. 1) (Kirchhoff

et al.,2018;Pearl,1998): a set of variables through which states

internal and external to the system interact (e.g., the skin is

a Markov blanket for the human body). Under these assump-

tions it can be shown that the states internal to the system

parameterise Bayesian beliefs about external states and can be

cast a process of variational free energy minimisation (Friston,

2019;Parr et al.,2020). This coincides with existing approaches

to approximate inference (Beal,2003;Bishop,2006;Blei et al.,

2017;Jordan et al.,1998). Furthermore, it can be shown that the

most likely courses of action taken by those systems are those

which minimise expected free energy (or a variant thereof, see

4

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Appendix C)—a quantity that subsumes many existing constructs

in science and engineering (see Section 7).

By subscribing to the above assumptions, it is possible to

describe the behaviour of viable living systems as performing

active inference—the remaining challenge is to determine the

computational and physiological processes that they implement

to do so. This paper aims to summarise possible answers to this

question, by reviewing the technical details of a process theory

for active inference on discrete state-space generative models,

first presented in Friston, FitzGerald et al. (2017). Note that it is

important to distinguish between active inference as a principle

(presented above) from active inference as a process theory.

The former is a consequence of fundamental assumptions about

living systems, while the latter is a hypothesis concerning the

computational and biological processes in the brain that might

implement active inference. The ensuing process theories theory

can then be used to predict plausible neuronal dynamics and

electrophysiological responses that are elicited experimentally.

3. Discrete state-space generative models

The generative model (Bishop,2006) expresses how the agent

represents the world. This is a joint probability distribution over

sensory data and the hidden (or latent) causes of these data.

The sorts of discrete state-space generative models used in active

inference are specifically suited to represent discrete time series

and decision-making tasks. These can be expressed as variants

of partially observable Markov decision processes (POMDPs; As-

tröm,1965): from simple Markov decision processes (Barto &

Sutton,1992;Stone,2019;White,2001) to generalisations in the

form of deep probabilistic (hierarchical) models (Allenby et al.,

2005;Box & Tiao,1965;Friston et al.,2018b). For clarity, the

process theory is derived for the simplest model that facilitates

understanding of subsequent generalisations; namely, a POMDP

where the agent holds beliefs about the probability of the initial

state (specified as D), the transition probabilities from one state

to the next (defined as matrix B) and the probability of outcomes

given states (i.e., the likelihood matrix A); see Fig. 2.

As mentioned above, a substantial body of work justifies

describing certain neuronal representations with discrete state-

space generative models (e.g., Luck & Vogel,1997;Tee & Taylor,

2018;Zhang & Luck,2008). Furthermore, it has been long known

that – at the level of neuronal populations – computations occur

periodically (i.e., in distinct and sometimes nested oscillatory

bands). Similarly, there is evidence for sequential computation

in a number of processes (e.g., attention Buschman & Miller,

2010;Duncan et al.,1994;Landau & Fries,2012, visual per-

ception Hanslmayr et al.,2013;Rolls & Tovee,1994) and at

different levels of the neuronal hierarchy (Friston,2008;Friston

et al.,2018b), in line with ideas from hierarchical predictive

processing (Chao et al.,2018;Iglesias et al.,2013). This accom-

modates the fact that visual saccadic sampling of observations

occurs at a frequency of approximately 4 Hz (Parr & Friston,

2018d). The relatively slow presentation of a discrete sequence of

observations enables inferences to be performed in peristimulus

time by (much) faster neuronal dynamics.

Active inference, implicitly, accounts for fast and slow neu-

ronal dynamics. At each time-step the agent observes an out-

come, from which it infers the past, present and future (hidden)

states through perception. This underwrites a plan into the future,

by evaluating (the expected free energy of) possible policies. The

inferred (best) policies specify the most likely action, which is

executed. At a slower timescale, parameters encoding the con-

tingencies of the world (e.g., A), are inferred. This is referred to as

learning. Even more slowly, the structure of the generative model

is updated to better account for available observations—this is

called structure learning. The following sections elucidate these

aspects of the active inference process theory.

This paper will be largely concerned with deriving and in-

terpreting the inferential dynamics that agents might implement

using the generative model in Fig. 2. We leave the discussion of

more complex models to Appendix A, since the derivations are

analogous in those cases.

4. Variational Bayesian inference

4.1. Free energy and model evidence

Variational Bayesian inference rests upon minimisation of a

quantity called (variational) free energy, which bounds the im-

probability (i.e., the surprise) of sensory observations, under a

generative model. Simultaneously, free energy minimisation is

a statistical inference technique that enables the approximation

of the posterior distribution in Bayes rule. In machine learning,

this is known as variational Bayes (Beal,2003;Bishop,2006;Blei

et al.,2017;Jordan et al.,1998). Active inference agents minimise

variational free energy, enabling concomitant maximisation of

their model evidence and inference of the latent variables of their

generative model. In the following, we consider a particular time

point to be given t∈ {1,...,T}, whence the agent has observed a

sequence of outcomes o1:t. The posterior about the latent causes

of sensory data is given by Bayes rule:

P(s1:T,A, π |o1:t)=P(o1:t|s1:T,A, π)P(s1:T,A, π )

P(o1:t)(1)

Note the policy πis a random variable. This entails planning

as inferring the best action sequence from observations (Attias,

2003;Botvinick & Toussaint,2012). Computing the posterior

distribution requires computing the model evidence P(o1:t)=

π∈Πs1:T∈STP(o1:t,s1:T,A, π )dA, which is intractable for

complex generative models embodied by biological and artifi-

cial systems (Friston,2008)—a well-known problem in Bayesian

statistics. An alternative to computing the exact posterior distri-

bution is to optimise an approximate posterior distribution over

latent causes Q(s1:T,A, π ), by minimising the Kullback–Leibler

(KL) divergence (Kullback & Leibler,1951)DKL—a non-negative

measure of discrepancy between probability distributions. We

can use the definition of the KL divergence and Bayes rule to

arrive at the variational free energy F, which is a functional of

approximate posterior beliefs:

0≤DKL[Q(s1:T,A, π )∥P(s1:T,A, π|o1:t)]

=EQ(s1:T,A,π)[log Q(s1:T,A, π )−log P(s1:T,A, π |o1:t)]

=EQ(s1:T,A,π)[log Q(s1:T,A, π )−log P(o1:t,s1:T,A, π )

+log P(o1:t)]

=EQ(s1:T,A,π)[log Q(s1:T,A, π )−log P(o1:t,s1:T,A, π )]

=:F[Q(s1:T,A,π)]

+log P(o1:t)

⇒ − log P(o1:t)≤F[Q(s1:T,A, π )]

(2)

From (2), one can see that by varying Qto minimise the vari-

ational free energy enables us to approximate the true posterior,

while simultaneously ensuring that surprise remains low. The

former offers the intuitive interpretation of the free energy as

a generalised prediction error, as minimising free energy cor-

responds to suppressing the discrepancy between predictions,

i.e., Q, and the actual state of affairs, i.e., the posterior; and indeed

for a particular class of generative models, we recover the predic-

tion error given by predictive coding schemes (see Bogacz,2017;

Buckley et al.,2017;Friston et al.,2007). Altogether, this means

5

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Fig. 2. Example of a discrete state-space generative model. Panel 2a, specifies the form of the generative model, which is how the agent represents the world. The

generative model is a joint probability distribution over (hidden) states, outcomes and other variables that cause outcomes. In this representation, states unfold

in time causing an observation at each time-step. The likelihood matrix Aencodes the probabilities of state–outcome pairs. The policy πspecifies which action

to perform at each time-step. Note that the agent’s preferences may be specified either in terms of states or outcomes. It is important to distinguish between

states (resp. outcomes) that are random variables, and the possible values that they can take in S(resp. in O), which we refer to as possible states (resp. possible

outcomes). Note that this type of representation comprises a finite number of timesteps, actions, policies, states, outcomes, possible states and possible outcomes.

In Panel 2b, the generative model is displayed as a probabilistic graphical model (Bishop,2006;Jordan et al.,1998;Pearl,1988,1998) expressed in factor graph

form (Loeliger,2004). The variables in circles are random variables, while squares represent factors, whose specific form are given in Panel 2a. The arrows represent

causal relationships (i.e., conditional probability distributions). The variables highlighted in grey can be observed by the agent, while the remaining variables are

inferred through approximate Bayesian inference (see Section 4) and called hidden or latent variables. Active inference agents perform inference by optimising

the parameters of an approximate posterior distribution (see Section 4). Panel 2c specifies how this approximate posterior factorises under a particular mean-field

approximation (Tanaka,1999), although other factorisations may be used (Parr, Markovic et al.,2019;Schwöbel et al.,2018). A glossary of terms used in this figure is

available in Table 2. The mathematical yoga of generative models is heavily dependent on Markov blankets. The Markov blanket of a random variable in a probabilistic

graphical model are those variables that share a common factor. Crucially, a variable conditioned upon its Markov blanket is conditionally independent of all other

variables. We will use this property extensively (and implicitly) in the text.

that variational free energy minimising agents, simultaneously,

infer the latent causes of their observations and maximise the

evidence for their generative model. One should note that the

free energy equals the surprise −log P(o1:t) only at the global free

energy minimum, when the approximate posterior Q(s1:T,A, π )

equals the true posterior P(s1:T,A, π |o1:t). Outside of the global

free energy minimum, the free energy upper bounds the surprise,

in which case, since the true posterior is generally intractable, the

tightness of the bound is generally unknowable.

To aid intuition, the variational free energy can be rearranged

into complexity and accuracy:

F[Q(s1:T,A, π )] = DKL[Q(s1:T,A, π)∥P(s1:T,A, π )]

Complexity

−EQ(s1:T,A,π)[log P(o1:t|s1:T,A, π )]

Accuracy

(3)

The first term of (3) can be regarded as complexity: a simple

explanation for observable data Q, which makes few assumptions

over and above the prior (i.e., with KL divergence close to zero),

is a good explanation. In other words, a good explanation is an

accurate account of some data that requires minimal movement

for updating of prior to posterior beliefs (c.f., Occam’s principle).

The second term is accuracy; namely, the probability of the data

given posterior beliefs about model parameters Q. In other words,

how well the generative model fits the observed data. The idea

that neural representations weigh complexity against accuracy

underwrites the imperative to find the most accurate explanation

for sensory observations that is minimally complex, which has

been leveraged by things like Horace Barlow’s principle of min-

imum redundancy (Barlow,2001) and subsequently supported

empirically (Dan et al.,1996;Lewicki,2002;Olshausen & Field,

2004;Olshausen & O’Connor,2002). Fig. 3 illustrates the various

implications of minimising free energy.

4.2. On the family of approximate posteriors

The goal is now to minimise variational free energy with

respect to Q. To obtain a tractable expression for the variational

free energy, we need to assume a certain simplifying factori-

sation of the approximate posterior. There are many possible

forms (e.g., mean-field, marginal, Bethe, see Heskes,2006;Parr,

Markovic et al.,2019;Yedidia et al.,2005), each of which trades

off the quality of the inferences with the complexity of the

computations involved. For the purpose of this paper we use

a particular structured mean-field approximation (see Table 2

for an explanation of the different distributions and variables in

play):

Q(s1:T,A, π )=Q(A)Q(π)

T

τ=1

Q(sτ|π) (4)

Q(sτ|π)=Cat(sπ τ ),sπτ ∈ {x∈Rm|xi>0,

i

xi=1}

Q(π)=Cat(π

π

π),{x∈R|Π||xi>0,

i

xi=1}

Q(A)=

m

i=1

Q(A•i),Q(A•i)=Dir(a•i),a•i∈(R>0)n

6

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Table 2

Glossary of terms and notation.

Notation Meaning Type

SSet of all possible (hidden) states. Finite set of cardinality m>0.

sτ(Hidden) state at time τ. In computations, if sτevaluates to the ith possible

state, then interpret it as the ith unit vector in Rm.

Random variable over S.

s1:tSequence of hidden states s1,...,st. Random variable over S× · · · × S

ttimes

=St.

OSet of all possible outcomes. Finite set of cardinality n>0.

oτOutcome at time τ. In computations, if oτevaluates to the jth possible

outcome, then interpret it as the jth unit vector in Rn.

Random variable over O.

o1:tSequence of outcomes o1,...,otRandom variable over O× · · · × O

ttimes

=Ot.

TNumber of timesteps in a trial of observation epochs under the generative

model.

Positive integer.

USet of all possible actions. Finite set.

ΠSet of all allowable policies; i.e., action sequences indexed in time. Finite subset of U× · · · × U

Ttimes

=UT.

πPolicy or actions sequence indexed in time. Random variable over Π, or element of Π

depending on context.

QApproximate posterior distribution over the latent variables of the generative

model s1:T,A, π.

Scalar valued probability distribution over

S× {x∈Rn|xi>0,ixi=1}m×Π.

F,FπVariational free energy and variational free energy conditioned upon a policy. Functionals of Qthat evaluate to a scalar

quantity.

GExpected free energy. Function defined on Πthat evaluates to a

scalar quantity.

Cat Categorical distribution; probability distribution over a finite set assigning

strictly positive probabilities.

Probability distribution over a finite set of

cardinality kparameterised by a real valued

vector of probabilities in

{x∈Rk|xi>0,ixi=1}

Dir Dirichlet distribution (conjugate prior of the categorical distribution).

Probability distribution over the parameter space of the categorical

distribution, parameterised by a vector of positive reals.

Probability distribution over

{x∈Rk|xi>0,ixi=1}, itself parameterised

by an element of (R>0)k.

X•i,Xki ith column and (k,i)th element of matrix X. Matrix indexing convention.

·,⊗,⊙,⊙Respectively inner product, Kronecker product, element-wise product and

element-wise power. Following existing active inference literature, we adopt

the convention X·Y:= XTYfor matrices.

Operation on vectors and matrices.

ALikelihood matrix. The probability of the state–outcome pair oτ,sτ, namely

P(oτ|sτ,A) is given by oτ·Asτ.

Random variable over the subset of Mn×m(R)

with columns in {x∈Rn|xi>0,ixi=1}.

Bπτ−1Matrix of transition probabilities from one state to the next state given action

πτ−1. The probability of possible state sτ, given sτ−1and action πτ−1is

sτ·Bπτ−1sτ−1.

Matrix in Mm×m(R) with columns in

{x∈Rm|xi>0,ixi=1}.

DVector of probabilities of initial state. The probability of the ith possible state

occurring at time 1 is Di.

Vector of probabilities in

{x∈Rm|xi>0,ixi=1}.

a,aParameters of prior and approximate posterior beliefs about A. Matrices in Mn×m(R>0).

a0,a0Matrices of the same size as a,a, with homogeneous columns; any of its ith

column elements are denoted by ai0,ai0and defined by

ai0=n

j=1aji,ai0=n

j=1aji.

Matrices in Mn×m(R>0).

log,Γ, ψ Natural logarithm, gamma function and digamma function. By convention

these functions are taken component-wise on vectors and matrices.

Functions.

EP(X)[f(X)]Expectation of a random variable f(X) under a probability density P(X), taken

component-wise if f(X) is a matrix. EP(X)[f(X)] := f(X)P(X)dX

Real-valued operator on random variables.

A A := EQ(A)[A] = a⊙a⊙(−1)

0Matrix in Mn×m(R>0).

logA logA := EQ(A)[log A] = ψ(a)−ψ(a0). Note that logA ̸= log A! Matrix in Mn×m(R).

σSoftmax function or normalised exponential. σ(x)k=exk

iexiFunction Rk→ {x∈Rk|xi>0,ixi=1}

H[P]Shannon entropy of a probability distribution P. Explicitly,

H[P] = EP(x)[− log P(x)]

Functional over probability distributions.

This choice is driven by didactic purposes and since this fac-

torisation has been used extensively in the active inference liter-

ature (Friston, FitzGerald et al.,2017;Friston, Parr et al.,2017;

Friston et al.,2018b). However, the most recent software im-

plementation of active inference (available in spm_MDP_VB_X.m)

employs a marginal approximation (Parr,2019;Parr, Markovic

et al.,2019), which retains the simplicity and biological inter-

pretation of the neuronal dynamics afforded by the mean-field

approximation, while approximating the more accurate infer-

ences of the Bethe approximation. For these reasons, the marginal

free energy currently stands as the most biologically plausible.

4.3. Computing the variational free energy

The next sections focus on producing biologically plausible

neuronal dynamics that perform perception and learning based

on variational free energy minimisation. To enable this, we first

compute variational the free energy, using the factorisations of

the generative model and approximate posterior (c.f., Fig. 2):

F[Q(s1:T,A, π )] = EQ(s1:T,A,π)[log Q(s1:T,A, π )

−log P(o1:t,s1:T,A, π )]

=EQ(s1:T,A,π)[log Q(A)

+log Q(π)+

T

τ=1

log Q(sτ|π)

−log P(A)−log P(π)−log P(s1)

−

T

τ=2

log P(sτ|sτ−1, π )

7

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Fig. 3. Markov blankets and self-evidencing. This schematic illustrates the various interpretations of minimising variational free energy. Recall that the existence of

a Markov blanket implies a certain lack of influences among internal, blanket and external states. These independencies have an important consequence; internal

and active states are the only states that are not influenced by external states, which means their dynamics (i.e., perception and action) are a function of, and

only of, particular states (i.e., internal, sensory and active states); here, the variational (free energy) bound on surprise. This surprise has a number of interesting

interpretations. Given it is the negative log probability of finding a particle or creature in a particular state, minimising surprise corresponds to maximising the

value of a particle’s state. This interpretation is licensed by the fact that the states with a high probability are, by definition, attracting states. On this view, one

can then spin-off an interpretation in terms of reinforcement learning (Barto & Sutton,1992), optimal control theory (Todorov & Jordan,2002) and, in economics,

expected utility theory (Bossaerts & Murawski,2015). Indeed, any scheme predicated on the optimisation of some objective function can now be cast in terms of

minimising surprise – in terms of perception and action (i.e., the dynamics of internal and active states) – by specifying these optimal values to be the agent’s

preferences. The minimisation of surprise (i.e., self-information) leads to a series of influential accounts of neuronal dynamics; including the principle of maximum

mutual information (Linsker,1990;Optican & Richmond,1987), the principles of minimum redundancy and maximum efficiency (Barlow,1961) and the free energy

principle (Friston et al.,2006). Crucially, the average or expected surprise (over time or particular states of being) corresponds to entropy. This means that action and

perception look as if they are minimising entropy. This leads us to theories of self-organisation, such as synergetics in physics (Haken,1978;Kauffman,1993;Nicolis

& Prigogine,1977) or homeostasis in physiology (Ashby,1947;Bernard,1974;Conant & Ashby,1970). Finally, the probability of any blanket states given a Markov

blanket (m) is, on a statistical view, model evidence (MacKay,1995,2003). This means that all the above formulations are internally consistent with things like the

Bayesian brain hypothesis, evidence accumulation and predictive coding; most of which inherit from Helmholtz motion of unconscious inference (von Helmholtz &

Southall,1962), later unpacked in terms of perception as hypothesis testing in 20th century psychology (Gregory,1980) and machine learning (Dayan et al.,1995).

−

t

τ=1

log P(oτ|sτ,A)]

=DKL[Q(A)∥P(A)] + DKL [Q(π)∥P(π)]

+EQ(π)[Fπ[Q(s1:T|π)]]

(5)

where

Fπ[Q(s1:T|π)] :=

T

τ=1

EQ(sτ|π)[log Q(sτ|π)]

−

t

τ=1

EQ(sτ|π)Q(A)[log P(oτ|sτ,A)]

−EQ(s1|π)[log P(s1)] −

T

τ=2

EQ(sτ|π)Q(sτ−1|π)

× [log P(sτ|sτ−1, π )]

(6)

is the variational free energy conditioned upon pursuing a par-

ticular policy. This is the same quantity that we would have

obtained by omitting Aand conditioning all probability distribu-

tions in the numerators of (1) by π. In the next section, we will

see how perception can be framed in terms of variational free

energy minimisation.

5. Perception

In active inference, perception is equated with state estima-

tion (Friston, FitzGerald et al.,2017) (e.g., inferring the tempera-

ture from the sensation of warmth), consistent with the idea that

perceptions are hypotheses (Gregory,1980). To infer the (past,

present and future) states of the environment, an agent must

minimise the variational free energy with respect to Q(s1:T|π) for

each policy π. This provides the agent’s inference over hidden

states, contingent upon pursuing a given policy. Since the only

part of the free energy that depends on Q(s1:T|π) is Fπ, the

agent must simply minimise Fπ. Substituting Q(sτ|π) by their

sufficient statistics (i.e., the vector of parameters sπτ ), Fπbecomes

a function of those parameters. This enables us to rewrite (6),

conveniently in matrix form (see Appendix B for details):

Fπ(sπ1,...,sπT)=

T

τ=1

sπτ ·log sπ τ −

t

τ=1

oτ·logAsπτ

−sπ1log D−

T

τ=2

sπτ ·log(Bπτ−1)sπ τ −1

(7)

8

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

This enables to compute the variational free energy gradi-

ents (Petersen & Pedersen,2012):

∇sπτ Fπ(sπ1,...,sπT)=⃗

1+log sπτ

−

oτ·logA +sπτ +1·log(Bπτ)

+log Dif τ=1

oτ·logA +sπτ +1·log(Bπτ)

+log(Bπτ−1)sπτ −1if 1 < τ ≤t

sπτ +1·log(Bπτ)+log(Bπτ−1)sπ τ −1

if τ > t

(8)

The neuronal dynamics are given by a gradient descent on

free energy (Friston, FitzGerald et al.,2017), with state-estimation

expressed as a softmax function of accumulated (negative) free

energy gradients, that we denote by vπτ (see Section 5.1 for an

interpretation). The constant term ⃗

1 is generally omitted since the

softmax function removes it anyway.

˙vπτ (sπ1,...,sπT)= −∇sπ τ Fπ(sπ1,...,sπT)

sπτ =σ(vπ τ )(9)

The softmax function σ– a generalisation of the sigmoid to

vector inputs – is a natural choice as the variational free energy

gradient is a logarithm and the components of sπτ must sum

to one. Note the continuous time gradient descent on the free

energy (9); although we focus on active inference with discrete

generative models, this does not preclude the belief updating

from occurring in continuous time (this is particularly important

when relating these dynamics to neurobiological processes, see

below). Yet, any numerical implementation of active inference

would implement a discretised version of (9) until convergence,

for example

v(k)

πτ =v(k−1)

πτ −κ∇s(k−1)

πτ Fπ(s(k−1)

π1,...,s(k−1)

πT) for small κ > 0

s(k)

πτ =σ(v(k)

πτ ).

5.1. Plausibility of neuronal dynamics

The temporal dynamics expressed in (9) unfold at a much

faster timescale than the sampling of new observations (i.e.,

within timesteps) and correspond to fast neuronal processing

in peristimulus time. This is consistent with behaviour-relevant

computations at frequencies that are higher than the rate of

visual sampling (e.g., working memory (Lundqvist et al.,2016),

visual stimulus perception in humans (Hanslmayr et al.,2013)

and macaques (Rolls & Tovee,1994)).

Furthermore, these dynamics (9) are consistent with predic-

tive processing (Bastos et al.,2012;Rao & Ballard,1999) – since

active inference prescribes dynamics that minimise prediction

error – although they generalise it to a wide range of generative

models. Note that, while also a variational free energy, this sort

of prediction error (7) is not the same as that given by predictive

coding schemes (which rely upon a certain kind of continuous

state-space generative model, see Bogacz,2017;Buckley et al.,

2017;Friston et al.,2007).

Just as neuronal dynamics involve translation from post-

synaptic potentials to firing rates, (9) involves translating from

a vector of real numbers (v), to a vector whose elements are

bounded between zero and one (sπτ ); via the softmax function. As

a result, it is natural to interpret the components of vas the av-

erage membrane potential of distinct neural populations, and sπτ

as the average firing rate of those populations, which is bounded

thanks to neuronal refractory periods. This is consistent with

mean-field formulations of neural population dynamics, in that

the average firing rate of a neuronal population follows a sigmoid

function of the average membrane potential (Deco et al.,2008;

Marreiros et al.,2008;Moran et al.,2013). Using the fact that a

softmax function is a generalisation of the sigmoid to vector in-

puts – here the average membrane potentials of coupled neuronal

populations – it follows that their average firing follows a softmax

function of their average potential. In this context, the softmax

function may be interpreted as performing lateral inhibition,

which can be thought of as leading to narrower tuning curves of

individual neurons and thereby sharper inferences (Von Békésy,

1967). Importantly, this tells us that state-estimation can be

performed in parallel by different neuronal populations, and a

simple neuronal architecture is sufficient to implement these

dynamics (see Parr, Markovic et al. (2019, Figure 6)).

Lastly, interpreting the dynamics in this way has a

degree of face validity, as it enables us to synthesise a wide-

range of biologically plausible electrophysiological responses;

including repetition suppression, mismatch negativity, violation

responses, place-cell activity, phase precession, theta sequences,

theta–gamma coupling, evidence accumulation, race-to-bound

dynamics and transfer of dopamine responses (Friston, FitzGer-

ald et al.,2017;Schwartenbeck, FitzGerald, Mathys, Dolan and

Friston,2015).

The neuronal dynamics for state estimation coincide with vari-

ational message passing (Dauwels,2007;Winn & Bishop,2005),

a popular algorithm for approximate Bayesian inference. This

follows, as we have seen, from free energy minimisation under

a particular mean-field approximation (4). If one were to use the

Bethe approximation, the corresponding dynamics coincide with

belief propagation (Bishop,2006;Loeliger,2004;Parr, Markovic

et al.,2019;Schwöbel et al.,2018;Yedidia et al.,2005), another

widely used algorithm for approximate inference. This offers a

formal connection between active inference and message pass-

ing interpretations of neuronal dynamics (Dauwels et al.,2007;

Friston, Parr et al.,2017;George,2005). In the next section, we

examine planning, decision-making and action selection.

6. Planning, decision-making and action selection

So far, we have focused on optimising beliefs about hidden

states under a particular policy by minimising a variational free

energy functional of an approximate posterior over hidden states,

under each policy.

In this section, we explain how planning and decision-making

arise as a minimisation of expected free energy—a function scor-

ing the goodness of each possible future course of action. We

briefly motivate how the expected free energy arises from first-

principles. This allows us to frame decision-making and action-

selection in terms of expected free energy minimisation. Finally,

we conclude by discussing the computational cost of planning

into the future.

6.1. Planning and decision-making

At the heart of active inference, is a description of agents

that strive to attain a target distribution specifying the range of

preferred states of being, given a sufficient amount of time. To

work towards reaching these preferences, agents select policies

Q(π), such that their predicted states Q(sτ,A) at some future time

point τ > t(usually, the time horizon of a policy T) reach the

preferred states P(sτ,A), which are specified by the generative

model. These considerations allow us to show in Appendix C

9

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

that the requisite approximate posterior over policies Q(π) is a

softmax function of the negative expected free energy G4:

Q(π)=σ(−G(π))

G(π)=DKL[Q(sτ,A|π)∥P(sτ,A)]

Risk

−EQ(sτ,A|π)P(oτ|sτ,A)[log P(oτ|sτ,A)]

Ambiguity

(10)

By risk we mean the difference between predicted and a

priori predictions in the future (e.g., the quantification of losses

as in financial risk) and ambiguity is the uncertainty associated

to future observations, given states. This means that the most

likely (i.e., best) policies minimise expected free energy. This

ensures that future courses of action are exploitative (i.e., risk

minimising) and explorative (i.e., ambiguity minimising). In par-

ticular, the expected free energy balances goal-seeking and itin-

erant novelty-seeking behaviour, given some prior preferences

or goals. Note that the ambiguity term rests on an expecta-

tion over fictive (i.e., predicted) outcomes under beliefs about

future states. This means that optimising beliefs about future

states during perception is crucial to accurately predict future

outcomes during planning. In summary, planning and decision-

making respectively correspond to evaluating the expected free

energy of different policies, which scores their goodness in re-

lation to prior preferences and forming approximate posterior

beliefs about policies.

6.2. Action selection, policy-independent state-estimation

Approximate posterior beliefs about policies allows to obtain

the most plausible action as the most likely under all policies—

this can be expressed as a Bayesian model average

ut=arg max

u∈U

π∈Π,πt=u

Q(π).(11)

In addition, we obtain a policy independent state-estimation

at any time point Q(sτ), τ ∈ {1,...,T}, as a Bayesian model av-

erage of approximate posterior beliefs about hidden states under

policies, which may be expressed in terms of the distribution’s

parameters (Q(sτ)=Cat(sτ),Q(sτ|π)=Cat (sπ τ )):

Q(sτ)=

π∈Π

Q(sτ|π)Q(π)

⇐⇒ sτ=

π∈Π

sπτ Q(π)

(12)

Note that these Bayesian model averages may be implemented

by neuromodulatory mechanisms (FitzGerald et al.,2014).

6.3. Biological plausibility

Winner take-all architectures of decision-making are already

commonplace in computational neuroscience (e.g., models of se-

lective attention and recognition (Carpenter & Grossberg,1987;

Itti et al.,1998), hierarchical models of vision (Riesenhuber &

Poggio,1999)). This is nice, since the softmax function in (10)

can be seen as providing a biologically plausible (Deco et al.,

4A more complete treatment may include priors over policies – usually

denoted by E– and the evidence for a policy afforded by observed outcomes

(usually denoted by F). These additional terms supplement the expected free

energy, leading to an approximate posterior of the form σ(−log E−F−G) (Friston

et al.,2018b).

2008;Marreiros et al.,2008;Moran et al.,2013), smooth ap-

proximation to the maximum operation, which is known as soft

winner take-all (Maass,2000). In fact, the generative model,

presented in Fig. 2, can be naturally extended such that the

approximate posterior contains an (inverse) temperature param-

eter γmultiplying the expected free energy inside the softmax

function (see Appendix A.2). This temperature parameter reg-

ulates how precisely the softmax approximates the maximum

function, thus recovering winner take-all architectures for high

parameter values (technically, this converts Bayesian model av-

eraging into Bayesian model selection, where the policy corre-

sponds to a model of what the agent is doing). This parameter,

regulating precision of policy selection, has a clear biological

interpretation in terms of confidence encoded in dopaminergic

firing (FitzGerald, Dolan et al.,2015;Friston, FitzGerald et al.,

2017;Friston et al.,2014;Schwartenbeck, FitzGerald, Mathys,

Dolan and Friston,2015). Interestingly, Daw and colleagues (Daw

et al.,2006) uncovered evidence in favour of a similar model

employing a softmax function and temperature parameter in

human decision-making.

6.4. Pruning of policy trees

From a computational perspective, planning (i.e., computing

the expected free energy) for each possible policy can be cost-

prohibitive, due do the combinatorial explosion in the number of

sequences of actions when looking deep into the future. There

has been work in understanding how the brain finesses this

problem (Huys et al.,2012), which suggests a simple answer:

during mental planning, humans stop evaluating a policy as soon

as they encounter a large loss (i.e., a high value of the expected

free energy that renders the policy highly implausible). In ac-

tive inference this corresponds to using an Occam window; that

is, we stop evaluating the expected free energy of a policy if

it becomes much higher than the best (smallest expected free

energy) policy—and set its approximate posterior probability to

an arbitrarily low value accordingly. This biologically plausible

pruning strategy drastically reduces the number of policies one

has to evaluate exhaustively.

Although effective and biologically plausible, the Occam win-

dow for pruning policy trees cannot deal with large policy spaces

that ensue with deep policy trees and long temporal horizons.

This means that pruning can only partially explain how biologi-

cal organisms perform deep policy searches. Further research is

needed to characterise the processes in which biological agents

reduce large policy spaces to tractable subspaces. One explana-

tion – for the remarkable capacity of biological agents to evaluate

deep policy trees – rests on deep (hierarchical) generative mod-

els, in which policies operate at each level. These deep models

enable long-term policies, modelling slow transitions among hid-

den states at higher levels in the hierarchy, to contextualise

faster state transitions at subordinate levels (see Appendix A).

The resulting (semi Markovian) process can then be specified in

terms of a hierarchy of limited horizon policies that are nested

over temporal scales; c.f., motor chunking (Dehaene et al.,2015;

Fonollosa et al.,2015;Haruno et al.,2003).

6.5. Discussion of the action–perception cycle

Minimising variational and expected free energy are com-

plementary and mutually beneficial processes. Minimisation of

variational free energy ensures that the generative model is a

good predictor of its environment; this allows the agent to ac-

curately plan into the future by evaluating expected free energy,

which in turn enables it to realise its preferences. In other words,

minimisation of variational free energy is a vehicle for effective

10

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

planning and reaching preferences via the expected free energy;

in turn, reaching preferences minimises the expected surprise of

future states of being.

In conclusion, we have seen how agents plan into the future

and make decisions about the best possible course of action. This

concludes our discussion of the action–perception cycle. In the

next section, we examine expected free energy in greater detail.

Then, we will see how active agents can learn the contingencies

of the environment and the structure of their generative model

at slower timescales.

7. Properties of the expected free energy

The expected free energy is a fundamental construct of inter-

est. In this section, we unpack its main features and highlight its

importance in relation to many existing theories in neurosciences

and engineering.

The expected free energy of a policy can be unpacked in a

number of ways. Perhaps the most intuitive is in terms of risk

and ambiguity:

G(π)=DKL[Q(sτ,A|π)∥P(sτ,A)]

Risk

+EQ(sτ,A|π)[H[P(oτ|sτ,A)]]

Ambiguity

(13)

This means that policy selection minimises risk and ambiguity.

Risk, in this setting, is simply the difference between predicted

and prior beliefs about final states. In other words, policies will

be deemed more likely if they bring about states that conform

to prior preferences. In the optimal control literature, this part of

expected free energy underwrites KL control (Todorov,2008;van

den Broek et al.,2010). In economics, it leads to risk sensitive

policies (Fleming & Sheu,2002). Ambiguity reflects the uncer-

tainty about future outcomes, given hidden states. Minimising

ambiguity therefore corresponds to choosing future states that

generate unambiguous and informative outcomes (e.g., switching

on a light in the dark).

We can express the expected free energy of a policy as a bound

on information gain and expected log (model) evidence (a.k.a.,

Bayesian risk):

G(π)=EQ[DKL[Q(sτ,A|oτ, π )∥P(sτ,A|oτ)]]

Expected evidence bound

−EQ[log P(oτ)]

Expected log evidence

−EQ[DKL[Q(sτ,A|oτ, π )Q(sτ,A|π)]]

Expected information gain

≥ − EQ[log P(oτ)]

Expected log evidence

−EQ[DKL[Q(sτ,A|oτ, π )∥Q(sτ,A|π)]]

Expected information gain

(14)

The first term in (14) is the expectation of log evidence under

beliefs about future outcomes, while the second ensures that this

expectation is maximally informed, when outcomes are encoun-

tered. Collectively, these two terms underwrite the resolution

of uncertainty about hidden states (i.e., information gain) and

outcomes (i.e., expected surprise) in relation to prior beliefs.

When the agent’s preferences are expressed in terms of out-

comes (c.f., Fig. 2), it is useful to express risk in terms of outcomes,

as opposed to hidden states. This is most useful when the gen-

erative model is not known or during structure learning, when

the state-space evolves over time. In these cases, the risk over

hidden states can be replaced risk over outcomes by assuming the

KL divergence between the predicted and true posterior (under

expected outcomes) is small:

DKL[Q(sτ,A|π)∥P(sτ,A)]

Risk (states)

=DKL[Q(oτ|π)∥P(oτ)]

Risk (outcomes)

+EQ(oτ|π)[DKL[Q(sτ,A|oτ, π )∥P(sτ,A|oτ)]]

≈0

≈DKL[Q(oτ|π)∥P(oτ)]

Risk (outcomes)

(15)

This divergence constitutes an expected evidence bound that

also appears if we express expected free energy in terms of

intrinsic and extrinsic value:

G(π)= − EQ(oτ|π)[log P(oτ)]

Extrinsic value

+EQ(oτ|π)[DKL[Q(sτ,A|oτ, π )∥P(sτ,A|oτ)]]

Expected evidence bound

−EQ(oτ|π)[DKL[Q(sτ|oτ, π )∥Q(sτ|π)]]

Intrinsic value (states) or salience

−EQ(oτ,sτ|π)[DKL[Q(A|oτ,sτ, π )∥Q(A)]]

Intrinsic value (parameters) or novelty

(16)

Extrinsic value is just the expected value of log evidence,

which can be associated with reward and utility in

behavioural psychology and economics, respectively (Barto et al.,

2013;Kauder,1953;Schmidhuber,2010). In this setting, ex-

trinsic value is the negative of Bayesian risk (Berger,1985),

when reward is log evidence. The intrinsic value of a policy

is its epistemic value or affordance (Friston et al.,2015). This

is just the expected information gain afforded by a particular

policy, which can be about hidden states (i.e., salience) or model

parameters (i.e., novelty). It is this term that underwrites artificial

curiosity (Schmidhuber,2006).

Intrinsic value corresponds to the expected information gain

about model parameters. It is also known as intrinsic motivation

in neurorobotics (Barto et al.,2013;Deci & Ryan,1985;Oudeyer

& Kaplan,2009), the value of information in economics (Howard,

1966), salience in the visual neurosciences and (rather confus-

ingly) Bayesian surprise in the visual search literature (Itti & Baldi,

2009;Schwartenbeck et al.,2013;Sun et al.,2011). In terms of

information theory, intrinsic value is mathematically equivalent

to the expected mutual information between hidden states in

the future and their consequences—consistent with the princi-

ples of minimum redundancy or maximum efficiency (Barlow,

1961,1974;Linsker,1990). Finally, from a statistical perspective,

maximising intrinsic value (i.e., salience and novelty) corresponds

to optimal Bayesian design (Lindley,1956) and machine learning

derivatives, such as active learning (MacKay,1992). On this view,

active learning is driven by novelty; namely, the information

gain afforded model parameters, given future states and their

outcomes. Heuristically, this curiosity resolves uncertainty about

‘‘what would happen if I did that’’ (Schmidhuber,2010). Fig. 4

illustrates the compass of expected free energy, in terms of its

special cases; ranging from optimal Bayesian design through to

Bayesian decision theory.

8. Learning

In active inference, learning concerns the dynamics of synaptic

plasticity, which are thought to encode beliefs about the con-

tingencies of the environment (Friston, FitzGerald et al.,2017)

11

L. Da Costa, T. Parr, N. Sajid et al. Journal of Mathematical Psychology 99 (2020) 102447

Fig. 4. Expected free energy. This figure illustrates the various ways in which minimising expected free energy can be unpacked (omitting model parameters for

clarity). The upper panel casts action and perception as the minimisation of variational and expected free energy, respectively. Crucially, active inference introduces

beliefs over policies that enable a formal description of planning as inference (Attias,2003;Botvinick & Toussaint,2012;Kaplan & Friston,2018a). In brief, posterior

beliefs about hidden states of the world, under plausible policies, are optimised by minimising a variational (free energy) bound on log evidence. These beliefs are

then used to evaluate the expected free energy of allowable policies, from which actions can be selected (Friston, FitzGerald et al.,2017). Crucially, expected free

energy subsumes several special cases that predominate in the psychological, machine learning and economics literature. These special cases are disclosed when one

removes particular sources of uncertainty from the implicit optimisation problem. For example, if we ignore prior preferences, then the expected free energy reduces

to information gain (Lindley,1956;MacKay,2003) or intrinsic motivation (Barto et al.,2013;Deci & Ryan,1985;Oudeyer & Kaplan,2009). This is mathematically

the same as expected Bayesian surprise and mutual information that underwrite salience in visual search (Itti & Baldi,2009;Sun et al.,2011) and the organisation

of our visual apparatus (Barlow,1961,1974;Linsker,1990;Optican & Richmond,1987). If we now remove risk but reinstate prior preferences, one can effectively

treat hidden and observed (sensory) states as isomorphic. This leads to risk sensitive policies in economics (Fleming & Sheu,2002;Kahneman & Tversky,1988) or

KL control in engineering (van den Broek et al.,2010). Here, minimising risk corresponds to aligning predicted outcomes to preferred outcomes. If we then remove

ambiguity and relative risk of action (i.e., intrinsic value), we are left with extrinsic value or expected utility in economics (Von Neumann & Morgenstern,1944) that

underwrites reinforcement learning and behavioural psychology (Barto & Sutton,1992). Bayesian formulations of maximising expected utility under uncertainty is

also known as Bayesian decision theory (Berger,1985). Finally, if we just consider a completely unambiguous world with uninformative priors, expected free energy

reduces to the negative entropy of posterior beliefs about the causes of data; in accord with the maximum entropy principle (Jaynes,1957). The expressions for

variational and expected free energy correspond to those described in the main text (omitting model parameters for clarity). They are arranged to illustrate the

relationship between complexity and accuracy, which become risk and ambiguity, when considering the consequences of action. This means that risk-sensitive policy

selection minimises expected complexity or computational cost. The coloured dots above the terms in the equations correspond to the terms that constitute the

special cases in the lower panels.

(e.g., beliefs about B, in some settings, are thought to be encoded

in recurrent excitatory connections in the prefrontal cortex (Parr,

Rikhye et al.,2019)). The fact that beliefs about matrices (e.g., A,

B) may be encoded in synaptic weights conforms to connectionist

models of brain function, as it offers a convenient way to compute

probabilities, in the sense that the synaptic weights could be

interpreted as performing matrix multiplication as in artificial

neural networks, to predict; e.g., outcomes from beliefs about

states, using the likelihood matrix A.

These synaptic dynamics (e.g., long-term potentiation and de-

pression) evolve at a slower timescale than action and percep-

tion, which is consistent with the fact that such inferences need

evidence accumulation over multiple state–outcome pairs. For

simplicity, we will assume the only variable that is learned is A,

but what follows generalises to more complex generative models

(c.f., Appendix A.1. Learning Ameans that approximate posterior

beliefs about Afollow a gradient descent on variational free

energy. Seeing the variational free energy (5) as a function of a

(the sufficient statistic of Q(A)) we can write:

F(a)=DKL[Q(A)∥P(A)] −

t

τ=1

EQ(π)Q(sτ|π)Q(A)[oτ·log(A)sτ] + · · ·

=DKL[Q(A)∥P(A)] −

t

τ=1

oτ·logAsτ+ · · ·

(17)

Here, we ignore the terms in (5) that do not depend on Q(A), as

these will vanish when we take the gradient. The KL-divergence

12