Conference PaperPDF Available

Learning in Factored Domains with Information-Constrained Visual Representations


Abstract and Figures

Humans learn quickly even in tasks that contain complex visual information. This is due in part to the efficient formation of compressed representations of visual information, allowing for better generalization and robustness. However, compressed representations alone are insufficient for explaining the high speed of human learning. Reinforcement learning (RL) models that seek to replicate this impressive efficiency may do so through the use of factored representations of tasks. These informationally simplistic representations of tasks are similarly motivated as the use of compressed representations of visual information. Recent studies have connected biological visual perception to disentangled and compressed representations. This raises the question of how humans learn to efficiently represent visual information in a manner useful for learning tasks. In this paper we present a model of human factored representation learning based on an altered form of a β-Variational Auto-encoder used in a visual learning task. Modelling results demonstrate a trade-off in the informational complexity of model latent dimension spaces, between the speed of learning and the accuracy of reconstructions.
Content may be subject to copyright.
Learning in Factored Domains with
Information-Constrained Visual Representations
Tyler Malloy
Rensselaer Polytechnic Institute
Tim Klinger
IBM Research AI
Miao Liu
IBM Research AI
Matthew Riemer
IBM Research AI
Gerald Tesauro
IBM Research AI
Chris R. Sims
Rensselaer Polytechnic Institute
Humans learn quickly even in tasks that contain complex visual information. This
is due in part to the efficient formation of compressed representations of visual
information, allowing for better generalization and robustness. However, com-
pressed representations alone are insufficient for explaining the high speed of
human learning. Reinforcement learning (RL) models that seek to replicate this
impressive efficiency may do so through the use of factored representations of tasks.
These informationally simplistic representations of tasks are similarly motivated
as the use of compressed representations of visual information. Recent studies
have connected biological visual perception to disentangled and compressed repre-
sentations. This raises the question of how humans learn to efficiently represent
visual information in a manner useful for learning tasks. In this paper we present
a model of human factored representation learning based on an altered form of
-Variational Auto-encoder used in a visual learning task. Modelling results
demonstrate a trade-off in the informational complexity of model latent dimension
spaces, between the speed of learning and the accuracy of reconstructions.
1 Introduction
Deep Reinforcement Learning (DRL) has achieved super-human performance on a variety of tasks
by leveraging large neural networks trained on long timescales [
]. However, much of the research
in applying RL onto cognitive modelling of human learning has been limited to domains with small
state and action sizes [11], due to the low sample efficiency of traditional DRL methods [2].
Recent methods have applied DRL onto predicting human learning by modifying
Auto-Encoders (
-VAE) to additionally predict utility in a supervised fashion [
]. Disentangled
representations have also been applied into improving zero-shot transfer learning in the DRL setting
by using latent representations as input to a policy network [
]. The model presented in this work
differs from these previous approaches by applying a hypothesis generation and evaluation method
onto latent representations, in the context of a factored task representation.
Factored representations of state transition and reward functions can be used by RL methods to
improve generalization and robustness in tasks with a causal structure that corresponds to the factored
Markov Decision Process problem specification [
]. This could be a useful source of higher sample
efficiency required to predict human learning using deep learning methods.
Information-Theoretic Principles in Cognitive Systems Workshop at the 36th Conference on Neural Information
Processing Systems (NeurIPS 2022).
The model presented in this work seeks to leverage the disentangled representations learned by
-VAE models onto learning the factored representation of a task. This is achieved by generating
a set of hypotheses that predict future rewards and states based on the latent features of visual
information. This hypothesis space is used to explain the causal structure of a given task, and is
repeatedly re-evaluated and re-generated based on the experience of the agent.
2 Beta Variational Autoencoders
-Variational Autoencoder model consists of a deep neural network
that learns
information-constrained representations of visual information
. These representations take the
form of a vector of means
and variances
that define a multi-variate Gaussian
N(µz, σz)
. This
distribution is sampled from to produce a vector of values
that is then fed through the subsequent
network layers
to produce a reconstruction, the entire model being trained to minimize the
difference between the input and reconstruction by maximizing the objective function [3]:
L(θ, ϕ;x, z, β ) = Eqϕ(z|x)[log pθ(x|z)] βDKL qϕ(z|x)||p(z)(1)
parameter allows for additional control over the information bottleneck of the model by adding
a weight to the informational complexity of the latent representations defining the multi-variate
Gaussian distribution. The result is that the entire model is trained to balance reconstruction accuracy
and latent representation complexity in an adjustable fashion.
3 Reinforcement Learning for Factored MDPs
Reinforcement Learning (RL) for Factored MDPs seeks to solve the problem specification described
by the Factored Markov Decision Process (FMDP). The FMDP setting is a special case of MDP
formed by relating it to a dynamic Bayesian network defined by a directed acyclic graph
with nodes
{X1, X2, ..., Xn}
and scopes
S1, ..., Sn
]. A scope
of this network describes the
dependencies of future state features or rewards based on previous features and actions, with
signifying the features of state
corresponding to the scope
. This allows for a definition of the
factored transition function P(x|x, a)and reward function R(x)as follows [12]:
P(x|x, a) =
i|x[Si], a)R(x) = 1
Ri(x[Si]) (2)
These factored representations can be leveraged to significantly improve sample efficiency when the
causal structure is provided [
]. However, it can be difficult to learn these factored representations
from scratch, especially in environments with complex information such as visual domains. In the
following section we describe how the proposed model leverages disentangled latent representations
with a given hypothesis generation method to produce useful factored representations.
4 Proposed Model
The proposed RL
-VAE model (see Figure 3) begins with a slight alteration to the
-VAE, in order
to additionally make predictions of the reward associated with a stimuli and action pair. The resulting
network is trained with the following objective:
L(θ, ϕ;x, z, β , υ, r) = Eqϕ(z|x)[log pθ(x|z)] βDKLqϕ(z|x)||p(z)+υR(z|a)r2(3)
is an additional parameter that weighs the importance of the accuracy of reward predictions
and the reward
is defined in terms of the factored reward of the latent representation
, and
the discounted value of the subsequent latent representation Zobserved after performing action a:
R(z|a) = 1
Ri(z[Si]) + γV n
i|z[Si], a)(4)
Figure 1: Example of the RLβ-VAE model forming a reconstruction and predicted reward.
γV (Z)
is the discounted value of the subsequent latent representation
, here calculated
using the factored transition function from Eq. 2. This model uses unsupervised pre-training
using a reward of 0 to calculate the training loss. After pre-training, the model can leverage the
learned disentangled representations to predict a factored reward structure that allows for improved
generalization and robustness, resulting in higher sample efficiency.
To transition from disentangled latent features to a factored representation requires the generation
and evaluation of a set of hypotheses that correspond to potential scopes S1, ..., Sn. The method of
hypothesis generation and evaluation used here has been previously applied onto abstract inductive
reasoning [
]. The steps of this process consist of 1) sampling a reduced hypothesis space
from a probability distribution
and 2) evaluating the hypotheses in the reduced space through
some metric for how well the hypothesis matches experience [
]. For an example of the factored
hypothesis generation and evaluation method see the appendix.
For the learning task described in this paper, the generation of hypotheses can be achieved through a
simple linear fitting of the learned representations to the observed reward. The space of hypotheses
consists of all possible scopes
S1, ..., Sn
that define the factored reward function. The evaluation step
ranks each hypothesis based on mean-squared error of reward prediction accuracy. Alternatives to
this approach (including Bayesian inference or TD-error update) are possible, but not required due to
the simple structure of the deterministic contextual bandit learning described in the next section.
5 Learning Task
While factored MDPs can aid in the sample efficiency of RL algorithms in many domains, in this
learning task we focus on reward factorization using a simple bandit learning environment. This
learning task consists of a contextual N-armed bandit based on two images of celebrity faces [8].
The two actions available in the 2-armed bandit setting correspond to selecting the left and right stim-
uli, meaning we can further simplify the input to the RL
-VAE model as only the face corresponding
to the action chosen. The result is two reward predictions
[rleft, rright ]
which are the input to a simple
soft-max function, a method commonly used in cognitive modelling of human bandit learning [11].
In our contextual bandit task, faces wearing glasses are worth 25 points, wearing hats are worth 50
points, wearing both are worth 75 points and wearing neither are worth 0 points. The assumption of
the hypothesis generation method used by the RL
-VAE model is that the reward can be predicted by
the sum of simple linear functions which map the latent representation values
Z:{z0, z1, ..., zn}
onto the observed reward. As noted previously, more complex hypothesis generation and evaluation
methods are possible, but unnecessary for this learning task.
Before applying the RL
-VAE models onto predicting reward they were pre-trained on 100 epochs of
the full 220K image dataset of celebrity faces [
], with 100 test images removed. During contextual
bandit model testing, two images of celebrities are randomly chosen from a set of 100 (25 each of
hats, glasses, both, and neither) images not included in the initial model pre-training. To ensure that
one of the options always has a higher reward, the images are selected from different categories.
6 Modelling Results
The main method of assessing the speed of learning in the contextual bandit task is the probability
the model assigns to selecting the higher reward bandit arm. The results shown in the middle column
of Figure 2 demonstrate that smaller latent dimension spaces allow for faster learning of the factored
reward structure in this contextual bandit task. Notably, the models with small latent dimension sizes
are able to consistently select the option with a higher reward after only two experiences in this task.
Figure 2: Left: Model pre-training reconstruction loss by training epoch, lower is better, color
indicates latent dimension size. Middle: Contextual bandit training for 1000 runs of model accuracy
by trail means (dots) are fit to a logarithmic function (lines). Right: Representation difference in
mean-squared error between images containing hats, glasses, and both, compared to wearing neither.
The left column of Figure 2 compares reconstruction loss by pre-training epoch. These results
demonstrate a lower end of training reconstruction accuracy from models with smaller latent spaces.
While these small latent dimensions are useful for quick hypothesis generation, they make accurate
reconstruction of stimuli more difficult due to the tight information-bottleneck imposed on the model.
This represents a trade-off between learning speed and reconstruction accuracy that has direct
implications on how the human mind forms constrained representations of visual information that is
used in learning tasks. Future research in this area can investigate the specific balance of this trade-off
made by humans engaged in learning tasks based on visual information.
In the right column of Figure 2, we compare the average latent representation difference, as measured
by mean squared error, between each of the three non-zero utility stimuli types (glasses, hats, both)
and the stimuli wearing neither glasses nor hats. Initially all representations are equally similar to
stimuli without hats or glasses. As utility is learned, representations of higher utility stimuli become
relatively more differentiated. In these results, the low utility stimuli is most similar to the zero utility
stimuli, and the highest utility stimuli is most different. This demonstrates a utility-based acquired
equivalence whereby stimuli with similar utility outcomes have similar latent representations.
7 Conclusions
The results presented in this work show the value of disentangled representations of visual information
in learning factored rewards. The learning task used in testing these models, while simple, revealed
potential explanations of how the human mind performs fast learning through hypothesis generation
in an information-compressed space that allows for better generalization and robustness. The method
of generating potential hypotheses that explain the reward observed in this contextual bandit task
was designed for the deterministic nature of the contextual bandit task, but simple adjustments are
possible to extend this application into alternative domains.
In addition to providing insight into the structure of visual information as it is being processed by the
reinforcement learning faculty of the human brain, this work is also related to the question of how best
to define disentanglement, which has been identified as an interesting open question [
]. Specifically,
the results provided here suggest the usefulness of a behavioural definition of disentanglement, which
is achieved when representations are disentangled in a way that makes them useful for behavioural
goals such as forming hypotheses that explain experience and direct future behaviour.
Elizabeth Baraff Bonawitz and Thomas L Griffiths. Deconfounding hypothesis generation and
evaluation in bayesian models. In Proceedings of the Annual Meeting of the Cognitive Science
Society, volume 32, 2010.
Matthew Botvinick, Sam Ritter, Jane X Wang, Zeb Kurth-Nelson, Charles Blundell, and Demis
Hassabis. Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5):408–422,
Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des-
jardins, and Alexander Lerchner. Understanding disentangling in
-vae. arXiv preprint
arXiv:1804.03599, 2018.
Xiaoyu Chen, Jiachen Hu, Lihong Li, and Liwei Wang. Efficient reinforcement learning in
factored mdps with application to constrained rl. In International Conference on Learning
Representations, 2020.
Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende,
and Alexander Lerchner. Towards a definition of disentangled representations. arXiv preprint
arXiv:1812.02230, 2018.
Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel,
Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot
transfer in reinforcement learning. In International Conference on Machine Learning, pages
1480–1490. PMLR, 2017.
Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored mdps. In
IJCAI, volume 16, pages 740–747, 1999.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the
wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
Tyler Malloy, Chris R. Sims, and Tim Klinger. Modeling human reinforcement learning with
disentangled visual representations. In Reinforcement Learning and Decision Making (RLDM),
July 2022.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
Yael Niv, Reka Daniel, Andra Geana, Samuel J Gershman, Yuan Chang Leong, Angela Rad-
ulescu, and Robert C Wilson. Reinforcement learning in multidimensional environments relies
on attention mechanisms. Journal of Neuroscience, 35(21):8145–8157, 2015.
Brian Sallans and Geoffrey E Hinton. Reinforcement learning with factored states and actions.
The Journal of Machine Learning Research, 5:1063–1088, 2004.
Joshua B Tenenbaum, Thomas L Griffiths, and Charles Kemp. Theory-based bayesian models
of inductive learning and reasoning. Trends in cognitive sciences, 10(7):309–318, 2006.
8 Appendices
8.1 Stimuli examples
Figure 3: Examples of face images with either eyeglasses or hats from the celebA dataset [8].
8.2 Hypothesis Generation and Evaluation
As mentioned previously the steps of this process consist of 1) sampling a reduced hypothesis space
from a probability distribution
and 2) evaluating the hypotheses in the reduced space
through some metric for how well the hypothesis matches experience [1].
In the factored MDP setting, a hypothesis is a set of scopes
S1, ..., Sn
that correspond to the causal
structure of an environment. Figure 4 shows one possible hypothesis for the causal structure of a
learning environment. In this example the first scope
corresponds to the relationship
between the features contained in
for the factored state transition function and reward function
described in Eq. 2. This relationship is signified in the Dynamic Bayesian Network in the left column
of Figure 4 by the arrow from
. Because the first scope
only contains the feature
, the
first function of the factored reward r1depends only on the first latent feature z1= 123.
The full hypothesis space for the reward of a given latent representation
of size
with k scope
elements is
for each of the possible scopes
S1, ..., Sn
. In the example hypothesis shown in Figure
4, n = 5 and k is 1, 2, or 3 and the hypothetical scope is defined as
S2={z2, z3}
S3={z2, z4, z5},S4={z4},S5={z4, z5}
Figure 4: Example a dynamic Bayesian network defined by one hypothesized scope. An example
stimuli with latent representation and factored reward function. Note that the hypothetical DBN
describes the transition function which is not used for the contextual bandit task.
In practice when performing the contextual bandit task described in the paper, the reduced hypothesis
space is formed by selecting some limited complexity of scopes, set as
k= 1
, meaning only 1 or
2 elements were contained in each scope, which significantly reduces the possible hypothesis space.
The probability function sampling the reduced space
was defined to deterministically select the
most likely hypothesis as evaluated by the mean-squared error of the most recent reward prediction.
This simple evaluation and hypothesis sampling approach was adequate for the deterministic reward
setting of this contextual bandit, but a more complex sampling approach is also possible.
... Humans use compositional reasoning in a variety of domains related to language such as letter writing and sentence generation (Lake, Salakhutdinov, and Tenenbaum 2013; Piantadosi, Tenenbaum, and Goodman 2016). Recent methods in reinforcement learning have sought to apply these concepts to improve generalization in machine learning methods (Malloy et al. 2022;Ito et al. 2022). ...
... These factored representations can be leveraged to significantly improve sample efficiency when the causal structure is provided (Chen et al. 2020). However, it can be difficult to learn these factored representations from scratch (Malloy et al. 2022). The proposed CC-RL model addresses this issue using causal inference as a method of update the feature weights Φ. ...
Full-text available
An important characteristic of human learning and decision-making is the flexibility with which we rapidly adapt to novel tasks. To this day, models of human behavior have been unable to emulate the ease and success with which humans transfer knowledge in one context to another. Humans rely on a lifetime of experience and a variety of cognitive mechanisms that are difficult to represent computationally. To address this problem, we propose a novel human behavior model that accounts for human transfer of learning using three mechanisms: compositional reasoning, causal inference, and optimal forgetting. To evaluate this proposed model, we introduce an experiment task designed to elicit human transfer of learning under different conditions. Our proposed model demonstrates a more human-like transfer of learning compared to models that optimize transfer or human behavior models that do not directly account for transfer of learning. The results of the ablation testing of the proposed model and a systematic comparison to human data demonstrate the importance of each component of the cognitive model underlying the transfer of learning.
... This information-bottleneck motivation of these 145 models has been associated with cognitive limitations that impact decision making in humans, resulting in 146 suboptimal behavior (Bhui et al., 2021; Lai and Gershman, 2021). 147 These representations have been related to the processing of visual information from humans in learning 148 tasks (Malloy and Sims, 2022), as they excel in retaining key details associated with stimulus generation 149 factors (such as the shape of a ball or the age of a person's face) (Malloy et al., 2022b). Although we employ 150 β-VAEs in this work, there are many alternative visual GMs that are capable of forming representations 151 useful for decision making. ...
Full-text available
Introduction Generative Artificial Intelligence has made significant impacts in many fields, including computational cognitive modeling of decision making, although these applications have not yet been theoretically related to each other. This work introduces a categorization of applications of Generative Artificial Intelligence to cognitive models of decision making. Methods This categorization is used to compare the existing literature and to provide insight into the design of an ablation study to evaluate our proposed model in three experimental paradigms. These experiments used for model comparison involve modeling human learning and decision making based on both visual information and natural language, in tasks that vary in realism and complexity. This comparison of applications takes as its basis Instance-Based Learning Theory, a theory of experiential decision making from which many models have emerged and been applied to a variety of domains and applications. Results The best performing model from the ablation we performed used a generative model to both create memory representations as well as predict participant actions. The results of this comparison demonstrates the importance of generative models in both forming memories and predicting actions in decision-modeling research. Discussion In this work, we present a model that integrates generative and cognitive models, using a variety of stimuli, applications, and training methods. These results can provide guidelines for cognitive modelers and decision making researchers interested in integrating Generative AI into their methods.
Full-text available
As the research community aims to build better AI assistants that are more dynamic and personalized to the diversity of humans that they interact with, there is increased interest in evaluating the theory of mind capabilities of large language models (LLMs). Indeed, several recent studies suggest that LLM theory of mind capabilities are quite impressive, approximating human-level performance. Our paper aims to rebuke this narrative and argues instead that past studies were not directly measuring agent performance, potentially leading to findings that are illusory in nature as a result. We draw a strong distinction between what we call literal theory of mind i.e. measuring the agent's ability to predict the behavior of others and functional theory of mind i.e. adapting to agents in-context based on a rational response to predictions of their behavior. We find that top performing open source LLMs may display strong capabilities in literal theory of mind, depending on how they are prompted, but seem to struggle with functional theory of mind -- even when partner policies are exceedingly simple. Our work serves to highlight the double sided nature of inductive bias in LLMs when adapting to new situations. While this bias can lead to strong performance over limited horizons, it often hinders convergence to optimal long-term behavior.
Full-text available
Instance-Based Learning Theory (IBLT) suggests that humans learn to engage in dynamic decision making tasks through the accumulation of experiences, represented by the decision task features, the actions performed, and the utility of decision outcomes. This theory has been applied to the design of Instance-Based Learning (IBL) models of human behavior in a variety of contexts. One key feature of all IBL model applications is the method of accumulating instance-based memory and performing recognition-based retrieval. In simple tasks with few features, this knowledge representation and retrieval could hypothetically be done using all relevant information. However, these methods do not scale well to complex tasks when exhaustive enumeration of features is unfeasible. This requires cognitive modelers to design task-specific representations of state features, as well as similarity metrics, which can be time consuming and fail to generalize to related tasks. To address this issue, we leverage recent advancements in Artificial Neural Networks, specifically generative models (GMs), to learn representations of complex dynamic decision making tasks without relying on domain knowledge. We evaluate a range of GMs in their usefulness in forming representations that can be used by IBL models to predict human behavior in a complex decision making task. This work connects generative and cognitive models by using GMs to form representations and determine similarity.
Conference Paper
Full-text available
Instance-Based Learning Theory (IBLT) suggests that humans learn to engage in dynamic decision making tasks through the accumulation of experiences, represented by the decision task features, the actions performed , and the utility of decision outcomes. This theory has been applied to the design of Instance-Based Learning (IBL) models of human behavior in a variety of contexts. One key feature of all IBL model applications is the method of accumulating instance-based memory and performing recognition-based retrieval. In simple tasks with few features, this knowledge representation and retrieval could hypothetically be done using all relevant information. However, these methods do not scale well to complex tasks when exhaustive enumeration of features is unfeasible. This requires cog-nitive modelers to design task-specific representations of state features, as well as similarity metrics, which can be time consuming and fail to generalize to related tasks. To address this issue, we leverage recent advancements in Artificial Neural Networks, specifically generative models (GMs), to learn representations of complex dynamic decision making tasks without relying on domain knowledge. We evaluate a range of GMs in their usefulness in forming representations that can be used by IBL models to predict human behavior in a complex decision making task. This work connects generative and cognitive models by using GMs to form representations and determine similarity.
Full-text available
How do humans coordinate perception and memory when learning and making decisions? Additionally, how do cognitive limitations and behavioural goals influence the optimal functioning of these faculties? Many accounts have sought to explain one or more of these faculties and how they impact behaviour. Relatively little attention has been given to how these cognitive faculties and goals are interrelated. This thesis will provide an account for how the human mind might optimally coordinate perception and memory with learning and decision making, relative to individual cognitive constraints. To achieve this goal, this thesis presents a cognitive model inspired by two areas of research. Firstly, computational modelling of biological visual perception and memory, which seeks to understand and predict these cognitive functions. Secondly, resource-rational analysis which seeks to understand how cognitive agents behave optimally under cognitive constraints, specifically information-theoretic constraints. The result of these connections is a cognitive model that makes predictions of perception, memory, learning, and decision making, while explaining how individuals coordinate these faculties relative to their goals and limitations. This model is first applied onto predicting human behaviour in a visual learning task collected in a previous experiment. Next, two novel experiments are introduced that incorporate utility judgements, change detection, and learning. Results from these experiments demonstrate that the proposed model is better able to account for detailed aspects of human behaviour compared to related methods. This improvement is due to the successful integration of multiple areas of research in biological perception and memory with learning and decision making, all under the resource-rational approach to cognitive modelling. This thesis concludes with a broad discussion of the importance of the proposed model and how it relates to remaining open questions in computational models of biological perception, memory, learning, and decision making.
Conference Paper
Full-text available
Humans are able to learn about the visual world with a remarkable degree of generality and robustness, in part due to attention mechanisms which focus limited resources onto relevant features. Deep learning models that seek to replicate this feature of human learning can do so by optimizing a so-called "disentanglement objective", which encourages representations that factorize stimuli into separable feature dimensions [4]. This objective is achieved by methods such as the β-Variational Autoencoder (β-VAE), which has demonstrated a strong correspondence to neural activity in biological visual representation formation [5]. However, in the β-VAE method, learned visual representations are not influenced by the utility of information, but are solely learned in an unsupervised fashion. In contrast to this, humans exhibit generalization of learning through acquired equivalence of visual stimuli associated with similar outcomes [7]. The question of how humans combine utility-based and unsupervised learning in the formation of visual representations is therefore unanswered. The current paper seeks to address this question by developing a modified β-VAE model which integrates both unsupervised learning and reinforcement learning. This model is trained to produce both psychological representations of visual information as well as predictions of utility based on these representations. The result is a model that predicts the impact of changing utility on visual representations. Our model demonstrates a high degree of predictive accuracy of human visual learning in a contextual multi-armed bandit learning task [8]. Importantly, our model takes as input the same complex visual information presented to participants, instead of relying on hand-crafted features. These results provide further support for disentanglement as a plausible learning objective for visual representation formation by demonstrating their usefulness in learning tasks that rely on attention mechanisms.
Full-text available
Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to no-limit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sample-inefficient – that is, it may simply be too slow – to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning.
Full-text available
In recent years, ideas from the computational field of reinforcement learning have revolutionized the study of learning in the brain, famously providing new, precise theories of how dopamine affects learning in the basal ganglia. However, reinforcement learning algorithms are notorious for not scaling well to multidimensional environments, as is required for real-world learning. We hypothesized that the brain naturally reduces the dimensionality of real-world problems to only those dimensions that are relevant to predicting reward, and conducted an experiment to assess by what algorithms and with what neural mechanisms this "representation learning" process is realized in humans. Our results suggest that a bilateral attentional control network comprising the intraparietal sulcus, precuneus, and dorsolateral prefrontal cortex is involved in selecting what dimensions are relevant to the task at hand, effectively updating the task representation through trial and error. In this way, cortical attention mechanisms interact with learning in the basal ganglia to solve the "curse of dimensionality" in reinforcement learning. Copyright © 2015 the authors 0270-6474/15/358145-13$15.00/0.
A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates state-action values as negative free energies in an undirected graphical model called a product of experts. The model parameters can be learned efficiently because values and derivatives can be efficiently computed for a product of experts. Actions can be found even in large factored action spaces by the use of Markov chain Monte Carlo sampling. Simulation results show that the product of experts approximation can be used to solve large problems. In one simulation it is used to find actions in action spaces of size 2.
Inductive inference allows humans to make powerful generalizations from sparse data when learning about word meanings, unobserved properties, causal relationships, and many other aspects of the world. Traditional accounts of induction emphasize either the power of statistical learning, or the importance of strong constraints from structured domain knowledge, intuitive theories or schemas. We argue that both components are necessary to explain the nature, use and acquisition of human knowledge, and we introduce a theory-based Bayesian framework for modeling inductive learning and reasoning as statistical inferences over structured knowledge representations.
We present a provably efficient and near-optimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN). Our algorithm generalizes the recent algorithm of Kearns and Singh, and assumes that we are given both an algorithm for approximate planning, and the graphical structure (but not the parameters) of the DBN. Unlike the original algorithm, our new algorithm exploits the DBN structure to achieve a running time that scales polynomially in the number of parameters of the DBN, which may be exponentially smaller than the number of global states. 1
Deconfounding hypothesis generation and evaluation in bayesian models
  • Elizabeth Baraff Bonawitz
  • Thomas L Griffiths
Elizabeth Baraff Bonawitz and Thomas L Griffiths. Deconfounding hypothesis generation and evaluation in bayesian models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 32, 2010.
Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner
  • P Christopher
  • Irina Burgess
  • Higgins
Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-vae. arXiv preprint arXiv:1804.03599, 2018.
Efficient reinforcement learning in factored mdps with application to constrained rl
  • Xiaoyu Chen
  • Jiachen Hu
  • Lihong Li
  • Liwei Wang
Xiaoyu Chen, Jiachen Hu, Lihong Li, and Liwei Wang. Efficient reinforcement learning in factored mdps with application to constrained rl. In International Conference on Learning Representations, 2020.
  • Irina Higgins
  • David Amos
  • David Pfau
  • Sebastien Racaniere
  • Loic Matthey
  • Danilo Rezende
  • Alexander Lerchner
Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.