Page 1

Adaptive Design Optimization:

A Mutual Information Based Approach to

Model Discrimination in Cognitive Science

Daniel R. CavagnaroJay I. MyungMark A. PittJanne V. Kujala

May 26, 2009

Abstract

Discriminating among competing statistical models is a pressing issue for many experimentalists in the

field of cognitive science. Resolving this issue begins with designing maximally informative experiments.

To this end, the problem to be solved in adaptive design optimization is identifying experimental designs

under which one can infer the underlying model in the fewest possible steps. When the models under

consideration are nonlinear, as is often the case in cognitive science, this problem can be impossible to

solve analytically without simplifying assumptions. However, as we show in this paper, a full solution

can be found numerically with the help of a Bayesian computational trick derived from the statistics

literature, which recasts the problem as a probability density simulation in which the optimal design is

the mode of the density. We use a utility function based on mutual information, and give three intuitive

interpretations of the utility function in terms of Bayesian posterior estimates. As a proof of concept,

we offer a simple example application to an experiment on memory retention.

1 Introduction

Experimentation is fundamental to the advancement of science, whether one is interested in studying the

neuronal basis of a sensory process in cognitive science or assessing the efficacy of a new drug in clinical

trials. In an adaptive experiment, the information learned from each test is used to adapt subsequent tests

to be maximally informative, in an appropriately defined sense. The problem to be solved in adaptive design

optimization (ADO) is to identify an experimental design under which one can infer the underlying model

in the fewest possible steps. This is particularly important in cases where measurements are costly or time

consuming.

Because of its flexibility and efficiency, the use of adaptive designs has become popular in many

fields of science. For example, in astrophysics, ADO has been used in the design of experiments to detect

extrasolar planets (Loredo, 2004). ADO has also been used in designing phase I and phase II clinical trials

to ascertain the dose-response relationship of experimental drugs (Haines et al., 2003; Ding et al., 2008), as

well as in estimating psychometric functions (Kujala and Lukka, 2006; Lesmes et al., 2006).

Bayesian decision theory offers a principled approach to the ADO problem. In this framework, each

potential design is treated as a gamble whose payoff is determined by the outcome of an experiment carried

out with that design. The idea is to estimate the utilities of hypothetical experiments carried out with each

design, so that an “expected utility” of each design can be computed. This is done by considering every

possible observation that could be obtained from an experiment with each design, and then evaluating the

relative likelihoods and statistical values of these observations. The design with the highest expected utility

value is then chosen as the optimal design.

Natural metrics for the utility of an experiment can be found in information theory. This was first

pointed out by Lindley (1956), who suggested maximization of Shannon information as a sensible criterion

for design optimization. MacKay (1992) was one of the first to apply such a criterion to ADO, using

1

Page 2

the expected change in entropy from one stage of experimentation to the next as the utility function. A

few other information-based utility functions have been proposed, including cross-entropy, Kullback-Leibler

divergence, and mutual information (Cover and Thomas, 1991). In particular, the desirability and usefulness

of the latter was formally justified by Paninski (2005) who proved that, under acceptably weak modeling

conditions, the adaptive approach with a utility function based on mutual information leads to consistent

and efficient parameter estimates.

Despite its theoretical appeal, the complexity of computing mutual information directly has proved

to be a major implementational challenge (Bernardo, 1979; Paninski, 2003, 2005). Consequently, most design

optimization research has been restricted to special cases such as linear-Gaussian models. For example, Lewi

et al. (2009) offers a fast algorithm for finding the design of a neurophysiology experiment that maximizes

the mutual information between the observed data and the parameters of a generalized linear model. Using

a Gaussian approximation of the posterior distribution to facilitate estimation of the mutual information,

the algorithm decreases the uncertainty of the parameter estimates much faster than an i.i.d. design, and

converges to the asymptotically optimal design. Other special cases can also facilitate the implementation of

the mutual-information-based approach. For example, Kujala and Lukka (2006) and Kujala et al. (submitted)

successfully implemented mutual information-based utility functions, for estimating psychometric functions

and for the design of adaptive learning games, respectively, with direct computation made possible by the

binary nature of the experimental outcomes.

The need for fast and accurate design optimization algorithms that can accommodate nonlinear

models has grown with recent developments of such models in cognitive science, such as those found in

memory retention (Rubin and Wenzel, 1996; Wixted and Ebbesen, 1991), category learning (Nosofsky and

Zaki, 2002; Vanpaemel and Storms, 2008), and numerical estimation (Opfer and Siegler, 2007). This problem

has also been approached in the astrophysics literature by Loredo (2004) who shows that so-called maximum

entropy sampling can be used to find the design that maximizes the expected Shannon information of the

posterior parameter estimates. This approach addresses the problem of ADO for parameter estimation, but

not for model discrimination. The latter problem is significantly more complex because it requires integration

over the space of models in addition to the integration over each model’s parameter space.

The problem of design optimization for discrimination of nonlinear models is considered in a non-

adaptive setting by Heavens et al. (2007) and by Myung and Pitt (in press). Heavens et al. used a Laplace

approximation of the expected Bayes factor as their utility function, and compared only nested models.

Myung and Pitt consider the problem much more generally. Rather than using an information-theoretic

utility function, they use a utility function based on the minimum description length principle (Gr¨ unwald,

2005). They bring to bear advanced stochastic Bayesian optimization techniques which allow them to find

optimal designs for discriminating among even highly complex, non-nested, nonlinear models.

In this paper we address the design optimization problem for discrimination of nonlinear models

in an adaptive setting. Following Paninski (2003, 2005), Kujala and Lukka (2006), Lewi et al. (2009), and

Kujala et al. (submitted), we use a utility function based on mutual information. That is, we measure the

utility of a design by the amount of information, about the relative likelihoods of the models in question, that

would be provided by the results of an experiment with that given design. Further, following Myung and

Pitt (in press), we apply a simulation-based approach for finding the full solution to the design optimization

problem, without relying upon linearization, normalization, nor approximation, as has often been done in

the past. We apply a Bayesian computational trick that was recently introduced in the statistics literature

(M¨ uller et al., 2004), which allows the optimal design to be found without evaluating the high-dimensional

integration and optimization directly. Briefly, the idea is to recast the problem as a density simulation in

which the optimal design corresponds to the mode of the density. The density is simulated with an interacting

particle filter, and the mode is found by gradually “sharpening up” the distribution with simulated annealing.

We also give several intuitive interpretations of the mutual information based utility function in terms of

Bayesian posterior estimates, which both elucidates the logic of the algorithm and connects it with common

statistical approaches to model selection in cognitive science. Finally, we demonstrate the approach with a

simple example application to an experiment on memory retention. In simulated experiments, the optimal

adaptive design outperforms all other comptetitors at identifying the data-generating model.

2

Page 3

2 Bayesian ADO Framework

Adaptive design optimization within a Bayesian framework has been considered at length in the statistics

community (Kiefer, 1959; Box and Hill, 1967; Chaloner and Verdinelli, 1995; Atkinson and Donev, 1992)

as well as in other science and engineering disciplines (e.g., El-Gamal and Palfrey, 1996; Bardsley et al.,

1996; Allen et al., 2003). The issue is essentially a Bayesian decision problem where, at each stage of

experimentation, the most informative design (i.e., the design with the highest expected utility) is chosen

based on the outcomes of the previous experiments. The criterion for the informativeness of a design often

depends on the goals of the experimenter. The experiment which yields the most precise parameter estimates

may not be the most effective at discriminating among competing models, for example (see Nelson, 2005,

for a comparison of several utility functions that have been used in cognitive science research).

Whatever the goals of the experiment may be, solving for the optimal design is a highly nontrivial

problem. The computation requires simultaneous optimization and high-dimensional integration, which can

be analytically intractable for the complex, nonlinear models as often used in many real-world problems.

Formally, ADO for model discrimination entails finding an optimal design that maximizes a utility function

U(d)

d∗= argmax

d

with the utility function defined as

{U(d)}

(1)

U(d) =

K

?

m=1

p(m)

? ?

u(d,θm,y)p(y|θm,d)p(θm)dy dθm,

(2)

where m = {1,2,...,K} is one of a set of K models being considered, d is a design, y is the outcome of

an experiment with design d under model m, and θmis a parameterization of model m. We refer to the

function u(d,θm,y) in Equation 2 as the local utility of the design d. It measures the utility of a hypothetical

experiment carried out with design d when the data generating model is m, the parameters of the model takes

the value θm, and the outcome y is observed. Thus, U(d) represents the expected value of the local utility

function, where the expectation is taken over all models under consideration, the full parameter space of

each model, and all possible observations given a particular model-parameter pair, with respect to the model

prior probability p(m), the parameter prior distribution p(θm), and the sampling distribution p(y|θm,d),

respectively.

The model and parameter priors are being updated on each stage s = {1,2,...} of experimentation.

Specifically, upon the specific outcome zsobserved at stage s of an actual experiment carried out with design

ds, the model and parameter priors to be used to find an optimal design at the next stage are updated via

Bayes rule and Bayes factor calculation (e.g., Gelman et al., 2004) as

ps+1(θm)=

p(zs|θm,ds)ps(θm)

?p(zs|θm,ds)ps(θm)dθm

?K

(3)

ps+1(m)=

p0(m)

k=1p0(k)BF(k,m)(zs)ps(θ)

(4)

where BF(k,m)(zs)ps(θ)denotes the Bayes factor defined as the ratio of the marginal likelihood of model k

to that of model m given the realized outcome zs, where the marginals are over the updated parameter

estimates from the preceding stage. The above updating scheme is applied successively on each stage of

experimentation, after an initialization with equal model priors p(s=0)(m) = 1/K and a non-informative

parameter prior p(s=0)(θm).

3 Computational Methods

To find the optimal design d∗in a general setting is exceedingly difficult. Given the multiple computational

challenges involved, standard optimization methods such as Newton-Raphson are out of question. However,

3

Page 4

a promising new approach to this problem has been proposed in statistics (M¨ uller et al., 2004).

a simulation-based approach that includes an ingenious computational trick that allows one to find the

optimal design without having to evaluate the integration and optimization directly in Equations (1) and(2).

The basic idea is to recast the design optimization problem as a simulation from a sequence of augmented

probability models.

To illustrate how it works, let us consider the design optimization problem to be solved at any given

stage s of experimentation, and, for simplicity, we will suppress the subscript s in the remainder of this

section. According to the computational trick of M¨ uller et al. (2004), we treat the design d as a random

variable and define an auxiliary distribution h(d,·) that admits U(d) as its marginal density. Specifically, we

define

?

m=1

where α(> 0) is the normalizing constant of the auxiliary distribution and

It is

h(d,y1,θ1,...,yK,θK) = α

K

?

p(m)u(d,θm,ym)

?

p(y1,θ1,...,yK,θK|d) (5)

p(y1,θ1,...,yK,θK|d) =

K

?

m=1

p(ym|θm,d)p(θm).

(6)

Note that the subscript m in the above equations refers to model m, not the stage of experimentation. For

instance, ymdenotes an experimental outcome generated from model m with design d and parameter θm.

Marginalizing h(d,·) over (y1,θ1,...,yK,θK) yields

?

K

?

=

αU(d).

h(d)=

...

?

h(d,y1,θ1,...yK,θK)dy1dθ1...dyKdθK

(7)

=

α

m=1

p(m)

? ?

u(d,θm,ym)p(ym|θm,d)p(θm)dymθm

(8)

(9)

Consequently, the design with the highest utility can be found by taking the mode of a sufficiently large

sample from the marginal distribution h(d). However, finding the global optimum could potentially require

a very large number of samples from h(d), especially if there are many local optima, or if the design space

is very irregular or high-dimensional. To overcome this problem, assuming h(d,·) is non-negative1and

bounded, we augment the auxiliary distribution with independent samples of y’s and θ’s given design d as

follows

J?

for a positive integer J and αJ(> 0). The marginal distribution of hJ(d) obtained after integrating out model

parameters and outcome variables will then be equal to αJU(d)J. Hence, as J increases, the distribution

hJ(d) will become more highly peaked around its (global) mode corresponding to the optimal design d∗,

thereby making it easier to identify the mode.

Following Amzal et al. (2006), we implemented a sequential Monte Carlo particle filter algorithm that

begins by simulating hJ(d,·) in Equation (10) for J = 1 and then increases J incrementally on subsequent

iterations on an appropriate simulated annealing schedule (Kirkpatrick et al., 1983; Doucet et al., 2001).

hJ(d,·) = αJ

j=1

h(d,y1,j,θ1,j,...,yK,j,θK,j)(10)

1Negative values of h(d,·) can be handled in the implementation by adding a small constant to the distribution and truncating

it at zero. This transformation does not change the location of the global maximum, provided that the truncated values are

not too extremeley negative. However, adding a constant does decrease the relative concentration of the distribution around

the global maximum, making it more difficult to find.

4

Page 5

4 Mutual Information Utility

Selection of a utility function that adequately captures the goals of the experiment is an integral, often crucial,

part of design optimization. A design that is optimal for parameter estimation is not necessarily optimal

for model selection. Perhaps the most studied optimization criterion in the design optimization literature

is minimization of the variance of parameter estimates. In the case of linear models, this is achieved by

maximizing the determinant of the variance-covariance matrix, which is called the D-optimality criterion

(Atkinson and Donev, 1992). For nonlinear models, a sensible choice of utility function is the negative

entropy of the posterior parameter estimates after observing experimental outcomes (Loredo, 2004; K¨ ueck

et al., 2006). It has been shown that such an entropy-based utility function also leads to D-optimality in the

linear-Gaussian case (Bernardo, 1979).

Implicit in the preceding optimality criteria is the assumption that the underlying model is correct.

Quite often, however, the researcher entertains multiple models and wishes to design an experiment that can

effectively distinguish them. One way to achieve this goal is to minimize model mimicry (i.e., the ability

of a model to account for data generated by a competing model). To this end, the T-optimality criterion

maximizes the sum-of-squares error between data generated from a model and the best fitting prediction of

another competing model (Atkinson and Federov, 1975a,b). In practice, however, sum-of-squares error is a

poor choice for model discrimination because it is biased toward more complex models (e.g., Myung, 2000).

As an alternative, one can use a statistical model selection criterion such as the Akaike information criterion

(Akaike, 1973), the Bayes factor (Kass and Raftery, 1995), or the minimum description length principle

(Gr¨ unwald, 2005; Myung and Pitt, in press; Balasubramanian et al., 2008).

One can also construct a utility function motivated from information theory (Cover and Thomas,

1991). In particular, mutual information seems to provide an ideal measure for quantifying the value of

an experiment design. Specifically, mutual information measures the reduction in uncertainty about one

variable that is provided by knowledge of the value of the other random variable. Formally, the mutual

information of a pair of random variables P and Q, taking values in X, is given by

I(P;Q) = H(P) − H(P|Q)(11)

where H(P) = −?

about P due to knowledge of Q. For example, if the two distributions were perfectly correlated, meaning

that knowledge of Q allowed perfect prediction of P, then the conditional distribution would be degenerate,

having entropy zero. Thus, the mutual information of P and Q would be H(P), meaning that all of the

entropy of P was eliminated through knowledge of Q. Mutual information is symmetric in the sense that

I(P;Q) = I(Q;P).

Mutual information can also be defined by a Kullback-Leibler divergence between a joint distri-

bution and the product of marginal distributions as I(P;Q) = DKL((P,Q),PQ), where DKL(P,Q) =

?

Thus, the mutual information of P and Q measures how “far” (in terms of KL-divergence) the actual joint

distribution is from what it would be if the distributions were independent. For example, if the distributions

actually were independent then the actual and hypothetical joint distributions would be identical and hence

the KL-divergence would be zero, meaning that Q provides no information about P.

Mutual information can be implemented as an optimality criterion in ADO for model discrimination

on each stage s (= 1,2,...) of experimentation in the following way. (For simplicity, we omit the subscript

s in the equations below.) Let M be a random variable defined over a model set {1,2,...,K}, representing

uncertainty about the true model, and let Y be a random variable denoting an experimental outcome. Hence

Prob.(M = m) = p(m) is the prior probability of model m, and Prob.(Y = y|d) =?K

Then I(M;Y |d) = H(M) − H(M|Y,d) measures the decrease in uncertainty about which model drives the

process under investigation given the outcome of an experiment with design d. Since H(M) is independent

x∈Xp(x)logp(x) is the entropy of P, and H(P|Q) =?

x∈Xp(x)H(P|Q = x) is the

conditional entropy of P given Q. A high mutual information indicates a large reduction in uncertainty

x∈XP(x)logP(x)

product PQ represents the hypothetical joint distribution of P and Q in the case that they were independent.

Q(x)is the Kullback-Leibler (KL) divergence between the two distributions P and Q. The

m=1p(y|d,m)p(m),

where p(y|d,m) =?p(y|θm,d)p(θm)dθm, is the associated prior over experimental outcomes given design d.

5

Page 6

of the design d, maximizing I(M;Y |d) on each stage of ADO is equivalent to minimizing H(M|Y,d), which

is the expected posterior entropy of M given d.

Implementing this ADO criterion requires identification of an appropriate local utility function

u(d,θm,y) in Equation (2); specifically, a function whose expectation over models, parameters, and observa-

tions is I(M;Y |d). Such a function can be found by writing

?

from whence it follows that setting u(d,θm,y) = logp(m|y,d)

p(m)

of a design for a given model and experimental outcome is the log ratio of the posterior probability to the

prior probability of that model. Put another way, the above utility function prescribes that a design that

increases our certainty about the model upon the observation of an outcome is more valued than a design

that does not.

Another interpretation of this local utility function can be obtained by rewriting it, applying Bayes

rule, as u(d,θm,y) = logp(y|d,m)

(in terms of KL-divergence) incurred from estimating the true distribution P∗over Y |d with the distribution

p(y|d) (Haussler and Opper, 1997). This net loss, or ‘regret’ is the additional loss over that which would

have be incurred from estimating P∗if true model were known (i.e., with p(y|d,m)). What this means for

ADO is that the observation that is to be made at each stage is the one whose result is the least expected,

or equivalently, the most surprising. In a manner of speaking, to learn the most we should test where we

know the least.

This local utility function can be interpreted in yet another way, in terms of Bayes factors, by

rewriting it as

?

where BF(k,m)(y) =

m, i.e., the Bayes Factor for model k over model m for y.2Examining equation 13 more closely, the weighted

sum of Bayes factors quantifies the evidence against m, provided by an observation y, aggregated across

head-to-head comparisons of m with each of the models under consideration. Further, the negative sign

means that to maximize the local utility is to minimize the aggregate evidence against m. Accordingly, the

designs that are favored by the utility function in Equation 12 are those that, on average, are expected to

produce the least amount of evidence against the true model, or equivalently, the largest amount of evidence

for the true model relative to the other models under consideration.

In what follows, we demonstrate the application of the adaptive design optimization framework for

discriminating retention models in cognitive science.

I(M;Y |d) =

K

m=1

p(m)

? ?

p(y|θm,d)p(θm) logp(m|y,d)

p(m)

dy dθm

(12)

yields U(d) = I(M;Y |d). Thus, the local utility

p(y|d). In this form, the local utility can be interpreted as the net informational loss

u(d,θm,y) = −log

K

k=1

p(k)BF(k,m)(y) (13)

p(y|k)

p(y|m)is the marginal likelihood of model k divided by the marginal likelihood of model

5 Application

A central issue in memory research is the rate of forgetting over time. Of the dozens of retention functions

(so called because the amount of information retained after study is measured) that have been evaluated by

researchers, two models, power (POW) and exponential (EXP), have received considerable attention. Both

are Bernoulli models, defined by p = a(t + 1)−band p = ae−bt, respectively, where p is the probability

of correct recall of a stimulus item (e.g., word) at time t, and a,b are model parameters. The maximum

likelihood estimates for a data set collected by Rubin et al. (1999) are depicted in Figure 1.

2Bayes factor evaluations within each utility estimate can be done by grid discretization if each model has only a few

parameters. More generally, Monte Carlo estimates can be used, but care must be taken to limit sampling error (see Han and

Carlin, 2001, for example).

6

Page 7

0 20 4060 80100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Lag Time

Probability of Correct Response

y=0.9025(t+1)−0.4861

y=0.7103e−0.0833t

Figure 1: Maximum likelihood estimates for POW (solid lines) and EXP (dashed lines) obtained by Rubin

et al. (1999).

7

Page 8

Many experiments have been performed to precisely identify the functional form of retention (see

Rubin and Wenzel, 1996, for a thorough review). In a typical retention experiment, data are collected through

a sequence of trials, each of which assesses retention at a single time point, and the data are then aggregated

across trials so that a retention curve can be estimated. Each trial consists of ‘study phase,’ in which a

participant is given a list of words to memorize, followed by a ‘test phase,’ in which retention is assessed by

testing how many words the participant can correctly recall from the study list. The length of time between

the the study phase and the test phase is called the ‘lag time.’ The lag times are design variables that can

be controlled by the experimenter. Thus, the goal of design optimization is to find the most informative set

of lag times for the purpose of discriminating between the power and exponential models.

We conducted computer simulations to illustrate the ADO procedure for discriminating between the

power and exponential models of retention, in which optimal designs were sought over a series of stages of

experimentation. For simplicity, we only considered designs in which one lag time was tested in each stage

of experimentation. This luxury was afforded by two considerations. Firstly, unlike the non-adaptive setting

in which all of the lag times must be chosen before experimentation begins, in the adaptive setting we can

choose a new lag time after each set of observations. Secondly, unlike utility functions based on statistical

model selection criteria such as minimum description length (MDL), the mutual-information-based utility

function does not require computation of the maximum likelihood estimate (MLE) for each model. For these

two-parameter models, observations at no less than three distinct time points would be required to compute

the MLE, hence an MDL-based utility function would be undefined for a design with less than three test

phases.3

We used parameter priors a ∼ Beta(2,1) and b ∼ Beta(1,4) for POW, and a ∼ Beta(2,1) and

b ∼ Beta(1,80) for EXP.4Figure 2 depicts a random sample of curves generated by each model with

parameters drawn from these priors. At each stage of the simulated experiment, the most informative lag

time for discriminating the models was computed, data were generated from POW with a = 0.9025 and

b = 0.4861 (i.e., the MLE for EXP from Rubin et al.) and 10 Bernoulli trials at that time point, and the

predictive distributions were updated accordingly. We continued the process for ten stages of the experiment.

A typical profile of the posterior model probability ps(POW) as a function of stage s is shown by the solid

black line in Figure 3.

For comparison, we also conducted several simulated experiments with randomly generated designs.

These experiments with random designs proceeded in the manner described above, except that the lag time

at each stage was chosen randomly (i.e., from a continuous, uniform distribution) between zero and 100

seconds. The solid gray line in Figure 3 shows a typical posterior model probability curve obtained in these

random experiments.

The results of the experiments with random designs show the advantage of ADO over a less principled

approach to designing a sequential experiment, but they do not show how ADO compares with the current

standard in retention research. To do just that, we conducted additional simulations using a typical design

from the retention literature. While there is no established standard for the set of lag times to test in retention

experiments, a few conventions have emerged. For one, previous experiments utilize what we call ‘fixed

designs,’ in which the set of lag times at which to assess memory are specified before experimentation begins,

and held fixed for the duration of the experiment. Thus, there is no Bayesian updating between stages as

there would be in a sequential design, such as what would be prescribed by ADO. The lag times are typically

concentrated near zero and spaced roughly geometrically. For example, the aforementioned data set collected

by Rubin et al. (1999) used a design consisting of 10 lag times: (0s,1s,2s,4s,7s,12s,21s,35s,59s,99s). To

get a meaningful comparison between this fixed design and the sequential designs, we generated data at each

stage from the same model as in the previous simulations, but with just 1 Bernoulli trial at each of the 10 lag

times in the Rubin et al. design. That way, the “cost” of each stage, in terms of the number of trials, is the

3The MDL-based utility function is implemented with the Fisher Information Approximation (FIA) defined as FIA =

−lnf(y|ˆθ) +k

size, and I(θ) is the Fisher information matrix of sample size 1 (Myung and Pitt, in press).

4The priors reflect conventional wisdom about these retention models based on many years of investigation. The choice of

priors does indeed change the optimal solution, but the importance of this example is the process of finding a solution, not the

actual solution itself.

2lnn

2+ ln? ?|I(θ)|dθ, where f(y|ˆθ) is the maximum likelihood, k is the number of parameters n is the sample

8

Page 9

0 2040 6080 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Lag Time

Probability of Correct Response

Figure 2: A random sample of curves generated from POW (solid lines) and EXP (dashed lines) illustrating

the ability of these models to mimic one another. The models also include binomial error (not shown), which

further complicates the task of discriminating them.

9

Page 10

0246810

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Stage of Experiment

Probability of True Model (POW)

Optimal Adaptive Design

Random Sequential Design

Fixed 10pt Design

Figure 3: Posterior model probability curves from simulated experiments with each of the three designs, in

which data were generated from POW with a = 0.9025, b = 0.4861, and 10 Bernoulli trials per stage. As

predicted by the theory, the optimal adaptive design accumulates evidence for POW faster than either of

the competing designs.

10

Page 11

same as in the adaptive design. The posterior model probabilities and parameter estimates were computed

after each stage from all data up to that point. The obtained posterior model probabilities from a typical

simulation are shown by the dashed line in Figure 3.

The results of these simulations clearly demonstrate the efficiency of the optimal adaptive design.

The optimal-adaptive-design simulation identifies the correct model with over 0.95 probability after just four

stages or 40 Bernoulli trials. In contrast, the fixed-design simulation requires twice as many observations (8

stages or 80 Bernoulli trials) to produce a similar level of evidence in favor of the true model. The random-

design simulation does not conclusively discriminate the models even after all 10 stages were complete.

To ensure that the advantage of the optimal adaptive design was not due to the choice of POW as the

true model, we repeated each of the simulated experiments with data generated from EXP, with a = 0.7103

and b = 0.0833 (i.e., the MLE for EXP from Rubin et al.). The results of these simulations are given in Figure

4, and the advantage of the optimal adaptive design is apparent once again. The optimal-adaptive-design

simulation identifies the true model with over 0.93 probability after just 4 stages or 40 Bernoulli trials. This

is much quicker accumulation of evidence than in the fixed-design simulation, which requires all 10 stages, or

100 Bernoulli trials, to identify the true model 0.92 probability. Again, the random-design simulation does

not conclusively discriminate the models even after all 10 stages were complete.

This example is intended as a proof-of-concept. In this simple case, an optimal design could have been

found via comprehensive grid searches. However, the approach that we have demonstrated here generalizes

easily to much more complex problems in which a brute-force approach would be impractical or impossible.

Moreover, this example shows that the methodology does not necessarily require state-of-the-art computing

hardware, as all of the computations were performed in one night on a personal computer.

6 Conclusions

ADO is an example of a large class of problems that can be framed as Bayesian decision problems with

expected information as expected utility. For example, current work in neurophysics aims to continuously

optimize a stimulus ensemble in order to maximize mutual information between inputs and outputs (Toy-

oizumi et al., 2005; Brunel and Nadal, 1998; Machens, 2002; Machens et al., 2005). It is also related to

optimization of dynamic sensor networks (Hoffman et al., 2006), and online learning in neural networks

(Opper, 1999). In machine learning and reinforcement learning literatures, DO is known as active learning

or policy decision. Essentially, the same math problem is to be solved. In constructing phase portraits of

dynamic systems, designs are sought to minimize the mutual information between observations (Fraser and

Swinney, 1986).

The Bayesian ADO framework developed here is myopic in the sense that the optimization at each

stage is done as though the current stage will be the final stage. That is, it does not take into account

the potential for future stages at which a new optimal design will be sought based on the outcome at the

current stage. In reality, later designs will depend on previous outcomes. Finding the globally optimal

sequence of designs requires backward induction involving an exponentially increasing number of scenarios.

This challenging problem is considered by M¨ uller et al. (2007), who also offer an algorithm for approximating

a solution using constrained backward induction. We believe that future work should approach the ADO

problem from this framework.

In the special case where the goal of experimentation is to disciminate between just two models, a

natural choice for the utility of a design is the expected Bayes factor between the two models. This was the

approach employed by Heavens et al. (2007), for example. The expected Bayes factor works well as a utility

function because, as with mutual information, it is nonnegative, parameterization invariant, and does not

require computation of the MLE. However, the Bayes factor is not appropriate for comparing more than two

models. Mutual information provides a natural generalization of the expected Bayes Factor for comparing

more than two models.

In sum, the growing importance of computational modeling in many disciplines has led to a need

for sophisticated methods to discriminate these models. Adaptive design optimization is a principled and

maximally efficient means of doing so, one which achieves this goal by increasing the informativeness of an

11

Page 12

02468 10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Stage of Experiment

Probability of True Model (EXP)

Optimal Adaptive Design

Random Sequential Design

Fixed 10pt Design

Figure 4: Posterior model probability curves from simulated experiments with each of the three designs,

in which data were generated from EXP with a = 0.7103, b = 0.0833, and 10 Bernoulli trials per stage.

Again, the optimal adaptive design accumulates evidence for the true model much faster than either of the

competing designs. The nonmonotonic behavior results from the data observed at a given stage being more

likely, according to the priors at that stage, under POW than under EXP. Even though the data are always

generated by EXP in these simulations, such behavior is not surprising given how closely POW can mimic

EXP as shown in Figure 2.

12

Page 13

experiment. When combined with a utility function that is based on mutual-information, the methodology

increases in flexibility, being applicable to more than two models simultaneously, and provides useful insight

into the model discrimination process.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Caski,

B. N. P. . F., editor, Proceedings of the Second International Symposium on Information Theory, pages

267–281, Budapest. Akademiai Kiado.

Allen, T., Yu, L., and Schmitz, J. (2003). An experimental design criterion for minimizing meta-model

prediction errors applied to die casting process design. Applied Statistics, 52:103–117.

Amzal, B., Bois, F., Parent, E., and Robert, C. (2006). Bayesian-optimal design via interacting particle

systems. Journal of the American Statistical Association, 101(474):773–785.

Atkinson, A. and Donev, A. (1992). Optimum Experimental Designs. Oxford University Press.

Atkinson, A. and Federov, V. (1975a). The design of experiments for discriminating between two rival

models. Biometrika, 62(1):57.

Atkinson, A. and Federov, V. (1975b). Optimal design: Experiments for discriminating between several

models. Biometrika, 62(2):289.

Balasubramanian, V., Larjo, K., and Seth, R. (2008). Experimental design and model selection: The example

of exoplanet detection. Festschrift in Honor of Jorma Rissanen.

Bardsley, W., Wood, R., and Melikhova, E. (1996). Optimal design: A computer program to study the best

possible spacing of design points for model discrimination. Computers & Chemistry, 20:145–157.

Bernardo, J. (1979). Expected information as expected utility. The Annals of Statistics, 7(3):686–690.

Box, G. and Hill, W. (1967). Discrimination among mechanistic models. Technometrics, 9:57–71.

Brunel, N. and Nadal, J.-P. (1998). Mutual information, fisher information, and population coding. Neural

Computation, 10:1731–1757.

Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10(3):273–

304.

Cover, T. and Thomas, J. (1991). Elements of Information Theory. John Wiley & Sons, Inc.

Ding, M., Rosner, G., and M¨ uller, P. (2008). Bayesian optimal design for phase ii screening trials. Biometrics,

64:886–894.

Doucet, A., de Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Springer.

El-Gamal, M. and Palfrey, T. (1996). Economical experiments: Bayesian efficient experimental design.

International Journal of Game Theory, 25:495–517.

Fraser, M. A. and Swinney, H. L. (1986). Independent coordinates for strange attractors from mutual

information. Physical Review A, 33(2):1134–1140.

Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2004). Bayesian Data Analysis. Chapman & Hall.

Gr¨ unwald, P. (2005). A tutorial introduction to the minimum description length principle. In Gr¨ unwald, P.,

Myung, I. J., and Pitt, M. A., editors, Advances in Minimum Description Length: Theory and Applications.

The M.I.T. Press.

13

Page 14

Haines, L., Perevozskaya, I., and Rosenberer, W. (2003). Bayesian optimal designs for phase i clinical trials.

Biometrics, 59:591–600.

Han, C. and Carlin, B. P. (2001). Mcmc methods for computing bayes factors: a comparative review. Journal

of the American Statistical Association, 96:1122–1132.

Haussler, D. and Opper, M. (1997). Mutual information, metric entropy and cumulative relative entropy

risk. Annals of Statistics, 25:2451–2492.

Heavens, A., Kitching, T., and Verde, L. (2007). On model selection forecasting, dark energy and modified

gravity. Monthly Notices of the Royal Astronomical Society, 380(3):1029–1035.

Hoffman, G., Waslander, S., and Tomlin, C. (2006). Mutual information methods with particle filters for

mobile sensor network control. IEEE Conference on Decision and Control, page 1019.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association,

90:773–795.

Kiefer, J. (1959). Optimum experimental designs. Journal of the Royal Statistical Society. Series B (Method-

ological), 21(2):272–319.

Kirkpatrick, S., Gelatt, C., and Vecchi, M. (1983). Optimization by simulated annealing. Science, 220:671–

680.

K¨ ueck, H., de Freitas, N., and Doucet, A. (2006). Smc samplers for bayesian optimal nonlinear design.

Nonlinear Statistical Signal Processing Workshop (NSSPW).

Kujala, J. and Lukka, T. (2006). Bayesian adaptive estimation: The next dimension. Journal of Mathematical

Psychology, 50(4):369–389.

Kujala, V., Richardson, U., and Lyytinen, H. (submitted). A bayesian-optimal principle for child-friendly

adaptation in learning games. Journal of Mathematical Psychology.

Lesmes, L., Jeon, S.-T., Lu, Z.-L., and Dosher, B. (2006). Bayesian adaptive estimation of threshold versus

contrast external noise functions: The quick tvc method. Vision Research, 46:3160–3176.

Lewi, J., Butera, R., and Paninski, L. (2009). Sequential optimal design of neurophysiology experiments.

Neural Computation, 21:619–687.

Lindley, D. (1956). On a measure of the information provided by an experiment. Annals of Mathematical

Statistics, 27(4):986–1005.

Loredo, T. J. (2004). Bayesian adaptive exploration. In Erickson, G. J. and Zhai, Y., editors, Bayesian

Inference and Maximum Entropy Methods in Science and Engineering: 23rd International Workshop on

Bayesian Inference and Maximum Entropy Methods in Science and Engineering, volume 707, pages 330–

346. AIP.

Machens, C. (2002). Adaptive sampling by information maximization. Physical Review Letters, 88(22).

Machens, C., Gollisch, T., Kolesnikova, O., and Herz, A. (2005). Testing the efficiency of sensory coding

with optimal stimulus ensembles. Neuron, 47:447–456.

MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computa-

tion, 4(4):590–604.

M¨ uller, P., Berry, D., Grieve, A., Smith, M., and Krams, M. (2007). Simulation-based sequential bayesian

design. Journal of Statistical Planning and Inference, 137:3140–3150.

14

Page 15

M¨ uller, P., Sanso, B., and De Iorio, M. (2004). Optimal bayesian design by inhomogeneous markov chain

simulation. Journal of the American Statistical Association, 99(467):788–798.

Myung, I. J. (2000). The importance of complexity in model selection. Journal of Mathematical Psychology,

44(4):190–204.

Myung, J. I. and Pitt, M. (in press). Optimal experimental design for model discrimination. Psychological

Review.

Nelson, J. (2005). Finding useful questions: On bayesian diagnosticity, probability, impact, and information

gain. Psychological Review, 112(4):979–999.

Nosofsky, R. and Zaki, S. (2002). Exemplar and prototype models revisited: Response strategies, selective

attention and stimulus generalization. Journal of Experimental Psychology, 28:924–940.

Opfer, J. and Siegler, R. (2007). Representational change and children’s numerical estimation. Cognitive

Psychology, 55:165–195.

Opper, M. (1999). A Bayesian approach to online learning. In Saad, D., editor, On-line Learning in Neural

networks, pages 363–377. Cambridge University Press.

Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15:1191–1253.

Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design. Neural Computation,

17:1480–1507.

Rubin, D., Hinton, S., and Wenzel, A. (1999). The precise time course of retention. Journal of Experimental

Psychology, 25(5):1161–1176.

Rubin, D. and Wenzel, A. (1996). One hundred years of forgetting: A quantitative description of retention.

Psychological Review, 103(4):734–760.

Toyoizumi, T., Pfister, J.-P., Aihara, K., and Gerstner, W. (2005). Generalized bienenstock-cooper-munro

rule for spiking neurons that maximizes information transmission. Proceedings of the National Academy

of Sciences, 102(14):5239–5244.

Vanpaemel, W. and Storms, G. (2008). In search of abstraction: The varying abstraction model of catego-

rization. Psychonomic Bulletin & Review, 15:732–749.

Wixted, J. and Ebbesen, E. (1991). On the form of forgetting. Psychological Science, 2(6):409–415.

Acknowledgements

This research is supported by National Institute of Health Grant R01-MH57472 to JIM and MAP. Parts

of this work have been submitted for presentation at the 2009 Annual Meeting of the Cognitive Science

Society in Amsterdam, Netherlands. We wish to thank Hendrik K¨ ueck and Nando de Freitas for valuable

feedback and technical help provided for the project, and Michael Rosner for the implementation of the

design optimization algorithm in C++. Correspondence concerning this article should be addressed to

Daniel Cavagnaro, Department of Psychology, Ohio State University, 1835 Neil Avenue, Columbus, OH

43210. E-mail: cavagnaro.2@osu.edu.

15