Page 1

Philosophy and the practice of Bayesian statistics

Andrew Gelman

Department of Statistics and Department of Political Science, Columbia University

Cosma Rohilla Shalizi

Statistics Department, Carnegie Mellon University

Santa Fe Institute

19 July 2010

Abstract

A substantial school in the philosophy of science identifies Bayesian inference with

inductive inference and even rationality as such, and seems to be strengthened by the

rise and practical success of Bayesian statistics. We argue that the most successful

forms of Bayesian statistics do not actually support that particular philosophy but

rather accord much better with sophisticated forms of hypothetico-deductivism. We

examine the actual role played by prior distributions in Bayesian models, and the crucial

aspects of model checking and model revision, which fall outside the scope of Bayesian

confirmation theory. We draw on the literature on the consistency of Bayesian updating

and also on our experience of applied work in social science.

Clarity about these matters should benefit not just philosophy of science, but also

statistical practice. At best, the inductivist view has encouraged researchers to fit and

compare models without checking them; at worst, theorists have actively discouraged

practitioners from performing model checking because it does not fit into their frame-

work.

1 The usual story—which we don’t like

In so far as I have a coherent philosophy of statistics, I hope it is “robust” enough

to cope in principle with the whole of statistics, and sufficiently undogmatic not

to imply that all those who may think rather differently from me are necessarily

stupid. If at times I do seem dogmatic, it is because it is convenient to give my

own views as unequivocally as possible. (Bartlett, 1967, p. 458)

Schools of statistical inference are sometimes linked to approaches to the philosophy

of science. “Classical” statistics—as exemplified by Fisher’s p-values, Neyman-Pearson

hypothesis tests, and Neyman’s confidence intervals—is associated with the hypothetico-

deductive and falsificationist view of science. Scientists devise hypotheses, deduce implica-

tions for observations from them, and test those implications. Scientific hypotheses can be

rejected (that is, falsified), but never really established or accepted in the same way. Mayo

(1996) presents the leading contemporary statement of this view.

1

arXiv:1006.3868v3 [math.ST] 19 Jul 2010

Page 2

In contrast, Bayesian statistics or “inverse probability”—starting with a prior distribu-

tion, getting data, and moving to the posterior distribution—is associated with an inductive

approach of learning about the general from particulars. Rather than testing and attempted

falsification, learning proceeds more smoothly: an accretion of evidence is summarized by a

posterior distribution, and scientific process is associated with the rise and fall in the pos-

terior probabilities of various models; see Figure 1 for a schematic illustration. In this view,

the expression p(θ|y) says it all, and the central goal of Bayesian inference is computing

the posterior probabilities of hypotheses. Anything not contained in the posterior distri-

bution p(θ|y) is simply irrelevant, and it would be irrational (or incoherent) to attempt

falsification, unless that somehow shows up in the posterior. The goal is to learn about

general laws, as expressed in the probability that one model or another is correct. This

view, strongly influenced by Savage (1954), is widespread and influential in the philosophy

of science (especially in the form of Bayesian confirmation theory; see Howson and Urbach

1989; Earman 1992) and among Bayesian statisticians (Bernardo and Smith, 1994). Many

people see support for this view in the rising use of Bayesian methods in applied statistical

work over the last few decades.1

We think most of this received view of Bayesian inference is wrong. Bayesian methods

are no more inductive than any other mode of statistical inference, which is, not inductive

in any strong sense. Bayesian data analysis is much better understood from a hypothetico-

deductive perspective.2Implicit in the best Bayesian practice is a stance that has much in

common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist

orientation. Indeed, crucial parts of Bayesian data analysis, such as model checking, can be

understood as “error probes” in Mayo’s sense.

We proceed by a combination of examining concrete cases of Bayesian data analysis in

empirical social science research, and theoretical results on the consistency and convergence

of Bayesian updating. Social-scientific data analysis is especially salient for our purposes

1Consider the current (9 June 2010) state of the Wikipedia article on Bayesian inference, which begins as

follows:

Bayesian inference is statistical inference in which evidence or observations are used to update

or to newly infer the probability that a hypothesis may be true.

It then continues with:

Bayesian inference uses aspects of the scientific method, which involves collecting evidence that

is meant to be consistent or inconsistent with a given hypothesis. As evidence accumulates,

the degree of belief in a hypothesis ought to change. With enough evidence, it should become

very high or very low....Bayesian inference uses a numerical estimate of the degree of belief

in a hypothesis before evidence has been observed and calculates a numerical estimate of

the degree of belief in the hypothesis after evidence has been observed....Bayesian inference

usually relies on degrees of belief, or subjective probabilities, in the induction process and does

not necessarily claim to provide an objective method of induction. Nonetheless, some Bayesian

statisticians believe probabilities can have an objective value and therefore Bayesian inference

can provide an objective method of induction.

These views differ from those of, e.g., Bernardo and Smith (1994) or Howson and Urbach (1989) only in the

omission of technical details.

2We are not interested in the hypothetico-deductive “confirmation theory” prominent in philosophy of

science from the 1950s through the 1970s, and linked to the name of Hempel (1965). The hypothetico-

deductive account of scientific method to which we appeal is distinct from, and much older than, this

particular sub-branch of confirmation theory.

2

Page 3

Figure 1: Hypothetical picture of idealized Bayesian inference under the conventional in-

ductive philosophy. The posterior probability of different models changes over time with the

expansion of the likelihood as more data are entered into the analysis. Depending on the

context of the problem, the time scale on the x-axis might be hours, years, or decades, in any

case long enough for information to be gathered and analyzed that first knocks out model 1

in favor of model 2, which in turn is dethroned in favor of the current champion, model 3.

because there is general agreement that, in this domain, all models in use are wrong—not

merely falsifiable, but actually false. With enough data—and often only a fairly moderate

amount—any analyst could reject any model now in use to any desired level of confidence.

Model fitting is nonetheless a valuable activity, and indeed the crux of data analysis. To

understand why this is so, we need to examine how models are built, fitted, used, and

checked, and the effects of misspecification on models.

2 The data-analysis cycle

We begin with a very brief reminder of how statistical models are built and used in data

analysis, following Gelman et al. (2003), or, from a frequentist perspective, Guttorp (1995).

The statistician begins with a model that stochastically generates all the data y, whose

joint distribution is specified as a function of a vector of parameters θ from a space Θ

(which may, in the case of some so-called non-parametric models, be infinite dimensional).

This joint distribution is the likelihood function. The stochastic model may involve other,

unmeasured but potentially observable variables ˜ y—that is, missing or latent data—and

more-or-less fixed aspects of the data-generating process as covariates. For both Bayesians

and frequentists, the joint distribution of (y, ˜ y) depends on θ. Bayesians insist on a full joint

distribution, embracing observables, latent variables, and parameters, so that the likelihood

function becomes a conditional probability density, p(y|θ). In designing the stochastic pro-

cess for (y, ˜ y), the goal is to represent the systematic relationships between the variables

and between the variables and the parameters, and as well as to represent the noisy (con-

tingent, accidental, irreproducible) aspects of the data stochastically. Against the desire

for accurate representation one must balance conceptual, mathematical and computational

3

Page 4

tractability. Some parameters thus have fairly concrete real-world referents, such as the fa-

mous (in statistics) survey of the rat population of Baltimore (Brown et al., 1955). Others,

however, will reflect the specification as a mathematical object more than the reality be-

ing modeled—t-distributions are sometimes used to model heavy-tailed observational noise,

with the number of degrees of freedom for the t representing the shape of the distribution;

few statisticians would take this as realistically as the number of rats.

Bayesian modeling, as mentioned, requires a joint distribution for (y, ˜ y,θ), which is

conveniently factored (without loss of generality) into a prior distribution for the parameters,

p(θ), and the complete-data likelihood, p(y, ˜ y|θ), so that p(y|θ) =

distribution is, as we will see, really part of the model. In practice, the various parts of the

model have functional forms picked by a mix of substantive knowledge, scientific conjectures,

statistical properties, analytical convenience, and computational tractability.

Having completed the specification, the Bayesian analyst calculates the posterior distri-

bution p(θ|y); it is so that this quantity makes sense that the observed y and the parameters

θ must have a joint distribution. The rise of Bayesian methods in applications has rested

on finding new ways of to actually carry through this calculation, even if only approxi-

mately, notably by adopting Markov chain Monte Carlo methods, originally developed in

statistical physics to evaluate high-dimensional integrals (Metropolis et al., 1953; Newman

and Barkema, 1999), to sample from the posterior distribution. The natural counterpart of

this stage for non-Bayesian analyses are various forms of point and interval estimation to

identify the set of values of θ that are consistent with the data y.

According to the view we sketched above, data analysis basically ends with the calcula-

tion of the posterior p(θ|y). At most, this might be elaborated by partitioning Θ into a set

of models or hypotheses, Θ1,...ΘK, each with a prior probability p(Θk) and its own set of

parameters θk. One would then compute the posterior parameter distribution within each

model, p(θk|y,Θk), and the posterior probabilities of the models,

p(Θk)p(y|Θk)

?

=

?

These posterior probabilities of hypotheses can be used for Bayesian model selection or

Bayesian model averaging (topics to which we return below). Scientific progress, in this

view, consists of gathering data—perhaps through well-designed experiments, designed to

distinguish among interesting competing scientific hypotheses (cf. Atkinson and Donev,

1992; Paninski, 2005)—and then plotting the p(Θk|y)’s over time and watching the system

learn (as sketched in Figure 1).

In our view, the account of the last paragraph is crucially mistaken. The data-analysis

process—Bayesian or otherwise—does not end with calculating parameter estimates or pos-

terior distribution. Rather, the model can then be checked, by comparing the implications

of the fitted model to the empirical evidence. One asks questions like, Do simulations from

the fitted model resemble the original data? Is the fitted model consistent with other data

not used in the fitting of the model? Do variables that the model says are noise (“error

terms”) in fact display readily-detectable patterns? Discrepancies between the model and

?p(y, ˜ y|θ)d˜ y. The prior

p(Θk|y)=

k? (p(Θk?)p(y|Θk?))

p(Θk)?p(y,θk|Θk)dθk

k? (p(Θk?)?p(y,θk|Θk?)dθk?).

4

Page 5

data can be used to learn about the ways in which the model is inadequate for the scientific

purposes at hand, and thus to motivate expansions and changes to the model (§4).

2.1Example: Estimating voting patterns in subsets of the population

We demonstrate the hypothetico-deductive Bayesian modeling process with an example

from our recent applied research (Gelman et al., 2010). In recent years, American political

scientists have been increasingly interested in the connections between politics and income

inequality (see, e.g., McCarty et al. 2006). In our own contribution to this literature, we

estimated the attitudes of rich, middle-income, and poor voters in each of the fifty states

(Gelman et al., 2008b). As we described in our article on the topic (Gelman et al., 2008c),

we began by fitting a varying-intercept logistic regression: modeling votes (coded as y = 1

for votes for the Republican presidential candidate or y = 0 for Democratic votes) given

family income (coded in five categories from low to high as x = −2,−1,0,1,2), using a

model of the form Pr(y = 1) = logit−1(as+ bx), where s indexes state of residence—the

model is fit to survey responses—and the varying intercepts ascorrespond to some states

being more Republican-leaning than others. Thus, for example ashas a positive value in a

conservative state such as Utah and a negative value in a liberal state such as California.

The coefficient b represents the “slope” of income, and its positive value indicates that,

within any state, richer voters are more likely to vote Republican.

It turned out that this varying-intercept model did not fit our data, as we learned

by making graphs of the average survey response and fitted curves for the different income

categories within each state. We had to expand to a varying-intercept, varying-slope model,

Pr(y = 1) = logit−1(as+ bsx), in which the slopes bsvaried by state as well. This model

expansion led to a corresponding expansion in our understanding: we learned that the gap

in voting between rich and poor is much greater in poor states such as Mississippi than in

rich states such as Connecticut. Thus, the polarization between rich and poor voters varied

in important ways geographically.

We found this not through any process of Bayesian induction but rather through model

checking. Bayesian inference was crucial, not for computing the posterior probability that

any particular model was true—we never actually did that—but in allowing us to fit rich

enough models in the first place that we could study state-to-state variation, incorporating

in our analysis relatively small states such as Mississippi and Connecticut that did not have

large samples in our survey. (Gelman and Hill (2006) review the hierarchical models that

allow such partial pooling.)

Life continues, though, and so do our statistical struggles. After the 2008 election,

we wanted to make similar plots, but this time we found that even our more complicated

logistic regression model did not fit the data—especially when we wanted to expand our

model to estimate voting patterns for different ethnic groups. Comparison of data to fit

led to further model expansions, leading to our current specification, which uses a varying-

intercept, varying-slope logistic regression as a baseline but allows for nonlinear and even

non-monotonic patterns on top of that. Figure 2 shows some of our inferences in map form,

while Figure 3 shows one of our diagnostics of data and model fit.

The power of Bayesian inference here is deductive: given the data and some model

assumptions, it allows us to make lots of inferences, many of which can be checked and

5