Philosophy and the practice of Bayesian statistics
Department of Statistics and Department of Political Science, Columbia University
Cosma Rohilla Shalizi
Statistics Department, Carnegie Mellon University
Santa Fe Institute
19 July 2010
A substantial school in the philosophy of science identifies Bayesian inference with
inductive inference and even rationality as such, and seems to be strengthened by the
rise and practical success of Bayesian statistics. We argue that the most successful
forms of Bayesian statistics do not actually support that particular philosophy but
rather accord much better with sophisticated forms of hypothetico-deductivism. We
examine the actual role played by prior distributions in Bayesian models, and the crucial
aspects of model checking and model revision, which fall outside the scope of Bayesian
confirmation theory. We draw on the literature on the consistency of Bayesian updating
and also on our experience of applied work in social science.
Clarity about these matters should benefit not just philosophy of science, but also
statistical practice. At best, the inductivist view has encouraged researchers to fit and
compare models without checking them; at worst, theorists have actively discouraged
practitioners from performing model checking because it does not fit into their frame-
1 The usual story—which we don’t like
In so far as I have a coherent philosophy of statistics, I hope it is “robust” enough
to cope in principle with the whole of statistics, and sufficiently undogmatic not
to imply that all those who may think rather differently from me are necessarily
stupid. If at times I do seem dogmatic, it is because it is convenient to give my
own views as unequivocally as possible. (Bartlett, 1967, p. 458)
Schools of statistical inference are sometimes linked to approaches to the philosophy
of science. “Classical” statistics—as exemplified by Fisher’s p-values, Neyman-Pearson
hypothesis tests, and Neyman’s confidence intervals—is associated with the hypothetico-
deductive and falsificationist view of science. Scientists devise hypotheses, deduce implica-
tions for observations from them, and test those implications. Scientific hypotheses can be
rejected (that is, falsified), but never really established or accepted in the same way. Mayo
(1996) presents the leading contemporary statement of this view.
arXiv:1006.3868v3 [math.ST] 19 Jul 2010
In contrast, Bayesian statistics or “inverse probability”—starting with a prior distribu-
tion, getting data, and moving to the posterior distribution—is associated with an inductive
approach of learning about the general from particulars. Rather than testing and attempted
falsification, learning proceeds more smoothly: an accretion of evidence is summarized by a
posterior distribution, and scientific process is associated with the rise and fall in the pos-
terior probabilities of various models; see Figure 1 for a schematic illustration. In this view,
the expression p(θ|y) says it all, and the central goal of Bayesian inference is computing
the posterior probabilities of hypotheses. Anything not contained in the posterior distri-
bution p(θ|y) is simply irrelevant, and it would be irrational (or incoherent) to attempt
falsification, unless that somehow shows up in the posterior. The goal is to learn about
general laws, as expressed in the probability that one model or another is correct. This
view, strongly influenced by Savage (1954), is widespread and influential in the philosophy
of science (especially in the form of Bayesian confirmation theory; see Howson and Urbach
1989; Earman 1992) and among Bayesian statisticians (Bernardo and Smith, 1994). Many
people see support for this view in the rising use of Bayesian methods in applied statistical
work over the last few decades.1
We think most of this received view of Bayesian inference is wrong. Bayesian methods
are no more inductive than any other mode of statistical inference, which is, not inductive
in any strong sense. Bayesian data analysis is much better understood from a hypothetico-
deductive perspective.2Implicit in the best Bayesian practice is a stance that has much in
common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist
orientation. Indeed, crucial parts of Bayesian data analysis, such as model checking, can be
understood as “error probes” in Mayo’s sense.
We proceed by a combination of examining concrete cases of Bayesian data analysis in
empirical social science research, and theoretical results on the consistency and convergence
of Bayesian updating. Social-scientific data analysis is especially salient for our purposes
1Consider the current (9 June 2010) state of the Wikipedia article on Bayesian inference, which begins as
Bayesian inference is statistical inference in which evidence or observations are used to update
or to newly infer the probability that a hypothesis may be true.
It then continues with:
Bayesian inference uses aspects of the scientific method, which involves collecting evidence that
is meant to be consistent or inconsistent with a given hypothesis. As evidence accumulates,
the degree of belief in a hypothesis ought to change. With enough evidence, it should become
very high or very low....Bayesian inference uses a numerical estimate of the degree of belief
in a hypothesis before evidence has been observed and calculates a numerical estimate of
the degree of belief in the hypothesis after evidence has been observed....Bayesian inference
usually relies on degrees of belief, or subjective probabilities, in the induction process and does
not necessarily claim to provide an objective method of induction. Nonetheless, some Bayesian
statisticians believe probabilities can have an objective value and therefore Bayesian inference
can provide an objective method of induction.
These views differ from those of, e.g., Bernardo and Smith (1994) or Howson and Urbach (1989) only in the
omission of technical details.
2We are not interested in the hypothetico-deductive “confirmation theory” prominent in philosophy of
science from the 1950s through the 1970s, and linked to the name of Hempel (1965). The hypothetico-
deductive account of scientific method to which we appeal is distinct from, and much older than, this
particular sub-branch of confirmation theory.
Figure 1: Hypothetical picture of idealized Bayesian inference under the conventional in-
ductive philosophy. The posterior probability of different models changes over time with the
expansion of the likelihood as more data are entered into the analysis. Depending on the
context of the problem, the time scale on the x-axis might be hours, years, or decades, in any
case long enough for information to be gathered and analyzed that first knocks out model 1
in favor of model 2, which in turn is dethroned in favor of the current champion, model 3.
because there is general agreement that, in this domain, all models in use are wrong—not
merely falsifiable, but actually false. With enough data—and often only a fairly moderate
amount—any analyst could reject any model now in use to any desired level of confidence.
Model fitting is nonetheless a valuable activity, and indeed the crux of data analysis. To
understand why this is so, we need to examine how models are built, fitted, used, and
checked, and the effects of misspecification on models.
2 The data-analysis cycle
We begin with a very brief reminder of how statistical models are built and used in data
analysis, following Gelman et al. (2003), or, from a frequentist perspective, Guttorp (1995).
The statistician begins with a model that stochastically generates all the data y, whose
joint distribution is specified as a function of a vector of parameters θ from a space Θ
(which may, in the case of some so-called non-parametric models, be infinite dimensional).
This joint distribution is the likelihood function. The stochastic model may involve other,
unmeasured but potentially observable variables ˜ y—that is, missing or latent data—and
more-or-less fixed aspects of the data-generating process as covariates. For both Bayesians
and frequentists, the joint distribution of (y, ˜ y) depends on θ. Bayesians insist on a full joint
distribution, embracing observables, latent variables, and parameters, so that the likelihood
function becomes a conditional probability density, p(y|θ). In designing the stochastic pro-
cess for (y, ˜ y), the goal is to represent the systematic relationships between the variables
and between the variables and the parameters, and as well as to represent the noisy (con-
tingent, accidental, irreproducible) aspects of the data stochastically. Against the desire
for accurate representation one must balance conceptual, mathematical and computational
tractability. Some parameters thus have fairly concrete real-world referents, such as the fa-
mous (in statistics) survey of the rat population of Baltimore (Brown et al., 1955). Others,
however, will reflect the specification as a mathematical object more than the reality be-
ing modeled—t-distributions are sometimes used to model heavy-tailed observational noise,
with the number of degrees of freedom for the t representing the shape of the distribution;
few statisticians would take this as realistically as the number of rats.
Bayesian modeling, as mentioned, requires a joint distribution for (y, ˜ y,θ), which is
conveniently factored (without loss of generality) into a prior distribution for the parameters,
p(θ), and the complete-data likelihood, p(y, ˜ y|θ), so that p(y|θ) =
distribution is, as we will see, really part of the model. In practice, the various parts of the
model have functional forms picked by a mix of substantive knowledge, scientific conjectures,
statistical properties, analytical convenience, and computational tractability.
Having completed the specification, the Bayesian analyst calculates the posterior distri-
bution p(θ|y); it is so that this quantity makes sense that the observed y and the parameters
θ must have a joint distribution. The rise of Bayesian methods in applications has rested
on finding new ways of to actually carry through this calculation, even if only approxi-
mately, notably by adopting Markov chain Monte Carlo methods, originally developed in
statistical physics to evaluate high-dimensional integrals (Metropolis et al., 1953; Newman
and Barkema, 1999), to sample from the posterior distribution. The natural counterpart of
this stage for non-Bayesian analyses are various forms of point and interval estimation to
identify the set of values of θ that are consistent with the data y.
According to the view we sketched above, data analysis basically ends with the calcula-
tion of the posterior p(θ|y). At most, this might be elaborated by partitioning Θ into a set
of models or hypotheses, Θ1,...ΘK, each with a prior probability p(Θk) and its own set of
parameters θk. One would then compute the posterior parameter distribution within each
model, p(θk|y,Θk), and the posterior probabilities of the models,
These posterior probabilities of hypotheses can be used for Bayesian model selection or
Bayesian model averaging (topics to which we return below). Scientific progress, in this
view, consists of gathering data—perhaps through well-designed experiments, designed to
distinguish among interesting competing scientific hypotheses (cf. Atkinson and Donev,
1992; Paninski, 2005)—and then plotting the p(Θk|y)’s over time and watching the system
learn (as sketched in Figure 1).
In our view, the account of the last paragraph is crucially mistaken. The data-analysis
process—Bayesian or otherwise—does not end with calculating parameter estimates or pos-
terior distribution. Rather, the model can then be checked, by comparing the implications
of the fitted model to the empirical evidence. One asks questions like, Do simulations from
the fitted model resemble the original data? Is the fitted model consistent with other data
not used in the fitting of the model? Do variables that the model says are noise (“error
terms”) in fact display readily-detectable patterns? Discrepancies between the model and
?p(y, ˜ y|θ)d˜ y. The prior
data can be used to learn about the ways in which the model is inadequate for the scientific
purposes at hand, and thus to motivate expansions and changes to the model (§4).
2.1Example: Estimating voting patterns in subsets of the population
We demonstrate the hypothetico-deductive Bayesian modeling process with an example
from our recent applied research (Gelman et al., 2010). In recent years, American political
scientists have been increasingly interested in the connections between politics and income
inequality (see, e.g., McCarty et al. 2006). In our own contribution to this literature, we
estimated the attitudes of rich, middle-income, and poor voters in each of the fifty states
(Gelman et al., 2008b). As we described in our article on the topic (Gelman et al., 2008c),
we began by fitting a varying-intercept logistic regression: modeling votes (coded as y = 1
for votes for the Republican presidential candidate or y = 0 for Democratic votes) given
family income (coded in five categories from low to high as x = −2,−1,0,1,2), using a
model of the form Pr(y = 1) = logit−1(as+ bx), where s indexes state of residence—the
model is fit to survey responses—and the varying intercepts ascorrespond to some states
being more Republican-leaning than others. Thus, for example ashas a positive value in a
conservative state such as Utah and a negative value in a liberal state such as California.
The coefficient b represents the “slope” of income, and its positive value indicates that,
within any state, richer voters are more likely to vote Republican.
It turned out that this varying-intercept model did not fit our data, as we learned
by making graphs of the average survey response and fitted curves for the different income
categories within each state. We had to expand to a varying-intercept, varying-slope model,
Pr(y = 1) = logit−1(as+ bsx), in which the slopes bsvaried by state as well. This model
expansion led to a corresponding expansion in our understanding: we learned that the gap
in voting between rich and poor is much greater in poor states such as Mississippi than in
rich states such as Connecticut. Thus, the polarization between rich and poor voters varied
in important ways geographically.
We found this not through any process of Bayesian induction but rather through model
checking. Bayesian inference was crucial, not for computing the posterior probability that
any particular model was true—we never actually did that—but in allowing us to fit rich
enough models in the first place that we could study state-to-state variation, incorporating
in our analysis relatively small states such as Mississippi and Connecticut that did not have
large samples in our survey. (Gelman and Hill (2006) review the hierarchical models that
allow such partial pooling.)
Life continues, though, and so do our statistical struggles. After the 2008 election,
we wanted to make similar plots, but this time we found that even our more complicated
logistic regression model did not fit the data—especially when we wanted to expand our
model to estimate voting patterns for different ethnic groups. Comparison of data to fit
led to further model expansions, leading to our current specification, which uses a varying-
intercept, varying-slope logistic regression as a baseline but allows for nonlinear and even
non-monotonic patterns on top of that. Figure 2 shows some of our inferences in map form,
while Figure 3 shows one of our diagnostics of data and model fit.
The power of Bayesian inference here is deductive: given the data and some model
assumptions, it allows us to make lots of inferences, many of which can be checked and
Peter D. Gr¨ unwald and John Langford. Suboptimal behavior of Bayes and MDL in clas-
sification under misspecification. Machine Learning, 66:119–149, 2007. doi: 10.1007/
s10994-007-0716-7. URL http://www.cwi.nl/~pdg/ftp/inconsistency.pdf.
Peter Guttorp. Stochastic Modeling of Scientific Data. Chapman and Hall, London, 1995.
Joseph Y. Halpern. Cox’s theorem revisited. Journal of Artificial Intelligence Research, 11:
Mark S. Handcock. Assessing degeneracy in statistical models of social networks. Tech-
nical Report 39, Center for Statistics and the Social Sciences, University of Washing-
ton, 2003. URL http://csde.washington.edu/statnet/www.csss.washington.edu/
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Carl G. Hempel. Aspects of Scientific Explanation. The Free Press, Glencoe, Illinois, 1965.
John H. Holland, Keith J. Holyoak, Richard E. Nisbett, and Paul R. Thagard. Induction:
Processes of Inference, Learning, and Discovery. MIT Press, Cambridge, Massachusetts,
Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach. Open
Court, La Salle, Illinois, 1989.
David R. Hunter, Steven M. Goodreau, and Mark S. Handcock. Goodness of fit of social
network models. Journal of the American Statistical Association, 103:248–258, 2008. doi:
10.1198/016214507000000446. URL http://www.csss.washington.edu/Papers/wp47.
E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press,
Cambridge, England, 2003.
Robert E. Kass and Adrian E. Raftery. Bayes factors. Journal of the American Statis-
tical Association, 90:773–795, 1995. URL http://www.stat.cmu.edu/~kass/papers/
Robert E. Kass and Paul W. Vos. Geometrical Foundations of Asymptotic Inference. Wiley,
New York, 1997.
Robert E. Kass and Larry Wasserman. The selection of prior distributions by formal rules.
Journal of the American Statistical Association, 91:1343–1370, 1996. URL http://www.
Kevin T. Kelly. The Logic of Reliable Inquiry. Oxford University Press, Oxford, 1996.
Kevin T. Kelly. Simplicity, truth, and probability. In Prasanta Bandyopadhyay and Malcolm
Forster, editors, Handbook on the Philosophy of Statistics. Elsevier, Dordrecht, 2010. URL
Philip Kitcher. The Advancement of Science: Science without Legend, Objectivity without
Illusions. Oxford University Press, Oxford, 1993.
B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian
statistics. Annals of Statistics, 34:837–877, 2006. URL http://arxiv.org/math.ST/
Leszek Kolakowski. The Alienation of Reason: A History of Positivist Thought. Doubleday,
Garden City, New York, 1968. Translated by Norbert Guterman from the Polish Filozofia
Pozytywistyczna (od Hume ’a do Kola Wiedenskiego), Panstvove Wydawinctwo Naukowe,
Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active
learning. In Gerald Tesauro, David Tourtetsky, and Todd Leen, editors, Advances in
Neural Information Processing 7 [NIPS 1994], pages 231–238, Cambridge, Massachusetts,
1995. MIT Press. URL http://books.nips.cc/papers/files/nips07/0231.pdf.
Thomas S. Kuhn. The Copernican Revolution: Planetary Astronomy in the Development
of Western Thought. Harvard University Press, Cambridge, Massachusetts, 1957.
Thomas S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press,
Chicago, second edition, 1970.
Imre Lakatos. Philosophical Papers. Cambridge University Press, Cambridge, England,
Larry Laudan. Beyond Positivism and Relativism: Theory, Method and Evidence. Westview
Press, Boulder, Colorado, 1996.
Larry Laudan. Science and Hypothesis. D. Reidel, Dodrecht, 1981.
Qi Li and Jeffrey Scott Racine. Nonparametric Econometrics: Theory and Practice. Prince-
ton University Press, Princeton, New Jersey, 2007.
Antonio Lijoi, Igor Pr¨ unster, and Stephen G. Walker. Bayesian consistency for stationary
models. Econometric Theory, 23:749–759, 2007. doi: 10.1017/S0266466607070314.
Bruce Lindsay and Liawei Liu. Model assessment tools for a model false world. Statistical
Science, 24:303–318, 2009. URL http://projecteuclid.org/euclid.ss/1270041257.
Charles F. Manski. Identification for Prediction and Decision. Harvard University Press,
Cambridge, Massachusetts, 2007.
Deborah G. Mayo. Error and the Growth of Experimental Knowledge. University of Chicago
Press, Chicago, 1996.
Deborah G. Mayo and D. R. Cox. Frequentist statistics as a theory of inductive inference.
In Javier Rojo, editor, Optimality: The Second Erich L. Lehmann Symposium, pages
77–97, Bethesda, Maryland, 2006. Institute of Mathematical Statistics. URL http://
Deborah G. Mayo and Aris Spanos. Methodology in practice: Statistical misspecification
testing. Philosophy of Science, 71:1007–1025, 2004. URL http://www.error06.econ.
Deborah G. Mayo and Aris Spanos. Severe testing as a basic concept in a neyman-pearson
philosophy of induction. The British Journal for the Philosophy of Science, 57:323–357,
2006. doi: 10.1093/bjps/axl003.
David A. McAllister. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999.
Nolan McCarty, Keith T. Poole, and Howard Rosenthal. Polarized America: The Dance
of Ideology and Unequal Riches. Walras-Pareto Lectures. MIT Press, Cambridge, Mas-
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations
of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–
Ulrich K. M¨ uller. Risk of Bayesian inference in misspecified models, and the sandwich covari-
ance matrix. Electronic pre-print, 2010. URL http://www.princeton.edu/~umueller/
Mark E. J. Newman and G. T. Barkema. Monte Carlo Methods in Statistical Physics.
Clarendon Press, Oxford, 1999.
John D. Norton. A material theory of induction. Philosophy of Science, 70:647–670, 2003.
Scott E. Page. The Difference: How the Power of Diveristy Creates Better Groups, Firms,
Schools, and Societies. Princeton University Press, Princeton, New Jersey, 2007.
Liam Paninski. Asymptotic theory of information-theoretic experimental design. Neu-
ral Computation, 17:1480–1507, 2005. URL http://www.stat.columbia.edu/~liam/
Karl R. Popper. The Logic of Scientific Discovery. Hutchinson, London, 1934/1959. Trans-
lated by the author from Logik der Forschung (Vienna: Julius Springer Verlag).
Karl R. Popper. The Open Society and Its Enemies. Routledge, London, 1945.
Willard Van Orman Quine. From a Logical Point of View: Logico-Philosophical Essays.
Harvard University Press, Cambridge, Mass., second edition, 1961. First edition, 1953.
Adrian E. Raftery. Bayesian model selection in social research. Sociological Methodology,
25:111–196, 1995. URL http://www.stat.washington.edu/raftery/Research/PDF/
Brian D. Ripley. Statistical Inference for Spatial Processes. Cambridge University Press,
Cambridge, England, 1988.
Douglas Rivers and Quang H. Vuong. Model selection tests for nonlinear dynamic models.
The Econometrics Journal, 5:1–39, 2002.
Donald B. Rubin. Bayesianly justifiable and relevant frequency calculations for the applied
statistician. Annals of Statistics, 12:1151–1172, 1984. URL http://projecteuclid.
Bertrand Russell. Human Knowledge: Its Scope and Limits. Simon and Schuster, New
Wesley C. Salmon. The appraisal of theories: Kuhn meets Bayes. PSA: Proceedings of
the Biennial Meeting of the Philosophy of Science Association, 1990:325–332, 1990. URL
Leonard J. Savage. The Foundations of Statistics. Wiley, New York, 1954.
Mark J. Schervish. Theory of Statistics. Springer-Verlag, Berlin, 1995.
editors, Foundations of Statistical Inference, pages 259–287, Dordrecht, 1987. D. Reidel.
Entropy and uncertainty.In I. B. MacNeill and G. J. Umphrey,
Why I am not an objective Bayesian: Some reflections prompted
Theory and Decision, 11:413–440, 1979. URL http://www.hss.
Cosma Rohilla Shalizi. Dynamics of Bayesian updating with dependent data and misspeci-
fied models. Electronic Journal of Statistics, 3:1039–1074, 2009. doi: 10.1214/09-EJS485.
Tom A. B. Snijders, Philippa E. Pattison, Garry L. Robins, and Mark S. Handcock. New
specifications for exponential random graph models. Sociological Methodology, 36:99–153,
2006. doi: 10.1111/j.1467-9531.2006.00176.x. URL http://www.csss.washington.edu/
Aris Spanos. Curve fitting, the reliability of inductive inference, and the error-statistical
approach. Philosophy of Science, 74:1046–1066, 2007. doi: 10.1086/525643.
David C. Stove. The Rationality of Induction. Clarendon Press, Oxford, 1986.
David C. Stove. Popper and After: Four Modern Irrationalists. Pergamon Press, Oxford,
Charles Tilly. Explaining Social Processes. Paradigm Publishers, Boulder, Colorado, 2008.
ciological Theory, 22:595–602, 2004.
Observations of social processes and their formal representations.
Stephen Toulmin. Human Understanding: The Collective Use and Evolution of Concepts.
Princeton University Press, Princeton, New Jersey, 1972.
Jos Uffink. The constraint rule of the maximum entropy principle. Studies in History
and Philosophy of Modern Physics, 27:47–79, 1996.
Jos Uffink. Can the maximum entropy principle be explained as a consistency requirement?
Studies in History and Philosophy of Modern Physics, 26B:223–261, 1995. URL http:
Springer-Verlag, Berlin, second edition, 2003.
Learning and Generalization: With Applications to Neural Networks.
Quang H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses.
Econometrica, 57:307–333, 1989. URL http://www.jstor.org/pss/1912557.
Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied
Mathematics, Philadelphia, 1990.
Larry Wasserman. Frequentist Bayes is objective. Bayesian Analysis, 1:451–456, 2006. URL
Tian Yu Cao, editor, Conceptual Foundations of Quantum Field Theory, pages 241–251,
Cambridge, England, 1999. Cambridge University Press. URL http://arxiv.org/abs/
What is quantum field theory, and what did we think it was?In
Halbert White. Estimation, Inference and Specification Analysis. Cambridge University
Press, Cambridge, England, 1994.
John Ziman. Real Science: What It Is, and What It Means. Cambridge University Press,
Cambridge, England, 2000.