Page 1

Commentary

Replicability, Confidence,

and Priors

Peter R. Killeen

Arizona State University

All commentaries concern priors. In this issue of Psychological

Science, Cumming graphically demonstrates the implications

of our ignorance of d. Doros and Geier found mistakes

in my argument and provide the Bayesian account. Macdonald

notes that my program is like Fisher’s, Fisher’s is like the

Bayesians’, and the Bayesians’ is incoherent. These Commen-

taries strengthen the foundation while leaving all conclusions

intact.

REPLICATING prep

Cumming reminds us that prepis an estimate of the probability

that a replication with the same power will support the original

finding—that it will give an effect of the same sign. The histo-

gram of probabilities of replication (PRs) at the bottom of his

Figure 1 is therefore reassuring: All but 6 of the 139 cases have

PRs greater than .5: More than 95% of the cases therefore

support the original finding. Indeed, because the distribution of

PR is negatively skewed, we can generally expect the typical

(median) replicability to be better than claimed, as was the case

in Cumming’s example. In that sense, prepis a conservative es-

timate of replicability.

Cumming’s real concern is not that a few replications may be

victimsofsampling error,butthattheoriginalexperimentmight

havebeenavictim.Again,thereisconsolationtobefoundinhis

histogram:Bycurrent standards(corresponding toprep5.9),for

none of his 139 cases did D go far enoughin the wrong direction

to have supported a decision to publish an unreplicable finding

(i.e., in no case was PR < .1). Define strong evidence as a prep

greater than ps. If we set psto a relatively liberal .8, the proba-

bility that replication of an experiment that provided strong

evidence in the first place will provide strong contradictory

evidence (the replication’s own prepis greater than ps, but the

effect is in the wrong direction) is less than .05.1Given Cum-

ming’s original prepof .89, approximately 3 of Cumming’s cases

should have strongly contradicted the original; 2 did so.

Neither prepnor any other statistic can overcome the pro-

babilisticnatureoftherelationbetweenevidenceandinference.

There is no surety, but only the relative safety of numbers, good

experimental design, and empirical replication.

I was edified by Cumming’s explanation of replication inter-

vals in terms of confidence intervals (CIs). Yet, although ev-

eryone agrees on the importance of reporting some measure of

effectsize,CIsarelessthanideal: First,most researchersdonot

understand what a CI means (Cumming, Williams, & Fidler,

2004, p. 299). The problem is not confined to psychologists: ‘‘A

confidence interval is an assertion that an unknown parameter

lies in a computed range, with a specified probability [sic]’’

(Rinaman, Heil, Strauss, Mascagni, & Sousa, 1996, p. 608).

Suchmisunderstandingmaybepartofthereasonwhy‘‘ofthe15

measurements of the Astronomical Unit that [Youden, 1972]

presented, not a single one fell within the range of the possible

values given by its immediate predecessor’’ (Stigler, 1996, p.

780)—or at least may be a reason for the bemusement that at-

tends such observations. Second, as Fidler, Thomason, Cum-

ming, Finch, and Leeman (2005) noted, ‘‘what to construct CIs

around—and how to display them—remain issues for debate’’

(p. 495). Third, CIs are an impure measure of effect size, be-

cause they invoke a sampling distribution to set the relation

between level and interval (May, 2003), and that is an easily

avoided source of error: Just use d or r.

If there is ‘‘still much to learn about confidence intervals’’

(Fidler et al., 2005), there is fortunately much less to learn

about replication intervals: Calculate the standard error,

center it over the statistic, and the long-run probability of

Address correspondence toPeter Killeen, DepartmentofPsychology,

Box1104,McAllisterSt.,ArizonaStateUniversity,Tempe,AZ85287-

1104; e-mail: killeen@asu.edu.

1The probability of a replicate speaking strongly against an original, where

strong means the replicate has a prep of its own of ps, is 1 ? NORMS-

DIST(NORMSINV(ps) 1 NORMSINV(prep)), where NORMSDIST is the stand-

ardized normal distribution, and NORMSINVis its inverse. The probability of a

replication speaking strongly for an original is 1 ? NORMSDIST(NORMS-

INV(ps) ? NORMSINV(prep)). If we would call the top quartile of preps sup-

portive, the bottom quartile contradictory, and the rest ambiguous, then set psas

equal to .75.

PSYCHOLOGICAL SCIENCE

Volume 16—Number 12

1009

Copyright r 2005 American Psychological Society

Page 2

a replication falling within those limits (Cumming’s average

probabilityofcapture,orAPC)isapproximately50%.Perhapsit

istimetostartexplainingthecomplicatedintermsofthesimple.

Cumming’s table, figures, and Web site should help readers

to understand this alternative to null-hypothesis significance

testing, as his insightful and encouraging comments helped me

to understand it in the first place.

ERROR AND CORRECTION

I arrived at prepby conditioning on the unknown d and inte-

grating it out, assuming flat priors. This is also how Cumming

simulated his PRs. I recognized this to be tantamount to a

convolution and took the variables I was differencing to be

the sampling errors of the original and replicate. But as

Doros and Geier show, my reduction of the argument to

d0

doesnotgivetheexpectedvalueofd0

provides the Bayesian route to my result. Treat s2

divide the numerator and denominator of their equation leading

to Equation 4 by s2

then s2

P d0

1

ð

likelihood estimate), and sc¼ sdR?

original report (Killeen, 2005).

I did not use s2

parameter djfor the reference population of experiments j, and

shouldhavesubscripteditass2

further discussion of priors). Then my Equation 7, written

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

s2

jects,measurements,oroperations,andapproacheszeroonlyfor

identical replications.

2¼ d0

1? D1þ D2, although correct in any particular case,

2.Theirfourthproposal(B2)

das a prior and

d. If knowledge of mcis vague or n is large,

d, whereupon their Equation 4 reduces to

?¼Rmc=sc

d>> s2

2> 0jd0

?

?1N 0;1

Þ, with mc? d0

1(its maximum

sd, just as in my

ffiffiffi

2

p

das a prior but as the variance of the hyper-

dj(seetheappendixforerrataand

as sdR¼

djrepresents the divergence of different populations of sub-

2 s2

diþ s2

dj

??

r

, is correct. As a realization variance,

PRIOR IGNORANCE

Macdonald argues that the distribution of replicate effect sizes

may be derived either from Fisher’s fiducial arguments or

from Bayesian analyses, but that the former are invalid and the

latter incoherent. Viable interpretations of Fisher’s arguments

reduce to a Bayesian model, such Doros and Geier’s, with uni-

form priors on the location parameter dj. Seidenfeld (1979, p.

131) blamed Fisher’s failure on the difficulty in formulating

uninformative priors that were invariant over arbitrary trans-

formations of the variables. But such invariance is a useless

luxury for scientists. Most of the inferential statistics we use

depend on the additivity of random variables, and those remain

additive only under linear transformations. If simple reaction

times are normally distributed on log(t), then log(t), not t, is

the scale on which to express priors. Such measurement

constraints,2long dismissed by statisticians (Hand, 2004), de-

marktheboundarieswithinwhichFisher’sfiducialprobabilities

and Bayesian inferences are both valid and coherent.3Statistics

lose their authority to the extent that the variables and their

transformations depart from linear comparability; their justifi-

cation then must be found in their less principled, but often

considerable, pragmatic utility.

Priors

Statistics can address three different types of questions (Royall,

1997):

? What should I believe?

? What should I do?

? How should I evaluate this evidence?

The first question requires Bayesian updating of priors to in-

corporate new data. If the priors are subjective, Bayesian

analysisis‘‘acodeofconsistencyforthepersonapplyingit,nota

system of predictions about the world around him’’ (Savage,

1972, p. 59, who nonetheless took personal probability as ‘‘the

only probability concept essential to science,’’ p. 56). If the

priors are objective, Bayesian updating is the tool of choice for

secondary meta-analysis, and provides the machinery for a cu-

mulative science. Had the astrophysicists cited by Youden

(1972) incorporated priors in their final parameter estimates,

there would have been less humor and more truth in the title of

his article. Scientists wanting to know what to believe about

claims—their own or others’—should respect prior information

(Field, 2003). After Bayesian updating, prepprovides an excel-

lent prognostic.

Neyman and Pearson avoided the Bayesian implications of

the first question by skipping to the second, asserting that a

counsel to action carries no implications for belief (Neyman,

1960, p. 290). But an answer to the second question requires

both efficient use of the data—not possible in their schema—

and a payoff matrix. By providing the first, prep lays the

groundwork of a decision theory for scientific inference.

The standard answer to the third question is that results

should be evaluated by classifying them as either significant or

nonsignificant. But this approach ‘‘is an impoverished, poten-

tially misleading way to describe evidence’’ (Dixon, 2003, p.

200; J.E. Hunter, 1997). Given the typical case of a composite

alternative hypothesis (e.g., ‘‘not the null’’), preppredicts the

probability that replications will provide evidence supporting

2Seidenfeld’s (1979) ‘‘smoothly invertible canonical pivotal variables’’ con-

ciselyembodythenecessaryconstraints,butheconsideredthosetoorestricting.

The issues are subtle; consult Macdonald’s references in this issue of Psycho-

logical Science and Seidenfeld’s (1979) book. Note, however, that Seidenfeld’s

paragonestimationofthevolumeofacubefromweightsofcapacityandsidedoes

not survive dimensional analysis;the weight of his ruler is a volumetricmeasure

and should be added, not cubed.

3Transformation techniques permit nonlinear transforms by appropriately

warping one of the scales; for statistical utility, the scale on which the central-

limit theorem holds should be treated as privileged.

1010

Volume 16—Number 12

prep

Page 3

the original effect. Given well-defined alternative hypotheses,

likelihood analysis (Royall, 1997), corrected for bias (Forster &

Sober, 2004), estimates the strength of evidence favoring the al-

ternatives. If additional statistical evaluation is wanted, random-

ization of the constituent log likelihoods will provide empirical

sampling distributions from which prepmay be inferred. In either

case, priors ‘‘can obfuscate formal tests by including information

not specifically contained within the experiment itself’’ (Maurer,

2004,p.17);theyflavortheevidencewiththeidiosyncratictasteof

theevaluator.Flat(uninformative)priorsprovidethelevelplaying

field necessary for unbiased evaluation. After evidence passes a

filter such as prep, it may be weighted and added to the canon.

Belief is best constructed from independently established facts,

composed with an eye to their cumulating effect.

Supernatural Paradoxes

If we knew that d ¼ 0, as in Macdonald’s example, then the

probability of a positive effect in replication would be .50, no

matter what preppredicts. But Macdonald assumes supernatural

knowledge; prepdoes not. Individual experiments do not estab-

lish parameters; meta-analyses converge on parameters. To

know what to believe, enter all relevant information into that

inferential engine. To know what research to advise students to

undertake, attend to priors. To evaluate experimental results,

however,use prep,unflavored.Itcomes withthe provisoofceteris

paribus,anditsdoubledvarianceallowsforsamplingerrorinthe

original and the replicate.

DorosandGeierconcludethat,becauseprepcanbecalculated4

from p, it inherits the shortcomings of null-hypothesis signifi-

cance testing. Wrong. These statistics, although informationally

equivalent,aredistinguishedbytheinferencestheywarrant;prep

is a valid posterior predictive probability, p is not. That is pre-

cisely why Fisher pursued the fiducial argument, which, absent

measurements on interval scales,is unattainable. With linearity,

‘‘selection of an ‘ignorance’ prior can be made without fear of

violating the probability calculus’’ (Seidenfeld, 1979, p. 133).

THE REFERENCE SET FOR prep

Much of my discussion thus far is, in the end, irrelevant to most

readers of this article. Virtually all psychological data are ob-

servational or are drawn from convenience samples, subsets of

which are randomly assigned to control or experimental condi-

tions. These standard empirical procedures are incompatible

with the normal statistical models, which assume random sam-

pling from a reference set or population (Lunneborg, 2000).

Randomization tests emulate our experimental operations

(Byrne, 1993), do not depend on priors, do not depend on the

form of the populations sampled, and permit fiducial inference

(Pitman, 1937). Their logic is straightforward; M.A. Hunter and

May (2003) have provided a clear overview and useful refer-

ences. The p value from such a test gives the proportion of oc-

casions on which the data would have segregated into such

disparategroups(orhavebeensocorrelatedwithapredictor)by

chance.5The corresponding prepestimates the probability of

replication in samples from the same data set (cf. Pitman’s w

statistic). It also predicts replicability in general, with its ac-

curacy depending on the similarity of the subjects and proce-

dures in the original and replicate. Permutation tests and prep

respect what we do and tell us what we need to know. They are

the right analytic tools for most of our primary research ques-

tions.

Acknowledgments—National Science Foundation Grant IBN

0236821 and National Institute of Mental Health Grant

1R01MH066860 supported this work.

REFERENCES

Bernardo, J.M. (in press). Reference analysis. In D. Dey & C.R. Rao

(Eds.), Handbook of statistics (Vol 25). Amsterdam: Elsevier.

Byrne, M.D. (1993). A better tool for the Cognitive Scientist’s toolbox:

Randomization statistics. In W. Kintsch (Ed.), Proceedings of the

Fifteenth Annual Conference of the Cognitive Science Society (pp.

289–293). Mawah, NJ: Erlbaum (Available from http://chil.

rice.edu/byrne/pubs.html)

Cumming, G. (2005). Understanding the average probability of rep-

lication: Comment on Killeen (2005). Psychological Science, 16,

1002–1004.

Cumming, G., Williams, J., & Fidler, F. (2004). Replication and re-

searchers’ understanding of confidence intervals and standard

error bars. Understanding Statistics, 3, 299–311.

Dixon, P. (2003). The p-value fallacy and how to avoid it. Canadian

Journal of Experimental Psychology, 57, 189–202.

Doros, G., & Geier, A.B. (2005). Probability of replication revisited:

Comment on ‘‘An alternative to null-hypothesis significance

tests.’’ Psychological Science, 16, 1005–1006.

Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap.

London: Chapman & Hall.

Fidler, F., Thomason, N.,Cumming, G., Finch, S., & Leeman, J.(2005).

Still much to learn about confidence intervals. Psychological

Science, 16, 494–495.

Field, A.P. (2003). The problems in using fixed-effects models of meta-

analysis onreal-world data.UnderstandingStatistics,2,105–124.

Forster, M., & Sober, E. (2004). Why likelihood? In M.L. Taper & S.R.

Lele (Eds.), The nature of scientific evidence: Statistical, philo-

sophical, and empirical considerations (pp. 153–190). Chicago:

University of Chicago Press.

Hand, D.J. (2004). Measurement theory and practice. New York: Oxford

University Press.

Hunter, J.E. (1997). Needed: A ban on the significance test. Psycho-

logical Science, 8, 3–7.

Hunter, M.A., & May, R.B. (2003). Statistical testing and null distri-

butions: What to do when samples are not random. Canadian

Journal of Experimental Psychology, 57, 176–188.

Killeen, P.R. (2005). An alternative to null-hypothesis significance

tests. Psychological Science, 16, 345–353.

4The calculation is as follows: prep5 NORMSDIST((NORMSINV(1 ? p))/

SQRT(2)).

5Permutation tests evaluate any difference in samples. They may be modified

to test differences of means (Efron & Tibshirani, 1993).

Volume 16—Number 12

1011

Peter R. Killeen

Page 4

Lee,M.D.,&Wagenmakers,E.-J.(2005).Bayesianstatisticalinference

in psychology: Comment on Trafimow (2003). Psychological Re-

view, 112, 662–668.

Lunneborg, C.E. (2000). Data analysis by resampling: Concepts and

applications. Pacific Grove, CA: Brooks/Cole/Duxbury.

Macdonald,R.R.(2005).Whyreplicationprobabilitiesdependonprior

probability distributions: A rejoinder to Killeen (2005). Psycho-

logical Science, 16, 1007–1008.

Maurer, B.A. (2004). Models of scientific inquiry and statistical prac-

tice:Implicationsforthestructureofscientificknowledge.InM.L.

Taper & S.R. Lele (Eds.), The nature of scientific evidence: Sta-

tistical, philosophical, and empirical considerations (pp. 17–50).

Chicago: University of Chicago Press.

May, K. (2003). A note on the use of confidence intervals. Under-

standing Statistics, 2, 133–135.

Neyman, J. (1960). First course in probability and statistics. New York:

Holt, Rinehart and Winston.

O’Hagan, A., & Forster, J. (2004). Kendall’s advanced theory of statis-

tics: Vol. 2B: Bayesian inference (2nd ed.). New York: Oxford

University Press.

Pitman, E.J.G. (1937). Significance tests which may be applied to

samples from any populations. Supplement to the Journal of the

Royal Statistical Society, 4, 119–130.

Rinaman, W.C., Heil, C., Strauss, M.T., Mascagni, M., & Sousa, M.

(1996). Probability and statistics. In D. Zwillinger (Ed.), CRC

standard mathematical tables and formulae (30th ed., pp. 569–

668). Boca Raton, FL: CRC Press.

Royall, R. (1997). Statistical evidence: A likelihood paradigm. London:

Chapman & Hall.

Savage,L.J.(1972).Thefoundationsofstatistics(2nded.).NewYork:Dover.

Seidenfeld, T. (1979). Philosophical problems of statistical inference:

Learning from R. A. Fisher. London: D. Reidel.

Stigler,S.M.(1996).Statistics andthe questionofstandards. Journal of

Research of the National Institute of Standards and Technology,

101, 779–789.

Youden, W.J. (1972). Enduring values. Technometrics, 14, 1–11.

(RECEIVED 7/6/05; ACCEPTED 8/21/05;

FINAL MATERIALS RECEIVED 8/22/05)

APPENDIX

Errata

S. Sirois (personal communication, May 10, 2005) noticed that

the standard error of replication on p. 347 in my original article

should have been sdR¼

radical in Equation 7 should have been s2

d, as used by Cumming, simplifies notation.

ffiffiffi

2

p

sd. The second variance under the

dj. An unembellished

Flat Priors

Bayesians recommend either Jeffrey’s priors (Lee & Wagen-

makers, 2005, have provided an excellent Bayesian tutorial) or

reference priors (Bernardo, in press). The Jeffrey’s prior for the

mean of normally distributed data is uniform. Alas, over an in-

finite range, that leaves any particular prior equaling an un-

productive zero. But this is not a problem if the range is merely

huge (e.g., spread with s2? 1010), as the prior’s influence will

then fall below the measurement error of rational data. ‘‘If prior

information is genuinely weak relative to the data, the posterior

distribution should be robust to any reasonable choice of prior

distribution [including improper priors]’’ (O’Hagan & Forster,

2004, p. 107).

Priors that are flat for d cannot also be flat for r2(Macdonald,

this issue). Ignorance has structure. Reference priors cash out

that structure against the models tested. The reference prior

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

deemed necessary. For the range of effect sizes that concern

psychologists, whether they use Jeffrey’s priors or reference

priors, d or r2, it is all pretty much Kansas.

p d;s

ð

sizes and variances involved whenever statistical analysis is

Þ ¼ ðs

1 þ d2=2

p

Þ?1is relatively flat for the effect

1012

Volume 16—Number 12

prep