Peter R. Killeen
Arizona State University
All commentaries concern priors. In this issue of Psychological
Science, Cumming graphically demonstrates the implications
of our ignorance of d. Doros and Geier found mistakes
in my argument and provide the Bayesian account. Macdonald
notes that my program is like Fisher’s, Fisher’s is like the
Bayesians’, and the Bayesians’ is incoherent. These Commen-
taries strengthen the foundation while leaving all conclusions
Cumming reminds us that prepis an estimate of the probability
that a replication with the same power will support the original
finding—that it will give an effect of the same sign. The histo-
gram of probabilities of replication (PRs) at the bottom of his
Figure 1 is therefore reassuring: All but 6 of the 139 cases have
PRs greater than .5: More than 95% of the cases therefore
support the original finding. Indeed, because the distribution of
PR is negatively skewed, we can generally expect the typical
(median) replicability to be better than claimed, as was the case
in Cumming’s example. In that sense, prepis a conservative es-
timate of replicability.
Cumming’s real concern is not that a few replications may be
histogram:Bycurrent standards(corresponding toprep5.9),for
none of his 139 cases did D go far enoughin the wrong direction
to have supported a decision to publish an unreplicable finding
(i.e., in no case was PR < .1). Define strong evidence as a prep
greater than ps. If we set psto a relatively liberal .8, the proba-
bility that replication of an experiment that provided strong
evidence in the first place will provide strong contradictory
evidence (the replication’s own prepis greater than ps, but the
effect is in the wrong direction) is less than .05.1Given Cum-
ming’s original prepof .89, approximately 3 of Cumming’s cases
should have strongly contradicted the original; 2 did so.
Neither prepnor any other statistic can overcome the pro-
There is no surety, but only the relative safety of numbers, good
experimental design, and empirical replication.
I was edified by Cumming’s explanation of replication inter-
vals in terms of confidence intervals (CIs). Yet, although ev-
eryone agrees on the importance of reporting some measure of
effectsize,CIsarelessthanideal: First,most researchersdonot
understand what a CI means (Cumming, Williams, & Fidler,
2004, p. 299). The problem is not confined to psychologists: ‘‘A
confidence interval is an assertion that an unknown parameter
lies in a computed range, with a specified probability [sic]’’
(Rinaman, Heil, Strauss, Mascagni, & Sousa, 1996, p. 608).
measurements of the Astronomical Unit that [Youden, 1972]
presented, not a single one fell within the range of the possible
values given by its immediate predecessor’’ (Stigler, 1996, p.
780)—or at least may be a reason for the bemusement that at-
tends such observations. Second, as Fidler, Thomason, Cum-
ming, Finch, and Leeman (2005) noted, ‘‘what to construct CIs
around—and how to display them—remain issues for debate’’
(p. 495). Third, CIs are an impure measure of effect size, be-
cause they invoke a sampling distribution to set the relation
between level and interval (May, 2003), and that is an easily
avoided source of error: Just use d or r.
If there is ‘‘still much to learn about confidence intervals’’
(Fidler et al., 2005), there is fortunately much less to learn
about replication intervals: Calculate the standard error,
center it over the statistic, and the long-run probability of
Address correspondence toPeter Killeen, DepartmentofPsychology,
1104; e-mail: firstname.lastname@example.org.
1The probability of a replicate speaking strongly against an original, where
strong means the replicate has a prep of its own of ps, is 1 ? NORMS-
DIST(NORMSINV(ps) 1 NORMSINV(prep)), where NORMSDIST is the stand-
ardized normal distribution, and NORMSINVis its inverse. The probability of a
replication speaking strongly for an original is 1 ? NORMSDIST(NORMS-
INV(ps) ? NORMSINV(prep)). If we would call the top quartile of preps sup-
portive, the bottom quartile contradictory, and the rest ambiguous, then set psas
equal to .75.
Volume 16—Number 12
Copyright r 2005 American Psychological Society
a replication falling within those limits (Cumming’s average
Cumming’s table, figures, and Web site should help readers
to understand this alternative to null-hypothesis significance
testing, as his insightful and encouraging comments helped me
to understand it in the first place.
ERROR AND CORRECTION
I arrived at prepby conditioning on the unknown d and inte-
grating it out, assuming flat priors. This is also how Cumming
simulated his PRs. I recognized this to be tantamount to a
convolution and took the variables I was differencing to be
the sampling errors of the original and replicate. But as
Doros and Geier show, my reduction of the argument to
provides the Bayesian route to my result. Treat s2
divide the numerator and denominator of their equation leading
to Equation 4 by s2
likelihood estimate), and sc¼ sdR?
original report (Killeen, 2005).
I did not use s2
parameter djfor the reference population of experiments j, and
further discussion of priors). Then my Equation 7, written
1? D1þ D2, although correct in any particular case,
das a prior and
d. If knowledge of mcis vague or n is large,
d, whereupon their Equation 4 reduces to
Þ, with mc? d0
sd, just as in my
das a prior but as the variance of the hyper-
djrepresents the divergence of different populations of sub-
, is correct. As a realization variance,
Macdonald argues that the distribution of replicate effect sizes
may be derived either from Fisher’s fiducial arguments or
from Bayesian analyses, but that the former are invalid and the
latter incoherent. Viable interpretations of Fisher’s arguments
reduce to a Bayesian model, such Doros and Geier’s, with uni-
form priors on the location parameter dj. Seidenfeld (1979, p.
131) blamed Fisher’s failure on the difficulty in formulating
uninformative priors that were invariant over arbitrary trans-
formations of the variables. But such invariance is a useless
luxury for scientists. Most of the inferential statistics we use
depend on the additivity of random variables, and those remain
additive only under linear transformations. If simple reaction
times are normally distributed on log(t), then log(t), not t, is
the scale on which to express priors. Such measurement
constraints,2long dismissed by statisticians (Hand, 2004), de-
and Bayesian inferences are both valid and coherent.3Statistics
lose their authority to the extent that the variables and their
transformations depart from linear comparability; their justifi-
cation then must be found in their less principled, but often
considerable, pragmatic utility.
Statistics can address three different types of questions (Royall,
? What should I believe?
? What should I do?
? How should I evaluate this evidence?
The first question requires Bayesian updating of priors to in-
corporate new data. If the priors are subjective, Bayesian
system of predictions about the world around him’’ (Savage,
1972, p. 59, who nonetheless took personal probability as ‘‘the
only probability concept essential to science,’’ p. 56). If the
priors are objective, Bayesian updating is the tool of choice for
secondary meta-analysis, and provides the machinery for a cu-
mulative science. Had the astrophysicists cited by Youden
(1972) incorporated priors in their final parameter estimates,
there would have been less humor and more truth in the title of
his article. Scientists wanting to know what to believe about
claims—their own or others’—should respect prior information
(Field, 2003). After Bayesian updating, prepprovides an excel-
Neyman and Pearson avoided the Bayesian implications of
the first question by skipping to the second, asserting that a
counsel to action carries no implications for belief (Neyman,
1960, p. 290). But an answer to the second question requires
both efficient use of the data—not possible in their schema—
and a payoff matrix. By providing the first, prep lays the
groundwork of a decision theory for scientific inference.
The standard answer to the third question is that results
should be evaluated by classifying them as either significant or
nonsignificant. But this approach ‘‘is an impoverished, poten-
tially misleading way to describe evidence’’ (Dixon, 2003, p.
200; J.E. Hunter, 1997). Given the typical case of a composite
alternative hypothesis (e.g., ‘‘not the null’’), preppredicts the
probability that replications will provide evidence supporting
2Seidenfeld’s (1979) ‘‘smoothly invertible canonical pivotal variables’’ con-
The issues are subtle; consult Macdonald’s references in this issue of Psycho-
logical Science and Seidenfeld’s (1979) book. Note, however, that Seidenfeld’s
not survive dimensional analysis;the weight of his ruler is a volumetricmeasure
and should be added, not cubed.
3Transformation techniques permit nonlinear transforms by appropriately
warping one of the scales; for statistical utility, the scale on which the central-
limit theorem holds should be treated as privileged.
Volume 16—Number 12
the original effect. Given well-defined alternative hypotheses,
likelihood analysis (Royall, 1997), corrected for bias (Forster &
Sober, 2004), estimates the strength of evidence favoring the al-
ternatives. If additional statistical evaluation is wanted, random-
ization of the constituent log likelihoods will provide empirical
sampling distributions from which prepmay be inferred. In either
case, priors ‘‘can obfuscate formal tests by including information
not specifically contained within the experiment itself’’ (Maurer,
field necessary for unbiased evaluation. After evidence passes a
filter such as prep, it may be weighted and added to the canon.
Belief is best constructed from independently established facts,
composed with an eye to their cumulating effect.
If we knew that d ¼ 0, as in Macdonald’s example, then the
probability of a positive effect in replication would be .50, no
matter what preppredicts. But Macdonald assumes supernatural
knowledge; prepdoes not. Individual experiments do not estab-
lish parameters; meta-analyses converge on parameters. To
know what to believe, enter all relevant information into that
inferential engine. To know what research to advise students to
undertake, attend to priors. To evaluate experimental results,
however,use prep,unflavored.Itcomes withthe provisoofceteris
original and the replicate.
from p, it inherits the shortcomings of null-hypothesis signifi-
cance testing. Wrong. These statistics, although informationally
is a valid posterior predictive probability, p is not. That is pre-
cisely why Fisher pursued the fiducial argument, which, absent
measurements on interval scales,is unattainable. With linearity,
‘‘selection of an ‘ignorance’ prior can be made without fear of
violating the probability calculus’’ (Seidenfeld, 1979, p. 133).
THE REFERENCE SET FOR prep
Much of my discussion thus far is, in the end, irrelevant to most
readers of this article. Virtually all psychological data are ob-
servational or are drawn from convenience samples, subsets of
which are randomly assigned to control or experimental condi-
tions. These standard empirical procedures are incompatible
with the normal statistical models, which assume random sam-
pling from a reference set or population (Lunneborg, 2000).
Randomization tests emulate our experimental operations
(Byrne, 1993), do not depend on priors, do not depend on the
form of the populations sampled, and permit fiducial inference
(Pitman, 1937). Their logic is straightforward; M.A. Hunter and
May (2003) have provided a clear overview and useful refer-
ences. The p value from such a test gives the proportion of oc-
casions on which the data would have segregated into such
chance.5The corresponding prepestimates the probability of
replication in samples from the same data set (cf. Pitman’s w
statistic). It also predicts replicability in general, with its ac-
curacy depending on the similarity of the subjects and proce-
dures in the original and replicate. Permutation tests and prep
respect what we do and tell us what we need to know. They are
the right analytic tools for most of our primary research ques-
Acknowledgments—National Science Foundation Grant IBN
0236821 and National Institute of Mental Health Grant
1R01MH066860 supported this work.
Bernardo, J.M. (in press). Reference analysis. In D. Dey & C.R. Rao
(Eds.), Handbook of statistics (Vol 25). Amsterdam: Elsevier.
Byrne, M.D. (1993). A better tool for the Cognitive Scientist’s toolbox:
Randomization statistics. In W. Kintsch (Ed.), Proceedings of the
Fifteenth Annual Conference of the Cognitive Science Society (pp.
289–293). Mawah, NJ: Erlbaum (Available from http://chil.
Cumming, G. (2005). Understanding the average probability of rep-
lication: Comment on Killeen (2005). Psychological Science, 16,
Cumming, G., Williams, J., & Fidler, F. (2004). Replication and re-
searchers’ understanding of confidence intervals and standard
error bars. Understanding Statistics, 3, 299–311.
Dixon, P. (2003). The p-value fallacy and how to avoid it. Canadian
Journal of Experimental Psychology, 57, 189–202.
Doros, G., & Geier, A.B. (2005). Probability of replication revisited:
Comment on ‘‘An alternative to null-hypothesis significance
tests.’’ Psychological Science, 16, 1005–1006.
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap.
London: Chapman & Hall.
Fidler, F., Thomason, N.,Cumming, G., Finch, S., & Leeman, J.(2005).
Still much to learn about confidence intervals. Psychological
Science, 16, 494–495.
Field, A.P. (2003). The problems in using fixed-effects models of meta-
analysis onreal-world data.UnderstandingStatistics,2,105–124.
Forster, M., & Sober, E. (2004). Why likelihood? In M.L. Taper & S.R.
Lele (Eds.), The nature of scientific evidence: Statistical, philo-
sophical, and empirical considerations (pp. 153–190). Chicago:
University of Chicago Press.
Hand, D.J. (2004). Measurement theory and practice. New York: Oxford
Hunter, J.E. (1997). Needed: A ban on the significance test. Psycho-
logical Science, 8, 3–7.
Hunter, M.A., & May, R.B. (2003). Statistical testing and null distri-
butions: What to do when samples are not random. Canadian
Journal of Experimental Psychology, 57, 176–188.
Killeen, P.R. (2005). An alternative to null-hypothesis significance
tests. Psychological Science, 16, 345–353.
4The calculation is as follows: prep5 NORMSDIST((NORMSINV(1 ? p))/
5Permutation tests evaluate any difference in samples. They may be modified
to test differences of means (Efron & Tibshirani, 1993).
Volume 16—Number 12
Peter R. Killeen
Lee,M.D.,&Wagenmakers,E.-J.(2005).Bayesianstatisticalinference Download full-text
in psychology: Comment on Trafimow (2003). Psychological Re-
view, 112, 662–668.
Lunneborg, C.E. (2000). Data analysis by resampling: Concepts and
applications. Pacific Grove, CA: Brooks/Cole/Duxbury.
probability distributions: A rejoinder to Killeen (2005). Psycho-
logical Science, 16, 1007–1008.
Maurer, B.A. (2004). Models of scientific inquiry and statistical prac-
Taper & S.R. Lele (Eds.), The nature of scientific evidence: Sta-
tistical, philosophical, and empirical considerations (pp. 17–50).
Chicago: University of Chicago Press.
May, K. (2003). A note on the use of confidence intervals. Under-
standing Statistics, 2, 133–135.
Neyman, J. (1960). First course in probability and statistics. New York:
Holt, Rinehart and Winston.
O’Hagan, A., & Forster, J. (2004). Kendall’s advanced theory of statis-
tics: Vol. 2B: Bayesian inference (2nd ed.). New York: Oxford
Pitman, E.J.G. (1937). Significance tests which may be applied to
samples from any populations. Supplement to the Journal of the
Royal Statistical Society, 4, 119–130.
Rinaman, W.C., Heil, C., Strauss, M.T., Mascagni, M., & Sousa, M.
(1996). Probability and statistics. In D. Zwillinger (Ed.), CRC
standard mathematical tables and formulae (30th ed., pp. 569–
668). Boca Raton, FL: CRC Press.
Royall, R. (1997). Statistical evidence: A likelihood paradigm. London:
Chapman & Hall.
Seidenfeld, T. (1979). Philosophical problems of statistical inference:
Learning from R. A. Fisher. London: D. Reidel.
Stigler,S.M.(1996).Statistics andthe questionofstandards. Journal of
Research of the National Institute of Standards and Technology,
Youden, W.J. (1972). Enduring values. Technometrics, 14, 1–11.
(RECEIVED 7/6/05; ACCEPTED 8/21/05;
FINAL MATERIALS RECEIVED 8/22/05)
S. Sirois (personal communication, May 10, 2005) noticed that
the standard error of replication on p. 347 in my original article
should have been sdR¼
radical in Equation 7 should have been s2
d, as used by Cumming, simplifies notation.
sd. The second variance under the
dj. An unembellished
Bayesians recommend either Jeffrey’s priors (Lee & Wagen-
makers, 2005, have provided an excellent Bayesian tutorial) or
reference priors (Bernardo, in press). The Jeffrey’s prior for the
mean of normally distributed data is uniform. Alas, over an in-
finite range, that leaves any particular prior equaling an un-
productive zero. But this is not a problem if the range is merely
huge (e.g., spread with s2? 1010), as the prior’s influence will
then fall below the measurement error of rational data. ‘‘If prior
information is genuinely weak relative to the data, the posterior
distribution should be robust to any reasonable choice of prior
distribution [including improper priors]’’ (O’Hagan & Forster,
2004, p. 107).
Priors that are flat for d cannot also be flat for r2(Macdonald,
this issue). Ignorance has structure. Reference priors cash out
that structure against the models tested. The reference prior
deemed necessary. For the range of effect sizes that concern
psychologists, whether they use Jeffrey’s priors or reference
priors, d or r2, it is all pretty much Kansas.
sizes and variances involved whenever statistical analysis is
Þ ¼ ðs
1 þ d2=2
Þ?1is relatively flat for the effect
Volume 16—Number 12