Null hypothesis signiﬁcance testing interpreted and
calibrated by estimating probabilities of sign errors: A
December 10, 2019
David R. Bickel
Ottawa Institute of Systems Biology
Department of Biochemistry, Microbiology and Immunology
Department of Mathematics and Statistics
University of Ottawa
451 Smyth Road
Ottawa, Ontario, K1H 8M5
+01 (613) 562-5800, ext. 8670
Concepts from multiple testing can improve tests of single hypotheses. The proposed deﬁni-
tion of the calibrated p value is an estimate of the local false sign rate, the posterior probability
that the direction of the estimated eﬀect is incorrect. Interpreting one-sided p values as esti-
mates of conditional posterior probabilities, that calibrated p value is (1 - LFDR) p/2 + LFDR,
where p is a two-sided p value and LFDR is an estimate of the local false discovery rate, the
posterior probability that a point null hypothesis is true given p. A simple option for LFDR is
the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower
The calibration provides a continuum between signiﬁcance testing and traditional Bayesian
testing. The former eﬀectively assumes the prior probability of the null hypothesis is 0, as some
statisticians argue is the case. Then the calibrated p value is equal to p/2, a one-sided p value,
since LFDR = 0. In traditional Bayesian testing, the prior probability of the null hypothesis
is at least 50%, which usually results in LFDR >> p. At that end of the continuum, the
calibrated p value is close to LFDR.
Keywords: calibrated eﬀect size estimation; pvalue; directional error; dividing null hypothesis;
replication crisis; reproducibility crisis; sign error; Type III error
Meta-analyses of large numbers of previous studies from biomedicine and neuroscience have raised
concerns that many published results cannot be replicated (Ioannidis, 2005; Nieuwenhuis et al.,
2011; Button et al., 2013), contributing to the perceived replication crisis in many scientiﬁc ﬁelds
(Begley and Ioannidis, 2015), especially psychology (Open Science Collaboration, 2015; Hughes,
2018). The statistics community has responded with guidelines on hypothesis testing and rec-
ommendations to emphasize eﬀect sizes (e.g., Wasserstein and Lazar, 2016). However, conﬂicting
proposals among statisticians on how to improve statistical data analysis (e.g., Wasserstein et al.,
2019, and references) cause confusion among non-statisticians (Schachtman, 2019; Mayo, 2019),
leaving statistical consultants with the responsibility of sifting through the arguments to provide
their collaborators practical solutions.
For example, many Bayesians propose to address criticisms of null hypothesis signiﬁcance testing
by transforming the pvalue to a lower bound on the posterior probability that the null hypothesis
is true: see Held and Ott (2018) and its references.
Example 1. Assuming the two-sided pvalue is not large (p≤1/e)when testing the null hypothesis
H0:θ=θH0, Sellke et al. (2001) and Benjamin and Berger (2019) recommend
as a lower bound on the Bayes factor B= Pr (P=p|θ=θH0)/Pr (P=p|θ6=θH0), where θis the
unknown value of the parameter of interest, θH0is the ﬁxed parameter value of the null hypothesis,
Pis the random variable representing the pvalue before it is observed to be equal to the number
p. Since the posterior probability is
Pr (θ=θH0|P=p) = Pr (θ=θH0) P r (P=p|θ=θH0)
Pr (P=p)= 1 + Pr (θ=θH0)
according to Bayes’s theorem, it has a lower bound of
v= 1 + Pr (θ=θH0)
called the v value because a quantity approximated by Bappears in Vovk (1993, §9). N
Since Pr (θ=θH0|P=p)is typically much larger than pwhen Pr (θ=θH0)≥1/2, it is often
claimed that p“overstates” the strength of the evidence against the null hypothesis (e.g., Goodman,
1999). That conclusion is disputed by Hurlbert and Lombardi (2009), who argue that since prudent
scientists tend to believe the null hypotheses they test are false, Pr (θ=θH0)should be much smaller
than 1/2, perhaps 1/10 or 1/100.
In fact, Bernardo (2011), McShane et al. (2019), and others argue that since systematic errors
prevent θ=θH0from ever being exactly true, it follows that 0 is the only reasonable value for
Pr (θ=θH0); cf. van den Bergh et al. (2019). In that case, Pr (θ=θH0|P=p) = 0, which would
make traditional Bayesian hypothesis testing useless. Frequentist hypothesis testing, on the other
hand, could still serve to determine whether the sample is large enough to warrant concluding that
θ > θH0or that θ < θH0. In that context, θ=θH0is called a dividing null hypothesis (Cox, 1977;
Bickel, 2011). The idea is that if the pvalue is low enough, then bs= sign b
θ−θH0is a reasonable
estimate of s= sign (θ−θH0), where b
θis an observed point estimate of θand the function sign (•)
has a value of 1 if its argument is positive, −1if its argument is negative, and 0 otherwise. In that
way, testing the null hypothesis that θ=θH0is used as an indirect method of deciding whether to
claim that s=bs.
A more direct way to make that decision would be to claim that s=bsonly if it is suﬃciently
probable or, equivalently, if the sign error s 6=bsis suﬃciently improbable. The sign error (Stephens,
2016) is also called a “Type III error” (Butler and Jones, 2018) and a “directional error” (Grandhi
et al., 2019). The posterior probability of making a sign error given a two-sided pis
Pr (s6=bs|P=p) =
Pr (θ > θH0|P=p) + Pr (θ=θH0|P=p)if b
θ < θH0
Pr (θ < θH0|P=p) + Pr (θ=θH0|P=p)if b
θ > θH0
Under broadly applicable conditions, that is reasonably estimated by
Pr (s6=bs|P=p) = (1 −v)p
whenever v, the vvalue of equation (3), is a reasonable estimate of the Pr (θ=θH0|P=p)in
equation (2). The result is proved for all reasonable estimates of Pr (θ=θH0|P=p)in Section 2.
The form of equation (5) represents a continuum between null hypothesis signiﬁcance testing
and conventional Bayesian testing. The frequentist practice of considering θ=θH0to be a dividing
null hypothesis (Cox, 1977; Bickel, 2011) is recovered by setting Pr (θ=θH0) = 0, for in that case
v= 0 and c
Pr (s6=bs|P=p) = p/2, which is a one-sided pvalue. At the opposite extreme, the
traditional Bayesian practice of setting Pr (θ=θH0)≥1/2often results in a vvalue that is much
greater than the pvalue, in which case c
Pr (s6=bs|P=p)≈v. Choices of Pr (θ=θH0)between those
frequentist and Bayesian extremes place c
Pr (s6=bs|P=p)within a continuum of values between
p/2and 1. For that reason, the easily interpreted estimate c
Pr (s6=bs|P=p)is a natural choice of
a calibrated pvalue, as illustrated by example in Section 3. There, Figure 1 vividly portrays the
The American Statistical Association’s call to emphasize eﬀect size estimation (Wasserstein
and Lazar, 2016) does not necessarily warrant reporting conventional eﬀect size estimates without
modiﬁcation (van den Bergh et al., 2019). In particular, a large eﬀect size estimate can be misleading
when a direction of the eﬀect is too uncertain. To address that problem, Section 4 derives a simple
calibration of the eﬀect size estimate. The calibrated pvalue c
Pr (s6=bs|P=p)emerges as the
degree of shrinkage.
Finally, implications for the debate and practice of testing null hypotheses are discussed in
2 Estimating the local false sign rate of a single null hypoth-
For making connections to the literature and for succinctly deriving equation (5) regarding a test
of the null hypothesis θ=θH0, some terminology originally developed for testing multiple null
hypotheses will prove useful. Since Efron et al. (2001) calls the Pr (θ=θH0|P=p)of equation (2)
the local false discovery rate, let LFDR = Pr (θ=θH0|P=p); see Efron (2010) and Bickel (2019a)
for expositions. Similarly, since Stephens (2016) calls the Pr (s6=bs|P=p)of equation (4) the local
false sign rate, let LFSR = Pr (s6=bs|P=p).
As equation (4) suggests, to estimate LFSR of a single null hypothesis, we need not only
an estimate of LFDR, but also estimates of Pr (θ≷θH0|P=p). Seeing that
Pr (θ≷θH0|P=p) = Pr (θ≷θH0, θ 6=θH0|P=p)
= Pr (θ6=θH0|P=p) Pr (θ≷θH0|P=p, θ 6=θH0)
= (1 −LFDR) Pr (θ≷θH0|P=p, θ 6=θH0),
Pr (θ≷θH0|P=p) = 1−
LFDRp≶, where p≶is the estimate of Pr (θ≷θH0|P=p, θ 6=θH0)
that is deﬁned as a one-sided pvalue testing the null hypothesis that θ=θH0with θ≶θH0as the
alternative hypothesis. From here on, the two-sided pvalue is p= 2 min (p<,p>).
Estimating Pr (θ≷θH0|P=p, θ 6=θH0)by p≶has both a Bayesian justiﬁcation and a Fish-
erian justiﬁcation. The Bayesian justiﬁcation is that p≶is in many cases an approximation of a
Pr (θ≷θH0|P=p, θ 6=θH0)based on any member of a wide class of prior distributions that do not
concentrate prior probability at θH0or at any other point (Pratt, 1965; Casella and Berger, 1987).
Setting Pr (θ=θH0)>0need not conﬂict with those priors since Pr (θ≷θH0|P=p, θ 6=θH0),
unlike Pr (θ≷θH0|P=p), is conditional on θ6=θH0(cf. Bickel, 2012b, 2018).
The Fisherian justiﬁcation is that p≶, as a ﬁducial probability or observed conﬁdence level
(Polansky, 2007) that θ≷θH0(Bickel, 2011), can serve as an estimate of a posterior probability
that θ≷θH0even though, as many have noted (e.g., Grundy, 1956; Lindley, 1958; Evans, 2015,
§3.6), it does not necessarily satisfy the properties of a Bayesian posterior probability. In the same
way, many optimal point estimates can have values that are not possible for the parameters they
estimate (Bickel, 2019b). That is why Wilkinson (1977, §6.2) considered ﬁducial probability as an
estimate of a level of belief rather than as a level of belief. Similarly, conﬁdence distributions, a
modern development of ﬁducial distributions (Nadara jah et al., 2015), have been interpreted in
terms of estimating θ(Singh et al., 2007; Xie and Singh, 2013) or an indicator of hypothesis truth
Plugging the above estimates into equation (4) yields
LFDR if b
θ < θH0
LFDR if b
θ > θH0
Theorem 1. If sign b
θ−θH0= sign (p<−p>), then
LFSR = 1−
Proof. By equation (6), it is suﬃcient to prove that
θ < θH0
θ > θH0
Since the sign b
θ−θH0= sign (p<−p>)condition implies that b
θ < θH0⇐⇒ p<<p>and
θ > θH0⇐⇒ p><p<, it is enough to prove that
which follows immediately from p= 2 min (p<,p>).
The sign b
θ−θH0= sign (p<−p>)condition for the theorem says the sign estimated by the
parameter estimate agrees with the sign indicated by the one-sided pvalues. It holds in nearly all
3 Estimates of local false sign rates as calibrated pvalues
The estimate of the local false sign rate approaches a local false discovery rate or a one-sided p
value, depending on the limiting conditions.
Corollary 1. If sign b
θ−θH0= sign (p<−p>), then limp→0
LFDR and limc
p/2, where c
Pr (θ=θH0)is the prior probability that yields
LFDR as the posterior probability.
Proof. By Bayes’s theorem,
LFDR →0as c
Pr (θ=θH0)→0. Both claims then follow from Theorem
Since p/2 = min (p<,p>), that result justiﬁes calling
LFDR-calibrated p value and
accordingly denoting it by p\
LFDRto stress its dependence on the choice of an estimate of LFDR.
Example 2. A simple option for
LFDR is v, the lower bound given in equation (3), with c
in place of Pr (θ=θH0). Then we write the v-calibrated pvalue as p(v).
The resulting Bayes-frequentist continuum is displayed as Figure 1, with traditional frequentism
at the left end of each plot and traditional Bayesianism at the right. Figure 2 zooms in on three
points in the continuum. N
Many other lower bounds on LFDR are available (e.g., Held and Ott, 2018, and references).
But why estimate the LFDR with an estimate of a lower bound such as the vvalue (Example 2)?
There are multiple reasons to accept the vvalue as an adequate estimate of the LFDR. First, as
the Bayes factor can be lower than B(Held and Ott, 2018), which is the Bayes factor bound behind
the vvalue, the vvalue is not necessarily a lower bound on LFDR. Second, Bis close to estimated
Bayes factors for many studies in epidemiology, genetics, and ecology (Bayarri et al., 2016, Fig. 3),
Figure 1: The three curves are p(v),v, and p/2as functions of Pr (θ=θH0). For both p= 0.05
and p= 0.005, the v-calibrated pvalue p(v)approaches the one-sided pvalue p/2as Pr (θ=θH0)
decreases and approaches the estimated posterior probability vas Pr (θ=θH0)increases.
Figure 2: The three curves are p(v),v, and p/2as functions of p, the two-sided pvalue, for
each of three prior probabilities: Pr (θ=θH0) = 0.01,0.1,0.5. In the plot corresponding most to
traditional frequentism (Pr (θ=θH0) = 0.01), the v-calibrated pvalue p(v)is close to p/2, a one-
sided pvalue. In the plot corresponding most to traditional Bayesianism (Pr (θ=θH0) = 0.5), the
v-calibrated pvalue p(v)is close to v, the estimated posterior probability. The remaining plot
(Pr (θ=θH0) = 0.1) shows a more interesting relationship between the v-calibrated pvalue, the
estimated posterior probability, and the one-sided pvalue.
and the vvalue would be close in those cases to LFDR. Third, the vvalue is quantitatively similar
to the following estimate of LFDR.
Example 3. Let zdenote the probit transform of p/2; the probit function is implemented in R as
rnorm and in Microsoft Excel as norm.s.inv. For |z| ≥ 1, the L value is
1 + 1/b
B= 1.86 |z|e−z2
2is the median-unbiased estimate of the Bayes factor assuming the probit
transform of a one-sided pvalue is normal with mean 0 under θ6= 0 (Bickel, 2019a,d). (See Held
and Ott (2016) for the maximum likelihood estimate under the same model and Pace and Salvan
(1997) on the 0% conﬁdence interval as a median-unbiased estimate.) Then p(L)is the L-calibrated
pvalue. It could be approximated by p(v)since p(L)≈p(v), and the simplicity of p(v)may
make it more practical for general use (cf. Benjamin and Berger, 2019) than p(L), which requires
the probit transform. N
While the local false sign rate and local false discovery rate are posterior probabilities conditional
on P=p, other posterior probabilities might serve as approximations.
Example 4. The positive predictive value Pr (θ6=θH0|P≤α)plays a key role in multiple papers
related to the reproducibility crisis (e.g., Ioannidis, 2005; Button et al., 2013; Dreber et al., 2015;
Wilson and Wixted, 2018). It is isomorphic to
Pr (θ=θH0|P≤α) = 1 −Pr (θ6=θH0|P≤α),
which is known as the false positive report probability (Wacholder et al., 2004) and, in the multiple
testing literature, as the Bayesian false discovery rate (Efron and Tibshirani, 2002) and the nonlocal
false discovery rate (Bickel, 2013). An estimate of Pr (θ=θH0|P≤α), such as the upper bound
proposed by Bickel (2019c), is denoted by wand called a w value after Wacholder et al. (2004).
Using it as an estimate of LFDR results in p(w), the w-calibrated pvalue. However, wis highly
biased as an estimate of LFDR when α=p(Colquhoun, 2017, 2019; Bickel and Rahal, 2019). N
4 Eﬀect size estimation informed by local false sign rate esti-
If all relevant prior distributions were known, the Bayes-optimal estimate of the eﬀect size θunder
squared error loss would be its posterior mean,
E (θ|P=p) = Pr (s=bs|P=p) E (θ|P=p,s=bs)
+ Pr (s6=bs, θ =θH0|P=p) E (θ|P=p,s6=bs, θ =θH0)
+ Pr (s6=bs, θ 6=θH0|P=p) E (θ|P=p,s6=bs, θ 6=θH0)
= (1 −LFSR) E (θ|P=p,s=bs) + (LFDR) θH0
+ (LFSR −LFDR) E (θ|P=p,s6=bs, θ 6=θH0).
Without that knowledge, θmay instead be estimated by estimating E (θ|P=p).
In agreement with the
LFSR = p\
LFDRframework of Sections 2-3, E (θ|P=p)is estimated
LFDR-calibrated eﬀect size estimate,
which uses b
θto estimate E (θ|P=p,s=bs)and θH0to estimate E (θ|P=p,s6=bs, θ 6=θH0). The
latter estimate works best when θwould probably be close to θH0conditional on a sign error. The
calibrated eﬀect size estimate simpliﬁes to
which reveals p\
LFDRas the degree to which b
θis shrunk toward θH0. The next result follows
immediately from that and Corollary 1.
Corollary 2. If sign b
θ−θH0= sign (p<−p>), then
Figure 3: b
θas a function of c
Pr (θ= 0) for θH0= 0 and p= 0.05,0.15,0.25,0.35. The
v-calibrated eﬀect size estimate b
θ(v)is seen to shrink b
θtoward 0 as por c
Pr (θ= 0) increases.
The right-hand side of equation (8) has been used in multiple testing situations (e.g., Montazeri
et al., 2010; Yanofsky and Bickel, 2010). Equation (9) records the eﬀect of considering the local
false sign rate even at the frequentist end of the Bayes-frequentist continuum.
An advantage of b
LFDRis that it shrinks b
θtoward θH0more for higher pvalues without
ever shrinking it all the way to θH0, as seen in Figure 3. As a result, reporting calibrated eﬀect
size estimates could help prevent researchers from concluding that θ=θH0on the basic of a high
Imagine a world in which abstracts have v-calibrated eﬀect size estimates and “p(v)=0.04,” “p(v)=0.01,”
etc. in place of our world’s uncalibrated estimates and “p<0.05.” Adopting the local false sign rate
estimate as a calibrated pvalue may focus current discussions about estimation and testing. The
traditional Bayesian and frequentist positions would no longer be incommensurate paradigms or
matters of upbringing and taste but rather opposite directions on the continuum determined by the
prior probability of the null hypothesis (Figures 1-2). Going forward, debates would then concen-
trate on ways to estimate the prior probability for each ﬁeld, data type, or other reference class (cf.
Lakens et al., 2018; de Ruiter, 2019). Progress is already being made in measuring how the prior is
inﬂuenced by a ﬁeld’s risk tolerance (Wilson and Wixted, 2018), echoing the report that a demand
for novelty leads to less reproducible results (Open Science Collaboration, 2015).
Even before a consensus is reached, statisticians can inform their collaborators of the impact
of the prior probability on the local false sign rate estimate and help them determine adequate
estimates of the prior for the data at hand. Estimates may be available in some cases from meta-
analyses. For example, Benjamin et al. (2017) derived their infamous 0.005 signiﬁcance threshold
in part from meta-analyses suggesting c
Pr (θ=θH0) = 10/11 in psychology (Dreber et al., 2015;
Johnson et al., 2017). The high value of that estimate reﬂects modeling assumptions that would in
eﬀect include values of θthat are close to θH0with the null hypothesis rather than the alternative
hypothesis. How close is close enough for inferential purposes may be a fruitful sub ject of future
study and argument since it determines the calibrated pvalue through c
The diﬃculties involved in estimating prior probabilities may at times force us to retreat back
to null hypothesis signiﬁcance testing without any prior or to traditional Bayesian testing with
the default 50% prior probability. The calibrated pvalue would then tell us what the estimated
probability of making a sign error would be if the prior probability of the null hypothesis were
actually 0% or 50%, respectively.
This research was partially supported by the Natural Sciences and Engineering Research Council
of Canada (RGPIN/356018-2009).
Bayarri, M., Benjamin, D. J., Berger, J. O., Sellke, T. M., 2016. Rejection odds and rejection ratios:
A proposal for statistical practice in testing hypotheses. Journal of Mathematical Psychology 72,
Begley, C. G., Ioannidis, J. P., 2015. Reproducibility in science. Circulation Research 116 (1),
Benjamin, D. J., Berger, J. O., 2019. Three recommendations for improving the use of p-values.
The American Statistician 73 (sup1), 186–191.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R.,
Bollen, K. A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C. D., Clyde, M.,
Cook, T. D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Eﬀerson, C., Fehr, E., Fidler,
F., Field, A. P., Forster, M., George, E. I., Gonzalez, R., Goodman, S., Green, E., Green, D. P.,
Greenwald, A. G., Hadﬁeld, J. D., Hedges, L. V., Held, L., Hua Ho, T., Hoijtink, H., Hruschka,
D. J., Imai, K., Imbens, G., Ioannidis, J. P. A., Jeon, M., Jones, J. H., Kirchler, M., Laibson, D.,
List, J., Little, R., Lupia, A., Machery, E., Maxwell, S. E., McCarthy, M., Moore, D. A., Morgan,
S. L., Munafó, M., Nakagawa, S., Nyhan, B., Parker, T. H., Pericchi, L., Perugini, M., Rouder, J.,
Rousseau, J., Savalei, V., Schönbrodt, F. D., Sellke, T., Sinclair, B., Tingley, D., Van Zandt, T.,
Vazire, S., Watts, D. J., Winship, C., Wolpert, R. L., Xie, Y., Young, C., Zinman, J., Johnson,
V. E., 9 2017. Redeﬁne statistical signiﬁcance. Nature Human Behaviour, 1.
Bernardo, J. M., 2011. Integrated objective Bayesian estimation and hypothesis testing. Bayesian
statistics 9, 1–68.
Bickel, D. R., 2011. Estimating the null distribution to adjust observed conﬁdence levels for genome-
scale screening. Biometrics 67, 363–370.
Bickel, D. R., 2012a. Coherent frequentism: A decision theory based on conﬁdence sets. Communi-
cations in Statistics - Theory and Methods 41, 1478–1496.
Bickel, D. R., 2012b. Empirical Bayes interval estimates that are conditionally equal to unadjusted
conﬁdence intervals or to default prior credibility intervals. Statistical Applications in Genetics
and Molecular Biology 11 (3), art. 7.
Bickel, D. R., 2013. Simple estimators of false discovery rates given as few as one or two p-values
without strong parametric assumptions. Statistical Applications in Genetics and Molecular Biol-
ogy 12, 529–543.
Bickel, D. R., 2018. Conﬁdence distributions and empirical Bayes posterior distributions uniﬁed as
distributions of evidential support, working paper, DOI: 10.5281/zenodo.2529438.
Bickel, D. R., 2019a. Genomics Data Analysis: False Discovery Rates and Empirical Bayes Methods.
Chapman and Hall/CRC, New York.
Bickel, D. R., 2019b. Maximum entropy derived and generalized under idempotent probability
to address Bayes-frequentist uncertainty and model revision uncertainty, working paper, DOI:
Bickel, D. R., 2019c. Null hypothesis signiﬁcance testing defended and calibrated by Bayesian model
checking. The American Statistician, DOI: 10.1080/00031305.2019.1699443.
Bickel, D. R., 2019d. Sharpen statistical signiﬁcance: Evidence thresholds and Bayes factors sharp-
ened into Occam’s razor. Stat 8 (1), e215.
Bickel, D. R., Rahal, A., 2019. Correcting false discovery rates for their bias to-
ward false positives. Communications in Statistics - Simulation and Computation, DOI:
Butler, J. S., Jones, P., Apr 2018. Theoretical and empirical distributions of the p value. METRON
76 (1), 1–30.
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., Munafò,
M. R., 2013. Power failure: why small sample size undermines the reliability of neuroscience.
Nature Reviews Neuroscience 14 (5), 365.
Casella, G., Berger, R. L., 1987. Reconciling Bayesian and frequentist evidence in the one-sided
testing problem. Journal of the American Statistical Association 82, 106–111.
Colquhoun, D., 2017. The reproducibility of research and the misinterpretation of p-values. Royal
Society Open Science 4 (12), 171085.
Colquhoun, D., 2019. The false positive risk: A proposal concerning what to do about p-values.
The American Statistician 73 (sup1), 192–201.
Cox, D. R., 1977. The role of signiﬁcance tests. Scandinavian Journal of Statistics 4, 49–70.
de Ruiter, J., Apr 2019. Redeﬁne or justify? comments on the alpha debate. Psychonomic Bulletin
& Review 26 (2), 430–433.
Dreber, A., Pfeiﬀer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., Johan-
nesson, M., 2015. Using prediction markets to estimate the reproducibility of scientiﬁc research.
Proceedings of the National Academy of Sciences 112 (50), 15343–15347.
Efron, B., 2010. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and
Prediction. Cambridge University Press, Cambridge.
Efron, B., Tibshirani, R., 2002. Empirical Bayes methods and false discovery rates for microarrays.
Genetic Epidemiology 23, 70–86.
Efron, B., Tibshirani, R., Storey, J. D., Tusher, V., 2001. Empirical Bayes analysis of a microarray
experiment. Journal of the American Statistical Association 96, 1151–1160.
Evans, M., 2015. Measuring Statistical Evidence Using Relative Belief. Chapman & Hall/CRC
Monographs on Statistics & Applied Probability. CRC Press, New York.
Goodman, S. N., 06 1999. Toward Evidence-Based Medical Statistics. 2: The Bayes Factor. Annals
of Internal Medicine 130 (12), 1005–1013.
Grandhi, A., Guo, W., Romano, J., 2019. Control of directional errors in ﬁxed sequence multiple
testing. Statistica Sinica 29 (2), 1047–1064.
Grundy, P. M., 1956. Fiducial distributions and prior distributions: An example in which the
former cannot be associated with the latter. Journal of the Royal Statistical Society, Series B 18,
Held, L., Ott, M., 2016. How the maximal evidence of p-values against point null hypotheses
depends on sample size. American Statistician 70 (4), 335–341.
Held, L., Ott, M., 2018. On p-values and Bayes factors. Annual Review of Statistics and Its Appli-
cation 5, 393–419.
Hughes, B., 2018. Psychology in Crisis. Palgrave, London.
Hurlbert, S., Lombardi, C., 2009. Final collapse of the Neyman-Pearson decision theoretic frame-
work and rise of the neoFisherian. Annales Zoologici Fennici 46, 311–349.
Ioannidis, J. P., 2005. Why most published research ﬁndings are false. PLoS Medicine 2 (8), e124.
Johnson, V., Payne, R., Wang, T., Asher, A., Mandal, S., 2017. On the reproducibility of psycho-
logical science. Journal of the American Statistical Association 112 (517), 1–10.
Lakens, D., Adolﬁ, F. G., Albers, C. J., Anvari, F., Apps, M. A., Argamon, S. E., Baguley, T.,
Becker, R. B., Benning, S. D., Bradford, D. E., et al., 2018. Justify your alpha. Nature Human
Behaviour 2 (3), 168.
Lindley, D. V., 1958. Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical
Society B 20, 102–107.
Mayo, D. G., 2019. The ASA’s p-value project: Why it’s doing more harm than good (cont from
11/4/19). Web page, accessed 3 December 2019.
McShane, B. B., Gal, D., Gelman, A., Robert, C., Tackett, J. L., 2019. Abandon statistical signiﬁ-
cance. The American Statistician 73 (sup1), 235–245.
Montazeri, Z., Yanofsky, C. M., Bickel, D. R., 2010. Shrinkage estimation of eﬀect sizes as an
alternative to hypothesis testing followed by estimation in high-dimensional biology: Applications
to diﬀerential gene expression. Statistical Applications in Genetics and Molecular Biology 9, 23.
Nadarajah, S., Bityukov, S., Krasnikov, N., 2015. Conﬁdence distributions: A review. Statistical
Methodology 22, 23–46.
Nieuwenhuis, S., Forstmann, B. U., Wagenmakers, E.-J., 2011. Erroneous analyses of interactions
in neuroscience: a problem of signiﬁcance. Nature neuroscience 14 (9), 1105.
Open Science Collaboration, 2015. Estimating the reproducibility of psychological science. Science
Pace, L., Salvan, A., 1997. Principles of Statistical Inference: From a Neo-Fisherian Perspective.
Advanced Series on Statistical Science & Applied Probability. World Scientiﬁc, Singapore.
Polansky, A. M., 2007. Observed Conﬁdence Levels: Theory and Application. Chapman and Hall,
Pratt, J. W., 1965. Bayesian interpretation of standard inference statements. Journal of the Royal
Statistical Society B 27, 169–203.
Schachtman, N. A., 2019. Palavering about p-values. Web page, accessed 3 December 2019.
URL http://schachtmanlaw.com/palavering- about-p-values/
Sellke, T., Bayarri, M. J., Berger, J. O., 2001. Calibration of p values for testing precise null
hypotheses. American Statistician 55, 62–71.
Singh, K., Xie, M., Strawderman, W. E., 2007. Conﬁdence distribution (CD) – distribution estima-
tor of a parameter. IMS Lecture Notes Monograph Series 2007 54, 132–150.
Stephens, M., 10 2016. False discovery rates: a new deal. Biostatistics 18 (2), 275–294.
van den Bergh, D., Haaf, J. M., Ly, A., Rouder, J. N., Wagenmakers, E.-J., Nov 2019. A cautionary
note on estimating eﬀect size. PsyArXiv, DOI: 10.31234/osf.io/h6pr8.
Vovk, V. G., 1993. A logic of probability, with application to the foundations of statistics. Journal
of the Royal Statistical Society: Series B (Methodological) 55 (2), 317–341.
Wacholder, S., Chanock, S., Garcia-Closas, M., Ghormli, L. E., Rothman, N., 2004. Assessing
the probability that a positive report is false: An approach for molecular epidemiology studies.
Journal of the National Cancer Institute 96, 434–442.
Wasserstein, R. L., Lazar, N. A., 2016. The ASA’s statement on p-values: Context, process, and
purpose. The American Statistician 70 (2), 129–133.
Wasserstein, R. L., Schirm, A. L., Lazar, N. A., 2019. Moving to a world beyond "p < 0.05". The
American Statistician 73 (sup1), 1–19.
Wilkinson, G. N., 1977. On resolving the controversy in statistical inference (with discussion).
Journal of the Royal Statistical Society B 39, 119–171.
Wilson, B. M., Wixted, J. T., 2018. The prior odds of testing a true eﬀect in cognitive and social
psychology. Advances in Methods and Practices in Psychological Science 1 (2), 186–197.
Xie, M.-G., Singh, K., 2013. Conﬁdence distribution, the frequentist distribution estimator of a
parameter: A review. International Statistical Review 81 (1), 3–39.
Yanofsky, C. M., Bickel, D. R., 2010. Validation of diﬀerential gene expression algorithms: Ap-
plication comparing fold-change estimation to hypothesis testing. BMC Bioinformatics 11, art.