PreprintPDF Available

Null hypothesis significance testing interpreted and calibrated by estimating probabilities of sign errors: A Bayes-frequentist continuum

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Concepts from multiple testing can improve tests of single hypotheses. The proposed definition of the calibrated p value is an estimate of the local false sign rate, the posterior probability that the direction of the estimated effect is incorrect. Interpreting one-sided p values as estimates of conditional posterior probabilities, that calibrated p value is (1 - LFDR) p/2 + LFDR, where p is a two-sided p value and LFDR is an estimate of the local false discovery rate, the posterior probability that a point null hypothesis is true given p. A simple option for LFDR is the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower bound. The calibration provides a continuum between significance testing and traditional Bayesian testing. The former effectively assumes the prior probability of the null hypothesis is 0, as some statisticians argue is the case. Then the calibrated p value is equal to p/2, a one-sided p value, since LFDR = 0. In traditional Bayesian testing, the prior probability of the null hypothesis is at least 50%, which usually results in LFDR >> p. At that end of the continuum, the calibrated p value is close to LFDR.
Content may be subject to copyright.
Null hypothesis significance testing interpreted and
calibrated by estimating probabilities of sign errors: A
Bayes-frequentist continuum
December 10, 2019
David R. Bickel
Ottawa Institute of Systems Biology
Department of Biochemistry, Microbiology and Immunology
Department of Mathematics and Statistics
University of Ottawa
451 Smyth Road
Ottawa, Ontario, K1H 8M5
+01 (613) 562-5800, ext. 8670
Concepts from multiple testing can improve tests of single hypotheses. The proposed defini-
tion of the calibrated p value is an estimate of the local false sign rate, the posterior probability
that the direction of the estimated effect is incorrect. Interpreting one-sided p values as esti-
mates of conditional posterior probabilities, that calibrated p value is (1 - LFDR) p/2 + LFDR,
where p is a two-sided p value and LFDR is an estimate of the local false discovery rate, the
posterior probability that a point null hypothesis is true given p. A simple option for LFDR is
the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower
The calibration provides a continuum between significance testing and traditional Bayesian
testing. The former effectively assumes the prior probability of the null hypothesis is 0, as some
statisticians argue is the case. Then the calibrated p value is equal to p/2, a one-sided p value,
since LFDR = 0. In traditional Bayesian testing, the prior probability of the null hypothesis
is at least 50%, which usually results in LFDR >> p. At that end of the continuum, the
calibrated p value is close to LFDR.
Keywords: calibrated effect size estimation; pvalue; directional error; dividing null hypothesis;
replication crisis; reproducibility crisis; sign error; Type III error
1 Introduction
Meta-analyses of large numbers of previous studies from biomedicine and neuroscience have raised
concerns that many published results cannot be replicated (Ioannidis, 2005; Nieuwenhuis et al.,
2011; Button et al., 2013), contributing to the perceived replication crisis in many scientific fields
(Begley and Ioannidis, 2015), especially psychology (Open Science Collaboration, 2015; Hughes,
2018). The statistics community has responded with guidelines on hypothesis testing and rec-
ommendations to emphasize effect sizes (e.g., Wasserstein and Lazar, 2016). However, conflicting
proposals among statisticians on how to improve statistical data analysis (e.g., Wasserstein et al.,
2019, and references) cause confusion among non-statisticians (Schachtman, 2019; Mayo, 2019),
leaving statistical consultants with the responsibility of sifting through the arguments to provide
their collaborators practical solutions.
For example, many Bayesians propose to address criticisms of null hypothesis significance testing
by transforming the pvalue to a lower bound on the posterior probability that the null hypothesis
is true: see Held and Ott (2018) and its references.
Example 1. Assuming the two-sided pvalue is not large (p1/e)when testing the null hypothesis
H0:θ=θH0, Sellke et al. (2001) and Benjamin and Berger (2019) recommend
B=epln p(1)
as a lower bound on the Bayes factor B= Pr (P=p|θ=θH0)/Pr (P=p|θ6=θH0), where θis the
unknown value of the parameter of interest, θH0is the fixed parameter value of the null hypothesis,
Pis the random variable representing the pvalue before it is observed to be equal to the number
p. Since the posterior probability is
Pr (θ=θH0|P=p) = Pr (θ=θH0) P r (P=p|θ=θH0)
Pr (P=p)= 1 + Pr (θ=θH0)
1Pr (θ=θH0)B1!1
according to Bayes’s theorem, it has a lower bound of
v= 1 + Pr (θ=θH0)
1Pr (θ=θH0)B1!1
called the v value because a quantity approximated by Bappears in Vovk (1993, §9). N
Since Pr (θ=θH0|P=p)is typically much larger than pwhen Pr (θ=θH0)1/2, it is often
claimed that p“overstates” the strength of the evidence against the null hypothesis (e.g., Goodman,
1999). That conclusion is disputed by Hurlbert and Lombardi (2009), who argue that since prudent
scientists tend to believe the null hypotheses they test are false, Pr (θ=θH0)should be much smaller
than 1/2, perhaps 1/10 or 1/100.
In fact, Bernardo (2011), McShane et al. (2019), and others argue that since systematic errors
prevent θ=θH0from ever being exactly true, it follows that 0 is the only reasonable value for
Pr (θ=θH0); cf. van den Bergh et al. (2019). In that case, Pr (θ=θH0|P=p) = 0, which would
make traditional Bayesian hypothesis testing useless. Frequentist hypothesis testing, on the other
hand, could still serve to determine whether the sample is large enough to warrant concluding that
θ > θH0or that θ < θH0. In that context, θ=θH0is called a dividing null hypothesis (Cox, 1977;
Bickel, 2011). The idea is that if the pvalue is low enough, then bs= sign b
θθH0is a reasonable
estimate of s= sign (θθH0), where b
θis an observed point estimate of θand the function sign ()
has a value of 1 if its argument is positive, 1if its argument is negative, and 0 otherwise. In that
way, testing the null hypothesis that θ=θH0is used as an indirect method of deciding whether to
claim that s=bs.
A more direct way to make that decision would be to claim that s=bsonly if it is sufficiently
probable or, equivalently, if the sign error s 6=bsis sufficiently improbable. The sign error (Stephens,
2016) is also called a “Type III error” (Butler and Jones, 2018) and a “directional error” (Grandhi
et al., 2019). The posterior probability of making a sign error given a two-sided pis
Pr (s6=bs|P=p) =
Pr (θ > θH0|P=p) + Pr (θ=θH0|P=p)if b
θ < θH0
Pr (θ < θH0|P=p) + Pr (θ=θH0|P=p)if b
θ > θH0
Under broadly applicable conditions, that is reasonably estimated by
Pr (s6=bs|P=p) = (1 v)p
whenever v, the vvalue of equation (3), is a reasonable estimate of the Pr (θ=θH0|P=p)in
equation (2). The result is proved for all reasonable estimates of Pr (θ=θH0|P=p)in Section 2.
The form of equation (5) represents a continuum between null hypothesis significance testing
and conventional Bayesian testing. The frequentist practice of considering θ=θH0to be a dividing
null hypothesis (Cox, 1977; Bickel, 2011) is recovered by setting Pr (θ=θH0) = 0, for in that case
v= 0 and c
Pr (s6=bs|P=p) = p/2, which is a one-sided pvalue. At the opposite extreme, the
traditional Bayesian practice of setting Pr (θ=θH0)1/2often results in a vvalue that is much
greater than the pvalue, in which case c
Pr (s6=bs|P=p)v. Choices of Pr (θ=θH0)between those
frequentist and Bayesian extremes place c
Pr (s6=bs|P=p)within a continuum of values between
p/2and 1. For that reason, the easily interpreted estimate c
Pr (s6=bs|P=p)is a natural choice of
a calibrated pvalue, as illustrated by example in Section 3. There, Figure 1 vividly portrays the
Bayes-frequentist continuum.
The American Statistical Association’s call to emphasize effect size estimation (Wasserstein
and Lazar, 2016) does not necessarily warrant reporting conventional effect size estimates without
modification (van den Bergh et al., 2019). In particular, a large effect size estimate can be misleading
when a direction of the effect is too uncertain. To address that problem, Section 4 derives a simple
calibration of the effect size estimate. The calibrated pvalue c
Pr (s6=bs|P=p)emerges as the
degree of shrinkage.
Finally, implications for the debate and practice of testing null hypotheses are discussed in
Section 5.
2 Estimating the local false sign rate of a single null hypoth-
For making connections to the literature and for succinctly deriving equation (5) regarding a test
of the null hypothesis θ=θH0, some terminology originally developed for testing multiple null
hypotheses will prove useful. Since Efron et al. (2001) calls the Pr (θ=θH0|P=p)of equation (2)
the local false discovery rate, let LFDR = Pr (θ=θH0|P=p); see Efron (2010) and Bickel (2019a)
for expositions. Similarly, since Stephens (2016) calls the Pr (s6=bs|P=p)of equation (4) the local
false sign rate, let LFSR = Pr (s6=bs|P=p).
As equation (4) suggests, to estimate LFSR of a single null hypothesis, we need not only
an estimate of LFDR, but also estimates of Pr (θθH0|P=p). Seeing that
Pr (θθH0|P=p) = Pr (θθH0, θ 6=θH0|P=p)
= Pr (θ6=θH0|P=p) Pr (θθH0|P=p, θ 6=θH0)
= (1 LFDR) Pr (θθH0|P=p, θ 6=θH0),
let c
Pr (θθH0|P=p) = 1
LFDRp, where pis the estimate of Pr (θθH0|P=p, θ 6=θH0)
that is defined as a one-sided pvalue testing the null hypothesis that θ=θH0with θθH0as the
alternative hypothesis. From here on, the two-sided pvalue is p= 2 min (p<,p>).
Estimating Pr (θθH0|P=p, θ 6=θH0)by phas both a Bayesian justification and a Fish-
erian justification. The Bayesian justification is that pis in many cases an approximation of a
Pr (θθH0|P=p, θ 6=θH0)based on any member of a wide class of prior distributions that do not
concentrate prior probability at θH0or at any other point (Pratt, 1965; Casella and Berger, 1987).
Setting Pr (θ=θH0)>0need not conflict with those priors since Pr (θθH0|P=p, θ 6=θH0),
unlike Pr (θθH0|P=p), is conditional on θ6=θH0(cf. Bickel, 2012b, 2018).
The Fisherian justification is that p, as a fiducial probability or observed confidence level
(Polansky, 2007) that θθH0(Bickel, 2011), can serve as an estimate of a posterior probability
that θθH0even though, as many have noted (e.g., Grundy, 1956; Lindley, 1958; Evans, 2015,
§3.6), it does not necessarily satisfy the properties of a Bayesian posterior probability. In the same
way, many optimal point estimates can have values that are not possible for the parameters they
estimate (Bickel, 2019b). That is why Wilkinson (1977, §6.2) considered fiducial probability as an
estimate of a level of belief rather than as a level of belief. Similarly, confidence distributions, a
modern development of fiducial distributions (Nadara jah et al., 2015), have been interpreted in
terms of estimating θ(Singh et al., 2007; Xie and Singh, 2013) or an indicator of hypothesis truth
(Bickel, 2012a).
Plugging the above estimates into equation (4) yields
LFDR if b
θ < θH0
LFDR if b
θ > θH0
Theorem 1. If sign b
θθH0= sign (p<p>), then
LFSR = 1
Proof. By equation (6), it is sufficient to prove that
2p<if b
θ < θH0
2p>if b
θ > θH0
Since the sign b
θθH0= sign (p<p>)condition implies that b
θ < θH0p<<p>and
θ > θH0p><p<, it is enough to prove that
2p<if p<<p>
2p>if p><p<
which follows immediately from p= 2 min (p<,p>).
The sign b
θθH0= sign (p<p>)condition for the theorem says the sign estimated by the
parameter estimate agrees with the sign indicated by the one-sided pvalues. It holds in nearly all
real situations.
3 Estimates of local false sign rates as calibrated pvalues
The estimate of the local false sign rate approaches a local false discovery rate or a one-sided p
value, depending on the limiting conditions.
Corollary 1. If sign b
θθH0= sign (p<p>), then limp0
LFDR and limc
p/2, where c
Pr (θ=θH0)is the prior probability that yields
LFDR as the posterior probability.
Proof. By Bayes’s theorem,
LFDR 0as c
Pr (θ=θH0)0. Both claims then follow from Theorem
Since p/2 = min (p<,p>), that result justifies calling
LFSR the
LFDR-calibrated p value and
accordingly denoting it by p\
LFDRto stress its dependence on the choice of an estimate of LFDR.
Example 2. A simple option for
LFDR is v, the lower bound given in equation (3), with c
Pr (θ=θH0)
in place of Pr (θ=θH0). Then we write the v-calibrated pvalue as p(v).
The resulting Bayes-frequentist continuum is displayed as Figure 1, with traditional frequentism
at the left end of each plot and traditional Bayesianism at the right. Figure 2 zooms in on three
points in the continuum. N
Many other lower bounds on LFDR are available (e.g., Held and Ott, 2018, and references).
But why estimate the LFDR with an estimate of a lower bound such as the vvalue (Example 2)?
There are multiple reasons to accept the vvalue as an adequate estimate of the LFDR. First, as
the Bayes factor can be lower than B(Held and Ott, 2018), which is the Bayes factor bound behind
the vvalue, the vvalue is not necessarily a lower bound on LFDR. Second, Bis close to estimated
Bayes factors for many studies in epidemiology, genetics, and ecology (Bayarri et al., 2016, Fig. 3),
Figure 1: The three curves are p(v),v, and p/2as functions of Pr (θ=θH0). For both p= 0.05
and p= 0.005, the v-calibrated pvalue p(v)approaches the one-sided pvalue p/2as Pr (θ=θH0)
decreases and approaches the estimated posterior probability vas Pr (θ=θH0)increases.
Figure 2: The three curves are p(v),v, and p/2as functions of p, the two-sided pvalue, for
each of three prior probabilities: Pr (θ=θH0) = 0.01,0.1,0.5. In the plot corresponding most to
traditional frequentism (Pr (θ=θH0) = 0.01), the v-calibrated pvalue p(v)is close to p/2, a one-
sided pvalue. In the plot corresponding most to traditional Bayesianism (Pr (θ=θH0) = 0.5), the
v-calibrated pvalue p(v)is close to v, the estimated posterior probability. The remaining plot
(Pr (θ=θH0) = 0.1) shows a more interesting relationship between the v-calibrated pvalue, the
estimated posterior probability, and the one-sided pvalue.
and the vvalue would be close in those cases to LFDR. Third, the vvalue is quantitatively similar
to the following estimate of LFDR.
Example 3. Let zdenote the probit transform of p/2; the probit function is implemented in R as
rnorm and in Microsoft Excel as norm.s.inv. For |z| ≥ 1, the L value is
1 + 1/b
where b
B= 1.86 |z|ez2
2is the median-unbiased estimate of the Bayes factor assuming the probit
transform of a one-sided pvalue is normal with mean 0 under θ6= 0 (Bickel, 2019a,d). (See Held
and Ott (2016) for the maximum likelihood estimate under the same model and Pace and Salvan
(1997) on the 0% confidence interval as a median-unbiased estimate.) Then p(L)is the L-calibrated
pvalue. It could be approximated by p(v)since p(L)p(v), and the simplicity of p(v)may
make it more practical for general use (cf. Benjamin and Berger, 2019) than p(L), which requires
the probit transform. N
While the local false sign rate and local false discovery rate are posterior probabilities conditional
on P=p, other posterior probabilities might serve as approximations.
Example 4. The positive predictive value Pr (θ6=θH0|Pα)plays a key role in multiple papers
related to the reproducibility crisis (e.g., Ioannidis, 2005; Button et al., 2013; Dreber et al., 2015;
Wilson and Wixted, 2018). It is isomorphic to
Pr (θ=θH0|Pα) = 1 Pr (θ6=θH0|Pα),
which is known as the false positive report probability (Wacholder et al., 2004) and, in the multiple
testing literature, as the Bayesian false discovery rate (Efron and Tibshirani, 2002) and the nonlocal
false discovery rate (Bickel, 2013). An estimate of Pr (θ=θH0|Pα), such as the upper bound
proposed by Bickel (2019c), is denoted by wand called a w value after Wacholder et al. (2004).
Using it as an estimate of LFDR results in p(w), the w-calibrated pvalue. However, wis highly
biased as an estimate of LFDR when α=p(Colquhoun, 2017, 2019; Bickel and Rahal, 2019). N
4 Effect size estimation informed by local false sign rate esti-
If all relevant prior distributions were known, the Bayes-optimal estimate of the effect size θunder
squared error loss would be its posterior mean,
E (θ|P=p) = Pr (s=bs|P=p) E (θ|P=p,s=bs)
+ Pr (s6=bs, θ =θH0|P=p) E (θ|P=p,s6=bs, θ =θH0)
+ Pr (s6=bs, θ 6=θH0|P=p) E (θ|P=p,s6=bs, θ 6=θH0)
= (1 LFSR) E (θ|P=p,s=bs) + (LFDR) θH0
+ (LFSR LFDR) E (θ|P=p,s6=bs, θ 6=θH0).
Without that knowledge, θmay instead be estimated by estimating E (θ|P=p).
In agreement with the
LFSR = p\
LFDRframework of Sections 2-3, E (θ|P=p)is estimated
by the
LFDR-calibrated effect size estimate,
which uses b
θto estimate E (θ|P=p,s=bs)and θH0to estimate E (θ|P=p,s6=bs, θ 6=θH0). The
latter estimate works best when θwould probably be close to θH0conditional on a sign error. The
calibrated effect size estimate simplifies to
which reveals p\
LFDRas the degree to which b
θis shrunk toward θH0. The next result follows
immediately from that and Corollary 1.
Corollary 2. If sign b
θθH0= sign (p<p>), then
LFDR θH0;(8)
Figure 3: b
θas a function of c
Pr (θ= 0) for θH0= 0 and p= 0.05,0.15,0.25,0.35. The
v-calibrated effect size estimate b
θ(v)is seen to shrink b
θtoward 0 as por c
Pr (θ= 0) increases.
The right-hand side of equation (8) has been used in multiple testing situations (e.g., Montazeri
et al., 2010; Yanofsky and Bickel, 2010). Equation (9) records the effect of considering the local
false sign rate even at the frequentist end of the Bayes-frequentist continuum.
An advantage of b
LFDRis that it shrinks b
θtoward θH0more for higher pvalues without
ever shrinking it all the way to θH0, as seen in Figure 3. As a result, reporting calibrated effect
size estimates could help prevent researchers from concluding that θ=θH0on the basic of a high
5 Discussion
Imagine a world in which abstracts have v-calibrated effect size estimates and “p(v)=0.04,” “p(v)=0.01,”
etc. in place of our world’s uncalibrated estimates and “p<0.05.” Adopting the local false sign rate
estimate as a calibrated pvalue may focus current discussions about estimation and testing. The
traditional Bayesian and frequentist positions would no longer be incommensurate paradigms or
matters of upbringing and taste but rather opposite directions on the continuum determined by the
prior probability of the null hypothesis (Figures 1-2). Going forward, debates would then concen-
trate on ways to estimate the prior probability for each field, data type, or other reference class (cf.
Lakens et al., 2018; de Ruiter, 2019). Progress is already being made in measuring how the prior is
influenced by a field’s risk tolerance (Wilson and Wixted, 2018), echoing the report that a demand
for novelty leads to less reproducible results (Open Science Collaboration, 2015).
Even before a consensus is reached, statisticians can inform their collaborators of the impact
of the prior probability on the local false sign rate estimate and help them determine adequate
estimates of the prior for the data at hand. Estimates may be available in some cases from meta-
analyses. For example, Benjamin et al. (2017) derived their infamous 0.005 significance threshold
in part from meta-analyses suggesting c
Pr (θ=θH0) = 10/11 in psychology (Dreber et al., 2015;
Johnson et al., 2017). The high value of that estimate reflects modeling assumptions that would in
effect include values of θthat are close to θH0with the null hypothesis rather than the alternative
hypothesis. How close is close enough for inferential purposes may be a fruitful sub ject of future
study and argument since it determines the calibrated pvalue through c
Pr (θ=θH0).
The difficulties involved in estimating prior probabilities may at times force us to retreat back
to null hypothesis significance testing without any prior or to traditional Bayesian testing with
the default 50% prior probability. The calibrated pvalue would then tell us what the estimated
probability of making a sign error would be if the prior probability of the null hypothesis were
actually 0% or 50%, respectively.
This research was partially supported by the Natural Sciences and Engineering Research Council
of Canada (RGPIN/356018-2009).
Bayarri, M., Benjamin, D. J., Berger, J. O., Sellke, T. M., 2016. Rejection odds and rejection ratios:
A proposal for statistical practice in testing hypotheses. Journal of Mathematical Psychology 72,
Begley, C. G., Ioannidis, J. P., 2015. Reproducibility in science. Circulation Research 116 (1),
Benjamin, D. J., Berger, J. O., 2019. Three recommendations for improving the use of p-values.
The American Statistician 73 (sup1), 186–191.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R.,
Bollen, K. A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C. D., Clyde, M.,
Cook, T. D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Efferson, C., Fehr, E., Fidler,
F., Field, A. P., Forster, M., George, E. I., Gonzalez, R., Goodman, S., Green, E., Green, D. P.,
Greenwald, A. G., Hadfield, J. D., Hedges, L. V., Held, L., Hua Ho, T., Hoijtink, H., Hruschka,
D. J., Imai, K., Imbens, G., Ioannidis, J. P. A., Jeon, M., Jones, J. H., Kirchler, M., Laibson, D.,
List, J., Little, R., Lupia, A., Machery, E., Maxwell, S. E., McCarthy, M., Moore, D. A., Morgan,
S. L., Munafó, M., Nakagawa, S., Nyhan, B., Parker, T. H., Pericchi, L., Perugini, M., Rouder, J.,
Rousseau, J., Savalei, V., Schönbrodt, F. D., Sellke, T., Sinclair, B., Tingley, D., Van Zandt, T.,
Vazire, S., Watts, D. J., Winship, C., Wolpert, R. L., Xie, Y., Young, C., Zinman, J., Johnson,
V. E., 9 2017. Redefine statistical significance. Nature Human Behaviour, 1.
Bernardo, J. M., 2011. Integrated objective Bayesian estimation and hypothesis testing. Bayesian
statistics 9, 1–68.
Bickel, D. R., 2011. Estimating the null distribution to adjust observed confidence levels for genome-
scale screening. Biometrics 67, 363–370.
Bickel, D. R., 2012a. Coherent frequentism: A decision theory based on confidence sets. Communi-
cations in Statistics - Theory and Methods 41, 1478–1496.
Bickel, D. R., 2012b. Empirical Bayes interval estimates that are conditionally equal to unadjusted
confidence intervals or to default prior credibility intervals. Statistical Applications in Genetics
and Molecular Biology 11 (3), art. 7.
Bickel, D. R., 2013. Simple estimators of false discovery rates given as few as one or two p-values
without strong parametric assumptions. Statistical Applications in Genetics and Molecular Biol-
ogy 12, 529–543.
Bickel, D. R., 2018. Confidence distributions and empirical Bayes posterior distributions unified as
distributions of evidential support, working paper, DOI: 10.5281/zenodo.2529438.
Bickel, D. R., 2019a. Genomics Data Analysis: False Discovery Rates and Empirical Bayes Methods.
Chapman and Hall/CRC, New York.
Bickel, D. R., 2019b. Maximum entropy derived and generalized under idempotent probability
to address Bayes-frequentist uncertainty and model revision uncertainty, working paper, DOI:
Bickel, D. R., 2019c. Null hypothesis significance testing defended and calibrated by Bayesian model
checking. The American Statistician, DOI: 10.1080/00031305.2019.1699443.
Bickel, D. R., 2019d. Sharpen statistical significance: Evidence thresholds and Bayes factors sharp-
ened into Occam’s razor. Stat 8 (1), e215.
Bickel, D. R., Rahal, A., 2019. Correcting false discovery rates for their bias to-
ward false positives. Communications in Statistics - Simulation and Computation, DOI:
Butler, J. S., Jones, P., Apr 2018. Theoretical and empirical distributions of the p value. METRON
76 (1), 1–30.
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., Munafò,
M. R., 2013. Power failure: why small sample size undermines the reliability of neuroscience.
Nature Reviews Neuroscience 14 (5), 365.
Casella, G., Berger, R. L., 1987. Reconciling Bayesian and frequentist evidence in the one-sided
testing problem. Journal of the American Statistical Association 82, 106–111.
Colquhoun, D., 2017. The reproducibility of research and the misinterpretation of p-values. Royal
Society Open Science 4 (12), 171085.
Colquhoun, D., 2019. The false positive risk: A proposal concerning what to do about p-values.
The American Statistician 73 (sup1), 192–201.
Cox, D. R., 1977. The role of significance tests. Scandinavian Journal of Statistics 4, 49–70.
de Ruiter, J., Apr 2019. Redefine or justify? comments on the alpha debate. Psychonomic Bulletin
& Review 26 (2), 430–433.
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., Johan-
nesson, M., 2015. Using prediction markets to estimate the reproducibility of scientific research.
Proceedings of the National Academy of Sciences 112 (50), 15343–15347.
Efron, B., 2010. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and
Prediction. Cambridge University Press, Cambridge.
Efron, B., Tibshirani, R., 2002. Empirical Bayes methods and false discovery rates for microarrays.
Genetic Epidemiology 23, 70–86.
Efron, B., Tibshirani, R., Storey, J. D., Tusher, V., 2001. Empirical Bayes analysis of a microarray
experiment. Journal of the American Statistical Association 96, 1151–1160.
Evans, M., 2015. Measuring Statistical Evidence Using Relative Belief. Chapman & Hall/CRC
Monographs on Statistics & Applied Probability. CRC Press, New York.
Goodman, S. N., 06 1999. Toward Evidence-Based Medical Statistics. 2: The Bayes Factor. Annals
of Internal Medicine 130 (12), 1005–1013.
Grandhi, A., Guo, W., Romano, J., 2019. Control of directional errors in fixed sequence multiple
testing. Statistica Sinica 29 (2), 1047–1064.
Grundy, P. M., 1956. Fiducial distributions and prior distributions: An example in which the
former cannot be associated with the latter. Journal of the Royal Statistical Society, Series B 18,
Held, L., Ott, M., 2016. How the maximal evidence of p-values against point null hypotheses
depends on sample size. American Statistician 70 (4), 335–341.
Held, L., Ott, M., 2018. On p-values and Bayes factors. Annual Review of Statistics and Its Appli-
cation 5, 393–419.
Hughes, B., 2018. Psychology in Crisis. Palgrave, London.
Hurlbert, S., Lombardi, C., 2009. Final collapse of the Neyman-Pearson decision theoretic frame-
work and rise of the neoFisherian. Annales Zoologici Fennici 46, 311–349.
Ioannidis, J. P., 2005. Why most published research findings are false. PLoS Medicine 2 (8), e124.
Johnson, V., Payne, R., Wang, T., Asher, A., Mandal, S., 2017. On the reproducibility of psycho-
logical science. Journal of the American Statistical Association 112 (517), 1–10.
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A., Argamon, S. E., Baguley, T.,
Becker, R. B., Benning, S. D., Bradford, D. E., et al., 2018. Justify your alpha. Nature Human
Behaviour 2 (3), 168.
Lindley, D. V., 1958. Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical
Society B 20, 102–107.
Mayo, D. G., 2019. The ASA’s p-value project: Why it’s doing more harm than good (cont from
11/4/19). Web page, accessed 3 December 2019.
McShane, B. B., Gal, D., Gelman, A., Robert, C., Tackett, J. L., 2019. Abandon statistical signifi-
cance. The American Statistician 73 (sup1), 235–245.
Montazeri, Z., Yanofsky, C. M., Bickel, D. R., 2010. Shrinkage estimation of effect sizes as an
alternative to hypothesis testing followed by estimation in high-dimensional biology: Applications
to differential gene expression. Statistical Applications in Genetics and Molecular Biology 9, 23.
Nadarajah, S., Bityukov, S., Krasnikov, N., 2015. Confidence distributions: A review. Statistical
Methodology 22, 23–46.
Nieuwenhuis, S., Forstmann, B. U., Wagenmakers, E.-J., 2011. Erroneous analyses of interactions
in neuroscience: a problem of significance. Nature neuroscience 14 (9), 1105.
Open Science Collaboration, 2015. Estimating the reproducibility of psychological science. Science
349 (6251).
Pace, L., Salvan, A., 1997. Principles of Statistical Inference: From a Neo-Fisherian Perspective.
Advanced Series on Statistical Science & Applied Probability. World Scientific, Singapore.
Polansky, A. M., 2007. Observed Confidence Levels: Theory and Application. Chapman and Hall,
New York.
Pratt, J. W., 1965. Bayesian interpretation of standard inference statements. Journal of the Royal
Statistical Society B 27, 169–203.
Schachtman, N. A., 2019. Palavering about p-values. Web page, accessed 3 December 2019.
URL about-p-values/
Sellke, T., Bayarri, M. J., Berger, J. O., 2001. Calibration of p values for testing precise null
hypotheses. American Statistician 55, 62–71.
Singh, K., Xie, M., Strawderman, W. E., 2007. Confidence distribution (CD) – distribution estima-
tor of a parameter. IMS Lecture Notes Monograph Series 2007 54, 132–150.
Stephens, M., 10 2016. False discovery rates: a new deal. Biostatistics 18 (2), 275–294.
van den Bergh, D., Haaf, J. M., Ly, A., Rouder, J. N., Wagenmakers, E.-J., Nov 2019. A cautionary
note on estimating effect size. PsyArXiv, DOI: 10.31234/
Vovk, V. G., 1993. A logic of probability, with application to the foundations of statistics. Journal
of the Royal Statistical Society: Series B (Methodological) 55 (2), 317–341.
Wacholder, S., Chanock, S., Garcia-Closas, M., Ghormli, L. E., Rothman, N., 2004. Assessing
the probability that a positive report is false: An approach for molecular epidemiology studies.
Journal of the National Cancer Institute 96, 434–442.
Wasserstein, R. L., Lazar, N. A., 2016. The ASA’s statement on p-values: Context, process, and
purpose. The American Statistician 70 (2), 129–133.
Wasserstein, R. L., Schirm, A. L., Lazar, N. A., 2019. Moving to a world beyond "p < 0.05". The
American Statistician 73 (sup1), 1–19.
Wilkinson, G. N., 1977. On resolving the controversy in statistical inference (with discussion).
Journal of the Royal Statistical Society B 39, 119–171.
Wilson, B. M., Wixted, J. T., 2018. The prior odds of testing a true effect in cognitive and social
psychology. Advances in Methods and Practices in Psychological Science 1 (2), 186–197.
Xie, M.-G., Singh, K., 2013. Confidence distribution, the frequentist distribution estimator of a
parameter: A review. International Statistical Review 81 (1), 3–39.
Yanofsky, C. M., Bickel, D. R., 2010. Validation of differential gene expression algorithms: Ap-
plication comparing fold-change estimation to hypothesis testing. BMC Bioinformatics 11, art.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Researchers commonly use p-values to answer the question: How strongly does the evidence favor the alternative hypothesis relative to the null hypothesis? p-Values themselves do not directly answer this question and are often misinterpreted in ways that lead to overstating the evidence against the null hypothesis. Even in the “post p < 0.05 era,” however, it is quite possible that p-values will continue to be widely reported and used to assess the strength of evidence (if for no other reason than the widespread availability and use of statistical software that routinely produces p-values and thereby implicitly advocates for their use). If so, the potential for misinterpretation will persist. In this article, we recommend three practices that would help researchers more accurately interpret p-values. Each of the three recommended practices involves interpreting p-values in light of their corresponding “Bayes factor bound,” which is the largest odds in favor of the alternative hypothesis relative to the null hypothesis that is consistent with the observed data. The Bayes factor bound generally indicates that a given p-value provides weaker evidence against the null hypothesis than typically assumed. We therefore believe that our recommendations can guard against some of the most harmful p-value misinterpretations. In research communities that are deeply attached to reliance on “p < 0.05,” our recommendations will serve as initial steps away from this attachment. We emphasize that our recommendations are intended merely as initial, temporary steps and that many further steps will need to be taken to reach the ultimate destination: a holistic interpretation of statistical evidence that fully conforms to the principles laid out in the ASA statement on statistical significance and p-values.
Significance testing is often criticized because p values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that the the p value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if the p value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check. That result leads to a way to calibrate a p value by transforming it into an upper bound on the posterior probability of the null hypothesis (conditional on rejection) for any model that would pass the check. The calibration may be calculated from a prior probability of the null hypothesis and the stringency of the check without more detailed modeling. An upper bound, as opposed to a lower bound, can justify concluding that the null hypothesis has a low posterior probability.
Statisticians have met the need to test hundreds or thousands of genomics hypotheses simultaneously with novel empirical Bayes methods that combine advantages of traditional Bayesian and frequentist statistics. Techniques for estimating the local false discovery rate assign probabilities of differential gene expression, genetic association, etc. without requiring subjective prior distributions. This book brings these methods to scientists while keeping the mathematics at an elementary level. Readers will learn the fundamental concepts behind local false discovery rates, preparing them to analyze their own genomics data and to critically evaluate published genomics research. Key Features: * dice games and exercises, including one using interactive software, for teaching the concepts in the classroom * examples focusing on gene expression and on genetic association data and briefly covering metabolomics data and proteomics data * gradual introduction to the mathematical equations needed * how to choose between different methods of multiple hypothesis testing * how to convert the output of genomics hypothesis testing software to estimates of local false discovery rates * guidance through the minefield of current criticisms of p values * material on non-Bayesian prior p values and posterior p values not previously published More:
The way false discovery rates (FDRs) are used in the analysis of genomics data leads to excessive false positive rates. In this sense, FDRs overcorrect for the excessive conservatism (bias toward false negatives) of methods of adjusting p values that control a family-wise error rate. Estimators of the local FDR (LFDR) are much less biased but have not been widely adopted due to their high variance and lack of availability in software. To address both issues, we propose estimating the LFDR by correcting an estimated FDR or the level at which an FDR is controlled.
Occam's razor suggests assigning more prior probability to a hypothesis corresponding to a simpler distribution of data than to a hypothesis with a more complex distribution of data, other things equal. An idealization of Occam's razor in terms of the entropy of the data distributions tends to favor the null hypothesis over the alternative hypothesis. As a result, lower p values are needed to attain the same level of evidence. A recently debated argument for lowering the significance level to 0.005 as the p value threshold for a new discovery and to 0.05 for a suggestive result would then support further lowering them to 0.001 and 0.01, respectively.
A simple example shows that the classical theory of probability implies more than one can deduce via Kolmogorov's calculus of probability. Developing Dawid's ideas I propose a new calculus of probability which is free from this drawback. This calculus naturally leads to a new interpretation of probability. I argue that attempts to create a general empirical theory of probability should be abandoned and we should content ourselves with the logic of probability establishing relations between probabilistic theories and observations. My approach to the logic of probability is based on a variant of Ville's principle of the excluded gambling strategy. In addition to the classical theory of probability this approach is applied to the probabilistic theories provided by the problem of testing validity of probability forecasts and by statistical models.
Benjamin et al. (Nature Human Behaviour 2, 6-10, 2017) proposed improving the reproducibility of findings in psychological research by lowering the alpha level of our conventional null hypothesis significance tests from .05 to .005, because findings with p-values close to .05 represent insufficient empirical evidence. They argued that findings with a p-value between 0.005 and 0.05 should still be published, but not called “significant” anymore. This proposal was criticized and rejected in a response by Lakens et al. (Nature Human Behavior 2, 168-171, 2018), who argued that instead of lowering the traditional alpha threshold to .005, we should stop using the term “statistically significant,” and require researchers to determine and justify their alpha levels before they collect data. In this contribution, I argue that the arguments presented by Lakens et al. against the proposal by Benjamin et al. are not convincing. Thus, given that it is highly unlikely that our field will abandon the NHST paradigm any time soon, lowering our alpha level to .005 is at this moment the best way to combat the replication crisis in psychology.
A Sound Basis for the Theory of Statistical Inference Measuring Statistical Evidence Using Relative Belief provides an overview of recent work on developing a theory of statistical inference based on measuring statistical evidence. It shows that being explicit about how to measure statistical evidence allows you to answer the basic question of when a statistical analysis is correct. The book attempts to establish a gold standard for how a statistical analysis should proceed. It first introduces basic features of the overall approach, such as the roles of subjectivity, objectivity, infinity, and utility in statistical analyses. It next discusses the meaning of probability and the various positions taken on probability. The author then focuses on the definition of statistical evidence and how it should be measured. He presents a method for measuring statistical evidence and develops a theory of inference based on this method. He also discusses how statisticians should choose the ingredients for a statistical problem and how these choices are to be checked for their relevance in an application.
Psychology has made great strides since it came of age in the late 1800s. Its subject matter receives a great deal of popular attention in the news, and its professionals are recognised as highly trained experts with wide-ranging and valuable skills. It is one of the most in-demand science subjects in education systems around the world, and more of its research is being conducted – and funded – than ever before. However, reviews of psychology’s standard research approaches have revealed the risk of systematic error to be troublingly high, and the arbitrary ways in which psychologists draw conclusions from evidence has been highlighted. In many ways psychology faces a number of important crises, including: a replication crisis; a paradigmatic crisis; a measurement crisis; a statistical crisis; a sampling crisis; and a crisis of exaggeration. This book addresses these and many other existential crises that face psychology today. [See]