Content uploaded by David R. Bickel

Author content

All content in this area was uploaded by David R. Bickel on Dec 10, 2019

Content may be subject to copyright.

Null hypothesis signiﬁcance testing interpreted and

calibrated by estimating probabilities of sign errors: A

Bayes-frequentist continuum

December 10, 2019

David R. Bickel

Ottawa Institute of Systems Biology

Department of Biochemistry, Microbiology and Immunology

Department of Mathematics and Statistics

University of Ottawa

451 Smyth Road

Ottawa, Ontario, K1H 8M5

+01 (613) 562-5800, ext. 8670

dbickel@uottawa.ca

Abstract

Concepts from multiple testing can improve tests of single hypotheses. The proposed deﬁni-

tion of the calibrated p value is an estimate of the local false sign rate, the posterior probability

that the direction of the estimated eﬀect is incorrect. Interpreting one-sided p values as esti-

mates of conditional posterior probabilities, that calibrated p value is (1 - LFDR) p/2 + LFDR,

where p is a two-sided p value and LFDR is an estimate of the local false discovery rate, the

posterior probability that a point null hypothesis is true given p. A simple option for LFDR is

the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower

bound.

The calibration provides a continuum between signiﬁcance testing and traditional Bayesian

testing. The former eﬀectively assumes the prior probability of the null hypothesis is 0, as some

statisticians argue is the case. Then the calibrated p value is equal to p/2, a one-sided p value,

since LFDR = 0. In traditional Bayesian testing, the prior probability of the null hypothesis

is at least 50%, which usually results in LFDR >> p. At that end of the continuum, the

calibrated p value is close to LFDR.

Keywords: calibrated eﬀect size estimation; pvalue; directional error; dividing null hypothesis;

replication crisis; reproducibility crisis; sign error; Type III error

1 Introduction

Meta-analyses of large numbers of previous studies from biomedicine and neuroscience have raised

concerns that many published results cannot be replicated (Ioannidis, 2005; Nieuwenhuis et al.,

2011; Button et al., 2013), contributing to the perceived replication crisis in many scientiﬁc ﬁelds

(Begley and Ioannidis, 2015), especially psychology (Open Science Collaboration, 2015; Hughes,

2018). The statistics community has responded with guidelines on hypothesis testing and rec-

ommendations to emphasize eﬀect sizes (e.g., Wasserstein and Lazar, 2016). However, conﬂicting

proposals among statisticians on how to improve statistical data analysis (e.g., Wasserstein et al.,

2019, and references) cause confusion among non-statisticians (Schachtman, 2019; Mayo, 2019),

leaving statistical consultants with the responsibility of sifting through the arguments to provide

their collaborators practical solutions.

For example, many Bayesians propose to address criticisms of null hypothesis signiﬁcance testing

by transforming the pvalue to a lower bound on the posterior probability that the null hypothesis

is true: see Held and Ott (2018) and its references.

Example 1. Assuming the two-sided pvalue is not large (p≤1/e)when testing the null hypothesis

H0:θ=θH0, Sellke et al. (2001) and Benjamin and Berger (2019) recommend

B=−epln p(1)

as a lower bound on the Bayes factor B= Pr (P=p|θ=θH0)/Pr (P=p|θ6=θH0), where θis the

unknown value of the parameter of interest, θH0is the ﬁxed parameter value of the null hypothesis,

Pis the random variable representing the pvalue before it is observed to be equal to the number

p. Since the posterior probability is

Pr (θ=θH0|P=p) = Pr (θ=θH0) P r (P=p|θ=θH0)

Pr (P=p)= 1 + Pr (θ=θH0)

1−Pr (θ=θH0)B−1!−1

(2)

according to Bayes’s theorem, it has a lower bound of

v= 1 + Pr (θ=θH0)

1−Pr (θ=θH0)B−1!−1

,(3)

called the v value because a quantity approximated by Bappears in Vovk (1993, §9). N

Since Pr (θ=θH0|P=p)is typically much larger than pwhen Pr (θ=θH0)≥1/2, it is often

1

claimed that p“overstates” the strength of the evidence against the null hypothesis (e.g., Goodman,

1999). That conclusion is disputed by Hurlbert and Lombardi (2009), who argue that since prudent

scientists tend to believe the null hypotheses they test are false, Pr (θ=θH0)should be much smaller

than 1/2, perhaps 1/10 or 1/100.

In fact, Bernardo (2011), McShane et al. (2019), and others argue that since systematic errors

prevent θ=θH0from ever being exactly true, it follows that 0 is the only reasonable value for

Pr (θ=θH0); cf. van den Bergh et al. (2019). In that case, Pr (θ=θH0|P=p) = 0, which would

make traditional Bayesian hypothesis testing useless. Frequentist hypothesis testing, on the other

hand, could still serve to determine whether the sample is large enough to warrant concluding that

θ > θH0or that θ < θH0. In that context, θ=θH0is called a dividing null hypothesis (Cox, 1977;

Bickel, 2011). The idea is that if the pvalue is low enough, then bs= sign b

θ−θH0is a reasonable

estimate of s= sign (θ−θH0), where b

θis an observed point estimate of θand the function sign (•)

has a value of 1 if its argument is positive, −1if its argument is negative, and 0 otherwise. In that

way, testing the null hypothesis that θ=θH0is used as an indirect method of deciding whether to

claim that s=bs.

A more direct way to make that decision would be to claim that s=bsonly if it is suﬃciently

probable or, equivalently, if the sign error s 6=bsis suﬃciently improbable. The sign error (Stephens,

2016) is also called a “Type III error” (Butler and Jones, 2018) and a “directional error” (Grandhi

et al., 2019). The posterior probability of making a sign error given a two-sided pis

Pr (s6=bs|P=p) =

Pr (θ > θH0|P=p) + Pr (θ=θH0|P=p)if b

θ < θH0

Pr (θ < θH0|P=p) + Pr (θ=θH0|P=p)if b

θ > θH0

.(4)

Under broadly applicable conditions, that is reasonably estimated by

c

Pr (s6=bs|P=p) = (1 −v)p

2+v,(5)

whenever v, the vvalue of equation (3), is a reasonable estimate of the Pr (θ=θH0|P=p)in

equation (2). The result is proved for all reasonable estimates of Pr (θ=θH0|P=p)in Section 2.

The form of equation (5) represents a continuum between null hypothesis signiﬁcance testing

and conventional Bayesian testing. The frequentist practice of considering θ=θH0to be a dividing

null hypothesis (Cox, 1977; Bickel, 2011) is recovered by setting Pr (θ=θH0) = 0, for in that case

v= 0 and c

Pr (s6=bs|P=p) = p/2, which is a one-sided pvalue. At the opposite extreme, the

2

traditional Bayesian practice of setting Pr (θ=θH0)≥1/2often results in a vvalue that is much

greater than the pvalue, in which case c

Pr (s6=bs|P=p)≈v. Choices of Pr (θ=θH0)between those

frequentist and Bayesian extremes place c

Pr (s6=bs|P=p)within a continuum of values between

p/2and 1. For that reason, the easily interpreted estimate c

Pr (s6=bs|P=p)is a natural choice of

a calibrated pvalue, as illustrated by example in Section 3. There, Figure 1 vividly portrays the

Bayes-frequentist continuum.

The American Statistical Association’s call to emphasize eﬀect size estimation (Wasserstein

and Lazar, 2016) does not necessarily warrant reporting conventional eﬀect size estimates without

modiﬁcation (van den Bergh et al., 2019). In particular, a large eﬀect size estimate can be misleading

when a direction of the eﬀect is too uncertain. To address that problem, Section 4 derives a simple

calibration of the eﬀect size estimate. The calibrated pvalue c

Pr (s6=bs|P=p)emerges as the

degree of shrinkage.

Finally, implications for the debate and practice of testing null hypotheses are discussed in

Section 5.

2 Estimating the local false sign rate of a single null hypoth-

esis

For making connections to the literature and for succinctly deriving equation (5) regarding a test

of the null hypothesis θ=θH0, some terminology originally developed for testing multiple null

hypotheses will prove useful. Since Efron et al. (2001) calls the Pr (θ=θH0|P=p)of equation (2)

the local false discovery rate, let LFDR = Pr (θ=θH0|P=p); see Efron (2010) and Bickel (2019a)

for expositions. Similarly, since Stephens (2016) calls the Pr (s6=bs|P=p)of equation (4) the local

false sign rate, let LFSR = Pr (s6=bs|P=p).

As equation (4) suggests, to estimate LFSR of a single null hypothesis, we need not only

\

LFDR,

an estimate of LFDR, but also estimates of Pr (θ≷θH0|P=p). Seeing that

Pr (θ≷θH0|P=p) = Pr (θ≷θH0, θ 6=θH0|P=p)

= Pr (θ6=θH0|P=p) Pr (θ≷θH0|P=p, θ 6=θH0)

= (1 −LFDR) Pr (θ≷θH0|P=p, θ 6=θH0),

let c

Pr (θ≷θH0|P=p) = 1−

\

LFDRp≶, where p≶is the estimate of Pr (θ≷θH0|P=p, θ 6=θH0)

3

that is deﬁned as a one-sided pvalue testing the null hypothesis that θ=θH0with θ≶θH0as the

alternative hypothesis. From here on, the two-sided pvalue is p= 2 min (p<,p>).

Estimating Pr (θ≷θH0|P=p, θ 6=θH0)by p≶has both a Bayesian justiﬁcation and a Fish-

erian justiﬁcation. The Bayesian justiﬁcation is that p≶is in many cases an approximation of a

Pr (θ≷θH0|P=p, θ 6=θH0)based on any member of a wide class of prior distributions that do not

concentrate prior probability at θH0or at any other point (Pratt, 1965; Casella and Berger, 1987).

Setting Pr (θ=θH0)>0need not conﬂict with those priors since Pr (θ≷θH0|P=p, θ 6=θH0),

unlike Pr (θ≷θH0|P=p), is conditional on θ6=θH0(cf. Bickel, 2012b, 2018).

The Fisherian justiﬁcation is that p≶, as a ﬁducial probability or observed conﬁdence level

(Polansky, 2007) that θ≷θH0(Bickel, 2011), can serve as an estimate of a posterior probability

that θ≷θH0even though, as many have noted (e.g., Grundy, 1956; Lindley, 1958; Evans, 2015,

§3.6), it does not necessarily satisfy the properties of a Bayesian posterior probability. In the same

way, many optimal point estimates can have values that are not possible for the parameters they

estimate (Bickel, 2019b). That is why Wilkinson (1977, §6.2) considered ﬁducial probability as an

estimate of a level of belief rather than as a level of belief. Similarly, conﬁdence distributions, a

modern development of ﬁducial distributions (Nadara jah et al., 2015), have been interpreted in

terms of estimating θ(Singh et al., 2007; Xie and Singh, 2013) or an indicator of hypothesis truth

(Bickel, 2012a).

Plugging the above estimates into equation (4) yields

\

LFSR =

1−

\

LFDRp<+

\

LFDR if b

θ < θH0

1−

\

LFDRp>+

\

LFDR if b

θ > θH0

.(6)

Theorem 1. If sign b

θ−θH0= sign (p<−p>), then

\

LFSR = 1−

\

LFDRp

2+

\

LFDR.

Proof. By equation (6), it is suﬃcient to prove that

p=

2p<if b

θ < θH0

2p>if b

θ > θH0

.

Since the sign b

θ−θH0= sign (p<−p>)condition implies that b

θ < θH0⇐⇒ p<<p>and

4

b

θ > θH0⇐⇒ p><p<, it is enough to prove that

p=

2p<if p<<p>

2p>if p><p<

,

which follows immediately from p= 2 min (p<,p>).

The sign b

θ−θH0= sign (p<−p>)condition for the theorem says the sign estimated by the

parameter estimate agrees with the sign indicated by the one-sided pvalues. It holds in nearly all

real situations.

3 Estimates of local false sign rates as calibrated pvalues

The estimate of the local false sign rate approaches a local false discovery rate or a one-sided p

value, depending on the limiting conditions.

Corollary 1. If sign b

θ−θH0= sign (p<−p>), then limp→0

\

LFSR =

\

LFDR and limc

Pr(θ=θH0)→0

\

LFSR =

p/2, where c

Pr (θ=θH0)is the prior probability that yields

\

LFDR as the posterior probability.

Proof. By Bayes’s theorem,

\

LFDR →0as c

Pr (θ=θH0)→0. Both claims then follow from Theorem

1.

Since p/2 = min (p<,p>), that result justiﬁes calling

\

LFSR the

\

LFDR-calibrated p value and

accordingly denoting it by p\

LFDRto stress its dependence on the choice of an estimate of LFDR.

Example 2. A simple option for

\

LFDR is v, the lower bound given in equation (3), with c

Pr (θ=θH0)

in place of Pr (θ=θH0). Then we write the v-calibrated pvalue as p(v).

The resulting Bayes-frequentist continuum is displayed as Figure 1, with traditional frequentism

at the left end of each plot and traditional Bayesianism at the right. Figure 2 zooms in on three

points in the continuum. N

Many other lower bounds on LFDR are available (e.g., Held and Ott, 2018, and references).

But why estimate the LFDR with an estimate of a lower bound such as the vvalue (Example 2)?

There are multiple reasons to accept the vvalue as an adequate estimate of the LFDR. First, as

the Bayes factor can be lower than B(Held and Ott, 2018), which is the Bayes factor bound behind

the vvalue, the vvalue is not necessarily a lower bound on LFDR. Second, Bis close to estimated

Bayes factors for many studies in epidemiology, genetics, and ecology (Bayarri et al., 2016, Fig. 3),

5

Figure 1: The three curves are p(v),v, and p/2as functions of Pr (θ=θH0). For both p= 0.05

and p= 0.005, the v-calibrated pvalue p(v)approaches the one-sided pvalue p/2as Pr (θ=θH0)

decreases and approaches the estimated posterior probability vas Pr (θ=θH0)increases.

6

Figure 2: The three curves are p(v),v, and p/2as functions of p, the two-sided pvalue, for

each of three prior probabilities: Pr (θ=θH0) = 0.01,0.1,0.5. In the plot corresponding most to

traditional frequentism (Pr (θ=θH0) = 0.01), the v-calibrated pvalue p(v)is close to p/2, a one-

sided pvalue. In the plot corresponding most to traditional Bayesianism (Pr (θ=θH0) = 0.5), the

v-calibrated pvalue p(v)is close to v, the estimated posterior probability. The remaining plot

(Pr (θ=θH0) = 0.1) shows a more interesting relationship between the v-calibrated pvalue, the

estimated posterior probability, and the one-sided pvalue.

7

and the vvalue would be close in those cases to LFDR. Third, the vvalue is quantitatively similar

to the following estimate of LFDR.

Example 3. Let zdenote the probit transform of p/2; the probit function is implemented in R as

rnorm and in Microsoft Excel as norm.s.inv. For |z| ≥ 1, the L value is

L=1

1 + 1/b

B,

where b

B= 1.86 |z|e−z2

2is the median-unbiased estimate of the Bayes factor assuming the probit

transform of a one-sided pvalue is normal with mean 0 under θ6= 0 (Bickel, 2019a,d). (See Held

and Ott (2016) for the maximum likelihood estimate under the same model and Pace and Salvan

(1997) on the 0% conﬁdence interval as a median-unbiased estimate.) Then p(L)is the L-calibrated

pvalue. It could be approximated by p(v)since p(L)≈p(v), and the simplicity of p(v)may

make it more practical for general use (cf. Benjamin and Berger, 2019) than p(L), which requires

the probit transform. N

While the local false sign rate and local false discovery rate are posterior probabilities conditional

on P=p, other posterior probabilities might serve as approximations.

Example 4. The positive predictive value Pr (θ6=θH0|P≤α)plays a key role in multiple papers

related to the reproducibility crisis (e.g., Ioannidis, 2005; Button et al., 2013; Dreber et al., 2015;

Wilson and Wixted, 2018). It is isomorphic to

Pr (θ=θH0|P≤α) = 1 −Pr (θ6=θH0|P≤α),

which is known as the false positive report probability (Wacholder et al., 2004) and, in the multiple

testing literature, as the Bayesian false discovery rate (Efron and Tibshirani, 2002) and the nonlocal

false discovery rate (Bickel, 2013). An estimate of Pr (θ=θH0|P≤α), such as the upper bound

proposed by Bickel (2019c), is denoted by wand called a w value after Wacholder et al. (2004).

Using it as an estimate of LFDR results in p(w), the w-calibrated pvalue. However, wis highly

biased as an estimate of LFDR when α=p(Colquhoun, 2017, 2019; Bickel and Rahal, 2019). N

8

4 Eﬀect size estimation informed by local false sign rate esti-

mation

If all relevant prior distributions were known, the Bayes-optimal estimate of the eﬀect size θunder

squared error loss would be its posterior mean,

E (θ|P=p) = Pr (s=bs|P=p) E (θ|P=p,s=bs)

+ Pr (s6=bs, θ =θH0|P=p) E (θ|P=p,s6=bs, θ =θH0)

+ Pr (s6=bs, θ 6=θH0|P=p) E (θ|P=p,s6=bs, θ 6=θH0)

= (1 −LFSR) E (θ|P=p,s=bs) + (LFDR) θH0

+ (LFSR −LFDR) E (θ|P=p,s6=bs, θ 6=θH0).

Without that knowledge, θmay instead be estimated by estimating E (θ|P=p).

In agreement with the

\

LFSR = p\

LFDRframework of Sections 2-3, E (θ|P=p)is estimated

by the

\

LFDR-calibrated eﬀect size estimate,

b

θ\

LFDR=1−p\

LFDRb

θ+\

LFDRθH0

+p\

LFDR−

\

LFDRθH0,

which uses b

θto estimate E (θ|P=p,s=bs)and θH0to estimate E (θ|P=p,s6=bs, θ 6=θH0). The

latter estimate works best when θwould probably be close to θH0conditional on a sign error. The

calibrated eﬀect size estimate simpliﬁes to

b

θ\

LFDR=1−p\

LFDRb

θ+p\

LFDRθH0,(7)

which reveals p\

LFDRas the degree to which b

θis shrunk toward θH0. The next result follows

immediately from that and Corollary 1.

Corollary 2. If sign b

θ−θH0= sign (p<−p>), then

lim

p→0b

θ\

LFDR=1−

\

LFDRb

θ+

\

LFDR θH0;(8)

lim

c

Pr(θ=θH0)→0b

θ\

LFDR=1−p

2b

θ+p

2θH0.(9)

9

Figure 3: b

θ(v)/b

θas a function of c

Pr (θ= 0) for θH0= 0 and p= 0.05,0.15,0.25,0.35. The

v-calibrated eﬀect size estimate b

θ(v)is seen to shrink b

θtoward 0 as por c

Pr (θ= 0) increases.

The right-hand side of equation (8) has been used in multiple testing situations (e.g., Montazeri

et al., 2010; Yanofsky and Bickel, 2010). Equation (9) records the eﬀect of considering the local

false sign rate even at the frequentist end of the Bayes-frequentist continuum.

An advantage of b

θ\

LFDRis that it shrinks b

θtoward θH0more for higher pvalues without

ever shrinking it all the way to θH0, as seen in Figure 3. As a result, reporting calibrated eﬀect

size estimates could help prevent researchers from concluding that θ=θH0on the basic of a high

pvalue.

5 Discussion

Imagine a world in which abstracts have v-calibrated eﬀect size estimates and “p(v)=0.04,” “p(v)=0.01,”

etc. in place of our world’s uncalibrated estimates and “p<0.05.” Adopting the local false sign rate

estimate as a calibrated pvalue may focus current discussions about estimation and testing. The

traditional Bayesian and frequentist positions would no longer be incommensurate paradigms or

matters of upbringing and taste but rather opposite directions on the continuum determined by the

prior probability of the null hypothesis (Figures 1-2). Going forward, debates would then concen-

trate on ways to estimate the prior probability for each ﬁeld, data type, or other reference class (cf.

Lakens et al., 2018; de Ruiter, 2019). Progress is already being made in measuring how the prior is

10

inﬂuenced by a ﬁeld’s risk tolerance (Wilson and Wixted, 2018), echoing the report that a demand

for novelty leads to less reproducible results (Open Science Collaboration, 2015).

Even before a consensus is reached, statisticians can inform their collaborators of the impact

of the prior probability on the local false sign rate estimate and help them determine adequate

estimates of the prior for the data at hand. Estimates may be available in some cases from meta-

analyses. For example, Benjamin et al. (2017) derived their infamous 0.005 signiﬁcance threshold

in part from meta-analyses suggesting c

Pr (θ=θH0) = 10/11 in psychology (Dreber et al., 2015;

Johnson et al., 2017). The high value of that estimate reﬂects modeling assumptions that would in

eﬀect include values of θthat are close to θH0with the null hypothesis rather than the alternative

hypothesis. How close is close enough for inferential purposes may be a fruitful sub ject of future

study and argument since it determines the calibrated pvalue through c

Pr (θ=θH0).

The diﬃculties involved in estimating prior probabilities may at times force us to retreat back

to null hypothesis signiﬁcance testing without any prior or to traditional Bayesian testing with

the default 50% prior probability. The calibrated pvalue would then tell us what the estimated

probability of making a sign error would be if the prior probability of the null hypothesis were

actually 0% or 50%, respectively.

Acknowledgments

This research was partially supported by the Natural Sciences and Engineering Research Council

of Canada (RGPIN/356018-2009).

References

Bayarri, M., Benjamin, D. J., Berger, J. O., Sellke, T. M., 2016. Rejection odds and rejection ratios:

A proposal for statistical practice in testing hypotheses. Journal of Mathematical Psychology 72,

90–103.

Begley, C. G., Ioannidis, J. P., 2015. Reproducibility in science. Circulation Research 116 (1),

116–126.

Benjamin, D. J., Berger, J. O., 2019. Three recommendations for improving the use of p-values.

The American Statistician 73 (sup1), 186–191.

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R.,

11

Bollen, K. A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C. D., Clyde, M.,

Cook, T. D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Eﬀerson, C., Fehr, E., Fidler,

F., Field, A. P., Forster, M., George, E. I., Gonzalez, R., Goodman, S., Green, E., Green, D. P.,

Greenwald, A. G., Hadﬁeld, J. D., Hedges, L. V., Held, L., Hua Ho, T., Hoijtink, H., Hruschka,

D. J., Imai, K., Imbens, G., Ioannidis, J. P. A., Jeon, M., Jones, J. H., Kirchler, M., Laibson, D.,

List, J., Little, R., Lupia, A., Machery, E., Maxwell, S. E., McCarthy, M., Moore, D. A., Morgan,

S. L., Munafó, M., Nakagawa, S., Nyhan, B., Parker, T. H., Pericchi, L., Perugini, M., Rouder, J.,

Rousseau, J., Savalei, V., Schönbrodt, F. D., Sellke, T., Sinclair, B., Tingley, D., Van Zandt, T.,

Vazire, S., Watts, D. J., Winship, C., Wolpert, R. L., Xie, Y., Young, C., Zinman, J., Johnson,

V. E., 9 2017. Redeﬁne statistical signiﬁcance. Nature Human Behaviour, 1.

Bernardo, J. M., 2011. Integrated objective Bayesian estimation and hypothesis testing. Bayesian

statistics 9, 1–68.

Bickel, D. R., 2011. Estimating the null distribution to adjust observed conﬁdence levels for genome-

scale screening. Biometrics 67, 363–370.

Bickel, D. R., 2012a. Coherent frequentism: A decision theory based on conﬁdence sets. Communi-

cations in Statistics - Theory and Methods 41, 1478–1496.

Bickel, D. R., 2012b. Empirical Bayes interval estimates that are conditionally equal to unadjusted

conﬁdence intervals or to default prior credibility intervals. Statistical Applications in Genetics

and Molecular Biology 11 (3), art. 7.

Bickel, D. R., 2013. Simple estimators of false discovery rates given as few as one or two p-values

without strong parametric assumptions. Statistical Applications in Genetics and Molecular Biol-

ogy 12, 529–543.

Bickel, D. R., 2018. Conﬁdence distributions and empirical Bayes posterior distributions uniﬁed as

distributions of evidential support, working paper, DOI: 10.5281/zenodo.2529438.

URL https://doi.org/10.5281/zenodo.2529438

Bickel, D. R., 2019a. Genomics Data Analysis: False Discovery Rates and Empirical Bayes Methods.

Chapman and Hall/CRC, New York.

URL https://davidbickel.com/genomics/

Bickel, D. R., 2019b. Maximum entropy derived and generalized under idempotent probability

to address Bayes-frequentist uncertainty and model revision uncertainty, working paper, DOI:

12

10.5281/zenodo.2645555.

URL https://doi.org/10.5281/zenodo.2645555

Bickel, D. R., 2019c. Null hypothesis signiﬁcance testing defended and calibrated by Bayesian model

checking. The American Statistician, DOI: 10.1080/00031305.2019.1699443.

URL https://doi.org/10.1080/00031305.2019.1699443

Bickel, D. R., 2019d. Sharpen statistical signiﬁcance: Evidence thresholds and Bayes factors sharp-

ened into Occam’s razor. Stat 8 (1), e215.

Bickel, D. R., Rahal, A., 2019. Correcting false discovery rates for their bias to-

ward false positives. Communications in Statistics - Simulation and Computation, DOI:

10.1080/03610918.2019.1630432.

Butler, J. S., Jones, P., Apr 2018. Theoretical and empirical distributions of the p value. METRON

76 (1), 1–30.

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., Munafò,

M. R., 2013. Power failure: why small sample size undermines the reliability of neuroscience.

Nature Reviews Neuroscience 14 (5), 365.

Casella, G., Berger, R. L., 1987. Reconciling Bayesian and frequentist evidence in the one-sided

testing problem. Journal of the American Statistical Association 82, 106–111.

Colquhoun, D., 2017. The reproducibility of research and the misinterpretation of p-values. Royal

Society Open Science 4 (12), 171085.

Colquhoun, D., 2019. The false positive risk: A proposal concerning what to do about p-values.

The American Statistician 73 (sup1), 192–201.

Cox, D. R., 1977. The role of signiﬁcance tests. Scandinavian Journal of Statistics 4, 49–70.

de Ruiter, J., Apr 2019. Redeﬁne or justify? comments on the alpha debate. Psychonomic Bulletin

& Review 26 (2), 430–433.

Dreber, A., Pfeiﬀer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., Johan-

nesson, M., 2015. Using prediction markets to estimate the reproducibility of scientiﬁc research.

Proceedings of the National Academy of Sciences 112 (50), 15343–15347.

Efron, B., 2010. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and

Prediction. Cambridge University Press, Cambridge.

13

Efron, B., Tibshirani, R., 2002. Empirical Bayes methods and false discovery rates for microarrays.

Genetic Epidemiology 23, 70–86.

Efron, B., Tibshirani, R., Storey, J. D., Tusher, V., 2001. Empirical Bayes analysis of a microarray

experiment. Journal of the American Statistical Association 96, 1151–1160.

Evans, M., 2015. Measuring Statistical Evidence Using Relative Belief. Chapman & Hall/CRC

Monographs on Statistics & Applied Probability. CRC Press, New York.

Goodman, S. N., 06 1999. Toward Evidence-Based Medical Statistics. 2: The Bayes Factor. Annals

of Internal Medicine 130 (12), 1005–1013.

Grandhi, A., Guo, W., Romano, J., 2019. Control of directional errors in ﬁxed sequence multiple

testing. Statistica Sinica 29 (2), 1047–1064.

Grundy, P. M., 1956. Fiducial distributions and prior distributions: An example in which the

former cannot be associated with the latter. Journal of the Royal Statistical Society, Series B 18,

217–221.

Held, L., Ott, M., 2016. How the maximal evidence of p-values against point null hypotheses

depends on sample size. American Statistician 70 (4), 335–341.

Held, L., Ott, M., 2018. On p-values and Bayes factors. Annual Review of Statistics and Its Appli-

cation 5, 393–419.

Hughes, B., 2018. Psychology in Crisis. Palgrave, London.

Hurlbert, S., Lombardi, C., 2009. Final collapse of the Neyman-Pearson decision theoretic frame-

work and rise of the neoFisherian. Annales Zoologici Fennici 46, 311–349.

Ioannidis, J. P., 2005. Why most published research ﬁndings are false. PLoS Medicine 2 (8), e124.

Johnson, V., Payne, R., Wang, T., Asher, A., Mandal, S., 2017. On the reproducibility of psycho-

logical science. Journal of the American Statistical Association 112 (517), 1–10.

Lakens, D., Adolﬁ, F. G., Albers, C. J., Anvari, F., Apps, M. A., Argamon, S. E., Baguley, T.,

Becker, R. B., Benning, S. D., Bradford, D. E., et al., 2018. Justify your alpha. Nature Human

Behaviour 2 (3), 168.

Lindley, D. V., 1958. Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical

Society B 20, 102–107.

14

Mayo, D. G., 2019. The ASA’s p-value project: Why it’s doing more harm than good (cont from

11/4/19). Web page, accessed 3 December 2019.

URL http://bit.ly/2LgXMKY

McShane, B. B., Gal, D., Gelman, A., Robert, C., Tackett, J. L., 2019. Abandon statistical signiﬁ-

cance. The American Statistician 73 (sup1), 235–245.

Montazeri, Z., Yanofsky, C. M., Bickel, D. R., 2010. Shrinkage estimation of eﬀect sizes as an

alternative to hypothesis testing followed by estimation in high-dimensional biology: Applications

to diﬀerential gene expression. Statistical Applications in Genetics and Molecular Biology 9, 23.

Nadarajah, S., Bityukov, S., Krasnikov, N., 2015. Conﬁdence distributions: A review. Statistical

Methodology 22, 23–46.

Nieuwenhuis, S., Forstmann, B. U., Wagenmakers, E.-J., 2011. Erroneous analyses of interactions

in neuroscience: a problem of signiﬁcance. Nature neuroscience 14 (9), 1105.

Open Science Collaboration, 2015. Estimating the reproducibility of psychological science. Science

349 (6251).

Pace, L., Salvan, A., 1997. Principles of Statistical Inference: From a Neo-Fisherian Perspective.

Advanced Series on Statistical Science & Applied Probability. World Scientiﬁc, Singapore.

Polansky, A. M., 2007. Observed Conﬁdence Levels: Theory and Application. Chapman and Hall,

New York.

Pratt, J. W., 1965. Bayesian interpretation of standard inference statements. Journal of the Royal

Statistical Society B 27, 169–203.

Schachtman, N. A., 2019. Palavering about p-values. Web page, accessed 3 December 2019.

URL http://schachtmanlaw.com/palavering- about-p-values/

Sellke, T., Bayarri, M. J., Berger, J. O., 2001. Calibration of p values for testing precise null

hypotheses. American Statistician 55, 62–71.

Singh, K., Xie, M., Strawderman, W. E., 2007. Conﬁdence distribution (CD) – distribution estima-

tor of a parameter. IMS Lecture Notes Monograph Series 2007 54, 132–150.

Stephens, M., 10 2016. False discovery rates: a new deal. Biostatistics 18 (2), 275–294.

15

van den Bergh, D., Haaf, J. M., Ly, A., Rouder, J. N., Wagenmakers, E.-J., Nov 2019. A cautionary

note on estimating eﬀect size. PsyArXiv, DOI: 10.31234/osf.io/h6pr8.

URL psyarxiv.com/h6pr8

Vovk, V. G., 1993. A logic of probability, with application to the foundations of statistics. Journal

of the Royal Statistical Society: Series B (Methodological) 55 (2), 317–341.

Wacholder, S., Chanock, S., Garcia-Closas, M., Ghormli, L. E., Rothman, N., 2004. Assessing

the probability that a positive report is false: An approach for molecular epidemiology studies.

Journal of the National Cancer Institute 96, 434–442.

Wasserstein, R. L., Lazar, N. A., 2016. The ASA’s statement on p-values: Context, process, and

purpose. The American Statistician 70 (2), 129–133.

Wasserstein, R. L., Schirm, A. L., Lazar, N. A., 2019. Moving to a world beyond "p < 0.05". The

American Statistician 73 (sup1), 1–19.

Wilkinson, G. N., 1977. On resolving the controversy in statistical inference (with discussion).

Journal of the Royal Statistical Society B 39, 119–171.

Wilson, B. M., Wixted, J. T., 2018. The prior odds of testing a true eﬀect in cognitive and social

psychology. Advances in Methods and Practices in Psychological Science 1 (2), 186–197.

Xie, M.-G., Singh, K., 2013. Conﬁdence distribution, the frequentist distribution estimator of a

parameter: A review. International Statistical Review 81 (1), 3–39.

Yanofsky, C. M., Bickel, D. R., 2010. Validation of diﬀerential gene expression algorithms: Ap-

plication comparing fold-change estimation to hypothesis testing. BMC Bioinformatics 11, art.

63.

16