ArticlePDF Available

Retire statistical significance



Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
EVOLUTION Cooperation
and conflict from ants
and chimps to us p.308
HISTORY To fight denial,
study Galileo and
Arendt p.309
CHEMISTRY Three more unsung
women — of astatine
discovery p.311
ID and English, list authors
in their own script p.311
hen was the last time you heard
a seminar speaker claim there
was ‘no difference’ between
two groups because the difference was
‘statistically non-significant’?
If your experience matches ours, there’s
a good chance that this happened at the
last talk you attended. We hope that at least
someone in the audience was perplexed if, as
frequently happens, a plot or table showed
that there actually was a difference.
How do statistics so often lead scientists to
deny differences that those not educated in
statistics can plainly see? For several genera-
tions, researchers have been warned that a
statistically non-significant result does not
‘prove’ the null hypothesis (the hypothesis
that there is no difference between groups or
no effect of a treatment on some measured
outcome)1. Nor do statistically significant
results ‘prove’ some other hypothesis. Such
misconceptions have famously warped the
literature with overstated claims and, less
famously, led to claims of conflicts between
studies where none exists.
We have some proposals to keep scientists
from falling prey to these misconceptions.
Let’s be clear about what must stop: we
should never conclude there is ‘no differ-
ence’ or ‘no association’ just because a P value
is larger than a threshold such as 0.05
Retire statistical significance
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories
call for an end to hyped claims and the dismissal of possibly crucial effects.
21 MARCH 2019 | VOL 567 | NATURE | 305
or, equivalently, because a confidence
interval includes zero. Neither should we
conclude that two studies conflict because
one had a statistically significant result and
the other did not. These errors waste research
efforts and misinform policy decisions.
For example, consider a series of analyses
of unintended effects of anti-inflammatory
drugs2. Because their results were statistically
non-significant, one set of researchers con-
cluded that exposure to the drugs was “not
associated” with new-onset atrial fibrillation
(the most common disturbance to heart
rhythm) and that the results stood in con-
trast to those from an earlier study with a
statistically significant outcome.
Now, let’s look at the actual data. The
researchers describing their statistically
non-significant results found a risk ratio
of 1.2 (that is, a 20% greater risk in exposed
patients relative to unexposed ones). They
also found a 95% confidence interval
that spanned everything from a trifling
risk decrease of 3% to a considerable risk
increase of 48% (P = 0.091; our calcula-
tion). The researchers from the earlier, sta
tistically significant, study found the exact
same risk ratio of 1.2. That study was sim-
ply more precise, with an interval spanning
from 9% to 33% greater risk (P = 0.0003; our
It is ludicrous to conclude that the
statistically non-significant results showed
“no association, when the interval estimate
included serious risk increases; it is equally
absurd to claim these results were in contrast
with the earlier results showing an identical
observed effect. Yet these common practices
show how reliance on thresholds of statisti-
cal significance can mislead us (see ‘Beware
false conclusions’).
These and similar errors are widespread.
Surveys of hundreds of articles have found
that statistically non-significant results are
interpreted as indicating ‘no difference’ or
‘no effect’ in around half (see ‘Wrong inter-
pretations’ and Supplementary Information).
In 2016, the American Statistical
Association released a statement in The
American Statistician warning against the
misuse of statistical significance and Pval-
ues. The issue also included many commen-
taries on the subject. This month, a special
issue in the same journal attempts to push
these reforms further. It presents more than
40 papers on ‘Statistical inference in the 21st
century: a world beyond P < 0.05’. The edi-
tors introduce the collection with the cau-
tion “don’t say ‘statistically significant’3.
Another article
with dozens of signatories
also calls on authors and journal editors to
disavow those terms.
We agree, and call for the entire concept
of statistical significance to be abandoned.
We are far from
alone. When we
invited others to
read a draft of this
comment and sign
their names if they
concurred with our
message, 250 did so
within the first 24
hours. A week later,
we had more than
800 signatories—all
checked for an aca-
demic affiliation or
other indication of
present or past work
in a field that depends on statistical model-
ling (see the list and final count of signatories
in the Supplementary Information). These
include statisticians, clinical and medical
researchers, biologists and psychologists
from more than 50 countries and across all
continents except Antarctica. One advocate
called it a “surgical strike against thought-
less testing of statistical significance” and “an
opportunity to register your voice in favour
of better scientific practices”.
We are not calling for a ban on P values.
Nor are we saying they cannot be used as
a decision criterion in certain special-
ized applications (such as determining
whether a manufacturing process meets
some quality-control standard). And we
are also not advocating for an anything-
goes situation, in which weak evidence
suddenly becomes credible. Rather, and in
line with many others over the decades, we
are calling for a stop to the use of Pvalues
in the conventional, dichotomous way — to
decide whether a result refutes or supports a
scientific hypothesis5.
The trouble is human and cognitive more
than it is statistical: bucketing results into
‘statistically significant’ and ‘statistically
non-significant’ makes people think that the
items assigned in that way are categorically
different6–8. The same problems are likely to
arise under any proposed statistical alterna-
tive that involves dichotomization, whether
frequentist, Bayesian or otherwise.
Unfortunately, the false belief that
crossing the threshold of statistical sig-
nificance is enough to show that a result is
‘real’ has led scientists and journal editors to
privilege such results, thereby distorting the
literature. Statistically significant estimates
are biased upwards in magnitude and poten-
tially to a large degree, whereas statistically
non-significant estimates are biased down-
wards in magnitude. Consequently, any dis
cussion that focuses on estimates chosen for
their significance will be biased. On top of
this, the rigid focus on statistical significance
encourages researchers to choose data and
methods that yield statistical significance for
some desired (or simply publishable) result,
or that yield statistical non-significance for
an undesired result, such as potential side
effects of drugs—thereby invalidating
The pre-registration of studies and a
commitment to publish all results of all
analyses can do much to mitigate these
issues. However, even results from pre-reg-
istered studies can be biased by decisions
invariably left open in the analysis plan9.
This occurs even with the best of intentions.
Again, we are not advocating a ban on
Pvalues, confidence intervals or other sta-
tistical measures — only that we should
not treat them categorically. This includes
dichotomization as statistically significant or
not, as well as categorization based on other
statistical measures such as Bayes factors.
One reason to avoid such ‘dichotomania
is that all statistics, including Pvalues and
confidence intervals, naturally vary from
study to study, and often do so to a sur-
prising degree. In fact, random variation
alone can easily lead to large disparities in
Pvalues, far beyond falling just to either side
of the 0.05 threshold. For example, even if
researchers could conduct two perfect
replication studies of some genuine effect,
each with 80% power (chance) of achieving
P < 0.05, it would not be very surprising for
one to obtain P < 0.01 and the other P > 0.30.
Studies currently dubbed ‘statistically signicant’ and ‘statistically non-signicant’ need not be
contradictory, and such designations might cause genuine eects to be dismissed.
‘Signicant’ study
(low P value)
Increased eectDecreased eect
‘Non-signicant’ study
(high P value)
No eect
The observed eect (or
point estimate) is the
same in both studies, so
they are not in conict,
even if one is ‘signicant’
and the other is not.
will help to halt
declarations of
‘no difference’
and absurd
306 | NATURE | VOL 567 | 21 MARCH 2019
Whether a Pvalue is small or large, caution
is warranted.
We must learn to embrace uncertainty.
One practical way to do so is to rename con-
fidence intervals as ‘compatibility intervals
and interpret them in a way that avoids over-
confidence. Specifically, we recommend that
authors describe the practical implications
of all values inside the interval, especially
the observed effect (or point estimate) and
the limits. In doing so, they should remem-
ber that all the values between the interval’s
limits are reasonably compatible with the
data, given the statistical assumptions used
to compute the interval7,10. Therefore, sin-
gling out one particular value (such as the
null value) in the interval as ‘shown’ makes
no sense.
We’re frankly sick of seeing such non-
sensical ‘proofs of the null’ and claims of
non-association in presentations, research
articles, reviews and instructional materials.
An interval that contains the null value will
often also contain non-null values of high
practical importance. That said, if you deem
all of the values inside the interval to be prac-
tically unimportant, you might then be able
to say something like ‘our results are most
compatible with no important effect’.
When talking about compatibility inter-
vals, bear in mind four things. First, just
because the interval gives the values most
compatible with the data, given the assump-
tions, it doesn’t mean values outside it are
incompatible; they are just less compatible.
In fact, values just outside the interval do not
differ substantively from those just inside
the interval. It is thus wrong to claim that an
interval shows all possible values.
Second, not all values inside are equally
compatible with the data, given the assump-
tions. The point estimate is the most compat-
ible, and values near it are more compatible
than those near the limits. This is why we
urge authors to discuss the point estimate,
even when they have a large Pvalue or a wide
interval, as well as discussing the limits of
that interval. For example, the authors above
could have written: ‘Like a previous study,
our results suggest a 20% increase in risk
of new-onset atrial fibrillation in patients
given the anti-inflammatory drugs. None-
theless, a risk difference ranging from a 3%
decrease, a small negative association, to a
48% increase, a substantial positive associa-
tion, is also reasonably compatible with our
data, given our assumptions.’ Interpreting
the point estimate, while acknowledging
its uncertainty, will keep you from making
false declarations of ‘no difference’, and from
making overconfident claims.
Third, like the 0.05 threshold from which
it came, the default 95% used to compute
intervals is itself an arbitrary convention. It
is based on the false idea that there is a 95%
chance that the computed interval itself con-
tains the true value, coupled with the vague
feeling that this is a basis for a confident
decision. A different level can be justified,
depending on the application. And, as in the
anti-inflammatory-drugs example, interval
estimates can perpetuate the problems of
statistical significance when the dichotomi-
zation they impose is treated as a scientific
Last, and most important of all, be
humble: compatibility assessments hinge
on the correctness of the statistical assump-
tions used to compute the interval. In prac-
tice, these assumptions are at best subject to
considerable uncertainty7,8,10. Make these
assumptions as clear as possible and test the
ones you can, for example by plotting your
data and by fitting alternative models, and
then reporting all results.
Whatever the statistics show, it is fine to
suggest reasons for your results, but discuss
a range of potential explanations, not just
favoured ones. Inferences should be scien-
tific, and that goes far beyond the merely
statistical. Factors such as background
evidence, study design, data quality and
understanding of underlying mechanisms
are often more important than statistical
measures such as Pvalues or intervals.
The objection we hear most against
retiring statistical significance is that it is
needed to make yes-or-no decisions. But
for the choices often required in regula-
tory, policy and business environments,
decisions based on the costs, benefits and
likelihoods of all potential consequences
always beat those made based solely on
statistical significance. Moreover, for deci
sions about whether to pursue a research
idea further, there is no simple connection
between a Pvalue and the probable results
of subsequent studies.
What will retiring statistical significance
look like? We hope that methods sections
and data tabulation will be more detailed
and nuanced. Authors will emphasize their
estimates and the uncertainty in them — for
example, by explicitly discussing the lower
and upper limits of their intervals. They will
not rely on significance tests. When P values
are reported, they will be given with sensible
precision (for example, P = 0.021 or P = 0.13)
— without adornments such as stars or let-
ters to denote statistical significance and not
as binary inequalities (P
< 0.05 or P > 0.05).
Decisions to interpret or to publish results
will not be based on statistical thresholds.
People will spend less time with statistical
software, and more time thinking.
Our call to retire statistical significance
and to use confidence intervals as compat-
ibility intervals is not a panacea. Although it
will eliminate many bad practices, it could
well introduce new ones. Thus, monitoring
the literature for statistical abuses should be
an ongoing priority for the scientific com-
munity. But eradicating categorization will
help to halt overconfident claims, unwar-
ranted declarations of ‘no difference’ and
absurd statements about ‘replication failure
when the results from the original and rep-
lication studies are highly compatible. The
misuse of statistical significance has done
much harm to the scientific community and
those who rely on scientific advice. P values,
intervals and other statistical measures all
have their place, but it’s time for statistical
significance to go.
Valentin Amrhein is a professor of zoolog y
at the University of Basel, Switzerland.
Sander Greenland is a professor of
epidemiology and statistics at the
University of California, Los Angeles. Blake
McShane is a statistical methodologist and
professor of marketing at Northwestern
University in Evanston, Illinois. For a full
list of co-signatories, see Supplementary
1. Fisher, R. A. Nature 136, 474 (1935).
2. Schmidt, M. & Rothman, K. J. Int. J. Cardiol. 177,
1089–1090 (2014).
3. Wasserstein, R. L., Schirm, A. & Lazar, N. A.
19.1583913 (2019).
4. Hurlbert, S. H., Levine, R. A. & Utts, J. Am. Stat.
616 (2019).
5. Lehmann, E. L. Testing Statistical Hypotheses 2nd
edn 70–71 (Springer, 1986).
6. Gigerenzer, G. Adv. Meth. Pract. Psychol. Sci. 1,
198–218 (2018).
7. Greenland, S. Am. J. Epidemiol. 186, 639–645
8. McShane, B. B., Gal, D., Gelman, A., Robert, C. &
Tackett, J. L. Am. Stat.
0031305.2018.1527253 (2019).
9. Gelman, A. & Loken, E. Am. Sci. 102, 460–465
10. Amrhein, V., Trafimow, D. & Greenland, S. Am.
543137 (2019).
Supplementary information accompanies this
article; see
An analysis of 791 articles across 5 journals*
found that around half mistakenly assume
non-signicance means no eect.
*Data taken from: P. Schatz et al. Arch. Clin. Neuropsychol. 20,
1053–1059 (2005); F. Fidler et al. Conserv. Biol. 20, 1539–1544
(2006); R. Hoekstra et al. Psychon. Bull. Rev. 13, 1033–1037 (2006);
F. Bernardi et al. Eur. Sociol. Rev. 33, 1–15 (2017).
21 MARCH 2019 | VOL 567 | NATURE | 307
... ANOVA is based on the statistical significance testing (Fisher, 1935;Zhang, 2022c). Statistical significance tests are one of the most important statistical inference methods in statistics (Fisher, 1935;Yates, 1951;Amrhein et al., 2019;Sellke et al., 2001;Zhang, 2022c). Whether a research result is statistically significant is mainly determined by using the p-value obtained from hypothesis testing (Bergstrom and West, 2021). ...
... The p-value is at the heart of the statistical significance testing (Sun, 2016;Zhang, 2022c). In recent years, however, the statistical significance testing have been questioned, mainly because the paradigm of significance tests is wrong, p-value is too sensitive, p-value is a dichotomous subjective index, and statistical significance is related to sample size, etc (Sellke et al., 2001;Trafimow and Marks, 2015;Baker, 2016;Wasserstein and Lazar, 2016;McShane and David, 2017;Amrhein et al., 2019;Tong, 2019;Wasserstein et al., 2019;Zhang, 2022a-c). Statistical significance testing has been one of the sources of false conclusions and research reproducibility crisis (Ioannidis, 2005;Open Science Collaboration, 2015;Errington et al., 2021;Huang, 2021aHuang, -b, 2023Kafdar, 2021;Nature Editorial, 2021;Vrieze, 2021;Zhang, 2022c). ...
Full-text available
Based on the statistical significance testing, the Analysis of Variance (ANOVA) is one of the most popular statistics widely used in experimental sciences. However, in recent years, the statistical significance testing has been widely criticized for its various shortcomings. To address this problem, in present article we developed the new ANOVA methodology in the paradigm of new statistics. In this methodology, effect size testing is added to six most used ANOVA methods and the finer p-value standardard is used in the statistical significance testing of eight ANOVA methods. Both online and offline computational tools are developed for free use. Online computation: ; Find details and offline tool at:
... Differences were considered significant when P < 0.05, or P > 0.05 with large differences of observed effects (as suggested in refs. 92,93 ). ...
Full-text available
Glycolytic intermediary metabolites such as fructose-1,6-bisphosphate can serve as signals, controlling metabolic states beyond energy metabolism. However, whether glycolytic metabolites also play a role in controlling cell fate remains unexplored. Here, we find that low levels of glycolytic metabolite 3-phosphoglycerate (3-PGA) can switch phosphoglycerate dehydrogenase (PHGDH) from cataplerosis serine synthesis to pro-apoptotic activation of p53. PHGDH is a p53-binding protein, and when unoccupied by 3-PGA interacts with the scaffold protein AXIN in complex with the kinase HIPK2, both of which are also p53-binding proteins. This leads to the formation of a multivalent p53-binding complex that allows HIPK2 to specifically phosphorylate p53-Ser46 and thereby promote apoptosis. Furthermore, we show that PHGDH mutants (R135W and V261M) that are constitutively bound to 3-PGA abolish p53 activation even under low glucose conditions, while the mutants (T57A and T78A) unable to bind 3-PGA cause constitutive p53 activation and apoptosis in hepatocellular carcinoma (HCC) cells, even in the presence of high glucose. In vivo, PHGDH-T57A induces apoptosis and inhibits the growth of diethylnitrosamine-induced mouse HCC, whereas PHGDH-R135W prevents apoptosis and promotes HCC growth, and knockout of Trp53 abolishes these effects above. Importantly, caloric restriction that lowers whole-body glucose levels can impede HCC growth dependent on PHGDH. Together, these results unveil a mechanism by which glucose availability autonomously controls p53 activity, providing a new paradigm of cell fate control by metabolic substrate availability.
... 34 One of the advantages of using APDs is that the technique is straightforward; APDs can be prepared at the time of care in a simple and relatively inexpensive manner with low morbidity. 28 Furthermore, as suggested by Amrhein et al., 35 the interpretation made about the direction of the effect of the intervention for certain outcomes was based on the point estimator (specifically on the absolute risk and clinical significance) and not on the statistical significance of the results (P-value). Therefore, the balance could be tipped in favor of using platelet derivatives in patients with cleft lip and palate. ...
... We performed statistical inference on the basis of estimation statistics, which is a much more powerful and informative tool than hypothesis testing [22][23][24]. Accordingly, no statistical test was performed and therefore no measure of statistical significance was reported [25][26][27][28]. Furthermore, we calculated Cohen's d to assess the relative magnitude of the effect sizes that we estimated [29]. ...
Full-text available
Background: The cerebellum and the brainstem are two brain structures involved in pain processing and modulation that have also been associated with migraine pathophysiology. The aim of this study was to investigate possible associations between the morphology of the cerebellum and brainstem and migraine, focusing on gray matter differences in these brain areas. Methods: The analyses were based on data from 712 individuals with migraine and 45,681 healthy controls from the UK Biobank study. Generalized linear models were used to estimate the mean gray matter volumetric differences in the brainstem and the cerebellum. The models were adjusted for important biological covariates such as BMI, age, sex, total brain volume, diastolic blood pressure, alcohol intake frequency, current tobacco smoking, assessment center, material deprivation, ethnic background, and a wide variety of health conditions. Secondary analyses investigated volumetric correlation between cerebellar sub-regions. Results: We found larger gray matter volumes in the cerebellar sub-regions V (mean difference: 72 mm 3 , 95% CI [13, 132]), crus I (mean difference: 259 mm 3 , 95% CI [9, 510]), VIIIa (mean difference: 120 mm 3 , 95% CI [0.9, 238]), and X (mean difference: 14 mm 3 , 95% CI [1, 27]). Conclusions: Individuals with migraine show larger gray matter volumes in several cerebellar sub-regions than controls. These findings support the hypothesis that the cerebellum plays a role in the pathophysiology of migraine.
... Statistical analysis and description of data are an essential step in the dissemination of science; however, many manuscripts in the medical and public health area present a series of statistical errors and/or in reporting them, which were also observed in studies included here: showing results with p-value only and without confidence intervals; reporting "p = ns" or "p < 0.05" or other arbitrary bounds instead of reporting exact p-values; not discussing sources of potential bias and confounding factors, among many others [60,61]. Presenting only the p-value without any context or other evidence is too limiting, so authors need to contextualize the p-value found with parameters such as quality of study design, internal validity, confidence intervals, numerical summaries and data graphics, as well as reinforce the difference between statistical significance and clinical relevance [62,63]. ...
Full-text available
This systematic review aimed to identify the influence of occupational stress on the body mass index of hospital workers. After registering the protocol at PROSPERO (CRD42022331846), we started this systematic review following a search in seven databases, gray literature, as well as manual search and contact with specialists. The selection of studies was performed independently by two evaluators following the inclusion criteria: observational studies evaluating adult hospital workers, in which occupational stress was considered exposure and body composition as a result. The risk of bias in the included studies was assessed using the Joanna Briggs Institute Critical Appraisal checklist. We used the Grading of Recommendations Assessment, Development and Evaluation to grade the certainty of the evidence. Qualitative results were presented and synthesized through a qualitative approach, with simplified information in a narrative form. A total of 12 studies met the eligibility criteria and were included. This review comprised 10,885 workers (2312 men; 1582 women; and 6991 workers whose gender was not identified). Ten studies were carried out only with health workers, and two included workers from other sectors besides health workers. This review showed a relationship between occupational stress and changes in body mass index in hospital workers. However, most studies presented a moderate or high risk of bias and low quality of the evidence. These findings can be useful for clinical practice, administrators and leaders and provide insights for future research in the field of worker health in the hospital setting.
Full-text available
The various debates around model selection paradigms are important, but in lieu of a consensus, there is a demonstrable need for a deeper appreciation of existing approaches, at least among the end-users of statistics and model selection tools. In the ecological literature, the Akaike information criterion (AIC) dominates model selection practices, and while it is a relatively straightforward concept, there exists what we perceive to be some common misunderstandings around its application. Two specific questions arise with surprising regularity among colleagues and students when interpreting and reporting AIC model tables. The first is related to the issue of ‘pretending’ variables, and specifically a muddled understanding of what this means. The second is related to p-values and what constitutes statistical support when using AIC. There exists a wealth of technical literature describing AIC and the relationship between p-values and AIC differences. Here, we complement this technical treatment and use simulation to develop some intuition around these important concepts. In doing so we aim to promote better statistical practices when it comes to using, interpreting and reporting models selected when using AIC.
Hand surgeons have the potential to improve patient care, both with their own research and by using evidenced-based practice. In this first part of a two-part article, we describe key steps for the analysis of clinical data using quantitative methodology. We aim to describe the principles of medical statistics and their relevance and use in hand surgery, with contemporaneous examples. Hand surgeons seek expertise and guidance in the clinical domain to improve their practice and patient care. Part of this process involves the critical analysis and appraisal of the research of others.
Full-text available
Until 2022, Vermont was one of the few US states that did not have an Environmental Justice (EJ) policy. In 2016, the Vermont Department of Environmental Conservation (DEC) initiated a process to create an EJ policy based on an agreement with the US Environmental Protection Agency (EPA). A coalition of academics, non-profit organization leaders, legal experts, and community-based partners formed in response to the DEC’s initial approach because it lacked a robust process to center the voices of the most vulnerable Vermonters. The coalition developed a mixed-method, community-based approach to ask, “What does EJ look like in Vermont?” This article reports the door-to-door survey portion of that broader research effort. The survey of 569 Vermont residents purposively sampled sites of likely environmental harm and health concerns and sites with existing relationships with activists and community organizations engaged in ongoing EJ struggles. The survey results use logistic regression to show that non-white respondents in the sites sampled were significantly more likely to be renters, to report exposures to mold, to have trouble paying for food and electricity, to lack access to public transportation, were less likely to own a vehicle, to have a primary care doctor, and reported higher rates of Lyme disease than white respondents. Our findings contribute to EJ theory regarding the co-productive relationship between environmental privilege and environmental harms within the context of persistent characterizations of Vermont as an environmental leader with abundant environmental benefits.
Full-text available
Obstructive sleep apnea (OSA) affects nearly one billion of the global adult population. It is associated with substantial burden in terms of quality of life, cognitive function, and cardiovascular health. Positive airway pressure (PAP) therapy, commonly considered the first-line treatment, is limited by low compliance and lacking efficacy on long-term cardiovascular outcomes. A substantial body of research has been produced investigating (novel) non-PAP treatments. With increased understanding of OSA pathogenesis, promising therapeutic approaches are emerging. There is an imperative need of high-quality synthesis of evidence; however, current systematic reviews and meta-analyses (SR/MA) on the topic demonstrate important methodological limitations and are seldom based on research questions that fully reflect the complex intricacies of OSA management. Here, we discuss the current challenges in management of OSA, the need of treatable traits based OSA treatment, the methodological limitations of existing SR/MA in the field, potential remedies, as well as future perspectives.
Full-text available
Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to biased reported effect sizes. Nonetheless, considerable non-replication is to be expected even without selective reporting, and generalizations from single studies are rarely if ever warranted. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. A general perception of a “replication crisis” may thus reflect failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as providing generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Acknowledging this uncertainty could help reduce the allure of selective reporting: Since a small P-value could be large in a replication study, and a large P-value could be small, there is simply no need to selectively report studies based on statistical results. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what problems occurred, what data were obtained, what analysis methods were used and why, and what output those methods produced.
Full-text available
Many controversies in statistics are due primarily or solely to poor quality control in journals, bad statistical textbooks, bad teaching, unclear writing, and lack of knowledge of the historical literature. One way to improve the practice of statistics and resolve these issues is to do what initiators of the 2016 ASA statement did: take one issue at a time, have extensive discussions about the issue among statisticians of diverse backgrounds and perspectives and eventually develop and publish a broadly supported consensus on that issue. Upon completion of this task, we then move on to deal with another core issue in the same way. We propose as the next project a process that might lead quickly to a strong consensus that the term “statistically significant” and all its cognates and symbolic adjuncts be disallowed in the scientific literature except where focus is on the history of statistics and its philosophies and methodologies. Calculation and presentation of accurate p-values will often remain highly desirable though not obligatory. Supplementary materials for this article are available online in the form of an appendix listing the names and institutions of 48 other statisticians and scientists who endorse the principal propositions put forward here. © 2019, © 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
Full-text available
In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration--often scant--given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.
The “replication crisis” has been attributed to misguided external incentives gamed by researchers (the strategic-game hypothesis). Here, I want to draw attention to a complementary internal factor, namely, researchers’ widespread faith in a statistical ritual and associated delusions (the statistical-ritual hypothesis). The “null ritual,” unknown in statistics proper, eliminates judgment precisely at points where statistical theories demand it. The crucial delusion is that the p value specifies the probability of a successful replication (i.e., 1 – p), which makes replication studies appear to be superfluous. A review of studies with 839 academic psychologists and 991 students shows that the replication delusion existed among 20% of the faculty teaching statistics in psychology, 39% of the professors and lecturers, and 66% of the students. Two further beliefs, the illusion of certainty (e.g., that statistical significance proves that an effect exists) and Bayesian wishful thinking (e.g., that the probability of the alternative hypothesis being true is 1 – p), also make successful replication appear to be certain or almost certain, respectively. In every study reviewed, the majority of researchers (56%–97%) exhibited one or more of these delusions. Psychology departments need to begin teaching statistical thinking, not rituals, and journal editors should no longer accept manuscripts that report results as “significant” or “not significant.”
There is no complete solution for the problem of abuse of statistics, but methodological training needs to cover cognitive biases and other psychosocial factors affecting inferences. The present paper discusses 3 common cognitive distortions: 1) dichotomania, the compulsion to perceive quantities as dichotomous even when dichotomization is unnecessary and misleading, as in inferences based on whether a P value is "statistically significant"; 2) nullism, the tendency to privilege the hypothesis of no difference or no effect when there is no scientific basis for doing so, as when testing only the null hypothesis; and 3) statistical reification, treating hypothetical data distributions and statistical models as if they reflect known physical laws rather than speculative assumptions for thought experiments. As commonly misused, null-hypothesis significance testing combines these cognitive problems to produce highly distorted interpretation and reporting of study results. Interval estimation has so far proven to be an inadequate solution because it involves dichotomization, an avenue for nullism. Sensitivity and bias analyses have been proposed to address reproducibility problems (Am J Epidemiol. 2017;186(6):646-647); these methods can indeed address reification, but they can also introduce new distortions via misleading specifications for bias parameters. P values can be reframed to lessen distortions by presenting them without reference to a cutoff, providing them for relevant alternatives to the null, and recognizing their dependence on all assumptions used in their computation; they nonetheless require rescaling for measuring evidence. I conclude that methodological development and training should go beyond coverage of mechanistic biases (e.g., confounding, selection bias, measurement error) to cover distortions of conclusions produced by statistical methods and psychosocial forces.
There needs to be a balance between maintaining the strictest statistical controls and allowing researchers some flexibility to pursue analysis of unexpected trends observed in a study beyond the limits of pre-registered primary analysis. Given a particular data set, it can seem entirely appropriate to look at the data and construct reasonable rules for data exclusion, coding, and analysis that can lead to statistical significance. In such a case, researchers need to perform only one test, but that test is conditional on the data. If data are gathered with no preconceptions at all, statistical significance can obviously be obtained even from pure noise by the simple means of repeatedly performing comparisons, excluding data in different ways, examining different interactions, controlling for different predictors, and so forth.