ArticlePDF Available

Redefine Statistical Significance


Abstract and Figures

We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Content may be subject to copyright.
Title: Redefine Statistical Significance
Authors: Daniel J. Benjamin1*, James O. Berger2, Magnus Johannesson3*, Brian A.
Lawrence Brown10, Colin Camerer11, David Cesarini12, 13, Christopher D. Chambers14,
Merlise Clyde2, Thomas D. Cook15,16, Paul De Boeck17, Zoltan Dienes18, Anna Dreber3,
Kenny Easwaran19, Charles Efferson20, Ernst Fehr21, Fiona Fidler22, Andy P. Field18,
Malcolm Forster23, Edward I. George10, Richard Gonzalez24, Steven Goodman25, Edwin
Green26, Donald P. Green27, Anthony Greenwald28, Jarrod D. Hadfield29, Larry V.
Hedges30, Leonhard Held31, Teck Hua Ho32, Herbert Hoijtink33, James Holland
Jones39,40, Daniel J. Hruschka34, Kosuke Imai35, Guido Imbens36, John P.A. Ioannidis37,
Minjeong Jeon38, Michael Kirchler41, David Laibson42, John List43, Roderick Little44,
Arthur Lupia45, Edouard Machery46, Scott E. Maxwell47, Michael McCarthy48, Don
Moore49, Stephen L. Morgan50, Marcus Munafó51, 52, Shinichi Nakagawa53, Brendan
Nyhan54, Timothy H. Parker55, Luis Pericchi56, Marco Perugini57, Jeff Rouder58, Judith
Rousseau59, Victoria Savalei60, Felix D. Schönbrodt61, Thomas Sellke62, Betsy
Sinclair63, Dustin Tingley64, Trisha Van Zandt65, Simine Vazire66, Duncan J. Watts67,
Christopher Winship68, Robert L. Wolpert2, Yu Xie69, Cristobal Young70, Jonathan
Zinman71, Valen E. Johnson72*
1Center for Economic and Social Research and Department of Economics, University of
Southern California, Los Angeles, CA 90089-3332, USA.
2Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA.
3Department of Economics, Stockholm School of Economics, SE-113 83 Stockholm,
4University of Virginia, Charlottesville, VA 22908, USA.
5Center for Open Science, Charlottesville, VA 22903, USA.
6University of Amsterdam, Department of Psychology, 1018 VZ Amsterdam, The
7University of Pennsylvania, School of Arts and Sciences and Department of
Criminology, Philadelphia, PA 19104-6286, USA.
8University of North Carolina Chapel Hill, Department of Psychology and Neuroscience,
Department of Sociology, Chapel Hill, NC 27599-3270, USA.
Universitätsstrasse 31
93040 Regensburg, Germany.
10Department of Statistics, The Wharton School, University of Pennsylvania,
Philadelphia, PA 19104, USA.
11Division of the Humanities and Social Sciences, California Institute of Technology,
Pasadena, CA 91125, USA.
12Department of Economics, New York University, New York, NY 10012, USA.
13The Research Institute of Industrial Economics (IFN), SE- 102 15 Stockholm, Sweden.
14Cardiff University Brain Research Imaging Centre (CUBRIC), CF24 4HQ, UK.
15Northwestern University, Evanston, IL 60208, USA.
16Mathematica Policy Research, Washington, DC, 20002-4221, USA.
17Department of Psychology, Quantitative Program, Ohio State University, Columbus,
OH 43210, USA.
18School of Psychology, University of Sussex, Brighton BN1 9QH, UK.
19Department of Philosophy, Texas A&M University, College Station, TX 77843-4237,
20Department of Psychology, Royal Holloway University of London, Egham Surrey
TW20 0EX, UK.
21Department of Economics, University of Zurich, 8006 Zurich, Switzerland.
22School of BioSciences and School of Historical & Philosophical Studies, University of
Melbourne, Vic 3010, Australia.
23Department of Philosophy, University of Wisconsin - Madison, Madison, WI 53706,
24Department of Psychology, University of Michigan, Ann Arbor, MI 48109-1043, USA.
25Stanford University, General Medical Disciplines, Stanford, CA 94305, USA.
26Department of Ecology, Evolution and Natural Resources SEBS, Rutgers University,
New Brunswick, NJ 08901-8551, USA.
27Department of Political Science, Columbia University in the City of New York, New
York, NY 10027, USA.
28Department of Psychology, University of Washington, Seattle, WA 98195-1525, USA.
29Institute of Evolutionary Biology School of Biological Sciences, The University of
Edinburgh, Edinburgh EH9 3JT, UK.
30Weinberg College of Arts & Sciences Department of Statistics, Northwestern
University, Evanston, IL 60208, USA.
31Epidemiology, Biostatistics and Prevention Institute (EBPI), University of Zurich,
8001 Zurich, Switzerland.
32National University of Singapore, Singapore 119077.
33Department of Methods and Statistics, Universiteit Utrecht, 3584 CH Utrecht, The
34School of Human Evolution and Social Change, Arizona State University, Tempe, AZ
85287-2402, USA.
35Department of Politics and Center for Statistics and Machine Learning, Princeton
University, Princeton NJ 08544, USA.
36Stanford University, Stanford, CA 94305-5015, USA.
37Departments of Medicine, of Health Research and Policy, of Biomedical Data
Science, and of Statistics and Meta-Research Innovation Center at Stanford (METRICS),
Stanford University, Stanford, CA 94305, USA.
38 Advanced Quantitative Methods, Social Research Methodology, Department of
Education, Graduate School of Education & Information Studies, University of
California, Los Angeles, CA 90095-1521, USA.
39Department of Life Sciences, Imperial College London, Ascot SL5 7PY, UK.
40Department of Earth System Science, Stanford, CA 94305- 4216, USA.
41Department of Banking and Finance, University of Innsbruck and University of
Gothenburg, A-6020 Innsbruck, Austria.
42Department of Economics, Harvard University, Cambridge, MA 02138, USA.
43Department of Economics, University of Chicago, Chicago, IL 60637, USA.
44Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA.
45Department of Political Science, University of Michigan, Ann Arbor, MI 48109-1045,
46Department of History and Philosophy of Science, University of Pittsburgh, Pittsburgh
PA 15260, USA.
47Department of Psychology, University of Notre Dame, Notre Dame, IN 46556, USA.
48School of BioSciences, University of Melbourne, Vic 3010, Australia.
49Haas School of Business, University of California at Berkeley, Berkeley, CA 94720-
1900A, USA.
50Johns Hopkins University, Baltimore, MD 21218, USA.
51MRC Integrative Epidemiology Unit, University of Bristol, Bristol BS8 1TU, UK.
52UK Centre for Tobacco and Alcohol Studies, School of Experimental Psychology,
University of Bristol, Bristol BS8 1TU, UK.
53Evolution & Ecology Research Centre and School of Biological, Earth and
Environmental Sciences, University of New South Wales, Sydney, NSW 2052, Australia.
54Department of Government, Dartmouth College, Hanover, NH 03755, USA.
55Department of Biology, Whitman College, Walla Walla, WA 99362, USA.
56Department of Mathematics, University of Puerto Rico, Rio Piedras Campus, San
Juan, PR 00936-8377.
57Department of Psychology, University of Milan - Bicocca, 20126 Milan, Italy.
58Department of Psychological Sciences, University of Missouri, Columbia, MO 65211,
59Université Paris Dauphine, 75016 Paris, France.
60Department of Psychology, The University of British Columbia, Vancouver, BC Canada
V6T 1Z4.
61Department Psychology, Ludwig-Maximilians-University Munich, Leopoldstraße 13,
80802 Munich, Germany.
62Department of Statistics, Purdue University, West Lafayette, IN 47907-2067, USA.
63Department of Political Science, Washington University in St. Louis, St. Louis, MO
63130-4899, USA.
64Government Department, Harvard University, Cambridge, MA 02138, USA.
65Department of Psychology, Ohio State University, Columbus, OH 43210, USA.
66Department of Psychology, University of California, Davis, CA, 95616, USA.
67Microsoft Research. 641 Avenue of the Americas, 7th Floor, New York, NY 10011,
68Department of Sociology, Harvard University, Cambridge, MA 02138, USA.
69Department of Sociology, Princeton University, Princeton NJ 08544, USA.
70Department of Sociology, Stanford University, Stanford, CA 94305-2047, USA.
71Department of Economics, Dartmouth College, Hanover, NH 03755-3514, USA.
72Department of Statistics, Texas A&M University, College Station, TX 77843, USA.
*Correspondence to: Daniel J. Benjamin,; Magnus
Johannesson,; Valen E. Johnson,
One Sentence Summary: We propose to change the default P-value threshold for
statistical significance for claims of new discoveries from 0.05 to 0.005.
Main Text:
The lack of reproducibility of scientific studies has caused growing concern over
the credibility of claims of new discoveries based on “statistically significant” findings.
There has been much progress toward documenting and addressing several causes of this
lack of reproducibility (e.g., multiple testing, P-hacking, publication bias, and under-
powered studies). However, we believe that a leading cause of non-reproducibility has
not yet been adequately addressed: Statistical standards of evidence for claiming new
discoveries in many fields of science are simply too low. Associating “statistically
significant” findings with P < 0.05 results in a high rate of false positives even in the
absence of other experimental, procedural and reporting problems.
For fields where the threshold for defining statistical significance for new
discoveries is 𝑃<0.05, we propose a change to 𝑃<0.005. This simple step would
immediately improve the reproducibility of scientific research in many fields. Results that
would currently be called “significant” but do not meet the new threshold should instead
be called “suggestive.” While statisticians have known the relative weakness of using
𝑃0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new (1,
2), a critical mass of researchers now endorse this change.
We restrict our recommendation to claims of discovery of new effects. We do not
address the appropriate threshold for confirmatory or contradictory replications of
existing claims. We also do not advocate changes to discovery thresholds in fields that
have already adopted more stringent standards (e.g., genomics and high-energy physics
research; see Potential Objections below).
We also restrict our recommendation to studies that conduct null hypothesis
significance tests. We have diverse views about how best to improve reproducibility, and
many of us believe that other ways of summarizing the data, such as Bayes factors or
other posterior summaries based on clearly articulated model assumptions, are preferable
to P-values. However, changing the P-value threshold is simple, aligns with the training
undertaken by many researchers, and might quickly achieve broad acceptance.
Strength of evidence from P-values
In testing a point null hypothesis 𝐻! against an alternative hypothesis 𝐻! based on
data 𝑥obs, the P-value is defined as the probability, calculated under the null hypothesis,
that a test statistic is as extreme or more extreme than its observed value. The null
hypothesis is typically rejected—and the finding is declared “statistically significant”—if
the P-value falls below the (current) Type I error threshold α = 0.05.
!From a Bayesian perspective,!a more direct measure of the strength of evidence
for 𝐻! relative to 𝐻! is the ratio of their probabilities. By Bayes’ rule, this ratio may be
written as:
Pr 𝐻!|𝑥obs
Pr 𝐻!|𝑥obs
Pr 𝐻!
Pr 𝐻!
𝐵𝐹 × prior odds ,!
where 𝐵𝐹 is the Bayes factor that represents the evidence from the data, and the prior
odds can be informed by researchers’ beliefs, scientific consensus, and validated
evidence from similar research questions in the same field. Multiple hypothesis testing,
P-hacking, and publication bias all reduce the credibility of evidence. Some of these
practices reduce the prior odds of 𝐻! relative to 𝐻! by changing the population of
hypothesis tests that are reported. Prediction markets (3) and analyses of replication
results (4) both suggest that for psychology experiments, the prior odds of 𝐻! relative to
𝐻! may be only about 1:10. A similar number has been suggested in cancer clinical trials,
and the number is likely to be much lower in preclinical biomedical research (5).
There is no unique mapping between the P-value and the Bayes factor since the
Bayes factor depends on 𝐻!. However, the connection between the two quantities can be
evaluated for particular test statistics under certain classes of plausible alternatives (Fig.
Fig. 1. Relationship between the P-value and the Bayes Factor. The Bayes factor (BF)
is defined as !!obs|!!
. The figure assumes that observations are drawn i.i.d. according to
𝑥 ~ 𝑁𝜇,𝜎!, where the mean 𝜇 is unknown and the variance 𝜎! is known. The P-value
is from a two-sided z test (or equivalently a one-sided 𝜒!
! test) of the null hypothesis
“Power”: BF obtained by defining 𝐻! as putting ½ probability on 𝜇=±𝑚 for the value
of 𝑚 that gives 75% power for the test of size α = 0.05. This 𝐻! represents an effect size
typical of that which is implicitly assumed by researchers during experimental design.
“Likelihood Ratio Bound”: BF obtained by defining 𝐻! as putting ½ probability on
𝜇=±𝑥, where 𝑥 is approximately equal to the mean of the observations. These BFs are
upper bounds among the class of all 𝐻!’s that are symmetric around the null, but they are
improper because the data are used to define 𝐻!. “UMPBT”: BF obtained by defining 𝐻!
according to the uniformly most powerful Bayesian test (5) that places ½ probability on
𝜇=±𝑤, where 𝑤 is the alternative hypothesis that corresponds to a one-sided test of size
0.0025. This curve is indistinguishable from the “Power” curve that would be obtained if
the power used in its definition was 80% rather than 75%. “Local-𝐻! Bound”: BF =
!!" !" !, where 𝑝 is the P-value, is a large-sample upper bound on the BF from among all
unimodal alternative hypotheses that have a mode at the null and satisfy certain regularity
conditions (15). For more details, see the Supplementary Online Materials (SOM).
A two-sided P-value of 0.05 corresponds to Bayes factors in favor of 𝐻! that range from
about 2.5 to 3.4 under reasonable assumptions about 𝐻! (Fig. 1). This is weak evidence
from at least three perspectives. First, conventional Bayes factor categorizations (6)
characterize this range as “weak” or “very weak.” Second, we suspect many scientists
would guess that 𝑃0.05 implies stronger support for 𝐻! than a Bayes factor of 2.5 to
3.4. Third, using equation (1) and prior odds of 1:10, a P-value of 0.05 corresponds to at
least 3:1 odds (i.e., the reciprocal of the product !
× 3.4) in favor of the null hypothesis!
Why 0.005?
The choice of any particular threshold is arbitrary and involves a trade-off
between Type I and II errors. We propose 0.005 for two reasons. First, a two-sided P-
value of 0.005 corresponds to Bayes factors between approximately 14 and 26 in favor of
𝐻!. This range represents “substantial” to “strong” evidence according to conventional
Bayes factor classifications (6).
Second, in many fields the 𝑃<0.005 standard would reduce the false positive
rate to levels we judge to be reasonable. If we let 𝜙 denote the proportion of null
hypotheses that are true, (1𝛽) the power of tests in rejecting false null hypotheses, and
𝛼 the Type I error/significance threshold, then as the population of tested hypotheses
becomes large, the false positive rate (i.e., the proportion of true null effects among the
total number of statistically significant findings) can be approximated by
false positive rate 𝛼𝜙
𝛼𝜙 +(1𝛽)(1𝜙).!
For different levels of the prior odds that there is a true effect, !!!
!, and for significance
thresholds 𝛼=0.05 and 𝛼=0.005, Figure 2 shows the false positive rate as a function
of power 1𝛽.
Fig. 2. Relationship between the P-value threshold, power, and the false positive
rate. Calculated according to Equation (2), with prior odds defined as !!!
=!" !!
!"(!!). For
more details, see the Supplementary Online Materials (SOM).
In many studies, statistical power is low (e.g., ref. 7). Fig. 2 demonstrates that low
statistical power and 𝛼=0.05 combine to produce high false positive rates.
For many, the calculations illustrated by Fig. 2 may be unsettling. For example,
the false positive rate is greater than 33% with prior odds of 1:10 and a P-value threshold
of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005
would reduce this minimum false positive rate to 5%. Similar reductions in false positive
rates would occur over a wide range of statistical powers.
Empirical evidence from recent replication projects in psychology and
experimental economics provide insights into the prior odds in favor of 𝐻!. In both
projects, the rate of replication (i.e., significance at P < 0.05 in the replication in a
consistent direction) was roughly double for initial studies with P < 0.005 relative to
initial studies with 0.005 < P < 0.05: 50% versus 24% for psychology (8), and 85%
versus 44% for experimental economics (9). Although based on relatively small samples
of studies (93 in psychology, 16 in experimental economics, after excluding initial studies
with P > 0.05), these numbers are suggestive of the potential gains in reproducibility that
would accrue from the new threshold of P < 0.005 in these fields. In biomedical research,
96% of a sample of recent papers claim statistically significant results with the P < 0.05
threshold (10). However, replication rates were very low (5) for these studies, suggesting
a potential for gains by adopting this new standard in these fields as well.
Potential Objections
We now address the most compelling arguments against adopting this higher
standard of evidence.
The false negative rate would become unacceptably high. Evidence that does not
reach the new significance threshold should be treated as suggestive, and where possible
further evidence should be accumulated; indeed, the combined results from several
studies may be compelling even if any particular study is not. Failing to reject the null
hypothesis does not mean accepting the null hypothesis. Moreover, the false negative rate
will not increase if sample sizes are increased so that statistical power is held constant.
For a wide range of common statistical tests, transitioning from a P-value
threshold of 𝛼=0.05 to 𝛼=0.005 while maintaining 80% power would require an
increase in sample sizes of about 70%. Such an increase means that fewer studies can be
conducted using current experimental designs and budgets. But Figure 2 shows the
benefit: false positive rates would typically fall by factors greater than two. Hence,
considerable resources would be saved by not performing future studies based on false
premises. Increasing sample sizes is also desirable because studies with small sample
sizes tend to yield inflated effect size estimates (11), and publication and other biases
may be more likely in an environment of small studies (12). We believe that efficiency
gains would far outweigh losses.
The proposal does not address multiple hypothesis testing, P-hacking, publication
bias, low power, or other biases (e.g., confounding, selective reporting, measurement
error), which are arguably the bigger problems. We agree. Reducing the P-value
threshold complements—but does not substitute for—solutions to these other problems,
which include good study design, ex ante power calculations, pre-registration of planned
analyses, replications, and transparent reporting of procedures and all statistical analyses
The appropriate threshold for statistical significance should be different for
different research communities. We agree that the significance threshold selected for
claiming a new discovery should depend on the prior odds that the null hypothesis is true,
the number of hypotheses tested, the study design, the relative cost of Type I versus Type
II errors, and other factors that vary by research topic. For exploratory research with very
low prior odds (well outside the range in Figure 2), even lower significance thresholds
than 0.005 are needed. Recognition of this issue led the genetics research community to
move to a “genome-wide significance threshold” of 5×10-8 over a decade ago. And in
high-energy physics, the tradition has long been to define significance by a “5-sigma”
rule (roughly a P-value threshold of 3×10-7). We are essentially suggesting a move from a
2-sigma rule to a 3-sigma rule.
Our recommendation applies to disciplines with prior odds broadly in the range
depicted in Figure 2, where use of P < 0.05 as a default is widespread. Within those
disciplines, it is helpful for consumers of research to have a consistent benchmark. We
feel the default should be shifted.
Changing the significance threshold is a distraction from the real solution, which
is to replace null hypothesis significance testing (and bright-line thresholds) with more
focus on effect sizes and confidence intervals, treating the P-value as a continuous
measure, and/or a Bayesian method. Many of us agree that there are better approaches to
statistical analyses than null hypothesis significance testing, but as yet there is no
consensus regarding the appropriate choice of replacement. For example, a recent
statement by the American Statistical Association addressed numerous issues regarding
the misinterpretation and misuse of P-values (as well as the related concept of statistical
significance), but failed to make explicit policy recommendations to address these
shortcomings (13) . Even after the significance threshold is changed, many of us will
continue to advocate for alternatives to null hypothesis significance testing.
Concluding remarks
Ronald Fisher understood that the choice of 0.05 was arbitrary when he
introduced it (14). Since then, theory and empirical evidence have demonstrated that a
lower threshold is needed. A much larger pool of scientists are now asking a much larger
number of questions, possibly with much lower prior odds of success.
For research communities that continue to rely on null hypothesis significance
testing, reducing the P-value threshold for claims of new discoveries to 0.005 is an
actionable step that will immediately improve reproducibility. We emphasize that this
proposal is about standards of evidence, not standards for policy action nor standards for
publication. Results that do not reach the threshold for statistical significance (whatever
it is) can still be important and merit publication in leading journals if they address
important research questions with rigorous methods. This proposal should not be used to
reject publications of novel findings with 0.005 < P < 0.05 properly labeled as suggestive
evidence. We should reward quality and transparency of research as we impose these
more stringent standards, and we should monitor how researchers’ behaviors are affected
by this change. Otherwise, science runs the risk that the more demanding threshold for
statistical significance will be met to the detriment of quality and transparency.
Journals can help transition to the new statistical significance threshold. Authors
and readers can themselves take the initiative by describing and interpreting results more
appropriately in light of the new proposed definition of “statistical significance.” The
new significance threshold will help researchers and readers to understand and
communicate evidence more accurately.
References and Notes:
1. A. G. Greenwald et al., Effect sizes and p values: What should be reported and
what should be replicated? Psychophysiology 33, 175-183 (1996).
2. V. E. Johnson, Revised standards for statistical evidence. Proc. Natl. Acad. Sci.
U.S.A. 110, 19313-19317 (2013).
3. A. Dreber et al., Using prediction markets to estimate the reproducibility of
scientific research. Proc. Natl. Acad. Sci. U.S.A. 112, 15343-15347 (2015).
4. V. E. Johnson et al., On the reproducibility of psychological science. J. Am. Stat.
Assoc. 112, 1-10 (2016).
5. G. C. Begley, J. P. A. Ioannidis, Reproducibility in science: Improving the
standard for basic and preclinical research. Circ. Res. 116, 116-126 (2015).
6. R. E. Kass, A. E. Raftery, Bayes Factors. J. Am. Stat. Assoc. 90, 773-795 (1995).
7. D. Szucs, J. P. A. Ioannidis, Empirical assessment of published effect sizes and
power in the recent cognitive neuroscience and psychology literature. PLoS Biol.
15, (2017).
8. Open Science Collaboration, Estimating the reproducibility of psychological
science. Science 349, (2015).
9. C. Camerer et al., Evaluating replicability of laboratory experiments in
economics. Science 351, 1433-1436 (2016).
10. D. Chavalarias et al., Evolution of reporting p values in the biomedical literature,
1990-2015. JAMA 315, 1141-1148 (2016).
11. A. Gelman, J. Carlin, Beyond power calculations: Assessing Type S (Sign) and
Type M (Magnitude) errors. Perspect. Psychol. Sci. 9, 641-651 (2014).
12. D. Fanelli, R. Costas, J. P. A. Ioannidis, Meta-assessment of bias in science. Proc.
Natl. Acad. Sci. U.S.A. 114, 3714-3719 (2017).
13. R. L. Wasserstein, N. A. Lazar, The ASA’s statement on p-values: Context,
process, and purpose. Am. Stat. 70 (and online comments), 129-133 (2016).
14. R. A. Fisher, Statistical Methods for Research Workers (Oliver & Boyd,
Edinburgh, 1925).
15. T. Sellke, M. J. Bayarri, J. O. Berger, Calibration of p-values for testing precise
null hypotheses. Am. Stat. 55, 62-71 (2001).
Acknowledgements: We thank Deanna L. Lormand, Rebecca Royer and Anh Tuan
Nguyen Viet for excellent research assistance.
Supplementary Materials:
Supplementary Text
R code used to generate Figures 1 and 2
Supplementary Materials:
Supplementary Text
Figure 1
All four curves in Figure 1 describe the relationship between (i) a P-value based
on a two-sided normal test and (ii) a Bayes factor or a bound on a Bayes factor. The P-
values are based on a two-sided test that the mean 𝜇 of an independent and identically
distributed sample of normally distributed random variables is 0. The variance of the
observations is known. Without loss of generality, we assume that the variance is 1 and
the sample size is also 1. The curves in the figure differ according to the alternative
hypotheses that they assume for calculating (ii).
Because these curves involve two-sided tests, all alternative hypotheses are restricted to
be symmetric around 0. That is, the density assumed for the value of 𝜇 under the
alternative hypothesis is always assumed to satisfy 𝑓𝜇=𝑓𝜇.
The curve labeled “Power” corresponds to defining the alternative hypothesis so that
power is 75% in a two-sided 5% test. This is achieved by assuming that 𝜇 under the
alternative hypothesis is equal to ±𝑧!.!"# +𝑧!.!" =±2.63. That is, the alternative
hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63.
The curve labeled UMPBT corresponds to the uniformly most powerful Bayesian test (2)
that corresponds to a classical, two-sided test of size 𝛼=0.005. The alternative
hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null
hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is
nearly identical to the “Power” curve if that curve had been defined using 80% power,
rather than 75% power. The Power curve for 80% power would place ½ its mass at
The Likelihood Ratio Bound curve represents an approximate upper bound on the Bayes
factor obtained by defining the alternative hypothesis as putting ½ its mass on ±𝑥, where
𝑥 is the observed sample mean. Over the range of P-values displayed in the figure, this
alternative hypothesis very closely approximates the maximum Bayes factor that can be
attained from among the set of alternative hypotheses constrained to be of the form 0.5×
[𝑓𝜇+𝑓𝜇] for some density function f.
The Local-H1 curve is described fully in the figure caption.!A!fuller!explanation!and!
Equation 2 and Figure 2
This equation defines the large-sample relationship between the false positive
rate, power 1𝛽, type I error rate 𝛼, and the probability that the null hypothesis is true
when a large number of independent experiments have been conducted. More
specifically, suppose that n independent hypothesis tests are conducted, and suppose that
in each test the probability that the null hypothesis is true is 𝜙. If the null hypothesis is
true, assume that the probability that it is falsely rejected (i.e., a false positive occurs) is
𝛼. For the test 𝑗=1,,𝑛, define the random variable 𝑋
!=1 if the null hypothesis is
true and the null hypothesis is rejected, and 𝑋
!=0 if either the alternative hypothesis is
true or the null hypothesis is not rejected. Note that the 𝑋
! are independent Bernoulli
random variables with Pr 𝑋
!=1=𝛼𝜙. Also for test j, define another random variable
!=1 if the alternative hypothesis is true and the null hypothesis is rejected, and 0
otherwise. It follows that the 𝑌
! are independent Bernoulli random variables with
Pr 𝑌
!=1=1𝜙1𝛽. Note that 𝑌
! is independent of 𝑌
! for 𝑗𝑘, but 𝑌
! is not
independent of 𝑋
!. For the n experiments, the false positive rate can then be written as:
By the strong law of large numbers, 𝑋
!!! converges almost surely to 𝛼𝜙, and
!!! converges almost surely to 1𝜙1𝛽. Application of the continuous
mapping theorem yields
𝛼𝜙 +(1𝜙)(1𝛽).
Figure 2 illustrates this relationship for various values of 𝛼 and prior odds for the
alternative, !!!
R code used to generate Figure 1:
xbar = qnorm(1-p/2)
# alternative based on 80% POWER IN 5% TEST
muPower = qnorm(1-type2)+qnorm(1-type1Power/2)
bfPow = 0.5*(dnorm(xbar,muPower,1)+dnorm(xbar,-
muUMPBT = qnorm(0.9975)
bfUMPBT = 0.5*(dnorm(xbar,muUMPBT,1)+dnorm(xbar,-
# two-sided "LR" bound
bfLR = 0.5/exp(-0.5*xbar^2)
bfLocal = -1/(2.71*p*log(p))
#coordinates for dashed lines
data = data.frame(p,bfLocal,bfLR,bfPow,bfUMPBT)
U_005 = max(data$bfLR[data$p=="0.005"])
L_005 = min(data$bfLocal[data$p=="0.005"])
U_05 = max(data$bfLR[data$p=="0.05"])
L_05 = min(data$bfUMPBT[data$p=="0.05"])
# Local bound; no need for two-sided adjustment
#plot margins
xlab=expression(paste(italic(P) ,"-value")),
ylab="Bayes Factor",
ylim = c(0.3,100),
legend(0.015,100,c(expression(paste("Power")),"Likelihood Ratio
cex = 0.8)
#text(0.062,65, "\u03B1", font =3, cex = 0.9)
#customizing axes
#x axis
labels =
tck = -0.01, padj = -1.1, cex.axis = .8)
#y axis on the left - main
axis(side=2,at=c(-0.2, 0.3,0.5,1,2,5,10,20,50,100),labels =
=1,las= 1,
tck = -0.01, hadj = 0.6, cex.axis = .8)
#y axis on the left - secondary (red labels)
axis(side=2,at=c(L_005,U_005),labels = c(13.9,25.7),lwd=1,las= 1,
tck = -0.01, hadj = 0.6, cex.axis = .6,col.axis="red")
#y axis on the right - main
axis(side=4,at=c(-0.2, 0.3,0.5,1,2,5,10,20,50,100),labels =
=1,las= 1,
tck = -0.01, hadj = 0.4, cex.axis = .8)
#y axis on the right - secondary (red labels)
axis(side=4,at=c(L_05,U_05),labels = c(2.4,3.4),lwd=1,las= 1,
tck = -0.01, hadj = 0.4, cex.axis = .6,col.axis="red")
###dashed lines
segments(x0 = 0.000011, y0= U_005, x1 = 0.005, y1 = U_005, col =
"gray40", lty = 2)
segments(x0 = 0.000011, y0= L_005, x1 = 0.005, y1 = L_005, col =
"gray40", lty = 2)
segments(x0 = 0.005, y0= 0.00000001, x1 = 0.005, y1 = U_005, col =
"gray40", lty = 2)
segments(x0 = 0.05, y0= U_05, x1 = 0.14, y1 = U_05, col = "gray40",
lty = 2)
segments(x0 = 0.05, y0= L_05, x1 = 0.14, y1 = L_05, col = "gray40",
lty = 2)
segments(x0 = 0.05, y0= 0.00000001, x1 = 0.05, y1 = U_05, col =
"gray40", lty = 2)
R code used to generate Figure 2:
pow1=c(5:999)/1000 # power range for 0.005 tests
pow2=c(50:999)/1000 # power range for 0.05 tests
alpha=0.005 # test size
pi0=5/6 # prior probability
N=10^6 # doesn't matter
#graph margins
plot(pow1,alpha*N*pi0/(alpha*N*pi0+pow1*(1-pi0)*N),type='n',ylim =
c(0,1), xlim = c(0,1.5),
xlab='Power ',
ylab='False positive rate', bty="n", xaxt="n", yaxt="n")
#grid lines
segments(x0 = -0.058, y0 = 0, x1 = 1, y1 = 0,lty=1,col = "gray92")
segments(x0 = -0.058, y0 = 0.2, x1 = 1, y1 = 0.2,lty=1,col =
segments(x0 = -0.058, y0 = 0.4, x1 = 1, y1 = 0.4,lty=1,col =
segments(x0 = -0.058, y0 = 0.6, x1 = 1, y1 = 0.6,lty=1,col =
segments(x0 = -0.058, y0 = 0.8, x1 = 1, y1 = 0.8,lty=1,col =
segments(x0 = -0.058, y0 = 1, x1 = 1, y1 = 1,lty=1,col = "gray92")
odd_1_5_1 = alpha*N*pi0/(alpha*N*pi0+pow1[995]*(1-pi0)*N)
odd_1_5_2 = alpha*N*pi0/(alpha*N*pi0+pow2[950]*(1-pi0)*N)
odd_1_10_2 = alpha*N*pi0/(alpha*N*pi0+pow2[950]*(1-pi0)*N)
odd_1_10_1 = alpha*N*pi0/(alpha*N*pi0+pow1[995]*(1-pi0)*N)
odd_1_40_2 = alpha*N*pi0/(alpha*N*pi0+pow2[950]*(1-pi0)*N)
odd_1_40_1 = alpha*N*pi0/(alpha*N*pi0+pow1[995]*(1-pi0)*N)
#customizing axes
axis(side=2,at=c(-0.5,0,0.2,0.4,0.6,0.8,1.0),labels =
lwd=1,las= 1,tck = -0.01, hadj = 0.4, cex.axis = .8)
axis(side=1,at=c(-0.5,0,0.2,0.4,0.6,0.8,1.0),labels =
lwd=1,las= 1, tck = -0.01, padj = -1.1, cex.axis = .8)
legend(1.05,1,c("Prior odds = 1:40","Prior odds = 1:10","Prior odds
= 1:5"),pch=c(15,15,15),
col=c("green","red","blue"), cex = 1)
############### Use these commands to add brackets in Figure 2
#add text and brackets
text(1.11,(odd_1_5_2+odd_1_40_2)/2, expression(paste(italic(P)," <
0.05 threshold")), cex = 0.9,adj=0)
text(1.11,(odd_1_5_1+odd_1_40_1)/2, expression(paste(italic(P)," <
0.005 threshold")), cex = 0.9,adj=0)
brackets(1.03, odd_1_40_1, 1.03, odd_1_5_1, h = NULL, ticks = 0.5,
curvature = 0.7, type = 1,
col = 1, lwd = 1, lty = 1, xpd = FALSE)
brackets(1.03, odd_1_40_2, 1.03, odd_1_5_2, h = NULL, ticks = 0.5,
curvature = 0.7, type = 1,
col = 1, lwd = 1, lty = 1, xpd = FALSE)
... In order to reduce the lack of field-wise replicability, there was a suggestion to lower the standard level for publication as a discovery, α, from 0.05 to 0.005. This suggestion was supported by 72 researchers that are authors of the paper by Benjamin et al. (2018), who believe that a leading cause for published false discoveries is that the statistical standards of evidence for claiming new discoveries in many fields of science are too low. A clear consequence of application of their suggestion is great power loss to make true discoveries. ...
... Remark 4.1. Decreasing the rejection threshold from α = 0.05 to α = 0.005, following Benjamin et al. (2018), indeed decreases the probability of making a false replicability claim with the naive approach. However, when the number of studies n is large enough, this probability can still be close to 1. ...
... Of course, as α decreases the power to detect a false no-replicability null hypothesis decreases, and the power deterioration may be large when moving from an α = 0.05 to α = 0.005. The suggestion to reduce α by Benjamin et al. (2018) was motivated by the need to have a stricter statistical standard for claiming new discoveries. Since the claim of replicability is stronger than the claim of a single new discovery, it may not be necessary to raise the standards even higher by reducing α. ...
Full-text available
Meta-analysis is routinely performed in many scientific disciplines. This analysis is attractive since discoveries are possible even when all the individual studies are underpowered. However, the meta-analytic discoveries may be entirely driven by signal in a single study, and thus non-replicable. The lack of replicability of scientific findings has been of great concern following the influential paper of Ioannidis (2005). Although the great majority of meta-analyses carried out to date do not infer on the replicability of their findings, it is possible to do so. We provide a selective overview of analyses that can be carried out towards establishing replicability of the scientific findings. We describe methods for the setting where a single outcome is examined in multiple studies (as is common in systematic reviews of medical interventions), as well as for the setting where multiple studies each examine multiple features (as in genomics applications). We also discuss some of the current shortcomings and future directions.
... As the current standard practice of using a significance level of 5% is essentially arbitrary and risks a loss of information by reducing results to a dichotomous significant or not significant conclusion, the results are presented as CIs. 17 To limit the possibility of a type 1 error occurring, subgroup analyses was Protected by copyright. ...
Background The current guidelines of the American Heart Association (AHA) and European Society of Cardiology (ESC) recommend that when right ventricular myocardial infarction (RVMI) is present patients are not administered nitrates, due to the risk that decreasing preload in the setting of already compromised right ventricular ejection fraction may reduce cardiac output and precipitate hypotension. The cohort study (n=40) underlying this recommendation was recently challenged by new studies suitable for meta-analysis (cumulatively, n=1050), suggesting that this topic merits systematic review. Methods The protocol was registered on PROSPERO and published in Evidence Synthesis . Six databases were systematically searched in May 2022: PubMed, Embase, MEDLINE Complete, Cochrane CENTRAL Register, CINAHL and Google Scholar. Two investigators independently assessed for quality and bias and extracted data using Joanna Briggs Institute tools and methods. Risk ratios and 95% CIs were calculated, and meta-analysis performed using the random effects inverse variance method. Results Five studies (n=1113) were suitable. Outcomes included haemodynamics, GCS, syncope, arrest and death. Arrest and death did not occur in the RVMI group. Meta-analysis was possible for sublingual nitroglycerin 400 μg (2 studies, n=1050) and found no statistically significant difference in relative risk to combined inferior and RVMI at 1.31 (95% CI 0.81 to 2.12, p=0.27), with an absolute effect of 3 additional adverse events per 100 treatments. Results remained robust under sensitivity analysis. Conclusions This review suggests that the AHA and ESC contraindications are not supported by evidence. Key limitations include all studies having concomitant inferior and RVMI, not evaluating beneficial effects and very low certainty of evidence. As adverse events are transient and easily managed, nitrates are a reasonable treatment modality to consider during RVMI on current evidence. PROSPERO registration number CRD42020172839.
... Any deviations from the preregistered analyses or any added non-preregistered tests will be mentioned in the text below. For statistical tests, we interpret the threshold of p < 0.005 as identifying statistical significance, and the threshold of p < 0.05 as identifying suggestive evidence [34]. The statistical analysis was completed in 'R', using the 'miceadds' package for applying clustered standard errors [35,36]. ...
Full-text available
Many publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.
... This study thus provides important directions for future research, which could focus on the replication of our findings in an older population, as well as for other population groups beyond undergraduates and people older than 50 years. Additionally, a confirmatory replication study would be necessary, given that our p-values are close to .05 which increases the danger of false-positive significant findings in our work (Benjamin et al., 2017). Thus, a successful replication study could increase the trustworthiness of our findings (Anvari & Lakens, 2018;Hendriks, Kienhues, & Bromme, 2020;Wingen, Berkessel, & Englich, 2020). ...
Full-text available
How do media reports about the Covid-19 pandemic influence our mood? Building on the social comparison theory, we predicted that reading negative news affecting a similar group would result in an impaired mood. In contrast, reading negative news about a dissimilar group should lead to an improved mood. To test this, 150 undergraduate students read positive or negative news about the well-being of a similar or dissimilar group during the pandemic. As predicted, a mood assimilation effect occurred for similar groups, whilst a contrast effect occurred for a dissimilar group. The findings suggest that media reports can have a strong impact on mood. The direction of these effects, however, seems to depend strongly on social comparison processes.
... The pF indicates the probability of erroneously stating that the canonical variables are related (Fisher, 1925), hence a small pF means that the correlation is reliable. We set 0.005 as the significance threshold for pF, as suggested by Benjamin et al. (2018). Consequently, taking into account this threshold for pF only the first 15 cc are significant, as shown in Figure 7B and considered to reconstruct the amplification functions. ...
Full-text available
The Japanese KiK-net network comprises about 700 stations spread across the whole territory of Japan. For most of the stations, VP and VS profiles were measured down to the bottom borehole station. Using the vast dataset of earthquake recordings from 1997 to 2020 at a subset of 428 seismic stations, we compute the horizontal-to-vertical spectral ratio of earthquake coda, the S-wave surface-to-borehole spectral ratio, and the equivalent outcropping S-wave amplification function. The de facto equivalence of the horizontal-to-vertical spectral ratio of earthquake coda and ambient vibration is assessed on a homologous Swiss dataset. Based on that, we applied the canonical correlation analysis between amplification information and the horizontal-to-vertical spectral ratio of earthquake coda across all KiK-net sites. The aim of the correlation is to test a strategy to predict local earthquake amplification basing the inference on site condition indicators and single-station ambient vibration recordings. Once the correlation between frequency-dependent amplification factors and amplitudes of horizontal-to-vertical coda spectral ratios is defined, we predict amplification at each site in the selected KiK-net dataset with a leave-one-out cross-validation approach. In particular, for each site, three rounds of predictions are performed, using as prediction target the surface-to-borehole spectral ratio, the equivalent of a standard spectral ratio referred to the local bedrock and to a common Japanese reference rock profile. From our analysis, the most effective prediction is obtained when standard spectral ratios referred to local bedrock and the horizontal-to-vertical spectral ratio of earthquake coda are used, whereas a strong mismatch is obtained when standard spectral ratios are referred to a common reference. We ascribe this effect to the fact that, differently from amplification functions referred to a common reference, horizontal-to-vertical spectral ratios are fully site-dependent and then their peak amplitude is influenced by the local velocity contrast between bedrock and overlying sediments. Therefore, to reduce this discrepancy, we add in the canonical correlation as a site proxy the inferred velocity of the bedrock, which improves the final prediction.
The authors investigate whether and how borderline and pathological narcissistic traits differ in their associations with trait and state rejection sensitivity, and with affective reactions to experiences of social rejection occurring in daily life. Community adults (N = 189) completed baseline measures of rejection sensitivity, borderline personality, and pathological narcissism, and daily measures of perceived social rejection and affective states for 7 days. Vulnerable narcissism was the main driver of negative anticipated emotions for social rejection. Borderline personality made people prone to experiencing social rejection in daily life. Moreover, borderline personality traits predicted greater self-directed aggressive impulses when experiencing social rejection. Grandiose narcissism showed only a negative association with anticipatory anxiety for rejection. These findings highlight that sensitivity to social rejection is crucial in both borderline personality and pathological narcissism.
Full-text available
Aim Evidence indicates most people were resilient to the impact of the COVID-19 pandemic on mental health. However, evidence also suggests the pandemic effect on mental health may be heterogeneous. Therefore, we aimed to identify groups of trajectories of common mental disorders’ (CMD) symptoms assessed before (2017–19) and during the COVID-19 pandemic (2020–2021), and to investigate predictors of trajectories. Methods We assessed 2,705 participants of the ELSA-Brasil COVID-19 Mental Health Cohort study who reported Clinical Interview Scheduled-Revised (CIS-R) data in 2017–19 and Depression Anxiety Stress Scale-21 (DASS-21) data in May–July 2020, July–September 2020, October–December 2020, and April–June 2021. We used an equi-percentile approach to link the CIS-R total score in 2017–19 with the DASS-21 total score. Group-based trajectory modeling was used to identify CMD trajectories and adjusted multinomial logistic regression was used to investigate predictors of trajectories. Results Six groups of CMD symptoms trajectories were identified: low symptoms (17.6%), low-decreasing symptoms (13.7%), low-increasing symptoms (23.9%), moderate-decreasing symptoms (16.8%), low-increasing symptoms (23.3%), severe-decreasing symptoms (4.7%). The severe-decreasing trajectory was characterized by age < 60 years, female sex, low family income, sedentary behavior, previous mental disorders, and the experience of adverse events in life. Limitations Pre-pandemic characteristics were associated with lack of response to assessments. Our occupational cohort sample is not representative. Conclusion More than half of the sample presented low levels of CMD symptoms. Predictors of trajectories could be used to detect individuals at-risk for presenting CMD symptoms in the context of global adverse events.
Negative outcome expectations of psychological treatments predict unfavorable treatment outcomes. Therefore, therapists should approach negative outcome expectations and ideally transform them into more positive outcome expectations. In this study, we investigated the therapist’s interpersonal behavior to optimize the modification of negative outcome expectations. After inducing negative expectations in an online experiment, we presented different videos of therapist–patient interactions to violate the induced negative outcome expectations. We kept the expectation-violating information constant and manipulated the therapist’s warmth and competence. Results confirmed a significant influence of the therapist’s warmth and competence on expectation violation, which led to the most positive outcome expectations when the therapist was both warm and competent. In contrast to former correlational analyses, our experimental study confirms the causal role of the therapist’s interpersonal behavior and its impact on changing patients’ negative outcome expectations. On the basis of these findings, more powerful approaches to optimize critical outcome expectations can be developed.
Full-text available
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64-1.46) for nominally statistically significant results and D = 0.24 (0.11-0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Full-text available
Numerous biases are believed to affect the scientific literature, but their actual prevalence across disciplines is unknown. To gain a comprehensive picture of the potential imprint of bias in science, we probed for the most commonly postulated bias-related patterns and risk factors, in a large random sample of meta-analyses taken from all disciplines. The magnitude of these biases varied widely across fields and was overall relatively small. However, we consistently observed a significant risk of small, early, and highly cited studies to overestimate effects and of studies not published in peer-reviewed journals to underestimate them. We also found at least partial confirmation of previous evidence suggesting that US studies and early studies might report more extreme effects, although these effects were smaller and more heterogeneously distributed across meta-analyses and disciplines. Authors publishing at high rates and receiving many citations were, overall, not at greater risk of bias. However, effect sizes were likely to be overestimated by early-career researchers, those working in small or long-distance collaborations, and those responsible for scientific misconduct, supporting hypotheses that connect bias to situational factors, lack of mutual control, and individual integrity. Some of these patterns and risk factors might have modestly increased in intensity over time, particularly in the social sciences. Our findings suggest that, besides one being routinely cautious that published small, highly-cited, and earlier studies may yield inflated results, the feasibility and costs of interventions to attenuate biases in the literature might need to be discussed on a discipline-specific and topic-specific basis.
Full-text available
Author summary Biomedical science, psychology, and many other fields may be suffering from a serious replication crisis. In order to gain insight into some factors behind this crisis, we have analyzed statistical information extracted from thousands of cognitive neuroscience and psychology research papers. We established that the statistical power to discover existing relationships has not improved during the past half century. A consequence of low statistical power is that research studies are likely to report many false positive findings. Using our large dataset, we estimated the probability that a statistically significant finding is false (called false report probability). With some reasonable assumptions about how often researchers come up with correct hypotheses, we conclude that more than 50% of published findings deemed to be statistically significant are likely to be false. We also observed that cognitive neuroscience studies had higher false report probability than psychology studies, due to smaller sample sizes in cognitive neuroscience. In addition, the higher the impact factors of the journals in which the studies were published, the lower was the statistical power. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Full-text available
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a re-analysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested non-null effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of non-reproducibility. The results of this re-analysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Importance The use and misuse of P values has generated extensive debates.Objective To evaluate in large scale the P values reported in the abstracts and full text of biomedical research articles over the past 25 years and determine how frequently statistical information is presented in ways other than P values.Design Automated text-mining analysis was performed to extract data on P values reported in 12 821 790 MEDLINE abstracts and in 843 884 abstracts and full-text articles in PubMed Central (PMC) from 1990 to 2015. Reporting of P values in 151 English-language core clinical journals and specific article types as classified by PubMed also was evaluated. A random sample of 1000 MEDLINE abstracts was manually assessed for reporting of P values and other types of statistical information; of those abstracts reporting empirical data, 100 articles were also assessed in full text.Main Outcomes and Measures P values reported.Results Text mining identified 4 572 043 P values in 1 608 736 MEDLINE abstracts and 3 438 299 P values in 385 393 PMC full-text articles. Reporting of P values in abstracts increased from 7.3% in 1990 to 15.6% in 2014. In 2014, P values were reported in 33.0% of abstracts from the 151 core clinical journals (n = 29 725 abstracts), 35.7% of meta-analyses (n = 5620), 38.9% of clinical trials (n = 4624), 54.8% of randomized controlled trials (n = 13 544), and 2.4% of reviews (n = 71 529). The distribution of reported P values in abstracts and in full text showed strong clustering at P values of .05 and of .001 or smaller. Over time, the “best” (most statistically significant) reported P values were modestly smaller and the “worst” (least statistically significant) reported P values became modestly less significant. Among the MEDLINE abstracts and PMC full-text articles with P values, 96% reported at least 1 P value of .05 or lower, with the proportion remaining steady over time in PMC full-text articles. In 1000 abstracts that were manually reviewed, 796 were from articles reporting empirical data; P values were reported in 15.7% (125/796 [95% CI, 13.2%-18.4%]) of abstracts, confidence intervals in 2.3% (18/796 [95% CI, 1.3%-3.6%]), Bayes factors in 0% (0/796 [95% CI, 0%-0.5%]), effect sizes in 13.9% (111/796 [95% CI, 11.6%-16.5%]), other information that could lead to estimation of P values in 12.4% (99/796 [95% CI, 10.2%-14.9%]), and qualitative statements about significance in 18.1% (181/1000 [95% CI, 15.8%-20.6%]); only 1.8% (14/796 [95% CI, 1.0%-2.9%]) of abstracts reported at least 1 effect size and at least 1 confidence interval. Among 99 manually extracted full-text articles with data, 55 reported P values, 4 presented confidence intervals for all reported effect sizes, none used Bayesian methods, 1 used false-discovery rates, 3 used sample size/power calculations, and 5 specified the primary outcome.Conclusions and Relevance In this analysis of P values reported in MEDLINE abstracts and in PMC articles from 1990-2015, more MEDLINE abstracts and articles reported P values over time, almost all abstracts and articles with P values reported statistically significant results, and, in a subgroup analysis, few articles included confidence intervals, Bayes factors, or effect sizes. Rather than reporting isolated P values, articles should include effect sizes and uncertainty metrics.
The reproducibility of scientific findings has been called into question. To contribute data about reproducibility in economics, we replicate 18 studies published in the American Economic Review and the Quarterly Journal of Economics in 2011-2014. All replications follow predefined analysis plans publicly posted prior to the replications, and have a statistical power of at least 90% to detect the original effect size at the 5% significance level. We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original. The reproducibility rate varies between 67% and 78% for four additional reproducibility indicators, including a prediction market measure of peer beliefs.
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Concerns about a lack of reproducibility of statistically significant results have recently been raised in many fields, and it has been argued that this lack comes at substantial economic costs. We here report the results from prediction markets set up to quantify the reproducibility of 44 studies published in prominent psychology journals and replicated in the Reproducibility Project: Psychology. The prediction markets predict the outcomes of the replications well and outperform a survey of market participants' individual forecasts. This shows that prediction markets are a promising tool for assessing the reproducibility of published scientific results. The prediction markets also allow us to estimate probabilities for the hypotheses being true at different testing stages, which provides valuable information regarding the temporal dynamics of scientific discovery. We find that the hypotheses being tested in psychology typically have low prior probabilities of being true (median, 9%) and that a "statistically significant" finding needs to be confirmed in a well-powered replication to have a high probability of being true. We argue that prediction markets could be used to obtain speedy information about reproducibility at low cost and could potentially even be used to determine which studies to replicate to optimally allocate limited resources into replications.