PreprintPDF Available

Redefine statistical significance

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

"We propose to change the default P-value threshold forstatistical significance for claims of new discoveries from 0.05 to 0.005."
Content may be subject to copyright.
!
1!
Title: Redefine Statistical Significance
Authors: Daniel J. Benjamin1*, James O. Berger2, Magnus Johannesson3*, Brian A.
Nosek4,5,
E.-J.
Wagenmakers6,
Richard
Berk7,
10,
Kenneth
A.
Bollen8,
Björn
Brembs9,
Lawrence Brown10, Colin Camerer11, David Cesarini12, 13, Christopher D. Chambers14,
Merlise Clyde2, Thomas D. Cook15,16, Paul De Boeck17, Zoltan Dienes18, Anna Dreber3,
Kenny Easwaran19, Charles Efferson20, Ernst Fehr21, Fiona Fidler22, Andy P. Field18,
Malcolm Forster23, Edward I. George10, Richard Gonzalez24, Steven Goodman25, Edwin
Green26, Donald P. Green27, Anthony Greenwald28, Jarrod D. Hadfield29, Larry V.
Hedges30, Leonhard Held31, Teck Hua Ho32, Herbert Hoijtink33, James Holland
Jones39,40, Daniel J. Hruschka34, Kosuke Imai35, Guido Imbens36, John P.A. Ioannidis37,
Minjeong Jeon38, Michael Kirchler41, David Laibson42, John List43, Roderick Little44,
Arthur Lupia45, Edouard Machery46, Scott E. Maxwell47, Michael McCarthy48, Don
Moore49, Stephen L. Morgan50, Marcus Munafó51, 52, Shinichi Nakagawa53, Brendan
Nyhan54, Timothy H. Parker55, Luis Pericchi56, Marco Perugini57, Jeff Rouder58, Judith
Rousseau59, Victoria Savalei60, Felix D. Schönbrodt61, Thomas Sellke62, Betsy
Sinclair63, Dustin Tingley64, Trisha Van Zandt65, Simine Vazire66, Duncan J. Watts67,
Christopher Winship68, Robert L. Wolpert2, Yu Xie69, Cristobal Young70, Jonathan
Zinman71, Valen E. Johnson72*
Affiliations:
1Center for Economic and Social Research and Department of Economics, University of
Southern California, Los Angeles, CA 90089-3332, USA.
2Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA.
3Department of Economics, Stockholm School of Economics, SE-113 83 Stockholm,
Sweden.
4University of Virginia, Charlottesville, VA 22908, USA.
5Center for Open Science, Charlottesville, VA 22903, USA.
6University of Amsterdam, Department of Psychology, 1018 VZ Amsterdam, The
Netherlands.
7University of Pennsylvania, School of Arts and Sciences and Department of
Criminology, Philadelphia, PA 19104-6286, USA.
8University of North Carolina Chapel Hill, Department of Psychology and Neuroscience,
Department of Sociology, Chapel Hill, NC 27599-3270, USA.
9
Institute
of
Zoology
-
Neurogenetics,
Universität
Regensburg,
Universitätsstrasse 31
93040 Regensburg, Germany.
!
2!
10Department of Statistics, The Wharton School, University of Pennsylvania,
Philadelphia, PA 19104, USA.
11Division of the Humanities and Social Sciences, California Institute of Technology,
Pasadena, CA 91125, USA.
12Department of Economics, New York University, New York, NY 10012, USA.
13The Research Institute of Industrial Economics (IFN), SE- 102 15 Stockholm, Sweden.
14Cardiff University Brain Research Imaging Centre (CUBRIC), CF24 4HQ, UK.
15Northwestern University, Evanston, IL 60208, USA.
16Mathematica Policy Research, Washington, DC, 20002-4221, USA.
17Department of Psychology, Quantitative Program, Ohio State University, Columbus,
OH 43210, USA.
18School of Psychology, University of Sussex, Brighton BN1 9QH, UK.
19Department of Philosophy, Texas A&M University, College Station, TX 77843-4237,
USA.
20Department of Psychology, Royal Holloway University of London, Egham Surrey
TW20 0EX, UK.
21Department of Economics, University of Zurich, 8006 Zurich, Switzerland.
22School of BioSciences and School of Historical & Philosophical Studies, University of
Melbourne, Vic 3010, Australia.
23Department of Philosophy, University of Wisconsin - Madison, Madison, WI 53706,
USA.
24Department of Psychology, University of Michigan, Ann Arbor, MI 48109-1043, USA.
25Stanford University, General Medical Disciplines, Stanford, CA 94305, USA.
26Department of Ecology, Evolution and Natural Resources SEBS, Rutgers University,
New Brunswick, NJ 08901-8551, USA.
27Department of Political Science, Columbia University in the City of New York, New
York, NY 10027, USA.
28Department of Psychology, University of Washington, Seattle, WA 98195-1525, USA.
29Institute of Evolutionary Biology School of Biological Sciences, The University of
Edinburgh, Edinburgh EH9 3JT, UK.
!
3!
30Weinberg College of Arts & Sciences Department of Statistics, Northwestern
University, Evanston, IL 60208, USA.
31Epidemiology, Biostatistics and Prevention Institute (EBPI), University of Zurich,
8001 Zurich, Switzerland.
32National University of Singapore, Singapore 119077.
33Department of Methods and Statistics, Universiteit Utrecht, 3584 CH Utrecht, The
Netherlands.
34School of Human Evolution and Social Change, Arizona State University, Tempe, AZ
85287-2402, USA.
35Department of Politics and Center for Statistics and Machine Learning, Princeton
University, Princeton NJ 08544, USA.
36Stanford University, Stanford, CA 94305-5015, USA.
37Departments of Medicine, of Health Research and Policy, of Biomedical Data
Science, and of Statistics and Meta-Research Innovation Center at Stanford (METRICS),
Stanford University, Stanford, CA 94305, USA.
38 Advanced Quantitative Methods, Social Research Methodology, Department of
Education, Graduate School of Education & Information Studies, University of
California, Los Angeles, CA 90095-1521, USA.
39Department of Life Sciences, Imperial College London, Ascot SL5 7PY, UK.
40Department of Earth System Science, Stanford, CA 94305- 4216, USA.
41Department of Banking and Finance, University of Innsbruck and University of
Gothenburg, A-6020 Innsbruck, Austria.
42Department of Economics, Harvard University, Cambridge, MA 02138, USA.
43Department of Economics, University of Chicago, Chicago, IL 60637, USA.
44Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA.
45Department of Political Science, University of Michigan, Ann Arbor, MI 48109-1045,
USA.
46Department of History and Philosophy of Science, University of Pittsburgh, Pittsburgh
PA 15260, USA.
47Department of Psychology, University of Notre Dame, Notre Dame, IN 46556, USA.
48School of BioSciences, University of Melbourne, Vic 3010, Australia.
!
4!
49Haas School of Business, University of California at Berkeley, Berkeley, CA 94720-
1900A, USA.
50Johns Hopkins University, Baltimore, MD 21218, USA.
51MRC Integrative Epidemiology Unit, University of Bristol, Bristol BS8 1TU, UK.
52UK Centre for Tobacco and Alcohol Studies, School of Experimental Psychology,
University of Bristol, Bristol BS8 1TU, UK.
53Evolution & Ecology Research Centre and School of Biological, Earth and
Environmental Sciences, University of New South Wales, Sydney, NSW 2052, Australia.
54Department of Government, Dartmouth College, Hanover, NH 03755, USA.
55Department of Biology, Whitman College, Walla Walla, WA 99362, USA.
56Department of Mathematics, University of Puerto Rico, Rio Piedras Campus, San
Juan, PR 00936-8377.
57Department of Psychology, University of Milan - Bicocca, 20126 Milan, Italy.
58Department of Psychological Sciences, University of Missouri, Columbia, MO 65211,
USA.
59Université Paris Dauphine, 75016 Paris, France.
60Department of Psychology, The University of British Columbia, Vancouver, BC Canada
V6T 1Z4.
61Department Psychology, Ludwig-Maximilians-University Munich, Leopoldstraße 13,
80802 Munich, Germany.
62Department of Statistics, Purdue University, West Lafayette, IN 47907-2067, USA.
63Department of Political Science, Washington University in St. Louis, St. Louis, MO
63130-4899, USA.
64Government Department, Harvard University, Cambridge, MA 02138, USA.
65Department of Psychology, Ohio State University, Columbus, OH 43210, USA.
66Department of Psychology, University of California, Davis, CA, 95616, USA.
67Microsoft Research. 641 Avenue of the Americas, 7th Floor, New York, NY 10011,
USA.
68Department of Sociology, Harvard University, Cambridge, MA 02138, USA.
69Department of Sociology, Princeton University, Princeton NJ 08544, USA.
70Department of Sociology, Stanford University, Stanford, CA 94305-2047, USA.
!
5!
71Department of Economics, Dartmouth College, Hanover, NH 03755-3514, USA.
72Department of Statistics, Texas A&M University, College Station, TX 77843, USA.
*Correspondence to: Daniel J. Benjamin, daniel.benjamin@gmail.com; Magnus
Johannesson, magnus.johannesson@hhs.se; Valen E. Johnson,
vejohnson@exchange.tamu.edu.
One Sentence Summary: We propose to change the default P-value threshold for
statistical significance for claims of new discoveries from 0.05 to 0.005.
Main Text:
The lack of reproducibility of scientific studies has caused growing concern over
the credibility of claims of new discoveries based on “statistically significant” findings.
There has been much progress toward documenting and addressing several causes of this
lack of reproducibility (e.g., multiple testing, P-hacking, publication bias, and under-
powered studies). However, we believe that a leading cause of non-reproducibility has
not yet been adequately addressed: Statistical standards of evidence for claiming new
discoveries in many fields of science are simply too low. Associating “statistically
significant” findings with P < 0.05 results in a high rate of false positives even in the
absence of other experimental, procedural and reporting problems.
For fields where the threshold for defining statistical significance for new
discoveries is 𝑃<0.05, we propose a change to 𝑃<0.005. This simple step would
immediately improve the reproducibility of scientific research in many fields. Results that
would currently be called “significant” but do not meet the new threshold should instead
be called “suggestive.” While statisticians have known the relative weakness of using
𝑃0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new (1,
2), a critical mass of researchers now endorse this change.
We restrict our recommendation to claims of discovery of new effects. We do not
address the appropriate threshold for confirmatory or contradictory replications of
existing claims. We also do not advocate changes to discovery thresholds in fields that
have already adopted more stringent standards (e.g., genomics and high-energy physics
research; see Potential Objections below).
We also restrict our recommendation to studies that conduct null hypothesis
significance tests. We have diverse views about how best to improve reproducibility, and
many of us believe that other ways of summarizing the data, such as Bayes factors or
other posterior summaries based on clearly articulated model assumptions, are preferable
to P-values. However, changing the P-value threshold is simple, aligns with the training
undertaken by many researchers, and might quickly achieve broad acceptance.
Strength of evidence from P-values
!
6!
In testing a point null hypothesis 𝐻! against an alternative hypothesis 𝐻! based on
data 𝑥obs, the P-value is defined as the probability, calculated under the null hypothesis,
that a test statistic is as extreme or more extreme than its observed value. The null
hypothesis is typically rejected—and the finding is declared “statistically significant”—if
the P-value falls below the (current) Type I error threshold α = 0.05.
!From a Bayesian perspective,!a more direct measure of the strength of evidence
for 𝐻! relative to 𝐻! is the ratio of their probabilities. By Bayes’ rule, this ratio may be
written as:
!
!
Pr 𝐻!|𝑥obs
Pr 𝐻!|𝑥obs
=𝑓𝑥obs|𝐻!
𝑓𝑥obs|𝐻!
×
Pr 𝐻!
Pr 𝐻!
𝐵𝐹 × prior odds ,!
(1)!
!
where 𝐵𝐹 is the Bayes factor that represents the evidence from the data, and the prior
odds can be informed by researchers’ beliefs, scientific consensus, and validated
evidence from similar research questions in the same field. Multiple hypothesis testing,
P-hacking, and publication bias all reduce the credibility of evidence. Some of these
practices reduce the prior odds of 𝐻! relative to 𝐻! by changing the population of
hypothesis tests that are reported. Prediction markets (3) and analyses of replication
results (4) both suggest that for psychology experiments, the prior odds of 𝐻! relative to
𝐻! may be only about 1:10. A similar number has been suggested in cancer clinical trials,
and the number is likely to be much lower in preclinical biomedical research (5).
There is no unique mapping between the P-value and the Bayes factor since the
Bayes factor depends on 𝐻!. However, the connection between the two quantities can be
evaluated for particular test statistics under certain classes of plausible alternatives (Fig.
1).
!
7!
Fig. 1. Relationship between the P-value and the Bayes Factor. The Bayes factor (BF)
is defined as !!obs|!!
!!obs|!!
. The figure assumes that observations are drawn i.i.d. according to
𝑥 ~ 𝑁𝜇,𝜎!, where the mean 𝜇 is unknown and the variance 𝜎! is known. The P-value
is from a two-sided z test (or equivalently a one-sided 𝜒!
! test) of the null hypothesis
𝐻!:𝜇=0.
“Power”: BF obtained by defining 𝐻! as putting ½ probability on 𝜇=±𝑚 for the value
of 𝑚 that gives 75% power for the test of size α = 0.05. This 𝐻! represents an effect size
typical of that which is implicitly assumed by researchers during experimental design.
“Likelihood Ratio Bound”: BF obtained by defining 𝐻! as putting ½ probability on
𝜇=±𝑥, where 𝑥 is approximately equal to the mean of the observations. These BFs are
upper bounds among the class of all 𝐻!’s that are symmetric around the null, but they are
improper because the data are used to define 𝐻!. “UMPBT”: BF obtained by defining 𝐻!
according to the uniformly most powerful Bayesian test (5) that places ½ probability on
𝜇=±𝑤, where 𝑤 is the alternative hypothesis that corresponds to a one-sided test of size
0.0025. This curve is indistinguishable from the “Power” curve that would be obtained if
the power used in its definition was 80% rather than 75%. “Local-𝐻! Bound”: BF =
!
!!" !" !, where 𝑝 is the P-value, is a large-sample upper bound on the BF from among all
unimodal alternative hypotheses that have a mode at the null and satisfy certain regularity
conditions (15). For more details, see the Supplementary Online Materials (SOM).
A two-sided P-value of 0.05 corresponds to Bayes factors in favor of 𝐻! that range from
about 2.5 to 3.4 under reasonable assumptions about 𝐻! (Fig. 1). This is weak evidence
from at least three perspectives. First, conventional Bayes factor categorizations (6)
!
8!
characterize this range as “weak” or “very weak.” Second, we suspect many scientists
would guess that 𝑃0.05 implies stronger support for 𝐻! than a Bayes factor of 2.5 to
3.4. Third, using equation (1) and prior odds of 1:10, a P-value of 0.05 corresponds to at
least 3:1 odds (i.e., the reciprocal of the product !
!"
× 3.4) in favor of the null hypothesis!
Why 0.005?
The choice of any particular threshold is arbitrary and involves a trade-off
between Type I and II errors. We propose 0.005 for two reasons. First, a two-sided P-
value of 0.005 corresponds to Bayes factors between approximately 14 and 26 in favor of
𝐻!. This range represents “substantial” to “strong” evidence according to conventional
Bayes factor classifications (6).
Second, in many fields the 𝑃<0.005 standard would reduce the false positive
rate to levels we judge to be reasonable. If we let 𝜙 denote the proportion of null
hypotheses that are true, (1𝛽) the power of tests in rejecting false null hypotheses, and
𝛼 the Type I error/significance threshold, then as the population of tested hypotheses
becomes large, the false positive rate (i.e., the proportion of true null effects among the
total number of statistically significant findings) can be approximated by
!
!
false positive rate 𝛼𝜙
𝛼𝜙 +(1𝛽)(1𝜙).!
(2)!
For different levels of the prior odds that there is a true effect, !!!
!, and for significance
thresholds 𝛼=0.05 and 𝛼=0.005, Figure 2 shows the false positive rate as a function
of power 1𝛽.
!
9!
Fig. 2. Relationship between the P-value threshold, power, and the false positive
rate. Calculated according to Equation (2), with prior odds defined as !!!
!
=!" !!
!"(!!). For
more details, see the Supplementary Online Materials (SOM).
In many studies, statistical power is low (e.g., ref. 7). Fig. 2 demonstrates that low
statistical power and 𝛼=0.05 combine to produce high false positive rates.
For many, the calculations illustrated by Fig. 2 may be unsettling. For example,
the false positive rate is greater than 33% with prior odds of 1:10 and a P-value threshold
of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005
would reduce this minimum false positive rate to 5%. Similar reductions in false positive
rates would occur over a wide range of statistical powers.
Empirical evidence from recent replication projects in psychology and
experimental economics provide insights into the prior odds in favor of 𝐻!. In both
projects, the rate of replication (i.e., significance at P < 0.05 in the replication in a
consistent direction) was roughly double for initial studies with P < 0.005 relative to
initial studies with 0.005 < P < 0.05: 50% versus 24% for psychology (8), and 85%
versus 44% for experimental economics (9). Although based on relatively small samples
of studies (93 in psychology, 16 in experimental economics, after excluding initial studies
with P > 0.05), these numbers are suggestive of the potential gains in reproducibility that
would accrue from the new threshold of P < 0.005 in these fields. In biomedical research,
96% of a sample of recent papers claim statistically significant results with the P < 0.05
threshold (10). However, replication rates were very low (5) for these studies, suggesting
a potential for gains by adopting this new standard in these fields as well.
!
Potential Objections
We now address the most compelling arguments against adopting this higher
standard of evidence.
The false negative rate would become unacceptably high. Evidence that does not
reach the new significance threshold should be treated as suggestive, and where possible
further evidence should be accumulated; indeed, the combined results from several
studies may be compelling even if any particular study is not. Failing to reject the null
hypothesis does not mean accepting the null hypothesis. Moreover, the false negative rate
will not increase if sample sizes are increased so that statistical power is held constant.
For a wide range of common statistical tests, transitioning from a P-value
threshold of 𝛼=0.05 to 𝛼=0.005 while maintaining 80% power would require an
increase in sample sizes of about 70%. Such an increase means that fewer studies can be
conducted using current experimental designs and budgets. But Figure 2 shows the
benefit: false positive rates would typically fall by factors greater than two. Hence,
considerable resources would be saved by not performing future studies based on false
premises. Increasing sample sizes is also desirable because studies with small sample
sizes tend to yield inflated effect size estimates (11), and publication and other biases
may be more likely in an environment of small studies (12). We believe that efficiency
gains would far outweigh losses.
The proposal does not address multiple hypothesis testing, P-hacking, publication
bias, low power, or other biases (e.g., confounding, selective reporting, measurement
error), which are arguably the bigger problems. We agree. Reducing the P-value
threshold complements—but does not substitute for—solutions to these other problems,
which include good study design, ex ante power calculations, pre-registration of planned
analyses, replications, and transparent reporting of procedures and all statistical analyses
conducted.
The appropriate threshold for statistical significance should be different for
different research communities. We agree that the significance threshold selected for
claiming a new discovery should depend on the prior odds that the null hypothesis is true,
the number of hypotheses tested, the study design, the relative cost of Type I versus Type
II errors, and other factors that vary by research topic. For exploratory research with very
low prior odds (well outside the range in Figure 2), even lower significance thresholds
than 0.005 are needed. Recognition of this issue led the genetics research community to
move to a “genome-wide significance threshold” of 5×10-8 over a decade ago. And in
high-energy physics, the tradition has long been to define significance by a “5-sigma”
rule (roughly a P-value threshold of 3×10-7). We are essentially suggesting a move from a
2-sigma rule to a 3-sigma rule.
Our recommendation applies to disciplines with prior odds broadly in the range
depicted in Figure 2, where use of P < 0.05 as a default is widespread. Within those
disciplines, it is helpful for consumers of research to have a consistent benchmark. We
feel the default should be shifted.
!
Changing the significance threshold is a distraction from the real solution, which
is to replace null hypothesis significance testing (and bright-line thresholds) with more
focus on effect sizes and confidence intervals, treating the P-value as a continuous
measure, and/or a Bayesian method. Many of us agree that there are better approaches to
statistical analyses than null hypothesis significance testing, but as yet there is no
consensus regarding the appropriate choice of replacement. For example, a recent
statement by the American Statistical Association addressed numerous issues regarding
the misinterpretation and misuse of P-values (as well as the related concept of statistical
significance), but failed to make explicit policy recommendations to address these
shortcomings (13) . Even after the significance threshold is changed, many of us will
continue to advocate for alternatives to null hypothesis significance testing.
Concluding remarks
Ronald Fisher understood that the choice of 0.05 was arbitrary when he
introduced it (14). Since then, theory and empirical evidence have demonstrated that a
lower threshold is needed. A much larger pool of scientists are now asking a much larger
number of questions, possibly with much lower prior odds of success.
For research communities that continue to rely on null hypothesis significance
testing, reducing the P-value threshold for claims of new discoveries to 0.005 is an
actionable step that will immediately improve reproducibility. We emphasize that this
proposal is about standards of evidence, not standards for policy action nor standards for
publication. Results that do not reach the threshold for statistical significance (whatever
it is) can still be important and merit publication in leading journals if they address
important research questions with rigorous methods. This proposal should not be used to
reject publications of novel findings with 0.005 < P < 0.05 properly labeled as suggestive
evidence. We should reward quality and transparency of research as we impose these
more stringent standards, and we should monitor how researchers’ behaviors are affected
by this change. Otherwise, science runs the risk that the more demanding threshold for
statistical significance will be met to the detriment of quality and transparency.
Journals can help transition to the new statistical significance threshold. Authors
and readers can themselves take the initiative by describing and interpreting results more
appropriately in light of the new proposed definition of “statistical significance.” The
new significance threshold will help researchers and readers to understand and
communicate evidence more accurately.
!
References and Notes:
1. A. G. Greenwald et al., Effect sizes and p values: What should be reported and
what should be replicated? Psychophysiology 33, 175-183 (1996).
2. V. E. Johnson, Revised standards for statistical evidence. Proc. Natl. Acad. Sci.
U.S.A. 110, 19313-19317 (2013).
3. A. Dreber et al., Using prediction markets to estimate the reproducibility of
scientific research. Proc. Natl. Acad. Sci. U.S.A. 112, 15343-15347 (2015).
!
4. V. E. Johnson et al., On the reproducibility of psychological science. J. Am. Stat.
Assoc. 112, 1-10 (2016).
5. G. C. Begley, J. P. A. Ioannidis, Reproducibility in science: Improving the
standard for basic and preclinical research. Circ. Res. 116, 116-126 (2015).
6. R. E. Kass, A. E. Raftery, Bayes Factors. J. Am. Stat. Assoc. 90, 773-795 (1995).
7. D. Szucs, J. P. A. Ioannidis, Empirical assessment of published effect sizes and
power in the recent cognitive neuroscience and psychology literature. PLoS Biol.
15, (2017).
8. Open Science Collaboration, Estimating the reproducibility of psychological
science. Science 349, (2015).
9. C. Camerer et al., Evaluating replicability of laboratory experiments in
economics. Science 351, 1433-1436 (2016).
10. D. Chavalarias et al., Evolution of reporting p values in the biomedical literature,
1990-2015. JAMA 315, 1141-1148 (2016).
11. A. Gelman, J. Carlin, Beyond power calculations: Assessing Type S (Sign) and
Type M (Magnitude) errors. Perspect. Psychol. Sci. 9, 641-651 (2014).
12. D. Fanelli, R. Costas, J. P. A. Ioannidis, Meta-assessment of bias in science. Proc.
Natl. Acad. Sci. U.S.A. 114, 3714-3719 (2017).
13. R. L. Wasserstein, N. A. Lazar, The ASA’s statement on p-values: Context,
process, and purpose. Am. Stat. 70 (and online comments), 129-133 (2016).
14. R. A. Fisher, Statistical Methods for Research Workers (Oliver & Boyd,
Edinburgh, 1925).
15. T. Sellke, M. J. Bayarri, J. O. Berger, Calibration of p-values for testing precise
null hypotheses. Am. Stat. 55, 62-71 (2001).
Acknowledgements: We thank Deanna L. Lormand, Rebecca Royer and Anh Tuan
Nguyen Viet for excellent research assistance.
Supplementary Materials:
Supplementary Text
R code used to generate Figures 1 and 2
Supplementary Materials:
Supplementary Text
Figure 1
All four curves in Figure 1 describe the relationship between (i) a P-value based
on a two-sided normal test and (ii) a Bayes factor or a bound on a Bayes factor. The P-
values are based on a two-sided test that the mean 𝜇 of an independent and identically
distributed sample of normally distributed random variables is 0. The variance of the
observations is known. Without loss of generality, we assume that the variance is 1 and
the sample size is also 1. The curves in the figure differ according to the alternative
hypotheses that they assume for calculating (ii).
!
Because these curves involve two-sided tests, all alternative hypotheses are restricted to
be symmetric around 0. That is, the density assumed for the value of 𝜇 under the
alternative hypothesis is always assumed to satisfy 𝑓𝜇=𝑓𝜇.
The curve labeled “Power” corresponds to defining the alternative hypothesis so that
power is 75% in a two-sided 5% test. This is achieved by assuming that 𝜇 under the
alternative hypothesis is equal to ±𝑧!.!"# +𝑧!.!" =±2.63. That is, the alternative
hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63.
The curve labeled UMPBT corresponds to the uniformly most powerful Bayesian test (2)
that corresponds to a classical, two-sided test of size 𝛼=0.005. The alternative
hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null
hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is
nearly identical to the “Power” curve if that curve had been defined using 80% power,
rather than 75% power. The Power curve for 80% power would place ½ its mass at
±2.80.
The Likelihood Ratio Bound curve represents an approximate upper bound on the Bayes
factor obtained by defining the alternative hypothesis as putting ½ its mass on ±𝑥, where
𝑥 is the observed sample mean. Over the range of P-values displayed in the figure, this
alternative hypothesis very closely approximates the maximum Bayes factor that can be
attained from among the set of alternative hypotheses constrained to be of the form 0.5×
[𝑓𝜇+𝑓𝜇] for some density function f.
The Local-H1 curve is described fully in the figure caption.!A!fuller!explanation!and!
discussion!of!this!bound!can!be!found!in!ref.!15.
Equation 2 and Figure 2
This equation defines the large-sample relationship between the false positive
rate, power 1𝛽, type I error rate 𝛼, and the probability that the null hypothesis is true
when a large number of independent experiments have been conducted. More
specifically, suppose that n independent hypothesis tests are conducted, and suppose that
in each test the probability that the null hypothesis is true is 𝜙. If the null hypothesis is
true, assume that the probability that it is falsely rejected (i.e., a false positive occurs) is
𝛼. For the test 𝑗=1,,𝑛, define the random variable 𝑋
!=1 if the null hypothesis is
true and the null hypothesis is rejected, and 𝑋
!=0 if either the alternative hypothesis is
true or the null hypothesis is not rejected. Note that the 𝑋
! are independent Bernoulli
random variables with Pr 𝑋
!=1=𝛼𝜙. Also for test j, define another random variable
𝑌
!=1 if the alternative hypothesis is true and the null hypothesis is rejected, and 0
otherwise. It follows that the 𝑌
! are independent Bernoulli random variables with
Pr 𝑌
!=1=1𝜙1𝛽. Note that 𝑌
! is independent of 𝑌
! for 𝑗𝑘, but 𝑌
! is not
independent of 𝑋
!. For the n experiments, the false positive rate can then be written as:
𝐹𝑃𝑅 =
𝑋
!
!
!!!
𝑋
!+𝑌
!
!
!!!
!
!!!
=
𝑋
!/𝑛
!
!!!
𝑋
!/𝑛+𝑌
!/𝑛
!
!!!
!
!!!
.
!
By the strong law of large numbers, 𝑋
!/𝑛
!
!!! converges almost surely to 𝛼𝜙, and
𝑌
!/𝑛
!
!!! converges almost surely to 1𝜙1𝛽. Application of the continuous
mapping theorem yields
𝐹𝑃𝑅
a.s.
𝛼𝜙
𝛼𝜙 +(1𝜙)(1𝛽).
Figure 2 illustrates this relationship for various values of 𝛼 and prior odds for the
alternative, !!!
!
.
!
R code used to generate Figure 1:
type1=.005
type1Power=0.05
type2=0.25
p=1-c(9000:9990)/10000
xbar = qnorm(1-p/2)
# alternative based on 80% POWER IN 5% TEST
muPower = qnorm(1-type2)+qnorm(1-type1Power/2)
bfPow = 0.5*(dnorm(xbar,muPower,1)+dnorm(xbar,-
muPower,1))/dnorm(xbar,0,1)
muUMPBT = qnorm(0.9975)
bfUMPBT = 0.5*(dnorm(xbar,muUMPBT,1)+dnorm(xbar,-
muUMPBT,1))/dnorm(xbar,0,1)
# two-sided "LR" bound
bfLR = 0.5/exp(-0.5*xbar^2)
bfLocal = -1/(2.71*p*log(p))
#coordinates for dashed lines
data = data.frame(p,bfLocal,bfLR,bfPow,bfUMPBT)
U_005 = max(data$bfLR[data$p=="0.005"])
L_005 = min(data$bfLocal[data$p=="0.005"])
U_05 = max(data$bfLR[data$p=="0.05"])
L_05 = min(data$bfUMPBT[data$p=="0.05"])
# Local bound; no need for two-sided adjustment
#plot margins
par(mai=c(0.8,0.8,.1,0.4))
par(mgp=c(2,1,0))
matplot(p,cbind(bfLR,-1/(2.71*p*log(p))),type='n',log='xy',
xlab=expression(paste(italic(P) ,"-value")),
ylab="Bayes Factor",
ylim = c(0.3,100),
bty="n",xaxt="n",yaxt="n")
lines(p,bfPow,col="red",lwd=2.5)
lines(p,bfLR,col="black",lwd=2.5)
lines(p,bfUMPBT,col="blue",lwd=2.5)
lines(p,bfLocal,col="green",lwd=2.5)
legend(0.015,100,c(expression(paste("Power")),"Likelihood Ratio
Bound","UMPBT",expression(paste("Local-",italic(H)[1],"
Bound"))),lty=c(1,1,1,1),
lwd=c(2.5,2.5,2.5,2.5),col=c("red","black","blue","green"),
cex = 0.8)
#text(0.062,65, "\u03B1", font =3, cex = 0.9)
#customizing axes
#x axis
!
axis(side=1,at=c(-
2,0.001,0.0025,0.005,0.010,0.025,0.050,0.100,0.14),
labels =
c("","0.0010","0.0025","0.0050","0.0100","0.0250","0.0500","0.1000",
""),lwd=1,
tck = -0.01, padj = -1.1, cex.axis = .8)
#y axis on the left - main
axis(side=2,at=c(-0.2, 0.3,0.5,1,2,5,10,20,50,100),labels =
c("","0.3","0.5","1.0","2.0","5.0","10.0","20.0","50.0","100.0"),lwd
=1,las= 1,
tck = -0.01, hadj = 0.6, cex.axis = .8)
#y axis on the left - secondary (red labels)
axis(side=2,at=c(L_005,U_005),labels = c(13.9,25.7),lwd=1,las= 1,
tck = -0.01, hadj = 0.6, cex.axis = .6,col.axis="red")
#y axis on the right - main
axis(side=4,at=c(-0.2, 0.3,0.5,1,2,5,10,20,50,100),labels =
c("","0.3","0.5","1.0","2.0","5.0","10.0","20.0","50.0","100.0"),lwd
=1,las= 1,
tck = -0.01, hadj = 0.4, cex.axis = .8)
#y axis on the right - secondary (red labels)
axis(side=4,at=c(L_05,U_05),labels = c(2.4,3.4),lwd=1,las= 1,
tck = -0.01, hadj = 0.4, cex.axis = .6,col.axis="red")
###dashed lines
segments(x0 = 0.000011, y0= U_005, x1 = 0.005, y1 = U_005, col =
"gray40", lty = 2)
segments(x0 = 0.000011, y0= L_005, x1 = 0.005, y1 = L_005, col =
"gray40", lty = 2)
segments(x0 = 0.005, y0= 0.00000001, x1 = 0.005, y1 = U_005, col =
"gray40", lty = 2)
segments(x0 = 0.05, y0= U_05, x1 = 0.14, y1 = U_05, col = "gray40",
lty = 2)
segments(x0 = 0.05, y0= L_05, x1 = 0.14, y1 = L_05, col = "gray40",
lty = 2)
segments(x0 = 0.05, y0= 0.00000001, x1 = 0.05, y1 = U_05, col =
"gray40", lty = 2)
!
R code used to generate Figure 2:
pow1=c(5:999)/1000 # power range for 0.005 tests
pow2=c(50:999)/1000 # power range for 0.05 tests
alpha=0.005 # test size
pi0=5/6 # prior probability
N=10^6 # doesn't matter
#graph margins
par(mai=c(0.8,0.8,0.1,0.1))
par(mgp=c(2,1,0))
plot(pow1,alpha*N*pi0/(alpha*N*pi0+pow1*(1-pi0)*N),type='n',ylim =
c(0,1), xlim = c(0,1.5),
xlab='Power ',
ylab='False positive rate', bty="n", xaxt="n", yaxt="n")
#grid lines
segments(x0 = -0.058, y0 = 0, x1 = 1, y1 = 0,lty=1,col = "gray92")
segments(x0 = -0.058, y0 = 0.2, x1 = 1, y1 = 0.2,lty=1,col =
"gray92")
segments(x0 = -0.058, y0 = 0.4, x1 = 1, y1 = 0.4,lty=1,col =
"gray92")
segments(x0 = -0.058, y0 = 0.6, x1 = 1, y1 = 0.6,lty=1,col =
"gray92")
segments(x0 = -0.058, y0 = 0.8, x1 = 1, y1 = 0.8,lty=1,col =
"gray92")
segments(x0 = -0.058, y0 = 1, x1 = 1, y1 = 1,lty=1,col = "gray92")
lines(pow1,alpha*N*pi0/(alpha*N*pi0+pow1*(1-
pi0)*N),lty=1,col="blue",lwd=2)
odd_1_5_1 = alpha*N*pi0/(alpha*N*pi0+pow1[995]*(1-pi0)*N)
alpha=0.05
pi0=5/6
lines(pow2,alpha*N*pi0/(alpha*N*pi0+pow2*(1-
pi0)*N),lty=2,col="blue",lwd=2)
odd_1_5_2 = alpha*N*pi0/(alpha*N*pi0+pow2[950]*(1-pi0)*N)
alpha=0.05
pi0=10/11
lines(pow2,alpha*N*pi0/(alpha*N*pi0+pow2*(1-
pi0)*N),lty=2,col="red",lwd=2)
odd_1_10_2 = alpha*N*pi0/(alpha*N*pi0+pow2[950]*(1-pi0)*N)
alpha=0.005
pi0=10/11
lines(pow1,alpha*N*pi0/(alpha*N*pi0+pow1*(1-
pi0)*N),lty=1,col="red",lwd=2)
odd_1_10_1 = alpha*N*pi0/(alpha*N*pi0+pow1[995]*(1-pi0)*N)
alpha=0.05
pi0=40/41
!
lines(pow2,alpha*N*pi0/(alpha*N*pi0+pow2*(1-
pi0)*N),lty=2,col="green",lwd=2)
odd_1_40_2 = alpha*N*pi0/(alpha*N*pi0+pow2[950]*(1-pi0)*N)
alpha=0.005
pi0=40/41
lines(pow1,alpha*N*pi0/(alpha*N*pi0+pow1*(1-
pi0)*N),lty=1,col="green",lwd=2)
odd_1_40_1 = alpha*N*pi0/(alpha*N*pi0+pow1[995]*(1-pi0)*N)
#customizing axes
axis(side=2,at=c(-0.5,0,0.2,0.4,0.6,0.8,1.0),labels =
c("","0.0","0.2","0.4","0.6","0.8","1.0"),
lwd=1,las= 1,tck = -0.01, hadj = 0.4, cex.axis = .8)
axis(side=1,at=c(-0.5,0,0.2,0.4,0.6,0.8,1.0),labels =
c("","0.0","0.2","0.4","0.6","0.8","1.0"),
lwd=1,las= 1, tck = -0.01, padj = -1.1, cex.axis = .8)
legend(1.05,1,c("Prior odds = 1:40","Prior odds = 1:10","Prior odds
= 1:5"),pch=c(15,15,15),
col=c("green","red","blue"), cex = 1)
############### Use these commands to add brackets in Figure 2
library(pBrackets)
#add text and brackets
text(1.11,(odd_1_5_2+odd_1_40_2)/2, expression(paste(italic(P)," <
0.05 threshold")), cex = 0.9,adj=0)
text(1.11,(odd_1_5_1+odd_1_40_1)/2, expression(paste(italic(P)," <
0.005 threshold")), cex = 0.9,adj=0)
brackets(1.03, odd_1_40_1, 1.03, odd_1_5_1, h = NULL, ticks = 0.5,
curvature = 0.7, type = 1,
col = 1, lwd = 1, lty = 1, xpd = FALSE)
brackets(1.03, odd_1_40_2, 1.03, odd_1_5_2, h = NULL, ticks = 0.5,
curvature = 0.7, type = 1,
col = 1, lwd = 1, lty = 1, xpd = FALSE)
!
... An intraclass correlation coefficient test (Two-way random effects, absolute agreement, single rater/measurement: ICC (2,1)) was performed to determine the reliability and repeatability of the linear distances ( [51,54]. To prevent false significant comparisons due to the small sample size [55], statistical significance was set at 5% (p ≤ 0.05) and 0.5% (p ≤ 0.005) [56]. ...
Article
Full-text available
Accurate population and sex‐specific normative values for the orbital and ocular dimensions, including the position and protrusion of the eye relative to the orbital rim, are vital for reliable facial approximations. In studies utilizing cadaveric tissue and computed tomography scans, the observed measurements may be influenced by desiccation, distortion or gravity, respectively. This study assessed the dimensions of the eye and orbit and established the position and protrusion of the eye relative to the orbital margin using cone beam computed tomography (CBCT) scans to negate the effect of gravity in the supine position. Scans of 197 adult South Africans (45 Black females, 49 Black males, 55 White females, and 48 White males) were selected retrospectively from private and public hospitals in Pretoria, South Africa. Linear distances were calculated from three‐dimensional landmarks placed on the orbital rim and ocular equator using the MeVisLab © v.3.0.2 software. White females presented with significantly larger orbital heights and axial lengths of the eyes compared to Black females, while the eyeballs of Black females protruded more from the superior and lateral orbital margins. Black females presented with significantly smaller dimensions than Black males. On the contrary, White males exhibited significantly larger protrusion values than White females. The results of this study corroborate with the literature that sex, population, and modality significantly influence the position of the eye in the orbit, which emphasizes the necessity of creating population‐ and sex‐specific facial approximations guidelines for the placement of the eye in the orbit.
... Although a significance threshold of p < 0.05 has long been used in the social sciences and education research, there has long been disagreement about what the most appropriate significance level is (Benjamin et al., 2017). In the present study, a significance threshold of p < 0.01 was used for the regressions to obtain a better balance of Type I and Type II errors given the number of predictor variables being tested, the moderately large sample size (Murphy et al., 2014), and the exploratory nature of the current study. ...
Article
Full-text available
Implementing peer feedback in revisions is a complex process involving first planning to fix problems and then actual implementing feedback through revisions. Both phases are influenced by features of the peer feedback itself, but potentially in different ways, and yet prior research has not examined their separate role in planning or the mediating role of planning in the relationship of feedback features and implementation. We build on a process model to investigate whether feedback features had differing relationships to plans to ignore or act on feedback versus actual implementation of feedback in the revision, and whether planning mediated the relationship of feedback features and actual implementation. Source data consisted of peer feedback comments received, revision plans made, and revisions implemented by 125 US high school students given a shared writing assignment. Comments were coded for feedback features and implementation in the revision. Multiple regression analyses revealed that having a comment containing a specific solution or a general suggestion predicted revision plans whereas having a comment containing an explanation predicted actual implementation. Planning mediated the relationship to actual implementation for the two feedback features predicting plans, suggestion and solution. Implications for practice are discussed.
... Furthermore, meta-research has shown that most studies are still not fulfilling the requirements and standards of Open Science including the opportunity for replication even if they are claiming to do it (Nuijten, Hartgerink, van Assen, Epskamp & Wicherts, 2016). Second and at the analysis level, Open Science has to deal with the same challenges as traditional research that became evident with the discussion in the year 2017 about the right level for statistical significance: Benjamin et al. (2017) proposed p < 0.005 (instead of p < 0.05) as a new level for better reproducibility and more accurate communication. In direct replicas, Lakens et al. (2017) recommend to avoid the term statistical significance and the use of standardized thresholds for p-values whereas McShane, Gal, Gelman, Robert and Tackett (2017) demands to abandon the importance of null hypothesis significance testing (NHST) and of statistical significance and its levels in general. ...
Book
Full-text available
This open access book presents how Open Science is a powerful tool to boost Higher Education. The book introduces the reader into Open Access, Open Technology, Open Data, Open Research results, Open Licensing, Open Accreditation, Open Certification, Open Policy and, of course, Open Educational Resources. It brings all these key topics from major players in the field; experts that present the current state of the art and the forthcoming steps towards a useful and effective implementation. This book presents radical, transgenic solutions for recurrent and long-standing problems in Higher Education. Every chapter presents a clear view and a related solution to make Higher Education progress and implement tools and strategies to improve the user’s performance and learning experience. This book is part of a trilogy with companion volumes on Radical Solutions & Learning Analytics and Radical Solutions & eLearning.
... Furthermore, meta-research has shown that most studies are still not fulfilling the requirements and standards of Open Science including the opportunity for replication even if they are claiming to do it (Nuijten, Hartgerink, van Assen, Epskamp & Wicherts, 2016). Second and at the analysis level, Open Science has to deal with the same challenges as traditional research that became evident with the discussion in the year 2017 about the right level for statistical significance: Benjamin et al. (2017) proposed p < 0.005 (instead of p < 0.05) as a new level for better reproducibility and more accurate communication. In direct replicas, Lakens et al. (2017) recommend to avoid the term statistical significance and the use of standardized thresholds for p-values whereas McShane, Gal, Gelman, Robert and Tackett (2017) demands to abandon the importance of null hypothesis significance testing (NHST) and of statistical significance and its levels in general. ...
Chapter
Full-text available
Open Science is a phenomenon that can be traced back to the Middle Ages. In the end of the twentieth century and the beginning of the twenty-first century, Open Science is strongly growing due to the worldwide internet and related new technologies, tools and communication channels. Two core objectives (reliability and trust) and three main characteristics (transparency, openness and reproducibility) of Open Science can be identified but it is still too early for a broad definition of this growing movement. Its growth is happening in many disciplines and in diverse facets. This article presents an overview of how Open Science is introduced and established in all three science dimensions of research design, processes and publications. For the future, the benefits are analysed that Open Science is offering, as well as the challenges that it is facing. It can be concluded that it is desirable that all researchers collaborate in Open Science. Open Science can improve the different science disciplines, research practices and science in general. In that way, Open Science can contribute to overcome the post-truth age through increasing objective and subjective credibility of science and research. And in the long-term perspective, Open Science can improve the whole research, education, as well as our society.
Article
Full-text available
Results: consumption of antipsychotics in hospital for the period 2015–2022 characterized by a decrease in the proportion of typical antipsychotics (TA) to 12.8% due to an increase in the proportion of atypical antipsychotics (AA) to 61.0% and long-acting antipsychotics (LA) to 26.2%. The administration of antipsychotics by hospital treatment units was relatively uniform. Clozapine (26.9%), zuclopenthixol (20.0%), haloperidol (10.3%), olanzapine (10.3%), risperidone (9.3%), quetiapine (8.2%), paliperidone (4.1%) accounted for 89.1% of all antipsychotics consumed. The total proportion of cariprazine, pericyazine, aripiprazole, ziprasidone, levomepromazine, chlorprothixene, chlorpromazine, tiapride and trifluoperazine, sertindole, lurasidone, sulpiride, flupenthixol and brexpiprazole was 10.9%. Among the medications prescribed to patients with schizophrenia, the leading ones were risperidone (36.2%), haloperidol (17.1%), olanzapine (15.6%), and clozapine (10.8%). The frequency of prescription of other drugs was less than 10.0%. The share of TA was 26.3%, AA — 73.7%. In the vast majority of cases (98.1%), patients received monotherapy. Conclusion: the data obtained on the structure of antipsychotic prescriptions indicate that our approaches correspond to the global trend of the predominant use of second-generation antipsychotics in the in-patient treatment of schizophrenia.
Article
Full-text available
In many Arctic regions, marine coastlines change rapidly in our currently warming climate. In contrast, coastal rock cliffs on Svalbard are considered to be relatively stable. Long-term trends of coastal-retreat rates for rock cliffs on Svalbard remain unknown, but quantifying them could improve our understanding of coastal dynamics in the Canadian Arctic Archipelago. This study presents coastal-retreat rates for rock cliffs along several kilometres of Brøggerhalvøya, Svalbard. The analysis relies on high-resolution orthoimages from 1970, 1990, 2010, and 2021. The data are corroborated by high-precision dGNSS (differential Global Navigation Satellite System) measurements obtained along selected segments of the coastline. Our analysis reveals statistically significant acceleration in coastal-retreat rates across Brøggerhalvøya between 2010 and 2021. The northeast-facing coastline features fairly stable conditions, with retreat rates of 0.04 ± 0.06 ma-1 (1970–1990; calculated retreat rate ± the corresponding measurement uncertainty), 0.04 ± 0.04 ma-1 (1990–2010), and 0.06 ± 0.08 ma-1 (2010–2021). Along the southwest-facing coastline, higher retreat rates of 0.26 ± 0.06 ma-1 (1970–1990), 0.24 ± 0.04 ma-1 (1990–2010), and 0.30 ± 0.08 ma-1 (2010–2021) were calculated. For the most recent decade, this corresponds to an increase of 50 % for the northeast-facing coastline and an increase of 25 % for the southwest-facing coastline. Furthermore, for the northeast-facing coastline, the proportion of the coastline affected by erosion increased from 47 % (1970–1990) to 65 % (2010–2021), while it stayed consistently above 90 % for the southwest-facing coastline. The recent acceleration in retreat rates coincides with increasing storminess and retreating sea ice, factors that can enhance coastal erosion.
Article
Full-text available
Taxonomy lays the foundations for the study of biodiversity and its conservation. Procrustean geometric morphometrics (GMM) is a most common technique for the taxonomic assessment of phenotypic population differences. To measure biological variation and detect evolutionarily significant units, GMM is often used on its own, although it is much more powerful with an integrative approach, in combination with molecular, ecological and behavioural data, as well as with meristic morphological traits. GMM is particularly effective in taxonomic research, when applied to 2D images, which are fast and low cost to obtain. Yet, taxonomists who may want to explore the usefulness of GMM are rarely experts in multivariate statistical analyses of size and shape differences. In these twin papers, I aim to provide a detailed step-by-step guideline to taxonomic analysis employing Procrustean GMM in user-friendly software (with tips for R users). In the first part (A) of the study, I will focus on preliminary analyses (mainly, measurement error, outliers and statistical power), which are fundamental for accuracy, but often neglected. I will also use this first paper, and its appendix (Appendix A), to informally introduce, and discuss, general topics in GMM and statistics, that are relevant to taxonomic applications. In the second part (B) of the work, I will move on to the main taxonomic analyses. Thus, I will show how to compare size and shape among groups, but I will also explore allometry and briefly examine differences in variance, as a potential clue to population bottlenecks in peripheral isolates. A large sample of North American marmot mandibles provides the example data (available online, for readers to replicate the study and practice with analyses). However, as this sample is larger than in previous studies and mostly unpublished, it also offers a chance to further explore the patterns of interspecific morphological variation in a group, that has been prominent in mammalian sociobiology, and whose evolutionary divergence is complex and only partially understood.
Chapter
Now that we have learned a bit about how to analyze time series data, we can use a bit of self reflection and humility. Econometrics is necessarily backward looking: a strong empirical result may not cointinue into the future. Further, econometrics shares more with rhetoric than with physics; it can be thought of as a form of argumentation rather than a search for immutable truth. Finally: your results are never unimpeachable, your analysis is never perfect, and you will never have the final word.
Article
Environmental variability during the development of an organism has known impacts on the expression of certain behavioural patterns. We used the fruit fly Drosophila melanogaster to investigate how different environmental conditions interact with the allelic variants of rover (forR) and sitter (fors) at the foraging locus to affect food-related behaviour of larvae. We discovered that larval density and nutrient availability were key environmental factors affecting the larval behaviour during early development. High larval density decreased the tendency of rovers to leave a food patch and reduced their travelled path lengths, such that rovers and sitters showed no more significant differences regarding their behaviour. Similar results were obtained when starving the larvae. Furthermore, cutting the availability only of specific nutrients such as sugar, fat or protein during development all affected larval foraging behaviour and locomotion.
Article
In Bayesian statistics, if the distribution of the data is unknown, then each plausible distribution of the data is indexed by a parameter value, and the prior distribution of the parameter is specified. To the extent that more complicated data distributions tend to require more coincidences for their construction than simpler data distributions, default prior distributions should be transformed to assign additional prior probability or probability density to the parameter values that refer to simpler data distributions. The proposed transformation of the prior distribution relies on the entropy of each data distribution as the relevant measure of complexity. The transformation is derived from a few first principles and extended to stochastic processes.
Article
Full-text available
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64-1.46) for nominally statistically significant results and D = 0.24 (0.11-0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Article
Full-text available
Significance Science is said to be suffering a reproducibility crisis caused by many biases. How common are these problems, across the wide diversity of research fields? We probed for multiple bias-related patterns in a large random sample of meta-analyses taken from all disciplines. The magnitude of these biases varied widely across fields and was on average relatively small. However, we consistently observed that small, early, highly cited studies published in peer-reviewed journals were likely to overestimate effects. We found little evidence that these biases were related to scientific productivity, and we found no difference between biases in male and female researchers. However, a scientist’s early-career status, isolation, and lack of scientific integrity might be significant risk factors for producing unreliable results.
Article
Full-text available
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a re-analysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested non-null effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of non-reproducibility. The results of this re-analysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Article
Full-text available
Significance The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.
Article
Importance The use and misuse of P values has generated extensive debates.Objective To evaluate in large scale the P values reported in the abstracts and full text of biomedical research articles over the past 25 years and determine how frequently statistical information is presented in ways other than P values.Design Automated text-mining analysis was performed to extract data on P values reported in 12 821 790 MEDLINE abstracts and in 843 884 abstracts and full-text articles in PubMed Central (PMC) from 1990 to 2015. Reporting of P values in 151 English-language core clinical journals and specific article types as classified by PubMed also was evaluated. A random sample of 1000 MEDLINE abstracts was manually assessed for reporting of P values and other types of statistical information; of those abstracts reporting empirical data, 100 articles were also assessed in full text.Main Outcomes and Measures P values reported.Results Text mining identified 4 572 043 P values in 1 608 736 MEDLINE abstracts and 3 438 299 P values in 385 393 PMC full-text articles. Reporting of P values in abstracts increased from 7.3% in 1990 to 15.6% in 2014. In 2014, P values were reported in 33.0% of abstracts from the 151 core clinical journals (n = 29 725 abstracts), 35.7% of meta-analyses (n = 5620), 38.9% of clinical trials (n = 4624), 54.8% of randomized controlled trials (n = 13 544), and 2.4% of reviews (n = 71 529). The distribution of reported P values in abstracts and in full text showed strong clustering at P values of .05 and of .001 or smaller. Over time, the “best” (most statistically significant) reported P values were modestly smaller and the “worst” (least statistically significant) reported P values became modestly less significant. Among the MEDLINE abstracts and PMC full-text articles with P values, 96% reported at least 1 P value of .05 or lower, with the proportion remaining steady over time in PMC full-text articles. In 1000 abstracts that were manually reviewed, 796 were from articles reporting empirical data; P values were reported in 15.7% (125/796 [95% CI, 13.2%-18.4%]) of abstracts, confidence intervals in 2.3% (18/796 [95% CI, 1.3%-3.6%]), Bayes factors in 0% (0/796 [95% CI, 0%-0.5%]), effect sizes in 13.9% (111/796 [95% CI, 11.6%-16.5%]), other information that could lead to estimation of P values in 12.4% (99/796 [95% CI, 10.2%-14.9%]), and qualitative statements about significance in 18.1% (181/1000 [95% CI, 15.8%-20.6%]); only 1.8% (14/796 [95% CI, 1.0%-2.9%]) of abstracts reported at least 1 effect size and at least 1 confidence interval. Among 99 manually extracted full-text articles with data, 55 reported P values, 4 presented confidence intervals for all reported effect sizes, none used Bayesian methods, 1 used false-discovery rates, 3 used sample size/power calculations, and 5 specified the primary outcome.Conclusions and Relevance In this analysis of P values reported in MEDLINE abstracts and in PMC articles from 1990-2015, more MEDLINE abstracts and articles reported P values over time, almost all abstracts and articles with P values reported statistically significant results, and, in a subgroup analysis, few articles included confidence intervals, Bayes factors, or effect sizes. Rather than reporting isolated P values, articles should include effect sizes and uncertainty metrics.
Article
Another social science looks at itself Experimental economists have joined the reproducibility discussion by replicating selected published experiments from two top-tier journals in economics. Camerer et al. found that two-thirds of the 18 studies examined yielded replicable estimates of effect size and direction. This proportion is somewhat lower than unaffiliated experts were willing to bet in an associated prediction market, but roughly in line with expectations from sample sizes and P values. Science , this issue p. 1433
Article
Significance There is increasing concern about the reproducibility of scientific research. For example, the costs associated with irreproducible preclinical research alone have recently been estimated at US$28 billion a year in the United States. However, there are currently no mechanisms in place to quickly identify findings that are unlikely to replicate. We show that prediction markets are well suited to bridge this gap. Prediction markets set up to estimate the reproducibility of 44 studies published in prominent psychology journals and replicated in The Reproducibility Project: Psychology predict the outcomes of the replications well and outperform a survey of individual forecasts.
Article
Statistical power analysis provides the conventional approach to assess error rates when designing a research study. However, power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations in which (a) the probability of an estimate being in the wrong direction (Type S [sign] error) and (b) the factor by which the magnitude of an effect might be overestimated (Type M [magnitude] error or exaggeration ratio) are estimated. We illustrate with examples from recent published research and discuss the largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information. © The Author(s) 2014.
Article
Medical and scientific advances are predicated on new knowledge that is robust and reliable and that serves as a solid foundation on which further advances can be built. In biomedical research, we are in the midst of a revolution with the generation of new data and scientific publications at a previously unprecedented rate. However, unfortunately, there is compelling evidence that the majority of these discoveries will not stand the test of time. To a large extent, this reproducibility crisis in basic and preclinical research may be as a result of failure to adhere to good scientific practice and the desperation to publish or perish. This is a multifaceted, multistakeholder problem. No single party is solely responsible, and no single solution will suffice. Here we review the reproducibility problems in basic and preclinical biomedical research, highlight some of the complexities, and discuss potential solutions that may help improve research quality and reproducibility. © 2015 American Heart Association, Inc.