ArticlePDF Available

Abstract and Figures

In response to recommendations to redefine statistical significance to P ≤ 0.005, we propose that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.
Content may be subject to copyright.
Justify Your Alpha 1
Justify Your Alpha: A Response to “Redefine Statistical Significance”
1
2
Daniel Lakensabc1, Federico G. Adolfibc2, Casper J. Albersab3, Farid Anvarid4, Matthew A. J. Appsa5,
3
Shlomo E. Argamonab6, Thom Baguleyab7, Raymond B. Beckerac8, Stephen D. Benninga9, Daniel E.
4
Bradforda10, Erin M. Buchananab11, Aaron R. Caldwelld12, Ben van Calsterab13, Rickard Carlssond14,
5
Sau-Chin Chena15, Bryan Chunga16, Lincoln J Collinga17, Gary S. Collinsb18, Zander Crookab19, Emily S.
6
Crossd20, Sameera Danielsab21, Henrik Danielssona22, Lisa DeBruinea23, Daniel J. Dunleavyab24, Brian
7
D. Earpab25, Michele I. Feistbc26, Jason D. Ferrellab27, James G. Fieldab28, Nicholas W. Foxabc29, Amanda
8
Friesend30, Caio Gomesd31, Monica Gonzalez-Marquezabc32, James A. Grangeabc33, Andrew P.
9
Grieved34, Robert Guggenbergerd35, James Gristd36, Anne-Laura van Harmelenab37, Fred
10
Hasselmanbc38, Kevin D. Hochardd39, Mark R. Hoffartha40, Nicholas P. Holmesabc41, Michael Ingreab42,
11
Peder M. Isagerb43, Hanna K. Isotalusab44, Christer Johanssond45, Konrad Juszczykd46, David A.
12
Kennyd47, Ahmed A. Khalilabc48, Barbara Konatd49, Junpeng Laoab50, Erik Gahner Larsena51, Gerine M.
13
A. Lodderab52, Jiří Lukavský,d53, Christopher R. Madand54, David Manheimab55, Stephen R. Martinabc56,
14
Andrea E. Martinab57, Deborah G. Mayod58, Randy J. McCarthya59, Kevin McConwayab60, Colin
15
McFarland61, Amanda Q. X. Nioab62, Gustav Nilsonneab63, Cilene Lino de Oliveirab64, Jean-Jacques
16
Orban de Xivryab65,Sam Parsonsbc66, Gerit Pfuhlab67, Kimberly A. Quinnb68, John J. Sakona69, S. Adil
17
Saribaya70, Iris K. Schneiderab71, Manojkumar Selvarajud72, Zsuzsika Sjoerdsb73, Samuel G. Smithb74,
18
Tim Smitsa75, Jeffrey R. Spiesb76, Vishnu Sreekumarabc77, Crystal N. Steltenpohlabc78, Neil
19
Stenhousea79, Wojciech Świątkowskia80, Miguel A. Vadilloa81, Marcel A. L. M. Van Assenab82, Matt N.
20
Williamsab83, Samantha E. Williamsd84, Donald R. Williamsab85, Tal Yarkonib86, Ignazio Zianod87, Rolf A.
21
Zwaanab88
22
23
a) Participated in brainstorming. b) Participated in drafting the commentary. c) Conducted statistical
24
analyses/data preparation. d) Did not participate in a, b, or c, because the points that they would have
25
raised had already been incorporated into the commentary, or endorse a sufficiently large part of the
26
contents as if participation had occurred. Except for the first author, authorship order is alphabetical.
27
28
29
Affiliations
30
31
1Human-Technology Interaction, Eindhoven University of Technology, Den Dolech, 5600MB,
32
Eindhoven, The Netherlands
33
2Laboratory of Experimental Psychology and Neuroscience (LPEN), Institute of Cognitive and
34
Translational Neuroscience (INCYT), INECO Foundation, Favaloro University, Pacheco de Melo
35
1860, Buenos Aires, Argentina
36
2National Scientific and Technical Research Council (CONICET), Godoy Cruz 2290, Buenos Aires,
37
Argentina
38
3Heymans Institute for Psychological Research, University of Groningen, Grote Kruisstraat 2/1,
39
9712TS Groningen, The Netherlands
40
Justify Your Alpha 2
4College of Education, Psychology & Social Work, Flinders University, Adelaide, GPO Box 2100,
1
Adelaide, SA, 5001, Australia
2
5Department of Experimental Psychology, University of Oxford, New Radcliffe House, Oxford, OX2
3
6GG, UK
4
6Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 10 W. 31st Street,
5
Chicago, IL 60645, USA
6
7Department of Psychology, Nottingham Trent University, Nottingham, 50 Shakespeare Street,
7
Nottingham, NG1 4FQ, UK
8
8Faculty of Linguistics and Literature, Bielefeld University, Bielefeld, Universitätsstraße 25, 33615
9
Bielefeld, Germany
10
9Psychology, University of Nevada, Las Vegas, Las Vegas, 4505 S. Maryland Pkwy., Box 455030, Las
11
Vegas, NV 89154-5030, USA
12
10Psychology, University of Wisconsin-Madison, Madison, 1202 West Johnson St. Madison WI. 53706,
13
USA
14
11Psychology, Missouri State University, 901 S. National Ave, Springfield, MO, 65897, USA
15
12Health, Human Performance, and Recreation, University of Arkansas, Fayetteville, 155 Stadium
16
Drive, HPER 321, Fayetteville, AR, 72701, USA
17
13Department of Development and Regeneration, KU Leuven, Leuven, Herestraat 49 box 805, 3000
18
Leuven, Belgium, Belgium
19
13Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Postbus
20
9600, 2300 RC, Leiden, The Netherlands
21
14Department of Psychology, Linnaeus University, Kalmar, Stagneliusgatan 14, 392 34, Kalmar,
22
Sweden
23
15Department of Human Development and Psychology, Tzu-Chi University, No. 67, Jieren St., Hualien
24
City, Hualien County, 97074, Taiwan
25
16Department of Surgery, University of British Columbia, Victoria, #301 - 1625 Oak Bay Ave, Victoria
26
BC Canada, V8R 1B1 , Canada
27
17Department of Psychology, University of Cambridge, Cambridge CB2 3EB, UK
28
18Centre for Statistics in Medicine, University of Oxford, Windmill Road, Oxford, OX3 7LD, UK
29
19Department of Psychology, The University of Edinburgh, 7 George Square, Edinburgh, EH8 9JZ, UK
30
20School of Psychology, Bangor University, Bangor, Adeilad Brigantia, Bangor, Gwynedd, LL57 2AS,
31
UK
32
21Ramsey Decision Theoretics, 4849 Connecticut Ave. NW #132, Washington, DC 20008, USA
33
22Department of Behavioural Sciences and Learning, Linköping University, SE-581 83, Linköping,
34
Sweden
35
23Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, 58 Hillhead Street, UK
36
24College of Social Work, Florida State University, 296 Champions Way, University Center C,
37
Tallahassee, FL, 32304, USA
38
25Departments of Psychology and Philosophy, Yale University, 2 Hillhouse Ave, New Haven CT
39
06511, USA
40
Justify Your Alpha 3
26Department of English, University of Louisiana at Lafayette, P. O. Box 43719, Lafayette LA 70504,
1
USA
2
27Department of Psychology, St. Edward's University, 3001 S. Congress, Austin, TX 78704, USA
3
27Department of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000, Austin,
4
TX 78712-1043, USA
5
28Department of Management, West Virginia University, 1602 University Avenue, Morgantown, WV
6
26506, USA
7
29Department of Psychology, Rutgers University, New Brunswick, 53 Avenue E, Piscataway NJ
8
08854, USA
9
30Department of Political Science, Indiana University Purdue University, Indianapolis, Indianapolis,
10
425 University Blvd CA417, Indianapolis, IN 46202, USA
11
31Booking.com, Herengracht 597, 1017 CE Amsterdam, The Nederlands
12
32Department of English, American and Romance Studies, RWTH - Aachen University, Aachen,
13
Kármánstraße 17/19, 52062 Aachen, Germany
14
33School of Psychology, Keele University, Keele, Staffordshire, ST5 5BG, UK
15
34Centre of Excellence for Statistical Innovation, UCB Celltech, 208 Bath Road, Slough, Berkshire SL1
16
3WE, UK
17
35Translational Neurosurgery, Eberhard Karls University Tübingen, Tübingen, Germany
18
35University Tübingen, International Centre for Ethics in Sciences and Humanities, Germany
19
36Department of Radiology, University of Cambridge, Box 218, Cambridge Biomedical Campus, CB2
20
0QQ, UK
21
37Department of Psychiatry, University of Cambridge, Cambridge, 18b Trumpington Road, CB2 8AH,
22
UK
23
38Behavioural Science Institute, Radboud University Nijmegen, Montessorilaan 3, 6525 HR, Nijmegen,
24
The Netherlands
25
39Department of Psychology, University of Chester, Chester, Department of Psychology, University of
26
Chester, Chester, CH1 4BJ, UK
27
40Department of Psychology, New York University, 4 Washington Place, New York, NY 10003, USA
28
41School of Psychology, University of Nottingham, Nottingham, University Park, NG7 2RD, UK
29
42None, Independent, Stockholm, Skåpvägen 5, 12245 ENSKEDE, Sweden
30
43Department of Clinical and Experimental Medicine, University of Linköping, 581 83 Linköping,,
31
Sweden
32
44School of Clinical Sciences, University of Bristol, Bristol, Level 2 academic offices, L&R Building,
33
Southmead Hospital, BS10 5NB, UK
34
45Occupational Orthopaedics and Research, Sahlgrenska University Hospital, 413 45 Gothenburg,
35
Sweden
36
46The Faculty of Modern Languages and Literatures, Institute of Linguistics, Psycholinguistics
37
Department, Adam Mickiewicz University, Al. Niepodległości 4, 61-874, Pozn, Poland
38
47Department of Psychological Sciences, University of Connecticut, Storrs, CT, Department of
39
Psychological Sciences, U-1020, Storrs, CT 06269-1020, USA
40
Justify Your Alpha 4
48Center for Stroke Research Berlin, Charité - Universitätsmedizin Berlin, Hindenburgdamm 30, 12200
1
Berlin, Germany
2
48Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1a, 04103 Leipzig,
3
Germany
4
48Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Luisenstraße 56, 10115 Berlin,
5
Germany
6
40Social Sciences, Adam Mickiewicz University, Poznań, Szamarzewskiego 89, 60-568 Poznan,
7
Poland
8
50Department of Psychology, University of Fribourg, Faucigny 2, 1700 Fribourg, Switzerland
9
51School of Politics and International Relations, University of Kent, Canterbury CT2 7NX, UK
10
52 Department of Sociology / ICS, University of Groningen, Grote Rozenstraat 31, 9712 TG Groningen,
11
The Netherlands
12
53Institute of Psychology, Czech Academy of Sciences, Hybernská 8, 11000 Prague, Czech Republic
13
54School of Psychology, University of Nottingham, Nottingham, NG7 2RD, UK
14
55Pardee RAND Graduate School, RAND Corporation, 1200 S Hayes St, Arlington, VA 22202, USA
15
56Psychology and Neuroscience, Baylor University, Waco, One Bear Place 97310, Waco TX, USA
16
57Psychology of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen,
17
Wundtlaan 1, 6525XD, The Netherlands
18
57Department of Psychology, School of Philosophy, Psychology, and Language Sciences, University
19
of Edinburgh, 7 George Square, EH8 9JZ Edinburgh, UK
20
58Dept of Philosophy, Major Williams Hall, Virginia Tech, Blacksburg, VA, US
21
59Center for the Study of Family Violence and Sexual Assault, Northern Illinois University, DeKalb, IL,
22
125 President's BLVD., DeKalb, IL 60115, USA
23
60School of Mathematics and Statistics, The Open University, Milton Keynes, Walton Hall, Milton
24
Keynes MK7 6AA, UK
25
61Skyscanner, 15 Laurison Place, Edinburgh, EH3 9EN, UK
26
62School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK
27
63Stress Research Institute, Stockholm University, Stockholm, Frescati Hagväg 16A, SE-10691
28
Stockholm, Sweden
29
63Department of Clinical Neuroscience, Karolinska Institutet, Nobels väg 9, SE-17177 Stockholm,
30
Sweden
31
63Department of Psychology, Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
32
64Laboratory of Behavioral Neurobiology, Department of Physiological Sciences, Federal University of
33
Santa Catarina, Florianópolis, Campus Universitário Trindade, 88040900, Brazil
34
65Department of Kinesiology, KU Leuven, Leuven, Tervuursevest 101 box 1501, B-3001 Leuven,
35
Belgium
36
66Department of Experimental Psychology, University of Oxford, Oxford, UK
37
67Department of Psychology, UiT The Arctic University of Norway, Tromsø, Norway
38
68Department of Psychology, DePaul University, Chicago, 2219 N Kenmore Ave, Chicago, IL 60657,
39
USA
40
Justify Your Alpha 5
69Center for Neural Science, New York University, 4 Washington Pl Room 809 New York, NY 10003,
1
USA
2
70Department of Psychology, Boğaziçi University, Bebek, 34342, Istanbul, Turkey
3
71Psychology, University of Cologne, Cologne,Herbert-Lewin-St. 2, 50931, Cologne, Germany
4
72Saudi Human Genome Program, King Abdulaziz City for Science and Technology (KACST);
5
Integrated Gulf Biosystems, Riyadh, Saudi Arabia
6
73Cognitive Psychology Unit, Institute of Psychology, Leiden University, Wassenaarseweg 52, 2333
7
AK Leiden, The Netherlands
8
73Leiden Institute for Brain and Cognition, Leiden University, Leiden, The Netherlands
9
74Leeds Institute of Health Sciences, University of Leeds, Leeds, LS2 9NL, UK
10
75Institute for Media Studies, KU Leuven, Leuven, Belgium
11
76Center for Open Science, 210 Ridge McIntire Rd Suite 500, Charlottesville, VA 22903, USA
12
76Department of Engineering and Society, University of Virginia, Thornton Hall, P.O. Box 400259,
13
Charlottesville, VA 22904, USA
14
77Surgical Neurology Branch, National Institute of Neurological Disorders and Stroke, National
15
Institutes of Health, Bethesda, MD 20892, USA
16
78Department of Psychology, University of Southern Indiana, 8600 University Boulevard, Evansville,
17
Indiana, USA
18
79Life Sciences Communication, University of Wisconsin-Madison, Madison, Wisconsin, 1545
19
Observatory Drive, Madison, WI 53706, USA
20
80Department of Social Psychology, Institute of Psychology, University of Lausanne, Quartier UNIL-
21
Mouline, Bâtiment Géopolis, CH-1015 Lausanne, Switzerland
22
81Departamento de Psicología Básica, Universidad Autónoma de Madrid, c/ Ivan Pavlov 6, 28049
23
Madrid, Spain
24
82Department of Methodology and Statistics, Tilburg University, Warandelaan 2, 5000 LE Tilburg, The
25
Netherlands
26
82Department of Sociology, Utrecht University, Padualaan 14, 3584 CH, Utrecht, The Netherlands
27
83School of Psychology, Massey University, Auckland, Private Bag 102904, North Shore, Auckland,
28
0745, New Zealand
29
84Psychology, Saint Louis University, St. Louis, MO, 3700 Lindell Blvd, St. Louis, MO 63108, USA
30
85Psychology, University of California, Davis, Davis, One Shields Ave, Davis, CA 95616, USA
31
86Department of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000, Austin,
32
TX 78712-1043, USA
33
87Marketing Department, Ghent University, Tweekerkenstraat 2, 9000 Ghent, Belgium
34
88Department of Psychology, Education, and Child Studies, Erasmus University Rotterdam,
35
Rotterdam, Burgemeester Oudlaan 50, 3000 DR, Rotterdam, The Netherlands
36
37
Author Note: We’d like to thank Dale Barr, Felix Cheung, David Colquhoun, Hans IJzerman,
38
Harvey Motulsky, and Richard Morey for helpful discussions while drafting this commentary.
39
40
Justify Your Alpha 6
Funding Statement: Daniel Lakens was supported by NWO VIDI 452-17-013. Federico G.
1
Adolfi was supported by CONICET. Matthew Apps was funded by a Biotechnology and
2
Biological Sciences Research Council AFL Fellowship (BB/M013596/1). Gary Collins was
3
supported by the NIHR Biomedical Research Centre, Oxford. Zander Crook was supported
4
by the Economic and Social Research Council [grant number C106891X]. Emily S. Cross
5
was supported by the European Research Council (ERC-2015-StG-677270). Lisa DeBruine
6
is supported by the European Research Council (ERC-2014-CoG-647910 KINSHIP). Anne-
7
Laura van Harmelen is funded by a Royal Society Dorothy Hodgkin Fellowship (DH150176).
8
Mark R. Hoffarth was supported by the National Science Foundation under grant SBE
9
SPRF-FR 1714446. Junpeng Lao was supported by the SNSF grant 100014_156490/1.
10
Cilene Lino de Oliveira was supported by AvH, Capes, CNPq. Andrea E. Martin was
11
supported by the Economic and Social Research Council of the United Kingdom [grant
12
number ES/K009095/1]. Jean-Jacques Orban de Xivry is supported by an internal grant from
13
the KU Leuven (STG/14/054) and by the Fonds voor Wetenschappelijk Onderzoek
14
(1519916N). Sam Parsons was supported by the European Research Council (FP7/2007
15
2013; ERC grant agreement no; 324176). Gerine Lodder was funded by NWO VICI 453-14-
16
016. Samuel Smith is supported by a Cancer Research UK Fellowship (C42785/A17965).
17
Vishnu Sreekumar was supported by the NINDS Intramural Research Program (IRP). Miguel
18
A. Vadillo was supported by Grant 2016-T1/SOC-1395 from Comunidad de Madrid. Tal
19
Yarkoni was supported by NIH award R01MH109682.
20
21
Abstract: In response to recommendations to redefine statistical significance to p ≤ .005, we
22
propose that researchers should transparently report and justify all choices they make when
23
designing a study, including the alpha level.
24
25
Justify Your Alpha 7
Justify Your Alpha: A Response to “Redefine Statistical Significance”
1
2
“Tests should only be regarded as tools which must be used with discretion and
3
understanding, and not as instruments which in themselves give the final verdict.”
4
Neyman & Pearson, 1928, p. 58.
5
6
Renewed concerns about the non-replication of scientific findings have prompted
7
widespread debates about its underlying causes and possible solutions. As an actionable
8
step toward improving standards of evidence for new discoveries, 72 researchers proposed
9
changing the conventional threshold that defines “statistical significance” (i.e., the alpha
10
level) from p .05 to p .005 for all novel claims with relatively low prior odds (Benjamin et
11
al., 2017). They argued that this change will “immediately improve the reproducibility of
12
scientific research in many fields” (Benjamin et al., 2017, p. 5).
13
14
Benjamin et al. (2017) provided two arguments against the current threshold for statistical
15
significance of .05. First, a p-value of .05 provides only weak evidence for the alternative
16
hypothesis. Second, under certain assumptions, a p-value threshold of .05 leads to a high
17
false positive report probability (FPRP; the probability that a significant finding is a false
18
positive, Wacholder et al., 2004; also referred to as the false positive rate, or false positive
19
risk, Benjamin et al., 2017; Colquhoun, 2017). The authors claim that lowering the threshold
20
for statistical significance to .005 will increase evidential strength for novel discoveries and
21
reduce the FPRP.
22
23
We share the concerns raised by Benjamin et al. (2017) regarding the apparent non-
24
replicability
1
of many scientific studies and appreciate their attempt to provide a concrete,
25
easy-to-implement suggestion to improve science. We further agree that the current default
26
alpha level of .05 is arbitrary and may result in weak evidence for the alternative hypothesis.
27
However, we do not think that redefining the threshold for statistical significance to the lower,
28
but equally arbitrary threshold of p ≤ .005 is advisable. In this commentary, we argue that (1)
29
there is insufficient evidence that the current standard for statistical significance is in fact a
30
“leading cause of non-reproducibility” (Benjamin et al., 2017, p. 5), (2) the arguments in favor
31
of a blanket default of p .005 are not strong enough to warrant the immediate and
32
widespread implementation of such a policy, and (3) a lower significance threshold will likely
33
have positive and negative consequences, both of which should be carefully evaluated
34
1
We use ‘replicability’ to refer to the question of whether a conclusion that is sufficiently similar to an
earlier study could be drawn from data obtained from a new study, and ‘reproducibility’ to refer to
getting the same results when re-analysing the same data (Peng, 2009).
Justify Your Alpha 8
before any large-scale changes are proposed. We conclude with an alternative suggestion,
1
whereby researchers justify their choice for an alpha level before collecting the data, instead
2
of adopting a new uniform standard.
3
4
Lack of evidence that p .005 improves replicability
5
6
One of the main claims made by Benjamin et al. (2017) is that the expected proportion of
7
studies that can be replicated will be considerably higher for studies that observe p .005
8
than for studies that observe .005 < p .05, due to a lower FPRP. All else being equal, we
9
agree with Benjamin et al. (2017) that improvement in replicability is in theory related to the
10
FPRP, and that lower alpha levels will reduce false positive results in the literature. However,
11
it is difficult to predict how much the FPRP will change in practice, because quantifying the
12
FPRP requires accurate estimates of several unknowns, such as the prior odds that the
13
examined hypotheses are true, the true power of any performed experiments, and the
14
(change in) actual behavior of researchers should the newly proposed threshold be put in
15
place.
16
17
An analysis of the results of the Reproducibility Project: Psychology (RP:P; Open Science
18
Collaboration, 2015) shows that 49% (23 out of 47) of the original findings with p-values
19
below .005 yielded p .05 in the replication study, whereas only 24% (11 out of 45) of the
20
original studies with .005 < p .05 yielded p .05 in the replication study (χ2(1) = 5.92, p =
21
.015, BF10 = 6.84). Benjamin et al. (2017, p. 9) presented this analysis as empirical evidence
22
of the “potential gains in reproducibility that would accrue from the new threshold.” However,
23
as they acknowledged, their obtained p-value of .015 is only “suggestive” of such a
24
conclusion, according to their own proposal. Moreover, there is considerable variation in
25
replication rates across p-values (see Figure 1), with few observations in bins of size .005 for
26
.005 < p .05. In addition, the lower replication rate for p-values just below .05 is likely
27
confounded by p-hacking (the practice of flexibly analysing data until the p-value passes the
28
‘significance’ threshold) in the original study. This implies that at least some of the
29
differences in replication rates between studies with .005 < p .05 compared to studies with
30
p .005 are not due to the level of evidence per se, but rather due to other mechanisms
31
(e.g., flexibility during data analysis). Indeed, depending on the degree of flexibility exploited
32
by researchers, such p-hacking can be used to overcome any inferential threshold.
33
34
Even with a p .005 threshold, only 49% of studies replicated successfully. Furthermore,
35
only 11 out of 30 studies (37%) with .0001 < p .005 replicated at α = .05. By contrast, a
36
prima facie more satisfactory replication success rate of 71% was obtained only for p <
37
Justify Your Alpha 9
.0001 (12 out of 17 studies). This suggests that a relatively small number of studies with p-
1
values much lower than .005 were largely responsible for the 49% replication rate for studies
2
with p .005. Further analysis is needed, therefore, to explain the low replication rate of
3
studies with p .005 before this alpha level is recommended as a new significance threshold
4
for novel discoveries across scientific disciplines.
5
6
Figure 1. The proportion of studies (Open Science Collaboration, 2015) that replicated at α =
7
.05 (with a bin width of 0.005). Window start and end positions are plotted on the horizontal
8
axis. The error bars denote 95% Jeffreys confidence intervals. R code to reproduce Figure 1
9
is available from https://github.com/VishnuSreekumar/Alpha005
10
11
Weak justifications for the new p .005 threshold
12
13
Even though p-values close to .05 never provide strong ‘evidence’ against the null
14
hypothesis on their own (Wasserstein & Lazar, 2016), the argument that p-values provide
15
weak evidence based on Bayes factors has been called into question (Casella & Berger,
16
1987; Greenland et al., 2016; Senn, 2001). Redefining the alpha level as a function of the
17
strength of relative evidence measured by the Bayes factor is undesirable, given that the
18
marginal likelihood is very sensitive to different (somewhat arbitrary) choices for the models
19
that are compared (Gelman et al., 2013). Benjamin et al. (2017) stated that p-values of .005
20
imply Bayes factors between 14 and 26, but the level of evidence depends on the model
21
priors and the choice of hypotheses tested, and different modelling assumptions would imply
22
a different p-value threshold. The Bayesian analysis that underlies the recommendation
23
actually overstates the evidence against the null from the perspective of error statistics. It
24
Justify Your Alpha 10
would, with high probability, deem an alternative highly probable, even if it's false (Mayo,
1
1997, 2018). Finally, Benjamin et al. (2017) provided no rationale for why the new p-value
2
threshold should align with equally arbitrary Bayes factor thresholds representing
3
‘substantial’ or ‘strong’ evidence. Indeed, it has been argued that such classifications of
4
Bayes factors themselves introduce arbitrary meaning to a continuous measure (e.g., Morey,
5
2015). We (even those of us prepared to use likelihoods and Bayesian approaches in lieu of
6
p-values when interpreting results) caution against the idea that the alpha level at which an
7
error rate is controlled should be based on the amount of relative evidence indicated by a
8
Bayes factor. Extending Morey, Wagenmakers, and Rouder (2016), who argued against the
9
frequentist calibration of Bayes factors, we argue against the necessity of a Bayesian
10
calibration of error rates.
11
12
The second argument Benjamin et al. (2017) provided for p .005 is that the FPRP can be
13
high with α = .05. To calculate the FPRP one needs to define the alpha level, the power of
14
the tests that examine true effects, and the ratio of true to false hypotheses tested (the prior
15
odds). The FPRP is only problematic when a high proportion of examined hypotheses are
16
false, and thus Benjamin et al. (2017, p. 10) stated that their “recommendation applies to
17
disciplines with prior odds broadly in the range depicted in Figure 2.” Their Figure 2 displays
18
FPRPs for scenarios where many examined hypotheses are false, with ratios of true to false
19
hypotheses (i.e., prior odds) of 1 to 5, 1 to 10, and 1 to 40. Benjamin et al. (2017)
20
recommended p .005 because this threshold reduces the minimum FPRP to less than 5%,
21
assuming 1 to 10 prior odds of examining a true hypothesis (the true FPRP might still be
22
substantially higher in studies with very low power). This estimate of prior odds is based on
23
data from the RP:P (Open Science Collaboration, 2015) using an analysis that modelled
24
publication bias for 73 studies (Johnson et al., 2017; see also Ingre, 2016, for a more
25
conservative estimate). Without stating the reference class for the ‘base-rate of true nulls’
26
(i.e., does this refer to all hypotheses in science, in a discipline, or by a single researcher?),
27
the concept of ‘prior odds that H1 is true’ has little meaning in practice. The modelling effort
28
by Johnson et al. (2017) ignored practices that inflate error rates (e.g., p-hacking) and thus
29
likely does not provide an accurate estimate of bias, given the prevalence of such practices
30
(Fiedler & Schwarz, 2016; John et al., 2012). An estimate of the prior probability that a
31
hypothesis is true, similar to that of Johnson et al. (2017), was derived from 92 participants’
32
subjective ratings of the prior probability that the alternative hypothesis was true for 44
33
studies included in the RP:P (Dreber et al., 2015). As Dreber et al. (2015, p. 15345) noted,
34
“This relatively low average prior may reflect [the fact] that top psychology journals focus on
35
publishing surprising findings, i.e., positive findings on relatively unlikely hypotheses.” These
36
observations imply that there are not sufficient representative data to accurately estimate the
37
Justify Your Alpha 11
prior odds that researchers examine a true hypothesis, and thus, there is currently no strong
1
argument based on FPRP to redefine statistical significance to p .005.
2
3
Ways in which a threshold of p .005 might harm scientific practice
4
5
Benjamin et al. (2017) acknowledged that lowering the p-value threshold will not ameliorate
6
other practices that negatively impact the replicability of research findings (such as p-
7
hacking, publication bias, and low power). Yet, they did not address ways in which a p .005
8
threshold might harm scientific practice. Chief among our concerns are (1) a reduction in the
9
number of replication studies that can be conducted if such a threshold is adopted, (2) a
10
concomitant reduction in generalisability and breadth of research findings due to a likely
11
increased reliance on convenience samples, and (3) exacerbation of an already exaggerated
12
focus on single p-values.
13
14
Risk of fewer replication studies. Replication studies are central to generating reliable
15
scientific knowledge, especially when conclusions are largely based on p-values. As Fisher
16
(1926, p. 85) noted: “A scientific fact should be regarded as experimentally established only
17
if a properly designed experiment rarely fails to give this level of significance.” Replication
18
studies are at the heart of scientific progress. In the field of medicine, for example, the FDA
19
requires two independent pre-registered clinical trials, both significant with p .05, before
20
issuing marketing approval for new drugs (for a discussion, see Senn, 2007, p. 188).
21
Researchers have limited resources, and when studies require larger sample sizes scientists
22
will have to decide what research they will invest in. Achieving 80% power with α = .005,
23
compared to α = .05, will require a 70% larger sample size in a between-subjects design with
24
a two-sided test (and an 88% larger sample size for one-sided tests). This means that
25
researchers can complete almost two studies each powered at α = .05 (e.g., one novel study
26
and one replication study), or only one study powered at α = .005. Therefore, at a time when
27
replication studies are rare, lowering the alpha level to .005 might reduce the number of
28
replication studies. Indeed, general recommendations for evidence thresholds need to
29
carefully balance statistical and non-statistical considerations (e.g., the value of evidence per
30
novel study vs. the value of independent replications).
31
32
Risk of reduced generalisability and breadth. All things equal, larger sample sizes increase
33
the informational value of studies, but requiring larger sample sizes across all scientific
34
disciplines would potentially compound problems with over-reliance on convenience samples
35
(such as undergraduate students or Mechanical Turk workers). Lowering the significance
36
threshold could adversely affect the type of breadth of research questions examined if it is
37
Justify Your Alpha 12
done without (1) increased funding, (2) a reward system that values large-scale
1
collaboration, or (3) clear recommendations for how to evaluate research with lower
2
evidential value due to sample size constraints. Achieving a lower p-value in studies with
3
unique populations (e.g., people with rare genetic variants, people diagnosed with post-
4
traumatic stress disorder) or in studies with time- or otherwise resource-intensive data
5
collection (e.g., longitudinal studies) requires exponentially more effort than increasing the
6
amount of evidence in studies that use undergraduate students or Mechanical Turk workers.
7
Thus, researchers may become less motivated, or even tacitly discouraged, to study the
8
former populations or collect those types of data. Hence, lowering the alpha threshold may
9
indirectly reduce the generalisability and breadth of findings (Peterson & Merunka, 2014).
10
11
Risk of exaggerating the focus on single p-values. If anything, an excessive focus on p-value
12
thresholds has the potential to mask or even discourage opportunities for more fruitful
13
changes in scientific practice and education. Many researchers have come to recognise p-
14
hacking, low power, and publication bias as more important reasons for non-replication.
15
Benjamin et al. (2017) acknowledged that changing the threshold could be considered a
16
distraction from other solutions, and yet their proposal risks reinforcing the idea that relying
17
only on p-values is a sufficient, if imperfect, way to evaluate findings. The proposed p .005
18
threshold is not intended as a publication threshold. However, given the long history of
19
misuse of statistical recommendations, there is a substantial risk that redefining p .005 as
20
‘statistically significant’ will increase publication bias, which, in turn, would bias effect size
21
estimates upwards to an even greater extent (Lane & Dunlap, 1978). As such, Benjamin et
22
al.’s recommendation could divert attention from the burgeoning movement towards a more
23
cumulative evaluation of findings, where the converging results of multiple studies are taken
24
into account when addressing specific research questions. Examples of such approaches
25
are: multiple replications (both registered and multi-lab; see, e.g., Hagger et al., 2016),
26
continuously updating meta-analyses (Braver et al., 2014), p-curve analysis (Simonsohn et
27
al., 2014), and pre-registration of studies.
28
29
No one alpha to rule them all
30
31
Benjamin et al. (2017) recommended that only p-values lower than .005 should be called
32
‘statistically significant’ and that studies should generally be designed with α = .005. Our
33
recommendation is similarly twofold. First, when describing results, we recommend that the
34
label ‘statistically significant’ simply no longer be used. Instead, researchers should provide
35
a more meaningful interpretation (Eysenck, 1960). While p-values can inspire statements
36
about the probability of data (e.g., ‘the observed difference in the data was surprisingly large,
37
Justify Your Alpha 13
assuming the null hypothesis is true’), they should not be treated as indices that, on their
1
own, signify evidence for a theory.
2
3
Second, when designing studies, we propose that authors transparently specify their design
4
choices. These include (where applicable) the alpha level, the null and alternative models,
5
assumed prior odds, statistical power for a specified effect size of interest, the sample size,
6
and/or the desired accuracy of estimation. Without imposing a single value on any of these
7
design parameters, we ask authors to justify their choices before the data are collected.
8
Fellow researchers can evaluate these decisions on their merits and discuss how
9
appropriate they are for a specific research question, and whether the conclusions follow
10
from the study design. Ideally, this evaluation process occurs prior to data collection when
11
reviewing a Registered Report submission (Chambers, Dienes, McIntosh, Rotshtein, &
12
Willmes, 2015). Providing researchers (and reviewers) with accessible information on ways
13
to justify (and evaluate) these design choices, tailored to specific research areas, would
14
improve current research practices.
15
16
The optimal alpha level will sometimes be lower and sometimes be higher than the current
17
convention of .05 (see Field, Tyre, Jonzén, Rhodes, & Possingham, 2004; Grieves, 2015;
18
Mudge, Baker, Edge, & Houlahan, 2012; Pericchi & Pereira, 2016). Some fields, such as
19
genomics and physics, have lowered the alpha level. However, in genomics the overall false
20
positive rate is still controlled at 5%; the lower alpha level is only used to correct for multiple
21
comparisons (Storey & Tibshirani, 2003). In physics, a five sigma threshold (p 2.87 × 107)
22
is required to publish an article with ‘discovery of’ in the title, with less stringent alpha levels
23
being used for article titles with ‘evidence for’ or ‘measurement of’ (Franklin, 2014). In
24
physics researchers have also argued against a blanket rule, and instead setting the alpha
25
level based on factors such as how surprising the result would be and how much practical or
26
theoretical impact the discovery would have (Lyons, 2013). In non-human animal research,
27
minimising the number of animals used needs to be directly balanced against the probability
28
of false positives; other trade-offs may be relevant in other areas. Thus, a broadly applied p
29
≤ .005 threshold will rarely be optimal.
30
31
Benjamin et al. (2017, p. 5) stated that a “critical mass of researchers” now endorse the
32
standard of a p ≤ .005 threshold for “statistical significance.” However, the presence of a
33
critical mass can only be identified after a norm or practice has been widely adopted, not
34
before. Even if a p .005 threshold was widely endorsed, this would only reinforce the
35
flawed idea that a single alpha level is universally applicable. Ideally, the decision of where
36
to set the alpha level for a study should be based on statistical decision theory, where costs
37
Justify Your Alpha 14
and benefits are compared against a utility function (Neyman & Pearson, 1933; Skipper,
1
Guenther, & Nass, 1967). Such an analysis can be expected to differ based on the type of
2
study being conducted: for example, analysis of a large existing dataset versus primary data
3
collection relying on hard-to-obtain samples. Science is necessarily diverse, and it is up to
4
scientists within specific fields to justify the alpha level they decide to use. To quote Fisher
5
(1956, p. 42): “...no scientific worker has a fixed level of significance at which, from year to
6
year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each
7
particular case in the light of his evidence and his ideas.”
8
9
Conclusion
10
11
It is laudable that Benjamin et al. (2017) suggested a concrete step designed to immediately
12
improve science. However, it is not clear that lowering the significance threshold to p .005
13
will in practice amount to an improvement in replicability that is worth the potential costs.
14
Instead of simple heuristics and an arbitrary blanket threshold, research should be guided by
15
principles of rigorous science (Casadevall & Fang, 2016; LeBel, Vanpaemel, McCarthy,
16
Earp, & Elson, 2017; Meehl, 1990). These principles include not only sound statistical
17
analyses, but also experimental redundancy (e.g., replication, validation, and generalisation),
18
avoidance of logical traps, intellectual honesty, research workflow transparency, and full
19
accounting for potential sources of error. Single studies, regardless of their p-value, are
20
never enough to conclude that there is strong evidence for a theory. We need to train
21
researchers to recognise what cumulative evidence looks like and work towards an unbiased
22
scientific literature.
23
24
Although we agree with Benjamin et al. (2017) that the relatively high rate of non-replication
25
in the scientific literature is a cause for concern, we do not believe that redefining statistical
26
significance is a desirable solution: (1) there is not enough evidence that a blanket threshold
27
of p .005 will improve replication sufficiently to be worth the additional cost in data
28
collection, (2) the justifications given for the new threshold are not strong enough to warrant
29
the widespread implementation of such a policy, and (3) there are realistic concerns that a p
30
.005 threshold will have negative consequences for science, which should be carefully
31
examined before a change in practice is instituted. Instead of a narrower focus on p-value
32
thresholds, we call for a broader mandate whereby all justifications of key choices in
33
research design and statistical practice are pre-registered whenever possible, fully
34
accessible, and transparently evaluated.
35
36
37
Justify Your Alpha 15
References
1
2
Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R.,
3
… Johnson, V. (2017, July 22). Redefine statistical significance.
4
https://doi.org/10.17605/OSF.IO/MKY9J
5
Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-
6
analysis and replicability. Perspectives on Psychological Science, 9(3), 333-342.
7
https://doi.org/10.1177/1745691614529796
8
Casella, G., & Berger, R. L. (1987). Testing Precise Hypotheses: Comment. Statistical
9
Science, 2(3), 344347. https://doi.org/10.1214/ss/1177013243
10
Casadevall, A., & Fang, F. C. (2016). Rigorous Science: a How-To Guide. mBio, 7(6),
11
e01902-16. https://doi.org/10.1128/mBio.01902-16
12
Chambers, C.D., Dienes, Z., McIntosh, R.D., Rotshtein, P., & Willmes, K. (2015). Registered
13
Reports: Realigning incentives in scientific publishing. Cortex, 66, A1-2.
14
https://doi.org/10.1016/j.cortex.2015.03.022
15
Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of p-values.
16
bioRxiv, 144337. https://doi.org/10.1101/144337
17
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., … Johannesson,
18
M. (2015). Using prediction markets to estimate the reproducibility of scientific
19
research. Proceedings of the National Academy of Sciences, 112(50), 1534315347.
20
https://doi.org/10.1073/pnas.1516179112
21
Eysenck, H. J. (1960). The concept of statistical significance and the controversy about
22
one-tailed tests. Psychological Review, 67(4), 269-271.
23
http://dx.doi.org/10.1037/h0048412
24
Fiedler, K., & Schwarz, N. (2016). Questionable Research Practices Revisited. Social
25
Psychological and Personality Science, 7(1), 4552.
26
http://doi.org/10.1177/1948550615612150
27
Field, S. A., Tyre, A. J., Jonzen, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing
28
the cost of environmental management decisions by optimizing statistical thresholds.
29
Ecology Letters, 7(8), 669-675. https://doi.org/10.1111/j.1461-0248.2004.00625.x
30
Fisher, R. A. (1926). The arrangement of field experiments. Journal of the Ministry of
31
Agriculture of Great Britain, 33, 503-513.
32
Fisher R. A. (1956). Statistical Methods and Scientific Inferences. Hafner: New York.
33
Franklin, A. (2014). Shifting standards: Experiments in particle physics in the twentieth
34
century. University of Pittsburgh Press.
35
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., &
36
Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a
37
Justify Your Alpha 16
guide to misinterpretations. European Journal of Epidemiology, 31(4), 337350.
1
https://doi.org/10.1007/s10654-016-0149-3
2
Grieve, A. P. (2015). How to test hypotheses if you must. Pharmaceutical Statistics, 14(2),
3
139150. https://doi.org/10.1002/pst.1667
4
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R.,
5
. . . Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion
6
effect. Perspectives on Psychological Science, 11, 546573.
7
https://doi.org/10.1177/1745691616652873
8
Ingre, M. (2016). Recent reproducibility estimates indicate that negative evidence is
9
observed over 30 times before publication. arXiv preprint arXiv:1605.06414.
10
https://arxiv.org/abs/1605.06414
11
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable
12
research practices with incentives for truth-telling. Psychological Science, 23(5),
13
524532. https://doi.org/10.2139/ssrn.1996631
14
Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the
15
reproducibility of psychological science. Journal of the American Statistical
16
Association, 112(517), 110. https://doi.org/10.1080/01621459.2016.1240079
17
Koole, S. L., & Lakens, D. (2012). Rewarding replications: A sure and simple way to improve
18
psychological science. Perspectives on Psychological Science, 7(6), 608-614.
19
https://doi.org/10.1177/1745691612462586
20
Lakens, D. (2015). On the challenges of drawing conclusions from p-values just below 0.05.
21
PeerJ, 3, e1142. https://doi.org/10.7717/peerj.1142
22
Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the
23
significance criterion in editorial decisions. British Journal of Mathematical and
24
Statistical Psychology, 31(2), 107-112. https://doi.org/10.1111/j.2044-
25
8317.1978.tb00578.x
26
LeBel, E. P., Vanpaemel, W, McCarthy, R. J., Earp, B. D., & Elson, M. (2017). A Unified
27
Framework to Quantify the Trustworthiness of Empirical Research.
28
https://doi.org/10.17605/OSF.IO/UWMR8
29
Lyons, L. (2013). Discovering the Significance of 5 sigma. arXiv preprint arXiv:1310.1284.
30
Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense
31
and two principles that warrant it. Psychological Inquiry, 1(2), 108141.
32
https://doi.org/10.1207/s15327965pli0102_1
33
Mayo, D. (1997). Error statistics and learning from error: Making a virtue of necessity.
34
Philosophy of Science, Vol. 64, Part II: Symposia Papers, S195-S212.
35
Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics
36
Wars. Cambridge University Press.
37
Justify Your Alpha 17
Morey, R. (2015). On verbal categories for the interpretation of Bayes Factors. BayesFactor.
1
https://bayesfactor.blogspot.nl/2015/01/on-verbal-categories-for-interpretation.html
2
Morey, R. D., Wagenmakers, E.-J., & Rouder, J. N. (2016). Calibrated Bayes factors
3
should not be used: A reply to Hoijtink, van Kooten, and Hulsker. Multivariate Behavioral
4
Research, 51(1), 1119. http://dx.doi.org/10.1080/00273171.2015.1052710
5
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That
6
Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734.
7
https://doi.org/10.1371/journal.pone.0032734
8
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for
9
purposes of statistical inference: Part I. Biometrika, 175-240.
10
https://doi.org/10.2307/2331945
11
Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of
12
Statistical Hypotheses. Philosophical Transactions of the Royal Society of London A:
13
Mathematical, Physical and Engineering Sciences, 231(694706), 289337.
14
https://doi.org/10.1098/rsta.1933.0009
15
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia II: Restructuring Incentives
16
and Practices to Promote Truth Over Publishability. Perspectives on Psychological
17
Science, 7(6), 615-631. http://doi.org/10.1177/1745691612459058
18
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
19
Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
20
Peng, R. D. (2009). Reproducible research and biostatistics. Biostatistics, 10(3), 405408.
21
https://doi.org/10.1093/biostatistics/kxp014
22
Pericchi, L., & Pereira, C. (2016). Adaptative significance levels using optimal decision rules:
23
Balancing by weighting the error probabilities. Brazilian Journal of Probability and
24
Statistics, 30(1), 7090. https://doi.org/10.1214/14-BJPS257
25
Peterson, R. A., & Merunka, D. R. (2014). Convenience samples of college students and
26
research reproducibility. Journal of Business Research, 67(5), 1035-1041.
27
https://doi.org/10.1016/j.jbusres.2013.08.010
28
Senn, S. (2001) Two cheers for p-values? Journal of Epidemiology and Biostatistics, 6, 193-
29
204. https://doi.org/10.1080/135952201753172953
30
Senn, S. (2007). Statistical issues in drug development (2nd ed). Chichester, England;
31
Hoboken, NJ: John Wiley & Sons.
32
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting
33
for publication bias using only significant results. Perspectives on Psychological
34
Science, 9, 666-681. https://doi.org/10.1177/1745691614553988
35
Justify Your Alpha 18
Skipper, J. K., Guenther, A. L., & Nass, G. (1967). The sacredness of .05: A note concerning
1
the uses of statistical levels of significance in social science. The American
2
Sociologist, 2(1), 1618.
3
Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies.
4
Proceedings of the National Academy of Sciences, 100(16), 94409445.
5
https://doi.org/10.1073/pnas.1530509100
6
Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L., & Rothman, N. (2004).
7
Assessing the probability that a positive report is false: An approach for molecular
8
epidemiology studies. Journal of the National Cancer Institute, 96, 434-442.
9
https://doi.org/10.1093/jnci/djh075
10
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s Statement on p-Values: Context,
11
Process, and Purpose. The American Statistician, 70(2), 129133.
12
https://doi.org/10.1080/00031305.2016.1154108
13
... Others, however, have argued against the move to reduce α. In a reply to Benjamin et al. signed by 88 authors, Lakens et al. noted that a reduction in α would also have various negative consequences [34]. Perhaps most importantly, decreasing α would decrease statistical power and thereby increase the rate of false negatives (FNs)-that is, the proportion of studies that fail to find conclusive evidence for an effect that actually is present [14,35]. ...
... The current debate about α, however, illustrates the complexity of determining its optimal value [39,40]. Indeed, there are good reasons to believe that no single α level is optimal for all research contexts [34], and in some contexts there are strong arguments for increasing the α level to a value larger than 0.05 [41]. At this point, the only agreement concerning the choice of α level is that researchers within a given area should make it carefully-but how are they to do that? ...
... It is important to examine these total payoffs to understand which research scenario parameters must be considered and to see how the size of the payoff is jointly determined by the various parameter values. There is wide agreement that scientists in any field should consider their α levels carefully [33,34,44,45], and it seems essential to use an objective formalism to compare the expected scientific payoffs of different α levels. ...
Article
Full-text available
Researchers who analyze data within the framework of null hypothesis significance testing must choose a critical “alpha” level, α, to use as a cutoff for deciding whether a given set of data demonstrates the presence of a particular effect. In most fields, α = 0.05 has traditionally been used as the standard cutoff. Many researchers have recently argued for a change to a more stringent evidence cutoff such as α = 0.01, 0.005, or 0.001, noting that this change would tend to reduce the rate of false positives, which are of growing concern in many research areas. Other researchers oppose this proposed change, however, because it would correspondingly tend to increase the rate of false negatives. We show how a simple statistical model can be used to explore the quantitative tradeoff between reducing false positives and increasing false negatives. In particular, the model shows how the optimal α level depends on numerous characteristics of the research area, and it reveals that although α = 0.05 would indeed be approximately the optimal value in some realistic situations, the optimal α could actually be substantially larger or smaller in other situations. The importance of the model lies in making it clear what characteristics of the research area have to be specified to make a principled argument for using one α level rather than another, and the model thereby provides a blueprint for researchers seeking to justify a particular α level.
... We repeated this procedure 1000 times and the a posteriori power is the percentage of how often a focal coefficient was significant in 1000 repetitions. Given the high statistical power we had, we also report standardized p-value p stan in addition to significant p-values (Good, 1982;Lakens, 2018). P stan takes the p-value and multiplies it by the square root of the number of participants or observations to adjust alpha due to the high power . ...
Article
Full-text available
While prior research has found mindfulness to be linked with emotional responses to events, less is known about this effect in a non-clinical sample. Even less is known regarding the mechanisms of the underlying processes: It is unclear whether participants who exhibit increased acceptance show decreased emotional reactivity (i.e., lower affective responses towards events overall) or a speedier emotional recovery (i.e., subsequent decrease in negative affect) due to adopting an accepting stance. To address these questions, we re-analyzed two Ambulatory Assessment data sets. The first (NStudy1 = 125) was a six-week randomized controlled trial (including a 40-day ambulatory assessment); the second (NStudy2 = 175) was a one-week ambulatory assessment study. We found state mindfulness to be more strongly associated with emotional reactivity than with recovery, and that only emotional reactivity was significantly dampened by mindfulness training. Regarding the different facets of mindfulness, we found that the strongest predictor of both emotional reactivity and recovery was non-judgmental acceptance. Finally, we found that being aware of one's own thoughts and behavior could be beneficial or detrimental for emotional recovery, depending on whether participants accepted their thoughts and emotions. Together, these findings provide evidence for predictions derived from the monitoring and acceptance theory. This article is protected by copyright. All rights reserved.
... Various remedies for this particular problem exist, one being an application of the simple Bonferroni correction, which amounts to lowering the significance threshold α-commonly .05, but see for example Benjamin et al. (2018) and Lakens et al. (2018)-to α/m, where m is the number of hypotheses tested. This procedure is not systematically applied in NLG, although the awareness of the issues with multiple comparisons is increasing. ...
... A proposed solution to improve the replicability of psychological science is to use a lower significance threshold before concluding a finding to be significant, especially with regard to novel claims and in fields where less than half of all studies are expected to reflect a real effect 11 . However, experts still disagree about whether the significance level of 0.05 is the leading cause of the non-replicability and whether a lower (but still fixed) threshold will solve the problem without undesired negative consequences (Benjamin et al., 2018;Lakens et al., 2018). ...
Article
Full-text available
Research on money priming typically investigates whether exposure to money-related stimuli can affect people's thoughts, feelings, motivations, and behaviors (for a review, see Vohs, 2015). Our study answers the call for a comprehensive meta-analysis examining the available evidence on money priming (Vadillo, Hardwicke, & Shanks, 2016). By conducting a systematic search of published and unpublished literature on money priming, we sought to achieve three key goals. First, we aimed to assess the presence of biases in the available published literature (e.g., publication bias). econd, in the case of such biases, we sought to derive a more accurate estimate of the effect size after correcting for these biases. Third, we aimed to investigate whether design factors such as prime type and study setting moderated the money priming effects. Our overall meta-analysis included 246 suitable experiments and showed a significant overall effect size estimate (Hedges' g = .31, 95% CI [0.26, 0.36]). However, publication bias and related biases are likely given the asymmetric funnel plots, Egger's test and two other tests for publication bias. Moderator analyses offered insight into the variation of the money priming effect, suggesting for various types of study designs whether the effect was present, absent, or biased. We found the largest money priming effect in lab studies investigating a behavioral dependent measure using a priming technique in which participants actively handled money. Future research should use sufficiently powerful preregistered studies to replicate these findings.
Article
Full-text available
Negative foreign direct investment (divestment) between countries has received little attention in international macroeconomics. This is the first country-level study to investigate whether conventional drivers of bilateral foreign direct investment (FDI) have a reverse, but symmetric, impact on foreign direct divestment (FDD). Using bilateral FDI data between 126 countries, from 2005 to 2018, we find that, whereas some of the same variables are relevant, the view that what deters FDI encourages FDD, and vice versa, is not supported by our empirical findings.
Chapter
In process engineering, teacher‐researchers are confronted with the lack of a framework for building their methods, both in terms of data production and analysis. This chapter proposes training methods for process engineering and identifies criteria to ensure the scientificity of the training methods. The relevance of the training objectives is obviously a very important step in a training process. The chapter discusses the impact of a training course as a kind of product of quality of objectives, pedagogical efficiency, and quality of the transfer of acquired skills. It is possible to take stock of training in process engineering in France, Europe, and the world. The chapter shows the significant number of training courses in process engineering around the world, the advance of Anglo‐Saxon universities, and the rise of Asia. The main axis of development of higher education institutions and engineering grandes ecoles logically concerns the disciplines and contents of training courses.
Article
Full-text available
Transcranial magnetic stimulation (TMS) over human primary somatosensory cortex (S1) does not produce a measurable output. Researchers must rely on indirect methods to position the TMS coil. The 'gold standard' is to use individual functional and structural magnetic resonance imaging (MRI) data, but this method has not been used by most studies. Instead, the most common method used to locate the hand area of S1 (S1-hand) is to move the coil posteriorly from the hand area of M1. Yet, S1-hand is not directly posterior to M1-hand. Here, we addressed the localisation of S1-hand, in four ways. First, we re-analysed functional MRI data from 20 participants who received vibrotactile stimulation to their 10 digits. Second, to assist localising S1-hand and M1-hand without MRI data, we constructed a probabilistic atlas of the central sulcus from 100 healthy adult MRIs, and measured the likely scalp location of S1-index. Third, we conducted two novel experiments mapping the effects of TMS across the scalp on tactile discrimination performance. Fourth, we examined all available MRI data from our laboratory on the scalp location of S1-index. Contrary to the prevailing method, and consistent with the systematic review, S1-index is close to the C3/C4 electroencephalography (EEG) electrode locations on the scalp, approximately 7-8cm lateral to the vertex, and approximately 2cm lateral and 0.5cm posterior to the M1-FDI scalp location. These results suggest that an immediate revision to the most commonly-used heuristic to locate S1-hand is required. The results of many TMS studies of S1-hand need reassessment.
Article
More and more psychological researchers have come to appreciate the perils of common but poorly justified research practices and are rethinking commonly held standards for evaluating research. As this methodological reform expresses itself in psychological research, peer reviewers of such work must also adapt their practices to remain relevant. Reviewers of journal submissions wield considerable power to promote methodological reform, and thereby contribute to the advancement of a more robust psychological literature. We describe concrete practices that reviewers can use to encourage transparency, intellectual humility, and more valid assessments of the methods and statistics reported in articles.
Article
Full-text available
The dominant paradigm for inference in psychology is a null-hypothesis significance testing one. Recently, the foundations of this paradigm have been shaken by several notable replication failures. One recommendation to remedy the replication crisis is to collect larger samples of participants. We argue that this recommendation misses a critical point, which is that increasing sample size will not remedy psychology’s lack of strong measurement, lack of strong theories and models, and lack of effective experimental control over error variance. In contrast, there is a long history of research in psychology employing small-N designs that treats the individual participant as the replication unit, which addresses each of these failings, and which produces results that are robust and readily replicated. We illustrate the properties of small-N and large-N designs using a simulated paradigm investigating the stage structure of response times. Our simulations highlight the high power and inferential validity of the small-N design, in contrast to the lower power and inferential indeterminacy of the large-N design. We argue that, if psychology is to be a mature quantitative science, then its primary theoretical aim should be to investigate systematic, functional relationships as they are manifested at the individual participant level and that, wherever possible, it should use methods that are optimized to identify relationships of this kind.
Preprint
Full-text available
We wish to answer this question If you observe a “significant” P value after doing a single unbiased experiment, what is the probability that your result is a false positive?. The weak evidence provided by P values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe P = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3:1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the P value. And if you want to limit the false positive risk to 5 %, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe P =0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100:1 odds on there being a real effect. That would usually be regarded as conclusive, But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe P = 0.00045. It is recommended that the terms “significant” and “non-significant” should never be used. Rather, P values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed P value. Despite decades of warnings, many areas of science still insist on labelling a result of P < 0.05 as “statistically significant”. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomisation and P -hacking. Precise inductive inference is impossible and replication is the only way to be sure, Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.
Article
Full-text available
We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided by p-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe p = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3: 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the p-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe p = 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100: 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe p = 0.00045. It is recommended that the terms ‘significant’ and ‘non-significant’ should never be used. Rather, p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed p-value. Despite decades of warnings, many areas of science still insist on labelling a result of p < 0.05 as ‘statistically significant’. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization and p-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.
Article
Full-text available
Proposals to improve the reproducibility of biomedical research have emphasized scientific rigor. Although the word “rigor” is widely used, there has been little specific discussion as to what it means and how it can be achieved. We suggest that scientific rigor combines elements of mathematics, logic, philosophy, and ethics. We propose a framework for rigor that includes redundant experimental design, sound statistical analysis, recognition of error, avoidance of logical fallacies, and intellectual honesty. These elements lead to five actionable recommendations for research education.
Article
Full-text available
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a re-analysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested non-null effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of non-reproducibility. The results of this re-analysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Article
Full-text available
Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.
Article
Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results. It capitalizes on the fact that the distribution of significant p values, p-curve, is a function of the true underlying effect. Researchers armed only with sample sizes and test results of the published findings can correct for publication bias. We validate the technique with simulations and by reanalyzing data from the Many-Labs Replication project. We demonstrate that p-curve can arrive at conclusions opposite that of existing tools by reanalyzing the meta-analysis of the “choice overload” literature.
Article
The Open Science Collaboration recently reported that 36% of published findings from psychological studies were reproducible by their independent team of researchers. We can use this information to estimate the statistical power needed to produce these findings under various assumptions of prior probabilities and type-1 errors to calculate the expected distribution of positive and negative evidence. And we can compare this distribution to observations indicating that 90% of published findings in the psychological literature is statistically significant and supporting the authors hypothesis to get an estimate of publication bias. Such estimate indicates that negative evidence was expected to be observed 30-200 times before one was published assuming plausible priors.