Content uploaded by Monica Gonzalez-Marquez
Author content
All content in this area was uploaded by Monica Gonzalez-Marquez on Sep 19, 2017
Content may be subject to copyright.
Content uploaded by Monica Gonzalez-Marquez
Author content
All content in this area was uploaded by Monica Gonzalez-Marquez on Sep 19, 2017
Content may be subject to copyright.
Justify Your Alpha 1
Justify Your Alpha: A Response to “Redefine Statistical Significance”
1
2
Daniel Lakensabc1, Federico G. Adolfibc2, Casper J. Albersab3, Farid Anvarid4, Matthew A. J. Appsa5,
3
Shlomo E. Argamonab6, Thom Baguleyab7, Raymond B. Beckerac8, Stephen D. Benninga9, Daniel E.
4
Bradforda10, Erin M. Buchananab11, Aaron R. Caldwelld12, Ben van Calsterab13, Rickard Carlssond14,
5
Sau-Chin Chena15, Bryan Chunga16, Lincoln J Collinga17, Gary S. Collinsb18, Zander Crookab19, Emily S.
6
Crossd20, Sameera Danielsab21, Henrik Danielssona22, Lisa DeBruinea23, Daniel J. Dunleavyab24, Brian
7
D. Earpab25, Michele I. Feistbc26, Jason D. Ferrellab27, James G. Fieldab28, Nicholas W. Foxabc29, Amanda
8
Friesend30, Caio Gomesd31, Monica Gonzalez-Marquezabc32, James A. Grangeabc33, Andrew P.
9
Grieved34, Robert Guggenbergerd35, James Gristd36, Anne-Laura van Harmelenab37, Fred
10
Hasselmanbc38, Kevin D. Hochardd39, Mark R. Hoffartha40, Nicholas P. Holmesabc41, Michael Ingreab42,
11
Peder M. Isagerb43, Hanna K. Isotalusab44, Christer Johanssond45, Konrad Juszczykd46, David A.
12
Kennyd47, Ahmed A. Khalilabc48, Barbara Konatd49, Junpeng Laoab50, Erik Gahner Larsena51, Gerine M.
13
A. Lodderab52, Jiří Lukavský,d53, Christopher R. Madand54, David Manheimab55, Stephen R. Martinabc56,
14
Andrea E. Martinab57, Deborah G. Mayod58, Randy J. McCarthya59, Kevin McConwayab60, Colin
15
McFarland61, Amanda Q. X. Nioab62, Gustav Nilsonneab63, Cilene Lino de Oliveirab64, Jean-Jacques
16
Orban de Xivryab65,Sam Parsonsbc66, Gerit Pfuhlab67, Kimberly A. Quinnb68, John J. Sakona69, S. Adil
17
Saribaya70, Iris K. Schneiderab71, Manojkumar Selvarajud72, Zsuzsika Sjoerdsb73, Samuel G. Smithb74,
18
Tim Smitsa75, Jeffrey R. Spiesb76, Vishnu Sreekumarabc77, Crystal N. Steltenpohlabc78, Neil
19
Stenhousea79, Wojciech Świątkowskia80, Miguel A. Vadilloa81, Marcel A. L. M. Van Assenab82, Matt N.
20
Williamsab83, Samantha E. Williamsd84, Donald R. Williamsab85, Tal Yarkonib86, Ignazio Zianod87, Rolf A.
21
Zwaanab88
22
23
a) Participated in brainstorming. b) Participated in drafting the commentary. c) Conducted statistical
24
analyses/data preparation. d) Did not participate in a, b, or c, because the points that they would have
25
raised had already been incorporated into the commentary, or endorse a sufficiently large part of the
26
contents as if participation had occurred. Except for the first author, authorship order is alphabetical.
27
28
29
Affiliations
30
31
1Human-Technology Interaction, Eindhoven University of Technology, Den Dolech, 5600MB,
32
Eindhoven, The Netherlands
33
2Laboratory of Experimental Psychology and Neuroscience (LPEN), Institute of Cognitive and
34
Translational Neuroscience (INCYT), INECO Foundation, Favaloro University, Pacheco de Melo
35
1860, Buenos Aires, Argentina
36
2National Scientific and Technical Research Council (CONICET), Godoy Cruz 2290, Buenos Aires,
37
Argentina
38
3Heymans Institute for Psychological Research, University of Groningen, Grote Kruisstraat 2/1,
39
9712TS Groningen, The Netherlands
40
Justify Your Alpha 2
4College of Education, Psychology & Social Work, Flinders University, Adelaide, GPO Box 2100,
1
Adelaide, SA, 5001, Australia
2
5Department of Experimental Psychology, University of Oxford, New Radcliffe House, Oxford, OX2
3
6GG, UK
4
6Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 10 W. 31st Street,
5
Chicago, IL 60645, USA
6
7Department of Psychology, Nottingham Trent University, Nottingham, 50 Shakespeare Street,
7
Nottingham, NG1 4FQ, UK
8
8Faculty of Linguistics and Literature, Bielefeld University, Bielefeld, Universitätsstraße 25, 33615
9
Bielefeld, Germany
10
9Psychology, University of Nevada, Las Vegas, Las Vegas, 4505 S. Maryland Pkwy., Box 455030, Las
11
Vegas, NV 89154-5030, USA
12
10Psychology, University of Wisconsin-Madison, Madison, 1202 West Johnson St. Madison WI. 53706,
13
USA
14
11Psychology, Missouri State University, 901 S. National Ave, Springfield, MO, 65897, USA
15
12Health, Human Performance, and Recreation, University of Arkansas, Fayetteville, 155 Stadium
16
Drive, HPER 321, Fayetteville, AR, 72701, USA
17
13Department of Development and Regeneration, KU Leuven, Leuven, Herestraat 49 box 805, 3000
18
Leuven, Belgium, Belgium
19
13Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Postbus
20
9600, 2300 RC, Leiden, The Netherlands
21
14Department of Psychology, Linnaeus University, Kalmar, Stagneliusgatan 14, 392 34, Kalmar,
22
Sweden
23
15Department of Human Development and Psychology, Tzu-Chi University, No. 67, Jieren St., Hualien
24
City, Hualien County, 97074, Taiwan
25
16Department of Surgery, University of British Columbia, Victoria, #301 - 1625 Oak Bay Ave, Victoria
26
BC Canada, V8R 1B1 , Canada
27
17Department of Psychology, University of Cambridge, Cambridge CB2 3EB, UK
28
18Centre for Statistics in Medicine, University of Oxford, Windmill Road, Oxford, OX3 7LD, UK
29
19Department of Psychology, The University of Edinburgh, 7 George Square, Edinburgh, EH8 9JZ, UK
30
20School of Psychology, Bangor University, Bangor, Adeilad Brigantia, Bangor, Gwynedd, LL57 2AS,
31
UK
32
21Ramsey Decision Theoretics, 4849 Connecticut Ave. NW #132, Washington, DC 20008, USA
33
22Department of Behavioural Sciences and Learning, Linköping University, SE-581 83, Linköping,
34
Sweden
35
23Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, 58 Hillhead Street, UK
36
24College of Social Work, Florida State University, 296 Champions Way, University Center C,
37
Tallahassee, FL, 32304, USA
38
25Departments of Psychology and Philosophy, Yale University, 2 Hillhouse Ave, New Haven CT
39
06511, USA
40
Justify Your Alpha 3
26Department of English, University of Louisiana at Lafayette, P. O. Box 43719, Lafayette LA 70504,
1
USA
2
27Department of Psychology, St. Edward's University, 3001 S. Congress, Austin, TX 78704, USA
3
27Department of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000, Austin,
4
TX 78712-1043, USA
5
28Department of Management, West Virginia University, 1602 University Avenue, Morgantown, WV
6
26506, USA
7
29Department of Psychology, Rutgers University, New Brunswick, 53 Avenue E, Piscataway NJ
8
08854, USA
9
30Department of Political Science, Indiana University Purdue University, Indianapolis, Indianapolis,
10
425 University Blvd CA417, Indianapolis, IN 46202, USA
11
31Booking.com, Herengracht 597, 1017 CE Amsterdam, The Nederlands
12
32Department of English, American and Romance Studies, RWTH - Aachen University, Aachen,
13
Kármánstraße 17/19, 52062 Aachen, Germany
14
33School of Psychology, Keele University, Keele, Staffordshire, ST5 5BG, UK
15
34Centre of Excellence for Statistical Innovation, UCB Celltech, 208 Bath Road, Slough, Berkshire SL1
16
3WE, UK
17
35Translational Neurosurgery, Eberhard Karls University Tübingen, Tübingen, Germany
18
35University Tübingen, International Centre for Ethics in Sciences and Humanities, Germany
19
36Department of Radiology, University of Cambridge, Box 218, Cambridge Biomedical Campus, CB2
20
0QQ, UK
21
37Department of Psychiatry, University of Cambridge, Cambridge, 18b Trumpington Road, CB2 8AH,
22
UK
23
38Behavioural Science Institute, Radboud University Nijmegen, Montessorilaan 3, 6525 HR, Nijmegen,
24
The Netherlands
25
39Department of Psychology, University of Chester, Chester, Department of Psychology, University of
26
Chester, Chester, CH1 4BJ, UK
27
40Department of Psychology, New York University, 4 Washington Place, New York, NY 10003, USA
28
41School of Psychology, University of Nottingham, Nottingham, University Park, NG7 2RD, UK
29
42None, Independent, Stockholm, Skåpvägen 5, 12245 ENSKEDE, Sweden
30
43Department of Clinical and Experimental Medicine, University of Linköping, 581 83 Linköping,,
31
Sweden
32
44School of Clinical Sciences, University of Bristol, Bristol, Level 2 academic offices, L&R Building,
33
Southmead Hospital, BS10 5NB, UK
34
45Occupational Orthopaedics and Research, Sahlgrenska University Hospital, 413 45 Gothenburg,
35
Sweden
36
46The Faculty of Modern Languages and Literatures, Institute of Linguistics, Psycholinguistics
37
Department, Adam Mickiewicz University, Al. Niepodległości 4, 61-874, Poznań, Poland
38
47Department of Psychological Sciences, University of Connecticut, Storrs, CT, Department of
39
Psychological Sciences, U-1020, Storrs, CT 06269-1020, USA
40
Justify Your Alpha 4
48Center for Stroke Research Berlin, Charité - Universitätsmedizin Berlin, Hindenburgdamm 30, 12200
1
Berlin, Germany
2
48Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1a, 04103 Leipzig,
3
Germany
4
48Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Luisenstraße 56, 10115 Berlin,
5
Germany
6
40Social Sciences, Adam Mickiewicz University, Poznań, Szamarzewskiego 89, 60-568 Poznan,
7
Poland
8
50Department of Psychology, University of Fribourg, Faucigny 2, 1700 Fribourg, Switzerland
9
51School of Politics and International Relations, University of Kent, Canterbury CT2 7NX, UK
10
52 Department of Sociology / ICS, University of Groningen, Grote Rozenstraat 31, 9712 TG Groningen,
11
The Netherlands
12
53Institute of Psychology, Czech Academy of Sciences, Hybernská 8, 11000 Prague, Czech Republic
13
54School of Psychology, University of Nottingham, Nottingham, NG7 2RD, UK
14
55Pardee RAND Graduate School, RAND Corporation, 1200 S Hayes St, Arlington, VA 22202, USA
15
56Psychology and Neuroscience, Baylor University, Waco, One Bear Place 97310, Waco TX, USA
16
57Psychology of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen,
17
Wundtlaan 1, 6525XD, The Netherlands
18
57Department of Psychology, School of Philosophy, Psychology, and Language Sciences, University
19
of Edinburgh, 7 George Square, EH8 9JZ Edinburgh, UK
20
58Dept of Philosophy, Major Williams Hall, Virginia Tech, Blacksburg, VA, US
21
59Center for the Study of Family Violence and Sexual Assault, Northern Illinois University, DeKalb, IL,
22
125 President's BLVD., DeKalb, IL 60115, USA
23
60School of Mathematics and Statistics, The Open University, Milton Keynes, Walton Hall, Milton
24
Keynes MK7 6AA, UK
25
61Skyscanner, 15 Laurison Place, Edinburgh, EH3 9EN, UK
26
62School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK
27
63Stress Research Institute, Stockholm University, Stockholm, Frescati Hagväg 16A, SE-10691
28
Stockholm, Sweden
29
63Department of Clinical Neuroscience, Karolinska Institutet, Nobels väg 9, SE-17177 Stockholm,
30
Sweden
31
63Department of Psychology, Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
32
64Laboratory of Behavioral Neurobiology, Department of Physiological Sciences, Federal University of
33
Santa Catarina, Florianópolis, Campus Universitário Trindade, 88040900, Brazil
34
65Department of Kinesiology, KU Leuven, Leuven, Tervuursevest 101 box 1501, B-3001 Leuven,
35
Belgium
36
66Department of Experimental Psychology, University of Oxford, Oxford, UK
37
67Department of Psychology, UiT The Arctic University of Norway, Tromsø, Norway
38
68Department of Psychology, DePaul University, Chicago, 2219 N Kenmore Ave, Chicago, IL 60657,
39
USA
40
Justify Your Alpha 5
69Center for Neural Science, New York University, 4 Washington Pl Room 809 New York, NY 10003,
1
USA
2
70Department of Psychology, Boğaziçi University, Bebek, 34342, Istanbul, Turkey
3
71Psychology, University of Cologne, Cologne,Herbert-Lewin-St. 2, 50931, Cologne, Germany
4
72Saudi Human Genome Program, King Abdulaziz City for Science and Technology (KACST);
5
Integrated Gulf Biosystems, Riyadh, Saudi Arabia
6
73Cognitive Psychology Unit, Institute of Psychology, Leiden University, Wassenaarseweg 52, 2333
7
AK Leiden, The Netherlands
8
73Leiden Institute for Brain and Cognition, Leiden University, Leiden, The Netherlands
9
74Leeds Institute of Health Sciences, University of Leeds, Leeds, LS2 9NL, UK
10
75Institute for Media Studies, KU Leuven, Leuven, Belgium
11
76Center for Open Science, 210 Ridge McIntire Rd Suite 500, Charlottesville, VA 22903, USA
12
76Department of Engineering and Society, University of Virginia, Thornton Hall, P.O. Box 400259,
13
Charlottesville, VA 22904, USA
14
77Surgical Neurology Branch, National Institute of Neurological Disorders and Stroke, National
15
Institutes of Health, Bethesda, MD 20892, USA
16
78Department of Psychology, University of Southern Indiana, 8600 University Boulevard, Evansville,
17
Indiana, USA
18
79Life Sciences Communication, University of Wisconsin-Madison, Madison, Wisconsin, 1545
19
Observatory Drive, Madison, WI 53706, USA
20
80Department of Social Psychology, Institute of Psychology, University of Lausanne, Quartier UNIL-
21
Mouline, Bâtiment Géopolis, CH-1015 Lausanne, Switzerland
22
81Departamento de Psicología Básica, Universidad Autónoma de Madrid, c/ Ivan Pavlov 6, 28049
23
Madrid, Spain
24
82Department of Methodology and Statistics, Tilburg University, Warandelaan 2, 5000 LE Tilburg, The
25
Netherlands
26
82Department of Sociology, Utrecht University, Padualaan 14, 3584 CH, Utrecht, The Netherlands
27
83School of Psychology, Massey University, Auckland, Private Bag 102904, North Shore, Auckland,
28
0745, New Zealand
29
84Psychology, Saint Louis University, St. Louis, MO, 3700 Lindell Blvd, St. Louis, MO 63108, USA
30
85Psychology, University of California, Davis, Davis, One Shields Ave, Davis, CA 95616, USA
31
86Department of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000, Austin,
32
TX 78712-1043, USA
33
87Marketing Department, Ghent University, Tweekerkenstraat 2, 9000 Ghent, Belgium
34
88Department of Psychology, Education, and Child Studies, Erasmus University Rotterdam,
35
Rotterdam, Burgemeester Oudlaan 50, 3000 DR, Rotterdam, The Netherlands
36
37
Author Note: We’d like to thank Dale Barr, Felix Cheung, David Colquhoun, Hans IJzerman,
38
Harvey Motulsky, and Richard Morey for helpful discussions while drafting this commentary.
39
40
Justify Your Alpha 6
Funding Statement: Daniel Lakens was supported by NWO VIDI 452-17-013. Federico G.
1
Adolfi was supported by CONICET. Matthew Apps was funded by a Biotechnology and
2
Biological Sciences Research Council AFL Fellowship (BB/M013596/1). Gary Collins was
3
supported by the NIHR Biomedical Research Centre, Oxford. Zander Crook was supported
4
by the Economic and Social Research Council [grant number C106891X]. Emily S. Cross
5
was supported by the European Research Council (ERC-2015-StG-677270). Lisa DeBruine
6
is supported by the European Research Council (ERC-2014-CoG-647910 KINSHIP). Anne-
7
Laura van Harmelen is funded by a Royal Society Dorothy Hodgkin Fellowship (DH150176).
8
Mark R. Hoffarth was supported by the National Science Foundation under grant SBE
9
SPRF-FR 1714446. Junpeng Lao was supported by the SNSF grant 100014_156490/1.
10
Cilene Lino de Oliveira was supported by AvH, Capes, CNPq. Andrea E. Martin was
11
supported by the Economic and Social Research Council of the United Kingdom [grant
12
number ES/K009095/1]. Jean-Jacques Orban de Xivry is supported by an internal grant from
13
the KU Leuven (STG/14/054) and by the Fonds voor Wetenschappelijk Onderzoek
14
(1519916N). Sam Parsons was supported by the European Research Council (FP7/2007–
15
2013; ERC grant agreement no; 324176). Gerine Lodder was funded by NWO VICI 453-14-
16
016. Samuel Smith is supported by a Cancer Research UK Fellowship (C42785/A17965).
17
Vishnu Sreekumar was supported by the NINDS Intramural Research Program (IRP). Miguel
18
A. Vadillo was supported by Grant 2016-T1/SOC-1395 from Comunidad de Madrid. Tal
19
Yarkoni was supported by NIH award R01MH109682.
20
21
Abstract: In response to recommendations to redefine statistical significance to p ≤ .005, we
22
propose that researchers should transparently report and justify all choices they make when
23
designing a study, including the alpha level.
24
25
Justify Your Alpha 7
Justify Your Alpha: A Response to “Redefine Statistical Significance”
1
2
“Tests should only be regarded as tools which must be used with discretion and
3
understanding, and not as instruments which in themselves give the final verdict.”
4
Neyman & Pearson, 1928, p. 58.
5
6
Renewed concerns about the non-replication of scientific findings have prompted
7
widespread debates about its underlying causes and possible solutions. As an actionable
8
step toward improving standards of evidence for new discoveries, 72 researchers proposed
9
changing the conventional threshold that defines “statistical significance” (i.e., the alpha
10
level) from p ≤ .05 to p ≤ .005 for all novel claims with relatively low prior odds (Benjamin et
11
al., 2017). They argued that this change will “immediately improve the reproducibility of
12
scientific research in many fields” (Benjamin et al., 2017, p. 5).
13
14
Benjamin et al. (2017) provided two arguments against the current threshold for statistical
15
significance of .05. First, a p-value of .05 provides only weak evidence for the alternative
16
hypothesis. Second, under certain assumptions, a p-value threshold of .05 leads to a high
17
false positive report probability (FPRP; the probability that a significant finding is a false
18
positive, Wacholder et al., 2004; also referred to as the false positive rate, or false positive
19
risk, Benjamin et al., 2017; Colquhoun, 2017). The authors claim that lowering the threshold
20
for statistical significance to .005 will increase evidential strength for novel discoveries and
21
reduce the FPRP.
22
23
We share the concerns raised by Benjamin et al. (2017) regarding the apparent non-
24
replicability
1
of many scientific studies and appreciate their attempt to provide a concrete,
25
easy-to-implement suggestion to improve science. We further agree that the current default
26
alpha level of .05 is arbitrary and may result in weak evidence for the alternative hypothesis.
27
However, we do not think that redefining the threshold for statistical significance to the lower,
28
but equally arbitrary threshold of p ≤ .005 is advisable. In this commentary, we argue that (1)
29
there is insufficient evidence that the current standard for statistical significance is in fact a
30
“leading cause of non-reproducibility” (Benjamin et al., 2017, p. 5), (2) the arguments in favor
31
of a blanket default of p ≤ .005 are not strong enough to warrant the immediate and
32
widespread implementation of such a policy, and (3) a lower significance threshold will likely
33
have positive and negative consequences, both of which should be carefully evaluated
34
1
We use ‘replicability’ to refer to the question of whether a conclusion that is sufficiently similar to an
earlier study could be drawn from data obtained from a new study, and ‘reproducibility’ to refer to
getting the same results when re-analysing the same data (Peng, 2009).
Justify Your Alpha 8
before any large-scale changes are proposed. We conclude with an alternative suggestion,
1
whereby researchers justify their choice for an alpha level before collecting the data, instead
2
of adopting a new uniform standard.
3
4
Lack of evidence that p ≤ .005 improves replicability
5
6
One of the main claims made by Benjamin et al. (2017) is that the expected proportion of
7
studies that can be replicated will be considerably higher for studies that observe p ≤ .005
8
than for studies that observe .005 < p ≤ .05, due to a lower FPRP. All else being equal, we
9
agree with Benjamin et al. (2017) that improvement in replicability is in theory related to the
10
FPRP, and that lower alpha levels will reduce false positive results in the literature. However,
11
it is difficult to predict how much the FPRP will change in practice, because quantifying the
12
FPRP requires accurate estimates of several unknowns, such as the prior odds that the
13
examined hypotheses are true, the true power of any performed experiments, and the
14
(change in) actual behavior of researchers should the newly proposed threshold be put in
15
place.
16
17
An analysis of the results of the Reproducibility Project: Psychology (RP:P; Open Science
18
Collaboration, 2015) shows that 49% (23 out of 47) of the original findings with p-values
19
below .005 yielded p ≤ .05 in the replication study, whereas only 24% (11 out of 45) of the
20
original studies with .005 < p ≤ .05 yielded p ≤ .05 in the replication study (χ2(1) = 5.92, p =
21
.015, BF10 = 6.84). Benjamin et al. (2017, p. 9) presented this analysis as empirical evidence
22
of the “potential gains in reproducibility that would accrue from the new threshold.” However,
23
as they acknowledged, their obtained p-value of .015 is only “suggestive” of such a
24
conclusion, according to their own proposal. Moreover, there is considerable variation in
25
replication rates across p-values (see Figure 1), with few observations in bins of size .005 for
26
.005 < p ≤ .05. In addition, the lower replication rate for p-values just below .05 is likely
27
confounded by p-hacking (the practice of flexibly analysing data until the p-value passes the
28
‘significance’ threshold) in the original study. This implies that at least some of the
29
differences in replication rates between studies with .005 < p ≤ .05 compared to studies with
30
p ≤ .005 are not due to the level of evidence per se, but rather due to other mechanisms
31
(e.g., flexibility during data analysis). Indeed, depending on the degree of flexibility exploited
32
by researchers, such p-hacking can be used to overcome any inferential threshold.
33
34
Even with a p ≤ .005 threshold, only 49% of studies replicated successfully. Furthermore,
35
only 11 out of 30 studies (37%) with .0001 < p ≤ .005 replicated at α = .05. By contrast, a
36
prima facie more satisfactory replication success rate of 71% was obtained only for p <
37
Justify Your Alpha 9
.0001 (12 out of 17 studies). This suggests that a relatively small number of studies with p-
1
values much lower than .005 were largely responsible for the 49% replication rate for studies
2
with p ≤ .005. Further analysis is needed, therefore, to explain the low replication rate of
3
studies with p ≤ .005 before this alpha level is recommended as a new significance threshold
4
for novel discoveries across scientific disciplines.
5
6
Figure 1. The proportion of studies (Open Science Collaboration, 2015) that replicated at α =
7
.05 (with a bin width of 0.005). Window start and end positions are plotted on the horizontal
8
axis. The error bars denote 95% Jeffreys confidence intervals. R code to reproduce Figure 1
9
is available from https://github.com/VishnuSreekumar/Alpha005
10
11
Weak justifications for the new p ≤ .005 threshold
12
13
Even though p-values close to .05 never provide strong ‘evidence’ against the null
14
hypothesis on their own (Wasserstein & Lazar, 2016), the argument that p-values provide
15
weak evidence based on Bayes factors has been called into question (Casella & Berger,
16
1987; Greenland et al., 2016; Senn, 2001). Redefining the alpha level as a function of the
17
strength of relative evidence measured by the Bayes factor is undesirable, given that the
18
marginal likelihood is very sensitive to different (somewhat arbitrary) choices for the models
19
that are compared (Gelman et al., 2013). Benjamin et al. (2017) stated that p-values of .005
20
imply Bayes factors between 14 and 26, but the level of evidence depends on the model
21
priors and the choice of hypotheses tested, and different modelling assumptions would imply
22
a different p-value threshold. The Bayesian analysis that underlies the recommendation
23
actually overstates the evidence against the null from the perspective of error statistics. It
24
Justify Your Alpha 10
would, with high probability, deem an alternative highly probable, even if it's false (Mayo,
1
1997, 2018). Finally, Benjamin et al. (2017) provided no rationale for why the new p-value
2
threshold should align with equally arbitrary Bayes factor thresholds representing
3
‘substantial’ or ‘strong’ evidence. Indeed, it has been argued that such classifications of
4
Bayes factors themselves introduce arbitrary meaning to a continuous measure (e.g., Morey,
5
2015). We (even those of us prepared to use likelihoods and Bayesian approaches in lieu of
6
p-values when interpreting results) caution against the idea that the alpha level at which an
7
error rate is controlled should be based on the amount of relative evidence indicated by a
8
Bayes factor. Extending Morey, Wagenmakers, and Rouder (2016), who argued against the
9
frequentist calibration of Bayes factors, we argue against the necessity of a Bayesian
10
calibration of error rates.
11
12
The second argument Benjamin et al. (2017) provided for p ≤ .005 is that the FPRP can be
13
high with α = .05. To calculate the FPRP one needs to define the alpha level, the power of
14
the tests that examine true effects, and the ratio of true to false hypotheses tested (the prior
15
odds). The FPRP is only problematic when a high proportion of examined hypotheses are
16
false, and thus Benjamin et al. (2017, p. 10) stated that their “recommendation applies to
17
disciplines with prior odds broadly in the range depicted in Figure 2.” Their Figure 2 displays
18
FPRPs for scenarios where many examined hypotheses are false, with ratios of true to false
19
hypotheses (i.e., prior odds) of 1 to 5, 1 to 10, and 1 to 40. Benjamin et al. (2017)
20
recommended p ≤ .005 because this threshold reduces the minimum FPRP to less than 5%,
21
assuming 1 to 10 prior odds of examining a true hypothesis (the true FPRP might still be
22
substantially higher in studies with very low power). This estimate of prior odds is based on
23
data from the RP:P (Open Science Collaboration, 2015) using an analysis that modelled
24
publication bias for 73 studies (Johnson et al., 2017; see also Ingre, 2016, for a more
25
conservative estimate). Without stating the reference class for the ‘base-rate of true nulls’
26
(i.e., does this refer to all hypotheses in science, in a discipline, or by a single researcher?),
27
the concept of ‘prior odds that H1 is true’ has little meaning in practice. The modelling effort
28
by Johnson et al. (2017) ignored practices that inflate error rates (e.g., p-hacking) and thus
29
likely does not provide an accurate estimate of bias, given the prevalence of such practices
30
(Fiedler & Schwarz, 2016; John et al., 2012). An estimate of the prior probability that a
31
hypothesis is true, similar to that of Johnson et al. (2017), was derived from 92 participants’
32
subjective ratings of the prior probability that the alternative hypothesis was true for 44
33
studies included in the RP:P (Dreber et al., 2015). As Dreber et al. (2015, p. 15345) noted,
34
“This relatively low average prior may reflect [the fact] that top psychology journals focus on
35
publishing surprising findings, i.e., positive findings on relatively unlikely hypotheses.” These
36
observations imply that there are not sufficient representative data to accurately estimate the
37
Justify Your Alpha 11
prior odds that researchers examine a true hypothesis, and thus, there is currently no strong
1
argument based on FPRP to redefine statistical significance to p ≤ .005.
2
3
Ways in which a threshold of p ≤ .005 might harm scientific practice
4
5
Benjamin et al. (2017) acknowledged that lowering the p-value threshold will not ameliorate
6
other practices that negatively impact the replicability of research findings (such as p-
7
hacking, publication bias, and low power). Yet, they did not address ways in which a p ≤ .005
8
threshold might harm scientific practice. Chief among our concerns are (1) a reduction in the
9
number of replication studies that can be conducted if such a threshold is adopted, (2) a
10
concomitant reduction in generalisability and breadth of research findings due to a likely
11
increased reliance on convenience samples, and (3) exacerbation of an already exaggerated
12
focus on single p-values.
13
14
Risk of fewer replication studies. Replication studies are central to generating reliable
15
scientific knowledge, especially when conclusions are largely based on p-values. As Fisher
16
(1926, p. 85) noted: “A scientific fact should be regarded as experimentally established only
17
if a properly designed experiment rarely fails to give this level of significance.” Replication
18
studies are at the heart of scientific progress. In the field of medicine, for example, the FDA
19
requires two independent pre-registered clinical trials, both significant with p ≤ .05, before
20
issuing marketing approval for new drugs (for a discussion, see Senn, 2007, p. 188).
21
Researchers have limited resources, and when studies require larger sample sizes scientists
22
will have to decide what research they will invest in. Achieving 80% power with α = .005,
23
compared to α = .05, will require a 70% larger sample size in a between-subjects design with
24
a two-sided test (and an 88% larger sample size for one-sided tests). This means that
25
researchers can complete almost two studies each powered at α = .05 (e.g., one novel study
26
and one replication study), or only one study powered at α = .005. Therefore, at a time when
27
replication studies are rare, lowering the alpha level to .005 might reduce the number of
28
replication studies. Indeed, general recommendations for evidence thresholds need to
29
carefully balance statistical and non-statistical considerations (e.g., the value of evidence per
30
novel study vs. the value of independent replications).
31
32
Risk of reduced generalisability and breadth. All things equal, larger sample sizes increase
33
the informational value of studies, but requiring larger sample sizes across all scientific
34
disciplines would potentially compound problems with over-reliance on convenience samples
35
(such as undergraduate students or Mechanical Turk workers). Lowering the significance
36
threshold could adversely affect the type of breadth of research questions examined if it is
37
Justify Your Alpha 12
done without (1) increased funding, (2) a reward system that values large-scale
1
collaboration, or (3) clear recommendations for how to evaluate research with lower
2
evidential value due to sample size constraints. Achieving a lower p-value in studies with
3
unique populations (e.g., people with rare genetic variants, people diagnosed with post-
4
traumatic stress disorder) or in studies with time- or otherwise resource-intensive data
5
collection (e.g., longitudinal studies) requires exponentially more effort than increasing the
6
amount of evidence in studies that use undergraduate students or Mechanical Turk workers.
7
Thus, researchers may become less motivated, or even tacitly discouraged, to study the
8
former populations or collect those types of data. Hence, lowering the alpha threshold may
9
indirectly reduce the generalisability and breadth of findings (Peterson & Merunka, 2014).
10
11
Risk of exaggerating the focus on single p-values. If anything, an excessive focus on p-value
12
thresholds has the potential to mask or even discourage opportunities for more fruitful
13
changes in scientific practice and education. Many researchers have come to recognise p-
14
hacking, low power, and publication bias as more important reasons for non-replication.
15
Benjamin et al. (2017) acknowledged that changing the threshold could be considered a
16
distraction from other solutions, and yet their proposal risks reinforcing the idea that relying
17
only on p-values is a sufficient, if imperfect, way to evaluate findings. The proposed p ≤ .005
18
threshold is not intended as a publication threshold. However, given the long history of
19
misuse of statistical recommendations, there is a substantial risk that redefining p ≤ .005 as
20
‘statistically significant’ will increase publication bias, which, in turn, would bias effect size
21
estimates upwards to an even greater extent (Lane & Dunlap, 1978). As such, Benjamin et
22
al.’s recommendation could divert attention from the burgeoning movement towards a more
23
cumulative evaluation of findings, where the converging results of multiple studies are taken
24
into account when addressing specific research questions. Examples of such approaches
25
are: multiple replications (both registered and multi-lab; see, e.g., Hagger et al., 2016),
26
continuously updating meta-analyses (Braver et al., 2014), p-curve analysis (Simonsohn et
27
al., 2014), and pre-registration of studies.
28
29
No one alpha to rule them all
30
31
Benjamin et al. (2017) recommended that only p-values lower than .005 should be called
32
‘statistically significant’ and that studies should generally be designed with α = .005. Our
33
recommendation is similarly twofold. First, when describing results, we recommend that the
34
label ‘statistically significant’ simply no longer be used. Instead, researchers should provide
35
a more meaningful interpretation (Eysenck, 1960). While p-values can inspire statements
36
about the probability of data (e.g., ‘the observed difference in the data was surprisingly large,
37
Justify Your Alpha 13
assuming the null hypothesis is true’), they should not be treated as indices that, on their
1
own, signify evidence for a theory.
2
3
Second, when designing studies, we propose that authors transparently specify their design
4
choices. These include (where applicable) the alpha level, the null and alternative models,
5
assumed prior odds, statistical power for a specified effect size of interest, the sample size,
6
and/or the desired accuracy of estimation. Without imposing a single value on any of these
7
design parameters, we ask authors to justify their choices before the data are collected.
8
Fellow researchers can evaluate these decisions on their merits and discuss how
9
appropriate they are for a specific research question, and whether the conclusions follow
10
from the study design. Ideally, this evaluation process occurs prior to data collection when
11
reviewing a Registered Report submission (Chambers, Dienes, McIntosh, Rotshtein, &
12
Willmes, 2015). Providing researchers (and reviewers) with accessible information on ways
13
to justify (and evaluate) these design choices, tailored to specific research areas, would
14
improve current research practices.
15
16
The optimal alpha level will sometimes be lower and sometimes be higher than the current
17
convention of .05 (see Field, Tyre, Jonzén, Rhodes, & Possingham, 2004; Grieves, 2015;
18
Mudge, Baker, Edge, & Houlahan, 2012; Pericchi & Pereira, 2016). Some fields, such as
19
genomics and physics, have lowered the alpha level. However, in genomics the overall false
20
positive rate is still controlled at 5%; the lower alpha level is only used to correct for multiple
21
comparisons (Storey & Tibshirani, 2003). In physics, a five sigma threshold (p ≤ 2.87 × 10−7)
22
is required to publish an article with ‘discovery of’ in the title, with less stringent alpha levels
23
being used for article titles with ‘evidence for’ or ‘measurement of’ (Franklin, 2014). In
24
physics researchers have also argued against a blanket rule, and instead setting the alpha
25
level based on factors such as how surprising the result would be and how much practical or
26
theoretical impact the discovery would have (Lyons, 2013). In non-human animal research,
27
minimising the number of animals used needs to be directly balanced against the probability
28
of false positives; other trade-offs may be relevant in other areas. Thus, a broadly applied p
29
≤ .005 threshold will rarely be optimal.
30
31
Benjamin et al. (2017, p. 5) stated that a “critical mass of researchers” now endorse the
32
standard of a p ≤ .005 threshold for “statistical significance.” However, the presence of a
33
critical mass can only be identified after a norm or practice has been widely adopted, not
34
before. Even if a p ≤ .005 threshold was widely endorsed, this would only reinforce the
35
flawed idea that a single alpha level is universally applicable. Ideally, the decision of where
36
to set the alpha level for a study should be based on statistical decision theory, where costs
37
Justify Your Alpha 14
and benefits are compared against a utility function (Neyman & Pearson, 1933; Skipper,
1
Guenther, & Nass, 1967). Such an analysis can be expected to differ based on the type of
2
study being conducted: for example, analysis of a large existing dataset versus primary data
3
collection relying on hard-to-obtain samples. Science is necessarily diverse, and it is up to
4
scientists within specific fields to justify the alpha level they decide to use. To quote Fisher
5
(1956, p. 42): “...no scientific worker has a fixed level of significance at which, from year to
6
year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each
7
particular case in the light of his evidence and his ideas.”
8
9
Conclusion
10
11
It is laudable that Benjamin et al. (2017) suggested a concrete step designed to immediately
12
improve science. However, it is not clear that lowering the significance threshold to p ≤ .005
13
will in practice amount to an improvement in replicability that is worth the potential costs.
14
Instead of simple heuristics and an arbitrary blanket threshold, research should be guided by
15
principles of rigorous science (Casadevall & Fang, 2016; LeBel, Vanpaemel, McCarthy,
16
Earp, & Elson, 2017; Meehl, 1990). These principles include not only sound statistical
17
analyses, but also experimental redundancy (e.g., replication, validation, and generalisation),
18
avoidance of logical traps, intellectual honesty, research workflow transparency, and full
19
accounting for potential sources of error. Single studies, regardless of their p-value, are
20
never enough to conclude that there is strong evidence for a theory. We need to train
21
researchers to recognise what cumulative evidence looks like and work towards an unbiased
22
scientific literature.
23
24
Although we agree with Benjamin et al. (2017) that the relatively high rate of non-replication
25
in the scientific literature is a cause for concern, we do not believe that redefining statistical
26
significance is a desirable solution: (1) there is not enough evidence that a blanket threshold
27
of p ≤ .005 will improve replication sufficiently to be worth the additional cost in data
28
collection, (2) the justifications given for the new threshold are not strong enough to warrant
29
the widespread implementation of such a policy, and (3) there are realistic concerns that a p
30
≤ .005 threshold will have negative consequences for science, which should be carefully
31
examined before a change in practice is instituted. Instead of a narrower focus on p-value
32
thresholds, we call for a broader mandate whereby all justifications of key choices in
33
research design and statistical practice are pre-registered whenever possible, fully
34
accessible, and transparently evaluated.
35
36
37
Justify Your Alpha 15
References
1
2
Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R.,
3
… Johnson, V. (2017, July 22). Redefine statistical significance.
4
https://doi.org/10.17605/OSF.IO/MKY9J
5
Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-
6
analysis and replicability. Perspectives on Psychological Science, 9(3), 333-342.
7
https://doi.org/10.1177/1745691614529796
8
Casella, G., & Berger, R. L. (1987). Testing Precise Hypotheses: Comment. Statistical
9
Science, 2(3), 344–347. https://doi.org/10.1214/ss/1177013243
10
Casadevall, A., & Fang, F. C. (2016). Rigorous Science: a How-To Guide. mBio, 7(6),
11
e01902-16. https://doi.org/10.1128/mBio.01902-16
12
Chambers, C.D., Dienes, Z., McIntosh, R.D., Rotshtein, P., & Willmes, K. (2015). Registered
13
Reports: Realigning incentives in scientific publishing. Cortex, 66, A1-2.
14
https://doi.org/10.1016/j.cortex.2015.03.022
15
Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of p-values.
16
bioRxiv, 144337. https://doi.org/10.1101/144337
17
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., … Johannesson,
18
M. (2015). Using prediction markets to estimate the reproducibility of scientific
19
research. Proceedings of the National Academy of Sciences, 112(50), 15343–15347.
20
https://doi.org/10.1073/pnas.1516179112
21
Eysenck, H. J. (1960). The concept of statistical significance and the controversy about
22
one-tailed tests. Psychological Review, 67(4), 269-271.
23
http://dx.doi.org/10.1037/h0048412
24
Fiedler, K., & Schwarz, N. (2016). Questionable Research Practices Revisited. Social
25
Psychological and Personality Science, 7(1), 45–52.
26
http://doi.org/10.1177/1948550615612150
27
Field, S. A., Tyre, A. J., Jonzen, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing
28
the cost of environmental management decisions by optimizing statistical thresholds.
29
Ecology Letters, 7(8), 669-675. https://doi.org/10.1111/j.1461-0248.2004.00625.x
30
Fisher, R. A. (1926). The arrangement of field experiments. Journal of the Ministry of
31
Agriculture of Great Britain, 33, 503-513.
32
Fisher R. A. (1956). Statistical Methods and Scientific Inferences. Hafner: New York.
33
Franklin, A. (2014). Shifting standards: Experiments in particle physics in the twentieth
34
century. University of Pittsburgh Press.
35
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., &
36
Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a
37
Justify Your Alpha 16
guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
1
https://doi.org/10.1007/s10654-016-0149-3
2
Grieve, A. P. (2015). How to test hypotheses if you must. Pharmaceutical Statistics, 14(2),
3
139–150. https://doi.org/10.1002/pst.1667
4
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R.,
5
. . . Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion
6
effect. Perspectives on Psychological Science, 11, 546–573.
7
https://doi.org/10.1177/1745691616652873
8
Ingre, M. (2016). Recent reproducibility estimates indicate that negative evidence is
9
observed over 30 times before publication. arXiv preprint arXiv:1605.06414.
10
https://arxiv.org/abs/1605.06414
11
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable
12
research practices with incentives for truth-telling. Psychological Science, 23(5),
13
524–532. https://doi.org/10.2139/ssrn.1996631
14
Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the
15
reproducibility of psychological science. Journal of the American Statistical
16
Association, 112(517), 1–10. https://doi.org/10.1080/01621459.2016.1240079
17
Koole, S. L., & Lakens, D. (2012). Rewarding replications: A sure and simple way to improve
18
psychological science. Perspectives on Psychological Science, 7(6), 608-614.
19
https://doi.org/10.1177/1745691612462586
20
Lakens, D. (2015). On the challenges of drawing conclusions from p-values just below 0.05.
21
PeerJ, 3, e1142. https://doi.org/10.7717/peerj.1142
22
Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the
23
significance criterion in editorial decisions. British Journal of Mathematical and
24
Statistical Psychology, 31(2), 107-112. https://doi.org/10.1111/j.2044-
25
8317.1978.tb00578.x
26
LeBel, E. P., Vanpaemel, W, McCarthy, R. J., Earp, B. D., & Elson, M. (2017). A Unified
27
Framework to Quantify the Trustworthiness of Empirical Research.
28
https://doi.org/10.17605/OSF.IO/UWMR8
29
Lyons, L. (2013). Discovering the Significance of 5 sigma. arXiv preprint arXiv:1310.1284.
30
Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense
31
and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.
32
https://doi.org/10.1207/s15327965pli0102_1
33
Mayo, D. (1997). Error statistics and learning from error: Making a virtue of necessity.
34
Philosophy of Science, Vol. 64, Part II: Symposia Papers, S195-S212.
35
Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics
36
Wars. Cambridge University Press.
37
Justify Your Alpha 17
Morey, R. (2015). On verbal categories for the interpretation of Bayes Factors. BayesFactor.
1
https://bayesfactor.blogspot.nl/2015/01/on-verbal-categories-for-interpretation.html
2
Morey, R. D., Wagenmakers, E.-J., & Rouder, J. N. (2016). Calibrated Bayes factors
3
should not be used: A reply to Hoijtink, van Kooten, and Hulsker. Multivariate Behavioral
4
Research, 51(1), 11–19. http://dx.doi.org/10.1080/00273171.2015.1052710
5
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That
6
Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734.
7
https://doi.org/10.1371/journal.pone.0032734
8
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for
9
purposes of statistical inference: Part I. Biometrika, 175-240.
10
https://doi.org/10.2307/2331945
11
Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of
12
Statistical Hypotheses. Philosophical Transactions of the Royal Society of London A:
13
Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337.
14
https://doi.org/10.1098/rsta.1933.0009
15
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia II: Restructuring Incentives
16
and Practices to Promote Truth Over Publishability. Perspectives on Psychological
17
Science, 7(6), 615-631. http://doi.org/10.1177/1745691612459058
18
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
19
Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
20
Peng, R. D. (2009). Reproducible research and biostatistics. Biostatistics, 10(3), 405–408.
21
https://doi.org/10.1093/biostatistics/kxp014
22
Pericchi, L., & Pereira, C. (2016). Adaptative significance levels using optimal decision rules:
23
Balancing by weighting the error probabilities. Brazilian Journal of Probability and
24
Statistics, 30(1), 70–90. https://doi.org/10.1214/14-BJPS257
25
Peterson, R. A., & Merunka, D. R. (2014). Convenience samples of college students and
26
research reproducibility. Journal of Business Research, 67(5), 1035-1041.
27
https://doi.org/10.1016/j.jbusres.2013.08.010
28
Senn, S. (2001) Two cheers for p-values? Journal of Epidemiology and Biostatistics, 6, 193-
29
204. https://doi.org/10.1080/135952201753172953
30
Senn, S. (2007). Statistical issues in drug development (2nd ed). Chichester, England;
31
Hoboken, NJ: John Wiley & Sons.
32
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting
33
for publication bias using only significant results. Perspectives on Psychological
34
Science, 9, 666-681. https://doi.org/10.1177/1745691614553988
35
Justify Your Alpha 18
Skipper, J. K., Guenther, A. L., & Nass, G. (1967). The sacredness of .05: A note concerning
1
the uses of statistical levels of significance in social science. The American
2
Sociologist, 2(1), 16–18.
3
Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies.
4
Proceedings of the National Academy of Sciences, 100(16), 9440–9445.
5
https://doi.org/10.1073/pnas.1530509100
6
Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L., & Rothman, N. (2004).
7
Assessing the probability that a positive report is false: An approach for molecular
8
epidemiology studies. Journal of the National Cancer Institute, 96, 434-442.
9
https://doi.org/10.1093/jnci/djh075
10
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s Statement on p-Values: Context,
11
Process, and Purpose. The American Statistician, 70(2), 129–133.
12
https://doi.org/10.1080/00031305.2016.1154108
13