Article

Justify your alpha

Authors:
  • Max Planck Institute Æ
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In response to recommendations to redefine statistical significance to P ≤ 0.005, we propose that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.
Content may be subject to copyright.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Others, however, have argued against the move to reduce α. In a reply to Benjamin et al. signed by 88 authors, Lakens et al. noted that a reduction in α would also have various negative consequences [34]. Perhaps most importantly, decreasing α would decrease statistical power and thereby increase the rate of false negatives (FNs)-that is, the proportion of studies that fail to find conclusive evidence for an effect that actually is present [14,35]. ...
... The current debate about α, however, illustrates the complexity of determining its optimal value [39,40]. Indeed, there are good reasons to believe that no single α level is optimal for all research contexts [34], and in some contexts there are strong arguments for increasing the α level to a value larger than 0.05 [41]. At this point, the only agreement concerning the choice of α level is that researchers within a given area should make it carefully-but how are they to do that? ...
... It is important to examine these total payoffs to understand which research scenario parameters must be considered and to see how the size of the payoff is jointly determined by the various parameter values. There is wide agreement that scientists in any field should consider their α levels carefully [33,34,44,45], and it seems essential to use an objective formalism to compare the expected scientific payoffs of different α levels. ...
Article
Full-text available
Researchers who analyze data within the framework of null hypothesis significance testing must choose a critical “alpha” level, α, to use as a cutoff for deciding whether a given set of data demonstrates the presence of a particular effect. In most fields, α = 0.05 has traditionally been used as the standard cutoff. Many researchers have recently argued for a change to a more stringent evidence cutoff such as α = 0.01, 0.005, or 0.001, noting that this change would tend to reduce the rate of false positives, which are of growing concern in many research areas. Other researchers oppose this proposed change, however, because it would correspondingly tend to increase the rate of false negatives. We show how a simple statistical model can be used to explore the quantitative tradeoff between reducing false positives and increasing false negatives. In particular, the model shows how the optimal α level depends on numerous characteristics of the research area, and it reveals that although α = 0.05 would indeed be approximately the optimal value in some realistic situations, the optimal α could actually be substantially larger or smaller in other situations. The importance of the model lies in making it clear what characteristics of the research area have to be specified to make a principled argument for using one α level rather than another, and the model thereby provides a blueprint for researchers seeking to justify a particular α level.
... 1 The analysis presented here is not a counterproposal to RSS, but rather a refutation which is intended to elucidate the proposal's flaws and therefore neutralize the potential damage which would result from its implementation. 2 Our analysis may be seen as complementary to, but should not be read in any way as an endorsement of, the critiques and alternative proposals by other authors [1,12,16,21]. I discuss this last point further in Section 5. Brief summary. ...
... While I am sympathetic to the sentiment prompting the various responses to RSS [1,12,16,21], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic 'calls to action'. Those calling for bigger and better scientific regulations ought not forget that another regulation-the 5% significance level-lies at the heart of the crisis. ...
Article
A recent proposal to "redefine statistical significance" (Benjamin, et al. Nature Human Behaviour, 2017) claims that false positive rates "would immediately improve" by factors greater than two and replication rates would double simply by changing the conventional cutoff for 'statistical significance' from P<0.05 to P<0.005. I analyze the veracity of these claims, focusing especially on how Benjamin, et al neglect the effects of P-hacking in assessing the impact of their proposal. My analysis shows that once P-hacking is accounted for the perceived benefits of the lower threshold all but disappear, prompting two main conclusions: (i) The claimed improvements to false positive rate and replication rate in Benjamin, et al (2017) are exaggerated and misleading. (ii) There are plausible scenarios under which the lower cutoff will make the replication crisis worse.
... We repeated this procedure 1000 times and the a posteriori power is the percentage of how often a focal coefficient was significant in 1000 repetitions. Given the high statistical power we had, we also report standardized p-value p stan in addition to significant p-values (Good, 1982;Lakens, 2018). P stan takes the p-value and multiplies it by the square root of the number of participants or observations to adjust alpha due to the high power . ...
Article
Full-text available
While prior research has found mindfulness to be linked with emotional responses to events, less is known about this effect in a non-clinical sample. Even less is known regarding the mechanisms of the underlying processes: It is unclear whether participants who exhibit increased acceptance show decreased emotional reactivity (i.e., lower affective responses towards events overall) or a speedier emotional recovery (i.e., subsequent decrease in negative affect) due to adopting an accepting stance. To address these questions, we re-analyzed two Ambulatory Assessment data sets. The first (NStudy1 = 125) was a six-week randomized controlled trial (including a 40-day ambulatory assessment); the second (NStudy2 = 175) was a one-week ambulatory assessment study. We found state mindfulness to be more strongly associated with emotional reactivity than with recovery, and that only emotional reactivity was significantly dampened by mindfulness training. Regarding the different facets of mindfulness, we found that the strongest predictor of both emotional reactivity and recovery was non-judgmental acceptance. Finally, we found that being aware of one's own thoughts and behavior could be beneficial or detrimental for emotional recovery, depending on whether participants accepted their thoughts and emotions. Together, these findings provide evidence for predictions derived from the monitoring and acceptance theory. This article is protected by copyright. All rights reserved.
... Various remedies for this particular problem exist, one being an application of the simple Bonferroni correction, which amounts to lowering the significance threshold α-commonly .05, but see for example Benjamin et al. (2018) and Lakens et al. (2018)-to α/m, where m is the number of hypotheses tested. This procedure is not systematically applied in NLG, although the awareness of the issues with multiple comparisons is increasing. ...
... A proposed solution to improve the replicability of psychological science is to use a lower significance threshold before concluding a finding to be significant, especially with regard to novel claims and in fields where less than half of all studies are expected to reflect a real effect 11 . However, experts still disagree about whether the significance level of 0.05 is the leading cause of the non-replicability and whether a lower (but still fixed) threshold will solve the problem without undesired negative consequences (Benjamin et al., 2018;Lakens et al., 2018). ...
Article
Full-text available
Research on money priming typically investigates whether exposure to money-related stimuli can affect people’s thoughts, feelings, motivations, and behaviors (for a review, see Vohs, 2015). Our study answers the call for a comprehensive meta-analysis examining the available evidence on money priming (Vadillo, Hardwicke, & Shanks, 2016). By conducting a systematic search of published and unpublished literature on money priming, we sought to achieve three key goals. First, we aimed to assess the presence of biases in the available published literature (e.g., publication bias). Second, in the case of such biases, we sought to derive a more accurate estimate of the effect size after correcting for these biases. Third, we aimed to investigate whether design factors such as prime type and study setting moderated the money priming effects. Our overall meta-analysis included 246 suitable experiments and showed a significant overall effect size estimate (Hedges’ g = .31, 95% CI [0.26, 0.36]). However, publication bias and related biases are likely given the asymmetric funnel plots, Egger’s test and two other tests for publication bias. Moderator analyses offered insight into the variation of the money priming effect, suggesting for various types of study designs whether the effect was present, absent, or biased. We found the largest money priming effect in lab studies investigating a behavioral dependent measure using a priming technique in which participants actively handled money. Future research should use sufficiently powerful preregistered studies to replicate these findings.
Article
Full-text available
Negative foreign direct investment (divestment) between countries has received little attention in international macroeconomics. This is the first country-level study to investigate whether conventional drivers of bilateral foreign direct investment (FDI) have a reverse, but symmetric, impact on foreign direct divestment (FDD). Using bilateral FDI data between 126 countries, from 2005 to 2018, we find that, whereas some of the same variables are relevant, the view that what deters FDI encourages FDD, and vice versa, is not supported by our empirical findings.
Article
Full-text available
Objective - To collect and share information about the prevalence of precarious work in libraries and the factors associated with it. Methods - The authors collected and coded job postings from a nationwide job board in Canada for two years. Descriptive and inferential statistics were used to explore the extent of precarity and its relationship with job characteristics such as job type, institution type, education level, and minimum required experience. Results - The authors collected 1,968 postings, of which 842 (42.8%) were coded as precarious in some way. The most common types of precarious work were contracts (29.1% of all postings) and part-time work (22.7% of all postings). Contracts were most prevalent in and significantly associated with academic libraries and librarian positions, and they were most often one year in length. Both on-call and part-time work were most prevalent in school libraries and for library technicians and assistants, and they were significantly associated with all institution types either positively or negatively. Meanwhile, precarious positions overall were least prevalent in government and managerial positions. In terms of education, jobs requiring a secondary diploma or library technician diploma were most likely to be precarious, while positions requiring an MLIS were least likely. The mean minimum required experience was lower for all types of precarious positions than for stable positions, and the prevalence of precarity generally decreased as minimum required experience increased. Conclusion - The proportion of precarious positions advertised in Canada is substantial and seems to be growing over time. Based on these postings, employees with less experience, without advanced degrees, or in library technician and assistant roles are more likely to be precarious, while those with managerial positions, advanced degrees, or more experience, are less likely to be precarious. Variations in precarity based on factors such as job type, institution type, education level, and minimum required experience suggest that employees will experience precarity differently both within and across library systems.
Chapter
In process engineering, teacher‐researchers are confronted with the lack of a framework for building their methods, both in terms of data production and analysis. This chapter proposes training methods for process engineering and identifies criteria to ensure the scientificity of the training methods. The relevance of the training objectives is obviously a very important step in a training process. The chapter discusses the impact of a training course as a kind of product of quality of objectives, pedagogical efficiency, and quality of the transfer of acquired skills. It is possible to take stock of training in process engineering in France, Europe, and the world. The chapter shows the significant number of training courses in process engineering around the world, the advance of Anglo‐Saxon universities, and the rise of Asia. The main axis of development of higher education institutions and engineering grandes ecoles logically concerns the disciplines and contents of training courses.
Article
Full-text available
Secondary data analysis, or the analysis of preexisting data, provides a powerful tool for the resourceful psychological scientist. Never has this been more true than now, when technological advances enable both sharing data across labs and continents and mining large sources of preexisting data. However, secondary data analysis is easily overlooked as a key domain for developing new open-science practices or improving analytic methods for robust data analysis. In this article, we provide researchers with the knowledge necessary to incorporate secondary data analysis into their methodological toolbox. We explain that secondary data analysis can be used for either exploratory or confirmatory work, and can be either correlational or experimental, and we highlight the advantages and disadvantages of this type of research. We describe how transparency-enhancing practices can improve and alter interpretations of results from secondary data analysis and discuss approaches that can be used to improve the robustness of reported results. We close by suggesting ways in which scientific subfields and institutions could address and improve the use of secondary data analysis.
Article
Full-text available
Transcranial magnetic stimulation (TMS) over human primary somatosensory cortex (S1) does not produce a measurable output. Researchers must rely on indirect methods to position the TMS coil. The 'gold standard' is to use individual functional and structural magnetic resonance imaging (MRI) data, but this method has not been used by most studies. Instead, the most common method used to locate the hand area of S1 (S1-hand) is to move the coil posteriorly from the hand area of M1. Yet, S1-hand is not directly posterior to M1-hand. Here, we addressed the localisation of S1-hand, in four ways. First, we re-analysed functional MRI data from 20 participants who received vibrotactile stimulation to their 10 digits. Second, to assist localising S1-hand and M1-hand without MRI data, we constructed a probabilistic atlas of the central sulcus from 100 healthy adult MRIs, and measured the likely scalp location of S1-index. Third, we conducted two novel experiments mapping the effects of TMS across the scalp on tactile discrimination performance. Fourth, we examined all available MRI data from our laboratory on the scalp location of S1-index. Contrary to the prevailing method, and consistent with the systematic review, S1-index is close to the C3/C4 electroencephalography (EEG) electrode locations on the scalp, approximately 7-8cm lateral to the vertex, and approximately 2cm lateral and 0.5cm posterior to the M1-FDI scalp location. These results suggest that an immediate revision to the most commonly-used heuristic to locate S1-hand is required. The results of many TMS studies of S1-hand need reassessment.
Article
More and more psychological researchers have come to appreciate the perils of common but poorly justified research practices and are rethinking commonly held standards for evaluating research. As this methodological reform expresses itself in psychological research, peer reviewers of such work must also adapt their practices to remain relevant. Reviewers of journal submissions wield considerable power to promote methodological reform, and thereby contribute to the advancement of a more robust psychological literature. We describe concrete practices that reviewers can use to encourage transparency, intellectual humility, and more valid assessments of the methods and statistics reported in articles.
Article
Full-text available
The dominant paradigm for inference in psychology is a null-hypothesis significance testing one. Recently, the foundations of this paradigm have been shaken by several notable replication failures. One recommendation to remedy the replication crisis is to collect larger samples of participants. We argue that this recommendation misses a critical point, which is that increasing sample size will not remedy psychology’s lack of strong measurement, lack of strong theories and models, and lack of effective experimental control over error variance. In contrast, there is a long history of research in psychology employing small-N designs that treats the individual participant as the replication unit, which addresses each of these failings, and which produces results that are robust and readily replicated. We illustrate the properties of small-N and large-N designs using a simulated paradigm investigating the stage structure of response times. Our simulations highlight the high power and inferential validity of the small-N design, in contrast to the lower power and inferential indeterminacy of the large-N design. We argue that, if psychology is to be a mature quantitative science, then its primary theoretical aim should be to investigate systematic, functional relationships as they are manifested at the individual participant level and that, wherever possible, it should use methods that are optimized to identify relationships of this kind.
Article
Full-text available
Seeking to address the lack of research reproducibility in science, including psychology and the life sciences, a pragmatic solution has been raised recently: to use a stricter p < 0.005 standard for statistical significance when claiming evidence of new discoveries. Notwithstanding its potential impact, the proposal has motivated a large mass of authors to dispute it from different philosophical and methodological angles. This article reflects on the original argument and the consequent counterarguments, and concludes with a simpler and better-suited alternative that the authors of the proposal knew about and, perhaps, should have made from their Jeffresian perspective: to use a Bayes factors analysis in parallel (e.g., via JASP) in order to learn more about frequentist error statistics and about Bayesian prior and posterior beliefs without having to mix inconsistent research philosophies.
Article
Full-text available
Recent research has relied on trolley-type sacrificial moral dilemmas to study utilitarian versus nonutilitarian modes of moral decision-making. This research has generated important insights into people’s attitudes toward instrumental harm—that is, the sacrifice of an individual to save a greater number. But this approach also has serious limitations. Most notably, it ignores the positive, altruistic core of utilitarianism, which is characterized by impartial concern for the well-being of everyone, whether near or far. Here, we develop, refine, and validate a new scale—the Oxford Utilitarianism Scale—to dissociate individual differences in the ‘negative’ (permissive attitude toward instrumental harm) and ‘positive’ (impartial concern for the greater good) dimensions of utilitarian thinking as manifested in the general population. We show that these are two independent dimensions of proto-utilitarian tendencies in the lay population, each exhibiting a distinct psychological profile. Empathic concern, identification with the whole of humanity, and concern for future generations were positively associated with impartial beneficence but negatively associated with instrumental harm; and although instrumental harm was associated with subclinical psychopathy, impartial beneficence was associated with higher religiosity. Importantly, although these two dimensions were independent in the lay population, they were closely associated in a sample of moral philosophers. Acknowledging this dissociation between the instrumental harm and impartial beneficence components of utilitarian thinking in ordinary people can clarify existing debates about the nature of moral psychology and its relation to moral philosophy as well as generate fruitful avenues for further research.
Article
Full-text available
Seeking to address the lack of research reproducibility in science, including psychology and the life sciences, a pragmatic solution has been raised recently: to use a stricter p < 0.005 standard for statistical significance when claiming evidence of new discoveries. Notwithstanding its potential impact, the proposal has motivated a large mass of authors to dispute it from different philosophical and methodological angles. This article reflects on the original argument and the consequent counterarguments, and concludes with a simpler and better-suited alternative that the authors of the proposal knew about and, perhaps, should have made from their Jeffresian perspective: to use a Bayes factors analysis in parallel (e.g., via JASP) in order to learn more about frequentist error statistics and about Bayesian prior and posterior beliefs without having to mix inconsistent research philosophies.
Chapter
This chapter offers some recommendations and guiding principles for conducting interdisciplinary moral psychology research, which will benefit students and experienced scholars alike. It is especially helpful to scholars in the humanities looking to apply scientific methods to their work. Drawing from work at the intersection of philosophy and the cognitive and neural sciences, this chapter offers valuable advice for how to critically evaluate the scientific literature and avoid common pitfalls.
Preprint
Full-text available
We wish to answer this question If you observe a “significant” P value after doing a single unbiased experiment, what is the probability that your result is a false positive?. The weak evidence provided by P values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe P = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3:1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the P value. And if you want to limit the false positive risk to 5 %, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe P =0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100:1 odds on there being a real effect. That would usually be regarded as conclusive, But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe P = 0.00045. It is recommended that the terms “significant” and “non-significant” should never be used. Rather, P values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed P value. Despite decades of warnings, many areas of science still insist on labelling a result of P < 0.05 as “statistically significant”. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomisation and P -hacking. Precise inductive inference is impossible and replication is the only way to be sure, Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.
Article
Full-text available
We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided by p-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe p = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3: 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the p-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe p = 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100: 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe p = 0.00045. It is recommended that the terms ‘significant’ and ‘non-significant’ should never be used. Rather, p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed p-value. Despite decades of warnings, many areas of science still insist on labelling a result of p < 0.05 as ‘statistically significant’. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization and p-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.
Article
Full-text available
Proposals to improve the reproducibility of biomedical research have emphasized scientific rigor. Although the word “rigor” is widely used, there has been little specific discussion as to what it means and how it can be achieved. We suggest that scientific rigor combines elements of mathematics, logic, philosophy, and ethics. We propose a framework for rigor that includes redundant experimental design, sound statistical analysis, recognition of error, avoidance of logical fallacies, and intellectual honesty. These elements lead to five actionable recommendations for research education.
Article
Full-text available
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a re-analysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested non-null effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of non-reproducibility. The results of this re-analysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Article
Full-text available
Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.
Article
Full-text available
Good self-control has been linked to adaptive outcomes such as better health, cohesive personal relationships, success in the workplace and at school, and less susceptibility to crime and addictions. In contrast, self-control failure is linked to maladaptive outcomes. Understanding the mechanisms by which self-control predicts behavior may assist in promoting better regulation and outcomes. A popular approach to understanding self-control is the strength or resource depletion model. Self-control is conceptualized as a limited resource that becomes depleted after a period of exertion resulting in self-control failure. The model has typically been tested using a sequential-task experimental paradigm, in which people completing an initial self-control task have reduced self-control capacity and poorer performance on a subsequent task, a state known as ego depletion. Although a meta-analysis of ego-depletion experiments found a medium-sized effect, subsequent meta-analyses have questioned the size and existence of the effect and identified instances of possible bias. The analyses served as a catalyst for the current Registered Replication Report of the ego-depletion effect. Multiple laboratories (k = 23, total N = 2,141) conducted replications of a standardized ego-depletion protocol based on a sequential-task paradigm by Sripada et al. Meta-analysis of the studies revealed that the size of the ego-depletion effect was small with 95% confidence intervals (CIs) that encompassed zero (d = 0.04, 95% CI [−0.07, 0.15]. We discuss implications of the findings for the ego-depletion effect and the resource depletion model of self-control.
Article
Full-text available
Our purpose is to recommend a change in the paradigm of testing by generalizing a very natural idea, originated perhaps in Jeffreys (1935, 1961) and clearly exposed by DeGroot (1975), with the aim of developing an approach that is attractive to all schools of statistics, re- sulting in a procedure better suited to the needs of science. The essential idea is to base testing statistical hypotheses on minimizing a weighted sum of type I and type II error probabilities instead of the prevailing paradigm, which is fixing type I error and minimizing type II error. For simple vs. simple hypotheses, the optimal criterion is to reject the null using the likelihood ratio as the evidence (ordering) statistic, with a fixed threshold value instead of a fixed tail probability. By defining expected type I and type II errors, we generalize the weighting approach and find that the optimal region is defined by the evidence ratio, that is, a ratio of averaged likelihoods (with respect to a prior measure) and a fixed threshold. This approach yields an optimal theory in complete general- ity, which the classical theory of testing does not. This can be seen as a Bayesian/Non-Bayesian compromise: using a weighted sum of type I and type II error probabilities is Frequentist, but basing the test criterion on a ratio of marginalized likelihoods is Bayesian. We give arguments to push the theory still further, so that the weighting measures (priors) of the likelihoods do not have to be proper and highly informative, but just "well calibrated". That is, priors that give rise to the same evidence (marginal likelihoods) using minimal (smallest) training samples. The theory that emerges, similar to the theories based on objective Bayesian approaches, is a powerful response to criticisms of the pre- vailing approach of hypothesis testing. For criticisms see, for example, Ioannidis (2005 )a ndSiegfried (2010), among many others.
Article
Full-text available
Empirically analyzing empirical evidence One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study. Science , this issue 10.1126/science.aac4716
Article
Full-text available
In recent years, researchers have attempted to provide an indication of the prevalence of inflated Type 1 error rates by analyzing the distribution of p-values in the published literature. De Winter & Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a 'surge of p-values between 0.041-0.049 in recent decades' which 'suggests (but does not prove) questionable research practices have increased over the past 25 years.' I show the changes in the ratio of fractions of p-values between 0.041-0.049 over the years are better explained by assuming the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias (or the file drawer effect) over the years (cf. Fanelli, 2012; Pautasso, 2010, which has led to a relative decrease of 'marginally significant' p-values in abstracts in the literature (instead of an increase in p-values just below 0.05). I explain why researchers analyzing large numbers of p-values need to relate their assumptions to a model of p-value distributions that takes into account the average power of the performed studies, the ratio of true positives to false positives in the literature, the effects of publication bias, and the Type 1 error rate (and possible mechanisms through which it has inflated). Finally, I discuss why publication bias and underpowered studies might be a bigger problem for science than inflated Type 1 error rates, and explain the challenges when attempting to draw conclusions about inflated Type 1 error rates from a large heterogeneous set of p-values.
Article
Full-text available
Abstract. Our purpose is to recommend a change in the paradigm of testing by generalizing a very natural idea, originated perhaps in Jeffreys (1935, 1961) and clearly exposed by DeGroot (1975), with the aim of developing an approach that is attractive to all schools of statistics, resulting in a procedure better suited to the needs of science. The essential idea is to base testing statistical hypotheses on minimizing a weighted sum of type I and type II error probabilities instead of the prevailing paradigm, which is fixing type I error and minimizing type II error. For simple vs. simple hypotheses, the optimal criterion is to reject the null using the likelihood ratio as the evidence (ordering) statistic, with a fixed threshold value instead of a fixed tail probability. By defining expected type I and type II errors, we generalize the weighting approach and find that the optimal region is defined by the evidence ratio, that is, a ratio of averaged likelihoods (with respect to a prior measure) and a fixed threshold. This approach yields an optimal theory in complete generality, which the classical theory of testing does not. This can be seen as a Bayesian/Non-Bayesian compromise: using a weighted sum of type I and type II error probabilities is Frequentist, but basing the test criterion on a ratio of marginalized likelihoods is Bayesian. We give arguments to push the theory still further, so that the weighting measures (priors) of the likelihoods do not have to be proper and highly informative, but just ”well calibrated”. That is, priors that give rise to the same evidence (marginal likelihoods) using minimal (smallest) training samples. The theory that emerges, similar to the theories based on objective Bayesian approaches, is a powerful response to criticisms of the prevailing approach of hypothesis testing. For criticisms see, for example, Ioannidis (2005) and Siegfried (2010), among many others (http://imstat.org/bjps/papers/BJPS257.pdf)
Article
Full-text available
The current crisis in scientific psychology about whether our findings are irreproducible was presaged years ago by Tversky and Kahneman (1971), who noted that even sophisticated researchers believe in the fallacious Law of Small Numbers-erroneous intuitions about how imprecisely sample data reflect population phenomena. Combined with the low power of most current work, this often leads to the use of misleading criteria about whether an effect has replicated. Rosenthal (1990) suggested more appropriate criteria, here labeled the continuously cumulating meta-analytic (CCMA) approach. For example, a CCMA analysis on a replication attempt that does not reach significance might nonetheless provide more, not less, evidence that the effect is real. Alternatively, measures of heterogeneity might show that two studies that differ in whether they are significant might have only trivially different effect sizes. We present a nontechnical introduction to the CCMA framework (referencing relevant software), and then explain how it can be used to address aspects of replicability or more generally to assess quantitative evidence from numerous studies. We then present some examples and simulation results using the CCMA approach that show how the combination of evidence can yield improved results over the consideration of single studies. © The Author(s) 2014.
Article
Full-text available
Hoijtink, Kooten, and Hulsker (20164. Hoijtink, H., van Kooten, P., & Hulsker, K. (2016). Why Bayesian psychologists should change the way they use the Bayes factor. Multivariate Behavioral Research, 51, 1--9. doi: 10.1080/00273171.2014.969364.[Taylor & Francis Online]View all references) present a method for choosing the prior distribution for an analysis with Bayes factor that is based on controlling error rates, which they advocate as an alternative to our more subjective methods (Morey & Rouder, 20148. Morey, R.D., & Rouder, J.N. (2014). Bayesfactor: Computation of Bayes factors for common designs. R package version 0.9.9. Retrieved from http://CRAN.R-project.org/package=BayesFactorView all references; Rouder, Speckman, Sun, Morey, & Iverson, 200913. Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., & Iverson, G. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin and Review, 16, 225–237. doi: 10.3758/PBR.16.2.225[CrossRef], [PubMed], [Web of Science ®]View all references; Wagenmakers, Wetzels, Borsboom, & van der Maas, 201115. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. (2011). Why psychologists must change the way they analyze their data: The case of psi. A comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426–432. doi: 10.1037/a0022790[CrossRef], [PubMed], [Web of Science ®]View all references). We show that the method they advocate amounts to a simple significance test, and that the resulting Bayes factors are not interpretable. Additionally, their method fails in common circumstances, and has the potential to yield arbitrarily high Type II error rates. After critiquing their method, we outline the position on subjectivity that underlies our advocacy of Bayes factors.
Article
Full-text available
Tests of theory in marketing and consumer behavior research are frequently based on convenience samples of undergraduate college students. In a study of business-related ethicality, analysis of data from four dozen convenience samples of undergraduate business students revealed significant differences in means, variances, intercorrelations, and path parameters across the samples. Depending on the particular convenience sample used, relationships between variables and constructs were positive or negative and statistically significant or insignificant. The present research empirically documents, for the first time, the uncertainty created by using convenience samples of college students as research subjects. Only through empirical replications can researchers pragmatically assess the reliability, validity, and generalizability of research findings.
Article
Full-text available
The error statistical account of testing uses statistical considerations, not to provide a measure of probability of hypotheses, but to model patterns of irregularity that are useful for controlling, distinguishing, and learning from errors. The aim of this paper is (1) to explain the main points of contrast between the error statistical and the subjective Bayesian approach and (2) to elucidate the key errors that underlie the central objection raised by Colin Howson at our PSA 96 Symposium. Copyright 1997 by the Philosophy of Science Association. All rights reserved.
Article
Full-text available
An academic scientist's professional success depends on publishing. Publishing norms emphasize novel, positive results. As such, disciplinary incentives encourage design, analysis, and reporting decisions that elicit positive results and ignore negative results. Prior reports demonstrate how these incentives inflate the rate of false effects in published science. When incentives favor novelty over replication, false results persist in the literature unchallenged, reducing efficiency in knowledge accumulation. Previous suggestions to address this problem are unlikely to be effective. For example, a journal of negative results publishes otherwise unpublishable reports. This enshrines the low status of the journal and its content. The persistence of false findings can be meliorated with strategies that make the fundamental but abstract accuracy motive-getting it right-competitive with the more tangible and concrete incentive-getting it published. This article develops strategies for improving scientific practices and knowledge accumulation that account for ordinary human motivations and biases. © The Author(s) 2012.
Article
Full-text available
Null hypothesis significance testing has been under attack in recent years, partly owing to the arbitrary nature of setting α (the decision-making threshold and probability of Type I error) at a constant value, usually 0.05. If the goal of null hypothesis testing is to present conclusions in which we have the highest possible confidence, then the only logical decision-making threshold is the value that minimizes the probability (or occasionally, cost) of making errors. Setting α to minimize the combination of Type I and Type II error at a critical effect size can easily be accomplished for traditional statistical tests by calculating the α associated with the minimum average of α and β at the critical effect size. This technique also has the flexibility to incorporate prior probabilities of null and alternate hypotheses and/or relative costs of Type I and Type II errors, if known. Using an optimal α results in stronger scientific inferences because it estimates and minimizes both Type I errors and relevant Type II errors for a test. It also results in greater transparency concerning assumptions about relevant effect size(s) and the relative costs of Type I and II errors. By contrast, the use of α = 0.05 results in arbitrary decisions about what effect sizes will likely be considered significant, if real, and results in arbitrary amounts of Type II error for meaningful potential effect sizes. We cannot identify a rationale for continuing to arbitrarily use α = 0.05 for null hypothesis significance tests in any field, when it is possible to determine an optimal α.
Article
Full-text available
Environmental management decisions are prone to expensive mistakes if they are triggered by hypothesis tests using the conventional Type I error rate ( ) of 0.05. We derive optimal -levels for decision-making by minimizing a cost function that specifies the overall cost of monitoring and management. When managing an economically valuable koala population, it shows that a decision based on = 0.05 carries an expected cost over $5 million greater than the optimal decision. For a species of such value, there is never any benefit in guarding against the spurious detection of declines and therefore management should proceed directly to recovery action. This result holds in most circumstances where the species' value substantially exceeds its recovery costs. For species of lower economic value, we show that the conventional -level of 0.05 rarely approximates the optimal decision-making threshold. This analysis supports calls for reversing the statistical 'burden of proof' in environmental decision-making when the cost of Type II errors is relatively high.
Article
Full-text available
verbal statements regarding "significance' are at best supererogatory restatements in an inconvenient dichotomous form of results already properly stated in terms of a continuous system of p values; at worst they carry unjustified surplus meaning of an entirely subjective kind under the guise of an objective and mathematically meaningful statement It is suggested that the accurate and factual statement of probabilities (two-tailed) should be mandatory and that all subjective considerations, arguments, and judgments should be clearly separated from such factual statements.
Article
Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results. It capitalizes on the fact that the distribution of significant p values, p-curve, is a function of the true underlying effect. Researchers armed only with sample sizes and test results of the published findings can correct for publication bias. We validate the technique with simulations and by reanalyzing data from the Many-Labs Replication project. We demonstrate that p-curve can arrive at conclusions opposite that of existing tools by reanalyzing the meta-analysis of the “choice overload” literature.
Article
The Open Science Collaboration recently reported that 36% of published findings from psychological studies were reproducible by their independent team of researchers. We can use this information to estimate the statistical power needed to produce these findings under various assumptions of prior probabilities and type-1 errors to calculate the expected distribution of positive and negative evidence. And we can compare this distribution to observations indicating that 90% of published findings in the psychological literature is statistically significant and supporting the authors hypothesis to get an estimate of publication bias. Such estimate indicates that negative evidence was expected to be observed 30-200 times before one was published assuming plausible priors.
Article
Allan Franklin provides an overview of notable experiments in particle physics. Using papers published in Physical Review, the journal of the American Physical Society, as his basis, Franklin details the experiments themselves, their data collection, the events witnessed, and the interpretation of results. From these papers, he distills the dramatic changes to particle physics experimentation from 1894 through 2009.
Article
Significance There is increasing concern about the reproducibility of scientific research. For example, the costs associated with irreproducible preclinical research alone have recently been estimated at US$28 billion a year in the United States. However, there are currently no mechanisms in place to quickly identify findings that are unlikely to replicate. We show that prediction markets are well suited to bridge this gap. Prediction markets set up to estimate the reproducibility of 44 studies published in prominent psychology journals and replicated in The Reproducibility Project: Psychology predict the outcomes of the replications well and outperform a survey of individual forecasts.
Article
The current discussion of questionable research practices (QRPs) is meant to improve the quality of science. It is, however, important to conduct QRP studies with the same scrutiny as all research. We note problems with overestimates of QRP prevalence and the survey methods used in the frequently cited study by John, Loewenstein, and Prelec. In a survey of German psychologists, we decomposed QRP prevalence into its two multiplicative components, proportion of scientists who ever committed a behavior and, if so, how frequently they repeated this behavior across all their research. The resulting prevalence estimates are lower by order of magnitudes. We conclude that inflated prevalence estimates, due to problematic interpretation of survey data, can create a descriptive norm (QRP is normal) that can counteract the injunctive norm to minimize QRPs and unwantedly damage the image of behavioral sciences, which are essential to dealing with many societal problems.
Article
Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results. It capitalizes on the fact that the distribution of significant p values, p-curve, is a function of the true underlying effect. Researchers armed only with sample sizes and test results of the published findings can correct for publication bias. We validate the technique with simulations and by reanalyzing data from the Many-Labs Replication project. We demonstrate that p-curve can arrive at conclusions opposite that of existing tools by reanalyzing the meta-analysis of the "choice overload" literature. © The Author(s) 2014.
Article
Drug development is not the only industrial-scientific enterprise subject to government regulations. In some fields of ecology and environmental sciences, the application of statistical methods is also regulated by ordinance. Over the past 20years, ecologists and environmental scientists have argued against an unthinking application of null hypothesis significance tests. More recently, Canadian ecologists have suggested a new approach to significance testing, taking account of the costs of both type I and type II errors. In this paper, we investigate the implications of this for testing in drug development and demonstrate that its adoption leads directly to the likelihood principle and Bayesian approaches. Copyright © 2015 John Wiley & Sons, Ltd.
Article
Experiments that find larger differences between groups than actually exist in the population are more likely to pass stringent tests of significance and be published than experiments that find smaller differences. Published measures of the magnitude of experimental effects will therefore tend to overestimate these effects. This bias was investigated as a function of sample size, actual population difference, and alpha level. The overestimation of experimental effects was found to be quite large with the commonly employed significance levels of 5 per cent and 1 per cent. Further, the recently recommended measure, ω2, was found to depend much more heavily on the alpha level employed than the true population ω2 value. Hence, it was concluded that effect size estimation is impractical unless scientific journals drop the consideration of statistical significance as one of the criteria of publication.
Article
We discuss the traditional criterion for discovery in Particle Physics of requiring a significance corresponding to at least 5 sigma; and whether a more nuanced approach might be better.
Article
In social science, everything is somewhat correlated with everything ("crudfactor"), so whether H0 is refuted depends solely on statistical power. In psychology, the directional counternull of interest, H*, is not equivalent to the substantive theory T, there being many plausible alternative explanations of a mere directional trend (weak use of significance tests). Testing against a predicted point value (the strong use of significant tests) can discorroborate T by refuting H*. If used thus to abandon T forthwith, it is too strong, not allowing for theoretical verisimilitude as distinguished from truth. Defense and amendment of an apparently falsified T are appropriate strategies only when T has accumulated a good track record ("money in the bank") by making successful or near-miss predictions of low prior probability (Salmon's "damn strange coincidences"). Two rough indexes are proposed for numerifying the track record, by considering jointly how intolerant (risky) and how close (accurate) are its predictions.
Article
The present position of the art of field experimentation is one of rather special interest. For more than fifteen years the attention of agriculturalists has been turned to the errors of field experiments. During this period, experiments of the uniformity trial type have demonstrated the magnitude and ubiquity of that class of error which cannot be ascribed to carelessness in measuring the land or weighing the produce, and which is consequently described as due to “soil heterogeneity”; much ingenuity has been expended in devising plans for the proper arrangement of the plots; and not without result, for there can be little doubt that the standard of accuracy has been materially, though very irregularly, raised. What makes the present position interesting is that it is now possible to demonstrate (a) that the actual position of the problem is very much more intricate than was till recently imagined, but that realising this (b) the problem itself becomes much more definite and (c) its solution correspondingly more rigorous.
Article
Although replications are vital to scientific progress, psychologists rarely engage in systematic replication efforts. In this article, we consider psychologists' narrative approach to scientific publications as an underlying reason for this neglect and propose an incentive structure for replications within psychology. First, researchers need accessible outlets for publishing replications. To accomplish this, psychology journals could publish replication reports in files that are electronically linked to reports of the original research. Second, replications should get cited. This can be achieved by cociting replications along with original research reports. Third, replications should become a valued collaborative effort. This can be realized by incorporating replications in teaching programs and by stimulating adversarial collaborations. The proposed incentive structure for replications can be developed in a relatively simple and cost-effective manner. By promoting replications, this incentive structure may greatly enhance the dependability of psychology's knowledge base. © The Author(s) 2012.
Chapter
Drug development is the process of finding and producing therapeutically useful pharmaceuticals, turning them into safe and effective medicine, and producing reliable information regarding the appropriate dosage and dosing intervals. With regulatory authorities demanding increasingly higher standards in such developments, statistics has become an intrinsic and critical element in the design and conduct of drug development programmes. Statistical Issues in Drug Development presents an essential and thought provoking guide to the statistical issues and controversies involved in drug development. This highly readable second edition has been updated to include: Comprehensive coverage of the design and interpretation of clinical trials. Expanded sections on missing data, equivalence, meta-analysis and dose finding. An examination of both Bayesian and frequentist methods. A new chapter on pharmacogenomics and expanded coverage of pharmaco-epidemiology and pharmaco-economics. Coverage of the ICH guidelines, in particular ICH E9, Statistical Principles for Clinical Trials. It is hoped that the book will stimulate dialogue between statisticians and life scientists working within the pharmaceutical industry. The accessible and wide-ranging coverage make it essential reading for both statisticians and non-statisticians working in the pharmaceutical industry, regulatory bodies and medical research institutes. There is also much to benefit undergraduate and postgraduate students whose courses include a medical statistics component.
Article
The problem of testing statistical hypotheses is an old one. Its origins are usually connected with the name of Thomas Bayes, who gave the well-known theorem on the probabilities a posteriori of the possible “causes” of a given event.* Since then it has been discussed by many writers of whom we shall here mention two only, Bertrand† and Borel,‡ whose differing views serve well to illustrate the point from which we shall approach the subject.
Article
P-values are a practical success but a critical failure. Scientists the world over use them, but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous, but even the modern frequentist has little time for them. In this essay, I consider what, if anything, might be said in their favour.
Article
With the increase in genome-wide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genome-wide data set are tested against some null hypothesis, where many features are expected to be significant. Here we propose an approach to statistical significance in the analysis of genome-wide data sets, based on the concept of the false discovery rate. This approach o#ers a sensible balance between the number of true findings and the number of false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q-value is associated with each tested feature in addition to the traditional p-value. Our approach avoids a flood of false positive results, while o#ering a more liberal criterion than what has been used in genome scans for linkage.
Sweden 29 63 Department of Clinical Neuroscience
  • Stockholm
Stockholm, Sweden 29 63 Department of Clinical Neuroscience, Karolinska Institutet, Nobels väg 9, SE-17177 Stockholm, 30
On verbal categories for the interpretation of Bayes Factors
  • R Morey
Morey, R. (2015). On verbal categories for the interpretation of Bayes Factors. BayesFactor.
On the use and interpretation of certain test criteria for 9 purposes of statistical inference: Part I. Biometrika, 175-240
  • J Neyman
  • E S Pearson
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for 9 purposes of statistical inference: Part I. Biometrika, 175-240. 10 https://doi.org/10.2307/2331945
On the Problem of the Most Efficient Tests of 12
  • J Neyman
  • E S Pearson
Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of 12
The sacredness of .05: A note concerning 1
  • J K Skipper
  • A L Guenther
  • G Nass
Skipper, J. K., Guenther, A. L., & Nass, G. (1967). The sacredness of.05: A note concerning 1
The American Statistician
  • Purpose Process
Process, and Purpose. The American Statistician, 70(2), 129–133. 12 https://doi.org/10.1080/00031305.2016.1154108
Center for Open Science, 210 Ridge McIntire Rd Suite 500
  • Ku Leuven
74 Leeds Institute of Health Sciences, University of Leeds, Leeds, LS2 9NL, UK 10 75 Institute for Media Studies, KU Leuven, Leuven, Belgium 11 76 Center for Open Science, 210 Ridge McIntire Rd Suite 500, Charlottesville, VA 22903, USA 12 76 Department of Engineering and Society, University of Virginia, Thornton Hall, P.O. Box 400259, 13