Fig 2 - uploaded by Merlise Aycock Clyde
Content may be subject to copyright.
Relationship between the P-value threshold, power, and the false positive rate. Calculated according to Equation (2), with prior odds defined as
Source publication
We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Contexts in source publication
Context 1
... and for significance thresholds í µí»¼ = 0.05 and í µí»¼ = 0.005, Figure 2 shows the false positive rate as a function of power 1 − í µí»½. (2), with prior odds defined as ...
Context 2
... many studies, statistical power is low (e.g., ref. 7). Fig. 2 demonstrates that low statistical power and í µí»¼ = 0.05 combine to produce high false positive ...
Context 3
... many, the calculations illustrated by Fig. 2 may be unsettling. For example, the false positive rate is greater than 33% with prior odds of 1:10 and a P-value threshold of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005 would reduce this minimum false positive rate to 5%. Similar reductions in false positive rates would occur over a wide range ...
Context 4
... an increase means that fewer studies can be conducted using current experimental designs and budgets. But Figure 2 shows the benefit: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises. ...
Context 5
... agree that the significance threshold selected for claiming a new discovery should depend on the prior odds that the null hypothesis is true, the number of hypotheses tested, the study design, the relative cost of Type I versus Type II errors, and other factors that vary by research topic. For exploratory research with very low prior odds (well outside the range in Figure 2), even lower significance thresholds than 0.005 are needed. Recognition of this issue led the genetics research community to move to a "genome-wide significance threshold" of 5×10 -8 over a decade ago. ...
Context 6
... recommendation applies to disciplines with prior odds broadly in the range depicted in Figure 2, where use of P < 0.05 as a default is widespread. Within those disciplines, it is helpful for consumers of research to have a consistent benchmark. ...
Context 7
... = .8) #y axis on the left -main axis (side=2,at=c(-0.2, 0.3,0.5,1,2,5,10,20,50,100),labels = c("","0.3","0.5","1.0","2.0","5.0","10.0","20.0","50.0","100.0"),lwd ...
Context 8
... test size pi0=5/6 # prior probability N=10^6 # doesn't matter #graph margins par(mai=c(0.8,0.8,0.1,0.1)) par(mgp=c (2,1,0)) plot(pow1,alpha*N*pi0/(alpha*N*pi0+pow1*(1-pi0)*N),type='n',ylim = c(0,1), xlim = c(0,1.5), xlab='Power ', ylab='False positive rate', bty="n", xaxt="n", yaxt="n") #grid lines segments(x0 = -0.058, ...
Context 9
... odds = 1:40","Prior odds = 1:10","Prior odds = 1:5"),pch=c (15,15,15), col=c("green","red","blue"), cex = 1) ############### Use these commands to add brackets in Figure 2 library(pBrackets) #add text and brackets text(1.11,(odd_1_5_2+odd_1_40_2)/2, expression(paste(italic(P)," < 0.05 threshold")), cex = 0.9,adj=0) text(1.11,(odd_1_5_1+odd_1_40_1)/2, ...
Similar publications
"We propose to change the default P-value threshold forstatistical significance for claims of new discoveries from 0.05 to 0.005."
Citations
... But it was not until very recently, that consensus has been emerging across fields like psychology [8], statistics [3], and beyond [2]. Multiple publications in Nature have advocated for redefining [4] or even abandoning the idea of statistical significance altogether [1,2,12]. ...
Personal informatics has emerged as one of the most popular topics in the HCI + health community, with its motto of "self-knowledge through numbers" resonating deeply with many researchers (myself included). As personal informatics researchers, we are fortunately to have an abundance of analytical methods at our disposal, helping us make sense of raw data. It has been a thrilling journey learning from various disciplines like statistics, psychometrics, and fundamental science. What I have learned has come from analyzing my own data collected with early generations of Fitbit wristbands, from digging through literature, and yes, from those painful yet valuable journal and conference reviews. In this paper, I hope to share some lessons I have learned and spark reflections on how we, as an emerging research community, can improve the rigor of our statistical practices. After all, sometimes a little bit of introspection can go a long way.
... "In statistical practice, perhaps the single biggest problem with p-values is that they are often misinterpreted in ways that lead to overstating the evidence against the null hypothesis" (Benjamin and Berger 2019). The argument that p-values exaggerate the evidence against the null hypothesis has gained ground over decades (Berger and B David R. Bickel dbickel@uncg.edu 1 Sellke 1987;Goodman 1993;Stang et al. 2010), culminating in an initiative to reduce the level of statistical significance from 0.05 to 0.005 in certain fields of social science (Benjamin et al. 2018;Machery 2021). ...
This paper proposes a simple correction to the Bayesian information criterion (BIC) for small samples to ensure that it neither overstates nor understates the evidence against a null hypothesis or other tested model. The new correction raises the likelihood ratio in the BIC to the power of 1 minus the reciprocal of the sample size ( 1 - 1 / n , n > 1 ). That is equivalent to multiplying the loglikelihood term of the BIC by a factor of 1 - 1 / n . The correction is applied to the problem of calibrating p-values by transforming them to estimated Bayes factors. The corresponding calibration in the most common case is simply sqrt(n)/exp((1−1/n)*qchisq(1−p,df=1)/2) in R syntax, where the p-value is from a likelihood-ratio test. That intersects the class of betting scores called e-values and, more specifically, admissible calibrators. While all admissible calibrators neither overstate nor understate the evidence against the null hypothesis, previous admissible calibrators are not model-selection consistent since they do not increasingly favor the null hypothesis when it is true. The proposed calibrator is consistent under general conditions, for its corrected BIC is asymptotically equivalent to the BIC.
... Importantly, the evidence linking belief in an event conspiracy theory to reduced stress is strong from a statistical standpoint. Not only do the results reveal the predicted effect in a relatively large sample, the statistical evidence for the effect (p < 0.001) is impressive relative to even the most conservative standards [62]. However, a key limitation of Study 1 is the cross-sectional nature of the data, which seriously limits the ability to draw inferences about prospective relations among the variables. ...
Recent theorizing suggests that people gravitate toward conspiracy theories during difficult times because such beliefs promise to alleviate threats to psychological motives. Surprisingly, however, previous research has largely failed to find beneficial intrapersonal effects of endorsing an event conspiracy theory for outcomes like well-being. The current research provides correlational evidence for a link between well-being and an event conspiracy belief by teasing apart this relation from (1) the influence of experiencing turmoil that nudges people toward believing the event conspiracy theory in the first place and (2) conspiracist ideation—the general tendency to engage in conspiratorial thinking. Across two studies we find that, when statistically accounting for the degree of economic turmoil recently experienced and conspiracist ideation, greater belief in COVID-19 conspiracy theories concurrently predicts less stress and longitudinally predicts greater contentment. However, the relation between COVID-19 conspiracy belief and contentment diminishes in size over time. These findings suggest that despite their numerous negative consequences, event conspiracy beliefs are associated with at least temporary intrapersonal benefits.
... Respondents' answer was modeled as a continuous response. Following Benjamin et al. (2018), we used = 0.005 throughout, interpreting p-values between 0.005 and 0.05 as providing suggestive evidence. ...
We investigate Scottish end users' and professional forecasters' risk perception in relation to the 5‐point European Avalanche Danger Scale by eliciting numerical estimates of the probability of triggering an avalanche. Our main findings are that neither end users nor professional forecasters interpret the avalanche danger scale as intended, that is, in an exponential fashion. Second, we find that numerical interpretations by end users and professional forecasters have high variance, but are similar, in that both groups tend to overestimate the probability of triggering an avalanche and underestimate the relative risk increase. Finally, we find significant differences in the perceived probability of triggering an avalanche relative to a low or moderate avalanche danger level, and in the numerical interpretation of verbal probability terms depending on whether respondents provide their estimates using a frequency or a percentage chance format. We summarize our findings by identifying important lessons to improve avalanche risk understanding and its communication.
... We called these latter models independent variables PGLS. Statistics based on the overconfidence in P-values is today largely criticized (Benjamin et al. 2018, Amrhein et al. 2019, Ioannidis 2019, Yang et al. 2023. The consequences of the widespread use of the P-value, in conjunction with certain questionable research practices (Fiedler andSchwarz 2016, Fraser et al. 2018), have been called the 'replication crisis' (Loken and Gelman 2017). ...
Hearing is essential for odontocete ecology, supporting navigation, hunting prey, communication, and mother-calf bonding. This study examines morphological variation in the periotic bone, focusing on its taxonomic value, its phylogenetic signal, and the influence of ecological factors on its evolution. Using photogrammetry and 3D geometric morphometrics, we analysed 95 periotic bones from 32 species across five families (Delphinidae, Pontoporiidae, Phocoenidae, Ziphiidae, and Physeteridae). The specimens were mainly sourced from three osteological collections in Argentina, covering a wide range of odontocete taxa. We assessed the association and differentiation between families based on periotic shape, estimated the phylogenetic signal, and evaluated the influence of ecological variables on shape variation. Our results revealed clear differences between odontocete families, with a shared periotic morphotype for ziphiids and physeterids and another distinct periotic morphotype grouping of Delphinidae, Pontoporiidae, and Phocoenidae. Phylogenetic analyses showed a strong phylogenetic signal in periotic morphology, while ecological factors such as diet, habitat, diving ecology, and biosonar types were identified as key influences on its evolution. Overall, periotic shape reflects both phylogenetic history and ecological adaptations, offering significant taxonomic value by enabling clear species differentiation.
... Two approaches were undertaken to control the Type-I error rate. First, p values < 0.05 were treated as suggestive, and those ≤ 0.005 as significant (Benjamin et al., 2018). Second, because larger samples can produce many significant but inconsequential correlations, we adopted a smallest effect size of interest (SESOI) equivalent to r = 0.10 (Ferguson & Heene, 2021) and treated correlations smaller than this benchmark as statistical noise. ...
Subclinical narcissism, psychopathy, and Machiavellianism are a cluster of manipulative, callous, and entitled traits known as the Dark Triad (DT). These traits have been repeatedly linked to short-term mating strategies and a tolerance for uncommitted sexual behavior (i.e., unrestricted sociosexuality) in both men and women, a pattern interpreted as consistent with life history theory. Alongside sociosexuality, individuals vary in their distinct capacities toward sexual excitation and sexual inhibition. Although much research has examined the relationships between DT traits and sociosexuality, and between sociosexuality and sexual excitation/inhibition, none has simultaneously evaluated the links among all three. In a large undergraduate sample, DT traits and sexual excitation/inhibition showed unique multivariate associations with sociosexuality, even when accounting for age, sex, relationship status, and sexual orientation. Results suggest that DT traits, elevated sexual excitation, lower inhibition and bisexuality, facilitate fast life history strategies in both males and females.
... This stricter threshold was chosen for several reasons. First, it reduces the risk of false-positive results, which is particularly relevant in retrospective studies where uncontrolled confounding factors may influence outcomes [33]. Second, given the multiple statistical models employed, a stricter significance level mitigates the risk of simulated associations arising from multiple testing. ...
... In contrast to expectations and rationale, PTE of teeth with poor long-term prognosis before HNR does not appear to decrease the frequency of having experienced JORN to levels comparable to those in patients who did not undergo/require PTE before HNR. However, the findings of other recent studies [26,29,[33][34][35][36] that pre-HNR PTE is an independent risk factor for the development of JORN could not be unreservedly confirmed, warranting the need for further research. HNR patients with PTE in this study had a 61% increased risk for JORN occurrence and were associated with 31% higher odds for experiencing JORN compared with those not requiring/receiving pre-HNR PTE, but these findings were at the best marginally significant. ...
Background/Objectives: This retrospective study examined the relationship between prophylactic tooth extraction (PTE) and the occurrence of jaw osteoradionecrosis (JORN) in patients undergoing head and neck radiotherapy (HNR). The primary objective was to determine whether PTE resulted in a JORN rate comparable to that of patients who did not require or undergo PTE. Methods: A total of 497 patients were included. The primary predictor variable was PTE, and the primary outcome was JORN occurrence. Statistical analyses included univariate, bivariate, and multivariate regression, as well as Cox regression. The significance threshold was set at p ≤ 0.005. Results: JORN was more frequent in the PTE group than in patients who did not require or undergo PTE (17.1% vs. 13.0%; hazard ratio [HR] 1.71, 95% CI: 1.08–2.71, p = 0.021). However, a significant association could not be confirmed using multiple logistic regression (odds ratio [OR] 1.36, 95% CI: 0.82–2.26, p = 0.236). Suggestive associations were observed for HNR dose (HR 1.03 per Gy, p = 0.007) and tumor location (pharyngeal HR 0.52, p = 0.03; laryngeal HR 0.51, p = 0.02). Conclusions: Patients with PTE showed a higher JORN rate but the findings were only marginally significant, and no causal relationship was established. The differing results between Cox and logistic regression suggest a time-dependent effect of PTE, with an increased early risk for JORN. Further studies are needed to determine whether greater emphasis should be placed on tooth-preserving measures, limiting extractions before HNR to strictly non-preservable teeth.
... Not surprising then that in 2018, in NATURE, no fewer than seventy-two statisticians (experts in such scientific areas as Biomedicine, Epidemiology, Finance, Industrial Economics, Political Science, Psychology, …) signed up a call to "redefine statistical significance" (Benjamin et al., 2018). They argue 15 that we can't claim to have proved anything new unless turns out to be less than half a percent ( ≤ . ...
Manuscript of the article published online by the Journal of Marketing Analytics, the printed version of which has been downloaded by SPRINGER NATURE on the profiles of its authors: Alain BULTEZ and Jean-Luc HERRMANN. For readers interested in the technical details, the first author recommends consulting this manuscript
... Not surprising then that in 2018, in NATURE, no fewer than seventy-two statisticians (experts in such scientific areas as Biomedicine, Epidemiology, Finance, Industrial Economics, Political Science, Psychology, …) signed up a call to "redefine statistical significance" (Benjamin et al., 2018). They argue 15 that we can't claim to have proved anything new unless turns out to be less than half a percent ( ≤ . ...
In most scientific disciplines, the p-value is, still too often, revered as “the” conclusive criterion for hypothesis testing. For years, scholars from diverse horizons have alerted to the hazards of daredevil claims of “proofs” entailed by mere benchmarking of p against an arbitrarily predetermined risk tolerance -cutoff. Deciding whether empirical results are “statistically significant” or not, through such a dichotomization, not only bypasses the uncertainty hanging over the supposed effects, but also prompts insignificance verdicts. First, a cross-examination of an article published by JMR in 2021 traces the root causes of quick-and-dirty handling of p, and recaptures the true meaning of this misconstrued statistic. Next, a recalibration of it is advertised, for it yields a measure of the probability that the effect studied is real: its plausibility. Complementarily, to highlight the substantiality of expected effects, forest charts of their compatibility/confidence intervals are promoted. An example borrowed from an article issued by JAMS in 2022 proves the relevance of the add-ons proposed. Finally, the decision aid Bayesian econometrics provide—materialized by curve plots of credible intervals, chances of positive outcomes, and risks of negative ones—is illustrated through a thorough revisit of a 2018 marketing research case.
... Third, researchers should consider choosing prediction over explanation 62 , reporting results which are directly aimed at generalising to unseen data rather than relying on statistical inference in null-hypothesis significance testing (NHST) within a sample. Issues with NHST are well documented 26,63,64 and combined with small sample sizes it leads to underpowered studies that report distorted effect sizes 8,11,22,65 . Another issue with NHST is that P values may be derived from inappropriate null models (as we saw in Fig. 4); choosing an appropriate null for brain-wide statistics 66 is therefore yet another factor worth considering. ...
Recent studies have used big neuroimaging datasets to answer an important question: how many subjects are required for reproducible brain-wide association studies? These data-driven approaches could be considered a framework for testing the reproducibility of several neuroimaging models and measures. Here we test part of this framework, namely estimates of statistical errors of univariate brain-behaviour associations obtained from resampling large datasets with replacement. We demonstrate that reported estimates of statistical errors are largely a consequence of bias introduced by random effects when sampling with replacement close to the full sample size. We show that future meta-analyses can largely avoid these biases by only resampling up to 10% of the full sample size. We discuss implications that reproducing mass-univariate association studies requires tens-of-thousands of participants, urging researchers to adopt other methodological approaches.