Article

EXPRESS: “Statistical Significance” and Statistical Reporting: Moving Beyond Binary

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Null hypothesis significance testing (NHST) is the default approach to statistical analysis and reporting in marketing and the biomedical and social sciences more broadly. Despite its default role, NHST has long been criticized by both statisticians and applied researchers including those within marketing. Therefore, the authors propose a major transition in statistical analysis and reporting. Specifically, they propose moving beyond binary: abandoning NHST as the default approach to statistical analysis and reporting. To facilitate this, they briefly review some of the principal problems associated with NHST. They next discuss some principles that they believe should underlie statistical analysis and reporting. They then use these principles to motivate some guidelines for statistical analysis and reporting. They next provide some examples that illustrate statistical analysis and reporting that adheres to their principles and guidelines. They conclude with a brief discussion.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Going a step further, † In science, it is hardly possible to unequivocally speak of facts or of truths. Likewise, whether or not an effect exists is not always a binary question, i.e., treatment manipulations rarely produce population effects identically equal to zero, although they may be so small that they are negligible (21). In practical terms, we will use the terms "highly robust" and "replicable" when describing results from a Book of Truths perspective. ...
... One of the biggest criticisms of using NHST for defining replication success is that it dichotomizes results-success or failure-in ways that lead to biased reporting and fallacious reasoning. For example, an effect may be present in a population, but, due to natural sampling variability, some replications may be statistically significant by NHST [typically according to an arbitrary threshold of evidence (104) The development and interpretation of statistical methods for assessing replication is an active area of research, with many promising avenues (21,(109)(110)(111)(112). A full accounting of modern approaches is beyond the scope of our discussion, but some guiding principles include 1) taking an estimation perspective, i.e., prioritizing interval estimates, and considering evidence in a continuous fashion (21), 2) leveraging Bayesian decision-making (109)(110)(111), and 3) moving beyond goodness-of-fit indices (112). ...
... For example, an effect may be present in a population, but, due to natural sampling variability, some replications may be statistically significant by NHST [typically according to an arbitrary threshold of evidence (104) The development and interpretation of statistical methods for assessing replication is an active area of research, with many promising avenues (21,(109)(110)(111)(112). A full accounting of modern approaches is beyond the scope of our discussion, but some guiding principles include 1) taking an estimation perspective, i.e., prioritizing interval estimates, and considering evidence in a continuous fashion (21), 2) leveraging Bayesian decision-making (109)(110)(111), and 3) moving beyond goodness-of-fit indices (112). There are also recent methods designed to estimate and characterize heterogeneous treatment effects (113). ...
Article
Full-text available
Replication and the reported crises impacting many fields of research have become a focal point for the sciences. This has led to reforms in publishing, methodological design and reporting, and increased numbers of experimental replications coordinated across many laboratories. While replication is rightly considered an indispensable tool of science, financial resources and researchers' time are quite limited. In this perspective, we examine different values and attitudes that scientists can consider when deciding whether to replicate a finding and how. We offer a conceptual framework for assessing the usefulness of various replication tools, such as preregistration. replication | reproducibility | methodology | reform The ability to replicate empirical findings, accurately reproduce a data analysis pipeline, and, more generally, independently verify a scientific claim is, without question, a cornerstone of science. The aim of this dialog is not to debate whether replication is important. Our goal is to identify arguments and positions that can help us improve replication decisions, including whether a replication should be undertaken and how. The time, money, and energy required for scientific work are limited, and research groups must be judicious about where they direct their efforts. The scientific literature, popular press, and social media are awash in reports of empirical results that do not hold up when replicated, untrustworthy results due to data manipulation and fraud, and claims of an eroding trust in science. The terms "replication crisis," "credibility crisis," or "crisis of confidence" are often used to describe this state of affairs, which has caused numerous fields to take hard looks at their empirical literature. These fields include, but are not limited to, medicine (e.g., ref. 1), psychology (e.g., refs. 2 and 3), economics (e.g., ref. 4), and even computer science (e.g., ref. 5). As an example from social psychology, a well-cited, large-scale replication of 100 original studies revealed that replication effect sizes were systematically lower than the original ones and that a successful replication (defined as a significant P-value in the replication study) was achieved in well under 50% of cases (6). Yet, the extent and severity of these problems are contested. Fanelli (7) argues that a crisis narrative is unwarranted and counterproductive to scientific goals. He points out that in a properly working scientific field, one would not expect all reported studies to replicate, especially when one considers evolving methodology, treatment manipulations, and changes in populations over time. Consistent with this view, Shiffrin and colleagues (8) have argued that current replication issues reflect challenges that may be endemic to the practice of science, arguing that a good deal of nonreplicable results, possibly close to the present level, is necessary for science to progress optimally. However, other investigators have argued with empirical data and simulations that innovation and disruption in science has slowed down (9) despite the unilateral focus on novelty with little replication; and that discovery without replication may even have negative value if it leads to misleading waste (10) and building future work upon wrong foundations (11). Instead of joining the discussion about the prevalence of replication issues, we will focus on how scientists can make sound replication decisions in their respective fields. We do so by examining replication through the lens of different scientific values and attitudes. In addition to describing how these values and attitudes can guide replication decisions , we examine how different replication tools, such as Author affiliations:
... Going a step further, † In science, it is hardly possible to unequivocally speak of facts or of truths. Likewise, whether or not an effect exists is not always a binary question, i.e., treatment manipulations rarely produce population effects identically equal to zero, although they may be so small that they are negligible (21). In practical terms, we will use the terms "highly robust" and "replicable" when describing results from a Book of Truths perspective. ...
... One of the biggest criticisms of using NHST for defining replication success is that it dichotomizes results-success or failure-in ways that lead to biased reporting and fallacious reasoning. For example, an effect may be present in a population, but, due to natural sampling variability, some replications may be statistically significant by NHST [typically according to an arbitrary threshold of evidence (104) The development and interpretation of statistical methods for assessing replication is an active area of research, with many promising avenues (21,(109)(110)(111)(112). A full accounting of modern approaches is beyond the scope of our discussion, but some guiding principles include 1) taking an estimation perspective, i.e., prioritizing interval estimates, and considering evidence in a continuous fashion (21), 2) leveraging Bayesian decision-making (109)(110)(111), and 3) moving beyond goodness-of-fit indices (112). ...
... For example, an effect may be present in a population, but, due to natural sampling variability, some replications may be statistically significant by NHST [typically according to an arbitrary threshold of evidence (104) The development and interpretation of statistical methods for assessing replication is an active area of research, with many promising avenues (21,(109)(110)(111)(112). A full accounting of modern approaches is beyond the scope of our discussion, but some guiding principles include 1) taking an estimation perspective, i.e., prioritizing interval estimates, and considering evidence in a continuous fashion (21), 2) leveraging Bayesian decision-making (109)(110)(111), and 3) moving beyond goodness-of-fit indices (112). There are also recent methods designed to estimate and characterize heterogeneous treatment effects (113). ...
Article
Full-text available
Replication and the reported crises impacting many fields of research have become a focal point for the sciences. This has led to reforms in publishing, methodological design and reporting, and increased numbers of experimental replications coordinated across many laboratories. While replication is rightly considered an indispensable tool of science, financial resources and researchers’ time are quite limited. In this perspective, we examine different values and attitudes that scientists can consider when deciding whether to replicate a finding and how. We offer a conceptual framework for assessing the usefulness of various replication tools, such as preregistration.
... For the latter, the larger the volume of data, the better the representation of reality; but then complexity and management costs increase, and scholars may need to make compromises. That is why researchers resort to sampling and measurement error, with the consequent effects on variance (McShane et al., 2024;McShane & Bockenholt, 2016). Overall, while bias results from an incomplete representation of the true data-generating mechanism by a model because of simplifying assumptions, variance results from random variation in the data due to sampling and measurement error (Wedel & Kannan, 2016, p. 104). ...
... Therefore, it is gainful to justify the rationale behind a selected sample size (McShane et al., 2024). While it is well known that a larger volume of data reduces the variance, a better understanding of current practices on sample size and sampling procedures in B2B marketing can help to reduce uncertainty and identify whether journals are accomplishing high standards. ...
... The findings have shown certain elements of potential improvement in terms of methodology; however, in general there is consensus on which are the preferred approaches. Overall, the sample size is still a battlefield that should be considered, as the problems associated with its selection tend to be object of criticism (McShane et al., 2024;Rigdon & Sarstedt, 2022). In this regard, addressing concerns about sample size is of utmost importance for greater generalizing of results and for reaching greater academic rigor. ...
Article
Full-text available
Sampling procedures and sample size have the potential to improve the accuracy and efficiency of business-to-business (B2B) marketing research, as well as provide added statistical rigor to research. Therefore, justifying the rationale for sampling is useful in reducing uncertainty and accomplishing high standards. This paper aims to understand the sampling approaches applied in B2B marketing through a literature review of the three most reputable journals focused on this area in the 2013-2023 period. Furthermore, the paper also aims to develop a set of best practices. Results suggest that, although quantitative research is gaining presence in B2B marketing literature, there are different areas of improvement in terms of justification, explanation of the methodology, and the use of data. Achieving a minimum sample size along with the combination of primary and secondary data, and justifying the sampling, data collection, and methods used through statistical lenses, refraining from simplistic arguments, are not only valuable sources of detailed information, but can also increase confidence in quantitative research.
... Statistical significance misuse is a pervasive issue in medicine [1][2][3][4][5][6]. Despite being recognized problems for decades, nullism (exclusive analysis of the null hypothesis of zero effect), pISSN 1975-8375 eISSN 2233 the magnitude fallacy (misinterpretation of statistical significance as practical significance), and dichotomania (p-value <0.05 is significant, p-value≥0.05 is non-significant) persist and consistently lead to serious health consequences, including misidentification of adverse events, illusory replication failures, and mistrust in science [7][8][9][10][11][12][13][14][15][16][17][18][19][20]. Therefore, we propose and discuss alternatives to statistical significance. ...
... In this regard, we suggest that the "the least conditional approach" is most suitable, since this still relies on the authors' overall ability to formulate conclusions. Inference exists only in the degree of consistency of various studies, and it is essential to understand that the p-value (or s-value) for the null or any other hypothesis does not play a particular role in it [8]. For example, if in 10 different studies, conducted to the best of our capacities, we obtain an HR of approximately 4 and a null p≈0.10 (>0.05) (i.e., null s≈3) each time, the most plausible hypothesis is still HR=4 and not the null hypothesis HR0=1. ...
Article
Full-text available
Statistical testing in medicine is a controversial and commonly misunderstood topic. Despite decades of efforts by renowned associations and international experts, fallacies such as nullism, the magnitude fallacy, and dichotomania are still widespread within clinical and epidemiological research. This can lead to serious health errors (e.g., misidentification of adverse reactions). In this regard, our work sheds light on another common interpretive and cognitive error: the fallacy of high significance, understood as the mistaken tendency to prioritize findings that lead to low p-values. Indeed, there are target hypotheses (e.g., a hazard ratio of 0.10) for which a high p-value is an optimal and desirable outcome. Accordingly, we propose a novel method that goes beyond mere null hypothesis testing by assessing the statistical surprise of the experimental result compared to the prediction of several target assumptions. Additionally, we formalize the concept of interval hypotheses based on prior information about costs, risks, and benefits for the stakeholders (NORD-h protocol). The incompatibility graph (or surprisal graph) is adopted in this context. Finally, we discuss the epistemic necessity for a descriptive, (quasi) unconditional approach in statistics, which is essential to draw valid conclusions about the consistency of data with all relevant possibilities, including study limitations. Given these considerations, this new protocol has the potential to significantly impact the production of reliable evidence in public health.
... In light of the costs and risks linked to investigations in public health, it is essential to provide an overview of the most common errors and seek both short-term and long-term solutions. The first common flawed approach is the so-called null hypothesis significance testing (NHST), where only the point hypothesis of zero effect is considered and evaluated in dichotomous terms of 'significance' and 'non-significance' 24,28,29 . Even in the utopian scenario where all background assumptions are perfectly met, a large p-value for the null hypothesis only indicates a high degree of compatibility of the latter with the data (as conditionally evaluated by the test) but does not in any way support such a hypothesis over others. ...
... Good 1952) 32 . In this regard, the reasons behind the vast success of NHST should be searched in university education and cognitive distortions aimed at oversimplifying complex concepts 7,18,24,33 . As a remedy, Rafi and Greenland 26 propose to explain the ambiguous and unclear concept of 'statistical significance' through familiar statistical phenomena such as flipping an unbiased twoheaded coin. ...
... In Study 3, we confirmed the effect of anthropomorphism on rating valence, noting that the direct effect reaches statistical significance only at the 10% level. Nevertheless, following McShane et al. (2024), who critically discussed the dichotomous treatment of the arbitrary 5% level for statistical significance, we consider this result a contribution to the cumulative evidence compiled in this paper where the direct effect is replicated multiple times. Beyond the direct effect, the mediation results through social presence were statistically significant at the 5% level. ...
... Finally, the results may be attributed to small statistical power as the effect was insignificant in studies with the smallest sample. In summary, following McShane et al. (2024), we concluded that the cumulative evidence of the studies substantiates that anthropomorphism can decrease review length, and when it does it lowers review helpfulness. Future studies could further investigate the boundary conditions of the negative effect of anthropomorphism on review length. ...
Article
Full-text available
Companies are increasingly introducing conversational reviews —reviews solicited via chatbots—to gain customer feedback. However, little is known about how chatbot-mediated solicitation influences rating valence and review helpfulness compared to conventional online forms. Therefore, we conceptualized these review solicitation media on the continuum of anthropomorphism and investigated how various levels of anthropomorphism affect rating valence and review helpfulness, showing that more anthropomorphic media lead to more positive and less helpful reviews. We found that moderate levels of anthropomorphism lead to increased interaction enjoyment, and high levels increase social presence, thus inflating the rating valence and decreasing review helpfulness. Further, the effect of anthropomorphism remains robust across review solicitors’ salience (sellers vs. platforms) and expressed emotionality in conversations. Our study is among the first to investigate chatbots as a new form of technology to solicit online reviews, providing insights to inform various stakeholders of the advantages, drawbacks, and potential ethical concerns of anthropomorphic technology in customer feedback solicitation.
... To create a comprehensive study, we explored various hypotheses, including those predicting no significant differences in users' perceptions between AI and human influencers. This approach aligns with McShane et al. [78], emphasizing transparent result presentation for generalized conclusions based on cumulative evidence. Our goal was to provide insightful results, acting as a benchmark for past and future research comparisons. ...
... Also, the study addresses the ongoing controversy on the use and effectiveness of AI in influencer marketing [90], as well as Huang and Rust's [91] question about how marketers and AI should collaborate to better resonate with customers' needs and preferences. Our study offers empirical support for the CASA paradigm, at the same time stimulating the discussion between prominent theories examining people's reactions to emerging technology interfaces and artificial intelligence (speciesism [77], algorithmic aversion [78], Uncanny Valley [32]). Although a more in-depth examination of the phenomenon is required, our findings provide evidence that when anthropomorphism levels are high, without any visual cues that might evoke uncanny feelings of "almost human, but not quite", users' evaluations are not biased from speciesism or algorithm aversion and do not develop an aversion or a hostility towards non-humans. ...
... Additionally, the healthcare literature that employs statistical testing is also plagued by the lack of control for underlying assumptions and the incorrect combination of the divergence-theoretic neo-Fisherian and decision-theoretic Neyman-Pearson-Wald approaches (Greenland, 2023). Given that the types of errors and their implications have been thoroughly examined already, this manuscript is specifically designed to discuss new alternative methods, drawing on the context provided in earlier works (Greenland et al., 2016;McShane et al., 2023). ...
... In this regard, we suggest that the most suitable denomination is "the least conditional approach" since this still relies on the authors' overall ability to select the most compatible hypotheses. Inference can still exist but only in the degree of consistency over time (COT) of various studies, and it is essential to understand that the p-value (or S-value) for the null or any other hypothesis does not play a particular role in it (McShane et al., 2023). For example, if in 10 different studies, conducted to the best of our possibilities, we obtain HR ~ 4 and a two-sided null p ~ 0.10 (i.e., S ~ 3) each time, the most plausible hypothesis is HR = 4 and not the null hypothesis HR = 1 (since p-and S-values are descriptive statistics only) 1 . ...
Preprint
Full-text available
Statistical testing in public health is a controversial and often misunderstood topic. Despite decades of efforts by organizations, associations, and renowned international experts, fallacies of nullism, magnitude, and significance dichotomy are still extremely widespread. Nevertheless, our work highlights and addresses another common interpretive and cognitive error: the fallacy of surprise, understood as the mistaken habit of primarily considering and seeking findings with a low p-value (or a high s-value). Indeed, there are hypotheses (e.g., the efficacy of a drug) for which a high p-value is an optimal and desirable outcome. Therefore, this manuscript proposes a method to address the above situations based on comparing the statistical result with multiple hypotheses rather than just the null. Additionally, the concept of interval hypothesis is formalized based on costs, risks, and benefits known a priori. The objective is to assess the consistency of the data with various interval hypotheses of relevance simultaneously (including study limitations), in order to have a comprehensive understanding of the result reliability in relation to the research purposes. In this regard, the incompatibility graph (or surprisal graph) is introduced to make the reading of data simple and intuitive. As a general rule, unless meta-analyses with systematic review or other solid evidence are considered, these methods aim at providing a non-inferential and non-decisional descriptive overview of the scientific scenario.
... Alongside this, degrees of compatibility that appear markedly different could be highly compatible with each other. As shown by McShane et al., an original study with P = 0.005 and a replication study with P = 0.194 were highly compatible with one another in the sense that the P-value of the chosen comparison test, assuming no difference between them, was P = 0.289 [15]. Therefore, the difference between "statistically significant" and "statistically not significant" would be "statistically not significant" at the 0.05 level [5]. ...
... In addition to this, the compact formulation of multiple intervals can provide a much more complete and clearer overview than that described by a traditional confidence/compatibility interval without excessively burdening the reading, i.e., remaining suitable to be used in summary sections such as the abstract. Although the problems related to statistical testing are numerous and go beyond the scope of this manuscript (e.g., arbitrary multiple comparisons adjustments, p-hacking, statistical power misconceptions, and publication bias), the interpretation of test results is fundamental or integral to each of these [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]22]. Surprisal intervals, in conjunction with surprisals, can provide great assistance to the scientific community in framing research problems, especially in the field of public health where errors regarding statistical significance are as frequent as they are dangerous. ...
Article
Full-text available
PEER-REVIEWED SOURCE: https://revstat.ine.pt/index.php/REVSTAT/article/view/669-------------------------------------------------------------------------------------------------------Misuse of statistical significance continues to be prevalent in science. The absence of intuitive explanations of this concept often leads researchers to incorrect conclusions. For this reason, some statisticians suggest adopting S-values (surprisals) instead of P-values, as they relate the statistical relevance of an event to the number of consecutive heads when flipping an unbiased coin. This paper introduces the concept of surprisal intervals (S-intervals) as extensions of confidence/compatibility intervals. The proposed approach imposes the assessment of outcomes in terms of more and less surprising than some values, instead of statistically significant and statistically non-significant. Moreover, a novel methodology for presenting multiple consecutive S-intervals (or compatibility intervals as well) in order to evaluate the variation in surprise (or compatibility) with various target hypotheses is discussed.
... Three years later, 800 scientists signed a petition to propose the definitive abandonment of the originally problematic and now completely distorted concept of statistical significance . Such efforts continue to this day (McShane et al., 2023;. Nevertheless, the following errors remain extremely common (Greenland et al., 2016;Amrhein & Greenland, 2018): i) Dichotomania: The results are divided into (statistically) significant and non-significant based on an arbitrary threshold. ...
... Nevertheless, with due reservations, it is still possible to largely adhere to this fashion while ensuring interpretative correctness. Specifically, the P-value can be adopted as a continuous measure of the incompatibility (significance) of the data with the statistical t-assumption in the best-case scenario (McShane et al., 2023). Possible ranges of statistical incompatibility (significance) are shown in Table 3c, attempting to remain consistent with Tables 3a and 3b. ...
Preprint
Full-text available
Misuses and misconceptions about statistical testing are widespread in the field of public health. Specifically, the dichotomous use of the P-value (e.g., deemed significant if P<.05 and non-significant if P>.05), coupled with i) nullism (an obsession with the null hypothesis over other hypotheses), ii) failure to validate the statistical model adopted, iii) failure to distinguish between significance and effect size, and iv) failure to distinguish between statistical and empirical levels, creates an extremely fertile ground for overestimating the level of evidence found and drawing scientifically unfounded or incorrect conclusions. For these reasons, widely acknowledged and discussed in statistical literature, this article proposes a framework that aims to both help the reader understand the epistemological boundaries of the statistical approach and provide a structured workflow for conducting a statistical analysis capable of appropriately informing public health decisions. In this regard, novel concepts of multiple compatibility intervals and multiple surprisal intervals are discussed in detail through straightforward examples.
... Alongside this, degrees of compatibility that appear markedly different could be highly compatible with each other. As shown by McShane et al., an original study with P = 0.005 and a replication study with P = 0.194 were highly compatible with one another in the sense that the P-value, assuming no difference between them, was P = 0.289 [14]. Therefore, the difference between "statistically significant" and "statistically not significant" would be "statistically not significant" at the 0.05 level [4]. ...
... In addition to this, the compact formulation of multiple intervals can provide a much more complete and clearer overview than that described by a traditional confidence/compatibility interval without excessively burdening the reading, i.e., remaining suitable to be used in summary sections such as the abstract. Although the problems related to statistical testing are numerous and go beyond the scope of this manuscript (e.g., arbitrary multiple comparisons adjustments, p-hacking, statistical power misconceptions, and publication bias), the interpretation of test results is fundamental or integral to each of these [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17]21]. Surprisal intervals, in conjunction with surprisals, can provide great assistance to the scientific community in framing research problems, especially in the field of public health where errors regarding statistical significance are as frequent as they are dangerous. ...
Preprint
Despite decades of warnings, misuse and misinterpretation of statistical significance continue to be prevalent in science. The absence of simple and intuitive explanations of this concept often leads researchers to incorrect or misleading conclusions that can have severe consequences, especially in fields like medicine and public health. For this reason, some statisticians suggest adopting S-values (surprisals) instead of P-values, as they relate the statistical relevance of an event to the number of consecutive heads when flipping an unbiased coin. Such a comparison with a phenomenon that we encounter in our daily lives makes S-values simpler to understand than P-values and reduces the likelihood of making overstatements. However, to the best of the author’s knowledge, there is currently no natural extension of confidence/compatibility intervals for statistical surprise. This asymmetry forces researchers to remain bound to the significance threshold α, which is necessary to define said intervals through the 100(1-α)% relationship. To address this issue, this paper introduces the concept of surprisal intervals, explaining their use through straightforward examples. Surprisal intervals enable the definitive abandonment of inferential evaluation of results and inherently prevent the dichotomous approach in hypothesis testing. Indeed, the proposed methodologies impose the assessment of the outcomes in terms of more and less surprising than some fixed values instead of significant and non-significant, thus making the notion of degree of surprise manifest and ineliminable. Based on the above considerations, the use of surprisal intervals is highly recommended in future scientific investigations.
... Neyman and Pearson (1928, p. 232) cautioned that significance "tests should only be regarded as tools which must be used with discretion and understanding, and not as instruments which in themselves give the final verdict" (see also Bolles, 1962, p. 645;Boring, 1919, pp. 337-338;Chow, 1998, p. 169;Cox, 1958, p. 357;Hager, 2013, p. 261;Haig, 2018, p. 199;Lykken, 1968, p. 158;McShane et al., 2023;Meehl, 1978, p. 824;Meehl, 1997, p. 401;Szollosi & Donkin, 2021, p. 5). Phacking is most problematic for those who ignore this advice and rely on p values as the sole arbiters of scientific decisions rather than as mere steppingstones on the way to making substantive theoretical inferences during a fallible process of inference to the best explanation (Haig, 2009;Mackonis, 2013). ...
Article
Full-text available
The inflation of Type I error rates is thought to be one of the causes of the replication crisis. Questionable research practices such as p-hacking are thought to inflate Type I error rates above their nominal level, leading to unexpectedly high levels of false positives in the literature and, consequently, unexpectedly low replication rates. In this article, I offer an alternative view. I argue that questionable and other research practices do not usually inflate relevant Type I error rates. I begin by introducing the concept of Type I error rates and distinguishing between statistical errors and theoretical errors. I then illustrate my argument with respect to model misspecification, multiple testing, selective inference, forking paths, exploratory analyses, p-hacking, optional stopping, double dipping, and HARKing. In each case, I demonstrate that relevant Type I error rates are not usually inflated above their nominal level, and in the rare cases that they are, the inflation is easily identified and resolved. I conclude that the replication crisis may be explained, at least in part, by researchers’ misinterpretation of statistical errors and their underestimation of theoretical errors.
... on a computer, 0.68% on a tablet, 0.19% on a phone; and 0.18% on other device). Following the statistical reporting guidelines in McShane et al. (2024), we plotted point estimates along with 95% confidence intervals, and only interpreted insignificant effects when they were estimated precisely. Note that as the most common type of second screening is show-unrelated (Table 4), the standard errors are smaller than those for the estimated effects of show-related communication or information search. ...
Article
Full-text available
This paper examines the effect of second screening, the common practice of using another digital device while watching a television show, on repeat show viewing. We leveraged large-scale individual-level data from mobile diaries of 1,702 US TV viewers on 2,755 prime time shows. We used causal forest analysis for estimation, focusing on the moderating role of viewing preferences and show loyalty, and captured heterogeneity in viewer preferences using latent-class segmentation. We found that overall, show-related second screening has a positive effect on the attitude toward the show , as well as on actual repeat viewing. Show-unrelated second screening diminishes the viewer’s attitude. These effects are especially pronounced in the heavy viewer segment and among infrequent show viewers. Interestingly, our analysis did not provide evidence that second screening harms actual repeat viewing, countering potential concerns of negative distraction effects.
... Finally, the sustainability-related challenges HEIs encounter are complex and may necessitate deeper qualitative perspectives for interpretation, judgment, subjectivity, and learning from data (McShane et al., 2023). Strict theory-based hypothesis testing might be limiting our ability to obtain such insights, whereas an empirics-first approach focuses on learning from data and offering insights without strict theoretical restrictions, is akin to the increasingly popular analysis approaches such as Qualitative Comparative Analyses or fsQCA (Pappas and Woodside, 2021;Fainshmidt et al., 2020). ...
Article
Full-text available
While previous research focused on investigating students’ perceptions, few studies have analyzed students’ future-oriented normative sustainability expectations from their Higher Educational Institutions (HEIs) in various cultural contexts. The goal of this study is to (1) identify business students’ sustainability transformation expectations from their HEIs, (2) uncover potential differences in expectations across cultural environments, and (3) explain how students’ sustainability expectations impact their behaviors towards HEIs. A mixed qualitative quantitative research design using a semi-standardized questionnaire based on a sample of 239 business students from the USA and Germany was applied. Sustainability topics at HEIs are derived both from a literature review and through interviews and were categorized using content analysis. Data for the study was collected from business students in Bachelor programs at two state universities in the USA and one public university in Germany and the Kano analysis was utilized to examine students’ sustainability expectations. Our analysis uncovered 19 distinct topic areas of sustainability at HEIs. Across both countries, students considered the integration of sustainability in production and consumption, as well as gender equality and inclusion, as basic requirements for future sustainability transformations. Other attributes were evaluated as indifferent. Students from the USA considered staff and faculty development opportunities or institutional support as performance attributes, while students from Germany evaluated them as indifferent. Country variations in students’ expectations of key sustainability attributes from their HEIs are significantly influenced by their level of involvement in sustainability. Finally, students’ expectations significantly impact their behavioral intentions. We provide managerial implications suggesting a tailored focus on sustainability attributes based on Kano categories and the country context. Furthermore, we highlight the need for further research, including replication studies in diverse cultural settings using longitudinal study designs.
... [2] However, such a practice is entirely unfounded and contradicts consolidated evidence on the topic. [1][2][3][4][5][6] ere are two main, mutually exclusive approaches within the frequentist scenario: the neo-Fisherian and the Wald-Neyman-Pearson (WNP) ones. [7] And, paradoxically, despite the historical diatribe between these authors and the extreme philosophical and epistemological di erences, such approaches are erroneously mixed in much of today's research. ...
Article
Full-text available
Background: Despite the efforts of leading statistical authorities and experts worldwide, misuse of statistical significance remains a common, dangerous practice in public health research. There is an urgent need to quantify this phenomenon. Methods: 200 studies were randomly selected within the PubMed database. An evaluation scale for the interpretation and presentation of statistical results (SRPS) was adopted. The maximum achievable score was 4 points. Abstracts (A) and full texts (FT) were compared to highlight any differences in presentation. The Wilcoxon signed-rank test was employed in this scope. Results: All studies failed to adopt P-values as continuous measures of compatibility between the data and the target hypothesis as assessed by the chosen test. The vast majority did not provide information on the model specification. However, in most cases, all findings were reported in full within the manuscripts. The Wilcoxon signed-rank test showed a marked incompatibility of the null hypothesis of zero difference between A and FT scores with the data obtained in the survey: null P < .001 (as assessed by the model), r = 0.87 (standardized effect size). Additionally, the score difference (207.5 points for A vs. 441.5 points for FT) indicates a scenario consistent with a substantial disparity in the completeness of the outcomes reporting. Conclusion: These findings align with the hypothesis of widespread and severe shortcomings in the use and interpretation of statistical significance within public health research during 2023. Therefore, it is essential for academic journals to compulsorily demand higher scientific quality standards. The suggestions provided in this study could be helpful for this purpose.
... While we largely agree with this diagnosis and appreciate the emphasis OSC put on cumulative evidence rather than single replication projects, we are also concerned with some unintended consequences emerging from the general discussion and media coverage in the wake of RPP. Despite the OSC's precautionary efforts and the extensive literature on common statistical misconceptions and misperceptions (reviewed for example in Greenland et al., 2016;Wasserstein & Lazar, 2016;Greenland, 2017Greenland, , 2019Wasserstein et al., 2019;Amrhein et al., 2019;McShane et al., 2019;McShane et al., 2024), discussions of the RPP still exhibit a lack of conceptual clarity. An example is the common error of interpreting the RPP results as suggesting that all or most the original studies were false positives, without allowing for false-negative replications. ...
Article
Full-text available
The current controversy surrounding research replication in biomedical and psychosocial sciences often overlooks the uncertainties surrounding both the original and replication studies. Overemphasizing single attempts as definitive replication successes or failures, as exemplified by media coverage of the landmark Reproducibility Project: Psychology (RPP), fosters misleading dichotomies and erodes public trust. To avoid such unintended consequences, science communicators should more clearly articulate statistical variation and other uncertainty sources in replication, while emphasizing the cumulative nature of science in general and replication in particular.
... Conducting multiverse analyses requires researchers to embrace potentially inconclusive results if confidence intervals are wide or Bayes factors are uninformative. Such a step also implies abandoning the dichotomy of statistically significant versus not significant and, instead, interpreting the p-value as a continuous metric(McShane et al., 2023). As standard errors understate the overall uncertainty of results ...
Article
Scientific research demands robust findings, yet variability in results persists due to researchers' decisions in data analysis. Despite strict adherence to state-of the-art methodological norms, research results can vary when analyzing the same data. This article aims to explore this variability by examining the impact of researchers' analytical decisions when using different approaches to structural equation modeling (SEM), a widely used method in innovation management to estimate cause–effect relationships between constructs and their indicator variables. For this purpose, we invited SEM experts to estimate a model on absorptive capacity's impact on organizational innovation and performance using different SEM estimators. The results show considerable variability in effect sizes and significance levels, depending on the researchers' analytical choices. Our research underscores the necessity of transparent analytical decisions, urging researchers to acknowledge their results' uncertainty, to implement robustness checks, and to document the results from different analytical workflows. Based on our findings, we provide recommendations and guidelines on how to address results variability. Our findings, conclusions, and recommendations aim to enhance research validity and reproducibility in innovation management, providing actionable and valuable insights for improved future research practices that lead to solid practical recommendations.
... Since the original formulation by Sir Ronald Fisher in the early 1920s, the concept of statistical significance has been subject to serious misinterpretations. Despite more than 100 years having passed, these criticalities remain as vivid today as they were back then, if not more so (Wasserstein & Lazar, 2016;Gelman, 2018;Amrhein et al., 2019 a ;Greenland et al., 2022;McShane et al., 2023;Mansournia & Nazemipour, 2024). Given that the misuse of statistical testing in public health can lead to highly dangerous outcomes such as the approval of ineffective treatments or the rejection of effective ones, in this brief letter, we present a series of examples aimed at definitively dispelling some of the most common and erroneous beliefs about statistical significance. ...
Preprint
Full-text available
Since the original formulation by Sir Ronald Fisher in the early 1920s, the concept of statistical significance has been subject to serious misinterpretations. Despite more than 100 years having passed, these criticalities remain as vivid today as they were back then, if not more so. Given that the misuse of statistical testing in public health can lead to highly dangerous outcomes such as the approval of ineffective treatments or the rejection of effective ones, in this brief letter, we present a series of examples aimed at definitively dispelling some of the most common and erroneous beliefs about statistical significance.
... Neyman and Pearson (1928, p. 232) cautioned that significance "tests should only be regarded as tools which must be used with discretion and understanding, and not as instruments which in themselves give the final verdict" (see also Bolles, 1962, p. 645;Boring, 1919, pp. 337-338;Chow, 1998, p. 169;Cox, 1958, p. 357;Hager, 2013, p. 261;Haig, 2018, p. 199;Lykken, 1968, p. 158;McShane et al., 2023;Meehl, 1978, p. 824;Meehl, 1997, p. 401;Szollosi & Donkin, 2021, p. 5). P-hacking is most problematic for those who ignore this advice and rely on p values as the sole arbiters of scientific decisions rather than as mere steppingstones on the way to making substantive theoretical inferences during a fallible process of inference to the best explanation (Haig, 2009;Mackonis, 2013). ...
Preprint
The inflation of Type I error rates is thought to be one of the causes of the replication crisis. Questionable research practices such as p-hacking are thought to inflate Type I error rates above their nominal level, leading to unexpectedly high levels of false positives in the literature and, consequently, unexpectedly low replication rates. In this article, I offer an alternative view. I argue that questionable and other research practices do not usually inflate relevant Type I error rates. I begin with an introduction to Type I error rates that distinguishes them from theoretical errors. I then illustrate my argument with respect to model misspecification, multiple testing, selective inference, forking paths, exploratory analyses, p-hacking, optional stopping, double dipping, and HARKing. In each case, I demonstrate that relevant Type I error rates are not usually inflated above their nominal level, and in the rare cases that they are, the inflation is easily identified and resolved. I conclude that the replication crisis may be explained, at least in part, by researchers’ misinterpretation of statistical errors and their underestimation of theoretical errors.
Article
Purpose This study aims to analyse the influence of football fans' involvement on sponsor brand equity and their purchase intention toward the sponsoring brand. To achieve this, we specified a structural model examining the relationships between engagement, brand equity and fans’ purchase intentions. Design/methodology/approach The data for this study were collected using a structured questionnaire. Three football teams from the city of Quito (Ecuador) that compete in the first division of Ecuadorian professional football were considered. For data collection, both personal interviews and a web link were used. The personal interviews were carried out directly with the fans of the three teams in the vicinity of the stadiums, prior to matches of the Ecuadorian League. Findings The study concluded that a greater involvement of fans with a football club positively influences both the valuation of the sponsoring brand and the intention to purchase the product and/or service of the sponsoring brand. Practical implications This work contributes to the literature on brand equity. On the one hand, from the companies’ perspective, it is important for brand managers to realise that football fans constitute an especially significant section of the public to strengthen the brand and even to buy the products of the sponsoring brand. On the other hand, from the point of view of the clubs, it should be borne in mind that the involvement of the fans with the clubs constitutes a major factor in strengthening the sponsoring brands. Originality/value Most of the research in the literature has studied purchase intention towards the club brand but not towards the sponsoring brand. The research, which is applied to the football industry, conceptually extends the customer-based brand equity (CBBE) model by including the perspective of football fans’ involvement with their clubs.
Article
Purpose Drawing on value cocreation, this study examines health-care customers’ perceptions of patient-centered care (PCC) in hospital and online primary care settings. This study aims to address how are the key principles of PCC related, how the relationships between key PCC principles and outcomes (subjective well-being and service satisfaction) vary depending on the channel providing the care (hospital/online primary care) and what differences are placed on the involvement of family and friends in these different settings by health-care customers. Design/methodology/approach This study comprises four samples of health-care customers (Sample 1 n = 272, Sample 2 n = 278, Sample 3 n = 275 and Sample 4 n = 297) totaling 1,122 respondents. This study models four key principles of PCC: service providers respecting health-care customers’ values, needs and preferences; collaborative resources of the multi-disciplinary care team; health-care customers actively collaborating with their own resources; and health-care customers involving family and friends, explicating which principles of PCC have positive effects on outcomes: subjective well-being and service satisfaction. Findings Findings confirm that health-care customers want to feel respected by service providers, use their own resources to actively collaborate in their care and have multi-disciplinary teams coordinating and integrating their care. However, contrary to prior findings, for online primary care, service providers respecting customers’ values needs and preferences do not translate into health-care customers actively collaborating with their own resources. Further, involving family and friends has mixed results for online primary care. In that setting, this study finds that involving family and friends only positively impacts service satisfaction, when care is provided using video and not voice only. Social implications By identifying which PCC principles influence the health-care customer experience most, this research shows policymakers where they should invest resources to achieve beneficial outcomes for health-care customers, service providers and society, thus advancing current thinking and practice. Originality/value This research provides a health-care customer perspective on PCC and shows how the resources of the health-care system can activate the health-care customer’s own resources. It further shows the role of technology in online care, where it alters how care is experienced by the health-care customer.
Article
The ubiquitous presence of endogenous regressors presents a significant challenge when drawing causal inferences using observational data. The classical econometric method used to handle regressor endogeneity requires instrumental variables (IVs) that must satisfy the stringent condition of exclusion restriction, rendering it unfeasible in many settings. Herein, the authors propose a new IV-free method that uses copulas to address the endogeneity problem. Existing copula correction methods require nonnormal endogenous regressors: Normally or nearly normally distributed endogenous regressors cause model nonidentification or significant finite-sample bias. Furthermore, existing copula control function methods presume the independence of exogenous regressors and endogenous regressors. The authors’ generalized two-stage copula endogeneity-correction (2sCOPE) method simultaneously relaxes the two key identification requirements while maintaining the Gaussian copula regressor–error dependence structure. They prove that under the Gaussian copula dependence structure, 2sCOPE yields consistent causal-effect estimates with correlated endogenous and exogenous regressors as well as normally distributed endogenous regressors. In addition to relaxing the identification requirements, 2sCOPE has superior finite-sample performance and addresses the significant finite-sample bias problem due to insufficient regressor nonnormality. Moreover, 2sCOPE employs generated regressors derived from existing regressors to control for endogeneity, and can thus considerably increase the ease and broaden the applicability of IV-free methods for handling regressor endogeneity. The authors further demonstrate 2sCOPE's performance using simulation studies and illustrate its use in an empirical application.
Article
Purpose This paper aims to investigate how product upgrades influence consumers’ hedonic responses to currently owned products, focusing on the underlying attentional mechanism. Design/methodology/approach Six experiments were conducted, including one pilot study and five main studies, employing various stimuli and methodologies. These experiments used longitudinal designs, manipulated upgrade awareness and measured hedonic decline over time. Mediation and moderation analyses were performed to test the proposed attentional mechanism. Findings The studies demonstrate that awareness of product upgrades induces consumers to experience a faster hedonic decline with their current possessions. This effect occurs because upgrades prompt consumers to shift their attention away from the currently owned product. The research provides both mediation-based and moderation-based evidence for this attentional mechanism. Research limitations/implications The study primarily focused on product upgrades, and future research could explore this effect in nonproduct domains and investigate potential boundary conditions. Practical implications The findings have implications for both consumers and companies in managing product enjoyment and upgrade cycles. Consumers can make more informed decisions about upgrades, while companies can develop strategies to maintain customer satisfaction with current products. Originality/value This work offers a novel perspective on the influence of upgrades on consumer behavior by introducing an attention-based account of hedonic adaptation and the consequent upgrade phenomenon, contrasting with previous research that relied on justifiability or contrast effects.
Article
Full-text available
With this article we hope to achieve two goals. The first is to encourage consumer behavioral researchers to consider Bayesian methods for analyzing experimental and survey data. As such, we provide what we hope will be a persuasive set of arguments for trying Bayes. The second goal is to survey the different uses to which the Bayesian posterior distribution can be put. We organize this survey in terms of loss functions and propose that such loss functions can be chosen so as to simply describe a consumer behavioral phenomenon, to highlight a managerial implication, or to emphasize a theoretical contribution.
Article
This study empirically investigates whether and to what extent suppliers’ decisions to start selling directly to end-consumers provoke reactions in the ordering strategy of downstream channel partners, such as independent multibrand retailers. Using a multimethod approach that combines transactional data, survey data, and a scenario-based experiment, the authors demonstrate that retailers tend to exit these relationships after a direct channel introduction, as exhibited by their strategic decisions to order fewer distinct SKUs, accompanied by higher wholesale prices per unit. On average, retailers decrease the number of distinct SKUs ordered by 15 (or 18.75%) and pay a higher average wholesale price by €.79 (or 20.84%). Yet the responses also differ across retailers, reflecting moderating impacts of retailer power, expertise, and relationship quality. Retailer power emerges as a robust moderating factor, with more powerful retailers indicating a lower propensity to exit the relationship. Expertise and relationship quality have more nuanced influences on retailers’ ordering strategies. The multimethod approach allows to reveal the underlying mechanisms of these moderating effects, such that both rational (coercive power and switching costs) and emotional (conflict and confidence) considerations are in play.
Article
Looking back at 50 years of Journal of Consumer Research methods and interviewing some of the field’s most respected methodologists, this article seeks to craft a core set of best practices for scholars in consumer research. From perennial issues like conceptual validity to emerging issues like data integrity and replicability, the advice offered by our experts can help scholars improve the way they approach their research questions, provide empirical evidence that instills confidence, use new tools to make research more inclusive or descriptive of the “real world,” and seek to become thought leaders.
Article
McShane et al.'s (2024) wide-ranging critique of null hypothesis significance testing provides a number of specific suggestions for improved practice in empirical research. This commentary amplifies several of these from the perspective of computational statistics—particularly nonparametrics, resampling/bootstrapping, and Bayesian methods—applied to common research problems. Throughout, the author emphasizes estimation (as opposed to testing) and uncertainty quantification through a comprehensive process of “curating” a variety of graphical and tabular evidence. Specifically, researchers should be encouraged to estimate the quantities that matter, with as few assumptions as possible, in multiple ways, then try to visualize it all, documenting their pathway from data to results for others to follow.
Article
Overconfidence in statistical results in medicine is fueled by improper practices and historical biases afflicting the concept of statistical significance. In particular, the dichotomization of significance (i.e., significant vs. not significant), blending of Fisherian and Neyman-Pearson approaches, magnitude and nullification fallacies, and other fundamental misunderstandings distort the purpose of statistical investigations entirely, impacting their ability to inform public health decisions or other fields of science in general. For these reasons, the international statistical community has attempted to propose various alternatives or different interpretative modes. However, as of today, such misuses still prevail. In this regard, the present paper discusses the use of multiple confidence (or, more aptly, compatibility) intervals to address these issues at their core. Additionally, an extension of the concept of confidence interval, called surprisal interval (S-interval), is proposed in the realm of statistical surprisal. The aforementioned is based on comparing the statistical surprise to an easily interpretable phenomenon, such as obtaining S consecutive heads when flipping a fair coin. This allows for a complete departure from the notions of statistical significance and confidence, which carry with them longstanding misconceptions.
Article
How does financial education lead to improved financial behavior and higher financial well-being? An influential Consumer Financial Protection Bureau model (CFPB 2015) proposes that the goal of financial education is to improve financial well-being and that financial education does so by increasing financial knowledge, which improves financial behavior, which improves financial well-being. We test links in the CFPB model, examining the differential roles of objective and subjective knowledge. We also test whether an analogous model might capture effects of physical health education on physical health knowledge, behavior, and well-being. We report a quasi-experiment comparing T2-T1 changes in financial and physical health knowledge, behavior, and well-being over a semester for students enrolled in a personal finance class, a personal health class, or neither. Our study reports the first causal estimates of flow from financial education to financial knowledge to financial behaviors to a validated measure of subjective financial well-being. Financial education caused large changes in both objective and subjective knowledge. Yet only subjective knowledge mediated the large effects of financial education on changes in downstream behaviors. We find weaker but similar results for physical health. Our findings suggest that financial education efforts should refocus to foster subjective knowledge and improved behavior.
Article
Full-text available
P-values have played a central role in the advancement of research in virtually all scientific fields; however, there has been significant controversy over their use. “The ASA president’s task force statement on statistical significance and replicability” has provided a solid basis for resolving the quarrel, but although the significance part is clearly dealt with, the replicability part raises further discussions. Given the clear statement regarding significance, in this article, we consider the validity of p-value use for statistical inference as de facto. We briefly review the bibliography regarding the relevant controversy in recent years and illustrate how already proposed approaches, or slight adaptations thereof, can be readily implemented to address both significance and reproducibility, adding credibility to empirical study findings. The definitions used for the notions of replicability and reproducibility are also clearly described. We argue that any p-value must be reported along with its corresponding s-value followed by (1−α)% confidence intervals and the rejection replication index.
Article
Full-text available
The use of statistical significance is under discussion. Many statisticians and researchers advocate for its retirement. Conversely, other statisticians and researchers think that its retirement would damage science. There is room for improvement in the use of hypothesis testing and p-values in biochemical sciences and omics. The selection of variables by statistical significance with solid cutoffs drives and may bias the biological interpretation of biochemical data. To obtain robust knowledge by comparing studies, it is essential to report thoroughly all results (both quantitative and categorical variables). Because of the big number of variables, the problems of selecting variables by statistical significance increase for omic studies.
Article
Full-text available
At the beginning of our research training, we learned about hypothesis testing, p-values, and statistical inference [...]
Article
Full-text available
Following a fundamental statement made in 2016 by the American Statistical Associations and broad and consistent changes in data analysis and interpretation methodology in public health and other sciences, statistical significance/null hypothesis testing is being increasingly criticized and abandoned in the reporting and interpretation of the results of biomedical research. This shift in favor of a more comprehensive and non-dichotomous approach in the assessment of causal relationships may have a major impact on human health risk assessment. It is interesting to see, however, that authoritative opinions by the Supreme Court of the United States and European regulatory agencies have somehow anticipated this tide of criticism of statistical significance testing, thus providing additional support to its demise. Current methodological evidence further warrants abandoning this approach in both the biomedical and public law contexts, in favor of a more comprehensive and flexible method of assessing the effects of toxicological exposure on human and environmental health.
Article
Full-text available
It has long been argued that we need to consider much more than an observed point estimate and a p-value to understand statistical results. One of the most persistent misconceptions about p-values is that they are necessarily calculated assuming a null hypothesis of no effect is true. Instead, p-values can and should be calculated for multiple hypothesized values for the effect size. For example, a p-value function allows us to visualize results continuously by examining how the p-value varies as we move across possible effect sizes. For more focused discussions, a 95% confidence interval shows the subset of possible effect sizes that have p-values larger than 0.05 as calculated from the same data and the same background statistical assumptions. In this sense a confidence interval can be taken as showing the effect sizes that are most compatible with the data, given the assumptions, and thus may be better termed a compatibility interval. The question that should then be asked is whether any or all of the effect sizes within the interval are substantial enough to be of practical importance.
Preprint
Full-text available
A vivid debate is ongoing in the scientific community about statistical malpractice and the related publication bias. No general consensus exists on the consequences and this is reflected in heterogeneous rules defined by scientific journals on the use and reporting of statistical inference. This paper aims at discussing how the debate is perceived by the agricultural economics community and implications for our roles as researchers, contributors to the scientific publication process, and teachers. We start by summarizing the current state of the p-value debate and the replication crisis, and commonly applied statistical practices in our community. This is followed by motivation, design, results and discussion of a survey on statistical knowledge and practice among the researchers in the agricultural economics community in Austria, Germany and Switzerland. We conclude that beyond short-term measures like changing rules of reporting in publications, a cultural change regarding empirical scientific practices is needed that stretches across all our roles in the scientific process. Acceptance of scientific work should largely be based on the theoretical and methodological rigor and where the perceived relevance arises from the questions asked, the methodology employed, and the data used but not from the results generated. Revised and clear journal guidelines, the creation of resources for teaching and research, and public recognition of good practice are suggested measures to move forward.
Article
Full-text available
The common approach to meta‐analysis is overwhelmingly dominant in practice but suffers from a major limitation: it is suitable for analyzing only a single effect of interest. However, contemporary psychological research studies—and thus meta‐analyses of them—typically feature multiple dependent effects of interest. In this paper, we introduce novel meta‐analytic methodology that (i) accommodates an arbitrary number of effects—specifically, contrasts of means—and (ii) yields results in standard deviation units in order to adjust for differences in the measurement scales used for the dependent measure across studies. Importantly, when all studies follow the same two‐condition study design and interest centers on the simple contrast between the two conditions as measured on the Cohen’s d scale, our approach is equivalent to the common approach. Consequently, our approach generalizes the common approach to accommodate an arbitrary number of contrasts. As we illustrate and elaborate on across three extensive case studies, our approach has several advantages relative to the common approach. To facilitate the use of our approach, we provide a website that implements it.
Article
Full-text available
The likelihood of a risk factor achieving statistical significance is affected by the overall sample size as well as the distributions of both outcome and predictor variables. Closer attention to confidence intervals and visual displays, in the context of a prespecified determination of effects of interest, contributes to a more comprehensive understanding of the results of a data analysis.
Article
Full-text available
Statistical reporting of quantitative research data has been plagued by potential bias and reporting suppression due to a single numerical output: the p-value. While there is great importance in its merit, creating a pass-fail system (set at point of .05) has created a culture of researchers submitting their project's data to a filing cabinet if it does not yield "statistical significance" based on this value. The editors of the International Journal of Exercise Science are following the American Statistical Association's call for statistical reform by adjusting our reporting guidelines to the following requirements: [1.] make an intentional effort to move away from statements "statistically significant" or "not significant;" [2] all p-values are to be reported in their raw, continuous form; [3.] measures of the magnitude of effect must be presented with all p-values; [4.] either an a-priori power analysis with relevant citations should be included or post-hoc power calculations should accompany p-values and measures of effect. The ultimate goal of this editorial is to join with other scholars to push the field toward transparency in reporting and critical, thoughtful evaluation of research.
Article
Full-text available
Background: Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and P-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review some simple methods to aid researchers in interpreting statistical outputs. These methods emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. Methods: We use the Shannon transform of the P-value p, also known as the binary surprisal or S-value s = -log2(p), to provide a measure of the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing. We also use tables or graphs of test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. Finally, we reinterpret P-values and interval estimates in unconditional terms, which describe compatibility of data with the entire set of analysis assumptions. We illustrate these methods with a reanalysis of data from an existing record-based cohort study. Conclusions: In line with other recent recommendations, we advise that teaching materials and research reports discuss P-values as measures of compatibility rather than significance, compute P-values for alternative hypotheses whenever they are computed for null hypotheses, and interpret interval estimates as showing values of high compatibility with data, rather than regions of confidence. Our recommendations emphasize cognitive devices for displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates. We believe these simple reforms are well worth the minor effort they require.
Article
Full-text available
The American Society for Pharmacology and Experimental Therapeutics has revised the Instructions to Authors for Drug Metabolism and Disposition, Journal of Pharmacology and Experimental Therapeutics, and Molecular Pharmacology These revisions relate to data analysis (including statistical analysis) and reporting but do not tell investigators how to design and perform their experiments. Their overall focus is on greater granularity in the description of what has been done and found. Key recommendations include the need to differentiate between preplanned, hypothesis-testing, and exploratory experiments or studies; explanations of whether key elements of study design, such as sample size and choice of specific statistical tests, had been specified before any data were obtained or adapted thereafter; and explanation of whether any outliers (data points or entire experiments) were eliminated and when the rules for doing so had been defined. Variability should be described by S.D. or interquartile range, and precision should be described by confidence intervals; S.E. should not be used. P values should be used sparingly; in most cases, reporting differences or ratios (effect sizes) with their confidence intervals will be preferred. Depiction of data in figures should provide as much granularity as possible, e.g., by replacing bar graphs with scatter plots wherever feasible and violin or box-and-whisker plots when not. This editorial explains the revisions and the underlying scientific rationale. We believe that these revised guidelines will lead to a less biased and more transparent reporting of research findings.
Article
Full-text available
Reflecting on common empirical concerns in quantitative entrepreneurship research, recent calls for improved rigor and reproducibility in social science research, and recent methodological developments, we discuss new opportunities for further enhancing rigor in quantitative entrepreneurship research. In addition to highlighting common key concerns of editors and reviewers, we review recent methodological guidelines in the social sciences that offer more in-depth discussions of particular empirical issues and approaches. We conclude by offering a set of best practice recommendations for further enhancing rigor in quantitative entrepreneurship research.
Article
Full-text available
In this commentary, I argue why we should stop engaging in null hypothesis statistical significance testing altogether. Artificial and misleading it may be, but we know how to play the p value threshold and null hypothesis-testing game. We feel secure; we love the certainty. The fly in the ointment is that the conventions have led to questionable research practices. Wasserstein, Schirm, & Lazar (Am Stat 73(sup1):1–19, 2019. https://doi.org/10.1080/00031305.2019.1583913) explain why, in their thought-provoking editorial introducing a special issue of The American Statistician: “As ‘statistical significance’ is used less, statistical thinking will be used more.” Perhaps we empirical researchers can together find a way to work ourselves out of the straitjacket that binds us.
Article
Full-text available
Whether or not "the foundations and the practice of statistics are in turmoil", it is wise to question methods whose misuse has been lamented for over a century. Perhaps the most widespread misuse of statistics is taking the crossing of some threshold as license for declaring "statistical significance" and for generalizing from a single study. Such generalized conclusions are often taken up by science communicators, media, and political stakeholders without recognition of their uncertainty. A major consequence is flip-flopping headlines such as 'chocolate is good for you' followed by 'chocolate is bad for you'. No wonder only about a third of over 2000 respondents in a survey on the British public said they would trust data from medical trials.
Article
Full-text available
eNeuro is moving forward with a new initiative asking authors to present their results with estimation statistics and not to rely solely on p values. In this editorial, I would like to introduce to you the concept of this new statistics while first discussing my evaluation of the present situation and my own experience with using statistics to interpret results, then I will propose a solution and how we will move forward in the journal. I have also included my own experience using these new statistics and provided a list of resources. This new initiative will not change what is already acceptable for statistics in the journal; it is to encourage a simple addition of using estimation statistics.
Article
Mathematics is a limited component of solutions to real‐world problems, as it expresses only what is expected to be true if all our assumptions are correct, including implicit assumptions that are omnipresent and often incorrect. Statistical methods are rife with implicit assumptions whose violation can be life‐threatening when results from them are used to set policy. Among them are that there is human equipoise or unbiasedness in data generation, management, analysis, and reporting. These assumptions correspond to levels of cooperation, competence, neutrality, and integrity that are absent more often than we would like to believe. Given this harsh reality, we should ask what meaning, if any, we can assign to the P ‐values, “statistical significance” declarations, “confidence” intervals, and posterior probabilities that are used to decide what and how to present (or spin) discussions of analyzed data. By themselves, P ‐values and CI do not test any hypothesis, nor do they measure the significance of results or the confidence we should have in them. The sense otherwise is an ongoing cultural error perpetuated by large segments of the statistical and research community via misleading terminology. So‐called inferential statistics can only become contextually interpretable when derived explicitly from causal stories about the real data generator (such as randomization), and can only become reliable when those stories are based on valid and public documentation of the physical mechanisms that generated the data. Absent these assurances, traditional interpretations of statistical results become pernicious fictions that need to be replaced by far more circumspect descriptions of data and model relations.
Article
There is possibly no single term more misused, or idea misunderstood, than probability values, or P values, in the scientific literature. The most commonly admitted definition of the P value is the probability of obtaining results at least as extreme as those observed of a statistical hypothesis test (plausibility of data from a sample (your results)), assuming the null hypothesis (statistical hypothesis which proposes that no difference exists between 2 groups in a given series)¹. The concept has been attributed to the famous English mathematician and statistician Ronald Fisher, when he was confronted with the assertion of a well known British lady that she could detect whether milk, or hot water, was poured first into her cup of tea. Astounded at the idea that the lady could do so by taste alone, he devised an experiment in which he set up eight cups of tea for her to taste (half with water poured first, half with milk poured first), served in random order, the famous ‘tea test’ (not to be confused with the t test derived by William Gosset under the pen name of Student). When she correctly identified all eight cups, he calculated that the probability of this accomplishment would be 1 in 70 (approximately 1.43 per cent or P = 0.0114)².
Article
There are two distinct definitions of “P‐value” for evaluating a proposed hypothesis or model for the process generating an observed dataset. The original definition starts with a measure of the divergence of the dataset from what was expected under the model, such as a sum of squares or a deviance statistic. A P‐value is then the ordinal location of the measure in a reference distribution computed from the model and the data, and is treated as a unit‐scaled index of compatibility between the data and the model. In the other definition, a P‐value is a random variable on the unit interval whose realizations can be compared to a cutoff α to generate a decision rule with known error rates under the model and specific alternatives. It is commonly assumed that realizations of such decision P‐values always correspond to divergence P‐values. But this need not be so: Decision P‐values can violate intuitive single‐sample coherence criteria where divergence P‐values do not. It is thus argued that divergence and decision P‐values should be carefully distinguished in teaching, and that divergence P‐values are the relevant choice when the analysis goal is to summarize evidence rather than implement a decision rule.
Article
It is well known that the statistical analyses in health-science and medical journals are frequently misleading or even wrong. Despite many decades of reform efforts by hundreds of scientists and statisticians, attempts to fix the problem by avoiding obvious error and encouraging good practice have not altered this basic situation. Statistical teaching and reporting remain mired in damaging yet editorially enforced jargon of “significance”, “confidence”, and imbalanced focus on null (no-effect or “nil”) hypotheses, leading to flawed attempts to simplify descriptions of results in ordinary terms. A positive development amidst all this has been the introduction of interval estimates alongside or in place of significance tests and P-values, but intervals have been beset by similar misinterpretations. Attempts to remedy this situation by calling for replacement of traditional statistics with competitors (such as pure-likelihood or Bayesian methods) have had little impact. Thus, rather than ban or replace P-values or confidence intervals, we propose to replace traditional jargon with more accurate and modest ordinary-language labels that describe these statistics as measures of compatibility between data and hypotheses or models, which have long been in use in the statistical modeling literature. Such descriptions emphasize the full range of possibilities compatible with observations. Additionally, a simple transform of the P-value called the surprisal or S-value provides a sense of how much or how little information the data supply against those possibilities. We illustrate these reforms using some examples from a highly charged topic: trials of ivermectin treatment for Covid-19.
Article
Over the last decade, large-scale replication projects across the biomedical and social sciences have reported relatively low replication rates. In these large-scale replication projects, replication has typically been evaluated based on a single replication study of some original study and dichotomously as successful or failed. However, evaluations of replicability that are based on a single study and are dichotomous are inadequate, and evaluations of replicability should instead be based on multiple studies, be continuous, and be multi-faceted. Further, such evaluations are in fact possible due to two characteristics shared by many large-scale replication projects. In this article, we provide such an evaluation for two prominent large-scale replication projects, one which replicated a phenomenon from cognitive psychology and another which replicated 13 phenomena from social psychology and behavioral economics. Our results indicate a very high degree of replicability in the former and a medium to low degree of replicability in the latter. They also suggest an unidentified covariate in each, namely ocular dominance in the former and political ideology in the latter, that is theoretically pertinent. We conclude by discussing evaluations of replicability at large, recommendations for future large-scale replication projects, and design-based model generalization. Supplementary materials for this article are available online.
Article
The use of statistical significance and p-values has become a matter of substantial controversy in various fields using statistical methods. This has gone as far as some journals banning the use of indicators for statistical significance, or even any reports of p-values, and, in one case, any mention of confidence intervals. I discuss three of the issues that have led to these often-heated debates. First, I argue that in many cases, p-values and indicators of statistical significance do not answer the questions of primary interest. Such questions typically involve making (recommendations on) decisions under uncertainty. In that case, point estimates and measures of uncertainty in the form of confidence intervals or even better, Bayesian intervals, are often more informative summary statistics. In fact, in that case, the presence or absence of statistical significance is essentially irrelevant, and including them in the discussion may confuse the matter at hand. Second, I argue that there are also cases where testing null hypotheses is a natural goal and where p-values are reasonable and appropriate summary statistics. I conclude that banning them in general is counterproductive. Third, I discuss that the overemphasis in empirical work on statistical significance has led to abuse of p-values in the form of p-hacking and publication bias. The use of pre-analysis plans and replication studies, in combination with lowering the emphasis on statistical significance may help address these problems.
Article
Replication is an important contemporary issue in psychological research, and there is great interest in ways of assessing replicability, in particular, retrospectively via prior studies. The average power of a set of prior studies is a quantity that has attracted considerable attention for this purpose, and techniques to estimate this quantity via a meta-analytic approach have recently been proposed. In this article, we have two aims. First, we clarify the nature of average power and its implications for replicability. We explain that average power is not relevant to the replicability of actual prospective replication studies. Instead, it relates to efforts in the history of science to catalogue the power of prior studies. Second, we evaluate the statistical properties of point estimates and interval estimates of average power obtained via the meta-analytic approach. We find that point estimates of average power are too variable and inaccurate for use in application. We also find that the width of interval estimates of average power depends on the corresponding point estimates; consequently, the width of an interval estimate of average power cannot serve as an independent measure of the precision of the point estimate. Our findings resolve a seeming puzzle posed by three estimates of the average power of the power-posing literature obtained via the meta-analytic approach.
Article
The American Society for Pharmacology and Experimental Therapeutics has revised the Instructions to Authors for Drug Metabolism and Disposition, Journal of Pharmacology and Experimental Therapeutics, and Molecular Pharmacology These revisions relate to data analysis (including statistical analysis) and reporting but do not tell investigators how to design and perform their experiments. Their overall focus is on greater granularity in the description of what has been done and found. Key recommendations include the need to differentiate between preplanned, hypothesis-testing, and exploratory experiments or studies; explanations of whether key elements of study design, such as sample size and choice of specific statistical tests, had been specified before any data were obtained or adapted thereafter; and explanations of whether any outliers (data points or entire experiments) were eliminated and when the rules for doing so had been defined. Variability should be described by S.D. or interquartile range, and precision should be described by confidence intervals; S.E. should not be used. P values should be used sparingly; in most cases, reporting differences or ratios (effect sizes) with their confidence intervals will be preferred. Depiction of data in figures should provide as much granularity as possible, e.g., by replacing bar graphs with scatter plots wherever feasible and violin or box-and-whisker plots when not. This editorial explains the revisions and the underlying scientific rationale. We believe that these revised guidelines will lead to a less biased and more transparent reporting of research findings.
Article
BACKGROUND In the light of the recent discussions about the statistical rigour of empirical research, including the interpretation and use of p-values and the importance of the theoretical underpinnings of population studies, the editorial board of Demographic Research has adopted dedicated guidance for authors. Its aim is to clarify our expectations and highlight good practice in these areas. Starting from Volume 42 (2020), authors will be encouraged to follow these guidelines.