## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value's inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong. It also reviews the possible consequences of these improper understandings or representations of its meaning. Finally, it contrasts the P value with its Bayesian counterpart, the Bayes' factor, which has virtually all of the desirable properties of an evidential measure that the P value lacks, most notably interpretability. The most serious consequence of this array of P-value misconceptions is the false belief that the probability of a conclusion being in error can be calculated from the data in a single experiment without reference to external evidence or the plausibility of the underlying mechanism.

To read the full-text of this research,

you can request a copy directly from the author.

... Researchers have been working on misconceptions including p-value at least 70 years [3][4][5][6][7][8]. Despite all these warnings in articles the misconceptions and misunderstandings are still done commonly by the researchers. ...

... Misconception 2: p-value <0.05 shows clinical significance What we get from the p-value as the information, is the statistically significance. Mostly and by mistake p-values <0.05 is taken as clinically important [4]. This cannot be always true. ...

... p-value shows us the potential level of making a mistake when we rejecting the zero hypothesis. The simplest way to see that this is false is to note that the p-value is calculated under the assumption that the null hypothesis is true [4]. ...

... Researchers often misinterpret the information provided by p-values (Goodman, 2008). In our following explanations, we focus on the Neyman-Pearson approach , where the aim of the frequentist branch of statistics is to help you make decisions and limit the number of errors you will make in the longrun (Neyman, 1977). ...

... It is important to understand that the cut-off of 5% appears immutable now for disciplines like psychology that routinely use 5% for alpha, but it was never meant as a fixed standard of evidence. Fisher -one of the pioneers of hypothesis testing -commented that he accepted 5% as a low standard of evidence across repeated findings (Goodman, 2008). Fisher (1926) emphasised that individual researchers should consider which alpha is appropriate for the standard of evidence in their study, but this nuance has been lost over time. ...

Authors have highlighted for decades that sample size justification through power analysis is the exception rather than the rule. Even when authors do report a power analysis, there is often no justification for the smallest effect size of interest, or they do not provide enough information for the analysis to be reproducible. We argue one potential reason for these omissions is the lack of a truly accessible introduction to the key concepts and decisions behind power analysis. In this tutorial targeted at complete beginners, we demonstrate a priori and sensitivity power analysis using jamovi for two independent samples and two dependent samples. Respectively, these power analyses allow you to ask the questions: “How many participants do I need to detect a given effect size?”, and “What effect sizes can I detect with a given sample size?”. We emphasise how power analysis is most effective as a reflective process during the planning phase of research to balance your inferential goals with your resources. By the end of the tutorial, you will be able to understand the fundamental concepts behind power analysis and extend them to more advanced statistical models.

... De acordo com Goodman (2008), esse é o equívoco mais comum na interpretação do p-valor. Um valor de p de 0,01 não indica que a hipótese nula tem uma probabilidade de 1% de estar correta. ...

... • 236 • Essa recomendação é ponto pacífico na literatura (GOODMAN, 2008;GREENLAND;POOLE, 2013;WASSERSTEIN;LAZAR, 2016). Por exemplo, antes de inferir a respeito da natureza de uma determinada distribuição, deve-se examinar o histograma ou boxplot dos dados. ...

O p-valor pode ser definido como uma probabilidade que informa o nível de incompatibilidade dos dados observados com um modelo teórico esperado. Por essa razão, atua como um dos principais parâme-tros de significância estatística de pesquisas empí-ricas. Contudo, a utilização incorreta, associada a problemas como viés de publicação e ausência de padrões específicos de reprodutibilidade, tem gera-do problemas em áreas do conhecimento.Objetivo do trabalho é discutir aspectos ligados à importân-cia do p-valor nas pesquisas empíricas. O estudo é teórico-reflexivo, baseado nas recomendações da Associação Americana de Estatística sobre a corre-ta interpretação do p-valor. Além disso, discutimos o papel da significância estatística a partir de uma perspectiva empírica. Em particular, o p-valor: (1) não informa a probabilidade de que a hipótese nula é verdadeira; (2) não indica que os resultados foram produzidos aleatoriamente; (3) não estima o tama-nho do efeito observado; (4) não mensura a impor-tância substantiva dos resultados; (5) nunca deve ser interpretado sozinho; (6) não deve ser interpre-tado quando os pressupostos de seu cálculo forem violados e (7) não pode ser interpretado quando se trabalha com a população. A discussão crítica so-bre a utilização de testes de significância é sinal de maturidade estatística. Contudo, os pesquisadores não podem decidir sobre como utilizar o p-valor an-tes de compreenderem integralmente o seu papel na pesquisa empírica.

... Over the years, there have been plenty of published papers highlighting these issues and warning against common misinterpretations about P-values and results labelled as statistically significant findings (2,8,11,17,18). There are also studies that empirically have investigated the magnitude of erroneous inferences from statistical test results (5,9,(19)(20)(21). ...

... This study aimed at investigating if statistical inference misunderstandings persist. The answer is clearly yes, since the majority of responders committed the inferential mistakes often warned about (2,17) and seen in studies previously (5,19,21). The most common inferential mistake was what is often called the inverse probability fallacy, which is the false belief that a low P-value or a statistically significant result conveys information of the probability of the null hypotheses being true. ...

Background:
The aim was to investigate inferences of statistically significant test results among persons with more or less statistical education and research experience.
Methods:
A total of 75 doctoral students and 64 statisticians/epidemiologist responded to a web questionnaire about inferences of statistically significant findings. Participants were asked about their education and research experience, and also whether a 'statistically significant' test result (P = 0.024, α-level 0.05) could be inferred as proof or probability statements about the truth or falsehood of the null hypothesis (H0) and the alternative hypothesis (H1).
Results:
Almost all participants reported having a university degree, and among statisticians/epidemiologist, most reported having a university degree in statistics and were working professionally with statistics. Overall, 9.4% of statisticians/epidemiologist and 24.0% of doctoral students responded that the statistically significant finding proved that H0 is not true, and 73.4% of statisticians/epidemiologists and 53.3% of doctoral students responded that the statistically significant finding indicated that H0 is improbable. Corresponding numbers about inferences about the alternative hypothesis (H1) were 12.0% and 6.2% about proving H1 being true and 62.7 and 62.5% for the conclusion that H1 is probable. Correct inferences to both questions, which is that a statistically significant finding cannot be inferred as either proof or a measure of a hypothesis' probability, were given by 10.7% of doctoral students and 12.5% of statisticians/epidemiologists.
Conclusions:
Misinterpretation of P-values and statistically significant test results persists also among persons who have substantial statistical education and who work professionally with statistics.

... A p-value represents the probability of obtaining a result, or one more extreme, on the condition of a point hypothesis being true (Benjamin et al., 2018). Goodman names 12 common misconceptions of the p-value (Goodman, 2008), four of which are: ...

... Here, the first stage can be powered to only detect medium and large effects, but the second stage should be powered to detect small effects. Not only would this design entail a replication of an initially measured effect, but it also coincides with Fisher's original recommendation for how to use the p-value threshold of 0.05 (that an experiment should be repeated) (Goodman, 2008). Further, it would refine dichotomous experimental questions of whether an effect exists or not, to precisely estimating the magnitude and direction of said effects. ...

Replicability, the degree to which a previous scientific finding can be repeated in a distinct set of data, has been considered an integral component of institutionalized scientific practice since its inception several hundred years ago. In the past decade, large-scale replication studies have demonstrated that replicability is far from favorable, across multiple scientific fields. Here, I evaluate this literature and describe contributing factors including the prevalence of questionable research practices (QRPs), misunderstanding of p-values, and low statistical power. I subsequently discuss how these issues manifest specifically in preclinical neuroscience research. I conclude that these problems are multifaceted and difficult to solve, relying on the actions of early and late career researchers, funding sources, academic publishers, and others. I assert that any viable solution to the problem of substandard replicability must include changing academic incentives, with adoption of registered reports being the most immediately impactful and pragmatic strategy. For animal research in particular, comprehensive reporting guidelines that document potential sources of sensitivity for experimental outcomes is an essential addition.

... However, despite the p-value not being the probability of the null hypothesis being true, survey studies suggest researchers do interpret p-values in such a way (e.g. Goodman, 2008). Moreover, scientists often misreport non-significant results as evidence of absence of a difference between groups of conditions or evidence of no effect when this inference is not necessarily warranted. ...

... In their survey of 86 Psychonomic Bulletin & Review Hoekstra et al. (2006Hoekstra et al. ( , p. 1036 reported that: "We found the serious mistake of accepting the null hypothesis and claiming no effect in 60% (CI: 53%, 66%) of the articles that reported statistically nonsignificant results" (emphasis added). And interpreting a non-significant result as if there were no differences between conditions ranks at number 2 of Goodman's (2008) "Dirty Dozen" p-value misconceptions. However, just because a researcher might report the results of significance tests incorrectly, this does not mean that they themselves, or their readers, necessarily interpreted the significance test incorrectly. ...

In this thesis I explore the extent to which researchers of animal cognition should be concerned about the reliability of its scientific results and the presence of theoretical biases across research programmes. To do so I apply and develop arguments borne in human psychology’s “replication crisis” to animal cognition research and assess a range of secondary data analysis methods to detect bias across heterogeneous research programmes. After introducing these topics in Chapter 1, Chapter 2 makes the argument that areas of animal cognition research likely contain many findings that will struggle to replicate in direct replication studies. In Chapter 3, I combine two definitions of replication to outline the relationship between replication and theory testing, generalisability, representative sampling, and between-group comparisons in animal cognition. Chapter 4 then explores deeper issue in animal cognition research, examining how the academic systems that might select for research with low replicability might also select for theoretical bias across the research process. I use this argument to suggest that much of the vociferous methodological criticism in animal cognition research will be ineffective without considering how the academic incentive structure shapes animal cognition research. Chapter 5 then beings my attempt to develop methods to detect bias and critically and quantitatively synthesise evidence in animal cognition research. In Chapter 5, I led a team examining publication bias and the robustness of statistical inference in studies of animal physical cognition. Chapter 6 was a systematic review and a quantitative risk-of-bias assessment of the entire corvid social cognition literature. And in Chapter 7, I led a team assessing how researchers in animal cognition report and interpret non-significant statistical results, as well as the p-value distributions of non-significant results across a manually extracted dataset and an automatically extracted dataset from the animal cognition literature. Chapter 8 then reflects on the difficulties of synthesising evidence and detecting bias in animal cognition research. In Chapter 9, I present survey data of over 200 animal cognition researchers who I questioned on the topics of this thesis. Finally, Chapter 10 summarises the findings of this thesis, and discusses potential next steps for research in animal cognition.

... . The corrected p-values are then interpreted in many incorrect ways, such as (1 minus) the probability of the null hypothesis being false [7,12,13,22,29] or the probability of obtaining the same results in a replication of the experiment [7,12]. ...

... . The corrected p-values are then interpreted in many incorrect ways, such as (1 minus) the probability of the null hypothesis being false [7,12,13,22,29] or the probability of obtaining the same results in a replication of the experiment [7,12]. ...

Statistical tests are a powerful set of tools when applied correctly, but unfortunately the extended misuse of them has caused great concern. Among many other applications, they are used in the detection of biomarkers so as to use the resulting p -values as a reference with which the candidate biomarkers are ranked. Although statistical tests can be used to rank, they have not been designed for that use. Moreover, there is no need to compute any p -value to build a ranking of candidate biomarkers. Those two facts raise the question of whether or not alternative methods which are not based on the computation of statistical tests that match or improve their performances can be proposed. In this paper, we propose two alternative methods to statistical tests. In addition, we propose an evaluation framework to assess both statistical tests and alternative methods in terms of both the performance and the reproducibility. The results indicate that there are alternative methods that can match or surpass methods based on statistical tests in terms of the reproducibility when processing real data, while maintaining a similar performance when dealing with synthetic data. The main conclusion is that there is room for the proposal of such alternative methods.

... = 3.0). The statistical significance of each coefficient serves as an indication to the analyst that an estimation of at least that magnitude is unlikely to be generated if no real effect exists: the smaller the probability of observing an estimation at least that large (i.e., the infamous p-value [52]), the highest the confidence one can have that, given the data, the effect exists in reality and is unlikely to be explainable by chance alone. Generally, the threshold for significance is set at p ≤ 0.05, but this may vary considerably depending on the domain of application. ...

Decisional processes are at the basis of most businesses in several application domains. However, they are often not fully transparent and can be affected by human or algorithmic biases that may lead to systematically incorrect or unfair outcomes. In this work, we propose an approach for unveiling biases in decisional processes, which leverages association rule mining for systematic hypothesis generation and regression analysis for model selection and recommendation extraction. In particular, we use rule mining to elicit candidate hypotheses of bias from the observational data of the process. From these hypotheses, we build regression models to determine the impact of variables on the process outcome. We show how the coefficient of the (selected) model can be used to extract recommendation, upon which the decision maker can operate. We evaluated our approach using both synthetic and real-life datasets in the context of discrimination discovery. The results show that our approach provides more reliable evidence compared to the one obtained using rule mining alone, and how the obtained recommendations can be used to guide analysts in the investigation of biases affecting the decisional process at hand.

... The final models were selected by comparison based on the AIC, loglikelihood, and significance testing of model coefficients. We did not assume a specific significance level for the variables, such as 0.05, since the use of this type of heuristic cannot be considered valid (Goodman, 2008;Wasserstein et al., 2019;Wasserstein & Lazar, 2016). Instead, we recognised the basis for considering a given relationship to be stronger the lower the p-value, and weaker, but not nonexistent, when the p-value is high, e.g. ...

Purpose:
The technical dimensions of cybercrime and its control have rendered it an inconvenient subject for many criminologists. Adopting either semantic (legal) or syntactic (technical) perspectives on cyber criminality, as theorised by McGuire, can lead to disparate conclusions. The aim of this paper is to examine how these perspectives and corresponding educational backgrounds shape opinions on cybercrime and cybercrime policy.
Design/Methods/Approach:
To address this research question, we first provide a non-exhaustive review of existing critical literature on a few selected controversial issues in the field, including cyber vigilantism, file sharing websites, and political hacking. Based on these areas, we developed an online survey that we then distributed among students of law and computer science, as well as to a 'non-cyber contrast group' including mainly students of philology and philosophy.
Findings:
Statistical analysis revealed differences in the way respondents approached to most of the issues, which were sometimes moderated by the year of studies and gender. In general, the respondents were highly supportive of internet vigilantism, prioritised the cybercrimes of the powerful, and encouraged open access to cybersecurity. The computer science students expressed a lower fear of cybercrime and approved of hacktivism more frequently, while the law students affirmed a conservative vision of copyrights and demonstrated higher punitiveness towards cyber offenders. Interestingly, the computer science students were least likely to translate their fear of cybercrime into punitive demands.
Research Limitations/Implications:
The findings support the distinction between various narratives about cybercrime by showing the impact of professional socialization on the expressed opinions. They call for a consciously interdisciplinary approach to the subject and could be complemented by a comprehensive qualitative inquiry in the perception of cyber threats.
Originality/Value:
The authors wish to contribute to the understanding of the construction of cybercrime on the border of criminal law and computer science. Additionally, we present original data which reveal different views on related issues held by potential future professionals in both areas.

... Previous research has shown that misunderstandings and oversimplifications of NHST are prevalent not only among students but also among senior researchers and even methodology teachers (Haller & Kraus, 2002). Most common NHST misconceptions concern the erroneous assumptions that the p-value represents the probability of the null hypothesis being true ("inverse probability fallacy"); that p-value is a measure of effect ("effect size fallacy"), or the probability the result will replicate in further studies ("replication fallacy", Nickerson, 2000;Badenes-Ribera et al., 2016;Goodman, 2008;Greenland et al., 2016;Ropovik, 2017). Another pervasive misconception is related to the interpretation of negative results, for example, mistaking the absence of evidence for the evidence of absence of an effect (Altman & Bland, 1995;Alderson, 2004). ...

In the years following the reproducibility crisis in behavioral sciences, increased attention of the scientific community has been dedicated to the correct application of statistical inference and promotion of open science practices. In the present survey, we contacted psychology researchers, lecturers, and doctoral students from all universities in Slovakia and the Slovak Academy of Sciences via email. Together we received answers from 65 participants. Questions in the survey covered the most common misconceptions about statistical hypothesis testing, as well as awareness, attitudes, and barriers related to the adherence to open science practices. We found a high prevalence of statistical misconceptions, namely related to the interpretation of p-values and interpretation of null results. At the same time, participants indicated mostly positive attitudes towards open science practices, such as data sharing and preregistration, and were highly interested in further training. These results provide an insight into the needs of the Slovak psychology research community. This is an important step in the further dissemination of open science practices and the prevention of common statistical and methodological errors.

... Yet, such errors are very common in the biomedical literature. Despite its seeming simplicity, the p-value is probably "the most misunderstood, misinterpreted, and occasionally miscalculated index in all of biomedical research" [15]. ...

At the beginning of our research training, we learned about hypothesis testing, p-values, and statistical inference [...]

... This is a subtle, but important, difference. This ease of misinterpretation has led to concern amongst many scientific fields of the validity of -value testing (e.g Lee 2010; Krueger and Heck 2019;Goodman 2008). ...

21cm cosmology is a field in which the absorption and emission from the cosmic radio background by neutral hydrogen gas is used to probe cosmology and astrophysics of early epochs of the universe. In particular, this process is one of the most promising methods of measuring the Cosmic Dawn, when the first stars formed, and the Epoch of Reionisation, making it a key objective of modern radio cosmology. This thesis primarily investigates the application of Bayesian data analysis techniques to global 21cm cosmology, to aid in overcoming two of the most prominent difficulties in detecting a sky-averaged ('global') 21cm signal: the presence of foregrounds around four orders of magnitude brighter than the signal and systematic distortions that arise from chromaticity of the antenna's gain pattern. Therefore, in this thesis, the impact that these difficulties can have on experiments is investigated through simulations and the efficacy with which existing data analysis techniques in the field can manage them is quantified. Following this, a new data analysis pipeline is developed, utilising Bayesian processes, that is designed to overcome limitations with existing techniques. The primary concept of the pipeline developed in this thesis is to perform continuous physically motivated simulations of observations using parametrised models of the sky and antenna to explain and fit for systematic distortions in a physically understood manner. Throughout this thesis, this pipeline is developed and tested in simulations to quantify its performance and limitations. The core of this work was published in Anstey et al. (2021). This thesis also discusses the additional technique of utilising time- and antenna-dependencies in data, coupled with the developed pipeline, to improve the Bayesian modelling process, which is being written in Anstey et al. (in prep.), as well as a method by which simulated observations in the developed pipeline could be used to help guide the design of a global 21cm experiment, published in Anstey et al. (2022). The techniques developed in this thesis are generally applicable to any global 21cm experiment. However, they were developed with the primary intent of being utilised in the Radio Experiment for the Analysis of Cosmic Hydrogen (REACH) (de Lera Acedo et al. 2022).

... The increase in tree size spatial variability mainly occurs at small scales, being less than 10 m for both forests, but changes depending upon the functional group. A widespread practice in most statistical analyses, which we dislike, is to reduce extensive analyses only to compute P-values or detect a significant difference (Breiman, 2001;Goodman, 2008Goodman, , 2019Ellison and Dennis, 1996;Fanelli et al., 2017;Ioannidis, 2019). Regardless, a rule of thumb is that a semivariogram-ratio larger than five could be considered statistically significant from one (i.e., no difference) at a P-value of 5%. ...

Spatial patterns reveal critical features at the individual and community levels. However, how to evaluate changes in spatial characteristics remains largely unexplored. We assess spatial changes in spatial point patterns by augmenting current statistical functions and indices. We fitted functions to describe unmarked and marked (tree size) spatial patterns using data from a large-scale silvicultural experiment in southern Chile. Furthermore, we compute the mingling index to represent spatial tree diversity. We proffer the pair correlation function difference before and after treatment to detect changes in the unmarked-point pattern of trees and the semivariogram-ratio to evaluate changes in the marked-point pattern. Our research provides a quantitative assessment of a critical aspect of forest heterogeneity: changes in spatial unmarked and marked-point patterns. The proposed approach can be a powerful tool for quantifying the impacts of disturbances and silvicultural treatments on spatial patterns in forest ecosystems.

... Unfortunately, misinterpretation with this approach is both persistent and pervasive. 1 We suggest that a directional likelihood ratio calculated using the minimum clinically significant effect size as a dividing hypothesis provides a more useful and easily understood metric. ...

A simple and common type of medical research involves the comparison of one treatment against another. The logical aim should be both to establish which treatment is superior and the strength of evidence supporting this conclusion, a task for which null hypothesis significance testing is particularly ill-suited. This paper describes and evaluates a novel sequential inferential procedure based on the likelihood evidential paradigm with the likelihood ratio as its salient statistic. The real-world performance of the procedure as applied to the distribution of treatment effects seen in the Cochrane Database of Systematic Reviews is simulated. The misleading evidence rate was 5% and mostly this evidence was only weakly misleading. Early stopping occurred frequently and was associated with misleading evidence in only 0.4% of cases.

... Having provided some context for conducting Bayesian analyses, we can now turn to the central thesis of the current work: How Bayesian analyses stand to improve upon frequentist analyses when identifying and communicating adverse impact statistics. For decades, scholars have lamented the many problems with using p values in practice (e.g., Goodman, 2008;Greenland et al., 2016;Hoekstra et al., 2006). In fact, the American Statistical Association (ASA) released an official statement on p values, advising against the sole use of p values for testing the truth of a (4) P(H 1 |D) P(H 0 |D) = P(D|H 1 ) P(D|H 0 ) posterior odds = BF Table 1 Evidence provided by the Bayes factor ...

Adverse impact results from company hiring practices that negatively affect protected classes. It is typically determined on the basis of the 4/5ths Rule (which is violated when the minority selection rate is less than 4/5ths of the majority selection rate) or a chi-square test of statistical independence (which is violated when group membership is associated with hiring decisions). Typically, both analyses are conducted within the traditional frequentist paradigm, involving null hypothesis significance testing (NHST), but we propose that the less-often-used Bayesian paradigm more clearly communicates evidence supporting adverse impact findings, or the lack thereof. In this study, participants read vignettes with statistical evidence (frequentist or Bayesian) supporting the presence or absence of adverse impact at a hypothetical company; then they rated the vignettes on their interpretability (i.e., clarity) and retributive justice (i.e., deserved penalty). A Bayesian analysis of our study results finds moderate evidence in support of no mean difference in either interpretability or retributive justice, across three out of the four vignettes. The one exception was strong evidence supporting the frequentist vignette indicating no adverse impact being viewed as more interpretable than the equivalent Bayesian vignette. Broad implications for using Bayesian analyses to communicate adverse impact results are discussed.

... A su vez, el mismo suele ser desconocido, malinterpretado y muchas veces mal calculado. 1,2 Es pertinente aclarar que la dificultad existente en su interpretación quizás se deba a que no es un concepto fácil de comprender. Sumado a ello, su importancia esta inapropiadamente magnificada, llevando a los lectores desprevenidos a la mala interpretación de los resultados de las publicaciones científicas. ...

... The use of NHST in biomedical research is ubiquitous (Chavalarias et al., 2016) even though it has been criticized repeatedly (e.g., Berger & Delampady, 1987;Berger & Sellke, 1987;Cohen, 1994;Dienes, 2011;Gigerenzer, 2004;Goodman, 1999aGoodman, , 1999bGoodman, , 2008McShane et al., 2019;van Ravenzwaaij & Ioannidis, 2017;Wagenmakers, 2007;Wagenmakers et al., 2018;Wasserstein & Lazar, 2016). Null hypothesis Bayesian testing (NHBT) is an alternative to NHST that has some practical advantages. ...

The use of Cox proportional hazards regression to analyze time-to-event data is ubiquitous in biomedical research. Typically, the frequentist framework is used to draw conclusions about whether hazards are different between patients in an experimental and a control condition. We offer a procedure to calculate Bayes factors for simple Cox models, both for the scenario where the full data is available and for the scenario where only summary statistics are available. The procedure is implemented in our "baymedr" R package. The usage of Bayes factors remedies some shortcomings of frequentist inference and has the potential to save scarce resources.

... Second, we used descriptive statistical method to summarize date. Proportions and confidence (CI) were calculated as per published statistical equations and methods [23][24][25]. Third, we used deductive and inductive thematic analysis methods to categorize information and distil pertinent themes. The study was undertaken in line with the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines [26]. ...

Background:
The Hajj is an annual religious mass gathering event held in Makkah, Saudi Arabia. With millions of participants from across the globe attending the Hajj, the risk of importation, transmission, and global spread of infectious diseases is high. The emergence of antimicrobial resistant (AMR) bacteria is of worldwide concern and the Hajj poses a serious risk to its dissemination. This review aims to synthesize published literature on AMR bacteria acquisition and transmission associated with the Hajj.
Methods:
We searched electronic databases to identify literature published between January 1990 and December 2021. The search strategy included medical subject headings and keyword terms related to AMR bacteria and the Hajj.
Results:
After screening 2214 search results, 51 studies were included in the analysis. The review found 6455 AMR bacteria transmissions related to the Hajj. Thirty predominantly enteric or respiratory disease-causing AMR bacterial species were reported with isolates identified in cases on five continents. Most were male, aged above 50 years and were diagnosed in Makkah. Most cases were identified through hospital-based research; few cases were detected in community or primary health care settings.
Conclusions:
This review provides a contemporary account of knowledge related to AMR transmission at the Hajj. It emphasizes the need for the enhancement of surveillance for AMR bacteria globally.

... 13 Whilst frequentist analyses are ubiquitous in medical sciences, so too is their misinterpretation. [28][29][30] Largely, this results from the underlying assumptions of the Null Hypothesis Significance Testing approach, which require careful study to fully appreciate. In contrast, by directly calculating parameter distributions Bayesian analyses allow multiple hypotheses to be tested while incorporating prior information from previous studies or expert belief. ...

Purpose
Retrospective studies have identified a link between the average set-up error of lung cancer patients treated with image-guided radiotherapy (IGRT) and survival. The IGRT protocol was subsequently changed to reduce the action threshold. In this study, we use a Bayesian approach to evaluate the clinical impact of this change to practice using routine ‘real-world’ patient data.
Methods and Materials
Two cohorts of NSCLC patients treated with IGRT were compared: pre-protocol change (N=780, 5mm action threshold) and post-protocol change (N=411, 2mm action threshold). Survival models were fitted to each cohort and changes in the hazard ratios (HR) associated with residual set-up errors was assessed. The influence of using an uninformative and a skeptical prior in the model was investigated.
Results
Following the reduction of the action threshold, the HR for residual set-up error towards the heart was reduced by up to 10%. Median patient survival increased for patients with set-up errors towards the heart, and remained similar for patients with set-up errors away from the heart. Depending on the prior used, a residual hazard ratio may remain.
Conclusions
Our analysis found a reduced hazard of death and increased survival for patients with residual set-up errors towards versus away from the heart post-protocol change. This study demonstrates the value of a Bayesian approach in the assessment of technical changes in radiotherapy practice and supports the consideration of adopting this approach in further prospective evaluations of changes to clinical practice.

... p-values can be between 0 and 1, and values closer to 1 are correlations that are more likely to have occurred by random chance (i.e., no statistically significant correlation exists for the given data). Literature (Feinstein 1975;Goodman 2008) customarily recommends statistically significant correlations be limited to p-values ≤ 0.05. In Table 2, p-values range between 0.002 and 1, where p-value 11 through p-value 13 were all greater than 0.05 by about 0.6, on average, indicating the corresponding correlations are not statistically significant. ...

... First, violation of assumptions behind PCC, such as presence of outliers and nornormality of data, leads to mischaracterization of the linear dependence [Pernet et al., 2013, Bishara and Hittner, 2015, Armstrong, 2019. Second, erroneous interpretation of a p-value as the probability of H0 being true leads to inflated Type I errors and has been criticised of its interpretation and reliability [Goodman, 2008, Gao, 2020, and blamed for the replication/reproducibility crisis [Nuzzo, 2014, Halsey et al., 2015, Anderson, 2020. In a wider context there are calls to abandon NHST altogether [Szucs and Ioannidis, 2017, Häggström, 2017, Cumming, 2014]. ...

Inferring linear relationships lies at the heart of many empirical investigations. A measure of linear dependence should correctly evaluate the strength of the relationship as well as qualify whether it is meaningful for the population. Pearson's correlation coefficient (PCC), the \textit{de-facto} measure for bivariate relationships, is known to lack in both regards. The estimated strength $r$ maybe wrong due to limited sample size, and nonnormality of data. In the context of statistical significance testing, erroneous interpretation of a $p$-value as posterior probability leads to Type I errors -- a general issue with significance testing that extends to PCC. Such errors are exacerbated when testing multiple hypotheses simultaneously. To tackle these issues, we propose a machine-learning-based predictive data calibration method which essentially conditions the data samples on the expected linear relationship. Calculating PCC using calibrated data yields a calibrated $p$-value that can be interpreted as posterior probability together with a calibrated $r$ estimate, a desired outcome not provided by other methods. Furthermore, the ensuing independent interpretation of each test might eliminate the need for multiple testing correction. We provide empirical evidence favouring the proposed method using several simulations and application to real-world data.

... While controlled experiments are the gold standard in science for claiming causality, many people misunderstand p-values. A very common misunderstanding is that a statistically significant result with p-value 0.05 has a 5% chance of being a false positive (Goodman 2008, Greenland, Senn, et al. 2016, Vickers 2009. A common alternative to p-values used by commercial vendors is "confidence," which is defined as (1-p-value)*100%, and often misinterpreted as the probability that the result is a true positive. ...

A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas. While the statistics behind controlled experiments are well documented and some basic pitfalls known, we have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often badly so. Our goal is to describe these misunderstandings, the "intuition" behind them, and to explain and bust that intuition with solid statistical reasoning. We provide recommendations that experimentation platform designers can implement to make it harder for experimenters to make these intuitive mistakes.

... However, these p-values should be accompanied with meaningful graphs to describe the situation under study. In any case it is important to communicate and interpret p-value correctly, that is, not as a probability of the null hypothesis being true [31], but the probability that the observed (or a more extreme) result would have occurred if the null hypothesis had been true. ...

The German National Cohort (NAKO) is an ongoing, prospective multicenter cohort study, which started recruitment in 2014 and includes more than 205,000 women and men aged 19–74 years. The study data will be available to the global research community for analyses. Although the ultimate decision about the analytic methods will be made by the respective investigator, in this paper we provide the basis for a harmonized approach to the statistical analyses in the NAKO. We discuss specific aspects of the study (e.g., data collection, weighting to account for the sampling design), but also give general recommendations which may apply to other large cohort studies as well.

... Similar methods [20,29,42] have been used in relevant studies to demonstrate the sufficiency of IMs. However, the method proposed by Zelaschi et al. is unreasonable because the calculated p-values are closely related to the number of samples, as noted in relevant studies [40,43]. It becomes more difficult to accept the null hypothesis with an increasing number of samples. ...

To rank the pulse-like ground motions based on the damage potential to different structures, the internal relationship between the damage potential of pulse-like ground motions and engineering demand parameters (EDPs) is analyzed in this paper. First, a total of 240 pulse-like ground motions from the NGA-West2 database and 16 intensity measures (IMs) are selected. Moreover, four reinforced concrete frame structures with significantly different natural vibration periods are established for dynamic analysis. Second, the efficiency and sufficiency of the IMs of ground motion are analyzed, and the IMs that can be used to efficiently and sufficiently evaluate the EDPs are obtained. Then, based on the calculation results, the principal component analysis (PCA) method is employed to obtain a comprehensive IM for characterizing the damage potential of pulse-like ground motions for specific building structures and EDPs. Finally, the pulse-like ground motions are ranked based on the selected IM and the comprehensive IM for four structures and three EDPs. The results imply that the proposed method can be used to efficiently and sufficiently characterize the damage potential of pulse-like ground motions for building structures.

... Comparisons between synaptic densities and synapse morphologies were visualised graphically using estimation plots of effect sizes, and their uncertainty assessed using confidence intervals (Ho et al., 2019). This approach was preferred over traditional null hypothesis significance testing because of the associated limitations (Goodman, 2008). Separation between the confidence interval and 0 was considered an effect. ...

Excitatory synapses are typically described as single synaptic boutons (SSBs), where one presynaptic bouton contacts a single postsynaptic spine. Using serial section block face scanning electron microscopy, we found that this textbook definition of the synapse does not fully apply to the CA1 region of the hippocampus. Roughly half of all excitatory synapses in the stratum oriens involved multi-synaptic boutons (MSBs), where a single presynaptic bouton containing multiple active zones contacted many postsynaptic spines (from 2 to 7) on the basal dendrites of different cells. The fraction of MSBs increased during development (from P21 to P100) and decreased with distance from the soma. Curiously, synaptic properties such as active zone (AZ) or postsynaptic density (PSD) size exhibited less within-MSB variation when compared to neighbouring SSBs, features that were confirmed by super-resolution light microscopy. Computer simulations suggest that these properties favour synchronous activity in CA1 networks.

... For a discussion of "the dirty dozen" common misinterpretations of a single p-value, see Goodman (2008). Greenland et al. (2016) expanded on this topic in their guide to 14 common misinterpretations of a single p-value. ...

The aim of this paper is to present an alternative approach to measure effect size. The model proposed belongs to r family

Objective and background:
We assessed current understandings in interpretation of Bayesian and traditional statistical results within the clinical researcher (non-statistician) community.
Methods:
Within a 22-question survey, including demographics and experience and comfort levels with Bayesian analyses, we included questions on how to interpret both Bayesian and traditional statistical outputs. We also assessed whether Bayesian or traditional interpretations are considered more useful.
Results:
Among the 323 respondent clinicians, 42.4% and 36.5% chose the correct interpretations of the posterior probability and 95% credible interval, respectively. Only 11.5% of respondents interpreted the p-value correctly and 23.5% interpreted the 95% confidence interval correctly.
Conclusions:
Based on these survey results, we conclude that most of these clinicians face uncertainty when attempting to interpret results from both Bayesian and traditional statistical outputs. When presented with accurate interpretations, clinicians generally conclude that Bayesian results are more useful than conventional ones. We believe there is a need for education of clinicians in statistical interpretation in ways that are customized to this audience.

Understanding the causal relationship between genotype and phenotype is a major objective in biology. The main interest is in understanding trait architecture and identifying loci contributing to the respective traits. Genome-wide association mapping (GWAS) is one tool to elucidate these relationships and has been successfully used in many different species. However, most studies concentrate on marginal marker effects and ignore epistatic and gene-environment interactions. These interactions are problematic to account for, but are likely to make major contributions to many phenotypes that are not regulated by independent genetic effects, but by more sophisticated gene-regulatory networks. Further complication arises from the fact that these networks vary in different natural accessions. However, understanding the differences of gene regulatory networks and gene-gene interactions is crucial to conceive trait architecture and predict phenotypes. The basic subject of this study – using data from the Arabidopsis 1001 Genomes Project – is the analysis of pre-mature stop codons. These have been incurred in nearly one-third of the ~ 30k genes. A gene-gene interaction network of the co-occurrence of stop codons has been built and the over and under representation of different pairs has been statistically analyzed. To further classify the significant over and under- represented gene-gene interactions in terms of molecular function of the encoded proteins, gene ontology terms (GO-SLIM) have been applied. Furthermore, co- expression analysis specifies gene clusters that co-occur over different genetic and phenotypic backgrounds. To link these patterns to evolutionary constrains, spatial location of the respective alleles have been analyzed as well. The latter shows clear patterns for certain gene pairs that indicate differential selection.

P-values and Bayes factors are commonly used as measures of the evidential strength of the data collected in hypothesis tests. It is not clear, however, that they are valid measures of that evidential strength; that is, whether they have the properties that we intuitively expect a measure of evidential strength to have. I argue here that measures of evidential strength should be stochastically ordered by both the effect size and the sample size. I consider the case that the data are normally distributed and show that, for that case, P-values are valid measures of evidential strength while Bayes factors are not. Specifically, I show that in a sharp Null hypothesis test the Bayes factor is stochastically ordered by the sample size only if the effect size or the sample size is sufficiently large. This lack of stochastic ordering lies at the root of the Jeffreys-Lindley paradox.

Recently, there has been a growing interest in the development of pharmacological interventions targeting ageing, as well as on the use of machine learning for analysing ageing-related data. In this work we use machine learning methods to analyse data from DrugAge, a database of chemical compounds (including drugs) modulating lifespan in model organisms. To this end, we created four datasets for predicting whether or not a compound extends the lifespan of C. elegans (the most frequent model organism in DrugAge), using four different types of predictive biological features, based on compound-protein interactions, interactions between compounds and proteins encoded by ageing-related genes, and two types of terms annotated for proteins targeted by the compounds, namely Gene Ontology (GO) terms and physiology terms from the WormBase Phenotype Ontology. To analyse these datasets we used a combination of feature selection methods in a data pre-processing phase and the well-established random forest algorithm for learning predictive models from the selected features. The two best models were learned using GO terms and protein interactors as features, with predictive accuracies of about 82% and 80%, respectively. In addition, we interpreted the most important features in those two best models in light of the biology of ageing, and we also predicted the most promising novel compounds for extending lifespan from a list of previously unlabelled compounds.

Both the public and clinicians are interested in the application of scientific knowledge concerning problem animal behaviour and its treatment. However, in order to do this effectively it is essential that individuals have not only scientific literacy but also an appreciation of philosophical concepts underpinning a particular approach and their practical implications on the knowledge generated as a result. This paper highlights several common misunderstandings and biases associated with different scientific perspectives relevant to clinical animal behaviour and their consequences for how we determine what may be a useful treatment for a given patient. In addition to more reflective evaluation of results, there is a need for researchers to report more information of value to clinicians; such as relevant treatment outcomes, effect sizes, population characteristics. Clinicians must also appreciate the limitations of population level study results to a given case. These challenges can however be overcome with the careful critical reflection using the scientific principles and caveats described.

This authoritative new book provides a comprehensive overview of diagnostic and therapeutic strategies in hematopoietic cell transplantation, explaining key concepts, successes, controversies and challenges. The authors and editors discuss current and future strategies for major challenges, such as graft-versus-host-disease, including new prophylaxis and treatments. They also discuss long-term complications, such as second malignancies and cardiovascular complications. Chapters are written by leading world experts, carefully edited to achieve a uniform and accessible writing style. Each chapter includes evidence-based explanations and state-of-the-art solutions, providing the reader with practice-changing advice. Full reference lists are also supplied to facilitate further exploration of each topic. Each copy of the printed book is packaged with a password, giving readers online access to all text and images. This inspiring resource demystifies both the basics and subtleties of hematopoietic cell transplantation, and is essential reading for both senior clinicians and trainees.

Complementing more specific "p-value discussions", this paper presents fundamental arguments for when null hypothesis statistical significance tests (NHST) are required and appropriate. The arguments, which are paradigmatic rather than technical, are opera-tionalised and broken down to the extent that their logic can be mapped into a decision tree for the use of NHST. We derive a perspective that does not ban p-values but proposes to minimize their use. P-values will become rather rare in (agricultural) economics if they are not applied in any cases, where the conditions for their proper use are violated or where their use is not appropriate or required in order to answer the questions asked of the data. The accompanying shift from prioritising inferential statistics to recognis-ing the value of descriptive statistics requires not only a change in entrenched habits of thought. This shift also has the potential to trigger changes in the research processes and in the evaluation of new approaches within the disciplines.

Der Meinungsmonitor Entwicklungspolitik 2022 umfasst zwei Themenblöcke:
Im ersten Teil werden zentrale Einstellungen der Bevölkerung zu Entwicklungspolitik untersucht. Behandelt werden unter anderem die generelle Unterstützung für staatliche Entwicklungszusammenarbeit (EZ), die Einstellung zu verschiedenen Motiven für EZ sowie die Einschätzung ihrer Wirksamkeit. Besondere Aufmerksamkeit gilt Einstellungen
zu Demokratie und Demokratieförderung.
Im zweiten Teil steht das entwicklungspolitische Engagement im Fokus: einerseits nicht-monetäres Engagement wie informations- und kommunikationsbezogene Aktivitäten, organisationsbezogenes Engagement und politische Partizipation, andererseits monetäres Engagement wie Spenden an entwicklungspolitische Organisationen und nachhaltiger Konsum. Ergänzt wird der zweite Teil durch einen Gastbeitrag, in dem die verschiedenen Engagementformen zu einer Typologie zusammengefasst und Einflussfaktoren auf den Wechsel zwischen Engagementtypen im Zeitverlauf analysiert werden.
In beiden Teilen werden die Indikatoren für Einstellungen und Engagement (längsschnittlich) dargestellt und auf Unterschiede zwischen verschiedenen Bevölkerungsgruppen hin untersucht.
Abschließend werden Implikationen für staatliche und zivilgesellschaftliche entwicklungspolitische Akteure zur Strategie-, Kommunikations- und Bildungsarbeit sowie zur Förderung des entwicklungspolitischen Engagements formuliert.

Participant retention in longitudinal health research is necessary for generalizable results. Understanding factors that correlate with increased retention could improve retention in future studies. Here, we describe how participant and study process measures are associated with retention in a longitudinal tobacco cessation research study performed in Anchorage, Alaska. Specifically, we conducted a secondary analysis exploring retention among 151 Alaska Native and American Indian (ANAI) people and described our study processes using study retention categories from a recent meta-analysis. We found that our study processes influence retention among ANAI urban residents more than measures collected about the participant. For study process measures, calls where a participant answered and calls participants placed to the study team were associated with higher retention. Calls where the participant did not answer were associated with lower retention. For participant measures, only lower annual income was associated with lower retention at 6 weeks. Promoting communication from participants to the study team could improve retention, and alternative communication methods could be used after unsuccessful calls. Finally, categorizing our study retention strategies demonstrated that additional barrier-reduction strategies might be warranted.

Power laws fit to rockfall frequency-magnitude distributions are commonly used to summarize rockfall inventories, but uncertainty remains regarding which variables control the shape of the distribution, whether by exerting influence on rockfall activity itself or on our ability to measure rockfall activity. In addition, the current literature lacks concise summaries of background on the frequency-magnitude distribution for rockfalls and power law fitting. To help address these knowledge gaps, we present a new review of rockfall frequency literature designed to collect the basic concepts, methods, and applications of the rockfall frequency-magnitude curve in one place, followed by a meta-analysis of 46 rockfall inventories. We re-fit power laws to each inventory based on the maximum likelihood estimate of the scaling exponent and the cutoff volume and used Analysis of Variance and regression to test for relationships between 11 independent physical and systematic variables and the scaling exponent, applying both single-predictor and 2-predictor models. Notable relationships with the scaling exponent were observed for rockmass condition, geology, and maximum inventory volume. Higher scaling exponent values were associated with higher quality rockmasses, sedimentary rocks, and with smaller maximum rockfall volumes. Climate, data collection frequency, and data collection method also appear to have some influence on the scaling exponent, since higher scaling exponents were associated with temperate climates and inventories with shorter temporal extent and methods that involved few or no revisits to the slope. Relationships between the scaling exponent and slope angle, slope aspect, number of rockfalls in the inventory, record length, and minimum inventory volume are more ambiguous due to the noise inherent in comparing many studies together. In line with previous work, this study reinforces that sampling large volumes is important to obtaining an accurate distribution, and that the spatial scale of the inventory affects the likelihood of obtaining these measurements. We conclude with discussion of the results and recommendations for future work.

In the field of optimization and machine learning, the statistical assessment of results has played a key role in conducting algorithmic performance comparisons. Classically, null hypothesis statistical tests have been used. However, recently, alternatives based on Bayesian statistics have shown great potential in complex scenarios, especially when quantifying the uncertainty in the comparison. In this work, we delve deep into the Bayesian statistical assessment of experimental results by proposing a framework for the analysis of several algorithms on several problems/instances. To this end, experimental results are transformed to their corresponding rankings of algorithms, assuming that these rankings have been generated by a probability distribution (defined on permutation spaces). From the set of rankings, we estimate the posterior distribution of the parameters of the studied probability models, and several inferences concerning the analysis of the results are examined. Particularly, we study questions related to the probability of having one algorithm in the first position of the ranking or the probability that two algorithms are in the same relative position in the ranking. Not limited to that, the assumptions, strengths, and weaknesses of the models in each case are studied. To help other researchers to make use of this kind of analysis, we provide a Python package and source code implementation at
https://zenodo.org/record/6320599
.

Conducting high-quality research in early onset scoliosis (EOS) is challenging, requiring the assistance of PhD trained biostatisticians and epidemiologists with expertise in research methodology. Biostatisticians develop theoretical and statistical methods to analyze data in support of evidence-based decision-making. Epidemiologists provide empirical confirmation of disease processes, identifying factors that affect prognosis to guide the process toward clinical relevancy. Within each step in the study process, there are important principles that investigators can apply to improve the quality of research in EOS: Ask a research question that tests a hypothesis or formulate a hypothesis that answers a research question worth answering Formulate a focused, testable hypothesis Create a study design that tests the hypothesis (i.e. results prove/disprove hypothesis) Identify appropriate patient cohorts (treatment, controls) according to inclusion/exclusion criteria established a priori (prospective and retrospective studies) Specify the variables (categorical or quantitative – discrete and/or continuous) to be measured: Variables hypothesized to impact outcomes Independent - patient cohort, gender, treatment method Co-variates (e.g. medical co-morbidities, age, habitus, socioeconomic status, physical abilities) Dependent variables - objective measures of outcomes that reflect disease pathophysiology, treatment and/or prevention: clinical biomarkers, image-based anatomy, HRQOL) Analyze data using applicable statistical tests Sample size (power) calculations are predicated on the type of statistical tests that will be applied to the data and require specification of a pre-determined effect size (i.e., strength of the relationship between the independent and dependent variables) and an estimate of the extent of variation in the dependent variables Interpret results established on appropriately powered statistical tests in support/rejection of the hypothesis These points, as relevant to early onset scoliosis (EOS) research can be illustrated through an example of a retrospective de novo study identifying risk factors for increased mortality and decreased health-related quality of life (HRQoL) in EOS patients with cerebral palsy (CP) undergoing spine surgery.

In this chapter, we provide an overview of some of the major historic and contemporary statistical controversies, including the use of qualitative versus quantitative methods, the role of description/exploration in research, and the nature of hypothesis testing. We also consider a number of statistical non-controversies that we believe are generally agreed upon, yet still worthy of consideration in the current overview, including the condemnation of fraud, the value of sharing data, and the use of broader/more diverse samples. Finally, we consider reasons why statistical debates can be surprisingly heated and conclude that—regardless of the reasons for controversy, or the tone of these debates—impressive progress has been made in the last decade. Given the tools that researchers now have to avoid the mistakes that led to the replication crisis, we expect the quality of research to improve. There will undoubtedly continue to be statistical controversy, but as new practices take hold, we may see a shift in the tone of these debates to being more civil.KeywordsStatisticsQuantitativeQualitativeHypothesis testingBayesian statisticsWEIRD samples

Bisphenol A (BPA) is a synthetic chemical used for the manufacturing of plastics, epoxy resin, and many personal care products. This ubiquitous endocrine disruptor is detectable in the urine of over 80% of North Americans. Although adverse neurodevelopmental outcomes have been observed in children with high gestational exposure to BPA, the effects of prenatal BPA on brain structure remain unclear. Here, using magnetic resonance imaging (MRI), we studied the impact of maternal BPA exposure on children's brain structure, as well as the impact of comparable BPA levels in a mouse model. Our human data showed that most maternal BPA exposure effects on brain volumes were small, with the largest effects observed in the opercular region of the inferior frontal gyrus (ρ = −0.2754), superior occipital gyrus (ρ = −0.2556), and postcentral gyrus (ρ = 0.2384). In mice, gestational exposure to an equivalent level of BPA (2.25 μg BPA/kg bw/day) induced structural alterations in brain regions including the superior olivary complex (SOC) and bed nucleus of stria terminalis (BNST) with larger effect sizes (1.07≤ Cohens d ≤ 1.53). Human (n = 87) and rodent (n = 8 each group) sample sizes, while small, are considered adequate to perform the primary endpoint analysis. Combined, these human and mouse data suggest that gestational exposure to low levels of BPA may have some impacts on the developing brain at the resolution of MRI.

Publication bias is a ubiquitous threat to the validity of meta‐analysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods' performance to depend on the true data generating process, and no method consistently outperforms the others across a wide range of conditions. Unfortunately, when different methods lead to contradicting conclusions, researchers can choose those methods that lead to a desired outcome. To avoid the condition‐dependent, all‐or‐none choice between competing methods and conflicting results, we extend robust Bayesian meta‐analysis and model‐average across two prominent approaches of adjusting for publication bias: (1) selection models of p‐values and (2) models adjusting for small‐study effects. The resulting model ensemble weights the estimates and the evidence for the absence/presence of the effect from the competing approaches with the support they receive from the data. Applications, simulations, and comparisons to preregistered, multi‐lab replications demonstrate the benefits of Bayesian model‐averaging of complementary publication bias adjustment methods. This article is protected by copyright. All rights reserved.

We discuss a newly published study examining how phrases are used in clinical trials to describe results when the estimated P-value is close to (slightly above or slightly below) 0.05, which has been arbitrarily designated by convention as the boundary for ‘statistical significance’. Terms such as ‘marginally significant’, ‘trending towards significant’, and ‘nominally significant’ are well represented in biomedical literature, but are not actually scientifically meaningful. Acknowledging that ‘statistical significance’ remains a major determinant of publication, we propose that scientific journals de-emphasise the use of P-values for null hypothesis significance testing, a purpose for which they were never intended, and avoid the use of these ambiguous and confusing terms in scientific articles. Instead, investigators could simply report their findings: effect sizes, P-values, and confidence intervals (or their Bayesian equivalents), and leave it to the discerning reader to infer the clinical applicability and importance. Our goal should be to move away from describing studies (or trials) as positive or negative based on an arbitrary P-value threshold, and rather to judge whether the scientific evidence provided is informative or uninformative.

El Índice de Competitividad Global mide la habilidad de los países de proveer altos niveles de prosperidad a sus ciudadanos y depende de cuán productivamente un país utiliza sus recursos disponibles. En los últimos cinco años, Panamá ha caído 24 posiciones en este índice (de número 40 en el período 2012-2013 a número 64 en el período 2017-2018). La productividad en investigación es una de las variables que está presente en el cálculo de ese índice global, lo que implica que en los centros de investigación y universidades del país ha disminuido la producción y divulgación de trabajos científicos.
La cantidad de investigadores que existe en Panamá es de 0.28 por 1,000 trabajadores; esta cifra es muy inferior al promedio latinoamericano que es casi cuatro veces superior (Nevache, 2019). Por ello es muy importante fortalecer los “escenarios” de investigación en el país y, en particular, en el sector universitario.
El Instituto de Investigación de la Asociación de Universidades Particulares de Panamá ha realizado dos estudios (IdIA, 2016 y 2017) sobre la productividad, visibilidad e impacto de estas instituciones de educación superior en el país. Entre el 2014 y el 2016 se registraron 250 investigaciones y para el 31 de diciembre de 2016 se contabilizaron 117 investigaciones en ejecución. En cuanto a las investigaciones formativas, lo cuantificado para el período 2014-2016 fue de 3,574 (IdIA, 2016). Entre los problemas detectados están la poca divulgación que existe de las investigaciones formativas, el número bajo de profesores que hacen investigación y la falta de estímulos institucionales para promover la investigación.

Corpus linguistics continues to be a vibrant methodology applied across highly diverse fields of research in the language sciences. With the current steep rise in corpus sizes, computational power, statistical literacy and multi-purpose software tools, and inspired by neighbouring disciplines, approaches have diversified to an extent that calls for an intensification of the accompanying critical debate. Bringing together a team of leading experts, this book follows a unique design, comparing advanced methods and approaches current in corpus linguistics, to stimulate reflective evaluation and discussion. Each chapter explores the strengths and weaknesses of different datasets and techniques, presenting a case study and allowing readers to gauge methodological options in practice. Contributions also provide suggestions for further reading, and data and analysis scripts are included in an online appendix. This is an important and timely volume, and will be essential reading for any linguist interested in corpus-linguistic approaches to variation and change.

In the archaeological tradition of what is today Peru, studies of sedentary agricultural groups have accorded a minor role to the analysis of stone tools relative to other suites of material culture. Here, we illustrate the value of such lithic collections via a case study of settlement sites from the Chachapoyas region of northern Peru (AD 300–1500). This study demonstrates the potential of methods such as use-wear microscopy and raw material analysis to address questions of theoretical interest to archaeologists studying sedentary society, such as subsistence, household behavior, and ceremonial practices. A set of generalized linear models of the spatial distribution of volcanic stone indicates that lithic raw material acquisition at these ceramic period sites was likely embedded in other activities. In addition, we examine an unusual set of limestone and carbonate-patinated artifacts that suggest that lithic procurement and selection were informed and strategic, if not conforming to expected technological priorities. We suggest that, by taking the potential value of lithic artifacts into consideration from project design through field collection and assemblage sampling, researchers can minimize biases that may otherwise limit the value of lithic assemblages.

This commentary reviews the arguments for and against the use of p-values put forward in the Journal and other forums, and shows that they are all missing both a measure and concept of "evidence." The mathematics and logic of evidential theory are presented, with the log-likelihood ratio used as the measure of evidence. The profoundly different philosophy behind evidential methods (as compared to traditional ones) is presented, as well as a comparative example showing the difference between the two approaches. The reasons why we mistakenly ascribe evidential meaning to p-values and related measures are discussed. Unfamiliarity with the technology and philosophy of evidence is seen as the main reason why certain arguments about p-values persist, and why they are frequently contradictory and confusing.

Tests of statistical significance are often used by investigators in reporting the results of clinical research. Although such tests are useful tools, the significance levels are not appropriate indices of the size or importance of differences in outcome between treatments. Lack of "statistical significance" can be misinterpreted in small studies as evidence that no important difference exists. Confidence intervals are important but underused supplements to tests of significance for reporting the results of clinical investigations. Their usefulness is discussed here, and formulas are presented for calculating confidence intervals with types of data commonly found in clinical trials.

Conventional interpretation of clinical trials relies heavily on the classic p value. The p value, however, represents only a false-positive rate, and does not tell the probability that the investigator's hypothesis is correct, given his observations. This more relevant posterior probability can be quantified by an extension of Bayes' theorem to the analysis of statistical tests, in a manner similar to that already widely used for diagnostic tests. Reanalysis of several published clinical trials according to Bayes' theorem shows several important limitations of classic statistical analysis. Classic analysis is most misleading when the hypothesis in question is already unlikely to be true, when the baseline event rate is low, or when the observed differences are small. In such cases, false-positive and false-negative conclusions occur frequently, even when the study is large, when interpretation is based solely on the p value. These errors can be minimized if revised policies for analysis and reporting of clinical trials are adopted that overcome the known limitations of classic statistical theory with applicable bayesian conventions.

Given an observed test statistic and its degrees of freedom, one may compute the observed P value with most statistical packages. It is unknown to what extent test statistics and P values are congruent in published medical papers.
We checked the congruence of statistical results reported in all the papers of volumes 409-412 of Nature (2001) and a random sample of 63 results from volumes 322-323 of BMJ (2001). We also tested whether the frequencies of the last digit of a sample of 610 test statistics deviated from a uniform distribution (i.e., equally probable digits).
11.6% (21 of 181) and 11.1% (7 of 63) of the statistical results published in Nature and BMJ respectively during 2001 were incongruent, probably mostly due to rounding, transcription, or type-setting errors. At least one such error appeared in 38% and 25% of the papers of Nature and BMJ, respectively. In 12% of the cases, the significance level might change one or more orders of magnitude. The frequencies of the last digit of statistics deviated from the uniform distribution and suggested digit preference in rounding and reporting.
This incongruence of test statistics and P values is another example that statistical practice is generally poor, even in the most renowned scientific journals, and that quality of papers should be more controlled and valued.

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

This paper reviews the research relating Logo, as a programming language, to the learning and teaching of secondary mathematics. The paper makes a clear distinction between Logo used as a programming language, and Logo used in the terms originally put forth by Papert (1980). The main thesis is that research studies in this area often require that the experimental group learn Logo as well as the regular mathematical content that the control group is learning. Thus the Logo group is often learning more. A result of no significant difference can therefore be more important than a casual glance would indicate.

In a 1935 paper and in his book Theory of Probability, Jeffreys developed a methodology for quantifying the evidence in favor of a scientific theory. The centerpiece was a number, now called the Bayes factor, which is the posterior odds of the null hypothesis when the prior probability on the null is one-half. Although there has been much discussion of Bayesian hypothesis testing in the context of criticism of P-values, less attention has been given to the Bayes factor as a practical tool of applied statistics. In this article we review and discuss the uses of Bayes factors in the context of five scientific applications in genetics, sports, ecology, sociology, and psychology. We emphasize the following points:

The prime object of this book is to put into the hands of research workers, and especially of biologists, the means of applying statistical tests accurately to numerical data accumulated in their own laboratories or available in the literature.

t ;/ I read with interest Joe Fleiss' Letter to the Editor of this journal 18:394 (1987)1, entitled "Some Thoughts on Two-Tailed Tests." Certainly Fleiss is one of the more respected members and practitioners in our profession and I respect his opinion. However, I do not believe that he gives sufficient treatment of the one-tailed versus two-tailed issue. I, for one, believe that there are many situations where a one-tailed test is the appropriate test, provided that hypothesis testing itself is meaningful to do. On this point, I shall only state that fundamentally I believe that hy-pothesis testing is appropriate only in those trials designed to provide a confirmatory answer to a medically and/or scientifically important question. Otherwise, at the analysis stage, one is either in an estimation mode or a hypothesis-generating mode. I should underscore that helping to define the question at the protocol development stage is one of the most important contributions the statistician can make. The question to be answered is the research objective and, I believe most would agree/ should be the altemative hypothesis (H") in the hypothesis testing framework, the idea being that contradicting the null hypothesis (Ho) is seen as evidence in support of the research objective. So one point in support of one-sided tests is that if the question, the research is directed toward is unidirectional, then significance tests should be one-sided. A second point is that we should have internal consistency to our construction of the alternative hypothesis. An example of what I mean here is the dose response or dose comparison trial. Few (none that I have asked) statisticians would disagree that dose response as a research objective, cap-tured in the hypothesis specification framework, is Ho I pp 3 ltat s pd2, where for simplicity I havc assumed that there are two doses--J1 and d2 of the test drug, a placebo (p) control, and the effect of drug is expected to be nonde-creasing. If this is the case and if for some reason, the research is conducted in the absence of the d, group, then why would Hotpp = pa2 become H^: p,* 1ta2? A third point is that if the trial is designed to be confirmatory, then the alternative cannot be two-sided and still be logical. I believe this holds for positive controlled trials as well as for placebo controlled trials. However, it is more likely to gain broader acceptance for placebo controlled trials. To elaborate, suppose the (confirmatory) trial has one dose group and one pla-cebo group and a two-sided alternative was seen as appropriate at the design stage. Suppose further that the analyst is masked and that the only results known are the F statistic and corresponding p value (which is significant). The analyst then has to search for where the difference lies-a situation no different from those that invoke multiple range tests. That search fundamen-tally precludes a confirmatory conclusion even if the direction favors the drug. 383 ot97-2456t1988t53.50 Controlled Clinical Trials 9;383-38a G988) @ Elsevier Science Publishing Co., lnc. 1988 655 Avenue of the Anericas,

The Fisher and Neyman-Pearson approaches to testing statistical hypotheses are compared with respect to their attitudes to the interpretation of the outcome, to power, to conditioning, and to the use of fixed significance levels. It is argued that despite basic philosophical differences, in their main practical aspects the two theories are complementary rather than contradictory and that a unified approach is possible that combines the best features of both. As applications, the controversies about the Behrens-Fisher problem and the comparison of two binomials (2 × 2 tables) are considered from the present point of view.

The problem of testing a point null hypothesis (or a “small interval” null hypothesis) is considered. Of interest is the relationship between the P value (or observed significance level) and conditional and Bayesian measures of evidence against the null hypothesis. Although one might presume that a small P value indicates the presence of strong evidence against the null, such is not necessarily the case. Expanding on earlier work [especially Edwards, Lindman, and Savage (1963) and Dickey (1977)], it is shown that actual evidence against a null (as measured, say, by posterior probability or comparative likelihood) can differ by an order of magnitude from the P value. For instance, data that yield a P value of .05, when testing a normal mean, result in a posterior probability of the null of at least .30 for any objective prior distribution. (“Objective” here means that equal prior weight is given the two hypotheses and that the prior is symmetric and nonincreasing away from the null; other definitions of “objective” will be seen to yield qualitatively similar results.) The overall conclusion is that P values can be highly misleading measures of the evidence provided by the data against the null hypothesis.

The Empire of Chance tells how quantitative ideas of chance transformed the natural and social sciences, as well as daily life over the last three centuries. A continuous narrative connects the earliest application of probability and statistics in gambling and insurance to the most recent forays into law, medicine, polling and baseball. Separate chapters explore the theoretical and methodological impact in biology, physics and psychology. Themes recur - determinism, inference, causality, free will, evidence, the shifting meaning of probability - but in dramatically different disciplinary and historical contexts. In contrast to the literature on the mathematical development of probability and statistics, this book centres on how these technical innovations remade our conceptions of nature, mind and society. Written by an interdisciplinary team of historians and philosophers, this readable, lucid account keeps technical material to an absolute minimum. It is aimed not only at specialists in the history and philosophy of science, but also at the general reader and scholars in other disciplines.

Bayesian statistics, a currently controversial viewpoint concerning statistical inference, is based on a definition of probability as a particular measure of the opinions of ideally consistent people. Statistical inference is modification of these opinions in the light of evidence, and Bayes' theorem specifies how such modifications should be made. The tools of Bayesian statistics include the theory of specific distributions and the principle of stable estimation, which specifies when actual prior opinions may be satisfactorily approximated by a uniform distribution. A common feature of many classical significance tests is that a sharp null hypothesis is compared with a diffuse alternative hypothesis. Often evidence which, for a Bayesian statistician, strikingly supports the null hypothesis leads to rejection of that hypothesis by standard classical procedures. The likelihood principle emphasized in Bayesian statistics implies, among other things, that the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience. (71 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)

In some comparisons - for example, between two means or two proportions - there is a choice between two sided or one sided tests of significance (all comparisons of three or more groups are two sided).* This is the eighth in a series of occasional notes on medical statistics.When we use a test of significance to compare two groups we usually start with the null hypothesis that there is no difference between the populations from which the data come. If this hypothesis is not true the alternative hypothesis must be true - that there is a difference. Since the null hypothesis specifies no direction for the difference nor does the alternative hypothesis, and so we have a two sided test. In a one sided test the alternative hypothesis does specify a direction - for example, that an active treatment is better than a placebo. This is sometimes justified by saying that we are not interested in the possibility that the active treatment is worse than no treatment. This possibility is still part of the test; it is part of the null hypothesis, which now states that the difference in the population is zero or in favour of the placebo.A one sided test is sometimes appropriate. Luthra et al investigated the effects of laparoscopy and hydrotubation on the fertility of women presenting at an infertility clinic.1 After some months laparoscopy was carried out on those who had still not conceived. These women were then observed for several further months and some of these women also conceived. The conception rate in the period before laparoscopy was compared with that afterwards. The less fertile a woman is the longer it is likely to take her to conceive. Hence, the women who had the laparoscopy should have a lower conception rate (by an unknown amount) than the larger group who entered the study, because the more fertile women had conceived before their turn for laparoscopy came. To see whether laparoscopy increased fertility, Luthra et al tested the null hypothesis that the conception rate after laparoscopy was less than or equal to that before. The alternative hypothesis was that the conception rate after laparoscopy was higher than that before. A two sided test was inappropriate because if the laparoscopy had no effect on fertility the conception rate after laparoscopy was expected to be lower.One sided tests are not often used, and sometimes they are not justified. Consider the following example. Twenty five patients with breast cancer were given radiotherapy treatment of 50 Gy in fractions of 2 Gy over 5 weeks.2 Lung function was measured initially, at one week, at three months, and at one year. The aim of the study was to see whether lung function was lowered following radiotherapy. Some of the results are shown in the table, the forced vital capacity being compared between the initial and each subsequent visit using one sided tests. The direction of the one sided tests was not specified, but it may appear reasonable to test the alternative hypothesis that forced vital capacity decreases after radiotherapy, as there is no reason to suppose that damage to the lungs would increase it. The null hypothesis is that forced vital capacity does not change or increases. If the forced vital capacity increases, this is consistent with the null hypothesis, and the more it increases the more consistent the data are with the null hypothesis. Because the differences are not all in the same direction, at least one P value should be greater than 0.5. What has been done here is to test the null hypothesis that forced vital capacity does not change or decreases from visit 1 to visit 2 (nine week), and to test the null hypothesis that it does not change or increases from visit 1 to visit 3 (three months) or visit 4 (one year). These authors seem to have carried out one sided tests in both directions for each visit and then taken the smaller probability. If there is no difference in the population the probability of getting a significant difference by this approach is 10%, not 5% as it should be. The chance of a spurious significant difference is doubled. Two sided tests should be used, which would give probabilities of 0.26, 0.064, and 0.38, and no significant differences.In general a one sided test is appropriate when a large difference in one direction would lead to the same action as no difference at all. Expectation of a difference in a particular direction is not adequate justification. In medicine, things do not always work out as expected, and researchers may be surprised by their results. For example, Galloe et al found that oral magnesium significantly increased the risk of cardiac events, rather than decreasing it as they had hoped.3 If a new treatment kills a lot of patients we should not simply abandon it; we should ask why this happened.Two sided tests should be used unless there is a very good reason for doing otherwise. If one sided tests are to be used the direction of the test must be specified in advance. One sided tests should never be used simply as a device to make a conventionally non-significant difference significant.References1.↵Lund MB, Myhre KI, Melsom H, Johansen B. The effect on pulmonary function of tangential field technique in radiotherapy for carcinoma of the breast. Br J Radiol 1991;64:520–3.OpenUrlFREE Full Text2.↵Luthra P, Bland JM, Stanton SL. Incidence of pregnancy after laparoscopy and hydrotubation. BMJ 1982;284:1013.3.↵Galloe AM, Rasmussen HS, Jorgensen LN, Aurup P, Balslov S, Cintin C, Graudal N, McNair P. Influence of oral magnesium supplementation on cardiac events among survivors of an acute myocardial infarction. BMJ 1993;307:585–7.

An abstract is unavailable. This article is available as HTML full text and PDF.

This paper concerns interim analysis in clinical trials involving two treatments from the points of view of both classical and Bayesian inference. I criticize classical hypothesis testing in this setting and describe and recommend a Bayesian approach in which sampling stops when the probability that one treatment is the better exceeds a specified value. I consider application to normal sampling analysed in stages and evaluate the gain in average sample number as a function of the number of interim analyses.

For both P-values and confidence intervals, an alpha level is chosen to set limits of acceptable probability for the role of chance in the observed distinctions. The level of alpha is used either for direct comparison with a single P-value, or for determining the extent of a confidence interval. "Statistical significance" is proclaimed if the calculations yield a P-value that is below alpha, or a 1-alpha confidence interval whose range excludes the null result of "no difference." Both the P-value and confidence-interval methods are essentially reciprocal, since they use the same principles of probabilistic calculation; and both can yield distorted or misleading results if the data do not adequately conform to the underlying mathematical requirements. The major scientific disadvantage of both methods is that their "significance" is merely an inference derived from principles of mathematical probability, not an evaluation of substantive importance for the "big" or "small" magnitude of the observed distinction. The latter evaluation has not received adequate attention during the emphasis on probabilistic decisions; and careful principles have not been developed either for the substantive reasoning or for setting appropriate boundaries for "big" or "small." After a century of "significance" inferred exclusively from probabilities, a basic scientific challenge is to develop methods for deciding what is substantively impressive or trivial.

This article has no abstract; the first 100 words appear below.
Many medical researchers believe that it would be fruitless to submit for publication any paper that lacks statistical tests of significance. Their belief is not ill founded: editors and referees commonly rely on tests of significance as indicators of a sophisticated and meaningful statistical analysis, as well as the primary means to assess sampling variability in a study. The preoccupation with significance tests is embodied in the focus on whether the P value is less than 0.05; results are considered "significant" or "not significant" according to whether or not the P value is less than or greater than 0.05. Dr. . . .
J. Rothman, DR.P.H.
Harvard School of Public Health Boston, MA 02115 Kenneth
*Freiman JA, Chalmers TC, Smith HS Jr, et al: The importance of beta, the Type II error and sample size in the design and interpretation of the randomized controlled trial: survey of 71 "negative" trials. N Engl J Med 299:690–694, 1978

Statistical significance’ is commonly tested in biologic research when the investigator has found an impressive difference in two groups of animals or people. If the groups are relatively small, the investigator (or a critical reviewer) becomes worried about a statistical problem. Although the observed difference in the means or percentages is large enough to be biologically (or clinically) significant, do the groups contain enough members for the numerical differences to be ‘statistically significant’?

This paper reviews the role of statistics in causal inference. Special attention is given to the need for randomization to justify causal inferences from conventional statistics, and the need for random sampling to justify descriptive inferences. In most epidemiologic studies, randomization and random sampling play little or no role in the assembly of study cohorts. I therefore conclude that probabilistic interpretations of conventional statistics are rarely justified, and that such interpretations may encourage misinterpretation of nonrandomized studies. Possible remedies for this problem include deemphasizing inferential statistics in favor of data descriptors, and adopting statistical techniques based on more realistic probability models than those in common use.

Overemphasis on hypothesis testing--and the use of P values to dichotomise significant or non-significant results--has detracted from more useful approaches to interpreting study results, such as estimation and confidence intervals. In medical studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Some methods of calculating confidence intervals for means and differences between means are given, with similar information for proportions. The paper also gives suggestions for graphical display. Confidence intervals, if appropriate to the type of study, should be used for major findings in both the main text of a paper and its abstract.

This paper concerns interim analysis in clinical trials involving two treatments from the points of view of both classical and Bayesian inference. I criticize classical hypothesis testing in this setting and describe and recommend a Bayesian approach in which sampling stops when the probability that one treatment is the better exceeds a specified value. I consider application to normal sampling analysed in stages and evaluate the gain in average sample number as a function of the number of interim analyses.

It is not generally appreciated that the p value, as conceived by R. A. Fisher, is not compatible with the Neyman-Pearson hypothesis test in which it has become embedded. The p value was meant to be a flexible inferential measure, whereas the hypothesis test was a rule for behavior, not inference. The combination of the two methods has led to a reinterpretation of the p value simultaneously as an "observed error rate" and as a measure of evidence. Both of these interpretations are problematic, and their combination has obscured the important differences between Neyman and Fisher on the nature of the scientific method and inhibited our understanding of the philosophic implications of the basic methods in use today. An analysis using another method promoted by Fisher, mathematical likelihood, shows that the p value substantially overstates the evidence against the null hypothesis. Likelihood makes clearer the distinction between error rates and inferential evidence and is a quantitative tool for expressing evidential strength that is more appropriate for the purposes of epidemiology than the p value.

The recent controversy over the increased risk of venous thrombosis with third generation oral contraceptives illustrates the public policy dilemma that can be created by relying on conventional statistical tests and estimates: case-control studies showed a significant increase in risk and forced a decision either to warn or not to warn. Conventional statistical tests are an improper basis for such decisions because they dichotomise results according to whether they are or are not significant and do not allow decision makers to take explicit account of additional evidence--for example, of biological plausibility or of biases in the studies. A Bayesian approach overcomes both these problems. A Bayesian analysis starts with a "prior" probability distribution for the value of interest (for example, a true relative risk)--based on previous knowledge--and adds the new evidence (via a model) to produce a "posterior" probability distribution. Because different experts will have different prior beliefs sensitivity analyses are important to assess the effects on the posterior distributions of these differences. Sensitivity analyses should also examine the effects of different assumptions about biases and about the model which links the data with the value of interest. One advantage of this method is that it allows such assumptions to be handled openly and explicitly. Data presented as a series of posterior probability distributions would be a much better guide to policy, reflecting the reality that degrees of belief are often continuous, not dichotomous, and often vary from one person to another in the face of inconclusive evidence.

An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain "error rates," without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used--the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.

Bayesian inference is usually presented as a method for determining how scientific belief should be modified by data. Although Bayesian methodology has been one of the most active areas of statistical development in the past 20 years, medical researchers have been reluctant to embrace what they perceive as a subjective approach to data analysis. It is little understood that Bayesian methods have a data-based core, which can be used as a calculus of evidence. This core is the Bayes factor, which in its simplest form is also called a likelihood ratio. The minimum Bayes factor is objective and can be used in lieu of the P value as a measure of the evidential strength. Unlike P values, Bayes factors have a sound theoretical foundation and an interpretation that allows their use in both inference and decision making. Bayes factors show that P values greatly overstate the evidence against the null hypothesis. Most important, Bayes factors require the addition of background knowledge to be transformed into inferences--probabilities that a given conclusion is right or wrong. They make the distinction clear between experimental evidence and inferential conclusions while providing a framework in which to combine prior with current evidence.

Genetic association studies for multigenetic diseases are like fishing for the truth in a sea of trillions of candidate analyses. Red herrings are unavoidably common, and bias might cause serious misconceptions. However, a sizeable proportion of identified genetic associations are probably true. Meta-analysis, a rigorous, comprehensive, quantitative synthesis of all the available data, might help us to separate the true from the false.

Although statistical textbooks have for a long time asserted that "not significant" merely implies "not proven," investigators still display confusion regarding the interpretation of the verdict. This appears to be due to the ambiguity of the term "significance," to inadequate exposition, and especially to the behavior of textbook writers who in the analysis of data act as if "not significant" means "nonexistent" or "unimportant.". Appropriate action after a verdict of "nonsignificance" depends on many circumstances and requires much thought. "Significance" tests often could be, and in some instances should be, avoided; then "nonsignificance" would cease to be a serious problem.