Article

# Null Hypothesis Significance Testing Interpreted and Calibrated by Estimating Probabilities of Sign Errors: A Bayes-Frequentist Continuum

Authors:
To read the full-text of this research, you can request a copy directly from the author.

## Abstract

Hypothesis tests are conducted not only to determine whether a null hypothesis (H0) is true but also to determine the direction or sign of an effect. A simple estimate of the posterior probability of a sign error is PSE = (1 - PH0) p/2 + PH0, depending only on a two-sided p value and PH0, an estimate of the posterior probability of H0. A convenient option for PH0 is the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower bound. In that case, PSE depends only on p and an estimate of the prior probability of H0. PSE provides a continuum between significance testing and traditional Bayesian testing. The former effectively assumes the prior probability of H0 is 0, as some statisticians argue. In that case, PSE is equal to a one-sided p value. (In that sense, PSE is a calibrated p value.) In traditional Bayesian testing, on the other hand, the prior probability of H0 is at least 50%, which usually brings PSE close to PH0.

## No full-text available

... While empirical Bayes estimation is traditionally considered as a multiple comparison procedure (e.g., Efron 2010), its application to estimating false discovery rates applies not only to multiple testing but also to testing a single null hypothesis (e.g., Bickel 2019a, chapter 7;2021). That can be seen in terms of the problem of testing a null hypothesis against the most commonly used type of alternative hypothesis, that with two sides. ...
... where # and P are the parameter of interest and the p value as random variables, h H 0 is the value of h, the scalar parameter of interest under the null hypothesis that # ¼ h H 0 (that # is sufficiently close to h H 0 for practical purposes), and B is the Bayes factor f ðpj# ¼ h H 0 Þ=f ðpj# 6 ¼ h H 0 Þ based on f, the probability density function of P. Any nuisance parameters have been eliminated by integration with respect to their prior distributions. When Prð# ¼ h H 0 jP ¼ pÞ is a frequency-type probability, it is known as the local false discovery rate (LFDR) in the empirical Bayes literature on testing multiple hypotheses (e.g., Efron 2010; Efron et al. 2001) and on testing a single hypothesis (e.g., Bickel 2017Bickel , 2019aBickel , 2021. Under broad assumptions, Sellke, Bayarri, and Berger (2001) derived ...
... Additional arguments for d LFDR ¼ LFDR appear in Bickel (2021). ᭡ Example 3. Different assumptions lead to different versions of B, the lower bound on the Bayes factor (Held and Ott 2018). ...
Article
Much of the blame for failed attempts to replicate reports of scientific findings has been placed on ubiquitous and persistent misinterpretations of the p value. An increasingly popular solution is to transform a two-sided p value to a lower bound on a Bayes factor. Another solution is to interpret a one-sided p value as an approximate posterior probability. Combining the two solutions results in confidence intervals that are calibrated by an estimate of the posterior probability that the null hypothesis is true. The combination also provides a point estimate that is covered by the calibrated confidence interval at every level of confidence. Finally, the combination of solutions generates a two-sided p value that is calibrated by the estimate of the posterior probability of the null hypothesis. In the special case of a 50% prior probability of the null hypothesis and a simple lower bound on the Bayes factor, the calibrated two-sided p value is about (1 – abs(2.7 p ln p)) p + 2 abs(2.7 p ln p) for small p. The calibrations of confidence intervals, point estimates, and p values are proposed in an empirical Bayes framework without requiring multiple comparisons.
... A fiducial distribution is called a confidence distribution if it has the property that there is a 95% probability that ϑ lies within the limits of a 95% confidence interval computed from the fixed data set. A confidence distribution may be interpreted as an estimate of a Bayesian posterior distribution based on a prior distribution that assumes as little as possible (e.g., Bickel, 2021). Regardless of whether ϑ = ϑ 1 or ϑ = ϑ 2 , we see that Pr (28 MY ≤ ϑ ≤ 270 MY) ≥ 95% . ...
Preprint
Confidence intervals of divergence times and branch lengths do not reflect uncertainty about their clades or about the prior distributions and other model assumptions on which they are based. Uncertainty about the clade may be propagated to a confidence interval by multiplying its confidence level by the bootstrap proportion of its clade or by another probability that the clade is correct. (If the confidence level is 95% and the bootstrap proportion is 90%, then the uncertainty-adjusted confidence level is (0.95)(0.90) = 86%.) Uncertainty about the model can be propagated to the confidence interval by reporting the union of the confidence intervals from all the plausible models. Unless there is no overlap between the confidence intervals, that results in an uncertainty-adjusted interval that has as its lower and upper limits the most extreme limits of the models. The proposed methods of uncertainty quantification may be used together. https://doi.org/10.5281/zenodo.5212069
... [31] To deal with inherent biases, of which some researchers may not even be consciously aware, there is need for new kind of design that approaches the hypothesis from different perspectives. It has been suggested that there should be a clear identification of the underlying uncertainty model associated with the scientific study [32] [33][34] [35] and some have even argued that NHST should be abandoned [36] [37]. ...
Preprint
It is generally accepted that the reproducibility crisis in the fields of natural and biological sciences is in part due to the misuse of the Null Hypothesis Significance Testing (NHST). We review the shortcomings in the use of NHST and then go beyond these to consider additional issues. Many natural systems are time-varying and some are scale-free, which requires design of new methods for such cases. We also consider the problem from the perspective of information efficiency and since three-way logic is superior to two-way logic, we argue that adding a third hypothesis may be beneficial in certain applications.
Article
A Bayesian model has two parts. The first part is a family of sampling distributions that could have generated the data. The second part of a Bayesian model is a prior distribution over the sampling distributions. Both the diagnostics used to check the model and the process of updating a failed model are widely thought to violate the standard foundations of Bayesianism. That is largely because models are checked before specifying the space of all candidate replacement models, which textbook presentations of Bayesian model averaging would require. However, that is not required under a broad class of utility functions that apply when approximate model truth is an important consideration, perhaps among other important considerations. From that class, a simple criterion for model checking emerges and suggests a coherent approach to updating Bayesian models found inadequate. The criterion only requires the specification of the prior distribution up to ratios of prior densities of the models considered until the time of the check. That criterion, while justified by Bayesian decision theory, may also be derived under possibility theory from a decision-theoretic framework that generalizes the likelihood interpretation of possibility functions.
Article
Confidence intervals of divergence times and branch lengths do not reflect uncertainty about their clades or about the prior distributions and other model assumptions on which they are based. Uncertainty about the clade may be propagated to a confidence interval by multiplying its confidence level by the bootstrap proportion of its clade or by another probability that the clade is correct. (If the confidence level is 95% and the bootstrap proportion is 90%, then the uncertainty-adjusted confidence level is (0.95)(0.90) = 86%.) Uncertainty about the model can be propagated to the confidence interval by reporting the union of the confidence intervals from all the plausible models. Unless there is no overlap between the confidence intervals, that results in an uncertainty-adjusted interval that has as its lower and upper limits the most extreme limits of the models. The proposed methods of uncertainty quantification may be used together.
Article
Recent literature has shown that statistically significant results are often not replicated because the “p-value <0.05” publication rule results in a high false positive rate (FPR) or false discovery rate (FDR) in some scientific communities. While recommendations to address the phenomenon vary, many amount to incorporating additional study summary information, such as prior null hypothesis odds and/or effect sizes, in some way. This article demonstrates that a statistic called the local false discovery rate (lfdr), which incorporates this information, is a sufficient summary for addressing false positive rates. Specifically, it is shown that lfdr-values among published results are sufficient for estimating the community-wide FDR for any well-defined publication policy, and that lfdr-values are sufficient for defining policies for community-wide FDR control. It is also demonstrated that, though p-values can be useful for computing an lfdr, they alone are not sufficient for addressing the community-wide FDR. Data from a recent replication study is used to compare publication policies and illustrate the FDR estimator.
Preprint
Full-text available
We introduce a new Empirical Bayes approach for large-scale hypothesis testing, including estimating False Discovery Rates (FDRs), and effect sizes. This approach has two key differences from existing approaches to FDR analysis. First, it assumes that the distribution of the actual (unobserved) effects is unimodal, with a mode at 0. This “unimodal assumption” (UA), although natural in many contexts, is not usually incorporated into standard FDR analysis, and we demonstrate how incorporating it brings many benefits. Specifically, the UA facilitates efficient and robust computation – estimating the unimodal distribution involves solving a simple convex optimization problem – and enables more accurate inferences provided that it holds. Second, the method takes as its input two numbers for each test (an effect size estimate, and corresponding standard error), rather than the one number usually used ( p value, or z score). When available, using two numbers instead of one helps account for variation in measurement precision across tests. It also facilitates estimation of effects, and unlike standard FDR methods our approach provides interval estimates (credible regions) for each effect in addition to measures of significance. To provide a bridge between interval estimates and significance measures we introduce the term “local false sign rate” to refer to the probability of getting the sign of an effect wrong, and argue that it is a superior measure of significance than the local FDR because it is both more generally applicable, and can be more robustly estimated. Our methods are implemented in an R package ashr available from http://github.com/stephens999/ashr .
Preprint
Full-text available
Much of the blame for failed attempts to replicate reports of scientific findings has been placed on ubiquitous and persistent misinterpretations of the p value. An increasingly popular solution is to transform a two-sided p value to a lower bound on a Bayes factor. Another solution is to interpret a one-sided p value as an approximate posterior probability. Combining the two solutions results in confidence intervals that are calibrated by an estimate of the posterior probability that the null hypothesis is true. The combination also provides a point estimate that is covered by the calibrated confidence interval at every level of confidence. Finally, the combination of solutions generates a two-sided p value that is calibrated by the estimate of the posterior probability of the null hypothesis. In the special case of a 50% prior probability of the null hypothesis and a simple lower bound on the Bayes factor, the calibrated two-sided p value is about (1-abs(2.7 p ln p)) p + 2 abs(2.7 p ln p) for small p. The calibrations of confidence intervals, point estimates, and p values are proposed in an empirical Bayes framework without requiring multiple comparisons.
Article
Full-text available
Article
Full-text available
Researchers commonly use p-values to answer the question: How strongly does the evidence favor the alternative hypothesis relative to the null hypothesis? p-Values themselves do not directly answer this question and are often misinterpreted in ways that lead to overstating the evidence against the null hypothesis. Even in the “post p < 0.05 era,” however, it is quite possible that p-values will continue to be widely reported and used to assess the strength of evidence (if for no other reason than the widespread availability and use of statistical software that routinely produces p-values and thereby implicitly advocates for their use). If so, the potential for misinterpretation will persist. In this article, we recommend three practices that would help researchers more accurately interpret p-values. Each of the three recommended practices involves interpreting p-values in light of their corresponding “Bayes factor bound,” which is the largest odds in favor of the alternative hypothesis relative to the null hypothesis that is consistent with the observed data. The Bayes factor bound generally indicates that a given p-value provides weaker evidence against the null hypothesis than typically assumed. We therefore believe that our recommendations can guard against some of the most harmful p-value misinterpretations. In research communities that are deeply attached to reliance on “p < 0.05,” our recommendations will serve as initial steps away from this attachment. We emphasize that our recommendations are intended merely as initial, temporary steps and that many further steps will need to be taken to reach the ultimate destination: a holistic interpretation of statistical evidence that fully conforms to the principles laid out in the ASA statement on statistical significance and p-values.
Article
Full-text available
In response to recommendations to redefine statistical significance to P ≤ 0.005, we propose that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.
Article
Full-text available
It is widely acknowledged that the biomedical literature suffers from a surfeit of false positive results. Part of the reason for this is the persistence of the myth that observation of p < 0.05 is sufficient justification to claim that you have made a discovery. Unfortunately there has been no unanimity about what should be done about this problem. It is hopeless to expect users to change their reliance on p-values unless they are offered an alternative way of judging the reliability of their conclusions. If the alternative method is to have a chance of being adopted widely, it will have to be easy to understand and to calculate. One such proposal is based on calculation of false positive risk. This is likely to be accepted by users because many of them already think, mistakenly, that the false positive risk is what the p- value tells them, and because it is based on the null hypothesis that the true effect size is zero, a form of reasoning with which most users are familiar. It is suggested that p-values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p-value. This number could be the prior probability that it would be necessary to believe in order to achieve a false positive risk of, say. 0.05 (which is what many users think, mistakenly, is what the p-value achieves). Alternatively, the (minimum) false positive risk could be specified based on the assumption of a prior probability of 0.5 (the largest value that can be assumed in the absence of hard prior data).
Article
Full-text available
We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided by p-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe p = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3: 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the p-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe p = 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100: 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe p = 0.00045. It is recommended that the terms ‘significant’ and ‘non-significant’ should never be used. Rather, p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed p-value. Despite decades of warnings, many areas of science still insist on labelling a result of p < 0.05 as ‘statistically significant’. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization and p-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.
Article
Full-text available
In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration--often scant--given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.
Article
Full-text available
We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Article
Full-text available
We introduce a new Empirical Bayes approach for large-scale hypothesis testing, including estimating false discovery rates (FDRs), and effect sizes. This approach has two key differences from existing approaches to FDR analysis. First, it assumes that the distribution of the actual (unobserved) effects is unimodal, with a mode at 0. This “unimodal assumption” (UA), although natural in many contexts, is not usually incorporated into standard FDR analysis, and we demonstrate how incorporating it brings many benefits. Specifically, the UA facilitates efficient and robust computation—estimating the unimodal distribution involves solving a simple convex optimization problem—and enables more accurate inferences provided that it holds. Second, the method takes as its input two numbers for each test (an effect size estimate and corresponding standard error), rather than the one number usually used ($$p$$ value or $$z$$ score). When available, using two numbers instead of one helps account for variation in measurement precision across tests. It also facilitates estimation of effects, and unlike standard FDR methods, our approach provides interval estimates (credible regions) for each effect in addition to measures of significance. To provide a bridge between interval estimates and significance measures, we introduce the term “local false sign rate” to refer to the probability of getting the sign of an effect wrong and argue that it is a superior measure of significance than the local FDR because it is both more generally applicable and can be more robustly estimated. Our methods are implemented in an R package Ashr available from http://github.com/stephens999/ashr.
Article
Full-text available
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a re-analysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested non-null effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of non-reproducibility. The results of this re-analysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Article
Full-text available
P values have been critiqued on several grounds but remain entrenched as the dominant inferential method in the empirical sciences. In this article, we elaborate on the fact that in many statistical models, the one-sided P value has a direct Bayesian interpretation as the approximate posterior mass for values lower than zero. The connection between the one-sided P value and posterior probability mass reveals three insights: (1) P values can be interpreted as Bayesian tests of direction, to be used only when the null hypothesis is known from the outset to be false; (2) as a measure of evidence, P values are biased against a point null hypothesis; and (3) with N fixed and effect size variable, there is an approximately linear relation between P values and Bayesian point null hypothesis tests.
Article
Full-text available
In this paper, we consider the problem of simultaneously testing many two-sided hypotheses when rejections of null hypotheses are accompanied by claims of the direction of the alternative. The fundamental goal is to construct methods that control the mixed directional familywise error rate, which is the probability of making any type 1 or type 3 (directional) error. In particular, attention is focused on cases where the hypotheses are ordered as $H_1 , \ldots, H_n$, so that $H_{i+1}$ is tested only if $H_1 , \ldots, H_i$ have all been previously rejected. In this situation, one can control the usual familywise error rate under arbitrary dependence by the basic procedure which tests each hypothesis at level $\alpha$, and no other multiplicity adjustment is needed. However, we show that this is far too liberal if one also accounts for directional errors. But, by imposing certain dependence assumptions on the test statistics, one can retain the basic procedure.
Article
Full-text available
Much of science is (rightly or wrongly) driven by hypothesis testing. Even in situations where the hypothesis testing paradigm is correct, the common practice of basing inferences solely on p-values has been under intense criticism for over 50 years. We propose, as an alternative, the use of the odds of a correct rejection of the null hypothesis to incorrect rejection. Both pre-experimental versions (involving the power and Type I error) and post-experimental versions (depending on the actual data) are considered. Implementations are provided that range from depending only on the p-value to consideration of full Bayesian analysis. A surprise is that all implementations -- even the full Bayesian analysis -- have complete frequentist justification. Versions of our proposal can be implemented that require only minor modifications to existing practices yet overcome some of their most severe shortcomings.
Article
Full-text available
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Article
Full-text available
This essay grew out of an examination of one-tailed significance testing. One-tailed tests were little advocated by the founders of modern statistics but are widely used and recommended nowadays in the biological, behavioral and social sciences. The high frequency of their use in ecology and animal behavior and their logical indefensibil-ity have been documented in a companion review paper. In the present one, we trace the roots of this problem and counter some attacks on significance testing in general. Roots include: the early but irrational dichotomization of the P scale and adoption of the 'significant/non-significant' terminology; the mistaken notion that a high P value is evidence favoring the null hypothesis over the alternative hypothesis; and confusion over the distinction between statistical and research hypotheses. Resultant widespread misuse and misinterpretation of significance tests have also led to other problems, such as unjustifiable demands that reporting of P values be disallowed or greatly reduced and that reporting of confidence intervals and standardized effect sizes be required in their place. Our analysis of these matters thus leads us to a recommendation that for standard types of significance assessment the paleoFisherian and Neyman-Pearsonian paradigms be replaced by a neoFisherian one. The essence of the latter is that a critical α (prob-ability of type I error) is not specified, the terms 'significant' and 'non-significant' are abandoned, that high P values lead only to suspended judgments, and that the so-called "three-valued logic" of Cox, Kaiser, Tukey, Tryon and Harris is adopted explicitly. Con-fidence intervals and bands, power analyses, and severity curves remain useful adjuncts in particular situations. Analyses conducted under this paradigm we term neoFisherian significance assessments (NFSA). Their role is assessment of the existence, sign and magnitude of statistical effects. The common label of null hypothesis significance tests (NHST) is retained for paleoFisherian and Neyman-Pearsonian approaches and their hybrids. The original Neyman-Pearson framework has no utility outside quality control type applications. Some advocates of Bayesian, likelihood and information-theoretic approaches to model selection have argued that P values and NFSAs are of little or no value, but those arguments do not withstand critical review. Champions of Bayesian methods in particular continue to overstate their value and relevance.
Article
Full-text available
By representing fair betting odds according to one or more pairs of confidence set estimators, dual parameter distributions called confidence posteriors secure the coherence of actions without any prior distribution. This theory reduces to the maximization of expected utility when the pair of posteriors is induced by an exact or approximate confidence set estimator or when a reduction rule is applied to the pair. Unlike the p-value, the confidence posterior probability of an interval hypothesis is suitable as an estimator of the indicator of hypothesis truth since it converges to 1 if the hypothesis is true or to 0 otherwise.
Preprint
An increasingly popular approach to statistical inference is to focus on the estimation of effect size. Yet, this approach is implicitly based on the assumption that there is an effect while ignoring the null hypothesis that the effect is absent. We demonstrate how this common "null hypothesis neglect" may result in effect size estimates that are overly optimistic. The overestimation can be avoided by incorporating the plausibility of the null hypothesis into the estimation process through a "spike-and-slab" model. We illustrate the implications of this approach and provide an empirical example.
Article
While empirical Bayes methods thrive in the presence of the thousands of simultaneous hypothesis tests in genomics and other large-scale applications, significance tests and confidence intervals are considered more appropriate for small numbers of tested hypotheses. Indeed, for fewer hypotheses, there is more uncertainty in empirical Bayes estimates of the prior distribution. Confidence intervals have been used to propagate the uncertainty in the prior to empirical Bayes inference about a parameter, but only by combining a Bayesian posterior distribution with a confidence distribution. Combining distributions of both types has also been used to combine empirical Bayes methods and confidence intervals for estimating a parameter of interest. To clarify the foundational status of such combinations, the concept of an evidential model is proposed. In the framework of evidential models, both Bayesian posterior distributions and confidence distributions are special cases of evidential support distributions. Evidential support distributions, by quantifying the sufficiency of the data as evidence, leverage the strengths of Bayesian posterior distributions and confidence distributions for cases in which each type performs well and for cases benefiting from the combination of both. Evidential support distributions also address problems of bioequivalence, bounded parameters, and the lack of a unique confidence distribution.
Article
As a convention, p-value is often computed in frequentist hypothesis testing and compared with the nominal significance level of 0.05 to determine whether or not to reject the null hypothesis. The smaller the p-value, the more significant the statistical test. Under non-informative prior distributions, we establish the equivalence relationship between the p-value and Bayesian posterior probability of the null hypothesis for one-sided tests and, more importantly, the equivalence between the p-value and a transformation of posterior probabilities of the hypotheses for two-sided tests. For two-sided hypothesis tests with a point null, we recast the problem as a combination of two one-sided hypotheses along the opposite directions and establish the notion of a “two-sided posterior probability”, which reconnects with the (two-sided) p-value. In contrast to the common belief, such an equivalence relationship renders p-value an explicit interpretation of how strong the data support the null. Extensive simulation studies are conducted to demonstrate the equivalence relationship between the p-value and Bayesian posterior probability. Contrary to broad criticisms on the use of p-value in evidence-based studies, we justify its utility and reclaim its importance from the Bayesian perspective.
Article
Significance testing is often criticized because p values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that the the p value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if the p value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check. That result leads to a way to calibrate a p value by transforming it into an upper bound on the posterior probability of the null hypothesis (conditional on rejection) for any model that would pass the check. The calibration may be calculated from a prior probability of the null hypothesis and the stringency of the check without more detailed modeling. An upper bound, as opposed to a lower bound, can justify concluding that the null hypothesis has a low posterior probability.
Book
Statisticians have met the need to test hundreds or thousands of genomics hypotheses simultaneously with novel empirical Bayes methods that combine advantages of traditional Bayesian and frequentist statistics. Techniques for estimating the local false discovery rate assign probabilities of differential gene expression, genetic association, etc. without requiring subjective prior distributions. This book brings these methods to scientists while keeping the mathematics at an elementary level. Readers will learn the fundamental concepts behind local false discovery rates, preparing them to analyze their own genomics data and to critically evaluate published genomics research. Key Features: * dice games and exercises, including one using interactive software, for teaching the concepts in the classroom * examples focusing on gene expression and on genetic association data and briefly covering metabolomics data and proteomics data * gradual introduction to the mathematical equations needed * how to choose between different methods of multiple hypothesis testing * how to convert the output of genomics hypothesis testing software to estimates of local false discovery rates * guidance through the minefield of current criticisms of p values * material on non-Bayesian prior p values and posterior p values not previously published More: https://davidbickel.com/genomics/
Article
The way false discovery rates (FDRs) are used in the analysis of genomics data leads to excessive false positive rates. In this sense, FDRs overcorrect for the excessive conservatism (bias toward false negatives) of methods of adjusting p values that control a family-wise error rate. Estimators of the local FDR (LFDR) are much less biased but have not been widely adopted due to their high variance and lack of availability in software. To address both issues, we propose estimating the LFDR by correcting an estimated FDR or the level at which an FDR is controlled.
Article
Occam's razor suggests assigning more prior probability to a hypothesis corresponding to a simpler distribution of data than to a hypothesis with a more complex distribution of data, other things equal. An idealization of Occam's razor in terms of the entropy of the data distributions tends to favor the null hypothesis over the alternative hypothesis. As a result, lower p values are needed to attain the same level of evidence. A recently debated argument for lowering the significance level to 0.005 as the p value threshold for a new discovery and to 0.05 for a suggestive result would then support further lowering them to 0.001 and 0.01, respectively.
Article
A simple example shows that the classical theory of probability implies more than one can deduce via Kolmogorov's calculus of probability. Developing Dawid's ideas I propose a new calculus of probability which is free from this drawback. This calculus naturally leads to a new interpretation of probability. I argue that attempts to create a general empirical theory of probability should be abandoned and we should content ourselves with the logic of probability establishing relations between probabilistic theories and observations. My approach to the logic of probability is based on a variant of Ville's principle of the excluded gambling strategy. In addition to the classical theory of probability this approach is applied to the probabilistic theories provided by the problem of testing validity of probability forecasts and by statistical models.
Article
Benjamin et al. (Nature Human Behaviour 2, 6-10, 2017) proposed improving the reproducibility of findings in psychological research by lowering the alpha level of our conventional null hypothesis significance tests from .05 to .005, because findings with p-values close to .05 represent insufficient empirical evidence. They argued that findings with a p-value between 0.005 and 0.05 should still be published, but not called “significant” anymore. This proposal was criticized and rejected in a response by Lakens et al. (Nature Human Behavior 2, 168-171, 2018), who argued that instead of lowering the traditional alpha threshold to .005, we should stop using the term “statistically significant,” and require researchers to determine and justify their alpha levels before they collect data. In this contribution, I argue that the arguments presented by Lakens et al. against the proposal by Benjamin et al. are not convincing. Thus, given that it is highly unlikely that our field will abandon the NHST paradigm any time soon, lowering our alpha level to .005 is at this moment the best way to combat the replication crisis in psychology.
Book
A Sound Basis for the Theory of Statistical Inference Measuring Statistical Evidence Using Relative Belief provides an overview of recent work on developing a theory of statistical inference based on measuring statistical evidence. It shows that being explicit about how to measure statistical evidence allows you to answer the basic question of when a statistical analysis is correct. The book attempts to establish a gold standard for how a statistical analysis should proceed. It first introduces basic features of the overall approach, such as the roles of subjectivity, objectivity, infinity, and utility in statistical analyses. It next discusses the meaning of probability and the various positions taken on probability. The author then focuses on the definition of statistical evidence and how it should be measured. He presents a method for measuring statistical evidence and develops a theory of inference based on this method. He also discusses how statisticians should choose the ingredients for a statistical problem and how these choices are to be checked for their relevance in an application.
Book
Psychology has made great strides since it came of age in the late 1800s. Its subject matter receives a great deal of popular attention in the news, and its professionals are recognised as highly trained experts with wide-ranging and valuable skills. It is one of the most in-demand science subjects in education systems around the world, and more of its research is being conducted – and funded – than ever before. However, reviews of psychology’s standard research approaches have revealed the risk of systematic error to be troublingly high, and the arbitrary ways in which psychologists draw conclusions from evidence has been highlighted. In many ways psychology faces a number of important crises, including: a replication crisis; a paradigmatic crisis; a measurement crisis; a statistical crisis; a sampling crisis; and a crisis of exaggeration. This book addresses these and many other existential crises that face psychology today. [See https://www.amazon.co.uk/Psychology-Crisis-Brian-Hughes/dp/1352003007]
Article
Efforts to increase replication rates in psychology generally consist of recommended improvements to methodology, such as increasing sample sizes to increase power or using a lower alpha level. However, little attention has been paid to how the prior odds (R) that a tested effect is true can affect the probability that a significant result will be replicable. The lower R is, the less likely a published result will be replicable even if power is high. It follows that if R is lower in one set of studies than in another, then all else being equal, published results will be less replicable in the set with lower R. We illustrate this point by presenting an analysis of data from the social-psychology and cognitive-psychology studies that were included in the Open Science Collaboration’s (2015) replication project. We found that R was lower for the social-psychology studies than for the cognitive-psychology studies, which might explain why the rate of successful replications differed between these two sets of studies. This difference in replication rates may reflect the degree to which scientists in the two fields value risky but potentially groundbreaking (i.e., low-R) research. Critically, high-R research is not inherently better or worse than low-R research for advancing knowledge. However, if they wish to achieve replication rates comparable to those of high-R fields (a judgment call), researchers in low-R fields would need to use an especially low alpha level, conduct experiments that have especially high power, or both.
Article
The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research.
Article
The p-value quantifies the discrepancy between the data and a null hypothesis of interest, usually the assumption of no difference or no effect. A Bayesian approach allows the calibration of p-values by transforming them to direct measures of the evidence against the null hypothesis, so-called Bayes factors. We review the available literature in this area and consider two-sided significance tests for a point null hypothesis in more detail. We distinguish simple from local alternative hypotheses and contrast traditional Bayes factors based on the data with Bayes factors based on p-values or test statistics. A well-known finding is that the minimum Bayes factor, the smallest possible Bayes factor within a certain class of alternative hypotheses, provides less evidence against the null hypothesis than the corresponding p-value might suggest. It is less known that the relationship between p-values and minimum Bayes factors also depends on the sample size and on the dimension of the parameter of interest. We illustrate the transformation of p-values to minimum Bayes factors with two examples from clinical research.
Article
This paper proposes a general framework for prediction in which a prediction is presented in the form of a distribution function, called predictive distribution function. This predictive distribution function is well suited for the notion of confidence subscribed in the frequentist interpretation, and it can provide meaningful answers for questions related to prediction. A general approach under this framework is formulated and illustrated by using the so-called confidence distributions (CDs). This CD-based prediction approach inherits many desirable properties of CD, including its capacity for serving as a common platform for connecting and unifying the existing procedures of predictive inference in Bayesian, fiducial and frequentist paradigms. The theory underlying the CD-based predictive distribution is developed and some related efficiency and optimality issues are addressed. Moreover, a simple yet broadly applicable Monte Carlo algorithm is proposed for the implementation of the proposed approach. This concrete algorithm together with the proposed definition and associated theoretical development produce a comprehensive statistical inference framework for prediction. Finally, the approach is applied to simulation studies, and a real project on predicting the incoming volume of application submissions to a government agency. The latter shows the applicability of the proposed approach to dependence data settings.
Article
Minimum Bayes factors are commonly used to transform two-sided p-values to lower bounds on the posterior probability of the null hypothesis. Several proposals exist in the literature, but none of them depends on the sample size. However, the evidence of a p-value against a point null hypothesis is known to depend on the sample size. In this article, we consider p-values in the linear model and propose new minimum Bayes factors that depend on sample size and converge to existing bounds as the sample size goes to infinity. It turns out that the maximal evidence of an exact two-sided p-value increases with decreasing sample size. The effect of adjusting minimum Bayes factors for sample size is shown in two applications.
Article
R. A. Fisher, the father of modern statistics, proposed the idea of fiducial inference during the first half of the 20th century. While his proposal led to interesting methods for quantifying uncertainty, other prominent statisticians of the time did not accept Fisher’s approach as it became apparent that some of Fisher’s bold claims about the properties of fiducial distribution did not hold up for multi-parameter problems. Beginning around the year 2000, the authors and collaborators started to re-investigate the idea of fiducial inference and discovered that Fisher’s approach, when properly generalized, would open doors to solve many important and difficult inference problems. They termed their generalization of Fisher’s idea as generalized fiducial inference (GFI). The main idea of GFI is to carefully transfer randomness from the data to the parameter space using an inverse of a data generating equation without the use of Bayes theorem. The resulting generalized fiducial distribution (GFD) can then be used for inference. After more than a decade of investigations, the authors and collaborators have developed a unifying theory for GFI, and provided GFI solutions to many challenging practical problems in different fields of science and industry. Overall, they have demonstrated that GFI is a valid, useful, and promising approach for conducting statistical inference. The goal of this paper is to deliver a timely and concise introduction to GFI, to present some of the latest results, as well as to list some related open research problems. It is the authors’ hope that their contributions to GFI will stimulate the growth and usage of this exciting approach for statistical inference.
Article
The complete final product of Bayesian inference is the posterior distribution of the quantity of interest. Important inference summaries include point estimation, region estimation and precise hypotheses testing. Those summaries may appropriately be described as the solution to specific decision problems which depend on the particular loss function chosen. The use of a continuous loss function leads to an integrated set of solutions where the same prior distribution may be used throughout. Objective Bayesian methods are those which use a prior distribution which only depends on the assumed model and the quantity of interest. As a consequence, objective Bayesian methods produce results which only depend on the assumed model and the data obtained. The combined use of intrinsic discrepancy, an invariant information-based loss function, and appropriately defined reference priors, provides an integrated objective Bayesian solution to both estimation and hypothesis testing problems. The ideas are illustrated with a large collection of non-trivial examples.
Article
Concerns about a lack of reproducibility of statistically significant results have recently been raised in many fields, and it has been argued that this lack comes at substantial economic costs. We here report the results from prediction markets set up to quantify the reproducibility of 44 studies published in prominent psychology journals and replicated in the Reproducibility Project: Psychology. The prediction markets predict the outcomes of the replications well and outperform a survey of market participants' individual forecasts. This shows that prediction markets are a promising tool for assessing the reproducibility of published scientific results. The prediction markets also allow us to estimate probabilities for the hypotheses being true at different testing stages, which provides valuable information regarding the temporal dynamics of scientific discovery. We find that the hypotheses being tested in psychology typically have low prior probabilities of being true (median, 9%) and that a "statistically significant" finding needs to be confirmed in a well-powered replication to have a high probability of being true. We argue that prediction markets could be used to obtain speedy information about reproducibility at low cost and could potentially even be used to determine which studies to replicate to optimally allocate limited resources into replications.
Article
Medical and scientific advances are predicated on new knowledge that is robust and reliable and that serves as a solid foundation on which further advances can be built. In biomedical research, we are in the midst of a revolution with the generation of new data and scientific publications at a previously unprecedented rate. However, unfortunately, there is compelling evidence that the majority of these discoveries will not stand the test of time. To a large extent, this reproducibility crisis in basic and preclinical research may be as a result of failure to adhere to good scientific practice and the desperation to publish or perish. This is a multifaceted, multistakeholder problem. No single party is solely responsible, and no single solution will suffice. Here we review the reproducibility problems in basic and preclinical biomedical research, highlight some of the complexities, and discuss potential solutions that may help improve research quality and reproducibility. © 2015 American Heart Association, Inc.
Article
The controversy concerning the fundamental principles of statistics still remains unresolved. It is suggested that one key to resolving the conflict lies in recognizing that inferential probability derived from observational data is inherently noncoherent, in the sense that their inferential implications cannot be represented by a single probability distribution on the parameter space (except in the Objective Bayesian case). More precisely, for a parameter space R1, the class of all functions of the parameter comprise equivalence classes of invertibly related functions, and to each such class a logically distinct inferential probability distribution pertains. (There is an additional cross‐coherence requirement for simultaneous inference.) The non‐coherence of these distributions flows from the nonequivalence of the relevant components of the data for each. Noncoherence is mathematically inherent in confidence and fiducial theory, and provides a basis for reconciling the Fisherian and Neyman–Pearsonian viewpoints. A unified theory of confidence‐based inferential probability is presented, and the fundamental incompatibility of this with Subjective Bayesian theory is discussed.
Article
x is a one‐dimensional random variable whose distribution depends on a single parameter θ. It is the purpose of this note to establish two results: (i) The necessary and sufficient condition for the fiducial distribution of θ, given x, to be a Bayes' distribution is that there exist transformations of x to u, and of θ to τ, such that τ is a location parameter for u. The condition will be referred to as (A). This extends some results of Grundy's (1956). (ii) If, for a random sample of any size from the distribution for x, there exists a single sufficient statistic for θ then the fiducial argument is inconsistent unless condition (A) obtains: And when it does, the fiducial argument is equivalent to a Bayesian argument with uniform prior distribution for τ. The note concludes with an investigation of (A) in the case of the exponential family.
Article
A class of one‐parameter distributions is specified, for which the sample total is a sufficient statistic in samples of arbitrary size. It is proved that the resulting fiducial distribution of the parameter does not coincide with the distribution, a posteriori, given by Bayes' theorem, for any prior distribution whatever.
Article
Preface Introduction Introduction The Problem of Regions Some Example Applications About This Book Single Parameter Problems Introduction The General Case Smooth Function Model Asymptotic Comparisons Empirical Comparisons Examples Computation Using R Exercises Multiple Parameter Problems Introduction Smooth Function Model Asymptotic Accuracy Empirical Comparisons Examples Computation Using R Exercises Linear Models and Regression Introduction Statistical Framework Asymptotic Accuracy Empirical Comparisons Examples Further Issues in Linear Regression Computation Using R Exercises Nonparametric Smoothing Problems Introduction Nonparametric Density Estimation Density Estimation Examples Solving Density Estimation Problems Using R Nonparametric Regression Nonparametric Regression Examples Solving Nonparametric Regression Problems Using R Exercises Further Applications Classical Nonparametric Methods Generalized Linear Models Multivariate Analysis Survival Analysis Exercises Connections and Comparisons Introduction Statistical Hypothesis Testing Multiple Comparisons Attained Confidence Levels Bayesian Confidence Levels Exercises Appendix: Review of Asymptotic Statistics Taylor's Theorem Modes of Convergence Central Limit Theorem Convergence Rates Exercises References INDEX
Article
In this book, an integrated introduction to statistical inference is provided from a frequentist likelihood-based viewpoint. Classical results are presented together with recent developments, largely built upon ideas due to R.A. Fisher. The term “neo-Fisherian” highlights this.After a unified review of background material (statistical models, likelihood, data and model reduction, first-order asymptotics) and inference in the presence of nuisance parameters (including pseudo-likelihoods), a self-contained introduction is given to exponential families, exponential dispersion models, generalized linear models, and group families. Finally, basic results of higher-order asymptotics are introduced (index notation, asymptotic expansions for statistics and distributions, and major applications to likelihood inference).The emphasis is more on general concepts and methods than on regularity conditions. Many examples are given for specific statistical models. Each chapter is supplemented with problems and bibliographic notes. This volume can serve as a textbook in intermediate-level undergraduate and postgraduate courses in statistical inference.
Article
We live in a new age for statistical inference, where modern scientific technology such as microarrays and fMRI machines routinely produce thousands and sometimes millions of parallel data sets, each with its own estimation or testing problem. Doing thousands of problems at once is more than repeated application of classical methods. Taking an empirical Bayes approach, Bradley Efron, inventor of the bootstrap, shows how information accrues across problems in a way that combines Bayesian and frequentist ideas. Estimation, testing, and prediction blend in this framework, producing opportunities for new methodologies of increased power. New difficulties also arise, easily leading to flawed inferences. This book takes a careful look at both the promise and pitfalls of large-scale statistical inference, with particular attention to false discovery rates, the most successful of the new statistical techniques. Emphasis is on the inferential ideas underlying technical developments, illustrated using a large number of real examples.
Article
A review is provided of the concept confidence distributions. Material covered include: fundamentals, extensions, applications of confidence distributions and available computer software. We expect that this review could serve as a source of reference and encourage further research with respect to confidence distributions.
Article
Il est courant, en inférence fréquentielle, d'utiliser un point unique (une estimation ponctuelle) ou un intervalle (intervalle de confiance) dans le but d'estimer un paramètre d'intér^t. Une question très simple se pose: peut-on également utiliser, dans le même but, et dans la même optique fréquentielle, à la façon dont les Bayésiens utilisent une loi a posteriori, une distribution de probabilité? La réponse est affirmative, et les distributions de confiance apparaissent comme un choix naturel dans ce contexte. Le concept de distribution de confiance a une longue histoire, longtemps associée, à tort, aux théories d'inférence fiducielle, ce qui a compromis son développement dans l'optique fréquentielle. Les distributions de confiance ont récemment attiré un regain d'intérêt, et plusieurs résultats ont mis en évidence leur potentiel considérable en tant qu'outil inférentiel. Cet article présente une définition moderne du concept, et examine les ses évolutions récentes. Il aborde les méthodes d'inférence, les problèmes d'optimalité, et les applications. A la lumière de ces nouveaux développements, le concept de distribution de confiance englobe et unifie un large éventail de cas particuliers, depuis les exemples paramétriques réguliers (distributions fiducielles), les lois de rééchantillonnage, les p-valeurs et les fonctions de vraisemblance normalisées jusqu'aux a priori et posteriori bayésiens. La discussion est entièrement menée d'un point de vue fréquentiel, et met l'accent sur les applications dans lesquelles les solutions fréquentielles sont inexistantes ou d'une application difficile. Bien que nous attirions également l'attention sur les similitudes et les différences que présentent les approches fréquentielle, fiducielle, et Bayésienne, notre intention n'est pas de rouvrir un débat philosophique qui dure depuis près de deux cents ans. Nous espérons bien au contraire contribuer à combler le fossé qui existe entre les différents points de vue.
Article
A simple example shows that the classical theory of probability implies more than one can deduce via Kolmogorov's calculus of probability. Developing Dawid's ideas I propose a new calculus of probability which is free from this drawback. This calculus naturally leads to a new interpretation of probability. I argue that attempts to create a general empirical theory of probability should be abandoned and we should content ourselves with the logic of probability establishing relations between probabilistic theories and observations. My approach to the logic of probability is based on a variant of Ville's principle of the excluded gambling strategy. In addition to the classical theory of probability this approach is applied to the probabilistic theories provided by the problem of testing validity of probability forecasts and by statistical models.