Article
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Occam's razor suggests assigning more prior probability to a hypothesis corresponding to a simpler distribution of data than to a hypothesis with a more complex distribution of data, other things equal. An idealization of Occam's razor in terms of the entropy of the data distributions tends to favor the null hypothesis over the alternative hypothesis. As a result, lower p values are needed to attain the same level of evidence. A recently debated argument for lowering the significance level to 0.005 as the p value threshold for a new discovery and to 0.05 for a suggestive result would then support further lowering them to 0.001 and 0.01, respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... ᭡ Example 3. Different assumptions lead to different versions of B, the lower bound on the Bayes factor (Held and Ott 2018). Instead of the version given by Equation (2), this example uses B ¼ e Àz 2 , which Held and Ott (2016) call the universal lower bound, where z is the standard normal quantile of a one-sided p value for testing # ¼ h H 0 (Bickel 2019d). Since that B may be too low to be a reasonable estimate of the Bayes factor B, it might require some kind of averaging with B, an upper bound on B. A readily available upper bound is 1 unless the two-sided p value is large enough to increase the probability of null hypothesis. ...
... ᭡ Example 4. A lower bound similar to that of Equation (2) is the normal-distribution bound B ¼ jzje À z 2 À1 2 (Held and Ott 2016), where z, assumed to satisfy jzj > 1, is defined in Example 3. An interpretation of Occam's razor (Bickel 2020a) leads to increasing B by a factor of jzj (Bickel 2019d). Then concerns about using a lower bound on B as an estimate of B may motivate instead usingB ¼ jzjB as the estimate of the Bayes factor. ...
Article
Much of the blame for failed attempts to replicate reports of scientific findings has been placed on ubiquitous and persistent misinterpretations of the p value. An increasingly popular solution is to transform a two-sided p value to a lower bound on a Bayes factor. Another solution is to interpret a one-sided p value as an approximate posterior probability. Combining the two solutions results in confidence intervals that are calibrated by an estimate of the posterior probability that the null hypothesis is true. The combination also provides a point estimate that is covered by the calibrated confidence interval at every level of confidence. Finally, the combination of solutions generates a two-sided p value that is calibrated by the estimate of the posterior probability of the null hypothesis. In the special case of a 50% prior probability of the null hypothesis and a simple lower bound on the Bayes factor, the calibrated two-sided p value is about (1 – abs(2.7 p ln p)) p + 2 abs(2.7 p ln p) for small p. The calibrations of confidence intervals, point estimates, and p values are proposed in an empirical Bayes framework without requiring multiple comparisons.
... Example 3. Different assumptions lead to different versions of B , the lower bound on the Bayes factor (Held and Ott, 2018). Instead of the version given by equation (3), this example uses B = e − z 2 , which Held and Ott (2016) call the universal lower bound, where z is the standard normal quantile of a one-sided p value testing ϑ = θ H 0 (Bickel, 2019g). Since that B may be too low to be a reasonable estimate of the Bayes factor B , it might require some kind of averaging with B , an upper bound on B . ...
... An interpretation of Occam's razor (Bickel, 2019a) leads to increasing B by a factor of |z | (Bickel, 2019g). Then concerns about using a lower bound on B as an estimate of B may motivate instead Figure 2: Each local false discovery rate estimate LFDR as a function of the two-sided p value. ...
Preprint
Full-text available
Much of the blame for failed attempts to replicate reports of scientific findings has been placed on ubiquitous and persistent misinterpretations of the p value. An increasingly popular solution is to transform a two-sided p value to a lower bound on a Bayes factor. Another solution is to interpret a one-sided p value as an approximate posterior probability. Combining the two solutions results in confidence intervals that are calibrated by an estimate of the posterior probability that the null hypothesis is true. The combination also provides a point estimate that is covered by the calibrated confidence interval at every level of confidence. Finally, the combination of solutions generates a two-sided p value that is calibrated by the estimate of the posterior probability of the null hypothesis. In the special case of a 50% prior probability of the null hypothesis and a simple lower bound on the Bayes factor, the calibrated two-sided p value is about (1-abs(2.7 p ln p)) p + 2 abs(2.7 p ln p) for small p. The calibrations of confidence intervals, point estimates, and p values are proposed in an empirical Bayes framework without requiring multiple comparisons.
... Equation (24) suggests viewing σ κ B (x; σ) as the κ-sharpened Bayes factor, applicable regardless of the value of P (0). Under the ideal value of κ derived in Section 4, that simplicity adjustment, when coupled with an argument of Benjamin et al. (2017), leads to 0.001 or 0.01 rather than 0.005 or 0.05 as the default p-value threshold of statistical significance (Bickel, 2019c). ...
... Another implication is that prior distributions that represent known physical variability do not require adjustments for simplicity (cf. Bickel, 2019c), for their probabilities are limiting relative frequencies that do not depend on the construction of systems. ...
Article
In Bayesian statistics, if the distribution of the data is unknown, then each plausible distribution of the data is indexed by a parameter value, and the prior distribution of the parameter is specified. To the extent that more complicated data distributions tend to require more coincidences for their construction than simpler data distributions, default prior distributions should be transformed to assign additional prior probability or probability density to the parameter values that refer to simpler data distributions. The proposed transformation of the prior distribution relies on the entropy of each data distribution as the relevant measure of complexity. The transformation is derived from a few first principles and extended to stochastic processes.
... 48 The so-called Occam's Razor principle suggests that a simpler model is preferred over a complicated one if these two models perform similarly. 49,50 A feature-selection algorithm may significantly reduce the dimensionality of the transcriptome datasets for a binary disease diagnosis model. [51][52][53][54] It is also anticipated that the training and prediction times of a disease diagnosis model will be shortened by selecting a subset of features from the transcriptome dataset. ...
Article
Full-text available
Immune thrombocytopenia (ITP) is an autoimmune disease with the typical symptom of low platelet count in blood. ITP demonstrated age and sex biases in both occurrences and prognosis, and adult ITP was mainly induced by the living environments. The current diagnosis guideline lacks the integration of the molecular heterogenicity. This study recruited the largest cohort of platelet transcriptome samples. A comprehensive procedure of feature selection, feature engineering and stacking classification was carried out to detect the ITP biomarkers using the RNA-seq transcriptomes. The 40 detected biomarkers were loaded to train the final ITP detection model with the overall accuracy 0.974. The biomarkers suggested that the ITP onset may be associated with various transcribed components, including protein-coding genes, long intergenic non-coding RNA (lincRNA) genes and pseudogenes with apparent transcriptions. The delivered ITP detection model may also be utilized as a complementary ITP diagnosis tool. The code and the example dataset is freely available on http://www.healthinformaticslab.org/supp/resources.php .
... That is clear in frequentist inference, for each hypothesis test or confidence interval relies on assumptions that remain assumptions even when they pass statistical tests, for the absence of evidence against those assumptions is not evidence for their truth. Empirical Bayes methods are routinely criticized for neglecting uncertainty in estimating the prior distribution (e.g., Qiu et al., 2005), leading to the use of standard error estimates (Efron, 2007, §5), confidence intervals (Scheid and Spang, 2005), and confidence distributions (Bickel, 2017(Bickel, , 2020a(Bickel, , 2019b to account for more of the uncertainty in estimating local false discovery rates. Fully Bayesian models also underrepresent uncertainty since they could only incorporate all the uncertainty about the models if all reasonable models, their priors, and a hyperprior over the models could be specified with certainty. ...
Article
The probability distributions that statistical methods use to represent uncertainty fail to capture all of the uncertainty that may be relevant to decision making. A simple way to adjust probability distributions for the uncertainty not represented in their models is to average the distributions with a uniform distribution or another distribution of maximum uncertainty. A decision-theoretic framework leads to averaging the distributions by taking the means of the logit transforms of the probabilities. That method does not prevent convergence to the truth, as does taking the means of the probabilities themselves. The mean-logit approach to moderating distributions is applied to natural language processing performed by a deep neural network.
... To reduce failure to replicate, one solution suggested in the literature is the use of stricter evidential thresholds, possibly variable by discipline (Johnson, 2013;Goodman, 2016). Benjamin et al. (2017) and Bickel (2019) advocate changing the standard threshold for significance from 0.05 to 0.005, or even 0.001, while Lakens et al. (2017) recommend a case by case transparently-justified choice, better if pre-registered. ...
Preprint
The widely claimed replicability crisis in science may lead to revised standards of significance. The customary frequentist confidence intervals, calibrated through hypothetical repetitions of the experiment that is supposed to have produced the data at hand, rely on a feeble concept of replicability. In particular, contradictory conclusions may be reached when a substantial enlargement of the study is undertaken. To redefine statistical confidence in such a way that inferential conclusions are non-contradictory, with large enough probability, under enlargements of the sample, we give a new reading of a proposal dating back to the 60's, namely Robbins' confidence sequences. Directly bounding the probability of reaching, in the future, conclusions that contradict the current ones, Robbins' confidence sequences ensure a clear-cut form of replicability when inference is performed on accumulating data. Their main frequentist property is easy to understand and to prove. We show that Robbins' confidence sequences may be justified under various views of inference: they are likelihood-based, can incorporate prior information, and obey the strong likelihood principle. They are easy to compute, even when inference is on a parameter of interest, especially using a closed-form approximation from normal asymptotic theory.
Article
Hypothesis tests are conducted not only to determine whether a null hypothesis (H0) is true but also to determine the direction or sign of an effect. A simple estimate of the posterior probability of a sign error is PSE = (1 - PH0) p/2 + PH0, depending only on a two-sided p value and PH0, an estimate of the posterior probability of H0. A convenient option for PH0 is the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower bound. In that case, PSE depends only on p and an estimate of the prior probability of H0. PSE provides a continuum between significance testing and traditional Bayesian testing. The former effectively assumes the prior probability of H0 is 0, as some statisticians argue. In that case, PSE is equal to a one-sided p value. (In that sense, PSE is a calibrated p value.) In traditional Bayesian testing, on the other hand, the prior probability of H0 is at least 50%, which usually brings PSE close to PH0.
Preprint
Full-text available
Concepts from multiple testing can improve tests of single hypotheses. The proposed definition of the calibrated p value is an estimate of the local false sign rate, the posterior probability that the direction of the estimated effect is incorrect. Interpreting one-sided p values as estimates of conditional posterior probabilities, that calibrated p value is (1 - LFDR) p/2 + LFDR, where p is a two-sided p value and LFDR is an estimate of the local false discovery rate, the posterior probability that a point null hypothesis is true given p. A simple option for LFDR is the posterior probability derived from estimating the Bayes factor to be its e p ln(1/p) lower bound. The calibration provides a continuum between significance testing and traditional Bayesian testing. The former effectively assumes the prior probability of the null hypothesis is 0, as some statisticians argue is the case. Then the calibrated p value is equal to p/2, a one-sided p value, since LFDR = 0. In traditional Bayesian testing, the prior probability of the null hypothesis is at least 50%, which usually results in LFDR >> p. At that end of the continuum, the calibrated p value is close to LFDR.
Article
Significance testing is often criticized because p values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that the the p value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if the p value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check. That result leads to a way to calibrate a p value by transforming it into an upper bound on the posterior probability of the null hypothesis (conditional on rejection) for any model that would pass the check. The calibration may be calculated from a prior probability of the null hypothesis and the stringency of the check without more detailed modeling. An upper bound, as opposed to a lower bound, can justify concluding that the null hypothesis has a low posterior probability.
Article
The widely claimed replicability crisis in science may lead to revised standards of significance. The customary frequentist confidence intervals, calibrated through hypothetical repetitions of the experiment that is supposed to have produced the data at hand, rely on a feeble concept of replicability. In particular, contradictory conclusions may be reached when a substantial enlargement of the study is undertaken. To redefine statistical confidence in such a way that inferential conclusions are non‐contradictory, with large enough probability, under enlargements of the sample, we give a new reading of a proposal dating back to the 60s, namely, Robbins' confidence sequences. Directly bounding the probability of reaching, in the future, conclusions that contradict the current ones, Robbins' confidence sequences ensure a clear‐cut form of replicability when inference is performed on accumulating data. Their main frequentist property is easy to understand and to prove. We show that Robbins' confidence sequences may be justified under various views of inference: they are likelihood‐based, can incorporate prior information and obey the strong likelihood principle. They are easy to compute, even when inference is on a parameter of interest, especially using a closed form approximation from normal asymptotic theory.
Book
Statisticians have met the need to test hundreds or thousands of genomics hypotheses simultaneously with novel empirical Bayes methods that combine advantages of traditional Bayesian and frequentist statistics. Techniques for estimating the local false discovery rate assign probabilities of differential gene expression, genetic association, etc. without requiring subjective prior distributions. This book brings these methods to scientists while keeping the mathematics at an elementary level. Readers will learn the fundamental concepts behind local false discovery rates, preparing them to analyze their own genomics data and to critically evaluate published genomics research. Key Features: * dice games and exercises, including one using interactive software, for teaching the concepts in the classroom * examples focusing on gene expression and on genetic association data and briefly covering metabolomics data and proteomics data * gradual introduction to the mathematical equations needed * how to choose between different methods of multiple hypothesis testing * how to convert the output of genomics hypothesis testing software to estimates of local false discovery rates * guidance through the minefield of current criticisms of p values * material on non-Bayesian prior p values and posterior p values not previously published More: https://davidbickel.com/genomics/
Article
According to the general law of likelihood, the strength of statistical evidence for a hypothesis as opposed to its alternative is the ratio of their likelihoods, each maximized over the parameter of interest. Consider the problem of assessing the weight of evidence for each of several hypotheses. Under a realistic model with a free parameter for each alternative hypothesis, this leads to weighing evidence without any shrinkage toward a presumption of the truth of each null hypothesis. That lack of shrinkage can lead to many false positives in settings with large numbers of hypotheses. A related problem is that point hypotheses cannot have more support than their alternatives. Both problems may be solved by fusing the realistic model with a model of a more restricted parameter space for use with the general law of likelihood. Applying the proposed framework of model fusion to data sets from genomics and education yields intuitively reasonable weights of evidence.
Article
Confidence sets, p values, maximum likelihood estimates, and other results of non-Bayesian statistical methods may be adjusted to favor sampling distributions that are simple compared to others in the parametric family. The adjustments are derived from a prior likelihood function previously used to adjust posterior distributions.
Article
Full-text available
We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Article
Full-text available
In the face of rapid environmental and cultural change, orthodox concepts in restoration ecology such as historical fidelity are being challenged. Here we re-examine the diverse roles played by historical knowledge in restoration, and argue that these roles remain vitally important. As such, historical knowledge will be critical in shaping restoration ecology in the future. Perhaps the most crucial role in shifting from the present version of restoration ecology ("v1.0") to a newer formulation ("v2.0") is the value of historical knowledge in guiding scientific interpretation, recognizing key ecological legacies, and influencing the choices available to practitioners of ecosystem intervention under conditions of open-ended and rapid change.
Article
Full-text available
Of the many honorifics bestowed on the articles in this historical series, it is doubtful that any have had applied the best—funny. The rhetorical zest and smiling outrage that Joseph Berkson brings to his puncturing of the quasi-religious precepts of traditional statistics in his classic article 1 recalls for me a public debate I witnessed in the 1980s between a highly respected statistician and a surgeon clinical-trialist. It was a debate on issues related to the adjustment of P-values in clinical trials, and what I remember best was the entrance of the physician in full surgical regalia; green operating scrubs, face mask, shoe covers, the whole bit. Playing effectively the role of the ‘aw-shucks, I’m just a country doc who don’t know nuthin’ ‘bout statistics’ he parodied traditional statistical precepts so effectively, contrasting them unfavourably with common-sense judgements, that the statistician, however meritorious his rebuttal may have been, was left sputtering, helplessly pounding the lectern. So it seems with this commentary, which asks in an innocent yet seemingly unanswerable way, ‘If the population [of people] is not human, what is it?’ This is the leading edge of an attack on Fisher’s P-value which should still be required reading for all students of epidemiology and biostatistics today. The commentary shows us several things. First, it demonstrates just how old are some current criticisms, often presented as enlightened insights from a modern era. His first sentence has almost a nostalgic quality that looks surprising over 60 years later, ‘There was a time when we did not talk about tests of significance; we simply did them.’ These words described the future as much as the pre-1942 past. Second, although it may not be immediately obvious, the argument presented here is closely related to ones that underlie modern recommendations to use CI and even Bayesian methods in lieu of P-values in biomedical research. Third, Berkson makes important distinctions between hypothesis testing and significance tests that continue to be ignored today. Fourth, and perhaps most subtly, he brings in a notion of ‘evidence’, a positive, relative concept that is critical to have on the table as separate and distinct from the P-value. And finally, he provides modern statisticians with a model for how to communicate technical concepts to applied users in an accessible and lively way. All that said, it must be admitted that Berkson’s critique is frustratingly incomplete. While he offers a scathing critique of the P-value, and shows us how standard interpretations contravene scientific intuition (grounded mainly in appeals to common sense) he does not offer a real alternative. He does call for more research, particularly into the meaning of what he calls ‘middle P’s’. It is in this gap that I will spend most of my time in this commentary; linking his insights with the ‘further research’ that indeed occurred over the succeeding 60 years.
Article
Full-text available
The estimation of signal frequency count in the presence of background noise has had much discussion in the recent physics literature, and Mandelkern [1] brings the central issues to the statistical community, leading in turn to extensive discussion by statisticians. The primary focus however in [1] and the accompanying discussion is on the construction of a confidence interval. We argue that the likelihood function and $p$-value function provide a comprehensive presentation of the information available from the model and the data. This is illustrated for Gaussian and Poisson models with lower bounds for the mean parameter.
Article
In Bayesian statistics, if the distribution of the data is unknown, then each plausible distribution of the data is indexed by a parameter value, and the prior distribution of the parameter is specified. To the extent that more complicated data distributions tend to require more coincidences for their construction than simpler data distributions, default prior distributions should be transformed to assign additional prior probability or probability density to the parameter values that refer to simpler data distributions. The proposed transformation of the prior distribution relies on the entropy of each data distribution as the relevant measure of complexity. The transformation is derived from a few first principles and extended to stochastic processes.
Article
Confidence sets, p values, maximum likelihood estimates, and other results of non-Bayesian statistical methods may be adjusted to favor sampling distributions that are simple compared to others in the parametric family. The adjustments are derived from a prior likelihood function previously used to adjust posterior distributions.
Book
Interpreting statistical data as evidence, Statistical Evidence: A Likelihood Paradigm focuses on the law of likelihood, fundamental to solving many of the problems associated with interpreting data in this way. Statistics has long neglected this principle, resulting in a seriously defective methodology. This book redresses the balance, explaining why science has clung to a defective methodology despite its well-known defects. After examining the strengths and weaknesses of the work of Neyman and Pearson and the Fisher paradigm, the author proposes an alternative paradigm which provides, in the law of likelihood, the explicit concept of evidence missing from the other paradigms. At the same time, this new paradigm retains the elements of objective measurement and control of the frequency of misleading results, features which made the old paradigms so important to science. The likelihood paradigm leads to statistical methods that have a compelling rationale and an elegant simplicity, no longer forcing the reader to choose between frequentist and Bayesian statistics.
Article
The p-value quantifies the discrepancy between the data and a null hypothesis of interest, usually the assumption of no difference or no effect. A Bayesian approach allows the calibration of p-values by transforming them to direct measures of the evidence against the null hypothesis, so-called Bayes factors. We review the available literature in this area and consider two-sided significance tests for a point null hypothesis in more detail. We distinguish simple from local alternative hypotheses and contrast traditional Bayes factors based on the data with Bayes factors based on p-values or test statistics. A well-known finding is that the minimum Bayes factor, the smallest possible Bayes factor within a certain class of alternative hypotheses, provides less evidence against the null hypothesis than the corresponding p-value might suggest. It is less known that the relationship between p-values and minimum Bayes factors also depends on the sample size and on the dimension of the parameter of interest. We illustrate the transformation of p-values to minimum Bayes factors with two examples from clinical research.
Article
Minimum Bayes factors are commonly used to transform two-sided p-values to lower bounds on the posterior probability of the null hypothesis. Several proposals exist in the literature, but none of them depends on the sample size. However, the evidence of a p-value against a point null hypothesis is known to depend on the sample size. In this article, we consider p-values in the linear model and propose new minimum Bayes factors that depend on sample size and converge to existing bounds as the sample size goes to infinity. It turns out that the maximal evidence of an exact two-sided p-value increases with decreasing sample size. The effect of adjusting minimum Bayes factors for sample size is shown in two applications.
Article
Empirical Bayes estimates of the local false discovery rate can reflect uncertainty about the estimated prior by supplementing their Bayesian posterior probabilities with confidence levels as posterior probabilities. This use of coherent fiducial inference with hierarchical models generates set estimators that propagate uncertainty to varying degrees. Some of the set estimates approach estimates from plug-in empirical Bayes methods for high numbers of comparisons and can come close to the usual confidence sets given a sufficiently low number of comparisons.
Article
An account is given of A. M. Turing's unpublished contributions to statistics during 1941 or 1940.
Article
Scientific inference is thought to be hypothetical-deductive: from given facts or experimental findings we infer laws or theories from which the facts follow or which account for the facts. This is an oversimplification, though, for the facts or findings are seldom logical consequences of the explanatory theory, but merely ‘agree’ with the theory. Bayes’ rule then enters as a more general scheme of hypothetical deduction: from given facts, to infer the most plausible theory that affords those facts highest probability.
Article
In this definitive book, D. R. Cox gives a comprehensive and balanced appraisal of statistical inference. He develops the key concepts, describing and comparing the main ideas and controversies over foundational issues that have been keenly argued for more than two-hundred years. Continuing a sixty-year career of major contributions to statistical thought, no one is better placed to give this much-needed account of the field. An appendix gives a more personal assessment of the merits of different ideas. The content ranges from the traditional to the contemporary. While specific applications are not treated, the book is strongly motivated by applications across the sciences and associated technologies. The mathematics is kept as elementary as feasible, though previous knowledge of statistics is assumed. The book will be valued by every user or student of statistics who is serious about understanding the uncertainty inherent in conclusions from statistical analyses.
Article
We live in a new age for statistical inference, where modern scientific technology such as microarrays and fMRI machines routinely produce thousands and sometimes millions of parallel data sets, each with its own estimation or testing problem. Doing thousands of problems at once is more than repeated application of classical methods. Taking an empirical Bayes approach, Bradley Efron, inventor of the bootstrap, shows how information accrues across problems in a way that combines Bayesian and frequentist ideas. Estimation, testing, and prediction blend in this framework, producing opportunities for new methodologies of increased power. New difficulties also arise, easily leading to flawed inferences. This book takes a careful look at both the promise and pitfalls of large-scale statistical inference, with particular attention to false discovery rates, the most successful of the new statistical techniques. Emphasis is on the inferential ideas underlying technical developments, illustrated using a large number of real examples.
Article
Model selection refers to a data-based choice among competing statistical models, for example choosing between a linear or a quadratic regression function. The most popular model selection techniques are based on interpretations of p-values, using a scale originally suggested by Fisher: .05 is moderate evidence against the smaller model, .01 is strong evidence, etc. Recent Bayesian literature, going back to work by Jeffreys, suggests a quite different answer to the model selection problem. Jeffreys provided an interpretive scale for Bayes factors, a scale which can be implemented in practice by use of the BIC (Bayesian Information Criterion.) The Jeffreys scale often produces much more conservative results, especially in large samples, so for instance a .01 p-value may correspond to barely any evidence at all against the smaller model. This paper tries to reconcile the two theories by giving an interpretation of Fisher's scale in terms of Bayes factors. A general interpretation is given which works fine when checked for the one-dimensional Gaussian problem, where standard hypothesis testing is seen to coincide with a Bayesian analysis that assumes stronger (more informative) priors than those used by the BIC. This argument fails in higher dimensions, where Fisher's scale must be made more conservative in order to get a proper Bayes justification.
Article
P values (or significance probabilities) have been used in place of hypothesis tests as a means of giving more information about the relationship between the data and the hypothesis than does a simple reject/do not reject decision. Virtually all elementary statistics texts cover the calculation of P values for one-sided and point-null hypotheses concerning the mean of a sample from a normal distribution. There is, however, a third case that is intermediate to the one-sided and point-null cases, namely the interval hypothesis, that receives no coverage in elementary texts. We show that P values are continuous functions of the hypothesis for fixed data. This allows a unified treatment of all three types of hypothesis testing problems. It also leads to the discovery that a common informal use of P values as measures of support or evidence for hypotheses has serious logical flaws.
Article
The normalized maximum likelihood (NML) is a recent penalized likelihood that has properties that justify defining the amount of discrimination information (DI) in the data supporting an alternative hypothesis over a null hypothesis as the logarithm of an NML ratio, namely, the alternative hypothesis NML divided by the null hypothesis NML. The resulting DI, like the Bayes factor but unlike the P-value, measures the strength of evidence for an alternative hypothesis over a null hypothesis such that the probability of misleading evidence vanishes asymptotically under weak regularity conditions and such that evidence can support a simple null hypothesis. Instead of requiring a prior distribution, the DI satisfies a worst-case minimax prediction criterion. Replacing a (possibly pseudo-) likelihood function with its weighted counterpart extends the scope of the DI to models for which the unweighted NML is undefined. The likelihood weights leverage side information, either in data associated with comparisons other than the comparison at hand or in the parameter value of a simple null hypothesis. Two case studies, one involving multiple populations and the other involving multiple biological features, indicate that the DI is robust to the type of side information used when that information is assigned the weight of a single observation. Such robustness suggests that very little adjustment for multiple comparisons is warranted if the sample size is at least moderate.
Best known in our circles for his key role in the renaissance of low- density parity-check (LDPC) codes, David MacKay has written an am- bitious and original textbook. Almost every area within the purview of these TRANSACTIONS can be found in this book: data compression al- gorithms, error-correcting codes, Shannon theory, statistical inference, constrained codes, classification, and neural networks. The required mathematical level is rather minimal beyond a modicum of familiarity with probability. The author favors exposition by example, there are few formal proofs, and chapters come in mostly self-contained morsels richly illustrated with all sorts of carefully executed graphics. With its breadth, accessibility, and handsome design, this book should prove to be quite popular. Highly recommended as a primer for students with no background in coding theory, the set of chapters on error-correcting codes are an excellent brief introduction to the elements of modern sparse-graph codes: LDPC, turbo, repeat-accumulate, and fountain codes are de- scribed clearly and succinctly. As a result of the author's research on the field, the nine chapters on neural networks receive the deepest and most cohesive treatment in the book. Under the umbrella title of Probability and Inference we find a medley of chapters encompassing topics as varied as the Viterbi algorithm and the forward-backward algorithm, Monte Carlo simu- lation, independent component analysis, clustering, Ising models, the saddle-point approximation, and a sampling of decision theory topics. The chapters on data compression offer a good coverage of Huffman and arithmetic codes, and we are rewarded with material not usually encountered in information theory textbooks such as hash codes and efficient representation of integers. The expositions of the memoryless source coding theorem and of the achievability part of the memoryless channel coding theorem stick closely to the standard treatment in (1), with a certain tendency to over- simplify. For example, the source coding theorem is verbalized as: " i.i.d. random variables each with entropy can be compressed into more than bits with negligible risk of information loss, as ; conversely if they are compressed into fewer than bits it is virtually certain that informa- tion will be lost." Although no treatment of rate-distortion theory is offered, the author gives a brief sketch of the achievability of rate with bit- error rate , and the details of the converse proof of that limit are left as an exercise. Neither Fano's inequality nor an operational definition of capacity put in an appearance. Perhaps his quest for originality is what accounts for MacKay's pro- clivity to fail to call a spade a spade. Almost-lossless data compres- sion is called "lossy compression;" a vanilla-flavored binary hypoth-
Article
By restricting the possible values of the proportion of null hypotheses that are true, the local false discovery rate (LFDR) can be estimated using as few as one comparison. The proportion of proteins with equivalent abundance was estimated to be about 20% for patient group I and about 90% for group II. The simultaneously-estimated LFDRs give approximately the same inferences as individual-protein confidence levels for group I but are much closer to individual-protein LFDR estimates for group II. Simulations confirm that confidence-based inference or LFDR-based inference performs markedly better for low or high proportions of true null hypotheses, respectively.
Article
P -values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the hypothesis or, even worse, as the posterior probability that the hypothesis is true. The fact that these interpretations can be completely misleading when testing precise hypotheses is first reviewed, through consideration of two revealing simulations. Then two calibrations of a p-value are developed, the first being interpretable as odds and the second as either a (conditional) frequentist error probability or as the posterior probability of the hypothesis. Key words and phrases. Bayes factors; Bayesian robustness; Conditional frequentist error probabilities; Odds; Surprise. 1. Introduction In statistical analysis of data X, one is frequently working, at a given moment, with an entertained model or hypothesis H 0 : X f(x), where f(x) is a continuous density. A statistic T (X) ...
Article
Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science surveys. This paper uses false discovery rate methods to carry out both size and power calculations on large-scale problems. A simple empirical Bayes approach allows the false discovery rate (fdr) analysis to proceed with a minimum of frequentist or Bayesian modeling assumptions. Closed-form accuracy formulas are derived for estimated false discovery rates, and used to compare different methodologies: local or tail-area fdr's, theoretical, permutation, or empirical null hypothesis estimates. Two microarray data sets as well as simulations are used to evaluate the methodology, the power diagnostics showing why nonnull cases might easily fail to appear on a list of ``significant'' discoveries.
Computable priors sharpened into Occam's razors. working paper HAL-01423673
  • D R Bickel
Bickel, D. R. (2018). An explanatory rationale for priors sharpened into Occam's razors. working paper, <https://doi.org/10.5281/zenodo.1412875>
Theory of Probability
  • H. Jeffreys