Article

We Agree That Statistical Significance Proves Essentially Nothing: A Rejoinder to Thomas Mayer

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In several dozen journal reviews and in many other comments we have received-from, for example, four Nobel laureates, the statistician Dennis Lindley (2012), the statistician Arnold Zellner (2004), the mathematician Olle Häggström (2010), the sociologist Steve Fuller (2008), and the historian Theodore Porter (2008)-no one has ever tried to defend null hypothesis significance testing and its numerous errors. Recent articles by Thomas Mayer (2012, 2013), commenting on our book The Cult of Statistical Significance, are no exception. Of the five major claims we make in our book about the theory and practice of significance testing in economics, Mayer strongly agrees with four. On the fifth claim our disagreement is a matter of degree, not of kind, with no substantive change in results. Overall, Mayer agrees with us and with the new and growing consensus that statistical significance proves essentially nothing and has to change.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A good overview of the issues regarding statistical versus economic significance is provided by Mayer (2013) and the further response to Mayer by Ziliak and McCloskey (2013). An excellent analysis and explanation for the relative absence of replication in economics is provided by Galiani et al. (2017), although again, the challenge of replicating results is hardly unique to economics. ...
... Model 2 examines hypothesis 2. A significant and negative coefficient of À0.043 is reported for recognition and disclosure fraud and a non-significant coefficient is estimated for disclosure fraud. Consistent with hypothesis 2, investors perceive 11 Economic significance is a measure of the importance of a relationship and considers the magnitude of the estimated coefficients (Ziliak & McCloskey, 2013). 12 Apart from parametric t-tests, a nonparametric test: Wilcoxon Signed-rank test is also applied. ...
Article
This study examines the impact of different punishments for Chinese accounting fraud on shareholder valuation of firms between 2007 and 2016. From an examination of both monetary and non-monetary ‘name and shame’ penalties, it is reported all punishments have a negative and significant impact on the shareholder wealth of fraudulent firms. Investors perceive punishments involving monetary penalties far more severely than non-monetary punishments used to combat accounting fraud. Stock market reactions are also sensitive to the type of fraud committed with manipulation of recognition and disclosure fraud viewed more negatively by investors than fraud related to disclosure. Information leakage to capital markets prior to the announcement of punishments is also observed. It is proposed fines have been relatively more effective, than ‘name and shame’ punishments in addressing Chinese accounting fraud during the last decade, due not least to information leakage. Full article available at: https://www.sciencedirect.com/science/article/abs/pii/S0890838919300307
... Second, it is also possible that some of the recorded differences may be evolutionarily important, even when statistically non-significant [Anderson, 2000;Burnham & Anderson, 2002;Garamszegi, Calhim, Dochtermann, et al., 2009]. This is an extremely important consideration that continues to be largely ignored in the "small sample paradigm" characteristic of field primatology, despite an extensive literature on the topic [Cohen, 1994;Colqhuon, 2014;Fernandez-Duque, 1997;Ziliak & McCloskey, 2013]. Much of the time we lack adequate biological information on the consequences that a change in the variable of interest may cause and an exclusive reliance on the statistical significance of a test may prove uninformative at best if not misleading. ...
Article
Full-text available
Using published and new data from a population of monogamous owl monkeys in the Argentinean Chaco, I examine the hypothesis that social monogamy is a default social system imposed upon males because the spatial and/or temporal distribution of resources and females makes it difficult for a single male to defend access to more than one mate. First, I examine a set of predictions on ranging patterns, use of space, and population density. This first section is followed by a second one considering predictions related to the abundance and distribution of food. Finally, I conclude with a section attempting to link the ranging and ecological data to demographic and life-history parameters as proxies for reproductive success. In support of the hypothesis, owl monkey species do live at densities (7-64 ind/km(2) ) that are predicted for monogamous species, but groups occupy home ranges and core areas that vary substantially in size, with pronounced overlap of home ranges, but not of core areas. There are strong indications that the availability of food sources in the core areas during the dry season may be of substantial importance for regulating social monogamy in owl monkeys. Finally, none of the proxies for the success of groups were strongly related to the size of the home range or core area. The results I present do not support conclusively any single explanation for the evolution of social monogamy in owl monkeys, but they help us to better understand how it may function. Moreover, the absence of conclusive answers linking ranging, ecology, and reproductive success with the evolution of social monogamy in primates, offer renewed motivation for continuing to explore the evolution of monogamy in owl monkeys. Am. J. Primatol. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.
Article
Null hypothesis significance testing (NHST) provides an important statistical toolbox, but there are a number of ways in which it is often abused and misinterpreted, with bad consequences for the reliability and progress of science. Parts of contemporary NHST debate, especially in the psychological sciences, is reviewed, and a suggestion is made that a new distinction between strongly, weakly, and very weakly anti-NHST positions is likely to bring added clarity to the debate.
Article
A science, business, or law that is basing its validity on the level of p-values, t statistics and other tests of statistical significance is looking less and less relevant and more and more unethical. Today’s economist uses a lot of wit putting a clever index of opportunity cost into his models; but then, like the amnesiac, he fails to see opportunity cost in statistical estimates he makes of those same models. Medicine, psychology, pharmacology and other fields are similarly damaged by this fundamental error of science, keeping bad treatments on the market and good ones out. A few small changes to the style of the published research paper using statistical methods can bring large beneficial effects to more than academic research papers. It is suggested that misuse of statistical significance be added to the definition of scientific misconduct currently enforced by the NIH, NSF, Office of Research Integrity and others.
Article
Full-text available
Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for empirically examining hypothesized relationships, and the main approach for establishing the importance of empirical results. NHST is the foundation of classical or frequentist statistics. The approach is designed to test the probability of generating the observed data if no relationship exists between the dependent and independent variables of interest, recognizing that the results will vary from sample to sample. This paper is intended to evaluate the state of the criminological and criminal justice literature with respect to the correct application of NHST. We apply a modified version of the instrument used in two reviews of the economics literature by McCloskey and Ziliak to code 82 articles in criminology and criminal justice. We have selected three sources of papers: Criminology, Justice Quarterly, and a recent review of experiments in criminal justice by Farrington and Welsh. We find that most researchers provide the basic information necessary to understand effect sizes and analytical significance in tables which include descriptive statistics and some standardized measure of size (e.g., betas, odds ratios). On the other hand, few of the articles mention statistical power and even fewer discuss the standards by which a finding would be considered large or small. Moreover, less than half of the articles distinguish between analytical significance and statistical significance, and most articles used the term ‘significance’ in ambiguous ways.
Article
Deirdre McCloskey and Stephen Ziliak have graciously replied to my essay titled "Ziliak and McCloskey on Statistical Significance: An Assessment." Only a few of McCloskey and Ziliak's extensive criticisms are valid or partially valid, and these relate to points that can readily be dropped without materially weakening my conclusions. In particular, McCloskey and Ziliak do not engage my estimate of how often or how egregiously economists confuse statistical significance and oomph.
Article
Stephen Ziliak and D. N. McCloskey have sharply criticized the prevailing use of significance tests. Their work has, in turn, come under vigorous attack. The vehemence of the debate may induce readers to wrongly dismiss it as a "he said-she said" debate, or else to take sides in an unbending way that does not do justice to valid points raised by the other side. This paper aims at a more balanced reading. While Ziliak and McCloskey claim that a substantial majority of economists who use significance tests confuse statistical with substantive significance, or commit the logical error of the transposed conditional, I argue that such errors are much less frequent than they claim, though still much too pervasive. They also argue that since significance tests focus on the existence of an effect rather than on its size, the tests do not answer scientific questions. I respond with counter-examples. Ziliak and McCloskey also complain that significance tests ignore loss functions. I argue that loss functions should be introduced only at a later stage. Ziliak and McCloskey are correct, however, that confidence intervals deserve much more emphasis. The most valuable message of their work is that significance tests should be treated less mechanically.
Article
The Cult of Statistical Significance. How the Standard Error Costs Us Jobs, Justice, and Lives. By Stephen T. Ziliak and Deirdre N. McCloskey. University of Michigan Press, Ann Arbor, 2008. 348 pp. 75,£48.95.ISBN9780472070077.Paper,75, £48.95. ISBN 9780472070077. Paper, 24.95, £16.50. ISBN 9780472050079. Economics, Cognition, and Society. Through their consideration of a wide range of fields (including their home discipline of economics, psychology, epidemiology, and biomedical science), the authors argue that statistical significance is far too often mistakenly applied--to the detriment of science and society.
Article
Significance testing as used has no theoretical justification. Our article in the Journal of Economic Literature (1996) showed that of the 182 full-length papers published in the 1980s in the American Economic Review 70% did not distinguish economic from statistical significance. Since 1996 many colleagues have told us that practice has improved. We interpret their response as an empirical claim, a judgment about a fact. Our colleagues, unhappily, are mistaken: significance testing is getting worse. We find here that in the next decade, the 1990s, of the 137 papers using a test of statistical significance in the AER fully 82% mistook a merely statistically significant finding for an economically significant finding. A super majority (81%) believed that looking at the sign of a coefficient sufficed for science, ignoring size. The mistake is causing economic damage: losses of jobs and justice, and indeed of human lives (especially in, to mention another field enchanted with statistical significance as against substantive significance, medical science). The confusion between fit and importance is causing false hypotheses to be accepted and true hypotheses to be rejected. We propose a publication standard for the future: “Tell me the oomph of your coefficient; and do not confuse it with merely statistical significance.â€
Article
After William Gosset (1876-1937), the 'Student' of Student's t, the best statisticians have distinguished economic (or agronomic or psychological or medical) significance from merely statistical 'significance' at conventional levels. A singular exception among the best was Ronald A. Fisher, who argued in the 1920s that statistical significance at the 0.05 level is a necessary and sufficient condition for establishing a scientific result. After Fisher many economists and some others - but rarely physicists, chemists, and geologists, who seldom use Fisher-significance - have mixed up the two kinds of significance. We have been writing on the matter for some decades, with other critics in medicine, sociology, psychology, and the like. Hoover and Siegler, despite a disdainful rhetoric, agree with the logic of our case. Fisherian 'significance,' they agree, is neither necessary nor sufficient for scientific significance. But they claim that economists already know this and that Fisherian tests can still be used for specification searches. Neither claim seems to be true. Our massive evidence that economists get it wrong appears to hold up. And if rhetorical standards are needed to decide the importance of a coefficient in the scientific conversation, so are they needed when searching for an equation to fit. Fisherian 'significance' signifies nearly nothing, and empirical economics as actually practiced is in crisis.
Article
For more than 20 years, Deidre McCloskey has campaigned to convince the economics profession that it is hopelessly confused about statistical significance. She argues that many practices associated with significance testing are bad science and that most economists routinely employ these bad practices: ‘Though to a child they look like science, with all that really hard math, no science is being done in these and 96 percent of the best empirical economics …’ (McCloskey 199949. McCloskey , D. N. 1999. “Other Things Equal: Cassandra's Open Letter to Her Economist Colleagues,”. Eastern Economic Journal, 25(3): 357–363. View all references). McCloskey's charges are analyzed and rejected. That statistical significance is not economic significance is a jejune and uncontroversial claim, and there is no convincing evidence that economists systematically mistake the two. Other elements of McCloskey's analysis of statistical significance are shown to be ill‐founded, and her criticisms of practices of economists are found to be based in inaccurate readings and tendentious interpretations of those economists' work. Properly used, significance tests are a valuable tool for assessing signal strength, for assisting in model specification, and for determining causal structure.
Article
Over the last decade, criticisms of null-hypothesis significance testing have grown dramatically, and several alternative practices, such as confidence intervals, information theoretic, and Bayesian methods, have been advocated. Have these calls for change had an impact on the statistical reporting practices in conservation biology? In 2000 and 2001, 92% of sampled articles in Conservation Biology and Biological Conservation reported results of null-hypothesis tests. In 2005 this figure dropped to 78%. There were corresponding increases in the use of confidence intervals, information theoretic, and Bayesian techniques. Of those articles reporting null-hypothesis testing--which still easily constitute the majority--very few report statistical power (8%) and many misinterpret statistical nonsignificance as evidence for no effect (63%). Overall, results of our survey show some improvements in statistical practice, but further efforts are clearly required to move the discipline toward improved practices.
Article
Leading theorists and econometricians agree with our two main points: first, that economic significance usually has nothing to do with statistical significance and, second, that a supermajority of economists do not explore economic significance in their research. The agreement from Arrow to Zellner on the two main points should by itself change research practice. This paper replies to our critics, showing again that economic significance is what science and citizens want and need.
Article
Models are neither true nor false. Models are sometimes useful and sometimes misleading. The suggestion that “Size Matters” implicitly accepts the truthfulness goal and deploys a truthfulness metric for measuring the distance between the data and the model. With this way of thinking, we would be led decisively to “reject” a road map that has freeways colored red.
Article
The authors argue that economics as a science is held back through the use of “asterisk” reporting. We concur that many authors do a poor job at disseminating their results, however consider the call to reject the tools a little overblown. The tools are very useful for many of the most important parts of economics as a science, such as testing theories. Also, the argument that poor reporting holds back science is made through showing that the ‘X’ variable is low, not that it has the impact on the ‘Y’ of scientific progress. The impact may be zero and insignificant!