Article

The significance of the significance test controversy: Comments on 'Size Matters'

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The concerns expressed and issues raised by Zialik and McCloskey concerning the use of tests of statistical inference have also been raised in sociology and psychology. This paper examines the similarities and differences between the way that the significance test controversy has been discussed in the wider social sciences and Zialik and McCloskey's discussion of the implications for economics. The issues are similar in that different effects of over-reliance on the rejection of the null hypothesis are identified. These include mistakes and errors in statistical inference, the lack of the use of diagnostic statistics to qualify and guide statistical inference, and the broader impact on the field of accumulation of type II errors and lack of innovation in the field. There is considerable agreement on these points between sociologists, psychologists and economists who are concerned about these issues. However, there are also important differences that are discussed in this response. In particular, in the other social sciences the significance test controversy has broadened out and has been linked firstly to more discussions of the limitations of experimental and correlational designs and to a broader critique of positivism and scientism in the social sciences. Without this broader context the significance of the significance test controversy is understood in a more restricted way as a technical problem with widespread effects whereas in the social sciences it has been understood as symptomatic of broader disciplinary commitments in theory and purpose.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... For a variety of practical and ethical reasons, analysts find themselves faced with the rather more 'passive' analysis of datasets, over which they have no control. The problem with this 'post hoc dredging of sullen datasets' (Gorard 2006a) is that the statistical methods usually involved were designed for use only in active research (Lunt 2004). ...
... Without a controlled trial, the direct link between a hypothesis and its testing in practice disappears, and is replaced by a much a weaker form of 'test', such as those based on probability and significance. The results of these can be very misleading (Lunt 2004). For, in most research situations, it is not sampling variation that is the key to understanding and unlocking the process (Ziliak and McCloskey 2004). ...
Article
Full-text available
This paper compares the official value‐added scores in 2005 for all primary schools in three adjacent Local Educational Authorities (LEAs) in England with the raw‐score Key Stage 2 (KS2) results for the same schools. The correlation coefficient for the raw‐ and value‐added scores of these 457 schools is around +0.75. Scatterplots show that there are no low attaining schools with average or higher value‐added, and no high attaining schools with below average value‐added. At least some of the remaining scatter is explained by the small size of some schools. Although some relationship between these measures is to be expected – so that schools adding considerable value would tend to have high examination outcome scores – the relationship shown is too strong for this explanation to be considered sufficient. Value‐added analysis is intended to remove the link between a schools' intake scores and their raw‐score outcomes at KS2. It should lead to an estimate of the differential progress made by pupils, assessed between schools. In fact, however, the relationship between value‐added and raw scores is of the same size as the original relationship between intake scores and raw‐scores that the value‐added is intended to overcome. Therefore, however appealing the calculation of value‐added figures is, their development is still at the stage where they are not ready to move from being a research tool to an instrument of judgement on schools. Such figures may mislead parents, governors and teachers and, even more importantly, they are being used in England by the Office for Standards in Education (Ofsted) to pre‐determine the results of school inspections.
... For a variety of practical and ethical reasons, they find themselves faced with the rather more 'passive' analysis of datasets, over which they have no control. The problem with this 'post hoc dredging of sullen datasets' (Gorard 2006a) is that the methods involved were designed for use only in active research (Lunt 2004). ...
... Without a in a controlled trial, the direct link between a hypothesis and its testing in practice disappears, and is replaced by a much a weaker form of 'test', such as those based on probability and significance. The results of these can be very misleading (Lunt 2004). For, in most research situations, it is not sampling variation that is the key to understanding and unlocking the process (Ziliak and McCloskey 2004). ...
Article
Full-text available
This paper compares the official value-added scores in 2005 for all primary schools in three adjacent LEAs in England with the raw-score Key Stage 2 results for the same schools. The correlation coefficient for the raw- and value-added scores of these 457 schools is around +0.75. Scatterplots show that there are no low attaining schools with average or higher value-added, and no high attaining schools with below average value-added. At least some of the remaining scatter is explained by the small size of some schools. Although some relationship between these measures is to be expected - so that schools adding considerable value would tend to have high examination outcome scores - the relationship shown is too strong for this explanation to be considered sufficient. By way of illustration, value-added analysis is intended to remove the link between a schools' intake scores (of attainment or social background of pupils) and their raw-score outcomes at KS2. It should be an estimate of the differential progress made by pupils, assessed between schools. In fact, however, the relationship between value-added and raw scores is of the same size as the original relationship between intake scores and raw-scores that the value-added is intended to overcome. Therefore, however appealing the calculation of value-added figures is, their development is still at the stage where they are not ready to move from being a fascinating research tool to an instrument of judgement on schools. Such figures may mislead parents, governors and teachers and, even more importantly, they are being used in England by OFSTED to pre-determine the results of school inspections.
... Although this conforms with standard practice in applied quantitative sociology, which usually ignores it. It was also not discussed in theZiliak & McCloskey (2004) article that BCL draw on, nor in most of the comments on this article(Elliott & Granger, 2004;Horowitz, 2004;Leamer, 2004;Lunt, 2004;O'Brien, 2004;Thorbecke, 2004;Wooldridge, 2004), except forZellner (2004). ...
... I further evaluate the economic significance of my findings, since statistical significance by itself may not be enough Thorbecke, 2004;Elliott and Granger, 2004;Lunt, 2004;Engsted, 2009). As argue, "A merely statistical significance cannot substitute for the judgment of a scientist and her community about the largeness or smallness of a coefficient by standards of scientific or policy oomph" (2004: 528). ...
... While the informal reasoning that a researcher uses to argue for the substantive significance of a result is usually sound, s/he may have a difficult time communicating to others precisely how the strength and uncertainty of an effect, along with a scientific aversion to mistakenly accepting the existence of a relationship between X and Y where none exists, combines to form his/her judgment (Lunt, 2004). It may be difficult for that researcher to use consistent standards across multiple studies, particuarly when those studies are in different substantive areas. ...
Article
While formal tests exist for statistical significance, researchers have traditionally re- lied on informal arguments to demonstrate the substantive significance of their results. To improve the transparency and consistency of judgments of substantive significance, I introduce a formal test for the existence of a substantively meaningful relationship in quantitative data. The test takes a rational choice perspective toward evidence, using Bayesian statistical decision theory to ask whether it makes sense to believe in the existence of a statistical relationship given a researcher's view of the consequences of correct and incorrect decisions. The test generates a critical test statistic c with a clear interpretation: if a relationship of size c is not important enough to influence future research and policy advice, then the evidence does not support the existence of a substantively significant eect.
... Perhaps the two most important reasons why I believe the path of encouraging such complex analyses is a dead-end are that better quality data would make MLM unnecessary, and that MLM inhibits the wider critique of research that is its only guarantor of quality. Our current analytic problems stem partly from the extension of the logic of experimentation to other contexts, such as regression analysis (Lunt, 2004). But, as Gigerenzer (2004) reminds us, complex modeling is not any kind of test of a proposition generated by the same data as that used for the model. ...
Article
This paper presents an argument against the wider adoption of complex forms of data analysis, using multi‐level modeling (MLM) as an extended case study. MLM was devised to overcome some deficiencies in existing datasets, such as the bias caused by clustering. The paper suggests that MLM has an unclear theoretical and empirical basis, has not led to important practical research results, is largely unnecessary due to the availability of a range of alternatives, and is therefore overly complex, making research reports harder to read and understand for a general audience for no apparent analytic gain. Above all, the paper shows via examples that MLM has made so little difference in practice that it is worth us also considering its analytic costs. These costs include the promotion of an educational form of ‘asterix economics’, the creation of unworkable premises such as denying the existence of population data, and the tension between the contradictory need for both a large and small number of cases at each sub‐level. The paper concludes by outlining a substantial number of alternative methods of analysis which can have the same effect as MLM in examining structures in the data or overcoming the problems caused by clustering. Many of these alternatives are easier to use and understand, do not require specialist software, and avoid the problems—such as having to ignore cases with missing variables—created by the use of MLM.
... Yet, although we know a great deal about how to formalize rational decisions under uncertainty, such as the one just described, we rarely bring this knowledge to bear when making decisions using statistical results. The result is a lack of transparent and consistent standards for substantive significance that make it unclear how the magnitude and uncertainty of a result combine with the researcher's preferences about mistaken decisions to produce the resulting judgment (Horowitz, 2004; Lunt, 2004). ...
Article
We propose a critical statistic c * for determining the substantive significance of an empirical result, which we define as the degree to which it justifies a particular decision (such as the decision to accept or reject a theoretical hypothesis), and provide software tools for calculating c * for a wide variety of models. Our procedure, which is built on ideas from Bayesian statistical decision theory, helps researchers improve the objectiv-ity, transparency, and consistency of their assessments of substantive significance.
... From this perspective, alternative approaches such as Bayesian statistics, NeymanYPearson decision theory, non-parametric tests, and Tukey exploratory techniques should at least supplement if not replace NHST (Maltz 1994;Zellner 2004). A second group of critics takes a more philosophical approach, focusing on the limitations of experimental and correlational designs in social science (Lunt 2004). These critics would prefer more qualitative evidence and theoretical development and less Brank empiricism.^In ...
Article
Full-text available
Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for empirically examining hypothesized relationships, and the main approach for establishing the importance of empirical results. NHST is the foundation of classical or frequentist statistics. The approach is designed to test the probability of generating the observed data if no relationship exists between the dependent and independent variables of interest, recognizing that the results will vary from sample to sample. This paper is intended to evaluate the state of the criminological and criminal justice literature with respect to the correct application of NHST. We apply a modified version of the instrument used in two reviews of the economics literature by McCloskey and Ziliak to code 82 articles in criminology and criminal justice. We have selected three sources of papers: Criminology, Justice Quarterly, and a recent review of experiments in criminal justice by Farrington and Welsh. We find that most researchers provide the basic information necessary to understand effect sizes and analytical significance in tables which include descriptive statistics and some standardized measure of size (e.g., betas, odds ratios). On the other hand, few of the articles mention statistical power and even fewer discuss the standards by which a finding would be considered large or small. Moreover, less than half of the articles distinguish between analytical significance and statistical significance, and most articles used the term ‘significance’ in ambiguous ways.
Article
Empirical political science is not simply about reporting evidence; it is also about coming to conclusions on the basis of that evidence and acting on those conclusions. But whether a result is substantively significantstrong and certain enough to justify acting upon the belief that the null hypothesis is falseis difficult to objectively pin down, in part because different researchers have different standards for interpreting evidence. Instead, this article advocates judging results according to their the degree to which a community with heterogeneous standards for interpreting evidence would agree that the result is substantively significant. This study illustrates how this can be done using Bayesian statistical decision techniques. Judging results in this way yields a tangible benefit: false positives are reduced without decreasing the power of the test, thus decreasing the error rate in published results.
Article
Significance testing as used has no theoretical justification. Our article in the Journal of Economic Literature (1996) showed that of the 182 full-length papers published in the 1980s in the American Economic Review 70% did not distinguish economic from statistical significance. Since 1996 many colleagues have told us that practice has improved. We interpret their response as an empirical claim, a judgment about a fact. Our colleagues, unhappily, are mistaken: significance testing is getting worse. We find here that in the next decade, the 1990s, of the 137 papers using a test of statistical significance in the AER fully 82% mistook a merely statistically significant finding for an economically significant finding. A super majority (81%) believed that looking at the sign of a coefficient sufficed for science, ignoring size. The mistake is causing economic damage: losses of jobs and justice, and indeed of human lives (especially in, to mention another field enchanted with statistical significance as against substantive significance, medical science). The confusion between fit and importance is causing false hypotheses to be accepted and true hypotheses to be rejected. We propose a publication standard for the future: “Tell me the oomph of your coefficient; and do not confuse it with merely statistical significance.â€
Article
This paper takes the opportunity to discuss issues around the interdisciplinary relations between economics and psychology. It argues that there is a discernible gap between economists interested in psychology and psychologists interested in economics. Economists are interested in cognitive neuro-psychology as a resource for elaborating the rationality assumptions of neoclassical economics whereas many economic psychologists are more interested in social aspects of economic beliefs and behaviour. The paper presents a criticism of the prevailing approaches within economics to the appropriation of psychological ideas as mental accounts and in experimental economics. It is then proposed that economic psychologists stop adopting economists' agendas and start to examine economic theory to open new lines of collaboration that will allow them to apply their own conception of psychology to economics.
Middle Range Theory: interpretation and the traditional/critical divide in social psychology
  • P Lunt
Lunt, P., 2002. Middle Range Theory: interpretation and the traditional/critical divide in social psychology. History and Philosophy of Psychology.
Psychology and statistics: testing the opposite of the idea you first thought of. The Psychologist
  • P K Lunt
  • S M Livingstone
Lunt, P.K., Livingstone, S.M., 1989. Psychology and statistics: testing the opposite of the idea you first thought of. The Psychologist, December.