Article

Statistical Significance Testing From Three Perspectives

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Alternatively, perish the thought that a careless comment about "mean" performance would cause a reader to think about downright nasty behavior! Yes, we agree that authors should always try their best to communicate precisely what they and their statistical tests mean, but summoning in the editorial "language police" (Levin, 1993) to enforce such a rigid policy seems a bit much. Tis better, we think, that journal editors continue to mind a manuscript's substance and that copy editors continue to dot the i's, cross the t's, and go with the "flow" of the manuscript. ...
... And that brings us to what we believe is an even better alternative linguistic mousetrap. If language is to be modified, why not banish the term "significant" altogether in favor of referring to a reported statistical differ-ence as "statistically nonchance" or "statistically real" (Levin, 1993), insofar as those terms convey precisely the statistical test's intended meaning and leave nothing to the reader's imagination? Thus, a sentence such as "There was a statistically real inverse relationship between sentence length and comprehensibility" indicates that in this particular study the negative relationship between sentence length and comprehensibility was a statistically improbable one based on the researcher's a priori specified significance level (a). ...
... We couldn't agree more with Hurlburt and Hays, as well as with Thompson, about authors' need to be more sensitive to substantive significance (as reflected by effect sizes and strength of relationship measures), what it means, and how it differs from statistical significance. However, included in Hurlburt's statement is the germ of an editorial policy amendment that is consistent with Levin's (1993) philosophy and one that we wish to develop in the following paragraphs. ...
... Kaufman (1998) indicates that the "controversy about the use or misuse of statistical significance testing has been evident in the literature for the past 10 years and has become the major methodological issue of our generation" (p. 1). The debate has ranged from those who recommend the elimination of statistical significance testing (e.g., Carver, 1978, 1993; Nix & Barnette, 1998) to those who support it (e.g., Frick, 1996; Levin, 1993, 1998; McLean & Ernest, 1998). However, even those who defend statistical significance testing indicate that significant results should be accompanied by a measure of practical significance. ...
... However, even those who defend statistical significance testing indicate that significant results should be accompanied by a measure of practical significance. The leading method of reporting practical significance is through the provision of an effect size estimate (Kirk, 1996; McLean & Ernest, 1998; Robinson & Levin, 1997; Thompson, 1993). Unfortunately, the criteria for judging the practical significance of results based on effect size has defaulted to the use of Cohen's (1988) guidelines that even Cohen warned us about (1977, 1988, 1990). ...
Conference Paper
Full-text available
The concept of effect size has become very important in educational and behavioral research. However, the term, effect size, is used in many contexts. The three most common contexts are setting an effect size as a part of determining needed sample sizes to a achieve desired power level, converting various effect sizes and measures of association into a common metric for meta-analysis, and reporting the effect size post-hoc as an indication of the practical significance of group differences in experimental or quasi-experimental studies. The point this paper is to make is that arbitrary selections of effect size standards cannot be meaningful without knowledge and accounting for the critical characteristics of the situation relative to the number and size of samples or the degrees of freedom. Monte Carlo methods were used to generate the data for this research using random normal deviates as the basis for sample means to be compared using one-way fixed-effects ANOVA. Standardized effect sizes were generated for 10,000 replications within each combination of number of groups from 2 to 12 and sample sizes of 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 70, 80, 90, 100, 150, 200, 250, and 500, resulting in 2,200,000 total replications. The standardized effect size was computed as the range of means divided by the root mean square error. The results make clear that the current arbitrary and absolute criteria proposed by Cohen are far from being sensitive to the variation in standardized effect sizes as functions of the number and size of samples. The paper also demonstrates how this result relates to the three common uses of standardized effect size.
... McLean and Ernest's description of Suen's (1992) "overbearing guest" analogy is especially apt in this context. 4 Examples of the seductive power of large observed effect sizes that are more than likely the result of chance outcomes are provided by Levin (1993) and Robinson and Levin (1997). In its extreme form, effect-size-only reporting degenerates to strong conclusions about differential treatment efficacy that are based on comparing a single score of one participant in one treatment condition with that of another participant in a different condition. ...
... Yet, because of the excessive "heat" (Thompson, in press) being generated by hypothesis-testing bickerers, little time is left for shedding "light" on how to enhance the conclusion robustness of educational and psychological research. In addition to the methodological adequacy of an empirical study (e.g., Levin, 1985;Levin & Levin, 1993;Stanovich, 1998), the credibility of its findings is a function of the study's "statistical conclusion validity" (Cook & Campbell, 1979), which in turn encompasses a consideration of the congruence between the statistical tools applied and their associated distributional assump-tions. Reviews of the literature indicate that precious little attention is being paid by researchers and journal referees alike to that congruence: Statistical tests are being mindlessly applied or approved even in situations where fundamental assumptions underlying them are likely grossly violated (e.g., Keselman et al., in press;Wilcox, 1997). ...
... Table 1. Different views about the practical application of the null hypothesis significance testing Adopted view Researchers In favor Levin (1993) Fritz (1995y 1996) Greenwald et al. (1996 Abelson (1997) Cortina Dunlap (1997) Hagen (1997) Detractors Bakan (1966) Craig et al. (1976) Carver (1978y 1993 Chow (1988) Thompson (1988Thompson ( , 1989Thompson ( , 1996Thompson ( , 1997Thompson ( and 1999 Cohen (1990, 1994) Falk and Greenbaum (1995 Schimdt (1996) Manzano (1997) Nickerson ( La tabla 1 muestra un análisis cronológico del comportamiento histórico relacionado con el contraste y comprobación de hipótesis estadísticas. Estos estudios evidencian la confusión, crítica y polémica entre los investigadores, que en un inicio consideraron que era suficiente el informe del valor p para rechazar o aceptar una hipótesis (Ioannidis 2018). ...
Article
Full-text available
The contrast hypothesis constitutes the most used method in the scientific research to estimate the statistical significance of any find. However, nowadays its use is questionable because it did not have other statistical criteria that make possible the credibility and reproducibility of the studies. From this condition, this study review how the use of the null hypothesis significance testing has been and the recommendations made regarding the application of other complementary statistical criteria for the interpretation of the results. It is described the main controversy of only use the probability value to reject or accept a hypothesis. The interpretation of a non significant value, as prove of effect absence or a significant value as existence of it, is a frequent mistake in scientific researchers, according to the reviewed literature. It is suggested to make a rigorous assessment of the obtained data in a research and include in the study reports other statistical tests, as the test power and the effect size of the intercession, to offer a complete interpretation and increase the results quality. Specifically, it is recommended to the editors of scientific journals to consider the report of the mentioned statisticians in the papers who required, as part of the criteria to take into account for their evaluation.
... In that context, a critical point of contention concerns whether the effect sizes associated with a single -study investigation should be interpreted in the absence of statistical significance. We have cast our nay votes on (and justifications for) this issue elsewhere (e.g., Levin, 1993; L evin & Robinson, 1999; Robinson & Levin, 1997; Robinson, Funk, Halbur, & O'Ryan, in press; Wainer & Robinson, in press) and will summarize our stance here. Almost without exception, introductory statistics textbooks present examples based on single -study investigations. ...
... El p ro blema ha continuado durante décadas, re av ivándose en ocasiones , de tal modo que actualmente estamos viviendo un momento de polémica con posturas enfrentadas, en algunos casos de fo rma ex t rema, entre los defe n s o res (ej. Abelson, 1997a, 1997b; C o rtina y Dunlap, 1997; Fritz, 1995, 1996; Gre e n wa l d, Gonzalez, H a rris y Guthri e, 1996; Hagen, 1997, Levin, 1993) y detra c t o re s (ej. Chow, 1988; Cohen, 1994; Cowles, 1989; Meehl, 1978; Morrison y Henkel, 1970; Murp hy, 1997; Schimdt, 1996) de las pru ebas de significación estadística como instrumento válido para el p rogreso científi c o . ...
Article
Currently, there is a growing interest in the study of the sensitive and validity of the statistical conclusions of experimental design. Although most of books on experimental design stress these issues, many students on applied psychology still do not take advantage of these advances, as can be deduced by low statistical power. The goal of this article is to examine the impact of the guidelines of the editorial Board of peer reviewed respect to the computation and interpretation of the measures of effect size as well as the values of statistical significance.
... We therefore recommend that statistical hypothesis testing and effect-size estimation be used in tandem to establish a reported outcome's believability and magnitude, respectively. As such, tests of significance serve a valuable purpose in determining whether effectsize measures should be ignored or reported, a position endorsed by Fan (2001), Levin (1993, Robinson and Levin (1997), Knapp and Sawilowsky (2001), and even – we think – Gliner, Leech, and Morgan (2002). Let us take a moment to consider the last part of the foregoing sentence. ...
Article
Full-text available
Although estimating substantive importance (in the form of reporting effect sizes) has recently received widespread endorsement, its use has not been subjected to the same degree of scrutiny as has statistical hypothesis testing. As such, many researchers do not seem to be aware that certain of the same criticisms launched against the latter can also be aimed at the former. Our purpose here is to highlight major concerns about effect sizes and their estimation. In so doing, we argue that effect size measures per se are not the hoped-for panaceas for interpreting empirical research findings. Further, we contend that if effect sizes were the only basis for interpreting statistical data, social-science research would not be in any better position than it would if statistical hypothesis testing were the only basis. We recommend that hypothesis testing and effect-size estimation be used in tandem to establish a reported outcome's believability and magnitude, respectively, with hypothesis testing (or some other inferential statistical procedure) retained as a "gatekeeper" for determining whether or not effect sizes should be interpreted. Other methods for addressing statistical and substantive significance are advocated, particularly confidence intervals and independent replications.
... The debate has ranged from those who recommend the elimination of statistical significance testing (e.g., Carver, 1978, 1993; Nix & Barnette, 1998) to those who staunchly support it (e.g., Frick, 1996; Levin, 1993, 1998; McLean & Ernest, 1998). However, even those who defend statistical significance testing indicate that significant results should be accompanied by a measure of practical significance. ...
Conference Paper
Full-text available
Eta-Squared (ES) is often used as a measure of strength of association of an effect, a measure often associated with effect size. It is also considered the proportion of total variance accounted for by an independent variable. It is simple to compute and interpret. However, it has one critical weakness cited by several authors (Huberty, Snyder & Lawson, and Snijders) and that is a sampling bias that leads to an inflated judgment of true effect. The purpose of this research is to determine the degree of inflation by determining how large ES is likely to be by chance, find methods of predicting the mean inflation, and then proposing the use of a corrected ES coefficient which is the observed ES minus the mean expected ES, a value added approach. A Monte Carlo study was set up using number of samples from 2 to 10 and sample sizes from 5 to 100 in steps of 5. In each number of samples and sample size configuration, 10000 one-way ANOVA replications, using samples drawn from the unit normal distribution, were conducted for a total of 1,800,000 replications.
... The debate has ranged from those who recommend the elimination of statistical significance testing (e.g., Carver, 1978, 1993; Nix & Barnette, 1998) to those who staunchly support it (e.g., Frick, 1996; Levin, 1993, 1998; McLean & Ernest, 1998). However, even those who defend statistical significance testing indicate that significant results should be accompanied by a measure of practical significance. ...
Conference Paper
Full-text available
The probabilities of attaining varying magnitudes of standardized effect sizes by chance and when protected by a 0.05 level statistical test were studied. Monte Carlo procedures were used to generate standardized effect sizes in a one-way analysis of variance situation with 2 through 5, 6, 8, and 10 groups with selected sample sizes from 5 to 500. Within each of the 91 group and sample size configurations, 100,000 replications were generated from a distribution of normal deviates. For each data set, the effect size was computed along with a statistical test of the hypothesis at the 0.05 level. For each n/k combination, the proportion of effect sizes exceeding 0.1 to 2.0 in increments of 0.1 was computed for all cases and for those cases where "the no difference hypothesis" was rejected. There were trends that were common across all configurations. As the magnitude of effect size increased, the probability of getting such a difference by chance decreased, as would be expected. Within a given number of samples situation, as sample size increased, as expected, the probability of getting such a difference by chance decreased. Within a given sample size, as the number of groups increased, the probability of getting such a difference by chance increased. Another finding that was consistent across all configurations was that the significance test protected effect size probability was always equal to or less than the unprotected probability, in some cases dramatically so. It was clear that the addition of the significance test reduced the probability of finding a seemingly large effect size by chance. Such a protected effect size indicator could be an answer to the arguments posed by both those who protest against the use of the significance test and those who propose its use in judging the magnitude of an observed effect. (Contains 15 tables, 14 figures, and 43 references.)
... While such effects are "significant" according to a standard hypothesis-test, they may be so small that they are not practically relevant. To avoid such misinterpretation, statisticians recommend always reporting effect size measurements along with p-values (Levin, 1993;McLean & Ernest, 1998), allowing the reader to subjectively judge the importance of an effect. On the other hand, as noted above, even large effects can be declared "non-significant" because of an insufficient sample size relative to measurement variability (i.e., a Type II error due to lack of power). ...
... El p ro blema ha continuado durante décadas, re av ivándose en ocasiones , de tal modo que actualmente estamos viviendo un momento de polémica con posturas enfrentadas, en algunos casos de fo rma ex t rema, entre los defe n s o res (ej. Abelson, 1997a, 1997b; C o rtina y Dunlap, 1997; Fritz, 1995, 1996; Gre e n wa l d, Gonzalez, H a rris y Guthri e, 1996; Hagen, 1997, Levin, 1993) y detra c t o re s (ej. Chow, 1988; Cohen, 1994; Cowles, 1989; Meehl, 1978; Morrison y Henkel, 1970; Murp hy, 1997; Schimdt, 1996) de las pru ebas de significación estadística como instrumento válido para el p rogreso científi c o . ...
Article
Full-text available
Effect size and statistical significance. Currently, there is a growing interest in the study of the sensitive and validity of the statistical conclusions of ex p e rimental design. Although most of books on experimental design stress these issues, many students on applied psychology still do not take advantage of these advances, as can be deduced by low statistical power. The goal of this article is to examine the impact of the guidelines of the editorial Board of peer reviewed respect to the computation and interp re t ation of the measures of effect size as well as the values of statistical signifi c a n c e.
... Huberty (1993) thought that some of the blame for this lies with poor presentation by textbooks, teaching and reporting, and poor journal editorial review. More moderate commentators, including some journal editors, suggested that, when properly used, SST still has something to offer to the educational researcher (Asher, 1993;Bourke, 1993;Clements, 1993;Levin, 1993;Robinson & Levin, 1997;Rowley, 1993;Schafer, 1993;Zwick, 1997). ...
Chapter
This chapter considers near-term possible paths for psychological research without statistical inference, before going on the deeper obstacles to change, including our attachment to rules. These include, on the one hand, a desperate denial of the subject, in pursuit of objectivity, and, on the other, frantic attempts at control, at assertion of the subject. Our only accredited methods are epidemiological, yet we know that psychological and biological processes don’t operate on random aggregates, but individuals. Our contemporary approach is biomedical research without a biomedical model. When physicists construct a linear plot of, say, solubility as a function of temperature, it is because physical theory predicts a linear relationship across the specified range. But when psychologists plot physical symptoms against depression, there is no psychological theory which says that these measurements should exhibit a linear relationship across different individuals; and the resulting plot exhibits almost nothing but error. Bill Powers (Behavior: The control of perception. Chicago, IL: Aldine, 1973; Psychological Review, 85:417–435, 1978) has developed a psychological theory based on individuals, which reconciles the concepts of mechanism and purpose; and the chapter concludes with a brief overview of his Perceptual Control Theory, as an indication of how the science of psychology might intelligently proceed.
Article
Lebensweisheiten und poetische Gedanken bringen viele „Wahrheiten“ auf den Punkt – manchmal treffsicherer und tiefgründiger als kluge Fachbücher. Als Vorgeschmack auf das, was den Leser im folgenden Kapitel erwartet, ein Auszug aus Wilhelm Buschs (1864/2002) Gedicht Trauriges Resultat einer vernachlässigten Erziehung: Ach, wie oft kommt uns zu Ohren, Daß ein Mensch was Böses tat, Was man sehr begreiflich findet, Wenn man etwas Bildung hat. Manche Eltern sieht man lesen In der Zeitung früh bis spät; Aber was will dies bedeuten, Wenn man nicht zur Kirche geht! Denn man braucht nur zu bemerken, Wie ein solches Ehepaar Oft sein eignes Kind erziehet. Ach, das ist ja schauderbar! […]
Article
We compare the sample size requirements for significance tests and confidence intervals by calculating power of each. The power of confidence interval is defined as the probability of obtaining a short interval width conditional on that the confidence interval includes the parameter of interest. We find that a smaller sample size is required to attain a desired statistical power as compared to comparable power of confidence interval in the two sample independent t test, which is illustrated in an example study that examines the outcome difference between psychotherapy and control in treating depression.
Article
Although sleep has been linked to activities in various domains of life, one under-studied link is the relationship between unhealthy sleep practices and conduct problems among adolescents. The present study investigates the influence of adolescents' unhealthy sleep practices-short sleep (e.g., less than 6 h a day), inconsistent sleep schedule (e.g., social jetlag), and sleep problems-on conduct problems (e.g., substance use, fighting, and skipping class). In addition, this study examines unhealthy sleep practices in relationship to adolescent emotional well-being, defiant attitudes, and academic performance, as well as these three domains as possible mediators of the longitudinal association between sleep practices and conduct problems. Three waves of the Taiwan Youth Project (n = 2,472) were used in this study. At the first time-point examined in this study, youth (51 % male) were aged 13-17 (M = 13.3). The results indicated that all three measures of unhealthy sleep practices were related to conduct problems, such that short sleep, greater social jetlag, and more serious sleep problems were concurrently associated with greater conduct problems. In addition, short sleep and sleep problems predicted conduct problems one year later. Furthermore, these three unhealthy sleep practices were differently related to poor academic performance, low levels of emotional well-being, and defiant attitudes, and some significant indirect effects on later conduct problems through these three attributes were found. Cultural differences and suggestions for prevention are discussed.
Article
In education research, statistical significance and effect size are 2 sides of 1 coin; they complement each other but they do not substitute for each other. Good research practice requires that, to make sound research decisions, both sides should be considered. In a simulation study, the sampling variability of 2 popular effect-size measures (d and R ) was examined. The variability showed that what is statistically significant may not be practically meaningful, and what appears to be practically meaningful could have been the result of sampling error, thus not trustworthy. Some practical guidelines are suggested for combining the 2 sources of information in research practice.
Article
Media comparison studies have long been criticized as an inappropriate research design for measuring the effectiveness of instructional technology. However, a resurgence in their use has recently been noted in distance education for program evaluation purposes. An analysis of the research design will detail why such a methodology is an inappropriate approach to such an investigation. Increased access to such programming does not seem to serve as a satisfactory benefit for the implementation of distance education efforts. Stakeholders desire to prove that participants in distance-delivered courses receive the same quality of instruction off-campus as those involved in the traditional classroom setting. However, the desire to prove that the quality of such distributed offerings is equal to the quality of on-campus programming often results in comparisons of achievement between the two groups of student participants. Statistically, such a research design almost guarantees that the desired outcome will be attained—that indeed distance learners perform as well as campus-based students.
Article
Full-text available
En la actualidad la atención por la sensibilidad y validez de conclusión estadística del diseño de la investigación ha aumentado, especialmente en el tratamiento que reciben en las ediciones actuales de los manuales de diseños experimentales aunque quizás en el ámbito aplicado (donde la estimación del tamaño del efecto cobra su mayor importancia) no se ha desarrollado todo lo que sería de desear como lo demuestran los estudios de potencia de los trabajos publicados. El principal propósito de este trabajo es analizar la repercusión o impacto que tienen las indicaciones de los consejos editoriales sobre los trabajos de investigación publicados respecto al cálculo e interpretación conjunta de las medidas de magnitud del efecto junto con los valores de significación estadística.
Article
Full-text available
Null hypothesis significance testing (NHST) is arguably the most widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.
Article
Full-text available
This article examines current research methodology in psychology in the context of Serlin and Lapsley's response to Meehl's critiques of the scientific practices of psychologists. The argument is made that Serlin and Lapsley's appeal to Lakatos's philosophy of science to defend the rationality of null hypothesis tests and related practices misrepresents that philosophy. It is demonstrated that Lakatos in fact considered psychology an extremely poor science lacking true research programs, an opinion very much in line with Meehl's critique. The present essay speculates on the reasons for Lakatos's negative opinion and reexamines the role of null hypothesis tests in relation to the quality of theories in psychology. It is concluded that null hypothesis tests are destructive to theory building and directly related to Meehl's observation of slow progress in soft psychology.
Article
Full-text available
Discusses editorial policy of the Journal of Educational Psychology with respect to substantive, procedural, and ethical issues. Research should make substantive contributions and conceptually integrate recent developments bearing on the topic. Procedural concerns involve terminological clarity, confounding and controlled factors, the effects of a laboratory orientation on classroom research, and the appropriateness of a statistical analysis. Ethical issues include piecemeal and duplicate publications, plagiarism, and falsification/fabrication of data. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Recent developments in procedures for conducting pairwise multiple comparisons of means prompted an empirical investigation of several competing techniques. Monte Carlo results revealed that the newer multistage sequential procedures maintain their familywise Type I error probabilities while exhibiting power that is superior to the traditional competitors. Of all procedures examined, the modified E. Peritz (1970) procedure (M. A. Seaman et al, 1990) is generally the most powerful according to all definitions of power. At the same time, when computational ease and convenience are taken into consideration, A. J. Hayter's (1986) procedure should be regarded as a viable alternative. Beyond pairwise comparisons of means, the versatile S. Holm (1979) procedure and its modifications (J. P. Shaffer, 1986) are very attractive insofar as they represent simple, yet powerful, data-analytic tools for behavioral researchers. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Chapter
This chapter provides me with the opportunity to discuss a number of methodological and statistical “bugs” that I have detected creeping into psychological research in general, and into research on children’s learning in particular. Naturally, one cannot hope to exterminate all such bugs with but a single essay. Rather, it is hoped that this chapter will leave a trail of pellets that is sufficiently odorific to get to the source of these potentially destructive little creatures. It also goes without saying that different people in this trade have different entomological lists that they would like to see presented. Although all cannot be presented here, I intend to introduce you to nearly 20 of my own personal favorites. At the same time, it must be stated at the outset that present space limitations do not permit a complete specification and resolution of the problems that these omnipresent bugs can create for cognitive-developmental researchers. Consequently, in most cases I will only allude to a problem and its potential remedies, placing the motivation for additional inquiry squarely in the lap of the curious reader.
Article
Magnitude-of-effect (ME) statistics, when adequately understood and correctly used, are important aids for researchers who do not want to place a sole reliance on tests of statistical significance in substantive result interpretation. We describe why methodologists encourage the use of ME indices as interpretation aids and discuss different types of ME estimates. We discuss correction formulas developed to attenuate statistical bias in ME estimates and illustrate the effect these formulas have on different sample and effect sizes. Finally, we discuss several cautions against the indiscriminate use of these statistics and offer reasons why ME statistics, like all substantive result interpretation aids, are useful only when their strengths and limitations are understood by researchers.
Article
Three of the various criticisms of conventional uses of statistical significance testing are elaborated. Three alternatives for augmenting statistical significance tests in interpreting results are then elaborated. These include emphasizing effect sizes, evaluating statistical significance tests in a sample size context, and evaluating result replicability. Ways of estimating result replicability from data in hand include crossvalidation, jackknife, and bootstrap logics. The bootstrap is explored in some detail.
Article
Textbook discussion of statistical testing is the topic of interest. Some 28 books published from 1910 to 1949, 19 books published from 1990 to 1992, plus five multiple-edition books were reviewed in terms of presentations of statistical testing. It was of interest to discover textbook coverage of the P-value (i.e., Fisher) and fixed-alpha (i.e., Neyman-Pearson) approaches to statistical testing. Also of interest in the review were some issues and concerns related to the practice and teaching of statistical testing: (a) levels of significance, (b) importance of effects, (c) statistical power and sample size, and (d) multiple testing. It is concluded that it is not statistical testing itself that is at fault; rather, some of the textbook presentation, teaching practices, and journal editorial reviewing may be questioned.
Article
Based on principles of modern philosophy of science, it can be concluded that it is the magnitude of a population effect that is the essential quantity to examine in determining support or lack of support for a theoretical prediction. To test for theoretical support, the corresponding statistical null hypothesis must be derived from the theoretical prediction, which means that we must specify and test a range null hypothesis. Similarly, confidence intervals based on range null hypotheses are required. Certain of the newer multiple comparison procedures are discussed in terms of their applicability to the problem of generating confidence intervals based on range null hypotheses to control the familywise Type I error rate in multiple-sample experiments.
Article
A test of statistical significance addresses the question, How likely is a result, assuming the null hypotheses to be true. Randomness, a central assumption underlying commonly used tests of statistical significance, is rarely attained, and the effects of its absence rarely acknowledged. Statistical significance does not speak to the probability that the null hypothesis or an alternative hypothesis is true or false, to the probability that a result would be replicated, or to treatment effects, nor is it a valid indicator of the magnitude or the importance of a result. The persistence of statistical significance testing is due to many subtle factors. Journal editors are not to blame, but as publishing gatekeepers they could diminish its dysfunctional use.
Article
At present, too many research results in education are blatantly described as significant, when they are in fact trivially small and unimportant. There are several things researchers can do to minimize the importance of statistical significance testing and get articles published without using these tests. First, they can insert statistically in front of significant in research reports. Second, results can be interpreted before p values are reported. Third, effect sizes can be reported along with measures of sampling error. Fourth, replication can be built into the design. The touting of insignificant results as significant because they are statistically significant is not likely to change until researchers break the stranglehold that statistical significance testing has on journal editors.