Article

Significance Tests Have Their Place

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Null-hypothesis significance tests (NHST), properly used, tell us whether we have sufficient evidence to be confident of the sign of the population effect - but only if we abandon two-valued logic in favor of Kaiser's (1960) three-alternative hypothesis tests. Confidence intervals provide a useful addition to NHSTs, and can be used to provide the same sign-determination function as NHST. However, when so used, confidence intervals are subject to exactly the same Type I, II. and III error rates as NHST. In addition, NHSTs provide two pieces of information about our data - maximum probability of a Type III error and probability of a successful exact replication - that confidence intervals do not. The proposed alternative to NHST is just as susceptible to misinterpretation as is NHST. The problem of bias due to censoring of data collection or publication can be handled by providing archives for all methodologically sound data sets, but reserving interpretations and conclusions for statistically significant results.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A high P value is as consistent with H 1 as with H 0 and is grounds only for indecision or suspension of judgment with respect to the truth of H 0 (Fisher 1925, Tukey 1960, Kalbfleisch & Sprott 1976, Oakes 1986: 31, Abelson 1995, Cortina & Dunlap 1997, Harris 1997a, 1997b, Nickerson 2000, Tryon 2001, Lombardi & Hurlbert 2009). One is free to 'accept' H 0 on grounds other than the test and resultant P value, but the high P value itself provides no grounds for preferring H 0 over H 1 . ...
... Cox (1958) also gave an early formulation of the idea, stating that "the significance test is concerned whether we can, from the data under analysis, claim a difference in the same direction as that observed … [or] whether the direction of any effects has been reasonably well established …." The idea has been refined and recommended by Kaiser (1960), Tukey (1991), Abelson (1995), Harris (1997aHarris ( , 1997b, Tryon (2001), and Cox (2006a). ...
... Hunter (1997) strongly disparaged this "three-valued" logic paradigm. He stated: " Harris [1997b] argues that the computations of the significance tests can be saved by using a radically different interpretation scheme. However, his scheme was put forward 35 years ago by Kaiser and was adopted by no one. ...
Article
Full-text available
This essay grew out of an examination of one-tailed significance testing. One-tailed tests were little advocated by the founders of modern statistics but are widely used and recommended nowadays in the biological, behavioral and social sciences. The high frequency of their use in ecology and animal behavior and their logical indefensibil-ity have been documented in a companion review paper. In the present one, we trace the roots of this problem and counter some attacks on significance testing in general. Roots include: the early but irrational dichotomization of the P scale and adoption of the 'significant/non-significant' terminology; the mistaken notion that a high P value is evidence favoring the null hypothesis over the alternative hypothesis; and confusion over the distinction between statistical and research hypotheses. Resultant widespread misuse and misinterpretation of significance tests have also led to other problems, such as unjustifiable demands that reporting of P values be disallowed or greatly reduced and that reporting of confidence intervals and standardized effect sizes be required in their place. Our analysis of these matters thus leads us to a recommendation that for standard types of significance assessment the paleoFisherian and Neyman-Pearsonian paradigms be replaced by a neoFisherian one. The essence of the latter is that a critical α (prob-ability of type I error) is not specified, the terms 'significant' and 'non-significant' are abandoned, that high P values lead only to suspended judgments, and that the so-called "three-valued logic" of Cox, Kaiser, Tukey, Tryon and Harris is adopted explicitly. Con-fidence intervals and bands, power analyses, and severity curves remain useful adjuncts in particular situations. Analyses conducted under this paradigm we term neoFisherian significance assessments (NFSA). Their role is assessment of the existence, sign and magnitude of statistical effects. The common label of null hypothesis significance tests (NHST) is retained for paleoFisherian and Neyman-Pearsonian approaches and their hybrids. The original Neyman-Pearson framework has no utility outside quality control type applications. Some advocates of Bayesian, likelihood and information-theoretic approaches to model selection have argued that P values and NFSAs are of little or no value, but those arguments do not withstand critical review. Champions of Bayesian methods in particular continue to overstate their value and relevance.
... (cf. Bakan, 1966;Berkson, 1942;Carver, 1978Carver, , 1993Chow, 1998;Cohen, 1994;Dar, Serlin, & Omer, 1994;Hagen, 1997;Harlow, 1997;Hodges & Lehmann, 1954;Hogben, 1957;Hunter & Schmidt, 1990;Lykken, 1968;Meehl, 1967Meehl, , 1978Meehl, , 1990aMeehl, , 1990bMorrison & Henkel, 1970;Rozeboom, 1960;Sterling, 1959) and culminated in a special section of Psychological Science discussing whether the NHST should be banned (Abelson, 1997b;Estes, 1997b;Harris, 1997b;Hunter, 1997;Scarr, 1997;Shrout, 1997). The American Psychological Association Task Force on Statistical Inference was convened to determine what role, if any, NHST should have in psychological science (Schmidt, 1996;Wilkinson and the Task Force on Statistical Inference, 1999). ...
... The technical merits of the NHST are not disputed here. NHST proponents have successfully defended NHST procedures when correctly used (Abelson, 1997a, Hagen, 1997Harris, 1997aHarris, , 1997bMulaik, Raju, & Harshman, 1997;Rindskopf, 1997;Serlin, 1993;Serlin & Lapsley, 1985). Krantz (1999) observed that "statisticians prove theorems or develop methods that, if properly applied, would be useful. ...
Article
Full-text available
Null hypothesis statistical testing (NHST) has been debated extensively but always successfully defended. The technical merits of NHST are not disputed in this article. The widespread misuse of NHST has created a human factors problem that this article intends to ameliorate. This article describes an integrated, alternative inferential confidence interval approach to testing for statistical difference, equivalence, and indeterminacy that is algebraically equivalent to standard NHST procedures and therefore exacts the same evidential standard. The combined numeric and graphic tests of statistical difference, equivalence, and indeterminacy are designed to avoid common interpretive problems associated with NHST procedures. Multiple comparisons, power, sample size, test reliability, effect size, and cause–effect ratio are discussed. A section on the proper interpretation of confidence intervals is followed by a decision rule summary and caveats.
... So, behavior science/analysis as a whole is not immune from threats to reproducibility related to NHST, nor do we suggest that behavior scientists/analysts reject NHST as a potentially viable scientific practice. NHST can serve a useful purpose when conducted and reported properly, particularly when focusing on measures of effect size and confidence intervals, rather than just the p-value (Greenwald et al., 1996;Hales et al., 2018;Harris, 1997;Weaver & Lloyd, 2018). In addition, Jones and Tukey (2000) outlined what they termed "a sensible approach" to significance testing, which they claimed avoids the most common problems associated with traditional NHST (see also Harris, 1997Harris, , 2016Hurlbert & Lombardi, 2009). ...
... NHST can serve a useful purpose when conducted and reported properly, particularly when focusing on measures of effect size and confidence intervals, rather than just the p-value (Greenwald et al., 1996;Hales et al., 2018;Harris, 1997;Weaver & Lloyd, 2018). In addition, Jones and Tukey (2000) outlined what they termed "a sensible approach" to significance testing, which they claimed avoids the most common problems associated with traditional NHST (see also Harris, 1997Harris, , 2016Hurlbert & Lombardi, 2009). As Huitema (1986b) noted, the broader scientific community generally expects statistical results, so reporting such results can improve the acceptability of our research to audiences outside of behavior science/analysis, including funding agencies and nonbehavioral journals. ...
Article
Full-text available
For over a decade, the failure to reproduce findings in several disciplines, including the biomedical, behavioral, and social sciences, have led some authors to claim that there is a so-called “replication (or reproducibility) crisis” in those disciplines. The current article examines: (a) various aspects of the reproducibility of scientific studies, including definitions of reproducibility; (b) published concerns about reproducibility in the scientific literature and public press; (c) variables involved in assessing the success of attempts to reproduce a study; (d) suggested factors responsible for reproducibility failures; (e) types of validity of experimental studies and threats to validity as they relate to reproducibility; and (f) evidence for threats to reproducibility in the behavior science/analysis literature. Suggestions for improving the reproducibility of studies in behavior science and analysis are described throughout.
... Weight is a continuous quantity, and the observations are independent both between and within groups. An independent samples t test is therefore suitable for these data, and the hypotheses can be written in threevalued logic form (Harris, 1997 : µ µ < ...
... It simply means that the null hypothesis has been rejected in another sample, or that the statistics (e.g., correlations) from the two samples are not significantly different. Of course statistical power and the formal probabilities for Type I, Type II, and Type III errors (Harris, 1997) can all be discussed to help insure students' proper understanding of the statistical nature of these conclusions. Advanced students can additionally be asked to read the recent attempts to develop a p-replication statistic, which ultimately failed (Killeen, 2005;Maraun & Gabriel, 2010). ...
Article
Full-text available
Observation Oriented Modeling is an alternative to traditional methods of data conceptualization and analysis that challenges researchers to develop integrated, explanatory models of patterns of observations. The focus of research is thus shifted away from aggregate statistics, such as means, variances, and correlations, and is instead directed toward assessing the accuracy of judgments based on the observations in hand. In this paper a number of example data sets will be used to demonstrate how Observation Oriented Modeling can be taught to undergraduate and graduate students. While the examples are drawn from psychology, the method of contrasting Observation Oriented Modeling with traditional methods of research design and statistical analysis can easily be adapted to examples from other sciences.
... Another argument put forward by fans of statistical tests is that proposed alternative methods, such as effect sizes and confidence intervals (discussed below), are less informative than statistical tests, and are equally vulnerable to widespread misinterpretation (Frick, 1996;Harris, 1997). For example, Harris (1997) stated that statistical significance testing "provides useful information that is not easily gleaned from the Statistical Significance Tests 8 corresponding confidence interval: degree of confidence that we have not made a Type III error and likelihood that our sample result is replicable" (p. ...
... Another argument put forward by fans of statistical tests is that proposed alternative methods, such as effect sizes and confidence intervals (discussed below), are less informative than statistical tests, and are equally vulnerable to widespread misinterpretation (Frick, 1996;Harris, 1997). For example, Harris (1997) stated that statistical significance testing "provides useful information that is not easily gleaned from the Statistical Significance Tests 8 corresponding confidence interval: degree of confidence that we have not made a Type III error and likelihood that our sample result is replicable" (p. 10). ...
Article
Full-text available
Abstract The,present ,paper ,summarizes ,the ,literature ,regarding statistical,significance,testing,with,an emphasis,on (a),the post-1994 literature in various disciplines, (b) alternatives to statistical significance testing, and (c) literature exploring whyresearchers,have,demonstrably,failed,to be influenced,by the 1994 APA publication,manual’s,“encouragement”,(p. 18) to report effect,sizes. ,Also ,considered ,are ,defenses ,of statistical significance,tests. ,Statistical,Significance,Tests, 3 A Review,of Post-1994 Literature,on Whether Statistical,Significance,Tests,Should,be Banned Researchers,have ,long ,placed ,a premium ,on the ,use ,of statistical significance testing, notwithstanding withering criticisms,of many,conventional,practices,as regards,statistical inference (e.g., Burdenski, 1999; Carver, 1978; Daniel, 1999; McLean & Ernest, 1999; Meehl, 1978; Morrison & Henkel, 1970; Nix & Barnette, 1999; Thompson, 1993, 1998a, 1998b, 1998c, 1999a, 1999b, 1999d). A series of articles on these issues appeared in recent editions of the American Psychologist(e.g., Cohen, 1990; Kupfersmid, 1988; Rosnow & Rosenthal, 1989). Especially noteworthy are recent articles by Cohen (1994), Kirk (1996), Schmidt (1996), and Thompson (1996). Indeed, the criticism of statistical testing is growing fierce. For example, Rozeboom (1997) recently argued that: Null-hypothesis,significance,testing,is surely,the
... Four years later, Wilson and Miller (1964) discussed the inclusiveness of accepting the null hypothesis. In 1997, the Psychological Science journal devoted an entire issue to the controversy surrounding significance tests including a discussion on banning versus not banning the formulation of the null hypothesis and an emphasis on P values (Abelson, 1997a(Abelson, , 1997bHarris, 1997;Hunter, 1997;Shrout, 1997;Scarr, 1997). Harlow, Mulaik, and Steiger (1997) edited a book summarizing the controversy regarding the question "What if there were no significance tests." ...
Article
Full-text available
Ferguson (2015) observed that the proportion of studies supporting the experimental hypothesis and rejecting the null hypothesis is very high. This paper argues that the reason for this scenario is that researchers in the behavioral sciences have learned that the null hypothesis can always be rejected if one knows the statistical tricks to reject it (e.g., the probability of rejecting the null hypothesis increases with p = 0.05 compare to p = 0.01). Examples of the advancement of science without the need to formulate the null hypothesis are also discussed, as well as alternatives to null hypothesis significance testing-NHST (e.g., effect sizes), and the importance to distinguish the statistical significance from the practical significance of results.
... Among them are biology, medicine, technique and so on [17,18]. The appropriate tests "has just begun to stir up some interests in the educational and behavioral literature" [19][20][21][22]. ...
Article
Full-text available
Constrained Bayesian method (CBM) and the concept of false discovery rates (FDR) for testing directional hypotheses is considered in the paper. Here is shown that the direct application of CBM allows us to control FDR on the desired level. Theoretically it is proved that mixed directional false discovery rates (mdFDR) are restricted on the desired levels at the suitable choice of restriction levels at different statements of CBM. The correctness of the obtained theoretical results is confirmed by computation results of concrete examples.
... The majority of empirical articles in psychology use NHST (Rodgers, 2010) despite considerable opposition to an exclusive focus on dichotomous significance tests (e.g., Cohen, 1994;Cumming, 2012;Kline, 2013;Rozeboom, 1997;Schmidt & Hunter, 1997;Wilkinson et al., 1999). Amidst these opposing perspectives, a number of researchers endorse the use of significance tests in some circumstances, particularly if accompanied by relevant effect sizes and confidence intervals (CIs) (e.g., Abelson, 1997;Denis, 2003;Hagen, 1997;Harlow, 2010;Harris, 1997;Mulaik, Raju, & Harshman, 1997). The publication manual (APA, 2010) recommends reporting full results from hypothesis tests (including the test statistic, degrees of freedom, and exact p value), but also recommends including information about measures of magnitude and CIs. ...
Article
Full-text available
With recent focus on the state of research in psychology, it is essential to assess the nature of the statistical methods and analyses used and reported by psychological researchers. To that end, we investigated the prevalence of different statistical procedures and the nature of statistical reporting practices in recent articles from the 4 major Canadian psychology journals. The majority of authors evaluated their research hypotheses through the use of analysis of variance, t tests, and multiple regression. Multivariate approaches were less common. Null hypothesis significance testing remains a popular strategy, but the majority of authors reported a standardized or unstandardized effect size measure alongside their significance test results. Confidence intervals on effect sizes were infrequently employed. Many authors provided minimal details about their statistical analyses and less than a third of the articles presented on data complications such as missing data and violations of statistical assumptions. Strengths of and areas needing improvement for reporting quantitative results are highlighted. The article concludes with recommendations for how researchers and reviewers can improve comprehension and transparency in statistical reporting.
... 23). Harris (1997) provides a good discussion on threevalued logic in testing the null hypothesis, and recommends its use. He noted that two-valued logic leads to "such absurdities as stating whether or not results are statistically significant, but not in what direction" (p. ...
Article
Full-text available
Although teaching effect sizes is important, many statistics texts omit the topic for the Mann-Whitney U test and the Wilcoxon signed-rank test. To address this omission, this paper introduces the simple difference formula. The formula states that the correlation equals the simple difference between the proportion of favorable and unfavorable evidence; in symbols this is r = f – u. For the Mann-Whitney U, the evidence consists of pairs. For the signed-rank test, the evidence consists of rank sums. Also, the formula applies to the Binomial Effect Size Display. The formula r = f – u means that a correlation r can yield a prediction so that the proportion correct is f and the proportion incorrect is u.
... Hypothesis testing (null hypothesis testing) indirectly addresses issues of estimation and model selection by chasing statistically significant effects (i.e., the effect is or is not significant), and provides little information on the size of the effect (Yoccoz, 1991;Cherry, 1998;Goodman, 1999). Hypothesis testing has generated schisms in some fields (Abelson, 1997;Shrout, 1997;Batanero, 2000), with very heated exchanges between devotees and "unbelievers" of this approach (in favor: Harris, 1997;Chow, 1998;Robinson and Wainer, 2002;Mogie, 2004;against: Hunter, 1997;Cherry, 1998;Goodman, 1999;Guthery et al., 2001). Regardless, hypothesis testing does not perform particularly well in model selection (e.g., variables selected by forward, backward, or stepwise ap-proaches), and alternatives do exist (Anderson et al., , 2001aGuthery et al., 2001;Johnson, 1999Johnson, , 2002. ...
... It is interesting that just as clinicians have long lamented the inaccessibility or irrelevance of basic research to clinical practice, so too have the basic logic and machinery of psychological research come under scrutiny. A burgeoning contemporary literature attests to a growing disposition on the part of methodologists to place timehonored research designs under the microscope (Abelson, 1997;Cohen, 1990Cohen, , 1994Estes, 1997;Harris, 1997;Loftus, 1993Loftus, , 1996. The results often prove to be less than flattering, particularly in matters of data analysis. ...
Article
Full-text available
The ongoing transition to managed health care continues to have repercussions for health care providers, perhaps the most important of which is an emphasis on accountability for demonstrating the usefulness of clinical interventions. This requirement places a premium on intervention research and highlights the historically strained relationship between psychological research and professional practice. In the midst of this challenge, researchers have increasingly criticized the logic and practice of traditional null hypothesis significance testing. This article describes the history, epistemology, and advantages of single-participant research designs for behavioral scientists and professionals in clinical settings. Although its lack of correspondence with the Fisherian tradition has precluded widespread adoption, the single-participant alternative features a design power and flexibility well suite to both basic science and applied research.
... Por último, se puede enmarcar aquí a los que han definido el denominado error tipo III, que se refiere al signo del contraste, esto es, que las diferencias o relaciones encontradas vayan en la dirección contraria a la predicha. Diversos autores hacen hincapié de una u otra forma sobre el particular (Leventhal y Huynh, 1996;Harris, 1997aHarris, , 1997b. ...
Article
Full-text available
Null hypothesis significance testing has been a source of debate within the scientific community of behavioral researchers for years, since inadequate interpretations have resulted in incorrect use of this procedure. In this paper, we present a revision of the latest contributions of methodologists of different opinions, for and against, and we also set out the guidelines to research within behavioral science recently issued by the A.P.A. (American Psychological Association) Task Force in Statistical Inference (Wilkinson, 1999).
... Turning one's focus to abduction rather than statistical inference leads to a number of startling and liberating realizations. First, because population parameters are not necessarily being estimated, issues such as inferential errors (Type I, II, or III; Harris, 1997), statistical power, and parameter bias can fall by the wayside. As will be made explicit below, the goal in Observation Oriented Modeling is to identify meaningful and improbable patterns of observations (i.e., behaviors) of individual honeybees. ...
Article
Full-text available
Observation Oriented Modeling is a novel approach toward conceptualizing and analyzing data. Compared with traditional parametric statistics, Observation Oriented Modeling is more intuitive, relatively free of assumptions, and encourages researchers to stay close to their data. Rather than estimating abstract population parameters, the overarching goal of the analysis is to identify and explain distinct patterns within the observations. Selected data from a recent study by Craig et al. were analyzed using Observation Oriented Modeling; this analysis was contrasted with a traditional repeated measures ANOVA assessment. Various pitfalls in traditional parametric analyses were avoided when using Observation Oriented Modeling, including the presence of outliers and missing data. The differences between Observation Oriented Modeling and various parametric and nonparametric statistical methods were finally discussed.
... implied that a replication study would have a 99% chance of yielding significant results (see Oakes, 1986, for a discussion of why this assumption is incorrect). While recent advocates of significance tests have concurred that the probability of replicating a significant effect is not literally equal to 1 minus the p-value, they have defended researchers' intuition that p-values indicate something about the replicability of a finding (Harris, 1997;Krueger, 2001;Scarr, 1997). ...
... There is still a great deal of work to do in convincing our profession of the merits of Bayesian estimation and inference. Consistent with this landscape, there are many who continue to defend NHST against alternative methodologies like Bayesianism (e.g., Abelson, 1997;Hagen, 1997;Harris, 1997). Moreover, Bayesianism inspires its own raconteurs of cautionary tales, as well as a few outright detractors (e.g., Killeen, 2006;Larry, 2008;Senn, 2008). ...
Article
Bayesian estimation and inference remain infrequently used in organizational science research. Despite innumerable warnings regarding the entrenched frequentist paradigm, our field has yet to embrace the Bayesian revolution that seems to be sweeping through so many other disciplines. With this context as a backdrop, we address a simple yet difficult question: What is the likelihood that Bayesian methodologies eventually will supplement or even supplant traditional frequentist methodologies in the organizational science community? We draw on institutional theory to address this question, highlighting the cultural-cognitive, normative, and regulative forces that play important roles. As novel contributions to the discussion, we go beyond our own ideas and previously published opinions on the subject to report the opinions of 26 institutional elites (current and former officers of academic associations, editors, and editorial board members). These leading scholars help us shed light not only on the likelihood that Bayesianism will take root in the field but also on practical steps that could be taken to assist in this process. In some ways, we build Bayesian priors about Bayesian analysis, where those priors will be qualified on the basis of future events and outcomes.
... Both aspects of meta-analysis are likely to enhance the accuracy of conclusions derived from such a reviewal' the literature. See the following sources for a discussion of the relative merits of focusing on significance levels (Abelson, 1997;Chow, 1988;Harris, 1997) or on effect sizes (Cohen, 1994;Hunter, 1997;Schmidt, 1996). A third problem with focusing on significance levels is their limited utility for translating research findings into real-world implications. ...
Article
Full-text available
The present article discusses the application of quantitative research synthesis techniques to family violence research. While examples are taken from a number of general areas in family violence, we focus on the application of these review methods to evaluation research in particular. Although this is not a “how-to” manual for conducting quantitative or meta-analytic reviews, we present a general description of the meta-analytic process and then address both meritorious and problematic aspects of quantitative research synthesis. The paper examines the manner in which quantitative research synthesis reconceptualizes the review process, the problems often confronted when conducting a meta-analysis, and the way in which these techniques complement the traditional narrative literature review. To demonstrate the applicability of these techniques to evaluation research in family violence, a small meta-analysis of the Spousal Assault Replication Program (SARP) is presented.
... However, it is seldom pointed out that P values may have some merits when testing hypotheses that can be true, and so, critics may ignore some possible benefits of including tests of nil hypotheses in a strength-of-evidence approach (e.g. Fleiss 1986;Frick 1995;Chow 1996;Hagen 1997, Harris 1997a. ...
Article
Full-text available
Interpreting a P value from a traditional nil hypothesis test as a strength-of-evidence for the existence of an environmentally important difference between two populations of continuous variables (e.g. a chemical concentration) has become commonplace. Yet, there is substantial literature, in many disciplines, that faults this practice. In particular, the hypothesis tested is virtually guaranteed to be false, with the result that P depends far too heavily on the number of samples collected (the 'sample size'). The end result is a swinging burden-of-proof (permissive at low sample size but precautionary at large sample size). We propose that these tests be reinterpreted as direction detectors (as has been proposed by others, starting from 1960) and that the test's procedure be performed simultaneously with two types of equivalence tests (one testing that the difference that does exist is contained within an interval of indifference, the other testing that it is beyond that interval-also known as bioequivalence testing). This gives rise to a strength-of-evidence procedure that lends itself to a simple confidence interval interpretation. It is accompanied by a strength-of-evidence matrix that has many desirable features: not only a strong/moderate/dubious/weak categorisation of the results, but also recommendations about the desirability of collecting further data to strengthen findings.
... Some researchers have argued that replacing NHST with statistical inference based on CIs is unlikely to improve scientific practice because both are ultimately based on the same information and both involve an exclusionary decision of some kind (Cortina & Dunlap, 1997). In fact, the direct relation between NHST outcomes and CIs can lead people to interpret CIs as if they were significance tests (Harris, 1997). When a CI is used only to determine whether the null value lies within the interval (e.g., when comparing the mean difference between two groups), the statistical inference is no different than determining whether the outcome of a t test is significant or not. ...
Article
The use of confidence intervals (CIs) as an addition or as an alternative to null hypothesis significance testing (NHST) has been promoted as a means to make researchers more aware of the uncertainty that is inherent in statistical inference. Little is known, however, about whether presenting results via CIs affects how readers judge the probability that an effect is present in the population of interest and whether a replication would be likely to reveal the same results. In the present study, 66 PhD students were asked to interpret statistical outcomes presented as CIs or as conventional statistics (t statistics and associated p values). Fewer misinterpretations of statistics—such as accepting the null hypothesis—and more references to effect size were found when results were presented as CIs. Furthermore, participants tended to be more certain about the existence of a population effect in the expected direction and about the replicability of the results when the results were presented following the conventions of NHST than when presented using CIs. Contrary to expectations, no evidence of a more precipitous drop in the belief of the existence of a population effect and replicability estimates when p values exceeded the significance level of .05 was found when data were presented using NHST instead of by CIs.
... Most courses and textbooks of statistics routinely present tests of significance as established truth, and tests of significance continue to be a pervasive tool of experimental research. It is difficult to find empirical studies in marketing research, the social sciences, psychology, biology, medicine, and some physical sciences that do not report tests of significance, and the use of these tests continues to be defended (Fleiss 1986;Chow 1988;Frick 1996;Harris 1997;Abelson 1997). ...
Article
Although tests of statistical significance have been the subject of much criticism, they continue to be a major tool in the analysis and presentation of experimental findings. This article presents a probability problem to demonstrate a basic inconsistency in the way statistical significance tests are used when deciding among actions, and proposes a simple alternative method for dealing with the uncertainty inherent in the outcome of experiments.
... Instead we should use confidence intervals to represent the lack of certainty in our belief about our data, power analysis to assess the probability of a type II error (failing to reject the null hypothesis when the null hypothesis is false), and exploratory data analysis (Tukey 1977) and effect sizes (combined with meta-analysis) to represent the magnitude of effects. Those in favor of significance testing respond that making policy and determining courses of treatment require binary decisions, that an " level can be agreed upon for making such a decision, and that the conservative stance of an unknown relationship being nil accurately represents resistence to the implementation of a new program or treatment (Abelson 1997;Chow 1988;Cortina & Dunlap 1997;Frick 1999;Harris 1997;Wainer 1999). ...
Article
Full-text available
ABSTRACT Regression coefficients cannot be interpreted as causal if the relationship can be attributed to an alternate mechanism. One may control for the alternate cause through an experiment (e.g., with random assignment to treatment and control) or by measuring a corresponding confounding variable and including it in the model. Unfortunately, there are some circumstances under which it is not possible to measure or control for the potentially confounding variable. Under these circumstances, it is helpful to assess the robustness of a statistical inference to the inclusion of a potentially confounding variable. In this article, an index is derived for quantifying the impact of a confounding,variable on the inference of a regression coefficient. The index is developed for the bivariate case and then generalized to the multivariate case, and the distribution of the index is discussed. The index also is compared with existing indices and procedures. An example is presented for the relationship between socioeconomic background and educational attainment, and a reference distribution for the index is obtained. The potential for the index to inform causal inferences is discussed, as are extensions. INTRODUCTION: “BUT HAVE YOU CONTROLLED FOR ...?” As is commonly noted, one must be cautious in making causal inferences from statistical
... The practice of null hypothesis testing and significance testing is not without its proponents (Abelson, 1997;Frick, 1996;Hagen, 1997;Harris, 1997). To review this side of the debate is beyond the scope of this article, but a few points can be mentioned that are relevant in this context. ...
Article
Full-text available
The critique against significance testing has been increasingly acknowledged in recent years. This paper focuses on the relation between meta-analysis and this controversy. A contradiction in the literature can be seen in that significance testing has been blamed for the poor accumulation of knowledge in psychology, while at the same time meta-analytic reviews have claimed the opposite. Although a majority of meta-analytic experts argue against significance testing, this critique cannot account for the success of meta-analysis. Rather, it may be that meta-analysis has facilitated the recognition of the significance test critique. Taking the significance testing critique seriously has important implications for meta-analysis in that its research base (e. g., studies) is viewed as unreliable. Although the significance test controversy may lead to further fragmentation of psychology, it is not clear that this will negatively affect the practice of meta-analysis.
... Mistakes commonly cited include accepting the null hypothesis when it fails to be rejected, automatically interpreting rejected null hypotheses as theoretically or practically meaningful, and failing to consider the likelihood of Type II errors (Loftus, 1996;Shrout, 1997; see also Wilcox, 1998). Some writers (e.g., Hunter, 1997) have recommended outright bans against NHST, arguing that error rates are as high as 60%, not the 5% traditionally thought. 1 Others have argued for the continued use of NHST in addition to the incorporation of effect size statistics and confidence intervals (e.g., Abelson, 1997;Harris, 1997). ...
Article
Full-text available
Statistically significant differences in culture means may or may not reflect practically important differences between people of different cultures. To determine whether differences between culture means represent meaningful differences between individuals, further data analyses involving measures of cultural effect sizes are necessary. In this article the authors recommend four such measures and demonstrate their efficacy on two data sets from previously published studies. They argue for their use in future cross-cultural research as a complement to traditional tests of mean differences.
... I have seen an increase in both (a) papers submitted with effect sizes and (b) revisions, where effect sizes was requested, including effect sizes before the paper appears in print. (personal communication, February 2002) Richard Harris (1997) defended some applications of NHST at the 1996 APA symposium, "Needed: A Ban on Significance Tests." He was also optimistic about the impact of some recommendations: ...
Article
Full-text available
The fifth edition of the Publication Manual of the American Psychological Association (APA) draws on recommendations for improving statistical practices made by the APA Task Force on Statistical Inference (TFSI). The manual now acknowledges the controversy over null hypothesis significance testing (NHST) and includes both a stronger recommendation to report effect sizes and a new recommendation to report confidence intervals. Drawing on interviews with some critics and other interested parties, the present review identifies a number of deficiencies in the new manual. These include lack of follow-through with appropriate explanations and examples of how to report statistics that are now recommended. At this stage, the discipline would be well served by a response to these criticisms and a debate over needed modifications.
Preprint
p>With recent focus on the state of research in psychology, it is essential to assess the nature of the statistical methods and analyses used and reported by psychological researchers. To that end, we investigated the prevalence of different statistical procedures and the nature of statistical reporting practices in recent articles from the 4 major Canadian psychology journals. The majority of authors evaluated their research hypotheses through the use of analysis of variance, t tests, and multiple regression. Multivariate approaches were less common. Null hypothesis significance testing remains a popular strategy, but the majority of authors reported a standardized or unstandardized effect size measure alongside their significance test results. Confidence intervals on effect sizes were infrequently employed. Many authors provided minimal details about their statistical analyses and less than a third of the articles presented on data complications such as missing data and violations of statistical assumptions. Strengths of and areas needing improvement for reporting quantitative results are highlighted. The article concludes with recommendations for how researchers and reviewers can improve comprehension and transparency in statistical reporting.</p
Article
The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research.
Book
You're being asked to quantify your usability improvements with statistics. But even with a background in statistics, you are hesitant to statistically analyze their data, as they are often unsure which statistical tests to use and have trouble defending the use of small test sample sizes. The book is about providing a practical guide on how to solve common quantitative problems arising in usability testing with statistics. It addresses common questions you face every day such as: Is the current product more usable than our competition? Can we be sure at least 70% of users can complete the task on the 1st attempt? How long will it take users to purchase products on the website? This book shows you which test to use, and how provide a foundation for both the statistical theory and best practices in applying them. The authors draw on decades of statistical literature from Human Factors, Industrial Engineering and Psychology, as well as their own published research to provide the best solutions. They provide both concrete solutions (excel formula, links to their own web-calculators) along with an engaging discussion about the statistical reasons for why the tests work, and how to effectively communicate the results. *Provides practical guidance on solving usability testing problems with statistics for any project, including those using Six Sigma practices *Show practitioners which test to use, why they work, best practices in application, along with easy-to-use excel formulas and web-calculators for analyzing data *Recommends ways for practitioners to communicate results to stakeholders in plain English. © 2012 Jeff Sauro and James R. Lewis Published by Elsevier Inc. All rights reserved.
Chapter
Because many usability practitioners deeply depend on the use of measurement and statistics to guide their design recommendations, they inherit these controversies. In this chapter we summarize both sides of each issue and discuss what we, as pragmatic usability practitioners, recommend.
Article
Full-text available
Sharpe's (2013) article considered reasons for the apparent resistance of substantive researchers to the adoption of newer statistical methods recommended by quantitative methodologists, and possible ways to reduce that resistance, focusing on improved communication. The important point that Sharpe missed, however, is that because research methods vary radically from one subarea of psychology to another, a particular statistical innovation may be much better suited to some subareas than others. Although there may be some psychological or logistical explanations that account for resistance to innovation in general, to fully understand the resistance to any particular innovation, it is necessary to consider how that innovation impacts specific subareas of psychology. In this comment, I focus on the movement to replace null hypothesis significance testing (NHST) with reports of effect sizes and/or confidence intervals, and consider its possible impact on research in which only the direction of the effect is meaningful, and there is no basis for predicting specific effect sizes (and very large samples are rarely used). There are numerous examples of these studies in social psychology, for instance, such as those that deal with priming effects. I use a study in support of terror management theory as my main example. I conclude that the degree to which statistical reformers have overgeneralized their criticisms of NHST, and have failed to tailor their recommendations to different types of research, may explain some of the resistance to abandoning NHST. Finally, I offer suggestions for improved communication to supplement those presented by Sharpe. (PsycINFO Database Record
Article
Many authors have criticized small sample size experiments because of the lack of statistical reliability, on the basis of the statistical power and considering a subjective evaluation of prior probability of the null Hypothesis H0 (Cohen, 1994; Chertow, Palevsky e Green, 2006). The aim of the present study is to test the reliability of significant results obtained from the small sample size experiments in comparison with larger ones. The different samples (10, 20, 40, 80, 160) are obtained by monte Carlo simulation, representing two conditions: H0 true and H0 false. As parametric procedure, the linear regression analysis has been used. Thus, the frequency of the type i error and the false positive rate probability (FPRP) have been evaluated. The frequency of the i type error is around 5%, independently on the sample size. Moreover, the FPRP values obtained in the small size samples are comparable to the values obtained in the larger samples. In conclusion the significant results obtained in small size samples are reliable and have statistical validity as well as those obtained in larger samples. This is true even from a Bayesian point of view when a non-informative a priori probability of H0 is taken.
Article
Full-text available
Cross-cultural research is now an undeniable part of mainstream psychology and has had a major impact on conceptual models of human behavior. Although it is true that the basic principles of social psychological methodology and data analysis are applicable to cross-cultural research, there are a number of issues that are distinct to it, including managing incongruities of language and quantifying cultural response sets in the use of scales. Cross-Cultural Research Methods in Psychology provides state-of-the-art knowledge about the methodological problems that need to be addressed if a researcher is to conduct valid and reliable cross-cultural research. It also offers practical advice and examples of solutions to those problems and is a must-read for any student of culture.
Article
Lack of ability to think probabilistically makes one prone to a variety of irrational fears and vulnerable to scams designed to exploit probabilistic naiveté, impairs decision making under uncertainty, facilitates the misinterpretation of statistical information, and precludes critical evaluation of likelihood claims. Cognition and Chance presents an overview of the information needed to avoid such pitfalls and to assess and respond to probabilistic situations in a rational way. Dr. Nickerson investigates such questions as how good individuals are at thinking probabilistically and how consistent their reasoning under uncertainty is with principles of mathematical statistics and probability theory. He reviews evidence that has been produced in researchers' attempts to investigate these and similar types of questions. Seven conceptual chapters address such topics as probability, chance, randomness, coincidences, inverse probability, paradoxes, dilemmas, and statistics. The remaining five chapters focus on empirical studies of individuals' abilities and limitations as probabilistic thinkers. Topics include estimation and prediction, perception of covariation, choice under uncertainty, and people as intuitive probabilists. Cognition and Chance is intended to appeal to researchers and students in the areas of probability, statistics, psychology, business, economics, decision theory, and social dilemmas. © 2004 by Lawrence Erlbaum Associates, Inc. All rights reserved.
Article
Full-text available
In the late 1990s, the American Psychological Association advocated improved statistical practices, including reporting confidence intervals (CIs) and effect sizes. Since this statistical reform, the numbers of reports in international academic journals of psychology which have included the CI and effect sizes have increased. We investigated the evidence for statistical reform in Japan by examining papers published from 1982 to 2008 in the Japanese Journal of Psychonomic Science. The reports which included CIs and effect sizes were extremely rare in the journal even after 2001. This observation suggests that this statistical reform has not yet started among Japanese researchers in psychology.
Article
Pro-Essstörungswebseiten sind in die öffentliche Kritik geraten und werden in vielen Ländern aus Gründen des Jugendschutzes eingeschränkt bzw. gesperrt. Man nimmt an, dass sie einen starken abträglichen Einfluss auf die überwiegend junge und weibli-che Leserschaft haben. In Foren und Blogs werden hier Essstörungen als erstrebenswerter Lebensstil dargestellt, der untergewichtige Körper zum Ideal erhoben und schädliche Ernährungspraktiken und Maßnahmen zur radikalen Gewichtsreduktion vermittelt. Die in dieser Dissertationsschrift berichtete Studie bediente sich eines experimentellen Zugangs, um Auswirkungen auf Affekt, Körperzufriedenheit und Selbstwertgefühl junger Frauen zu untersuchen. Die 421 Teilnehmerinnen (mittleres Alter 23,5 Jahre) wurden zufällig einer von drei Bedingungen zugewiesen: einem Pro-Essstörungsblog, einem Selbsthilfeblog (zur Überwindung der Essstörung) oder einem neutralen Blog (ohne essstörungsbezogene Inhalte). Es wurde vermutet, dass sich negative Effekte des Blogs vor allem bei vulnerablen Personen zeigen würden. Daher wurde das individuelle Risiko der Probandinnen, eine Essstörung zu entwickeln, vor dem Experiment erhoben und eine gleiche Anzahl von Teilnehmerinnen mit geringem bzw. hohem Risiko rekrutiert. Tatsächlich zeigte sich eine generelle negative Auswirkung nur für die Variable negativer Affekt. Dieser war am höchsten in der Pro-Essstörungs-Bedingung, aber auch nach dem Lesen eines Selbsthilfeblogs war er im Vergleich zum neutralen Blog erhöht. Bei den Teilnehmerinnen mit hohem Risiko war zudem der körperbezogene Selbstwert nach dem Lesen eines Pro-Essstörungblogs als auch eines Selbsthilfeblogs niedriger als in der neutralen Bedingung. Für Körperunzufriedenheit fanden sich bei diesen Personen lediglich tendenziell erhöhte Werte in den beiden Essstörungs-Bedingungen. Die Ergebnisse der Studie zeigen, dass die Gefahr der Pro-Essstörungswebseiten relativiert werden kann. Die negative Stimmung, die das Lesen auslöst, kann als natürliche Reaktion auf die als befremdlich und schockierend erlebten Inhalte angesehen werden. Unter der potentiellen Leserschaft haben ca. 30% ein erhöhtes Risiko für eine Essstörung und scheinen anfälliger für die schädliche Wirkung zu sein. Dies gilt aber in gleicher Weise für Selbsthilfewebseiten, deren Existenzberechtigung kaum in Frage gestellt wird. Allerdings lassen sich aus den Ergebnissen der Studie keine Aussagen über die Auswirkungen eines intensiveren Kontaktes mit Pro-Essstörungswebseiten ableiten.
Article
Legal scholars have increasingly begun to use principles of behavioral psychology, cognitive science, and related disciplines to inform legal analysis. Much of this analysis has been motivated by perceived limitations of economic analysis of law, particularly its foundational assumption that man is a rational maximizer of his expected utilities ("Chicago Man'). Drawing in substantial part from the work of Nobel Prize winning psychologist Daniel Kahneman and his late colleague Amos Tversky, behavioral decision theorists have argued that legal analysis should be based upon more realistic models of human activity (hence, "K-T Man"). As behavioral analysis of law has grown in popularity, inevitably it has come under attack on several dimensions. This article (a) defends both the scientific integrity and legitimacy of this new mode of analysis, and (b) defends its particular applications by legal decision theorists. The author argues that attempts to paint the heuristics and biases literature founded by Kahneman and Tversky as the mere product of parlor tricks used in laboratory experiments involving college sophomores will ring increasingly hollow as new techniques of neuroimaging continue to produce evidence of brain activity substantiating these limitations on human reasoning.
Article
Full-text available
Article
Full-text available
Causal inference is an important, controversial topic in the social sciences, where it is difficult to conduct experiments or measure and control for all confounding variables. To address this concern, the present study presents a probability index to assess the robustness of a causal inference to the impact of a confounding variable. The information from existing covariates is used to develop a reference distribution for gauging the likelihood of observing a given value of the impact of a confounding variable. Applications are illustrated with an empirical example pertaining to educational attainment. The methodology discussed in this study allows for multiple partial causes in the complex social phenomena that we study, and informs the controversy about causal inference that arises from the use of statistical models in the social sciences.
Article
For half a century, methodologists have debated when to use one-and two-tailed tests. But they conducted the debate with scarcely a mention of the little known directional two-tailed test - the only hypothesis test that, properly used, provides for a decision in either direction. In contrast, the traditional two-tailed test assesses nondirectional statistical hypotheses and does not provide for a directional decision. A directional two-tailed test with unequal rejection regions can have virtually the same power as a one-tailed test and, unlike one-tailed tests, it provides for deciding in the unpredicted direction. However, a problem unresolved for one-tailed tests remains for the directional two-tailed test, namely, whether one should create unequal rejection regions just because one has grounds to predict an outcome's direction. Nevertheless, the directional two-tailed test will satisfy research needs much more frequently than will traditional tests and should be adopted as the primary, general-purpose hypothesis test.
Article
Chow's book should be read only by those who already have a firm enough grasp of the logic of significance testing to separate the few valid, insightful points from the many incorrect statements and misrepresentations.
Article
Full-text available
Attention to statistical power and effect size can improve the design and the reporting of behavioral accounting research. Three accounting journals representative of current empirical behavioral accounting research are analyzed for their power (1−β), or control of Type II errors (β), and compared to research in other disciplines. Given this study's findings, additional attention should be directed to adequacy of sample sizes and study design to ensure sufficient power when Type I error is controlled at α = .05 as a baseline. We do not suggest replacing traditional significance testing, but rather augmenting it with the reporting of β to complement and interpret the relevance of a reported α in any given study. In addition, the presentation of results in alternative formats, such as those suggested in this study, will enhance the current reporting of significance tests. In turn, this will allow the reader a richer understanding of, and an increased trust in, a study's results and implications.
Article
We review the publication guidelines of the American Psychological Association (APA) since 1929 and document their advice for authors about statistical practice. Although the advice has been extended with each revision of the guidelines, it has largely focused on null hypothesis significance testing (NHST) to the exclusion of other statistical methods. In parallel, we review over 40 years of critiques of NHST in psychology. Until now, the critiques have had little impact on the APA guidelines. The guidelines are influential in broadly shaping statistical practice, although in some cases recommended reporting practices are not closely followed. The guidelines have an important role to play in reform of statistical practice in psychology. Following the report of the APA's Task Force on Statistical Inference, we propose that future revisions of the guidelines reflect a broader philosophy of analysis and inference, provide detailed statistical requirements for reporting research, and directly address concerns about NHST. In addition, the APA needs to develop ways to ensure that its editors succeed in their leadership role in achieving essential reform.
Article
Two generations of methodologists have criticized hypothesis testing by claiming that most point null hypotheses are false and that hypothesis tests do not provide the probability that the null hypothesis is true. These criticisms are answered. (1) The point-null criticism, if correct, undermines only the traditional two-tailed rest, not the one-tailed test or the little-known directional two-tailed rest. The directional two-tailed test is the only hypothesis test that, properly used, provides for deciding the direction of a parameter, that is, deciding whether a parameter is positive or negative or whether it falls above or below some interesting nonzero value. The point-null criticism becomes unimportant if we replace traditional one- and two-tailed rests with the directional two-railed rest, a replacement already recommended for most purposes by previous writers. (2) If one interprets probability as a relative frequency, as most textbooks do, then the concept of probability cannot meaningfully be attached to the truth of an hypothesis; hence, it is meaningless to ask for the probability that the null is true. (3) Hypothesis tests provide the next best thing, namely, a relative frequency probability that the decision about the statistical hypotheses is correct. Two arguments are offered.
Article
Full-text available
By imposing directional decisions on traditional two-tailed tests, you will overestimate power, underestimate sample size, and ignore the risk of Type III error. You can avoid these problems by using the directional two-tailed test. This paper demonstrates how to use PROC POWER to estimate the prospective power and Type III error for one-sample, two-sample and paired t tests for means, and binomial tests for proportions, using the directional two-tailed testing procedure. Examples are given with SAS ® codes accompanied by selected results. Power and Type III error for the directional two-tailed sign test with unknown direction is also discussed and illustrated with a SAS macro written by the author.
Chapter
Cross-cultural studies involve persons from different countries and/or ethnic groups. One of the central methodological problems of these studies is bias, the generic term for multiple explanations of cross-cultural differences. Three different types of bias are distinguished, depending on whether the source of interpretation problems derives from the construct, method of the study, or specific items (called construct, method, and item bias or differential item functioning, respectively). Equivalence refers to the implications of bias on score comparability. Linguistic, structural, measurement unit, and full score equivalence are described. Issues in test translation (translation – back-translation, committee designs, decentering) are discussed. Common subject- and culture-sampling schemes in cross-cultural research are mentioned. The article ends with a discussion of issues in combining individual- and country-level characteristics.
Article
Full-text available
Data analysis methods in psychology still emphasize statistical significance testing, despite numerous articles demonstrating its severe deficiencies. It is now possible to use meta-analysis to show that reliance on significance testing retards the development of cumulative knowledge. But reform of teaching and practice will also require that researchers learn that the benefits that they believe flow from use of significance testing are illusory. Teachers must revamp their courses to bring students to understand that (a) reliance on significance testing retards the growth of cumulative research knowledge; (b) benefits widely believed to flow from significance testing do not in fact exist; and (c) significance testing methods must be replaced with point estimates and confidence intervals in individual studies and with meta-analyses in the integration of multiple studies. This reform is essential to the future progress of cumulative knowledge in psychological research. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Concerning the traditional nondirectional 2-sided test of significance, the author argues that "we cannot logically make a directional statistical decision or statement when the null hypothesis is rejected on the basis of the direction of the difference in the observed means." Thus, this test "should almost never be used." He proposes that "almost without exception the directional two-sided test should replace" it (18 ref.)
Article
For a wide range of tests of single-df hypotheses, the sample size needed to achieve 50% power is readily approximated by setting N such that a significance test conducted on data that fit one's assumptions perfectly just barely achieves statistical significance at one's chosen alpha level. If the effect size assumed in establishing one's N is the minimally important effect size (i.e., that effect size such that population differences or correlations smaller than that are not of any practical or theoretical significance, whether statistically significant or not), then 50% power is optimal, because the probability of rejecting the null hypothesis should be greater than .5 when the population difference is of practical or theoretical significance but lower than .5 when it is not. Moreover, the actual power of the test in this case will be considerably higher than .5, exceeding .95 for a population difference two or more times as large as the minimally important difference (MID). This minimally important difference significant (MIDS) criterion extends naturally to specific comparisons following (or substituting for) overall tests such as the ANOVA F and chi-square for contingency tables, although the power of the overall test (i.e., the probability of finding some statistically significant specific comparison) is considerably greater than .5 when the MIDS criterion is applied to the overall test. However, the proper focus for power computations is one or more specific comparisons (rather than the omnibus test), and the MIDS criterion is well suited to setting sample size on this basis. Whereas Nmids(the sample size specified by the MIDS criterion) is much too small for the case in which we wish to prove the modified H0 that there is no important population effect, it nonetheless provides a useful metric for specifying the necessary sample size. In particular, the sample size needed to have a 1 - ? probability that the (1 ? ?)-level confidence interval around one's population parameter includes no important departure from H0 is four times Nmids when H0 is true and approximately [4/(1 - b)2].NMIDS when b (the ratio between the actual population difference and the minimally important difference) is between zero and unity. The MIDS criterion for sample size provides a useful alternative to the methods currently most commonly employed and taught.
Article
The controversy that has raged since the early fifties regarding the admissibility of one-tailed tests of hypotheses was examined. From the review of that literature, it was concluded that the main advantage of the one-tailed test was the gain in power for the prediction while its main disadvantage was its inability to test for significance if the results were opposite to prediction. It is argued here that splitting α unequally between the two tails, placing most of the rejection region on the side of the prediction but a smaller fraction on the opposite side provides both power and the ability to detect opposite-to-prediction outcomes. This compromise procedure requires a finer choice in the splitting of α than the dichotomous choice of putting either all or exactly half of α in the favored tail, i.e., the choice between a one- or a two-tailed test. Rules for the most effective split, based on Bayesian considerations, are prescribed. The fraction of α in the predicted tail should be equal to the investigator's a priori probability that the predicted order, as opposed to the reversed order, of sample means will be obtained. A table of t-values is presented which gives critical regions for significance, both "expected" and "unexpected," at specified levels of a priori probability.
Article
This paper is based on the 1989 Miller Memorial Lecture at Stanford University. The topic was chosen because of Rupert Miller's long involvement and significant contributions to multiple comparison procedures and theory. Our emphasis will be on the major questions that have received relatively little attention--on what one wants multiple comparisons to do, on why one wants to do that, and on how one can communicate the results. Very little attention will be given to how the results can be calculated--after all, there are books about that (e.g., Miller, 1966, 1981; Hochberg and Tamhane, 1987).
Article
Despite publication of many well-argued critiques of null hypothesis testing (NHT), behavioral science researchers continue to rely heavily on this set of practices. Although we agree with most critics' catalogs of NHT's flaws, this article also takes the unusual stance of identifying virtues that may explain why NHT continues to be so extensively used. These virtues include providing results in the form of a dichotomous (yes/no) hypothesis evaluation and providing an index (p value) that has a justifiable mapping onto confidence in repeatability of a null hypothesis rejection. The most-criticized flaws of NHT can be avoided when the importance of a hypothesis, rather than the p value of its test, is used to determine that a finding is worthy of report, and when p approximately equal to .05 is treated as insufficient basis for confidence in the replicability of an isolated non-null finding. Together with many recent critics of NHT, we also urge reporting of important hypothesis tests in enough descriptive detail to permit secondary uses such as meta-analysis.
Unpublished manuscript
  • R J Harris