
Harvey Jay Keselman- University of Manitoba
Harvey Jay Keselman
- University of Manitoba
About
157
Publications
42,841
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,921
Citations
Current institution
Publications
Publications (157)
One of the validity conditions of classical test statistics (e.g., Student's t-test, the ANOVA and MANOVA F-tests) is that data be normally distributed in the populations. When this and/or other derivational assumptions do not hold, the classical test statistic can be prone to too many Type I errors (i.e., falsely rejecting too often) and/or have l...
Frane (2015) pointed out the difference between per-family and familywise Type I error control and how different multiple comparison procedures control one method but not necessarily the other. He then went on to demonstrate in the context of a two group multivariate design containing different numbers of dependent variables and correlations betwee...
While numerous investigations have examined the effects of assumption violations on the empirical probability of a Type I error for Tukey’s multiple comparison test, no study to date has numerically quantified and systematically varied the degree of total variation resulting from combining unequal variances with unequal sample sizes. The present in...
Warr and Erich (2013) compared a frequently recommended procedure in textbooks: the interquartile range divided by the sample standard deviation, against the Shapiro-Wilk's test in assessing normality of data. They found the Shapiro-Wilk's test to be far superior to the deficient interquartile range statistic. We look further into the issue of asse...
One of the validity conditions of classical test statistics (e.g., the ANOVA F-test) is that data be normally distributed in the populations. When this de-rivational assumption does not hold classical test statistics can be prone to falsely rejecting too often and/or fail to reject when the null hypothesis is false. Thus, many authors have recommen...
Normality is a distributional requirement of classical test statistics. In order for the test statistic to provide valid results leading to sound and reliable conclusions this requirement must be satisfied. In the not too distant past, it was claimed that violations of normality would not likely jeopardize scientific findings (See Hsu & Feldt, 1969...
In comparing multiple treatments, 2 error rates that have been studied extensively are the familywise and false discovery rates. Different methods are used to control each of these rates. Yet, it is rare to find studies that compare the same methods on both of these rates, and also on the per-family error rate, the expected number of false rejectio...
Reports an error in "Many tests of significance: New methods for controlling type I errors" by H. J. Keselman, Charles W. Miller and Burt Holland (Psychological Methods, 2011[Dec], Vol 16[4], 420-431). The R code for arriving at adjusted p values for one of the methods is incorrect. The specific changes that need to be made are provided in the erra...
During the last half century hundreds of papers published in statistical journals have documented general conditions where reliance on least squares regression and Pearson's correlation can result in missing even strong associations between variables. Moreover, highly misleading conclusions can be made, even when the sample size is large. There are...
Variants of Levene’s and O’Brien’s procedures not investigated by Keselman, Wilcox & Algina (2008) were examined. Simulations indicate that a new O’Brien variant provides very good Type I error control and is simpler for applied researchers to compute than the method recommended by Keselman, et al.
Variants of Levene's and O'Brien's procedures not investigated by Keselman, Wilcox & Algina (2008) were examined. Simulations indicate that a new O'Brien variant provides very good Type I error control and is simpler for applied researchers to compute than the method recommended by Keselman, et al.
There have been many discussions of how Type I errors should be controlled when many hypotheses are tested (e.g., all possible comparisons of means, correlations, proportions, the coefficients in hierarchical models, etc.). By and large, researchers have adopted familywise (FWER) control, though this practice certainly is not universal. Familywise...
The Kruskal & Wallis and normal scores non-parametric multiple comparison tests were compared to the parametric Tukey test. In addition to quantifying and manipulating the degree of heterogeneity due to combining heterogeneous variances with unequal sample sizes, the tests were compared for their sensitivity, as well as rates of Type I error, under...
Recent literature demonstrates that uniformity of population variances and covariances is a sufficient but not a necessary requirement for valid F ratios in repeated measures designs; the tests will be valid if the loss restrictive condition of circularity is satisfied. The circularity assumptions of various repeated measures designs are presented...
One strategy which has been recommended for examining effects in repeated measures designs combines a degrees of freedom (d.f.)-adjusted univariate F test and a multivariate test. The results of a simulation study for a groups × trials repeated measures design are presented, and demonstrate that for balanced designs this combined strategy rarely pr...
Specific information concerning the nature of interaction effects in factorial designs may be obtained through the use of tetrad contrasts. Empirical familywise Type I error rates and power rates associated with 10 procedures for conducting tetrad contrasts in groups-by-trials repeated measures designs were obtained when the assumptions of multisam...
Pairwise comparisons were computed on six procedures which convert tests of spread into tests for mean equality using the usual Student's t statistic and a Welch t statistic which obtains an error term from the variance and sample sizes involved in the comparison. The Tukey multiple comparison critical value was used to assess significance. The obt...
Meta-analytic methods were used to summarize the results of Monte Carlo studies investigating the Type I error and power properties of various univariate and multivariate procedures for testing within-subjects effects in split-plot repeated measures designs. Results indicated that all test procedures were generally robust to violations of the multi...
The Kruskal-Wallis and normal scores non-parametric tests of location equality are compared to the parametric analysis of variance F test. In addition to quantifying and manipulating degrees of variance heterogeneity due to combining heterogeneous variances with unequal sample sizes, the tests are compared for their sensitivity, as well as rates of...
The probability of detecting factorial effects (main, interaction, simple) with experimentwise and per hypothesis Type I error protection are compared. The effects of level of significance, sample size and size of the true factorial effect are discussed. For researchers working in the personality, social or clinical area, where effects are presumed...
The data obtained from one-way independent groups designs is typically non-normal in form and rarely equally variable across treatment populations (i.e., population variances are heterogeneous). Consequently, the classical test statistic that is used to assess statistical significance (i.e., the analysis of variance F test) typically provides inval...
The increase in the squared multiple correlation coefficient (Delta R(2)) associated with a variable in a regression equation is a commonly used measure of importance in regression analysis. Algina, Keselman, and Penfield found that intervals based on asymptotic principles were typically very inaccurate, even though the sample size was quite large...
Researchers can adopt one of many different measures of central tendency to examine the effect of a treatment variable across groups. These include least squares means, trimmed means, M-estimators and medians. In addition, some methods begin with a preliminary test to determine the shapes of distributions before adopting a particular estimator of t...
Looney & Stanley's (1989) recommendations regarding analysis strategies for repeated measures designs containing between-subjects grouping variables and within-subjects repeated measures variables were re-examined and compared to recent analysis strategies. That is, corrected degrees of freedom univariate tests, multivariate tests, mixed model test...
Standard least squares analysis of variance methods suffer from poor power under arbitrarily small departures from normality and fail to control the probability of a Type I error when standard assumptions are violated. This article describes a framework for robust estimation and testing that uses trimmed means with an approximate degrees of freedom...
The squared multiple semipartial correlation coefficient is the increase in the squared multiple correlation coefficient that occurs when two or more predictors are added to a multiple regression model. Coverage probability was investigated for two variations of each of three methods for setting confidence intervals for the population squared multi...
We examined 633 procedures that can be used to compare the variability of scores across independent groups. The procedures, except for one, were modifications of the procedures suggested by Levene (1960) and O'Brien (1981). We modified their procedures by substituting robust measures of the typical score and variability, rather than relying on clas...
Abstract A Review of Simultaneous Pairwise Multiple Comparisons. Simultaneous pairwise comparisons can be accomplished with numerous multiple comparison procedures. The methods differ in two essential ways: the choice of critical value and the specification of the estimated standard error of the mean difference. Those methods that assume homogeneou...
Applications of distribution theory for the squared multiple correlation coefficient and the squared cross-validation coefficient are reviewed, and computer programs for these applications are made available. The applications include confidence intervals, hypothesis testing, and sample size selection.
A squared semipartial correlation coefficient (Delta R-2) is the increase in the squared multiple correlation coefficient that occurs when a predictor is added to a multiple regression model. Prior research has shown that coverage probability for a confidence interval constructed by using a modified percentile bootstrap method with Delta R-2 was ge...
We examined nine adaptive methods of trimming, that is, methods that empirically determine when data should be trimmed and the amount to be trimmed from the tails of the empirical distribution. Over the 240 empirical values collected for each method investigated, in which we varied the total percentage of data trimmed, sample size, degree of varian...
Several tests for group mean equality have been suggested for analyzing nonnormal and heteroscedastic data. A Monte Carlo study compared the Welch tests on ranked data and heterogeneous, nonparametric statistics with previously recommended procedures. Type I error rates for the Welch tests on ranks and the heterogeneous, nonparametric statistics we...
The increase in the squared multiple correlation coefficient (Delta R-2) associated with a variable in a regression equation is a commonly used measure of importance in regression analysis. The coverage probability that an asymptotic and percentile bootstrap confidence interval includes Delta rho(2) was investigated. As expected, coverage probabili...
Kelley compared three methods for setting a confidence interval (CI) around Cohen's standardized mean difference statistic: the noncentral-t-based, percentile (PERC) bootstrap, and biased-corrected and accelerated (BCA) bootstrap methods under three conditions of normormality, eight cases of sample size, and six cases of population effect size (ES)...
Consider the linear regression model Y=β X 1+α+τ (X)ϵ, where X and ϵ are independent random variables, ϵ has a mean of zero and variance σ2, and τ is some unknown function used to model heteroscedasticity. Many methods have been proposed for testing H 0: τ (X) ≡ 1, the hypothesis that the error term is homoscedastic, with most methods known to be u...
A modification to testing pairwise comparisons that may provide better control of Type I errors in the presence of non-normality is to use a preliminary test for symmetry which determines whether data should be trimmed symmetrically or asymmetrically. Several pairwise MCPs were investigated, employing a test of symmetry with a number of heterosceda...
Confidence intervals must be robust in having nominal and actual probability coverage in close agreement. This article examined two ways of computing an effect size in a two-group problem: (a) the classic approach which divides the mean difference by a single standard deviation and (b) a variant of a method which replaces least squares values with...
Nonnormality and variance heterogeneity affect the validity of the traditional tests for treatment group equality (e.g. ANOVA F-test and t-test), particularly when group sizes are unequal. Adopting trimmed means instead of the usual least squares estimator has been shown to be mostly affective in combating the deleterious effects of nonnormality. T...
The probability coverage of intervals involving robust estimates of effect size based on seven procedures was compared for asymmetrically trimming data in an independent two-groups design, and a method that symmetrically trims the data. Four conditions were varied: (a) percentage of trimming, (b) type of nonnormal population distribution, (c) popul...
Multiple comparison procedures (MCPs) are frequently adopted by applied researchers to locate specific differences between treatment groups. That is, omnibus test statistics, such as the analysis of variance F test, can only signify that effects are present, not which specific groups differ from one another (when there are more than two groups). In...
Most classical multivariate procedures (e.g., multivariate analysis of variance, multivariate measures of effect size, classification procedures, maximum likelihood factor analysis) require that the data follow a multivariate normal density function. Behavioral science researchers risk committing many more Type I errors, quantifying inaccurately th...
The authors argue that a robust version of Cohen's effect size constructed by replacing population means with 20% trimmed means and the population standard deviation with the square root of a 20% Winsorized variance is a better measure of population separation than is Cohen's effect size. The authors investigated coverage probability for confidence...
Probability coverage for eight different confidence intervals (CIs) of measures of effect size (ES) in a two-level repeated measures design was investigated. The CIs and measures of ES differed with regard to whether they used least squares or robust estimates of central tendency and variability, whether the end critical points of the interval were...
Hotelling's T2 procedure is used to test the equality of means in two-group multivariate designs when covariances are homogeneous. A number of alternatives to T2, which are robust to covariance heterogeneity, have been proposed in the literature. However, all are sensitive to departures from multivariate normality. We demonstrate how to obtain mult...
A serious practical problem with the ordinary least squares regression estimator is that it can have a relatively large standard error when the error term is heteroscedastic, even under normality. In practical terms, power can be poor relative to other regres-sion estimators that might be used. This article illustrates the problem and summa-rizes s...
SAS's PROC MIXED can be problematic when analyzing data from randomized longitudinal two-group designs when observations are missing over time. Overall (1996, 1999) and colleagues found a number of procedures that are effective in controlling the number of false positives (Type I errors) and are yet sensitive (powerful) to detect treatment effects....
Researchers can adopt one of many different measures of central tendency to examine the effect of a treatment variable across groups. These include least squares means, trimmed means, M-estimators and medians. In addition, some methods begin with a preliminary test to determine the shapes of distributions before adopting a particular estimator of t...
Locating pairwise differences among treatment groups is a common practice of applied researchers. Articles published in this journal have addressed the issue of statistical inference within the context of an analysis of variance (ANOVA) framework, describing procedures for comparing means, among other issues. In particular, 1 article (Jaccard & Gui...
Seven test statistics known to be robust to the combined effects of nonnormality and variance heterogeneity were compared for their sensitivity to detect treatment effects in a one-way completely randomized design containing four groups. The six Welch-James-type heteroscedastic tests adopted either symmetric or asymmetric trimmed means, were transf...
The sample mean can have poor efficiency relative to various alternative estimators under arbitrarily small departures from normality. In the multivariate case, (affine equivariant) estimators have been proposed for dealing with this problem, but a comparison of various estimators by Massé and Plante (2003) indicated that the small-sample efficienc...
In a longitudinal two-group randomized trials design, also referred to as randomized parallel-groups design or split-plot repeated measures design, the important hypothesis of interest is whether there are differential rates of change over time, that is, whether there is a group by time interaction. Several analytic methods have been presented in t...
One approach to the analysis of repeated measures data allows researchers to model the covariance structure of their data rather than presume a certain structure, as is the case with conventional univariate and multivariate test statistics. This mixed-model approach, available through SAS PROC MIXED, was compared to a Welch-James type statistic. Th...
Health evaluation research often employs multivariate designs in which data on several outcome variables are obtained for independent groups of subjects. This article examines statistical procedures for testing hypotheses of multivariate mean equality in two-group designs. The conventional test for multivariate means, Hotelling's T2, rests on certa...
This article considers the problem of comparing two independent groups in terms of some measure of location. It is well known that with Student's two-independent-sample t test, the actual level of significance can be well above or below the nominal level, confidence intervals can have inaccurate probability coverage, and power can be low relative t...
A simulation study had been carried out to compare the Type I error and power of S
1, a statistic recommended by Babu et al. (1999) for testing the equality of location parameters for skewed distributions. Othman et al. (in press) showed that this statistic is robust to the underlying populations and is also powerful. In our work, we modified this...
Monte Carlo methods were used to examine Type I error and power rates of 2 versions (conventional and robust) of the paired and in dependent-samples t tests under nonnormality. The conventional (robust) versions employed least squares means and variances (trimmed means and Winsorized variances) to test for differences between groups.
The Welch-James (WJ) and the Huynh Improved General Approximation (IGA) tests for interaction were examined with respect to Type I error in a between- by within-subjects repeated measures design when data were non-normal, non-spherical and heterogeneous, particularly when group sizes were unequal. The tests were computed with aligned ranks and comp...
The approximate degrees of freedom Welch-James (WJ) and Brown-Forsythe (BF) procedures for testing within-subjects effects in multivariate groups by trials repeated measures designs were investigated under departures from covariance homogeneity and normality. Empirical Type I error and power rates were obtained for least-squares estimators and robu...
Various statistical methods, developed after 1970, offer the opportunity to substantially improve upon the power and accuracy of the conventional t test and analysis of variance methods for a wide range of commonly occurring situations. The authors briefly review some of the more fundamental problems with conventional methods based on means; provid...
This article defines an approximate confidence interval for effect size in correlated (repeated measures) groups designs. The authors found that their method was much more accurate than the interval presented and acknowledged to be approximate by Bird. That is, the coverage probability over all the conditions investigated was very close to the theo...
Researchers in the behavioral sciences are often interested in comparing the means of several treatment conditions on a specific dependent measure. When scores on the dependent measure are not normally distributed, researchers must make important decisions regarding the multiple comparison strategy that is implemented. Although researchers commonly...
Standard least squares analysis of variance methods suffer from poor power under arbitrarily small departures from normality and fail to control the probability of a Type I error when standard assumptions are violated. These problems are vastly reduced when using a robust measure of location; incorporating bootstrap methods can result in additional...
Wilcox, Keselman, Muska and Cribbie (2000) found a method for comparing the trimmed means of dependent groups that performed well in simulations, in terms of Type I errors, with a sample size as small as 21. Theory and simulations indicate that little power is lost under normality when using trimmed means rather than untrimmed means, and trimmed me...
Researchers in the behavioural sciences have been presented with a host of pairwise multiple comparison procedures that attempt to obtain an optimal combination of Type I error control, power, and ease of application. However, these procedures share one important limitation: intransitive decisions. Moreover, they can be characterized as a piecemeal...
We investigated bias, sampling variability, Type I error and power of nine approaches for testing the group by time interaction in a repeated measures design under three types of missing data mechanisms. One procedure due to Overall, Ahn, Shivakumar, and Kalburgi (1999) performed reasonably well over a range of conditions.
Researchers can adopt different measures of central tendency and test statistics to examine the effect of a treatment variable across groups (e.g., means, trimmed means, M-estimators, & medians. Recently developed statistics are compared with respect to their ability to control Type I errors when data were nonnormal, heterogeneous, and the design w...
When data are nonnormal in form classical procedures for assessing treatment group equality are prone to distortions in rates of Type I error and power to detect effects. Replacing the usual means with trimmed means reduces rates of Type I error and increases sensitivity to detect effects. If data are skewed, say to the right, then it has been post...
Consider the problem of performing all pair-wise comparisons among J dependent groups based on measures of location associated with the marginal distributions. It is well known that the standard error of the sample mean can be large relative to other estimators when outliers are common. Two general strategies for addressing this problem are to trim...
Methods for analyzing repeated measures data, in addition to the conventional and corrected degrees of freedom univariate and multivariate solutions, are presented in this review. These "newer" methods offer researchers either improved control over Type I errors and/or greater power to detect treatment effects when (a) certain assumptions are viola...
When many tests of significance are examined in a research investigation with procedures that limit the probability of making at least one Type I error--the so-called familywise techniques of control--the likelihood of detecting effects can be very low. That is, when familywise error controlling methods are adopted to assess statistical significanc...
We compared three tests for mean equality: the Welch (1938) heteroscedastic statistic, the Zhou et al. (1997) test, derived to be used with skewed lognormal data, and Yuen's (1974) procedure which uses robust estimators of central tendency and variability with the Welch test in order to combat the combined effects of nonnormality and variance heter...
Numerous authors suggest that the data gathered by investigators are not normal in shape. Accordingly, methods for assessing pairwise multiple comparisons of means with traditional statistics will frequently result in biased rates of Type I error and depressed power to detect effects. One solution is to obtain a critical value to assess statistical...
Numerous authors suggest that the data gathered by investigators are not normal in shape. Accordingly, methods for assessing pairwise multiple comparisons of means with traditional statistics will frequently result in biased rates of Type I error and depressed power to detect effects. One solution is to obtain a critical value to assess statistical...
Given a random sample from each of two independent groups, this article takes up the problem of estimating power, as well as a power curve, when comparing 20% trimmed means with a percentile bootstrap method. Many methods were considered, but only one was found to be satisfactory in terms of obtaining both a point estimate of power as well as a (on...
One approach to the analysis of repeated measures data allows researchers to model the covariance structure of the data rather than presume a certain structure, as is the case with conventional univariate and multivariate test statistics. This mixed-model approach was evaluated for testing all possible pairwise differences among repeated measures m...
Consider two independent groups with K measures for each subject. For the jth group and kth measure, let μtjk be the population trimmed mean, j = 1, 2; k = 1, ..., K. This article compares several methods for testing H 0 : u1k = t2k such that the probability of at least one Type I error is, and simultaneous probability coverage is 1 - α when comput...
Repeated measures ANOVA can refer to many different types of analysis. Specifically, this vague term can refer to conventional tests of significance, one of three univariate solutions with adjusted degrees of freedom, two different types of multivariate statistic, or approaches that combine univariate and multivariate tests. Accordingly, it is argu...
Non-normality and covariance heterogeneity between groups affect the validity of the traditional repeated measures methods of analysis, particularly when group sizes are unequal. A non-pooled Welch-type statistic (WJ) and the Huynh Improved General Approximation (IGA) test generally have been found to be effective in controlling rates of Type I err...
The Welch-James and Improved General Approximation tests were examined in between-subjects x within-subjects repeated measures designs for their rates of Type I error when data were nonnormal, nonspherical, and heterogeneous and when group sizes were unequal as well. The tests were computed with either least squares or robust estimators of central...
In a previous paper, Boik presented an empirical Bayes (EB) approach to the analysis of repeated measurements. The EB approach is a blend of the conventional univariate and multivariate approaches. Specifically, in the EB approach, the underlying covariance matrix is estimated by a weighted sum of the univariate and multivariate estimators. In addi...
This paper considers the common problem of testing the equality of means in a repeated measures design. Recent results indicate that practical problems can arise when computing confidence intervals for all pairwise differences of the means in conjunction with the Bonferroni inequality. This suggests, and is confirmed here, that a problem might occu...
The squared cross-validity coefficient is a measure of the predictive validity of a sample linear prediction equation. It provides a more realistic assessment of the usefulness of the equation than the squared multiple-correlation coefficient. The squared cross-validity coefficient cannot be larger than the squared multiple-correlation coefficient;...
Tests for mean equality proposed by Weerahandi (1995) and Chen and Chen (1998), tests that do not require equality of population variances, were examined when data were not only heterogeneous but, as well, nonnormal in unbalanced completely randomized designs. Furthermore, these tests were compared to a test examined by Lix and Keselman (1998), a t...
I. Olkin and J. D. Finn (1995) presented 2 methods for comparing squared multiple correlation coefficients for 2 independent samples. In 1 method, the researcher constructs a confidence interval for the difference between 2 population squared coefficients; in the 2nd method, a Fisher-type transformation of the sample squared correlation coefficient...
When undertaking many tests of significance, researchers are faced with the problem of how best to control the probability of committing a Type I error. The familywise approach deals directly with multiplicity problems by setting a level of significance for an entire set of related hypotheses; the comparison approach ignores the issue by setting th...
When undertaking many tests of significance, researchers are faced with the problem of how best to control the probability of committing a Type I error. The familywise approach deals directly with multiplicity problems by setting a level of significance for an entire set of related hypotheses; the comparison approach ignores the issue by setting th...
Mehrotra (1997) presented an ‘;improved’ Brown and Forsythe (1974) statistic which is designed to provide a valid test of mean equality in independent groups designs when variances are heterogeneous. In particular, the usual Brown and Fosythe procedure was modified by using a Satterthwaite approximation for numerator degrees of freedom instead of t...
Articles published in several prominent educational journals were examined to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the...
In 1987, Jennings enumerated data analysis procedures that authors must follow for analyzing effects in repeated measures designs when submitting papers to Psychophysiology. These prescriptions were intended to counteract the effects of nonspherical data, a condition know to produce biased tests of significance. Since this editorial policy was esta...
In 1987, Jennings enumerated data analysis procedures that authors must follow for analyzing effects in repeated measures designs when submitting papers to Psychophysiology. These prescriptions were intended to counteract the effects of nonspherical data, a condition know to produce biased tests of significance. Since this editorial policy was esta...
Power for the improved general approximation (IGA) and Welch-James (WJ) tests of the within-subjects (trials) main effect and the within-subjects x between-subjects (groups x trial) interaction was estimated for re design with one between- and one within-subjects factor. The distribution of the data had two levels: multivariate normal and multivari...
Tests of mean equality proposed by Alexander and Govern, Box, Brown and Forsythe, James, and Welch, as well as the analysis of variance F test, were compared for their ability to limit the number of Type I errors and to detect true treatment group differences in one-way, completely randomized designs in which the underlying distributions were nonno...