Page 1

The analysis of repeated measures designs:

A review

H. J. Keselman*

University of Manitoba, Canada

James Algina

University of Florida, USA

Rhonda K. Kowalchuk

University of Manitoba, Canada

Repeated measures ANOVA can refer to many different types of analysis.

Speci®cally, this vague term can refer to conventional tests of signi®cance, one

of three univariate solutions with adjusted degrees of freedom, two different types

of multivariate statistic, or approaches that combine univariate and multivariate

tests. Accordingly, it is argued that, by only reporting probability values and

referring to statistical analyses as repeated measures ANOVA, authors convey

neither the type of analysis that was used nor the validity of the reported probability

value, since each of these approaches has its own strengths and weaknesses. The

various approaches are presented with a discussion of their strengths and weak-

nesses, and recommendations are made regarding the `best’ choice of analysis.

Additional topics discussed include analyses for missing data and tests of linear

contrasts.

1. Introduction

Papers intended to bring to the attention of applied researchers the latest developments in data

analysis strategies, which are generally introduced in statistical journals, are not uncommon

in the psychological literature (see, for example, Algina & Coombs, 1996; Keselman &

Keselman, 1993; Keselman, Rogan & Games, 1981; McCall & Appelbaum, 1973). Since, as

McCall and Appelbaum note, repeated measures (RM) designs are one of the most common

research paradigms in psychology, it is not surprising that articles of this nature pertaining to

the analysis of repeated measurements have appeared periodically in our literature; for

example, McCall & Appelbaum, Hertzog & Rovine (1985), Keselman & Keselman (1988),

and Keselman & Algina (1996) have provided updates on analysis strategies for RM designs.

Because new analysis strategies for the analysis of repeated measurements have recently

British Journal of Mathematical and Statistical Psychology (2001), 54, 1±20

© 2001 The British Psychological Society

Printed in Great Britain

1

* Requests for reprints should be addressed to Professor H. J. Keselman, Department of Psychology, University of

Manitoba, 190 Dysart Road, Winnipeg, Manitoba, Canada R3T 2N2 (e-mail: kesel@ms.umanitoba.ca).

Page 2

appeared in the quantitatively oriented literature, we thought it timely to once again provide

an update for psychological researchers.

In addition to introducing procedures that have appeared in the last ®ve to ten years, we

present a brief review of procedures that are not so new, since recent evidence suggests that

even these are not commonly adopted by behavioural science researchers (see Keselman

et al., 1998). It is important to review these procedures since they are better (i.e., generally

better control the probability of a Type I error) than the conventional univariate method

of analysis and, moreover, because they provide an important theoretical link to the most

recent approaches to the analysis of repeated measurements.

RM designs and analysis of variance (ANOVA) statistics are often used by behavioural

science researchers to assess treatment effects (Keselman et al., 1998). However, ANOVA

statistics are, according to results reported in the literature, sensitive to violations of the

derivational assumptions on which they are based, particularly when the design is unbalanced

(i.e., group sizes are unequal) (Collier, Baker, Mandeville & Hays, 1967; Keselman &

Keselman, 1993; Keselman, Keselman & Lix, 1995; Rogan, Keselman & Mendoza, 1979).

Speci®cally, the conventional univariate method of analysis assumes that the data have been

obtained from populations that have the well-known normal (multivariate) form, that the

degree of variability (covariance) among the levels of the RM variable conforms to a

spherical pattern, and that the data conform to independence assumptions. Since the data

obtained in many areas of psychological inquiry are not likely to conform to these

requirements and are frequently unbalanced (see Keselman et al., 1998), researchers using

the conventional procedure will erroneously claim treatment effects when none are present,

thus ®lling their literatures with false positive claims.

However, many other ANOVA-type statistics are available for the analysis of RM designs

which under many conditions will be insensitive (i.e., robust) to violations of the assumptions

associated with the conventional tests or do not depend on the conventional covariance

assumption (i.e., multisample sphericity). These ANOVA-type procedures include uni-

variate tests with adjusted degrees of freedom (df), multivariate test statistics, statistics

that do not depend on the conventional assumptions of multisample sphericity, and hybrid

types of analyses that involve a combining of the univariate and multivariate approaches.

Another ¯y in this ointment relates to the vagueness associated with the descriptors

typically used by behavioural science researchers to describe the statistical tests employed in

the analysis of treatment effects in RM designs and the use of the associated probability value

(p) to convey success or failure of the treatment. That is, describing the analysis as `repeated

measures ANOVA’ does not tell the reader which repeated measures ANOVA technique was

used to test for treatment effects. In addition, just reporting a p-value does not give enough

information (e.g., df of the statistic) for the reader to determine what type of RM analysis was

used, and thus calls into question the legitimacy of the authors’ claims regarding the

likelihood that the result was due to the manipulated variable and not because the test was

used improperly, (i.e., when the assumptions to the test have not been met). Thus, the aim

of this paper is to describe brie¯y how RM designs are typically analysed by researchers, and

to survey the strengths and weaknesses of other ANOVA-type tests for assessing treatment

effects in RM designs and thus comment on the validity of the associated p-values.

The reader should note that although we typically present the test statistic for the various

approaches, they need not be examined with an eye for obtaining a numerical solution;

numerical results can be obtained with speci®ed software.

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

2

Page 3

2. Older data analysis approaches

2.1. Conventional univariate tests of signi?cance

The simplest of the between-by within-subjects RM designs involves a single between-

subjects grouping factor and a single within-subjects RM factor, in which subjects

(i51,...,nj,P

(k51,...,K). In this design, the RM data are modelled by assuming that the observational

vectors Yij 5(Yij1Yij2...Yijk)9are normal, independent and identically distributed within

each level j, with common mean vector mjand covariance matrix Sj.

Tests of the within-subjects main and interaction effects traditionally have been accom-

plished by the respective use of the conventional univariate F statistics,

jnj 5N) are selected randomly for each level of the between-subjects

factor (j51,...,J) and observed and measured under all levels of the within-subjects factor

FK 5MSK/MSK3S/J~ F[a;(K21),(N2J)(K21)](1)

and

FJ3K 5MSJ3K/MSK3S/J~ F[a;(J21)(K 21),(N 2J)(K 21)],(2)

where ~ is to be read as `is distributed as’. The validity of these tests rests on the assumptions

of normality, independence of errors, and homogeneity of the treatment-difference var-

iancesÐ(i.e., sphericity (Huynh & Feldt, 1970; Rogan et al., 1979; Rouanet &Lepine,1970).

Speci®cally, sphericity is satis®ed if and only if C9SC5lI(K21), where C is a normalized

(i.e., unit-length) matrix of K 21 orthogonal contrasts among the K repeated measurements,

S is the population covariance matrix, l is a positive scalar, and I is an identity matrix of

order K21.1As the diagonal and off-diagonal elements of C9SC equal the variances and

covariances of the K21 orthogonal contrasts, the sphericity assumption is satis®ed if

and only if the K 21 contrasts are independent and equally variable. Further, the presence of

a between-subjects grouping factor requires that the data meet an additional assumption,

namely, that the covariance matrices of these treatment differences are the same for all levels

of this grouping factor. Jointly, these two assumptions have been referred to as multisample

sphericity (Huynh, 1978; Mendoza, 1980; for another description of multisample sphericity,

see Hertzog & Rovine, 1985, pp. 792±793). The F tests of simple RM designs containing

only within-subjects variables, that is, with no between-subjects grouping variables, also

depend on the sphericity assumption; however, they do not require multisample sphericity.

When the assumptions for the conventional tests have been satis®ed they provide a valid

test of their respective null hypotheses and are uniformly most powerful for detecting

any treatment effects that are present. These traditional tests are easily obtained using the

major statistical packages, such as SAS (SAS Institute, 1999) and SPSS (Norus Ïis, 1993).

Thus, when assumptions are known to be satis®ed, psychological researchers can adopt the

conventional procedures and report the associated p-values since under these conditions these

values are an accurate re¯ection of the probability of rejecting the null hypothesis by chance

when the null hypothesis is true.

However, McCall & Appelbaum (1973) provide a very good illustration as to why in

many areas of psychology (e.g., developmental, learning), the covariances between the

The analysis of repeated measures designs: a review

3

1See Rogan et al. (1979) for an example that shows the form of a contrast matrix (C) and the computation of

C9SC5lI.

Page 4

levels of the RM variable will not conform to the required covariance pattern for a valid

univariate F test. They use an example from developmental psychology to illustrate this

point. Speci®cally, adjacent-age assessments typically correlate (i.e., covary) more highly

than developmentally distant assessments (e.g., `IQ at age 3 correlates .83 with IQ at age 4

but .46 with IQ at age 12’); this type of correlational (i.e., covariance) structure does not

correspond to a spherical covariance structure. That is, for many psychological paradigms

successive or adjacent measurement occasions are more highly correlated than non-adjacent

measurement occasions, with the correlation between these measurements decreasing the

farther apart the measurements are in the series (Danford, Hughes & McNee, 1960; Winer,

1971). Indeed, as McCall and Appelbaum (1973) note: `Most longitudinal studies using age

or time as a factor cannot meet these assumptions.’ McCall and Appelbaum also indicate that

the covariance pattern found in learning experiments is not likely to conform to a spherical

pattern. As they note, `experiments in which change in some behaviour over short periods of

time is compared under several different treatments often cannot meet covariance require-

ments’ (p. 403).

The result of applying the conventional tests of signi®cance to data that do not conform to

the assumptions of multisample sphericity will be that too many null hypotheses will be

falsely rejected (Box, 1954; Collier et al., 1967; Imhof, 1962; Kogan, 1948; Stoloff, 1970).

Furthermore, as the degree of non-sphericity increases, the conventional repeated measures

F tests becomes increasingly in¯ated (Noe, 1976; Rogan et al., 1979). For example, the

results reported by Collier et al. and Rogan et al. indicate that Type I error rates can approach

10% for both the test of the RM main and interaction effects when sphericity does not hold.

Thus, p-values are not accurate re¯ections oftheobservedstatisticsoccurringbychanceunder

their null hypotheses. Rather, they indicate the probability of the statistics arising under some

otherdistribution,adistributioncharacterizedbyasphericityparameterthatisnotpresumedby

the conventional tests of signi®cance. Hence, using these p-values to ascertain whether the

treatment has been successful or not will give a biased picture of the nature of the treatment.

2.2. The multivariate approach

The multivariate test of the RM main effect in a simple (no between-subjects factors)

or between- by within-subjects design is performed by creating K21 difference variables.

The null hypothesis that is tested, using Hotelling’s (1931) T2statistic, is that the vector

of population means of these K 21 difference variables equals the null vector (see McCall

& Appelbaum, 1973, for a fuller discussion and numerical example). The upper 100(12a)

percentage points of the T2distribution can be obtained from the relationship

F5N 2J2K 1 2

(N2J)(K21)T2~ F[a;K 21,N 2J2K 1 2].(3)

The multivariate test of the within-subjects interaction effect, on the other hand, is a test

of whether the population means of the K21 difference variables are equal across the

levels of the grouping variable. A test of this hypothesis can be obtained by conducting a

one-way multivariate ANOVA, where the K 21 difference variables are the dependent

variables and the grouping variable (J) is the between-subjects independent variable. When

J > 2 four popular multivariate criteria are: (1) Wilks’s (1932) likelihood ratio; (2) the

Pillai±Bartlett trace statistic (Pillai, 1955; Bartlett, 1939); (3) Roy’s (1953) largest root

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

4

Page 5

criterion; and (4) the Hotelling±Lawley trace criterion (Hotelling, 1951; Lawley, 1938).

When J52, all criteria are equivalent to Hotelling’s T2statistic.

ValidmultivariatetestsoftheRMhypothesesin between-by within-subjectsdesigns,unlike

the univariate tests, depend not on the sphericity assumption but only on the equality of the

covariance matrices at all levels of the grouping factor as well as normality and independence

of observations across subjects. Simple designs, however, in addition to normality and

independence assumptions, only require that the covariance matrix be positive de®nite.

Multivariate tests of RM designs hypotheses are easily obtained from the general linear

model program associated with each of the two major statistical packages mentioned earlier.

The empirical results indicate that the multivariate test of the RM main effect is generally

robust to assumption violations when the design is balanced (or contains no grouping

factors) and not robust when the design is unbalanced (Algina & Oshima, 1994; Keselman

et al., 1995; Keselman, Algina, Kowalchuk & Wol®nger, 1999a, 1999b). The interaction

test is not necessarily robust even when the group sizes are equal (Olson, 1974). In particular,

as was the case with the univariate tests, the multivariate tests are conservative or liberal

depending on whether the covariance matrices and group sizes are positively or negatively

paired. When positively paired, main as well as interaction effect rates of Type I error, can be

less than 1%, while for negative pairings rates in excess of 20% have been reported (see

Keselman et al., 1995).

2.3. Univariate tests with adjusted degrees of freedom

When the covariance matrices for the orthonormal variables are equal but the common

covariance matrix is not spherical, or when the design is balanced (group sizes are equal) the

Greenhouse & Geisser (1959) and Huynh & Feldt (1976) adjusted-df univariate tests are

robust alternatives to the conventional tests (see also Quintana & Maxwell, 1994, for other

adjusted-df tests).

The Greenhouse and Geisser (GG) Ã «-approximate F test is an approximate-df procedure

which refers values of F to an adjusted critical value by modifying the usual numerator and

denominator df according to a sample estimate (Ã «) of the unknown sphericity parameter «.

That is,

FKÇ ~ F[a;(K21)Ã «,(N2J)(K21)Ã «](4)

and

FJ3KÇ ~ F[a;(J21)(K 21)Ã «;(N 2J)(K 21)Ã «],(5)

where Ç ~ is to be read as `is approximately distributed as’ and

Ã «5

[tr(C9SC)]2

(K21)tr[(C9SC)]2,(6)

in which S is the pooled sample covariance matrix which estimates S and `tr’ is the trace

operator.

The Huynh and Feldt (HF) Ä «-approximate F test (see also Lecoutre, 1991), like the

Greenhouse & Geisser (1959) adjustment, refers values of F to the sampling distribution

of F based on another sample estimate of «, one which is intended to be more accurate when

The analysis of repeated measures designs: a review

5

Page 6

« $ 0.75. According to the HF approximation values of F are referred to

FKÇ ~ F[a;(K21)Ä «,(N2J)(K21)Ä «](7)

and

FJ3KÇ ~ F[a;(J21)(K21)Ä «,(N2J)(K21)Ä «],(8)

where

Ä «5

(N 2J 1 1)(K21)Ã «22

(K21)[N 2J2(K 21)Ã «].

(9)

The empirical literature indicates that the GG and HF adjusted-df tests are robust to

violations of multisample sphericity as long as group sizes are equal (see Rogan et al., 1979).

The p-values associated with these adjusted statistics will provide an accurate re¯ection of the

probability of obtaining them by chance under the null hypotheses of no treatment effects.

Moreover, SAS (1999) and SPSS (Norus Ïis, 1993) provide GG and HF adjusted p-values.

However, the GG and HF adjusted-df tests are not robust when the design is unbalanced

(Algina&Oshima,1994,1995;Keselmanetal.,1995;Keselman&Keselman,1990;Keselman,

Lix & Keselman, 1996). Speci®cally, the tests are conservative (liberal) when group sizes and

covariance matrices are positively (negatively) paired with one another. A positive (negative)

pairingreferstothecaseinwhichthesmallestnjisassociatedwiththecovariancematrixwiththe

smallest (largest) element values. For example, the rates when depressed can be lower than 1%

and when in¯ated higher than 11% (see Keselman et al., 1999b).

2.4. The combined approach

Due to the absence of a clear advantage in adopting either an adjusted univariate or

multivariate approach, a number of authors have recommended that these procedures be

used in combination (Barcikowski & Robey, 1984; Looney & Stanley, 1989). In order to

maintain the overall rate of Type I error at a for a test of an RM effect, these authors

suggested assessing each of the two tests using an a/2 critical value. In this strategy, rejection

of an RM effect null hypothesis occurs if either test is found to be statistically signi®cant (see

Barcikowski & Robey, 1984, p. 150; Looney & Stanley, 1989, p. 221). Not surprisingly, this

approach to the analysis of repeated measurements results in depressed or in¯ated rates of

Type I error when multisample sphericity is not satis®ed when the design is unbalanced (see

Keselman et al., 1995).

3. Underused and new data analysis approaches

In addition to the Greenhouse & Geisser (1959) and Huynh & Feldt (1976) adjusted-df tests,

other adjusted-df tests are available for obtaining a valid test. The test to be introduced now

not only corrects for non-sphericity, but also adjusts for heterogeneity of the orthonormalized

covariance matrices.

3.1. The Huynh (1978) approximate F tests

Huynh (1978) developed a test of the within-subjects main and interaction hypotheses, the

improved general approximation (IGA) test, that is designed to be used when multisample

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

6

Page 7

sphericity is violated. The IGA tests of the within-subjects main and interaction hypotheses

are the usual statistics, FKand FJ3K, respectively, with corresponding critical values of

bF[a;h9,h] and cF[a;h99,h]. The parameters of the critical values are de®ned in terms of the

group covariance matrices and group sample sizes. Estimates of the parameters (c, b, h, h9,

and h99) and the correction due to Lecoutre (1991) are presented in Algina (1994) and

Keselman & Algina (1996). These parameters adjust the critical value to take into account the

effect that violation of multisample sphericity has on FKand FJ3K. If multisample sphericity

holds,

bF[a;h9h]5F[a;(K21),(N 2J)(K21)]

and

cF[a;h99,h]5F[a;(J21)(K 21),(N 2J)(K 21)].

An SAS/IML (SAS Institute, 1999) program is also available for computing this test in any

RM design (see Algina, 1997).

The IGA tests have been found to be robust to violations of multisample sphericity,

even for unbalanced designs where the data are not multivariate normal in form (see

Keselman et al., 1999b). This result is not surprising since these tests were speci®cally

designed to adjust for non-sphericity and heterogeneity of the between-subjects covariance

matrices. Thus, the p-values associated with the IGA tests of the repeated measures effects

are accurate.

3.2. Mixed model analyses

Another procedure that researchers can adopt to test RM effects can be derived from a general

formulation for analysing effects in RM models. This approach to the analysis of repeated

measurements is a mixed model analysis. Advocates suggest that it provides the `best’

approach to the analysis of repeated measurements since it can, among other things, handle

missing data and also allows users to model the covariance structure of the data. Thus, one

can use this procedure to select the most appropriate covariance structure before testing the

usual RM hypotheses (e.g., FKand FJ3K). The ®rst of these advantages is typically not a

pertinent issue to those involved in controlled experiments, since data in these contexts are

rarely missing. The second consideration, however, could be most relevant to experimenters

since modelling the correct covariance structure of the data should result in more powerful

tests of the ®xed-effects parameters.

The linear model underlying the mixed model approach can be written as follows:

Y5XB 1 ZU 1 E,(10)

where Y is a vector of response scores, X and Z are known design matrices, B is a vector of

unknown ®xed-effects parameters, U is a vector of unknown random effects, and E is the

vector of random errors. The name for this approach to the analysis of repeated measurements

stems from the fact that the model contains both unknown ®xed and random effects. The

model requires that U and E are normally distributed with

«

E

U

E

¬

5

0

0

« ¬

The analysis of repeated measures designs: a review

7

Page 8

and

Var

U

E

«¬

5

G

0

0

R

«¬

.

Thus, the variance of the response measure is given by

V5ZGZ91 R.(11)

Accordingly, one can model V by specifying Z and covariance structures for G and R. Note

that the usual general linear model is arrived at by letting Z50 and R5j2I. The choice

of estimation procedure for mixed model analysis and the formation of test statistics is

described in Littell, Milliken, Stroup and Wol®nger (1996, pp. 498±502).

The mixed approach, and speci®cally the PROC MIXED procedure in SAS (SAS Institute,

1995, 1996), allows users to ®t various covariance structures for G and R. For example, some

of the covariance structures that can be ®tted with PROC MIXED are: (a) compound

symmetric (CS), (b) unstructured (UN), (c) spherical (HF), (d) ®rst-order autoregressive

(AR1), and (e) random coef®cients (RC) (see Wol®nger, 1996, for speci®cations of these and

other covariance structures). The HF structure, as indicated, is assumed by the conventional

univariate F tests in the GLM program (SAS Institute, 1999), while the UN structure is

assumed by GLM’s multivariate tests of the RM effects. The AR1 and RC structures indicate

that measurements that are closer in time could be more highly correlated than those farther

apart in time. The program allows users even greater ¯exibility by allowing covariance

structures with within-subjects and/or between-subjects heterogeneity to be modelled. In

order to select an appropriate structure for one’s data, PROC MIXED users can use either an

Akaike (1974) or Schwarz (1978) information criterion (see Littell et al., 1996, pp. 101±102).

Keselman et al. (1999a, 1999b) recommend adopting the optional Satterthwaite F tests

rather than the default F tests when using PROC MIXED since they are typically robust

to violations of multisample sphericity in cases where the default tests are not.

3.3. A non-pooled adjusted-df multivariate test

Since the effects of testing mean equality in RM designs with heterogeneous data are similar

to the results reported for independent groups designs, one solution to the problem parallels

those found in the context of completely randomized designs. The Johansen (1980) approach,

a multivariate extension of the Welch (1951) and James (1951) procedures for completely

randomized designs, involves the computation of a statistic that does not pool across

heterogeneous sources of variation and estimates error df from sample data. (This is in

contrast to the Huynh, 1978, approach which, by use of the conventional univariate F

statistics, does pool across heterogeneous sources of variance. The Huynh approach adjusts

the critical value to take account of the pooling.)

Consider the RM design described previously, but allow SjÞ Sj9, j Þ j9. Suppose under

these model assumptions that we wish to test the hypothesis

H0:Cm50,(12)

where m5(m91,...,m9J)9, mj5(mj1,...,mjK)9, j51,...,J, and C is a full-rank contrast

matrix of dimension r 3JK. Then an approximate-df multivariate Welch±James type statistic

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

8

Page 9

(WJ), according to Johansen (1980) and Keselman, Carriere & Lix (1993), is

TWJ 5(C¯Y)9(CSC9)21(C¯Y),

where ÅY5(ÅY91,...,ÅY9J)9, with E(ÅY)5m, and the sample covariance matrix of ÅY is

S5diag(S1/n1,...,SJ/nJ), where Sjis the sample variance±covariance matrix of the jth

grouping factor. TWJ/c is distributed, approximately, as an F variable with df f1 5r and

f2 5r(r 1 2)/(3A), and c is given by r 1 2A26A/(r 1 2) with

(13)

A51

2

X

J

j51

[tr{SC9(CSC9)21CQj}21 {tr(SC9(CSC9)21CQj)}2]/(nj21).(14)

The matrix Qjis a block diagonal matrix of dimension JK 3JK, corresponding to the jth

group. The (s,t)th block of Qjis IK3Kif s5t5j and is 0 otherwise. In order to obtain the

main and interaction tests with the WJ procedure, let C9K21be a (K21) 3K contrast matrix

and let CJ21be similarly de®ned. A test of the main effect can be obtained by letting

C51JÄ CK21, where 1Jis the j 31 unit vector and Ä denotes the Kronecker product. The

contrast matrix for a test of the interaction effect is C5CJ21Ä CK21.2

The empirical literature indicates that the WJ test is in many instances insensitive to

heterogeneity of the covariance matrices and accordingly will provide valid p-values (see

Algina & Keselman, 1997; Keselman et al., 1993, 1999a, 1999b). (As a multivariate statistic,

WJ does not require a spherical covariance structure.) Researchers should consider using this

statistic when they suspect that group covariance matrices are unequal and they have groups

of unequal size. However, to obtain a robust statistic researchers must have reasonably large

sample sizes. That is, according to Keselman et al. (1993), when J53, in order to obtain a

robust test of the RM main effect hypothesis, the number of observations in the smallest of

groups (nmin) must be three to four times the number of repeated measurements minus one

(K21), while the number must be ®ve or six to one in order to obtain a robust test of the

interaction effect. As J increases, smaller sample sizes will suf®ce for the main effect but

larger sample sizes are required to control the Type I error rate for the interaction test (Algina

& Keselman, 1997). Though the test statistic cannot be obtained from the major statistical

packages, Lix & Keselman (1995) present a SAS/IML (SAS, 1999) program that can be used

to compute the WJ test for any RM design (excluding quantitative covariates). The program

requires only the user to enter the data, the number of observations per group (cell), and the

coef®cients of one or more contrast matrices that represent the hypothesis of interest. Lix and

Keselman present illustrations of how to obtain numerical results with their SAS/IML

program.

3.4. The empirical Bayes approach

Boik (1997) introduced an empirical Bayes (EB) approach to the analysis of repeated

measurements. This is a hybrid approach in that it represents a melding of the adjusted-df

univariate and multivariate procedures. As he notes, the varied approaches to the analysis of

repeated measurements differ according to how they model the variances and covariances

The analysis of repeated measures designs: a review

9

2For a 3 34 between- by within-subjects design, a main effect contrast vector among the levels of the RM

variable could look like [1 2 1 0 0 1 0 2 1 0 1 0 0 2 1]. Though this example contains simple (pairwise) contrasts

(coef®cients), the vector can be any set of linearly independent contrasts.

Page 10

among the levels of the RM variable. For example, as we indicated, the conventional

univariate approach assumes that there is a spherical structure among the elements of the

covariance matrix, whereas the multivariate approach does not require that the covariance

matrix assume any particular structure, only that it be positive de®nite. As we have pointed

out, even though users are not typically interested in the structure of the covariance

matrix, the covariance model that one adopts affects how well the ®xed-effect parameters

of the model (e.g., the treatment effects) are estimated. An increase in the precision of the

covariance estimator translates into an increase in the sensitivity that the procedure has for

detecting treatment effects. As an illustration, consider the multivariate approach to the

analysis of repeated measurements. Because it does not put any restrictions on the form of

the covariance matrix, it can be inef®cient in that many unknown parameters must be

estimated (i.e., all of the variances and all of the covariances among the levels of the RM

variable), and this inef®ciency may mean loss of statistical power to detect treatment effects.

Thus, choosing a parsimonious model should be important to applied researchers.

The EB approach is an alternative to the univariate adjusted-df approach to the analysis

of repeated measurements. The adjusted-df approach presumes that a spherical model is a

reasonable approximation to the unknown covariance structure, and though departures from

sphericity are expected, they would not be large enough to abandon the univariate estimator

of the covariance matrix. The multivariate approach allows greater ¯exibility in that the

elements of the covariance matrix are not required to follow any particular pattern. In the EB

approach the unknown covariance matrix is estimated as a linear combination of the uni-

variate and multivariate estimators. Boik (1997) believed that a combined estimator would

be better than either one individually. In effect, Boik’s (1997) approach is based on a

hierarchical model in which sphericity is satis®ed on average, though not necessarily satis-

®ed on any particular experimental outcome. This form of sphericity is referred to as

second-stage sphericity (Boik, 1997).

Boik (1997) demonstrated, through Monte Carlo methods, that the EB approach controls

its Type I error rate and can be more powerful than either the adjusted-df or multivariate

procedure for many non-null mean con®gurations. Researchers can make inferences about

the RMeffects by computing hypothesis and error sums of squares and cross product matrices

with Boik’s formulae and obtain numerical solutions with any of the conventional multi-

variate statistics (see Boik, 1997, p. 162 for an illustration).

4. Discussion

The aim of this paper was to indicate that `repeated measures ANOVA’ can refer to a number

of different types of analysis for RM designs. Speci®cally, we indicated that repeated

measures ANOVA could be construed to mean the conventional tests of signi®cance, the

adjusted-df univariate test statistics, a multivariate analysis, a multivariate analysis that

does not require the assumptions associated with the usual multivariate test, or a combined

univariate±multivariate test. In addition, by indicating the strengths and weaknesses of each

of these approaches, we intended to convey the validity or lack thereof that can be associated

with the p-values corresponding with each of these approaches. Thus, researchers can better

convey the validity of their ®ndings by indicating the type of `repeated measures ANOVA’

that was used to assess treatment effects. We summarize the advantages/disadvantages of

the various approaches, indicating as well how numerical results can be obtained, in Table 1.

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

10

Page 11

In conclusion, we feel it is rarely legitimate to use the conventional tests of signi®cance

since data are not likely to conform to the very strict assumptions associated with this

procedure. On the other hand, researchers should take comfort in the fact that there are many

viable alternatives to the conventional tests of signi®cance. Furthermore, we believe that we

can offer simple guidelines for choosing between them, guidelines which, by and large,

are based on whether group sizes are equal or not.3That is, for simple RM designs containing

no between-subjects variables or for between- by within-subjects designs having groups

of equal size, we recommend either the empirical Bayes or the mixed model approach. Boik

(1997) demonstrated that his approach will typically provide more powerful tests of RM

effects than uniformly adopting either an adjusted-df univariate approach or a multivariate

test statistic. Furthermore, numerical results can easily be obtained with a standard multi-

variate program. The mixed model approach is also likely to provide more powerful tests of

RM effects than the adjusted-df univariate and multivariate approaches because researchers

can model the covariance structure of their data. Furthermore, for designs that contain

between-subjects grouping variables, heterogeneity across the levels of the grouping variable

can also be modelled. To the extent that the actual covariance structure of the data resembles

the ®t structure, it is likely that the mixed model approach will provide more powerful tests

than the empirical Bayes approach; however, this observation has not yet been con®rmed

through empirical investigation. A caveat to this recommendation is that when covariance

matrices are suspected to be unequal, a safer course of action, in terms of Type I error pro-

tection, is to use an adjusted-df univariate test.4That is, some ®ndings suggest that the EB and

mixed model approaches may result in in¯ated rates of Type I error when covariance matrices

are unequal and sample sizes are small, even when group sizes are equal (see Keselman et al.,

1999a, 1999b; Keselman, Kowalchuk & Boik, 2000; Wright & Wol®nger, 1996).

In those (fairly typical) cases where the group sizes are unequal and one does not know

that the group covariance matrices are equal, researchers should use either the IGA or

Welch±James tests. We feel quite comfortable in recommending the WJ and IGA tests as

general analytic tools for the analysis of repeated measurements. We believe they are

preferable to the conventional univariate (including adjusted-df univariate tests) and multi-

variate methods because they will typically control rates of Type I error where the con-

ventional methods of analysis (and newer ones as well) will not. Furthermore, results

indicate that the power to detect effects will not be substantially reduced when using WJ

or IGA when the assumptions on the conventional procedures are satis®ed (see Algina &

Keselman, 1998). Thus, there is nothing to lose (with respect to power) and everything to

The analysis of repeated measures designs: a review

11

3We caution readers that our recommendations are no substitute for carefully examining the characteristics of their

data and basing their choice of a test statistic on this examination. There are a myriad of factors (scale of

measurement, distributional shape, outliers, etc.), not considered for the sake of simplicity in formulating our

recommendations, which could result in other data analysis choices (nonparametric analyses, analyses based on

robust estimators rather than least-squares estimators, transformations of the data, etc.). Furthermore, the empirical

literature that has been published regarding the ef®cacy of the new procedures reviewed in this paper is extremely

limited, and future ®ndings may accordingly result in better recommendations.

4It is unknown to what extent covariance matrices are unequal between groups in RM designs since researchers do

not report their sample covariance matrices. However, we agree with other researchers who investigate the operating

characteristics of statistical procedures that the data in psychological experiments is likely to be heterogeneous (see

DeShon & Alexander, 1996; Wilcox, 1987). Accordingly, the safest course of action, when group sizes are unequal,

is to adopt a procedure that allows for heterogeneity. The empirical literature also indicates that one will not suffer

substantial power losses by using a heterogeneous test statistic when heterogeneity does not exist (see Algina &

Keselman, 1998; Keselman et al., 1999b).

Page 12

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

12

Table 1. Data analysis procedures for repeated measures designs

REQUIREMENTS/CONSIDERATIONS/

OBTAINING NUMERICAL

METHOD

ISSUES

EMPIRICAL FINDINGS

RESULTS

Conventional F tests

· Require, among other assumptions, the

· In¯ated or depressed Type I

· The major statistical

unlikely to be satis®ed assumption of

error rates when data do

packages (e.g., SAS,

multisample sphericity

not conform to multisample

SPSS) compute these tests

sphericity

Multivariate F tests

· Require, among other assumptions,

· Generally robust to heterogeneity

· The major statistical

homogeneity of the between-subjects

of the covariance matrices

packages compute these

covariance matrices

when group sizes are equal

tests

· Not robust to covariance

heterogeneity when group sizes

are unequal

· Multivariate test of the interaction

effect may be non-robust to

non-normality

Adjusted-df univariate test

· Require, among other assumptions,

· In¯ated or depressed Type I

· The major statistical

statistics: Greenhouse &

homogeneity of the between-subjects

error rates when covariance

packages compute these

Geisser (1959), Huynh &

covariance matrices

matrices are unequal, particularly

tests

Feldt (1976)

when group sizes are unequal

Combined approach

· Uses both the adjusted-df univariate and

· In¯ated or depressed Type I

· The major statistical

(Barcikowski & Robey,

multivariate tests to analyse effects,

error rates when data are

packages can be used to

1984)

dividing the level of signi®cance between

heterogeneous and non-normal

compute these tests

the two tests

Huynh’s (1978) IGA F

· Derived to be applicable to data that do

· Robust to violations of

· A SAS/IML program can

tests

not conform to multisample sphericity

multisample sphericity

be obtained from Algina

· Robust even when group sizes

(1997)

are unequal and relatively small

· Applicable to any RM

design that does not

· Robust to non-normality when

contain covariates or

robust estimators (i.e., trimmed

continuous variables

means and Winsorized variances

and covariances) are substituted for

the least-squares estimators

Page 13

The analysis of repeated measures designs: a review

13

Mixed model F tests

· Allow the covariance structure of data

· The default F tests are prone

· Results can be obtained

to be modelled before conducting tests

to distorted Type I error rates

from PROC MIXED

of the RM effects

when covariance matrices are

(SAS Institute, 1999)

· Allow missing data across the levels of

heterogeneous, group sizes are

the RM variable

unequal and data are non-normal

· Allow between-subjects and/or within-

in form

subjects heterogeneity

· The Satterthwaite optional F

· Multiple comparisons of RM effects can

tests provide reasonably good

be obtained through this procedure

protection against Type I errors

Welch±James adjusted-df

· Sample sizes must conform to

· Generally robust to covariance

· Lix & Keselman (1995)

multivariate F tests

prescriptions given by Keselman et al.

heterogeneity and non-normality

provide an SAS/IML

(Keselman et al., 1993)

(1993) and Algina & Keselman (1997)

if sample size requirements are met

program that can be used

· Interaction test requires larger

to obtain numerical results

sample sizes in order to be robust

in any RM design not

to non-normality

containing covariates or

· Robust results can be achieved

continuous variables

with reasonably moderate sample

· Keselman et al. (2001)

sizes when robust estimators are

provide an SAS/IML

substituted for the least-squares

program that computes

estimators

tests of signi®cance with

· Algina & Keselman (1997) found

least-squares and/or

that the WJ test can have

robust estimators with or

substantially more power to detect

without boot strapping

effects than the IGA approach

· Multiple comparisons of RM effects

can be obtained

Empirical Bayes approach

· A hybrid approach that combines the

· EB can be more powerful than either

· Hypothesis and error sums

(Boik, 1997)

adjusted-df univariate and multivariate

the adjusted-df or multivariate

of squares and cross

approaches to the analysis of RM

approach (Boik, 1997)

product matrices can be

· Generally robust to covariance

computed with formulae

heterogeneity when group sizes are

provided by Boik (1997)

equal

and then can be input to

· Type I error rates can be in¯ated or

any multivariate test

depressed when covariance matrices

statistic

are heterogeneous when group sizes are unequal

Page 14

gain (with respect to Type I error control) by adopting one of these two approaches to the

analysis of repeated measurements. Of the two, we generally recommend the WJ approach;

based upon power analyses, it appears that it can have substantial power advantages over

the IGA test (Algina & Keselman, 1997).

The SAS/IML program (SAS Institute, 1999) presented by Lix & Keselman (1995) can be

used to obtain numerical results. However, according to results provided by Keselman et al.

(1993) and Algina & Keselman (1998), sample sizes cannot be small. When sample sizes

are unequal and small, we recommend the IGA test.

When researchers feel that they are dealing with populations that are non-normal in

formÐTukey (1960) suggests that most populations are skewed and/or contain outliersÐ

and thus subscribe to the position that inferences pertaining to robust parameters are more

valid than inferences pertaining to the usual least-squares parameters, then either the IGA or

WJ procedure, based on robust estimators, can be adopted. Results provided by Keselman,

Algina, Wilcox & Kowalchuk (2000) certainly suggest that these procedures will provide

valid tests of the RM main and interaction effect hypotheses (of trimmed population means)

when data are non-normal, non-spherical and heterogeneous. Numerical results can be

obtained with the SAS/IML program provided by Keselman et al. (2001).

5. Postscripts

5.1. Missing data

We remind the reader that in some areas of psychological research data may be missing

over time. Mixed model analyses can provide numerical solutions based on all of the avail-

able data, as opposed to statistical software that derives results from complete cases (e.g.,

PROC GLM in SAS). Alternatively, multiple imputation (Rubin, 1987; Schafer, 1997) can be

used in conjunction with software that calculates results from complete cases. However, the

validity of these approaches depends on the mechanism that causes the data to be missing.

Rubin (1976), Little & Rubin (1987) and Wang-Clow, Lange, Laird & Ware (1995) describe

three mechanisms that can cause data to be missing. Citing Diggle & Kenward (1994), Little

(1995) describes a fourth. Aspects of the following presentation assume that if a subject

does not contribute data on a particular occasion, he or she does not contribute data

subsequently. We refer to this as dropout.

Missing completely at random (MCAR).

random and that missingness does not depend on individual characteristics or treatment. Thus

dropout rates do not vary across treatment levels and dropout is not predictable from any of

the variables in the study. Clearly MCAR is a very strong assumption. If data are MCAR,

analysis of complete cases is not biased but is inef®cient because data are discarded for

respondents who have been observed on at least some of the measurement occasions. The

maximum likelihood analysis implemented in PROC MIXED and multiple imputation are

more ef®cient.

This process assumes that missing data occur at

Covariate-dependent dropout (CDD).

and within-subjects covariates that are ®xed in the study. These covariates include the

treatments. An example of CDD would be if subjects dropped out of a diet intervention study

because they were unwilling to adhere to the diet regimen and not because of their weight

Here dropping out is dependent on between-subjects

H. J. Keselman, James Algina and Rhonda K. Kowalchuk

14

Page 15

gain or loss (Wang-Clow et al., 1995). Complete case analyses are unbiased but inef®cient

under CDD (Little, 1995). Correct analyses can be obtained by using the maximum likelihood

analysis implemented in PROC MIXED or by using multiple imputation.

Missing at random (MAR).

observed values of the dependent variables and on the covariates. For example, discussing

dropout, Wang-Clow et al. (1995, p. 295) report that data are MAR if `attrition occurs at

random, but with a probability that depends on anindividual’s previously observed response’.

An example, of the MAR process would be if, in a study designed to assess the effectiveness

of a drug in reducing weight among obese patients, subjects dropped out because they

attained their desired weight loss. When data are MAR, correct analyses can be obtained by

using maximum likelihood analysis implemented in PROC MIXED or by using multiple

imputation.

When data are missing at random, missingness depends on the

Non-ignorable missingness (NI).

mechanism is not ignorable and must be explicitly taken into account in the data analysis.

Two varieties of non-ignorable dropout have been described in the literature (Little, 1995).

NI outcome-based dropout means that dropping out is predictable from the unrecorded scores

on the dependent variable. Wang-Clow et al. (1995, p. 294) give as an example of this

mechanism a study designed to control blood pressure where patients with home blood-

pressure kits decide not to return to the study based on their own home measurements. NI

random-effect dropout means that dropping out is predictable from the random effects in

equation (10). For example, if subjects’ performances over time are modelled as linear

functions over time, and if subjects who have large slopes (i.e., are changing more quickly)

are more likely to dropout, then, the missing-data mechanism is NI random-effect-dependent.

When the missing-data mechanism is NI, two modelling approaches may be used (Little,

1995). In selection models, the missing-data mechanism is explicitly modelled along with

modelling the dependent variable. In pattern-mixture models, data are strati®ed by missing-

data patterns. Little (1995) has advocated the use of patter-mixture models for taking account

of NI dropout. Drawing on the extensive work of Little (1993, 1994, 1995), Hedeker and

Gibbons (1997) provide a recent presentation of patter-mixture models in the context of

repeated measures analysis.

Non-ignorable dropout means that the missing-data

5.2. Multiple comparisons

The reader should note that the mixed model, WJ and EB approaches can also be applied

to tests of contrasts (see SAS Institute, 1992, Chapter 16; Lix & Keselman, 1995, 1996; Boik,

1997). Our preference is for the approach presented by Keselman, Keselman & Shaffer

(1991)Ðin effect a WJ approachÐwhich can now be implemented with PROC MIXED.

Accordingly, we present a brief description of the Keselman et al. approach (a detailed

presentation can be found in Kowalchuk & Keselman, 2000).

Keselman et al. (1991) presented a statistic, which when combined with various multiple

comparison critical valuesÐHochberg’s (1988) sequentially acceptive step-up Bonferroni

approach or Shaffer’s (1986) sequentially rejective step-down Bonferroni approachÐis

robust to the effects of covariance heterogeneity and non-normality in unbalanced non-

spherical RM designs. Their statistic, like the omnibus multivariate statistic, allows for a

The analysis of repeated measures designs: a review

15