Content uploaded by Joseph Weir
Author content
All content in this area was uploaded by Joseph Weir on Jan 23, 2023
Content may be subject to copyright.
231
Journal of Strength and Conditioning Research, 2005, 19(1), 231–240
q2005 National Strength & Conditioning Association Brief Review
Q
UANTIFYING
T
EST
R
ETEST
R
ELIABILITY
U
SING
THE
I
NTRACLASS
C
ORRELATION
C
OEFFICIENT
AND THE
SEM
J
OSEPH
P. W
EIR
Applied Physiology Laboratory, Division of Physical Therapy, Des Moines University—Osteopathic Medical
Center, Des Moines, Iowa 50312.
A
BSTRACT
.Weir, J.P. Quantifying testretest reliability using
the intraclass correlation coefﬁcient and the SEM.J. Strength
Cond. Res. 19(1):231–240. 2005.—Reliability, the consistency of
a test or measurement, is frequently quantiﬁed in the movement
sciences literature. A common metric is the intraclass correla
tion coefﬁcient (ICC). In addition, the SEM, which can be cal
culated from the ICC, is also frequently reported in reliability
studies. However, there are several versions of the ICC, and con
fusion exists in the movement sciences regarding which ICC to
use. Further, the utility of the SEM is not fully appreciated. In
this review, the basics of classic reliability theory are addressed
in the context of choosing and interpreting an ICC. The primary
distinction between ICC equations is argued to be one concern
ing the inclusion (equations 2,1 and 2,k) or exclusion (equations
3,1 and 3,k) of systematic error in the denominator of the ICC
equation. Inferential tests of mean differences, which are per
formed in the process of deriving the necessary variance com
ponents for the calculation of ICC values, are useful to deter
mine if systematic error is present. If so, the measurement
schedule should be modiﬁed (removing trials where learning
and/or fatigue effects are present) to remove systematic error,
and ICC equations that only consider random error may be safe
ly used. The use of ICC values is discussed in the context of
estimating the effects of measurement error on sample size, sta
tistical power, and correlation attenuation. Finally, calculation
and application of the SEM are discussed. It is shown how the
SEM and its variants can be used to construct conﬁdence inter
vals for individual scores and to determine the minimal differ
ence needed to be exhibited for one to be conﬁdent that a true
change in performance of an individual has occurred.
K
EY
W
ORDS
. reproducibility, precision, error, consistency, SEM,
intraclass correlation coefﬁcient
I
NTRODUCTION
Reliability refers to the consistency of a test or
measurement. For a seemingly simple concept,
the quantifying of reliability and interpreta
tion of the resulting numbers are surprisingly
unclear in the biomedical literature in general
(49) and in the sport sciences literature in particular. Part
of this stems from the fact that reliability can be assessed
in a variety of different contexts. In the sport sciences,
we are most often interested in simple testretest reli
ability; this is what Fleiss (22) refers to as a simple reli
ability study. For example, one might be interested in the
reliability of 1 repetition maximum (1RM) squat mea
sures taken on the same athletes over different days.
However, if one is interested in the ability of different
testers to get the same results from the same subjects on
skinfold measurements, one is now interested in the in
terrater reliability. The quantifying of reliability in these
different situations is not necessarily the same, and the
decisions regarding how to calculate reliability in these
different contexts has not been adequately addressed in
the sport sciences literature. In this article, I focus on
testretest reliability (but not limited in the number of
retest trials). In addition, I discuss data measured on a
continuous scale.
Confusion also stems from the jargon used in the con
text of reliability, i.e., consistency, precision, repeatabili
ty, and agreement. Intuitively, these terms describe the
same concept, but in practice some are operationalized
differently. Notably, reliability and agreement are not
synonymous (30, 49). Further, reliability, conceptualized
as consistency, consists of both absolute consistency and
relative consistency (44). Absolute consistency concerns
the consistency of scores of individuals, whereas relative
consistency concerns the consistency of the position or
rank of individuals in the group relative to others. In the
ﬁelds of education and psychology, the term reliability is
operationalized as relative consistency and quantiﬁed us
ing reliability coefﬁcients called intraclass correlation co
efﬁcients (ICCs) (49). Issues regarding quantifying ICCs
and their interpretation are discussed in the ﬁrst half of
this article. Absolute consistency, quantiﬁed using the
SEM, is addressed in the second half of the article. In
brief, the SEM is an indication of the precision of a score,
and its use allows one to construct conﬁdence intervals
(CIs) for scores.
Another confusing aspect of reliability calculations is
that a variety of different procedures, besides ICCs and
SEM, have been used to determine reliability. These in
clude the Pearson r, the coefﬁcient of variation, and the
LOA (BlandAltman plots). The Pearson product moment
correlation coefﬁcient (Pearson r) was often used in the
past to quantify reliability, but the use of the Pearson r
is typically discouraged for assessing testretest reliabil
ity (7, 9, 29, 33, 44); however, this recommendation is not
universal (43). The primary, although not exclusive,
weakness of the Pearson ris that it cannot detect system
atic error. More recently, the limits of agreement (LOA)
described by Bland and Altman (10) have come into vogue
in the biomedical literature (2). The LOA will not be ad
dressed in detail herein other than to point out that the
procedure was developed to examine agreement between
2 different techniques of quantifying some variable (so
called method comparison studies, e.g., one could compare
testosterone concentration using 2 different bioassays),
not reliability per se. The use of LOA as an index of re
liability has been criticized in detail elsewhere (26, 49).
In this article, the ICC and SEM will be the focus.
232 W
EIR
Unfortunately, there is considerable confusion concerning
both the calculation and interpretation of the ICC. In
deed, there are 6 common versions of the ICC (and others
as well), and the choice of which version to use is not
intuitively obvious. Similarly, the SEM, which is inti
mately related to the ICC, has useful applications that
are not fully appreciated by practitioners in the move
ment sciences. The purposes of this article are to provide
information on the choice and application of the ICC and
to encourage practitioners to use the SEM in the inter
pretation of test data.
T
HE
ICC
Reliability Theory
For a group of measurements, the total variance ( ) in
2
s
T
the data can be thought of as being due to true score
variance ( ) and error variance ( ). Similarly, each ob
22
ss
te
served score is composed of the true score and error (44).
The theoretical true score of an individual reﬂects the
mean of an inﬁnite number of scores from a subject,
whereas error equals the difference between the true
score and the observed score (21). Sources of error include
errors due to biological variability, instrumentation, error
by the subject, and error by the tester. If we make a ratio
of the to the of the observed scores, where equals
22 2
ss s
tT T
plus , we have the following reliability coefﬁcient:
22
ss
te
2
s
t
R5. (1)
22
s1s
te
The closer this ratio is to 1.0, the higher the reliability
and the lower the . Since we do not know the true score
2
s
e
for each subject, an index of the is used based on be
2
s
t
tweensubjects variability, i.e., the variance due to how
subjects differ from each other. In this context, reliability
(relative consistency) is formally deﬁned (5, 21, 49) as fol
lows:
between subjects variability
reliability 5. (2)
between subjects variability 1error
The reliability coefﬁcient in Equation 2 is quantiﬁed
by various ICCs. So although reliability is conceptually
aligned with terms such as reproducibility, repeatability,
and agreement, it is deﬁned as above. The necessary var
iance estimates are derived from analysis of variance
(ANOVA), where appropriate mean square values are re
corded from the computer printout. Speciﬁcally, the var
ious ICCs can be calculated from mean square values de
rived from a withinsubjects, singlefactor ANOVA (i.e., a
repeatedmeasures ANOVA).
The ICC is a relative measure of reliability (18) in that
it is a ratio of variances derived from ANOVA, is unitless,
and is more conceptually akin to R
2
from regression (43)
than to the Pearson r. The ICC can theoretically vary
between 0 and 1.0, where an ICC of 0 indicates no reli
ability, whereas an ICC of 1.0 indicates perfect reliability.
In practice, ICCs can extend beyond the range of 0 to 1.0
(30), although with actual data this is rare. The relative
nature of the ICC is reﬂected in the fact that the mag
nitude of an ICC depends on the betweensubjects vari
ability (as shown in the next section). That is, if subjects
differ little from each other, ICC values are small even if
trialtotrial variability is small. If subjects differ from
each other a lot, ICCs can be large even if trialtotrial
variability is large. Thus, the ICC for a test is context
speciﬁc (38, 51). As noted by Streiner and Norman (49),
‘‘There is literally no such thing as the reliability of a test,
unqualiﬁed; the coefﬁcient has meaning only when ap
plied to speciﬁc populations.’’ Further, it is intuitive that
small differences between individuals are more difﬁcult
to detect than large ones, and the ICC is reﬂective of this
(49).
Error is typically considered as being of 2 types: sys
tematic error (e.g., bias) and random error (2, 39). (Gen
eralizability theory expands sources of error to include
various facets of interest but is beyond the scope of this
article.) Total error reﬂects both systematic error and
random error (imprecision). Systematic error includes
both constant error and bias (38). Constant error affects
all scores equally, whereas bias is systematic error that
affects certain scores differently than others. For physical
performance measures, the distinction between constant
error and bias is relatively unimportant and the focus
here on systematic error is on situations that result in a
unidirectional change in scores on repeated testing. In
testing of physical performance, subjects may improve
their test scores simply due to learning effects, e.g., per
forming the ﬁrst test serves as practice for subsequent
tests, or fatigue or soreness may result in poorer perfor
mance across trials. In contrast, random error refers to
sources of error that are due to chance factors. Factors
such as luck, alertness, attentiveness by the tester, and
normal biological variability affect a particular score.
Such errors should, in a random manner, both increase
and decrease test scores on repeated testing. Thus, we
can expand Equation 1 as follows:
2
s
t
R5, (3)
22 2
s1s 1s
tsere
where is the systematic and is the random .
2222
ssss
se e re e
It has been argued that systematic error is a concern
of validity and not reliability (12, 43). Similarly, system
atic error (e.g., learning effects, fatigue) has been sug
gested to be a natural phenomenon and therefore does
not contribute to unreliability per se in testretest situa
tions (43). Thus, there is a school of thought that suggests
that only random error should be assessed in reliability
calculations. Under this analysis, the error term in the
denominator will only reﬂect random error and not sys
tematic error, increasing the size of reliability coefﬁ
cients. The issue of inclusion of systematic error in the
determination of reliability coefﬁcients is addressed in a
subsequent section.
The Basic Calculations
The calculation of reliability starts with the performance
of a repeatedmeasures ANOVA. This analysis performs
2 functions. First, the inferential test of mean differences
across trials is an assessment of systematic error (trend).
Second, all of the subsequent calculations can be derived
from the output from this ANOVA. In keeping with the
nomenclature of Keppel (28), the ANOVA that is used is
of a singlefactor, withinsubjects (repeatedmeasures) de
sign. Unfortunately, the language gets a bit tortured in
many sources, because the different ICC models are re
ferred to as either 1way or 2way models; what is im
portant to keep in mind is that both the 1way and 2way
ICC models can be derived from the same singlefactor,
withinsubjects ANOVA.
Q
UANTIFYING
T
EST
R
ETEST
R
ELIABILITY
233
T
ABLE
1. Example data set.
Trial A1 Trial A2 DTrial B1 Trial B2 D
146
148
170
90
140
152
152
99
26
14
218
19
166
168
160
150
160
172
142
159
26
14
218
19
157
156
176
205
156 633
145
153
167
218
153 633
212
23
29
113
147
146
156
155
156 68
135
143
147
168
153 613
212
23
29
113
T
ABLE
2. Twoway analysis of variance summary table for data set A.*
Source df SS Mean square Fpvalue
Between subjects 7 14,689.8 2098.4 (MS
B
: 1way)
(MS
S
: 2way)
36.8
Within subjects
Trials
Error
8
1
7
430
30.2
399.8
53.75 (MS
W
)
30.2 (MS
T
)
57 (MS
E
)
0.53 0.49
Total 15 15,119.8
*MS
B
5betweensubjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5withinsubjects mean square; SS 5sums of squares.
To illustrate the calculations, example data are pre
sented in Table 1. ANOVA summary tables are presented
in Tables 2 and 3, and the resulting ICCs are presented
in Table 4. Focus on the ﬁrst two columns of Table 1,
which are labeled trial A1 and trial A2. As can be seen,
there are 2 sets (columns) of scores, and each set has 8
scores. In this example, each of 8 subjects has provided a
score in each set. Assume that each set of scores repre
sents the subjects’ scores on the 1RM squat across 2 dif
ferent days (trials). A repeatedmeasures ANOVA is per
formed to primarily test whether the 2 sets of scores are
signiﬁcantly different from each other (i.e., do the scores
systematically change between trials) and is summarized
in Table 2. Equivalently, one could have used a paired t
test, since there were only 2 levels of trials. However, the
ANOVA is applicable to situations with 2 or more trials
and is consistent with the ICC literature in deﬁning
sources of variance for ICC calculations. Note that there
are 3 sources of variability in Table 2: subjects, trials, and
error. In a repeatedmeasures ANOVA such as this, it is
helpful to remember that this analysis might be consid
ered as having 2 factors: the primary factor of trials and
a secondary factor called subjects (with a sample size of
1 subject per cell). The error term includes the interaction
effect of trials by subjects. It is useful to keep these sourc
es of variability in mind for 2 reasons. First, the 1way
and 2way models of the ICC (6, 44) either collapse the
variability due to trials and error together (1way models)
or keep them separate (2way models). Note that the tri
als and error sources of variance, respectively, reﬂect the
systematic and random sources of error in the of the
2
s
e
reliability coefﬁcient. These differences are illustrated in
Table 2, where the df and sums of squares values for error
in the 1way model (withinsubjects source) are simply
the sum of the respective values for trials and error in
the 2way model.
Second, unlike a betweensubjects ANOVA where the
‘‘noise’’ due to different subjects is part of the error term,
the variability due to subjects is now accounted for (due
to the repeated testing) and therefore not a part of the
error term. Indeed, for the calculation of the ICC, the nu
merator (the signal) reﬂects the variance due to subjects.
Since the error term of the ANOVA reﬂects the interac
tion between subjects and trials, the error term is small
in situations where all the subjects change similarly
across test days. In situations where subjects do not
change in a similar manner across test days (e.g., some
subjects’ scores increase, whereas others decrease), the
error term is large. In the former situation, even small
differences across test days, as long as they are consistent
across all the subjects, can result in a statistically signif
icant effect for trials. In this example, however, the effect
for trials is not statistically signiﬁcant (p50.49), indi
cating that there is no statistically signiﬁcant systematic
error in the data. It should be kept in mind, however, that
the statistical power of the test of mean differences be
tween trials is affected by sample size and random error.
Small sample sizes and noisy data (i.e., high random er
ror) will decrease power and potentially hide systematic
error. Thus, an inferential test of mean differences alone
is insufﬁcient to quantify reliability. Further, evaluation
of the effect for trials ought to be evaluated with a more
liberal ameasure, since in this case, the implications of
a type 2 error are more severe than a type 1 error. In
cases where systematic error is present, it may be pru
dent to change the measurement schedule (e.g., add trials
if a learning effect is present or increase rest intervals if
fatigue is present) to compensate for the bias.
Shrout and Fleiss (46) have presented 6 forms of the
ICC. This system has taken hold in the physical therapy
literature. However, the speciﬁc nomenclature of their
system does not seem to be as prevalent in the exercise
physiology, kinesiology, and sport science literature,
which has instead ignored which is model used or focused
on ICC terms that are centered on either 1way or 2way
ANOVA models (6, 44). Nonetheless, the ICC models of
234 W
EIR
T
ABLE
3. Analysis of variance summary table for data set B.*
Source df SS Mean square Fpvalue
Between subjects 7 1330 190 (MS
B
: 1way)
(MS
S
: 2way)
3.3
Within subjects
Trials
Error
8
1
7
430
30.2
399.8
53.75 (MS
W
)
30.2 (MS
T
)
57 (MS
E
)
0.53 0.49
Total 15 1760
*MS
B
5betweensubjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5withinsubjects mean square; SS 5sums of squares.
Shrout and Fleiss (46) overlap with the 1way and 2way
models presented by Safrit (44) and Baumgartner (6).
Three general models of the ICC are present in the
Shrout and Fleiss (46) nomenclature, which are labeled
1, 2, and 3. Each model can be calculated 1 of 2 ways. If
the scores in the analysis are from single scores from each
subject for each trial (or rater if assessing interrater re
liability), then the ICC is given a second designation of 1.
If the scores in the analysis represent the average of the
k scores from each subject (i.e., the average across the
trials), then the ICC is given a second designation of k.
In this nomenclature then, an ICC with a model desig
nation of 2,1 indicates an ICC calculated using model 2
with single scores. The use of these models is typically
presented in the context of determining rater reliability
(41). For model 1, each subject is assumed to be assessed
by a different set of raters than other subjects, and these
raters are assumed to be randomly sampled from the pop
ulation of possible raters so that raters are a random ef
fect. Model 2 assumes each subject was assessed by the
same group of raters, and these raters were randomly
sampled from the population of possible raters. In this
case, raters are also considered a random effect. Model 3
assumes each subject was assessed by the same group of
raters, but these particular raters are the only raters of
interest, i.e., one does not wish to generalize the ICCs
beyond the conﬁnes of the study. In this case, the analysis
attempts to determine the reliability of the raters used
by that particular study, and raters are considered a ﬁxed
effect.
The 1way ANOVA models (6, 44) coincide with model
1,k for situations where scores are averaged and model
1,1 for single scores for a given trial (or rater). Further,
ICC 1,1 coincides with the 1way ICC model described by
Bartko (3, 4), and ICC 1,k has also been termed the
Spearman Brown prediction formula (4). Similarly, ICC
values derived from single and averaged scores calculated
using the 2way approach (6, 44) coincide with models 3,1
and 3,k, respectively. Calculations coincident with models
2,1 and 2,k were not reported by Baumgartner (6) or Saf
rit (44).
More recently, McGraw and Wong (34) expanded the
Shrout and Fleiss (46) system to include 2 more general
forms, each also with a single score or average score ver
sion, resulting in 10 ICCs. These ICCs have now been
incorporated into SPSS statistical software starting with
version 8.0 (36). Fortunately, 4 of the computational for
mulas of Shrout and Fleiss (46) also apply to the new
forms of McGraw and Wong (34), so the total number of
formulas is not different.
The computational formulas for the ICC models of
Shrout and Fleiss (46) and McGraw and Wong (34) are
summarized in Table 5. Unfortunately, it is not intuitive
ly obvious how the computational formulas reﬂect the in
tent of equations 1 through 3. This stems from the fact
that the computational formulas reported in most sources
are derived from algebraic manipulations of basic equa
tions where mean square values from ANOVA are used
to estimate the various s
2
values reﬂected in equations 1
through 3. To illustrate, the manipulations for ICC 1,1 (ran
domeffects, 1way ANOVA model) are shown herein. First,
the computational formula for ICC 1,1 is as follows:
MS 2MS
BW
ICC 1,1 5(4)
MS 1(k 21)MS
BW
where MS
B
indicates the betweensubjects mean square,
MS
W
indicates the withinsubjects mean square, and k is
the number of trials (3, 46). The relevant mean square
values can be found in Table 2. To relate this computa
tional formula to equation 1, one must know that esti
mation of the appropriate s
2
comes from expected mean
squares from ANOVA. Speciﬁcally, for this model the ex
pected MS
B
equals plus k , whereas the expected MS
W
22
ss
es
equals (3); therefore, MS
B
equals MS
W
plus k . If from
22
ss
es
equation 1 we estimate from betweensubjects variance
2
s
t
( ), then
2
s
s
2
s
s
ICC 5. (5)
22
s1s
se
By algebraic manipulation (e.g., 5[MS
B
2MS
W
]/k)
2
s
s
and substitution of the expected mean squares into equa
tion 5, it can be shown that
2
s
s
ICC 1,1 5
22
s1s
se
MS 2MS
BW
k
5MS 2MS
BW
1MS
W
k
MS 2MS
BW
5. (6)
MS 1(k 21)MS
BW
Similar derivations can be made for the other ICC models
(3, 34, 46, 49) so that all ultimately relate to equation 1.
Of note is that with the different ICC models (ﬁxed vs.
random effects, 1way vs. 2way ANOVA), the expected
mean squares change and thus the computational for
mulas commonly found in the literature (30, 41) also
change.
Choosing an ICC
Given the 6 ICC versions of Shrout and Fleiss (46) and
the 10 versions presented by McGraw and Wong (34), the
choice of ICC is perplexing, especially considering that
Q
UANTIFYING
T
EST
R
ETEST
R
ELIABILITY
235
T
ABLE
4. ICC values for data sets A and B.*
ICC type Data set A Data set B
1,1
1,k
2,1
2,k
3,1
3,k
0.95
0.97
0.95
0.97
0.95
0.97
0.56
0.72
0.55
0.71
0.54
0.70
* ICC 5intraclass correlation coefﬁcient.
T
ABLE
6. Example data set with systematic error.
Trial C1 Trial C2 D
146
148
170
90
161
162
189
100
114
114
119
110
157
156
176
205
175
171
195
219
118
115
119
114
156 633 172 635
T
ABLE
5. Intraclass correlation coefﬁcient model summary table.*
Shrout and Fleiss Computational formula McGraw and Wong Model
1,1 MS 2MS
BW
MS 1(k 21)MS
BW
1 1way random
1,k MS 2MS
BW
MS
B
k 1way random
Use 3,1
Use 3,k
C,1
C,k
2way random
2way random
2,1 MS 2MS
SE
k(MS 2MS )
TE
MS 1(k 21)MS 1
SE
n
A,1 2way random
2,k MS 2MS
SE
k(MS 2MS )
TE
MS 1
S
n
A,k 2way random
3,1 MS 2MS
SE
MS 1(k 21)MS
SE
C,1 2way ﬁxed
3,k MS 2MS
SE
MS
S
C,k 2way ﬁxed
Use 2,1
Use 2,k
A,1
A,k
2way ﬁxed
2way ﬁxed
* Adapted from Shrout and Fleiss (46) and McGraw and Wong (34). Mean square abbreviations are based on the 1way and 2way
analysis of variance illustrated in Table 2. For McGraw and Wong, A 5absolute and C 5consistency. MS
B
5betweensubjects
mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5withinsubjects mean
square.
most of the literature deals with rater reliability not test
retest reliability of physical performance measures. In a
classic paper, Brozek and Alexander (11) ﬁrst introduced
the concept of the ICC to the movement sciences litera
ture and detailed the implementation of an ICC for ap
plication to testretest analysis of motor tasks. Their co
efﬁcient is equivalent to model 3,1. Thus, one might use
ICC 3,1 with testretest reliability where trials is substi
tuted for raters. From the rater nomenclature above, if
one does not wish to generalize the reliability ﬁndings but
rather assert that in our hands the procedures are reli
able, then ICC 3,1 seems like a logical choice. However,
this ICC does not include variance associated with sys
tematic error and is in fact closely approximated by the
Pearson r(1, 43). Therefore, the criticism of the Pearson
ras an index of reliability holds as well for ICCs derived
from model 3. At the least, it needs to be established that
the effect for trials (bias) is trivial if reporting an ICC
derived from model 3. Use of effect size for the trials effect
in the ANOVA would provide information in this regard.
With respect to ICC 3,1, Alexander (1) notes that it ‘‘may
be regarded as an estimate of the value that would have
been obtained if the ﬂuctuation [systematic error] had
been avoided.’’
In a more general sense, there are 4 issues to be ad
dressed in choosing an ICC: (a) 1 or 2way model, (b)
ﬁxed or randomeffect model, (c) include or exclude sys
tematic error in the ICC, and (d) single or mean score.
With respect to choosing a 1 or 2way model, in a 1way
model, the effect of raters or trials (replication study) is
not crossed with subjects, meaning that it allows for sit
uations where all raters do not score all subjects (48).
Fleiss (22) uses the 1way model for what he terms simple
replication studies. In this model, all sources of error are
lumped together into the MS
W
(Tables 2 and 3). In con
trast, the 2way models allow the error to be partitioned
between systematic and random error. When systematic
error is small, MS
W
from the 1way model and error mean
square (MS
E
) from the 2way models (reﬂecting random
error) are similar, and the resulting ICCs are similar.
This is true for both data sets A and B. When systematic
error is substantial, MS
W
and MS
E
are disparate, as in
data set C (Tables 6 and 7). Twoway models require tri
als or raters to be crossed with subjects (i.e., subjects pro
vide scores for all trials or each rater rates all subjects).
For testretest situations, the design dictates that trials
are crossed with subjects and therefore lend themselves
to analysis by 2way models.
Regarding ﬁxed vs. random effects, a ﬁxed factor is
one in which all levels of the factor of interest (in this
236 W
EIR
T
ABLE
7. Analysis of variance summary table for data set C.*
Source df SS Mean square Fpvalue
Between subjects 7 15,925 2275 (MS
B
: 1way)
(MS
S
: 2way)
482.58
Within subjects
Trials
Error
8
1
7
994
961.0
33.0
124.25 (MS
W
)
961.0 (MS
T
)
4.71 (MS
E
)
203.85 ,0.0001
Total 15 16,919
*MS
B
5betweensubjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5withinsubjects mean square; SS 5sums of squares.
case trials) are included in the analysis and no attempt
at generalization of the reliability data beyond the con
ﬁnes of the study is expected. Determining the reliability
of a test before using it in a larger study ﬁts this descrip
tion of ﬁxed effect. A random factor is one in which the
levels of the factor in the design (trials) are but a sample
of the possible levels, and the analysis will be used to
generalize to other levels. For example, a study designed
to evaluate the testretest reliability of the vertical jump
for use by other coaches (with similar athletes) would con
sider the effect of trials to be a random effect. Both Shrout
and Fleiss (46) models 1 and 2 are randomeffects models,
whereas model 3 is a ﬁxedeffect model. From this dis
cussion, for the 2way models of Shrout and Fleiss (46),
the choice between model 2 and model 3 appears to hinge
on a decision regarding a random vs. ﬁxedeffects model.
However, models 2 and 3 also differ in their treatment of
systematic error. As noted previously, model 3 only con
siders random error, whereas model 2 considers both ran
dom and systematic error. This system does not include
a 2way ﬁxedeffects model that includes systematic error
and does not offer a 2way randomeffects model that only
considers random error. The expanded system of McGraw
and Wong (34) includes these options. In the nomencla
ture of McGraw and Wong (34), the designation C refers
to consistency and A refers to absolute agreement. That
is, the C models consider only random error and the A
models consider both random and systematic error. As
noted in Table 5, no new computational formulas are re
quired beyond those presented by Shrout and Fleiss (46).
Thus, if one were to choose a 2way randomeffects model
that only addressed random error, one would use equa
tion 3,1 (or equation 3,k if the mean across k trials is the
criterion score). Similarly, if one were to choose a 2way
ﬁxedeffects model that addressed both systematic and
random error, equation 2,1 would be used (or 2,k). Ulti
mately then, since the computational formulas do not dif
fer between systems, the choice between using the Shrout
and Fleiss (46) equations from models 2 vs. 3 hinge on
decisions regarding inclusion or exclusion of systematic
error in the calculations. As noted by McGraw and Wong
(34), ‘‘the randomﬁxed effects distinction is in its effect
on the interpretation, but not calculation, of an ICC.’’
Should systematic error be included in the ICC? First,
if the effect for trials is small, the systematic differences
between trials will be small, and the ICCs will be similar
to each other. This is evident in both the A and B data
sets (Tables 1 through 3 ). However, if the mean differ
ences are large, then differences between ICCs are evi
dent, especially between equation 3,1, which does not con
sider systematic error, and equations 1,1 and 2,1, which
do consider systematic error. In this regard, the Ftest for
trials and the ICC calculations may give contradictory re
sults from the same data. Speciﬁcally, it can be the case
that an ICC can be large (indicating good reliability),
whereas the ANOVA shows a signiﬁcant trials effect. An
example is given in Tables 6 and 7. In this example, each
score in trial C1 was altered in trial C2 so that there was
a bias of 115 kg and a random component added to each
score. The effect for trials was signiﬁcant (F
1,7
5203.85,
p,0.001) and reﬂected a mean increase of 16 kg. For an
ANOVA to be signiﬁcant, the effect must be large (in this
case, the mean differences between trials must be large),
the noise (error term) must be small, or both. The error
term is small when all subjects behave similarly across
test days. When this is the case, even small mean differ
ences can be statistically signiﬁcant. In this case, the sys
tematic differences explain a signiﬁcant amount of vari
ability in the data. Despite the rather large systematic
error, the ICC values from equations 1,1; 2,1; and 3,1
were 0.896, 0.901, and 0.998, respectively. A cursory ex
amination of just the ICC scores would suggest that the
test exhibited good reliability, especially using equation
3,1, which only reﬂects random error. However, an ap
proximately 10% increase in scores from trial C1 to C2
would suggest otherwise. Thus, an analysis that only fo
cuses on the ICC without consideration of the trials effect
is incomplete (31). If the effect for trials is signiﬁcant, the
most straightforward approach is to develop a measure
ment schedule that will attenuate systematic error (2,
50). For example, if learning effects are present, one
might add trials until a plateau in performance occurs.
Then the ICC could be calculated only on the trials in the
plateau region. The identiﬁcation of such a measurement
schedule would be especially helpful for randomeffects
situations where others might be using the test being
evaluated. For simplicity, all the examples here have
been with only 2 levels for trials. If a trials effect is sig
niﬁcant, however, 2 trials are insufﬁcient to identify a
plateau. The possibility of a signiﬁcant trials effect should
be considered in the design of the reliability study. For
tunately, the ANOVA procedures require no modiﬁcation
to accommodate any number of trials.
Interpreting the ICC
At one level, interpreting the ICC is fairly straightfor
ward; it represents the proportion of variance in a set of
scores that is attributable to the . An ICC of 0.95 means
2
s
t
that an estimated 95% of the observed score variance is
due to . The balance of the variance (1 2ICC 55%) is
2
s
t
attributable to error (51). However, how does one quali
tatively evaluate the magnitude of an ICC and what can
the quantity tell you? Some sources have attempted to
delineate good, medium, and poor levels for the ICC, but
there is certainly no consensus as to what constitutes a
good ICC (45). Indeed, Charter and Feldt (15) argue that
Q
UANTIFYING
T
EST
R
ETEST
R
ELIABILITY
237
‘‘it is not theoretically defensible to set a universal stan
dard for test score reliability.’’ These interpretations are
further complicated by 2 factors. First, as noted herein,
the ICC varies, depending on which version of the ICC is
used. Second, the magnitude of the ICC is dependent on
the variability in the data (45). All other things being
equal, low levels of betweensubjects variability will serve
to depress the ICC even if the differences between sub
jects’ scores across test conditions are small. This is illus
trated by comparing the 2 example sets of data in Table
1. Trials 1 and 2 of data sets A and B have identical mean
values and identical change scores between trials 1 and
2. They differ in the variability between subjects, with
greater betweensubjects variability evident in data set A
as shown in the larger SDs. In Tables 2 and 3, the AN
OVA tables have identical outcomes with respect to the
inferential test of the factor trials and have identical error
terms (since the betweensubjects variability is not part
of the error term, as noted previously). Table 4 shows the
ICC values calculated using the 6 different models of
Shrout and Fleiss (46) on the A and B data sets. Clearly,
data set B, with the lower betweensubjects variability,
results in smaller ICC values than data set A.
How then does one interpret an ICC? First, because
of the relationship between the ICC and betweensubjects
variability, the heterogeneity of the subjects should be
considered. A large ICC can mask poor trialtotrial con
sistency when betweensubjects variability is high. Con
versely, a low ICC can be found even when trialtotrial
variability is low if the betweensubjects variability is
low. In this case, the homogeneity of the subjects means
it will be difﬁcult to differentiate between subjects even
though the absolute measurement error is small. An ex
amination of the SEM in conjunction with the ICC is
therefore needed (32). From a practical perspective, a giv
en test can have different reliability, at least as deter
mined from the ICC, depending on the characteristics of
the individuals included in the analysis. In the 1RM
squat, combining individuals of widely different capabil
ities (e.g., wide receivers and defensive linemen in Amer
ican football) into the same analysis increases between
subjects variability and improves the ICC, yet this may
not be reﬂected in the expected daytoday variation as
illustrated in Tables 1 through 4. In addition, the infer
ential test for bias described previously needs to be con
sidered. High betweensubjects variability may result in
a high ICC even if the test for bias is statistically signif
icant.
The relationship between betweensubjects variability
and the magnitude of the ICC has been used as a criti
cism of the ICC (10, 39). This is an unfair criticism, since
the ICC is used to provide information regarding infer
ential statistical tests not to provide an index of absolute
measurement error. In essence, the ICC normalizes mea
surement error relative to the heterogeneity of the sub
jects. As an index of absolute reliability then, this is a
weakness and other indices (i.e., the SEM) are more in
formative. As a relative index of reliability, the ICC be
haves as intended.
What are the implications of a low ICC? First, mea
surement error reﬂected in an ICC of less than 1.0 serves
to attenuate correlations (22, 38). The equation for this
attenuation effect is as follows:
ˆ
r5rÏICC ICC (7)
xy xy x y
where r
xy
is the observed correlation between xand y,rˆ
xy
is the correlation between xand yif both were measured
without error (i.e., the correlations between the true
scores), and ICC
x
and ICC
y
are the reliability coefﬁcients
for xand y, respectively. Nunnally and Bernstein (38)
note that the effect of measurement error on correlation
attenuation becomes minimal as ICCs increase above
0.80. In addition, reliability affects the power of statistical
tests. Speciﬁcally, the lower the reliability, the greater
the risk of type 2 error (14, 40). Fleiss (22) illustrates how
the magnitude of an ICC can be used to adjust sample
size and statistical power calculations (45). In short, low
ICCs mean that more subjects are required in a study for
a given effect size to be statistically signiﬁcant (40). An
ICC of 0.60 may be perfectly ﬁne if the resulting effect on
sample size and statistical power is within the logistical
constraints of the study. If, however, an ICC of 0.60
means that, for a required level of power, more subjects
must be recruited than is feasible, then 0.60 is not ac
ceptable.
Although infrequently used in the movement sciences,
the ICC of test scores can be used in the setting and in
terpretation of cut points for classiﬁcation of individuals.
Charter and Feldt (15) show how the ICC can be used to
estimate the percentage of falsepositive, falsenegative,
truepositive, and truenegative results for a clinical clas
siﬁcation scheme. Although the details of these calcula
tions are beyond the scope of this article, it is worthwhile
to note that very high ICCs are required to classify indi
viduals with a minimum of misclassiﬁcation.
T
HE
SEM
Because the general form of the ICC is a ratio of variance
due to differences between subjects (the signal) to the to
tal variability in the data (the noise), the ICC is reﬂective
of the ability of a test to differentiate between different
individuals (27, 47). It does not provide an index of the
expected trialtotrial noise in the data, which would be
useful to practitioners such as strength coaches. Unlike
the ICC, which is a relative measure of reliability, the
SEM provides an absolute index of reliability. Hopkins
(26) refers to this as the ‘‘typical error.’’ The SEM quan
tiﬁes the precision of individual scores on a test (24). The
SEM has the same units as the measurement of interest,
whereas the ICC is unitless. The interpretation of the
SEM centers on the assessment of reliability within in
dividual subjects (45). The direct calculation of the SEM
involves the determination of the SD of a large number
of scores from an individual (44). In practice, a large num
ber of scores is not typically collected, so the SEM is es
timated. Most references estimate the SEM as follows:
SEM 5SDÏ12ICC (8)
where SD is the SD of the scores from all subjects
(which can be determined from the ANOVA as
) and ICC is the reliability coefﬁcient.
ÏSS /(n 21)
TOTAL
Note the similarity between the equation for the SEM
and standard error of estimate from regression analysis.
Since different forms of the ICC can result in different
numbers, the choice of ICC can substantively affect the
size of the SEM, especially if systematic error is present.
However, there is an alternative way of calculating the
SEM that avoids these uncertainties. The SEM can be
estimated as the square root of the mean square error
term from the ANOVA (20, 26, 48). Since this estimate of
238 W
EIR
the SEM has the advantage of being independent of the
speciﬁc ICC, its use would allow for more consistency in
interpreting SEM values from different studies. However,
the mean square error terms differ when using the 1way
vs. 2way models. In Table 2 it can be seen that using a
1way model (22) would require the use of MS
W
(Ï53.75
57.3 kg), whereas use of a 2way model would require
use of MS
E
(Ï57 57.6 kg). Hopkins (26) argues that be
cause the 1way model combines inﬂuences of random
and systematic error together, ‘‘The resulting statistic is
biased high and is hard to interpret because the relative
contributions of random error and changes in the mean
are unknown.’’ He therefore suggests that the error term
from the 2way model (MS
E
) be used to calculate SEM.
Note however, that in this sample, the 1way SEM is
smaller than the 2way SEM. This is because the trials
effect is small. The high bias of the 1way model is ob
served when the trials effect is large (Table 7). The SEM
calculated using the MS error from the 2way model
(Ï4.71 52.2 kg) is markedly lower than the SEM cal
culated using the 1way model (Ï124.25 511.1 kg), since
the SEM as deﬁned as ÏMS
E
only considers random er
ror. This is consistent with the concept of a SE, which
deﬁnes noise symmetrically around a central value. This
points to the desire of establishing a measurement sched
ule that is free of systematic variation.
Another difference between the ICC and SEM is that
the SEM is largely independent of the population from
which it was determined, i.e., the SEM ‘‘is considered to
be a ﬁxed characteristic of any measure, regardless of the
sample of subjects under investigation’’ (38). Thus, the
SEM is not affected by betweensubjects variability as is
the ICC. To illustrate, the MS
E
for the data in Tables 2
and 3 are equal (MS
E
557), despite large differences in
betweensubjects variability. The resulting SEM is the
same for data sets A and B (Ï57 57.6 kg), but yet they
have different ICC values (Table 4). The results are sim
ilar when calculating the SEM using equation 8, even
though equation 8 uses the ICC in calculating the SEM,
since the effects of the SD and the ICC tend to offset each
other (38). However, the effects do not offset each other
completely, and use of equation 8 results in an SEM es
timate that is modestly affected by betweensubjects var
iability (2).
The SEM is the SE in estimating observed scores (the
scores in your data set) from true scores (38). Of course,
our problem is just the opposite. We have the observed
scores and would like to estimate subjects’ true scores.
The SEM has been used to deﬁne the boundaries around
which we think a subject’s true score lies. It is often re
ported (8, 17) that the 95% CI for a subject’s true score
can be estimated as follows:
T5S61.96(SEM), (9)
where Tis the subject’s true score, Sis the subject’s ob
served score on the measurement, and 1.96 deﬁnes the
95% CI. However, strictly speaking this is not correct,
since the SEM is symmetrical around the true score, not
the observed score (13, 19, 24, 38), and the SEM reﬂects
the SD of the observed scores while holding the true score
constant. In lieu of equation 9, an alternate approach is
to estimate the subject’s true score and calculate an al
ternate SE (reﬂecting the SD of true scores while holding
observed scores constant). Because of regression to the
mean, obtained scores (S) are biased estimators of true
scores (16, 19). Scores below the mean are biased down
ward, and scores above the mean are biased upward. A
subject’s estimated true score (T) can be calculated as fol
lows:
¯
T5X1ICC(d), (10)
where d5S2X
¯. To illustrate, consider data set A in
Table 1. With a grand mean of 154.5 and an ICC 3,1 of
0.95, an individual with an Sof 120 kg would have a
predicted Tof 154.5 10.95 (120 2154.5) 5121.8 kg.
Note that because the ICC is high, the bias is small (1.8
kg). The appropriate SE to deﬁne the CI of the true score,
which some have referred to as the standard error of es
timate (13), is as follows (19, 38):
SEM 5SDÏICC(1 2ICC). (11)
TS
In this example the value is 31.74 Ï0.95 (1 20.95) 5
6.92, where 31.74 equals the SD of the observed scores
around the grand mean. The 95% CI for Tis then 121.8
61.96 (6.92), which deﬁnes a span of 108.2 to 135.4 kg.
The entire process, which has been termed the regression
based approach (16), can be summarized as follows (24):
95%CI for T
¯
5X1ICC(d)61.96 SDÏICC(1 2ICC). (12)
If one had simply used equation 9 using Sand SEM,
the resulting interval would span 120 61.96 (7.8) 5105.1
to 134.9 kg. Note that the differences between CIs is
small and that the CI width from equation 9 (29.8 kg) is
wider than that from equation 12 (27.2 kg). For all ICCs
less than 1.0, the CI width will be narrower from equation
12 than from equation 9 (16), but the differences shrink
as the ICC approaches 1.0 and as Sapproaches X
¯(24).
M
INIMAL
D
IFFERENCES
N
EEDED TO
B
E
C
ONSIDERED
R
EAL
The SEM is an index that can be used to deﬁne the dif
ference needed between separate measures on a subject
for the difference in the measures to be considered real.
For example, if the 1RM of an athlete on one day is 155
kg and at some later time is 160 kg, are you conﬁdent
that the athlete really increased the 1RM by 5 kg or is
this difference within what you might expect to see in
repeated testing just due to the noise in the measure
ment? The SEM can be used to determine the minimum
difference (MD) to be considered ‘‘real’’ and can be cal
culated as follows (8, 20, 42):
MD 5SEM 31.96 3Ï2, (13)
Once again the point is to construct a 95% CI, and the
1.96 value is simply the zscore associated with a 95% CI.
(One may choose a different zscore instead of 1.96 if a
more liberal or more conservative assessment is desired.)
But where does the Ï2 come from?
Why can’t we simply calculate the 95% CI for a sub
ject’s score as we have done above? If the score is outside
that interval, then shouldn’t we be 95% conﬁdent that the
subject’s score has really changed? Indeed, this approach
has been suggested in the literature (25, 37). The key
here is that we now have 2 scores from a subject. Each of
these scores has a true component and an error compo
nent. That is, both scores were measured with error, and
simply seeing if the second score falls outside the CI of
the ﬁrst score does not account for the error in the second
Q
UANTIFYING
T
EST
R
ETEST
R
ELIABILITY
239
score. What we really want here is an index based on the
variability of the difference scores. This can be quantiﬁed
as the SD of the difference scores (SDd). As it turns out,
when there are 2 levels of trials (as in the examples
herein), the SEM is equal to the SDd divided by Ï2 (17,
26):
SEM 5SDd /Ï2. (14)
Therefore, multiplying the SEM by Ï2 solves for the
SDd and then multiplying the SDd by 1.96 allows for the
construction of the 95% CI. Once the MD is calculated,
then any change in a subject’s score, either above or below
the previous score, greater than the MD is considered
real. More precisely, for all people whose differences on
repeated testing are at least greater than or equal to the
MD, 95% of them would reﬂect real differences. Using
data set A, the ﬁrst subject has a trial A1 score of 146
kg. The SEM for the test is Ï57 57.6 kg. From equation
13, MD 57.6 31.96 3Ï2521.07 kg. Thus, a change
of at least 21.07 kg needs to occur to be conﬁdent, at the
95% level, that a change in 1RM reﬂects a real change
and not a difference that is within what might be reason
ably expected given the measurement error of the 1RM
test.
However, as with deﬁning a CI for an observed score,
the process outlined herein for deﬁning a minimal differ
ence is not precisely accurate. As noted by Charter (13)
and Dudek (19), the SE of prediction (SEP) is the correct
SE to use in these calculations, not the SEM. The SEP is
calculated as follows:
2
SEP 5SDÏ12ICC . (15)
To deﬁne a 95% CI outside which one could be conﬁ
dent that a retest score reﬂects a real change in perfor
mance, simply calculate the estimated true score (equa
tion 10) plus or minus the SEP. To illustrate, consider
the same data as in the example in the previous para
graph. From equation 10, we estimate the subject’s true
score (T)asT5X
¯1ICC (d)5154 10.95 (146 2154.5)
ù146.4 kg. The SEP 5SD 3Ï(1 2ICC
2
)531.74 3
Ï(1 20.95
2
)59.91. The resulting 95% CI is 146.4 6
1.96 (9.91), which deﬁnes an interval from approximately
127 to 166 kg. Therefore, any retest score outside that
interval would be interpreted as reﬂecting a real change
in performance. As given in Table 1, the retest score of
140 kg is inside the CI and would be interpreted as a
change consistent with the measurement error of the test
and does not reﬂect a real change in performance. As be
fore, use of a different zscore in place of 1.96 will allow
for the construction of a more liberal or conservative CI.
O
THER
C
ONSIDERATIONS
In this article, several considerations regarding ICC and
SEM calculations will not be addressed in detail, but brief
mention will be made here. First, assumptions of ANOVA
apply to these data. The most common assumption vio
lated is that of homoscedasticity. That is, does the size of
the error correlate with the magnitude of the observed
scores? If the data exhibit homoscedasticity, the answer
is no. For physical performance measures, it is common
that the absolute error tends to be larger for subjects who
score higher (2, 26), e.g., the noise from repeated strength
testing of stronger subjects is likely to be larger than the
noise from weaker subjects. If the data exhibit heterosce
dasticity, often a logarithmic transformation is appropri
ate. Second, it is important to realize that ICC and SEM
values determined from sample data are estimates. As
such, it is instructive to construct CIs for these estimates.
Details of how to construct these CIs are addressed in
other sources (34, 35). Third, how many subjects are re
quired to get adequate stability for the ICC and SEM cal
culations? Unfortunately, there is no consensus in this
area. The reader is referred to other studies for further
discussion (16, 35, 52). Finally, reliability, as quantiﬁed
by the ICC, is not synonymous with responsiveness to
change (23). The MD calculation presented herein allows
one to evaluate a change score after the fact. However, a
small MD, in and of itself, is not a priori evidence that a
given test is responsive.
P
RACTICAL
A
PPLICATIONS
For a comprehensive assessment of reliability, a 3layered
approach is recommended. First, perform a repeated
measures ANOVA and cast the summary table as a 2
way model, i.e., trials and error are separate sources of
variance. Evaluate the Fratio for the trials effect to ex
amine systematic error. As noted previously, it may be
prudent to evaluate the effect for trials using a more lib
eral ameasure than the traditional 0.05 level. If the effect
for trials is signiﬁcant (and the effect size is not trivial),
it is prudent to reexamine the measurement schedule for
inﬂuences of learning and fatigue. If 3 or more levels of
trials were included in the analysis, a plateau in perfor
mance may be evident, and exclusion of only those levels
of trials not in the plateau region in a subsequent re
analysis may be warranted. However, this exclusion of
trials needs to be reported. Under these conditions, where
systematic error is deemed unimportant, the ICC values
will be similar and reﬂect random error (imprecision).
However, it is suggested here that the ICC from equation
3,1 be used (Table 5), since it is most closely tied to the
MS
E
calculation of the SEM. Once the systematic error is
determined to be nonsigniﬁcant or trivial, interpret the
ICC and SEM within the analytical goals of your study
(2). Speciﬁcally, researchers interested in grouplevel re
sponses can use the ICC to assess correlation attenuation,
statistical power, and sample size calculations. Practi
tioners (e.g., coaches, clinicians) can use the SEM (and
associated SEs) in the interpretation of scores from in
dividual athletes (CIs for true scores, assessing individual
change). Finally, although reliability is an important as
pect of measurement, a test may exhibit reliability but
not be a valid test (i.e., it does not measure what it pur
ports to measure).
R
EFERENCES
1. A
LEXANDER
, H.W. The estimation of reliability when several
trials are available. Psychometrika 12:79–99. 1947.
2. A
TKINSON
, D.B.,
AND
A.M. N
EVILL
. Statistical methods for as
sessing measurement error (reliability) in variables relevant to
Sports Medicine. Sports Med. 26:217–238. 1998.
3. B
ARTKO
, J.J. The intraclass reliability coefﬁcient as a measure
of reliability. Psychol. Rep. 19:3–11. 1966.
4. B
ARTKO
, J.J. On various intraclass correlation coefﬁcients.
Psychol. Bull. 83:762–765. 1976.
5. B
AUMGARTNER
, T.A. Estimating reliability when all test trials
are administered on the same day. Res. Q. 40:222–225. 1969.
6. B
AUMGARTNER
, T.A. Normreferenced measurement: reliabili
ty. In: Measurement Concepts in Physical Education and Exer
cise Science. M.J. Safrit and T.M. Woods, eds. Champaign, IL:
Human Kinetics, 1989. pp. 45–72.
240 W
EIR
7. B
AUMGARTNER
, T.A. Estimating the stability reliability of a
score. Meas. Phys. Educ. Exerc. Sci. 4:175–178. 2000.
8. B
ECKERMAN
, H., T.W. V
OGELAAR
, G.L. L
ANKHORST
,
AND
A.L.M.
V
ERBEEK
. A criterion for stability of the motor function of the
lower extremity in stroke patients using the FuglMeyer as
sessment scale. Scand. J. Rehabil. Med. 28:3–7. 1996.
9. B
EDARD
, M., N.J. M
ARTIN
,P.K
RUEGER
,
AND
K. B
RAZIL
. As
sessing reproducibility of data obtained with instruments
based on continuous measurements. Exp. Aging Res. 26:353–
365. 2000.
10. B
LAND
, J.M.,
AND
D.G. A
LTMAN
. Statistical methods for as
sessing agreement between two methods of clinical measure
ment. Lancet 1:307–310. 1986.
11. B
ROZEK
, J.,
AND
H. A
LEXANDER
. Components of variance and
the consistency of repeated measurements. Res. Q. 18:152–166.
1947.
12. B
RUTON
, A., J.H. C
ONWAY
,
AND
S.T. H
OLGATE
. Reliability:
What is it and how is it measured. Physiotherapy 86:94–99.
2000.
13. C
HARTER
, R.A. Revisiting the standard error of measurement,
estimate, and prediction and their application to test scores.
Percept. Mot. Skills 82:1139–1144. 1996.
14. C
HARTER
, R.A. Effect of measurement error on tests of statis
tical signiﬁcance. J. Clin. Exp. Neuropsychol. 19:458–462. 1997.
15. C
HARTER
, R.A.,
AND
L.S. F
ELDT
. Meaning of reliability in terms
of correct and incorrect clinical decisions: The art of decision
making is still alive. J. Clin. Exp. Neuropsychol. 23:530–537.
2001.
16. C
HARTER
, R.A.,
AND
L.S. F
ELDT
. The importance of reliability
as it relates to true score CIs. Meas. Eval. Counseling Dev. 35:
104–112. 2002.
17. C
HINN
, S. Repeatability and method comparison. Thorax 46:
454–456. 1991.
18. C
HINN
, S.,
AND
P.G. B
URNEY
. On measuring repeatability of
data from selfadministered questionnaires. Int. J. Epidemiol.
16:121–127. 1987.
19. D
UDEK
, F.J. The continuing misinterpretation of the standard
error of measurement. Psychol. Bull. 86:335–337. 1979.
20. E
LIASZIW
, M., S.L. Y
OUNG
, M.G. W
OODBURY
,
AND
K. F
RYDAY

F
IELD
. Statistical methodology for the concurrent assessment
of interrater and intrarater reliability: Using goniometric mea
surements as an example. Phys. Ther. 74:777–788. 1994.
21. F
ELDT
, L.S.,
AND
M.E. M
C
K
EE
. Estimation of the reliability of
skill tests. Res. Q. 29:279–293. 1958.
22. F
LEISS
, J.L. The Design and Analysis of Clinical Experiments.
New York: John Wiley and Sons, 1986.
23. G
UYATT
, G., S. W
ALTER
,
AND
G. N
ORMAN
. Measuring change
over time: assessing the usefulness of evaluative instruments.
J. Chronic Dis. 40:171–178. 1987.
24. H
ARVILL
, L.M. Standard error of measurement. Educ. Meas.
Issues Pract. 10:33–41. 1991.
25. H
EBERT
, R., D.J. S
PIEGELHALTER
,
AND
C. B
RAYNE
. Setting the
minimal metrically detectable change on disability rating
scales. Arch. Phys. Med. Rehabil. 78:1305–1308. 1997.
26. H
OPKINS
, W.G. Measures of reliability in sports medicine and
science. Sports Med. 30:375–381. 2000.
27. K
EATING
, J.,
AND
T. M
ATYAS
. Unreliable inferences from reli
able measurements. Aust. Physiother. 44:5–10. 1998.
28. K
EPPEL
,G.Design and Analysis: A Researcher’s Handbook (3rd
ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.
29. K
ROLL
, W. A note on the coefﬁcient of intraclass correlation as
an estimate of reliability. Res. Q. 33:313–316. 1962.
30. L
AHEY
, M.A., R.G. D
OWNEY
,
AND
F.E. S
AAL
. Intraclass corre
lations: there’s more than meets the eye. Psychol. Bull. 93:586–
595. 1983.
31. L
IBA
, M. A trend test as a preliminary to reliability estimation.
Res. Q. 38:245–248. 1962.
32. L
OONEY
, M.A. When is the intraclass correlation coefﬁcient
misleading? Meas. Phys. Educ. Exerc. Sci. 4:73–78. 2000.
33. L
UDBROOK
, J. Statistical techniques for comparing measures
and methods of measurement: A critical review. Clin. Exp.
Pharmacol. Physiol. 29:527–536. 2002.
34. M
C
G
RAW
, K.O.,
AND
S.P. W
ONG
. Forming inferences about
some intraclass correlation coefﬁcients. Psychol. Methods 1:30–
46. 1996.
35. M
ORROW
, J.R.,
AND
A.W. J
ACKSON
. How ‘‘signiﬁcant’’ is your
reliability? Res. Q. Exerc. Sport 64:352–355. 1993.
36. N
ICHOLS
, C.P. Choosing an intraclass correlation coefﬁcient.
Available at: www.spss.com/tech/stat/articles/whichicc.htm.
Accessed 1998.
37. N
ITSCHKE
, J.E., J.M. M
C
M
EEKEN
, H.C. B
URRY
,
AND
T.A. M
A

TYAS
. When is a change a genuine change? A clinically mean
ingful interpretation of grip strength measurements in healthy
and disabled women. J. Hand Ther. 12:25–30. 1999.
38. N
UNNALLY
, J.C.,
AND
I.H. B
ERNSTEIN
.Psychometric Theory
(3rd ed.). New York: McGrawHill, 1994.
39. O
LDS
, T. Five errors about error. J. Sci. Med. Sport 5:336–340.
2002.
40. P
ERKINS
, D.O., R.J. W
YATT
,
AND
J.J. B
ARTKO
. Pennywise and
poundfoolish: The impact of measurement error on sample
size requirements in clinical trials. Biol. Psychiatry. 47:762–
766. 2000.
41. P
ORTNEY
, L.G.,
AND
M.P. W
ATKINS
.Foundations of Clinical Re
search (2nd ed.). Upper Saddle River, NJ: Prentice Hall, 2000.
42. R
OEBROECK
, M.E., J. H
ARLAAR
,
AND
G.J. L
ANKHORST
. The ap
plication of generalizability theory to reliability assessment: An
illustration using isometric force measurements. Phys. Ther.
73:386–401. 1993.
43. R
OUSSON
, V., T. G
ASSER
,
AND
B. S
EIFERT
. Assessing intrarater,
interrater, and testretest reliability of continuous measure
ments. Stat. Med. 21:3431–3446. 2002.
44. S
AFRIT
, M.J.E. Reliability Theory. Washington, DC: American
Alliance for Health, Physical Education, and Recreation, 1976.
45. S
HROUT
, P.E. Measurement reliability and agreement in psy
chiatry. Stat. Methods Med. Res. 7:301–317. 1998.
46. S
HROUT
, P.E.,
AND
J.L. F
LEISS
. Intraclass correlations: uses in
assessing rater reliability. Psychol. Bull. 36:420–428. 1979.
47. S
TRATFORD
, P. Reliability: consistency or differentiating be
tween subjects? [Letter]. Phys. Ther. 69:299–300. 1989.
48. S
TRATFORD
, P.W.,
AND
C.H. G
OLDSMITH
. Use of standard error
as a reliability index of interest: An applied example using el
bow ﬂexor strength data. Phys. Ther. 77:745–750. 1997.
49. S
TREINER
, D.L.,
AND
G.R. N
ORMAN
.Measurement Scales: A
Practical Guide to Their Development and Use (2nd ed.). Oxford:
Oxford University Press, 1995. pp. 104–127.
50. T
HOMAS
, J.R.,
AND
J.K. N
ELSON
.Research Methods in Physical
Activity (2nd ed.). Champaign, IL: Human Kinetics, 1990. pp.
352.
51. T
RAUB
, R.E.,
AND
G.L. R
OWLEY
. Understanding reliability.
Educ. Meas. Issues Pract. 10:37–45. 1991.
52. W
ALTER
, S.D., M. E
LIASZIW
,
AND
A. D
ONNER
. Sample size and
optimal designs for reliability studies. Stat. Med. 17:101–110.
1998.
Acknowledgments
I am indebted to Lee Brown, Joel Cramer, Bryan Heiderscheit,
Terry Housh, and Bob Oppliger for their helpful comments on
drafts of the paper.
Address correspondence to Dr. Joseph P. Weir,
joseph.weir@dmu.edu.