Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM.
ABSTRACT Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.
- SourceAvailable from: Almir Vieira Dibai Filho[Show abstract] [Hide abstract]
ABSTRACT: Objective: The purpose of this study was to assess the effects of high-voltage electrical stimulation (HVES), continuous short wave diathermy, and physical exercise on arterial blood flow in the lower limbs of diabetic women with peripheral arterial disease. Methods: A crossover study was carried out involving 15 diabetic women (mean age of 77.87 ± 6.20 years) with a diagnosis of peripheral arterial disease. One session of each therapeutic resource was held, with a 7-day washout period between protocols. Blood flow velocity was evaluated before each session and 0, 20, 40 and 60 minutes after the administration of each protocol. Two-way repeated-measures analysis of variance with Bonferroni post hoc test was used for the intragroup and intergroup comparisons. Results: In the intragroup analysis, a significant reduction (P < .05) was found in blood flow velocity in the femoral and popliteal arteries over time with HVES and physical exercise and in the posterior tibial artery with the physical exercise protocol. However, no significant differences were found in the intergroup analysis (P > .05). Conclusion: Proximal blood circulation in the lower limb of diabetic women with peripheral arterial disease was increased by a single session of HVES and physical exercise, whereas distal circulation was only increased with physical exercise.Journal of Manipulative and Physiological Therapeutics 01/2015; · 1.25 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Assessment of control of posture using a task battery that represents work-related postural conditions is highly recommended for providing a comprehensive understanding of collective postural demands. However, dearth of evidence exists on the reliability of a task battery, thus precluding its use as an outcome measure in field research. This study investigated the intrasession reliability and systematic variation of force plate derived centre of pressure (COP) measures obtained during repeated performance of a task battery (lifting task, limits of stability and bipedal and unipedal stance). COP signals obtained during each task performance were processed to derive various time-domain COP measures. Statistical analyses revealed that 13 of the 19 COP measures displayed excellent relative (ICC(2,3) ≥ 0.75) and acceptable absolute reliability (SEM%: ≤ 10). Although COP measures displayed systematic variation, the differences were less or equal to the measurement error, except COP measures of unipedal stance and limits of stability. The chosen task battery is reliable and can be used for comprehensive evaluation of control of posture, in both field and laboratory research. Practitioner Summary: Repeated evaluation of multiple tasks together sequentially could introduce measurement variability. This study investigated intrasession reliability of a task battery representing common work-related postures. The chosen task battery was found to be reliable with acceptable measurement error and can be used in field research settings for evaluation of control of posture.Ergonomics 01/2015; · 1.61 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Background Limited research exists examining the impact of nutrition on golfing performance. This study’s purpose was to determine the impact of daily supplementation with an over-the-counter dietary supplement on golf performance. Methods Healthy men (30.3 ± 6.9 y, 183.1 ± 5.6 cm, 86.7 ± 11.9 kg), with a 5–15 handicap were assigned in a double-blind, placebo-controlled manner to ingest for 30 days either a placebo (PLA, n = 13) or a dietary supplement containing creatine monohydrate, coffea arabica fruit extract, calcium fructoborate and vitamin D (Strong Drive™, SD, n = 14). Subjects ingested two daily doses for the first two weeks and one daily dose for the remaining two weeks. Participants followed their normal dietary habits and did not change their physical activity patterns. Two identical testing sessions in a pre/post fashion were completed consisting of a fasting blood sample, anthropometric measurements, 1-RM bench press, upper body power and golf swing performance using their driver and 7-iron. Data were analyzed using two-way mixed factorial ANOVAs and ANCOVA when baseline differences were present. Statistical significance was established a priori at p ≤ 0.05. Results ANCOVA revealed significantly greater (post-test) best drive distance (p = 0.04) for SD (+5.0% [+13.6 yards], ES = 0.75) as well as a tendency (p = 0.07) for average drive distance to increase (+8.4% [+19.6 yards], ES = 0.65), while no such changes were found with PLA (−0.5% [−1.2 yards], ES = 0.04 and +1.3% [+2.8 yards], ES = 0.08, respectively). Both groups experienced significant increases in body mass and 1-RM bench press (p < 0.001). No other significant group × time interactions were found. For the SD group only, within-group analysis confirmed significant improvements in set 1 average (+8.9%, p = 0.001) and peak velocity (+6.8%, p < =0.01). No changes were noted for reported adverse events, pain inventories, quality of life or any measured blood parameter. Conclusions SD supplementation for 30 days significantly improved best drive distance more than placebo. Supplementation was well tolerated and did not result in any clinically significant changes in markers of health or adverse events/side effect profiles.Journal of the International Society of Sports Nutrition 01/2015; 12(1). · 1.50 Impact Factor
Journal of Strength and Conditioning Research, 2005, 19(1), 231–240
? 2005 National Strength & Conditioning Association
QUANTIFYING TEST-RETEST RELIABILITY USING
THE INTRACLASS CORRELATION COEFFICIENT
AND THE SEM
JOSEPH P. WEIR
Applied Physiology Laboratory, Division of Physical Therapy, Des Moines University—Osteopathic Medical
Center, Des Moines, Iowa 50312.
ABSTRACT. Weir, J.P. Quantifying test-retest reliability using
the intraclass correlation coefficient and the SEM. J. Strength
Cond. Res. 19(1):231–240. 2005.—Reliability, the consistency of
a test or measurement, is frequently quantified in the movement
sciences literature. A common metric is the intraclass correla-
tion coefficient (ICC). In addition, the SEM, which can be cal-
culated from the ICC, is also frequently reported in reliability
studies. However, there are several versions of the ICC, and con-
fusion exists in the movement sciences regarding which ICC to
use. Further, the utility of the SEM is not fully appreciated. In
this review, the basics of classic reliability theory are addressed
in the context of choosing and interpreting an ICC. The primary
distinction between ICC equations is argued to be one concern-
ing the inclusion (equations 2,1 and 2,k) or exclusion (equations
3,1 and 3,k) of systematic error in the denominator of the ICC
equation. Inferential tests of mean differences, which are per-
formed in the process of deriving the necessary variance com-
ponents for the calculation of ICC values, are useful to deter-
mine if systematic error is present. If so, the measurement
schedule should be modified (removing trials where learning
and/or fatigue effects are present) to remove systematic error,
and ICC equations that only consider random error may be safe-
ly used. The use of ICC values is discussed in the context of
estimating the effects of measurement error on sample size, sta-
tistical power, and correlation attenuation. Finally, calculation
and application of the SEM are discussed. It is shown how the
SEM and its variants can be used to construct confidence inter-
vals for individual scores and to determine the minimal differ-
ence needed to be exhibited for one to be confident that a true
change in performance of an individual has occurred.
KEY WORDS. reproducibility, precision, error, consistency, SEM,
intraclass correlation coefficient
unclear in the biomedical literature in general
(49) and in the sport sciences literature in particular.Part
of this stems from the fact that reliability can be assessed
in a variety of different contexts. In the sport sciences,
we are most often interested in simple test-retest reli-
ability; this is what Fleiss (22) refers to as a simple reli-
ability study. For example, one might be interested in the
reliability of 1 repetition maximum (1RM) squat mea-
sures taken on the same athletes over different days.
However, if one is interested in the ability of different
testers to get the same results from the same subjects on
skinfold measurements, one is now interested in the in-
terrater reliability. The quantifying of reliability in these
eliability refers to the consistency of a test or
measurement. For a seemingly simple concept,
the quantifying of reliability and interpreta-
tion of the resulting numbers are surprisingly
different situations is not necessarily the same, and the
decisions regarding how to calculate reliability in these
different contexts has not been adequately addressed in
the sport sciences literature. In this article, I focus on
test-retest reliability (but not limited in the number of
retest trials). In addition, I discuss data measured on a
Confusion also stems from the jargon used in the con-
text of reliability, i.e., consistency, precision, repeatabili-
ty, and agreement. Intuitively, these terms describe the
same concept, but in practice some are operationalized
differently. Notably, reliability and agreement are not
synonymous (30, 49). Further, reliability, conceptualized
as consistency, consists of both absolute consistency and
relative consistency (44). Absolute consistency concerns
the consistency of scores of individuals, whereas relative
consistency concerns the consistency of the position or
rank of individuals in the group relative to others. In the
fields of education and psychology, the term reliability is
operationalized as relative consistency and quantified us-
ing reliability coefficients called intraclass correlation co-
efficients (ICCs) (49). Issues regarding quantifying ICCs
and their interpretation are discussed in the first half of
this article. Absolute consistency, quantified using the
SEM, is addressed in the second half of the article. In
brief, the SEM is an indication of the precision of a score,
and its use allows one to construct confidence intervals
(CIs) for scores.
Another confusing aspect of reliability calculations is
that a variety of different procedures, besides ICCs and
SEM, have been used to determine reliability. These in-
clude the Pearson r, the coefficient of variation, and the
LOA (Bland-Altman plots). The Pearson product moment
correlation coefficient (Pearson r) was often used in the
past to quantify reliability, but the use of the Pearson r
is typically discouraged for assessing test-retest reliabil-
ity (7, 9, 29, 33, 44); however, this recommendation is not
universal (43). The primary, although not exclusive,
weakness of the Pearson r is that it cannot detect system-
atic error. More recently, the limits of agreement (LOA)
described by Bland and Altman (10) have come into vogue
in the biomedical literature (2). The LOA will not be ad-
dressed in detail herein other than to point out that the
procedure was developed to examine agreement between
2 different techniques of quantifying some variable (so-
called method comparison studies, e.g., one could compare
testosterone concentration using 2 different bioassays),
not reliability per se. The use of LOA as an index of re-
liability has been criticized in detail elsewhere (26, 49).
In this article, the ICC and SEM will be the focus.
Unfortunately, there is considerable confusion concerning
both the calculation and interpretation of the ICC. In-
deed, there are 6 common versions of the ICC (and others
as well), and the choice of which version to use is not
intuitively obvious. Similarly, the SEM, which is inti-
mately related to the ICC, has useful applications that
are not fully appreciated by practitioners in the move-
ment sciences. The purposes of this article are to provide
information on the choice and application of the ICC and
to encourage practitioners to use the SEM in the inter-
pretation of test data.
For a group of measurements, the total variance (
the data can be thought of as being due to true score
variance ( ) and error variance (
served score is composed of the true score and error (44).
The theoretical true score of an individual reflects the
mean of an infinite number of scores from a subject,
whereas error equals the difference between the true
score and the observed score (21). Sources of error include
errors due to biological variability, instrumentation,error
by the subject, and error by the tester. If we make a ratio
of the to theof the observed scores, where
plus, we have the following reliability coefficient:
). Similarly, each ob-
? ? ?
The closer this ratio is to 1.0, the higher the reliability
and the lower the . Since we do not know the true score
for each subject, an index of the
tween-subjects variability, i.e., the variance due to how
subjects differ from each other. In this context, reliability
(relative consistency) is formally defined (5, 21, 49) as fol-
between subjects variability
between subjects variability ? error
The reliability coefficient in Equation 2 is quantified
by various ICCs. So although reliability is conceptually
aligned with terms such as reproducibility, repeatability,
and agreement, it is defined as above. The necessary var-
iance estimates are derived from analysis of variance
(ANOVA), where appropriate mean square values are re-
corded from the computer printout. Specifically, the var-
ious ICCs can be calculated from mean square values de-
rived from a within-subjects, single-factor ANOVA (i.e., a
The ICC is a relative measure of reliability (18) in that
it is a ratio of variances derived from ANOVA, is unitless,
and is more conceptually akin to R2from regression (43)
than to the Pearson r. The ICC can theoretically vary
between 0 and 1.0, where an ICC of 0 indicates no reli-
ability, whereas an ICC of 1.0 indicates perfect reliability.
In practice, ICCs can extend beyond the range of 0 to 1.0
(30), although with actual data this is rare. The relative
nature of the ICC is reflected in the fact that the mag-
nitude of an ICC depends on the between-subjects vari-
ability (as shown in the next section). That is, if subjects
differ little from each other, ICC values are small even if
trial-to-trial variability is small. If subjects differ from
each other a lot, ICCs can be large even if trial-to-trial
variability is large. Thus, the ICC for a test is context
is used based on be-
specific (38, 51). As noted by Streiner and Norman (49),
‘‘There is literally no such thing as the reliability of a test,
unqualified; the coefficient has meaning only when ap-
plied to specific populations.’’ Further, it is intuitive that
small differences between individuals are more difficult
to detect than large ones, and the ICC is reflective of this
Error is typically considered as being of 2 types: sys-
tematic error (e.g., bias) and random error (2, 39). (Gen-
eralizability theory expands sources of error to include
various facets of interest but is beyond the scope of this
article.) Total error reflects both systematic error and
random error (imprecision). Systematic error includes
both constant error and bias (38). Constant error affects
all scores equally, whereas bias is systematic error that
affects certain scores differently than others. For physical
performance measures, the distinction between constant
error and bias is relatively unimportant and the focus
here on systematic error is on situations that result in a
unidirectional change in scores on repeated testing. In
testing of physical performance, subjects may improve
their test scores simply due to learning effects, e.g., per-
forming the first test serves as practice for subsequent
tests, or fatigue or soreness may result in poorer perfor-
mance across trials. In contrast, random error refers to
sources of error that are due to chance factors. Factors
such as luck, alertness, attentiveness by the tester, and
normal biological variability affect a particular score.
Such errors should, in a random manner, both increase
and decrease test scores on repeated testing. Thus, we
can expand Equation 1 as follows:
? ? ?? ?
where is the systematic
It has been argued that systematic error is a concern
of validity and not reliability (12, 43). Similarly, system-
atic error (e.g., learning effects, fatigue) has been sug-
gested to be a natural phenomenon and therefore does
not contribute to unreliability per se in test-retest situa-
tions (43). Thus, there is a school of thought that suggests
that only random error should be assessed in reliability
calculations. Under this analysis, the error term in the
denominator will only reflect random error and not sys-
tematic error, increasing the size of reliability coeffi-
cients. The issue of inclusion of systematic error in the
determination of reliability coefficients is addressed in a
is the random.
The Basic Calculations
The calculation of reliability starts with the performance
of a repeated-measures ANOVA. This analysis performs
2 functions. First, the inferential test of mean differences
across trials is an assessment of systematic error (trend).
Second, all of the subsequent calculations can be derived
from the output from this ANOVA. In keeping with the
nomenclature of Keppel (28), the ANOVA that is used is
of a single-factor, within-subjects (repeated-measures)de-
sign. Unfortunately, the language gets a bit tortured in
many sources, because the different ICC models are re-
ferred to as either 1-way or 2-way models; what is im-
portant to keep in mind is that both the 1-way and 2-way
ICC models can be derived from the same single-factor,
QUANTIFYING TEST-RETEST RELIABILITY
TABLE 1. Example data set.
156 ? 33
153 ? 33
156 ? 8
153 ? 13
TABLE 2. Two-way analysis of variance summary table for data set A.*
2098.4 (MSB: 1-way)
* MSB? between-subjects mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW
? within-subjects mean square; SS ? sums of squares.
To illustrate the calculations, example data are pre-
sented in Table 1. ANOVA summary tables are presented
in Tables 2 and 3, and the resulting ICCs are presented
in Table 4. Focus on the first two columns of Table 1,
which are labeled trial A1 and trial A2. As can be seen,
there are 2 sets (columns) of scores, and each set has 8
scores. In this example, each of 8 subjects has provided a
score in each set. Assume that each set of scores repre-
sents the subjects’ scores on the 1RM squat across 2 dif-
ferent days (trials). A repeated-measures ANOVA is per-
formed to primarily test whether the 2 sets of scores are
significantly different from each other (i.e., do the scores
systematically change between trials) and is summarized
in Table 2. Equivalently, one could have used a paired t-
test, since there were only 2 levels of trials. However, the
ANOVA is applicable to situations with 2 or more trials
and is consistent with the ICC literature in defining
sources of variance for ICC calculations. Note that there
are 3 sources of variability in Table 2: subjects, trials, and
error. In a repeated-measures ANOVA such as this, it is
helpful to remember that this analysis might be consid-
ered as having 2 factors: the primary factor of trials and
a secondary factor called subjects (with a sample size of
1 subject per cell). The error term includes the interaction
effect of trials by subjects. It is useful to keep these sourc-
es of variability in mind for 2 reasons. First, the 1-way
and 2-way models of the ICC (6, 44) either collapse the
variability due to trials and error together (1-way models)
or keep them separate (2-way models). Note that the tri-
als and error sources of variance, respectively, reflect the
systematic and random sources of error in the
reliability coefficient. These differences are illustrated in
Table 2, where the df and sums of squares values for error
in the 1-way model (within-subjects source) are simply
the sum of the respective values for trials and error in
the 2-way model.
Second, unlike a between-subjects ANOVA where the
‘‘noise’’ due to different subjects is part of the error term,
the variability due to subjects is now accounted for (due
to the repeated testing) and therefore not a part of the
error term. Indeed, for the calculation of the ICC, the nu-
merator (the signal) reflects the variance due to subjects.
Since the error term of the ANOVA reflects the interac-
tion between subjects and trials, the error term is small
in situations where all the subjects change similarly
across test days. In situations where subjects do not
change in a similar manner across test days (e.g., some
subjects’ scores increase, whereas others decrease), the
error term is large. In the former situation, even small
differences across test days, as long as they are consistent
across all the subjects, can result in a statistically signif-
icant effect for trials. In this example, however, the effect
for trials is not statistically significant (p ? 0.49), indi-
cating that there is no statistically significant systematic
error in the data. It should be kept in mind, however, that
the statistical power of the test of mean differences be-
tween trials is affected by sample size and random error.
Small sample sizes and noisy data (i.e., high random er-
ror) will decrease power and potentially hide systematic
error. Thus, an inferential test of mean differences alone
is insufficient to quantify reliability. Further, evaluation
of the effect for trials ought to be evaluated with a more
liberal ? measure, since in this case, the implications of
a type 2 error are more severe than a type 1 error. In
cases where systematic error is present, it may be pru-
dent to change the measurement schedule (e.g., add trials
if a learning effect is present or increase rest intervals if
fatigue is present) to compensate for the bias.
Shrout and Fleiss (46) have presented 6 forms of the
ICC. This system has taken hold in the physical therapy
literature. However, the specific nomenclature of their
system does not seem to be as prevalent in the exercise
physiology, kinesiology, and sport science literature,
which has instead ignored which is model used or focused
on ICC terms that are centered on either 1-way or 2-way
ANOVA models (6, 44). Nonetheless, the ICC models of
TABLE 3. Analysis of variance summary table for data set B.*
SS Mean square
* MSB? between-subjects mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW
? within-subjects mean square; SS ? sums of squares.
Shrout and Fleiss (46) overlap with the 1-way and 2-way
models presented by Safrit (44) and Baumgartner (6).
Three general models of the ICC are present in the
Shrout and Fleiss (46) nomenclature, which are labeled
1, 2, and 3. Each model can be calculated 1 of 2 ways. If
the scores in the analysis are from single scores from each
subject for each trial (or rater if assessing interrater re-
liability), then the ICC is given a second designation of 1.
If the scores in the analysis represent the average of the
k scores from each subject (i.e., the average across the
trials), then the ICC is given a second designation of k.
In this nomenclature then, an ICC with a model desig-
nation of 2,1 indicates an ICC calculated using model 2
with single scores. The use of these models is typically
presented in the context of determining rater reliability
(41). For model 1, each subject is assumed to be assessed
by a different set of raters than other subjects, and these
raters are assumed to be randomly sampled from the pop-
ulation of possible raters so that raters are a random ef-
fect. Model 2 assumes each subject was assessed by the
same group of raters, and these raters were randomly
sampled from the population of possible raters. In this
case, raters are also considered a random effect. Model 3
assumes each subject was assessed by the same group of
raters, but these particular raters are the only raters of
interest, i.e., one does not wish to generalize the ICCs
beyond the confines of the study. In this case, the analysis
attempts to determine the reliability of the raters used
by that particular study, and raters are considered a fixed
The 1-way ANOVA models (6, 44) coincide with model
1,k for situations where scores are averaged and model
1,1 for single scores for a given trial (or rater). Further,
ICC 1,1 coincides with the 1-way ICC model described by
Bartko (3, 4), and ICC 1,k has also been termed the
Spearman Brown prediction formula (4). Similarly, ICC
values derived from single and averaged scores calculated
using the 2-way approach (6, 44) coincide with models 3,1
and 3,k, respectively. Calculations coincident with models
2,1 and 2,k were not reported by Baumgartner (6) or Saf-
More recently, McGraw and Wong (34) expanded the
Shrout and Fleiss (46) system to include 2 more general
forms, each also with a single score or average score ver-
sion, resulting in 10 ICCs. These ICCs have now been
incorporated into SPSS statistical software starting with
version 8.0 (36). Fortunately, 4 of the computational for-
mulas of Shrout and Fleiss (46) also apply to the new
forms of McGraw and Wong (34), so the total number of
formulas is not different.
The computational formulas for the ICC models of
Shrout and Fleiss (46) and McGraw and Wong (34) are
summarized in Table 5. Unfortunately, it is not intuitive-
ly obvious how the computational formulas reflect the in-
tent of equations 1 through 3. This stems from the fact
that the computational formulas reported in most sources
are derived from algebraic manipulations of basic equa-
tions where mean square values from ANOVA are used
to estimate the various ?2values reflected in equations 1
through 3. To illustrate, the manipulations for ICC 1,1 (ran-
dom-effects, 1-way ANOVA model) are shown herein. First,
the computational formula for ICC 1,1 is as follows:
MS ? MS
ICC 1,1 ?
MS ? (k ? 1)MS
where MSBindicates the between-subjects mean square,
MSWindicates the within-subjects mean square, and k is
the number of trials (3, 46). The relevant mean square
values can be found in Table 2. To relate this computa-
tional formula to equation 1, one must know that esti-
mation of the appropriate ?2comes from expected mean
squares from ANOVA. Specifically, for this model the ex-
pected MSBequals plus k
equals (3); therefore, MSBequals MSWplus k
equation 1 we estimatefrom between-subjects variance
, whereas the expected MSW
. If from
? ? ?
By algebraic manipulation (e.g.,
and substitution of the expected mean squares into equa-
tion 5, it can be shown that
? [MSB? MSW]/k)
ICC 1,1 ?
? ? ?
MS ? MS
?MS ? MS
MS ? MS
MS ? (k ? 1)MS
Similar derivations can be made for the other ICC models
(3, 34, 46, 49) so that all ultimately relate to equation 1.
Of note is that with the different ICC models (fixed vs.
random effects, 1-way vs. 2-way ANOVA), the expected
mean squares change and thus the computational for-
mulas commonly found in the literature (30, 41) also
Choosing an ICC
Given the 6 ICC versions of Shrout and Fleiss (46) and
the 10 versions presented by McGraw and Wong (34), the
choice of ICC is perplexing, especially considering that
QUANTIFYING TEST-RETEST RELIABILITY
TABLE 4. ICC values for data sets A and B.*
ICC type Data set A
* ICC ? intraclass correlation coefficient.
Data set B
TABLE 6. Example data set with systematic error.
Trial C1Trial C2
156 ? 33 172 ? 35
TABLE 5. Intraclass correlation coefficient model summary table.*
Shrout and Fleiss
MS ? MS
MS ? (k ? 1)MS
MS ? MS
Computational formula McGraw and Wong Model
1 1-way random
k 1-way random
MS ? MS
k(MS ? MS )
MS ? (k ? 1)MS ?
MS ? MS
k(MS ? MS )
A,1 2-way random
MS ? MS
MS ? (k ? 1)MS
MS ? MS
* Adapted from Shrout and Fleiss (46) and McGraw and Wong (34). Mean square abbreviations are based on the 1-way and 2-way
analysis of variance illustrated in Table 2. For McGraw and Wong, A ? absolute and C ? consistency. MSB? between-subjects
mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW? within-subjects mean
most of the literature deals with rater reliability not test-
retest reliability of physical performance measures. In a
classic paper, Brozek and Alexander (11) first introduced
the concept of the ICC to the movement sciences litera-
ture and detailed the implementation of an ICC for ap-
plication to test-retest analysis of motor tasks. Their co-
efficient is equivalent to model 3,1. Thus, one might use
ICC 3,1 with test-retest reliability where trials is substi-
tuted for raters. From the rater nomenclature above, if
one does not wish to generalize the reliability findings but
rather assert that in our hands the procedures are reli-
able, then ICC 3,1 seems like a logical choice. However,
this ICC does not include variance associated with sys-
tematic error and is in fact closely approximated by the
Pearson r (1, 43). Therefore, the criticism of the Pearson
r as an index of reliability holds as well for ICCs derived
from model 3. At the least, it needs to be established that
the effect for trials (bias) is trivial if reporting an ICC
derived from model 3. Use of effect size for the trials effect
in the ANOVA would provide information in this regard.
With respect to ICC 3,1, Alexander (1) notes that it ‘‘may
be regarded as an estimate of the value that would have
been obtained if the fluctuation [systematic error] had
In a more general sense, there are 4 issues to be ad-
dressed in choosing an ICC: (a) 1- or 2-way model, (b)
fixed- or random-effect model, (c) include or exclude sys-
tematic error in the ICC, and (d) single or mean score.
With respect to choosing a 1- or 2-way model, in a 1-way
model, the effect of raters or trials (replication study) is
not crossed with subjects, meaning that it allows for sit-
uations where all raters do not score all subjects (48).
Fleiss (22) uses the 1-way model for what he terms simple
replication studies. In this model, all sources of error are
lumped together into the MSW(Tables 2 and 3). In con-
trast, the 2-way models allow the error to be partitioned
between systematic and random error. When systematic
error is small, MSWfrom the 1-way model and error mean
square (MSE) from the 2-way models (reflecting random
error) are similar, and the resulting ICCs are similar.
This is true for both data sets A and B. When systematic
error is substantial, MSWand MSEare disparate, as in
data set C (Tables 6 and 7). Two-way models require tri-
als or raters to be crossed with subjects (i.e., subjects pro-
vide scores for all trials or each rater rates all subjects).
For test-retest situations, the design dictates that trials
are crossed with subjects and therefore lend themselves
to analysis by 2-way models.
Regarding fixed vs. random effects, a fixed factor is
one in which all levels of the factor of interest (in this
TABLE 7. Analysis of variance summary table for data set C.*
7 15,925(MSB: 1-way)
* MSB? between-subjects mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW
? within-subjects mean square; SS ? sums of squares.
case trials) are included in the analysis and no attempt
at generalization of the reliability data beyond the con-
fines of the study is expected. Determining the reliability
of a test before using it in a larger study fits this descrip-
tion of fixed effect. A random factor is one in which the
levels of the factor in the design (trials) are but a sample
of the possible levels, and the analysis will be used to
generalize to other levels. For example, a study designed
to evaluate the test-retest reliability of the vertical jump
for use by other coaches (with similar athletes) would con-
sider the effect of trials to be a random effect. Both Shrout
and Fleiss (46) models 1 and 2 are random-effects models,
whereas model 3 is a fixed-effect model. From this dis-
cussion, for the 2-way models of Shrout and Fleiss (46),
the choice between model 2 and model 3 appears to hinge
on a decision regarding a random- vs. fixed-effects model.
However, models 2 and 3 also differ in their treatment of
systematic error. As noted previously, model 3 only con-
siders random error, whereas model 2 considers both ran-
dom and systematic error. This system does not include
a 2-way fixed-effects model that includes systematic error
and does not offer a 2-way random-effects model that only
considers random error. The expanded system of McGraw
and Wong (34) includes these options. In the nomencla-
ture of McGraw and Wong (34), the designation C refers
to consistency and A refers to absolute agreement. That
is, the C models consider only random error and the A
models consider both random and systematic error. As
noted in Table 5, no new computational formulas are re-
quired beyond those presented by Shrout and Fleiss (46).
Thus, if one were to choose a 2-way random-effects model
that only addressed random error, one would use equa-
tion 3,1 (or equation 3,k if the mean across k trials is the
criterion score). Similarly, if one were to choose a 2-way
fixed-effects model that addressed both systematic and
random error, equation 2,1 would be used (or 2,k). Ulti-
mately then, since the computational formulas do not dif-
fer between systems, the choice between using the Shrout
and Fleiss (46) equations from models 2 vs. 3 hinge on
decisions regarding inclusion or exclusion of systematic
error in the calculations. As noted by McGraw and Wong
(34), ‘‘the random-fixed effects distinction is in its effect
on the interpretation, but not calculation, of an ICC.’’
Should systematic error be included in the ICC? First,
if the effect for trials is small, the systematic differences
between trials will be small, and the ICCs will be similar
to each other. This is evident in both the A and B data
sets (Tables 1 through 3 ). However, if the mean differ-
ences are large, then differences between ICCs are evi-
dent, especially between equation 3,1, which does not con-
sider systematic error, and equations 1,1 and 2,1, which
do consider systematic error. In this regard, the F test for
trials and the ICC calculations may give contradictory re-
sults from the same data. Specifically, it can be the case
that an ICC can be large (indicating good reliability),
whereas the ANOVA shows a significant trials effect. An
example is given in Tables 6 and 7. In this example, each
score in trial C1 was altered in trial C2 so that there was
a bias of ?15 kg and a random component added to each
score. The effect for trials was significant (F1,7? 203.85,
p ? 0.001) and reflected a mean increase of 16 kg. For an
ANOVA to be significant, the effect must be large (in this
case, the mean differences between trials must be large),
the noise (error term) must be small, or both. The error
term is small when all subjects behave similarly across
test days. When this is the case, even small mean differ-
ences can be statistically significant. In this case, the sys-
tematic differences explain a significant amount of vari-
ability in the data. Despite the rather large systematic
error, the ICC values from equations 1,1; 2,1; and 3,1
were 0.896, 0.901, and 0.998, respectively. A cursory ex-
amination of just the ICC scores would suggest that the
test exhibited good reliability, especially using equation
3,1, which only reflects random error. However, an ap-
proximately 10% increase in scores from trial C1 to C2
would suggest otherwise. Thus, an analysis that only fo-
cuses on the ICC without consideration of the trials effect
is incomplete (31). If the effect for trials is significant, the
most straightforward approach is to develop a measure-
ment schedule that will attenuate systematic error (2,
50). For example, if learning effects are present, one
might add trials until a plateau in performance occurs.
Then the ICC could be calculated only on the trials in the
plateau region. The identification of such a measurement
schedule would be especially helpful for random-effects
situations where others might be using the test being
evaluated. For simplicity, all the examples here have
been with only 2 levels for trials. If a trials effect is sig-
nificant, however, 2 trials are insufficient to identify a
plateau. The possibility of a significant trials effect should
be considered in the design of the reliability study. For-
tunately, the ANOVA procedures require no modification
to accommodate any number of trials.
Interpreting the ICC
At one level, interpreting the ICC is fairly straightfor-
ward; it represents the proportion of variance in a set of
scores that is attributable to the
that an estimated 95% of the observed score variance is
due to. The balance of the variance (1 ? ICC ? 5%) is
attributable to error (51). However, how does one quali-
tatively evaluate the magnitude of an ICC and what can
the quantity tell you? Some sources have attempted to
delineate good, medium, and poor levels for the ICC, but
there is certainly no consensus as to what constitutes a
good ICC (45). Indeed, Charter and Feldt (15) argue that
. An ICC of 0.95 means
QUANTIFYING TEST-RETEST RELIABILITY
‘‘it is not theoretically defensible to set a universal stan-
dard for test score reliability.’’ These interpretations are
further complicated by 2 factors. First, as noted herein,
the ICC varies, depending on which version of the ICC is
used. Second, the magnitude of the ICC is dependent on
the variability in the data (45). All other things being
equal, low levels of between-subjects variability will serve
to depress the ICC even if the differences between sub-
jects’ scores across test conditions are small. This is illus-
trated by comparing the 2 example sets of data in Table
1. Trials 1 and 2 of data sets A and B have identical mean
values and identical change scores between trials 1 and
2. They differ in the variability between subjects, with
greater between-subjects variability evident in data set A
as shown in the larger SDs. In Tables 2 and 3, the AN-
OVA tables have identical outcomes with respect to the
inferential test of the factor trials and have identical error
terms (since the between-subjects variability is not part
of the error term, as noted previously). Table 4 shows the
ICC values calculated using the 6 different models of
Shrout and Fleiss (46) on the A and B data sets. Clearly,
data set B, with the lower between-subjects variability,
results in smaller ICC values than data set A.
How then does one interpret an ICC? First, because
of the relationship between the ICC and between-subjects
variability, the heterogeneity of the subjects should be
considered. A large ICC can mask poor trial-to-trial con-
sistency when between-subjects variability is high. Con-
versely, a low ICC can be found even when trial-to-trial
variability is low if the between-subjects variability is
low. In this case, the homogeneity of the subjects means
it will be difficult to differentiate between subjects even
though the absolute measurement error is small. An ex-
amination of the SEM in conjunction with the ICC is
therefore needed (32). From a practical perspective, a giv-
en test can have different reliability, at least as deter-
mined from the ICC, depending on the characteristics of
the individuals included in the analysis. In the 1RM
squat, combining individuals of widely different capabil-
ities (e.g., wide receivers and defensive linemen in Amer-
ican football) into the same analysis increases between-
subjects variability and improves the ICC, yet this may
not be reflected in the expected day-to-day variation as
illustrated in Tables 1 through 4. In addition, the infer-
ential test for bias described previously needs to be con-
sidered. High between-subjects variability may result in
a high ICC even if the test for bias is statistically signif-
The relationship between between-subjects variability
and the magnitude of the ICC has been used as a criti-
cism of the ICC (10, 39). This is an unfair criticism, since
the ICC is used to provide information regarding infer-
ential statistical tests not to provide an index of absolute
measurement error. In essence, the ICC normalizes mea-
surement error relative to the heterogeneity of the sub-
jects. As an index of absolute reliability then, this is a
weakness and other indices (i.e., the SEM) are more in-
formative. As a relative index of reliability, the ICC be-
haves as intended.
What are the implications of a low ICC? First, mea-
surement error reflected in an ICC of less than 1.0 serves
to attenuate correlations (22, 38). The equation for this
attenuation effect is as follows:
? r ?ICC ICC
where rxyis the observed correlation between x and y, r ˆxy
is the correlation between x and y if both were measured
without error (i.e., the correlations between the true
scores), and ICCxand ICCyare the reliability coefficients
for x and y, respectively. Nunnally and Bernstein (38)
note that the effect of measurement error on correlation
attenuation becomes minimal as ICCs increase above
0.80. In addition, reliability affects the power of statistical
tests. Specifically, the lower the reliability, the greater
the risk of type 2 error (14, 40). Fleiss (22) illustrates how
the magnitude of an ICC can be used to adjust sample
size and statistical power calculations (45). In short, low
ICCs mean that more subjects are required in a study for
a given effect size to be statistically significant (40). An
ICC of 0.60 may be perfectly fine if the resulting effect on
sample size and statistical power is within the logistical
constraints of the study. If, however, an ICC of 0.60
means that, for a required level of power, more subjects
must be recruited than is feasible, then 0.60 is not ac-
Although infrequently used in the movement sciences,
the ICC of test scores can be used in the setting and in-
terpretation of cut points for classification of individuals.
Charter and Feldt (15) show how the ICC can be used to
estimate the percentage of false-positive, false-negative,
true-positive, and true-negative results for a clinical clas-
sification scheme. Although the details of these calcula-
tions are beyond the scope of this article, it is worthwhile
to note that very high ICCs are required to classify indi-
viduals with a minimum of misclassification.
Because the general form of the ICC is a ratio of variance
due to differences between subjects (the signal) to the to-
tal variability in the data (the noise), the ICC is reflective
of the ability of a test to differentiate between different
individuals (27, 47). It does not provide an index of the
expected trial-to-trial noise in the data, which would be
useful to practitioners such as strength coaches. Unlike
the ICC, which is a relative measure of reliability, the
SEM provides an absolute index of reliability. Hopkins
(26) refers to this as the ‘‘typical error.’’ The SEM quan-
tifies the precision of individual scores on a test (24). The
SEM has the same units as the measurement of interest,
whereas the ICC is unitless. The interpretation of the
SEM centers on the assessment of reliability within in-
dividual subjects (45). The direct calculation of the SEM
involves the determination of the SD of a large number
of scores from an individual (44). In practice, a large num-
ber of scores is not typically collected, so the SEM is es-
timated. Most references estimate the SEM as follows:
SEM ? SD?1 ? ICC
where SD is the SD of the scores from all subjects
) and ICC is the reliability coefficient.
?SS /(n ? 1)
Note the similarity between the equation for the SEM
and standard error of estimate from regression analysis.
Since different forms of the ICC can result in different
numbers, the choice of ICC can substantively affect the
size of the SEM, especially if systematic error is present.
However, there is an alternative way of calculating the
SEM that avoids these uncertainties. The SEM can be
estimated as the square root of the mean square error
term from the ANOVA (20, 26, 48). Since this estimate of
fromthe ANOVA as
the SEM has the advantage of being independent of the
specific ICC, its use would allow for more consistency in
interpreting SEM values from different studies. However,
the mean square error terms differ when using the 1-way
vs. 2-way models. In Table 2 it can be seen that using a
1-way model (22) would require the use of MSW(?53.75
? 7.3 kg), whereas use of a 2-way model would require
use of MSE(?57 ? 7.6 kg). Hopkins (26) argues that be-
cause the 1-way model combines influences of random
and systematic error together, ‘‘The resulting statistic is
biased high and is hard to interpret because the relative
contributions of random error and changes in the mean
are unknown.’’ He therefore suggests that the error term
from the 2-way model (MSE) be used to calculate SEM.
Note however, that in this sample, the 1-way SEM is
smaller than the 2-way SEM. This is because the trials
effect is small. The high bias of the 1-way model is ob-
served when the trials effect is large (Table 7). The SEM
calculated using the MS error from the 2-way model
(?4.71 ? 2.2 kg) is markedly lower than the SEM cal-
culated using the 1-way model (?124.25 ? 11.1 kg), since
the SEM as defined as ?MSEonly considers random er-
ror. This is consistent with the concept of a SE, which
defines noise symmetrically around a central value. This
points to the desire of establishing a measurement sched-
ule that is free of systematic variation.
Another difference between the ICC and SEM is that
the SEM is largely independent of the population from
which it was determined, i.e., the SEM ‘‘is considered to
be a fixed characteristic of any measure, regardless of the
sample of subjects under investigation’’ (38). Thus, the
SEM is not affected by between-subjects variability as is
the ICC. To illustrate, the MSEfor the data in Tables 2
and 3 are equal (MSE? 57), despite large differences in
between-subjects variability. The resulting SEM is the
same for data sets A and B (?57 ? 7.6 kg), but yet they
have different ICC values (Table 4). The results are sim-
ilar when calculating the SEM using equation 8, even
though equation 8 uses the ICC in calculating the SEM,
since the effects of the SD and the ICC tend to offset each
other (38). However, the effects do not offset each other
completely, and use of equation 8 results in an SEM es-
timate that is modestly affected by between-subjects var-
The SEM is the SE in estimating observed scores (the
scores in your data set) from true scores (38). Of course,
our problem is just the opposite. We have the observed
scores and would like to estimate subjects’ true scores.
The SEM has been used to define the boundaries around
which we think a subject’s true score lies. It is often re-
ported (8, 17) that the 95% CI for a subject’s true score
can be estimated as follows:
T ? S ? 1.96(SEM),
where T is the subject’s true score, S is the subject’s ob-
served score on the measurement, and 1.96 defines the
95% CI. However, strictly speaking this is not correct,
since the SEM is symmetrical around the true score, not
the observed score (13, 19, 24, 38), and the SEM reflects
the SD of the observed scores while holding the true score
constant. In lieu of equation 9, an alternate approach is
to estimate the subject’s true score and calculate an al-
ternate SE (reflecting the SD of true scores while holding
observed scores constant). Because of regression to the
mean, obtained scores (S) are biased estimators of true
scores (16, 19). Scores below the mean are biased down-
ward, and scores above the mean are biased upward. A
subject’s estimated true score (T) can be calculated as fol-
T ? X ? ICC(d),
where d ? S ? X¯. To illustrate, consider data set A in
Table 1. With a grand mean of 154.5 and an ICC 3,1 of
0.95, an individual with an S of 120 kg would have a
predicted T of 154.5 ? 0.95 (120 ? 154.5) ? 121.8 kg.
Note that because the ICC is high, the bias is small (1.8
kg). The appropriate SE to define the CI of the true score,
which some have referred to as the standard error of es-
timate (13), is as follows (19, 38):
? SD?ICC(1 ? ICC).
In this example the value is 31.74 ?0.95 (1 ? 0.95) ?
6.92, where 31.74 equals the SD of the observed scores
around the grand mean. The 95% CI for T is then 121.8
? 1.96 (6.92), which defines a span of 108.2 to 135.4 kg.
The entire process, which has been termed the regression-
based approach (16), can be summarized as follows (24):
95%CI for T
? X ? ICC(d) ? 1.96 SD?ICC(1 ? ICC).
If one had simply used equation 9 using S and SEM,
the resulting interval would span 120 ? 1.96 (7.8) ? 105.1
to 134.9 kg. Note that the differences between CIs is
small and that the CI width from equation 9 (29.8 kg) is
wider than that from equation 12 (27.2 kg). For all ICCs
less than 1.0, the CI width will be narrower from equation
12 than from equation 9 (16), but the differences shrink
as the ICC approaches 1.0 and as S approaches X¯(24).
MINIMAL DIFFERENCES NEEDED TO BE
The SEM is an index that can be used to define the dif-
ference needed between separate measures on a subject
for the difference in the measures to be considered real.
For example, if the 1RM of an athlete on one day is 155
kg and at some later time is 160 kg, are you confident
that the athlete really increased the 1RM by 5 kg or is
this difference within what you might expect to see in
repeated testing just due to the noise in the measure-
ment? The SEM can be used to determine the minimum
difference (MD) to be considered ‘‘real’’ and can be cal-
culated as follows (8, 20, 42):
MD ? SEM ? 1.96 ? ?2,
Once again the point is to construct a 95% CI, and the
1.96 value is simply the z score associated with a 95% CI.
(One may choose a different z score instead of 1.96 if a
more liberal or more conservative assessment is desired.)
But where does the ?2 come from?
Why can’t we simply calculate the 95% CI for a sub-
ject’s score as we have done above? If the score is outside
that interval, then shouldn’t we be 95% confident that the
subject’s score has really changed? Indeed, this approach
has been suggested in the literature (25, 37). The key
here is that we now have 2 scores from a subject. Each of
these scores has a true component and an error compo-
nent. That is, both scores were measured with error, and
simply seeing if the second score falls outside the CI of
the first score does not account for the error in the second
QUANTIFYING TEST-RETEST RELIABILITY
score. What we really want here is an index based on the
variability of the difference scores. This can be quantified
as the SD of the difference scores (SDd). As it turns out,
when there are 2 levels of trials (as in the examples
herein), the SEM is equal to the SDd divided by ?2 (17,
SEM ? SDd/?2.
Therefore, multiplying the SEM by ?2 solves for the
SDd and then multiplying the SDd by 1.96 allows for the
construction of the 95% CI. Once the MD is calculated,
then any change in a subject’s score, either above or below
the previous score, greater than the MD is considered
real. More precisely, for all people whose differences on
repeated testing are at least greater than or equal to the
MD, 95% of them would reflect real differences. Using
data set A, the first subject has a trial A1 score of 146
kg. The SEM for the test is ?57 ? 7.6 kg. From equation
13, MD ? 7.6 ? 1.96 ? ?2 ? 21.07 kg. Thus, a change
of at least 21.07 kg needs to occur to be confident, at the
95% level, that a change in 1RM reflects a real change
and not a difference that is within what might be reason-
ably expected given the measurement error of the 1RM
However, as with defining a CI for an observed score,
the process outlined herein for defining a minimal differ-
ence is not precisely accurate. As noted by Charter (13)
and Dudek (19), the SE of prediction (SEP) is the correct
SE to use in these calculations, not the SEM. The SEP is
calculated as follows:
SEP ? SD?1 ? ICC .
To define a 95% CI outside which one could be confi-
dent that a retest score reflects a real change in perfor-
mance, simply calculate the estimated true score (equa-
tion 10) plus or minus the SEP. To illustrate, consider
the same data as in the example in the previous para-
graph. From equation 10, we estimate the subject’s true
score (T) as T ? X¯? ICC (d) ? 154 ? 0.95 (146 ? 154.5)
? 146.4 kg. The SEP ? SD ? ?(1 ? ICC2) ? 31.74 ?
?(1 ? 0.952) ? 9.91. The resulting 95% CI is 146.4 ?
1.96 (9.91), which defines an interval from approximately
127 to 166 kg. Therefore, any retest score outside that
interval would be interpreted as reflecting a real change
in performance. As given in Table 1, the retest score of
140 kg is inside the CI and would be interpreted as a
change consistent with the measurement error of the test
and does not reflect a real change in performance. As be-
fore, use of a different z score in place of 1.96 will allow
for the construction of a more liberal or conservative CI.
In this article, several considerations regarding ICC and
SEM calculations will not be addressed in detail, but brief
mention will be made here. First, assumptions of ANOVA
apply to these data. The most common assumption vio-
lated is that of homoscedasticity. That is, does the size of
the error correlate with the magnitude of the observed
scores? If the data exhibit homoscedasticity, the answer
is no. For physical performance measures, it is common
that the absolute error tends to be larger for subjects who
score higher (2, 26), e.g., the noise from repeated strength
testing of stronger subjects is likely to be larger than the
noise from weaker subjects. If the data exhibit heterosce-
dasticity, often a logarithmic transformation is appropri-
ate. Second, it is important to realize that ICC and SEM
values determined from sample data are estimates. As
such, it is instructive to construct CIs for these estimates.
Details of how to construct these CIs are addressed in
other sources (34, 35). Third, how many subjects are re-
quired to get adequate stability for the ICC and SEM cal-
culations? Unfortunately, there is no consensus in this
area. The reader is referred to other studies for further
discussion (16, 35, 52). Finally, reliability, as quantified
by the ICC, is not synonymous with responsiveness to
change (23). The MD calculation presented herein allows
one to evaluate a change score after the fact. However, a
small MD, in and of itself, is not a priori evidence that a
given test is responsive.
For a comprehensive assessment of reliability, a 3-layered
approach is recommended. First, perform a repeated-
measures ANOVA and cast the summary table as a 2-
way model, i.e., trials and error are separate sources of
variance. Evaluate the F ratio for the trials effect to ex-
amine systematic error. As noted previously, it may be
prudent to evaluate the effect for trials using a more lib-
eral ? measure than the traditional 0.05 level. If the effect
for trials is significant (and the effect size is not trivial),
it is prudent to reexamine the measurement schedule for
influences of learning and fatigue. If 3 or more levels of
trials were included in the analysis, a plateau in perfor-
mance may be evident, and exclusion of only those levels
of trials not in the plateau region in a subsequent re-
analysis may be warranted. However, this exclusion of
trials needs to be reported. Under these conditions, where
systematic error is deemed unimportant, the ICC values
will be similar and reflect random error (imprecision).
However, it is suggested here that the ICC from equation
3,1 be used (Table 5), since it is most closely tied to the
MSEcalculation of the SEM. Once the systematic error is
determined to be nonsignificant or trivial, interpret the
ICC and SEM within the analytical goals of your study
(2). Specifically, researchers interested in group-level re-
sponses can use the ICC to assess correlation attenuation,
statistical power, and sample size calculations. Practi-
tioners (e.g., coaches, clinicians) can use the SEM (and
associated SEs) in the interpretation of scores from in-
dividual athletes (CIs for true scores, assessing individual
change). Finally, although reliability is an important as-
pect of measurement, a test may exhibit reliability but
not be a valid test (i.e., it does not measure what it pur-
ports to measure).
1.ALEXANDER, H.W. The estimation of reliability when several
trials are available. Psychometrika 12:79–99. 1947.
2.ATKINSON, D.B., AND A.M. NEVILL. Statistical methods for as-
sessing measurement error (reliability) in variables relevant to
Sports Medicine. Sports Med. 26:217–238. 1998.
3.BARTKO, J.J. The intraclass reliability coefficient as a measure
of reliability. Psychol. Rep. 19:3–11. 1966.
4.BARTKO, J.J. On various intraclass correlation coefficients.
Psychol. Bull. 83:762–765. 1976.
5.BAUMGARTNER, T.A. Estimating reliability when all test trials
are administered on the same day. Res. Q. 40:222–225. 1969.
6.BAUMGARTNER, T.A. Norm-referenced measurement: reliabili-
ty. In: Measurement Concepts in Physical Education and Exer-
cise Science. M.J. Safrit and T.M. Woods, eds. Champaign, IL:
Human Kinetics, 1989. pp. 45–72.
7.BAUMGARTNER, T.A. Estimating the stability reliability of a
score. Meas. Phys. Educ. Exerc. Sci. 4:175–178. 2000.
BECKERMAN, H., T.W. VOGELAAR, G.L. LANKHORST, AND A.L.M.
VERBEEK. A criterion for stability of the motor function of the
lower extremity in stroke patients using the Fugl-Meyer as-
sessment scale. Scand. J. Rehabil. Med. 28:3–7. 1996.
BEDARD, M., N.J. MARTIN, P. KRUEGER, AND K. BRAZIL. As-
sessing reproducibility of data obtained with instruments
based on continuous measurements. Exp. Aging Res. 26:353–
BLAND, J.M., AND D.G. ALTMAN. Statistical methods for as-
sessing agreement between two methods of clinical measure-
ment. Lancet 1:307–310. 1986.
BROZEK, J., AND H. ALEXANDER. Components of variance and
the consistency of repeated measurements. Res. Q. 18:152–166.
BRUTON, A., J.H. CONWAY, AND S.T. HOLGATE. Reliability:
What is it and how is it measured. Physiotherapy 86:94–99.
CHARTER, R.A. Revisiting the standard error of measurement,
estimate, and prediction and their application to test scores.
Percept. Mot. Skills 82:1139–1144. 1996.
CHARTER, R.A. Effect of measurement error on tests of statis-
tical significance. J. Clin. Exp. Neuropsychol. 19:458–462.1997.
CHARTER, R.A., AND L.S. FELDT. Meaning of reliability in terms
of correct and incorrect clinical decisions: The art of decision
making is still alive. J. Clin. Exp. Neuropsychol. 23:530–537.
CHARTER, R.A., AND L.S. FELDT. The importance of reliability
as it relates to true score CIs. Meas. Eval. Counseling Dev. 35:
CHINN, S. Repeatability and method comparison. Thorax 46:
CHINN, S., AND P.G. BURNEY. On measuring repeatability of
data from self-administered questionnaires. Int. J. Epidemiol.
DUDEK, F.J. The continuing misinterpretation of the standard
error of measurement. Psychol. Bull. 86:335–337. 1979.
ELIASZIW, M., S.L. YOUNG, M.G. WOODBURY, AND K. FRYDAY-
FIELD. Statistical methodology for the concurrent assessment
of interrater and intrarater reliability: Using goniometric mea-
surements as an example. Phys. Ther. 74:777–788. 1994.
FELDT, L.S., AND M.E. MCKEE. Estimation of the reliability of
skill tests. Res. Q. 29:279–293. 1958.
FLEISS, J.L. The Design and Analysis of Clinical Experiments.
New York: John Wiley and Sons, 1986.
GUYATT, G., S. WALTER, AND G. NORMAN. Measuring change
over time: assessing the usefulness of evaluative instruments.
J. Chronic Dis. 40:171–178. 1987.
HARVILL, L.M. Standard error of measurement. Educ. Meas.
Issues Pract. 10:33–41. 1991.
HEBERT, R., D.J. SPIEGELHALTER, AND C. BRAYNE. Setting the
minimal metrically detectable change on disability rating
scales. Arch. Phys. Med. Rehabil. 78:1305–1308. 1997.
HOPKINS, W.G. Measures of reliability in sports medicine and
science. Sports Med. 30:375–381. 2000.
KEATING, J., AND T. MATYAS. Unreliable inferences from reli-
able measurements. Aust. Physiother. 44:5–10. 1998.
KEPPEL, G. Design and Analysis: A Researcher’s Handbook (3rd
ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.
KROLL, W. A note on the coefficient of intraclass correlation as
an estimate of reliability. Res. Q. 33:313–316. 1962.
LAHEY, M.A., R.G. DOWNEY, AND F.E. SAAL. Intraclass corre-
lations: there’s more than meets the eye. Psychol. Bull. 93:586–
31.LIBA, M. A trend test as a preliminary to reliability estimation.
Res. Q. 38:245–248. 1962.
LOONEY, M.A. When is the intraclass correlation coefficient
misleading? Meas. Phys. Educ. Exerc. Sci. 4:73–78. 2000.
LUDBROOK, J. Statistical techniques for comparing measures
and methods of measurement: A critical review. Clin. Exp.
Pharmacol. Physiol. 29:527–536. 2002.
MCGRAW, K.O., AND S.P. WONG. Forming inferences about
some intraclass correlation coefficients. Psychol. Methods 1:30–
MORROW, J.R., AND A.W. JACKSON. How ‘‘significant’’ is your
reliability? Res. Q. Exerc. Sport 64:352–355. 1993.
NICHOLS, C.P. Choosing an intraclass correlation coefficient.
Available at: www.spss.com/tech/stat/articles/whichicc.htm.
NITSCHKE, J.E., J.M. MCMEEKEN, H.C. BURRY, AND T.A. MA-
TYAS. When is a change a genuine change? A clinically mean-
ingful interpretation of grip strength measurements in healthy
and disabled women. J. Hand Ther. 12:25–30. 1999.
NUNNALLY, J.C., AND I.H. BERNSTEIN. Psychometric Theory
(3rd ed.). New York: McGraw-Hill, 1994.
OLDS, T. Five errors about error. J. Sci. Med. Sport 5:336–340.
PERKINS, D.O., R.J. WYATT, AND J.J. BARTKO. Penny-wise and
pound-foolish: The impact of measurement error on sample
size requirements in clinical trials. Biol. Psychiatry. 47:762–
PORTNEY, L.G., AND M.P. WATKINS. Foundations of Clinical Re-
search (2nd ed.). Upper Saddle River, NJ: Prentice Hall, 2000.
ROEBROECK, M.E., J. HARLAAR, AND G.J. LANKHORST. The ap-
plication of generalizability theory to reliability assessment: An
illustration using isometric force measurements. Phys. Ther.
ROUSSON, V., T. GASSER, AND B. SEIFERT. Assessing intrarater,
interrater, and test-retest reliability of continuous measure-
ments. Stat. Med. 21:3431–3446. 2002.
SAFRIT, M.J.E. Reliability Theory. Washington, DC: American
Alliance for Health, Physical Education, and Recreation, 1976.
SHROUT, P.E. Measurement reliability and agreement in psy-
chiatry. Stat. Methods Med. Res. 7:301–317. 1998.
SHROUT, P.E., AND J.L. FLEISS. Intraclass correlations: uses in
assessing rater reliability. Psychol. Bull. 36:420–428. 1979.
STRATFORD, P. Reliability: consistency or differentiating be-
tween subjects? [Letter]. Phys. Ther. 69:299–300. 1989.
STRATFORD, P.W., AND C.H. GOLDSMITH. Use of standard error
as a reliability index of interest: An applied example using el-
bow flexor strength data. Phys. Ther. 77:745–750. 1997.
STREINER, D.L., AND G.R. NORMAN. Measurement Scales: A
Practical Guide to Their Development and Use (2nd ed.). Oxford:
Oxford University Press, 1995. pp. 104–127.
THOMAS, J.R., AND J.K. NELSON. Research Methods in Physical
Activity (2nd ed.). Champaign, IL: Human Kinetics, 1990. pp.
TRAUB, R.E., AND G.L. ROWLEY. Understanding reliability.
Educ. Meas. Issues Pract. 10:37–45. 1991.
WALTER, S.D., M. ELIASZIW, AND A. DONNER. Sample size and
optimal designs for reliability studies. Stat. Med. 17:101–110.
I am indebted to Lee Brown, Joel Cramer, Bryan Heiderscheit,
Terry Housh, and Bob Oppliger for their helpful comments on
drafts of the paper.
correspondenceto Dr. JosephP. Weir,