ArticlePDF AvailableLiterature Review

Quantifying Test-Retest Reliability Using The Intraclass Correlation Coefficient and the SEM

Authors:

Abstract and Figures

Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.
Content may be subject to copyright.
231
Journal of Strength and Conditioning Research, 2005, 19(1), 231–240
q2005 National Strength & Conditioning Association Brief Review
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
U
SING
THE
I
NTRACLASS
C
ORRELATION
C
OEFFICIENT
AND THE
SEM
J
OSEPH
P. W
EIR
Applied Physiology Laboratory, Division of Physical Therapy, Des Moines University—Osteopathic Medical
Center, Des Moines, Iowa 50312.
A
BSTRACT
.Weir, J.P. Quantifying test-retest reliability using
the intraclass correlation coefficient and the SEM.J. Strength
Cond. Res. 19(1):231–240. 2005.—Reliability, the consistency of
a test or measurement, is frequently quantified in the movement
sciences literature. A common metric is the intraclass correla-
tion coefficient (ICC). In addition, the SEM, which can be cal-
culated from the ICC, is also frequently reported in reliability
studies. However, there are several versions of the ICC, and con-
fusion exists in the movement sciences regarding which ICC to
use. Further, the utility of the SEM is not fully appreciated. In
this review, the basics of classic reliability theory are addressed
in the context of choosing and interpreting an ICC. The primary
distinction between ICC equations is argued to be one concern-
ing the inclusion (equations 2,1 and 2,k) or exclusion (equations
3,1 and 3,k) of systematic error in the denominator of the ICC
equation. Inferential tests of mean differences, which are per-
formed in the process of deriving the necessary variance com-
ponents for the calculation of ICC values, are useful to deter-
mine if systematic error is present. If so, the measurement
schedule should be modified (removing trials where learning
and/or fatigue effects are present) to remove systematic error,
and ICC equations that only consider random error may be safe-
ly used. The use of ICC values is discussed in the context of
estimating the effects of measurement error on sample size, sta-
tistical power, and correlation attenuation. Finally, calculation
and application of the SEM are discussed. It is shown how the
SEM and its variants can be used to construct confidence inter-
vals for individual scores and to determine the minimal differ-
ence needed to be exhibited for one to be confident that a true
change in performance of an individual has occurred.
K
EY
W
ORDS
. reproducibility, precision, error, consistency, SEM,
intraclass correlation coefficient
I
NTRODUCTION
Reliability refers to the consistency of a test or
measurement. For a seemingly simple concept,
the quantifying of reliability and interpreta-
tion of the resulting numbers are surprisingly
unclear in the biomedical literature in general
(49) and in the sport sciences literature in particular. Part
of this stems from the fact that reliability can be assessed
in a variety of different contexts. In the sport sciences,
we are most often interested in simple test-retest reli-
ability; this is what Fleiss (22) refers to as a simple reli-
ability study. For example, one might be interested in the
reliability of 1 repetition maximum (1RM) squat mea-
sures taken on the same athletes over different days.
However, if one is interested in the ability of different
testers to get the same results from the same subjects on
skinfold measurements, one is now interested in the in-
terrater reliability. The quantifying of reliability in these
different situations is not necessarily the same, and the
decisions regarding how to calculate reliability in these
different contexts has not been adequately addressed in
the sport sciences literature. In this article, I focus on
test-retest reliability (but not limited in the number of
retest trials). In addition, I discuss data measured on a
continuous scale.
Confusion also stems from the jargon used in the con-
text of reliability, i.e., consistency, precision, repeatabili-
ty, and agreement. Intuitively, these terms describe the
same concept, but in practice some are operationalized
differently. Notably, reliability and agreement are not
synonymous (30, 49). Further, reliability, conceptualized
as consistency, consists of both absolute consistency and
relative consistency (44). Absolute consistency concerns
the consistency of scores of individuals, whereas relative
consistency concerns the consistency of the position or
rank of individuals in the group relative to others. In the
fields of education and psychology, the term reliability is
operationalized as relative consistency and quantified us-
ing reliability coefficients called intraclass correlation co-
efficients (ICCs) (49). Issues regarding quantifying ICCs
and their interpretation are discussed in the first half of
this article. Absolute consistency, quantified using the
SEM, is addressed in the second half of the article. In
brief, the SEM is an indication of the precision of a score,
and its use allows one to construct confidence intervals
(CIs) for scores.
Another confusing aspect of reliability calculations is
that a variety of different procedures, besides ICCs and
SEM, have been used to determine reliability. These in-
clude the Pearson r, the coefficient of variation, and the
LOA (Bland-Altman plots). The Pearson product moment
correlation coefficient (Pearson r) was often used in the
past to quantify reliability, but the use of the Pearson r
is typically discouraged for assessing test-retest reliabil-
ity (7, 9, 29, 33, 44); however, this recommendation is not
universal (43). The primary, although not exclusive,
weakness of the Pearson ris that it cannot detect system-
atic error. More recently, the limits of agreement (LOA)
described by Bland and Altman (10) have come into vogue
in the biomedical literature (2). The LOA will not be ad-
dressed in detail herein other than to point out that the
procedure was developed to examine agreement between
2 different techniques of quantifying some variable (so-
called method comparison studies, e.g., one could compare
testosterone concentration using 2 different bioassays),
not reliability per se. The use of LOA as an index of re-
liability has been criticized in detail elsewhere (26, 49).
In this article, the ICC and SEM will be the focus.
232 W
EIR
Unfortunately, there is considerable confusion concerning
both the calculation and interpretation of the ICC. In-
deed, there are 6 common versions of the ICC (and others
as well), and the choice of which version to use is not
intuitively obvious. Similarly, the SEM, which is inti-
mately related to the ICC, has useful applications that
are not fully appreciated by practitioners in the move-
ment sciences. The purposes of this article are to provide
information on the choice and application of the ICC and
to encourage practitioners to use the SEM in the inter-
pretation of test data.
T
HE
ICC
Reliability Theory
For a group of measurements, the total variance ( ) in
2
s
T
the data can be thought of as being due to true score
variance ( ) and error variance ( ). Similarly, each ob-
22
ss
te
served score is composed of the true score and error (44).
The theoretical true score of an individual reflects the
mean of an infinite number of scores from a subject,
whereas error equals the difference between the true
score and the observed score (21). Sources of error include
errors due to biological variability, instrumentation, error
by the subject, and error by the tester. If we make a ratio
of the to the of the observed scores, where equals
22 2
ss s
tT T
plus , we have the following reliability coefficient:
22
ss
te
2
s
t
R5. (1)
22
s1s
te
The closer this ratio is to 1.0, the higher the reliability
and the lower the . Since we do not know the true score
2
s
e
for each subject, an index of the is used based on be-
2
s
t
tween-subjects variability, i.e., the variance due to how
subjects differ from each other. In this context, reliability
(relative consistency) is formally defined (5, 21, 49) as fol-
lows:
between subjects variability
reliability 5. (2)
between subjects variability 1error
The reliability coefficient in Equation 2 is quantified
by various ICCs. So although reliability is conceptually
aligned with terms such as reproducibility, repeatability,
and agreement, it is defined as above. The necessary var-
iance estimates are derived from analysis of variance
(ANOVA), where appropriate mean square values are re-
corded from the computer printout. Specifically, the var-
ious ICCs can be calculated from mean square values de-
rived from a within-subjects, single-factor ANOVA (i.e., a
repeated-measures ANOVA).
The ICC is a relative measure of reliability (18) in that
it is a ratio of variances derived from ANOVA, is unitless,
and is more conceptually akin to R
2
from regression (43)
than to the Pearson r. The ICC can theoretically vary
between 0 and 1.0, where an ICC of 0 indicates no reli-
ability, whereas an ICC of 1.0 indicates perfect reliability.
In practice, ICCs can extend beyond the range of 0 to 1.0
(30), although with actual data this is rare. The relative
nature of the ICC is reflected in the fact that the mag-
nitude of an ICC depends on the between-subjects vari-
ability (as shown in the next section). That is, if subjects
differ little from each other, ICC values are small even if
trial-to-trial variability is small. If subjects differ from
each other a lot, ICCs can be large even if trial-to-trial
variability is large. Thus, the ICC for a test is context
specific (38, 51). As noted by Streiner and Norman (49),
‘‘There is literally no such thing as the reliability of a test,
unqualified; the coefficient has meaning only when ap-
plied to specific populations.’’ Further, it is intuitive that
small differences between individuals are more difficult
to detect than large ones, and the ICC is reflective of this
(49).
Error is typically considered as being of 2 types: sys-
tematic error (e.g., bias) and random error (2, 39). (Gen-
eralizability theory expands sources of error to include
various facets of interest but is beyond the scope of this
article.) Total error reflects both systematic error and
random error (imprecision). Systematic error includes
both constant error and bias (38). Constant error affects
all scores equally, whereas bias is systematic error that
affects certain scores differently than others. For physical
performance measures, the distinction between constant
error and bias is relatively unimportant and the focus
here on systematic error is on situations that result in a
unidirectional change in scores on repeated testing. In
testing of physical performance, subjects may improve
their test scores simply due to learning effects, e.g., per-
forming the first test serves as practice for subsequent
tests, or fatigue or soreness may result in poorer perfor-
mance across trials. In contrast, random error refers to
sources of error that are due to chance factors. Factors
such as luck, alertness, attentiveness by the tester, and
normal biological variability affect a particular score.
Such errors should, in a random manner, both increase
and decrease test scores on repeated testing. Thus, we
can expand Equation 1 as follows:
2
s
t
R5, (3)
22 2
s1s 1s
tsere
where is the systematic and is the random .
2222
ssss
se e re e
It has been argued that systematic error is a concern
of validity and not reliability (12, 43). Similarly, system-
atic error (e.g., learning effects, fatigue) has been sug-
gested to be a natural phenomenon and therefore does
not contribute to unreliability per se in test-retest situa-
tions (43). Thus, there is a school of thought that suggests
that only random error should be assessed in reliability
calculations. Under this analysis, the error term in the
denominator will only reflect random error and not sys-
tematic error, increasing the size of reliability coeffi-
cients. The issue of inclusion of systematic error in the
determination of reliability coefficients is addressed in a
subsequent section.
The Basic Calculations
The calculation of reliability starts with the performance
of a repeated-measures ANOVA. This analysis performs
2 functions. First, the inferential test of mean differences
across trials is an assessment of systematic error (trend).
Second, all of the subsequent calculations can be derived
from the output from this ANOVA. In keeping with the
nomenclature of Keppel (28), the ANOVA that is used is
of a single-factor, within-subjects (repeated-measures) de-
sign. Unfortunately, the language gets a bit tortured in
many sources, because the different ICC models are re-
ferred to as either 1-way or 2-way models; what is im-
portant to keep in mind is that both the 1-way and 2-way
ICC models can be derived from the same single-factor,
within-subjects ANOVA.
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
233
T
ABLE
1. Example data set.
Trial A1 Trial A2 DTrial B1 Trial B2 D
146
148
170
90
140
152
152
99
26
14
218
19
166
168
160
150
160
172
142
159
26
14
218
19
157
156
176
205
156 633
145
153
167
218
153 633
212
23
29
113
147
146
156
155
156 68
135
143
147
168
153 613
212
23
29
113
T
ABLE
2. Two-way analysis of variance summary table for data set A.*
Source df SS Mean square Fpvalue
Between subjects 7 14,689.8 2098.4 (MS
B
: 1-way)
(MS
S
: 2-way)
36.8
Within subjects
Trials
Error
8
1
7
430
30.2
399.8
53.75 (MS
W
)
30.2 (MS
T
)
57 (MS
E
)
0.53 0.49
Total 15 15,119.8
*MS
B
5between-subjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean square; SS 5sums of squares.
To illustrate the calculations, example data are pre-
sented in Table 1. ANOVA summary tables are presented
in Tables 2 and 3, and the resulting ICCs are presented
in Table 4. Focus on the first two columns of Table 1,
which are labeled trial A1 and trial A2. As can be seen,
there are 2 sets (columns) of scores, and each set has 8
scores. In this example, each of 8 subjects has provided a
score in each set. Assume that each set of scores repre-
sents the subjects’ scores on the 1RM squat across 2 dif-
ferent days (trials). A repeated-measures ANOVA is per-
formed to primarily test whether the 2 sets of scores are
significantly different from each other (i.e., do the scores
systematically change between trials) and is summarized
in Table 2. Equivalently, one could have used a paired t-
test, since there were only 2 levels of trials. However, the
ANOVA is applicable to situations with 2 or more trials
and is consistent with the ICC literature in defining
sources of variance for ICC calculations. Note that there
are 3 sources of variability in Table 2: subjects, trials, and
error. In a repeated-measures ANOVA such as this, it is
helpful to remember that this analysis might be consid-
ered as having 2 factors: the primary factor of trials and
a secondary factor called subjects (with a sample size of
1 subject per cell). The error term includes the interaction
effect of trials by subjects. It is useful to keep these sourc-
es of variability in mind for 2 reasons. First, the 1-way
and 2-way models of the ICC (6, 44) either collapse the
variability due to trials and error together (1-way models)
or keep them separate (2-way models). Note that the tri-
als and error sources of variance, respectively, reflect the
systematic and random sources of error in the of the
2
s
e
reliability coefficient. These differences are illustrated in
Table 2, where the df and sums of squares values for error
in the 1-way model (within-subjects source) are simply
the sum of the respective values for trials and error in
the 2-way model.
Second, unlike a between-subjects ANOVA where the
‘‘noise’’ due to different subjects is part of the error term,
the variability due to subjects is now accounted for (due
to the repeated testing) and therefore not a part of the
error term. Indeed, for the calculation of the ICC, the nu-
merator (the signal) reflects the variance due to subjects.
Since the error term of the ANOVA reflects the interac-
tion between subjects and trials, the error term is small
in situations where all the subjects change similarly
across test days. In situations where subjects do not
change in a similar manner across test days (e.g., some
subjects’ scores increase, whereas others decrease), the
error term is large. In the former situation, even small
differences across test days, as long as they are consistent
across all the subjects, can result in a statistically signif-
icant effect for trials. In this example, however, the effect
for trials is not statistically significant (p50.49), indi-
cating that there is no statistically significant systematic
error in the data. It should be kept in mind, however, that
the statistical power of the test of mean differences be-
tween trials is affected by sample size and random error.
Small sample sizes and noisy data (i.e., high random er-
ror) will decrease power and potentially hide systematic
error. Thus, an inferential test of mean differences alone
is insufficient to quantify reliability. Further, evaluation
of the effect for trials ought to be evaluated with a more
liberal ameasure, since in this case, the implications of
a type 2 error are more severe than a type 1 error. In
cases where systematic error is present, it may be pru-
dent to change the measurement schedule (e.g., add trials
if a learning effect is present or increase rest intervals if
fatigue is present) to compensate for the bias.
Shrout and Fleiss (46) have presented 6 forms of the
ICC. This system has taken hold in the physical therapy
literature. However, the specific nomenclature of their
system does not seem to be as prevalent in the exercise
physiology, kinesiology, and sport science literature,
which has instead ignored which is model used or focused
on ICC terms that are centered on either 1-way or 2-way
ANOVA models (6, 44). Nonetheless, the ICC models of
234 W
EIR
T
ABLE
3. Analysis of variance summary table for data set B.*
Source df SS Mean square Fpvalue
Between subjects 7 1330 190 (MS
B
: 1-way)
(MS
S
: 2-way)
3.3
Within subjects
Trials
Error
8
1
7
430
30.2
399.8
53.75 (MS
W
)
30.2 (MS
T
)
57 (MS
E
)
0.53 0.49
Total 15 1760
*MS
B
5between-subjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean square; SS 5sums of squares.
Shrout and Fleiss (46) overlap with the 1-way and 2-way
models presented by Safrit (44) and Baumgartner (6).
Three general models of the ICC are present in the
Shrout and Fleiss (46) nomenclature, which are labeled
1, 2, and 3. Each model can be calculated 1 of 2 ways. If
the scores in the analysis are from single scores from each
subject for each trial (or rater if assessing interrater re-
liability), then the ICC is given a second designation of 1.
If the scores in the analysis represent the average of the
k scores from each subject (i.e., the average across the
trials), then the ICC is given a second designation of k.
In this nomenclature then, an ICC with a model desig-
nation of 2,1 indicates an ICC calculated using model 2
with single scores. The use of these models is typically
presented in the context of determining rater reliability
(41). For model 1, each subject is assumed to be assessed
by a different set of raters than other subjects, and these
raters are assumed to be randomly sampled from the pop-
ulation of possible raters so that raters are a random ef-
fect. Model 2 assumes each subject was assessed by the
same group of raters, and these raters were randomly
sampled from the population of possible raters. In this
case, raters are also considered a random effect. Model 3
assumes each subject was assessed by the same group of
raters, but these particular raters are the only raters of
interest, i.e., one does not wish to generalize the ICCs
beyond the confines of the study. In this case, the analysis
attempts to determine the reliability of the raters used
by that particular study, and raters are considered a fixed
effect.
The 1-way ANOVA models (6, 44) coincide with model
1,k for situations where scores are averaged and model
1,1 for single scores for a given trial (or rater). Further,
ICC 1,1 coincides with the 1-way ICC model described by
Bartko (3, 4), and ICC 1,k has also been termed the
Spearman Brown prediction formula (4). Similarly, ICC
values derived from single and averaged scores calculated
using the 2-way approach (6, 44) coincide with models 3,1
and 3,k, respectively. Calculations coincident with models
2,1 and 2,k were not reported by Baumgartner (6) or Saf-
rit (44).
More recently, McGraw and Wong (34) expanded the
Shrout and Fleiss (46) system to include 2 more general
forms, each also with a single score or average score ver-
sion, resulting in 10 ICCs. These ICCs have now been
incorporated into SPSS statistical software starting with
version 8.0 (36). Fortunately, 4 of the computational for-
mulas of Shrout and Fleiss (46) also apply to the new
forms of McGraw and Wong (34), so the total number of
formulas is not different.
The computational formulas for the ICC models of
Shrout and Fleiss (46) and McGraw and Wong (34) are
summarized in Table 5. Unfortunately, it is not intuitive-
ly obvious how the computational formulas reflect the in-
tent of equations 1 through 3. This stems from the fact
that the computational formulas reported in most sources
are derived from algebraic manipulations of basic equa-
tions where mean square values from ANOVA are used
to estimate the various s
2
values reflected in equations 1
through 3. To illustrate, the manipulations for ICC 1,1 (ran-
dom-effects, 1-way ANOVA model) are shown herein. First,
the computational formula for ICC 1,1 is as follows:
MS 2MS
BW
ICC 1,1 5(4)
MS 1(k 21)MS
BW
where MS
B
indicates the between-subjects mean square,
MS
W
indicates the within-subjects mean square, and k is
the number of trials (3, 46). The relevant mean square
values can be found in Table 2. To relate this computa-
tional formula to equation 1, one must know that esti-
mation of the appropriate s
2
comes from expected mean
squares from ANOVA. Specifically, for this model the ex-
pected MS
B
equals plus k , whereas the expected MS
W
22
ss
es
equals (3); therefore, MS
B
equals MS
W
plus k . If from
22
ss
es
equation 1 we estimate from between-subjects variance
2
s
t
( ), then
2
s
s
2
s
s
ICC 5. (5)
22
s1s
se
By algebraic manipulation (e.g., 5[MS
B
2MS
W
]/k)
2
s
s
and substitution of the expected mean squares into equa-
tion 5, it can be shown that
2
s
s
ICC 1,1 5
22
s1s
se
MS 2MS
BW
k
5MS 2MS
BW
1MS
W
k
MS 2MS
BW
5. (6)
MS 1(k 21)MS
BW
Similar derivations can be made for the other ICC models
(3, 34, 46, 49) so that all ultimately relate to equation 1.
Of note is that with the different ICC models (fixed vs.
random effects, 1-way vs. 2-way ANOVA), the expected
mean squares change and thus the computational for-
mulas commonly found in the literature (30, 41) also
change.
Choosing an ICC
Given the 6 ICC versions of Shrout and Fleiss (46) and
the 10 versions presented by McGraw and Wong (34), the
choice of ICC is perplexing, especially considering that
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
235
T
ABLE
4. ICC values for data sets A and B.*
ICC type Data set A Data set B
1,1
1,k
2,1
2,k
3,1
3,k
0.95
0.97
0.95
0.97
0.95
0.97
0.56
0.72
0.55
0.71
0.54
0.70
* ICC 5intraclass correlation coefficient.
T
ABLE
6. Example data set with systematic error.
Trial C1 Trial C2 D
146
148
170
90
161
162
189
100
114
114
119
110
157
156
176
205
175
171
195
219
118
115
119
114
156 633 172 635
T
ABLE
5. Intraclass correlation coefficient model summary table.*
Shrout and Fleiss Computational formula McGraw and Wong Model
1,1 MS 2MS
BW
MS 1(k 21)MS
BW
1 1-way random
1,k MS 2MS
BW
MS
B
k 1-way random
Use 3,1
Use 3,k
C,1
C,k
2-way random
2-way random
2,1 MS 2MS
SE
k(MS 2MS )
TE
MS 1(k 21)MS 1
SE
n
A,1 2-way random
2,k MS 2MS
SE
k(MS 2MS )
TE
MS 1
S
n
A,k 2-way random
3,1 MS 2MS
SE
MS 1(k 21)MS
SE
C,1 2-way fixed
3,k MS 2MS
SE
MS
S
C,k 2-way fixed
Use 2,1
Use 2,k
A,1
A,k
2-way fixed
2-way fixed
* Adapted from Shrout and Fleiss (46) and McGraw and Wong (34). Mean square abbreviations are based on the 1-way and 2-way
analysis of variance illustrated in Table 2. For McGraw and Wong, A 5absolute and C 5consistency. MS
B
5between-subjects
mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean
square.
most of the literature deals with rater reliability not test-
retest reliability of physical performance measures. In a
classic paper, Brozek and Alexander (11) first introduced
the concept of the ICC to the movement sciences litera-
ture and detailed the implementation of an ICC for ap-
plication to test-retest analysis of motor tasks. Their co-
efficient is equivalent to model 3,1. Thus, one might use
ICC 3,1 with test-retest reliability where trials is substi-
tuted for raters. From the rater nomenclature above, if
one does not wish to generalize the reliability findings but
rather assert that in our hands the procedures are reli-
able, then ICC 3,1 seems like a logical choice. However,
this ICC does not include variance associated with sys-
tematic error and is in fact closely approximated by the
Pearson r(1, 43). Therefore, the criticism of the Pearson
ras an index of reliability holds as well for ICCs derived
from model 3. At the least, it needs to be established that
the effect for trials (bias) is trivial if reporting an ICC
derived from model 3. Use of effect size for the trials effect
in the ANOVA would provide information in this regard.
With respect to ICC 3,1, Alexander (1) notes that it ‘‘may
be regarded as an estimate of the value that would have
been obtained if the fluctuation [systematic error] had
been avoided.’’
In a more general sense, there are 4 issues to be ad-
dressed in choosing an ICC: (a) 1- or 2-way model, (b)
fixed- or random-effect model, (c) include or exclude sys-
tematic error in the ICC, and (d) single or mean score.
With respect to choosing a 1- or 2-way model, in a 1-way
model, the effect of raters or trials (replication study) is
not crossed with subjects, meaning that it allows for sit-
uations where all raters do not score all subjects (48).
Fleiss (22) uses the 1-way model for what he terms simple
replication studies. In this model, all sources of error are
lumped together into the MS
W
(Tables 2 and 3). In con-
trast, the 2-way models allow the error to be partitioned
between systematic and random error. When systematic
error is small, MS
W
from the 1-way model and error mean
square (MS
E
) from the 2-way models (reflecting random
error) are similar, and the resulting ICCs are similar.
This is true for both data sets A and B. When systematic
error is substantial, MS
W
and MS
E
are disparate, as in
data set C (Tables 6 and 7). Two-way models require tri-
als or raters to be crossed with subjects (i.e., subjects pro-
vide scores for all trials or each rater rates all subjects).
For test-retest situations, the design dictates that trials
are crossed with subjects and therefore lend themselves
to analysis by 2-way models.
Regarding fixed vs. random effects, a fixed factor is
one in which all levels of the factor of interest (in this
236 W
EIR
T
ABLE
7. Analysis of variance summary table for data set C.*
Source df SS Mean square Fpvalue
Between subjects 7 15,925 2275 (MS
B
: 1-way)
(MS
S
: 2-way)
482.58
Within subjects
Trials
Error
8
1
7
994
961.0
33.0
124.25 (MS
W
)
961.0 (MS
T
)
4.71 (MS
E
)
203.85 ,0.0001
Total 15 16,919
*MS
B
5between-subjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean square; SS 5sums of squares.
case trials) are included in the analysis and no attempt
at generalization of the reliability data beyond the con-
fines of the study is expected. Determining the reliability
of a test before using it in a larger study fits this descrip-
tion of fixed effect. A random factor is one in which the
levels of the factor in the design (trials) are but a sample
of the possible levels, and the analysis will be used to
generalize to other levels. For example, a study designed
to evaluate the test-retest reliability of the vertical jump
for use by other coaches (with similar athletes) would con-
sider the effect of trials to be a random effect. Both Shrout
and Fleiss (46) models 1 and 2 are random-effects models,
whereas model 3 is a fixed-effect model. From this dis-
cussion, for the 2-way models of Shrout and Fleiss (46),
the choice between model 2 and model 3 appears to hinge
on a decision regarding a random- vs. fixed-effects model.
However, models 2 and 3 also differ in their treatment of
systematic error. As noted previously, model 3 only con-
siders random error, whereas model 2 considers both ran-
dom and systematic error. This system does not include
a 2-way fixed-effects model that includes systematic error
and does not offer a 2-way random-effects model that only
considers random error. The expanded system of McGraw
and Wong (34) includes these options. In the nomencla-
ture of McGraw and Wong (34), the designation C refers
to consistency and A refers to absolute agreement. That
is, the C models consider only random error and the A
models consider both random and systematic error. As
noted in Table 5, no new computational formulas are re-
quired beyond those presented by Shrout and Fleiss (46).
Thus, if one were to choose a 2-way random-effects model
that only addressed random error, one would use equa-
tion 3,1 (or equation 3,k if the mean across k trials is the
criterion score). Similarly, if one were to choose a 2-way
fixed-effects model that addressed both systematic and
random error, equation 2,1 would be used (or 2,k). Ulti-
mately then, since the computational formulas do not dif-
fer between systems, the choice between using the Shrout
and Fleiss (46) equations from models 2 vs. 3 hinge on
decisions regarding inclusion or exclusion of systematic
error in the calculations. As noted by McGraw and Wong
(34), ‘‘the random-fixed effects distinction is in its effect
on the interpretation, but not calculation, of an ICC.’’
Should systematic error be included in the ICC? First,
if the effect for trials is small, the systematic differences
between trials will be small, and the ICCs will be similar
to each other. This is evident in both the A and B data
sets (Tables 1 through 3 ). However, if the mean differ-
ences are large, then differences between ICCs are evi-
dent, especially between equation 3,1, which does not con-
sider systematic error, and equations 1,1 and 2,1, which
do consider systematic error. In this regard, the Ftest for
trials and the ICC calculations may give contradictory re-
sults from the same data. Specifically, it can be the case
that an ICC can be large (indicating good reliability),
whereas the ANOVA shows a significant trials effect. An
example is given in Tables 6 and 7. In this example, each
score in trial C1 was altered in trial C2 so that there was
a bias of 115 kg and a random component added to each
score. The effect for trials was significant (F
1,7
5203.85,
p,0.001) and reflected a mean increase of 16 kg. For an
ANOVA to be significant, the effect must be large (in this
case, the mean differences between trials must be large),
the noise (error term) must be small, or both. The error
term is small when all subjects behave similarly across
test days. When this is the case, even small mean differ-
ences can be statistically significant. In this case, the sys-
tematic differences explain a significant amount of vari-
ability in the data. Despite the rather large systematic
error, the ICC values from equations 1,1; 2,1; and 3,1
were 0.896, 0.901, and 0.998, respectively. A cursory ex-
amination of just the ICC scores would suggest that the
test exhibited good reliability, especially using equation
3,1, which only reflects random error. However, an ap-
proximately 10% increase in scores from trial C1 to C2
would suggest otherwise. Thus, an analysis that only fo-
cuses on the ICC without consideration of the trials effect
is incomplete (31). If the effect for trials is significant, the
most straightforward approach is to develop a measure-
ment schedule that will attenuate systematic error (2,
50). For example, if learning effects are present, one
might add trials until a plateau in performance occurs.
Then the ICC could be calculated only on the trials in the
plateau region. The identification of such a measurement
schedule would be especially helpful for random-effects
situations where others might be using the test being
evaluated. For simplicity, all the examples here have
been with only 2 levels for trials. If a trials effect is sig-
nificant, however, 2 trials are insufficient to identify a
plateau. The possibility of a significant trials effect should
be considered in the design of the reliability study. For-
tunately, the ANOVA procedures require no modification
to accommodate any number of trials.
Interpreting the ICC
At one level, interpreting the ICC is fairly straightfor-
ward; it represents the proportion of variance in a set of
scores that is attributable to the . An ICC of 0.95 means
2
s
t
that an estimated 95% of the observed score variance is
due to . The balance of the variance (1 2ICC 55%) is
2
s
t
attributable to error (51). However, how does one quali-
tatively evaluate the magnitude of an ICC and what can
the quantity tell you? Some sources have attempted to
delineate good, medium, and poor levels for the ICC, but
there is certainly no consensus as to what constitutes a
good ICC (45). Indeed, Charter and Feldt (15) argue that
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
237
‘‘it is not theoretically defensible to set a universal stan-
dard for test score reliability.’’ These interpretations are
further complicated by 2 factors. First, as noted herein,
the ICC varies, depending on which version of the ICC is
used. Second, the magnitude of the ICC is dependent on
the variability in the data (45). All other things being
equal, low levels of between-subjects variability will serve
to depress the ICC even if the differences between sub-
jects’ scores across test conditions are small. This is illus-
trated by comparing the 2 example sets of data in Table
1. Trials 1 and 2 of data sets A and B have identical mean
values and identical change scores between trials 1 and
2. They differ in the variability between subjects, with
greater between-subjects variability evident in data set A
as shown in the larger SDs. In Tables 2 and 3, the AN-
OVA tables have identical outcomes with respect to the
inferential test of the factor trials and have identical error
terms (since the between-subjects variability is not part
of the error term, as noted previously). Table 4 shows the
ICC values calculated using the 6 different models of
Shrout and Fleiss (46) on the A and B data sets. Clearly,
data set B, with the lower between-subjects variability,
results in smaller ICC values than data set A.
How then does one interpret an ICC? First, because
of the relationship between the ICC and between-subjects
variability, the heterogeneity of the subjects should be
considered. A large ICC can mask poor trial-to-trial con-
sistency when between-subjects variability is high. Con-
versely, a low ICC can be found even when trial-to-trial
variability is low if the between-subjects variability is
low. In this case, the homogeneity of the subjects means
it will be difficult to differentiate between subjects even
though the absolute measurement error is small. An ex-
amination of the SEM in conjunction with the ICC is
therefore needed (32). From a practical perspective, a giv-
en test can have different reliability, at least as deter-
mined from the ICC, depending on the characteristics of
the individuals included in the analysis. In the 1RM
squat, combining individuals of widely different capabil-
ities (e.g., wide receivers and defensive linemen in Amer-
ican football) into the same analysis increases between-
subjects variability and improves the ICC, yet this may
not be reflected in the expected day-to-day variation as
illustrated in Tables 1 through 4. In addition, the infer-
ential test for bias described previously needs to be con-
sidered. High between-subjects variability may result in
a high ICC even if the test for bias is statistically signif-
icant.
The relationship between between-subjects variability
and the magnitude of the ICC has been used as a criti-
cism of the ICC (10, 39). This is an unfair criticism, since
the ICC is used to provide information regarding infer-
ential statistical tests not to provide an index of absolute
measurement error. In essence, the ICC normalizes mea-
surement error relative to the heterogeneity of the sub-
jects. As an index of absolute reliability then, this is a
weakness and other indices (i.e., the SEM) are more in-
formative. As a relative index of reliability, the ICC be-
haves as intended.
What are the implications of a low ICC? First, mea-
surement error reflected in an ICC of less than 1.0 serves
to attenuate correlations (22, 38). The equation for this
attenuation effect is as follows:
ˆ
r5rÏICC ICC (7)
xy xy x y
where r
xy
is the observed correlation between xand y,rˆ
xy
is the correlation between xand yif both were measured
without error (i.e., the correlations between the true
scores), and ICC
x
and ICC
y
are the reliability coefficients
for xand y, respectively. Nunnally and Bernstein (38)
note that the effect of measurement error on correlation
attenuation becomes minimal as ICCs increase above
0.80. In addition, reliability affects the power of statistical
tests. Specifically, the lower the reliability, the greater
the risk of type 2 error (14, 40). Fleiss (22) illustrates how
the magnitude of an ICC can be used to adjust sample
size and statistical power calculations (45). In short, low
ICCs mean that more subjects are required in a study for
a given effect size to be statistically significant (40). An
ICC of 0.60 may be perfectly fine if the resulting effect on
sample size and statistical power is within the logistical
constraints of the study. If, however, an ICC of 0.60
means that, for a required level of power, more subjects
must be recruited than is feasible, then 0.60 is not ac-
ceptable.
Although infrequently used in the movement sciences,
the ICC of test scores can be used in the setting and in-
terpretation of cut points for classification of individuals.
Charter and Feldt (15) show how the ICC can be used to
estimate the percentage of false-positive, false-negative,
true-positive, and true-negative results for a clinical clas-
sification scheme. Although the details of these calcula-
tions are beyond the scope of this article, it is worthwhile
to note that very high ICCs are required to classify indi-
viduals with a minimum of misclassification.
T
HE
SEM
Because the general form of the ICC is a ratio of variance
due to differences between subjects (the signal) to the to-
tal variability in the data (the noise), the ICC is reflective
of the ability of a test to differentiate between different
individuals (27, 47). It does not provide an index of the
expected trial-to-trial noise in the data, which would be
useful to practitioners such as strength coaches. Unlike
the ICC, which is a relative measure of reliability, the
SEM provides an absolute index of reliability. Hopkins
(26) refers to this as the ‘‘typical error.’’ The SEM quan-
tifies the precision of individual scores on a test (24). The
SEM has the same units as the measurement of interest,
whereas the ICC is unitless. The interpretation of the
SEM centers on the assessment of reliability within in-
dividual subjects (45). The direct calculation of the SEM
involves the determination of the SD of a large number
of scores from an individual (44). In practice, a large num-
ber of scores is not typically collected, so the SEM is es-
timated. Most references estimate the SEM as follows:
SEM 5SDÏ12ICC (8)
where SD is the SD of the scores from all subjects
(which can be determined from the ANOVA as
) and ICC is the reliability coefficient.
ÏSS /(n 21)
TOTAL
Note the similarity between the equation for the SEM
and standard error of estimate from regression analysis.
Since different forms of the ICC can result in different
numbers, the choice of ICC can substantively affect the
size of the SEM, especially if systematic error is present.
However, there is an alternative way of calculating the
SEM that avoids these uncertainties. The SEM can be
estimated as the square root of the mean square error
term from the ANOVA (20, 26, 48). Since this estimate of
238 W
EIR
the SEM has the advantage of being independent of the
specific ICC, its use would allow for more consistency in
interpreting SEM values from different studies. However,
the mean square error terms differ when using the 1-way
vs. 2-way models. In Table 2 it can be seen that using a
1-way model (22) would require the use of MS
W
(Ï53.75
57.3 kg), whereas use of a 2-way model would require
use of MS
E
(Ï57 57.6 kg). Hopkins (26) argues that be-
cause the 1-way model combines influences of random
and systematic error together, ‘‘The resulting statistic is
biased high and is hard to interpret because the relative
contributions of random error and changes in the mean
are unknown.’’ He therefore suggests that the error term
from the 2-way model (MS
E
) be used to calculate SEM.
Note however, that in this sample, the 1-way SEM is
smaller than the 2-way SEM. This is because the trials
effect is small. The high bias of the 1-way model is ob-
served when the trials effect is large (Table 7). The SEM
calculated using the MS error from the 2-way model
(Ï4.71 52.2 kg) is markedly lower than the SEM cal-
culated using the 1-way model (Ï124.25 511.1 kg), since
the SEM as defined as ÏMS
E
only considers random er-
ror. This is consistent with the concept of a SE, which
defines noise symmetrically around a central value. This
points to the desire of establishing a measurement sched-
ule that is free of systematic variation.
Another difference between the ICC and SEM is that
the SEM is largely independent of the population from
which it was determined, i.e., the SEM ‘‘is considered to
be a fixed characteristic of any measure, regardless of the
sample of subjects under investigation’’ (38). Thus, the
SEM is not affected by between-subjects variability as is
the ICC. To illustrate, the MS
E
for the data in Tables 2
and 3 are equal (MS
E
557), despite large differences in
between-subjects variability. The resulting SEM is the
same for data sets A and B (Ï57 57.6 kg), but yet they
have different ICC values (Table 4). The results are sim-
ilar when calculating the SEM using equation 8, even
though equation 8 uses the ICC in calculating the SEM,
since the effects of the SD and the ICC tend to offset each
other (38). However, the effects do not offset each other
completely, and use of equation 8 results in an SEM es-
timate that is modestly affected by between-subjects var-
iability (2).
The SEM is the SE in estimating observed scores (the
scores in your data set) from true scores (38). Of course,
our problem is just the opposite. We have the observed
scores and would like to estimate subjects’ true scores.
The SEM has been used to define the boundaries around
which we think a subject’s true score lies. It is often re-
ported (8, 17) that the 95% CI for a subject’s true score
can be estimated as follows:
T5S61.96(SEM), (9)
where Tis the subject’s true score, Sis the subject’s ob-
served score on the measurement, and 1.96 defines the
95% CI. However, strictly speaking this is not correct,
since the SEM is symmetrical around the true score, not
the observed score (13, 19, 24, 38), and the SEM reflects
the SD of the observed scores while holding the true score
constant. In lieu of equation 9, an alternate approach is
to estimate the subject’s true score and calculate an al-
ternate SE (reflecting the SD of true scores while holding
observed scores constant). Because of regression to the
mean, obtained scores (S) are biased estimators of true
scores (16, 19). Scores below the mean are biased down-
ward, and scores above the mean are biased upward. A
subject’s estimated true score (T) can be calculated as fol-
lows:
¯
T5X1ICC(d), (10)
where d5S2X
¯. To illustrate, consider data set A in
Table 1. With a grand mean of 154.5 and an ICC 3,1 of
0.95, an individual with an Sof 120 kg would have a
predicted Tof 154.5 10.95 (120 2154.5) 5121.8 kg.
Note that because the ICC is high, the bias is small (1.8
kg). The appropriate SE to define the CI of the true score,
which some have referred to as the standard error of es-
timate (13), is as follows (19, 38):
SEM 5SDÏICC(1 2ICC). (11)
TS
In this example the value is 31.74 Ï0.95 (1 20.95) 5
6.92, where 31.74 equals the SD of the observed scores
around the grand mean. The 95% CI for Tis then 121.8
61.96 (6.92), which defines a span of 108.2 to 135.4 kg.
The entire process, which has been termed the regression-
based approach (16), can be summarized as follows (24):
95%CI for T
¯
5X1ICC(d)61.96 SDÏICC(1 2ICC). (12)
If one had simply used equation 9 using Sand SEM,
the resulting interval would span 120 61.96 (7.8) 5105.1
to 134.9 kg. Note that the differences between CIs is
small and that the CI width from equation 9 (29.8 kg) is
wider than that from equation 12 (27.2 kg). For all ICCs
less than 1.0, the CI width will be narrower from equation
12 than from equation 9 (16), but the differences shrink
as the ICC approaches 1.0 and as Sapproaches X
¯(24).
M
INIMAL
D
IFFERENCES
N
EEDED TO
B
E
C
ONSIDERED
R
EAL
The SEM is an index that can be used to define the dif-
ference needed between separate measures on a subject
for the difference in the measures to be considered real.
For example, if the 1RM of an athlete on one day is 155
kg and at some later time is 160 kg, are you confident
that the athlete really increased the 1RM by 5 kg or is
this difference within what you might expect to see in
repeated testing just due to the noise in the measure-
ment? The SEM can be used to determine the minimum
difference (MD) to be considered ‘‘real’’ and can be cal-
culated as follows (8, 20, 42):
MD 5SEM 31.96 3Ï2, (13)
Once again the point is to construct a 95% CI, and the
1.96 value is simply the zscore associated with a 95% CI.
(One may choose a different zscore instead of 1.96 if a
more liberal or more conservative assessment is desired.)
But where does the Ï2 come from?
Why can’t we simply calculate the 95% CI for a sub-
ject’s score as we have done above? If the score is outside
that interval, then shouldn’t we be 95% confident that the
subject’s score has really changed? Indeed, this approach
has been suggested in the literature (25, 37). The key
here is that we now have 2 scores from a subject. Each of
these scores has a true component and an error compo-
nent. That is, both scores were measured with error, and
simply seeing if the second score falls outside the CI of
the first score does not account for the error in the second
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
239
score. What we really want here is an index based on the
variability of the difference scores. This can be quantified
as the SD of the difference scores (SDd). As it turns out,
when there are 2 levels of trials (as in the examples
herein), the SEM is equal to the SDd divided by Ï2 (17,
26):
SEM 5SDd /Ï2. (14)
Therefore, multiplying the SEM by Ï2 solves for the
SDd and then multiplying the SDd by 1.96 allows for the
construction of the 95% CI. Once the MD is calculated,
then any change in a subject’s score, either above or below
the previous score, greater than the MD is considered
real. More precisely, for all people whose differences on
repeated testing are at least greater than or equal to the
MD, 95% of them would reflect real differences. Using
data set A, the first subject has a trial A1 score of 146
kg. The SEM for the test is Ï57 57.6 kg. From equation
13, MD 57.6 31.96 3Ï2521.07 kg. Thus, a change
of at least 21.07 kg needs to occur to be confident, at the
95% level, that a change in 1RM reflects a real change
and not a difference that is within what might be reason-
ably expected given the measurement error of the 1RM
test.
However, as with defining a CI for an observed score,
the process outlined herein for defining a minimal differ-
ence is not precisely accurate. As noted by Charter (13)
and Dudek (19), the SE of prediction (SEP) is the correct
SE to use in these calculations, not the SEM. The SEP is
calculated as follows:
2
SEP 5SDÏ12ICC . (15)
To define a 95% CI outside which one could be confi-
dent that a retest score reflects a real change in perfor-
mance, simply calculate the estimated true score (equa-
tion 10) plus or minus the SEP. To illustrate, consider
the same data as in the example in the previous para-
graph. From equation 10, we estimate the subject’s true
score (T)asT5X
¯1ICC (d)5154 10.95 (146 2154.5)
ù146.4 kg. The SEP 5SD 3Ï(1 2ICC
2
)531.74 3
Ï(1 20.95
2
)59.91. The resulting 95% CI is 146.4 6
1.96 (9.91), which defines an interval from approximately
127 to 166 kg. Therefore, any retest score outside that
interval would be interpreted as reflecting a real change
in performance. As given in Table 1, the retest score of
140 kg is inside the CI and would be interpreted as a
change consistent with the measurement error of the test
and does not reflect a real change in performance. As be-
fore, use of a different zscore in place of 1.96 will allow
for the construction of a more liberal or conservative CI.
O
THER
C
ONSIDERATIONS
In this article, several considerations regarding ICC and
SEM calculations will not be addressed in detail, but brief
mention will be made here. First, assumptions of ANOVA
apply to these data. The most common assumption vio-
lated is that of homoscedasticity. That is, does the size of
the error correlate with the magnitude of the observed
scores? If the data exhibit homoscedasticity, the answer
is no. For physical performance measures, it is common
that the absolute error tends to be larger for subjects who
score higher (2, 26), e.g., the noise from repeated strength
testing of stronger subjects is likely to be larger than the
noise from weaker subjects. If the data exhibit heterosce-
dasticity, often a logarithmic transformation is appropri-
ate. Second, it is important to realize that ICC and SEM
values determined from sample data are estimates. As
such, it is instructive to construct CIs for these estimates.
Details of how to construct these CIs are addressed in
other sources (34, 35). Third, how many subjects are re-
quired to get adequate stability for the ICC and SEM cal-
culations? Unfortunately, there is no consensus in this
area. The reader is referred to other studies for further
discussion (16, 35, 52). Finally, reliability, as quantified
by the ICC, is not synonymous with responsiveness to
change (23). The MD calculation presented herein allows
one to evaluate a change score after the fact. However, a
small MD, in and of itself, is not a priori evidence that a
given test is responsive.
P
RACTICAL
A
PPLICATIONS
For a comprehensive assessment of reliability, a 3-layered
approach is recommended. First, perform a repeated-
measures ANOVA and cast the summary table as a 2-
way model, i.e., trials and error are separate sources of
variance. Evaluate the Fratio for the trials effect to ex-
amine systematic error. As noted previously, it may be
prudent to evaluate the effect for trials using a more lib-
eral ameasure than the traditional 0.05 level. If the effect
for trials is significant (and the effect size is not trivial),
it is prudent to reexamine the measurement schedule for
influences of learning and fatigue. If 3 or more levels of
trials were included in the analysis, a plateau in perfor-
mance may be evident, and exclusion of only those levels
of trials not in the plateau region in a subsequent re-
analysis may be warranted. However, this exclusion of
trials needs to be reported. Under these conditions, where
systematic error is deemed unimportant, the ICC values
will be similar and reflect random error (imprecision).
However, it is suggested here that the ICC from equation
3,1 be used (Table 5), since it is most closely tied to the
MS
E
calculation of the SEM. Once the systematic error is
determined to be nonsignificant or trivial, interpret the
ICC and SEM within the analytical goals of your study
(2). Specifically, researchers interested in group-level re-
sponses can use the ICC to assess correlation attenuation,
statistical power, and sample size calculations. Practi-
tioners (e.g., coaches, clinicians) can use the SEM (and
associated SEs) in the interpretation of scores from in-
dividual athletes (CIs for true scores, assessing individual
change). Finally, although reliability is an important as-
pect of measurement, a test may exhibit reliability but
not be a valid test (i.e., it does not measure what it pur-
ports to measure).
R
EFERENCES
1. A
LEXANDER
, H.W. The estimation of reliability when several
trials are available. Psychometrika 12:79–99. 1947.
2. A
TKINSON
, D.B.,
AND
A.M. N
EVILL
. Statistical methods for as-
sessing measurement error (reliability) in variables relevant to
Sports Medicine. Sports Med. 26:217–238. 1998.
3. B
ARTKO
, J.J. The intraclass reliability coefficient as a measure
of reliability. Psychol. Rep. 19:3–11. 1966.
4. B
ARTKO
, J.J. On various intraclass correlation coefficients.
Psychol. Bull. 83:762–765. 1976.
5. B
AUMGARTNER
, T.A. Estimating reliability when all test trials
are administered on the same day. Res. Q. 40:222–225. 1969.
6. B
AUMGARTNER
, T.A. Norm-referenced measurement: reliabili-
ty. In: Measurement Concepts in Physical Education and Exer-
cise Science. M.J. Safrit and T.M. Woods, eds. Champaign, IL:
Human Kinetics, 1989. pp. 45–72.
240 W
EIR
7. B
AUMGARTNER
, T.A. Estimating the stability reliability of a
score. Meas. Phys. Educ. Exerc. Sci. 4:175–178. 2000.
8. B
ECKERMAN
, H., T.W. V
OGELAAR
, G.L. L
ANKHORST
,
AND
A.L.M.
V
ERBEEK
. A criterion for stability of the motor function of the
lower extremity in stroke patients using the Fugl-Meyer as-
sessment scale. Scand. J. Rehabil. Med. 28:3–7. 1996.
9. B
EDARD
, M., N.J. M
ARTIN
,P.K
RUEGER
,
AND
K. B
RAZIL
. As-
sessing reproducibility of data obtained with instruments
based on continuous measurements. Exp. Aging Res. 26:353–
365. 2000.
10. B
LAND
, J.M.,
AND
D.G. A
LTMAN
. Statistical methods for as-
sessing agreement between two methods of clinical measure-
ment. Lancet 1:307–310. 1986.
11. B
ROZEK
, J.,
AND
H. A
LEXANDER
. Components of variance and
the consistency of repeated measurements. Res. Q. 18:152–166.
1947.
12. B
RUTON
, A., J.H. C
ONWAY
,
AND
S.T. H
OLGATE
. Reliability:
What is it and how is it measured. Physiotherapy 86:94–99.
2000.
13. C
HARTER
, R.A. Revisiting the standard error of measurement,
estimate, and prediction and their application to test scores.
Percept. Mot. Skills 82:1139–1144. 1996.
14. C
HARTER
, R.A. Effect of measurement error on tests of statis-
tical significance. J. Clin. Exp. Neuropsychol. 19:458–462. 1997.
15. C
HARTER
, R.A.,
AND
L.S. F
ELDT
. Meaning of reliability in terms
of correct and incorrect clinical decisions: The art of decision
making is still alive. J. Clin. Exp. Neuropsychol. 23:530–537.
2001.
16. C
HARTER
, R.A.,
AND
L.S. F
ELDT
. The importance of reliability
as it relates to true score CIs. Meas. Eval. Counseling Dev. 35:
104–112. 2002.
17. C
HINN
, S. Repeatability and method comparison. Thorax 46:
454–456. 1991.
18. C
HINN
, S.,
AND
P.G. B
URNEY
. On measuring repeatability of
data from self-administered questionnaires. Int. J. Epidemiol.
16:121–127. 1987.
19. D
UDEK
, F.J. The continuing misinterpretation of the standard
error of measurement. Psychol. Bull. 86:335–337. 1979.
20. E
LIASZIW
, M., S.L. Y
OUNG
, M.G. W
OODBURY
,
AND
K. F
RYDAY
-
F
IELD
. Statistical methodology for the concurrent assessment
of interrater and intrarater reliability: Using goniometric mea-
surements as an example. Phys. Ther. 74:777–788. 1994.
21. F
ELDT
, L.S.,
AND
M.E. M
C
K
EE
. Estimation of the reliability of
skill tests. Res. Q. 29:279–293. 1958.
22. F
LEISS
, J.L. The Design and Analysis of Clinical Experiments.
New York: John Wiley and Sons, 1986.
23. G
UYATT
, G., S. W
ALTER
,
AND
G. N
ORMAN
. Measuring change
over time: assessing the usefulness of evaluative instruments.
J. Chronic Dis. 40:171–178. 1987.
24. H
ARVILL
, L.M. Standard error of measurement. Educ. Meas.
Issues Pract. 10:33–41. 1991.
25. H
EBERT
, R., D.J. S
PIEGELHALTER
,
AND
C. B
RAYNE
. Setting the
minimal metrically detectable change on disability rating
scales. Arch. Phys. Med. Rehabil. 78:1305–1308. 1997.
26. H
OPKINS
, W.G. Measures of reliability in sports medicine and
science. Sports Med. 30:375–381. 2000.
27. K
EATING
, J.,
AND
T. M
ATYAS
. Unreliable inferences from reli-
able measurements. Aust. Physiother. 44:5–10. 1998.
28. K
EPPEL
,G.Design and Analysis: A Researcher’s Handbook (3rd
ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.
29. K
ROLL
, W. A note on the coefficient of intraclass correlation as
an estimate of reliability. Res. Q. 33:313–316. 1962.
30. L
AHEY
, M.A., R.G. D
OWNEY
,
AND
F.E. S
AAL
. Intraclass corre-
lations: there’s more than meets the eye. Psychol. Bull. 93:586–
595. 1983.
31. L
IBA
, M. A trend test as a preliminary to reliability estimation.
Res. Q. 38:245–248. 1962.
32. L
OONEY
, M.A. When is the intraclass correlation coefficient
misleading? Meas. Phys. Educ. Exerc. Sci. 4:73–78. 2000.
33. L
UDBROOK
, J. Statistical techniques for comparing measures
and methods of measurement: A critical review. Clin. Exp.
Pharmacol. Physiol. 29:527–536. 2002.
34. M
C
G
RAW
, K.O.,
AND
S.P. W
ONG
. Forming inferences about
some intraclass correlation coefficients. Psychol. Methods 1:30–
46. 1996.
35. M
ORROW
, J.R.,
AND
A.W. J
ACKSON
. How ‘‘significant’’ is your
reliability? Res. Q. Exerc. Sport 64:352–355. 1993.
36. N
ICHOLS
, C.P. Choosing an intraclass correlation coefficient.
Available at: www.spss.com/tech/stat/articles/whichicc.htm.
Accessed 1998.
37. N
ITSCHKE
, J.E., J.M. M
C
M
EEKEN
, H.C. B
URRY
,
AND
T.A. M
A
-
TYAS
. When is a change a genuine change? A clinically mean-
ingful interpretation of grip strength measurements in healthy
and disabled women. J. Hand Ther. 12:25–30. 1999.
38. N
UNNALLY
, J.C.,
AND
I.H. B
ERNSTEIN
.Psychometric Theory
(3rd ed.). New York: McGraw-Hill, 1994.
39. O
LDS
, T. Five errors about error. J. Sci. Med. Sport 5:336–340.
2002.
40. P
ERKINS
, D.O., R.J. W
YATT
,
AND
J.J. B
ARTKO
. Penny-wise and
pound-foolish: The impact of measurement error on sample
size requirements in clinical trials. Biol. Psychiatry. 47:762–
766. 2000.
41. P
ORTNEY
, L.G.,
AND
M.P. W
ATKINS
.Foundations of Clinical Re-
search (2nd ed.). Upper Saddle River, NJ: Prentice Hall, 2000.
42. R
OEBROECK
, M.E., J. H
ARLAAR
,
AND
G.J. L
ANKHORST
. The ap-
plication of generalizability theory to reliability assessment: An
illustration using isometric force measurements. Phys. Ther.
73:386–401. 1993.
43. R
OUSSON
, V., T. G
ASSER
,
AND
B. S
EIFERT
. Assessing intrarater,
interrater, and test-retest reliability of continuous measure-
ments. Stat. Med. 21:3431–3446. 2002.
44. S
AFRIT
, M.J.E. Reliability Theory. Washington, DC: American
Alliance for Health, Physical Education, and Recreation, 1976.
45. S
HROUT
, P.E. Measurement reliability and agreement in psy-
chiatry. Stat. Methods Med. Res. 7:301–317. 1998.
46. S
HROUT
, P.E.,
AND
J.L. F
LEISS
. Intraclass correlations: uses in
assessing rater reliability. Psychol. Bull. 36:420–428. 1979.
47. S
TRATFORD
, P. Reliability: consistency or differentiating be-
tween subjects? [Letter]. Phys. Ther. 69:299–300. 1989.
48. S
TRATFORD
, P.W.,
AND
C.H. G
OLDSMITH
. Use of standard error
as a reliability index of interest: An applied example using el-
bow flexor strength data. Phys. Ther. 77:745–750. 1997.
49. S
TREINER
, D.L.,
AND
G.R. N
ORMAN
.Measurement Scales: A
Practical Guide to Their Development and Use (2nd ed.). Oxford:
Oxford University Press, 1995. pp. 104–127.
50. T
HOMAS
, J.R.,
AND
J.K. N
ELSON
.Research Methods in Physical
Activity (2nd ed.). Champaign, IL: Human Kinetics, 1990. pp.
352.
51. T
RAUB
, R.E.,
AND
G.L. R
OWLEY
. Understanding reliability.
Educ. Meas. Issues Pract. 10:37–45. 1991.
52. W
ALTER
, S.D., M. E
LIASZIW
,
AND
A. D
ONNER
. Sample size and
optimal designs for reliability studies. Stat. Med. 17:101–110.
1998.
Acknowledgments
I am indebted to Lee Brown, Joel Cramer, Bryan Heiderscheit,
Terry Housh, and Bob Oppliger for their helpful comments on
drafts of the paper.
Address correspondence to Dr. Joseph P. Weir,
joseph.weir@dmu.edu.
... Values of Cronbach's α greater than 0.7 indicate an adequate internal consistency, values above 0.8 suggest a good internal consistency, and values above 0.9 indicate an excellent internal consistency. The test-retest reliability was assessed by comparing the overall scores using the intraclass correlation coefficient (ICC), along with the standard error of measurement (SEM) [40,41]. ...
... The internal consistency was measured using Cronbach's α coefficient. Cronbach's α coefficient was 0.82, which is considered good [40,41]. The intraclass correlation coefficient (ICC) was 0.991 for the PRIA-RS, 0.994 for the factor "Insecurity, Self-confidence, Individual perception", and 0.973 for the factor "Fear of Re-injury". ...
... The SEM absolute scores were 3.15 for the PRIA-RS, 2.44 for the 1st factor, and 1.64 for the second. Then, it was presented in relation to the grand mean, where it was found that they were within acceptable limits (8.18-15.07% of the grand mean) ( Table 2) [40]. ...
Article
Full-text available
The psychological readiness of athletes and its connection to their functional status in returning to sport after a musculoskeletal injury has been previously studied. The “Psychological Readiness of Injured Athlete to Return to Sport” (PRIA-RS) questionnaire is a widely used tool designed to assess an athlete’s psychological readiness to return to sport. The purpose of the present study is to investigate the validity and reliability of the PRIA-RS questionnaire in Greek football athletes. The questionnaire was administered to 113 football athletes, and its face validity, content validity, concurrent validity, construct validity, test–retest reliability, and internal consistency were assessed. The face and content validity of the PRIA-RS were supported, and an exploratory factor analysis confirmed the instrument’s original two-factor structure. Its concurrent validity was demonstrated by examining correlations between the PRIA-RS and three other measures: the Causes of Re-Injury Worry Questionnaire, the Sport Confidence Questionnaire for Rehabilitated Athletes Returning to Competition, and the Attention Questionnaire for Rehabilitated Athletes Returning to Competition. The PRIA-RS exhibited a good internal consistency (Cronbach’s α = 0.82). The intraclass correlation coefficients (ICCs) for the test–retest reliability of each factor were excellent (ICC = 0.97−0.99). Overall, the PRIA-RS appears to be a valid and reliable tool that rehabilitation professionals can utilize in both clinical practice and research by realizing the athletes’ psychological needs and helping them to return safer with no future musculoskeletal injuries.
... Relative reliability, the degree to which individuals maintain their position in a sample with repeated measures, was determined using the intraclass correlation coefficient (ICC) in its 95% confidence interval (CI) 22 . The ICC was estimated using a two-way mixed effects model, which separates the variability between trial and error in order to measure test-retest reproducibility between sessions. ...
... This was evaluated using the calculation of the standard error of measurement (SEM), which is the standard deviation of all errors in one measure 25 . The lower the SEM value, the more reliable the measurement 22 . In addition, the practical significance of YBT-Aged was determined by calculating the minimal detectable change with 95% CI (MDC 95% ), which provides information on the minimum threshold for a measurement to ensure that the differences between test and retest scores are real and outside the error range 22 For the Falls Efficacy Scale International Questionnaire -Brazil (FES-I-Brazil) scores greater than or equal to 23 points suggest an association with a history of sporadic falls, while 31 points or more suggests recurrent falls 18 . ...
... The lower the SEM value, the more reliable the measurement 22 . In addition, the practical significance of YBT-Aged was determined by calculating the minimal detectable change with 95% CI (MDC 95% ), which provides information on the minimum threshold for a measurement to ensure that the differences between test and retest scores are real and outside the error range 22 For the Falls Efficacy Scale International Questionnaire -Brazil (FES-I-Brazil) scores greater than or equal to 23 points suggest an association with a history of sporadic falls, while 31 points or more suggests recurrent falls 18 . For the Rolland Morris Disability Questionnaire -Brazil scores greater than or equal to 14 indicate significant disability due to low back pain 17 . ...
Article
There are several tests to assess dynamic balance in older adults, however many of these tests are indicated for the most fragile population to evaluate fall risks. Considering the increase in the number of older people engaged in fitness programs and therefore the improvement of their physical capacities and motor skills, a valid and reliable test to assess dynamic balance of this population is of great importance for both scientific and clinical purposes. The objectives of the present study are to adapt the Y Balance Test (YBT) instrument and protocol for active, older women and to determine its between-session reliability. Fifty-one healthy women (aged 66.6 ± 5.3 years) underwent an adapted version of the YBT where the working limb was allowed to return to and make contact with the support base before a new trial. The interval between testing and retesting ranged from five to seven days. Intraclass correlation coefficients (ICC) and the Bland-Altman plot method were used to quantify between-session test-retest reliability and its level of agreement, while the Minimal Detectable Change quantified the scores’ smallest detectable differences for the adapted version. The ICC results for all variables are above 0.90, indicating excellent between-session test-retest reliability. Levels of agreement are good and real score differences lower than 3.6% of the mean can be detected. In addition, the protocol allowed for the complete execution of the task in less time and without a fear of falling, which improved the effectiveness of the test sessions. In conclusion, the adaptations proposed for the equipment and protocol, called YBT-Aged, produced highly reliable results to assess dynamic balance in a group of active older women.
... Consequently, there is an increasing movement within exercise science and sports nutrition to focus on individual outcomes using other parameters, such as 95% confidence intervals (42), smallest worthwhile change (14,18,42), the minimal important difference (MID) (12,22,32), minimal clinically important difference (7,12,15), and minimal difference (49). For instance, several recent studies have used these quantitative analyses to investigate the effects of various nutritional supplements on individual muscular performance (3,31), to assess interlimb differences in the lower body (6), and to examine muscle excitation and performance of the leg extensors in individual athletes (17). ...
... Several studies (18,32,38,42,49) have emphasized the importance of examining individual subject responses in conjunction with group mean parametric and non-parametric inferential analyses to provide a more comprehensive interpretation of the effects of various interventions on performance-related exercise and nutritional outcomes (3,6,17,31,38), medical research and patient outcomes (7,14,22), and the outcomes of rehabilitation procedures (6,14,15). However, there remains no consensus regarding which statistical procedure provides the most valid determination of the MID to assess individual subject responses. ...
... Additionally, dependent variables associated with DCER tests such as power and movement velocity should also be assessed to provide a more comprehensive understanding of the adaptations induced by the resistance training intervention. Furthermore, study designs could also incorporate repeated testing sessions to generate data that allow for multiple calculations of the MID (17,38,42,49). ...
Article
Full-text available
This study examined the effects of resistance training combined with whey protein and leucine blends on muscular strength (1-RM), endurance (repetitions to failure [RTF]), cross-sectional area (CSA), perceived exertion (RPE), and body mass (BM). Thirty-nine men (age = 20.6 ± 1.5 yrs) were randomly assigned to 1 of 3 Groups: (a) 1 dose of 40 g of whey protein and 6.2 g of total leucine (1PRO+L, n = 13); (b) 2 doses of 20 g of whey protein and 6.2 g of total leucine per dose (2PRO+L, n = 12); or (c) placebo (PLA, n = 14). The dependent variables were assessed before and after 8 weeks of high-intensity resistance training 3 d·wk-1. Mixed factorial ANOVAs revealed significant (P < 0.001) increases in BP and LE 1-RM and RTF, VL CSA, a reduction in RPE, and no change in BM (P > 0.05), with no between Group differences. Individual analyses indicated that a greater proportion of the 1PRO+L Group exceeded the minimal important difference for LE 1-RM and RTF compared to those in the 2PRO+L and PLA Groups (P < 0.05). No other differences were observed for the individual responses. These findings indicate that 40 g of whey protein with 6.2 g of total leucine increased LE 1-RM and RTF more than 2 doses of 20 g of whey protein with 6.2 g of total leucine or PLA.
... Similarly, if the differences between subjects are large, so will be the ICC despite relatively poor reproducibility. Thus, whenever ICC is used, it is important to keep in mind that its meaning is restricted to specific populations [51,56]. ...
Preprint
Dynamic cerebral autoregulation, that is the transient response of cerebral blood flow to changes in arterial blood pressure, is currently assessed using a variety of different time series methods and data collection protocols. In the continuing absence of a gold standard for the study of cerebral autoregulation it is unclear to what extent does the assessment depend on the choice of a computational method and protocol. We use continuous measurements of blood pressure and cerebral blood flow velocity in the middle cerebral artery from the cohorts of 18 normotensive subjects performing sit-to-stand manoeuvre. We estimate cerebral autoregulation using a wide variety of black-box approaches (ARI, Mx, Sx, Dx, FIR and ARX) and compare them in the context of reproducibility and variability. For all autoregulation indices, considered here, the ICC was greater during the standing protocol, however, it was significantly greater (Fisher's Z-test) for Mx (p < 0.03), Sx (p<0.003)$ and Dx (p<0.03). In the specific case of the sit-to-stand manoeuvre, measurements taken immediately after standing up greatly improve the reproducibility of the autoregulation coefficients. This is generally coupled with an increase of the within-group spread of the estimates.
... 14 The SDC was calculated at the individual level as SEM 3 1.96 3 O2 and at the group level as (SEM 3 1.96 3 O2)/On. 5,19 Construct Validity. Construct validity was evaluated using the Spearman correlation coefficient (r S ) for the association of the MTSS-Tr score with the SF-36-Tr PCS domain and VAS pain scores. ...
Article
Full-text available
Background Medial tibial stress syndrome (MTSS) is a common leg injury in military personnel and athletes and is especially related to running and jumping. A patient-reported outcome measure, the MTSS score, was developed to determine the severity of MTSS. Purpose To translate, culturally adapt, and validate the MTSS score for the Turkish language. Study Design Cohort study (diagnosis); Level of evidence, 2. Methods Established guidelines were used for translation and adaptation. The Turkish version of the MTSS (MTSS-Tr) score was completed twice with a 1-week interval between assessments. In the first assessment, patients also completed the Turkish version of the 36-Item Short-Form Survey (SF-36-Tr) and a visual analog scale (VAS) for pain. Test-retest reliability and internal consistency of the MTSS-Tr were measured with the intraclass correlation coefficient (ICC) and Cronbach α coefficient, respectively. The construct validity was demonstrated with the Spearman correlation coefficient ( r S ). Results A total of 48 participants were included in the study. The test-retest reliability was good and internal consistency was good (ICC, 0.9; Cronbach α, 0.884). The MTSS-Tr score was highly negatively correlated with the physical component score of the SF-36-Tr ( r S = −0.716; P < .001). There was a moderate correlation between the MTSS-Tr score and the VAS pain score ( r S = 0.465; P = .001). Conclusion The translated MTSS-Tr score has good internal consistency and good reliability and validity. Therefore, the MTSS-Tr score is useful to evaluate symptoms in patients with MTSS. Registration NCT05400668 (ClinicalTrials.gov identifier).
... The ICC values were interpreted according to Koo and Li [53] as poor (<0.50), moderate (0.50-0.75), good (0.75-0.9), or excellent (>0.90). Furthermore, the coefficient of variation (CV) was calculated as CV = SD M × 100 and the standard error of measurement (SEM) was computed as SEM = SD × √ 1 − ICC [54] in order to assess the degree of variation between the repeated measures. ...
Article
Full-text available
Background/Objectives: Core strength diagnostics often focus on measuring core endurance rather than maximal core strength or core power. This study investigates whether core strength can be considered as a general ability that can be measured by a single core strength test or whether it needs to be differentiated into several components. Methods: Forty-two adult sports students (nfemale = 20; nmale = 22; age: 24.0 ± 2.9 years; body height: 179.0 ± 9.8 cm; body mass: 75.2 ± 12.7 kg; body fat: 18.0 ± 6.8%) participated in two randomized testing sessions in a laboratory setting. Standard measurements, such as peak rate of force development (pRFD), maximal voluntary contraction (MVC), and holding time, were taken isometrically during four exercises (ventral, dorsal, and lateral right and left). Results: A principal component analysis (PCA) extracted three principal components from twelve different core strength variables. The three identified components explained 73.3% of the total variance and were labeled as (a) maximal core strength, (b) core endurance, and (c) core power. Conclusions: The results suggest three principal components of the core strength construct, as well as their differentiation, may be imperative. These findings should be taken into account in sport science and sports practice as they may be helpful in planning sport-specific diagnostic, performance-oriented training, and injury prevention programs.
Article
Introduction/Aims Tests for assessing upper extremity (UE) functional capacity in patients with Duchenne muscular dystrophy (DMD) are limited. This study aimed to evaluate the validity and reliability of the 6-min pegboard and ring test (6PBRT) as a practical tool for this purpose. Methods Children with DMD (n = 22) were evaluated using the 6PBRT for UE functional capacity, the Quick Disabilities of the Arm, Shoulder, and Hand (Q-DASH) for functionality, the Pediatric Quality of Life Inventory (PedsQL) for quality of life, and a dynamometer for handgrip strength and UE muscle strength. Results The 6PBRT showed excellent test–retest reliability, with an intraclass correlation coefficient (ICC) of 0.978 (95% confidence interval, 0.946–0.984). A very strong positive correlation was observed between the test and retest 6PBRT mean scores (r = 0.981). The mean 6PBRT score exhibited moderate-to-strong correlations with handgrip strength (r = 0.653, r = 0.646, right/left, respectively), muscle strength (shoulder flexors [r = 0.793, r = 0.797, right/left, respectively], shoulder abductors (r = 0.763, r = 0.743, right/left, respectively), elbow flexors [r = 0.743, r = 0.755, right/left, respectively]), mean Q-DASH score (r = −0.555), and mean PedsQL score (r = 0.611). Discussion The 6PBRT appears to be a valid and reliable measure for assessing upper extremity functional capacity in patients with DMD. This test is suitable for patients who are able to lift both hands above their heads.
Article
Purpose To establish the repeatability of choriocapillaris flow deficit (CCFD) measurements within a macular grid and then demonstrate the use of this registered grid strategy to follow CCFD measurements over time. Methods Swept-source optical coherence tomography angiography scans were acquired (nominal size of 6 × 6 mm). For each scan, masks of hyperreflective foci, calcified drusen, and persistent choroidal hypertransmission defects (hyperTDs) were generated. These masks were then used to exclude these prespecified regions when calculating the CCFD percentages (CCFD%). Scans were registered, and CCFD% measurements were performed within 3-mm and 5-mm fovea-centered circles and within a fovea-centered grid (one box: 74 × 74 pixels). The 95% minimal detectable changes (MDC95) for CCFD% were calculated for each of the regions. This longitudinal grid workflow was then used to study eyes before and after drusen resolved. Results Ninety eyes of 63 patients were identified: 30 normal eyes, 30 eyes with intermediate age-related macular degeneration (iAMD), and 30 eyes with hyperTDs. The MDC95 for the normal, iAMD, and hyperTD eyes within the 3-mm and 5-mm circles ranged from 0.85% to 1.96%. The MDC95 for an individual grid's box ranged from 3.35% to 4.67%, and for the total grid area, the MDC95 ranged from 0.91% to 1.40%. When tested longitudinally before and after the resolution of drusen using grid strategy, no significant differences in the CCFD% were observed. Conclusions A grid strategy was developed to investigate targeted longitudinal changes in CCFD% associated with changes in optical coherence tomography biomarkers, and this strategy was validated using eyes in which drusen resolved.
Article
A method is developed to calculate the required number of subjects k in a reliability study, where reliability is measured using the intraclass correlation ρ. The method is based on a functional approximation to earlier exact results. The approximation is shown to have excellent agreement with the exact results and one can use it easily without intensive numerical computation. Optimal design configurations are also discussed; for reliability values of about 40 per cent or higher, use of two or three observations per subject will minimize the total number of observations required. © 1998 John Wiley & Sons, Ltd.
Article
Reliability, the ratio of the variance attributable to true differences among subjects to the total variance, is an important attribute of psychometric measures. However, it is possible for instruments to be reliable, but unresponsive to change; conversely, they may show poor reliability but excellent responsiveness. This is especially true for instruments in which items are tailored to the individual respondent. Therefore, we suggest a new index of responsiveness to assess the usefulness of instruments designed to measure change over time. This statistic, which relates the minimal clinically important difference to the variability in stable subjects, has direct sample size implications. Responsiveness should join reliability and validity as necessary requirements for instruments designed primarily to measure change over time.
Book
Clinicians and those in health sciences are frequently called upon to measure subjective states such as attitudes, feelings, quality of life, educational achievement and aptitude, and learning style in their patients. This fifth edition of Health Measurement Scales enables these groups to both develop scales to measure non-tangible health outcomes, and better evaluate and differentiate between existing tools. Health Measurement Scales is the ultimate guide to developing and validating measurement scales that are to be used in the health sciences. The book covers how the individual items are developed; various biases that can affect responses (e.g. social desirability, yea-saying, framing); various response options; how to select the best items in the set; how to combine them into a scale; and finally how to determine the reliability and validity of the scale. It concludes with a discussion of ethical issues that may be encountered, and guidelines for reporting the results of the scale development process. Appendices include a comprehensive guide to finding existing scales, and a brief introduction to exploratory and confirmatory factor analysis, making this book a must-read for any practitioner dealing with this kind of data.
Article
This paper illustrates the application of a trend test as a first step in the utilization of analysis of variance procedures for the estimation of the reliability of skill tests. Tests for over-all trend and orthogonal components of trend were applied to bowling and softball velocity data. It was demonstrated that systematic linear variation existed from trial to trial in the bowling velocity data. There was no evidence of systematic variation from trial to trial for the softball velocity data.
Article
Using the domain-sampling model from classical test theory, the effects of measurement error on statistical tests for the difference between an obtained mean and a hypothesized mean, and the difference between two means, are demonstrated. The results indicate that lowering the reliability (i. e., increasing measurement error) of dependent variable data increases the chance of obtaining a nonsignificant result when a significant result is the correct outcome. Lowering the reliability also produces reduced estimates of strength of association.