ArticlePDF AvailableLiterature Review

Quantifying Test-Retest Reliability Using The Intraclass Correlation Coefficient and the SEM

Authors:

Abstract and Figures

Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.
Content may be subject to copyright.
231
Journal of Strength and Conditioning Research, 2005, 19(1), 231–240
q2005 National Strength & Conditioning Association Brief Review
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
U
SING
THE
I
NTRACLASS
C
ORRELATION
C
OEFFICIENT
AND THE
SEM
J
OSEPH
P. W
EIR
Applied Physiology Laboratory, Division of Physical Therapy, Des Moines University—Osteopathic Medical
Center, Des Moines, Iowa 50312.
A
BSTRACT
.Weir, J.P. Quantifying test-retest reliability using
the intraclass correlation coefficient and the SEM.J. Strength
Cond. Res. 19(1):231–240. 2005.—Reliability, the consistency of
a test or measurement, is frequently quantified in the movement
sciences literature. A common metric is the intraclass correla-
tion coefficient (ICC). In addition, the SEM, which can be cal-
culated from the ICC, is also frequently reported in reliability
studies. However, there are several versions of the ICC, and con-
fusion exists in the movement sciences regarding which ICC to
use. Further, the utility of the SEM is not fully appreciated. In
this review, the basics of classic reliability theory are addressed
in the context of choosing and interpreting an ICC. The primary
distinction between ICC equations is argued to be one concern-
ing the inclusion (equations 2,1 and 2,k) or exclusion (equations
3,1 and 3,k) of systematic error in the denominator of the ICC
equation. Inferential tests of mean differences, which are per-
formed in the process of deriving the necessary variance com-
ponents for the calculation of ICC values, are useful to deter-
mine if systematic error is present. If so, the measurement
schedule should be modified (removing trials where learning
and/or fatigue effects are present) to remove systematic error,
and ICC equations that only consider random error may be safe-
ly used. The use of ICC values is discussed in the context of
estimating the effects of measurement error on sample size, sta-
tistical power, and correlation attenuation. Finally, calculation
and application of the SEM are discussed. It is shown how the
SEM and its variants can be used to construct confidence inter-
vals for individual scores and to determine the minimal differ-
ence needed to be exhibited for one to be confident that a true
change in performance of an individual has occurred.
K
EY
W
ORDS
. reproducibility, precision, error, consistency, SEM,
intraclass correlation coefficient
I
NTRODUCTION
Reliability refers to the consistency of a test or
measurement. For a seemingly simple concept,
the quantifying of reliability and interpreta-
tion of the resulting numbers are surprisingly
unclear in the biomedical literature in general
(49) and in the sport sciences literature in particular. Part
of this stems from the fact that reliability can be assessed
in a variety of different contexts. In the sport sciences,
we are most often interested in simple test-retest reli-
ability; this is what Fleiss (22) refers to as a simple reli-
ability study. For example, one might be interested in the
reliability of 1 repetition maximum (1RM) squat mea-
sures taken on the same athletes over different days.
However, if one is interested in the ability of different
testers to get the same results from the same subjects on
skinfold measurements, one is now interested in the in-
terrater reliability. The quantifying of reliability in these
different situations is not necessarily the same, and the
decisions regarding how to calculate reliability in these
different contexts has not been adequately addressed in
the sport sciences literature. In this article, I focus on
test-retest reliability (but not limited in the number of
retest trials). In addition, I discuss data measured on a
continuous scale.
Confusion also stems from the jargon used in the con-
text of reliability, i.e., consistency, precision, repeatabili-
ty, and agreement. Intuitively, these terms describe the
same concept, but in practice some are operationalized
differently. Notably, reliability and agreement are not
synonymous (30, 49). Further, reliability, conceptualized
as consistency, consists of both absolute consistency and
relative consistency (44). Absolute consistency concerns
the consistency of scores of individuals, whereas relative
consistency concerns the consistency of the position or
rank of individuals in the group relative to others. In the
fields of education and psychology, the term reliability is
operationalized as relative consistency and quantified us-
ing reliability coefficients called intraclass correlation co-
efficients (ICCs) (49). Issues regarding quantifying ICCs
and their interpretation are discussed in the first half of
this article. Absolute consistency, quantified using the
SEM, is addressed in the second half of the article. In
brief, the SEM is an indication of the precision of a score,
and its use allows one to construct confidence intervals
(CIs) for scores.
Another confusing aspect of reliability calculations is
that a variety of different procedures, besides ICCs and
SEM, have been used to determine reliability. These in-
clude the Pearson r, the coefficient of variation, and the
LOA (Bland-Altman plots). The Pearson product moment
correlation coefficient (Pearson r) was often used in the
past to quantify reliability, but the use of the Pearson r
is typically discouraged for assessing test-retest reliabil-
ity (7, 9, 29, 33, 44); however, this recommendation is not
universal (43). The primary, although not exclusive,
weakness of the Pearson ris that it cannot detect system-
atic error. More recently, the limits of agreement (LOA)
described by Bland and Altman (10) have come into vogue
in the biomedical literature (2). The LOA will not be ad-
dressed in detail herein other than to point out that the
procedure was developed to examine agreement between
2 different techniques of quantifying some variable (so-
called method comparison studies, e.g., one could compare
testosterone concentration using 2 different bioassays),
not reliability per se. The use of LOA as an index of re-
liability has been criticized in detail elsewhere (26, 49).
In this article, the ICC and SEM will be the focus.
232 W
EIR
Unfortunately, there is considerable confusion concerning
both the calculation and interpretation of the ICC. In-
deed, there are 6 common versions of the ICC (and others
as well), and the choice of which version to use is not
intuitively obvious. Similarly, the SEM, which is inti-
mately related to the ICC, has useful applications that
are not fully appreciated by practitioners in the move-
ment sciences. The purposes of this article are to provide
information on the choice and application of the ICC and
to encourage practitioners to use the SEM in the inter-
pretation of test data.
T
HE
ICC
Reliability Theory
For a group of measurements, the total variance ( ) in
2
s
T
the data can be thought of as being due to true score
variance ( ) and error variance ( ). Similarly, each ob-
22
ss
te
served score is composed of the true score and error (44).
The theoretical true score of an individual reflects the
mean of an infinite number of scores from a subject,
whereas error equals the difference between the true
score and the observed score (21). Sources of error include
errors due to biological variability, instrumentation, error
by the subject, and error by the tester. If we make a ratio
of the to the of the observed scores, where equals
22 2
ss s
tT T
plus , we have the following reliability coefficient:
22
ss
te
2
s
t
R5. (1)
22
s1s
te
The closer this ratio is to 1.0, the higher the reliability
and the lower the . Since we do not know the true score
2
s
e
for each subject, an index of the is used based on be-
2
s
t
tween-subjects variability, i.e., the variance due to how
subjects differ from each other. In this context, reliability
(relative consistency) is formally defined (5, 21, 49) as fol-
lows:
between subjects variability
reliability 5. (2)
between subjects variability 1error
The reliability coefficient in Equation 2 is quantified
by various ICCs. So although reliability is conceptually
aligned with terms such as reproducibility, repeatability,
and agreement, it is defined as above. The necessary var-
iance estimates are derived from analysis of variance
(ANOVA), where appropriate mean square values are re-
corded from the computer printout. Specifically, the var-
ious ICCs can be calculated from mean square values de-
rived from a within-subjects, single-factor ANOVA (i.e., a
repeated-measures ANOVA).
The ICC is a relative measure of reliability (18) in that
it is a ratio of variances derived from ANOVA, is unitless,
and is more conceptually akin to R
2
from regression (43)
than to the Pearson r. The ICC can theoretically vary
between 0 and 1.0, where an ICC of 0 indicates no reli-
ability, whereas an ICC of 1.0 indicates perfect reliability.
In practice, ICCs can extend beyond the range of 0 to 1.0
(30), although with actual data this is rare. The relative
nature of the ICC is reflected in the fact that the mag-
nitude of an ICC depends on the between-subjects vari-
ability (as shown in the next section). That is, if subjects
differ little from each other, ICC values are small even if
trial-to-trial variability is small. If subjects differ from
each other a lot, ICCs can be large even if trial-to-trial
variability is large. Thus, the ICC for a test is context
specific (38, 51). As noted by Streiner and Norman (49),
‘‘There is literally no such thing as the reliability of a test,
unqualified; the coefficient has meaning only when ap-
plied to specific populations.’’ Further, it is intuitive that
small differences between individuals are more difficult
to detect than large ones, and the ICC is reflective of this
(49).
Error is typically considered as being of 2 types: sys-
tematic error (e.g., bias) and random error (2, 39). (Gen-
eralizability theory expands sources of error to include
various facets of interest but is beyond the scope of this
article.) Total error reflects both systematic error and
random error (imprecision). Systematic error includes
both constant error and bias (38). Constant error affects
all scores equally, whereas bias is systematic error that
affects certain scores differently than others. For physical
performance measures, the distinction between constant
error and bias is relatively unimportant and the focus
here on systematic error is on situations that result in a
unidirectional change in scores on repeated testing. In
testing of physical performance, subjects may improve
their test scores simply due to learning effects, e.g., per-
forming the first test serves as practice for subsequent
tests, or fatigue or soreness may result in poorer perfor-
mance across trials. In contrast, random error refers to
sources of error that are due to chance factors. Factors
such as luck, alertness, attentiveness by the tester, and
normal biological variability affect a particular score.
Such errors should, in a random manner, both increase
and decrease test scores on repeated testing. Thus, we
can expand Equation 1 as follows:
2
s
t
R5, (3)
22 2
s1s 1s
tsere
where is the systematic and is the random .
2222
ssss
se e re e
It has been argued that systematic error is a concern
of validity and not reliability (12, 43). Similarly, system-
atic error (e.g., learning effects, fatigue) has been sug-
gested to be a natural phenomenon and therefore does
not contribute to unreliability per se in test-retest situa-
tions (43). Thus, there is a school of thought that suggests
that only random error should be assessed in reliability
calculations. Under this analysis, the error term in the
denominator will only reflect random error and not sys-
tematic error, increasing the size of reliability coeffi-
cients. The issue of inclusion of systematic error in the
determination of reliability coefficients is addressed in a
subsequent section.
The Basic Calculations
The calculation of reliability starts with the performance
of a repeated-measures ANOVA. This analysis performs
2 functions. First, the inferential test of mean differences
across trials is an assessment of systematic error (trend).
Second, all of the subsequent calculations can be derived
from the output from this ANOVA. In keeping with the
nomenclature of Keppel (28), the ANOVA that is used is
of a single-factor, within-subjects (repeated-measures) de-
sign. Unfortunately, the language gets a bit tortured in
many sources, because the different ICC models are re-
ferred to as either 1-way or 2-way models; what is im-
portant to keep in mind is that both the 1-way and 2-way
ICC models can be derived from the same single-factor,
within-subjects ANOVA.
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
233
T
ABLE
1. Example data set.
Trial A1 Trial A2 DTrial B1 Trial B2 D
146
148
170
90
140
152
152
99
26
14
218
19
166
168
160
150
160
172
142
159
26
14
218
19
157
156
176
205
156 633
145
153
167
218
153 633
212
23
29
113
147
146
156
155
156 68
135
143
147
168
153 613
212
23
29
113
T
ABLE
2. Two-way analysis of variance summary table for data set A.*
Source df SS Mean square Fpvalue
Between subjects 7 14,689.8 2098.4 (MS
B
: 1-way)
(MS
S
: 2-way)
36.8
Within subjects
Trials
Error
8
1
7
430
30.2
399.8
53.75 (MS
W
)
30.2 (MS
T
)
57 (MS
E
)
0.53 0.49
Total 15 15,119.8
*MS
B
5between-subjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean square; SS 5sums of squares.
To illustrate the calculations, example data are pre-
sented in Table 1. ANOVA summary tables are presented
in Tables 2 and 3, and the resulting ICCs are presented
in Table 4. Focus on the first two columns of Table 1,
which are labeled trial A1 and trial A2. As can be seen,
there are 2 sets (columns) of scores, and each set has 8
scores. In this example, each of 8 subjects has provided a
score in each set. Assume that each set of scores repre-
sents the subjects’ scores on the 1RM squat across 2 dif-
ferent days (trials). A repeated-measures ANOVA is per-
formed to primarily test whether the 2 sets of scores are
significantly different from each other (i.e., do the scores
systematically change between trials) and is summarized
in Table 2. Equivalently, one could have used a paired t-
test, since there were only 2 levels of trials. However, the
ANOVA is applicable to situations with 2 or more trials
and is consistent with the ICC literature in defining
sources of variance for ICC calculations. Note that there
are 3 sources of variability in Table 2: subjects, trials, and
error. In a repeated-measures ANOVA such as this, it is
helpful to remember that this analysis might be consid-
ered as having 2 factors: the primary factor of trials and
a secondary factor called subjects (with a sample size of
1 subject per cell). The error term includes the interaction
effect of trials by subjects. It is useful to keep these sourc-
es of variability in mind for 2 reasons. First, the 1-way
and 2-way models of the ICC (6, 44) either collapse the
variability due to trials and error together (1-way models)
or keep them separate (2-way models). Note that the tri-
als and error sources of variance, respectively, reflect the
systematic and random sources of error in the of the
2
s
e
reliability coefficient. These differences are illustrated in
Table 2, where the df and sums of squares values for error
in the 1-way model (within-subjects source) are simply
the sum of the respective values for trials and error in
the 2-way model.
Second, unlike a between-subjects ANOVA where the
‘‘noise’’ due to different subjects is part of the error term,
the variability due to subjects is now accounted for (due
to the repeated testing) and therefore not a part of the
error term. Indeed, for the calculation of the ICC, the nu-
merator (the signal) reflects the variance due to subjects.
Since the error term of the ANOVA reflects the interac-
tion between subjects and trials, the error term is small
in situations where all the subjects change similarly
across test days. In situations where subjects do not
change in a similar manner across test days (e.g., some
subjects’ scores increase, whereas others decrease), the
error term is large. In the former situation, even small
differences across test days, as long as they are consistent
across all the subjects, can result in a statistically signif-
icant effect for trials. In this example, however, the effect
for trials is not statistically significant (p50.49), indi-
cating that there is no statistically significant systematic
error in the data. It should be kept in mind, however, that
the statistical power of the test of mean differences be-
tween trials is affected by sample size and random error.
Small sample sizes and noisy data (i.e., high random er-
ror) will decrease power and potentially hide systematic
error. Thus, an inferential test of mean differences alone
is insufficient to quantify reliability. Further, evaluation
of the effect for trials ought to be evaluated with a more
liberal ameasure, since in this case, the implications of
a type 2 error are more severe than a type 1 error. In
cases where systematic error is present, it may be pru-
dent to change the measurement schedule (e.g., add trials
if a learning effect is present or increase rest intervals if
fatigue is present) to compensate for the bias.
Shrout and Fleiss (46) have presented 6 forms of the
ICC. This system has taken hold in the physical therapy
literature. However, the specific nomenclature of their
system does not seem to be as prevalent in the exercise
physiology, kinesiology, and sport science literature,
which has instead ignored which is model used or focused
on ICC terms that are centered on either 1-way or 2-way
ANOVA models (6, 44). Nonetheless, the ICC models of
234 W
EIR
T
ABLE
3. Analysis of variance summary table for data set B.*
Source df SS Mean square Fpvalue
Between subjects 7 1330 190 (MS
B
: 1-way)
(MS
S
: 2-way)
3.3
Within subjects
Trials
Error
8
1
7
430
30.2
399.8
53.75 (MS
W
)
30.2 (MS
T
)
57 (MS
E
)
0.53 0.49
Total 15 1760
*MS
B
5between-subjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean square; SS 5sums of squares.
Shrout and Fleiss (46) overlap with the 1-way and 2-way
models presented by Safrit (44) and Baumgartner (6).
Three general models of the ICC are present in the
Shrout and Fleiss (46) nomenclature, which are labeled
1, 2, and 3. Each model can be calculated 1 of 2 ways. If
the scores in the analysis are from single scores from each
subject for each trial (or rater if assessing interrater re-
liability), then the ICC is given a second designation of 1.
If the scores in the analysis represent the average of the
k scores from each subject (i.e., the average across the
trials), then the ICC is given a second designation of k.
In this nomenclature then, an ICC with a model desig-
nation of 2,1 indicates an ICC calculated using model 2
with single scores. The use of these models is typically
presented in the context of determining rater reliability
(41). For model 1, each subject is assumed to be assessed
by a different set of raters than other subjects, and these
raters are assumed to be randomly sampled from the pop-
ulation of possible raters so that raters are a random ef-
fect. Model 2 assumes each subject was assessed by the
same group of raters, and these raters were randomly
sampled from the population of possible raters. In this
case, raters are also considered a random effect. Model 3
assumes each subject was assessed by the same group of
raters, but these particular raters are the only raters of
interest, i.e., one does not wish to generalize the ICCs
beyond the confines of the study. In this case, the analysis
attempts to determine the reliability of the raters used
by that particular study, and raters are considered a fixed
effect.
The 1-way ANOVA models (6, 44) coincide with model
1,k for situations where scores are averaged and model
1,1 for single scores for a given trial (or rater). Further,
ICC 1,1 coincides with the 1-way ICC model described by
Bartko (3, 4), and ICC 1,k has also been termed the
Spearman Brown prediction formula (4). Similarly, ICC
values derived from single and averaged scores calculated
using the 2-way approach (6, 44) coincide with models 3,1
and 3,k, respectively. Calculations coincident with models
2,1 and 2,k were not reported by Baumgartner (6) or Saf-
rit (44).
More recently, McGraw and Wong (34) expanded the
Shrout and Fleiss (46) system to include 2 more general
forms, each also with a single score or average score ver-
sion, resulting in 10 ICCs. These ICCs have now been
incorporated into SPSS statistical software starting with
version 8.0 (36). Fortunately, 4 of the computational for-
mulas of Shrout and Fleiss (46) also apply to the new
forms of McGraw and Wong (34), so the total number of
formulas is not different.
The computational formulas for the ICC models of
Shrout and Fleiss (46) and McGraw and Wong (34) are
summarized in Table 5. Unfortunately, it is not intuitive-
ly obvious how the computational formulas reflect the in-
tent of equations 1 through 3. This stems from the fact
that the computational formulas reported in most sources
are derived from algebraic manipulations of basic equa-
tions where mean square values from ANOVA are used
to estimate the various s
2
values reflected in equations 1
through 3. To illustrate, the manipulations for ICC 1,1 (ran-
dom-effects, 1-way ANOVA model) are shown herein. First,
the computational formula for ICC 1,1 is as follows:
MS 2MS
BW
ICC 1,1 5(4)
MS 1(k 21)MS
BW
where MS
B
indicates the between-subjects mean square,
MS
W
indicates the within-subjects mean square, and k is
the number of trials (3, 46). The relevant mean square
values can be found in Table 2. To relate this computa-
tional formula to equation 1, one must know that esti-
mation of the appropriate s
2
comes from expected mean
squares from ANOVA. Specifically, for this model the ex-
pected MS
B
equals plus k , whereas the expected MS
W
22
ss
es
equals (3); therefore, MS
B
equals MS
W
plus k . If from
22
ss
es
equation 1 we estimate from between-subjects variance
2
s
t
( ), then
2
s
s
2
s
s
ICC 5. (5)
22
s1s
se
By algebraic manipulation (e.g., 5[MS
B
2MS
W
]/k)
2
s
s
and substitution of the expected mean squares into equa-
tion 5, it can be shown that
2
s
s
ICC 1,1 5
22
s1s
se
MS 2MS
BW
k
5MS 2MS
BW
1MS
W
k
MS 2MS
BW
5. (6)
MS 1(k 21)MS
BW
Similar derivations can be made for the other ICC models
(3, 34, 46, 49) so that all ultimately relate to equation 1.
Of note is that with the different ICC models (fixed vs.
random effects, 1-way vs. 2-way ANOVA), the expected
mean squares change and thus the computational for-
mulas commonly found in the literature (30, 41) also
change.
Choosing an ICC
Given the 6 ICC versions of Shrout and Fleiss (46) and
the 10 versions presented by McGraw and Wong (34), the
choice of ICC is perplexing, especially considering that
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
235
T
ABLE
4. ICC values for data sets A and B.*
ICC type Data set A Data set B
1,1
1,k
2,1
2,k
3,1
3,k
0.95
0.97
0.95
0.97
0.95
0.97
0.56
0.72
0.55
0.71
0.54
0.70
* ICC 5intraclass correlation coefficient.
T
ABLE
6. Example data set with systematic error.
Trial C1 Trial C2 D
146
148
170
90
161
162
189
100
114
114
119
110
157
156
176
205
175
171
195
219
118
115
119
114
156 633 172 635
T
ABLE
5. Intraclass correlation coefficient model summary table.*
Shrout and Fleiss Computational formula McGraw and Wong Model
1,1 MS 2MS
BW
MS 1(k 21)MS
BW
1 1-way random
1,k MS 2MS
BW
MS
B
k 1-way random
Use 3,1
Use 3,k
C,1
C,k
2-way random
2-way random
2,1 MS 2MS
SE
k(MS 2MS )
TE
MS 1(k 21)MS 1
SE
n
A,1 2-way random
2,k MS 2MS
SE
k(MS 2MS )
TE
MS 1
S
n
A,k 2-way random
3,1 MS 2MS
SE
MS 1(k 21)MS
SE
C,1 2-way fixed
3,k MS 2MS
SE
MS
S
C,k 2-way fixed
Use 2,1
Use 2,k
A,1
A,k
2-way fixed
2-way fixed
* Adapted from Shrout and Fleiss (46) and McGraw and Wong (34). Mean square abbreviations are based on the 1-way and 2-way
analysis of variance illustrated in Table 2. For McGraw and Wong, A 5absolute and C 5consistency. MS
B
5between-subjects
mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean
square.
most of the literature deals with rater reliability not test-
retest reliability of physical performance measures. In a
classic paper, Brozek and Alexander (11) first introduced
the concept of the ICC to the movement sciences litera-
ture and detailed the implementation of an ICC for ap-
plication to test-retest analysis of motor tasks. Their co-
efficient is equivalent to model 3,1. Thus, one might use
ICC 3,1 with test-retest reliability where trials is substi-
tuted for raters. From the rater nomenclature above, if
one does not wish to generalize the reliability findings but
rather assert that in our hands the procedures are reli-
able, then ICC 3,1 seems like a logical choice. However,
this ICC does not include variance associated with sys-
tematic error and is in fact closely approximated by the
Pearson r(1, 43). Therefore, the criticism of the Pearson
ras an index of reliability holds as well for ICCs derived
from model 3. At the least, it needs to be established that
the effect for trials (bias) is trivial if reporting an ICC
derived from model 3. Use of effect size for the trials effect
in the ANOVA would provide information in this regard.
With respect to ICC 3,1, Alexander (1) notes that it ‘‘may
be regarded as an estimate of the value that would have
been obtained if the fluctuation [systematic error] had
been avoided.’’
In a more general sense, there are 4 issues to be ad-
dressed in choosing an ICC: (a) 1- or 2-way model, (b)
fixed- or random-effect model, (c) include or exclude sys-
tematic error in the ICC, and (d) single or mean score.
With respect to choosing a 1- or 2-way model, in a 1-way
model, the effect of raters or trials (replication study) is
not crossed with subjects, meaning that it allows for sit-
uations where all raters do not score all subjects (48).
Fleiss (22) uses the 1-way model for what he terms simple
replication studies. In this model, all sources of error are
lumped together into the MS
W
(Tables 2 and 3). In con-
trast, the 2-way models allow the error to be partitioned
between systematic and random error. When systematic
error is small, MS
W
from the 1-way model and error mean
square (MS
E
) from the 2-way models (reflecting random
error) are similar, and the resulting ICCs are similar.
This is true for both data sets A and B. When systematic
error is substantial, MS
W
and MS
E
are disparate, as in
data set C (Tables 6 and 7). Two-way models require tri-
als or raters to be crossed with subjects (i.e., subjects pro-
vide scores for all trials or each rater rates all subjects).
For test-retest situations, the design dictates that trials
are crossed with subjects and therefore lend themselves
to analysis by 2-way models.
Regarding fixed vs. random effects, a fixed factor is
one in which all levels of the factor of interest (in this
236 W
EIR
T
ABLE
7. Analysis of variance summary table for data set C.*
Source df SS Mean square Fpvalue
Between subjects 7 15,925 2275 (MS
B
: 1-way)
(MS
S
: 2-way)
482.58
Within subjects
Trials
Error
8
1
7
994
961.0
33.0
124.25 (MS
W
)
961.0 (MS
T
)
4.71 (MS
E
)
203.85 ,0.0001
Total 15 16,919
*MS
B
5between-subjects mean square; MS
E
5error mean square; MS
S
5subjects mean square; MS
T
5trials mean square; MS
W
5within-subjects mean square; SS 5sums of squares.
case trials) are included in the analysis and no attempt
at generalization of the reliability data beyond the con-
fines of the study is expected. Determining the reliability
of a test before using it in a larger study fits this descrip-
tion of fixed effect. A random factor is one in which the
levels of the factor in the design (trials) are but a sample
of the possible levels, and the analysis will be used to
generalize to other levels. For example, a study designed
to evaluate the test-retest reliability of the vertical jump
for use by other coaches (with similar athletes) would con-
sider the effect of trials to be a random effect. Both Shrout
and Fleiss (46) models 1 and 2 are random-effects models,
whereas model 3 is a fixed-effect model. From this dis-
cussion, for the 2-way models of Shrout and Fleiss (46),
the choice between model 2 and model 3 appears to hinge
on a decision regarding a random- vs. fixed-effects model.
However, models 2 and 3 also differ in their treatment of
systematic error. As noted previously, model 3 only con-
siders random error, whereas model 2 considers both ran-
dom and systematic error. This system does not include
a 2-way fixed-effects model that includes systematic error
and does not offer a 2-way random-effects model that only
considers random error. The expanded system of McGraw
and Wong (34) includes these options. In the nomencla-
ture of McGraw and Wong (34), the designation C refers
to consistency and A refers to absolute agreement. That
is, the C models consider only random error and the A
models consider both random and systematic error. As
noted in Table 5, no new computational formulas are re-
quired beyond those presented by Shrout and Fleiss (46).
Thus, if one were to choose a 2-way random-effects model
that only addressed random error, one would use equa-
tion 3,1 (or equation 3,k if the mean across k trials is the
criterion score). Similarly, if one were to choose a 2-way
fixed-effects model that addressed both systematic and
random error, equation 2,1 would be used (or 2,k). Ulti-
mately then, since the computational formulas do not dif-
fer between systems, the choice between using the Shrout
and Fleiss (46) equations from models 2 vs. 3 hinge on
decisions regarding inclusion or exclusion of systematic
error in the calculations. As noted by McGraw and Wong
(34), ‘‘the random-fixed effects distinction is in its effect
on the interpretation, but not calculation, of an ICC.’’
Should systematic error be included in the ICC? First,
if the effect for trials is small, the systematic differences
between trials will be small, and the ICCs will be similar
to each other. This is evident in both the A and B data
sets (Tables 1 through 3 ). However, if the mean differ-
ences are large, then differences between ICCs are evi-
dent, especially between equation 3,1, which does not con-
sider systematic error, and equations 1,1 and 2,1, which
do consider systematic error. In this regard, the Ftest for
trials and the ICC calculations may give contradictory re-
sults from the same data. Specifically, it can be the case
that an ICC can be large (indicating good reliability),
whereas the ANOVA shows a significant trials effect. An
example is given in Tables 6 and 7. In this example, each
score in trial C1 was altered in trial C2 so that there was
a bias of 115 kg and a random component added to each
score. The effect for trials was significant (F
1,7
5203.85,
p,0.001) and reflected a mean increase of 16 kg. For an
ANOVA to be significant, the effect must be large (in this
case, the mean differences between trials must be large),
the noise (error term) must be small, or both. The error
term is small when all subjects behave similarly across
test days. When this is the case, even small mean differ-
ences can be statistically significant. In this case, the sys-
tematic differences explain a significant amount of vari-
ability in the data. Despite the rather large systematic
error, the ICC values from equations 1,1; 2,1; and 3,1
were 0.896, 0.901, and 0.998, respectively. A cursory ex-
amination of just the ICC scores would suggest that the
test exhibited good reliability, especially using equation
3,1, which only reflects random error. However, an ap-
proximately 10% increase in scores from trial C1 to C2
would suggest otherwise. Thus, an analysis that only fo-
cuses on the ICC without consideration of the trials effect
is incomplete (31). If the effect for trials is significant, the
most straightforward approach is to develop a measure-
ment schedule that will attenuate systematic error (2,
50). For example, if learning effects are present, one
might add trials until a plateau in performance occurs.
Then the ICC could be calculated only on the trials in the
plateau region. The identification of such a measurement
schedule would be especially helpful for random-effects
situations where others might be using the test being
evaluated. For simplicity, all the examples here have
been with only 2 levels for trials. If a trials effect is sig-
nificant, however, 2 trials are insufficient to identify a
plateau. The possibility of a significant trials effect should
be considered in the design of the reliability study. For-
tunately, the ANOVA procedures require no modification
to accommodate any number of trials.
Interpreting the ICC
At one level, interpreting the ICC is fairly straightfor-
ward; it represents the proportion of variance in a set of
scores that is attributable to the . An ICC of 0.95 means
2
s
t
that an estimated 95% of the observed score variance is
due to . The balance of the variance (1 2ICC 55%) is
2
s
t
attributable to error (51). However, how does one quali-
tatively evaluate the magnitude of an ICC and what can
the quantity tell you? Some sources have attempted to
delineate good, medium, and poor levels for the ICC, but
there is certainly no consensus as to what constitutes a
good ICC (45). Indeed, Charter and Feldt (15) argue that
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
237
‘‘it is not theoretically defensible to set a universal stan-
dard for test score reliability.’’ These interpretations are
further complicated by 2 factors. First, as noted herein,
the ICC varies, depending on which version of the ICC is
used. Second, the magnitude of the ICC is dependent on
the variability in the data (45). All other things being
equal, low levels of between-subjects variability will serve
to depress the ICC even if the differences between sub-
jects’ scores across test conditions are small. This is illus-
trated by comparing the 2 example sets of data in Table
1. Trials 1 and 2 of data sets A and B have identical mean
values and identical change scores between trials 1 and
2. They differ in the variability between subjects, with
greater between-subjects variability evident in data set A
as shown in the larger SDs. In Tables 2 and 3, the AN-
OVA tables have identical outcomes with respect to the
inferential test of the factor trials and have identical error
terms (since the between-subjects variability is not part
of the error term, as noted previously). Table 4 shows the
ICC values calculated using the 6 different models of
Shrout and Fleiss (46) on the A and B data sets. Clearly,
data set B, with the lower between-subjects variability,
results in smaller ICC values than data set A.
How then does one interpret an ICC? First, because
of the relationship between the ICC and between-subjects
variability, the heterogeneity of the subjects should be
considered. A large ICC can mask poor trial-to-trial con-
sistency when between-subjects variability is high. Con-
versely, a low ICC can be found even when trial-to-trial
variability is low if the between-subjects variability is
low. In this case, the homogeneity of the subjects means
it will be difficult to differentiate between subjects even
though the absolute measurement error is small. An ex-
amination of the SEM in conjunction with the ICC is
therefore needed (32). From a practical perspective, a giv-
en test can have different reliability, at least as deter-
mined from the ICC, depending on the characteristics of
the individuals included in the analysis. In the 1RM
squat, combining individuals of widely different capabil-
ities (e.g., wide receivers and defensive linemen in Amer-
ican football) into the same analysis increases between-
subjects variability and improves the ICC, yet this may
not be reflected in the expected day-to-day variation as
illustrated in Tables 1 through 4. In addition, the infer-
ential test for bias described previously needs to be con-
sidered. High between-subjects variability may result in
a high ICC even if the test for bias is statistically signif-
icant.
The relationship between between-subjects variability
and the magnitude of the ICC has been used as a criti-
cism of the ICC (10, 39). This is an unfair criticism, since
the ICC is used to provide information regarding infer-
ential statistical tests not to provide an index of absolute
measurement error. In essence, the ICC normalizes mea-
surement error relative to the heterogeneity of the sub-
jects. As an index of absolute reliability then, this is a
weakness and other indices (i.e., the SEM) are more in-
formative. As a relative index of reliability, the ICC be-
haves as intended.
What are the implications of a low ICC? First, mea-
surement error reflected in an ICC of less than 1.0 serves
to attenuate correlations (22, 38). The equation for this
attenuation effect is as follows:
ˆ
r5rÏICC ICC (7)
xy xy x y
where r
xy
is the observed correlation between xand y,rˆ
xy
is the correlation between xand yif both were measured
without error (i.e., the correlations between the true
scores), and ICC
x
and ICC
y
are the reliability coefficients
for xand y, respectively. Nunnally and Bernstein (38)
note that the effect of measurement error on correlation
attenuation becomes minimal as ICCs increase above
0.80. In addition, reliability affects the power of statistical
tests. Specifically, the lower the reliability, the greater
the risk of type 2 error (14, 40). Fleiss (22) illustrates how
the magnitude of an ICC can be used to adjust sample
size and statistical power calculations (45). In short, low
ICCs mean that more subjects are required in a study for
a given effect size to be statistically significant (40). An
ICC of 0.60 may be perfectly fine if the resulting effect on
sample size and statistical power is within the logistical
constraints of the study. If, however, an ICC of 0.60
means that, for a required level of power, more subjects
must be recruited than is feasible, then 0.60 is not ac-
ceptable.
Although infrequently used in the movement sciences,
the ICC of test scores can be used in the setting and in-
terpretation of cut points for classification of individuals.
Charter and Feldt (15) show how the ICC can be used to
estimate the percentage of false-positive, false-negative,
true-positive, and true-negative results for a clinical clas-
sification scheme. Although the details of these calcula-
tions are beyond the scope of this article, it is worthwhile
to note that very high ICCs are required to classify indi-
viduals with a minimum of misclassification.
T
HE
SEM
Because the general form of the ICC is a ratio of variance
due to differences between subjects (the signal) to the to-
tal variability in the data (the noise), the ICC is reflective
of the ability of a test to differentiate between different
individuals (27, 47). It does not provide an index of the
expected trial-to-trial noise in the data, which would be
useful to practitioners such as strength coaches. Unlike
the ICC, which is a relative measure of reliability, the
SEM provides an absolute index of reliability. Hopkins
(26) refers to this as the ‘‘typical error.’’ The SEM quan-
tifies the precision of individual scores on a test (24). The
SEM has the same units as the measurement of interest,
whereas the ICC is unitless. The interpretation of the
SEM centers on the assessment of reliability within in-
dividual subjects (45). The direct calculation of the SEM
involves the determination of the SD of a large number
of scores from an individual (44). In practice, a large num-
ber of scores is not typically collected, so the SEM is es-
timated. Most references estimate the SEM as follows:
SEM 5SDÏ12ICC (8)
where SD is the SD of the scores from all subjects
(which can be determined from the ANOVA as
) and ICC is the reliability coefficient.
ÏSS /(n 21)
TOTAL
Note the similarity between the equation for the SEM
and standard error of estimate from regression analysis.
Since different forms of the ICC can result in different
numbers, the choice of ICC can substantively affect the
size of the SEM, especially if systematic error is present.
However, there is an alternative way of calculating the
SEM that avoids these uncertainties. The SEM can be
estimated as the square root of the mean square error
term from the ANOVA (20, 26, 48). Since this estimate of
238 W
EIR
the SEM has the advantage of being independent of the
specific ICC, its use would allow for more consistency in
interpreting SEM values from different studies. However,
the mean square error terms differ when using the 1-way
vs. 2-way models. In Table 2 it can be seen that using a
1-way model (22) would require the use of MS
W
(Ï53.75
57.3 kg), whereas use of a 2-way model would require
use of MS
E
(Ï57 57.6 kg). Hopkins (26) argues that be-
cause the 1-way model combines influences of random
and systematic error together, ‘‘The resulting statistic is
biased high and is hard to interpret because the relative
contributions of random error and changes in the mean
are unknown.’’ He therefore suggests that the error term
from the 2-way model (MS
E
) be used to calculate SEM.
Note however, that in this sample, the 1-way SEM is
smaller than the 2-way SEM. This is because the trials
effect is small. The high bias of the 1-way model is ob-
served when the trials effect is large (Table 7). The SEM
calculated using the MS error from the 2-way model
(Ï4.71 52.2 kg) is markedly lower than the SEM cal-
culated using the 1-way model (Ï124.25 511.1 kg), since
the SEM as defined as ÏMS
E
only considers random er-
ror. This is consistent with the concept of a SE, which
defines noise symmetrically around a central value. This
points to the desire of establishing a measurement sched-
ule that is free of systematic variation.
Another difference between the ICC and SEM is that
the SEM is largely independent of the population from
which it was determined, i.e., the SEM ‘‘is considered to
be a fixed characteristic of any measure, regardless of the
sample of subjects under investigation’’ (38). Thus, the
SEM is not affected by between-subjects variability as is
the ICC. To illustrate, the MS
E
for the data in Tables 2
and 3 are equal (MS
E
557), despite large differences in
between-subjects variability. The resulting SEM is the
same for data sets A and B (Ï57 57.6 kg), but yet they
have different ICC values (Table 4). The results are sim-
ilar when calculating the SEM using equation 8, even
though equation 8 uses the ICC in calculating the SEM,
since the effects of the SD and the ICC tend to offset each
other (38). However, the effects do not offset each other
completely, and use of equation 8 results in an SEM es-
timate that is modestly affected by between-subjects var-
iability (2).
The SEM is the SE in estimating observed scores (the
scores in your data set) from true scores (38). Of course,
our problem is just the opposite. We have the observed
scores and would like to estimate subjects’ true scores.
The SEM has been used to define the boundaries around
which we think a subject’s true score lies. It is often re-
ported (8, 17) that the 95% CI for a subject’s true score
can be estimated as follows:
T5S61.96(SEM), (9)
where Tis the subject’s true score, Sis the subject’s ob-
served score on the measurement, and 1.96 defines the
95% CI. However, strictly speaking this is not correct,
since the SEM is symmetrical around the true score, not
the observed score (13, 19, 24, 38), and the SEM reflects
the SD of the observed scores while holding the true score
constant. In lieu of equation 9, an alternate approach is
to estimate the subject’s true score and calculate an al-
ternate SE (reflecting the SD of true scores while holding
observed scores constant). Because of regression to the
mean, obtained scores (S) are biased estimators of true
scores (16, 19). Scores below the mean are biased down-
ward, and scores above the mean are biased upward. A
subject’s estimated true score (T) can be calculated as fol-
lows:
¯
T5X1ICC(d), (10)
where d5S2X
¯. To illustrate, consider data set A in
Table 1. With a grand mean of 154.5 and an ICC 3,1 of
0.95, an individual with an Sof 120 kg would have a
predicted Tof 154.5 10.95 (120 2154.5) 5121.8 kg.
Note that because the ICC is high, the bias is small (1.8
kg). The appropriate SE to define the CI of the true score,
which some have referred to as the standard error of es-
timate (13), is as follows (19, 38):
SEM 5SDÏICC(1 2ICC). (11)
TS
In this example the value is 31.74 Ï0.95 (1 20.95) 5
6.92, where 31.74 equals the SD of the observed scores
around the grand mean. The 95% CI for Tis then 121.8
61.96 (6.92), which defines a span of 108.2 to 135.4 kg.
The entire process, which has been termed the regression-
based approach (16), can be summarized as follows (24):
95%CI for T
¯
5X1ICC(d)61.96 SDÏICC(1 2ICC). (12)
If one had simply used equation 9 using Sand SEM,
the resulting interval would span 120 61.96 (7.8) 5105.1
to 134.9 kg. Note that the differences between CIs is
small and that the CI width from equation 9 (29.8 kg) is
wider than that from equation 12 (27.2 kg). For all ICCs
less than 1.0, the CI width will be narrower from equation
12 than from equation 9 (16), but the differences shrink
as the ICC approaches 1.0 and as Sapproaches X
¯(24).
M
INIMAL
D
IFFERENCES
N
EEDED TO
B
E
C
ONSIDERED
R
EAL
The SEM is an index that can be used to define the dif-
ference needed between separate measures on a subject
for the difference in the measures to be considered real.
For example, if the 1RM of an athlete on one day is 155
kg and at some later time is 160 kg, are you confident
that the athlete really increased the 1RM by 5 kg or is
this difference within what you might expect to see in
repeated testing just due to the noise in the measure-
ment? The SEM can be used to determine the minimum
difference (MD) to be considered ‘‘real’’ and can be cal-
culated as follows (8, 20, 42):
MD 5SEM 31.96 3Ï2, (13)
Once again the point is to construct a 95% CI, and the
1.96 value is simply the zscore associated with a 95% CI.
(One may choose a different zscore instead of 1.96 if a
more liberal or more conservative assessment is desired.)
But where does the Ï2 come from?
Why can’t we simply calculate the 95% CI for a sub-
ject’s score as we have done above? If the score is outside
that interval, then shouldn’t we be 95% confident that the
subject’s score has really changed? Indeed, this approach
has been suggested in the literature (25, 37). The key
here is that we now have 2 scores from a subject. Each of
these scores has a true component and an error compo-
nent. That is, both scores were measured with error, and
simply seeing if the second score falls outside the CI of
the first score does not account for the error in the second
Q
UANTIFYING
T
EST
-R
ETEST
R
ELIABILITY
239
score. What we really want here is an index based on the
variability of the difference scores. This can be quantified
as the SD of the difference scores (SDd). As it turns out,
when there are 2 levels of trials (as in the examples
herein), the SEM is equal to the SDd divided by Ï2 (17,
26):
SEM 5SDd /Ï2. (14)
Therefore, multiplying the SEM by Ï2 solves for the
SDd and then multiplying the SDd by 1.96 allows for the
construction of the 95% CI. Once the MD is calculated,
then any change in a subject’s score, either above or below
the previous score, greater than the MD is considered
real. More precisely, for all people whose differences on
repeated testing are at least greater than or equal to the
MD, 95% of them would reflect real differences. Using
data set A, the first subject has a trial A1 score of 146
kg. The SEM for the test is Ï57 57.6 kg. From equation
13, MD 57.6 31.96 3Ï2521.07 kg. Thus, a change
of at least 21.07 kg needs to occur to be confident, at the
95% level, that a change in 1RM reflects a real change
and not a difference that is within what might be reason-
ably expected given the measurement error of the 1RM
test.
However, as with defining a CI for an observed score,
the process outlined herein for defining a minimal differ-
ence is not precisely accurate. As noted by Charter (13)
and Dudek (19), the SE of prediction (SEP) is the correct
SE to use in these calculations, not the SEM. The SEP is
calculated as follows:
2
SEP 5SDÏ12ICC . (15)
To define a 95% CI outside which one could be confi-
dent that a retest score reflects a real change in perfor-
mance, simply calculate the estimated true score (equa-
tion 10) plus or minus the SEP. To illustrate, consider
the same data as in the example in the previous para-
graph. From equation 10, we estimate the subject’s true
score (T)asT5X
¯1ICC (d)5154 10.95 (146 2154.5)
ù146.4 kg. The SEP 5SD 3Ï(1 2ICC
2
)531.74 3
Ï(1 20.95
2
)59.91. The resulting 95% CI is 146.4 6
1.96 (9.91), which defines an interval from approximately
127 to 166 kg. Therefore, any retest score outside that
interval would be interpreted as reflecting a real change
in performance. As given in Table 1, the retest score of
140 kg is inside the CI and would be interpreted as a
change consistent with the measurement error of the test
and does not reflect a real change in performance. As be-
fore, use of a different zscore in place of 1.96 will allow
for the construction of a more liberal or conservative CI.
O
THER
C
ONSIDERATIONS
In this article, several considerations regarding ICC and
SEM calculations will not be addressed in detail, but brief
mention will be made here. First, assumptions of ANOVA
apply to these data. The most common assumption vio-
lated is that of homoscedasticity. That is, does the size of
the error correlate with the magnitude of the observed
scores? If the data exhibit homoscedasticity, the answer
is no. For physical performance measures, it is common
that the absolute error tends to be larger for subjects who
score higher (2, 26), e.g., the noise from repeated strength
testing of stronger subjects is likely to be larger than the
noise from weaker subjects. If the data exhibit heterosce-
dasticity, often a logarithmic transformation is appropri-
ate. Second, it is important to realize that ICC and SEM
values determined from sample data are estimates. As
such, it is instructive to construct CIs for these estimates.
Details of how to construct these CIs are addressed in
other sources (34, 35). Third, how many subjects are re-
quired to get adequate stability for the ICC and SEM cal-
culations? Unfortunately, there is no consensus in this
area. The reader is referred to other studies for further
discussion (16, 35, 52). Finally, reliability, as quantified
by the ICC, is not synonymous with responsiveness to
change (23). The MD calculation presented herein allows
one to evaluate a change score after the fact. However, a
small MD, in and of itself, is not a priori evidence that a
given test is responsive.
P
RACTICAL
A
PPLICATIONS
For a comprehensive assessment of reliability, a 3-layered
approach is recommended. First, perform a repeated-
measures ANOVA and cast the summary table as a 2-
way model, i.e., trials and error are separate sources of
variance. Evaluate the Fratio for the trials effect to ex-
amine systematic error. As noted previously, it may be
prudent to evaluate the effect for trials using a more lib-
eral ameasure than the traditional 0.05 level. If the effect
for trials is significant (and the effect size is not trivial),
it is prudent to reexamine the measurement schedule for
influences of learning and fatigue. If 3 or more levels of
trials were included in the analysis, a plateau in perfor-
mance may be evident, and exclusion of only those levels
of trials not in the plateau region in a subsequent re-
analysis may be warranted. However, this exclusion of
trials needs to be reported. Under these conditions, where
systematic error is deemed unimportant, the ICC values
will be similar and reflect random error (imprecision).
However, it is suggested here that the ICC from equation
3,1 be used (Table 5), since it is most closely tied to the
MS
E
calculation of the SEM. Once the systematic error is
determined to be nonsignificant or trivial, interpret the
ICC and SEM within the analytical goals of your study
(2). Specifically, researchers interested in group-level re-
sponses can use the ICC to assess correlation attenuation,
statistical power, and sample size calculations. Practi-
tioners (e.g., coaches, clinicians) can use the SEM (and
associated SEs) in the interpretation of scores from in-
dividual athletes (CIs for true scores, assessing individual
change). Finally, although reliability is an important as-
pect of measurement, a test may exhibit reliability but
not be a valid test (i.e., it does not measure what it pur-
ports to measure).
R
EFERENCES
1. A
LEXANDER
, H.W. The estimation of reliability when several
trials are available. Psychometrika 12:79–99. 1947.
2. A
TKINSON
, D.B.,
AND
A.M. N
EVILL
. Statistical methods for as-
sessing measurement error (reliability) in variables relevant to
Sports Medicine. Sports Med. 26:217–238. 1998.
3. B
ARTKO
, J.J. The intraclass reliability coefficient as a measure
of reliability. Psychol. Rep. 19:3–11. 1966.
4. B
ARTKO
, J.J. On various intraclass correlation coefficients.
Psychol. Bull. 83:762–765. 1976.
5. B
AUMGARTNER
, T.A. Estimating reliability when all test trials
are administered on the same day. Res. Q. 40:222–225. 1969.
6. B
AUMGARTNER
, T.A. Norm-referenced measurement: reliabili-
ty. In: Measurement Concepts in Physical Education and Exer-
cise Science. M.J. Safrit and T.M. Woods, eds. Champaign, IL:
Human Kinetics, 1989. pp. 45–72.
240 W
EIR
7. B
AUMGARTNER
, T.A. Estimating the stability reliability of a
score. Meas. Phys. Educ. Exerc. Sci. 4:175–178. 2000.
8. B
ECKERMAN
, H., T.W. V
OGELAAR
, G.L. L
ANKHORST
,
AND
A.L.M.
V
ERBEEK
. A criterion for stability of the motor function of the
lower extremity in stroke patients using the Fugl-Meyer as-
sessment scale. Scand. J. Rehabil. Med. 28:3–7. 1996.
9. B
EDARD
, M., N.J. M
ARTIN
,P.K
RUEGER
,
AND
K. B
RAZIL
. As-
sessing reproducibility of data obtained with instruments
based on continuous measurements. Exp. Aging Res. 26:353–
365. 2000.
10. B
LAND
, J.M.,
AND
D.G. A
LTMAN
. Statistical methods for as-
sessing agreement between two methods of clinical measure-
ment. Lancet 1:307–310. 1986.
11. B
ROZEK
, J.,
AND
H. A
LEXANDER
. Components of variance and
the consistency of repeated measurements. Res. Q. 18:152–166.
1947.
12. B
RUTON
, A., J.H. C
ONWAY
,
AND
S.T. H
OLGATE
. Reliability:
What is it and how is it measured. Physiotherapy 86:94–99.
2000.
13. C
HARTER
, R.A. Revisiting the standard error of measurement,
estimate, and prediction and their application to test scores.
Percept. Mot. Skills 82:1139–1144. 1996.
14. C
HARTER
, R.A. Effect of measurement error on tests of statis-
tical significance. J. Clin. Exp. Neuropsychol. 19:458–462. 1997.
15. C
HARTER
, R.A.,
AND
L.S. F
ELDT
. Meaning of reliability in terms
of correct and incorrect clinical decisions: The art of decision
making is still alive. J. Clin. Exp. Neuropsychol. 23:530–537.
2001.
16. C
HARTER
, R.A.,
AND
L.S. F
ELDT
. The importance of reliability
as it relates to true score CIs. Meas. Eval. Counseling Dev. 35:
104–112. 2002.
17. C
HINN
, S. Repeatability and method comparison. Thorax 46:
454–456. 1991.
18. C
HINN
, S.,
AND
P.G. B
URNEY
. On measuring repeatability of
data from self-administered questionnaires. Int. J. Epidemiol.
16:121–127. 1987.
19. D
UDEK
, F.J. The continuing misinterpretation of the standard
error of measurement. Psychol. Bull. 86:335–337. 1979.
20. E
LIASZIW
, M., S.L. Y
OUNG
, M.G. W
OODBURY
,
AND
K. F
RYDAY
-
F
IELD
. Statistical methodology for the concurrent assessment
of interrater and intrarater reliability: Using goniometric mea-
surements as an example. Phys. Ther. 74:777–788. 1994.
21. F
ELDT
, L.S.,
AND
M.E. M
C
K
EE
. Estimation of the reliability of
skill tests. Res. Q. 29:279–293. 1958.
22. F
LEISS
, J.L. The Design and Analysis of Clinical Experiments.
New York: John Wiley and Sons, 1986.
23. G
UYATT
, G., S. W
ALTER
,
AND
G. N
ORMAN
. Measuring change
over time: assessing the usefulness of evaluative instruments.
J. Chronic Dis. 40:171–178. 1987.
24. H
ARVILL
, L.M. Standard error of measurement. Educ. Meas.
Issues Pract. 10:33–41. 1991.
25. H
EBERT
, R., D.J. S
PIEGELHALTER
,
AND
C. B
RAYNE
. Setting the
minimal metrically detectable change on disability rating
scales. Arch. Phys. Med. Rehabil. 78:1305–1308. 1997.
26. H
OPKINS
, W.G. Measures of reliability in sports medicine and
science. Sports Med. 30:375–381. 2000.
27. K
EATING
, J.,
AND
T. M
ATYAS
. Unreliable inferences from reli-
able measurements. Aust. Physiother. 44:5–10. 1998.
28. K
EPPEL
,G.Design and Analysis: A Researcher’s Handbook (3rd
ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.
29. K
ROLL
, W. A note on the coefficient of intraclass correlation as
an estimate of reliability. Res. Q. 33:313–316. 1962.
30. L
AHEY
, M.A., R.G. D
OWNEY
,
AND
F.E. S
AAL
. Intraclass corre-
lations: there’s more than meets the eye. Psychol. Bull. 93:586–
595. 1983.
31. L
IBA
, M. A trend test as a preliminary to reliability estimation.
Res. Q. 38:245–248. 1962.
32. L
OONEY
, M.A. When is the intraclass correlation coefficient
misleading? Meas. Phys. Educ. Exerc. Sci. 4:73–78. 2000.
33. L
UDBROOK
, J. Statistical techniques for comparing measures
and methods of measurement: A critical review. Clin. Exp.
Pharmacol. Physiol. 29:527–536. 2002.
34. M
C
G
RAW
, K.O.,
AND
S.P. W
ONG
. Forming inferences about
some intraclass correlation coefficients. Psychol. Methods 1:30–
46. 1996.
35. M
ORROW
, J.R.,
AND
A.W. J
ACKSON
. How ‘‘significant’’ is your
reliability? Res. Q. Exerc. Sport 64:352–355. 1993.
36. N
ICHOLS
, C.P. Choosing an intraclass correlation coefficient.
Available at: www.spss.com/tech/stat/articles/whichicc.htm.
Accessed 1998.
37. N
ITSCHKE
, J.E., J.M. M
C
M
EEKEN
, H.C. B
URRY
,
AND
T.A. M
A
-
TYAS
. When is a change a genuine change? A clinically mean-
ingful interpretation of grip strength measurements in healthy
and disabled women. J. Hand Ther. 12:25–30. 1999.
38. N
UNNALLY
, J.C.,
AND
I.H. B
ERNSTEIN
.Psychometric Theory
(3rd ed.). New York: McGraw-Hill, 1994.
39. O
LDS
, T. Five errors about error. J. Sci. Med. Sport 5:336–340.
2002.
40. P
ERKINS
, D.O., R.J. W
YATT
,
AND
J.J. B
ARTKO
. Penny-wise and
pound-foolish: The impact of measurement error on sample
size requirements in clinical trials. Biol. Psychiatry. 47:762–
766. 2000.
41. P
ORTNEY
, L.G.,
AND
M.P. W
ATKINS
.Foundations of Clinical Re-
search (2nd ed.). Upper Saddle River, NJ: Prentice Hall, 2000.
42. R
OEBROECK
, M.E., J. H
ARLAAR
,
AND
G.J. L
ANKHORST
. The ap-
plication of generalizability theory to reliability assessment: An
illustration using isometric force measurements. Phys. Ther.
73:386–401. 1993.
43. R
OUSSON
, V., T. G
ASSER
,
AND
B. S
EIFERT
. Assessing intrarater,
interrater, and test-retest reliability of continuous measure-
ments. Stat. Med. 21:3431–3446. 2002.
44. S
AFRIT
, M.J.E. Reliability Theory. Washington, DC: American
Alliance for Health, Physical Education, and Recreation, 1976.
45. S
HROUT
, P.E. Measurement reliability and agreement in psy-
chiatry. Stat. Methods Med. Res. 7:301–317. 1998.
46. S
HROUT
, P.E.,
AND
J.L. F
LEISS
. Intraclass correlations: uses in
assessing rater reliability. Psychol. Bull. 36:420–428. 1979.
47. S
TRATFORD
, P. Reliability: consistency or differentiating be-
tween subjects? [Letter]. Phys. Ther. 69:299–300. 1989.
48. S
TRATFORD
, P.W.,
AND
C.H. G
OLDSMITH
. Use of standard error
as a reliability index of interest: An applied example using el-
bow flexor strength data. Phys. Ther. 77:745–750. 1997.
49. S
TREINER
, D.L.,
AND
G.R. N
ORMAN
.Measurement Scales: A
Practical Guide to Their Development and Use (2nd ed.). Oxford:
Oxford University Press, 1995. pp. 104–127.
50. T
HOMAS
, J.R.,
AND
J.K. N
ELSON
.Research Methods in Physical
Activity (2nd ed.). Champaign, IL: Human Kinetics, 1990. pp.
352.
51. T
RAUB
, R.E.,
AND
G.L. R
OWLEY
. Understanding reliability.
Educ. Meas. Issues Pract. 10:37–45. 1991.
52. W
ALTER
, S.D., M. E
LIASZIW
,
AND
A. D
ONNER
. Sample size and
optimal designs for reliability studies. Stat. Med. 17:101–110.
1998.
Acknowledgments
I am indebted to Lee Brown, Joel Cramer, Bryan Heiderscheit,
Terry Housh, and Bob Oppliger for their helpful comments on
drafts of the paper.
Address correspondence to Dr. Joseph P. Weir,
joseph.weir@dmu.edu.
... The within-session reliability was determined by calculating the intraclass correlation coefficient (ICC; model designation according to McGraw and Wong [107]: ICC(A,k)) and interpreted according to Koo and Li [108] as poor (<0.50), moderate (0.50-0.75), good (0.75-0.90), and excellent (>0.90). In addition, the coefficient of variation (CV) was calculated as CV = SD M × 100 for each subject and averaged across subjects, the standard error of measurement (SEM) was calculated as SEM = SD pooled √ 1 − ICC [109], and the minimum detectable change with a 90% level of confidence (MDC 90 ) was calculated as MDC 90 = SEM × 1.64 × √ 2 [109]. Since responses to ACL injury prevention programs can be very dissimilar in individuals [110,111], subjects were classified into positive-, non-, and negative responders according to subjects' pre-post differences in comparison to the respective MDC90. ...
... The within-session reliability was determined by calculating the intraclass correlation coefficient (ICC; model designation according to McGraw and Wong [107]: ICC(A,k)) and interpreted according to Koo and Li [108] as poor (<0.50), moderate (0.50-0.75), good (0.75-0.90), and excellent (>0.90). In addition, the coefficient of variation (CV) was calculated as CV = SD M × 100 for each subject and averaged across subjects, the standard error of measurement (SEM) was calculated as SEM = SD pooled √ 1 − ICC [109], and the minimum detectable change with a 90% level of confidence (MDC 90 ) was calculated as MDC 90 = SEM × 1.64 × √ 2 [109]. Since responses to ACL injury prevention programs can be very dissimilar in individuals [110,111], subjects were classified into positive-, non-, and negative responders according to subjects' pre-post differences in comparison to the respective MDC90. ...
Article
Full-text available
This study developed a cutting technique modification training program and investigated its effects on cutting performance and movement quality in adolescent American football players. For six weeks, an intervention group (IG) of 11 players participated in 25 min cutting technique modification training sessions integrated into team training twice a week, while a control group (CG) of 11 players continued their usual team training. Movement quality was assessed by evaluating 2D high-speed videos, obtained during preplanned 45° and 90° cutting tests, using the Cutting Movement Assessment Score (CMAS) qualitative screening tool. Cutting performance was assessed based on change of direction deficit (CODD). Significant interaction effects of time × group were found for CMAS in 45° and 90° cuttings (p < 0.001, ηp2 = 0.76, p < 0.001, ηp2 = 0.64, respectively), with large improvements in the IG (p < 0.001, g = −2.16, p < 0.001, g = −1.78, respectively) and deteriorations in the CG for 45° cuttings (p = 0.002, g = 1.15). However, no statistically significant differences in CODD were observed pre-to-post intervention. The cutting technique modification training was effective at improving movement quality without impairing cutting performance, and it can be used by practitioners working with adolescent athletes.
... The Cronbach's alpha coefficient, item-total score correlation analysis, and test-retest analysis were carried out to ascertain the reliability of the scale. The criteria for assessment were taken as a Cronbach's alpha coefficient >.60 (Alpar, 2016), item-total score correlation >0.40 (Çokluk et al., 2018), and Intraclass Correlation Coefficient (ICC) >0.70 (Weir, 2005). ...
... The scale's internal consistency coefficient was higher than 0.60, its item-total score correlation values were higher than 0.30 and its ICC value was higher than 0.70. These results showed that the scale was quite reliable (Alpar, 2016), stable (Weir, 2005) and the items accurately distinguished the participants (Çokluk et al., 2018). ...
Article
Full-text available
Aim The aim of this study was to develop and psychometrically test the Ethical Conflict Scale for Nurses in Extraordinary Circumstances (ECSNEC). Design This study is designed to develop and validate an instrument. Methods There are four basic steps in the development process of ECSNEC: (1) establishment of the conceptual framework, (2) creation of the item pool, (3) preliminary evaluation and (4) psychometric evaluation. The data were gathered from 519 nurses who worked in two different hospitals operating in Istanbul between June 2022 and October 2022. Results The scale had good content validity. The exploratory factor analysis revealed a three‐factor construct which explained 47.31% of the total variance in the measured variable. The corresponding construct was confirmed by the confirmatory factor analysis. The Cronbach's alpha coefficients were greater than .60 for all dimensions. The test–retest reliability coefficient value of the scale was 0.90. Conclusion ECSNEC is a valid and reliable tool to determine the ethical conflict experienced by nurses in extraordinary circumstances. Impact The established scale allows the identification of factors influencing the ethical challenges nurses face in extraordinary circumstances. Thus, policies can be developed to prevent such ethical conflicts. Patient or Public Contribution No patient or public contribution.
... A two-way repeated measures analysis of variance (ANOVA, session × jump type) was used to determine the differences in all measured metrics from three jump types between sessions, with statistical significance set at p < 0.05. The between-session reliability for all measured variables were calculated via a 2-way random model intraclass correlation coefficients (ICC) with 95% confidence intervals (CI) and the coefficient of variation (CV) with 95% CI [29]. The magnitude of ICC was assessed based on suggestions by Koo and Li [30], with an ICC value > 0.90 = excellent, 0.75-0.90 ...
Article
Full-text available
The kinetic analysis of joint work and joint contribution provides practitioners with information regarding movement characteristics and strategies of any jump test that is undertaken. This study aimed to compare joint works and contributions, and performance metrics in the countermovement jump (CMJ), drop jump (DJ), and countermovement rebound jump CMRJ. Thirty-three participants completed 18 jumps across two testing sessions. Jump height and strategy-based metrics (time to take-off [TTTO], countermovement depth [CM depth], and ground contact time [GCT]) were measured. Two-way analysis of variance assessed systematic bias between jump types and test sessions (α = 0.05). Reliability was evaluated via intraclass correlation coefficient [ICC] and coefficient of variation [CV]. Jump height and strategy-based metrics demonstrated good to excellent reliability (ICC = 0.82–0.98) with moderate CV (≤8.64%). Kinetic variables exhibited moderate to excellent reliability (ICC = 0.64–0.93) with poor to moderate CV (≤25.04%). Moreover, apart from TTTO (p ≤ 0.027, effect size [ES] = 0.49–0.62) that revealed significant differences between jump types, CM depth (p ≤ 0.304, ES = 0.27–0.32) and GCT (p ≤ 0.324, ES = 0.24) revealed nonsignificant trivial to small differences between three jumps in both sessions. Finally, the negative and positive hip and knee works, and positive ankle contribution measured in the CMRJ showed significant differences from the CMJ and DJ (p ≤ 0.048, g ≤ 0.71), with no significant difference observed in other kinetic variables between the three jump actions (p ≥ 0.086). Given the consistent joint works and joint contributions between jump types, the findings suggest that practitioners can utilize the CMRJ as a viable alternative to CMJ and DJ tests, and the CMRJ test offers valuable insights into movement characteristics and training suggestions.
... A two-way repeated measures analysis of variance (ANOVA, session × jump type) was used to determine the differences in all measured metrics from three jump types between sessions, with statistical significance set at p < 0.05. The between-session reliability for all measured variables were calculated via a 2-way random model intraclass correlation coefficients (ICC) with 95% confidence intervals (CI) and the coefficient of variation (CV) with 95% CI [29]. The magnitude of ICC was assessed based on suggestions by Koo and Li [30], with an ICC value > 0.90 = excellent, 0.75-0.90 ...
Article
Full-text available
The kinetic analysis of joint work and joint contribution provides practitioners with information regarding movement characteristics and strategies of any jump test that is undertaken. This study aimed to compare joint works and contributions, and performance metrics in the countermovement jump (CMJ), drop jump (DJ) and countermovement rebound jump CMRJ. Thirty-three participants completed jumps across two testing sessions. Jump height and strategy-based metrics (time to take-off [TTTO], countermovement depth [CM depth] and ground contact time [GCT]) were measured. Two-way analysis of variance assessed systematic bias between jump types and test sessions (α = 0.05). Reliability was evaluated via intraclass correlation coefficient [ICC] and coefficient of variation [CV]. Jump height and strategy-based metrics demonstrated good to excellent reliability (ICC = 0.82–0.98) with moderate CV (≤ 8.64%). Kinetic variables exhibited moderate to excellent reliability (ICC = 0.64–0.93) with poor to moderate CV (≤ 25.04%). Moreover, apart from TTTO (p ≤ 0.027, effect size [ES] = 0.49–0.62) that revealed significant differences between jump types, CM depth (p ≤ 0.304, ES = 0.27–0.32) and GCT (p ≤ 0.324, ES = 0.24) revealed nonsignificant trivial to small differences between three jumps in both sessions. Finally, the negative and positive hip and knee works, and positive ankle contribution measured in the CMRJ showed significant differences from the CMJ and DJ (p ≤ 0.048, g ≤ 0.71), with no significant difference observed in other kinetic variables between the three jump actions (p ≥ 0.086). Given the consistent joint works and joint contributions between jump types, the findings suggest that practitioners can utilize the CMRJ as a viable alternative to CMJ and DJ tests, and the CMRJ test offers valuable insights into movement characteristics and training suggestions.
... Testtekrar test güvenirliğini değerlendirmek için sınıf içi korelasyon katsayısı (intraclass correlation coefficent: ICC) ve %95 güven aralığı (%95 GA) hesaplanmıştır. ICC değeri ise 0,6-0,8 arasında ise iyi düzeyde güvenirliği, 0,8'in üzerinde ise mükemmel test-tekrar test güvenirliğini gösterir(28). Ölçeklerin iç tutarlılıkları Cronbach-α değerleri hesaplanarak incelenmiştir. Cronbach-α 0,70 ve üzeri kabul edilebilir bir değer olarak yorumlanır(22).Araştırma Verisinin Düzenlenmesi ve AnaliziAraştırma verisinin istatistiksel analizleri için IBM SPSS 22.0 (SPSS Inc. ...
Article
Amaç: Bu çalışmada Revize Edilmiş Diyabet Öz Bakım Envanteri’nin (RDÖBE) ve Algılanan Diyabet Öz Yönetimi Ölçeği’nin (ADÖYÖ) Türkçe versiyonlarının geçerlik ve güvenirliğini n değerlendirmesi amaçlandı. Gereç ve Yöntemler: Bu çalışma metodolojik tipte tasarlanmıştır. Ölçeklerin Türkçeye çevirisi çeviri-geri çeviri yöntemi ile yapılmış ve 14 uzman tarafından kapsam geçerliği değerlendirilmiştir. Son test aşamasında ölçekler 150 tip 1 ve 328 tip 2 diyabet hastasına uygulanmıştır. Ölçeklerin yapı geçerliği açımlayıcı ve doğrulayıcı faktör analizi (AFA ve DFA), yakınsak ve bilinen gruplar geçerliği ile test edilmiştir. Ölçeklerin güvenirliğini belirlemek için Cronbach-α ile iç tutarlılığı değerlendirilmiştir. Test-tekrar test güvenirliği için sınıf içi korelasyon katsayısı (ICC) hesaplanmıştır. Ölçek puanları arasındaki ilişki Spearman Korelasyon Katsayısı ile incelenmiştir. Bulgular: Tip 1 diyabet grubunda AFA; RDÖBE ve ADÖYÖ’nin tek faktörlü yapıları için faktör yüklerinin 0,371 ile 0,794 arasında; tip 2 diyabet grubunda ise RDÖBE ve ADÖYÖ’nin tek faktörlü yapıları için faktör yüklerinin 0,353 ile 0,756 arasında olduğunu göstermiştir. Her iki diyabet grubunda ölçeklerin DFA sonucu tek faktörlü yapıları doğrulanmış ve uyum indeksleri kabul edilebilir düzeyde bulunmuştur. Ölçeklerin test-tekrar test güvenirliği için ICC değerleri >0,9 bulunmuştur. Cronbach-α, Tip 1 diyabet grubunda RDÖBE için 0,831, ADÖYÖ için 0,784 iken tip 2 diyabet grubunda RDÖBE için 0,785, ADÖYÖ için 0,822’dir. Ayrıca ADÖYÖ ile RDÖBE puanları arasında da orta düzeyde pozitif ilişki saptanmıştır ( r= 0,553; p
... Cronbach alpha coefficient value was considered as an acceptable internal consistency for greater than 0.7. 13,14 Internal construct validity of the WDI was analyzed with confirmatory factor analysis. The structural validity of Low Back Outcome Score was examined through factor analysis by using Bartlett's test (BT), and the combined validity was assessed using the Kaiser-Meyer-Olkin (KMO) test. ...
Article
Full-text available
Objective: The aim of this study was to translate the Northwick Park Neck Pain Questionnaire (NPQ) into the Turkish language and assess its reliability and validity among patients with neck pain in the Turkish population. Material and Methods: One hundred subjects (67 female, 33 male) who had chronic neck pain for at least 3 months were included in this study. All participants were asked to complete the NPQ, the Neck Disability Index (NDI) and Neck Pain and Disability Scale (NPDS) on the day of admission, and one week later. The test-retest and internal consistency analyses were applied for the assessment of reliability. The test-retest analysis were assessed by using the intraclass correlation coefficient (ICC) method (95% confidence interval). The value of Cronbach’s alpha coefficient was calculated for internal consistency. Spearman’s correlation coefficient analysis was used for convergent validity. Results: The mean age was 46.68±12.11 years in the study. The NPQ had a good internal consistency (Cronbach alpha=0.704) and excellent test-retest reliability (ICC=0.995). Spearman’s correlation coefficient of the NPQ with the NDI was calculated at 0.648 and Spearman’s correlation coefficient of the NPQ with the NPDS was calculated at 0.811. These results showed that the NPQ is very good correlated with the NDI and the NPDS (p<0.001). Conclusion: Our results suggest that the Turkish version of the NPQ is a reliable and valid instrument.
... The significant difference between the two visits may indicate the presence of a learning effect from the first to the second measurement. Interestingly, the Spearman correlation between the AAN outcome measured in Visit 2 and 3 was rather high (ρ = 0.809), confirming how the high inter-subject variability masked the intra-subject error [53] and showing once more how important it is to consider both the relative and absolute reliability when validating a new measurement tool. With regards to the reliability of L-FORCE, we refer to the work of Bolliger et al. [31], where the L-FORCE showed a fair to good reliability (intra-rater reliability for LF_HF and LF_KF ranged from 0.50 to 0.91; the SEM ranged between 6.5 Nm and 11.6 Nm for single measures). ...
Article
Full-text available
Background Walking impairments are a common consequence of neurological disorders and are assessed with clinical scores that suffer from several limitations. Robot-assisted locomotor training is becoming an established clinical practice. Besides training, these devices could be used for assessing walking ability in a controlled environment. Here, we propose an adaptive assist-as-needed (AAN) control for a treadmill-based robotic exoskeleton, the Lokomat, that reduces the support of the device (body weight support and impedance of the robotic joints) based on the ability of the patient to follow a gait pattern displayed on screen. We hypothesize that the converged values of robotic support provide valid and reliable information about individuals’ walking ability. Methods Fifteen participants with spinal cord injury and twelve controls used the AAN software in the Lokomat twice within a week and were assessed using clinical scores (10MWT, TUG). We used a regression method to identify the robotic measure that could provide the most relevant information about walking ability and determined the test–retest reliability. We also checked whether this result could be extrapolated to non-ambulatory and to unimpaired subjects. Results The AAN controller could be used in patients with different injury severity levels. A linear model based on one variable (robotic knee stiffness at terminal swing) could explain 74% of the variance in the 10MWT and 61% in the TUG in ambulatory patients and showed good relative reliability but poor absolute reliability. Adding the variable ‘maximum hip flexor torque’ to the model increased the explained variance above 85%. This did not extend to non-ambulatory nor to able-bodied individuals, where variables related to stance phase and to push-off phase seem more relevant. Conclusions The novel AAN software for the Lokomat can be used to quantify the support required by a patient while performing robotic gait training. The adaptive software might enable more challenging training conditions tuned to the ability of the individuals. While the current implementation is not ready for assessment in clinical practice, we could demonstrate that this approach is safe, and it could be integrated as assist-as-needed training, rather than as assessment. Trial registration ClinicalTrials.gov Identifier: NCT02425332.
... The values were interpreted: > 0.75 was good, 0.75-0.50 was moderate, and < 0.50 was poor [55]. In order to analyse the level of accuracy, the standard error of measurement (SEM) was calculated (SEM = SD × √ 1 − ICC), where SD = standard deviation of the measure at baseline, and ICC corresponds to the ICC obtained from the testretest reliability calculation of each factor [56]. Floor and ceiling effects were considered to be present if 15% or more of the participants achieved the minimum or maximum scores, respectively, in each factor. ...
Article
Full-text available
Low back pain (LBP) is one of the main musculoskeletal pain conditions, and it affects 23–28% of the global population. Strong evidence supports the absence of a direct relationship between the intensity of pain and tissue damage, with psychosocial factors also playing a crucial role. In this context, the Pain Attitudes and Beliefs Scale for Physiotherapists (PABS-PT) is a useful tool for evaluating physiotherapists’ treatment orientations and beliefs regarding the management of low back pain (LBP). It helps identify practitioners who may benefit from additional education in modern pain neuroscience. However, there is not a Spanish validation of this scale for physiotherapists. Thus, the aims of this study were to translate and culturally adapt the Pain Attitudes and Beliefs Scale for Physiotherapists (PABS-PT) into Spanish and to evaluate its psychometric properties. This validation study used three convenience samples of physiotherapists (PTs) (n = 22 for the pilot study, n = 529 for the validity study and n = 53 for assessing the instrument’s responsiveness). The process of translating and adapting the PABS-PT into Spanish followed international guidelines and produced a satisfactory pre-final version of the questionnaire. Factor analysis confirmed the two-factor structure of the original version, with the biomedical (BM) factor explaining 39.4% of the variance and the biopsychosocial (BPS) factor explaining 13.8% of the variance. Cronbach’s alpha values were excellent for the BM factor (0.86) and good for the BPS factor (0.77), indicating good internal consistency. Test–retest reliability was excellent for both factors, with intraclass correlation coefficients (ICCs) of 0.84 for BM and 0.82 for BPS. The standard error of measurement (SEM) was acceptable for both factors (3.9 points for BM and 2.4 points for BPS). Concurrent validity was moderate and in the expected direction and had significant correlations with the Health Care Providers’ Pain and Impairment Relationship Scale (HC-PAIRS) and Revised Neurophysiology Pain Questionnaire (R-NPQ). Sensitivity to change was demonstrated by significant improvements in both factors after an educational intervention, with medium-to-large effect sizes. The PABS-PT also showed good discriminative ability, as it was able to distinguish between physiotherapists with and without pain education. Cut-off values for the BM and BPS factors were determined. In conclusion, the translated and adapted Spanish version of the PABS-PT demonstrated good psychometric properties and can be reliably used to assess the attitudes and beliefs of Spanish-speaking physiotherapists regarding LBP. The questionnaire is recommended for use in clinical and educational research in the Spanish language context.
Preprint
Full-text available
Background Skeletal muscle’s architecture can undergo temporary or permanent adaptations when subjected to chronic passive loading, such as during passive static stretching (PSS). Aims We evaluated the effects of a 6-week PSS program, with two and five minutes of duration, on the Gastrocnemius Medialis (GM’s) muscle architecture parameters. The second aim of this study was to determine the inter-analyzer reliability of the GM’s muscle architecture images analysis process. Methods 30 healthy adults participated in this study. Plantar flexors’ PSS was applied three times a week for 6 weeks. Participants were divided into three groups: Control Group (CG), 2-minutes of PSS (G2) and 5-minutes of PSS (G5). They were assessed before, after, and two weeks post the intervention. GM’s muscle thickness (MT), pennation angle (PA) and fascicle length (FL) were measured with an ultrasound system by an experienced evaluator. All images were analyzed by two independent analyzers, using the Image-J software. Results No significant effects were identified (p > 0.05) of the PSS program on muscle architecture parameters. No architectural changes were observed following the detraining period. GM’s MT results presented excellent reliability, while good reliability was found for the FL measures. For PA, good reliability was just observed for the post-intervention moment. On the pre-intervention and follow-up moments, the intraclass correlation coefficients values were moderate. Conclusion A 6-week PSS program did not generate adaptations on GM’s muscle architecture parameters in healthy subjects, independent of the stretching duration. Muscle architecture parameters are reliable when analyzed by different analyzers. Registration Number This study was registered in Brazilian Clinical Trials Registry RBR-5j3h3c on 07/24/2018 (http://www.ensaiosclinicos.gov.br/).
Article
Full-text available
Background It is desirable that (more) children continue swimming after having completed their swimming lessons to preserve their swimming skills and water safety, and as part of an active, healthy lifestyle. This may be encouraged by stirring children's intrinsic motivation for swimming during swimming lessons. However, it is currently unknown how intrinsically motivating swimming lessons are in Western countries. Purpose This study examined to what extent swimming instructors in the Netherlands cater to the basic needs of autonomy, competence, and relatedness, which, according to Self-Determination Theory (SDT), promote intrinsic motivation. Additionally, it examined whether an SDT-based teaching program prompts instructors to better meet these needs, and to what extent the teaching program, the education and experience of the instructor, and the group size predict the employment of SDT in swimming lessons. Methods A total of 128 swimming lessons given by equally many instructors were observed in the Netherlands and rated on a modified version of the SDT teaching style scale to assess autonomy, competence, and relatedness support. The swimming lessons referred to four teaching programs, one of which was explicitly based on SDT. Results Instructors exhibited autonomy-thwarting, weakly competence-supportive, and relatedness-supportive behaviors. The SDT-based teaching program scored higher on the provision of autonomy, competence, and relatedness in lessons. This finding was significant for autonomy. Teaching program was the only significant predictor of SDT employment by instructors. Conclusion Further improvement is desirable in catering to the basic needs, particularly autonomy, which can be achieved by deliberately implementing the principles of SDT into teaching programs for swimming.
Article
A method is developed to calculate the required number of subjects k in a reliability study, where reliability is measured using the intraclass correlation ρ. The method is based on a functional approximation to earlier exact results. The approximation is shown to have excellent agreement with the exact results and one can use it easily without intensive numerical computation. Optimal design configurations are also discussed; for reliability values of about 40 per cent or higher, use of two or three observations per subject will minimize the total number of observations required. © 1998 John Wiley & Sons, Ltd.
Article
Reliability, the ratio of the variance attributable to true differences among subjects to the total variance, is an important attribute of psychometric measures. However, it is possible for instruments to be reliable, but unresponsive to change; conversely, they may show poor reliability but excellent responsiveness. This is especially true for instruments in which items are tailored to the individual respondent. Therefore, we suggest a new index of responsiveness to assess the usefulness of instruments designed to measure change over time. This statistic, which relates the minimal clinically important difference to the variability in stable subjects, has direct sample size implications. Responsiveness should join reliability and validity as necessary requirements for instruments designed primarily to measure change over time.
Book
Clinicians and those in health sciences are frequently called upon to measure subjective states such as attitudes, feelings, quality of life, educational achievement and aptitude, and learning style in their patients. This fifth edition of Health Measurement Scales enables these groups to both develop scales to measure non-tangible health outcomes, and better evaluate and differentiate between existing tools. Health Measurement Scales is the ultimate guide to developing and validating measurement scales that are to be used in the health sciences. The book covers how the individual items are developed; various biases that can affect responses (e.g. social desirability, yea-saying, framing); various response options; how to select the best items in the set; how to combine them into a scale; and finally how to determine the reliability and validity of the scale. It concludes with a discussion of ethical issues that may be encountered, and guidelines for reporting the results of the scale development process. Appendices include a comprehensive guide to finding existing scales, and a brief introduction to exploratory and confirmatory factor analysis, making this book a must-read for any practitioner dealing with this kind of data.
Article
This paper illustrates the application of a trend test as a first step in the utilization of analysis of variance procedures for the estimation of the reliability of skill tests. Tests for over-all trend and orthogonal components of trend were applied to bowling and softball velocity data. It was demonstrated that systematic linear variation existed from trial to trial in the bowling velocity data. There was no evidence of systematic variation from trial to trial for the softball velocity data.
Article
Using the domain-sampling model from classical test theory, the effects of measurement error on statistical tests for the difference between an obtained mean and a hypothesized mean, and the difference between two means, are demonstrated. The results indicate that lowering the reliability (i. e., increasing measurement error) of dependent variable data increases the chance of obtaining a nonsignificant result when a significant result is the correct outcome. Lowering the reliability also produces reduced estimates of strength of association.