# Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM.

**ABSTRACT** Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.

**12**Bookmarks

**·**

**1,697**Views

- Alexander Rozental, Erik Forsell, Andreas Svensson, David Forsström, Gerhard Andersson, Per Carlbring[Show abstract] [Hide abstract]

**ABSTRACT:**Procrastination is a prevalent self-regulatory failure associated with stress and anxiety, decreased well-being, and poorer performance in school as well as work. One-fifth of the adult population and half of the student population describe themselves as chronic and severe procrastinators. However, despite the fact that it can become a debilitating condition, valid and reliable self-report measures for assessing the occurrence and severity of procrastination are lacking, particularly for use in a clinical context. The current study explored the usefulness of the Swedish version of three Internet-administered self-report measures for evaluating procrastination; the Pure Procrastination Scale, the Irrational Procrastination Scale, and the Susceptibility to Temptation Scale, all having good psychometric properties in English. In total, 710 participants were recruited for a clinical trial of Internet-based cognitive behavior therapy for procrastination. All of the participants completed the scales as well as self-report measures of depression, anxiety, and quality of life. Principal Component Analysis was performed to assess the factor validity of the scales, and internal consistency and correlations between the scales were also determined. Intraclass Correlation Coefficient, Minimal Detectable Change, and Standard Error of Measurement were calculated for the Irrational Procrastination Scale. The Swedish version of the scales have a similar factor structure as the English version, generated good internal consistencies, with Cronbach's α ranging between .76 to .87, and were moderately to highly intercorrelated. The Irrational Procrastination Scale had an Intraclass Correlation Coefficient of .83, indicating excellent reliability. Furthermore, Standard Error of Measurement was 1.61, and Minimal Detectable Change was 4.47, suggesting that a change of almost five points on the scale is necessary to determine a reliable change in self-reported procrastination severity. The current study revealed that the Pure Procrastination Scale, the Irrational Procrastination Scale, and the Susceptibility to Temptation Scale are both valid and reliable from a psychometric perspective, and that they might be used for assessing the occurrence and severity of procrastination via the Internet. The current study is part of a clinical trial assessing the efficacy of Internet-based cognitive behavior therapy for procrastination, and was registered 04/22/2013 on ClinicalTrials.gov (NCT01842945).BMC psychology. 12/2014; 2(1):54. - SourceAvailable from: Mark A Robinson[Show abstract] [Hide abstract]

**ABSTRACT:**Upper limb (UL) kinematic assessment protocols are becoming integrated into clinical practice due to their development over the last few years. We propose the ELEPAP UL protocol, a contemporary UL kinematic protocol that can be applied to different pathological conditions. This model is based on ISB modeling recommendations, uses functional joint definitions, and models three joints of the shoulder girdle. The specific aim of this study was to determine the within and between session reliability of the ELEPAP UL model. Ten healthy subjects (mean age: 13.6±4.3 years) performed four reach-to-grasp and five functional tasks, which included a novel throwing task to assess a wide spectrum of motor skills. Three trials of every task in two different sessions were analyzed. The reliability of angular waveforms was evaluated by measurement error (σ) and coefficient of multiple correlation (CMC). Spatiotemporal parameters were assessed by standard error of measurement (SEM). Generally joint kinematics presented low σw and σb errors (<100). A selection of angular waveforms errors was presented to inspect error fluctuation in different phases, which was found to be related to the demands of the different movements. CMCw and CMCb values (>0.60) were found, demonstrating good to excellent reliability especially in joints with larger ranges of motion. The throwing task proved equally reliable, enhancing the universal application of the protocol. Compared to the literature, this study demonstrated higher reliability of the thorax, scapula and wrist joints. This was attributed to the highly standardized procedure and the implementation of recent methodological advancements. In conclusion, ELEPAP protocol was proved a reliable tool to analyze UL kinematics. Copyright © 2014 Elsevier B.V. All rights reserved.Gait & Posture 12/2014; · 2.30 Impact Factor - SourceAvailable from: Bart DingenenBart Dingenen, Bart Malfait, Jos Vanrenterghem, Mark A. Robinson, Sabine M.P. Verschueren, Filip F. Staes[Show abstract] [Hide abstract]

**ABSTRACT:**Less optimal sagittal plane movement patterns are believed to increase knee injury risk in female athletes. To facilitate clinical screening with a user-friendly method, the purpose of the present study was to examine the temporal relationships between two-dimensional measured sagittal plane kinematics and three-dimensional joint moments during the double-leg drop vertical jump (DVJ) and single-leg DVJ (SLDVJ).The Knee 12/2014; · 2.01 Impact Factor

Page 1

231

Journal of Strength and Conditioning Research, 2005, 19(1), 231–240

? 2005 National Strength & Conditioning Association

Brief Review

QUANTIFYING TEST-RETEST RELIABILITY USING

THE INTRACLASS CORRELATION COEFFICIENT

AND THE SEM

JOSEPH P. WEIR

Applied Physiology Laboratory, Division of Physical Therapy, Des Moines University—Osteopathic Medical

Center, Des Moines, Iowa 50312.

ABSTRACT. Weir, J.P. Quantifying test-retest reliability using

the intraclass correlation coefficient and the SEM. J. Strength

Cond. Res. 19(1):231–240. 2005.—Reliability, the consistency of

a test or measurement, is frequently quantified in the movement

sciences literature. A common metric is the intraclass correla-

tion coefficient (ICC). In addition, the SEM, which can be cal-

culated from the ICC, is also frequently reported in reliability

studies. However, there are several versions of the ICC, and con-

fusion exists in the movement sciences regarding which ICC to

use. Further, the utility of the SEM is not fully appreciated. In

this review, the basics of classic reliability theory are addressed

in the context of choosing and interpreting an ICC. The primary

distinction between ICC equations is argued to be one concern-

ing the inclusion (equations 2,1 and 2,k) or exclusion (equations

3,1 and 3,k) of systematic error in the denominator of the ICC

equation. Inferential tests of mean differences, which are per-

formed in the process of deriving the necessary variance com-

ponents for the calculation of ICC values, are useful to deter-

mine if systematic error is present. If so, the measurement

schedule should be modified (removing trials where learning

and/or fatigue effects are present) to remove systematic error,

and ICC equations that only consider random error may be safe-

ly used. The use of ICC values is discussed in the context of

estimating the effects of measurement error on sample size, sta-

tistical power, and correlation attenuation. Finally, calculation

and application of the SEM are discussed. It is shown how the

SEM and its variants can be used to construct confidence inter-

vals for individual scores and to determine the minimal differ-

ence needed to be exhibited for one to be confident that a true

change in performance of an individual has occurred.

KEY WORDS. reproducibility, precision, error, consistency, SEM,

intraclass correlation coefficient

R

unclear in the biomedical literature in general

(49) and in the sport sciences literature in particular.Part

of this stems from the fact that reliability can be assessed

in a variety of different contexts. In the sport sciences,

we are most often interested in simple test-retest reli-

ability; this is what Fleiss (22) refers to as a simple reli-

ability study. For example, one might be interested in the

reliability of 1 repetition maximum (1RM) squat mea-

sures taken on the same athletes over different days.

However, if one is interested in the ability of different

testers to get the same results from the same subjects on

skinfold measurements, one is now interested in the in-

terrater reliability. The quantifying of reliability in these

INTRODUCTION

eliability refers to the consistency of a test or

measurement. For a seemingly simple concept,

the quantifying of reliability and interpreta-

tion of the resulting numbers are surprisingly

different situations is not necessarily the same, and the

decisions regarding how to calculate reliability in these

different contexts has not been adequately addressed in

the sport sciences literature. In this article, I focus on

test-retest reliability (but not limited in the number of

retest trials). In addition, I discuss data measured on a

continuous scale.

Confusion also stems from the jargon used in the con-

text of reliability, i.e., consistency, precision, repeatabili-

ty, and agreement. Intuitively, these terms describe the

same concept, but in practice some are operationalized

differently. Notably, reliability and agreement are not

synonymous (30, 49). Further, reliability, conceptualized

as consistency, consists of both absolute consistency and

relative consistency (44). Absolute consistency concerns

the consistency of scores of individuals, whereas relative

consistency concerns the consistency of the position or

rank of individuals in the group relative to others. In the

fields of education and psychology, the term reliability is

operationalized as relative consistency and quantified us-

ing reliability coefficients called intraclass correlation co-

efficients (ICCs) (49). Issues regarding quantifying ICCs

and their interpretation are discussed in the first half of

this article. Absolute consistency, quantified using the

SEM, is addressed in the second half of the article. In

brief, the SEM is an indication of the precision of a score,

and its use allows one to construct confidence intervals

(CIs) for scores.

Another confusing aspect of reliability calculations is

that a variety of different procedures, besides ICCs and

SEM, have been used to determine reliability. These in-

clude the Pearson r, the coefficient of variation, and the

LOA (Bland-Altman plots). The Pearson product moment

correlation coefficient (Pearson r) was often used in the

past to quantify reliability, but the use of the Pearson r

is typically discouraged for assessing test-retest reliabil-

ity (7, 9, 29, 33, 44); however, this recommendation is not

universal (43). The primary, although not exclusive,

weakness of the Pearson r is that it cannot detect system-

atic error. More recently, the limits of agreement (LOA)

described by Bland and Altman (10) have come into vogue

in the biomedical literature (2). The LOA will not be ad-

dressed in detail herein other than to point out that the

procedure was developed to examine agreement between

2 different techniques of quantifying some variable (so-

called method comparison studies, e.g., one could compare

testosterone concentration using 2 different bioassays),

not reliability per se. The use of LOA as an index of re-

liability has been criticized in detail elsewhere (26, 49).

In this article, the ICC and SEM will be the focus.

Page 2

232WEIR

Unfortunately, there is considerable confusion concerning

both the calculation and interpretation of the ICC. In-

deed, there are 6 common versions of the ICC (and others

as well), and the choice of which version to use is not

intuitively obvious. Similarly, the SEM, which is inti-

mately related to the ICC, has useful applications that

are not fully appreciated by practitioners in the move-

ment sciences. The purposes of this article are to provide

information on the choice and application of the ICC and

to encourage practitioners to use the SEM in the inter-

pretation of test data.

THE ICC

Reliability Theory

For a group of measurements, the total variance (

the data can be thought of as being due to true score

variance ( ) and error variance (

?

t

served score is composed of the true score and error (44).

The theoretical true score of an individual reflects the

mean of an infinite number of scores from a subject,

whereas error equals the difference between the true

score and the observed score (21). Sources of error include

errors due to biological variability, instrumentation,error

by the subject, and error by the tester. If we make a ratio

of the to theof the observed scores, where

??

tT

plus, we have the following reliability coefficient:

??

te

) in

2

?T

). Similarly, each ob-

22

e

?

equals

222

T

?

22

2

?t

R ?

. (1)

2

t

2

e

? ? ?

The closer this ratio is to 1.0, the higher the reliability

and the lower the . Since we do not know the true score

?e

for each subject, an index of the

tween-subjects variability, i.e., the variance due to how

subjects differ from each other. In this context, reliability

(relative consistency) is formally defined (5, 21, 49) as fol-

lows:

between subjects variability

reliability ?

between subjects variability ? error

The reliability coefficient in Equation 2 is quantified

by various ICCs. So although reliability is conceptually

aligned with terms such as reproducibility, repeatability,

and agreement, it is defined as above. The necessary var-

iance estimates are derived from analysis of variance

(ANOVA), where appropriate mean square values are re-

corded from the computer printout. Specifically, the var-

ious ICCs can be calculated from mean square values de-

rived from a within-subjects, single-factor ANOVA (i.e., a

repeated-measures ANOVA).

The ICC is a relative measure of reliability (18) in that

it is a ratio of variances derived from ANOVA, is unitless,

and is more conceptually akin to R2from regression (43)

than to the Pearson r. The ICC can theoretically vary

between 0 and 1.0, where an ICC of 0 indicates no reli-

ability, whereas an ICC of 1.0 indicates perfect reliability.

In practice, ICCs can extend beyond the range of 0 to 1.0

(30), although with actual data this is rare. The relative

nature of the ICC is reflected in the fact that the mag-

nitude of an ICC depends on the between-subjects vari-

ability (as shown in the next section). That is, if subjects

differ little from each other, ICC values are small even if

trial-to-trial variability is small. If subjects differ from

each other a lot, ICCs can be large even if trial-to-trial

variability is large. Thus, the ICC for a test is context

2

is used based on be-

2

?t

. (2)

specific (38, 51). As noted by Streiner and Norman (49),

‘‘There is literally no such thing as the reliability of a test,

unqualified; the coefficient has meaning only when ap-

plied to specific populations.’’ Further, it is intuitive that

small differences between individuals are more difficult

to detect than large ones, and the ICC is reflective of this

(49).

Error is typically considered as being of 2 types: sys-

tematic error (e.g., bias) and random error (2, 39). (Gen-

eralizability theory expands sources of error to include

various facets of interest but is beyond the scope of this

article.) Total error reflects both systematic error and

random error (imprecision). Systematic error includes

both constant error and bias (38). Constant error affects

all scores equally, whereas bias is systematic error that

affects certain scores differently than others. For physical

performance measures, the distinction between constant

error and bias is relatively unimportant and the focus

here on systematic error is on situations that result in a

unidirectional change in scores on repeated testing. In

testing of physical performance, subjects may improve

their test scores simply due to learning effects, e.g., per-

forming the first test serves as practice for subsequent

tests, or fatigue or soreness may result in poorer perfor-

mance across trials. In contrast, random error refers to

sources of error that are due to chance factors. Factors

such as luck, alertness, attentiveness by the tester, and

normal biological variability affect a particular score.

Such errors should, in a random manner, both increase

and decrease test scores on repeated testing. Thus, we

can expand Equation 1 as follows:

2

?t

R ?

, (3)

2

t

2

se

and

2

re

? ? ?? ?

where is the systematic

It has been argued that systematic error is a concern

of validity and not reliability (12, 43). Similarly, system-

atic error (e.g., learning effects, fatigue) has been sug-

gested to be a natural phenomenon and therefore does

not contribute to unreliability per se in test-retest situa-

tions (43). Thus, there is a school of thought that suggests

that only random error should be assessed in reliability

calculations. Under this analysis, the error term in the

denominator will only reflect random error and not sys-

tematic error, increasing the size of reliability coeffi-

cients. The issue of inclusion of systematic error in the

determination of reliability coefficients is addressed in a

subsequent section.

is the random.

2

se

2

e

2

re

2

e

????

The Basic Calculations

The calculation of reliability starts with the performance

of a repeated-measures ANOVA. This analysis performs

2 functions. First, the inferential test of mean differences

across trials is an assessment of systematic error (trend).

Second, all of the subsequent calculations can be derived

from the output from this ANOVA. In keeping with the

nomenclature of Keppel (28), the ANOVA that is used is

of a single-factor, within-subjects (repeated-measures)de-

sign. Unfortunately, the language gets a bit tortured in

many sources, because the different ICC models are re-

ferred to as either 1-way or 2-way models; what is im-

portant to keep in mind is that both the 1-way and 2-way

ICC models can be derived from the same single-factor,

within-subjects ANOVA.

Page 3

QUANTIFYING TEST-RETEST RELIABILITY

233

TABLE 1. Example data set.

Trial A1

146

148

170

90

157

156

176

205

156 ? 33

Trial A2

140

152

152

99

145

153

167

218

153 ? 33

?

Trial B1

166

168

160

150

147

146

156

155

156 ? 8

Trial B2

160

172

142

159

135

143

147

168

153 ? 13

?

?6

?4

?18

?9

?12

?3

?9

?13

?6

?4

?18

?9

?12

?3

?9

?13

TABLE 2. Two-way analysis of variance summary table for data set A.*

Source

df

Between subjects7

SSMean square

2098.4 (MSB: 1-way)

(MSS: 2-way)

53.75 (MSW)

30.2 (MST)

57(MSE)

Fp value

14,689.8

36.8

Within subjects

Trials

Error

Total

* MSB? between-subjects mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW

? within-subjects mean square; SS ? sums of squares.

8

1

7

430

30.2

399.8

0.530.49

1515,119.8

To illustrate the calculations, example data are pre-

sented in Table 1. ANOVA summary tables are presented

in Tables 2 and 3, and the resulting ICCs are presented

in Table 4. Focus on the first two columns of Table 1,

which are labeled trial A1 and trial A2. As can be seen,

there are 2 sets (columns) of scores, and each set has 8

scores. In this example, each of 8 subjects has provided a

score in each set. Assume that each set of scores repre-

sents the subjects’ scores on the 1RM squat across 2 dif-

ferent days (trials). A repeated-measures ANOVA is per-

formed to primarily test whether the 2 sets of scores are

significantly different from each other (i.e., do the scores

systematically change between trials) and is summarized

in Table 2. Equivalently, one could have used a paired t-

test, since there were only 2 levels of trials. However, the

ANOVA is applicable to situations with 2 or more trials

and is consistent with the ICC literature in defining

sources of variance for ICC calculations. Note that there

are 3 sources of variability in Table 2: subjects, trials, and

error. In a repeated-measures ANOVA such as this, it is

helpful to remember that this analysis might be consid-

ered as having 2 factors: the primary factor of trials and

a secondary factor called subjects (with a sample size of

1 subject per cell). The error term includes the interaction

effect of trials by subjects. It is useful to keep these sourc-

es of variability in mind for 2 reasons. First, the 1-way

and 2-way models of the ICC (6, 44) either collapse the

variability due to trials and error together (1-way models)

or keep them separate (2-way models). Note that the tri-

als and error sources of variance, respectively, reflect the

systematic and random sources of error in the

reliability coefficient. These differences are illustrated in

Table 2, where the df and sums of squares values for error

in the 1-way model (within-subjects source) are simply

the sum of the respective values for trials and error in

the 2-way model.

Second, unlike a between-subjects ANOVA where the

‘‘noise’’ due to different subjects is part of the error term,

of the

2

?e

the variability due to subjects is now accounted for (due

to the repeated testing) and therefore not a part of the

error term. Indeed, for the calculation of the ICC, the nu-

merator (the signal) reflects the variance due to subjects.

Since the error term of the ANOVA reflects the interac-

tion between subjects and trials, the error term is small

in situations where all the subjects change similarly

across test days. In situations where subjects do not

change in a similar manner across test days (e.g., some

subjects’ scores increase, whereas others decrease), the

error term is large. In the former situation, even small

differences across test days, as long as they are consistent

across all the subjects, can result in a statistically signif-

icant effect for trials. In this example, however, the effect

for trials is not statistically significant (p ? 0.49), indi-

cating that there is no statistically significant systematic

error in the data. It should be kept in mind, however, that

the statistical power of the test of mean differences be-

tween trials is affected by sample size and random error.

Small sample sizes and noisy data (i.e., high random er-

ror) will decrease power and potentially hide systematic

error. Thus, an inferential test of mean differences alone

is insufficient to quantify reliability. Further, evaluation

of the effect for trials ought to be evaluated with a more

liberal ? measure, since in this case, the implications of

a type 2 error are more severe than a type 1 error. In

cases where systematic error is present, it may be pru-

dent to change the measurement schedule (e.g., add trials

if a learning effect is present or increase rest intervals if

fatigue is present) to compensate for the bias.

Shrout and Fleiss (46) have presented 6 forms of the

ICC. This system has taken hold in the physical therapy

literature. However, the specific nomenclature of their

system does not seem to be as prevalent in the exercise

physiology, kinesiology, and sport science literature,

which has instead ignored which is model used or focused

on ICC terms that are centered on either 1-way or 2-way

ANOVA models (6, 44). Nonetheless, the ICC models of

Page 4

234WEIR

TABLE 3. Analysis of variance summary table for data set B.*

Source

Between subjects

df

7

SS Mean square

190(MSB: 1-way)

(MSS: 2-way)

53.75 (MSW)

30.2 (MST)

57(MSE)

Fp value

1330

3.3

Within subjects

Trials

Error

Total

* MSB? between-subjects mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW

? within-subjects mean square; SS ? sums of squares.

8

1

7

430

30.2

399.8

1760

0.530.49

15

Shrout and Fleiss (46) overlap with the 1-way and 2-way

models presented by Safrit (44) and Baumgartner (6).

Three general models of the ICC are present in the

Shrout and Fleiss (46) nomenclature, which are labeled

1, 2, and 3. Each model can be calculated 1 of 2 ways. If

the scores in the analysis are from single scores from each

subject for each trial (or rater if assessing interrater re-

liability), then the ICC is given a second designation of 1.

If the scores in the analysis represent the average of the

k scores from each subject (i.e., the average across the

trials), then the ICC is given a second designation of k.

In this nomenclature then, an ICC with a model desig-

nation of 2,1 indicates an ICC calculated using model 2

with single scores. The use of these models is typically

presented in the context of determining rater reliability

(41). For model 1, each subject is assumed to be assessed

by a different set of raters than other subjects, and these

raters are assumed to be randomly sampled from the pop-

ulation of possible raters so that raters are a random ef-

fect. Model 2 assumes each subject was assessed by the

same group of raters, and these raters were randomly

sampled from the population of possible raters. In this

case, raters are also considered a random effect. Model 3

assumes each subject was assessed by the same group of

raters, but these particular raters are the only raters of

interest, i.e., one does not wish to generalize the ICCs

beyond the confines of the study. In this case, the analysis

attempts to determine the reliability of the raters used

by that particular study, and raters are considered a fixed

effect.

The 1-way ANOVA models (6, 44) coincide with model

1,k for situations where scores are averaged and model

1,1 for single scores for a given trial (or rater). Further,

ICC 1,1 coincides with the 1-way ICC model described by

Bartko (3, 4), and ICC 1,k has also been termed the

Spearman Brown prediction formula (4). Similarly, ICC

values derived from single and averaged scores calculated

using the 2-way approach (6, 44) coincide with models 3,1

and 3,k, respectively. Calculations coincident with models

2,1 and 2,k were not reported by Baumgartner (6) or Saf-

rit (44).

More recently, McGraw and Wong (34) expanded the

Shrout and Fleiss (46) system to include 2 more general

forms, each also with a single score or average score ver-

sion, resulting in 10 ICCs. These ICCs have now been

incorporated into SPSS statistical software starting with

version 8.0 (36). Fortunately, 4 of the computational for-

mulas of Shrout and Fleiss (46) also apply to the new

forms of McGraw and Wong (34), so the total number of

formulas is not different.

The computational formulas for the ICC models of

Shrout and Fleiss (46) and McGraw and Wong (34) are

summarized in Table 5. Unfortunately, it is not intuitive-

ly obvious how the computational formulas reflect the in-

tent of equations 1 through 3. This stems from the fact

that the computational formulas reported in most sources

are derived from algebraic manipulations of basic equa-

tions where mean square values from ANOVA are used

to estimate the various ?2values reflected in equations 1

through 3. To illustrate, the manipulations for ICC 1,1 (ran-

dom-effects, 1-way ANOVA model) are shown herein. First,

the computational formula for ICC 1,1 is as follows:

MS ? MS

ICC 1,1 ?

MS ? (k ? 1)MS

B

where MSBindicates the between-subjects mean square,

MSWindicates the within-subjects mean square, and k is

the number of trials (3, 46). The relevant mean square

values can be found in Table 2. To relate this computa-

tional formula to equation 1, one must know that esti-

mation of the appropriate ?2comes from expected mean

squares from ANOVA. Specifically, for this model the ex-

pected MSBequals plus k

??

e

equals (3); therefore, MSBequals MSWplus k

?

e

equation 1 we estimatefrom between-subjects variance

?t

(), then

?s

BW

(4)

W

, whereas the expected MSW

22

s

. If from

22

s

?

2

2

2

?s

ICC ?

. (5)

2

s

2

e

? ? ?

By algebraic manipulation (e.g.,

and substitution of the expected mean squares into equa-

tion 5, it can be shown that

? [MSB? MSW]/k)

2

?s

2

?s

ICC 1,1 ?

2

s

2

e

? ? ?

MS ? MS

BW

k

?MS ? MS

BW? MSW

k

MS ? MS

B

MS ? (k ? 1)MS

B

W

?

.(6)

W

Similar derivations can be made for the other ICC models

(3, 34, 46, 49) so that all ultimately relate to equation 1.

Of note is that with the different ICC models (fixed vs.

random effects, 1-way vs. 2-way ANOVA), the expected

mean squares change and thus the computational for-

mulas commonly found in the literature (30, 41) also

change.

Choosing an ICC

Given the 6 ICC versions of Shrout and Fleiss (46) and

the 10 versions presented by McGraw and Wong (34), the

choice of ICC is perplexing, especially considering that

Page 5

QUANTIFYING TEST-RETEST RELIABILITY

235

TABLE 4. ICC values for data sets A and B.*

ICC type Data set A

1,1

1,k

2,1

2,k

3,1

3,k

* ICC ? intraclass correlation coefficient.

Data set B

0.56

0.72

0.55

0.71

0.54

0.70

0.95

0.97

0.95

0.97

0.95

0.97

TABLE 6. Example data set with systematic error.

Trial C1Trial C2

146

148

170

90

157

156

176

205

156 ? 33 172 ? 35

?

161

162

189

100

175

171

195

219

?14

?14

?19

?10

?18

?15

?19

?14

TABLE 5. Intraclass correlation coefficient model summary table.*

Shrout and Fleiss

MS ? MS

MS ? (k ? 1)MS

B

MS ? MS

B

MSB

Use 3,1

Use 3,k

Computational formula McGraw and Wong Model

1,1

BW

W

1 1-way random

1,k

W

k 1-way random

C,1

C,k

2-way random

2-way random

2,1

MS ? MS

SE

k(MS ? MS )

T

n

E

MS ? (k ? 1)MS ?

S

MS ? MS

S

k(MS ? MS )

MS ?

S

E

A,1 2-way random

2,k

E

TE

n

A,k2-way random

3,1

MS ? MS

S

MS ? (k ? 1)MS

S

E

E

C,12-way fixed

3,k

MS ? MS

S

MSS

Use 2,1

Use 2,k

E

C,k2-way fixed

A,1

A,k

2-way fixed

2-way fixed

* Adapted from Shrout and Fleiss (46) and McGraw and Wong (34). Mean square abbreviations are based on the 1-way and 2-way

analysis of variance illustrated in Table 2. For McGraw and Wong, A ? absolute and C ? consistency. MSB? between-subjects

mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW? within-subjects mean

square.

most of the literature deals with rater reliability not test-

retest reliability of physical performance measures. In a

classic paper, Brozek and Alexander (11) first introduced

the concept of the ICC to the movement sciences litera-

ture and detailed the implementation of an ICC for ap-

plication to test-retest analysis of motor tasks. Their co-

efficient is equivalent to model 3,1. Thus, one might use

ICC 3,1 with test-retest reliability where trials is substi-

tuted for raters. From the rater nomenclature above, if

one does not wish to generalize the reliability findings but

rather assert that in our hands the procedures are reli-

able, then ICC 3,1 seems like a logical choice. However,

this ICC does not include variance associated with sys-

tematic error and is in fact closely approximated by the

Pearson r (1, 43). Therefore, the criticism of the Pearson

r as an index of reliability holds as well for ICCs derived

from model 3. At the least, it needs to be established that

the effect for trials (bias) is trivial if reporting an ICC

derived from model 3. Use of effect size for the trials effect

in the ANOVA would provide information in this regard.

With respect to ICC 3,1, Alexander (1) notes that it ‘‘may

be regarded as an estimate of the value that would have

been obtained if the fluctuation [systematic error] had

been avoided.’’

In a more general sense, there are 4 issues to be ad-

dressed in choosing an ICC: (a) 1- or 2-way model, (b)

fixed- or random-effect model, (c) include or exclude sys-

tematic error in the ICC, and (d) single or mean score.

With respect to choosing a 1- or 2-way model, in a 1-way

model, the effect of raters or trials (replication study) is

not crossed with subjects, meaning that it allows for sit-

uations where all raters do not score all subjects (48).

Fleiss (22) uses the 1-way model for what he terms simple

replication studies. In this model, all sources of error are

lumped together into the MSW(Tables 2 and 3). In con-

trast, the 2-way models allow the error to be partitioned

between systematic and random error. When systematic

error is small, MSWfrom the 1-way model and error mean

square (MSE) from the 2-way models (reflecting random

error) are similar, and the resulting ICCs are similar.

This is true for both data sets A and B. When systematic

error is substantial, MSWand MSEare disparate, as in

data set C (Tables 6 and 7). Two-way models require tri-

als or raters to be crossed with subjects (i.e., subjects pro-

vide scores for all trials or each rater rates all subjects).

For test-retest situations, the design dictates that trials

are crossed with subjects and therefore lend themselves

to analysis by 2-way models.

Regarding fixed vs. random effects, a fixed factor is

one in which all levels of the factor of interest (in this

Page 6

236WEIR

TABLE 7. Analysis of variance summary table for data set C.*

Source

df

Between subjects

SSMean square

2275

Fp value

7 15,925(MSB: 1-way)

(MSS: 2-way)

482.58

Within subjects

Trials

Error

Total

* MSB? between-subjects mean square; MSE? error mean square; MSS? subjects mean square; MST? trials mean square; MSW

? within-subjects mean square; SS ? sums of squares.

8

1

7

994

961.0

33.0

124.25 (MSW)

961.0 (MST)

4.71 (MSE)

203.85

?0.0001

1516,919

case trials) are included in the analysis and no attempt

at generalization of the reliability data beyond the con-

fines of the study is expected. Determining the reliability

of a test before using it in a larger study fits this descrip-

tion of fixed effect. A random factor is one in which the

levels of the factor in the design (trials) are but a sample

of the possible levels, and the analysis will be used to

generalize to other levels. For example, a study designed

to evaluate the test-retest reliability of the vertical jump

for use by other coaches (with similar athletes) would con-

sider the effect of trials to be a random effect. Both Shrout

and Fleiss (46) models 1 and 2 are random-effects models,

whereas model 3 is a fixed-effect model. From this dis-

cussion, for the 2-way models of Shrout and Fleiss (46),

the choice between model 2 and model 3 appears to hinge

on a decision regarding a random- vs. fixed-effects model.

However, models 2 and 3 also differ in their treatment of

systematic error. As noted previously, model 3 only con-

siders random error, whereas model 2 considers both ran-

dom and systematic error. This system does not include

a 2-way fixed-effects model that includes systematic error

and does not offer a 2-way random-effects model that only

considers random error. The expanded system of McGraw

and Wong (34) includes these options. In the nomencla-

ture of McGraw and Wong (34), the designation C refers

to consistency and A refers to absolute agreement. That

is, the C models consider only random error and the A

models consider both random and systematic error. As

noted in Table 5, no new computational formulas are re-

quired beyond those presented by Shrout and Fleiss (46).

Thus, if one were to choose a 2-way random-effects model

that only addressed random error, one would use equa-

tion 3,1 (or equation 3,k if the mean across k trials is the

criterion score). Similarly, if one were to choose a 2-way

fixed-effects model that addressed both systematic and

random error, equation 2,1 would be used (or 2,k). Ulti-

mately then, since the computational formulas do not dif-

fer between systems, the choice between using the Shrout

and Fleiss (46) equations from models 2 vs. 3 hinge on

decisions regarding inclusion or exclusion of systematic

error in the calculations. As noted by McGraw and Wong

(34), ‘‘the random-fixed effects distinction is in its effect

on the interpretation, but not calculation, of an ICC.’’

Should systematic error be included in the ICC? First,

if the effect for trials is small, the systematic differences

between trials will be small, and the ICCs will be similar

to each other. This is evident in both the A and B data

sets (Tables 1 through 3 ). However, if the mean differ-

ences are large, then differences between ICCs are evi-

dent, especially between equation 3,1, which does not con-

sider systematic error, and equations 1,1 and 2,1, which

do consider systematic error. In this regard, the F test for

trials and the ICC calculations may give contradictory re-

sults from the same data. Specifically, it can be the case

that an ICC can be large (indicating good reliability),

whereas the ANOVA shows a significant trials effect. An

example is given in Tables 6 and 7. In this example, each

score in trial C1 was altered in trial C2 so that there was

a bias of ?15 kg and a random component added to each

score. The effect for trials was significant (F1,7? 203.85,

p ? 0.001) and reflected a mean increase of 16 kg. For an

ANOVA to be significant, the effect must be large (in this

case, the mean differences between trials must be large),

the noise (error term) must be small, or both. The error

term is small when all subjects behave similarly across

test days. When this is the case, even small mean differ-

ences can be statistically significant. In this case, the sys-

tematic differences explain a significant amount of vari-

ability in the data. Despite the rather large systematic

error, the ICC values from equations 1,1; 2,1; and 3,1

were 0.896, 0.901, and 0.998, respectively. A cursory ex-

amination of just the ICC scores would suggest that the

test exhibited good reliability, especially using equation

3,1, which only reflects random error. However, an ap-

proximately 10% increase in scores from trial C1 to C2

would suggest otherwise. Thus, an analysis that only fo-

cuses on the ICC without consideration of the trials effect

is incomplete (31). If the effect for trials is significant, the

most straightforward approach is to develop a measure-

ment schedule that will attenuate systematic error (2,

50). For example, if learning effects are present, one

might add trials until a plateau in performance occurs.

Then the ICC could be calculated only on the trials in the

plateau region. The identification of such a measurement

schedule would be especially helpful for random-effects

situations where others might be using the test being

evaluated. For simplicity, all the examples here have

been with only 2 levels for trials. If a trials effect is sig-

nificant, however, 2 trials are insufficient to identify a

plateau. The possibility of a significant trials effect should

be considered in the design of the reliability study. For-

tunately, the ANOVA procedures require no modification

to accommodate any number of trials.

Interpreting the ICC

At one level, interpreting the ICC is fairly straightfor-

ward; it represents the proportion of variance in a set of

scores that is attributable to the

that an estimated 95% of the observed score variance is

due to. The balance of the variance (1 ? ICC ? 5%) is

?t

attributable to error (51). However, how does one quali-

tatively evaluate the magnitude of an ICC and what can

the quantity tell you? Some sources have attempted to

delineate good, medium, and poor levels for the ICC, but

there is certainly no consensus as to what constitutes a

good ICC (45). Indeed, Charter and Feldt (15) argue that

. An ICC of 0.95 means

2

?t

2

Page 7

QUANTIFYING TEST-RETEST RELIABILITY

237

‘‘it is not theoretically defensible to set a universal stan-

dard for test score reliability.’’ These interpretations are

further complicated by 2 factors. First, as noted herein,

the ICC varies, depending on which version of the ICC is

used. Second, the magnitude of the ICC is dependent on

the variability in the data (45). All other things being

equal, low levels of between-subjects variability will serve

to depress the ICC even if the differences between sub-

jects’ scores across test conditions are small. This is illus-

trated by comparing the 2 example sets of data in Table

1. Trials 1 and 2 of data sets A and B have identical mean

values and identical change scores between trials 1 and

2. They differ in the variability between subjects, with

greater between-subjects variability evident in data set A

as shown in the larger SDs. In Tables 2 and 3, the AN-

OVA tables have identical outcomes with respect to the

inferential test of the factor trials and have identical error

terms (since the between-subjects variability is not part

of the error term, as noted previously). Table 4 shows the

ICC values calculated using the 6 different models of

Shrout and Fleiss (46) on the A and B data sets. Clearly,

data set B, with the lower between-subjects variability,

results in smaller ICC values than data set A.

How then does one interpret an ICC? First, because

of the relationship between the ICC and between-subjects

variability, the heterogeneity of the subjects should be

considered. A large ICC can mask poor trial-to-trial con-

sistency when between-subjects variability is high. Con-

versely, a low ICC can be found even when trial-to-trial

variability is low if the between-subjects variability is

low. In this case, the homogeneity of the subjects means

it will be difficult to differentiate between subjects even

though the absolute measurement error is small. An ex-

amination of the SEM in conjunction with the ICC is

therefore needed (32). From a practical perspective, a giv-

en test can have different reliability, at least as deter-

mined from the ICC, depending on the characteristics of

the individuals included in the analysis. In the 1RM

squat, combining individuals of widely different capabil-

ities (e.g., wide receivers and defensive linemen in Amer-

ican football) into the same analysis increases between-

subjects variability and improves the ICC, yet this may

not be reflected in the expected day-to-day variation as

illustrated in Tables 1 through 4. In addition, the infer-

ential test for bias described previously needs to be con-

sidered. High between-subjects variability may result in

a high ICC even if the test for bias is statistically signif-

icant.

The relationship between between-subjects variability

and the magnitude of the ICC has been used as a criti-

cism of the ICC (10, 39). This is an unfair criticism, since

the ICC is used to provide information regarding infer-

ential statistical tests not to provide an index of absolute

measurement error. In essence, the ICC normalizes mea-

surement error relative to the heterogeneity of the sub-

jects. As an index of absolute reliability then, this is a

weakness and other indices (i.e., the SEM) are more in-

formative. As a relative index of reliability, the ICC be-

haves as intended.

What are the implications of a low ICC? First, mea-

surement error reflected in an ICC of less than 1.0 serves

to attenuate correlations (22, 38). The equation for this

attenuation effect is as follows:

ˆ

r

? r ?ICC ICC

xy

(7)

xyxy

where rxyis the observed correlation between x and y, r ˆxy

is the correlation between x and y if both were measured

without error (i.e., the correlations between the true

scores), and ICCxand ICCyare the reliability coefficients

for x and y, respectively. Nunnally and Bernstein (38)

note that the effect of measurement error on correlation

attenuation becomes minimal as ICCs increase above

0.80. In addition, reliability affects the power of statistical

tests. Specifically, the lower the reliability, the greater

the risk of type 2 error (14, 40). Fleiss (22) illustrates how

the magnitude of an ICC can be used to adjust sample

size and statistical power calculations (45). In short, low

ICCs mean that more subjects are required in a study for

a given effect size to be statistically significant (40). An

ICC of 0.60 may be perfectly fine if the resulting effect on

sample size and statistical power is within the logistical

constraints of the study. If, however, an ICC of 0.60

means that, for a required level of power, more subjects

must be recruited than is feasible, then 0.60 is not ac-

ceptable.

Although infrequently used in the movement sciences,

the ICC of test scores can be used in the setting and in-

terpretation of cut points for classification of individuals.

Charter and Feldt (15) show how the ICC can be used to

estimate the percentage of false-positive, false-negative,

true-positive, and true-negative results for a clinical clas-

sification scheme. Although the details of these calcula-

tions are beyond the scope of this article, it is worthwhile

to note that very high ICCs are required to classify indi-

viduals with a minimum of misclassification.

THE SEM

Because the general form of the ICC is a ratio of variance

due to differences between subjects (the signal) to the to-

tal variability in the data (the noise), the ICC is reflective

of the ability of a test to differentiate between different

individuals (27, 47). It does not provide an index of the

expected trial-to-trial noise in the data, which would be

useful to practitioners such as strength coaches. Unlike

the ICC, which is a relative measure of reliability, the

SEM provides an absolute index of reliability. Hopkins

(26) refers to this as the ‘‘typical error.’’ The SEM quan-

tifies the precision of individual scores on a test (24). The

SEM has the same units as the measurement of interest,

whereas the ICC is unitless. The interpretation of the

SEM centers on the assessment of reliability within in-

dividual subjects (45). The direct calculation of the SEM

involves the determination of the SD of a large number

of scores from an individual (44). In practice, a large num-

ber of scores is not typically collected, so the SEM is es-

timated. Most references estimate the SEM as follows:

SEM ? SD?1 ? ICC

where SD is the SD of the scores from all subjects

(whichcan bedetermined

) and ICC is the reliability coefficient.

?SS /(n ? 1)

TOTAL

Note the similarity between the equation for the SEM

and standard error of estimate from regression analysis.

Since different forms of the ICC can result in different

numbers, the choice of ICC can substantively affect the

size of the SEM, especially if systematic error is present.

However, there is an alternative way of calculating the

SEM that avoids these uncertainties. The SEM can be

estimated as the square root of the mean square error

term from the ANOVA (20, 26, 48). Since this estimate of

(8)

fromthe ANOVA as

Page 8

238WEIR

the SEM has the advantage of being independent of the

specific ICC, its use would allow for more consistency in

interpreting SEM values from different studies. However,

the mean square error terms differ when using the 1-way

vs. 2-way models. In Table 2 it can be seen that using a

1-way model (22) would require the use of MSW(?53.75

? 7.3 kg), whereas use of a 2-way model would require

use of MSE(?57 ? 7.6 kg). Hopkins (26) argues that be-

cause the 1-way model combines influences of random

and systematic error together, ‘‘The resulting statistic is

biased high and is hard to interpret because the relative

contributions of random error and changes in the mean

are unknown.’’ He therefore suggests that the error term

from the 2-way model (MSE) be used to calculate SEM.

Note however, that in this sample, the 1-way SEM is

smaller than the 2-way SEM. This is because the trials

effect is small. The high bias of the 1-way model is ob-

served when the trials effect is large (Table 7). The SEM

calculated using the MS error from the 2-way model

(?4.71 ? 2.2 kg) is markedly lower than the SEM cal-

culated using the 1-way model (?124.25 ? 11.1 kg), since

the SEM as defined as ?MSEonly considers random er-

ror. This is consistent with the concept of a SE, which

defines noise symmetrically around a central value. This

points to the desire of establishing a measurement sched-

ule that is free of systematic variation.

Another difference between the ICC and SEM is that

the SEM is largely independent of the population from

which it was determined, i.e., the SEM ‘‘is considered to

be a fixed characteristic of any measure, regardless of the

sample of subjects under investigation’’ (38). Thus, the

SEM is not affected by between-subjects variability as is

the ICC. To illustrate, the MSEfor the data in Tables 2

and 3 are equal (MSE? 57), despite large differences in

between-subjects variability. The resulting SEM is the

same for data sets A and B (?57 ? 7.6 kg), but yet they

have different ICC values (Table 4). The results are sim-

ilar when calculating the SEM using equation 8, even

though equation 8 uses the ICC in calculating the SEM,

since the effects of the SD and the ICC tend to offset each

other (38). However, the effects do not offset each other

completely, and use of equation 8 results in an SEM es-

timate that is modestly affected by between-subjects var-

iability (2).

The SEM is the SE in estimating observed scores (the

scores in your data set) from true scores (38). Of course,

our problem is just the opposite. We have the observed

scores and would like to estimate subjects’ true scores.

The SEM has been used to define the boundaries around

which we think a subject’s true score lies. It is often re-

ported (8, 17) that the 95% CI for a subject’s true score

can be estimated as follows:

T ? S ? 1.96(SEM),

where T is the subject’s true score, S is the subject’s ob-

served score on the measurement, and 1.96 defines the

95% CI. However, strictly speaking this is not correct,

since the SEM is symmetrical around the true score, not

the observed score (13, 19, 24, 38), and the SEM reflects

the SD of the observed scores while holding the true score

constant. In lieu of equation 9, an alternate approach is

to estimate the subject’s true score and calculate an al-

ternate SE (reflecting the SD of true scores while holding

observed scores constant). Because of regression to the

mean, obtained scores (S) are biased estimators of true

(9)

scores (16, 19). Scores below the mean are biased down-

ward, and scores above the mean are biased upward. A

subject’s estimated true score (T) can be calculated as fol-

lows:

¯

T ? X ? ICC(d),

where d ? S ? X¯. To illustrate, consider data set A in

Table 1. With a grand mean of 154.5 and an ICC 3,1 of

0.95, an individual with an S of 120 kg would have a

predicted T of 154.5 ? 0.95 (120 ? 154.5) ? 121.8 kg.

Note that because the ICC is high, the bias is small (1.8

kg). The appropriate SE to define the CI of the true score,

which some have referred to as the standard error of es-

timate (13), is as follows (19, 38):

SEM

? SD?ICC(1 ? ICC).

TS

In this example the value is 31.74 ?0.95 (1 ? 0.95) ?

6.92, where 31.74 equals the SD of the observed scores

around the grand mean. The 95% CI for T is then 121.8

? 1.96 (6.92), which defines a span of 108.2 to 135.4 kg.

The entire process, which has been termed the regression-

based approach (16), can be summarized as follows (24):

95%CI for T

¯

? X ? ICC(d) ? 1.96 SD?ICC(1 ? ICC).

If one had simply used equation 9 using S and SEM,

the resulting interval would span 120 ? 1.96 (7.8) ? 105.1

to 134.9 kg. Note that the differences between CIs is

small and that the CI width from equation 9 (29.8 kg) is

wider than that from equation 12 (27.2 kg). For all ICCs

less than 1.0, the CI width will be narrower from equation

12 than from equation 9 (16), but the differences shrink

as the ICC approaches 1.0 and as S approaches X¯(24).

(10)

(11)

(12)

MINIMAL DIFFERENCES NEEDED TO BE

CONSIDERED REAL

The SEM is an index that can be used to define the dif-

ference needed between separate measures on a subject

for the difference in the measures to be considered real.

For example, if the 1RM of an athlete on one day is 155

kg and at some later time is 160 kg, are you confident

that the athlete really increased the 1RM by 5 kg or is

this difference within what you might expect to see in

repeated testing just due to the noise in the measure-

ment? The SEM can be used to determine the minimum

difference (MD) to be considered ‘‘real’’ and can be cal-

culated as follows (8, 20, 42):

MD ? SEM ? 1.96 ? ?2,

Once again the point is to construct a 95% CI, and the

1.96 value is simply the z score associated with a 95% CI.

(One may choose a different z score instead of 1.96 if a

more liberal or more conservative assessment is desired.)

But where does the ?2 come from?

Why can’t we simply calculate the 95% CI for a sub-

ject’s score as we have done above? If the score is outside

that interval, then shouldn’t we be 95% confident that the

subject’s score has really changed? Indeed, this approach

has been suggested in the literature (25, 37). The key

here is that we now have 2 scores from a subject. Each of

these scores has a true component and an error compo-

nent. That is, both scores were measured with error, and

simply seeing if the second score falls outside the CI of

the first score does not account for the error in the second

(13)

Page 9

QUANTIFYING TEST-RETEST RELIABILITY

239

score. What we really want here is an index based on the

variability of the difference scores. This can be quantified

as the SD of the difference scores (SDd). As it turns out,

when there are 2 levels of trials (as in the examples

herein), the SEM is equal to the SDd divided by ?2 (17,

26):

SEM ? SDd/?2.

Therefore, multiplying the SEM by ?2 solves for the

SDd and then multiplying the SDd by 1.96 allows for the

construction of the 95% CI. Once the MD is calculated,

then any change in a subject’s score, either above or below

the previous score, greater than the MD is considered

real. More precisely, for all people whose differences on

repeated testing are at least greater than or equal to the

MD, 95% of them would reflect real differences. Using

data set A, the first subject has a trial A1 score of 146

kg. The SEM for the test is ?57 ? 7.6 kg. From equation

13, MD ? 7.6 ? 1.96 ? ?2 ? 21.07 kg. Thus, a change

of at least 21.07 kg needs to occur to be confident, at the

95% level, that a change in 1RM reflects a real change

and not a difference that is within what might be reason-

ably expected given the measurement error of the 1RM

test.

However, as with defining a CI for an observed score,

the process outlined herein for defining a minimal differ-

ence is not precisely accurate. As noted by Charter (13)

and Dudek (19), the SE of prediction (SEP) is the correct

SE to use in these calculations, not the SEM. The SEP is

calculated as follows:

SEP ? SD?1 ? ICC .

To define a 95% CI outside which one could be confi-

dent that a retest score reflects a real change in perfor-

mance, simply calculate the estimated true score (equa-

tion 10) plus or minus the SEP. To illustrate, consider

the same data as in the example in the previous para-

graph. From equation 10, we estimate the subject’s true

score (T) as T ? X¯? ICC (d) ? 154 ? 0.95 (146 ? 154.5)

? 146.4 kg. The SEP ? SD ? ?(1 ? ICC2) ? 31.74 ?

?(1 ? 0.952) ? 9.91. The resulting 95% CI is 146.4 ?

1.96 (9.91), which defines an interval from approximately

127 to 166 kg. Therefore, any retest score outside that

interval would be interpreted as reflecting a real change

in performance. As given in Table 1, the retest score of

140 kg is inside the CI and would be interpreted as a

change consistent with the measurement error of the test

and does not reflect a real change in performance. As be-

fore, use of a different z score in place of 1.96 will allow

for the construction of a more liberal or conservative CI.

(14)

2

(15)

OTHER CONSIDERATIONS

In this article, several considerations regarding ICC and

SEM calculations will not be addressed in detail, but brief

mention will be made here. First, assumptions of ANOVA

apply to these data. The most common assumption vio-

lated is that of homoscedasticity. That is, does the size of

the error correlate with the magnitude of the observed

scores? If the data exhibit homoscedasticity, the answer

is no. For physical performance measures, it is common

that the absolute error tends to be larger for subjects who

score higher (2, 26), e.g., the noise from repeated strength

testing of stronger subjects is likely to be larger than the

noise from weaker subjects. If the data exhibit heterosce-

dasticity, often a logarithmic transformation is appropri-

ate. Second, it is important to realize that ICC and SEM

values determined from sample data are estimates. As

such, it is instructive to construct CIs for these estimates.

Details of how to construct these CIs are addressed in

other sources (34, 35). Third, how many subjects are re-

quired to get adequate stability for the ICC and SEM cal-

culations? Unfortunately, there is no consensus in this

area. The reader is referred to other studies for further

discussion (16, 35, 52). Finally, reliability, as quantified

by the ICC, is not synonymous with responsiveness to

change (23). The MD calculation presented herein allows

one to evaluate a change score after the fact. However, a

small MD, in and of itself, is not a priori evidence that a

given test is responsive.

PRACTICAL APPLICATIONS

For a comprehensive assessment of reliability, a 3-layered

approach is recommended. First, perform a repeated-

measures ANOVA and cast the summary table as a 2-

way model, i.e., trials and error are separate sources of

variance. Evaluate the F ratio for the trials effect to ex-

amine systematic error. As noted previously, it may be

prudent to evaluate the effect for trials using a more lib-

eral ? measure than the traditional 0.05 level. If the effect

for trials is significant (and the effect size is not trivial),

it is prudent to reexamine the measurement schedule for

influences of learning and fatigue. If 3 or more levels of

trials were included in the analysis, a plateau in perfor-

mance may be evident, and exclusion of only those levels

of trials not in the plateau region in a subsequent re-

analysis may be warranted. However, this exclusion of

trials needs to be reported. Under these conditions, where

systematic error is deemed unimportant, the ICC values

will be similar and reflect random error (imprecision).

However, it is suggested here that the ICC from equation

3,1 be used (Table 5), since it is most closely tied to the

MSEcalculation of the SEM. Once the systematic error is

determined to be nonsignificant or trivial, interpret the

ICC and SEM within the analytical goals of your study

(2). Specifically, researchers interested in group-level re-

sponses can use the ICC to assess correlation attenuation,

statistical power, and sample size calculations. Practi-

tioners (e.g., coaches, clinicians) can use the SEM (and

associated SEs) in the interpretation of scores from in-

dividual athletes (CIs for true scores, assessing individual

change). Finally, although reliability is an important as-

pect of measurement, a test may exhibit reliability but

not be a valid test (i.e., it does not measure what it pur-

ports to measure).

REFERENCES

1.ALEXANDER, H.W. The estimation of reliability when several

trials are available. Psychometrika 12:79–99. 1947.

2.ATKINSON, D.B., AND A.M. NEVILL. Statistical methods for as-

sessing measurement error (reliability) in variables relevant to

Sports Medicine. Sports Med. 26:217–238. 1998.

3.BARTKO, J.J. The intraclass reliability coefficient as a measure

of reliability. Psychol. Rep. 19:3–11. 1966.

4.BARTKO, J.J. On various intraclass correlation coefficients.

Psychol. Bull. 83:762–765. 1976.

5.BAUMGARTNER, T.A. Estimating reliability when all test trials

are administered on the same day. Res. Q. 40:222–225. 1969.

6.BAUMGARTNER, T.A. Norm-referenced measurement: reliabili-

ty. In: Measurement Concepts in Physical Education and Exer-

cise Science. M.J. Safrit and T.M. Woods, eds. Champaign, IL:

Human Kinetics, 1989. pp. 45–72.

Page 10

240WEIR

7.BAUMGARTNER, T.A. Estimating the stability reliability of a

score. Meas. Phys. Educ. Exerc. Sci. 4:175–178. 2000.

BECKERMAN, H., T.W. VOGELAAR, G.L. LANKHORST, AND A.L.M.

VERBEEK. A criterion for stability of the motor function of the

lower extremity in stroke patients using the Fugl-Meyer as-

sessment scale. Scand. J. Rehabil. Med. 28:3–7. 1996.

BEDARD, M., N.J. MARTIN, P. KRUEGER, AND K. BRAZIL. As-

sessing reproducibility of data obtained with instruments

based on continuous measurements. Exp. Aging Res. 26:353–

365. 2000.

BLAND, J.M., AND D.G. ALTMAN. Statistical methods for as-

sessing agreement between two methods of clinical measure-

ment. Lancet 1:307–310. 1986.

BROZEK, J., AND H. ALEXANDER. Components of variance and

the consistency of repeated measurements. Res. Q. 18:152–166.

1947.

BRUTON, A., J.H. CONWAY, AND S.T. HOLGATE. Reliability:

What is it and how is it measured. Physiotherapy 86:94–99.

2000.

CHARTER, R.A. Revisiting the standard error of measurement,

estimate, and prediction and their application to test scores.

Percept. Mot. Skills 82:1139–1144. 1996.

CHARTER, R.A. Effect of measurement error on tests of statis-

tical significance. J. Clin. Exp. Neuropsychol. 19:458–462.1997.

CHARTER, R.A., AND L.S. FELDT. Meaning of reliability in terms

of correct and incorrect clinical decisions: The art of decision

making is still alive. J. Clin. Exp. Neuropsychol. 23:530–537.

2001.

CHARTER, R.A., AND L.S. FELDT. The importance of reliability

as it relates to true score CIs. Meas. Eval. Counseling Dev. 35:

104–112. 2002.

CHINN, S. Repeatability and method comparison. Thorax 46:

454–456. 1991.

CHINN, S., AND P.G. BURNEY. On measuring repeatability of

data from self-administered questionnaires. Int. J. Epidemiol.

16:121–127. 1987.

DUDEK, F.J. The continuing misinterpretation of the standard

error of measurement. Psychol. Bull. 86:335–337. 1979.

ELIASZIW, M., S.L. YOUNG, M.G. WOODBURY, AND K. FRYDAY-

FIELD. Statistical methodology for the concurrent assessment

of interrater and intrarater reliability: Using goniometric mea-

surements as an example. Phys. Ther. 74:777–788. 1994.

FELDT, L.S., AND M.E. MCKEE. Estimation of the reliability of

skill tests. Res. Q. 29:279–293. 1958.

FLEISS, J.L. The Design and Analysis of Clinical Experiments.

New York: John Wiley and Sons, 1986.

GUYATT, G., S. WALTER, AND G. NORMAN. Measuring change

over time: assessing the usefulness of evaluative instruments.

J. Chronic Dis. 40:171–178. 1987.

HARVILL, L.M. Standard error of measurement. Educ. Meas.

Issues Pract. 10:33–41. 1991.

HEBERT, R., D.J. SPIEGELHALTER, AND C. BRAYNE. Setting the

minimal metrically detectable change on disability rating

scales. Arch. Phys. Med. Rehabil. 78:1305–1308. 1997.

HOPKINS, W.G. Measures of reliability in sports medicine and

science. Sports Med. 30:375–381. 2000.

KEATING, J., AND T. MATYAS. Unreliable inferences from reli-

able measurements. Aust. Physiother. 44:5–10. 1998.

KEPPEL, G. Design and Analysis: A Researcher’s Handbook (3rd

ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.

KROLL, W. A note on the coefficient of intraclass correlation as

an estimate of reliability. Res. Q. 33:313–316. 1962.

LAHEY, M.A., R.G. DOWNEY, AND F.E. SAAL. Intraclass corre-

lations: there’s more than meets the eye. Psychol. Bull. 93:586–

595. 1983.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.LIBA, M. A trend test as a preliminary to reliability estimation.

Res. Q. 38:245–248. 1962.

LOONEY, M.A. When is the intraclass correlation coefficient

misleading? Meas. Phys. Educ. Exerc. Sci. 4:73–78. 2000.

LUDBROOK, J. Statistical techniques for comparing measures

and methods of measurement: A critical review. Clin. Exp.

Pharmacol. Physiol. 29:527–536. 2002.

MCGRAW, K.O., AND S.P. WONG. Forming inferences about

some intraclass correlation coefficients. Psychol. Methods 1:30–

46. 1996.

MORROW, J.R., AND A.W. JACKSON. How ‘‘significant’’ is your

reliability? Res. Q. Exerc. Sport 64:352–355. 1993.

NICHOLS, C.P. Choosing an intraclass correlation coefficient.

Available at: www.spss.com/tech/stat/articles/whichicc.htm.

Accessed 1998.

NITSCHKE, J.E., J.M. MCMEEKEN, H.C. BURRY, AND T.A. MA-

TYAS. When is a change a genuine change? A clinically mean-

ingful interpretation of grip strength measurements in healthy

and disabled women. J. Hand Ther. 12:25–30. 1999.

NUNNALLY, J.C., AND I.H. BERNSTEIN. Psychometric Theory

(3rd ed.). New York: McGraw-Hill, 1994.

OLDS, T. Five errors about error. J. Sci. Med. Sport 5:336–340.

2002.

PERKINS, D.O., R.J. WYATT, AND J.J. BARTKO. Penny-wise and

pound-foolish: The impact of measurement error on sample

size requirements in clinical trials. Biol. Psychiatry. 47:762–

766. 2000.

PORTNEY, L.G., AND M.P. WATKINS. Foundations of Clinical Re-

search (2nd ed.). Upper Saddle River, NJ: Prentice Hall, 2000.

ROEBROECK, M.E., J. HARLAAR, AND G.J. LANKHORST. The ap-

plication of generalizability theory to reliability assessment: An

illustration using isometric force measurements. Phys. Ther.

73:386–401. 1993.

ROUSSON, V., T. GASSER, AND B. SEIFERT. Assessing intrarater,

interrater, and test-retest reliability of continuous measure-

ments. Stat. Med. 21:3431–3446. 2002.

SAFRIT, M.J.E. Reliability Theory. Washington, DC: American

Alliance for Health, Physical Education, and Recreation, 1976.

SHROUT, P.E. Measurement reliability and agreement in psy-

chiatry. Stat. Methods Med. Res. 7:301–317. 1998.

SHROUT, P.E., AND J.L. FLEISS. Intraclass correlations: uses in

assessing rater reliability. Psychol. Bull. 36:420–428. 1979.

STRATFORD, P. Reliability: consistency or differentiating be-

tween subjects? [Letter]. Phys. Ther. 69:299–300. 1989.

STRATFORD, P.W., AND C.H. GOLDSMITH. Use of standard error

as a reliability index of interest: An applied example using el-

bow flexor strength data. Phys. Ther. 77:745–750. 1997.

STREINER, D.L., AND G.R. NORMAN. Measurement Scales: A

Practical Guide to Their Development and Use (2nd ed.). Oxford:

Oxford University Press, 1995. pp. 104–127.

THOMAS, J.R., AND J.K. NELSON. Research Methods in Physical

Activity (2nd ed.). Champaign, IL: Human Kinetics, 1990. pp.

352.

TRAUB, R.E., AND G.L. ROWLEY. Understanding reliability.

Educ. Meas. Issues Pract. 10:37–45. 1991.

WALTER, S.D., M. ELIASZIW, AND A. DONNER. Sample size and

optimal designs for reliability studies. Stat. Med. 17:101–110.

1998.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

Acknowledgments

I am indebted to Lee Brown, Joel Cramer, Bryan Heiderscheit,

Terry Housh, and Bob Oppliger for their helpful comments on

drafts of the paper.

Address

joseph.weir@dmu.edu.

correspondenceto Dr. JosephP. Weir,

#### View other sources

#### Hide other sources

- Available from Joseph Weir · May 20, 2014
- Available from uta.edu