ArticlePDF AvailableLiterature Review

Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine

Authors:

Abstract and Figures

Minimal measurement error (reliability) during the collection of interval- and ratio-type data is critically important to sports medicine research. The main components of measurement error are systematic bias (e.g. general learning or fatigue effects on the tests) and random error due to biological or mechanical variation. Both error components should be meaningfully quantified for the sports physician to relate the described error to judgements regarding 'analytical goals' (the requirements of the measurement tool for effective practical use) rather than the statistical significance of any reliability indicators. Methods based on correlation coefficients and regression provide an indication of 'relative reliability'. Since these methods are highly influenced by the range of measured values, researchers should be cautious in: (i) concluding acceptable relative reliability even if a correlation is above 0.9; (ii) extrapolating the results of a test-retest correlation to a new sample of individuals involved in an experiment; and (iii) comparing test-retest correlations between different reliability studies. Methods used to describe 'absolute reliability' include the standard error of measurements (SEM), coefficient of variation (CV) and limits of agreement (LOA). These statistics are more appropriate for comparing reliability between different measurement tools in different studies. They can be used in multiple retest studies from ANOVA procedures, help predict the magnitude of a 'real' change in individual athletes and be employed to estimate statistical power for a repeated-measures experiment. These methods vary considerably in the way they are calculated and their use also assumes the presence (CV) or absence (SEM) of heteroscedasticity. Most methods of calculating SEM and CV represent approximately 68% of the error that is actually present in the repeated measurements for the 'average' individual in the sample. LOA represent the test-retest differences for 95% of a population. The associated Bland-Altman plot shows the measurement error schematically and helps to identify the presence of heteroscedasticity. If there is evidence of heteroscedasticity or non-normality, one should logarithmically transform the data and quote the bias and random error as ratios. This allows simple comparisons of reliability across different measurement tools. It is recommended that sports clinicians and researchers should cite and interpret a number of statistical methods for assessing reliability. We encourage the inclusion of the LOA method, especially the exploration of heteroscedasticity that is inherent in this analysis. We also stress the importance of relating the results of any reliability statistic to 'analytical goals' in sports medicine.
Content may be subject to copyright.
Statistical Methods For Assessing
Measurement Error (Reliability) in
Variables Relevant to Sports Medicine
Greg Atkinson and Alan M. Nevill
Research Institute for Sport and Exercise Sciences, Liverpool John Moores University,
Liverpool, England
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
1. Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
1.1 Systematic Bias and Random Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
1.2 Heteroscedasticity and Homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
2. Can a Measurement Tool be Significantly Reliable? . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3. Analytical Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4. Statistical Methods for Assessing Reliability in Sports Medicine . . . . . . . . . . . . . . . . . . . . . 222
4.1 Paired t-Test for Detection of Systematic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.2 Analysis of Variation for Detection of Systematic Bias . . . . . . . . . . . . . . . . . . . . . . . 224
4.3 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5. Correlation and Relative Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.1 Implications of Poor Interpretation of Test-Retest Correlations . . . . . . . . . . . . . . . . . . 226
6. Intraclass Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7. Other Methods Based on Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.1 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8. Statistical Measures of Absolute Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.1 Standard Error of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.2 Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3 Bland and Altman’s 95% Limits of Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9. Limits of Agreement and Analytical Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Abstract Minimal measurement error (reliability) during the collection of interval- and
ratio-type data is critically important to sports medicine research. The main com-
ponents of measurement error are systematic bias (e.g. general learning or fatigue
effects on the tests) and random error due to biological or mechanical variation.
Both error components should be meaningfully quantified for the sports physician
to relate the described error to judgements regarding ‘analytical goals’ (the re-
quirements of the measurement tool for effective practical use) rather than the
statistical significance of any reliability indicators.
Methods based on correlation coefficients and regression provide an indica-
tion of ‘relative reliability’. Since these methods are highly influenced by the range
of measured values, researchers should be cautious in: (i) concluding accept-
REVIEW ARTICLE Sports Med 1998 Oct; 26 (4): 217-238
0112-1642/98/0010-0217/$11.00/0
© Adis International Limited. All rights reserved.
able relative reliability even if a correlation is above 0.9; (ii) extrapolating the
results of a test-retest correlation to a new sample of individuals involved in an
experiment; and (iii) comparing test-retest correlations between different reliabil-
ity studies.
Methods used to describe ‘absolute reliability’ include the standard error of
measurements (SEM), coefficient of variation (CV) and limits of agreement
(LOA). These statistics are more appropriate for comparing reliability between
different measurement tools in different studies. They can be used in multiple
retest studies from ANOVA procedures, help predict the magnitude of a ‘real’
change in individual athletes and be employed to estimate statistical power for a
repeated-measures experiment.
These methods vary considerably in the way they are calculated and their use
also assumes the presence (CV) or absence (SEM) of heteroscedasticity. Most
methods of calculating SEM and CV represent approximately 68% of the error
that is actually present in the repeated measurements for the ‘average’ individual
in the sample. LOA represent the test-retest differences for 95% of a population.
The associated Bland-Altman plot shows the measurement error schematically
and helps to identify the presence of heteroscedasticity. If there is evidence of
heteroscedasticity or non-normality, one should logarithmically transform the
data and quote the bias and random error as ratios. This allows simple compari-
sons of reliability across different measurement tools.
It is recommended that sports clinicians and researchers should cite and inter-
pret a number of statistical methods for assessing reliability. We encourage the
inclusion of the LOA method, especially the exploration of heteroscedasticity that
is inherent in this analysis. We also stress the importance of relating the results
of any reliability statistic to ‘analytical goals’ in sports medicine.
It is extremely important to ensure that the
measurements made as part of research or athlete
support work in sports medicine are adequately re-
liable and valid. The sport medic’s dependence on
adequate measurements was recently mentioned in
reviews on the sports medicine subdisciplines of
biomechanics, physiology and psychology re-
search.[1-3] This multidisciplinary nature of sports
medicine means that a variety of different types of
data are collected by researchers. Nevertheless, the
most common measurements in sports medicine
are continuous and on an interval or ratio scale. For
example, body temperature measured in degrees
Celsius or whole body flexibility measured in cen-
timetres above or below the position of the feet
when elevated above ground are not theoretically
bounded by zero and are therefore considered to be
interval data.[4] On the other hand, it is impossible
to obtain values of muscle strength or body mass,
for example, that are lower than zero. Such vari-
ables would be measured on a ratio scale.[5] Both
types of data are considered continuous, since the
values may not merely be whole numbers, but
can be expressed as any number of decimal points
depending on the accuracy of the measurement
tool.[6]
Mainstream clinical tools may hold sufficient
reliability to detect the often large differences in
interval or ratio measurements that exist between
healthy and diseased patients. Formulae are now
available for clinicians to calculate, from continu-
ous measurements, the probability of this concept
of ‘discordant classification’ amongst patients.[7]
Nevertheless, laboratory measures of human per-
formance may need to be sensitive enough to dis-
tinguish between the smaller differences that exist
between elite and subelite athletes (the ability to
detect changes in performance, which may be very
small, but still meaningful to athletic performance).
For sports medicine support work, it is desirable
218 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
that a measurement tool be reliable enough to be
used on individual athletes. For example, a clini-
cian may need to know whether an improvement
in strength following an injury-rehabilitation pro-
gramme is real or merely due to measurement er-
ror. Researchers in sports medicine may need to
know the influence of measurement error on statis-
tical power and sample size estimation for experi-
ments. A full discussion of this latter issue is be-
yond the scope of this review but interested readers
are directed towards Bates et al.[8] and Dufek et
al.[9] who recently outlined the importance of data
reliability on statistical power (the ability to detect
real differences between conditions or groups).
The issue of which statistical test to employ for
the quantification of ‘good’ measurement has been
recently raised in the newsletter of the British As-
sociation of Sport and Exercise Sciences[10] and in
an Editorial of the Journal of Sports Science,[11] as
well as in other sources related to subdisciplines of
sport and exercise science.[12-15] Atkinson[10] and
Nevill[11] promoted the use of ‘95% limits of agree-
ment’[16] to supplement any analyses that are per-
formed in measurement studies. This generated
much discussion between sports scientists through
personal communication with respect to the choice
of statistics for assessing the adequacy of measure-
ments. This review is an attempt to communicate
these discussions formally.
1. Definition of Terms
Studies concerning measurement issues cover
all the sports medicine subdisciplines. The most
common topics involve the assessment of the reli-
ability and validity of a particular measurement
tool. Validity is, generally, the ability of the meas-
urement tool to reflect what it is designed to mea-
sure. This concept is not covered in great detail in
the present review (apart from a special type of
validity called ‘method comparison’, which is
mentioned in the discussion) mainly because of the
different interpretations and methods of assessing
validity amongst researchers. Detailed discussions
of validity issues can be found in the book edited
by Safrit and Wood.[17]
Reliability can be defined as the consistency of
measurements, or of an individual’s performance,
on a test; or ‘the absence of measurement error’.[17]
Realistically, some amount of error is always pres-
ent with continuous measurements. Therefore, re-
liability could be considered as the amount of
measurement error that has been deemed accept-
able for the effective practical use of a measure-
ment tool. Logically, it is reliability that should be
tested for first in a new measurement tool, since
it will never be valid if it is not adequately consis-
tent in whatever value it indicates from repeated
measurements. Terms that have been used inter-
changeably with ‘reliability’, in the literature, are
‘repeatability’, ‘reproducibility’, ‘consistency’,
‘agreement’, ‘concordance’ and ‘stability’.
Baumgarter[18] identified 2 types of reliability:
relative and absolute. Relative reliability is the de-
gree to which individuals maintain their position
in a sample with repeated measurements. This type
of reliability is usually assessed with some type of
correlation coefficient. Absolute reliability is the
degree to which repeated measurements vary for
individuals. This type of reliability is expressed
either in the actual units of measurement or as a
proportion of the measured values (dimensionless
ratio).
Baumgarter[18] also defined reliability in terms
of the source of measurement error. For example,
internal consistency reliability is the variability be-
tween repeated trials within a day. Researchers
should be careful in the interpretation of this type
of reliability, since the results might be influenced
by systematic bias due to circadian variation in per-
formance.[19] Stability reliability was defined as
the day-to-day variability in measurements. This is
the most common type of reliability analysis, al-
though it is stressed that exercise performance tests
may need more than one day between repeated
measurements to allow for bias due to inadequate
recovery. Objectivity is the degree to which differ-
ent observers agree on the measurements and is
sometimes referred to as rater reliability.[20] This
type of reliability assessment is relevant to meas-
Measurement Error in Sports Medicine 219
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
urements that might be administered by different
clinicians over time.
These different definitions of reliability have lit-
tle impact on the present review, since they have all
been analysed with similar statistical methods in
the sports medicine literature. Nevertheless, a re-
searcher may be interested in examining the rela-
tive influence of these different types of reliability
within the same study. Generalisability theory (the
partitioning of measurement error due to different
sources) is appropriate for this type of analysis.
This review considers one of the base statistics [the
standard error of measurement (SEM)] for meas-
urement error that happens to be used in general-
isability theory, but does not cover the actual con-
cept itself. Interested readers are directed to
Morrow[21] for a fuller discussion and Roebroeck
et al.[22] for an example of the use of the theory in
a sports medicine application.
1.1 Systematic Bias and Random Error
Irrespective of the type of reliability that is as-
sessed (internal consistency, stability, objectivity),
there are 2 components of variability associated
with each assessment of measurement error. These
are systematic bias and random error. The sum total
of these components of variation is known as total
error.[23]
Systematic bias refers to a general trend for
measurements to be different in a particular direc-
tion (either positive or negative) between repeated
tests. There might be a trend for a retest to be higher
than a prior test due to a learning effect being pres-
ent. For example, Coldwells et al.[24] found a bias
due to learning effects for the measurement of back
strength using a portable dynamometer. Bias may
also be due to there being insufficient recovery be-
tween tests. In this case, a retest would show a
‘worse’ score than a prior test. It may be that, after
a large number of repeated tests, systematic bias
due to training effects (if the test is physically chal-
lenging) or transient increases in motivation be-
comes apparent. For example, Hickey et al.[25] found
that the final test of some 16km cycling time trial
performances was significantly better than the 3
prior tests measured on a week-to-week basis. Such
bias should be investigated if the test is to admin-
istered many times as part of an experiment and
should be controlled by attempting to maximise
motivation on all the tests with individuals who are
already well trained.
The other component of variability between re-
peated tests is the degree of random error. Large
amounts of random differences could arise due to
inherent biological or mechanical variation, or in-
consistencies in the measurement protocol, e.g. not
controlling posture in a consistent way during
measurements of muscle strength.[24] Whilst such
obvious sources of error as protocol variation can
be controlled, the random error component is still
usually larger than that due to bias. Unfortunately,
the researcher can do relatively little to reduce ran-
dom error once the measurement tool has been pur-
chased, especially if it is due wholly to inherent
mechanical (instrument) variation. An important
issue here, therefore, is that the researcher could
compare magnitudes of random error between dif-
ferent pieces of equipment that measure the same
variable so that the ‘best’ measurement tool is pur-
chased. This denotes that, whatever the choice of a
statistic of measurement error, researchers investi-
gating the reliability of a measurement tool should
also be consistent in this choice (or provide a num-
ber of statistical analyses for global comparison
amongst future researchers).
1.2 Heteroscedasticity and
Homoscedasticity
One issue that is rarely mentioned in sport and
exercise reliability studies is how the measurement
error relates to the magnitude of the measured vari-
able. When the amount of random error increases
as the measured values increase, the data are said
to be heteroscedastic. Heteroscedastic data can also
show departures from a normal distribution (i.e.
positive skewness).[6] When there is no relation be-
tween the error and the size of the measured value,
the data are described as homoscedastic. Such char-
acteristics of the data influence how the described
error is eventually expressed and analysed.[26,27]
220 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
Homoscedastic errors can be expressed in the ac-
tual units of measurement but heteroscedastic data
should be measured on a ratio scale (although this
can be interpreted back into the units of measure-
ment by multiplying and dividing a particular mea-
sured value by the error ratio). With homoscedastic
errors, providing they are also normally distrib-
uted, the raw data can be analysed with conven-
tional parametric analyses, but heteroscedastic
data should be transformed logarithmically before
analysis or investigated with an analysis based on
ranks.
There could be practical research implications
of the presence of heteroscedastic errors in meas-
urements. Heteroscedasticity means that the indi-
viduals who score the highest values on a particular
test also show the greatest amount of measurement
error (in the units of measurement). It is also likely
that these high-scoring individuals show the small-
est changes (in the units of measurement) in re-
sponse to a certain experimental intervention.[28]
Therefore, in line with the discussions on measure-
ment error and statistical power referred to in the
introduction, it may be that the detection of small
but meaningful changes in sports medicine–related
variables measured on a ratio scale is particularly
difficult with individuals who score highly on
those particular variables.
2. Can a Measurement Tool be
Significantly Reliable?
The statistical philosophy for assessing agree-
ment between measurements can be considered to
be different from that surrounding the testing of
research hypotheses.[29,30] Indeed, the identifica-
tion of, and adherence to, a single statistical method
(or the citation of several different methods in a
paper on reliability) could be considered as more
important for measurement issues than it is for test-
ing hypotheses. There are several different statisti-
cal methods that can help examine a particular
hypothesis. For example, in a multifactorial exper-
iment involving comparisons of changes over time
between different treatments, one may employ
analysis of summary statistics[31] or multifactorial
analysis of variance (ANOVA) models[32] to test
hypotheses. The consideration of measurement er-
ror is a different concept, since one may not neces-
sarily be concerned with hypothesis testing, but the
correct, meaningful and consistent quantification
of variability between different methods or re-
peated tests. Coupled with this, the researcher would
need to arrive at the final decision as to whether a
measurement tool is reliable or not (whether the
measurement error is acceptable for practical use).
3. Analytical Goals
The above concept for reliability assessment ba-
sically entails the researcher relating measurement
error to ‘analytical goals’ rather than the signifi-
cance of hypothesis tests. The consideration of
analytical goals is routine in laboratory medi-
cine[33,34] but seems to have been neglected in sport
and exercise science.
One way of arriving at the acceptance of a cer-
tain degree of measurement error (attaining an an-
alytical goal), as already mentioned, is estimating
the implications of the measurement error on sam-
ple size estimation for experiments or on individ-
uals’ differences/changes. The present authors
were able to locate only 3 published reliability
studies relevant to sports science/medicine which
have calculated the influence of the described
measurement error on sample size estimation for
future research.[35-37] Hopkins[38] provides meth-
ods, based on test-retest correlations, in which re-
searchers might perform this extrapolation of
measurement error to sample size estimation. Sam-
ple size can also be estimated from absolute reli-
ability statistics such as the standard deviation
(SD) of test-retest differences.[6,39]
Researchers in sport science have, at least,
recognised that an analytical goal might not neces-
sarily be the same as the acceptance of significance
on a hypothesis test.[29] In the present review, by
considering each reliability statistic in turn, we aim
to highlight how an ‘acceptable’ level of measure-
ment error might still be falsely accepted, when
statistical criteria that are still not based on any
well-defined analytical goals are employed (e.g.
Measurement Error in Sports Medicine 221
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
correlations >0.9, sample mean coefficients of
variation <10%). Such criteria are in common use
in the sport and exercise sciences.
4. Statistical Methods for Assessing
Reliability in Sports Medicine
Many statistical tests have been proposed in the
sport science literature for the appraisal of meas-
urement issues. This is illustrated in table I which
cites the different methods used in the ‘measure-
ment’ studies presented at the 1996 conference of
the American College of Sports Medicine. It is
stressed that some of these studies were ‘method
comparison’ (validity) studies, although the major-
ity investigated reliability issues. It can be seen that
the most common methods involve the use of hy-
pothesis tests (paired t-tests, ANOVA) and/or cor-
relation coefficients (Pearson’s, intraclass correla-
tion). Other methods cited in the literature involve
regression analysis, coefficient of variation (CV)
or various methods that calculate ‘percentage vari-
ation’. A little-quoted method in studies relevant to
sport science is the ‘limits of agreement’ technique
outlined by Bland and Altman in 1983[16,41] and
refined in later years.[42-44] In the following sec-
tions of this review, each statistical method for as-
sessing reliability will be considered using, where
possible, real data relevant to sports science and
medicine.
4.1 Paired t-Test for Detection of
Systematic Bias
This test would be used to compare the means
of a test and retest i.e. it tests whether there is any
statistically significant bias between the tests. Al-
though this is useful, it should not of course be
employed on its own as an assessment of reliability,
since the t-statistic provides no indication of ran-
dom variation between tests. Altman[30] and Bland
and Altman[42] stressed caution in the interpretation
of a paired t-test to assess reliability, since the de-
tection of a significant difference is actually de-
pendent on the amount of random variation be-
tween tests.
Specifically, because of the nature of the for-
mula employed to calculate the t-value, significant
systematic bias will be less likely to be detected if
it is accompanied by large amounts of random error
between tests. For example, a paired t-test was per-
formed on the data presented in table II to assess
repeatability of the ‘Fitech’ step test for predicting
maximal oxygen uptake (V
.O2max). The mean sys-
tematic bias between week 1 and week 2 of 1.5
ml/kg/min was not statistically significant (t29 =
1.22, p = 0.234), a finding that has been used on its
own by some researchers (table I) to conclude that
a tool has acceptable measurement error. However,
if one examines the data from individual partici-
pants, it can be seen that there are differences be-
tween the 2 weeks of up to ±16 ml/kg/min (partic-
ipant 23 recorded 61 ml/kg/min in the first test but
only 45 ml/kg/min in the retest).
The possible compromising effect of large
amounts of random error on the results of the paired
t-test is further illustrated by applying it to the hy-
pothetical data in table III. With these data, a test-
retest t-value of zero would be obtained (p = 0.99),
which could be interpreted as excellent reliability,
even though there are very large random differ-
ences in the individual cases. With the use of a t-test
Table I. The various statistical methods used in repeatability and
validity studies presented at the 43rd meeting of the American
College of Sports Medicine[40]a
Type of analysis Number of studies
Hypothesis test for bias (i.e paired
t-test, ANOVA) 16
Pearson’s correlation coefficient (r) 17
ICC 3
Hypothesis test and Pearson’s
correlation coefficient (r) 11
Hypothesis test and ICC 9
CV 4
Absolute error 7
Regression 3
Total 70b
a Validity studies as well as reliability investigations were in-
cluded in this literature search. The critique of the statistical
analyses in the present review may not necessarily apply to
validity examination.
b 5.6% of the total number of studies presented, 1256.
ANOVA = analysis of variance; CV = coefficient of variation; ICC =
intraclass correlation.
222 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
per se, very unreliable (relatively high random er-
ror) measurements would be concluded as very re-
liable (relatively small bias)! It should be noted
that the correlation between test and retest may not,
in all data sets, be a good indicator of the amount
of absolute random error present, which is the basis
of the denominator in the paired t-test equation (see
the discussion of correlation methods below).
The use of a t-test may still be recommended in
a measurement study that investigates a simple test
and retest, since it will detect large systematic bias
(relative to the random error), and the terms in the
formula for the t-value can be used in the calcula-
tion of measures of random error (e.g. limits of
agreement). Nevertheless, the researcher may need
to supplement this analysis with the consideration
Table II. Test-retest data for the Fitech step test predicting maximal oxygen consumption.a The data have been ranked to show that a high
correlation may not necessarily mean that individuals maintain their positions in a sample following repeated measurements (adequate relative
reliability).
Individual Test 1 (ml/kg/min) Test 1 (ranks) Test 2 (ml/kg/min) Test 2 (ranks) Difference
(ml/kg/min) Absolute difference
in ranks
1 31 2.0 27 1.0 –4 1.0
2 33 3.0 35 3.0 +2 0
3 42 9.0 47 13.5 +5 4.5
4 40 6.0 44 8.0 +4 2
5 63 28.0 63 28.0 0 0
6 28 1.0 31 2.0 +3 2
7 43 12.5 54 23.5 +11 11
8 44 15.0 54 23.5 +10 8.5
9 68 29.0 68 30.0 0 1
10 47 18.0 58 25.5 +11 7. 5
11 47 18.0 48 16.0 +1 2
12 40 6.0 43 5.5 +3 0.5
13 43 12.5 45 11.0 +2 1.5
14 47 18.0 52 20.0 +5 2
15 58 24.5 48 16.0 +10 8.5
16 61 26.5 61 27.0 0 0.5
17 45 16.0 52 20.0 +7 4
18 43 12.5 44 8.0 +1 4.5
19 58 24.5 48 16.0 –10 8.5
20 40 6.0 44 8.0 +4 2
21 48 20.5 47 13.5 –1 7
22 42 9.0 52 20.0 +10 11
23 61 26.5 45 11.0 –16 15.5
24 48 20.5 43 5.5 –5 15
25 43 12.5 52 20.0 +11 7. 5
26 50 22.0 52 20.0 +2 2
27 39 4.0 40 4.0 +1 0
28 52 23.0 58 25.5 +6 2.5
29 42 9.0 45 11.0 +3 2
30 77 30.0 67 29.0 –10 1
Mean (SD) 47.4 (10.9) 48.9 (9.4) +1.5 (6.6)
a Data obtained in a laboratory practical at Liverpool John Moores University. t = 1.22 (p = 0.234); r = 0.80 (p < 0.001); ICC = 0.88; rc =
0.78; sample CV = 7.6%; limits of agreement = –1.5 ± 12.9 ml/kg/min (0.97 ×/÷ 0.29 as a ratio).
CV = coefficient of variation; ICC = intraclass correlation; r = Pearson’s product-moment correlation; rc = concordance correlation; SD =
standard deviation; t = test statistic from t-test.
Measurement Error in Sports Medicine 223
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
of an analytical goal. For example, the bias of 1.5
ml/kg/min for the data in table I represents about
3% of the grand mean V
.O2max of the sample. This
seems small in relation to the amount of random
error in these data (see section 8). Besides, a good
experiment would be designed to control for any
such bias (i.e. control groups/conditions). Never-
theless, it could be that the bias (probably due to
familiarisation in this case) is reduced if more re-
tests are conducted and examined for reliability.
This denotes the use of ANOVA procedures.
4.2 Analysis of Variation for Detection of
Systematic Bias
ANOVA with repeated measures (preferably
with a correction for ‘sphericity’)[32] has been used
for comparing more than one retest with a test.[32,45]
With appropriate a priori or post hoc multiple com-
parisons (e.g. Tukey tests), it can be used to assess
systematic bias between tests. However, the sole
use of ANOVA is associated with exactly the same
drawback as the paired t-test in that the detection
of systematic bias is affected by large random (re-
sidual) variation. Again it should be noted that a
correlation coefficient (intraclass in the case of
ANOVA) may not be as sensitive an indicator of
this random error as an examination of the residual
mean squared error itself in the ANOVA results
table (the calculation of an F-value for differences
between tests in a repeated measures ANOVA in-
volves variance due to tests and residual error. The
variance due to individuals is involved in the cal-
culation of a intraclass correlation but is ‘parti-
tioned out’ of a repeated measures ANOVA hypoth-
esis test; see section 8).
As with the t-test, ANOVA is useful for detect-
ing large systematic errors and the mean squared
error term from ANOVA can be used in the calcu-
lation of indicators of absolute reliability.[39,46] An
important point in the use of a hypothesis test to
assess agreement, whether it be either a paired t-
test or ANOVA, is that if significant (or large
enough to be important) systematic bias is de-
tected, a researcher would need to adapt the meas-
urement protocol to remove the learning or fatigue
effect on the test (e.g. include more familiarisation
trials or increase the time between repeated meas-
urements, respectively). It is preferable that the
method should then be reassessed for reliability.[18]
An intuitive researcher may suspect that a test
would show some bias because of familiarisation.
It follows, therefore, that a reliability study may be
best planned to have multiple retests. The re-
searcher would then not need to go ‘back to the
drawing board’ but merely examine when the bias
between tests is considered negligible. The number
of tests performed before this decision is made
would be suggested as familiarisation sessions to a
future researcher. This concept is discussed in
greater detail by Baumgarter.[18]
4.3 Pearson’s Correlation Coefficient
The Pearson’s correlation coefficient has been
the most common technique for assessing reliabil-
ity. The idea is that if a high (>0.8) and statistically
significant correlation coefficient is obtained, the
equipment is deemed to be sufficiently reliable.[47]
Baumgarter[18] pointed out that correlation meth-
ods actually indicate the degree of relative reliabil-
ity. This is definitely conceptually useful, since a
researcher could, in theory, tell how consistently
the measurement tool distinguishes between indi-
viduals in a particular population. However, Bland
Table III. Hypothetical data from a validity study comparing a test
and a retest of spinal flexibility. The sole use of a t-test on these data
would provide a t-value = 0 (p = 0.99), which may lead some
researchers to conclude good reliability when large random varia-
tion is evident
Test 1 (degrees) Test 2 (degrees) Difference (degrees)
110 +9
10 1 –9
220 +18
20 2 –18
330 +27
30 3 –27
440 +36
40 4 –36
Mean (SD)
13.8 (14.7) 13.8 (14.7) 0 (26.3)
SD = standard deviation.
224 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
and Altman[42] and Sale[48] considered the use of
the correlation coefficient as being inappropriate,
since, among other criticisms, it cannot, on its own,
assess systematic bias and it depends greatly on the
range of values in the sample.[49] The latter note of
caution in the use of test-retest correlation coeffi-
cients is the most important. For example, we have
already seen that there is substantial random vari-
ation among the individual data in table II, but if
correlation was used to examine this, it would be
concluded that the test has good repeatability (test-
retest correlation of r = 0.80, p < 0.001). Note that
the sample in table II is very varied in maximal
oxygen consumption (28 to 77 ml/kg/min).
In table IV, the same data from table II have been
manipulated to decrease the interindividual varia-
tion while retaining exactly the same level of
absolute reliability [indicated by the differences
column and the standard deviation (SD) of these
differences]. When Pearson’s r is calculated for
these data, it drops to a nonsignificant 0.27 (p >
0.05). This phenomenon suggests that researchers
should be extremely cautious in the 2 common pro-
cedures of: (i) extrapolating test-retest correla-
tions, which have been deemed acceptable to a new
and possibly more homogeneous sample of indi-
viduals (e.g. elite athletes); and (ii) comparing test-
retest r-values between different reliability studies
(e.g. Perrin[50]). To overcome these difficulties,
there are methods for correcting the correlation co-
efficient for interindividual variability.[51] Concep-
tually, this correction procedure would be similar
to the use of an indicator of absolute reliability;
these statistics are relatively unaffected by popula-
tion heterogeneity (see section 8).
5. Correlation and Relative Reliability
Despite the above notes of caution when com-
paring correlation results, it could be argued that a
high correlation coefficient reflects adequate rela-
tive reliability for use of the measurement tool in
the particular population that has been investi-
gated. This seems sensible, since the more homo-
geneous a population is, the less the measurement
error would need to be in order to detect differences
Table IV. The same data as in table II but manipulated to give a
less heterogeneous sample (indicated by the test and retest sam-
ple standard deviations (SDs) being approximately half those in
table II). The data has exactly the same degree of agreement
(indicated by column of differences) between the test and retest as
the data presented in table IIa
Test 1 (ml/kg/min) Test 2 (ml/kg/min) Difference (ml/kg/min)
41 37 –4
43 45 2
42 47 5
40 44 4
43 43 0
48 51 3
43 54 11
44 54 10
48 48 0
47 58 11
47 48 1
40 43 3
43 45 2
47 52 5
58 48 –10
41 41 0
45 52 7
43 44 1
58 48 –10
40 44 4
48 47 –1
42 52 10
61 45 –16
48 43 –5
43 52 9
50 52 2
39 40 1
52 58 6
42 45 3
57 47 –10
Mean (SD)
46.1 (5.9) 47.6 (5.1) 1.5 (6.6)
a t = 1.22 (p = 0.234), r = 0.27 (p > 0.05), ICC = 0.43, rc = 0.28,
sample CV = 7.6%, limits of agreement = –1.5 ± 12.9
ml/kg/min (0.97 ×/÷ 1.29 as a ratio). Note that the results of
the correlation methods are very different from those calcu-
lated on the data from table II.
CV = coefficient of variation; ICC = intraclass correlation; r =
Pearson’s product-moment correlation; rc = concordance correla-
tion; t = test statistic from t-test.
Measurement Error in Sports Medicine 225
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
between individuals within that population. Using
our examples, the correlation coefficients suggest
that relative reliability is worse for the data in table
IV than those in table II, since the former data is
more homogeneous and, therefore, it is more diffi-
cult to detect differences between individuals for
that given degree of absolute measurement error.
The use of correlation to assess this population-
specific relative reliability is quite informative but,
unfortunately, the ability of a high correlation co-
efficient to reflect an adequate consistency of
group positions in any one sample is also question-
able with certain data sets. For example, a re-
searcher may have the ‘analytical goal’ that the
V
.O2max test (table II) can be used as a performance
test to consistently rank athletes in a group. The
researcher may follow convention and deem that
this analytical goal has been accomplished, since a
highly significant test-retest correlation of 0.80
(p < 0.001) was obtained (in fact, it would be ex-
tremely difficult not to obtain a significant correla-
tion in a reliability study with the sort of sample
that is commonly used in studies on measurement
issues: males and females, individuals of varying
age with a wide range of performance abilities).
If one now examines, in table II, the actual rank-
ings of the sample based on the 2 tests using the
measurement tool, it can be seen that only 3 indi-
viduals maintained their positions in the group fol-
lowing the retest. Although the maintenance of the
exact same rank of individuals in a sample may be
a rather strict analytical goal for a measurement
tool in sports medicine (although this has not been
investigated), it should be noted that 4 individuals
in this highly correlated data-set actually moved
more than 10 positions following the retest com-
pared with the original test. In this respect, a
correlation coefficient based on ranks (e.g. Spear-
man’s) may be more informative for the quantifi-
cation and judgement of ‘relative reliability’. This
would have the added benefit of making no as-
sumptions on the shape of the data distribution and
being less affected by outliers in the data.[52]
A rank correlation or a correlation on the logged
test-retest data is rarely used in reliability studies.
This is surprising given the high likelihood that
heteroscedasticity is present in data recorded on the
ratio scale.[53] The presence of such a characteristic
in the described error would mean that a conven-
tional correlation analysis on the raw data is not
really appropriate.[26] Taking this point further, a
reliability study employing both conventional cor-
relation analysis on the raw, untransformed data
(which assumes no evidence of heteroscedasticity)
and the CV statistic (which does assume hetero-
scedasticity is present) can be criticised somewhat
for mixing statistical ‘apples and oranges’.
5.1 Implications of Poor Interpretation of
Test-Retest Correlations
The above disparity between the results of cor-
relation analysis and the perceived reliability may
mean that there may be measurement tools in sports
medicine that have been concluded as reliable on
the basis of the correlation coefficient, but they will
not, in practical use, realise certain analytical
goals. For example, the majority of tools and pro-
tocols for the measurement of isokinetic muscle
strength have been tested for reliability with corre-
lation methods applied to heterogeneous data.
Most of these correlations are above 0.8.[50] Only
recently, with the emergence of more appropriate
analysis techniques, is it emerging that the repeat-
ability of these measurements is relatively poor at
faster isokinetic speeds.[54] Nevill and Atkinson[53]
examined the reliability of 23 common measure-
ment tools in sport and exercise science research.
The use of an absolute measure of reliability (ratio
limits of agreement) showed that there were con-
siderable differences in reliability between meas-
urement tools.
There are several other pieces of evidence which
support the lack of sensitivity of correlation for as-
sessing even relative reliability; Bailey et al.[55] and
Sarmandal et al.[56] assessed the reliability of sev-
eral clinical measures. Test-retest correlations
ranged from 0.89 to 0.98, but when a measure of
absolute reliability (limits of agreement) was re-
lated to the interindividual variation, the usefulness
of the measurement tools was questionable. Atkin-
226 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
son et al.[57] examined if the measurement error of
several performance tests was influenced by the
time of day that the measurements were obtained.
Test-retest correlations were consistently very high
at all times of day. Only when an absolute indicator
of reliability was examined did it become apparent
that random measurement error seemed to be
higher when data was collected at night. Otten-
bacher and Tomchek[15] also showed that the cor-
relation coefficient is not sensitive enough to detect
inadequate method comparison based on inter-
individual differences in a sample. It was found in
a data simulation study that a between-method cor-
relation only dropped from 0.99 to 0.98, even
though absolute reliability was altered to a degree
whereby it would affect the drawing of conclusions
from the measurements. The statistical implica-
tions of this study would apply equally to the as-
sessment of measurement error and relative reli-
ability.
It is clear that the concept of ‘relative reliability’
is useful and correlation analysis does provide
some indication of this. Interestingly, in clinical
chemistry, a statistical criterion for ‘relative reli-
ability’ is not a high correlation coefficient, but the
related measure of absolute reliability expressed as
a certain proportion of the interindividual vari-
ance.[33,34] Bailey et al.[55] and Sarmandal et al.[56]
adopted a similar approach when they related the
limits of agreement between 2 observers to popu-
lation percentile (or qualitative categories) charts.
Taking this stance, the ultimate analytical goal for
relative reliability would be that the measurement
error is less than the difference between individual
differences or analytical goal-related population
centiles. It is recommended that statisticians
working in sport and exercise sciences tackle the
problem of defining an acceptable degree of rela-
tive reliability for practical use of a measurement
tool together with an investigation of the statistic
that is most sensitive for the assessment of relative
reliability. We suggest the employment of analyti-
cal simulations applied to reliability data sets[15] in
order to realise these aims.
It is possible to relate test-retest correlations to
analytical goals regarding adequate sample sizes
for experiments.[38,58] Interestingly, for the estima-
tion of sample sizes in repeated-measures experi-
ments, the correlation would be converted, mathe-
matically, to an absolute reliability statistic. Bland[39]
showed how the SD of the differences or residual
error (measures of absolute reliability) could be
obtained from a test-retest correlation coefficient
to estimate sample size. It is residual error, not the
correlation coefficient, that is the denominator in
‘repeated measures’ hypothesis tests and is there-
fore used in this type of statistical power estima-
tion.
6. Intraclass Correlation
Intraclass correlation (ICC) methods have be-
come a popular choice of statistics in reliability
studies, not least because they are the advised
methods in the 2 textbooks on research methodol-
ogy in sports science.[32,45] The most common
methods of ICC are based on the terms used in the
calculation of the F-value from repeated measures
ANOVA.[18] The main advantages of this statistic
over Pearson’s correlation are maintained to be that
the ICC is univariate rather than bivariate and it can
be used when more than one retest is being com-
pared with a test.[18] The ICC can also be calculated
in such a way that it is sensitive to the presence of
systematic bias in the data (there is an argument,
discussed in section 7, against the sole citation of
such an indicator of ‘total error’ which combines
both bias and random variation into a single coef-
ficient). In fact, there are at least 6 ways of calcu-
lating an ICC, all giving different results.[15,59]
Eliasziw et al.[60] discussed the choice of an appro-
priate ICC. The most important implication of this,
as Krebs[61] stressed, is that researchers must detail
exactly how this choice is made and how an ICC
has been calculated in a reliability study.
Whatever the type of ICC that is calculated, it
is suggested that, like Pearson’s r, an ICC close to
1 indicates ‘excellent’ reliability. Various catego-
ries of agreement based on the ICC, ranging from
‘questionable’ (0.7 to 0.8) to ‘high’ (>0.9), are
Measurement Error in Sports Medicine 227
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
provided by Vincent.[32] The present authors were
unable to locate any reference in the sport and ex-
ercise science literature relating these ICC ‘cut-off’
points to any analytical goals for research. A more
informative approach would be to calculate confi-
dence intervals for a given ICC as detailed by Mor-
row and Jackson.[29]
The calculated ICC[45] of 0.88 for the data in
table II would suggest ‘good’ reliability of the
measurements. This is especially true when it has
already been seen that there are quite large test-
retest differences in some individuals and the rela-
tive reliability, by examining the stability of the
sample ranks, might not be sufficient for some an-
alytical goals. When the ICC is calculated on the
less heterogeneous data in table IV (same degree
of agreement as in data from table II), it drops to a
very poor 0.43. Therefore, it is apparent that the
ICC is prone to exactly the same constraints as
Pearson’s r, in that it includes the variance term for
individuals and is therefore affected by sample het-
erogeneity to such a degree that a high correlation
may still mean unacceptable measurement error for
some analytical goals.[62,63]
Myrer et al.[64] highlighted with a practical ex-
ample the difficulties in interpreting ICCs. Otten-
bacher and Tomcheck[15] showed in data simula-
tions that an ICC never dropped below 0.94. This
occurred despite marked changes in the absolute
agreement between 2 methods of measurement and
whilst the sampling characteristics were control-
led. Quan and Shih[65] maintained that the ICC
should really only be employed when a fixed pop-
ulation of individuals can be well defined. We sup-
port the citation of the ICC in any reliability study
but believe it should not be employed as the sole
statistic and more work is needed to define accept-
able ICCs based on the realisation of definite ana-
lytical goals.
7. Other Methods Based on Correlation
In an effort to rectify a perceived problem with
Pearson’s correlation (that it is not sensitive to dis-
agreement between methods/tests due to system-
atic bias), Lin[66] introduced the ‘concordance cor-
relation coefficient’ (rc), which is the correlation
between the 2 readings that fall on the 45 degree
line through the origin (the line of identity on a
scatterplot). Nickerson[67] maintained that this sta-
tistic is exactly the same as one type of ICC that is
already used by researchers. First, this method is
again sensitive to sample heterogeneity.[68] The rc
for the heterogeneous data in table II is 0.78 com-
pared with 0.28 for the less heterogeneous (but
same level of agreement) data in table IV. Second,
although it may seem convenient to have a single
measure of agreement (one that is sensitive to both
bias and random error), it may be inconvenient in
practical terms when this ‘total error’ is cited on its
own, so the reader of the reliability study is left
wondering whether the measurement protocol
needs adapting to correct for bias or is associated
with high amounts of random variation.[68] This
possibility of ‘over-generalising’ the error, which
may constrain the practical solutions to this error,
also applies to both the type of ICC which includes
the between trials mean-squared-error term as well
as the mean squared residual term in its calcula-
tion[32,45] and the limits of agreement method if
bias and random error are not cited separately (see
section 8.3).
7.1 Regression Analysis
This is another common method of analysis in
agreement studies but, like hypothesis tests and
correlation methods, it may be misleading in some
reliability assessments.[42,66] Conceptually, one is
not dealing with a predictor and a response vari-
able, which is the philosophy behind regression. In
addition, sample heterogeneity is, again, a possible
problem for extrapolation of the reliability analy-
sis; the R2 and regression analysis for the data in
table II are 0.64 and F = 49.01 (p < 0.0001), respec-
tively, thus, indicating ‘good’ reliability. For the
more homogeneous but equally agreeable data (in
terms of absolute reliability) in table IV, the R2 and
regression analysis are 0.08 and F = 2.54 (p > 0.10),
respectively, indicating very poor reliability.
For systematic bias, the null hypothesis that the
intercept of the regression line equals zero would
228 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
be tested. As with the t-test, a wide scatter of indi-
vidual differences may lead to a false acceptance
of this hypothesis (the conclusion that bias is not
significant, even though it may be large enough to
be important).
8. Statistical Measures of
Absolute Reliability
The most common methods of analysing abso-
lute reliability are the SEM and the CV. A little-
used statistic in sport and exercise sciences, which
could be considered to measure absolute reliabil-
ity, is the limits of agreement method. One aspect
that these statistics have in common is that they are
unaffected by the range of measurements. There-
fore, they all theoretically provide an indication of
the variability in repeated tests for specific individ-
uals, irrespective of where the individuals rank in
a particular sample. The general advantage of these
statistics over indicators of relative reliability is
that it is easier, both to extrapolate the results of
absolute reliability studies to new individuals and
to compare reliability between different measure-
ment tools. As discussed in sections 8.1 to 8.3,
these 3 statistics do seem to differ in the way abso-
lute reliability is expressed. They also make differ-
ent assumptions regarding the presence of hetero-
scedasticity (a positive relationship between the
degree of measurement error and the magnitude of
the measured value).
8.1 Standard Error of Measurement
One indicator of absolute reliability is the
‘standard error of measurement’.[45,60,69] The most
common way of calculating this statistic that is
cited in the sports science literature is by means of
the following equation:[18,45]
SEM = SD1 – ICC
where SEM = ‘standard error of measurement’, SD
= the sample standard deviation and ICC = the cal-
culated intraclass correlation coefficient. The use
of SD in the equation, in effect, partially ‘cancels
out’ the interindividual variation that was used to
in the calculation of the ICC. Nevertheless, the sta-
tistic (calculated this way) is still affected by sam-
ple heterogeneity (3.5 ml/kg/min for the data in
table II versus 2.8 ml/kg/min for the data with the
same SD of differences in table IV).
Stratford and Goldsmith[69] and Eliasziw et al.[60]
stated that SEM can be calculated from the square
root of the mean square error term in a repeated
measures ANOVA. This statistic would be totally
unaffected by the range of measured values. To add
to the confusion over the method of calculation,
Bland and Altman[43] called this statistic ‘the intra-
individual SD’. In addition to the differences in the
terminology, this latter calculation also seems to
give a slightly different result (4.7 ml/kg/min for
the data in table II and table IV) from that obtained
with the above equation for SEM based on the ICC.
The cause of this seems to lie in the type of ICC
that is employed (random error or random error +
bias). For the above calculations, we employed the
ICC without the bias error according to the meth-
ods of Thomas and Nelson.[45]
The statistic is expressed in the actual units of
measurement, which is useful since the smaller the
SEM the more reliable the measurements. The
SEM is also used as a ‘summary statistic’ in gen-
eralisability theory to investigate different sources
of variation in test scores.[22] Useful methods have
also been formulated to compare SEMs between
measurement tools.[69]
The question of ‘how does one know if a partic-
ular SEM statistic indicates adequate reliability?’
seems to be unanswered in the literature. Baumgar-
ter[18] showed how an SEM could be used to ascer-
tain whether the difference in measurements be-
tween 2 individuals is real or due to measurement
error. It was stated that ‘confidence bands’ based
on the SEM are formed around the individual
scores. If these bands do not overlap, it was main-
tained that the difference between the measure-
ments is real. However, researchers should be ex-
tremely cautious in following this advice, since the
SEM covers about 68% of the variability and not,
as Thomas and Nelson[45] discussed, 95%, which is
the conventional criterion used in confidence inter-
val comparisons. Eliasziw et al.[60] also discussed
Measurement Error in Sports Medicine 229
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
the use of SEM to differentiate between real
changes and those due to measurement error and
suggested 1.962 × SEM which, interestingly, ap-
proximates to the limits of agreement statistic (see
section 8.3).
Besides the lack of clarity over an acceptable
SEM, the use of this statistic is associated with sev-
eral assumptions. First, it is assumed that there is a
‘population’ of measurements for each individual
(the SEM actually approximates to the mean SD
for repeated measurements in individuals), and that
this population is normally distributed and that
there are no carry-over effects between the re-
peated tests. Payne[70] discussed these assumptions
in more detail. The use of the SEM also denotes
that heteroscedasticity is not present in the data, so
that it is appropriate only if the data are purely in-
terval in nature. Therefore, if for example an SEM
of 3.5 ml/kg/min is calculated, it is assumed that
this amount of absolute error is the same for indi-
viduals recording high values in the sample as
those scoring low values. Nevill and Atkinson[53]
have shown that this homoscedasticity is uncom-
mon in ratio variables relevant to sports medicine.
Practically, for researchers who are examining a
subsample of individuals who score highly on cer-
tain tests, the use of SEM may mislead them into
thinking that the measurement error is only a small
percentage of these scores (the measurement error
has been underestimated relative to the particular
sample that is examined). This denotes that, if
heteroscedasticity is present in data, the use of a
ratio statistic (e.g. CV) may be more useful to the
researchers.
8.2 Coefficient of Variation
The CV is common in biochemistry studies
where it is cited as a measure of the reliability of a
particular assay.[71] It is somewhat easier to per-
form multiple repeated tests in this field than it is
in studies on human performance. There are vari-
ous methods of calculating CV, but the simplest
way is with data from repeated measurements on a
single case, where the SD of the data is divided by
the mean and multiplied by 100.[48] An extension
of this on a sample of individuals is to calculate the
mean CV from individual CVs. The use of a dimen-
sionless statistic like the CV has great appeal, since
the reliability of different measurement tools can
be compared.[72] However, as discussed in detail by
Allison[73] and Yao and Sayre,[74] there may be cer-
tain limitations in the use of CV.
Researchers should be aware that the assump-
tion of normality for an assumed ‘population’ of
repeated tests applies to CV in the same way as
with the SEM. Detwiler et al.[75] discussed the dif-
ficulty of examining these assumptions for CV
with a small number of repeated measures. Unlike
SEM, CV methods apply to data in which the de-
gree of agreement between tests does depend on the
magnitude of the measured values. In other words,
the use of CV assumes that the largest test-retest
variation occurs in the individuals scoring the high-
est values on the test.[42] Although this charac-
teristic is probably very common with sports sci-
ence data on a ratio scale (see section 8.3),[53] it is
best if heteroscedasticity is actually explored and
quantified before assuming it is present. This ex-
ploration is not very common amongst sport sci-
ence researchers carrying out reliability studies.
Besides, there are reliability data sets which defi-
nitely should not be described by CV. For example,
a CV would be meaningless for data that can show
negative values (not bounded by zero), since the
use of the CV denotes that the measurement error
approximates zero for measured values that are
close to zero. This would not be so if zero values
were midway on a measurement scale (e.g. whole
body flexibility measures).
Another cautionary note on the use of CV cen-
tres around its practical meaning to researchers per-
forming experiments. Some scientists seem to have
chosen, quite arbitrarily, an analytical goal of the
CV being 10% or below.[76] This does not mean that
all variability between tests is always less than 10%
of the mean. A CV of 10% obtained on an individ-
ual actually means that, assuming the data are nor-
mally distributed, 68% of the differences between
tests lie within 10% of the mean of the data.[71]
Therefore, as with the SEM statistic, the variability
230 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
is not described for 32% of the individual differ-
ences. For example, if a test-retest CV of 10% was
obtained with a test of maximal oxygen consump-
tion and the grand sample mean of the 2 tests was
50 ml/kg/min, the CV of 10% might be considered
an indicator of acceptable agreement. Realistically,
there could be test-retest differences of greater than
10 ml/kg/min (20% of mean) in some individuals.
The criticism of CV that it is very rarely applied
to an analytical goal applies in particular to the
common situation in which means are calculated
from a sample of individual CVs. The true varia-
tion between tests may be underestimated for some
new individuals in this case. For example, the sam-
ple mean CV for the data in table II is 7.6%, which
could be used to indicate very good reliability. This
is unrealistic given that over a third of the sample
shows individual differences that can be calculated
to be greater than 13% of the respective means.
Sarmandal et al.[56] and Bailey et al.[55] also
showed with practical examples how mean CVs of
1.6 to 4% did not reflect adequate reliability for
some clinical measurements. It is probably more
informative if the sample SD of the repeated tests
is multiplied by 1.96 before being expressed as a
CV for each individual,[77] as this would cover 95%
of the repeated measurements. It is stressed, how-
ever, that if a sample mean CV is then calculated,
this may still not reflect the repeated test error for
all individuals, but only the ‘average individual’
(50% of the individuals in the sample). For this
reason, Quan and Shih[65] termed this statistic the
‘naive estimator’ of CV and suggested that it
should not be used. These researchers and oth-
ers[38,44] described more appropriate CV calcula-
tions based on the mean square error term (from
ANOVA) of logarithmically transformed data.
This is an important part of the last statistical
method that is to be discussed; the limits of agree-
ment technique.
8.3 Bland and Altman’s 95% Limits
of Agreement
Altman and Bland[41] recognised several of the
above limitations with these different forms of
analysis and introduced the method of ‘limits of
agreement’, an indicator of absolute reliability like
SEM and CV. The main difference between these
statistics seems to be that the limits of agreement
assume a population of individual test-retest differ-
ences. Chatburn[23] termed this type of statistic an
error interval. SEM and CV, as discussed above,
involve an assumed population of repeated meas-
urements around a ‘true value’ for each individual.
Chatburn[23] called this concept a tolerance inter-
val. Although there are differences here on the sta-
tistical philosophy, the present review is more con-
cerned with the practical use of these statistics.
The first step in the limits of agreement analysis
is to present and explore the test-retest data with a
Bland-Altman plot, which is the individual subject
differences between the tests plotted against the
respective individual means (it is a mistake to plot
the differences against the scores obtained for just
one of the tests).[78] An example of a Bland-Altman
plot using the data in table II is provided in figure
1. Using this plot rather than the conventional test-
retest scattergram, a rough indication of systematic
bias and random error is provided by examining
the direction and magnitude of the scatter around
the zero line, respectively. It is also important to
observe whether there is any heteroscedasticity in
the data (whether the differences depend on the
Week 1 – week 2
(ml/kg/min)
20
15
10
5
0
–5
–10
–15
20 30 40 50 60 70 80
+1.96SD
Bias
–1.96SD
Mean VO
2max
(ml/kg/min)
·
Fig. 1. A Bland-Altman plot for the data presented in table II. The
differences between the tests/methods are plotted against each
individual’s mean for the 2 tests. The bias line and random error
lines forming the 95% limits of agreement are also presented
on the plot. Visual inspection of the data suggests that the dif-
ferences are greater with the highest maximal oxygen uptake
(V
.O2max) values. A similar plot can be formed from the results of
analysis of variance (ANOVA) by plotting the residuals against
the actual scores. SD = standard deviation.
Measurement Error in Sports Medicine 231
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
magnitude of the mean). Heteroscedasticity can be
examined formally by plotting the absolute differ-
ences against the individual means (fig. 2) and cal-
culating the correlation coefficient (correlation is
appropriate here, since the alternative hypothesis is
that there is a relationship present). If heterosce-
dasticity is suspected, the analysis is more compli-
cated (see below).
If the heteroscedasticity correlation is close to
zero and the differences are normally distributed,
one may proceed to calculate the limits of agree-
ment as follows. First, the SD of the differences
between test 1 and test 2 is calculated. The SD of
the differences of the data in table II is 6.6
ml/kg/min. This is then multiplied by 1.96 to ob-
tain the 95% random error component of 12.9
ml/kg/min (the 95th percentile is the way reliability
data should be presented according to the British
Standards Institute).[79] If there is no significant
systematic bias (identified by a paired t-test) then
there is a rationale for expressing the limits of
agreement as ± this value. However, one of the dis-
cussed drawbacks of the t-test was that significant
bias would not be detected if it is accompanied by
large random variation. One could quote the ran-
dom error with the bias to form the limits of agree-
ment, even if it is not statistically significant. For
the data in table II, since there is a slight bias of
–1.5 ml/kg/min, the limits of agreement are –14.4
to +11.4 ml/kg/min. Expressed this way, the limits
of agreement are actually a measure of ‘total error’
(bias and random error together). It is probably
more informative to researchers reading abstracts
of reliability studies if the bias and random error
components are cited separately, e.g. –1.5 ± 12.9
ml/kg/min.
It was stated earlier that CV methods should be
used only if the variability depends on the magni-
tude of the mean values (heteroscedasticity). If,
from the positive correlation between the absolute
differences and the individual means, there is
heteroscedasticity in the data, then Bland and Alt-
man[16] recommend the logarithmic (natural) trans-
formation of the data before the calculation of lim-
its of agreement. The final step would be to antilog
the data. Bland and Altman[16] provide a worked
example for this.
In the examination of heteroscedasticity, Nevill
and Atkinson[53] found that, if the correlation be-
tween absolute differences and individual means is
positive but not necessarily significant in a set of
data, it is usually beneficial to take logarithmic val-
ues when calculating the limits of agreement. For
example, there is very slight heteroscedasticity
present in the data from table II (fig. 2, r = 0.18,
p = 0.345). If logs are taken, this correlation is re-
duced to 0.01. Having taken logs of the measure-
ments from both weeks, the mean ± 95% limits of
agreement is calculated to be –0.0356 ± 0.257. Tak-
ing antilogs of these values the mean bias on the
ratio scale is 0.97 and the random error component
is now ×/÷ 1.29. Therefore, 95% of the ratios
should lie between 0.97 ×/÷ 1.29. If the sample of
differences is not normally distributed, which has
also been observed with some measurements rele-
vant to sports medicine,[53] the data would, again,
benefit from logarithmic transformation. The data
in table II are actually not normally distributed
(Anderson-Darling test), but after log transforma-
tion they follow normality. It follows that the pre-
vious tests for bias (paired t-test, ANOVA) should
have, strictly speaking been performed on the log
transformed data. Nevertheless, this does not de-
Absolute difference
(ml/kg/min)
16
14
12
10
8
6
4
2
0
20 30 40 50 60 70 80
Mean VO
2max
(ml/kg/min)
·
Fig. 2. A plot of the absolute differences between the tests/meth-
ods and the individual means for the examination of hetero-
scedasticity in the data presented in table II (r = 0.18, p = 0.345).
This correlation is decreased to 0.01 when the data are loga-
rithmically transformed. Therefore, there is evidence that the
limits of agreement would be best expressed as ratios (the ab-
solute measurement error is greater for the individuals who
score highly on the test). SD = standard deviation.
232 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
tract, in any way, from the points that were made
regarding the use of these tests to detect bias in
reliability studies.
9. Limits of Agreement and
Analytical Goals
The next step is the interpretation of the limits
of agreement. Some researchers[80] have concluded
acceptable measurement error by observing that
only a few of the test-retest differences fall outside
the 95% limits of agreement that were calculated
from those same differences. This is not how the
limits should be interpreted. Rather, it can be said
that for a new individual from the studied popula-
tion, it would be expected (an approximate 95%
probability) that the difference between any 2 tests
should lie within the limits of agreement. There-
fore, in the case of the Fitech test, we expect the
differences between the test and retest of an indi-
vidual from the particular population to lie be-
tween –14.4 and +11.5 ml/kg/min. Since there was
evidence that heteroscedasticity was present in the
Fitech data (the heteroscedasticity correlation re-
duced following logarithmic transformation of the
data), the limits are best represented by ratios.
From the ratio limits of agreement calculated
above (0.97 ×/÷ 1.29), it can be said that for any
individual from the population, assuming the bias
that is present (3%) is negligible, any 2 tests will
differ due to measurement error by no more than
29% either in a positive or negative direction (the
error is actually slightly greater in the positive than
the negative direction with true ratio data that are
heteroscedastic). It should be noted, as Bland[39]
observed, that this value is very similar to the value
of 27% calculated in an arguably simpler manner
from 100 × (1.96 × SD diff/grand mean) on the data
prior to logging, where ‘SD diff’ represents stand-
ard deviation of the differences between test and
retest and ‘grand mean’ represents (mean of test 1
+ mean of test 2)/2.
As discussed earlier, it is the task of the re-
searcher to judge, using analytical goals, whether
the limits of agreement are narrow enough for the
test to be of practical use. The comparison of reli-
ability between different measurement tools using
limits of agreement is, at present difficult, since
there have been so few studies employing limits of
agreement for sports science measurements. With
respect to the Fitech test data, the limits of agree-
ment for reliability are very similar to those pub-
lished for the similar-in-principle Astrand-Rhyming
test of predicted maximal oxygen consumption.[81]
We would conclude that these tests are probably
not reliable enough to monitor the small changes
in maximal oxygen consumption that result from
increasing the training of an already athletic per-
son.[82] However, these predictive tests may detect
large differences in maximal oxygen consumption,
for example, after an initially sedentary person
performs a conditioning programme.[83] One could
arrive at a more conclusive decision of adequate
(or inadequate) reliability by using analytical goals
based on sample sizes for future experimental uses.
The SD of the differences (or the mean squared
residual in the case of ANOVA) can be used to
estimate sample sizes for repeated measures ex-
periments.[6] It would be clear, even without such
calculations, that the greater the random error com-
ponent of the limits of agreement, the more indi-
viduals would be needed in an experiment for a
given hypothesised experimental change. Alterna-
tively, the greater the random error indicated by the
limits of agreement, the larger the minimal detect-
able change would be for a given sample size in
an experiment. Zar[6] also provides calculations
for this issue of estimating minimal detectable
changes for measurement tools. One cannot judge
the magnitude of a correlation coefficient per se as
simply as this, since there is an ‘added factor’ of
interindividual variability in this statistic.
We have 3 comments on the use of limits of
agreement in sports science and medicine:
(1) Only recently[39] has the limits of agreement
method been applied to multiple retests using an
ANOVA approach. This is preferable for the in-
depth investigation of bias and also because the
examination of heteroscedasticity is enhanced (the
degrees of freedom are increased). The random
Measurement Error in Sports Medicine 233
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
error component of the 95% limits of agreement is
calculated from
1.962 × MSE
where MSE is the mean squared error term from a
repeated measures ANOVA. Recently, Bland and
Altman[43,44] accepted that measurement error may
be expressed in relation to a ‘population’ of re-
peated tests in individuals, which is the basis of
SEM and CV. They calculated this from MSE,
which equates to one method of calculating the
SEM.[69] They did stress, however, the need to mul-
tiply this value by 1.96 in order to represent the
difference between measured and the ‘true’ value
for 95% of observations. For the example data in
table II, the MSE is 4.7 ml/kg/min so the ‘95%
SEM’ is 1.96 × 4.7 = ±9.2 ml/kg/min. For logged
data, one would antilog the MSE from ANOVA
and express this CV to the power of 1.96 to cover
95% of observations. This would be 1.0971.96 =
×/÷1.20 for the data in table II when expressed as
a ratio.
Hopkins[38] cites a very similar statistic to Bland
and Altman’s 68% ratio CV of 1.097 (9.7%), al-
though it is calculated in a slightly different way
and always expressed as a percentage (±9.3% for
our example data). Note that both these methods of
calculating CV (from ANOVA) give slightly higher
values than the ‘naive estimator’ of the mean value
of 7.6% calculated from individual CVs. This
agrees with the observations of Quan and Shih.[65]
Note also that expressing a CV as ± percent rather
than as ×/÷ ratio may be misleading since a char-
acteristic of ratio data is that the range of error will
always be slightly less, below a given measured
value compared with the error above a measured
value. The calculation of ± CV implies, erron-
eously with true ratio data, that the error is of equal
magnitude either side of a particular measured
value.
(2) Because the calculated limits of agreement are
meant to be extrapolated to a given population, it
is recommended that a large sample size (n > 40)
is examined in any measurement study.[30] Bland
and Altman[16] also advise the calculation of the
standard errors of the limits of agreement to show
how precise they are in relation to the whole pop-
ulation. From these, confidence intervals can be cal-
culated, which may allow statistical meta-analysis
for comparison of limits of agreement between dif-
ferent studies.
(3) The reliability examples cited in Bland and
Altman’s work[16,39] appear not to consider that
bias can occur in repeated measurements.[84] Only
the method comparison (validity) examples incor-
porate the bias estimation in the limits of agree-
ment. This might be because the clinician is dealing
frequently with biological assays, which are not
affected by learning or fatigue of testing. Since
these effects are likely to influence measurements
of human performance, it is recommended that the
bias between repeated trials is always reported
(separately from the random error component) by
the sports scientist.
10. Discussion
This review has attempted to evaluate the most
common statistical methods for evaluating reliabil-
ity. In view of the importance of minimal measure-
ment error to sports science research and, although
one book on the subject has been published,[17] it
is surprising how neglected discussions on meas-
urement issues are in sports science and medicine.
An important point is that correlation methods
(including ICC) should be interpreted with caution
in such studies. This is a difficult notion to promote
given the popularity of judging a high correlation
as indicating adequate reliability. An implication
of the poor interpretation of correlation analyses is
that equipment used routinely in the sport and ex-
ercise sciences may have been erroneously con-
cluded as being sufficiently reliable (realising cer-
tain analytical goals for sports science use). It
would be sensible for researchers to reappraise the
results of test-retest correlations and supplement
this with the application of absolute indicators of
reliability. Ideally, a database should exist provid-
ing information on the reliability of every measure-
ment tool used routinely in sports medicine. This has
been attempted, using correlation, with isokinetic
234 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
muscle strength measurements.[50] At present, the
limits of agreement method has been applied most,
amongst sport science–relevant variables, to the
reliability and validity of adipose tissue measure-
ments.[85-89]
The present review has attempted to highlight
that some reliability statistics are cited in the sports
science literature without adequate investigation
of underlying assumptions. The important assump-
tion regarding the relationship between error and
the magnitude of the measured value is rarely ex-
plored by reliability researchers. It may be that
with some measurements, the variability decreases
instead of increases as the measured values in-
crease (negative heteroscedasticity). In this case,
the data might need to be transformed differently
before application of an absolute indicator of reli-
ability. Statisticians are currently working on such
problems.[42] It is imperative that the sports physi-
cian keeps abreast of the correct statistical solu-
tions to these issues. One practical recommenda-
tion is that future reliability studies include an
examination of how the measurement error relates
to the magnitude of the measured variables, irre-
spective of which type of absolute reliability sta-
tistic is employed (SEM, CV, limits of agreement).
The simplest way to do this is by plotting the cal-
culated residuals from ANOVA against the fitted
values and observing if the classic ‘funnelling’ of
heteroscedasticity is evident.
One issue which Bland and Altman consistently
discuss in their work on measurement issues is that
of ‘method comparison’.[16,42] They maintain that
the disadvantages of many statistics used in reli-
ability studies also apply to studies investigating
whether different methods can be used interchange-
ably or whether a method agrees with a gold stand-
ard measurement tool. They propose that the use of
limits of agreement is also more appropriate in
these situations, which happen to be very common
in sports science as part of validity examina-
tions.[90] Obviously, such a use of limits of agree-
ment would be alien to the sports scientist who may
be accustomed to hypothesis tests, and regression
and correlation methods as part of this type of
validity analysis. An important issue such as this
should warrant further discussion amongst sports
science researchers.
To conclude, it seems ironic that the many sta-
tistics designed to assess agreement seem so incon-
sistent in their quantification of measurement error
and their interpretation amongst researchers for de-
ciding whether a measurement tool can be reliably
employed in future research work. In brief, there
are difficulties with relative reliability statistics
both in their interpretation and extrapolation of re-
sults to future research. There are also many differ-
ent methods of calculating the reliability statistic,
ICC. Moreover, the expression of absolute reliabil-
ity statistics differs to such an extent that one sta-
tistic (SEM) can be calculated in a way (SD1–ICC)
that makes it still sensitive to population heteroge-
neity (i.e. not a true indicator of absolute reliability
at all). There is also a general lack of exploration
of associated assumptions with absolute reliability
statistics and disagreement on the described pro-
portion of measurement error (68 vs 95%).
While statistics will never be more important
than a well designed reliability study itself, it is
sensible that there should be a standardised statis-
tical analysis for any reliability study involving
ratio of interval measurements. This may entail us-
ing several reliability statistics, so that different
researchers can interpret the one they are most ac-
customed to. To this end, we would suggest:
The inclusion in any reliability study of an ex-
amination of the assumptions surrounding the
choice of statistics, especially the presence or
absence of heteroscedasticity.
A full examination of any systematic bias in the
measurements coupled with practical recom-
mendations for future researchers on the num-
ber of pretest familiarisation sessions to employ
and the advised recovery time between tests so
that any bias due to fatigue is minimised.
The inclusion of intraclass correlation analysis,
but with full details of which type of ICC has
been calculated and the citation of confidence
intervals for the ICC. This analysis could be
supplemented with an examination of relative
Measurement Error in Sports Medicine 235
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
reliability through the test-retest stability of
sample ranks or the relation of the degree of
absolute reliability to the interindividual or
between-centile differences in a population.
This is recommended even if a high ICC (>0.9)
has been obtained.
The citation of the most popular measures of
absolute reliability, depending on whether
heteroscedasticity is present (CV, ‘ratio limits of
agreement’) or absent (SEM, ‘absolute limits of
agreement’). It is preferable that these are cal-
culated from the mean square error term in a
repeated-measures ANOVA model. The de-
scribed percentile of measurement error (68 or
95%) should also be stated.
The arrival at an eventual decision of reliability
(or not) based on the extrapolation of the meas-
urement error to the realisation of ‘analytical
goals’. These may include the effectiveness of
use of the measurement tool on individual cases,
a meaningful degree of relative reliability and
the implications of the measurement error for
sample size estimation in experiments.
Acknowledgements
We are grateful to Dr Don MacLaren for providing, from
his laboratory work, the data used to illustrate the various
statistical analyses in this review. The comments of Professor
Thomas Reilly, Dr Jim Waterhouse and Dr Richard Tong
were also appreciated during the preparation of this manu-
script.
References
1. Yeadon MR, Challis JH. The future of performance-related
sports biomechanics research. J Sports Sci 1994; 12: 3-32
2. Jakeman PM, Winter EM, Doust J. A review of research in
sports physiology. J Sports Sci 1994; 12: 33-60
3. Hardy L, Jones G. Current issues and future directions for per-
formance-related research in sport psychology. J Sports Sci
1994; 12: 61-92
4. Nevill AM. Statistical methods in kinanthropometry and exer-
cise physiology. In: Eston R, Reilly T, editors. Kinanthropo-
metry and exercise physiology laboratory manual. London:
E and FN Spon, 1996: 297-320
5. Safrit MJ. An overview of measurement. In: Safrit MJ, Wood
TM, editors. Measurement concepts in physical education and
exercise science. Champaign (IL): Human Kinetics, 1989: 3-20
6. Zar JH. Biostatistical analysis. London: Prentice Hall, 1996
7. Mathews JN. A formula for the probability of discordant clas-
sification in method comparison studies. Stat Med 1997; 16
(6): 705-10
8. Bates BT, Dufek JS, Davis HP. The effects of trial size on sta-
tistical power. Med Sci Sports Exerc 1992; 24 (9): 1059-65
9. Dufek JS, Bates BT, Davis HP. The effect of trial size and vari-
ability on statistical power. Med Sci Sports Exerc 1995; 27:
288-95
10. Atkinson G. [Letter]. British Association of Sports Sciences
Newsletter, 1995 Sep: 5
11. Nevill AM. Validity and measurement agreement in sports per-
formance [abstract]. J Sports Sci 1996; 14: 199
12. Ottenbacher KJ, Stull GA. The analysis and interpretation of
method comparison studies in rehabilitation research. Am J
Phys Med Rehab 1993; 72: 266-71
13. Hollis S. Analysis of method comparison studies. Ann Clin
Biochem 1996; 33: 1-4
14. Liehr P, Dedo YL, Torres S, et al. Assessing agreement between
clinical measurement methods. Heart Lung 1995; 24: 240-5
15. Ottenbacher KJ, Tomcheck SD. Measurement variation in
method comparison studies: an empirical examination. Arch
Phys Med Rehabil 1994; 75 (5): 505-12
16. Bland JM, Altman DG. Statistical methods for assessing agree-
ment between two methods of clinical measurement. Lancet
1986; I: 307-10
17. Safrit MJ, Wood TM, editors. Measurement concepts in physi-
cal education and exercise science. Champaign (IL): Human
Kinetics, 1989
18. Baumgarter TA. Norm-referenced measurement: reliability. In:
Safrit MJ, Wood TM, editors. Measurement concepts in phys-
ical education and exercise science. Champaign (IL): Human
Kinetics, 1989: 45-72
19. Atkinson G. Reilly T. Circadian variation in sports performance.
Sports Med 1996; 21 (4): 292-312
20. Morrow JR, Jackson AW, Disch JG, et al. Measurement and
evaluation in human performance. Champaign (IL): Human
Kinetics, 1995
21. Morrow JR. Generalizability theory. In: Safrit MJ, Wood TM,
editors. Measurement concepts in physical education and ex-
ercise science. Champaign (IL): Human Kinetics, 1989: 73-96
22. Roebroeck ME, Harlaar J, Lankhorst GJ. The application of
generalizability theory to reliability assessment: an illustra-
tion using isometric force measurements. Phys Ther 1993; 73
(6): 386-95
23. Chatburn RL. Evaluation of instrument error and method agree-
ment. Am Assoc Nurse Anesthet J 1996; 64 (3): 261-8
24. Coldwells A, Atkinson G, Reilly T. Sources of variation in back
and leg dynamometry. Ergonomics 1994; 37: 79-86
25. Hickey MS, Costill DL, McConnell GK, et al. Day-to-day vari-
ation in time trial cycling performance. Int J Sports Med 1992;
13: 467-70
26. Nevill A. Why the analysis of performance variables recorded
on a ratio scale will invariably benefit from a log transforma-
tion. J Sports Sci 1997; 15: 457-8
27. Bland JM, Altman DG. Transforming data. BMJ 1996; 312
(7033): 770
28. Schultz RW. Analysing change. In: Safrit MJ, Wood TM, edi-
tors. Measurement concepts in physical education and exer-
cise science. Champaign (IL): Human Kinetics, 1989: 207-28
29. Morrow JR, Jackson AW. How ‘significant’ is your reliability?
Res Q Exerc Sport 1993; 64 (3): 352-5
30. Altman DG. Practical statistics for medical research. London:
Chapman and Hall, 1991: 396-403
31. Mathews JNS, Altman DG, Campbell MJ, et al. Analysis of
serial measurements in medical research. BMJ 1990; 300:
230-5
236 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
32. Vincent J. Statistics in kinesiology. Champaign (IL): Human
Kinetics Books, 1994
33. Ross JW, Fraser MD. Analytical goals developed from the in-
herent error of medical tests. Clin Chem 1993; 39 (7): 1481-93
34. Fraser CG, Hyltoft Peterson P, et al. Setting analytical goals for
random analytical error in specific clinical monitoring situa-
tions. Clin Chem 1990; 36 (9): 1625-8
35. Zehr ER, Sale DG. Reproducibility of ballistic movement. Med
Sci Sports Exerc 1997; 29: 1383-8
36. Hofstra WB, Sont JK, Sterk PJ, et al. Sample size estimation in
studies monitoring exercise-induced bronchoconstriction in
asthmatic children. Thorax 1997; 52: 739-41
37. Schabort EJ, Hopkins WG, Hawley JA. Reproducibility of self-
paced treadmill performance of trained endurance runners.
Int J Sports Med 1998; 19: 48-51
38. Hopkins W. A new view of statistics. Internet site, 1997,
http://www.sportsci.org/resource/stats/index.html
39. Bland M. An introduction to medical statistics. Oxford: Univer-
sity Press, 1995
40. Proceedings of the 43rd Meeting of the American College of
Sports Medicine. Med Sci Sports Exerc 1996; 28: S1-211
41. Altman DG, Bland JM. Measurement in medicine: the analysis
of method comparison studies. Statistician 1983; 32: 307-17
42. Bland JM, Altman DG. Comparing two methods of clinical
measurement: a personal history. Int J Epidemiol 1995; 24
Suppl. 1: S7-14
43. Bland JM, Altman DG. Measurement error. BMJ 1996; 312
(7047): 1654
44. Bland JM, Altman DG. Measurement error proportional to the
mean. BMJ 1996; 313 (7049): 106
45. Thomas JR, Nelson JK. Research methods in physical activity.
Champaign (IL): Human Kinetics, 1990
46. Nevill AN, Atkinson G. Assessing measurement agreement (re-
peatability) between 3 or more trials [abstract]. J Sports Sci
1998; 16: 29
47. Coolican H. Research methods and statistics in psychology.
London: Hodder and Stoughton, 1994
48. Sale DG. Testing strength and power. In: MacDougall JD, Wen-
ger HA, Green HJ, editors. Physiological testing of the high
performance athlete. Champaign (IL): Human Kinetics,
1991: 21-106
49. Bates BT, Zhang S, Dufek JS, et al. The effects of sample size
and variability on the correlation coefficient. Med Sci Sports
Exerc 1996; 28 (3): 386-91
50. Perrin DH. Isokinetic exercise and assessment. Champaign
(IL): Human Kinetics, 1993
51. Glass GV, Hopkins KD. Statistical methods in education and
psychology. 2nd ed. Englewood Cliffs (NJ): Prentice-Hall,
1984
52. Estelberger W, Reibnegger G. The rank correlation coefficient:
an additional aid in the interpretation of laboratory data. Clin
Chim Acta 1995; 239 (2): 203-7
53. Nevill AN, Atkinson G. Assessing agreement between meas-
urements recorded on a ratio scale in sports medicine and
sports science. Br J Sports Med 1997; 31: 314-8
54. Atkinson G, Greeves J, Reilly T, et al. Day-to-day and circadian
variability of leg strength measured with the lido isokinetic
dynamometer. J Sports Sci 1995; 13: 18-9
55. Bailey SM, Sarmandal P, Grant JM. A comparison of three
methods of assessing inter-observer variation applied to
measurement of the symphysis-fundal height. Br J Obstet
Gynaecol 1989; 96 (11): 1266-71
56. Sarmandal P, Bailey SM, Grant JM. A comparison of three
methods of assessing inter-observer variation applied to ultra-
sonic fetal measurement in the third trimester. Br J Obstet
Gynaecol 1989; 96 (11): 1261-5
57. Atkinson G, Coldwells A, Reilly T, et al. Does the within-test
session variation in measurements of muscle strength depend
on time of day? [abstract] J Sports Sci 1997; 15: 22
58. Charter RA. Effect of measurement error on tests of statistical
significance. J Clin Exp Neuropsychol 1997; 19 (3): 458-62
59. Muller R, Buttner P. A critical discussion of intraclass correla-
tion coefficients. Stat Med 1994; 13: 23-4, 2465-76
60. Eliasziw M, Young SL, Woodbury MG, et al. Statistical meth-
odology for the concurrent assessment of inter-rater and
intra-rater reliability: using goniometric measurements as an
example. Phys Ther 1994; 74 (8): 777-88
61. Krebs DE. Declare your ICC type [letter]. Phys Ther 1986; 66:
1431
62. Atkinson G. A comparison of statistical methods for assessing
measurement repeatability in ergonomics research. In: Atkin-
son G, Reilly T, editors. Sport, leisure and ergonomics. Lon-
don: E and FN Spon, 1995: 218-22
63. Bland JM, Altman DG. A note on the use of the intraclass cor-
relation coefficient in the evaluation of agreement between
two methods of measurement. Comput Biol Med 1990; 20:
337-40
64. Myrer JW, Schulthies SS, Fellingham GW. Relative and abso-
lute reliability of the KT-2000 arthrometer for uninjured
knees. Testing at 67, 89, 134 and 178 N and manual maximum
forces. Am J Sports Med 1996; 24 (1): 104-8
65. Quan H, Shih WJ. Assessing reproducibility by the within-
subject coefficient of variation with random effects models.
Biometrics 1996; 52 (4): 1195-203
66. Lin LI-K. A concordance correlation coefficient to evaluate re-
producibility. Biometrics 1989; 45: 255-68
67. Nickerson CAE. A note on ‘A concordance correlation coeffi-
cient to evaluate reproducibility’. Biometrics 1997; 53: 1503-7
68. Atkinson G, Nevill A. Comment on the use of concordance
correlation to assess the agreement between two variables.
Biometrics 1997; 53: 775-7
69. Stratford PW, Goldsmith CH. Use of the standard error as a
reliability index of interest: an applied example using elbow
flexor strength data. Phys Ther 1997; 77 (7): 745-50
70. Payne RW. Reliability theory and clinical psychology. J Clin
Psychol 1989; 45 (2): 351-2
71. Strike PW. Statistical methods in laboratory medicine. Oxford:
Butterworth-Heinemann, 1991
72. Fetz CJ, Miller GE. An asymptotic test for the equality of coef-
ficients of variation from k populations. Stat Med 1996; 15
(6): 646-58
73. Allison DB. Limitations of coefficient of variation as index of
measurement reliability [editorial]. Nutrition 1993; 9 (6):
559-61
74. Yao L, Sayre JW. Statistical concepts in the interpretation of
serial bone densitometry. Invest Radiol 1994; 29 (10): 928-32
75. Detwiler JS, Jarisch W, Caritis SN. Statistical fluctuations in
heart rate variability indices. Am J Obstet Gynecol 1980; 136
(2): 243-8
76. Stokes M. Reliability and repeatability of methods for measur-
ing muscle in physiotherapy. Physiother Pract 1985; 1: 71-6
77. Bishop D. Reliability of a 1-h endurance performance test in
trained female cyclists. Med Sci Sports Exerc 1997; 29: 554-9
78. Bland JM, Altman DG. Comparing methods of measurement:
why plotting difference against the standard method is mis-
leading. Lancet 1995; 346 (8982): 1085-7
Measurement Error in Sports Medicine 237
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
79. British Standards Institution. Precision of test methods I. Guide
for the determination and reproducibility for a standard test
method. BS5497: Pt 1. London: BSI, 1979
80. de Jong JS, van Diest PJ, Baak JPA. In response [letter]. Lab
Invest 1996; 75 (5): 756-8
81. Wisen AG, Wohlfart B. A comparison between two exercise
tests on cycle; a computerised test versus the Astrand test. Clin
Physiol 1995; 15: 91-102
82. Wilmore JH, Costill DL. Physiology of sport and exercise.
Champaign (IL): Human Kinetics, 1994
83. Pollock ML. Quantification of endurance training programmes.
Exerc Sports Sci Rev 1973; 1: 155-88
84. Doyle JR, Doyle JM. Measurement error is that which we have
not yet explained. BMJ 1997; 314: 147-8
85. Schaefer F, Georgi M, Zieger A, et al. Usefulness of bioelectric
impedance and skinfold measurements in predicting fat-free
mass derived from total body potassium in children. Pediatr
Res 1994; 35: 617-24
86. Webber J, Donaldson M, Allison SP, et al. Comparison of
skinfold thickness, body mass index, bioelectrical impedance
analysis and x-ray absorptiometry in assessing body compo-
sition in obese subjects. Clin Nutr 1994; 13: 177-82
87. Fuller NJ, Sawyer MB, Laskey MA, et al. Prediction of body
composition in elderly men over 75 years of age. Ann Hum
Biol 1996; 23: 127-47
88. Gutin B, Litaker M, Islam S, et al. Body composition measure-
ment in 9-11 year old children by dual energy x-ray
absorptiometry, skinfold thickness measures and bioimped-
ance analysis. Am J Clin Nutr 1996; 63: 287-92
89. Reilly JJ, Wilson J, McColl JH, et al. Ability of bioelectric im-
pedance to predict fat-free mass in prepubertal children.
Pediatr Res 1996; 39: 176-9
90. Wood TM. The changing nature of norm-referenced validity. In:
Safrit MJ, Wood TM, editors, Measurement concepts in phys-
ical education and exercise science. Champaign (IL): Human
Kinetics, 1989: 23-44
Correspondence and reprints: Dr Greg Atkinson, Research
Institute for Sport and Exercise Sciences, Trueman Building,
Webster Street, Liverpool John Moores University, Liver-
pool L3 2ET, England.
E-mail: g.atkinson@livjm.ac.uk
238 Atkinson & Nevill
Adis International Limited. All rights reserved. Sports Med 1998 Oct; 26 (4)
... We examined both the relative reliability and the absolute reliability of the measure. Relative reliability refers to the degree to which individuals maintain their position in a sample over repeated measurements [13] and it can be measured with the Spearman correlation coefficient [44]. Absolute reliability refers to the degree to which repeated measurements vary for individuals, irrespective of their ranks in a sample [44], and it can be measured with the Bland-Altman plot and the 95% Limits of Agreement (LOAs) [44,45]. ...
... Relative reliability refers to the degree to which individuals maintain their position in a sample over repeated measurements [13] and it can be measured with the Spearman correlation coefficient [44]. Absolute reliability refers to the degree to which repeated measurements vary for individuals, irrespective of their ranks in a sample [44], and it can be measured with the Bland-Altman plot and the 95% Limits of Agreement (LOAs) [44,45]. The Bland-Altman plot shows the mean of the two measures plotted against their difference and it can be used to examine the presence of systematic bias and the magnitude of the error compared to the mean value of the measure. ...
... Relative reliability refers to the degree to which individuals maintain their position in a sample over repeated measurements [13] and it can be measured with the Spearman correlation coefficient [44]. Absolute reliability refers to the degree to which repeated measurements vary for individuals, irrespective of their ranks in a sample [44], and it can be measured with the Bland-Altman plot and the 95% Limits of Agreement (LOAs) [44,45]. The Bland-Altman plot shows the mean of the two measures plotted against their difference and it can be used to examine the presence of systematic bias and the magnitude of the error compared to the mean value of the measure. ...
Article
Full-text available
Background Walking impairments are a common consequence of neurological disorders and are assessed with clinical scores that suffer from several limitations. Robot-assisted locomotor training is becoming an established clinical practice. Besides training, these devices could be used for assessing walking ability in a controlled environment. Here, we propose an adaptive assist-as-needed (AAN) control for a treadmill-based robotic exoskeleton, the Lokomat, that reduces the support of the device (body weight support and impedance of the robotic joints) based on the ability of the patient to follow a gait pattern displayed on screen. We hypothesize that the converged values of robotic support provide valid and reliable information about individuals’ walking ability. Methods Fifteen participants with spinal cord injury and twelve controls used the AAN software in the Lokomat twice within a week and were assessed using clinical scores (10MWT, TUG). We used a regression method to identify the robotic measure that could provide the most relevant information about walking ability and determined the test–retest reliability. We also checked whether this result could be extrapolated to non-ambulatory and to unimpaired subjects. Results The AAN controller could be used in patients with different injury severity levels. A linear model based on one variable (robotic knee stiffness at terminal swing) could explain 74% of the variance in the 10MWT and 61% in the TUG in ambulatory patients and showed good relative reliability but poor absolute reliability. Adding the variable ‘maximum hip flexor torque’ to the model increased the explained variance above 85%. This did not extend to non-ambulatory nor to able-bodied individuals, where variables related to stance phase and to push-off phase seem more relevant. Conclusions The novel AAN software for the Lokomat can be used to quantify the support required by a patient while performing robotic gait training. The adaptive software might enable more challenging training conditions tuned to the ability of the individuals. While the current implementation is not ready for assessment in clinical practice, we could demonstrate that this approach is safe, and it could be integrated as assist-as-needed training, rather than as assessment. Trial registration ClinicalTrials.gov Identifier: NCT02425332.
... For day-to-day, test-retest, or interday reliability analyses, the means of three trials of the DJ outcome measures in single and double leg tests were used. We performed a paired t-test on the difference between the means of the outcome measures obtained in the test and retest sessions to test for the absence of systematic bias (Atkinson & Nevill, 1998). The effect sizes (ES) were also estimated (the calculation details are given below). ...
... To estimate the absolute reliability, we calculated three reliability metrics. The first was the standard error of measurement (SEM), which was calculated as the square root of the mean square error term from repeated-measures analysis of variance (Atkinson & Nevill, 1998). The SEM was expressed in the actual units of the outcome measures; therefore, it is practical to interpret it as the smaller the SEM, the more reliable the measurement (Atkinson & Nevill, 1998). ...
... The first was the standard error of measurement (SEM), which was calculated as the square root of the mean square error term from repeated-measures analysis of variance (Atkinson & Nevill, 1998). The SEM was expressed in the actual units of the outcome measures; therefore, it is practical to interpret it as the smaller the SEM, the more reliable the measurement (Atkinson & Nevill, 1998). The second absolute measure of reliability was the minimal metrically detectable change (MMDC), the amount of change that could be considered different between two measurements, calculated as the 95% CI of the SEM of the outcome measures (i.e., ±1.96 SEM). ...
Article
Full-text available
ABSTRACT The purpose was to evaluate reliability of measurements during drop jump (DJ) and magnitude of differences in vertical jump height (VJH) when calculated using two variables: time-in-the-air and center-of-mass velocity. Thirty-seven handball players performed three single-leg and double-leg DJs on a portable force plate during two sessions. Sixteen outcome measures and four sets of reliability metrics, including intraclass correlation coefficients (ICCs), were estimated. All outcome measures, except vertical time-to-stabilization, yielded high (0.70–0.89) to very high (0.90–0.96) interday and intraday ICCs for the single-leg and double-leg DJ tests. These results indicated that the single-leg and double-leg DJ tests can be considered reliable for short-term and long-term monitoring of collegiate handball players of both sexes. In addition, the single-leg DJ VJH calculated using these two variables differed in magnitude (mean difference in test measurements: 0.64 cm, p < .001, effect size: 0.165). Therefore, we recommend using the same method to calculate single-leg VJH for long-term monitoring.
... • Assumes heteroscedasticity (Atkinson & Nevill, 1998, Bailey, Sato, McInnis, 2021, Bailey 2019 • I've never found asymmetry data to meet this assumption ...
... • Assumes homoscedasticity (Atkinson & Nevill, 1998, Bailey, Sato, McInnis, 2021, Bailey 2019 • Despite opposing assumptions, some report both CV and SEM ¯\_(ツ)_/• CV should never be reported with asymmetry measures ...
... Although the ICC is a widely used statistical analysis to verify reliability, its results are affected by the heterogeneity of the sample. Thus, it must be accompanied by other analyzes to detect the measurement error, such as TEM or SEM 48,49 . The present review found that absolute errors ranged from 0.01 to 0.47 cm. ...
... In the exercise and sports sciences areas, it was recommended as a criterion that the error of the acceptable relative measure should be at most 10% 48 . Except for one study eligible for this review, all had CV below 10%. ...
Article
Full-text available
Due to its low cost and operational simplicity, ultrasound has been used to monitor muscle thickness in laboratory environments, rehabilitation clinics, and sports clubs. However, it is necessary to determine the measurement’s quality to infer whether the possible changes observed are derived from the treatment or the measurement error. Therefore, we performed a systematic review to determine the validity, reliability, and measurement error of quadriceps femoris muscle thickness obtained by ultrasound in healthy adults. A search was conducted in the Pubmed, Scopus, and Web of Science databases until April 2022. The study selection process was carried out by two independent researchers, with the presence of a third researcher in case of disagreements. Twenty-six studies were eligible for the review, being 4 of validity, 4 of reliability only, and 18 of reliability and measurement error.The intraclass correlation coefficient ranged from 0.60 to 0.99 in validity studies and from 0.44 to 0.99 in reliability studies. The typical error of measurement ranged from 0.01 to 0.47 cm, and the coefficient of variation was from 0.5 to 17.9%. Four studies received “very good” classification in all the risk of bias analysis criteria. Therefore, it is concluded that the quadriceps femoris muscle thickness obtained by ultrasound was shown to be valid, reliable, and to have low measurement errors in healthy adults. The weighted average of the relative error was 6.5%, less than typical increases in resistance training studies. The raters’ experience and methodological care for repeated measurements were necessary to observe low measurement errors.
... Despite our efforts to recruit strength-trained participants, the issue could potentially be addressed with a larger and more homogeneous sample specifically experienced in performing NH exercises. The presence of the heteroscedasticity was carefully observed, and in one case, correlation coefficient reached up to 0.17 (comparing NH and Isokinetic 30°/s Symmetry test results) which is just below the threshold of suspicion of heteroscedasticity according to Atkinson and Nevill (1998). ...
Article
The main purpose of the study was to analyse the relationship between the Nordic hamstring (NH) and the gold standard isokinetic eccentric strength measurements at three different eccentric velocities (30°, 60°, and 90°/s). Twenty-seven (15 male, 12 female) physical education students, experienced in strength training, performed isokinetic and NH tests. The results of the isokinetic hamstring strength test were statistically significantly higher compared to the results of the NH test, regardless of the velocity of the isokinetic test. The correlations observed ranged from moderate to large (0.419 ≤ r ≤ 0.582) and were statistically significant for the left leg, right leg, and the average results between the legs. However, no statistically significant correlations were found between the measures for inter-leg symmetry variables. We observed excellent intra-session reliability of the NH test results (ICC2.1 = 0.891–920), but only good reliability (ICC2.1 = 0.601) for the intra-leg symmetry variable. Although NH strength measurements were found to be highly reliable, sports practitioners should be aware of the discrepancy between the measures and interpret the results carefully. Eccentric hamstring strength and inter-leg symmetry cannot be evaluated interchangeably with the NH test and gold standard isokinetic test performed in the 90° hip flexion position.
... The precision was measured by the standard error of measurement (SEM), which was calculated using the following formula: SD x (√1− ICC), where SD is the pooled SD of the two repetitions. A smaller SEM value suggests greater agreement between measures (Atkinson & Nevill, 1998a). In addition, the minimum detectable change (MDC95%) was calculated as SEM x (√2 × 1.96) (Haley et al., 2006). ...
... The intraclass correlation coefficient (ICC) was used to assess the reliability within the trials of each physical performance test. 35 ICC values were interpreted as poor (\ 0.5), moderate (0.5-0.75), good (0.75-0.9) and excellent ( . 0.9) reliability. ...
Article
This study aimed to analyse the influence of different physical fitness levels of youth basketball players on match-related physical performance, using Random Forest clustering to distinguish between high-fitness level players and low-fitness level players. Twenty male youth basketball players completed the following physical performance tests in two separate sessions: bilateral and unilateral countermovement jumps, bilateral and unilateral horizontal jumps, single leg lateral jumps, the 20 m linear straight sprint test, the 505 test and a repeated sprint ability test. 1 week after the second testing day, players completed a simulated match while external loads were monitored using an ultra-wide band-based Local Positioning System. A Random Forest clustering was used to create two different clusters composed of players with similar physical fitness attributes (high-and low-fitness level players). Results indicate that the Random Forest clustering adequately discriminated among the players in different groups according to their physical fitness attributes. High-fitness level players covered more distance per min in all intensity thresholds and reached higher maximal speed and acceleration intensity during the simulated matches (p \ 0.05). These results may assist basketball practitioners in understanding running performance variations during matches and can be used to optimise preparation for individual players.
Article
Full-text available
Background: The aim of this study was to examine the test-retest reliability of the physical activity behavior, health and wellbeing questionnaire, in adolescent populations, administered by teachers in school settings, in the Republic of Ireland. Methods : A cross-sectional, mixed sample of 55 participants (45.5% males: Age, 13.94 (±.40) were included. The participants completed the questionnaire on two occasions (T1 and T2), on the same day and time, one week apart following identical procedures. Variables for testing included physical activity behavior (n=13), health (n=11) and wellbeing (n=2). Test-retest reliability of the questionnaire’s covariates, including family affluence and physical impairments were also examined. Results: Systematic error (Bland-Altman plots) was found to be near to zero for each of the physical activity behavior, health and wellbeing variables. The combined mean coefficient of variation was lower for females (10.19%) in comparison to males (13.01%). Similarly, the combined mean intraclass correlation coefficients were higher for females (>.901) than males (>.822). Conclusions: This study found the physical activity behavior, health and wellbeing questionnaire to be reliable for use in adolescent populations.
Article
Objectives Reliable measurements of health‐related fitness—cardiorespiratory endurance, muscular endurance, muscular strength, body composition, and flexibility—are imperative for understanding and tracking health‐related fitness from the preschool age. This study aimed to examine the test–retest reliability of field‐based (i.e., sit and reach [standard and back‐saver], standing long jump, grip strength); and laboratory‐based (i.e., Bruce Protocol Treadmill Test, Wingate Anaerobic Test) assessments of health‐related fitness in preschool aged children (4–5 years). Methods Forty‐two typically developing children participated in both assessment time points separated by 2–3 weeks. All fitness assessments were administered individually and repeated in the same order by the same assessor. Heteroscedasticity was examined for each parameter. Intraclass correlation coefficients were calculated to assess test–retest reliability. Results All parameters were homoscedastic. Test–retest reliability for the field‐based tests and Bruce Protocol Treadmill Test parameters were moderate to good. Test–retest reliability for the Wingate Test parameters were good to excellent for maximum pedal rate, peak power, and peak power/kg; mean power and fatigue measured at 10 and 30 s demonstrated moderate to excellent test–retest reliability. Conclusion The standard sit and reach, grip strength, and short‐term muscle power from the Wingate test are reliable assessments of health‐related fitness in preschool‐aged children.
Article
Background: Because ultrasound measurement of plantar fascia thickness is widely used in the diagnosis and evaluation of plantar fasciitis, it is important to understand and minimize the errors that occur with this measurement. The aim of this systematic review was to identify and synthesize studies reporting on intrarater and interrater reliability of ultrasound measurement of plantar fascia thickness. Methods: After comprehensive searches in the MEDLINE, Embase, and Cochrane Library databases, 11 studies involving 238 healthy participants and 68 patients with pathologic foot disorders were included. Results: Seven of 11 studies revealed a low risk of bias. Most of the studies reported good to excellent intrarater and interrater reliability for ultrasound measurement of plantar fascia thickness (intrarater intraclass correlation coefficient [ICC], 0.77–0.98; interrater ICC, 0.76–0.98). In addition, two studies on intrarater reliability and one study on interrater reliability showed moderate reliability (ICCs, 0.65, 0.67, and 0.59, respectively). Overall, the standard error of measurement was less than 5% and did not exceed 7%. Conclusions: The findings of this review suggest that ultrasound measurement of plantar fascia thickness is reliable in terms of both relative and absolute reliability. Reliability can be optimized by using the average of multiple measurements and an experienced operator.
Article
The purpose of the study was to investigate the effects of variability as a function of sample size on the Pearson product-moment correlation coefficient(PCC) under the assumption of a perfect relationship between two variables. The effects of sample size (subjects/trials) and variability on the PCC were demonstrated using a computer model. The model was also used to evaluate selected examples taken from the literature. The results indicated that variability in excess of 10% of the range for each variable resulted in a mean reduction of the shared variance by 50% or greater. Although sample size did not affect the mean PCC, it did have a dramatic effect on extreme percentile values producing unreliable results. These results indicate that a small PCC value can be an artifact of variability. It is suggested, therefore, that one should be cautious when stating conclusions regarding the relationship between two variables without having knowledge of the associated variabilities.
Article
L. I. Lin [Biometrics 45, No. 1, 255-268 (1989; Zbl 0715.62114)] objected to the use of the intraclass correlation coefficient as a way to evaluate the reproducibility of measurements between two trials of an assay or instrument and developed an alternative called the concordance correlation coefficient. It is noted that intraclass correlation refers not to a single coefficient but to a group of coefficients and that Lin’s alternative is nearly identical to a subset of the coefficients in this group.
Article
Using the domain-sampling model from classical test theory, the effects of measurement error on statistical tests for the difference between an obtained mean and a hypothesized mean, and the difference between two means, are demonstrated. The results indicate that lowering the reliability (i. e., increasing measurement error) of dependent variable data increases the chance of obtaining a nonsignificant result when a significant result is the correct outcome. Lowering the reliability also produces reduced estimates of strength of association.
Article
Electrostimulation is sometimes recommended for urge incontinence or mixed incontinence. How good are its credentials and how practical is it in a general practice setting?