Article

Quantifying Test-Retest Reliability Using The Intraclass Correlation Coefficient and the SEM

Applied Physiology Laboratory, Division of Physical Therapy, Des Moines University-Osteopathic Medical Center, Des Moines, Iowa 50312, USA.
The Journal of Strength and Conditioning Research (Impact Factor: 2.08). 03/2005; 19(1):231-40. DOI: 10.1519/15184.1
Source: PubMed

ABSTRACT

Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.

  • Source
    • "Intraclass correlation coefficients (ICC) were calculated. The standard error of measurement, defined as the square root of the mean square within subjects error term using repeated measures analysis of variance, was determined and then used to calculate the minimal detectable difference (SEM x 1.96 x p 2).[16]Principal components analysis was conducted using SPSS (IBM Corp. Released 2013. IBM SPSS Statistics for Windows, Version 22.0. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Rett syndrome is a pervasive neurodevelopmental disorder associated with a pathogenic mutation on the MECP2 gene. Impaired movement is a fundamental component and the Rett Syndrome Gross Motor Scale was developed to measure gross motor abilities in this population. The current study investigated the validity and reliability of the Rett Syndrome Gross Motor Scale. Video data showing gross motor abilities supplemented with parent report data was collected for 255 girls and women registered with the Australian Rett Syndrome Database, and the factor structure and relationships between motor scores, age and genotype were investigated. Clinical assessment scores for 38 girls and women with Rett syndrome who attended the Danish Center for Rett Syndrome were used to assess consistency of measurement. Principal components analysis enabled the calculation of three factor scores: Sitting, Standing and Walking, and Challenge. Motor scores were poorer with increasing age and those with the p.Arg133Cys, p.Arg294* or p.Arg306Cys mutation achieved higher scores than those with a large deletion. The repeatability of clinical assessment was excellent (intraclass correlation coefficient for total score 0.99, 95% CI 0.93-0.98). The standard error of measurement for the total score was 2 points and we would be 95% confident that a change 4 points in the 45-point scale would be greater than within-subject measurement error. The Rett Syndrome Gross Motor Scale could be an appropriate measure of gross motor skills in clinical practice and clinical trials.
    Full-text · Article · Jan 2016 · PLoS ONE
  • Source
    • "The grand mean is the mean of the means of each VPIT parameter . As agreement parameters (SDDs) are expressed on the actual scale of the assessments, they allow clinical interpretation of the results[34,36]. Furthermore, the SDD% can be used to compare test-retest reliability among tests[25]. We hypothesized that the SDD% is ≤ 54 % of the mean average values of the VPIT, asChen et al. (2009)found an SDD% of 54 % for the affected hand using the NHPT in patients with stroke[25]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Measuring arm and hand function of the affected side is vital in stroke rehabilitation. Therefore, the Virtual Peg Insertion Test (VPIT), an assessment combining virtual reality and haptic feedback during a goal-oriented task derived from the Nine Hole Peg Test (NHPT), was developed. This study aimed to evaluate (1) the concurrent validity of key outcome measures of the VPIT, namely the execution time and the number of dropped pegs, with the NHPT and Box and Block Test (BBT), and (2) the test-retest-reliability of these parameters together with the VPIT's additional kinetic and kinematic parameters in patients with chronic stroke. The three tests were administered on 31 chronic patients with stroke in one session (concurrent validity), and the VPIT was retested in a second session 3-7 days later (test-retest reliability). Spearman rank correlation coefficients (ρ) were calculated for assessing concurrent validity, and intraclass correlation coefficients (ICCs) were used to determine relative reliability. Bland-Altman plots were drawn and the smallest detectable difference (SDD) was calculated to examine absolute reliability. Results: For the 31 included patients, 11 were able to perform the VPIT solely via use of their affected arm, whereas 20 patients also had to utilize support from their unaffected arm. For n = 31, the VPIT showed low correlations with the NHPT (ρ = 0.31 for time (Tex[s]); ρ = 0.21 for number of dropped pegs (Ndp)) and BBT (ρ = -0.23 for number of transported cubes (Ntc); ρ = -0.12 for number of dropped cubes (Ndc)). The test-retest reliability for the parameters Tex[s], mean grasping force (Fggo[N]), number of zero-crossings (Nzc[1/sgo/return) and mean collision force (Fcmean[N]) were good to high, with ICCs ranging from 0.83 to 0.94. Fair reliability could be found for Fgreturn (ICC = 0.75) and trajectory error (Etrajgo[cm]) (0.70). Poor reliability was measured for Etrajreturn[cm] (0.67) and Ndp (0.58). The SDDs were: Tex = 70.2 s, Ndp = 0.4 pegs; Fggo/return = 3.5/1.2 Newton; Nzc[1/s]go/return = 0.2/1.8 zero-crossings; Etrajgo/return = 0.5/0.8 cm; Fcmean = 0.7 Newton. Conclusions: The VPIT is a promising upper limb function assessment for patients with stroke requiring other components of upper limb motor performance than the NHPT and BBT. The high intra-subject variation indicated that it is a demanding test for this stroke sample, which necessitates a thorough introduction to this assessment. Once familiar, the VPIT provides more objective and comprehensive measurements of upper limb function than conventional, non-computerized hand assessments.
    Full-text · Article · Jan 2016 · Journal of NeuroEngineering and Rehabilitation
  • Source
    • "Consistent with findings on the neurocognitive sequelae of concussion for other measures, the literature has revealed moderate to large neurocognitive impairments within 1–3 days post-injury on ImPACT whether concussed athletes are compared to their own baselines (Iverson, Brooks, Collins, & Lovell, 2006; Iverson et al., 2003; McClincy, Lovell, Pardini, Collins, & Spore, 2006) or to non-injured controls (Schatz, Pardini, 1 It is worth mentioning that there has been debate about whether Pearson or ICCs are more appropriate for the estimation of test-retest reliability. Those who advocate for the use of ICCs tend to cite the statistic's ability to take into account systematic error (e.g., practice effects; Weir, 2005). Further complicating this debate is that numerous formulas for the ICC exist, some of which do not take into account systematic error. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Limited data exist comparing the performance of computerized neurocognitive tests (CNTs) for assessing sport-related concussion. We evaluated the reliability and validity of three CNTs—ANAM, Axon Sports/Cogstate Sport, and ImPACT—in a common sample. High school and collegiate athletes completed two CNTs each at baseline. Concussed (n=165) and matched non-injured control (n=166) subjects repeated testing within 24 hr and at 8, 15, and 45 days post-injury. Roughly a quarter of each CNT’s indices had stability coefficients (M=198 day interval) over .70. Group differences in performance were mostly moderate to large at 24 hr and small by day 8. The sensitivity of reliable change indices (RCIs) was best at 24 hr (67.8%, 60.3%, and 47.6% with one or more significant RCIs for ImPACT, Axon, and ANAM, respectively) but diminished to near the false positive rates thereafter. Across time, the CNTs’ sensitivities were highest in those athletes who became asymptomatic within 1 day before neurocognitive testing but was similar to the tests’ false positive rates when including athletes who became asymptomatic several days earlier. Test–retest reliability was similar among these three CNTs and below optimal standards for clinical use on many subtests. Analyses of group effect sizes, discrimination, and sensitivity and specificity suggested that the CNTs may add incrementally (beyond symptom scores) to the identification of clinical impairment within 24 hr of injury or within a short time period after symptom resolution but do not add significant value over symptom assessment later. The rapid clinical recovery course from concussion and modest stability probably jointly contribute to limited signal detection capabilities of neurocognitive tests outside a brief post-injury window. (JINS, 2016, 22, 24–37)
    Full-text · Article · Jan 2016 · Journal of the International Neuropsychological Society
Show more

Questions & Answers about this publication

  • Paulo Roberto Garcia Lucareli added an answer in Physiotherapy:
    What is the minimal clinically important difference for shoulder ROM?

    Can someone help me with any reference about MCID for shoulder ROM?

    Paulo Roberto Garcia Lucareli

    Hi Bruno,

    Why don´t you calculate the MCID based on your own sample?

    A good option in my opinion is MINIMAL DIFFERENCES NEEDED TO BE CONSIDERED REAL. WEIR, J. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. Journal of Strength and Conditioning Research, v. 19, n. 1, p. 231-240, 2005.

    https://www.researchgate.net/publication/8028009_Quantifying_test-retest_reliability_using_the_intraclass_correlation_coefficient_and_the_SEM

    Best

    • Source
      [Show abstract] [Hide abstract]
      ABSTRACT: Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.
      Full-text · Article · Mar 2005 · The Journal of Strength and Conditioning Research