Issues for selection of outcome measures in stroke rehabilitation:
K. SALTER1, J.W. JUTAI1,3, R. TEASELL1,2, N.C. FOLEY1, J. BITENSKY1& M. BAYLEY3
1Department of Physical Medicine and Rehabilitation, St. Joseph’s Health Care London, UK,2University of Western Ontario,
London, Ontario, Canada, and3Neurorehabilitation Program, Toronto Rehabilitation Institute, Toronto, Ontario, Canada
(Accepted date August 2004)
category, which are used in stroke rehabilitation research and reported in the published literature.
Method. Critical review and synthesis of measurement properties for six commonly reported instruments in the stroke
rehabilitation literature. Each instrument was rated using the eight evaluation criteria proposed by the UK Health
Technology Assessment (HTA) programme. The instruments were also assessed for the rigour with which their reliability,
validity and responsiveness were reported in the published literature.
Results. Validity has been well reported for at least half of the measures reviewed. However, methods for reporting specific
measurement qualities of outcome instruments were inconsistent. Responsiveness of measures has not been well
documented. Of the three ICF categories, Participation seems to be most problematic with respect to: (a) lack of consensus
on the range of domains required for measurement in stroke; (b) much greater emphasis on health-related quality of life,
relative to subjective quality of life in general; (c) the inclusion of a mixture of measurements from all three ICF categories.
Conclusions.The reader is encouraged to examine carefully the nature and scope of outcome measurement used in
reporting the strength of evidence for improved participation associated with stroke rehabilitation. There is no consensus
regarding the most important indicators of successful involvement in a life situation and which ones best represent the
societal perspective of functioning. In particular, quality of life outcomes lack adequate conceptual frameworks to guide the
process of development and validation of measures.
To evaluate the psychometric and administrative properties of outcome measures in the ICF Participation
Measuring the effectiveness of rehabilitation inter-
ventions is accepted as essential to good practice.
Van der Putten et al.  point out that measuring
the outcome of health care is a ‘central component
of determining therapeutic effectiveness and, there-
fore, the provision of evidence-based healthcare’.
Reliability, validity, and administrative burden are
properties of measurement instruments that affect
the credibility of the measurement process [2, 3]
and the reporting of research findings [4–6].
Remarkably few published rehabilitation outcome
studies appear to report these properties adequately
in defending their research designs and interpreting
their results .
Recently, there have been important advances in
compiling and publishing the best-available scientific
evidence examining the effectiveness of stroke
rehabilitation [7, 8]. However, there are limitations
to the successful transfer of the research results to
clinical practice and service delivery, in part due to a
lack of consensus on the selection of measures to best
address and balance the needs and values of
stakeholders in stroke rehabilitation, including pa-
healthcare decision makers. Ultimately, the compar-
ison of size and direction of treatment effects across
areas of stroke rehabilitation will be most mean-
ingfully interpreted when it is clear that comparable
approaches to outcome measurement have been
used . To enhance the clinical meaningfulness of
Correspondence: Department of Physical Medicine and Rehabilitation, St. Joseph’s Health Care London and University of Western Ontario, 801
Commissioners Road East, London (Ontario) N6C 5J1, Canada. E-mail: Katherine.Salter@sjhc.london.on.ca
Disability and Rehabilitation, 2005; 27(9): 507–528
ISSN 0963-8288 print/ISSN 1464-5165 online ª 2005 Taylor & Francis Group Ltd
the current evidence, this paper presents the best
available information on how outcome measures
might be classified and selected for use, based upon
their measurement qualities. For this purpose, we
have selected for review only some of the more
commonly used measures in stroke rehabilitation.
This paper is not intended to be a comprehensive
compendium of stroke outcome measures.
This paper attempts to describe how the ICF
[10, 11] conceptual framework can be used for
classifying outcome measures in stroke rehabilitation,
and summarize aspects of measurement theory that
are pertinent for evaluating measures. It also gives a
template presentation on the characteristics, applica-
tion, reliability, validity,
qualities of commonly used measures in a format
for easy reference. For a more extensive discussion of
outcome measurement theory and properties in
physical rehabilitation, the reader is referred to Finch
et al.  This paper will present only the informa-
tion most relevant for the rehabilitation of stroke
Classification of stroke rehabilitation outcomes
To be effective, outcomes research requires a
systematic approach to describing outcomes and
classifying them meaningfully. The study and assess-
ment of stroke rehabilitation has sparked the
development of numerous outcome measures applic-
able to one or more of its dimensions. In attempting
to discuss some of the commonly used measures
available for use within the field of stroke rehabilita-
tion, it is useful to have guidelines available for
classifying these tools. The WHO International
Classification of Functioning, Disability and Health
(ICF) [10, 11] provides a multi-dimensional frame-
work forhealthand disability
classification of outcome instruments.
Originally published in 1980, the WHO frame-
work has undergone several revisions. In the most
recent version, the ICF framework [10, 11] identifies
three primary levels of human functioning—the body
or body part, the whole person, and the whole person
in relation to his/her social context. Outcomes may
be measured at any of these levels—Body functions/
structure (impairment); Activities (refers to the
whole person—formerly conceived as disability in
the old ICIDH framework) and Participation (for-
merly referred to as handicap) (Table I). Activity and
Participation are affected by environmental and
personal factors (referred to as contextual factors
within the ICF, Table I).
Outcome measures can also be conceived of as
falling along a continuum, moving from measure-
ments at the level of body function or structure to
those focused on participation and life satisfaction. It
becomes more difficult to attribute outcomes to
particular rehabilitation interventions as one moves
away from body structure toward participation, since
many variables other than the interventions might
account for changes observed [13, 14].
We reviewed the findings from a number of recent
studies that have examined the patterns of scale use
in various settings, both clinical [15–17] and
research [14, 18–21]. In the absence of an author-
recommended stroke rehabilitation measures, we
focused our review on scales with which most stroke
specialists would be familiar.
Table II presents 20 of the most popular instru-
ments from the stroke rehabilitation literature,
classified by ICF category by the primary author,
after consideration of the study author’s stated
purpose for the tool and the content of the
instrument’s items. This subjective component was
introduced because there is no published consensus
on how this kind of classification should proceed.
The classification was reviewed independently by the
co-authors. Table II reflects the consensus among
the paper’s authors.
If a classification is to be useful for scientific
research, the basic categories and concepts within it
need to be measurable, and their boundaries clear
and distinct. It is not yet clear from the research
evidence whether the three ICF categories com-
pletely fulfill these criteria. Nonetheless, when
applied to outcome assessment in stroke rehabilita-
tion the ICF conceptual framework can be used to
place outcome measures into one of the three
categories depending upon what it is they purport
Table I. ICF Definitions.
Old terminologyNew terminologyDefinition
Physiological functions of body systems including psychological. Structures are anatomical parts
or regions of their bodies and their components. Impairments are problems in body function or
The execution of a task by an individual. Limitations in activity are defined as difficulties an
individual might experience in completing a given activity.
Involvement of an individual in a life situation. Restrictions to participation describe difficulties
experienced by the individual in a life situation or role.
K. Salter et al.
It should be noted that linking existing measures
to the ICF is not a straightforward process [22–24].
Many existing measures include items that fall into
several ICF dimensions in addition to items that may
not be included in the ICF at all. Instruments
appearing in the Participation domain, for instance,
assess participation in life situations such as social
functioning or roles, but include the assessment of
elements of one or both of the Body Structure/
Function and Activities categories. While these
measures have been used to assess health-related
quality of life, it is not the intent of this paper to
define this construct or its assessment.
The present study was not intended as an attempt
to provide item by item mapping for each of the
identified measures. The ICF was used as a frame-
workwithinwhich measureswereclassified according
to the level of assessment they include furthest along a
continuum from body function, through activity to
participation. However, the process of developing
systematic approaches to establishing linkages be-
tween existing measures and the ICF is an important
one in the ongoing attempt to create an international
language and standard for measurement.
Evaluation criteria for outcome measures
While it is useful to have the ICF framework within
which to classify levels of outcomes measures, it is
necessary to have a set of criteria to guide the
selection of outcomes measures. Reliability, validity
and responsiveness have widespread usage and are
essential to the evaluation of outcome measures
[1, 14, 19, 25]. Finch et al.  provide a good
tutorial on the general issues for outcome measure
The Health Technology Assessment (HTA) pro-
gramme  examined 413 articles that focused on
methodological aspects of the useand development of
patient-based outcome measures. In theirreport, they
recommended the use of eight evaluation criteria.
Table III lists the criteria and gives a definition for
each one. It also identifies a recommended standard
for quantifying (rating) each criterion, where applic-
able, and how the ratings should be interpreted. The
criteria, including some additional considerations
described below, were applied to each of the outcome
measures reviewed in this paper.
Each measure reviewed in this paper was also
assessed for the thoroughness with which its relia-
bility, validity and responsiveness have been reported
in the literature. Standards for evaluation of rigour
were adapted from McDowell & Newell  and
Andresen . The authors assessed rigour in the
manner described above for the other ratings, and
scored each instrument on each of the three proper-
ties as follows: +++ Excellent—most major forms of
testing reported; ++ Adequate—several studies and/
or several types of testing reported; + Poor—minimal
information is reported and/or few studies (other
than author’s); N/a—no information available. For
example, a rating of ‘+++’ (or excellent) for validity
meant that evidence has been presented demonstrat-
ing excellent construct
standards provided and in various forms including
convergent and discriminant validity.
In addition to the criteria outlined above, three
additional questions were considered. Has the
measure been used in a stroke population? Has the
measure been tested for use with proxy assessment?
What is the recommended time frame for measure-
The primary author reviewed and rated each
instrument using these evaluative criteria. The
results were reviewed independently by the co-
authors. There were very few instances of disagree-
ment among raters, and they were never more than
one level apart in their evaluations. The results
presented in this paper reflect the authors’ consensus
on ratings after discussing all discordant ratings.
Has the measure been used in a stroke
Reliability and validity are not fixed qualities of
measures. They should be regarded as relative
Table II. Classification of outcome measures.
Body structure (impairments) Activities (limitations to activity–disability)Participation (barriers to participation—handicap)
1. Beck Depression Inventory
2. Fugl-Meyer Assessment
3. Mini Mental State Examination
4. Modified Ashworth
5. Motor-free Visual Perception
6. Barthel Index
7. Berg Balance Scale
8. Chedoke McMaster Stroke Assessment Scale 17. Nottingham Health Profile
9. Functional Independence Measure (FIM)
10. Frenchay Activities Index
11. Modified Rankin Handicap Scale
12. Rivermead Motor Assessment
13. Rivermead Mobility Index
14. Timed-Up-and-Go (TUG)
16. Medical Outcomes Study Short Form 36
18. Sickness Impact Profile (stroke adapted
19. Stroke Impact Scale
20. Stroke Specific Quality of Life
Issues for selection of outcome measures in stroke rehabilitation
Table III.Evaluation criteria and standards.
1. Appropriateness The match of the instrument to the purpose/question
under study. One must determine what information is
required and what use will be made of the information
2. ReliabilityRefers to the reproducibility and internal consistency of the
Reproducibility addresses the degree to which the score is
free from random error. Test re-test & inter-observer
reliability both focus on this aspect of reliability and are
commonly evaluated using correlation statistics including
ICC, Pearson’s or Spearman’s coefficients and kappa
coefficients (weighted or unweighted).
Internal consistency assesses the homogeneity of the scale
items. It is generally examined using split-half reliability or
Cronbach’s alpha statistics. Item-to-item and item-to scale
correlations are also accepted methods.
Depends upon the specific purpose for which the
measurement is intended.
Test-retest or interobserver reliability(ICC; kappa statistics)
[4, 86, 87]
Note: Fitzpatrick et al.  recommend a minimum test-
retest reliability of 0.90 if the measure is to be used to
evaluate the ongoing progress of an individual in a
Internal consistency (split-half or Cronbach’s a statistics):
Note: Fitzpatrick et al.  caution a values in excess of
0.90 may indicate redundancy.
Inter-item & item-to-scale correlation coefficients:
Adequate levels—inter-item: between 0.3 and 0.9;
item-to-scale: between 0.2 and 0.9 [26, 59]
Construct/convergent and concurrent correlations:
Excellent: 50.60, Adequate: 0.31–0.59, Poor: 40.30
[4, 26, 27, 88]
ROC analysis – AUC: Excellent: 50.90, Adequate: 0.70–
0.89, Poor: 50.70 
There are no agreed on standards by which to judge
sensitivity and specificity as a validity index .
3. ValidityDoes the instrument measure what it purports to measure?
Forms of validity include face, content, construct, and
criterion. Concurrent, convergent or discriminative, and
predictive validity are all considered to be forms of
criterion validity. However, concurrent, convergent and
discriminative validity all depend on the existence of a
‘gold standard’ to provide a basis for comparison. If no
gold standard exists, they represent a form of construct
validity in which the relationship to another measure is
Sensitivity to changes within patients over time (which
might be indicative of therapeutic effects).
Assessment of possible floor and ceiling effects is included
as they indicate limits to the range of detectable change
beyond which no further improvement or deterioration can
Responsiveness is most commonly evaluated through
correlation with other change scores, effect sizes,
standardized response means, relative efficiency, sensitivity
and specificity of change scores and ROC analysis.
Sensitivity to change:
Evidence of change in expected direction using methods
such as standardized effect sizes:
Also, by way of standardized response means, ROC
analysis of change scores (area under the curve—see above)
or relative efficiency.
Evidence of moderate/less change than expected;
Weak evidence based solely on p-values (statistical
significance) [4, 26, 27, 88]
Excellent: No floor or ceiling effects
Adequate: floor and ceiling effects 420% of patients who
attain either the minimum (floor) or maximum (ceiling)
Poor: 420% .
Depends on the precision required for the purpose of the
measurement (e.g., classification, evaluation, prediction).
5. Precision Number of gradations or distinctions within the
measurement. E.g. Yes/no response vs. a 7-point Likert
K. Salter et al.
indicators of how well the instrument might function
within a given sample or for a given purpose [26, 28].
Responsiveness, too, may be condition or purpose
specific. Van der Putten et al.  for example, in an
evaluation of the Barthel Index and Functional
Independence Measure, found both measures to be
equally responsive in terms of effect sizes when used
among stroke patients and patients with multiple
sclerosis. Within the stroke group, floor and ceiling
effects were within acceptable limits on both
measures. However, the authors point out that
within the MS patient group, there were larger
ceiling effects associated with the BI scores and the
scores from the FIM cognitive subscale. This,
coupled with the much smaller effect sizes noted
among MS patients leads the authors to suggest that
these two instruments are better suited for used
among stroke patients. Therefore, it would seem
important for a measure to have been tested for use
in the population within which it will be applied.
Has the measure been tested for use with proxy
When assessment is conducted in such a way as to
require a form of self-report (e.g. interview or
questionnaire—in person, by telephone or by mail),
stroke survivors who have experienced significant
cognitive or speech and language deficits may be
excluded from assessment because of their inability
to complete it. In such cases, the use of a proxy
respondent becomes an important alternative source
of information. However, the use of proxy respon-
dents should be approached with caution.
Studies of proxy assessments report a tendency for
significant others, including family members, to
assess patients as more disabled than they appear
on other measures of functional disability, including
self-reported methods. This discrepancy becomes
more pronounced for patients with more impaired
levels of functioning [29–31]. Hachisuka et al. 
suggested that this discrepancy could be explained by
a difference in interpretation. Proxy respondents may
be rating actual, observable performance, while
patients may rate their perceived capability—what
they think they are capable of doing rather than what
they actually do.
Unfortunately, using a healthcare professional as a
substitute for the family member or significant other
as proxy does not solve this problem. A similar
discrepancy has been noted in ratings when using
healthcare professionals as proxy respondents though
in the opposite direction. They may tend to rate
patients higher than the patients themselves would
[30, 32]. It has been suggested that, in this case, the
discrepancy is due to a difference in frame of
reference. A healthcare professional may use a
different, more disabled group, as a reference norm
whereas the patient would only compare him/herself
to pre-stroke conditions .
What is the recommended timeframe for
The natural history of stroke presents problems in
assessment in that the rate and extent of change in
outcomes varies across the different levels of ICF
classification . The further one moves along the
outcome continuum from body structure toward
participation, the more time it may take to reach a
measurement end point, that is, participation within
a defined social context may take longer to stabilize
than the impaired body structure .
Jorgensen et al.  demonstrated that maximal
occurred, in most patients, within the first 13 weeks
following a stroke even though the time course of
both neurological and functional recovery was
strongly related to initial stroke severity. They
suggested that a valid prognosis of functional
recovery might be made within the first 6-months.
According to Mayo et al. , by 6 months post-
stroke, physical recovery is complete, for the most
part, with additional gains being a function of
Table III. (continued)
6. Interpretability How meaningful are the scores? Are there consistent
definitions and classifications for results? Are there norms
available for comparison?
Jutai & Teasell  point out these practical issues should
not be separated from consideration of the values that
underscore the selection of outcome measures. A brief
assessment of practicality will accompany each summary
7. Acceptability How acceptable the scale is in terms of completion by the
patient—does it represent a burden? Can the assessment be
completed by proxy, if necessary?
Extent of effort, burden, expense and disruption to staff/
clinical care arising from the administration of the
Issues for selection of outcome measures in stroke rehabilitation
learning, practice and confidence. Duncan et al. 
support this time frame for assessment of neurolo-
gical impairment and disability outcomes but suggest
that participation outcomes should not be measured
sooner than 6 months post-stroke, to provide the
opportunity for the patient’s social situation to
stabilize. They also suggest that assessments at the
time of discharge not be used as endpoint measure-
ments. They argue that variability in treatment
interventions and length of stay practices decreases
the comparative usefulness of this information.
Review of participation outcome measures
This paper is the final in a series of three, and deals
with the third level or category of the ICF classifica-
tion system; Participation.
The necessity for clearly defined boundaries
between categories of classification is most apparent
when one considers the ICF dimensions of Activity
and Participation. Given that the domains associated
with activity and participation are presented as a
single, neutral list with several classification options
, it is not surprising that questions have arisen
with regard to the validity of separating them into
distinct dimensions. While exploration of this issue is
ongoing, it is worth noting that Jette et al. 
recently provided empirical evidence of distinctly
differing concepts conforming to the dimensions of
activity and participation as defined within the ICF
and suggested that the participation domain may
represent more complex categories of ‘life behaviours’.
Granlund et al. (2004) suggest that the participation
dimension of the ICF can be used effectively to link
items on existing measures to the ICF, although, they
do not attempt to define its relationship to activity.
Keeping in mind that the fit of a given instrument
within a single category is rarely perfect, measures
appearing in this section focus on the assessment of
Participation. As defined by the ICF [10, 11],
Participation is involvement in a life situation and
represents the societal perspective of functioning.
Participation restrictions, therefore, are problems an
individual may experience in involvement in life
situations or roles. If an activity limitation prevents
a person from attending school or being employed,
this is a participation restriction (handicap). In
contrast with the ICF Activities category, tasks
subsumed within the Participation level are rela-
tively complex, related to quality of life, performed
with others, more dependent upon environmental
influences, assessed in the community by self or
caregivers . According to Perenboom and
Chorus , involvement in life situations includes
the concept of autonomy ‘even if one is not actually
doing things themselves’, and therefore, the assess-
ment of participation should include the fulfilment
of personal goals and societal roles rather than
performance-based indicators alone.
The EuroQol Quality of Life Scale (EQ-5D)
The EuroQol scale (EQ-5D) is a generic index
instrument, which was developed by a multi-country,
multi-disciplinary team and is used to value and
describe health states . The EQ-5D was intended
to be brief and simple to administer representing
little or no burden to the patient. It focuses on a core
set of generic, health-related quality of life items to
provide a broad, generic assessment. The EQ-5D
was intended to promote the collection of a common
data set for reference purposes or as a complement to
other, more comprehensive measures [27, 39–41].
The EQ-5D is a self-administered questionnaire, in
two parts. The first part contains a simple descriptive
profile of health in five dimensions (mobility, self-
care, usual activities, pain/discomfort and anxiety/
depression). Each dimension is represented by three
statements corresponding to three levels of difficulty
(some problem, moderate and extreme problems)
within that dimension. The respondent chooses the
statement within each dimension that is most applic-
able to (her)himself at the time of assessment.
Each dimension statement selected receives a
numerical rating of 1 (some or no problem), 2
(moderate problems) or 3 (extreme problems). These
ratings are combined such that each combination of
choices creates a 5-digit expression of a health state.
Theoretically, there are 243 such representations
possible. By applying scores from a standard set of
values, each of these health states can be transformed
into a utility value ranging from 0 (worst possible) to
1 (best possible). Standard weights or preferences
were derived from population data obtained using
time trade-off techniques [12, 42]. Values have been
elicited for health states in Canada, Denmark, Fin-
land, Germany, Japan, Netherlands, New Zealand,
Slovenia, Spain, Sweden, UK, US and Zimbabwe.
Part 2 of the EQ-5D consists of a visual analogue
scale (VAS) on which respondents rate their current
state of health from 0 (worst imaginable) to 100 (best
While the EQ-5D was originally designed for self-
administration, it can be administered by interview.
complete and yields three types of information; a
profile indicating the extent of problems experienced
on each of five dimensions, a population-weighted
health index and a self-rated assessment of current
perceived health . The scale is in the public
domain and may be used without cost for the most
part. Restrictions on the use of the scale as well
as current information and references regarding
K. Salter et al.
the EQ-5D are available from the website www.
The measurement properties of the EQ-5D are
summarized in Table IV.
Advantages The EQ-5D is very short and simple.
High response rates have been reported (80% 
and 80–86%) . Reports of missing data are
mixed although are relatively low overall [43, 45].
The scale also provides considerable flexibility.
Though designed as a self-completed instrument to
be administered by post, it can be administered in
face-to-face interviews and has been evaluated for
use with proxy respondents. In addition, the data can
be presented and used in three distinct forms; a
patient profile in five domains based on unweighted
responses, a health utility or index and an overall
rating of perceived health.
Limitations The level of validity reported would
suggest that the instrument may not be suitable for
use in serial assessments of individual patients. It
would be more appropriate for use in the study and
comparison of groups [44, 45].
Brazier et al.  reported missing data rates of
10% when using the EQ-5D in an elderly population
(mean age 80.1 years). This observation is supported
by Coast et al.  who demonstrated that the ability
to self-complete the EQ-5D is directly related to age
and cognitive function (p50.0001). The authors
also report that the probability of requiring interview
administration to complete the scale increases from
11% at age 65 to 73% at age 85. This would increase
the costs associated with using the EQ-5D with
While the scale has been assessed for use with
proxy respondents post stroke, Dorman et al. 
observed that reliability was consistently lower when
a proxy respondent completed the questionnaire on
the patient’s behalf. Levels of agreement between
proxy respondents and patients were acceptable for
mobility and self-care. However, the more subjective
the domain, the lower were the levels of agreement.
In the case of depression/anxiety, agreement was no
better than chance among the more severely affected
stroke survivors .
The health state valuations used in the EQ-5D
utility were derived from time trade-off techniques.
These techniques may be prone to biases and have
been shown to elicit lower values for minor and
major stroke than standard gamble techniques .
The ratings of methodological rigour associated with
evaluation of the measurement properties of the EQ-
5D are presented in Table X.
Interpretability EQ-5D uses population based utility
weights (a set of empirically derived valuations) to
provide a standard set of utility values for the 5-digit
health state derived from the 5-domain index. These
weights are available for a large number of countries
and cultures. The health profile may also be
considered as an unweighted profile in 5-dimensions
and is accompanied by a rating of perceived health
Acceptability Although designed to be short and
simple, reports of missing data are mixed. Essink-
Bot et al.  report higher rates of missing data for
the EQ-5D than for the NHP or SF-36. However, its
simplicity and brevity remain an advantage for use
with stroke survivors. It has been evaluated for use
with proxy respondents although only the mobility
and self-care domains remain reliable.
Feasibility The EQ-5D is designed as a self-comple-
tion questionnaire that may be administered as a
postal survey or face-to-face interview. It requires no
special training to administer and both the scale itself
and supporting information are readily available.
Medical Outcomes Study Short Form 36
The Medical Outcomes Study Short Form 36 (SF-
36) is a generic health survey created as part of the
Medical Outcomes Study to assess health status in
the general population . It is comprised of 36
items drawn from the original 245 items generated
by that study [50, 51].
Items are organized into eight dimensions or
subscales; physical functioning, role limitations—
physical, bodily pain, social functioning, general
mental health, role limitations—emotional, vitality,
and general health perceptions. It also includes two
questions intended to estimate change in health
status over the past year. These two questions remain
separate from the eight subscales and are not scored.
With the exception of the general change in health
status questions, subjects are asked to respond with
reference to the past 4 weeks. An acute version of the
SF-36 refers to problems in the past week only .
The recommended scoring system uses a weighted
Likert system for each item. Items within subscales
are summed to provide a summed score for each
subscale or dimension. Each of the eight summed
scores is linearly transformed onto a scale from 0–
100 to provide a score out of 100 for each subscale.
In addition, a physical component (PCS) and mental
component score (MCS) can be derived from the
scale items. Standardized population data for several
Issues for selection of outcome measures in stroke rehabilitation
Table IV. Measurement Properties of the Euroqol-5D.
– Test-retest: Hurst et al.  reported the EQ-profile showed no significant change in any of 5 domains in self-reported
stable patients (p50.02), EQ-utility and EQ-VAS ICC=0.73 and 0.70 respectively, over 3 months and 0.78 and 0.85
over 2 week retest interval; Dorman et al.  reported k=0.66 (usual activities) to 0.85 (mobility) for the index,
ICC=0.86 for the VAS self-rating of health status and ICC=0.83 for the health utility; Brazier et al.  reported
r=0.53 (VAS) and r=0.67 (utility index) over a 6-month retest period.
– Construct validity: Patients reporting problems on EuroQol domains reported dysfunction on a standardized instrument
in that domain—OPCS locomotion related to mobility (r=0.61), BI to self-care (r=70.64), FAI to usual activities
(r=70.60), VAS pain scale to pain (r=0.71), HADS mood to anxiety (r=0.56) and depression(r=0.35)—median
scores on standard instruments were ordered appropriately when compared with EuroQol levels (p40.0002) ;
Hurst et al.  reported EQ-5D levels of self-care, pain and anxiety/depression scores related to corresponding mean
and median scores on standardized assessments (HAQ, Pain-VAS and HAD-mood; p50.001), EQ-utility and EQ-
VAS correlated with measures of disease activity (r=0.32 to 0.57) as well as subjective measures of mood (HAD,
r=70.56 and 70.59) and pain (r=70.73 and 70.63); Cup et al.  reported EQ-5D correlated with standardized
functional measures—BI (0.7) and FAI (0.65); Johnson & Coons  reported increasing age related to 4/5 EQ
dimensions (all except anxiety/depression; p50.01) as hypothesized—employment status, education, household
income, marital status and presence of chronic medical problems all significantly related to EQ-5D dimension scores in
the expected direction (p50.05).
– Construct validity (known groups): EuroQol scores on all dimension able to discriminate migraine sufferers from controls
(p40.03, ROC/AUC=0.50–0.59) and between groups of migraine sufferers based on absence from work 0 vs. 50.5
days; p50.0, ROC/AUC=0.54–0.70) ; EuroQol profiles distinguished between major stroke syndrome groups
and between groups based on baseline stroke severity—the EuroQol VAS ratings of overall health could also
discriminate groups based on severity (p50.05) ; Hurst et al.  reported increasing problems on all 5 EQ-5D
domains associated with increasing functional class in RA patients stratified by function (p50.001), EQ utilities
discriminated between all RA functional classes (p50.001), EQ-VAS discriminated between functional class 1,2 3
(p50.001) but not between 3 and 4 (more severe) ; Brazier et al.  reported EuroQol utility and VAS scores
distinguished groups based on recent visits to GP, hospital inpatient stays and longstanding illness (p50.05).
– Concurrent validity: Essink-Bot et al.  reported EQ dimensions correlated with corresponding COOP/WONCA
chart items—anxiety/depression with COOP feelings (r=0.83), usual activities with COOP daily activities (r=0.75)
and COOP social activities (r=0.61); EQ-5D dimension scores most closely related to corresponding COOP-
WONCA charts—physical chart to EQ mobility and self-care (r=0.39 and 0.34), feelings to EQ anxiety/depression
(r=0.70), daily activities with usual activities (r=0.59), pain to EQ pain (r=0.74) – COOP overall health related to
EQ-5D utility score (r=70.53) and VAS rating (r=70.65) as was COOP change in health (r=70.35) ; Cup et
al.  reported EQ-5D correlated with SA-SIP30 (0.48); Dorman et al.  demonstrated mobility, self-care and
usual activities correlated most strongly with SF-36 physical functioning (r=0.57, 0.65 and 0.63, respectively), pain
correlated with SF-36 bodily pain (r=0.66)—but anxiety/depression domain of EuroQol was moderately correlated
with all SF-36 subscales—r=0.21 (mental health) to r=0.44 (general health)—VAS rating correlated most strongly to
SF-36 general health (r=0.66); EQ-VAS rating correlated with SF-12 PCS and MCS (r=0.55 and 0.41, respectively)
; Bosch & Hunink  found EQ-5D score (utility) correlated with HUI3 before and after treatment for
intermittent claudication (ICC=0.49–0.78), change in EQ-5D and HUI3 scores over time was significantly correlated
(ICC=0.30, p50.01)—change in EQ-5D scores correlated with change in SF-36 scores on all dimensions—
ICC=0.22 (energy) – 0.43 (pain); EQ-5D index correlated with SF-12 PCS (r=0.64) and MCS (r=0.52) and HUI3
(r=0.69)—the EQ-VAS rating correlated with SF-12 PCS (r=0.61), MCS (r=0.41) and HUI3 (r=0.56) as well as
with SF-12 self-perceived health (r=0.61) and the HUI overall health rating (r=0.70) .
– Construct Validity (convergent/divergent): Johnson & Coons  reported EQ mobility, self-care, usual activities and pain
related to SF-12 PCS (0.12–0.41) but not to SF-12 MCS (0.02–0.04) while EQ anxiety and depression related to SF-
12 MCS (0.40) but not to SF-12 PCS (0.03); Lubetkin & Gold  reported EQ-5D mobility correlated more with
HUI ambulation (0.59) and SF-12 PCS (0.49) than to HUI emotion (0.23) or MCS (0.24)—the EQ-5D pain/
discomfort dimension showed a similar pattern of correlation and EQ-5D anxiety/depression more strongly correlated
with MCS (0.48) and HUI emotion (0.55) than SF-12 PCS (0.23) or HUI ambulation (0.22).
– Examination of distribution of VAS ratings and EuroQol utility scores did not suggest problems with ceiling or floor
effects , examination of distribution of baseline scores prior to admission to early discharge programme revealed
minimum (no problem) scores in excess of 20% of patients on all dimensions but usual activities—maximum scoring
occurred in 47% of patients on the usual activities dimension (all other dimensions 520%)—distribution of EQ utility
scores and VAS ratings revealed no floor or ceiling effects ; in general population, 85% ceiling effect reported
(rating of no problem) for self-care, mobility and usual activity dimensions and 58.5% for pain dimensions—VAS
ceiling effect less pronounced ; Brazier et al.  reported no floor or ceiling effects.
– Hurst et al.  reported significant change on EQ-profile among RA patients reporting improvement in all domains
(p50.05) except anxiety/depression over a 3-month period, SRM for EQ-utility and EQ-VAS=0.71 and 0.70
respectively; EQ-5D scores showed significant improvement at 1, 3 and 12 months post-treatment (p50.01) .
Yes [43, 44, 48, 56, 78, 91].
K. Salter et al.
countries are available for the SF-36 . The
component scores have also been standardized with a
mean of 50 and standard deviation of 10 .
The SF-36 questionnaire can be self-completed or
administered either in person or over the telephone
by a trained interviewer. It is considered simple to
administer and takes less than 10 min to complete
. Permission to use the instrument should be
obtained from the Medical Outcomes Trust who
oversee the standardized administration of the SF-36
and will provide updates on administration and
The measurement properties of the SF-36 are
summarized in Table V.
Advantages The SF-36 is simple to administer.
Either form of administration takes less than
10 min to complete . As a self-completed,
mailed questionnaire, it has been shown to have
reasonably high response rates (83% [54, 55]; 75%–
83% ; 85% ; 82% overall and 69% for those
over age 85) .
Limitations Higher rates of missing data have been
reported among older patients when using a self-
completed form of administration [46, 53, 54].
O’Mahoney et al.  found item completion rates
to range from 66% to 96%. At the scale level,
complete data collection (amount required to com-
putea scale score)ranged
limitations—emotional) to 97% (social functioning).
Walters et al.  reported scale completion rates
among community dwelling older adults ranging
from 86.4% to 97.7% with all eight scales being
calculablefor72% ofrespondents. Dorman et al. 
reported a proportion of missing data on the scale
level ranging from 2% (social functioning) to 16%
(role functioning—emotional). Given the lack of data
completeness found, postal administration of the SF-
from 67% (role
O’Mahoney et al.  suggested that data
completeness may be indicative of respondent
acceptance and understanding of the survey. Hayes
et al.  noted that the most common items
missing on the self-completed questionnaire re-
ferred to work or to vigorous activity. Older
respondents identified these questions as pertinent
for much younger people and not relevant to their
own situation. In a qualitative assessment of the
physical functioning and general health perceptions
dimensions of the SF-36, Mallinson  noted
that the participants, who were all over the age of
65, tended to display signs of disengagement from
the interview process and some participants ex-
pressed concern relating to the relevance of the
questions. There was also considerable variation
noted in subjective interpretation of items and
most subjects used qualifying, contextual informa-
tion to clarify their responses to the interviewer. As
Mallinson points out, such individual issues of
subjective meaning and context are lost when the
questionnaire is scored.
The SF-36 does not lend itself to the generation of
an overall summary score. In scales using summed
Likert scale scores, information contained within
individual responses is lost in the total score—that is,
any given total score can be achieved in a variety of
ways from individual item responses . Hobart et
al.  examined the use of the two-dimensional
model, which consists of a mental health component
(MCS) and physical health component (PCS).
These two component scales could account for only
60% of the variance in SF-36 scores suggesting a
significant loss of information when the 2-compo-
nent model is used.
The level of test re-test reliability reported in
stroke populations indicate that the SF-36 may not
be adequate for serial comparisons of individual
patients, but rather should be used for large group
comparisons only . Weinberger et al.  also
Use by proxy?
– Test-retest reliability for EuroQol when completed by proxy was k=0.31 (mobility) to 0.61 (pain and usual activities),
VAS rating of overall ICC=0.74 and the assigned utility ICC=0.8 .
– Inter-rater reliability: Dorman et al.  reported agreement between scores on self-completed EQ vs. proxy-completed
EQ (ICC=0.53) for the VAS rating and for interview-completed EQ vs. proxy completed (ICC=0.32). Agreement on
domains of the EuroQol, was reported as k=0.38 (anxiety/depression) to k=0.57 (mobility and self-care) for self-
completed vs. proxy-completed EQ –when EQ was completed by interview, k=0.05 (depression/anxiety) to k=0.62
(self-care). Overall agreement between patient completed EQ (self-completed or interview) vs. proxy, k=0.30
(depression/anxiety) to k=0.64 (self-care); Agreement between subject and proxy responses on 5 domains ranged from
k=0.099 (pain/discomfort) to 0.601 (mobility) at baseline, k=0.439 (anxiety/depression) – 0.529 (self-care) at one
month follow-up and k=0.264 (anxiety/depression) to 0.598 (usual activities) at 4 months—agreement on the index
ranged from ICC=0.447 at baseline to 0.581 at 4 months—VAS rating agreements ranged from ICC=0.221 to 0.498.
Comparison of index change scores based on proxy vs. subject responses yielded ICC=0.287 (1 month vs. baseline) to
0.504 (4 month vs. baseline)—similar comparison of VAS rating change scores yielded ICC=0 to 0.044 .
Issues for selection of outcome measures in stroke rehabilitation
Table V. Measurement Properties of the Medical Outcomes Study Short Form 36.
– Test-retest reliability: Brazier et al.  calculated correlation coefficients ranging from 0.6 (social functioning) to 0.81
(physical functioning). Mean differences ranged from 0.15 (social functioning) to 0.71 (mental health) with 91–98%
cases falling into the 95% CI (constructed as per Bland & Altman) ; lower values reported in stroke population of
0.28 (mental health) to 0.80 (social functioning)—reported substantial variability in individual responses, particularly for
role limitations—emotional ; Brazier et al.  reported r=0.28 (social functioning) to 0.70 (vitality) over a retest
period of 6 months.
– Internal Consistency: Brazier et al.  a50.80 for all subscales but social functioning (a=0.73). Reliability
coefficients=0.74 (social functioning)—0.93 (physical functioning); Anderson et al.  reported a of 0.6 (vitality) to
0.9 (physical functioning, bodily pain and role limitations—emotional). Four scales fell below 0.80; Brazier et al. 
reported a50.80 for all subscales but social functioning (0.56) and general health (0.66)—inter-item correlations
50.73 with the exception of social functioning (0.56) and general health (0.66); Essink-Bot et al.  reported a=0.76
(general health)—0.91 (physical functioning); Hobart et al.  found a of 0.68 (general health) and 0.70 (social
functioning) to 0.90 (physical functioning)—Correlations between 8 scales were lower than the reported alpha
coefficients; Hobart et al.  found item-own exceeded item-other correlations by 42.5 SE for 6 of 8 scales—social
functioning scale and general health scale did not (i.e. limited ability to distinguish constructs); Walters et al. 
reported a50.80 for all scales but social functioning (a=0.79).
– Construct validity (known groups): Patients diagnosed with 51 chronic physical problems, had lower scores on all
dimensions of the SF-36 except mental health, than healthy age-matched controls (p50.001). SF-36 scores distributed
as expected for sex, age, social class and use of health services ; SF-36 distinguished between groups based on
functional dependence vs. independence based on BI scores (p50.05 on all scales) and between groups based on
mental health vs. ill-health defined by GHQ-28 scores (p50.05 on all scales) ; Mayo et al.  reported SF-36
scores discriminated stroke survivors from age and gender-matched controls; Williams et al.  found the SF36 unable
to discriminate between groups based on patient self-report ratings of overall HRQOL (same, a little worse or a lot worse
than prestroke). SF-36 discriminated between age groups (575 years vs. 75+) on physical functioning, vitality and
change in health subscales (p40.006) and between groups based on setting (general practice vs. hospital outpatients) on
the physical function and role functioning—physical subscales (p=0.16) ; Essink-Bot et al.  reported SF-36 able
to discriminate between migraine sufferers and controls on all subscales (p50.01; ROC/AUC=0.54–0.67) and
between groups of migraine sufferers based on absence from work (0 vs. 50.5 days; p50.01, ROC/AUC=0.61–0.79);
Brazier et al.  reported SF-36 scores distinguished groups based on recent visits to GP, hospital inpatient stays and
longstanding illness (p50.05).
– Construct validity: Walters et al.  reported significant relationships in hypothesized directions to support construct
validity among older adults—scores in all scales were reported to decrease as age increased (p50.001)—women
reported worse health than men on all scales even after adjusting for age (p50.001)—respondents who had recently
visited their physician reported poorer health on all scales (p50.001) and people living alone had lower scores
(p50.001) except on general health (p=0.02).
– Convergent and discriminant validity: Correlations of 70.41 (social functioning vs. social isolation) to 70.68 (vitality vs.
energy) between similar scales on the SF-36 and the Nottingham Health Profile were reported. Correlations between
dimensions less clearly related ranged from 70.18 (physical functioning vs. emotional reaction) to 70.53 (social
functioning vs. emotional reactions) ; Anderson et al.  reported BI scores (in stroke survivors) strongly
associated (p50.001) with physical functioning and general health -Mental health on the GHQ28 most strongly
associated (p50.001) with the social functioning, role limitations—emotional and mental health scales of the SF-36;
Dorman et al.  reported SF-36 physical functioning subscale correlated most closely with mobility, self-care and
activities domains of EuroQol (r=0.57, 0.65 and 0.63) and less strongly with the EuroQol psychological domain
(0.34)—SF-36 bodily pain correlated with EuroQol pain domain (r=0.66) and moderately with all EuroQol domains—
role functioning, emotional correlated most closely with EuroQol psychological domain (r=0.43) and least with
EuroQol self care (r=0.24)—SF-36 mental health was not closely related to the psychological domain (r=0.21) or to
physical EuroQol domains (r=0.06–0.10)—SF-36 general health correlated with EuroQol overall HRQOL rating)
r=0.66; Lai et al.  reported r=0.55 between SF-36 physical functioning scale and BI.
– Predictive validity: McHorney  examined data from Medical Outcomes Study—reported general health perceptions
scale to be most predictive of death (death rate of patients in lowest quartile for SF-36 general health scale was three
times greater than for patients with SF-36 scores in the highest quartile), followed by scores in physical functioning.
Baseline physical functioning, role functioning-physical and pain scales were most predictive of hospitalizations and pain,
general health and vitality were most predictive of physician visits.
Responsiveness – Via item mapping—social functioning subscale limited assessment of number and difficulty of activities—demonstrated
marked ceiling effects—up to 60% for MRS grade 0) SF-36 physical function scale reported to have floor effects of 37%
and 100% for patients with MRS grades 4 and 5 ; Large ceiling effects reported for the role limitations—physical
(53%), bodily pain (43%), social functioning (67%) and role limitations—emotional scales (72%) – no floor effects over
7% were reported—scores for SF-36 physical functioning scale more uniformly distributed than BI scores suggesting
lower floor and ceiling effects than the BI ; Brazier et al.  reported floor effects in excess of 25% for role
limitations physical and emotional and ceiling effects 425% for social functioning and role limitations emotional and
K. Salter et al.
questioned the usefulness of the SF-36 in the serial
evaluation of individuals given large reported abso-
lute differences in SF-36 scores obtained via
common modes of administration (face-to-face
interview, self administration and telephone inter-
view) over short testing intervals.
Low rates of agreement were reported between
proxy respondent and patient respondent ratings
 and test-retest reliability has also been shown to
be negatively affected by the use of proxy respon-
dents . While the use of a proxy may be the only
means by which to include data from more severely
affected stroke survivors, the subjective nature of the
SF-36 may make proxy use difficult or even
Summary—medical outcomes study short form 36
The ratings of methodological rigour associated with
evaluation of the measurement properties of the SF-
36 are presented in Table X.
Interpretability Use of scale scores and summary
component scores represents a loss of information
– Notable floor effects (role limitations—physical 59.1%; role limitations—emotional 19.9%) and ceiling effects (role
limitations—emotional 63.1%; social functioning 29.9%; bodily pain 25.6%) reported among ischemic stroke survivors
; substantial floor and ceiling effects reported by O’Mahoney et al. ; For face-to-face, telephone and self-
administration, Weinberger et al.  reported substantial floor effects for role-physical (440%) and role–emotional
(425%) subscales and ceiling effects for role-emotional (436%) and social functioning subscales (427%—for face-
to-face and self administration only); Walters et al.  reported substantial floor (30.9–61%) and ceiling effects across
all age groupings (65–69, 70–74, 75–79, 80–84 and 85+) in the role functioning physical (30.9–61% and 11.7–
38.6%) and role functioning—emotional (25.6–50.4% and 32.2–53.2%) as well as substantial ceiling effects in social
functioning and bodily pain (15–46.7% and 14.1–21.1%, respectively).
– Mossberg & McFarland  found SF-36 effect sizes (admission to outpatient rehabilitation to discharge)=0.48 (role
limitations – emotional) to 1.38 (bodily pain)—PCS (physical component) and MCS (mental component) effect
sizes=0.80 and 0.45 respectively.
Yes [44, 55, 56, 97–99].
Other formats – Mailed questionnaire: Hayes et al.  found type/mode of administration was clearly related to completeness of data
(p50.0001). For self-completion vs. in-person interview, percentage of missing items greater among the older
respondents (p50.015). Time to complete survey not dependent upon mode of administration or age—84% of the
respondents completed the assessment in 10 min or less; Walters et al.  reported non-completion of the mailed
survey to be significantly related to increasing age (p50.001).
– Face-to-face, self report and telephone interview: Weinberger et al.  reported internal consistency for all modes of
administration—face-to-face a=0.75–0.89; self a=0.77–0.93; telephone a=0.67–0.92. Mean test-retest correlations
for face-to-face, self and telephone modes were 0.80, 0.83 and 0.79. Between mode correlations were similar – face-to-
face vs. self (r=0.54–0.82), face-to-face vs. telephone (r=0.55–0.91). Correlations did not differ significantly by order
of administration. Despite short testing intervals, large absolute differences were reported on within mode and between
mode comparisons. Directional differences (over time 51 week) were significant on between mode comparisons on 4/8
subscales (physical function, social function, role-emotional and mental health) with face-to-face interviews producing
– Acute (1-week recall) version: Keller et al.  reported median inter-item correlations ranging from 0.43 (role-
emotional) to 0.78 (bodily pain)—a ranged from 0.59 (role- emotional) to 0.89) (physical functioning). Vitality, role
emotional and mental health a values fell below 0.80. Principal component analysis revealed the same 2-factor structure
as the standard version. The acute version displayed significant ceiling effects (420%) in 4 subscales (role-physical,
bodily pain, social functioning and role-emotional). There were no reported floor effects. Change scores for the acute
form (baseline to week 4) were more closely related to one-week change in disease severity than standard form scores.
For acute change scores 10/18 such comparisons reached significance.
– Dorman et al.  reported test-retest reliability better when the patient completed the forms than when completed by
proxy respondent. ICC’s ranged from 0.3 (mental health) to 0.81 (bodily pain/general health) when forms were patient-
completed vs. ICC of 0.24 (mental health) to 0.76 (social functioning) for proxy completion.
– Pierre et al.  demonstrated poor to moderate agreement between proxy and patient ratings. In a rehabilitation
setting, ICC’s=0.01 (social functioning) to 0.60 (vitality) for patient/health professional proxy pairings. For significant
others proxies/patients, ICC’s=70.11 (mental health) – 0.58 (general health). In a day hospital setting and
professionals as proxies, ICC’s=0.09 (role physical)—0.45 (physical functioning)—with sig. others, ICC’s=0.01 (social
functioning)—0.71 (physical functioning). a=0.64–0.86 for the patient data, 0.76–0.90 for the health professional data
and 0.69–0.84 for the significant other data.
– Segal & Schall  reported ICC of 0.15 (role limitations—emotional) to 0.67 (physical functioning) for patient ratings
vs. proxy ratings.
Issues for selection of outcome measures in stroke rehabilitation
and decreases potential clinical interpretability.
Standardized norms for several countries are avail-
able for the SF-36.
Acceptability Completion times are approximately
10 min for either self-completed or interview admi-
nistered questionnaires. Some items have been
questioned for their relevance to elderly populations.
The SF-36 has been studied for use by proxy,
however; reliability of the test decreased when proxy
respondents completed assessments.
Feasibility The SF-36 questionnaire can be adminis-
asamailsurveywith reasonably high completion rates
reported, however, data obtained are more complete
when interview administration is used. Permission to
ing its administration and scoring should be obtained
from the Medical Outcomes Trust.
Nottingham Health Profile (NHP)
The Nottingham Health Profile (NHP) was designed
to be a brief, subjective measure of perceived health
encompassing the social and personal effects of illness
[62–65]. It was not intended to be a measure of
health-related quality of life or a means to identify
weights are intended to reflect the point of view of the
lay person and were derived from statements regard-
ing the effects of ill health collected from more than
The NHP consists of two parts. Part 1 contains 38
items grouped into six dimensions or subsections of
subjective health: physical mobility (8 items), pain (8
items), sleep (5 items), social isolation (5 items),
emotional reactions (9 items) and energy level (3
items). Each item is presented as a statement of a
potential problem. Respondents answer yes or no to
each statement according to whether or not they feel
the item applies to them at the present time. Each
statement carries with it a weight, based on perceived
severity of the item. Weights assigned to items in
each dimension total 100. If a statement is affirmed,
it is scored with its associated weight while negative
responses receive no score. All weighted responses
within a section are summed to give a total score for
that dimension out of 100. Higher scores correspond
to poorer perceived health status. Results from the
six dimensions should not be combined to provide a
total overall score.
Part II contains seven items representing areas or
activities that may be influenced by the respondent’s
health: paid employment, jobs around the house,
social life, personal relationships, sex life, hobbies
and interests, and holidays. Respondents provide yes
or no answers as to whether each area is affected by
the respondent’s current state of health. Items in Part
II are not weighted. A score out of 7 is obtained by
adding together the number of positive responses.
Administration of Part II is optional.
The NHP is a self-reported assessment that may
be self-completed or administered by interview. It
takes approximately 10 min to complete. A user’s
manual  as well as reference scores for healthy
people by age, group, sex and social class are
The measurement properties of the NHP are
summarized in Table VI.
Advantages The NHP is a simple and concise
measure. Reported completion times range from 5–
15 min and, unless interview administration is neces-
sary, administrative burdenis minimal [41, 68].Used
as a postal questionnaire, reported response rates
ranged from 68–93% [54, 65, 69]. Ebrahim et al.
 reported low rates of missing data (4–7%).
The NHP has been widely used and extensively
studied. It was the first measure of perceived health
developed for use in Europe.
Limitations Overall, the NHP is a somewhat limited
as sensory deficits, incontinence, eating problems,
stigma, memory, intellectual ability, or financial
difficulty [66, 69]. It is a negative measure of health
assessing only the presence or absence of problems
and does not address the presence of positive out-
only of an absence of the problems presented on the
NHP and does not indicate a sense of well-being.
The statements comprising Part I reflect serious
problems and this may limit the usefulness of the
scale among less ill subjects. Given the prevalence of
ceiling effects (scoring ‘0’—no problems), the NHP
may not be suited for use in the general population or
among individuals experiencing only minor illnesses
or distress [41, 66, 68, 70].
The use of the weights provided with the scale
items has been criticized as being inappropriate and
confounded [71, 72]. In his 1991 study, Jenkinson
 gave values of 0 (no) and 1 (yes) to responses,
summed the positive responses for each section and
expressed this summed total as a percentage. Scores
derived by this simplified method were very highly
correlated with results obtained using the traditional
weighted system (r=0.98; p50.001) suggesting that
the use of weights may be unnecessary.
Part II is not well studied and most evaluative
research pertains to Part I only. This may be due to
its optional nature. The application of Part II may be
more limited than Part I as many of the items would
K. Salter et al.
Table VI.Measurement Properties of the Nottingham Health Profile.
– Test-retest: For Part I, r=0.77 (energy) to 0.85 (sleep and physical mobility) reported among patients with osteoarthritis
and r=0.75 (emotional reactions) to 0.88 (pain) among patients with vascular disease. For Part II, reliability
coefficients=0.44 (hobbies/interests) to 0.86 (paid employment) among osteoarthritis patients and 0.55 (paid
employment) to 0.89 (family relationships) among vascular disease patients ; For Part I, correlations ranged from
r=0.44 (sleep) to 0.85 (emotion), most individual item kappa values were reported as showing moderate agreement
(k=0.41–0.60) but Bland-Altman repeatability coefficients in each section were in excess of 1/3 of the scale total
; Visser et al.  reported r=0.65 (pain) to 0.88 (part II); Bureau-Chalot et al.  reported ICC=0.45
(social isolation) to 0.83 (energy).
– Internal consistency: Bureau-Chalot et al.  reported a=0.63 (energy) to 0.85 (pain); Post et al.  reported
a=0.64 (social isolation) to 0.82 (pain)—overall a=0.87; Essink-Bot et al.  reported a=0.62 (energy) to 0.82
(pain) — correlations between NHP subscales ranged from 70.17 (physical ability and pain) to 0.62 (emotional
reaction and social isolation)
– Construct validity: (convergent/divergent) Ebrahim et al.  reported NHP emotional reaction scale correlated with
GHQ at six months post stroke (r=0.71)—supported by Jenkinson et al.  in RA and migraine sufferers (r=0.49);
NHP emotional reactions correlated most strongly with sleep, energy and social isolation in both RA and migraine
sufferers (r=0.25–0.57) and with GHQ (r=0.59 and 0.65)—emotional reaction did not correlate with mobility or
pain in either group (r=0.04–0.07) ; Brazier et al.  reported physical ability, pain, emotional reactions and
energy subsections correlated with corresponding dimensions of the SF36 (r=70.52 to 70.68)—though, social
isolation was less strongly related to social functioning on SF36 (r=70.41)—with this exception, relationships with
corresponding scales on the SF36 were stronger than with scales that did not correspond to NHP sections; Essink-Bot
et al.  reported NHP energy related to SF36 vitality (ICC=0.47), pain to both SF-36 bodily pain (ICC=0.43) and
physical functioning (ICC=0.69), emotional reaction to SF-36 RE (ICC=0.46) and MH (ICC=0.56), and physical
mobility to SF-36 physical functioning (ICC=0.67); Stansfeld et al.  reported NHP scale scores most strongly
related to corresponding SF36 scale score with the following exceptions—social functioning on SF36 was more
strongly related to emotional reactions (r=70.4) than social isolation (r=70.348), role limitations physical was most
strongly associated with energy level (r=70.281) and SF36 pain to NHP physical mobility (r=70.339) and energy
level (r=70.342) than pain (7183). Factor analysis suggested NHP pain measures different aspects of pain than
SF36; NHP scales correlated with corresponding 0.SIP scales—physical ability most closely related to SIP68 somatic
autonomy (r=0.68) though not mobility (r=0.22), emotional reactions with SIP68 emotional stability (r=0.56); social
isolation with SIP68 social behaviour and emotional stability (r=0.35 and 0.41) .
– Construct validity (known groups): NHP scores on all sections discriminated between groups with varying degrees of
chronic illness and physical fitness (p50.001) ; scores on each section of NHP distinguished between patients
grouped by frequency of consultation with a physician (p50.01) and by length of absence from work (p50.001)—the
energy section distinguished between groups based on level of activity (p50.05) ; NHP scores (all sections)
distinguished between stroke survivors and age-matched controls at one and 6 months post-stroke (p50.01) and
between those able to walk vs. unable to walk at 1 and 6 months post stroke (p50.05) ; linear discriminant
function analysis was undertaken to determine if NHP scores could accurately classify RA vs. migraine sufferers—rates
of correct classification by ‘disease membership’ were 97% for RA vs. 100% for migraine sufferers ; NHP
discriminated between RA and migraine sufferers on all dimensions (p50.05) but emotional reactions ; NHP
pain scores and Part II scores distinguished between stroke patients and controls (p=0.01) ; NHP scores
distinguished between migraine sufferers and controls on all subscales except sleep (p40.03; ROC/AUC=0.53–
0.59) and between groups of migraine sufferers based on absence from work (0 days vs. 50.5 days, p40.03, ROC/
AUC=0.57–0.62) ; NHP scores distinguished between groups spinal cord patients vs. rheumatic disease patients
(p50.001) on all subscales except physical ability (p=0.1) .
– Concurrent validity: Scores on all sections of NHP were related to ratings of perceived health over the past 6 months
(p50.01) and at present (p50.001) ; NHP scores (emotional reaction, social isolation, pain and energy) were
correlated with ratings of perceived health (r=0.34–0.38) and with ratings of well-being (r=0.30–0.61) .
– Examination of response distributions revealed NHP scores to be highly skewed toward a ‘0’ score (ceiling effect where
0=no problems) on all dimensions when used with a general population sampling ; Post et al.  noted the NHP
to have serious ceiling effects where the median score for spinal cord patients was 0 on 4 of 6 subscales and on 2 of 6
subscales for rheumatic disease patients.
Yes [69, 105, 106].
Use by proxy?
In a study of dementia patients, Bureau-Chalot et al.  examined agreement between 1) subject and family proxy
scores (ranging from emotional reactions ICC=0.33 to physical ability ICC=0.57) as well as between 2) subject and
formal caregiver proxy scores (ranging from social isolation ICC=0.22 to physical ability ICC=0.48) and between 3)
family proxy scores and formal caregiver scores (social isolation ICC=0.20 to physical ability ICC=0.76)—
concordance was greater for the most objective domains (e.g. physical ability).
Issues for selection of outcome measures in stroke rehabilitation
be inappropriate or irrelevant to a number of subject
populations, such as the elderly, unemployed or
disabled . It is has been reported, subsequent to
further developmental work, that the authors no
longer recommend the use of Part II [41, 66].
Summary—Nottingham health profile
The ratings of methodological rigour associated with
evaluation of the measurement properties of the
NHP are presented in Table X.
Interpretability The NHP has been widely used in
Europe and extensively studied. A complete user’s
manual is available  as are population norms and
scores for individual patient groups .
Acceptability The NHP is short, simple and takes
little time to complete. High response rates and low
rates of missing data suggest that it is acceptable to
respondents. It has been tested for use with proxy
respondents, however, reported reliability was low.
Feasibility The test can be administered as either a
self-report questionnaire or interview and has been
used as a postal survey. The NHP is not suited for
use in the general population or with mildly-impaired
Stroke-Adapted Sickness Impact Profile
The Sickness Impact Profile (SIP) is a comprehen-
health status originally intended for use in health
surveys, programme planning, policy formation and
in monitoring patient progress in terms of sickness
[73, 74]. It has become one of the more commonly
used generic instruments in the assessment of health-
related quality of life.
The major drawback in the use of the SIP may be
its length. It contains 136 items and can take more
than 30 min to complete. As such, it represents
considerable patient burden and may pose significant
administrative difficulty for both clinical and re-
search trial applications. A shorter version has now
been developed specifically for use in stroke outcome
research in order to overcome problems of accept-
ability and feasibility associated with the longer SIP
The Stroke-Adapted Sickness Impact Profile (SA-
SIP-30) was derived directly from the Sickness
Impact Profile. Van Straten et al.  following a
3-stage process to eliminate items and subscales of
little relevance to stroke survivors as well as those
items with the lowest levels of reliability [75, 76].
The end result is a scale comprised of 30 items in
eight subscales (body care and movement, social
interaction, mobility, communication, emotional
behaviour, household management, alertness beha-
viour and ambulation). Scale items are weighted to
reflect the relative importance of the item to health
status. Weights used in the SA-SIP-30 are the same
as those used in the parent version and were derived
by health professionals, students and members of a
group health plan .
Each item is a statement describing changes in
behav-iour that reflect the impact of illness on some
aspect of daily life. Respondents are asked to mark
items most descriptive of themselves on a given day.
To score the SA-SIP-30, weights are applied to
marked items and summed for each subscale. This
summed subscale score is expressed as a percentage.
Higher scores are indicative of poorer health out-
come [12, 75, 78]. Subscale scores can be combined
to form two dimensions; physical (body care and
movement, ambulation, household management and
mobility) and psychosocial (alertness behaviour,
communication, social interaction and emotional
No special equipment or training is required,
though a user’s manual and trainer’s manual are
available for the original SIP . Like the original
SIP, the SA-SIP-30 may be completed by interview
The measurement properties of the SA-SIP-30 are
summarized in Table VII.
Advantages The SA-SIP-30 is a much shorter and
simpler scale than the parent scale and is more
suitable for use in stroke outcome research .
Authors of the scale provide regression weights
to allow for the calculation of estimated SIP scores
maintaining much of the original subscale structure
of the SIP, these weights help facilitate compar-
isons with studies using the original SIP-136. In
addition, van Straten et al.  identified cut-off
scores representative of poor health. Patients with
scores 433 were reported to be ADL disabled,
unable to live independently, experiencing some
problems in self-care, mobility and in performing
their main activity, and reported low values for
health-related quality of life. Similar profiles were
observed for physical dimension scores 440, but
no cut-off values could be defined using the
psychosocial dimension .
Limitations In the process of creating the stroke-
adapted scale, items less relevant to stroke were
removed (i.e. applying to fewer than 10% of stroke
patients). However, no attempt was made to supple-
K. Salter et al.
ment the scale with items or domains of potential
importance to stroke. The stroke-adapted version
does not assess pain, recreation, energy, general
health perceptions, overall quality of life or stroke
In examining weights associated with items re-
moved from the scale, van Straten et al.  noted
that these deleted items carried high item weights
descriptive of more severe health states. The new
scale, therefore, may be less effective when used with
patients who have suffered a severe stroke and,
indeed, lower levels of agreement between scores
obtained with the SIP-136 and SA-SIP-30 was
reported among more severely ill stroke patients
than among healthier patients .
Total scores of the SA-SIP-30 appear to be derived
mostly by its physical dimension (66% for the
subscales of the physical dimension versus 25% for
the subscales of the psychosocial dimension) . As
such, the SA-SIP-30 may represent a measure of
physical disability rather than the more comprehen-
sive constructs of health status or health-related
quality of life.
While the SA-SIP-30 is derived from an older,
well-established scale, there is relatively little infor-
mation available with regard to its measurement
properties or use. Most information originates from
the scale’s authors.
The ratings of methodological rigour associated with
evaluation of the measurement properties of the SA-
SIP-30 are presented in Table X.
Table VII. Measurement Properties of the Sickness Impact Profile (stroke adapted version).
– Internal consistency: van Straten et al.  reported total a=0.85, psychosocial dimension a=0.78 and physical
dimension=0.82—for individual subscales, a=0.54 (ambulation) and 0.57 (emotional behaviour) to 0.71 (mobility
and alertness behaviour)—inter-item correlations ranged from 0.25 (emotional behaviour) to 0.47 (alertness behaviour
– Construct validity: Principal component analysis supported two dimensions (physical and psychosocial) supporting
retention of the original dimension structure of the SIP ; SA-SIP-30 scores explained 91% of variance in SIP
scores—moderate (as hypothesized) correlations of 0.50 and 0.68 were reported with the BI and Rankin scales
respectively ; linear regression analysis revealed measures of physical disability (Barthel index, Rankin scale scores
and ADL-related dimensions of EuroQol) most closely associated with SA-SIP-30 scores accounting for 36% (BI) to
53% (Rankin) of the variance in total SA-SIP-30 scores ; SA-SIP-30 correlated with BI(r=70.587), Rankin
(r=0.468), FAI (r=70.426) and Euro-Qol (r=70.483; Cup et al. 2003).
– Construct validity (known groups): SA-SIP-30 scores distinguished between patients with lacunar vs. cortical or
subcortical strokes on all subscales (p50.01) except emotional behaviour (p=0.49) and mobility (p=0.07)—but,
contrary to hypothesis, SA-SIP-30 scores could not distinguish patients with infratentorial vs. supratentorial lesions
– Concurrent validity: Total SA-SIP-30 scores correlated with SIP scores (r=0.96)—subscale correlations ranged from
0.75 (emotional behaviour) to 0.90 (body care and movement) .
– Sensitivity/Specificity: van Straten et al.  reported SA-SIP-30 cut-off scores for poor outcomes using criterion scores
from the BI, Rankin, EuroQol dimension scores and overall index—patients with SA-SIP-30 scores 433 patients were
ADL disabled (AUC=0.84), unable to live independently (AUC=0.90), experienced problems in mobility
(AUC=0.85), self care (AUC=0.88) and perceived their own health as poor (using EuroQol index score—
AUC=0.80). A similar profile was obtained for physical functioning scores 440—no cut-off values could be
determined for poor functioning on the psychosocial dimension.
– Van Straten et al.  noted a skewed distribution of total scores, physical and psychosocial dimensions toward
Use by proxy?
While the SA-SIP-30 has not been evaluated for use by proxy, its parent scale, the SIP-136 was evaluated for use with
proxy respondents among a stroke patient population by Sneeuw et al.  Reliability: The authors reported
ICC=0.77 for agreement between patient and proxy scores for the total score, 0.61 for the psychosocial dimension and
0.85 for the physical dimension. Subscale ICC values ranged from 0.47 (eating) to 0.82 (body care and movement)—
the majority ICC values of SIP subscales 50.70. There was a demonstrated bias for proxy respondents to rate patients
as having more limitations than the patients themselves—this bias increased as patients’ level of functioning decreased.
a=0.95 for the total scale (0.89 psychosocial and 0.93 physical dimension)—a=0.45 (sleep and rest) to 0.88 (body
care and movement). Validity: Proxy-rated SIP scores were significantly associated (p50.001) with the patient’s
Rankin grade in both communicative and non-communicative patients.
Issues for selection of outcome measures in stroke rehabilitation