Self- and surrogate-reported communication functioning in aphasia.
ABSTRACT PURPOSE: To evaluate the dimensionality and measurement invariance of the aphasia communication outcome measure (ACOM), a self- and surrogate-reported measure of communicative functioning in aphasia. METHODS: Responses to a large pool of items describing communication activities were collected from 133 community-dwelling persons with aphasia of ≥ 1 month post-onset and their associated surrogate respondents. These responses were evaluated using confirmatory and exploratory factor analysis. Chi-square difference tests of nested factor models were used to evaluate patient-surrogate measurement invariance and the equality of factor score means and variances. Association and agreement between self- and surrogate reports were examined using correlation and scatterplots of pairwise patient-surrogate differences. RESULTS: Three single-factor scales (Talking, Comprehension, and Writing) approximating patient-surrogate measurement invariance were identified. The variance of patient-reported scores on the Talking and Writing scales was higher than surrogate-reported variances on these scales. Correlations between self- and surrogate reports were moderate-to-strong, but there were significant disagreements in a substantial number of individual cases. CONCLUSIONS: Despite minimal bias and relatively strong association, surrogate reports of communicative functioning in aphasia are not reliable substitutes for self-reports by persons with aphasia. Furthermore, although measurement invariance is necessary for direct comparison of self- and surrogate reports, the costs of obtaining invariance in terms of scale reliability and content validity may be substantial. Development of non-invariant self- and surrogate report scales may be preferable for some applications.
-
Citations (0)
-
Cited In (0)
Page 1
1
Self- and Surrogate-Reported Communication
Functioning in Aphasia
Patrick J. Doyle
Geriatric Research Education and Clinical Center, VA Pittsburgh Healthcare System and
Department of Communication Science and Disorders, University of Pittsburgh, Pittsburgh, PA
William D. Hula
Geriatric Research Education and Clinical Center, VA Pittsburgh Healthcare System and
Department of Communication Science and Disorders, University of Pittsburgh, Pittsburgh, PA
412-954-4639 (office)
412-954-4629 (fax)
william.hula@va.gov
Shannon N. Austermann Hula
Research Service, VA Pittsburgh Healthcare System, Pittsburgh, PA
Clement A. Stone
Department of Psychology in Education, University of Pittsburgh, Pittsburgh, PA
Julie L. Wambaugh
Research Service, VA Salt Lake City Healthcare System and Department of Communication
Sciences and Disorders, University of Utah, Salt Lake City, UT
Katherine B. Ross
Audiology and Speech Pathology, Phoenix VA Healthcare System, Phoenix, AZ
James G. Schumacher
Audiology and Speech Pathology, VA Pittsburgh Healthcare System, Pittsburgh, PA
Page 2
2
This work was supported by VA Rehabilitation Research & Development Merit Review Award C6098R,
Career Development Award 6210M, and the VA Pittsburgh Healthcare System Geriatric Research
Education and Clinical Center. An earlier version of this work was presented at the Clinical Aphasiology
Conference, May 27, 2010, Isle of Palms, SC, USA. The contents of this paper do not represent the views
of the Department of Veterans Affairs or the United States Government.
The original publication is available at www.springerlink.com:
http://www.springerlink.com/content/m340q360u1p76873/
Page 3
3
Introduction
Aphasia is an acquired neurogenic impairment of language performance, usually resulting from
focal brain damage involving the dominant (usually left) hemisphere [1]. In most cases,
communication deficits are present in all input and output modalities (i.e., speaking,
understanding, reading, and writing), and they are disproportionate to any other cognitive
impairments that may be present [1]. The term aphasia specifically excludes motor speech
disorders resulting from muscle weakness or incoordination (e.g., dysarthria), as well as
communication impairments resulting from dementia, delirium, coma, or sensory loss [1]. Stroke
is the most common cause of aphasia [2], and approximately 20% of stroke survivors have
persisting aphasia [3]. The worldwide incidence and prevalence of aphasia are not known, but
there are currently estimated to be more than 1 million people living with the condition in the
United States [4]. The negative consequences of aphasia include psychosocial difficulties,
reduced functional independence, and diminished vocational opportunities.
The measurement of communication outcomes is critical to the care of patients with
aphasia and to the evaluation of stroke rehabilitation programs. In addition to traditional
performance-based and clinical indicators of communication functioning, increasing emphasis
has been placed on patient-centered assessments. Several patient-reported stroke outcome
assessments include sub-scales of communication functioning [5-7], and additional scales have
been developed specifically for patients with aphasia [8-12].
One issue that has concerned developers and users of these and related scales is the extent
to which stroke survivors in general, and stroke survivors with aphasia specifically, can provide
Page 4
4
valid self reports of their own functioning [13-21]. This concern has led to the collection of
proxy1 reports and their direct comparison with patients’ self reports [13-20]. It has also been
noted that proxy reports may constitute a valid perspective in their own right, regardless of their
correspondence with patients’ ratings [22, 23].
Stroke-specific studies that have included participants with aphasia are in agreement with
the more general literature that patient and proxy respondents demonstrate higher agreement on
ratings of more directly observable domains (e.g., physical function vs. energy) and that proxies
tend to rate patients as more limited than patients rate themselves [16-18]. In these studies, the
strength of association between patient and proxy reports, expressed as intraclass correlation
coefficients, has ranged from 0.50 to 0.70 for language and communication scales. Studies
specific to patients with aphasia have produced similar findings [13-15]. Some researchers in this
area have concluded that, in cases where patients with aphasia are unable to give valid self
reports, substitution with proxy reports is appropriate [13, 16]. Others have been more cautious
[14, 17].
One limitation of these patient-proxy comparison studies is that they have not evaluated
whether the scales in question have invariant measurement properties in the two groups.
Investigation of measurement invariance asks whether a scale measures the same construct in the
same way in two different populations. Questions of measurement invariance may be addressed
using latent variable modeling approaches to psychological measurement. Within this framework,
observed responses to test items are taken as indicators of unobserved (latent) constructs that are
1 The term “proxy” has been used with two distinct meanings in the literature. Some authors have used the term to
refer to a person close the patient who responds as he or she believes that the patient would respond [9, 16]. Others
have used the term to refer to a person close to the patient who provides his or her own assessment, without
considering how the patient might respond [12, 14]. In still other cases, the meaning is not clearly specified [13].
Page 5
5
the actual objects of study [24]. Thus, a model relating the observed scores to the underlying
latent construct is necessary, and when group comparisons are made, it must be shown that this
model is structured similarly for the groups involved [25, 26]. Without demonstration of
invariance, between-group comparisons of means, variances, and covariances may be
confounded [25, 27, 28]. While investigations of measurement invariance in patient-reported
health-status assessment have frequently focused on cultural, ethnic, gender, and age differences
[29-35], the issue is equally applicable to potential differences in how patients and their proxies
use self-report scales.
A related issue concerns the underlying conceptual structure of communication
functioning. In order to evaluate measurement invariance, the structure of the latent variable in
question must first be established within a reference population. Among the many instruments
that have been developed to assess various aspects of functional communication in aphasia [[7-
12, 22, 36-43], there is a general lack of a unifying conceptual structure [40] and much
variability in how the construct has been operationalized [44]. Some instruments propose
multiple subdomains of communication functioning that may be assessed individually or in
combination [36, 41] others provide only an overall score [22, 39], and still others have chosen
to measure communication as an undifferentiated aspect of general cognition [33, 34].
In this context, we have begun to develop a new self- and surrogate-reported2 instrument
for measuring communication functioning in persons with aphasia: The Aphasia Communication
Outcome Measure (ACOM). Initial steps in developing the ACOM item pool were reported in a
prior paper [45]. In the present study, we asked the following questions: (1) Do items describing
2 We use the term “surrogate” here to specify the second meaning of the word “proxy” discussed above in Footnote
1, i.e., a person close to the patient who provides his or her own assessment, without trying to respond as he or she
thinks that the patient would respond.
Page 6
6
self- and surrogate-reported communication functioning in aphasia reflect a single
unidimensional scale? We plan to develop one or more communication functioning item banks
calibrated to an item response theory model [43]. Because the most easily applied item response
theory models assume unidimensionality, the present paper is focused on defining valid single-
factor scales. (2) Do self and surrogate ratings of communication functioning demonstrate
measurement invariance? That is, can they be interpreted and directly compared using a common
scale? (3) To what extent do self and surrogate ratings of communication functioning agree? (4)
Are persons with severe aphasia able to provide meaningful self-reports about their own
commuinication functioning?
Methods
Participants were 133 persons with aphasia (PWAs) and 133 surrogate respondents.
PWAs met the following inclusion criteria: diagnosis of aphasia ≥1 months post-onset;
community dwelling; self-reported normal pre-morbid speech-language function; pre-morbid
literacy with English as a first language; negative self-reported history of progressive
neurological disease, psychopathology, and substance abuse; ≥0.6 delayed/immediate ratio on
Arizona Battery for Communication Disorders of Dementia Story Retell [46]; ≤5 self-reported
depressive symptoms on the 15-item Geriatric Depression Rating Scale [47]; and Boston
Diagnostic Aphasia Exam severity rating ≥1. Surrogate (SUR) respondents met similar criteria,
except for diagnosis of aphasia and reported weekly or more-frequent contact with their
respective PWA both prior to and after aphasia onset. A subset of the PWAs (n = 116) was also
administered the Porch Index of Communicative Ability [48], a performance-based test of
communication impairment. Demographic and clinical characteristics of the sample are
summarized in Tables 1 and 2.
Page 7
7
The initial ACOM item pool was comprised of 177 items describing
various communication activities. The content of the items is presented in Appendices A and B.
Participants were asked to rate on a 4-point scale (not at all, somewhat, mostly, completely) how
effectively the PWA performs each activity. “Effectively” was defined as “accomplishing what
you want to, without help, and without too much time or effort.” Respondents were also
permitted to indicate that they had no basis for rating a particular item, or that the PWA did not
do the activity in question for some reason other than his/her aphasia, in which cases the
responses were coded as missing data. For example, many surrogates indicated that they had no
basis for rating the item “get help in an emergency” because they had never observed their
partner do this, and many PWAs responded similarly because they had not experienced any
emergencies since the onset of their aphasia.
Responses from PWAs and surrogates were collected separately by trained research staff
using an interviewer-assisted administration format. Each item was displayed on a computer
screen in large font along with the stem “How effectively do you…” (for PWAs) or “How
effectively does your partner…” (for surrogates). The examiner read each item aloud and also
permitted the respondent to read it. The computer screen also displayed a vertical bar
representing the response categories with text labels. Participants were permitted to give their
responses verbally, by pointing to the screen, or a combination. In cases where there was any
uncertainty about the validity of the response, the examiner verified the response by verbally
repeating the item and the response back to the participant and also indicating the chosen
category on the screen.
Page 8
8
Analyses and Results
Item Reduction
To address our research questions, we took a factor analytic approach, using Mplus version 5.2
[49] with the weighted least squares mean-and-variance-adjusted estimator. We began the
analysis by collapsing item response categories with < 10 observed responses in either the PWA
or SUR data with adjacent categories. For example, if the response category “completely” was
used for a particular item by fewer than ten PWA, we collapsed “completely” with “mostly” and
treated these two responses as the same for this particular item. Also, we excluded items with ≥
5% missing responses for either the PWA or SUR. Missing data were handled with pairwise
deletion. Items retained in the analyses (n =101) described below are presented in Online
Resource 1. Items excluded by the missing data criterion (n = 73) are presented in Online
Resource 2.
An initial attempt to fit the 101 retained items to single factor model yielded poor fit for
both the PWA and SUR data (Comparative Fit Index (CFI) < 0.9, Tucker-Lewis Index (TLI) <
0.95, and root mean square error of approximation (RMSEA) > 0.10). Next, we performed
separate exploratory factor analyses on the PWA and SUR data. A three-factor model provided
marginally adequate fit for both PWA (CFI = 0.949, TLI = 0.971, RMSEA = 0.074) and SUR
(CFI = 0.949, TLI = 0.979, RMSEA = 0.081).
The factors identified in these exploratory models defined coherent groupings of item
content and were predominantly consistent across the two sources of report. The item content
and salient loadings (>0.4) are presented in Online Resource 1. For both groups, the items that
loaded onto the first factor were primarily related to verbal expression (talking), with the second
Page 9
9
and third factors related to writing (including typing) and comprehension (both auditory and
written), respectively. The factor correlation matrix, presented in Table 3, was similar across the
PWA and SUR samples.
Based on the above analysis, we selected three item subsets, henceforth referred to as
domains, based on the content groupings identified by the three factors that the PWA and SUR
participants had in common: Talking, Comprehension, and Writing. The subsequent analysis
steps were carried out separately for each domain, and included: item reduction, testing of
measurement invariance, and analysis of patient-surrogate agreement.
First, we fit a series of unidimensional confirmatory factor models separately for the
PWA and SUR items within each domain. When a one-factor model demonstrated poor fit, an
exploratory model was estimated and items with non-salient loadings on the primary factor were
excluded until adequate fit to a unidimensional model was achieved. We also inspected the
model modification indices provided by Mplus and excluded items that contributed substantially
to model misfit. We considered a model to have adequate fit when the following criteria were
met: CFI > 0.95, TLI > 0.95, RMSEA < 0.08, and weighted root mean square residual (WRMR)
< 1.0 [47].3 In excluding items based on the factor analysis results, we also attempted to retain
the largest possible groups of items with the most directly related content.
3 The CFI and TLI are measures of incremental or relative fit that compare the tested model to a null model, which
assumes that there are no relationships between any of the observed variables. They both adjust for model
complexity, the CFI with an expression that subtracts the model degrees of freedom from the model chi-square
value, while the TLI is based on the ratio of the chi-square to its degrees of freedom. CFI and TLI values of zero
indicate worst possible fit, while values close to 1 indicate relatively good fit. The RMSEA is a badness-of-fit
measure where a value of zero indicates best possible fit. It is based on the model chi-square, its degrees of freedom,
and the sample size. The WRMR is a newer statistic that measures the weighted average difference between the
observed and model-estimated population variances and covariances.
Page 10
10
Starting with an initial set of 50 Talking items, we retained 24 items that fit a
unidimensional model for both sources of report. The content of the retained items was primarily
related to verbal conversation and social interaction, e.g., “tell people about yourself” and “start a
conversation with other people.” By contrast, much of the excluded item content related to de-
contextualized verbal performance, e.g., “say the names of clothing items,” and basic
communication, e.g., “say your name.” Item reduction for the Comprehension domain began
with 29 items. Ten items were retained in the final model, all of which described auditory
comprehension activities, e.g., “follow group conversation,” and “follow tv shows.” For the
Writing domain, item reduction began with 18 items. Fourteen items were retained in the final
factor model, including “write down a phone message,” and “write your name.”
Measurement Invariance
To evaluate measurement invariance for each scale, we tested a series of nested
confirmatory factor models [24, 25, 28], using the theta parameterization option in Mplus and the
DIFFTEST option for chi-square difference testing of nested models. Because of the potential
dependency between the PWA and SUR item pairs with identical content, we did not conduct a
traditional multiple group analysis, but instead treated the paired PWA and SUR responses as a
single case [50]. We specified a series of 2-factor models in which the PWA responses loaded on
the first factor and the SUR responses loaded on the second. In order to model the PWA-SUR
dependency, the errors for each item pair were permitted to covary. The first model tested in
each domain evaluated configural invariance, which requires that items respond to the same
factor(s) in both groups [24]. This model permitted item thresholds, factor loadings, and factor
variances to vary across the two groups [49]. Next, we evaluated weak and strong factorial
invariance in a single step. Weak invariance requires that factor loadings be equal across groups
Page 11
11
and permits valid comparisons of estimated factor variances and covariances. Strong invariance
adds the constraint that item thresholds are equal for both groups and supports valid comparison
of estimated group means [24]. In this second step, we tested a model in which the factor
loadings and thresholds for each PWA-SUR item pair were constrained to be equal. Finally, we
evaluated strict factorial invariance, which adds the additional constraint that the residual
variance for each item must be equivalent in the two groups. When strict factorial invariance is
met, observed score variances and covariances may be validly compared, and additional support
for the validity of group mean comparisons is provided as well [24]. In each case, we used chi-
square difference testing to evaluate whether the added model constraints significantly (p < 0.05)
worsened fit.
As shown in Table 4, the strong invariance model for the Talking scale was rejected.
Modification indices showed that the constraints on the factor loadings for two items, “speak to
family members and friends on the phone” and “ask questions to get information” were the
largest contributors to the significant chi-square difference test. We estimated a model in which
these constraints were relaxed, permitting the loadings for these items to be freely estimated
across patients and surrogates. This partial invariance model [24, 28] was tenable. Table 5
presents the results of measurement invariance testing for the Comprehension scale. The strong
and strict invariance models were both tenable. For the Writing scale, shown in Table 6, the
strong invariance model was rejected. Modification indices showed that the constraints on the
thresholds for the item “dial a telephone number” were the strongest contributors to misfit. A
model that estimated separate PWA and SUR thresholds for this item provided support for partial
strong invariance. A partial strict invariance model that maintained free estimation of the
thresholds for this item also showed adequate fit and a non-significant chi-square difference test.
Page 12
12
Patient-Surrogate Agreement
Having established measurement invariance for the three scales, we evaluated
agreement between self and surrogate reports in three ways. First, we inspected the correlations
between the PWA and SUR factor scores for each scale. The correlations were 0.71, 0.50, and
0.89 for Talking, Comprehension, and Writing, respectively, suggesting moderate-to-strong
relationships between self and surrogate reports in each domain.
Second, we further constrained the restricted invariance factor models described above to
test the equality of the means and variances between self and surrogate reports. For the Talking
and Writing scales, the models specifying equal PWA and SUR means were tenable, but the
models specifying equal variances were not (see Tables 4 and 6). In both cases, the SUR
distribution had higher variance. For the Comprehension scale, there were no significant
differences between PWA and SUR means or variances.
To evaluate the magnitude of individual PWA-SUR differences and their relationship to
overall level of reported functioning, we constructed Bland-Altman plots for each domain [51].
These plots, displayed in Figure 1Fig. , show the PWA-SUR difference as a function of the
average of the PWA and SUR scores, which serves as an estimate of the true level of functioning.
For the Talking and Writing scales, there was a weak, but statistically significant negative
correlation between the PWA-SUR difference and the average. This suggests that for PWA with
lower reported functioning, SUR participants tended to underestimate ability relative to PWA,
and for PWA with higher reported functioning, SUR participants tended to overestimate ability
relative to PWA. We also used the estimated reliability for each scale (Talking: 0.94;
Comprehension: 0.86; Writing: 0.93) to compute the 95% CI about the assumption of a null
Page 13
13
difference between individual PWA and SUR score pairs. These confidence intervals are shown
in Figure 1. Cases falling outside these intervals showed statistically significant disagreement at
p < 0.05. Thirty-three percent of PWA-SUR differences were significant on the Talking scale,
26% were significant on the Comprehension scale, and 15% were significant on the Writing
scale.
Effects of Comprehension Impairment on Patient Responses
Finally, in order to evaluate whether comprehension impairment negatively affected
PWAs’ ability to provide meaningful responses, we conducted an additional series of factor
analyses. We included in these analyses only the 116 participants for whom we had PICA scores,
and we began by stratifying this sample into two approximately sub-groups based on
comprehension performance. Specifically, we divided the sample into groups with severe (n =
39), and mild or moderate (n = 77) comprehension impairments based on the average of their
raw scores on the PICA auditory and reading comprehension subtests.
We then evaluated measurement invariance between the severely impaired sub-sample
and the remaining participants, using an approach similar to that described above. This analysis
was motivated by the hypothesis that if comprehension impairment prevented participants with
severe aphasia from understanding and validly responding to the questions, this should be
reflected in non-invariant parameter estimates for the severe group compared to the rest of the
sample. Put differently, if participants with severe aphasia were responding based on incorrect
understanding of the items, the items’ positions relative to one another on the latent trait scale
and the relative strength of their relationships to the latent trait should be affected. The major
difference between the present analyses and the analyses of PWA-SUR invariance described
Page 14
14
above was that in this case the subsamples were independent, permitting us to conduct
traditional multiple group analyses in which only one factor for each scale was specified. Also,
for these analyses, we tested only configural, weak, and strong invariance, because tests of strict
invariance are not particularly relevant for the this question.
The results of these analyses are presented in Table 7. For the Talking and
Comprehension scales, the chi-square difference tests were not significant, suggesting that
severity of comprehension impairment was not associated with reliable differences in factor
loadings or intercepts. For the Writing scale, the test was significant (p = 0.048). Inspection of
the modification indices revealed that the constrained intercepts for the item “communicate by
email” were the single largest contributor to model misfit. Participants with severe
comprehension impairment found this item to be harder (relative to the other items in the Writing
scale) than did the participants with mild-to-moderate comprehension impairment. With this
constraint relaxed, the chi-square difference test was no longer significant.
Discussion
This is the first investigation of agreement between patient and proxy reports of communication
functioning in aphasia that has demonstrated measurement invariance of the scales in question, a
necessary precondition for making the comparison. The first aim of this study was to evaluate
whether self and surrogate-reported communication functioning can be measured on the same
unidimensional scale. We conducted a series of exploratory and confirmatory factor analyses to
reduce a large initial item pool to form three single-factor scales: Talking, Comprehension, and
Writing. The Comprehension scale demonstrated full strict measurement invariance between self
Page 15
15
and surrogate reports. The Talking and Writing scales demonstrated partial strict invariance, after
relaxing cross-group equality constraints on a small number of parameters in each model.
The second aim of this study was to evaluate the level of agreement between self- and
surrogate-reported communication functioning. Correlations between PWA and SUR factor
scores for Talking (0.71) and Comprehension (0.50) were moderately strong, while the
correlation between Writing scores was stronger (0.89). This replicates the previous finding,
noted above [13, 16, 17], that patients and proxies show better agreement on reports of
functioning in more directly observable domains. Finally, we evaluated whether aphasic
comprehension impairment prevented participants with severe aphasia from responding
meaningfully to the items. Factor analyses of the ACOM scales using participant sub-samples
stratified by severity of comprehension impairment suggested that even the participants with the
most severe aphasia understood the questions sufficiently well to provide meaningful and
coherently related responses.
Regarding self- and surrogate agreement, testing of nested confirmatory factor models in
each domain further suggested that there was no average bias for surrogates to over- or under-
report functioning relative to PWA. This finding contrasts with prior reports that proxies are
generally biased to report lower functioning and/or well-being [13, 14, 17]. We also found that
surrogate-reported scores had higher variance than self-reported scores in two domains, Talking
and Writing. The Bland-Altman plots presented in Figure 1 offer perspective on this finding.
They show a weak but significant tendency for surrogates to assign more extreme scores than
PWA in both domains. Thus, for PWA with lower ability in a given domain, SUR reports tended
to result in lower score estimates and for PWA with higher ability, SUR reports tended to result
in higher score estimates.