ArticlePDF Available

The Stroop Test as a Measure of Performance Validity in Adults Clinically Referred for Neuropsychological Assessment

Authors:

Abstract and Figures

This study was designed to develop performance validity indicators embedded within the Delis-Kaplan Executive Function Systems (D-KEFS) version of the Stroop task. Archival data from a mixed clinical sample of 132 patients (50% male; MAge= 43.4; MEducation= 14.1) clinically referred for neuropsychological assessment were analyzed. Criterion measures included the Warrington Recognition Memory Test-Words and 2 composites based on several independent validity indicators. An age-corrected scaled score ≤6 on any of the 4 trials reliably differentiated psychometrically defined credible and noncredible response sets with high specificity (.87-.94) and variable sensitivity (.34-.71). An inverted Stroop effect was less sensitive (.14-.29), but comparably specific (.85-90) to invalid performance. Aggregating the newly developed D-KEFS Stroop validity indicators further improved classification accuracy. Failing the validity cutoffs was unrelated to self-reported depression or anxiety. However, it was associated with elevated somatic symptom report. In addition to processing speed and executive function, the D-KEFS version of the Stroop task can function as a measure of performance validity. A multivariate approach to performance validity assessment is generally superior to univariate models. (PsycINFO Database Record
Content may be subject to copyright.
Psychological Assessment
The Stroop Test as a Measure of Performance Validity in
Adults Clinically Referred for Neuropsychological
Assessment
Laszlo A. Erdodi, Sanya Sagar, Kristian Seke, Brandon G. Zuccato, Eben S. Schwartz, and Robert
M. Roth
Online First Publication, February 22, 2018. http://dx.doi.org/10.1037/pas0000525
CITATION
Erdodi, L. A., Sagar, S., Seke, K., Zuccato, B. G., Schwartz, E. S., & Roth, R. M. (2018, February 22).
The Stroop Test as a Measure of Performance Validity in Adults Clinically Referred for
Neuropsychological Assessment. Psychological Assessment. Advance online publication.
http://dx.doi.org/10.1037/pas0000525
The Stroop Test as a Measure of Performance Validity in Adults Clinically
Referred for Neuropsychological Assessment
Laszlo A. Erdodi, Sanya Sagar, Kristian Seke,
and Brandon G. Zuccato
University of Windsor
Eben S. Schwartz
Waukesha Memorial Hospital, Waukesha, Wisconsin
Robert M. Roth
Geisel School of Medicine at Dartmouth/Dartmouth-Hitchcock Medical Center
This study was designed to develop performance validity indicators embedded within the Delis-Kaplan
Executive Function Systems (D-KEFS) version of the Stroop task. Archival data from a mixed clinical
sample of 132 patients (50% male; M
Age
43.4; M
Education
14.1) clinically referred for neuropsy-
chological assessment were analyzed. Criterion measures included the Warrington Recognition Memory
Test—Words and 2 composites based on several independent validity indicators. An age-corrected scaled
score 6 on any of the 4 trials reliably differentiated psychometrically defined credible and noncredible
response sets with high specificity (.87–.94) and variable sensitivity (.34 –.71). An inverted Stroop effect
was less sensitive (.14 –.29), but comparably specific (.85–90) to invalid performance. Aggregating the
newly developed D-KEFS Stroop validity indicators further improved classification accuracy. Failing the
validity cutoffs was unrelated to self-reported depression or anxiety. However, it was associated with
elevated somatic symptom report. In addition to processing speed and executive function, the D-KEFS
version of the Stroop task can function as a measure of performance validity. A multivariate approach
to performance validity assessment is generally superior to univariate models.
Public Significance Statement
The Stroop test can function as a performance validity indicator by identifying unusual patterns of
responding. Invalid performance was associated with higher levels of self-reported somatic symptoms.
Keywords: Stroop task, performance validity, embedded validity indicators
The validity of the neuropsychological evaluation hinges on the
examinees’ ability and willingness to demonstrate their typical
level of cognitive functioning (Bigler, 2015). Therefore, there is a
broad consensus within the profession that a thorough performance
validity assessment is an essential part of the examination (Bush,
Ruff, & Heilbronner, 2014; Chafetz et al., 2015; Heilbronner et al.,
2009). As a result, the administration of multiple, nonredundant
performance validity tests (PVTs) has become a widely accepted
practice standard (Boone, 2013; Larrabee, 2014).
Although stand-alone instruments are considered the gold stan-
dard for validity assessment (Green, 2013), embedded validity
indicators (EVIs) are increasing in popularity. EVIs are derived
from traditional neuropsychological tests originally designed to
measure cognitive ability, but were subsequently coopted as PVTs.
Many EVIs have strong empirical support and a long presence in
the research literature. Some predate the most acclaimed stand-
alone PVTs, such as those based in verbal fluency (Hayward, Hall,
Hunt, & Zubrick, 1987), digit span (Greiffenstein, Baker, & Gola,
1994), or symbol substitution (Trueblood, 1994) tasks.
EVIs have several advantages over stand-alone PVTs. First,
they allow clinicians to use multiple validity indicators without
adding new measures, resulting in significant savings in test ma-
terial and administration time. Compressing the battery also lowers
the demand on patients’ mental stamina, which is especially im-
portant when assessing individuals with complex medical and
psychiatric history (Lichtenstein, Erdodi, & Linnea, 2017). EVIs
may also be more resistant to coaching, as they are less likely to be
identified as PVTs than stand-alone instruments (Chafetz et al.,
2015; Schutte, Axelrod, & Montoya, 2015). Finally, they automat-
ically address concerns about the generalizability of the PVT
scores to the rest of the battery (Bigler, 2014). Overall, these
features enable EVIs to achieve the ideal of ongoing monitoring of
test-taking effort (Boone, 2009) without placing a significant ad-
ditional burden on either the examiner or examinee.
Laszlo A. Erdodi and Sanya Sagar, Department of Psychology, Univer-
sity of Windsor; Kristian Seke, Brain-Cognition-Neuroscience Program,
University of Windsor; Brandon G. Zuccato, Department of Psychology,
University of Windsor; Eben S. Schwartz, Waukesha Memorial Hospital,
Waukesha, Wisconsin; Robert M. Roth, Geisel School of Medicine at
Dartmouth/Dartmouth-Hitchcock Medical Center.
Correspondence concerning this article should be addressed to Laszlo A.
Erdodi, 168 Chrysler Hall South, 401 Sunset Avenue, Windsor, ON N9B
3P4, Canada. E-mail: lerdodi@gmail.com
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychological Assessment © 2018 American Psychological Association
2018, Vol. 1, No. 2, 000 1040-3590/18/$12.00 http://dx.doi.org/10.1037/pas0000525
1
The Stroop (1935) paradigm, of which there are many variants,
has the potential to function as an EVI. The task usually consists
of at least three trials (MacLeod & MacDonald, 2000). In the first
trial, the participant is asked to read a series of color words, printed
in black ink, as quickly as possible. In the second trial, the
participant is asked to look at a series of color squares, and name
the colors as quickly as possible. The third trial is the test of
interference and the evoker of the classic Stroop effect: the par-
ticipant is asked to look at a series of color words, printed in
incongruent ink colors, and name the color of the ink instead of
reading the word, as quickly as possible. For example, if the word
“red” is printed in green ink, the examinee is asked to say “green”
instead of “red.” Because reading words is more automatized than
naming ink colors, inhibiting the overlearned response requires
additional cognitive resources, which results in increased comple-
tion time relative to the word reading and color naming trials
(MacLeod & MacDonald, 2000).
The Stroop task within the Delis-Kaplan Executive Function
System (D-KEFS; Delis, Kaplan, & Kramer, 2001) includes a
fourth trial (inhibition/switching) designed to further increase the
cognitive load by requiring examinees to switch back and forth
between two sets of rules. On Trial 4, half of the words are
enclosed in boxes. The examinee is instructed to name the color of
the ink for free-standing items (as in Trial 3 of the classic Stroop
task), but read the word (rather than name the ink color) for items
inside a box. Trial 4 was meant to be more difficult than the
interference trial to capture more subtle executive deficits. How-
ever, the empirical evidence on this difficulty gradient is mixed
(Lippa & Davis, 2010).
The Stroop paradigm has been shown to be sensitive to neuro-
psychiatric conditions with executive dysfunction as a common
feature, such as traumatic brain injury (TBI; Larson, Kaufman,
Schmalfuss, & Perlstein, 2007; Schroeter et al., 2007) and
attention-deficit-hyperactivity disorder (ADHD; Lansbergen, Ken-
emans, & Van Engeland, 2007). However, there is limited research
examining the utility of the Stroop paradigm as a measure of
noncredible performance. Arentsen and colleagues (2013) intro-
duced validity cutoffs for the word reading (66 s), color naming
(93 s), and interference (191 s) trials in the Comalli Stroop
Test (Comalli, Wapner, & Werner, 1962). All of these cutoffs
achieved specificity .90 in a mixed clinical population, with
.29 –.53 sensitivity.
A raw residual score (i.e., predicted score minus actual score)
of 47 on the word reading trial of the Stroop Color and Word
Test (Golden & Freshwater, 2002) discriminated noncredible from
credible responders at .95 specificity and .29 sensitivity using
Slick, Sherman, and Iverson’s (1999) criteria for malingered neu-
rocognitive dysfunction (Guise, Thompson, Greve, Bianchini, &
West, 2014). Other studies (Egeland & Langfjaeran, 2007;
Osimani, Alon, Berger, & Abarbanel, 1997) have found that non-
credible performers may display slower overall reaction time (RT)
and an inverted Stroop effect (i.e., better performance on the
interference trial than the word reading or color naming trials).
While Osimani and colleagues (1997) did not perform signal
detection analyses, Egeland and Langfjaeran (2007) reported un-
acceptably low specificity (.59) for the inverted Stroop effect, even
though the majority of noncredible performers exhibited this vio-
lation of the difficulty gradient. Furthermore, the inverted Stroop
effect as an index of validity has not been replicated consistently
in the literature (Arentsen et al., 2013).
To our knowledge, the potential of the D-KEFS Stroop to
function as an EVI has not been investigated. Given its more
nuanced difficulty gradient because of the unique combined inhi-
bition/switching task (Trial 4), it may be particularly useful as a
measure of performance validity. The purpose of this study is to
examine the utility of the D-KEFS Stroop in differentiating cred-
ible and noncredible response sets in a clinical setting.
Method
Participants
Data were collected from a consecutive sequence of 132 patients
(50% male, 89.4% right-handed), clinically referred for neuropsy-
chological assessment at a northeastern academic medical center.
The vast majority of them (95%) were White, reflecting the
demographic composition of the region. Age (M43.4, SD 16)
followed a bimodal distribution, with one peak around 20 years
and another around 55 years. Mean level of education was 14.1
years (SD 2.8). Overall intellectual functioning was in the
average range (M
FSIQ
101.2, SD
FSIQ
16.4), as were scores on
a single word reading test (M
WRAT-4
104.4, SD
WRAT-4
14.8).
The most common primary diagnosis was psychiatric (46.2%),
followed by TBI (35.6%), neurological disorders (14.4%) and
other medical conditions (3.8%). Within the psychiatric sub-
sample, most patients had been diagnosed with depression
(45.9%), followed by somatic (19.7%) and anxiety disorders
(13.1%). The majority of the TBI patients (81.1%) had sustained a
mild injury. Likewise, the average self-reported depression was in
the mild range (M
BDI-II
16.4, SD
BDI-II
14.8). Most patients
(45.6%) scored in the minimal range (13), while 21.6% scored in
the mild (14 –19), 17.6% scored in the moderate (20 –28), and
15.2% scored in the severe (29) range for self-reported depres-
sion.
Procedures
Data were collected through a retrospective chart review from
patients assessed between December 2012 and July 2014. The
main inclusion criterion was a complete administration of the
D-KEFS Stroop. The study was approved by the ethics board of
the hospital where the data were collected, and that of the univer-
sity where the research project was finalized. Relevant guidelines
regulating research with human participants were followed
throughout the study.
The names and abbreviation of tests administered are provided
in Table 1. The percentage of the sample with scores on each test
is also listed. A core battery of tests was administered to most
patients, while the rest of the instruments were selected based on
the specific referral question. Therefore, they vary from patient to
patient.
The main stand-alone PVT was Warrington’s Recognition
Memory Test—Words (RMT). Failure was defined as an accuracy
score of 43 or a completion time of 192 s (Erdodi, Tyson, et
al., 2017). In addition, a composite of 11 validity indicators labeled
“Effort Index Eleven” (EI-11) was developed to provide a com-
prehensive measure of performance validity (Erdodi, Abeare, et
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2ERDODI, SAGAR, SEKE, ZUCCATO, SCHWARTZ, AND ROTH
al., 2017; Erdodi, Kirsch, Lajiness-O’Neill, Vingilis, & Medoff,
2014; Erdodi & Roth, 2017). The constituent PVTs were dichot-
omized into Pass (0) and Fail (1) along published cutoffs.
Some PVTs have multiple indicators; failing any indicator was
considered as failing the entire PVT (1). Failing multiple indi-
cators nested within the same measure was counted as a single
failure (1). Missing data were coded as Pass (0), although it
is recognized that this may increase error variance by potentially
misclassifying noncredible patients as credible.
The value of EI-11 is the sum of failures on its components.
Given the relatively large number of indicators, and that the most
liberal cutoffs were used to maximize sensitivity (see Table 2), the
EI-11 is prone to false positive errors by design. To correct for
that, the more conservative threshold of 3 independent PVT
Table 1
List of Tests Administered: Abbreviations, Scales, and Norms
Test name Abbreviation Norms % ADM
Beck Depression Inventory, 2nd Edition BDI-II 94.4
Beck Anxiety Inventory BAI 69.7
California Verbal Leaning Test, 2nd Edition CVLT-II Manual 100.0
Complex Ideational Material CIM Heaton 32.6
Conners’ Continuous Performance Test, 2nd Edition CPT-II Manual 78.8
Delis-Kaplan Executive Systems–Stroop D-KEFS Manual 100.0
Finger Tapping Test FTT Heaton 81.1
Letter and Category Fluency Test FAS & Animals Heaton 84.1
Personality Assessment Inventory PAI Manual 43.9
Recognition Memory Test–Words RMT 100.0
Rey 15-Item Test Rey-15 81.8
Rey Complex Figure Test RCFT Manual 96.2
Trail Making Test (A & B) TMT (A & B) Heaton 56.8
Wechsler Adult Intelligence Scale, 4th Edition WAIS-IV Manual 99.2
Wechsler Memory Scale, 4th Edition WMS-IV Manual 99.2
Wide Range Achievement Test, 4th Edition WRAT-4 Manual 83.3
Wisconsin Card Sorting Test WCST Manual 91.7
Note. Heaton Demographically adjusted norms published by Heaton, Miller, Taylor, and Grant (2004);
Manual Normative data published in the technical manual; % ADM Percent of the sample to which the test
was administered.
Table 2
Base Rates of Failure for EI-11 Components, Cutoffs, and References for Each Indicator
Test BR
Fail
Indicator Cutoff Reference
Rey-15 10.6 Free recall 9 Lezak, 1995; Boone et al., 2002
TMT 15.9 A B (“) 137 Shura et al., 2016
Digit Span 25.0 RDS 7 Greiffenstein et al., 1994; Pearson, 2009
ACSS 6 Axelrod et al., 2006; Spencer et al., 2013; Trueblood, 1994
LDF 4 Heinly et al., 2005
WCST 14.4 FMS 2 Larrabee, 2003; Suhr & Boyer, 1999
LRE 1.9 Greve et al., 2002; Suhr and Boyer, 1999
CIM 6.1 Raw 9 Erdodi and Roth, 2017; Erdodi, Tyson, et al., 2016
T-score 29 Erdodi and Roth, 2017; Erdodi, Tyson, et al., 2016
LM
WMS-IV
15.9 I ACSS 3 Bortnik et al., 2010
II ACSS 4 Bortnik et al., 2010
Recognition 20 Bortnik et al., 2010; Pearson, 2009
VR
WMS-IV
18.2 Recognition 4 Pearson, 2009
CVLT-II 12.9 Hits
Recognition
10 Bauer et al., 2005; Greve et al., 2009; Wolfe et al., 2010
FCR 15 Bauer et al., 2005; D. Delis (personal communication, May 10, 2012
RCFT 34.1 Copy raw 26 Lu et al., 2003; Reedy et al., 2013
3-min raw 9.5 Lu et al., 2003; Reedy et al., 2013
TP
Recognition
6 Lu et al., 2003; Reedy et al., 2013
Atyp RE 1 Blaskewitz et al., 2009; Lu et al., 2003
FAS 9.8 T-score 33 Curtis et al., 2008; Sugarman and Axelrod, 2015
Animals 16.7 T-score 33 Hayward et al., 1987; Sugarman and Axelrod, 2015
Note.BR
Fail
Base rate of failure (% of the sample that failed one or more indicators within the test); TMT Trail Making Test; RDS Reliable digit
span; ACSS Age-corrected scaled score; LDF longest digit span forward; WCST Wisconsin Card Sorting Test; FMS Failure to maintain set;
UE Unique errors; LRE Logistical regression equation; CIM Complex Ideational Material from the Boston Diagnostic Aphasia Battery; WMS-IV
Wechsler Memory Scale, 4th Edition; LM Logical Memory; VR Visual Reproduction; CVLT-II California Verbal Learning Test, 2nd Edition;
FCR Forced choice recognition; RCFT Rey Complex Figure Test; TP
Recognition
Recognition true positives; Atyp RE Atypical recognition errors.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
3
STROOP TEST
failures was used to define Fail on the EI-11. At the same time, to
maintain the purity of the credible group, Pass was defined as 1.
Hence, patients with EI-11 scores of two were considered Border-
line and excluded from the analyses (Erdodi & Roth, 2017; Erdodi,
Tyson, et al., 2017).
Relying on a mixture of PVTs representing a wide range of
sensory modalities, cognitive domains, and testing paradigms is a
desirable feature of the EI-11, as it provides an ecologically valid
index of performance validity. However, this heterogeneity could
also become a source of error variance, especially when the pur-
pose of the instrument is to establish the credibility of the perfor-
mance on a specific test, and not on the overall neurocognitive
profile. The issue of modality-specificity as a confound in signal
detection analyses was raised as a theoretical concern (Leighton,
Weinborn, & Maybery, 2014) and has found empirical support
(Erdodi, Abeare, et al., 2017; Erdodi, Tyson, et al., 2017).
Therefore, because the D-KEFS Stroop is timed, another valid-
ity composite was developed based on constituent PVTs that were
based on processing speed measures, labeled “Erdodi Index
Seven” (EI-7
PSP
). The unique feature of the EI-7
PSP
is that instead
of the traditional Pass/Fail dichotomy, each of its components is
coded on a 4-point scale ranging from zero (unequivocal Pass)to
three (unequivocal Fail), with one and two reflecting intermediate
levels of failure (see Table 3). As such, the EI-7
PSP
captures both
the number and extent of PVT failures, recognizing the underlying
continuity in test-taking effort (Erdodi, Roth, Kirsch, Lajiness-
O’Neill, & Medoff, 2014; Erdodi, Tyson, et al., 2016).
Because the practical demands of clinical classification require
a dichotomous outcome, EI-7
PSP
scores 1 were defined as Pass,
and 4asFail. EI-7
PSP
values of two and three represent an
indeterminate range, as they could reflect either multiple failures at
a liberal cutoff or a single failure at the most conservative cutoff.
As these performances are considered “near-passes” by some
(Bigler, 2012, 2015), patients with EI-7
PSP
scores in this range
were excluded from signal detection analyses in the interest of
obtaining diagnostically pure criterion groups, following method-
ological guidelines established by previous researchers (Axelrod,
Meyers, & Davis, 2014; Greve & Bianchini, 2004). The majority
of the sample (62.1%) performed in the passing range; only 15.9%
scored 4 (see Table 4).
Data Analysis
Descriptive statistics (mean, SD,BR
Fail
) are reported for the
relevant variables. The main inferential statistics were one-way
analyses of variance (ANOVAs) and independent sample ttests.
Effect size estimates are reported in Cohen’s dand partial eta
squares (
2
). Classification accuracy (sensitivity and specificity)
was calculated using standard formulas (Grimes & Schulz, 2005).
The emerging standard for specificity is .90 (Boone, 2013), with
the minimum acceptable value at .84 (Larrabee, 2003).
Results
One-way ANOVAs using the trichotomized EI-11 (Pass-
Borderline-Fail) as the independent variable, and the RMT accu-
racy score, completion time, and the EI-7
PSP
scores as the depen-
dent variables, were statistically significant. Associated effect
sizes were large (
2
: .16 –.23). Scores in the Pass range were
always significantly lower than scores in the Fail range. However,
scores in the Borderline range did not differ consistently from the
other two classification ranges (see Table 5).
These analyses were repeated using the trichotomized EI-7
PSP
(Pass-Borderline-Fail) as the independent variable, and the RMT
accuracy score, completion time, and the EI-11 scores as depen-
dent variables. All contrasts were significant with large effects (
2
:
.11–.34). As before, scores in the Pass range were always signif-
icantly lower than scores in the Fail range, but scores in the
Borderline range did not differ consistently from the other two
classification ranges (see Table 6). Overall, these findings provide
empirical support for eliminating participants with EI-11 and EI-
7
PSP
scores in the Borderline range when computing the classifi-
cation accuracy of the D-KEFS Stroop to minimize error variance
and establish criterion groups with neurocognitive profiles that are
either clearly valid or invalid.
Mean D-KEFS Stroop age-corrected scaled scores (ACSS) were
in the average range on all four trials. However, they were signif-
icantly below the nominal mean of 10, with small to medium effect
sizes (Cohen’s d: .25–.44). Skew and kurtosis were well within
1.0 (see Table 7). However, visual inspection revealed bimodal
distributions with one peak in the impaired range and another in
the average-to-high average range.
Trial 1 ACSS 7 failed to clear the minimum threshold for
specificity (.84; Larrabee, 2003) against the RMT and EI-11.
The 6 cutoff produced good combinations of sensitivity (.43–
.71) and specificity (.86 –.94) against all three reference PVTs.
Lowering the cutoff to 5 produced negligible changes in classi-
fication accuracy. The more conservative 4 cutoff resulted in
predictable tradeoffs: improved specificity (.93–.99) at the expense
of sensitivity (.29 –.43).
Trial 2 ACSS 7 cleared the minimum threshold for specificity
against the EI-11 and EI-7
PSP
, but fell short against the RMT.
The 6 cutoff produced good combinations of sensitivity (.45–
.62) and specificity (.87–.91) against all three reference PVTs.
Lowering the cutoff to 5 improved specificity across all refer-
ence PVTs (.92–.96) with minimal loss in sensitivity (.38 –.57).
The more conservative 4 cutoff produced excellent specificity
(.96 –.99) with relatively well-preserved sensitivity (.33–.48).
Trial 3 ACSS 7 cleared the minimum threshold for specificity
against the EI-11 and EI-7
PSP
, but once again, fell short of expec-
tations against the RMT. Lowering the cutoff to 6 resulted in the
predictable tradeoffs, but still failed to reach minimum specificity
against the RMT. Lowering the cutoff to 5 improved specificity
across all reference PVTs (.87–.99) with minimal loss in sensitiv-
ity (.26 –.62). The more conservative 4 cutoff produced excellent
specificity (.94 –1.00) with acceptable sensitivity (.26 –.52).
Trial 4 ACSS 7 failed to clear the minimum threshold for
specificity against the RMT and EI-11. The 6 cutoff produced
acceptable combinations of sensitivity (.29 –.48) and specificity
(.84 –.91) against all three reference PVTs. Lowering the cutoff
to 5 produced negligible changes in classification accuracy. The
more conservative 4 cutoff resulted in predictable tradeoffs:
improved specificity (.92–.96) at the expense of sensitivity (.21–
.33).
To evaluate whether the pattern of performance across the trials
can reveal invalid responding, two additional derivative validity
indices were examined: the Trials 4/3 raw score ratio, and the
(Trials 1 2)/(Trials 3 4) ACSS ratio. The former index is a
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
4ERDODI, SAGAR, SEKE, ZUCCATO, SCHWARTZ, AND ROTH
measure of the absolute difficulty gradient, as examinees are
expected to take longer to finish Trial 4, given its added cognitive
load of switching between two sets of rules. Indeed, on average,
patients produced higher completion times on Trial 4 relative to
Trial 3 (Trials 4/3 raw score ratio: M1.17, SD 0.32, range:
0.65–2.59). The distribution was bimodal, with the bulk of the
sample forming a bell-shaped distribution around the mean and a
small group of positive outliers. The latter index, a ratio of aggre-
gated easy (Trials 1 and 2) versus hard (3 and 4) trials, is a
measure of the relative difficulty gradient. Norm-referencing (i.e.,
age-correction) is expected to equalize the increase in task de-
mands from the first two to the last two trials. As expected, the
overall (Trials 1 2)/(Trials 3 4) ACSS ratio was close to 1.00:
M1.05, SD 0.44, range: 0.33–3.50. Again, the distribution
was bimodal, with a bell-shaped majority around the mean and a
small group of positive outliers.
A Trials 4/3 raw score ratio 0.90 cleared the minimum thresh-
old for specificity against all reference PVTs, but sensitivity was
low (.14 –.24). Lowering the cutoff to 0.85 resulted in predict-
able tradeoffs, with notable increase in specificity (.95–.97), but
further loss in sensitivity (.10 –.14). Lowering the cutoff to 0.80
produced negligible changes in classification accuracy. The
more conservative 0.75 cutoff produced excellent specificity
(.98 –1.00), but very low sensitivity (.07–.14).
A (Trials 1 2)/(Trials 3 4) ACSS ratio 0.80 failed to
achieve minimum specificity against any of the reference PVTs.
Lowering the cutoff to 0.75 cleared the lower threshold for
specificity against all reference PVTs (.88 –.89), with accept-
able sensitivity (.26 –.29). Lowering the cutoff to 0.70 pro-
duced predictable tradeoffs: improved specificity (.92–.94) at
the expense of sensitivity (.21–.24). The more conserva-
tive 0.65 cutoff produced excellent specificity (.98 –1.00), but
low sensitivity (.12–.24).
Finally, the effect of cumulative failures on independent
D-KEFS Stroop validity indicators was examined (see Table 8).
Failing at least two of the six newly introduced embedded PVTs
produced good combinations of sensitivity (.61–.81) and specific-
ity (.86 –.87) against the EI-11 and EI-7
PSP
, but fell short of the
minimum specificity standard against the RMT. Failing at least
three indicators cleared the specificity threshold against all three
reference PVTs (.84 –.93), at the expense of sensitivity (.36 –.57).
Failing at least four indicators produced consistently high speci-
ficity (.92–.99), with further loss in sensitivity (.29 –.52). Failing
five indicators (the highest value observed) was associated with
perfect specificity, but low sensitivity (.10 –.14).
Given the high base rate of psychiatric disorders in the sample,
we examined the relationship between self-reported emotional
distress and PVT failure. There was no difference in BDI-II scores
between patients who passed and those who failed the three
reference PVTs and five of the newly developed validity cutoffs
embedded in the D-KEFS Stroop. Trial 4 ACSS 6 was an
isolated exception; failing this cutoff was associated with lower
levels of depression (d.42). Similarly, no difference was found
in BAI scores between patients who passed and those who failed
the three reference PVTs and four of the newly developed validity
cutoffs embedded in the D-KEFS Stroop. The exceptions were
Trial 2 ACSS 6 and (Trials 1 2)/(Trials 3 4) ACSS
ratio 0.75. In both cases, failing the cutoff was associated with
increased levels of anxiety (d: .41–.55).
To further explore the potential contribution of psychiatric
symptoms to PVT failure, we performed a series of ttests using
Pass/Fail status on the PVTs as independent variables, and the PAI
clinical scales as dependent variables (see Table 9). All patients
passed the validity cutoff on the Negative Impression Management
scale. The Somatic Concerns scale was the only scale with signif-
icant contrasts. Effect sizes ranged from medium (d.54) to large
(d.77). No difference emerged against the EI-7
PSP
and the
derivative D-KEFS Stroop validity indices. Within the Somatic
Concerns scale, effect sizes were generally larger on the Conver-
sion subscale (d: .50 –1.06), but again, contrasts involving the
Table 3
The Components of the EI-7
PSP
With Base Rates of Failure
Corresponding to Each Cutoff
EI-7
PSP
component
EI-7
PSP
values
01 23
FTT number of failures 0 1 2
Base rate 96.2 .8 3.0
FAS T-scores 33 32–33 28–31 27
Base rate 87.9 5.3 2.3 4.5
Animals T-scores 33 25–33 21–24 20
Base rate 83.3 8.3 3.8 4.5
TMT A B raw scores 137 137–221 222–255 256
Base rate 84.1 10.6 2.3 3.0
CPT-II number of failures 0 1 2 3
Base rate 66.7 15.2 5.3 12.9
WAIS-IV CD (ACSS) 55 4 3
Base rate 90.2 1.5 5.3 3.0
WAIS-IV SS (ACSS) 55 4 3
Base rate 85.6 6.8 4.5 3.0
Note. EI-7
PSP
“Erdodi Index Seven” based on processing speed mea-
sures; FTT Failures Finger tapping test, number of scores at 35
(men)/28 (women) dominant hand and 66 (men)/58 (women) combined
mean raw scores (Arnold et al., 2005; Axelrod, Meyers, & Davis, 2014);
FAS Letter fluency T-score (Curtis et al., 2008; Sugarman & Axelrod,
2015); Animals Category fluency T-score (Sugarman & Axelrod, 2015);
CPT-II Failures Conners’ Continuous Performance Test, 2nd Edition;
number of T-scores 70 on Omissions, Hit Reaction Time Standard Error,
Variability and Perseverations (Erdodi, Pelletier, & Roth, 2016; Erdodi,
Roth, et al., 2014; Lange et al., 2013; Ord, Boettcher, Greve, & Bianchini,
2010); WAIS-IV CD (ACSS) Age-corrected scaled score on the Coding
subtest of the Wechsler Adult Intelligence Scale—Fourth Edition (Erdodi,
Abeare, et al., 2017; Etherton et al., 2006; N. Kim et al., 2010; Trueblood,
1994); WAIS-IV SS (ACSS) Age-corrected scaled score on the Symbol
Search subtest of the Wechsler Adult Intelligence Scale—Fourth Edition
(Erdodi, Abeare, et al., 2017; Etherton et al., 2006; Trueblood, 1994).
Table 4
Frequency Distribution of the EI-7
PSP
With Classification Ranges
EI-7
PSP
f%%
Cumulative
Classification
0 61 46.2 46.2 Pass
1 21 15.9 62.1 Pass
2 12 9.1 71.2 Borderline
3 17 12.9 84.1 Borderline
4 5 3.8 87.9 Fail
5 2 1.5 89.4 Fail
6 4 3.0 92.4 Fail
7 3 2.3 94.7 Fail
8 0 .0 94.7 Fail
9 1 .8 95.5 Fail
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
5
STROOP TEST
derivative D-KEFS Stroop validity indices failed to reach signif-
icance. The Somatization subscale was only associated with failing
the RMT (d.54). Significant differences reemerged on the
Health Concerns subscale, with effect sizes ranging from medium
(d.52) to large (d.84). Contrasts involving the EI-7
PSP
,
D-KEFS Stroop Trial 2 and the derivative validity indices failed to
reach significance.
Discussion
This study explored the potential of D-KEFS Stroop to function as
a PVT. A scaled score below 1 SD of the normative mean on any of
the four trials was a reliable indicator of psychometrically defined
invalid performance. Violating the difficulty gradient (i.e., scoring
better on difficult tasks than on easier tasks) was also reliably asso-
ciated with failure on reference PVTs. All six EVIs produced bimodal
distributions with a distinct cluster of outliers in the range of non-
credible impairment, indicating that valid and invalid performance
may start to diverge at the level of descriptive statistics. Overall,
results suggest that in addition to measuring basic processing speed
and executive function, the D-KEFS Stroop is also an effective PVT.
This finding is consistent with earlier investigations using different
versions of the Stroop task (Arentsen et al., 2013; Egeland & Lang-
fjaeran, 2007; Guise et al., 2014; Osimani et al., 1997).
Labeling a score that is only 1 SD below the mean as invalid
may appear an extreme measure at first, as it implies that as many
as 16% of the original normative sample demonstrated invalid
performance. However, the practice is not without precedent.
Shura, Miskey, Rowland, Yoash-Gantz, and Denning (2016) dem-
onstrated that an ACSS 7 (Low Average) on Letter-Number
Sequencing was a reliable indicator of noncredible responding.
Moreover, Baker, Connery, Kirk, and Kirkwood (2014) found a
recognition discriminability z-score of 0.5 (Average) to be the
marker of invalid performance on the California Verbal Learning
Test—Children’s Version.
This phenomenon of the noncredible range of performance
expanding into the traditional range of normal cognitive function-
ing has been recently labeled as the “invalid-before-impaired’
paradox.” Erdodi and Lichtenstein (2017) recently argued that this
apparent psychometric anomaly has multiple possible explana-
tions, one of which is that few (if any) normative samples are
screened for invalid performance. Therefore, noncredible respond-
ing contaminates the scaling process used to establish ACSSs. In
Table 5
Results of One-Way ANOVAs on RMT and EI-7
PSP
Scores Across EI-11Classification Ranges
Outcome
Measure
Descriptive
Statistics
EI-11
Fp
2
Significant
post hocs
0–1 2 3
n76
PASS
n18
BOR
n38
FAIL
RMT
Accuracy
M47.4 44.5 42.2 12.1 .001 .16 PASS vs. BOR
SD 3.4 6.6 7.7 PASS vs. FAIL
RMT
Time
M130.0 147.3 191.3 11.1 .001 .15 PASS vs. FAIL
SD 63.4 52.9 74.4 BOR vs. FAIL
EI-7
PSP
M0.8 1.8 4.2 19.1 .001 .23 PASS vs. FAIL
SD 1.6 1.4 4.5 BOR vs. FAIL
Note. Post hoc pairwise contrasts were computed using the least significant difference method; EI-11 Effort
Index Eleven; BOR Borderline; RMT
Accuracy
Recognition Memory Test–Words (Accuracy score);
RMT
Time
Recognition Memory Test–Words (completion time in seconds); EI-7
PSP
“Erdodi Index Seven”
based on processing speed measures; ANOVA analysis of variance.
Table 6
Results of One-Way ANOVAs on RMT and EI-11Scores Across EI-7
PSP
Classification Ranges
Outcome
Measure
Descriptive
Statistics
EI-7
PSP
Fp
2
Significant
post hocs
0 –1 2–3 4
n82
PASS
n29
BOR
n21
FAIL
RMT
Accuracy
M47.0 43.3 42.7 8.08 .001 .11 PASS vs. BOR
SD 4.1 7.2 7.7 PASS vs. FAIL
RMT
Time
M123.2 189.9 207.9 21.5 .001 .25 PASS vs. BOR
SD 53.8 70.4 75.3 PASS vs. FAIL
EI-11 M1.0 2.2 4.2 33.6 .001 .34 PASS vs. FAIL
SD 1.3 1.7 2.2 PASS vs. BOR
BOR vs. FAIL
Note. Post hoc pairwise contrasts were computed using the least significant difference method; EI-7
PSP
“Erdodi Index Seven” based on processing speed measures; BOR Borderline; RMT
Accuracy
Recognition
Memory Test–Words (Accuracy score); RMT
Time
Recognition Memory Test–Words (Completion time in
seconds); EI-11 Effort Index Eleven; ANOVA analysis of variance.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
6ERDODI, SAGAR, SEKE, ZUCCATO, SCHWARTZ, AND ROTH
turn, later research discovers that scores commonly interpreted as
within normal limits are in fact indicative of invalid performance.
They concluded that EVI cutoffs that reach into the range of
functioning traditionally considered intact provide valuable infor-
mation about the credibility of the response set and therefore,
should not be automatically discounted.
Within the D-KEFS Stroop, derivative validity indicators be-
haved differently both relative to reference PVTs and to single-
trial validity cutoffs. First, the derivative validity indicators had
consistently lower BR
Fail
, suggesting that pattern violations are
less common manifestations of noncredible responding than ab-
normally slow completion time. This finding is congruent with
earlier reports that an inverted or absent Stroop effect does not
occur in credible examinees; therefore, it is highly specific to
invalid performance (Osimani et al., 1997). As a direct conse-
quence of this, derivative validity indicators were generally less
sensitive, which may also reflect the inconsistency in the literature
regarding the inverted Stroop effect as an index of performance
validity. While some studies found that noncredible performers
perform better on more difficult trials, this pattern of performance
failed to demonstrate adequate classification accuracy (Arentsen et
al., 2013; Egeland & Langfjaeran, 2007).
Despite the variability in sample characteristics, methodology,
version of Stroop task, reference PVTs, and BR
Fail
, our findings
Table 7
D-KEFS Stroop Age-Corrected Scaled Scores Across the Four
Trials for the Entire Sample (N 132)
Name
D-KEFS Stroop Trials
Color naming Word reading Inhibition Inhibition/switching
Number 1 2 3 4
M8.6 9.2 9.0 9.1
SD 3.4 3.5 3.8 3.5
Median 10 10 9.5 10
Skew .57 .66 .53 .63
Kurtosis .33 .33 .27 .12
Range 1–15 1–15 1–15 1–15
Note. D-KEFS Delis-Kaplan Executive Systems.
Table 8
Classification Accuracy of Validity Indicators Embedded in the D-KEFS Stroop Task Against Reference PVTs
RMT (n132) EI-11 (n114) EI-7
PSP
(n103)
D-KEFS
Stroop
31.8 33.3 20.4
Cutoff BR
Fail
SENS SPEC SENS SPEC SENS SPEC
Trial 1 7 34.1 .55 .76 .61 .82 .76 .88
6 23.5 .43 .86 .53 .92 .71 .94
5 21.2 .43 .89 .50 .93 .62 .94
4 13.6 .29 .93 .37 .96 .43 .99
Trial 2 7 26.5 .48 .83 .52 .87 .62 .88
6 23.5 .45 .87 .45 .88 .62 .91
5 17.4 .38 .92 .44 .95 .57 .96
4 13.6 .33 .96 .39 .98 .48 .99
Trial 3 7 31.1 .45 .76 .55 .88 .71 .85
6 22.0 .31 .82 .42 .91 .62 .93
5 17.4 .26 .87 .37 .93 .62 .99
4 12.1 .26 .94 .32 .97 .52 1.00
Trial 4 7 26.5 .33 .77 .47 .83 .67 .88
6 19.7 .29 .84 .34 .87 .48 .91
5 16.7 .24 .88 .29 .90 .48 .95
4 12.1 .21 .92 .21 .92 .33 .96
Trials 4/3 .90 14.3 .14 .86 .21 .90 .24 .85
Raw score .85 6.8 .10 .96 .13 .97 .14 .95
.80 5.3 .07 .97 .11 .99 .14 .98
.75 3.8 .07 .98 .11 .99 .14 .99
Trials .80 22.3 .29 .80 .32 .82 .29 .81
(1 2)/(3 4) .75 16.7 .26 .88 .27 .88 .29 .89
ACSS .70 11.4 .21 .93 .18 .92 .24 .94
.65 7.6 .14 .96 .16 .96 .24 .99
Cumulative 2 31.1 .50 .78 .61 .86 .81 .87
Failures 3 22.0 .36 .84 .45 .92 .57 .93
4 14.4 .29 .92 .32 .93 .52 .99
5 3.0 .10 1.00 .11 1.00 .14 1.00
Note. D-KEFS Delis-Kaplan Executive Function System; PVT performance validity tests; RMT Warrington Recognition Memory Test–Words
[Pass: accuracy score 43 and time-to-completion 192”; Fail: accuracy score 43 or time-to-completion 192” (Erdodi, Kirsch, et al., 2014; Erdodi,
Tyson, et al., 2017; M. S. Kim et al., 2010)]; EI-11 Effort Index Eleven [Pass 1; Fail 3 (Erdodi & Roth, 2017; Erdodi, Tyson, et al., 2017)]; EI-7
PSP
“Erdodi Index Seven” based on processing speed measures [Pass 1; Fail 4 (Erdodi, Roth, et al., 2014; Erdodi, Tyson, et al., 2016, 2017)]; BR
Fail
Base rate of failure (percentage); SENS Sensitivity; SPEC Specificity; Trial 1 Color Naming age-corrected scaled score (ACSS); Trial 2 Word
Reading ACSS; Trial 3 Inhibition ACSS (classic Stroop task); Trial 4 Inhibition/Switching ACSS; Cumulative Failures Number of validity indices
failed (Trials 1– 4 ACSS 6; Trials 3/4 raw score ratio .90; Trials (1 2)/(3 4) ACSS ratio .75).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
7
STROOP TEST
are broadly consistent with the extant literature in that the inverted
Stroop effect is more common in noncredible examinees, but has
limited discriminant power. Arentsen and colleagues (2013) note
that the interference trial may be associated with poor specificity
because while most PVTs are designed to appear difficult, but are
in fact easy, the opposite is true for the Stroop task: the interfer-
ence trial is actually difficult for most individuals.
As such, the inverted Stroop effect as an EVI follows the reverse
logic compared with classic stand-alone PVTs, as performing well
on a difficult task is meant to expose noncredible responding rather
than performing poorly on an easy task. Although the inverted
Stroop effect seems less effective at separating valid and invalid
response sets, it appears to tap different manifestations of noncred-
ible performance. Therefore, it may provide valuable nonredun-
dant information for the multivariate model of validity assessment
(Boone, 2013; Larrabee, 2003).
An emergent finding of cross-validation analyses is the
modality-specificity of classification accuracy (Leighton et al.,
2014). Of the three reference PVTs, one was a traditional stand-
alone measure based on the forced choice recognition paradigm
(RMT), one was a composite measure based on the number of
independent PVT failures (EI-11), and one was a composite of
validity indicators specifically selected to match the target con-
structs in the Stroop task (EI-7
PSP
). The four base trials of the
D-KEFS Stroop and the (Trials 1 2)/(Trials 3 4) ACSS ratio
produced the best overall classification accuracy against the EI-
7
PSP
. The Trials 4/3 raw score ratio was a marginal exception,
reiterating the divergence in the psychometric properties of the
derivative validity indices. Nevertheless, all newly introduced
D-KEFS Stroop based validity cutoffs had the highest sensitivity
against the EI-7
PSP
. In several cases, sensitivity values were dou-
ble than that observed against the RMT.
Table 9
Results of Independent T-Tests Comparing Scores on PAI Somatization Scales as a Function of Passing or Failing PVTs
PAI scale PVT-outcome RMT EI-11 EI-7
PSP
D-KEFS Stroop trials
12344
3
12
34
n58 49 47 58 58 58 58 58 58
SOM Pass
M59.1 58.4 59.2 59.3 60.0 59.4 60.3 61.2 61.0
SD 11.1 11.8 11.8 11.7 11.8 11.7 12.0 13.0 12.5
Fail
M67.4 65.9 65.2 70.0 69.9 69.2 70.3 61.9 63.3
SD 16.1 15.7 18.5 15.7 17.8 15.9 19.3 13.9 17.2
p.05 .05 .11 .01 .05 .05 .05 .44 .33
d.60 .54 — .77 .66 .70 .62
SOM
CONV
Pass
M56.4 52.7 56.1 55.6 56.4 55.7 57.3 58.2 57.2
SD 11.1 9.2 12.2 11.2 12.2 11.6 12.7 14.2 13.0
Fail
M64.4 65.4 64.3 71.6 72.1 71.0 70.2 60.6 65.5
SD 19.5 17.8 19.0 18.1 18.8 17.9 22.1 14.6 20.1
p.05 .05 .05 .01 .01 .01 .05 .31 .07
d.50 .90 .51 1.06 .99 1.01 .72
SOM
SOM
Pass
M57.6 59.9 58.5 58.7 58.9 58.5 59.2 59.9 60.0
SD 14.2 14.6 13.6 14.0 13.9 14.1 13.8 15.0 14.1
Fail
M65.2 60.6 60.2 64.1 64.6 64.9 63.8 59.0 57.6
SD 13.9 15.2 19.0 16.3 17.7 15.4 20.6 12.6 17.4
p.05 .43 .37 .14 .15 .10 .23 .44 .33
D.54— — ——————
SOM
H-CON
Pass
M59.2 58.9 59.1 59.7 60.3 59.9 60.1 61.0 60.9
SD 9.6 10.8 10.1 10.4 10.3 10.3 10.7 10.8 10.8
Fail
M65.8 65.4 65.0 66.9 65.6 66.1 69.3 61.3 61.9
SD 13.3 11.9 14.0 12.4 15.2 13.2 11.1 12.6 13.0
p.05 .05 .07 .05 .11 .05 .05 .47 .41
d.57 .57 .63 — .52 .84 —
Note. D-KEFS Delis-Kaplan Executive Function System; PVT performance validity tests; PAI Personality Assessment Inventory; SOM
Somatic Concerns scale; SOM
CONV
Conversion subscale; SOM
SOM
Somatization subscale; SOM
H-CON
Health Concerns subscale; RMT
Warrington Recognition Memory Test–Words [Pass: accuracy score 43 and time-to-completion 192”; Fail: accuracy score 43 or time-to-
completion 192” (Erdodi, Kirsch, et al., 2014; Erdodi, Tyson, et al., 2017; M. S. Kim et al., 2010)]; EI-11 Effort Index Eleven [Pass 1; Fail 3
(Erdodi & Roth, 2017; Erdodi, Tyson, et al., 2017)]; EI-7
PSP
“Erdodi Index Seven” based on processing speed measures [Pass 1; Fail 4 (Erdodi,
Kirsch, et al., 2014; Erdodi, Tyson, et al., 2017; Erdodi, Abeare, et al., 2017)]; BR
Fail
Base rate of failure (percentage); SENS Sensitivity; SPEC
Specificity; Trial 1 Color Naming age-corrected scaled score (cutoff for failure 6); Trial 2 Word Reading age-corrected scaled score (cutoff for
failure 6); Trial 3 Inhibition (classic Stroop task) age-corrected scaled score (cutoff for failure 6); Trial 4 Inhibition/Switching age-corrected scaled
score (cutoff for failure 6); Trial 4/3 The ratio of Trial 4 and Trial 3 raw scores (cutoff for failure .90); Trials (1 2)/(3 4) The ratio of the
sum of Trials 1 and 2 over the sum of Trials 3 and 4 age-corrected scaled scores (cutoff for failure .75)
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
8ERDODI, SAGAR, SEKE, ZUCCATO, SCHWARTZ, AND ROTH
These findings resonate with earlier studies (Erdodi, Abeare, et
al., 2017; Erdodi, Tyson, et al., 2017; Lichtenstein et al., 2017),
and serve as a reminder that the choice of criterion measure can
influence the perceived utility of the test being evaluated. In
addition, they illustrate the importance of the methodological
pluralism in cross-validating PVTs (Boone, 2013; Larrabee, 2014)
at group level, and determining the veracity of an individual
response set (Larrabee, 2003, 2008; Vallabhajosula, & van Gorp,
2001), as it can protect against instrumentation artifacts. Knowing
that a new cutoff performs well against several different reference
PVTs increases confidence in the reliability of its signal detection
performance (Erdodi & Roth, 2017).
Combining the newly developed EVIs within the D-KEFS
Stroop improved overall classification accuracy. Cutoffs based on
cumulative failures produced superior signal detection profiles
relative to individual EVIs at comparable BR
Fail
, consistent with
previous research (Larrabee, 2003, 2008). Even though the internal
logic behind the practice of aggregating multiple validity indica-
tors prioritizes sensitivity over specificity (Proto et al., 2014), at
the appropriate cutoffs, multivariate models actually reduce false
positive rates (Davis & Millis, 2014; Larrabee, 2014).
Passing or failing the newly developed validity cutoffs within
the D-KEFS Stroop was largely unrelated to depression and anx-
iety, consistent with previous reports investigating the relationship
between depression and PVT failure (Considine et al., 2011; Rees,
Tombaugh, & Boulay, 2001). However, patients who failed the
reference PVTs and the newly introduced validity cutoffs in Trials
1– 4 of the D-KEFS Stroop reported higher levels of somatization
on the PAI, even though no systematic differences were observed
on any of the other clinical scales. This finding is consistent with
previous reports on the relationship between the somatization scale
of the PAI and PVT failures (Whiteside et al., 2010).
In this study, we introduced a range of validity cutoffs for each
of the four base trials of the D-KEFS Stroop, as well as two
derivative validity indices, recognizing the need for flexible,
population-specific cutoff scores (Bigler, 2015). To our knowl-
edge, this is the first attempt to develop EVIs within the D-KEFS
version of the Stroop task. In addition, we examined the relation-
ship between PVT failures and self-reported psychiatric symp-
toms. The signal detection profiles of the new validity indicators
across the engineered differences among the reference PVTs pro-
vided an opportunity to reflect on the instrumentation artifacts as
potential confounds in the cross-validation methodology used to
calibrate new validity indices.
The results of the study should be interpreted in the context of
its limitations. The sample was geographically restricted and un-
usually high functioning for a clinical setting. However, the overall
intellectual functioning in our sample was comparable with previ-
ous research involving patients with neurological disorders from
the Northeastern United States (Blonder, Gur, Gur, Saykin, &
Hurtig, 1989; Erdodi, Pelletier, & Roth, 2016; Saykin et al., 1995).
In addition, the sample was diagnostically heterogeneous. There-
fore, it is unclear if the newly introduced cutoffs will perform
similarly across patients with different neuropsychiatric condi-
tions. Until replicated in different clinical populations, these cut-
offs should only be applied to patients with clinical characteristics
that are similar to the present sample, as they may be associated
with unacceptably high false positive error rates in examinees with
severe neurological conditions.
Further, as indeterminate cases were excluded from the analyses
to maximize the diagnostic purity of the criterion groups, this
practice may have inflated classification accuracy estimates. More-
over, the time or sequence of administration was not available for
the D-KEFS Stroop, even though these factors have been raised as
potential confounds in the clinical interpretation of cognitive tests
in general (Erdodi & Lajiness-O’Neill, 2014), and of PVT failures
specifically (Bigler, 2015). Finally, in the absence of data on
litigation status, the criterion groups (Valid/Invalid) were psycho-
metrically defined. Given that external incentive to appear im-
paired has been previously suggested as a relevant diagnostic
criterion for noncredible neurocognitive performance (Slick, Sher-
man, & Iverson, 1999), the newly introduced cutoffs would benefit
from cross-validation using known-group designs that incorporate
incentive status. As always, future research using different sam-
ples, diagnostic categories, and reference PVTs are needed to
establish the generalizability of these findings.
References
Arentsen, T. J., Boone, K. B., Lo, T. T., Goldberg, H. E., Cottingham,
M. E., Victor, T. L.,...Zeller, M. A. (2013). Effectiveness of the
Comalli Stroop Test as a measure of negative response bias. The Clinical
Neuropsychologist, 27, 1060 –1076. http://dx.doi.org/10.1080/13854046
.2013.803603
Arnold, G., Boone, K. B., Lu, P., Dean, A., Wen, J., Nitch, S., & McPher-
son, S. (2005). Sensitivity and specificity of finger tapping test scores for
the detection of suspect effort. The Clinical Neuropsychologist, 19,
105–120. http://dx.doi.org/10.1080/13854040490888567
Axelrod, B. N., Fichtenberg, N. L., Millis, S. R., & Wertheimer, J. C.
(2006). Detecting incomplete effort with Digit Span from the Wechsler
Adult Intelligence Scale-Third Edition. The Clinical Neuropsychologist,
20, 513–523. http://dx.doi.org/10.1080/13854040590967117
Axelrod, B. N., Meyers, J. E., & Davis, J. J. (2014). Finger Tapping Test
performance as a measure of performance validity. The Clinical Neuro-
psychologist, 28, 876 – 888. http://dx.doi.org/10.1080/13854046.2014
.907583
Baker, D. A., Connery, A. K., Kirk, J. W., & Kirkwood, M. W. (2014).
Embedded performance validity indicators within the California Verbal
Learning Test, Children’s Version. The Clinical Neuropsychologist, 28,
116 –127. http://dx.doi.org/10.1080/13854046.2013.858184
Bauer, L., Yantz, C. L., Ryan, L. M., Warden, D. L., & McCaffrey, R. J.
(2005). An examination of the California Verbal Learning Test II to
detect incomplete effort in a traumatic brain-injury sample. Applied
Neuropsychology, 12, 202–207. http://dx.doi.org/10.1207/s15324826
an1204_3
Bigler, E. D. (2012). Symptom validity testing, effort, and neuropsycho-
logical assessment. Journal of the International Neuropsychological
Society, 18, 632– 640. http://dx.doi.org/10.1017/S1355617712000252
Bigler, E. D. (2014). Effort, symptom validity testing, performance validity
testing and traumatic brain injury. Brain Injury, 28, 1623–1638. http://
dx.doi.org/10.3109/02699052.2014.947627
Bigler, E. D. (2015). Neuroimaging as a biomarker in symptom validity
and performance validity testing. Brain Imaging and Behavior, 9, 421–
444. http://dx.doi.org/10.1007/s11682-015-9409-1
Blaskewitz, N., Merten, T., & Brockhaus, R. (2009). Detection of subop-
timal effort with the Rey Complex Figure Test and recognition trial.
Applied Neuropsychology, 16, 54 – 61. http://dx.doi.org/10.1080/
09084280802644227
Blonder, L. X., Gur, R. E., Gur, R. C., Saykin, A. J., & Hurtig, H. I. (1989).
Neuropsychological functioning in hemiparkinsonism. Brain and Cog-
nition, 9, 244 –257. http://dx.doi.org/10.1016/0278-2626(89)90034-1
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
9
STROOP TEST
Boone, K. B. (2009). The need for continuous and comprehensive sam-
pling of effort/response bias during neuropsychological examinations.
The Clinical Neuropsychologist, 23, 729 –741. http://dx.doi.org/10.1080/
13854040802427803
Boone, K. B. (2013). Clinical Practice of Forensic Neuropsychology. New
York, NY: Guilford Press.
Boone, K. B., Salazar, X., Lu, P., Warner-Chacon, K., & Razani, J. (2002).
The Rey 15-item recognition trial: A technique to enhance sensitivity of
the Rey 15-item memorization test. Journal of Clinical and Experimen-
tal Neuropsychology, 24, 561–573. http://dx.doi.org/10.1076/jcen.24.5
.561.1004
Bortnik, K. E., Boone, K. B., Marion, S. D., Amano, S., Ziegler, E.,
Cottingham, M. E.,...Zeller, M. A. (2010). Examination of various
WMS-III logical memory scores in the assessment of response bias. The
Clinical Neuropsychologist, 24, 344 –357. http://dx.doi.org/10.1080/
13854040903307268
Bush, S. S., Heilbronner, R. L., & Ruff, R. (2014). Psychological assess-
ment of symptom and performance validity, response bias, and malin-
gering: Official position of the Association of Psychological Advance-
ment in Psychological Injury and Law. Psychological Injury and Law, 7,
197–205. http://dx.doi.org/10.1007/s12207-014-9198-7
Chafetz, M. D., Williams, M. A., Ben-Porath, Y. S., Bianchini, K. J.,
Boone, K. B., Kirkwood, M. W.,...Ord, J. S. (2015). Official position
of the American Academy of Clinical Neuropsychology Social Security
Administration policy on validity testing: Guidance and recommenda-
tions for change. The Clinical Neuropsychologist, 29, 723–740. http://
dx.doi.org/10.1080/13854046.2015.1099738
Comalli, P. E., Jr., Wapner, S., & Werner, H. (1962). Interference effects
of Stroop color-word test in childhood, adulthood, and aging. The
Journal of Genetic Psychology, 100, 47–53. http://dx.doi.org/10.1080/
00221325.1962.10533572
Considine, C. M., Weisenbach, S. L., Walker, S. J., McFadden, E. M.,
Franti, L. M., Bieliauskas, L. A.,...Langenecker, S. A. (2011).
Auditory memory decrements, without dissimulation, among patients
with major depressive disorder. Archives of Clinical Neuropsychology,
26, 445– 453. http://dx.doi.org/10.1093/arclin/acr041
Curtis, K. L., Thompson, L. K., Greve, K. W., & Bianchini, K. J. (2008).
Verbal fluency indicators of malingering in traumatic brain injury:
Classification accuracy in known groups. The Clinical Neuropsycholo-
gist, 22, 930 –945. http://dx.doi.org/10.1080/13854040701563591
Davis, J. J., & Millis, S. R. (2014). Examination of performance validity
test failure in relation to number of tests administered. The Clinical
Neuropsychologist, 28, 199 –214. http://dx.doi.org/10.1080/13854046
.2014.884633
Delis, D. C., Kaplan, E., & Kramer, J. H. (2001). Delis-Kaplan executive
function system (D-KEFS). San Antonio, TX: Psychological Corpora-
tion.
Egeland, J., & Langfjaeran, T. (2007). Differentiating malingering from
genuine cognitive dysfunction using the Trail Making Test-ratio and
Stroop Interference scores. Applied Neuropsychology, 14, 113–119.
http://dx.doi.org/10.1080/09084280701319953
Erdodi, L. A., Abeare, C. A., Lichtenstein, J. D., Tyson, B. T., Kucharski,
B., Zuccato, B. G., & Roth, R. M. (2017). Wechsler Adult Intelligence
Scale-Fourth Edition (WAIS-IV) processing speed scores as measures of
noncredible responding: The third generation of embedded performance
validity indicators. Psychological Assessment, 29, 148 –157. http://dx
.doi.org/10.1037/pas0000319
Erdodi, L. A., Kirsch, N. L., Lajiness-O’Neill, R., Vingilis, E., & Medoff,
B. (2014). Comparing the Recognition Memory Test and the Word
Choice Test in a mixed clinical sample: Are they equivalent? Psycho-
logical Injury and Law, 7, 255–263. http://dx.doi.org/10.1007/s12207-
014-9197-8
Erdodi, L. A., & Lajiness-O’Neill, R. (2014). Time-related changes in
Conners’ CPT-II scores: A replication study. Applied Neuropsychology
Adult, 21, 43–50. http://dx.doi.org/10.1080/09084282.2012.724036
Erdodi, L. A., & Lichtenstein, J. D. (2017). Invalid before impaired: An
emerging paradox of embedded validity indicators. The Clinical Neuro-
psychologist. Advance online publication. http://dx.doi.org/10.1080/
13854046.2017.1323119
Erdodi, L. A., Pelletier, C. L., & Roth, R. M. (2016). Elevations on select
Conners’ CPT-II scales indicate noncredible responding in adults with
traumatic brain injury. Applied Neuropsychology: Adult, 22, 851– 858.
Erdodi, L., & Roth, R. (2017). Low scores on BDAE Complex Ideational
Material are associated with invalid performance in adults without
aphasia. Applied Neuropsychology: Adult, 24, 264 –274. http://dx.doi
.org/10.1080/23279095.2016.1154856
Erdodi, L. A., Roth, R. M., Kirsch, N. L., Lajiness-O’neill, R., & Medoff,
B. (2014). Aggregating validity indicators embedded in Conners’ CPT-II
outperforms individual cutoffs at separating valid from invalid perfor-
mance in adults with traumatic brain injury. Archives of Clinical Neu-
ropsychology, 29, 456 – 466. http://dx.doi.org/10.1093/arclin/acu026
Erdodi, L. A., Tyson, B. T., Abeare, C. A., Lichtenstein, J. D., Pelletier,
C. L., Rai, J. K., & Roth, R. M. (2016). The BDAE Complex Ideational
Material—A measure of receptive language or performance validity?
Psychological Injury and Law, 9, 112–120. http://dx.doi.org/10.1007/
s12207-016-9254-6
Erdodi, L. A., Tyson, B. T., Shahein, A. G., Lichtenstein, J. D., Abeare,
C. A., Pelletier, C. L.,...Roth, R. M. (2017). The power of timing:
Adding a time-to-completion cutoff to the Word Choice Test and Rec-
ognition Memory Test improves classification accuracy. Journal of
Clinical and Experimental Neuropsychology, 39, 369 –383. http://dx.doi
.org/10.1080/13803395.2016.1230181
Etherton, J. L., Bianchini, K. J., Heinly, M. T., & Greve, K. W. (2006).
Pain, malingering, and performance on the WAIS-III Processing Speed
Index. Journal of Clinical and Experimental Neuropsychology, 28,
1218 –1237. http://dx.doi.org/10.1080/13803390500346595
Golden, C., & Freshwater, S. (2002). A Manual for the Adult Stroop Color
and Word Test. Chicago, IL: Stoelting.
Green, P. (2013). Spoiled for choice: Making comparisons between forced-
choice effort tests. In K. B. Boone (Ed.), Clinical practice of forensic
neuropsychology. New York, NY: Guilford Press.
Greiffenstein, M. F., Baker, W. J., & Gola, T. (1994). Validation of
malingered amnesia measures with a large clinical sample. Psycholog-
ical Assessment, 6, 218 –224. http://dx.doi.org/10.1037/1040-3590.6.3
.218
Greve, K. W., & Bianchini, K. J. (2004). Setting empirical cut-offs on
psychometric indicators of negative response bias: A methodological
commentary with recommendations. Archives of Clinical Neuropsychol-
ogy, 19, 533–541. http://dx.doi.org/10.1016/j.acn.2003.08.002
Greve, K. W., Bianchini, K. J., Mathias, C. W., Houston, R. J., & Crouch,
J. A. (2002). Detecting malingered performance with the Wisconsin card
sorting test: A preliminary investigation in traumatic brain injury. The
Clinical Neuropsychologist, 16, 179 –191. http://dx.doi.org/10.1076/clin
.16.2.179.13241
Greve, K. W., Ord, J. S., Bianchini, K. J., & Curtis, K. L. (2009).
Prevalence of malingering in patients with chronic pain referred for
psychologic evaluation in a medico-legal context. Archives of Physical
Medicine and Rehabilitation, 90, 1117–1126. http://dx.doi.org/10.1016/
j.apmr.2009.01.018
Grimes, D. A., & Schulz, K. F. (2005). Refining clinical diagnosis with
likelihood ratios. The Lancet, 365, 1500 –1505. http://dx.doi.org/10
.1016/S0140-6736(05)66422-7
Guise, B. J., Thompson, M. D., Greve, K. W., Bianchini, K. J., & West, L.
(2014). Assessment of performance validity in the Stroop Color and
Word Test in mild traumatic brain injury patients: A criterion-groups
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
10 ERDODI, SAGAR, SEKE, ZUCCATO, SCHWARTZ, AND ROTH
validation design. Journal of Neuropsychology, 8, 20 –33. http://dx.doi
.org/10.1111/jnp.12002
Hayward, L., Hall, W., Hunt, M., & Zubrick, S. R. (1987). Can localised
brain impairment be simulated on neuropsychological test profiles?
Australian and New Zealand Journal of Psychiatry, 21, 87–93. http://
dx.doi.org/10.3109/00048678709160904
Heaton, R. K., Miller, S. W., Taylor, M. J., & Grant, I. (2004). Revised
comprehensive norms for an expanded Halstead-Reitan battery: Demo-
graphically adjusted neuropsychological norms for African American
and Caucasian adults. Lutz, FL: Psychological Assessment Resources.
Heilbronner, R. L., Sweet, J. J., Morgan, J. E., Larrabee, G. J., & Millis,
S. R., & the Conference Participants. (2009). American Academy of
Clinical Neuropsychology Consensus Conference Statement on the neu-
ropsychological assessment of effort, response bias, and malingering.
The Clinical Neuropsychologist, 23, 1093–1129. http://dx.doi.org/10
.1080/13854040903155063
Heinly, M. T., Greve, K. W., Bianchini, K. J., Love, J. M., & Brennan, A.
(2005). WAIS digit span-based indicators of malingered neurocognitive
dysfunction: Classification accuracy in traumatic brain injury. Assess-
ment, 12, 429 – 444. http://dx.doi.org/10.1177/1073191105281099
Kim, M. S., Boone, K. B., Victor, T., Marion, S. D., Amano, S., Cotting-
ham, M. E.,...Zeller, M. A. (2010). The Warrington Recognition
Memory Test for words as a measure of response bias: Total score and
response time cutoffs developed on “real world” credible and noncred-
ible subjects. Archives of Clinical Neuropsychology, 25, 60 –70. http://
dx.doi.org/10.1093/arclin/acp088
Kim, N., Boone, K. B., Victor, T., Lu, P., Keatinge, C., & Mitchell, C.
(2010). Sensitivity and specificity of a digit symbol recognition trial in
the identification of response bias. Archives of Clinical Neuropsychol-
ogy, 25, 420 – 428. http://dx.doi.org/10.1093/arclin/acq040
Lange, R. T., Iverson, G. L., Brickell, T. A., Staver, T., Pancholi, S.,
Bhagwat, A., & French, L. M. (2013). Clinical utility of the Conners’
Continuous Performance Test-II to detect poor effort in U.S. military
personnel following traumatic brain injury. Psychological Assessment,
25, 339 –352. http://dx.doi.org/10.1037/a0030915
Lansbergen, M. M., Kenemans, J. L., & van Engeland, H. (2007). Stroop
interference and attention-deficit/hyperactivity disorder: A review and
meta-analysis. Neuropsychology, 21, 251–262. http://dx.doi.org/10
.1037/0894-4105.21.2.251
Larrabee, G. J. (2003). Detection of malingering using atypical perfor-
mance patterns on standard neuropsychological tests. The Clinical Neu-
ropsychologist, 17, 410 – 425. http://dx.doi.org/10.1076/clin.17.3.410
.18089
Larrabee, G. J. (2008). Aggregation across multiple indicators improves
the detection of malingering: Relationship to likelihood ratios. The
Clinical Neuropsychologist, 22, 666 – 679. http://dx.doi.org/10.1080/
13854040701494987
Larrabee, G. J. (2014). False-positive rates associated with the use of
multiple performance and symptom validity tests. Archives of Clinical
Neuropsychology, 29, 364 –373. http://dx.doi.org/10.1093/arclin/acu019
Larson, M. J., Kaufman, D. A., Schmalfuss, I. M., & Perlstein, W. M.
(2007). Performance monitoring, error processing, and evaluative con-
trol following severe TBI. Journal of the International Neuropsycholog-
ical Society, 13, 961–971. http://dx.doi.org/10.1017/S135561770
7071305
Leighton, A., Weinborn, M., & Maybery, M. (2014). Bridging the gap
between neurocognitive processing theory and performance validity
assessment among the cognitively impaired: A review and methodolog-
ical approach. Journal of the International Neuropsychological Society,
20, 873– 886. http://dx.doi.org/10.1017/S135561771400085X
Lezak, M. D. (1995). Neuropsychological assessment. New York, NY:
Oxford University Press.
Lichtenstein, J. D., Erdodi, L. A., & Linnea, K. S. (2017). Introducing a
forced-choice recognition task to the California Verbal Learning Test—
Children’s Version. Child Neuropsychology, 23, 284 –299.
Lippa, S. M., & Davis, R. N. (2010). Inhibition/switching is not necessarily
harder than inhibition: An analysis of the D-KEFS color-word interfer-
ence test. Archives of Clinical Neuropsychology, 25, 146 –152. http://dx
.doi.org/10.1093/arclin/acq001
Lu, P. H., Boone, K. B., Cozolino, L., & Mitchell, C. (2003). Effectiveness
of the Rey-Osterrieth Complex Figure Test and the Meyers and Meyers
recognition trial in the detection of suspect effort. The Clinical Neuro-
psychologist, 17, 426 – 440. http://dx.doi.org/10.1076/clin.17.3.426
.18083
MacLeod, C. M., & MacDonald, P. A. (2000). Interdimensional interfer-
ence in the Stroop effect: Uncovering the cognitive and neural anatomy
of attention. Trends in Cognitive Sciences, 4, 383–391. http://dx.doi.org/
10.1016/S1364-6613(00)01530-8
Ord, J. S., Boettcher, A. C., Greve, K. J., & Bianchini, K. J. (2010).
Detection of malingering in mild traumatic brain injury with the Con-
ners’ Continuous Performance Test-II. Journal of Clinical and Experi-
mental Neuropsychology, 32(4), 380–387.
Osimani, A., Alon, A., Berger, A., & Abarbanel, J. M. (1997). Use of the
Stroop phenomenon as a diagnostic tool for malingering. Journal of
Neurology, Neurosurgery & Psychiatry, 62, 617– 621. http://dx.doi.org/
10.1136/jnnp.62.6.617
Pearson, N. C. S. (2009). Advanced clinical solutions for WAIS-IV and
WMS-IV: Administration and scoring manual. San Antonio, TX: The
Psychological Corporation.
Proto, D. A., Pastorek, N. J., Miller, B. I., Romesser, J. M., Sim, A. H., &
Linck, J. F. (2014). The dangers of failing one or more performance
validity tests in individuals claiming mild traumatic brain injury-related
postconcussive symptoms. Archives of Clinical Neuropsychology, 29,
614 – 624. http://dx.doi.org/10.1093/arclin/acu044
Reedy, S. D., Boone, K. B., Cottingham, M. E., Glaser, D. F., Lu, P. H.,
Victor, T. L.,...Wright, M. J. (2013). Cross validation of the Lu and
colleagues (2003). Rey-Osterrieth Complex Figure Test effort equation
in a large known-group sample. Archives of Clinical Neuropsychology,
28, 30 –37. http://dx.doi.org/10.1093/arclin/acs106
Rees, L. M., Tombaugh, T. N., & Boulay, L. (2001). Depression and the
test of memory malingering. Archives of Clinical Neuropsychology, 16,
501–506. http://dx.doi.org/10.1093/arclin/16.5.501
Saykin, A. J., Stafiniak, P., Robinson, L. J., Flannery, K. A., Gur, R. C.,
O’Connor, M. J., & Sperling, M. R. (1995). Language before and after
temporal lobectomy: Specificity of acute changes and relation to early
risk factors. Epilepsia, 36, 1071–1077. http://dx.doi.org/10.1111/j.1528-
1157.1995.tb00464.x
Schroeter, M. L., Ettrich, B., Schwier, C., Scheid, R., Guthke, T., & von
Cramon, D. Y. (2007). Diffuse axonal injury due to traumatic brain
injury alters inhibition of imitative response tendencies. Neuropsycho-
logia, 45, 3149 –3156. http://dx.doi.org/10.1016/j.neuropsychologia
.2007.07.004
Schutte, C., Axelrod, B. N., & Montoya, E. (2015). Making sure neuro-
psychological data are meaningful: Use of performance validity testing
in medicolegal and clinical contexts. Psychological Injury and Law, 8,
100 –105. http://dx.doi.org/10.1007/s12207-015-9225-3
Shura, R. D., Miskey, H. M., Rowland, J. A., Yoash-Gantz, R. E., &
Denning, J. H. (2016). Embedded performance validity measures with
postdeployment veterans: Cross-validation and efficiency with multiple
measures. Applied Neuropsychology: Adult, 23, 94 –104. http://dx.doi
.org/10.1080/23279095.2015.1014556
Slick, D. J., Sherman, E. M., & Iverson, G. L. (1999). Diagnostic criteria
for malingered neurocognitive dysfunction: Proposed standards for clin-
ical practice and research. Clinical Neuropsychologist, 13, 545–561.
http://dx.doi.org/10.1076/1385-4046(199911)13:04;1-Y;FT545
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
11
STROOP TEST
Spencer, R. J., Axelrod, B. N., Drag, L. L., Waldron-Perrine, B., Pangili-
nan, P. H., & Bieliauskas, L. A. (2013). WAIS-IV reliable digit span is
no more accurate than age corrected scaled score as an indicator of
invalid performance in a veteran sample undergoing evaluation for
mTBI. The Clinical Neuropsychologist, 27, 1362–1372. http://dx.doi
.org/10.1080/13854046.2013.845248
Stroop, J. R. (1935). Studies of interference in serial verbal reactions.
Journal of Experimental Psychology, 18, 643– 662. http://dx.doi.org/10
.1037/h0054651
Sugarman, M. A., & Axelrod, B. N. (2015). Embedded measures of
performance validity using verbal fluency tests in a clinical sample.
Applied Neuropsychology Adult, 22, 141–146. http://dx.doi.org/10.1080/
23279095.2013.873439
Suhr, J. A., & Boyer, D. (1999). Use of the Wisconsin Card Sorting Test
in the detection of malingering in student simulator and patient samples.
Journal of Clinical and Experimental Neuropsychology, 21, 701–708.
http://dx.doi.org/10.1076/jcen.21.5.701.868
Trueblood, W. (1994). Qualitative and quantitative characteristics of ma-
lingered and other invalid WAIS-R and clinical memory data. Journal of
Clinical and Experimental Neuropsychology, 16, 597– 607. http://dx.doi
.org/10.1080/01688639408402671
Vallabhajosula, B., & van Gorp, W. G. (2001). Post-Daubert admissibility
of scientific evidence on malingering of cognitive deficits. Journal of the
American Academy of Psychiatry and the Law, 29, 207–215.
Whiteside, D., Clinton, C., Diamonti, C., Stroemel, J., White, C., Zimber-
off, A., & Waters, D. (2010). Relationship between suboptimal cognitive
effort and the clinical scales of the Personality Assessment Inventory.
The Clinical Neuropsychologist, 24, 315–325. http://dx.doi.org/10.1080/
13854040903482822
Wolfe, P. L., Millis, S. R., Hanks, R., Fichtenberg, N., Larrabee, G. J., &
Sweet, J. J. (2010). Effort indicators within the California verbal learn-
ing test-II (CVLT-II). The Clinical Neuropsychologist, 24, 153–168.
http://dx.doi.org/10.1080/13854040903107791
Received October 5, 2016
Revision received July 11, 2017
Accepted July 17, 2017
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
12 ERDODI, SAGAR, SEKE, ZUCCATO, SCHWARTZ, AND ROTH
... Research over the last decades considered a large number of validity measures, so-called performance validity tests (PVTs), that can be classified into stand-alone (freestanding) validity tests and validity indicators embedded within routine measures of neuropsychological functions (for an overview, see Boone, 2021). While stand-alone PVTs are considered the gold standard for producing high diagnostic accuracy in distinguishing credible and noncredible cognitive performance, embedded PVTs have certain advantages over stand-alone PVTs Erdodi et al., 2014Erdodi et al., , 2018. Empirical evidence and current practice standards recommend the use of multiple PVTs to assess validity and stress the need to sample validity continuously throughout the assessment and across cognitive domains (e.g., Rhoads et al., 2021;Soble, 2021;Sweet et al., 2021). ...
... As a major limitation of the present study, it must be noted that only one PVT (i.e., the WMT) was used as the criterion to determine noncredible cognitive performance. The use of a single criterion PVT has been shown to distort accuracy estimates of the predictor PVT, especially when high congruence between predictor and criterion is given (see domain-specificity hypothesis, Erdodi et al., 2018). In favor of the present study, the congruence between the cognitive domains of the predictor PVT (COG as an attention test) and criterion PVT (WMT as a verbal memory test) was low, and the WMT is presumably one of the best studied PVTs and de facto gold standard in clinic and research. ...
Article
Full-text available
The assessment of performance validity is essential in any neuropsychological evaluation. However, relatively few measures exist that are based on attention performance embedded within routine cognitive tasks. The present study explores the potential value of a computerized attention test, the Cognitrone, as an embedded validity indicator in the neuropsychological assessment of early retirement claimants. Two hundred and sixty-five early retirement claimants were assessed with the Word Memory Test (WMT) and the Cognitrone. WMT scores were used as the independent criterion to determine performance validity. Speed and accuracy measures of the Cognitrone were analyzed in receiver operating characteristics (ROC) to classify group membership. The Cognitrone was sensitive in revealing attention deficits in early retirement claimants. Further, 54% (n = 143) of the individuals showed noncredible cognitive performance, whereas 46% (n = 122) showed credible cognitive performance. Individuals failing the performance validity assessment showed slower (AUC = 79.1%) and more inaccurate (AUC = 79.5%) attention performance than those passing the performance validity assessment. A compound score integrating speed and accuracy revealed incremental value as indicated by AUC = 87.9%. Various cut scores are suggested, resulting in equal rates of 80% sensitivity and specificity (cut score = 1.297) or 69% sensitivity with 90% specificity (cut score = 0.734). The present study supports the sensitivity of the Cognitrone for the assessment of attention deficits in early retirement claimants and its potential value as an embedded validity indicator. Further research on different samples and with multidimensional criteria for determining invalid performance are required before clinical application can be suggested.
... The Stroop task is one of the most favorable neuropsychological tests to investigate the cognitive impairments in attention and inhibition control. 31,[60][61][62] The behavioral results in this paper confirm the claim that both the reaction times and accuracy rates are statistically different between healthy controls and patients with neuropsychiatric diseases as seen in Tables 3 and 4 and in Figs. 4(a)-4(d). ...
... In fact the classification accuracy of this metric between healthy controls and diseased subjects was 76.25% as seen in Fig. 8(a). In a study by Erdodi et al., 60 classification accuracy of inverted Stroop test metrics between healthy controls and patients that were clinically referred for neuropsychological assessment were found to be less sensitive (14% to 25%), but comparably specific (85% to 90%) while the findings in this study were contradictory with very high sensitivity (100%) but less specificity (77.6%) for this metric. Certainly there are many differences especially in the choice of subject groups, the Stroop test employed and the parameters used in the analysis of that study and this one, but it is evident that the behavioral parameters alone cannot yield high accuracies in classification for neuropsychological assessment. ...
Article
Significance: Clinical use of fNIRS-derived features has always suffered low sensitivity and specificity due to signal contamination from background systemic physiological fluctuations. We provide an algorithm to extract cognition-related features by eliminating the effect of background signal contamination, hence improving the classification accuracy. Aim: The aim in this study is to investigate the classification accuracy of an fNIRS-derived biomarker based on global efficiency (GE). To this end, fNIRS data were collected during a computerized Stroop task from healthy controls and patients with migraine, obsessive compulsive disorder, and schizophrenia. Approach: Functional connectivity (FC) maps were computed from [HbO] time series data for neutral (N), congruent (C), and incongruent (I) stimuli using the partial correlation approach. Reconstruction of FC matrices with optimal choice of principal components yielded two independent networks: cognitive mode network (CM) and default mode network (DM). Results: GE values computed for each FC matrix after applying principal component analysis (PCA) yielded strong statistical significance leading to a higher specificity and accuracy. A new index, neurocognitive ratio (NCR), was computed by multiplying the cognitive quotients (CQ) and ratio of GE of CM to GE of DM. When mean values of NCR ( N C R ¯ ) over all stimuli were computed, they showed high sensitivity (100%), specificity (95.5%), and accuracy (96.3%) for all subjects groups. Conclusions: N C R ¯ can reliable be used as a biomarker to improve the classification of healthy to neuropsychiatric patients.
... Stroop test (ST). An experimental trial-by trial version of the ST (Erdodi et al., 2018;Jensen, 1965) was employed. Stimuli consisted of 48 congruent and incongruent colored (green, yellow, and blue) words. ...
Article
Almost 30% of ADHD adults do not respond to standard pharmaceuticals. Transcranial direct current stimulation (tDCS) is a method for modulation of cortical excitability. On the other hand, dialectical behavioral therapy (DBT) is a cognitive-behavioral approach that might be utilized for adults with ADHD. The effects of integration of these interventions are only beginning to be explored. In the present work, we used both subjective and objective measures to investigate the effects of tDCS, DBT, and the integration of the two in treating adult ADHD symptoms. A total of 80 adults with ADHD (63 females, 17 males) participated in the study and were grouped into control, DBT, tDCS, and combined groups. Based on the observed results, the combination of DBT and tDCS was significantly effective in improving the mentioned variables compared to administration of each method in isolation. The results are discussed in terms of neurophysiological and psychological aspects of treatment methods.
... The EFT is widely used to measure inhibition ability and selective attention [47]. The ST is commonly used to measure the inhibition of cognitive interference, processing speed, cognitive flexibility, and executive function [48]. The RVIPT was used to assess target sensitivity, decision-making, and sustained attention and alertness [49]. ...
Article
Full-text available
This study aimed to investigate the effects of caffeine on performances of simulated match, Wingate Anaerobic Test (WAnT), and cognitive function test of elite taekwondo athletes. Ten elite taekwondo athletes in Hong Kong volunteered to participate in two main trials in a randomized double-blinded crossover design. In each main trial, 1 h after consuming a drink with caffeine (CAF) or a placebo drink without caffeine (PLA), the participants completed two simulated taekwondo match sessions followed by the WAnT. The participants were instructed to complete three cognitive function tests, namely the Eriksen Flanker Test (EFT), Stroop Test, and Rapid Visual Information Processing Test, at baseline, before exercise, and immediately after the simulated matches. They were also required to wear functional near-infrared spectroscopy equipment during these tests. Before exercise, the reaction time in the EFT was shorter in the CAF trial than in the PLA trial (PLA: 494.9 ± 49.2 ms vs. CAF: 467.9 ± 38.0 ms, p = 0.035). In the WAnT, caffeine intake increased the peak power and mean power per unit of body weight (by approximately 13% and 6%, respectively, p = 0.018 & 0.042). The performance in the simulated matches was not affected by caffeine intake (p = 0.168). In conclusion, caffeine intake enhances anaerobic power and may improve certain cognitive functions of elite taekwondo athletes in Hong Kong. However, this may not be enough to improve the simulated match performance.
... In a mixed clinical sample of 234 adults referred for neuropsychological assessment, the Borderline range was significantly different from both Pass (i.e., stronger evidence of non-credible responding) and Fail (i.e., weaker evidence of non-credible responding). These findings are consistent with the results of previous (Erdodi & Rai, 2017;Erdodi, Sagar, et al., 2018;Erdodi, Seke, et al., 2017) and subsequent Cutler et al., 2021;Dunn et al., 2021;Erdodi, Hurtubise, et al., 2020) investigations. ...
Article
Full-text available
This study was designed to examine the classification accuracy of the Erdodi Index (EI-5), a novel method for aggregating validity indicators that takes into account both the number and extent of performance validity test (PVT) failures. Archival data were collected from a mixed clinical/forensic sample of 452 adults referred for neuropsychological assessment. The classification accuracy of the EI-5 was evaluated against established free-standing PVTs. The EI-5 achieved a good combination of sensitivity (.65) and specificity (.97), correctly classifying 92% of the sample. Its classification accuracy was comparable to that of another free-standing PVT. An indeterminate range between Pass and Fail emerged as a legitimate third outcome of performance validity assessment, indicating that the underlying construct is an inherently continuous variable. Results support the use of the EI-model as a practical and psychometrically sound method of aggregating multiple embedded PVTs into a single-number summary of performance validity. Combining free-standing PVTs with the EI-5 resulted in a better separation between credible and non-credible profiles, demonstrating incremental validity. Findings are consistent with recent endorsements of a three-way outcome for PVTs (Pass, Borderline and Fail).
... In the Stroop test [31,32] (duration: 2 min), the words "RED" or "GREEN" appeared in the central area of the screen, randomly shown in red or green. The participants were asked to press the "Z" button when the color of the word was red, and to press the "/" button when the color of the word was green, regardless of the meaning of the words. ...
Article
Full-text available
Fifteen participants were exposed in an enclosed environmental chamber to investigate the effects of elevated carbon dioxide (CO2) concentration on their cognitive abilities. Three CO2 conditions (1500, 3500, and 5000 ppm) were achieved by constant air supply and additional ultrapure CO2. All participants received the same exposure under each condition, during which they performed six cognitive tests evaluating human perception, attention, short-term working memory, risky decision-making, and executive ability. Generalized additive mixed effects model (GAMM) results showed no statistically significant differences in performance on the reaction time (RT) tests, the speed perception test, and the 2-back test. This suggests that elevated CO2 concentrations below 5000 ppm did not affect participants’ perception and short-term working memory. However, a significant increase in response time was observed in the visual search (VS) test, the balloon simulation risk test (BART), and the Stroop test at 5000 ppm compared to lower exposure concentrations. The slower responses reflected the detrimental effects of elevated CO2 concentrations on visual attention, risky decision-making, and executive ability. The findings suggest that the control level of CO2 concentrations should be tighter in enclosed workplaces where rapid response and operational safety are required.
... Of note, embedded validity indicators have also been derived from and validated using other Stroop paradigms, including the Comalli version(Arentsen et al., 2013) and Delis-Kaplan Executive Function System Color-Word Interference(Eglit et al., 2020;Erdodi et al., 2018), but these will not be reviewed here given the focus of our article is the traditional SCWT. 2 KHAN ET AL. ...
Article
This study investigated the utility of four Stroop Color and Word Test (SCWT) indices, including the raw score and T score for the word reading (WR) and color naming (CN) trials, as embedded performance validity tests (PVTs) within a sample referred for evaluation of suspected or known attention-deficit/hyperactivity disorder (ADHD). Data were analyzed from a final sample of 317 patients consecutively referred for ADHD evaluation, which was divided into groups with invalid (n = 43; 14%) and valid neuropsychological test performance (n = 274; 86%). A subset of the valid group with confirmed ADHD diagnoses (n = 226; 71%) were also analyzed separately. Classification accuracy for the overall valid sample was in the acceptable range (AUCs = .757-.794), with optimal cut scores of WR raw ≤75 (54% sensitivity/90% specificity), WR T score ≤ 28 (54% sensitivity/88% specificity), CN raw ≤57 (42% sensitivity/90% specificity), and CN T score ≤ 30 (40% sensitivity/90% specificity). Classification accuracy was also in the acceptable range for the ADHD-confirmed subgroup (AUCs = .750-.790), with optimal cut scores of WR Raw ≤ 75 (54% sensitivity/89% specificity), WR T score ≤ 28 (54% sensitivity/87% specificity), CN Raw ≤ 57 (42% sensitivity/90% specificity), and CN T score ≤ 30 (40% sensitivity/90% specificity). These findings indicate that embedded PVTs derived from the SCWT, particularly those derived from the WR trial, are effective measures for determining validity status in samples with suspected or confirmed ADHD. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
Article
Full-text available
Objective The study was designed to expand on the results of previous investigations on the D-KEFS Stroop as a performance validity test (PVT), which produced diverging conclusions. Method The classification accuracy of previously proposed validity cutoffs on the D-KEFS Stroop was computed against four different criterion PVTs in two independent samples: patients with uncomplicated mild TBI (n = 68) and disability benefit applicants (n = 49). Results Age-corrected scaled scores (ACSSs) ≤6 on individual subtests often fell short of specificity standards. Making the cutoffs more conservative improved specificity, but at a significant cost to sensitivity. In contrast, multivariate models (≥3 failures at ACSS ≤6 or ≥2 failures at ACSS ≤5 on the four subtests) produced good combinations of sensitivity (.39-.79) and specificity (.85-1.00), correctly classifying 74.6-90.6% of the sample. A novel validity scale, the D-KEFS Stroop Index correctly classified between 78.7% and 93.3% of the sample. Conclusions A multivariate approach to performance validity assessment provides a methodological safeguard against sample- and instrument-specific fluctuations in classification accuracy, strikes a reasonable balance between sensitivity and specificity, and mitigates the invalid before impaired paradox.
Article
Introduction: Research has shown non-trivial base rates of noncredible symptom report and performance in the clinical evaluation of attention-deficit/hyperactivity disorder (ADHD) in adulthood. The goal of this study is to estimate and replicate base rates of symptom and performance validity test failure in the clinical evaluation of adult ADHD and derive prediction models based on routine clinical measures. Methods: This study reuses data of a previous publication of 196 adults seeking ADHD assessment and replicates the findings on an independent sample of 700 adults recruited in the same referral context. Measures of symptom and performance validity (one SVT, two PVTs) were applied to estimate base rates. Prediction models were developed using machine learning. Results: Both samples showed substantial rates of noncredible symptom report (one SVT failure: 35.7% - 36.6%), noncredible test performance (one PVT failure: 32.1% - 49.3%; two PVT failures: 18.9% - 27.3%), or both (each one SVT and PVT failure: 13.3% - 22.4%; one SVT and two PVT failures: 9.7% - 13.7%). Machine learning algorithms resulted in generally moderate to weak prediction models, with advantages of the reused sample compared to the independent replication sample. Associations between measures of symptom and performance validity were negligible to small. Conclusions: This study highlights the necessity to include measures of symptom and performance validity in the clinical evaluation of adult ADHD. Further, this study demonstrates the difficulty to characterize the group failing symptom or performance validity assessment.
Article
Today medical imagers can pinpoint the location of brain tumors and chart how far stroke damage has spread. However, there is another important, if less well-publicized profession – clinical neuropsychology – which develops, administers, and interprets diagnostic tests that allow for the assessment of the real-world consequences of these brain lesions and advises if there are prospects for remediation.
Article
Full-text available
Elevations on certain Conners? CPT-II scales are known to be associated with invalid responding. However, scales and cutoffs vary across studies. In addition, the methodology behind developing performance validity tests (PVTs) has been challenged for mistaking true impairment for noncredible presentation. Using ability-based tests as a PVT makes clinicians especially vulnerable to this criticism. The present study examined the ability of CPT-II to dissociate effort from impairment in 47 adults clinically referred for neuropsychological assessment. CPT-II scales previously identified as PVTs (Omissions, Commissions, Hit Reaction Time SE, Variability, and Perseverations) produced classification accuracies hovering around .50 sensitivity at .90 specificity. The subsample that failed these PVTs performed within normal range on other tests of working memory, processing speed, visual attention, and executive function. Results suggest that the select CPT-II based PVTs are sensitive to invalid responding, and are associated with depression and anxiety, but are unrelated to cognitive functioning.
Article
Full-text available
Introduction: The Recognition Memory Test (RMT) and Word Choice Test (WCT) are structurally similar, but psychometrically different. Previous research demonstrated that adding a time-to-completion cutoff improved the classification accuracy of the RMT. However, the contribution of WCT time-cutoffs to improve the detection of invalid responding has not been investigated. The present study was designed to evaluate the classification accuracy of time-to-completion on the WCT compared to the accuracy score and the RMT. Method: Both tests were administered to 202 adults (Mage = 45.3 years, SD = 16.8; 54.5% female) clinically referred for neuropsychological assessment in counterbalanced order as part of a larger battery of cognitive tests. Results: Participants obtained lower and more variable scores on the RMT (M = 44.1, SD = 7.6) than on the WCT (M = 46.9, SD = 5.7). Similarly, they took longer to complete the recognition trial on the RMT (M = 157.2 s,SD = 71.8) than the WCT (M = 137.2 s, SD = 75.7). The optimal cutoff on the RMT (≤43) produced .60 sensitivity at .87 specificity. The optimal cutoff on the WCT (≤47) produced .57 sensitivity at .87 specificity. Time-cutoffs produced comparable classification accuracies for both RMT (≥192 s; .48 sensitivity at .88 specificity) and WCT (≥171 s; .49 sensitivity at .91 specificity). They also identified an additional 6-10% of the invalid profiles missed by accuracy score cutoffs, while maintaining good specificity (.93-.95). Functional equivalence was reached at accuracy scores ≤43 (RMT) and ≤47 (WCT) or time-to-completion ≥192 s (RMT) and ≥171 s (WCT). Conclusions: Time-to-completion cutoffs are valuable additions to both tests. They can function as independent validity indicators or enhance the sensitivity of accuracy scores without requiring additional measures or extending standard administration time.
Article
Full-text available
Scores on the Complex Ideational Material (CIM) were examined in reference to various performance validity tests (PVTs) in 106 adults clinically referred for neuropsychological assessment. The main diagnostic categories, reflecting a continuum between neurological and psychiatric disorders, were epilepsy, psychiatric disorders, postconcussive disorder, and psychogenic non-epileptic seizures. Cross-validation analyses suggest that in the absence of bona fide aphasia, a raw score ≤9 or T score ≤29 on the CIM is more likely to reflect non-credible presentation than impaired receptive language skills. However, these cutoffs may be associated with unacceptably high false positive rates in patients with longstanding, documented neurological deficits. Therefore, more conservative cutoffs (≤8/23) are recommended in such populations. Contrary to the widely accepted assumption that psychiatric disorders are unrelated to performance validity, results were consistent with the psychogenic interference hypothesis, suggesting that emotional distress increases the likelihood of PVT failures even in the absence of apparent external incentives to underperform on cognitive testing.
Article
Full-text available
Research suggests that select processing speed measures can also serve as embedded validity indicators (EVIs). The present study examined the diagnostic utility of Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV) subtests as EVIs in a mixed clinical sample of 205 patients medically referred for neuropsychological assessment (53.3% female, mean age = 45.1). Classification accuracy was calculated against 3 composite measures of performance validity as criterion variables. A PSI ≤79 produced a good combination of sensitivity (.23-.56) and specificity (.92-.98). A Coding scaled score ≤5 resulted in good specificity (.94-1.00), but low and variable sensitivity (.04-.28). A Symbol Search scaled score ≤6 achieved a good balance between sensitivity (.38-.64) and specificity (.88-.93). A Coding-Symbol Search scaled score difference ≥5 produced adequate specificity (.89-.91) but consistently low sensitivity (.08-.12). A 2-tailed cutoff on the Coding/Symbol Search raw score ratio (≤1.41 or ≥3.57) produced acceptable specificity (.87-.93), but low sensitivity (.15-.24). Failing ≥2 of these EVIs produced variable specificity (.81-.93) and sensitivity (.31-.59). Failing ≥3 of these EVIs stabilized specificity (.89-.94) at a small cost to sensitivity (.23-.53). Results suggest that processing speed based EVIs have the potential to provide a cost-effective and expedient method for evaluating the validity of cognitive data. Given their generally low and variable sensitivity, however, they should not be used in isolation to determine the credibility of a given response set. They also produced unacceptably high rates of false positive errors in patients with moderate-to-severe head injury. Combining evidence from multiple EVIs has the potential to improve overall classification accuracy. (PsycINFO Database Record
Article
Full-text available
Complex Ideational Material (CIM) is a sentence comprehension task designed to detect pathognomonic errors in receptive language. Nevertheless, patients with apparently intact language functioning occasionally score in the impaired range. If these instances reflect poor test taking effort, CIM has potential as a performance validity test (PVT). Indeed, in 68 adults medically referred for neuropsychological assessment, CIM was a reliable marker of psychometrically defined invalid responding. A raw score ≤9 or T-score ≤29 achieved acceptable combinations of sensitivity (.34-.40) and specificity (.82-.90) against two reference PVTs, and produced a zero overall false positive rate when scores on all available PVTs were considered. More conservative cutoffs (≤8/ ≤ 23) with higher specificity (.95-1.00) but lower sensitivity (.14-.17) may be warranted in patients with longstanding, documented neurological deficits. Overall, results indicate that in the absence of overt aphasia, poor performance on CIM is more likely to reflect invalid responding than true language impairment. The implications of the clinical interpretation of CIM are discussed.
Article
Full-text available
The importance of performance validity tests (PVTs) is increasingly recognized in pediatric neuropsychology. To date, research has focused on investigating whether PVTs designed for adults function similarly in children. The downward extension of adult cutoffs is counter-intuitive considering the robust effect of age-related changes in basic cognitive skills in children and adolescents. The purpose of this study was to examine the signal detection properties of a forced-choice recognition trial (FCR-C) for the California Verbal Learning Test - Children's Version. A total of 72 children aged 6-15 years (M = 11.1 , SD = 2.6) completed the FCR-C as part of a larger neuropsychological assessment battery. Cross-validation analyses revealed that the FCR-C had good signal detection performance against reference PVTs. The first level of failure (≤14/15) produced the best combination of overall sensitivity (.31) and specificity (.87). A more conservative FCR-C cutoff (≤13) resulted in a predictable trade-off between sensitivity (.15) and specificity (.94), but also a net loss in discriminant power. Lowering the cutoff to ≤12 resulted in a slight improvement in specificity (.97) but further deterioration in sensitivity (.14). These preliminary findings suggest that the FCR-C has the potential to become the newest addition to a growing arsenal of pediatric PVTs.
Article
Full-text available
Objective: The AACN sought to provide independent expert guidance and recommendations concerning the use of validity testing in disability determinations. Method: A panel of contributors to the science of validity testing and its application to the disability process was charged with describing why the disability process for SSA needs improvement, and indicating the necessity for validity testing in disability exams. Results: This work showed how the determination of malingering is a probability proposition, described how different types of validity tests are appropriate, provided evidence concerning non-credible findings in children and low-functioning individuals, and discussed the appropriate evaluation of pain disorders typically seen outside of mental consultations. Conclusions: A scientific plan for validity assessment that additionally protects test security is needed in disability determinations and in research on classification accuracy of disability decisions.
Article
Full-text available
Embedded validity measures support comprehensive assessment of performance validity. The purpose of this study was to evaluate the accuracy of individual embedded measures and to reduce them to the most efficient combination. The sample included 212 postdeployment veterans (average age = 35 years, average education = 14 years). Thirty embedded measures were initially identified as predictors of Green’s Word Memory Test (WMT) and were derived from the California Verbal Learning Test-Second Edition (CVLT-II), Conners’ Continuous Performance Test-Second Edition (CPT-II), Trail Making Test, Stroop, Wisconsin Card Sorting Test-64, the Wechsler Adult Intelligence Scale-Third Edition Letter-Number Sequencing, Rey Complex Figure Test (RCFT), Brief Visuospatial Memory Test-Revised, and the Finger Tapping Test. Eight nonoverlapping measures with the highest area-under-the-curve (AUC) values were retained for entry into a logistic regression analysis. Embedded measure accuracy was also compared to cutoffs found in the existing literature. Twenty-one percent of the sample failed the WMT. Previously developed cutoffs for individual measures showed poor sensitivity (SN) in the current sample except for the CPT-II (Total Errors, SN = .41). The CVLT-II (Trials 1–5 Total) showed the best overall accuracy (AUC = .80). After redundant measures were statistically eliminated, the model included the RCFT (Recognition True Positives), CPT-II (Total Errors), and CVLT-II (Trials 1–5 Total) and increased overall accuracy compared with the CVLT-II alone (AUC = .87). The combination of just 3 measures from the CPT-II, CVLT-II, and RCFT was the most accurate/efficient in predicting WMT performance.