Content uploaded by Laszlo A Erdodi
Author content
All content in this area was uploaded by Laszlo A Erdodi on Jul 22, 2019
Content may be subject to copyright.
Psychological Assessment
The Stroop Test as a Measure of Performance Validity in
Adults Clinically Referred for Neuropsychological
Laszlo A. Erdodi, Sanya Sagar, Kristian Seke, Brandon G. Zuccato, Eben S. Schwartz, and Robert
M. Roth
Online First Publication, February 22, 2018.
Erdodi, L. A., Sagar, S., Seke, K., Zuccato, B. G., Schwartz, E. S., & Roth, R. M. (2018, February 22).
The Stroop Test as a Measure of Performance Validity in Adults Clinically Referred for
Neuropsychological Assessment. Psychological Assessment. Advance online publication.
The Stroop Test as a Measure of Performance Validity in Adults Clinically
Referred for Neuropsychological Assessment
Laszlo A. Erdodi, Sanya Sagar, Kristian Seke,
and Brandon G. Zuccato
University of Windsor
Eben S. Schwartz
Waukesha Memorial Hospital, Waukesha, Wisconsin
Robert M. Roth
Geisel School of Medicine at Dartmouth/Dartmouth-Hitchcock Medical Center
This study was designed to develop performance validity indicators embedded within the Delis-Kaplan
Executive Function Systems (D-KEFS) version of the Stroop task. Archival data from a mixed clinical
sample of 132 patients (50% male; M
⫽43.4; M
⫽14.1) clinically referred for neuropsy-
chological assessment were analyzed. Criterion measures included the Warrington Recognition Memory
Test—Words and 2 composites based on several independent validity indicators. An age-corrected scaled
score ⱕ6 on any of the 4 trials reliably differentiated psychometrically defined credible and noncredible
response sets with high specificity (.87–.94) and variable sensitivity (.34 –.71). An inverted Stroop effect
was less sensitive (.14 –.29), but comparably specific (.85–90) to invalid performance. Aggregating the
newly developed D-KEFS Stroop validity indicators further improved classification accuracy. Failing the
validity cutoffs was unrelated to self-reported depression or anxiety. However, it was associated with
elevated somatic symptom report. In addition to processing speed and executive function, the D-KEFS
version of the Stroop task can function as a measure of performance validity. A multivariate approach
to performance validity assessment is generally superior to univariate models.
Public Significance Statement
The Stroop test can function as a performance validity indicator by identifying unusual patterns of
responding. Invalid performance was associated with higher levels of self-reported somatic symptoms.
Keywords: Stroop task, performance validity, embedded validity indicators
The validity of the neuropsychological evaluation hinges on the
examinees’ ability and willingness to demonstrate their typical
level of cognitive functioning (Bigler, 2015). Therefore, there is a
broad consensus within the profession that a thorough performance
validity assessment is an essential part of the examination (Bush,
Ruff, & Heilbronner, 2014; Chafetz et al., 2015; Heilbronner et al.,
2009). As a result, the administration of multiple, nonredundant
performance validity tests (PVTs) has become a widely accepted
practice standard (Boone, 2013; Larrabee, 2014).
Although stand-alone instruments are considered the gold stan-
dard for validity assessment (Green, 2013), embedded validity
indicators (EVIs) are increasing in popularity. EVIs are derived
from traditional neuropsychological tests originally designed to
measure cognitive ability, but were subsequently coopted as PVTs.
Many EVIs have strong empirical support and a long presence in
the research literature. Some predate the most acclaimed stand-
alone PVTs, such as those based in verbal fluency (Hayward, Hall,
Hunt, & Zubrick, 1987), digit span (Greiffenstein, Baker, & Gola,
1994), or symbol substitution (Trueblood, 1994) tasks.
EVIs have several advantages over stand-alone PVTs. First,
they allow clinicians to use multiple validity indicators without
adding new measures, resulting in significant savings in test ma-
terial and administration time. Compressing the battery also lowers
the demand on patients’ mental stamina, which is especially im-
portant when assessing individuals with complex medical and
psychiatric history (Lichtenstein, Erdodi, & Linnea, 2017). EVIs
may also be more resistant to coaching, as they are less likely to be
identified as PVTs than stand-alone instruments (Chafetz et al.,
2015; Schutte, Axelrod, & Montoya, 2015). Finally, they automat-
ically address concerns about the generalizability of the PVT
scores to the rest of the battery (Bigler, 2014). Overall, these
features enable EVIs to achieve the ideal of ongoing monitoring of
test-taking effort (Boone, 2009) without placing a significant ad-
ditional burden on either the examiner or examinee.
Laszlo A. Erdodi and Sanya Sagar, Department of Psychology, Univer-
sity of Windsor; Kristian Seke, Brain-Cognition-Neuroscience Program,
University of Windsor; Brandon G. Zuccato, Department of Psychology,
University of Windsor; Eben S. Schwartz, Waukesha Memorial Hospital,
Waukesha, Wisconsin; Robert M. Roth, Geisel School of Medicine at
Dartmouth/Dartmouth-Hitchcock Medical Center.
Correspondence concerning this article should be addressed to Laszlo A.
Erdodi, 168 Chrysler Hall South, 401 Sunset Avenue, Windsor, ON N9B
3P4, Canada. E-mail:
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychological Assessment © 2018 American Psychological Association
2018, Vol. 1, No. 2, 000 1040-3590/18/$12.00
The Stroop (1935) paradigm, of which there are many variants,
has the potential to function as an EVI. The task usually consists
of at least three trials (MacLeod & MacDonald, 2000). In the first
trial, the participant is asked to read a series of color words, printed
in black ink, as quickly as possible. In the second trial, the
participant is asked to look at a series of color squares, and name
the colors as quickly as possible. The third trial is the test of
interference and the evoker of the classic Stroop effect: the par-
ticipant is asked to look at a series of color words, printed in
incongruent ink colors, and name the color of the ink instead of
reading the word, as quickly as possible. For example, if the word
“red” is printed in green ink, the examinee is asked to say “green”
instead of “red.” Because reading words is more automatized than
naming ink colors, inhibiting the overlearned response requires
additional cognitive resources, which results in increased comple-
tion time relative to the word reading and color naming trials
(MacLeod & MacDonald, 2000).
The Stroop task within the Delis-Kaplan Executive Function
System (D-KEFS; Delis, Kaplan, & Kramer, 2001) includes a
fourth trial (inhibition/switching) designed to further increase the
cognitive load by requiring examinees to switch back and forth
between two sets of rules. On Trial 4, half of the words are
enclosed in boxes. The examinee is instructed to name the color of
the ink for free-standing items (as in Trial 3 of the classic Stroop
task), but read the word (rather than name the ink color) for items
inside a box. Trial 4 was meant to be more difficult than the
interference trial to capture more subtle executive deficits. How-
ever, the empirical evidence on this difficulty gradient is mixed
(Lippa & Davis, 2010).
The Stroop paradigm has been shown to be sensitive to neuro-
psychiatric conditions with executive dysfunction as a common
feature, such as traumatic brain injury (TBI; Larson, Kaufman,
Schmalfuss, & Perlstein, 2007; Schroeter et al., 2007) and
attention-deficit-hyperactivity disorder (ADHD; Lansbergen, Ken-
emans, & Van Engeland, 2007). However, there is limited research
examining the utility of the Stroop paradigm as a measure of
noncredible performance. Arentsen and colleagues (2013) intro-
duced validity cutoffs for the word reading (ⱖ66 s), color naming
(ⱖ93 s), and interference (ⱖ191 s) trials in the Comalli Stroop
Test (Comalli, Wapner, & Werner, 1962). All of these cutoffs
achieved specificity ⱖ.90 in a mixed clinical population, with
.29 –.53 sensitivity.
A raw residual score (i.e., predicted score minus actual score)
of ⱕ⫺47 on the word reading trial of the Stroop Color and Word
Test (Golden & Freshwater, 2002) discriminated noncredible from
credible responders at .95 specificity and .29 sensitivity using
Slick, Sherman, and Iverson’s (1999) criteria for malingered neu-
rocognitive dysfunction (Guise, Thompson, Greve, Bianchini, &
West, 2014). Other studies (Egeland & Langfjaeran, 2007;
Osimani, Alon, Berger, & Abarbanel, 1997) have found that non-
credible performers may display slower overall reaction time (RT)
and an inverted Stroop effect (i.e., better performance on the
interference trial than the word reading or color naming trials).
While Osimani and colleagues (1997) did not perform signal
detection analyses, Egeland and Langfjaeran (2007) reported un-
acceptably low specificity (.59) for the inverted Stroop effect, even
though the majority of noncredible performers exhibited this vio-
lation of the difficulty gradient. Furthermore, the inverted Stroop
effect as an index of validity has not been replicated consistently
in the literature (Arentsen et al., 2013).
To our knowledge, the potential of the D-KEFS Stroop to
function as an EVI has not been investigated. Given its more
nuanced difficulty gradient because of the unique combined inhi-
bition/switching task (Trial 4), it may be particularly useful as a
measure of performance validity. The purpose of this study is to
examine the utility of the D-KEFS Stroop in differentiating cred-
ible and noncredible response sets in a clinical setting.
Data were collected from a consecutive sequence of 132 patients
(50% male, 89.4% right-handed), clinically referred for neuropsy-
chological assessment at a northeastern academic medical center.
The vast majority of them (⬎95%) were White, reflecting the
demographic composition of the region. Age (M⫽43.4, SD ⫽16)
followed a bimodal distribution, with one peak around 20 years
and another around 55 years. Mean level of education was 14.1
years (SD ⫽2.8). Overall intellectual functioning was in the
average range (M
⫽101.2, SD
⫽16.4), as were scores on
a single word reading test (M
⫽104.4, SD
The most common primary diagnosis was psychiatric (46.2%),
followed by TBI (35.6%), neurological disorders (14.4%) and
other medical conditions (3.8%). Within the psychiatric sub-
sample, most patients had been diagnosed with depression
(45.9%), followed by somatic (19.7%) and anxiety disorders
(13.1%). The majority of the TBI patients (81.1%) had sustained a
mild injury. Likewise, the average self-reported depression was in
the mild range (M
⫽16.4, SD
⫽14.8). Most patients
(45.6%) scored in the minimal range (ⱕ13), while 21.6% scored in
the mild (14 –19), 17.6% scored in the moderate (20 –28), and
15.2% scored in the severe (ⱖ29) range for self-reported depres-
Data were collected through a retrospective chart review from
patients assessed between December 2012 and July 2014. The
main inclusion criterion was a complete administration of the
D-KEFS Stroop. The study was approved by the ethics board of
the hospital where the data were collected, and that of the univer-
sity where the research project was finalized. Relevant guidelines
regulating research with human participants were followed
throughout the study.
The names and abbreviation of tests administered are provided
in Table 1. The percentage of the sample with scores on each test
is also listed. A core battery of tests was administered to most
patients, while the rest of the instruments were selected based on
the specific referral question. Therefore, they vary from patient to
The main stand-alone PVT was Warrington’s Recognition
Memory Test—Words (RMT). Failure was defined as an accuracy
score of ⱕ43 or a completion time of ⱖ192 s (Erdodi, Tyson, et
al., 2017). In addition, a composite of 11 validity indicators labeled
“Effort Index Eleven” (EI-11) was developed to provide a com-
prehensive measure of performance validity (Erdodi, Abeare, et
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
al., 2017; Erdodi, Kirsch, Lajiness-O’Neill, Vingilis, & Medoff,
2014; Erdodi & Roth, 2017). The constituent PVTs were dichot-
omized into Pass (⫽0) and Fail (⫽1) along published cutoffs.
Some PVTs have multiple indicators; failing any indicator was
considered as failing the entire PVT (⫽1). Failing multiple indi-
cators nested within the same measure was counted as a single
failure (⫽1). Missing data were coded as Pass (⫽0), although it
is recognized that this may increase error variance by potentially
misclassifying noncredible patients as credible.
The value of EI-11 is the sum of failures on its components.
Given the relatively large number of indicators, and that the most
liberal cutoffs were used to maximize sensitivity (see Table 2), the
EI-11 is prone to false positive errors by design. To correct for
that, the more conservative threshold of ⱖ3 independent PVT
Table 1
List of Tests Administered: Abbreviations, Scales, and Norms
Test name Abbreviation Norms % ADM
Beck Depression Inventory, 2nd Edition BDI-II — 94.4
Beck Anxiety Inventory BAI — 69.7
California Verbal Leaning Test, 2nd Edition CVLT-II Manual 100.0
Complex Ideational Material CIM Heaton 32.6
Conners’ Continuous Performance Test, 2nd Edition CPT-II Manual 78.8
Delis-Kaplan Executive Systems–Stroop D-KEFS Manual 100.0
Finger Tapping Test FTT Heaton 81.1
Letter and Category Fluency Test FAS & Animals Heaton 84.1
Personality Assessment Inventory PAI Manual 43.9
Recognition Memory Test–Words RMT — 100.0
Rey 15-Item Test Rey-15 — 81.8
Rey Complex Figure Test RCFT Manual 96.2
Trail Making Test (A & B) TMT (A & B) Heaton 56.8
Wechsler Adult Intelligence Scale, 4th Edition WAIS-IV Manual 99.2
Wechsler Memory Scale, 4th Edition WMS-IV Manual 99.2
Wide Range Achievement Test, 4th Edition WRAT-4 Manual 83.3
Wisconsin Card Sorting Test WCST Manual 91.7
Note. Heaton ⫽Demographically adjusted norms published by Heaton, Miller, Taylor, and Grant (2004);
Manual ⫽Normative data published in the technical manual; % ADM ⫽Percent of the sample to which the test
was administered.
Table 2
Base Rates of Failure for EI-11 Components, Cutoffs, and References for Each Indicator
Test BR
Indicator Cutoff Reference
Rey-15 10.6 Free recall ⱕ9 Lezak, 1995; Boone et al., 2002
TMT 15.9 A ⫹B (“) ⱖ137 Shura et al., 2016
Digit Span 25.0 RDS ⱕ7 Greiffenstein et al., 1994; Pearson, 2009
ACSS ⱕ6 Axelrod et al., 2006; Spencer et al., 2013; Trueblood, 1994
LDF ⱕ4 Heinly et al., 2005
WCST 14.4 FMS ⱖ2 Larrabee, 2003; Suhr & Boyer, 1999
LRE ⬎1.9 Greve et al., 2002; Suhr and Boyer, 1999
CIM 6.1 Raw ⱕ9 Erdodi and Roth, 2017; Erdodi, Tyson, et al., 2016
T-score ⱕ29 Erdodi and Roth, 2017; Erdodi, Tyson, et al., 2016
15.9 I ACSS ⱕ3 Bortnik et al., 2010
II ACSS ⱕ4 Bortnik et al., 2010
Recognition ⱕ20 Bortnik et al., 2010; Pearson, 2009
18.2 Recognition ⱕ4 Pearson, 2009
CVLT-II 12.9 Hits
ⱕ10 Bauer et al., 2005; Greve et al., 2009; Wolfe et al., 2010
FCR ⱕ15 Bauer et al., 2005; D. Delis (personal communication, May 10, 2012
RCFT 34.1 Copy raw ⱕ26 Lu et al., 2003; Reedy et al., 2013
3-min raw ⱕ9.5 Lu et al., 2003; Reedy et al., 2013
ⱕ6 Lu et al., 2003; Reedy et al., 2013
Atyp RE ⱖ1 Blaskewitz et al., 2009; Lu et al., 2003
FAS 9.8 T-score ⱕ33 Curtis et al., 2008; Sugarman and Axelrod, 2015
Animals 16.7 T-score ⱕ33 Hayward et al., 1987; Sugarman and Axelrod, 2015
⫽Base rate of failure (% of the sample that failed one or more indicators within the test); TMT ⫽Trail Making Test; RDS ⫽Reliable digit
span; ACSS ⫽Age-corrected scaled score; LDF ⫽longest digit span forward; WCST ⫽Wisconsin Card Sorting Test; FMS ⫽Failure to maintain set;
UE ⫽Unique errors; LRE ⫽Logistical regression equation; CIM ⫽Complex Ideational Material from the Boston Diagnostic Aphasia Battery; WMS-IV ⫽
Wechsler Memory Scale, 4th Edition; LM ⫽Logical Memory; VR ⫽Visual Reproduction; CVLT-II ⫽California Verbal Learning Test, 2nd Edition;
FCR ⫽Forced choice recognition; RCFT ⫽Rey Complex Figure Test; TP
⫽Recognition true positives; Atyp RE ⫽Atypical recognition errors.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
failures was used to define Fail on the EI-11. At the same time, to
maintain the purity of the credible group, Pass was defined as ⱕ1.
Hence, patients with EI-11 scores of two were considered Border-
line and excluded from the analyses (Erdodi & Roth, 2017; Erdodi,
Tyson, et al., 2017).
Relying on a mixture of PVTs representing a wide range of
sensory modalities, cognitive domains, and testing paradigms is a
desirable feature of the EI-11, as it provides an ecologically valid
index of performance validity. However, this heterogeneity could
also become a source of error variance, especially when the pur-
pose of the instrument is to establish the credibility of the perfor-
mance on a specific test, and not on the overall neurocognitive
profile. The issue of modality-specificity as a confound in signal
detection analyses was raised as a theoretical concern (Leighton,
Weinborn, & Maybery, 2014) and has found empirical support
(Erdodi, Abeare, et al., 2017; Erdodi, Tyson, et al., 2017).
Therefore, because the D-KEFS Stroop is timed, another valid-
ity composite was developed based on constituent PVTs that were
based on processing speed measures, labeled “Erdodi Index
Seven” (EI-7
). The unique feature of the EI-7
is that instead
of the traditional Pass/Fail dichotomy, each of its components is
coded on a 4-point scale ranging from zero (unequivocal Pass)to
three (unequivocal Fail), with one and two reflecting intermediate
levels of failure (see Table 3). As such, the EI-7
captures both
the number and extent of PVT failures, recognizing the underlying
continuity in test-taking effort (Erdodi, Roth, Kirsch, Lajiness-
O’Neill, & Medoff, 2014; Erdodi, Tyson, et al., 2016).
Because the practical demands of clinical classification require
a dichotomous outcome, EI-7
scores ⱕ1 were defined as Pass,
and ⱖ4asFail. EI-7
values of two and three represent an
indeterminate range, as they could reflect either multiple failures at
a liberal cutoff or a single failure at the most conservative cutoff.
As these performances are considered “near-passes” by some
(Bigler, 2012, 2015), patients with EI-7
scores in this range
were excluded from signal detection analyses in the interest of
obtaining diagnostically pure criterion groups, following method-
ological guidelines established by previous researchers (Axelrod,
Meyers, & Davis, 2014; Greve & Bianchini, 2004). The majority
of the sample (62.1%) performed in the passing range; only 15.9%
scored ⱖ4 (see Table 4).
Data Analysis
Descriptive statistics (mean, SD,BR
) are reported for the
relevant variables. The main inferential statistics were one-way
analyses of variance (ANOVAs) and independent sample ttests.
Effect size estimates are reported in Cohen’s dand partial eta
squares (
). Classification accuracy (sensitivity and specificity)
was calculated using standard formulas (Grimes & Schulz, 2005).
The emerging standard for specificity is ⱖ.90 (Boone, 2013), with
the minimum acceptable value at .84 (Larrabee, 2003).
One-way ANOVAs using the trichotomized EI-11 (Pass-
Borderline-Fail) as the independent variable, and the RMT accu-
racy score, completion time, and the EI-7
scores as the depen-
dent variables, were statistically significant. Associated effect
sizes were large (
: .16 –.23). Scores in the Pass range were
always significantly lower than scores in the Fail range. However,
scores in the Borderline range did not differ consistently from the
other two classification ranges (see Table 5).
These analyses were repeated using the trichotomized EI-7
(Pass-Borderline-Fail) as the independent variable, and the RMT
accuracy score, completion time, and the EI-11 scores as depen-
dent variables. All contrasts were significant with large effects (
.11–.34). As before, scores in the Pass range were always signif-
icantly lower than scores in the Fail range, but scores in the
Borderline range did not differ consistently from the other two
classification ranges (see Table 6). Overall, these findings provide
empirical support for eliminating participants with EI-11 and EI-
scores in the Borderline range when computing the classifi-
cation accuracy of the D-KEFS Stroop to minimize error variance
and establish criterion groups with neurocognitive profiles that are
either clearly valid or invalid.
Mean D-KEFS Stroop age-corrected scaled scores (ACSS) were
in the average range on all four trials. However, they were signif-
icantly below the nominal mean of 10, with small to medium effect
sizes (Cohen’s d: .25–.44). Skew and kurtosis were well within
⫾1.0 (see Table 7). However, visual inspection revealed bimodal
distributions with one peak in the impaired range and another in
the average-to-high average range.
Trial 1 ACSS ⱕ7 failed to clear the minimum threshold for
specificity (.84; Larrabee, 2003) against the RMT and EI-11.
The ⱕ6 cutoff produced good combinations of sensitivity (.43–
.71) and specificity (.86 –.94) against all three reference PVTs.
Lowering the cutoff to ⱕ5 produced negligible changes in classi-
fication accuracy. The more conservative ⱕ4 cutoff resulted in
predictable tradeoffs: improved specificity (.93–.99) at the expense
of sensitivity (.29 –.43).
Trial 2 ACSS ⱕ7 cleared the minimum threshold for specificity
against the EI-11 and EI-7
, but fell short against the RMT.
The ⱕ6 cutoff produced good combinations of sensitivity (.45–
.62) and specificity (.87–.91) against all three reference PVTs.
Lowering the cutoff to ⱕ5 improved specificity across all refer-
ence PVTs (.92–.96) with minimal loss in sensitivity (.38 –.57).
The more conservative ⱕ4 cutoff produced excellent specificity
(.96 –.99) with relatively well-preserved sensitivity (.33–.48).
Trial 3 ACSS ⱕ7 cleared the minimum threshold for specificity
against the EI-11 and EI-7
, but once again, fell short of expec-
tations against the RMT. Lowering the cutoff to ⱕ6 resulted in the
predictable tradeoffs, but still failed to reach minimum specificity
against the RMT. Lowering the cutoff to ⱕ5 improved specificity
across all reference PVTs (.87–.99) with minimal loss in sensitiv-
ity (.26 –.62). The more conservative ⱕ4 cutoff produced excellent
specificity (.94 –1.00) with acceptable sensitivity (.26 –.52).
Trial 4 ACSS ⱕ7 failed to clear the minimum threshold for
specificity against the RMT and EI-11. The ⱕ6 cutoff produced
acceptable combinations of sensitivity (.29 –.48) and specificity
(.84 –.91) against all three reference PVTs. Lowering the cutoff
to ⱕ5 produced negligible changes in classification accuracy. The
more conservative ⱕ4 cutoff resulted in predictable tradeoffs:
improved specificity (.92–.96) at the expense of sensitivity (.21–
To evaluate whether the pattern of performance across the trials
can reveal invalid responding, two additional derivative validity
indices were examined: the Trials 4/3 raw score ratio, and the
(Trials 1 ⫹2)/(Trials 3 ⫹4) ACSS ratio. The former index is a
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
measure of the absolute difficulty gradient, as examinees are
expected to take longer to finish Trial 4, given its added cognitive
load of switching between two sets of rules. Indeed, on average,
patients produced higher completion times on Trial 4 relative to
Trial 3 (Trials 4/3 raw score ratio: M⫽1.17, SD ⫽0.32, range:
0.65–2.59). The distribution was bimodal, with the bulk of the
sample forming a bell-shaped distribution around the mean and a
small group of positive outliers. The latter index, a ratio of aggre-
gated easy (Trials 1 and 2) versus hard (3 and 4) trials, is a
measure of the relative difficulty gradient. Norm-referencing (i.e.,
age-correction) is expected to equalize the increase in task de-
mands from the first two to the last two trials. As expected, the
overall (Trials 1 ⫹2)/(Trials 3 ⫹4) ACSS ratio was close to 1.00:
M⫽1.05, SD ⫽0.44, range: 0.33–3.50. Again, the distribution
was bimodal, with a bell-shaped majority around the mean and a
small group of positive outliers.
A Trials 4/3 raw score ratio ⱕ0.90 cleared the minimum thresh-
old for specificity against all reference PVTs, but sensitivity was
low (.14 –.24). Lowering the cutoff to ⱕ0.85 resulted in predict-
able tradeoffs, with notable increase in specificity (.95–.97), but
further loss in sensitivity (.10 –.14). Lowering the cutoff to ⱕ0.80
produced negligible changes in classification accuracy. The
more conservative ⱕ0.75 cutoff produced excellent specificity
(.98 –1.00), but very low sensitivity (.07–.14).
A (Trials 1 ⫹2)/(Trials 3 ⫹4) ACSS ratio ⱕ0.80 failed to
achieve minimum specificity against any of the reference PVTs.
Lowering the cutoff to ⱕ0.75 cleared the lower threshold for
specificity against all reference PVTs (.88 –.89), with accept-
able sensitivity (.26 –.29). Lowering the cutoff to ⱕ0.70 pro-
duced predictable tradeoffs: improved specificity (.92–.94) at
the expense of sensitivity (.21–.24). The more conserva-
tive ⱕ0.65 cutoff produced excellent specificity (.98 –1.00), but
low sensitivity (.12–.24).
Finally, the effect of cumulative failures on independent
D-KEFS Stroop validity indicators was examined (see Table 8).
Failing at least two of the six newly introduced embedded PVTs
produced good combinations of sensitivity (.61–.81) and specific-
ity (.86 –.87) against the EI-11 and EI-7
, but fell short of the
minimum specificity standard against the RMT. Failing at least
three indicators cleared the specificity threshold against all three
reference PVTs (.84 –.93), at the expense of sensitivity (.36 –.57).
Failing at least four indicators produced consistently high speci-
ficity (.92–.99), with further loss in sensitivity (.29 –.52). Failing
five indicators (the highest value observed) was associated with
perfect specificity, but low sensitivity (.10 –.14).
Given the high base rate of psychiatric disorders in the sample,
we examined the relationship between self-reported emotional
distress and PVT failure. There was no difference in BDI-II scores
between patients who passed and those who failed the three
reference PVTs and five of the newly developed validity cutoffs
embedded in the D-KEFS Stroop. Trial 4 ACSS ⱕ6 was an
isolated exception; failing this cutoff was associated with lower
levels of depression (d⫽.42). Similarly, no difference was found
in BAI scores between patients who passed and those who failed
the three reference PVTs and four of the newly developed validity
cutoffs embedded in the D-KEFS Stroop. The exceptions were
Trial 2 ACSS ⱕ6 and (Trials 1 ⫹2)/(Trials 3 ⫹4) ACSS
ratio ⱕ0.75. In both cases, failing the cutoff was associated with
increased levels of anxiety (d: .41–.55).
To further explore the potential contribution of psychiatric
symptoms to PVT failure, we performed a series of ttests using
Pass/Fail status on the PVTs as independent variables, and the PAI
clinical scales as dependent variables (see Table 9). All patients
passed the validity cutoff on the Negative Impression Management
scale. The Somatic Concerns scale was the only scale with signif-
icant contrasts. Effect sizes ranged from medium (d⫽.54) to large
(d⫽.77). No difference emerged against the EI-7
and the
derivative D-KEFS Stroop validity indices. Within the Somatic
Concerns scale, effect sizes were generally larger on the Conver-
sion subscale (d: .50 –1.06), but again, contrasts involving the
Table 3
The Components of the EI-7
With Base Rates of Failure
Corresponding to Each Cutoff
01 23
FTT number of failures 0 1 2 —
Base rate 96.2 .8 3.0 —
FAS T-scores ⬎33 32–33 28–31 ⱕ27
Base rate 87.9 5.3 2.3 4.5
Animals T-scores ⬎33 25–33 21–24 ⱕ20
Base rate 83.3 8.3 3.8 4.5
TMT A ⫹B raw scores ⬍137 137–221 222–255 ⱖ256
Base rate 84.1 10.6 2.3 3.0
CPT-II number of failures 0 1 2 ⱖ3
Base rate 66.7 15.2 5.3 12.9
WAIS-IV CD (ACSS) ⬎55 4 ⱕ3
Base rate 90.2 1.5 5.3 3.0
WAIS-IV SS (ACSS) ⬎55 4 ⱕ3
Base rate 85.6 6.8 4.5 3.0
Note. EI-7
⫽“Erdodi Index Seven” based on processing speed mea-
sures; FTT Failures ⫽Finger tapping test, number of scores at ⱕ35
(men)/28 (women) dominant hand and ⱕ66 (men)/58 (women) combined
mean raw scores (Arnold et al., 2005; Axelrod, Meyers, & Davis, 2014);
FAS ⫽Letter fluency T-score (Curtis et al., 2008; Sugarman & Axelrod,
2015); Animals ⫽Category fluency T-score (Sugarman & Axelrod, 2015);
CPT-II Failures ⫽Conners’ Continuous Performance Test, 2nd Edition;
number of T-scores ⬎70 on Omissions, Hit Reaction Time Standard Error,
Variability and Perseverations (Erdodi, Pelletier, & Roth, 2016; Erdodi,
Roth, et al., 2014; Lange et al., 2013; Ord, Boettcher, Greve, & Bianchini,
2010); WAIS-IV CD (ACSS) ⫽Age-corrected scaled score on the Coding
subtest of the Wechsler Adult Intelligence Scale—Fourth Edition (Erdodi,
Abeare, et al., 2017; Etherton et al., 2006; N. Kim et al., 2010; Trueblood,
1994); WAIS-IV SS (ACSS) ⫽Age-corrected scaled score on the Symbol
Search subtest of the Wechsler Adult Intelligence Scale—Fourth Edition
(Erdodi, Abeare, et al., 2017; Etherton et al., 2006; Trueblood, 1994).
Table 4
Frequency Distribution of the EI-7
With Classification Ranges
0 61 46.2 46.2 Pass
1 21 15.9 62.1 Pass
2 12 9.1 71.2 Borderline
3 17 12.9 84.1 Borderline
4 5 3.8 87.9 Fail
5 2 1.5 89.4 Fail
6 4 3.0 92.4 Fail
7 3 2.3 94.7 Fail
8 0 .0 94.7 Fail
9 1 .8 95.5 Fail
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
derivative D-KEFS Stroop validity indices failed to reach signif-
icance. The Somatization subscale was only associated with failing
the RMT (d⫽.54). Significant differences reemerged on the
Health Concerns subscale, with effect sizes ranging from medium
(d⫽.52) to large (d⫽.84). Contrasts involving the EI-7
D-KEFS Stroop Trial 2 and the derivative validity indices failed to
reach significance.
This study explored the potential of D-KEFS Stroop to function as
a PVT. A scaled score below 1 SD of the normative mean on any of
the four trials was a reliable indicator of psychometrically defined
invalid performance. Violating the difficulty gradient (i.e., scoring
better on difficult tasks than on easier tasks) was also reliably asso-
ciated with failure on reference PVTs. All six EVIs produced bimodal
distributions with a distinct cluster of outliers in the range of non-
credible impairment, indicating that valid and invalid performance
may start to diverge at the level of descriptive statistics. Overall,
results suggest that in addition to measuring basic processing speed
and executive function, the D-KEFS Stroop is also an effective PVT.
This finding is consistent with earlier investigations using different
versions of the Stroop task (Arentsen et al., 2013; Egeland & Lang-
fjaeran, 2007; Guise et al., 2014; Osimani et al., 1997).
Labeling a score that is only 1 SD below the mean as invalid
may appear an extreme measure at first, as it implies that as many
as 16% of the original normative sample demonstrated invalid
performance. However, the practice is not without precedent.
Shura, Miskey, Rowland, Yoash-Gantz, and Denning (2016) dem-
onstrated that an ACSS ⱕ7 (Low Average) on Letter-Number
Sequencing was a reliable indicator of noncredible responding.
Moreover, Baker, Connery, Kirk, and Kirkwood (2014) found a
recognition discriminability z-score of ⱕ⫺0.5 (Average) to be the
marker of invalid performance on the California Verbal Learning
Test—Children’s Version.
This phenomenon of the noncredible range of performance
expanding into the traditional range of normal cognitive function-
ing has been recently labeled as the “invalid-before-impaired’
paradox.” Erdodi and Lichtenstein (2017) recently argued that this
apparent psychometric anomaly has multiple possible explana-
tions, one of which is that few (if any) normative samples are
screened for invalid performance. Therefore, noncredible respond-
ing contaminates the scaling process used to establish ACSSs. In
Table 5
Results of One-Way ANOVAs on RMT and EI-7
Scores Across EI-11Classification Ranges
post hocs
0–1 2 ⱖ3
M47.4 44.5 42.2 12.1 ⬍.001 .16 PASS vs. BOR
SD 3.4 6.6 7.7 PASS vs. FAIL
M130.0 147.3 191.3 11.1 ⬍.001 .15 PASS vs. FAIL
SD 63.4 52.9 74.4 BOR vs. FAIL
M0.8 1.8 4.2 19.1 ⬍.001 .23 PASS vs. FAIL
SD 1.6 1.4 4.5 BOR vs. FAIL
Note. Post hoc pairwise contrasts were computed using the least significant difference method; EI-11 ⫽Effort
Index Eleven; BOR ⫽Borderline; RMT
⫽Recognition Memory Test–Words (Accuracy score);
⫽Recognition Memory Test–Words (completion time in seconds); EI-7
⫽“Erdodi Index Seven”
based on processing speed measures; ANOVA ⫽analysis of variance.
Table 6
Results of One-Way ANOVAs on RMT and EI-11Scores Across EI-7
Classification Ranges
post hocs
0 –1 2–3 ⱖ4
M47.0 43.3 42.7 8.08 ⬍.001 .11 PASS vs. BOR
SD 4.1 7.2 7.7 PASS vs. FAIL
M123.2 189.9 207.9 21.5 ⬍.001 .25 PASS vs. BOR
SD 53.8 70.4 75.3 PASS vs. FAIL
EI-11 M1.0 2.2 4.2 33.6 ⬍.001 .34 PASS vs. FAIL
SD 1.3 1.7 2.2 PASS vs. BOR
Note. Post hoc pairwise contrasts were computed using the least significant difference method; EI-7
“Erdodi Index Seven” based on processing speed measures; BOR ⫽Borderline; RMT
Memory Test–Words (Accuracy score); RMT
⫽Recognition Memory Test–Words (Completion time in
seconds); EI-11 ⫽Effort Index Eleven; ANOVA ⫽analysis of variance.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
turn, later research discovers that scores commonly interpreted as
within normal limits are in fact indicative of invalid performance.
They concluded that EVI cutoffs that reach into the range of
functioning traditionally considered intact provide valuable infor-
mation about the credibility of the response set and therefore,
should not be automatically discounted.
Within the D-KEFS Stroop, derivative validity indicators be-
haved differently both relative to reference PVTs and to single-
trial validity cutoffs. First, the derivative validity indicators had
consistently lower BR
, suggesting that pattern violations are
less common manifestations of noncredible responding than ab-
normally slow completion time. This finding is congruent with
earlier reports that an inverted or absent Stroop effect does not
occur in credible examinees; therefore, it is highly specific to
invalid performance (Osimani et al., 1997). As a direct conse-
quence of this, derivative validity indicators were generally less
sensitive, which may also reflect the inconsistency in the literature
regarding the inverted Stroop effect as an index of performance
validity. While some studies found that noncredible performers
perform better on more difficult trials, this pattern of performance
failed to demonstrate adequate classification accuracy (Arentsen et
al., 2013; Egeland & Langfjaeran, 2007).
Despite the variability in sample characteristics, methodology,
version of Stroop task, reference PVTs, and BR
, our findings
Table 7
D-KEFS Stroop Age-Corrected Scaled Scores Across the Four
Trials for the Entire Sample (N ⫽132)
D-KEFS Stroop Trials
Color naming Word reading Inhibition Inhibition/switching
Number 1 2 3 4
M8.6 9.2 9.0 9.1
SD 3.4 3.5 3.8 3.5
Median 10 10 9.5 10
Skew ⫺.57 ⫺.66 ⫺.53 ⫺.63
Kurtosis ⫺.33 ⫺.33 ⫺.27 ⫺.12
Range 1–15 1–15 1–15 1–15
Note. D-KEFS ⫽Delis-Kaplan Executive Systems.
Table 8
Classification Accuracy of Validity Indicators Embedded in the D-KEFS Stroop Task Against Reference PVTs
RMT (n⫽132) EI-11 (n⫽114) EI-7
31.8 33.3 20.4
Cutoff BR
Trial 1 ⱕ7 34.1 .55 .76 .61 .82 .76 .88
ⱕ6 23.5 .43 .86 .53 .92 .71 .94
ⱕ5 21.2 .43 .89 .50 .93 .62 .94
ⱕ4 13.6 .29 .93 .37 .96 .43 .99
Trial 2 ⱕ7 26.5 .48 .83 .52 .87 .62 .88
ⱕ6 23.5 .45 .87 .45 .88 .62 .91
ⱕ5 17.4 .38 .92 .44 .95 .57 .96
ⱕ4 13.6 .33 .96 .39 .98 .48 .99
Trial 3 ⱕ7 31.1 .45 .76 .55 .88 .71 .85
ⱕ6 22.0 .31 .82 .42 .91 .62 .93
ⱕ5 17.4 .26 .87 .37 .93 .62 .99
ⱕ4 12.1 .26 .94 .32 .97 .52 1.00
Trial 4 ⱕ7 26.5 .33 .77 .47 .83 .67 .88
ⱕ6 19.7 .29 .84 .34 .87 .48 .91
ⱕ5 16.7 .24 .88 .29 .90 .48 .95
ⱕ4 12.1 .21 .92 .21 .92 .33 .96
Trials 4/3 ⱕ.90 14.3 .14 .86 .21 .90 .24 .85
Raw score ⱕ.85 6.8 .10 .96 .13 .97 .14 .95
ⱕ.80 5.3 .07 .97 .11 .99 .14 .98
ⱕ.75 3.8 .07 .98 .11 .99 .14 .99
Trials ⱕ.80 22.3 .29 .80 .32 .82 .29 .81
(1 ⫹2)/(3 ⫹4) ⱕ.75 16.7 .26 .88 .27 .88 .29 .89
ACSS ⱕ.70 11.4 .21 .93 .18 .92 .24 .94
ⱕ.65 7.6 .14 .96 .16 .96 .24 .99
Cumulative ⱖ2 31.1 .50 .78 .61 .86 .81 .87
Failures ⱖ3 22.0 .36 .84 .45 .92 .57 .93
ⱖ4 14.4 .29 .92 .32 .93 .52 .99
ⱖ5 3.0 .10 1.00 .11 1.00 .14 1.00
Note. D-KEFS ⫽Delis-Kaplan Executive Function System; PVT ⫽performance validity tests; RMT ⫽Warrington Recognition Memory Test–Words
[Pass: accuracy score ⬎43 and time-to-completion ⬍192”; Fail: accuracy score ⱕ43 or time-to-completion ⱖ192” (Erdodi, Kirsch, et al., 2014; Erdodi,
Tyson, et al., 2017; M. S. Kim et al., 2010)]; EI-11 ⫽Effort Index Eleven [Pass ⱕ1; Fail ⱖ3 (Erdodi & Roth, 2017; Erdodi, Tyson, et al., 2017)]; EI-7
“Erdodi Index Seven” based on processing speed measures [Pass ⱕ1; Fail ⱖ4 (Erdodi, Roth, et al., 2014; Erdodi, Tyson, et al., 2016, 2017)]; BR
Base rate of failure (percentage); SENS ⫽Sensitivity; SPEC ⫽Specificity; Trial 1 ⫽Color Naming age-corrected scaled score (ACSS); Trial 2 ⫽Word
Reading ACSS; Trial 3 ⫽Inhibition ACSS (classic Stroop task); Trial 4 ⫽Inhibition/Switching ACSS; Cumulative Failures ⫽Number of validity indices
failed (Trials 1– 4 ACSS ⱕ6; Trials 3/4 raw score ratio ⱕ.90; Trials (1 ⫹2)/(3 ⫹4) ACSS ratio ⱕ.75).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
are broadly consistent with the extant literature in that the inverted
Stroop effect is more common in noncredible examinees, but has
limited discriminant power. Arentsen and colleagues (2013) note
that the interference trial may be associated with poor specificity
because while most PVTs are designed to appear difficult, but are
in fact easy, the opposite is true for the Stroop task: the interfer-
ence trial is actually difficult for most individuals.
As such, the inverted Stroop effect as an EVI follows the reverse
logic compared with classic stand-alone PVTs, as performing well
on a difficult task is meant to expose noncredible responding rather
than performing poorly on an easy task. Although the inverted
Stroop effect seems less effective at separating valid and invalid
response sets, it appears to tap different manifestations of noncred-
ible performance. Therefore, it may provide valuable nonredun-
dant information for the multivariate model of validity assessment
(Boone, 2013; Larrabee, 2003).
An emergent finding of cross-validation analyses is the
modality-specificity of classification accuracy (Leighton et al.,
2014). Of the three reference PVTs, one was a traditional stand-
alone measure based on the forced choice recognition paradigm
(RMT), one was a composite measure based on the number of
independent PVT failures (EI-11), and one was a composite of
validity indicators specifically selected to match the target con-
structs in the Stroop task (EI-7
). The four base trials of the
D-KEFS Stroop and the (Trials 1 ⫹2)/(Trials 3 ⫹4) ACSS ratio
produced the best overall classification accuracy against the EI-
. The Trials 4/3 raw score ratio was a marginal exception,
reiterating the divergence in the psychometric properties of the
derivative validity indices. Nevertheless, all newly introduced
D-KEFS Stroop based validity cutoffs had the highest sensitivity
against the EI-7
. In several cases, sensitivity values were dou-
ble than that observed against the RMT.
Table 9
Results of Independent T-Tests Comparing Scores on PAI Somatization Scales as a Function of Passing or Failing PVTs
PAI scale PVT-outcome RMT EI-11 EI-7
D-KEFS Stroop trials
n58 49 47 58 58 58 58 58 58
SOM Pass
M59.1 58.4 59.2 59.3 60.0 59.4 60.3 61.2 61.0
SD 11.1 11.8 11.8 11.7 11.8 11.7 12.0 13.0 12.5
M67.4 65.9 65.2 70.0 69.9 69.2 70.3 61.9 63.3
SD 16.1 15.7 18.5 15.7 17.8 15.9 19.3 13.9 17.2
p⬍.05 ⬍.05 .11 ⬍.01 ⬍.05 ⬍.05 ⬍.05 .44 .33
d.60 .54 — .77 .66 .70 .62 — —
M56.4 52.7 56.1 55.6 56.4 55.7 57.3 58.2 57.2
SD 11.1 9.2 12.2 11.2 12.2 11.6 12.7 14.2 13.0
M64.4 65.4 64.3 71.6 72.1 71.0 70.2 60.6 65.5
SD 19.5 17.8 19.0 18.1 18.8 17.9 22.1 14.6 20.1
p⬍.05 ⬍.05 ⬍.05 ⬍.01 ⬍.01 ⬍.01 ⬍.05 .31 .07
d.50 .90 .51 1.06 .99 1.01 .72 — —
M57.6 59.9 58.5 58.7 58.9 58.5 59.2 59.9 60.0
SD 14.2 14.6 13.6 14.0 13.9 14.1 13.8 15.0 14.1
M65.2 60.6 60.2 64.1 64.6 64.9 63.8 59.0 57.6
SD 13.9 15.2 19.0 16.3 17.7 15.4 20.6 12.6 17.4
p⬍.05 .43 .37 .14 .15 .10 .23 .44 .33
D.54— — ——————
M59.2 58.9 59.1 59.7 60.3 59.9 60.1 61.0 60.9
SD 9.6 10.8 10.1 10.4 10.3 10.3 10.7 10.8 10.8
M65.8 65.4 65.0 66.9 65.6 66.1 69.3 61.3 61.9
SD 13.3 11.9 14.0 12.4 15.2 13.2 11.1 12.6 13.0
p⬍.05 ⬍.05 .07 ⬍.05 .11 ⬍.05 ⬍.05 .47 .41
d.57 .57 — .63 — .52 .84 — —
Note. D-KEFS ⫽Delis-Kaplan Executive Function System; PVT ⫽performance validity tests; PAI ⫽Personality Assessment Inventory; SOM ⫽
Somatic Concerns scale; SOM
⫽Conversion subscale; SOM
⫽Somatization subscale; SOM
⫽Health Concerns subscale; RMT ⫽
Warrington Recognition Memory Test–Words [Pass: accuracy score ⬎43 and time-to-completion ⬍192”; Fail: accuracy score ⱕ43 or time-to-
completion ⱖ192” (Erdodi, Kirsch, et al., 2014; Erdodi, Tyson, et al., 2017; M. S. Kim et al., 2010)]; EI-11 ⫽Effort Index Eleven [Pass ⱕ1; Fail ⱖ3
(Erdodi & Roth, 2017; Erdodi, Tyson, et al., 2017)]; EI-7
⫽“Erdodi Index Seven” based on processing speed measures [Pass ⱕ1; Fail ⱖ4 (Erdodi,
Kirsch, et al., 2014; Erdodi, Tyson, et al., 2017; Erdodi, Abeare, et al., 2017)]; BR
⫽Base rate of failure (percentage); SENS ⫽Sensitivity; SPEC ⫽
Specificity; Trial 1 ⫽Color Naming age-corrected scaled score (cutoff for failure ⱕ6); Trial 2 ⫽Word Reading age-corrected scaled score (cutoff for
failure ⱕ6); Trial 3 ⫽Inhibition (classic Stroop task) age-corrected scaled score (cutoff for failure ⱕ6); Trial 4 ⫽Inhibition/Switching age-corrected scaled
score (cutoff for failure ⱕ6); Trial 4/3 ⫽The ratio of Trial 4 and Trial 3 raw scores (cutoff for failure ⱕ.90); Trials (1 ⫹2)/(3 ⫹4) ⫽The ratio of the
sum of Trials 1 and 2 over the sum of Trials 3 and 4 age-corrected scaled scores (cutoff for failure ⱕ.75)
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
These findings resonate with earlier studies (Erdodi, Abeare, et
al., 2017; Erdodi, Tyson, et al., 2017; Lichtenstein et al., 2017),
and serve as a reminder that the choice of criterion measure can
influence the perceived utility of the test being evaluated. In
addition, they illustrate the importance of the methodological
pluralism in cross-validating PVTs (Boone, 2013; Larrabee, 2014)
at group level, and determining the veracity of an individual
response set (Larrabee, 2003, 2008; Vallabhajosula, & van Gorp,
2001), as it can protect against instrumentation artifacts. Knowing
that a new cutoff performs well against several different reference
PVTs increases confidence in the reliability of its signal detection
performance (Erdodi & Roth, 2017).
Combining the newly developed EVIs within the D-KEFS
Stroop improved overall classification accuracy. Cutoffs based on
cumulative failures produced superior signal detection profiles
relative to individual EVIs at comparable BR
, consistent with
previous research (Larrabee, 2003, 2008). Even though the internal
logic behind the practice of aggregating multiple validity indica-
tors prioritizes sensitivity over specificity (Proto et al., 2014), at
the appropriate cutoffs, multivariate models actually reduce false
positive rates (Davis & Millis, 2014; Larrabee, 2014).
Passing or failing the newly developed validity cutoffs within
the D-KEFS Stroop was largely unrelated to depression and anx-
iety, consistent with previous reports investigating the relationship
between depression and PVT failure (Considine et al., 2011; Rees,
Tombaugh, & Boulay, 2001). However, patients who failed the
reference PVTs and the newly introduced validity cutoffs in Trials
1– 4 of the D-KEFS Stroop reported higher levels of somatization
on the PAI, even though no systematic differences were observed
on any of the other clinical scales. This finding is consistent with
previous reports on the relationship between the somatization scale
of the PAI and PVT failures (Whiteside et al., 2010).
In this study, we introduced a range of validity cutoffs for each
of the four base trials of the D-KEFS Stroop, as well as two
derivative validity indices, recognizing the need for flexible,
population-specific cutoff scores (Bigler, 2015). To our knowl-
edge, this is the first attempt to develop EVIs within the D-KEFS
version of the Stroop task. In addition, we examined the relation-
ship between PVT failures and self-reported psychiatric symp-
toms. The signal detection profiles of the new validity indicators
across the engineered differences among the reference PVTs pro-
vided an opportunity to reflect on the instrumentation artifacts as
potential confounds in the cross-validation methodology used to
calibrate new validity indices.
The results of the study should be interpreted in the context of
its limitations. The sample was geographically restricted and un-
usually high functioning for a clinical setting. However, the overall
intellectual functioning in our sample was comparable with previ-
ous research involving patients with neurological disorders from
the Northeastern United States (Blonder, Gur, Gur, Saykin, &
Hurtig, 1989; Erdodi, Pelletier, & Roth, 2016; Saykin et al., 1995).
In addition, the sample was diagnostically heterogeneous. There-
fore, it is unclear if the newly introduced cutoffs will perform
similarly across patients with different neuropsychiatric condi-
tions. Until replicated in different clinical populations, these cut-
offs should only be applied to patients with clinical characteristics
that are similar to the present sample, as they may be associated
with unacceptably high false positive error rates in examinees with
severe neurological conditions.
Further, as indeterminate cases were excluded from the analyses
to maximize the diagnostic purity of the criterion groups, this
practice may have inflated classification accuracy estimates. More-
over, the time or sequence of administration was not available for
the D-KEFS Stroop, even though these factors have been raised as
potential confounds in the clinical interpretation of cognitive tests
in general (Erdodi & Lajiness-O’Neill, 2014), and of PVT failures
specifically (Bigler, 2015). Finally, in the absence of data on
litigation status, the criterion groups (Valid/Invalid) were psycho-
metrically defined. Given that external incentive to appear im-
paired has been previously suggested as a relevant diagnostic
criterion for noncredible neurocognitive performance (Slick, Sher-
man, & Iverson, 1999), the newly introduced cutoffs would benefit
from cross-validation using known-group designs that incorporate
incentive status. As always, future research using different sam-
ples, diagnostic categories, and reference PVTs are needed to
establish the generalizability of these findings.
Arentsen, T. J., Boone, K. B., Lo, T. T., Goldberg, H. E., Cottingham,
M. E., Victor, T. L.,...Zeller, M. A. (2013). Effectiveness of the
Comalli Stroop Test as a measure of negative response bias. The Clinical
Neuropsychologist, 27, 1060 –1076.
Arnold, G., Boone, K. B., Lu, P., Dean, A., Wen, J., Nitch, S., & McPher-
son, S. (2005). Sensitivity and specificity of finger tapping test scores for
the detection of suspect effort. The Clinical Neuropsychologist, 19,
Axelrod, B. N., Fichtenberg, N. L., Millis, S. R., & Wertheimer, J. C.
(2006). Detecting incomplete effort with Digit Span from the Wechsler
Adult Intelligence Scale-Third Edition. The Clinical Neuropsychologist,
20, 513–523.
Axelrod, B. N., Meyers, J. E., & Davis, J. J. (2014). Finger Tapping Test
performance as a measure of performance validity. The Clinical Neuro-
psychologist, 28, 876 – 888.
Baker, D. A., Connery, A. K., Kirk, J. W., & Kirkwood, M. W. (2014).
Embedded performance validity indicators within the California Verbal
Learning Test, Children’s Version. The Clinical Neuropsychologist, 28,
116 –127.
Bauer, L., Yantz, C. L., Ryan, L. M., Warden, D. L., & McCaffrey, R. J.
(2005). An examination of the California Verbal Learning Test II to
detect incomplete effort in a traumatic brain-injury sample. Applied
Neuropsychology, 12, 202–207.
Bigler, E. D. (2012). Symptom validity testing, effort, and neuropsycho-
logical assessment. Journal of the International Neuropsychological
Society, 18, 632– 640.
Bigler, E. D. (2014). Effort, symptom validity testing, performance validity
testing and traumatic brain injury. Brain Injury, 28, 1623–1638. http://
Bigler, E. D. (2015). Neuroimaging as a biomarker in symptom validity
and performance validity testing. Brain Imaging and Behavior, 9, 421–
Blaskewitz, N., Merten, T., & Brockhaus, R. (2009). Detection of subop-
timal effort with the Rey Complex Figure Test and recognition trial.
Applied Neuropsychology, 16, 54 – 61.
Blonder, L. X., Gur, R. E., Gur, R. C., Saykin, A. J., & Hurtig, H. I. (1989).
Neuropsychological functioning in hemiparkinsonism. Brain and Cog-
nition, 9, 244 –257.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Boone, K. B. (2009). The need for continuous and comprehensive sam-
pling of effort/response bias during neuropsychological examinations.
The Clinical Neuropsychologist, 23, 729 –741.
Boone, K. B. (2013). Clinical Practice of Forensic Neuropsychology. New
York, NY: Guilford Press.
Boone, K. B., Salazar, X., Lu, P., Warner-Chacon, K., & Razani, J. (2002).
The Rey 15-item recognition trial: A technique to enhance sensitivity of
the Rey 15-item memorization test. Journal of Clinical and Experimen-
tal Neuropsychology, 24, 561–573.
Bortnik, K. E., Boone, K. B., Marion, S. D., Amano, S., Ziegler, E.,
Cottingham, M. E.,...Zeller, M. A. (2010). Examination of various
WMS-III logical memory scores in the assessment of response bias. The
Clinical Neuropsychologist, 24, 344 –357.
Bush, S. S., Heilbronner, R. L., & Ruff, R. (2014). Psychological assess-
ment of symptom and performance validity, response bias, and malin-
gering: Official position of the Association of Psychological Advance-
ment in Psychological Injury and Law. Psychological Injury and Law, 7,
Chafetz, M. D., Williams, M. A., Ben-Porath, Y. S., Bianchini, K. J.,
Boone, K. B., Kirkwood, M. W.,...Ord, J. S. (2015). Official position
of the American Academy of Clinical Neuropsychology Social Security
Administration policy on validity testing: Guidance and recommenda-
tions for change. The Clinical Neuropsychologist, 29, 723–740. http://
Comalli, P. E., Jr., Wapner, S., & Werner, H. (1962). Interference effects
of Stroop color-word test in childhood, adulthood, and aging. The
Journal of Genetic Psychology, 100, 47–53.
Considine, C. M., Weisenbach, S. L., Walker, S. J., McFadden, E. M.,
Franti, L. M., Bieliauskas, L. A.,...Langenecker, S. A. (2011).
Auditory memory decrements, without dissimulation, among patients
with major depressive disorder. Archives of Clinical Neuropsychology,
26, 445– 453.
Curtis, K. L., Thompson, L. K., Greve, K. W., & Bianchini, K. J. (2008).
Verbal fluency indicators of malingering in traumatic brain injury:
Classification accuracy in known groups. The Clinical Neuropsycholo-
gist, 22, 930 –945.
Davis, J. J., & Millis, S. R. (2014). Examination of performance validity
test failure in relation to number of tests administered. The Clinical
Neuropsychologist, 28, 199 –214.
Delis, D. C., Kaplan, E., & Kramer, J. H. (2001). Delis-Kaplan executive
function system (D-KEFS). San Antonio, TX: Psychological Corpora-
Egeland, J., & Langfjaeran, T. (2007). Differentiating malingering from
genuine cognitive dysfunction using the Trail Making Test-ratio and
Stroop Interference scores. Applied Neuropsychology, 14, 113–119.
Erdodi, L. A., Abeare, C. A., Lichtenstein, J. D., Tyson, B. T., Kucharski,
B., Zuccato, B. G., & Roth, R. M. (2017). Wechsler Adult Intelligence
Scale-Fourth Edition (WAIS-IV) processing speed scores as measures of
noncredible responding: The third generation of embedded performance
validity indicators. Psychological Assessment, 29, 148 –157. http://dx
Erdodi, L. A., Kirsch, N. L., Lajiness-O’Neill, R., Vingilis, E., & Medoff,
B. (2014). Comparing the Recognition Memory Test and the Word
Choice Test in a mixed clinical sample: Are they equivalent? Psycho-
logical Injury and Law, 7, 255–263.
Erdodi, L. A., & Lajiness-O’Neill, R. (2014). Time-related changes in
Conners’ CPT-II scores: A replication study. Applied Neuropsychology
Adult, 21, 43–50.
Erdodi, L. A., & Lichtenstein, J. D. (2017). Invalid before impaired: An
emerging paradox of embedded validity indicators. The Clinical Neuro-
psychologist. Advance online publication.
Erdodi, L. A., Pelletier, C. L., & Roth, R. M. (2016). Elevations on select
Conners’ CPT-II scales indicate noncredible responding in adults with
traumatic brain injury. Applied Neuropsychology: Adult, 22, 851– 858.
Erdodi, L., & Roth, R. (2017). Low scores on BDAE Complex Ideational
Material are associated with invalid performance in adults without
aphasia. Applied Neuropsychology: Adult, 24, 264 –274. http://dx.doi
Erdodi, L. A., Roth, R. M., Kirsch, N. L., Lajiness-O’neill, R., & Medoff,
B. (2014). Aggregating validity indicators embedded in Conners’ CPT-II
outperforms individual cutoffs at separating valid from invalid perfor-
mance in adults with traumatic brain injury. Archives of Clinical Neu-
ropsychology, 29, 456 – 466.
Erdodi, L. A., Tyson, B. T., Abeare, C. A., Lichtenstein, J. D., Pelletier,
C. L., Rai, J. K., & Roth, R. M. (2016). The BDAE Complex Ideational
Material—A measure of receptive language or performance validity?
Psychological Injury and Law, 9, 112–120.
Erdodi, L. A., Tyson, B. T., Shahein, A. G., Lichtenstein, J. D., Abeare,
C. A., Pelletier, C. L.,...Roth, R. M. (2017). The power of timing:
Adding a time-to-completion cutoff to the Word Choice Test and Rec-
ognition Memory Test improves classification accuracy. Journal of
Clinical and Experimental Neuropsychology, 39, 369 –383. http://dx.doi
Etherton, J. L., Bianchini, K. J., Heinly, M. T., & Greve, K. W. (2006).
Pain, malingering, and performance on the WAIS-III Processing Speed
Index. Journal of Clinical and Experimental Neuropsychology, 28,
1218 –1237.
Golden, C., & Freshwater, S. (2002). A Manual for the Adult Stroop Color
and Word Test. Chicago, IL: Stoelting.
Green, P. (2013). Spoiled for choice: Making comparisons between forced-
choice effort tests. In K. B. Boone (Ed.), Clinical practice of forensic
neuropsychology. New York, NY: Guilford Press.
Greiffenstein, M. F., Baker, W. J., & Gola, T. (1994). Validation of
malingered amnesia measures with a large clinical sample. Psycholog-
ical Assessment, 6, 218 –224.
Greve, K. W., & Bianchini, K. J. (2004). Setting empirical cut-offs on
psychometric indicators of negative response bias: A methodological
commentary with recommendations. Archives of Clinical Neuropsychol-
ogy, 19, 533–541.
Greve, K. W., Bianchini, K. J., Mathias, C. W., Houston, R. J., & Crouch,
J. A. (2002). Detecting malingered performance with the Wisconsin card
sorting test: A preliminary investigation in traumatic brain injury. The
Clinical Neuropsychologist, 16, 179 –191.
Greve, K. W., Ord, J. S., Bianchini, K. J., & Curtis, K. L. (2009).
Prevalence of malingering in patients with chronic pain referred for
psychologic evaluation in a medico-legal context. Archives of Physical
Medicine and Rehabilitation, 90, 1117–1126.
Grimes, D. A., & Schulz, K. F. (2005). Refining clinical diagnosis with
likelihood ratios. The Lancet, 365, 1500 –1505.
Guise, B. J., Thompson, M. D., Greve, K. W., Bianchini, K. J., & West, L.
(2014). Assessment of performance validity in the Stroop Color and
Word Test in mild traumatic brain injury patients: A criterion-groups
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
validation design. Journal of Neuropsychology, 8, 20 –33. http://dx.doi
Hayward, L., Hall, W., Hunt, M., & Zubrick, S. R. (1987). Can localised
brain impairment be simulated on neuropsychological test profiles?
Australian and New Zealand Journal of Psychiatry, 21, 87–93. http://
Heaton, R. K., Miller, S. W., Taylor, M. J., & Grant, I. (2004). Revised
comprehensive norms for an expanded Halstead-Reitan battery: Demo-
graphically adjusted neuropsychological norms for African American
and Caucasian adults. Lutz, FL: Psychological Assessment Resources.
Heilbronner, R. L., Sweet, J. J., Morgan, J. E., Larrabee, G. J., & Millis,
S. R., & the Conference Participants. (2009). American Academy of
Clinical Neuropsychology Consensus Conference Statement on the neu-
ropsychological assessment of effort, response bias, and malingering.
The Clinical Neuropsychologist, 23, 1093–1129.
Heinly, M. T., Greve, K. W., Bianchini, K. J., Love, J. M., & Brennan, A.
(2005). WAIS digit span-based indicators of malingered neurocognitive
dysfunction: Classification accuracy in traumatic brain injury. Assess-
ment, 12, 429 – 444.
Kim, M. S., Boone, K. B., Victor, T., Marion, S. D., Amano, S., Cotting-
ham, M. E.,...Zeller, M. A. (2010). The Warrington Recognition
Memory Test for words as a measure of response bias: Total score and
response time cutoffs developed on “real world” credible and noncred-
ible subjects. Archives of Clinical Neuropsychology, 25, 60 –70. http://
Kim, N., Boone, K. B., Victor, T., Lu, P., Keatinge, C., & Mitchell, C.
(2010). Sensitivity and specificity of a digit symbol recognition trial in
the identification of response bias. Archives of Clinical Neuropsychol-
ogy, 25, 420 – 428.
Lange, R. T., Iverson, G. L., Brickell, T. A., Staver, T., Pancholi, S.,
Bhagwat, A., & French, L. M. (2013). Clinical utility of the Conners’
Continuous Performance Test-II to detect poor effort in U.S. military
personnel following traumatic brain injury. Psychological Assessment,
25, 339 –352.
Lansbergen, M. M., Kenemans, J. L., & van Engeland, H. (2007). Stroop
interference and attention-deficit/hyperactivity disorder: A review and
meta-analysis. Neuropsychology, 21, 251–262.
Larrabee, G. J. (2003). Detection of malingering using atypical perfor-
mance patterns on standard neuropsychological tests. The Clinical Neu-
ropsychologist, 17, 410 – 425.
Larrabee, G. J. (2008). Aggregation across multiple indicators improves
the detection of malingering: Relationship to likelihood ratios. The
Clinical Neuropsychologist, 22, 666 – 679.
Larrabee, G. J. (2014). False-positive rates associated with the use of
multiple performance and symptom validity tests. Archives of Clinical
Neuropsychology, 29, 364 –373.
Larson, M. J., Kaufman, D. A., Schmalfuss, I. M., & Perlstein, W. M.
(2007). Performance monitoring, error processing, and evaluative con-
trol following severe TBI. Journal of the International Neuropsycholog-
ical Society, 13, 961–971.
Leighton, A., Weinborn, M., & Maybery, M. (2014). Bridging the gap
between neurocognitive processing theory and performance validity
assessment among the cognitively impaired: A review and methodolog-
ical approach. Journal of the International Neuropsychological Society,
20, 873– 886.
Lezak, M. D. (1995). Neuropsychological assessment. New York, NY:
Oxford University Press.
Lichtenstein, J. D., Erdodi, L. A., & Linnea, K. S. (2017). Introducing a
forced-choice recognition task to the California Verbal Learning Test—
Children’s Version. Child Neuropsychology, 23, 284 –299.
Lippa, S. M., & Davis, R. N. (2010). Inhibition/switching is not necessarily
harder than inhibition: An analysis of the D-KEFS color-word interfer-
ence test. Archives of Clinical Neuropsychology, 25, 146 –152. http://dx
Lu, P. H., Boone, K. B., Cozolino, L., & Mitchell, C. (2003). Effectiveness
of the Rey-Osterrieth Complex Figure Test and the Meyers and Meyers
recognition trial in the detection of suspect effort. The Clinical Neuro-
psychologist, 17, 426 – 440.
MacLeod, C. M., & MacDonald, P. A. (2000). Interdimensional interfer-
ence in the Stroop effect: Uncovering the cognitive and neural anatomy
of attention. Trends in Cognitive Sciences, 4, 383–391.
Ord, J. S., Boettcher, A. C., Greve, K. J., & Bianchini, K. J. (2010).
Detection of malingering in mild traumatic brain injury with the Con-
ners’ Continuous Performance Test-II. Journal of Clinical and Experi-
mental Neuropsychology, 32(4), 380–387.
Osimani, A., Alon, A., Berger, A., & Abarbanel, J. M. (1997). Use of the
Stroop phenomenon as a diagnostic tool for malingering. Journal of
Neurology, Neurosurgery & Psychiatry, 62, 617– 621.
Pearson, N. C. S. (2009). Advanced clinical solutions for WAIS-IV and
WMS-IV: Administration and scoring manual. San Antonio, TX: The
Psychological Corporation.
Proto, D. A., Pastorek, N. J., Miller, B. I., Romesser, J. M., Sim, A. H., &
Linck, J. F. (2014). The dangers of failing one or more performance
validity tests in individuals claiming mild traumatic brain injury-related
postconcussive symptoms. Archives of Clinical Neuropsychology, 29,
614 – 624.
Reedy, S. D., Boone, K. B., Cottingham, M. E., Glaser, D. F., Lu, P. H.,
Victor, T. L.,...Wright, M. J. (2013). Cross validation of the Lu and
colleagues (2003). Rey-Osterrieth Complex Figure Test effort equation
in a large known-group sample. Archives of Clinical Neuropsychology,
28, 30 –37.
Rees, L. M., Tombaugh, T. N., & Boulay, L. (2001). Depression and the
test of memory malingering. Archives of Clinical Neuropsychology, 16,
Saykin, A. J., Stafiniak, P., Robinson, L. J., Flannery, K. A., Gur, R. C.,
O’Connor, M. J., & Sperling, M. R. (1995). Language before and after
temporal lobectomy: Specificity of acute changes and relation to early
risk factors. Epilepsia, 36, 1071–1077.
Schroeter, M. L., Ettrich, B., Schwier, C., Scheid, R., Guthke, T., & von
Cramon, D. Y. (2007). Diffuse axonal injury due to traumatic brain
injury alters inhibition of imitative response tendencies. Neuropsycho-
logia, 45, 3149 –3156.
Schutte, C., Axelrod, B. N., & Montoya, E. (2015). Making sure neuro-
psychological data are meaningful: Use of performance validity testing
in medicolegal and clinical contexts. Psychological Injury and Law, 8,
100 –105.
Shura, R. D., Miskey, H. M., Rowland, J. A., Yoash-Gantz, R. E., &
Denning, J. H. (2016). Embedded performance validity measures with
postdeployment veterans: Cross-validation and efficiency with multiple
measures. Applied Neuropsychology: Adult, 23, 94 –104. http://dx.doi
Slick, D. J., Sherman, E. M., & Iverson, G. L. (1999). Diagnostic criteria
for malingered neurocognitive dysfunction: Proposed standards for clin-
ical practice and research. Clinical Neuropsychologist, 13, 545–561.;1-Y;FT545
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Spencer, R. J., Axelrod, B. N., Drag, L. L., Waldron-Perrine, B., Pangili-
nan, P. H., & Bieliauskas, L. A. (2013). WAIS-IV reliable digit span is
no more accurate than age corrected scaled score as an indicator of
invalid performance in a veteran sample undergoing evaluation for
mTBI. The Clinical Neuropsychologist, 27, 1362–1372. http://dx.doi
Stroop, J. R. (1935). Studies of interference in serial verbal reactions.
Journal of Experimental Psychology, 18, 643– 662.
Sugarman, M. A., & Axelrod, B. N. (2015). Embedded measures of
performance validity using verbal fluency tests in a clinical sample.
Applied Neuropsychology Adult, 22, 141–146.
Suhr, J. A., & Boyer, D. (1999). Use of the Wisconsin Card Sorting Test
in the detection of malingering in student simulator and patient samples.
Journal of Clinical and Experimental Neuropsychology, 21, 701–708.
Trueblood, W. (1994). Qualitative and quantitative characteristics of ma-
lingered and other invalid WAIS-R and clinical memory data. Journal of
Clinical and Experimental Neuropsychology, 16, 597– 607. http://dx.doi
Vallabhajosula, B., & van Gorp, W. G. (2001). Post-Daubert admissibility
of scientific evidence on malingering of cognitive deficits. Journal of the
American Academy of Psychiatry and the Law, 29, 207–215.
Whiteside, D., Clinton, C., Diamonti, C., Stroemel, J., White, C., Zimber-
off, A., & Waters, D. (2010). Relationship between suboptimal cognitive
effort and the clinical scales of the Personality Assessment Inventory.
The Clinical Neuropsychologist, 24, 315–325.
Wolfe, P. L., Millis, S. R., Hanks, R., Fichtenberg, N., Larrabee, G. J., &
Sweet, J. J. (2010). Effort indicators within the California verbal learn-
ing test-II (CVLT-II). The Clinical Neuropsychologist, 24, 153–168.
Received October 5, 2016
Revision received July 11, 2017
Accepted July 17, 2017 䡲
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
A preview of this full-text is provided by American Psychological Association.
Content available from Psychological Assessment
This content is subject to copyright. Terms and conditions apply.