ArticlePDF Available

Stereotype Threat and The Intellectual Test-Performance of African-Americans

Authors:

Abstract and Figures

Stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one's group. Studies 1 and 2 varied the stereotype vulnerability of Black participants taking a difficult verbal test by varying whether or not their performance was ostensibly diagnostic of ability, and thus, whether or not they were at risk of fulfilling the racial stereotype about their intellectual ability. Reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled). Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it. Study 4 showed that mere salience of the stereotype could impair Blacks' performance even when the test was not ability diagnostic. The role of stereotype vulnerability in the standardized test performance of ability-stigmatized groups is discussed.
Content may be subject to copyright.
ATTITUDES AND SOCIAL COGNITION
Stereotype Threat and the Intellectual Test Performance
of African Americans
Claude M. Steele
Stanford UniversityJoshua Aronson
University of Texas, Austin
Stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about
one's group. Studies
1
and 2 varied the stereotype vulnerability of Black participants taking a diffi-
cult verbal test by varying whether or not their performance was ostensibly diagnostic of ability, and
thus,
whether or not they were at risk of fulfilling the racial stereotype about their intellectual ability.
Reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the
ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests
controlled). Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in
these participants and motivated them not to conform to it, or to be judged by it. Study 4 showed
that mere salience of the stereotype could impair Blacks' performance even when the test was not
ability diagnostic. The role of stereotype vulnerability in the standardized test performance of abil-
ity-stigmatized groups is discussed.
Not long ago, in explaining his career-long preoccupation
with the American Jewish experience, the novelist Philip Roth
said that it was not Jewish culture or religion per se that fasci-
nated him, it
was
what
he
called
the
Jewish "predicament." This
is an apt term for the perspective taken in the present research.
It focuses on a social-psychological predicament that can arise
from widely-known negative stereotypes about one's group. It
is this: the existence of such a stereotype means that anything
one does or any of one's features that conform to it make the
stereotype more plausible as a self-characterization in the eyes
of
others,
and perhaps even in one's own eyes.
We
call this pre-
dicament
stereotype threat
and argue that it is experienced, es-
sentially,
as
a self-evaluative threat. In form, it is a predicament
that can beset the members of
any
group about whom negative
stereotypes exist. Consider the stereotypes elicited by the terms
yuppie,
feminist, liberal, or White male. Their prevalence in
society raises the possibility for potential targets that the stereo-
type is true of them and, also, that other people will see them
that
way.
When the allegations of the stereotype are importantly
Claude M. Steele, Department of Psychology, Stanford University;
Joshua Aronson, School of Education, University of Texas, Austin. This
research was supported by National Institutes of Health Grant
MH51977, Russell Sage Foundation Grant 879.304, and by Spencer
Foundation and James S. McDonnell Foundation postdoctoral fellow-
ships,
and its completion was aided by the Center for Advanced Study
in the Behavioral Sciences.
We thank John Butner, Emmeline Chen, and Matthew McGlone for
assistance and helpful comments on this research.
Correspondence concerning this article should be addressed to
Claude M. Steele, Department of Psychology, Stanford University, Stan-
ford, California 94305, or Joshua Aronson, School of Education, Uni-
versity of Texas, Austin, Texas 78712.
negative, this predicament may be self-threatening enough to
have disruptive effects of its own.
The present research examined the role these processes play
in the intellectual test performance of African Americans. Our
reasoning is this: whenever African American students per-
form an explicitly scholastic or intellectual task, they face the
threat of confirming or being judged by a negative societal ste-
reotype—a suspicion—about their group's intellectual ability
and competence. This threat is not borne by people not stereo-
typed in this
way.
And the self-threat it causes—through a va-
riety of mechanisms—may interfere with the intellectual
functioning of these students, particularly during standardized
tests.
This is the principal hypothesis examined in the present
research. But as this threat persists over time, it may have the
further effect of pressuring these students to protectively dis-
identify with achievement in school and related intellectual
domains. That
is,
it may pressure the person to define or rede-
fine the self-concept such that school achievement is neither a
basis of self-evaluation nor a personal identity. This protects
the person against the self-evaluative threat posed by the ste-
reotypes but may have the byproduct of diminishing interest,
motivation, and, ultimately, achievement in the domain
(Steele, 1992).
The anxiety of knowing that one is a potential target of
prej-
udice and stereotypes has been much discussed: in classic social
science (e.g., Allport, 1954; GofFman, 1963), popular books
(e.g., Carter, 1991) and essays, as, for example, S. Steele's
(1990) treatment of what he called
racial
vulnerability.
In this
last analysis, S. Steele made a connection between this experi-
ence and the school life of African Americans that has similari-
ties to our own. He argued that after a lifetime of exposure to
society's negative images of their
ability,
these students are likely
to internalize an "inferiority anxiety"—a state that can be
Journal of Personality and Social Psychology, 1995, Vol. 69, No. 5, 797-811
Copyright 1995 by the American Psychological Association, Inc. 0022-3514/95/S3.00
797
798CLAUDE M. STEELE AND JOSHUA ARONSON
aroused by a variety of race-related cues in the environment.
This anxiety, in turn, can lead them to blame others for their
troubles (for example, White racism), to underutilize available
opportunities, and to generally form a victim's identity. These
adaptations, in turn, the argument
goes,
translate into poor life
success.
The present theory and research do not focus on the internal-
ization of inferiority images or their
consequences.
Instead they
focus on the immediate situational threat that derives from the
broad dissemination of negative stereotypes about one's
group—the threat of possibly being judged and treated stereo-
typically, or of possibly self-fulfilling such a stereotype. This
threat can befall anyone with a group identity about which
some negative stereotype
exists,
and for the person to be threat-
ened in this way, he need not even believe the stereotype. He
need only know that it stands
as
a hypothesis about him in situ-
ations where the stereotype is relevant. We focused on the ste-
reotype threat of African Americans in intellectual and scho-
lastic domains to provide a compelling test of the theory and
because the theory, should it be supported in this context for
this group, would have relevance to an important set of
outcomes.
Gaps in school achievement and retention rates between
White and Black Americans at all levels of schooling have been
strikingly persistent in American society (e.g., Steele, 1992).
Well publicized at the kindergarten through 12th grade level,
recent statistics show that they persist even at the college level
where, for
example,
the national drop-out rate for Black college
students (the percentage who do not complete college within
a
6-year
window of time) is 70% compared to 42% for White
Americans (American Council on Education, 1990). Even
among those who graduate, their grades average two thirds of
letter grade lower than those of graduating Whites
(e.g.,
Nettles,
1988).
It has been most common to understand such problems
as stemming largely from the socioeconomic disadvantage, seg-
regation, and discrimination that African Americans have en-
dured and continue to endure in this society, a set of conditions
that, among other things, could produce racial gaps in achieve-
ment by undermining preparation for school.
Some
evidence,
however, questions the sufficiency of these ex-
planations. It
comes
from the sizable literature examining racial
bias in standardized testing. This work, involving hundreds of
studies over several decades, generally shows that standardized
tests predict subsequent school achievement as well for Black
students as for White students (e.g., Cleary, Humphreys, Ken-
drick, & Wesman,
1975;
Linn,
1973;
Stanley, 1971). The slope
of the lines regressing subsequent school achievement on entry-
level standardized test scores is essentially the same for both
groups. But embedded in this literature
is
another
fact:
At every
level of preparation as measured by a standardized test—for
example, the Scholastic Aptitude Test (SAT)—Black students
with that score have poorer subsequent achievement—GPA, re-
tention rates, time to graduation, and so on—than White stu-
dents with that score (Jensen, 1980). This is variously known
as the overprediction or underachievement phenomenon, be-
cause it indicates that, relative to Whites with the same score,
standardized tests actually overpredict the achievement that
Blacks will realize. Most important for our purposes, this evi-
dence suggests that Black-White achievement gaps are not due
solely to group differences in preparation. Blacks achieve less
well than Whites even when they have the same preparation,
and even when that preparation is at a very high level. Could
this underachievement, in some part, reflect the stereotype
threat that is a chronic feature of these students' schooling
environments?
Research from the early 1960s—largely that of Irwin Katz
and his colleagues (e.g., Katz, 1964) on how desegregation
affected the intellectual performance of Black students—shows
the sizable influence on Black intellectual performance of fac-
tors that can be interpreted as manipulations of stereotype
threat. Katz, Roberts, and Robinson (1965), for example,
found that Black participants performed better
on
an
IQ
subtest
when it was presented as a test of eye-hand coordination—a
nonevaluative and thus threat-negating test representation
than when it was said to be a test of intelligence. Katz, Epps,
and Axelson (1964) found that Black students performed better
on an IQ test when they believed their performance would be
compared to other
Blacks as
opposed to
Whites.
But
as
evidence
that bears on our hypothesis, this literature has several limita-
tions.
Much of the research was conducted in an era when
American race relations were different in important ways than
they are now. Thus, without their being replicated, the extent
to which these
findings
reflect enduring processes of stereotype
threat as opposed to the racial dynamics of
a
specific historical
era is not
clear.
Also, this research seldomly used White control
groups. Thus it is difficult to know the extent to which some of
the critical effects were mediated by the stereotype threat of
Black students as opposed to processes experienced by any
students.
Other research supports the present hypothesis by showing
that factors akin to stereotype threat—that is, other factors
that add self-evaluative threat to test taking or intellectual per-
formance—are capable of disrupting that performance. The
presence of observers or coactors, for example, can interfere
with performance on mental tasks (e.g., Geen, 1985; Seta,
1982).
Being a "token" member of a group—the sole repre-
sentative of a social category—can inhibit one's memory for
what is said during a group discussion (Lord & Saenz, 1985;
Lord, Saenz, & Godfrey, 1987). Conditions that increase the
importance of performing well—prizes, competition, and au-
dience approval—have all been shown to impair performance
of
even
motor skills (e.g., Baumeister, 1984). The stereotype
threat hypothesis
shares
with these approaches
the
assumption
that performance suffers when the situation redirects attention
needed to perform a task onto some other concern—in the
case of stereotype threat, a concern with the significance of
one's performance in light of a devaluing stereotype.
For African American students, the act of taking a test pur-
ported to measure intellectual ability may be enough to induce
this threat. But we assume that this is most likely to happen
when the test is also frustrating. It is frustration that makes
the stereotype—as an allegation of inability—relevant to their
performance and thus raises the possibility that they have an
inability linked to their race. This is not to argue that the ste-
reotype is necessarily believed; only that, in the face of frustra-
tion with the test, it becomes more plausible as a self-charac-
terization and thereby more threatening to the
self.
Thus for
Black students who care about the skills being tested—that is,
those who are identified with these skills in the sense of their
self-regard being somewhat tied to having them—the stereo-
RACIAL STEREOTYPES AND TEST PERFORMANCE799
type loads the testing situation with an extra degree of
self-
threat, a degree not borne by people not stereotyped in this
way. This additional threat, in turn, may interfere with their
performance in a variety of
ways:
by causing an arousal that
reduces the range of cues participants are able to use (e.g.,
Easterbrook, 1959), or by diverting attention onto task-irrele-
vant worries (e.g., Sarason, 1972; Wine, 1971), by causing an
interfering self-consciousness (e.g., Baumeister, 1984), or
overcautiousness (Geen, 1985). Or, through the ability-indict-
ing interpretation it poses for test frustration, it could foster
low performance expectations that would cause participants
to withdraw effort (e.g., Bandura, 1977, 1986). Depending on
the situation, several of these processes may be involved simul-
taneously or in alternation. Through these mechanisms, then,
stereotype threat might be expected to undermine the stan-
dardized test performance of Black participants relative to
White participants who, in this situation, do not suffer this
added threat.
Study 1
Accordingly, Black and White college students in this experi-
ment were given a 30-min test composed of
items
from the ver-
bal Graduate Record Examination (GRE) that were difficult
enough to be at the limits of most participants' skills. In the
stereotype-threat condition, the test
was
described
as
diagnostic
of intellectual ability, thus making the racial stereotype about
intellectual ability relevant to Black participants' performance
and establishing for them the threat of fulfilling it. In the non-
stereotype-threat condition, the same test was described simply
as a laboratory problem-solving task that was nondiagnostic of
ability. Presumably, this would make the racial stereotype about
ability irrelevant to Black participants' performance and thus
preempt any threat of fulfilling
it.
Finally, a second nondiagnos-
tic condition was included which exhorted participants to view
the difficult test as a challenge. For practical reasons we were
interested in whether stressing the challenge inherent in a
difficult test might further increase participants' motivation and
performance over what would occur in the nondiagnostic con-
dition. The primary dependent measure in this experiment was
participants' performance on the test adjusted for the influence
of individual differences in skill level (operationalized as partic-
ipants' verbal SAT scores).
We predicted that Black participants would underperform
relative to Whites in the diagnostic condition where there was
stereotype threat, but not in the two nondiagnostic condi-
tions—the non-diagnostic-only condition and the non-diagnos-
tic-plus-challenge condition—where this threat was presum-
ably reduced. In the non-diagnostic-challenge condition, we
also expected the additional motivation to boost the perfor-
mance of both Black and White participants above that ob-
served in the non-diagnostic-only condition. Several additional
measures were included to assess the effectiveness of the manip-
ulation and possible mediating states.
Method
Design and Participants
This experiment took the form of a 2 X 3 factorial design. The factors
were race of the participant. Black or White, and a test description factor
in which the test was presented as either diagnostic of intellectual ability
(the diagnostic condition), as a laboratory tool for studying problem solv-
ing (the non-diagnostic-only condition), or as both a problem-solving tool
and a challenge (the non-diagnostic-challenge condition). Test perfor-
mance was the primary dependent measure. We recruited 117 male and
female. Black and White Stanford undergraduates through campus adver-
tisements which offered $ 10.00 for
1
hr of participation. The data from 3
participants were excluded from the analysis because they failed to provide
their verbal SAT scores. This left a total of 114 participants randomly as-
signed to the three experimental conditions with the exception that we
ensured an equal number of participants per condition.
Procedure
Participants who signed up for the experiment were contacted by-
telephone prior to their experimental participation and asked to pro-
vide their verbal and quantitative SAT scores, to rate their enjoyment
of verbally oriented classes, and to provide background information
(e.g., year in school, major, etc.). When participants arrived at the
laboratory, the experimenter (a White man) explained that for the
next 30 min they would work on a set of verbal problems in a format
identical to the
SAT
exam, and end
by
answering some questions about
their experience.
The participant was then given a page that stated the purpose of the
study, described the procedure for answering questions, stressed the
importance of indicating guessed answers (by a check), described the
test as very difficult and that they should expect not to get many of the
questions correct, and told them that they would be given feedback on
their performance at the end of the session. We included the informa-
tion about test difficulty to, as much as possible, equate participants'
performance expectations across the conditions. And, by acknowledg-
ing the difficulty of the test, we wanted to reduce the possibility that
participants would see the test as a miscalculation of their skills and
perhaps reduce their effort. This description was the same for all con-
ditions with the exception of several key phrases that comprised the
experimental manipulation.
Participants in the diagnostic condition were told that the study was
concerned with "various personal factors involved in performance on
problems requiring reading and verbal reasoning abilities." They were
further informed that after the test, feedback would be provided which
"may
be
helpful to you by familiarizing you with
some
of your strengths
and weaknesses" in verbal problem solving. As noted, participants in
all conditions were told that they should not expect to get many items
correct, and in the diagnostic condition, this test difficulty was justified
as a means of providing a "genuine test of your verbal abilities and lim-
itations so that we might better understand the factors involved in
both." Participants were asked to give a strong effort in order to "help
us in our analysis of your verbal ability."
In the non-diagnostic-only and non-diagnostic-challenge conditions,
the description of the study made no reference to verbal ability. Instead,
participants were told that the purpose of the research was to better
understand the "psychological factors involved in solving verbal prob-
lems.
. . ." These participants too were told that they would receive
performance feedback, but it
was
justified as a means of familiarizing
them "with the kinds of problems that appear on tests [they] may en-
counter in the future." In the non-diagnostic-only condition, the diffi-
culty of the test was justified in terms of a research focus on difficult
verbal problems and in the non-diagnostic-challenge condition it was
justified
as
an attempt to provide "even highly verbal people with a men-
tal challenge. . . ." Last, participants in both conditions were asked to
give a genuine effort in order to "help us in our analysis of the problem
solving process." As the experimenter left them to work on the test, to
further differentiate the conditions, participants in the non-diagnostic-
only condition were asked to try hard "even though we're not going
to evaluate your ability." Participants in the non-diagnostic-challenge
800CLAUDE M. STEELE AND JOSHUA ARONSON
condition were asked to "please take this challenge seriously even
though we will not be evaluating your ability."
Dependent Measures
The primary dependent measure was participants' performance on
30 verbal items, 27 of which were difficult items taken from GRE study
guides (only 30% of earlier samples had gotten these items correct)
and 3 difficult anagram problems. Both the total number correct and
an accuracy index of the number correct over the number attempted
were analyzed.
Participants next completed an 18-item self-report measure of their
current thoughts relating to academic competence and personal worth
(e.g., "I feel confident about my abilities," "I feel self-conscious," "I
feel as smart as others," etc.). These were measured on
5-point
scales
anchored by the phrases not at all (I) and extremely (5). Participants
also completed a 12-item measure of cognitive interference frequently
used in test anxiety research (Sarason, 1980) on which they indicated
the frequency of several distracting thoughts during the exam (e.g., "I
wondered what the experimenter would think of
me,"
"I thought about
how poorly 1 was doing," "I thought about the difficulty of the prob-
lems,"
etc.) by putting a number from
1
(never) to 5 (very often) next to
each statement. Participants then rated how difficult and biased they
considered the test on 15-point scales anchored by the labels not at all
(1) and extremely (15). Next, participants evaluated their own perfor-
mance by estimating the number of problems they correctly solved, and
by comparing their own performance to that of the average Stanford
student on a 15-point scale with the end points much worse (1) and
much better (15). Finally, as a check on the manipulation, participants
responded to the question:
The purpose of this experiment was to: (a) provide a genuine test
of my abilities in order to examine personal factors involved in
verbal ability; (b) provide a challenging test in order to examine
factors involved in solving verbal problems; (c) present you with
unfamiliar verbal problems to measure verbal learning.
Participants were asked to circle the appropriate response.
Results
Because there were no main or interactive effects of gender on
verbal test performance or the self-report measures, we col-
lapsed over this factor in all analyses.
Manipulation
Check
Chi-square analyses performed on participants' responses to
the postexperimental question about the purpose of the study
revealed only an effect of condition, x2 (2) = 43.18, p < .001.
Participants were more likely to believe the purpose of
the
ex-
periment was to evaluate their abilities in the diagnostic condi-
tion (65%) than in the nondiagnostic condition (3%), or the
challenge condition (11%).
Test Performance
The ANCOVA on the number of
items
participants got cor-
rect, using their self-reported
SAT
scores
as
the covariate (Black
mean = 592, White mean = 632) revealed a significant condi-
tion main effect, F(
2,
107) =
4.74,
p
<
.02,
with participants in
the non-diagnostic-challenge condition performing higher than
participants in the non-diagnostic-only and diagnostic condi-
tions,
respectively, and a significant race main effect, F{\, 107)
BLACK SUBJECTS
WHITE SUBJECTS
DIAGNOSTIC NONDIAGNOSTIC CHALLENGE
Figure I. Mean test performance Study 1.
=
5.22,
p
<
.03,
with White participants performing higher than
Black participants.1 The race-by-condition interaction did not
reach conventional significance (p <
.
19).
The adjusted condi-
tion means are presented in Figure 1.
If making the test diagnostic of ability depresses the perfor-
mance of Black students through stereotype threat, then their
performance should be lower in the diagnostic condition than
in either the non-diagnostic-only or non-diagnostic-challenge
conditions which presumably lessened stereotype threat, and it
should be lower than that of Whites in the diagnostic condition.
Bonferroni contrasts2 with SATs as a covariate supported this
reasoning by showing that Black participants in the diagnostic
condition performed significantly worse than Black partici-
pants in either the nondiagnostic condition, /(107) =
2.88,
p <
.01,
or the challenge condition, t( 107) =
2.63,
p <
.01,
as well
as significantly worse than White participants in the diagnostic
condition t( 107) = 2.64,p <
.01.
But, as noted, the interaction testing the differential effect of
test diagnosticity on Black and White participants did not reach
significance. This may have happened, however, because an inci-
dental pattern of means—Whites slightly outperforming
Blacks
in
the nondiagnostic-challenge condition—undermined the overall
interaction effect.
To
pursue a more sensitive test,
we
constructed
a weighted contrast that compared the size of the race effect in the
diagnostic condition with that in the nondiagnostic condition and
assigned weights of
zero
to the White and Black non-diagnostic-
challenge
conditions.
This
analysis
(including
the
use of
SATs as
a
covariate) reached marginal significance, F(l, 107) = 3.27, p <
.08.
In sum, then, the hypothesis was supported by the pattern
of contrasts, but when tested over the whole design, reached only
marginal significance.
1 Because we did not warn participants to avoid guessing in these ex-
periments, we do not report the performance results in terms of the
index used by Educational Testing Service, which includes a correction
for guessing. This correction involves subtracting from the number cor-
rect, the number wrong adjusted for the number of response options for
each wrong item and dividing this by the number of items on the test.
Because 27 of our 30 items had the same number of response options
(5),
this correction amounts to adjusting the number correct almost
invariably by the same number. All analyses are the same regardless of
the index used.
2 All comparisons of adjusted means reported hereafter used the
Bonferroni procedure.
RACIAL STEREOTYPES AND TEST PERFORMANCE801
Accuracy
An ANCOVA on accuracy, the proportion correct of the
number attempted, with SATs as the covariate, found that nei-
ther condition main effect nor the interaction reached signifi-
cance, although there was a marginally significant tendency for
Black participants to evidence less accuracy, p <
.
10.
This ten-
dency was primarily due to Black participants in the diagnostic
condition who had the lowest adjusted mean accuracy of any
group in the experiment, .420. The adjusted means for the
White diagnostic, White non-diagnostic-only, White non-diag-
nostic-challenge, Black non-diagnostic-only, and Black diag-
nostic-challenge conditions were, .519, .518, .561, .546, and
.490,
respectively. Bonferroni tests revealed that Black partici-
pants in the diagnostic condition were reliably less accurate
than Black participants in the non-diagnostic-only condition
and White participants in the diagnostic condition, t( 107) =
2.64, p <
.01,
and /(107) =
2.13,
p <
.05,
respectively.
No condition or interaction effects reached significance for
the number of items completed or the number of
guesses
par-
ticipants recorded on the test (all Fs < 1). The overall means
for these two measures were 22.9 and
4.1,
respectively.
Self-Report Measures
There were no significant condition effects on the self-report
measure of academic competence and personal worth or on the
self-report measure of disruptive thoughts and feelings during
the test. Analysis of participants' responses to the question
about test bias yielded a main effect of
race,
F( 1, 107) = 10.47,
p < .001. Black participants in all conditions thought the test
was more biased than White participants.
Perceived Performance
Participants' estimates of how many problems they solved
correctly and of
how
they compared to other participants both
showed significant condition main effects,
F(
2,
106) =
7.91,
p<
.001,andF(2, 107) = 3.17, p< .05, respectively. Performance
estimates were higher in the non-diagnostic-only condition (M
= 11.81) than in either the diagnostic (M = 9.20) or non-diag-
nostic-challenge conditions (M = 8.15). Bonferroni tests
showed that Black participants in the diagnostic condition (M
= 4.89) saw their relative performance as poorer than Black
participants in the non-diagnostic-only condition (M = 6.54),
/(107) =
2.81,
p<
.01,
and than Black participants in the non-
diagnostic-challenge condition (M= 6.30), t( 107) = 2.40,p<
.02.,
while test description had no effect on the ratings of White
participants. The overall mean was 5.86.
Discussion
With SAT differences statistically controlled, Black partici-
pants performed worse than White participants when the test
was presented as a measure of their ability, but improved dra-
matically, matching the performance of
Whites,
when the test
was presented as less reflective of
ability.
Nonetheless, the race-
by-diagnosticity interaction testing this relationship reached
only marginal significance, and then, only when participants
from the non-diagnostic-challenge condition were excluded
from the analysis. Thus there remained some question as to the
reliability of this interaction.
We had also reasoned that stereotype threat might un-
dermine performance by increasing interfering thoughts during
the test. But the conditions affected neither self-evaluative
thoughts nor thoughts about the self in the immediate situation
(Sarason, 1980). Thus to further test the reliability of
the
pre-
dicted interaction and explore the mediation of the stereotype
threat effect, we conducted a second experiment.
Study 2
We
argued that the effect of stereotype threat on performance
is
mediated by an apprehension over possibly conforming to the
negative group stereotype. Could this apprehension be detected
as a
higher
level
of general anxiety among stereotype-threatened
participants? To test this possibility, participants in all condi-
tions completed a version of the Spielberger State Anxiety In-
ventory (STAI) immediately after the test. This scale has been
successfully used in other research to detect anxiety induced by
evaluation apprehension (e.g., Geen, 1985).
We
also measured
the amount of time they spent on each test item to learn whether
greater anxiety was associated with more time spent answering
items.
Method
Participants
Twenty Black and 20 White Stanford female undergraduates were
randomly assigned (with the exception of attaining equal cell sizes) to
either the diagnostic or the nondiagnostic conditions as described in
Study 1, yielding 10 participants per condition. Female participants
were used in this experiment because, due to other research going on.
we had considerably easier access to Black female undergraduates than
to Black male undergraduates. This decision was justified by the finding
of no gender differences in the first study, or, as it turned out, in any of
the subsequent studies reported in this article—all of which used both
men and women.
Procedure
This experiment used the same test used in Study 1, with several ex-
ceptions; the final three anagram problems were deleted and the test
period was reduced from 30 to 25 min. Also, the test was presented on
a Macintosh computer (LCII). Participants controlled with the mouse
how long each item or item component was on the screen and could, at
their own pace, access whatever item material they wanted to see. The
computer recorded the amount of time the items, or item components
were on the screen as well as the number of referrals between item com-
ponents (as in the reading comprehension items)—in addition to re-
cording participants' answers.
Following the exam, participants completed the STAI and the cogni-
tive interference measure described for Study
1.
Also, on
11
-point scales
(with end-points not at all and extremely) participants indicated the
extent to which they guessed when having difficulty, expended effort on
the test, persisted on problems, limited their time on problems, read
problems more than once, became frustrated and gave up. and felt that
the test was biased.
Results and Discussion
The ANCOVA performed on the number of items correctly
solved yielded a significant main effect of race, F(\, 35) =
802CLAUDE M. STEELE AND JOSHUA ARONSON
10.04,
p <
.01,
qualified by a significant Race
X
Test Descrip-
tion interaction, F(
1,
35) =
8.07,
p
<
.01.
The mean
SAT
score
for Black participants was 603 and for White participants 655.
The adjusted means are presented in Figure 2. Planned con-
trasts on the adjusted scores revealed that, as predicted, Blacks
in the diagnostic condition performed significantly worse than
Blacks in the nondiagnostic condition /(35) = 2.38, p < .02,
than Whites in the diagnostic condition t(35) =
3.75,
p <
.001,
and than Whites in the nondiagnostic condition Z(35) = 2.34,
/?<.025.
For accuracy—the number correct over the number at-
tempted—a similar pattern emerged: Blacks in the diagnostic
condition had lower accuracy (M = .392) than Blacks in the
nondiagnostic condition (M
=
.490) or than Whites in either
the diagnostic condition (M
=
.485) or the nondiagnostic con-
dition (M = .435). The diagnosticity-by-race interaction test-
ing this pattern reached significance, F(
1,
35) =
4.18,
p < .05.
But the planned contrasts of the Black diagnostic condition
against the other conditions did not reach conventional signifi-
cance, although its contrasts with the Black nondiagnostic and
White diagnostic conditions were marginally significant, with
ps of
.06
and .09 respectively.
Blacks completed fewer items than Whites, ^(1,35) = 9.35,
p < .01, and participants in the diagnostic conditions tended
to complete fewer items than those in the nondiagnostic con-
ditions, F(\, 35) = 3.69, p < .07. The overall interaction did
not reach significance. But planned contrasts revealed that
Black participants in the diagnostic condition finished fewer
items (M = 12.38) than Blacks in the nondiagnostic condition
(M = 18.53), ?(35) = 2.50, p < .02; than Whites in the diag-
nostic condition (M= 20.93), /(35) = 339,p<
.01;
and than
Whites in the nondiagnostic condition (M = 21.45), t(35) =
3.60,p<.01.
These results establish the reliability of the diagnosticity-by-
race interaction for test performance that was marginally sig-
nificant in Study 1. They also reveal another dimension of the
effect of stereotype threat. Black participants in the diagnostic
condition completed fewer test items than participants in the
other
conditions.
Test diagnosticity impaired the rate, as
well
as
the accuracy of their work. This is precisely the impairment
caused by evaluative pressures such as evaluation apprehen-
sion, test anxiety, and competitive pressure (e.g., Baumeister,
1984).
But one might ask why this did not happen in the near-
identical Study 1. Several factors may be relevant. First, the
most involved test items—reading comprehension items that
t
1
s
20-
18-
14-
12-
10-
8-
6-
4-
2-
0-
1 BLACK SUBJECTS
0 WHITE SUBJECTS
T
T
i
DIAGNOSTIC
NONDMGNOS»T1C
Figure
2.
Mean test performance Study 2.
took several
steps to
answer—came
first
in the
test.
And second,
the test lasted 25 min in the present experiment whereas it
lasted 30 min in the first experiment. Assuming, then, that ste-
reotype threat slowed the pace of Black participants in the diag-
nostic conditions of both experiments, this
5-min
difference in
test period may have made it harder for these participants in the
present experiment to get past the early, involved items and
onto the more quickly answered items at the end of the test, a
possibility that may also explain the generally lower scores in
this experiment.
This view is reinforced by the ANCOVA (with SATs as a
covariate) on the average time spent on each of the
first
five
test
items—the minimum number of items that all participants in
all conditions answered. A marginal effect of test presentation
emerged, F{
1,
35) = 3.52, p < .07, but planned comparisons
showed that Black participants in the diagnostic condition
tended to be slower than participants in the other conditions.
On average they spent 94 s answering each of these items in
contrast to
71
for Black participants in the nondiagnostic con-
dition, ((35) = 2.39, p < .05; 73 s for Whites in the diagnostic
condition, f(35) = 2.12, p < .05, and 71 s for Whites in the
nondiagnostic condition, Z(35) = 2.37, p < .05. Like other
forms of evaluative pressure, stereotype threat causes an im-
pairment of both accuracy and speed of performance.
No differences were found on any of the remaining measures,
including self-reported effort, cognitive interference,
or
anxiety.
These measures may have been insensitive, or too delayed.
Nonetheless, we lack an important kind of
evidence.
We have
not shown that test diagnosticity causes in Black participants a
specific apprehension about fulfilling the negative group stereo-
type about their ability—the apprehension that we argue dis-
rupts their test performance. To examine this issue we con-
ducted a third experiment.
Study 3
Taking an intellectually diagnostic test and experiencing some
frustration with
it,
we have
assumed,
is
enough
to cause
stereotype
threat for Black participants. In testing this reasoning, the present
experiment examines several specific propositions.
First, if taking or expecting to take a difficult, intellectually di-
agnostic test makes Black participants feel threatened by a spe-
cifically racial
stereotype,
then it might
be
expected to
activate
that
stereotype in their thinking and information processing. That is,
the racial stereotype, and perhaps also the self-doubts associated
with it, should be more cognitively activated for
these
participants
than for Black participants in the nondiagnostic condition or for
White participants in either condition (e.g., Dovidio, Evans, &
Tyler,
1986;
Devine,
1989;
Higgins, 1989). Accordingly, in testing
whether test diagnosticity arouses this state, the present experi-
ment measured the effect of conditions on the activation of this
stereotype and of related self-doubts about ability.
Second, if test diagnosticity makes Black participants appre-
hensive about fulfilling and being judged by the racial stereo-
type,
then these participants, more than participants in the
other
conditions,
might
be
motivated to disassociate themselves
from the stereotype. Brent Staples, an African American edito-
rialist for the New
York
Times, offers an example of
this
in his
recent autobiography, Parallel Time. He describes beginning
graduate school at the University of Chicago and
finding
that as
RACIAL STEREOTYPES AND TEST PERFORMANCE803
he walked the streets of Hyde Park he made people uncomfort-
able.
They grouped more closely when he walked by, and some
even crossed the street to avoid him. He eventually realized that
in that urban context, dressed as a student, he was being per-
ceived through the lens of a race-class stereotype
as a
potentially
menacing Black man. To deflect this perception he learned a
trick; he would whistle Vivaldi. It worked. Upon hearing him
do this, people around him visibly relaxed and he felt out of
suspicion. If it
is
apprehension about being judged in light of the
racial stereotype that interferes with the performance of Black
participants in the diagnostic condition, then these participants,
like Staples, might be motivated to deflect such a perception by
showing that the broader racial stereotype is not applicable to
them. To test this possibility, the present experiment measured
the effect of conditions on participants' stated preferences for
such things
as
activities and styles of music, some of which were
stereotypic of African Americans.
Third, by adding to the normal evaluative risks of test perfor-
mance the further risk of self-validating the racial stereotype,
the diagnostic condition should also make Black participants
more apprehensive about their test performance. The present
experiment measured this apprehension
as
the degree to which
participants self-handicapped their expected performance, that
is,
endorsed excuses for poor performance before the test.
The experiment took the form of a 2X3 design in which the
race of participants (African American or White) was crossed
with diagnostic, nondiagnostic, and control conditions. The
diagnostic and nondiagnostic conditions were the same as
those described for Study 2, while in the control condition par-
ticipants completed the critical dependent measures without
expecting to take a test of any sort. In the experimental condi-
tions,
the dependent measures were administered immediately
after the diagnosticity instructions and just before the test was
ostensibly to be taken. These included measures of stereotype
activation, stereotype avoidance, and, as a measure of general
performance apprehension, participants' willingness to
self-
handicap. Participants in this experiment never took the test.
The measures of stereotype activation and stereotype avoid-
ance,
we felt, could activate the racial stereotype and stereo-
type threat among Black participants in both the diagnostic
and nondiagnostic conditions, making performance results
difficult to interpret.
If test diagnosticity threatens Black participants with a spe-
cifically racial stereotype, then Black participants in the diag-
nostic condition, more than participants in the other condi-
tions,
should show greater cognitive activation of the stereotype
and ability-related self-doubts, greater motivation to disassoci-
ate themselves from the stereotype, and greater performance
apprehension as indicated by the endorsement of self-handicap-
ping excuses.
Method
Participants
Thirty-five Black (9 male, 26 female) and 33 White (20 male, 13
female) Stanford undergraduates were randomly assigned to either a
diagnostic, nondiagnostic, or control condition, yielding from 10 to 12
participants per experimental group.
Procedure
A White male experimenter gave a booklet to participants as they
arrived that explained that the study was examining the relationship
between two types of cognitive processes: lexical access processing
(LAP) and higher verbal reasoning (HVR). They were told that they
would be asked to complete two tasks, one of which measured LAP
"the visual and recognition processing of words"—and the other of
which measured HVR—"abstract reasoning about the meaning of
words." Test diagnosticity was manipulated as in Study 1 with the fol-
lowing written instructions to further differentiate the conditions:
Diagnostic: Because we want an accurate measure of your ability
in these domains, we want to ask you to try as hard as you can to
perform well on these tasks. At the end of the study, we can give
you feedback which may be helpful by pointing out your strengths
and weaknesses.
Nondiagnostic: Even though we are not evaluating your ability on
these tasks, we want to ask you to try as hard as you can to perform
well on these tasks. If you want to know more about your LAP
and HVR performance, we can give you feedback at the end of the
study.
Finally, participants were shown one sample item from the LAP (an
item of the same sort as used in the fragment completion task) and three
sample items from the HVR—difficult verbal GRE problems. The pur-
pose of the HVR sample items was to alert participants to the difficulty
of the test and the possibility of poor performance, thus occasioning the
relevance of the racial stereotype in the diagnostic condition.
Participants in the control condition arrived at the laboratory to find
a note on the door from the experimenter apologizing for not being
present. The note instructed them to complete a set of measures lying
on the desk in an envelope with the participant's name on it. The enve-
lope contained the LAP word fragment measure and the stereotype
avoidance measure (described below) with detailed instructions. No
mention of verbal ability evaluation was made.
Measures
Stereotype activation. Participants first performed a word-fragment
completion task, introduced as the "LAP task," versions of which have
been shown to measure the cognitive activation of constructs that are
either recently primed or self-generated (Gilbert & Hixon, 1991; Tulv-
ing, Schacter, & Stark, 1982). The task was made up of 80 word frag-
ments with missing letters specified as blank spaces (e.g., C E).
Twelve of these fragments had as one possible solution a word reflecting
either a race-related construct or an image associated with African
Americans. The list was generated by having a group of 40 undergradu-
ates (White students from the introductory psychology pool) generate
a set of words that reflected the image of African Americans. From these
lists,
the research team identified the 12 most common constructs (e.g.,
lower class, minority) and selected single words to represent those con-
structs on the task. For example, the word "race" was used to represent
the construct "concerned with race" on the task. Then, for each of the
words placed on the task, at least two letter spaces were omitted and the
word was checked again to determine whether other, non-stereotype-
related associations to the word stem were possible. Leaving at least
two letter spaces blank in each word fragment greatly unconstrains the
number of word completions possible for each fragment when com-
pared to leaving only one letter space blank. This reduces the chance of
ceiling effects in which virtually all participants would think of the
race-related fragment completion. The complete list was as follows:
C E (RACE); L A (LAZY); A C K (BLACK);
O R (POOR); C L _ S _ (CLASS); B R
(BROTHER); T E (WHITE); M I
804CLAUDE M. STEELE AND JOSHUA ARONSON
(MINORITY); W E L (WELFARE); C O
(COLOR); TO (TOKEN).
We included a fairly high number (12) of target fragments so that if
ceiling or floor effects occurred on
some
fragments it would be
less
likely
to damage the sensitivity of the overall measure. To reduce the chance
that participants would become aware of the racial nature of the target
fragments, they were spaced with at least three filler items between
them, and there were only two target fragments per page in the task
booklet. Participants
were
instructed
to
work
quickly,
spending no more
than
15
son each item.
Self-doubt
activation.
Seven word fragments reflecting self-doubts
about competence and ability were included in the 80-item LAP task:
LO (LOSER); DU (DUMB);SHA (SHAME);
E R I O R (INFERIOR); F L (FLUNK); _A R D
(HARD); W K (WEAK). These were generated by the research
team, and again included at least two blank letter spaces in each frag-
ment. As with the racial fragments, these were separated from one an-
other (and from the racial fragments) by at least three filler items.
Stereotype
avoidance.
This measure asked participants to rate their
preferences for a variety of activities and to rate the self-descriptiveness
of various personality
traits,
some of which
were
associated with images
of African Americans and African American life. Participants in the
diagnostic and nondiagnostic conditions were told that these ratings
were taken to give us a better understanding of the underpinnings of
LAP and H VR
processes.
Control participants
were
told that
these
mea-
sures were being taken to assess the typical interests and personality
traits of Stanford undergraduates. The measure contained 57
items
ask-
ing participants to rate the extent to which they enjoyed a number of
activities (e.g., pleasure reading, socializing, shopping, traveling, etc.),
types of music (e.g., jazz, rap
music,
classical music), sports
(e.g.,
base-
ball, basketball, boxing), and
finally,
how they
saw
themselves standing
on
various
personality dimensions
(e.g.,
extroverted, organized, humor-
ous,
etc.).
All ratings
were
made on 7-point Likert scales with
1
indicat-
i
ng
the lowest preference or
degree
of trait
descriptiveness.
Some of these
activities and traits were stereotypic of African Americans. For an item
to be selected as stereotypic, 65% of our pretest sample of 40 White
participants had to have generated the item when asked to list activities
and traits they believed to be stereotypic of African Americans. In the
activities category, the stereotype-relevant items were: "How much do
you enjoy sports?" and "How much do you enjoy being a lazy 'couch
potato'?"
The stereotype-relevant music preference item
was rap
music;
the stereotype-relevant sports preference item was
basketball;
and the
stereotype-relevant trait ratings were lazy and
aggressive/belligerent.
Participants also completed a brief demographic questionnaire
(asking their
age,
gender, major, etc.) just before they expected to begin
the test. As another measure of participants' motivation to distance
themselves from the stereotype, the second item of this questionnaire
gave them the option of recording their race. We reasoned that partici-
pants who wanted to avoid having their performance viewed through
the lens of a racial stereotype would be
less willing
to indicate their race.
Self-handicapping
measure.
This measure just preceded the demo-
graphic questionnaire. The directions stated "as you know, student life
is sometimes stressful, and we may not always get enough sleep, etc.
Such things can affect cognitive functioning, so it will be necessary to
ask how prepared you feel." Participants then indicated the number of
hours they slept the night before in addition to responding, on 7-point
scales (with 7 being the higher rating on these dimensions) to the fol-
lowing
questions:
"How able to focus do you feel?;" "How much stress
have you been under
lately?:"
"How tricky/unfair
do
you typically find
standardized tests?"
Results
Stereotype Activation
A 2 (race) X 3 (condition: diagnostic, nondiagnostic, or
control) ANCOVA (with verbal SAT as the covariate: Black
mean =
581,
White mean =
650) was
performed
on
the number
of target word fragments filled in with stereotypic completions.
This analysis yielded significant main effects for both race, F{ 1,
61)=
13.77,
p<
.001,
and for experimental condition, F( 2,61)
=
5.90,
p
<
.005.
These main effects, however,
were
qualified by
a significant Race X Condition interaction, F(2,
61
) = 3.30, p
<
.05.
Figure
3 shows
that
as
expected, the diagnostic condition
significantly increased the number of race-related completions
of Black participants but not of White participants. Black par-
ticipants in the diagnostic condition produced more race-re-
lated completions (M = 3.70) than Black participants in the
nondiagnostic condition (M
=
2.10), t(6\)
=
3.53,p
<
.001,
or
for that matter, more than participants in any of other condi-
tions,
all
ps< .05.
Self-Doubt Activation
It did the same for their self doubts. The number of
self-
doubt-related completions of self-doubt target fragments were
submitted to an ANCOVA (as described above) yielding a main
effect of experimental condition, F(2,
61
) =
4.33,
p < .02, and
a Race X Condition interaction, F(2, 61) = 3.34, p < .05. As
Figure 3 shows, Black participants in the diagnostic condition,
as predicted, generated the most self-doubt-related comple-
tions,
significantly more than Black participants in the nondi-
agnostic condition, t(f>\)
=
3.52, p<
.001,
and more than par-
ticipants in any of the other conditions as
well,
all
ps < .05.
Stereotype Avoidance
The six preference and stereotype items described above
were summed to form an index of stereotype avoidance that
ranged from 6 to 42 with 6 indicating high avoidance and 42
indicating low avoidance (Cronbach's alpha = .65). When
these scores were submitted to the ANCOVA they yielded a
significant effect of
condition,
F(2, 61) =
4.73,
p < .02, and a
significant Race X Condition interaction, F(2, 61) = 4.14, p
< .03. As can be seen in Figure 3, Black participants in the
diagnostic condition were the most avoidant of conforming to
stereotypic images of African Americans (M = 20.80), more
so than Black participants in the nondiagnostic condition (M
= 29.80), t{6\) =
3.61,
p <
.001,
and/or White participants
in either condition, all ps < .05.
Indicating Race
Did the ability diagnosticity of the test affect participants'
tendency to indicate their race on the demographic question-
naire? Among Black participants in the diagnostic condition,
only 25% would indicate their race on the questionnaire,
whereas
100%
of the participants in each of the other conditions
would
do
so.
Using
a
0/
1
conversion of the response frequencies
(with 0 = refusal to indicate race and
1
= indication of race)
the standard ANCOVA performed on this measure revealed a
marginally significant effect of
race,
F(
1,
61) = 3.86, p < .06, a
significant effect of condition, F(2.6l) = 3.40, p < .04, and a
significant Race X Condition interaction, F(
1,
61) = 6.60, p
<
.01,
all due, of
course,
to the unique unwillingness of Black
participants in the diagnostic condition to indicate their race.
RACIAL STEREOTYPES AND TEST PERFORMANCE805
Stereotype Activation Measure
DIAGNOSTIC NONDIAGNOST1CCONTROL
Self-Doubt Activation Measure
DIAGNOSTIC NONDIAGNOSTICCONTROL
Stereotype Avoidance Measure
DIAGNOSTIC NONDIAGNOSTIC CONTROL
Figure i. Indicators olstereotype threat.
Self-Handicapping
Four measures assessed participants' desire to claim impedi-
ments to performance. Because participants in the control con-
ditions did not complete this measure, these responses were
submitted to separate 2(race) X 2(diagnosticity) ANCOVAs.
Cell means are presented in Table 1. Framing the verbal tasks
as diagnostic of ability had significant effects on three of the
four measures. For the number of
hours
of
sleep,
the ANCOVA
yielded a significant effect of
race,
F(1, 39) = 8.22, p <
.01,
and
a significant effect of condition, F( 1, 39) =
6.53,
p < .02. These
effects were qualified by a significant Race
X
Condition interac-
tion, F(\, 39) =
4.1,
p < .01. For participants' ratings of their
ability to focus, a similar result emerged: main effects of race,
F( 1, 39) = 7.26, p < .02, and condition, F(1, 39) =
10.67,
p <
.01,
and a significant qualifying interaction, F( 1, 39) = 5.73,
p < .03. And finally, the same pattern of effects emerged for
participants' ratings of
how
tricky or unfair they generally find
standardized tests to
be:
a race main effect, F(
1,
39) =
13.24,
p
< .001, a condition main effect, F( 1, 39) = 13.42, p < .001,
and a marginally significant, qualifying interaction, F( 1, 39) =
3.58, p < .07. No significant effects emerged on participants'
ratings of their current stress.
Discussion
We had assumed that presenting an intellectual test as diag-
nostic of ability would arouse a sense of stereotype threat in
Black participants. The present results dramatically support
this assumption. Compared to participants in the other condi-
tions—that is, Blacks in the nondiagnostic condition and
Whites in either condition—Black participants expecting to
take a difficult, ability-diagnostic test showed significantly
greater cognitive activation of stereotypes about Blacks, greater
cognitive activation of concerns about their
ability,
a
greater ten-
dency to avoid racially stereotypic preferences, a greater ten-
dency to make advance excuses for their performance, and fi-
nally, a greater reluctance to have their racial identity linked to
their performance even in the pedestrian way of recording it on
their questionnaires. Clearly the diagnostic instructions caused
these participants to experience a strong apprehension, a dis-
tinct sense of stereotype threat.
Table 1
Self-Handicapping Responses
in Study 3
Experimental condition
DiagnosticNondiagnostic
MeasureBlacks
(n= 12)Whites
(«
=
IDBlacks
(n= IDWhites
(n= 10)
Hours of sleep
Abilitv to focus