Page 1
Does Stereotype Threat Affect Test Performance of Minorities and
Women? A Meta-Analysis of Experimental Evidence
Hannah-Hanh D. Nguyen
California State University, Long Beach
Ann Marie Ryan
Michigan State University
A meta-analysis of stereotype threat effects was conducted and an overall mean effect size of |.26| was
found, but true moderator effects existed. A series of hierarchical moderator analyses evidenced
differential effects of race- versus gender-based stereotypes. Women experienced smaller performance
decrements than did minorities when tests were difficult: mean ds ? |.36| and |.43|, respectively. For
women, subtle threat-activating cues produced the largest effect, followed by blatant and moderately
explicit cues: ds ? |.24|, |.18|, and |.17|, respectively; explicit threat-removal strategies were more
effective in reducing stereotype threat effects than subtle ones: ds ? |.14| and |.33|, respectively. For
minorities, moderately explicit stereotype threat-activating cues produced the largest effect, followed by
blatant and subtle cues: ds ? |.64|, |.41|, and |.22|, respectively; explicit removal strategies enhanced
stereotype threat effects compared with subtle strategies: ds ? |.80| and |.34|, respectively. In addition,
stereotype threat affected moderately math-identified women more severely than highly math-identified
women: ds ? |.52| and |.29|, respectively; low math-identified women suffered the least from stereotype
threat: d? |.11|. Theoretical and practical implications of these findings are discussed.
Keywords: stereotype threat effects, meta-analysis, cognitive ability test performance, gender gap in math
scores, racial gap in test scores
Since Steele and Aronson’s (1995) seminal experiments, the
research literature on stereotype threat effects on test performance
has steadily grown. According to the theory, stereotype threat
refers to the “predicament” in which members of a social group
(e.g., African Americans, women) “must deal with the possibility
of being judged or treated stereotypically, or of doing something
that would confirm the stereotype” (p. 401; Steele & Aronson,
1998). For instance, when stereotyped group members take stan-
dardized ability tests, such as in educational admission or employ-
ment selection contexts, their performance may be partially under-
mined when they encounter cues of a salient negative stereotype in
the testing environment (e.g., women are not good at math, or
ethnic minorities are inferior in intellectual abilities; Steele, Spen-
cer, & Aronson, 2002).
The social message that the theory of stereotype threat conveys
is powerful: Members of stigmatized social groups may be con-
stantly at risk of underperformance in testing contexts, and the
risks may be partially caused by situational factors (i.e., beyond
established factors such as poverty, parental style, socialization,
etc.; Steele, 1997). For a stigmatized social group, we adopt the
widely used definition of Crocker and Major (1989): belonging to
a social category about which others hold negative attitudes, ste-
reotypes, and beliefs. According to Devos and Banaji (2003), the
contribution of the stereotype threat theory and literature is that it
predicts (and empirically tests) the relationship between negative
in-group stereotypes and group members’ behavioral changes
(e.g., diminished task or test performance in a stereotyped evalu-
ative domain), not only attitudinal or affective changes. The
present meta-analytic study aims at investigating the extent to
which the activation of stereotype threat is detrimental to stereo-
typed test takers’ performance on cognitive ability tests by aggre-
gating the findings in stereotype threat experiments. In our pre-
sentation, we discuss how the laboratory conditions may or may
not generalize to employment testing contexts, and hence whether
or not any stereotype threat effects produced in these contexts
might occur in field settings of interest to organizational psychol-
ogists.
Stereotype Threat: Research Paradigm and Empirical
Evidence
The primary hypothesis of stereotype threat theory is perfor-
mance interference, or the prediction that stereotyped individuals
perform worse on an evaluative task (e.g., African Americans
taking a verbal ability test or women taking a mathematics test) in
a stereotype-threatening context than they would in a nonthreat-
Hannah-Hanh D. Nguyen, Department of Psychology, California State
University, Long Beach; Ann Marie Ryan, Department of Psychology,
Michigan State University.
This article was partly based on Hannah-Hanh D. Nguyen’s dissertation
research under Ann Marie Ryan’s supervision. The meta-analysis was
supported with Hannah-Hanh D. Nguyen’s National Science Foundation
Graduate Research Fellowship and Michigan State University Competitive
Doctoral Enrichment Fellowship.
We thank the researchers who generously shared their work and addi-
tional statistical data, especially Gregory Walton. We also thank Irene Sze
and Emily Harris for their coding assistance, Huy A. Le for his meta-
analytic advice, and Linda Jackson, Neal Schmitt, and Ilies Remus for their
committee guidance.
Correspondence concerning this article should be addressed to Hannah-
Hanh D. Nguyen, Department of Psychology Room 319, California State
University, Long Beach, 1250 Bellflower Boulevard, Long Beach, CA
90840. E-mail: hnguyen@csulb.edu
Journal of Applied Psychology
2008, Vol. 93, No. 6, 1314–1334
Copyright 2008 by the American Psychological Association
0021-9010/08/$12.00 DOI: 10.1037/a0012702
1314
Page 2
ening condition (see Steele, 1997; Steele et al., 2002). The basic
experimental paradigm involves randomly assigning members of a
stereotyped group to a control or threat condition and comparing
mean performance of the conditions. Researchers sometimes also
include a comparison group to whom the induced negative stereo-
type is not relevant (e.g., Whites, men). In a seminal experiment,
Steele and Aronson (1995; Experiment 1) assigned African Amer-
ican students to one of three conditions of stereotype threat where
they were administered a difficult verbal ability test (i.e., only 30%
of a pretest sample correctly solved the test). In the stereotype
threat condition, participants were told that the test was diagnostic
of their intellectual capability; in the other conditions, the test was
either described as a problem-solving task or no particular direc-
tions were given. Participants in the stereotype threat condition
correctly solved fewer test problems compared with those in other
conditions, supporting the performance interference hypothesis.
A majority of subsequent researchers replicated and extended
the stereotype threat effect on cognitive ability tests for African
American or Hispanic test takers (e.g., Cadinu, Maass, Frigerio,
Impagliazzo, & Latinotti, 2003; Dodge, Williams, & Blanton,
2001) and on difficult math ability tests for female test takers (e.g.,
Davies, Spencer, Quinn, & Gerhardstein, 2002; Schmader &
Johns, 2003; S. J. Spencer, Steele, & Quinn, 1999). However,
some researchers did not find support for the performance inter-
ference hypothesis (e.g., McFarland, Kemp, Viera, & Odin, 2003;
Oswald & Harvey, 2000–2001; Schneeberger & Williams, 2003;
Stricker & Ward, 2004). The mixed findings suggest moderating
effects for stereotype threat. Stereotype threat theory does propose
three moderators—stereotype relevance, domain identification,
and test difficulty—which are described in the next section and
investigated in this study.
Conceptual Moderators
Relevance of Stereotypes
Steele (1997) posited that the degree of stereotype threat effects
may vary depending on the relevance of a stereotype in the test
setting. For example, Shih, Pittinsky, and Ambady (1999) found
that, when encountering a stereotype threat-loaded situation (i.e.,
taking a math test), Asian American women tended to solve more
correct problems when the situational cues about their “Asian”
identity were accessible than when their “woman” identity was
made salient (mean differences were not statistically significant,
however). Researchers can activate stereotype threat by manipu-
lating the degree to which a stereotype is salient to individuals in
an evaluative testing context by either cuing test takers to the link
between a stereotype and a particular evaluative test (i.e., implicit
stereotype threat activation) or by declaring that members of a
social group tend to perform worse on the test than a comparison
group (i.e., explicit stereotype threat activation; see definitions and
examples in Table 1).
Stereotype theories in general predict that more implicit threat
cues would have a stronger negative effect on task performance
than explicit ones (see Bargh, 1997). Explicit stereotype threat
cues have been found to cause target test takers to overperform on
a test or task (e.g., Kray, Thompson, & Galinsky, 2001; McFar-
land, Kemp, et al., 2003), a phenomenon called a stereotype
reactance effect. According to Kray et al. (2001), when a negative
stereotype is blatantly and explicitly activated, it might be per-
ceived by test takers as a limit to their freedom and ability to
perform, thereby ironically invoking behaviors that are inconsis-
tent with the stereotype.
The implication is that a nonlinear relationship possibly exists
between stereotype threat-activating cues and performance. Stud-
ies using subtle stereotype threat-activating cues (i.e., via manip-
ulating the testing environment) might produce a larger effect size
than those using blatant cues (i.e., via spelling out a stereotype to
target test takers) because the stereotype might work on a subcon-
scious level and directly affect targets’ test performance (see Levy,
1996). Also, threatened individuals might consciously react
against a blatant stereotype. Studies using moderately explicit cues
might actually yield the greatest stereotype threat effects: When a
general message of subgroup differences in intellectual abilities is
explicitly conveyed to target test takers but the direction of these
differences is not specified and is instead left open for test takers’
interpretation, the stereotype might be direct enough to draw
targets’ attention, ambiguous enough to cause targets to engage in
detrimental off-task thinking (e.g., trying to figure out how the
message should be interpreted), but not too blatant to make some
targets become motivated to “prove it wrong.” This distinction is
particularly important to consider if one wishes to determine if
laboratory stereotype threat effects generalize to employment test-
ing settings, where blatant or even moderately explicit cues are
unlikely to be present. If effects are not found with more subtle
cues, then one might question the applicability of this line of
research to employment testing contexts.
The relevance between a negative stereotype and a test can also
be refuted or removed to reduce observed stereotype threat effects,
either implicitly (e.g., by framing a test as a nondiagnostic task) or
explicitly (e.g., by disputing said group differences in test perfor-
mance; see examples in Table 2). Explicit stereotype threat-
removal strategies may serve as a catalyst to motivate individuals
to avoid being stereotyped; this motivation can in turn inhibit
negative stereotypes by shaping activated thoughts into actions
toward their goals (see S. J. Spencer, Fein, Strahan, & Zanna,
2005). In other words, explicitly making a stereotype less relevant
to a test context might alleviate stereotype threat effects more
effectively than implicit threat-removal strategies. This distinction
is also an important one to examine in determining the viability of
generalizing stereotype threat lab research to employment testing
contexts: Hiring organizations are unlikely to enact the more
explicit threat-removal strategies used in this line of research.
In this meta-analysis, we examine the type of stereotype acti-
vation cue—subtle, moderately explicit, or blatant—and the type
of threat-removal strategy (implicit vs. explicit) as potential mod-
erators.
Domain Identification
Stereotype threat theory proposes that only those who strongly
identify themselves with a domain with which there is a negative
group stereotype are susceptible to the threat of confirming the
group-based stigma because the strength of stereotype threat ef-
fects depends on “the degree to which one’s self-regard, or some
component of it, depends on the outcomes one experiences in the
domain” (p. 390; Steele et al., 2002). For example, only women
who identify with math would experience stereotype threat while
1315
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 3
taking a math test (Cadinu et al., 2003, Study 1). That is, negative
stereotypes will not be threatening to individuals who do not care
about performing well in an area, as success in that domain plays
little role in their identity. Surprisingly, only a few studies directly
tested domain identification as a moderator of stereotype threat
effects, and the results were mixed (Aronson et al., 1999; Cadinu
et al., 2003; Leyens, Desert, Croizet, & Darcis, 2000; McFarland,
Lev-Arey, & Ziegert, 2003). In this meta-analysis, we examined
whether levels of individuals’ domain identification might influ-
ence the magnitude of stereotype threat effects.
Test Difficulty
Stereotype threat theory suggests that members of a stigmatized
social group are most likely to be threatened by a situational
stereotype threat cue when a test is challenging. Because the
cognitive demands of a difficult test will increase individuals’
mental workload, interference from a stereotype will be cogni-
tively more problematic when a test is challenging than when a test
does not require as much from the test takers’ resources (Steele &
Aronson, 1995; Steele et al., 2002). Some researchers selected
highly difficult intellectual ability tests to investigate stereotype
threat effects (e.g., Croizet et al., 2004; Gonzales, Blanton, &
Williams, 2002; Inzlicht & Ben-Zeev, 2003; McIntyre et al., 2003;
Schmader, 2002; Steele & Aronson, 1995), whereas other re-
searchers used moderately difficult tests (e.g., Dodge et al., 2001;
McKay, Doverspike, Bowen-Hilton, & Martin, 2002; J. L. Smith
& White, 2002; Stricker & Ward, 2004). The empirical evidence
for the moderating effects of test difficulty is mixed (O’Brien &
Crandall, 2003; S. J. Spencer et al., 1999; Stricker & Bejar, 2004).
Examining this moderator is also of importance to understanding
Table 1
Stereotype Threat-Activating Cues
Cue classification Operational definition Stereotype threat-activating cueExample study
Blatant The message involving a stereotype about a
subgroup’s inferiority in cognitive ability
and/or ability performance is explicitly
conveyed to test takers prior to their
taking a cognitive ability test. The group-
based negative stereotype becomes salient
to test takers via a conscious mechanism.
Emphasizing the target subgroup’s inferiority
on tests (or the comparison subgroup’s
superiority). For example, stating that
Whites tend to perform better than Blacks/
Hispanics or that men tend to score higher
than women.
Priming targets’ group-based inferiority. For
example, administering a stereotype threat
questionnaire before tests or giving
information favoring males before tests.
Race/gender performance differences in
general ability tests. For example, stating
that generally men and women perform
differently on standardized math tests.
Race/gender performance differences on the
specific test. For example, stating that
taking a specific math test produces
gender differences, testing minorities’
math ability on a White-normed or biased
test, stating that certain groups of people
perform better than others on math exams.
Race/gender priming. For example, making a
race/gender inquiry prior to tests or race/
gender priming by other means (e.g., a
pretest questionnaire, a pretest task, a
testing environment cue).
Aronson et al. (1999); Cadinu
et al. (2003); Schneeberger
& Williams (2003)
Bailey (2004); Seagal (2001)
Moderately explicit The message of subgroup differences in
cognitive ability and/or ability perfor-
mance is conveyed directly to test takers
in test directions or via the test-taking
context, but the direction of these group
differences is left open for test takers’
interpretation. The group-based negative
stereotype may become salient to test
takers via a conscious mechanism.
R. P. Brown & Pinel (2003);
Edwards (2004); H. E. S.
Rosenthal & Crisp (2006)
Keller & Dauenheimer
(2003); Pellegrini (2005);
Tagler (2003)
Indirect and subtle The message of subgroup differences in
cognitive ability is not directly conveyed;
instead, the context of tests, test takers’
subgroup membership, or test taking
experience is manipulated. The group-
based negative stereotype may become
salient to test takers via an automatic
and/or subconscious mechanism.
Anderson (2001); Dinella
(2004); Oswald & Harvey
(2000–2001); Schmader &
Johns (2003); Spicer
(1999); Steele & Aronson
(1995)
Martin (2004); Marx &
Stapel (2006); Ployhart et
al. (2003); Prather (2005)
Emphasizing test diagnosticity purpose. For
example, labeling the test as a diagnostic
test or stressing the evaluative nature of
the test.
Note. In Walton and Cohen’s (2003) meta-analysis, only two levels of classification were employed.
Table 2
Stereotype Threat-Removal Strategies
Strategy Example study
Explicit
Give a handout with information
favoring women
State that a math test is free of gender
bias (men ? women)
State that Blacks perform better than
Whites
Educate subjects about the stereotype
threat phenomenon
Subtle
Describe a test as a problem-solving
task (no race inquiry before task)
State that test performance will not be
assessed
Show television commercials with
women in astereotypical roles (e.g.,
engineers)
Bailey (2004)
R. P. Brown & Pinel (2003)
Cadinu et al. (2003)
Guajardo (2005)
Steele & Aronson (1995)
Wout et al. (n.d.)
Davies et al. (2002)
1316
NGUYEN AND RYAN
Page 4
the potential generalizability of stereotype threat effects from
laboratory to employment testing contexts, as employers often
seek to use at least moderately difficult tests to increase selectivity.
Type of Stereotype
The theory of stereotype threat alludes to a universal reaction to
stereotype threat by members of any stigmatized social group,
implying that findings are generalizable from one stigmatized
social group to another (see Steele et al., 2002). In this meta-
analysis, we tested the viability of this assumption of universal
effects by examining whether the activation of stereotype threat
might differentially affect women and ethnic minority test takers in
testing contexts. Stereotype relevance might differ because in the
United States, where group differences are publicly acknowledged
and discussed, advances in employment and higher education for
minorities are affected by high-stakes testing (Sackett, Schmitt,
Ellingson, & Kabin, 2001). However, the effect of math test
performance on women’s advancement opportunities is not likely
to be as great, given that career opportunities that many women
choose are not affected by math scores (see Halpern et al., 2007,
for a review). Hence, a race-based stereotype regarding test per-
formance might be more salient to test takers, or might lead to
stronger reactions, than a gender-based math stereotype.
Prior Meta-Analysis
Although not their primary focus, Walton and Cohen (2003)
conducted a meta-analysis on stereotype threat effects experienced
by members of several stigmatized groups (e.g., ethnic minorities,
women, older adults, individuals of lower socioeconomic status).
They found a small overall effect size (mean d? |.29|, k? 43),
which was moderated by stereotype relevance and domain identi-
fication. Walton and Cohen did not examine test difficulty as a
moderator but opted instead to examine only studies that used a
difficult test or performance situation. Although Walton and Co-
hen’s results on stereotype threat effects are informative, there is
room for improvement on their approach. We replicated and ex-
tended their work in five ways: (a) examining test difficulty as a
moderator, (b) examining differential effects of different group-
based stereotypes, (c) using a more fine-grained classification in
examining stereotype-activating cues as a moderator, (d) consid-
ering nonindependent data points in a more appropriate manner,
and (e) including substantially more studies.
First, Walton and Cohen (2003) reported significant heteroge-
neity tests of the observed effect sizes for all meta-analytic find-
ings, meaning that there were other uninvestigated moderators that
may further explain the variance in their findings. Therefore, we
extended Walton and Cohen’s work to include test difficulty and
type of stereotype as potential moderators. Second, Walton and
Cohen meta-analytically cumulated effect sizes from studies with
various stereotype threats based on race, gender, age, or socioeco-
nomic status and with various outcome measures. In this meta-
analysis, we tested the viability of a universal stereotype threat
reaction, considering potential differential effects of two types of
stereotypes (race and gender) related to test performance.
Third, although Walton and Cohen (2003) did examine whether
subtle versus explicit ways of activating and/or attempting to
remove stereotype threat manipulations produced differing results,
we extended their work by using a more concretely defined and
detailed categorization for both threat activation and threat re-
moval to better address generalizability issues to employment
contexts. Fourth, Walton and Cohen’s treatment of nonindepen-
dent data points was nonstandard: Studies in the data set that
yielded hundreds of nonindependent data points each (i.e., identi-
cal or overlapping samples on multiple dependent measures; e.g.,
Stricker, 1998; Stricker & Ward, 1998) were given a weight of 0.5
in the effect size computation, but the reasons and/or implications
of such a treatment in regard to the variance estimation of effect
sizes (see Hunter & Schmidt, 1990) were neither explained nor
discussed. Finally, Walton and Cohen used a small data set of
experiments (k? 43) and the literature has grown substantially
since their study. Less than one half of the studies the researchers
meta-analyzed are related to the two group-based stereotypes of
interest in employment settings, and many additional studies were
included in the present meta-analysis.
In summary, to address these five extensions to Walton and
Cohen’s (2003) work, we conducted a hierarchical moderator
meta-analysis, with each of the stated conceptual moderators—test
difficulty, domain identification, and stereotype threat relevance
(i.e., activation cues and removal strategies)—meta-analyzed
across group-based stereotypes (i.e., race-based vs. gender-based).
Furthermore, the primary focus of Walton and Cohen’s meta-
analysis was that the variance in observed between-subgroups
mean test score differences (e.g., men vs. women, Black/Hispanic
vs. White) might be partially accounted for by the debilitated
performance of the target group members and partially by the
comparison group’s performance boost (stereotype lift). In this
study, we directly examined the estimates of such between-group
differences in test performance, considering potential differential
effects for different stereotypes.
Method
Literature Search
We conducted a bibliographic search of electronic databases
such as PsycINFO and PROQUEST using the combined keywords
of stereotype and threat as search parameters for journal articles
and dissertation abstracts dated between 1995 (the publication year
of the seminal article by Steele and Aronson) and April 2006. A
manual search was conducted by reviewing the reference lists of
key articles to find additional citations of unpublished articles. The
internet search engines of Google and Google Scholar were used to
search for unpublished empirical articles of interest and/or for
self-identified stereotype threat researchers. We sent the identified
researchers with available e-mail addresses a “cold” e-mail, re-
questing manuscripts and/or working papers. We also posted the
same request on various psychology list-servs. Furthermore, sev-
eral prominent researchers in the stereotype threat area were con-
tacted for unpublished manuscripts, in-press papers, as well as for
other additional sources of research data on stereotype threat
effects on cognitive ability test performance.
Inclusion Criteria
To be included in the data set, a research report first had to be
an experiment designed to test Steele and Aronson’s (1995)
1317
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 5
within-subgroup performance interference hypothesis regarding
stereotyped minorities’ or women’s cognitive ability test perfor-
mance (quantitative, verbal, analytic, and/or nonverbal intelli-
gence). Empirical studies that drew inferences from the theory of
stereotype threat but were correlational or based on a different
research framework were excluded (e.g., Ben-Zeev, Fein, & Inz-
licht, 2005; Chung-Herrera, Ehrhart, Ehrhart, Hattrup, & Solamon,
2005; Cullen, Hardison, & Sackett, 2004; Good, Aronson, &
Inzlicht, 2003; Inzlicht & Ben-Zeev, 2000, 2003; Osborne, 2001a,
2001b; Roberson, Deitch, Brief, & Block, 2003).
Second, a report had to operationalize test performance as the
number of correct answers. (For studies that used a different index
of performance, such as a ratio of correct answers to attempted
problems, we converted these indexes from available reported
information or contacted study authors for the information.) Third,
an article had to yield precise statistics that were convertible to a
weighted effect size d (e.g., mean test performance differences
between women in a stereotype threat condition and those in a
control condition). Finally, a report had to be written in English or
could be translated into English.
Summary of the Meta-Analytic Data Set
The literature search initially identified a total of 151 published
and unpublished empirical reports on stereotype threat effects. Of
these reports, 75 were excluded because they did not meet one or
more inclusion criteria.1
The remaining 76 reports contained 116 primary studies; 67 of
which were from published peer-reviewed articles, and 65 of
which included a comparison sample (e.g., Whites or men). The
study database yielded a total of 8,277 data points from stereo-
typed groups and a total of 6,789 data points from comparison
groups. Table 3 presents an overview of the characteristics of
studies included in the full data set. Note that of the 43 primary
studies in Walton and Cohen’s (2003) meta-analysis, there were
only 24 studies overlapping with our data set. In other words,
about 79% of our data set (or k? 92) was nonoverlapping with that
in Walton and Cohen’s meta-analysis.
Treatment of Independent Data Points
When an article consisted of multiple single studies, we treated
each study as an independent source of effect size estimates. When
a single study included a fully replicated design across demo-
graphic subgroups (i.e., a conceptually equivalent but statistically
independent design), we treated the data as if they were values
from different studies. For example, means of cognitive ability test
scores from all ethnic subgroups (e.g., Hispanic Americans and
African Americans; Sawyer & Hollis-Sawyer, 2005) were statis-
tically independent.
Treatment of Nonindependent Data Points
To be sensitive to potential problems caused by nonindependent
data, for the eight studies with multiple measures of cognitive
ability in our data set, we used only one independent estimate of
effect size per study (i.e., an average effect size across cognitive
ability tests for all subsamples per study). Nonindependent data
points also occurred when the design of an experiment allowed for
multiple effect size estimates to be computed across study condi-
tions. For example, the research design of 14 studies in our data set
consisted of one stereotype threat-activated condition and at least
two or more stereotype threat-removed conditions and vice versa,
resulting in multiple mean effect estimates per study. Following
Webb and Sheeran’s (2006) practice, we used the largest mean
effect size estimate.
Treatment of Studies With a Control Condition
A nonexperimental (control) condition was defined as when a
cognitive ability test was administered to test takers without any
specialdirections.Whenastudydesignconsistedoftwoconditionsof
stereotype threat manipulation (i.e., stereotype threat-activation, or
STA, vs. stereotype threat-removal, or STR), the study contributed
one effect size, dSTA-STR, to the data set. When a study design
consisted of STA and control conditions, the study contributed one
effect size, dSTA-Control. When all three conditions (STA, STR, con-
trol) were present in a study, the effect size dSTA-STRwas chosen to
be cumulated. Although this approach might result in an upward bias
ininterpretingthemagnitudeofstereotypethreateffectsacrossstudies
(i.e.,anestimateofdSTA-STRmightbelargerthanthatofdSTA-Control),
we erred on optimizing the probability of detecting stereotype threat
effects and supporting the theory tenets, given the important social
implications of stereotype threat.
Treatment of Studies With Stereotype Threat ?
Moderator Designs
For primary studies employing either the design of Stereotype
Threat ? Domain Identification or Stereotype Threat ? Test
Difficulty (e.g., Anderson, 2001), we split these studies into two or
three independent subsamples according to the levels of domain
identification or test difficulty as defined by the researchers them-
selves. Each subsample contributed an independent estimate of
effect size to the database. For studies with a Stereotype Threat ?
Nontarget Moderator design (i.e., a moderating factor not investi-
gated in the present meta-analysis), we gathered relevant statistical
information across the stereotype threat conditions only.2
Treatment of Studies Where Gender Was Nested in Race
Schmader and Johns (2003, Study 2) and Stricker and Ward
(2004, Study 2) conducted studies where test takers’ gender was
nested within race/ethnicity subgroups (i.e., White men/women vs.
Latinos/Latinas vs. African American men/women). Because these
1The list of all excluded studies and reasons for exclusion is available
from Hannah-Hanh D. Nguyen upon request.
2An exception was Keller’s (2007) experiment, involving both domain
identification and test difficulty. This study was coded as five separate
substudies: two studies across levels of domain identification and three
studies across levels of test difficulty. However, to avoid a violation of the
independent error variance assumption, we cumulated only the estimates
from Stereotype Threat ? Test Difficulty studies for the overall mean
effect size (because the domain identification subsets were nested within
the subsets of test difficulty studies; Hunter & Schmidt, 1990). Further,
each set of substudies across moderator levels contributed estimates to
respective moderator analyses of domain identification or of test difficulty.
1318
NGUYEN AND RYAN
Page 6
Table 3
Overview of the Meta-Analysis Database: Characteristics of Included Studies (K ? 116)
Study
no.
Study
Statusa
Stereotyped group
Sample
size
Effect
size
Comparison
group
included?
DI
preselected?AuthorNo.
1
2
3
4
5
6
7
8
Ambady et al. (2004)
Ambady et al. (2004)
Anderson (2001)
Aronson et al. (1999)
Aronson et al. (1999)
Aronson et al. (1999)
Bailey (2004)
J. L. Brown et al. (n.d.)
1 of 2
2 of 2
1 of 1
1 of 2
2 of 2
2B of 2
1 of 1
2 of 2
Published
Published
Unpublished
Published
Published
Published
Unpublished
Unpublished
Female undergrads
Female undergrads
Female undergrads
White undergrads
White undergrads
White undergrads
Female undergrads
African American
undergrads
African American
undergrads
Female undergrads
20
20
?0.57
?0.67
?0.96
?1.46
?0.99
?2.74
?0.09
?0.62
No
No
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
Yes
No
No
604
23
26
23
44
28
9R. P. Brown & Day (2006) 1 of 1 Published 340.38 YesNo
10 R. P. Brown & Josephs
(1999)
R. P. Brown & Josephs
(1999)
R. P. Brown & Pinel (2003)
Cadinu et al. (2003)
1 of 3 Published65
?0.09 YesNo
11 2 of 3PublishedFemale undergrads 35
?1.17Yes No
12
13
1 of 1
1 of 2
Published
Published
Female undergrads
Female undergrads
(Italian)
Female undergrads
(Italian)
African American
soldiers
Female undergrads
(Italian)
African American
undergrads
Female undergrads
African American
undergrads
Female undergrads
Female undergrads
Female high school
students
African American
undergrads
Female undergrads and
graduates
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female undergrads
Female secondary school
students (German)
Female secondary school
students (German)
Female secondary school
students (German)
Female secondary school
students (German)
Female undergrads
(German)
Female secondary school
students (German)
46
25
?0.53
?0.10
No
No
Yes
Yes
14Cadinu et al. (2003) 1B of 2Published 380.023 No Yes
15 Cadinu et al. (2003) 2 of 2Published 50
?0.19No No
16 Cadinu et al. (2005)1 of 1Published60 0.015 NoNo
17G. L. Cohen & Garcia (2005) 2 of 3Published 410.74 NoNo
18
19
Cotting (2003)
Cotting (2003)
1 of 1
1B of 1
Unpublished
Unpublished
51
55
?0.53
0.44
No
No
No
No
20
21
22
Davies et al. (2002)
Davies et al. (2002)
Dinella (2004)
1 of 2
2 of 2
1 of 1
Published
Published
Unpublished
25
34
?0.71
0.27
0.11
Yes
Yes
Yes
Yes
Yes
No232
23 Dodge et al. (2001)1 of 1Unpublished 93 0.045Yes No
24 Edwards (2004)1 of 1Unpublished79
?0.78 NoNo
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Elizaga & Markman (n.d.)
Foels (1998)
Foels (1998)
Foels (2000)
Ford et al. (2004)
Gamet (2004)
Gresky et al. (n.d.)
Gresky et al. (n.d.)
Guajardo (2005)
Guajardo (2005)
Harder (1999)
Harder (1999)
Johns et al. (2005)
Josephs et al. (2003)
Keller (2002)
1 of 1
1 of 1
1B of 1
1 of 1
2 of 2
1 of 1
1 of 1
1B of 1
1 of 2
2 of 2
1 of 2 (pilot)
2 of 2
1 of 1
1 of 1
1 of 1
Unpublished
Unpublished
Unpublished
Unpublished
Published
Unpublished
Unpublished
Unpublished
Unpublished
Unpublished
Unpublished
Unpublished
Published
Published
Published
145
33
32
71
31
51
23
37
56
30
36
19
46
39
37
?0.38
?0.78
?0.3
?0.7
?1.7
?1.51
?0.32
?0.45
0.03
?0.52
?0.66
?0.04
0.27
?0.79
?0.62
No
No
No
Yes
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
No
No
No
No
Yes
Yes
No
No
No
Yes
No
Yes
No
40Keller (2007) 1 of 1Published 19
?0.10Yes No
41Keller (2007) 1B of 1Published18
?0.95Yes No
42Keller (2007) 1C of 1Published 18
?1.11 YesNo
43Keller & Bless (n.d.) 2 of 3Unpublished 66
?0.57NoNo
44
Keller & Dauenheimer
(2003)
1 of 1Published 33
?0.38Yes No
(table continues)
1319
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 7
Table 3
(Continued)
Study
no.
Study
Statusa
Stereotyped group
Sample
size
Effect
size
Comparison
group
included?
DI
preselected?AuthorNo.
45 Lewis (1998) 1 of 1 Unpublished African American
undergrads
Female undergrads
Female undergrads
African American
undergrads
African American
undergrads
Female undergrads
(Dutch)
Female undergrads
(Dutch)
Female undergrads
(Dutch)
Female undergrads
(Dutch)
Female undergrads
(Dutch)
Female undergrads
(Dutch)
Female undergrads
71
?0.12YesNo
46
47
48
Martens et al. (2006)
Martens et al. (2006)
Martin (2004)
1 of 2
2 of 2
2 of 2
Published
Published
Unpublished
22
38
?0.67
?0.11
0.54
No
Yes
No
Yes
No
No 100
49Martin (2004) 2B of 2Unpublished 102
?0.93 NoNo
50 Marx & Stapel (2005)1 of 1 Published 48
?1.07 Yes No
51 Marx & Stapel (2006)1 of 3Published 29
?1.22Yes No
52 Marx & Stapel (2006)3 of 3 Published28
?1.24 Yes No
53Marx et al. (2005) 3 of 4Published27 0.56 No No
54Marx et al. (2005) 3B of 4Published25
?0.16 No No
55Marx et al. (2005) 4 of 4Published 25
?0.14No No
56McFarland, Kemp, et al.
(2003)
McFarland, Lev-Arey, &
Ziegert (2003)
McIntyre et al. (2003)
McIntyre et al. (2003)
McIntyre et al. (2005)
McKay (1999)
1 of 1Unpublished 126
?0.035No No
571 of 1 PublishedAfrican American
undergrads
Female undergrads
Female undergrads
Female undergrads
African American
undergrads
African American
undergrads
Female undergrads
Female undergrads
Female undergrads
50
?0.22 Yes No
58
59
60
61
1 of 2
2 of 2
1 of 1
1 of 1
Published
Published
Published
Unpublished
116
74
81
103
?0.52
?0.49
?0.98
0.91
Yes
Yes
Yes
Yes
No
No
No
No
62 Nguyen et al. (2003)1 of 1 Published 800.05Yes No
63
64
65
Nguyen et al. (2004)
O’Brien & Crandall (2003)
Oswald & Harvey (2000–
2001)
Pellegrini (2005)
1 of 1
1 of 1
1 of 1
Unpublished
Published
Published
114
58
34
0.057
?0.305
?0.06
Yes
Yes
No
No
No
No
661 of 1 UnpublishedHispanic undergrads
(female)
Female undergrads
African American
undergrads
African American
undergrads
Female undergrads
Latino high school
students
Female undergrads
(British)
Female undergrads
(British)
Female undergrads
(British)
Mexican American
undergrads
Mexican American
undergrads
African American
undergrads
Hispanic undergrads
60
?1.03 No No
67
68
Philipp & Harton (2004)
Ployhart et al. (2003)
1 of 1
1 of 1
Unpublished
Published
38
48
?1.21
?0.59
Yes
Yes
No
No
69Ployhart et al. (2003)1B of 1Published 48
?0.57 Yes No
70
71
Prather (2005)
Rivadeneyra (2001)
1 of 1
1 of 1
Unpublished
Unpublished
114
116
?0.67
?0.96
No
No
No
No
72H. E. S. Rosenthal & Crisp
(2006)
H. E. S. Rosenthal & Crisp
(2006)
H. E. S. Rosenthal & Crisp
(2006)
Salinas (1998)
2 of 3 Published 24
?1.46 NoNo
733 of 3 Published 29
?0.99 No No
743B of 3Published 27
?2.74No No
75 1 of 2 Unpublished 27
?0.09 Yes No
76
Salinas (1998)
2 of 2Unpublished 56 0.38 Yes No
77Sawyer & Hollis-Sawyer
(2005)
Sawyer & Hollis-Sawyer
(2005)
Schimel et al. (2004)
Schmader (2002)
Schmader & Johns (2003)
Schmader & Johns (2003)
1 of 1Published 66
?0.09Yes No
781B of 1Published 47
?1.17 YesNo
79
80
81
82
2 of 3
1 of 1
1 of 3
2 of 3
Published
Published
Published
Published
Female undergrads
Female undergrads
Female undergrads
Latino American
undergrads
46
32
28
33
?0.53
?0.19
?0.62
?0.10
No
Yes
Yes
Yes
Yes
No
Yes
No
1320
NGUYEN AND RYAN
Page 8
Table 3
(Continued)
Study
no.
Study
Statusa
Stereotyped group
Sample
size
Effect
size
Comparison
group
included?
DI
preselected?Author No.
83
84
85
Schmader & Johns (2003)
Schmader et al. (2004)
Schneeberger & Williams
(2003)
Schultz et al. (n.d.)
3 of 3
2 of 2
1 of 1
Published
Published
Unpublished
Female undergrads
Female undergrads
Female undergrads
28
68
61
0.023
0.015
0.74
No
No
Yes
Yes
No
No
86 1 of 2 UnpublishedHispanic American
undergrads
Hispanic American
undergrads
African American and
Latino undergrads
Female undergrads
44
?0.533YesNo
87
Schultz et al. (n.d.)
2 of 2 Unpublished400.44 Yes No
88Seagal (2001)6 of 6 Unpublished101
?0.71YesNo
89
Sekaquaptewa & Thompson
(2002)
C. E. Smith & Hopkins
(2004)
J. L. Smith & White (2002)
J. L. Smith & White (2002)
S. J. Spencer et al. (1999)
S. L. Spencer (2005)
Spicer (1999)
1 of 1 Published80 0.27YesNo
901 of 1Published African American
undergrads
White undergrads (male)
Female undergrads
Female undergrads
Female undergrads
African American
undergrads
African American
undergrads
African American
undergrads
African American
undergrads
African American
undergrads
Female high school
students
Female high school
students
African American high
school students
Female high school
students
African American
undergrads
Female undergrads
Female undergrads
(Dutch)
Female undergrads
(Dutch)
White undergrads
(Australian)
Female undergrads
(Canadian)
African American
undergrads
Minority high school
students (Dutch)
Female undergrads
(Dutch)
African American
undergrads
African American
undergrads
African American
undergrads
African American
undergrads
160 0.11No No
91
92
93
94
95
1 of 2
2 of 2
2 of 3
1 of 1
2 of 2
Published
Published
Published
Unpublished
Unpublished
47
23
30
40
39
0.045
?0.78
?0.78
?0.38
?0.3
No
No
Yes
No
No
No
No
Yes
No
Yes
96Spicer (1999) 2B of 2 Unpublished 39
?0.7NoYes
97
Steele & Aronson (1995)
1 of 4 Published38
?1.7 YesNo
98
Steele & Aronson (1995)
2 of 4Published 20
?1.51 YesNo
99
Steele & Aronson (1995)
4 of 4 Published 22
?0.32 Yes No
100
Sternberg et al. (n.d.)
1 of 2Unpublished 27
?0.45Yes No
101
Sternberg et al. (n.d.)
2 of 2 Unpublished96 0.03Yes No
102
Stricker & Ward (2004)
1 of 2 Published122
?0.52Yes No
103
Stricker & Ward (2004)
1B of 2 Published730
?0.66Yes No
104
Stricker & Ward (2004)
2 of 2Published 468
?0.04 YesNo
105
106
Tagler (2003)
van Dijk et al. (n.d.)
1 of 1
1 of 1
Unpublished
Unpublished
136
38
0.27
?0.79
Yes
Yes
No
No
107van Dijk et al. (n.d.) 1B of 1Unpublished 38
?0.57 Yes No
108 von Hippel et al. (2005) 4 of 4 Published 56
?0.38 No No
109Walsh et al. (1999) 2 of 2Published 96
?0.62YesNo
110Walters (2000) 1 of 2 Unpublished49
?0.10No Yes
111Wicherts et al. (2005) 1 of 3 Published 138
?0.95YesNo
112 Wicherts et al. (2005)3 of 3 Published95
?1.11 Yes No
113Wout et al. (n.d.)1 of 4Unpublished 57
?0.12 NoNo
114 Wout et al. (n.d.)2 of 4Unpublished 29
?0.67No No
115 Wout et al. (n.d.)3 of 4Unpublished 24
?0.11 NoNo
116Wout et al. (n.d.) 4 of 4Unpublished 260.54 NoNo
Note.
aPublished articles are those that appeared in peer-reviewed journal articles, including those in press; unpublished articles refer to dissertations, theses,
conference papers, and working manuscripts.
Studies presented in bold font (k ? 24) are those that overlap with Walton and Cohen’s (2003) data set. DI ? domain identification.
1321
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 9
studies mainly aimed at examining race-based stereotype threat
effects, only the effect sizes as a function of race/ethnicity and
stereotype threat activation contributed data points to the overall
meta-analytic data set.3
Treatment of Studies With Large Sample Sizes
There were a few studies with substantially larger sample sizes
than those in the majority of other studies in the meta-analysis
(e.g., Anderson, 2001; Stricker & Ward, 2004, Studies 1 and 2).
Meta-analytic results with and without the estimates of effect sizes
from these studies were similar so we report findings including all
studies.4
Coding of Studies
We coded levels of test difficulty (e.g., difficult, easy) and
domain identification based on investigators’ description and/or
evidence in source reports. When test takers’ domain identification
was on a continuous scale in some reports (e.g., Bailey, 2004;
Edwards, 2004; Ployhart, Ziegert, & McFarland, 2003), we did not
find sufficient statistical information to convert the data into cat-
egorical subgroup data.
For the type of stereotype threat-activation cues, we coded the
data on three levels: blatant, moderately explicit, and subtle (see
Table 1). A similar coding practice was used for the moderator of
stereotype threat-removal strategies (see Table 2). We coded the
condition of a cognitive ability test without any special directions
as a control condition, following Fisher’s (1925) definitions of
research groups.
Studies were also coded for demographic characteristics of
samples, such as whether the stereotype activated was based on
race or gender. Because stereotype threat manipulation and test
takers’ race/ethnicity or gender were correlated in many studies, a
series of hierarchical moderator analyses was needed to assess the
potential impact of confounding on the moderator analyses. To
accomplish this, we first broke down the stereotype threat effect
estimates for manipulation conditions by test takers’ race/ethnicity
or by gender, and then we undertook a moderator analysis by the
race/ethnicity of test takers (minorities vs. Whites) or by gender
(women vs. men) within the stereotype threat manipulation con-
ditions.
Coders and Agreement
Three coders (Hannah-Hanh D. Nguyen and two trained assis-
tants) coded target variables; each study was coded by at least two
coders and periodically cross-checked. Objective statistics and
continuous variables coded include sample size, variable cell
means and standard deviations, t-test values, and/or F-test values.
When there was insufficient statistical information to compute the
estimate of effect size for a study, coders tried to contact source
authors for additional information before marking the data as
missing. The interrater agreement rates for continuous and objec-
tive variables were between 91% and 100%; disagreements were
discussed and resolved. For categorical variables, we computed a
series of interrater agreement index kappas following Landis and
Koch’s (1977) rules. Kappa values ranged from 0.49 to 0.95,
indicating moderately good to very satisfactory interrater agree-
ment levels. The lower kappa values were associated mainly with
the classification of stereotype threat conditions given how re-
searchers might differ in labeling these conditions in the primary
studies (e.g., an STR condition might be referred to as a control
group). Disagreements were discussed and resolved.
Meta-Analytic Procedure
We employed the meta-analysis procedure of Hunter and
Schmidt (1990, 2004) and conducted an overall meta-analysis to
cumulate findings across studies, as well as conducting separate
meta-analyses with subsets of studies to examine moderator ef-
fects. Specifically, we cumulated the average population effect
size ? (corrected for measurement error) and computed variance
var(?) across studies, weighted by sample size, using the Meta-
Analysis of d-Values Using Artifact Distributions software pro-
gram (Schmidt & Le, 2005).
We converted descriptive statistics, t-test estimates, or F-test
estimates into the effect size Cohen’s d (i.e., mean difference
between cell means in standard score form) using Thalheimer and
Cook’s (2002) software program, which was based on Rosnow and
Rosenthal’s (1996) and Rosnow, Rosenthal, and Rubin’s (2000)
formulas. Reliability information on cognitive ability tests was
sporadically reported in the source reports; therefore, study effect
sizes could not be corrected individually for measurement error;
we used artifact distributions instead.
Following Hunter and Schmidt’s (1990, 2004) recommendation,
to judge whether substantial variation due to moderators exists, we
used the standard deviation, SD?, estimated from var(?) to con-
struct the 90% credibility intervals (CrI) around ? as an index of
true variance due to moderators (Whitener, 1990). When the
credibility intervals were large (e.g., greater than 0.11; Koslowsky
& Sagie, 1993) and overlapped zero, we interpreted them as
indicating the presence of true moderators and inconclusive meta-
analytic findings. V%, or the ratio of sampling error variance to the
observed variance in the corrected effect size, was also calculated.
When most of the observed variance is due to sampling error (i.e.,
V% ?75%), it is less likely that a true moderator exists and
explains the observed variance in effect sizes. Following recent
meta-analytic practices (e.g., Roth, Bobko, & McFarland, 2005;
Zhao & Seibert, 2006), we also reported 95% confidence intervals
(CI; the likely amount of error in an estimate of a single value of
mean effect size due to sampling error) in our meta-analyses. The
interpretation of 95% CI excluding zero means that we can be 95%
confident that mean ? is not zero.
3Stricker and Ward (2004, Study 1) was an exception to this rule, as the
stereotype threat cues were both race-based and gender-based (i.e., race
and gender inquiries prior to tests). Therefore, it was conceptually sound to
code the outcomes of this study separately as a function of race or gender;
that means the study contributed some nonindependent estimates of effect
size to the data set. However, the proportion of these nonindependent data
points was not large in the data set (i.e., 842 data points altogether, or
10.7%).
4The results from meta-analytic data sets without large sample-size
studies are available from Hannah-Hanh D. Nguyen upon request.
1322
NGUYEN AND RYAN
Page 10
Testing for Publication Bias
Our meta-analysis database consisted of a relatively balanced
number of published and unpublished reports (54.8% and 45.2%,
respectively). Nevertheless, fail-safe N analyses were conducted to
test a potential file-drawer bias in each meta-analysis. Hunter and
Schmidt (1990) provided a formula to calculate fail-safe N, which
indicates the number of missing studies with zero-effect size that
would have to exist to bring the mean effect size down to a specific
level. In the present review, mean dcriticalwas arbitrarily set to
0.10, which constitutes a negligible effect size (see J. Cohen,
1988). In the interest of space, we discussed file-drawer analyses
only where potential problems were indicated.
Additionally, we used Light and Pillemer’s (1984) “funnel
graph” technique of plotting sample sizes versus effect sizes. In the
absence of bias, the plot should resemble a symmetrical inverted
funnel. There may be a problem of publication bias when there is
a cutoff of small effects for studies with a small sample size. In
other words, because only large effects reach statistical signifi-
cance in small samples, a publication bias or other types of
location biases are present when only large effects are reported by
studies with a small sample size (i.e., an asymmetrical and skewed
shape). On the contrary, there are no biases if an exclusion of null
results is not visible on the funnel graph.
Results
Within-Group Stereotype Threat Effects
Overall effect.
archical moderating meta-analyses. The overall effect size was
mean d? |.26| (K ? 116, N ? 7,964; see Table 4), which was
comparable to the finding of mean d? |.29| in Walton and Cohen’s
(2003) study. However, the variance of effect sizes was nonzero
(V% was about 26%) and the CrI shows that there was a 90%
probability that the true effect size was between ?0.85 and
0.29—a range of d values overlapping zero. These values indicate
that true moderators existed.
Group-based stereotypes.
We separately analyzed the mean
effect sizes for studies with an ethnic/racial group-based stereotype
of intellectual inferiority and studies with a gender-based stereo-
type of mathematical ability inferiority, something not considered
in Walton and Cohen (2003). Lines 3 and 4 in Table 4 show
differential stereotype threat effects in that the mean effect size
was greater in the ethnicity/race-based stereotype subset than in
the gender-based stereotype subset (mean ds ? |.32| and |.21|,
respectively). The nonoverlapping 95% CIs indicate reliable ef-
fects. Although the subset variance values decreased compared
with the variance of the entire set of d values, they were still
Tables 4 and 5 present the results of our hier-
Table 4
Hierarchical Moderator Analyses of Domain Identification and Test Difficulty
Variable
KN
Mean
d
Var d
Var e
Mean
?
Var ?
% var
SE
V%90% CrI 95% CI
Fail-safe
N
Overall effect size
Group-based stereotype
Race/ethnicity
Female
1167,964
?.258.227.060
?.281 .198 26.26 26.33
?0.85, 0.29
?0.38,?0.16 415
44
72
2,988
4,935
?.324
?.208
.186
.241
.060
.059
?.353
?.227
.149
.216
32.50
24.64
32.63
24.68
?0.85, 0.14
?0.82, 0.37
?0.49,?0.19
?0.35,?0.08
187
222
Test difficulty by group-based stereotype
Overall
Difficult
Moderately difficult
Easy
Minority test takers
Difficult
Moderately difficult
Female test takers
Difficult
Moderately difficult
Easy
48
24
9
2,161
1,560
308
?.394
?.190
.083
.396
.153
.199
.092
.063
.119
?.429
?.208
.091
.361
.107
.095
23.20
40.86
59.74
23.29
40.92
59.74
?1.20, 0.34
?0.63, 0.21
?0.30, 0.49
?0.62,?0.02
?0.38,?0.02
?0.23, 0.40
237
70
2
12
10
549
647
?.425
?.181
.157
.073
.091
.063
?.464
?.198
.078
.012
57.84
86.27
58.11
86.37
?0.82,?0.11
?0.34,?0.06
?0.71,?0.18
?0.38, 0.00
63
28
33
13
9
1,508
890
308
?.363
?.175
.083
.500
.195
.199
.090
.059
.119
?.395
?.191
.091
.487
.162
.095
18.04
30.34
59.74
18.10
30.38
59.74
?1.29, 0.50
?0.71, 0.32
?0.30, 0.49
?0.66,?0.10
?0.45, 0.09
?0.23, 0.40
153
36
2
Domain identification by group-based stereotypea
Overall
High
Medium
Female test takersb
High
Medium
12478
313
?.316
?.371
.210
.290
.103
.120
?.344
?.404
.127
.203
49.21
41.00
49.32
41.10
?0.80, 0.11
?0.98, 0.17
?0.63,?0.03
?0.79, 0.01
50
429
9
6
380
212
?.287
?.518
.201
.204
.097
.119
?.313
?.565
.123
.100
48.44
58.29
48.54
58.59
?0.76, 0.14
?0.97,?0.16
?0.63, 0.03
?0.96,?0.12
35
37
Note.
observed variance of d values; var e ? variance attributed to sampling error variance; mean ? ? mean true effect size; var ? ? true variance of effect sizes;
% var SE ? percent variance in observed d values due to sampling error variance; V% ? percent variance accounted for in observed d values due to all
corrected artifacts; 90% CrI ? 90% of mean ? (credibility interval); 95% CI ? 95% of mean ? (confidence interval); fail-safe N ? number of missing
studies averaging null findings that would be needed to bring mean d down to .10, from Hunter and Schmidt’s (1990) effect size file-drawer analysis.
aDomain identification levels: High ? strongly identified with academic or cognitive ability domains; Medium ? moderately identified.
subsets of female test takers were meta-analyzed here because there were insufficient race-based studies contributing effect size estimates (see Arthur et
al., 2003).
K ? Number of effect sizes (d values); N ? total sample size; mean d ? sample size weighted mean effect size; var d ? sample size weighted
bOnly the
1323
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 11
nonzero. There was a 90% probability that the true mean effect
size of the race-based subset was between a zero-included range of
?0.85 and 0.14, whereas there was a 90% probability that the true
mean effect size of the gender-based subset was between a zero-
included range of ?0.82 and 0.37. Subset V% values were 33%
and 25% for minorities and women subsets, respectively. Taken
together, these values suggested further moderator meta-analyses.
Test difficulty.
We next meta-analyzed test difficulty as a
moderator across stereotypes. Table 4 shows that stereotype-
threatened minorities performed more poorly than did nonthreat-
ened minorities when cognitive ability tests were highly difficult
(mean d? |.43|, a reliable effect with a nonoverlapping 95% CI)
than when tests were moderately difficult (mean d? |.18|, zero-
included 95% CI). (There were no studies using easy tests to
investigate stereotype threat effects among minority test takers.)
The credibility intervals did not overlap zero, meaning that these
findings were conclusive.
Similar to ethnic minority test takers, women underperformed
when a math test was highly difficult (mean d? |.36|), more so
than when a math test was moderately difficult (mean d? |.18|).
However, when the test was easy, women tended to improve their
test performance slightly (mean d? |.08|). The nonoverlapping
95% CIs indicate that only the high difficulty finding was reliable.
The V% and zero-included 90% CrIs indicate that true moderators
still existed. The smaller file-drawer N values indicate that the
findings on medium and low difficulty levels were not conclusive.
Domain identification.
As shown in Table 4, there were no
discernible differences in stereotype threat effects between highly
and moderately domain-identified samples (mean ds? |.32| and
|.37|, ks ? 12 and 9, respectively). The results were inconsistent
with those in Walton and Cohen’s (2003) meta-analysis (i.e., mean
ddomain-identified? |.68| vs. mean dnot-identified? |.29|). However,
our hierarchical meta-analyses with studies on gender-based ste-
reotypes showed a different pattern of findings: Highly math-
identified women experienced smaller stereotype threat effects
(mean d? |.29|, k? 9) than did moderately low math-identified
women (mean d? |.52|, k? 6). The nonoverlapping 95% CIs show
that only the result for the moderate domain identification subset
was reliable. Smaller V% values and zero-included CrIs indicate
that these findings were inconclusive because of other moderators
that may explain the variance in the data over and above study
artifacts. Furthermore, lower fail-safe N values show that these
meta-analytic findings might not be conclusive. There were only
three studies of ethnic minorities in each level, so the analyses for
minority subsets were not conducted (see Arthur, Bennett, Edens,
& Bell, 2003).
Stereotype threat relevance: Threat-activating cues.
shows that stereotype threat-activating cues affected minority test
takers’ test performance (mean d? |.30|) more than that of women
(mean d? |.21|). However, Table 5 shows that when the negative
stereotype was based on race, the largest mean effect size was
produced for moderately explicit threat-activating cues (mean d?
|.64|) compared with other types of threat-activating cues (blatant
cues: mean d? |.41|; subtle cues: mean d? |.22|). As the V% and
CI values in Table 5 indicate, the findings for the race-based subset
for blatant cue conditions were conclusive, but for the subtle cues
subset further moderator analyses were still needed.
Table 5
Table 5
Hierarchical Moderator Analyses of Stereotype Threat Relevance (Activation and Removal)
Variable
KN
Mean d
Var d
Var e
Mean ?
Var ?
% var SE
V% 90% CrI95% CIFail-safe N
Stereotype threat-activating (STA) cues by group-based stereotype
Minority test takers
Overall
Blatant
Moderately
explicit
Subtle
Female test takers
Overall
Blatant
Moderately
explicit
Subtle
38 2,724
436
277
?.295
?.405
?.639
.185
.077
.058
.057
.057
.108
?.322
?.441
?.696
.151
.024
0
30.89
73.37
100
31.00
73.86
100
?0.82, 0.18
?0.64,?0.24
?0.47,?0.15
?0.68,?0.16
?0.89,?0.45
150
30
52
6
7
25 2,011
?.224 .201.051
?.244 .179 25.1025.16
?0.79, 0.30
?0.44,?0.03 81
73
22
20
4,947
1,279
1,138
?.205
?.172
?.184
.240
.390
.181
.060
.070
.072
?.223
?.188
?.201
.214
.381
.130
25.06
17.90
39.63
25.10
17.92
39.67
?0.82, 0.37
?0.98, 0.60
?0.66, 0.26
?0.35,?0.08
?0.47, 0.11
?0.41, 0.02
223
60
57
322,564
?.239 .193 .051
?.261.169 26.4226.49
?0.8, 0.27
?0.43,?0.07108
Stereotype threat-removal (STR) strategies by group-based stereotype
Minority test takers
Overall
Explicit
Subtle
Female test takers
Overall
Explicit
Subtle
30 1,661
157
1,504
?.415
?.800
?.375
.245
.053
.248
.075
.140
.068
?.452
?.870
?.408
.201
0
.213
30.53
100
27.62
30.69
100
27.75
?1.03, 0.12
?0.65,?0.22
?1.09,?0.58
?0.62,?0.16
155
45
119
5
25
?1.00, 0.18
61
31
30
3,310
1,626
1,684
?.233
?.135
?.329
.337
.285
.368
.075
.078
.731
?.254
?.147
?.358
.311
.245
.350
22.34
27.17
19.88
22.38
27.19
19.95
?0.97, 0.46
?0.78, 0.49
?1.12, 0.40
?0.41,?0.07
?0.35, 0.07
?0.59, 0.36
203
73
129
Note.
observed variance of d values; var e ? variance attributed to sampling error variance; mean ? ? mean true effect size; var ? ? true variance of effect sizes;
% var SE ? percent variance in observed d values due to sampling error variance; V% ? percent variance accounted for in observed d values due to all
corrected artifacts; 90% CrI ? 90% of mean ? (credibility interval); 95% CI ? 95% of mean ? (confidence interval); fail-safe N ? number of missing
studies averaging null findings that would be needed to bring mean d down to .10, from Hunter and Schmidt’s (1990) effect size file-drawer analysis.
K ? Number of effect sizes (d values); N ? total sample size; mean d ? sample size weighted mean effect size; var d ? sample size weighted
1324
NGUYEN AND RYAN
Page 12
As shown in Table 5, the negative stereotype concerning wom-
en’s mathematical ability yielded a different pattern of findings
from the race-based stereotype. Studies using moderately explicit
cues yielded a comparable mean effect size (mean d? |.18|) to that
in studies using blatant cues (mean d? |.17|). The zero-included
95% CIs indicate nonreliable effects. Studies employing subtle
stereotype threat cues yielded the largest mean effect size (mean
d? |.24|), and the nonoverlapping 95% CI indicates a reliable
effect. (The effect size differences among these subsets were trivial
though.) V% values and the 90% CrIs suggest that other moder-
ators would further explain the variance in these d values to reach
conclusive findings. Our findings show a more complex pattern
than the results in Walton and Cohen’s (2003) meta-analysis:
Walton and Cohen had found that overall, explicit stereotype threat
activation produced a greater effect size (mean d? |.57|) than when
the activation was not explicit (mean d? |.27|). Their findings were
consistent with those found for minorities in the present study but
inconsistent with those found for female test takers.
Stereotype threat relevance: Threat-removal strategies.
5 shows that stereotype threat-removal strategies differentially
affected minority test takers (mean d? |.42|, nonoverlapping 95%
CI) and female test takers’ performance (mean d? |.23|, nonover-
lapping 95% CI). Stereotype threat-removal strategies seemed to
work better on women’s math test performance than on minorities’
test performance, at least at the mean level.
Hierarchical meta-analyses showed that minority test takers
seemed to benefit more from subtle or indirect threat-removal
strategies than from direct, explicit ones (i.e., smaller stereotype
threat effects: mean d? |.38| and mean d? |.80|, respectively).
Study artifacts explained all variance in the explicit-removal strat-
egy subset of d values, indicating that this finding of interest was
conclusive, although one should be cautious about generalizing
this finding because there was a smaller fail-safe N of 45. How-
ever, study artifacts explained only 28% of the variance in the
subtle removal-strategy subset of d values, and the 90% CrI
overlapped zero, indicating true moderators and inconclusive find-
ings.
Table 5 also reveals that female test takers benefited more from
explicit stereotype threat-removal strategies (mean d? |.14|, zero-
included 95% CI) than from subtle strategies (mean d ? |.33|,
nonoverlapping 95% CI). Low V% values and zero-overlapping
90% CrIs indicate the effects of other true moderators. Again, the
pattern of Walton and Cohen’s (2003) meta-analytic findings was
more consistent with our result pattern for minorities than with that
for women (i.e., Walton and Cohen found that studies explicitly
removing stereotype threat produced greater stereotype threat,
mean d ? .45, than studies that did not, mean d? .20).
Table
Supplemental Bias Analysis
As shown in Figure 1, the funnel plot for the full meta-analytic
data set resembles a relatively symmetrical inverted funnel, indi-
cating the absence of publication bias in the data set. Further, the
relationship between effect size estimates and study sample sizes
was positive and statistically significant (r? .23, p? .05). When
four primary studies, each with a sample size larger than 200, were
excluded from the data set (Anderson, 2001; Dinella, 2004;
Stricker & Ward, 2004, Study 1B & Study 2), a similar pattern of
findings was also found.
One additional question is whether studies that yielded either
positive effect size estimates or estimates clustering around the
zero point (k? 29, 25% of the data set) have differential charac-
teristics from studies where the d values supported the hypothesis
of performance interference (i.e., a negative effect size). Examin-
ing the general characteristics of samples in subsets of studies at
different levels of effect size estimates, we found no clearly
defining characteristics that might distinguish studies that found no
stereotype threat effects or positive effects from studies that found
the effects.5
Between-Group Stereotype Threat Effects
As shown in Table 6, the overall between-group effect values
increased from a mean effect size d? |.44| in test-only, control
conditions to a mean effect size d? |.53| in stereotype threat-
activated conditions. When interventions or threat-removal strate-
gies were implemented, stereotyped test takers underperformed on
cognitive ability tests compared with reference test takers (mean
d? |.28|). The nonoverlapping 95% CIs indicate reliable effects.
The 90% CrI values and V% estimates indicate true moderator
effects. The zero-included CrIs for mean ds in stereotype threat-
activated conditions and stereotype threat-removed conditions
mean that these findings were not conclusive. The credibility
interval for mean d in control conditions did not overlap zero,
however.
Subsequent hierarchical meta-analyses across group-based ste-
reotypes were conducted. As shown in Table 6, in control condi-
tions, ethnic minority test takers underperformed compared with
majority test takers: The between group mean d is |.56|. The
nonoverlapping 95% CI indicates a reliable effect. On the average,
ethnic minority test takers’ test scores were approximately at the
30th percentile of majority groups’ mean test scores, which is
relatively consistent with the literature (e.g., the overall mean
standardized differences for g are 1.10 for the Black–White com-
parison and 0.72 for the Latino–White comparison; Roth, Bevier,
Bobko, Switzer, & Tyler, 2001). Note that study artifacts ex-
plained all observed variance in the d values in this subset, sug-
gesting that no further moderator analyses should be conducted for
this subset.
Table 6 also reveals that, in control conditions, women under-
performed compared with men on mathematical ability tests; the
between-group mean effect size was mean d? |.26|, which is
consistent with the literature (see a review by Halpern et al., 2007).
The nonoverlapping 95% CI indicates a reliable effect. On aver-
age, women’s mean math test scores were approximately at the
40th percentile of men’s mean math test scores. Study artifacts
explained all of the variance in d values, suggesting no other
moderators for this subset. Although the fail-safe N value was not
very large (47), similar overall gender differences in math test
5Whereas there was only one non-American sample in the “non-effect”
group of studies (3.5%), there were 23 non-American samples (26.5%) in
the “stereotype threat effect” group, suggesting that non-American authors
(or American authors who used non-American samples) might be more
likely to publish significant findings that were consistent with the hypoth-
esis of performance interference in American journals than non-significant
findings or findings that were contradictory to the hypothesis.
1325
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 13
performance were observed in the testing literature; therefore, a
file-drawer problem was possible but not plausible.
In stereotype threat-activated conditions, ethnic minority test
takers underperformed compared with majority test takers, and the
between-group mean effect size was mean d? |.67|. The nonover-
lapping 95% CI indicates a reliable effect. When stereotype threat
was activated, on average, ethnic minority test takers’ mean scores
were approximately at the 25th percentile of majority groups’
mean test scores, which was worse than the results in control
conditions. Study artifacts explained about 62% of the observed
variance in d values, suggesting that further moderator effects
should be investigated for this subset. The 90% CrI did not overlap
zero, indicating a conclusive result.
Also in stereotype threat-activated conditions, women underper-
formed compared with men on mathematical ability tests, and the
between-group mean effect was mean d? |.39|. The nonoverlap-
ping 95% CI indicates a reliable effect. In other words, when
stereotype threat was activated, women’s mean math test scores
were approximately at the 34th percentile of men’s mean scores.
Study artifacts explained only 26% of the variance in d, suggesting
other true moderator effects. The zero-included credibility interval
indicates an inconclusive finding.
In stereotype threat-removed conditions, ethnic minority test
takers underperformed compared with majority test takers; the
between-group mean effect size was mean d? |.38|, a reliable
effect based on the nonoverlapping 95% CI value. On average,
when stereotype threat effects were removed, ethnic minority test
takers’ mean test scores were approximately at the 34th percentile
of majority groups’ mean test scores. However, study artifacts
explained about 38% of the observed variance in the d values in
the subset findings, suggesting moderator effects. The zero-
included 90% CrI indicates that this finding was not conclusive.
Furthermore, in stereotype threat-removal conditions, women
underperformed compared with men on mathematical ability tests,
and the between-group mean effect size was mean d? |.23|, a
reliable effect based on the nonoverlapping 95% CI value. On the
average, women’s math test scores were approximately at the 41st
percentile of men’s mean math scores when stereotype threat-
removal strategies were implemented. Study artifacts explained
73% of the variance in d values, suggesting no other moderators.
The 90% CrI did not include zero, indicating a meaningful effect.
Discussion
In integrating more than 10 years of experimental research on
stereotype threat effects on stereotyped test takers’ cognitive abil-
ity test performance, we found that the overall performance of
stereotyped test takers might suffer from a situational stereotype
threat: Our overall effect size of |.26| was consistent with the
finding in Walton and Cohen (2003). However, there was consid-
erable variability in the effect sizes (i.e., one fourth of studies in
our data set showed zero or positive effects); therefore we exam-
ined several conceptual moderators (test difficulty, domain iden-
tification, and the activation or removal of stereotype relevance),
each of which was analyzed separately for race-based and gender-
-3.00
-2.00
Effect estimate of stereotyped groups
-1.000.001.00
0
200
400
600
800
Sample size of stereotyped groups
(K = 116)
Figure 1.
The effect size estimates are plotted against study sample sizes (r? .23, p? .05).
The funnel graph of stereotype threat effects on target test takers’ cognitive ability test performance.
1326
NGUYEN AND RYAN
Page 14
based stereotypes. We focus our discussion on the implications of
these key moderating relationships.
Moderators
The theory of stereotype threat assumes a uniform pattern of
target reactions to the activation (or removal) of a salient negative
stereotype when an evaluative ability test is administered. Al-
though our meta-analytic findings suggest that under a situational
stereotype threat, both minority and women test takers tended to
perform poorly on different types of cognitive ability tests com-
pared with others under no threat, we also found the observed
effects appeared to be greater among studies using a race/ethnicity-
based stereotype than among studies using a gender-based stereo-
type. Although these findings should not be directly interpreted
(because the zero-included credibility intervals required further
moderating tests), they lend initial credence to our proposition that
the type of group-based stereotype is an important moderator of
stereotype threat effects. In fact, most of our subsequent lower
order moderator analytic results further supported this proposition.
In terms of test difficulty, although both racial/ethnic-based and
gender-based stereotypes seemed to interact with this moderator in
a similar fashion (i.e., more difficult tests produced larger effect
sizes), stereotype threat effects were more severe for ethnic mi-
norities than for female test takers when a test was highly difficult.
One possible explanation is the methodological inconsistency in
how test difficulty is operationalized in the literature. For “very
difficult” math tests in gender-based studies, many researchers
selected a specific advanced type of standardized quantitative
ability test (e.g., GRE Calculus only; Aronson et al., 1999),
whereas other content domains were considered as constituting
“moderately difficult” or “easy” tests (e.g., algebra, trigonometry,
and geometry; O’Brien & Crandall, 2003; S. J. Spencer et al.,
1999). The construct of test difficulty might be confounded with
math subdomains. It remains unclear in these studies whether
stereotype threat effects were manifested at a high level of diffi-
culty or whether they were observed with certain types of math
ability problems (e.g., advanced calculus) but not with other types.
In some studies of race-based stereotypes, researchers reviewed
test score distributions or pilot-tested test items to see whether or
not a test was difficult enough for their sample (e.g., Steele &
Aronson, 1995; Stricker & Bejar, 2004; Wicherts, Dolan, & Hes-
sen, 2005). These practices also might induce variance beyond the
construct of difficulty. Also, the theory of stereotype threat sug-
gests stereotype threat effects will occur when the test is difficult
because of the cognitive demands of taking the test. However, one
might posit that difficult items can be seen by a test taker as
showing that the stereotype is true and affecting subsequent mo-
tivation. Future studies need to adopt a more consistent and ap-
propriate method to operationalize test difficulty and measure
potential mediating mechanisms to clarify these differential find-
ings for race and gender stereotypes—a good example is Stricker
and Bejar’s (2004) approach (reducing the difficulty of the same
types of test items). Furthermore, of interest would be studies that
examine samples that range in ability (and hence the test is more
difficult for some test takers than others) as well as studies that
contrast effects on easy versus hard items in the same test.
In terms of test takers’ domain identification, the lack of sub-
stantial direct investigation of ethnic minorities’ domain identifi-
cation rendered meta-analyses impossible. Similar to the overall
pattern of findings in Walton and Cohen (2003), we found that less
math-identified women did not suffer much from stereotype threat
in terms of math test performance compared with more strongly
math-identified women. However, in a departure from Walton and
Cohen, we found that moderately math-identified women were
surprisingly affected more severely by stereotype threat than
highly math-identified ones, which is inconsistent with the theory
tenet but may be suggestive of stereotype reactance among highly
identified individuals. Investigators might inadvertently lose infor-
mative data when implementing the strongest experimental design
by screening in only strongly identified individuals (e.g., math
Table 6
Hierarchical Meta-Analytic Findings of Between-Group Mean Test Performance Across Stereotype Threat Levels
Condition
KN
Mean
d
Var d
Var e
Mean
?
Var ?
% var SE
V% 90% CrI95% CI
Fail-safe
N
Control
Overall
Minority vs. majoritya
Women vs. men
Stereotype threat-activating
Overall
Minority vs. majoritya
Women vs. men
Stereotype threat-removing
Overall
Minority vs. majoritya
Women vs. men
23
10
13
3,620
1,695
1,803
?.440
?.564
?.264
.080
.0195
.025
.030
.025
.029
?.480
?.615
?.288
.060
0
0
34.88
100
100
35.48
100
100
?0.79, ?0.17
?0.61, ?0.31
?0.71, ?0.47
?0.38, ?0.17
124
66
47
62
23
39
5,937
2,498
3,330
?.530
?.686
?.392
.160
.065
.186
.040
.039
.048
?.580
?.747
?.428
.130
.065
.163
27.80
60.29
26.08
28.22
61.95
26.27
?1.05, ?0.11
?0.97, .53
?0.94, .09
?0.69, ?0.42
?0.86, ?0.57
?0.58, ?0.24
391
181
192
46
14
32
2,603
848
1,765
?.280
?.377
?.232
.130
.182
.101
.070
.068
.074
?.300
?.410
?.252
.070
.135
.032
54.76
37.42
73.02
54.90
37.60
73.15
?0.64, .04
?0.88, .06
?0.48, ?0.02
?0.41, ?0.17
?0.65, ?0.13
?0.37, ?0.11
175
67
106
Note.
observed variance of d values; var e ? variance attributed to sampling error variance; mean ? ? mean true effect size; var ? ? true variance of effect sizes;
% var SE ? percent variance in observed d values due to sampling error variance; V% ? percent variance accounted for in observed d values due to all
corrected artifacts; 90% CrI ? 90% of mean ? (credibility interval); 95% CI ? 95% of mean ? (confidence interval); fail-safe N ? number of missing
studies averaging null findings that would be needed to bring mean d down to .10, from Hunter and Schmidt’s (1990) effect size file-drawer analysis.
aFive primary studies from the entire data set that used White test takers as the stereotyped group were excluded in these analyses (Aronson et al., 1999;
J. L. Smith & White, 2002; von Hippel et al., 2005) so that only minority subgroups’ d values were meta-analyzed.
K ? Number of effect sizes (d values); N ? total sample size; mean d ? sample size weighted mean effect size; var d ? sample size weighted
1327
A META-ANALYSIS OF STEREOTYPE THREAT EFFECTS
Page 15
majors). Not only does future research need to expand to include
more studies on domain identification effects with ethnic minori-
ties, but researchers also need to include the full spectrum of
identification to more accurately determine effects.
Similar to test difficulty, the variance in domain identification
effects might be partially explained by the inconsistent operation-
alization of the construct in the literature. Test takers’ domain
identification was either directly assessed using self-report mea-
sures (e.g., R. P. Brown & Pinel, 2003; Spicer, 1999), indirectly
inferred from objective measures such as high standardized cog-
nitive ability test scores (e.g., Anderson, 2001; Quinn & Spencer,
2001; Schmader & Johns, 2003), or assessed with both approaches
(e.g., Davies et al., 2002; Harder, 1999). Defining individuals’
domain identification indirectly from their existing standardized
cognitive ability test scores might be problematic. The perfor-
mance interference hypothesis of stereotype threat would predict
that a negative stereotype might negatively affect target stereo-
typed individuals in a highly diagnostic testing situation. Pre-
screening participants based on their strong performance on prior
cognitive ability tests in the hope that these high performers would
subsequently underperform on another cognitive ability test might
result not only in a restriction of range but also in a circular
conceptualization of the construct. Future research needs to reach
a consensus on the operational definition of domain identification.
The pattern of differential evidence of stereotype threat effects
was most apparent when we considered the moderating effect of
stereotype threat relevance. As mentioned, the relevance of a
negative stereotype might be activated with threat cues or removed
with various strategies. We extended Walton and Cohen’s (2003)
work by categorizing stereotype threat-activating cues as blatant,
moderately explicit, or subtle. Stereotype threat theory implies that
the more explicit threat cues would produce a stronger stereotype
threat effect, and Walton and Cohen’s findings supported this
prediction. Consistent with this, for minority test takers, we found
that subtle stereotype threat cues produced smaller stereotype
threat effects compared with other conditions. However, we also
found that moderately explicit threat-activating cues produced a
greater mean effect size than blatant cues for minority test takers.
These interesting findings lend partial credence to the theory of
stereotype reactance, which posits that stereotyped individuals
might perceive a blatant negative stereotype as a limit to their
freedom and ability to perform, thereby ironically invoking behav-
iors that are inconsistent with the stereotype (see Kray et al., 2001).
In contrast, for female test takers, explicit threat-activation cues
(both blatant and moderate) generally produced smaller mean
effect sizes than subtle cues. The findings seemed to support
Levy’s (1996) position that explicit priming of a negative stereo-
type might produce a weaker effect than subtle priming because
the latter might bypass individuals’ conscious psychological mech-
anisms to directly affect task performance.
Before further discussing possible explanations for these differ-
ential outcomes, we should note our findings for the other side of
the coin: What happens when researchers actively removed the
link between a stereotype and an ability test? Explicit stereotype
threat-removal strategies were more effective than subtle ones in
reducing stereotype threat effects for women, supporting Shih et
al.’s (1999) notion that the direct activation of a positive in-group
stereotype (e.g., women are better on a specific math test than
men) might buffer the effect of the stereotype and even cause a
performance boost for some female test takers. However, for
ethnic minorities, explicit stereotype threat-removal strategies
counterintuitively led to stronger stereotype threat effects com-
pared with subtle strategies, a pattern consistent with the overall
moderator effect found by Walton and Cohen (2003). In other
words, actively removing stereotype threat seemed to be not as
effective for minority test takers.
Because studies with minorities and women rely on different
stereotypes and different dependent variables, it may seem unsur-
prising that different effects were found, even though prior re-
search has treated these as conceptually interchangeable manifes-
tations of stereotype threat. Further, the moderator of stereotype
relevance might affect test performance of women and minorities
via different mechanisms. For example, telling test takers that
ethnic minorities in general perform better than Whites on a certain
cognitive ability test (an explicit threat-removal strategy) might
actually introduce performance interference to the testing context.
In other words, direct and explicit statements might create a
performance pressure for these test takers: Should they do poorly,
they would not be able to confirm the positive in-group image.
Therefore, ethnic minorities’ test performance might suffer be-
cause of the same psychological mechanisms as experienced by
individuals of a “model minority” status (see Cheryan & Boden-
hausen, 2000). Explicit interventions aimed at refuting a negative
stereotype about minorities’ intellectual inferiority might backfire,
inadvertently worsening stereotype threat effects instead of allevi-
ating them.
On the other hand, it is possible that female test takers reacted
favorably to explicit threat-removals because they might not ex-
perience as much performance pressure. As a sociocultural factor,
an intellectual-based negative stereotype may carry more distress
for ethnic minority targets than a math-based stereotype for
women; hence, minorities might paradoxically underperform even
when encountering a refuting message, whereas female test takers
might take the threat removal message at face value. Although
most empirical efforts to understanding why stereotype threat
effects generally take place are not successful (see a review by
J. L. Smith, 2004), future research needs to focus on investigating
possible differential psychological mechanisms underlying group-
based targets’ reactions to stereotype threat, both activated and
removed.
Another possibility is that how stereotype relevance is activated
or removed would somehow tap into unique characteristics asso-
ciated with ethnicity and gender, invoking differential behavioral
reactions. Rejection sensitivity is defined “as a cognitive–affective
processing dynamic . . . whereby people anxiously expect, readily
perceive, and intensely react to rejection in situations in which
rejection is possible” (Mendoza-Denton, Downey, Purdie, Davis,
& Pietrzak, 2002, p. 897). Ethnic minorities have a lifetime history
of being subjected to group-based discrimination, mistreatment,
prejudice, and exclusion from salient domains (e.g., higher aca-
demic education, employment, and certification), either directly or
vicariously (see Essed, 1991). When the outcome is important and
where one would possibly experience rejection based on one’s
group membership (Higgins, 1996), minority test takers might
more readily recognize and/or interpret situational threat cues as
rejection cues than female test takers who might not have the same
life experiences (i.e., women successfully pursuing careers not
related to math; cf. Halpern et al., 2007). Future research may want
1328
NGUYEN AND RYAN
Download full-text