Page 1
Does stereotype threat influence performance of girls in
stereotyped domains? A meta-analysis☆
Paulette C. Flore⁎, Jelte M. Wicherts
Tilburg University, The Netherlands
a r t i c l ei n f o a b s t r a c t
Article history:
Received 26 November 2013
Received in revised form 24 October 2014
Accepted 25 October 2014
Available online xxxx
Although the effect of stereotype threat concerning women and mathematics has been subject to
various systematic reviews, noneof themhavebeen performedonthesub-population ofchildren
and adolescents. In this meta-analysis we estimated the effects of stereotype threat on perfor-
mance of girls on math, science and spatial skills (MSSS) tests. Moreover, we studied publication
bias and four moderators: test difficulty, presence of boys, gender equality within countries, and
the type of control group that was used in the studies. We selected study samples when the
study includedgirls, samples had a meanage below 18 years, the design was(quasi-)experimen-
tal, the stereotype threat manipulation was administered between-subjects, and the dependent
variable was a MSSS test related to a gender stereotype favoring boys. To analyze the 47 effect
sizes, we used random effects and mixed effects models. The estimated mean effect size equaled
−0.22andsignificantlydifferedfrom0.Noneofthemoderatorvariableswassignificant; however,
there were several signs for the presence of publication bias. We conclude that publication
bias might seriously distort the literature on the effects of stereotype threat among schoolgirls.
We propose a large replication study to provide a less biased effect size estimate.
© 2014 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.
Keywords:
Stereotype threat
Math/science test performance
Gender gap
Test anxiety
Publication bias
Meta-analysis
1. Introduction
Spencer, Steele, and Quinn (1999) first suggested that women's performance on mathematics tests could be disrupted by the
presence of a stereotype threat. This initial paper inspired many researchers to replicate the stereotype threat effect and expand the
theory by introducing numerous moderator variables and various dependent variables related to negative gender stereotypes, such
as tests of Mathematics, Science, and Spatial Skills (MSSS). This practice resulted in approximately one hundred research papers
and five meta-analyses (Nguyen & Ryan, 2008; Picho, Rodriguez, & Finnie, 2013; Stoet & Geary, 2012; Walton & Cohen, 2003;
Walton & Spencer, 2009). Although four of these systematic reviews (Nguyen & Ryan, 2008; Picho et al., 2013; Walton & Cohen,
2003; Walton & Spencer, 2009) confirmed the existence of a robust mean stereotype threat effect, some ambiguities regarding this
effect remain. For instance, it has been suggested (⁎Ganley et al., 2013; Stoet & Geary, 2012) that the stereotype threat literature is
subject to an excess of significant findings, which might be caused by publication bias (Ioannidis, 2005; Rosenthal, 1979), p-hacking
(i.e., using questionable research practices to obtain a statistically significant effect; Simonsohn, Nelson, & Simmons, 2013), or both
(Bakker, van Dijk, & Wicherts, 2012). A less controversial but nevertheless interesting issue is the age at which stereotype threat
begins to influence performance on MSSS tests: does stereotype threat already influence children's performance, or does this effect
Journal of School Psychology xxx (2014) xxx–xxx
☆ The preparation of this article was supported by grant numbers 016-125-385 and 406-12-137 from the Netherlands Organization for Scientific Research (NWO).
⁎ Corresponding author at: Department of Methodology and Statistics, Tilburg School of Behavioral and Social Sciences, Tilburg University, P.O. Box 90153, 5000 LE
Tilburg, The Netherlands.
E-mail address: P.C.Flore@tilburguniversity.edu (P.C. Flore).
Action Editor: Craig Albers
JSP-00774; No of Pages 20
http://dx.doi.org/10.1016/j.jsp.2014.10.002
0022-4405/© 2014 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.
Contents lists available at ScienceDirect
Journal of School Psychology
journal homepage: www.elsevier.com/locate/jschpsyc
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 2
only emerge during early adulthood? Both of these issues are addressed in this article by means of a meta-analysis of the stereotype
threat literature in the context of schoolgirls' MSSS test performance. We will introduce these topics by providing a general review of
the literature on stereotype threat and the onset of gender differences in the domains of MSSS.
1.1. Stereotype threat
The effect of stereotype threat refers to the ramifications of an activated negative stereotype or an emphasized social identity
(Steele, 1997). Individuals who are members of a stigmatized group tend to perform worse on stereotype relevant tasks when
confronted with that negative stereotype (Steele & Aronson, 1995). In their seminal paper, Steele and Aronson (1995) focused on
ethnic minorities as stereotyped group. Later experiments showed similar effects for other stigmatized groups, including women in
the quantitative domain (e.g., Ambady, Paik, Steele, Owen-Smith, & Mitchell, 2004; Brown & Josephs, 1999; Oswald & Harvey,
2001; Schmader & Johns, 2003; Spencer et al., 1999). In these experiments, women were either assigned to a stereotype threat
condition, where they were exposed to a gender-related stereotype threat (e.g., a written statement that men perform better on
mathematics tests than women), or to a control condition, where they were not exposed to such a threat. When participants subse-
quently completed a MSSS test (e.g., a mathematical test), women who were assigned to the stereotype threat condition averaged
lower scores than women who were assigned to the control condition (Ambady et al., 2004; Brown & Josephs, 1999; Oswald &
Harvey,2001;Schmader&Johns,2003;Spenceretal.,1999).Theresultsofthesestudiesweredeemedimportant,becauseresearchers
suspected that stereotype threat could be a drivingforcebehind thedecision of women to leave the science, technology, engineering,
and mathematics (STEM) fields (Cheryan & Plaut, 2010; Schmader, Johns, & Barquissau, 2004). These developments led to an expan-
sion of the stereotype threat literature, in which several moderator and mediator variables were studied.
Of all the studied moderator and mediator variables, we will summarize those variables that have been studied most frequently.
Itemdifficultyappearstomoderatetheeffectsofstereotypethreat,withdifficultitemsleadingtostrongereffects(Campbell&Collaer,
2009; O'Brien & Crandall, 2003; Spencer et al., 1999; Wicherts, Dolan, & Hessen, 2005). Test-takers who are strongly identified with
therelevantdomain,inthiscasethedomainofmathematics,scienceorspatialskills,appeartoshowstrongerstereotypethreateffects
(Cadinu, Maass, Frigerio, Impagliazzo, & Latinotti, 2003; Lesko & Corpus, 2006; Pronin, Steele, & Ross, 2004; Steinberg, Okun, & Aiken,
2012). Another theoretical moderator is gender identification; the effects of stereotype threat are generally more severe for women
who are highly gender-identified (Kiefer & Sekaquaptewa, 2007; Rydell, McConnell, & Beilock, 2009; Schmader, 2002; Wout, Danso,
Jackson, & Spencer, 2008). However, the latter results were contradicted in a Swedish study (Eriksson & Lindholm, 2007). Moreover,
the effects of stereotype threat appear stronger within a threatening environment (e.g., in the presence of men, or when negatively
stereotypedtest-takersholdaminoritystatus)comparedtoasafeenvironment(e.g.,inthepresenceofwomenonly,orwhenholding
a majority status; Gneezy, Niederle, & Rustichini, 2003; Inzlicht, Aronson, Good, & McKay, 2006; Inzlicht & Ben-Zeev, 2003;
Sekaquaptewa & Thompson, 2003). The presence of role models also appears to moderate the effect of stereotype threat, in such a
way that role models that contradictthe stereotype (i.e., women whoare good in mathematics or men wholack mathematical skills)
appeartoprotectfemalesfromthedebilitatingeffectsofstereotypethreatonMSSStestperformance(Elizaga&Markman,2008;Marx
& Ko, 2012; Marx & Roman, 2002; McIntyre, Paulson, Taylor, Morin, & Lord, 2011; Taylor, Lord, McIntyre, & Paulson, 2011). Finally,
several researchers suggested that the stereotype threat effect is (partly) mediated by arousal (Ben-zeev, Fein, & Inzlicht, 2005),
anxiety and worries (Brodish & Devine, 2009; Ford, Ferguson, Brooks, & Hagadone, 2004; Gerstenberg, Imhoff, & Schmitt, 2012;
Osborne, 2001, 2007), or the occupation of working memory (Beilock, Rydell, & McConnell, 2007; Bonnot & Croizet, 2007; Rydell,
Rydell, & Boucher, 2010; Schmader & Johns, 2003).
The literature on the effects of stereotype threat has been summarized by five meta-analyses that covered heterogeneous subsets
of studies (Nguyen & Ryan, 2008; Picho et al., 2013; Stoet & Geary, 2012; Walton & Cohen, 2003; Walton & Spencer, 2009). These
broad-stroke meta-analyses estimated a small to medium significant effect before moderators were taken into account, with
standardizedmeandifferencesrangingfrom0.24(Pichoetal.,2013)to0.48(Walton&Spencer,2009).Thesefindingsseemedtocon-
firm that the effect is rather stable, although most of these meta-analyses reported heterogeneity in effect sizes (Picho et al., 2013;
Stoet & Geary, 2012; Walton & Cohen, 2003). In fact, the previous meta-analyses included diverse tests, settings, and stereotyped
groups, which makes it hard to pinpoint exactly why some studies show larger effects than others. Although these large scale
meta-analyses are interesting to portray an overall picture, a more homogeneous subset of studies is preferred when dealing with
specific questions, like the degree to which the stereotype threat related to gender also influences MSSS performance in schools.
Thus,weaddressedthisissuebyselectingaspecificstereotypedgroupandstereotype(i.e.,womenandtheirsupposedinferiorcapac-
ity of solving mathematical or spatial tasks) and a specific age group (i.e., those younger than 18 years), which should result in a less
heterogeneous set of effect sizes. These design elements enabled us to describe the influence of stereotype threat on MSSS test
performance for females in critical periods of human development, namely childhood and adolescence.
1.2. Stereotype threat and children
Althoughtheeffectsofstereotypethreatonwomenwastraditionallystudiedwithinadultpopulations(Spenceretal.,1999),mul-
tiple studies over the last 15 years have been carried out with children and adolescents as participants (e.g., Ambady, Shih, Kim, &
Pittinsky, 2001; ⁎Keller & Dauenheimer, 2003). Studies on children and adolescents in schools contribute to the literature for at
least three reasons: (1) to findout at whichage thestereotype threateffectactually emerges, (2) to study thestereotypethreat effect
in the natural setting of the classroom instead of the laboratory setting, and (3) to address the question whether variables that
moderate the stereotype threat effect in adult samples similarly moderate the stereotype threat effect among children.
2
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 3
Theprimaryresearchonstereotypethreatwithchildrenasparticipants(i.e.,studiesthatweincludedinourmeta-analysis)rough-
ly shared a similar design, although the details of the designs varied somewhat. Typically, the studies were conducted by means of an
experimentor a quasi-experiment involvinga stereotype threat condition anda control condition as predictor variable, sometimes in
combination with a third or fourth condition (⁎Cherney & Campbell, 2011;⁎Picho& Stephens, 2012). Theseconditions weretypically
designed as a between-subjects factor. Some variations exist in the implementation of the stereotype threat and control conditions.
The stereotype threat manipulation was administered either explicitly or implicitly. The explicit stereotype threat manipulation
usually involved a written or verbal statement that informed participants that the MSSS test they were about to complete produced
genderdifferences,whereastheimplicitstereotypethreatmanipulationstriggeredthegenderstereotypewithoutexplicitlymention-
ing the gender gap. Further examples of the two types of stereotype threat manipulations are illustrated in Table 1. The control con-
dition was designed to either nullify or not nullify stereotype threat. In the nullified control condition the stereotype threat was
activelyremoved,generally by a written orverbal statementwhichinformed participants thattheMSSS testthey were abouttocom-
plete did notproducegenderdifferences,whereasinthenon-nullified control conditionnogender relatedinformationwasprovided.
Further examples of the two types of control conditions are illustrated in Table 2.
The outcome measure in studies of stereotype threat among schoolgirls to date were MSSS tests; most studies involved a mathe-
matical test properly adjusted to the age and ability level of the participants (e.g., ⁎Keller & Dauenheimer, 2003; ⁎Muzzatti & Agnoli,
2007). A few studies used the Mental Rotation Task (e.g., ⁎Moè & Pazzaglia, 2006; ⁎Neuburger, Jansen, Heil, & Quaiser-Pohl, 2012;
Table 1
Types of stereotype threat manipulations.
Manipulation
condition
ManipulationExample Examples of papers
Explicit Verbal or written statement that boys
are superior to girls on the test
“It [the test] is comprised of a collection of questions
which have been shown to produce gender differences
in the past. Male participants outperformed female
participants.”
“Boys are really good at this game.”
⁎Cherney and Campbell (2011),
⁎Keller and Dauenheimer (2003)
Explicit Verbal statement that boys are really
good in the test
Participants filling out their gender
Visual depiction of a stereotypical
situation
Priming female identity
⁎Cimpian, Mu, and Erickson (2012)
Implicit
Implicit
–
Showed pictures of male scientists/mathematicians
⁎Stricker and Ward (2004)
⁎Good et al. (2010), ⁎Muzzatti
and Agnoli (2007)
⁎Tomasetto, Alparone, and
Cadinu (2011)
Implicit The story described a girl using a number of traits that
were stereotypically feminine in participants' cultural
context (e.g., long blond hair, blue eyes, and
colorful clothes).
–
Implicit Framing the question as a geometric
problem
⁎Huguet and Régner (2007),
⁎Huguet and Régner (2009)
Table 2
Types of control conditions.
Control condition InformationExample Examples of papers
No Threat No information given with regards to the
relationship between gender and
performance on the test
Verbal or written statement that girls are
superior to boys on the test
–⁎Delgado and Prieto (2008),
⁎Muzzatti and Agnoli (2007)
Nullified
“It is comprised of a collection of questions
which have been shown not to produce
gender differences in the past. The average
achievement of male participants was equal
to the achievement of female participants.”
“In such tasks, boys and girls are equally skilled.
Both have an equal ability to imagine how
pictures and objects look when they are rotated.
Therefore, such tasks are exactly equally difficult
or easy for girls and boys.”
“Research has shown that men perform better
than women in this test and obtain higher scores.
This superiority is caused by a gender stereotype,
i.e., by a common belief in male superiority in spatial
tasks, and has nothing to do with lack of ability.”
“Marie was described as a successful student in math”
⁎Cherney and Campbell (2011)
Nullified Verbal or written statement that girls and
boys perform equally well on the test
⁎Neuburger et al. (2012)
NullifiedEducation about the stereotype threat
effect
⁎Moè (2009)
Nullified Written description of a counter-stereotypical
situation
Visual depiction of a counter-stereotypical
situation
⁎Bagès and Martinot (2011)
Nullified
“Participants were randomly assigned to one of three
experimental conditions by inviting them to color a
picture, in which a girl correctly resolves the
calculation whereas a boy fails to respond”
⁎Galdi et al. (2014)
3
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 4
⁎Titze, Jansen, & Heil, 2010) which measured children's spatial abilities, a concept tightly linked to mathematics and gender stereo-
types. Remaining dependent variables were the performance on a physics test (⁎Marchand & Taasoobshirazi, 2012), a chemistry
comprehension test (⁎Good, Woodzicka, & Wingfield, 2010) or recall performance of a geometric figure (⁎Huguet & Régner, 2009).
These tests generally consisted of 10 to 40 questions.
1.3. Developmental aspects of stereotype threat
The onset and development of the effects of stereotype threat on girls in mathematics throughout the life course is an interesting
issue;however,fewsolidconclusionshavebeenreached (Aronson&Good,2003; Jordan&Lovett,2007). Toexplorepossibletheories
on how age might influence stereotype threat, we recollect the most important moderators that were identified in the research on
young adults and subsequently consider whether these could influence stereotype threat differently throughout the development
of children. The most important moderators among adults are gender identification, domain identification, stigma consciousness,
and beliefs about intelligence (Aronson & Good, 2003). Thus, women who strongly identify with both the academic domain of math-
ematics (Cadinu et al., 2003; Lesko & Corpus, 2006; Pronin et al., 2004; Steinberg et al., 2012) and the female gender (Kiefer &
Sekaquaptewa, 2007; Rydell et al., 2009; Schmader, 2002; Wout et al., 2008) are expected to experience stronger performance dec-
rements compared to women who less strongly identify with those domains. Additionally, women who believe that the stereotypes
regarding women and mathematics are true (Schmader et al., 2004) and that mathematical ability is a stable and fixed characteristic
(Aronson & Good, 2003) are purported to show stronger stereotype threat effects. The current knowledge about the development of
these four traits can be used as guidance for the expectations of the impact of stereotype threat throughout different age groups
(Aronson & Good, 2003).
1.4. Gender identification
Genderidentificationispresentatanearlyage.Attheageof3 years,amajorityofchildrenareabletocorrectlylabelthemselvesto
theirgender(Katz&Kofkin,1997).Astudyon3-to5-year-olds(Martin&Little,1990)showedthatthesechildrenarenotonlyableto
correctlylabeltheirgenderanddistinguishmenfromwomenbutalsoprefersex-typedtoysthatcorrespondtotheirgender(i.e.,boys
preferring masculine sex-typed toys and girls preferring feminine sex-typed toys). When children reach the age of 6 to 7 years, they
master the concept of gender constancy; and so understand that gender is stable over time and consistent (Bussey & Bandura, 1999).
Based on these studies one could argue that because gender identity is already stable at a young age, even young children are poten-
tially vulnerable to performance decrements caused by stereotype threat. However, Aronson and Good (2003) proclaimed that al-
though children are already aware of their gender from an early age on, they do not form a coherent sense of the self until
adolescence, which prevents younger children from vulnerability to stereotype threat.
1.5. Stigma consciousness
The studies on development of awareness of the stereotype (stigma consciousness) have showed mixed results. Various studies
showed that children believe that boys are either better in mathematics or are identified more strongly with thefield of mathematics
compared to girls, for ages 6 to 11 (Cvencek, Meltzoff, & Greenwald, 2011; Eccles, Wigfield, Harold, & Blumenfeld, 1993; Lummis &
Stevenson, 1990) and ages 14 and 22 (Steffens & Jelenec, 2011). In Steffens and Jelenec (2011), older participants endorsed the ste-
reotypes more strongly than the younger participants. A meta-analysis on affects and attitudes concerningmathematics showed that
adolescents and young adults from different age groups (11 to 25 years old) all see mathematics more as a male domain (Hyde,
Fennema, Ryan, Frost, & Hopp, 1990). These gender stereotypes are also present in the classroom; teachers tend to see boys as
morecompetentinmathematics(Li,1999),theyexpectmathematicstobemoredifficultforgirls(Tiedemann,2000),andtheyexpect
thatfailureinmathematicsforgirlsmorelikelyoriginatesfromalackofability,whereasfailureforboysoriginatesfromalackofeffort
(Fennema,Peterson,Carpenter,&Lubinski,1990;Tiedemann,2000).However,counterintuitiveevidenceregardingstigmaconscious-
ness has also been found more recently: some studies failed to find convincing evidence that children explicitly believe in the tradi-
tional stereotype (Ambady et al., 2001; Kurtz-Costes, Rowley, Harris-Britt, & Woods, 2008), other studies found that children believe
in non-traditional stereotypes (Martinot, Bagès, & Désert, 2012; Martinot & Désert, 2007), and another study found that teachers do
not hold stereotypical beliefs (Leedy, LaLonde, & Runk, 2003). Additionally a more recent study found that when it comes to overall
academic competency 6- to 10-year-olds hold the stereotype that girls outperform boys (Hartley & Sutton, 2013), and these children
actually believe that adults hold those stereotypes as well. A stereotype threat manipulation addressing this stereotype actually neg-
ativelyinfluencedtheperformanceofboysonatestthatincludeddifferentdomains,includingmathematics.Moreover,alongitudinal
studyshowedthatoverdifferentgrades,teacherseitherratedthegirlsintheirclassessignificantlyhigherinmathematicalabilitythan
boys, or rated girls and boys as roughly equivalent in mathematical ability, even when there was a significant gender gap in perfor-
mance on a mathematics test favoring males (Robinson & Lubienski, 2011). Some argue that this evidence against the stereotype re-
gardingmathematicsandgenderinrecentstudiesmightindicatethatthegenderstereotypeasweknowitisoutdated(Martinotetal.,
2012). Also, relatively little research has addressed whether gender stereotypes are comparable over time (e.g., during the 1980s vs.
during the 2010s) or across different countries or smaller cultural units (as we addressed in the section Moderators).
4
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 5
1.6. Domain identification
Fewstudies have beenconducted onthedevelopmentof academic identification,or domainidentification,in children(Aronson&
Good,2003).Astudyby⁎Keller(2007)on15-year-oldsindicatedthatdomainidentificationmoderatedtheeffectofstereotypethreat
on math performance. Specifically, girls in a stereotype threat condition who considered themselves as low identifiers in the mathe-
matical domain performed better on difficult math items, whereas girls who considered themselves as high identifiers in the math-
ematical domain performed worse on difficult math items. Although little attention has been given to domain identification in the
context of stereotype threat and development, research on affect and attitude of girls towards mathematics over different age groups
could provide information on how domain identification might fluctuate. For instance, the gender gap of positive attitudes towards
and self-confidence in mathematics is virtually non-existent for children between the ages of 5 to 10 years but grows wider in
older age groups, with boys being more positive and self-confident than girls (Hyde et al., 1990). Thus, it seems that, generally, ado-
lescentgirls have lessconfidence in and fewer positive attitudestowards mathematicscomparedto boys of their age, whichmightbe
anindication that oldergirls also identifythemselves less withthe mathematical domain.In thecontextof stereotypethreat, thispat-
ternoffindingswouldleadustoexpectthatadolescentgirlsareactuallylessvulnerabletotheeffectsofstereotypethreatcomparedto
pre-teenage girls.
1.7. Beliefs about intelligence
The literature on beliefs about intelligence and academic ability describes rather straightforwardly how those beliefs change
throughout the development of children. Children younger than 7 years do not yet comprehend that intelligence and ability are per-
sonal traits that are stable over time and that the role of effort in academic performance is limited (Droege & Stipek, 1993; Stipek &
Daniels, 1990). At this age, children confuse intelligence and ability with social–moral qualities: a good or nice person equals a
smart person and viceversa (Droege & Stipek, 1993; Heyman, Dweck, & Cain,1992). Because youngchildren do not yet see academic
abilities as fixed traits, they tend to be overly optimistic about their performances and overestimate their position on academic per-
formances relative to their classmates (Nicholls, 1979). When children reach the age of 7 or 8, their theories seem to shift, in such a
way that older children believe in more temporal constant abilities (Kinlaw & Kurtz-Costes, 2003). At this age, the children predict
more stable levels of intelligence (Dweck, 2002; Wigfield et al., 1997), and they believe less in the role of effort (Stipek & Daniels,
1990). Additionally, they are better able to distinguish ability from social or moral abilities (Droege & Stipek, 1993; Heyman et al.,
1992; Stipek & Daniels, 1990). As a consequence, beginning at approximately age 7 to 8 years, children are less optimistic and
more realistic about their future academic performances and their position within the classroom compared to their peers (Eccles
et al., 1989; Nicholls, 1979). These findings imply that stereotype threat would only have an effect on children who are at least 7 to
8 yearsold.Ifindeedthesenotionsaboutabilitiesarecrucialforstereotypethreat,youngerchildrenmostlikelydonotevenseemath-
ematical ability asa fixedtrait; hence, there wouldbelittle reason for them tofeel threatened bystereotypes regardingmathematical
competency.Incontrast,olderchildrenwouldhavethecapacitytounderstandthateffortwillnotnecessarilycompensateforalackof
ability and hence be susceptible to stereotype threat.
Although studies on the development of gender identity, stigma consciousness, and beliefs about intelligence seem to imply that
children below theage of 8 or10will probably not be influenced by stereotypethreat, theline of evidence concerningthese potential
age-related moderating variables we discussed here is indirect. That is, it is unclear whether moderators that were found to be rele-
vantforstereotypethreatamongyoungadultsalsoarerelevantamongschoolgirls.Inaddition,theconclusionthatchildrenbelowthe
age of 8 or 10 will probably not be influenced by stereotype threat is in contrast with the theory on domain identification, which
would actually predict the opposite. It is therefore important to collate all the evidence that speaks to the ages at which stereotype
threat effects among schoolgirls actually emerge. In our meta-analysis, we therefore (a) explored whether age is a moderator of
the stereotype threat effect among schoolgirls and (b) studied the moderators (at the level of studies) that are implicated in stereo-
type theory as being relevant for stereotype threat.
1.8. Moderators
1.8.1. Test difficulty
Inourmeta-analysesweconsidered,inadditiontotheexploratorymoderatorofage,fourconfirmatorymoderatorsonthebasisof
theory and previous results (Nguyen & Ryan, 2008; Picho et al., 2013; Steele, 2010). The first moderator we hypothesized to have an
influence on the effect of stereotype threat is test difficulty. Studies on the adult population showed that test difficulty is an important
moderator (e.g., Nguyen & Ryan, 2008; Spencer et al., 1999). The moderation of test difficulty on the stereotype threat effect is often
explainedintermsofarousal(Ben-zeevetal.,2005),althoughpsychometricreasonsmayalsoplayarole(Wichertsetal.,2005).Stud-
ies showed that the stereotype threat effect appears to be mediated by arousal or anxiety (Ben-zeev et al., 2005; ⁎Delgado & Prieto,
2008; Gerstenberg et al., 2012; Osborne, 2001); thus, the more anxious or aroused participants are, the worse they will perform on
a mathematical test. Relatively difficult items are more threatening than easy items; therefore, they lead to a higher state of arousal,
whichinturnwillresultinalargergendergapinmathematicaltestperformance(⁎Delgado& Prieto,2008;O'Brien&Crandall,2003).
These findings corresponded to traditional findings of social facilitation, which showed that arousal leads to diminished performance
on a difficult task, whereas arousal leads to enhanced performance when the task is well learned (Markus, 1978; Zajonc, 1965). The
moderating role of test anxiety might be explained by the fact that solving difficult questions requires a larger working memory ca-
pacity than solving easy questions (Beilock et al., 2007). When worrying thoughts provoked by stereotype threat occupy part of the
5
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 6
working memory, solving a difficult question becomes problematic, whereas easy questions are still solvable because they do not re-
quirea largeworkingmemorycapacity(Eysenck&Calvo,1992).Thismechanismleadstoscorereductionfordifficulttestsbutnotfor
easy tests. With theformer in mind, we expected that theeffectof stereotype threat would be stronger in studies that use a relatively
difficult test compared to studies that use a relatively easy test. We defined difficulty here as the degree to which those in the sample
answeritemsin thetest correctly. Psychometricallyadvanced analyses thatformally model theitem difficulties are beyondthescope
of this meta-analysis because they require the raw data.
1.8.2. Presence of boys
Thesecondvariablethatwepredictedtomoderatethestereotypethreateffectamongschoolgirlsistheabsenceorpresenceofboys
duringtest-taking.Severalstudiesshowedthatfemalestudentstendtounderperformonnegativelystereotypedtasksinthepresence
of male students who are working on the same task (Gneezy et al., 2003; Inzlicht & Ben-zeev, 2000; Inzlicht & Ben-Zeev, 2003; Picho
et al., 2013; Sekaquaptewa & Thompson, 2003). This effect might be explained by the salience of gender identity; gender becomes
more salient for women who hold the minority in a group than for women who are in a same-sex group (Cota & Dion, 1986;
Mcguire, Mcguire, & Winton, 1979). In turn, the heightened salience of gender identity might lead to stronger effects of stereotype
threat. People who hold a minority or token status within a group tend to suffer from cognitive deficits (Lord & Saenz, 1985), a phe-
nomenon that is even registered when women simply watch a gender unbalanced video of a conference in a mathematical domain
(Murphy, Steele, & Gross, 2007). The combination of both the activation of gender identity and reduced cognitive performance due
to social pressure caused by a minority status then leads to worse performance for women confronted with stereotype threat in a
mixed-gender setting. Thus, we predicted the stereotype threat effect among schoolgirls to be stronger in studies in which boys
were present during test administration, compared to studies in which no boys were present during test administration.
1.8.3. Cross-cultural gender equality
The third moderator we studied was cross-cultural gender equality, or the degree in whichwomen are deemedequalto menin the
several nations where the selected stereotype threat studies took place. Recent studies showed marked cross-cultural differences in
thegender gapin mathematicalperformance acrosscountries (Else-Quest, Hyde, & Linn, 2010;Mullis,Martin,Foy, & Arora, 2012; Or-
ganisation for Economic Co-operation and Development (OECD), 2010). In the cross-cultural study on 15-year-old students carried
out by OECD (i.e., the Programme for International Student Assessment or PISA) within 65 countries boys significantly outperformed
girls on themathematical test in 54% of thecountries, whereasin 8% of the countries girls outperformed boys. In 38% of thecountries,
nosignificantdifferencebetweenthetwosexgroupswasfound.ComparablearetheTrendsinInternationalMathematicsandScience
Study(TIMSS)studies(Mullisetal.,2012)onfourthgraderswithin50countries,inwhichboysoutperformedgirlsin40%ofthecoun-
tries, girls outperformed boys in 8% of the countries, and no significant differences were found in 52% of the countries. However, the
results of the TIMSS studies for eightgraders in 42 countries were different: in 31% of the countries, girls outperformed boys, while in
only 17%ofthecountries, boys outperformed girls, and in52% of thecountries nosignificantdifferencesemerged.Overall,thesex dif-
ferences for themajority of countries werequite small. The differences between countries concerningthegender gapin mathematics
were proposed to be associatedwiththegender equalityand amount of stereotypingwithin countries (Else-Quest et al., 2010; Guiso,
Monte, & Sapienza, 2008; Nosek et al., 2009). Some studies showed that gender equality is associated with the gender gap in math-
ematics for school aged children (Else-Quest et al., 2010; Guiso et al., 2008). Gender equality also has as a negative relation with anx-
iety, and a positive relation with girls' self-concept and self-efficacy concerning the mathematical domain (Else-Quest et al., 2010). In
addition, the gender gap in mathematical test performance could be predicted by cross-national differences in Implicit Association
Test-scores on the gender–science relation (Nosek et al., 2009). Based on these results, we expected that the stereotype threat effect
amongschoolgirlswouldbestrongerforstudiesconductedincountrieswithlowlevelsofgenderequalitycomparedtocountrieswith
high levels of gender equality. To operationalize this variable, we used the Gender Gap Index (Hausmann, Tyson, & Zahidi, 2012),
which is an index that incorporates economic participation, educational attainment, political empowerment, and health and survival
of womenrelative tomen. Higher scores on theGGI indicate a higher degreeof gender equality.Geographical regions have been used
before as moderator variable in the meta-analysis on stereotype threat and mathematical performance by Picho et al. (2013); how-
ever, they only studied regions within the United States of America.
1.8.4. Type of control condition
Thelastmoderatorwestudiedconcernedthetypeofcontrolconditionparticipantswereassignedto.Stereotypethreatexperiments
involvetheuse of twoor more conditionsthat differin stereotypethreat, suchthat conditionscan be rankedby severityof stereotype
threat. The condition that supposedly ranks lowest on stereotype threat severity is the control condition, which exists either of a sit-
uationwhereparticipantsdonotreceiveanygenderrelatedinformation(e.g.,⁎Delgado&Prieto,2008;⁎Muzzatti&Agnoli,2007),ora
so-called nullified control condition. This nullified control condition is designed to actively remove the stereotype threat, usually by
informing test-takers that girls perform equally well as boys or even that girls outperform boys on the mathematical test
(⁎Cherney & Campbell, 2011; ⁎Neuburger et al., 2012). There are indications that test-takers who are assigned to a nullified control
condition outperform those who are assigned to a condition in which no additional information has been given (Campbell &
Collaer, 2009; Smith & White, 2002; Walton & Cohen, 2003; Walton & Spencer, 2009). This effect is explained by the fact that when-
ever women are confronted with a MSSS test their gender identity already becomes salient by the well-known stereotype (Smith &
White,2002);givingnoadditionalinformationwouldthusentailaformofimplicitthreatactivation.Therefore,weexpectedtheeffect
of stereotype threat among schoolgirls to be stronger in studies that involved a nullified control condition compared to studies that
involved a control condition without additional information.
6
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 7
1.9. Publication bias and p-hacking
Althoughtheexistenceofthestereotypethreateffectseemswidelyaccepted,therearesomereasonstodoubtwhethertheeffectis
as solid as it is often claimed to be. Based on recent published and unpublished studies that fail to replicate the effects of stereotype
threat, ⁎Ganley et al. (2013) suggested that the literature on the stereotype threat effect in children might suffer from publication
bias, a claim that had also been made for the wider stereotype threat literature involving females and mathematics (Stoet & Geary,
2012). Publication bias refers to the practice of primarily publishing articles in which significant results are shown, thus leaving the
so-called null results in thefile drawer (Ioannidis, 2005; Rosenthal, 1979; Sterling, 1959), a practice that can leadto serious inflations
of estimated effect-sizes in meta-analyses (Bakker et al., 2012; Sutton, Duval, Tweedie, Abrams, & Jones, 2000).
According to Ioannidis (2005) a research field is particularly vulnerable to publication bias if the field (1) features studies with
small sample sizes; (2) concerns small effect sizes; (3) focuses on a large number of relations; (4) involves studies with a large flex-
ibility in design,definitions, and outcomes;(5) is popular andso features manystudies, and(6) deals with topicsrelevant to financial
or political interest. The field of stereotype threat is susceptible to publication bias, because all six characteristics are present to some
extentinstereotypethreatresearch.Forinstance,moststudies(39outofthe47studies)haveatotalsamplesizesmallerthan100;the
averagedeffectsizesfoundintherecentmeta-analysesliebetween0.24(Pichoetal.,2013)and0.45(Walton&Spencer,2009),which
are classified as small to medium effect sizes1(Cohen, 1992); and the use of multiple dependent variables and covariates is common
practice (Stoet & Geary, 2012), despite problems associated with covariate corrections (Wicherts, 2005). Furthermore, the design is
often flexible with different kinds of manipulations, control conditions, and moderators. Moreover, the number of published studies
attests to the popularity of the topic, and several stereotype threat researchers called for affirmative action based on their research
(e.g., by means of a policy paper (Walton, Spencer, & Erman, 2013) or the Brief of Experimental Psychologists et al., 2012, for the
case of Fisher vs. the University). With the former in mind, we expected to find indications of publication bias in our meta-analytic
data set.
Ifwewanttodrawconclusionsbasedontheoutcomesofameta-analysis,weassumethattheoutcomesoftheincludedstudiesare
reliable. Unfortunatelytheoutcomesof somestudiesmightbedistortedduetoquestionableresearchpractices (QRPs)incollectionof
data, reporting of results, and analysis of data. The term QRPs defines a broad set of decisions made by researchers that might posi-
tively influence the outcome of their studies. Four examples of frequently used QRPs are (1) failing to report all the dependent vari-
ables, (2) collecting extra data when the test statistic is not significant yet, (3) excluding data when it lowers the p-value of the test
statistic, and (4) rounding down p-values (John, Loewenstein, & Prelec, 2012). The practice of using these QRPs with the purpose of
obtaininga statisticallysignificanteffect is referred to as “p-hacking”(Simonsohnetal., 2013).p-Hackingcanseriouslydistort thesci-
entific literaturebecause it enlarges thechanceof a Type-I error (Simmons, Nelson,& Simonsohn, 2011), and it leads toinflated effect
sizes in meta-analyses(Bakkeretal.,2012).If manyresearchers whowork withinthesame field invokep-hacking, thenaneffectthat
does not existat the populationlevel mightbecome established. Simonsohnet al. (2013)have developed the p-curve: a tool aimed to
distinguishwhetherafieldisinfectedbyselectivereporting,orwhetherresultsaretruthfullyreported.Whenmostresearcherswithin
afieldtruthfullyreportedcorrectp-values,adistributionofstatisticalsignificantp-valuesshouldberightskewed(providedthereisan
actual effect in the population), whereas the distribution of p-values for a field in which researchers p-hack will be left skewed. With
the p-curve, we can test whether it is likely that p-values within this field are p-hacked.
2. Method
2.1. Search strategies
A literature search was conducted using the databases ABI/INFORM, PsycINFO, ProQuest, Web of Science (searched in March
2013), and ERIC (searched in January 2014). Combined, these five databases cover the majority of the psychological and educational
literature.Thekeywordsthatweusedintheliteraturesearch(inconjunctionwiththephrase“stereotypethreat”,whichneededtobe
present in the abstract) were “gender,” “math,” “performance,” or “mental rotation,” and “children,” “girls,” “women,” or “high
school.”Thissearchstrategy resultedinseveralsearchstringsthatwereconnectedbythesearchterm“AND,”suchas “ab(“stereotype
threat”) AND children AND gender.” In addition two cited-reference searches on Web of Science were conducted; we targeted the
oldest paper that we obtained from the first part of our literature search (Ambady et al., 2001) and the classical paper on stereotype
threat and gender by Spencer et al. (1999). Additionally, we performed a more informal search on Google Scholar for which we used
the same keywords as our other database searches. With this strategy we obtained two extra articles.
Animportantpartofa meta-analysis isthesearchforunpublishedstudies ordata(i.e.,grayliterature).Weautomaticallysearched
parts of the gray literature by our search on Google Scholar and using databases PsycINFO, ERIC, and ProQuest; they do not only con-
tainpublishedpapersbutalsodissertationsandconferenceproceedings.Moreover,inordertofindunpublishedstudiesweusedthree
additional strategies. First, we e-mailed the first authors of the included published papers with the question whether they possessed
any unpublished data or were familiar with unpublished studies by other researchers. Second, we screened the abstracts of poster
1Although widely used, Cohen's rules of thumb for small, medium, and large effects may not be entirely appropriate here. Set against the typical effectsizesfor gen-
der differences in mathematics (e.g., d = 0.16, Hedges & Nowell, 1995), even a d of 0.1 for the stereotype threat effect among schoolgirls could be substantial in the
sense that it may then explain a substantial part of the gender gap, all other things being equal. When considered in light of earlier meta-analyses of the stereotype
threat effect the same effect size estimate of d = 0.1 could be seen as small. The core issue for understanding the potential effect of publication bias is that stereotype
threat effects are small in relation to the sample sizes typical for psychological research (Bakker et al., 2012), leading to underpowered studies.
7
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 8
presentations held at the last 10 conferences of the Society for Personality and Social Psychology (SPSP), selected those abstracts that
mentioned stereotype threat and children,and e-mailed thefirst author that worked on the projectin question.Finally, we posted an
open call for data on both the SPSP forum (www.spsp.org) and the Social Psychology Network forum (www.socialpsychology.org).
Wedidnotreceiveanypapersthroughthesecondandthirdstrategies;however,weobtainedsevenresponsesthroughthefirststrat-
egy,whichprovideduswithfiveadditionalstudies.Fiveauthorsindicatedthattheyhadnounpublishedworks.Ultimately,weinclud-
ed five effect sizes (11%) in the meta-analysis that were a product of unpublished studies. In our literature search, we obtained one
Italian study (⁎Tomasetto, Matteucci, & Pansu, 2010) that was translated by the first author.
2.2. Inclusion criteria
We included study samples based on five criteria. First, we selected only those studies in which schoolgirls were included in the
sample andwhere thegender stereotypethreatwasmanipulated.We excluded studiesthat focused on only boys or studies thatcon-
cernedanothernegativelystereotypedgroup(e.g., ethnic minoritiesinotherabilitydomains). Second,becausewe focused onstudies
with children and adolescents, we disregarded those studies for which the average age within the sample was above 18. Third, we
used experiments in whichstudents were randomly assigned2to thestereotype threat condition or control condition.This constraint
meant that we included neither correlational studies nor studies that failed to administer a viable stereotype threat. A viable threat
was either accomplished using explicit cues that address the ramifications of the gender stereotype (e.g., “Women perform worse
on this mathematical test”) or using implicit cues that are supposed to activate gender stereotypes (e.g., instructions to circle gender
ona testform). Fourth,weincluded only studies for whichthestereotype threat manipulation wastreated asa between-subjects fac-
torandthusexcludedstudiesinwhichthisvariablewastreatedasawithin-subjectsfactor.Fifth,thedependentvariablehadtobethe
score on a MSSS test. We coded the selected variables using the procedures described in the next section.
2.3. Coding procedures
Theselectionandcodingoftheindependentanddependentvariableswascarriedoutfollowinganumberofrules.Insomestudies
participants were assigned not only to a stereotype threat or controlcondition but alsoto an additional crossedfactor. We treated the
groups formed by the additional factor as different populations when this factor was a between-subjects factor.3Whenever the addi-
tional factor was a within-subjects factor, we took only the level of the factor that, based on the existing theories of stereotype threat,
wouldbeexpectedtohavethestrongesteffect.Forinstance,weselecteda difficultoveraneasytestinonestudy(⁎Neuville& Croizet,
2007). The control condition consisted of either a nullified control condition or a control condition in which no information had been
given regarding gender and performance. For studies that involved multiple types of control groups, we selected the control group in
the following order: (1) a nullified control condition which described that no differences in performance on the mathematical test
have been found, (2) a nullified control condition which described that girls perform better on the mathematical test condition,
(3) a nullified control condition in which test-takers were informed that the sex differences in performance on the mathematical
test are due to stereotype threat, (4) a nullified control condition that entailed a description or visualization of a stereotype inconsis-
tentsituation,and(5)acontrolconditioninwhichnoadditionalinformationhadbeengiven.Inselectingthedependentvariableper-
formance on a MSSS test we used the following rules: we first selected a test administered after the threat manipulation over a test
administered before the threat manipulation, subsequently we selected published cognitive tests over self-constructed cognitive
tests, and finally we selected math tests over other tests (i.e., spatial tests, physics tests, geometrical recall tests, or chemistry tests).
We coded performance on a MSSS test via the official scorings rule for the test; if this rule was not reported, we used the reported
percentage of correct answers or alternatively the average sum score (i.e., the raw mean number of correct answers per condition).
In addition to the independent and the dependent variable, six other variables were coded. Test difficulty was coded by 1 minus
the proportion of correct answers within the control group of girls in the study sample; thus, a more difficult test resulted in a higher
score on this moderator variable. We calculated test difficulty using the data from the control group of girls only instead of the entire
sample because some (but not all) studies included boys in their samples and thetest difficulty needed to be comparable across sam-
ples. Additionally, we did not use the data of girls in the experimental group because the effect of stereotype threat would probably
distort the actual difficulty. Presence of boys was coded with yes when boys were present during test administration or alternatively
with no when boys were not present. The type of control group was coded with nullified whenever the control condition consisted of
an active threat removal, whereas a control condition without such an active threat removal was coded as no information. Cross-
cultural gender equality in the country where the study took place was coded by the country's score on the Gender Gap Index
2To correct for random assignment on the cluster level instead of the individual level, we used cluster correction for equal cluster sizes (Hedges, 2007), which was
appliedto fivestudies.Both correctedand uncorrectedeffectsizesare reportedinTable 3.We basedthe adjustment of theeffectsize onthefollowingformula:
dT2¼
YT
••−YC
ST
••
!
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1−2 n−1
ðÞρ
N−2
r
:
The decision to use an intra-class correlation of ρ = .2 was guided by the paper of Hedges and Hedberg (2007), inwhich calculations of the intra-class correlation for a
large sample of schools showed an average of ρ = .220. This number was rather stable across grades (kindergarten through the 12th grade); thus, we felt confident to
round this number down and use it in our analysis.
3Intheexperimentby⁎Keller(2007),thefactordomainidentificationwasobtainedbyamediansplitbasedonthecontinuousvariabledomainidentificationthatwe
were unable to duplicate. Therefore, we chose to calculate the effect size over the entire sample pooled together, ignoring the variable domain identification.
8
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 9
(Hausmann et al., 2012). The exploratory variable type of manipulation was coded by either explicit or implicit as indicated in Table 1.
Age wascoded by usingthe meanage in theentiresample; howeverfor papers that only reported anage range we took the midpoint
of this range. Test difficulty, age, and cross-cultural gender equality were included as continuous moderators in the analysis, whereas
presence of boys, type of control group, and type of manipulation were included as categorical moderators.
Whenever thepapers provided insufficient information,we requested additional information from theauthors via email. Wesent
theauthorsonereminderwhentheyfailedtorespond.Whenwefailedtoobtainallinformationneededtocalculatetheeffectsize,we
excludedthepaperfromthatparticularanalysis.Missingpiecesofinformationonmoderatorvariablesweretreatedasmissingvalues,
which were excluded pairwise from the analysis.
Toensure that the codingprocedure would beasobjective aspossible,we developed a codingsheet.4The codingprocesswasfirst
carried out by the first author. To assess inter-rater agreement, five variables (type of control condition, presence of boys, cross-
cultural gender equality, age, and type of manipulation) were rescored by two independent raters for all studies except for unpub-
lished studies that were not reported in paper form (k = 43). The inter-rater agreement was assessed by calculating Fleiss' exact
kappa (Conger, 1980; Fleiss, 1971) for categorical variables and the two-way, agreement, unit-measures intraclass correlation
(Hallgren, 2012; Shrout & Fleiss, 1979) for continuous variables using the R-package irr (Gamer, Lemon, Fellows, & Singh, 2012).
Those measures reached satisfactory levels of agreement for the nominal variables type of control condition (Fleiss' exact κ = .76)
and presence of boys (Fleiss' exact κ = .68) as well as for continuous variables cross-cultural gender equality (ICC = 1.00) and age
(ICC = .96). Only the agreement for the variable type of manipulation was lower (Fleiss' exact κ = .10), indicating only slight agree-
mentamongthethree coders.However, asthe typeof manipulation wasusedasanexploratory variable in this studyandwas,there-
fore, not ourmain focus; lowagreementon this variable is not overly problematic.Disagreements in scoringwere solved by selecting
the modal response. The dependent variable “performance on a MSSS test” and the moderator variable “test difficulty” were not re-
trievedbymultiplecodersbecauseforthesevariablestoomuchinformationwasnotreportedintheoriginalarticlesandneededtobe
retrieved by e-mailing the authors.
2.4. Statistical methods
We used Hedges's g (Hedges, 1981) as effect size estimator, which was calculated by means of the following formula:
Hedges0s g ¼Yexperimental
••
−Ycontrol
••
Spooled
?
1−
3
4 n1þ n2
ðÞ−9
??
:
Thus,studysampleswithnegativeeffectsizesdenotetheexpectedperformancedecrementduetostereotypethreat,whereaspos-
itive effect sizes contradict our expectations. The model fitted to the data was the random effects model (for the analyses without
moderators)and themixedeffects model (for theanalyses with moderators) because we wanted both to explain systematic variance
by adding multiple moderators as well as to generalize to the entire population of studies (Viechtbauer, 2010). A characteristic of
these two methods is that effect sizes are automatically weighted by the inverse of the study's sampling variance. We have not
weighted the effect sizes with regards to other quality indicators. We estimated these models with the R-package metafor
(Viechtbauer, 2010) in R version 3.0.2.
Whenfittingtherandomeffectsmodel,weautomaticallyassumethatthepopulationleveleffectsizesvaluesvaryandarenormal-
ly distributed. In this case, it is considered good practice (Hunter & Schmidt, 2004; Whitener, 1990) to calculate a credibility interval
aroundtheaverageeffectsize(g)inadditiontothemorefamiliarconfidenceinterval.Wecalculatedthe95%credibilityinterval,which
is an estimation of the boundaries in which 95% of values in the effect size distribution are expected to fall (Hunter & Schmidt, 2004).
The boundaries of this interval are obtained using the standard deviation of the distribution of effect sizes (SDES), or more specifically
adding and subtracting 1.96 times the SDESof g. In contrast, for the 95% confidence interval the standard error is used to obtain the
boundaries around a single value of g. The confidence interval gives an indication of how the results can fluctuate due to sampling
error, whereas the credibility interval gives an indication of the amount of heterogeneity in the distribution of effect sizes.
We estimated the amount of heterogeneity τ2with the restricted maximum likelihood estimator, which is the default in metafor
(Viechtbauer, 2010) and an approximately unbiased estimator for the standardized mean difference (Viechtbauer, 2005). To address
theissueofpublicationbias,weusedseveralmethods.First,weusedthreemethodsbasedonfunnelplotasymmetry:thetrimandfill
method (Duval & Tweedie, 2000; Rothstein, 2007), the rank correlation test (Begg & Mazumdar, 1994), and Egger's test (Sterne &
Egger, 2005). A combination of the three methods is desirable to obtain robust results because both the rank correlation test and
Egger's testhave lowpower when theamountof studies in theanalysisis small (Kepes,Banks, & Oh, 2012).To taketests into account
that are not based on the funnel plot, we conducted Ioannidis and Trikalinos's exploratory test (2007), which compares the observed
amount of significant studies and the expected amount of significant studies based on power calculations (see also Francis, 2013,
2014). Finally, we created a p-curve to have an indication of the practice of p-hacking within the field (Simonsohn et al., 2013). A
p-curve consists of only statistically significant p-values within a set of studies. So the p-curve analysis includes only the 15 studies
for which the mean scores of the experimental group and the control group significantly differed from each other (based on a
t-test and α = .05). If the p-curve resembles a right skewed curve, this finding suggests that our set of findings has evidential
value, whereas a left skewed curve suggests that some researchers have invoked p-hacking (Simonsohn et al., 2013).
4A list of excluded studies and the coding sheet are available upon request.
9
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 10
We pre-registered the hypotheses and inclusion criteria of our meta-analysis via the Open Science Framework (https://osf.io/
bwupt/).
3. Results
Ourliterature searchand thecall for data yielded 972 papers that werefurther screened.Based onthe inclusion criteria, 26 papers
(i.e., studies) or unpublished reports were actually included in the meta-analysis, which resulted in 47 independent effect sizes
(i.e., study samples). Additional information concerning the screening process is listed in Fig. 1. These 26 papers provided us with a
wealth of new information because only 3 of these papers (12%) were also included in the most recent meta-analysis on this topic
(Pichoetal.,2013).Theoverlapwiththefouroldermeta-analysesisequaltoorsmallerthan12%.Thetotalsample,obtainedbysimply
adding all participants of the included studies, consisted of N = 3760 girls, of which nST= 1926 girls were assigned to the
Fig. 1. Flow-chart of the literature search. n = number of papers.
10
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 11
Table 3
Characteristics and statistics of studies included in the meta-analysis.
StudyAgeCountryStatus
Nga
CCBoys DifficultyGGIManipulation
AuthorsYearNo.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Agnoli, Altoè & Muzzatti
Agnoli, Altoè & Muzzatti
*Agnoli, Altoè & Pastro
*Agnoli, Altoè & Pastro
Bagès & Martinot
Bagès & Martinot
Cherney & Campbell
Cherney & Campbell
Cimpian, Mu, & Erickson
Delgado & Prieto
Galdi, Cadinu, & Tomasetto
⁎Galdi et al.
⁎Galdi et al.
⁎Galdi et al.
⁎Galdi et al.
⁎Galdi et al.
⁎Galdi et al.
Good et al.
Huguet & Régner
Huguet & Régner
Huguet & Régner
Huguet & Régner
Keller & Dauenheimer
Keller
Marchand &
Taasoobshirazi
*Moè
Moè
Moè
Moè & Pazzaglia
Muzzatti & Agnoli
Muzzatti & Agnoli
Muzzatti & Agnoli
Muzzatti & Agnoli
Muzzatti & Agnoli
Muzzatti & Agnoli
Muzzatti & Agnoli
Neuburger et al.
Neuville & Croizet
Picho & Stephens
Picho & Stephens
Stricker & Ward
Titze et al.
Tomasetto et al.
Tomasetto et al.
Tomasetto et al.
Tomasetto et al.
*Twamley
–
–
–
–
2011
2011
2011
2011
2012
2008
2013
2014
2014
2014
2014
2014
2014
2010
2009
2007
2007
2007
2003
2007
2012
1A of 1
1B of 1
1A of 1
1B of 1
1A of 1
1B of 1
1A of 1
1B of 1
2 of 2
1 of 1
1
1 of 3
2A of 3
2B of 3
3A of 3
3B of 3
3C of 3
1 of 1
1
1 of 2
2A of 2
2B of 2
1 of 1
1 of 1
1 of 1
10.92
12.92
14.01
16.03
10.58
10.58
16.02
16.02
5.98
15.5
6.47
13.5
12.5
13.5
9.5
13.5
17.5
14.81
12
12
12
12
15.7
15.9
16
Italy
Italy
Italy
Italy
France
France
USA
USA
USA
Spain
Italy
USA
USA
USA
USA
USA
USA
USA
France
France
France
France
Germany
Germany
USA
Unpub.
Unpub.
Unpub.
Unpub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
38
59
41
49
63
59
0.199
0.028
−0.891
0.557
−0.705
−0.864
0.293
0.507
−0.656
−0.270 (−0.277)
−0.620
0.137
0.276
−0.158
0.165
0.141
−0.268
−0.693
−0.867
−0.742
0.010 (0.010)
−0.808 (−0.815)
−0.457
0.040
−0.576 (−0.581)
No information
No information
No information
No information
Nullified
Nullified
Nullified
Nullified
No information
No information
Nullified
Nullified
No information
No information
No information
No information
No information
No information
No information
No information
No information
No information
Nullified
Nullified
Nullified
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
.636
.668
.594
.500
.508
.552
.500
.370
.458
.365
NA
.620
.230
.360
.560
.550
.480
.782
.589
.538
.598
.578
.531
.705
.310
.673
.673
.673
.673
.698
.698
.737
.737
.737
.727
.673
.737
.737
.737
.737
.737
.737
.737
.698
.698
.698
.698
.763
.763
.737
Implicit
Implicit
Implicit
Implicit
Implicit
Implicit
Explicit
Explicit
Explicit
Explicit
Implicit
Explicit
Explicit
Explicit
Explicit
Explicit
Explicit
Implicit
Implicit
Implicit
Implicit
Implicit
Explicit
Explicit
Explicit
124
135
48
168
80
110
115
99
29
65
76
34
92
20
136
87
35
55
90
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
2012
2009
2009
2006
2007
2007
2007
2007
2007
2007
2007
2012
2007
2012
2012
2004
2010
2010
2011
2011
2011
2009
1 of 1
1A of 1
1B of 1
1 of 2
1A of 2
1B of 2
1C of 2
1D of 2
2A of 2
2B of 2
2C of 2
1 of 1
1 of 1
1A of 1
1B of 1
1 of 2
1 of 1
1 of 1
1A of 1
1B of 1
1C of 1
1 of 1
15.5
17.97
17.97
17
7.2
8.4
9.4
10.4
8.2
10.2
13
10.18
7.3
15.5
15.5
17.5
10.47
15.59
5.43
6.05
7.47
11
Italy
Italy
Italy
Italy
Italy
Italy
Italy
Italy
Italy
Italy
Italy
Germany
France
Uganda
Uganda
USA
Germany
Italy
Italy
Italy
Italy
USA
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Pub.
Unpub.
49
24
23
71
35
68
64
42
42
48
30
72
45
38
51
−0.541
−0.497
−0.620
−0.266
0.047
0.230
0.132
−0.424
0.028
0.148
−1.197
−0.143
−0.639
−0.744
−0.135
−0.160 (−0.160)
0.273
−0.125
−0.652
−0.339
−0.322
−0.252
Nullified
Nullified
Nullified
Nullified
No information
No information
No information
No information
No information
No information
No information
Nullified
No information
No information
No information
No information
Nullified
Nullified
No information
No information
No information
No information
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
No
No
.572
.643
.554
.582
.509
.663
.610
.663
.364
.305
.325
.741
.200
.330
.390
.522
.272
.338
NA
NA
NA
.730
.673
.673
.673
.673
.673
.673
.673
.673
.673
.673
.673
.763
.698
.723
.723
.737
.763
.673
.673
.673
.673
.737
Explicit
Explicit
Explicit
Explicit
Implicit
Implicit
Implicit
Implicit
Implicit
Implicit
Implicit
Explicit
Implicit
Explicit
Explicit
Implicit
Explicit
Implicit
Implicit
Implicit
Implicit
Implicit
730
84
118
33
64
27
74
Note. Status = published versus unpublished papers. N = Nthreat condition+ Ncontrol condition. CC = control condition. Boys = presence of boys (yes) or not (no). GGI = Gender Gap Index. NA indicates a cell with missing data.
aThe primary number is the corrected effect size; the number in parentheses is the uncorrected effect size.
11
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 12
experimentalconditionandnC=1834girlswereassignedtothecontrolcondition.Themostimportantcharacteristicsoftheincluded
study samples are summarized in Table 3.
3.1. Overall effect
To estimate the overall effect size, we used a random effects model. In accordance with our hypothesis as well as the former liter-
ature, we founda smallaveragestandardized meandifference,g= −0.22,z = −3.63, p b .001, CI95= −0.34;−0.10,indicating that
girls who have been exposed to a stereotype threat on average score lower on the MSSS tests compared to girls who have not been
exposed to such a threat. Furthermore, we found a significant amount of heterogeneity using the restricted maximum likelihood es-
timator,^ τ2= 0.10, Q(46) = 117.19, p b .001, CI95= 0.04; 0.19, which indicates there is variability amongtheunderlyingpopulation
effectsizes.This estimatedheterogeneityaccountsforalargeshareofthetotalvariability,I2= 61.75%.The95%credibilityinterval,an
estimation of the boundaries in which 95% of the true effect sizes are expected to fall, lies between −0.85 and 0.41 (Viechtbauer,
2010). This range constitutes a wide interval. The forest plot (Fig. 2) depicts the effect sizes against the precision with which each ef-
fect was estimated.
Fig. 2. The forest plot of included effect sizes. NA = missing value. RE model = Random Effects model. The observed outcome is the standardized mean difference
Hedges's g.
12
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 13
3.2. Moderator analyses
Wesubmittedthedatatoseparatemixedeffectsmeta-regressionsforeachofthefourmoderatorsandusedtheREMLestimatorto
obtaintheresidual^ τ2(i.e.,unexplainedvarianceinunderlyingeffectsizes).Theresultsofthesimplemeta-regressionanalysesforeach
moderator variable separately are presented in Table 4, where the variables presence of boys and control condition were treated as
categorical variables,and the remainingvariables were treated ascontinuous variables.Noneof themoderators werestatistically sig-
nificant. Additionally, the results for the multiple meta-regression as given in Table 5, showed no statistically significant moderation,
QM(4) = 2.68, p = .61,^ τ2= .11, QE(38) = 95.59, p b .001. Additional exploratory analyses did not yield any statistically significant
explanation for differences between the effect sizes. Themoderation of the exploratory variable age, QM(1) = 0.65, p = .42,^ τ2= .10,
QE(45) = 112.80, p b .001, did not turn out be statistically significant, indicating that we found no evidence for systematic variety in
the magnitude of the effect sizes due to differences in age. Additionally the exploratory variable type of manipulation, QM(1) = 3.16,
p = .08,^ τ2= .09, QE(45) = 103.87, p b .001, did not result in a statistically significant moderation either.
3.3. Sensitivity analyses
To verify the robustness of our results (notably the estimated effect size), we ran several sensitivity analyses, as is recommended
formeta-analyses(Greenhouse&Iyengar,2009).Specifically,weverifiedtherobustnessofourresults withrespecttotheuseof adif-
ferentstatisticalmeta-analytic model, an alternativeheterogeneity estimator, re-analyses of the random effects modelusingdifferent
estimates of τ2, diagnostic tests, and different subsets of effect sizes. First, in a fixed effects model, we also found a statistically signif-
icantmeaneffectsizeofg=−0.16,z=−4.35,pb .001.5UsingtheDerSimonian–Lairdestimatoryieldedasimilareffectsizeestimate
as the restricted maximum likelihood estimator, g = −0.22, z = −3.66, p b .001, CI95= −0.34; −0.10, with roughly the same
amount of estimated heterogeneity, ^ τ2= 0.10, Q(46) = 117.19, p b .001, CI95= 0.04; 0.19. We also reran the original analysis
with three different amounts for^ τ2: the originally estimated^ τ2, the upper bound around^ τ2, and the lower bound of the confidence
intervalaround^ τ2.Theresultsof theseanalysesaresummarized inTable6. Althoughtheestimatedeffectsizesvariedslightly,theyall
were negative and differed significant from zero.
We also considered potential outliers, by inspecting the studentized residuals, and found that the second study of ⁎Cherney and
Campbell (2011) displayed a studentized residual larger than 2. Running the analysis without this study gave an estimated effect
size of g = −0.24, z = −4.05, p b .001, which indicates that the estimated mean effect size is only slightly influenced by this
5Although we report this analysis for the sake of robustness of the estimated effect size, we would not advocate interpreting this resultdue to the heterogeneity we
found among effect sizes.
Table 4
Results of the univariate mixed effects meta-regression per moderator.
Variable
kN
InterceptSlope coefficient
SEzp
95% CI
QE
τ2
QM
I2
R2
GGI
Boys (factor)
Difficulty
Control (factor)
47
47
43
47
3760
3760
3556
3760
−2.23
−0.28
−0.43
−0.23
2.83
0.08
0.45
0.03
1.85
0.15
0.42
0.13
1.53
0.54
1.09
0.25
.13
.59
.28
.80
−0.80
−0.21
−0.37
−0.22
6.46
0.36
1.28
0.29
107.33⁎
117.08⁎
105.28⁎
115.17⁎
0.09
0.10
0.10
0.10
2.34
0.29
1.18
0.06
60%
62%
63%
62%
.07
0
.02
0
⁎ p b .001.
Table 5
Results of the multivariate mixed effects meta-regression with four moderators included.
Variable Slope coefficient
SEzp
95% CI
Intercept
GGI
Boys (factor)
Difficulty
Control (factor)
−2.07
2.30
−0.05
0.52
−0.03
1.52
2.10
0.18
0.43
0.14
−1.36
1.09
−0.27
1.20
0.22
.17
.27
.79
.23
.83
−5.06
−1.82
−0.39
−0.33
−0.24
0.91
6.41
0.30
1.37
0.31
Table 6
Sensitivity analysis: estimating the effect using different amounts of heterogeneity.
^ τ2
g SEzp
0.0447
0.1001
0.1940
−0.20
−0.22
−0.24
0.05
0.06
0.08
−4.06
−3.63
−3.10
b.001
b.001
.002
13
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 14
study. Finally, we created different subsets to see whether the effect is stable over different categories. We found a few differences
between some subsets: the estimated effect size was larger for samples with an implicit stereotype threat manipulation g = −0.32,
z = −3.76, p b .001, k = 26, compared to samples with an explicit stereotype threat manipulation, g = −0.10, z = −1.20, p = .23,
k = 21, and samples gathered outside of the United States of America showed a stronger stereotype threat effect, g = −0.30, z =
−4.15, p b .001, k = 34, than samples gathered in the United States of America, g = −0.05, z = −0.48, p = .63, k = 13. Additionally
we created subsets of young (younger than 13 years) and older (13 years or older) participants; the estimated effect size was larger in
samples with younger students,g = −0.25, z = −2.92, p= .004, k = 25, thanin samples witholderstudents,g = −0.20,z =−2.19,
p = .03, k = 22. Using an alternative cut-off at the age of 10 yielded similar results (for younger students,g = −0.24, z = −2.06, p =
.04,k=11,andforolderstudents,g=−0.22,z=−3.07,p= .002,k=36).Thesesubsetanalysesareexploratoryanalysesandshould
be interpreted as such; however, they might be an inspiration for future research.
3.4. Excess of significance results
We used several methods to test for the presence of publication bias. First, we ran several tests on the funnel plot (see Fig. 3) to
assess funnel plot asymmetry. According to the estimations of the trim and fill method (Duval & Tweedie, 2000), the funnel plot
wouldbesymmetricif11effectsizeswouldhavebeenimputedontherightsideofthefunnelplot.Actualimputationofthosemissing
effectsizes(Duval& Tweedie,2000)reducedtheestimated effectsizetog=−0.07,z = −1.10,p =.27,CI95= −0.21;0.06. Because
this alteredeffectsizedidnotdiffersignificantlyfrom zerowhereasouroriginaleffectsizeestimationofg=−0.22did,thispatternis
a firstindicationthatourresults mightbedistorted bypublicationbias. BothEgger'stest (Sterne&Egger, 2005;z = −3.25,p = .001)
and Begg and Mazumdar's (1994) rank correlation test, Kendall's τ = −.27, p = .01, indicated funnel plot asymmetry. This finding
indicates that imprecise study samples (i.e., study samples with a larger standard error) on average contribute to a more negative ef-
fect than precise study samples. The relation between imprecise samples and the effect sizes is illustrated in Fig. 4 using a cumulative
meta-analysis sorted by the sampling variance of the samples (Borenstein, Hedges, Higgins, & Rothstein, 2009). This cumulative pro-
cessfirstcarriesouta“meta-analysis”onthesamplewiththesmallestsamplingvarianceandproceedsaddingthestudywithsmallest
remaining sampling variance and re-analyzinguntil all samples are included in the meta-analysis. The drifting trend of the estimated
effect sizes visualizes the effect that small imprecise study samples have on the estimations of the mean effect. We created subsets to
estimate the effects of large study samples (N ≥ 60) and small study samples (N b 60). We found a stronger effect in the subset of
smaller study samples,g = −0.34, z = −3.76, p b .001, CI95= −0.52; −0.16, CrI95= −0.96; 0.27, k = 24, and a small and nonsig-
nificanteffectforthesubsetoflargerstudysamples,g=−0.13,z =−1.63,p=.10,CI95=−0.29;0.03,CrI95=−0.75;0.49,k=23.
Finally,IoannidisandTrikalinos'sexploratorytest(Ioannidis&Trikalinos,2007)showedthatthismeta-analysiscontainsmoresta-
tisticallysignificanteffects than would be expected based on thecumulative power of all study samples,χ2(1) = 8.50, p = .004.6The
excess of statistically significant findings is another indicator of publication bias (Bakker et al., 2012; Francis, 2012). To check the al-
ternativeexplanationthattheexcessofstatisticallysignificantfindingsisduetothepracticeofp-hackingwecreatedap-curve(Fig.5)
usingtheonlineappfrom Simonsohnet al.(2013). Thep-curvedepicts thetheoreticaldistribution of p-values when there is noeffect
present (solid line), the theoretical distribution of p-values when an effect is present and the tests have 33% power (dotted line), and
the observed distribution of the significant p-values in our meta-analysis (dashed line). The observed distribution was right-skewed,
Observed Outcome
Standard Error
0.462
0.347
0.231
0.116
0.000
−1.50−1.00−0.500.000.50 1.001.50
p > .10
.10 > p > .05
.05 > p > .01
p < .01
Fig. 3. The contour-enhanced funnel plot of included effect sizes. The observed outcome is the standardized mean difference Hedges's g.
6Tocalculatethecumulativepowerweusedtheestimatedeffectsizeobtainedbytherandomeffectsmodel,|g|= 0.2226.Althoughwedetectasignificantdifference
between the observed and expected significant study samples based on this effect size, the test is rather sensitive. For an effect size of 0.27, the test is no longer statis-
tically significant.
14
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Page 15
χ2(30)= 62.87,pb .001,whichindicatedthatthereisaneffectpresentthatisnotsimplytheresultofpracticeslikep-hacking.7Over-
all, most publication bias tests indicate that the estimated effect size is likely to be inflated.
4. Discussion
Analyzing 15 years of stereotype threat literature with children or adolescents as test-takers, we found indications that girls
underperform on MSSS tests due to stereotype threat. Consistent with findings by Nguyen and Ryan (2008), Picho et al. (2013),
Walton and Cohen (2003), and Walton and Spencer (2009), we estimated a small effect of −0.22. The estimations of heterogeneity
indicated that there was a large share of heterogeneity among population effect sizes. We ran multiple sensitivity analyses, and most
of these tests indicated that the mean effect size is rather robust against fluctuations due to alternative decisions regarding the anal-
yses or the removal of influential studies. Yet our results failed to corroborate predictions drawn from stereotype threat theory with
regards to the moderating variables. None of the four variables (difficulty, presence of boys, type of control group, and cross-cultural
gender equality) significantly moderated the effect of stereotype threat. Exploratory analyses with moderators as age or type of ma-
nipulationdidnotyieldsignificantmoderationeither.However,wedidfindsomestrongindicationsthatpublicationbiasispresentin
the field of stereotype threat.
In future research, the exploratory variables age and type of manipulation deserve more attention. With regards to the variable
age, the effect of stereotype threat overall appears to be rather stable over different ages. However, surprisingly, the subset analyses
indicated that the estimated effect size for samples with children younger than 13 was slightly larger than the effect size for samples
with older children. An additional subset analysis on our data using only samples with early grade school children (i.e., younger than
8 years old) shows a relatively large estimated mean effect size,g = −0.48, z = −4.30, p b .001, k = 7. This outcome is rather coun-
terintuitive, because three theories on stereotype threat predict that very young children would not yet be sensitive to detrimental
effects of stereotypes: preadolescent children have not obtained a coherent sense of the self yet (Aronson & Good, 2003), young chil-
dren fail to understand that effort will not necessarily compensate for a lack of mathematical abilities (e.g., Droege & Stipek, 1993;
Stipek & Daniels, 1990), and older children endorse gender stereotypes more strongly than younger children (Steffens & Jelenec,
2011). The variable type of manipulation also deserves extra attention. Although type of manipulation did not have a statistically sig-
nificanteffectonstereotypethreat(p= .08),theintercoderagreementforthisvariablewassuboptimal,andmostlikelythepowerfor
the test of this variable is low. In other words, the circumstances under which we measured this variable were not ideal, and future
inspection of it might be valuable. Due to these issues, we conclude that the type of manipulation and age are variables that require
more attention in the stereotype threat literature.
Unfortunately the robustness of the stereotype threat effect can be questioned by the presence of publication bias. All three tests
based on funnel plot asymmetry—trim and fill (Duval & Tweedie, 2000), Egger's test (Sterne & Egger, 2005), and Begg and
Mazumdar's rank correlation test (Begg & Mazumdar, 1994)—indicated that publication bias was present. Additionally Ioannidis
and Trikalinos's (2007) exploratory test highlighted an excess of significant findings, which can be due to publication bias. These
7The test for the left-skewed distribution is not statistically significant, χ2(30) = 18.24, p = .95.
Fig. 4. Cumulative meta-analysis sorted by the sampling variance of the studies. The overall estimate is the estimated average effect size.
15
P.C. Flore, J.M. Wicherts / Journal of School Psychology xxx (2014) xxx–xxx
Please cite this article as: Flore, P.C., & Wicherts, J.M., Does stereotype threat influence performance of girls in stereotyped
domains? A meta-analysis, Journal of School Psychology (2014), http://dx.doi.org/10.1016/j.jsp.2014.10.002
Download full-text