Content uploaded by Konrad Bocian
Author content
All content in this area was uploaded by Konrad Bocian on Jan 10, 2019
Content may be subject to copyright.
https://doi.org/10.1177/2515245918810225
Advances in Methods and
Practices in Psychological Science
2018, Vol. 1(4) 443 –490
© The Author(s) 2018
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/2515245918810225
www.psychologicalscience.org/AMPPS
ASSOCIATION FOR
PSYCHOLOGICAL SCIENCE
Registered Replication Report
810225AMPXXX10.1177/2515245918810225Klein et al.Many Labs 2
research-article2018
Corresponding Author:
Richard A. Klein, LIP/PC2S, Université Grenoble Alpes, CS 40700, 38 058 Grenoble Cedex 9, France
E-mail: raklein22@gmail.com
Many Labs 2: Investigating Variation in
Replicability Across Samples and Settings
Richard A. Klein1, Michelangelo Vianello2, Fred Hasselman3,4,
Byron G. Adams5,6, Reginald B. Adams, Jr.7, Sinan Alper8,
Mark Aveyard9, Jordan R. Axt10, Mayowa T. Babalola11,
Šteˇpán Bahník12, Rishtee Batra13, Mihály Berkics14,
Michael J. Bernstein15, Daniel R. Berry16, Olga Bialobrzeska17,
Evans Dami Binan18, Konrad Bocian19, Mark J. Brandt5, Robert Busching20,
Anna Cabak Rédei21, Huajian Cai22, Fanny Cambier23,24,
Katarzyna Cantarero25, Cheryl L. Carmichael26, Francisco Ceric27,28,
Jesse Chandler29,30, Jen-Ho Chang31,32, Armand Chatard33,34,
Eva E. Chen35, Winnee Cheong36, David C. Cicero37, Sharon Coen38,
Jennifer A. Coleman39, Brian Collisson40, Morgan A. Conway41,
Katherine S. Corker42, Paul G. Curran42, Fiery Cushman43,
Zubairu K. Dagona18, Ilker Dalgar44, Anna Dalla Rosa2,
William E. Davis45, Maaike de Bruijn5, Leander De Schutter46,
Thierry Devos47, Marieke de Vries3,48,49, Canay Dog˘ulu50,
Nerisa Dozo51, Kristin Nicole Dukes52, Yarrow Dunham53,
Kevin Durrheim54, Charles R. Ebersole55, John E. Edlund56,
Anja Eller57, Alexander Scott English58, Carolyn Finck59,
Natalia Frankowska17, Miguel-Ángel Freyre57, Mike Friedman23,24,
Elisa Maria Galliani60, Joshua C. Gandi18, Tanuka Ghoshal61,
Steffen R. Giessner62, Tripat Gill63, Timo Gnambs64,65, Ángel Gómez66,
Roberto González67, Jesse Graham68, Jon E. Grahe69, Ivan Grahek70,
Eva G. T. Green71, Kakul Hai72, Matthew Haigh73, Elizabeth L. Haines74,
Michael P. Hall75, Marie E. Heffernan76, Joshua A. Hicks77, Petr Houdek78,
Jeffrey R. Huntsinger79, Ho Phi Huynh80, Hans IJzerman1, Yoel Inbar81,
Åse H. Innes-Ker82, William Jiménez-Leal59, Melissa-Sue John83,
Jennifer A. Joy-Gaba39, Roza G. Kamilog˘lu84, Heather Barry Kappes85,
Serdar Karabati86, Haruna Karick17,18, Victor N. Keller87, Anna Kende88,
Nicolas Kervyn23,24, Goran Kneževic´89, Carrie Kovacs90, Lacy E. Krueger91,
German Kurapov92, Jamie Kurtz93, Daniël Lakens94, Ljiljana B. Lazarevic´95,
Carmel A. Levitan96, Neil A. Lewis, Jr.97, Samuel Lins98,
Nikolette P. Lipsey41, Joy E. Losee41, Esther Maassen99,
Angela T. Maitner9, Winfrida Malingumu100, Robyn K. Mallett79,
Satia A. Marotta101, Janko Med
–edovic´102,103, Fernando Mena-Pacheco104,
Taciano L. Milfont105, Wendy L. Morris106, Sean C. Murphy107,
Andriy Myachykov73, Nick Neave73, Koen Neijenhuijs108,109,
444 Klein et al.
Anthony J. Nelson7, Félix Neto98, Austin Lee Nichols110, Aaron Ocampo104,
Susan L. O’Donnell111, Haruka Oikawa112, Masanori Oikawa112,
Elsie Ong113, Gábor Orosz114, Malgorzata Osowiecka17, Grant Packard63,
Rolando Pérez-Sánchez115, Boban Petrovic´103, Ronaldo Pilati87,
Brad Pinter7, Lysandra Podesta3,4, Gabrielle Pogge41,
Monique M. H. Pollmann116, Abraham M. Rutchick117, Patricio Saavedra118,
Alexander K. Saeri119, Erika Salomon120, Kathleen Schmidt121,
Felix D. Schönbrodt122, Maciej B. Sekerdej123, David Sirlopú27,
Jeanine L. M. Skorinko83, Michael A. Smith73, Vanessa Smith-Castro115,
Karin C. H. J. Smolders94, Agata Sobkow124, Walter Sowden125,
Philipp Spachtholz122, Manini Srivastava126, Troy G. Steiner7,
Jeroen Stouten127, Chris N. H. Street128, Oskar K. Sundfelt82,
Stephanie Szeto38, Ewa Szumowska123, Andrew C. W. Tang113,
Norbert Tanzer129, Morgan J. Tear119, Jordan Theriault130,
Manuela Thomae131, David Torres132, Jakub Traczyk124,
Joshua M. Tybur133, Adrienn Ujhelyi88, Robbie C. M. van Aert99,
Marcel A. L. M. van Assen99, Marije van der Hulst134,
Paul A. M. van Lange133, Anna Elisabeth van ’t Veer135,
Alejandro Vásquez- Echeverría136, Leigh Ann Vaughn137,
Alexandra Vázquez66, Luis Diego Vega104, Catherine Verniers138,
Mark Verschoor139, Ingrid P. J. Voermans4, Marek A. Vranka140,
Cheryl Welch93, Aaron L. Wichman141, Lisa A. Williams142,
Michael Wood131, Julie A. Woodzicka143, Marta K. Wronska19,
Liane Young144, John M. Zelenski145, Zeng Zhijia146, and
Brian A. Nosek55,147
1Laboratoire Inter-universitaire de Psychologie, Personnalité, Cognition, Changement Social (LIP/PC2S),
Université Grenoble Alpes; 2Department of Philosophy, Sociology, Education and Applied Psychology, University
of Padua; 3Behavioural Science Institute, Radboud University Nijmegen; 4School of Pedagogical and Educational
Sciences, Radboud University Nijmegen; 5Department of Social Psychology, Tilburg University; 6Department of
Industrial Psychology and People Management, University of Johannesburg; 7Department of Psychology, The
Pennsylvania State University; 8Department of Psychology, Yasar University; 9Department of International
Studies, American University of Sharjah; 10Center for Advanced Hindsight, Duke University; 11College of Business
and Economics, United Arab Emirates University; 12Department of Management, Faculty of Business
Administration, University of Economics, Prague; 13Erivan K. Haub School of Business, Saint Joseph’s University;
14Institute of Psychology, ELTE Eötvös Loránd University; 15Psychological and Social Sciences Program,
Pennsylvania State University Abington; 16Department of Psychology, California State University San Marcos;
17Warsaw Faculty of Psychology, SWPS University of Social Sciences and Humanities; 18Department of General
and Applied Psychology, University of Jos; 19Sopot Faculty of Psychology, SWPS University of Social Sciences
and Humanities; 20Department of Psychology, University of Potsdam; 21Centre for Languages and Literature, Lund
University; 22Institute of Psychology, Chinese Academy of Sciences; 23Louvain Research Institute in Management
and Organizations (LouRIM), Université catholique de Louvain; 24Center on Consumers and Marketing Strategy
(CCMS), Université catholique de Louvain; 25Social Behavior Research Centre, Wroclaw Faculty of Psychology,
SWPS University of Social Sciences and Humanities; 26Department of Psychology, Brooklyn College & Graduate
Center, CUNY; 27Facultad de Psicologia, Universidad del Desarrollo; 28Centro de Apego y Regulacion Emocional,
Universidad del Desarrollo; 29Institute for Social Research, University of Michigan; 30Mathematica Policy Research,
Princeton, New Jersey; 31Institute of Ethnology, Academia Sinica; 32Department of Psychology, National Taiwan
University; 33Department of Psychology, Poitiers University; 34CNRS Unité Mixte de Recherche 7295, Poitiers,
France; 35Division of Social Science, The Hong Kong University of Science and Technology; 36Department of
Psychology, HELP University; 37Department of Psychology, University of Hawaii at Manoa; 38Directorate of
Psychology and Public Health, University of Salford; 39Department of Psychology, Virginia Commonwealth
University; 40Department of Psychology, Azusa Pacific University; 41Department of Psychology, University of
Florida; 42Department of Psychology, Grand Valley State University; 43Department of Psychology, Harvard
University; 44Department of Psychology, Middle East Technical University; 45Department of Psychology,
Wittenberg University; 46Leadership and Human Resource Management, WHU – Otto Beisheim School of
Management; 47Department of Psychology, San Diego State University; 48Institute for Computing and Information
Many Labs 2 445
Sciences, Radboud University Nijmegen; 49Tilburg Institute for Behavioral Economics Research, Tilburg
University; 50Department of Psychology, Bas¸kent University; 51School of Psychology, The University of
Queensland; 52Office of Institutional Diversity, Allegheny College; 53Department of Psychology, Yale University;
54School of Applied Human Sciences, University of KwaZulu-Natal; 55Department of Psychology, University of
Virginia; 56Department of Psychology, Rochester Institute of Technology; 57Facultad de Psicología, Universidad
Nacional Autónoma de México; 58Shanghai Intercultural Institute, Shanghai International Studies University;
59Departamento de Psicología, Universidad de los Andes, Colombia; 60Department of Political and Juridical
Sciences and International Studies, University of Padua; 61Department of Marketing and International Business,
Baruch College, CUNY; 62Department of Organisation and Personnel Management, Rotterdam School of
Management, Erasmus University; 63Lazaridis School of Business and Economics, Wilfrid Laurier University;
64Educational Measurement, Leibniz Institute for Educational Trajectories, Bamberg, Germany; 65Institute of
Education and Psychology, Johannes Kepler University Linz; 66Departamento de Psicología Social y de las
Organizaciones, Universidad Nacional de Educación a Distancia; 67Escuela de Psicología, Pontificia Universidad
Católica de Chile; 68Eccles School of Business, University of Utah; 69Psychology, Pacific Lutheran University;
70Department of Experimental Clinical and Health Psychology, Ghent University; 71Institute of Psychology,
Faculty of Social and Political Sciences, University of Lausanne; 72Amity Institute of Psychology and Allied
Sciences, Amity University; 73Department of Psychology, Northumbria University; 74Department of Psychology,
William Paterson University; 75Department of Psychology, University of Michigan; 76Smith Child Health Research,
Outreach, and Advocacy Center, Ann & Robert H. Lurie Children's Hospital of Chicago, Chicago, Illinois;
77Department of Psychological & Brain Sciences, Texas A&M University; 78Department of Economics and
Management, Faculty of Social and Economic Studies, Jan Evangelista Purkyne University; 79Department of
Psychology, Loyola University Chicago; 80Department of Science and Mathematics, Texas A&M University-San
Antonio; 81Department of Psychology, University of Toronto Scarborough; 82Department of Psychology, Lund
University; 83Department of Social Science and Policy Studies, Worcester Polytechnic Institute; 84Department of
Psychology, University of Amsterdam; 85Department of Management, London School of Economics and Political
Science; 86Department of Business Administration, Istanbul Bilgi University; 87Department of Social and Work
Psychology, Institute of Psychology, University of Brasilia; 88Department of Social Psychology, ELTE Eötvös
Loránd University; 89Department of Psychology, Faculty of Philosophy, University of Belgrade; 90Department of
Work, Organizational and Media Psychology, Johannes Kepler University Linz; 91Department of Psychology &
Special Education, Texas A&M University-Commerce; 92International Victimology Institute Tilburg, Tilburg
University; 93Department of Psychology, James Madison University; 94School of Innovation Science, Eindhoven
University of Technology; 95Institute of Psychology, Faculty of Philosophy, University of Belgrade; 96Department
of Cognitive Science, Occidental College; 97Department of Communication, Cornell University; 98Department of
Psychology, University of Porto; 99Department of Methodology and Statistics, Tilburg University; 100Department of
Education Policy Planning and Administration, Faculty of Education, Open University of Tanzania; 101Department
of Occupational Therapy, Tufts University; 102Faculty of Media and Communications, Singidunum University;
103Institute of Criminological and Sociological Research, Belgrade, Serbia; 104Department of Psychology,
Universidad Latina de Costa Rica; 105Centre for Applied Cross-Cultural Research, Victoria University of
Wellington; 106Department of Psychology, McDaniel College; 107Melbourne School of Psychological Sciences, The
University of Melbourne; 108Department of Clinical, Neuro- and Developmental Psychology, Vrije Universiteit
Amsterdam; 109Amsterdam Public Health Research Institute, Amsterdam, The Netherlands; 110Department of
Psychology, University of Central Florida; 111Department of Psychology, George Fox University; 112Department of
Psychology, Doshisha University; 113Li Ka Shing Institute of Professional and Continuing Education (LiPACE), The
Open University of Hong Kong; 114Department of Psychology, Stanford University; 115Institute for Psychological
Research, University of Costa Rica; 116Department of Communication and Cognition, Tilburg University;
117Department of Psychology, California State University, Northridge; 118School of Psychology, University of
Sussex; 119BehaviourWorks Australia, Monash Sustainable Development Institute, Monash University;
120Department of Computer Science, University of Chicago; 121Department of Psychology, Southern Illinois
University Carbondale; 122Department of Psychology, Ludwig-Maximilians-Universität München; 123Institute of
Psychology, Jagiellonian University in Kraków; 124Wroclaw Faculty of Psychology, SWPS University of Social
Sciences and Humanities; 125Center for Military Psychiatry and Neuroscience, Walter Reed Army Institute of
Research, Silver Spring, Maryland; 126Faculty of Psychology and Educational Science and Physical Education,
University of Regensburg; 127Occupational & Organisational Psychology and Professional Learning, KU Leuven;
128Department of Psychology, University of Huddersfield; 129Institute of Psychology, University of Graz;
130Department of Psychology, Northeastern University; 131Department of Psychology, University of Winchester;
132Department of Psychology, Universidad de Iberoamerica; 133Department of Experimental and Applied
Psychology, Vrije Universiteit Amsterdam; 134Department of Obstetrics and Gynaecology, Erasmus MC,
Rotterdam, The Netherlands; 135Methodology and Statistics Unit, Institute of Psychology, Leiden University;
136Centro de Investigación Básica en Psicología, Universidad de la República; 137Department of Psychology,
Ithaca College; 138Institute of Psychology, Paris Descartes University - Sorbonne Paris Cité; 139Department of
Social Psychology, University of Groningen; 140Department of Psychology, Faculty of Arts, Charles University;
141Department of Psychological Science, Western Kentucky University; 142School of Psychology, University of
New South Wales; 143Department of Psychology, Washington and Lee University; 144Department of Psychology,
Boston College; 145Department of Psychology, Carleton University; 146Zhejiang University of Finance and
Economics; and 147Center for Open Science, Charlottesville, Virginia
446 Klein et al.
Abstract
We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were
peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was
administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories.
Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided
evidence of a statistically significant effect in the same direction as the original finding. With a strict significance
criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-
powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded
effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings
and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%)
were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant
heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest
overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to
this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had
tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little
heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered
in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized,
rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness
scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being
studied than to the sample or setting in which it was studied.
Keywords
social psychology, cognitive psychology, replication, culture, individual differences, sampling effects, situational
effects, meta-analysis, Registered Report, open data, open materials, preregistered
Received 9/17/17; Revision accepted 10/10/18
Suppose a researcher, Josh, conducts an experiment
and finds that academic performance is reduced among
participants who experience threat compared with
those in a control condition. Another researcher, Nina,
conducts the same study at her institution and finds no
effect. Person- and situation-based explanations of the
discrepancy may come to mind immediately: Nina may
have used a sample that differed in important ways
from Josh’s sample, and the situational context in Nina’s
lab might have differed in theoretically important but
nonobvious ways from the context in Josh’s lab. Both
explanations could be true. A less interesting, but real,
possibility is that one of the researchers made an error
in design or procedure that the other did not. Finally,
it is possible that the different results are a function of
sampling error: Nina’s result could be a false negative,
or Josh’s result could be a false positive. The present
research provides evidence toward understanding the
contribution of variation in samples and settings to
observed variation in psychological effects.
Accounting for Variation in Effects:
Person and Situation Variation, or
Sampling Error?
There is a body of research providing evidence that
experimental effects are influenced by variation in person
characteristics and experimental context (Lewin, 1936;
Ross & Nisbett, 1991). For example, people tend to attri-
bute behavior to characteristics of the person rather than
characteristics of the situation (e.g., Gilbert & Malone,
1995; Jones & Harris, 1967), but some evidence suggests
that this effect is stronger in Western than in Eastern
cultures (Miyamoto & Kitayama, 2002). A common model
of investigating psychological processes is to identify an
effect and then investigate moderating influences that
make the effect stronger or weaker. Therefore, when
similar experiments yield different outcomes, the readily
available conclusion is that a moderating influence
accounts for the difference. However, if effects vary less
across samples and settings than is assumed in the psy-
chological literature, then the assumptions of moderation
may be overapplied and the role of sampling error may
be underestimated.
If effects are highly variable across samples and set-
tings, then variation in effect sizes will routinely exceed
what would be expected to result from sampling error. In
this circumstance, the lack of consistency between Josh’s
and Nina’s results is unlikely to influence beliefs about
the original effect. Moreover, if there are many influential
factors, then it is difficult to isolate moderators and iden-
tify the conditions necessary to obtain the effect. In this
case, the lack of consistency between Josh’s and Nina’s
results might produce collective indifference—there are
Many Labs 2 447
just too many variables to know why there was a differ-
ence, so the different results produce no change in under-
standing of the phenomenon.
Alternatively, variations in effect sizes may not
exceed what would be expected to result from sampling
error. In this case, observed differences in effects do
not indicate moderating influences of sample or setting.
Rather, imprecision in estimation is the sole source of
variation and requires no causal explanation.
In the case of Josh’s and Nina’s results, it is not nec-
essarily easy to assess whether the inconsistency is due
to sampling error or moderation, especially if their stud-
ies had small samples (Morey & Lakens, 2016). With
small samples, Josh’s positive result and Nina’s null
result will likely have confidence intervals that overlap
each other, so that one can conclude little other than
that “more data are needed.”
The difference between these interpretations regard-
ing the source of the inconsistency is substantial, but
there is little direct evidence regarding the extent to
which persons and situations—samples and settings—
influence the size of psychological effects in general
(but see Coppock, in press; Krupnikov & Levine, 2014;
Mullinix, Leeper, Druckman, & Freese, 2015). The
default assumption is that psychological effects are
awash in interactions among many variables. The pres-
ent report follows up on initial evidence from the Many
Labs projects (Ebersole etal., 2016; Klein etal., 2014a).
The first Many Labs project (Klein etal., 2014a) repli-
cated 13 classic and contemporary psychological effects
with 36 different samples and settings (N = 6,344). The
results showed that (a) variation in sample and setting
had little impact on observed effect magnitudes; (b)
when there was variation in effect magnitude across
samples, it occurred in studies with large effects, not
in studies with small effects; and (c) overall, effect-size
estimates were more related to the effect studied than
to the sample or setting in which it was studied, includ-
ing the nation in which the data were collected and
whether they were collected in the lab or over the Web.
A limitation of the first Many Labs project is that it
included a small number of effects and there was no
reason to presume that they varied substantially across
samples and settings. It is possible that the included
effects are more robust and homogeneous than typical
behavioral phenomena, or that the populations were
more homogeneous than initially expected. The present
research substantially expanded the first Many Labs
study design by including (a) more effects, (b) some
effects that are presumed to vary across samples or
settings, (c) more labs, and (d) diverse samples. The
effects were not randomly selected, nor are they rep-
resentative, but they do cover a wide range of topics.
This study provides preliminary evidence for the extent
to which variation in effect magnitude is attributable to
sample and setting, as opposed to sampling error.
Other Influences on Observed Effects
Across systematic replication efforts in the social-
behavioral sciences, there is accumulating evidence that
replication of published effects is less frequent than
might be expected, and that replication effect sizes are
typically smaller than original effect sizes (Camerer etal.,
2016; Camerer etal., 2018; Ebersole etal., 2016; Klein
etal., 2014a; Open Science Collaboration, 2015). For
example, Camerer etal. (2018) successfully replicated
13 of 21 social science studies published in Science and
Nature. Among the failures to replicate, the average
effect size was approximately 0, but even among the
successful replications, the average effect size was about
75% of what was observed in the original experiments.
Failures to replicate can be due to errors in the replica-
tion or to unanticipated moderation by changes in sam-
ple and setting, as we investigated in the project reported
here. They can also occur because of pervasive low-
powered research plus publication bias that favors posi-
tive over negative results (Button etal., 2013; Cohen,
1962; Greenwald, 1975; Rosenthal, 1979) and because of
questionable research practices, such as p-hacking, that
can inflate the likelihood of obtaining false positives
(John, Loewenstein, & Prelec, 2012; Simmons, Nelson,
& Simonsohn, 2011). These other reasons for failure to
replicate, which can also contribute to replication effect
sizes being weaker than those originally observed, were
not investigated directly in the present research.
Origins of the Study Design
To obtain a list of candidate effects for this project, we
held a round of open nominations, inviting submission
of any effect that fit the defined criteria (see the Coor-
dinating Proposal, available at https://osf.io/uazdm/).
Those nominations were supplemented by ideas from
the project team and by suggestions received in
response to direct queries sent to independent experts
in psychological science.
The nominated studies were evaluated individually
on the following criteria: (a) feasibility of implementa-
tion through a Web browser, (b) brevity of study pro-
cedures (shorter procedures were desired), (c) number
of citations (more citations desired), (d) identifiability
of a meaningful two-condition experimental design or
simple correlation as the target of replication (with
experiments favored), (e) general interest value of the
effect, and (f) applicability to samples of adults. The
nominated studies were also evaluated collectively to
ensure diversity on several criteria. Specifically, we
448 Klein et al.
wanted to include (a) both effects that had demon-
strated replicability across multiple samples and set-
tings and others that had not been examined across
multiple samples and settings,1 (b) both effects that
were known to be sensitive to sample or setting and
others for which variation was unknown or assumed
to be minimal, (c) both classic and contemporary
effects, (d) effects covering a broad range of topical
areas in social and cognitive psychology, (e) effects
observed in studies conducted by a variety of
research groups, and (f) effects that had been pub-
lished in diverse outlets.
More than 100 effects were nominated as potentially
fitting these criteria. A subset of the project team reviewed
these effects with the aim of maximizing the number of
included effects and the diversity of the total slate on
these criteria. No specific researcher’s work was selected
for replication because of beliefs or concerns about the
researcher or the effects he or she had reported, but
some topical areas and authors were included more than
once because they provided short, simple, interesting
effects that met the selection criteria.
Once an effect was selected for inclusion, a member
of the research team contacted the corresponding
author (if he or she was alive) to obtain original study
materials and get advice about adapting the procedure
for this use. In particular, original authors were asked
if there were moderators or other limitations to obtain-
ing the targeted result that would be useful for the team
to understand in advance and, perhaps, anticipate in
data collection.
In some cases, correspondence with the original
authors identified limitations of the selected effect that
reduced its applicability for the present design. In those
cases, we worked with the original authors to identify
alternative studies or decided to remove the effect
entirely from the selected set and replace it with one
of the available alternatives.
We split the studies into two slates that would require
about 30 min each for participants to complete. We
included 32 effects in total before peer review and pilot
testing. In only one instance did the original authors
express strong concerns about their effect being
included in this project. Because we make no claim
about the sample of studies being randomly selected
or representative, we removed that effect from the proj-
ect. With 31 effects remaining, we pilot-tested both
slates, with the authors and members of their labs as
participants, to ensure that each slate could be com-
pleted within 30 min. We observed that we underesti-
mated the time required for the tasks needed to test a
few effects. As a consequence, we had to remove three
effects (i.e., those originally reported by Ashton-James,
Maddux, Galinsky, & Chartrand, 2009; Srull & Wyer,
1979; and Todd, Hanko, Galinsky, & Mussweiler, 2011),
shorten or remove a few individual difference mea-
sures, and slightly reorganize the slates. The final set
comprised 28 effects, which were divided between the
slates to balance them on the criteria listed earlier and
to avoid substantial overlap in topics within a slate (for
a list of the effects in each slate, along with citation
counts for the original publications, see Table A1 in the
appendix).
Following the Registered Report model (Nosek &
Lakens, 2014), prior to data collection we submitted the
materials and protocols to formal peer review in a pro-
cess conducted by this journal’s Editor.
Disclosures
Preregistration
The accepted design was preregistered on the Open
Science Framework (OSF), at https://osf.io/ejcfw/.
Data, materials, and online resources
Comprehensive materials, data, and supplementary
information about the project are available at https://
osf.io/8cd4r/. Deviations from the preregistered descrip-
tion of the project and its implementation are recorded
in supplementary materials at https://osf.io/7mqba/.
Changes to analysis plans are noted with justification,
and results of the original and revised analytic
approaches are compared, in supplementary materials
at https://osf.io/4rbh9/. Table 1 provides a summary of
known differences from the original studies and changes
in the analysis plan. A guide to the data-analysis code
is available at https://manylabsopenscience.github.io/.
Measures
We report how we determined our sample size, all data
exclusions, all manipulations, and all measures in the
study.
Ethical approval
This research was conducted in accordance with the
Declaration of Helsinki and followed local requirements
for the institutional review board’s approval at each of
the data-collection sites.
Method
Participants
An open invitation to participate as a data-collection
site in Many Labs 2 was issued in early 2014. To be
Many Labs 2 449
Table 1. Summary of Differences From the Original Studies and Changes to the Preregistered Analysis Plan
Effect Known differences from the original study Change to analysis plan
1. Cardinal direction and socioeconomic
status (Huang, Tse, & Cho, 2014)
Study was administered online rather than with
paper and pencil, and the effect of the orientation
difference was tested by using tablets at some sites
None
2. Structure promotes goal pursuit (Kay,
Laurin, Fitzsimons, & Landau, 2014)
None known None
3. Disfluency engages analytic processing
(Alter, Oppenheimer, Epley, & Eyre,
2007)
Study was administered online rather than with
paper and pencil
None
4. Moral foundations of liberals versus
conservatives (Graham, Haidt, &
Nosek, 2009)
The political-ideology item was changed to use
regionally appropriate terms for the left and right
in place of the U.S.-centric terms “liberal” and
“conservative”; the analysis strategy was simplified
None
5. Affect and risk (Rottenstreich &
Hsee, 2001)
The study was administered online, but the
original study may have used paper and pencil
None
6. Consumerism undermines trust
(Bauer, Wilkie, Kim, &
Bodenhausen, 2012)
None known None
7. Correspondence bias (Miyamoto &
Kitayama, 2002)
The study was administered online rather than
with paper and pencil; the names and location
referred to in the materials were altered to be
familiar to each sample; the essay prompt was
changed to match the legal status of capital
punishment in the nation; a minimum 10-s delay
before advancing to the next task was added
to increase likelihood of reading the essay; the
low-diagnosticity condition was removed
None
8. Disgust sensitivity predicts
homophobia (Inbar, Pizarro, Knobe,
& Bloom, 2009)
The 5-item Contamination Disgust subscale of the
modern 25-item Disgust Scale–Revised (DS-R;
Olatunji etal. 2007) was used instead of the
original 8-item measure
None
9. Influence of incidental anchors
on judgment (Critcher & Gilovich,
2008)
The study was administered online rather than
with paper and pencil, and the effect of this
difference was tested by using paper and
pencil at 11 sites; markets were matched to the
location of data collection; the pictures of the
smartphones were updated
None
10. Social value orientation and family
size (Van Lange, Otten, De Bruin, &
Joireman, 1997)
The study was administered online rather than with
paper and pencil; social value orientation was
measured with a modern scale instead of the
original categorical measure
None
11. Trolley Dilemma 1: principle of
double effect (Hauser, Cushman,
Young, Jin, & Mikhail, 2007)
A subset of the scenarios was used Fisher’s exact test was used
instead of chi-square, to obtain
two-sided results in which
negative values indicated an
effect opposite the original
12. Sociometric status and well-being
(Anderson, Kraus, Galinsky, &
Keltner, 2012)
The high- and low-socioeconomic-status
conditions were removed
None
13. False consensus: supermarket scenario
(Ross, Greene, & House, 1977)
The study was administered online, but the
original study likely used paper and pencil
None
14. False consensus: traffic-ticket
scenario (Ross etal., 1977)
The study was administered online, but the
original study likely used paper and pencil
None
15. Vertical position and power
(Giessner & Schubert, 2007)
The salary of the hypothetical manager was
converted to local currency and adjusted to be
relevant for each sample
None
(continued)
450 Klein et al.
Effect Known differences from the original study Change to analysis plan
16. Effect of framing on decision
making (Tversky & Kahneman,
1981)
The study was administered online, but the
original study likely used paper and pencil;
dollar amounts were adjusted, and consumer
items were replaced to be appropriate for 2014;
currency was converted and adjusted to be
relevant for each sample
Fisher’s exact test was used
instead of chi-square, to obtain
two-sided results in which
negative values indicated an
effect opposite the original
17. Trolley Dilemma 2: principle of
double effect (Hauser etal., 2007)
A subset of the scenarios was used Fisher’s exact test was used
instead of chi-square, to obtain
two-sided results in which
negative values indicated an
effect opposite the original
18. Reluctance to tempt fate (Risen &
Gilovich, 2008)
The study was administered online, but the
original study likely used paper and pencil; the
condition in which the protagonist was not the
participant was removed
None
19. Construing actions as choices
(Savani, Markus, Naidu, Kumar, &
Berlia, 2010)
The study was administered online, but the
original study may have used paper and pencil;
a separate effect size was estimated for each
sample
Asymptotic rather than exact,
noncentral confidence
intervals were calculated
20. Preferences for formal versus
intuitive reasoning (Norenzayan,
Smith, Kim, & Nisbett, 2002)
Participants categorized objects by selecting from
a multiple-choice list; random assignment to
condition was balanced (assignment in the
original study was 2/3:1/3); the practice trial
was removed
None
21. Less-is-better effect (Hsee, 1998) The study was administered online, but the
original study may have used paper and pencil;
currency was converted and adjusted to be
relevant for each sample
None
22. Moral typecasting (Gray & Wegner,
2009)
The study was administered online, but the
original study may have used paper and pencil
None
23. Moral violations and desire for
cleansing (Zhong & Liljenquist,
2006)
The study was administered online rather than
with paper and pencil; participants typed rather
than hand-copied an adapted version of the
story; the study was purported to be measuring
both personality and typing speed
None
24. Assimilation and contrast effects
in question sequences (Schwarz,
Strack, & Mai, 1991)
The study was administered online rather than
with paper and pencil
None
25. Effect of choosing versus rejecting
on relative desirability (Shafir, 1993)
The study was administered online rather than
with paper and pencil; the order in which
the two parents were presented was not
counterbalanced
Effect size was estimated directly
from the key z test rather
than with a logistic regression
model
26. Priming “heat” increases belief in
global warming (Zaval, Keenan,
Johnson, & Weber, 2014)
The original study began with a question about
the current temperature followed by a 10-
min delay; this question and the delay were
dropped from the replication
Participants who made errors in
sentence unscrambling were
excluded on the recommendation
of the original authors
27. Perceived intentionality for side
effects (Knobe, 2003)
The study was administered online, but the
original study may have used paper and pencil;
the dependent variable was changed from a
“yes”/“no” response to a 7-point agreement scale
None
28. Directionality and similarity
(Tversky & Gati, 1978)
The study was administered online, but the
original study likely used paper and pencil;
nations were updated (Ceylon to Sri Lanka,
West Germany to Germany, and U.S.S.R. to
Russia)
Additional mixed models
were conducted (see the
supplemental information at
https://osf.io/4rbh9/)
Note: Additional descriptions and supplementary analyses are available in Supplementary Notes (https://osf.io/4rbh9/). Full descriptions of known
differences from the original studies are provided in the preregistered protocol at https://osf.io/ejcfw/; for example, the protocol makes note of
additional experimental conditions and outcome variables that were part of the original studies but not included in the replications. Differences
from the original studies were suggested by the original authors or reviewed and approved during peer review. In all cases, the replication
samples and settings differed from the original studies. These differences included the fact that the studies were administered sequentially in a
slate in the replication project. The order effect is evaluated directly in the Results section.
Table 1. (Continued)
Many Labs 2 451
eligible for inclusion, labs had to agree to administer
their assigned study procedure to at least 80 partici-
pants and to collect data from as many as was feasible.
Labs decided to stop data collection on the basis of
their access to participants and time constraints. None
had opportunity to observe the outcomes prior to the
conclusion of data collection. All contributors who met
the design and data-collection requirements received
authorship on this final report. Upon completion of
data collection, there were 125 total samples (64 for
Slate 1 and 61 for Slate 2; 15 sites collected data for
both slates), and the cumulative sample size was 15,305
(mean n = 122.44, median = 99, SD = 92.71, range =
16–841).
For 79 samples, data were collected in person (typi-
cally in the lab, though tasks were completed on the
Internet), and for 46 samples, data collections was
entirely Web based. Thirty-nine of the samples were
from the United States, and the 86 others were from
Australia (n = 2); Austria (n = 2); Belgium (n = 2); Brazil
(n = 1); Canada (n = 4); Chile (n = 3); China (n = 5);
Colombia (n = 1); Costa Rica (n = 2); the Czech Repub-
lic (n = 3); France (n = 2); Germany (n = 4); Hong
Kong, China (n = 3); Hungary (n = 1); India (n = 5);
Italy (n = 1); Japan (n = 1); Malaysia (n = 1); Mexico
(n = 1); The Netherlands (n = 9); New Zealand (n = 2);
Nigeria (n = 1); Poland (n = 6); Portugal (n = 1); Serbia
(n = 3); South Africa (n = 3); Spain (n = 2); Sweden
(n = 1); Switzerland (n = 1); Taiwan (n = 1); Tanzania
(n = 2); Turkey (n = 3); the United Arab Emirates (n =
2); the United Kingdom (n = 4); and Uruguay (n = 1).
Details about each site of data collection are available
at https://osf.io/uv4qx/.
Of the participants who responded to demographics
questions in Slate 1, 34.5% were men, 64.4% were
women, 0.3% selected “other,” and 0.8% selected “prefer
not to answer.” The average age for Slate 1 participants
(after excluding responses greater than “100”) was 22.37
(SD = 7.09). Of the participants in Slate 2, 35.9% were
men, 62.9% were women, 0.4% selected “other,” and
0.8% selected “prefer not to answer.” The average age
for Slate 2 participants (after excluding responses
greater than “100”) was 23.34 (SD = 8.28). Variation in
demographic characteristics across the samples is docu-
mented at https://osf.io/g3bza/.
Procedure
The tasks were administered over the Internet for pur-
poses of standardization across locations. At some loca-
tions, participants completed the survey in a lab or
room on computers or tablets, whereas in other loca-
tions, participants completed the survey entirely online
at their own convenience. Surveys were created in
Qualtrics software (qualtrics.com), and a unique link
to run the studies was sent to each data-collection team
so that we could track the origin of data. Each site was
assigned an identifier. These identifiers can be found
under the “source” variable in the public data set (avail-
able at https://osf.io/8cd4r/).
Data were deposited to a central database and ana-
lyzed together. Each team created a video simulation
of study administration to illustrate the features of the
data-collection setting. Labs that used a language other
than English completed a translation of the study mate-
rials and then a back-translation to check that the origi-
nal meaning was retained (cf. Brislin, 1970). Labs
decided themselves the language that was appropriate
for their sample and adapted materials so that the con-
tent would be appropriate for their sample (e.g., some
labs edited monetary units).
Labs were assigned to slates so as to maximize the
national diversity for both slates. If there was only one
lab in a given country, it was randomly assigned to a
slate using a tool available at random.org. If there was
more than one lab for a country, the labs were also
randomly assigned to slates using a tool available at
random.org, but with the constraint that the labs were
evenly distributed across slates as closely as possible
(e.g., two labs in each slate if there were four labs in
that country). Near the beginning of data collection, we
recruited some additional Asian sites specifically for
Slate 1 to increase its sample diversity. The slates were
administered by a single experiment script that began
with informed consent, next presented the appropriate
tasks in an order that was fully randomized across par-
ticipants, then presented the individual difference mea-
sures in randomized order, and closed with demographics
measures and debriefing (see Table A2 in the appendix
for a list of the demographic, data-quality, and individual
difference measures included, with citation counts).
Demographics
Demographic information was collected so that we
could characterize each sample and explore possible
moderation. Participants were free to decline to answer
any question.
Age. Participants noted their age in years in an open-
response box.
Sex. Participants selected “male,” “female,” “other,” or
“prefer not to answer” to indicate their biological sex.
Race-ethnicity. Participants indicated their race-ethnicity
by selecting from a drop-down menu populated with
options determined by the lead researcher for each site.
452 Klein et al.
Participants could also select “other” or write an open
response. Note that response items were not standard-
ized, as different countries have very different conceptu-
alizations of race and ethnicity.
Cultural origins. Three items assessed cultural origins.
Each used a drop-down menu populated by a list of
countries or territories and an “other” option with an
open-response box. The three items were as follows: (a)
“In which country/region were you born?”; (b) “In which
country/region was your primary caregiver (e.g., parent,
grandparent) born?”; and (c) “If you had a second pri-
mary caregiver, in which country/region was he or she
born?”
Hometown. All participants were asked to indicate their
hometown (“What is the name of your home town/city?”)
in an open-response box. This item was included for
possible future examination as a potential moderator of
Huang, Tse, and Cho’s (2014) effect.
Location of wealth in hometown. Another item asked,
“Where do wealthier people live in your home town/
city?” The response options were “north,” “south,” and
“neither.” This item was included as a potential moderator
of Huang et al.’s (2014) effect and appeared in Slate 1
only.
Political ideology. Participants rated their political ideol-
ogy on a scale with response options of “strongly left-wing,”
“moderately left-wing,” “slightly left-wing,” “moderate,”
“slightly right-wing,” “moderately right-wing,” and “strongly
right-wing.” Instructions were adapted for each country to
ensure this measure’s relevance to the local context. For
example, the U.S. instructions read: “Please rate your politi-
cal ideology on the following scale. In the United States,
‘liberal’ is usually used to refer to left-wing and ‘conserva-
tive’ is usually used to refer to right-wing.”
Education. Participants reported their educational attain-
ment in response to a single item, “What is the highest
educational level that you have attained?” The response
scale was as follows: 1 = no formal education, 2 = com-
pleted primary/elementary school, 3 = completed secondary
school/high school, 4 = some university/college, 5 = com-
pleted university/college degree, 6 = completed advanced
degree.
Socioeconomic status. Socioeconomic status (SES) was
measured with the ladder technique (Adler etal., 1994).
Participants used a ladder with 10 steps to indicate their
standing in the community with which they most identi-
fied relative to other people in that community. On the
ladder, 1 indicated people having the lowest standing in
the community, and 10 referred to people having the
highest standing. Previous research demonstrated that
this item has good convergent validity with objective cri-
teria of individual social status and also good construct
validity with regard to several psychological and physio-
logical health indicators (e.g., Adler, Epel, Castellazzo, &
Ickovics, 2000; S. Cohen etal., 2008). This ladder was also
used as one of the items for Anderson, Kraus, Galinsky,
and Keltner’s (2012, Study 3) effect in Slate 1. Participants
in that slate answered the ladder item as part of the mate-
rials for that effect and did not receive the item a second
time.
Data quality
Recent research on careless responding or insufficient
effort in responding has suggested that there is a need
to refine implementation of established scales embed-
ded in data collection to check for aberrant response
patterns (Huang etal., 2014; Meade & Craig, 2012). As
a check on data quality, we included two items at the
end of the study, just prior to the demographic items.
The first item asked participants, “In your honest opin-
ion, should we use your data in our analyses in this
study?” and had “yes” and “no” as response options
(Meade & Craig, 2012). The second item was an instruc-
tional manipulation check (Oppenheimer, Meyvis, &
Davidenko, 2009), in which an ostensibly simple demo-
graphic question (“Where are you completing this
study?”) was preceded by a long block of text that
contained, in part, alternative instructions for partici-
pants to follow to demonstrate that they were paying
attention (“Instead, simply check all four boxes and
then press ‘continue’ to proceed to the next screen”).
Individual difference measures
The following individual difference measures were
included to allow future tests of effect-size moderation.
Cognitive reflection. The cognitive-reflection task (CRT;
Frederick, 2005) assesses individuals’ ability to suppress
an intuitive (wrong) response in favor of a deliberative
(correct) answer. The items on the original CRT are
widely known, and the measure is vulnerable to practice
effects (Chandler, Mueller, & Paolacci, 2014). Therefore,
we used an updated version that is logically equivalent
and correlates highly with the items on the original CRT
(Finucane & Gullion, 2010). The three items are (a) “If it
takes 2 nurses 2 minutes to measure the blood pressure
of 2 patients, how long would it take 200 nurses to mea-
sure the blood pressure of 200 patients?”; (b) “Soup and
salad cost $5.50 in total. The soup costs a dollar more
than the salad. How much does the salad cost?”; and (c)
Many Labs 2 453
“Sally is making tea. Every hour, the concentration of the
tea doubles. If it takes 6 hours for the tea to be ready,
how long would it take for the tea to reach half of the
final concentration?” Also, we constrained the total time
available to answer the three questions to 75 s. This likely
lowered overall performance on average, as it was some-
what less time than some participants took in pretesting.
Subjective well-being. Subjective well-being was mea-
sured with a single item: “All things considered, how sat-
isfied are you with your life as a whole these days?” The
response scale ranged from 1, dissatisfied, to 10, satisfied.
Similar items have been included in numerous large-scale
social surveys (cf. Veenhoven, 2009) and have shown sat-
isfactory reliability (e.g., Lucas & Donnellan, 2012) and
validity (Cheung & Lucas, 2014; Oswald & Wu, 2010;
Sandvik, Diener, & Seidlitz, 1993).
Global self-esteem. Global self-esteem was measured
using the Single-Item Self-Esteem Scale (Robins, Hendin,
& Trzesniewski, 2001), which was designed as an alterna-
tive to the Rosenberg (1965) Self-Esteem Scale. The SISE
consists of a single item: “I have high self-esteem.” Partici-
pants respond on a 5-point Likert scale, ranging from 1,
not very true of me, to 5, very true of me. Robins et al.
reported that the SISE has strong convergent validity with
the Rosenberg Self-Esteem Scale among adults (rs rang-
ing from .70 to .80) and that the SISE and Rosenberg Self-
Esteem Scale have similar predictive validity.
Big Five personality. The five basic traits of human per-
sonality (Goldberg, 1981)—conscientiousness, agreeable-
ness, neuroticism (emotional stability), openness (intellect),
and extraversion—were measured with the Ten-Item Per-
sonality Inventory (Gosling, Rentfrow, & Swann, 2003). Each
trait was assessed with two items answered on response
scales from 1, disagree strongly, to 7, agree strongly. The five
scales have satisfactory retest reliability (cf. Gnambs, 2014)
and substantial convergent validity with longer Big Five
instruments (e.g., Ehrhart etal., 2009; Gosling etal., 2003;
Rojas & Widiger, 2014).
Mood. There exist many assessments of mood. We selected
the single item from G. L. Cohen etal. (2007): “How would
you describe your mood right now?” The response options
are as follows: 1 = extremely bad, 2 = bad, 3 = neutral, 4 =
good, 5 = extremely good.
Disgust sensitivity. To measure disgust sensitivity, we
used the Contamination Disgust subscale of the Disgust
Scale–Revised (DS-R; Olatunji etal., 2007), a 25-item revi-
sion of the original Disgust Sensitivity Scale (Haidt,
McCauley, & Rozin, 1994). The subscales of the DS-R
were determined by factor analysis. The Contamination
Disgust subscale includes 5 items related to concerns
about bodily contamination. Because of length consider-
ations, this subscale was included only in Slate 1, for
Inbar, Pizarro, Knobe, and Bloom’s (2009, Study 1) effect.
No part of the DS-R appeared in Slate 2.
The 28 Effects
Before presenting the main results for heterogeneity
across samples and settings, we discuss each of the 28
selected effects. For each effect, we summarize the main
idea of the original research, provide the sample size,
and present the inferential test and effect size that were
the target for replication. Then, we summarize the
aggregate result of the replication. For these aggregate
tests, we pooled the data of all available samples, ignor-
ing sample origin. An aggregate result was labeled con-
sistent with the original finding if the effect was
statistically significant and in the same direction as in
the original study. The vast majority of the original stud-
ies were conducted in a Western, educated, industrial-
ized, rich, democratic (i.e., WEIRD) society (Henrich,
Heine, & Norenzayan, 2010). For the four original stud-
ies that focused on cultural differences, we present the
replication results such that positive effect sizes cor-
respond to the direction of the effect that had been
observed in the original WEIRD sample. Our main rep-
lication result is the aggregate effect size regardless of
cultural context. Whether effects varied by setting (or
cultural context more generally) was examined in the
heterogeneity analyses reported in the Results section.
Heterogeneity was assessed using the Q, tau, and I 2
measures (Borenstein, Hedges, Higgins, & Rothstein,
2009). If there was opportunity to test the original cul-
tural difference with similar samples, we did so, and
these additional results are reported in this section. If
the original authors anticipated moderating influences
that could affect comparison of the original and replica-
tion effect sizes, then we also report those analyses.
Readers interested in the global results of this repli-
cation project may skip this long section detailing each
individual replication and proceed to the section pre-
senting the systematic meta-analyses testing variation
by sample and setting.
Slate 1
1. Cardinal direction and socioeconomic status (Huang
et al., 2014, Study 1a). People in the United States and
Hong Kong have different demographic knowledge that
may shape their metaphoric association between valence
and cardinal direction (north vs. south). One hundred
eighty participants from the United States and Hong Kong
participated in Huang etal.’s (2014) Study 1a. They were
454 Klein et al.
presented with a blank map of a fictional city and were
randomly assigned to indicate on the map where either a
high-SES or a low-SES person might live. There was an
interaction between SES (high vs. low) and population
(United States vs. Hong Kong), F(1, 176) = 20.39, MSE =
5.63, p < .001, ηp2 = .10, d = 0.68, 95% confidence inter-
val (CI) = [0.38, 0.98]. U.S. participants expected the
high-SES person to live further north (M = 0.98, SD =
1.85) than the low-SES person (M = −0.69, SD = 2.19),
t(78) = 3.69, p < .001, d = 0.83, 95% CI = [0.37, 1.28]. Con-
versely, Hong Kong participants expected the low-SES
person to live further north (M = 0.63, SD = 2.75) than
the high-SES person (M = −0.92, SD = 2.47), t(98) =
−2.95, p = .004, d = −0.59, 95% CI = [−0.99, −0.19]. The
authors explained that wealth in Hong Kong is concen-
trated in the south of the city, and wealth in cities in the
United States is more commonly concentrated in the
north of the city. As a consequence, members of these
cultures differ in their assumptions about the concentra-
tion of wealth in fictional cities.
Replication. The coordinates of participants’ clicks on
the fictional map were recorded (x, y) from the top left
of the image and then recentered in the analysis such
that clicks in the north half of the map were positive
and clicks in the southern half of the map were negative.
Across all samples (N = 6,591), participants in the high-
SES condition (M = 11.70, SD = 84.31) selected a further
north location than did participants in the low-SES con-
dition (M = −22.70, SD = 88.78), t(6554.05) = 16.12, p =
2.15e−57, d = 0.40, 95% CI = [0.35, 0.45].
As suggested by the original authors, the focal test
for replicating the effect they found for Western par-
ticipants was completed by selecting only those par-
ticipants, across all samples, who indicated that wealth
tended to be in the north in their hometown. These
participants expected the high-SES person to live fur-
ther north (M = 43.22, SD = 84.43) than the low-SES
person (M = −40.63, SD = 84.99), t(1692) = 20.36, p =
1.24e−82, d = 0.99, 95% CI = [0.89, 1.09]. This result is
consistent with the hypothesis that people reporting
that wealthier people tend to live in the north in their
hometown also guess that wealthier people will tend
to live in the north in a fictional city, and the effect was
substantially larger than that in the sample as a whole.
Follow-up analyses. The original study compared Hong
Kong and U.S. participants. In the replication, Hong Kong
participants expected the high-SES person to live further
south (M = −37.44, SD = 84.29) than the low-SES person
(M = 12.43, SD = 95.03), t(140) = −3.30, p = .001, d =
−0.55, 95% CI = [−0.89, −0.22]. U.S. participants expected
the high-SES person to live further north (M = 41.55, SD =
80.73) than the low-SES person (M = −42.63, SD = 82.41),
t(2199) = 24.20, p = 6.53e−115, d = 1.03, 95% CI = [0.94,
1.12]. This result is consistent with the original finding
that cultural differences in perceived location of wealth
in a fictional city correlated with location of wealth in
participants’ hometown.
Most participants completed the items for this study
on a vertically oriented monitor display as opposed to
a paper survey on a desk, as in the original study. The
original authors suggested a priori that this difference
might be important because associations between “up”
and “good” or between “down” and “bad” might inter-
fere with any associations with “north” and “south.” At
10 data-collection sites (n = 582), we assigned some
participants to complete Slate 1 on Microsoft Surface
tablets resting horizontally on a table. Among the par-
ticipants using the horizontal tablets, those who said
that wealth tended to be in the north in their hometown
(n = 156) expected the high-SES person to live further
north (M = 38.66, SD = 80.43) than the low-SES person
(M = −43.92, SD = 80.32), t(154) = 6.38, p = 1.95e−09,
d = 1.03, 95% CI = [0.69, 1.36]. By comparison, within
this horizontal-tablet group, participants who said that
wealth tended to be in the south in their hometown
(n = 87) expected the high-SES person to live further
south (M = −33.58, SD = 72.89) than the low-SES person
(M = −4.11, SD = 88.33), t(85) = −1.63, p = .11, d = −0.36,
95% CI = [−0.79, 0.08]. The effect sizes for just these
subsamples were very similar to the effect sizes for the
whole sample, which suggests that the orientation of
the display did not moderate this effect.
2. Structure promotes goal pursuit (Kay, Laurin,
Fitzsimons, & Landau, 2014, Study 2). In Study 2 of
Kay etal. (2014), 67 participants generated what they felt
was their most important goal. They then read one of two
scenarios in which a natural event (leaves growing on
trees) was described as being a structured or random
event. For example, in the structured condition, a sen-
tence read, “The way trees produce leaves is one of the
many examples of the orderly patterns created by nature
. . . ,” but in the random condition, the corresponding
sentence read, “The way trees produce leaves is one of
the many examples of the natural randomness that sur-
rounds us. . . .” Next, participants answered three ques-
tions about their most important goal, on a scale from 1,
not very, to 7, extremely. The first item measured the sub-
jective value of the goal, and the other two items mea-
sured willingness to pursue that goal. Participants
exposed to a structured event (M = 5.26, SD = 0.88) were
more willing to pursue their goal compared with those
exposed to a random event (M = 4.72, SD = 1.32), t(65) =
2.00, p = .05, d = 0.49, 95% CI = [0.001, 0.973]. In the
overall replication sample (N = 6,506), participants
exposed to a structured event (M = 5.48, SD = 1.45) were
Many Labs 2 455
not significantly more willing to pursue their goal com-
pared with those exposed to a random event (M = 5.51,
SD = 1.39), t(6498.63) = −0.94, p = .35, d = −0.02, 95%
CI = [−0.07, 0.03]. This result does not support the hypoth-
esis that willingness to pursue goals is higher after expo-
sure to structured as opposed to random events.
3. Disfluency engages analytic processing (Alter,
Oppenheimer, Epley, & Eyre, 2007, Study 4). In Study
4, Alter et al. (2007) investigated whether a deliberate,
analytic processing style can be activated by incidental
disfluency cues that suggest task difficulty. Forty-one par-
ticipants attempted to solve syllogisms presented in either
a hard-to-read or an easy-to-read font. The hard-to-read
font served as an incidental induction of disfluency. Par-
ticipants in the hard-to-read-font condition answered
more moderately difficult syllogisms correctly (64%) than
did participants in the easy-to-read-font condition (42%),
t(39) = 2.01, p = .051, d = 0.63, 95% CI = [−0.004, 1.25].
Replication. The original study focused on the two
moderately difficult syllogisms among the six adminis-
tered. Our analysis strategy was sensitive to potential dif-
ferences across samples in ability to solve the syllogisms.
We first determined which ones were moderately diffi-
cult for participants by excluding within each sample any
syllogisms that were answered correctly by fewer than
25% of participants or more than 75% of participants in
the two conditions combined. The remaining syllogisms
were used to calculate mean syllogism performance for
each participant.
As in Alter etal.’s (2007) experiment, the easy-to-
read font was 12-point black Myriad Web font, and the
hard-to-read font was 10-point 10% gray italicized
Myriad Web font. For a direct comparison with the
original effect size, the original authors suggested that
only English in-lab samples be used for two reasons:
First, we could not adequately control for online par-
ticipants “zooming in” on the page or otherwise making
the font more readable, and second, we anticipated
having to substitute the font in some translated versions
because the original font (Myriad Web) might not sup-
port all languages.2 In this subsample (N = 2,580), the
number of syllogisms answered correctly by partici-
pants in the hard-to-read-font condition (M = 1.10,
SD = 0.88) was similar to the number answered cor-
rectly by participants in the easy-to-read-font condition
(M = 1.13, SD = 0.91), t(2578) = −0.79, p = .43, d = −0.03,
95% CI = [−0.08, 0.01]. In a secondary analysis that mir-
rored the original, we used performance on the same
two syllogisms Alter etal. (2007) focused on. Again,
the number of syllogisms answered correctly by partici-
pants in the hard-to-read-font condition (M = 0.80,
SD = 0.79) was similar to the number answered correctly
by participants in the easy-to-read-font condition (M =
0.84, SD = 0.81), t(2578) = −1.19, p = .23, d = −0.05,
95% CI = [−0.12, 0.03]).3 These results do not support
the hypothesis that syllogism performance is higher
when the font is harder to read; the difference between
conditions was slightly in the opposite direction and
not distinguishable from zero (d = −0.03, 95% CI =
[−0.08, 0.01], vs. original d = 0.64).
Follow-up analyses. In the aggregate replication sam-
ple (N = 6,935), the number of syllogisms answered cor-
rectly was similar in the hard-to-read-font condition (M =
1.03, SD = 0.86) and the easy-to-read-font condition (M =
1.06, SD = 0.87), t(6933) = −1.37, p = .17, d = −0.03, 95%
CI = [−0.08, 0.01]. Finally, in the whole sample, an analy-
sis using the same two syllogisms that Alter etal. (2007)
did showed that participants in the hard-to-read-font
condition answered about as many syllogisms correctly
(M = 0.75, SD = 0.76) as participants in the easy-to-read-
font condition (M = 0.79, SD = 0.77), t(6933) = −2.07, p =
.039, d = −0.05, 95% CI = [−0.097, −0.003]. These follow-
up analyses do not qualify the conclusion from the focal
tests.
4. Moral foundations of liberals versus conserva-
tives (Graham, Haidt, & Nosek, 2009, Study 1). People
on the political left (liberal) and political right (conservative)
have distinct policy preferences and may also have different
moral intuitions and principles. In Graham et al.’s (2009)
Study 1, 1,548 participants across the ideological spec-
trum rated whether different concepts, such as “purity”
and “fairness,” were relevant for deciding whether some-
thing was right or wrong. Items that emphasized concerns
of harm or fairness (individualizing foundations) were
deemed more relevant for moral judgment by the political
left than by the political right (r = −.21, d = −0.43, 95%
CI = [−0.55, −0.32]), whereas items that emphasized con-
cerns for the in-group, authority, or purity (binding foun-
dations) were deemed more relevant for moral judgment
by the political right than by the political left (r = .25, d =
0.52, 95% CI = [0.40, 0.63]).4 Participants rated the rele-
vance to moral judgment of 15 items (3 for each founda-
tion) in a randomized order on a 6-point scale from not at
all relevant to extremely relevant.
Replication. The primary target of replication was the
relationship between political ideology and the binding
foundations. In the aggregate sample (N = 6,966), items
that emphasized concerns for the in-group, authority, or
purity were deemed more relevant for moral judgment
by the political right than by the political left (r = .14,
p = 6.05e−34, d = 0.29, 95% CI = [0.25, 0.34], q = 0.15,
95% CI = [0.12, 0.17]). This result is consistent with the
hypothesis that binding foundations are perceived as more
456 Klein et al.
morally relevant by members of the political right than by
members of the political left. The overall effect size was
smaller than the original (d = 0.29, 95% CI = [0.25, 0.34],
vs. original d = 0.52).
Follow-up analyses. The relationship between politi-
cal ideology and the individualizing foundations was a
secondary replication target. In the aggregate sample
(N = 6,970), items that emphasized concerns of harm or
fairness were deemed more relevant for moral judgment
by the political left than by the political right (r = −.13,
p = 2.54e−29, d = −0.27, 95% CI = [−0.32, −0.22], q = −0.13,
95% CI = [−0.16, −0.11]). This result is consistent with the
hypothesis that individualizing foundations are perceived
as more morally relevant by members of the political left
than by members of the political right. The overall effect
size was smaller than the original result (d = −0.27, 95%
CI = [−0.32, −0.22], vs. original d = −0.43).
5. Affect and risk (Rottenstreich & Hsee, 2001, Study
1). In this experiment, 40 participants chose whether
they would prefer an affectively attractive option (a kiss
from a favorite movie star) or a financially attractive
option ($50). In one condition, participants made the
choice imagining a low probability (1%) of getting the
outcome. In the other condition, participants imagined
that the outcome was certain, and they just needed to
choose between the options. When the outcome was
unlikely, 70% of participants preferred the affectively
attractive option; when the outcome was certain, 35%
preferred the affectively attractive option. The difference
between conditions was significant, χ2(1, N = 40) = 4.91,
p = .0267, d = 0.74, 95% CI = [< 0.001, 1.74]. This result
supported the hypothesis that positive affect has greater
influence on judgments about uncertain outcomes than
on judgments about definite outcomes.
In the aggregate replication sample (N = 7,218),
when the outcome was unlikely, 47% of participants
preferred the affectively attractive choice, and when
the outcome was certain, 51% preferred the affectively
attractive choice. The difference was significant, p =
.002, odds ratio (OR) = 0.87, d = −0.08, 95% CI = [−0.13,
−0.03], but in the direction opposite the prediction of
the hypothesis (i.e., that affectively attractive choices
are more preferred when they are uncertain rather than
definite). The overall effect was much smaller than in
the original study and in the opposite direction (d =
−0.08, 95% CI = [−0.13, −0.03], vs. original d = 0.74).
6. Consumerism undermines trust (Bauer, Wilkie,
Kim, & Bodenhausen, 2012, Study 4). Bauer etal.
(2012) examined whether being in a consumer mind-set
would reduce trust in other people. In their Study 4, 77
participants read about a hypothetical water-conservation
dilemma in which they were involved. They were randomly
assigned to either a condition that referred to them and
other people in the scenario as “consumers” or a condi-
tion that referred to them and other people in the sce-
nario as “individuals” (control condition). Participants in
the consumer condition reported less trust that other peo-
ple would conserve water (M = 4.08, SD = 1.56; scale from
1, not at all, to 7, very much) compared with participants
in the control condition (M = 5.33, SD = 1.30), t(76) = 3.86,
p = .001, d = 0.87, 95% CI = [0.41, 1.34].
Replication. In the aggregate replication sample (N =
6,608), participants in the consumer condition reported
slightly less trust that other people would conserve water
(M = 3.92, SD = 1.44) compared with participants in the
control condition (M = 4.10, SD = 1.45), t(6606) = 4.93,
p = 8.62e−7, d = 0.12, 95% CI = [0.07, 0.17]. This result is
consistent with the hypothesis that people have lower
trust in others when they think of those others as con-
sumers rather than as individuals. The overall effect size
was much smaller than in the original experiment (d =
0.12, 95% CI = [0.07, 0.17], vs. original d = 0.87).
Follow-up analyses. The original experiment and the
replication examined the effect of the priming manipula-
tion on four additional dependent variables. Compared
with the original study, the replication showed weaker
effects in the same direction for (a) participants’ feelings
of responsibility for the crisis (original d = 0.47; repli-
cation d = 0.10, 95% CI = [0.05, 0.15]), (b) participants’
feelings of obligation to cut water usage (original d =
0.29; replication d = 0.08, 95% CI = [0.03, 0.13]), (c) par-
ticipants’ perception of other people as partners (original
d = 0.53; replication d = 0.12, 95% CI = [0.07, 0.16]), and
(d) participants’ judgments about how much less water
other people should use (original d = 0.25; replication
d = 0.01, 95% CI = [−0.04, 0.06]).
7. Correspondence bias (Miyamoto & Kitayama,
2002, Study 1). Miyamoto and Kitayama (2002) exam-
ined whether Americans would be more likely than Japa-
nese to show a bias toward ascribing to an actor an
attitude corresponding to the actor’s behavior, a phenom-
enon referred to as correspondence bias (Jones & Harris,
1967). In their Study 1, 49 Japanese and 58 American
undergraduates learned that they would read a university
student’s essay about the death penalty and infer the stu-
dent’s true attitude toward the issue. The essay was either
in favor of or against the death penalty, and it was
designed to be diagnostic or not very diagnostic of a
strong attitude. After reading the essay, participants
learned that the student had been assigned which posi-
tion to argue. Then, participants estimated the essay writ-
er’s actual attitude toward capital punishment and the
Many Labs 2 457
extent to which they thought the student’s behavior was
constrained by the assignment.
Controlling for perceived constraint, analyses com-
pared perceived attitudes of the writer who wrote in
favor of capital punishment and the writer who wrote
against it (rating scale from 1, against capital punish-
ment, to 15, supports capital punishment). American
participants perceived a large difference between the
actual attitude of the essay writer who had been
assigned to write a pro-capital-punishment essay (M =
10.82, SD = 3.47) and the writer who had been assigned
to write an anti-capital-punishment essay (M = 3.30,
SD = 2.62), t(27) = 6.66, p < .001, d = 2.47, 95% CI =
[1.46, 3.49]. Japanese participants perceived less of a
difference in actual attitudes (M = 9.27, SD = 2.88, and
M = 7.02, SD = 3.06, respectively), t(23) = 1.84, p = .069,
d = 0.74, 95% CI = [–0.12, 1.59].
Replication. In the aggregate replication sample (N =
7,197), controlling for perceived constraint, participants per-
ceived a difference in actual attitudes between the essay writer
who had been assigned to write a pro-capital-punishment
essay (M = 10.98, SD = 3.69) and the essay writer who had
been assigned to write an anti-capital-punishment essay
(M = 4.45, SD = 3.51), F(2, 7194) = 3,042.00, p < 2.2e−16,
d = 1.82, 95% CI = [1.76, 1.87]. This finding is consistent
with the correspondence-bias hypothesis: Participants
inferred the essay writer’s attitude, in part, on the basis of
the writer’s observed behavior. Whether the magnitude of
this effect varies cross-culturally was examined in tests dis-
cussed in the Results section.
Follow-up analyses. Results for the primary replication
analysis showed that participants estimated the writer’s
true attitude toward capital punishment to be similar to
the position that the writer was assigned to defend. Par-
ticipants also expected that the writers would express
attitudes consistent with the position to which they were
assigned if given the opportunity to talk freely about
capital punishment (pro–capital punishment: M = 10.17,
SD = 3.84; anti–capital punishment: M = 4.96, SD = 3.61),
t(7187) = 59.44, p = 2.2e−16, d = 1.40, 95% CI = [1.35, 1.45].
Two possible moderators were included in the
design: perceived attitude of the average student in the
writer’s country (tailored to be the same as the partici-
pant’s country) and perceived persuasiveness of the
essay. In the aggregate replication sample (N = 7,211),
controlling for perceived constraint, we did not observe
an interaction between condition and perceived attitude
of the average student in the writer’s country on estima-
tions of the writer’s true attitude toward capital punish-
ment, t(7178) = 0.55, p = .58, d = 0.013, 95% CI = [−0.03,
0.06]. We did, however, observe an interaction between
condition and perceived persuasiveness of the essay
on estimations of the writer’s true attitude toward capi-
tal punishment, t(7170) = 16.25, p = 2.3e−58, d = 0.38,
95% CI = [0.34, 0.43]. The effect of condition on estima-
tions of the writer’s true attitude toward capital punish-
ment was stronger for higher levels of perceived
persuasiveness of the essay.
8. Disgust sensitivity predicts homophobia (Inbar
etal., 2009, Study 1). Behaviors that are deemed mor-
ally wrong may be judged as more intentional than
behaviors without moral implications (Knobe, 2006).
Thus, people who judge the portrayal of gay sexual activ-
ity in the media as intentional may view homosexuality
as morally reprehensible. In Inbar etal.’s (2009) Study 1,
44 participants read a vignette about a director’s action
and judged him as more intentional (scale from 1, not at
all, to 7, definitely) when he was described as encourag-
ing gay kissing (M = 4.36, SD = 1.51) than when he was
describing more generally as encouraging kissing (M =
2.91, SD = 2.01), β = 0.41, t(39) = 3.39, p = .002, r = .48.
Disgust sensitivity was positively related to judgments of
intentionality in the gay-kissing condition, β = 0.79, t(19) =
4.49, p = .0003, r = .72, and not the kissing condition, β =
−0.20, t(19) = −0.88, p = .38, r = .20. The correlation was
stronger in the gay-kissing condition than in the kissing
condition, z = 2.11, p = .03, q = 0.70, 95% CI = [0.05, 1.36].
The authors concluded that individuals who are more
prone to disgust are more likely to interpret encourage-
ment of gay kissing as intentional, which indicates that
they intuitively disapprove of homosexuality.
Replication. The relationship between disgust sensitiv-
ity and intentionality ratings was the target of our direct
replication. In the aggregate replication sample (N =
7,117), participants did not judge the director’s action as
more intentional when he encouraged gay kissing (M =
3.48, SD = 1.87) than when he encouraged kissing (M =
3.51, SD = 1.84), t(7115) = −0.74, p = .457, d = −0.02, 95%
CI = [−0.06, 0.03]. Greater disgust sensitivity was related
to judgments of greater intentionality in both the gay-
kissing condition, r = .12, p = 1.2e−13, and the kissing
condition, r = .07, p = 2.48e−5. The correlation in the
gay-kissing condition was similar to the correlation in the
kissing condition, z = 2.62, p = .02, q = 0.05, 95% CI =
[0.01, 0.10]. These data are inconsistent with the original
finding that disgust sensitivity and perceived intentional-
ity are more strongly related when people consider gay
kissing than when they consider kissing in general, and
the effect size was much smaller than the original effect
size (q = 0.05, 95% CI = [0.01, 0.10], vs. original q = 0.70).
Disgust sensitivity was very weakly related to perceived
intentionality, and there was no mean difference in per-
ceived intentionality between the gay-kissing and kissing
conditions.
458 Klein et al.
Follow-up analyses. The original study included two
other outcome measures based on responses to yes/no
questions. These were examined as secondary replica-
tions following the same analysis strategy as for inten-
tionality. First, disgust sensitivity was only slightly more
related to responses to “Is there anything wrong with
homosexual men French kissing in public?” (r = −.20,
p < 2.2e−16) than to responses to “Is there anything wrong
with couples French kissing in public?” (r = −.16, p <
2.2e−16; z = −1.66, p = .096, q = −0.04, 95% CI = [−0.09,
0.01]). Second, disgust sensitivity was only slightly more
related to answers to “Was it wrong of the director to
make a video that he knew would encourage homosex-
ual men to French kiss in public?” (r = .27, p < 2.2e−16)
than to “Was it wrong of the director to make a video
that he knew would encourage couples to French kiss in
public?” (r = .22, p < 2.2e−16; z = 2.28, p = .02, q = 0.05,
95% CI = [0.01, 0.10]).
9. Influence of incidental anchors on judgment
(Critcher & Gilovich, 2008, Study 2). In Critcher and
Gilovich’s (2008) Study 2, 207 participants predicted the
relative popularity of a new cell phone in the U.S. and
European marketplaces. In one condition, the smart-
phone was called the P97; in the other condition, the
smartphone was called the P17. Participants in the P97
condition estimated that a greater percentage of the new
phone’s sales would be in the United States (M = 58.1%,
SD = 19.6%) compared with participants in the P17 con-
dition (M = 51.9%, SD = 21.7%), t(197.5) = 2.12, p = .03,
d = 0.30, 95% CI = [0.02, 0.58]. This result supported the
hypothesis that judgment can be influenced by incidental
anchors in the environment. The mere presence of a high
or low number in the name of the cell phone influenced
estimates of sales of the phone.
Replication. In the aggregate replication sample (N =
6,826), participants’ estimates of the percentage of sales
the new phone would garner in their region as opposed
to a foreign market were approximately the same in the
P97 condition (M = 49.87%, SD = 21.86%) as in the P17
condition (M = 48.98%, SD = 22.14%), t(6824) = 1.68, p =
.09, d = 0.04, 95% CI = [−0.01, 0.09]. This result does not
support the hypothesis that sales estimates are influenced
by incidental anchors. The effect size was in the same
direction as the original effect size, but much smaller
(d = 0.04, 95% CI = [−0.01, 0.09], vs. original d = 0.30) and
indistinguishable from zero.
Follow-up analyses. The original authors administered
this experiment with paper and pencil, rather than on a
computer, to avoid the possibility that the numeric keys
on the keyboard might serve as primes. We administered
this task with paper and pencil at 11 sites. At these sites
(N = 1,112), participants in the P97 condition estimated
that the new phone’s percentage of sales in their region
would be slightly smaller (M = 53.02%, SD = 20.15%)
compared with participants in the P17 condition (M =
53.28%, SD = 20.17%), t(1110) = −0.22, p = .83, d = −0.01,
95% CI = [−0.13, 0.10]. This difference was in the direc-
tion opposite the direction of the original finding, but not
reliably different from zero.
10. Social value orientation and family size (Van
Lange, Otten, De Bruin, & Joireman, 1997, Study 3). Van
Lange et al. (1997) proposed that social value orienta-
tions (SVOs) are rooted in social interaction experiences,
and that the number of one’s siblings is one variable that
influences such experiences. In one of four studies (Study
3), they examined the association between SVO and fam-
ily size, thereby providing a test of two competing
hypotheses. One hypothesis states that in larger families,
resources have to be shared more frequently, and this
facilitates cooperation and the development of a proso-
cial orientation. Another hypothesis, rooted in group-size
effects, states that greater family size may undermine
trust and expected cooperation from other people, and
may therefore inhibit the development of prosocial orien-
tation. In Study 3, 631 participants reported how many
siblings they had and completed an SVO measure called
the Triple-Dominance Measure, which identified them as
prosocial people, individualists, or competitors. An anal-
ysis of variance (ANOVA) revealed a significant differ-
ence in SVO across these groups, F(2, 535) = 4.82, p = .01.
Prosocial people had more siblings (M = 2.03, SD = 1.56)
than individualists (M = 1.63, SD = 1.00) and competitors
(M = 1.71, SD = 1.35), ds = 0.287, 95% CI = [0.095, 0.478],
and 0.210, 95% CI = [−0.045, 0.465], respectively. Planned
comparisons of the number of siblings revealed a signifi-
cant contrast between prosocial people, on the one hand,
and individualists and competitors, on the other, F(1,
535) = 9.14, p = .003, d = 0.19, 95% CI = [< 0.01, 0.47].
The original demonstration used a measure of SVO
with three categorical values. In discussion with the
original first author, an alternative measure, the SVO
slider (Murphy, Ackermann, & Handgraaf, 2011), was
identified as a useful replacement to yield a continuous
distribution of scores. Thus, the replication focused
only on the observed direct positive correlation between
prosocial orientation and number of siblings. In the
aggregate replication sample (N = 6,234), number of
siblings was not related to prosocial orientation (r =
−.02, 95% CI = [−0.04, 0.01], p = .18). This result does
not support the hypothesis that having more siblings
is positively related with prosocial orientation. Direct
comparison of effect sizes was not possible because of
the change in the SVO measure, but the replication
effect size was near zero.
Many Labs 2 459
11. Trolley Dilemma 1: principle of double effect
(Hauser, Cushman, Young, Jin, & Mikhail, 2007,
Scenarios 1 and 2). According to the principle of dou-
ble effect, an act that harms other people is more morally
permissible if the act is a foreseen side effect rather than
the means to the greater good. Hauser etal. (2007) com-
pared participants’ reactions to two scenarios to test
whether their judgments followed this principle. In the
foreseen-side-effect scenario, a person on an out-of-control
train changed the train’s trajectory so that the train killed
one person instead of five. In the greater-good scenario, a
person pushed a fat man in front of a train, killing him, to
save five people. Whereas 89% of participants judged the
action in the foreseen-side-effect scenario as permissible
(95% CI = [87%, 91%]), only 11% of participants in the
greater-good scenario judged it as permissible (95% CI =
[9%, 13%]). The difference between the percentages was
significant, χ2(1, N = 2,646) = 1,615.96, p < .001, w = .78,
d= 2.50, 95% CI = [2.22, 2.86]. Thus, the results provided
evidence for the principle of double effect.
Replication. In the aggregate replication sample (N =
6,842 after removing participants who responded in less
than 4 s), 71% of participants judged the action in the
foreseen-side-effect scenario as permissible, but only 17%
of participants in the greater-good scenario judged it as
permissible. The difference between the percentages was
significant, p = 2.2e−16, OR = 11.54, d = 1.35, 95% CI =
[1.28, 1.41]. The replication results were consistent with
the double-effect hypothesis, and the effect was about
half the magnitude of the original (d = 1.35, 95% CI =
[1.28, 1.41], vs. original d = 2.50).
Follow-up analyses. Variations of the trolley problem
are well known. The original authors suggested that the
effect may be weaker for participants who have previ-
ously been exposed to this sort of task. We included an
additional item assessing participants’ prior knowledge of
the task. Among the 3,069 participants reporting that they
were not familiar with the task, Cohen’s d was 1.47, 95%
CI = [1.38, 1.57]; among the 4,107 who reported being
familiar with the task, Cohen’s d was 1.20, 95% CI = [1.12,
1.28]. This suggests moderation by task familiarity, but the
effect was very strong regardless of familiarity.
12. Sociometric status and well-being (Anderson
etal., 2012, Study 3). Anderson etal. (2012) examined
the relationships among sociometric status (SMS), SES,
and subjective well-being. According to the authors, SMS
refers to interpersonal wealth, whereas SES refers to fiscal
wealth. Study 3 examined whether SMS has stronger ties
than SES to well-being. In a 2 × 2 between-participants
design, 228 Mechanical Turk participants were presented
with descriptions of people who were either relatively
high or relatively low on either SES or SMS and then
made upward or downward social comparisons (e.g.,
participants in the high-SMS condition imagined and
compared themselves with a low-SMS person). Then,
participants wrote about what it would be like to interact
with such people, and then reported their subjective
well-being. Results showed a significant 2 × 2 interaction,
F(1, 224) = 4.73, p = .03. Participants in the high-SMS
condition had higher subjective well-being than those in
the low-SMS condition, t(115) = 3.05, p = .003, d = 0.57,
95% CI = [0.20, 0.93], but there were no differences
between the two SES conditions, t(109) = 0.06, p = .96,
d = 0.01.
For replication, we used only the high- and low-SMS
conditions and excluded the high- and low-SES condi-
tions because they showed no differences in the origi-
nal study. In the aggregate replication sample (N =
6,905), participants in the high-SMS condition (M =
−0.01, SD = 0.67) had slightly lower subjective well-
being than those in the low-SMS condition (M = 0.01,
SD = 0.66; scores were standardized and averaged),
t(6903) = −1.76, p = .08, d = −0.04, 95% CI = [−0.09,
0.004]. This result did not support the hypothesis that
subjective well-being is higher for participants exposed
to descriptions of higher SMS. The effect was small in
magnitude, much smaller than the original effect, and
in the opposite direction (d = −0.04, 95% CI = [−0.09,
0.004], vs. original d = 0.57).
13. False consensus: supermarket scenario (Ross,
Greene, & House, 1977, Study 1). People perceive a
false consensus regarding how common their own
responses are among other people (Ross et al., 1977).
Thus, estimates of the prevalence of a particular belief,
opinion, or behavior are biased in the direction of the
perceiver’s belief, opinion, or behavior. In Study 1, Ross
etal. presented 320 college undergraduates with one of
four hypothetical events that culminated in a clear dichot-
omous choice of action. Participants first estimated what
percentage of their peers would choose each option and
then indicated their own choice. For each of the four
scenarios, participants who chose the first option, com-
pared with those who chose the second, believed that a
higher percentage of other people would choose the first
option (M = 65.7% vs. 48.5%), F(1, 312) = 49.1, p < .001,
d = 0.79, 95% CI = [0.56, 1.02]. A later meta-analysis sug-
gested that this effect is robust and moderate in size
across a variety of paradigms (r = .31, Mullen etal., 1985).
This study was replicated in Slate 1 and Slate 2 using
different scenarios. In Slate 1, participants were pre-
sented with the supermarket vignette, which had shown
a significant effect in the original study, F(1, 78) = 17.7,
d = 0.99, 95% CI = [0.24, 2.29]. All participants who
provided percentage estimates between 0 and 100 and
460 Klein et al.
responded to all three items were included in the analy-
sis. In the aggregate replication sample (N = 7,205),
participants who chose the first option, compared with
those who chose the second, believed that a higher
percentage of other people would choose the first
option (M = 69.19% vs. 43.35%), t(6420.77) = 49.93,
p < 2.2e−16, d = 1.18, 95% CI = [1.13, 1.23]. This result
is consistent with the hypothesis that participants’
choices are positively correlated with their perception
of the percentage of other people who would make the
same choice.
Slate 2
14. False consensus: traffic-ticket scenario (Ross etal.,
1977, Study 1). In Slate 2, participants were presented
with the traffic-ticket vignette, which had shown a signifi-
cant effect in Ross etal.’s (1977) Study 1 (see the previ-
ous paragraph for a description of that study), F(1, 78) =
12.8, d = 0.80, 95% CI = [0.22, 1.87]. All participants who
provided percentage estimates between 0 and 100 and
who responded to all three items were included in the
replication analysis. In the aggregate replication sample
(N = 7,827), participants who chose the first option, com-
pared with those who chose the second, believed that a
higher percentage of other people would choose the first
option (M = 72.48% vs. 48.76%), t(6728.25) = 41.74, p <
2.2e−16, d = 0.95, 95% CI = [0.90, 1.00]. This result is con-
sistent with the hypothesis that participants’ choices are
positively correlated with their perception of the percent-
age of other people who would make the same choice.
15. Vertical position and power (Giessner & Schubert,
2007, Study 1a). In Giessner and Shubert’s (2007) Study
1a, 64 participants formed an impression of a manager
on the basis of a few pieces of information, including an
organization chart with a vertical line connecting the
manager on top with his team below. Participants had
been randomly assigned to one of two conditions in
which the line was either short (2 cm) or long (7 cm).
After being presented with the information, participants
indicated their agreement with statements that the man-
ager was dominant, had a strong leader personality, was
self-confident, had considerable control in the company,
and had high status in the company (scale from 1, totally
disagree, to 7, totally agree). Responses were averaged to
create a rating of the manager’s power. Participants in the
long-line condition (M = 5.01, SD = 0.60) perceived the
manager to have greater power than did participants in
the short-line condition (M = 4.62, SD = 0.81), t(62) =
2.20, p = .03, d = 0.55, 95% CI = [0.05, 1.05]. This result
was interpreted as showing that people associate higher
vertical position with greater power.
In the aggregate replication sample (N = 7,890), par-
ticipants in the long-line condition (M = 4.97, SD = 1.09)
and participants in the short-line condition (M = 4.93,
SD = 1.07) perceived the manager to have similar levels
of power, t(7888) = 1.40, p = .16, d = 0.03, 95% CI =
[−0.01, 0.08]. This result does not support the hypoth-
esis that perceived power is higher with greater vertical
distance. The replication effect was in the same direc-
tion as, but much smaller than, the original (d = 0.03,
95% CI = [−0.01, 0.08], vs. original d = 0.55).
16. Effect of framing on decision making (Tversky &
Kahneman, 1981, Study 10). In Tversky and Kahneman’s
(1981) Study 10, 181 participants considered a scenario in
which they were buying two items, one relatively cheap
($15) and one relatively costly ($125). Ninety-three par-
ticipants were assigned to a condition in which the cheap
item could be purchased for $5 less by going to a different
branch of the store 20 min away. Eighty-eight participants
were instead assigned to a condition in which the costly
item could be purchased for $5 less at the other branch.
Therefore, the total cost for the two items and the cost
savings for traveling to the other branch were the same in
the two conditions. Participants were more likely to say
that they would go to the other branch when the cheap
item was on sale (68%) than when the costly item was on
sale (29%; z = 5.14, p = 7.4e−7, OR = 4.96, 95% CI = [2.55,
9.90]). This suggests that the decision of whether to travel
was influenced by the base cost of the discounted item
rather than the total cost.
For the replication, in consultation with one of the
original authors, we adjusted dollar amounts to be more
appropriate for 2014 (i.e., when the replication study
was conducted). The stimuli were also replaced with
consumer items that were relevant in 2014 and plausibly
sold by a single salesperson (a ceramic vase and a wall
hanging). In the aggregate replication sample (N =
7,228), participants were more likely to say that they
would go to the other branch when the cheap item was
on sale (49%) than when the costly item was on sale
(32%; p = 1.01e−50, d = 0.40, 95% CI = [0.35, 0.45]; OR =
2.06, 95% CI = [1.87, 2.27]). These results are consistent
with the hypothesis that the base cost of a discounted
item influences willingness to travel, though the effect
was less than half the size of the original (OR = 2.06,
95% CI = [1.87, 2.27], vs. original OR = 4.96).
17. Trolley Dilemma 2: principle of double effect
(Hauser etal., 2007, Study 1, Scenarios 3 and 4). In
Slate 2, participants were presented with the Ned and
Oscar scenarios from Hauser etal.’s (2007) Study 1 (for a
description of the original study, see Effect 11 in Slate 1).
In the original study, 72% of the participants judged the
action in the foreseen-side-effect (Oscar) scenario as per-
missible (95% CI = [69%, 74%]), and 56% of the participants
judged the action in the greater-good (Ned) scenario as
permissible (95% CI = [53%, 59%]). The difference between
Many Labs 2 461
the percentages was significant, χ2(1, N = 2,612) = 72.35,
p < .001, w = .17, d = 0.34, 95% CI = [0.26, 0.42].
Replication. In the aggregate replication sample (N =
7,923), after participants who responded in less than 4
s were removed, 64% of participants judged the action
in the foreseen-side-effect scenario as permissible, and
53% of participants in the greater-good scenario judged
it as permissible. The difference between the percent-
ages was significant (p = 4.66e−23, OR = 1.58, d = 0.25,
95% CI = [0.20, 0.30]). These results are consistent with
the principle of double effect, though the effect size was
somewhat smaller in the replication compared with the
original study (d = 0.25, 95% CI = [0.20, 0.30], vs. original
d = 0.34).
Follow-up analyses. Again, we included an additional
item assessing participants’ prior knowledge of the task.
Among the 3,558 participants reporting that they were not
familiar with the task, Cohen’s d was 0.27, 95% CI = [0.20,
0.34]; among the 4,297 who were familiar with the task,
Cohen’s d was 0.24, 95% CI = [0.17, 0.30]. In this case,
familiarity did not moderate the observed effect size.
18. Reluctance to tempt fate (Risen & Gilovich, 2008,
Study 2). Risen and Gilovich (2008) explored the belief
that tempting fate increases bad outcomes. They tested
whether people judge the likelihood of a negative out-
come to be higher when they have imagined themselves
or a classmate tempting fate, compared with when they
have imagined themselves or a classmate not tempting
fate. One hundred twenty participants read a scenario in
which either they or a classmate (“Jon”) tempted fate (by
not reading before class) or did not tempt fate (by com-
ing to class prepared). Participants then estimated how
likely it was that the protagonist (themselves or Jon)
would be called on by the professor (scale from 1, not at
all likely, to 10, extremely likely). The predicted main
effect emerged, as participants judged the likelihood
of being called on to be higher when the protagonist
had tempted fate (M = 3.43, SD = 2.34) than when the
protagonist had not tempted fate (M = 2.53, SD =
2.24), t(116) = 2.15, p = .034, d = 0.39, 95% CI = [0.03,
0.75].
Replication. The original study design included both
self and other scenarios (i.e., the protagonist was either
the participant or a classmate), but no self-other differ-
ences were found. With the original authors’ approval, we
limited the replication study to the two self conditions. In
the aggregate replication sample (N = 8,000), participants
judged the likelihood of being called on to be higher
when they had tempted fate (M = 4.58, SD = 2.44) than
when they had not tempted fate (M = 4.14, SD = 2.45),
t(7998) = 8.08, p = 7.70e−16, d = 0.18, 95% CI = [0.14, 0.22].
This is consistent with the hypothesis that people believe
tempting fate increases the likelihood of a negative out-
come, though the effect size was less than half the effect
size in the original study (d = 0.18, 95% CI = [0.14, 0.22],
vs. original d = 0.39).
For the key confirmatory test, the original authors
suggested that the sample should include only under-
graduate students, given the nature of the scenarios. In
that subsample (N = 4,599), participants judged the
likelihood of being called on to be higher when they
had tempted fate (M = 4.61, SD = 2.42) than when they
had not tempted fate (M = 4.07, SD = 2.36), t(4597) =
7.57, p = 4.4e−14, d = 0.22, 95% CI = [0.17, 0.28]. The
observed effect size (0.22) was very similar to what was
observed with the whole sample (0.18).
Follow-up analyses. During peer review of our design
and analysis plan, gender was suggested as a possible
moderator of the effect. Using the undergraduate sub-
sample, we conducted a 2 × 2 ANOVA with condition
and gender as factors. In addition to the main effect of
condition, there was a main effect of gender, F(1, 4524) =
31.80, p = 1.81e−8, d = 0.17, 95% CI = [0.09, 0.25]; females
judged the likelihood of being called on to be higher
than males. There was also a very weak interaction of
condition and gender, F(1, 4524) = 5.10, p = .024, d =
0.07, 95% CI = [0.04, 0.13].
19. Construing actions as choices (Savani, Markus,
Naidu, Kumar, & Berlia, 2010, Study 5). Savani etal.
(2010) examined cultural asymmetry in people’s con-
strual of behavior as choices. In their Study 5, 218 partici-
pants (90 Americans, 128 Indians) were randomly assigned
to recall either personal actions or interpersonal actions
and then to indicate whether the actions constituted
choices. In a logistic hierarchical linear model with con-
strual of choice as the dependent measure, culture and
condition (personal or interpersonal actions) as partici-
pant-level predictors, and importance of the decision as
a trial-level covariate, the authors found no main effect of
condition across cultures, β = −0.13, OR = 0.88, d = 0.08,
t(101) = 0.71, p = .48. Among Americans, there was no
difference between the proportion of personal actions
construed as choices (M = .83, SD = .15) and the propor-
tion of interpersonal actions construed as choices (M =
.82, SD = .14), t(88) = 0.39, p = .65, d = 0.04. However,
Indians were less likely to construe personal actions as
choices (M = .61, SD = .26) than to construe interpersonal
actions as choices (M =.71, SD = .26), t(126) = −3.69, p =
.0002, d = −0.65, 95% CI = [−1.01, −0.30].
Replication. For the replication, we conducted a hier-
archical logistic regression analysis with choice (binary)
462 Klein et al.
as the dependent variable, importance of the decision
(ordered categorical) as a trial-level covariate nested within
participants, and condition (categorical) as a participant-
level factor. The effect of interest was the odds of an action
being construed as a choice, depending on the partici-
pant’s condition, controlling for the reported importance
of the action.
After excluding participants who performed the task
outside of university labs, as recommended by the origi-
nal authors, and those who did not respond to all
choice and importance-of-choice questions (remaining
N = 3,506), we found a significant main effect of condi-
tion (β = −0.43, SE = 0.03, z = −12.54, p < 2e−16, d =
−0.24, 95% CI = [−0.27, −0.21]). Additional exploratory
analyses revealed a significant interaction between con-
dition and importance of the decision (β = −0.08, SE =
0.02, z = −4.23, p = 2.37e−5). Participants were less likely
to construe personal actions as choices (M = .74, SD =
.44) than to construe interpersonal actions as choices
(M = .82, SD = .39), and this effect was stronger at
higher ratings of the importance of the choice. This
small effect (d = −0.24, 95% CI = [−0.27, −0.21]) differed
from the original null effect (d = 0.04) among Ameri-
cans and was in the same direction as but smaller than
the original effect among Indians (d = −0.65), but the
present sample was highly diverse.
For the key confirmatory test of the original result
among Indians, we selected participants from university
labs in India who responded to all choice and importance-
of-choice questions (N = 122). In this subsample, we
found no main effect of condition (β = −0.06, SE = 0.17,
z = −0.34, p = .73, d = −0.03, 95% CI = [−0.18, 0.11])
and a significant interaction between condition and
importance of the decision (β = 0.35, SE = 0.09, z =
3.79, p = 1.0e−4, d = 0.19, 95% CI = [0.05, 0.34]). Indian
participants were equally likely to construe personal
actions (M = .63, SD = .48) and interpersonal actions
(M = .63, SD = .48) as choices. Though there was a
significant main effect in the full sample, the absence
of a significant main effect in this subsample, control-
ling for importance, is inconsistent with the original
finding that Indians are less likely to construe personal
actions than interpersonal actions as choices. There was
an interaction between condition and rating of the
importance of the choice, with a pattern similar to that
in the full sample. This moderation was not reported
in the original article.
Follow-up analyses. The original authors suggested
that only university samples should be included in the
main analyses, so those are the results we report in the
previous paragraph. In follow-up analyses of the whole
sample, after excluding only participants who did not
respond to all choice and importance-of-choice ques-
tions (remaining N = 5,882), we found a significant effect
of condition (β = −0.33, SE = 0.03, z = −11.54, p < 2.0e−16,
d = −0.18, 95% CI = [−0.21, −0.16]) and a significant inter-
action between condition and importance of the choice
(β = −0.06, SE = 0.014, z = −4.46, p = 8.04e−6, d = −0.03,
95% CI = [−0.06, −0.01]). In the whole sample, partic-
ipants were less likely to construe personal actions as
choices (M = .74, SD = .44) than to construe interpersonal
actions as choices (M = .79, SD = .40), and this effect was
stronger at higher ratings of the importance of the choice.
20. Preferences for formal versus intuitive reasoning
(Norenzayan, Smith, Kim, & Nisbett, 2002, Study 2).
The way people living in the West think may be more rule
based than the way people living in East Asia think. Fifty-
two European Americans (27 men, 25 women), 52 Asian
Americans (28 men, 24 women), and 53 East Asians (27
men, 26 women) were randomly assigned to either a
classification-judgment condition (decide “which group
the target object belongs to”; two thirds of the sample) or a
similarity-judgment condition (decide “which group the tar-
get object is most similar to”; one third of the sample).
All participants categorized targets into two alterna-
tive groups. Each stimulus set consisted of two targets
and two groups of four exemplars each. Each of the
two target stimuli was presented separately with the
two groups. All the exemplars in each group had a
particular feature in common with each other and with
one of the targets but shared a family resemblance, and
no single common feature, with the other target, in a
counterbalanced design (see Fig. 1). When asked
“which group the target object belongs to,” European
American and East Asian participants preferred to clas-
sify on the basis of a rule (M = 69% of responses for
European Americans; M = 70% of responses for East
Asians) rather than family resemblance, F(1, 100) =
44.40, p < .001, r = .55. When asked “which group the
target object is more similar to,” European Americans
gave many more responses based on the unidimen-
sional rule (M = 69%) than on family resemblance (M =
31%), t(17) = 3.68, p = .002, d = 1.65, 95% CI = [0.59,
2.67]. In contrast, East Asians gave fewer rule-based
responses (M = 41%) than family-resemblance-based
responses (M = 59%), t(17) = −2.09, p = .05, d = −0.93,
95% CI = [−1.85, 0.01]. The responses of Asian Americans
were intermediate, with participants indicating no pref-
erence for the unidimensional rule (M = 46%) over fam-
ily resemblance (M = 54%), t < 1.
Replication. For the replication, we preregistered a
plan to compare the percentage of rule-based responses
between the belong-to and similar-to conditions. In the
original study, European Americans showed no differ-
ence between these conditions (d = 0.00, 95% CI = [−0.15,
0.15]), but East Asians were more likely to give rule-
based responses in the belong-to condition than in the
Many Labs 2 463
similar-to condition (d = 0.67, 95% CI = [0.52, 0.81). Note
that we planned a comparison between the experimental
conditions, whereas Norenzayan etal. (2002) focused their
analysis and theoretical interest on comparisons between
cultural groups within each experimental condition.
We computed the percentage of rule-based responses
for each participant and then tested whether the mean
percentages for the two experimental conditions were
equal, using a t test for independent samples. In the
aggregate replication sample (N = 7,396), participants
who were asked “which group the target object belongs
to” were more likely to classify on the basis of a rule
(M = 64%, SD = 25%) than on the basis of family resem-
blance (M = 36%, SD = 25%), and participants who were
asked “which group the target object is more similar to”
were more likely to classify on the basis of family resem-
blance (M = 56%, SD = 21%) than on the basis of a rule
(M = 44%, SD = 21%). The likelihood of using a rule was
higher in the belong-to condition compared with the
similar-to condition, t(7227.59) = 37.05, p = 3.04e−275,
d = 0.86, 95% CI = [0.81, 0.91]. This pattern was in the
same direction as the original aggregate result, and the
effect size was somewhat larger: People were more likely
to categorize on the basis of a rule when they considered
what group the target belonged to and more likely to
categorize on the basis of family resemblance when they
considered what group the target was similar to.5
Follow-up analyses. We identified a priori that this
effect and the one reported by Tversky and Gati (1978)
both involved similarity judgments and thus that the
order of these study materials in Slate 2 might be par-
ticularly relevant. We tested whether Norenzayan etal.’s
(2002) effect was moderated by whether its materials
came before or after Tversky and Gati’s and observed
very weak moderation by task order, t(7392) = 2.34, p =
.02, d = 0.05, 95% CI = [0.01, 0.10].
21. Less-is-better effect (Hsee, 1998, Study 1). Hsee
(1998) demonstrated the less-is-better effect, wherein a
less expensive gift can be perceived as more generous
than a more expensive gift when the less expensive gift is
a high-priced item compared with other items in its cate-
gory, and the more expensive item is a low-priced item
compared with other items in its category. In Hsee’s Study
1, 83 participants imagined that they were about to study
abroad and had received a goodbye gift from a friend. In
one condition, participants imagined receiving a $45 scarf
bought in a store where the prices of scarves ranged from
$5 to $50. In the other condition, participants imagined
receiving a $55 coat bought in a store where the prices of
coats ranged from $50 to $500. Participants in the scarf
condition considered their gift giver significantly more
generous (M = 5.63; scale from 0, not generous at all, to
6,extremely generous) than did those in the coat condi -
tion(M = 5.00), t(82) = 3.13, p = .002, d = 0.69, 95% CI =
[0.24, 1.13], despite the gift being objectively less expensive.
In the replication, the dollar values were approxi-
mately adjusted for inflation. We converted the amounts
Fig. 1. Examples of targets and groups used in the replication of
Norenzayan, Smith, Kim, and Nisbett’s (2002) Study 2. Each of the
two target objects in each set was presented separately with the two
groups in order to achieve a counterbalanced design. For the flowers,
the defining feature was the stem length; for the geometric figures,
it was the topmost string.
464 Klein et al.
to local currencies at sites where U.S. dollars would be
relatively unfamiliar to participants. In the aggregate
replication sample (N = 7,646), participants in the scarf
condition considered their gift giver significantly more
generous (M = 5.50, SD = 0.89) than did those in the
coat condition (M = 4.61, SD = 1.34), t(6569.67) = 34.20,
p = 4.5e−236, d = 0.78, 95% CI = [0.74, 0.83]. This result
is consistent with the less-is-better effect, and the effect
size was slightly larger than in the original demonstra-
tion (d = 0.78, 95% CI = [0.74, 0.83], vs. original d =
0.69).
22. Moral typecasting (Gray & Wegner, 2009, Study
1a). Gray and Wegner (2009) examined the attribution
of intentionality and responsibility as a function of per-
ceived moral agency—the ability to direct and control
one’s moral decisions. In their Study 1a, 69 participants
read about an event involving a person high on moral
agency (an adult man) and a person low on moral agency
(a baby). In one condition, the man knocked over a tray
of glasses, which resulted in harm to the baby. In the
other condition, the baby knocked over the tray of
glasses, which resulted in harm to the man. Participants
then rated the degree to which the person who commit-
ted the act was responsible, how intentional the act was,
and how much pain was felt by the victim (scales from 1
to 7). The adult man (M = 5.29, SD = 1.86) was evaluated
as more responsible for committing the act than was the
baby (M = 3.86, SD = 1.64), t(68) = 3.32, p = .001, d =
0.80, 95% CI = [0.31, 1.29]. Likewise, the adult man (M =
4.05, SD = 2.05) was rated as acting more intentionally
than the baby (M = 3.07, SD = 1.55), t(68) = 2.20, p = .03,
d = 0.53. Finally, when on the receiving end of the act,
the adult man (M = 4.63, SD = 1.15) was viewed as feel-
ing less pain compared with the baby (M = 5.76, SD =
1.55), t(68) = 3.49, p = .001, d = 0.85.
Replication. The effect of condition on perceived
responsibility was identified as the primary relationship
for replication. In the aggregate replication sample (N =
8,002), the adult man (M = 5.41, SD = 1.63) was evaluated
as more responsible for committing the act than the baby
(M = 3.77, SD = 1.79), t(7913.89) = 42.62, p < 3.32e−285,
d = 0.95, 95% CI = [0.91, 1.00]. This result is consistent
with the hypothesis that an adult’s perceived responsibil-
ity for harming a baby is greater than a baby’s perceived
responsibility for harming an adult. The effect size in the
replication was slightly larger than the original result (d =
0.95, 95% CI = [0.91, 1.00], vs. original d = 0.80).
Follow-up analyses. There were two additional depen-
dent variables for secondary analysis: perceived inten-
tionality and pain felt by the victim. The adult man (M =
3.62, SD = 1.89) was rated as acting more intentionally
than the baby (M = 2.73, SD = 1.64), t(7864.62) = 22.51,
p = 8.3e−109, d = 0.50, 95% CI = [0.46, 0.55]. And, when on
the receiving end of the act, the adult man (M = 4.66 SD =
1.25) was viewed as feeling less pain compared with the
baby (M = 5.44, SD = 1.25), t(7989) = 27.54, p = 1.5e−159,
d = 0.62, 95% CI = [0.57, 0.66].
23. Moral violations and desire for cleansing (Zhong
& Liljenquist, 2006, Study 2). Zhong and Liljenquist
(2006) investigated whether moral violations can induce
a desire for cleansing. In their Study 2, under the guise of
a study on the relationship between personality and
handwriting, 27 participants hand-copied a first-person
account of an ethical act (helping a coworker) or unethi-
cal act (sabotaging a coworker). Then, participants rated
the desirability of five cleansing products and five non-
cleansing products (scale from 1, not at all, to 7, very
much). Participants who copied the unethical account
(M = 4.95, SD = 0.84) reported that the cleansing prod-
ucts were more desirable than did participants who cop-
ied the ethical account (M = 3.75, SD = 1.32), F(1, 25) =
6.99, p = .01, d = 1.02, 95% CI = [0.39, 2.44]. There was no
difference between the unethical (M = 3.85, SD = 1.21)
and ethical (M = 3.91, SD = 1.03) conditions in ratings of
noncleansing products, F(1, 25) = 0.02, p = .89, d = 0.05.
Replication. The effect of interest for replication was
whether condition affected ratings of the cleansing prod-
ucts. In the aggregate replication sample (N = 7,001), after
participants who copied less than half of the first-person
account were removed, participants who copied the
unethical account (M = 3.95, SD = 1.43) and those who
copied the ethical account (M = 3.95, SD = 1.45) rated the
cleansing products as similarly desirable, t(6999) = −0.11,
p = .91, d = 0.00, 95% CI = [−0.05, 0.04]. This result is not
consistent with the hypothesis that copying an account
of an unethical action increases the desirability of cleans-
ing products compared with copying an account of an
ethical action.
Follow-up analyses. The original study revealed no dif-
ference by condition in ratings of noncleansing products.
In the replication, a 2 (condition) × 2 (type of product)
linear mixed-effects model with participant as a random
effect yielded no interaction, t(6999) = −0.57, p = .57,
d = −0.01, 95% CI = [−0.06, 0.03]. Moreover, there was no
difference between the ethical (M = 3.12, SD = 1.08) and
unethical (M = 3.11, SD = 1.05) conditions in ratings of
noncleansing products, t(6999) = 0.63, p = .53, d = 0.02,
95% CI = [−0.03, 0.06].
24. Assimilation and contrast effects in question
sequences (Schwarz, Strack, & Mai, 1991, Study 1). In
this study, 100 participants answered a question about
life satisfaction in a specific domain, “How satisfied are
you with your relationship?” and a question about life
Many Labs 2 465
satisfaction in general, “How satisfied are you with your
life-as-a-whole?” Participants were randomly assigned to
the order in which they answered the specific and general
questions. When the specific question was asked first, the
correlation between the responses to the two questions
was strong (r = .67, p < .05). When the specific question
was asked second, the correlation between the responses
was weaker (r = .32, p < .05). The difference between
these correlations was significant, z = 2.32, p < .01, q =
0.48, 95% CI = [0.07, 0.88].
The authors suggested that the specific-first condi-
tion made the relationship more accessible, so that
participants were more likely to incorporate informa-
tion about their relationship when evaluating their life
satisfaction more generally. Because responses to the
two items were linked by the accessibility of relation-
ship information, they were correlated. In contrast, in
the specific-second condition, relationship satisfaction
was not necessarily accessible when participants evalu-
ated their overall life satisfaction, so they could draw
on any number of different areas to generate their
response to the general question. Thus, the correlation
between responses to the two items was weaker than
in the specific-first condition.
Replication. In the aggregate replication sample (N =
7,460), when the specific question was asked first, the
correlation between the responses to the two questions
was moderate (r = .38). When the specific question was
asked second, the correlation between the responses was
slightly stronger (r = .44). The difference between these
correlations was significant, z = −3.03, p = .002, q = −0.07,
95% CI = [−0.12, −0.02]. The replication effect was in the
direction opposite that of the original effect, and the rep-
lication effect size was much smaller than the original
result (q = −0.07, 95% CI = [−0.12, −0.02], vs. q = 0.48).
Follow-up analysis. In the original procedure, no other
measures preceded the questions. This particular effect
concerns the influence of question context, so it is reason-
able to presume that task order will have an impact on it.
Therefore, the data for the most direct comparison with
the original were provided by the sites where this task was
administered first in the slate. In that subsample (N = 470),
when the specific question was asked first, the correlation
between the responses to the two questions was strong
(r= .41). When the specific question was asked second,
the correlation between the responses was the same (r =
.41). The difference between these correlations was not
significant, z = 0.01, p = .99, q = 0.00, 95% CI = [−0.18, 0.18].
25. Effect of choosing versus rejecting on relative
desirability (Shafir, 1993, Study 1). In this study, 170
participants imagined that they were on the jury of a
custody case and had to choose between two parents.
One of the parents had both more strongly positive and
more strongly negative characteristics (the extreme par-
ent) than the other parent (the average parent). Partici-
pants were randomly assigned to either decide to award
custody to one parent or decide to deny custody to one
parent. Participants were more likely to both award (64%)
and deny (55%) custody to the extreme parent than to
the average parent, and the sum of these probabilities
was significantly greater than 100%, z = 2.48, p = .013,
d = 0.35, 95% CI = [−0.04, 0.68]. This finding was consistent
with the hypothesis that negative features are weighted
more strongly than positive features when people are
rejecting options, and positive features are weighted more
strongly than negative features when people are selecting
options (Shafir, 1993).
In the aggregate replication sample (N = 7,901), par-
ticipants were less likely to both award (45.5%) and
deny (47.6%) custody to the extreme parent than to the
average parent, and the sum of these probabilities
(93%) was significantly smaller than the 100% one
would expect if choosing and rejecting were comple-
mentary, z = −6.10, p = 1.1e−9, d = −0.13, 95% CI =
[−0.18, −0.09]. This result was small in magnitude and
in the direction opposite that of the original finding,
and it is incompatible with the hypothesis that negative
features are weighted more strongly when people are
rejecting options and positive features are weighted
more strongly when people are selecting options.
26. Priming “heat” increases belief in global warm-
ing (Zaval, Keenan, Johnson, & Weber, 2014, Study
3a). Zaval etal. (2014) investigated how beliefs in cli-
mate change could be influenced by immediately avail-
able information about temperature. In their Study 3a,
300 Mechanical Turk workers reported their beliefs about
global warming after completing one of three scrambled-
sentence tasks; one task primed the concept of “heat,”
another primed the concept of “cold,” and the third had
no theme (control condition). There was a significant
effect of condition on both belief in global warming, F(2,
288) = 3.88, p = .02, and concern about it, F(2, 288) =
4.74, p = .01, controlling for demographic and actual-
temperature data. Post hoc pairwise comparisons revealed
that on a 4-point scale (from 1, not at all convinced, to 4,
completely convinced), participants in the heat-priming
condition expressed stronger belief (M = 2.7, SD = 1.1) in
global warming than did both participants in the cold-
priming condition (M = 2.4, SD = 1.1), t(191) = 1.9, p =
.06, d = 0.27, 95% CI = [0.05, 0.49], and participants in the
control condition (M = 2.3, SD = 1.1), t(193) = 2.23, p =
.03, d = 0.37, 95% CI = [0.14, 0.59]. Likewise, participants
in the heat-priming condition expressed greater concern
(M = 2.4, SD = 1.0) about global warming than did both
466 Klein et al.
participants in the cold-priming condition (M = 2.1, SD =
1.0; scale from 1, not at all worried, to 4, completely wor-
ried), t(191) = 2.15, p = .03, d = 0.31, 95% CI = [0.03, 0.59],
and participants in the control condition (M = 2.1, SD =
1.0), t(193) = 2.23, p = .02, d = 0.31, 95% CI = [0.02, 0.59].
Replication. For the direct replication, the mean dif-
ference in concern about global warming between the
heat- and cold-priming conditions was evaluated. In the
aggregate replication sample, after participants who made
errors in the sentence-unscrambling task were excluded
(remaining N = 4,204), participants in the heat-priming con-
dition (M = 2.47, SD = 0.90) and participants in the cold-
priming condition (M = 2.50, SD = 0.89) expressed similar
levels of concern about global warming, t(4202) = −1.09,
p = .27, d = −0.03, 95% CI = [−0.09, 0.03]. This result is
not consistent with the hypothesis that temperature prim-
ing alters concern about global warming. The effect was
small, much weaker than the original finding, and in the
opposite direction (d = −0.03, 95% CI = [−0.09, 0.03], vs.
original d = 0.31).
Translations of the scrambled-sentence task may
have disrupted the effectiveness of the manipulation.
Therefore, the most direct comparison with the original
effect size was provided by the sites where the test was
administered in English only. In this subsample (N =
2,939), participants in the heat-priming condition (M =
2.40, SD = 0.90) also expressed similar concern about
global warming compared with participants in the cold-
priming condition (M = 2.44, SD = 0.89), t(2937) = −0.18,
p = .24, d = −0.04, 95% CI = [−0.12, 0.03].
Follow-up analyses. Belief in global warming was
included as a secondary dependent variable. In the aggre-
gate replication sample (N = 4,212), participants in the
heat-priming condition (M = 3.25, SD = 0.84) and partici-
pants in the cold-priming condition (M = 3.25, SD = 0.82)
expressed similar belief in global warming, t(4210) = 0.50,
p = .62, d = 0.00, 95% CI = [−0.06, 0.06]. In the subsample of
participants who took the test in English, participants in the
heat-priming condition (M = 3.25, SD = 0.86) and partici-
pants in the cold-priming condition (M = 3.23, SD = 0.85)
also expressed similar belief in global warming, t(2940) =
1.40, p = .16, d = 0.02, 95% CI = [−0.05, 0.09]. Neither of
these follow-up analyses was consistent with the original
study’s finding that temperature priming influenced belief
in global warming.
27. Perceived intentionality for side effects (Knobe,
2003, Study 1). Knobe (2003) investigated whether help-
ful and harmful side effects are differentially perceived as
being intended. Consider, for example, an agent who
knows that his or her behavior will have a particular side
effect, but does not care whether the side effect does or
does not occur. If the agent chooses to go ahead with the
behavior and the side effect occurs, do people believe
that the agent brought about the side effect intentionally?
Knobe had participants read vignettes about such situa-
tions and found that participants were more likely to
believe the agent brought about the side effect intention-
ally when the side effect was harmful compared with
when it was helpful. Eighty-two percent of participants in
the harmful-side-effect condition said that the agent
brought about the side effect intentionally, whereas 23%
of those in the helpful-side-effect condition said that the
agent brought about the side effect intentionally, χ2(1,
N = 78) = 27.2, p < .001, d = 1.45, 95% CI = [0.79, 2.77].
Also, ratings of the blame deserved by agents who
brought about harmful side effects were higher than rat-
ings of the praise deserved by agents who brought about
helpful side effects (scales from 1 to 7), t(120) = 8.4, p <
.001, d = 1.55, 95% CI = [1.14, 1.95]. The total amount of
blame or praise attributed to the agent was associated
with belief that the agent brought about the side effect
intentionally, r(120) = .53, p < .001, d = 0.63, 95% CI =
[0.26, 0.99].
Replication. For the direct replication, ratings of inten-
tionality in the harmful- and helpful-side-effect conditions
were compared using a 7-point scale rather than a dichot-
omous judgment. In the aggregate replication sample
(N = 7,982), participants in the harmful-side-effect condition
(M = 5.34, SD =1.94) said that the agent brought about the
side effect intentionally to a greater extent than did partici-
pants in the helpful-side-effect condition (M = 2.17, SD =
1.69), t(7843.86) = 78.11, p < 1.68e−305, d = 1.75, 95% CI =
[1.70, 1.80]. This is consistent with the original result, and
the effect was somewhat stronger in the replication (d =
1.75, 95% CI = [1.70, 1.80], vs. original d = 1.45).
Follow-up analyses. Blame and praise ratings were
assessed as a secondary replication. Ratings of the blame
deserved by agents who brought about harmful side
effects were higher (M = 6.03, SD = 1.26) than ratings of
the praise deserved by agents who brought about help-
ful side effects (M = 2.54, SD = 1.60), t(7553.82) = 108.15,
p < 1.68e−305, d = 2.42, 95% CI = [2.36, 2.48]. This is also
consistent with the original result, and the effect size is
notably larger (2.42 vs. 1.55).
28. Directionality and similarity (Tversky & Gati,
1978, Study 2). Tversky and Gati (1978) investigated
the relationship between directionality and similarity. In
their Study 2, 144 participants made 21 similarity ratings
of country pairs in which one country (e.g., the United
States) was shown in a pretest to be more prominent
than the other (e.g., Mexico). In a between-participants
manipulation, the pair was presented with either the
more prominent country first (e.g., United States-Mexico)
or the less prominent country first (e.g., Mexico-United
Many Labs 2 467
States). Two counterbalanced versions of the survey were
created such that the more prominent country and the
less prominent country were presented first “about an
equal number of times” (p. 87). Results indicated that
participants’ similarity ratings were higher when less
prominent countries were displayed first than when more
prominent countries were displayed first, t(153) = 2.99,
p = .003, d = 0.48, 95% CI = [0.16, 0.80], and that higher
similarity ratings were given to the version of each pair
that listed the more prominent country second, t(20) =
2.92, p = .001, d = 0.64, 95% CI = [0.16, 1.10].
A follow-up study (N = 46) with the same design
examined ratings of differences rather than similarities.
Results were consistent with the first study: Participants’
difference ratings were higher when the more promi-
nent countries were displayed first than when the less
prominent countries were displayed first, t(45) = 2.24,
p < .05, d = 0.66, 95% CI = [0.06, 1.25], and higher dif-
ference ratings were given to the version of each pair
that listed the more prominent country first, t(20) =
2.72, p < .01, d = 0.59, 95% CI = [0.12, 1.05].
Replication. For the replication, participants were ran-
domly assigned to one of the two counterbalanced ver-
sions of the survey and were randomly assigned to rate
either similarities or differences between the two coun-
tries in each pair. Following the design of the original
studies, we considered the participants who provided
similarity and difference judgments to be two indepen-
dent samples. Therefore, each site had about half as
much data for its critical test as for the tests of the other
27 effects. The similarity ratings were the primary focus
for direct replication, and the difference ratings were
examined in a secondary analysis.
For each participant in the aggregate similarities
sample (N = 3,549), we created an asymmetry score,
calculated as the average similarity rating when the
prominent country appeared second minus the average
similarity rating when the prominent country appeared
first. Across participants, the asymmetry score was not
different from zero, t(3548) = 0.60, p = .55, d = 0.01,
95% CI = [−0.02, 0.04]; the order of presentation of more
and less prominent countries did not influence evalu-
ations of their similarity. In addition, we observed that
the average similarity ratings in one counterbalancing
condition were 8.78 (SD = 2.44) and 8.84 (SD = 2.43)
when the more prominent country was presented first
and second, respectively, whereas the corresponding
average similarity ratings in the other counterbalancing
condition were higher, M = 10.14 (SD = 2.42) and M =
10.09 (SD = 2.44), respectively. In summary, there was no
evidence of the key effect of country order (prominent
country first vs. second), and similarity ratings were
different between the counterbalancing conditions, a
procedural effect.
Then, we reproduced the original by-item analysis.
Participants’ similarity ratings were nearly identical
when the less prominent country was displayed first
(M = 9.42, SD = 2.61) and when the more prominent
country was displayed first (M = 9.43, SD = 2.57),
t(20) = −0.29, p = .78, d = −0.04, 95% CI = [−0.35, 0.26].
Thus, the overall replication effect size was near zero,
and the effect was in the direction opposite the original
findings.
Follow-up analyses. We conducted the same analy-
ses on the difference ratings (N = 3,582). The asymmetry
score was not different from zero, t(3581) = 1.70, p = .09,
d = 0.03, 95% CI = [−0.004, 0.061]; the order of presenta-
tion of more and less prominent countries did not influ-
ence evaluations of their difference.
The by-item analysis showed that participant’s dif-
ference ratings were very similar when the more promi-
nent country was displayed first (M = 11.19, SD = 2.54)
compared with when the less prominent country was
displayed first (M = 11.25, SD = 2.54), t(20) = 1.1, p =
.29, d = 0.17, 95% CI = [−0.14, 0.47].
Order effects in general are reported in the Results
section. As noted earlier, we identified a priori that this
effect and Norenzayan et al.’s (2002) effect both
involved similarity judgments and thus that the order
of these study materials might be particularly relevant.
We compared whether the asymmetry score for Tversky
and Gati’s (1978) effect was moderated by whether the
measures for Norenzayan etal.’s effect appeared before
or after, and observed no moderation for the primary
similarities test, t(3547) = −0.48, p = .63, d = −0.02, 95%
CI = [−0.08, 0.05], or for the secondary differences test,
t(3580) = −0.23, p = .82, d = −0.01, 95% CI = [−0.07,
0.06].
Results
For each of the 28 effects, Table 2 presents the original
study’s effect size (with 95% CI), the median effect size
for the replication samples, and the weighted mean of
the replication effect sizes (with 95% CI) after pooling
the data of all the samples. It also shows the percentage
of samples in which the null hypothesis was rejected
and the effect was in the expected direction, the per-
centage of samples in which the null hypothesis was
rejected and the effect was in the unexpected direction,
and the percentage of samples in which the null
hypothesis was not rejected. The effects are ordered
from the largest global replication effect size consistent
with the original study, at the top of the table, to the
largest opposite-direction effect, at the bottom. For
original studies that had shown cultural differences, we
present results separately for cultures with WEIRD
scores above the mean (classified as “WEIRD” samples)
468 Klein et al.
Table 2. Summary of Effect Sizes and Results of Significance Tests Across Replication Samples for Each of the 28 Effects
Replication
Results of significance testing
(p < .05; % of samples)
Effect
Original study’s
effect size
Median
effect size
Global effect
size
Negative
estimated
effect
Non-
significant
effect
Positive
estimated
effect
Cohen's q effect size
Disgust sensitivity predicts homophobia
(Inbar, Pizarro, Knobe, & Bloom, 2009)
0.70 [0.05, 1.36] 0.03 0.05 [0.01, 0.10] 3.39 93.22 3.39
Assimilation and contrast effects in
question sequences (Schwarz, Strack,
& Mai, 1991)
0.48 [0.07, 0.88] −0.06 −0.07 [−0.12, −0.02] 5.08 91.53 3.39
Cohen's d effect size
Correspondence bias (Miyamoto &
Kitayama, 2002)
WEIRD samples 2.47 [1.46, 3.49] 1.78 1.81 [1.75, 1.88] 0.00 0.00 100.00
Less WEIRD samples 0.74 [−0.12, 1.59] 1.86 1.84 [1.74, 1.94] 0.00 0.00 100.00
Perceived intentionality for side effects
(Knobe, 2003)
1.45 [0.79, 2.77] 1.94 1.75 [1.70, 1.80] 0.00 5.08 94.92
Trolley Dilemma 1: principle of double
effect (Hauser, Cushman, Young, Jin,
& Mikhail, 2007)
2.50 [2.22, 2.86] 1.42 1.35 [1.28, 1.41] 0.00 0.00 100.00
False consensus: supermarket scenario
(Ross, Greene, & House, 1977)
0.99 [0.24, 2.29] 1.08 1.18 [1.13, 1.23] 0.00 0.00 100.00
Moral typecasting (Gray & Wegner, 2009) 0.80 [0.31, 1.29] 1.04 0.95 [0.91, 1.00] 0.00 5.00 95.00
False consensus: traffic-ticket scenario
(Ross etal., 1977)
0.80 [0.22, 1.87] 0.89 0.95 [0.90, 1.00] 0.00 6.67 93.33
Preferences for formal versus intuitive
Reasoning (Norenzayan, Smith, Kim,
& Nisbett, 2002)
WEIRD samples 0.00 [−0.15, 0.15] 0.95 0.95 [0.90, 1.00] 0.00 2.33 97.67
Less WEIRD samples 0.67 [0.52, 0.81] 0.50 0.56 [0.46, 0.65] 0.00 42.86 57.14
Less-is-better effect (Hsee, 1998) 0.69 [0.24, 1.13] 0.86 0.78 [0.74, 0.83] 0.00 10.53 89.47
Cardinal direction and socioeconomic
status (Huang, Tse, & Cho, 2014)
WEIRD samples 0.83 [0.37, 1.28] 0.66 0.55 [0.49, 0.61] 4.35 30.43 65.22
Less WEIRD samples −0.59 [−0.99, −0.19] −0.10 0.03 [−0.05, 0.13] 5.56 83.33 11.11
Effect of framing on decision making
(Tversky & Kahneman, 1981)
1.08 [0.71, 1.45] 0.38 0.40 [0.35, 0.45] 0.00 54.55 45.45
Moral foundations of liberals versus
conservatives (Graham, Haidt, & Nosek,
2009)
0.52 [0.40, 0.63] 0.23 0.29 [0.25, 0.34] 0.00 75.00 25.00
Trolley Dilemma 2: principle of double
effect (Hauser etal., 2007)
0.34 [0.26, 0.42] 0.22 0.25 [0.20, 0.30] 0.00 81.67 18.33
Reluctance to tempt fate (Risen &
Gilovich, 2008)
0.39 [0.03, 0.75] 0.23 0.18 [0.14, 0.22] 1.69 72.88 25.42
Consumerism undermines trust (Bauer,
Wilkie, Kim, & Bodenhausen, 2012)
0.87 [0.41, 1.34] 0.16 0.12 [0.07, 0.17] 1.85 87.04 11.11
Influence of incidental anchors on
judgment (Critcher & Gilovich, 2008)
0.30 [0.02, 0.58] 0.00 0.04 [−0.01, 0.09] 3.39 91.53 5.08
Vertical position and power (Giessner
& Schubert, 2007)
0.55 [0.05, 1.05] 0.01 0.03 [−0.01, 0.08] 1.69 94.92 3.39
Directionality and similarity (Tversky
& Gati, 1978)
0.48 [0.16, 0.80] 0.03 0.01 [−0.02, 0.04] 2.04 97.96 0.00
(continued)
Many Labs 2 469
Replication
Results of significance testing
(p < .05; % of samples)
Effect
Original study’s
effect size
Median
effect size
Global effect
size
Negative
estimated
effect
Non-
significant
effect
Positive
estimated
effect
Moral violations and desire for cleansing
(Zhong & Liljenquist, 2006)
1.02 [0.39, 2.44] 0.00 0.00 [−0.05, 0.04] 0.00 94.23 5.77
Structure promotes goal pursuit (Kay,
Laurin, Fitzsimons, & Landau, 2014)
0.49 [0.001, 0.973] −0.02 −0.02 [−0.07, 0.03] 0.00 100.00 0.00
Social value orientation and family
size (Van Lange, Otten, De Bruin, &
Joireman, 1997)
0.19 [< 0.001, 0.47] 0.06 −0.03 [−0.08, 0.02] 0.00 98.15 1.85
Priming “heat” increases belief in global
warming (Zaval, Keenan, Johnson, &
Weber, 2014)
0.31 [0.03, 0.59] 0.00 −0.03 [−0.09, 0.03] 5.36 89.29 5.36
Disfluency engages analytic processing
(Alter, Oppenheimer, Epley, & Eyre, 2007)
0.63 [−0.004, 1.25] −0.07 −0.03 [−0.08, 0.01] 1.52 96.97 1.52
Sociometric status and well-being (Anderson,
Kraus, Galinsky, & Keltner, 2012)
0.57 [0.20, 0.93] −0.05 −0.04 [−0.09, −0.004] 0.00 94.92 5.08
Affect and risk (Rottenstreich & Hsee, 2001) 0.74 [< 0.001, 1.74] −0.06 −0.08 [−0.13, −0.03] 3.33 95.00 1.67
Effect of choosing versus rejecting on
relative desirability (Shafir, 1993)
0.35 [−0.04, 0.68] −0.04 −0.13 [−0.18, −0.09] 18.97 79.31 1.72
Construing actions as choices (Savani,
Markus, Naidu, Kumar, & Berlia, 2010)
WEIRD samples 0.08 [−0.33, 0.50] −0.24 −0.21 [−0.23, −0.18] 46.51 53.49 0.00
Less WEIRD samples −0.65 [−1.01, −0.30] −0.14 −0.12 [−0.16, −0.08] 28.57 71.43 0.00
Note: Numbers inside brackets are 95% confidence intervals. For the original effect sizes, we calculated the confidence intervals using cell sample
sizes when they were available and assumed equal distribution across conditions when they were not available. For original studies that observed a
difference between a sample from a WEIRD (i.e., Western, educated, industrialized, rich, and democratic) culture and a sample from a particular less
WEIRD culture, we present summary results for WEIRD and all less WEIRD samples separately to avoid potentially misrepresenting replication success
within subsamples. Figure 2 plots the distribution of effect sizes across all samples for each of the 28 effects included in this replication project.
Table 2. (Continued)
and those with WEIRD scores below the mean (classi-
fied as “less WEIRD” samples), to avoid aggregating
results when effects might be anticipated in some sam-
ples but not others (see the next section for an expla-
nation of how WEIRD scores were calculated). (In these
cases, the effects are ordered according to the global
replication effect size in the WEIRD samples.) However,
the differences observed between samples in the origi-
nal research may not be expected to be replicated in
our comparisons of aggregated cultural contexts. There-
fore, we avoid drawing conclusions about replication
of original cultural differences beyond what we have
already discussed in reporting the findings for indi-
vidual studies and what we discuss later in presenting
our exploratory cultural comparisons.
Overall, after we adjusted for multiple comparisons,
the replications for 14 of the 28 effects (50%) showed
significant evidence in the same direction as the origi-
nal finding, 1 replication provided evidence that was
weakly consistent with the original (4%),6 and 13
replications (46%) yielded a null effect or evidence in
the direction opposite the original finding.7 Larger
aggregate effects tended to have a higher percentage
of significant positive results than smaller aggregate
effects, as would be expected given the power of the
individual samples to detect the observed aggregate
effect size. For 8 of the supported effects, 89% to 100%
of the individual samples had significant results, and
for the other 6, 11% to 46% of the individual samples
had significant results. As would be expected, for effects
that were null in the aggregate, there were occasional
significant results both in the original finding’s direction
and in the opposite direction, but more than 90% of
the individual samples typically showed a null effect.
Most observed pooled effect sizes (21 of 28; 75%) were
smaller than the original findings in WEIRD samples,
but some (7 of 28; 25%) were larger.
Figure 2 provides a summary illustration of the 28
studies including (a) estimates of the aggregate effect
sizes, (b) the effect-size estimate for each individual
470 Klein et al.
sample, and (c) the original studies’ effect-size estimates
(results for samples from WEIRD and less WEIRD cul-
tures are identified separately for the 4 original studies
that had samples from two cultures). A figure showing
separate distributions for WEIRD and less WEIRD repli-
cation samples is available in at https://osf.io/5yzn8/.
−3 −2 −1 0 1 2 3
Construing Actions as Choices (Savani et al., 2010)
Affect & Risk (Rottenstreich & Hsee, 2001)
Effect of Choosing vs. Rejecting (Shafir, 1993)
Disfluency Engages Analytic Processing (Alter et al., 2007)
Structure Promotes Goal Pursuit (Kay et al., 2014)
Priming “Heat” (Zaval et al., 2014)
SMS & Well-Being (Anderson et al., 2012)
Directionality & Similarity (Tversky & Gati, 1978)
Vertical Position & Power (Giessner & Schubert, 2007)
Moral Violations & Cleansing (Zhong & Liljenquist, 2006)
SVO and Family Size (Van Lange et al., 1997)
Influence of Incidental Anchors (Critcher & Gilovich, 2008)
Consumerism Undermines Trust (Bauer et al., 2012)
Trolley Dilemma 2 (Hauser et al., 2007)
Reluctance to Tempt Fate (Risen & Gilovich, 2008)
Moral Foundations of Liberals vs. Conservatives (Graham et al., 2009)
Cardinal Direction & SES (Huang et al., 2014)
Effect of Framing (Tversky & Kahneman, 1981)
Less-Is-Better Effect (Hsee, 1998)
Preferences for Formal vs. Intuitive Reasoning (Norenzayan et al., 2002)
False Consensus: Traffic-Ticket Scenario (Ross et al., 1977)
Moral Typecasting (Gray & Wegner, 2009)
False Consensus: Supermarket Scenario (Ross et al., 1977)
Trolley Dilemma 1 (Hauser et al., 2007)
Perceived Intentionality for Side Effects (Knobe, 2003)
Correspondence Bias (Miyamoto & Kitayama, 2002)
Assimilation & Contrast Effects (Schwarz et al., 1991)
Disgust Sensitivity Predicts Homophobia (Inbar et al., 2009)
−1.0 −0.5 0.0 0.5 1.0
Effect-Size r
Cohen’s q
Original Effect Size
Fig. 2. Effect-size distributions for the 28 effects. The effect size for each replication sample is plotted as a short vertical line; the aggregate
estimates are plotted as longer, thick vertical lines. Results for samples with fewer than 15 participants because of exclusions are not plot-
ted, and some samples were excluded because of errors in administration. A detailed accounting of all exclusions is available at https://
manylabsopenscience.github.io/ML2_data_cleaning. Positive effect sizes indicate effects consistent with the direction of the original findings
in the original Western samples. Original effect sizes are indicated by the gray-filled triangles. If the original study had a cultural comparison,
the gray triangle shows the result for the WEIRD (Western, educated, industrialized, rich, and democratic) sample, and the open triangle
shows the results for the less WEIRD sample. Note that for the top two rows of the figure, effect sizes were calculated as Cohen’s q (the
estimate of the difference between two correlations); all other effect sizes were calculated as r. SES = socioeconomic status; SVO = social
value orientation; SMS = sociometric status.
Many Labs 2 471
Variation across samples and settings
Our central interest was the variation in effect estimates
across all samples and settings. In a linear mixed model
with samples and studies as random effects, we compared
the intraclass correlation (ICC) of samples across effects
(ICC =.782), which was quite large, with the intraclass
correlation of effects across samples (ICC = .004), which
was near zero. In other words, to predict effect sizes
across the 28 findings and dozens of samples studied in
this project, it is very useful to know the effect in question
and barely useful to know the sample in which it was
being studied.
Next, we examined whether specific effects were sen-
sitive to variation in sample or setting. For each of the
28 replication studies, we examined variability in effect
sizes using a random-effects meta-analysis (with
restricted maximum likelihood as the estimator for
between-study variance) and established heterogeneity
estimates—Q, I2, and tau—to determine if the amount
of variability across samples exceeded that expected as
a result of measurement error (see Table 3). Because
the study procedures were nearly identical (except for
language translations) across the individual studies,
variation exceeding measurement error was likely to be
due to effects of sample or setting and interactions
between sample and materials. Eleven of the 28 effects
(39%) showed significant heterogeneity according to the
Q test (p < .001). Notably, of those showing such vari-
ability, the effect sizes for 8 were among the 10 largest
effect sizes. Only one of the nonsignificant replication
effects (i.e., replication of Van Lange etal., 1997) showed
significant heterogeneity according to the Q test.
The I2 statistic indicated substantial heterogeneity for
some of the tests; 10 (36%) showed at least medium
heterogeneity (I2 ≥ 50%), and 2 showed heterogeneity
larger than 75% (tests of Huang etal., 2014, and Knobe,
2003; see Table 3). Note, however, that estimation of
heterogeneity is rather imprecise, as evidenced by the
many large confidence intervals for I2, particularly for
the cases with low estimates of heterogeneity. Fifteen
I2 effects had a lower bound of 0%. Also, the I2 statistic
increases if sample size increases, so the large samples
in this project may explain the large I2 statistics that
were observed (Rücker, Schwarzer, Carpenter, &
Schumacher, 2008). As in the first Many Labs project
(Klein etal., 2014a), heterogeneity was greater for large
effects than for small effects. The Spearman rank-order
correlation between aggregate effect size and I2 was .56.
Finally, as estimated with tau, only 1 effect (replica-
tion of Huang etal., 2014) showed substantial standard
deviation among effect sizes (.24), and 8 others showed
modest heterogeneity near .10. Most of the effects, 19
of 28 (68%), showed near zero heterogeneity as esti-
mated by tau. Thus, according to this test, there was
modest evidence of heterogeneity overall, and when it
was observed in individual effects, it was quite small.
Table 3 summarizes the tests of moderation by lab
versus online setting. After we adjusted for multiple
comparisons, just one result showed a significant dif-
ference between lab and online samples (replication of
Zhong & Liljenquist, 2006). For this effect, the overall
result was not different from zero, and approximately
95% of the individual samples showed null effects.
These results suggest the need for some caution in
concluding that effects are moderated by whether data
are collected in the lab or online.
For exploratory cultural comparisons, we computed
a WEIRDness (Henrich etal., 2010) score for each sam-
ple based on its country of origin, using public country
rankings. Western countries were given a score of 1,
and Eastern countries were given a score of 0. Devel-
oped countries were given a score of 1, and emerging
countries were given a score of 0. The list of developed
countries was obtained from the United Nations (United
Nations, Department of Economic and Social Affairs,
Development Policy and Analysis Division, 2014). Scores
for education, industrialization, and democratization
were taken from the United Nations’ Education Index
(Education Index, 2017) and Industrial Development
Report (United Nations Industrial Development Orga-
nization, 2015) and from the Global Democracy Rank-
ing (Campbell, Pölzlbauer, Barth, & Pölzlbauer, 2015).
We then computed a global WEIRDness score for each
sample by taking the mean across its scores. Details on
the computation and specific links to the country rank-
ings are available at https://osf.io/b7qrt/. Samples from
countries with WEIRDness scores above the mean across
samples were categorized as WEIRD (Slate 1: n = 42;
Slate 2: n = 44), and samples from countries with
WEIRDness scores below the mean were categorized
as “less WEIRD” (Slate 1: n = 22; Slate 2: n = 17; see
Fig. 3 for the distribution of WEIRDness scores).
Table 3 also presents heterogeneity statistics for com-
parisons of the WEIRD and less WEIRD cultures. For
13 of the 14 replication effects that were reliable and
in the same direction as the effects in the original stud-
ies, the effect was observed with similar magnitude in
the WEIRD and the less WEIRD samples. The only
exception was Huang etal.’s (2014) effect; in this case,
the WEIRD samples showed an effect in the same direc-
tion as the original WEIRD sample, and the less WEIRD
samples showed no overall effect. Both showed wide
variability across samples. This result is relatively con-
sistent with the original study, in which Hong Kong and
U.S. participants showed effects in opposite directions,
presumably because of observed between-sample dif-
ferences in whether wealthy people tended to live in
the north or south. It is likely that there is wide vari-
ability in whether wealthy people tend to live in the
472
Table 3. Results of Heterogeneity Tests for Each of the 28 Effects
Heterogeneity test
All samples (no moderators) WEIRD versus less WEIRD samples Lab versus online samples
Effect ESaTau Q df p I
2Tau Q p I
2Tau Q p I
2
Cohen's q effect size
Disgust sensitivity predicts
homophobia (Inbar, Pizarro, Knobe,
& Bloom, 2009)
0.05 .00 55.80 58 .56 3%
[0%, 30%]
0.00 2.89 .09 0%
[0%, 29%]
0.00 0.18 .67 5%
[0%, 31%]
Assimilation and contrast effects in
question sequences (Schwarz,
Strack, & Mai, 1991)
−0.07 .10 60.39 58 .39 15%
[0%, 33%]
0.10 0.61 .44 17%
[0%, 35%]
0.10 0.00 .97 16%
[0%, 34%]
Cohen's d effect size
Correspondence bias (Miyamoto
& Kitayama, 2002)
1.82 .00 235.65 57 < .001 65%
[46%, 73%]
0.00 1.47 .23 64%
[45%, 72%]
0.00 2.83 .09 64%
[45%, 74%]
Perceived intentionality for side effects
(Knobe, 2003)
1.75 .14 631.72 58 < .001 93%
[92%, 97%]
0.10 26.43 < .001 91%
[87%, 95%]
0.14 2.55 .11 93%
[91%, 97%]
Trolley Dilemma 1: principle of double
effect (Hauser, Cushman, Young, Jin,
& Mikhail, 2007)
1.35 .10 131.24 58 < .001 54%
[32%, 66%]
0.10 4.80 .03 51%
[27%, 64%]
0.10 0.13 .72 55%
[32%, 67%]
False consensus: supermarket scenario
(Ross, Greene, & House, 1977)
1.18 .00 65.54 58 .23 16%
[0%, 41%]
0.00 3.36 .07 12%
[0%, 38%]
0.00 0.26 .61 18%
[0%, 43%]
Moral typecasting (Gray & Wegner,
2009)
0.95 .10 203.30 59 < .001 73%
[62%, 83%]
0.10 6.02 .01 71%
[58%, 81%]
0.10 0.52 .47 71%
[59%, 82%]
False consensus: traffic-ticket scenario
(Ross etal., 1977)
0.95 .00 100.19 57 < .001 43%
[18%, 62%]
0.00 0.00 .97 44%
[19%, 63%]
0.00 0.17 .68 46%
[21%, 65%]
Preferences for formal versus intuitive
reasoning (Norenzayan, Smith, Kim,
& Nisbett, 2002)
0.86 .10 156.75 56 < .001 66%
[54%, 81%]
0.10 20.58 < .001 55%
[36%, 73%]
0.10 0.69 .41 67%
[55%, 81%]
Less-is-better effect (Hsee, 1998) 0.78 .10 158.41 56 < .001 65%
[49%, 77%]
0.10 4.68 .03 63%
[46%, 75%]
0.10 1.69 .19 65%
[49%, 77%]
Effect of framing on decision making
(Tversky & Kahneman, 1981)
0.40 .00 55.20 54 .43 6%
[0%, 36%]
0.00 1.46 .23 3%
[0%, 37%]
0.00 0.20 .66 7%
[0%, 38%]
Cardinal direction and socioeconomic
status (Huang, Tse, & Cho, 2014)
0.40 .24 626.26 63 < .001 89%
[84%, 92%]
0.22 13.01 < .001 87%
[81%, 91%]
0.24 1.64 .20 89%
[84%, 92%]
Moral foundations of liberals versus
conservatives (Graham, Haidt, &
Nosek, 2009)
0.29 .09 175.26 59 < .001 64%
[49%, 75%]
0.09 0.25 .62 65%
[49%, 75%]
0.09 1.26 .26 65%
[49%, 76%]
Reluctance to tempt fate (Risen &
Gilovich, 2008)
0.18 .00 87.82 58 .01 36%
[6%, 54%]
0.00 1.61 .20 34%
[3%, 53%]
0.00 0.53 .47 37%
[7%, 55%]
Trolley Dilemma 2: principle of
double effect (Hauser etal., 2007)
0.25 .00 60.40 59 .42 12%
[0%, 33%]
0.00 0.90 .34 10%
[0%, 34%]
0.00 0.14 .71 11%
[0%, 31%]
(continued)
473
Heterogeneity test
All samples (no moderators) WEIRD versus less WEIRD samples Lab versus online samples
Effect ESaTau Q df p I
2Tau Q p I
2Tau Q p I
2
Consumerism undermines
trust (Bauer, Wilkie, Kim, &
Bodenhausen, 2012)
0.12 .00 63.78 53 .15 12%
[0%, 49%]
0.00 0.04 .85 14%
[0%, 50%]
0.00 0.30 .58 15%
[0%, 51%]
Influence of incidental anchors on
judgment (Critcher & Gilovich,
2008)
0.04 .00 64.88 58 .25 6%
[0%, 43%]
0.00 0.11 .75 8%
[0%, 44%]
0.00 1.17 .28 4%
[0%, 41%]
Social value orientation and family size
(Van Lange, Otten, De Bruin, &
Joireman, 1997)
−0.03 .00 103.56 53 < .001 50%
[28%, 68%]
0.00 1.15 .28 50%
[28%, 68%]
0.00 1.15 .28 49%
[26%, 67%]
Moral violations and desire for cleans ing
(Zhong & Liljenquist, 2006)
0.00 .00 65.59 51 .08 22%
[0%, 52%]
0.00 1.17 .28 21%
[0%, 52%]
0.00 9.15 < .001 4%
[0%, 46%]
Vertical position and power
(Giessner & Schubert, 2007)
0.03 .00 62.87 58 .31 3%
[0%, 42%]
0.00 0.00 .96 5%
[0%, 43%]
0.00 6.19 .01 4%
[0%, 35%]
Directionality and similarity (Tversky
& Gati, 1978)
0.01 .00 15.33 48 .99 0%
[0%, 0%]
0.00 0.42 .52 0%
[0%, 0%]
0.00 0.12 .73 0%
[0%, 0%]
Sociometric status and well-being
(Anderson, Kraus, Galinsky, &
Keltner, 2012)
−0.04 .00 55.09 58 .58 2%
[0%, 30%]
0.00 0.83 .36 2%
[0%, 30%]
0.00 3.21 .07 0%
[0%, 16%]
Priming “heat” increases belief in
global warming (Zaval, Keenan,
Johnson, & Weber, 2014)
−0.03 .10 72.96 46 .01 37%
[8%, 63%]
0.10 0.76 .38 37%
[8%, 63%]
0.10 0.50 .48 40%
[11%, 64%]
Structure promotes goal pursuit (Kay,
Laurin, Fitzsimons, & Landau,
2014)
−0.02 .00 33.95 51 .97 0%
[0%, 2%]
0.00 3.10 .08 0%
[0%, 0%]
0.00 2.06 .15 0%
[0%, 0%]
Disfluency engages analytic processing
(Alter, Oppenheimer, Epley, & Eyre,
2007)
−0.03 .00 59.46 65 .67 0%
[0%, 27%]
0.00 1.38 .24 0%
[0%, 27%]
0.00 0.91 .34 0%
[0%, 21%]
Effect of choosing versus rejecting on
relative desirability (Shafir, 1993)
−0.13 .00 51.67 40 .10 26%
[0%, 52%]
0.00 0.55 .46 26%
[0%, 53%]
0.00 0.14 .71 25%
[0%, 50%]
Affect and risk (Rottenstreich & Hsee,
2001)
−0.08 .00 50.75 59 .77 0%
[0%, 21%]
0.00 0.28 .60 0%
[0%, 22%]
0.00 0.31 .58 0%
[0%, 25%]
Construing actions as choices
(Savani, Markus, Naidu, Kumar, &
Berlia, 2010)
−0.18 .00 155.49 56 < .001 64%
[47%, 76%]
0.00 3.69 .05 62%
[43%, 74%]
0.00 0.61 .44 65%
[48%, 77%]
Note: Numbers inside brackets are 95% confidence intervals. For the tests of moderation, df for Q is 1. Bonferroni correction for multiple comparisons suggests alpha = .004 (Slate 1) and alpha = .003
(Slate 2) as the criteria for statistical significance. Italics indicate significant moderation by the type of sample. Random effects meta-analyses were conducted using the R package metafor (Viechtbauer,
2010). Between-study variance was estimated using restricted maximum likelihood.
aThis column presents the global effect size (ES) for each tested effect. This information is also presented in Table 4, but it is included here as well so that these values can be easily compared with the estimates of tau.
Table 3. (Continued)
474 Klein et al.
north or south of the many different settings within our
WEIRD and less WEIRD samples, and that this produced
the high observed variability. Among the 14 effects that
were null in the aggregate or in the direction opposite
the effect in the original WEIRD sample, there was little
evidence for the original finding in most WEIRD and less
WEIRD samples. However, in the case of Savani etal.’s
(2010) effect, both the WEIRD and the less WEIRD rep-
lication samples showed effects that were in the direction
of the effect in the original Indian sample.
Ultimately, just three effects (those originally reported
by Huang etal., 2014; Knobe, 2003; and Norenzayan
etal., 2002) showed significant evidence for moderation
by WEIRDness after correction for multiple compari-
sons. However, for Norenzayan etal.’s effect, the cul-
tural difference was the inverse of the original result,
though we note that the original study did not have a
theoretical commitment regarding the cultural differ-
ence in the condition effect that we tested. Norenzayan
etal. focused on rule-based responses across conditions
and predicted that their European American sample
would show greater rule-based responses than the East
Asian sample within each condition (see note 5).
Influence of task order
The order in which tasks are presented could moderate
effect sizes. Across a 30-min session, effects may
weaken if participants tire or if earlier procedures inter-
fere with later procedures. We did not observe this
pattern in prior Many Labs investigations with the same
design (Ebersole etal., 2016; Klein etal., 2014a), but
task order remains a potential moderator. Order of
administration in the current project was randomized,
so we were able to test this possibility directly. Figure
4 shows the magnitude of each effect for each position
in which it was presented in its slate (i.e., from 1, pre-
sented first, to 13 or 15, presented last). For each of the
28 effects, Table 4 shows the aggregate effect size, the
effect size when the study was administered first, and
the effect size when the study was administered last.
Across the 28 findings, we observed little systematic
evidence that effects were stronger (or weaker) when
administered first compared with last. Also, there was
no evidence of linear, quadratic, or cubic trends by task
order (for analytic details, see https://osf.io/z8dqs/).
Examination of all task positions for all 28 findings
revealed that the aggregate effect size fell outside of
the 95% CI for 29 of the 394 estimates (7.4%), a percent-
age that is not much different from what would be
expected by chance (5%). Also, the distribution of the
significant effects appears to have been relatively ran-
dom across effects and positions (Fig. 4).
Authors of four of the original articles (Alter etal.,
2007; Giessner & Schubert, 2007; Miyamoto & Kitayama,
2002; Schwarz etal., 1991) noted a priori that their find-
ings might be sensitive to order of administration. How-
ever, there was no evidence for systematic variation in
magnitude by task order for any of these effects. It is
still possible that there are specific order effects, such
as when a particular procedure immediately precedes
another particular procedure; but these analyses confirm
Number of Samples
WEIRDness Score
0.21
0
20
40
60
80
0.27 0.33 0.38 0.44 0.50 0.56 0.62 0.68 0.74 0.80 0.86 0.94
Fig. 3. Histogram of the WEIRDness (Western, educated, industrialized, rich, democratic) scores
of the samples.
475
−0.108
−0.048
−0.047
−0.026
−0.014
−0.01
−0.005
0.009
0.011
0.016
0.027
0.028
0.066
0.11
0.117
0.125
0.195
0.21
0.381
0.382
0.408
0.434
0.464
0.574
0.653
0.677
−0.003
−0.004
12345678910 11 12 13 14 15
Presentation Order
Mean q
Mean r
Does the 95% CI of the Effect Size at This
Position Contain the Aggregate Effect Size? No Yes
Construing Actions as Choices (Savani et al., 2010)
Affect & Risk (Rottenstreich & Hsee, 2001)
Effect of Choosing vs. Rejecting (Shafir, 1993)
Disfluency Engages Analytic Processing (Alter et al., 2007)
Structure Promotes Goal Pursuit (Kay et al., 2014)
Priming “Heat” (Zaval et al., 2014)
SMS & Well-Being (Anderson et al., 2012)
Directionality & Similarity (Tversky & Gati, 1978)
Vertical Position & Power (Giessner & Schubert, 2007)
Moral Violations & Cleansing (Zhong & Liljenquist, 2006)
SVO and Family Size (Van Lange et al., 1997)
Influence of Incidental Anchors (Critcher & Gilovich, 2008)
Consumerism Undermines Trust (Bauer et al., 2012)
Trolley Dilemma 2 (Hauser et al., 2007)
Reluctance to Tempt Fate (Risen & Gilovich, 2008)
Moral Foundations of Liberals vs. Conservatives (Graham et al., 2009)
Cardinal Direction & SES (Huang et al., 2014)
Effect of Framing (Tversky & Kahneman, 1981)
Less-Is-Better Effect (Hsee, 1998)
Preferences for Formal vs. Intuitive Reasoning (Norenzayan et al., 2002)
False Consensus: Traffic-Ticket Scenario (Ross et al., 1977)
Moral Typecasting (Gray & Wegner, 2009)
False Consensus: Supermarket Scenario (Ross et al., 1977)
Trolley Dilemma 1 (Hauser et al., 2007)
Perceived Intentionality for Side Effects (Knobe, 2003)
Correspondence Bias (Miyamoto & Kitayama, 2002)
Assimilation & Contrast Effects (Schwarz et al., 1991)
Disgust Sensitivity Predicts Homophobia (Inbar et al., 2009)
Fig. 4. Effect-size estimates for each study for each position in which it was presented. For comparison, the aggregate effect size for each of the 28 effects is presented
at the right. Error bars represent 95% confidence intervals (CIs). SES = socioeconomic status; SVO = social value orientation; SMS = sociometric status.
476 Klein et al.
Table 4. Comparison of Each Study’s Global Effect Size With Its Effect Size When the Study Was Administered First and
Last in Its Slate
Effect Global effect size
Effect size in first
position
Effect size in last
position
Cohen's q effect size
Disgust sensitivity predicts homophobia (Inbar,
Pizarro, Knobe, & Bloom, 2009)
0.05 [0.01, 0.10] 0.01 [−0.16, 0.18] −0.06 [−0.23, 0.11]
Assimilation and contrast effects in question sequences
(Schwarz, Strack, & Mai, 1991)
−0.07 [−0.12, –0.02] –0.06 [–0.23, 0.12] –0.13 [–0.29, 0.03]
Cohen's d effect size
Correspondence bias (Miyamoto & Kitayama, 2002) 1.82 [1.76, 1.87] 1.88 [1.68, 2.07] 1.63 [1.43, 1.84]
Perceived intentionality for side effects (Knobe, 2003) 1.75 [1.70, 1.80] 1.47 [1.27, 1.66] 1.82 [1.31, 2.03]
Trolley Dilemma 1: principle of double effect (Hauser,
Cushman, Young, Jin, & Mikhail, 2007)
1.35 [1.28, 1.41] 1.57 [1.33, 1.81] 1.21 [0.98, 1.44]
False consensus: supermarket scenario (Ross, Greene,
& House, 1977)
1.18 [1.13, 1.23] 1.22 [1.05, 1.39] 1.12 [0.93, 1.30]
Moral typecasting (Gray & Wegner, 2009) 0.95 [0.91, 1.00] 1.07 [0.88, 1.26] 1.20 [1.01, 1.39]
False consensus: traffic-ticket scenario (Ross etal.,
1977)
0.95 [0.90, 1.00] 1.05 [0.88, 1.21] 0.93 [0.75, 1.11]
Preferences for formal versus intuitive reasoning
(Norenzayan, Smith, Kim, & Nisbett, 2002)
0.86 [0.81, 0.91] 0.69 [0.52, 0.87] 0.71 [0.53, 0.89]
Less-is-better effect (Hsee, 1998) 0.78 [0.74, 0.83] 0.75 [0.56, 0.93] 0.85 [0.66, 1.03]
Effect of framing on decision making (Tversky &
Kahneman, 1981)
0.40 [0.35, 0.45] 0.47 [0.26, 0.68] 0.41 [0.21, 0.62]
Cardinal direction and socioeconomic status (Huang,
Tse, & Cho, 2014)
0.40 [0.35, 0.45] 0.31 [0.13, 0.49] 0.35 [0.17, 0.52]
Moral foundations of liberals versus conservatives
(Graham, Haidt, & Nosek, 2009)
0.29 [0.25, 0.34] 0.47 [0.30, 0.65] 0.31 [0.14, 0.49]
Reluctance to tempt fate (Risen & Gilovich, 2008) 0.18 [0.14, 0.22] 0.12 [−0.05, 0.29] 0.42 [0.25, 0.60]
Trolley Dilemma 2: principle of double effect (Hauser
etal., 2007)
0.25 [0.20, 0.30] 0.20 [0.002, 0.41] 0.24 [0.04, 0.44]
Consumerism undermines trust (Bauer, Wilkie, Kim, &
Bodenhausen, 2012)
0.12 [0.07, 0.17] 0.03 [−0.16, 0.21] 0.14 [−0.03, 0.32]
Influence of incidental anchors on judgment (Critcher
& Gilovich, 2008)
0.04 [−0.01, 0.09] 0.09 [−0.08, 0.27] 0.05 [−0.12, 0.22]
Social value orientation and family size (Van Lange,
Otten, De Bruin, & Joireman, 1997)
−0.03 [−0.08, 0.02] −0.08 [−0.26, 0.10] −0.11 [−0.30, 0.08]
Moral violations and desire for cleansing (Zhong &
Liljenquist, 2006)
0.00 [−0.05, 0.04] 0.01 [−0.18, 0.20] 0.17 [−0.02, 0.36]
Vertical position and power (Giessner & Schubert,
2007)
0.03 [−0.01, 0.08] 0.01 [−0.18, 0.19] −0.02 [−0.20, 0.15]
Directionality and similarity (Tversky & Gati, 1978) 0.01 [−0.02, 0.04] 0.13 [−0.01, 0.26] −0.01 [−0.14, 0.12]
Sociometric status and well-being (Anderson, Kraus,
Galinsky, & Keltner, 2012)
−0.04 [−0.09, 0.005] −0.08 [−0.26, 0.10] 0.13 [−0.04, 0.30]
Priming “heat” increases belief in global warming
(Zaval, Keenan, Johnson, & Weber, 2014)
−0.03 [−0.09, 0.03] 0.08 [−0.15, 0.30] −0.11 [−0.35, 0.14]
Structure promotes goal pursuit (Kay, Laurin,
Fitzsimons, & Landau, 2014)
−0.02 [−0.07, 0.03] −0.01 [−0.18, 0.17] 0.10 [−0.08, 0.27]
Disfluency engages analytic processing (Alter,
Oppenheimer, Epley, & Eyre, 2007)
−0.03 [−0.08, 0.01] −0.02 [−0.20, 0.16] −0.10 [−0.28, 0.07]
Effect of choosing versus rejecting on relative
desirability (Shafir, 1993)
−0.13 [−0.18, −0.09] −0.08 [−0.25, 0.09] −0.20 [−0.40, −0.03]
Affect and risk (Rottenstreich & Hsee, 2001) −0.08 [−0.13, −0.02] −0.12 [−0.30, 0.07] −0.03 [−0.20, 0.14]
Construing actions as choices (Savani, Markus, Naidu,
Kumar, & Berlia, 2010)
−0.18 [−0.21, −0.16] −0.15 [−0.24, −0.06] −0.20 [−0.29, −0.11]
Note: Numbers inside brackets are 95% confidence intervals. Last position was 13 for Slate 1 effects and 15 for Slate 2 effects. The “Global”
column presents the overall effect sizes ignoring task position.
Many Labs 2 477
that the findings, in the aggregate, are robust to task
order and, particularly, that task order cannot account
for observation of null effects for any of the nonrepli-
cated results.
Discussion
With protocols that were peer reviewed in advance, we
conducted preregistered replications of 28 published
results, collecting data from 125 samples, including
thousands of participants from locations around the
world. According to the conventional criterion for sta-
tistical significance (p < .05), 15 (54%) of the replica-
tions provided significant evidence for an effect in the
same direction as the original. According to a strict
significance criterion (p < .0001), 14 (50%) of the 28
replications still provided such evidence—a reflection
of the extremely high-powered design (for Inbar etal.,
2009, the replication p value was .02). Seven (25%) of
the replications had effect sizes (Cohen’s d or q) larger
than the original, and 21 (75%) had effect sizes smaller
than the original. In the WEIRD samples, the median
Cohen’s d was 0.60 for the original findings and 0.15
for the replications—a substantial decline (Open Sci-
ence Collaboration, 2015).8 Sixteen replications (57%)
had small effect sizes (< 0.20), and 9 (32%) replication
effects were in the direction opposite that of the origi-
nal finding. Three of the latter (i.e., replications of
Rottenstreich & Hsee, 2001; Schwarz etal., 1991; and
Shafir, 1993) had an aggregate replication effect size
that was significantly in the opposite direction accord-
ing to the p < .05 criterion, but only 1 (the replication
of Shafir, 1993) was significant at the p < .0001 level.
There is no simple decision rule for declaring success
or failure in replication or for detecting positive results
(Benjamin etal., 2018; Camerer etal., 2018; Open Sci-
ence Collaboration, 2015). In Table 5, we summarize
the replication success for each of the 28 global effects
according to five possible criteria. Each criterion evalu-
ates whether the observed effect size would be consid-
ered statistically significant under different conditions.
Two criteria used the replication data as reported in
this article and applied either a loose significance cri-
terion (p < .05) or a strict significance criterion (p <
.0001). According to these criteria, the success rate was
54% or 50%, respectively. Other approaches considered
what the p value would have been if the effect size we
observed had been obtained with different sample
sizes: the original study’s sample size, a sample 2.5
times the original study’s sample size (Simonsohn,
2015), or 50 participants per group—a reasonably large
sample compared with historical trends (Fraley &
Vazire, 2014). With the significance criterion set to p <
.05 for all three cases, the success rate was 41%, 44%,
or 35%, respectively. Ten of the effects (36%) were suc-
cessfully replicated according to all the criteria that
could be applied, and 13 (46%) were unsuccessfully
replicated according to all the criteria that could be
applied.9 Five findings (18%) varied in replication suc-
cess depending on the criterion used, usually because
the replication effect size was substantially smaller than
the original effect size.
The final column in Table 5 indicates the sample size
that would be needed to have 80% power to detect each
original effect given the observed global effect size and
alpha of .05. Effects that were highly replicable across
all the criteria were relatively large in magnitude and
would be relatively efficient to investigate (i.e., they
would require modest sample sizes: Ns from 12 to 54
except for one N of 200 and another of 506). Effects
that were inconsistently replicated across criteria would
need more substantial sample sizes to study efficiently
(Ns from 200 to 2,184). Effects that were in the same
direction as in the original study but too weak for us
to reject the null hypothesis of no effect with our large
samples would need massive samples to reject the null
hypothesis (Ns from 6,283 to 313,958). Finally, for the
10 findings that had replication effect sizes of 0 or
replication effects that were in the direction opposite
the original, the null hypothesis cannot be rejected no
matter what sample size is used.
The high proportion of failures to replicate, despite
our extremely large samples, is consistent with the accu-
mulating evidence from other systematic replication stud-
ies (Camerer etal., 2016; Camerer etal., 2018; Ebersole
etal., 2016; Klein etal., 2014a; Open Science Collabora-
tion, 2015). We cannot identify whether these results are
due to errors in replication design, p-hacking in original
studies, or publication bias in which positive results are
selected despite pervasive low-powered research. How-
ever, it is notable that surveys and prediction markets in
which researchers predicted and bet on whether these
original effects would be replicated were effective at pre-
dicting replication success. For example, the correlation
between market price and replication success for the
Many Labs 2 studies was .755. These results are reported
in a separate article (Forsell etal., in press) and replicate
other studies using prediction markets and surveys to
predict replication success (Camerer etal., 2016; Camerer
etal., 2018; Dreber etal., 2016). In any case, these find-
ings provide further justification for improving the trans-
parency of research (Miguel etal., 2014; Nosek etal.,
2015), preregistering studies to make all findings discov-
erable even if they are not published, and preregistering
analysis plans to make clear the distinction between con-
firmatory tests and exploratory discoveries (Nosek, Eber-
sole, DeHaven, & Mellor, 2018; Wagenmakers, Wetzels,
Borsboom, van der Maas, & Kievit, 2012).
478
Table 5. Summary of Replication Success and Failure According to Different Criteria
Effect
Original
study’s
sample size
Replication’s
sample size
Replication’s
global effect
size
Test used to detect
the effect
Criterion of replication successa
Minimum sample
needed for 80% power
to detect the effect
with alpha = .05b
Replication’s
sample size,
p < .05
Replication’s
sample size,
p < .0001
Original
study’s sample
size, p < .05
2.5 times
the original
sample size,
p < .05
50
participants
per group,
p < .05
Correspondence bias
(Miyamoto & Kitayama, 2002)
107 7,197 1.82 General linear model
(main effect)
< 1e−10 < 1e−10 4.65e−9 < 1e−10 < 1e−10 12
Perceived intentionality for side
effects (Knobe, 2003)
78 7,982 1.75 Welch’s two-sample
t test
< 1e−10 < 1e−10 < 1e−10 < 1e−10 < 1e−10 14
Trolley Dilemma 1: principle
of double effect (Hauser,
Cushman, Young, Jin, &
Mikhail, 2007)
2,646 6,842 1.35 Two-tailed Fisher’s
exact test
< 1e−10 < 1e−10 < 1e−10 < 1e−10 — 20
False consensus: supermarket
scenario (Ross, Greene, &
House, 1977)
80 7,205 1.18 Welch’s two-sample
t test
< 1e−10 < 1e−10 6.98e−6 < 1e−10 5.18e−8 26
Moral typecasting (Gray &
Wegner, 2009)
69 8,002 0.95 Welch’s two-sample
t test
< 1e−10 < 1e−10 1.86e−4 3.06e−9 6.55e−6 38
False consensus: traffic-ticket
scenario (Ross etal., 1977)
80 7,827 0.95 Welch’s two-sample
t test
< 1e−10 < 1e−10 6.06e−5 2.04e−10 6.64e−6 38
Preferences for formal
versus intuitive reasoning
(Norenzayan, Smith, Kim, &
Nisbett, 2002)
157 7,396 0.86 Welch’s two-sample
t test
< 1e−10 < 1e−10 < 1e−10 < 1e−10 3.92e−5 46
Less-is-better effect (Hsee,
1998)
83 7,646 0.78 Welch’s two-sample
t test
< 1e−10 < 1e−10 6.18e−4 5.76e−8 1.69e−4 54
Effect of framing on decision
making (Tversky &
Kahneman, 1981)
181 7,228 0.40 Two-tailed Fisher’s
exact test
< 1e−10 < 1e−10 .031 6.29e−4 — 200
Cardinal direction and
socioeconomic status
(Huang, Tse, & Cho, 2014)
180 6,591 0.40 Welch’s two-sample
t test
< 1e−10 < 1e−10 .080 5.47e−3 .0498 200
Moral foundations of liberals
versus conservatives
(Graham, Haidt, & Nosek,
2009)
1,209 6,966 0.29 Fisher’s r-to-z test
(1 correlation)
< 1e−10 < 1e−10 4.12e−7 < 1e−10 .318 376
Trolley Dilemma 2: principle of
double effect (Hauser etal.,
2007)
2,612 7,923 0.25 Two-tailed Fisher’s
exact test
< 1e−10 < 1e−10 4.85e−8 < 1e−10 — 506
Reluctance to tempt fate (Risen
& Gilovich, 2008)
120 8,000 0.18 Two-sample t test < 1e−10 < 1e−10 .325 .119 .369 972
(continued)
479
Effect
Original
study’s
sample size
Replication’s
sample size
Replication’s
global effect
size
Test used to detect
the effect
Criterion of replication successa
Minimum sample
needed for 80% power
to detect the effect
with alpha = .05b
Replication’s
sample size,
p < .05
Replication’s
sample size,
p < .0001
Original
study’s sample
size, p < .05
2.5 times
the original
sample size,
p < .05
50
participants
per group,
p < .05
Consumerism undermines
trust (Bauer, Wilkie, Kim, &
Bodenhausen, 2012)
77 6,608 0.12 Two-sample t test < 1e−10 < 1e−10 .594 .399 .546 2,184
Disgust sensitivity predicts
homophobia (Inbar, Pizarro,
Knobe, & Bloom, 2009)
44 7,117 0.05 Fisher’s r-to-z test
(2 correlations)
.024 .024 .871 .788 .794 6,283
Influence of incidental anchors
on judgment (Critcher &
Gilovich, 2008)
200 6,826 0.04 Two-sample t test .092 .092 .773 .649 .839 19,626
Vertical position and power
(Giessner & Schubert, 2007)
64 7,890 0.03 Two-sample t test .162 .162 .900 .842 .875 34,886
Directionality and similarity
(Tversky & Gati, 1978)
144 3,549 0.01 One-sample t test .550 .550 .973 .983 .953 313,958
Moral violations and desire
for cleansing (Zhong &
Liljenquist, 2006)
27 7,001 0.00 Two-sample t test .910 .910 .994 .991 .989 NA
Structure promotes goal pursuit
(Kay, Laurin, Fitzsimons, &
Landau, 2014)
67 6,506 −0.02 Welch’s two-sample
t test
.347 .347 .924 .880 .907 NA
Social value orientation and
family size (Van Lange,
Otten, De Bruin, & Joireman,
1997)
536 6,234 −0.03 Fisher’s r-to-z test
(1 correlation)
.183 .183 .697 .537 .908 NA
Disfluency engages analytic
processing (Alter, Oppen-
heimer, Epley, & Eyre, 2007)
41 6,935 −0.03 Two-sample t test .171 .171 .917 .868 .870 NA
Priming “heat” increases belief
in global warming (Zaval,
Keenan, Johnson, & Weber,
2014)
192 4,204 −0.03 Two-sample t test .274 .274 .816 .712 .866 NA
Sociometric status and well-
being (Anderson, Kraus,
Galinsky, & Keltner, 2012)
116 6,905 −0.04 Two-sample t test .079 .079 .820 .719 .833 NA
Table 5. (Continued)
(continued)
480
Effect
Original
study’s
sample size
Replication’s
sample size
Replication’s
global effect
size
Test used to detect
the effect
Criterion of replication successa
Minimum sample
needed for 80% power
to detect the effect
with alpha = .05b
Replication’s
sample size,
p < .05
Replication’s
sample size,
p < .0001
Original
study’s sample
size, p < .05
2.5 times
the original
sample size,
p < .05
50
participants
per group,
p < .05
Assimilation and contrast
effects in question
sequences (Schwarz,
Strack, & Mai, 1991)
100 7,460 −0.07 Fisher’s r-to-z test
(2 correlations)
.002 .002 .734 .583 .734 NA
Affect and risk (Rottenstreich &
Hsee, 2001)
40 7,218 −0.08 Two-tailed Fisher’s
exact test
.002 .002 .831 .735 — NA
Effect of choosing versus
rejecting on relative
desirability (Shafir, 1993)
170 7,901 −0.13 One-sample z test 5.47e−10 5.47e−10 .186 .079 .314 NA
Construing actions as choices
(Savani, Markus, Naidu,
Kumar, & Berlia, 2010)
218 5,882 −0.18 Generalized linear
mixed model with
a binomial (logit)
link (main effect)
8.04e−06 8.04e−06 — — — NA
Number of successful
replications
15 14 11 12 8
Number of unsuccessful
replications
13 14 16 15 16
Success rate 54% 50% 41% 44% 33%
Note: The effects are ordered by the global effect size of the replication, with the largest effect size first. Negative effect sizes indicate effects in the direction opposite that observed in the original WEIRD (Western,
educated, industrialized, rich, democratic) sample. If another effect size was used in the original study (e.g., correlation, odds ratio, proportion), that value was transformed to Cohen’s d; two effect sizes (Inbar etal.,
2009; Schwarz etal., 1991) are on a different metric (Cohen's q).
aThese columns present p values calculated on the basis of the observed global effect size in the replication study. For the first two criteria, p values were calculated using the sample size of the replication. For the
third, fourth, and fifth criteria, p values were calculated under the assumption of other sample sizes: the original study’s sample size, a sample 2.5 times the original study’s sample size, or 50 participants per group.
Boldface indicates p values that met the indicated criteria for successful replication. Italics indicate that the global effect in the replication was in the direction opposite the effect in the original WEIRD sample.
Replication success for Savani etal.’s (2010) effect could not be determined using three of the criteria because a p value could not be computed for the test used. Replication success according to the 50-participants-
per-group criterion could not be determined when the test was a two-tailed Fisher’s exact test (four findings), because this would require making strong assumptions about how the sample was distributed in the 2 × 2
frequency table. bThis column shows the sample size needed for 80% power to detect a significant effect in the same direction as the original given the observed global effect size and alpha = .05. Power analyses were
conducted using the Cohen's d and Cohen's q values for the replication effect sizes. “NA” indicates that the global effect was in the direction opposite the original finding.
Table 5. (Continued)
Many Labs 2 481
The main purpose of this investigation was to assess
variability in effect sizes by sample and setting. It is
reasonable to expect that many psychological phenom-
ena are moderated by variation in sample, setting, or
procedural details, and that this variation may impact
reproducibility (Henrich etal., 2010; Klein etal., 2014a,
2014b; Markus & Kitayama, 1991; Schwarz & Strack,
2014; Van Bavel, Mende-Siedlecki, Brady, & Reinero,
2016). However, we found a very strong correlation of
samples across effects (ICC = .782), and we found a
nearly zero correlation of effects across samples (ICC =
.004). As one would expect, knowing the effect being
studied provides much more information about effect
size than does knowing the sample being studied. Just
11 of the 28 effects (39%) showed significant heteroge-
neity according to the Q test, and most of these 11 were
among the effects with the largest overall magnitude.
Only one of the near-zero replication effects (Van Lange
etal., 1997) showed significant heterogeneity with the
Q test. In other words, if no effect was observed overall,
there was also very little evidence for heterogeneity
among the samples.
The I
2 statistic indicated at least medium heterogene-
ity across samples for approximately one third of all
the effects studied (36%), but almost all the I
2 estimates
had high uncertainty (i.e., wide confidence intervals).
Taken at face value, the I
2 statistics in Table 3 indicate
that heterogeneity in the samples was high for some of
the findings, even when there was little evidence for
an effect. For example, for Zaval etal. (2014), the main
effect was not distinguishable from zero, and 89% of
the individual samples showed nonsignificant effects,
which is close to expectation if the samples were drawn
from a null distribution, and yet the I
2 was 37%. How-
ever, even if the average effect size is 0 and a majority
of the results are null, there can be strong heterogeneity
as measured with I
2 (see https://osf.io/frbuv/ for an
explanation). I
2 compares variability in the dependent
variable across samples with variability within samples.
With increasing power, I
2 will tend toward 100% if there
is any evidence for heterogeneity no matter how small
the effect. Thus, our I 2 estimates likely reflect our
extremely large sample sizes rather than the amount of
heterogeneity in absolute terms.
By comparison, the estimates for tau in Table 3 indi-
cate a small standard deviation in effect sizes for all
studies except one (tau = .24 for Huang etal., 2014).
In fact, 19 of the 28 effects (68%) had an estimated tau
near 0, an indication of minimal heterogeneity, and 8
(29%) had an estimated tau near .10, an indication of
a small amount of heterogeneity. It is not so surprising
that this was the case for the effects that failed to be
replicated globally, but it was also occasionally the
case for successful replications. More important, even
among the successful replications, when heterogeneity
was observed with tau, it was relatively weak. As a
consequence, at least for the variation investigated here,
heterogeneity across samples does not provide much
explanatory power for failures to replicate.
Our estimates of average effect size and effect-size
heterogeneity may have been affected by imperfect
reliabilities of the instruments that measured the out-
come variables. For instance, Hunter and Schmidt
(1990) showed how imperfect reliabilities attenuate
effect-size estimates and suggested correcting for these
imperfections when estimating effect size. As imperfect
reliabilities were not corrected for in the original studies
or our replications, systematic differences in effect-size
estimates between the original and replication studies
cannot be explained by imperfect reliabilities, unless
the measurement instruments were systematically much
less reliable in the replications than in the original stud-
ies; we have no evidence that this is the case. Differ-
ences across labs in reliabilities of measurement
instruments may also result in overestimation of effect-
size heterogeneity in cases of a true nonzero effect size.
Insofar as these differences existed in the current inves-
tigation, our results likely overestimate heterogeneity,
as our analyses did not take imperfect reliabilities of
variables into account.
For 12 of the 28 effects, moderators or sample char-
acteristics that may be necessary to observe the effect
were identified a priori by the original authors or other
experts during the Registered Report review process.
These effect-specific analyses were reported in the sec-
tion on the individual effects. For 7 of those 12, the
pooled result was null or in the direction opposite the
original; for the other 5, the pooled results showed
evidence for the original finding. Evidence consistent
with the hypothesized moderation was obtained for just
1 of the 12 effects (8% of the total; Hauser etal., 2007:
Trolley Dilemma 1), and weak or partial evidence was
obtained for 2 (17%; Miyamoto & Kitayama, 2002; Risen
& Gilovich, 2008). For the other 9 (75%), there was little
evidence that narrowing the data sets to the samples
and settings deemed most relevant affected the likeli-
hood of observing the effects or the original effect
magnitudes. This does not mean that moderating effects
do not occur, but it may mean that psychological theory
is not yet advanced enough to predict them reliably.
Another possible moderating influence was task
order. Participants completed their slate of 13 or 15
effects in a randomized order. It was possible that per-
formance on tasks completed later in the sequence
would be affected by tasks completed earlier in the
sequence, either because of the specific content of the
tasks or because of interference, fatigue, or other order-
related influences (Ferguson, Carter, & Hassin, 2014;
Kahneman, 2016; Schnall, 2014). Contrary to this pre-
diction, we observed little evidence for systematic order
482 Klein et al.
effects for the 28 findings investigated. This replicates
the lack of evidence for task-order effects observed in
Many Labs 1 (Klein etal., 2014a) and Many Labs 3
(Ebersole etal., 2016). Across 51 total replication tests
(28 reported here; 13 in Klein etal., 2014a; and 10 in
Ebersole etal., 2016), we have observed little evidence
for reliable effects of task order. The idea that whether
a study comes first in a sequence, in the middle, or at
the end has an impact on the magnitude of the observed
effect is appealing but, so far, unsupported.
The same is true for effects of administration in lab
versus online. Since the Internet became a source for
behavioral research, there has been interest in the
degree to which lab and online results are consistent
with one another (Birnbaum, 2004; Dandurand, Shultz,
& Onishi, 2008; Hilbig, 2016). As is the case for task
order, across Many Labs projects we have observed little
evidence for an effect of mode of administration. There
may be conditions under which it is consequential
whether a study is administered in the lab versus online,
but we have not observed meaningful evidence for such
an impact.
Finally, we included an exploratory analysis of the
moderating influence of cultural differences between
WEIRD and less WEIRD samples. We sampled from 125
highly heterogeneous sources (39 U.S. samples and 86
samples from 35 other countries and territories) to
maximize the possibility of observing variation in
effects based on sample characteristics. Ultimately, we
found compelling evidence for differences between our
WEIRD and less WEIRD samples for just three effects
(those originally reported by Huang etal., 2014; Knobe,
2003; and Norenzayan etal., 2002).
However, our approach characterized cultural differ-
ences at the most general level possible—a dichotomy
of WEIRDness—and ignored the rich diversity within
that categorization. The distribution of WEIRDness
scores was such that the WEIRD samples were highly
similar in WEIRDness, and the less WEIRD samples
varied substantially in WEIRDness. Figure 3 illustrates
the highly skewed distribution. Countries with scores
above 0.70 were categorized as WEIRD, and the rest
were categorized as less WEIRD. Our summary analyses
also did not address the possibility of highly specific
regional variations, such as differences between U.S.
and British samples, nor did they examine why differ-
ences were observed. Nor did these analyses investigate
many interesting sampling moderators available in this
data set, such as individual differences, gender, and
ethnicity. Some moderating influences could be evalu-
ated using the present data set; testing others will
require new data collections. Also, a true examination
of WEIRDness would need to more deliberately vary
sampling across each of the WEIRD dimensions. Further
analyses of the present data set may inspire hypotheses
to be tested in future studies.
Implications
It is practically a truism that the human behavior
observed in psychological studies is contingent on the
cultural and personal characteristics of the participants
under study and the setting in which they are studied.
The depth with which this idea is embedded in present
psychological theorizing is illustrated by the appeals to
“hidden moderators” as explanations of failures to rep-
licate when there have been no empirical tests of
whether such moderators are operative (Baumeister &
Vohs, 2016; Crisp, Miles, & Husnu, 2014; Gilbert, King,
Pettigrew, & Wilson, 2016; Ramscar, Shaoul, & Baayen,
2015; Schwarz & Clore, 2016; Stroebe & Strack, 2014;
Van Bavel etal., 2016). The present study suggests that
dismissing failures to replicate as a consequence of
such moderators without conducting independent tests
of the hypothesized moderators is unwise. Collectively,
we observed some evidence for effect-specific hetero-
geneity, particularly for relatively large effects; occa-
sional evidence for cultural variation; and little evidence
for moderation by procedural factors, such as task order
and lab versus online administration.
There have been a variety of failures to replicate
effects that were quite large in the original investigation
(e.g., Doyen, Klein, Pichon, & Cleeremans, 2012; Hagger
etal., 2016; Hawkins, Fitzgerald, & Nosek, 2015; Johnson,
Cheung, & Donnellan, 2014). If effects are highly con-
tingent on the sample and setting, then they could be
large and easily detected in some samples and negligible
in other samples. We did not observe this. Rather, evi-
dence for moderation or heterogeneity was mostly
observed in the large, consistently detectable effects.
Further, we observed some heterogeneity between
samples, but a priori predictions (e.g., original authors’
predictions of moderating influences) and prior findings
(e.g., previously observed cultural differences) were
minimally successful in accounting for it. For the effects
tested in Many Labs 2 at least, it appears that the cumu-
lative evidence base has not yet matured enough for
moderating influences to be predicted reliably. Simul-
taneously, there is accumulating evidence that research-
ers can predict the likelihood that an effect of interest
will be replicated (Camerer etal., 2016; Camerer etal.,
2018; Dreber etal., 2016; Forsell etal., in press).
For many multistudy investigations, a common tem-
plate is to identify an effect in a first study and then
report evidence for a variety of moderating influences
in follow-up studies. A pessimistic interpretation would
suggest that this template may be a consequence of
practices that inflate the likelihood of false positives.
Many Labs 2 483
Consider the context, in which positive results are per-
ceived as more publishable than negative results
(Greenwald, 1975) and common analytic practices may
inadvertently increase the likelihood of obtaining false
positives (Simmons et al., 2011). In a program of
research, researchers might eventually obtain a signifi-
cant result for a simple effect and call that Study 1. In
follow-up studies, they might fail to observe the original
effect and then initiate a search for moderators. Such
post hoc searches necessarily increase the likelihood
of false positives, but finding one may simultaneously
reinforce belief in the original effect despite failure to
replicate it. That is, identifying a moderator may feel
like unpacking the phenomenon and explaining why
the main effect “failed.”
An ironic consequence is that the identification of a
moderator may simultaneously increase confidence in
an effect and decrease its credibility. Investigating mod-
erating influences is much harder than presently appre-
ciated in practice. A 2 × 2 × 2 ANOVA has a nominal
false positive rate of approximately .30 for one or more
of its seven tests (1 – .957), yet correcting for multiple
tests in multivariate analyses is rare (Cramer et al.,
2016). Also, typical study designs are woefully under-
powered for studying moderation (Frazier, Tix, &
Barron, 2004; McClelland, 1997), perhaps because
researchers intuitively overestimate the power of vari-
ous research designs (Bakker, Hartgerink, Wicherts, &
van der Maas, 2016). The combination of low power
and lack of correction for multiple tests means that
every study offers ample opportunity for “detecting”
moderating influences that are not there.
Ultimately, the main implication of the present find-
ings is a plain one: It is not sufficient to presume that
moderating influences account for observed variation
in a phenomenon. Cultural, sample, or procedural vari-
ation could be a reasonable hypothesis as an account
for differences in observed effects, but it is not a cred-
ible hypothesis until it survives a confirmatory test
(Nosek etal., 2018).
Limitations
The present study has the strength of data collected
from very large samples from a wide variety of settings.
Nevertheless, the generalizability of these results to
other psychological findings is unknown. Fifty percent
of the original effects we tested were reproduced,
which is roughly consistent with the rate of replication
success in other large-scale investigations (Camerer
etal., 2016; Camerer etal., 2018; Ebersole etal., 2016;
Klein etal., 2014a; Open Science Collaboration, 2015).
However, the findings selected for this investigation
were not a random sample of any definable population,
nor did they constitute a large sample. It may be sur-
prising that just 50% of the findings were reproduced
under the circumstances of this project (original materi-
als, peer review in advance, extremely high power,
multiple samples), but that does not mean that 50% of
all findings in psychology will be reproduced, or fail
to be reproduced, under similar circumstances.
This study has the advantage over prior work by
having included many tests and large samples, to
achieve relatively precise estimation. Nevertheless, the
failures to replicate do not necessarily mean that the
tested hypotheses are incorrect. The lack of an effect
may be limited to the particular procedural conditions
of the test. Future theory and evidence will need to
account for why the effects were not observed in these
circumstances if they are replicable in others. Con-
versely, the successful replications add substantial pre-
cision for effect-size estimation and extend the
generalizability of those phenomena across a variety of
samples and settings.
Data availability
The amassed data set is very rich for exploring the
individual effects, potential interactions between spe-
cific effects, and alternate ways to estimate heterogene-
ity and analyze the aggregate data. Our analysis plan
focused on the big picture and not, for example, explor-
ing potential moderating influences on each of the indi-
vidual effects. These would be worthy analyses, but they
are beyond the scope of a single report. Follow-up
investigations using these data could provide substantial
additional insight. For the accompanying Commentaries
solicited by Advances in Methods and Practices in Psy-
chological Science, we leveraged the extremely high-
powered design of this study to demonstrate the
productive interplay of exploratory and confirmatory
analysis strategies. Commenters received a third of the
data set for analysis. Upon completion of an exploratory
analysis, the analytic scripts were registered and applied
to the holdout data for a mostly confirmatory test (Nosek
etal., 2018). The analysts’ decisions could have been
influenced by advance observation of the summary
results in this article, but use of the holdout sample
reduced other potential biasing influences. Finally, the
full data set (plus the portions used for the exploratory-
confirmatory Commentaries) and all study materials are
available at https://osf.io/8cd4r/ so that other teams can
use them for their own investigations.
Conclusion
Our results suggest that variation across samples, set-
tings, and procedures has modest explanatory power
484 Klein et al.
for understanding variation in the 28 effects included
in this project. These results do not indicate that mod-
erating influences never occur. Rather, they suggest that
hypothesizing a moderator to account for observed
differences in results between contexts is not equivalent
to testing moderation with new data. The Many Labs
paradigm allows testing across a broad range of con-
texts to probe the variability of psychological effects
across samples. Such an approach is particularly valu-
able for understanding the extent to which given psy-
chological findings represent general features of the
human mind.
Appendix
Table A1. Included Effects, With Citation Counts
Effect Description of effect and original publication
Citation
counta
Slate 1
1 Cardinal direction and socioeconomic status: Huang, Tse, and Cho (2014, Study 1a) 6
2 Structure promotes goal pursuit: Kay, Laurin, Fitzsimons, and Landau (2014, Study 2) 53
3 Disfluency engages analytic processing: Alter, Oppenheimer, Epley, and Eyre (2007,
Study 4)
743
4 Moral foundations of liberals versus conservatives: Graham, Haidt, and Nosek (2009,
Study 1)
2,064
5 Affect and risk: Rottenstreich and Hsee (2001, Study 1) 756
6Consumerism undermines trust: Bauer, Wilkie, Kim, and Bodenhausen (2012, Study 4) 165
7 Correspondence bias: Miyamoto and Kitayama (2002, Study 1) 149
8 Disgust sensitivity predicts homophobia: Inbar, Pizarro, Knobe, and Bloom (2009,
Study 1)
453
9 Influence of incidental anchors on judgment: Critcher and Gilovich (2008, Study 2) 151
10 Social value orientation and family size: Van Lange, Otten, De Bruin, and Joireman
(1997, Study 3)
1,145
11 Trolley Dilemma 1: principle of double effect: Hauser, Cushman, Young, Jin, and
Mikhail (2007, Scenarios 1 and 2)
687
12 Sociometric status and well-being: Anderson, Kraus, Galinsky, and Keltner (2012,
Study 3)
250
13 False consensus: supermarket scenario: Ross, Greene, and House (1977, Study 1) 2,965
Slate 2
14 False consensus: traffic-ticket scenario: Ross etal. (1977, Study 1) 2,965
15 Vertical position and power: Giessner and Schubert (2007, Study 1a) 261
16 Effect of framing on decision making: Tversky and Kahneman (1981, Study 10) 17,970
17 Trolley Dilemma 2: principle of double effect: Hauser etal. (2007, Scenarios 3 and 4) 687
18 Reluctance to tempt fate: Risen and Gilovich (2008, Study 2) 121
19 Construing actions as choices: Savani, Markus, Naidu, Kumar, and Berlia (2010,
Study 5)
139
20 Preferences for formal versus intuitive reasoning: Norenzayan, Smith, Kim, and Nisbett
(2002, Study 2)
497
21 Less-is-better effect: Hsee (1998. Study 1) 370
22 Moral typecasting: Gray and Wegner (2009, Study 1a) 250
23 Moral violations and desire for cleansing: Zhong and Liljenquist (2006, Study 2) 1,000
24 Assimilation and contrast effects in question sequences: Schwarz, Strack, and Mai
(1991, Study 1)
475
25 Effect of choosing versus rejecting on relative desirability: Shafir (1993, Study 1) 605
26 Priming “heat” increases belief in global warming: Zaval, Keenan, Johnson, and
Weber (2014, Study 3a)
133
27 Perceived intentionality for side effects: Knobe (2003, Study 1) 847
28 Directionality and similarity: Tversky and Gati (1978, Study 2) 695
aThe citation counts come from Google Scholar on November 6, 2018.
Many Labs 2 485
Action Editor
Daniel J. Simons served as action editor for this article.
Author Contributions
F. Hasselman, R. A.