ArticlePDF Available

Estimating the reproducibility of psychological science

Authors:
  • University of Maryland

Abstract

Empirically analyzing empirical evidence One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study. Science , this issue 10.1126/science.aac4716
RESEARCH ARTICLE SUMMARY
PSYCHOLOGY
Estimating the reproducibility of
psychological science
Open Science Collaboration*
INTRODUCTION: Reproducibility is a defin-
ing feature of science, but the extent to which
it characterizes current research is unknown.
Scientific claims should not gain credence
because of the status or authority of their
originator but by the replicability of their
supporting evidence. Even research of exem-
plary quality may have irreproducible empir-
ical findings because of random or systematic
error.
RATIONALE: There is concern about the rate
and predictors of reproducibility, but limited
evidence. Potentially problematic practices in-
clude selective reporting, selective analysis, and
insufficient specification of the conditions nec-
essary or sufficient to obtain the results. Direct
replication is the attempt to recreate the con-
ditions believed sufficient for obtaining a pre-
viously observed finding and is the means of
establishing reproducibility of a finding with
new data. We conducted a large-scale, collab-
orative effort to obtain an initial estimate of
the reproducibility of psychological science.
RESULTS: We conducted replications of 100
experimental and correlational studies pub-
lished in three psychologyjournalsusinghigh-
powered designs and original materials when
available. There is no single standard for eval-
uating replication success. Here, we evaluated
reproducibility using significance and Pvalues,
effect sizes, subjective assessments of replica-
tion teams, and meta-analysis of effect sizes.
The mean effect size (r) of the replication ef-
fects (M
r
=0.197,SD= 0.257)washalfthemag-
nitude of the mean effect size of the original
effects (M
r
=0.403,SD=0.188), representing a
substantial decline. Ninety-seven percent of orig-
inal studies had significant results (P<.05).
Thirty-six percent of replications had signifi-
cant results; 47% of origi-
nal effect sizes were in the
95% confidence interval
of the replication effect
size; 39% of effects were
subjectively rated to have
replicated the original re-
sult; and if no bias in original results is as-
sumed, combining original and replication
results left 68% with statistically significant
effects. Correlational tests suggest that repli-
cation success was better predicted by the
strength of original evidence than by charac-
teristics of the original and replication teams.
CONCLUSION: No single indicator sufficient-
ly describes replication success, and the five
indicators examined here are not the only
ways to evaluate reproducibility. Nonetheless,
collectively these results offer a clear conclu-
sion: A large portion of replications produced
weaker evidence for the original findings de-
spite using materials provided by the original
authors, review in advance for methodologi-
cal fidelity, and high statistical power to detect
the original effect sizes. Moreover, correlational
evidence is consistent with the conclusion that
variation in the strength of initial evidence
(such as original Pvalue) was more predictive
of replication success than variation in the
characteristics of the teams conducting the
research (such as experience and expertise).
The latter factors certainly can influence rep-
lication success, but they did not appear to do
so here.
Reproducibility is not well understood be-
cause the incentives for individual scientists
prioritize novelty over replication. Innova-
tion is the engine of discovery and is vital for
a productive, effective scientific enterprise.
However, innovative ideas become old news
fast. Journal reviewers and editors may dis-
miss a new test of a published idea as un-
original. The claim that we already know this
belies the uncertainty of scientific evidence.
Innovation points out paths that are possible;
replication points out paths that are likely;
progress relies on both. Replication can in-
crease certainty when findings are reproduced
and promote innovation when they are not.
This project provides accumulating evidence
for many findings in psychological research
and suggests that there is still more work to
do to verify whether we know what we think
we know.
RESEARCH
SCIENCE sciencemag.org 28 AUGUST 2015 VOL 349 ISSUE 62 51 943
The list of author affiliations is available in the full article online.
*Corresponding author. E-mail: nosek@virginia.edu
Cite this article as Open Science Collaboration, Science 349,
aac4716 (2015). DOI : 10.1126/science.aac4716
Original study effect size versus replication effect size (correlation coefficients). Diagonal
line represents replication effect size equal to original effect size. Dotted line represents replication
effect size of 0. Points below the dotted line were effects in the opposite direction of the original.
Density plots are separated by significant (blue) and nonsignificant (red) effects.
ON OUR WEB SITE
Read the full article
at http://dx.doi.
org/10.1126/
science.aac4716
..................................................
on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from on September 3, 2015www.sciencemag.orgDownloaded from
RESEARCH ARTICLE
PSYCHOLOGY
Estimating the reproducibility of
psychological science
Open Science Collaboration*
Reproducibility is a defining feature of science, but the extent to which it characterizes
current research is unknown. We conducted replications of 100 experimental and correlational
studies published in three psychology journals using high-powered designs and original
materials when available. Replication effects were half the magnitude of original effects,
representing a substantial decline. Ninety-seven percent of original studies had statistically
significant results. Thirty-six percent of replications had statistically significant results; 47%
of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of
effects were subjectively rated to have replicated the original result; and if no bias in original
results is assumed, combining original and replication results left 68% with statistically
significant effects. Correlational tests suggest that replication success was better predicted by
the strength of original evidence than by characteristics of the original and replication teams.
Reproducibility is a core principle of scien-
tific progress (16). Scientific claims should
not gain credence because of the status or
authority of their originator but by the
replicability of their supporting evidence.
Scientists attempt to transparently describe the
methodology and resulting evidence used to sup-
port their claims. Other scientists agree or dis-
agree whether the evidence supports the claims,
citing theoretical or methodological reasons or
by collecting new evidence. Such debates are
meaningless, however, if the evidence being
debated is not reproducible.
Even research of exemplary quality may have
irreproducible empirical findings because of ran-
dom or systematic error. Direct replication is
the attempt to recreate the conditions believed
sufficient for obtaining a previously observed
finding (7,8)andisthemeansofestablishing
reproducibility of a finding with new data. A
direct replication may not obtain the original
result for a variety of reasons: Known or un-
known differences between the replication and
original study may moderate the size of an ob-
served effect, the original result could have been
afalsepositive,orthereplication could produce
afalsenegative.Falsepositivesandfalsenega-
tives provide misleading information about effects,
and failure to identify the necessary and suffi-
cient conditions to reproduce a finding indicates
an incomplete theoretical understanding. Direct
replication provides the opportunity to assess
and improve reproducibility.
There is plenty of concern (913)aboutthe
rate and predictors of reproducibility but limited
evidence. In a theoretical analysis, Ioannidis es-
timated that publishing and analytic practices
make it likely that more than half of research
results are false and therefore irreproducible (9).
Some empirical evidence supports this analysis.
In cell biology, two industrial laboratories re-
ported success replicating the original results of
landmark findings in only 11 and 25% of the
attempted cases, respectively (10,11). These num-
bers are stunning but also difficult to interpret
because no details are available about the studies,
methodology, or results. With no transparency, the
reasons for low reproducibility cannot be evaluated.
Other investigations point to practices and
incentives that may inflate the likelihood of
obtaining false-positive results in particular or
irreproducible results more generally. Poten-
tially problematic practices include selective
reporting, selective analysis, and insufficient
specification of the conditions necessary or suf-
ficient to obtain the results (1223). We were in-
spired to address the gap in direct empirical
evidence about reproducibility. In this Research
Article, we report a large-scale, collaborative ef-
fort to obtain an initial estimate of the reproduc-
ibility of psychological science.
Method
Starting in November 2011, we constructed a
protocol for selecting and conducting high-
quality replications (24). Collaborators joined
the project, selected a study for replication from
the available studies in the sampling frame, and
were guided through the replication protocol.
The replication protocol articulated the process
of selecting the study and key effect from the
available articles, contacting the original authors
for study materials, preparing a study protocol
and analysis plan, obtaining review of the pro-
tocol by the original authors and other members
within the present project, registering the pro-
tocol publicly, conducting the replication, writ-
ing the final report, and auditing the process and
analysis for quality control. Project coordinators
facilitated each step of the process and main-
tained the protocol and project resources. Repli-
cation materials and data were required to be
archived publicly in order to maximize transpar-
ency, accountability, andreproducibilityofthe
project (https://osf.io/ezcuj).
In total, 100 replications were completed by
270 contributing authors. There were many dif-
ferent research designs and analysis strategies
in the original research. Through consultation
with original authors, obtaining original mate-
rials, and internal review, replications maintained
high fidelity to the originaldesigns.Analysescon-
verted results to a common effect size metric [cor-
relation coefficient (r)] with confidence intervals
(CIs). The units of analysis for inferences about
reproducibility were the original and replication
study effect sizes. The resulting open data set
provides an initial estimate of the reproducibility
of psychology and correlational data to support
development of hypotheses about the causes of
reproducibility.
Sampling frame and study selection
We constructed a sampling frame and selection
process to minimize selection biases and maxi-
mize generalizability of the accumulated evi-
dence. Simultaneously, to maintain high quality,
within this sampling frame we matched indi-
vidual replication projects with teams that had
relevant interests and expertise. We pursued a
quasi-random sample by defining the sampling
frame as 2008 articles of three important psy-
chology journals: Psychological Science (PSCI),
Journal of Personality and Social Psychology
(JPSP), and Journal of Experimental Psychol-
ogy: Learning, Memory, and Cognition (JEP:
LMC). The first is a premier outlet for all psy-
chological research; the second and third are
leading disciplinary-specific journals for social
psychology and cognitive psychology, respec-
tively [more information is available in (24)].
These were selected a priori in order to (i) pro-
vide a tractable sampling frame that would not
plausibly bias reproducibility estimates, (ii) en-
able comparisons across journal types and sub-
disciplines, (iii) fit with the range of expertise
available in the initial collaborative team, (iv) be
recent enough to obtain original materials, (v) be
old enough to obtain meaningful indicators of ci-
tation impact, and (vi) represent psychology sub-
disciplines that have a high frequency of studies
that are feasible to conductatrelativelylowcost.
The first replication teams could select from a
pool of the first 20 articles from each journal,
starting with the first article published in the
first 2008 issue. Project coordinators facilitated
matching articles with replication teams by in-
terests and expertise until the remaining arti-
cles were difficult to match. If there were still
interested teams, then another 10 articles from
one or more of the three journals were made
available from the sampling frame. Further,
project coordinators actively recruited teams
from the community with relevant experience
for particular articles. This approach balanced
competing goals: minimizing selection bias by
RESEARCH
SCIENCE sciencemag.org 28 AUGUST 2015 VOL 349 ISSUE 6251 aac4716-1
*All authors with their affiliations appear at the end of this paper.
Corresponding author. E-mail: nosek@virginia.edu
having only a small set of articles available at a
time and matching studies with replication
teamsinterests, resources, and expertise.
By default, the last experiment reported in
each article was the subject of replication. This
decision established an objective standard for
study selection within an article and was based
on the intuition that the first study in a multiple-
study article (the obvious alternative selection
strategy) was more frequently a preliminary
demonstration. Deviations from selecting the
last experiment were made occasionally on
the basis of feasibility or recommendations of
the original authors. Justifications for deviations
were reported in the replication reports, which
were made available on the Open Science Frame-
work (OSF) (http://osf.io/ezcuj). In total, 84 of
the 100 completed replications (84%) were of
the last reported study in the article. On aver-
age, the to-be-replicated articles contained 2.99
studies (SD = 1.78) with the following distribu-
tion: 24 single study, 24 two studies, 18 three
studies, 13 four studies, 12 five studies, 9 six or
more studies. All following summary statistics
refer to the 100 completed replications.
For the purposes of aggregating results across
studies to estimate reproducibility, a key result
from the selected experiment was identified as
the focus of replication. The key result had to be
represented as a single statistical inference test
or an effect size. In most cases, that test was a
ttest, Ftest, or correlation coefficient. This effect
was identified before datacollectionoranalysis
and was presented to the original authors as part
of the design protocol for critique. Original au-
thors occasionally suggested that a different
effect be used, and by default, replication teams
deferred to original authorsjudgments. None-
theless, because the single effect came from a
single study, it is not necessarily the case that
the identified effect was central to the overall
aims of the article. In the individual replication
reports and subjective assessments of replica-
tion outcomes, more than a single result could
be examined, but only the result of the single
effect was considered in the aggregate analyses
[additional details of the general protocol and
individual study methods are provided in the
supplementary materials and (25)].
In total, there were 488 articles in the 2008
issues of the three journals. One hundred fifty-
eight of these (32%) became eligible for selec-
tion for replication during the project period,
between November 2011 and December 2014.
From those, 111 articles (70%) were selected by a
replication team, producing 113 replications. Two
articles had two replications each (supplementary
materials). And 100 of those (88%) replications
were completed by the project deadline for in-
clusion in this aggregate report. After being
claimed, some studies were not completed be-
cause the replication teams ran out of time or
could not devote sufficient resources to com-
pleting the study. By journal, replications were
completed for 39 of 64 (61%) articles from
PSCI,31of55(56%)articlesfromJPSP,and28
of 39 (72%) articles from JEP:LMC.
The most common reasons for failure to match
an article with a team were feasibility constraints
for conducting the research. Of the 47 articles
from the eligible pool that were not claimed, six
(13%) had been deemed infeasible to replicate
because of time, resources, instrumentation, de-
pendence on historical events, or hard-to-access
samples. The remaining 41 (87%) were eligible
but not claimed. These often require d specialized
samples (such as macaques or people with autism),
resources (such as eye tracking machines or func-
tional magnetic resonance imaging), or knowl-
edge making them difficult to match with teams.
Aggregate data preparation
Each replication team conducted the study, an-
alyzed their data, wrote their summary report,
and completed a checklist of requirements for
sharing the materials and data. Then, indepen-
dent reviewers and analysts conducted a project-
wide audit of all individual projects, materials,
data, and reports. A description of this review is
available on the OSF (https://osf.io/xtine). More-
over, to maximize reproducibility and accuracy,
the analyses for every replication study were re-
produced by another analyst independent of
the replication team using the R statistical pro-
gramming language and a standardized analytic
format. A controller R script was created to re-
generate the entire analysis of every study and
recreate the master data file. This R script, avail-
able at https://osf.io/fkmwg,canbeexecutedto
reproduce the results of the individual studies. A
comprehensive description of this reanalysis pro-
cess is available publicly (https://osf.io/a2eyg).
Measures and moderators
We assessed features of the original study and
replication as possible correlates of reproduc-
ibility and conducted exploratory analyses to
inspire further investigation. These included
characteristics of the original study such as the
publishing journal; original effect size, Pvalue,
and sample size; experience and expertise of the
original research team; importance of the effect,
with indicators such as the citation impact of
the article; and rated surprisingness of the ef-
fect. We also assessed characteristics of the rep-
lication such as statistical power and sample size,
experience and expertise of the replication team,
independently assessed challenge of conducting
an effective replication, and self-assessed quality
of the replication effort. Variables such as the P
value indicate the statistical strength of evidence
given the null hypothesis, and variables such as
effect surprisingnessand expertise of the team
indicate qualities of the topic of study and the
teams studying it, respectively. The master data file,
containing these and other variables, is available
for exploratory analysis (https://osf.io/5wup8).
It is possible to derive a variety of hypotheses
about predictors of reproducibility. To reduce the
likelihood of false positives due to many tests, we
aggregated some variables into summary indica-
tors: experience and expertise of original team,
experience and expertise of replication team, chal-
lenge of replication, self-assessed quality of repli-
cation, and importance of the effect. We had no a
priori justification to give some indicators stron-
ger weighting over others, so aggregates were
created by standardizing [mean (M)=0,SD=1]
the individual variables and then averaging to
create a single index. In addition to the publish-
ing journal and subdiscipline, potential moder-
ators included six characteristics of the original
study and five characteristics of the replication
(supplementary materials).
Publishing journal and subdiscipline
Journalsdifferent publishing practices may re-
sult in a selection bias that covaries with repro-
ducibility. Articles from three journals were made
available for selection: JPSP (n=59articles),JEP:
LMC (n=40articles),andPSCI (n=68articles).
From this pool of available studies, replications
were selected and completed from JPSP (n=32
studies), JEP:LMC (n=28studies),andPSCI
(n=40studies)andwerecodedasrepresenting
cognitive (n=43studies)orsocial-personality
(n=57studies)subdisciplines.Fourstudiesthat
would ordinarily be understood as developmental
psychologybecause of studying children or infants
were coded as having a cognitive or social em-
phasis. Reproducibility may vary by subdiscipline
in psychology because of differing practices. For
example, within-subjects designs are more com-
mon in cognitive than social psychology, and
these designs often have greater power to detect
effects with the same number of participants.
Statistical analyses
There is no single standard for evaluating rep-
lication success (25). We evaluated reproducibil-
ity using significance and Pvalues, effect sizes,
subjective assessments of replication teams, and
meta-analyses of effect sizes. All five of these
indicators contribute information about the re-
lations between the replication and original find-
ing and the cumulative evidence about the effect
and were positively correlated with one another
(rranged from 0.22 to 0.96, median r=0.57).
Results are summarized in Table 1, and full details
of analyses are in the supplementary materials.
Significance and P values
Assuming a two-tailed test and significance or
alevel of 0.05, all test results of original and
replication studies were classified as statisti-
cally significant (P0.05) and nonsignificant
(P>0.05).However,originalstudiesthatinter-
preted nonsignificant Pvalues as significant
were coded as significant (four cases, all with
Pvalues < 0.06). Using only the nonsignificant
Pvalues of the replication studies and applying
Fishersmethod(26), we tested the hypothesis
that these studies had no evidential value(the
null hypothesis of zero-effect holds for all these
studies). We tested the hypothesis that the pro-
portions of statistically significant results in the
original and replication studies are equal using
the McNemar test for paired nominal data and
calculated a CI of the reproducibility parame-
ter. Second, we compared the central tendency
of the distribution of Pvalues of original and
aac4716-2 28 AUGUST 2015 VOL 349 ISSUE 6251 sciencemag.org SCIENCE
RESEARCH |RESEARCH ARTICLE
replication studies using the Wilcoxon signed-
rank test and the ttest for dependent samples.
For both tests, we only used study-pairs for which
both Pvalues were available.
Effect sizes
We transformed effect sizes into correlation co-
efficients whenever possible. Correlation coeffi-
cients have several advantages over other effect
size measures, such as Cohensd. Correlation
coefficients are bounded,wellknown,andthere-
fore more readily interpretable. Most impor-
tant for our purposes, analysis of correlation
coefficients is straightforward because, after ap-
plying the Fisher transformation, their standard
error is only a function of sample size. Formulas
and code for converting test statistics z,F,t,and
c
2
into correlation coefficients are provided in
the appendices at http://osf.io/ezum7.Tobeable
to compare and analyze correlations across stu dy-
pairs, the original studys effect size was coded
as positive; the replication studyseffectsizewas
coded as negative if the replication studyseffect
was opposite to that of the original study.
We compared effect sizes using four tests. We
compared the central tendency of the effect size
distributions of original and replication studies
using both a paired two-sample ttest and the
Wilcoxon signed-rank test. Third, we computed
the proportion of study-pairs in which the effect
of the original study was stronger than in the
replication study and tested the hypothesis that
this proportion is 0.5. For this test, we included
findings for which effect size measures were
available but no correlation coefficient could be
computed (for example, if a regression coefficient
was reported but not its test statistic). Fourth,
we calculated coverage,or the proportion of
study-pairs in which the effect of the original
study was in the CI of the effect of the replication
study, and compared this with the expected pro-
portion using a goodness-of-fit c
2
test. We carried
SCIENCE sciencemag.org 28 AUGUST 2015 VOL 349 ISSUE 6251 aac4716-3
Tabl e 2 . S p ea r m ans rank-order correlations of reproducibility indicatorswith summary original and replication study characteristics. Effect size difference
computed after converting rto Fishersz.df/Nrefers to the information on which the test of the effect was based (for example, df of ttest, denominator df of Ftest,
sample size 3 of correlation, and sample size for zand c
2
). Four original results had Pvalues slightly higher than 0.05 but were considered p ositive results in the
original article and are treated that way here. Exclusions (explanation provided in supplementary materials, A3) are replications P<.05(3 original nulls excluded; n=
97 stu dies), effect size difference(3 excluded; n=97studies);meta-analytic mean estimates(27 excluded; n= 73 studies); and, percent original effect sizewithin
replicat ion 95% CI(5 excluded, n=95studies).
Replications
P< 0.05 in
original direction
Effect size
difference
Meta-analytic
estimate
Original effect
size within
replication 95% CI
Subjective yes
to Did it replicate?
Original study characteristics
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Original Pvalue 0.327 0.057 0.468 0.032 0.260
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Original effect size 0.304 0.279 0.793 0.121 0.277
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Original df/N0.150 0.194 0.502 0.221 0.185
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Importance of original result 0.105 0.038 0.205 0.133 0.074
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Surprising original result 0.244 0.102 0.181 0.113 0.241
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Experience and expertise of original team 0.072 0.033 0.059 0.103 0.044
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Replication characteristics
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Replication Pvalue 0.828 0.621 0.614 0.562 0.738
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Replication effect size 0.731 0.586 0.850 0.611 0.710
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Replication power 0.368 0.053 0.142 0.056 0. 285
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Replication df/N0.085 0.224 0.692 0.257 0.164
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Challenge of conducting replication 0.219 0.085 0.301 0.109 0.151
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Experience and expertise of replication team 0.096 0.133 0.017 0.053 0.068
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Self-assessed quality of replication 0.069 0.01 7 0.054 0.088 0.05 5
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Table 1. Summary of reproducibility rates and effect sizes for original and replication studies overall and by journal/discipline. df/Nrefers to the informati on
on which the test of the effect was based (for example, df of ttest, denominator df of Ftest, sample size 3ofcorrelation,andsamplesizeforzand c
2
). Four original
results had Pvalues slightly higher than 0.05 but were consi dered positive results in the origin al article and are treated that way here. Exclusions (exp lanation provided in
supplementary materials, A3) are replications P<0.05(3 original nulls excluded; n=97studies);mean original and replication effect sizes(3 excluded; n=97
studies); meta-analytic mean estimates(27 excluded;n=73studies);percent meta-analytic (P<0.05)(25 excluded; n=75studies);and,percent original effect size
within replication 95% CI(5 excluded, n=95studies).
Effect size comparison Original and replication combined
Replications
P<0.05
in original
direction
Perc en t
Mean
(SD)
original
effect
size
Median
original
df/N
Mean (SD)
replication
effect size
Median
replic atio n
df/N
Average
replication
power
Meta-
analytic
mean (SD)
estimate
Perc en t
meta-
analytic
(P<0.05)
Perc en t
original
effect size
within
replicat ion
95% CI
Perc en t
subjective
yesto
Did it
replic ate?
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
Overall 35/97 36 0.403 (0.188) 54 0.197 (0.257) 68 0.92 0.309 (0.223) 68 47 39
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
JPSP, social 7/31 23 0.29 (0.10) 73 0.07 (0.11) 120 0.91 0.138 (0.087) 43 34 25
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
JEP:LMC, cognitive 13/27 48 0.47 (0.18) 36.5 0.27 (0.24) 43 0.93 0.393 (0.209) 86 62 54
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
PSCI, social 7/24 29 0.39 (0.20) 76 0.21 (0.30) 122 0.92 0.286 (0.228) 58 40 32
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
PSCI, cognitive 8/15 53 0.53 (0.2) 23 0.29 (0.35) 21 0.94 0.464 (0.221) 92 60 53
.................................... ....................................... .......................................... ................................... ....................................... ........................................ ............................................... .......................................... ............
RESEARCH |RESEARCH ARTICLE
out this test on the subset of study pairs in which
both the correlation coefficient and its standard
error could be computed [we refer to this data
set as the meta-analytic (MA) subset]. Standard
errors could only be computed if test statistics
were r,t,orF(1,df
2
). The expected proportion is
the sum over expected probabilities across study-
pairs. The test assumes the same population effect
size for original and replication study in the same
study-pair. For those studies that tested the effect
with F(df
1
>1,df
2
)orc
2
,weverifiedcoverageusing
other statistical procedures (computational details
are provided in the supplementary materials).
Meta-analysis combining original and
replication effects
We conducted fixed-effect meta-analyses using
the R package metafor (27)onFisher-transformed
correlations for all study-pairs in subset MA and
on study-pairs with the odds ratio as the depen-
dent variable. The number of times the CI of all
these meta-analyses contained 0 was calculated.
For studies in the MA subset, estimated effect
sizes were averaged and analyzed by discipline.
Subjective assessment of Did it replicate?
In addition to the quantitative assessments of
replication and effect estimation, we collected
subjective assessments of whether the replica-
tion provided evidence of replicating the origi-
nal result. In some cases, the quantitative data
anticipate a straightforward subjective assess-
ment of replication. For more complex designs,
such as multivariate interaction effects, the quan-
titative analysis may not provide a simple inter-
pretation. For subjective assessment, replication
teams answered yesor noto the question,
Did your results replicate the original effect?
Additional subjective variables are available for
analysis in the full data set.
Analysis of moderators
We correlated the five indicators evaluating
reproducibility with six indicators of the origi-
nal study (original Pvalue, original effect size,
original sample size, importance of the effect,
surprising effect, and experience and expertise
of original team) and seven indicators of the
replication study (replication Pvalue, replica-
tion effect size, replication power based on orig-
inal effect size, replication sample size, challenge
of conducting replication, experience and exper-
tise of replication team, and self-assessed qual-
ity of replication) (Table 2). As follow-up, we did
the same with the individual indicators compris-
ing the moderator variables (tables S3 and S4).
Results
Evaluating replication effect against null
hypothesis of no effect
A straightforward method for evaluating repli-
cation is to test whether the replication shows a
statistically significant effect (P<0.05)withthe
same direction as the original study. This di-
chotomous vote-counting method is intuitively
appealing and consistent with common heu-
ristics used to decide whether original studies
worked.Ninety-seven of 100 (97%) effects
from original studies were positive results (four
had Pvalues falling a bit short of the 0.05
criterionP=0.0508,0.0514,0.0516,and0.0567
but all of these were interpreted as positive
effects). On the basis of only the average rep-
lication power of the 97 original, significant ef-
fects [M=0.92,median(Mdn)=0.95],wewould
expect approximately 89 positive results in the
replications if all original effects were true and
accurately estimated; however, there were just
35 [36.1%; 95% CI = (26.6%, 46.2%)], a significant
reduction [McNemar test, c
2
(1) = 59.1, P<0.001].
Akeyweaknessofthismethodisthatittreats
the 0.05 threshold as a bright-line criterion be-
tween replication success and failure (28). It
could be that many of the replications fell just
short of the 0.05 criterion. The density plots of
Pvalues for original studies (mean Pvalue =
0.028) and replications (mean Pvalue = 0.302)
are shown in Fig. 1, left. The 64 nonsignificant
Pvalues for replications were distributed widely.
When there is no effect to detect, the null distribu-
tion of Pvalues is uniform. This distribution de-
viated slightly from uniform with positive skew,
however, suggesting that at least one replication
could be a false negative, c
2
(128) = 155.83, P=
0.048. Nonetheless, the wide distribution of P
values suggests against insufficient power as the
only explanation for failures to re pl icat e. A sca t-
terplot of original compar ed with replicati on
study Pvalues is shown in Fig. 2.
Evaluating replication effect against
original effect size
Acomplementarymethodforevaluatingrepli-
cation is to test whether the original effect size is
within the 95% CI of the effect size estimate from
the replication. For the subset of 73 studies in
which the standard error of the correlation could
be computed, 30 (41.1%) of the replication CIs
contained the original effect size (significantly
lower than the expected value of 78.5%, P<
0.001) (supplementary materials). For 22 studies
using other test statistics [F(df
1
>1,df
2
)andc
2
],
68.2% of CIs contained the effect size of the
original study. Overall, this analysis suggests a
47.4% replication success rate.
This method addresses the weakness of the
first test that a replication in the same direction
and a Pvalue of 0.06 may not be significantly
different from the original result. However, the
method will also indicate that a replication fails
when the direction of the effect is the same but
the replication effect size is significantly smaller
than the original effect size (29). Also, the repli-
cation succeedswhen the result is near zero but
not estimated with sufficiently high precision to
be distinguished from the original effect size.
Comparing original and replication
effect sizes
Comparing the magnitude of the original and
replication effect sizes avoids special emphasis
on Pvalues. Overall, original study effect sizes
aac4716-4 28 AUGUST 2015 VOL 349 ISSUE 6251 sciencemag.org SCIENCE
0.00
0.25
0.50
0.75
1.00
Original Studies Replications
pvalue
Quantile
100
75
50
25
0.25
0.50
0.00
0.25
0.50
0.75
1.00
Original Studies Replications
Effect Size
Quantile
100
75
50
25
Fig. 1. Density plots of original and replication Pvalues and effect sizes. (A)Pvalues. (B)Effectsizes(correlationcoefficients).Lowestquantilesfor
Pvalues are not visible because they are clustered near zero.
RESEARCH |RESEARCH ARTICLE
(M= 0.403, SD = 0.188) were reliably larger than
replication effect sizes (M= 0.197, SD = 0.257),
WilcoxonsW=7137,P<0.001.Ofthe99studies
for which an effect size in both the original and
replication study could be calculated (30), 82
showed a stronger effect size in the original
study (82.8%; P<0.001,binomialtest)(Fig.1,
right). Original and replication effect sizes were
positively correlated (Spearmansr=0.51,P<
0.001). A scatterplot of the original and repli-
cation effect sizes is presented in Fig. 3.
Combining original and replication effect
sizes for cumulative evidence
The disadvantage of the descriptive comparison
of effect sizes is that it does not provide information
about the precision of either estimate or resolution
of the cumulative evidence for the effect. This is
often addressed by computing a meta-analytic es-
timate of the effect sizes by combining the original
and replication studies (28). This approach weights
each study by the inverse of its variance and uses
these weighted estimates of effect size to estimate
cumulative evidence and precision of the effect.
Using a fixed-effect model, 51 of the 75 (68%) ef-
fects for which a meta-analytic estimate could be
computed had 95% CIs that did not include 0.
One qualification about this result is the pos-
sibility that the original studies have inflated
effect sizes due to publication, selection, report-
ing, or other biases (9,1223). In a discipline with
low-powered research designs and an emphasis
on positive results for publication, effect sizes
will be systematically overestimated in the pub-
lished literature. There is no publication bias in
the replication studies because all results are
reported. Also, there are no selection or reporting
biases because all were confirmatory tests based
on pre-analysis plans. This maximizes the inter-
pretability of the replication Pvalues and effect
estimates. If publication, selection, and reporting
biases completely explain the effect differences,
then the replication estimates would be a better
estimate of the effect size than would the meta-
analytic and original results. However, to the
extent that there are other influences, such as
moderation by sample, setting, or quality of rep-
lication, the relative bias influencing original and
replication effect size estimation is unknown.
Subjective assessment of
Did it replicate?
In addition to the quantitative assessments of rep-
lication and effect estimation, replication teams
provided a subjective assessment of replication
success of the study they conducted. Subjective
assessments of replication success were very sim-
ilar to significance testing results (39 of 100 suc-
cessful replications), including evaluating success
for two null replications when the original study
reported a null result and failurefor a P<0.05
replication when the original result was a null.
Correlates of reproducibility
The overall replication evidence is summarized in
Table 1 across the criteria described above and
then separately by journal/discipline. Considering
significance testing, reproducibility was stronger
in studies and journals representing cognitive psy-
chology than social psychology topics. For exam-
ple, combining across journals, 14 of 55 (25%) of
social psychology effects replicated by the P<
0.05 criterion, whereas 21 of 42 (50%) of cogni-
tive psychology effects did so. Simultaneously, all
journals and disciplines showed substantial and
similar [c
2
(3) = 2.45, P=0.48]declinesineffect
size in the replications compared with the original
studies. The difference in significance testing re-
sults between fields appears to be partly a function
of weaker original effects in socia l psychology
studies, particularly in JPSP,andperhapsofthe
greater frequency of high-powered within-subjects
manipulations and repeated measurement designs
in cognitive psychology as suggested by high power
despite relatively small participant samples. Fur-
ther, the type of test was associated with repli-
cation success. Among original, significant effects,
23 of the 49 (47%) that tested main or simple
effects replicated at P<0.05,butjust8ofthe37
(22%) that tested interaction effects did.
Correlations between reproducibility indica-
tors and characteristics of replication and orig-
inal studies are provided in Table 2. A negative
correlation of replication success with the orig-
inal study Pvalue indicates that the initial
strength of evidence is predictive of reproduc-
ibility. For example, 26 of 63 (41%) original studies
with P<0.02achievedP<0.05inthereplication,
whereas 6 of 23 (26%) that had a Pvalue be-
tween 0.02 < P< 0.04 and 2 of 11 (18%) that had
aPvalue > 0.04 did so (Fig. 2). Almost two thirds
(20 of 32, 63%) of original studies with P<0.001
had a significant Pvalue in the replication.
Larger original effect sizes were associated
with greater likelihood of achieving P<0.05(r=
0.304) and a greater effect size difference between
original and replication (r=0.279).Moreover,
replication power was related to replication suc-
cess via significance testing (r=0.368)butnot
with the effect size difference between original
and replication (r=0.053). Comparing effect
sizes across indicators, surprisingness of the
original effect, and the challenge of conducting
the replication were related to replication suc-
cess for some indicators. Surprising effects were
less reproducible, as were effects for which it was
more challenging to conduct the replication.
Last, there was little evidence that perceived
importance of the effect, expertise of the original
or replication teams, or self-assessed quality of
the replication accounted for meaningful variation
SCIENCE sciencemag.org 28 AUGUST 2015 VOL 349 ISSUE 6251 aac4716-5
Fig. 2. Scatterplots of original study and replication Pvalues for three psychology journals.
Data points scaled by power of the replication based on original study effect size. Dotted red lines
indicate P=0.05criterion.SubplotbelowshowsPvalues from the range between the gray lines (P=0
to 0.005) in the main plot above.
RESEARCH |RESEARCH ARTICLE
in reproducibility across indicators. Replication
success was more consistently related to the
original strength of evidence (such as original
Pvalue, effect size, and effect tested) than to
characteristics of the teams and implementation
of the replication (such as expertise, quality, or
challenge of conducting study) (tables S3 and S4).
Discussion
No single indicator sufficiently describes repli-
cation success, and the five indicators examined
here are not the only ways to evaluate reproduc-
ibility. Nonetheless, collectively, these results
offer a clear conclusion: A large portion of repli-
cations produced weaker evidence for the
original findings (31)despiteusingmaterials
provided by the original authors, review in
advance for methodological fidelity, and high
statistical power to detect the original effect sizes.
Moreover, correlational evidence is consistent
with the conclusion that variation in the strength
of initial evidence (such as original Pvalue) was
more predictive of replication success than was
variation in the characteristics of the teams con-
ducting the research (such as experience and
expertise). The latter factors certainly can influ-
ence replication success, but the evidence is that
they did not systematically do so here. Other
investigators may develop alternative indicators
to explore further the role of expertise and qual-
ity in reproducibility on this open data set.
Insights on reproducibility
It is too easy to conclude that successful replica-
tion means that the theoretical understanding of
the original finding is correct. Direct replication
mainly provides evidence for the reliability of a
result. If there are alternative explanations for
the original finding, those alternatives could like-
wise account for the replication. Understanding
is achieved through multiple, diverse investiga-
tions that provide converging support for a the-
oretical interpretation and rule out alternative
explanations.
It is also too easy to conclude that a failure to
replicate a result means that the original evi-
dence was a false positive. Replications can fail
if the replication methodology differs from the
original in ways that interfere with observing
the effect. We conducted replications designed
to minimize a priori reasons to expect a dif-
ferent result by using original materials, en-
gaging original authors for review of the designs,
and conducting internal reviews. Nonetheless,
unanticipated factors in the sample, setting, or
procedure could still have altered the observed
effect magnitudes (32).
More generally, there are indications of cul-
tural practices in scientific communication that
may be responsible for the observed results. Low-
power research designs combined with publication
bias favoring positive results together produce
aliteraturewithupwardlybiasedeffectsizes
(14,16,33,34). This anticipates that replication
effect sizes wouldbe smaller than original studies
on a routine basisnot because of differences
in implementation but because the original study
effect sizes are affected by publication and report-
ing bias, and the replications are not. Consistent
with this expectation, most replication effects were
smaller than original results, and reproducibility
success was correlated with indicators of the
strength of initial evidence, such as lower original
Pvalues and larger effect sizes. This suggests pub-
lication, selection, and reporting biases as plausible
explanations for the difference between original
and replication effects. The replication studies sig-
nificantly reduced these biases because replication
preregistration and pre-analysis plans ensured
confirmatory tests and reporting of all results.
The observed variation in replication and
original results may reduce certainty about the
statistical inferences from the original studies
but also provides an opportunity for theoretical
innovation to explain differing outcomes, and
then new research to test those hypothesized
explanations. The correlational evidence, for ex-
ample, suggests that procedures that are more
challenging to execute may result in less re-
producible results, and that more surprising
original effects may be less reproducible than
less surprising original effects. Further, system-
atic, repeated replication efforts that fail to iden-
tify conditions under which the original finding
can be observed reliably may reduce confidence
in the original finding.
Implications and limitations
The present study provides the first open, system-
atic evidence of reproducibility from a sample of
studies in psychology. We sought to maximize
generalizability of the results with a structured
process for selecting studies for replication. How-
ever, it is unknown the extent to which these find-
ings extend to the rest of psycholog y or other
disciplines. In the sampling frame itself, not all
articles were replicated; in each article, only one
study was replicated; and in each study, only one
statistical result was subject to replication. More
resource-intensive studies were less likely to be
included than were less resource-intensive studies.
Although study selection bias was reduced by
the sampling frame and selection strategy, the
impact of selection bias is unknown.
We investigated the reproducibility rate of psy-
chology not because there is something special
about psychology, but because it is our discipline.
Concerns about reproducibility are widespread
across disciplines (921). Reproducibility is not well
understood because the incentives for individual
scientists prioritize novelty over replication (20).
If nothing else, this project demonstrates that it
is possible to conduct a large-scale examination
of reproducibility despite the incentive barriers.
Here, we conducted single-replication attempts of
many effects obtaining broad-and-shallow evidence.
These data provide information about reprodu-
cibility in general but little precision about in-
dividual effects in particular. A complementary
narrow-and-deep approach is characterize d by
the Many Labs replication projects (32). In those,
many replications of single effects allow precise
estimates of effect size but result in general-
izability that is circumscribed to those individual
effects. Pursuing both strategies across disciplines,
such as the ongoing effort in cancer biology (35),
would yield insight about common and distinct
aac4716-6 28 AUGUST 2015 VOL 349 ISSUE 6251 sciencemag.org SCIENCE
0.50
0.25
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Original Effect Size
Replication Effect Size
pvalue
Not Significant
Significant
Replication Power
0.6
0.7
0.8
0.9
Fig. 3. Original study effect size versus replication effect size (correlation coefficients).
Diagonal line represents replication effect size equal to original effect size. Dotted line represents
replication effect size of 0. Points below the dotted line were effects in the opposite direction of the
original. Density plots are separated by significant (blue) and nonsignificant (red) effects.
RESEARCH |RESEARCH ARTICLE
challenges and may cross-fertilize strategies so as
to improve reproducibility.
Because reproducibility is a hallmark of
credible scientific evidence, it is tempting to
think that maximum reproducibility of origi-
nal results is important from the onset of a line
of inquiry through its maturation. This is a
mistake. If initial ideas were always correct,
then there would hardly be a reason to conduct
research in the first place. A healthy discipline
will have many false starts as it confronts the
limits of present understanding.
Innovation is the engine of discovery and is
vital for a productive, effective scientific enter-
prise. However, innovative ideas become old
news fast. Journal reviewers and editors may
dismiss a new test of a published idea as un-
original. The claim that we already know this
belies the uncertainty of scientific evidence. De-
ciding the ideal balance of resourcing innova-
tion versus verification is a question of research
efficiency. How can we maximize the rate of
research progress? Innovation points out paths
that are possible; replication points out paths
that are likely; progress relies on both. The ideal
balance is a topic for investigation itself. Scientific
incentivesfunding, publication, or awardscan
be tuned to encourage an optimal balance in the
collective effort of discovery (36,37).
Progress occurs when existing expectations
are violated and a surprising result spurs a new
investigation. Replication can increase certainty
when findings are reproduced and promote
innovation when they are not. This project pro-
vides accumulating evidence for many findings
in psychological researchandsuggeststhatthere
is still more work to do to verify whether we
know what we think we know.
Conclusion
After this intensive effort to reproduce a sample
of published psychological findings, how many
of the effects have we established are true? Zero.
And how many of the effects have we established
are false? Zero. Is this a limitation of the project
design? No. It is the reality of doing science, even
if it is not appreciated in daily practice. Humans
desire certainty, and science infrequently provides
it. As much as we might wish it to be otherwise, a
single study almost never provides definitive reso-
lution for or against an effect and its explanation.
The original studies examined here offered tenta-
tive evidence; the replications we conducted offered
additional, confirmatory evidence. In some cases,
the replications increase confidence in the reli-
ability of the original results; in other cases, the
replications suggest that more investigation is
needed to establish the validity of the original
findings. Scientific progress is a cumulative pro-
cess of uncertainty reduction that can only suc-
ceed if science itself remains the greatest skeptic
of its explanatory claims.
The present results suggest that there is room
to improve reproducibility in psychology. Any
temptation to interpret these results as a defeat
for psychology, or science more generally, must
contend with the fact that this project demon-
strates science behaving as it should. Hypothe-
ses abound that the present culture in science
may be negatively affecting the reproducibility
of findings. An ideological response would dis-
count the arguments, discredit the sources, and
proceed merrily along. The scientific process is
not ideological. Science does not always provide
comfort for what we wish to be; it confronts us
with what is. Moreover, as illustrated by the Trans-
parency and Openness Promotion (TOP) Guidelines
(http://cos.io/top)(37), the research community
is taking action already to improve the quality and
credibility of the scientific literature.
We conducted this project because we care
deeply about the health of our discipline and
believe in its promise for accumulating knowl-
edge about human behavior that can advance
the quality of the human condition. Reproduc-
ibility is central to that aim. Accumulating evi-
dence is the scientific communitys method of
self-correction and is the best available option
for achieving that ultimate goal: truth.
REFERENCES AND NOTES
1. C. Hempel, Maximal specificity and lawlikeness in probabilistic
explanation. Philos. Sci. 35,116133 (1968). doi: 10.1086/288197
2. C. Hempel, P. Oppenheim, Studies in the logic of explanation.
Philos. Sci. 15,135175 (1948). doi: 10.1086/286983
3. I. Lakatos, in Criticism and the Growth of Knowledge, I. Lakatos,
A. Musgrave, Eds. (Cambridge Univ. Press, London, 1970), pp. 170196.
4. P. E. Meehl, Appraising and amending theories: The strategy of
Lakatosian defense and two principles that warrant it. Psychol.
Inq. 1,108141 (1 990). doi: 10.1207/s15327965pli0102_1
5. J. R. Platt, Strong Inference: Certain systematic methods of
scientific thinking may produce much more rapid progress
than others. Science 146,347353 (1964). doi: 10.1126/
science.146.3642.347;pmid:17739513
6. W. C. Salmon, in Introduction to the Philosophy of Science,
M. H. Salmon Ed. (Hackett Publishing Company, Indianapolis,
1999), pp. 741.
7. B. A. Nosek, D. Lakens, Registered reports: A method to
increase the credibility of published results. Soc. Psychol. 45,
137141 (20 14). doi: 10.1027/1864-9335/a000192
8. S. Schmidt, Shall we really do it again? The powerful concept
of replication is neglected in the social sciences. Rev. Gen.
Psychol. 13,90100 (2009). doi: 10.1037/a0015108
9. J. P. A. Ioann idis, W hy most published research findings are
false. PLOS Med. 2,e124(2005).doi:10.1371/journal.
pmed.0020124;pmid:16060722
10. C. G. Begley, L. M. Ellis, Drug development: Raise standards for
preclinical cancer research. Nature 483,531533 (2012).
doi: 10.1038/483531a;pmid:22460880
11. F. Prinz, T. Schlange, K. Asadullah, Believe it or not: How much
can we rely on published data on potential drug targets?
Nat. Rev. Drug Discov. 10,712713 (2011). doi: 10.1038/
nrd3439-c1;pmid:21892149
12. M. McNutt, Reproducibility. Science 343,229(2014).
doi: 10.1126/science.1250475;pmid:24436391
13. H. Pashler, E.-J. Wagenmakers, Editorsintroduction to
the special section on replicability in psychological science:
A crisis of confidence? Perspect. Psychol. Sci. 7,528530 (2012).
doi: 10.1177/1745691612465253;pmid:26168108
14. K . S. But ton et al., Power failure: Why small sample size
undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14,
365376 (2013). doi: 10.1038/nrn3475; pmid: 23571845
15. D. Fanelli, Positiveresults increase down the hierarchy of the
sciences. PLOS ONE 5,e10068(2010).doi:10.1371/journal.
pone.0010068;pmid:20383332
16. A. G. Greenwald, Consequences of prejudice against the null
hypothesis. Psychol. Bull.82,120 (1975). doi: 10.1037/h0076157
17. G. S. Howard et al., Do research literatures give correct answers?
Rev. Gen. Psychol. 13,116121 (2009). doi: 10.1037/a0015468
18. J. P. A. Ioannidis, M. R. Munafò, P. Fusar-Poli, B. A. Nosek,
S. P. David, Publication and other reporting biases in cognitive
sciences: Detection, prevalence, and prevention. Trends
Cogn. Sci. 18,235241 (2014). doi: 10.1016/j.tics.2014.02.010;
pmid: 24656991
19. L. K. John,G. Loewenstein, D. Prelec, Measuring the prevalence of
questionable research practices with incentives for truth telling.
Psychol. Sci. 23,524532 (2012). doi: 10.1177/
0956797611430953;pmid:22508865
20. B. A. Nosek, J. R. Spies, M. Motyl, Scientific utopia: II.
Restructuring incentives and practices to promote truth over
publishability. Perspect. Psychol. Sci. 7,615631 (2012).
doi: 10.1177/1745691612459058;pmid:26168121
21. R. Rosenthal, The file drawer problem and tolerance for null
results. Psychol. Bull. 86,638641 (1979). doi: 10.1037/0033-
2909.86.3.638
22. P. Rozin, What kind of empirical research should we publish,
fund, and reward?: A different perspective. Perspect. Psychol.
Sci. 4,435439 (2009). doi: 10.1111/j.1745-6924.2009.01151.x;
pmid: 26158991
23. J. P. Simmons, L. D. Nelson, U. Simonsohn, False-positive
psychology: Undisclosed flexibility in data collection and analysis
allows presenting anythingas significant. Psychol. Sci. 22,
13591366 (2011). doi: 10.1177/0956797611417632;
pmid: 22006061
24. Open Science Collaboration, An open, large-scale, collaborative
effort to estimate the reproducibility of psychological science.
Perspect. Psychol. Sci. 7,657660 (2012). doi: 10.1177/
1745691612462588;pmid:26168127
25. Open Science Collaboration, in Implementing Reproducible
Computational Research (A Volume in The R Series),
V. Stodden, F. Leisch, R. Peng, Eds. (Taylor & Francis,
New York, 2014), pp. 299323.
26. R. A. Fisher, Theory of statistical estimation. Math. Proc. Camb.
Philos. Soc. 22,700725 (1925). doi: 10.1017/
S0305004100009580
27. W. Viechtbauer, Conducting meta-analyses in R with the
metafor package. J. Stat. Softw. 36,148 (2010).
28. S. L. Braver,F. J. Thoemmes, R.Rosenthal,Continuouslycumulating
meta-analysis and replicability. Perspect. Psychol. Sci. 9,333342
(2014). doi: 10.1177/1745691614529796;pmid:26173268
29. U. Simonsohn, Small telescopes: Detectability and the evaluation
of replication results. Psychol. Sci. 26,559569 (2015).
doi: 10.1177/0956797614567341; pmid: 25800521
30. D. Lakens, Calculating and reporting effect sizes to facilitate
cumulative science: A practical primer for t-tests and ANOVAs.
Front. Psychol. 4, 863 (2013). doi: 10.3389/fpsyg.2013.00863;
pmid: 24324449
31. J. Lehrer, The truth wears off: Is there something wrong with the
scientific method? The New Yorker,5257 (2010).
32. R. Klein et al., Investigating variation in replicability: A many
labsreplication project. Soc. Psychol. 45,142152 (2014).
doi: 10.1027/1864-9335/a000178
33. J. Cohen, The statistical power of abnormal-social psychological
research: A review. J. Abnorm. Soc. Psychol. 65,145153
(1962). doi: 10.1037/h0045186; pmid: 13880271
34. T. D. Sterling, Publication decisions and their possible effects
on inferences drawn from tests of significanceor vice versa.
J. Am. Stat. Assoc. 54,3034 (1959).
35. T. M. Errington et al., An open investigation of the
reproducibility of cancer biology research. eLife 3,e04333
(2014). doi: 10.7554/eLife.04333;pmid:25490932
36. J. K. Hartshorne, A. Schachner, Tracking replicability as a method
of post-publication open evaluation. Front. Comput. Neurosci. 6,8
(2012). doi: 10.3389/fncom.2012.00008;pmid:22403538
37. B. A. Nosek et al., Promoting an open research culture. Science
348,14221425 (2015). doi: 10.1126/science.aab2374;
pmid: 26113702
ACKNO WLED GME NTS
In addition to the coauthors of this manuscript, there were many
volunteers who contributed to project success. We thank
D. Acup, J. Anderson, S. Anzellotti, R. Araujo, J. D. Arnal, T. Bates,
R. Battleday, R. Bauchwitz, M. Bernstein, B. Blohowiak, M. Boffo,
E. Bruneau, B. Chabot-Hanowell, J. Chan, P. Chu, A. Dalla Rosa,
B. Deen, P. DiGiacomo, C. Dogulu, N. Dufour, C. Fitzgerald,
A. Foote, A. Garcia, E. Garcia, C. Gautreau, L. Germine, T. Gill,
L. Goldberg, S. D. Goldinger, H. Gweon, D. Haile, K. Hart,
F. Hjorth, J. Hoenig, Å. Innes-Ker, B. Jansen, R. Jersakova, Y. Jie,
Z. Kaldy, W. K. Vong, A. Kenney, J. Kingston, J. Koster-Hale,
A. Lam, R. LeDonne, D. Lumian, E. Luong, S. Man-pui, J. Martin,
A. Mauk, T. McElroy, K. McRae, T. Miller, K. Moser, M. Mullarkey,
A. R. Munoz, J. Ong, C. Parks, D. S. Pate, D. Patron,
H. J. M. Pennings, M. Penuliar, A. Pfammatter, J. P. Shanoltz,
E. Stevenson, E. Pichler, H. Raudszus, H. Richardson, N. Rothstein,
T. Scherndl, S. Schrager, S. Shah, Y. S. Tai, A. Skerry,
M. Steinberg, J. Stoeterau, H. Tibboel, A. Tooley, A. Tullett,
C. Vaccaro, E. Vergauwe, A. Watanabe, I. Weiss, M. H. White II,
SCIENCE sciencemag.org 28 AUGUST 2015 VOL 349 ISSUE 62 51 aac4716-7
RESEARCH |RESEARCH ARTICLE
P. Whitehead, C. Widmann, D. K. Williams, K. M. Williams, and H. Yi.
Also, we thank the authors of the original research that was the
subject of replication in this project. These authors were generous
with their time, materials, and advice for improving the quality of
each replication and identifying the strengths and limits of the
outcomes. The authors of this work are listed alphabetically. This
project was supported by the Center for Open Science and the
Laura and John Arnold Foundation. The authors declare no
financial conflict of interest with the reported research.
The Open Science Collaboration
Alexander A. Aarts,
1
Joanna E. Anderson,
2
Christopher J. Anderson,
3
Peter R. Attridge,
4,5
Angela Attwood,
6
Jordan Axt,
7
Molly Babel,
8
Štěpán Bahník,
9
Erica Baranski,
10
Michael Barnett-Cowan,
11
Elizabeth Bartmess,
12
Jennifer Beer,
13
Raoul Bell,
14
Heather Bentley,
5
Leah Beyan,
5
Grace Binion,
5,15
Denny Borsboom,
16
Annick Bosch,
17
Frank A. Bosco,
18
Sara D. Bowman,
19
Mark J. Brandt,
20
Erin Braswell,
19
Hilmar Brohmer,
20
Benjamin T. Brown,
5
Kristina Brown,
5
Jovita Brüning,
21,22
Ann Calhoun-Sauls,
23
Shannon P. Callahan,
24
Elizabeth Chagnon,
25
Jesse Chandler,
26,27
Christopher R. Chartier,
28
Felix Cheung,
29,30
Cody D. Christopherson,
31
Linda Cillessen,
17
Russ Clay,
32
Hayley Cleary,
18
Mark D. Cloud,
33
Michael Cohn,
12
Johanna Cohoon,
19
Simon Columbus,
16
Andreas Cordes,
34
Giulio Costantini,
35
Leslie D. Cramblet Alvarez,
36
Ed Cremata,
37
Jan Crusius,
38
Jamie DeCoster,
7
Michelle A. DeGaetano,
5
Nicolás Della Penna,
39
Bobby den Bezemer,
16
Marie K. Deserno,
16
Olivia Devitt,
5
Laura Dewitte,
40
David G. Dobolyi,
7
Geneva T. Dodson,
7
M. Brent Donnellan,
41
Ryan Donohue,
42
Rebecca A. Dore,
7
Angela Dorrough,
43,44
Anna Dreber,
45
Michelle Dugas,
25
Elizabeth W. Dunn,
8
Kayleigh Easey,
6
Sylvia Eboigbe,
5
Casey Eggleston,
7
Jo Embley,
46
Sacha Epskamp,
16
Timothy M. Errington,
19
Vivien Estel,
47
Frank J. Farach,
48,49
Jenelle Feather,
50
Anna Fedor,
51
Belén Fernández-Castilla,
52
Susann Fiedler,
44
James G. Field,
18
Stanka A. Fitneva,
53
Taru Flagan,
13
Amanda L. Forest,
54
Eskil Forsell,
45
Joshua D. Foster,
55
Michael C. Frank,
56
Rebecca S. Frazier,
7
Heather Fuchs,
38
Philip Gable,
57
Jeff Galak,
58
Elisa Maria Galliani,
59
Anup Gampa,
7
Sara Garcia,
60
Douglas Gazarian,
61
Elizabeth Gilbert,
7
Roger Giner-Sorolla,
46
Andreas Glöckner,
34,44
Lars Goellner,
43
Jin X. Goh,
62
Rebecca Goldberg,
63
Patrick T. Goodbourn,
64
Shauna Gordon-McKeon,
65
Bryan Gorges,
19
Jessie Gorges,
19
Justin Goss,
66
Jesse Graham,
37
James A. Grange,
67
Jeremy Gray,
29
Chris Hartgerink,
20
Joshua Hartshorne,
50
Fred Hasselman,
17,68
Timothy Hayes,
37
Emma Heikensten,
45
Felix Henninger,
69,44
John Hodsoll,
70,71
Taylor Holubar,
56
Gea Hoogendoorn,
20
Denise J. Humphries,
5
Cathy O.-Y. Hung,
30
Nathali Immelman,
72
Vanessa C. Irsik,
73
Georg Jahn,
74
Frank Jäkel,
75
Marc Jekel,
34
Magnus Johannesson,
45
Larissa G. Johnson,
76
David J. Johnson,
29
Kate M. Johnson,
37
William J. Johnston,
77
Kai Jonas,
16
Jennifer A. Joy-Gaba,
18
Heather Barry Kappes,
78
Kim Kelso,
36
Mallory C. Kidwell,
19
Seung Kyung Kim,
56
Matthew Kirkhart,
79
Bennett Kleinberg,
16,80
Goran Knežević,
81
Franziska Maria Kolorz,
17
Jolanda J. Kossakowski,
16
Robert Wilhelm Krause,
17
Job Krijnen,
20
Tim Kuhlmann,
82
Yoram K. Kunkels,
16
Megan M. Kyc,
33
Calvin K. Lai,
7
Aamir Laique,
83
Daniël Lakens,
84
Kristin A. Lane,
61
Bethany Lassetter,
85
Ljiljana B. Lazarević,
81
Etienne P. LeBel,
86
Key Jung Lee,
56
Minha Lee,
7
Kristi Lemm,
87
Carmel A. Levitan,
88
Melissa Lewis,
89
Lin Lin,
30
Stephanie Lin,
56
Matthias Lippold,
34
Darren Loureiro,
25
Ilse Luteijn,
17
Sean Mackinnon,
90
Heather N. Mainard,
5
Denise C. Marigold,
91
Daniel P. Martin,
7
Tylar Martinez,
36
E.J. Masicampo,
92
Josh Matacotta,
93
Maya Mathur,
56
Michael May,
44,94
Nicole Mechin,
57
Pranjal Mehta,
15
Johannes Meixner,
21,95
Alissa Melinger,
96
Jeremy K. Miller,
97
Mallorie Miller,
63
Katherine Moore,
42,98
Marcus Möschl,
99
Matt Motyl,
100
Stephanie M. Müller,
47
Marcus Munafo,
6
Koen I. Neijenhuijs,
17
Taylor Nervi,
28
Gandalf Nicolas,
101
Gustav Nilsonne,
102,103
Brian A. Nosek,
7,19
Michèle B. Nuijten,
20
Catherine Olsson,
50,104
Colleen Osborne,
7
Lutz Ostkamp,
75
Misha Pavel,
62
Ian S. Penton-Voak,
6
Olivia Perna,
28
Cyril Pernet,
105
Marco Perugini,
35
R. Nathan Pipitone,
36
Michael Pitts,
89
Franziska Plessow,
99,106
Jason M. Prenoveau,
79
Rima-Maria Rahal,
44,16
Kate A. Ratliff,
107
David Reinhard,
7
Frank Renkewitz,
47
Ashley A. Ricker,
10
Anastasia Rigney,
13
Andrew M. Rivers,
24
Mark Roebke,
108
Abraham M. Rutchick,
109
Robert S. Ryan,
110
Onur Sahin,
16
Anondah Saide,
10
Gillian M. Sandstrom,
8
David Santos,
111,112
Rebecca Saxe,
50
René Schlegelmilch,
44,47
Kathleen Schmidt,
113
Sabine Scholz,
114
Larissa Seibel,
17
Dylan Faulkner Selterman,
25
Samuel Shaki,
115
William B. Simpson,
7
H. Colleen Sinclair,
63
Jeanine L. M. Skorinko,
116
Agnieszka Slowik,
117
Joel S. Snyder,
73
Courtney Soderberg,
19
Carina Sonnleitner,
117
Nick Spencer,
36
Jeffrey R. Spies,
19
Sara Steegen,
40
Stefan Stieger,
82
Nina Strohminger,
118
Gavin B. Sullivan,
119
Thomas Talhelm,
7
Megan Tapia,
36
Anniek te Dorsthorst,
17
Manuela Thomae,
72,120
Sarah L. Thomas,
7
Pia Tio,
16
Frits Traets,
40
Steve Tsang,
121
Francis Tuerlinckx,
40
Paul Turchan,
122
Milan Valášek,
105
Anna E. van 't Veer,
20,123
Robbie Van Aert,
20
Marcel van Assen,
20
Riet van Bork,
16
Mathijs van de Ven,
17
Don van den Bergh,
16
Marije van der Hulst,
17
Roel van Dooren,
17
Johnny van Doorn,
40
Daan R. van Renswoude,
16
Hedderik van Rijn,
114
Wolf Vanpaemel,
40
Alejandro Vásquez Echeverría,
124
Melissa Vazquez,
5
Natalia Velez,
56
Marieke Vermue,
17
Mark Verschoor,
20
Michelangelo Vianello,
59
Martin Voracek,
117
Gina Vuu,
7
Eric-Jan Wagenmakers,
16
Joanneke Weerdmeester,
17
Ashlee Welsh,
36
Erin C. Westgate,
7
Joeri Wissink,
20
Michael Wood,
72
Andy Woods,
125,6
Emily Wright,
36
Sining Wu,
63
Marcel Zeelenberg,
20
Kellylynn Zuni
36
1
Open Science Collaboration, Nuenen, Netherlands.
2
Defense
Research and Development Canada, Ottawa, ON, Canada.
3
Department of Psychology, Southern New Hampshire University,
Schenectady, NY 12305, USA.
4
Mercer School of Medicine, Macon,
GA 31207, USA.
5
Georgia Gwinnett College, Lawrenceville, GA
30043, USA.
6
School of Experimental Psychology, University
of Bristol, Bristol BS8 1TH, UK.
7
University of Virginia,
Charlottesville, VA 22904, USA.
8
University of British Columbia,
Vancouver, BC V6T 1Z4 Canada.
9
Department of Psychology II,
University of Würzburg, Würzburg, Germany.
10
Department of
Psychology, University of California, Riverside, Riverside,
CA 92521, USA.
11
University of Waterloo, Waterloo, ON N2L 3G1,
Canada.
12
University of California, San Francisco, San Francisico,
CA 94118, USA.
13
Department of Psychology, University of
Texas at Austin, Austin, TX 78712, USA.
14
Department of
Experimental Psychology, Heinrich Heine University Düsseldorf,
Düsseldorf, Germany.
15
Department of Psychology, University of
Oregon, Eugene, OR 97403, USA.
16
Department of Psychology,
University of Amsterdam, Amsterdam, Netherlands.
17
Radboud
University Nijmegen, Nijmegen, Netherlands.
18
Virginia
Commonwealth University, Richmond, VA 23284, USA.
19
Center
for Open Science, Charlottesville, VA 22902, USA.
20
Department
of Social Psychology, Tilburg University, Tilburg, Netherlands.
21
Humboldt University of Berlin, Berlin, Germany.
22
Charité
Universitätsmedizin Berlin, Berlin, Germany.
23
Belmont Abbey
College, Belmont, NC 28012, USA.
24
Department of Psychology,
University of California, Davis, Davis, CA 95616, USA.
25
University
of Maryland, Washington, DC 20012, USA.
26
Institute for Social
Research, University of Michigan, Ann Arbor, MI 48104, USA.
27
Mathematica Policy Research, Washington, DC 20002, USA.
28
Ashland University, Ashland, OH 44805, USA.
29
Michigan State
University, East Lansing, MI 48824, USA.
30
Department of
Psychology, University of Hong Kong, Pok Fu Lam, Hong Kong.
31
Department of Psychology, Southern Oregon University, Ashland,
OR 97520, USA.
32
College of Staten Island, City University of New
York, Staten Island, NY 10314, USA.
33
Department of Psychology,
Lock Haven University, Lock Haven, PA 17745, USA.
34
University of
Göttingen, Göttingen, Germany.
35
University of Milan-Bicocca,
Milan, Italy.
36
Department of Psychology, Adams State University,
Alamosa, CO 81101, USA.
37
University of Southern California,
Los Angeles, CA 90089, USA.
38
University of Cologne, Cologne,
Germany.
39
Australian National University, Canberra, Australia.
40
University of Leuven, Leuven, Belgium.
41
Texas A & M University,
College Station, TX 77845, USA.
42
Elmhurst College, Elmhurst, IL
60126, USA.
43
Department of Social Psychology, University of Siegen,
Siegen, Germany.
44
Max Planck Institute for Research on Collective
Goods, Bonn, Germany.
45
Department of Economics, Stockholm
School of Economics, Stockholm, Sweden.
46
School of Psychology,
University of Kent, Canterbury, Kent, UK.
47
University of Erfurt,
Erfurt, Germany.
48
University of Washington, Seattle, WA 98195,
USA.
49
Prometheus Research, New Haven, CT 06510, USA.
50
Department of Brain and Cognitive Sciences, Massachusetts
Institute of Technology, Cambridge, MA 02139, USA.
51
Parmenides Center for the Study of Thinking, Munich,
Germany.
52
Universidad Complutense de Madrid, Madrid, Spain.
53
Department of Psychology, Queen's University, Kingston, ON,
Canada.
54
Department of Psychology, University of Pittsburgh,
Pittsburgh, PA 15260, USA.
55
Department of Psychology, University
of South Alabama, Mobile, AL 36688, USA.
56
Stanford University,
Stanford, CA 94305, USA.
57
Department of Psychology, University
of Alabama, Tuscaloosa, AL 35487, USA.
58
Carnegie Mellon
University, Pittsburgh, PA 15213, USA.
59
Department FISPPA
Applied Psychology Unit, University of Padua, Padova, Italy.
60
Universidad Nacional De Asunción, Asunción, Paraguay.
61
Bard College, Providence, RI 02906, USA.
62
Northeastern
University, Boston, MA 02115, USA.
63
Department of Counseling
and Educational Psychology, Mississippi State University,
Mississippi State, MS 39762, USA.
64
School of Psychology,
University of Sydney, Sydney, Australia.
65
Hampshire College,
Amherst, MA 01002, USA.
66
Colorado State University-Pueblo,
Pueblo, CO 81001, USA.
67
School of Psychology, Keele University,
Keele, Staffordshire, UK.
68
Behavioral Science Institute, Nijmegen,
Netherlands.
69
University of Koblenz-Landau, Landau, Germany.
70
Department of Biostatistics, Institute of Psychiatry, Psychology,
and Neuroscience, IHR Biomedical Research Centre for
Mental Health, South London, London, UK.
71
Maudsley NHS
Foundation Trust, King's College London, London, UK.
72
Department of Psychology, University of Winchester, Winchester,
UK.
73
University of Nevada, Las Vegas, Las Vegas, NV 89154, USA.
74
Institute for Multimedia and Interactive Systems, University of
Lübeck, Lübeck, Germany.
75
Institute of Cognitive Science,
University of Osnabrück, Osnabrück, Germany.
76
University of
Birmingham, Northampton, Northamptonshire NN1 3NB, UK.
77
University of Chicago, Chicago, IL 60615, USA.
78
London School
of Economics and Political Science, London WC2A 2AE, UK.
79
Loyola University Maryland, Baltimore, MD 21210, USA.
80
Department of Security and Crime Science, University College
London, London WC1H 9EZ, UK.
81
Department of Psychology,
University of Belgrade, Belgrade, Serbia.
82
Department of
Psychology, University of Konstanz, Konstanz, Germany.
83
Open
Science Collaboration, Saratoga, CA, USA.
84
School of Innovation
Sciences, Eindhoven University of Technology, Eindhoven, Netherlands.
85
Department of Psychology, University of Iowa, Iowa City, IA
52242, USA.
86
Department of Psychology, Western University,
London, ON N6A 5C2, Canada.
87
Department of Psychology,
Western Washington University, Bellingham, WA 98225, USA.
88
Department of Cognitive Science, Occidental College, Los
Angeles, CA 90041, USA.
89
Department of Psychology, Reed
College, Portland, OR 97202, USA.
90
Department of Psychology
and Neuroscience, Dalhousie University, Halifax, NS, Canada.
91
Renison University College at University of Waterloo, Waterloo,
ON N2l 3G4, Canada.
92
Department of Psychology, Wake Forest
University, Winston-Salem, NC 27109, USA.
93
Counseling and
Psychological Services, California State University, Fullerton,
Fullerton, CA 92831, USA.
94
University of Bonn, Bonn, Germany.
95
University of Potsdam, Potsdam, Germany.
96
School of
Psychology, University of Dundee, Dundee, Scotland.
97
Willamette
University, Salem, OR 97301, USA.
98
Arcadia University, Glenside,
PA 19038, USA.
99
Department of Psychology, Technische
Universität Dresden, Dresden, Germany.
100
Department of
Psychology, University of Illinois at Chicago, Chicago, IL 60607,
USA.
101
William and Mary, Williamsburg, VA 23185, USA.
102
Stockholm University, Stockholm, Sweden.
103
Karolinska
Institute, Stockholm, Sweden.
104
Center for Neural Science,
New York University, New York, NY 10003, USA.
105
Centre for
Clinical Brain Sciences, The University of Edinburgh, Edinburgh
EH16 4SB, UK.
106
Department of Neurology, Beth Israel
Deaconess Medical Center, Harvard Medical School, Boston, MA
02215, USA.
107
Department of Psychology, University of Florida,
Gainesville, FL 32611, USA.
108
Department of Psychology, Wright
State University, Dayton, OH 45435, USA.
109
California State
University, Northridge, Northridge, CA 91330, USA.
110
Kutztown
University of Pennsylvania, Kutztown, PA 19530, USA.
111
Department of Psychology, Universidad Autónoma de Madrid,
Madrid, Spain.
112
IE Business School, Madrid, Spain.
113
Department
of Social Sciences, University of Virginia's College at Wise, Wise,
VA 24230, USA.
114
University of Groningen, Groningen,
Netherlands.
115
Department of Behavioral Sciences, Ariel University,
Ariel, 40700 Israel.
116
Department of Social Science, Worchester
Polytechnic Institute, Worchester, MA 01609, USA.
117
Department of Basic Psychological Research and Research
Methods, University of Vienna, Vienna, Austria.
118
Duke University,
Durham, NC 27708, USA.
119
Centre for Research in Psychology,
Behavior and Achievement, Coventry University, Coventry CV1 5FB,
UK.
120
The Open University, Buckinghamshire MK7 6AA, UK.
121
City University of Hong Kong, Shamshuipo, KLN, Hong Kong.
122
Jacksonville University, Jacksonville, FL 32211, USA.
123
TIBER
(Tilburg Institute for Behavioral Economics Research), Tilburg,
Netherlands.
124
Universidad de la República Uruguay, Montevideo
11200, Uruguay.
125
University of Oxford, Oxford, UK.
SUPPLEMENTARY MATERIALS
www.sciencemag.org/content/349/6251/aac4716/suppl/DC1
Materials and Methods
Figs. S1 to S7
Tables S1 to S4
References (3841)
29 April 2015; accepted 28 July 2015
10.1126/science.aac4716
aac4716-8 28 AUGUST 2015 VOL 349 ISSUE 6251 sciencemag.org SCIENCE
RESEARCH |RESEARCH ARTICLE
DOI: 10.1126/science.aac4716
, (2015);349 Science
Open Science Collaboration
Estimating the reproducibility of psychological science
This copy is for your personal, non-commercial use only.
clicking here.colleagues, clients, or customers by
, you can order high-quality copies for yourIf you wish to distribute this article to others
here.following the guidelines
can be obtained byPermission to republish or repurpose articles or portions of articles
): August 27, 2015 www.sciencemag.org (this information is current as of
The following resources related to this article are available online at
http://www.sciencemag.org/content/349/6251/aac4716.full.html
version of this article at:
including high-resolution figures, can be found in the onlineUpdated information and services,
http://www.sciencemag.org/content/suppl/2015/08/26/349.6251.aac4716.DC1.html
can be found at: Supporting Online Material
http://www.sciencemag.org/content/349/6251/aac4716.full.html#related
found at:
can berelated to this article A list of selected additional articles on the Science Web sites
http://www.sciencemag.org/content/349/6251/aac4716.full.html#ref-list-1
, 13 of which can be accessed free:cites 37 articlesThis article
http://www.sciencemag.org/cgi/collection/psychology
Psychology
subject collections:This article appears in the following
registered trademark of AAAS.
is aScience2015 by the American Association for the Advancement of Science; all rights reserved. The title
CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005.
(print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience
on September 3, 2015www.sciencemag.orgDownloaded from
... Video gameplay Users referred to as "gamers" are less susceptible to report high symptoms (Collaboration, 2015;Lanier et al., 2019;Kaplan et al., 2020;Szopa and Soares, 2021;Theresa Pöhlmann et al., 2021;Wang et al., 2021) Encouraging potential users to play 3D video games to acclimate them to movements (Rebenitsch and Owen, 2021) on a screen they could encounter in VR CYB_4 III ...
... But, contradictory results regarding mental workload have been observed (Porcino et al., 2017). For example, VR presents a lower cognitive demand for geo-visualization and trajectory data exploration than PC usage (Collaboration, 2015;Kaplan et al., 2020;Szopa and Soares, 2021), and a higher mental workload does not always negatively impact task performance (Tian et al., 2021). As mental overload is especially contingent on task characteristics, relying only on a general model provides only general assertions. ...
Article
Full-text available
Virtual Reality (VR) can induce side effects known as Virtual Reality Induced Symptoms and Effects (VRISE). To address this concern, we identify a literature-based listing of these factors thought to influence VRISE with a focus on office work use. Using those, we recommend guidelines for VRISE amelioration intended for virtual environment creators and users. We identify five VRISE risks, focusing on short-term symptoms with their short-term effects. Three overall factor categories are considered: individual, hardware, and software. Over ninety factors may influence VRISE frequency and severity. We identify guidelines for each factor to help reduce VR side effects. To better reflect our confidence in those guidelines, we graded each with a Level of Evidence rating. Common factors occasionally influence different forms of VRISE. This can lead to confusion in the literature. General guidelines for using VR at work involve worker adaptation, such as limiting immersion times to between twenty and thirty minutes. These regimens involve taking regular breaks. Extra care is required for workers with special needs, neurodiversity, and gerontechnological concerns. In addition to following our guidelines, stakeholders should be aware that current Head-Mounted Displays and virtual environments can continue to induce VRISE. While no single existing method fully alleviates VRISE, workers' health and safety must be monitored and safeguarded when VR is used at work.
... Over the last decade, several large-scale projects have attempted to replicate the findings of research in preclinical medicine, economics and psychology. The results of these studies have arguably brought the evidence base in those disciplines into question [1][2][3][4][5]. This 'replication crisis' arises in many areas of science, undermining the support for decision-making and public trust in science. ...
... (i) Selected claims were from Many Labs 1 [3], Many Labs 2 [5] and Many Labs 3 [36], the Social Sciences Replication Project (Nature and Science) [14] or the original Reproducibility Project Psychology [4]. (ii) The replication had at least 90% power to detect an effect 75% of the size of that seen in the original study (calculated using Fisher Z transformed correlation coefficients). ...
Article
Full-text available
This paper explores judgements about the replicability of social and behavioural sciences research and what drives those judgements. Using a mixed methods approach, it draws on qualitative and quantitative data elicited from groups using a structured approach called the IDEA protocol (‘investigate’, ‘discuss’, ‘estimate’ and ‘aggregate’). Five groups of five people with relevant domain expertise evaluated 25 research claims that were subject to at least one replication study. Participants assessed the probability that each of the 25 research claims would replicate (i.e. that a replication study would find a statistically significant result in the same direction as the original study) and described the reasoning behind those judgements. We quantitatively analysed possible correlates of predictive accuracy, including self-rated expertise and updating of judgements after feedback and discussion. We qualitatively analysed the reasoning data to explore the cues, heuristics and patterns of reasoning used by participants. Participants achieved 84% classification accuracy in predicting replicability. Those who engaged in a greater breadth of reasoning provided more accurate replicability judgements. Some reasons were more commonly invoked by more accurate participants, such as ‘effect size’ and ‘reputation’ (e.g. of the field of research). There was also some evidence of a relationship between statistical literacy and accuracy.
Article
Full-text available
The present study examines the poor fit between the idea of school meritocracy and the successful inclusion of students with special educational needs (SEN). Because students with SEN are assigned negative stereotypes related to suffering, failure, and difficulty regarding their school achievement, we argue that, if they succeed at levels comparable to those of regular students, they may experience backlash, a sanction for challenging the status quo. The results of two studies show that backlash can manifest itself in the form of lower assigned competence to students with special educational needs who succeed. More precisely, across a pilot and a main study, our findings indicate that while performing as well as students without special educational needs, the perceived competence of students with special educational needs was evaluated as lower by participants (pre- and in-service teachers), particularly when these students benefitted from an accommodation perceived as “unfair”. Due to its potential role in justifying inequities within educational contexts, the backlash effect is discussed as an ideological barrier to the inclusion of students with special educational needs.
Article
The past decade has witnessed a proliferation of big team science (BTS), endeavours where a comparatively large number of researchers pool their intellectual and/or material resources in pursuit of a common goal. Despite this burgeoning interest, there exists little guidance on how to create, manage and participate in these collaborations. In this paper, we integrate insights from a multi-disciplinary set of BTS initiatives to provide a how-to guide for BTS. We first discuss initial considerations for launching a BTS project, such as building the team, identifying leadership, governance, tools and open science approaches. We then turn to issues related to running and completing a BTS project, such as study design, ethical approvals and issues related to data collection, management and analysis. Finally, we address topics that present special challenges for BTS, including authorship decisions, collaborative writing and team decision-making.
Article
Full-text available
In this meta-analysis, the authors investigated whether being in nature and emotional social support are reliable strategies to downregulate stress. We retrieved all the relevant articles that investigated a connection between one of these two strategies and stress. For being in nature we found 54 effects reported in 16 papers (total N = 1,697, MdnN = 52.5), while for emotional social support we found 18 effects reported in 13 papers (total N = 3,787, MdnN = 186). Although we initially found an effect for being in nature and emotional social support on stress (Hedges’ g = -.42; Hedges’ g = -.14, respectively), the effect only held for being in nature after applying our main publication bias correction technique (Hedges’ g = -.60). The emotional social support literature also had a high risk of bias. Although the being-in-nature literature was moderately powered (.72) to detect effects of Cohen’s d = .50 or larger, the risk of bias was considerable, and the reporting contained numerous statistical reporting errors.
Article
Full-text available
The data includes measures collected for the two experiments reported in “False-Positive Psychology” [1] where listening to a randomly assigned song made people feel younger (Study 1) or actually be younger (Study 2). These data are useful because they illustrate inflations of false positive rates due to flexibility in data collection, analysis, and reporting of results. Data are useful for educational purposes.
Article
Full-text available
Transparency, openness, and reproducibility are readily recognized as vital features of science (1, 2). When asked, most scientists embrace these features as disciplinary norms and values (3). Therefore, one might expect that these valued features would be routine in daily practice. Yet, a growing body of evidence suggests that this is not the case (4–6).
Article
Full-text available
The current crisis in scientific psychology about whether our findings are irreproducible was presaged years ago by Tversky and Kahneman (1971), who noted that even sophisticated researchers believe in the fallacious Law of Small Numbers-erroneous intuitions about how imprecisely sample data reflect population phenomena. Combined with the low power of most current work, this often leads to the use of misleading criteria about whether an effect has replicated. Rosenthal (1990) suggested more appropriate criteria, here labeled the continuously cumulating meta-analytic (CCMA) approach. For example, a CCMA analysis on a replication attempt that does not reach significance might nonetheless provide more, not less, evidence that the effect is real. Alternatively, measures of heterogeneity might show that two studies that differ in whether they are significant might have only trivially different effect sizes. We present a nontechnical introduction to the CCMA framework (referencing relevant software), and then explain how it can be used to address aspects of replicability or more generally to assess quantitative evidence from numerous studies. We then present some examples and simulation results using the CCMA approach that show how the combination of evidence can yield improved results over the consideration of single studies. © The Author(s) 2014.
Article
Full-text available
The published journal article is the primary means of communicating scientific ideas, methods, and empirical data. Not all ideas and data get published. In the present scientific culture, novel and positive results are considered more publishable than replications and negative results. This creates incentives to avoid or ignore replications and negative results, even at the expense of accuracy (Giner-Sorolla, 2012; Nosek, Spies, & Motyl, 2012). As a consequence, replications (Makel, Plucker, & Hegarty, 2012) and negative results (Fanelli, 2010; Sterling, 1959) are rare in the published literature. This insight is not new, but the culture is resistant to change. This article introduces the first known journal issue in any discipline consisting exclusively of preregistered replication studies. It demonstrates that replications have substantial value, and that incentives can be changed. (PsycINFO Database Record (c) 2014 APA, all rights reserved)
Article
This article introduces a new approach for evaluating replication results. It combines effect-size estimation with hypothesis testing, assessing the extent to which the replication results are consistent with an effect size big enough to have been detectable in the original study. The approach is demonstrated by examining replications of three well-known findings. Its benefits include the following: (a) differentiating "unsuccessful" replication attempts (i.e., studies yielding p > .05) that are too noisy from those that actively indicate the effect is undetectably different from zero, (b) "protecting" true findings from underpowered replications, and (c) arriving at intuitively compelling inferences in general and for the revisited replications in particular. © The Author(s) 2015.
Article
Recent systematic reviews and empirical evaluations of the cognitive sciences literature suggest that publication and other reporting biases are prevalent across diverse domains of cognitive science. In this review, we summarize the various forms of publication and reporting biases and other questionable research practices, and overview the available methods for probing into their existence. We discuss the available empirical evidence for the presence of such biases across the neuroimaging, animal, other preclinical, psychological, clinical trials, and genetics literature in the cognitive sciences. We also highlight emerging solutions (from study design to data analyses and reporting) to prevent bias and improve the fidelity in the field of cognitive science research.
Article
Science advances on a foundation of trusted discoveries. Reproducing an experiment is one important approach that scientists use to gain confidence in their conclusions. Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible. Because confidence in results is of paramount importance to the broad scientific community, we are announcing new initiatives to increase confidence in the studies published in Science . For preclinical studies (one of the targets of recent concern), we will be adopting recommendations of the U.S. National Institute of Neurological Disorders and Stroke (NINDS) for increasing transparency. * Authors will indicate whether there was a pre-experimental plan for data handling (such as how to deal with outliers), whether they conducted a sample size estimation to ensure a sufficient signal-to-noise ratio, whether samples were treated randomly, and whether the experimenter was blind to the conduct of the experiment. These criteria will be included in our author guidelines.