ArticlePDF Available

Abstract

The importance of replication is becoming increasingly appreciated, however, considerably less consensus exists about how to evaluate the design and results of replications. We make concrete recommendations on how to evaluate replications with more nuance than what is typically done currently in the literature. We highlight six study characteristics that are crucial for evaluating replications: replication method similarity, replication differences, investigator independence, method/data transparency, analytic result reproducibility, and auxiliary hypotheses’ plausibility evidence. We also recommend a more nuanced approach to statistically interpret replication results at the individual-study and meta-analytic levels, and propose clearer language to communicate replication results.
Meta-Psychology, 2019, vol 3, MP.2018.843,
https://doi.org/10.15626/MP.2018.843
Article type: Original Article
Published under the CC
-BY4.0 license
Open data: Not relevant
Open materials: Not relevant
Open and reproducible analysis:
Not relevant
Open reviews and editorial process: Yes
Preregistration: Not relevant
Edited by: Rickard Carlsson
Reviewed by: Nuijten, M. & Schimmack, U.
All supplementary files can be a
ccessed at the OSF project page:
https://doi.org/10.17605/OSF.IO/Q56E8
A Brief Guide to Evaluate Replications
Etienne P. LeBel
KU Leuven
Irene Cheung
Huron University College
Wolf Vanpaemel
KU Leuven
Lorne Campbell
Western University
The importance of replication is becoming increasingly appreciated, however,
considerably less consensus exists about how to evaluate the design and results of
replications. We make concrete recommendations on how to evaluate replications with
more nuance than what is typically done currently in the literature. We highlight six study
characteristics that are crucial for evaluating replications: replication method similarity,
replication differences, investigator independence, method/data transparency, analytic
result reproducibility, and auxiliary hypotheses’ plausibility evidence. We also
recommend a more nuanced approach to statistically interpret replication results at the
individual-study and meta-analytic levels, and propose clearer language to communicate
replication results.
Keywords: transparency, replicability, direct replication, evaluating replications,
reproducibility
There is growing consensus in the psychology
community regarding the fundamental scientific
value and importance of replication. Considerably
less consensus, however, exists about how to
evaluate the design and results of replication
studies. In this article, we make concrete
recommendations on how to evaluate replications
with more nuance than what is typically done
currently in the literature. These recommendations
are made to maximize the likelihood that replication
results are interpreted in a fair and principled
manner.
We propose a two-stage approach. The first one
involves considering and evaluating six crucial study
characteristics (the first three specific to replication
studies with the last three relevant for any study): (1)
replication method similarity, (2) replication
differences, (3) investigator independence, (4)
method/data transparency, (5) analytic result
reproducibility, and (6) auxiliary hypotheses’
plausibility evidence. Second, and assuming sound
study characteristics, we recommend more
nuanced ways to interpret replication results at the
individual-study and meta-analytic levels. Finally,
we propose the use of clearer and less ambiguous
language to more effectively communicate the
results of replication studies.
These recommendations are directly based on
curating N = 1,127 replications (as of August 2018)
available at Curate Science (CurateScience.org), a
web platform that organizes and tracks the
transparency and replications of published findings
in the social sciences (LeBel, McCarthy, Earp, Elson,
& Vanpaemel, 2018). This is the largest known meta-
scientific effort to evaluate and interpret replication
We thank the editor Rickard Carlsson and re
viewers
Michèle Nuijten and Ulrich
Schimmack
for valuable feedback
on an earlier version of this article.
We also thank Chiel Mues
for copyediting
our manuscript.
Correspondence
concerning this article should be
addressed
to Etienne P. LeBel,
Quantitative Psychology and
Individual Differences Unit
, KU Leuven, Tiensestraat 102
-
Box 3713
, Leuven, Belgium, 3000.
Email:
etienne.lebel@gmail.com
LEBEL, VANPAEMEL, CHEUNG, & CAMPBELL
2
results of studies across a wide and heterogeneous
set of study types, designs, and methodologies.
Replication-Specific Study Characteristics
When evaluating replication studies, the
following three study characteristics are of crucial
importance:
1. Methodological similarity.
A first aspect is whether a replication study
employed a sufficiently similar methodology to the
original study (i.e., at minimum, used the same
operationalizations for the independent and
dependent variables, as in “close replications”; LeBel
et al., 2018). This is required because only such
replications can cast doubt upon an original
hypothesis (assuming sound auxiliary hypotheses,
see section below), and hence in principle, falsify a
hypothesis (LeBel, Berger, Campbell, & Loving, 2017;
Pashler & Harris, 2012). Studies that are not
sufficiently similar can only speak to the
generalizability -- but not replicability -- of a
phenomenon under study, and should therefore be
treated as "generalizability studies" rather than
“replication studies”. Such studies are sometimes
called "conceptual replications", but this is a
misnomer given that it is more accurate to
conceptualize such studies as "extensions" rather
than replications (LeBel et al., 2017; Zwaan, Etz,
Lucas, & Donnellan, 2017).
2. Replication differences.
A second aspect to carefully consider is whether
there are any study design characteristics that
differed from the comparison original study. These
are important to consider whether the differences
were within or beyond a researcher’s control (LeBel
et al., 2018). Such differences are critical to consider
because they help the community begin to
understand the replicability and generalizability of
an effect. Consistent positive replication evidence
across replications with minor design differences
suggests an effect is likely robust across those
design differences. On the other hand, for
inconsistent replication evidence, such differences
may provide initial clues regarding potential
boundary conditions of an effect.
3. Investigator independence.
A final important consideration is the degree of
independence between the replication investigators
and researchers who conducted the original study.
This is important to consider to mitigate against the
problem of “correlated investigators” (Rosenthal,
1991) whereby non-independent investigators may
be more susceptible to confirmation biases given
vested interest in an effect (although preregistration
and other transparent practices can alleviate these
issues; see next section).
General Study Characteristics
When evaluating studies in general, the following
three study characteristics are important to
consider.
1. Study transparency.
Sufficient transparency is required to allow
comprehensive scrutiny of how any study was
conducted. Sufficient transparency means posting
the experimental materials and underlying data in a
readable format (e.g., with a codebook) on a public
repository (criteria for earning open materials and
open data badges, respectively; Kidwell et al., 2016)
and following the relevant reporting standards for
the type of study and methodology used (e.g.,
CONSORT reporting standard for experimental
studies; Schulz, Altman, & Moher, 2010). If a study is
not reported with sufficient transparency, it cannot
be properly scrutinized. The findings from such a
study are consequently of little value because the
target hypothesis was not tested in a sufficiently
falsifiable manner. Preregistering a study (which
publicly commits data collection, processing, and
analysis plans prior to data collection) offers even
more transparency and limits researcher degrees of
freedom (assuming that the preregistered
procedure was actually followed).
2. Analytic result reproducibility.
For any study, it is also important to consider
whether a study’s primary result (or set of results) is
analytically reproducible. That is, whether a study’s
primary result can be successfully reproduced
(within a certain margin of error) from the raw or
transformed data (this is contingent of course on the
fact that the data are actually available, whether
publicly, as in the case of “open data”, or otherwise).
A BRIEF GUIDE TO EVALUATE REPLICATIONS
If analytic reproducibility is confirmed, then our
confidence in a study’s reported results is boosted
(and ideally results can also be confirmed to be
robust across alternative justifiable data-analytic
choices; Steegen, Tuerlinckx, Gelman, & Vanpaemel,
2016). If analytic reproducibility is not confirmed
and/or if discrepancies are detected, then our
confidence should be reduced and this should be
taken into account when interpreting a study’s
results.
3. Auxiliary hypotheses.
Finally, for any study, researchers should
consider all available evidence regarding how
plausible it is that the relevant auxiliary hypotheses,
needed to test the substantive hypothesis at hand,
were true (LeBel et al., 2018). Auxiliary hypotheses
include, for example, the psychometric validity of
the measuring instruments, and the sound
realizations of experimental conditions (Meehl,
1990). This can be done by examining reported
evidence of positive controls or evidence that a
replication sample had the ability to detect some
effect (e.g., replicating a past known effect;
manipulation check evidence). These considerations
are particularly crucial when interpreting null
results so that one can rule out more mundane
reasons for not having detected a signal (e.g., fatal
experimenter or data processing errors; though
such fatal errors can also sometimes cause false
positive results).
Nuanced Statistical Interpretation and Language
Once these six study characteristics have been
evaluated and taken into account, we recommend
statistical approaches to interpret the results of a
replication study at the individual-study and meta-
analytic levels that are more nuanced than what is
currently typically done. We then propose the use of
clearer language to communicate replication
results.
1 The ES estimate precision of an original study is not
currently accounted for because the vast majority of legacy
literature original studies don’t report 95% CIs (and CIs most
often cannot be calculated because insufficient information
is reported). In rare cases that CIs are reported, they are
typically so wide (given the underpowered nature of the
Statistical interpretation: Individual-study level.
At the individual-study level, we recommend that
the following three distinct statistical aspects of a
replication result are considered: (1) whether a
signal was detected, (2) consistency of the
replication effect size (ES) relative to the original
study ES, and (3) the relative precision of the
replication ES estimate relative to the original study.
Such considerations yield the following replication
outcome categories for the situation where an
original study detected a signal (see Figure 1, Panel
A, for visual depictions of these distinct scenarios)1:
1. Signal consistent: replication ES 95%
confidence interval (CI) excludes 0 and
includes original ES point estimate (Panel A
replication scenario #1; e.g., Chartier’s, 2015,
Reproducibility Project: Psychology [RPP]
#31 replication result of McCrea’s, 2008
Study 5; see Table 1 in the Appendix for
details of Chartier's, 2015 RPP #31 replication
and subsequently cited replication
examples).
2. Signal inconsistent: replication ES 95% CI
excludes 0 but also excludes original ES point
estimate. Three sub categorizations exist
within this outcome category:
a. Signal inconsistent, larger (same
direction): replication ES is larger and in
same direction as original ES (Panel A
replication scenario #2; e.g., Veer et al.’s,
2015, RPP #36 replication result of
Armor et al.’s, 2008 Study 1).
b. Signal inconsistent, smaller (same
direction): replication ES is smaller and
in same direction as original ES (Panel A
replication scenario #3; e.g., Ratliff’s,
2015, RPP #26 replication result of
Fischer et al.’s, 2008 Study 4).
c. Signal inconsistent, opposite
direction/pattern: replication ES is in
opposite direction (or reflects an
inconsistent pattern) relative to the
original ES direction/pattern (Panel A
replication scenario #4; e.g., Earp et al.’s,
legacy literature) that ES estimates are not statistically
falsifiable in practical terms. Once it becomes the norm in
the field to report highly precise ES estimates, however, it
will become possible and desirable to account for original
study ES estimate precision when statistically interpreting
replication results.
LEBEL, VANPAEMEL, CHEUNG, & CAMPBELL
4
2014 Study 3 replication result of Zhong
& Liljenquist’s, 2006 Study 2).
3. No signal consistent: replication ES 95% CI
includes 0 but also includes original ES point
estimate (Panel A replication scenario #5;
e.g., Hull et al.’s, 2002 Study 1b replication
result of Bargh et al.’s, 1996 Study 2a).
4. No signal inconsistent: replication ES 95%
CI includes 0 but excludes original ES point
estimate (Panel A replication scenario #6;
e.g., LeBel & Campbell’s, 2013 Study 1
replication result of Vess’, 2012 Study 1).
Figure 1. Distinct hypothetical outcomes of a replication study based on considering three statistical aspects of a
replication result: (1) whether a signal was detected, (2) consistency of replication effect size (ES) relative to an original
study, and (3) the precision of replication ES estimate relative to ES estimate precision in an original study. Outcomes are
separated for situations where an original study detected a signal (Panel A) versus did not detect a sign
al (Panel B).
A BRIEF GUIDE TO EVALUATE REPLICATIONS
In cases where a replication effect size estimate
was less precise than the original (i.e., the
replication ES confidence interval is wider than the
original), which can occur when a replication uses a
smaller sample size and/or when the replication
sample exhibits higher variability, we propose the
label "less precise" be used to warn readers that such
replication result should only be interpreted meta-
analytically (Panel A replication scenario #7; e.g.,
Schuler & Wanke’s, 2016 Study 2 replication result of
Caruso et al.’s, 2013 Study 2).
In the situation where an original study did not
detect a signal, such considerations yield the
following replication outcome categories (see Figure
1, Panel B, for visual depictions of these distinct
scenarios):
1. No signal consistent: replication ES 95%
confidence interval (CI) includes 0 and
includes original ES point estimate (Panel B
replication scenario #1; e.g., Selterman et
al.’s, 2015, RPP #29 replication result of
Eastwick & Finkel’s, 2008 Study 1).
2. No signal consistent (less precise):
replication ES 95% confidence interval (CI)
includes 0 and includes original ES point
estimate, but replication ES estimate is less
precise than in original study (Panel B
replication scenario #2; no replication is yet
known to fall under this scenario).
3. Signal consistent: replication ES 95%
confidence interval (CI) excludes 0 but
includes original ES point estimate (Panel B
replication scenario #3; Roebke & Penna’s
2015, RPP #76 replication result of Couture
et al.'s, 2008 Study 1).
4. Signal inconsistent: replication ES 95%
confidence interval (CI) excludes 0 and
excludes original ES point estimate. Two sub
categorizations exist within this outcome
category:
a. Signal inconsistent, positive effect:
replication ES involves a positive effect
(Panel B replication scenario #4; e.g.,
Cohn’s, 2015, RPP #45 replication result
of Ranganath & Nosek’s, 2008 Study 1).
b. Signal inconsistent, negative effect:
replication ES involves a negative effect
(Panel B replication scenario #5; e.g.,
no replication is yet known to fall under
this scenario).
From this perspective, the proposed improved
language to describe a replication study under
replication scenario #6 would be: “We report a
replication study of effect X. No signal was detected
and the effect size was inconsistent with the original
one.” This terminology contrasts favorably with
several ambiguous or unclear replication-related
terminologies that are currently commonly used to
describe replication results (e.g., “unsuccessful”,
“failed”, “failure to replicate”, “non-replication”).
The terms “unsuccessful” or “failed” (or “failure to
replicate”) are ambiguous: was it the replication
methodology or the replication result that was
unsuccessful or failed (with similar logic applied to
the ambiguous term “non-replication”)? The terms
“unsuccessful” or “failed” are also problematic
because of the implicit message conveyed that
something was “wrong” with the replication. For
example, though the “small telescope approach”
(Simonsohn, 2015) was an improvement over the
prior simplistic standard of considering a replication
p < .05 as “successful” and p > .05 as “unsuccessful”,
the approach nonetheless uses ambiguous language
that does not actually describe a replication result
(e.g., “uninformative” vs. “informative failure to
replicate”). Instead, the terminology we propose
offers unambiguous and descriptively accurate
language, stating both whether a signal was
detected and the consistency of the replication ES
estimate relative to the original study. The proposed
nuanced approach to statistically interpreting
replication evidence improves the clarity of the
language to describe and communicate replication
results.
Statistical interpretation: Meta-analytic level.
Interpreting the outcomes of a set of replication
studies can proceed in two ways: an informal
approach, when only a few replications are available,
and a more quantitative meta-analytic approach
when several replications are available for a specific
operationalization of an effect. The first one
considers whether replications can consistently
detect a signal, each of which is consistent (i.e., of
similar magnitude) with the ES point estimate from
the original study (Panel A replication scenario #1).
Under this situation, one could informally say that
an effect is “replicable.” When several replications
are available, a more quantitative meta-analytic
approach can be taken: an effect can be considered
“replicable” when the meta-analytic ES estimate
excludes zero and is consistent with the original ES
point estimate (also replication scenario #1, see
LEBEL, VANPAEMEL, CHEUNG, & CAMPBELL
6
Panel A Figure 1; see also Mathur & VanderWeele,
2018).
Conclusion
It is important to note that replicability should be
seen as a minimum requirement for scientific
progress rather than an arbiter of truth. Replicability
ensures that a research community avoids going
down blind alleys chasing after anomalous results
that emerged due to chance, noise, or other
unknown errors. However, when adjudicating the
replicability of an effect, it is important to keep in
mind that an effect that does not appear to be
replicable does not necessarily mean the tested
hypothesis is false: It is always possible that an effect
is replicable via alternative methods or
operationalizations and/or that there were
problems with some of the auxiliary hypotheses
(e.g., invalid measurement, or unclear instructions,
etc.). This possibility, however, should not be
exploited: eventually one must consider the value of
continued testing of a hypothesis across different
operationalizations and contexts. Conversely, an
effect that appears replicable does not necessarily
mean the tested hypothesis is true: A replicable
effect may not necessarily reflect a valid and/or
generalizable effect (e.g., a replicable effect may
simply reflect a measurement artifact and/or may
not generalize to other methods, populations, or
contexts).
The recommendations advocated in this article
are based on curating over one thousand
replications at Curate Science (as of August 2018).
These recommendations have been applied to each
of the replication in its database, including
employing our suggested language to describe the
outcome of each of its curated replication. It is
expected, however, that these recommendations
will evolve over time as additional replications, from
an even wider set of studies, are curated and
evaluated (indeed, as of September 2018,
approximately 1,800 replications are in the queue to
be curated at Curate Science). Consequently, these
recommendations should be seen as a starting point
for the research community to more accurately
evaluate replication results, as we gradually learn
more sophisticated approaches to interpret
replication results. We hope, however, that our
proposed recommendations will be a stepping stone
in this direction and consequently accelerate
psychology’s path on becoming a more cumulative
and valid science.
A BRIEF GUIDE TO EVALUATE REPLICATIONS
Appendix
Table 1. Known published replication results that fall under the distinct hypothetical replication outcomes depicted in
Figure 1 (when available).
Original study
Replication study
Signal detected in original study
handicapping
effect
McCrea (2008)
Study 5 r = .34 ± .35 r = .29 ± .24 Chartier (2015,
RPP #31)
signal
consistent
prescribed
optimism effect
Armor, Massey
et al. (2008)
Study 1
r = .68 ± .10 r = .76 ± .06 Veer et al. (2015,
RPP #36)
inconsistent,
larger
exposure
information
quantity effect
Fischer, Schulz-
Hardt et al.
(2008) Study 4
r = .50 ± .21 r = .22 ± .16 Ratliff (2015, RPP
#26)
signal -
inconsistent,
smaller
Macbeth effect
Zhong &
Liljenquist (2006)
Study 2
r = .45 ± .31 r = -.11 ± .11 Earp et al. (2014)
Study 3
inconsistent,
opposite
effect
Bargh et al.
(1996) Study 2a d = 1.02 ± .76 d = .53 ± .63
Hull et al. (2002)
Study 1b
consistent
attachment
warm food effect
Vess (2012) Study
1 d = .60 ± .55 d = .03 ± .27
LeBel &
Campbell (2013)
Study 1
no signal -
inconsistent
money priming
effect
Caruso et al.
(2013) Study 2 d = .43 ± .30 d = -.09 ± .39 Schuler & Wänke
(2016) Study 2
inconsistent (less
No signal detected in original study
earning prospect
predicts
romantic
interest effect
Eastwick &
Finkel (2008)
Study 1
r = .14 ± .16 r = .03 ± .11
Selterman,
Chagnon et al.
(2015, RPP #29)
no signal -
consistent
Hebb repetition
effect revisited
Couture, Lafond,
& Tremblay
(2008) Study 1
r = .35 ± .38 r = .27 ± .24 Roebke & Penna
(2015, RPP #76)
signal -
consistent
generalization
occurs
immediately
effect
Ranganath &
Nosek (2008)
Study 1
r = .00 ± .08 r = .11 ± .04 Cohn (2015, RPP
#45)
signal -
inconsistent,
larger
LEBEL, VANPAEMEL, CHEUNG, & CAMPBELL
8
References
Armor, D. A., Massey, C., & Sackett, A. M. (2008).
Prescribed optimism: Is it right to be wrong
about the future? Psychological Science, 19,
329-331. doi:10.1111/j.1467-9280.2008.02089.x
Bargh, J. A., Chen, M., & Burrows, L. (1996).
Automaticity of social behavior: Direct effects
of trait construct and stereotype activation on
action. Journal of Personality and Social
Psychology, 71(2), 230-244. doi:10.1037/0022-
3514.71.2.230
Chartier, C. R., & Perna, O. (2015). Replication of
“Self-handicapping, excuse making, and
counterfactual thinking: Consequences for
self-esteem and future motivation.” by SM
McCrea (2008, Journal of Personality and
Social Psychology). Retrieved from
https://osf.io/ytxgr/ (Reproducibility Project:
Psychology Study #31)
Cohn, M. A. (2015). Replication of “Implicit Attitude
Generalization Occurs Immediately; Explicit
Attitude Generalization Takes Time”
(Ranganath & Nosek, 2008). Retrieved from:
https://osf.io/9xt25/ (Reproducibility Project:
Psychology Study #45)
Caruso, E. M., Vohs, K. D., Baxter, B., & Waytz, A.
(2013). Mere exposure to money increases
endorsement of free-market systems and
social inequality. Journal of Experimental
Psychology: General, 142, 301-306.
doi:10.1037/a0029288
Couture, M., Lafond, D., & Tremblay, S. (2008).
Learning correct responses and errors in the
hebb repetition effect: Two faces of the same
coin. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 34, 524-532.
doi:10.1037/0278-7393.34.3.524
Earp, B. D., Everett, J. A. C., Madva, E. N., & Hamlin,
J. K. (2014). Out, damned spot: Can the
"macbeth effect" be replicated? Basic and
Applied Social Psychology, 36, 91-98.
doi:10.1080/01973533.2013.856792
Eastwick, P. W., & Finkel, E. J. (2008). Sex
differences in mate preferences revisited: Do
people know what they initially desire in a
romantic partner? Journal of Personality and
Social Psychology, 94, 245-264.
doi:10.1037/0022-3514.94.2.245
Fischer, P., Schulz-Hardt, S., & Frey, D. (2008).
Selective exposure and information quantity:
How different information quantities moderate
decision makers' preference for consistent and
inconsistent information. Journal of Personality
and Social Psychology, 94, 231-244.
doi:10.1037/0022-3514.94.2.94.2.231
Hull, J., Slone, L., Meteyer, K., & Matthews, A.
(2002). The nonconsciousness of self-
consciousness. Journal of Personality and Social
Psychology, 83, 406-424. doi:10.1037//0022-
3514.83.2.406
Kidwell, M., Lazarevic, L., Baranski, E., Hardwicke,
T., Piechowski, S., Falkenberg, L., . . . Nosek, B.
(2016). Badges to acknowledge open practices:
A simple, low-cost, effective method for
increasing transparency. Plos Biology, 14,
e1002456. doi:10.1371/journal.pbio.1002456
LeBel, E. P., & Campbell, L. (2013). Heightened
sensitivity to temperature cues in individuals
with high anxious attachment: Real or elusive
phenomenon? Psychological Science, 24, 2128-
2130. doi:10.1177/0956797613486983
LeBel, E., Berger, D., Campbell, L., & Loving, T.
(2017). Falsifiability is not optional. Journal of
Personality and Social Psychology, 113, 696-696.
doi:10.1037/pspi0000117
LeBel, E. P., McCarthy, R., Earp, B., Elson, M. &
Vanpaemel, W. (2018). A Unified Framework to
Quantify the Credibility of Scientific Findings.
Advances in Methods and Practices in
Psychological Science, 1(3), 389-402.
Mathur & VanderWeele (2018, May 7). Preprint:
"New statistical metrics for multisite
replication projects".
https://doi.org/10.31219/osf.io/w89s5
McCrea, S. M. (2008). Self-handicapping, excuse
making, and counterfactual thinking:
Consequences for self-esteem and future
motivation. Journal of Personality and Social
Psychology, 95, 274-292.
http://dx.doi.org/10.1037/0022-3514.95.2.274
Meehl, P. E. (1990). Why summaries of research on
psychological theories are often
uninterpretable. Psychological Reports, 66, 195-
244. doi:10.2466/PRO.66.1.195-244
Pashler, H., & Harris, C. R. (2012). Is the replicability
crisis overblown? Three arguments examined.
Perspectives on Psychological Science, 7, 531
536. doi:10.1177/1745691612463401
Ranganath, K. A., & Nosek, B. A. (2008). Implicit
attitude generalization occurs immediately;
explicit attitude generalization takes time.
Psychological Science, 19, 249-254.
doi:10.1111/j.1467-9280.2008.02076.x
A BRIEF GUIDE TO EVALUATE REPLICATIONS
Ratliff, K. A. (2015). Replication of Fischer, Schulz-
Hardt, and Frey (2008). Retrieved from:
https://osf.io/5afur/ (Reproducibility Project:
Psychology Study #26)
Roebke, M., & Penna, N. D. (2015). Replication of
“Learning correct responses and errors in the
Hebb repetition effect: two faces of the same
coin” by M Couture, D Lafond, S Tremblay
(2008, Journal of Experimental Psychology:
Learning, Memory, and Cognition). Retrieved
from: https://osf.io/qm5n6/ (Reproducibility
Project: Psychology Study #76)
Rosenthal, R. (1991). Applied Social Research
Methods: Meta-analytic procedures for social
research. Thousand Oaks, CA: SAGE. doi:
10.4135/9781412984997
Schuler, J., & Wänke, M. (2016). A fresh look on
money priming: Feeling privileged or not
makes a difference. Social Psychological and
Personality Science, 7, 366-373.
doi:10.1177/1948550616628608
Schulz, K. F., Altman, D. G., Moher, D., CONSORT
Group, & for the CONSORT Group. (2010).
CONSORT 2010 statement: Updated guidelines
for reporting parallel group randomised trials.
BMJ, 340, 698-702. doi:10.1136/bmj.c332
Selterman, D. F., Chagnon, E., & Mackinnon, S.
(2015). Replication of: Sex Differences in Mate
Preferences Revisited: Do People Know What
They Initially Desire in a Romantic Partner? by
Paul Eastwick & Eli Finkel (2008, Journal of
Personality and Social Psychology). Retrieved
from: https://osf.io/5pjsn/ (Reproducibility
Project: Psychology Study #29)
Simonsohn, U. (2015). Small telescopes:
Detectability and the evaluation of replication
results. Psychological Science, 26, 559569.
http://dx.doi .org/10.1177/0956797614567341
Steegen, S., Tuerlinckx, F., Gelman, A., &
Vanpaemel, W. (2016). Increasing transparency
through a multiverse analysis. Perspectives on
Psychological Science, 11, 702-712.
doi:10.1177/1745691616658637
Veer, A. vt., Lassetter, B., Brandt, M. J., & Mehta, P.
H. (2015). The Reproducibility of Psychological
Science The Open Science Collaboration
Replication of Prescribed Optimism: Is it Right
to Be Wrong About the Future? by David A.
Armor, Cade Massey & Aaron M. Sackett (2008,
Psychological Science). Retrieved from:
https://osf.io/8u5v2/ (Reproducibility
Project: Psychology Study #36)
Vess, M. (2012). Warm thoughts: Attachment
anxiety and sensitivity to temperature cues.
Psychological Science, 23, 472-474.
doi:10.1177/0956797611435919
Zhong, C., & Liljenquist, K. (2006). Washing away
your sins: Threatened morality and physical
cleansing. Science, 313, 1451-1452.
doi:10.1126/science.1130726
Zwaan, R., Etz, A., Lucas, R., & Donnellan, M. (2017).
Making replication mainstream. Behavioral and
Brain Sciences, 1-50.
doi:10.1017/S0140525X17001972
... We examine the replicabili ty of the findings by Read et al. (1999), Tversky and Kahnema n (1981) and Hsee (1996) based on the criteria by Lebel et al. (2019 The number of free desserts ordered for a seven days event is lower when ordering the free desserts is done in joint mode for all seven days together in one page, than in joint mode for weekdays together and then weekend together, and both are lower compared to a separate display of ordering for each day separately on a different page. 7 ...
... We aimed to compare the replication effects with the original effects (as shown in Tables 3, 4, and 5 using the criteria set by LeBel et al. (2019). ...
Article
Full-text available
Choice partitioning refers to the phenomenon when the same choice set yields different decision-making behavior when they are grouped into sets (broadly bracketed) or evaluated separately (narrowly bracketed). In a Registered Report experiment with a US American sample recruited online through Prolific (N = 896), we conducted a replication of seven studies reviewed in Read et al (1999). We concluded a mostly successful replication: Out of the seven studies, we found support for six (Studies 1, 3, 4, and 6: Cramer’s V > 0.31; Studies 2 and 7: Cohen’s d > .29), and no empirical support for one (Study 5: Cramer’s V = .02). Extending the replication, we added new conditions in Studies 6 and 7 further expanding the manipulation’s scope range, yet failed to find any impact. In our replication we came across many challenges, both conceptual and empirical, and we therefore call bracketing scholars to better define bracketing in relation to other phenomena in decision making (joint vs. separate mode, framing effects, mental accounting, etc.), with falsifiable hypotheses, examining overlap with other constructs, and clearer mapping between theory and empirics. Materials, data, and code are available on: https://osf.io/vdqek/
... We evaluated the replication outcomes based on the criteria proposed by LeBel et al. (2019). As shown in Table 4, overall, the current findings were inconsistent with the original results. ...
Article
Full-text available
Human beings have a fundamental need to connect with others. Epley, Akalis, et al. (2008) found that people with higher chronic loneliness had a stronger tendency to anthropomorphize nonhuman objects, presumably for fulfilling unmet needs for social connection. In this Registered Report, we conducted a replication of Epley, Akalis, et al. (2008): Based on the setup of their Study 1, we examined the correlations between loneliness and anthropomorphism of technological gadgets (original Study 1), pets (original Study 3), and belief in supernatural beings (original Study 2), with a large U.S. online sample recruited from MTurk using CloudResearch (n = 885 after exclusions). Meanwhile, we extended the replication by examining the association between belief in free will and anthropomorphism. We found weak-to-no empirical support for the original finding that self-reported chronic loneliness was positively related to anthropomorphism. However, our evidence supports that the perceived controllability of gadgets negatively predicts their anthropomorphism and that free will belief is positively associated with belief in and anthropomorphism of supernatural beings.
... These low reliabilities might suggest a high degree of misclassification of positive and negative testers-a participant classified as a positive tester at one time could be classified as a negative tester at another. Considering these low reliabilities, the present study reinforces M. Minear et al.'s (2018) findings, especially given the recent emphasis on replication (LeBel et al., 2019). ...
Article
Full-text available
In this study, participants (N = 144) first studied 40 word pairs, then restudied half of the word pairs and practiced retrieval with feedback on the other half. In separate sessions, they then completed cued-recall and fluid intelligence (gF) tests. Three main objectives were addressed. First, we sought to generalize two findings reported by M. Minear et al. (2018): During the final-test phase, the high gF group exhibited a greater retrieval practice effect for difficult items compared to easy items, while the opposite pattern was observed for the low gF group; and, during the practice phase, the advantage of the high gF group over the low gF group increased across cycles for difficult items but not for easy items. Overall, we successfully extended their results. Second, we investigated whether gF is related to the amount of new items recalled during the practice phase. Consistent positive relationships were found in Cycles 1–3 (rs between .30 and .43). Third, we tested and found an indirect effect of gF on the retrieval practice effect mediated by performance during the practice phase. One possibility is that learners with higher gF may be particularly skilled at generating effective mediators and at monitoring and replacing less effective ones after retrieval failures. We recommend the following research agenda: measure the production, shift, and retrieval of mediators; manipulate the number of retrieval practice opportunities; probe the retrieval practice effect with free-recall tests; and adopt procedures based on learning to a criterion.
... proposed byLeBel et al. (2019) using Cohen's d s *(Delacre et al., 2021) as our metric of comparison. Cohen's d s * was chosen as our standardized effect size measure here because we refrained from assuming equal variances in our tests of the replication effects (hence our use of Welch's t-tests instead of Student's t-tests; for more details, see the "Data analysis strategy" section).Due to the directional nature of the hypotheses, replication effects were tested with onesided tests and not two-sided tests (as done in the target article). ...
Preprint
Full-text available
[IMPORTANT: Abstract, method, and results were written using a randomized dataset produced by Qualtrics to simulate what these sections will look like after data collection. These will be updated following the data collection. For the purpose of the simulation, we wrote things in past tense, but no pre-registration or data collection took place yet.] Moralization is the process by which previously morally neutral objects/activities acquire moral qualities. Rozin et al. (1997) proposed the moralization is responsible for the purported attitudinal differences between moral vegetarians and health vegetarians. In a Registered Report with a US American Prolific online sample (N = 830), we conducted a replication and extension of the study described in Rozin et al. (1997). [The following findings are concluded from simulated random noise and will be updated after data collection.] We failed to find empirical support for differences between moral- and health-origin vegetarians on overall reasons for being a vegetarian (ds* = -0.09 [-0.30, ∞]), reasons for being a vegetarian that are neither moral nor health reasons (ds* = -0.30 [-0.51, ∞]), disgust towards meat (ds* = -0.18 [-0.40, ∞]), emotional reactions to eating meat (ds* = -0.09 [-0.31, ∞]), and personality-related reasons for vegetarianism (ds* = -0.16 [-0.38, ∞]). Extending the replication, we failed to find support for differences between moral- and health-origin vegetarians on the range of animal meats (ds* = 0.12 [-0.09, ∞]) and animal products/by-products rejected (ds* = -0.04 [-0.25, ∞]), their opposition towards the use of animals for scientific testing (ds* = -0.22 [-0.44, ∞]), liking of animals (ds* = -0.01 [-0.23, ∞]), and rejection of products that directly or indirect involve the use of animals than health-origin vegetarians (ds* = -0.05 [-0.16, ∞]). Overall, we conclude that we failed to successfully replicate and extend the findings by Rozin et al. (1997). Materials, data, and code are available on: https://osf.io/5azdg/
... These low reliabilities might suggest a high degree of misclassification of positive and negative testers-a participant classified as a positive tester at one time could be classified as a negative tester at another. Considering these low reliabilities, the present study reinforces M. Minear et al.'s (2018) findings, especially given the recent emphasis on replication (LeBel et al., 2019). ...
Article
Retrieving information from memory enhances long-term retention. In this manuscript, we describe the dual-memory framework, which makes interval-scale predictions of the magnitude of this retrieval practice effect. After outlining the framework, we use data from our laboratory—both at the group level and at the distribution level—to fit the equations from the dual-memory framework. Overall, we successfully fitted the model predictions to the observed average data. In addition, we compared the predicted and the observed distributions of performance in the retrieval practice condition. More importantly, we introduce a useful approach to simulate empirical scenarios and test the relationship between individual-difference variables and the retrieval practice effect. We illustrate the application of this approach using data from a study that measured fluid intelligence. Future studies may benefit from contrasting different strength-based frameworks.
Article
Full-text available
Pronin and Kugler (2010) proposed that people believe they have more free will than others. In their Experiment 1 they showed that U.S. American students evaluated their own decisions and life events as less predictable than similar decisions and life events of close others, presumably suggesting higher free will attributions. We conducted three pre-registered replications of this study, one with a Hong Kong undergraduate sample (N = 47) and two online samples from the USA (MTurk using CloudResearch: N = 126, Prolific: N = 858) (overall N = 1031). In Studies 1a and 1b that mirrored the target article’s mixed design (self-other between, past-future within), we found support for the original findings with weaker effects. In Study 2 we contrasted between-subject versus within-subject designs in a single data collection. We successfully replicated the effects with the between-subject design, whereas we failed to find support for the effect using the within-subjects design. This suggests support for the phenomenon in single evaluation mode assessing either the self or the other, but that people correct for the self-other asymmetry in perceived predictability when the judgment is made in joint evaluations mode. Materials, data, and code are available on: https://osf.io/ykmqp/. Open peer review: https://osf.io/d47kj
Article
Full-text available
Jordan et al. (2011) demonstrated that people underestimated the prevalence of others’ negative emotional experiences and that these were associated with higher well-being. We conducted a pre-registered replication of Studies 1b and 3 by Jordan et al. (2011) (N = 594) with adjustments and added extensions. Building on their methodology we examined both prevalence and intensity of emotional experiences, and our findings suggest a much more complex story with surprising effects. We found an underestimation of the prevalence of negative emotions, but also unexpectedly of an underestimation of the prevalence of positive emotions, with stronger effects for negative than for positive emotions. However, we found an opposite effect for emotional intensity, people overestimated the intensity of both positive and negative emotional experiences, again with stronger effects for negative. Surprisingly, associations between prevalence estimations and well-being were in the opposite direction to the target article’s. Materials, data, and code: https://osf.io/bwmtr/
Article
Full-text available
Outcome bias is the phenomenon whereby decisions which resulted in positive outcomes are rated more favorably than when the same decisions resulted in negative outcomes, ceteris paribus. We conducted a pre-registered replication of Gino, Moore, and Bazerman (2009) Study 1’s three scenarios (original’s: N = 120) with an extension adding the three scenarios from their Study 2. Our data was collected online with an Amazon Mechanical Turk sample recruited using CloudResearch (N = 402). We tested outcome bias by measuring participants’ ratings of how unethical, punishable, and blameworthy a decision maker’s behavior was in morally grey scenarios. We partially replicated outcome bias in ratings of punishment (original: η2G = .05 [.00, .14]; replication 1-3: η2G = .03 [.01, .11]; replication 4-6: η2G = .11 [.05, .18]) and blame (original: η2G = .12 [.03, .23]; replication 1-3: η2G = .06 [.05, .19]; replication 4-6: η2G = .16 [.08, .23]), but with support for outcome bias in ratings of unethicality in Scenarios 4-6 (η2G = .04 [.01, .08]) but not in Scenarios 1-3 (original: η2G = .059 [.004, .16]; replication: η2G = .00 [.00, .03]). Similarly, we only found support for the target’s finding that ratings of unethicality mediate the relationship between outcome and both perceptions of punishment and blame in Scenarios 4-6. We also added an extension of a control condition and found higher unethicality judgements when a decision resulted in a negative outcome relative to a control condition with no outcome information. Materials, data, and code are available on: https://osf.io/3bz2g/.
Article
Full-text available
Societies invest in scientific studies to better understand the world, and attempt to harness such improved understanding to address pressing societal problems. Published research, however, can only be useful for theory or application if it is credible. In science, a credible finding is one that has repeatedly survived risky falsification attempts. However, state-of-the-art meta-analytic approaches cannot determine the credibility of an eect because they do not account for the extent to which each included study has survived such attempted falsification. To overcome this problem, the following paper outlines a unified framework to estimate the credibility of published research by examining four fundamental falsifiability-related dimensions: (1) method/data transparency, (2) analytic reproducibility, (3) analytic robustness, and (4) eect replicability. A standardized workflow is proposed to quantify the degree to which a finding has survived scrutiny along these four credibility facets. The framework is demonstrated by applying it to published replications in the psychology literature. A web platform implementation of the framework is outlined, and we conclude by encouraging the community of researchers to contribute to the development and crowdsourcing of the platform.
Article
Full-text available
Empirical research inevitably includes constructing a data set by processing raw data into a form ready for statistical analysis. Data processing often involves choices among several reasonable options for excluding, transforming, and coding data. We suggest that instead of performing only one analysis, researchers could perform a multiverse analysis, which involves performing all analyses across the whole set of alternatively processed data sets corresponding to a large set of reasonable scenarios. Using an example focusing on the effect of fertility on religiosity and political attitudes, we show that analyzing a single data set can be misleading and propose a multiverse analysis as an alternative practice. A multiverse analysis offers an idea of how much the conclusions change because of arbitrary choices in data construction and gives pointers as to which choices are most consequential in the fragility of the result.
Article
Full-text available
Author Openness is a core value of scientific practice. The sharing of research materials and data facilitates critique, extension, and application within the scientific community, yet current norms provide few incentives for researchers to share evidence underlying scientific claims. In January 2014, the journal Psychological Science adopted such an incentive by offering “badges” to acknowledge and signal open practices in publications. In this study, we evaluated the effect that two types of badges—Open Data badges and Open Materials badges—have had on reported data and material sharing, as well as on the actual availability, correctness, usability, and completeness of those data and materials both in Psychological Science and in four comparison journals. We report an increase in reported data sharing of more than an order of magnitude from baseline in Psychological Science, as well as an increase in reported materials sharing, although to a weaker degree. Moreover, we show that reportedly available data and materials were more accessible, correct, usable, and complete when badges were earned. We demonstrate that badges are effective incentives that improve the openness, accessibility, and persistence of data and materials that underlie scientific research.
Preprint
[Slide talk of this material: https://www.youtube.com/watch?v=xhexCDRKKW4]Increasing interest in replicability in the social sciences has engendered novel designs for replication projects in which multiple sites replicate an original study. At least 134 such "many-to-one" replications have been completed since 2014 or are currently ongoing. These designs have unique potential to help estimate whether the original study is statistically consistent with the replications and to re-assess the strength of evidence for the scientific effect of interest. However, existing statistical analyses generally focus on single replications; when applied to many-to-one designs, they provide an incomplete view of aggregate evidence and can lead to unduly pessimistic conclusions about replication success. We therefore propose new statistical metrics representing: (1) the probability that the original study's estimated effect size would be as extreme or more extreme than it actually was, if in fact the original study is statistically consistent with the replications; (2) the proportion of true effects agreeing in direction with the original study. Generalized versions of the second metric allow consideration only of true effects of non-negligible size; they estimate the proportion of true effects of scientifically meaningful size in the same direction as the estimate of the original study and, secondly, the proportion of effects of meaningful size in the direction opposite the original study's estimate. We provide an R package ("Replicate").
Article
Many philosophers of science and methodologists have argued that the ability to repeat studies and obtain similar results is an essential component of science. A finding is elevated from single observation to scientific evidence when the procedures that were used to obtain it can be reproduced and the finding itself can be replicated. Recent replication attempts show that some high profile results---most notably in psychology, but in many other disciplines as well---cannot be replicated consistently. These replication attempts have generated a considerable amount of controversy and the issue of whether direct replications have value has, in particular, proven to be contentious. However, much of this discussion has occurred in published commentaries and social media outlets, resulting in a fragmented discourse. To address the need for an integrative summary, we review various types of replication studies and then discuss the most commonly voiced concerns about direct replication. We provide detailed responses to these concerns and consider different statistical ways to evaluate replications. We conclude there are no theoretical or statistical obstacles to making direct replication a routine aspect of psychological science.
Article
Two studies demonstrated that subjective socioeconomic status moderates the effects of reminders of money on the endorsement of the socioeconomic system. Whether reminders of money increased or decreased system justification (Study 1) and the belief in a just world (Study 2) depended on participants’ subjectively experienced standing in the social hierarchy. These findings were backed up by a small-scale meta-analysis across our entire data (N = 365). Hence, we also included a third study into the meta-analysis, in which the manipulation check indicated that the mental activation of money was comparably weak. This research offers new insights into the psychological mechanisms of money primes and reveals that interindividual differences, such as whether one feels privileged or not, can moderate the effects of money primes.
Article
This article introduces a new approach for evaluating replication results. It combines effect-size estimation with hypothesis testing, assessing the extent to which the replication results are consistent with an effect size big enough to have been detectable in the original study. The approach is demonstrated by examining replications of three well-known findings. Its benefits include the following: (a) differentiating "unsuccessful" replication attempts (i.e., studies yielding p > .05) that are too noisy from those that actively indicate the effect is undetectably different from zero, (b) "protecting" true findings from underpowered replications, and (c) arriving at intuitively compelling inferences in general and for the revisited replications in particular. © The Author(s) 2015.
Article
We discuss three arguments voiced by scientists who view the current outpouring of concern about replicability as overblown. The first idea is that the adoption of a low alpha level (e.g., 5%) puts reasonable bounds on the rate at which errors can enter the published literature, making false-positive effects rare enough to be considered a minor issue. This, we point out, rests on statistical misunderstanding: The alpha level imposes no limit on the rate at which errors may arise in the literature (Ioannidis, 2005b). Second, some argue that whereas direct replication attempts are uncommon, conceptual replication attempts are common-providing an even better test of the validity of a phenomenon. We contend that performing conceptual rather than direct replication attempts interacts insidiously with publication bias, opening the door to literatures that appear to confirm the reality of phenomena that in fact do not exist. Finally, we discuss the argument that errors will eventually be pruned out of the literature if the field would just show a bit of patience. We contend that there are no plausible concrete scenarios to back up such forecasts and that what is needed is not patience, but rather systematic reforms in scientific practice. © The Author(s) 2012.