ArticlePDF Available

What Did the OSC Replication Initiative Reveal About the Crisis in Psychology?

Authors:

Abstract

In his draft paper, James C. Coyne argues that replication initiatives will not salvage the trustworthiness of psychology, due to various limitations inherent in almost any (actually feasible) replication program; instead, he believes that the bulk of attention should be directed at Questionable Research Practices (QRPs), editorial and publication biases, sloppy statistical reasoning, perverse incentives in the reward structure of science, and so on. This is as opposed to so-called “direct” replications, in particular, especially when it comes to certain sub-fields of the discipline in which such replications would be hard to conduct, much less interpret. I agree with many of Professor Coyne’s points, and have made similar arguments elsewhere. However, I think that his discussion of the apparent implications of the now-famous Open Science Collaboration (OSC) paper published in Science (purporting to show that more than half of a sample of 100 psychology studies “failed to replicate” when carried out by independent labs) is flawed in a number of ways. I argue, among other things, that the informational value of the OSC paper is much lower than many people seem to think.
What did the OSC replication initiative reveal
about the crisis in psychology?
Brian D. Earp1
An open review of the draft paper entitled “Replication initiatives will not
salvage the trustworthiness of psychology” by James C. Coyne
Submitted to BMC Psychology Editorial Office: 8 February, 2016. Note: minor, primarily
stylistic issues in the current document have been improved since the official submitted version
that is available at http://www.biomedcentral.com/2050-7283/4/28/prepub. Please cite--and
refer to--the current document.
Abstract & introduction
In his draft paper, James C. Coyne argues that replication initiatives will not salvage the
trustworthiness of psychology, due to various limitations inherent in almost any (actually
feasible) replication program; instead, he believes that the bulk of attention should be directed at
Questionable Research Practices (QRPs), editorial and publication biases, sloppy statistical
reasoning, perverse incentives in the reward structure of science, and so on. This is as opposed to
so-called “direct” replications, in particular, especially when it comes to certain sub-fields of the
discipline in which such replications would be hard to conduct, much less interpret. I agree with
many of Professor Coyne’s points, and have made similar arguments elsewhere. However, I
think that his discussion of the apparent implications of the now-famous Open Science
Collaboration (OSC) paper published in Science (purporting to show that more than half of a
sample of 100 psychology studies “failed to replicate” when carried out by independent labs) is
flawed in a number of ways. I argue, among other things, that the informational value of the OSC
paper is much lower than many people seem to think.
Key words: replication, estimating the reproducibility of psychological science, OSC, p-values
1 University of Oxford and Visiting Scholar, The Hasting Center
This is an open peer review of a submitted paper. It may be cited as:
Earp, B. D. (2016). What did the OSC replication initiative reveal about the crisis in
psychology? An open review of the draft paper entitled, “Replication initiatives will not salvage
the trustworthiness of psychology” by James C. Coyne. BMC Psychology, 4(28), 1-19.
Available at
https://www.academia.edu/21711738/Open_review_of_the_draft_paper_entitled_Replication_i
nitiatives_will_not_salvage_the_trustworthiness_of_psychology_by_James_C._Coyne.
2
Theoretical/ substantive comments
Page 3, line 33. Referring to the OSC paper, the author writes: “Overall results of the project
demonstrated that within this sample of studies … most positive findings proved false or
exaggerated.” I would argue that the word “demonstrated” here is far too strong. Perhaps
“suggested” would be OK, but in my view, the author is giving too much credit to the OSC paper
for definitively showing anything at all. I will explain what I mean in some detail here, because I
think that this is an important issue that has gone largely unnoticed in the academic and public
discussion(s) of the now-famous OSC publication.
Here is the problem. The OSC conducted exactly one replication attempt of each of 100 studies.
In a couple of cases that I have scrutinized personally, these single replication attempts were,
unfortunately, not particularly well designed. For example, in the reported attempt to replicate at
least one study (I won’t go into the details here as I may write this critique up separately), the
replicating scientists recruited fewer participants than were involved in the original study, thus
reducing their power to detect an effect (if one existed), based on a naïve assumption—built into
their power analysis—that the initially reported effect size was accurate. This is a naïve
assumption because we have good reason to think that initially reported effect sizes are
frequently biased high, as the OSC authors themselves acknowledge in their paper. As they state:
“One qualification about [our] result is the possibility that the original studies have inflated effect
sizes due to publication, selection, reporting, or other biases. In a discipline with low-powered
research designs and an emphasis on positive results for publication, effect sizes will be
systematically overestimated in the published literature” (2015, p. aac4716-5; see also my
3
discussion2 of this point in Earp, Everett, Madva, & Hamlin, 2014; and see Button et al., 2013).
Therefore, all else being equal, it is typically better to recruit more participants for a replication
study than were involved in the original experiment, as opposed to fewer, if the idea is to have
adequate power (Earp et al., 2014).
That is just one example of an apparently poorly designed replication study (how many others
have similar flaws I have not yet determined). The fact that this study got different apparent
results from the initial study, therefore, tells us almost nothing at all about the validity of the
original findings. But let us just assume that the other 99 replications were perfectly designed
and flawlessly conducted. Nevertheless, we still cannot draw any definitive conclusions about
what these replication efforts entail with respect to the validity of the original reported findings.
To see why this is the case, imagine the following. Suppose that we take just one of the original
studies from the OSC project, and we try to replicate it—not once, but 100 times. And assume
that we manage to do this perfectly, under ideal conditions. What we would end up with, if all
goes according to plan, is a distribution of p-values, as well as a distribution of effect size
estimates, both of which should be at least roughly centered around whatever the “true” values
for those parameters are. In real life, however, we don’t have the full distribution. What we have
instead, in most cases, is just a single reported p-value, and a single reported effect size estimate
(i.e., from the initial published study). How do we know where, on the idealized distribution
from our thought experiment, these values are likely to be coming from? We don’t know for sure,
but we can guess that they are coming from the higher end. In part, this is because of the well-
known publication bias in favor of “significant” effects, and especially “impressive-looking”
2 Please note that I will be referring to a number of papers of mine throughout the rest of this review. Since this is an
open review, I will make no pretense of having failed to develop a particular perspective on this debate, and I hope
that the author (and other readers) can forgive me for citing so much of my own work. It saves me the trouble of
having to re-articulate everything here.
4
ones: hence, as just explained—and as the author of the present submission himself appears to
appreciate (see, e.g., page 6, lines 40-45)—the mere fact that the finding has actually been
published suggests that our estimate is probably inflated.
But this doesn’t mean that there is no underlying evidence for the original effect (i.e., it doesn’t
mean that the effect itself has not been replicated, in the sense of having been shown not actually
to exist), nor that the “true” p-value, even if it is different from what was originally reported,
would be guaranteed to be non-significant (this is setting aside, for the sake of this review,
longstanding and rather heated debates about what we can actually infer from “significant” p-
values—and from the null hypothesis significance testing procedure generally; see, e.g.,
Trafimow & Rice, 2009). Indeed, based on running an original study just one more time, as was
the case with the OSC project, it would actually be reasonable for us to expect that the next p-
value we generate will in fact be larger than the one that was originally reported, and the next
effect size, smaller, based on the “winner’s curse” phenomenon (see, e.g., Button et al., 2013)
and the principle of regression toward the mean.
Thus, getting a smaller effect size estimate the second time around is only a “failure to replicate”
in the trivial sense of failing to replicate the exact same effect size estimate. But we shouldn’t
expect to get a “replication” in that sense anyway! As far as I can tell, what people are really
interested in knowing is: “Is there an effect here, and is it a meaningful one?” The answer to the
first question, it should be clear, is not “no” just because the size of the effect may be smaller
than what was originally estimated. And the answer to the second question depends upon, among
other things, what precise effect size (or effect size range) is theoretically or practically
meaningful—something that many researchers in psychology and other disciplines unfortunately
fail to specify—as well as what the actual effect size is. Alas, however, a single replication study
5
can’t tell us either of those things. Instead, to have any real degree of confidence in this regard,
we need to run many replications of the experiment. Then, over time—assuming that the
replications were of sufficiently high quality as well as adequately powered, etc.—we would be
in an increasingly better position to make a rational judgment about whether the original reported
finding was “real” (or just statistical noise) and, if it was/is real, what its likely effect size is.
David Trafimow and I make this argument in Earp and Trafimow (2015).
Now, if I may return to my point earlier about “underlying” evidence. Let us just say that we do
get a smaller effect size estimate the second time we run a study. Does that mean that the
evidence in favor of the existence of the original reported finding or phenomenon has so much as
“gone down?” We don’t actually know. In part, this is because p-values and effect size estimates
are not direct measures of evidence, as Veronica Vieland has argued (see generally: Vieland,
2001; Vieland, 2006; Vieland & Hodge, 2011). But even if we had a good measure of evidence,
talk of “failed replication” on the basis of a single follow-up study (with apparently different
results from the original) could still be hard to justify. As Professor Vieland recently explained to
me in an email (personal correspondence, January 29th, 2016):
(1) Observing weak evidence in favor of something after having already seen strong
evidence in favor of that thing does not weaken the evidence; the evidence may not
change by much but it certainly doesn’t go down. We can summarize this by saying that
evidence itself accumulates rather than averaging, and in the absence of a way to properly
measure and rigorously accumulate evidence, we can’t tell from the OSC paper whether
this is what’s happening (in some cases) or whether the evidence really is going down.
6
(2) Moreover, in the context of “winner’s curse” types of set-ups, e.g., when published
results are likely to favor smaller p-values and larger effect size estimates, regression to
the mean is going to tend to send both in the “wrong” direction (bigger for the p-value
and smaller for the estimate) regardless of whether the effect is real or not. Thus, in
this type of setting, tracking the p-value and/or estimate simply cannot be used to infer
whether the evidence has gone up or down, let alone whether the originally reported
effect is real or not.
In short, a single replication of a study (or of 100 separate studies)—as was attempted in the case
of the OSC project—actually provides us with a lot less information than many people seem to
think about the validity of the original finding (or the sheer existence of the effect in question), at
least in terms of a case by case analysis. Indeed, just as a single report of a finding in the primary
literature should not convince us that an effect is “real,” neither should a single replication
attempt in the follow-up literature convince us that the effect is “false” or “exaggerated” (see,
again, Earp & Trafimow, 2015, as well as Earp & Everett, 2015, for more on these issues).
All that being said, I should acknowledge that a colleague of mine, in personal correspondence
(January 28th, 2016), has pushed back against the general thrust of my argument. He writes: “An
argument about why the OSC article might be a little better in terms of informational value than
you’ve characterized might be made as follows. First, I agree with you completely that a single
replication that apparently ‘fails’ does not convincingly show that there was a flaw (much less
malfeasance) in the conduct of the original experiment, nor even that the originally reported
results are wrong or inaccurate. But I think a different conclusion might be a bit more difficult to
dispense with. Imagine 100 studies and a single replication of each, as in the OSC article. If we
imagine that each study, whether original or replication, is randomly selected from a distribution
7
of hypothetical replications, we would expect the average effect size for the set of original
studies to approximately equal the average effect size of the set of replication studies. That is,
some replication studies should result in stronger effect sizes than their corresponding original
ones, as well as the reverse. However, the average effect size in the OSC paper was considerably
less in the replication studies. This violation from what we might expect can be considered to
indicate a problem in psychology publications, one of which is publication bias against ‘negative’
findings as you’ve already stated. Also, as an argument against what I just said, it should be
pointed out that the probability of getting what the OSC got, given improper practices, is not the
same thing as the probability of improper practices given what the OSC got! Therefore, again, it
really is difficult to know what to conclude from the article.”
A second colleague provided a similar analysis (personal correspondence, February 7th, 2016):
“You are absolutely right about the limited conclusions that should be drawn from the OSC
replication effort, in relation to any one original study. But we need to make a clear distinction
between conclusions about any individual study, and the whole set. The fact that the replication
effect sizes were on average approximately half the original effect sizes is a strong result, and
strong evidence that something amiss is going on—most likely, as you say, selective publication
and/or QRPs in the original set of studies [note that this possibility is discussed at length by the
OSC authors]. But we have no idea which of the 100 original studies, and very little idea, even
roughly, of what proportion of the original studies were in fact biased. In short, the whole OSC
paper does give strong evidence of some bias in the original set, notwithstanding that the precise
extent and nature of the bias cannot be determined.”
Fair enough. Now, I do agree with the author of the present article that, whatever we can infer
(or not) from replication studies or replication initiatives, we do have many at least partially
8
independent reasons to crack down on questionable research practices, sloppy experiments, bad
statistics, unjustified inferences, extrapolations of data beyond the environments in which they
were collected, publication bias, the file drawer problem, crony peer review, faulty editorial
practices, and so on. In other words, whether or not some kind of systematic replication initiative
is even (a) feasible, or (b) helpful (assuming that it could be accomplished), we should still be
very concerned about the amount of poor-quality research being published, and we should try to
reform the reward structure of professional science to encourage fewer publications (of higher
quality), as opposed to the avalanche of largely uninterpretable research (e.g., Meehl, 1990)
currently being published, including in “prestigious” journals (see Earp, in press). If we were
successful in that regard, we would have a lot less need for large scale replication initiatives in
the first place. So in this respect I think there is a great deal of concordance between the author
and myself.
Page 4, line 8. The author writes that tests of excess significance “may themselves be biased.”
Again, there are many different types of bias, so just saying “biased” doesn’t mean very much.
What kind of bias? Biased how?
Page 4, line 25. I am sympathetic with the author’s feelings about the defensive and even
obstructionist reactions of some researchers to this new “replication” era, but the rhetoric here
(“obviously,” “with impunity,etc.) seems too hot—more suited to a blog post than a formal
paper. Similarly, “hostile” in the following sentence seems unproductively harsh; and in any
event, no citations or references are given actually illustrating such behavior (though it
undoubtedly exists). This is such a contentious, politicized area, that I think more tempered
language, and taking the “high ground” in terms of making step-by-step arguments (rather than
9
“calling out” other researchers for supposedly “obviously” believing they can get away with
murder) would be more productive in the long run.
Page 4, line 48. This issue of a disproportionate focus on social psychology studies in replication
attempts is discussed in Earp and Everett (2015).
Page 5, line 11. Really excellent point about how hard it would be to recruit subjects from
sensitive populations (e.g., babies) for “mere” replication studies.
Page 5, line 32. The issue of replications being low-status, and especially costly to carry out if
one is an early-career researcher is discussed at length in Everett and Earp (2015).
Page 10, line 27. Problems with peer review have been well-documented and well-discussed by
many authors, in particular Richard Smith (e.g., 2006). Some key citations would be appropriate
here to support the author’s claim that peer review is “undependable.” In case it’s helpful, I have
recently collected some relevant citations in Earp (in press).
Page 11, line 22. This discussion of direct vs. conceptual replication is a little bit facile, in my
opinion. There is actually a theoretical problem with not having direct replications (depending
upon your aims), because deliberate changes in study design could violate auxiliary assumptions
that are still implicit (i.e., not yet identified), such that “indirect” replications might not tell us
very much about the reliability of the original finding. For more on the philosophy of direct vs.
conceptual replication and making generalizable statements over variable conditions, see Earp
and Trafimow (2015) as well as Trafimow and Earp (in press).
10
That said, I do agree that fetishizing direct replication over more conceptual replication is
problematic, as the author (in my view) discusses very nicely in this passage. Nevertheless, it
seems to me that more needs to be said about when and under what conditions direct vs. more
indirect replications are useful (and toward what particular ends). If an original researcher finds a
very narrowly-achievable effect (using the bug-killing paradigm, for example, as the author
discusses), and then generalizes this to a range of cases that haven’t actually be tested, that is
clearly a problem; but what kind of replication effort is appropriate depends on the question.
Specifically, if you want to “check” whether the original finding, as captured by some narrow
outcome variable, is actually a good/reliable effect, then a direct replication is needed. By
contrast, if you want to see whether the paradigm extends to a new case, you should probably
still do a direct/exact replication to make sure the original finding can be repeated, but then—
yes—it is likely to be worthwhile to try to systematically change components of the study design
to see whether the effect holds over a range of conditions (see Earp & Trafimow, 2015).
The issue is when you do indirect/conceptual replications without also doing direct replications
(see Earp & Trafimow, 2015). Specifically, when such indirect replication attempts apparently
fail, the replicating researcher is likely to say: “Well, it’s probably just my fault because I
changed certain things about the experiment and therefore didn’t do an exact replication” (see,
e.g., Earp et al., 2014) Accordingly, this researcher might simply abandon the problem and put
the replication data in her proverbial file drawer (due, again, to the perverse incentives created by
the existence of publication bias in favor of “significant” effects; e.g., Greenwald, 1975). If she
had run a direct replication first, however, being careful not to change anything that would be at
least plausibly theoretically relevant—and she still didn’t find evidence of the effect—then she
would be more likely to wonder whether the original effect was illusory after all, and would thus
11
be more likely to look into the matter further. I discuss this issue at some length in the
concluding section of Earp et al. (2014), re: the “Macbeth” effect, relying largely on Harris et al.
(2013).
Page 12 line 14. Again, extensive discussion of the theoretical issues involved in testing the
generalizability of claims across variations can be found in Earp and Trafimow (2015). I think
this section would be improved by engaging more specifically with some of the ideas presented
there. But it is up to the author.
Page 12, lines 53-53. The author writes that the results in question were “demonstrated [to be]
untrustworthy.Again, for the reasons discussed above, there is a sense in which this claim is too
strong. Depending upon how it is interpreted, it does not necessarily follow from what the OSC
paper found.
Page 13, line 28. Again, “not replicated” is a strong term. What does the author mean by
“replicate” – to achieve the exact same p-value and/or effect size estimate? To achieve values
within a certain range? What range? Unless the author is clear about what counts as a “successful”
replication, we have no way of knowing whether his notion of “not replicated” is meaningful. To
repeat: getting a different a p-value or effect size estimate (especially in a single, one-off
replication study) is not sufficient to show even that the underlying evidence for the original
reported effect has “gone down.” If the p-value goes from .01 to .90 that’s probably a red flag
and we definitely want to ask what’s going on there.3 If the effect size goes from huge to zero,
same thing. But unless we articulate what we would expect to be reasonable for a “successful”
3 Though note that a colleague writes (personal correspondence, February 7th, 2016), “I find p-values totally
unhelpful in this whole discussion. You are correct that they are very likely to vary, even with perfect replications.
… Even the example you give of the p-value shifting from .01 to .90 may hardly be strong evidence that the two
studies were estimating different population effect sizes.”
12
replication (and it would be stretch to think that this should be “the exact same p-value or smaller,
or the exact same effect size or bigger), then we need to be much more careful about saying “not
replicated.” I should just note that the OSC authors do a pretty good job of discussing different
ways of interpreting “replication” and they do test a handful of plausible contenders.
Page 13, line 39. There has been a lot of work critiquing impact factors (e.g., Seglan, 1998),
arguing that they are easily game-able, are not necessarily good indicators of the quality of the
research appearing in a given journal, etc. Some of this work should arguably be cited here, as it
is consistent with the author’s apparent viewpoint. That said, I worry that this passage could be
read as suggesting that “high impact factor” journals should accept replication studies because
this would lend more “status” to replication studies, when in fact—as the author himself most
likely believes—this link between perceived status and having a high impact factor is part of the
overall problem. In other words, there is a risk that the argument in this passage will simply
reinforce the notion that having a “high impact factor” is a worthwhile marker of status, and that
researchers should care about whether their work (whether a replication study or original)
appears in a so-called “vanity” journal.
Page 19, generally. In the abstract it seemed that a discussion of how social media can serve as a
form of post-publication peer review to identify poor quality studies was going to take place, and
yet here, in the concluding paragraph, the idea gets a very cursory mention. Could this be filled
out and perhaps illustrated with examples? Maybe give concrete recommendations for how this
tool could be used more effectively?
Minor comments re: typos, grammatical errors, etc.
13
Page 3, line 28. The author refers to the “extraordinary multi-institution effort” organized by the
Open Science Collaboration: Psychology (OSC), but qualifies this with the phrase “even if
biased.” But biased in what particular way? The author could certainly say: “even if biased (for
reasons to be discussed)” or something like that; but simply slotting in the word “biased” here
with no further explanation of what kind of bias is actually meant makes this sentence less
informative than it could be.
Page 3, line 18. If, as the author claims, replication initiatives are widely seen this way (i.e., as
“the great hope for salvaging the trustworthiness of psychology”), then perhaps a few key
citations of people/organizations that have actually advanced such a perspective would be in
order.
Page 2, header. The header reads: “replication machines will not salvage.” This is a fragment and
doesn’t make grammatical sense; also “machines” is not used like this elsewhere throughout the
paper. A better header is needed.
Page 2, abstract. The abstract is excellent, and draws me right into the paper. However, it
highlights the role of social media to expose faulty research practices, etc., whereas this topic is
given only a sentence or two in the actual paper. That should be reconciled one way or the other
(see above).
Page 3, line 23. The phrase “accomplished in selecting now” makes no sense as far as I can see.
Page 3, line 44. The phrase “this unreliability of the frustrating” is nonsensical; I think “has been
frustrating” is meant. Then it says, “There been various” but it should be: “There have been
14
various.Then later in the sentence: “overall excess statistically” should be “overall excess of
statistically.” I can’t promise to flag all such grammatical errors and typos as I go through
(because that should really be done by a copy editor), but I must say that I am surprised by the
sheer number of such writing errors in this manuscript.
Page 3, line 59. Could page numbers be given for quotes?
Page 4, line 16/17. The phrase “or smaller” should probably be in parentheses in order to help
make reading this sentence easier. And, again, I wish to point out that simply getting a smaller
effect size the second time around doesn’t tell us very much at all. Multiple replications are
needed.
Page 4, line 33. The word “suspect” has several different meanings, and those differences make
it hard to parse this. I think “be suspicious of” would be more clear/ less ambiguous.
Page 5, line 35. The phrase “contribution and is” is ungrammatical. I think the “and” needs to be
deleted.
Page 6, line 16. The period is missing at the end of the sentence after “extreme outlier.”
Page 7, line 9. The phrase “summoned to support of” should be “in support of.”
Page 7, line 32. What does “correlational observational are” mean? I believe this is a
grammatical error.
15
Page 8, line 6. The closing quotation mark is missing after “factor” and a period is missing at the
end of the sentence after “big mush.”
Page 8, line 35. The underline under “not” is too long.
Page 8, line 50. What is “OPS” … is this OSC? Maybe I just missed something.
Page 10, line 14. The “of” after “replication” should be deleted.
Page 11, line 22. The word “seeming” should be “seemingly” and then the next line down
“procedures can be considerable” should be “make a considerable.”
Page 11, line 56. The phrase “based on studies undergraduates” should be “on studies of
undergraduates.”
Page 12, line 6. The phrase “perspective that psychology” should be “perspective that a
psychology.
Page 12, line 32. There are two periods at the end of the sentence.
Page 13, line 18. The word “characterizes” should I think be “characterizing.
Page 13, line 36. “The” should be added before American Psychological Association and the
APS, here and hereafter.
16
Page 14, line 42. There is an extra space after “Nelson.”
Page 15, line 11. What does “which the OCS sample” mean? I think “sampled” is meant.
Page 15, line 27. I think it should be “agendas” or else “a strong institutional.” Then two lines
down, if it’s “agenda” then it needs to be “remains” and then “endless and waste the effort of”
needs to be “endless and a waste of the effort of.
Page 16, line 48. I think “papers author’s” should be “paper’s author’s.”
Page 17, line 14. “CONSORT, set a” should be “CONSORT, had set a.”
Page 17, line 34/5. The phrase “physical health in the prestigious journal” should be: “a”
prestigious journal.
Page 17, line 55. The phrase “both of the journals in which the published” is ungrammatical, and
then the dashes are uneven: the first one is long; the second one short and not centered.
Page 18, line 24. There is an extra space in: [38 ].
Page 18, line 34/5. I think “has promise” should be deleted so that this makes sense.
Page 18, line 45. “Postscript” should be bolded.
Page 18, line 53. I think “incentized” should be “incentivized.
17
Page 20, generally. There should be quotation marks around the actual Brembs quote or else the
text should be indented, and a page number should be given if that applies.
Page 21, line 41 and Page 24, line 12. There are missing reference numbers. Where is reference
8? And 37?
Page 24, line 10. The formatting for reference 36 is completely off. There is an extra close
bracket, and it is indented.
Overall comment and recommendation
I am strongly sympathetic with many of the points raised by the author. Moreover, I think that
his perspective is valuable, and that the field generally should have the opportunity to learn from
it. Therefore, I believe that a version of this paper should be published. That said, there are a
number of substantive theoretical points that I think need to be improved upon or at least
clarified (and also, for some reason, a very large number of typos and ungrammatical sentences
that will need to be fixed before the paper could go to print).
Acknowledgements
Thank you to Professors Geoff Cumming, Veronica Vieland, Stuart Firestein, and two other
colleagues who wish to remain anonymous for providing feedback on my review and/or
discussing these ideas with me. None of what I’ve written should be taken as an expression of
their views unless I’ve specifically cited them; I also take personal responsibility for any errors
that remain in this document. Please note that a few typos and other minor cosmetic issues have
been addressed since the formal submission of this review.
18
References cited in this review
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò,
M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience.
Nature Reviews Neuroscience, 14(5), 365-376.
Earp, B. D. (in press). Mental shortcuts [unabridged version]. Hastings Center Report. Available
ahead of print at
https://www.researchgate.net/publication/292148550_Mental_shortcuts_unabridged.
Earp, B. D. (2015, September 12). Psychology is not in crisis? Depends on what you mean by
“crisis.” Huffington Post. Available at http://www.huffingtonpost.com/brian-earp/psychology-is-
not-in-crisis_b_8077522.html.
Earp, B. D., & Everett, J. A. C. (2015, October 26) How to fix psychology’s replication crisis.
The Chronicle of Higher Education. Available at
http://www.academia.edu/17321035/How_to_fix_psychologys_replication_crisis.
Earp, B. D., Everett, J. A. C., Madva, E. N., & Hamlin, J. K. (2014). Out, damned spot: Can the
“Macbeth Effect” be replicated? Basic and Applied Social Psychology, 36(1), 91-98.
Everett, J. A. C. & Earp, B. D. (2015). A tragedy of the (academic) commons: Interpreting the
replication crisis in psychology as a social dilemma for early-career researchers. Frontiers in
Psychology, 6(1152), 1-4.
Earp, B. D., & Trafimow, D. (2015). Replication, falsification, and the crisis of confidence in
social psychology. Frontiers in Psychology, 6(621), 1-11.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological
Bulletin, 82(1), 1-20
Harris, C. R., Coburn N., Rohrer D., & Pashler H. (2013). Two failures to replicate high-
performance-goal priming effects. PLoS ONE 8: e72467. doi:10.1371/journal.pone.0072467.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often
uninterpretable. Psychological Reports, 66(1), 195-244.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349(6251), aac4716-1-aac4716-8.
Seglen, P. O. (1998). Citation rates and journal impact factors are not suitable for evaluation of
research. Acta Orthopaedica Scandinavica, 69(3), 224-229.
Smith, R. (2006). Peer review: a flawed process at the heart of science and journals. Journal of
the Royal Society of Medicine, 99(4), 178-182.
Trafimow, D. & Earp, B. D. (in press). Badly specified theories are not responsible for the
replication crisis in social psychology. Theory & Psychology, in press. Available ahead of print
https://www.researchgate.net/publication/284625607_Badly_specified_theories_are_not_respons
ible_for_the_replication_crisis_in_social_psychology
19
Trafimow, D., & Rice, S. (2009). A test of the null hypothesis significance testing procedure
correlation argument. The Journal of General Psychology, 136(3), 261-270.
Vieland, V. J. (2001). The replication requirement. Nature Genetics, 29(3), 244-245.
Vieland, V. J. (2006). Thermometers: something for statistical geneticists to think about. Human
Heredity, 61(3), 144-156.
Vieland, V. J., & Hodge, S. E. (2011). Measurement of evidence and evidence of measurement.
Statistical Applications in Genetics and Molecular Biology, 10(1), 1-11.
... The recent Open Science Collaboration (2015) initiative, which appeared to show a surprisingly low replication rate using the traditional .05 alpha level (for further discussion see Earp, 2015Earp, , 2016Gilbert et al. 2016;Andersen et al., 2016;Palmer, 2016), arguably further supports that view that many researchers engage in such QRPs. Well, then, by requiring a particular alpha level as a necessary condition for publication, the field of psychology is likely to be inadvertently incentivizing the submission of manuscripts that reflect such questionable statistical practices (aimed, as they often are, whether consciously or unconsciously, at illegitimately "nudging" the p value below the threshold regarded as necessary for publication). ...
... To the extent that the researcher was lucky, in the sense we have defined, and thus more likely to be able to publish his or her results, it would be expected-due to statistical regression-that p-values in a set of replication experiments would be on average much higher than those obtained in the original reported experiments. Indeed, this interactive effect of statistical regression and publication bias is perfectly consistent with the findings reported by the Open Science Collaboration highlighted above (see Earp, 2016). Thus, a problem with attempting to use a particular level of alpha to control the Type I error rate is that, in conjunction with the problem of publication bias, it leads to this type of regression. ...
... Thus, a problem with attempting to use a particular level of alpha to control the Type I error rate is that, in conjunction with the problem of publication bias, it leads to this type of regression. This, in turn, implies that a set of replications, even if honestly done, will have greater p-values on average than the original cohort of studies, making it exceedingly difficult to draw meaningful conclusions about whether the replications have been "successful" in terms of a p-value based analysis (see Earp, 2016). This, too, is not auspicious for science. ...
Article
Full-text available
Although many common uses of p-values for making statistical inferences in contemporary scientific research have been shown to be invalid, no one, to our knowledge, has adequately assessed the main original justification for their use, which is that they can help to control the Type I error rate (Neyman & Pearson, 1928; 1933). We address this issue head-on by asking a specific question: Across what domain, specifically, do we wish to control the Type I error rate? For example, do we wish to control it across all of science, across all of a specific discipline such as psychology, across a researcher’s active lifetime, across a substantive research area, across an experiment, or across a set of hypotheses? In attempting to answer these questions, we show that each one leads to troubling dilemmas wherein controlling the Type I error rate turns out to be inconsistent with other scientific desiderata. This inconsistency implies that we must make a choice. In our view, the other scientific desiderata are much more valuable than controlling the Type I error rate and so it is the latter, rather than the former, with which we must dispense. But by doing so—that is, by eliminating the Type I error justification for computing and using p-values—there is even less reason to believe that p is useful for validly rejecting null hypotheses than previous critics have suggested.
... Concerns have been raised about the replicability of results in experimental (social) psychology[28][29][30]. Meanwhile, there is evidence that x-phi studies have a replication rate of over 70%[31], more than twice as high as typical estimates generated for social psychology. ...
Article
Full-text available
This paper explores an emerging sub-field of both empirical bioethics and experimental philosophy (“x-phi”), which has been called “experimental philosophical bioethics” (“bioxphi”). As an empirical discipline, bioxphi adopts the methods of experimental moral psychology and cognitive science; it does so to make sense of the eliciting factors and underlying cognitive processes that shape people’s normative judgments, particularly about real-world matters of bioethical concern. Yet, as a normative discipline situated within the broader field of bioethics, it also aims to contribute to substantive ethical questions about what should be done in a given context. What are some of the ways in which this aim has been pursued? In this paper, we employ a case study approach to examine and critically evaluate four strategies from the recent literature by which scholars in bioxphi have leveraged empirical data in the service of normative arguments.
... The results of these large-scale replication attempts have introduced new questions into the field. One such initiative ran single replications of 100 studies and reported only about one third of the studies replicated according to various plausible criteria for what should count as a successful replication (Open Science Collaboration, 2015; see also Earp, 2016). While conclusions regarding the actual replication rate in this and other efforts have been debated (e.g., Gilbert et al., 2016a;Gilbert et al., 2016b;Anderson et al., 2016;Etz & Vandekerckhove, 2016), the question of why systematic replication efforts have routinely failed to replicate original findings has become an important topic in psychology. ...
Article
Full-text available
What explanation is there when teams of researchers are unable to successfully replicate already established ‘canonical’ findings? One suggestion that has been put forward, but left largely untested, is that those researchers who fail to replicate prior studies are of low ‘expertise and diligence’ and lack the skill necessary to successfully replicate the conditions of the original experiment. Here we examine the replication success of 100 scientists of differing ‘expertise and diligence’ who attempted to replicate five different studies. Using a bibliometric tool ( h -index) as our indicator of researcher ‘expertise and diligence’, we examine whether this was predictive of replication success. Although there was substantial variability in replication success and in the h-factor of the investigators, we find no relationship between these variables. The present results provide no evidence for the hypothesis that systematic replications fail because of low ‘expertise and diligence’ among replicators.
... Notes 1. Some additional sources are as follows: Asendorpf and colleagues (2013), Brandt and colleauges (2013), Earp (2016), Earp and Everett (2015), Earp et al. (2014), Earp and Trafimow (2015), Ioannidis (2005), Makel and Plucker (2014), McBee and Matthews (2014), Open Science Collaboration (2015), and Prinz, Schlange, and Asadullah (2011). 2. The invalidity of significance testing was a major theme at the American Statistical Association Symposium on Statistical Inference, October, 2017. ...
Article
Possibly, the replication crisis constitutes the most important problem in psychology. It calls into question whether psychology is a science. Existing conceptualizations of replicability depend on effect sizes; the larger the population effect size, the greater the probability of replication. This is problematic and contributes to the replication crisis. A different conceptualization, not dependent on population effect sizes, is desirable. The proposed solution features the closeness of sample means to their corresponding population means, in both the original and replication experiments. If the researcher has specified the sampling precision desired, it is possible to calculate the probability of replication, prior to data collection, and without dependence on the population effect size or expected population effect size. In addition, it is not necessary to know population means or standard deviations, nor sample means or standard deviations, to employ the proposed a priori way of thinking about replicability.
... Indeed, in recent years, the robustness and reliability of the findings have been called into question, with at least two research groups reporting an inability to replicate them (Carey & Roston, 2015;Open Science Collaboration, 2015;Zwaan, 2013). While these apparently unsuccessful replication attempts must, themselves, be interpreted with caution-among other reasons, there is no consensus among psychologists about what 'counts' as an unsuccessful replication (Cova et al., in press;Earp, 2016)-they do strike a consonant note with more general concerns about challenges with replication in the field (for overviews, see Earp & Trafimow, 2015;Earp, in press;Pashler & Wagenmakers, 2012). Thus, the long-run reproducibility of the findings remains uncertain. ...
Preprint
Full-text available
In recent years, diminished belief in free will or increased belief in determinism have been associated with a range of antisocial or otherwise negative outcomes: unjustified aggression, cheating, prejudice, less helping behavior, and so on. Only a few studies have entertained the possibility of prosocial or otherwise positive outcomes, such as greater willingness to forgive and less motivation to punish retributively. Here, five studies explore the relationship between belief in determinism and another positive outcome or attribute, namely, humility. The reported findings suggest that relative disbelief in free will is reliably associated with at least one type of humility—what we call ‘Einsteinian’ humility—but is not associated with, or even negatively associated with, other types of humility described in the literature.
... Indeed, in recent years, the robustness and reliability of the findings have been called into question, with at least two research groups reporting an inability to replicate them (Carey & Roston, 2015;Open Science Collaboration, 2015;Zwaan, 2013). While these apparently unsuccessful replication attempts must, themselves, be interpreted with caution-among other reasons, there is no consensus among psychologists about what 'counts' as an unsuccessful replication ( Cova et al., in press;Earp, 2016)-they do strike a consonant note with more general concerns about challenges with replication in the field (for overviews, see Earp & Trafimow, 2015;Earp, in press;Pashler & Wagenmakers, 2012). Thus, the long-run reproducibility of the findings remains uncertain. ...
Preprint
Full-text available
In recent years, diminished belief in free will or increased belief in determinism have been associated with a range of antisocial or otherwise negative outcomes: unjustified aggression, cheating, prejudice, less helping behavior, and so on. Only a few studies have entertained the possibility of prosocial or otherwise positive outcomes, such as greater willingness to forgive and less motivation to punish retributively. Here, five studies explore the relationship between belief in determinism and another positive outcome or attribute, namely, humility. The reported findings suggest that relative disbelief in free will is reliably associated with at least one type of humility--what we call 'Einsteinian' humility--but is not associated with, or even negatively associated with, other types of humility described in the literature.
... Moreover, critiques have been raised about the reward structure of science which favors non-stop "productivity" and headline-grabbing conclusions over painstaking methodology [12][13][14][15]. And a series of high-profile apparent failures to replicate major findings from prior studies has sent shockwaves through the scientific community [16,17]. ...
Article
Full-text available
In January of 1927, Dr. Richard D. Mudd of Detroit published a letter in the Journal of the American Medical Association, seeking to vindicate his grandfather, Dr. Samuel A. Mudd, against charges of conspiring in a murder [1]. The victim was U.S. President Abraham Lincoln; the murderer, actor John Wilkes Booth (see Appendix). In this editorial, I, an erstwhile actor, would like to vindicate my own grandfather, Dr. John Rosslyn Earp, for a letter he published on the same day, just one column over, in the very same issue of the journal [2]. But I mean “vindicate” in its other sense—to prove correct—as we shall see.
... This is the author's copy of a published abstract of a refereed lecture. Please cite as: Earp, B. D. (2016). What kind of crisis is psychology (supposedly) in? ...
Conference Paper
Full-text available
In a much-discussed New York Times article, psychologist Lisa Feldman Barrett recently claimed, “Psychology is not in crisis.” She was responding to the results of a large-scale initiative called the Reproducibility Project, published in Science magazine, which appeared to show that the results from over 60% of a sample of 100 psychology studies did not hold up when independent labs attempted to replicate them. In this paper, I address three major issues: (1) What did the Reproducibility Project really show, and in what specific sense can the follow-up studies meaningfully be described as “failures to replicate” the original findings? I argue that, contrary to what many are suggesting, very little can be learned about the validity of the original studies based upon a single (apparent) failure to replicate: instead, multiple replications (of sufficient quality) of each contested experiment would be needed before any strong conclusions could be drawn about the appropriate degree of confidence to be placed in the original findings. (2) Is psychology in crisis or not? And if so, what kind of crisis? I tease apart two senses of crisis here. The first sense is “crisis of confidence,” which is a descriptive or sociological claim referring to the notion that many people, within the profession and without, are, as a matter of fact, experiencing a profound and, in some ways, unprecedented lack of confidence in the validity of the published literature. Whether these people are justified in feeling this way is a separate but related question, and the answer depends on a number of factors, to be discussed. The second sense of “crisis” is “crisis of process” – i.e., the notion that (due in large part to apparent failures to replicate a substantial portion of previously published findings), psychological science is “fundamentally broken,” or perhaps not even a “true” science at all. This notion would be based on the assumption that most or perhaps even all of the findings in a professionally published literature should “hold up” when they are replicated, in order for a discipline to be a “true” science, or not to be in a state of “crisis” in this second sense. But this assumption, I will argue, is erroneous: failures of various sorts in science, including bona fide failures to replicate published results, are often the wellspring of important discoveries and other innovations. Therefore, (apparent) replication failure, even on a wide scale, is no evidence that science is broken, per se. Nevertheless, (3) This does not mean that there is not substantial room for serious, even radical improvements to be made in the conduct of psychological science. These issues must not be brushed under the rug. Even holding the replication debate aside, that is, we have many at least partially independent reasons to push for deep changes in contemporary research and publication norms. Problems that need urgently to be addressed include: publication bias against “negative” results, the related “file drawer” problem, sloppy statistics and lack of adequate statistical training among many scientists, small sample sizes, inefficient and arbitrary peer review, and so on.
... Diseases have been eradicated, rockets have been sent to the moon, and convincing, causal explanations have been given for a whole range of formerly inscrutable phenomena. Notwithstanding recent concerns about sloppy research, small sample sizes, and challenges in replicating major findings123—concerns I share and which I have written about at length45678910—I still believe that the scientific method is the best available tool for getting at empirical truth.[11] Or to put it a slightly different way (if I may paraphrase Winston Churchill's famous remark about democracy): it is perhaps the worst tool, except for all the rest. ...
Article
Full-text available
In this essay, I discuss the problem of plausible-sounding bullshit in science, and I describe one particularly insidious method for producing it. Because it takes so much more energy to refute bullshit than it does to create it, and because bullshit can be so damaging to the integrity of empirical research as well as to the policies that are based upon such research, I suggest that addressing this issue should be a high priority for publication ethics.
Article
Full-text available
A key source of support for the view that challenging people's beliefs about free will may undermine moral behavior is two classic studies by Vohs and Schooler (2008). These authors reported that exposure to certain prompts suggesting that free will is an illusion increased cheating behavior. In the present paper, we report several attempts to replicate this influential and widely cited work. Over a series of five studies (sample sizes of N = 162, N = 283, N = 268, N = 804, N = 982) (four preregistered) we tested the relationship between (1) anti-free-will prompts and free will beliefs and (2) free will beliefs and immoral behavior. Our primary task was to closely replicate the findings from Vohs and Schooler (2008) using the same or highly similar manipulations and measurements as the ones used in their original studies. Our efforts were largely unsuccessful. We suggest that manipulating free will beliefs in a robust way is more difficult than has been implied by prior work, and that the proposed link with immoral behavior may not be as consistent as previous work suggests.
Article
Full-text available
Klein (2014) argues that the replication crisis in social psychology is due—at least in large part —to the tendency of psychological theories to be ill-specified. We disagree. First, we use both historical and contemporary examples to show that high-quality replication is possible even in the absence of a well-specified theory; and, second, we argue that it is typically auxiliary assumptions, rather than theories themselves, that need to be more clearly specified in order to understand the implications of a given replication effort.
Article
Full-text available
"Over half of psychology studies fail reproducibility test." "Study delivers bleak verdict on validity of psychology experiment results." "Psychology is a discipline in crisis." These and other similar headlines followed the results of a large-scale initiative called the Reproducibility Project, recently published in Science magazine, which appeared to show that a majority of findings from a sample of 100 psychology studies did not hold up when independent labs attempted to replicate them. As it stands, though, it is not at all clear what these replications mean. What the experiments actually yielded in most cases was a different statistical value or a smaller effect-size estimate compared with the original studies, rather than positive evidence against the existence of the underlying phenomenon. This is an important distinction. In psychology, as in any other science, we have to make inferences based on statistical estimates, and we should expect those estimates to vary over time. In the typical scenario, an initial estimate turns out to be on the high end (that’s why it ends up getting published in the first place — it looks impressive), and then subsequent estimates are a bit more down to earth. The true effect size, whatever it is, is something that emerges only after many replications (and even then as an approximation) as we repeat the experiment over and over. This means that we need a lot of high quality replications; and any solution to the replication crisis has to start by addressing this core issue.
Article
Full-text available
In the New York Times, psychologist Lisa Feldman Barrett argues that "Psychology Is Not in Crisis." She is responding to the results of a large-scale initiative called the Reproducibility Project, published in Science magazine, which appeared to show that the findings from over 60 percent of a sample of 100 psychology studies did not hold up when independent labs attempted to replicate them. She argues that "the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works." To illustrate this point, she gives us the following scenario: "Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. We have a failure to replicate. Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true only under certain conditions. The scientist's job now is to figure out what those conditions are, in order to form new and better hypotheses to test." She's making a pretty big assumption here, which is that the studies we're interested in are "well-designed" and "carefully run." But a major reason for the so-called "crisis" in psychology is the fact that a very large number of not-well-designed, and not-carefully-run studies have been making it through peer review for decades.
Article
Full-text available
Empirically analyzing empirical evidence One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study. Science , this issue 10.1126/science.aac4716
Article
Full-text available
Several proposals for addressing the “replication crisis” in social psychology have been advanced in the recent literature. In this paper, we argue that the “crisis” be interpreted as a disciplinary social dilemma, with the problem facing early-career researchers being especially acute. To resolve this collective action problem, we offer a structural solution: as a condition of receiving their PhD from any accredited institution, graduate students in psychology should be required to conduct, write up, and submit for publication a high-quality replication attempt of at least one key finding from the literature, focusing on the area of their doctoral research. We consider strengths, weaknesses, and implementation challenges associated with this proposal, and call on our colleagues to offer critical response.
Article
Full-text available
The (latest) crisis in confidence in social psychology has generated much heated discussion about the importance of replication, including how it should be carried out as well as interpreted by scholars in the field. For example, what does it mean if a replication attempt “fails”—does it mean that the original results, or the theory that predicted them, have been falsified? And how should “failed” replications affect our belief in the validity of the original research? In this paper, we consider the replication debate from a historical and philosophical perspective, and provide a conceptual analysis of both replication and falsification as they pertain to this important discussion. Along the way, we highlight the importance of auxiliary assumptions (for both testing theories and attempting replications), and introduce a Bayesian framework for assessing “failed” replications in terms of how they should affect our confidence in original findings.
Article
Full-text available
Zhong and Liljenquist (2006) reported evidence of a “Macbeth Effect” in social psychology: a threat to people's moral purity leads them to seek, literally, to cleanse themselves. In an attempt to build upon these findings, we conducted a series of direct replications of Study 2 from Z&L's seminal report. We used Z&L's original materials and methods, investigated samples that were more representative of the general population, investigated samples from different countries and cultures, and substantially increased the power of our statistical tests. Despite multiple good-faith efforts, however, we were unable to detect a “Macbeth Effect” in any of our experiments. We discuss these findings in the context of recent concerns about replicability in the field of experimental social psychology.
Article
Full-text available
Bargh et al. (2001) reported two experiments in which people were exposed to words related to achievement (e.g., strive, attain) or to neutral words, and then performed a demanding cognitive task. Performance on the task was enhanced after exposure to the achievement related words. Bargh and colleagues concluded that better performance was due to the achievement words having activated a "high-performance goal". Because the paper has been cited well over 1100 times, an attempt to replicate its findings would seem warranted. Two direct replication attempts were performed. Results from the first experiment (n = 98) found no effect of priming, and the means were in the opposite direction from those reported by Bargh and colleagues. The second experiment followed up on the observation by Bargh et al. (2001) that high-performance-goal priming was enhanced by a 5-minute delay between priming and test. Adding such a delay, we still found no evidence for high-performance-goal priming (n = 66). These failures to replicate, along with other recent results, suggest that the literature on goal priming requires some skeptical scrutiny.
Article
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.