Content uploaded by Brian D. Earp
Author content
All content in this area was uploaded by Brian D. Earp on Jun 23, 2016
Content may be subject to copyright.
What did the OSC replication initiative reveal
about the crisis in psychology?
Brian D. Earp1
An open review of the draft paper entitled “Replication initiatives will not
salvage the trustworthiness of psychology” by James C. Coyne
Submitted to BMC Psychology Editorial Office: 8 February, 2016. Note: minor, primarily
stylistic issues in the current document have been improved since the official submitted version
that is available at http://www.biomedcentral.com/2050-7283/4/28/prepub. Please cite--and
refer to--the current document.
Abstract & introduction
In his draft paper, James C. Coyne argues that replication initiatives will not salvage the
trustworthiness of psychology, due to various limitations inherent in almost any (actually
feasible) replication program; instead, he believes that the bulk of attention should be directed at
Questionable Research Practices (QRPs), editorial and publication biases, sloppy statistical
reasoning, perverse incentives in the reward structure of science, and so on. This is as opposed to
so-called “direct” replications, in particular, especially when it comes to certain sub-fields of the
discipline in which such replications would be hard to conduct, much less interpret. I agree with
many of Professor Coyne’s points, and have made similar arguments elsewhere. However, I
think that his discussion of the apparent implications of the now-famous Open Science
Collaboration (OSC) paper published in Science (purporting to show that more than half of a
sample of 100 psychology studies “failed to replicate” when carried out by independent labs) is
flawed in a number of ways. I argue, among other things, that the informational value of the OSC
paper is much lower than many people seem to think.
Key words: replication, estimating the reproducibility of psychological science, OSC, p-values
1 University of Oxford and Visiting Scholar, The Hasting Center
This is an open peer review of a submitted paper. It may be cited as:
Earp, B. D. (2016). What did the OSC replication initiative reveal about the crisis in
psychology? An open review of the draft paper entitled, “Replication initiatives will not salvage
the trustworthiness of psychology” by James C. Coyne. BMC Psychology, 4(28), 1-19.
Available at
https://www.academia.edu/21711738/Open_review_of_the_draft_paper_entitled_Replication_i
nitiatives_will_not_salvage_the_trustworthiness_of_psychology_by_James_C._Coyne.
2
Theoretical/ substantive comments
Page 3, line 33. Referring to the OSC paper, the author writes: “Overall results of the project
demonstrated that within this sample of studies … most positive findings proved false or
exaggerated.” I would argue that the word “demonstrated” here is far too strong. Perhaps
“suggested” would be OK, but in my view, the author is giving too much credit to the OSC paper
for definitively showing anything at all. I will explain what I mean in some detail here, because I
think that this is an important issue that has gone largely unnoticed in the academic and public
discussion(s) of the now-famous OSC publication.
Here is the problem. The OSC conducted exactly one replication attempt of each of 100 studies.
In a couple of cases that I have scrutinized personally, these single replication attempts were,
unfortunately, not particularly well designed. For example, in the reported attempt to replicate at
least one study (I won’t go into the details here as I may write this critique up separately), the
replicating scientists recruited fewer participants than were involved in the original study, thus
reducing their power to detect an effect (if one existed), based on a naïve assumption—built into
their power analysis—that the initially reported effect size was accurate. This is a naïve
assumption because we have good reason to think that initially reported effect sizes are
frequently biased high, as the OSC authors themselves acknowledge in their paper. As they state:
“One qualification about [our] result is the possibility that the original studies have inflated effect
sizes due to publication, selection, reporting, or other biases. In a discipline with low-powered
research designs and an emphasis on positive results for publication, effect sizes will be
systematically overestimated in the published literature” (2015, p. aac4716-5; see also my
3
discussion2 of this point in Earp, Everett, Madva, & Hamlin, 2014; and see Button et al., 2013).
Therefore, all else being equal, it is typically better to recruit more participants for a replication
study than were involved in the original experiment, as opposed to fewer, if the idea is to have
adequate power (Earp et al., 2014).
That is just one example of an apparently poorly designed replication study (how many others
have similar flaws I have not yet determined). The fact that this study got different apparent
results from the initial study, therefore, tells us almost nothing at all about the validity of the
original findings. But let us just assume that the other 99 replications were perfectly designed
and flawlessly conducted. Nevertheless, we still cannot draw any definitive conclusions about
what these replication efforts entail with respect to the validity of the original reported findings.
To see why this is the case, imagine the following. Suppose that we take just one of the original
studies from the OSC project, and we try to replicate it—not once, but 100 times. And assume
that we manage to do this perfectly, under ideal conditions. What we would end up with, if all
goes according to plan, is a distribution of p-values, as well as a distribution of effect size
estimates, both of which should be at least roughly centered around whatever the “true” values
for those parameters are. In real life, however, we don’t have the full distribution. What we have
instead, in most cases, is just a single reported p-value, and a single reported effect size estimate
(i.e., from the initial published study). How do we know where, on the idealized distribution
from our thought experiment, these values are likely to be coming from? We don’t know for sure,
but we can guess that they are coming from the higher end. In part, this is because of the well-
known publication bias in favor of “significant” effects, and especially “impressive-looking”
2 Please note that I will be referring to a number of papers of mine throughout the rest of this review. Since this is an
open review, I will make no pretense of having failed to develop a particular perspective on this debate, and I hope
that the author (and other readers) can forgive me for citing so much of my own work. It saves me the trouble of
having to re-articulate everything here.
4
ones: hence, as just explained—and as the author of the present submission himself appears to
appreciate (see, e.g., page 6, lines 40-45)—the mere fact that the finding has actually been
published suggests that our estimate is probably inflated.
But this doesn’t mean that there is no underlying evidence for the original effect (i.e., it doesn’t
mean that the effect itself has not been replicated, in the sense of having been shown not actually
to exist), nor that the “true” p-value, even if it is different from what was originally reported,
would be guaranteed to be non-significant (this is setting aside, for the sake of this review,
longstanding and rather heated debates about what we can actually infer from “significant” p-
values—and from the null hypothesis significance testing procedure generally; see, e.g.,
Trafimow & Rice, 2009). Indeed, based on running an original study just one more time, as was
the case with the OSC project, it would actually be reasonable for us to expect that the next p-
value we generate will in fact be larger than the one that was originally reported, and the next
effect size, smaller, based on the “winner’s curse” phenomenon (see, e.g., Button et al., 2013)
and the principle of regression toward the mean.
Thus, getting a smaller effect size estimate the second time around is only a “failure to replicate”
in the trivial sense of failing to replicate the exact same effect size estimate. But we shouldn’t
expect to get a “replication” in that sense anyway! As far as I can tell, what people are really
interested in knowing is: “Is there an effect here, and is it a meaningful one?” The answer to the
first question, it should be clear, is not “no” just because the size of the effect may be smaller
than what was originally estimated. And the answer to the second question depends upon, among
other things, what precise effect size (or effect size range) is theoretically or practically
meaningful—something that many researchers in psychology and other disciplines unfortunately
fail to specify—as well as what the actual effect size is. Alas, however, a single replication study
5
can’t tell us either of those things. Instead, to have any real degree of confidence in this regard,
we need to run many replications of the experiment. Then, over time—assuming that the
replications were of sufficiently high quality as well as adequately powered, etc.—we would be
in an increasingly better position to make a rational judgment about whether the original reported
finding was “real” (or just statistical noise) and, if it was/is real, what its likely effect size is.
David Trafimow and I make this argument in Earp and Trafimow (2015).
Now, if I may return to my point earlier about “underlying” evidence. Let us just say that we do
get a smaller effect size estimate the second time we run a study. Does that mean that the
evidence in favor of the existence of the original reported finding or phenomenon has so much as
“gone down?” We don’t actually know. In part, this is because p-values and effect size estimates
are not direct measures of evidence, as Veronica Vieland has argued (see generally: Vieland,
2001; Vieland, 2006; Vieland & Hodge, 2011). But even if we had a good measure of evidence,
talk of “failed replication” on the basis of a single follow-up study (with apparently different
results from the original) could still be hard to justify. As Professor Vieland recently explained to
me in an email (personal correspondence, January 29th, 2016):
(1) Observing weak evidence in favor of something after having already seen strong
evidence in favor of that thing does not weaken the evidence; the evidence may not
change by much but it certainly doesn’t go down. We can summarize this by saying that
evidence itself accumulates rather than averaging, and in the absence of a way to properly
measure and rigorously accumulate evidence, we can’t tell from the OSC paper whether
this is what’s happening (in some cases) or whether the evidence really is going down.
6
(2) Moreover, in the context of “winner’s curse” types of set-ups, e.g., when published
results are likely to favor smaller p-values and larger effect size estimates, regression to
the mean is going to tend to send both in the “wrong” direction (bigger for the p-value
and smaller for the estimate) regardless of whether the effect is real or not. Thus, in
this type of setting, tracking the p-value and/or estimate simply cannot be used to infer
whether the evidence has gone up or down, let alone whether the originally reported
effect is real or not.
In short, a single replication of a study (or of 100 separate studies)—as was attempted in the case
of the OSC project—actually provides us with a lot less information than many people seem to
think about the validity of the original finding (or the sheer existence of the effect in question), at
least in terms of a case by case analysis. Indeed, just as a single report of a finding in the primary
literature should not convince us that an effect is “real,” neither should a single replication
attempt in the follow-up literature convince us that the effect is “false” or “exaggerated” (see,
again, Earp & Trafimow, 2015, as well as Earp & Everett, 2015, for more on these issues).
All that being said, I should acknowledge that a colleague of mine, in personal correspondence
(January 28th, 2016), has pushed back against the general thrust of my argument. He writes: “An
argument about why the OSC article might be a little better in terms of informational value than
you’ve characterized might be made as follows. First, I agree with you completely that a single
replication that apparently ‘fails’ does not convincingly show that there was a flaw (much less
malfeasance) in the conduct of the original experiment, nor even that the originally reported
results are wrong or inaccurate. But I think a different conclusion might be a bit more difficult to
dispense with. Imagine 100 studies and a single replication of each, as in the OSC article. If we
imagine that each study, whether original or replication, is randomly selected from a distribution
7
of hypothetical replications, we would expect the average effect size for the set of original
studies to approximately equal the average effect size of the set of replication studies. That is,
some replication studies should result in stronger effect sizes than their corresponding original
ones, as well as the reverse. However, the average effect size in the OSC paper was considerably
less in the replication studies. This violation from what we might expect can be considered to
indicate a problem in psychology publications, one of which is publication bias against ‘negative’
findings as you’ve already stated. Also, as an argument against what I just said, it should be
pointed out that the probability of getting what the OSC got, given improper practices, is not the
same thing as the probability of improper practices given what the OSC got! Therefore, again, it
really is difficult to know what to conclude from the article.”
A second colleague provided a similar analysis (personal correspondence, February 7th, 2016):
“You are absolutely right about the limited conclusions that should be drawn from the OSC
replication effort, in relation to any one original study. But we need to make a clear distinction
between conclusions about any individual study, and the whole set. The fact that the replication
effect sizes were on average approximately half the original effect sizes is a strong result, and
strong evidence that something amiss is going on—most likely, as you say, selective publication
and/or QRPs in the original set of studies [note that this possibility is discussed at length by the
OSC authors]. But we have no idea which of the 100 original studies, and very little idea, even
roughly, of what proportion of the original studies were in fact biased. In short, the whole OSC
paper does give strong evidence of some bias in the original set, notwithstanding that the precise
extent and nature of the bias cannot be determined.”
Fair enough. Now, I do agree with the author of the present article that, whatever we can infer
(or not) from replication studies or replication initiatives, we do have many at least partially
8
independent reasons to crack down on questionable research practices, sloppy experiments, bad
statistics, unjustified inferences, extrapolations of data beyond the environments in which they
were collected, publication bias, the file drawer problem, crony peer review, faulty editorial
practices, and so on. In other words, whether or not some kind of systematic replication initiative
is even (a) feasible, or (b) helpful (assuming that it could be accomplished), we should still be
very concerned about the amount of poor-quality research being published, and we should try to
reform the reward structure of professional science to encourage fewer publications (of higher
quality), as opposed to the avalanche of largely uninterpretable research (e.g., Meehl, 1990)
currently being published, including in “prestigious” journals (see Earp, in press). If we were
successful in that regard, we would have a lot less need for large scale replication initiatives in
the first place. So in this respect I think there is a great deal of concordance between the author
and myself.
Page 4, line 8. The author writes that tests of excess significance “may themselves be biased.”
Again, there are many different types of bias, so just saying “biased” doesn’t mean very much.
What kind of bias? Biased how?
Page 4, line 25. I am sympathetic with the author’s feelings about the defensive and even
obstructionist reactions of some researchers to this new “replication” era, but the rhetoric here
(“obviously,” “with impunity,” etc.) seems too hot—more suited to a blog post than a formal
paper. Similarly, “hostile” in the following sentence seems unproductively harsh; and in any
event, no citations or references are given actually illustrating such behavior (though it
undoubtedly exists). This is such a contentious, politicized area, that I think more tempered
language, and taking the “high ground” in terms of making step-by-step arguments (rather than
9
“calling out” other researchers for supposedly “obviously” believing they can get away with
murder) would be more productive in the long run.
Page 4, line 48. This issue of a disproportionate focus on social psychology studies in replication
attempts is discussed in Earp and Everett (2015).
Page 5, line 11. Really excellent point about how hard it would be to recruit subjects from
sensitive populations (e.g., babies) for “mere” replication studies.
Page 5, line 32. The issue of replications being low-status, and especially costly to carry out if
one is an early-career researcher is discussed at length in Everett and Earp (2015).
Page 10, line 27. Problems with peer review have been well-documented and well-discussed by
many authors, in particular Richard Smith (e.g., 2006). Some key citations would be appropriate
here to support the author’s claim that peer review is “undependable.” In case it’s helpful, I have
recently collected some relevant citations in Earp (in press).
Page 11, line 22. This discussion of direct vs. conceptual replication is a little bit facile, in my
opinion. There is actually a theoretical problem with not having direct replications (depending
upon your aims), because deliberate changes in study design could violate auxiliary assumptions
that are still implicit (i.e., not yet identified), such that “indirect” replications might not tell us
very much about the reliability of the original finding. For more on the philosophy of direct vs.
conceptual replication and making generalizable statements over variable conditions, see Earp
and Trafimow (2015) as well as Trafimow and Earp (in press).
10
That said, I do agree that fetishizing direct replication over more conceptual replication is
problematic, as the author (in my view) discusses very nicely in this passage. Nevertheless, it
seems to me that more needs to be said about when and under what conditions direct vs. more
indirect replications are useful (and toward what particular ends). If an original researcher finds a
very narrowly-achievable effect (using the bug-killing paradigm, for example, as the author
discusses), and then generalizes this to a range of cases that haven’t actually be tested, that is
clearly a problem; but what kind of replication effort is appropriate depends on the question.
Specifically, if you want to “check” whether the original finding, as captured by some narrow
outcome variable, is actually a good/reliable effect, then a direct replication is needed. By
contrast, if you want to see whether the paradigm extends to a new case, you should probably
still do a direct/exact replication to make sure the original finding can be repeated, but then—
yes—it is likely to be worthwhile to try to systematically change components of the study design
to see whether the effect holds over a range of conditions (see Earp & Trafimow, 2015).
The issue is when you do indirect/conceptual replications without also doing direct replications
(see Earp & Trafimow, 2015). Specifically, when such indirect replication attempts apparently
fail, the replicating researcher is likely to say: “Well, it’s probably just my fault because I
changed certain things about the experiment and therefore didn’t do an exact replication” (see,
e.g., Earp et al., 2014) Accordingly, this researcher might simply abandon the problem and put
the replication data in her proverbial file drawer (due, again, to the perverse incentives created by
the existence of publication bias in favor of “significant” effects; e.g., Greenwald, 1975). If she
had run a direct replication first, however, being careful not to change anything that would be at
least plausibly theoretically relevant—and she still didn’t find evidence of the effect—then she
would be more likely to wonder whether the original effect was illusory after all, and would thus
11
be more likely to look into the matter further. I discuss this issue at some length in the
concluding section of Earp et al. (2014), re: the “Macbeth” effect, relying largely on Harris et al.
(2013).
Page 12 line 14. Again, extensive discussion of the theoretical issues involved in testing the
generalizability of claims across variations can be found in Earp and Trafimow (2015). I think
this section would be improved by engaging more specifically with some of the ideas presented
there. But it is up to the author.
Page 12, lines 53-53. The author writes that the results in question were “demonstrated [to be]
untrustworthy.” Again, for the reasons discussed above, there is a sense in which this claim is too
strong. Depending upon how it is interpreted, it does not necessarily follow from what the OSC
paper found.
Page 13, line 28. Again, “not replicated” is a strong term. What does the author mean by
“replicate” – to achieve the exact same p-value and/or effect size estimate? To achieve values
within a certain range? What range? Unless the author is clear about what counts as a “successful”
replication, we have no way of knowing whether his notion of “not replicated” is meaningful. To
repeat: getting a different a p-value or effect size estimate (especially in a single, one-off
replication study) is not sufficient to show even that the underlying evidence for the original
reported effect has “gone down.” If the p-value goes from .01 to .90 that’s probably a red flag
and we definitely want to ask what’s going on there.3 If the effect size goes from huge to zero,
same thing. But unless we articulate what we would expect to be reasonable for a “successful”
3 Though note that a colleague writes (personal correspondence, February 7th, 2016), “I find p-values totally
unhelpful in this whole discussion. You are correct that they are very likely to vary, even with perfect replications.
… Even the example you give of the p-value shifting from .01 to .90 may hardly be strong evidence that the two
studies were estimating different population effect sizes.”
12
replication (and it would be stretch to think that this should be “the exact same p-value or smaller,
or the exact same effect size or bigger), then we need to be much more careful about saying “not
replicated.” I should just note that the OSC authors do a pretty good job of discussing different
ways of interpreting “replication” and they do test a handful of plausible contenders.
Page 13, line 39. There has been a lot of work critiquing impact factors (e.g., Seglan, 1998),
arguing that they are easily game-able, are not necessarily good indicators of the quality of the
research appearing in a given journal, etc. Some of this work should arguably be cited here, as it
is consistent with the author’s apparent viewpoint. That said, I worry that this passage could be
read as suggesting that “high impact factor” journals should accept replication studies because
this would lend more “status” to replication studies, when in fact—as the author himself most
likely believes—this link between perceived status and having a high impact factor is part of the
overall problem. In other words, there is a risk that the argument in this passage will simply
reinforce the notion that having a “high impact factor” is a worthwhile marker of status, and that
researchers should care about whether their work (whether a replication study or original)
appears in a so-called “vanity” journal.
Page 19, generally. In the abstract it seemed that a discussion of how social media can serve as a
form of post-publication peer review to identify poor quality studies was going to take place, and
yet here, in the concluding paragraph, the idea gets a very cursory mention. Could this be filled
out and perhaps illustrated with examples? Maybe give concrete recommendations for how this
tool could be used more effectively?
Minor comments re: typos, grammatical errors, etc.
13
Page 3, line 28. The author refers to the “extraordinary multi-institution effort” organized by the
Open Science Collaboration: Psychology (OSC), but qualifies this with the phrase “even if
biased.” But biased in what particular way? The author could certainly say: “even if biased (for
reasons to be discussed)” or something like that; but simply slotting in the word “biased” here
with no further explanation of what kind of bias is actually meant makes this sentence less
informative than it could be.
Page 3, line 18. If, as the author claims, replication initiatives are widely seen this way (i.e., as
“the great hope for salvaging the trustworthiness of psychology”), then perhaps a few key
citations of people/organizations that have actually advanced such a perspective would be in
order.
Page 2, header. The header reads: “replication machines will not salvage.” This is a fragment and
doesn’t make grammatical sense; also “machines” is not used like this elsewhere throughout the
paper. A better header is needed.
Page 2, abstract. The abstract is excellent, and draws me right into the paper. However, it
highlights the role of social media to expose faulty research practices, etc., whereas this topic is
given only a sentence or two in the actual paper. That should be reconciled one way or the other
(see above).
Page 3, line 23. The phrase “accomplished in selecting now” makes no sense as far as I can see.
Page 3, line 44. The phrase “this unreliability of the frustrating” is nonsensical; I think “has been
frustrating” is meant. Then it says, “There been various” but it should be: “There have been
14
various.” Then later in the sentence: “overall excess statistically” should be “overall excess of
statistically.” I can’t promise to flag all such grammatical errors and typos as I go through
(because that should really be done by a copy editor), but I must say that I am surprised by the
sheer number of such writing errors in this manuscript.
Page 3, line 59. Could page numbers be given for quotes?
Page 4, line 16/17. The phrase “or smaller” should probably be in parentheses in order to help
make reading this sentence easier. And, again, I wish to point out that simply getting a smaller
effect size the second time around doesn’t tell us very much at all. Multiple replications are
needed.
Page 4, line 33. The word “suspect” has several different meanings, and those differences make
it hard to parse this. I think “be suspicious of” would be more clear/ less ambiguous.
Page 5, line 35. The phrase “contribution and is” is ungrammatical. I think the “and” needs to be
deleted.
Page 6, line 16. The period is missing at the end of the sentence after “extreme outlier.”
Page 7, line 9. The phrase “summoned to support of” should be “in support of.”
Page 7, line 32. What does “correlational observational are” mean? I believe this is a
grammatical error.
15
Page 8, line 6. The closing quotation mark is missing after “factor” and a period is missing at the
end of the sentence after “big mush.”
Page 8, line 35. The underline under “not” is too long.
Page 8, line 50. What is “OPS” … is this OSC? Maybe I just missed something.
Page 10, line 14. The “of” after “replication” should be deleted.
Page 11, line 22. The word “seeming” should be “seemingly” and then the next line down
“procedures can be considerable” should be “make a considerable.”
Page 11, line 56. The phrase “based on studies undergraduates” should be “on studies of
undergraduates.”
Page 12, line 6. The phrase “perspective that psychology” should be “perspective that a
psychology.”
Page 12, line 32. There are two periods at the end of the sentence.
Page 13, line 18. The word “characterizes” should I think be “characterizing.”
Page 13, line 36. “The” should be added before American Psychological Association and the
APS, here and hereafter.
16
Page 14, line 42. There is an extra space after “Nelson.”
Page 15, line 11. What does “which the OCS sample” mean? I think “sampled” is meant.
Page 15, line 27. I think it should be “agendas” or else “a strong institutional.” Then two lines
down, if it’s “agenda” then it needs to be “remains” and then “endless and waste the effort of”
needs to be “endless and a waste of the effort of.”
Page 16, line 48. I think “papers author’s” should be “paper’s author’s.”
Page 17, line 14. “CONSORT, set a” should be “CONSORT, had set a.”
Page 17, line 34/5. The phrase “physical health in the prestigious journal” should be: “a”
prestigious journal.
Page 17, line 55. The phrase “both of the journals in which the published” is ungrammatical, and
then the dashes are uneven: the first one is long; the second one short and not centered.
Page 18, line 24. There is an extra space in: [38 ].
Page 18, line 34/5. I think “has promise” should be deleted so that this makes sense.
Page 18, line 45. “Postscript” should be bolded.
Page 18, line 53. I think “incentized” should be “incentivized.”
17
Page 20, generally. There should be quotation marks around the actual Brembs quote or else the
text should be indented, and a page number should be given if that applies.
Page 21, line 41 and Page 24, line 12. There are missing reference numbers. Where is reference
8? And 37?
Page 24, line 10. The formatting for reference 36 is completely off. There is an extra close
bracket, and it is indented.
Overall comment and recommendation
I am strongly sympathetic with many of the points raised by the author. Moreover, I think that
his perspective is valuable, and that the field generally should have the opportunity to learn from
it. Therefore, I believe that a version of this paper should be published. That said, there are a
number of substantive theoretical points that I think need to be improved upon or at least
clarified (and also, for some reason, a very large number of typos and ungrammatical sentences
that will need to be fixed before the paper could go to print).
Acknowledgements
Thank you to Professors Geoff Cumming, Veronica Vieland, Stuart Firestein, and two other
colleagues who wish to remain anonymous for providing feedback on my review and/or
discussing these ideas with me. None of what I’ve written should be taken as an expression of
their views unless I’ve specifically cited them; I also take personal responsibility for any errors
that remain in this document. Please note that a few typos and other minor cosmetic issues have
been addressed since the formal submission of this review.
18
References cited in this review
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò,
M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience.
Nature Reviews Neuroscience, 14(5), 365-376.
Earp, B. D. (in press). Mental shortcuts [unabridged version]. Hastings Center Report. Available
ahead of print at
https://www.researchgate.net/publication/292148550_Mental_shortcuts_unabridged.
Earp, B. D. (2015, September 12). Psychology is not in crisis? Depends on what you mean by
“crisis.” Huffington Post. Available at http://www.huffingtonpost.com/brian-earp/psychology-is-
not-in-crisis_b_8077522.html.
Earp, B. D., & Everett, J. A. C. (2015, October 26) How to fix psychology’s replication crisis.
The Chronicle of Higher Education. Available at
http://www.academia.edu/17321035/How_to_fix_psychologys_replication_crisis.
Earp, B. D., Everett, J. A. C., Madva, E. N., & Hamlin, J. K. (2014). Out, damned spot: Can the
“Macbeth Effect” be replicated? Basic and Applied Social Psychology, 36(1), 91-98.
Everett, J. A. C. & Earp, B. D. (2015). A tragedy of the (academic) commons: Interpreting the
replication crisis in psychology as a social dilemma for early-career researchers. Frontiers in
Psychology, 6(1152), 1-4.
Earp, B. D., & Trafimow, D. (2015). Replication, falsification, and the crisis of confidence in
social psychology. Frontiers in Psychology, 6(621), 1-11.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological
Bulletin, 82(1), 1-20
Harris, C. R., Coburn N., Rohrer D., & Pashler H. (2013). Two failures to replicate high-
performance-goal priming effects. PLoS ONE 8: e72467. doi:10.1371/journal.pone.0072467.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often
uninterpretable. Psychological Reports, 66(1), 195-244.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349(6251), aac4716-1-aac4716-8.
Seglen, P. O. (1998). Citation rates and journal impact factors are not suitable for evaluation of
research. Acta Orthopaedica Scandinavica, 69(3), 224-229.
Smith, R. (2006). Peer review: a flawed process at the heart of science and journals. Journal of
the Royal Society of Medicine, 99(4), 178-182.
Trafimow, D. & Earp, B. D. (in press). Badly specified theories are not responsible for the
replication crisis in social psychology. Theory & Psychology, in press. Available ahead of print
https://www.researchgate.net/publication/284625607_Badly_specified_theories_are_not_respons
ible_for_the_replication_crisis_in_social_psychology
19
Trafimow, D., & Rice, S. (2009). A test of the null hypothesis significance testing procedure
correlation argument. The Journal of General Psychology, 136(3), 261-270.
Vieland, V. J. (2001). The replication requirement. Nature Genetics, 29(3), 244-245.
Vieland, V. J. (2006). Thermometers: something for statistical geneticists to think about. Human
Heredity, 61(3), 144-156.
Vieland, V. J., & Hodge, S. E. (2011). Measurement of evidence and evidence of measurement.
Statistical Applications in Genetics and Molecular Biology, 10(1), 1-11.