PreprintPDF Available

A Simple Re-Analysis Overturns a "Failure to Replicate" and Highlights an Opportunity to Improve Scientific Practice: Commentary on Li and Bates (2019)

Preprints and early-stage research may not have been peer reviewed yet.


In recent years the field has improved the standards for replicators to follow to help ensure that a replication attempt, whether it succeeds or fails, will be informative. When these standards are not followed, false claims can result and opportunities to learn are missed, which could undermine the larger scientific enterprise and hinder the accumulation of knowledge. In the case addressed here-Li and Bates' (in press) attempt to replicate Mueller and Dweck's (1998) findings on the effects of ability versus effort praise on post-failure performance-the replicating authors did not follow best practices in the design or analysis of the study. Correcting even the simplest deviations from standard procedures yielded a clear replication of the original results. Li and Bates' data therefore provided one of the strongest possible types of evidence in support of Mueller and Dweck's (1998) findings: an independent replication by a researcher who is on record being skeptical of the phenomenon. The present paper highlights the wisdom of upholding the field's rigorous standards for replication research. It also highlights the importance of moving beyond yes/no thinking in replication studies and toward an approach that values collaboration, generalization, and the systematic identification of boundary conditions.
A Simple Re-Analysis Overturns a “Failure to Replicate” and
Highlights an Opportunity to Improve Scientific Practice:
Commentary on Li and Bates (in press)
Carol S. Dweck
Stanford University
David S. Yeager
University of Texas at Austin
Preparation of this manuscript was supported by the William T. Grant Foundation, the National Science
Foundation under grant number HRD 1761179, and by National Institute of Child Health and Human
Development (Grant No. 10.13039/100000071 R01HD084772-01, and Grant No. P2C-HD042849, to the
Population Research Center [PRC] at The University of Texas at Austin). The content is solely the
responsibility of the authors and does not necessarily represent the official views of the National Institutes
of Health and the National Science Foundation.
An Opportunity to Improve Replication Practices
In recent years the field has improved the standards for replicators to follow to help
ensure that a replication attempt, whether it succeeds or fails, will be informative. When these
standards are not followed, false claims can result and opportunities to learn are missed, which
could undermine the larger scientific enterprise and hinder the accumulation of knowledge. In
the case addressed hereLi and Bates (in press) attempt to replicate Mueller and Dweck’s
(1998) findings on the effects of ability versus effort praise on post-failure performancethe
replicating authors did not follow best practices in the design or analysis of the study. Correcting
even the simplest deviations from standard procedures yielded a clear replication of the original
results. Li and Bates’ data therefore provided one of the strongest possible types of evidence in
support of Mueller and Dweck’s (1998) findings: an independent replication by a researcher who
is on record being skeptical of the phenomenon. The present paper highlights the wisdom of
upholding the field’s rigorous standards for replication research. It also highlights the importance
of moving beyond yes/no thinking in replication studies and toward an approach that values
collaboration, generalization, and the systematic identification of boundary conditions.
Keywords: Replication, meta-science, attribution theory, motivation.
An Opportunity to Improve Replication Practices
A Simple Re-Analysis Overturns a “Failure to Replicate” and
Highlights an Opportunity to Improve Scientific Practice:
Commentary on Li and Bates (in press)
In this paper we re-analyze data from a recent replication study (Li & Bates, in press).
We illustrate how replicator degrees of freedom can obscure the true conclusions of a
replication study’s data. Replicator degrees of freedom are defined as (a) flexibility in the
analysis of replication studies that can lead to the presentation of supportive evidence as though
it was unsupportive and (b) flexibility in the design of replication studies that can weaken or
otherwise obscure a replicable result (Bryan, Yeager, & O’Brien, in press). We do so because
publication of misleading replication results can threaten the integrity of psychology’s important
recent efforts to increase the role of replication studies in the scientific process.
Science seeks to discover effects that are “robust, replicable, and generalizable”
(Association for Psychological Science, 2014). Replication studiesin which researchers collect
new data to retest a hypothesis using the same proceduresprovide one important means for
reaching this scientific ideal. The consensus view is that a replication study should not be judged
by whether or not it provides support for an initial hypothesis, but rather by how well-conducted
the replication study was (see, e.g. Srivastava, 2012). The reason is that well-conducted
replication studies which fail to find supportive evidence for a hypothesis serve the valuable
purpose of reducing confidence in potentially false research conclusions. Likewise, independent
successful replications serve a valuable purpose of demonstrating that an original study’s results
were robust and replicable.
An Opportunity to Improve Replication Practices
When replication studies are of insufficient quality, however, it becomes difficult to
accumulate scientific knowledge. On reason why is that it is difficult for original authors to
accept the results of failed replication studies if replicating authors were able to “null hack” the
results through flexibility in the design and analysis of the study (see Protzko, 2018). Indeed,
there has been growing attention to the possibility that replicating authors, not just original
authors, face a conflict between accurate reporting of results and the professional incentives of
media attention, publication in high-profile outlets, and research grants that come from “failures
to replicate” high-profile findings (Bryan et al., in press). Results confirming what is already
known may be of considerably less interest relative to results that overturn well-cited findings,
(see, e.g., the "Proteus phenomenon," Ioannidis & Trikalinos, 2005) and so the incentive to null-
hack may be every bit as strong as the incentive that initial authors may face to “p-hack”.
Unchecked null-hacking undermines scientists’ confidence in the replication enterprise.
It is therefore a threat to scientific progress more generally, because it reduces the effectiveness
of one highly important means for clarifying the scientific recordthe replication study. As
such, null-hacking in replication studies deserves serious attention.
To protect the integrity of the replication enterprise, professional societies such as APS
have developed and disseminated clear scientific standards. Nearly all of the recent, high-profile,
multi-investigator replication studies have followed these (e.g. Klein et al., 2018). These
standards are intended to dramatically constrain several researcher degrees of freedom.
For instance, registered replication reports (Association for Psychological Science, 2014)
and a growing number of similar efforts require transparency and rigor with respect to study
procedures and statistical analyses. The standard for replication study procedures involves:
Contact with original authors (and possibly other authors who are experts in a sub-field) to
An Opportunity to Improve Replication Practices
develop replication study procedures; several rounds of review and revision; perhaps review of
procedures by independent experts, especially in the case of disagreements between parties; and,
after “locking” the procedures but before data collection, voicing of any concerns or objections
about study procedures prior to data collection. The standard for statistical analysis involves:
careful consultation of the original studies’ analysis plan; consultation with original authors to
select the hypothesis test that replicates the original study; pooling all of the new data to produce
a highly-powered statistical test; pre-registration of this analysis plan prior to analyzing the data;
and pre-registering the conclusions that would be drawn depending on the results.
For the most part these standards should cause original authors to accept the conclusions
of a failed replication, barring a later discovery of flaws in the replication study’s design (see
Luttrell, Petty, & Xu, 2017; Schooler, 2014). When replication studies do not follow these
standards, however, the threats to scientific progress noted previously loom large.
The present paper. Here we present data from a recent replication attempt (Li & Bates,
in press) that speaks to the wisdom of the field’s standards. We show that a study which did not
follow the standards just listed resulted in the publication of misleading conclusions. More
specifically, a study which claimed to find “noand “null” evidence in support of a hypothesis in
fact provided some of the strongest kind of evidence in support of itan independent replication
by authors who are on record as being skeptical of the effect.
Before describing the replication study and our re-analysis, it is important to be clear that
we are interested parties in this debate. One of us is an original author of the effect under
investigation, and the other has published about the broader phenomenon. We have tried to limit
and be transparent about our own degrees of freedom, as we explain, and we have made our data
An Opportunity to Improve Replication Practices
and syntax publicly available ( so that readers can draw their own
conclusions as well.
It is important to remember that replicating authors, too, should not be presumed to be
disinterested in the results of a study, because of the incentives for publishing null results noted
previously. And although there has been some debate in psychology over whose p-values should
be accorded greater validitythe replicating authors or those of the authors who re-analyze
replication study datathese debates have almost exclusively focused on cases in which
replicating authors followed a public, pre-approved, and transparent pre-analysis plan. In the
case of Li and Bates’ (in press) studies, however, the standards for open and transparent pre-
analysis plans were not followed. So our analyses should not be considered any more or less
“exploratory” than the replicating authors’.
The Present Replication Debate
In 1998, Mueller and Dweck published a paper entitled “Praise for intelligence can
undermine children’s motivation and performance.” The paper reported 6 studies on the effects
of ability praise vs. effort praise, and four of the studies examined how each form of praise given
after a success would affect children’s task performance after a subsequent failure (see Figure 1).
On the first trial of the task, children were given a set of moderately challenging
problems after which they were given success feedback and praise. In the intelligence praise
condition, they were told: “You got __ right. That’s a really good score. You must be smart at
these problems.” In the effort praise condition, they were told “You got __ right. That’s a really
good score. You must have worked hard at these problems.” And in the comparison (outcome
praise) condition, they were told: “You got __ right. That’s a really good score.”—with no
further attribution. Following this success trial, children experienced a failure trial, on which
An Opportunity to Improve Replication Practices
they got very few, if any, problems correct and were given feedback to that effect. The goal was
to determine how this failure would affect their performance on the next problem set as a
function of the praise they were given. Thus, on the third trial, students received problems that
were matched in difficulty to those on the first trial, and the changes in performance from Trial 1
to Trial 3 constituted the main dependent variable in these four studies: “This process yielded a
measure of post-failure performance.” (Mueller & Dweck, 1998, p. 36). Although there were
other dependent variables in the studies, this was the key variable, the one primarily focused on
in the replication test discussed below, and the one we will focus on in this paper.
Figure 1. Study procedures from Mueller and Dweck (1998).
Mueller and Dweck reasoned that by providing ability vs. effort attributions for the
students’ initial success, children might be more likely to make ability vs. effort attributions for
their subsequent failure. Prior research had suggested that ability versus effort attributions for
failure were associated with different effectsmore debilitating versus more enhancingon
subsequent performance (Diener & Dweck, 1978; Weiner & Kukla, 1970), leading Mueller and
Dweck (1998) to say: Therefore, we hypothesized that children receiving intelligence praise
would show…worse task performance after failure than children praised for effort.” (p. 35).
An Opportunity to Improve Replication Practices
Across the four studies that assessed post-failure performance, Mueller and Dweck found
consistent evidence that intelligence praise led to worse post-failure performance than did effort
praise. The results were quite robust with respect to statistical significance, and seemed unlikely
to be distorted by p-hacking (Simonsohn, Nelson, & Simmons, 2014), because the statistical
models were simple, involving only a simple comparison of conditions, and were kept consistent
across the studies. Furthermore, the p values were in most cases well below the .05 threshold.
Unsurprisingly, then, a p-curve analysis of the four experiments in Mueller & Dweck which
tested the focal hypothesis provided no evidence of p-hacking. The studies had “evidential
value” (continuous test Z = -4.09, p<.001) and the model estimated the original studies had 94%
power (see Appendix 1). In short, overall Mueller and Dweck (1998) reported rather robust
evidence for the difference between intelligence and effort praise on post-failure performance.
Replication by Li and Bates (in press). Li & Bates (in press) sought to determine
whether they could replicate the Mueller and Dweck findings, prioritizing the main outcome of
post-failure performance. In three studies, conducted with students in Northeast China, they
administered problems of a similar type (although, notably, not of a similar difficulty) to those of
Mueller and Dweck. They followed a roughly similar procedure: they gave the students effort or
ability praise after trial one, administered more difficult problems on Trial 2 (but not, as we show
later, “failure” problems), and then as Mueller and Dweck had done, compared Trial 3
performance to Trial 1 performance to create the main dependent variable.
In Studies 2 and 3,
they also introduced a new “active control” condition, in which a fixed mindset statement was
The Li and Bates (in press), experiment was conducted in Chinese but the authors do not report pilot tests of the
meaning of the key terms (e.g., “smart at that”), and have not posted back-translations of their protocols, survey
items, or text. So readers cannot evaluate whether the materials replicated the meaning of the same terms in the
Mueller and Dweck (1998) studies. Indeed, there is some ambiguity about what praise Li and Bates actually
delivered (the authors list different praise statements on page 2 and page 6 in the manuscript).
An Opportunity to Improve Replication Practices
made and was followed by a statement about effort: You can't change your basic ability, but
you work at things, and that's how we get hard things done. Finally, in some studies, they added
some difficult problems on the end (Trial 4) and report performance on these problems. (Since
Trial 4 is not a replication of any published protocol, we do not discuss it further.)
For Study 1, Li and Bates report a significant benefit of effort over ability praise,
supporting the Mueller and Dweck findings: performance on trial 3 versus trial 1 was superior
for the effort-praised children compared to the ability-praised children. Li and Bates then
reduced statistical power for Studies 2 and 3. They found no difference between these groups in
Study 2, but also no effect for their active control condition. There was a nonsignificant trend in
the Mueller and Dweck direction in Study 3, as well as a significant effect of the active control.
On the basis of their findings, Li & Bates conclude that there is no support for the
Mueller and Dweck (1998) effects, sometimes phrased as no evidence for the effects of the
“mindset manipulation” on performance after failure (note that there was never any mindset
manipulation in Mueller and Dweck because no manipulation included any information about the
malleability of intelligence). In the present paper we answer three major questions:
Did Li & Bates (in press) fail to replicate the Mueller and Dweck (1998) findings?
Applying the standards prevalent in the field, we find that Li and Bates replicated the
Mueller and Dweck findings.
Could a seemingly less-robust level of statistical significance across studies plausibly be
attributed to a difference in study procedures? Here, we provide evidence that there was
no failure trial (on Trial 2) that was close to the one in Mueller and Dweck.
What is the meaning and import of Li and Bates’ “active control” condition? Li and
Bates argue that a significant effect of an active control condition in one of the studies
An Opportunity to Improve Replication Practices
allows them to reject the positive evidence in favor of Mueller and Dweck. We argue that
this represents a confusion of interpretation with replication.
Re-Analysis of Li and Bates (in press)
Overall replication of Mueller & Dweck (1998). Li and Bates collected data from 624
participants across three studies; 429 of those were randomly assigned to effort or intelligence
praise conditions, which were described by the authors as a replication of Mueller and Dweck’s
(1998) procedures (the remaining participants were assigned to a new, active control condition
that does not constitute a replication because it tests a novel hypothesis; we return to it below).
As noted, in the absence of a pre-analysis plan on Li and Bates’ (in press) part, it is
advisable to rely on published standard operating procedures for replication studies. Typically,
when multiple studies are described as direct replications, analysts pool the data from all
available samples for the primary analysis. The aggregate result is deemed consistent with the
original paper when it is statistically significant and in the same direction as the original study
(see, e.g., Klein et al., 2018).
In the present case, this standard procedure is especially wise, because Li and Bates’
individual studies were under-powered. Simonsohn (2015) argued that, for a replication study to
provide conclusive evidence against a previous study’s result, then the replication study needs at
least 2.5 times the sample size of the research it is replicating, which the Li and Bates’ studies
did not have.
Furthermore, independently interpreting differences in statistical significance
across studies that are, on their own, under-powered, is called a “Gelman and Stern error
(Gelman & Stern, 2006). This is especially likely to be a threat to validity in the present case,
Insufficient power for a replication is clear when comparing the sample size of the effort and intelligence
conditions Study 1 in Mueller & Dweck (1998) to the sample sizes for the same conditions in any individual study
in Li and Bates (in press), or when comparing the totality of the data in Mueller & Dweck (1998) to the totality of
the data in Li and Bates (in press).
An Opportunity to Improve Replication Practices
since Study 1 in Li and Bates reported a replication of Mueller and Dweck at p<.05, but then the
authors reduced their statistical power for Studies 2 and 3 (and reported p values > .05).
Looking at all of the data across the three Li and Bates studies yielded a replication of the
key Mueller and Dweck effect on post-failure performance. There was a significant overall
comparison of effort praise and ability praise, t(475)=2.045, p = .041, in an OLS regression
predicting the outcome of the difference in performance between Trial 3 and Trial 1, with praise
condition and no covariates in the model. (In their syntax, Li and Bates standardized (z-scored)
the outcome within each study, and we did the same here; thus the pooled analysis is roughly,
but not exactly, equivalent to a fixed-effects meta-analysis).
This straightforward analysis
solves the puzzle of why Li and Bates (in press) reported conclusions that seemed to contradict
the a priori prediction that an original effect which showed a largely-supportive p-curve analysis
would be likely to replicate in a future study (e.g. Simonsohn, Nelson, & Simmons, 2019): Li
and Bates’ (in press) studies were seemingly under-powered.
We note that our simple pooled analysis refrains from exercising additional researcher
degrees of freedom above and beyond what was done by Li and Bates; the only difference was
that we stacked the data. However, several alternative analysis methods are reasonable to
consider, in part because they have been used in past efforts to assess the replicability of original
effects. These include: using a multi-level mixed effects model (with participants nested within
studies), computing a meta-analysis of the three studies (using fixed or random effects), and
including covariates (any combination of age, sex, Trial 1 performance, and Trial 2 performance;
see below for a justification of controlling for Trial 2 performance). Covariates, in particular,
We acknowledge that there are alternative methods for assessing a replication study’s results (Hedges & Schauer,
2018). Here we limit ourselves to Li and Bates’ (in press) published claims of “null” and “no” evidence. Their
conclusions were based exclusively on statistical significance tests, and since we are evaluating them, so are ours.
An Opportunity to Improve Replication Practices
seem wise in a replication test because they can reduce error variance and maximize the
statistical precision (and therefore informativeness) of the replication test. A public and
transparent analysis plan almost certainly would have included at least some statistical controls,
in part because increased precision is important when replications are insufficiently powered.
To explore the impact of model specification degrees of freedom on the conclusion that
the Li and Bates data replicated the main Mueller and Dweck conclusion, we followed
recommendations set forth by Bryan et al. (in press). Bryan et al. (in press) recommend
conducting “specification curve” analysis, which involves estimating null hypothesis tests for all
reasonable and unique combinations of the model specification choices, and reporting them all.
We estimated all 80 unique combinations of the specification options described above
these also yielded evidence consistent with the original Mueller and Dweck finding. The inter-
quartile range for p values ranged from p = .009 to p = .034, and fully 79 (99%) of the alternative
model specifications were less than .05 (see syntax and model output posted at A “Bayesian causal forest” analysis, reported in a footnote,
supported the
same conclusion.
Any model degree of freedom or covariate choice was implemented identically across all three studies.
Bryan et al. (in press) also recommend a second, conservative analysis: a “Bayesian causal forest” (BCF) approach
(Hahn, Murray, & Carvalho, 2017). A primary advantage of BCF over the typical regression methods is that it does
not treat all possible covariate specifications as though they were equally plausible; instead, it uses machine learning
to select an optimal covariate function, but uses Bayesian priors to shrink the estimated treatment effect size to zero,
which makes it conservative. The Bayesian causal forest algorithm found a very high posterior probability that the
effect size in the Li and Bates data was greater than zero (.97, or 32:1 odds). This corresponds to strong evidence
against the null hypothesis (and in favor of the Mueller & Dweck, 1998, conclusion), because the Li and Bates data
were strong enough to greatly shift the posterior distribution away from a conservative prior probability of .5 (1:1
odds). A final benefit of the BCF analysis is that, because it involves posterior probabilities (not frequentist p-
values) it sidesteps the concern that specification curve yields invalid p-values because the latter uses the same data
many times to estimate p-values. BCF syntax is posted at
An Opportunity to Improve Replication Practices
Given these results, it is puzzling why the Li and Bates (in press) paper included strong
claims of failed replication, such as these (we note again that Li and Bates called effort praise
“growth mindset,” but, as can be seen above, there was no growth mindset manipulation):
“The results did not support any effect of growth mindset on children’s post-failure
performance” (p. 19).
“We found little or no support for the idea that growth mindsets are beneficial for
children’s responses to failure” (p. 13).
The present paper reveals “null outcomes” (p. 27).
We found no evidence that a growth mindset condition improved children’s
performance on cognitive tests following failure” (p. 28).
Differences from the original study procedures. Despite strong overall support for the
central conclusion of the Mueller and Dweck (1998) paper, the evidence presented by the Li and
Bates (in press) replication appeared to be less robust than the original paper. Since p-hacking in
the original studies is an unlikely explanation for the difference (see Appendix 1), it seems
fruitful to explore potential differences in the study procedures.
We are aware that post-hoc critiques about replication study procedures are perhaps the
most common responses to replication attempts among original authors (Bavel, Mende-
Siedlecki, Brady, & Reinero, 2016; Gilbert, King, Pettigrew, & Wilson, 2016; Luttrell et al.,
2017), and that they are the most likely to be dismissed out of hand. When replication studies
involve pre-approved study protocols, especially when these are vetted by the original authors, it
is far easier to ignore post-hoc concerns about replication procedures. But, as noted, the field’s
standards were not followed here.
An Opportunity to Improve Replication Practices
Furthermore, there are clear cases in which deviations from original study procedures
have been key to the failure of a replication test. For instance, a protocol for a registered
replication report for the verbal overshadowing effect famously deviated from the original
published paper in a way that suppressed the researchers’ ability to detect the phenomenon
(Schooler, 2014). When the deviation was corrected, and new data were collected, then the
replicating authors successfully detected the verbal overshadowing effect. Thus, concerns about
deviations from original study procedures are not exclusively unfounded complaints but are (at
least some of the time) responsible for misleading claims of a failure to replicate.
As noted earlier, the purpose of the Mueller and Dweck studies was to determine how
praise for success would affect children’s reactions to failure. The reason why is that Mueller
and Dweck hypothesized that students praised for high ability (after success) would infer that
they have low ability after failure; this inference should undermine motivation. Without a failure
experience, there would be no reason for praised students to infer that they lack ability. For this
reason, Mueller and Dweck included a strong failure trial after students were praised, on which
students got very few, or even no, problems correct (Trial 2). The paper emphasized the critical
nature of the failure trial in their paper; the word “failure” was used in the Mueller and Dweck
paper 119 times.
With this background, we explored the possibility that the Li and Bates studies differed
from the Mueller & Dweck (1998) studies in their inclusion of a true failure trial. A true failure
trial is what would allow the effects of failure on subsequent performance to be tested. The Li
and Bates studies did in fact differ in this respect.
The mean for the Mueller and Dweck (1998) Trial 2 appears on page 38 of the paper (and
can be calculated from data posted at, and can be compared to the data from
An Opportunity to Improve Replication Practices
the three Li and Bates studies. Student scores on the supposed failure trial in Li and Bates were
nearly two and a half standard deviations (2.43 SD) higher compared to Mueller and Dweck,
MM&D = 1.60 problems solved, SDM&D = 1.27; ML&B = 4.68 problems solved, SDL&B = 2.32. This
was a massive difference; indeed, the distributions of scores on the key failure trial are clearly
not the same (see Figure 2).
Figure 2. Li and Bates did not create a failure trial that came close to matching the original
Mueller and Dweck studies.
Another way to think of the failure trial scores is as a manipulation check. Of all of the
critiques of a replication study, perhaps the least controversial is that the replication should pass
Li and Bates (in press) may have used problems that were similar to those used by Mueller and Dweck (1998), but
replication does not only mean simply re-using the same materials, particularly when conducting research in a
different culture or among students with different ability levels (see Frank, 2017). The replicating researchers must
pilot the problems to make sure that they recreate the psychological conditions under which the original effect
emerged, in this case by creating a failure trial for their participants. Communicating with the original authors could
have made this clear, which is why the field’s standards for replication should have been followed in this case.
An Opportunity to Improve Replication Practices
a manipulation check (Frank, 2017). In the present case, a successful manipulation of failure
would mean that students did about as well on the problems as they would be expected to do if
they were guessing randomly. In the Mueller and Dweck (1998) data, failure trial scores were no
different from the expected score from random guessing (1.6 problems out of 10), t(122)=.51,
p=.610. But students vastly outperformed chance in the Li and Bates data, t(622)=32.43, p<.001,
suggesting that the study did not pass a manipulation check. In fact, 89% of the Li and Bates got
more problems right than expected by chance alone.
What this means is that the conditions for testing the Mueller and Dweck hypothesis
conditions laid out in the original published reportwere in large part absent. The effects of
failure on subsequent performance as a function of praise condition could not be fully assessed.
In this context, it is not surprising that the Li and Bates produced less consistent results of
significance tests compared to Mueller and Dweck.
In fact, it is remarkable that they replicated
the Mueller and Dweck results at all.
Taking the insufficient failure trial into accounton an exploratory basismight also
resolve the puzzle of why Study 2 in Li and Bates (in press) seemed to show such weak evidence
relative to Studies 1 and 3. Study 2’s participants did significantly better during the “failure” trial
(Trial 2) than the participants in the other two studies (MStudies 1,3 = 4.45, SDStudies 1,3 = 2.30; MStudy
2 = 5.09, SDStudy 2 = 2.31), t(476)=2.764, p = .006. In fact, the number of problems solved on the
failure trial in Li and Bates’ Study 2 was roughly equivalent to the number of problems solved
The present analysis of failure trial differences (relative to Mueller & Dweck, 1998) also invalidates the use of the
“small telescopes” argument to challenge the statistical power in the original Mueller and Dweck (1998) studies. As
Simonsohn (2015) stated, differences in materials, populations, and measures lead to differences in the true effect
under study” (p. 567), and the differences in the Li and Bates procedures were, as we showed, quite large.
An Opportunity to Improve Replication Practices
on the Mueller and Dweck success trials. Thus, the one Li and Bates study that did not yield a
behavioral effect of praise was the least effective at creating a failure experience.
What about the “active control “condition? As we noted above, Li and Bates included
what they call an “active control” condition in Studies 2 and 3. The students in that condition,
after trial 1, received the statement: “You can't change your basic ability, but you work at things,
and that's how we get hard things done.” Li and Bates (in press) included this condition to
provide an alternative interpretation for the Mueller and Dweck (1998) findings. That is, their
inclusion of a “fixed mindset” statement before an effort statement was meant to rule out a
“growth mindset” as the explanation of any positive effects of effort praise.
Li and Bates’ (in press) logic is flawed in two respects, and it is worth pointing these
flaws out so that the field can guard against them more successfully. First, Li and Bates claim
that if their new active control condition yields significant results, they can declare that there is
no evidence in support of the original findings. But, as suggested earlier, this represents a
confusion between replication and mechanism. Replications ask: Did the same conditions yield
the same results? The testing of a new mechanism or interpretation, even if successful, does not
nullify the findings in favor of replication.
Second, the alternative condition proposed by Li and Bates does not, in fact, support an
alternative explanation for effort praise. Part of the confusion comes from Li and Bates’ incorrect
One cannot definitively answer the question of what the Li and Bates data would have looked like had there been a
true failure trial. The best we can do is try to correct for the extremity of the Study 2 non-failure by equating the
three studies with respect to the failure trial and asking whether the results of the three studies come more into line
with each other when that is done. One can do so by controlling for Trial 2 performance, which we did in the
alternative specification exploration described above and posted online ( A primary limitation
of such an analysis, of course, is that Trial 2 occurred after the praise. This is why it would be ideal to for Li & Bates
to simply have included a true failure trial in the experiment, rather than our having to adjust for the failure trial
statistically. And yet the Bayesian causal forest analysis, reported in footnote 4, uses machine-learning techniques to
deal with potential confounding from a post-treatment covariate, and it yielded evidence supportive of the Mueller
and Dweck (1998) conclusion.
An Opportunity to Improve Replication Practices
statements that the Mueller and Dweck praise studies were mindset studies. In mindset studies,
participants receive explicit instruction about the malleability of ability (Yeager et al., 2016). In
contrast, in Muller & Dweck (1998), mindsets (then called “theories of intelligence”) were
simply measured as dependent variables in two studies, because Mueller and Dweck (1998)
hypothesized that the different forms of praise may orient children toward different mindsets.
But it was attributions, not mindsets, that were the object of manipulation.
As a reminder, students in the intelligence praise condition were told, “You must be
smart at these problems;” whereas students in the effort praise condition, they were told, “You
must have worked hard at these problems.” These statements are giving them the explanation, or
the attribution, for their success; the statements are not telling them anything about the nature of
the ability. As further support for this, the word “attribution(s)” was used 64 times in the Mueller
and Dweck article; the mindset-related words used at the time (such as lay theory, implicit
theory, entity theory, or incremental theory) were used a total of 8 times.
Notably, the Li and Bates active control (“You can't change your basic ability, but you
work at things, and that's how we get hard things done.”) also presents an effort attribution for
success. Therefore, rather than presenting an alternative interpretation, if effective it would
present further confirmation of the effects of effort attributions. Thus, there is no basis for
rejecting the Mueller and Dweck interpretation as invalid. Incidentally, when pooling all the data
across Studies 2 and 3, Li and Bates’ active control condition actually showed no significant
effect, t(286)=1.371, p = .171. So there is no basis for making strong conclusions either way
regarding the active control condition.
Mueller and Dweck (1998) has sometimes been cited or portrayed in the past as demonstrating an antecedent of
mindset or as relevant to mindsets. But, strictly speaking it is not a “mindset study” because it does not contain a
mindset manipulation. Therefore any inferences about relevance of the results from the attribution manipulation in
Li and Bates (or Mueller & Dweck, 1998) to the broader question of mindset interventions are unwarranted.
An Opportunity to Improve Replication Practices
In the present case, the primary Mueller and Dweck results replicated, and the “new”
interpretation would have been seen as consistent with the old one, had it been significant
overall, in that both tested effort attributions. This is another reason why it would be important
for replicators to have communication with the initial authors. Then parties can transparently
decide, in advance, what would constitute a test of a disputed theory and agree to the
interpretation whatever the study shows (see, e.g., the protocols for "adversarial collaborations;"
Bateman, Kahneman, Munro, Starmer, & Sugden, 2005).
General Discussion
The present re-analysis showed that simply pooling all the data described as a direct
replication of Mueller and Dweck (1998) yielded a significant replication, as did dozens of
justifiable alternative analyses. At one level this result is not surprising given the robust evidence
in the original paper. Indeed, our re-analysisbut not the analysis reported by Li and Bates
conforms to the expectations that would be set by a strongly supportive p-curve analysis (see
Appendix 1), because p-curve results are known to predict replicability (e.g. Simonsohn, Nelson,
& Simmons, 2019). At another level, however, the Li and Bates results are rather surprising
given the potentially critical differences in methods and cultural differences in the study
population, no reported piloting, and insufficient transparency about the translation and
appropriateness of the language used in the protocols and questionnaires.
Despite this supportive evidence, Li and Bates’ (in press) paper is nevertheless
disappointing, and the ways in which it is disappointing can lead to improvements in the field’s
replication practices. First it is disappointing to see strong claims of null effects in the article
when this was not true. Next, it was disappointing that an “active control” was given a prominent
place in Li and Bates’ (in press) paper when it had no significant effect overall and would not
An Opportunity to Improve Replication Practices
have overturned Mueller and Dweck’s interpretation even if it had. And it is disappointing to see
an inadequate application of methodological standards to guard against replicator degrees of
freedom in the design and analysis of a study. All of these limitations are addressable by simply
upholding well-known standards for rigorous research.
But even more importantly it is disappointing any time an opportunity to deepen the
field’s knowledge about a phenomenon is missed. Li and Bates (in press) could have conducted
an informative replication (really a generalization test) in a very different culture. They could
have asked whether and how the results might differ in China and they could have formed and
tested hypotheses about what those differences might be. More than that, they could have
proposed and tested mediators and moderators of any effects to add to our knowledge of when,
where, and why the praise effects are found or not found. Instead, the study was couched as a
replication purporting to overturn “mindset theory”—a theory that was never tested in Mueller
and Dweck (1998) or in the present study.
Indeed, leading statisticians have argued that posing simple yes/no questions (“is the
effect real?) is at the root of the very replication crisis that independent replication studies are
supposed to correct. As Gelman (2014) has stated, we should “move away from is-it-there-or-is-
it-not-there to a more helpful, contextually informed perspective” (p. 5). This means it is a
missed opportunity when replication studies do not help us understand the conditions under
which an effect will appear. Since all psychological phenomena have boundary conditions, it is a
service to the field to add to knowledge of what they are. This kind of information can then
greatly inform practice in the real world.
Critically, knowledge of boundary conditions is weak when it is post-hoc, as it will be
when studies use the approach implemented by Li and Bates (in press). Knowledge of boundary
An Opportunity to Improve Replication Practices
conditions is much stronger when samples, protocols, and analysis plans are designed, from the
start, to identify heterogeneous conditions under which an effect appears more strongly or more
weakly (cf. the National Study of Learning Mindsets, Yeager et al., 2019).
As an aside, many questions remain about effort praise, and they are worthy of future
research. For example, effort praise may backfire can backfire if a student is praised on an easy
task (since this can imply that the adult thinks they have low ability, Meyer, 1992). Thus, we do
not see effort praise as a panacea or an unadulterated good. Instead we typically recommend that
“process praise” in the real world include not just praise for effort, but also praise for taking on
challenges, trying new strategies, persisting, and seeking appropriate help or learning resources.
In this context, we also note that in the Mueller and Dweck praise work, the effort praise did not
have the starring role. Rather, it was the intelligence praise (a form of “person praise”) that was
featured in the title of the paper and that provided the most novel, counterintuitive findings in the
paperthe findings that praising intelligence could undermine later (post-failure) performance.
All of this underscores the point that when replicators and original authors collaborate in
good faith there is a unique opportunity for potentially important new knowledge to be
generated, thus furthering the goals of science. When they do not collaborate, this shortchanges
both the parties involved and the field as a whole.We therefore emphasize the emerging a vision
of science (both original studies and replication studies) as a collaborative endeavor that seeks to
expand knowledge and not just make a yes-no judgment about a body of work. At its best,
science explores mechanisms, finds boundary conditions, and informs practice in an open-
minded and transparent way. And we call upon the field to support this exciting vision of
replications so that we are all served by open science as it was meant to be.
An Opportunity to Improve Replication Practices
Association for Psychological Science. (2014). Registered Replication Reports. Retrieved
August 27, 2019, from Association for Psychological Science website:
Bateman, I., Kahneman, D., Munro, A., Starmer, C., & Sugden, R. (2005). Testing competing
models of loss aversion: An adversarial collaboration. Journal of Public Economics,
89(8), 15611580.
Bavel, J. J. V., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual
sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences,
113(23), 64546459.
Bryan, C., Yeager, D. S., & O’Brien, J. (in press). Replicator degrees of freedom allow
publication of misleading “failures to replicate.” Proceedings of the National Academy of
Sciences of the United States of America.
Diener, C. I., & Dweck, C. S. (1978). An analysis of learned helplessness: Continuous changes
in performance, strategy, and achievement cognitions following failure. Journal of
Personality and Social Psychology, 36(5), 451.
Frank, M. (2017, February 15). Damned if you do, damned if you don’t. Retrieved September 1,
2019, from Babies Learning Language website:
An Opportunity to Improve Replication Practices
Gelman, A. (2014). The connection between varying treatment effects and the crisis of
unreplicable research a Bayesian perspective. Journal of Management,
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not
Itself statistically significant. The American Statistician, 60(4), 328331.
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the
reproducibility of psychological science.” Science, 351(6277), 10371037.
Hahn, P. R., Murray, J. S., & Carvalho, C. (2017). Bayesian regression tree models for causal
inference: Regularization, confounding, and heterogeneous effects. ArXiv:1706.09523
Hedges, L. V., & Schauer, J. M. (2018). Statistical analyses for studying replication: Meta-
analytic perspectives. Psychological Methods, No Pagination Specified-No Pagination
Ioannidis, J. P. A., & Trikalinos, T. A. (2005). Early extreme contradictory estimates may appear
in published research: The Proteus phenomenon in molecular genetics research and
randomized trials. Journal of Clinical Epidemiology, 58(6), 543549.
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., … Nosek, B.
A. (2018). Many Labs 2: Investigating variation in replicability across samples and
settings. Advances in Methods and Practices in Psychological Science, 1(4), 443490.
An Opportunity to Improve Replication Practices
Li, Y., & Bates, T. C. (in press). You can’t change your basic ability, but you work at things, and
that’s how we get hard things done: Testing the role of growth mindset on response to
setbacks, educational attainment, and cognitive ability. Journal of Experimental
Psychology: General.
Luttrell, A., Petty, R. E., & Xu, M. (2017). Replicating and fixing failed replications: The case of
need for cognition and argument quality. Journal of Experimental Social Psychology, 69,
Meyer, W.-U. (1992). Paradoxical effects of praise and criticism on perceived ability. European
Review of Social Psychology, 3(1), 259283.
Mueller, C. M., & Dweck, C. S. (1998). Praise for intelligence can undermine children’s
motivation and performance. Journal of Personality and Social Psychology, 75(1), 33
Protzko, J. (2018). Null-hacking, a lurking problem in the open science movement. PsyArXiv.
Schooler, J. W. (2014). Turning the Lens of Science on Itself: Verbal overshadowing,
replication, and metascience. Perspectives on Psychological Science, 9(5), 579584.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer.
Journal of Experimental Psychology: General, 143(2), 534547.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2019). P-curve won’t do your laundry, but it
will distinguish replicable from non-replicable findings in observational research:
An Opportunity to Improve Replication Practices
Comment on Bruns & Ioannidis (2016). PLOS ONE, 14(3), e0213454.
Srivastava, S. (2012, September 27). A Pottery Barn rule for scientific journals. Retrieved
August 27, 2019, from The Hardest Science website:
Uri Simonsohn. (2015). Small Telescopes: Detectability and the evaluation of replication results.
Psychological Science, 26(5), 559569.
Weiner, B., & Kukla, A. (1970). An attributional analysis of achievement motivation. Journal of
Personality and Social Psychology, 15(1), 120.
Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C., … Dweck,
C. S. (2019). A national experiment reveals where a growth mindset improves
achievement. Nature, 16.
Yeager, D. S., Romero, C., Paunesku, D., Hulleman, C. S., Schneider, B., Hinojosa, C., …
Dweck, C. S. (2016). Using design thinking to improve psychological interventions: The
case of the growth mindset during the transition to high school. Journal of Educational
Psychology, 108(3), 374391.
An Opportunity to Improve Replication Practices
Appendix 1. P-curve analysis of Mueller & Dweck (1998).
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Mindset theory predicts that a growth mindset can substantially improve children's resilience to failure and enhance important outcomes such as school grades. We tested these predictions in a series of studies of 9-13-year-old Chinese children (n = 624). Study 1 closely replicated Mueller and Dweck (1998). Growth mindset manipulation was associated with performance on a moderate difficulty postfailure test (p = .049), but not with any of the 8 motivation and attribution measures used by Mueller and Dweck (1998): mean p = .48. Studies 2 and 3 included an active control to distinguish effects of mindset from other aspects of the manipulation, and included a challenging test. No effect of the classic growth mindset manipulation was found for either moderate or more difficult material in either Study 2 or Study 3 (ps = .189 to .974). Compatible with these null results, children's mindsets were unrelated to resilience to failure for either outcome measure (ps = .673 to .888). The sole exception was a significant effect in the reverse direction to prediction found in Study 2 for resilience on more difficult material (p = .007). Finally, in 2 studies relating mindset to grades across a semester in school, the predicted association of growth mindset with improved grades was not supported. Neither was there any association of children's mindsets with their grades at the start of the semester. Beliefs about the malleability of basic ability may not be related to resilience to failure or progress in school. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Full-text available
A global priority for the behavioural sciences is to develop cost-effective, scalable interventions that could improve the academic outcomes of adolescents at a population level, but no such interventions have so far been evaluated in a population-generalizable sample. Here we show that a short (less than one hour), online growth mindset intervention—which teaches that intellectual abilities can be developed—improved grades among lower-achieving students and increased overall enrolment to advanced mathematics courses in a nationally representative sample of students in secondary education in the United States. Notably, the study identified school contexts that sustained the effects of the growth mindset intervention: the intervention changed grades when peer norms aligned with the messages of the intervention. Confidence in the conclusions of this study comes from independent data collection and processing, pre-registration of analyses, and corroboration of results by a blinded Bayesian analysis.
Full-text available
We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p< .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p< .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied
Full-text available
Pre-registration of analysis plans involves making data-analysis decisions before the data is run in order to prevent flexibly re-running it until a specific result appears (p-hacking). Just because a model and results is pre-registered does not make it true, however. The complement to p-hacking, null-hacking, is the use of the same questionable research practices to re-analyze open data to return a null finding. We provide a vocabulary for null-hacking and introduce the threat it poses to the spirit of the open science movement and pre-registration in particular. Null-hacking forces us to introduce considerations of model fit to compare pre-registered and ‘alternative’ models. The reason null-hacking cannot be ignored is a null-hacked model can easily provide better fit to the data than a pre-registered one. Model fit, however, is a precarious problem and in the extreme challenges the justifications of pre-registration. Focusing just on model fit by only selecting a ‘best fitting model’ pre-registration, while giving default preference to pre-registered results ignores how well our models can represent the data. We provide the beginnings to a solution aimed at retaining the advantage and justifications of pre-registration, including model fit, and providing protection against null-hacking. We call this Fully-Informed Model Pre-registration and it involves strict supervised machine learning to maximize local model fit within heavily pre-specified decisions. This solution eliminates exploratory analyses and maximizes local model fit, eliminating the only grounds null-hacked results have for being accepted. It is not yet a complete solution but merely the groundwork for why other approaches are incomplete and what a solution may look like.
Full-text available
This paper develops a semi-parametric Bayesian regression model for estimating heterogeneous treatment effects from observational data. Standard nonlinear regression models, which may work quite well for prediction, can yield badly biased estimates of treatment effects when fit to data with strong confounding. Our Bayesian causal forests model avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariate-dependent prior on the regression function. This new parametrization also allows treatment heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively "shrink to homogeneity", in contrast to existing Bayesian non- and semi-parametric approaches.
Full-text available
Significance Scientific progress requires that findings can be reproduced by other scientists. However, there is widespread debate in psychology (and other fields) about how to interpret failed replications. Many have argued that contextual factors might account for several of these failed replications. We analyzed 100 replication attempts in psychology and found that the extent to which the research topic was likely to be contextually sensitive (varying in time, culture, or location) was associated with replication success. This relationship remained a significant predictor of replication success even after adjusting for characteristics of the original and replication studies that previously had been associated with replication success (e.g., effect size, statistical power). We offer recommendations for psychologists and other scientists interested in reproducibility.
Full-text available
There are many promising psychological interventions on the horizon, but there is no clear methodology for preparing them to be scaled up. Drawing on design thinking, the present research formalizes a methodology for redesigning and tailoring initial interventions. We test the methodology using the case of fixed versus growth mindsets during the transition to high school. Qualitative inquiry and rapid, iterative, randomized “A/B” experiments were conducted with ~3,000 participants to inform intervention revisions for this population. Next, two experimental evaluations showed that the revised growth mindset intervention was an improvement over previous versions in terms of short-term proxy outcomes (Study 1, N=7,501), and it improved 9th grade core-course GPA and reduced D/F GPAs for lower achieving students when delivered via the Internet under routine conditions with ~95% of students at 10 schools (Study 2, N=3,676). Although the intervention could still be improved even further, the current research provides a model for how to improve and scale interventions that begin to address pressing educational problems. It also provides insight into how to teach a growth mindset more effectively.
Formal empirical assessments of replication have recently become more prominent in several areas of science, including psychology. These assessments have used different statistical approaches to determine if a finding has been replicated. The purpose of this article is to provide several alternative conceptual frameworks that lead to different statistical analyses to test hypotheses about replication. All of these analyses are based on statistical methods used in meta-analysis. The differences among the methods described involve whether the burden of proof is placed on replication or nonreplication, whether replication is exact or allows for a small amount of “negligible heterogeneity,” and whether the studies observed are assumed to be fixed (constituting the entire body of relevant evidence) or are a sample from a universe of possibly relevant studies. The statistical power of each of these tests is computed and shown to be low in many cases, raising issues of the interpretability of tests for replication.
Recent large-scale replication efforts have raised the question: how are we to interpret failures to replicate? Many have responded by pointing out conceptual or methodological discrepancies between the original and replication studies as potential explanations for divergent results as well as emphasizing the importance of contextual moderators. To illustrate the importance of accounting for discrepancies between original and replication studies as well as moderators, we turn to a recent example of a failed replication effort. Previous research has shown that individual differences in need for cognition interact with a message's argument quality to affect evaluation (Cacioppo, Petty, & Morris, 1983). However, a recent attempt failed to replicate this outcome (Ebersole et al., 2016). We propose that the latter study's null result was due to conducting a non-optimal replication attempt. We thus conducted a new study that manipulated the key features that we propose created non-optimal conditions in the replication effort. The current results replicated the original need for cognition × argument quality interaction but only under the “optimal” conditions (closer to the original study's method and accounting for subsequently identified moderators). Under the non-optimal conditions, mirroring those used by Ebersole et al., results replicated the failure to replicate the target interaction. These findings emphasize the importance of informed replication, an approach to replication that pays close attention to ongoing developments identified in an effect's broader literature.
A paper from the Open Science Collaboration (Research Articles, 28 August 2015, aac4716) attempting to replicate 100 published studies suggests that the reproducibility of psychological science is surprisingly low. We show that this article contains three statistical errors and provides no support for such a conclusion. Indeed, the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high.