ArticlePDF Available

Estimating statistical power, posterior probability and publication bias of psychological research using the observed replication rate

Authors:

Abstract and Figures

In this paper, we show how Bayes' theorem can be used to better understand the implications of the 36% reproducibility rate of published psychological findings reported by the Open Science Collaboration. We demonstrate a method to assess publication bias and show that the observed reproducibility rate was not consistent with an unbiased literature. We estimate a plausible range for the prior probability of this body of research, suggesting expected statistical power in the original studies of 48-75%, producing (positive) findings that were expected to be true 41-62% of the time. Publication bias was large, assuming a literature with 90% positive findings, indicating that negative evidence was expected to have been observed 55-98 times before one negative result was published. These findings imply that even when studied associations are truly NULL, we expect the literature to be dominated by statistically significant findings.
This content is subject to copyright.
rsos.royalsocietypublishing.org
Research
Cite this article: Ingre M, Nilsonne G. 2018
Estimating statistical power, posterior probability
and publication bias of psychological research
using the observed replication rate. R. Soc. open
sci. 5: 181190.
http://dx.doi.org/10.1098/rsos.181190
Received: 25 July 2018
Accepted: 3 August 2018
Subject Category:
Psychology and cognitive neuroscience
Subject Areas:
psychology/statistics
Keywords:
selection bias, prior probability, falsification
Author for correspondence:
Michael Ingre
e-mail: michael.ingre@gmail.com
Electronic supplementary material is available
online at https://dx.doi.org/10.6084/m9.figshare.
c.4215926.
Estimating statistical power,
posterior probability and
publication bias of
psychological research using
the observed replication rate
Michael Ingre1,2 and Gustav Nilsonne1,3,4
1
Department of Clinical Neuroscience, Karolinska Institutet, Solna, Sweden
2
Institute for Globally Distributed Open Research and Education (IGDORE), Sweden
3
Stress Research Institute, Stockholm University, Stockholm, Sweden
4
Department of Psychology, Stanford University, Stanford, CA 94305, USA
MI, 0000-0003-0678-4494; GN, 0000-0001-5273-0150
In this paper, we show how Bayes’ theorem can be used to better
understand the implications of the 36% reproducibility rate of
published psychological findings reported by the Open Science
Collaboration. We demonstrate a method to assess publication
bias and show that the observed reproducibility rate was not
consistent with an unbiased literature. We estimate a plausible
range for the prior probability of this body of research,
suggesting expected statistical power in the original studies of
48– 75%, producing (positive) findings that were expected to be
true 41– 62% of the time. Publication bias was large, assuming a
literature with 90% positive findings, indicating that negative
evidence was expected to have been observed 55– 98 times
before one negative result was published. These findings imply
that even when studied associations are truly NULL, we expect
the literature to be dominated by statistically significant findings.
1. Introduction
The Open Science Collaboration (OSC) reported that 36% of
published positive findings in experimental psychology were
successfully replicated in independent attempts [1]. This finding is
interesting in itself as an indicator of the reproducibility of
published findings in psychology; however, it is also an important
data point that can be used together with other information to
assess publication bias, statistical power and even the posterior
probability of findings published in the psychological literature.
Another important set of observations concern the proportion
of positive findings in the literature. A series of observations
&2018 The Authors. Published by the Royal Society under the terms of the Creative
Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits
unrestricted use, provided the original author and source are credited.
spanning five decades have indicated that more than 90% of published studies in psychology reported
positive findings, where the authors’ hypothesis was supported by data [2– 4]. A similar observation
was made by the OSC, where 97% of the original studies they replicated supported the proposed
hypothesis with a ‘statistically significant’ association [1].
There are many sources of bias in research. In the following analysis, we take advantage of the fact
that the OSC performed direct replications of the original studies, where the design, methods, materials,
study population and statistical analysis of the result were reproduced as close to the original studies as
possible. This means that many methodological biases have been controlled and cannot explain
differences in the outcome between original studies and replications.
A large class of biases related to the process of publishing was not accounted for by the replications.
In the replications, only one test of the hypothesis was performed and the finding was reported
regardless of the result; however, the original studies had to make it through peer-review and were
subject to editorial policies that have been suggested to favour novel and positive findings [5],
creating selection bias in the published literature. Knowledge of this bias may also have caused
researchers to adapt their strategy when observing a negative result: they may have put negative
findings in the file drawer and looked for positive results in another study, or they may have tried to
repeatedly observe different results in the same study until they found one that was positive. The first
strategy creates bias that is generally known as the file drawer problem [6], and the latter is usually
referred to as selective reporting, HARKing [7] or p-hacking [8– 10]. They all produce a similar
selection bias where observed negative evidence is suppressed in favour of reporting positive findings.
We can estimate the collective magnitude of this publication bias by comparing the observed
proportion of positive findings in the published literature, with the proportion of positive findings
that was expected after a single test in the original studies that were replicated by the OSC.
In this paper, we show how the observed reproducibility and proportion of positive findings in the
literature can be used to better understand meta-properties of published psychological research. We
demonstrate a mathematical solution that can be used to assess expected statistical power, posterior
probability and publication bias of published psychological research. We aim to answer the following
questions:
— What are the properties of research that leads to 90% positive findings?
— What is the expected reproducibility of research with 90% positive findings?
— Is the observed 36% reproducibility rate consistent with an unbiased literature?
— What does the observed reproducibility suggest about the prior probability of the tested hypotheses,
statistical power of the studies, the posterior probability of the original findings and of publication
bias?
In the first part of our analysis, we use a naive approach. This analysis produces simple linear equations
that are valid for a single study; but when they are applied to a group of studies in the literature, they
assume that all studies have identical statistical power and tend to produce biased estimates when
there is large variance in statistical power. However, when statistical power is assumed to be very
high (i.e. more than 90%), there is little room for variance in power to influence the result, and the
naive calculations approximate more complex solutions. The second part of our analysis takes
variance in statistical power between studies into account, in order to produce more ecological
estimates of the published literature.
The mathematical exercises presented here were performed in R [11], and the source code needed to
reproduce all findings is available as the electronic supplementary material, appendix and on GitHub
(https://github.com/micing/publication_bias_psychology).
2. What are the properties of research that leads to 90%
positive findings?
The concept of prior probability from Bayesian theory [12] describes the probability that a hypothesis is
true before it has been tested on data. When considering a large number of hypotheses, prior
probability can also be understood as the proportion of hypotheses that are true a priori, that is, before
they have been tested on data. The prior probability of an individual hypothesis can be small and
close to zero, for example, in massively exploratory studies where vast amounts of data are searched
to try to find the few true associations that may exist; or it can be large and close to one, in
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
2
theoretically motivated confirmatory research with prior empirical support. We will use theta (
u
)to
denote prior probability.
We also need to consider the probability that a study testing a true hypothesis will produce positive
evidence. This is generally known as statistical power within a NULL hypothesis significance testing
(NHST) paradigm and is calculated from the type 2 error rate: 1 2
b
. Finally, we need to consider the
test’s type 1 error rate that describes the probability of observing positive evidence when the
hypothesis is false, which we will assume to be
a
¼0.05 in this text unless stated otherwise.
The probability of observing true-positive evidence is calculated by multiplying the prior
probability with the statistical power of the study (equation (2.1)). The probability to observe false-
positive evidence is the type 1 error rate multiplied by the prior probability that the hypothesis is false
(equation (2.2)). Added together, they describe the total probability of observing positive evidence
(equation (2.3)).
Ptrue ¼
u
(1
b
), ð2:1Þ
Pfalse ¼
a
(1
u
)ð2:2Þ
and Ptotal ¼
u
(1
b
)þ
a
(1
u
):ð2:3Þ
If a hypothesis is true a priori, we cannot observe false-positive evidence, and the probability of
observing positive evidence reduces to the statistical power (1 2
b
). This shows that one way to
produce 90% positive findings is to only test true hypotheses with 90% power. Another way to
produce close to 90% positive findings is to run studies with perfect power (100%) on hypotheses
of which 90% are true a priori. It should be noted that in such a situation we would actually
expect to observe 90.5% positive evidence, because we would also observe a small number of type
1 errors when the hypothesis is false, as described by equation (2.2). The smallest prior that can
produce 90% expected positive evidence is 89.5%, assuming perfect power. Thus, it is possible
to produce an unbiased literature with more than 90% positive findings when the underlying
research tests hypotheses that are more than 90% true a priori in studies with more than 90%
statistical power.
3. What is the expected reproducibility of research with 90%
positive findings?
To calculate the reproducibility of a positive research finding, we first need to calculate the probability of
such a finding to be true (rather than a type 1 error). We can do this by applying Bayes’ theorem [12] in
order to calculate the posterior probability:
P(AjB)¼P(BjA)P(A)
P(B):ð3:1Þ
In this equation, we calculate the conditional probability of Agiven B. If we replace Awith a
hypothesis, and Bwith observing positive evidence, we can calculate the posterior probability of a
hypothesis given that we have observed positive evidence. The numerator then describes the
probability of observing positive evidence given that the hypothesis is true, which is statistical power,
multiplied by the prior probability of the hypothesis; and this is precisely P
true
that we defined earlier
in equation (2.1). The denominator is the total probability of observing positive evidence, which is
P
total
defined by equation (2.3). Thus, we merely need to take the ratio P
true
/P
total
defined by
equations (2.1)– (2.3), to complete a formulation of Bayes’ theorem that can be used to estimate the
posterior probability of a hypothesis after observing positive evidence from NHST:
^
u
¼
u
(1
b
)
u
(1
b
)þ
a
(1
u
):ð3:2Þ
When we know the posterior probability and statistical power, it is easy to calculate the probability of
a positive finding to be reproduced (R) in an identical independent study. Equation (2.3) above already
showed how to calculate the probability of observing positive evidence, but in this case we substitute the
assumed prior probability (
u
) with the posterior probability (^
u
) of the finding:
R¼^
u
(1
b
)þ
a
(1 ^
u
):ð3:3Þ
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
3
As discussed above, with a perfect prior and 90% power, we would observe 90% positive findings
that are all true; the reproducibility of such a finding in an identical study is the same as the statistical
power 90%. At the other end of the spectrum, we find the smallest prior able to produce 90%
expected positive evidence at 89.5%, assuming perfect power; and applying equations (3.2) and (3.3)
indicates a posterior probability and reproducibility of such research at 99.4%. Thus, the expected
reproducibility of research producing more than 90% positive evidence falls in the range of 90–100%.
4. Is the observed 36% reproducibility consistent with an
unbiased literature?
We can use the above information to create a tentative statistical test of bias of the published literature.
A binomial test on the observed reproducibility rate of 36% (95% CI 27– 46%; n¼97) reported by the
OSC indicates strong evidence ( p,10
215
) that the replication studies were not drawn from a
literature with 90% reproducibility. This conservative test, assuming the lower bound of
reproducibility that is expected in an unbiased literature with 90% positive evidence, and identical
power in the replication studies, indicates publication bias in the OSC sample, supporting the
observation made in the original report of a right-skewed funnel plot [1].
5. Incorporating reproducibility into Bayesian calculations
One complication with applying Bayes’ theorem (equation (3.2)) is that it is based on several unknown
variables. We usually have a good idea of the type 1 error rate that is applied in research, but prior
probability and statistical power are often elusive. We can sometimes make informed guesses [13] and
calculate the posterior probability, as illustrated above, but with three unknown variables, statistical
power (1 2
b
), prior (
u
) and posterior (^
u
), there is only a limited amount of information we can
extract from data. We want to reduce the number of unknown variables to only two, so that we can
learn more useful information.
A first step in this process is to form a system of equations based on equations (3.2) and (3.3), so that
we can incorporate the observed reproducibility into our calculations (equation (5.1)):
^
u
¼
u
(1
b
)
u
(1
b
)þ
a
(1
u
)
R¼^
u
(1
b
)þ
a
(1 ^
u
):
8
>
<
>
:
ð5:1Þ
If we knew the type 1 error rate (
a
) and the probability of a positive finding to be reproduced in an
identical study (R), equation (5.1) would have only two unknowns (
b
and
u
) and we could solve it to find
the statistical power (1 2
b
) needed for any assumed prior (
u
).
6. Accounting for variance in statistical power
So far, we have used a naive approach that is valid for a single hypothesis tested in identical studies, but
when applied to a group of studies published in the literature, it assumes that all studies have identical
statistical power, which is not plausible in general. Equations (6.1) and (6.2) below take variance into
account by integrating the result over a probability density function (f) with mean
m
b
, describing the
distribution of statistical power (1 2
b
) between studies. Assuming that we know the type 1 error rate
(
a
) and prior probability (
u
) of the research, these equations produce the expected posterior probability
(equation (6.1)) and the expected reproducibility (equation (6.2)) of the research; the complement of the
mean of fis also the expected statistical power (1 2
m
b
):
E[^
u
]¼ð1
0
f(
b
)
u
(1
b
)
u
(1
b
)þ
a
(1
u
)d
b
ð6:1Þ
and E[R]¼E[^
u
](1
b
)þ
a
(1 E[^
u
]):ð6:2Þ
Statistical power is a function of the true effect size and the sample size of the study and does not
have a well-defined sample distribution. Empirical studies based on a large number of meta-analyses
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
4
indicate a bimodal distribution of power in the published literature, where a large proportion of studies
have either very low or very high power [14,15]. We digitized the data on three research areas (somatic,
psychiatric and neurological) presented by Dumas-Mallet et al. [14, figs 1 and 2] (see electronic
supplementary material) and found that expected power was approximately in the range of 30– 39%
with variance 0.09 –0.12. When only significant meta-analyses were considered, as an attempt to
remove most true NULL associations, bimodality was reduced and expected power increased to about
42–51% with variance 0.08 –0.11. We used these estimates as a starting point to find suitable
distribution functions.
Figure 1 shows six distributions based on the Beta distribution function. The Beta distribution is
defined by two shape parameters, and the mean is calculated as
m
¼s
1
/(s
1
þs
2
). Figure 1a,c,eshows
Beta distributions defined only by a single shape parameter (s), and the mean (
m
) is used to calculate
the second shape parameter. Panels in figure 1b,d,fare defined similarly, but describe bimodal
distributions, calculated as the weighted average of two separate Beta distributions with fixed location
means. The distribution that most closely matches the variances observed by Dumas-Mallet et al.is
shown in figure 1c(s¼1
2), and we used it to model likely estimates. An alternative variances range was
defined between a smaller variance defined in figure 1aand a larger variance in figure 1b. The
distribution in figure 1fwas used to model extreme variance.
(a)(b)
(c)(d)
(e)(f)
5
s=1
mean (var)
0.3 (0.15)
0.8 (0.079)
0.525 (0.18)
0.4 (0.18)
mean (var)
0.3 (0.095)
0.8 (0.039)
0.525 (0.13)
0.4 (0.12)
mean (var)
0.3 (0.075)
0.8 (0.029)
0.525 (0.11)
0.4 (0.1)
mean (var)
0.3 (0.12)
0.8 (0.043)
0.525 (0.15)
0.4 (0.4)
mean (var)
0.3 (0.12)
0.8 (0.046)
0.525 (0.15)
0.4 (0.15)
mean (var)
0.3 (0.047)
0.8 (0.016)
0.525 (0.069)
0.4 (0.065)
s= 1 (10/90)
s= 1/2 s= 2 (10/90)
s= 1/3 s= 1 (05/95)
0
0
8
6
4
2
0
8
density density
0
8
6
4
2
density
15
0
5
10
6
4
2
0
7
6
5
4
3
1
2
0.05 1.000.800.500.30 0.05 1.000.800.500.30
0.05 1.00
0.80
0.500.30 0.05 1.00
0.80
0.50
0.30
0.05 1.000.80
statistical
p
ower statistical
p
ower
0.50
0.30 0.05 1.000.800.500.30
1
2
3
4
Figure 1. Beta distributions used to model variance in statistical power. (a,c,e) Beta distributions defined by a single shape
parameter, and a mean that was used to calculate the second shape parameter:
m
¼s
1
/(s
1
þs
2
). The shape parameters are
s¼1(a), s¼1
2(c) and s¼1
3(e). (b,d,f) Bimodal distributions that are also parametrized with a single shape parameter
and a mean, describing the weighted average of two Beta distributions with fixed location means at the 10th and 90th
percentile of the distribution (i.e. power ¼0.145 and 0.905) for (a,b)(s¼1 and s¼2) and the 5th and 95th percentile
(power ¼0.0975 and 0.9525) for ( f)(s¼1). Image (c)(s¼1
2) was used to model variance for the likely range, and
alternative variances were modelled between (a) (small variance) and (b) (large variance). Extreme variance was modelled
using the distribution in ( f).
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
5
7. Defining replication statistical power to solve the equations
In the discussion below, we use subscripts (oand r) to separate statistical power of the original studies
(1 2
b
o
) defined in equation (6.1) from power in replication studies (1 2
b
r
) defined in equation (6.2).
The OSC determined the replication sample sizes from power analyses based on the reported effect
sizes of the original studies. Such estimates are known to be inflated in the presence of publication bias
[16] and cannot be used in our calculations. Data downloaded from the OSC GitHub repository [17]
show that 70% of replications were designed with a larger sample than the original study, 10% had
the same sample size and 20% were smaller than the original study, indicating that statistical power
was on average higher in the replication studies. This information can be used to calculate upper and
lower bounds of statistical power. The lower bound assumes identical power in original and
replication studies, i.e.
b
r
¼
b
o
, and the upper bound assumes perfect power in replication studies
b
r
¼0. This reduces the number of unknowns to only two, and when power can be expected to be
higher in the replication studies, it defines the boundaries of a range in which the true value must fall.
We can attempt a more precise approximation of power based on the median degrees of freedom of
original studies (d.f. ¼54) and replication studies (d.f. ¼68) reported by the OSC. The observed median
effect size in the replication studies (r¼0.2) is likely to be attenuated by the presence of NULL
associations in data, and the observed effect size in the original studies (r¼0.4) is likely to be inflated
by publication bias; thus, the true effect size is likely to fall between these two estimates. Calculating
statistical power for the range 0.2 ,r,0.4 shows that a median-sized replication study added
approximately 6– 10% statistical power (10% at midpoint: r¼0.3) compared with the original study.
This estimate gives an approximation of the increase in statistical power we can expect for the
replication studies and allows for a likely range of power to be defined between
b
r
¼
b
0
20.06 and
b
r
¼
b
0
20.10. This gives two additional applications of equations (6.1) and (6.2) with only two
unknown variables that define a range in which the true value is likely to fall.
8. Implications of observed reproducibility on prior probability of the
tested hypotheses, statistical power of the studies, posterior probability
of the original findings and publication bias
Assuming a true reproducibility rate of R¼0.36 (equations (3.3) and (6.2)) as reported by the OSC, a type
1 error rate of
a
o
¼0.05 for the original research (equations (3.2) and (6.1)), and because the OSC used
two-tailed test of directional hypotheses,
a
r
¼0.025 in replication studies (equations (3.3) and (6.2)),
together with the four different conditions of statistical power discussed above (
b
r
¼
b
o
,
b
r
¼0,
b
r
¼
b
o
20.06 and
b
r
¼
b
o
20.10), we have only two unknown variables left (
u
and
b
o
) and we can solve
these equations to calculate the expected statistical power (1 2
m
b
) for any assumed prior probability (
u
).
Equations (6.1) and (6.2) were solved as a system of two simultaneous equations using an optimizer,
applying several different distributions of statistical power (figure 1). Equation (5.1) was solved
analytically to represent the extreme boundary of zero variance in power. Solving these equations
produced the expected statistical power of the research together with the corresponding expectation of
the posterior probability of the original findings. We then applied equation (2.3) to calculate the
expected probability of observing positive findings and compared that estimate with the
approximately 90% positive findings that has been observed in the literature in order to assess
publication bias. These results are summarized in figure 2 and the complete solution is presented in
the electronic supplementary material.
The findings presented in figure 2 give insights into a plausible range of prior probabilities of tested
hypotheses in psychology. Figure 2ashows that the prior probability of the underlying research was not
likely to be
u
,0.025, because that would imply better than perfect expected power of the original
research; and our suggested likely range, assuming þ6–10% power in the replication studies, does not
extend to
u
,0.027, because it would imply better than perfect power in the replications.
The prior was also unlikely to be smaller than
u
,0.05; while the lower bound of the power estimate
at this prior fell at 50%, it is based on the implausible assumption of perfect power in the replications. The
likely range suggests 73– 75% expected power, which is quite optimistic, because such large statistical
power has been indicated only for larger than medium effect sizes in psychological research [18,19].
A restricted range of priors was defined as 0.05 ,
u
,0.20 that indicated expected power between
48 and 75%, and we assume this to be a plausible range in which the true prior is likely to fall.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
6
Assuming smaller or larger alternative variances only marginally changed these estimates, to fall
between 45 and 76% expected power. The range between zero and extreme variance brings expected
power to 42– 77%.
With higher assumed prior probabilities, the posterior probability of the original research goes up,
and statistical power has to come down to be consistent with the reported reproducibility rate of 36%.
Assuming that one out of 10 tested hypotheses in this research were true a priori (
u
¼0.1), the
posterior of the original findings was expected at 52% and the reason that the OSC could only
replicate 36% is explained by 67% power in the replication studies. In addition, power in the original
research was 59%, and 1.2% of the replications were expected to report type 1 errors.
For the full range of plausible priors 0.05 ,
u
,0.20, the expected proportion of true-positive findings
in the original studies fell between 42 and 62%. The alternative variances increased the range to 41– 65%
and assuming zero to extreme variance increased it further to 41– 69%.
0
0.2
0.4
0.6
0.8
1.0
statistical power
expected statistical power
of the underlying research
0.025 0.050 0.100 0.200 0.500 1.000
likely range
alternative variances
zero variance
extreme variance
outer boundary
0
0.2
0.4
0.6
0.8
1.0
posterior probability
expected posterior probability
of the original findings
0.025 0.050 0.100 0.200 0.500 1.000
0
0.1
0.2
0.3
0.4
positive evidence
assumed prior probabilit
y
expected probability of observing
positive evidence in the underlying research
0.025 0.050 0.100 0.200 0.500 1.000
1
2
5
10
20
50
100
publication bias
assumed prior probabilit
y
odds of observed negative evidence
to be supressed from publication
0.025 0.050 0.100 0.200 0.500 1.000
(a)(b)
(c)(d)
Figure 2. Expected statistical power and expected posterior probability of the original research replicated by the OSC (a,b) together
with the expected proportion of observed positive evidence and the corresponding publication bias of the research (c,d), assuming a
reproducibility rate of 36% and a literature with 90% positive evidence. The estimates were based on equations (6.1) and (6.2) for
the range 0.025 ,
u
,0.975 of assumed prior probabilities. The plots assume
a
o
¼0.05 in the original studies and
a
r
¼
0.025 in the replication studies. The likely range assumes replication studies at 6– 10% more statistical power than original
studies, and that power in original studies followed a Beta distribution with shape parameter s¼1
2(figure 1c). The alternative
variances describe a range between a Beta distribution with shape parameter s¼1 (figure 1a) for smaller variance, and a
bimodal distribution (figure 1b) for larger variance. The extreme variance estimate is based on a bimodal distribution (figure
1f) and the zero variance estimate is based on equation (5.1). Outer boundaries were calculated assuming anything from zero
to extreme variance, and that statistical power in the replication studies fell between the power of the original studies and
perfect power. X-axes of all plots and the y-axis of the publication bias plot (d) are on the log scale.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
7
The most striking observation in figure 1 was the estimate of publication bias. Figure 1eindicates the
expected proportion of positive evidence observed in the original studies to be between 8 and 14% for
plausible priors; and this is also the distribution we would expect to observe in an unbiased literature.
Assuming extreme variance brings this estimate up to a maximum of 15%. Figure 1fshows this
estimate rescaled to odds of suppressing negative evidence in a literature with 90% positive evidence;
even the lower bound of this estimate, above which the true estimate must fall if our assumptions
hold, indicates that negative evidence was expected to be observed more than 16 times before one
instance was published, over the whole range of priors plotted in figure 2. For the likely range and
more plausible priors, 0.05 ,
u
,0.20, we see an even more pronounced bias, indicating that negative
evidence was likely to have been observed 55–98 times before one instance was published.
Alternative variances suggest 53– 99 times and the range between zero and extreme variance indicates
52–100. The lower end of the conservative outer bound fell in the range 49–94 for plausible priors.
9. Assuming reproducibility rates other than 36%
So far, we have assumed the reproducibility rate to be 36%, which was the point estimate reported by the
OSC. However, this is an estimate with uncertainty as indicated by the outer bounds of the 95%
confidence intervals at 27 and 46% reproducibility. In figure 3, we expand our analysis to other
plausible reproducibility rates, based on different confidence intervals of the OSC estimate. The
analysis assumes þ8% power in the replication studies to reflect the midpoint of our previous
estimates and uses the same variance assumption as the likely range in figure 2. The different lines
represent reproducibility rates at the outer bounds of the 50, 75 and 95% confidence intervals.
In general, assuming larger true reproducibility rates increased expected power and posterior
probability of the research, but the expected proportion of positive evidence was only marginally
affected. Assuming conservatively that the true reproducibility rate falls at the upper end of the
95% confidence interval (R¼46%), we expect to observe positive evidence 9– 16% of the time for
plausible priors (i.e. 0.05 ,
u
,0.20), and the publication bias estimate indicates that negative
evidence was suppressed 48– 90 times before one instance was published, assuming a literature with
90% positive evidence.
10. Discussion
In this paper, we show how Bayes’ theorem can be used to better understand implications of the
observed 36% reproducibility rate of published psychological findings that was reported by the OSC
[1]. We demonstrated a method to assess publication bias and performed a tentative test indicating
that the observed reproducibility rate was not consistent with an unbiased literature. We presented a
mathematical solution and used it to estimate plausible ranges of expected statistical power, posterior
probability, probability to observe positive evidence and publication bias of the underlying research.
We used Bayes’ theorem to calculate the expected (marginal) posterior probability assuming a known
prior probability of the hypothesis, in order to solve a system of equations and find the expected
statistical power needed to produce an expected reproducibility. Our solution produced the
expectation after a large number of trials and does not allow for proper confidence (or credible)
intervals to be computed. This differs from a full Bayesian model that makes explicit assumptions of
prior distributions in order to estimate the posterior distribution of the parameters from the raw data
[20] and reflects the limitations of using summary statistics for the analysis.
To perform these exercises, we made several assumptions: we assumed a prior probability that was
independent of the statistical power of the studies testing the hypotheses; furthermore, we assumed that
published research was reproducible 36% of the time [1], that replication studies had 6–10% better power
than the original studies, that the variance in statistical power was similar to observations made in meta
meta-analyses [14] and that the literature presents 90% positive findings supporting the authors’
hypothesis [1–4]. The validity of our likely estimates depends on the validity of these assumptions.
However, we also produced estimates for a range of plausible reproducibility rates and estimates
based on alternative variances. In addition, we calculated outer boundaries that are valid for a range
from zero to extreme variance and only assumed that the replication studies had higher power than the
original studies.
The results showed that a long-term reproducibility rate of 36% is not consistent with a prior smaller
than
u
,0.025, because it would imply better than perfect expected statistical power of the research. The
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
8
prior was also unlikely to be smaller than
u
,0.05, because it would imply more than 73% expected
power of the original research, which is an optimistic assumption. We suggest a plausible prior
somewhere in the range 0.05 ,
u
,0.20, indicating expected statistical power at 48–75%. We found
that 42– 62% of the original findings were expected to be true and that the reproducibility rate
observed by the OSC was lower due to less than perfect power in the replications. Publication bias
was large, assuming a literature with 90% positive findings, indicating that negative evidence was
expected to be observed approximately 55– 98 times before one negative result was published.
Estimates of publication bias were robust and only marginally affected even assuming extreme
variance, and assuming true replication rates up to 46%, representing the upper limit of the 95%
confidence interval of the reproducibility estimate reported by the OSC.
Another analysis of the OSC data by Johnson et al. [20] focused on observed effect sizes and was
restricted to the subsample for which a correlation (r) with standard errors could be derived (73/100
studies). This subsample had 71 positive findings and the observed reproducibility rate was 41%. The
authors estimated approximately 93% true NULL hypotheses in this research, i.e.
u
¼0.07.
Furthermore, they estimated
a
o
¼0.052 and that both original studies and replication studies had 75%
power to arrive at an estimated posterior of 37/71 ¼52% of the original positive findings. They also
0
0.2
0.4
0.6
0.8
1.0
statistical power
expected statistical power
of the underlying research
0.025 0.050 0.100 0.200 0.500 1.000
R (CI %)
0.27 (95)
0.30 (75)
0.32 (50)
0.36
0.40 (50)
0.42 (75)
0.46 (95) 0
0.2
0.4
0.6
0.8
1.0
posterior probability
expected posterior probability
of the original findings
0.025 0.050 0.100 0.200 0.500 1.000
0.0
0.1
0.2
0.3
0.4
positive evidence
assumed prior probabilit
y
expected probability of observing
positive evidence in the underlying research
0.025 0.050 0.100 0.200 0.500 1.000
1
2
5
10
20
50
100
publication bias
assumed prior probabilit
y
odds of observed negative evidence
to be supressed from publication
0.025 0.050 0.100 0.200 0.500 1.000
(a)(b)
(c)(d)
Figure 3. Expected statistical power and expected posterior probability of the original research replicated by the OSC (a,b) together
with the expected proportion of observed positive evidence and the corresponding publication bias of the research (c,d), assuming
different true replication rates and a literature with 90% positive evidence. The estimates were based on equations (6.1) and (6.2)
assuming þ8% power in the replication studies to reflect the midpoint of the likely range presented in figure 2. The variance of
statistical power was also the same, assuming a Beta distribution with s¼1
2(figure 1c). The lines describe the reported
reproducibility rate (36%) and estimate at the outer limits of 50, 75 and 95% confidence intervals (i.e. R¼0.27, 0.30, 0.32,
0.36, 0.40, 0.42, 0.46).
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
9
indicated that approximately 700 hypothesis tests were performed to produce the 71 positive and two
negative published findings in the sample, suggesting that more than 600 negative findings had been
observed in the process.
Our analysis was based on the observed reproducibility (36%) for the full sample of positive findings
replicated by the OSC. Also, because the replications were designed with larger sample sizes in average,
we did not assume identical power in the replication studies and original studies and used that
assumption only for the outer boundary. If we were to accept the prior suggested by Johnson et al.
[20] (i.e.
u
¼0.07), expected statistical power would be estimated in the range 65– 67% in the original
studies and 73– 75% in the replication studies. The expected posterior of the original findings would
be 46–47%. In addition, the expected proportion of positive evidence observed in the original studies
was approximately 9.3%, suggesting that 97/0.093 ¼1043 studies were needed to produce the 97
positive and three negative findings that were published and subsequently replicated by the OSC; this
means that negative evidence was observed approximately 88 times before a negative finding was
published, assuming a literature with 90% positive findings.
The prior (
u
¼0.07) suggested by Johnson et al. [20] implies more than 65% expected power of the
original research. Such high power has been indicated for larger than medium (r¼0.3) effect sizes in
psychological research [18,19] and is larger than empirical estimates of median power observed in
other fields [14,15]. Considering that the median effect size observed in the replication studies by OSC
was only r¼0.2, assuming such high power is optimistic, but not implausible due to the likely
attenuation of this estimate from the presence of NULL associations in data. We propose a plausible
prior somewhere in the range 0.05 ,
u
,0.20, corresponding to expected statistical power in the range
48–75% of the original studies. However, for completeness we presented results for the full range of
assumed priors, so that readers can investigate the implications of assumptions that fall outside of our
suggested range.
Applying Bayes’ theorem in this way has important implications: it assumes that hypotheses are
either true or false, and such binary hypothesis testing has been criticized [21]. Indeed, it can be
argued that there are no truly non-zero associations in observational data. If we assume that no
associations are truly zero, but we are not interested in making inferences from very small true effect
sizes, p-values from NULL hypothesis significance testing (NHST) would be biased with inflated type
1 errors. In addition, we may conclude that any (non-directional) hypothesis is necessarily true, giving
a trivial prior probability of
u
¼1. However, we should recognize that these are not limitations of
binary hypothesis testing per se, but rather limitations of how specific hypotheses are formulated and
tested. It is possible to define a different ‘NULL’ hypothesis, with a mean other than zero, to protect
inferences from true effect sizes of ‘trivial’ magnitudes [22] and make the prior more informative in
observational studies at
u
,1. Also, binary NHST is not inherently problematic in true experimental
designs (with randomization), because we can then assume associations in data that are truly NULL.
In the present analysis, we have assumed the same position on binary NHST as the publishing
authors of the original studies that were replicated by the OSC, and the limitations discussed above
apply similarly to how they would apply to the original studies.
The most crucial estimate used in our analysis was the observed reproducibility rate of 36% reported
by the OSC [1]. Reproducibility is a complicated concept with many different facets, in particular in
psychology and the social sciences; some ‘true’ findings may not be possible to replicate in a different
time, social or cultural context, because the underlying meaning of the constructs used to design the
study or define the variables may have changed. The underlying theory may still be valid but needs
to be adapted to the new environment, and this has been proposed as an argument against the
validity of direct replication of a study’s methods on an independent sample [23]. But, from a more
general scientific perspective, it can be seen as a flaw in the formulated theory and the methods
defined to test it: Science needs to be verifiable to stand out from other types of claims and should
have some generalizability to be a useful source of knowledge; thus, important context needs to be
included when formulating a scientific theory or hypothesis. Another factor to consider is poorly
described methods in the original study that may impact the success rate in replications, but this is
essentially the same problem. If the study report did not present sufficient information to accurately
replicate the methods: how can it be properly understood and evaluated by the readers?
Reproducibility may have been impaired because of mistakes made by the replicating team of
researchers; however, this does not seem to be a major risk in the OSC study. The study was pre-
registered and performed by well-motivated researchers under more or less public scrutiny; the team
was in frequent contact with authors of the original studies to obtain material and information about
the design and procedure of their studies; and they employed a system of internal reviews of all
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
10
studies to ensure quality. Our findings show that assuming a larger true reproducibility rate of this
research implies larger statistical power and posterior probability of the original findings, but
estimates of publication bias were only marginally affected. In addition, any potential mistakes that
may have lowered the reproducibility rate below its true value are part of the overall type 2 error rate
(
b
) in equations (3.3) and (6.2) and can be seen as a reduction of ‘statistical power’ in the replication
studies below what we have nominally assumed. It seems unlikely that this would pose a problem
large enough to invalidate the lower bound of the estimate used in this study, assuming power to be
identical in the original and replication studies.
Studies eligible for replication by the OSC were selected from three prestigious journals in
experimental psychology. Approximately one-third of the total sample was never submitted for
replication, mostly because these studies were deemed infeasible to replicate, for example,
because they required special samples, knowledge or equipment. This introduces uncertainty and
potential bias in the reproducibility estimate; it is possible that the more specialized or complicated
designs would have worse (or, less likely, better) reproducibility. Thus, the reproducibility rate
estimated by the OSC is an estimate representative of the two-thirds most accessible research in
three well-regarded journals in experimental psychology; and might not generalize to psychology
in general.
Data from other scientific fields suggest a less pronounced focus on positive evidence, with 70– 90%
significant findings supporting the authors’ hypothesis [3,4], but even worse reproducibility rates in the
range 11– 24% in certain fields [24,25]. This suggests that while all estimates presented here may not
generalize, publication bias may still be of similar magnitude in other fields; but specific fields with a
higher proportion of published negative evidence and, to some extent, with a higher demonstrated
reproducibility [26] are likely to be less affected by publication bias.
One should recognize that most findings suppressed from publication describe NULL effects that
many may find uninformative or not interesting [23], but the fact that they are never published makes
it more likely that similar studies are performed repeatedly by independent researchers; and
eventually one will become ‘significant’ by chance, dramatically increasing its chance of being
published. Thus, the fact that such a large portion of negative evidence was suppressed from
publication not only represents a serious threat to the veracity of published positive evidence; it also
means that false theories that have been published may never become ‘falsified’ in the literature [5]
and that researchers are likely to spend time and resources testing hypotheses that should already
have been rejected.
We estimated the expected magnitude of total bias related to publishing findings in psychological
journals. This bias is produced at many different stages in the research process, and we cannot say
how much is related to editorial decisions to reject publication, researchers putting negative findings
in the file drawer [6], selective reporting, HARKing [7] or p-hacking [8– 10]. Our metric assumes
independent observations; however, in the case of repeated observations in a single study, we expect
observations to be correlated. Thus, our estimate would tend to be conservative with respect to actual
observations made in data, because correlated observations provide less new information than
independent observations.
Publication bias may be the single most important problem to solve in order to increase the efficiency
of the scientific project and bring the veracity of published research to higher standards. The implications
of suppressing more than 55 negative observations for each one published should not be underestimated.
With
a
¼0.05, we expect a significant finding by chance for every 20 observations made on random data.
Thus, our results suggest that even when studied associations are truly NULL, the literature will be
dominated by statistically significant findings.
Data accessibility. All the code needed to reproduce the findings presented in this paper is available as an electronic
supplementary material, appendix.
Authors’ contributions. M.I. conceived of the study, performed the analysis, interpreted results and drafted the paper. G.N.
participated in interpreting results and writing the paper. Both authors approved of the final manuscript before
publication.
Competing interests. We have no competing interests.
Funding. This research received no external funding.
Acknowledgements. We thank Prof. Anna Dreber Almenberg for the comments on a preprint version of this paper. We also
thank an anonymous reviewer, whose comments helped us improve our work.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
11
References
1. Open Science Collaboration. 2015 Estimating the
reproducibility of psychological science. Science
349, aac4716. (doi:10.1126/science.aac4716)
2. Sterling TD. 1959 Publication decisions and their
possible effects on inferences drawn from tests
of significance—or vice versa. J. Am. Stat.
Assoc. 54, 30–34.
3. Sterling TD, Rosenbaum WL, Weinkam JJ. 1995
Publication decisions revisited: the effect of the
outcome of statistical tests on the decision to
publish and vice versa. Am. Stat. 49, 108– 112.
4. Fanelli D. 2010 ‘Positive’ results increase down
the hierarchy of the sciences. PLoS ONE 5,
e10068. (doi:10.1371/journal.pone.0010068)
5. Ferguson CJ, Heene M. 2012 A vast graveyard of
undead theories: publication bias and
psychological science’s aversion to the null.
Perspect. Psychol. Sci. 7, 555– 561. (doi:10.
1177/1745691612459059)
6. Rosenthal R. 1979 The file drawer problem and
tolerance for null results. Psychol. Bull. 86, 638.
(doi:10.1037/0033-2909.86.3.638)
7. Kerr NL. 1998 HARKing: hypothesizing after the
results are known. Pers. Soc. Psychol. Rev. 2,
196– 217. (doi:10.1207/s15327957pspr0203_4)
8. Head ML, Holman L, Lanfear R, Kahn AT, Jennions
MD. 2015 The extent and consequences of p-
hacking in science. PLoS Biol. 13, e1002106.
(doi:10.1371/journal.pbio.1002106)
9. Bruns SB, Ioannidis JPA. 2016 p-Curve and p-
hacking in observational research. PLoS ONE 11,
e0149144. (doi:10.1371/journal.pone.0149144)
10. Simmons JP, Nelson LD, Simonsohn U.
2011 False-positive psychology: undisclosed
flexibility in data collection and analysis allows
presenting anything as significant. Psychol. Sci.
22, 1359–1366. (doi:10.1177/09567976
11417632)
11. R Core Team. 2017 R: a language and
environment for statistical computing. Vienna,
Austria: R Foundation for Statistical Computing.
See https://www.R-project.org/.
12. Puga JL, Krzywinski M, Altman N. 2015 Bayes’
theorem. Nat. Methods 12, 277. (doi:10.1038/
nmeth.3335)
13. Dreber A, Pfeiffer T, Almenberg J, Isaksson S,
Wilson B, Chen Y, Nosek BA, Johannesson M.
2015 Using prediction markets to estimate the
reproducibility of scientific research. Proc. Natl
Acad. Sci. USA 112, 15 343– 15 347. (doi:10.
1073/pnas.1516179112)
14. Dumas-Mallet E, Button KS, Boraud T, Gonon F,
Munafo
`MR. 2017 Low statistical power in
biomedical science: a review of three human
research domains. R. Soc. open sci. 4, 160254.
(doi:10.1098/rsos.160254)
15. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA,
Flint J, Robinson ESJ, Munafo
`MR. 2013 Power
failure: why small sample size undermines the
reliability of neuroscience. Nat. Rev. Neurosci.
14, 365– 376. (doi:10.1038/nrn3475)
16. Yarkoni T. 2009 Big correlations in little studies:
inflated fMRI correlations reflect low statistical
power: commentary on Vul et al. (2009).
Perspect. Psychol. Sci. 4, 294– 298. (doi:10.
1111/j.1745-6924.2009.01127.x)
17. 2017 rpp. GitHub. https://github.com/
CenterForOpenScience/rpp.
18. Szucs D, Ioannidis JPA. 2017 Empirical
assessment of published effect sizes and power
in the recent cognitive neuroscience and
psychology literature. PLoS Biol. 15, e2000797.
(doi:10.1371/journal.pbio.2000797)
19. Rossi JS. 1990 Statistical power of psychological
research: what have we gained in 20 years?
J. Consult. Clin. Psychol. 58, 646. (doi:10.1037/
0022-006X.58.5.646)
20. Johnson VE, Payne RD, Wang T, Asher A,
Mandal S. 2017 On the reproducibility of
psychological science. J. Am. Stat. Assoc.
112, 1–10. (doi:10.1080/01621459.2016.
1240079)
21. Cohen J. 1994 The earth is round (p,0.05).
Am. Psychol. 49, 997. (doi:10.1037/0003-066X.
49.12.997)
22. Ingre M. 2013 Why small low-powered studies
are worse than large high-powered studies and
how to protect against ‘trivial’ findings in
research: comment on Friston (2012).
Neuroimage 81, 496– 498. (doi:10.1016/j.
neuroimage.2013.03.030)
23. Stroebe W, Strack F. 2014 The alleged crisis and
the illusion of exact replication. Perspect.
Psychol. Sci. 9, 59– 71. (doi:10.1177/
1745691613514450)
24. Prinz F, Schlange T, Asadullah K. 2011 Believe it
or not: how much can we rely on published
data on potential drug targets? Nat. Rev. Drug
Discov. 10, 712. (doi:10.1038/nrd3439-c1)
25. Begley CG, Ellis LM. 2012 Drug development:
raise standards for preclinical cancer research.
Nature 483, 531– 533. (doi:10.1038/483531a)
26. Camerer CF et al. 2016 Evaluating replicability of
laboratory experiments in economics. Science
351, 1433–1436. (doi:10.1126/science.aaf0918)
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 181190
12
... Therefore, there is a need for replication studies on the effects of a physical activity intervention on cognition with participants from different nationalities. Additionally, another study has indicated the difficulty of estimating the true robustness and effect size of the cognitive enhancement effect by surveying the published literature [54]. Publication bias has been implicated in decreasing the efficiency of the science and bringing the credibility of published research to lower standards [54]. ...
... Additionally, another study has indicated the difficulty of estimating the true robustness and effect size of the cognitive enhancement effect by surveying the published literature [54]. Publication bias has been implicated in decreasing the efficiency of the science and bringing the credibility of published research to lower standards [54]. Hence, replication studies are needed to combat the publication bias and low diversity in the scientific literature. ...
Article
Full-text available
Research suggests that physical activity can be used as an intervention to increase cognitive function. Yet, there are competing views on the cognitive effects of physical activity and it is not clear what level of consensus exists among researchers in the field. The purpose of this study was two-fold: Firstly, to quantify the scientific consensus by focusing on the relationship between physical activity and cognitive function. Secondly, to investigate if there is a gap between the public’s and scientists’ interpretations of scientific texts on this topic. A two-phase study was performed by including 75 scientists in the first phase and 15 non-scientists in the second phase. Participants were asked to categorize article abstracts in terms of endorsement of the effect of physical activity on cognitive function. Results indicated that there was a 76.1% consensus that physical activity has positive cognitive effects. There was a consistent association between scientists’ and non-scientists’ categorizations, suggesting that both groups perceived abstracts in a similar fashion. Taken together, this study provides the first analysis of its kind to evaluate the level of consensus in almost two decades of research. The present data can be used to inform further research and practice.
... An example of one recommendation is for Editors to introduce a policy for results-blind review, which involves submitting a full manuscript for review that excludes the results section. This practice might reduce Editor and reviewers' positive results bias, and promote a stronger focus on the theoretical and practical impact of the study itself rather than whether or not the findings were statistically significant (see Ingre & Nilsonne, 2018). Other important recommendations for journals discussed by Aguinis et al. (2020) include: (1) requiring a pre-registration for every study, (2) publicly sharing the data, code, and materials for all studies, (3) creating a review track that includes Registered Reports, (4) providing an online archive for each journal article, such as the Open Science Framework, for authors to post their study materials, (5) creating a best-paper award that is based on the use of open science criteria, and (6) providing access to open science training for all research stakeholders. ...
Article
Full-text available
The replication crisis has stimulated researchers around the world to adopt open science research practices intended to reduce publication bias and improve research quality. Open science practices include study pre-registration, open data, open access, and avoiding methods that can lead to publication bias and low replication rates. Although gambling studies uses similar research methods as behavioral research fields that have struggled with replication, we know little about the uptake of open science research practices in gambling-focused research. We conducted a scoping review of 500 recent (1/1/2016–12/1/2019) studies focused on gambling and problem gambling to examine the use of open science and transparent research practices. Our results showed that a small percentage of studies used most practices: whereas 54.6% (95% CI: [50.2, 58.9]) of studies used at least one of nine open science practices, each practice’s prevalence was: 1.6% for pre-registration (95% CI: [0.8, 3.1]), 3.2% for open data (95% CI: [2.0, 5.1]), 0% for open notebook, 35.2% for open access (95% CI: [31.1, 39.5]), 7.8% for open materials (95% CI: [5.8, 10.5]), 1.4% for open code (95% CI: [0.7, 2.9]), and 15.0% for preprint posting (95% CI: [12.1, 18.4]). In all, 6.4% (95% CI: [4.6, 8.9]) of the studies included a power analysis and 2.4% (95% CI: [1.4, 4.2]) were replication studies. Exploratory analyses showed that studies that used any open science practice, and open access in particular, had higher citation counts. We suggest several practical ways to enhance the uptake of open science principles and practices both within gambling studies and in science more generally.
... It should also be noted that the findings were almost exclusively positive, that is, the authors managed to demonstrate a hypothesized effect. This might suggest a selection bias where failed attempts to demonstrate an effect have been put in the file drawer (Ingre & Nilsonne, 2018;Open Science Collaboration, 2015). Thus, ideally, the findings reviewed here should be further explored and replicated before being recommended as reliable communication strategies. ...
... It should also be noted that the ndings were almost exclusively positive, that is, the authors managed to demonstrate a hypothesized effect. This might suggest a selection bias where failed attempts to demonstrate an effect have been put in the le drawer (Ingre & Nilsonne, 2018;Open Science Collaboration, 2015). Thus, ideally, the ndings reviewed here should be further explored and replicated before being recommended as reliable communication strategies. ...
... control. First, there is a file drawer problem and it is unclear how many attempts at replication remain unpublished because findings did not replicate and were not statistically significant in the replication attempt (Ingre & Nilsonne, 2018). Second, RCTs are expensive and conducting a direct replication RCT is not likely to be feasible. ...
Article
Objective: The high rate of statistically significant findings in the sciences that do not replicate in a new sample has been described as a "replication crisis." Few replication attempts have been conducted in studies of alcohol use disorder (AUD), and the best method for determining whether a finding replicates has not been explored. The goal of the current study was to conduct direct replications within a multisite AUD-randomized controlled trial and to test a range of replication metrics. Method: We used data from a large AUD clinical trial (Project Matching Alcoholism Treatments and Client Heterogeneity [Project MATCH], n = 1,726) to simulate direct replication attempts. We examined associations between drinking intensity and negative alcohol-related consequences (Model 1), sex differences in drinking intensity (Model 2), and reductions in drinking following treatment (Model 3). We treated each of the 11 data collection sites as unique studies such that each subsample was treated as an "original" study, and the remaining 10 subsamples were viewed as "replication" studies. Replicability metrics included the consistency of statistical significance, overlapping confidence intervals, and consistency of the direction of the effect. We also tested effect replication and heterogeneity using meta-analysis. Results: We observed between 0% and 100% replicability across the replicability metrics depending on which subsample was treated as the "original" study. Meta-analyses indicated results were more similar across subsamples with no significant heterogeneity for Models 1 and 2. Conclusions: We recommend researchers focus on effect sizes and use meta-analysis to evaluate the level of replicability. We also encourage direct replication attempts and sharing of data and code to facilitate direct replication. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
... The present study demonstrates a high probability that a statistically significant effect from a traditional zeroorder significance test of a predictor X on an outcome Y while adjusting for a possible confounder Z is false, i.e. falling prey to type 1 error and having low positive predictive value (PPV). This finding is consistent with a body of literature demonstrating limited reproducibility in empirical sciences [8][9][10][11][12]. We further show that this increased risk can be mitigated by calculating the expected adjusted effect and requiring the observed adjusted effect to differ significantly from this expected effect (AEAE test) rather than interpreting directly a significant deviation from zero. ...
Article
Full-text available
Objective The present simulation study aimed to assess positive predictive value (PPV) and negative predictive value (NPV) for our newly introduced Accounting for Expected Adjusted Effect test (AEAE test) and compare it to PPV and NPV for a traditional zero-order significance test. Results The AEAE test exhibited greater PPV compared to a traditional zero-order significance test, especially with a strong true adjusted effect, low prior probability, high degree of confounding, large sample size, high reliability in the measurement of predictor X and outcome Y, and low reliability in the measurement of confounder Z. The zero-order significance test, on the other hand, exhibited higher NPV, except for some combinations of high degree of confounding and large sample size, or low reliability in the measurement of Z and high reliability in the measurement of X/Y, in which case the zero-order significance test can be completely uninformative. Taken together, the findings demonstrate desirable statistical properties for the AEAE test compared to a traditional zero-order significance test.
... It can also be noted that quite a few published studies, including the meta-analysis by Kim (2005), found no support for the threshold hypothesis. This, together with the high estimated rate of publication bias in psychological research (Ingre & Nilsonne, 2018), and the results of the present simulation, could be seen to indicate that there is, for the moment, more speaking against than in favor of a true threshold-like association between intelligence and creativity. ...
Article
Full-text available
According to the intelligence-creativity threshold hypothesis, there should be a positive association between intelligence and creative potential up to a certain point, the threshold, after which a further increase in intelligence should have no association with creativity. In the present simulation study, the measured intelligence and creativity of virtual subjects were affected by their true abilities as well as a disturbance factor that varied in magnitude between subjects. The results indicate that the hypothesized threshold-like association could be due to some disturbing factor, for example, low motivation, illness, or linguistic confusion, that varies between individuals and that affects both measured intelligence and measured creativity, especially if the actual association between intelligence and creativity is weak. This, together with previous negative findings, calls the validity of the intelligence-creativity threshold hypothesis into question.
Article
Is what is beautiful good and more accurately understood? Lorenzo et al. (2010) explored this question and found that more attractive targets (as per consensus) were judged more positively and accurately. Perceivers’ specific (idiosyncratic) ratings of targets’ attractiveness were also related to more positive and accurate impressions, but the latter was only true for highly consensually attractive targets. With a larger sample (N=547), employing a round-robin study design, we aimed to replicate and extend these findings by 1) using a more reliable accuracy criterion, 2) using a direct measure of positive personality impressions, and 3) exploring attention as a potential mechanism of these links. We found that targets’ consensual attractiveness was not significantly related to the positivity or the accuracy of impressions. Replicating the original findings, idiosyncratic attractiveness was related to more positive impressions. The association between idiosyncratic attractiveness and accuracy was again dependent on consensual attractiveness, but here, idiosyncratic attractiveness was associated with lower accuracy for less consensually attractive targets. Perceivers’ attention helped explain these associations. These results partially replicate the original findings while also providing new insight: What is beautiful to the beholder is good but is less accurately understood if the target is consensually less attractive.
Article
A number of methodologists have recently argued that it is inadvisable or even improper to use the same data for exploration (discovering effects) and for confirmation (validating the existence of effects). This has led to suggestions of a two-phased strategy: running an exploratory study (Phase 1) and then performing a Phase 2 validation/confirmation study (ideally pre-registered) that tests just the strongest effect(s) to emerge from Phase 1. Using simulations we ask a simple question: how does this phased strategy compare with the simpler alternative of running “one big study” that combines exploration and confirmation? At any given alpha level, two figures of merit trade off against each other, with the 2-phased strategy offering lower power and greater positive predictive value (PPV). However, a closer comparison of the results show that the “big study” option is strictly dominant in the sense that for any given alpha level used in the two-phased strategy, there is some alpha level for which the “big study” approach yields better power and better PPV. Bonferroni correction for multiple comparisons does not affect this result. The implications and their important limitations are discussed.
Article
Proponents of the research credibility movement make a number of recommendations to enhance research rigour in psychology. These represent positive advances and can enhance replicability in clinical psychological science. This article evaluates whether there are any risks associated with this movement. We argue that there is the potential for research credibility principles to stifle innovation and exacerbate type II error, but only if they are applied too rigidly and beyond their intended scope by funders, journals and scientists. We outline ways to mitigate these risks. Further, we discuss how research credibility issues need to be situated within broader concerns about research waste. A failure to optimise the process by which basic science findings are used to inform the development of novel treatments (the first translational gap) and effective treatments are then implemented in real-world settings (the second translational gap) are also significant sources of research waste in depression. We make some suggestions about how to better cross these translational gaps.
Article
Full-text available
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64-1.46) for nominally statistically significant results and D = 0.24 (0.11-0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Article
Full-text available
Author summary Biomedical science, psychology, and many other fields may be suffering from a serious replication crisis. In order to gain insight into some factors behind this crisis, we have analyzed statistical information extracted from thousands of cognitive neuroscience and psychology research papers. We established that the statistical power to discover existing relationships has not improved during the past half century. A consequence of low statistical power is that research studies are likely to report many false positive findings. Using our large dataset, we estimated the probability that a statistically significant finding is false (called false report probability). With some reasonable assumptions about how often researchers come up with correct hypotheses, we conclude that more than 50% of published findings deemed to be statistically significant are likely to be false. We also observed that cognitive neuroscience studies had higher false report probability than psychology studies, due to smaller sample sizes in cognitive neuroscience. In addition, the higher the impact factors of the journals in which the studies were published, the lower was the statistical power. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Article
Full-text available
Studies with low statistical power increase the likelihood that a statistically significant finding represents a false positive result. We conducted a review of meta-analyses of studies investigating the association of biological, environmental or cognitive parameters with neurological, psychiatric and somatic diseases, excluding treatment studies, in order to estimate the average statistical power across these domains. Taking the effect size indicated by a meta-analysis as the best estimate of the likely true effect size, and assuming a threshold for declaring statistical significance of 5%, we found that approximately 50% of studies have statistical power in the 0–10% or 11–20% range, well below the minimum of 80% that is often considered conventional. Studies with low statistical power appear to be common in the biomedical sciences, at least in the specific subject areas captured by our search strategy. However, we also observe evidence that this depends in part on research methodology, with candidate gene studies showing very low average power and studies using cognitive/behavioural measures showing high average power. This warrants further investigation.
Article
Full-text available
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a re-analysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested non-null effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of non-reproducibility. The results of this re-analysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Article
Full-text available
The p-curve, the distribution of statistically significant p-values of published studies, has been used to make inferences on the proportion of true effects and on the presence of p-hacking in the published literature. We analyze the p-curve for observational research in the presence of p-hacking. We show by means of simulations that even with minimal omitted-variable bias (e.g., unaccounted confounding) p-curves based on true effects and p-curves based on null-effects with p-hacking cannot be reliably distinguished. We also demonstrate this problem using as practical example the evaluation of the effect of malaria prevalence on economic growth between 1960 and 1996. These findings call recent studies into question that use the p-curve to infer that most published research findings are based on true effects in the medical literature and in a wide range of disciplines. p-values in observational research may need to be empirically calibrated to be interpretable with respect to the commonly used significance threshold of 0.05. Violations of randomization in experimental studies may also result in situations where the use of p-curves is similarly unreliable.
Article
Full-text available
The data includes measures collected for the two experiments reported in “False-Positive Psychology” [1] where listening to a randomly assigned song made people feel younger (Study 1) or actually be younger (Study 2). These data are useful because they illustrate inflations of false positive rates due to flexibility in data collection, analysis, and reporting of results. Data are useful for educational purposes.
Article
This article considers a practice in scientific communication termed HARKing (Hypothesizing After the Results are Known). HARKing is defined as presenting a post hoc hypothesis (i.e., one based on or informed by one's results) in one's research report as if it were, in fact, an a priori hypotheses. Several forms of HARKing are identified and survey data are presented that suggests that at least some forms of HARKing are widely practiced and widely seen as inappropriate. I identify several reasons why scientists might HARK. Then I discuss several reasons why scientists ought not to HARK. It is conceded that the question of whether HARKing's costs exceed its benefits is a complex one that ought to be addressed through research, open discussion, and debate. To help stimulate such discussion (and for those such as myself who suspect that HARKing's costs do exceed its benefits), I conclude the article with some suggestions for deterring HARKing.
Article
The reproducibility of scientific findings has been called into question. To contribute data about reproducibility in economics, we replicate 18 studies published in the American Economic Review and the Quarterly Journal of Economics in 2011-2014. All replications follow predefined analysis plans publicly posted prior to the replications, and have a statistical power of at least 90% to detect the original effect size at the 5% significance level. We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original. The reproducibility rate varies between 67% and 78% for four additional reproducibility indicators, including a prediction market measure of peer beliefs.
Article
Concerns about a lack of reproducibility of statistically significant results have recently been raised in many fields, and it has been argued that this lack comes at substantial economic costs. We here report the results from prediction markets set up to quantify the reproducibility of 44 studies published in prominent psychology journals and replicated in the Reproducibility Project: Psychology. The prediction markets predict the outcomes of the replications well and outperform a survey of market participants' individual forecasts. This shows that prediction markets are a promising tool for assessing the reproducibility of published scientific results. The prediction markets also allow us to estimate probabilities for the hypotheses being true at different testing stages, which provides valuable information regarding the temporal dynamics of scientific discovery. We find that the hypotheses being tested in psychology typically have low prior probabilities of being true (median, 9%) and that a "statistically significant" finding needs to be confirmed in a well-powered replication to have a high probability of being true. We argue that prediction markets could be used to obtain speedy information about reproducibility at low cost and could potentially even be used to determine which studies to replicate to optimally allocate limited resources into replications.