ArticlePDF Available

Abstract and Figures

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Content may be subject to copyright.
PLoS Medicine | 0696
Open access, freely available online
August 2005 | Volume 2 | Issue 8 | e124
ublished research fi ndings are
sometimes refuted by subsequent
evidence, with ensuing confusion
and disappointment. Refutation and
controversy is seen across the range of
research designs, from clinical trials
and traditional epidemiological studies
[1–3] to the most modern molecular
research [4,5]. There is increasing
concern that in modern research, false
ndings may be the majority or even
the vast majority of published research
claims [6–8]. However, this should
not be surprising. It can be proven
that most claimed research fi ndings
are false. Here I will examine the key
factors that infl uence this problem and
some corollaries thereof.
Modeling the Framework for False
Positive Findings
Several methodologists have
pointed out [9–11] that the high
rate of nonreplication (lack of
confi rmation) of research discoveries
is a consequence of the convenient,
yet ill-founded strategy of claiming
conclusive research fi ndings solely on
the basis of a single study assessed by
formal statistical signifi cance, typically
for a p-value less than 0.05. Research
is not most appropriately represented
and summarized by p-values, but,
unfortunately, there is a widespread
notion that medical research articles
should be interpreted based only on
p-values. Research fi ndings are defi ned
here as any relationship reaching
formal statistical signifi cance, e.g.,
effective interventions, informative
predictors, risk factors, or associations.
“Negative” research is also very useful.
“Negative” is actually a misnomer, and
the misinterpretation is widespread.
However, here we will target
relationships that investigators claim
exist, rather than null fi ndings.
As has been shown previously, the
probability that a research fi nding
is indeed true depends on the prior
probability of it being true (before
doing the study), the statistical power
of the study, and the level of statistical
signifi cance [10,11]. Consider a 2 × 2
table in which research fi ndings are
compared against the gold standard
of true relationships in a scientifi c
eld. In a research fi eld both true and
false hypotheses can be made about
the presence of relationships. Let R
be the ratio of the number of “true
relationships” to “no relationships”
among those tested in the fi eld. R
is characteristic of the fi eld and can
vary a lot depending on whether the
eld targets highly likely relationships
or searches for only one or a few
true relationships among thousands
and millions of hypotheses that may
be postulated. Let us also consider,
for computational simplicity,
circumscribed fi elds where either there
is only one true relationship (among
many that can be hypothesized) or
the power is similar to fi nd any of the
several existing true relationships. The
pre-study probability of a relationship
being true is R⁄(R + 1). The probability
of a study fi nding a true relationship
refl ects the power 1 − β (one minus
the Type II error rate). The probability
of claiming a relationship when none
truly exists refl ects the Type I error
rate, α. Assuming that c relationships
are being probed in the fi eld, the
expected values of the 2 × 2 table are
given in Table 1. After a research
nding has been claimed based on
achieving formal statistical signifi cance,
the post-study probability that it is true
is the positive predictive value, PPV.
The PPV is also the complementary
probability of what Wacholder et al.
have called the false positive report
probability [10]. According to the 2
× 2 table, one gets PPV = (1 − β)R⁄(R
− βR + α). A research fi nding is thus
The Essay section contains opinion pieces on topics
of broad interest to a general medical audience.
Why Most Published Research Findings
Are False
John P. A. Ioannidis
Citation: Ioannidis JPA (2005) Why most published
research fi ndings are false. PLoS Med 2(8): e124.
Copyright: © 2005 John P. A. Ioannidis. This is an
open-access article distributed under the terms
of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and
reproduction in any medium, provided the original
work is properly cited.
Abbreviation: PPV, positive predictive value
John P. A. Ioannidis is in the Department of Hygiene
and Epidemiology, University of Ioannina School of
Medicine, Ioannina, Greece, and Institute for Clinical
Research and Health Policy Studies, Department of
Medicine, Tufts-New England Medical Center, Tufts
University School of Medicine, Boston, Massachusetts,
United States of America. E-mail:
Competing Interests: The author has declared that
no competing interests exist.
DOI: 10.1371/journal.pmed.0020124
There is increasing concern that most
current published research fi ndings are
false. The probability that a research claim
is true may depend on study power and
bias, the number of other studies on the
same question, and, importantly, the ratio
of true to no relationships among the
relationships probed in each scientifi c
eld. In this framework, a research fi nding
is less likely to be true when the studies
conducted in a fi eld are smaller; when
effect sizes are smaller; when there is a
greater number and lesser preselection
of tested relationships; where there is
greater fl exibility in designs, defi nitions,
outcomes, and analytical modes; when
there is greater fi nancial and other
interest and prejudice; and when more
teams are involved in a scientifi c fi eld
in chase of statistical signifi cance.
Simulations show that for most study
designs and settings, it is more likely for
a research claim to be false than true.
Moreover, for many current scientifi c
elds, claimed research fi ndings may
often be simply accurate measures of the
prevailing bias. In this essay, I discuss the
implications of these problems for the
conduct and interpretation of research.
It can be proven that
most claimed research
ndings are false.
PLoS Medicine | 0697
more likely true than false if (1 − β)R
> α. Since usually the vast majority of
investigators depend on α = 0.05, this
means that a research fi nding is more
likely true than false if (1 − β)R > 0.05.
What is less well appreciated is
that bias and the extent of repeated
independent testing by different teams
of investigators around the globe may
further distort this picture and may
lead to even smaller probabilities of the
research fi ndings being indeed true.
We will try to model these two factors in
the context of similar 2 × 2 tables.
First, let us defi ne bias as the
combination of various design, data,
analysis, and presentation factors that
tend to produce research fi ndings
when they should not be produced.
Let u be the proportion of probed
analyses that would not have been
“research fi ndings,” but nevertheless
end up presented and reported as
such, because of bias. Bias should not
be confused with chance variability
that causes some fi ndings to be false by
chance even though the study design,
data, analysis, and presentation are
perfect. Bias can entail manipulation
in the analysis or reporting of fi ndings.
Selective or distorted reporting is a
typical form of such bias. We may
assume that u does not depend on
whether a true relationship exists
or not. This is not an unreasonable
assumption, since typically it is
impossible to know which relationships
are indeed true. In the presence of bias
(Table 2), one gets PPV = ([1 − β]R +
uβR)⁄(R + α − βR + uuα + uβR), and
PPV decreases with increasing u, unless
1 − β ≤ α, i.e., 1 − β ≤ 0.05 for most
situations. Thus, with increasing bias,
the chances that a research fi nding
is true diminish considerably. This is
shown for different levels of power and
for different pre-study odds in Figure 1.
Conversely, true research fi ndings
may occasionally be annulled because
of reverse bias. For example, with large
measurement errors relationships
are lost in noise [12], or investigators
use data ineffi ciently or fail to notice
statistically signifi cant relationships, or
there may be confl icts of interest that
tend to “bury” signifi cant ndings [13].
There is no good large-scale empirical
evidence on how frequently such
reverse bias may occur across diverse
research fi elds. However, it is probably
fair to say that reverse bias is not as
common. Moreover measurement
errors and ineffi cient use of data are
probably becoming less frequent
problems, since measurement error has
decreased with technological advances
in the molecular era and investigators
are becoming increasingly sophisticated
about their data. Regardless, reverse
bias may be modeled in the same way as
bias above. Also reverse bias should not
be confused with chance variability that
may lead to missing a true relationship
because of chance.
Testing by Several Independent
Several independent teams may be
addressing the same sets of research
questions. As research efforts are
globalized, it is practically the rule
that several research teams, often
dozens of them, may probe the same
or similar questions. Unfortunately, in
some areas, the prevailing mentality
until now has been to focus on
isolated discoveries by single teams
and interpret research experiments
in isolation. An increasing number
of questions have at least one study
claiming a research fi nding, and
this receives unilateral attention.
The probability that at least one
study, among several done on the
same question, claims a statistically
signifi cant research fi nding is easy to
estimate. For n independent studies of
equal power, the 2 × 2 table is shown in
Table 3: PPV = R(1 − β
)⁄(R + 1 − [1 −
) (not considering bias). With
increasing number of independent
studies, PPV tends to decrease, unless
1 − β < α, i.e., typically 1 − β < 0.05.
This is shown for different levels of
power and for different pre-study odds
in Figure 2. For n studies of different
power, the term β
is replaced by the
product of the terms β
for i = 1 to n,
but inferences are similar.
A practical example is shown in Box
1. Based on the above considerations,
one may deduce several interesting
corollaries about the probability that a
research fi nding is indeed true.
Corollary 1: The smaller the studies
conducted in a scientifi c eld, the less
likely the research fi ndings are to be
true. Small sample size means smaller
power and, for all functions above,
the PPV for a true research fi nding
decreases as power decreases towards
1 − β = 0.05. Thus, other factors being
equal, research fi ndings are more likely
true in scientifi c elds that undertake
large studies, such as randomized
controlled trials in cardiology (several
thousand subjects randomized) [14]
than in scientifi c elds with small
studies, such as most research of
molecular predictors (sample sizes 100-
fold smaller) [15].
Corollary 2: The smaller the effect
sizes in a scientifi c eld, the less likely
the research fi ndings are to be true.
Power is also related to the effect
size. Thus research fi ndings are more
likely true in scientifi c elds with large
effects, such as the impact of smoking
on cancer or cardiovascular disease
(relative risks 3–20), than in scientifi c
elds where postulated effects are
small, such as genetic risk factors for
multigenetic diseases (relative risks
1.1–1.5) [7]. Modern epidemiology is
increasingly obliged to target smaller
Table 1. Research Findings and True Relationships
True Relationship
Yes No Total
Yes c(1 − β)R/(R + 1) cα/(R + 1) c(R + αβR)/(R + 1)
No cβR/(R + 1) c(1 − α)/(R + 1) c(1 − α + βR)/(R + 1)
Total cR/(R + 1) c/(R + 1) c
DOI: 10.1371/journal.pmed.0020124.t001
Table 2. Research Findings and True Relationships in the Presence of Bias
True Relationship
Yes No Total
Yes (c[1 − β]R + ucβR)/(R + 1) cα + uc(1 − α)/(R + 1) c(R + αβR + uuα + uβR)/(R + 1)
No (1 − u)cβR/(R + 1) (1 − u)c(1 − α)/(R + 1) c(1 − u)(1 − α + βR)/(R + 1)
Total cR/(R + 1) c/(R + 1) c
DOI: 10.1371/journal.pmed.0020124.t002
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | 0698
effect sizes [16]. Consequently, the
proportion of true research fi ndings
is expected to decrease. In the same
line of thinking, if the true effect sizes
are very small in a scientifi c eld,
this fi eld is likely to be plagued by
almost ubiquitous false positive claims.
For example, if the majority of true
genetic or nutritional determinants of
complex diseases confer relative risks
less than 1.05, genetic or nutritional
epidemiology would be largely utopian
Corollary 3: The greater the number
and the lesser the selection of tested
relationships in a scientifi c eld, the
less likely the research fi ndings are to
be true. As shown above, the post-study
probability that a fi nding is true (PPV)
depends a lot on the pre-study odds
(R). Thus, research fi ndings are more
likely true in confi rmatory designs,
such as large phase III randomized
controlled trials, or meta-analyses
thereof, than in hypothesis-generating
experiments. Fields considered highly
informative and creative given the
wealth of the assembled and tested
information, such as microarrays and
other high-throughput discovery-
oriented research [4,8,17], should have
extremely low PPV.
Corollary 4: The greater the
exibility in designs, defi nitions,
outcomes, and analytical modes in
a scientifi c eld, the less likely the
research fi ndings are to be true.
Flexibility increases the potential for
transforming what would be “negative”
results into “positive” results, i.e., bias,
u. For several research designs, e.g.,
randomized controlled trials [18–20]
or meta-analyses [21,22], there have
been efforts to standardize their
conduct and reporting. Adherence to
common standards is likely to increase
the proportion of true fi ndings. The
same applies to outcomes. True
ndings may be more common
when outcomes are unequivocal and
universally agreed (e.g., death) rather
than when multifarious outcomes are
devised (e.g., scales for schizophrenia
outcomes) [23]. Similarly, fi elds that
use commonly agreed, stereotyped
analytical methods (e.g., Kaplan-
Meier plots and the log-rank test)
[24] may yield a larger proportion
of true fi ndings than fi elds where
analytical methods are still under
experimentation (e.g., artifi cial
intelligence methods) and only “best”
results are reported. Regardless, even
in the most stringent research designs,
bias seems to be a major problem.
For example, there is strong evidence
that selective outcome reporting,
with manipulation of the outcomes
and analyses reported, is a common
problem even for randomized trails
[25]. Simply abolishing selective
publication would not make this
problem go away.
Corollary 5: The greater the fi nancial
and other interests and prejudices
in a scientifi c eld, the less likely
the research fi ndings are to be true.
Confl icts of interest and prejudice may
increase bias, u. Confl icts of interest
are very common in biomedical
research [26], and typically they are
inadequately and sparsely reported
[26,27]. Prejudice may not necessarily
have fi nancial roots. Scientists in a
given fi eld may be prejudiced purely
because of their belief in a scientifi c
theory or commitment to their own
ndings. Many otherwise seemingly
independent, university-based studies
may be conducted for no other reason
than to give physicians and researchers
qualifi cations for promotion or tenure.
Such nonfi nancial confl icts may also
lead to distorted reported results and
interpretations. Prestigious investigators
may suppress via the peer review process
the appearance and dissemination of
ndings that refute their fi ndings, thus
condemning their fi eld to perpetuate
false dogma. Empirical evidence
on expert opinion shows that it is
extremely unreliable [28].
Corollary 6: The hotter a
scientifi c eld (with more scientifi c
teams involved), the less likely the
research fi ndings are to be true.
This seemingly paradoxical corollary
follows because, as stated above, the
PPV of isolated fi ndings decreases
when many teams of investigators
are involved in the same fi eld. This
may explain why we occasionally see
major excitement followed rapidly
by severe disappointments in fi elds
that draw wide attention. With many
teams working on the same fi eld and
with massive experimental data being
produced, timing is of the essence
in beating competition. Thus, each
team may prioritize on pursuing and
disseminating its most impressive
“positive” results. “Negative” results may
become attractive for dissemination
only if some other team has found
a “positive” association on the same
question. In that case, it may be
attractive to refute a claim made in
some prestigious journal. The term
Proteus phenomenon has been coined
to describe this phenomenon of rapidly
Table 3. Research Findings and True Relationships in the Presence of Multiple Studies
True Relationship
Yes No Total
Yes cR(1 − β
)/(R + 1) c(1 − [1 − α]
)/(R + 1) c(R + 1 − [1 − α]
)/(R + 1)
No cRβ
/(R + 1) c(1 − α)
/(R + 1) c([1 − α]
+ Rβ
)/(R + 1)
Total cR/(R + 1) c/(R + 1) c
DOI: 10.1371/journal.pmed.0020124.t003
DOI: 10.1371/journal.pmed.0020124.g001
Figure 1. PPV (Probability That a Research
Finding Is True) as a Function of the Pre-Study
Odds for Various Levels of Bias, u
Panels correspond to power of 0.20, 0.50,
and 0.80.
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | 0699
alternating extreme research claims
and extremely opposite refutations
[29]. Empirical evidence suggests that
this sequence of extreme opposites is
very common in molecular genetics
These corollaries consider each
factor separately, but these factors often
infl uence each other. For example,
investigators working in fi elds where
true effect sizes are perceived to be
small may be more likely to perform
large studies than investigators working
in fi elds where true effect sizes are
perceived to be large. Or prejudice
may prevail in a hot scientifi c eld,
further undermining the predictive
value of its research fi ndings. Highly
prejudiced stakeholders may even
create a barrier that aborts efforts at
obtaining and disseminating opposing
results. Conversely, the fact that a fi eld
is hot or has strong invested interests
may sometimes promote larger studies
and improved standards of research,
enhancing the predictive value of its
research fi ndings. Or massive discovery-
oriented testing may result in such a
large yield of signifi cant relationships
that investigators have enough to
report and search further and thus
refrain from data dredging and
Most Research Findings Are False
for Most Research Designs and for
Most Fields
In the described framework, a PPV
exceeding 50% is quite diffi cult to
get. Table 4 provides the results
of simulations using the formulas
developed for the infl uence of power,
ratio of true to non-true relationships,
and bias, for various types of situations
that may be characteristic of specifi c
study designs and settings. A fi nding
from a well-conducted, adequately
powered randomized controlled trial
starting with a 50% pre-study chance
that the intervention is effective is
eventually true about 85% of the time.
A fairly similar performance is expected
of a confi rmatory meta-analysis of
good-quality randomized trials:
potential bias probably increases, but
power and pre-test chances are higher
compared to a single randomized trial.
Conversely, a meta-analytic fi nding
from inconclusive studies where
pooling is used to “correct” the low
power of single studies, is probably
false if R ≤ 1:3. Research fi ndings from
underpowered, early-phase clinical
trials would be true about one in four
times, or even less frequently if bias
is present. Epidemiological studies of
an exploratory nature perform even
worse, especially when underpowered,
but even well-powered epidemiological
studies may have only a one in
ve chance being true, if R = 1:10.
Finally, in discovery-oriented research
with massive testing, where tested
relationships exceed true ones 1,000-
fold (e.g., 30,000 genes tested, of which
30 may be the true culprits) [30,31],
PPV for each claimed relationship is
extremely low, even with considerable
Box 1. An Example: Science
at Low Pre-Study Odds
Let us assume that a team of
investigators performs a whole genome
association study to test whether
any of 100,000 gene polymorphisms
are associated with susceptibility to
schizophrenia. Based on what we
know about the extent of heritability
of the disease, it is reasonable to
expect that probably around ten
gene polymorphisms among those
tested would be truly associated with
schizophrenia, with relatively similar
odds ratios around 1.3 for the ten or so
polymorphisms and with a fairly similar
power to identify any of them. Then
R = 10/100,000 = 10
, and the pre-study
probability for any polymorphism to be
associated with schizophrenia is also
R/(R + 1) = 10
. Let us also suppose that
the study has 60% power to fi nd an
association with an odds ratio of 1.3 at
α = 0.05. Then it can be estimated that
if a statistically signifi cant association is
found with the p-value barely crossing the
0.05 threshold, the post-study probability
that this is true increases about 12-fold
compared with the pre-study probability,
but it is still only 12 × 10
Now let us suppose that the
investigators manipulate their design,
analyses, and reporting so as to make
more relationships cross the p = 0.05
threshold even though this would not
have been crossed with a perfectly
adhered to design and analysis and with
perfect comprehensive reporting of the
results, strictly according to the original
study plan. Such manipulation could be
done, for example, with serendipitous
inclusion or exclusion of certain patients
or controls, post hoc subgroup analyses,
investigation of genetic contrasts that
were not originally specifi ed, changes
in the disease or control defi nitions,
and various combinations of selective
or distorted reporting of the results.
Commercially available data mining”
packages actually are proud of their
ability to yield statistically signifi cant
results through data dredging. In the
presence of bias with u = 0.10, the post-
study probability that a research fi nding
is true is only 4.4 × 10
. Furthermore,
even in the absence of any bias, when
ten independent research teams perform
similar experiments around the world, if
one of them fi nds a formally statistically
signifi cant association, the probability
that the research fi nding is true is only
1.5 × 10
, hardly any higher than the
probability we had before any of this
extensive research was undertaken!
DOI: 10.1371/journal.pmed.0020124.g002
Figure 2. PPV (Probability That a Research
Finding Is True) as a Function of the Pre-Study
Odds for Various Numbers of Conducted
Studies, n
Panels correspond to power of 0.20, 0.50,
and 0.80.
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | 0700
standardization of laboratory and
statistical methods, outcomes, and
reporting thereof to minimize bias.
Claimed Research Findings
May Often Be Simply Accurate
Measures of the Prevailing Bias
As shown, the majority of modern
biomedical research is operating in
areas with very low pre- and post-
study probability for true fi ndings.
Let us suppose that in a research fi eld
there are no true fi ndings at all to be
discovered. History of science teaches
us that scientifi c endeavor has often
in the past wasted effort in fi elds with
absolutely no yield of true scientifi c
information, at least based on our
current understanding. In such a “null
eld,” one would ideally expect all
observed effect sizes to vary by chance
around the null in the absence of bias.
The extent that observed fi ndings
deviate from what is expected by
chance alone would be simply a pure
measure of the prevailing bias.
For example, let us suppose that
no nutrients or dietary patterns are
actually important determinants for
the risk of developing a specifi c tumor.
Let us also suppose that the scientifi c
literature has examined 60 nutrients
and claims all of them to be related to
the risk of developing this tumor with
relative risks in the range of 1.2 to 1.4
for the comparison of the upper to
lower intake tertiles. Then the claimed
effect sizes are simply measuring
nothing else but the net bias that has
been involved in the generation of
this scientifi c literature. Claimed effect
sizes are in fact the most accurate
estimates of the net bias. It even follows
that between “null fi elds,” the fi elds
that claim stronger effects (often with
accompanying claims of medical or
public health importance) are simply
those that have sustained the worst
For fi elds with very low PPV, the few
true relationships would not distort
this overall picture much. Even if a
few relationships are true, the shape
of the distribution of the observed
effects would still yield a clear measure
of the biases involved in the fi eld. This
concept totally reverses the way we
view scientifi c results. Traditionally,
investigators have viewed large
and highly signifi cant effects with
excitement, as signs of important
discoveries. Too large and too highly
signifi cant effects may actually be more
likely to be signs of large bias in most
elds of modern research. They should
lead investigators to careful critical
thinking about what might have gone
wrong with their data, analyses, and
Of course, investigators working in
any fi eld are likely to resist accepting
that the whole fi eld in which they have
spent their careers is a “null fi eld.”
However, other lines of evidence,
or advances in technology and
experimentation, may lead eventually
to the dismantling of a scientifi c eld.
Obtaining measures of the net bias
in one fi eld may also be useful for
obtaining insight into what might be
the range of bias operating in other
elds where similar analytical methods,
technologies, and confl icts may be
How Can We Improve
the Situation?
Is it unavoidable that most research
ndings are false, or can we improve
the situation? A major problem is that
it is impossible to know with 100%
certainty what the truth is in any
research question. In this regard, the
pure “gold” standard is unattainable.
However, there are several approaches
to improve the post-study probability.
Better powered evidence, e.g., large
studies or low-bias meta-analyses,
may help, as it comes closer to the
unknown “gold” standard. However,
large studies may still have biases
and these should be acknowledged
and avoided. Moreover, large-scale
evidence is impossible to obtain for all
of the millions and trillions of research
questions posed in current research.
Large-scale evidence should be
targeted for research questions where
the pre-study probability is already
considerably high, so that a signifi cant
research fi nding will lead to a post-test
probability that would be considered
quite defi nitive. Large-scale evidence is
also particularly indicated when it can
test major concepts rather than narrow,
specifi c questions. A negative fi nding
can then refute not only a specifi c
proposed claim, but a whole fi eld or
considerable portion thereof. Selecting
the performance of large-scale studies
based on narrow-minded criteria,
such as the marketing promotion of a
specifi c drug, is largely wasted research.
Moreover, one should be cautious
that extremely large studies may be
more likely to fi nd a formally statistical
signifi cant difference for a trivial effect
that is not really meaningfully different
from the null [32–34].
Second, most research questions
are addressed by many teams, and
it is misleading to emphasize the
statistically signifi cant ndings of
any single team. What matters is the
Table 4. PPV of Research Findings for Various Combinations of Power (1 − β), Ratio
of True to Not-True Relationships (R), and Bias (u)
1 − β RuPractical Example PPV
0.80 1:1 0.10 Adequately powered RCT with little
bias and 1:1 pre-study odds
0.95 2:1 0.30 Confi rmatory meta-analysis of good-
quality RCTs
0.80 1:3 0.40 Meta-analysis of small inconclusive
0.20 1:5 0.20 Underpowered, but well-performed
phase I/II RCT
0.20 1:5 0.80 Underpowered, poorly performed
phase I/II RCT
0.80 1:10 0.30 Adequately powered exploratory
epidemiological study
0.20 1:10 0.30 Underpowered exploratory
epidemiological study
0.20 1:1,000 0.80 Discovery-oriented exploratory
research with massive testing
0.20 1:1,000 0.20 As in previous example, but
with more limited bias (more
The estimated PPVs (positive predictive values) are derived assuming α = 0.05 for a single study.
RCT, randomized controlled trial.
DOI: 10.1371/journal.pmed.0020124.t004
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | 0701
totality of the evidence. Diminishing
bias through enhanced research
standards and curtailing of prejudices
may also help. However, this may
require a change in scientifi c mentality
that might be diffi cult to achieve.
In some research designs, efforts
may also be more successful with
upfront registration of studies, e.g.,
randomized trials [35]. Registration
would pose a challenge for hypothesis-
generating research. Some kind of
registration or networking of data
collections or investigators within fi elds
may be more feasible than registration
of each and every hypothesis-
generating experiment. Regardless,
even if we do not see a great deal of
progress with registration of studies
in other fi elds, the principles of
developing and adhering to a protocol
could be more widely borrowed from
randomized controlled trials.
Finally, instead of chasing statistical
signifi cance, we should improve our
understanding of the range of R
values—the pre-study odds—where
research efforts operate [10]. Before
running an experiment, investigators
should consider what they believe the
chances are that they are testing a true
rather than a non-true relationship.
Speculated high R values may
sometimes then be ascertained. As
described above, whenever ethically
acceptable, large studies with minimal
bias should be performed on research
ndings that are considered relatively
established, to see how often they are
indeed confi rmed. I suspect several
established “classics” will fail the test
Nevertheless, most new discoveries
will continue to stem from hypothesis-
generating research with low or very
low pre-study odds. We should then
acknowledge that statistical signifi cance
testing in the report of a single study
gives only a partial picture, without
knowing how much testing has been
done outside the report and in the
relevant fi eld at large. Despite a large
statistical literature for multiple testing
corrections [37], usually it is impossible
to decipher how much data dredging
by the reporting authors or other
research teams has preceded a reported
research fi nding. Even if determining
this were feasible, this would not
inform us about the pre-study odds.
Thus, it is unavoidable that one should
make approximate assumptions on how
many relationships are expected to be
true among those probed across the
relevant research fi elds and research
designs. The wider fi eld may yield some
guidance for estimating this probability
for the isolated research project.
Experiences from biases detected in
other neighboring fi elds would also be
useful to draw upon. Even though these
assumptions would be considerably
subjective, they would still be very
useful in interpreting research claims
and putting them in context. 
1. Ioannidis JP, Haidich AB, Lau J (2001) Any
casualties in the clash of randomised and
observational evidence? BMJ 322: 879–880.
2. Lawlor DA, Davey Smith G, Kundu D,
Bruckdorfer KR, Ebrahim S (2004) Those
confounded vitamins: What can we learn from
the differences between observational versus
randomised trial evidence? Lancet 363: 1724–
3. Vandenbroucke JP (2004) When are
observational studies as credible as randomised
trials? Lancet 363: 1728–1731.
4. Michiels S, Koscielny S, Hill C (2005)
Prediction of cancer outcome with microarrays:
A multiple random validation strategy. Lancet
365: 488–492.
5. Ioannidis JPA, Ntzani EE, Trikalinos TA,
Contopoulos-Ioannidis DG (2001) Replication
validity of genetic association studies. Nat
Genet 29: 306–309.
6. Colhoun HM, McKeigue PM, Davey Smith
G (2003) Problems of reporting genetic
associations with complex outcomes. Lancet
361: 865–872.
7. Ioannidis JP (2003) Genetic associations: False
or true? Trends Mol Med 9: 135–138.
8. Ioannidis JPA (2005) Microarrays and
molecular research: Noise discovery? Lancet
365: 454–455.
9. Sterne JA, Davey Smith G (2001) Sifting the
evidence—What’s wrong with signifi cance tests.
BMJ 322: 226–231.
10. Wacholder S, Chanock S, Garcia-Closas M, El
ghormli L, Rothman N (2004) Assessing the
probability that a positive report is false: An
approach for molecular epidemiology studies. J
Natl Cancer Inst 96: 434–442.
11. Risch NJ (2000) Searching for genetic
determinants in the new millennium. Nature
405: 847–856.
12. Kelsey JL, Whittemore AS, Evans AS,
Thompson WD (1996) Methods in
observational epidemiology, 2nd ed. New York:
Oxford U Press. 432 p.
13. Topol EJ (2004) Failing the public health—
Rofecoxib, Merck, and the FDA. N Engl J Med
351: 1707–1709.
14. Yusuf S, Collins R, Peto R (1984) Why do we
need some large, simple randomized trials? Stat
Med 3: 409–422.
15. Altman DG, Royston P (2000) What do we
mean by validating a prognostic model? Stat
Med 19: 453–473.
16. Taubes G (1995) Epidemiology faces its limits.
Science 269: 164–169.
17. Golub TR, Slonim DK, Tamayo P, Huard
C, Gaasenbeek M, et al. (1999) Molecular
classifi cation of cancer: Class discovery
and class prediction by gene expression
monitoring. Science 286: 531–537.
18. Moher D, Schulz KF, Altman DG (2001)
The CONSORT statement: Revised
recommendations for improving the quality
of reports of parallel-group randomised trials.
Lancet 357: 1191–1194.
19. Ioannidis JP, Evans SJ, Gotzsche PC, O’Neill
RT, Altman DG, et al. (2004) Better reporting
of harms in randomized trials: An extension
of the CONSORT statement. Ann Intern Med
141: 781–788.
20. International Conference on Harmonisation
E9 Expert Working Group (1999) ICH
Harmonised Tripartite Guideline. Statistical
principles for clinical trials. Stat Med 18: 1905–
21. Moher D, Cook DJ, Eastwood S, Olkin I,
Rennie D, et al. (1999) Improving the quality
of reports of meta-analyses of randomised
controlled trials: The QUOROM statement.
Quality of Reporting of Meta-analyses. Lancet
354: 1896–1900.
22. Stroup DF, Berlin JA, Morton SC, Olkin I,
Williamson GD, et al. (2000) Meta-analysis
of observational studies in epidemiology:
A proposal for reporting. Meta-analysis
of Observational Studies in Epidemiology
(MOOSE) group. JAMA 283: 2008–2012.
23. Marshall M, Lockwood A, Bradley C,
Adams C, Joy C, et al. (2000) Unpublished
rating scales: A major source of bias in
randomised controlled trials of treatments for
schizophrenia. Br J Psychiatry 176: 249–252.
24. Altman DG, Goodman SN (1994) Transfer
of technology from statistical journals to the
biomedical literature. Past trends and future
predictions. JAMA 272: 129–132.
25. Chan AW, Hrobjartsson A, Haahr MT,
Gotzsche PC, Altman DG (2004) Empirical
evidence for selective reporting of outcomes in
randomized trials: Comparison of protocols to
published articles. JAMA 291: 2457–2465.
26. Krimsky S, Rothenberg LS, Stott P, Kyle G
(1998) Scientifi c journals and their authors’
nancial interests: A pilot study. Psychother
Psychosom 67: 194–201.
27. Papanikolaou GN, Baltogianni MS,
Contopoulos-Ioannidis DG, Haidich AB,
Giannakakis IA, et al. (2001) Reporting of
confl icts of interest in guidelines of preventive
and therapeutic interventions. BMC Med Res
Methodol 1: 3.
28. Antman EM, Lau J, Kupelnick B, Mosteller F,
Chalmers TC (1992) A comparison of results
of meta-analyses of randomized control trials
and recommendations of clinical experts.
Treatments for myocardial infarction. JAMA
268: 240–248.
29. Ioannidis JP, Trikalinos TA (2005) Early
extreme contradictory estimates may
appear in published research: The Proteus
phenomenon in molecular genetics research
and randomized trials. J Clin Epidemiol 58:
30. Ntzani EE, Ioannidis JP (2003) Predictive
ability of DNA microarrays for cancer outcomes
and correlates: An empirical assessment.
Lancet 362: 1439–1444.
31. Ransohoff DF (2004) Rules of evidence
for cancer molecular-marker discovery and
validation. Nat Rev Cancer 4: 309–314.
32. Lindley DV (1957) A statistical paradox.
Biometrika 44: 187–192.
33. Bartlett MS (1957) A comment on D.V.
Lindley’s statistical paradox. Biometrika 44:
34. Senn SJ (2001) Two cheers for P-values. J
Epidemiol Biostat 6: 193–204.
35. De Angelis C, Drazen JM, Frizelle FA, Haug C,
Hoey J, et al. (2004) Clinical trial registration:
A statement from the International Committee
of Medical Journal Editors. N Engl J Med 351:
36. Ioannidis JPA (2005) Contradicted and
initially stronger effects in highly cited clinical
research. JAMA 294: 218–228.
37. Hsueh HM, Chen JJ, Kodell RL (2003)
Comparison of methods for estimating the
number of true null hypotheses in multiplicity
testing. J Biopharm Stat 13: 675–689.
August 2005 | Volume 2 | Issue 8 | e124
... Larger effect sizes are also more likely be significant and significant results are in turn more likely to be published. The pressure to publish significant, "novel" results has led to the literature being flooded with overestimated effect sizes with a high likelihood of being simply false (Ioannidis, 2005;Luck & Gaspelin, 2016). One way to combat false positives and inflated effect size estimates is to directly replicate an experiment. ...
... However, while the results of Experiment 3 were consistent with a number prior studies directly testing the PNP, these are arguably only a handful of studies that potentially reflect publication bias (Ioannidis, 2005). In fact, only around 30% of published ERP studies on predictive processing elicit a PNP at all (Van Petten & Luka, 2012). ...
A large body of research now supports the presence of both syntactic and lexical predictions in sentence processing. Lexical predictions, in particular, are considered to indicate a deep level of predictive processing that extends past the structural features of a necessary word (e.g. noun), right down to the phonological features of the lexical identity of a specific word (e.g. /kite/; DeLong et al., 2005). However, evidence for lexical predictions typically focuses on predictions in very local environments, such as the adjacent word or words (DeLong et al., 2005; Van Berkum et al., 2005; Wicha et al., 2004). Predictions in such local environments may be indistinguishable from lexical priming, which is transient and uncontrolled, and as such may prime lexical items that are not compatible with the context (e.g. Kukona et al., 2014). Predictive processing has been argued to be a controlled process, with top-down information guiding preactivation of plausible upcoming lexical items (Kuperberg & Jaeger, 2016). One way to distinguish lexical priming from prediction is to demonstrate that preactivated lexical content can be maintained over longer distances. In this dissertation, separable German particle verbs are used to demonstrate that preactivation of lexical items can be maintained over multi-word distances. A self-paced reading time and an eye tracking experiment provide some support for the idea that particle preactivation triggered by a verb and its context can be observed by holding the sentence context constant and manipulating the predictabilty of the particle. Although evidence of an effect of particle predictability was only seen in eye tracking, this is consistent with previous evidence suggesting that predictive processing facilitates only some eye tracking measures to which the self-paced reading modality may not be sensitive (Staub, 2015; Rayner1998). Interestingly, manipulating the distance between the verb and the particle did not affect reading times, suggesting that the surprisal-predicted faster reading times at long distance may only occur when the additional distance is created by information that adds information about the lexical identity of a distant element (Levy, 2008; Grodner & Gibson, 2005). Furthermore, the results provide support for models proposing that temporal decay is not major influence on word processing (Lewandowsky et al., 2009; Vasishth et al., 2019). In the third and fourth experiments, event-related potentials were used as a method for detecting specific lexical predictions. In the initial ERP experiment, we found some support for the presence of lexical predictions when the sentence context constrained the number of plausible particles to a single particle. This was suggested by a frontal post-N400 positivity (PNP) that was elicited when a lexical prediction had been violated, but not to violations when more than one particle had been plausible. The results of this study were highly consistent with previous research suggesting that the PNP might be a much sought-after ERP marker of prediction failure (DeLong et al., 2011; DeLong et al., 2014; Van Petten & Luka, 2012; Thornhill & Van Petten, 2012; Kuperberg et al., 2019). However, a second experiment in a larger sample experiment failed to replicate the effect, but did suggest the relationship of the PNP to predictive processing may not yet be fully understood. Evidence for long-distance lexical predictions was inconclusive. The conclusion drawn from the four experiments is that preactivation of the lexical entries of plausible upcoming particles did occur and was maintained over long distances. The facilitatory effect of this preactivation at the particle site therefore did not appear to be the result of transient lexical priming. However, the question of whether this preactivation can also lead to lexical predictions of a specific particle remains unanswered. Of particular interest to future research on predictive processing is further characterisation of the PNP. Implications for models of sentence processing may be the inclusion of long-distance lexical predictions, or the possibility that preactivation of lexical material can facilitate reading times and ERP amplitude without commitment to a specific lexical item.
... Spin is a type of research waste -a problem consisting in spending billions of euros per year on low-quality studies that have flaws in their design, are poorly reported or never published (Ioannidis, 2005). In 2014, Macleod et al. (2014) estimated up to 85% of money spent on clinical research to be wasted yearly. ...
... Several authors have observed that the quality of reporting research results in the clinical domain is suboptimal. As a consequence, research findings can often not be replicated, and billions of euros may be wasted yearly (Ioannidis, 2005). ...
In this thesis, we report on our work on developing Natural Language Processing (NLP) algorithms to aid readers and authors of scientific (biomedical) articles in detecting spin (distorted presentation of research results). Our algorithm focuses on spin in abstracts of articles reporting Randomized Controlled Trials (RCTs). We studied the phenomenon of spin from the linguistic point of view to create a description of its textual features. We annotated a set of corpora for the key tasks of our spin detection pipeline: extraction of declared (primary) and reported outcomes, assessment of semantic similarity of pairs of trial outcomes, and extraction of relations between reported outcomes and their statistical significance levels. Besides, we anno-tated two smaller corpora for identification of statements of similarity of treatments and of within-group comparisons. We developed and tested a number of rule-based and machine learning algorithmsforthe key tasksof spindetection(outcome extraction,outcome similarity assessment, and outcome-significance relation extraction). The best performance was shown by a deep learning approach that consists in fine-tuning deep pre-trained domain-specific language representations(BioBERT and SciBERT models) for our downstream tasks. This approach was implemented in our spin detection prototype system, called De-Spin, released as open source code. Our prototype includes some other important algorithms, such as text structure analysis (identification of the abstract of an article, identification of sections within the abstract), detection of statements of similarity of treatments and of within-group comparisons, extraction of data from trial registries. Identification of abstract sections is performed with a deep learning approach using the fine-tuned BioBERT model, while other tasks are performed using a rule-based approach. Our prototype system includes a simple annotation and visualization interface
... Replication studies are conducted in order to investigate whether an original finding can be confirmed in an independent study. Although replication has long been a central part of the scientific method in many fields, the so-called replication crisis (Ioannidis, 2005;Begley and Ioannidis, 2015) has led to increased interest in replication over the last decade. These developments eventually culminated in large-scale replication projects that were conducted in various fields (Errington et al., 2014;Open Science Collaboration, 2015;Camerer et al., 2016Camerer et al., , 2018Cova et al., 2018). ...
Replication studies are increasingly conducted to confirm original findings. However, there is no established standard how to assess replication success and in practice many different approaches are used. The purpose of this paper is to refine and extend a recently proposed reverse-Bayes approach for the analysis of replication studies. We show how this method is directly related to the relative effect size, the ratio of the replication to the original effect estimate. This perspective leads to two important contributions: (1) the golden level to recalibrate the assessment of replication success, and (2) a novel approach to calculate the replication sample size based on the specification of the minimum relative effect size. Compared to the standard approach to require statistical significance of both the original and replication study, replication success at the golden level offers uniform gains in project power and controls the Type-I error rate even if the replication sample size is slightly smaller than the original one. Sample size calculation based on replication success at the golden level tends to require smaller samples than the standard approach, if the original study is reasonably powered. An application to data from four large replication projects shows that the replication success approach leads to more appropriate inferences, as it penalizes shrinkage of the replication estimate compared to the original one, while ensuring that both effect estimates are sufficiently convincing on their own.
... Furthermore, only few studies included a sample that is sizable enough to provide firm, stable conclusions (Naing et al. 2006), and thus, the basis for the generalizability of the reports on the prevalence is very narrow. Ioannidis (2005) asserted that the smaller the sample sizes in a study, the smaller the power of the study, and consequently the higher the likelihood of the research findings to be affected by bias. Thus, we emphasize the need to conduct more studies across India, with proportional sample sizes for objective, less biased conclusions regarding bullying behavior. ...
This study provides a systematic review of literature from India on traditional bullying and victimization among school-going adolescents. A search of bibliographic electronic databases PsycINFO, MEDLINE, ERIC, Web of Science, and PubMed was performed in May 2020. Thirty-seven studies were included in the review. For each study included, the following specifics were examined: (a) methodological characteristics, (b) prevalence estimates of bullying behavior, (c) forms of bullying, (d) risk factors, and (e) consequences of bullying. It was found that bullying happens in India, and some risk factors for bullying and victimization in India are typical to the Indian context. In addition, bullying in India is associated with adverse consequences for both the aggressor and the victim. Many studies on bullying from India should be interpreted cautiously because of problems with data collection processes, instrumentation, and presentation of the findings. Cross-cultural comparisons for prevalence estimates, and longitudinal studies to examine the direction of possible influence between bullying and its correlates need to be conducted, to cater to the large adolescent population of India.
... A critical approach on the part of both professional and patient is also always important. Evidence is not written in stone, and as John Ioannadis and others have reported, there are many causes of bias, poor choices and low quality that can skew results (Ioannidis, 2005). ...
This article explores how systematic reviews can provide a useful addition to a general practitioner's knowledge toolbox and explores scenarios where systematic reviews can be used to help inform a decision. The article also explores how the trustworthiness of the information from a systematic review or indeed any knowledge resource, can be assessed, and describes some of the ways that systematic reviews are changing. A follow up article will explore, in more detail, how to appraise, understand and use the information in a systematic review. Clinical case scenario Mr Brown and his wife attend the surgery to discuss his care. Mr Brown has lung cancer. They have heard from an elderly medical relative that a study has shown that use of oral 'blood thinners' might be effective in people with small cell lung cancer. They are particularly concerned because a friend also had a form of cancer and died due to a massive pulmonary embolism. They, therefore, wonder whether Mr Brown would benefit from taking an oral anticoagulant. You decide to search on the National Institute for Health and Care Excellence and Clinical Knowledge Summaries web-sites, and BMJ Best Practice for guidance on this matter, but cannot find any reference to the use of oral anticoagulation in people with no obvious indication except cancer. You search The Cochrane Library and find a systematic review: Oral anticoagulation in people with cancer who have no therapeutic or prophylactic indication for anticoagulation. This review finds moderate-certainty evidence of little or no difference in mortality, although there is low-certainty evidence of a reduction in thromboembolism in people treated with oral anticoagulants compared with no treatment. You note that there is also moderate-certainty evidence of an increase in both major and minor bleeding, increasing the absolute risk of major bleeding from around 5% to 10%. Mr and Mrs Brown had not understood that there would be risks involved, and easily make the decision that they do not wish to pursue this treatment.
Full-text available
Drawing on the concept of a gale of creative destruction in a capitalistic economy, we argue that initiatives to assess the robustness of findings in the organizational literature should aim to simultaneously test competing ideas operating in the same theoretical space. In other words, replication efforts should seek not just to support or question the original findings, but also to replace them with revised, stronger theories with greater explanatory power. Achieving this will typically require adding new measures, conditions, and subject populations to research designs, in order to carry out conceptual tests of multiple theories in addition to directly replicating the original findings. To illustrate the value of the creative destruction approach for theory pruning in organizational scholarship, we describe recent replication initiatives re-examining culture and work morality, working parents’ reasoning about day care options, and gender discrimination in hiring decisions.
Full-text available
Most research on health interventions aims to find evidence to support better causal inferences about those interventions. However, for decades, a majority of this research has been criticised for inadequate control of bias and overconfident conclusions that do not reflect the uncertainty. Yet, despite the need for improvement, clear signs of progress have not appeared, suggesting the need for new ideas on ways to reduce bias and improve the quality of research. With the aim of understanding why bias has been difficult to reduce, we first explore the concepts of causal inference, bias and uncertainty as they relate to health intervention research. We propose a useful definition of ‘a causal inference’ as: ‘a conclusion that the evidence available supports either the existence, or the non-existence, of a causal effect’. We used this definition in a methodological review that compared the statistical methods used in health intervention cohort studies with the strength of causal language expressed in each study’s conclusions. Studies that used simple instead of multivariable methods, or did not conduct a sensitivity analysis, were more likely to contain overconfident conclusions and potentially mislead readers. The review also examined how the strength of causal language can be judged, including an attempt to create an automatic rating algorithm that we ultimately deemed cannot succeed. This review also found that a third of the articles (94/288) used a propensity score method, highlighting the popularity of a method developed specifically for causal inference. On the other hand, 11% of the articles did not adjust for any confounders, relying on methods such as t-tests and chi-squared tests. This suggests that many researchers still lack an understanding of how likely it is that confounding affects their results. Drawing on knowledge from statistics, philosophy, linguistics, cognitive psychology, and all areas of health research, the central importance of how people think and make decisions is examined in relation to bias in research. This reveals the many hard-wired cognitive biases that, aside from confirmation bias, are mostly unknown to statisticians and researchers in health. This is partly because they mostly occur without conscious awareness, yet everyone is susceptible. But while the existence of biases such as overconfidence bias, anchoring, and failure to account for the base rate have been raised in the health research literature, we examine biases that have not been raised in health, or we discuss them from a different perspective. This includes a tendency of people to accept the first explanation that comes to mind (called take-the-first heuristic); how we tend to believe that other people are more susceptible to cognitive biases than we are (bias blind spot); a tendency to seek arguments that defend our beliefs, rather than seeking the objective truth (myside bias); a bias for causal explanations (various names including the causality heuristic); and our desire to avoid cognitive effort (many names including the ‘law of least mental effort’). This knowledge and understanding also suggest methods that might counter these biases and improve the quality of research. This includes any technique that encourages the consideration of alternative explanations of the results. We provide novel arguments for a number of methods that might help, such as the deliberate listing of alternative explanations, but also some novel ideas including a form of adversarial collaboration. Another method that encourages the researcher to consider alternative explanations is causal diagrams. However, we introduce them in a way that differs from the more formal presentation that is currently the norm, avoiding most of the terminology to focus instead on their use as an intuitive framework, helping the researcher to understand the biases that may lead to different conclusions. We also present a case study where we analysed the data for a pragmatic randomised controlled trial of a telemonitoring service. Considerable missing data hampered the forming of conclusions; however, this enabled an exploration of methods to better understand, reduce and communicate the uncertainty that remained after the analysis. Methods used included multiple imputation, causal diagrams, a listing of alternative explanations, and the parametric g-formula to handle bias from time-dependent confounding. Finally, we suggest strategies, resources and tools that may overcome some of the barriers to better control of bias and improvements in causal inference, based on the knowledge and ideas presented in this thesis. This includes a proposed online searchable causal diagram database, to make causal diagrams themselves easier to learn and use.
Most discussions of the reproducibility crisis focus on its epistemic aspect: the fact that the scientific community fails to follow some norms of scientific investigation, which leads to high rates of irreproducibility via a high rate of false positive findings. The purpose of this paper is to argue that there is a heretofore underappreciated and understudied dimension to the reproducibility crisis in experimental psychology and neuroscience that may prove to be at least as important as the epistemic dimension. This is the communication dimension. The link between communication and reproducibility is immediate: independent investigators would not be able to recreate an experiment whose design or implementation were inadequately described. I exploit evidence of a replicability and reproducibility crisis in computational science, as well as research into quality of reporting to support the claim that a widespread failure to adhere to reporting standards, especially the norm of descriptive completeness, is an important contributing factor in the current reproducibility crisis in experimental psychology and neuroscience.
Null-hypothesis statistical testing has been seriously criticised in other domains, to the extent of some advocating a complete ban on publishing p-values. This short position paper aims to introduce the argument to the mobile-HCI research community, who make extensive use of the controversial testing methods.
Prognostic models are used in medicine for investigating patient outcome in relation to patient and disease characteristics. Such models do not always work well in practice, so it is widely recommended that they need to be validated. The idea of validating a prognostic model is generally taken to mean establishing that it works satisfactorily for patients other than those from whose data it was derived. In this paper we examine what is meant by validation and review why it is necessary. We consider how to validate a model and suggest that it is desirable to consider two rather different aspects – statistical and clinical validity – and examine some general approaches to validation. We illustrate the issues using several case studies. Copyright © 2000 John Wiley & Sons, Ltd.
Prognostic models are used in medicine for investigating patient outcome in relation to patient and disease characteristics. Such models do not always work well in practice, so it is widely recommended that they need to be validated. The idea of validating a prognostic model is generally taken to mean establishing that it works satisfactorily for patients other than those from whose data it was derived. In this paper we examine what is meant by validation and review why it is necessary. We consider how to validate a model and suggest that it is desirable to consider two rather different aspects – statistical and clinical validity – and examine some general approaches to validation. We illustrate the issues using several case studies. Copyright © 2000 John Wiley & Sons, Ltd.
Background: The Quality of Reporting of Meta-analyses (QUOROM) conference was convened to address standards for improving the quality of reporting of meta-analyses of clinical randomised controlled trials (RCTs). Methods: The QUOROM group consisted of 30 clinical epidemiologists, clinicians, statisticians, editors, and researchers. In conference, the group was asked to identify items they thought should be included in a checklist of standards. Whenever possible, checklist items were guided by research evidence suggesting that failure to adhere to the item proposed could lead to biased results. A modified Delphi technique was used in assessing candidate items. Findings: The conference resulted in the QUOROM statement, a checklist, and a flow diagram. The checklist describes our preferred way to present the abstract, introduction, methods, results, and discussion sections of a report of a meta-analysis. It is organised into 21 headings and subheadings regarding searches, selection, validity assessment, data abstraction, study characteristics, and quantitative data synthesis, and in the results with "trial flow", study characteristics, and quantitative data synthesis; research documentation was identified for eight of the 18 items. The flow diagram provides information about both the numbers of RCTs identified, included, and excluded and the reasons for exclusion of trials. Interpretation: We hope this report will generate further thought about ways to improve the quality of reports of meta-analyses of RCTs and that interested readers, reviewers, researchers, and editors will use the QUOROM statement and generate ideas for its improvement.
Inability to replicate many results has led to increasing scepticism about the value of simple association study designs for detection of genetic variants contributing to common complex traits. Much attention has been drawn to the problems that might, in theory, bedevil this approach, including confounding from population structure, misclassification of outcome, and allelic heterogeneity. Other researchers have argued that absence of replication may indicate true heterogeneity in gene-disease associations. We suggest that the most important factors underlying inability to replicate these associations are publication bias, failure to attribute results to chance, and inadequate sample sizes, problems that are all rectifiable. Without changes to present practice, we risk wastage of scientific effort and rejection of a potentially useful research strategy.
The findings of medical research are often met with considerable scepticism, even when they have apparently come from studies with sound methodologies that have been subjected to appropriate statistical analysis. This is perhaps particularly the case with respect to epidemiological findings that suggest that some aspect of everyday life is bad for people. Indeed, one recent popular history, the medical journalist James Le Fanu's The Rise and Fall of Modern Medicine , went so far as to suggest that the solution to medicine's ills would be the closure of all departments of epidemiology.1 One contributory factor is that the medical literature shows a strong tendency to accentuate the positive; positive outcomes are more likely to be reported than null results.2–4 By this means alone a host of purely chance findings will be published, as by conventional reasoning examining 20 associations will produce one result that is “significant at P = 0.05” by chance alone. If only positive findings are published then they may be mistakenly considered to be of importance rather than being the necessary chance results produced by the application of criteria for meaningfulness based on statistical significance. As many studies contain long questionnaires collecting information on hundreds of variables, and measure a wide range of potential outcomes, several false positive findings are virtually guaranteed. The high volume and often contradictory nature5 of medical research findings, however, is not only because of publication bias. A more fundamental problem is the widespread misunderstanding of the nature of statistical significance. In this paper we consider how the practice of significance testing emerged; an arbitrary division of results as “significant” or “non-significant” (according to the commonly used threshold of P = 0.05) was not the intention of the founders of statistical inference. P values need to be …