ArticlePublisher preview availableLiterature Review

The ‘File Drawer’ Problem and Tolerance for Null Results

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

For any given research area, one cannot tell how many studies have been conducted but never reported. The extreme view of the "file drawer problem" is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled with the 95% of the studies that show nonsignificant results. Quantitative procedures for computing the tolerance for filed and future null results are reported and illustrated, and the implications are discussed. (15 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Psychological Bulletin
1979, Vol.
86, No. 3,
638-641
The
"File
Drawer Problem"
and
Tolerance
for
Null Results
Robert
Rosenthal
Harvard
University
For any
given
research
area,
one
cannot
tell
how
many
studies
have
been
con-
ducted
but
never reported.
The
extreme
view
of the
"file
drawer problem"
is
that
journals
are filled
with
the
5%
of the
studies
that
show
Type
I
errors,
while
the file
drawers
are filled
with
the
95%
of the
studies
that
show
non-
significant
results.
Quantitative
procedures
for
computing
the
tolerance
for filed
and
future
null
results
are
reported
and
illustrated,
and the
implications
are
discussed.
Both behavioral researchers
and
statisti-
cians have long suspected
that
the
studies
published
in the
behavioral sciences
are a
biased sample
of the
studies that
are
actually
carried
out
(Bakan, 1967; McNemar, 1960;
Smart,
1964; Sterling,
1959).
The
extreme
view
of
this problem,
the
"file drawer prob-
lem,"
is
that
the
journals
are filled
with
the
5%
of the
studies
that
show
Type
I
errors,
while
the file
drawers back
at the lab are
filled
with
the 95% of the
studies
that
show
nonsignificant
(e.g.,
p >
.05) results.
In the
past
there
was
very little
one
could
do
to
assess
the net
effect
of
studies, tucked
away
in file
drawers,
that
did not
make
the
magic
.05
level (Rosenthal
&
Gaito, 1963,
1964).
Now, however, although
no
definitive
solution
to the
problem
is
available,
one can
establish reasonable boundaries
on the
prob-
lem
and
estimate
the
degree
of
damage
to
any
research conclusion
that
could
be
done
by the file
drawer problem.
This
advance
in our
ability
to
cope with
the file
drawer
is an
outgrowth
of the in-
creasing interest
of
behavioral scientists
in
summarizing
bodies
of
research
literature
sys-
Preparation
of
this article
was
supported
in
part
by
the
National Science Foundation.
I
would
like
to
thank Judith
A.
Hall
and
Donald
B.
Rubin
for
their valuable improvements
of an
earlier
version
of
this article.
Requests
for
reprints should
be
sent
to
Robert
Rosenthal,
Department
of
Psychology
and
Social
Relations,
Harvard University,
33
Kirkland
Street,
Cambridge,
Massachusetts 02138.
tematically
and
quantitatively, both with
re-
spect
to
significance levels
(Rosenthal,
1969,
1976,
1978)
and
with respect
to
effect-size
estimation
(Hall,
1978; Rosenthal, 1969,
1976;
Rosenthal
&
Rosnow, 1975; Smith
&
Glass,
1977;
Glass, Note
1).
One
hopes
that
this interest
in
summarizing entire research
domains
will
lead
to an
improvement
in
book-
keeping
so
that eventually
all
results will
be
recorded
both with
an
estimate
of
effect
size
(e.g.,
r or d;
Cohen, 1977)
and
with
the
level
of
significance
obtained,
or
more prac-
tically, with
the
standard normal deviate
(Z)
that corresponds
to the
obtained
p
(Rosen-
thai,
1978).1
Future appraisals
of
research
domains
of the
type
found
in
Psychological
Bulletin
should give estimates
of
overall
effect
sizes
and
significance levels; these esti-
mates
of
overall
significance
can
provide
a
basis
for
coping with
the file
drawer problem.
Tolerance
for
Future Null Results
Given
any
systematic quantitative review
of
the
literature bearing
on a
particular
hy-
1
Standard normal deviates
(Z) can be
found
by
various
methods,
of
which
the
following
three
are
most
often
useful:
(a)
Obtain
the
exact
p
asso-
ciated
with
the
test statistic
(e.g.,
t, F, or
x")
and
find the Z
associated with that
p in
tables
of the
normal
distribution;
(6) if the
effect
size
r or phi
is
given
or can be
computed,
Z can be
estimated
by
r(N)l;
(c) if the
effect
size
d is
given
or can be
computed,
Z can be
estimated
by
[<fa/(<f
+
4)]!
w*.
Copyright 1979
by the
American Psychological Association, Inc.
0033-2909/79/8603-0638$00.75
638
... As denoted by the circle on the right, researcher behavior is also shaped by consequences associated with volume of publication and associated metrics, which often confers promotion, prestige, and other tangible benefits, such as grant funding (Lilienfeld, 2017;Marsicano et al., 2022;Schimanski & Alperin, 2018). Such (Franco et al., 2014;Rosenthal, 1978) p-hacking Selectively conducting data analyses to produce and/or enhance positive/statistically significant outcomes (Head et al., 2015) HARKing (Hypothesizing After the Results are Known) ...
... A second form of evidence comes from systematic reviews comparing results of intervention studies described in published journal articles to those of unpublished reports, such as doctoral dissertations, that are searchable through databases or otherwise discoverable (see Polanin et al., 2016). If the body of published studies yield larger effects than the set of unpublished ones, this arguably suggests evidence of selective data reporting (i.e., a QRP related to the file drawer effect; Pigott et al., 2013;Rosenthal, 1978; See Table 1). For example, Sham and Smith (2014) compared effect sizes for published and unpublished SCED studies on a behaviorally based intervention called pivotal response treatment (Koegel & Koegel, 2006). ...
Article
Full-text available
Researchers have identified questionable research practices that compromise repli-cability and validity of conclusions. However, this concept of questionable research practices has not been widely applied to single-case experimental designs (SCED). Moreover, to date researchers have focused little attention on improved research practices as alternatives to questionable practices. This article describes initial steps toward identifying questionable and improved research practices in SCED. Participants were 63 SCED researcher experts with varying backgrounds and expertise. They attended a 1-day virtual microconference with focus groups to solicit examples of questionable and improved research practices at different stages of the research process. A qualitative analysis of over 2,000 notes from the participants yielded shared perspectives, resulting in 64 pairs of questionable and improved research practices in SCED. Our results highlight the need for further evaluation and efforts to disseminate improved research practices as alternatives to questionable practices.
... One possible explanation for this could be that pre-program, individual stressors impacted FS and FAS and potentially self-efficacy, whereas post-program participants had gone through a similar stimulus (i.e., the training program) and it regulated some of their affective responses. Most studies report improvements in self-efficacy from pre-to postexercise programs, but this may reflect publication bias, as non-significant findings are less likely to be published [46]. Therefore, the point here is not about self-efficacy improving pre-to-post, but rather the absence of a significant correlation pre-program and a strong correlation post-program. ...
Article
Full-text available
Background: Despite the well-established physical benefits of resistance training (RT), only 31% of U.S. adults meet RT guidelines, with women participating at lower rates. While the physiological aspects of RT are well researched, less is known about the psychological factors, such as affective responses (e.g., enjoyment, energy). This study explored the relationships between self-efficacy, self-determined motivation, affective responses, and adherence in a 16-week barbell-based RT program. Methods: A prospective longitudinal study was conducted with 43 adults (M age = 45.09 ± 10.7, 81.8% female) enrolled in a community-based RT program. Affective responses were measured pre- and post-training, within RT sessions, and over time. Repeated-measures ANOVA and correlational analyses were used to examine relationships between psychological variables, affective responses, and adherence to the program. Results: Participants reported significant improvements in positive affective responses post-training and across the program’s duration. Self-efficacy and intrinsic motivation were positively associated with higher affective responses and greater adherence. Strength exercises elicited more positive affects compared to power exercises, and lifting heavier relative loads was correlated with more favorable emotional outcomes. Conclusions: The study highlights the importance of psychological factors, such as self-efficacy and motivation, in the relationship of affective responses to RT.
... We omitted these to determine encompasses several types of bias related to the dissemination of scientific 346 information. Here, we focus on two main types: outcome reporting bias, which occurs when 347 studies are selectively published (the file drawer problem;Rosenthal, 1979), and time-lag 348 bias, where significant, large-effect, or corroborating results tend to be published earlier than349 non-significant, incremental, or negative findings (i.e. the decline effect: Koricheva & 350 Kulinskaya, 2019; Connell & Leung, 2023). Two features common to meta-analyses in 351 ecology and evolution, large amounts of heterogeneity and non-independence of effects, 352 complicate and invalidate many standard methods for detecting and quantifying bias. ...
Preprint
Full-text available
One of the major subfields of chemical ecology is the study of toxins and how they mediate interactions between organisms. Toxins produced by harmful algae, phycotoxins, impact a wide variety of organisms connected to the marine food web. Significant research efforts have thus aimed to identify the ecological and evolutionary drivers behind harmful algal blooms (HABs) to facilitate their forecasting, mitigation, and management. Nutrient availability is a key factor controlling growth and toxin production. Additionally, recent evidence has shown that harmful algae can sense the presence of zooplankton grazers, primarily copepods, and respond by dramatically increasing toxin production. Phycotoxin production is consequently controlled by a combination of bottom-up and top-down drivers, but the relative importance of the two is not understood. Here we conducted a meta-analysis of 113 control-treatment contrasts from 37 peer-reviewed experimental studies, comparing the effects of relative nitrogen enrichment (increased nitrogen:phosphorus ratio) and elevated grazing risk on phycotoxin induction in the two most studied marine HAB-forming genera, Alexandrium dinoflagellates and Pseudo-nitzschia diatoms. We show that phycotoxins are induced in response to both nitrogen enrichment and elevated grazing risk. Although both genera responded similarly to nitrogen enrichment, Pseudo-nitzschia toxins increased four times more in response to grazers than to nitrogen enrichment, and ten times more than Alexandrium toxins did in response to grazers. Grazing risk thus appears to rival, perhaps even supersede, the well-established phycotoxin-inducing effect of nitrogen enrichment in marine harmful algae. Although this analysis is limited to the two most studied marine HAB genera, we conclude that future attempts to understand the evolution and variable production of phycotoxins require integration of bottom-up nutrient availability and top-down selective pressures to fully elucidate phycotoxin dynamics in marine HAB-forming species.
... This initial study was not pre-registered. To combat the "file-drawer" problem (Rosenthal, 1979) we confirm that all data collected for the studies in this paper are reported either in the manuscript or the Supplemental Information file, and that there are no studies from our lab using this or a highly similar design (e.g. unreported pilot studies) that have not been included in this report. ...
Preprint
Full-text available
The injectable medication Ozempic (semaglutide) has demonstrated unprecedented effectiveness in promoting significant weight loss. However, its use has sparked moral debates, with critics dismissing it as a mere "shortcut" compared to traditional methods like diet and exercise. This study investigates how weight loss method—Ozempic, diet/exercise, or a combination of both—impacts moral judgments and perceptions of effort, praiseworthiness, and identity/value change. We used a contrastive vignette technique in two experiments (combined N = 1041, nationally representative for age, sex, and ethnicity) to study the attitudes of US participants toward a fictional character who lost 50 pounds through one of the three described methods. Weight loss through diet/exercise was deemed most effortful and most praiseworthy, whereas Ozempic use, even when combined with diet/exercise, was rated as both less effortful and less praiseworthy than diet/exercise alone. Ozempic use with no mention of diet/exercise was rated as least effortful and least praiseworthy. Compared to diet and exercise alone, Ozempic use also decreased perceptions that the individual had really changed as a person, or experienced a change in their underlying values. We discuss potential implications, address study limitations, and provide suggestions for further work.
... Ensuring registrations are kept up to date will allow for "evidence surveillance" to diagnose false leads early, identifying which studies have been terminated and where unpublished negative and neutral data are stored. Together, preregistration and curated data repositories can help tackle publication bias in preclinical systematic reviews [58,59]. ...
Article
Full-text available
In this review article, we provide a comprehensive overview of current practices and challenges associated with research synthesis in preclinical biomedical research. We identify critical barriers and roadblocks that impede effective identification, utilisation, and integration of research findings to inform decision making in research translation. We examine practices at each stage of the research lifecycle, including study design, conduct, and publishing, that can be optimised to facilitate the conduct of timely, accurate, and comprehensive evidence synthesis. These practices are anchored in open science and engaging with the broader research community to ensure evidence is accessible and useful to all stakeholders. We underscore the need for collective action from researchers, synthesis specialists, institutions, publishers and journals, funders, infrastructure providers, and policymakers, who all play a key role in fostering an open, robust and synthesis-ready research environment, for an accelerated trajectory towards integrated biomedical research and translation.
... -White (1994) indicated that meta-analysts should conduct an exhaustive search to avoid missing a useful paper that lies outside one's regular purview. Conversely, Rosenthal (1979) theorized that a meta-analyst could miss literally "thousands of studies averaging a null result" before it would indicate bias in an overall result. In this critical analysis the researcher attempted to ascertain whether Kavale and Mattson conducted an exhaustive search by including a data base search using EBSCO, PsycINFO, ERIC, Academic Search Premier, MEDLINE, and CINAHL utilizing the following keywords: Perceptual Motor Training, Movement Skills Training, and Sensorimotor Training. ...
Article
Full-text available
The effects of perceptual motor training in children with learning disabilities have been hotly debated for many years. Proponents have included many of the pioneers in the fields of learning disabilities and motor development. Among the challenges perhaps the most sophisticated one has been the meta-analysis by Kavale and Mattson. Their conclusion was that perceptual motor training was not an effective intervention for children with disabilities. The purpose of this project was to analyze critically the Kavale and Mattson meta-analysis from the perspective of an adapted physical educator and to consider the validity of this meta-analysis when examining multiple interventions dealing with a multitude of disabilities. A critical analysis of all 180 studies question their conclusions.
Article
Purpose Debate persists regarding the transformation between team task conflict and relationship conflict. Based on conflict spiral and team effectiveness theory, this study aims to explore whether and when these conflicts transform over time. Design/methodology/approach To address endogeneity in existing research and to test theoretical model, the authors conduct a two-stage structural equation modeling meta-analysis using a cross-lagged panel model based on 32 longitudinal studies ( N = 2361). Findings The meta-analytic results are as follows: (a) Early team task conflict leads to an increased subsequent relationship conflict while controlling for another form of conflict at each time point (Time 1 and Time 2). This relationship is positively moderated by team size and negatively moderated by interdependence, but not moderated by time lag. Specifically, task conflict generates higher levels of relationship conflict in larger teams or teams with lower interdependence. (b) While controlling for another form of conflict at each time point, early relationship conflict within teams does not significantly relate to task conflict over time. Team size, interdependence and time lag do not significantly moderate this relationship. Research limitations/implications These findings help scholars better understand team conflict transformation processes and present managerial implications for practitioners. The limited sample sizes precluded the exploration of more moderators and the interactions among them. Originality/value The uniqueness of this paper is related to its effort in clarifying the temporal precedence of two forms of conflict, reconciling previous inconsistencies in conflict transformation by exploring moderators and addressing endogeneity in existing research.
Article
Background: Evidence supporting the association between posttraumatic stress disorder (PTSD) and cognitive impairment is accumulating. However, less is known about which factors influence this association.Objective: The aims of this meta-analysis were to (1) elucidate the association between PTSD and a broad spectrum of cognitive impairment, including the risk of developing neurocognitive disorder (NCD) later in life, using a multilevel meta-analytic approach, and (2) identify potential moderating factors of this association by examining the effects of age (20-39, 40-59, 60+), study design (cross-sectional or longitudinal), study population (war-exposed populations/veterans or the general population), neurocognitive outcome assessed (i.e. a diagnosis of NCD or type of cognitive domain as classified according to A Compendium of Neuropsychological tests), gender (≥50% women or <50% women), study quality (high vs low), type of PTSD measure (self-report or clinical diagnosis), as well as the presence of comorbidities such as traumatic brain injury (TBI), depression, and substance use (all coded as either present or absent).Method: Peer-reviewed studies on this topic were extracted from PubMed and Web of Science with predetermined keywords and criteria. In total, 53 articles met the criteria. Hedge's g effect sizes were calculated for each study and a three-level random effect meta-analysis conducted.Results: After accounting for publication bias, the results suggested a significant association between PTSD and cognitive impairment, g = 0.13 (95% CI: 0.10-0.17), indicating a small effect. This association was consistent across all examined moderators, including various neurocognitive outcomes, age, gender, study design, study population, study quality, type of PTSD measure, and comorbidities such as depression, substance use, and TBI.Conclusions: These findings collectively suggest that PTSD is associated with both cognitive impairment and NCD. This emphasizes the need for early intervention (including prevention strategies) of PTSD, alongside monitoring cognitive function in affected individuals.International Prospective Register of Systematic Reviews (PROSPERO) registration number: CRD42021219189, date of registration: 02.01.2021.
Article
Full-text available
Results of 375 controlled evaluations of psychotherapy and counseling were coded and integrated statistically. The findings provide convincing evidence of the efficacy of psychotherapy. On the average, the typical therapy client is better off than 75% of untreated individuals. Few important differences in effectiveness could be established among many quite different types of psychotherapy. More generally, virtually no difference in effectiveness was observed between the class of all behavioral therapies (e.g., systematic desensitization and behavior modification) and the nonbehavioral therapies (e.g., Rogerian, psychodynamic, rational-emotive, and transactional analysis).
Article
Full-text available
The purposes of this study were to determine the proportion of papers which contain negative results (results which fail to reject the null hypothesis), and whether there is some selection in the papers published such that negative results are unlikely to be published. An examination of current psychological journals indicated that studies with negative results constitute about 9 per cent of the total volume of published papers. However, data from several unpublished sources indicate that negative results are less likely to be published. The reasons for their neglect - chiefly author selection and the greater editorial scrutiny they get - were presented. The practical, statistical and heuristic value of negative results was also discussed.
Article
This book is really three-books-in-one, dealing with the topic of artifacts in behavioral research. It is about the problems of experimenter effects which have not been solved. Experimenters still differ in the ways in which they see, interpret, and manipulate their data. Experimenters still obtain different responses from research participants (human or infrahuman) as a function of experimenters' states and traits of biosocial, psychosocial, and situational origins. Experimenters' expectations still serve too often as self-fulfilling prophecies, a problem that biomedical researchers have acknowledged and guarded against better than have behavioral researchers; e.g., many biomedical studies would be considered of unpublishable quality had their experimenters not been blind to experimental condition. Problems of participant or subject effects have also not been solved. Researchers usually still draw research samples from a population of volunteers that differ along many dimensions from those not finding their way into our research. Research participants are still often suspicious of experimenters' intent, try to figure out what experimenters are after, and are concerned about what the experimenter thinks of them. That portion of the complexity of human behavior that can be attributed to the social nature of behavioral research can be conceptualized as a set of artifacts to be isolated, measured, considered, and, sometimes, eliminated. This book examines the methodological and substantive implications of sources of artifacts in behavioral research and strategies for improving this situation.
Article
There is some evidence that in fields where statistical tests of significance are commonly used, research which yields nonsignificant results is not published. Such research being unknown to other investigators may be repeated independently until eventually by chance a significant result occurs—an “error of the first kind”—and is published. Significant results published in these fields are seldom verified by independent replication. The possibility thus arises that the literature of such a field consists in substantial part of false conclusions resulting from errors of the first kind in statistical tests of significance.* The author wishes to express his thanks to Sir Ronald Fisher whose discussion on related topics stimulated this research in the first place, and to Leo Katz, Oliver Lacey, Enders Robinson, and Paul Siegel for reading and criticizing earlier drafts of this manuscript.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.