ArticlePDF Available

Examining the generalizability of research findings from archival data

Authors:

Abstract and Figures

This initiative examined systematically the extent to which a large set of archival research findings generalizes across contexts. We repeated the key analyses for 29 original strategic management effects in the same context (direct reproduction) as well as in 52 novel time periods and geographies; 45% of the reproductions returned results matching the original reports together with 55% of tests in different spans of years and 40% of tests in novel geographies. Some original findings were associated with multiple new tests. Reproducibility was the best predictor of generalizability—for the findings that proved directly reproducible, 84% emerged in other available time periods and 57% emerged in other geographies. Overall, only limited empirical evidence emerged for context sensitivity. In a forecasting survey, independent scientists were able to anticipate which effects would find support in tests in new samples.
Content may be subject to copyright.
Examining the generalizability of research ndings from
archival data
Andrew Delios
a,1,2
, Elena Giulia Clemente
b,1
, Tao Wu
c,1
, Hongbin Tan
d
, Yong Wang
e
, Michael Gordon
f
, Domenico Viganola
g
, Zhaowei Chen
a
,
Anna Dreber
b,h
, Magnus Johannesson
b
, Thomas Pfeiffer
f
, Generalizability Tests Forecasting Collaboration
3
, and Eric Luis Uhlmann
i,1,2
Edited by Simine Vazire, The University of Melbourne, Melbourne, Australia; received November 8, 2021; accepted June 8, 2022 by Editorial Board
Member Mark Granovetter.
This initiative examined systematically the extent to which a large set of archival
research ndings generalizes across contexts. We repeated the key analyses for 29 origi-
nal strategic management effects in the same context (direct reproduction) as well as in
52 novel time periods and geographies; 45% of the reproductions returned results
matching the original reports together with 55% of tests in different spans of years and
40% of tests in novel geographies. Some original ndings were associated with multiple
new tests. Reproducibility was the best predictor of generalizabilityfor the ndings
that proved directly reproducible, 84% emerged in other available time periods and
57% emerged in other geographies. Overall, only limited empirical evidence emerged for
context sensitivity. In a forecasting survey, independent scientists were able to anticipate
which effects would nd support in tests in new samples.
research reliability jgeneralizability jarchival data jreproducibility jcontext sensitivity
Do research investigations in the social sciences reveal regularities in individual and col-
lective behavior that we can expect to hold across contexts? Are they more akin to case
studies, capturing a particular place and moment in time? Or are they something in
between, capturing patterns that emerge reliably in some conditions but are absent or
reversed in others depending on moderating factors, which may yet await discovery?
Social scientists, like their counterparts in more established elds such as chemistry,
physics, and biology, strive to uncover predictable regularities about the world. However,
psychology, economics, management, and related elds have become embroiled in contro-
versies as to whether the claimed discoveries are reliable (111). When reading a research
report, is it sensible to assume the nding is a true positive rather than a false positive (12,
13)? Additionally, if evidence was obtained from another context (e.g., a different culture
or a different time period), is it reasonable to extract lessons for the situations and choices
of intellectual and practical interest to you?
These issues of research reliability and context sensitivity are increasingly inter-
twined. One common counterexplanation for evidence that a scienticnding is not as
reliable as initially expected is that it holds in the original context but not in some
other contextsfor example, due to cultural differences or changes in situations over
time (1419). Taken to the extreme, however, this explanation converts research
reports into case studies with little to say about other populations and situations, such
that ndings and theories are rendered unfalsiable (11, 20, 21). The multilaboratory
replication efforts thus far suggest that experimental laboratory effects either generally
hold across samples, including those in different nations, or consistently fail to replicate
across sites (2226). We suggest that the generalizability of archival ndings is likewise
worthy of systematic investigation (2729).
Ways of Knowing
Experimental and observational studies represent two of the major ways by which social
scientists attempt to study the world quantitatively (30). An experiment is uniquely
advantaged to establish causal relationships, but a host of variables (e.g., corporate strate-
gies, nancial irregularities, workplace injuries, abusive workplace supervision, sexual
harassment) cannot be manipulated experimentally either ethically or pragmatically (31).
In contrast, an archival or observational dataset (henceforth referred to as archival) allows
for assessing the strength of association between variables of interest in an ecologically
valid setting (e.g., harassment complaints and work performance over many years).
Large-scale replication projects reveal that many effects from behavioral experiments
do not readily emerge in independent laboratories using the same methods and materials
but new observations (2224, 3235). No similar initiative has systematically retested
Signicance
The extent to which results from
complex datasets generalize
across contexts is critically
important to numerous scientic
elds as well as to practitioners
whorelyonsuchanalysesto
guide important strategic
decisions. Our initiative
systematically investigated
whether ndings from the eld of
strategic management would
emerge in new time periods and
new geographies. Original ndings
that were statistically reliable in
the rst place were typically
obtained again in novel tests,
suggesting surprisingly little
sensitivity to context. For some
social scientic areas of inquiry,
results from a specictimeand
place can be a meaningful guide
as to what will be observed more
generally.
Author contributions: A. Delios, E.G.C., T.W., H.T., Y.W.,
M.G., D.V., Z.C., A. Dreber, M.J., T.P., and E.L.U.
designed research; A. Delios, E.G.C., T.W., H.T., Y.W.,
M.G., D.V., Z.C., A. Dreber, M.J., T.P., and E.L.U.
performed research; A. Delios, E.G.C., T.W., H.T., Y.W.,
M.G., D.V., Z.C., A. Dreber, M.J., T.P., and E.L.U.
analyzed data; G.T.F.C. performed forecasting; and
A. Delios, E.G.C., T.W., A. Dreber, and E.L.U. wrote the
paper.
The authors declare no competing interest.
This article is a PNAS Direct Submission. S.V. is a guest
editor invited by the Editorial Board.
Copyright © 2022 the Author(s). Published by PNAS.
This open access article is distributed under Creative
Commons Attribution-NonCommercial-NoDerivatives
License 4.0 (CC BY-NC-ND).
1
A. Delios, E.G.C., T.W., and E.L.U. contributed equally
to this work.
2
To whom correspondence may be addressed. Email:
andrew@nus.edu.sg or eric.luis.uhlmann@gmail.com.
3
A complete list of the Generalizability Tests Forecasting
Collaboration authors can be found in SI Appendix.
This article contains supporting information online at
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.
2120377119/-/DCSupplemental.
Published July 19, 2022.
PNAS 2022 Vol. 119 No. 30 e2120377119 https://doi.org/10.1073/pnas.2120377119 1of9
RESEARCH ARTICLE
|
PSYCHOLOGICAL AND COGNITIVE SCIENCES OPEN ACCESS
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
archival ndings in novel contexts using new data. Yet, there is
little reason to assume that archival ndings are inherently more
reliable than experiments (3639). A great deal of archival data
are unavailable for reanalysis due to condentiality concerns,
nondisclosure agreements with private companies, loss, and
investigator unwillingness (35, 4046). Independent researchers
have encountered substantial difculties reproducing results
from the statistics reported in the article (3) and when available,
from the same data and code (4755). Efforts to crowdsource
the analysis of complex archival sources, assigning the same
research question and dataset to numerous independent scien-
tists, indicate that defensible yet subjective analytic choices have
a large impact on the reported results (5660).
Experimental and archival research could differ more in target-
ability for reexamination with new observations rather than in
their inherent soundness. In other words, it is typically easier for
independent scientists to target experiments for repetition in new
samples than it is for many archival studies. Although it is
straightforward to conduct a simple experiment again with a new
population (e.g., a different university subject pool), this is not
feasible for many archival ndings. For example, if the executive
who granted access to data has left the rm, it may no longer be
possible to sample employment data from a specic company for
a new span of years, and other companies may collect different
information about their employees, thus rendering the datasets
noncomparable. Thus, although it is at this point clear that
many experimental ndings do not readily emerge again when
the same method and analyses are repeated using new observa-
tions (10, 34), this key aspect of the reliability of archival ndings
remains as yet unknown.
Forms of Research Reliability
Distinct types of research reliability are often conated, espe-
cially across diverse methodologies and elds where different
standards may prevail (27, 35, 6168). Drawing on existing
frameworks, we refer to verifying research results using the same
dataset and analytic approach as a direct reproduction, relying
on the original data and employing alternative specications as a
robustness test, and repeating the original analyses on a new set
of data (e.g., separate time period, different country) as a direct
replication or generalizability test depending on the implications
of the results for the original nding. Different aspects of
research reliability can be examined in tandem: for example,
sampling new data and carrying out many defensible analytic
approaches at the same time.
The notion of a generalizability test captures the expectation
that universality is incredibly unlikely (69) and that ndings
from a complex dataset with a host of interrelated variables may
not emerge in new contexts for reasons that are theoretically
informative. Unlike chemical reactions or the operation of physi-
cal laws, social behaviors ought to vary meaningfully between
populations and time periods, in some cases for reasons that are
not yet fully understood. For example, the effectiveness of a spe-
cic corporate strategy likely changes over time as economic cir-
cumstances shift and probably varies across cultural, political,
and institutional settings. If a true positive nding among
Korean manufacturers does not emerge in a sample of US phar-
maceutical rms, then the line of inquiry has been fruitfully
extended to a new context, allowing for an assessment of the
generality vs. context specicity of strategic choices by rms
(7073). It is scientically interesting if an empirical pattern
generally holds. It is also scientically interesting if it does not.
This distinction between a generalizability test and direct rep-
lication is theory laden. In both cases, the same methodology
and statistical analyses are repeated on a new sample. However, a
failed replication casts doubt on the original nding (74),
whereas a generalizability test can only fail to extend it to a new
context. Importantly, the line of division between a generaliz-
ability test and replication does not lie between archival datasets
and behavioral experiments. Some efforts to repeat past behav-
ioral experiments may occur in a sufciently different context
such that inconsistent results do not reect negatively on the
original research [e.g., repeating the Ultimatum Game experi-
ment among the Machiguenga of the Peruvian Amazon (75)].
Likewise, tests of the same empirical predictions in two different
complex datasets (e.g., the personnel records of two companies)
can occur with a strong theoretical expectation of consistent
ndings. The present initiative provides a test of generalizability,
not replicability, because the targeted eld of international strategic
management theoretically expects high levels of context sensitivity
and was in fact founded on this principle (76).
The Present Research
We leveraged a longitudinal dataset from the eld of interna-
tional strategic management to examine systematically if ndings
from a given span of years emerge in different time periods and
geographies. We also carried out a direct reproduction of each
original study, or in other words, we conducted the same analysis
on the same set of observations (27, 67). The present initiative,
therefore, focused on the reproducibility and generalizability, but
not robustness, of a set of archival ndings, leveraging a single
large dataset that was the basis for all tests.
The dataset on foreign direct investment by Japanese rms
was originally constructed by the rst author from various sour-
ces and subsequently, leveraged for scores of academic publica-
tions by numerous researchers. Our set of 29 target articles
consisted of those publications for which no major new data col-
lection by the present author team was required to conduct this
metascientic investigation. For each published article, the origi-
nal authors selected a subsample by time period or geography
from within the larger dataset. As such, the portions of the larger
dataset not used by the original authors were sufcient to con-
duct our generalizability tests. In many cases, further years of
data accumulated after the publication of the original article,
allowing for time extensions to subsequent years. Inclusion and
exclusion decisions were made prior to conducting any analyses,
such that the nal set of ndings was selected blind to the conse-
quences for overall rates of reproducibility and generalizability.
The reproduction repeated the sampling procedure and analyses
from the original article. The generalizability tests (67) utilized
the same analytic approach but different sets of observations
from those in the original investigation, and thus attempted to
extend the ndings to new contexts.
Previous metascientic investigations have examined whether
results from complex datasets can be reproduced using the statis-
tics from the original report (3), with the same data and code
(47, 49), with the same data yet alternative specications
(5660, 7782), and with improvements on the original analyses
and an expanded dataset including both the original and new
observations (8). In only a few cases have the identical analyses
been repeated in new samples to probe the generalizability of the
ndings (28, 8386).
Closest to the present initiative in both topic and general
approach is the important 2016 special issue of the Strategic
Management Journal (62), which reexamined a small set of
2of9 https://doi.org/10.1073/pnas.2120377119 pnas.org
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
inuential published ndings varying the research method and/or
sampling approach. In these cases, it is difcult to distinguish
whether discrepancies in results are due to changes in the analyses
or context since both were altered. Further, since no direct repro-
ductions were carried out (i.e., same analyses on the same data),
we have no sense of whether inconsistent results are failed exten-
sions or failures of reproducibility. The present research constitutes
a systematic and simultaneous test of the reproducibility and gener-
alizability of a large set of archival ndings.
It also remains unknown if scientists are generally optimistic,
pessimistic, or fairly accurate about whether ndings generalize
to new situations. Prior forecasting studies nd that, based solely
on a research abstract or set of study materials, academics are
fairly effective at anticipating whether a research hypothesis will
be conrmed in an upcoming experiment (e.g., refs. 32, 33, 74,
and 8791). We extend this line of metascienticinvestigation
to include forecasts about the results of archival analyses, examin-
ing whether scientists can anticipate the generalizability of such
ndings across contexts.
Methods
Generalizability Study.
Sample of original findings. We rst identied all the refereed articles that
used an international strategic management dataset initially built by the rst
author (A. Delios). These research articles are based on longitudinal, multihost
country data on Japanese foreign direct investment. The two main data sources
used to assemble the single larger dataset are Kaigai Shinshutsu Kigyou Souran-
Kuni Betsu and the Nikkei Economic Electronic Databank System. This single
larger dataset, used for all reproduction and generalizability tests, assembled dis-
parate variables together to facilitate testing empirical hypotheses regarding the
strategic decisions of international companies. Our initial sample of articles
consisted of 112 studies published in 33 management journals.
Our only further selection criterion was whether the reproduction and general-
izability tests could be carried out without a major new data collection effort by
the present project team. We made the a priori decision to focus on 29 papers
(Table 1 and SI Appendix,TableS7-14have details) based on the accessibility of
the original data as well as additional data necessary to conduct generalizability
tests. Hence, for some tests, we collected additional data from open sources, such
as the World Bank, the United Nations, and other organizations and institutes.
This nal set of 29 papers appeared in prominent outlets, including Strategic
Management Journal (5), Academy of Management Journal (1), Organization Sci-
ence (1), Administrative Science Quarterly (1), and Journal of International Busi-
ness Studies (5), among others. The impact factors of the journals ranged from
1.03 to 11.82, with a median of 7.49 and a mean of 6.99 (SD =2.87). The
papers have had a pronounced impact on the eld of strategic management,
with citation counts ranging from 16 to 2,910 with a median of 163 and a
mean of 411.79 (SD =582.83). SI Appendix, supplement 1 has a more detailed
overview of these article-level characteristics.
That the present rst author built the basic dataset creates a unique opportu-
nity; unlike other metascientic investigations, we avoid the selection bias intro-
duced when original authors decline requests for data and other key materials.
Although more complete, our sample frame is also narrower and does not allow
us to make strong claims about the entire strategic management literature com-
pared with sampling representatively. At the same time, we provide an empirical
assessment of what the generalizability rate of a set of archival ndings to new
time periods and geographies can look like.
Analysis copiloting and consultations with original authors. Each reproduc-
ibility and generalizability test was carried out by two analysis copilots (92) who
worked independently; then, they compared results and contacted the original
authors for feedback as needed. Thus, many individual specications received a
form of peer review from the original authors, specically an analytic review.
Original authors were asked to give feedback on the reproduction of their pub-
lished research, and this specication was then repeated for all available further
time periods and geographies to test generalizability. In other words, original
authors were not allowed input into the sampling approach for the new tests,
only on the analytic approach used for both thereproduction and generalizability
tests. SI Appendix, supplement 2 has a detailed overview of this process, and SI
Appendix, Table S7-15 shows how discrepancies between copilots were resolved.
We did not preregister each specic reproduction and generalizability test
because the copilots simply repeated the specication described by the original
authors on all available time periods and geographies in the dataset that had
sufcient data. Thus, the methods and results sections of the 29 original papers
served as our analysis plans, with the only added constraint of data availability.
We conducted tests in all possible alternative geographies and time periods
with sample sizes comparable with the original published report. We had to
forgo testing generalizability to nations and spans of years with inadequate
numbers of observations or for which key variables were unavailable entirely.
Forecasting Survey. Following previous efforts (91, 93), we asked independent
scientists (n=238) recruited via social media advertisements to attempt to pre-
dict the outcomes of the generalizability tests while blinded to the results. Each
forecaster was provided with the original articles title; abstract; full text, including
the original sample size and all associated analyses; the key statistical test from
the paper; and a narrative summary of the focal nding and attempted to predict
both its direct reproducibility and generalizability to different time periods. We
asked forecasters to assign probabilities that results would be statistically signi-
cant in the same direction as the original study for original positive results and
probabilities that results would be nonsignicant for original nonsignicant
results. We did not ask forecasters to predict effect sizes given the complex results
of many original studies (e.g., an inverted U-shaped relationship between the
number of expatriate employees and international joint venture performance),
which we believed would prove difcult to mentally convert into effect sizes.
Future research should examine forecasts about context sensitivity using more
granular predictions focused on effect sizes, ideally using target studies with
simple designs and results (e.g., two-condition behavioral experiments).
We did not ask forecasters to predict the generalizability of the original nd-
ings to other geographies given the limited number of geographic extension
tests possible with the available data. When multiple time extension tests had
been carried out for the same original nding, just one generalizability result of
similar length to the original time period was selected as a target for forecasting.
Sample sizes were by design roughly equivalent between original studies and
generalizability tests. SI Appendix, supplement 3 contains the complete forecast-
ing survey items, and SI Appendix, supplement 4 has the preregistered analysis
plan (https://osf.io/t987n).
Results
One key effect from each of the 29 original papers was subjected
to a direct reproduction. We also carried out 42 time extension
tests and 10 geographic extension tests. A subset of original effects
was subjected to multiple generalizability tests (for example, a
backward time extension [previous decade], forward time exten-
sion [subsequent decade], and geographic extension [new coun-
tries and territories]), resulting in a total of 52 generalizability
tests for 29 original effects. Table 2 and SI Appendix, Tables S7-2
and S7-17S7-21 summarize the results of a set of research repro-
ducibility criteria. These include whether the original, reproduc-
tion, and generalizability results are in the same direction and
whether the effect is statistically signicant in the individual gen-
eralizability tests, aggregating across all available generalizability
tests and aggregating across all available data, including both
reproduction and generalizability tests (34, 94, 95). We did not
test for differences between original and generalizability test effect
sizes because there was not enough statistical information in
many of the published research reports to calculate the former.
Pvalue thresholds are arbitrary and can be misleading. Non-
signicant effects are not necessarily nonexistent; they simply do
not meet the cutoff for supporting the prediction. Further, the
power of the new tests limits the generalizability rate. There are
two types of effect sizes for 15 of 29 ndings for which we are able
to conduct sensitivity power analyses (SI Appendix,TableS7-25).
PNAS 2022 Vol. 119 No. 30 e2120377119 https://doi.org/10.1073/pnas.2120377119 3of9
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
Table 1. Overview of focal ndings examined in the generalizability initiative
No. Focal effect New span of years and/or geography
1 An inverted U shape between a regions formal institutional diversity and
the likelihood of MNEs to enter a country within this region
Time: 19962001, 20082010
2 A negative relationship between the statutory tax rate of a country and the
probability of locating a plant in that country
Time: 19791989, 20002010
3 An inverted U-shaped curve between a rms number of prior foreign
subsidiaries and its number of subsequent foreign subsidiaries in a
country
Time: 19952010
4 A positive relationship between the timing of a subsidiary entering a
market and the protability of the subsidiary
Time: 19872001; geography: India, South Korea,
SE Asian countries
5 An inverted U shape between the number of the subsidiaries of other
MNEs in a host country and the likelihood of setting a subsidiary by an
MNE in the same host country
Time: 19781989, 20002009
6 A positive relationship between a foreign investing rms assets specicity
and that rms ownership position in its foreign investment
Time: 1989, 1992, 1996, 1999; geography: China
mainland, Taiwan, South Korea, etc.
7 A positive relationship between a multinational rms intangible assets and
the survival chances of the rms foreign subsidiaries
Time: 19821991, 19891998
8 A positive relationship between the percentage equity ownership and the
use of expatriates
Time: 1992, 1995, 1999; geography: Brazil,
European countries, SE Asian countries, etc.
9 A negative relationship between a countrys political hazards and the
probability of locating a plant in that country
Time: 19831989, 19881994, 19921998
10 A moderating effect (weakening) of a rms experience in politically
hazardous countries on the negative relationship between a countrys
political hazards and the rates of FDI entry into that country
Time: 19701989, 19621980, 19621989
11 A positive relationship between timing of foreign market entry and
subsidiary survival
Time: 19811994
12 A negative relationship between foreign equity ownership and the
mortality of the subsidiary
Time: 19982009
13 An inverted U relationship between expatriate deployment and IJV
performance
Time: 20002010; geography: China
14 A moderating effect (strengthening) of the ratio of expatriates in a foreign
subsidiary on the positive relationship between the level of the parent
rms technological knowledge and the subsidiarys short-term
performance
Time: 19941999
15 A positive relationship between the institutional distance between the
home country and the host country of a subsidiary and the likelihood of
the subsidiary general managers being parent country nationals
Time: 1998, 2000
16 A negative relationship between the speed of subsequent subsidiary
establishment and the performance of the subsidiary
Time: 20012010, 19892010; geography: India,
South Korea, SE Asian countries
17 A positive relationship between the use of ethnocentric stafng policies as
compared with polycentric stafng policies and the performance of the
rms international ventures
Time: 1990, 1992, 1996
18 A moderating effect (weakening) of exporting activities on the relationship
between FDI activities and performance
Time: 19892000
19 A positive relationship between the level of exporting activities and an
SMEs growth
Time: 19892000
20 A positive relationship between the frequency of using an entry mode in
prior entries and its likelihood of using the same entry mode in
subsequent entries
Time: 19992003; geography: China, South
Korea, Brazil, India, SE Asian countries
21 A positive relationship between a subsidiarys location in Shanghai
(economically oriented city) relative to Beijing (politically oriented city) and
its survival rate
Time: 19862010; geography: Vietnam (Hanoi vs.
Ho Chi Minh)
22 A moderating effect (weakening) of a foreign parents host country
experience on the positive relationship between having a local partner and
the joint ventures performance
Time: 1990, 1994; geography: China mainland,
South Korea, India
23 A moderating effect (weakening) of subsidiary age on the relationship
between cultural distance and ownership control (or expatriate stafng
ratios)
Time: 2010
24 A positive relationship between the likelihood of joint ventures established
by other Japanese rms and the likelihood of entering by joint ventures
Time: 1992, 1994, 1998, 2000
25 A negative relationship between parent rmssize asymmetry and the IJVs
performance and survival
Time: 2001, 2002, 2003
26 A positive relationship between the difculty of alliance performance
measurement and the likelihood of escalation
Time: 19901996, 19962002; geography:
European countries
27 A positive relationship between the proliferation of FDI opportunities and
the use of IJVs as compared with WOSs
Time: 19851993
28 A moderating effect (strengthening) of a rms Ricardian rent creation
focus on the negative relationship between asset retrenchment and
postretrenchment performance
Time: 19861991, 19982001
29 A moderating effect (strengthening) of ownership level on the relationship
between business relatedness and subsidiary performance
Time: 1994, 1998; geography: India, South
Korea, SE Asian countries
Details on research designs and variable operationalizations for each focal effect are in SI Appendix, Table S7-14. FDI, foreign direct investment; IJV, international joint venture; MNE,
multinational enterprise; SE Asian, Southeast Asian; SME, small and medium enterprise; WOS, wholly owned subsidiary.
4of9 https://doi.org/10.1073/pnas.2120377119 pnas.org
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
Among all the tests with eta square as the effect size (11 of 29), the
effect sizes detectable with 80% power range from close to 0 to 0.
0633 (mean =0.0066; median =0.0019). Among all the tests
with Cox coefcient as the effect size (4 of 29), the effect sizes
detectable with 80% power range from -0.6478 to -0.0292 (mean
=-0.1402; median =-0.0695).
SI Appendix, Table S7-23 summarizes the power of the associ-
ated generalizability tests to detect the effect size from the subset
of reproducible original studies, with a mean of 0.66 for the
individual generalizability tests and 0.69 for the pooled tests.
These power levels should be kept in mind when interpreting
the generalizability rates, which will be necessarily imperfect.
Parallel Bayesian analyses assessed whether the effect was sup-
ported or contradicted or if the evidence was unclear in each
reproduction test, in the aggregated generalizability tests, and
leveraging all available data (Table 2 and SI Appendix, Tables
S7-19 and S7-20). These statistical criteria were supplemented by
a subjective assessment from the project team as to whether the
results of the new analyses supported the effect. More detailed
descriptions of the analyses related to each effect are available in
SI Appendix, supplement 5, and further alternative approaches to
calculating reproducibility and generalizability rates are presented
in SI Appendix, supplement 7.Thecode,data,andothersupport-
ing information are at https://osf.io/nymev/.
Frequentist Analyses Using the P<0.05 Criterion. Following
on past research (4755), we likewise nd a low absolute repro-
ducibility rate for published ndings,evenwhenemployingthe
same analysis on the same data and consulting with the original
authors for clarications and guidance. After corresponding with
theoriginalauthors,wewereultimatelyabletodirectlyreproduce
45% of the original set of 29 ndings using the same analysis and
sampling approach. We believe that one likely contributor is that
lacking access to the original code, we constructed new code based
on the methods sections of the published articles (68), and subtle
but important details regarding the original specication may
have been omitted from the research report. This calls for
improved reporting, code and data transparency, and analytic
reviews by journals prepublication (35, 96).
Of much greater theoretical interest, 55% of ndings (23 of
42)emergedagainwhentestedinadistincttimeperiodfromthat
of the original research, and 40% of ndings (4 of 10) proved
generalizable to a new national context. It may seem surprising
that the cross-temporal generalizability rate was directionally
higher than the reproducibility rate, but the two are not directly
comparable. Reproducibility is calculated at the paper level (one
reproduction test per article), whereas generalizability is at the test
level, and a single paper can have multiple time and geographic
extension tests. This paper-level vs. nding-level distinction is only
Table 2. Research reliability criteria
No.
Same direction Statistically significant Bayesian tests
Subjective
assessmentRepro
Pooled
gen
All
data Repro
Pooled
gen
All
data Repro Pooled gen All data
1 No No Yes No Yes Yes Unclear Unclear Unclear No
2 Yes Yes Yes No Yes Yes Conrmed Unclear Conrmed Yes
3 Yes Yes Yes Yes Yes Yes Disconrmed Conrmed Conrmed Yes
4 Yes Yes Yes Yes Yes Yes Conrmed Conrmed Conrmed Yes
5 Yes Yes Yes No No No Conrmed Conrmed Conrmed Yes
6 Yes Yes Yes Yes No No Unclear Unclear Unclear No
7 No No No Yes Yes Yes Conrmed Conrmed Conrmed No
8 Yes Yes Yes Yes Yes Yes Disconrmed Disconrmed Disconrmed No
9 Yes Yes Yes No Yes Yes Unclear Conrmed Conrmed No
10 No Yes Yes No Yes No Conrmed Conrmed Unclear No
11 No No No No No No Conrmed Conrmed Conrmed No
12 Yes Yes Yes Yes Yes Yes Unclear Unclear Unclear Yes
13 Yes No No No No No Unclear Unclear Conrmed No
14 Yes No No No No No Unclear Unclear Unclear No
15 Yes Yes Yes No Yes Yes Conrmed Conrmed Conrmed Yes
16 Yes Yes Yes No Yes Yes Conrmed Conrmed Disconrmed No
17 Yes Yes Yes Yes Yes Yes Unclear Unclear Unclear Yes
18 No No No No No No Conrmed Conrmed Disconrmed No
19 Yes Yes Yes No No No Unclear Unclear Unclear No
20 Yes Yes Yes Yes Yes Yes Conrmed Conrmed Unclear Yes
21 Yes Yes Yes Yes Yes Yes Conrmed Conrmed Conrmed Yes
22 Yes No No No No No Conrmed Conrmed Conrmed No
23 Yes Yes Yes Yes Yes Yes Conrmed Conrmed Conrmed Yes
24 Yes Yes Yes Yes Yes Yes Conrmed Conrmed Conrmed Yes
25 No No No No No Yes Disconrmed Disconrmed Disconrmed Yes
26 Yes Yes Yes No Yes Yes Conrmed Conrmed Conrmed Yes
27 Yes No Yes Yes No No Unclear Unclear Unclear No
28 No No No Yes No No Conrmed Disconrmed Disconrmed Yes
29 Yes No Yes No No No Unclear Unclear Unclear No
Repro refers to the reproduction test. Pooled gen refers to pooling all time and geographic extension data for a given effect. All refers to pooling all data used in the reproduction and
generalizability tests for an effect. For comparisons of effect direction, yes means the new result and the original effect are in the same direction. For tests of statistical signicance, yes
means the effect is statistically signicant at P<0.05. Five tests (papers 25 to 29) were nonsignicant in the original report. Conrmed means that the effect is supported from a
Bayesian perspective at Bayes factor greater than three. Disconrmed means that the effect is contradicted from a Bayesian perspective at Bayes factor <0.33. For the subjective
assessment, yes means that the present research team believes that the effect was supported.
PNAS 2022 Vol. 119 No. 30 e2120377119 https://doi.org/10.1073/pnas.2120377119 5of9
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
one possible explanation for an admittedly surprising pattern of
results. What is clear is that reproducibility does not set an upper
limit on generalizability.
As analyzed in greater depth in SI Appendix, supplement 7,
although they are conceptually orthogonal, reproducibility and
generalizability are empirically associated (r=0.50, P<0.001)
(Fig. 1). In a multivariable logistic regression, the odds ratio of
generalizing was much greater (e^3.66 =38.86) if a paper was
reproducible (β=3.66, P=0.001). For the subset of repro-
ducible ndings, the cross-temporal generalizability rate was
84% (16 of 19), and the cross-national generalizability rate was
57% (4 of 7); in contrast, for ndings we were unable to
directly reproduce, the cross-temporal generalizability was only
30% (7 of 23), and cross-national generalizability was 0% (0 of
3). This suggests that if a strategic management research nding
can be obtained once again in the same data, it has an excellent
chance of generalizing to other time periods and is also more
likely than not to extend to new geographies. Indeed, the gen-
eralizability rates for reproducible ndings are about as high as
could be realistically achieved given the imperfect power of the
new tests (SI Appendix, Tables S7-16, S7-23, and S7-25).
Although speculative, different indices of research reliability
may cluster together due to properties of the phenomenon, the
research practices of the scientist, or both. Some reliable true
positives should be obtainable again in the same data and true
across nations and time periods (35). Also, good research prac-
tices, like ensuring that onesndings are computationally
reproducible, could in turn predict success in replications and
extensions by other investigators using new data.
Overall, 35% of ndings were both reproducible and gener-
alizable, 45% were neither reproducible nor generalizable, 10%
were reproducible but not generalizable, and 10% were general-
izable but not reproducible. Thus, in a small subset of cases,
the key scientic prediction was supported in a new context
(i.e., different time period or nation) but surprisingly, was not
found again in the original data. This suggests that the origi-
nally reported results are less reliable than hoped in that they
did not reproduce when the same analyses were repeated on the
same data. Yet, at the same time, the underlying ideas have
merit and nd support in other observations. Analogous pat-
terns have emerged in experimental replication projects: for
example, when targeted ndings fail to replicate in the popula-
tion in which they are theoretically expected to exist (97) but
are obtained in more culturally distant populations (98). This
underscores the point that the determinants of research reliabil-
ity are not yet fully understood (99102).
420 245
365 273
298 237
3,761 1,983
3,707 3,707
146 147
953 5,930
120,672 246,042
28,314 12,528
545 549
33,858 66,858
28
27
25
23
19
18
6
5
3
2
1
−.01 0 .01 .02 .03 .04
Type 1: Eta−squared
314 819
1,008 1,616
709 2,390
7,681 637
26
21
16
12
−1.5 −1 −.5 0.5 1
Type 2: Coefficient
431 1,225
582 1,935
682 1,481
1,767 4,254
1,625 4,751
9,612 29,798
1,030 1,032
568 1,776
7,677 7,435
581,482 660,039
753,676 669,998
677 1,945
1,810 2,406
738 2,437
29
24
22
20
17
15
14
13
11
10
9
8
7
4
0 1 2 3 4
Type 3: Hazard/Odds ratio
Generalizability test (p >= .05)
Generalizability test (p < .05)
Reproduction (p < .05)
Reproduction (p >= .05)
Sample size
Number of Effect
Fig. 1. Reproductions and generalizability tests for 29 strategic management ndings. Results of the generalizability tests initiative are presented separately
by type of effect size estimate (eta square, coefcient, hazard or odds ratio). The leftmost column is the numeric indicator for the original nding (1 to 29)
(Table 1 has detailed descriptions). The central column depicts the effect size estimates for the reproductions (same data, same analysis) and generalizability
tests (different time period and/or geography, same analysis). Generalizability test estimates are based on pooled data across all new tests. Triangles (repro-
ductions) and circles (generalizability tests) are a solid color if the effect was statistically signicant at P<0.05. Findings 25 to 29 were nonsignicant in the
original report. The two rightmost columns display the sample sizes for each analysis.
6of9 https://doi.org/10.1073/pnas.2120377119 pnas.org
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
Even an original nding that is a false positive [e.g., due to P
hacking (13)] should in principle be reproducible from the
same data (35). Thus, reproducible but not generalizable sets
an overly liberal criterion for context sensitivity, making it even
more noteworthy that so few ndings fell into this category. To
provide a more conservative test, we further separated the sub-
set of 20 of 29 original ndings with multiple generalizability
tests based on whether all generalizability tests were statistically
signicant (40%), all generalizability tests were not signicant
(35%), or some generalizability tests were signicant and others
were not (25%). Given the limitations of signicance thresh-
olds, we quantify the variability of the effect sizes in the gener-
alizability tests using I square, Cochrans Q, and tau square for
the same subset of 20 studies (SI Appendix, Table S7-22); 50%
of the studies have nonnegligible unexplained heterogeneity
(I square >25%): 15% at the high level (I square >75%),
15% at the moderate level (50% <I square <75%), and 20%
at the low level (25% <I square <50%). Taken together, the
results argue against massive context sensitivity for this set of
archival ndings, consistent with the prior results for experi-
ments replicated across different geographic settings (24, 25).
At the same time, it should be noted that larger numbers of
novel tests of each effect are needed to estimate heterogeneity
precisely (25), and thus, more research is needed before drawing
strong conclusions on this point.
Journal impact factor, University of Texas at Dallas and
Financial Times listing of the journal, and article-level citation
counts were not signicantly correlated with reproducibility or
generalizability (SI Appendix, supplement 7;SI Appendix, Table
S7-3 has more details). Consistent with past research relying on
large samples (103, 104), the present small-sample investigation
nds no evidence that traditional indices of academic prestige
serve as meaningful signals of the reliability of ndings. How-
ever, these tests had observed power as low as 0.16 (SI
Appendix, Tables S7-5 and S7-8), such that we are only able to
provide limited evidence of absence. More systematic investiga-
tions are needed regarding the predictors of generalizable
research outcomes. Our results are most appropriate for inclu-
sion in a later meta-analysis of the relationships between indica-
tors of research reliability and academic prestige.
Further Research Reliability Criteria. As seen in Table 2, 76%
of reproductions and 62% of generalizability tests were in the
same direction as the original result aggregating across all new
data, 59% of generalizability tests were statistically signicant
(P<0.05) aggregating across all new data, and 59% of effects
were signicant (P<0.05) leveraging all available data (i.e., from
reproductions and generalizability tests combined). Bayesian anal-
yses indicated that 55% of reproductions supported the effect,
10% provided contrary evidence, and 34% were inconclusive.
Pooling all generalizability data, 55% of effects were supported;
10% were contradicted; and for 34% of effects, the evidence was
unclear. Note that in a number of the above cases, the percen-
tages for different tests match, but the distributions over studies
are different. The Bayesian results underscore that, especially given
the imperfect power of our tests, failure to reach statistical signi-
cance can reect mixed rather than disconrmatory evidence.
Indeed,onlyafeworiginalndings were actively contradicted by
the results of the present initiative.
Forecasting Survey. We nd a robust and statistically signi-
cant relationship between forecasts and observed results of both
generalizability tests (β=0.409, P<0.001) and the pooled
sample of predictions (β=0.162, P<0.001). For the forecasts
and observed results for direct reproducibility tests, we nd a small
but positive and signicant relationship (β=0.059, P=0.010),
whichis,however,notrobusttoalternativespecications. In par-
ticular, this association is no longer statistically signicant when
aggregating forecasterspredictions and when excluding certain
original results (SI Appendix, supplement 6 has a more detailed
description of the robustness tests).
In addition, forecasters were signicantly more accurate at
anticipating generalizability relative to reproducibility (mean of
the differences =0.092, P<0.001). The overall generalizabil-
ity rate predicted by the crowd of forecasters (57%) was compa-
rable with the observed generalizability rate for the subset of
ndings included in the forecasting survey (55%), with no sig-
nicant difference (z=0.236, P=0.813). However, the fore-
casted reproducibility rate (71%) was signicantly higher than
the observed reproducibility rate (45%; z=2.729, P=0.006).
Whether a nding will emerge again when the same analyses
are repeated on the same data may be challenging to predict
since this is contingent on unobservable behaviors from the
original researchers, such as annotated code, data archiving, and
questionable research practices. Theories about whether a nding
is true or not may be less useful since even false positives should
in principle be reproducible. In contrast, predictions regarding
generalizability may rely primarily on theories about the phenom-
enon in question. SI Appendix, supplement 6 contains a more
detailed report of the results of the forecasting survey.
Discussion
The present initiative leveraged a longitudinal database to
examine if a set of archival ndings generalizes to different time
periods and geographies from the original investigation. Provid-
ing a systematic assessment of research generalizability for an
area of scientic inquiry is the primary contribution of this 6-
year-long metascientic initiative. In our frequentist analyses
using the P<0.05 criterion for statistical signicance, 55% of
the original ndings regarding strategic decisions by corpora-
tions extended to alternative time periods, and 40% extended
to separate geographic areas.
In the accompanying direct reproductions, 45% of ndings
emerged again using the same analyses and observations as in
the original report. One potential reason the reproducibility
rate is directionally lower than the generalizability rate is
because the former is at the paper level and the latter is at the
test level; regardless, because of this, they are not directly com-
parable. More meaningfully, reproducibility was empirically
correlated with generalizability; of the directly reproducible
ndings, 84% generalized to other time periods and 57% gen-
eralized to other nations and territories. In a forecasting survey,
scientists proved overly optimistic about direct reproducibility,
predicting a reproducibility rate of 71%, yet were accurate
about cross-temporal generalizability, anticipating a success rate
of 57% that closely aligned with the realized results.
Although an initial investigation, our research suggests that a
substantial number of ndings from archival datasets, particu-
larly those that are statistically reliable (i.e., reproducible) to
begin with (68), may in fact generalize to other settings (62).
Overall, only limited evidence of context sensitivity emerged.
The project conclusions were robust to the use of different
approaches to quantifying context sensitivity and a suite of fre-
quentist and Bayesian criteria for research reliability. Findings
that hold more broadly can serve as building blocks for general
theories and also, as heuristic guides for practitioners (2224).
Of course, other empirical patterns can be circumscribed based
PNAS 2022 Vol. 119 No. 30 e2120377119 https://doi.org/10.1073/pnas.2120377119 7of9
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
on time period, geography, or both. In such cases, additional
auxiliary assumptions (105107) may be needed to specify the
moderating conditions in which the original theoretical predic-
tions hold and do not hold (35, 7073).
Building on this and other recent investigations (28, 62, 84),
more research is needed that repeats archival analyses in alterna-
tive time periods, populations, and geographies whenever feasi-
ble. Recent years have witnessed an increased emphasis on
repeating behavioral experiments in new contexts (10, 23, 24,
3234). Such empirical initiatives are needed for archival
research in management, sociology, economics, and other elds
(27, 62, 66, 67), such as the ongoing Systematizing Condence
in Open Research and Evidence project (100102) and the
newly launched Institute for Replication (https://i4replication.
org/) that focuses on economics and political science. This moves
the question of the reliability of archival ndings beyond
whether the results can be reproduced using the same code and
data (49, 68) or survive alternative analytic approaches (60, 81).
Rather, generalizability tests seek to extend the theory to novel
contexts. Even when an attempt to generalize fails, the individual
and collective wisdom of the scientic community can be put to
work revising theoretical assumptions and in some cases, identi-
fying meaningful moderators for further empirical testing (108).
Data Availability. Code, data, and other supporting information have been
deposited on the Open Science Framework (https://osf.io/t987n/) (109).
ACKNOWLEDGMENTS. This research project benetted from Ministry of Education
(Singapore) Tier 1 Grant R-313-000-131-115 (to A. Delios), National Science Founda-
tion of China Grants 72002158 (to H.T.) and 71810107002 (to H.T.), grants from the
Knut and Alice Wallenberg Foundation (to A. Dreber) and the Marianne and Marcus
Wallenberg Foundation (through a Wallenberg Scholar grant; to A. Dreber), Austrian
Science Fund (FWF) Grant SFB F63 (to A. Dreber), grants from the Jan Wallander
and Tom Hedelius Foundation (Svenska Handelsbankens Forskningsstiftelser;
to A. Dreber), and an Research & Development (R&D) research grant from
Institut Europ
een dAdministration des Affaires (INSEAD) (to E.L.U.). Dmitrii
Dubrov, of the G.T.F.C., was supported by the National Research University
Higher School of Economics (HSE University) Basic Research Program.
Author afliations:
a
Department of Strategy and Policy, National University of
Singapore, 119245 Singapore;
b
Department of Economics, Stockholm School of
Economics, Stockholm, 113 83 Sweden;
c
School of Management and Economics and
Shenzhen Finance Institute, The Chinese University of Hong Kong, Shenzhen (CUHK-
Shenzhen), Shenzhen 518000, China;
d
Advanced Institute of Business, Tongji University,
Shanghai 200092, China;
e
School of Management, Xian Jiaotong University, Xian,
Shaanxi 710049, China;
f
New Zealand Institute for Advanced Study, Massey University,
Auckland 0745, New Zealand;
g
Global Indicators Department, Development Economics
Vice Presidency, World Bank Group, Washington, DC 20433, USA;
h
Department of
Economics, University of Innsbruck, 6020 Innsbruck, Austria; and
i
Department of
Organizational Behaviour, Institut Europ
een dAdministration des Affaires (INSEAD),
Singapore 138676
1. H. Aguinis, W. F. Cascio, R. S. Ramani, Sciences reproducibility and replicability crisis:
International business is not immune. J. Int. Bus. Stud. 48, 653663 (2017).
2. M. Baker, First results from psychologys largest reproducibility test: Crowd-sourced effort raises
nuanced questions about what counts as replication. Nature, 10.1038/nature.2015.17433
(2015).
3. D. D. Bergh, B. M. Shar p, H. Aguinis, M. Li, Is there a credib ility crisis in strategic manag ement
research? Evid ence on the reproducibil ity of study ndings. Strate g. Organ. 15, 423436
(2017).
4. J. Bohannon, Psychology. Replication effort provokes praiseand bullyingcharges. Science 344,
788789 (2014).
5. F. A. Bosco, H. Aguinis,J. G. Field, C. A. Pierce, D. R. Dalton, HARKingsthreat to organizational
research: Evidence from primary and meta-analytic sources. Person.Psychol. 69,709750 (2016).
6. G. Francis, Replication, statistical consistency, and publication bias. J. Math. Psychol. 57,
153169 (2013).
7. A. Gelman, E. Loken, The statistical crisis in science. Am. Sci. 102, 460465 (2014).
8. K. Hou, C. Xue, L. Zhang, Replicating anomalies. Rev. Financ. Stud. 33, 2019-2133 (2020).
9. K. R. Murphy, H. Aguinis, HARKing: How badly can cherry picking and question trolling produce
bias in published results? J. Bus. Psychol. 34,117 (2019).
10. B. A. Nosek et al., Replicability, robustness, and reproducibility in psychological science. Annu.
Rev. Psychol. 73, 719748 (2021).
11. R. A. Zwaan, A. Etz, R. E. Lucas, M. B. Donnellan, Making replication mainstream. Behav. Brain Sci.
41, e120 (2017).
12. J. P. Ioannidis, Why most published research ndings are false. PLoS Med. 2, e124 (2005).
13. J. P. Simmons, L. D. Nelson, U. Simonsohn, False-positive psychology: Undisclosed exibility in
data collection and analysis allows presenting anything as signicant. Psychol. Sci. 22,
13591366 (2011).
14. D. T. Gilbert, G. King, S. Pettigrew, T. D. Wilson, Comment on Estimating the reproducibility of
psychological science.Science 351, 1037 (2016).
15. M. Ramscar,Learning and the replicabilityof priming effects.Curr. Opin. Psychol.12,8084 (2016).
16. N. Schwarz, F. Strack, Does merely going through the same moves make for a directreplication?
Concepts, contexts, and operationalizations. Soc. Psychol. 45, 305306 (2014).
17. W. Stroebe, F. Strack, The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci.
9,5971(2014).
18. J. J. Van Bavel, P. Mende-Siedlecki, W. J. Brady, D. A. Reinero, Contextual sensitivity in scientic
reproducibility. Proc. Natl. Acad. Sci. U.S.A. 113, 64546459 (2016).
19. Y. Inbar, Association between contextual dependence and replicability in psychology may be
spurious. Proc. Natl. Acad. Sci. U.S.A. 113, E4933E4934 (2016).
20. D. J. Simons, The value of direct replication. Perspect. Psychol. Sci. 9,7680 (2014).
21. D. J. Simons, Y. Shoda, D. S. Lindsay, Constraints on generality (COG): A proposed addition to all
empirical papers. Perspect. Psychol. Sci. 12, 11231128 (2017).
22. C. R. Ebersole et al., Many Labs 3: Evaluating participant pool quality across the academic
semester via replication. J. Exp. Soc. Psychol. 67,6882 (2016).
23. R. A. Klein et al., Investigating variation in replicability: A many labsreplication project. Soc.
Psychol. 45, 142152 (2014).
24. R. A. Klein et al., Many Labs 2: Investigating variation in replicability across sample and setting.
Adv. Methods Pract. Psychol. Sci. 1, 443490 (2018).
25. A. Olsson-Collentine, J. M. Wicherts, M. A. L. M. van Assen, Heterogeneity in direct replications in
psychology and its association with effect size. Psychol. Bull. 146, 922940 (2020).
26. C. R. Ebersole et al., Many Labs 5: Testing pre-data-collection peer review as an intervention to
increase replicability. Adv. Methods Pract. Psychol. Sci. 3, 309331 (2020).
27. J. Freese, D. Peterson, Replication in social science. Annu. Rev. Sociol. 43, 147165 (2017).
28. C. J. Soto, Do links between personality and life outcomes generalize? Testing the robustness of
trait-outcome associations across gender, age, ethnicity, and analytic approaches. Soc. Psychol.
Personal. Sci. 12, 118130 (2021).
29. T. D. Stanley, E. C. Carter, H. Doucouliagos, What meta-analyses reveal about the replicability of
psychological research. Psychol. Bull. 144, 13251346 (2018).
30. J. E. McGrath, Dilemmatics: The study of research choices and dilemmasin Judgment Calls in
Research, J. E. McGrath, R. A. Kulka, Eds. (Sage, New York, NY, 1982), pp. 179-210.
31. T. D. Cook, D. T. Campbell, Quasi-Experimentation: Design and Analysis Issues for Field Settings
(Houghton Mifin, Boston, MA, 1979).
32. C. F. Camerer et al., Evaluating replicability of laboratory experiments in economics. Science 351,
14331436 (2016).
33. C. F. Camerer et al., Evaluating the replicability of social science experiments in Nature and
Science between 2010 and 2015. Nat. Hum. Behav. 2, 637644 (2018).
34. Open Science Collaboration, PSYCHOLOGY. Estimating the reproducibility of psychological
science. Science 349, aac4716 (2015).
35. National Academies of Sciences, Engineering, and Medicine, Reproducibility and Replicability in
Science (The National Academies Press, Washington, DC, 2019).
36. J. D. Angrist, J. S. Pischke, The credibility revolution in empirical economics: How better research
design is taking the con out of econometrics. J. Econ. Perspect. 24,330 (2010).
37. A. Brodeur, N. Cook, A. Heyes, Methods matter: P-hacking and publication bias in causal analysis
in economics. Am. Econ. Rev. 110, 36343660 (2020).
38. G. Christensen, E. Miguel, Transparency, reproducibility, and the credibility of economics
research. J. Econ. Lit. 56, 920980 (2018).
39. E. E. Leamer, Lets take the con out of econometrics. Am. Econ. Rev. 73,3143 (1983).
40. J. Cochrane, Secret data (2015). johnhcochrane.blogspot.co.uk/2015/12/secret-data.html?m=1.
Accessed 17 May 2022.
41. E. Gibney, R. Van Noorden, Scientists losing data at a rapid rate. Nature, 10.1038/nature.2013.
14416 (2013).
42. T. E. Hardwicke, J. P. A. Ioannidis, Populating the Data Ark: An attempt to retrieve, preserve, and
liberate data from the most highly-cited psychology and psychiatry articles. PLoS One 13,
e0201856 (2018).
43. W. Vanpaemel, M. Vermorgen, L. Deriemaecker, G. Storms, Are we wasting a good crisis? The
availability of psychological research data after the storm. Collabra 1,15 (2015).
44. J. M. Wicherts, D. Borsboom, J. Kats, D. Molenaar, The poor availability of psychological research
data for reanalysis. Am. Psychol. 61, 726728 (2006).
45. R. P. Womack, Research data in core journals in biology, chemistry, mathematics, and physics.
PLoS One 10, e0143460 (2015).
46. C. Young, A. Horvath, Sociologists need to be better at replication (2015). https://orgtheory.
wordpress.com/2015/08/11/sociologists-need-to-be-better-at-replication-a-guest-post-by-
cristobal-young/. Accessed 17 May 2022.
47. A. C. Chang, P. Li, Is economics research replicable? Sixty published papers from thirteen
journals say usually not’” (Finance and Economics Discussion Series 2015-083, Board of
Governors of the Federal Reserve System, Washington, DC, 2015).
48. N. Janz, Leading journal veries articles before replicationso far, all replications failed (2015).
https://politicalsciencereplication.wordpress.com/2015/05/04/leading-journal-veries-articles-
before-publication-so-far-all-replications-failed/. Accessed 17 May 2022.
49. B. D. McCullough, K. A. McGeary, T. D. Harrison, Lessons from the JMCB archive. J. Money Credit
Bank. 38, 10931107 (2006).
50. R. Minocher, S. Atmaca, C. Bavero, B. Beheim, Reproducibility of social learning research declines
exponentially over 63 years of publication. PsyArXiv [Preprint] (2020). https://psyarxiv.com/
4nzc7/ (Accessed 17 May 2022).
51. R. L. Andrew et al., Assessing the reproducibility of discriminant function analyses. PeerJ 3,
e1137 (2015).
52. K. J. Gilbert et al., Recommendations for utilizing andreporting populationgenetic analyses: The
reproducibility of genetic clustering using the program STRUCTURE. Mol. Ecol.21,49254930(2012).
53. J. P. Ioannidis et al., Repeatability of published microarray gene expression analyses. Nat. Genet.
41, 149155 (2009).
8of9 https://doi.org/10.1073/pnas.2120377119 pnas.org
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
54. J. H. Stagge et al., Assessing data availability and research reproducibility in hydrology and water
resources. Sci. Data 6, 190030 (2019).
55. V. Stodden, J. Seiler, Z. Ma, An empirical analysis of journal policy effectiveness for computational
reproducibility. Proc. Natl. Acad. Sci. U.S.A. 115, 25842589 (2018).
56. J. A. Bastiaansenet al., Time t oget personal? The impact of researchers choices on the selection of
treatment targets using the experience sampling methodology. J. Psychosom. Res. 137, 110211 (2020).
57. N. Breznau, E. M. Rinke, A. Wuttke, Replication data for Inside irredentism: A global empirical
analysis.Harvard Dataverse. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.
7910/DVN/X88LYH. Accessed 17 May 2022.
58. M. Schweinsberg et al., Radical dispersion of effect size estimates when independent scientists
operationalize and test the same hypothesis with the same data. Organ. Behav. Hum. Decis.
Process. 165, 228249 (2021).
59. R. Silberzahn, E. L. Uhlmann, E. L. Uhlmann, Crowdsourced research: Many hands make tight
work. Nature 526, 189191 (2015).
60. R. Silberzahn et al., Many analysts, one dataset: Making transparent how variations in analytical
choices affect results. Adv. Methods Pract. Psychol. Sci. 1, 337356 (2018).
61. M. Baker, Muddled meanings hamper efforts to x reproducibility crisis: Researchers tease out
different denitions of a crucial scientic term. Nature, 10.1038/nature.2016.20076 (2016).
62. R. A. Bettis, C. E. Helfat, J. M. Shaver, The necessity, logic, and forms of replication. Strateg.
Manage. J. 37, 21932203 (2016).
63. S. N. Goodman, D. Fanelli, J. P. A. Ioannidis, What does research reproducibility mean? Sci.
Transl. Med. 8, 341ps12 (2016).
64. E. P. LeBel, R. McCarthy, B. Earp, M. Elson, W. Vanpaemel, A unied framework to quantify the
credibility of scienticndings. Adv. Methods Pract. Psychol. Sci. 1, 389402 (2018).
65. D. T. Lykken, Statistical signicance in psychological research. Psychol. Bull. 70, 151159 (1968).
66. E. W. K. Tsang, K. M. Kwan, Replication and theory development in organizational science: A
critical realist perspective. Acad. Manage. Rev. 24, 759780 (1999).
67. M. Clemens, The meaning of failed replications: A review and proposal. J. Econ. Surv. 31,
326342 (2015).
68. J. M. Hofman et al., Expanding the scope of reproducibility research through data analysis
replications. Organ. Behav. Hum. Decis. Process. 164, 192202 (2021).
69. A. Norenzayan, S. J. Heine, Psychological universals: What are they and how can we know?
Psychol. Bull. 131, 763784 (2005).
70. N. Cartwright, The Dappled World: A Study in the Boundaries of Science (Cambridge University
Press, Cambridge, United Kingdom, 1999).
71. W. J. McGuire, The yin and yang of progress in social psychology: Seven koan. J. Pers. Soc.
Psychol. 26, 446456 (1973).
72. W. J. McGuire, A contextualist theory of knowledge: Its implications for innovations and reform
in psychological researchin Advances in Experimental Social Psychology, L. Berkowitz, Ed.
(Academic Press, Cambridge, MA, 1983), vol. 16, pp. 1-47.
73. H. A. Walker, B. P. Cohen, Scope statements: Imperatives for evaluating theory. Am. Sociol. Rev.
50, 288301 (1985).
74. A. Dreber et al., Using prediction markets to estimate the reproducibility of scientic research.
Proc. Natl. Acad. Sci. U.S.A. 112, 1534315347 (2015).
75. J. Henrich, Does culture matter in economic behavior? Ultimatum game bargaining among the
Machiguenga of the Peruvian Amazon. Am. Econ. Rev. 90, 973979 (2000).
76. L. A. Dau, G. D. Santangelo, A. van Witteloostuijn, Replication studies in international business.
J. Int. Bus. Stud. 53, 215230 (2022).
77. R. Botvinik-Nezer et al., Variability in the analysis of a single neuroimaging dataset by many
teams. Nature 582,8488(2020).
78. A. Orben, A. K. Przybylski, The association between adolescent well-being and digital technology
use. Nat. Hum. Behav. 3, 173182 (2019).
79. U. Simonsohn, J. P. Simmons, L. D. Nelson, Specication curve analysis. Nat. Hum. Behav. 4,
12081214 (2020).
80. D. Smerdon, H. Hu, A. McLennan, W. von Hippel, S. Albrecht, Female chess players show typical
stereotype-threat effects: Commentary on Stafford. Psychol. Sci. 31, 756759 (2020).
81. S. Steegen, F. Tuerlinckx, A. Gelman, W. Vanpaemel, Increasing transparency through a
multiverse analysis. Perspect. Psychol. Sci. 11, 702712 (2016).
82. J. Mu~
noz, C. Young, We ran 9 billion regressions: Eliminating false positives through
computational model robustness. Sociol. Methodol. 48,133 (2018).
83. C. J. Soto, How replicable are links between personality traits and consequential life outcomes?
The life outcomes of personality replication project. Psychol. Sci. 30, 711727 (2019).
84. M. K. Forbes, A. G. C. Wright, K. E. Markon, R. F. Krueger, Evidence that psychopathology
symptom networks have limited replicability. J. Abnorm. Psychol. 126, 969988 (2017).
85. D. Borsboom et al., False alarm? A comprehensive reanalysis of Evidence that psychopathology
symptom networks have limited replicabilityby Forbes, Wright, Markon, and Krueger (2017).
J. Abnorm. Psychol. 126, 989999 (2017).
86. M. K. Forbes, A. G. Wright, K. E. Markon, R. F. Krueger, Quantifying the reliability and replicability
of psychopathology network characteristics. Multivariate Behav. Res. 56, 224242 (2019).
87. S. DellaVigna, D. G. Pope, Predicting experimental results: Who knows what? J. Polit. Econ. 126,
24102456 (2018).
88. O. Eitan et al., Is scientic research politically biased? Systematic empirical tests and a forecasting
tournament to address the controversy. J. Exp. Soc. Psychol. 79, 188199 (2018).
89. E. Forsell et al., Predicting replication outcomes in the Many Labs 2 study. J. Econ. Psychol. 75,
102117 (2019).
90. M. Gordon, D. Viganola, A. Dreber, M. Johannesson, T. Pfeiffer, Predicting replicability-Analysis of
survey and prediction market data from large-scale forecasting projects. PLoS One 16, e0248780
(2021).
91. J. F. Landy et al., Crowdsourcing hypothesis tests: Making transparent how design choices shape
research results. Psychol. Bull. 146, 451479 (2020).
92. C. L. S. Veldkamp, M. B. Nuijten, L. Dominguez-Alvarez, M. A. L. M. Van Assen, J. M. Wicherts,
Statistical Reporting Errors and Collaboration on Statistical Analyses in Psychological Science.
PLoS One. 9, e114876 (2014).
93. W. Tierney et al., A creative destruction approach to replication: Implicit work and sex morality
across cultures. J. Exp. Soc. Psychol. 93, 104060 (2021).
94. M. Schweinsberg et al., The pipeline project: Pre-publication independent replications of a single
laboratorys research pipeline. J. Exp. Soc. Psychol. 66,5567 (2016).
95. J. Verhagen, E. J. Wagenmakers, Bayesian tests to quantify the result of a replication attempt.
J. Exp. Psychol. Gen. 143, 14571475 (2014).
96. J. Sakaluk, A. Williams, M. Biernat, Analytic review as a solution to the problem of misreporting
statistical results in psychological science. Perspect. Psychol. Sci. 9, 652660 (2014).
97. A. Moon, S. S. Roeder, A secondary replication attempt of stereotype susceptibility (Shih,
Pittinsky, & Ambady, 1999). Soc. Psychol. 45, 199201 (2014).
98. C. E. Gibson, J. Losee, C. Vitiello, A replication attempt of stereotype susceptibility (Shih, Pittinsky,
& Ambady, 1999): Identity salience and shifts in quantitative performance. Soc. Psychol. 45,
194198 (2014).
99. A. Altmejd et al., Predicting the replicability of social science lab experiments. PLoS One 14,
e0225826 (2019).
100. M. Gordon et al., Are replication rates the same across academic elds? Community forecasts
from the DARPA SCORE programme. R. Soc. Open Sci. 7, 200566 (2020).
101. D. Viganola et al., Using prediction markets to predict the outcomes in the Defense Advanced
Research Projects Agencys next-generation social science programme. R. Soc. Open Sci. 8,
181308 (2021).
102. N. Alipourfard et al., Systematizing condence in open research and evidence (SCORE). SocArXiv
[Preprint] (2021). https://osf.io/preprints/socarxiv/46mnb/ (Accessed 17 May 2022).
103. B. Brembs, Prestigious science journals struggle to reach even average reliability. Front. Hum.
Neurosci. 12, 37 (2018).
104. U. Schimmack, Journal replicability rankings (2018). https://replicationindex.com/2018/12/29/
2018-replicability-rankings/. Accessed 17 May 2022.
105. T. S. Kuhn, The Structure of Scientic Revolutions (University of Chicago Press, Chicago, IL, ed. 1,
1962).
106. I. Lakatos, Falsication and the methodology of scientic research programmesin Can Theories
Be Refuted?, S. G. Harding, Ed. (Synthese Library, Springer, Dordrecht, the Netherlands, 1976),
vol. 81, pp. 205259.
107. K. Popper, The Logic of Scientic Discovery (Routledge, London, United Kingdom, 2002).
108. H. Aguinis, R. S. Ramani, W. F. Cascio, Methodological practices in international business
research: An after-action review of challenges and solutions. J. Int. Bus. Stud. 51, 15931608
(2020).
109. A. Delios et al., Examining the generalizability of research ndings from archival data. Open
Science Framework. https://osf.io/t987n/. Deposited 11 November 2020.
PNAS 2022 Vol. 119 No. 30 e2120377119 https://doi.org/10.1073/pnas.2120377119 9of9
Downloaded from https://www.pnas.org by UCL Library Services on July 25, 2022 from IP address 128.41.35.10.
... Even if researchers and journals adopt a culture of confirmatory research practices (35,36) to remedy systematic bias in the scientific knowledge accumulation, the scholarly community faces another obstacle on its way toward reliable empirical evidence: the doubt about the generalizability and robustness of reported results to alternative populations, research designs, and analytical decisions (37)(38)(39)(40)(41)(42). Typically, empirical studies only capture tiny snapshots of the range of possible results, and common estimates of the uncertainty about these snapshots do not account for the uncertainty due to the flexibility in choosing a sample, a research design, and an analysis path during a research project. ...
... The median H across the 70 meta-analyses is 1.08; a large fraction of the estimates are in the small to moderate heterogeneity range. Cochran's Q-test rejects the null hypothesis of homogeneity at the 5% level for 21 (30%) of the sampled meta-analyses and at the 0.5% level for 14 (20%) of the sampled meta-analyses (42). Some meta-analyses (46/70 in 4/13 papers) are based on effect sizes measured in terms of Cohen's d; heterogeneity can be reasonably compared across studies in absolute terms (τ) for this subsample. ...
Article
Full-text available
A typical empirical study involves choosing a sample, a research design, and an analysis path. Variation in such choices across studies leads to heterogeneity in results that introduce an additional layer of uncertainty, limiting the generalizability of published scientific findings. We provide a framework for studying heterogeneity in the social sciences and divide heterogeneity into population, design, and analytical heterogeneity. Our framework suggests that after accounting for heterogeneity, the probability that the tested hypothesis is true for the average population, design, and analysis path can be much lower than implied by nominal error rates of statistically significant individual studies. We estimate each type's heterogeneity from 70 multilab replication studies, 11 prospective meta-analyses of studies employing different experimental designs, and 5 multianalyst studies. In our data, population heterogeneity tends to be relatively small, whereas design and analytical heterogeneity are large. Our results should, however, be interpreted cautiously due to the limited number of studies and the large uncertainty in the heterogeneity estimates. We discuss several ways to parse and account for heterogeneity in the context of different methodologies.
... Underlying these definitional differences are shared values in the conduct and dissemination of science, and the need to move toward the principles and behaviors of open science has been widely recognized across the sciences. Research communities across many disciplines have begun to develop stronger norms inspired by open science, including psychology [2,[14][15][16], genetics [17], biomedicine [18], animal behavior [4,19], economics [20][21][22][23][24], education [21,[25][26][27][28][29], political science [30], public health [31,32], science and technology studies [33], scientometrics [34], and sociology [35,36], among others (see ( [37]). Despite some progress, all stakeholders in the system need to do better at adopting and implementing open science practices, and our focus is on how to help editors accomplish this. ...
... Underlying these definitional differences are shared values in the conduct and dissemination of science, and the need to move toward the principles and behaviors of open science has been widely recognized across the sciences. Research communities across many disciplines have begun to develop stronger norms inspired by open science, including psychology [2,[14][15][16], genetics [17], biomedicine [18], animal behavior [4,19], economics [20][21][22][23][24], education [21,[25][26][27][28][29], political science [30], public health [31,32], science and technology studies [33], scientometrics [34], and sociology [35,36], among others (see ( [37]). Despite some progress, all stakeholders in the system need to do better at adopting and implementing open science practices, and our focus is on how to help editors accomplish this. ...
Article
Full-text available
Journal editors have a large amount of power to advance open science in their respective fields by incentivising and mandating open policies and practices at their journals. The Data PASS Journal Editors Discussion Interface (JEDI, an online community for social science journal editors: www.dpjedi.org ) has collated several resources on embedding open science in journal editing ( www.dpjedi.org/resources ). However, it can be overwhelming as an editor new to open science practices to know where to start. For this reason, we created a guide for journal editors on how to get started with open science. The guide outlines steps that editors can take to implement open policies and practices within their journal, and goes through the what, why, how, and worries of each policy and practice. This manuscript introduces and summarizes the guide (full guide: https://doi.org/10.31219/osf.io/hstcx ).
... Even if researchers and journals adopt a culture of con rmatory research practices 34,35 to remedy systematic bias in the scienti c knowledge accumulation, the scienti c community faces another major obstacle on its way toward reliable empirical evidence: the doubt about the generalizability and robustness of the reported results to alternative populations, research designs, and analytical decisions [36][37][38][39][40] . Typically, empirical studies only capture tiny snapshots of the range of possible results, and common estimates of the uncertainty about these snapshots do not account for the uncertainty due to the exibility in choosing a sample, a research design, and an analysis path during a research project. ...
... The considerations sketched above draw an unmistakable picture of why the scienti c enterprise ought to start taking action to parse and cope with heterogeneity [36][37][38][39][40] . As illustrated, disregarding heterogeneity can have a substantial impact on statistical inference, which in turn implies that a priori power analyses can be misleading and the planning of original and replication studies might be misguided 68 estimates are difficult to draw on for our purpose and the comparability with our estimates is limited since heterogeneity estimates in meta-analyses based on the published literature will be impacted by publication bias and p -hacking [15][16][17][18]74 . ...
Preprint
Full-text available
A typical empirical study involves choosing a sample, a research design, and an analysis path. Variation in such choices across studies leads to heterogeneity in results that introduce an additional layer of uncertainty not accounted for in reported standard errors and con dence intervals. We provide a framework for studying heterogeneity in the social sciences and divide heterogeneity into population heterogeneity, design heterogeneity, and analytical heterogeneity. We estimate each type's heterogeneity from multi-lab replication studies, prospective meta-analyses of studies varying experimental designs, and multi-analyst studies. Our results suggest that population heterogeneity tends to be relatively small, whereas design and analytical heterogeneity are large. A conservative interpretation of the estimates suggests that incorporating the uncertainty due to heterogeneity would approximately double sample standard errors and con dence intervals. We illustrate that heterogeneity of this magnitude-unless properly accounted for-has severe implications for statistical inference with strongly increased rates of false scienti c claims.
... In the broader conversation around the changing dynamics of academic reputation in the digital era, these findings illuminate the growing importance of digital visibility in academia. Rigorous investigations like the current one provide crucial insights, assisting universities in their adaptation to these changing realities [12][13] [14]. ...
Article
The research presented here delves into the connection between data from Google Search Console (GSC) and the Webometrics visibility score of a specific public university's web presence. The study scrutinized GSC parameters such as clicks, impressions, click-through-rate (CTR), and average position to assess their impact on the university's digital visibility. The results indicate that impressions and average position play a critical role in determining the Webometrics visibility score, underscoring the significance of search engine optimization for learning establishments. The research also pinpointed the most effective search queries that drive substantial visitor traffic to the university's website, underlining the need for precise content targeting to optimize search performance. In this study, a predictive model was developed using multiple linear regression analysis to accurately predict the Webometrics visibility score based on GSC metrics, suggesting that strategic efforts to enhance these parameters could boost a university's online prominence. Additionally, a theoretical model was proposed to clarify the dynamic relationship between impressions, positions, and clicks in shaping the overall web visibility. Although this study provides valuable insights, it is based on data from a single university, which calls for further investigation using more varied datasets. Ultimately, the study emphasizes the immense potential of leveraging GSC data to bolster a university's online footprint, suggesting that strategic enhancements of vital parameters can greatly improve a university's online visibility according to Webometrics. As the academic world becomes increasingly digital, implementing these findings to guide search engine optimization strategies is a crucial element of institutional administration.
... Studies have shown that findings from the field of strategic management tend to be surprisingly consistent between different periods and geographies, suggesting little sensitivity to context. 22 Furthermore, the degree to which leader characteristics become status indicators and influence leadership outcomes is influenced by contextual factors such as social dominance, status characteristics, and diversity initiatives. 23 The context in which leadership occurs is crucial, as different leadership roles and cultures call for different types and behaviors. ...
Article
Full-text available
The primary concern with contemporary leadership research is the limited extent to which the findings can be applied to other types of organizations due to their specific circumstances. This study uses meta-analysis and a systematic review of the literature to identify crucial factors that impact the generalizability of the findings. They are organizational culture, industry type, leadership styles, and contextual variables. The research offers insights into the ability of previous leadership behavior to predict future behavior. Although the study emphasizes the significance of considering contextual sensitivities in leadership research, it proposes effective strategies for enhancing the generalizability, such as utilizing blended methodologies, warning against the fallacy of generaliza-bility, employing a productive causal-modeling framework, expanding eligibility criteria, and targeting multiple sub-populations. The findings of this study have practical implications for leadership development programs and organizational practices, as well as theoretical implications for refining and expanding existing leadership theories.
... Prediction markets of research credibility. In recent years, researchers have employed prediction markets to assess the credibility of research findings [94][95][96][97][98][99] . Here, researchers invite experts or non-experts to estimate the replicability of different studies or claims. ...
Article
Full-text available
The emergence of large-scale replication projects yielding successful rates substantially lower than expected caused the behavioural, cognitive, and social sciences to experience a so-called ‘replication crisis’. In this Perspective, we reframe this ‘crisis’ through the lens of a credibility revolution, focusing on positive structural, procedural and community-driven changes. Second, we outline a path to expand ongoing advances and improvements. The credibility revolution has been an impetus to several substantive changes which will have a positive, long-term impact on our research environment.
... A large-scale and systematic comparison of free recall and recognition memory, with and without a space inserted between lexemes, is important to test the integrated view of the AoA effect, perhaps using a creative destruction approach (i.e. pre-specifying alternative results by competing hypotheses on a complex set of experimental findings; Delios et al., 2022;Tierney et al., 2020Tierney et al., ,2021. ...
Article
The age at which a person acquires knowledge of an item is a strong predictor of item retrieval, hereon defined as the Age of Acquisition (AoA) effect. This effect is such that early-acquired words are processed more quickly and accurately than late-acquired items. One account to explain this effect is the integrated account, where the AoA effect occurs in the early processes of lexical retrieval and hence should increase in tasks necessitating greater semantic processing. Importantly, this account has been applied to lexical processing, but not, to date, memory tasks. The current study aimed to assess whether the integrated account could explain memory tasks, using compound words, which differ from monomorphemic words regarding ease of mapping and semantic processes. Four-hundred-and-eighty participants were split into four groups of 120 participants for each of four experiments. Participants were required to recall unspaced and spaced compound words (Experiments 1 and 2, respectively) or make a recognition decision for unspaced and spaced compound words (Experiments 3 and 4, respectively). This approach allowed us to establish how semantic processing was involved in recalling and recognising the items. We found that AoA was related to all tasks such that irrespective of space, early-acquired compound words were recalled more accurately than late-acquired compound words in free recall. In recognition memory, late-acquired compound words were recognised more accurately than early-acquired compound words. However, the slope for the AoA was semantic processing influenced free recall to a greater extent than the recognition memory, with the AoA effect being larger in free recall than recognition memory. In addition, the AoA effect for the compound word was larger in spaced compound words than unspaced compound words. This demonstrates that the AoA effect in memory has multiple sources.
Preprint
Purpose: Adults recognise words that are acquired during childhood more quickly than words acquired during adulthood. This is known as the Age of Acquisition (AoA) effect. The AoA effect, according to the integrated account, manifests in tasks necessitating greater semantic processing and in tasks with arbitrary input-output mapping. Compound words allow us to investigate this account due to the arbitrary input-output mapping between the compound word itself and its morphemes, which requires greater semantic processing. Method: Forty-eight British English students in each experiment completed an unspaced (Experiment 1; n=48; 83% female; Mage=19.73), spaced (Experiment 2; n=48; 83% female; Mage=19.04), auditory (Experiment 3; n=48; 63% female; Mage=19.83), and cross-modal (Experiment 4; n=48; 52% female; Mage=19.81) lexical decision task (LDT) using a regression design on 226 compound words.Results: We observed that the AoA of the compound word affected accuracy across all tasks, whereas the AoA of the compound word influenced recognition latencies across all tasks except cross-modal LDT. Discussion: The results suggest that the influence of the AoA effect and of semantic predictors is largest in unspaced compound words and smallest in cross-modal LDT. This indicates that the AoA effect in word recognition is in line with the integrated account.
Article
Full-text available
Cryptocurrencies have ballooned into a billion-dollar business. To inform regulations aimed at protecting consumers vulnerable to suboptimal financial decisions, we investigate crypto investment intentions as a function of consumer gender, financial overconfidence (greater subjective versus objective financial knowledge), and the Big Five personality traits. Study 1 (N = 126) found that people believe each Big Five personality trait as well as consumer gender and financial overconfidence to predict consumers’ crypto investment intentions. Study 2 (N = 1,741) revealed that less than 1 in 10 consumers from a nationally representative sample (Norway) are willing to invest in crypto. However, the proportion of male (vs. female) consumers considering such investments is more than twice as large, with less (vs. more) agreeable, less (vs. more) conscientious, and more (vs. less) open consumers also being increasingly inclined to consider crypto investments. Financial overconfidence, agreeableness, and conscientiousness mediate the link between consumer gender and crypto investment intentions. These results hold after accounting for a theoretically relevant confounding factor (financial self-efficacy). Together, this research offers novel implications for marketing theory and practice that help understand the observed gender differences in consumers’ crypto investments.
Article
Full-text available
In international business, as well as in many other social sciences, replication studies have long been treated as a poor relative, discounted and discouraged as “not original”. We argue that by teasing out confounding factors, validating causal mechanisms, and testing spatial and temporal boundaries, replication studies can stimulate debate, add to our body of knowledge, and fine-tune theory. Our goal in writing this editorial is to promote replication studies. We build a case for them by recognizing their value and showcasing their different types. We also offer a methodological template for carrying them out with academic rigor. Finally, we make concrete recommendations on how to go about increasing the number of them published.
Article
Full-text available
There is evidence that prediction markets are useful tools to aggregate information on researchers' beliefs about scientific results including the outcome of replications. In this study, we use prediction markets to forecast the results of novel experimental designs that test established theories. We set up prediction markets for hypotheses tested in the Defense Advanced Research Projects Agency's (DARPA) Next Generation Social Science (NGS2) programme. Researchers were invited to bet on whether 22 hypotheses would be supported or not. We define support as a test result in the same direction as hypothesized, with a Bayes factor of at least 10 (i.e. a likelihood of the observed data being consistent with the tested hypothesis that is at least 10 times greater compared with the null hypothesis). In addition to betting on this binary outcome, we asked participants to bet on the expected effect size (in Cohen's d ) for each hypothesis. Our goal was to recruit at least 50 participants that signed up to participate in these markets. While this was the case, only 39 participants ended up actually trading. Participants also completed a survey on both the binary result and the effect size. We find that neither prediction markets nor surveys performed well in predicting outcomes for NGS2.
Article
Full-text available
The reproducibility of published research has become an important topic in science policy. A number of large-scale replication projects have been conducted to gauge the overall reproducibility in specific academic fields. Here, we present an analysis of data from four studies which sought to forecast the outcomes of replication projects in the social and behavioural sciences, using human experts who participated in prediction markets and answered surveys. Because the number of findings replicated and predicted in each individual study was small, pooling the data offers an opportunity to evaluate hypotheses regarding the performance of prediction markets and surveys at a higher power. In total, peer beliefs were elicited for the replication outcomes of 103 published findings. We find there is information within the scientific community about the replicability of scientific findings, and that both surveys and prediction markets can be used to elicit and aggregate this information. Our results show prediction markets can determine the outcomes of direct replications with 73% accuracy (n = 103). Both the prediction market prices, and the average survey responses are correlated with outcomes (0.581 and 0.564 respectively, both p < .001). We also found a significant relationship between p-values of the original findings and replication outcomes. The dataset is made available through the R package “pooledmaRket” and can be used to further study community beliefs towards replications outcomes as elicited in the surveys and prediction markets.
Article
Full-text available
How can we maximize what is learned from a replication study? In the creative destruction approach to replication, the original hypothesis is compared not only to the null hypothesis, but also to predictions derived from multiple alternative theoretical accounts of the phenomenon. To this end, new populations and measures are included in the design in addition to the original ones, to help determine which theory best accounts for the results across multiple key outcomes and contexts. The present pre-registered empirical project compared the Implicit Puritanism account of intuitive work and sex morality to theories positing regional, religious, and social class differences; explicit rather than implicit cultural differences in values; self-expression vs. survival values as a key cultural fault line; the general moralization of work; and false positive effects. Contradicting Implicit Puritanism's core theoretical claim of a distinct American work morality, a number of targeted findings replicated across multiple comparison cultures, whereas several failed to replicate in all samples and were identified as likely false positives. No support emerged for theories predicting regional variability and specific individual-differences moderators (religious affiliation, religiosity, and education level). Overall, the results provide evidence that work is intuitively moralized across cultures.
Article
Full-text available
Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically significant effect (p < .05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δr = .002 or .014, depending on analytic approach). The median effect size for the revised protocols (r = .05) was similar to that of the RP:P protocols (r = .04) and the original RP:P replications (r = .11), and smaller than that of the original studies (r = .37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = .07, range = .00–.15) were 78% smaller, on average, than the original effect sizes (median r = .37, range = .19–.50).
Article
Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. For psychology, the 2010s might be characterized as a decade of active confrontation. Systematic and multi-site replication projects assessed current understandings and observed surprising failures to replicate many published findings. Replication efforts highlighted sociocultural challenges such as disincentives to conduct replications and a tendency to frame replication as a personal attack rather than a healthy scientific practice, and they raised awareness that replication contributes to self-correction. Nevertheless, innovation in doing and understanding replication and its cousins, reproducibility and robustness, has positioned psychology to improve research practices and accelerate progress. Expected final online publication date for the Annual Review of Psychology, Volume 73 is January 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
In recent years, researchers in several scientific disciplines have become concerned with published studies replicating less often than expected. A positive side effect of this concern is an appreciation that replicating other researchers’ work is an essential part of the scientific process. To date, many such efforts have come from the experimental sciences, where replication entails running new experiments, generating new data, and analyzing it. In this article, we emphasize not experimental replications but data analysis replications. We do so for three reasons. First, experimental replication excludes entire classes of publications that do not run experiments or even collect original data (e.g., archival data analysis). Second, experimental replication may in some cases be a needlessly high bar: there is great value in replicating just the data analyses of published experimental work. As data analysis replications require a lower investment of resources than experimental replications, their adoption should expand the number and variety of scientific reproducibility studies undertaken. Third, we propose that teaching undergraduate students to perform data analysis replications will greatly increase the number of replications done while providing them with research experience that should inform their decisions to pursue research or to attend graduate school. Towards this end, we provide details of a pilot program we created to teach undergraduates the skills necessary to conduct data analysis replications, and include a case study of the first set of students who completed this program and attempted to replicate the data analyses in a widely-cited social science paper on policing. In addition, we present a summary of ten additional data analysis replications carried out entirely by students in a university course.
Chapter
Two books have been particularly influential in contemporary philosophy of science: Karl R. Popper's Logic of Scientific Discovery, and Thomas S. Kuhn's Structure of Scientific Revolutions. Both agree upon the importance of revolutions in science, but differ about the role of criticism in science's revolutionary growth. This volume arose out of a symposium on Kuhn's work, with Popper in the chair, at an international colloquium held in London in 1965. The book begins with Kuhn's statement of his position followed by seven essays offering criticism and analysis, and finally by Kuhn's reply. The book will interest senior undergraduates and graduate students of the philosophy and history of science, as well as professional philosophers, philosophically inclined scientists, and some psychologists and sociologists.
Article
The credibility revolution in economics has promoted causal identification using randomized control trials (RCT), difference-in-differences (DID), instrumental variables (IV) and regression discontinuity design (RDD). Applying multiple approaches to over 21,000 hypothesis tests published in 25 leading economics journals, we find that the extent of p-hacking and publication bias varies greatly by method. IV (and to a lesser extent DID) are particularly problematic. We find no evidence that (i) papers published in the Top 5 journals are different to others; (ii) the journal “revise and resubmit” process mitigates the problem; (iii) things are improving through time. (JEL A14, C12, C52)
Article
Most anomalies fail to hold up to currently acceptable standards for empirical finance. With microcaps mitigated via NYSE breakpoints and value-weighted returns, 65% of the 452 anomalies in our extensive data library, including 96% of the trading frictions category, cannot clear the single test hurdle of the absolute t-value of 1.96. Imposing the higher multiple test hurdle of 2.78 at the 5% significance level raises the failure rate to 82%. Even for replicated anomalies, their economic magnitudes are much smaller than originally reported. In all, capital markets are more efficient than previously recognized. Received June 12, 2017; editorial decision October 29, 2018 by Editor Stijn Van Nieuwerburgh. Authors have furnished an Internet Appendix, which is available on the Oxford University Press Web site next to the link to the final published paper online.