PreprintPDF Available

Guidance for Multi-Analyst Studies

Authors:

Abstract

We present consensus-based guidance for conducting and documenting multi-analyst studies. We discuss why broader adoption of the multi-analyst approach will strengthen the robustness of results and conclusions in empirical sciences.
Note: This is an unpublished manuscript. We welcome any feedback or comments.
Guidance for Multi-Analyst Studies
Authors:
Balazs Aczel1, Barnabas Szaszi1, Gustav Nilsonne2,3, Olmo R. van den Akker4, Casper J.
Albers5, Marcel A. L. M. van Assen4,6, Jojanneke A. Bastiaansen7,8, Dan Benjamin9,10, Udo
Boehm11, Rotem Botvinik-Nezer12, Laura F. Bringmann5, Niko A. Busch13, Emmanuel
Caruyer14, Andrea M. Cataldo15,16, Nelson Cowan17, Andrew Delios18, Noah N. N. van
Dongen11, Chris Donkin19, Johnny B. van Doorn11, Anna Dreber20,21, Gilles Dutilh22, Gary F.
Egan23, Morton Ann Gernsbacher24, Rink Hoekstra5, Sabine Hoffmann25, Felix
Holzmeister21, Juergen Huber21, Magnus Johannesson20, Kai J. Jonas26, Alexander T.
Kindel27, Michael Kirchler21, Yoram K. Kunkels7, D. Stephen Lindsay28, Jan-Francois
Mangin29,30, Dora Matzke11, Marcus R. Munafò31, Ben R. Newell19, Brian A. Nosek32,33,
Russell A. Poldrack34, Don van Ravenzwaaij5, Jörg Rieskamp35, Matthew J. Salganik27,
Alexandra Sarafoglou11, Tom Schonberg36, Martin Schweinsberg37, David Shanks38, Raphael
Silberzahn39, Daniel J. Simons40, Barbara A. Spellman33, Samuel St-Jean41,42, Jeffrey J.
Starns43, Eric L. Uhlmann44, Jelte Wicherts4, Eric-Jan Wagenmakers11
Affiliations:
1ELTE, Eotvos Lorand University, Budapest, Hungary, 2Karolinska Institutet, Stockholm,
Sweden, 3Stockholm University, Stockholm, Sweden, 4Tilburg University, Tilburg, The
Netherlands, 5University of Groningen, Groningen, The Netherlands, 6Utrecht University,
Utrecht, The Netherlands, 7University of Groningen, University Medical Center Groningen,
Groningen, The Netherlands, 8Friesland Mental Health Care Services, Leeuwarden, The
Netherlands, 9University of California Los Angeles, Los Angeles, CA, USA, 10National
Bureau of Economic Research, Cambridge, MA, USA, 11University of Amsterdam,
Amsterdam, The Netherlands, 12Dartmouth College, Hanover, NH, USA, 13University of
Münster, Münster, Germany, 14University of Rennes, CNRS, Inria, Inserm, Rennes, France,
15McLean Hospital, Belmont, MA, USA, 16Harvard Medical School, Boston, MA, USA,
17Department of Psychological Sciences, University of Missouri, MO, USA, 18National
University of Singapore, Singapore, 19University of New South Wales, Sydney, Australia,
20Stockholm School of Economics, Stockholm, Sweden, 21University of Innsbruck,
Innsbruck, Austria, 22University Hospital Basel, Basel, Switzerland, 23Monash University,
2
Note: This is an unpublished manuscript. We welcome any feedback or comments.
Melbourne, Victoria, Australia, 24University of Wisconsin-Madison, Madison, WI, USA,
25Ludwig-Maximilians-University, Munich, Germany, 26Maastricht University, Maastricht,
The Netherlands, 27Princeton University, Princeton, NJ, USA, 28University of Victoria,
Victoria, Canada, 29Université Paris-Saclay, Paris, France, 30Neurospin, CEA, France,
31University of Bristol, Bristol, UK, 32Center for Open Science, USA, 33University of
Virginia, Charlottesville, USA, 34Stanford University, Stanford, USA, 35University of Basel,
Basel, Switzerland, 36Tel Aviv University, Tel Aviv, Israel, 37ESMT Berlin, Germany,
38University College London, London, UK, 39University of Sussex, Brighton, UK,
40University of Illinois at Urbana-Champaign, USA, 41University of Alberta, Edmonton,
Canada, 42Lund University, Lund, Sweden, 43University of Massachusetts Amherst, USA,
44INSEAD, Singapore
*Correspondence: aczel.balazs@ppk.elte.hu and szaszi.barnabas@ppk.elte.hu
3
Note: This is an unpublished manuscript. We welcome any feedback or comments.
Standfirst:
We present consensus-based guidance for conducting and documenting multi-analyst
studies. We discuss why broader adoption of the multi-analyst approach will strengthen the
robustness of results and conclusions in empirical sciences.
The Unknown Fragility of Reported Conclusions
Typically, empirical results hinge on analytical choices made by a single data analyst
or team of authors, with limited independent, external input. This makes it uncertain whether
the reported conclusions are robust to justifiable alternative analytical strategies (Fig.1).
Studies in the social and behavioural sciences lend themselves to a multitude of justifiable
analyses. Empirical investigations require many analytical decisions, and the underlying
theoretical framework rarely imposes strong restrictions on how the data should be
preprocessed and modelled.
Fig.1 Example of a reported sequence of analysis choices (black line, leading to conclusion
A) shown as a subset of alternative plausible analysis paths (grey lines). In the left panel, all
plausible paths support conclusion A; in the right panel, most plausible paths support
conclusion B. This illustrates that without reporting the outcomes from alternative paths, it
remains unknown whether or not the conclusion is robust to justifiable alternative analytical
strategies.
As an example, the journal Surgery published two articles1,2 a few months apart that
used the same dataset and answered the same question: Does the use of a retrieval bag during
laparoscopic appendectomy reduce surgical site infections? Two reasonable, but different,
analyses were applied (with notable differences in analytical choices including inclusion and
exclusion criteria, outcome measures, sample sizes, and covariates). As a result of the
different analytical choices, the two articles reached opposite conclusions, one finding that
the use of a retrieval bag reduced infections, and the other that it did not3. This example
illustrates how independent analysis of the same data (in this case, unplanned) can reach
different, yet justifiable conclusions.
4
Note: This is an unpublished manuscript. We welcome any feedback or comments.
In this article, we describe how a multi-analyst approach can evaluate the impact of
alternative analyses on reported results and conclusions. In addition, we provide consensus-
based guidance to help research prepare, conduct, and report of multi-analyst studies.
Exploring Analytical Robustness with the Multi-Analyst Method
The robustness of results and conclusions can be studied by evaluating many distinct
analysis options simultaneously (e.g., vibration of effects4 or multiverse analysis5) or by
involving multiple analysts who independently analyse the same data613. Rather than
exhaustively evaluating all plausible analyses, the multi-analyst method examines analytical
choices that are deemed most appropriate by independent analysts.
Botvinik-Nezer et al.10, for example, asked 70 teams to test the same hypotheses using
the same functional magnetic resonance imaging dataset. They found that no two teams
followed the same data preprocessing steps or analysis strategies in their analyses, resulting
in substantial variability in their conclusions. This and other multi-analyst initiatives613
highlight how findings can vary depending on the judgment of the analyst.
Use and Benefits of the Multi-Analyst Method
Although the multi-analyst approach is new to many researchers, it has been in use
since the 19th century. A prominent example is the cuneiform competition14, which may be
viewed as a precursor to the modern multi-analyst method. In 1857, the Royal Asian Society
asked four scholars to independently translate a previously unseen inscription to verify that
the ancient Assyrian language had been deciphered correctly. The almost perfect overlap
between the solutions indicated that “they have Truth for their basis” (p. 4)14.
The central idea from this cuneiform competition can be applied to 21st century data
analysis with several benefits (Box 1). With even a few co-analysts, the multi-analyst
approach can be informative about the robustness of results and conclusions. When the
results of independent data analyses converge, more confidence in the conclusions is
warranted. When the results diverge, confidence falters, and scientists can examine the
reasons for these discrepancies. With enough co-analysts, it is possible to estimate the
variability among analysis strategies and identify factors explaining this variability.
5
Note: This is an unpublished manuscript. We welcome any feedback or comments.
Box 1
Benefits of the Multi-Analyst Approach
Converging conclusions increase confidence in the analytical robustness of a finding
Diverging conclusions decrease confidence in the analytical robustness of a finding and
prompt an examination of the reasons for the divergence
Identifies a key source of uncertainty, namely the extent to which the results and
conclusions depend on the analytic preferences of the analyst
Establishes the variability of results as a function of analytical choices
With analysts from multiple disciplines, the approach stimulates cross-pollination of
analysis strategies that otherwise might remain isolated within research subcultures
Diminishes or eliminates the analyst’s potential preference toward the hypotheses, since no
individual analysis is likely to determine the conclusions
Multi-Analyst Guidance
The multi-analyst approach is rarely used in empirical research, but many disciplines
could benefit from its broader adoption. Implementing a multi-analyst study involves
practical challenges that might discourage researchers from pursuing it further. To help
researchers overcome these practical challenges, we provide consensus-based guidance to
help researchers surmount the practical challenges of preparing, conducting, and reporting
multi-analyst studies.
To develop this guidance, we recruited an expert panel of 50 social and behavioural
scientists (all co-authors on this paper) with experience in organising multi-analyst projects or
expertise in research methodology. In a first survey we gathered their conceptual and
practical insights about this approach. These responses were used to create a draft of our
guidance. Next, the draft was iteratively improved by the expert panel, following a
preregistered ‘reactive-Delphi’ expert consensus procedure. The final draft was
independently rated by the members of the panel to ensure that each of the approved items
satisfied our preset criteria for a sufficiently high level of support. The expert consensus
procedure has been concluded in one round, resulting in a guide that represents a consensus
among the experts. Of course, other experts might have different views, and we welcome
feedback. For the survey materials, a list of panel members, and the details of the consensus
procedure see Supplementary Information.
The guidance includes 10 Recommended Practices (Table 1) and a Practical
Considerations document that supports these practices (see Supplementary Information). Both
the practices and considerations address the five main stages of a multi-analyst project: (1)
Recruiting co-analysts; (2) Providing the dataset, research questions, and research tasks; (3)
Conducting the independent analyses; (4) Processing the results; and (5) Reporting the
methods and results. To further assist researchers in documenting multi-analyst projects, we
also provide a modifiable Reporting Template that incorporates the elements of our guide.
6
Note: This is an unpublished manuscript. We welcome any feedback or comments.
Table 1
Recommended Practices for the Main Stages of the Multi-Analyst Method
Stage
Recommended Practices
Recruiting
Co-analysts
1. Determine a minimum target number of co-analysts and outline clear
eligibility criteria before recruiting co-analysts. We recommend that the
final report justifies why these choices are adequate to achieve the study
goals.
2. When recruiting co-analysts, inform them about (a) their tasks and
responsibilities; (b) the project code of conduct (e.g., confidentiality/ non-
disclosure agreements); (c) the plans for publishing the research report and
presenting the data, analyses, and conclusion; (d) the conditions for an
analysis to be included or excluded from the study; (e) whether their names
will be publicly linked to the analyses; (f) the co-analysts’ rights to update
or revise their analyses; (g) the project time schedule; and (h) the nature
and criteria of compensation (e.g., authorship).
Providing the
Dataset,
Research
Questions,
and Research
Tasks
3. Provide the datasets accompanied with a codebook that contains a
comprehensive explanation of the variables and the datafile structure.
4. Ensure that co-analysts understand any restrictions on the use of the data,
including issues of ethics, privacy, confidentiality, or ownership.
5. Provide the research questions (and potential theoretically derived
hypotheses that should be tested) without communicating the lead team’s
preferred analysis choices or expectations about the conclusions.
Conducting
the
Independent
Analyses
6. To ensure independence, we recommend that co-analysts do not
communicate with each other about their analyses until after all initial
reports have been submitted. In general, it should be clearly explained why
and at what stage co-analysts are allowed to communicate about the
analyses (e.g., to detect errors or call attention to outlying data points).
Processing
the Results
7. Require co-analysts to share with the lead team their results, the analysis
code with explanatory comments (or a detailed description of their point-
and-click analyses), their conclusions, and an explanation of how their
conclusions follow from their results.
8. The lead team makes the commented code, results, and conclusions of all
non-withdrawn analyses publicly available before or at the same time as
submitting the research report.
Reporting the
Methods and
Results
9. The lead team should report the multi-analyst process of the study,
including (a) the justification for the number of co-analysts; (b) the
eligibility criteria and recruitment of co-analysts; (c) how co-analysts were
given the data sets and research questions; (d) how the independence of
analyses was ensured; (e) the numbers of and reasons for withdrawals and
omissions of analyses; (f) whether the lead team conducted an independent
7
Note: This is an unpublished manuscript. We welcome any feedback or comments.
analysis; (g) how the results were processed; (h) the summary of the
results of co-analysts; (i) and the limitations and potential biases of the
study.
10. Data management should follow the FAIR principles15, and the research
report should be transparent about access to the data and code for all
analyses16.
Caveats and Conclusions
The present work does not cover all aspects of multi-analyst projects. For instance,
the multi-analyst approach outlined here entails the independent analysis of one or more
datasets, but it should be acknowledged that other crowdsourced analysis approaches might
not require such independence of the analyses. Also, we emphasize that this consensus-based
guidance is a first step towards the broader adoption of the multi-analyst approach in
empirical research; we hope and expect that our recommendations will be developed further.
This guidance document aims to facilitate adoption of the multi-analyst approach in
empirical research. We believe that the scientific benefits greatly outweigh the extra logistics
required, especially for projects with great scientific or societal impact. With a systematic
exploration of the analytical space we can assess whether the reported results and conclusions
are dependent on the chosen analytical strategy. Such exploration takes us beyond the tip of
the epistemic iceberg that results from a single data analyst executing a single statistical
analysis.
References
1. Fields, A. C. et al. Does retrieval bag use during laparoscopic appendectomy reduce
postoperative infection? Surgery 165, 953957 (2019).
2. Turner, S. A., Jung, H. S. & Scarborough, J. E. Utilization of a specimen retrieval bag
during laparoscopic appendectomy for both uncomplicated and complicated appendicitis is
not associated with a decrease in postoperative surgical site infection rates. Surgery 165,
11991202 (2019).
3. Childers, C. P. & Maggard-Gibbons, M. Same Data, Opposite Results?: A Call to
Improve Surgical Database Research. JAMA Surg. 156, 219220 (2021).
4. Patel, C. J., Burford, B. & Ioannidis, J. P. Assessment of vibration of effects due to
model specification can demonstrate the instability of observational associations. J. Clin.
Epidemiol. 68, 10461058 (2015).
5. Steegen, S., Tuerlinckx, F., Gelman, A. & Vanpaemel, W. Increasing Transparency
Through a Multiverse Analysis. Perspect. Psychol. Sci. 11, 702712 (2016).
6. Bastiaansen, J. A. et al. Time to get personal? The impact of researchers choices on
the selection of treatment targets using the experience sampling methodology. J.
Psychosom. Res. 137, 110211 (2020).
7. Dongen, N. N. N. van et al. Multiple Perspectives on Inference for Two Simple
Statistical Scenarios. Am. Stat. 73, 328339 (2019).
8. Salganik, M. J. et al. Measuring the predictability of life outcomes with a scientific
8
Note: This is an unpublished manuscript. We welcome any feedback or comments.
mass collaboration. Proc. Natl. Acad. Sci. 117, 83988403 (2020).
9. Silberzahn, R. et al. Many Analysts, One Data Set: Making Transparent How
Variations in Analytic Choices Affect Results. Adv. Methods Pract. Psychol. Sci. 1, 337
356 (2018).
10. Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset
by many teams. Nature 582, 8488 (2020).
11. Dutilh, G. et al. The Quality of Response Time Data Inference: A Blinded,
Collaborative Assessment of the Validity of Cognitive Models. Psychon. Bull. Rev. 26,
10511069 (2019).
12. Fillard, P. et al. Quantitative evaluation of 10 tractography algorithms on a realistic
diffusion MR phantom. NeuroImage 56, 220234 (2011).
13. Starns, J. J. et al. Assessing theoretical conclusions with blinded inference to
investigate a potential inference crisis. Adv. Methods Pract. Psychol. Sci. 2, 335349
(2019).
14. Rawlinson, H. S., Talbot, F., Hincks, E. & Oppert, J. Inscription of Tiglath Pileser I,
King of Assyria, BC 1150, as translated by H. Rawlinson Fox Talbot Dr Hincks Dr Oppert
Publ. R. Asiat. Soc. Lond. JW Park. Son (1857).
15. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management
and stewardship. Sci. Data 3, 160018 (2016).
16. Aczel, B. et al. A consensus-based transparency checklist. Nat. Hum. Behav. 4, 46
(2020).
Competing interests
B.A.N. is Executive Director of the Center for Open Science, a nonprofit technology
and culture change organization with a mission to increase openness, integrity, and
reproducibility of research. The other authors declare no competing interests.
Data and materials availability
All anonymized data as well as the survey materials are publicly shared on the Open
Science Framework page of the project: https://osf.io/4zvst/. Our methodology and data-
analysis plan were preregistered. The preregistration document can be accessed at:
https://osf.io/dgrua.
Funding
This research was not funded. A.S. was supported by a talent grant from the
Netherlands Organisation for Scientific Research (NWO) to AS (406-17-568). R.B.-N. is an
Awardee of the Weizmann Institute of Science Israel National Postdoctoral Award Program
for Advancing Women in Science. B.A.N. was supported by grants from the John Templeton
Foundation, Templeton World Charity Foundation, Templeton Religion Trust, and Arnold
Ventures. S.St-J. is supported by the Natural Sciences and Engineering Research Council of
Canada (NSERC) [funding reference number BP5462832020] and the Fonds de recherche
9
Note: This is an unpublished manuscript. We welcome any feedback or comments.
du Québec - Nature et technologies (FRQNT) [Dossier 290978]. J.M.W. and O.R.v.d.A. were
supported by a Consolidator Grant (IMPROVE) from the European Research Council (ERC;
grant no. 726361). Y.K.K. was supported by a grant from the European Research Council
(ERC) under the European Union’s Horizon 2020 research and innovation programme (ERC-
CoG-2015; No 681466 to M. Wichers). D.v.R. was supported by a Dutch scientific
organization VIDI fellowship grant (016.Vidi.188.001). L.F.B. was supported by a Dutch
scientific organization VENI fellowship grant (Veni 191G.037). M.J.S., was supported by the
U.S. National Science Foundation (1760052).
Author contributions
Conceptualization: B.A., B.S., G.N., and E.-J.W.; Methodology: B.A., B.S., G.N., and E.-
J.W.; Project Administration: B.A.; Supervision: E.-J.W.; Writing - Original Draft
Preparation: B.A., B.S., G.N., and E.-J.W.; Writing - Review & Editing: B.A., B.S., G.N.,
O.R.v.d.A., C.J.A., M.A.L.M.v.A., J.A.B., D.B., U.B., R.B.-N., L.F.B., N.B., E.C., A.M.C.,
N.C., A. Delios, N.N.N.v.D., C.D., J.B.v.D., A. Dreber, G.D., G.F.E., M.A.G., R.H., S.H.,
F.H., J.H., M.J., K.J.J., A.T.K., M.K., Y.K.K., D.S.L., J.-F.M., D.M., M.R.M., B.R.N.,
B.A.N., R.A.P., D.v.R., J.R., M.J.S., A.S., T.S., M.S., D.S., R.S., D.J.S., B.A.S., S.St-J.,
J.J.S., E.L.U., J.W., and E.-J.W.
Supplementary Information
Recommended Practices and Practical Considerations for Multi-Analyst Projects:
https://osf.io/uvwgy/
Reporting Template: https://osf.io/h9mgy/
Supplementary Methods: https://osf.io/gjz2r/
... In this study, we followed the procedures proposed by Parker et al. (2020) and Aczel et al. (2021). The project comprised the following five phases: ...
... On the basis of the common practices currently in place within the field, we anticipated that researchers would use multilevel regression models; thus, common measurements of effect size, such as Cohen's d, might have been inappropriate. Furthermore, Aczel et al. (2021) suggested that directly asking analysts to report standardized effect sizes could bias the choice of analyses toward types that more straightforwardly return a standardized effect. Because the variables used by the analysis teams might have substantially differed in their measurement scales (e.g., Hertz for frequency vs. milliseconds for duration), which was indeed the case, we standardized all reported effects by refitting each reported model with centered and scaled continuous variables (z scores, i.e., the observed values subtracted from the mean divided by the standard deviation) and sum-coded factor variables. ...
... R Core Team, 2020). Model refitting also constituted a way of validating the reported analyses, a step recommended by Aczel et al. (2021). Details about the refitting procedure can be found at https://many-speech-analyses.github.io/many_ ...
Article
Full-text available
Recent empirical studies have highlighted the large degree of analytic flexibility in data analysis that can lead to substantially different conclusions based on the same data set. Thus, researchers have expressed their concerns that these researcher degrees of freedom might facilitate bias and can lead to claims that do not stand the test of time. Even greater flexibility is to be expected in fields in which the primary data lend themselves to a variety of possible operationalizations. The multidimensional, temporally extended nature of speech constitutes an ideal testing ground for assessing the variability in analytic approaches, which derives not only from aspects of statistical modeling but also from decisions regarding the quantification of the measured behavior. In this study, we gave the same speech-production data set to 46 teams of researchers and asked them to answer the same research question, resulting in substantial variability in reported effect sizes and their interpretation. Using Bayesian meta-analytic tools, we further found little to no evidence that the observed variability can be explained by analysts’ prior beliefs, expertise, or the perceived quality of their analyses. In light of this idiosyncratic variability, we recommend that researchers more transparently share details of their analysis, strengthen the link between theoretical construct and quantitative system, and calibrate their (un)certainty in their conclusions.
... In this study, we followed the procedures proposed by Parker et al. (2020) and Aczel et al. (2021). The project comprised the following five phases: ...
... Based on the common practices currently in place within the field, we anticipated that researchers would use multilevel regression models, thus common measurements of effect size, such as Cohen's d, might have been inappropriate. Furthermore, Aczel et al. (2021) suggest that directly asking analysts to report standardized effect sizes could bias the choice of analyses towards types that more straightforwardly return a standardized effect. Since the variables used by the analysis teams might have substantially differed in their measurement scales (e.g, Hertz for frequency vs. milliseconds for duration) which was indeed the case, we have standardized all reported effects by refitting each REPORTED MODEL with centered and scaled continuous variables (z-scores, i.e. the observed values subtracted from the mean divided by the standard deviation) and sum-coded factor variables. ...
... Each STANDARDIZED MODEL was fitted as a Bayesian regression model with Stan (Team 2021), RStan , and brms (Bürkner 2017) in R (R Core Team 2020). Model refitting also constituted a way of validating the reported analyses, a step recommended by Aczel et al. (2021). Details about the refitting procedure can be found at https://many-speech-analyses.github.io/ ...
... In this study, we followed the procedures proposed by Parker et al. (2020) and Aczel et al. (2021). The project comprised the following five phases: ...
... On the basis of the common practices currently in place within the field, we anticipated that researchers would use multilevel regression models; thus, common measurements of effect size, such as Cohen's d, might have been inappropriate. Furthermore, Aczel et al. (2021) suggested that directly asking analysts to report standardized effect sizes could bias the choice of analyses toward types that more straightforwardly return a standardized effect. Because the variables used by the analysis teams might have substantially differed in their measurement scales (e.g., Hertz for frequency vs. milliseconds for duration), which was indeed the case, we standardized all reported effects by refitting each reported model with centered and scaled continuous variables (z scores, i.e., the observed values subtracted from the mean divided by the standard deviation) and sum-coded factor variables. ...
... R Core Team, 2020). Model refitting also constituted a way of validating the reported analyses, a step recommended by Aczel et al. (2021). Details about the refitting procedure can be found at https://many-speech-analyses.github.io/many_ ...
Article
Full-text available
Recent empirical studies have highlighted the large degree of analytic flexibility in data analysis which can lead to substantially different conclusions based on the same data set. Thus, researchers have expressed their concerns that these researcher degrees of freedom might facilitate bias and can lead to claims that do not stand the test of time. Even greater flexibility is to be expected in fields in which the primary data lend themselves to a variety of possible operationalizations. The multidimensional, temporally extended nature of speech constitutes an ideal testing ground for assessing the variability in analytic approaches, which derives not only from aspects of statistical modeling, but also from decisions regarding the quantification of the measured behavior. In the present study, we gave the same speech production data set to 46 teams of researchers and asked them to answer the same research question, resulting in substantial variability in reported effect sizes and their interpretation. Using Bayesian meta-analytic tools, we further find little to no evidence that the observed variability can be explained by analysts’ prior beliefs, expertise or the perceived quality of their analyses. In light of this idiosyncratic variability, we recommend that researchers more transparently share details of their analysis, strengthen the link between theoretical construct and quantitative system and calibrate their (un)certainty in their conclusions.
... In this study, we followed the procedures proposed by Parker et al. (2020) and Aczel et al. (2021). The project comprised the following five phases: ...
... Based on the common practices currently in place within the field, we anticipated that researchers would use multilevel regression models, thus common measurements of effect size, such as Cohen's d, might have been inappropriate. Furthermore, Aczel et al. (2021) suggest that directly asking analysts to report standardized effect sizes could bias the choice of analyses towards types that more straightforwardly return a standardized effect. Since the variables used by the analysis teams might have substantially differed in their measurement scales (e.g, Hertz for frequency vs. milliseconds for duration) which was indeed the case, we have standardized all reported effects by refitting each REPORTED MODEL with centered and scaled continuous variables (z-scores, i.e. the observed values subtracted from the mean divided by the standard deviation) and sum-coded factor variables. ...
... Each STANDARDIZED MODEL was fitted as a Bayesian regression model with Stan (Team 2021), RStan , and brms (Bürkner 2017) in R (R Core Team 2020). Model refitting also constituted a way of validating the reported analyses, a step recommended by Aczel et al. (2021). Details about the refitting procedure can be found at https://many-speech-analyses.github.io/ ...
... In this study, we followed the procedures proposed by Parker et al. (2020) and Aczel et al. (2021). The project comprised the following five phases: ...
... On the basis of the common practices currently in place within the field, we anticipated that researchers would use multilevel regression models; thus, common measurements of effect size, such as Cohen's d, might have been inappropriate. Furthermore, Aczel et al. (2021) suggested that directly asking analysts to report standardized effect sizes could bias the choice of analyses toward types that more straightforwardly return a standardized effect. Because the variables used by the analysis teams might have substantially differed in their measurement scales (e.g., Hertz for frequency vs. milliseconds for duration), which was indeed the case, we standardized all reported effects by refitting each reported model with centered and scaled continuous variables (z scores, i.e., the observed values subtracted from the mean divided by the standard deviation) and sum-coded factor variables. ...
... R Core Team, 2020). Model refitting also constituted a way of validating the reported analyses, a step recommended by Aczel et al. (2021). Details about the refitting procedure can be found at https://many-speech-analyses.github.io/many_ ...
Preprint
Full-text available
Scientific studies of language span across many disciplines and provide evidence for social, cultural, cognitive, technological, and biomedical studies of human nature and behavior. By becoming increasingly empirical and quantitative, linguistics has been facing challenges and limitations of the scientific practices that pose barriers to reproducibility and replicability. One of the proposed solutions to the widely acknowledged reproducibility and replicability crisis has been the implementation of transparency practices, e.g. open access publishing, preregistrations, sharing study materials, data, and analyses, performing study replications and declaring conflicts of interest. Here, we have assessed the prevalence of these practices in randomly sampled 600 journal articles from linguistics across two time points. In line with similar studies in other disciplines, we found a moderate amount of articles published open access, but overall low rates of sharing materials, data, and protocols, no preregistrations, very few replications and low rates of conflict of interest reports. These low rates have not increased noticeably between 2008/2009 and 2018/2019, pointing to remaining barriers and slow adoption of open and reproducible research practices in linguistics. As linguistics has not yet firmly established transparency and reproducibility as guiding principles in research, we provide recommendations and solutions for facilitating the adoption of these practices.
... In this study, we followed the procedures proposed by Parker et al. (2020) and Aczel et al. (2021). The project comprised the following five phases: ...
... Based on the common practices currently in place within the field, we anticipated that researchers would use multilevel regression models, thus common measurements of effect size, such as Cohen's d, might have been inappropriate. Furthermore, Aczel et al. (2021) suggest that directly asking analysts to report standardized effect sizes could bias the choice of analyses towards types that more straightforwardly return a standardized effect. Since the variables used by the analysis teams might have substantially differed in their measurement scales (e.g, Hertz for frequency vs. milliseconds for duration) which was indeed the case, we have standardized all reported effects by refitting each REPORTED MODEL with centered and scaled continuous variables (z-scores, i.e. the observed values subtracted from the mean divided by the standard deviation) and sum-coded factor variables. ...
... Each STANDARDIZED MODEL was fitted as a Bayesian regression model with Stan (Team 2021), RStan , and brms (Bürkner 2017) in R (R Core Team 2020). Model refitting also constituted a way of validating the reported analyses, a step recommended by Aczel et al. (2021). Details about the refitting procedure can be found at https://many-speech-analyses.github.io/ ...
Preprint
Full-text available
Recent empirical studies have highlighted the large degree of analytic flexibility in data analysis which can lead to substantially different conclusions based on the same data set. Thus, researchers have expressed their concerns that these researcher degrees of freedom might facilitate bias and can lead to claims that do not stand the test of time. Even greater flexibility is to be expected in fields in which the primary data lend themselves to a variety of possible operationalizations. The multidimensional, temporally extended nature of speech constitutes an ideal testing ground for assessing the variability in analytic approaches, which derives not only from aspects of statistical modeling, but also from decisions regarding the quantification of the measured behavior. In the present study, we gave the same speech production data set to 46 teams of researchers and asked them to answer the same research question, resulting in substantial variability in reported effect sizes and their interpretation. Using Bayesian meta-analytic tools, we further find little to no evidence that the observed variability can be explained by analysts' prior beliefs, expertise or the perceived quality of their analyses. In light of this idiosyncratic variability, we recommend that researchers more transparently share details of their analysis, strengthen the link between theoretical construct and quantitative system and calibrate their (un)certainty in their conclusions.
... A MULTIVERSE APPROACH 2 In recent years, several many-analysts projects (MAPs) have been published, projects in which synthetic or real data are analyzed independently by different experts (e.g., Aczel et al., 2021;Schweinsberg et al., 2021;Silberzahn et al., 2018;Silberzahn and Uhlmann, 2015). Such projects are particularly useful in investigating how analytic variability may influence the strength and direction of results -when inferential statistics are used (e.g., Schweinsberg et al., 2021) -or the estimation of individual model parameters -when computational models are pitted against each other (e.g., Boehm et al., 2018;Dutilh et al., 2019). ...
Article
Full-text available
Objective One of the promises of the experience sampling methodology (ESM) is that a statistical analysis of an individual's emotions, cognitions and behaviors in everyday-life could be used to identify relevant treatment targets. A requisite for clinical implementation is that outcomes of such person-specific time-series analyses are not wholly contingent on the researcher performing them. Methods To evaluate this, we crowdsourced the analysis of one individual patient's ESM data to 12 prominent research teams, asking them what symptom(s) they would advise the treating clinician to target in subsequent treatment. Results Variation was evident at different stages of the analysis, from preprocessing steps (e.g., variable selection, clustering, handling of missing data) to the type of statistics and rationale for selecting targets. Most teams did include a type of vector autoregressive model, examining relations between symptoms over time. Although most teams were confident their selected targets would provide useful information to the clinician, not one recommendation was similar: both the number (0–16) and nature of selected targets varied widely. Conclusion This study makes transparent that the selection of treatment targets based on personalized models using ESM data is currently highly conditional on subjective analytical choices and highlights key conceptual and methodological issues that need to be addressed in moving towards clinical implementation.
Article
Full-text available
How predictable are life trajectories? We investigated this question with a scientific mass collaboration using the common task method; 160 teams built predictive models for six life outcomes using data from the Fragile Families and Child Wellbeing Study, a high-quality birth cohort study. Despite using a rich dataset and applying machine-learning methods optimized for prediction, the best predictions were not very accurate and were only slightly better than those from a simple benchmark model. Within each outcome, prediction error was strongly associated with the family being predicted and weakly associated with the technique used to generate the prediction. Overall, these results suggest practical limits to the predictability of life outcomes in some settings and illustrate the value of mass collaborations in the social sciences.
Article
Full-text available
We present a consensus-based checklist to improve and document the transparency of research reports in social and behavioural research. An accompanying online application allows users to complete the form and generate a report that they can submit with their manuscript or post to a public repository.
Article
Full-text available
Data analysis workflows in many scientific domains have become increasingly complex and flexible. To assess the impact of this flexibility on functional magnetic resonance imaging (fMRI) results, the same dataset was independently analyzed by 70 teams, testing nine ex-ante hypotheses. The flexibility of analytic approaches is exemplified by the fact that no two teams chose identical workflows to analyze the data. This flexibility resulted in sizeable variation in hypothesis test results, even for teams whose statistical maps were highly correlated at intermediate stages of their analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Importantly, meta-analytic approaches that aggregated information across teams yielded significant consensus in activated regions across teams. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset. Our findings show that analytic flexibility can have substantial effects on scientific conclusions, and demonstrate factors related to variability in fMRI. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for multiple analyses of the same data. Potential approaches to mitigate issues related to analytical variability are discussed.
Article
Full-text available
When data analysts operate within different statistical frameworks (e.g., frequentist versus Bayesian, emphasis on estimation versus emphasis on testing), how does this impact the qualitative conclusions that are drawn for real data? To study this question empirically we selected from the literature two simple scenarios—involving a comparison of two proportions and a Pearson correlation—and asked four teams of statisticians to provide a concise analysis and a qualitative interpretation of the outcome. The results showed considerable overall agreement; nevertheless, this agreement did not appear to diminish the intensity of the subsequent debate over which statistical framework is more appropriate to address the questions at hand.
Article
Full-text available
Twenty-nine teams involving 61 analysts used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy in which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective, analytic choices influence research results.
Article
The number of studies using clinical registries or claims-based sources has exploded over the past decade. For the American College of Surgeons National Surgical Quality Improvement Program (NSQIP) alone, the annual number of publications increased 10-fold from 68 in 2010 to 805 in 2019. The reasons for this growth are clear: these sources are easily accessible, can be imported into statistical programs within minutes, and offer opportunities to answer a diverse breadth of questions. The limitations are also well known: because the data are observational, they may be prone to bias from selection or confounding. However, in the absence of randomized data, clinicians often rely on database research to develop guidelines, make patient care decisions, and shape health care policy. With increased use of database research, greater caution must be exercised in terms of how it is performed and documented.
Article
Scientific advances across a range of disciplines hinge on the ability to make inferences about unobservable theoretical entities on the basis of empirical data patterns. Accurate inferences rely on both discovering valid, replicable data patterns and accurately interpreting those patterns in terms of their implications for theoretical constructs. The replication crisis in science has led to widespread efforts to improve the reliability of research findings, but comparatively little attention has been devoted to the validity of inferences based on those findings. Using an example from cognitive psychology, we demonstrate a blinded-inference paradigm for assessing the quality of theoretical inferences from data. Our results reveal substantial variability in experts’ judgments on the very same data, hinting at a possible inference crisis.
Article
Background: To determine whether utilization of a retrieval bag during laparoscopic appendectomy for uncomplicated and complicated appendicitis (perforation/abscess) is associated with postoperative surgical site infection rates. Methods: We studied patients presented in the database of the 2016 Appendectomy-Targeted American College of Surgeons National Surgical Quality Improvement Program who underwent laparoscopic appendectomy for pathology-confirmed appendicitis. The primary predictor variable was intraoperative utilization of a specimen retrieval bag for removal of the appendix from the peritoneal cavity. The primary outcome variable was 30-day postoperative surgical site infection. Logistic regression analysis was used to determine the association between use of a specimen retrieval bag and postoperative surgical site infection rate after adjustment for patient- and disease-related variables. Results: A total of 10,357 patients were included for analysis. Of these procedures, 9,585 (92.6%) included the use of a specimen bag and 772 (7.5%) did not. The 30-day incidence of postoperative surgical site infection was 4.2% in the group in which no bag was used and 3.6% in the group in which a bag was used (adjusted odds ratio of surgical site infection with no bag utilization was 1.15 [95% confidence interval 0.78-1.69; P = .49]). The lack of a statistically significant association between bag utilization and postoperative surgical site infection incidence was also demonstrated for a subgroup of patients with perforated appendicitis. Conclusion: Utilization of a retrieval bag during laparoscopic appendectomy is not associated with a statistically significant decrease in postoperative surgical site infection for either uncomplicated or complicated acute appendicitis.
Article
Background: Appendectomy is the most commonly performed emergency operation in the United States, with approximately 370,000 patients undergoing the procedure every year. Although laparoscopic appendectomy is associated with decreased complications when compared with open appendectomy, the risk for infectious complications, including surgical site infection, intra-abdominal abscess, and sepsis, remains a significant source of postoperative morbidity and health care cost. The goal of this study is to determine whether the appendix retrieval technique during laparoscopic appendectomy affects risk of infectious complications. Methods: The American College of Surgeons National Surgical Quality Improvement Program database and the Appendectomy Procedure Targeted database were used to conduct this retrospective study. Patients who underwent laparoscopic appendectomy in 2016 were identified. The primary outcomes were infectious complications of superficial site infection and intra-abdominal abscess. Results: A total of 10,578 (92.2%) patients underwent laparoscopic appendectomy using an appendix retrieval bag and 897 (7.8%) patients underwent laparoscopic appendectomy without an appendix retrieval bag. There was no significant difference in preoperative sepsis, smoking status, wound class, complicated appendicitis, or American Society of Anesthesiologists class between patient groups (all P > .05). In the univariate analysis, there was no difference in the rate of superficial site infection (0.9% vs 0.6%, P = .28) or intra-abdominal infection (2.7% vs 3.8%, P = .06) between retrieval bag use and non-use. In the multivariable analysis, appendix retrieval bag use was an independent predictor of intra-abdominal infection and associated with a 40% decrease in intra-abdominal infection rates (odds ratio: 0.6, 95% confidence interval: 0.42-0.95, P = .03). Conclusion: Appendix retrieval bags are associated with a decreased risk of postoperative intra-abdominal abscess. The use of appendix retrieval bags should be the standard of care during laparoscopic appendectomy.