Figure - available from: BMC Biology
This content is subject to copyright. Terms and conditions apply.
Violin plot of Box-Cox transformed deviation from meta-analytic mean Z¯r\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\overline{Z} }_{r}$$\end{document} as a function of categorical peer rating. Grey points for each rating group denote model-estimated marginal mean deviation, and error bars denote 95%CI of the estimate. A Blue tit dataset. BEucalyptus dataset

Violin plot of Box-Cox transformed deviation from meta-analytic mean Z¯r\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\overline{Z} }_{r}$$\end{document} as a function of categorical peer rating. Grey points for each rating group denote model-estimated marginal mean deviation, and error bars denote 95%CI of the estimate. A Blue tit dataset. BEucalyptus dataset

Source publication
Article
Full-text available
Although variation in effect sizes and predicted values among studies of similar phenomena is inevitable, such variation far exceeds what might be produced by sampling error alone. One possible explanation for variation among results is differences among researchers in the decisions they make regarding statistical analyses. A growing array of studi...

Citations

... Consequently, Pitogo and colleagues reanalysed the same dataset and found slight variations and differences compared to our outcomes; in particular, they indicated a decline in observed species occurrence records with increasing distance from conflict events, which is the opposite of our findings. These differences may have been due to the variability in our analytical choices and decisions (for example, Botvinik-Nezer et al. 13 , Gould et al. 14 , and Silberzahn et al. 15 ). Firstly, Pitogo and colleagues, they treated taxonomic groups as a random effect rather than a fixed effect, which does not align with our primary goal and is different from our analysis. ...
Article
Full-text available
Sociopolitical conflicts have significant but often overlooked impacts on biodiversity. In our reply, we reaffirm key findings from our previous work and directly address the Matters Arising raised by Pitogo and colleagues. Additionally, we present fine-scale analyses that further support our original conclusions. We emphasise the need for continued research to fully unravel the complex relationship between conflict and environmental impacts.
... To move forward, broadening participation requires recognizing other human-nature relationships, which is different from the tradition of seeing nature as something hierarchically inferior, available to be acquired, and transformed freely and as an object to be economically exploited for human benefit (Arriagada Oyarzún and Zambra Álvarez, 2019). There are other views (Fanon, 1968;Haraway, 2015;Islas-Vargas, 2020), values (Navarro and Gutiérrez, 2018;Tilot et al., 2021), and possibilities for different scientific paradigms (Funtowicz and Ravetz, 1993;Kuhn, 1962). ...
... For the author, normal science refers to the routine puzzlesolving practice in which the scientific practice delves between two conceptual revolutions. In this normal state, uncertainties are managed automatically, values are unspoken, and foundational problems are unheard of Funtowicz and Ravetz (1993). By comparison, in post-normal science, the centrality comes from the understanding that "facts are uncertain, values are in dispute, the stakes high and decisions urgent" (Funtowicz and Ravetz, 1993). ...
... In this normal state, uncertainties are managed automatically, values are unspoken, and foundational problems are unheard of Funtowicz and Ravetz (1993). By comparison, in post-normal science, the centrality comes from the understanding that "facts are uncertain, values are in dispute, the stakes high and decisions urgent" (Funtowicz and Ravetz, 1993). This perspective enhances the idea of stakeholder consultation in broader and deeper participation, expanding the usual understanding of stakeholders, commonly restricted a group of specialists from the scientific community (plus a few "relevant" decision-makers), to the broad community (namely, expanded peer community) based on the justification of shared risks to the globalized civilization. ...
Article
Full-text available
The transition from the current fossil fuel-based economy toward one that relies on renewable sources of energy allegedly will require a set of minerals for manufacturing batteries that store this energy and power electric devices. Deep seabed mining (DSM) is an economic activity that has the potential to fill these material requirements as it relies on collecting rich mineral resources from the bottom of the ocean. This activity brings enormous challenges to regulation and potentially irreversible impacts on a large scale. In addition, the seabed is considered a common heritage of humankind, and therefore, questions of distributions of burdens and profits also emerge. We build on the premise of social justice, legitimacy, and participatory processes to discuss six perspectives that should be considered while dealing with DSM. We claim that DSM should be seen through a wicked problem lens, acknowledging the limits of ignorance squared, inside a scientific paradigm open to the possibility of a post-normal science. Participation should center on recognizing plural rationalities, ensuring justice and capabilities, and actively including the global South.We conclude that DSM’s legitimacy can be enhanced by following these six perspective guidelines.
... However, even that is not a panacea, as discovered in a recent study where 174 independent teams were given the same data and the same research question, yet there was substantial heterogeneity among findings with some showing results with opposite associations with the outcome variable. 13 The cohort in the second national analysis was approximately double the cohort for the first national analysis for both T1D (77 392 patients vs 38 523) and T2D (836 532 vs 448 829). The difference between these cohorts was the addition of the SGSS dataset to identify more COVID-19 positive tests. ...
Article
Full-text available
Objectives To assess the degree to which we can replicate a study between a regional and a national database of electronic health record data in the UK. The original study examined the risk factors associated with hospitalisation following COVID-19 infection in people with diabetes. Design A replication of a retrospective cohort study. Setting Observational electronic health record data from primary and secondary care sources in the UK. The original study used data from a large, urbanised region (Greater Manchester Care Record, Greater Manchester, UK—2.8 m patients). This replication study used a national database covering the whole of England, UK (NHS England’s Secure Data Environment service for England, accessed via the BHF Data Science Centre’s CVD-COVID-UK/COVID-IMPACT Consortium—54 m patients). Participants Individuals with a diagnosis of type 1 diabetes or type 2 diabetes prior to a positive COVID-19 test result. The matched controls (3:1) were individuals who had a positive COVID-19 test result, but who did not have a diagnosis of diabetes on the date of their positive COVID-19 test result. Matching was done on age at COVID-19 diagnosis, sex and approximate date of COVID-19 test. Primary and secondary outcome measures Hospitalisation within 28 days of a positive COVID-19 test. Results We found that many of the effect sizes did not show a statistically significant difference, but that some did. Where effect sizes were statistically significant in the regional study, then they remained significant in the national study and the effect size was the same direction and of similar magnitude. Conclusions There is some evidence that the findings from studies in smaller regional datasets can be extrapolated to a larger, national setting. However, there were some differences, and therefore replication studies remain an essential part of healthcare research.
... To move forward, broadening participation requires recognizing other human-nature relationships, which is different from the tradition of seeing nature as something hierarchically inferior, available to be acquired, and transformed freely and as an object to be economically exploited for human benefit (Arriagada Oyarzún and Zambra Álvarez, 2019). There are other views (Fanon, 1968;Haraway, 2015;Islas-Vargas, 2020), values (Navarro and Gutiérrez, 2018;Tilot et al., 2021), and possibilities for different scientific paradigms (Funtowicz and Ravetz, 1993;Kuhn, 1962). ...
... For the author, normal science refers to the routine puzzlesolving practice in which the scientific practice delves between two conceptual revolutions. In this normal state, uncertainties are managed automatically, values are unspoken, and foundational problems are unheard of Funtowicz and Ravetz (1993). By comparison, in post-normal science, the centrality comes from the understanding that "facts are uncertain, values are in dispute, the stakes high and decisions urgent" (Funtowicz and Ravetz, 1993). ...
... In this normal state, uncertainties are managed automatically, values are unspoken, and foundational problems are unheard of Funtowicz and Ravetz (1993). By comparison, in post-normal science, the centrality comes from the understanding that "facts are uncertain, values are in dispute, the stakes high and decisions urgent" (Funtowicz and Ravetz, 1993). This perspective enhances the idea of stakeholder consultation in broader and deeper participation, expanding the usual understanding of stakeholders, commonly restricted a group of specialists from the scientific community (plus a few "relevant" decision-makers), to the broad community (namely, expanded peer community) based on the justification of shared risks to the globalized civilization. ...
Article
Full-text available
The transition from the current fossil fuel-based economy toward one that relies on renewable sources of energy allegedly will require a set of minerals for manufacturing batteries that store this energy and power electric devices. Deep seabed mining (DSM) is an economic activity that has the potential to fill these material requirements as it relies on collecting rich mineral resources from the bottom of the ocean. This activity brings enormous challenges to regulation and potentially irreversible impacts on a large scale. In addition, the seabed is considered a common heritage of humankind, and therefore, questions of distributions of burdens and profits also emerge. We build on the premise of social justice, legitimacy, and participatory processes to discuss six perspectives that should be considered while dealing with DSM. We claim that DSM should be seen through a wicked problem lens, acknowledging the limits of ignorance squared, inside a scientific paradigm open to the possibility of a post-normal science. Participation should center on recognizing plural rationalities, ensuring justice and capabilities, and actively including the global South. We conclude that DSM's legitimacy can be enhanced by following these six perspective guidelines.
... The between-analyst variability in conclusions is substantial and has been shown to arise across a wide range of empirical disciplines (e.g., neuroscience, [4][5][6]; psychology, [3,[7][8][9][10]; social science, [2,11]; medical sciences/epidemiology, [12][13][14]; biology, [15]; and economics/finance, [16,17]). Moreover, the variability does not appear to be a result of suboptimal analytic choices [e.g., 2]. ...
Article
Full-text available
The vast majority of empirical research articles report a single primary analysis outcome that is the result of a single analysis plan, executed by a single analysis team (usually the team that also designed the experiment and collected the data). However, recent many-analyst projects have demonstrated that different analysis teams generally adopt a unique approach and that there exists considerable variability in the associated conclusions. There appears to be no single optimal statistical analysis plan, and different plausible plans need not lead to the same conclusion. A high variability in outcomes signals that the conclusions are relatively fragile and dependent on the specifics of the analysis plan. Crucially, without multiple teams analyzing the data, it is difficult to gauge the extent to which the conclusions are robust. We have recently proposed that empirical articles of particular scientific interest or societal importance are accompanied by two or three short reports that summarize the results of alternative analyses conducted by independent experts [F. Bartoš et al., Nat. Hum. Behav. (2025)]. In order to showcase the practical feasibility and epistemic benefits of this approach we have founded the Journal of Robustness Reports, which is dedicated to publishing short reanalyses of empirical findings. This editorial describes the scope and the workflow of the Journal of Robustness Reports including the type and format of the published articles. We hope that the Journal of Robustness Reports will help make reanalyses of published findings the norm across the empirical sciences.
... For example, the structure of complex systems can be represented quite differently in the mental models of individual system experts (Levy et al., 2018;Schaffernicht & Groesser, 2024;van den Broek et al., 2023). This diversity of perspective means modellers are unavoidably partial in making the many decisions involved in developing models -such as system boundary, variables, and data -with studies demonstrating how differences in models and decisions by modellers lead to differing model predictions for the same question (Holländer et al., 2014), and how different research teams can draw highly variable conclusions from the same set of data (Gould et al., 2025). The plurality of ways to model a system and the consequent variability in possible impacts on different people at different scales highlights the importance of careful thinking about the ethical dimensions of the modelling process, i.e., the awareness by the modeller that modelling cannot be separated from its social and political uses and impacts, even in the case of biophysical and socio-environmental models (Picavet, 2009). ...
... As case study 5 highlights, unscrupulous use of biophysical model outputs may still result in disadvantage to a person or community. Case study 4 identifies the idea of equifinality, where multiple valid models may represent the same biophysical systems (Beven, 2006), which is linked to the example discussed in the introduction where hundreds of researchers created different, heterogeneous models from the same datasets (Gould et al., 2025). These issues fall into the category of exogenous ethical considerations, whether in a planning sense (deciding which is the best or most appropriate model(s) to use), or from a philosophical sense (knowing which model(s) to use and why that decision was made). ...
Article
Full-text available
In recent years there has been an increasing emphasis on including humans when modelling socio-environmental systems. However, it is crucial that we remain mindful of the impacts that the decisions made during model development or analysis can have on people or nature as modelling is not an impartial process. Responsible modelling requires us to consider the broader societal implications of our work, therefore, modellers should consider a range of ethical concerns, often found beyond those prescribed through institutionally mandated ethical approval processes. Herein we examine the ethical dimensions of six socio-environmental case studies using the principles of credibility, legitimacy, and salience, encompassing the modelling process from conception to delivery and beyond. We also discuss the results from an interdisciplinary workshop held with experienced modellers to co-produce a list of ethical dimensions that modellers would ideally engage with when conducting a modelling project. Based on our findings, we have developed a set of recommendations to: i) support modellers in ensuring their modelling practice is underpinned by ethical reflection, ii) guide end-users of model outputs when selecting and repurposing those outputs, and iii) identify means by which institutions can support responsible modelling practices. Engaging with ethical dimensions in the process of modelling is critical for building trust with stakeholders, therefore enhancing the credibility, legitimacy, and salience of the models and research.
... For transparency and to reduce the risk of false-positive findings (Forstmeier et al., 2017), we preregistered our hypotheses, the study design and our analytical approach with the Open Science Framework (OSF) before running the odour discrimination experiment (Schlatmann et al., 2021). The usefulness of preregistrations in reducing post hoc researcher flexibility (researcher degrees of freedom; Simmons et al., 2011) in statistical analysis is well illustrated by a recent study in which over two hundred ecologists investigated the same data sets to test given hypotheses (Gould et al., 2025). While the majority of analysts came to qualitatively similar conclusions, statistical estimates varied markedly among them depending on the analysis choices they made and even qualitative conclusions were sometimes divergent (Gould et al., 2025). ...
... The usefulness of preregistrations in reducing post hoc researcher flexibility (researcher degrees of freedom; Simmons et al., 2011) in statistical analysis is well illustrated by a recent study in which over two hundred ecologists investigated the same data sets to test given hypotheses (Gould et al., 2025). While the majority of analysts came to qualitatively similar conclusions, statistical estimates varied markedly among them depending on the analysis choices they made and even qualitative conclusions were sometimes divergent (Gould et al., 2025). ...
Article
Full-text available
Olfactory kin discrimination occurs in many animal taxa, but its potential contribution to commonly observed kin-biased behaviours in birds has rarely been tested. In a previous odour discrimination experiment, 7-day-old blue tit, Cyanistes caeruleus, nestlings showed stronger begging responses to ol-factory cues from conspecific nestlings from other nests than from their own. The authors hypothesized olfaction to mediate kin-biased sibling competition in nests with varying relatedness due to extrapair paternity. In the present study, we aimed to test this hypothesis. We therefore replicated the previous experiment with a crucial modification: we cross-fostered two nestlings of each brood the day after hatching. This allowed us to test for olfactory kin discrimination when nestmates differed in relatedness (due to being cross-fostered) but not in familiarity. We ascertained the relatedness of nestlings using genetic parentage assignment. We preregistered our research plan with the Open Science Framework (OSF) to increase research transparency and reduce researcher degrees of freedom. We found that nestlings did not differ in their begging responses to related versus unrelated (cross-fostered) nestmates' odours, indicating that nestlings do not discriminate kin from nonkin odours when these are both familiar. Moreover, in an exploratory analysis, cross-fostered nestlings did not differ in survival or size from their non-cross-fostered nestmates shortly before fledging, indicating that the presence of unrelated individuals did not affect the distribution of parental care in the nest. In conclusion, we found no evidence for olfactory kin discrimination in begging blue tit nestlings.
... Many-analysts studies have been conducted in microeconomics Borjas and Breznau 2024), finance (Menkveld et al. 2024), religion (Hoogeveen et al. 2023), neuroimaging (Botvinik-Nezer et al. 2020), political science (Breznau et al. 2021), machine learning (W. Chen and Cummings 2024), ecology and evolutionary biology (Gould et al. 2023), psychology (Boehm et al. 2018;Bastiaansen et al. 2020;Schweinsberg et al. 2021), and medical informatics (Ostropolets et al. 2023), among others. 3 sonable but divergent choices (Bryan, Yeager, and O'Brien 2019). ...
... Finally, peer review may increase agreement if there is an option to revise, although if instead outside evaluation is used as a measure of researcher quality, peer review scores do not necessarily predict outlier results (Menkveld et al. 2024;Gould et al. 2023). ...
Preprint
Full-text available
We use a rigorous three-stage many-analysts design to assess how different researcher decisions-specifically data cleaning, research design, and the interpretation of a policy question-affect the variation in estimated treatment effects. A total of 146 research teams each completed the same causal inference task three times each: first with few constraints, then using a shared research design, and finally with pre-cleaned data in addition to a specified design. We find that even when analyzing the same data, teams reach different conclusions. In the first stage, the interquartile range (IQR) of the reported policy effect was 3.1 percentage points, with substantial outliers. Surprisingly, the second stage, which restricted research design choices, exhibited slightly higher IQR (4.0 percentage points), largely attributable to imperfect adherence to the prescribed protocol. By contrast, the final stage, featuring standardized data cleaning, narrowed variation in estimated effects, achieving an IQR of 2.4 percentage points. Reported sample sizes also displayed significant convergence under more restrictive conditions, with the IQR dropping from 295,187 in the first stage to 29,144 in the second, and effectively zero by the third. Our findings underscore the critical importance of data cleaning in shaping applied microeconomic results and highlight avenues for future replication efforts. Ludwig-Maximilians-Universität München (LMU Munich), Western Washington University, and the 2024 Annual Meeting of WEAI for helpful comments and suggestions. We would also like to thank the researchers
... Many-analysts studies have been conducted in microeconomics Borjas and Breznau 2024), finance (Menkveld et al. 2024), religion (Hoogeveen et al. 2023), neuroimaging (Botvinik-Nezer et al. 2020), political science (Breznau et al. 2021), machine learning (W. Chen and Cummings 2024), ecology and evolutionary biology (Gould et al. 2023), psychology (Boehm et al. 2018;Bastiaansen et al. 2020;Schweinsberg et al. 2021), and medical informatics (Ostropolets et al. 2023), among others. 3 sonable but divergent choices (Bryan, Yeager, and O'Brien 2019). ...
... Finally, peer review may increase agreement if there is an option to revise, although if instead outside evaluation is used as a measure of researcher quality, peer review scores do not necessarily predict outlier results (Menkveld et al. 2024;Gould et al. 2023). ...
... Despite the violation of the model assumptions, we use random-effects meta-analyses to estimate τ and H as an approximation of the analytical heterogeneity for the sampled multianalyst studies. A recent multianalyst study in biology (72) also used this method to estimate the heterogeneity of results across analysts. The estimates reported in Fig. 2 should be interpreted cautiously since relying on the random-effects meta-analytic model will underestimate heterogeneity for correlated observations (as the within-study variation will be lower in the case of dependent observations). ...
Article
Full-text available
A typical empirical study involves choosing a sample, a research design, and an analysis path. Variation in such choices across studies leads to heterogeneity in results that introduce an additional layer of uncertainty, limiting the generalizability of published scientific findings. We provide a framework for studying heterogeneity in the social sciences and divide heterogeneity into population, design, and analytical heterogeneity. Our framework suggests that after accounting for heterogeneity, the probability that the tested hypothesis is true for the average population, design, and analysis path can be much lower than implied by nominal error rates of statistically significant individual studies. We estimate each type's heterogeneity from 70 multilab replication studies, 11 prospective meta-analyses of studies employing different experimental designs, and 5 multianalyst studies. In our data, population heterogeneity tends to be relatively small, whereas design and analytical heterogeneity are large. Our results should, however, be interpreted cautiously due to the limited number of studies and the large uncertainty in the heterogeneity estimates. We discuss several ways to parse and account for heterogeneity in the context of different methodologies.