Figure - available from: Nature
This content is subject to copyright. Terms and conditions apply.
Voxels overlap
Maps showing at each voxel the proportion of teams (out of n = 65 teams) that reported significant activations in their thresholded statistical map, for each hypothesis (labelled H1–H9), thresholded at 10% (that is, voxels with no colour were significant in fewer than 10% of teams). + or − refers to the direction of effect; gain or loss refers to the effect being tested; and equal indifference (EI) or equal range (ER) refers to the group being examined or compared. Hypotheses 1 and 3, as well as hypotheses 2 and 4, share the same statistical maps as they relate to the same contrast and experimental group but different regions (see Extended Data Table 1). Images can be viewed at https://identifiers.org/neurovault.collection:6047.

Voxels overlap Maps showing at each voxel the proportion of teams (out of n = 65 teams) that reported significant activations in their thresholded statistical map, for each hypothesis (labelled H1–H9), thresholded at 10% (that is, voxels with no colour were significant in fewer than 10% of teams). + or − refers to the direction of effect; gain or loss refers to the effect being tested; and equal indifference (EI) or equal range (ER) refers to the group being examined or compared. Hypotheses 1 and 3, as well as hypotheses 2 and 4, share the same statistical maps as they relate to the same contrast and experimental group but different regions (see Extended Data Table 1). Images can be viewed at https://identifiers.org/neurovault.collection:6047.

Source publication
Article
Full-text available
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exempl...

Citations

... Researchers need to make a multitude of choices, which may affect the outcomes of their study (Simmons et al. 2011). When different research teams analyze the same research question using the same data set, considerable heterogeneity in both study designs and outcomes can be observed (Botvinik-Nezer et al. 2020;Breznau et al. 2022;Huntington-Klein et al. 2021;Silberzahn et al. 2018). As researchers face incentives to search for and to report statistically significant findings -particularly those that conform with prior beliefs or sensational findings and large effect sizes (Ioannidis 2005) -they may consciously or unconsciously use the flexibility in possible research designs to obtain desired study outcomes (Bruns and Kalthaus 2020). ...
Preprint
    Researchers have incentives to search for and selectively report findings that appear to be statistically significant and/or conform to prior beliefs. Such selective reporting practices, including p-hacking and publication bias, can lead to a distorted set of results being published, potentially undermining the process of knowledge accumulation and evidence-based decision making. We take stock of the state of empirical research in the environmental sciences using 67,947 statistical tests obtained from 547 meta-analyses. We find that 59% of the p-values that were reported as significant are not actually expected to be statistically significant. The median power of these tests is between 6% to 12%, which is the lowest yet identified for any discipline. Only 8% of tests are adequately powered with statistical power of 80% or more. Exploratory regressions suggest that increased statistical power and the use of experimental research designs reduce the extent of selective reporting. Differences between subfields can be mostly explained by methodological differences. To improve the environmental sciences evidence base, researchers should pay more attention to statistical power, but incentives for selective reporting may remain even with adequate statistical power. Ultimately, a paradigm shift towards open science is needed to ensure the reliability of published empirical research.
    ... Independent replications are needed as they represent a fundamental part of science, leading to greater confidence in previously reported findings 13,14 . This is particularly relevant to neuroimaging studies due to their large analytical variability 15 , and to the field of psychiatry suffering from the known heterogeneity problem 16 . Nevertheless, engaging in replication studies is often undervalued, can be difficult to publish, and has few direct incentives for researchers 17 . ...
    Article
    Objective: The weak link between subjective symptom-based diagnostic methods for posttraumatic psychopathology and objectively measured neurobiological indices forms a barrier to the development of effective personalized treatments. To overcome this problem, recent studies have aimed to stratify psychiatric disorders by identifying consistent subgroups based on objective neural markers. Along these lines, a promising 2021 study by Stevens et al. identified distinct brain-based biotypes associated with different longitudinal patterns of posttraumatic symptoms. Here, the authors conducted a conceptual nonexact replication of that study using a comparable data set from a multimodal longitudinal study of recent trauma survivors. Methods: A total of 130 participants (mean age, 33.61 years, SD=11.21; 48% women) admitted to a general hospital emergency department following trauma exposure underwent demographic, clinical, and neuroimaging assessments 1, 6, and 14 months after trauma. All analyses followed the pipeline outlined in the original study and were conducted in collaboration with its authors. Results: Task-based functional MRI conducted 1 month posttrauma was used to identify four clusters of individuals based on profiles of neural activity reflecting threat and reward reactivity. These clusters were not identical to the previously identified brain-based biotypes and were not associated with prospective symptoms of posttraumatic psychopathology. Conclusions: Overall, these findings suggest that the original brain-based biotypes of trauma resilience and psychopathology may not generalize to other populations. Thus, caution is warranted when attempting to define subtypes of psychiatric vulnerability using neural indices before treatment implications can be fully realized. Additional replication studies are needed to identify more stable and generalizable neuroimaging-based biotypes of posttraumatic psychopathology.
    ... It is difficult to collect large scale and well characterized randomized ET and control subjects, and collection and analysis of disparate cohorts with small sample sizes limits valid hypothesis testing and interpretation of findings. Apart from the difficulties of collecting large scale well characterized randomized ET and control subjects, the disagreements between imaging studies may also arise from the complexity and flexibility of the neuroimaging processing pipelines and the statistical models [31][32][33] . We refer to the study of robustness in findings resulting from various pipelines and statistical models as "Methods sensitivity analysis". ...
    Article
    Full-text available
    Essential tremor (ET) is the most prevalent movement disorder with poorly understood etiology. Some neuroimaging studies report cerebellar involvement whereas others do not. This discrepancy may stem from underpowered studies, differences in statistical modeling or variation in magnetic resonance imaging (MRI) acquisition and processing. To resolve this, we investigated the cerebellar structural differences using a local advanced ET dataset augmented by matched controls from PPMI and ADNI. We tested the hypothesis of cerebellar involvement using three neuroimaging biomarkers: VBM, gray/white matter volumetry and lobular volumetry. Furthermore, we assessed the impacts of statistical models and segmentation pipelines on results. Results indicate that the detected cerebellar structural changes vary with methodology. Significant reduction of right cerebellar gray matter and increase of the left cerebellar white matter were the only two biomarkers consistently identified by multiple methods. Results also show substantial volumetric overestimation from SUIT-based segmentation—partially explaining previous literature discrepancies. This study suggests that current estimation of cerebellar involvement in ET may be overemphasized in MRI studies and highlights the importance of methods sensitivity analysis on results interpretation. ET datasets with large sample size and replication studies are required to improve our understanding of regional specificity of cerebellum involvement in ET. Protocol registration The stage 1 protocol for this Registered Report was accepted in principle on 21 March 2022. The protocol, as accepted by the journal, can be found at: https://doi.org/10.6084/m9.figshare.19697776.
    ... Finally, decoding accuracy might depend (to some extent) on the choice of analysis parameters, such as the type of cross-validation procedure and the voxel-level threshold that provides the basis for cluster-level correction [29]. To evaluate to what extent our results were sensitive to these choices, we performed additional analyses using a different type of cross-validation procedure and a different voxel-level threshold. ...
    Article
    Full-text available
    Number symbols, such as Arabic numerals, are cultural inventions that have transformed human mathematical skills. Although their acquisition is at the core of early elementary education in children, it remains unknown how the neural representations of numerals emerge during that period. It is also unclear whether these relate to an ontogenetically earlier sense of approximate quantity. Here, we used multivariate fMRI adaptation coupled with within- and between-format machine learning to probe the cortical representations of Arabic numerals and approximate nonsymbolic quantity in 89 children either at the beginning (age 5) or four years into formal education (age 8). Although the cortical representations of both numerals and nonsymbolic quantities expanded from age 5 to age 8, these representations also segregated with learning and development. Specifically, a format-independent neural representation of quantity was found in the right parietal cortex, but only for 5-year-olds. These results are consistent with the so-called symbolic estrangement hypothesis, which argues that the relation between symbolic and nonsymbolic quantity weakens with exposure to formal mathematics in children.
    ... org/10.1101org/10. /2023 heterogeneous results (Borghi & Gulick, 2018;Botvinik-Nezer et al., 2020;Gronenschild et al., 2012), and it is practically impossible to standardize acquisition protocols and preprocessing procedures across all research centers and application domains. Routine procedures, such as providing access to raw data, modifying analysis configurations and integrating new workflows, all become significantly more difficult to support with every increase in scale. ...
    Preprint
    Magnetic resonance imaging (MRI) is a powerful tool for non-invasive imaging of the human body. However, the quality and reliability of MRI data can be influenced by various factors, such as hardware and software configurations, image acquisition protocols, and preprocessing techniques. In recent years, the introduction of large-scale neuroimaging datasets has taken an increasingly prominent role in neuroscientific research. The advent of publicly available and standardized repositories has enabled researchers to combine data from multiple sources to explore a wide range of scientific inquiries. This increase in scale allows the study of phenomena with smaller effect sizes over a more diverse sample and with greater statistical power. Other than the variability inherent to the acquisition of the data across sites, preprocessing and feature generation steps implemented in different labs introduce an additional layer of variability which may influence consecutive statistical procedures. In this study, we show that differences in the configuration of surface reconstruction from anatomical MRI using FreeSurfer results in considerable changes to the estimated anatomical features. In addition, we demonstrate the effect these differences have on within-subject similarity and the performance of basic prediction tasks based on the derived anatomical features. Our results show that although FreeSurfer may be provided with either a T2w or a FLAIR scan for the same purpose of improving pial surface estimation (relative to based on the mandatory T1w scan alone), the two configurations have a distinctly different effect. In addition, our findings indicate that the similarity of within-subject scans and performance of a range of models for the prediction of sex and age are significantly effected, they are not significantly improved by either of the enhanced configurations. These results demonstrate the large extent to which elementary and sparsely reported differences in preprocessing workflow configurations influence the derived brain features. The results of this study are meant to underline the importance of optimizing preprocessing procedures based on experimental results prior to their distribution and consecutive standardization and harmonization efforts across public datasets. In addition, preprocessing configurations should be carefully reported and included in any following analytical workflows, to account for any variation originating from such differences. Finally, other representations of the raw data should be explored and studied to provide a more robust framework for data aggregation and sharing.
    ... In one report, it is estimated that it generally takes two or three researchers working on a project based on data from public datasets roughly 6 to 9 months to download, process, and prepare the data for analysis (Horien et al., 2021). This period of time is also fertile ground for the introduction of more degrees of freedom to the analytical workflow (as well as human error), severely undermining the reproducibility of the results (Botvinik-Nezer et al., 2020;Maier-Hein et al., 2017). To better facilitate large-scale neuroimaging research, dedicated data and analysis management solutions are required. ...
    Preprint
    The goal of this article is to present "The Labbing Project"; a novel neuroimaging data aggregation and preprocessing web application built with Django and VueJS. Neuroimaging data can be complex and time-consuming to work with, especially for researchers with limited programming experience. This web application aims to streamline the process of aggregating and preprocessing neuroimaging data by providing an intuitive, user-friendly interface that allows researchers to upload, organize, and preprocess their data with minimal programming requirements. The application utilizes Django, a popular Python web framework, to create a robust and scalable platform that can handle large volumes of data and accommodate the needs of a diverse user base. This robust infrastructure is complemented by a user-friendly VueJS frontend application, supporting commonplace data querying and extraction tasks. By automating common data processing tasks, this web application aims to save researchers time and resources, enabling them to focus on their research rather than data management.
    ... Perhaps reviews motivating primary research are systematic, perhaps not, perhaps they follow a particular system, or another; we simply do not know. Just as data-analytic decisions for the same hypothesis test vary widely and cause heterogeneity in conclusions about data (Botvinik-Nezer et al., 2020;Schweinsberg et al., 2021), variation in how literature is searched and coded will yield heterogenous conclusions about the same body of research. By not documenting this research step, we are unable to evaluate the comprehensiveness and quality of a review of the literature and readers are unable to learn which review practices are superior to others. ...
    ... A systematic literature review we conducted showed that research practices related to research questions and problems (Step 1 in Figure 1 Open science practices make transparency the default choice (Klein et al., 2018) for the steps in the scientific research process, from sharing materials that inform study design choices (Landy et al., 2020) and raw data (Simonsohn, 2013) to data analyses (Botvinik-Nezer et al., 2020;Schweinsberg et al., 2021). Open science practices could also be implemented when reviewing the literature in primary research: scholars could share the search terms they used, and search results, along with coding criteria, and the results of this coding process on an online repository such as OSF. ...
    Article
    Full-text available
    Four validity types evaluate the approximate truth of inferences communicated by primary research. However, current validity frameworks ignore the truthfulness of empirical inferences that are central to research problem statements. Problem statements contrast a review of past research with other knowledge that extend, contradict, or call into question specific features of past research. Authors communicate empirical inferences, or quantitative judgments about the frequency (e.g., “few,” “most”) and variability (e.g., “on the one hand, on the other hand”) in their reviews of existing theories, measures, samples, or results. We code a random sample of primary research articles and show that 83% of quantitative judgments in our sample are both vague and their origin non-transparent, making it difficult to assess their validity. We review validity threats of current practices. We propose that documenting the literature search, how the search was coded, along with quantification facilitates more precise judgments and makes their origin transparent. This practice enables research questions that are more closely tied to the existing body of knowledge and allows for more informed evaluations of the contribution of primary research articles, their design choices, and how they advance knowledge. We discuss potential limitations of our proposed framework.
    ... dbICC, distance-based intra-class correlation This work also presents a general framework for evaluating reliability and validity of image processing pipelines. In recent years, there has been many concerns about the reproducibility of neuroimaging studies (Botvinik-Nezer et al., 2020;Elliott et al., 2020;Marek et al., 2022;Masouleh et al., 2019;Noble et al., 2019). Some researchers proposed that the standardization of pipelines could improve the reproducibility across studies, because using different pipelines would lead to different results and conclusions. ...
    Article
    Full-text available
    Magnetic resonance imaging (MRI) has been one of the primary instruments to measure the properties of the human brain non-invasively in vivo. MRI data generally needs to go through a series of processing steps (i.e., a pipeline) before statistical analysis. Currently, the processing pipelines for multi-modal MRI data are still rare, in contrast to single-modal pipelines. Furthermore, the reliability and validity of the output of the pipelines are critical for the MRI studies. However, the reliability and validity measures are not available or adequate for almost all pipelines. Here, we present PhiPipe, a multi-modal MRI processing pipeline. PhiPipe could process T1-weighted, resting-state BOLD, and diffusion-weighted MRI data and generate commonly used brain features in neuroimaging. We evaluated the test-retest reliability of PhiPipe's brain features by computing intra-class correlations (ICC) in four public datasets with repeated scans. We further evaluated the predictive validity by computing the correlation of brain features with chronological age in three public adult lifespan datasets. The multivariate reliability and predictive validity of the PhiPipe results were also evaluated. The results of PhiPipe were consistent with previous studies, showing comparable or better reliability and validity when compared with two popular single-modality pipelines, namely DPARSF and PANDA. The publicly available PhiPipe provides a simple-to-use solution to multi-modal MRI data processing. The accompanied reliability and validity assessments could help researchers make informed choices in experimental design and statistical analysis. Furthermore, this study provides a framework for evaluating the reliability and validity of image processing pipelines.
    ... While the predefined analysis pipelines help to reduce the number of error-prone manual interventions to a minimum, it also has the advantage of decreasing the number of analytical degrees of freedom available to a user to its minimum (Carp 2012). This constraint in flexibility is important as it helps to control the variability in data processing and analysis (Botvinik-Nezer et al. 2020). fMRIPrep showed a clear need for such analysis-agnostic approaches and was therefore chosen to provide much of the groundwork for fMRIflows. ...
    Article
    Full-text available
    How functional magnetic resonance imaging (fMRI) data are analyzed depends on the researcher and the toolbox used. It is not uncommon that the processing pipeline is rewritten for each new dataset. Consequently, code transparency, quality control and objective analysis pipelines are important for improving reproducibility in neuroimaging studies. Toolboxes, such as Nipype and fMRIPrep, have documented the need for and interest in automated pre-processing analysis pipelines. Recent developments in data-driven models combined with high resolution neuroimaging dataset have strengthened the need not only for a standardized preprocessing workflow, but also for a reliable and comparable statistical pipeline. Here, we introduce fMRIflows: a consortium of fully automatic neuroimaging pipelines for fMRI analysis, which performs standard preprocessing, as well as 1st- and 2nd-level univariate and multivariate analyses. In addition to the standardized pre-processing pipelines, fMRIflows provides flexible temporal and spatial filtering to account for datasets with increasingly high temporal resolution and to help appropriately prepare data for advanced machine learning analyses, improving signal decoding accuracy and reliability. This paper first describes fMRIflows’ structure and functionality, then explains its infrastructure and access, and lastly validates the toolbox by comparing it to other neuroimaging processing pipelines such as fMRIPrep, FSL and SPM. This validation was performed on three datasets with varying temporal sampling and acquisition parameters to prove its flexibility and robustness. fMRIflows is a fully automatic fMRI processing pipeline which uniquely offers univariate and multivariate single-subject and group analyses as well as pre-processing.
    ... Reproducibility in this context refers to the ability to corroborate results of a previous study by conducting new experiments with the same experimental design but collecting new and independent data sets. Reproducibility checks are common in fields like physics (CERN, 2018), but rarer in biological disciplines such as neuroscience and pharmacotherapy, which are increasingly facing a "reproducibility crisis" (Bespalov et al., 2016;Bespalov and Steckler, 2018;Botvinik-Nezer et al., 2020). Even though a high risk of failure to repeat experiments between laboratories is an inherent part of developing innovative therapies, some risks can be greatly reduced and avoided by adherence to evidence-based research practices using clearly identified measures to improve research rigor (Vollert et al., 2020;Bespalov et al., 2021;Emmerich et al., 2021). ...