Gael Varoquaux

Gael Varoquaux
National Institute for Research in Computer Science and Control | INRIA · Parietal

PhD

About

268
Publications
126,335
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
63,228
Citations
Additional affiliations
September 2005 - September 2008
French National Centre for Scientific Research
Position
  • PhD Student

Publications

Publications (268)
Preprint
Full-text available
The field of automatic biomedical image analysis crucially depends on robust and meaningful performance metrics for algorithm validation. Current metric usage, however, is often ill-informed and does not reflect the underlying domain interest. Here, we present a comprehensive framework that guides researchers towards choosing performance metrics in...
Article
Full-text available
Previous literature has focused on predicting a diagnostic label from structural brain imaging. Since subtle changes in the brain precede cognitive decline in healthy and pathological aging, our study predicts future decline as a continuous trajectory instead. Here, we tested whether baseline multimodal neuroimaging data improve the prediction of f...
Article
Full-text available
Associating brain systems with mental processes requires statistical analysis of brain activity across many cognitive processes. These analyses typically face a difficult compromise between scope—from domain-specific to system-level analysis—and accuracy. Using all the functional Magnetic Resonance Imaging (fMRI) statistical maps of the largest dat...
Article
Full-text available
Background As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values. These large databases are well suited to train machine learning models, e.g., for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative—rather than generativ...
Article
Full-text available
MRI has been extensively used to identify anatomical and functional differences in Autism Spectrum Disorder (ASD). Yet, many of these findings have proven difficult to replicate because studies rely on small cohorts and are built on many complex, undisclosed, analytic choices. We conducted an international challenge to predict ASD diagnosis from MR...
Article
Full-text available
Research in computer analysis of medical images bears many promises to improve patients’ health. However, a number of systematic challenges are slowing down the progress of the field, from limitations of the data, such as biases, to research incentives, such as optimizing for publication. In this paper we review roadblocks to developing and assessi...
Preprint
Full-text available
State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simp...
Article
Full-text available
Background With increasing data sizes and more easily available computational methods, neurosciences rely more and more on predictive modeling with machine learning, e.g., to extract disease biomarkers. Yet, a successful prediction may capture a confounding effect correlated with the outcome instead of brain features specific to the outcome of inte...
Preprint
BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use dis...
Preprint
BACKGROUND As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use disc...
Article
Full-text available
The analysis of brain-imaging data requires complex processing pipelines to support findings on brain function or pathologies. Recent work has shown that variability in analytical decisions, small amounts of noise, or computational environments can lead to substantial differences in the results, endangering the trust in conclusions. We explored the...
Article
Full-text available
Background Biological aging is revealed by physical measures, e.g., DNA probes or brain scans. In contrast, individual differences in mental function are explained by psychological constructs, e.g., intelligence or neuroticism. These constructs are typically assessed by tailored neuropsychological tests that build on expert judgement and require ca...
Preprint
Full-text available
High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be derived from the UK Biobank population dataset and u...
Article
Full-text available
PurposeThe Coronavirus disease 2019 (COVID-19) has led to an unparalleled influx of patients. Prognostic scores could help optimizing healthcare delivery, but most of them have not been comprehensively validated. We aim to externally validate existing prognostic scores for COVID-19.Methods We used “COVID-19 Evidence Alerts” (McMaster University) to...
Article
Full-text available
Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can under...
Article
Full-text available
As the global health crisis unfolded, many academic conferences moved online in 2020. This move has been hailed as a positive step towards inclusivity in its attenuation of economic, physical, and legal barriers and effectively enabled many individuals from groups that have traditionally been underrepresented to join and participate. A number of st...
Preprint
Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can under...
Preprint
How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful lear...
Preprint
Full-text available
While a randomized controlled trial (RCT) readily measures the average treatment effect (ATE), this measure may need to be shifted to generalize to a different population. Standard estimators of the target population treatment effect are based on the distributional shift in covariates, using inverse propensity sampling weighting (IPSW) or modeling...
Article
Full-text available
Cognitive brain imaging is accumulating datasets about the neural substrate of many different mental processes. Yet, most studies are based on few subjects and have low statistical power. Analyzing data across studies could bring more statistical power; yet the current brain-imaging analytic framework cannot be used at scale as it requires casting...
Preprint
Full-text available
Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and public...
Article
Full-text available
In brain imaging, decoding is widely used to infer relationships between brain and cognition, or to craft brain-imaging biomarkers of pathologies. Yet, standard decoding procedures do not come with statistical guarantees, and thus do not give confidence bounds to interpret the pattern maps that they produce. Indeed, in whole-brain decoding settings...
Article
Full-text available
Functional magnetic resonance imaging (fMRI) has opened the possibility to investigate how brain activity is modulated by behavior. Most studies so far are bound to one single task, in which functional responses to a handful of contrasts are analyzed and reported as a group average brain map. Contrariwise, recent data‐collection efforts have starte...
Preprint
Full-text available
Biomedical entity linking aims to map biomedical mentions, such as diseases and drugs, to standard entities in a given knowledge base. The specific challenge in this context is that the same biomedical entity can have a wide range of names, including synonyms, morphological variations, and names with different word orderings. Recently, BERT-based m...
Preprint
Full-text available
With increasing data availability, treatment causal effects can be evaluated across different dataset, both randomized trials and observational studies. Randomized trials isolate the effect of the treatment from that of unwanted (confounding) co-occuring effects. But they may be applied to limited populations, and thus lack external validity. On th...
Article
Full-text available
We present an extension of the Individual Brain Charting dataset –a high spatial-resolution, multi-task, functional Magnetic Resonance Imaging dataset, intended to support the investigation on the functional principles governing cognition in the human brain. The concomitant data acquisition from the same 12 participants, in the same environment, al...
Preprint
Full-text available
Background Biological aging is revealed by physical measures, e . g ., DNA probes or brain scans. Instead, individual differences in mental function are explained by psychological constructs, e.g., intelligence or neuroticism. These constructs are typically assessed by tailored neuropsychological tests that build on expert judgement and require car...
Article
Full-text available
We leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about Coronavirus Disease 2019 (COVID-19). To do this, we formed an international consortium (4CE) of 96 hospitals across five countries (www.covidclinical.net). Contributors utilized the Informatics for Integrating...
Article
Full-text available
We simultaneously revisited the ADI-R and ADOS with a comprehensive data-analytics strategy. Here, the combination of pattern analysis algorithms and an extensive data resources (n=266 patients aged 7 to 49 years) allowed identifying coherent clinical constellations in and across ADI-R and ADOS assessments widespread in clinical practice. Our clust...
Preprint
Full-text available
The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent...
Article
Full-text available
Population imaging markedly increased the size of functional-imaging datasets, shedding new light on the neural basis of inter-individual differences. Analyzing these large data entails new scalability challenges, computational and statistical. For this reason, brain images are typically summarized in a few signals, for instance reducing voxel-leve...
Preprint
Full-text available
Objective To assess the clinical effectiveness of oral hydroxychloroquine (HCQ) with or without azithromycin (AZI) in preventing death or leading to hospital discharge. Design Retrospective cohort study. Setting An analysis of data from electronic medical records and administrative claim data from the French Assistance Publique - Hopitaux de Paris...
Preprint
Full-text available
Cognitive decline occurs in healthy and pathological aging, and both may be preceded by subtle changes in the brain - offering a basis for cognitive predictions. Previous work has largely focused on predicting a diagnostic label from structural brain imaging. Our study broadens the scope of applications to cognitive decline in healthy aging by pred...
Article
Full-text available
In the 20th century, evidence-based medicine has put clinical practice on much more solid ground. For instance, randomized clinical trials have provided strong evidence on useful interventions, thanks to double-blind treatment application and tests for treatment associations with clinical outcomes. However, precision medicine in the 21st century st...
Article
Full-text available
Electrophysiological methods, that is M/EEG, provide unique views into brain health. Yet, when building predictive models from brain data, it is often unclear how electrophysiology should be combined with other neuroimaging methods. Information can be redundant, useful common representations of multimodal data may not be obvious and multimodal data...
Article
Full-text available
Electrophysiological methods, that is M/EEG, provide unique views into brain health. Yet, when building predictive models from brain data, it is often unclear how electrophysiology should be combined with other neuroimaging methods. Information can be redundant, useful common representations of multimodal data may not be obvious and multimodal data...
Article
Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding.This strategy breaks down when the number of categories grows, as it creates high-dimensional feature vectors. Additionally, for string entries, one-hot encoding does not capture morphological information in their representation....
Article
Full-text available
Predicting biomedical outcomes from Magnetoencephalography and Electroencephalography (M/EEG) is central to applications like decoding, brain-computer-interfaces (BCI) or biomarker development and is facilitated by supervised machine learning. Yet most of the literature is concerned with classification of outcomes defined at the event-level. Here,...
Article
Full-text available
Electrophysiological methods, i.e., M/EEG provide unique views into brain health. Yet, when building predictive models from brain data, it is often unclear how electrophysiology should be combined with other neuroimaging methods. Information can be redundant, useful common representations of multimodal data may not be obvious and multimodal data co...
Preprint
Full-text available
We leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about Coronavirus Disease 2019 (COVID-19). To do this, we formed an international consortium (4CE) of 96 hospitals across 5 countries (www.covidclinical.net). Contributors utilized the Informatics for Integrating Bi...
Preprint
Full-text available
Population imaging markedly increased the size of functional-imaging datasets, shedding new light on the neural basis of inter-individual differences. Analyzing these large data entails new scalability challenges, computational and statistical. For this reason, brain images are typically summarized in a few signals, for instance reducing voxel-leve...
Article
Full-text available
Reaching a global view of brain organization requires assembling evidence on widely different mental processes and mechanisms. The variety of human neuroscience concepts and terminology poses a fundamental challenge to relating brain imaging results across the scientific literature. Existing meta-analysis methods perform statistical tests on sets o...
Article
Full-text available
Reaching a global view of brain organization requires assembling evidence on widely different mental processes and mechanisms. The variety of human neuroscience concepts and terminology poses a fundamental challenge to relating brain imaging results across the scientific literature. Existing meta-analysis methods perform statistical tests on sets o...
Article
Full-text available
Reaching a global view of brain organization requires assembling evidence on widely different mental processes and mechanisms. The variety of human neuroscience concepts and terminology poses a fundamental challenge to relating brain imaging results across the scientific literature. Existing meta-analysis methods perform statistical tests on sets o...
Preprint
Full-text available
Reaching a global view of brain organization requires assembling evidence on widely different mental processes and mechanisms. The variety of human neuroscience concepts and terminology poses a fundamental challenge to relating brain imaging results across the scientific literature. Existing meta-analysis methods perform statistical tests on sets o...
Article
Importance Great interest exists in identifying methods to predict neuropsychiatric disease states and treatment outcomes from high-dimensional data, including neuroimaging and genomics data. The goal of this review is to highlight several potential problems that can arise in studies that aim to establish prediction. Observations A number of neuro...
Preprint
Full-text available
Electrophysiological methods, i.e., M/EEG provide unique views into brain health. Yet, when building predictive models from brain data, it is often unclear how electrophysiology should be combined with other neuroimaging methods. Information can be redundant, useful common representations of multimodal data may not be obvious and multimodal data co...
Preprint
Full-text available
Predicting biomedical outcomes from Magnetoencephalography and Electroencephalography (M/EEG) is central to applications like decoding, brain-computer-interfaces (BCI) or biomarker development and is facilitated by supervised machine learning. Yet most of the literature is concerned with within-subject classification. Here, we focus on predicting c...
Preprint
Full-text available
We simultaneously revisited the ADI-R and ADOS with a comprehensive data-analytics strategy. Here, the combination of pattern analysis algorithms and an extensive data resources (n=266 patients aged 7 to 49 years) allowed identifying coherent clinical constellations in and across ADI-R and ADOS assessments widespread in clinical practice. The colle...
Article
Full-text available
The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data-sharing resources that have been develop...
Preprint
Magnetoencephalography and electroencephalography (M/EEG) can reveal neuronal dynamics non-invasively in real-time and are therefore appreciated methods in medicine and neuroscience. Recent advances in modeling brain-behavior relationships have highlighted the effectiveness of Riemannian geometry for summarizing the spatially correlated time-series...
Article
Functional connectomes reveal biomarkers of individual psychological or clinical traits. However, there is great variability in the analytic pipelines typically used to derive them from rest-fMRI cohorts. Here, we consider a specific type of studies, using predictive models on the edge weights of functional connectomes, for which we highlight the b...
Article
Full-text available
Estimating covariances from functional Magnetic Resonance Imaging at rest (r-fMRI) can quantify interactions between brain regions. Also known as brain functional connectivity, it reflects inter-subject variations in behavior and cognition, and characterizes neuropathologies. Yet, with noisy and short time-series, as in r-fMRI, covariance estimatio...
Data
Distribution of terms in our database. (PDF)
Data
Maps for consensus between forward and reverse. Left: maps for the different inferences on the “place” concept. Right: the overlaid inferences for this concept. The consensus singles out the PPA for the “place” concept. (TIFF)
Data
Prediction scores for different methods. AUC (area under the curve) of the ROC. OD: ontology decoding, LOG: logistic, NB: Naive Bayes. Left: leave-subject-out cross-validation, Right: leave-study-out cross-validation. (TIFF)
Data
Prediction scores for different methods. AUC (area under the curve) of the ROC curve. OD: ontology decoding, LOG: logistic regression, NB: Naive Bayes, NS: NeuroSynth. The OD (ontology decoding) method performs very well (chance is at .5), including when predicting to new studies. Leave-subject-out cross-validation scheme tend to display a higher p...