Elias Chaibub Neto

Elias Chaibub Neto
Sage Bionetworks

PhD

About

114
Publications
26,964
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,426
Citations
Additional affiliations
July 2011 - present
Sage Bionetworks
Position
  • Senior Researcher
Description
  • Statistician
Education
August 2004 - April 2010
University of Wisconsin–Madison
Field of study
  • PhD - Statistics

Publications

Publications (114)
Article
Full-text available
Healthcare researchers are increasingly utilizing smartphone sensor data as a scalable and cost-effective approach to studying individualized health-related behaviors in real-world settings. However, to develop reliable and robust digital behavioral signatures that may help in the early prediction of the individualized disease trajectory and future...
Article
Objective Psoricatic disease remains underdiagnosed and undertreated. We developed and validated a suite of novel, smartphone sensor-based assessments that can be self-administered to measure cutaneous and musculoskeletal signs and symptoms of psoriatic disease. Methods Participants with psoriasis, psoriatic arthritis, or healthy controls were rec...
Preprint
Full-text available
Non-parametric two-sample tests based on energy distance or maximum mean discrepancy are widely used statistical tests for comparing multivariate data from two populations. While these tests enjoy desirable statistical properties, their test statistics can be expensive to compute as they require the computation of 3 distinct Euclidean distance (or...
Preprint
Full-text available
Healthcare researchers are increasingly utilizing smartphone sensor data as a scalable and cost-effective approach to studying individualized health-related behaviors in real-world settings. However, to develop reliable and robust digital behavioral signatures that may help predict disease trajectory early and future prognosis, there is a critical...
Article
Full-text available
Background The two-way partial AUC has been recently proposed as a way to directly quantify partial area under the ROC curve with simultaneous restrictions on the sensitivity and specificity ranges of diagnostic tests or classifiers. The metric, as originally implemented in the tpAUC R package, is estimated using a nonparametric estimator based on...
Preprint
Full-text available
In the field of statistical disclosure control, the tradeoff between data confidentiality and data utility is measured by comparing disclosure risk and information loss metrics. Distance based metrics such as the mean absolute error (MAE), mean squared error (MSE), mean variation (IL1), and its scaled alternative (IL1s) are popular information loss...
Article
Full-text available
We propose a counterfactual approach to train “causality-aware” predictive models that are able to leverage causal information in static anticausal machine learning tasks (i.e., prediction tasks where the outcome influences the inputs). In applications plagued by confounding, the approach can be used to generate predictions that are free from the i...
Article
Full-text available
Ideally, a patient’s response to medication can be monitored by measuring changes in performance of some activity. In observational studies, however, any detected association between treatment (“on-medication” vs “off-medication”) and the outcome (performance in the activity) might be due to confounders. In particular, causal inferences at the pers...
Preprint
Full-text available
Background Psoriasis and psoriatic arthritis are common immune-mediated inflammatory conditions that primarily affect the skin, joints and entheses and can lead to significant disability and worsening quality of life. Although early recognition and treatment can prevent the development of permanent damage, psoriatic disease remains underdiagnosed a...
Article
Full-text available
Remote health assessments that gather real-world data (RWD) outside clinic settings require a clear understanding of appropriate methods for data collection, quality assessment, analysis and interpretation. Here we examine the performance and limitations of smartphones in collecting RWD in the remote mPower observational study of Parkinson’s diseas...
Article
Full-text available
Consumer wearables and sensors are a rich source of data about patients’ daily disease and symptom burden, particularly in the case of movement disorders like Parkinson’s disease (PD). However, interpreting these complex data into so-called digital biomarkers requires complicated analytical approaches, and validating these biomarkers requires suffi...
Preprint
Linear residualization is a common practice for confounding adjustment in machine learning (ML) applications. Recently, causality-aware predictive modeling has been proposed as an alternative causality-inspired approach for adjusting for confounders. The basic idea is to simulate counterfactual data that is free from the spurious associations gener...
Preprint
In health related machine learning applications, the training data often corresponds to a non-representative sample from the target populations where the learners will be deployed. In anticausal prediction tasks, selection biases often make the associations between confounders and the outcome variable unstable across different target environments....
Article
Full-text available
There are many approaches to maintaining wellness, including taking a simple vacation to attending highly structured wellness retreats, which typically regulate the attendee's personal time and activities. In a healthy English-speaking cohort of 112 women and men (aged 30–80 years), this study examined the effects of participating in either a 6-day...
Article
Full-text available
While the past decade has seen meaningful improvements in clinical outcomes for multiple myeloma patients, a subset of patients does not benefit from current therapeutics for unclear reasons. Many gene expression-based models of risk have been developed, but each model uses a different combination of genes and often involves assaying many genes mak...
Preprint
Causal modeling has been recognized as a potential solution to many challenging problems in machine learning (ML). While counterfactual thinking has been leveraged in ML tasks that aim to predict the consequences of actions/interventions, it has not yet been applied to more traditional/static supervised learning tasks, such as the prediction of lab...
Article
Full-text available
IMPORTANCE Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. OBJECTIVE To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased eval...
Article
Full-text available
Importance Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. Objective To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased eva...
Article
Full-text available
Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representat...
Preprint
Full-text available
Mobile health, the collection of data using wearables and sensors, is a rapidly growing field in health research with many applications. Deriving validated measures of disease and severity that can be used clinically or as outcome measures in clinical trials, referred to as digital biomarkers, has proven difficult. In part due to the complicated an...
Preprint
While counterfactual thinking has been used in ML tasks that aim to predict the consequences of different actions, policies, and interventions, it has not yet been leveraged in more traditional/static supervised learning tasks, such as the prediction of discrete labels in classification tasks or continuous responses in regression problems. Here, we...
Article
Full-text available
Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repe...
Preprint
Machine learning practice is often impacted by confounders. Confounding can be particularly severe in remote digital health studies where the participants self-select to enter the study. While many different confounding adjustment approaches have been proposed in the literature, most of these methods rely on modeling assumptions, and it is unclear...
Preprint
Full-text available
Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representat...
Preprint
Full-text available
While the past decade has seen meaningful improvements in clinical outcomes for multiple myeloma patients, a subset of patients do not benefit from current therapeutics for unclear reasons. Many gene expression-based models of risk have been developed, but each model uses a different combination of genes and often involve assaying many genes making...
Conference Paper
Machine learning applications are often plagued with confounders that can impact the generalizability of the learners. In clinical settings, demographic characteristics often play the role of confounders. Confounding is especially problematic in remote digital health studies where the participants self-select to enter the study, thereby making it d...
Article
Full-text available
The effectiveness of most cancer targeted therapies is short-lived. Tumors often develop resistance that might be overcome with drug combinations. However, the number of possible combinations is vast, necessitating data-driven approaches to find optimal patient-specific treatments. Here we report AstraZeneca's large drug combination dataset, consis...
Article
Full-text available
The effectiveness of most cancer targeted therapies is short-lived. Tumors often develop resistance that might be overcome with drug combinations. However, the number of possible combinations is vast, necessitating data-driven approaches to find optimal patient-specific treatments. Here we report AstraZeneca's large drug combination dataset, consis...
Preprint
Full-text available
Clinical machine learning applications are often plagued with confounders that can impact the generalizability and predictive performance of the learners. Confounding is especially problematic in remote digital health studies where the participants self-select to enter the study, thereby making it challenging to balance the demographic characterist...
Conference Paper
Current clinimetrics assessment of Parkinson's disease (PD) is insensitive, episodic, subjective, and provider-centered. Ubiquitous technologies such as smartphones promise to fundamentally change PD assessments. To enable frequent remote assessment of PD tremor severity, here we present a 39-month smartphone research study in a real-world setting...
Preprint
Clinical machine learning applications are often plagued with confounders that are clinically irrelevant, but can still artificially boost the predictive performance of the algorithms. Confounding is especially problematic in mobile health studies run "in the wild", where it is challenging to balance the demographic characteristics of participants...
Article
The roles played by learning and memorization represent an important topic in deep learning research. Recent work on this subject has shown that the optimization behavior of DNNs trained on shuffled labels is qualitatively different from DNNs trained with real labels. Here, we propose a novel permutation approach that can differentiate memorization...
Preprint
Full-text available
The effectiveness of most cancer targeted therapies is short lived since tumors evolve and develop resistance. Combinations of drugs offer the potential to overcome resistance, however the number of possible combinations is vast necessitating data-driven approaches to find optimal treatments tailored to a patient’s tumor. AstraZeneca carried out 11...
Article
Full-text available
Recently, Saeb et al (2017) showed that, in diagnostic machine learning applications, having data of each subject randomly assigned to both training and test sets (record-wise data split) can lead to massive underestimation of the cross-validation prediction error, due to the presence of "subject identity confounding" caused by the classifier's abi...
Article
Full-text available
Purpose: Docetaxel has a demonstrated survival benefit for patients with metastatic castration-resistant prostate cancer (mCRPC); however, 10% to 20% of patients discontinue docetaxel prematurely because of toxicity-induced adverse events, and the management of risk factors for toxicity remains a challenge. Patients and methods: The comparator a...
Conference Paper
Mental Health conditions are now amongst the top five burdensome diseases in the US. Disparities in access to services and health outcomes vary due to several factors including socioeconomic status, shortage of mental health professionals, stigma and the linguistic gap between providers and non-English speaking minority population. This study explo...
Article
Full-text available
In this work we provide a couple of contributions to the analysis of longitudinal data collected by smartphones in mobile health applications. First, we propose a novel statistical approach to disentangle personalized treatment and "time-of-the-day" effects in observational studies. Under the assumption of no unmeasured confounders, we show how to...
Article
Full-text available
Mindboggle (http://mindboggle.info) is an open source brain morphometry platform that takes in preprocessed T1-weighted MRI data and outputs volume, surface, and tabular data containing label, feature, and shape information for further analysis. In this article, we document the software and demonstrate its use in studies of shape variation in healt...
Data
Mindboggle flowchart. Nipype automatically generates a flow diagram of the processing steps when running Mindboggle. (PDF)
Data
Mindboggle output directory tree. (PDF)
Data
Tables of shape differences between scans and between hemispheres. (PDF)
Data
Variance components analysis of the shapes of 62 cortical regions in 101 human brains. (PDF)
Article
Background: Improvements to prognostic models in metastatic castration-resistant prostate cancer have the potential to augment clinical trial design and guide treatment strategies. In partnership with Project Data Sphere, a not-for-profit initiative allowing data from cancer clinical trials to be shared broadly with researchers, we designed an ope...
Article
Full-text available
Human genome-wide association studies (GWAS) have shown that genetic variation at >130 gene loci is associated with type 2 diabetes (T2D). We asked if the expression of the candidate T2D-associated genes within these loci is regulated by a common locus in pancreatic islets. Using an obese F2 mouse intercross segregating for T2D, we show that the ex...
Data
Regional association plots for NFAT and fasting insulin in human GWAS. Association (-log10 P-value) to fasting insulin levels for SNPs near NFATC1 (A) and NFATC2 (B). Plots were generated using LocusZoom [13] and data provided in [14]. Color scale shows correlation (r2) between the SNP with the strongest association within the plotted region (lead...
Data
T2D GWAS islet eQTLs and plasma insulin QTL that are conditional on Nfatc2. Heat maps show the linkage for plasma insulin and the islet eQTLs for Nfatc2 and 54 transcripts for genes identified in human GWAS that are associated with Type 2 Diabetes (T2D). Linkage data was obtained from an F2 intercross between diabetes resistant (B6) and diabetes-su...
Data
Genotype dependence of T2D GWAS trans-eQTLs on Chr 2. Expression of GWAS gene candidates in islets of 491 F2 mice. For each gene, mice are grouped according to genotype at the peak locus of the respective eQTL; homozygous B6 (B6:B6), heterozygous (B6:BTBR), or homozygous BTBR (BTBR:BTBR). The expression of 26 GWAS gene candidates increased (A) in r...
Data
Genotype dependence of expression of Nfatc2 in pancreatic islets. Expression of Nfatc2 in pancreatic islets of 491 F2 mice. Mice are grouped according to their genotype at ~168.4 Mb on Chr 2 (rs3024096), the marker position closest to the maximum LOD (~70) of the cis-eQTL for Nfatc2. At this position, mice were homozygous B6 (B6:B6, N = 127), heter...
Data
Sequence comparison of mouse and human Nfatc1 and Nfatc2. Amino acid sequence for mouse and human, proteins for equivalent isoforms of Nfatc2 (A) and Nfatc1 (B) were aligned using Clustal Omega. For Nfatc2, we used isoforms A (NP_035029.2) and C (NP_775114.1) for mouse and human, respectively. For Nfatc1, we used isoforms 1 (NP_058071.2) and I (NP_...
Data
Expression of the NFAT gene family in mouse islets transduced with adenoviruses. Normalized RNA-sequencing values for the NFAT gene family in mouse islets 48 hr after transduction with adenoviruses containing GFP, ca-Nfatc1 or ca-Nfatc2 (A). Average expression values (± S.E.M., N = 5) are shown for each gene/virus combination. Western blot analysis...
Data
Comparison of GWAS genes that were regulated by NFAT in mouse and human islets. Heat maps illustrate the change in the expression of T2D-associated GWAS gene candidates in mouse (left) and human (right) islets replotted from Fig 6 as the average Z-score for each transcript; N = 5 for mouse and N = 3 for human. (TIF)
Data
The overexpression of ca-Nfatc2 promotes cell cycle progression and not DNA damage repair pathways in mouse islets. Immunocytochemistry of mouse islets for (A) Ki67, (B) pHH3 (S10) and (C) 53BP1 following Ad-LacZ (control) and Ad-ca-Nfatc2 transduction (72 hr). To identify β-cells, islets were stained for insulin. All islets were exposed to BrdU (1...
Data
Additional cell cycle regulatory genes that were differentially regulated by ca-Nfatc1 in mouse and human islets. The regulation of expression for cell cycle genes is illustrated in mouse (A) and human (B) islets following overexpression of either ca-Nfatc1 or ca-Nfatc2. The data is plotted as the log2 fold-change in expression relative to that mea...
Data
Gene candidates linked to T2D risk loci from human GWAS, and their mouse homologues. Separate tabs list: Tab 1, ~300 entries in the GWAS catalog for genomic loci associated with the disease/trait “Type 2 diabetes”; Tab 2, distinct genomic loci and their associated P-values; and Tab 3, candidate human genes reported for the loci, with accompanying m...
Data
Summary scores for islet transcription factor cis-eQTLs on chromosome 2 for conditional dependence on T2D GWAS gene candidates. Table summarizing conditional dependence of T2D GWAS gene candidates on islet Chr 2 cis-eQTLs for genes annotated as playing a role in "transcription" or "DNA binding" (https://david.ncifcrf.gov/). For each cis-eQTL, the s...
Data
Isoform-specific expression of the NFAT gene family in mouse islets. Islets from B6 mice were used for deep RNA-sequencing to determine isoform-specific expression of all genes. The table shows the expression level and relative proportion of each isoform for the NFAT gene family. Expression values for all genes is available at GEO submission GSE736...
Data
Islet eQTLs for T2D GWAS candidates in mouse islets. Excel sheet lists the 205 eQTLs for GWAS candidates genes that were identified genome-wide in islets from ~500 B6:BTBR-F2 obese mice. Genomic positions for the genes and their eQTLs are shown. Cis is defined as an eQTL that occurred within 2.5 cM (~5 Mbp) of the genomic position of the correspond...
Data
Donor information for human islet preparations. For several islet preparations, multiple studies were conducted, which are listed in the final column labeled “Experiments”. Values that are missing are not known. (PDF)
Data
Quantitative real time PCR primers used for gene expression measurements in human islets. (XLSX)
Data
Normalized expression values for all genes from RNA-sequencing of mouse islets following overexpression of GFP, ca-Nfatc1 or ca-Nfatc2. Excel spreadsheet contains normalized expression values for all genes (Tab 1) identified in mouse islets 48 hr after overexpression of GFP, ca-Nfatc1 or ca-Nfatc2 (N = 5 ea.). Posterior probabilities (PP) for diffe...
Preprint
Full-text available
Mindboggle ( http://mindboggle.info ) is an open source brain morphometry platform that takes in preprocessed T1-weighted MRI data and outputs volume, surface, and tabular data containing label, feature, and shape information for further analysis. In this article, we document the software and demonstrate its use in studies of shape variation in hea...
Chapter
A major challenge in biomedical research is to identify causal relationships among genotypes, phenotypes, and clinical outcomes from high-dimensional measurements. Causal networks have been widely used in systems genetics for modeling gene regulatory systems and for identifying causes and risk factors of diseases. In this chapter, we describe funda...
Article
Full-text available
Nature Communications 7 : Article number: 12460 10.1038/ncomms12460 ( 2016 ); Published: 23 August 2016 ; Updated: 10 October 2016 . The HTML version of this Article incorrectly duplicated the authors S.
Article
Full-text available
Rheumatoid arthritis (RA) affects millions world-wide. While anti-TNF treatment is widely used to reduce disease progression, treatment fails in ∼one-third of patients. No biomarker currently exists that identifies non-responders before treatment. A rigorous community-based assessment of the utility of SNP data for predicting anti-TNF treatment eff...
Data
Supplementary Figures 1-6, Supplementary Tables 1-4, Supplementary Note 1 and Supplementary References
Article
Rheumatoid arthritis (RA) affects millions world-wide. While anti-TNF treatment is widely used to reduce disease progression, treatment fails in ∼one-third of patients. No biomarker currently exists that identifies non-responders before treatment. A rigorous community-based assessment of the utility of SNP data for predicting anti-TNF treatment eff...
Article
Full-text available
Over-fitting is a dreaded foe in challenge-based competitions. Because participants rely on public leaderboards to evaluate and refine their models, there is always the danger they might over-fit to the holdout data supporting the leaderboard. The recently published Ladder algorithm aims to address this problem by preventing the participants from e...
Article
Full-text available
Clinical trials traditionally employ blinding as a design mechanism to reduce the influence of placebo effects. In practice, however, it can be difficult or impossible to blind study participants and unblinded trials are common in medical research. Here we show how instrumental variables can be used to quantify and disentangle treatment and placebo...
Article
Mobile health studies can leverage longitudinal sensor data from smartphones to guide the application of personalized medical interventions. These studies are particularly appealing due to their ability to attract a large number of participants. In this paper, we argue that the adoption of an instrumental variable approach for randomized trials wit...
Article
Full-text available
Identifying accurate biomarkers of cognitive decline is essential for advancing early diagnosis and prevention therapies in Alzheimer's disease. The Alzheimer's disease DREAM Challenge was designed as a computational crowdsourced project to benchmark the current state-of-the-art in predicting cognitive outcomes in Alzheimer's disease based on high...
Article
Full-text available
Current measures of health and disease are often insensitive, episodic, and subjective. Further, these measures generally are not designed to provide meaningful feedback to individuals. The impact of high-resolution activity data collected from mobile phones is only beginning to be explored. Here we present data from mPower, a clinical observationa...
Article
We propose hypothesis tests for detecting dopaminergic medication response in Parkinson disease patients, using longitudinal sensor data collected by smartphones. The processed data is composed of multiple features extracted from active tapping tasks performed by the participant on a daily basis, before and after medication, over several months. Ea...
Article
Full-text available
Biological functions are carried out by groups of interacting molecules, cells or tissues, known as communities. Membership in these communities may overlap when biological components are involved in multiple functions. However, traditional clustering methods detect non-overlapping communities. These detected communities may also be unstable and di...
Article
Full-text available
DREAM challenges are community competitions designed to advance computational methods and address fundamental questions in system biology and translational medicine. Each challenge asks participants to develop and apply computational methods to either predict unobserved outcomes or to identify unknown model parameters given a set of training data....