Jason H Moore

Jason H Moore
University of Pennsylvania | UP · Perelman School of Medicine

Ph.D.

About

904
Publications
105,987
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
32,691
Citations
Introduction
Director, Institute for Quantitative Biomedical Sciences (iQBS) Director, Graduate Program in Quantitative Biomedical Sciences (QBS) Associate Director, Norris-Cotton Cancer Center (NCCC) Editor-in-Chief, BioData Mining
Additional affiliations
August 2004 - February 2015
Dartmouth College
Position
  • Director, Institute for Quantitative Biomedical Sciences

Publications

Publications (904)
Article
Full-text available
Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to...
Article
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments....
Article
Full-text available
Background Alzheimer’s disease (AD) is a complex neurodegenerative disorder and the most common type of dementia. AD is characterized by a decline of cognitive function and brain atrophy, and is highly heritable with estimated heritability ranging from 60 to 80 $$\%$$ % . The most straightforward and widely used strategy to identify AD genetic basi...
Article
Full-text available
The opioid epidemic continues to contribute to loss of life through overdose and significant social and economic burdens. Many individuals who develop problematic opioid use (POU) do so after being exposed to prescribed opioid analgesics. Therefore, it is important to accurately identify and classify risk factors for POU. In this review, we discuss...
Article
ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI's graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previou...
Article
Full-text available
Integrating data across institutions can improve learning efficiency. To integrate data efficiently while protecting privacy, we propose A one-shot, summary-statistics-based, Distributed Algorithm for fitting Penalized (ADAP) regression models across multiple datasets. ADAP utilizes patient-level data from a lead site and incorporates the first-ord...
Preprint
When seeking a predictive model in biomedical data, one often has more than a single objective in mind, e.g., attaining both high accuracy and low complexity (to promote interpretability). We investigate herein whether multiple objectives can be dynamically tuned by our recently proposed coevolutionary algorithm, SAFE (Solution And Fitness Evolutio...
Preprint
We recently highlighted a fundamental problem recognized to confound algorithmic optimization, namely, \textit{conflating} the objective with the objective function. Even when the former is well defined, the latter may not be obvious, e.g., in learning a strategy to navigate a maze to find a goal (objective), an effective objective function to \tex...
Preprint
We have recently presented SAFE -- Solution And Fitness Evolution -- a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. We showed that SAFE was successful at evolving solutions within a robotic maze domain. Herein we present an i...
Preprint
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attain...
Article
Full-text available
Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates acr...
Article
Full-text available
The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the lim...
Preprint
Accurate disease risk stratification can lead to more precise and personalized prevention and treatment of diseases. As an important component to disease risk, genetic risk factors can be utilized as an early and stable predictor for disease onset. Recently, the polygenic risk score (PRS) method has combined the effects from hundreds to millions of...
Preprint
In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox ( D eep learning for Tox icology), an interpretation framework for knowledge-guid...
Article
Full-text available
Background Gene set enrichment analysis (GSEA) uses gene-level univariate associations to identify gene set-phenotype associations for hypothesis generation and interpretation. We propose that GSEA can be adapted to incorporate SNP and gene-level interactions. To this end, gene scores are derived by Relief-based feature importance algorithms that e...
Preprint
Objective For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. Materials and...
Article
Full-text available
Quantitative Structure-Activity Relationship (QSAR) modeling is a common computational technique for predicting chemical toxicity, but a lack of new methodological innovations has impeded QSAR performance on many tasks. We show that contemporary QSAR modeling for predictive toxicology can be substantially improved by incorporating semantic graph da...
Article
Scientific innovation has long been heralded the collaborative effort of many people, groups, and studies to drive forward research. However, the traditional peer review process relies on reviewers acting in a silo to critically judge research. As research becomes more cross-disciplinary, finding reviewers with appropriate expertise to provide feed...
Article
Full-text available
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population-level polygenic risk score (PRS) can explain phenotypic variation among geographic populations ba...
Article
Full-text available
Semantic GP is a promising branch of GP that introduces semantic awareness during genetic evolution to improve various aspects of GP. This paper presents a new Semantic GP approach based on Dynamic Target (SGP-DT) that divides the search problem into multiple GP runs. The evolution in each run is guided by a new (dynamic) target based on the residu...
Article
We present AddGBoost, a gradient boosting-style algorithm, wherein the decision tree is replaced by a succession of (possibly) stronger learners, which are optimized via a state-of-the-art hyperparameter optimizer. Through experiments over 90 regression datasets we show that AddGBoost emerges as the top performer for 33% (with 2 stages) up to 42% (...
Article
Multimodal neuroimaging data can provide complementary information that a single modality cannot about neurodegenerative diseases such as Alzheimer's disease (AD). Deep Generalized Canonical Correlation Analysis (DGCCA) is able to learn a shared feature representation from different views of data by applying non‐linear transformation using neural n...
Article
Brain imaging genetics is an emerging research topic in the study of Alzheimer’s disease (AD). The conventional approach, such as canonical correlation analysis (CCA), has been widely used to identify imaging genetic associations. A deep learning model has recently been proposed to better understand the roots of the complex association between imag...
Article
The advances in technologies for acquiring brain imaging and high-throughput genetic data allow the researcher to access a large amount of multi-modal data. Although the sparse canonical correlation analysis is a powerful bi-multivariate association analysis technique for feature selection, we are still facing major challenges in integrating multi-...
Article
Full-text available
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning me...
Article
Full-text available
Motivation Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. Results This rel...
Article
Full-text available
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Article
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Article
Full-text available
Aims: Enhanced risk stratification of patients with aortic stenosis (AS) is necessary to identify patients at high risk for adverse outcomes, and may allow for better management of patient subgroups at high risk of myocardial damage. The objective of this study was to identify plasma biomarkers and multimarker profiles associated with adverse outc...
Preprint
Full-text available
Motivation Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for...
Preprint
Full-text available
Quantitative Structure-Activity Relationship (QSAR) modeling is the most common computational technique for predicting chemical toxicity, but a lack of methodological innovations in QSAR have led to underwhelming performance. We show that contemporary QSAR modeling for predictive toxicology can be substantially improved by incorporating semantic gr...
Article
Full-text available
Environmental disasters are anthropogenic catastrophic events that affect health. Famous disasters include the Seveso disaster and the Fukushima-Daiichi nuclear meltdown, which had disastrous health consequences. Traditional methods for studying environmental disasters are costly and time-intensive. We propose the use of electronic health records (...
Preprint
Full-text available
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14...
Article
Full-text available
Machine Learning (ML) approaches are increasingly being used in biomedical applications. Important challenges of ML include choosing the right algorithm and tuning the parameters for optimal performance. Automated ML (AutoML) methods, such as Tree-based Pipeline Optimization Tool (TPOT), have been developed to take some of the guesswork out of ML t...
Preprint
We ascertain and compare the performances of AutoML tools on large, highly imbalanced healthcare datasets. We generated a large dataset using historical administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six d...
Article
Background: The electronic health record (EHR) has become increasingly ubiquitous. At the same time, health professionals have been turning to this resource for access to data that is needed for the delivery of health care and for clinical research. There is little doubt that the EHR has made both of these functions easier than earlier days when w...
Preprint
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification...
Preprint
Full-text available
Biclustering is a technique of detecting meaningful patterns in tabular data. It is also one of the fields in which evolutionary algorithms have risen to the very top in terms of speed and accuracy. In this short paper we summarize the results of porting one of the leading evolutionary-based biclustering methods EBIC to Julia-an emerging high-end p...
Conference Paper
Full-text available
In the multi-class classification problem GP plays an important role when combined with other non-GP classifiers. However, when GP performs the actual classification (without relying on other classifiers) its classification accuracy is low. This is especially true when the number of classes is high. In this paper, we present DTC, a GP classifier th...
Chapter
Socio-cognitive computing is a paradigm developed for the last several years, it consists in introducing into metaheuristics mechanisms inspired by inter-individual learning and cognition. It was successfully applied in hybridizing ACO and PSO metaheuristics. In this paper we have followed our previous experiences in order to hybridize the acclaime...
Article
Full-text available
Assumptions are made about the genetic model of single nucleotide polymorphisms (SNPs) when choosing a traditional genetic encoding: additive, dominant, and recessive. Furthermore, SNPs across the genome are unlikely to demonstrate identical genetic models. However, running SNP-SNP interaction analyses with every combination of encodings raises the...
Article
Full-text available
Automated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists on when to use one versus the other. Furthermore, relatively few tools exi...
Preprint
Full-text available
Biclustering is a data mining technique which searches for local patterns in numeric tabular data with main application in bioinformatics. This technique has shown promise in multiple areas, including development of biomarkers for cancer, disease subtype identification, or gene-drug interactions among others. In this paper we introduce EBIC.JL - an...
Article
The Translational Machine (TM) is a machine learning (ML)‐based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population s...
Article
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: symbolic-regression boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages—between 2 and 5—to a symbolic regressor, statistically significant improvements can often be attained....
Preprint
Full-text available
Machine Learning (ML) approaches are increasingly being used in biomedical applications. Important challenges of ML include choosing the right algorithm and tuning the parameters for optimal performance. Automated ML (AutoML) methods, such as Tree-based Pipeline Optimization Tool (TPOT), have been developed to take some of the guesswork out of ML t...
Preprint
Full-text available
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population level polygenic risk score (PRS) can explain phenotypic variation among geographic populations ba...
Article
Full-text available
Unstructured: Coincident with the tsunami of COVID-19-related publications, there has been a surge of studies using real-world data, including those obtained from the electronic health record (EHR). Unfortunately, several of these high-profile publications were retracted because of concerns regarding the soundness and quality of the studies and th...
Article
Full-text available
Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method—used to produce the final ensemble. In this paper, focus...
Article
Full-text available
Introduction: The Consortium for Clinical Characterization of COVID-19 by EHR (4CE) is an international collaboration addressing COVID-19 with federated analyses of electronic health record (EHR) data. Objective: We sought to develop and validate a computable phenotype for COVID-19 severity. Methods: Twelve 4CE sites participated. First we dev...
Article
Full-text available
Purpose: POAG is the leading cause of irreversible blindness in African Americans. In this study, we quantitatively assess the association of autosomal ancestry with POAG risk in a large cohort of self-identified African Americans. Methods: Subjects recruited to the Primary Open-Angle African American Glaucoma Genetics (POAAGG) study were classi...
Preprint
Full-text available
OBJECTIVE: Neurological complications can worsen outcomes in COVID-19. We defined the prevalence of a wide range of neurological conditions among patients hospitalized with COVID-19 in geographically diverse multinational populations. METHODS: Using electronic health record (EHR) data from 348 participating hospitals across 6 countries and 3 contin...
Article
Full-text available
Background Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or mode...
Article
Full-text available
Increasingly, clinical phenotypes with matched genetic data from bio-bank linked electronic health records (EHRs) have been used for pleiotropy analyses. Thus far, pleiotropy analysis using individual-level EHR data has been limited to data from one site. However, it is desirable to integrate EHR data from multiple sites to improve the detection po...
Article
Translational bioinformatics (TBI) is focused on the integration of biomedical data science and informatics. This combination is extremely powerful for scientific discovery as well as translation into clinical practice. Several topics where TBI research is at the leading edge are 1) the clinical utility of polygenic risk scores, 2) data integration...
Preprint
Full-text available
Objectives To perform an international comparison of the trajectory of laboratory values among hospitalized patients with COVID-19 who develop severe disease and identify optimal timing of laboratory value collection to predict severity across hospitals and regions. Design Retrospective cohort study. Setting The Consortium for Clinical Characteri...
Preprint
Objective Electronic health records (EHRs) can improve patient care by enabling systematic identification of patients for targeted decision support. But, this requires scalable learning of computable phenotypes. To this end, we developed the feature engineering automation tool (FEAT) and assessed it in targeting screening for the under-diagnosed, u...
Preprint
Objective Environmental disasters are anthropogenic catastrophic events that affect health. Famous disasters include the Chernobyl and Fukushima-Daiichi nuclear meltdowns, which had disastrous health consequences. Traditional methods for studying environmental disasters are costly and time-intensive. We propose the use of Electronic Health Records...
Article
Full-text available
One of the challenges with urgent evaluation of patients with acute respiratory distress syndrome (ARDS) in the emergency room (ER) is distinguishing between cardiac vs infectious etiologies for their pulmonary findings. We conducted a retrospective study with the collected data of 171 ER patients. ER patient classification for cardiac and infectio...
Preprint
Full-text available
PMLB (Penn Machine Learning Benchmark) is an open-source data repository containing a curated collection of datasets for evaluating and comparing machine learning (ML) algorithms. Compiled from a broad range of existing ML benchmark collections, PMLB synthesizes and standardizes hundreds of publicly available datasets from diverse sources such as t...
Article
Full-text available
Papers describing software are an important part of computational fields of scientific research. These “software papers” are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing soft...