
Jason H MooreUniversity of Pennsylvania | UP · Perelman School of Medicine
Jason H Moore
Ph.D.
About
904
Publications
105,987
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
32,691
Citations
Introduction
Director, Institute for Quantitative Biomedical Sciences (iQBS)
Director, Graduate Program in Quantitative Biomedical Sciences (QBS)
Associate Director, Norris-Cotton Cancer Center (NCCC)
Editor-in-Chief, BioData Mining
Additional affiliations
August 2004 - February 2015
Publications
Publications (904)
Objectives
Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.
Materials and Methods
We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to...
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments....
Background
Alzheimer’s disease (AD) is a complex neurodegenerative disorder and the most common type of dementia. AD is characterized by a decline of cognitive function and brain atrophy, and is highly heritable with estimated heritability ranging from 60 to 80 $$\%$$ % . The most straightforward and widely used strategy to identify AD genetic basi...
The opioid epidemic continues to contribute to loss of life through overdose and significant social and economic burdens. Many individuals who develop problematic opioid use (POU) do so after being exposed to prescribed opioid analgesics. Therefore, it is important to accurately identify and classify risk factors for POU. In this review, we discuss...
ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI's graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previou...
Integrating data across institutions can improve learning efficiency. To integrate data efficiently while protecting privacy, we propose A one-shot, summary-statistics-based, Distributed Algorithm for fitting Penalized (ADAP) regression models across multiple datasets. ADAP utilizes patient-level data from a lead site and incorporates the first-ord...
When seeking a predictive model in biomedical data, one often has more than a single objective in mind, e.g., attaining both high accuracy and low complexity (to promote interpretability). We investigate herein whether multiple objectives can be dynamically tuned by our recently proposed coevolutionary algorithm, SAFE (Solution And Fitness Evolutio...
We recently highlighted a fundamental problem recognized to confound algorithmic optimization, namely, \textit{conflating} the objective with the objective function. Even when the former is well defined, the latter may not be obvious, e.g., in learning a strategy to navigate a maze to find a goal (objective), an effective objective function to \tex...
We have recently presented SAFE -- Solution And Fitness Evolution -- a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. We showed that SAFE was successful at evolving solutions within a robotic maze domain. Herein we present an i...
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attain...
Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates acr...
The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the lim...
Accurate disease risk stratification can lead to more precise and personalized prevention and treatment of diseases. As an important component to disease risk, genetic risk factors can be utilized as an early and stable predictor for disease onset. Recently, the polygenic risk score (PRS) method has combined the effects from hundreds to millions of...
In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox ( D eep learning for Tox icology), an interpretation framework for knowledge-guid...
Background
Gene set enrichment analysis (GSEA) uses gene-level univariate associations to identify gene set-phenotype associations for hypothesis generation and interpretation. We propose that GSEA can be adapted to incorporate SNP and gene-level interactions. To this end, gene scores are derived by Relief-based feature importance algorithms that e...
Objective
For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information.
Materials and...
Quantitative Structure-Activity Relationship (QSAR) modeling is a common computational technique for predicting chemical toxicity, but a lack of new methodological innovations has impeded QSAR performance on many tasks. We show that contemporary QSAR modeling for predictive toxicology can be substantially improved by incorporating semantic graph da...
Scientific innovation has long been heralded the collaborative effort of many people, groups, and studies to drive forward research. However, the traditional peer review process relies on reviewers acting in a silo to critically judge research. As research becomes more cross-disciplinary, finding reviewers with appropriate expertise to provide feed...
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population-level polygenic risk score (PRS) can explain phenotypic variation among geographic populations ba...
Semantic GP is a promising branch of GP that introduces semantic awareness during genetic evolution to improve various aspects of GP. This paper presents a new Semantic GP approach based on Dynamic Target (SGP-DT) that divides the search problem into multiple GP runs. The evolution in each run is guided by a new (dynamic) target based on the residu...
We present AddGBoost, a gradient boosting-style algorithm, wherein the decision tree is replaced by a succession of (possibly) stronger learners, which are optimized via a state-of-the-art hyperparameter optimizer. Through experiments over 90 regression datasets we show that AddGBoost emerges as the top performer for 33% (with 2 stages) up to 42% (...
Multimodal neuroimaging data can provide complementary information that a single modality cannot about neurodegenerative diseases such as Alzheimer's disease (AD). Deep Generalized Canonical Correlation Analysis (DGCCA) is able to learn a shared feature representation from different views of data by applying non‐linear transformation using neural n...
Brain imaging genetics is an emerging research topic in the study of Alzheimer’s disease (AD). The conventional approach, such as canonical correlation analysis (CCA), has been widely used to identify imaging genetic associations. A deep learning model has recently been proposed to better understand the roots of the complex association between imag...
The advances in technologies for acquiring brain imaging and high-throughput genetic data allow the researcher to access a large amount of multi-modal data. Although the sparse canonical correlation analysis is a powerful bi-multivariate association analysis technique for feature selection, we are still facing major challenges in integrating multi-...
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning me...
Motivation
Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows.
Results
This rel...
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Aims:
Enhanced risk stratification of patients with aortic stenosis (AS) is necessary to identify patients at high risk for adverse outcomes, and may allow for better management of patient subgroups at high risk of myocardial damage. The objective of this study was to identify plasma biomarkers and multimarker profiles associated with adverse outc...
Motivation
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for...
Quantitative Structure-Activity Relationship (QSAR) modeling is the most common computational technique for predicting chemical toxicity, but a lack of methodological innovations in QSAR have led to underwhelming performance. We show that contemporary QSAR modeling for predictive toxicology can be substantially improved by incorporating semantic gr...
Environmental disasters are anthropogenic catastrophic events that affect health. Famous disasters include the Seveso disaster and the Fukushima-Daiichi nuclear meltdown, which had disastrous health consequences. Traditional methods for studying environmental disasters are costly and time-intensive. We propose the use of electronic health records (...
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14...
Machine Learning (ML) approaches are increasingly being used in biomedical applications. Important challenges of ML include choosing the right algorithm and tuning the parameters for optimal performance. Automated ML (AutoML) methods, such as Tree-based Pipeline Optimization Tool (TPOT), have been developed to take some of the guesswork out of ML t...
We ascertain and compare the performances of AutoML tools on large, highly imbalanced healthcare datasets. We generated a large dataset using historical administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six d...
Background:
The electronic health record (EHR) has become increasingly ubiquitous. At the same time, health professionals have been turning to this resource for access to data that is needed for the delivery of health care and for clinical research. There is little doubt that the EHR has made both of these functions easier than earlier days when w...
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification...
Biclustering is a technique of detecting meaningful patterns in tabular data. It is also one of the fields in which evolutionary algorithms have risen to the very top in terms of speed and accuracy. In this short paper we summarize the results of porting one of the leading evolutionary-based biclustering methods EBIC to Julia-an emerging high-end p...
In the multi-class classification problem GP plays an important role when combined with other non-GP classifiers. However, when GP performs the actual classification (without relying on other classifiers) its classification accuracy is low. This is especially true when the number of classes is high. In this paper, we present DTC, a GP classifier th...
Socio-cognitive computing is a paradigm developed for the last several years, it consists in introducing into metaheuristics mechanisms inspired by inter-individual learning and cognition. It was successfully applied in hybridizing ACO and PSO metaheuristics. In this paper we have followed our previous experiences in order to hybridize the acclaime...
Assumptions are made about the genetic model of single nucleotide polymorphisms (SNPs) when choosing a traditional genetic encoding: additive, dominant, and recessive. Furthermore, SNPs across the genome are unlikely to demonstrate identical genetic models. However, running SNP-SNP interaction analyses with every combination of encodings raises the...
Automated machine learning (AutoML) and artificial neural networks (ANNs) have revolutionized the field of artificial intelligence by yielding incredibly high-performing models to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists on when to use one versus the other. Furthermore, relatively few tools exi...
Biclustering is a data mining technique which searches for local patterns in numeric tabular data with main application in bioinformatics. This technique has shown promise in multiple areas, including development of biomarkers for cancer, disease subtype identification, or gene-drug interactions among others. In this paper we introduce EBIC.JL - an...
The Translational Machine (TM) is a machine learning (ML)‐based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population s...
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: symbolic-regression boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages—between 2 and 5—to a symbolic regressor, statistically significant improvements can often be attained....
Machine Learning (ML) approaches are increasingly being used in biomedical applications. Important challenges of ML include choosing the right algorithm and tuning the parameters for optimal performance. Automated ML (AutoML) methods, such as Tree-based Pipeline Optimization Tool (TPOT), have been developed to take some of the guesswork out of ML t...
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population level polygenic risk score (PRS) can explain phenotypic variation among geographic populations ba...
Unstructured:
Coincident with the tsunami of COVID-19-related publications, there has been a surge of studies using real-world data, including those obtained from the electronic health record (EHR). Unfortunately, several of these high-profile publications were retracted because of concerns regarding the soundness and quality of the studies and th...
Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method—used to produce the final ensemble. In this paper, focus...
Introduction:
The Consortium for Clinical Characterization of COVID-19 by EHR (4CE) is an international collaboration addressing COVID-19 with federated analyses of electronic health record (EHR) data.
Objective:
We sought to develop and validate a computable phenotype for COVID-19 severity.
Methods:
Twelve 4CE sites participated. First we dev...
Purpose:
POAG is the leading cause of irreversible blindness in African Americans. In this study, we quantitatively assess the association of autosomal ancestry with POAG risk in a large cohort of self-identified African Americans.
Methods:
Subjects recruited to the Primary Open-Angle African American Glaucoma Genetics (POAAGG) study were classi...
OBJECTIVE: Neurological complications can worsen outcomes in COVID-19. We defined the prevalence of a wide range of neurological conditions among patients hospitalized with COVID-19 in geographically diverse multinational populations.
METHODS: Using electronic health record (EHR) data from 348 participating hospitals across 6 countries and 3 contin...
Background
Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or mode...
Increasingly, clinical phenotypes with matched genetic data from bio-bank linked electronic health records (EHRs) have been used for pleiotropy analyses. Thus far, pleiotropy analysis using individual-level EHR data has been limited to data from one site. However, it is desirable to integrate EHR data from multiple sites to improve the detection po...
Translational bioinformatics (TBI) is focused on the integration of biomedical data science and informatics. This combination is extremely powerful for scientific discovery as well as translation into clinical practice. Several topics where TBI research is at the leading edge are 1) the clinical utility of polygenic risk scores, 2) data integration...
Objectives
To perform an international comparison of the trajectory of laboratory values among hospitalized patients with COVID-19 who develop severe disease and identify optimal timing of laboratory value collection to predict severity across hospitals and regions.
Design
Retrospective cohort study.
Setting
The Consortium for Clinical Characteri...
Objective
Electronic health records (EHRs) can improve patient care by enabling systematic identification of patients for targeted decision support. But, this requires scalable learning of computable phenotypes. To this end, we developed the feature engineering automation tool (FEAT) and assessed it in targeting screening for the under-diagnosed, u...
Objective
Environmental disasters are anthropogenic catastrophic events that affect health. Famous disasters include the Chernobyl and Fukushima-Daiichi nuclear meltdowns, which had disastrous health consequences. Traditional methods for studying environmental disasters are costly and time-intensive. We propose the use of Electronic Health Records...
One of the challenges with urgent evaluation of patients with acute respiratory distress syndrome (ARDS) in the emergency room (ER) is distinguishing between cardiac vs infectious etiologies for their pulmonary findings. We conducted a retrospective study with the collected data of 171 ER patients. ER patient classification for cardiac and infectio...
PMLB (Penn Machine Learning Benchmark) is an open-source data repository containing a curated collection of datasets for evaluating and comparing machine learning (ML) algorithms. Compiled from a broad range of existing ML benchmark collections, PMLB synthesizes and standardizes hundreds of publicly available datasets from diverse sources such as t...
Papers describing software are an important part of computational fields of scientific research. These “software papers” are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing soft...