
Jason H MooreUniversity of Pennsylvania | UP · Perelman School of Medicine
Jason H Moore
Ph.D.
About
947
Publications
128,981
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
37,697
Citations
Introduction
Director, Institute for Quantitative Biomedical Sciences (iQBS)
Director, Graduate Program in Quantitative Biomedical Sciences (QBS)
Associate Director, Norris-Cotton Cancer Center (NCCC)
Editor-in-Chief, BioData Mining
Additional affiliations
August 2004 - February 2015
Publications
Publications (947)
Large database sources, such as the National Health and Nutrition Examination Survey (NHANES), while being a great utility for epidemiological studies, pose challenges for machine learning due to data heterogeneity, varied sample sizes, missing values/outliers and variations in data collection and interpretation requiring thorough data-quality asse...
In response to the escalating global obesity crisis and its associated health and financial burdens, this paper presents a novel methodology for analyzing longitudinal weight loss data and assessing the effectiveness of financial incentives. Drawing from the Keep It Off trial—a three-arm randomized controlled study with 189 participants—we examined...
Alzheimer’s disease (AD) leads to irreversible cognitive decline, with Mild Cognitive Impairment (MCI) as its prodromal stage. Early detection of AD and related dementia is crucial for timely treatment and slowing disease progression. However, classifying cognitive normal (CN), MCI, and AD subjects using machine learning models faces class imbalanc...
Motivation
Biomedical and healthcare domains generate vast amounts of complex data that can be challenging to analyze using machine learning tools, especially for researchers without computer science training.
Results
Aliro is an open-source software package designed to automate machine learning analysis through a clean web interface. By infusing...
There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heavi...
Background and Objectives:
The two most common neurodegenerative diseases are Alzheimer's disease (AD) and Parkinson's disease (PD), both related to age and affect millions of people across the world, especially as life expectancy increases in certain countries. Here, we explore the potential predictiveness of the genetic risk of AD and PD separate...
Motivation: . Genome-Wide Association Studies (GWAS) commonly assume phenotypic and genetic homogeneity that is not
present in complex conditions. We designed Transformative Regression Analysis of Combined Effects (TRACE), a GWAS
methodology that better accounts for clinical phenotype heterogeneity and identifies gene-by-environment (GxE) interacti...
In computational toxicology, prediction of complex endpoints has always been challenging, as they often involve multiple distinct mechanisms. State-of-the-art models are either limited by low accuracy, or lack of interpretability due to their black-box nature. Here, we introduce AIDTox, an interpretable deep learning model which incorporates curate...
The introduction of large language models (LLMs) that allow iterative “chat” in late 2022 is a paradigm shift that enables generation of text often indistinguishable from that written by humans. LLM-based chatbots have immense potential to improve academic work efficiency, but the ethical implications of their fair use and inherent bias must be con...
Investigating the relationship between genetic variation and phenotypic traits is a key issue in quantitative genetics. Specifically for Alzheimer's disease, the association between genetic markers and quantitative traits remains vague while, once identified, will provide valuable guidance for the study and development of genetics-based treatment a...
Machine learning (ML) models trained for triggering clinical decision support (CDS) are typically either accurate or interpretable but not both. Scaling CDS to the panoply of clinical use cases while mitigating risks to patients will require many ML models be intuitively interpretable for clinicians. To this end, we adapted a symbolic regression me...
Statistical epistasis has been studied extensively because of its potential to provide evidence for genetic interactions for phenotypes, but there have been methodological limitations to its exhaustive, widespread application. We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many va...
Background
Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning appro...
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWASs). Standard GWASs are well-powered to interrogate additive models; however, new approaches are required for invesigating other modes of inheritance such as domina...
In many evolutionary computation systems, parent selection methods can affect, among other things, convergence to a solution. In this paper, we present a study comparing the role of two commonly used parent selection methods in evolving machine learning pipelines in an automated machine learning system called Tree-based Pipeline Optimization Tool (...
BACKGROUND
As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer’s Disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture heterogeneous biomedical knowledge that is central to the disease’s etiology and response to drugs. We designed the Alz...
Background
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-conf...
In many evolutionary computation systems, parent selection methods can affect, among other things, convergence to a solution. In this paper, we present a study comparing the role of two commonly used parent selection methods in evolving machine learning pipelines in an automated machine learning system called Tree-based Pipeline Optimization Tool (...
Background
Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning appro...
The primary efforts of disease and epidemiological research can be divided into two areas: identifying the causal mechanisms and utilizing important variables for risk prediction. The latter is generally perceived as a more obtainable goal due to the vast number of readily available tools and the faster pace of obtaining results. However, the lower...
Artificial Intelligence (AI) in medicine stands at the cusp of revolutionizing clinician reasoning and decision-making. Since its foundational years in the mid-20th century, the progression of medical AI has seen considerable advancements, concurrently grappling with various challenges. Early attempts of AI showcased immense potential, yet faced hu...
The selection and tuning of feature selection, feature engineering, and classification or regression algorithms is a major challenge in machine learning, affecting both beginners and experts. Automated machine learning (AutoML) offers a solution by automating the creation of machine learning pipelines, eliminating the guesswork associated with a ma...
Brain imaging genetics examines associations between imaging quantitative traits (QTs) and genetic factors such as single nucleotide polymorphisms (SNPs) to provide important insights into the pathogenesis of Alzheimer’s diseases (AD). Given the high dimensionality, the individual level SNP‐QT signals typically have small effect sizes and are hard...
Investigating the relationship between genetic variation and phenotypic traits is a key issue in quantitative genetics. Specifically for Alzheimer's disease, the association between genetic markers and quantitative traits remains vague while, once identified, will provide valuable guidance for the study and development of genetic-based treatment ap...
Automated machine learning (AutoML) algorithms have grown in popularity due to their high performance and flexibility to adapt to different problems and data sets. With the increasing number of AutoML algorithms, deciding which would best suit a given problem becomes increasingly more work. Therefore, it is essential to use complex and challenging...
Objectives
Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.
Materials and Methods
We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to...
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial to determine their scope of application. Here, we introduce the Diverse and Generative ML Benchmark (DIGEN), a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of ML algorithms for classification of binary outcom...
Evolutionary multi-agent systems (EMASs) are very good at dealing with difficult, multi-dimensional problems, their efficacy was proven theoretically based on analysis of the relevant Markov-Chain based model. Now the research continues on introducing autonomous hybridization into EMAS. This paper focuses on a proposed hybrid version of the EMAS, a...
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWAS). Standard GWAS are well-powered to interrogate additive models; however, new approaches are required to investigate other modes of inheritance such as dominance...
In computational toxicology, prediction of complex endpoints has always been challenging, as they often involve multiple distinct mechanisms. State-of-the-art models are either limited by low accuracy, or lack of interpretability due to their black-box nature. Here we introduce AIDTox, an interpretable deep learning model which incorporates curated...
The rapid increase of interest in, and use of, artificial intelligence (AI) in computer applications has raised a parallel concern about its ability (or lack thereof) to provide understandable, or explainable, output to users. This concern is especially legitimate in biomedical contexts, where patient safety is of paramount importance. This positio...
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning me...
Idiopathic pulmonary fibrosis (IPF) is a chronic, progressive, fibrosing interstitial pneumonia of unknown etiology. The role of genetic risk factors has been the focus of numerous studies probing for associations of genetic variants with IPF. We aimed to determine whether single-nucleotide polymorphisms (SNPs) of four candidate genes are associate...
Evolutionary multi-agent systems (EMASs) are very good at dealing with difficult, multi-dimensional problems, their efficacy was proven theoretically based on analysis of the relevant Markov-Chain based model. Now the research continues on introducing autonomous hybridization into EMAS. This paper focuses on a proposed hybrid version of the EMAS, a...
Brain imaging genetics examines associations between imaging quantitative traits (QTs) and genetic factors such as single nucleotide polymorphisms (SNPs) to provide important insights into the pathogenesis of Alzheimer’s disease (AD). The individual level SNP-QT signals are high dimensional and typically have small effect sizes, making them hard to...
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments....
Background
Alzheimer’s disease (AD) is a complex neurodegenerative disorder and the most common type of dementia. AD is characterized by a decline of cognitive function and brain atrophy, and is highly heritable with estimated heritability ranging from 60 to 80 $$\%$$ % . The most straightforward and widely used strategy to identify AD genetic basi...
Objective
For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information.
Materials and...
In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox (deep learning for toxicology), an interpretation framework for knowledge-guided...
The opioid epidemic continues to contribute to loss of life through overdose and significant social and economic burdens. Many individuals who develop problematic opioid use (POU) do so after being exposed to prescribed opioid analgesics. Therefore, it is important to accurately identify and classify risk factors for POU. In this review, we discuss...
ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI's graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previou...
Integrating data across institutions can improve learning efficiency. To integrate data efficiently while protecting privacy, we propose A one-shot, summary-statistics-based, Distributed Algorithm for fitting Penalized (ADAP) regression models across multiple datasets. ADAP utilizes patient-level data from a lead site and incorporates the first-ord...
When seeking a predictive model in biomedical data, one often has more than a single objective in mind, e.g., attaining both high accuracy and low complexity (to promote interpretability). We investigate herein whether multiple objectives can be dynamically tuned by our recently proposed coevolutionary algorithm, SAFE (Solution And Fitness Evolutio...
We recently highlighted a fundamental problem recognized to confound algorithmic optimization, namely, \textit{conflating} the objective with the objective function. Even when the former is well defined, the latter may not be obvious, e.g., in learning a strategy to navigate a maze to find a goal (objective), an effective objective function to \tex...
We have recently presented SAFE -- Solution And Fitness Evolution -- a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. We showed that SAFE was successful at evolving solutions within a robotic maze domain. Herein we present an i...
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attain...
Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates acr...
The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the lim...
Accurate disease risk stratification can lead to more precise and personalized prevention and treatment of diseases. As an important component to disease risk, genetic risk factors can be utilized as an early and stable predictor for disease onset. Recently, the polygenic risk score (PRS) method has combined the effects from hundreds to millions of...
In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox ( D eep learning for Tox icology), an interpretation framework for knowledge-guid...
Background
Gene set enrichment analysis (GSEA) uses gene-level univariate associations to identify gene set-phenotype associations for hypothesis generation and interpretation. We propose that GSEA can be adapted to incorporate SNP and gene-level interactions. To this end, gene scores are derived by Relief-based feature importance algorithms that e...
Objective
For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information.
Materials and...
Quantitative Structure-Activity Relationship (QSAR) modeling is a common computational technique for predicting chemical toxicity, but a lack of new methodological innovations has impeded QSAR performance on many tasks. We show that contemporary QSAR modeling for predictive toxicology can be substantially improved by incorporating semantic graph da...
Scientific innovation has long been heralded the collaborative effort of many people, groups, and studies to drive forward research. However, the traditional peer review process relies on reviewers acting in a silo to critically judge research. As research becomes more cross-disciplinary, finding reviewers with appropriate expertise to provide feed...
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population-level polygenic risk score (PRS) can explain phenotypic variation among geographic populations ba...
Semantic GP is a promising branch of GP that introduces semantic awareness during genetic evolution to improve various aspects of GP. This paper presents a new Semantic GP approach based on Dynamic Target (SGP-DT) that divides the search problem into multiple GP runs. The evolution in each run is guided by a new (dynamic) target based on the residu...
We present AddGBoost, a gradient boosting-style algorithm, wherein the decision tree is replaced by a succession of (possibly) stronger learners, which are optimized via a state-of-the-art hyperparameter optimizer. Through experiments over 90 regression datasets we show that AddGBoost emerges as the top performer for 33% (with 2 stages) up to 42% (...
Background
Multimodal neuroimaging data can provide complementary information that a single modality cannot about neurodegenerative diseases such as Alzheimer's disease (AD). Deep Generalized Canonical Correlation Analysis (DGCCA) is able to learn a shared feature representation from different views of data by applying non‐linear transformation usi...
Background
Brain imaging genetics is an emerging research topic in the study of Alzheimer’s disease (AD). The conventional approach, such as canonical correlation analysis (CCA), has been widely used to identify imaging genetic associations. A deep learning model has recently been proposed to better understand the roots of the complex association b...
The advances in technologies for acquiring brain imaging and high-throughput genetic data allow the researcher to access a large amount of multi-modal data. Although the sparse canonical correlation analysis is a powerful bi-multivariate association analysis technique for feature selection, we are still facing major challenges in integrating multi-...
Motivation
Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows.
Results
This rel...
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...