
Jason H Moore- Ph.D.
- Managing Director at University of Pennsylvania
Jason H Moore
- Ph.D.
- Managing Director at University of Pennsylvania
About
1,004
Publications
148,308
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
42,015
Citations
Introduction
Director, Institute for Quantitative Biomedical Sciences (iQBS)
Director, Graduate Program in Quantitative Biomedical Sciences (QBS)
Associate Director, Norris-Cotton Cancer Center (NCCC)
Editor-in-Chief, BioData Mining
Current institution
Additional affiliations
August 2004 - February 2015
Publications
Publications (1,004)
Lexicase selection is a successful parent selection method in genetic programming that has outperformed other methods across multiple benchmark suites. Unlike other selection methods that require explicit parameters to function, such as tournament size in tournament selection, lexicase selection does not. However, if evolutionary parameters like po...
Motivation
LLMs like GPT-4, despite their advancements, often produce hallucinations and struggle with integrating external knowledge effectively. While Retrieval-Augmented Generation (RAG) attempts to address this by incorporating external information, it faces significant challenges such as context length limitations and imprecise vector similari...
Background
The amyloid‐tau‐neurodegeneration (ATN) framework provides a valuable model for comprehending the pathophysiology and progression of Alzheimer’s disease (AD). However the relationship between and genetic interaction with these three characteristics are complex and not fully understood. Here, we use neuroimaging‐derived quantitative trait...
Background
Alzheimer's disease (AD) is a complex neurodegenerative disorder that has impacted millions of people worldwide. Identifying different risk groups converting to AD during the mild cognitive impairment (MCI) stage and determining their genetic basis would be immensely valuable for drug discovery and subsequent clinical treatment. Previous...
Given the complexity and multifactorial nature of Alzheimer's disease, investigating potential drug-gene targets is imperative for developing effective therapies and advancing our understanding of the underlying mechanisms driving the disease. We present an explainable ML model that integrates the role and impact of gene interactions to drive the g...
Background
Epistasis, the phenomenon where the effect of one gene (or variant) is masked or modified by one or more other genes, significantly contributes to the phenotypic variance of complex traits. Traditionally, epistasis has been modeled using the Cartesian epistatic model, a multiplicative approach based on standard statistical regression. Ho...
Importance
Machine learning for augmented screening of perinatal mood and anxiety disorders (PMADs) requires thorough consideration of clinical biases embedded in electronic health records (EHRs) and rigorous evaluations of model performance.
Objective
To mitigate bias in predictive models of PMADs trained on commonly available EHRs.
Design, Sett...
Feature selection in Knowledge Graphs (KGs) is increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection (FS) within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy,...
Background
The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. T...
Growing evidence suggests that social determinants of health (SDoH), a set of nonmedical factors, affect individuals' risks of developing Alzheimer's disease (AD) and related dementias. Nevertheless, the etiological mechanisms underlying such relationships remain largely unclear, mainly due to difficulties in collecting relevant information. This s...
Background
Epistasis, the interaction between genetic loci where the effect of one locus is influenced by one or more other loci, plays a crucial role in the genetic architecture of complex traits. However, as the number of loci considered increases, the investigation of epistasis becomes exponentially more complex, making the selection of key feat...
The long-term complications of COVID-19, known as the post-acute sequelae of SARS-CoV-2 infection (PASC), significantly burden healthcare resources. Quantifying the demand for post-acute healthcare is essential for understanding patients’ needs and optimizing the allocation of valuable medical resources for disease management. Driven by this need,...
Acute myelogenous leukemia (AML) is a common blood cancer marked by heterogeneity in disease and diverse genetic abnormalities. Additional therapies are needed as the 5-year survival remains below 30%. Trametinib is a mitogen-activated extracellular signal-regulated kinase (MEK) inhibitor that is widely used in solid tumors and also in tumors with...
Bladder cancer shows distinct sex-related patterns, with male patients experiencing significantly higher incidence and female patients facing worse survival outcomes. In this paper, we aimed to address the lack of understanding of the biological mechanisms responsible for this sex-based divergence through an integrative analysis using bladder cance...
Background: The investigation of epistasis becomes increasingly complex as more loci are considered due to the exponential expansion of possible interactions. Consequently, selecting key features that influence epistatic interactions is crucial for effective downstream analyses. Recognizing this challenge, this study investigates the efficiency of...
Lexicase selection is a successful parent selection method in genetic programming that has outperformed other methods across multiple benchmark suites. Unlike other selection methods that require explicit parameters to function, such as tournament size in tournament selection, lexicase selection does not. However, if evolutionary parameters like po...
Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hyp...
GPT-4, as the most advanced version of OpenAI’s large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4’s capabilities in generating text, tables, and diagrams for biomedical r...
Automated machine learning streamlines the task of finding effective machine learning pipelines by automating model training, evaluation, and selection. Traditional evaluation strategies, like cross-validation (CV), generate one value that averages the accuracy of a pipeline's predictions. This single value, however, may not fully describe the gene...
Determining the fundamental characteristics that define a face as "feminine" or "masculine" has long fascinated anatomists and plastic surgeons, particularly those involved in aesthetic and gender-affirming surgery. Previous studies in this area have relied on manual measurements, comparative anatomy, and heuristic landmark-based feature extraction...
Motivation
Answering and solving complex problems using a large language model (LLM) given a certain domain such as biomedicine is a challenging task that requires both factual consistency and logic, and LLMs often suffer from some major limitations, such as hallucinating false or irrelevant information, or being influenced by noisy data. These iss...
The authors emphasize diversity, equity, and inclusion in STEM education and artificial intelligence (AI) research, focusing on LGBTQ+ representation. They discuss the challenges faced by queer scientists, educational resources, the implementation of National AI Campus, and the notion of intersectionality. The authors hope to ensure supportive and...
The progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of d...
Background Epistasis, the phenomenon where the effect of one gene (or variant) is masked or modified by one or more other genes, can significantly contribute to the observed phenotypic variance of complex traits. To date, it has been generally assumed that genetic interactions can be detected using a Cartesian, or multiplicative, interaction model...
GPT-4, as the most advanced version of OpenAI's large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4's capabilities in generating text, tables, and diagrams for biomedical r...
Purpose
Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermor...
In this chapter, we present a new implementation of the popular Tree-Based Pipeline Optimization Tool (TPOT). This new implementation, called TPOT2, was rebuilt from the ground up to be more modular, easier to maintain, and easier to expand. TPOT2 comes with new features and optimizations, such as a more flexible graph-based representation of Sciki...
Contemporary analyses focused on a limited number of clinical and molecular biomarkers have been unable to accurately predict clinical outcomes in pancreatic ductal adenocarcinoma. Here we describe a precision medicine platform known as the Molecular Twin consisting of advanced machine-learning models and use it to analyze a dataset of 6,363 clinic...
Genome-wide association studies (GWAS) have been instrumental in identifying genetic associations for various diseases and traits. However, uncovering genetic underpinnings among traits beyond univariate phenotype associations remains a challenge. Multi-phenotype associations (MPA), or genetic pleiotropy, offer important insights into shared genes...
Natural language processing techniques are having an increasing impact on clinical care from patient, clinician, administrator, and research perspective. Among others are automated generation of clinical notes and discharge letters, medical term coding for billing, medical chatbots both for patients and clinicians, data enrichment in the identifica...
One of the central challenges of machine learning is the selection of methods for feature selection, feature engineering, and classification or regression algorithms for building an analytics pipeline. This is true for both novices and experts. Automated machine learning (AutoML) has emerged as a useful approach to generate machine learning pipelin...
This work demonstrates the use of cluster analysis in detecting fair and unbiased novel discoveries. Given a sample population of elective spinal fusion patients, we identify two overarching subgroups driven by insurance type. The Medicare group, associated with lower socioeconomic status, exhibited an over-representation of negative risk factors....
The concept of a digital twin came from the engineering, industrial, and manufacturing domains to create virtual objects or machines that could inform the design and development of real objects. This idea is appealing for precision medicine where digital twins of patients could help inform healthcare decisions. We have developed a methodology for g...
The following sections are included:Introduction to the workshopWorkshop Presenters.
Background
Previous genetic studies of Alzheimer’s Disease (AD) utilized features derived from florbetapir (AV45) PET imaging (Lee et al., BMC Genomics 85(2022)). Genetic analysis of florbetaben (FBB) PET FreeSurfer‐defined ROI‐specific SUVrs provided by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) can extend this work by nominating ROIs...
Large database sources, such as the National Health and Nutrition Examination Survey (NHANES), while being a great utility for epidemiological studies, pose challenges for machine learning due to data heterogeneity, varied sample sizes, missing values/outliers and variations in data collection and interpretation requiring thorough data-quality asse...
In response to the escalating global obesity crisis and its associated health and financial burdens, this paper presents a novel methodology for analyzing longitudinal weight loss data and assessing the effectiveness of financial incentives. Drawing from the Keep It Off trial—a three-arm randomized controlled study with 189 participants—we examined...
Alzheimer’s disease (AD) leads to irreversible cognitive decline, with Mild Cognitive Impairment (MCI) as its prodromal stage. Early detection of AD and related dementia is crucial for timely treatment and slowing disease progression. However, classifying cognitive normal (CN), MCI, and AD subjects using machine learning models faces class imbalanc...
Motivation
Biomedical and healthcare domains generate vast amounts of complex data that can be challenging to analyze using machine learning tools, especially for researchers without computer science training.
Results
Aliro is an open-source software package designed to automate machine learning analysis through a clean web interface. By infusing...
There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heavi...
Background and Objectives:
The two most common neurodegenerative diseases are Alzheimer's disease (AD) and Parkinson's disease (PD), both related to age and affect millions of people across the world, especially as life expectancy increases in certain countries. Here, we explore the potential predictiveness of the genetic risk of AD and PD separate...
Motivation: . Genome-Wide Association Studies (GWAS) commonly assume phenotypic and genetic homogeneity that is not
present in complex conditions. We designed Transformative Regression Analysis of Combined Effects (TRACE), a GWAS
methodology that better accounts for clinical phenotype heterogeneity and identifies gene-by-environment (GxE) interacti...
In computational toxicology, prediction of complex endpoints has always been challenging, as they often involve multiple distinct mechanisms. State-of-the-art models are either limited by low accuracy, or lack of interpretability due to their black-box nature. Here, we introduce AIDTox, an interpretable deep learning model which incorporates curate...
The introduction of large language models (LLMs) that allow iterative “chat” in late 2022 is a paradigm shift that enables generation of text often indistinguishable from that written by humans. LLM-based chatbots have immense potential to improve academic work efficiency, but the ethical implications of their fair use and inherent bias must be con...
Investigating the relationship between genetic variation and phenotypic traits is a key issue in quantitative genetics. Specifically for Alzheimer's disease, the association between genetic markers and quantitative traits remains vague while, once identified, will provide valuable guidance for the study and development of genetics-based treatment a...
Machine learning (ML) models trained for triggering clinical decision support (CDS) are typically either accurate or interpretable but not both. Scaling CDS to the panoply of clinical use cases while mitigating risks to patients will require many ML models be intuitively interpretable for clinicians. To this end, we adapted a symbolic regression me...
Statistical epistasis has been studied extensively because of its potential to provide evidence for genetic interactions for phenotypes, but there have been methodological limitations to its exhaustive, widespread application. We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many va...
Background
Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning appro...
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWASs). Standard GWASs are well-powered to interrogate additive models; however, new approaches are required for invesigating other modes of inheritance such as domina...
In many evolutionary computation systems, parent selection methods can affect, among other things, convergence to a solution. In this paper, we present a study comparing the role of two commonly used parent selection methods in evolving machine learning pipelines in an automated machine learning system called Tree-based Pipeline Optimization Tool (...
BACKGROUND
As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer’s Disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture heterogeneous biomedical knowledge that is central to the disease’s etiology and response to drugs. We designed the Alz...
Background
As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture relationships central to the disease’s etiology and response to drugs.
Objective
We designed the Alzheimer’s Knowledge B...
Background
Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-conf...
In many evolutionary computation systems, parent selection methods can affect, among other things, convergence to a solution. In this paper, we present a study comparing the role of two commonly used parent selection methods in evolving machine learning pipelines in an automated machine learning system called Tree-based Pipeline Optimization Tool (...
Background
Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning appro...
The primary efforts of disease and epidemiological research can be divided into two areas: identifying the causal mechanisms and utilizing important variables for risk prediction. The latter is generally perceived as a more obtainable goal due to the vast number of readily available tools and the faster pace of obtaining results. However, the lower...
Artificial Intelligence (AI) in medicine stands at the cusp of revolutionizing clinician reasoning and decision-making. Since its foundational years in the mid-20th century, the progression of medical AI has seen considerable advancements, concurrently grappling with various challenges. Early attempts of AI showcased immense potential, yet faced hu...
The selection and tuning of feature selection, feature engineering, and classification or regression algorithms is a major challenge in machine learning, affecting both beginners and experts. Automated machine learning (AutoML) offers a solution by automating the creation of machine learning pipelines, eliminating the guesswork associated with a ma...
Brain imaging genetics examines associations between imaging quantitative traits (QTs) and genetic factors such as single nucleotide polymorphisms (SNPs) to provide important insights into the pathogenesis of Alzheimer’s diseases (AD). Given the high dimensionality, the individual level SNP‐QT signals typically have small effect sizes and are hard...
Investigating the relationship between genetic variation and phenotypic traits is a key issue in quantitative genetics. Specifically for Alzheimer's disease, the association between genetic markers and quantitative traits remains vague while, once identified, will provide valuable guidance for the study and development of genetic-based treatment ap...
Automated machine learning (AutoML) algorithms have grown in popularity due to their high performance and flexibility to adapt to different problems and data sets. With the increasing number of AutoML algorithms, deciding which would best suit a given problem becomes increasingly more work. Therefore, it is essential to use complex and challenging...
Objectives
Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.
Materials and Methods
We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to...
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial to determine their scope of application. Here, we introduce the Diverse and Generative ML Benchmark (DIGEN), a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of ML algorithms for classification of binary outcom...
Evolutionary multi-agent systems (EMASs) are very good at dealing with difficult, multi-dimensional problems, their efficacy was proven theoretically based on analysis of the relevant Markov-Chain based model. Now the research continues on introducing autonomous hybridization into EMAS. This paper focuses on a proposed hybrid version of the EMAS, a...
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWAS). Standard GWAS are well-powered to interrogate additive models; however, new approaches are required to investigate other modes of inheritance such as dominance...
In computational toxicology, prediction of complex endpoints has always been challenging, as they often involve multiple distinct mechanisms. State-of-the-art models are either limited by low accuracy, or lack of interpretability due to their black-box nature. Here we introduce AIDTox, an interpretable deep learning model which incorporates curated...
The rapid increase of interest in, and use of, artificial intelligence (AI) in computer applications has raised a parallel concern about its ability (or lack thereof) to provide understandable, or explainable, output to users. This concern is especially legitimate in biomedical contexts, where patient safety is of paramount importance. This positio...
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning me...
Idiopathic pulmonary fibrosis (IPF) is a chronic, progressive, fibrosing interstitial pneumonia of unknown etiology. The role of genetic risk factors has been the focus of numerous studies probing for associations of genetic variants with IPF. We aimed to determine whether single-nucleotide polymorphisms (SNPs) of four candidate genes are associate...
Evolutionary multi-agent systems (EMASs) are very good at dealing with difficult, multi-dimensional problems, their efficacy was proven theoretically based on analysis of the relevant Markov-Chain based model. Now the research continues on introducing autonomous hybridization into EMAS. This paper focuses on a proposed hybrid version of the EMAS, a...
Brain imaging genetics examines associations between imaging quantitative traits (QTs) and genetic factors such as single nucleotide polymorphisms (SNPs) to provide important insights into the pathogenesis of Alzheimer’s disease (AD). The individual level SNP-QT signals are high dimensional and typically have small effect sizes, making them hard to...
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments....
Background
Alzheimer’s disease (AD) is a complex neurodegenerative disorder and the most common type of dementia. AD is characterized by a decline of cognitive function and brain atrophy, and is highly heritable with estimated heritability ranging from 60 to 80 $$\%$$ % . The most straightforward and widely used strategy to identify AD genetic basi...
Objective
For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information.
Materials and...
In drug development, a major reason for attrition is the lack of understanding of cellular mechanisms governing drug toxicity. The black-box nature of conventional classification models has limited their utility in identifying toxicity pathways. Here we developed DTox (deep learning for toxicology), an interpretation framework for knowledge-guided...
The opioid epidemic continues to contribute to loss of life through overdose and significant social and economic burdens. Many individuals who develop problematic opioid use (POU) do so after being exposed to prescribed opioid analgesics. Therefore, it is important to accurately identify and classify risk factors for POU. In this review, we discuss...
ComptoxAI is a new data infrastructure for computational and artificial intelligence research in predictive toxicology. Here, we describe and showcase ComptoxAI's graph-structured knowledge base in the context of three real-world use-cases, demonstrating that it can rapidly answer complex questions about toxicology that are infeasible using previou...
Integrating data across institutions can improve learning efficiency. To integrate data efficiently while protecting privacy, we propose A one-shot, summary-statistics-based, Distributed Algorithm for fitting Penalized (ADAP) regression models across multiple datasets. ADAP utilizes patient-level data from a lead site and incorporates the first-ord...
When seeking a predictive model in biomedical data, one often has more than a single objective in mind, e.g., attaining both high accuracy and low complexity (to promote interpretability). We investigate herein whether multiple objectives can be dynamically tuned by our recently proposed coevolutionary algorithm, SAFE (Solution And Fitness Evolutio...
We recently highlighted a fundamental problem recognized to confound algorithmic optimization, namely, \textit{conflating} the objective with the objective function. Even when the former is well defined, the latter may not be obvious, e.g., in learning a strategy to navigate a maze to find a goal (objective), an effective objective function to \tex...
We have recently presented SAFE -- Solution And Fitness Evolution -- a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. We showed that SAFE was successful at evolving solutions within a robotic maze domain. Herein we present an i...
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attain...
Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates acr...
Questions
Question (1)
Replication has become the gold standard in genetic association studies. In fact, it has become so set in stone that it is difficult to publish a paper without it. We published the following paper a few years ago showing that power to replicate under an epistasis model can drop from more than 80% to less than 20% with very small changes in allele frequencies in replication data. This paper shows, using simulation, that lack of replication can provide misleading evidence in support of the null hypothesis of no association. It is interesting to note that multiple top genetics journals refused to review the paper and several geneticists from major GWAS consortia told us explicitly not to publish it because it might 'confuse' people. I am worried that too much emphasis is being placed on statistical replication. The real value of any genetic association is whether someone is convinced to spend the money and time to experimentally validate a finding. Replication certainly helps but is not the only piece of evidence that should be considered.
Greene CS, Penrod NM, Williams SM, Moore JH. Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS One. 2009 Jun 2;4(6):e5639. PubMed PMID: 19503614