Kristin P. Bennett

Kristin P. Bennett
  • PhD University of Wisconsin-Madison
  • Professor at Rensselaer Polytechnic Institute

About

223
Publications
60,765
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,723
Citations
Current institution
Rensselaer Polytechnic Institute
Current position
  • Professor

Publications

Publications (223)
Preprint
Full-text available
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline fe...
Article
Full-text available
MortalityMinder enables healthcare researchers, providers, payers, and policy makers to gain actionable insights into where and why premature mortality rates due to all causes, cancer, cardiovascular disease, and deaths of despair rose between 2000 and 2017 for adults aged 25–64. MortalityMinder is designed as an open-source web-based visualization...
Article
We propose a survival analysis approach for discovering and characterizing user behavior and risks for lending protocols in decentralized finance (DeFi). We demonstrate how to gather and prepare DeFi transaction data for survival analysis. We illustrate our approach using transactions in Aave, one of the largest lending protocols. We develop a DeFi...
Chapter
The emerging decentralized financial ecosystem (DeFi) is comprised of numerous protocols, one type being lending protocols. People make transactions in lending protocols, each of which is attributed to a specific blockchain address which could represent an externally-owned account (EOA) or a smart contract. Using Aave, one of the largest lending pr...
Article
Full-text available
Disparities in healthcare access and utilization associated with demographic and socioeconomic status hinder advancement of health equity. Thus, we designed a novel equity-focused approach to quantify variations of healthcare access/utilization from the expectation in national target populations. We additionally applied survey-weighted logistic reg...
Chapter
As concerns have grown about bias in ML models, the field of ML fairness has expanded considerably beyond classification. Researchers now propose fairness metrics for regression, but unlike classification there is no literature review of regression fairness metrics and no comprehensive resource to define, categorize, and compare them. To address th...
Article
Full-text available
Background Clinical decision support systems have been widely deployed to guide healthcare decisions on patient diagnosis, treatment choices, and patient management through evidence-based recommendations. These recommendations are typically derived from clinical practice guidelines created by clinical specialties or healthcare organizations. Althou...
Article
Streams of irregularly occurring events are commonly modeled as a marked temporal point process. Many real-world datasets such as e-commerce transactions and electronic health records often involve events where multiple event types co-occur, e.g. multiple items purchased or multiple diseases diagnosed simultaneously. In this paper, we tackle multi-...
Article
Full-text available
Introduction and aims Dietary Rational Gene Targeting (DRGT) is a therapeutic dietary strategy that uses healthy dietary agents to modulate the expression of disease-causing genes back toward the normal. Here we use the DRGT approach to (1) identify human studies assessing gene expression after ingestion of healthy dietary agents with an emphasis o...
Preprint
Full-text available
Disparities in healthcare access and utilization associated with demographic and socioeconomic status hinder advancement of health equity. Thus, we designed a novel equity-focused approach to quantify variations of healthcare access/utilization from the expectation in national target populations. We additionally applied survey-weighted logistic reg...
Chapter
We propose a decentralized finance (DeFi) survival analysis approach for discovering and characterizing user behavior and risks in lending protocols. We demonstrate how to gather and prepare DeFi transaction data for survival analysis. We demonstrate our approach using transactions in AAVE, one of the largest lending protocols. We develop a DeFi su...
Article
Full-text available
Randomized clinical trial (RCT) studies are the gold standard for scientific evidence on treatment benefits to patients. RCT outcomes may not be generalizable to clinical practice if the trial population is not representative of the patients for which the treatment is intended. Specifically, enrollment plans may not adequately include groups of pat...
Article
Full-text available
Circadian rhythms broadly regulate physiological functions by tuning oscillations in the levels of mRNAs and proteins to the 24-hour day/night cycle. Globally assessing which mRNAs and proteins are timed by the clock necessitates accurate recognition of oscillations in RNA and protein data, particularly in large omics data sets. Tools that employ f...
Article
Full-text available
Objectives We hypothesize that identification of healthy whole foods that modulate disease-causing gene expression back toward the normal is a low-cost, healthy, and readily-translatable alternative and/or complementary approach to costly and sometimes toxic pharmaceutical drugs. Our objectives are (1) to identify human studies assessing gene expre...
Conference Paper
The way people respond to messaging from public health organizations on social media can provide insight into public perceptions on critical health issues, especially during a global crisis such as COVID-19. It could be valuable for high-impact organizations such as the US Centers for Disease Control and Prevention (CDC) or the World Health Organiz...
Article
This paper reports on Data Analytics Research (DAR), a course-based undergraduate research experience (CURE) in which undergraduate students conduct data analysis research on open real-world problems for industry, university, and community clients. We describe how DAR, offered by the Mathematical Sciences Department at Rensselaer Polytechnic Instit...
Preprint
Full-text available
Circadian rhythms broadly regulate physiological functions by tuning oscillations in the levels of mRNAs and proteins to the 24-hour day/night cycle. Globally assessing which mRNAs and proteins are timed by the clock necessitates accurate recognition of oscillations in RNA and protein data, particularly in large omics data sets. Tools that employ f...
Preprint
Full-text available
The way people respond to messaging from public health organizations on social media can provide insight into public perceptions on critical health issues, especially during a global crisis such as COVID-19. It could be valuable for high-impact organizations such as the US Centers for Disease Control and Prevention (CDC) or the World Health Organiz...
Article
Access to private medical data is restricted due to privacy laws, hindering research and real-world use. Synthetic data generation provides a viable solution by generating data with high utility and privacy protection without releasing the real data. Healthcare data records are often longitudinal in nature, being affected by covariates like age, ge...
Preprint
This paper evaluates synthetically generated healthcare data for biases and investigates the effect of fairness mitigation techniques on utility-fairness. Privacy laws limit access to health data such as Electronic Medical Records (EMRs) to preserve patient privacy. Albeit essential, these laws hinder research reproducibility. Synthetic data is a v...
Article
Full-text available
Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and ena...
Article
Full-text available
Objective We help identify subpopulations underrepresented in randomized clinical trials (RCTs) cohorts with respect to national, community-based or health system target populations by formulating population representativeness of RCTs as a machine learning (ML) fairness problem, deriving new representation metrics, and deploying them in easy-to-und...
Preprint
Full-text available
Objective We formulate population representativeness of randomized clinical trials (RCTs) as a machine learning (ML) fairness problem, derive new representation metrics, and deploy them in visualization tools which help users identify subpopulations that are underrepresented in RCT cohorts with respect to national, community-based or health system...
Preprint
Full-text available
This study examines how social determinants associated with COVID-19 mortality change over time. Using US county-level data from July 5 and December 28, 2020, the effect of 19 high-risk factors on COVID-19 mortality rate was quantified at each time point with negative binomial mixed models. Then, these high-risk factors were used as controls in two...
Article
In this exploratory study, we scrutinize a database of over one million tweets collected from March to July 2020 to illustrate public attitudes towards mask usage during the COVID-19 pandemic. We employ natural language processing, clustering and sentiment analysis techniques to organize tweets relating to mask-wearing into high-level themes, then...
Chapter
Generating synthetic data represents an attractive solution for creating open data, enabling health research and education while preserving patient privacy. We reproduce the research outcomes obtained on two previously published studies, which used private health data, using synthetic data generated with a method that we developed, called HealthGAN...
Article
Motivation Circadian rhythms are approximately 24 hour endogenous cycles that control many biological functions. To identify these rhythms, biological samples are taken over circadian time and analyzed using a single omics type, such as transcriptomics or proteomics. By comparing data from these single omics approaches, it has been shown that trans...
Chapter
Medical data is rarely made publicly available due to high de-identification costs and risks. Access to such data is highly regulated due to it’s sensitive nature. These factors impede the development of data-driven advancements in the healthcare domain. Synthetic medical data which can maintain the utility of the real data while simultaneously pre...
Preprint
Full-text available
This study examines social determinants associated with disparities in COVID-19 mortality rates in the United States.Using county-level data, 42 negative binomial mixed models were used to evaluate the impact of social determinants on COVID-19 outcome. First, to identify proper controls, the effect of 24 high-risk factors on COVID-19 mortality rate...
Preprint
Full-text available
In this exploratory study, we scrutinize a database of over 1 million tweets collected across the first five months of 2020 to draw conclusions about public attitudes towards the preventative measure of mask usage during the COVID-19 pandemic. In recent months, a body of literature has emerged to suggest the robustness of trends in online activity...
Article
We propose a machine learning driven approach to derive insights from observational healthcare data to improve public health outcomes. Our goal is to simultaneously identify patient subpopulations with differing health risks and to find those risk factors within each subpopulation. We develop two supervised mixture of experts models: a Supervised G...
Preprint
Motivation: Circadian rhythms are approximately 24 hour endogenous cycles that control many biological functions. To identify these rhythms, biological samples are taken over circadian time and analyzed using a single omics type, such as transcriptomics or proteomics. By comparing data from these single omics approaches, it has been shown that tran...
Article
Full-text available
We develop metrics for measuring the quality of synthetic health data for both education and research. We use novel and existing metrics to capture a synthetic dataset’s resemblance, privacy, utility and footprint. Using these metrics, we develop an end-to-end workflow based on our generative adversarial network (GAN) method, HealthGAN, that create...
Preprint
Full-text available
Synthetic medical data which preserves privacy while maintaining utility can be used as an alternative to real medical data, which has privacy costs and resource constraints associated with it. At present, most models focus on generating cross-sectional health data which is not necessarily representative of real data. In reality, medical data is lo...
Chapter
Full-text available
Treatment recommendations within Clinical Practice Guidelines (CPGs) are largely based on findings from clinical trials and case studies, referred to here as research studies, that are often based on highly selective clinical populations, referred to here as study cohorts. When medical practitioners apply CPG recommendations, they need to understan...
Conference Paper
Full-text available
Circadian rhythms are 24-hour biological cycles that control daily molecular rhythms in many organisms. The cellular elements that fall under the regulation of the clock are often studied through the use of omics-scale data sets gathered over time to determine how circadian regulation impacts cellular physiology. Previously, we created the ECHO (Ex...
Article
Full-text available
Motivation: Time courses utilizing genome scale data are a common approach to identifying the biological pathways that are controlled by the circadian clock, an important regulator of organismal fitness. However, the methods used to detect circadian oscillations in these datasets are not able to accommodate changes in the amplitude of the oscillat...
Preprint
Full-text available
Treatment recommendations within Clinical Practice Guidelines (CPGs) are largely based on findings from clinical trials and case studies, referred to here as research studies, that are often based on highly selective clinical populations, referred to here as study cohorts. When medical practitioners apply CPG recommendations, they need to understan...
Preprint
Full-text available
Treatment recommendations within Clinical Practice Guidelines (CPGs) are largely based on findings from clinical trials and case studies, referred to here as research studies, that are often based on highly selective clinical populations, referred to here as study cohorts. When medical practitioners apply CPG recommendations, they need to understan...
Preprint
Motivation Time courses utilizing genome scale data are a common approach to identifying the biological pathways that are controlled by the circadian clock, an important regulator of organismal fitness. However, the methods used to detect circadian oscillations in these datasets are not able to accommodate changes in the amplitude of the oscillatio...
Article
We consider the problem in precision health of grouping people into subpopulations based on their degree of vulnerability to a risk factor. These subpopulations cannot be discovered with traditional clustering techniques because their quality is evaluated with a supervised metric: the ease of modeling a response variable for observations within the...
Conference Paper
This paper builds on the results of the ESANN 2019 conference paper "Privacy Preserving Synthetic Health Data" [16], which develops metrics for assessing privacy and utility of synthetic data and models. The metrics laid out in the initial paper show that utility can still be achieved in synthetic data while maintaining both privacy of the model an...
Conference Paper
Full-text available
We develop a semantics-driven, automated approach for dynamically performing rigorous scientific studies. This framework may be applied to a wide variety of data and study types; here, we demonstrate its suitability for conducting retrospective cohort studies using publicly available population health data. The goal is to identify risk factors that...
Article
Full-text available
Increased understanding of developmental disorders of the brain has shown that genetic mutations, environmental toxins and biological insults typically act during developmental windows of susceptibility. Identifying these vulnerable periods is a necessary and vital step for safeguarding women and their fetuses against disease causing agents during...
Article
Feature selection is of great importance for two possible scenarios: (1) prediction, i.e., improving (or minimally degrading) the predictions of a target variable while discarding redundant or uninformative features and (2) discovery, i.e., identifying features that are truly dependent on the target and may be genuine causes to be determined in exp...
Preprint
Full-text available
One primary task of population health analysis is the identification of risk factors that, for some subpopulation, have a significant association with some health condition. Examples include finding lifestyle factors associated with chronic diseases and finding genetic mutations associated with diseases in precision health. We develop a combined se...
Chapter
Full-text available
With the rapid advancements in cancer research, the information that is useful for characterizing disease, staging tumors, and creating treatment and survivorship plans has been changing at a pace that creates challenges when physicians try to remain current. One example involves increasing usage of biomarkers when characterizing the pathologic pro...
Preprint
We consider the problem in precision health of grouping people into subpopulations based on their degree of vulnerability to a risk factor. These subpopulations cannot be discovered with traditional clustering techniques because their quality is evaluated with a supervised metric: the ease of modeling a response variable over observations within th...
Preprint
Full-text available
We present a new "grey-box" approach to anomaly detection in smart manufacturing. The approach is designed for tools run by control systems which execute recipe steps to produce semiconductor wafers. Multiple streaming sensors capture trace data to guide the control systems and for quality control. These control systems are typically PI controllers...
Preprint
Full-text available
With the rapid advancements in cancer research, the information that is useful for characterizing disease, staging tumors, and creating treatment and survivorship plans has been changing at a pace that creates challenges when physicians try to remain current. One example involves increasing usage of biomarkers when characterizing the pathologic pro...
Article
We consider the problem in regression analysis of identifying subpopulations that exhibit different patterns of response, where each subpopulation requires a different underlying model. Unlike statistical cohorts, these subpopulations are not known a priori; thus, we refer to them as cadres. When the cadres and their associated models are interpret...
Conference Paper
Circadian rhythms are endogenous cycles of approximately 24 hours reinforced by external cues such as light. These cycles are typically modeled as harmonic oscillators with fixed amplitude peaks. Using experimental data measuring global gene transcription in Neurospora crassa over 48 hours in the dark (i.e. with external queues removed), we demonst...
Conference Paper
By treating the end-to-end data science workflow as data itself and through the conceptual modeling of the goals and functional intent of the data analyst, the entire process of data analytics becomes open and accessible to the powerful tools of artificial intelligence, machine learning, statistics, and data mining. We examine the fundamental quest...
Conference Paper
Infection by Mycobacterium tuberculosis complex (MTB) produces either active tuberculosis (TB) disease or latent infections without symptoms with about a 10% lifetime risk of developing disease. We hypothesize that MTB lineages may have different latent reactivation phenotypes, and that these different phenotypes would be reflected in different dis...
Conference Paper
New NIH grants require establishing scientific rigor, i.e. applicants must provide evidence of strict application of the scientific method to ensure robust and unbiased experimental design, methodology, analysis, interpretation and reporting of results. Researchers must transparently report experimental details so others may reproduce and extend fi...
Article
Electronic Healthcare Records (EHRs) have the potential to improve healthcare quality and to decrease costs by providing quality metrics, discovering actionable insights, and supporting decision-making to improve future outcomes. Within the United States Medicaid Program, rates of recidivism among emergency department (ED) patients serve as metrics...
Conference Paper
ChaLearn is organizing the Automatic Machine Learning (AutoML) contest for the IJCNN 2015, which challenges participants to solve classification and regression problems without any human intervention. Participants' code is automatically run on the contest servers to train and test learning machines. However, there is no obligation to submit code. H...
Article
Full-text available
We develop a novel approach for incorporating expert rules into Bayesian networks for classification of Mycobacterium tuberculosis complex (MTBC) clades. The proposed knowledge-based Bayesian network (KBBN) treats sets of expert rules as prior distributions on the classes. Unlike prior knowledge-based support vector machine approaches which require...
Article
Computational methods that can identify CYP-mediated sites of metabolism (SOMs) of drug-like compounds have become required tools for early stage lead optimization. In recent years, methods that combine CYP binding site features with CYP/ligand binding information have been sought in order to increase the prediction accuracy of such hybrid models o...
Article
Full-text available
We propose a novel optimization-based approach to embedding heterogeneous high-dimensional data characterized by a graph. The goal is to create a two-dimensional visualization of the graph structure such that edge-crossings are minimized while preserving proximity relations between nodes. This paper provides a fundamentally new approach for address...
Article
Full-text available
Biomarkers of Mycobacterium tuberculosis complex (MTBC) mutate over time. Among the biomarkers of MTBC, spacer oligonucleotide type (spoligotype) and mycobacterium interspersed repetitive unit (MIRU) patterns are commonly used to genotype clinical MTBC strains. In this study, we present an evolution model of spoligotype rearrangements using MIRU pa...
Article
Full-text available
In this study, we present host-pathogen associations of tuberculosis by incorporating genetic proximity between MTBC strains, spatial proximity between TB patients, and time into domain knowledge via Unified Biclustering Framework (UBF). We simultaneously factorize multiple sources of information in various forms and obtain biclusters which represe...
Article
This paper formulates a set of rules to classify genotypes of the Mycobacterium tuberculosis complex (MTBC) into major lineages using spoligotypes and MIRU-VNTR results. The rules synthesize prior literature that characterizes lineages by spacer deletions and variations in the number of repeats seen at locus MIRU24 (alias VNTR2687). A tool that eff...
Article
Full-text available
Signal timing information is important in signal oper-ations and signal/arterial performance measurement. Such infor-mation, however, may not be available for wide areas. This imposes difficulty, particularly for real-time signal/arterial performance measurement and traffic information provisions that have received much attention recently. We study...
Article
Full-text available
The successful application of Support Vector Machines (SVMs), kernel methods and other statistical machine learning methods requires selec-tion of model parameters based on estimates of the generalization error. This paper presents a novel approach to systematic model selection through bilevel optimization. We show how modelling tasks for widely us...
Article
Full-text available
We propose a novel approach for incorporating prior knowledge into the online binary support vector classifica- tion problem. An existing advice-taking approach, when prior knowledge is in the form of polyhedral knowledge sets in input space of data, is via knowledge-based support vector machines (KB- SVMs). We adopt the formalism of passive-aggres...
Article
RS-Predictor is a tool for creating pathway-independent, isozyme-specific, site of metabolism (SOM) prediction models using any set of known cytochrome P450 (CYP) substrates and metabolites. Until now, the RS-Predictor method was only trained and validated on CYP 3A4 data, but in the present study, we report on the versatility the RS-Predictor mode...
Article
Least-squares fitting of the Hill equation to quantitative high-throughput screening (qHTS) assays results in frequent unsatisfactory fits. We learn and exploit prior knowledge to improve the Hill fitting in a nonlinear regression method called domain knowledge fitter (DK-fitter). This paper formulates and solves DK-fitter for 44 public qHTS data s...
Article
Full-text available
Biomarkers of Mycobacterium tuberculosis complex (MTBC) mutate over time. Among the biomarkers of MTBC, spacer oligonucleotide type (spoligotype) and Mycobacterium Interspersed Repetitive Unit (MIRU) patterns are commonly used to genotype clinical MTBC strains. In this study, we present an evolution model of spoligotype rearrangements using MIRU pa...
Chapter
Full-text available
Why Use Descriptors?Challenges Faced by Molecular Descriptors when Applied to Biological SystemsExamples of Descriptors for Biological Systems and Applications to Modeling ProblemsSummary and Conclusions References
Conference Paper
Full-text available
DNA fingerprints of Mycobacterium tuberculosis complex bacteria (MTBC) are routinely gathered from tuberculosis (TB) patient isolates for all TB patients in the United States to support TB tracking and control efforts, but few tools are available for visualizing and discovering host-pathogen relationships. We present a new visualization approach, h...
Article
We present a bundle algorithm for multiple-instance classification and ranking. These frameworks yield improved models on many problems possessing special structure. Multiple-instance loss functions are typically nonsmooth and nonconvex, and current algorithms convert these to smooth nonconvex optimization problems that are solved iteratively. Insp...
Article
Full-text available
This paper introduces two types of nonsmooth optimization methods for selecting model hyperparameters in primal SVM models based on cross-validation. Unlike common grid search approaches for model selection, these approaches are scalable both in the number of hyperparameters and number of data points. Taking inspiration from linear-time primal SVM...
Article
We propose a novel approach to drawing graphs that simultaneously optimizes two criteria (i) preserving proximity relations as measured by some embedding objective, and (ii) minimizing edge-crossings, to create a clear representation of the underlying graph structure. Frequently, the nodes of the graph represent objects that have their own intrinsi...
Article
In this study we explore publicly available web tools designed to use molecular epidemiological data to extract information that can be employed for the effective tracking and control of tuberculosis (TB). The application of molecular methods for the epidemiology of TB complement traditional approaches used in public health. DNA fingerprinting meth...
Article
Full-text available
This paper presents regression models obtained from a process of blind prediction of peptide binding affinity from provided descriptors for several distinct datasets as part of the 2006 Comparative Evaluation of Prediction Algorithms (COEPRA) contest. This paper finds that kernel partial least squares, a nonlinear partial least squares (PLS) algori...

Network

Cited By