Michael SteinbachUniversity of Minnesota | UMN · Department of Computer Science and Engineering
Michael Steinbach
Ph.D. Computer Science
About
187
Publications
127,192
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
24,260
Citations
Introduction
Skills and Expertise
Additional affiliations
September 1998 - present
Education
September 1998 - October 2005
Publications
Publications (187)
We present a knowledge-guided machine learning (KGML) framework for modeling multi-scale processes, and study its performance in the context of streamflow forecasting in hydrology. Specifically, we propose a novel hierarchical recurrent neural architecture that factorizes the system dynamics at multiple temporal scales and captures their interactio...
Applying Deep Learning (DL) models to graphical causal learning has brought outstanding effectiveness and efficiency but is still far from widespread use in domain sciences. In research of EHR (Electronic Healthcare Records), we realize that some confounding bias inherently exists in the causally formed data, which DL cannot automatically adjust. T...
In many environmental applications, recurrent neural networks (RNNs) are often used to model physical variables with long temporal dependencies. However, due to mini-batch training, temporal relationships between training segments within the batch (intra-batch) as well as between batches (inter-batch) are not considered, which can lead to limited p...
Streamflow prediction is a long‐standing hydrologic problem. Development of models for streamflow prediction often requires incorporation of catchment physical descriptors to characterize the associated complex hydrological processes. Across different scales of catchments, these physical descriptors also allow models to extrapolate hydrologic infor...
There is a growing consensus that solutions to complex science and engineering problems require novel methodologies that are able to integrate traditional physics-based modeling approaches with state-of-the-art machine learning (ML) techniques. This paper provides a structured overview of such techniques. Application-centric objective areas for whi...
Objective:
Hospital-acquired infections (HAIs) are associated with significant morbidity, mortality, and prolonged hospital length of stay. Risk prediction models based on pre- and intraoperative data have been proposed to assess the risk of HAIs at the end of the surgery, but the performance of these models lag behind HAI detection models based o...
Major societal and environmental challenges involve complex systems that have diverse multi-scale interacting processes. Consider, for example, how droughts and water reserves affect crop production and how agriculture and industrial needs affect water quality and availability. Preventive measures, such as delaying planting dates and adopting new a...
Objective
The association of body mass index (BMI) and all-cause mortality is controversial, frequently referred to as a paradox. Whether the cause is metabolic factors or statistical biases is still controversial. We assessed the association of BMI and all-cause mortality considering a wide range of comorbidities and baseline mortality risk.
Meth...
Diseases can show different courses of progression even when patients share the same risk factors. Recent studies have revealed that the use of trajectories, the order in which diseases manifest throughout life, can be predictive of the course of progression. In this study, we propose a novel computational method for learning disease trajectories f...
Physics-based models are often used to study engineering and environmental systems. The ability to model these systems is the key to achieving our future environmental sustainability and improving the quality of human life. This article focuses on simulating lake water temperature, which is critical for understanding the impact of changing climate...
Causal inference is a powerful statistical methodology for explanatory analysis and individualized treatment effect (ITE) estimation, a prominent causal inference task that has become a fundamental research problem. ITE estimation, when performed naively, tends to produce biased estimates. To obtain unbiased estimates, counterfactual information is...
This paper proposes a physics-guided machine learning approach that combines advanced machine learning models and physics-based models to improve the prediction of water flow and temperature in river networks. We first build a recurrent graph network model to capture the interactions among multiple segments in the river network. Then we present a p...
Importance
Clinical domain knowledge about diseases and their comorbidities, severity, treatment pathways, and outcomes can facilitate diagnosis, enhance preventive strategies, and help create smart evidence-based practice guidelines.
Objective
To introduce a new representation of patient data called disease severity hierarchy that leverages domai...
In this manuscript, we provide a structured and comprehensive overview of techniques to integrate machine learning with physics-based modeling. First, we provide a summary of application areas for which these approaches have been applied. Then, we describe classes of methodologies used to construct physics-guided machine learning models and hybrid...
Background:
The ubiquity of electronic health records (EHR) offers an opportunity to observe trajectories of laboratory results and vital signs over long periods of time. This study assessed the value of risk factor trajectories available in the electronic health record to predict incident type 2 diabetes.
Study design and methods:
Analysis was...
Physics-based models of dynamical systems are often used to study engineering and environmental systems. Despite their extensive use, these models have several well-known limitations due to simplified representations of the physical processes being modeled or challenges in selecting appropriate parameters. While-state-of-the-art machine learning mo...
Our aging population increasingly suffers from multiple chronic diseases simultaneously, necessitating the comprehensive treatment of these conditions. Finding the optimal set of drugs for a combinatorial set of diseases is a combinatorial pattern exploration problem. Association rule mining is a popular tool for such problems, but the requirement...
The rapid growth of data in water resources has created new opportunities to accelerate knowledge discovery with the use of advanced deep learning tools. Hybrid models that integrate theory with state‐of‐the art empirical techniques have the potential to improve predictions while remaining true to physical laws. This paper evaluates the Process‐Gui...
Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to foc...
Different analytic techniques operate optimally with different types of data. As the use of EHR-based analytics expands to newer tasks, data will have to be transformed into different representations, so the tasks can be optimally solved. We classified representations into broad categories based on their characteristics, and proposed a new knowledg...
The ability to assess data quality is essential for secondary use of EHR data and an automated Healthcare Data Quality Framework (HDQF) can be used as a tool to support a healthcare organization's data quality initiatives. Use of a general purpose HDQF provides a method to assess and visualize data quality to quickly identify areas for improvement....
This paper proposes a physics-guided recurrent neural network model (PGRNN) that combines RNNs and physics-based models to leverage their complementary strengths and improve the modeling of physical processes. Specifically, we show that a PGRNN can improve prediction accuracy over that of physical models, while generating outputs consistent with ph...
In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong li...
In this paper, we introduce a novel framework for combining scientific knowledge within physics-based models and recurrent neural networks to advance scientific discovery in many dynamical systems. We will first describe the use of outputs from physics-based models in learning a hybrid-physics-data model. Then, we further incorporate physical knowl...
In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong li...
Objectives:
To specify when delays of specific 3-hour bundle Surviving Sepsis Campaign guideline recommendations applied to severe sepsis or septic shock become harmful and impact mortality.
Design:
Retrospective cohort study.
Setting:
One health system composed of six hospitals and 45 clinics in a Midwest state from January 01, 2011, to July...
Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling...
Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to foc...
The true onset time of a disease, particularly slow-onset diseases like Type 2 diabetes mellitus (T2DM), is rarely observable in electronic health records (EHRs). However, it is critical for analysis of time to events and for studying sequences of diseases. The aim of this study is to demonstrate a method for estimating the onset time of such disea...
(Article has been published at ACM Computing Surveys. Please find it at : https://dl.acm.org/citation.cfm?id=3127881)
The continuously increasing cost of the US healthcare system has received significant attention. Central to the ideas aimed at curbing this trend is the use of technology, in the form of the mandate to implement electronic health re...
The continuously increasing cost of the US healthcare system has received significant attention. Central to the ideas aimed at curbing this trend is the use of technology, in the form of the mandate to implement electronic health records (EHRs). EHRs consist of patient information such as demographics, medications, laboratory test results, diagnosi...
Methods:
Propensity score matching was used to select subsets of the cohorts with balanced covariates. Bootstrapping was performed to build distributions of the measured difference in rates/means. The effect of the sepsis intervention was evaluated for all patients, and High and Low Risk subgroups for illness severity. A separate analysis was perf...
Our aging population increasingly suffers from multiple chronic diseases simultaneously, necessitating the comprehensive treatment of these conditions. Finding the optimal set of drugs for a combinatorial set of diseases is a combinatorial pattern exploration problem. Association rule mining is a popular tool for such problems, but the requirement...
Sepsis incidents have doubled from 2000 through 2008, and hospitalizations for these diagnoses have increased by 70%. The use of the Surviving Sepsis Campaign (SSC) guidelines can lead to earlier diagnosis and treatment; however, the effectiveness of the SSC guidelines in preventing complications for this population is unclear. The overall purpose...
Introduction:
Pulmonary complications due to infection and Idiopathic Pneumonia Syndrome (IPS), a non-infectious lung injury in the hematopoietic stem cell transplant (HSCT) recipients, are frequent causes of transplant-related mortality and morbidity. Our objective was to characterize the global bronchoalveolar lavage fluid (BALF) protein express...
Disease progression models, statistical models that assess a patient's risk of diabetes progression, are popular tools in clinical practice for prevention and management of chronic conditions. Most, if not all, models currently in use are based on gold standard clinical trial data. The relatively small sample size available from clinical trial limi...
Electronic Health Records (EHRs) consists of patient information such as demographics, medications, laboratory test results, diagnosis codes and procedures. Mining EHRs could lead to improvement in patient healthcare management as EHRs contains detailed information related to disease prognosis for
large patient populations. We hypothesize that pati...
Mobility is critical for self-management. Understanding factors associated with improvement in mobility during home healthcare can help nurses tailor interventions to improve mobility outcomes and keep patients safely at home.
The aims were to (a) identify patient and support system factors associated with mobility improvement during home care, (b)...
Non-small cell lung cancers (NSCLCs) harbor thousands of passenger events that hide genetic drivers. Even highly recurrent events in NSCLC, such as mutations in PTEN, EGFR, KRAS, and ALK, are only detected in, at most, 30% of patients. Thus, many unidentified low-penetrant events are causing a significant portion of lung cancers. To detect low-pene...
Diabetes Mellitus (DM) is a prevalent and costly disease and if not managed effectively, it leads to complications in almost every body system. Evidence-based guidelines for prevention and management of DM exist, but they
ignore the trajectory along which the disease developed. With the implementation of electronic health records (EHRs), suffcientl...
Multiple myeloma (MM) is the second-most common hematopoietic malignancy in the United States accounting for 1% of all cancers and 10% of all hematologic malignancies. Despite recent improvements in treatment strategies including the emergence of proteasome inhibitors (PIs) as effective chemotherapeutic agents, MM still remains difficult to cure wi...
In this manuscript, we present connectivity cluster analysis (CoCA), a novel computational framework that takes advantage of structure of the brain networks to magnify reproducible signals and quash noise. Resting state functional Magnetic Resonance Imaging (fMRI) data that is used in estimating functional brain networks is often noisy, leading to...
To adequately address climate change, we need novel data-science methods that account for the spatiotemporal and physical nature of climate phenomena. Only then will we be able to move from statistical analysis to scientific insights.
Acute Respiratory Distress Syndrome (ARDS) continues to have a high mortality. Currently, there are no biomarkers that provide reliable prognostic information to guide clinical management or stratify risk among clinical trial participants. The objective of this study was to probe the bronchoalveolar lavage fluid (BALF) proteome to identify proteins...
The Human Leukocyte Antigen (HLA) genes are some of the most studied genes on the genome. This is due to their importance in bone marrow and solid organ transplantation, as well as their strong associations with many autoimmune, infectious, and inammatory diseases. As such, they can be a highly valuable asset to clinicians and researchers for eluci...
Advances in earth observation technologies have led to the acquisition of vast volumes of accurate, timely and reliable environmental data which encompass a multitude of information about the land, ocean and atmosphere of the planet. Earth science sensor datasets capture multiple facets of information about natural processes and human activities th...
Histiocytic sarcoma is a rare, aggressive neoplasm that responds poorly to therapy. Histiocytic sarcoma is thought to arise from macrophage precursor cells via genetic changes that are largely undefined. To improve our understanding of the etiology of histiocytic sarcoma we conducted a forward genetic screen in mice using the Sleeping Beauty transp...
Mining patterns from electronic health-care records (EHR) can potentially lead to better and more cost-effective treatments. We aim to find the groups of ICD-9 diagnosis codes from EHRs that can predict the improvement of urinary incontinence of home health care (HHC) patients and also are interpretable to domain experts. In this paper, we propose...
The focus of this paper is to address the problem of discovering groups of time series that share similar behavior in multiple small intervals of time. This problem has two characteristics: i) There are exponentially many combinations of time series that needs to be explored to find these groups, ii) The groups of time series of interest need to ha...
Personal genome assembly is a critical process when studying tumor genomes and other highly divergent sequences. The accuracy of downstream analyses, such as RNA-seq and ChIP-seq, can be greatly enhanced by using personal genomic sequences rather than standard references. Unfortunately, reads sequenced from these types of samples often have a heter...
In this paper, we study the problem of anomaly detection with application to aviation systems. We proposed a framework for detecting precursors to aviation safety incidents due to human factors based on Hidden Semi-Markov Models (HSMM). We investigate HSMMs due to their inherent ability to model durations in addition to model latent state transitio...
The Human Leukocyte Antigen (HLA) gene system plays a crucial role in hematopoietic stem cell transplantation, where patients and donors are matched with respect to their HLA genes in order to maximize the chances of a successful transplant. It is the most polymorphic region of the human genome with some of the strongest associations with autoimmun...
Rationale: In rodent model systems, the sequential changes in lung morphology resulting from hyperoxic injury are well characterized, and are similar to changes in human acute respiratory distress syndrome (ARDS). In the injured lung, alveolar type two (AT2) epithelial cells play a critical role restoring the normal alveolar structure. Thus charact...
Neuropsychiatric disorders such as schizophrenia, bipolar disorder and Alzheimer's disease are major public health problems. However, despite decades of research, we currently have no validated prognostic or diagnostic tests that can be applied at an individual patient level. Many neuropsychiatric diseases are due to a combination of alterations th...
Pressure dipoles are important long distance climate phenomena (teleconnection) characterized by pressure anomalies of the opposite polarity appearing at two different locations at the same time. Such dipoles have been proven important for understanding and explaining the variability in climate in many regions of the world, e.g. the El Niño Souther...