Joyce Ho

Joyce Ho

About

100
Publications
9,395
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,645
Citations

Publications

Publications (100)
Preprint
Full-text available
Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbo...
Preprint
Full-text available
Background Atrial fibrillation (AF) is one of the most common types of cardiac arrhythmias, often leading to serious health issues such as stroke, heart failure, and higher mortality rates. Its global impact is rising due to aging populations and growing comorbidities, creating an urgent need for more effective treatment methods. AF ablation, a key...
Preprint
Full-text available
Background: Atrial fibrillation (AF) ablation is an effective treatment for reducing episodes and improving quality of life in patients with AF. However, in some patients there are only modest long-term AF-free rates after AF ablation. There is a need to address the limited benefits some patients experience by developing predictive algorithms to im...
Preprint
Full-text available
Pressure injury (PI) detection is challenging, especially in dark skin tones, due to the unreliability of visual inspection. Thermography has been suggested as a viable alternative as temperature differences in the skin can indicate impending tissue damage. Although deep learning models have demonstrated considerable promise toward reliably detecti...
Preprint
Full-text available
Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tack...
Preprint
Full-text available
Pressure injury (PI) detection is challenging, especially in dark skin tones, due to the unreliability of visual inspection. Thermography may serve as a viable alternative as temperature differences in the skin can indicate impending tissue damage. Although deep learning models hold considerable promise toward reliably detecting PI, existing work f...
Preprint
Full-text available
Incidence of postoperative atrial fibrillation (POAF) after cardiac surgery remains high and is associated with adverse patient outcomes. Risk scoring tools have been developed to predict POAF, yet discrimination performance remains moderate. Machine learning (ML) models can achieve better performance but may exhibit performance heterogeneity acros...
Article
Full-text available
Incidence of hospital-acquired pressure injury, a key indicator of nursing quality, is directly proportional to adverse outcomes, increased hospital stays, and economic burdens on patients, caregivers, and society. Thus, predicting hospital-acquired pressure injury is important. Prediction models use structured data more often than unstructured not...
Preprint
Full-text available
There has been a rapid growth in biomedical literature, yet capturing the heterogeneity of the bibliographic information of these articles remains relatively understudied. Although graph mining research via heterogeneous graph neural networks has taken center stage, it remains unclear whether these approaches capture the heterogeneity of the PubMed...
Article
Full-text available
To determine the hepatitis C virus (HCV) care cascade among persons who were born during 1945 to 1965 and received outpatient care on or after January 2014 at a large academic healthcare system. Deidentified electronic health record data in an existing research database were analyzed for this study. Laboratory test results for HCV antibody and HCV...
Preprint
Full-text available
Due to patient privacy protection concerns, machine learning research in healthcare has been undeniably slower and limited than in other application domains. High-quality, realistic, synthetic electronic health records (EHRs) can be leveraged to accelerate methodological developments for research purposes while mitigating privacy concerns associate...
Article
Full-text available
AIMS : Various cardiovascular risk prediction models have been developed for patients with type 2 diabetes mellitus. Yet few models have been validated externally. We perform a comprehensive validation of existing risk models on a heterogeneous population of patients with type 2 diabetes using secondary analysis of electronic health record data. M...
Preprint
Full-text available
Nonlinear acceleration methods are powerful techniques to speed up fixed-point iterations. However, many acceleration methods require storing a large number of previous iterates and this can become impractical if computational resources are limited. In this paper, we propose a nonlinear Truncated Generalized Conjugate Residual method (nlTGCR) whose...
Conference Paper
With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extrac...
Conference Paper
Full-text available
Many modern machine learning algorithms such as generative adversarial networks (GANs) and adversarial training can be formulated as minimax optimization. Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity. However, GDA can converge to non-optimal minimax points. We propose a new minimax optimization framework,...
Preprint
Full-text available
Tensor factorization has received increasing interest due to its intrinsic ability to capture latent factors in multi-dimensional data with many applications such as recommender systems and Electronic Health Records (EHR) mining. PARAFAC2 and its variants have been proposed to address irregular tensors where one of the tensor modes is not aligned,...
Preprint
BACKGROUND Patients develop pressure injuries (PIs) in the hospital owing to low mobility, exposure to localized pressure, circulatory conditions, and other predisposing factors. Over 2.5 million Americans develop PIs annually. The Center for Medicare and Medicaid considers hospital-acquired PIs (HAPIs) as the most frequent preventable event, and t...
Conference Paper
Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medica...
Conference Paper
Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the t...
Preprint
Full-text available
Many modern machine learning algorithms such as generative adversarial networks (GANs) and adversarial training can be formulated as minimax optimization. Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity. However, GDA can converge to non-optimal minimax points. We propose a new minimax optimization framework,...
Preprint
Full-text available
Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medica...
Preprint
Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the t...
Chapter
Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. F...
Article
Objectives This study aimed to compare the concordance of pressure injury (PI) site, stage, and count documented in electronic health records (EHRs); explore if PI count during each patient hospitalization is consistent based on PI site or stage count in the diagnosis or chart event records; and examine if discrepancies in PI count were associated...
Conference Paper
There is an increased adoption of electronic health record systems by a variety of hospitals and medical centers. This provides an opportunity to leverage automated computer systems in assisting healthcare workers. One of the least utilized but rich source of patient information is the unstructured clinical text. In this work, we develop CATAN, a c...
Conference Paper
To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local contex...
Chapter
Full-text available
Sequential pattern mining can be used to extract meaningful sequences from electronic health records. However, conventional sequential pattern mining algorithms that discover all frequent sequential patterns can incur a high computational and be susceptible to noise in the observations . Approximate sequential pattern mining techniques have been in...
Conference Paper
To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local contex...
Article
Existing tensor factorization methods assume that the input tensor follows some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distribution...
Article
From electronic health records (EHRs), the relationship between patients' conditions, treatments, and outcomes can be discovered and used in various healthcare research tasks such as risk prediction. In practice, EHRs can be stored in one or more data warehouses, and mining from distributed data sources becomes challenging. Another challenge arises...
Chapter
Full-text available
Samples with ground truth labels may not always be available in numerous domains. While learning from crowdsourcing labels has been explored, existing models can still fail in the presence of sparse, unreliable, or differing annotations. Co-teaching methods have shown promising improvements for computer vision problems with noisy labels by employin...
Conference Paper
Full-text available
Modern healthcare systems knitted by a web of entities (e.g., hospitals, clinics, pharmacy companies) are collecting a huge volume of healthcare data from a large number of individuals with various medical procedures, medications, diagnosis, and lab tests. To extract meaningful medical concepts (i.e., phenotypes) from such higher-arity relational h...
Conference Paper
Generating a novel and optimized molecule with desired chemical properties is an essential part of the drug discovery process. Failure to meet one of the required properties can frequently lead to failure in a clinical test which is costly. In addition, optimizing these multiple properties is a challenging task because the optimization of one prope...
Preprint
Full-text available
Samples with ground truth labels may not always be available in numerous domains. While learning from crowdsourcing labels has been explored, existing models can still fail in the presence of sparse, unreliable, or diverging annotations. Co-teaching methods have shown promising improvements for computer vision problems with noisy labels by employin...
Chapter
Information in many real-world applications is inherently multi-modal, sequential and characterized by a variety of missing values. Existing imputation methods mainly focus on the recurrent dynamics in one modality while ignoring the complementary property from other modalities. In this paper, we propose a novel method called cross-modal memory fus...
Chapter
Full-text available
Mining massive spatio-temporal data can help a variety of real-world applications such as city capacity planning, event management, and social network analysis. The tensor representation can be used to capture the correlation between space and time and simultaneously exploit the latent structure of the spatial and temporal patterns in an unsupervis...
Article
Hospital-acquired pressure ulcer injury (PUI) is a primary nursing quality metric, reflecting the caliber of nursing care within a hospital. Prior studies have used the Braden scale and structured data from the electronic health records to detect/predict PUI while the informative unstructured clinical notes have not been used. We propose automated...
Conference Paper
Hospital-acquired pressure ulcer injury (PUI) is a primary nursing quality metric, reflecting the caliber of nursing care within a hospital. Prior studies have used the Braden scale and structured data from the electronic health records to detect/predict PUI while the informative unstructured clinical notes have not been used. We propose automated...
Preprint
Full-text available
Generating a novel and optimized molecule with desired chemical properties is an essential part of the drug discovery process. Failure to meet one of the required properties can frequently lead to failure in a clinical test which is costly. In addition, optimizing these multiple properties is a challenging task because the optimization of one prope...
Article
Full-text available
Background Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data. Method...
Preprint
Existing tensor factorization methods assume that the input tensor follows some specific distribution (i.e. Poisson, Bernoulli and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distributions...
Conference Paper
Full-text available
Binary data with one-class missing values are ubiquitous in real-world applications. They can be represented by irregular tensors with varying sizes in one dimension, where value one means presence of a feature while zero means unknown (i.e., either presence or absence of a feature). Learning accurate low-rank approximations from such binary irregu...
Preprint
Mining massive spatio-temporal data can help a variety of real-world applications such as city capacity planning, event management, and social network analysis. The tensor representation can be used to capture the correlation between space and time and simultaneously exploit the latent structure of the spatial and temporal patterns in an unsupervis...
Preprint
Mining social media content for tasks such as detecting personal experiences or events, suffer from lexical sparsity, insufficient training data, and inventive lexicons. To reduce the burden of creating extensive labeled data and improve classification performance, we propose to perform these tasks in two steps: 1. Decomposing the task into domain-...
Conference Paper
Systematic review (SR) is an essential process to identify, evaluate, and summarize the findings of all relevant individual studies concerning health-related questions. However, conducting a SR is labor-intensive, as identifying relevant studies is a daunting process that entails multiple researchers screening thousands of articles for relevance. I...
Conference Paper
Real-world predictive models in healthcare should be evaluated in terms of discrimination, the ability to differentiate between high and low risk events, and calibration, or the accuracy of the risk estimates. Unfortunately, calibration is often neglected and only discrimination is analyzed. Calibration is crucial for personalized medicine as they...
Article
Analyzing healthcare data poses several challenges including the limited number of samples, missing measurements, noisy labels, and heterogeneous data types. Tree-based boosting is well-suited for modeling such data as it is insensitive to data types and missingness. Moreover, Stochastic Gradient TreeBoost is often found in many winning solutions i...
Preprint
Full-text available
Phenotyping electronic health records (EHR) focuses on defining meaningful patient groups (e.g., heart failure group and diabetes group) and identifying the temporal evolution of patients in those groups. Tensor factorization has been an effective tool for phenotyping. Most of the existing works assume either a static patient representation with ag...
Conference Paper
Tensor factorization has been demonstrated as an efficient approach for computational phenotyping, where massive electronic health records (EHRs) are converted to concise and meaningful clinical concepts. While distributing the tensor factorization tasks to local sites can avoid direct data sharing, it still requires the exchange of intermediary re...
Conference Paper
A vast amount of biomedical literature is generated and digitized every year. As a result is a growing need to develop methods for discovering, accessing, and sharing knowledge from medical literature. Keyphrase extraction is the task of summarizing a text by identifying the key concepts. The keyphrases can be single-word or multi-word linguistic u...
Preprint
Tensor factorization has been demonstrated as an efficient approach for computational phenotyping, where massive electronic health records (EHRs) are converted to concise and meaningful clinical concepts. While distributing the tensor factorization tasks to local sites can avoid direct data sharing, it still requires the exchange of intermediary re...
Preprint
Full-text available
Predicting drug-target interactions (DTI) is an essential part of the drug discovery process, which is an expensive process in terms of time and cost. Therefore, reducing DTI cost could lead to reduced healthcare costs for a patient. In addition, a precisely learned molecule representation in a DTI model could contribute to developing personalized...
Article
Full-text available
Distributed semantic representation of biomedical text can be beneficial for text classification, named entity recognition, query expansion, human comprehension, and information retrieval. Despite the success of high-quality vector space models such as Word2Vec and GloVe, they only provide unigram word representations and the semantics for multi-wo...
Article
The epidemiology of cardiovascular disease (CVD) complications in people with diabetes is changing with the increasing prevalence of presentations other than coronary heart disease (CHD) (e.g., heart failure (HF), cardiomyopathy (CM)). Existing CVD risk estimators such as the Framingham Risk Score (FRS), SCORE, and UKPDS Risk Engine primarily asses...
Conference Paper
Full-text available
In the past few decades, there has been rapid growth in quantity and variety of healthcare data. These large sets of data are usually high dimensional (e.g. patients, their diagnoses, and medications to treat their diagnoses) and cannot be adequately represented as matrices. Thus, many existing algorithms can not analyze them. To accommodate these...
Chapter
Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an n-way array that captures the relationship between n objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1...
Article
Unstructured data from electronic health records hold potential for improving predictive models for health outcomes. Efforts to extract structured information from the unstructured data used text mining methodologies, such as topic modeling and sentiment analysis. However, such methods do not account for abbreviations. Nursing notes have valuable i...
Article
The rapid growth of electronic health records (EHRs) facilitates the use of clinical pathways, an actionable plan for patients which is represented as sequences of diagnostic records ordered by visit dates. We propose to extract discriminative and representative clinical pathways from EHRs using sequential pattern mining. However, existing sequenti...
Article
Estimating length of stay of intensive care unit patients is crucial to reducing health care costs. This can help physicians intervene at the right time to prevent adverse outcomes for the patients. Moreover, resource allocation can be optimized to ensure appropriate hospital staff levels. Yet the length of stay prediction is very hard, as physicia...
Conference Paper
Full-text available
Estimating length of stay of intensive care unit patients is crucial to reducing health care costs. This can help physicians intervene at the right time to prevent adverse outcomes for the patients. Moreover, resource allocation can be optimized to ensure appropriate hospital staff levels. Yet the length of stay prediction is very hard, as physicia...
Article
A computational phenotype is a set of clinically relevant and interesting characteristics that describe patients with a given condition. Various machine learning methods have been proposed to derive phenotypes in an automatic, high-throughput manner. Among these methods, computational phenotyping through tensor factorization has been shown to produ...
Conference Paper
PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. Despite recent improvements on unconstrained PARAFAC2, its model factors are usually dense and sensiti...
Preprint
Full-text available
It has been recently shown that sparse, nonnegative tensor factorization of multi-modal electronic health record data is a promising approach to high-throughput computational phenotyping. However, such approaches typically do not leverage available domain knowledge while extracting the phenotypes; hence, some of the suggested phenotypes may not map...
Preprint
Full-text available
Stochastic Gradient TreeBoost is often found in many winning solutions in public data science challenges. Unfortunately, the best performance requires extensive parameter tuning and can be prone to overfitting. We propose PaloBoost, a Stochastic Gradient TreeBoost model that uses novel regularization techniques to guard against overfitting and is r...
Article
Full-text available
Background Researchers are developing methods to automatically extract clinically relevant and useful patient characteristics from raw healthcare datasets. These characteristics, often capturing essential properties of patients with common medical conditions, are called computational phenotypes. Being generated by automated or semiautomated, data-d...
Article
Full-text available
PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is jointly modeling treatments across a set of patients with varying number of medical encounters, where the alignment of events in time bears no clinical meaning, and it may also be impossible to align them...
Preprint
PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. Despite recent improvements on unconstrained PARAFAC2, its model factors are usually dense and sensiti...
Conference Paper
Full-text available
Extracting patterns and deriving insights from spatio-temporal data finds many target applications in various domains, such as in urban planning and computational sustainability. Due to their inherent capability of simultaneously modeling the spatial and temporal aspects of multiple instances, tensors have been successfully used to analyze such spa...
Article
Full-text available
As the adoption of Electronic Healthcare Records has grown, the need to transform manual processes that extract and characterize medical data into automatic and high-throughput processes has also grown. Recently, researchers have tackled the problem of automatically extracting candidate phenotypes from EHR data. Since these phenotypes are usually g...
Conference Paper
Predicting and preventing cardiac arrest is one of the biggest challenges of contemporary cardiology, as a patients survival depends on the effectiveness of the emergency response teams. While black-box models have shown to have better predictive accuracies for cardiac risk stratification, early warning scoring systems are more prominent in the hos...
Conference Paper
We propose gamAID, an exploratory, supervised nonnegative tensor factorization method that iteratively extracts phenotypes from tensors constructed from medical count data. Using data from diabetic patients who later on get diagnosed with chronic kidney disorder (CKD) as well as diabetic patients who do not receive a CKD diagnosis, we demonstrate t...
Conference Paper
In the realm of data driven clinical research, medical concepts, or phenotypes, are used to serve as indicators for patient clusters of interest. Often, studies will use groups of algorithmically generated phenotypes (feature groups) to predict the occurrence of heart disease, diabetes, and other conditions. When these groups are algorithmically ge...
Article
Full-text available
The increased availability of electronic health records (EHRs) have spearheaded the initiative for precision medicine using data driven approaches. Essential to this effort is the ability to identify patients with certain medical conditions of interest from simple queries on EHRs, or EHR-based phenotypes. Existing rule--based phenotyping approaches...
Article
Full-text available
In many healthcare settings, intuitive decision rules for risk stratification can help effective hospital resource allocation. This paper introduces a novel variant of decision tree algorithms that produces a chain of decisions, not a general tree. Our algorithm, $\alpha$-Carving Decision Chain (ACDC), sequentially carves out "pure" subsets of the...
Conference Paper
https://www.researchgate.net/profile/Jin-Mann_Lin/publications?sorting=recentlyAdded&page=2
Article
The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical resea...
Article
The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to phenotypes, or medical concepts, tha...
Conference Paper
Electronic health records (EHRs) are becoming an increasingly important source of patient information. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts...
Article
Sepsis and septic shock are common and potentially fatal conditions that often occur in intensive care unit (ICU) patients. Early prediction of patients at risk for septic shock is therefore crucial to minimizing the effects of these complications. Potential indications for septic shock risk span a wide range of measurements, including physiologica...
Article
We model the temporal symptomatic characteristics of 171 cardiac arrest patients in Intensive Care Units. The temporal and feature dependencies in the data are illustrated using a mixture of matrix normal distributions. We found that the cardiac arrest temporal signature is best summarized with six hours data prior to cardiac arrest events, and its...
Article
Full-text available
Multiple sclerosis (MS) is a chronic autoimmune disease that affects the central nervous system. The progression and severity of MS varies by individual, but it is generally a disabling disease. Although medications have been developed to slow the disease progression and help manage symptoms, MS research has yet to result in a cure. Early diagnosis...
Preprint
Multiple sclerosis (MS) is a chronic autoimmune disease that affects the central nervous system. The progression and severity of MS varies by individual, but it is generally a disabling disease. Although medications have been developed to slow the disease progression and help manage symptoms, MS research has yet to result in a cure. Early diagnosis...
Conference Paper
ICU patients are vulnerable to in-ICU morbidities and mortality, making accurate systems for identifying at-risk patients a necessity for improving clinical care. Here, we present an improved model for predicting in-hospital mortality using data collected from the first 48 hours of a patient's ICU stay. We generated predictive features for each pat...
Article
Full-text available
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004. Includes bibliographical references (p. 113-116). The proliferation of mobile devices and their tendency to present information proactively has led to an increase in device generated interruptions experienced by users. These interrup...
Conference Paper
The potential for sensor-enabled mobile devices to proactively present information when and where users need it ranks among the greatest promises of ubiquitous computing. Unfortunately, mobile phones, PDAs, and other computing devices that compete for the user's attention can contribute to interruption irritability and feelings of information overl...
Article
Full-text available
Sepsis and septic shock are potentially fatal complications that frequently occur in intensive care unit patients. The ability to predict which patients are at risk for sepsis and septic shock is therefore crucial to limiting the effects of these complications. Potential indications for sepsis risk are scattered in a wide range of clinical measurem...

Network

Cited By