
Yanshan WangUniversity of Pittsburgh | Pitt · Department of Health Information Management
Yanshan Wang
PhD
Clinical natural language processing, machine learning, deep learning, AI in medicine @Pitt.
About
190
Publications
48,007
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,845
Citations
Introduction
Assistant Professor of Biomedical Informatics at the Mayo Clinic. 10 years’ experience in research of Natural Language Processing, Information Retrieval, Machine Learning, Data Science, and Artificial Intelligence. Now dedicated to applications in biomedical and clinical domain.
Education
March 2015 - March 2018
Mayo Clinic
Field of study
- Biomedical NLP, IR, ML
Publications
Publications (190)
Background
Neural word embeddings have been widely used in biomedical Natural Language Processing (NLP) applications as they provide vector representations of words capturing the semantic properties of words and the linguistic relationship between words. Many biomedical applications use different textual resources (e.g., Wikipedia and biomedical ar...
Background
With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the informat...
In the era of digitalization, information retrieval (IR), which retrieves and ranks documents from large collections according to users’ search queries, has been popularly applied in the biomedical domain. Building patient cohorts using electronic health records (EHRs) or searching literature for topics of interest are some IR use cases. Meanwhile,...
The prediction of a stock market direction may serve as an early
recommendation system for short-term investors and as an early financial
distress warning system for long-term shareholders. In this paper, we propose
an empirical study on the Korean and Hong Kong stock market with an integrated
machine learning framework that employs Principal Compo...
Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this p...
In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require q...
UNSTRUCTURED
Objective This study aims to develop natural language processing (NLP) algorithms to extract physical rehabilitation exercise information from clinical notes of post-stroke patients. Methods We identified a cohort of patients diagnosed with stroke at the University of Pittsburgh Medical Center and retrieved their clinical notes that co...
The emergence of generative Large Language Models (LLMs) emphasizes the need for accurate and efficient prompting approaches. LLMs are often applied in Few-Shot Learning (FSL) contexts, where tasks are executed with minimal training data. FSL has become popular in many Artificial Intelligence (AI) subdomains, including AI for health. Rare diseases,...
ChatGPT has gained remarkable traction since its inception in November 2022. However, it faces limitations in generating inaccurate responses, ignoring existing guidelines, and lacking reasoning when applied in clinical settings. This study introduces ChatGPT-CARE, a tool that integrates clinical practice guidelines with ChatGPT, focusing on COVID-...
In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require q...
Rehabilitation research focuses on determining the components of a treatment intervention, the mechanism of how these components lead to recovery and rehabilitation, and ultimately the optimal intervention strategies to maximize patients' physical, psychologic, and social functioning. Traditional randomized clinical trials that study and establish...
Health literacy is the central focus of Healthy People 2030, the fifth iteration of the U.S. national goals and objectives. People with low health literacy usually have trouble understanding health information, following post-visit instructions, and using prescriptions, which results in worse health outcomes and serious health disparities. In this...
Strategy training is a multidisciplinary rehabilitation approach that teaches skills to reduce disability among those with cognitive impairments following a stroke. Strategy training has been shown in randomized, controlled clinical trials to be a more feasible and efficacious intervention for promoting independence than traditional rehabilitation...
Objective: To pre-train fair and unbiased patient representations from Electronic Health Records (EHRs) using a novel weighted loss function that reduces bias and improves fairness in deep representation learning models. Methods: We defined a new loss function, called weighted loss function, in the deep representation learning model to balance the...
A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it i...
Developing clinical natural language systems based on machine learning and deep learning is dependent on the availability of large-scale annotated clinical text datasets, most of which are time-consuming to create and not publicly available. The lack of such annotated datasets is the biggest bottleneck for the development of clinical NLP systems. Z...
Closely associated with aging and age-related disorders, cellular senescence (CS) is the inability of cells to proliferate due to accumulated unrepaired cellular damage and irreversible cell cycle arrest. Senescent cells are characterized by their senescence-associated secretory phenotype that overproduces inflammatory and catabolic factors that ha...
Rehabilitation research focuses on determining the components of a treatment intervention, the mechanism of how these components lead to recovery and rehabilitation, and ultimately the optimal intervention strategies to maximize patients' physical, psychologic, and social functioning. Traditional randomized clinical trials that study and establish...
This paper presents an evaluation of the Health-prompt, a prompt-based zero-shot clinical text classification framework. The lack of publicly available datasets and the expensive data annotation in the clinical domain make traditional NLP models difficult to train. To overcome this issue, Healthprompt utilizes Pre-trained Language Models (PLMs) and...
Background: Clinical information retrieval (IR) plays a vital role in modern healthcare by facilitating efficient access and analysis of medical literature for clinicians and researchers. This scoping review aims to offer a comprehensive overview of the current state of clinical IR research and identify gaps and potential opportunities for future s...
Large language models (LLMs) are increasingly used for clinical Natural Language Processing (NLP) applications but are considered black-box models with little explanation for their predictions. In this study, we propose a framework that generates counterfactuals using multiple perturbations and uses local logistic regression to explain the decision...
Large language models (LLMs) are increasingly used for clinical Natural Language Processing (NLP) applications but are considered black-box models with little explanation for their predictions. In this study, we propose a framework that generates counterfactuals using multiple perturbations and uses local logistic regression to explain the decision...
Age-associated back pains arising from intervertebral disc degeneration (IDD) is one of the major causes of chronic disorders. Research supporting the role of cellular Senescence (CS) in driving IDD is rapidly growing which requires a systematic review to organize the current literature findings. However, the traditional approach of searching and s...
Physical rehabilitation plays a crucial role in the recovery process of post-stroke patients. By personalizing therapies for patients leveraging predictive modeling and electronic health records (EHRs), healthcare providers can make the rehabilitation process more efficient. Before predictive modeling can provide decision support for the assignment...
Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Info...
BACKGROUND
Clinical Natural Language Processing (NLP) has become an emerging technology in healthcare that leverages a large amount of free-text data in electronic health records (EHRs) to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Recently, deep learning has achieved state-of-the-a...
Background
Natural language processing (NLP) has become an emerging technology in health care that leverages a large amount of free-text data in electronic health records to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Recently, deep learning has achieved state-of-the-art performance...
PURPOSE The applications of machine learning (ML) to healthcare have enhanced clinical capabilities through the analysis of large, complex data sets. UPMC HCC has begun a project to utilize ML to leverage retrospective Real World Data to predict the clinical course of patients who received immunotherapy for lung cancer. To achieve this goal, the fi...
Strategy training is a multidisciplinary rehabilitation approach that teaches skills to reduce disability among those with cognitive impairments following a stroke. Strategy training has been shown in randomized, controlled clinical trials to be a more feasible and efficacious intervention for promoting independence than traditional rehabilitation...
Health literacy is the central focus of Healthy People 2030, the fifth iteration of the U.S. national goals and objectives. People with low health literacy usually have trouble understanding health information, following post-visit instructions, and using prescriptions, which results in worse health outcomes and serious health disparities. In this...
Clinical Natural Language Processing (NLP) has become an emerging technology in healthcare that leverages a large amount of free-text data in electronic health records (EHRs) to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Deep learning has achieved state-of-the-art performance in man...
Semantic textual similarity (STS) in the clinical domain helps improve diagnostic efficiency and produce concise texts for downstream data mining tasks. However, given the high degree of domain knowledge involved in clinic text, it remains challenging for general language models to infer implicit medical relationships behind clinical sentences and...
Background
Since no effective therapies exist for Alzheimer’s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle’s effect on AD. However, lifestyle information is typically stored in clinical narr...
Dementia is one of the most prevalent health problems in the aging population. Despite the significant number of people affected, dementia diagnoses are often significantly delayed, missing opportunities to maximize life quality. Early identification of older adults at high risk for dementia may help to maximize current quality of life and to impro...
Computational drug repurposing methods adapt Artificial intelligence (AI) algorithms for the discovery of new applications of approved or investigational drugs. Among the heterogeneous datasets, electronic health records (EHRs) datasets provide rich longitudinal and pathophysiological data that facilitate the generation and validation of drug repur...
Generating a summary from findings has been recently explored (Zhang et al., 2018, 2020) in note types such as radiology reports that typically have short length. In this work, we focus on echocardiogram notes that is longer and more complex compared to previous note types. We formally define the task of echocardiography conclusion generation (Echo...
Alzheimer’s Disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age.. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is tha...
Deep learning algorithms are dependent on the availability of large-scale annotated clinical text datasets. The lack of such publicly available datasets is the biggest bottleneck for the development of clinical Natural Language Processing(NLP) systems. Zero-Shot Learning(ZSL) refers to the use of deep learning models to classify instances from new...
Personal Health Literacy (PHL) is defined as "the degree to which individuals have the ability to find, understand, and use information and services to inform health-related decisions and actions for themselves and others" 1. New definitions of PHL focus on consumers' ability to use the information and make well-informed decisions. According to the...
Purpose:
Rural populations are disproportionately affected by the COVID-19 pandemic. We characterized urban-rural disparities in patient portal messaging utilization for COVID-19, and, of those who used the portal during its early stage in the Midwest.
Methods:
We collected over 1 million portal messages generated by midwestern Mayo Clinic patie...
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for...
Dementia is one of the most prevalent health problems in the aging population. Despite the significant number of people affected, dementia diagnoses are often significantly delayed, missing opportunities to maximize life quality. Early identification of older adults at high risk for dementia may help to maximize current quality of life and to impro...
Background:
During the Coronavirus Disease 2019 (COVID-19) pandemic, patient portals and their message platforms allowed remote access to healthcare. Utilization patterns in patient messaging during the COVID-19 crisis have not been studied thoroughly. In this work, we propose to characterize patients and their use of asynchronous virtual care for...
BACKGROUND
During the COVID-19 pandemic, patient portals and their message platforms allowed remote access to health care. Utilization patterns in patient messaging during the COVID-19 crisis have not been studied thoroughly. In this work, we propose characterizing patients and their use of asynchronous virtual care for COVID-19 via a retrospective...
Background
Several social determinants of health (SDoH) have been associated with the onset of major depressive disorder (MDD). However, prior studies largely focused on individual SDoH and thus less is known about the relative importance (RI) of SDoH variables, especially in older adults. Given that risk factors for MDD may differ across the lifes...
Background . There is growing evidence that social and behavioral determinants of health (SBDH) play a substantial effect in a wide range of health outcomes. Electronic health records (EHRs) have been widely employed to conduct observational studies in the age of artificial intelligence (AI). However, there has been limited review into how to make...
Major depressive disorder (MDD) is a prevalent psychiatric disorder that is associated with significant healthcare burden worldwide. Phenotyping of MDD can help early diagnosis and consequently may have significant advantages in patient management. In prior research MDD phenotypes have been extracted from structured Electronic Health Records (EHR)...
Mental health concerns, such as suicidal thoughts, are frequently documented by providers in clinical notes, as opposed to structured coded data. In this study, we evaluated weakly supervised methods for detecting “current” suicidal ideation from unstructured clinical notes in electronic health record (EHR) systems. Weakly supervised machine learni...
There is growing evidence showing the significant role of social determinant of health (SDOH) on a wide variety of health outcomes. In the era of artificial intelligence (AI), electronic health records (EHRs) have been widely used to conduct observational studies. However, how to make the best of SDOH information from EHRs is yet to be studied. In...
Dietary supplements (DSs) have been widely used in the U.S. and evaluated in clinical trials as potential interventions for various diseases. However, many clinical trials face challenges in recruiting enough eligible patients in a timely fashion, causing delays or even early termination. Using electronic health records to find eligible patients wh...
Since no effective therapies exist for Alzheimer's disease (AD), prevention has become more critical through lifestyle factor changes and interventions. Analyzing electronic health records (EHR) of patients with AD can help us better understand lifestyle's effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus...
There is growing evidence showing the significant role of social determinant of health (SDOH) on a wide variety of health outcomes. In the era of artificial intelligence (AI), electronic health records (EHRs) have been widely used to conduct observational studies. However, how to make the best of SDOH information from EHRs is yet to be studied. In...
Introduction
Racially and ethnically diverse minorities often experience the disease burden of sexually transmitted infections or diseases (STD) more often than their White counterparts. Yet, little is known about the connection of STD systematic discrimination, racism, and social and behavioral determinants. Plus, little to no details exists relat...
Coronavirus Disease 2019 has emerged as a significant global concern, triggering harsh public health restrictions in a successful bid to curb its exponential growth. As discussion shifts towards relaxation of these restrictions, there is significant concern of second-wave resurgence. The key to managing these outbreaks is early detection and interv...
Background
Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semanti...
Background
Chronic pain affects more than 20% of adults in the United States and is associated with substantial physical, mental, and social burden. Clinical text contains rich information about chronic pain, but no systematic appraisal has been performed to assess the electronic health record (EHR) narratives for these patients. A formal content a...
Background: Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniq...
Objective:
The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task track 3, focused on medical concept normalization (MCN) in clinical records. This track aimed to assess the state of the art in identifying and matching salient medical concepts to a controlled vocabulary. In this paper, we...
Dietary supplements (DSs) have been widely used in the U.S. and evaluated in clinical trials as potential interventions for various diseases. However, many clinical trials face challenges in recruiting enough eligible patients in a timely fashion, causing delays or even early termination. Using electronic health records to find eligible patients wh...
Defining patient-to-patient similarity is essential for the development of precision medicine in clinical care and research. Conceptually, the identification of similar patient cohorts appears straightforward; however, universally accepted definitions remain elusive. Simultaneously, an explosion of vendors and published algorithms have emerged and...
BACKGROUND
As a risk factor for many diseases, family history captures both shared genetic variations and living environments among family members. Though there are several systems focusing on family history extraction (FHE) using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized.
OBJEC...
Background
As a risk factor for many diseases, family history (FH) captures both shared genetic variations and living environments among family members. Though there are several systems focusing on FH extraction using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized.
Objective
The n2c2...
BACKGROUND
Semantic textual similarity (STS) is a common task in general English domain to assess the degree to which the underlying semantics of two segments text are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the STS task in the clinical domain that attempts to measure the degree of semantic equivalence betwee...