Yanshan Wang

Yanshan Wang
University of Pittsburgh | Pitt · Department of Health Information Management

PhD
Clinical natural language processing, machine learning, deep learning, AI in medicine @Pitt.

About

190
Publications
48,007
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,845
Citations
Introduction
Assistant Professor of Biomedical Informatics at the Mayo Clinic. 10 years’ experience in research of Natural Language Processing, Information Retrieval, Machine Learning, Data Science, and Artificial Intelligence. Now dedicated to applications in biomedical and clinical domain.
Education
March 2015 - March 2018
Mayo Clinic
Field of study
  • Biomedical NLP, IR, ML

Publications

Publications (190)
Article
Full-text available
Background Neural word embeddings have been widely used in biomedical Natural Language Processing (NLP) applications as they provide vector representations of words capturing the semantic properties of words and the linguistic relationship between words. Many biomedical applications use different textual resources (e.g., Wikipedia and biomedical ar...
Article
Full-text available
Background With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the informat...
Article
In the era of digitalization, information retrieval (IR), which retrieves and ranks documents from large collections according to users’ search queries, has been popularly applied in the biomedical domain. Building patient cohorts using electronic health records (EHRs) or searching literature for topics of interest are some IR use cases. Meanwhile,...
Article
Full-text available
The prediction of a stock market direction may serve as an early recommendation system for short-term investors and as an early financial distress warning system for long-term shareholders. In this paper, we propose an empirical study on the Korean and Hong Kong stock market with an integrated machine learning framework that employs Principal Compo...
Article
Full-text available
Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this p...
Article
Full-text available
In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require q...
Preprint
UNSTRUCTURED Objective This study aims to develop natural language processing (NLP) algorithms to extract physical rehabilitation exercise information from clinical notes of post-stroke patients. Methods We identified a cohort of patients diagnosed with stroke at the University of Pittsburgh Medical Center and retrieved their clinical notes that co...
Preprint
Full-text available
The emergence of generative Large Language Models (LLMs) emphasizes the need for accurate and efficient prompting approaches. LLMs are often applied in Few-Shot Learning (FSL) contexts, where tasks are executed with minimal training data. FSL has become popular in many Artificial Intelligence (AI) subdomains, including AI for health. Rare diseases,...
Preprint
Full-text available
ChatGPT has gained remarkable traction since its inception in November 2022. However, it faces limitations in generating inaccurate responses, ignoring existing guidelines, and lacking reasoning when applied in clinical settings. This study introduces ChatGPT-CARE, a tool that integrates clinical practice guidelines with ChatGPT, focusing on COVID-...
Preprint
In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require q...
Article
Rehabilitation research focuses on determining the components of a treatment intervention, the mechanism of how these components lead to recovery and rehabilitation, and ultimately the optimal intervention strategies to maximize patients' physical, psychologic, and social functioning. Traditional randomized clinical trials that study and establish...
Article
Health literacy is the central focus of Healthy People 2030, the fifth iteration of the U.S. national goals and objectives. People with low health literacy usually have trouble understanding health information, following post-visit instructions, and using prescriptions, which results in worse health outcomes and serious health disparities. In this...
Article
Strategy training is a multidisciplinary rehabilitation approach that teaches skills to reduce disability among those with cognitive impairments following a stroke. Strategy training has been shown in randomized, controlled clinical trials to be a more feasible and efficacious intervention for promoting independence than traditional rehabilitation...
Preprint
Objective: To pre-train fair and unbiased patient representations from Electronic Health Records (EHRs) using a novel weighted loss function that reduces bias and improves fairness in deep representation learning models. Methods: We defined a new loss function, called weighted loss function, in the deep representation learning model to balance the...
Preprint
Full-text available
A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it i...
Article
Developing clinical natural language systems based on machine learning and deep learning is dependent on the availability of large-scale annotated clinical text datasets, most of which are time-consuming to create and not publicly available. The lack of such annotated datasets is the biggest bottleneck for the development of clinical NLP systems. Z...
Article
Full-text available
Closely associated with aging and age-related disorders, cellular senescence (CS) is the inability of cells to proliferate due to accumulated unrepaired cellular damage and irreversible cell cycle arrest. Senescent cells are characterized by their senescence-associated secretory phenotype that overproduces inflammatory and catabolic factors that ha...
Preprint
Full-text available
Rehabilitation research focuses on determining the components of a treatment intervention, the mechanism of how these components lead to recovery and rehabilitation, and ultimately the optimal intervention strategies to maximize patients' physical, psychologic, and social functioning. Traditional randomized clinical trials that study and establish...
Preprint
Full-text available
This paper presents an evaluation of the Health-prompt, a prompt-based zero-shot clinical text classification framework. The lack of publicly available datasets and the expensive data annotation in the clinical domain make traditional NLP models difficult to train. To overcome this issue, Healthprompt utilizes Pre-trained Language Models (PLMs) and...
Preprint
Full-text available
Background: Clinical information retrieval (IR) plays a vital role in modern healthcare by facilitating efficient access and analysis of medical literature for clinicians and researchers. This scoping review aims to offer a comprehensive overview of the current state of clinical IR research and identify gaps and potential opportunities for future s...
Conference Paper
Full-text available
Large language models (LLMs) are increasingly used for clinical Natural Language Processing (NLP) applications but are considered black-box models with little explanation for their predictions. In this study, we propose a framework that generates counterfactuals using multiple perturbations and uses local logistic regression to explain the decision...
Preprint
Full-text available
Large language models (LLMs) are increasingly used for clinical Natural Language Processing (NLP) applications but are considered black-box models with little explanation for their predictions. In this study, we propose a framework that generates counterfactuals using multiple perturbations and uses local logistic regression to explain the decision...
Preprint
Full-text available
Age-associated back pains arising from intervertebral disc degeneration (IDD) is one of the major causes of chronic disorders. Research supporting the role of cellular Senescence (CS) in driving IDD is rapidly growing which requires a systematic review to organize the current literature findings. However, the traditional approach of searching and s...
Preprint
Physical rehabilitation plays a crucial role in the recovery process of post-stroke patients. By personalizing therapies for patients leveraging predictive modeling and electronic health records (EHRs), healthcare providers can make the rehabilitation process more efficient. Before predictive modeling can provide decision support for the assignment...
Article
Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Info...
Preprint
BACKGROUND Clinical Natural Language Processing (NLP) has become an emerging technology in healthcare that leverages a large amount of free-text data in electronic health records (EHRs) to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Recently, deep learning has achieved state-of-the-a...
Article
Full-text available
Background Natural language processing (NLP) has become an emerging technology in health care that leverages a large amount of free-text data in electronic health records to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Recently, deep learning has achieved state-of-the-art performance...
Research
Full-text available
PURPOSE The applications of machine learning (ML) to healthcare have enhanced clinical capabilities through the analysis of large, complex data sets. UPMC HCC has begun a project to utilize ML to leverage retrospective Real World Data to predict the clinical course of patients who received immunotherapy for lung cancer. To achieve this goal, the fi...
Preprint
Strategy training is a multidisciplinary rehabilitation approach that teaches skills to reduce disability among those with cognitive impairments following a stroke. Strategy training has been shown in randomized, controlled clinical trials to be a more feasible and efficacious intervention for promoting independence than traditional rehabilitation...
Preprint
Health literacy is the central focus of Healthy People 2030, the fifth iteration of the U.S. national goals and objectives. People with low health literacy usually have trouble understanding health information, following post-visit instructions, and using prescriptions, which results in worse health outcomes and serious health disparities. In this...
Preprint
Full-text available
Clinical Natural Language Processing (NLP) has become an emerging technology in healthcare that leverages a large amount of free-text data in electronic health records (EHRs) to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Deep learning has achieved state-of-the-art performance in man...
Preprint
Full-text available
Semantic textual similarity (STS) in the clinical domain helps improve diagnostic efficiency and produce concise texts for downstream data mining tasks. However, given the high degree of domain knowledge involved in clinic text, it remains challenging for general language models to infer implicit medical relationships behind clinical sentences and...
Article
Full-text available
Background Since no effective therapies exist for Alzheimer’s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle’s effect on AD. However, lifestyle information is typically stored in clinical narr...
Chapter
Full-text available
Dementia is one of the most prevalent health problems in the aging population. Despite the significant number of people affected, dementia diagnoses are often significantly delayed, missing opportunities to maximize life quality. Early identification of older adults at high risk for dementia may help to maximize current quality of life and to impro...
Article
Full-text available
Computational drug repurposing methods adapt Artificial intelligence (AI) algorithms for the discovery of new applications of approved or investigational drugs. Among the heterogeneous datasets, electronic health records (EHRs) datasets provide rich longitudinal and pathophysiological data that facilitate the generation and validation of drug repur...
Conference Paper
Generating a summary from findings has been recently explored (Zhang et al., 2018, 2020) in note types such as radiology reports that typically have short length. In this work, we focus on echocardiogram notes that is longer and more complex compared to previous note types. We formally define the task of echocardiography conclusion generation (Echo...
Preprint
Full-text available
Alzheimer’s Disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age.. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is tha...
Preprint
Full-text available
Deep learning algorithms are dependent on the availability of large-scale annotated clinical text datasets. The lack of such publicly available datasets is the biggest bottleneck for the development of clinical Natural Language Processing(NLP) systems. Zero-Shot Learning(ZSL) refers to the use of deep learning models to classify instances from new...
Preprint
Full-text available
Personal Health Literacy (PHL) is defined as "the degree to which individuals have the ability to find, understand, and use information and services to inform health-related decisions and actions for themselves and others" 1. New definitions of PHL focus on consumers' ability to use the information and make well-informed decisions. According to the...
Article
Full-text available
Purpose: Rural populations are disproportionately affected by the COVID-19 pandemic. We characterized urban-rural disparities in patient portal messaging utilization for COVID-19, and, of those who used the portal during its early stage in the Midwest. Methods: We collected over 1 million portal messages generated by midwestern Mayo Clinic patie...
Article
Full-text available
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for...
Preprint
Full-text available
Dementia is one of the most prevalent health problems in the aging population. Despite the significant number of people affected, dementia diagnoses are often significantly delayed, missing opportunities to maximize life quality. Early identification of older adults at high risk for dementia may help to maximize current quality of life and to impro...
Article
Full-text available
Background: During the Coronavirus Disease 2019 (COVID-19) pandemic, patient portals and their message platforms allowed remote access to healthcare. Utilization patterns in patient messaging during the COVID-19 crisis have not been studied thoroughly. In this work, we propose to characterize patients and their use of asynchronous virtual care for...
Preprint
BACKGROUND During the COVID-19 pandemic, patient portals and their message platforms allowed remote access to health care. Utilization patterns in patient messaging during the COVID-19 crisis have not been studied thoroughly. In this work, we propose characterizing patients and their use of asynchronous virtual care for COVID-19 via a retrospective...
Article
Full-text available
Background Several social determinants of health (SDoH) have been associated with the onset of major depressive disorder (MDD). However, prior studies largely focused on individual SDoH and thus less is known about the relative importance (RI) of SDoH variables, especially in older adults. Given that risk factors for MDD may differ across the lifes...
Article
Full-text available
Background . There is growing evidence that social and behavioral determinants of health (SBDH) play a substantial effect in a wide range of health outcomes. Electronic health records (EHRs) have been widely employed to conduct observational studies in the age of artificial intelligence (AI). However, there has been limited review into how to make...
Preprint
Major depressive disorder (MDD) is a prevalent psychiatric disorder that is associated with significant healthcare burden worldwide. Phenotyping of MDD can help early diagnosis and consequently may have significant advantages in patient management. In prior research MDD phenotypes have been extracted from structured Electronic Health Records (EHR)...
Article
Mental health concerns, such as suicidal thoughts, are frequently documented by providers in clinical notes, as opposed to structured coded data. In this study, we evaluated weakly supervised methods for detecting “current” suicidal ideation from unstructured clinical notes in electronic health record (EHR) systems. Weakly supervised machine learni...
Preprint
Full-text available
There is growing evidence showing the significant role of social determinant of health (SDOH) on a wide variety of health outcomes. In the era of artificial intelligence (AI), electronic health records (EHRs) have been widely used to conduct observational studies. However, how to make the best of SDOH information from EHRs is yet to be studied. In...
Article
Full-text available
Dietary supplements (DSs) have been widely used in the U.S. and evaluated in clinical trials as potential interventions for various diseases. However, many clinical trials face challenges in recruiting enough eligible patients in a timely fashion, causing delays or even early termination. Using electronic health records to find eligible patients wh...
Preprint
Full-text available
Since no effective therapies exist for Alzheimer's disease (AD), prevention has become more critical through lifestyle factor changes and interventions. Analyzing electronic health records (EHR) of patients with AD can help us better understand lifestyle's effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus...
Preprint
Full-text available
There is growing evidence showing the significant role of social determinant of health (SDOH) on a wide variety of health outcomes. In the era of artificial intelligence (AI), electronic health records (EHRs) have been widely used to conduct observational studies. However, how to make the best of SDOH information from EHRs is yet to be studied. In...
Preprint
Full-text available
Introduction Racially and ethnically diverse minorities often experience the disease burden of sexually transmitted infections or diseases (STD) more often than their White counterparts. Yet, little is known about the connection of STD systematic discrimination, racism, and social and behavioral determinants. Plus, little to no details exists relat...
Article
Coronavirus Disease 2019 has emerged as a significant global concern, triggering harsh public health restrictions in a successful bid to curb its exponential growth. As discussion shifts towards relaxation of these restrictions, there is significant concern of second-wave resurgence. The key to managing these outbreaks is early detection and interv...
Article
Full-text available
Background Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semanti...
Article
Full-text available
Background Chronic pain affects more than 20% of adults in the United States and is associated with substantial physical, mental, and social burden. Clinical text contains rich information about chronic pain, but no systematic appraisal has been performed to assess the electronic health record (EHR) narratives for these patients. A formal content a...
Article
Full-text available
Background: Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniq...
Article
Objective: The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task track 3, focused on medical concept normalization (MCN) in clinical records. This track aimed to assess the state of the art in identifying and matching salient medical concepts to a controlled vocabulary. In this paper, we...
Preprint
Full-text available
Dietary supplements (DSs) have been widely used in the U.S. and evaluated in clinical trials as potential interventions for various diseases. However, many clinical trials face challenges in recruiting enough eligible patients in a timely fashion, causing delays or even early termination. Using electronic health records to find eligible patients wh...
Article
Full-text available
Defining patient-to-patient similarity is essential for the development of precision medicine in clinical care and research. Conceptually, the identification of similar patient cohorts appears straightforward; however, universally accepted definitions remain elusive. Simultaneously, an explosion of vendors and published algorithms have emerged and...
Preprint
BACKGROUND As a risk factor for many diseases, family history captures both shared genetic variations and living environments among family members. Though there are several systems focusing on family history extraction (FHE) using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized. OBJEC...
Article
Full-text available
Background As a risk factor for many diseases, family history (FH) captures both shared genetic variations and living environments among family members. Though there are several systems focusing on FH extraction using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized. Objective The n2c2...
Preprint
Full-text available
BACKGROUND Semantic textual similarity (STS) is a common task in general English domain to assess the degree to which the underlying semantics of two segments text are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the STS task in the clinical domain that attempts to measure the degree of semantic equivalence betwee...