About
69
Publications
11,776
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,517
Citations
Citations since 2017
Publications
Publications (69)
Background
Differentiating between Crohn’s disease (CD) and intestinal tuberculosis (ITB) with endoscopy is challenging. We aim to perform more accurate endoscopic diagnosis between CD and ITB by building a trustworthy AI differential diagnosis application.
Methods
A total of 1271 electronic health record (EHR) patients who had undergone colonosco...
Objective:
Disease knowledge graphs have emerged as a powerful tool for AI, enabling the connection, organization, and access to diverse information about diseases. However, the relations between disease concepts are often distributed across multiple data formats, including plain language and incomplete disease knowledge graphs. As a result, extra...
Electronic health records (EHR) contain vast biomedical knowledge and are rich resources for developing precise medicine systems. However, due to privacy concerns, there are limited high-quality EHR data accessible to researchers hence hindering the advancement of methodologies. Recent research has explored using generative modelling methods to syn...
Liver cancer is a common malignant tumor, and its clinical stage is closely related to the clinical treatment and prognosis of patients. Currently, the BCLC staging system revised by the BCLC group of University of Barcelona is the globally recognized staging system for liver cancer. However, with the deepening of related research, the current stag...
Background
Differentiating between Crohn’s disease (CD) and intestinal tuberculosis (ITB) with endoscopy is challenging. We aim to perform more accurate endoscopic diagnosis between CD and ITB by building a trustworthy AI differential diagnosis application.
Methods
A total of 1271 electronic health record (EHR) patients who had undergone colonosco...
Entities lie in the heart of biomedical natural language understanding, and the biomedical entity linking (EL) task remains challenging due to the fine-grained and diversiform concept names. Generative methods achieve remarkable performances in general domain EL with less memory usage while requiring expensive pre-training. Previous biomedical EL m...
Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understandin...
Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provid...
Biomedical knowledge graphs (BioMedKGs) are essential infrastructures for biomedical and healthcare big data and artificial intelligence (AI), facilitating natural language processing, model development, and data exchange. For many decades, these knowledge graphs have been built via expert curation, which can no longer catch up with the speed of to...
Objective: Disease knowledge graphs are a way to connect, organize, and access disparate information about diseases with numerous benefits for artificial intelligence (AI). To create knowledge graphs, it is necessary to extract knowledge from multimodal datasets in the form of relationships between disease concepts and normalize both concepts and r...
Entity alignment (EA) merges knowledge graphs (KGs) by identifying the equivalent entities in different graphs, which can effectively enrich knowledge representations of KGs. However, in practice, different KGs often include dangling entities whose counterparts cannot be found in the other graph, which limits the performance of EA methods. To impro...
Knowledge graph integration typically suffers from the widely existing dangling entities that cannot find alignment cross knowledge graphs (KGs). The dangling entity set is unavailable in most real-world scenarios, and manually mining the entity pairs that consist of entities with the same meaning is labor-consuming. In this paper, we propose a nov...
We present PMC-Patients, a dataset consisting of 167k patient notes with 3.1M relevant article annotations and 293k similar patient annotations. The patient notes are extracted by identifying certain sections from case reports in PubMed Central, and those with at least CC BY-NC-SA license are re-distributed. Patient-article relevance and patient-pa...
Automatic Question Answering (QA) has been successfully applied in various domains such as search engines and chatbots. Biomedical QA (BQA), as an emerging QA task, enables innovative applications to effectively perceive, access, and understand complex biomedical knowledge. There have been tremendous developments of BQA in the past two decades, whi...
Objective
This paper aims to propose knowledge-aware embedding, a critical tool for medical term normalization.
Methods
We develop CODER (Cross-lingual knowledge-infused medical term embedding) via contrastive learning based on a medical knowledge graph (KG) named the Unified Medical Language System, and similarities are calculated utilizing both...
The medical automatic diagnosis system aims to imitate human doctors in the real diagnostic process. This task is formulated as a sequential decision-making problem with symptom inquiring and disease diagnosis. In recent years, many researchers have used reinforcement learning methods to handle this task. However, most recent works neglected to dis...
BACKGROUND
Differentiating between Crohn’s disease (CD) and intestinal tuberculosis (ITB) has long been an important and challenging problem in clinical practice. Endoscopy is an essential examination for a timely and accurate diagnosis but the results can be confusing and rely heavily on the experience of the clinician.
OBJECTIVE
We aim to perfor...
The existing neural machine translation system has achieved near human-level performance in general domain in some languages, but the lack of parallel corpora poses a key problem in specific domains. In biomedical domain, the parallel corpus is less accessible. This work presents a new unsupervised sentence alignment method and explores features in...
Question Answering (QA) is a benchmark Natural Language Processing (NLP) task where models predict the answer for a given question using related documents, images, knowledge bases and question-answer pairs. Automatic QA has been successfully applied in various domains like search engines and chatbots. However, for specific domains like biomedicine,...
We propose a novel medical term embedding method named CODER, which stands for mediCal knOwledge embeDded tErm Representation. CODER is designed for medical term normalization by providing close vector representations for terms that represent the same or similar concepts with multi-language support. CODER is trained on top of BERT (Devlin et al., 2...
Background:
Differentiating between ulcerative colitis (UC), Crohn's disease (CD) and intestinal tuberculosis (ITB) using endoscopy is challenging. We aimed to realize automatic differential diagnosis among these diseases through machine learning algorithms.
Methods:
A total of 6399 consecutive patients (5128 UC, 875 CD and 396 ITB) who had unde...
Objective: Medical relations are the core components of medical knowledge graphs that are needed for healthcare artificial intelligence. However, the requirement of expert annotation by conventional algorithm development processes creates a major bottleneck for mining new relations. In this paper, we present Hi-RES, a framework for high-throughput...
Objective
Artificial intelligence in healthcare increasingly relies on relations in knowledge graphs for algorithm development. However, many important relations are not well covered in existing knowledge graphs. We aim to develop a novel long-distance relation extraction algorithm that leverages the article section structure and is trained with bo...
Objective
This study aims at realizing unsupervised term discovery in Chinese electronic health records (EHRs) by using the word segmentation technique. The existing supervised algorithms do not perform satisfactorily in the case of EHRs, as annotated medical data are scarce. We propose an unsupervised segmentation method (GTS) based on the graph p...
Risk of intracranial aneurysm rupture could be affected by geometric features of intracranial aneurysms and the surrounding vasculature in a location specific manner. Our goal is to investigate the morphological characteristics associated with ruptured posterior communicating artery (PCoA) aneurysms, as well as patient factors associated with the m...
Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to th...
Background: Differentiating between ulcerative colitis (UC), Crohn’s disease (CD) and intestinal tuberculosis (ITB) using endoscopy is challenging. We aimed to realize automatic differential diagnosis among these diseases through machine learning algorithms.
Methods: A total of 6399 consecutive patients (5128 UC, 875 CD and 396 ITB) who had undergo...
Background: Differentiating between ulcerative colitis (UC), Crohn’s disease (CD) and intestinal tuberculosis (ITB) using endoscopy is challenging. We aimed to realize automatic differential diagnosis among these diseases through machine learning algorithms.
Methods: A total of 6399 consecutive patients (5128 UC, 875 CD and 396 ITB) who had undergo...
Background: Differentiating between ulcerative colitis (UC), Crohn’s disease (CD) and intestinal tuberculosis (ITB) using endoscopy is challenging. We aimed to realize automatic differential diagnosis among these diseases through machine learning algorithms.
Methods: A total of 6399 consecutive patients (5128 UC, 875 CD and 396 ITB) who had undergo...
Objective
Accurate coding is critical for medical billing and electronic medical record (EMR)-based research. Recent research has been focused on developing supervised methods to automatically assign International Classification of Diseases (ICD) codes from clinical notes. However, supervised approaches rely on ICD code data stored in the hospital...
Objective:
Electronic health records linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Disease...
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variatio...
Iron and its derivatives play a significant role in various physiological and biochemical pathways, and are influenced by a wide variety of inflammatory, infectious, and immunological disorders. We hypothesized that iron and its related factors play a role in intracranial aneurysm pathophysiology and investigated if serum iron values are associated...
Objective
Electronic health records (EHR) linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Dis...
Objective: Phenotyping algorithms can efficiently and accurately identify patients with a specific disease phenotype and construct electronic health records (EHR)-based cohorts for subsequent clinical or genomic studies. Previous studies have introduced unsupervised EHR-based feature selection methods that yielded algorithms with high accuracy. How...
The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule‐based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These si...
Objective:
To determine the association between ruptured saccular aneurysms and aspirin use/aspirin dose.
Methods:
Four thousand seven hundred one patients who were diagnosed at the Massachusetts General Hospital and Brigham and Women's Hospital between 1990 and 2016 with 6,411 unruptured and ruptured saccular intracranial aneurysms were evaluat...
While cocaine use is thought to be associated with aneurysmal rupture, it is not known whether heroin use increases the risk of rupture in patients with non-mycotic saccular aneurysms. Our goal was to investigate the association between heroin and cocaine use and the rupture of saccular non-mycotic aneurysms. The medical records of 4701 patients wi...
Background:
Geometric factors of intracranial aneurysms and surrounding vasculature could affect the risk of aneurysm rupture. However, large-scale assessments of morphological parameters correlated with intracranial aneurysm rupture in a location-specific manner are scarce.
Objective:
To investigate the morphological characteristics associated...
Background and purpose:
Both low serum calcium and magnesium levels have been associated with the extent of bleeding in patients with intracerebral hemorrhage, suggesting hypocalcemia- and hypomagnesemia-induced coagulopathy as a possible underlying mechanism. We hypothesized that serum albumin-corrected total calcium and magnesium levels are asso...
Objective:
Standard approaches for large scale phenotypic screens using electronic health record (EHR) data apply thresholds, such as ≥2 diagnosis codes, to define subjects as having a phenotype. However, the variation in the accuracy of diagnosis codes can impair the power of such screens. Our objective was to develop and evaluate an approach whi...
Background and Purpose—Growing evidence from experimental animal models and clinical studies suggests the protective effect of statin use against rupture of intracranial aneurysms; however, results from large studies detailing the relationship between intracranial aneurysm rupture and total cholesterol, HDL (high-density lipoprotein), LDL (low-dens...
We propose a new approach to the Chinese word segmentation problem that considers the sentence as an undirected graph, whose nodes are the characters. One can use various techniques to compute the edge weights that measure the connection strength between characters. Spectral graph partition algorithms are used to group the characters and achieve wo...
Alcohol consumption may be a modifiable risk factor for rupture of intracranial aneurysms. Our aim is to evaluate the association between ruptured aneurysms and alcohol consumption, intensity, and cessation. The medical records of 4701 patients with 6411 radiographically confirmed intracranial aneurysms diagnosed at the Brigham and Women’s Hospital...
Background:
Genetic studies of neuropsychiatric disease strongly suggest an overlap in liability. There are growing efforts to characterize these diseases dimensionally rather than categorically, but the extent to which such dimensional models correspond to biology is unknown.
Methods:
We applied a newly developed natural language processing met...
Background:
Relying on diagnostic categories of neuropsychiatric illness obscures the complexity of these disorders. Capturing multiple dimensional measures of neuropathology could facilitate the clinical and neurobiological investigation of cognitive and behavioral phenotypes.
Methods:
We developed a natural language processing-based approach t...
Background and purpose:
Previous studies have suggested a protective effect of diabetes mellitus on aneurysmal subarachnoid hemorrhage risk. However, reports are inconsistent, and objective measures of hyperglycemia in these studies are lacking. Our aim was to investigate the association between aneurysmal subarachnoid hemorrhage and antihyperglyc...
Objective:
Electronic health record (EHR)-based phenotyping infers whether a patient has a disease based on the information in his or her EHR. A human-annotated training set with gold-standard disease status labels is usually required to build an algorithm for phenotyping based on a set of predictive features. The time intensiveness of annotation...
Objective:
Although smoking is a known risk factor for intracranial aneurysm (IA) rupture, the exact relationship between IA rupture and smoking intensity and duration, as well as duration of smoking cessation, remains unknown.
Methods:
In this case-control study, we analyzed 4,701 patients with 6,411 IAs diagnosed at the Brigham and Women's Hos...
Objective
Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selectio...
A common practice in predictive medicine is to use current study data to construct a stratification procedure, which groups subjects according to baseline information and forms stratum-specific prevention or intervention strategies. A desirable stratification scheme would not only have small intra-stratum variation but also have a clinically meanin...
Objective:
To use natural language processing (NLP) in conjunction with the electronic medical record (EMR) to accurately identify patients with cerebral aneurysms and their matched controls.
Methods:
ICD-9 and Current Procedural Terminology codes were used to obtain an initial data mart of potential aneurysm patients from the EMR. NLP was then...
The migration of imaging reports to electronic medical record systems holds great potential in terms of advancing radiology research and practice by leveraging the large volume of data continuously being updated, integrated, and shared. However, there are significant challenges as well, largely due to the heterogeneity of how these data are formatt...
Polycystic ovary syndrome (PCOS) is a heterogeneous disorder because of the variable criteria used for diagnosis. Therefore, International Classification of Diseases 9 (ICD-9) codes may not accurately capture the diagnostic criteria necessary for large scale PCOS identification. We hypothesized that use of electronic medical records text and data w...
Objective Analysis of narrative (text) data from electronic health records (EHRs) can improve population-scale phenotyping for clinical and genetic research. Currently, selection of text features for phenotyping algorithms is slow and laborious, requiring extensive and iterative involvement by domain experts. This paper introduces a method to devel...
TEACHING POINTS
1. Natural Language Processing (NLP) as a method for automatic data extraction has been applied to radiology research in a limited scope to date 2. The following current NLP technologies are used in radiology: pattern matching, machine learning, coupled with linguistic and statistical approaches. We will describe each with illustrat...
In this paper we describe an efficient tool based on natural language processing for classifying the detail state of pulmonary embolism (PE) recorded in CT pulmonary angiography reports. The classification tasks include: PE present vs. absent, acute PE vs. others, central PE vs. others, and subsegmental PE vs. others. Statistical learning algorithm...
Background
Electronic Medical Records (EMR) use clinical data to enable large-scale clinical studies. We created an EMR cohort of type 2 diabetes (T2D) patients from a large academic hospital system, to enable risk stratification of T2D patients at population scale. We hypothesize that natural language processing of narrative EMR data (e.g., physic...
PURPOSE
To develop and test a Natural Language Processing (NLP) algorithm that analyzes clinical reports of CT Pulmonary Angiography (CTPA) for the diagnoses of pulmonary embolism (PE), the chronicity of PE when present, and the location of the most proximal filling defect considered positive for PE.
METHOD AND MATERIALS
The final CTPA reports for...
In this paper, we briefly introduce MiniNLP, a natural language processing
library for clinical narratives. MiniNLP is an experiment of our ideas on
efficient and effective medical language processing. We introduce the overall
design of MiniNLP and its major components, and show the performance of it in
real projects.