
Chunhua Weng- PhD
- Professor (Associate) at Columbia University
Chunhua Weng
- PhD
- Professor (Associate) at Columbia University
About
325
Publications
54,182
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,408
Citations
Introduction
Skills and Expertise
Current institution
Additional affiliations
January 2002 - June 2005
September 2005 - June 2007
Publications
Publications (325)
Pragmatic clinical trials (PCTs) evaluate interventions in real-world settings, often using electronic health records (EHRs) for efficient data collection. We report on the challenges in performing EHR analysis of health-care provider orders in a PCT within the eMERGE consortium, which investigates the impact of reporting genome-informed risk asses...
Many factors, including environmental and genetic variables, contribute to Colorectal Cancer (CRC) risk. Some of these risk factors may share underlying genetics with CRC. We investigated potential shared genetics by performing a Phenome-wide association study (PheWAS) with a multi-ancestry CRC polygenic risk score (PRS). The discovery cohort (N=42...
While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions f...
Kidney dysfunction is a major cause of mortality, but its genetic architecture remains elusive. In this study, we conducted a multiancestry genome-wide association study in 2.2 million individuals and identified 1026 (97 previously unknown) independent loci. Ancestry-specific analysis indicated an attenuation of newly identified signals on common v...
This study reports a comprehensive environmental scan of the generative AI (GenAI) infrastructure in the national network for clinical and translational science across 36 institutions supported by the CTSA Program led by the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) at the United States....
This study presents a convincing analysis of the effects of covariates, such as age, sex, socioeconomic status, or biomarker levels, on the predictive accuracy of polygenic scores for body mass index; the work is further supported by important approaches for improving prediction accuracy by accounting for such covariates across a variety of associa...
Objective
Extracting PICO elements—Participants, Intervention, Comparison, and Outcomes—from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entit...
Objective: Extracting PICO elements -- Participants, Intervention, Comparison, and Outcomes -- from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PIC...
While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a...
We report the findings of a genome-wide association study (GWAS) meta-analysis of endometriosis consisting of a large portion (31%) of non-European samples across 14 biobanks worldwide as part of the Global Biobank Meta-Analys i s Initiative (GBMI) . We identified 45 significant loci using a wide phenotype definition, seven of which are previously...
Patients with rare diseases often experience prolonged diagnostic delays. Ordering appropriate genetic tests is crucial yet challenging, especially for general pediatricians without genetic expertise. Recent American College of Medical Genetics (ACMG) guidelines embrace early use of exome sequencing (ES) or genome sequencing (GS) for conditions lik...
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matc...
Objective
This study aims to automate the prediction of Mini-Mental State Examination (MMSE) scores, a widely adopted standard for cognitive assessment in patients with Alzheimer’s disease, using natural language processing (NLP) and machine learning (ML) on structured and unstructured EHR data.
Materials and Methods
We extracted demographic data,...
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matc...
Introduction
Clinical research is critical for healthcare advancement, but participant recruitment remains challenging. Clinical research professionals (CRPs; e.g., clinical research coordinator, research assistant) perform eligibility prescreening, ensuring adherence to study criteria while upholding scientific and ethical standards. This study in...
This study reports a comprehensive environmental scan of the generative AI (GenAI) infrastructure in the national network for clinical and translational science across 36 institutions supported by the Clinical and Translational Science Award (CTSA) Program led by the National Center for Advancing Translational Sciences (NCATS) of the National Insti...
Endometriosis is a complex and heterogeneous condition affecting 10% of reproductive-age women, and yet, it often goes undiagnosed for several years. Limited observed heritability (7%) of large genetic association studies may be attributable to underlying heterogeneity of disease mechanisms. Therefore, we conducted this study to investigate genetic...
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance fall...
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance fall...
Apart from ancestry, personal or environmental covariates may contribute to differences in polygenic score (PGS) performance. We analyzed effects of covariate stratification and interaction on body mass index (BMI) PGS (PGS BMI ) across four cohorts of European (N=491,111) and African (N=21,612) ancestry. Stratifying on binary covariates and quinti...
Introduction
Genomic medicine holds transformative potential for personalized nephrology care; however, its clinical integration poses challenges. Automated clinical decision support (CDS) systems in the electronic health record (EHR) offer a promising solution but have shown limited impact. This study aims to glean practical insights into nephrolo...
Background
Endometriosis affects 10% of reproductive-age women, and yet, it goes undiagnosed for 3.6 years on average after symptoms onset. Despite large GWAS meta-analyses (N > 750,000), only a few dozen causal loci have been identified. We hypothesized that the challenges in identifying causal genes for endometriosis stem from heterogeneity acros...
Skin cancer mortality rates continue to rise, and survival analysis is increasingly needed to understand who is at risk and what interventions improve outcomes. However, current statistical methods are limited by inability to synthesize multiple data types, such as patient genetics, clinical history, demographics, and pathology and reveal significa...
INTRODUCTION: Cutaneous T-cell lymphoma (CTCL) is a heterogeneous group of non-Hodgkin lymphomas. Mycosis fungoides (MF), the most common CTCL subtype, frequently mimics eczema or psoriasis. Genetic variation, lack of definitive CTCL biomarkers and similarity to benign inflammatory conditions results in diagnostic delay. Artificial intelligence (AI...
Objectives
Extracting PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method, PICOX, to extract overlapping PICO entities.
Materials and Methods
PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then, it uses a mu...
Background
Alzheimer’s disease and related dementias (ADRD) affect over 55 million globally. Current clinical trials suffer from low recruitment rates, a challenge potentially addressable via natural language processing (NLP) technologies for researchers to effectively identify eligible clinical trial participants.
Objective
This study investigate...
Objective:
Large language models (LLMs) like ChatGPT are powerful algorithms that have been shown to produce human-like text from input data. A number of potential clinical applications of this technology have been proposed and evaluated by biomedical informatics experts. However, few have surveyed healthcare providers for their opinions about whe...
Background
Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus ne...
Research on polygenic risk scores (PRSs) for common, genetically complex chronic diseases aims to improve health-related predictions, tailor risk-reducing interventions, and improve health outcomes. Yet, the study and use of PRSs in clinical settings raise equity, clinical, and regulatory challenges that can be greater for individuals from historic...
Objective
To automate scientific claim verification using PubMed abstracts.
Materials and Methods
We developed CliVER, an end-to-end scientific Claim VERification system that leverages retrieval-augmented techniques to automatically retrieve relevant clinical trial abstracts, extract pertinent sentences, and use the PICO framework to support or re...
Polygenic risk scores (PRSs) have improved in predictive performance, but several challenges remain to be addressed before PRSs can be implemented in the clinic, including reduced predictive performance of PRSs in diverse populations, and the interpretation and communication of genetic results to both providers and patients. To address these challe...
Chronic kidney disease (CKD) is determined by an interplay of monogenic, polygenic, and environmental risks. Autosomal dominant polycystic kidney disease (ADPKD) and COL4A-associated nephropathy (COL4A-AN) represent the most common forms of monogenic kidney diseases. These disorders have incomplete penetrance and variable expressivity, and we hypot...
African Americans have a significantly higher risk of developing chronic kidney disease, especially focal segmental glomerulosclerosis -, than European Americans. Two coding variants (G1 and G2) in the APOL1 gene play a major role in this disparity. While 13% of African Americans carry the high-risk recessive genotypes, only a fraction of these ind...
Rare disease patients often endure prolonged diagnostic odysseys and may still remain undiagnosed for years. Selecting the appropriate genetic tests is crucial to lead to timely diagnosis. Phenotypic features offer great potential for aiding genomic diagnosis in rare disease cases. We see great promise in effective integration of phenotypic informa...
At the turn of this century deCODE genetics (https://www.decode.com/) proposed an ambitious new paradigm in human genetic research that entailed genotyping an entire population that had existing digitized health data. In 2007, the Wellcome Trust Case Control Consortium provided empirical evidence that such a dataset could be successfully repurposed...
Objective:
Developing targeted, culturally competent educational materials is critical for participant understanding of engagement in a large genomic study that uses computational pipelines to produce genome-informed risk assessments.
Materials and methods:
Guided by the Smerecnik framework that theorizes understanding of multifactorial genetic...
Background
Randomized clinical trials (RCT) are the foundation for medical advances, but participant recruitment remains a persistent barrier to their success. This retrospective data analysis aims to (1) identify clinical trial features associated with successful participant recruitment measured by accrual percentage and (2) compare the characteri...
Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot med...
Objective
Outcomes are important clinical study information. Despite progress in automated extraction of PICO (Population, Intervention, Comparison, and Outcome) entities from PubMed, rarely are these entities encoded by standard terminology to achieve semantic interoperability. This study aims to evaluate the suitability of the Unified Medical Lan...
Black Americans have a significantly higher risk of developing chronic kidney disease (CKD), especially focal segmental glomerulosclerosis (FSGS), than European Americans. Two coding variants (G1 and G2) in the APOL1 gene play a major role in this disparity. While 13% of Black Americans carry the high-risk recessive genotypes, only a fraction of th...
Apart from ancestry, personal or environmental covariates may contribute to differences in polygenic score (PGS) performance. We analyzed effects of covariate stratification and interaction on body mass index (BMI) PGS (PGSBMI) across four cohorts of European (N=491,111) and African (N=21,612) ancestry. Stratifying on binary covariates and quintile...
Apart from ancestry, personal or environmental covariates may contribute to differences in polygenic score (PGS) performance. We analyzed effects of covariate stratification and interaction on body mass index (BMI) PGS (PGSBMI) across four cohorts of European (N=491,111) and African (N=21,612) ancestry. Stratifying on binary covariates and quintile...
This reproducibility study presents an algorithm to weigh in race distribution data of clinical research study samples when training biomedical embeddings. We extracted 12,864 PubMed abstracts published between January 1st, 2000 and January 1st, 2022 and weighed them based on the race distribution data extracted from their corresponding clinical tr...
Participant recruitment continues to be a challenge to the success of randomized controlled trials, resulting in increased costs, extended trial timelines and delayed treatment availability. Literature provides evidence that study design features (e.g., trial phase, study site involvement) and trial sponsor are significantly associated with recruit...
Polygenic risk scores (PRS) have improved in predictive performance supporting their use in clinical practice. Reduced predictive performance of PRS in diverse populations can exacerbate existing health disparities. The NHGRI-funded eMERGE Network is returning a PRS-based genome-informed risk assessment to 25,000 diverse adults and children. We ass...
Creating a sustainable model for clinical data infrastructure requires the inclusion of key stakeholders, harmonization of their needs and constraints, integration with data governance considerations, conforming to FAIR principles while maintaining data safety and data quality, and maintaining financial health for contributing organizations and par...
With its seeming competence to mimic human responses, ChatGPT, an emerging AI-powered chatbot, has spurred great interest. This study aims to explore the role of ChatGPT in synthesizing medication literature and compare it with a hybrid summarization system. We tested ten medications' effectiveness with reference to their definitions and descriptio...
Apart from ancestry, personal or environmental covariates may contribute to differences in polygenic score (PGS) performance. We analyzed effects of covariate stratification and interaction on body mass index (BMI) PGS (PGS BMI ) across four cohorts of European (N=491,111) and African (N=21,612) ancestry. Stratifying on binary covariates and quinti...
Importance
Chronic kidney disease (CKD) is a genetically complex disease determined by an interplay of monogenic, polygenic, and environmental risks. The most common forms of monogenic kidney disorders include autosomal dominant polycystic kidney disease (ADPKD), caused by mutations in the PKD1 or PKD2 genes, and COL4A-associated nephropathy (COL4A...
Background: Chronic kidney disease (CKD) is a genetically complex disease determined by an interplay of monogenic, polygenic, and environmental risks. Most forms of monogenic kidney diseases have incomplete penetrance and variable expressivity. It is presently unknown if some of the variability in penetrance can be attributed to polygenic factors....
With the burgeoning development of computational phenotypes, it is increasingly difficult to identify the right phenotype for the right tasks. This study uses a mixed-methods approach to develop and evaluate a novel metadata framework for retrieval of and reusing computational phenotypes. Twenty active phenotyping researchers from 2 large research...
Objective:
Feasible, safe, and inclusive eligibility criteria are crucial to successful clinical research recruitment. Existing expert-centered methods for eligibility criteria selection may not be representative of real-world populations. This paper presents a novel model called OPTEC (OPTimal Eligibility Criteria) based on the Multiple Attribute...
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWASs). Standard GWASs are well-powered to interrogate additive models; however, new approaches are required for invesigating other modes of inheritance such as domina...
Objective:
To develop a computable representation for medical evidence and to contribute a gold standard dataset of annotated randomized controlled trial (RCT) abstracts, along with a natural language processing (NLP) pipeline for transforming free-text RCT evidence in PubMed into the structured representation.
Materials and methods:
Our represe...
Background:
Participant recruitment is a barrier to successful clinical research. One strategy to improve recruitment is to conduct eligibility prescreening, a resource-intensive process where clinical research staff manually reviews electronic health records data to identify potentially eligible patients. Criteria2Query (C2Q) was developed to add...
Background
Cardiometabolic diseases are highly comorbid, but their relationship with female‐specific or overwhelmingly female‐predominant health conditions (breast cancer, endometriosis, pregnancy complications) is understudied. This study aimed to estimate the cross‐trait genetic overlap and influence of genetic burden of cardiometabolic traits on...
The electronic Medical Records and Genomics (eMERGE) Network assessed the feasibility of deploying portable phenotype rule-based algorithms with natural language processing (NLP) components added to improve performance of existing algorithms using electronic health records (EHRs). Based on scientific merit and predicted difficulty, eMERGE selected...
Objective:
The aim of this study was to analyze a publicly available sample of rule-based phenotype definitions to characterize and evaluate the variability of logical constructs used.
Materials and methods:
A sample of 33 preexisting phenotype definitions used in research that are represented using Fast Healthcare Interoperability Resources and...
Immunoglobulin A (IgA) mediates mucosal responses to food antigens and the intestinal microbiome and is involved in susceptibility to mucosal pathogens, celiac disease, inflammatory bowel disease, and IgA nephropathy. We performed a genome-wide association study of serum IgA levels in 41,263 individuals of diverse ancestries and identified 20 genom...
Objective:
High BMI is associated with many comorbidities and mortality. This study aimed to elucidate the overall clinical risk of obesity using a genome- and phenome-wide approach.
Methods:
This study performed a phenome-wide association study of BMI using a clinical cohort of 736,726 adults. This was followed by genetic association studies us...
Background: Genomic Disorders (GDs) are associated with many comorbid outcomes, including chronic kidney disease (CKD). Identification of GDs has diagnostic utility.
Methods: We examined the prevalence of GDs among participants in the Chronic Kidney Disease in Children (CKiD) cohort II (N=248), Chronic Renal Insufficiency Cohort study (CRIC, N = 3,...
Leveraging linkage disequilibrium (LD) patterns as representative of population substructure enables the discovery of additive association signals in genome-wide association studies (GWAS). Standard GWAS are well-powered to interrogate additive models; however, new approaches are required to investigate other modes of inheritance such as dominance...
Objective:
To identify and characterize clinical subgroups of hospitalized COVID-19 patients.
Materials and methods:
Electronic health records of hospitalized COVID-19 patients at NewYork-Presbyterian/Columbia University Irving Medical Center were temporally sequenced and transformed into patient vector representations using Paragraph Vector mod...
Although individually rare, collectively more than 7,000 rare diseases affect about 10% of patients. Each of the rare diseases impacts the quality of life for patients and their families, and incurs significant societal costs. The low prevalence of each rare disease causes formidable challenges in accurately diagnosing and caring for these patients...
Objective
The free-text Condition data field in the ClinicalTrials.gov is not amenable to computational processes for retrieving, aggregating and visualizing clinical studies by condition categories. This paper contributes a method for automated ontology-based categorization of clinical studies by their conditions.
Materials and Methods
Our method...
Objective
To design and evaluate an interactive data quality (DQ) characterization tool focused on fitness-for-use completeness measures to support researchers’ assessment of a dataset.
Materials and Methods
Design requirements were identified through a conceptual framework on DQ, literature review, and interviews. The prototype of the tool was de...
Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data...
Objective
Analyze a publicly available sample of rule-based phenotype definitions to characterize and evaluate the types of logical constructs used.
Materials & Methods
A sample of 33 phenotype definitions used in research and published to the Phenotype KnowledgeBase (PheKB), that are represented using Fast Healthcare Interoperability Resources (F...
Chronic kidney disease (CKD) is a common complex condition associated with high morbidity and mortality. Polygenic prediction could enhance CKD screening and prevention; however, this approach has not been optimized for ancestrally diverse populations. By combining APOL1 risk genotypes with genome-wide association studies (GWAS) of kidney function,...
Although individually rare, collectively more than 7,000 rare diseases affect about 10% of patients. Each of the rare diseases impacts the quality of life for patients and their families, and incurs significant societal costs. The low prevalence of each rare disease causes formidable challenges in accurately diagnosing and caring for these patients...
Clinical and epidemiological studies have shown that circulatory system diseases and nervous system disorders often co-occur in patients. However, genetic susceptibility factors shared between these disease categories remain largely unknown. Here, we characterized pleiotropy across 107 circulatory system and 40 nervous system traits using an ensemb...
While the PICO framework is widely used by clinicians for clinical question formulation when querying the medical literature, it does not have the expressiveness to explicitly capture medical findings based on any standard. In addition, findings extracted from the literature are represented as free-text, which is not amenable to computation. This r...
Complex interventions are ubiquitous in healthcare. A lack of computational representations and information extraction solutions for complex interventions hinders accurate and efficient evidence synthesis. In this study, we manually annotated and analyzed 3,447 intervention snippets from 261 randomized clinical trial (RCT) abstracts and developed a...
The rapid growth of clinical trials launched in recent years poses significant challenges for accurate and efficient trial search. Keyword-based clinical trial search engines require users to construct effective queries, which can be a difficult task given complex information needs. In this study, we present an interactive clinical trial search int...
Electronic healthcare records data promises to improve the efficiency of patient eligibility screening, which is an important factor in the success of clinical trials and observational studies. To bridge the sociotechnical gap in cohort identification by end-users, who are clinicians or researchers unfamiliar with underlying EHR databases, we previ...
Bidirectional recurrent neural networks (RNN) improved performance of various natural language processing tasks and recently have been used for diagnosis prediction. Advantages of general bidirectional RNN, however, are not readily applied to diagnosis prediction task. In this study, we present a simple way to efficiently apply bidirectional RNN fo...
Anecdotally, 38.5% of clinical outcome descriptions in randomized controlled trial publications contain complex text. Existing terminologies are insufficient to standardize outcomes and their measures, temporal attributes, quantitative metrics, and other attributes. In this study, we analyzed the semantic patterns in the outcome text in a sample of...
Background
As genomic sequencing moves closer to clinical implementation, there has been an increasing acceptance of returning incidental findings to research participants and patients for mutations in highly penetrant, medically actionable genes. A curated list of genes has been recommended by the American College of Medical Genetics and Genomics...
Importance:
Knowledge about the spectrum of diseases associated with hereditary cancer syndromes may improve disease diagnosis and management for patients and help to identify high-risk individuals.
Objective:
To identify phenotypes associated with hereditary cancer genes through a phenome-wide association study.
Design, setting, and participan...
Objective
To combine machine efficiency and human intelligence for converting complex clinical trial eligibility criteria text into cohort queries.
Materials and Methods
Criteria2Query (C2Q) 2.0 was developed to enable real-time user intervention for criteria selection and simplification, parsing error correction, and concept mapping. The accuracy...
The identification of delirium in electronic health records (EHRs) remains difficult due to inadequate assessment or under-documentation. The purpose of this research is to present a classification model that identifies delirium using retrospective EHR data. Delirium was confirmed with the Confusion Assessment Method for the Intensive Care Unit. Ag...
Accurate disease risk stratification can lead to more precise and personalized prevention and treatment of diseases. As an important component to disease risk, genetic risk factors can be utilized as an early and stable predictor for disease onset. Recently, the polygenic risk score (PRS) method has combined the effects from hundreds to millions of...
Cardiometabolic diseases are highly comorbid, but their relationship with female-specific health conditions (breast cancer, endometriosis, pregnancy complications) is understudied. Using electronic health record data from 71,008 ancestrally diverse females, we examined relationships between 23 obstetrical/gynecological conditions and 4 cardiometabo...
Cognitive impairment is a defining feature of neurological disorders such as Alzheimer's disease (AD), one of the leading causes of disability and mortality in the elderly population. Assessing cognitive impairment is important for diagnostic, clinical management, and research purposes. The Folstein Mini-Mental State Examination (MMSE) is the most...
Questions
Question (1)
What are the well known theories in this area?