Emily Getzen’s research while affiliated with University of Pennsylvania and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (27)


Core Frameworks for Model Development.
An Overview of Large Language Models for Statisticians
  • Preprint
  • File available

February 2025

·

157 Reads

Wenlong Ji

·

Weizhe Yuan

·

Emily Getzen

·

[...]

·

Linjun Zhang

Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI), exhibiting remarkable capabilities across diverse tasks such as text generation, reasoning, and decision-making. While their success has primarily been driven by advances in computational power and deep learning architectures, emerging problems -- in areas such as uncertainty quantification, decision-making, causal inference, and distribution shift -- require a deeper engagement with the field of statistics. This paper explores potential areas where statisticians can make important contributions to the development of LLMs, particularly those that aim to engender trustworthiness and transparency for human users. Thus, we focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation. We also consider possible roles for LLMs in statistical analysis. By bridging AI and statistics, we aim to foster a deeper collaboration that advances both the theoretical foundations and practical applications of LLMs, ultimately shaping their role in addressing complex societal challenges.

Download

Implications of mappings between International Classification of Diseases clinical diagnosis codes and Human Phenotype Ontology terms

November 2024

·

131 Reads

JAMIA Open

Objective Integrating electronic health record (EHR) data with other resources is essential in rare disease research due to low disease prevalence. Such integration is dependent on the alignment of ontologies used for data annotation. The international classification of diseases (ICD) is used to annotate clinical diagnoses, while the human phenotype ontology (HPO) is used to annotate phenotypes. Although these ontologies overlap in the biomedical entities they describe, the extent to which they are interoperable is unknown. We investigate how well aligned these ontologies are and whether such alignments facilitate EHR data integration. Materials and Methods We conducted an empirical analysis of the coverage of mappings between ICD and HPO. We interpret this mapping coverage as a proxy for how easily clinical data can be integrated with research ontologies such as HPO. We quantify how exhaustively ICD codes are mapped to HPO by analyzing mappings in the unified medical language system (UMLS) Metathesaurus. We analyze the proportion of ICD codes mapped to HPO within a real-world EHR dataset. Results and Discussion Our analysis revealed that only 2.2% of ICD codes have direct mappings to HPO in UMLS. Within our EHR dataset, less than 50% of ICD codes have mappings to HPO terms. ICD codes that are used frequently in EHR data tend to have mappings to HPO; ICD codes that represent rarer medical conditions are seldom mapped. Conclusion We find that interoperability between ICD and HPO via UMLS is limited. While other mapping sources could be incorporated, there are no established conventions for what resources should be used to complement UMLS.


Realizing the potential of social determinants data in EHR systems: A scoping review of approaches for screening, linkage, extraction, analysis, and interventions

October 2024

·

70 Reads

·

8 Citations

Journal of Clinical and Translational Science

Background Social determinants of health (SDoH), such as socioeconomics and neighborhoods, strongly influence health outcomes. However, the current state of standardized SDoH data in electronic health records (EHRs) is lacking, a significant barrier to research and care quality. Methods We conducted a PubMed search using “SDOH” and “EHR” Medical Subject Headings terms, analyzing included articles across five domains: 1) SDoH screening and assessment approaches, 2) SDoH data collection and documentation, 3) Use of natural language processing (NLP) for extracting SDoH, 4) SDoH data and health outcomes, and 5) SDoH-driven interventions. Results Of 685 articles identified, 324 underwent full review. Key findings include implementation of tailored screening instruments, census and claims data linkage for contextual SDoH profiles, NLP systems extracting SDoH from notes, associations between SDoH and healthcare utilization and chronic disease control, and integrated care management programs. However, variability across data sources, tools, and outcomes underscores the need for standardization. Discussion Despite progress in identifying patient social needs, further development of standards, predictive models, and coordinated interventions is critical for SDoH-EHR integration. Additional database searches could strengthen this scoping review. Ultimately, widespread capture, analysis, and translation of multidimensional SDoH data into clinical care is essential for promoting health equity.


Fig. 2: Performance metrics for Convalesco's highest scoring submission. The calibration curves and area under the receiver operator curves from Convalesco's highest scoring submission. Each sub-graph shows individual model performances from Convalesco's submission. The "Main Model" is the model that was evaluated and scored for the L3C evaluation. Model 100 includes only 100 temporal features, Model 36 includes just the top 36 temporal features, and Model Z includes the same 100 temporal features but excludes racial information and data contributor identifiers. (a) The calibration curves from the model on the Hold Out Testing dataset. (b) The calibration curves from the model on the Two Site Testing dataset. (c) The calibration curves from the model on the level 3 post-challenge Limited Testing dataset. (d) The receiver operator curves from the model on the Hold Out Testing dataset. (e) The receiver operator curves from the model on the Two Site Testing dataset. (f) The receiver operator curves from the model on the level 3 post-challenge Limited Testing dataset. While the model wasn't well calibrated to the Hold Out testing dataset, the model generalized well to two out of sample datasets from separate data contributing partners and improved further after re-training and evaluation on the level 3 limited dataset.
Fig. 3: Interpretability dashboard from Convalesco's submission. The chart represents a prototype patient risk timeline. The top graph shows the single-event contributions toward the predicted PASC risk at Day 28. The risk change was calculated based on the difference between the final prediction and the hypothetical risk using all data except one event. Only a subsample of events are shown. The bottom chart shows the day-by-day predictions of cumulative risk based on events prior to the day.
Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative

September 2024

·

33 Reads

·

1 Citation

EBioMedicine

Background While many patients seem to recover from SARS-CoV-2 infections, many patients report experiencing SARS-CoV-2 symptoms for weeks or months after their acute COVID-19 ends, even developing new symptoms weeks after infection. These long-term effects are called post-acute sequelae of SARS-CoV-2 (PASC) or, more commonly, Long COVID. The overall prevalence of Long COVID is currently unknown, and tools are needed to help identify patients at risk for developing long COVID. Methods A working group of the Rapid Acceleration of Diagnostics-radical (RADx-rad) program, comprised of individuals from various NIH institutes and centers, in collaboration with REsearching COVID to Enhance Recovery (RECOVER) developed and organized the Long COVID Computational Challenge (L3C), a community challenge aimed at incentivizing the broader scientific community to develop interpretable and accurate methods for identifying patients at risk of developing Long COVID. From August 2022 to December 2022, participants developed Long COVID risk prediction algorithms using the National COVID Cohort Collaborative (N3C) data enclave, a harmonized data repository from over 75 healthcare institutions from across the United States (U.S.). Findings Over the course of the challenge, 74 teams designed and built 35 Long COVID prediction models using the N3C data enclave. The top 10 teams all scored above a 0.80 Area Under the Receiver Operator Curve (AUROC) with the highest scoring model achieving a mean AUROC of 0.895. Included in the top submission was a visualization dashboard that built timelines for each patient, updating the risk of a patient developing Long COVID in response to clinical events. Interpretation As a result of L3C, federal reviewers identified multiple machine learning models that can be used to identify patients at risk for developing Long COVID. Many of the teams used approaches in their submissions which can be applied to future clinical prediction questions. Funding Research reported in this RADx® Rad publication was supported by the 10.13039/100000002National Institutes of Health. Timothy Bergquist, Johanna Loomba, and Emily Pfaff were supported by Axle Subcontract: NCATS-STSS-P00438.



Mining for Health: A Comparison of Word Embedding Methods for Analysis of EHRs Data

June 2024

·

4 Reads

·

7 Citations

Electronic health records (EHRs), routinely collected as part of healthcare delivery, offer great promise for advancing precision health. At the same time, they present significant analytical challenges. In EHRs, data for individual patients are collected at irregular time intervals and with varying frequencies; they include both structured and unstructured data. Advanced statistical and machine learning methods have been developed to tackle these challenges, for example, for predicting diagnoses earlier and more accurately. One powerful tool for extracting useful information from EHRs data is word embedding algorithms, which represent words as vectors of real numbers that capture the words’ semantic and syntactic similarities. Learning embeddings can be viewed as automated feature engineering, producing features that can be used for predictive modeling of medical events. Methods such as Word2Vec, BERT, FastText, ELMo, and GloVe have been developed for word embedding, but there has been little work on re-purposing these algorithms for the analysis of structured medical data. Our work seeks to fill this important gap. We extended word embedding methods to embed (structured) medical codes from a patient’s entire medical history, and used the resultant embeddings to build prediction models for diseases. We assessed the performance of multiple embedding methods in terms of predictive accuracy and computation time using the Medical Information Mart for Intensive Care (MIMIC) database. We found that using Word2Vec, Fast- Text, and GloVe algorithms yield comparable models, while more recent contextual embeddings provide marginal further improvement. Our results provide insights and guidance to practitioners regarding the use of word embedding methods for the analysis of EHR data.


Figure 3. Consistency scores (higher is better) for DISCRET and a black-box model (TransTEE) combined with a post-hoc explainer. Our results confirm that DISCRET produces faithful explanations, and importantly, show that post-hoc explanations are rarely faithful, as evidenced by low consistency scores across datasets.
Figure 5. The curve of ATE errors on test split of IHDP by DISCRET
Figure 6. Frequency of the outcome values on Uganda dataset
DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation

June 2024

·

23 Reads

Designing faithful yet accurate AI models is challenging, particularly in the field of individual treatment effect estimation (ITE). ITE prediction models deployed in critical settings such as healthcare should ideally be (i) accurate, and (ii) provide faithful explanations. However, current solutions are inadequate: state-of-the-art black-box models do not supply explanations, post-hoc explainers for black-box models lack faithfulness guarantees, and self-interpretable models greatly compromise accuracy. To address these issues, we propose DISCRET, a self-interpretable ITE framework that synthesizes faithful, rule-based explanations for each sample. A key insight behind DISCRET is that explanations can serve dually as database queries to identify similar subgroups of samples. We provide a novel RL algorithm to efficiently synthesize these explanations from a large search space. We evaluate DISCRET on diverse tasks involving tabular, image, and text data. DISCRET outperforms the best self-interpretable models and has accuracy comparable to the best black-box models while providing faithful explanations. DISCRET is available at https://github.com/wuyinjun-1993/DISCRET-ICML2024.


On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

May 2024

·

14 Reads

·

2 Citations

Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.


Figure 1: Medical visits occur at irregular time intervals and varying frequencies.
Figure 2: Illustration of zero-padding and LLM-based imputation: (a) BioClinicalBert embeddings are generated based on existing notes and zero-vectors are used for missing notes. (b) The LLM is given existing notes and asked to generate text for the missing notes. BioClinicalBert embeddings are calculated for both existing and generated text.
Performance metrics of supervised learning methods by temporal harmonization method
Quantifying improvements in performance for patients with more or less missing data
Filling the gaps: leveraging large language models for temporal harmonization of clinical text across multiple medical visits for clinical prediction

May 2024

·

42 Reads

Electronic health records offer great promise for early disease detection, treatment evaluation, information discovery, and other important facets of precision health. Clinical notes, in particular, may contain nuanced information about a patient's condition, treatment plans, and history that structured data may not capture. As a result, and with advancements in natural language processing, clinical notes have been increasingly used in supervised prediction models. To predict long-term outcomes such as chronic disease and mortality, it is often advantageous to leverage data occurring at multiple time points in a patient's history. However, these data are often collected at irregular time intervals and varying frequencies, thus posing an analytical challenge. Here, we propose the use of large language models (LLMs) for robust temporal harmonization of clinical notes across multiple visits. We compare multiple state-of-the-art LLMs in their ability to generate useful information during time gaps, and evaluate performance in supervised deep learning models for clinical prediction.



Citations (17)


... Development of tools plays a crucial role in addressing SDoH by providing innovative solutions to identify, measure, and analyze the complex SDoH factors influencing health outcomes. The papers featured in this special issue highlight a range of notable tools and methodologies designed to tackle various aspects of SDoH [17][18][19][20][21][22][23][24]. For example, German et al. [21] presented an interactive visualization tool that integrates clinical, sociodemographic, and environmental data to enhance understanding of health disparities in diabetes care and outcomes. ...

Reference:

Advancing social determinants of health research and practice: Data, tools, and implementation
Realizing the potential of social determinants data in EHR systems: A scoping review of approaches for screening, linkage, extraction, analysis, and interventions

Journal of Clinical and Translational Science

... Along the same lines, CRNs are often well-positioned to foster multidisciplinary collaboration [20]. They have streamlined contracting and collaboration processes that expedite research while reducing regulatory burden, and can also provide an efficient vehicle for patient recruitment into prospective cohort studies and clinical trials. ...

Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative

EBioMedicine

... However, for large and complex datasets this model cannot yield promising effectiveness. Similarity-aware diffusion Model-Based Imputation (SADI) [25] leverages diffusion models and self-attention to effectively impute missing values in EHRs. Current models typically rely on correlations between time points and features, which is effective for data with strong correlations, such as Intensive Care Unit (ICU) data. ...

SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data
  • Citing Article
  • May 2024

... Time spent building word embeddings can affect NLP model training. More effective word embedding methods reduce training time, speeding model development [Getzen,22]. Model training efficiency and iteration speed for best performance and deployment depend on neural network fitting time. ...

Mining for Health: A Comparison of Word Embedding Methods for Analysis of EHRs Data
  • Citing Chapter
  • June 2024

... Recent LLMs have shown great potential in generative applications especially its superior zero-and few-shot performance [30,32,33,[68][69][70][71][72][73]. Despite this, the generated content can be unfaithful, inconsistent and biased [23,[74][75][76][77][78]. We plan to thoroughly evaluate LLMs and extend to Ascle in the future, ensuring that their application truly benefits biomedical researchers and healthcare professionals. ...

Realizing the Potential of Social Determinants Data: A Scoping Review of Approaches for Screening, Linkage, Extraction, Analysis and Interventions

... CT is useful for the diagnosis of bone metastases and for estimating the extent of bone destruction [124][125][126][127]. Biomechanical CT (BCT) is a radiomic technique that measures BMD and bone strength from CT scans. Among men diagnosed with metastatic hormone-sensitive prostate cancer, BCT assessments were strongly correlated with DXA and predicted subsequent pathologic fracture [128]. CT-based rigidity analysis has a small influence on perceived fracture risk [129][130]. ...

Validation of Biomechanical Computed Tomography for Fracture Risk Classification in Metastatic Hormone-sensitive Prostate Cancer
  • Citing Article
  • November 2023

European Urology Oncology

... 17,[21][22][23][24][25][26] This limitation may stem from the inherently non-stationary nature of temporal timestamps in EHRs, wherein their relevance and influence in modeling health outcomes dynamically shift in relation to the exposure or treatment of interest. 27 Post-acute sequelae of SARS-CoV-2 (PASC) represent a health outcome that can be examined within the framework of recurring exposures and outcomes. In this context, individuals may experience multiple COVID-19 infections, with PASC potentially associated with a specific infection episode. ...

Characterization of long COVID temporal sub-phenotypes by distributed representation learning from electronic health record data: a cohort study
  • Citing Article
  • October 2023

EClinicalMedicine

... Hence, including 10% of the identified non-cases was enough to create a 14. Diseases of the genitourinary system 11 Glomerular diseases (N00-N08); Renal tubulo-interstitial diseases (N10-N16); Renal failure (N17-N19); Urolithiasis (N20-N23) 15. Pregnancy, childbirth, and the puerperium 7 Pregnancy with abortive outcome (O00-O08); Oedema, proteinuria and hypertensive disorders in pregnancy, childbirth, and the puerperium (O10-O16); Other maternal disorders predominantly related to pregnancy (O20-O29); Maternal care related to the fetus and amniotic cavity and possible delivery problems (O30-O48) 16. ...

Potential pitfalls in the use of real-world data for studying long COVID
  • Citing Article
  • April 2023

Nature Medicine

... EHRs comprise multiple data types with distinct characteristics. For instance, clinical events like ICD codes are high-dimensional and sequential, while time series data are relatively low-dimensional but feature irregular sampling and significant missingness, which is often informative and not random [3,4]. Additionally, patients may have multiple visits or admissions, with data across visits being correlated. ...

Informative Missingness: What can we learn from patterns in missing laboratory data in the electronic health record?
  • Citing Article
  • February 2023

Journal of Biomedical Informatics

... Third, exploiting this method leads to a streamlined medical workflow. When healthcare providers have timely and proper data, they can make accurate medical decisions, ultimately improving patient outcomes [28]. ...

Mining for Equitable Health: Assessing the Impact of Missing Data in Electronic Health Records
  • Citing Article
  • January 2023

Journal of Biomedical Informatics