Tianxi Cai

Tianxi Cai
Harvard University | Harvard · Department of Biostatistics

About

416
Publications
37,510
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
17,234
Citations

Publications

Publications (416)
Article
Full-text available
Introduction We measured and compared five individual surrogate markers—change from baseline to 1 year after randomization in hemoglobin A1c (HbA1c), fasting glucose, 2-hour postchallenge glucose, triglyceride–glucose index (TyG) index, and homeostatic model assessment of insulin resistance (HOMA-IR)—in terms of their ability to explain a treatment...
Preprint
BACKGROUND Cohort studies contain rich clinical data across large and diverse patient populations that are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multi-cohort studies. Given differences in...
Preprint
Full-text available
Background: Long COVID characterized as post-acute sequelae of SARS-CoV-2 (PASC) has no universal clinical case definition. Recent efforts have focused on understanding long COVID symptoms and electronic health records (EHR) data provides a unique resource for understanding this condition. The introduction of the International Classification of Dis...
Preprint
Though electronic health record (EHR) systems are a rich repository of clinical information with large potential, the use of EHR-based phenotyping algorithms is often hindered by inaccurate diagnostic records, the presence of many irrelevant features, and the requirement for a human-labeled training set. In this paper, we describe a knowledge-drive...
Article
Background Characterizing Post-Acute Sequelae of COVID (SARS-CoV-2 Infection), or PASC has been challenging due to the multitude of sub-phenotypes, temporal attributes, and definitions. Scalable characterization of PASC sub-phenotypes can enhance screening capacities, disease management, and treatment planning. Methods We conducted a retrospective...
Article
Importance: The US Food and Drug Administration (FDA) is building a national postmarketing surveillance system for medical devices, moving to a "total product life cycle" approach whereby more limited premarketing data are balanced with postmarketing surveillance to capture rare adverse events and long-term safety issues. Objective: To assess th...
Preprint
Full-text available
Genome-wide association studies (GWAS) have underrepresented individuals from non-European populations, impeding progress in characterizing the genetic architecture and consequences of health and disease traits. To address this, we present a population-stratified phenome-wide GWAS followed by a multi-population meta-analysis for 2,068 traits derive...
Article
Full-text available
Background Many patients with rheumatoid arthritis (RA) require a trial of multiple biologic disease-modifying anti-rheumatic drugs (bDMARDs) to control their disease. With the availability of several bDMARD options, the history of bDMARDs may provide an alternative approach to understanding subphenotypes of RA. The objective of this study was to d...
Article
Objective: Disease knowledge graphs have emerged as a powerful tool for AI, enabling the connection, organization, and access to diverse information about diseases. However, the relations between disease concepts are often distributed across multiple data formats, including plain language and incomplete disease knowledge graphs. As a result, extra...
Article
Objective: Electronic health records (EHR), containing detailed longitudinal clinical information on a large number of patients and covering broad patient populations, open opportunities for comprehensive predictive modeling of disease progression and treatment response. However, since EHRs were originally constructed for administrative purposes n...
Preprint
Due to the increasing adoption of electronic health records (EHR), large scale EHRs have become another rich data source for translational clinical research. Despite its potential, deriving generalizable knowledge from EHR data remains challenging. First, EHR data are generated as part of clinical care with data elements too detailed and fragmented...
Article
Full-text available
Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is el...
Preprint
Full-text available
Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature rep...
Preprint
Full-text available
Electronic health record (EHR) data are increasingly used to support real-world evidence (RWE) studies. Yet its ability to generate reliable RWE is limited by the lack of readily available precise information on the timing of clinical events such as the onset time of heart failure. We propose a LAbel-efficienT incidenT phEnotyping (LATTE) algorithm...
Article
Full-text available
Background: Rheumatoid arthritis (RA) shares genetic variants with other autoimmune conditions, but existing studies test the association between RA variants with a pre-defined set of phenotypes. The objective of this study was to perform a large-scale, systemic screen to determine phenotypes that share genetic architecture with RA to inform our u...
Article
The development of phenotypes using electronic health records is a resource-intensive process. Therefore, the cataloging of phenotype algorithm metadata for reuse is critical to accelerate clinical research. The Department of Veterans Affairs (VA) has developed a standard for phenotype metadata collection which is currently used in the VA phenomics...
Article
Growing evidence has shown that applying machine learning models to large clinical data sources may exceed clinician performance in suicide risk stratification. However, many existing prediction models either suffer from "temporal bias" (a bias that stems from using case-control sampling) or require training on all available patient visit data. Her...
Preprint
Full-text available
The International Classification of Diseases (ICD)-10 code (U09.9) for post-acute sequelae of COVID-19 (PASC) was introduced in October of 2021. As researchers seek to leverage this billing code for research purposes in large scale real-world studies of PASC, it is of utmost importance to understand the functional use of the code by healthcare prov...
Preprint
Surrogate variables in electronic health records (EHR) play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels, under which supervised methods only using labeled data poorly perform poorly. Meanwhile, synthesizing multi-site EHR data is crucial for powerful and generalizable statistical lea...
Article
Full-text available
Motivation: Predicting molecule-disease indications and side effects is important for drug development and pharmacovigilance. Comprehensively mining molecule-molecule, molecule-disease and disease-disease semantic dependencies can potentially improve prediction performance. Methods: We introduce a Multi-Modal REpresentation Mapping Approach to P...
Article
Background: In electronic health records, patterns of missing laboratory test results could capture patients' course of disease as well as reflect clinician's concerns or worries for possible conditions. These patterns are often understudied and overlooked. This study aims to identify informative patterns of missingness among laboratory data colle...
Preprint
UNSTRUCTURED Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-wor...
Article
Full-text available
Purpose In young adults (18 to 49 years old), investigation of the acute respiratory distress syndrome (ARDS) after severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection has been limited. We evaluated the risk factors and outcomes of ARDS following infection with SARS-CoV-2 in a young adult population. Methods A retrospective coho...
Preprint
Synthesizing information from multiple data sources is critical to ensure knowledge generalizability. Integrative analysis of multi-source data is challenging due to the heterogeneity across sources and data-sharing constraints due to privacy concerns. In this paper, we consider a general robust inference framework for federated meta-learning of da...
Article
Full-text available
Background While acute kidney injury (AKI) is a common complication in COVID-19, data on post-AKI kidney function recovery and the clinical factors associated with poor kidney function recovery is lacking. Methods A retrospective multi-centre observational cohort study comprising 12,891 hospitalized patients aged 18 years or older with a diagnosis...
Article
There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnera...
Preprint
Full-text available
While randomized controlled trials (RCTs) are the gold-standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data (RWD) has been vital in post-approval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of RWD is electronic...
Article
The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with shorter follow‐up time or less cost. However, previous work has demonstrated potential heterogeneity in the utility of a surrogate marker. When such heterogeneity exists, existing methods that use the surrogate...
Article
Full-text available
While there exist numerous methods to identify binary phenotypes (i.e. COPD) using electronic health record (EHR) data, few exist to ascertain the timings of phenotype events (i.e. COPD onset or exacerbations). Estimating event times could enable more powerful use of EHR data for longitudinal risk modeling, including survival analysis. Here we intr...
Preprint
Full-text available
Network analysis has been a powerful tool to unveil relationships and interactions among a large number of objects. Yet its effectiveness in accurately identifying important node-node interactions is challenged by the rapidly growing network size, with data being collected at an unprecedented granularity and scale. Common wisdom to overcome such hi...
Preprint
Genomic data are increasingly incorporated into high-throughput approaches such as the Phenome-Wide Association Study (PheWAS) to query potential effects of targeted therapies. Genetic variants, such as the interleukin-6 receptor ( IL6R ) genetic variant rs2228145 (Asp358Ala), have been identified with a downstream effect similar to the drug, e.g.,...
Preprint
Motivated by increasing pressure for decision makers to shorten the time required to evaluate the efficacy of a treatment such that treatments deemed safe and effective can be made publicly available, there has been substantial recent interest in using an earlier or easier to measure surrogate marker, $S$, in place of the primary outcome, $Y$. To v...
Preprint
The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with shorter follow-up time or less cost. However, previous work has demonstrated potential heterogeneity in the utility of a surrogate marker. When such heterogeneity exists, existing methods that use the surrogate...
Preprint
Full-text available
The development of phenotypes using electronic health records is a resource intensive process. Therefore, the cataloging of phenotype algorithm metadata for reuse is critical to accelerate clinical research. The Department of Veterans Affairs Office of Research and Development has developed a phenomics knowledgebase library, CIPHER (Centralized Int...
Preprint
Full-text available
In this work, we propose a semi-supervised triply robust inductive transfer learning (STRIFLE) approach, which integrates heterogeneous data from label rich source population and label scarce target population to improve the learning accuracy in the target population. Specifically, we consider a high dimensional covariate shift setting and employ t...
Article
Objective Electronic Health Record (EHR) based phenotyping is a crucial yet challenging problem in the biomedical field. Though clinicians typically determine patient-level diagnoses via manual chart review, the sheer volume and heterogeneity of EHR data renders such tasks challenging, time-consuming, and prohibitively expensive, thus leading to a...
Article
Objective For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. Materials and...
Article
Background: There is limited data on comparative risk of infections with various biologic agents in older adults with inflammatory bowel diseases (IBD). Aim: We aimed to assess the comparative safety of biologic agents in older IBD patients with varying comorbidity burden. Methods: We used data from a large, national commercial insurance plan...
Article
Full-text available
Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and meth...
Article
Objective The growing availability of electronic health records (EHR) data opens opportunities for integrative analysis of multi-institutional EHR to produce generalizable knowledge. A key barrier to such integrative analyses is the lack of semantic interoperability across different institutions due to coding differences. We propose a Multiview Inc...
Article
Full-text available
Importance: Temporal shifts in clinical knowledge and practice need to be adjusted for in treatment outcome assessment in clinical evidence. Objective: To use electronic health record (EHR) data to (1) assess the temporal trends in treatment decisions and patient outcomes and (2) emulate a randomized clinical trial (RCT) using EHR data with prop...
Preprint
Full-text available
Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a r...
Article
Objective Accurately assigning phenotype information to individual patients via computational phenotyping using Electronic Health Records (EHRs) has been seen as the first step towards enabling EHRs for precision medicine research. Chart review labels annotated by clinical experts, also known as “gold standard” labels, are essential for the develop...
Article
Full-text available
Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates acr...
Article
Nested case control (NCC) is a sampling method widely used for developing and evaluating risk models with expensive biomarkers on large prospective cohort studies. In a typical NCC design, biomarker values are obtained on a subcohort, where cases consist of all the events (subjects who experience the event during the follow‐up). However, when the n...
Article
Full-text available
The risk profiles of post-acute sequelae of COVID-19 (PASC) have not been well characterized in multi-national settings with appropriate controls. We leveraged electronic health record (EHR) data from 277 international hospitals representing 414,602 patients with COVID-19, 2.3 million control patients without COVID-19 in the inpatient and outpatien...
Preprint
Full-text available
There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset type II diabetes (T2D). However, because existing clinical studies with limited sample sizes often suffer from selection bias issues, there is no r...
Preprint
Full-text available
Background In electronic health records, patterns of missing laboratory test results could capture patients’ course of disease as well as reflect clinician’s concerns or worries for possible conditions. These patterns are often understudied and overlooked. This study aims to characterize the patterns of missingness among laboratory data collected a...
Article
In this retrospective cohort study of 94,595 SARS-CoV-2 positive cases, we developed and validated an algorithm to assess the association between COVID-19 severity and long-term complications (stroke, myocardial infarction, pulmonary embolism/deep vein thrombosis, heart failure, and mortality). COVID-19 severity was associated with a greater risk o...
Article
Objective The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentia...
Preprint
Full-text available
Purpose In young adults (18 to 49 years old), investigation of the acute respiratory distress syndrome (ARDS) after severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection has been limited. We evaluated the risk factors and outcomes of ARDS following infection with SARS-CoV-2 in a young adult population. Methods A retrospective coho...
Article
Identifying effective and valid surrogate markers to make inference about a treatment effect on long‐term outcomes is an important step in improving the efficiency of clinical trials. Replacing a long term outcome with short term and/or cheaper surrogate markers can potentially shorten study duration and reduce trial costs. There is a sizable stati...
Article
In many contemporary applications, large amounts of unlabelled data are readily available while labelled examples are limited. There has been substantial interest in semi‐supervised learning (SSL) which aims to leverage unlabelled data to improve estimation or prediction. However, current SSL literature focuses primarily on settings where labelled...
Article
Objectives The pathogenesis of intracranial aneurysms is multifactorial and includes genetic, environmental, and anatomic influences. We aimed to identify image-based morphological parameters that were associated with middle cerebral artery (MCA) bifurcation aneurysms. Materials and methods We evaluated three-dimensional morphological parameters o...
Article
Leveraging large-scale electronic health record (EHR) data to estimate survival curves for clinical events can enable more powerful risk estimation and comparative effectiveness research. However, use of EHR data is hindered by a lack of direct event time observations. Occurrence times of relevant diagnostic codes or target disease mentions in clin...
Preprint
Full-text available
Objective For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. Materials and...
Article
Full-text available
Background Tofacitinib and inflammatory bowel disease (IBD) have been associated with increased risks for thromboembolic and cardiovascular events, but drug attributable risk is unknown. Methods We conducted a retrospective cohort study in a US claims database. We identified patients with IBD by International Classification of Disease (ICD) codes,...
Article
Introduction The comparative safety of therapies is important to inform relative positioning within the therapeutic algorithm. Tumor necrosis factor α antagonists (anti-TNF) are associated with an increased risk of infections. Whether there is a similar increase with ustekinumab (UST) or tofacitinib has not been established. Methods We identified...
Article
Objective To assess changes in international mortality rates and laboratory recovery rates during hospitalisation for patients hospitalised with SARS-CoV-2 between the first wave (1 March to 30 June 2020) and the second wave (1 July 2020 to 31 January 2021) of the COVID-19 pandemic. Design, setting and participants This is a retrospective cohort s...
Article
The risk profiles of post-acute sequelae of COVID-19 (PASC) have not been well characterized in multi-national settings with appropriate controls. We leveraged electronic health record (EHR) data from 277 international hospitals representing 414,602 patients with COVID-19, 2.3 million control patients without COVID-19 in the inpatient and outpatien...
Article
Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent effo...
Article
In studies that require long-term and/or costly follow-up of participants to evaluate a treatment, there is often interest in identifying and using a surrogate marker to evaluate the treatment effect. While several statistical methods have been proposed to evaluate potential surrogate markers, available methods generally do not account for or addre...
Article
Full-text available
The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the...
Article
Full-text available
In “International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study” (J Med Internet Res 2021 Oct 11;23(10):e31400), two errors were noted. In the originally published paper, equal contribution of the last three authors was not noted. This has been corrected to add a note of equal contributio...
Article
Full-text available
Importance As disease-modifying treatment options for multiple sclerosis increase, comparisons of the options based on real-world evidence may guide clinical decision-making. Objective To compare the relapse outcomes between 2 pairs of disease-modifying treatments: dimethyl fumarate vs fingolimod and natalizumab vs rituximab. Design, Setting, and...
Preprint
UNSTRUCTURED Authorship Correction: International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study In “International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study” (J Med Internet Res 2021 Oct 11;23(10):e31400. doi: 10.2196/31400),...
Article
Readily available proxies for time of disease onset such as time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status...
Preprint
Full-text available
A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response often cannot be accurately captured by readily available EHR features and require labor inten...
Article
Full-text available
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Article
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participat...
Article
Objective: Large amounts of health data are becoming available for biomedical research. Synthesizing information across databases may capture more comprehensive pictures of patient health and enable novel research studies. When no gold standard mappings between patient records are available, researchers may probabilistically link records from sepa...