Xihong Lin

Xihong Lin
Harvard University | Harvard · Department of Biostatistics

About

487
Publications
44,923
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,356
Citations

Publications

Publications (487)
Article
Full-text available
Background Lung cancer and tobacco use pose significant global health challenges, necessitating a comprehensive translational roadmap for improved prevention strategies such as cancer screening and tobacco treatment, which are currently under-utilised. Polygenic risk scores (PRSs) may further motivate health behaviour change in primary care for lun...
Preprint
Full-text available
Linear mixed-effects models (LMMs) and ridge regression are commonly applied in genetic association studies to control for population structure and sample-relatedness. To control for sample-relatedness, the existing methods use empirical genetic relatedness matrices (GRM) either explicitly or conceptually. This works well with mostly homogeneous po...
Article
Full-text available
The role of rare non-coding variation in complex human phenotypes is still largely unknown. To elucidate the impact of rare variants in regulatory elements, we performed a whole-genome sequencing association analysis for height using 333,100 individuals from three datasets: UK Biobank (N = 200,003), TOPMed (N = 87,652) and All of Us (N = 45,445). W...
Article
Motivation Functional Annotation of genomic Variants Online Resources (FAVOR) offers multi-faceted, whole genome variant functional annotations, which is essential for Whole Genome and Exome Sequencing (WGS/WES) analysis and the functional prioritization of disease-associated variants. A versatile chatbot designed to facilitate informative interpre...
Article
Inflammation biomarkers can provide valuable insight into the role of inflammatory processes in many diseases and conditions. Sequencing based analyses of such biomarkers can also serve as an exemplar of the genetic architecture of quantitative traits. To evaluate the biological insight, which can be provided by a multi-ancestry, whole-genome based...
Article
Full-text available
Background Although polygenic risk score (PRS) has emerged as a promising tool for predicting cancer risk from genome-wide association studies (GWAS), the individual-level accuracy of lung cancer PRS and the extent to which its impact on subsequent clinical applications remains largely unexplored. Methods Lung cancer PRSs and confidence/credible i...
Article
Full-text available
The COVID-19 pandemic influenced emotional experiences globally. We examined daily positive and negative affect between May/June 2020 and February 2021 (N = 151,049; 3,509,982 observations) using a convenience sample from a national mobile application-based survey that asked for daily affect reports. Four questions were examined: (1) How did people...
Article
BACKGROUND Individuals with type 2 diabetes (T2D) have an increased risk of coronary artery disease (CAD), but questions remain about the underlying pathology. Identifying which CAD loci are modified by T2D in the development of subclinical atherosclerosis (coronary artery calcification [CAC], carotid intima-media thickness, or carotid plaque) may...
Preprint
Full-text available
The role of rare non-coding variation in complex human phenotypes is still largely unknown. To elucidate the impact of rare variants in regulatory elements, we performed a whole-genome sequencing association analysis for height using 333,100 individuals from three datasets: UK Biobank (N=200,003), TOPMed (N=87,652) and All of Us (N=45,445). We perf...
Preprint
Full-text available
Large-scale whole-genome sequencing (WGS) studies have improved our understanding of the contributions of coding and noncoding rare variants to complex human traits. Leveraging association effect sizes across multiple traits in WGS rare variant association analysis can improve statistical power over single-trait analysis, and also detect pleiotropi...
Preprint
Full-text available
Inflammation biomarkers can provide valuable insight into the role of inflammatory processes in many diseases and conditions. Sequencing based analyses of such biomarkers can also serve as an exemplar of the genetic architecture of quantitative traits. To evaluate the biological insight, which can be provided by a multi-ancestry, whole-genome based...
Preprint
Full-text available
Long non-coding RNAs (lncRNAs) are known to perform important regulatory functions. Large-scale whole genome sequencing (WGS) studies and new statistical methods for variant set tests now provide an opportunity to assess the associations between rare variants in lncRNA genes and complex traits across the genome. In this study, we used high-coverage...
Article
Full-text available
SARS-CoV-2 vaccines are useful tools to combat the Coronavirus Disease 2019 (COVID-19) pandemic, but vaccine reluctance threatens these vaccines’ effectiveness. To address COVID-19 vaccine reluctance and ensure equitable distribution, understanding the extent of and factors associated with vaccine acceptance and uptake is critical. We report the re...
Article
Full-text available
Large biobank-scale whole genome sequencing (WGS) studies are rapidly identifying a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- or tr...
Article
Full-text available
Blood lipids are heritable modifiable causal factors for coronary artery disease. Despite well-described monogenic and polygenic bases of dyslipidemia, limitations remain in discovery of lipid-associated alleles using whole genome sequencing (WGS), partly due to limited sample sizes, ancestral diversity, and interpretation of clinical significance....
Article
Full-text available
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We pro...
Preprint
Large-scale whole genome sequencing (WGS) studies and biobanks are rapidly generating a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- o...
Article
Full-text available
To identify new susceptibility loci to lung cancer among diverse populations, we performed cross-ancestry genome-wide association studies in European, East Asian and African populations and discovered five loci that have not been previously reported. We replicated 26 signals and identified 10 new lead associations from previously reported loci. Rar...
Article
Full-text available
The genetic determinants of fasting glucose (FG) and fasting insulin (FI) have been studied mostly through genome arrays, resulting in over 100 associated variants. We extended this work with high-coverage whole genome sequencing analyses from fifteen cohorts in NHLBI’s Trans-Omics for Precision Medicine (TOPMed) program. Over 23,000 non-diabetic i...
Article
Introduction: Obstructive sleep apnea (OSA) is a common disorder associated with increased risk for cardiovascular disease, diabetes, and premature mortality. There is strong clinical and epi-demiologic evidence supporting the importance of genetic factors influencing OSA, but limited data implicating specific genes. Methods: Leveraging high dep...
Article
We developed the STAAR WDL workflow to facilitate the analysis of rare variants in whole genome sequencing association studies. The open-access STAAR workflow written in the workflow description language (WDL) allows a user to perform rare variant testing for both gene-centric and genetic region approaches, enabling genome-wide, candidate, and cond...
Article
Full-text available
Large scale screening is a critical tool in the life sciences, but is often limited by reagents, samples, or cost. An important recent example is the challenge of achieving widespread COVID-19 testing in the face of substantial resource constraints. To tackle this challenge, screening methods must efficiently use testing resources. However, given t...
Article
Amidst the continuing spread of COVID-19, real-time data analysis and visualization remain critical the general public to track the pandemic's impact and to inform policy making by officials. Multiple metrics permit the evaluation of the spread, infection, and mortality of infectious diseases. For example, numbers of new cases and deaths provide ea...
Article
Full-text available
Analyses of data from genome-wide association studies on unrelated individuals have shown that, for human traits and diseases, approximately one-third to two-thirds of heritability is captured by common SNPs. However, it is not known whether the remaining heritability is due to the imperfect tagging of causal variants by common SNPs, in particular...
Article
Full-text available
Allele frequency estimates in admixed populations, such as Hispanics/Latinos, rely on the sample’s specific admixture composition and thus may differ between two seemingly similar populations. However, ancestry-specific allele frequencies, i.e., pertaining to the ancestral populations of an admixed group, may be particularly useful for prioritizing...
Article
Attempts to identify and prioritize functional DNA elements in coding and non-coding regions, particularly through use of in silico functional annotation data, continue to increase in popularity. However, specific functional roles can vary widely from one variant to another, making it challenging to summarize different aspects of variant function w...
Article
Sample sizes vary substantially across tissues in the Genotype‐Tissue Expression (GTEx) project, where considerably fewer samples are available from certain inaccessible tissues, such as the substantia nigra (SSN), than from accessible tissues, such as blood. This severely limits power for identifying tissue‐specific expression quantitative trait l...
Article
Full-text available
Obstructive sleep apnea (OSA) is a common disorder associated with increased risk of cardiovascular disease and mortality. Iron and heme metabolism, implicated in ventilatory control and OSA comorbidities, was associated with OSA phenotypes in recent admixture mapping and gene enrichment analyses. However, its causal contribution was unclear. In th...
Article
Full-text available
Background The difference between an individual's chronological and DNA methylation predicted age (DNAmAge), termed DNAmAge acceleration (DNAmAA), can capture life-long environmental exposures and age-related physiological changes reflected in methylation status. Several studies have linked DNAmAA to morbidity and mortality, yet its relationship wi...
Article
Full-text available
Background Sleep-disordered breathing is a common disorder associated with significant morbidity. The genetic architecture of sleep-disordered breathing remains poorly understood. Through the NHLBI Trans-Omics for Precision Medicine (TOPMed) program, we performed the first whole-genome sequence analysis of sleep-disordered breathing. Methods The s...
Preprint
Full-text available
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare variants' (RVs) associations with complex human traits. Variant set analysis is a powerful approach to study RV association, and a key component of it is constructing RV sets for analysis. However, existing methods have limited ability to define analysis units in th...
Preprint
Full-text available
Obstructive sleep apnea (OSA) is a common disorder associated with increased risk of cardiovascular disease and mortality. Iron and heme metabolism, implicated in ventilatory control and OSA comorbidities, was associated with OSA phenotypes in recent admixture mapping and gene enrichment analyses. However, its causal contribution was unclear. In th...
Preprint
Full-text available
Plasma lipids are heritable modifiable causal factors for coronary artery disease, the leading cause of death globally. Despite the well-described monogenic and polygenic bases of dyslipidemia, limitations remain in discovery of lipid-associated alleles using whole genome sequencing, partly due to limited sample sizes, ancestral diversity, and inte...
Article
Modeling infectious disease dynamics has been critical throughout the COVID-19 pandemic. Of particular interest are the incidence, prevalence, and effective reproductive number (Rt ). Estimating these quantities is challenging due to under-ascertainment, unreliable reporting, and time lags between infection, onset, and testing. We propose a Multile...
Preprint
Full-text available
We developed the STAAR WDL workflow to facilitate the analysis of rare variants in whole genome sequencing association studies. The open-access STAAR workflow written in the workflow description language (WDL) allows a user to perform rare variant testing for both gene-centric and genetic region approaches, enabling genome-wide, candidate, and cond...
Article
Full-text available
Background/objectives Neck circumference, an index of upper airway fat, has been suggested to be an important measure of body-fat distribution with unique associations with health outcomes such as obstructive sleep apnea and metabolic disease. This study aims to study the genetic bases of neck circumference. Methods We conducted a multi-ethnic gen...
Article
There is little evidence on the short-term impact of fine particulate matter (PM2.5) on renal health, and the potential interactions and various influences of PM2.5 components on renal health have not been examined. We investigated whether short-term (≤28 days) ambient PM2.5 and 15 PM2.5 components were associated with serum uric acid (SUA), blood...
Article
Full-text available
Background Identifying county-level characteristics associated with high coronavirus 2019 (COVID-19) burden can help allow for data-driven, equitable allocation of public health intervention resources and reduce burdens on health care systems. Methods Synthesizing data from various government and nonprofit institutions for all 3142 United States (...
Article
PurposeAcute respiratory distress syndrome (ARDS) is accompanied by a dysfunctional immune-inflammatory response following lung injury, including during coronavirus disease 2019 (COVID-19). Limited causal biomarkers exist for ARDS development. We sought to identify novel genetic susceptibility targets for ARDS to focus further investigation on thei...
Article
Full-text available
Air pollution, especially fine particulate matter (PM2.5), may impair cognitive performance1–3, but its short-term impact is poorly understood. We investigated the short-term association of PM2.5 with the cognitive performances of 954 white males measured as global cognitive function and Mini-Mental State Examination (MMSE) scores and further explo...
Article
In genome-wide epigenetic studies, it is of great scientific interest to assess whether the effect of an exposure on a clinical outcome is mediated through DNA methylations. However, statistical inference for causal mediation effects is challenged by the fact that one needs to test a large number of composite null hypotheses across the whole epigen...
Preprint
Full-text available
SARS-CoV-2 vaccines are powerful tools to combat the COVID-19 pandemic, but vaccine hesitancy threatens these vaccines' effectiveness. To address COVID-19 vaccine hesitancy and ensure equitable distribution, understanding the extent of and factors associated with vaccine hesitancy is critical. We report the results of a large nationwide study condu...
Article
Full-text available
The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenot...
Article
Full-text available
Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise co...
Preprint
Full-text available
Attempts to identify and prioritize functional DNA elements in coding and noncoding regions, particularly through use of in silico functional annotation data, continue to increase in popularity. However, specific functional roles may vary widely from one variant to another, making it challenging to summarize different aspects of variant function. H...
Preprint
The genetic determinants of fasting glucose (FG) and fasting insulin (FI) have been studied mostly through genome and exome arrays, resulting in over 100 associated variants. We extended this work with a high-coverage whole genome sequencing (WGS) analysis from fifteen cohorts in the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. More t...
Preprint
Full-text available
Molecular QTLs (xQTLs) are widely studied to identify functional variation and possible mechanisms underlying genetic associations with diseases. Larger xQTL sample sizes are critical to help identify causal variants, improve predictive models, and increase power to detect rare associations. This will require scalable and accurate methods for analy...
Preprint
Full-text available
Missing data are prevalent in the Genotype-Tissue Expression (GTEx) project, where measurements from certain inaccessible tissues, such as the substantia nigra (SSN), are available at much smaller sample sizes than those from accessible tissues, such as blood. This severely limits power for identifying tissue-specific expression quantitative trait...
Article
Testing a global hypothesis for a set of variables is a fundamental problem in statistics with a wide range of applications. A few well-known classical tests include the Hotelling’s T ² test, the F-test, and the empirical Bayes based score test. These classical tests, however, are not robust to the signal strength and could have a substantial loss...
Preprint
Full-text available
Racial and ethnic disparities in COVID-19 outcomes reflect the unequal burden experienced by vulnerable communities in the United States (US). Proposed explanations include socioeconomic factors that influence how people live, work, and play, and pre-existing comorbidities. It is important to assess the extent to which observed US COVID-19 racial a...
Article
Full-text available
It is believed that genetic factors play a large role in the development of many cognitive and neurological processes; however, epidemiological evidence for the genetic basis of childhood neurodevelopment is very limited. Identification of the genetic polymorphisms associated with early-stage neurodevelopment will help elucidate biological mechanis...
Preprint
Full-text available
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We dev...
Preprint
Lung cancer is the leading cause of cancer death worldwide. Genome-wide association studies have revealed genetic risk factors, highlighting the role of smoking, family history, telomere regulation, and DNA damage-repair in lung cancer etiology. Many studies have focused on a single ethnic group to avoid confounding from variability in allele frequ...
Preprint
In genome-wide epigenetic studies, it is of great scientific interest to assess whether the effect of an exposure on a clinical outcome is mediated through DNA methylations. However, statistical inference for causal mediation effects is challenged by the fact that one needs to test a large number of composite null hypotheses across the whole epigen...
Article
Clinical trial results have recently demonstrated that inhibiting inflammation by targeting the interleukin-1β pathway can offer a significant reduction in lung cancer incidence and mortality, highlighting a pressing and unmet need to understand the benefits of inflammation-focused lung cancer therapies at the genetic level. While numerous genome-w...
Article
We consider in this paper detection of signal regions associated with disease outcomes in whole genome association studies. Gene- or region-based methods have become increasingly popular in whole genome association analysis as a complementary approach to traditional individual variant analysis. However, these methods test for the association betwee...
Article
Full-text available
Despite the widespread implementation of public health measures, coronavirus disease 2019 (COVID-19) continues to spread in the United States. To facilitate an agile response to the pandemic, we developed How We Feel, a web and mobile application that collects longitudinal self-reported survey responses on health, behaviour and demographics. Here,...
Conference Paper
Genome-wide association studies (GWAS) have revealed susceptible genetic risk factors for lung cancer, highlighting the role of smoking, family history, and DNA damage repair genes in disease etiology. Many studies have focused on European populations; however, lung cancer is a leading cause of cancer incidence and mortality around the world. Previ...
Article
Full-text available
Importance Chronic obstructive pulmonary disease (COPD) is a critical public health burden. The neutrophil to lymphocyte ratio (NLR), an inflammation biomarker, has been associated with COPD morbidity and mortality; however, its associations with lung function decline and COPD development are poorly understood. Objective To explore the association...
Article
Full-text available
Background: DNA methylation at the fifth position of cytosine (5mC) is a common epigenetic alteration affecting a range of cellular processes. In recent years, 5-hydroxymethylcytosine (5hmC), an oxidized form of 5mC, has risen broad interests as a potential biomarker for lung cancer diagnosis and survival. Methods: We analyzed the epigenome-wide...
Article
Whole-genome sequencing (WGS) can improve assessment of low-frequency and rare variants, particularly in non-European populations that have been underrepresented in existing genomic studies. The genetic determinants of C-reactive protein (CRP), a biomarker of chronic inflammation, have been extensively studied, with existing genome-wide association...
Article
Average arterial oxyhemoglobin saturation during sleep (AvSpO2S) is a clinically relevant measure of physiological stress associated with sleep-disordered breathing, and this measure predicts incident cardiovascular disease and mortality. Using high-depth whole-genome sequencing data from the National Heart, Lung, and Blood Institute (NHLBI) Trans-...
Article
Full-text available
Cross-ratio is an important local measure of the strength of dependence among correlated failure times. If a covariate is available, it may be of scientific interest to understand how the cross-ratio varies with the covariate as well as time components. Motivated by the Tremin study, where the dependence between age at a marker event reflecting ear...
Preprint
Sleep-disordered breathing (SDB) is a common disorder associated with significant morbidity. Through the NHLBI Trans-Omics for Precision Medicine (TOPMed) program we report the first whole-genome sequence analysis of SDB. We identified 4 rare gene-based associations with SDB traits in 7,988 individuals of diverse ancestry and 4 replicated common va...
Article
Full-text available
Study Objectives Daytime sleepiness is a consequence of inadequate sleep, sleep–wake control disorder, or other medical conditions. Population variability in prevalence of daytime sleepiness is likely due to genetic and biological factors as well as social and environmental influences. DNA methylation (DNAm) potentially influences multiple health o...
Article
Full-text available
Background: Exploring the associations of air pollution and weather variables with blood leukocyte distribution is critical to understand the impacts of environmental exposures on the human immune system. Objectives: As previous analyses have been mainly based on data from cell counters, which might not be feasible in epidemiologic studies inclu...