Nature Genetics

Published by Springer Nature
Online ISSN: 1546-1718
Print ISSN: 1061-4036
Learn more about this page
Recent publications
Telomere length in humans is associated with lifespan and severe diseases, yet the genetic determinants of telomere length remain incompletely defined. Here we performed genome-wide CRISPR–Cas9 functional telomere length screening and identified thymidine (dT) nucleotide metabolism as a limiting factor in human telomere maintenance. Targeted genetic disruption using CRISPR–Cas9 revealed multiple telomere length control points across the thymidine nucleotide metabolism pathway: decreasing dT nucleotide salvage via deletion of the gene encoding nuclear thymidine kinase (TK1) or de novo production by knockout of the thymidylate synthase gene (TYMS) decreased telomere length, whereas inactivation of the deoxynucleoside triphosphohydrolase-encoding gene SAMHD1 lengthened telomeres. Remarkably, supplementation with dT alone drove robust telomere elongation by telomerase in cells, and thymidine triphosphate stimulated telomerase activity in a substrate-independent manner in vitro. In induced pluripotent stem cells derived from patients with genetic telomere biology disorders, dT supplementation or inhibition of SAMHD1 promoted telomere restoration. Our results demonstrate a critical role of thymidine metabolism in controlling human telomerase and telomere length, which may be therapeutically actionable in patients with fatal degenerative diseases.
After severe heart injury, fibroblasts are activated and proliferate excessively to form scarring, leading to decreased cardiac function and eventually heart failure. It is unknown, however, whether cardiac fibroblasts are heterogeneous with respect to their degree of activation, proliferation and function during cardiac fibrosis. Here, using dual recombinase-mediated genetic lineage tracing, we find that endocardium-derived fibroblasts preferentially proliferate and expand in response to pressure overload. Fibroblast-specific proliferation tracing revealed highly regional expansion of activated fibroblasts after injury, whose pattern mirrors that of endocardium-derived fibroblast distribution in the heart. Specific ablation of endocardium-derived fibroblasts alleviates cardiac fibrosis and reduces the decline of heart function after pressure overload injury. Mechanistically, Wnt signaling promotes activation and expansion of endocardium-derived fibroblasts during cardiac remodeling. Our study identifies endocardium-derived fibroblasts as a key fibroblast subpopulation accounting for severe cardiac fibrosis after pressure overload injury and as a potential therapeutic target against cardiac fibrosis.
Analysis flowchart
For common variant analyses across 65 quantitative traits, we performed GWAS among UK Biobank samples who were unrelated to individuals with WES data (‘out-sample’), GWAS among UK Biobank samples with WES data (‘in-sample’) and GWAS among all UK Biobank samples (‘total’). From each GWAS, we constructed PGSs using clumping-and-thresholding methods and using PRS-CS¹¹. We then performed exome-wide testing of rare variants within the WES samples, using models without PGSs and adjusting for various PGSs. LOF and missense variants were used to assess rare variant yields, while synonymous variants were used to assess inflation and false-positive rates. In the flowchart, blue boxes describe steps revolving around common variant analyses and PGS construction, while the red boxes highlight steps involving rare variant analyses. λGC, inflation factor computed as observed χ² at the median over the expected under the null hypothesis.
PGS adjustment improves discovery yield in analysis of rare deleterious variants
a, Bar charts for the improvement in deleterious RVAT yield after PGS adjustment at different α levels, expressed in percentage relative to the no-PGS (reference) model. b, Violin plots for the difference (δ) in significance of tests from deleterious RVATs, comparing models with PGS versus models without PGS. Here, the δ in P values (on the −log10 scale) is displayed for tests reaching P < 2.6 × 10⁻⁶ (Methods). The P values and distributions are derived from two-sided paired Wilcoxon signed-rank tests (where n gene–trait pairs equals 263, 270, 258, 260, 265 and 278 from left to right), while the d̅ values plotted above the violins are derived from two-sided paired t-tests (after removing outliers). The left plot shows all results, while the right plot is capped at y = 10 for clarity. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers. c, Quantile–quantile plots for PGS-adjusted RVAT of the phenotype height. The left plot shows expected versus observed P values for the model with no PGS adjustment, while the second and third plots show results for PGSlead-SNP (out-sample) and PGSCS (out-sample), respectively. Exome-wide significant genes are annotated with gene names; genes highlighted in bold were only identified after PGS adjustment. d̅, estimated paired group difference; δ, difference; α, significance cutoff. Ref, reference model (that is, the baseline model without inclusion of PGS).
PGS adjustment does not increase false-positive rates or genomic inflation in the analysis of rare synonymous variants
a, Boxplots for per-trait association rate from synonymous RVATs at different α levels across the 65 traits. A median of 18,060 genes were analyzed per trait. b, Violin plots for genomic inflation factors for exome-wide RVATs of synonymous variants across the 65 traits. c, Violin plots for difference (δ) in significance of tests from synonymous variant RVATs, comparing models with PGSs versus models without PGSs. Here, the δ in P values (on the −log10 scale) is displayed for tests reaching P < 0.05 (Methods), with the contributing n gene–trait pairs equaling 75,044, 77,524, 75,838, 89,792, 77,784 and 85,187 (from left to right). Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5 × interquartile range; points, outliers.
With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10⁻⁶), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.
Individuals of admixed ancestries (for example, African Americans) inherit a mosaic of ancestry segments (local ancestry) originating from multiple continental ancestral populations. This offers the unique opportunity of investigating the similarity of genetic effects on traits across ancestries within the same population. Here we introduce an approach to estimate correlation of causal genetic effects (radmix) across local ancestries and analyze 38 complex traits in African-European admixed individuals (N = 53,001) to observe very high correlations (meta-analysis radmix = 0.95, 95% credible interval 0.93–0.97), much higher than correlation of causal effects across continental ancestries. We replicate our results using regression-based methods from marginal genome-wide association study summary statistics. We also report realistic scenarios where regression-based methods yield inflated heterogeneity-by-ancestry due to ancestry-specific tagging of causal effects, and/or polygenicity. Our results motivate genetic analyses that assume minimal heterogeneity in causal effects by ancestry, with implications for the inclusion of ancestry-diverse individuals in studies.
Malignant pleural mesothelioma (MPM) is an aggressive cancer with rising incidence and challenging clinical management. Through a large series of whole-genome sequencing data, integrated with transcriptomic and epigenomic data using multiomics factor analysis, we demonstrate that the current World Health Organization classification only accounts for up to 10% of interpatient molecular differences. Instead, the MESOMICS project paves the way for a morphomolecular classification of MPM based on four dimensions: ploidy, tumor cell morphology, adaptive immune response and CpG island methylator profile. We show that these four dimensions are complementary, capture major interpatient molecular differences and are delimited by extreme phenotypes that—in the case of the interdependent tumor cell morphology and adapted immune response—reflect tumor specialization. These findings unearth the interplay between MPM functional biology and its genomic history, and provide insights into the variations observed in the clinical behavior of patients with MPM.
Endometriosis is a common condition associated with debilitating pelvic pain and infertility. A genome-wide association study meta-analysis, including 60,674 cases and 701,926 controls of European and East Asian descent, identified 42 genome-wide significant loci comprising 49 distinct association signals. Effect sizes were largest for stage 3/4 disease, driven by ovarian endometriosis. Identified signals explained up to 5.01% of disease variance and regulated expression or methylation of genes in endometrium and blood, many of which were associated with pain perception/maintenance (SRP14/BMF, GDAP1, MLLT10, BSN and NGF). We observed significant genetic correlations between endometriosis and 11 pain conditions, including migraine, back and multisite chronic pain (MCP), as well as inflammatory conditions, including asthma and osteoarthritis. Multitrait genetic analyses identified substantial sharing of variants associated with endometriosis and MCP/migraine. Targeted investigations of genetically regulated mechanisms shared between endometriosis and other pain conditions are needed to aid the development of new treatments and facilitate early symptomatic intervention.
Following severe liver injury, when hepatocyte-mediated regeneration is impaired, biliary epithelial cells (BECs) can transdifferentiate into functional hepatocytes. However, the subset of BECs with such facultative tissue stem cell potential, as well as the mechanisms enabling transdifferentiation, remains elusive. Here we identify a transitional liver progenitor cell (TLPC), which originates from BECs and differentiates into hepatocytes during regeneration from severe liver injury. By applying a dual genetic lineage tracing approach, we specifically labeled TLPCs and found that they are bipotent, as they either differentiate into hepatocytes or re-adopt BEC fate. Mechanistically, Notch and Wnt/β-catenin signaling orchestrate BEC-to-TLPC and TLPC-to-hepatocyte conversions, respectively. Together, our study provides functional and mechanistic insights into transdifferentiation-assisted liver regeneration.
Gastric cancer is among the most common malignancies worldwide, characterized by geographical, epidemiological and histological heterogeneity. Here, we report an extensive, multiancestral landscape of driver events in gastric cancer, involving 1,335 cases. Seventy-seven significantly mutated genes (SMGs) were identified, including ARHGAP5 and TRIM49C. We also identified subtype-specific drivers, including PIGR and SOX9, which were enriched in the diffuse subtype of the disease. SMGs also varied according to Epstein–Barr virus infection status and ancestry. Non-protein-truncating CDH1 mutations, which are characterized by in-frame splicing alterations, targeted localized extracellular domains and uniquely occurred in sporadic diffuse-type cases. In patients with gastric cancer with East Asian ancestry, our data suggested a link between alcohol consumption or metabolism and the development of RHOA mutations. Moreover, mutations with potential roles in immune evasion were identified. Overall, these data provide comprehensive insights into the molecular landscape of gastric cancer across various subtypes and ancestries.
Women with germline BRCA1 mutations (BRCA1+/mut) have increased risk for hereditary breast cancer. Cancer initiation in BRCA1+/mut is associated with premalignant changes in breast epithelium; however, the role of the epithelium-associated stromal niche during BRCA1-driven tumor initiation remains unclear. Here we show that the premalignant stromal niche promotes epithelial proliferation and mutant BRCA1-driven tumorigenesis in trans. Using single-cell RNA sequencing analysis of human preneoplastic BRCA1+/mut and noncarrier breast tissues, we show distinct changes in epithelial homeostasis including increased proliferation and expansion of basal-luminal intermediate progenitor cells. Additionally, BRCA1+/mut stromal cells show increased expression of pro-proliferative paracrine signals. In particular, we identify pre-cancer-associated fibroblasts (pre-CAFs) that produce protumorigenic factors including matrix metalloproteinase 3 (MMP3), which promotes BRCA1-driven tumorigenesis in vivo. Together, our findings demonstrate that precancerous stroma in BRCA1+/mut may elevate breast cancer risk through the promotion of epithelial proliferation and an accumulation of luminal progenitor cells with altered differentiation.
Current risk assessment and treatment strategies for venous thromboembolism (VTE) consider genetic factors only in a limited way. New work shows a more pervasive role of common variants in VTE risk, inspiring genetic predictors that surpass and complement individual clinical risk factors and monogenic thrombophilia testing.
ATPase activity-competent mSWI/SNF complexes are essential for SARS-CoV-2 infection
a, Schematic of the three mSWI/SNF family complexes, cBAF, PBAF and ncBAF, with subunits colored according to scores in the Vero E6 SARS-CoV-2 CRISPR–Cas9 screen. The average proviral z-scores for each complex are shown. Complex-specific scores represent the sum of two complex-specific subunits, one core subunit and one reader subunit. b, Bar graph depicting the percentage of mNeonGreen-expressing Vero E6 cells (control cells or those with polyclonal CRISPR-mediated knockout of shared or unique mSWI/SNF subunits or ACE2) after infection by icSARS-CoV-2-mNG at an MOI of 1. c, Immunoblot performed in SMARCA4 knockout Vero E6 cells reconstituted with empty vector, WT SMARCA4 or SMARCA4 ATPase-dead mutant (K785R). d, SMARCA4 knockout-complemented and WT Vero E6 cells were infected with icSARS-CoV-2-mNG at an MOI of 1. Infected cells were imaged via fluorescence microscopy (left); mNeonGreen-expressing cell frequency was measured 2 d after infection (right). e, SMARCA4 knockout-complemented and WT Vero E6 cells were infected with SARS-CoV-2 at an MOI of 0.1. Virus titer was measured by plaque assay. PFU, plaque-forming unit. f, SMARCA4 knockout-complemented and WT Vero E6 cells were infected with SARS-CoV-2 (left), HKU5-SARS-CoV-1-S (middle) and MERS-CoV (right) at an MOI of 0.2. Cell viability relative to a mock-infected control was measured 3 d after infection with CellTiter-Glo (CTG). RLUs, relative light units. g, SMARCA4 knockout-complemented and WT Vero E6 cells were infected with VSV pseudovirus (VSVpp): VSVpp-VSV-G; VSVpp-SARS-CoV-2-S (left), VSVpp-SARS-CoV-1-S (middle) and VSVpp-MERS-CoV-S (right). Luciferase relative to the VSVpp-VSV-G control was measured 1 d after infection. Data in b and d–g were analyzed by one-way ANOVA with Tukey’s multiple comparison test. The mean ± s.e.m. are shown. **P < 0.01, ***P < 0.001, NS, not significant. n = 3 biological replicates. Data in c are representative one of three independent experiments.
Source data
Top-ranked sites of ATPase-active BAF complex occupancy and DNA accessibility include ACE2
a, Immunoblot performed on nuclear extract (input) and anti-V5 immunoprecipitates from SMARCA4 knockout Vero E6 cells expressing either empty vector, WT V5-SMARCA4 or V5-SMARCA4 K785R. b, Heatmap and clustering analysis performed on the merged BAF complex (SMARCA4, SMARCC1 and ARID1A), H3K27ac occupancies (n = 1) and ATAC–seq (n = 2 biological replicates) peaks performed in Vero E6 cells rescued with the conditions shown in a, grouped into three clusters. c, Metaplots of SMARCA4 occupancy (C&T) and ATAC–seq peaks at WT SMARCA4-dependent sites (cluster 3). d, Distance to TSS distribution of C&T and ATAC–seq merged peaks for all conditions across clusters 1–3 from b. e, Cumulative distribution function plot reflecting genes nearest to SMARCC1 gained sites in SMARCA4 knockout cells rescued with WT SMARCA4 versus empty vector in cluster 3 from b; the top one-tenth fraction reflects genes associated with the top changed sites; sites highlighted in red indicate genes that scored as proviral determinants in the CRISPR screen. f, ATAC–seq and C&T tracks at the ACE2 locus in SMARCA4 knockout Vero E6 cells rescued with empty vector, WT SMARCA4 or SMARCA4 K785R. g, Reads per kilobase of transcript per million reads mapped (RPKM) levels for ACE2 in Vero E6 cells across the conditions shown (n = 2 biological replicates). The P values shown were calculated in edgeR using a quasi-likelihood negative binomial test. h, Volcano plots reflecting gene expression changes (RNA-seq) (n = 2 biological replicates) between the conditions shown. i, Overexpression of hACE2 in SMARCA4 knockout Vero E6 cells. j, VSVpp-based pseudovirus entry assay and plaque assays in WT Vero E6 cells and SMARCA4 knockout cells rescued with human ACE2. Data in a and i are representative of one of three independent experiments. Data in j were analyzed using a one-way ANOVA with Tukey’s multiple comparisons test. The mean ± s.e.m. are shown. **P < 0.01, ***P < 0.001, n = 3 biological replicates. The dashed line indicates limit of detection.
Source data
HNF1A–BAF complex binding cooperates with high motif density at the ACE2 locus to regulate ACE2 expression
a, Transcription factor motif enrichment analysis at BAF-gained sites (cluster 3). b, Transcription factor motif enrichment at BAF-occupied gained sites after rescue of Vero E6 SMARCA4 knockout cells with WT SMARCA4 plotted against log2 fold change of the transcript levels of the transcription factors (empty vector versus WT SMARCA4 conditions). HNF1A and HNF1B are circled in red. c, Immunoblot of HNF1A/B across WT Vero E6 and SMARCA4 knockout cells rescued with empty vector, WT SMARCA4 or SMARCA4 K785R. d, ACE2 expression in HNF1A and HNF1B knockout Vero E6 cells measured by RT–qPCR (left) and immunoblot (right). e, WT and HNF1A/B knockout Vero E6 cells were infected with icSARS-CoV-2-mNG at an MOI of 1. The frequency of infected cells was measured using mNeonGreen expression 2 d after infection. f, Vero E6 cells were infected with SARS-CoV-2 at an MOI of 0.1. Virus production was measured by plaque assays. g, HNF1A and HNF1B knockout Vero E6 cells were infected with SARS-CoV-2 pseudovirus. Luciferase relative to VSVpp-VSV-G control was measured 1 d after infection. h, Coimmunoprecipitation of endogenous SMARCA4 and HNF1A in nuclear extracts isolated from Vero E6 cells. i, HNF1 dimerization and association studies in WT and SMARCA2/4 double-knockout HEK 293T cells. j, Heatmap of SMARCA4 and SMARCC1 merged C&T (n = 1) and ATAC–seq (n = 2) peaks in control and HNF1A knockout Vero E6 cells, divided into three clusters. k, Bar graph depicting the fraction of sites with an HNF1 motif near cluster A (lost sites), cluster B (gained sites) and cluster C (unchanged sites) from j. l, Normalized gene rank of genes closest to cluster A sites plotted against the number of HNF1 motifs per gene at cluster A sites; selected genes within the top 10% of sites regulated by HNF1A are shown. m, C&T and ATAC–seq tracks at the ACE2 locus in control and HNF1A knockout Vero E6 cells. The data in c, d, h and i are representative of one of three independent experiments. The data in d–g were analyzed using a one-way ANOVA with Tukey’s multiple comparisons test. The mean ± s.e.m. are shown. ***P < 0.001, n = 3 biological replicates.
Source data
Small-molecule inhibition of the SMARCA4 ATPase of mSWI/SNF complexes downregulates ACE2 expression and blocks SARS-CoV-2 infection
a, Top: chemical structures of the mSWI/SNF SMARCA4/2 ATPase inhibitors, Comp12 and Comp14. Bottom: three-dimensional structure highlighting Comp12 docked in the ATPase site of the SMARCA2/4 ATPase subunit (Protein Data Bank ID: 6EG2). b, Vero E6 cells were treated with 1.25 μM Comp12 for the indicated times. ACE2 mRNA and protein levels were measured using RT–qPCR and immunoblot, respectively. c, Vero E6 cells were pretreated with Comp12 inhibitors for 2 d and then infected with SARS-CoV-2 at an MOI of 0.2. Cell viability was measured 3 d after infection. d, Vero E6 cells were pretreated with 1.25 and 2.5 μM Comp12 for 2 d and then infected with SARS-CoV-2 pseudovirus. The ACE2 antibody was preincubated with cells for 1 h before infection as a positive control. Luciferase relative to VSVpp-VSV-G control was measured 1 d after infection. e, SARS-CoV-2 production in Comp12 pretreated Vero E6 cells with the indicated concentrations of Comp12 for 2 d. f, ACE2 transcript and protein levels in Comp12-treated Vero E6, Huh7.5 and Calu-3 cells for 2 d at 1.25 and 2.5 μM. g, ACE2 transcript and protein levels in inhibitor and degrader-treated Vero and Huh7.5 cells for 2 d at 1.25 and 2.5 μM. h, Vero E6 and Huh7.5 cells were pretreated with the indicated inhibitors and/or degraders at 2.5 μM for 2 d and then infected with icSARS-CoV-2-mNG. The frequency of infected cells was measured by mNeonGreen expression. i, Vero E6 cells were pretreated with 2.5 μΜ Comp12 for 2 d and then infected with the indicated SARS-CoV-2 variants at an MOI of 0.2. Cell viability was measured 3 d after infection. j, Vero E6 or Calu-3 cells were pretreated with 2.5 μΜ of Comp12 for 2 d and then infected with the SARS-CoV-2 WA1 and E802D viruses. Virus production was measured by plaque assays 1 d after infection. Data in b–j were analyzed using a one-way ANOVA with Tukey’s multiple comparisons test. The mean ± s.e.m. are shown. *P < 0.05, **P < 0.01, ***P < 0.001, n = 3 biological replicates.
Source data
SMARCA4 is required for ACE2 expression and sarbecovirus susceptibility in primary human cells
a, Schematic of SMARCA4/2 ATPase inhibitor treatment and virus infection in primary HBECs. b–e, HBECs were pretreated with 2.5 and 5 μΜ Comp12 for 4 d and then infected with SARS-CoV-2 (b), HKU5-SARS1-S (c), MERS-CoV (d) and IAV (e). Virus replication was measured by plaque assay and/or RT–qPCR. f, HBECs were pretreated with 2.5 μΜ Comp12 for 4 d and then infected with SARS-CoV-2 WA1 or E802D virus at an MOI of 0.5. The virus titer was measured using plaque assay. Remdesivir was added right after infection. g, ACE2 expression was measured using RT–qPCR; SARS-CoV-2 replication was measured using a plaque assay after virus infection in HBECs pretreated with 2.5 μΜ of the indicated compounds for 4 d. h, hPSC-derived pneumocyte-like cells were pretreated with Comp12 for 2 d and then infected with SARS-CoV-2 at an MOI of 0.1. i,j, Infectivity was measured by the accumulation of viral nucleoprotein in the nucleus of the cells 2 d after infection. ACE2 expression and SARS-CoV-2 infection were measured in HIEs (i) and MIEs (j) pretreated with 2.5 μΜ of the indicated compounds for 3 d except remdesivir, which was added right after virus infection. k, Model depicting the mechanism of mSWI/SNF complex-mediated regulation of ACE2 expression and SARS-CoV-2 entry. In b, c, d, f, g and i, the dashed line indicates the limit of detection. Data in b–j were analyzed using a one-way ANOVA with Tukey’s multiple comparisons test. The mean ± s.e.m. are shown. **P < 0.01, ***P < 0.001, ****P < 0.0001, n = 3 biological replicates.
Source data
Identification of host determinants of coronavirus infection informs mechanisms of viral pathogenesis and can provide new drug targets. Here we demonstrate that mammalian SWItch/Sucrose Non-Fermentable (mSWI/SNF) chromatin remodeling complexes, specifically canonical BRG1/BRM-associated factor (cBAF) complexes, promote severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and represent host-directed therapeutic targets. The catalytic activity of SMARCA4 is required for mSWI/SNF-driven chromatin accessibility at the ACE2 locus, ACE2 expression and virus susceptibility. The transcription factors HNF1A/B interact with and recruit mSWI/SNF complexes to ACE2 enhancers, which contain high HNF1A motif density. Notably, small-molecule mSWI/SNF ATPase inhibitors or degraders abrogate angiotensin-converting enzyme 2 (ACE2) expression and confer resistance to SARS-CoV-2 variants and a remdesivir-resistant virus in three cell lines and three primary human cell types, including airway epithelial cells, by up to 5 logs. These data highlight the role of mSWI/SNF complex activities in conferring SARS-CoV-2 susceptibility and identify a potential class of broad-acting antivirals to combat emerging coronaviruses and drug-resistant variants.
Overview of immune selection calculation using SOPRANO
a, Estimates can be performed at the cohort or at the single individual level. b, In each case, it is possible to estimate immune selection on a single HLA allele (that is, HLA-A0201), a generic combination of HLA alleles (proto-HLA) or the germline-specific HLA-immunopeptidome. c, Immune selection determines the evolutionary trajectories of clonal growth; fully immune-edited tumors with strong immune selection signals can transit towards fully immune-escaped tumors where signals are absent. d, Toy model of mixing immune-edited and immune-escaped tumors. It is possible to estimate immune dN/dS by aggregating all mutations and generate a single cohort estimate or to estimate a distribution of values per patient. In both cases, we hypothesize that mixing escaped with edited tumors leads to loss of immune selection signals reflected by immune dN/dS values closer to one (depicted as red dashed lines in the figure).
Immune dN/dS landscape across multiple tumor types
a,b, Immune dN/dS (ON-dN/dS/OFF-dN/dS ratio) in multiple tumor types using either a curated HLA-A0201 based immunopeptidome (a) or a proto-HLA consisting of the most common HLA haplotypes in the population (b). Error bars indicate 95% confidence interval to the point estimate obtained with SOPRANO and number of samples for each tumor type is described in Supplementary Table 1. c, Proto-HLA immune dN/dS from SOPRANO versus proto-HLA normalized HBMR reported previously²⁸. d,e, Immune dN/dS on HLA-A0201 (d) and HBMR on the proto-HLA versus median CD8 T cell infiltration, including microsatellite-unstable (MSI)-rich tumors (e). f,g, Immune dN/dS on HLA-A0201 (f) and HBMR on the proto-HLA versus median CD8 T cell infiltration, excluding MSI-rich tumors (g). We assumed that MSI-rich tumors are also escape-rich tumors. P values and correlation coefficients were calculated using Pearson’s correlation (two-sided t-test). Gray shaded areas represent error bands indicating the 95% confidence interval. Red dashed lines indicate neutral dN/dS at one. h, log2 dN/dS versus -log10(P value) of selected escape genes (Supplementary Table 3) using missense and truncating mutations. i, Linear mixed model using dN/dS as the dependent variables and all immune metrics as independent variables. x axis shows the B coefficients. Model selection using the AIC revealed that immune dN/dS is strongly associated to the median abundance of CD8 T cells. No immune subpopulation was significantly associated to OFF-dN/dS. For ON-dN/dS, adjusted R² = 0.927, F-statistic = 18.8 on 10 and 4 degrees of freedom, P = 0.00617. For OFF-dN/dS, adjusted R² = 0.898, F-statistic = 13.3 on 10 and 4 degrees of freedom, P = 0.0117 (significance codes are described as ‘***’ for P < 0.001, ‘**’ P < 0.01, ‘*’ P < 0.05, and NS, for not significant P > 0.05). APC, antigen presenting cell; FDR, false discovery rate; pDC, plasma dendritic cell. R indicates Pearson r correlation coefficient.
Immune dN/dS analysis of a curated set of individuals from three tumor types
Analysis of a curated set of individuals from three MSI-rich tumor types. Patient-specific analysis of primary untreated colorectal (CRC), stomach (STAD) and uterine/endometrial cancer (UCEC) (n = 879) with annotated escape mechanisms obtained from Lakatos et al.³². a,b, TMB (a) and immune dN/dS (b) for different subtypes of cancers, including MSS escaped− (MSS−, n = 130) and escaped+ (MSS+, n = 144), microsatellite-unstable escaped− (MSI−, n = 53) and escaped+ (MSI+, n = 125) and POLE mutants (n = 38). c, Immune dN/dS versus TMB for immune-escaped and immune-edited MSS groups using all mutations (MSS− n = 107, MSS + n = 133). d, Immune dN/dS comparison between escaped− and escaped+ tumors using clonal (MSS−, n = 93, MSS+, n = 94) or all mutations (MSS−, n = 130, MSS+, n = 144). e, Immune dN/dS comparison between clonal versus all mutations in escaped− and escaped+ MSS tumors. Reported P values from paired two-samples two-sided Wilcoxon signed rank test after multiple test correction using Holm method. f, Relationship between immune dN/dS and the reported CD8 T cell infiltration for escaped− and escaped+ in MSS and MSI tumors. Boxplots represent the median, 25th percentile and 75th percentile, and whiskers correspond to 1.5 times the interquartile range. CRC, circle; STAD, triangle; UCEC, square. For two-sample comparisons, P values were calculated using a nonparametric two-sided Mann–Whitney U test. For linear correlations, P values and coefficients were calculated using Pearson’s correlation (two-sided t-test). Red dashed lines indicate immune neutral dN/dS = 1.
Analysis of the Hartwig Medical Foundation metastatic cohort under immunotherapy
a, A total of 308 patients with clinical response to immunotherapy based on RECIST1.1, including responders (R) and nonresponders (NR). Patients were primarily treated with ipilimumab (Ipi), nivolumab (Nivo), pembrolizumab (Pembro) or a combination of ipilimumab plus nivolumab (IPI/nivo). b, Cohort immune dN/dS for responders (n = 79) and non-responders (n = 229) using a common HLA-A0201 immunopeptidome reveal immune dN/dS lower than 1 for nonresponders consistent with a low overall tumor antigenicity. c, Comparison of individual immune dN/dS for responders (median = 1.05, n = 67) and nonresponders (median = 0.77, n = 154) using patient-specific HLA immunopeptidomes (P = 0.034). d, Proportion of escaped+ (NR n = 81, R n = 48) and escaped− (NR n = 148, R n = 31) tumors classified by clinical response. Responders were enriched in genetic escape mechanisms (χ²P = 8 × 10⁻⁵). e, TMB for escaped (NR+ n = 69, R+ n = 43) and escaped− tumors (NR− n = 85, R− n = 43) classified by response status show that escaped+ tumors have significantly more mutations than immune-edited tumors. f, Patient-specific immune dN/dS revealed a significant depletion of nonsynonymous mutations only in the immunopeptidome of escaped− nonresponders (NR−, median = 0.69, P = 0.0012). One-sample one-sided Wilcoxon signed rank test with mu = 1 (NR− n = 85, NR+ n = 69, R− n = 24, R+ n = 43). Boxplots represent the median, 25th percentile and 75th percentile, and whiskers correspond to 1.5 times the interquartile range. g, Cohort dN/dS for driver genes (196 genes from Martincorena et al.¹⁴), escape genes (Supplementary Table 3), all the exome (global dN/dS) and the immunopeptidome (immune dN/dS). All error bars include the point estimate plus the 95% confidence interval calculated using SOPRANO package. h, Kaplan–Meier curves of overall survival for responders and nonresponders grouped by escape status (log rank P = 1.4 × 10⁻⁶). i, Kaplan–Meier curves of overall survival for individuals classified based on immune dN/dS and escape status (−, escaped−; +, escaped+) (log rank P = 0.0025). Red dashed lines indicate immune-neutral dN/dS = 1.
Impact of immune dN/dS on outcome for ICI therapy in a metastatic melanoma cohort
a, A clinically annotated cohort of 48 patients with sequencing data before (Pre) and after (On) receiving ICIs was obtained from Riaz et al.³⁹. b, dN/dS distributions for nonresponders (n = 36) and responders (n = 12) before and after therapy. We estimated OFF-dN/dS (left), ON-dN/dS (middle) and immune dN/dS (right) using each patient’s six HLA alleles. P values were calculated using two-sided Wilcoxon rank sum test and corrected using Benjamini-Hochberg. c, Mutations and their prevalence in genes classified as escaped in the Riaz cohort. Individuals were also classified according to: homozygosity status (Hom, homozygous; Het, heterozygous), somatic loss of HLA heterozygocity (N, no; Y, yes) and their immune dN/dS category (N, neutral; L, low; H, high). d, Pre-therapy immune dN/dS for escaped− and escaped+ cohorts classified as responders (R) and nonresponders (NR). Escaped− nonresponders were the only group with cohort immune dN/dS less than 1. All escaped+ tumors before therapy displayed immune dN/dS equals to one. e, Immune dN/dS distribution for randomized escaped− tumors. The point estimate for nonresponders escaped− patients was significantly lower than the mean of randomized immune dN/dS values (exact P = 0.019). f, Kaplan–Meier curves of overall survival between high-TMB and low-TMB patients. High-TMB patients had significantly longer overall survival (log-rank P = 0.031) than low-TMB individuals. g, Kaplan–Meier curves of overall survival between immune neutral (escape+ and neutral immune dN/dS) and immune edited (escape− and low-immune dN/dS). The association between overall survival and immune dN/dS was more significant than with TMB (log-rank P = 0.015). Red dashed lines indicate immune-neutral dN/dS = 1.
In cancer, evolutionary forces select for clones that evade the immune system. Here we analyzed >10,000 primary tumors and 356 immune-checkpoint-treated metastases using immune dN/dS, the ratio of nonsynonymous to synonymous mutations in the immunopeptidome, to measure immune selection in cohorts and individuals. We classified tumors as immune edited when antigenic mutations were removed by negative selection and immune escaped when antigenicity was covered up by aberrant immune modulation. Only in immune-edited tumors was immune predation linked to CD8 T cell infiltration. Immune-escaped metastases experienced the best response to immunotherapy, whereas immune-edited patients did not benefit, suggesting a preexisting resistance mechanism. Similarly, in a longitudinal cohort, nivolumab treatment removes neoantigens exclusively in the immunopeptidome of nonimmune-edited patients, the group with the best overall survival response. Our work uses dN/dS to differentiate between immune-edited and immune-escaped tumors, measuring potential antigenicity and ultimately helping predict response to treatment.
Zygotic genome activation (ZGA) is a critical postfertilization step that promotes totipotency and allows different cell fates to emerge in the developing embryo. MERVL (murine endogenous retrovirus-L) is transiently upregulated at the two-cell stage during ZGA. Although MERVL expression is widely used as a marker of totipotency, the role of this retrotransposon in mouse embryogenesis remains elusive. Here, we show that full-length MERVL transcripts, but not encoded retroviral proteins, are essential for accurate regulation of the host transcriptome and chromatin state during preimplantation development. Both knockdown and CRISPRi-based repression of MERVL result in embryonic lethality due to defects in differentiation and genomic stability. Furthermore, transcriptome and epigenome analysis revealed that loss of MERVL transcripts led to retention of an accessible chromatin state at, and aberrant expression of, a subset of two-cell-specific genes. Taken together, our results suggest a model in which an endogenous retrovirus plays a key role in regulating host cell fate potential.
Epigenetic reprogramming in the germline contributes to the erasure of epigenetic inheritance across generations in mammals but remains poorly characterized in plants. Here we profiled histone modifications throughout Arabidopsis male germline development. We find that the sperm cell has widespread apparent chromatin bivalency, which is established by the acquisition of H3K27me3 or H3K4me3 at pre-existing H3K4me3 or H3K27me3 regions, respectively. These bivalent domains are associated with a distinct transcriptional status. Somatic H3K27me3 is generally reduced in sperm, while dramatic loss of H3K27me3 is observed at only ~700 developmental genes. The incorporation of the histone variant H3.10 facilitates the establishment of sperm chromatin identity without a strong impact on resetting of somatic H3K27me3. Vegetative nuclei harbor thousands of specific H3K27me3 domains at repressed genes, while pollination-related genes are highly expressed and marked by gene body H3K4me3. Our work highlights putative chromatin bivalency and restricted resetting of H3K27me3 at developmental regulators as key features in plant pluripotent sperm.
Pearl millet is an important cereal crop worldwide and shows superior heat tolerance. Here, we developed a graph-based pan-genome by assembling ten chromosomal genomes with one existing assembly adapted to different climates worldwide and captured 424,085 genomic structural variations (SVs). Comparative genomics and transcriptomics analyses revealed the expansion of the RWP-RK transcription factor family and the involvement of endoplasmic reticulum (ER)-related genes in heat tolerance. The overexpression of one RWP-RK gene led to enhanced plant heat tolerance and transactivated ER-related genes quickly, supporting the important roles of RWP-RK transcription factors and ER system in heat tolerance. Furthermore, we found that some SVs affected the gene expression associated with heat tolerance and SVs surrounding ER-related genes shaped adaptation to heat tolerance during domestication in the population. Our study provides a comprehensive genomic resource revealing insights into heat tolerance and laying a foundation for generating more robust crops under the changing climate.
| Study overview. a, Discovery meta-analysis. *For signals present in more than one trait, the signal is only counted once (for the most significant trait). b, Pathway analyses, GRS analyses and PheWAS studies.
Lung-function impairment underlies chronic obstructive pulmonary disease (COPD) and predicts mortality. In the largest multi-ancestry genome-wide association meta-analysis of lung function to date, comprising 580,869 participants, we identified 1,020 independent association signals implicating 559 genes supported by ≥2 criteria from a systematic variant-to-gene mapping framework. These genes were enriched in 29 pathways. Individual variants showed heterogeneity across ancestries, age and smoking groups, and collectively as a genetic risk score showed strong association with COPD across ancestry groups. We undertook phenome-wide association studies for selected associated variants as well as trait and pathway-specific genetic risk scores to infer possible consequences of intervening in pathways underlying lung function. We highlight new putative causal variants, genes, proteins and pathways, including those targeted by existing drugs. These findings bring us closer to understanding the mechanisms underlying lung function and COPD, and should inform functional genomics experiments and potentially future COPD therapies. Multi-ancestry genome-wide association analyses and systematic variant-to-gene mapping strategies implicate new genes and pathways influencing lung function and chronic obstructive pulmonary disease risk.
| Global enrichment in 80 panel genes under strong constraint (pLI > 0.9). a, Case-control enrichment of rare (minor allele count ≤ 5) proteintruncating, missense and synonymous variants in all ancestries combined. The PGC3SEQ results were derived from 11,580 individuals with SCZ and 10,555 controls and are shown in red/orange. We conducted the same analysis in the SCHEMA samples (shown in gray; 19,108 cases and 18,001 controls) that we had access to for comparison. b, Ancestry-stratified rare variant (MAF < 0.1%) enrichment in the meta-analysis of PGC3SEQ and SCHEMA (29,381 cases and
Schizophrenia (SCZ) is a chronic mental illness and among the most debilitating conditions encountered in medical practice. A recent landmark SCZ study of the protein-coding regions of the genome identified a causal role for ten genes and a concentration of rare variant signals in evolutionarily constrained genes1. This recent study—and most other large-scale human genetics studies—was mainly composed of individuals of European (EUR) ancestry, and the generalizability of the findings in non-EUR populations remains unclear. To address this gap, we designed a custom sequencing panel of 161 genes selected based on the current knowledge of SCZ genetics and sequenced a new cohort of 11,580 SCZ cases and 10,555 controls of diverse ancestries. Replicating earlier work, we found that cases carried a significantly higher burden of rare protein-truncating variants (PTVs) among evolutionarily constrained genes (odds ratio = 1.48; P = 5.4 × 10−6). In meta-analyses with existing datasets totaling up to 35,828 cases and 107,877 controls, this excess burden was largely consistent across five ancestral populations. Two genes (SRRM2 and AKAP11) were newly implicated as SCZ risk genes, and one gene (PCLO) was identified as shared by individuals with SCZ and those with autism. Overall, our results lend robust support to the rare allelic spectrum of the genetic architecture of SCZ being conserved across diverse human populations. Targeted sequencing finds a higher burden of rare protein-truncating variants in constrained genes among schizophrenia cases of diverse ancestries. Meta-analyses with existing datasets show that this excess burden is consistent across five ancestral populations.
Multi-omic profiling of lesions at autopsy reveals a plethora of resistance mechanisms present within individual patients with ovarian cancer. This highlights the extreme challenge faced in treating end-stage disease and underscores the need for new methods of early detection and intervention.
High-grade serous ovarian cancer (HGSC) is frequently characterized by homologous recombination (HR) DNA repair deficiency and, while most such tumors are sensitive to initial treatment, acquired resistance is common. We undertook a multiomics approach to interrogate molecular diversity in end-stage disease, using multiple autopsy samples collected from 15 women with HR-deficient HGSC. Patients had polyclonal disease, and several resistance mechanisms were identified within most patients, including reversion mutations and HR restoration by other means. We also observed frequent whole-genome duplication and global changes in immune composition with evidence of immune escape. This analysis highlights diverse evolutionary changes within HGSC that evade therapy and ultimately overwhelm individual patients.
Identification of therapeutic targets from genome-wide association studies (GWAS) requires insights into downstream functional consequences. We harmonized 8,613 RNA-sequencing samples from 14 brain datasets to create the MetaBrain resource and performed cis- and trans-expression quantitative trait locus (eQTL) meta-analyses in multiple brain region- and ancestry-specific datasets (n ≤ 2,759). Many of the 16,169 cortex cis-eQTLs were tissue-dependent when compared with blood cis-eQTLs. We inferred brain cell types for 3,549 cis-eQTLs by interaction analysis. We prioritized 186 cis-eQTLs for 31 brain-related traits using Mendelian randomization and co-localization including 40 cis-eQTLs with an inferred cell type, such as a neuron-specific cis-eQTL (CYP24A1) for multiple sclerosis. We further describe 737 trans-eQTLs for 526 unique variants and 108 unique genes. We used brain-specific gene-co-regulation networks to link GWAS loci and prioritize additional genes for five central nervous system diseases. This study represents a valuable resource for post-GWAS research on central nervous system diseases.
Implementation and benchmarking of network-based augmentation of GWAS
a, Edge and node counts of the combined interactome and its components. OTAR is the Open Targets combined physical protein interaction network that is provided via a Neo4j Graph Database. b, Graphic representation of some L2G components: SNP-to-gene distance, data from QTLs and variant effect predictions. The integration of information into the L2G score has been described previously¹¹. c, Graphical representation of the network-based approach: network propagation of the initial input, clustering using a random walker to find gene communities and scoring of modules using the distribution of PageRank score. KS, Kolmogorov–Smirnov. d, Number of starting genes linked to traits, grouped in therapeutic areas. In the violin plot, the red dots represent the median, the limits of the thick line correspond to quartiles 1 and 3 (25% and 75% of the distribution) and the limits of the thin line are 1.5× the interquartile range. e, Benchmarking of the method, using as a starting signal genes from the Open Targets Genetics portal with a L2G score >0.5. AUC values are calculated using as positive hits the DISEASE database, with increasing cutoff values for its gene-to-trait score (Methods), as well as clinical trials data from the ChEMBL database (clinical phase II or higher). We also re-calculated the AUC values and determined Z-scores reflecting the deviation in AUCs relative to those observed after randomization of the list of true positives (TPs). In the boxplots, the middle lines represent the median, the limits of the box are quartiles 1 and 3 and the whiskers represent 1.5× the interquartile range.
Trait–trait genetic and functional similarities determined from network expansion of GWAS data
a, Tree showing the Manhattan distance between all traits, using the full PPR score. Hierarchical clustering was performed using a cutoff of h = 1, leading to 54 clusters, colored depending on the predominant EFO ancestry term. The right-hand panel is a barplot showing the 54 clusters with the frequencies for the predominant EFO ancestry terms and a heatmap showing the counts for ChEMBL targets and drugs. The text label next to each cluster corresponds to the second most predominant EFO terms that, on average, label 35% of the traits within the clusters that have a text label. b, Examples of traits grouped using the Manhattan distance, extracted from the tree in a. CSF, colony-stimulating factor; Ig, immunolglobulin; LDL, low-density lipoprotein.
Multitrait gene module associations for studies of shared biological processes and drug-repurposing opportunities
a, Heatmap showing the overlap between gene modules across traits. Traits were clustered using hierarchical clustering (Methods) and subgroups were defined by a cutoff of 0.6 average correlation coefficient. A module was considered the same across different traits when most genes are in common (Jaccard index > 0.7). Significant trait–module relations are marked in yellow or pink, with yellow indicating modules overrepresented in one of the subgroups of traits (one-sided Fisher’s exact test, adjusted P < 0.05) and pink otherwise. The heatmap in the right-hand panel shows the number of genes in modules from each subgroup of traits that are drug targets (phase III or higher, ChEMBL database), linked with clinical variants (ClinVar database) or with mouse KO phenotypes (International Mouse Phenotyping Consortium database). b, Barplot showing the number of traits linked with the top six most pleiotropic gene modules. The GOBP description is based on the results of a GOBP enrichment test (Methods). c, Simplified heatmap of the clusters in a concerning bone-related and fasciitis traits. The represented network includes genes from the modules indicated in blue letters and the represented interactions have been filtered for visualization (Methods). Blue nodes, relevant mouse KO phenotypes; green nodes, diseases with clinical variants enriched in this gene module; red nodes, drugs in clinical trials. Genes linked to blue, green or yellow nodes have the linked mouse phenotypes, clinical variants in the linked disease or are targets of the linked drug. Genes that are the targets of drugs in clinical trials have yellow nodes. GWAS-linked genes (L2G score > 0.5) have borders colored in an orange to red gradient (count of GWAS-linked traits). d, Simplified heatmap of one the clusters in a concerning allergic reactions (node and edge color code are the same as in c). In this case, two modules were merged to build the interaction network in the right-hand panel. mRNA, messenger RNA; SRP, signal recognition particle.
Gene module analysis of autoimmune diseases
a, Heatmap showing the overlap between gene modules across traits (color-coded as in Fig. 3a,c,d). The GOBP description is based on the results of a GOBP enrichment test (one-sided Fisher’s exact test, BH adjustment, Methods). The heatmap in the right-hand panel shows the gene set enrichment analysis carried out on the expression data from different tissues extracted from Human Protein Atlas (HPA) for the gene modules in blue (two-sided Kolmogorov–Smirnov test, Methods). After BH adjustment for multiple testing, the P value of the test was log transformed and given a positive value if the median distribution for the foreground was higher than the background and a negative value if it was lower. b, Shared modules as a network, nodes are gene modules associated with different immune-related traits colored blue or red for the two trait subgroups; edges represent a high degree of overlap at the gene level (Jaccard index > 0.7). Gene modules linked to different traits are given in black circles. Gene modules are linked with the yellow node ‘ChEMBL-drugs’ when they contain targets for drugs in clinical trials (phases III and IV, ChEMBL); linked with green nodes when they are enriched in genes with clinical variants for a given disease; and linked with purple nodes when they are enriched for the corresponding KO phenotypes (one-sided Fisher’s exact test, adjusted P < 0.05). c, Network corresponding to genes found in gene modules enriched for Type I interferon (INF) signaling, phospholipase C-activating GPCR signaling, neutrophil activation (integrins) and protein kinase A (PKA) activity. Edge filtering, node and edge colors are the same as in Fig. 3c,d.
An IBD-specific network is enriched for likely causal genes
a, Curated IBD seed genes (N = 37) tend to have a higher network propagation score (PPR percentile) than other genes within 200 kb at the same loci. b, Genes selected by high Open Targets L2G score also tend to have high PRR percentile, highlighting network evidence as complementary to typical locus features. In the boxplots, the middle lines represents the median, the limits of the box are quartiles 1 and 3 and the whiskers represents 1.5× the interquartile range. c, Genome-wide, genes with low P-value SNPs within 10 kb are enriched for high PPR percentile (one-sided Fisher’s exact test). Data are presented as the mean ± s.d.
Interacting proteins tend to have similar functions, influencing the same organismal traits. Interaction networks can be used to expand the list of candidate trait-associated genes from genome-wide association studies. Here, we performed network-based expansion of trait-associated genes for 1,002 human traits showing that this recovers known disease genes or drug targets. The similarity of network expansion scores identifies groups of traits likely to share an underlying genetic and biological process. We identified 73 pleiotropic gene modules linked to multiple traits, enriched in genes involved in processes such as protein ubiquitination and RNA processing. In contrast to gene deletion studies, pleiotropy as defined here captures specifically multicellular-related processes. We show examples of modules linked to human diseases enriched in genes with known pathogenic variants that can be used to map targets of approved drugs for repurposing. Finally, we illustrate the use of network expansion scores to study genes at inflammatory bowel disease genome-wide association study loci, and implicate inflammatory bowel disease-relevant genes with strong functional and genetic support.
In the context of climate change, drought is one of the most limiting factors that influence crop production. Maize, as a major crop, is highly vulnerable to water deficit, which causes significant yield loss. Thus, identification and utilization of drought-resistant germplasm are crucial for the genetic improvement of the trait. Here we report on a high-quality genome assembly of a prominent drought-resistant genotype, CIMBL55. Genomic and genetic variation analyses revealed that 65 favorable alleles of 108 previously identified drought-resistant candidate genes were found in CIMBL55, which may constitute the genetic basis for its excellent drought resistance. Notably, ZmRtn16, encoding a reticulon-like protein, was found to contribute to drought resistance by facilitating the vacuole H⁺-ATPase activity, which highlights the role of vacuole proton pumps in maize drought resistance. The assembled CIMBL55 genome provided a basis for genetic dissection and improvement of plant drought resistance, in support of global food security.
Characteristics of obesity and WHRadjBMI genetic risk
a, LDSC-SEG for WHRadjBMI (left) and obesity characteristics (right), measuring the enrichment of traits near genes with tissue-specific expression. The P values are based on standard error and represent the contribution of tissue-specific expression to trait heritability after correction for baseline annotations. The bars are colored by tissue category. CR, cardiorespiratory; MS, musculoskeletal; SAT, subcutaneous adipose tissue. b, xCell-derived cell type enrichment predictions, normalized to generate proportions and averaged across adipose cell types (preadipocyte and adipocyte). The bars are colored by tissue category. c, Genetic correlation and heritability for sex-specific GWASs conducted for obesity and WHRadjBMI (for obesity, n = ~681,275 individuals; for WHRadjBMI, n = 694,649 individuals). The bars are colored by trait. The data are represent correlation or heritability estimates ± s.e. The genetic correlation is high between the sexes for obesity and lower for WHRadjBMI. The sexual dimorphism of WHRadjBMI is partially driven by increased heritability of WHRadjBMI in women. Welch’s two-sided t-test was used to evaluate within-trait heritability between sexes (WHR P = 3.3 × 10⁻⁴; obesity P = 0.16). F, female; M, male.
Genes identified by TWAS
a, Overlap between TWAS-identified gene sets (pie chart; top) and the top 10 results for each TWAS. Results show some known and some novel genes associated with obesity and WHR. Significance based on permutation distribution with subsequent Bonferroni correction. Data are colored by sex and trait. b, Sexual dimorphism of WHRadjBMI TWAS results. Every gene that was tested in both male and female TWASs is displayed. The data are colored by sex. c, rs1534696 eQTL effect on SNX10 in women and men. The data are colored by rs1534696 genotype. The P values for the significance of the differences in of log2[transcripts per million (tpm)] values between sexes at each rs1534696 genotype were obtained via Student’s two-sided t-test (for females, there were 60 A/A participants, 91 C/A participants and 42 A/A participants; for males, there were 54 A/A participants, 96 C/A participants and 43 C/C participants). The box plot whiskers represent the minimum (1st percentile) and maximum (99th percentile) points of the data, the box bounds represent the first quartile (25th percentile and 75th percentiles) and the center line represents the median of the data (50th percentile). FC, fold change. d, Sexual dimorphism of eQTLs (left) and GWASs (right) for female WHR. The data are colored by data type (eQTL versus GWAS). Each dot represents the top eQTL or GWAS for a female WHR TWAS gene. The GWAS betas are normalized by sex-specific trait heritability (h²). The left panel represents P values of each variant as a sex-specific eQTL; the right represents the P value of each variant in a WHRadjBMI sex-specific GWAS. The variants show sexual dimorphism in GWASs, but not as eQTLs. e, GWAS locus near SNX10 in women (left) and men (right). rs1534696 is sexually dimorphic for WHRadjBMI.
Regulatory network of female WHRadjBMI
a, MPRA results. The outside ring shows female WHRadjBMI TWAS genes (n = 91). The blue ring shows variants that had significant enhancer activity (n = 426). The red ring shows variants that modulated the activity of significant enhancers (n = 58). There were n = 1,455 nonsignificant variants. Significance was determined by one-sided Mann–Whitney U-test and subsequent FDR correction. b, Transcription factor (TF) motifs enriched in enhancer-modulating variants relative to nonsignificant enhancer sequences. Transcription factors contributing to common adipose motifs are highlighted in green. The P values represent the probability of finding a degree of enrichment by chance, based on HOMER significance testing based on a null binomial distribution of motifs in the genome and FDR multiple comparisons correction. c, Conservation of MPRA sequences by significance status and the presence of Alu repeat elements in MPRA sequences. The data are colored by MPRA significance status. Conservation significance was determined by Student’s two-sided t-test between average conservation rates across MPRA sequences. The data represent mean conservation rates ± s.e. (for the enhancer-modulating variant versus enhancer, P = 1.81 × 10⁻¹⁷; for the enhancer versus nonsignificant variant, P = 2.22 × 10⁻²). The Alu proportion significance was calculated via chi-squared test comparing observed proportions of MPRA variants in Alu elements with those expected by chance in MPRA variants (the enhancer-modulating variant overlap was 44/58 sequences, the enhancer overlap was 148/368 sequences and the nonsignificant overlap was 287/1433 sequences). d, Average enhancer activities of MPRA sequences with and without Alu elements. The data are colored by MPRA significance status. Significance was determined by t-test comparing average normalized activities between variants within or not within Alu elements per MPRA significance level. e, Density plot reflecting the presence of enriched adipogenesis motifs within the first monomer of Alu elements in MPRA sequences.
Effects of the rs1534696 allele on human primary adipocytes
a, ABC-predicted interactions between the SNX10 promoter and gene body loci in differentiating primary adipocytes. b, LipocyteProfiler method. c, LipocyteProfiler features across primary human adipocyte differentiation. The data are colored by feature class. Larger dots represent features that differ significantly between AA and CC/AA genotype groups at each time point for adipocytes from subcutaneous and visceral depots. Significance is defined as P < 0.05 and FDR q < 0.05. The y axes represent the P value of significance of each feature and the x axes represent the effect size, represented as the t statistic, of each feature. The dashed lines represent the P value significance threshold. d, LipocyteProfiler features significant in day 14 subcutaneous adipocytes. The data are colored by feature class. The x axis shows significant features grouped by cellular organelle staining target. Organelle targets are visualized and consist of: (1) mitochondria; (2) intracellular lipids; (3) DNA; and (4) actin cytoskeleton and Golgi membrane. The features shown are collapsed across redundant features (for example, different measures of cell shape). For subcutaneous adipocytes, n = 20 CC/CA and n = 11 AA. For visceral adipocytes, n = 23 CC/CA and n = 12 AA.
SNX10-mediated inhibition of adipogenesis
a, Oil Red O staining (left) and reverse transcription quantitative real-time PCR quantification (right) representing lipid accumulation after adipocyte differentiation in hMSCs upon SNX10 shRNA introduction. The Oil Red O images are representative of samples from that condition. The data represent mean expression ± s.d. (n = 2 technical replicates). The data are colored by shRNA experiment. mRNA, messenger RNA. b, DEXA scans from female and male control mice (top) and female and male mmΔSnx10Adipoq mice. c,d, Body weight (c) and body fat mass (d) of control and mmΔSnx10Adipoq mice after HFD administration (n = 6 female control mice (CF), n = 5 female knockout mice (KOF), n = 3 male control mice (CM) and n = 6 male knockout mice (KOM); pairwise t-test, P < 0.01). The asterisks represent the results of two-sided Student’s t-tests. The box plot whiskers represent the minimum (1st percentile) and maximum (99th percentile) points of data, the box bounds represent the first quartile (25th percentile and 75th percentile) and the center lines represent medians of the data (50th percentile). The data are colored by sex and SNX10 genotype. NS, not significant.
Obesity-associated morbidity is exacerbated by abdominal obesity, which can be measured as the waist-to-hip ratio adjusted for the body mass index (WHRadjBMI). Here we identify genes associated with obesity and WHRadjBMI and characterize allele-sensitive enhancers that are predicted to regulate WHRadjBMI genes in women. We found that several waist-to-hip ratio-associated variants map within primate-specific Alu retrotransposons harboring a DNA motif associated with adipocyte differentiation. This suggests that a genetic component of adipose distribution in humans may involve co-option of retrotransposons as adipose enhancers. We evaluated the role of the strongest female WHRadjBMI-associated gene, SNX10, in adipose biology. We determined that it is required for human adipocyte differentiation and function and participates in diet-induced adipose expansion in female mice, but not males. Our data identify genes and regulatory mechanisms that underlie female-specific adipose distribution and mediate metabolic dysfunction in women.
Variant cohort details
a,b, Phenotypes associated with 88 experimentally verified clinical splicing variants (a) and position of the 76/88 variants that are SNVs relative to the ESs (b). c, Nature of 148 unannotated splicing events (mis-splicing) induced by the 88 variants. IR = intron retention.
Source data
Unannotated splicing events seen in 300K-RNA
a, Exon-skipping events are evidenced by split-reads spanning nonconsecutive exons within the transcript. Splice sites (GT/AG motifs) shown in bold and black are those for which events are being ranked. b, Cryptic activation events are evidenced by split-reads spanning: (i) an annotated acceptor and an unannotated donor or (ii) an annotated donor and an unannotated acceptor. c, Example showing the Top-4* events for NM_130786 (A1BG) exon 2 donor (g.58353291). Exon/intron lengths are not drawn to scale. Arc thickness corresponds to event rank. d, One hundred percent (119/119) of exon-skipping and cryptic activation events detected across 88 variants are present in 300K-RNA, and 92% are in the Top-4* events for their respective splice site. e, Percent of the 119 true-positive events detected within random subsets of the 335,663 source specimens in 300K-RNA. Gray dots show proportion across 20 random samples; blue line shows mean proportions with LOESS smoothing. f, Top-1* and Top-2* events around the splice sites affected by our 88 variants typically occur in mutually exclusive specimens—with both events seen, on average, in only 5% of total samples where either event was detected. Internal lines of boxplot denote the median value, and the lower and upper limits of the boxes represent 25th and 75th percentiles. Whiskers extend to the largest and smallest values at most 1.5IQR. An asterisk indicates our filter for events involving skipping one or two exons and cryptic activation within 600 nt of the annotated splice site.
Source data
300K-RNA event rankings across tissues and data sources
a, Heatmap showing the proportion of mis-splicing events* with the same event rank in each GTEx tissue subtype, as compared to all GTEx tissue subtypes combined—for the 88 splice sites affected by our cohort of variants. Only tissues with ≥100 GTEx samples are shown. The Top-1* event in individual tissues is concordant with the Top-1* event in ‘all GTEx tissues’ for ≥80% of splice sites. b, Concordance of top-ranked mis-splicing events* in GTEx versus SRA. The Top-1* event in GTEx is the Top-1* event in SRA for 80/88 (91%) splice sites in our cohort. c, Spearman correlation of all mis-splicing events* across 98,810 annotated splice sites in clinically relevant Mendelian disease genes (Methods) in each GTEx tissue subtype versus GTEx overall. Only tissues with ≥100 GTEx samples are shown. Black, clinically accessible tissues. d, Upset plot²⁶ showing 300K-RNA Top-4* across clinically relevant Mendelian disease genes (four events per splice site, n = 383,677 in total) versus Top-4* specific to four clinically accessible tissues in GTEx. Twenty percent (77,461/383,677) of all 300K-RNA Top-4* across clinically relevant Mendelian disease genes are captured as Top-4* events among all four clinically accessible tissues (blood, fibroblasts, EBV-LCL and muscle). An asterisk indicates filtering to skipping one or two exons and cryptic activation within 600 nt of the annotated splice site.
Source data
Comparison of 300K-RNA Top-4* with SpliceAI
a–c, Custom interpretive rules applied to SpliceAI Δ-scores to predict the nature of mis-splicing. Δ-scores below 0.001 were excluded as our applied threshold for no predicted impact on splicing (threshold shown with gray dashed lines). Heights of red lines denote example Δ-scores that predict mis-splicing events according to our rules. a, Single-exon skipping is predicted if both splice sites flanking the exon have donor and acceptor loss Δ-scores above threshold, and double-exon skipping was inferred if the splice site of the upstream or downstream intron also had donor loss or acceptor loss Δ-score above threshold. b, Intron retention was predicted if both splice sites flanking an intron had donor loss and acceptor loss Δ-score above threshold. c, Cryptic activation was predicted by donor gain or acceptor gain Δ-scores above threshold for any unannotated donor or acceptor. d, Example showing SpliceAI predictions of exon skipping and cryptic activation in case number 6. e, Sensitivity and PPV of 300K-RNA and SpliceAI for exon-skipping and cryptic-activation prediction at different thresholds, for the 86/88 variants that can be scored by SpliceAI. Points on the 300K-RNA curve (blue) show metrics when using Top-1*, Top-2*, Top-3*, Top-4*, etc. events as predictions of the nature of mis-splicing. Points on the SpliceAI curve (red) show metrics at Δ-scores that predict the same number of exon skipping and cryptic activation as 300K-RNA Top-1*, Top-2*, Top-3*, Top-4* and so on. f, 300K-RNA and SpliceAI predictions of exon skipping (seen/not seen in RNA studies across 86 cases). g, 300K-RNA and SpliceAI predictions of cryptic splice-site activation (seen/not seen in RNA studies across 86 cases). Dashed lines indicate the threshold of Top-4* and SpliceAI Δ-score ≥ 0.011 identified in e. Black dots, mis-splicing events seen in RNA studies but not meeting the Δ-score threshold of 0.001. An asterisk indicates that filtering to skipping one or two exons and cryptic activation within 600 nt of the annotated splice site. TP, true positives, FN, false negatives, FP, false positives.
Source data
Even for essential splice-site variants that are almost guaranteed to alter mRNA splicing, no current method can reliably predict whether exon-skipping, cryptic activation or multiple events will result, greatly complicating clinical interpretation of pathogenicity. Strikingly, ranking the four most common unannotated splicing events across 335,663 reference RNA-sequencing (RNA-seq) samples (300K-RNA Top-4) predicts the nature of variant-associated mis-splicing with 92% sensitivity. The 300K-RNA Top-4 events correctly identify 96% of exon-skipping events and 86% of cryptic splice sites for 140 clinical cases subject to RNA testing, showing higher sensitivity and positive predictive value than SpliceAI. Notably, RNA re-analyses showed we had missed 300K-RNA Top-4 events for several clinical cases tested before the development of this empirical predictive method. Simply, mis-splicing events that happen around a splice site in RNA-seq data are those most likely to be activated by a splice-site variant. The SpliceVault web portal allows users easy access to 300K-RNA for informed splice-site variant interpretation and classification.
SCR is required for Sox2 expression in epiblast cells
a, CHi-C 1D interaction frequency heatmap in WT mES cells (top). Black arrowhead points to the center of the Sox2–SCR interaction and this corner signal overlaps with CTCF binding suggesting the formation of a CTCF-mediated loop. Publicly available ChIP-seq of RAD21, CTCF and NIPBL as well as CUT&RUN of H3K27ac in mES cells are shown at the bottom. CTCF motif orientation (red and blue arrowheads) is shown for significant CTCF motifs (Q < 0.05) as detected by FIMO. Shaded box shows deleted region in SCR∆ mice. b, qPCR analysis of Sox2 expression in blastocysts at E3.5 was done using the ∆∆CT method and Gapdh as a reference. Sox2 expression was calculated by comparing it to the median of all analyzed WT embryos. Each dot represents a single blastocyst. The number of biologically independent blastocysts (n) analyzed of each genotype is shown in the legend. Boxplots show minimum, maximum, median, first and third quartiles. A Wilcoxon two-sided test was performed to assess statistical significance. Het, heterozygous; Hom, homozygous. c, E6.5–E7.5 embryos were stained for GATA4, T and SOX2. Eight of nine SCR∆ homozygotes showed arrested development shortly after implantation and failed to initiate gastrulation as shown by T expression. Eight of eight WT and heterozygotes displayed correct pattern of T expression. Scale bars, 80 µm.
SCR activates Sox2 expression independently of CTCF
a, CHi-C 1D interaction frequency heatmaps in homozygotic CTCF∆(C2–C4) and CTCF∆(C5) mES cells, compared to WT. Rectangles represent the Sox2–SCR interaction. Insulation scores for 5-kb windows in this region are shown below publicly available CTCF and H3K27ac enrichment tracks. Lower scores represent higher insulation. b, Differential CHi-C interaction frequency heatmap. Red signal represents interactions occurring at higher frequency in mutant cell lines compared to control and the blue shows interactions of lower frequency. Dashed lines represent the Sox2–SCR domain as detected in WT control cells. c, Virtual 4C plots using the Sox2 and SCR viewpoints. Region surrounding viewpoint was removed from the analysis. Dashed lines highlight SCR in the Sox2 viewpoint (top) and Sox2 in the SCR viewpoint (bottom). Virtual 4C signal is shown as the average of the two replicates in 5-kb overlapping windows. Colored dots represent regions of statistically significant difference compared to WT (adj. P < 0.01). Black arrowhead indicates region of the highest intensity of the Sox2–SCR interaction, which overlaps with the SCR–CTCF motif. d, qPCR analysis of Sox2 expression in blastocysts at E3.5 was done using the ∆∆CT method and Gapdh as a reference. Sox2 expression was calculated by comparing it to the median of all analyzed WT embryos. Each dot represents a single blastocyst. The number of biologically independent blastocysts (n) analyzed of each genotype is shown below the plot. Boxplots show minimum, maximum, median, first and third quartiles. A Wilcoxon two-sided test was performed to assess statistical significance.
SCR can activate Sox2 across CTCF-mediated insulation
a, CHi-C 1D interaction frequency heatmaps in homozygotic mES cells of the CTCFi3×⁻, CTCFi3×⁺, CTCFi18×⁺ and CTCFi3×⁻;3×⁺ strains compared to WT. Rectangles show the Sox2–SCR interaction. Inserted CTCF motif orientation and position in each mutant are shown below the plots. Insulation scores for 5-kb windows are shown below publicly available CTCF and H3K27ac tracks. Dashed lines below heatmap show CTCF insertion sites. b, Virtual 4C plots using the Sox2 and SCR viewpoints. Dashed lines highlight SCR in the Sox2 viewpoint (top) and Sox2 in the SCR viewpoint (bottom). Region surrounding viewpoint was removed from analysis. Virtual 4C signal is shown as the average of the two replicates in 5-kb overlapping windows. Colored dots represent regions of statistically significant difference compared to WT using a Wald test and after correction for multiple comparisons (Q < 0.01). c, qPCR analysis of Sox2 expression in blastocysts at E3.5 was done using the ∆∆CT method and Gapdh as a reference. Sox2 expression was compared to the median of all WT embryos. Each dot represents a blastocyst and a Wilcoxon two-sided test assessed statistical significance. The number of biologically independent blastocysts (n) analyzed of each genotype is shown below the plot. Boxplots show minimum, maximum, median, first and third quartiles. d, IF of blastocysts with antibodies targeting GATA6, NANOG and SOX2. Each dot in the quantification plots represents the signal intensity of a single cell normalized by the cell with the highest intensity in heterozygotes. The number of biologically independent blastocysts (n) analyzed of each genotype is shown below the plot. Boxplots show minimum, maximum, median, first and third quartiles. A Wilcoxon two-sided test was performed to assess statistical significance. Scale bars, 10 µm.
DNE also induces Sox2 across CTCF-mediated insulation
a, CHi-C 1D interaction frequency heatmap in E11.5 heads. Data of WT ES cells are shown for comparison (bottom). Insets on the right show 2D interaction heatmaps highlighting interactions between regions surrounding Sox2 and DNE. CTCF ChIP-seq publicly available data were obtained from in vitro differentiated neural progenitor cells. RAD21 ChIP-seq was performed on E11.5 heads isolated as for CHi-C. 2D insets show the same tracks on the x and y axes but at different locations. Rectangles represent the Sox2–DNE interaction, arrowheads represent loops with CTCF downstream of DNE established by transgene insertions, black arrow represents loops with CTCF upstream of Sox2 established by transgene insertions, white arrow represents loops in ES cells between CTCF upstream of Sox2 and CTCF downstream of DNE, white brackets in the insets highlight the region between DNE and CTCF. Dashed triangle in head CHi-C represents the Sox2–SCR interaction domain detected in ES cells. b, Virtual 4C plots using Sox2 and DNE viewpoints using 5-kb overlapping windows and signal are shown as an average of the two replicates of each genotype. Region surrounding viewpoint was removed from the analysis. Insets show signal at DpnII fragments. c, qPCR analysis of Sox2 expression in NPCs was done using the standard curve dilution method and Eef2 as a reference. Each dot represents a technical replicate and three individual mutant cell lines (n) were analyzed. A Wilcoxon two-sided test assessed the statistical significance by comparing WT to all mutant clones combined. d, qPCR analysis of Sox2 expression in E11.5 midbrains was done using the ∆∆CT method and Gapdh as a reference. Sox2 expression was compared to the median of all WT embryos. Each dot represents one embryo and a Wilcoxon two-sided test assessed statistical significance. The number of biologically independent embryos (n) analyzed for each genotype is shown in the legend. Boxplots show minimum, maximum, median, first and third quartiles. e, IF of E9.5–10.5 embryos stained with an antibody targeting SOX2. Three homozygotes of each line were stained and imaged together with two WT littermates. Scale bars, 500 µm.
CTCF loops can completely insulate Sox2 from its AFG-specific enhancers
a, IF of E9.5–10.5 embryos stained with an antibody targeting SOX2. Bracket highlights the AFG. First four images were taken with a dissection microscope and the two on the right with a confocal microscope. Three homozygotes of each line were stained and imaged together with two WT littermates. Scale bars, 150 µm. b, IF with antibodies targeting SOX2 and NKX2.1 using dissected E13.5 AFG-derived tissues. Tr, trachea; Es, esophagus; Lu, lungs; St, stomach. Six embryos of each genotype were stained and imaged. Scale bars, 500 µm. c, qPCR analysis of Sox2 expression in stomach at E13.5 was done using the ∆∆CT method and Gapdh as a reference. Sox2 expression was calculated by comparing it to the median of all analyzed WT embryos. Each dot represents a single embryo. The number of embryos (n) analyzed for each genotype is shown below the plot. Boxplots show minimum, maximum, median, first and third quartiles. A Wilcoxon two-sided test was performed to assess statistical significance. d, GFP expression in AFG-derived organs dissected from E15.5 fetuses originating from crosses between CTCFi3×⁻;3×⁺ and Sox2GFP heterozygous mice. Three embryos of each genotype were imaged. Scale bars, 500 µm. e, CHi-C 1D interaction frequency heatmaps in WT E11.5 heads (top) and GFP⁺ cells from E11.5–12.5 AFG-derived tissues dissected from Sox2GFP heterozygotes (bottom). Publicly available CTCF, ATAC-seq and H3K27ac enrichment data from different tissues are shown below the heatmaps. Green arrowheads indicate putative regulatory elements with tissue-specific activity in AFG derivatives. Insets show a 2D interaction heatmap where y axis shows region centered around Sox2 and x axis shows region around a CTCF motif downstream of DNE. For the E11.5 head inset, CTCF data from NPCs are shown on both axes. For AFG-derived tissues, CTCF from stomach is shown on x axis and from lungs on the y axis. White bracket highlights the DNE and CTCF regions. In the AFG, the signal is restricted to the CTCF motifs and not to DNE.
How enhancers activate their distal target promoters remains incompletely understood. Here we dissect how CTCF-mediated loops facilitate and restrict such regulatory interactions. Using an allelic series of mouse mutants, we show that CTCF is neither required for the interaction of the Sox2 gene with distal enhancers, nor for its expression. Insertion of various combinations of CTCF motifs, between Sox2 and its distal enhancers, generated boundaries with varying degrees of insulation that directly correlated with reduced transcriptional output. However, in both epiblast and neural tissues, enhancer contacts and transcriptional induction could not be fully abolished, and insertions failed to disrupt implantation and neurogenesis. In contrast, Sox2 expression was undetectable in the anterior foregut of mutants carrying the strongest boundaries, and these animals fully phenocopied loss of SOX2 in this tissue. We propose that enhancer clusters with a high density of regulatory activity can better overcome physical barriers to maintain faithful gene expression and phenotypic robustness. Genetic manipulation of the Sox2 locus in mice shows that gene activation by distal enhancers does not require CTCF-mediated loops and can occur across ectopic CTCF-mediated boundaries. The ability to bypass CTCF boundaries varies with their insulation strength and the tissue-specific enhancers responsible for activation.
Attention-deficit hyperactivity disorder (ADHD) is a prevalent neurodevelopmental disorder with a major genetic component. Here, we present a genome-wide association study meta-analysis of ADHD comprising 38,691 individuals with ADHD and 186,843 controls. We identified 27 genome-wide significant loci, highlighting 76 potential risk genes enriched among genes expressed particularly in early brain development. Overall, ADHD genetic risk was associated with several brain-specific neuronal subtypes and midbrain dopaminergic neurons. In exome-sequencing data from 17,896 individuals, we identified an increased load of rare protein-truncating variants in ADHD for a set of risk genes enriched with probable causal common variants, potentially implicating SORCS3 in ADHD by both common and rare variants. Bivariate Gaussian mixture modeling estimated that 84–98% of ADHD-influencing variants are shared with other psychiatric disorders. In addition, common-variant ADHD risk was associated with impaired complex cognition such as verbal reasoning and a range of executive functions, including attention. Genome-wide analyses identify 27 loci associated with attention-deficit hyperactivity disorder and provide insights into its genetic architecture in relation to other psychiatric disorders and cognitive traits.
Schematic description of the TESLA method
TESLA uses meta-regression to model phenotypic effect estimates as functions of the PCs of genome-wide allele frequencies from each cohort. For a given gene expression prediction model generated from an eQTL dataset, we use TESLA to more accurately estimate phenotypic effects, then use them to perform TWASs and attain optimal power. We also performed fine mapping and enrichment analysis using the TESLA results (which we call eTESLA).
Manhattan plot for multi-tissue TESLA results using GTEx for CigDay phenotype
For each chromosome, we labeled the fine-mapped genes with posterior inclusion probability (PIP) > 0.9 (with P < 2.5 × 10⁻⁶). If more than ten genes were significant for a chromosome, only the top ten genes were labeled. The Manhattan plot for other traits can be found in Supplementary Fig. 2. All P values are two sided. We have now labeled the fine-mapped genes with PIP > 0.9 in the Manhattan plot. For smoking initiation trait, there are a large number of fine-mapped signals, so we labeled only ten genes per chromosome with the largest PIP values.
Key addiction-related pathways are ubiquitously enriched with TESLA hits in multiple brain tissues
We displayed TESLA enrichment P values (two sided) across 13 GTEx brain tissues using radar plots. a,b, The enrichment of TESLA hits for cigarettes per day for the dopaminergic synapse pathways (a) and the behavioral response to nicotine pathways (b). Gridlines in the radar plots indicate different levels of statistical significance. Each spoke represents a brain tissue and the length of the spoke represents the −log10(P) of enrichment. Brain tissues with significant enrichment P values after multiple testing corrections are shown in red. CC, cellular component; BP, biological process.
Different brain tissues are enriched with distinct pathways
We used REVIGO to reduce redundant GO terms and facilitate the visualization of enrichment results. We highlighted three brain regions (that is, cortex, substantia nigra and cerebellum) with distinct patterns of enrichment. For brain cortex, one GO term (relaxation of smooth muscle) accounts for 98.3% of the pathways enriched with TWAS hits, whereas, for substantia nigra and cerebellum, a diverse set of GO terms was enriched with TWAS hits. The brain figures are generated by R package ggseg⁴³.
TESLA identified substantially more loci and new loci than FE-TWASs, RE-TWASs and EURO-TWASs using GTEx data and PrediXcan weights
Most transcriptome-wide association studies (TWASs) so far focus on European ancestry and lack diversity. To overcome this limitation, we aggregated genome-wide association study (GWAS) summary statistics, whole-genome sequences and expression quantitative trait locus (eQTL) data from diverse ancestries. We developed a new approach, TESLA (multi-ancestry integrative study using an optimal linear combination of association statistics), to integrate an eQTL dataset with a multi-ancestry GWAS. By exploiting shared phenotypic effects between ancestries and accommodating potential effect heterogeneities, TESLA improves power over other TWAS methods. When applied to tobacco use phenotypes, TESLA identified 273 new genes, up to 55% more compared with alternative TWAS methods. These hits and subsequent fine mapping using TESLA point to target genes with biological relevance. In silico drug-repurposing analyses highlight several drugs with known efficacy, including dextromethorphan and galantamine, and new drugs such as muscle relaxants that may be repurposed for treating nicotine addiction. A multi-ancestry transcriptome-wide association study using an optimal linear combination of association statistics provides insights into tobacco use biology and suggests opportunities for drug repurposing.
Burdens and mutational signatures in normal human small intestinal crypts
a, SBS burden versus age, showing regression lines for the three different sectors of the small intestine. Regression lines were estimated using linear mixed-effects models. Error bands represent 95% confidence interval for the fixed effect of age. Colors indicate biopsy regions, with orange, green and blue representing duodenum, ileum and jejunum, respectively. Shapes indicate whether the donor has a celiac history or not. Crosses indicate donors with a celiac history, and dots indicate donors without a celiac history. b, ID burden versus age, showing regression lines for the three different sectors of the small intestine. c, The proportion of mutations in each crypt attributed to each SBS mutational signature (arranged by ascending age). Signatures are color coded as indicated on the right.
APOBEC mutagenesis on phylogenetic trees
Phylogenies of small intestine crypts with mutational signature annotation. Branch lengths correspond to SBS burdens. Signatures exposures are color coded below the trees. a, Phylogeny of PD41851, an individual with frequent APOBEC mutagenesis exhibiting SBS1, SBS5 and SBS18 with SBS2/SBS13 in 8 of 11 crypts. b, Phylogeny of PD41853, exhibiting SBS1, SBS5 and SBS18 with SBS2/SBS13 in 1 of 11 crypts. c, Phylogeny of PD43953, a 4-year-old child exhibiting SBS1, SBS5, SBS18 and SBS2/SBS13. d, Phylogeny of PD41852 with APOBEC mutagenesis detected before a crypt fission event at ~25 years of age. e, Phylogeny of PD43401 with APOBEC mutagenesis detected after a crypt fission event at ~30 years of age. f, Phylogeny of PD43403 with APOBEC mutagenesis detected after a crypt fission event at ~30 years of age.
Spatial distribution of APOBEC-positive crypts
APOBEC-positive crypts and their surrounding crypts before microdissection with their SBS mutational spectrums. Signatures exposures are color coded on top left. Red dots, APOBEC-positive crypts. Blue dots, APOBEC-negative crypts that have been sequenced. Gray rectangles on the mutational spectra circle characteristic peaks of SBS2/SBS13. a, PD43401, this individual has one APOBEC-positive crypt but all the remaining crypts in the neighborhood are negative. b, PD52487, with an APOBEC-negative crypt (PD52487b_lo0005) between APOBEC-positive crypts. c, PD45778, an APOBEC-negative crypt (PD45778b_lo0002) between APOBEC-positive crypts.
Local mutation clusters (kataegis) in the normal small intestine
a,b, Two examples of crypts with kataegis. Types of SBSs are indicated by the color code below. Red arrowheads show the location of kataegis.a, Rainfall plot of a crypt from PD28690 showing a mutation cluster on chromosome 16. b, Rainfall plot of a crypt from PD28690 showing mutation clusters on chromosomes 7, 14 and 22.
APOBEC1/APOBEC3A/APOBEC3B expression across normal tissues
Bulk tissue gene APOBEC expression from The HPA project consensus dataset³⁸. Image credit: HPA ( a, APOBEC1 bulk tissue gene expression ( Duodenum nTPM = 29, small intestine nTPM = 26 and colon nTPM = 1.1. b, APOBEC3A bulk tissue gene expression ( Duodenum nTPM = 1.4, small intestine nTPM = 0.8 and colon nTPM = 1.5. c, APOBEC3B ( bulk tissue gene expression. Duodenum nTPM = 7.3, small intestine nTPM = 4.3 and colon nTPM = 12.1.
APOBEC mutational signatures SBS2 and SBS13 are common in many human cancer types. However, there is an incomplete understanding of its stimulus, when it occurs in the progression from normal to cancer cell and the APOBEC enzymes responsible. Here we whole-genome sequenced 342 microdissected normal epithelial crypts from the small intestines of 39 individuals and found that SBS2/SBS13 mutations were present in 17% of crypts, more frequent than most other normal tissues. Crypts with SBS2/SBS13 often had immediate crypt neighbors without SBS2/SBS13, suggesting that the underlying cause of SBS2/SBS13 is cell-intrinsic. APOBEC mutagenesis occurred in an episodic manner throughout the human lifespan, including in young children. APOBEC1 mRNA levels were very high in the small intestine epithelium, but low in the large intestine epithelium and other tissues. The results suggest that the high levels of SBS2/SBS13 in the small intestine are collateral damage from APOBEC1 fulfilling its physiological function of editing APOB mRNA. Whole-genome sequencing of healthy human epithelial crypts from the small intestines of 39 individuals highlights APOBEC enzymes as a common contributor to the overall mutational burden in this tissue.
Precision medicine promises to transform healthcare for groups and individuals through early disease detection, refining diagnoses and tailoring treatments. Analysis of large-scale genomic–phenotypic databases is a critical enabler of precision medicine. Although Asia is home to 60% of the world’s population, many Asian ancestries are under-represented in existing databases, leading to missed opportunities for new discoveries, particularly for diseases most relevant for these populations. The Singapore National Precision Medicine initiative is a whole-of-government 10-year initiative aiming to generate precision medicine data of up to one million individuals, integrating genomic, lifestyle, health, social and environmental data. Beyond technologies, routine adoption of precision medicine in clinical practice requires social, ethical, legal and regulatory barriers to be addressed. Identifying driver use cases in which precision medicine results in standardized changes to clinical workflows or improvements in population health, coupled with health economic analysis to demonstrate value-based healthcare, is a vital prerequisite for responsible health system adoption. This Perspective article discusses Singapore’s efforts to implement a National Precision Medicine Strategy through the integration of genomic, clinical and lifestyle data of up to one million Singaporean individuals.
Effect size distribution and pleiotropic associations
a, Relationship between VTE risk alleles and risk allele frequency. OR values for VTE of the 93 lead variants in the VTE GWAS meta-analysis (81,190 cases and 1,419,671 controls) plotted against risk allele frequency. Variants with OR > 1.3 are annotated. The color of each point indicates whether the locus is known (gray) or novel (red). Note that variants SERPINC1 and P2RX3 are represented by a single point, due to similar risk allele frequency and effect size. b, Effect sizes for DVT (x axis) and PE (y axis) for the 93 lead VTE variants. Points represent effect estimates (log(OR), measure of center) and error bars represent 95% CI, estimated using GWAS meta-analysis data from deCODE, UKB, FinnGen and the CHB-CVDC and DBDS. Summary data from the four cohorts were combined using inverse-variance-weighted fixed-effects meta-analysis. c, Heatmap showing nominally significant associations (P < 0.05) between lead variants at the 93 VTE risk loci and 24 selected blood traits. Blood cell traits (n ≤ 870,000) were derived from GWAS summary statistics while data on coagulation factors (n ~36,000) were derived from proteomics data. Bonferroni significant associations, after correcting for testing of 24 traits and 93 lead variants (P < 2.2 × 10⁻⁵), are marked by an asterisk (*). P values (two-sided) were derived from linear regression models. Coloring represents effect estimate (beta) for each respective trait, all oriented according to the VTE risk allele. Red and blue indicate an increase or decrease, respectively, in the trait. APTT, activated partial thromboplastin time; PAI-1, plasminogen activator inhibitor-1.
Gene prioritization
Displayed are genome-wide significant loci mapped to candidate genes based on at least two lines of evidence, namely: PoPS, effect on gene expression (eQTL), plasma protein levels (pQTL) and predicted effects on protein function (coding). The first column shows whether the gene was associated with PoPS z > 1; the second column shows whether there was evidence for colocalization (posterior probability > 0.75) between lead variant and eQTL variant; the third column shows whether the lead variant or variant in high LD (r² > 0.8) was associated with plasma protein levels; the fourth column shows whether the lead variant or variant in high LD (r² > 0.8) was protein coding.
Phenome-wide associations between PRSVTE and selected cardiometabolic traits, autoimmune disorders, malignancies and biomarkers in the UKB
a, Association between PRSVTE and binary traits. b, Association between PRSVTE and quantitative traits. Data are presented as beta (points, measure of center) and 95% CI (error bars) per 1 s.d. increase in PRSVTE. Beta estimates, CI and two-sided P values were calculated using either logistic (a) or linear regression (b). Significant associations are highlighted in red (P < 0.001 (0.05 / 49 traits)). Point size corresponds to the level of significance. Abbreviations, effect estimates, P values and sample sizes are shown in Supplementary Table 17. COPD, chronic pulmonary obstructive disease; AAA, abdominal aortic aneurysm; ALAT, alanine aminotransferase.
PRSVTE and risk of VTE in the UKB
a, Phenotypic variance (R²) explained by four different PRSs: 5-SNP PRS by de Haan et al.³⁸, 297-SNP PRS by Klarin et al.¹², 93-SNP PRS (this study) and ~1.1 million-SNP PRS (this study). b, OR of VTE per 1 s.d. increase in PRS according to the four PRSs. Points refer to OR (measure of center) and error bars represent 95% CI. The model included 23,723 cases of VTE (prevalent and incident) and 412,717 controls. Logistic regression models were used, adjusted for age, sex and four PCs. c, Risk of VTE according to polygenic and monogenic carrier status. OR, 95% CIs and two-sided P values were obtained using logistic regression models adjusted for age, sex and first four PCs. For PRS analyses, each percentile was compared with the rest of the population. For monogenic risk assessment, noncarriers were set as the reference group. The second column shows the number of carriers and noncarriers for each comparison, while the third column shows the number of VTE events among carriers and noncarriers. Black squares indicate adjusted OR and horizontal lines represent 95% CI. d, Plot showing predictive performance of PRS in relation to known demographic and clinical risk factors. Points represent AUC (measure of center), the triangle the benchmark model (age, sex and PCs) and error bars 95% CI.
Ten-year risk of VTE according to PRS and clinical and genetic risk factors
a, Ten-year cumulative risk curves showing interplay between PRS and F2 (rs1799963) carrier status in individuals without previous VTE. b, 10-year cumulative risk curves showing interplay between PRS and F5 (rs6025) carrier status in individuals without previous VTE. a,b, Population average was defined as individuals with a PRS between the 10th and 90th percentile. c, Ten-year risk of incident VTE according to combinations of age group, sex (male/female), obesity (yes/no), regular exercise (yes/no), smoking (yes/no) and PRS.
We report a genome-wide association study of venous thromboembolism (VTE) incorporating 81,190 cases and 1,419,671 controls sampled from six cohorts. We identify 93 risk loci, of which 62 are previously unreported. Many of the identified risk loci are at genes encoding proteins with functions converging on the coagulation cascade or platelet function. A VTE polygenic risk score (PRS) enabled effective identification of both high- and low-risk individuals. Individuals within the top 0.1% of PRS distribution had a VTE risk similar to homozygous or compound heterozygous carriers of the variants G20210A (c.*97 G > A) in F2 and p.R534Q in F5. We also document that F2 and F5 mutation carriers in the bottom 10% of the PRS distribution had a risk similar to that of the general population. We further show that PRS improved individual risk prediction beyond that of genetic and clinical risk factors. We investigated the extent to which venous and arterial thrombosis share clinical risk factors using Mendelian randomization, finding that some risk factors for arterial thrombosis were directionally concordant with VTE risk (for example, body mass index and smoking) whereas others were discordant (for example, systolic blood pressure and triglyceride levels).
Top-cited authors
Mark Daly
  • Massachusetts General Hospital
Goncalo Abecasis
Christian Gieger
  • Helmholtz Zentrum München Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Tim Spector
  • King's College London
Dorret I Boomsma
  • Vrije Universiteit Amsterdam