Hilary K. Finucane’s research while affiliated with Massachusetts General Hospital and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (212)


Variant scoring performance across selection regimes depends on variant-to-gene and gene-to-disease components
  • Preprint

September 2024

·

9 Reads

Hilary K Finucane

·

Sophie Parsa

·

Jeremy Guez

·

[...]

·

Variant scoring methods (VSMs) aid in the interpretation of coding mutations and their potential impact on health, but their evaluation in the context of human genetics applications remains inconsistent. Here, we describe GeneticsGym, a systematic approach to evaluating the real-world impact of VSMs on human genetic analysis across selection regimes. We show that the relative performance of VSMs varies across the spectrum of variant impact, as well as by gene function, and that both variant-to-gene and gene-to-disease components contribute.


Widespread naturally variable human exons aid genetic interpretation

September 2024

·

6 Reads

Most mammalian genes undergo alternative splicing. The splicing of some exons has been acquired or lost in specific mammalian lineages, but differences in splicing within the human population are poorly characterized. Using GTEx tissue transcriptomes from 838 individuals, we identified 56,415 exons which are included in mRNAs in some individuals but entirely excluded from others, which we term 'naturally variable exons' (NVEs). NVEs impact three quarters of protein-coding genes, occur at all population frequencies, and are often absent from reference annotations. NVEs are more abundant in genes depleted of genetic loss-of-function mutations and aid in the interpretation of causal genetic variants. Genetic variants modulate the splicing of many NVEs, and 5'UTR and coding-region NVEs are often associated with increased and decreased gene expression, respectively. Together, our findings characterize abundant splicing variation in the human population, with implications for a range of human genetic analyses.


Functional dissection of complex and molecular trait variants at single nucleotide resolution
  • Preprint
  • File available

May 2024

·

85 Reads

·

8 Citations

Identifying the causal variants and mechanisms that drive complex traits and diseases remains a core problem in human genetics. The majority of these variants have individually weak effects and lie in non-coding gene-regulatory elements where we lack a complete understanding of how single nucleotide alterations modulate transcriptional processes to affect human phenotypes. To address this, we measured the activity of 221,412 trait-associated variants that had been statistically fine-mapped using a Massively Parallel Reporter Assay (MPRA) in 5 diverse cell-types. We show that MPRA is able to discriminate between likely causal variants and controls, identifying 12,025 regulatory variants with high precision. Although the effects of these variants largely agree with orthogonal measures of function, only 69% can plausibly be explained by the disruption of a known transcription factor (TF) binding motif. We dissect the mechanisms of 136 variants using saturation mutagenesis and assign impacted TFs for 91% of variants without a clear canonical mechanism. Finally, we provide evidence that epistasis is prevalent for variants in close proximity and identify multiple functional variants on the same haplotype at a small, but important, subset of trait-associated loci. Overall, our study provides a systematic functional characterization of likely causal common variants underlying complex and molecular human traits, enabling new insights into the regulatory grammar underlying disease risk.

Download

Pan-UK Biobank GWAS improves discovery, analysis of genetic architecture, and resolution into ancestry-enriched effects

March 2024

·

233 Reads

·

56 Citations

Large biobanks, such as the UK Biobank (UKB), enable massive phenome by genome-wide association studies that elucidate genetic etiology of complex traits. However, individuals from diverse genetic ancestry groups are often excluded from association analyses due to concerns about population structure introducing false positive associations. Here, we generate mixed model associations and meta-analyses across genetic ancestry groups, inclusive of a larger fraction of the UKB than previous efforts, to produce freely-available summary statistics for 7,271 traits. We build a quality control and analysis framework informed by genetic architecture. Overall, we identify 14,676 significant loci in the meta-analysis that were not found in the European genetic ancestry group alone, including novel associations for example between CAMK2D and triglycerides. We also highlight associations from ancestry-enriched variation, including a known pleiotropic missense variant in G6PD associated with several biomarker traits. We release these results publicly alongside FAQs that describe caveats for interpretation of results, enhancing available resources for interpretation of risk variants across diverse populations.


Establishing the TeloHAEC CRISPRi model and Perturb-seq details
a, Enrichment of CAD heritability in TeloHAEC enhancers, from Stratified Linkage Disequilibrium Score Regression analysis (S-LDSC, see Methods), where enrichment is the percentage of heritability explained by variants in enhancers (%heritability), divided by the percentage of variants in enhancers (%SNPs). Enhancers in TeloHAEC (treated under the indicated conditions) were identified from ATAC-seq and H3K27ac ChIP-seq data (n = 6 for control ATAC, 3 for IL-1β, TNFα or VEGF ATAC, 4 for control ChIP, and 2 for IL-1β, TNFα or VEGF ChIP) by the Activity-by-Contact model. Error bars: standard error around the enrichment estimate, calculated by S-LDSC using jackknife (which resamples the data used for calculating heritability enrichment). P-values were calculated using the S-LDSC method²⁸, and FDR by the Benjamini-Hochberg method. *: FDR < 0.05, with specific FDR values of: Ctrl; 0.037, IL-1β; 0.015, TNFα; 0.020 and VEGF; 0.041. Full S-LDSC results can be found in Supplementary Table 27. b, Scatter density plot of human right coronary artery endothelial cell single cell RNA-seq pseudobulk gene expression (from⁶⁹) versus teloHAEC pseudobulk gene expression, for genes perturbed in this study. 2,107 of the 2,285 perturbed genes are expressed at TPM > 1 in healthy or diseased RCAECs. R and p-values from two sided Pearson correlation test. c, As in b, but for the 41 V2G2P genes. d, Heatmap of gene expression (log10 TPM) of the 41 V2G2P genes in diseased RCAECs and in teloHAEC. All but one gene, FBN2, is expressed at > 1 TPM in RCAECs. e, FACS showing dox inducibility of KRAB-dCas9-IRES-BFP in TeloHAEC, after sorting but before the screen. Left panels: gating for viable individual cells. Right panels: counts of gated cells by fluorescence intensity in the BFP/PB450 channel. f, BFP channel counts of cells grown in parallel and concurrently with cells for the Perturb-seq screen. After expansion to 120 M cells, transduction, selection and 5-day doxycycline treatment, 92% of cells remain BFP positive. g, Cumulative distribution fraction for duplication levels of unique CBC-UMI-Guide combinations in deeply-sequenced dialout libraries (“unique UMIs”, red) or all guide reads (blue) versus duplication level. Requiring four duplicates (dotted line) eliminates 90% of CBC-UMI-guide combinations (likely PCR chimaeras), while retaining >85% of total guide reads. h, UMIs for top guide per CBC. Arrow: the chosen 4 UMI threshold. i, Counts of singlets (1 gRNA, black bar), doublets (2) and higher multimers, as well as cells with no guide called (0), at the chosen thresholds of 4 UMIs for the top guide and 4 or more fold fewer for the next most frequent guide. j, Histogram of counts of singlet cells per target. Dotted line: average. k, As in j, but for singlet cells per guide. l, Read UMI counts for all transcripts per cell by singlet/multiplet status. The median UMI count for doublets was 37% more than singlets. Assuming that droplets with two cells will have double the number of reads as singlets, this suggests 37% of doublets are due to two cells (9.3% of cells with guides) while the remainder (15.7% of cells with guides) are due to two guides in one cell, very close to the expectation from the infection MOI of 15%. n = 352686, 214449, 79744, 19195 and 5345 cells with 0, 1, 2, 3, or 4 guides, respectively. Boxplot centre line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. m, Distribution of knockdown efficiency across target genes (log2 expression in cells containing guideRNAs targeting the gene versus in cells containing negative control guideRNAs). Grey line: all targeted genes. Yellow and red lines: genes expressed at >30 and >300 TPM, respectively. Red dotted vertical line: 40% knockdown (average for 300 + TPM target genes). n, Distribution of fitness effects across all guideRNAs (log2 ratio of guide frequency in singlet cells from the Perturb-seq experiment after 5 days of CRISPRi induction compared to guide frequency in the original guideRNA library). Guides targeting common essential genes (red) were depleted more frequently than guideRNAs targeting other genes. o, Relative count frequencies for the number of nominally significant differentially expressed (DE) genes per perturbed target (log2 of genes with raw p < 0.01, and fold change >1.15 from EdgeR DE analysis), for the indicated subclasses of targets. p, Volcano plot showing log2 (# DE genes for target)/(avg. # DE genes for non-expressed controls) versus -log10 FDR (capped at 100). Significance was assessed by two sided binomial test versus DE gene counts for the 48 perturbed non-expressed control genes. Right: symbols for target genes with the strongest effects. 10.7% of all targets had a significant effect on transcription (FDR < .05 increased DE gene count), including 31.9% of common essential genes, and 9.0% of other genes. q, Percent of perturbations that have a significant transcriptional effect in Perturb-seq, as defined by either (i) “DE Genes”, as per p, or (ii) “DE Programs”: perturbations that lead to significant changes in program expression by MAST package⁵⁰ with 10X lane correction (FDR < 0.05), by each indicated class: Permuted Controls (statistical tests performed on randomly drawn cells with negative control or safe-targeting guides), Expressed (>=1 TPM in TeloHAEC bulk RNA-seq), Low or No Expression (<1 TPM), Common essential (as identified in DepMap¹¹⁹), TeloHAEC Proliferation (showing fitness effects, as per n, of +/−15%, FDR < 0.05), Gene near CAD GWAS signals (expressed genes nearby any CAD GWAS signal, see Methods: Defining variants in CAD GWAS signals), Gene near IBD signals (perturbed expressed genes nearby 10 selected IBD GWAS signals, with no genes overlapping those for CAD signals).
QC metrics for single cells, and selection of number of components for cNMF
a, UMAPs showing number of UMIs per cell (left), percent ribosomal genes detected per cell (middle), percent mitochondrial genes detected per cell (right). b, UMAPs showing cells from each of the twenty 10X lanes. The differences in clustering along the UMAP_2 axis indicates a technical batch effect between 10X lanes. c, Cumulative distribution function (CDF) plot of the maximum absolute value of Pearson correlation between cNMF component expression in cells and batch. Dotted line: the R > = 0.15 threshold used to call programs associated with batch. d, Gene set enrichment analysis for GO terms among co-regulated genes, as a function of the number of components in the cNMF model (K). y-axis: the number of unique GO terms enriched across all programs for a given K. e, Number of unique motifs enriched among the promoters (top) or enhancers (bottom) of co-regulated genes across all components, as a function of K. f, Number of unique perturbations that have significant effect (FDR < 0.05) on one or more programs, as a function of K. g, Model-based evaluation of the choice of K. Stability of the components over 100 NMF runs (top) and element-wise square of error (bottom, see Methods). h, Quantile-quantile plot for effects of perturbations on program expression. X-axis: expected uniform distribution. Y-axis: –log10p-value computed from MAST package⁵⁰. Red: p-value < 0.05.
Catalogue of gene programs
a, Correlation heatmap of cNMF components. Colour: Pearson’s correlation of log2 fold-change in component expression across all perturbed genes. b, 50 programs ordered by variance explained (see Methods). c, 50 programs ordered by endothelial-cell specificity score — that is, the degree to which the co-regulated genes in the program are specifically expressed in endothelial cells versus in other cell types from FANTOM5 CAGE data (see Methods). Red line: z-score corresponding to top 10% of genes most specifically expressed in endothelial cells. d, Effects of selected regulators on the 13 endothelial-cell-specific programs. Heatmap: log2 fold-change in component expression in perturbation versus control. Top: 16 regulators shared between multiple endothelial cell-specific programs. Bottom: the 4 significant regulators (experiment-wide FDR < 0.05) per program with the most specific effects on that program relative to other endothelial-cell-specific programs. e, Programs ordered by number of enriched transcription factor motifs (see Methods). Grey: promoters. Blue: enhancers. Some programs only have enrichment for motifs in promoters. Some programs showed enrichment of distinct motifs in enhancers versus promoters, such as Program 47 (Angiogenesis, GATA2), with promoter enrichment in WT1 and EGR2 motifs, and enhancer enrichment in GATA2 and PRDM6 motifs. Among the programs with few or no enriched transcription factor motifs, we identified other likely proximal regulatory mechanisms: Program 17 expressed genes whose promoters were marked by H3K27me3 in endothelial cells (see also panel k), and the most significant regulator of this program was SUZ12, a component of the complex (PRC2) that writes this histone modification; and Program 16 pointed to a potential RNA surveillance program, since 40% of its program genes were noncoding RNAs (panel l), and its regulators included a component of the RNA exosome (EXOSC5) and the chromatin remodeller INO80E, which has previously been shown to regulate a subset of noncoding transcripts in yeast¹²⁰ (see also Supplementary Table 12). f, Annotations for an example program: 15. Left: top 10 program co-regulated genes. Middle, top: motifs enriched in promoters of the 300 program co-regulated genes. Middle, bottom: Gene Ontology terms enriched in the 300 program co-regulated genes. Right: volcano plot of the effects of regulators on cNMF component 15 genes. Program 15 (Flow response, KLF2) appeared to correspond to a canonical endothelial cell response to laminar shear stress defined by the known flow-responsive transcription factor KLF2: the program was highly enriched for KLF motifs in promoters; included known flow-responsive genes such as KRT18/19, NOS3, and KLF2 itself; and was significantly reduced by perturbations to MAP2K5 (MEK5), a kinase known to activate the signaling pathway upstream of KLF235,121. g, Log2 fold change in expression of programs 28 versus 47 for each perturbed gene relative to controls. Program 28 (Tip cell, migration) includes co-regulated genes that mark tip cell specification during sprouting angiogenesis (ESM1, RHOC, PLAUR), and Program 47 (Angiogenesis, GATA) includes co-regulated genes that are enriched in GATA2 & TAL1 motifs and that include NRP2, a co-receptor for VEGF-A, previously shown to act downstream of GATA2¹²²). Blue, red, and purple mark genes that are regulators of Program 28, Program 47, or both programs, respectively. Note that regulators that affect both programs do so in opposite directions. h, Perturbations ordered by the number of regulated programs. Red: top 10 perturbed genes. i. Programs ordered by the number of regulators. Blue: endothelial-cell-specific programs. The top 3 programs, by number of regulators, are labeled. j, 131 perturbed genes that are regulators of at least one endothelial-cell-specific program, ordered by the number of such programs that they regulate. Top 10 regulators are labelled, and included genes known to have important functions in ECs such as EGFL7 and ITGB1BP1/ICAP126,27. k, Average H3K27me3 ChIP-seq signal in co-regulated gene promoters. The top program is Program 17 (Polycomb targets). See legend to e for more details. N = 50 programs. Boxplot centre line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, all data points. l, Percent of noncoding RNA genes in program co-regulated genes. The top program is Program 16 (ncRNA & antisense RNAs). See legend to e for more details. N = 50 programs. Boxplot centre line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, all data points. m, Same as Fig. 1d, but with transcription factor motif sequence logo. The enriched motifs have distinct sequence logos.
Annotations for CAD-associated programs: 8, 35, 39, 47, 48
Left panels, top 10 program co-regulated genes. Program Specificity z-scores are the cNMF marker gene coefficients, indicating how specific this gene is to this program, relative to other programs (see Methods). Middle left panels, top: top 5 motifs enriched in the promoters or enhancers of the program co-regulated genes; bottom: top 5 GO terms enriched in program co-regulated genes. Middle right panels, regulators of the program. Volcano plot shows effects of all perturbed genes on program expression. Red: FDR < 0.05. Labelled: top two significant regulators in each direction, plus CCM2 and TLNRD1. Right panels, UMAP of program expression in a subset of cells (24,000, randomly selected).
Prioritization of CAD-associated programs and candidate CAD genes
a, Using MAGMA to prioritize gene programs enriched for CAD heritability (linking variants to program genes and 50 kb of flanking sequence, see Methods). Barplots show beta regression coefficient (left) and –log10 FDR (Benjamini-Hochberg adjusted enrichment p-value, right). Programs are ordered separately by beta or FDR value. Dotted line: FDR = 0.05. b, Using S-LDSC to prioritize gene programs enriched for CAD heritability (linking variants in endothelial cell chromatin accessible regions to genes within 50 Kb, see Methods). Barplots show enrichment (left) and –log10 FDR (Benjamini-Hochberg adjusted enrichment p-value, right). N = 300 (co-regulated program genes ranked by z-score coefficient, for each program). Error bars: standard error around the enrichment estimate, calculated by S-LDSC using jackknife (which resamples the data used for calculating heritability enrichment). P-values were calculated using the S-LDSC method²⁸, and FDR by the Benjamini-Hochberg method. *: FDR < 0.05. Dotted lines: 1 fold enrichment (left), or FDR 0.05 (right). c, CAD-associated V2G2P genes are ranked highly by an independent gene prioritization method, the Polygenic priority score (PoPS). For each of the 43 CAD GWAS signals including a CAD-associated V2G2P gene, we ranked nearby genes based on their PoPS scores. Red: 39 CAD-associated V2G2P genes (two genes, EXOC3L2 and PECAM1, were not assigned scores by PoPS). Grey: all other nearby genes. p-value: two-sided Mann-Whitney U-test. d, Contingency table of PoPS and distance-to-TSS ranks for the 39 CAD-associated V2G2P genes. (two CAD-associated V2G2P genes were not assigned scores by PoPS). e, Odds ratios of variants in lipid-associated (N = 1,181) or non-lipid-associated (N = 3,313) CAD GWAS signals in (i) ATAC peaks in endothelial cells (N = 373,630 unique non-overlapping non-promoter features from 11 epigenomic datasets in ECs, see Methods), (ii) ABC enhancers in endothelial cells (N = 47,112 unique non-overlapping non-promoter features from 11 epigenomic datasets in ECs), (iii) coding sequences (N = 189,232 unique non-overlapping non-promoter features), or (iv) all three categories combined (N = 519,046 unique non-overlapping non-promoter features), compared to background variants (all SNPs from 1000 Genomes, excluding lipid-associated or non-lipid associated CAD GWAS variants, N = 9,955,2088 or N = 9,953,076, respectively; see Methods). Odds ratios were calculated as ((CAD variants within the indicated genomic features)/(all background variants within these features))/((CAD variants outside of these features)/(all background variants outside of these features), and significance assessed by application of a two-sided Fisher’s exact test to the contingency table of these data, with columns = CAD variants versus background variants and rows=inside features versus outside features. Error bars: 95% confidence interval. *: FDR < 0.05. Specific FDR values, from top to bottom, were 1.1e-4, 3.3e-33, 1.5e-8, 3.2e-6, 0.39, 6.0e-32, 0.011, 7.5e-31. Dotted line: odds ratio of 1. f, sc-linker prioritization for 60 EC Perturb-seq gene programs, ranked by z-score. The ranking of programs was similar to V2G2P analysis, but none of the programs reached significance. g, Precision/Recall (PR) plot for V2G2P and seven prior approaches to prioritize CAD locus genes. Recall: the fraction of the eight “gold standard” genes (with strong prior evidence for endothelial cell-specific roles in CAD) detected by each method. Precision: [number of “gold standard” genes called]/[number of genes called within these gold standard loci]. Red: V2G2P. Blue: other studies that prioritized CAD GWAS genes in endothelial cells.

+13

Convergence of coronary artery disease genes onto endothelial cell programs

February 2024

·

237 Reads

·

43 Citations

Nature

Linking variants from genome-wide association studies (GWAS) to underlying mechanisms of disease remains a challenge1–3. For some diseases, a successful strategy has been to look for cases in which multiple GWAS loci contain genes that act in the same biological pathway1–6. However, our knowledge of which genes act in which pathways is incomplete, particularly for cell-type-specific pathways or understudied genes. Here we introduce a method to connect GWAS variants to functions. This method links variants to genes using epigenomics data, links genes to pathways de novo using Perturb-seq and integrates these data to identify convergence of GWAS loci onto pathways. We apply this approach to study the role of endothelial cells in genetic risk for coronary artery disease (CAD), and discover 43 CAD GWAS signals that converge on the cerebral cavernous malformation (CCM) signalling pathway. Two regulators of this pathway, CCM2 and TLNRD1, are each linked to a CAD risk variant, regulate other CAD risk genes and affect atheroprotective processes in endothelial cells. These results suggest a model whereby CAD risk is driven in part by the convergence of causal genes onto a particular transcriptional pathway in endothelial cells. They highlight shared genes between common and rare vascular diseases (CAD and CCM), and identify TLNRD1 as a new, previously uncharacterized member of the CCM signalling pathway. This approach will be widely useful for linking variants to functions for other common polygenic diseases.


Improving fine-mapping by modeling infinitesimal effects

November 2023

·

95 Reads

·

23 Citations

Nature Genetics

Fine-mapping aims to identify causal genetic variants for phenotypes. Bayesian fine-mapping algorithms (for example, SuSiE, FINEMAP, ABF and COJO-ABF) are widely used, but assessing posterior probability calibration remains challenging in real data, where model misspecification probably exists, and true causal variants are unknown. We introduce replication failure rate (RFR), a metric to assess fine-mapping consistency by downsampling. SuSiE, FINEMAP and COJO-ABF show high RFR, indicating potential overconfidence in their output. Simulations reveal that nonsparse genetic architecture can lead to miscalibration, while imputation noise, nonuniform distribution of causal variants and quality control filters have minimal impact. Here we present SuSiE-inf and FINEMAP-inf, fine-mapping methods modeling infinitesimal effects alongside fewer larger causal effects. Our methods show improved calibration, RFR and functional enrichment, competitive recall and computational efficiency. Notably, using our methods’ posterior effect sizes substantially increases polygenic risk score accuracy over SuSiE and FINEMAP. Our work improves causal variant identification for complex traits, a fundamental goal of human genetics.


Cell-type-specific Alzheimer’s disease polygenic risk scores (ADPRS)
a A schematic of cell-type-specific PRS derivation. b An UpSetR plot of cell-type-specific gene sets used to define cell-type-specific ADPRS. Each cell-type-specific gene set includes genes within the top 10% of cell-type specificity (n = 1343). Each row of the matrix represents each cell-type-specific gene set, and each column of the matrix represents an intersection of one or more sets. Gene sets in each intersection were indicated by filled black circles connected by a black vertical line. The vertical bar graph on the top shows the number of genes in each intersection. The 15 most frequent intersections were visualized. c, d Correlation matrix among cell-type-specific ADPRS (c ROSMAP; d A4). Pearson’s correlation coefficient was positive for all pairs, and the square of Pearson’s correlation coefficient (R²) between pairs of cell types was visualized. Source data are provided as a Source Data file. Ast astrocytes, Ex excitatory neurons, In inhibitory neurons, Mic microglia, Oli oligodendrocytes, Opc oligodendroglial progenitor cells, snRNA-Seq single nucleus RNA-sequencing.
Association of cell-type-specific AD polygenic risk scores in ROSMAP
a Association of cell-type-specific ADPRSs with the odds of AD with dementia (case: n = 538, control: n = 248). b Association of cell-type-specific ADPRSs with overall Aβ burden. c Association of cell-type-specific ADPRSs with overall diffuse plaque (DP) burden. d Association of cell-type-specific ADPRSs with overall neuritic plaque (NP) burden. e Association of cell-type-specific ADPRSs with overall paired-helical-filament tau (PHFtau) burden. f Association of cell-type-specific ADPRSs with overall neurofibrillary tangle (NFT) burden. g Association of cell-type-specific ADPRSs with cognitive decline (the slope of annual change in antemortem measures of global cognitive composite). For a–g, the y-axis indicates -log10(p-value) of each association, the black solid horizontal line indicates the p-value corresponding to multiple comparisons-corrected statistical significance (FDR = 0.025), and the black dashed horizontal line indicates p = 0.05. The p-values (two-sided, unadjusted) are from regression models (a logistic regression, b–g linear regression) adjusting for APOE ε4, APOE ε2, age at death, sex, genotyping platform, years of education (only for a and g), and the first three genotyping principal components. The effect size of each statistically significant PRS-trait association was quantified with ΔR² (difference of Nagelkerke’s R² (a) or adjusted R² (b–g) between the linear models with and without the given PRS term; b–g) and indicated above the bar graph. Also, see Supplementary Tables 3–9 for further details. Source data are provided as a Source Data file. h Association of Mic-ADPRS (x-axis) and the proportion of activated microglia (PAM, y-axis). On y-axis, residual PAM values adjusting for covariates (APOE ε4, APOE ε2, age at death, sex, genotyping platform, and the first three genotyping principal components) were shown. T-statistics and p-values from linear regression (adjusting for the covariates) are shown. Created with Biorender.com. Act. Mic. activated microglia, All full autosomal genome, Ast astrocytes, Ex excitatory neurons, In inhibitory neurons, Mic microglia, Oli oligodendrocytes, Opc oligodendroglial progenitor cells.
Causal mediation analyses and structural equation modeling of cell-type-specific ADPRS and AD endophenotypes
a DP partially mediates Ast-ADPRS—NP association (n = 1452). b NP mediates most of the Ast-ADPRS—NFT association (n = 1474), and the direct effect of Ast-ADPRS on NFT is not significant. c NP partially mediates Mic-ADPRS—tau association (n = 1474). d NFT partially mediates Mic-ADPRS—cognitive decline association (n = 1392). The model included NP burden as a covariate. All models in a–d are linear models adjusted for age, sex, education (for cognitive decline slope), APOE ε2 count, APOE ε4 count, genotype batch, and first three genotype principal components, and non-parametric bootstrapping (n = 10,000 iterations) were used to derive empiric two-sided p-values and 95% confidence intervals for the average causal mediated effect, average direct effect, and proportion mediated. Also, see Supplementary Table 16 for further details. e Structural equation modeling (SEM) shows a probable causal relationship between cell-type-specific ADPRS and AD endophenotypes. Black solid arrows indicate phenotype-phenotype associations and red solid arrows indicate genotype-phenotype associations. All depicted associations were nominally significant (p < 0.05). Numbers adjacent to each arrow indicate completely standardized solutions (relative strength of the effect). Model fit metrics indicate an excellent model fit. All linear associations in this SEM are adjusted for age, sex, education (for cognitive decline slope), APOE ε2 count, APOE ε4 count, genotype batch, and first three genotype principal components. Created with Biorender.com. CFI comparative fit index, DP diffuse plaque, Nobs number of observations (participants), Nparameter number of model parameters, NP neuritic plaque, n.s. not significant, RMSEA root mean square error of approximation, SRMR standardized root mean square residual, TLI Tucker Lewis Index.
Association of cell-type-specific AD polygenic risk scores in A4
a Association of cell-type-specific ADPRSs with cortical Aβ (florbetapir PET cortical composite SUVR). b Association of cell-type-specific ADPRSs with temporal lobe tau (flortaucipir PET temporal lobe composite SUVR). c Association of cell-type-specific ADPRSs with hippocampal volume (HV). d Association of cell-type-specific ADPRSs with screening Preclinical Alzheimer Cognitive Composite (PACC). The y-axis indicates -log10(p-value) of each association. The black solid horizontal line indicates the p-value corresponding to statistical significance (FDR = 0.025), and the black dashed horizontal line indicates p = 0.05. The p-values (two-sided, unadjusted) are from linear regression models adjusting for APOE ε4, APOE ε2, age, sex, intracranial volume (only for c), years of education (only for d), and the first three genotyping principal components. Effect size of each statistically significant PRS-trait association was quantified with ΔR² (difference of adjusted R2 between the linear models with and without the given PRS term) and indicated above the bar graph. Also, see Supplementary Tables 18, 22, 25, and 26 for further details. Source data are provided as a Source Data file. Created with Biorender.com. SUVR standardized uptake value ratio.
Cell-type-specific Alzheimer’s disease polygenic risk scores are associated with distinct disease processes in Alzheimer’s disease

November 2023

·

183 Reads

·

20 Citations

Many of the Alzheimer’s disease (AD) risk genes are specifically expressed in microglia and astrocytes, but how and when the genetic risk localizing to these cell types contributes to AD pathophysiology remains unclear. Here, we derive cell-type-specific AD polygenic risk scores (ADPRS) from two extensively characterized datasets and uncover the impact of cell-type-specific genetic risk on AD endophenotypes. In an autopsy dataset spanning all stages of AD (n = 1457), the astrocytic ADPRS affected diffuse and neuritic plaques (amyloid-β), while microglial ADPRS affected neuritic plaques, microglial activation, neurofibrillary tangles (tau), and cognitive decline. In an independent neuroimaging dataset of cognitively unimpaired elderly (n = 2921), astrocytic ADPRS was associated with amyloid-β, and microglial ADPRS was associated with amyloid-β and tau, connecting cell-type-specific genetic risk with AD pathology even before symptom onset. Together, our study provides human genetic evidence implicating multiple glial cell types in AD pathophysiology, starting from the preclinical stage.


Figure 2. Benchmarking predictions of enhancer-gene regulatory interactions a, We benchmarked the performance of predictive models for enhancer -gene regulatory connections on three different prediction tasks: 1) Linking enhancers to CRISPR-validated target genes, 2) Enrichment of putative regulatory variants from fine-mapped eQTL and GWAS datasets, and 3) Linking variants to putative causal target genes from fine-mapped GWAS datasets. b, Precision-Recall curves showing the performance of predictive models at predicting experimental results of CRISPRi data in K562 cells. Combined CRISPRi data was assembled by combining element-gene pairs from the re-analyzed Nasser et al., 2021 6 , Gasperini et al., 2019 24 and Schraivogel et al., 2020 23 datasets (10,375 tested element-gene pairs, 472 positives). Curves represent continuous predictors with the dashed vertical line indicating performance at a threshold corresponding to 70% recall. Single dots represent performance of binary predictors. c, Performance (AUPRC) of quantitative predictors as a function of distance to TSS. Color legend from panel d applies. Error bars represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). d, Enrichment -recall curves showing the ratio of fine-mapped distal noncoding eQTLs with a PIP>0.5 in EBV-transformed lymphocytes in predicted enhancers compared to distal noncoding common variants (enrichment, y-axis) versus fraction of variants overlapping enhancers linked to the correct gene in
An encyclopedia of enhancer-gene regulatory interactions in the human genome

November 2023

·

297 Reads

·

38 Citations

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and large-scale genetic perturbations generated by the ENCODE Consortium. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 element-gene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancer-promoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.


Uniform re-processing of all datasets
(A), The number of studies, datasets, donors and samples in the previous publication (R3) and current version of the eQTL Catalogue (R6). (B), Number of genes with at least one significant eQTL (‘eGenes’) on the X chromosome as a function of dataset sample size. Red points indicate datasets for which the X chromosome genotypes were unavailable. (C), The number of eGenes identified in each dataset for the five molecular traits (gene expression, exon expression, transcript usage, txrevise event usage, and Leafcutter splice-junction usage). Datasets newly added since release 3 have been highlighted.
Visualisation of a splicing QTL detected in the CYP2R1 gene
(A) RNA-seq read coverage across the CYP2R1 gene in GTEx transverse colon tissue stratified by the genotype of the lead sQTL variant (chr11_14855172_G_A). All introns have been shortened to 50 nt with wiggleplotr [29] to make variation in exonic read coverage easier to see. (B) Effect sizes and 95% confidence intervals of the lead sQTL variant on the expression level of individual exons (or exonic parts) of CYP2R1. Associations significant at FDR < = 1% are shown in dark blue. (C) The top two rows show the MANE Select [30] reference transcript and all annotated exons of CYP2R1, respectively. The remaining rows show the txrevise [5] event annotations used for sQTL mapping. The short version of exon 4 (between dashed lines) is only present in annotated nonsense-mediated decay (NMD) transcripts.
Sharing of significantly colocalised signals with vItamin D
(A) Number of colocalised signals detected by the different molecular QTL quantification methods and sharing between them. (B) Number of colocalised signals assigned to empirical functional consequence (eQTL, sQTL, puQTL, apaQTL or ambiguous) and sharing structure between them. (C) Number of independent colocalised signals associated with either a single target gene or multiple target genes in each functional consequences group. eQTL—expression QTL, sQTL—splicing QTL, puQTL—promoter usage QTL, apaQTL—alternative polyadenylation QTL.
eQTL Catalogue 2023: New datasets, X chromosome QTLs, and improved detection and visualisation of transcript-level QTLs

September 2023

·

82 Reads

·

19 Citations

The eQTL Catalogue is an open database of uniformly processed human molecular quantitative trait loci (QTLs). We are continuously updating the resource to further increase its utility for interpreting genetic associations with complex traits. Over the past two years, we have increased the number of uniformly processed studies from 21 to 31 and added X chromosome QTLs for 19 compatible studies. We have also implemented Leafcutter to directly identify splice-junction usage QTLs in all RNA sequencing datasets. Finally, to improve the interpretability of transcript-level QTLs, we have developed static QTL coverage plots that visualise the association between the genotype and average RNA sequencing read coverage in the region for all 1.7 million fine mapped associations. To illustrate the utility of these updates to the eQTL Catalogue, we performed colocalisation analysis between vitamin D levels in the UK Biobank and all molecular QTLs in the eQTL Catalogue. Although most GWAS loci colocalised both with eQTLs and transcript-level QTLs, we found that visual inspection could sometimes be used to distinguish primary splicing QTLs from those that appear to be secondary consequences of large-effect gene expression QTLs. While these visually confirmed primary splicing QTLs explain just 6/53 of the colocalising signals, they are significantly less pleiotropic than eQTLs and identify a prioritised causal gene in 4/6 cases.



Citations (73)


... Published and preprint studies suggest that among the most scalable experimental tools available to verify predictions are MPRAs, which can probe up to hundreds of thousands of sequences 20,96,97 . The ability to synthesize these sequences de novo offers extensive opportunities for rigorous validation, both by testing variants of natural sequences and by testing completely novel sequences. ...

Reference:

Predicting gene expression from DNA sequence using deep learning models
Functional dissection of complex and molecular trait variants at single nucleotide resolution

... Our functional analysis was augmented using RNA-Seq data from the MAGE dataset 38 , which provides RNA-seq data for 140 SAS individuals. For our GWAS and Biobank analysis, we used the standard T2T-CHM13 GWAS catalog and the phenotype manifests from the Pan-UK Biobank 42 ...

Pan-UK Biobank GWAS improves discovery, analysis of genetic architecture, and resolution into ancestry-enriched effects

... However, interpreting clusters based solely on pleiotropy can be challenging when biological processes are partially overlapping or produce similar multi-trait signatures. Epigenomic annotations provide valuable information about disease biology [30][31][32][33][34] , and have been used to retrospectively interpret clustering results [20][21][22] , but have not been directly integrated into clustering models. Furthermore, no quantitative metric has been proposed to evaluate clustering performance. ...

Convergence of coronary artery disease genes onto endothelial cell programs

Nature

... To assess the utility of the predicted functional scores, we adopt the methodology described by Avsec et al. [14] , focusing on putative eQTL causal variants. Specifically, we utilize the SuSiE [27] fine-mapped results across multiple tissues provided by the PigGTEx consortium [28] . We include only tissues with at least 500 variants having a posterior inclusion probability (PIP) exceeding 0.9 in a credible causal set, resulting in 13 tissue-specific causal variant datasets. ...

Improving fine-mapping by modeling infinitesimal effects

Nature Genetics

... This suggests that Aβ contributes to the effect of tau in AD. Moreover, several phenotypes are associated with the Aβ-tau interaction, such as astrocyte reactivity, blood pressure, vascular burden, and microglia [11][12][13][14][15]; however, the detailed mechanisms are not fully understood. Various cellular perturbations have been identified through single-nucleus RNA sequencing (snRNA-seq) in AD [16][17][18][19][20][21][22]. ...

Cell-type-specific Alzheimer’s disease polygenic risk scores are associated with distinct disease processes in Alzheimer’s disease

... For comparison, we used ENCODEs rE2G enhancer-gene link data 15 from 352 cell and tissue types to map target genes for more accessible UL enhancers. We annotated 3616 target genes in at least 2/3 UL subclasses, and 1160 out of 1925 previously annotated closest genes were found also in the ENCODE enhancer-gene link data, providing confidence in the regulatory link between the more accessible UL enhancers and annotated closest target genes. ...

An encyclopedia of enhancer-gene regulatory interactions in the human genome

... For instance, the eQTLGen Consortium investigated the genetic regulation of blood gene expression in 31,684 individuals worldwide (94). Both the eQTL Catalogue and QTLbase2 have uniformly processed and curated various types of human molQTL, including eQTL, splicing QTL, and TF binding QTL (95)(96)(97). ...

eQTL Catalogue 2023: New datasets, X chromosome QTLs, and improved detection and visualisation of transcript-level QTLs

... We interpreted the GWAS results using a series of gene mapping approaches (positional, eQTL and 3D chromatin-mapping) in combination with the MAGMA genebased association test, gene prioritization (using PoPs) 47 and TWAS. CSE1L was . ...

Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases

Nature Genetics

... The inclusion of data based on heterogeneous phenotypic assessment associates with a decay of SNP-h² estimates 16,17 . This has led to a call for approaches with uniform phenotyping, for example by adherence to clinical diagnostic criteria. ...

Polygenic risk prediction: why and when out-of-sample prediction R2 can exceed SNP-based heritability
  • Citing Article
  • June 2023

The American Journal of Human Genetics

... The MPO enzyme, mainly released by activated neutrophils, is characterized by powerful prooxidative and proinflammatory properties (42). The immune cell-specific genes PRKAG3, ADAMTSL5, FBXO47, P2RY10, and EGFL6 have been linked to neuroinflammatory processes (43). P2RY10 is involved in microglial activation (43), which plays a crucial role in AD progression. ...

Cell-type-specific Alzheimer's disease polygenic risk scores are associated with distinct disease processes in Alzheimer's disease