Gonçalo R Abecasis

University of Michigan, Ann Arbor, Michigan, United States

Are you Gonçalo R Abecasis?

Claim your profile

Publications (332)4831.21 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Although analysis pipelines have been developed to use RNA-seq to identify long non-coding RNAs (lncRNAs), inference of their biological and pathological relevance remains a challenge. As a result, most transcriptome studies of autoimmune disease have only assessed protein-coding transcripts. We used RNA-seq data from 99 lesional psoriatic, 27 uninvolved psoriatic, and 90 normal skin biopsies, and applied computational approaches to identify and characterize expressed lncRNAs. We detect 2,942 previously annotated and 1,080 novel lncRNAs which are expected to be skin specific. Notably, over 40% of the novel lncRNAs are differentially expressed and the proportions of differentially expressed transcripts among protein-coding mRNAs and previously-annotated lncRNAs are lower in psoriasis lesions versus uninvolved or normal skin. We find that many lncRNAs, in particular those that are differentially expressed, are co-expressed with genes involved in immune related functions, and that novel lncRNAs are enriched for localization in the epidermal differentiation complex. We also identify distinct tissue-specific expression patterns and epigenetic profiles for novel lncRNAs, some of which are shown to be regulated by cytokine treatment in cultured human keratinocytes. Together, our results implicate many lncRNAs in the immunopathogenesis of psoriasis, and our results provide a resource for lncRNA studies in other autoimmune diseases.
    Genome Biology 12/2015; 16(1):24. DOI:10.1186/s13059-014-0570-4 · 10.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Human leukocyte antigen (HLA) genes confer substantial risk for autoimmune diseases on a log-additive scale. Here we speculated that differences in autoantigen-binding repertoires between a heterozygote's two expressed HLA variants might result in additional non-additive risk effects. We tested the non-additive disease contributions of classical HLA alleles in patients and matched controls for five common autoimmune diseases: rheumatoid arthritis (ncases = 5,337), type 1 diabetes (T1D; ncases = 5,567), psoriasis vulgaris (ncases = 3,089), idiopathic achalasia (ncases = 727) and celiac disease (ncases = 11,115). In four of the five diseases, we observed highly significant, non-additive dominance effects (rheumatoid arthritis, P = 2.5 × 10(-12); T1D, P = 2.4 × 10(-10); psoriasis, P = 5.9 × 10(-6); celiac disease, P = 1.2 × 10(-87)). In three of these diseases, the non-additive dominance effects were explained by interactions between specific classical HLA alleles (rheumatoid arthritis, P = 1.8 × 10(-3); T1D, P = 8.6 × 10(-27); celiac disease, P = 6.0 × 10(-100)). These interactions generally increased disease risk and explained moderate but significant fractions of phenotypic variance (rheumatoid arthritis, 1.4%; T1D, 4.0%; celiac disease, 4.1%) beyond a simple additive model.
    Nature Genetics 08/2015; DOI:10.1038/ng.3379 · 29.65 Impact Factor
  • Emily Y Chew · Michael L Klein · Traci E Clemons · Elvira Agrón · Gonçalo R Abecasis
    Ophthalmology 08/2015; 122(8):e46-e47. DOI:10.1016/j.ophtha.2015.01.023 · 6.17 Impact Factor
  • Matthew Flickinger · Goo Jun · Gonçalo R Abecasis · Michael Boehnke · Hyun Min Kang
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA sample contamination is a frequent problem in DNA sequencing studies and can result in genotyping errors and reduced power for association testing. We recently described methods to identify within-species DNA sample contamination based on sequencing read data, showed that our methods can reliably detect and estimate contamination levels as low as 1%, and suggested strategies to identify and remove contaminated samples from sequencing studies. Here we propose methods to model contamination during genotype calling as an alternative to removal of contaminated samples from further analyses. We compare our contamination-adjusted calls to calls that ignore contamination and to calls based on uncontaminated data. We demonstrate that, for moderate contamination levels (5%-20%), contamination-adjusted calls eliminate 48%-77% of the genotyping errors. For lower levels of contamination, our contamination correction methods produce genotypes nearly as accurate as those based on uncontaminated data. Our contamination correction methods are useful generally, but are particularly helpful for sample contamination levels from 2% to 20%. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
    The American Journal of Human Genetics 07/2015; DOI:10.1016/j.ajhg.2015.07.002 · 10.99 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA sequencing identifies common and rare genetic variants for association studies, but studies typically focus on variants in nuclear DNA and ignore the mitochondrial genome. In fact, analyzing variants in mitochondrial DNA (mtDNA) sequences presents special problems, which we resolve here with a general solution for the analysis of mtDNA in next-generation sequencing studies. The new program package comprises 1) an algorithm designed to identify mtDNA variants (i.e., homoplasmies and heteroplasmies), incorporating sequencing error rates at each base in a likelihood calculation and allowing allele fractions at a variant site to differ across individuals; and 2) an estimation of mtDNA copy number in a cell directly from whole-genome sequencing data. We also apply the methods to DNA sequence from lymphocytes of ~2,000 SardiNIA Project participants. As expected, mothers and offspring share all homoplasmies but a lesser proportion of heteroplasmies. Both homoplasmies and heteroplasmies show 5-fold higher transition/transversion ratios than variants in nuclear DNA. Also, heteroplasmy increases with age, though on average only ~1 heteroplasmy reaches the 4% level between ages 20 and 90. In addition, we find that mtDNA copy number averages ~110 copies/lymphocyte and is ~54% heritable, implying substantial genetic regulation of the level of mtDNA. Copy numbers also decrease modestly but significantly with age, and females on average have significantly more copies than males. The mtDNA copy numbers are significantly associated with waist circumference (p-value = 0.0031) and waist-hip ratio (p-value = 2.4×10-5), but not with body mass index, indicating an association with central fat distribution. To our knowledge, this is the largest population analysis to date of mtDNA dynamics, revealing the age-imposed increase in heteroplasmy, the relatively high heritability of copy number, and the association of copy number with metabolic traits.
    PLoS Genetics 07/2015; 11(7):e1005306. DOI:10.1371/journal.pgen.1005306 · 8.17 Impact Factor
  • Source
  • Source
  • Chaolong Wang · Xiaowei Zhan · Liming Liang · Gonçalo R Abecasis · Xihong Lin
    [Show abstract] [Hide abstract]
    ABSTRACT: Accurate estimation of individual ancestry is important in genetic association studies, especially when a large number of samples are collected from multiple sources. However, existing approaches developed for genome-wide SNP data do not work well with modest amounts of genetic data, such as in targeted sequencing or exome chip genotyping experiments. We propose a statistical framework to estimate individual ancestry in a principal component ancestry map generated by a reference set of individuals. This framework extends and improves upon our previous method for estimating ancestry using low-coverage sequence reads (LASER 1.0) to analyze either genotyping or sequencing data. In particular, we introduce a projection Procrustes analysis approach that uses high-dimensional principal components to estimate ancestry in a low-dimensional reference space. Using extensive simulations and empirical data examples, we show that our new method (LASER 2.0), combined with genotype imputation on the reference individuals, can substantially outperform LASER 1.0 in estimating fine-scale genetic ancestry. Specifically, LASER 2.0 can accurately estimate fine-scale ancestry within Europe using either exome chip genotypes or targeted sequencing data with off-target coverage as low as 0.05×. Under the framework of LASER 2.0, we can estimate individual ancestry in a shared reference space for samples assayed at different loci or by different techniques. Therefore, our ancestry estimation method will accelerate discovery in disease association studies not only by helping model ancestry within individual studies but also by facilitating combined analysis of genetic data from multiple sources. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
    The American Journal of Human Genetics 05/2015; 96(6). DOI:10.1016/j.ajhg.2015.04.018 · 10.99 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Neuroticism is a pervasive risk factor for psychiatric conditions. It genetically overlaps with major depressive disorder (MDD) and is therefore an important phenotype for psychiatric genetics. The Genetics of Personality Consortium has created a resource for genome-wide association analyses of personality traits in more than 63 000 participants (including MDD cases). To identify genetic variants associated with neuroticism by performing a meta-analysis of genome-wide association results based on 1000 Genomes imputation; to evaluate whether common genetic variants as assessed by single-nucleotide polymorphisms (SNPs) explain variation in neuroticism by estimating SNP-based heritability; and to examine whether SNPs that predict neuroticism also predict MDD. Genome-wide association meta-analysis of 30 cohorts with genome-wide genotype, personality, and MDD data from the Genetics of Personality Consortium. The study included 63 661 participants from 29 discovery cohorts and 9786 participants from a replication cohort. Participants came from Europe, the United States, or Australia. Analyses were conducted between 2012 and 2014. Neuroticism scores harmonized across all 29 discovery cohorts by item response theory analysis, and clinical MDD case-control status in 2 of the cohorts. A genome-wide significant SNP was found on 3p14 in MAGI1 (rs35855737; P = 9.26 × 10-9 in the discovery meta-analysis). This association was not replicated (P = .32), but the SNP was still genome-wide significant in the meta-analysis of all 30 cohorts (P = 2.38 × 10-8). Common genetic variants explain 15% of the variance in neuroticism. Polygenic scores based on the meta-analysis of neuroticism in 27 cohorts significantly predicted neuroticism (1.09 × 10-12 < P < .05) and MDD (4.02 × 10-9 < P < .05) in the 2 other cohorts. This study identifies a novel locus for neuroticism. The variant is located in a known gene that has been associated with bipolar disorder and schizophrenia in previous studies. In addition, the study shows that neuroticism is influenced by many genetic variants of small effect that are either common or tagged by common variants. These genetic variants also influence MDD. Future studies should confirm the role of the MAGI1 locus for neuroticism and further investigate the association of MAGI1 and the polygenic association to a range of other psychiatric disorders that are phenotypically correlated with neuroticism.
    JAMA Psychiatry 05/2015; 72(7). DOI:10.1001/jamapsychiatry.2015.0554 · 12.01 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite progress in identifying genes associated with breast cancer, many more risk loci exist. Genome-wide association analyses in genetically-homogeneous populations, such as that of Sardinia (Italy), could represent an additional approach to detect low penetrance alleles. We performed a genome-wide association study comparing 1431 Sardinian patients with non-familial, BRCA1/2-mutation-negative breast cancer to 2171 healthy Sardinian blood donors. DNA was genotyped using GeneChip Human Mapping 500 K Arrays or Genome-Wide Human SNP Arrays 6.0. To increase genomic coverage, genotypes of additional SNPs were imputed using data from HapMap Phase II. After quality control filtering of genotype data, 1367 cases (9 men) and 1658 controls (1156 men) were analyzed on a total of 2,067,645 SNPs. Overall, 33 genomic regions (67 candidate SNPs) were associated with breast cancer risk at the p < 10(-6) level. Twenty of these regions contained defined genes, including one already associated with breast cancer risk: TOX3. With a lower threshold for preliminary significance to p < 10(-5), we identified 11 additional SNPs in FGFR2, a well-established breast cancer-associated gene. Ten candidate SNPs were selected, excluding those already associated with breast cancer, for technical validation as well as replication in 1668 samples from the same population. Only SNP rs345299, located in intron 1 of VAV3, remained suggestively associated (p-value, 1.16x10(-5)), but it did not associate with breast cancer risk in pooled data from two large, mixed-population cohorts. This study indicated the role of TOX3 and FGFR2 as breast cancer susceptibility genes in BRCA1/2-wild-type breast cancer patients from Sardinian population.
    BMC Cancer 05/2015; 15(1):383. DOI:10.1186/s12885-015-1392-9 · 3.32 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide gene expression quantitative trait loci (eQTL) mapping have been focused on single-nucleotide polymorphisms and have helped interpret findings from diseases mapping studies. The functional effect of structure variants, especially short insertions and deletions (indel) has not been well investigated. Here we impute 1,380,133 indels based on the latest 1,000 Genomes Project panel into three eQTL data sets from multiple tissues. Imputation of indels increased 9.9% power and identifies indel-specific eQTLs for 325 genes. We find introns and vicinities of UTRs are more enriched of indel eQTLs and 3.6 (single-tissue)-9.2%(multi-tissue) of previous identified eSNPs were taggers of eindels. Functional analyses identifies epigenetics marks, gene ontology categories and disease GWAS loci affected by SNPs and indels eQTLs showing tissue-consistent or tissue-specific effects. This study provides new insights into the underlying genetic architecture of gene expression across tissues and new resource to interpret function of diseases and traits associated structure variants.
    Nature Communications 05/2015; 6:6821. DOI:10.1038/ncomms7821 · 10.74 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Psoriasis is a chronic autoimmune disease with complex genetic architecture. Previous genome-wide association studies (GWAS) and a recent meta-analysis using Immunochip data have uncovered 36 susceptibility loci. Here, we extend our previous meta-analysis of European ancestry by refined genotype calling and imputation and by the addition of 5,033 cases and 5,707 controls. The combined analysis, consisting of over 15,000 cases and 27,000 controls, identifies five new psoriasis susceptibility loci at genome-wide significance (P<5 × 10(-8)). The newly identified signals include two that reside in intergenic regions (1q31.1 and 5p13.1) and three residing near PLCL2 (3p24.3), NFKBIZ (3q12.3) and CAMK2G (10q22.2). We further demonstrate that NFKBIZ is a TRAF3IP2-dependent target of IL-17 signalling in human skin keratinocytes, thereby functionally linking two strong candidate genes. These results further integrate the genetics and immunology of psoriasis, suggesting new avenues for functional analysis and improved therapies.
    Nature Communications 05/2015; 6:7001. DOI:10.1038/ncomms8001 · 10.74 Impact Factor
  • Goo Jun · Mary Kate Wing · Gonçalo R Abecasis · Hyun Min Kang
    [Show abstract] [Hide abstract]
    ABSTRACT: The analysis of next-generation sequencing data is computationally and statistically challenging because of massive data volumes and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole genome and exome targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies. Published by Cold Spring Harbor Laboratory Press.
    Genome Research 04/2015; 25(6). DOI:10.1101/gr.176552.114 · 13.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abbreviations: LD, linkage disequilibrium; MHC, major histocompatibility complex; SNP, single-nucleotide polymorphism
    Journal of Investigative Dermatology 04/2015; 135:1177-1180. DOI:10.1038/jid.2014.517 · 6.37 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0489-0) contains supplementary material, which is available to authorized users.
    BMC Bioinformatics 03/2015; 16(1). DOI:10.1186/s12859-015-0489-0 · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Normal thyroid function is essential for health, but its genetic architecture remains poorly understood. Here, for the heritable thyroid traits thyrotropin (TSH) and free thyroxine (FT4), we analyse whole-genome sequence data from the UK10K project (N=2,287). Using additional whole-genome sequence and deeply imputed data sets, we report meta-analysis results for common variants (MAF≥1%) associated with TSH and FT4 (N=16,335). For TSH, we identify a novel variant in SYN2 (MAF=23.5%, P=6.15 × 10−9) and a new independent variant in PDE8B (MAF=10.4%, P=5.94 × 10−14). For FT4, we report a low-frequency variant near B4GALT6/SLC25A52 (MAF=3.2%, P=1.27 × 10−9) tagging a rare TTR variant (MAF=0.4%, P=2.14 × 10−11). All common variants explain ≥20% of the variance in TSH and FT4. Analysis of rare variants (MAF<1%) using sequence kernel association testing reveals a novel association with FT4 in NRG1. Our results demonstrate that increased coverage in whole-genome sequence association studies identifies novel variants associated with thyroid function.
    Nature Communications 03/2015; 6. DOI:10.1038/ncomms6681 · 10.74 Impact Factor
  • Source
    Gonçalo R Abecasis
    The American Journal of Human Genetics 03/2015; 96(3):363-6. DOI:10.1016/j.ajhg.2015.02.006 · 10.99 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advances in exome sequencing and the development of exome genotyping arrays are enabling explorations of association between rare coding variants and complex traits. To ensure power for these rare variant analyses, a variety of association tests that group variants by gene or functional unit have been proposed. Here, we extend these tests to family-based studies. We develop family-based burden tests, variable frequency threshold tests and sequence kernel association tests. Through simulations, we compare the performance of different tests. We describe situations where family-based studies provide greater power than studies of unrelated individuals to detect rare variants associated with moderate to large changes in trait values. Broadly speaking, we find that when sample sizes are limited and only a modest fraction of all trait-associated variants can be identified, family samples are more powerful. Finally, we illustrate our approach by analyzing the relationship between coding variants and levels of high-density lipoprotein (HDL) cholesterol in 11,556 individuals from the HUNT and SardiNIA studies, demonstrating association for coding variants in the APOC3, CETP, LIPC, LIPG, and LPL genes and illustrating the value of family samples, meta-analysis, and gene-level tests. Our methods are implemented in freely available C++ code. © 2015 WILEY PERIODICALS, INC.
    Genetic Epidemiology 03/2015; 39(4). DOI:10.1002/gepi.21892 · 2.95 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Normal thyroid function is essential for health, but its genetic architecture remains poorly understood. Here, for the heritable thyroid traits thyrotropin (TSH) and free thyroxine (FT4), we analyse whole-genome sequence data from the UK10K project (N = 2,287). Using additional whole-genome sequence and deeply imputed data sets, we report meta-analysis results for common variants (MAF >= 1%) associated with TSH and FT4 (N = 16,335). For TSH, we identify a novel variant in SYN2 (MAF = 23.5%, P = 6.15 x 10(-9)) and a new independent variant in PDE8B (MAF = 10.4%, P = 5.94 x 10(-14)). For FT4, we report a low-frequency variant near B4GALT6/ SLC25A52 (MAF = 3.2%, P = 1.27 x 10(-9)) tagging a rare TTR variant (MAF = 0.4%, P = 2.14 x 10(-11)). All common variants explain >= 20% of the variance in TSH and FT4. Analysis of rare variants (MAF<1%) using sequence kernel association testing reveals a novel association with FT4 in NRG1. Our results demonstrate that increased coverage in whole-genome sequence association studies identifies novel variants associated with thyroid function.
    Nature Communications 03/2015; 6:7172. DOI:10.1038/ncomms8172 · 10.74 Impact Factor
  • Adrian Tan · Gonçalo R Abecasis · Hyun Min Kang
    [Show abstract] [Hide abstract]
    ABSTRACT: A genetic variant can be represented in the Variant Call Format (VCF) in multiple different ways. Inconsistent representation of variants between variant callers and analyses will magnify discrepancies between them and complicate variant filtering and duplicate removal. We present a software tool vt normalize that normalizes representation of genetic variants in the Variant Call Format (VCF). We formally define variant normalization as the consistent representation of genetic variants in an unambiguous and concise way and derive a simple general algorithm to enforce it. We demonstrate the inconsistent representation of variants across existing sequence analysis tools and show that our tool facilitates integration of diverse variant types and call sets. Availability: The source code is available for download at http://github.com/atks/vt. More detailed documentation is available at http://genome.sph.umich.edu/wiki/Variant_Normalization. hmkang@umich.edu. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
    Bioinformatics 02/2015; 31(13). DOI:10.1093/bioinformatics/btv112 · 4.62 Impact Factor

Publication Stats

64k Citations
4,831.21 Total Impact Points


  • 2002–2015
    • University of Michigan
      • • Department of Biostatistics
      • • Center for Statistical Genetics
      Ann Arbor, Michigan, United States
  • 2010–2014
    • National Institute on Aging
      • Laboratory of Personality and Cognition (LPC)
      Baltimore, Maryland, United States
    • National Research Council
      • Institute of Neurogenetics and Neuropharmacology IRGB
      Roma, Latium, Italy
    • Christian-Albrechts-Universität zu Kiel
      • Institute of Clinical Molecular Biology
      Kiel, Schleswig-Holstein, Germany
    • Broad Institute of MIT and Harvard
      • Program in Medical and Population Genetics
      Cambridge, Massachusetts, United States
    • University of Exeter
      • Peninsula College of Medicine and Dentistry
      Exeter, ENG, United Kingdom
  • 2013
    • Harvard Medical School
      • Department of Medicine
      Boston, Massachusetts, United States
  • 2004–2013
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States
  • 2012
    • University of Chicago
      • Department of Human Genetics
      Chicago, IL, United States
    • McGill University
      • Department of Epidemiology, Biostatistics and Occupational Health
      Montréal, Quebec, Canada
  • 2000–2012
    • University of Oxford
      • Wellcome Trust Centre for Human Genetics
      Oxford, ENG, United Kingdom
  • 2008
    • MRC Mitochondrial Biology Unit
      Cambridge, England, United Kingdom
    • Boston Children's Hospital
      Boston, Massachusetts, United States
  • 2007
    • Imperial College London
      Londinium, England, United Kingdom
  • 2005
    • University of Rome Tor Vergata
      • Dipartimento di Biopatologia e Diagnostica per Immagini
      Roma, Latium, Italy
    • Cold Spring Harbor Laboratory
      Cold Spring Harbor, New York, United States