Using population admixture to help complete maps of the human genome.

1] Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. [2] Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA. [3] Division of Nephrology, Department of Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, Massachusetts, USA. [4] Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
Nature Genetics (Impact Factor: 29.65). 02/2013; DOI: 10.1038/ng.2565
Source: PubMed

ABSTRACT Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces using the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning 4 million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified 8 new large interchromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed at the RNA level and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project.
    Computational Biology and Chemistry 08/2014; 53. · 1.60 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genetic association studies in recently admixed populations offer exciting opportunities to identify novel variants underlying phenotypic diversity. At the same time, genetic heterogeneity resulting from population admixture has to be accounted for to ensure validity of association tests. The whole-genome sequence data and the genome-wide single-nucleotide polymorphism chip data for Mexican American individuals provided by Genetic Analysis Workshop 18 (GAW18) presents a unique opportunity to evaluate and compare methods for the statistical analysis of admixed genetic data. We summarize here the five contributions from the GAW18 working group on admixture mapping and adjusting for admixture. Although group members considered a variety of research topics, the general theme was inference and consideration of ancestry admixture in genetic analyses. The topics considered can be grouped into three categories: (1) global and local ancestry inference and estimation, (2) association and admixture mapping, and (3) genotype imputation in admixed samples. We describe the approaches that were used and the most relevant findings from each contribution. We also provide insight into the strengths and limitations of the state-of-the-art methods considered for genetic analyses in admixed populations.
    Genetic Epidemiology 09/2014; 38 Suppl 1:S5-S12. · 2.95 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.
    Genome Research 11/2014; · 13.85 Impact Factor

Full-text (2 Sources)

Available from
Jan 12, 2015