Using population admixture to help complete maps of the human genome

1] Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. [2] Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA. [3] Division of Nephrology, Department of Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, Massachusetts, USA. [4] Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
Nature Genetics (Impact Factor: 29.35). 02/2013; 45(4). DOI: 10.1038/ng.2565
Source: PubMed


Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces using the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning 4 million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified 8 new large interchromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed at the RNA level and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.

Download full-text


Available from: Cynthia C Morton, Jan 12, 2015
1 Follower
26 Reads
  • Source
    • "eference as - sembly , such as GPRIN2 and DUSP22 , have 20 and 56 HNR vari - ants , respectively , while high V st genes such as PDE4DIP have 267 HNR variants . The gene with the most HNR variants ( N = 618 ) is PRIM2 that is part of interchromosomal duplications of Chromo - somes 6 and 3 and represents cryptic SDs in the GRCh37 reference genome ( Genovese et al . 2013 ) . Additionally , two regions that were incorrectly represented in GRCh37 and subsequently resolved in GRCh38 using the CHM1 derived BAC library , SRGAP2 ( Dennis et al . 2012 ) and IGH ( Watson et al . 2013 ) , both had high counts of HNR variants ( 39 and 54 , respectively ) providing additional sup - port for the hypothesis that het"
    [Show abstract] [Hide abstract]
    ABSTRACT: A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.
    Genome Research 11/2014; 24(12). DOI:10.1101/gr.180893.114 · 14.63 Impact Factor
  • Source
    • "The term mappability as it is used in the current paper refers to NGS short-read based mappability (SMRT sequencing is not discussed here); it refers to mappability of a read to a reference genome (not assemblability in the absence of a reference genome (Kinsford et al., 2010; Bradnam et al., 2013)); we ignore the failure to map a read due to the gaps or unsequenced regions in the reference genome (the N's in the reference genome) (Genovese et al., 2013); we refer to a specific length scale (e.g. read length k = 1000) as read length will dramatically change the nature of mappability ; and it is assumed that reads are aligned to a genomic region only if there is an exact match (zero-mismatch alignment). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project.
    Computational Biology and Chemistry 08/2014; 53. DOI:10.1016/j.compbiolchem.2014.08.015 · 1.12 Impact Factor
  • Source
    • "Other crucial applications have included pharmacogenomics; for example, in a recent study, Native American ancestry was significantly associated with the risk of relapse in children suffering from acute lymphoblastic leukemia (Yang et al. 2011). In addition to these traditional applications, in the more recent years, local ancestry inference methods have also found applications in other settings such as localizing sequences of unknown location from the human reference genome (Genovese et al. 2013), studying recombination rate variation (Hinch et al. 2011; Wegmann et al. 2011), inferring natural selection (Tang et al 2007; Jin et al. 2012), making demographic inferences (Bryc et al. 2010; Johnson et al. 2011; Kidd et al. 2012) and in joint association and admixture mapping to boost the power to detect disease linked genes and variants (Pasaniuc et al. 2011; Shriner et al. 2011 ). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Ancestry inference is a frequently encountered problem and has many applications such as forensic analyses, genetic association studies, and personal genomics. The main goal of ancestry inference is to identify an individual's population of origin based on our knowledge of natural populations. Because both self-reported ancestry in humans or the sampling location of an organism can be inaccurate for this purpose, the use of genetic markers can facilitate accurate and reliable inference of an individual's ancestral origins. At a higher level, there are two different paradigms in ancestry inference: global ancestry inference which tries to compute the genome-wide average of the population contributions and local ancestry inference which tries to identify the regional ancestry of a genomic segment. In this mini review, I describe the numerous approaches that are currently available for both kinds of ancestry inference from population genomic datasets. I first describe the general ideas underlying such inference methods and their relationship to one another. Then, I describe practical applications in which inference of ancestry has proven useful. Lastly, I discuss challenges and directions for future research work in this area.
    Frontiers in Genetics 06/2014; 5. DOI:10.3389/fgene.2014.00204
Show more