Article

Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions

Department of Genome Sciences, University of Washington School of Medicine, Seattle, USA.
Nature Methods (Impact Factor: 25.95). 05/2010; 7(5):365-71. DOI: 10.1038/nmeth.1451
Source: PubMed

ABSTRACT The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18-37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

0 Followers
 · 
149 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9,216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to over 1 megabase. These pools are 'sub-haploid', in that the lengths of fragments contained in each pool sums to approximately 5 to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate 'joins' are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by 8 to 57 fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing mid-range contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.
    Genome Research 10/2014; 24(12). DOI:10.1101/gr.178319.114 · 13.85 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.
    Genome Research 11/2014; 24(12). DOI:10.1101/gr.180893.114 · 13.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chromosomal abnormalities, including microdeletions and microduplications, have long been associated with abnormal developmental outcomes. Early discoveries relied on a common clinical presentation and the ability to detect chromosomal abnormalities by standard karyotype analysis or specific assays such as fluorescence in situ hybridization. Over the past decade, the development of novel genomic technologies has allowed more comprehensive, unbiased discovery of microdeletions and microduplications throughout the human genome. The ability to quickly interrogate large cohorts using chromosome microarrays and, more recently, next-generation sequencing has led to the rapid discovery of novel microdeletions and microduplications associated with disease, including very rare but clinically significant rearrangements. In addition, the observation that some microdeletions are associated with risk for several neurodevelopmental disorders contributes to our understanding of shared genetic susceptibility for such disorders. Here, we review current knowledge of microdeletion/duplication syndromes, with a particular focus on recurrent rearrangement syndromes. Expected final online publication date for the Annual Review of Genomics and Human Genetics Volume 15 is September 01, 2014. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.
    Annual review of genomics and human genetics 04/2014; DOI:10.1146/annurev-genom-091212-153408 · 9.13 Impact Factor