Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions

Department of Genome Sciences, University of Washington School of Medicine, Seattle, USA.
Nature Methods (Impact Factor: 32.07). 05/2010; 7(5):365-71. DOI: 10.1038/nmeth.1451
Source: PubMed


The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18-37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

Download full-text


Available from: Rajinder Kaul, Oct 14, 2015
21 Reads
  • Source
    • "The intersection of their results in NA12878, yields 132K calls, of which 55K are nonreference . Our SV validation data is 169 insertions and deletions called from alignment of finished fosmid sequence [Kidd et al., 2010a,b]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating germline variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on these benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single-nucleotide polymorphism, indel and structural variant calling algorithms. Availability and implementation: We provide free and open access online to the SMaSH tool kit, along with detailed documentation, at
    Bioinformatics 10/2013; 30(19). DOI:10.1093/bioinformatics/btu345 · 4.98 Impact Factor
  • Source
    • "Map those unalignable transcript contigs to Celera and HuRef Align those missing gene fragments and unalignable transcript contigs to chimpanzee, macaca and mouse genomes to investigate their conservation Annotate those missing gene fragments and missing transcript contigs with NCBI nr database and the protein family database Pfam Fig. 1 Strategies of identifying and characterizing genes missing from the human reference genome. To more comprehensively identify and characterize the missing gene sequences of the NCBI human reference genome build 37.2, we integrated two different strategies: (1) genome-wide comparison coupled with genome-guided transcriptome reconstruction; (2) de novo transcriptome assembly and then align the assembled transcript contigs onto the genomes Hum Genet conducted on human genome-wide comparison (Khaja et al. 2006; Kidd et al. 2010; Li et al. 2010). To generate non-redundant and non-overlapping alignments for each chromosome, we sorted the alignment results of each chromosome in descending order and iteratively compared each hit with the lower scoring alignments. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The human reference genome is still incomplete and a number of gene sequences are missing from it. The approaches to uncover them, the reasons causing their absence and their functions are less explored. Here, we comprehensively identified and characterized the missing genes of human reference genome with RNA-Seq data from 16 different human tissues. By using a combined approach of genome-guided transcriptome reconstruction coupled with genome-wide comparison, we uncovered 3.78 and 2.37 Mb transcribed regions in the human genome assemblies of Celera and HuRef either missed from their homologous chromosomes of NCBI human reference genome build 37.2 or partially or entirely absent from the reference. We further identified a significant number of novel transcript contigs in each tissue from de novo transcriptome assembly that are unalignable to NCBI build 37.2 but can be aligned to at least one of the genomes from Celera, HuRef, chimpanzee, macaca or mouse. Our analyses indicate that the missing genes could result from genome misassembly, transposition, copy number variation, translocation and other structural variations. Moreover, our results further suggest that a large portion of these missing genes are conserved between human and other mammals, implying their important biological functions. Totally, 1,233 functional protein domains were detected in these missing genes. Collectively, our study not only provides approaches for uncovering the missing genes of a genome, but also proposes the potential reasons causing genes missed from the genome and highlights the importance of uncovering the missing genes of incomplete genomes.
    Human Genetics 04/2013; 132(8). DOI:10.1007/s00439-013-1300-9 · 4.82 Impact Factor
  • Source
    • "Thus, the reference sequence has become essential for clinical applications, and is used to determine alleles for risk, protection, or treatment-specific response in human disease [1]. Yet, the current reference sequence, being based on a limited number of samples, neither adequately represents the full range of human diversity, nor is complete [2], [3]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Data from the 1000 genomes project (1KGP) and Complete Genomics (CG) have dramatically increased the numbers of known genetic variants and challenge several assumptions about the reference genome and its uses in both clinical and research settings. Specifically, 34% of published array-based GWAS studies for a variety of diseases utilize probes that overlap unanticipated single nucleotide polymorphisms (SNPs), indels, or structural variants. Linkage disequilibrium (LD) block length depends on the numbers of markers used, and the mean LD block size decreases from 16 kb to 7 kb,when HapMap-based calculations are compared to blocks computed from1KGP data. Additionally, when 1KGP and CG variants are compared, 19% of the single nucleotide variants (SNVs) reported from common genomes are unique to one dataset; likely a result of differences in data collection methodology, alignment of reads to the reference genome, and variant-calling algorithms. Together these observations indicate that current research resources and informatics methods do not adequately account for the high level of variation that already exists in the human population and significant efforts are needed to create resources that can accurately assess personal genomics for health, disease, and predict treatment outcomes.
    PLoS ONE 07/2012; 7(7):e40294. DOI:10.1371/journal.pone.0040294 · 3.23 Impact Factor
Show more