Error and Error Mitigation in Low-Coverage Genome Assemblies

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America.
PLoS ONE (Impact Factor: 3.53). 02/2011; 6(2):e17034. DOI: 10.1371/journal.pone.0017034
Source: PubMed

ABSTRACT The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1-4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.

Download full-text


Available from: Manolis Kellis, Aug 21, 2015
  • Source
    • "The annotated assembly with 6.51× coverage is now available at the National Center for Biotechnology Information (NCBI), University of California Santa Cruz (UCSC), and Ensembl. The rabbit chosen by the Broad Institute for sequencing was obtained from Covance in 2004 and was shown to have low heterozygosity rate compared to non-inbred New Zealand White (NZW) rabbits from the same source (Lindblad-Toh et al. 2011; Hubisz et al. 2011). DNA of the same animal was used both for the ∼7× coverage sequencing project analyzed here and for a previous low-coverage ∼2× sequencing project (accession AAGW01000000). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We report on the analyses of genes encoding immunoglobulin heavy and light chains in the rabbit 6.51× whole genome assembly. This OryCun2.0 assembly confirms previous mapping of the duplicated IGK1 and IGK2 loci to chromosome 2 and the IGL lambda light chain locus to chromosome 21. The most frequently rearranged and expressed IGHV1 that is closest to IG DH and IGHJ genes encodes rabbit VHa allotypes. The partially inbred Thorbecke strain rabbit used for whole-genome sequencing was homozygous at the IGK but heterozygous with the IGHV1a1 allele in one of 79 IGHV-containing unplaced scaffolds and IGHV1a2, IGHM, IGHG, and IGHE sequences in another. Some IGKV, IGLV, and IGHA genes are also in other unplaced scaffolds. By fluorescence in situ hybridization, we assigned the previously unmapped IGH locus to the q-telomeric region of rabbit chromosome 20. An approximately 3-Mb segment of human chromosome 14 including IGH genes predicted to map to this telomeric region based on synteny analysis could not be located on assembled chromosome 20. Unplaced scaffold chrUn0053 contains some of the genes that comparative mapping predicts to be missing. We identified discrepancies between previous targeted studies and the OryCun2.0 assembly and some new BAC clones with IGH sequences that can guide other studies to further sequence and improve the OryCun2.0 assembly. Complete knowledge of gene sequences encoding variable regions of rabbit heavy, kappa, and lambda chains will lead to better understanding of how and why rabbits produce antibodies of high specificity and affinity through gene conversion and somatic hypermutation.
    Immunogenetics 08/2013; 65. DOI:10.1007/s00251-013-0722-9 · 2.49 Impact Factor
  • Source
    • "MBE encode the N-terminal or C-terminal part of a protein (Hubisz et al. 2011; Thompson et al. 2011; Prosdocimi et al. 2012). We thus repeated our analyses of indel accumulation and of indel density association with sequence divergence , with excluding such terminal indels. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Insertions and deletions (indels) in protein-coding genes are important sources of genetic variation. Their role in creating new proteins may be especially important after gene duplication. However, little is known about how indels affect the divergence of duplicate genes. We here study thousands of duplicate genes in five fish (teleost) species with completely sequenced genomes. The ancestor of these species has been subject to a fish-specific genome duplication (FSGD) event that occurred approximately 350 Ma. We find that duplicate genes contain at least 25% more indels than single-copy genes. These indels accumulated preferentially in the first 40 my after the FSGD. A lack of widespread asymmetric indel accumulation indicates that both members of a duplicate gene pair typically experience relaxed selection. Strikingly, we observe a 30-80% excess of deletions over insertions that is consistent for indels of various lengths and across the five genomes. We also find that indels preferentially accumulate inside loop regions of protein secondary structure and in regions where amino acids are exposed to solvent. We show that duplicate genes with high indel density also show high DNA sequence divergence. Indel density, but not amino acid divergence, can explain a large proportion of the tertiary structure divergence between proteins encoded by duplicate genes. Our observations are consistent across all five fish species. Taken together, they suggest a general pattern of duplicate gene evolution in which indels are important driving forces of evolutionary change.
    Molecular Biology and Evolution 04/2012; 29(10):3005-22. DOI:10.1093/molbev/mss108 · 14.31 Impact Factor
  • Source
    • "A potential drawback to RRL methods is the large number of sequencing reads originating from random shearing rather than the restriction digest; however, we show that, given a reference sequence, these reads can be used to produce fully-resequenced organellar genomes. High-throughput sequencing methods may also have high error rates, which are especially unreliable at low sequencing depth (Harismendy et al. 2009; Hubisz et al. 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Connecting broad-scale patterns of genetic variation and population structure to genetic diversity on a landscape is a key step towards understanding historical processes of migration and adaptation. New genomic approaches can be used to increase the resolution of phylogeographic studies while reducing locus sampling effects and circumventing ascertainment bias. Here, we use a novel approach based on high-throughput sequencing to characterize genetic diversity in complete chloroplast genomes and >10,000 nuclear loci in switchgrass, at continental and landscape scales. Switchgrass is a North American tallgrass species, which is widely used in conservation and perennial biomass production, and shows strong ecotypic adaptation and population structure across the continental range. We sequenced 40.9 billion base pairs from 24 individuals from across the species' range and 20 individuals from the Indiana Dunes. Analysis of plastome sequence revealed 203 variable SNP sites that define eight haplogroups, which are differentiated by 4-127 SNPs and confirmed by patterns of indel variation. These include three deeply divergent haplogroups, which correspond to the previously described lowland-upland ecotypic split and a novel upland haplogroup split that dates to the mid-Pleistocene. Most of the plastome haplogroup diversity present in the northern switchgrass range, including in the Indiana Dunes, originated in the mid- or upper Pleistocene prior to the most recent postglacial recolonization. Furthermore, a recently colonized landscape feature (approximately 150 ya) in the Indiana Dunes contains several deeply divergent upland haplogroups. Nuclear markers also support a deep lowland-upland split, followed by limited gene flow, and show extensive gene flow in the local population of the Indiana Dunes.
    Molecular Ecology 11/2011; 20(23):4938-52. DOI:10.1111/j.1365-294X.2011.05335.x · 6.49 Impact Factor
Show more