Error and Error Mitigation in Low-Coverage Genome Assemblies

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America.
PLoS ONE (Impact Factor: 3.23). 02/2011; 6(2):e17034. DOI: 10.1371/journal.pone.0017034
Source: PubMed


The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1-4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.

Download full-text


Available from: Manolis Kellis,
33 Reads
  • Source
    • "The utility of such public data, produced by others and from several different researchers, relies heavily on the assumption of high data quality [3]. However, although much has been accomplished in terms of minimizing their prevalence, sequence errors are still an important issue for both Sanger and next generation sequencing data [4–7]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic data can provide a powerful tool for those interested in the biology, management and conservation of wildlife, but also lead to erroneous conclusions if appropriate controls are not taken at all steps of the analytical process. This particularly applies to data deposited in public repositories such as GenBank, whose utility relies heavily on the assumption of high data quality. Here we report on an in-depth reassessment and comparison of GenBank and chromatogram mtDNA sequence data generated in a previous study of Baltic grey seals. By re-editing the original chromatogram data we found that approximately 40% of the grey seal mtDNA haplotype sequences posted in GenBank contained errors. The re-analysis of the edited chromatogram data yielded overall similar results and conclusions as the original study. However, a significantly different outcome was observed when using the uncorrected dataset based on the GenBank haplotypes. We therefore suggest disregarding the existing GenBank data and instead using the correct haplotypes reported here. Our study serves as an illustrative example reiterating the importance of quality control through every step of a research project, from data generation to interpretation and submission to an online repository. Errors conducted in any step may lead to biased results and conclusions, and could impact management decisions.
    PLoS ONE 08/2013; 8(8):e72853. DOI:10.1371/journal.pone.0072853 · 3.23 Impact Factor
  • Source
    • "The annotated assembly with 6.51× coverage is now available at the National Center for Biotechnology Information (NCBI), University of California Santa Cruz (UCSC), and Ensembl. The rabbit chosen by the Broad Institute for sequencing was obtained from Covance in 2004 and was shown to have low heterozygosity rate compared to non-inbred New Zealand White (NZW) rabbits from the same source (Lindblad-Toh et al. 2011; Hubisz et al. 2011). DNA of the same animal was used both for the ∼7× coverage sequencing project analyzed here and for a previous low-coverage ∼2× sequencing project (accession AAGW01000000). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We report on the analyses of genes encoding immunoglobulin heavy and light chains in the rabbit 6.51× whole genome assembly. This OryCun2.0 assembly confirms previous mapping of the duplicated IGK1 and IGK2 loci to chromosome 2 and the IGL lambda light chain locus to chromosome 21. The most frequently rearranged and expressed IGHV1 that is closest to IG DH and IGHJ genes encodes rabbit VHa allotypes. The partially inbred Thorbecke strain rabbit used for whole-genome sequencing was homozygous at the IGK but heterozygous with the IGHV1a1 allele in one of 79 IGHV-containing unplaced scaffolds and IGHV1a2, IGHM, IGHG, and IGHE sequences in another. Some IGKV, IGLV, and IGHA genes are also in other unplaced scaffolds. By fluorescence in situ hybridization, we assigned the previously unmapped IGH locus to the q-telomeric region of rabbit chromosome 20. An approximately 3-Mb segment of human chromosome 14 including IGH genes predicted to map to this telomeric region based on synteny analysis could not be located on assembled chromosome 20. Unplaced scaffold chrUn0053 contains some of the genes that comparative mapping predicts to be missing. We identified discrepancies between previous targeted studies and the OryCun2.0 assembly and some new BAC clones with IGH sequences that can guide other studies to further sequence and improve the OryCun2.0 assembly. Complete knowledge of gene sequences encoding variable regions of rabbit heavy, kappa, and lambda chains will lead to better understanding of how and why rabbits produce antibodies of high specificity and affinity through gene conversion and somatic hypermutation.
    Immunogenetics 08/2013; 65(10). DOI:10.1007/s00251-013-0722-9 · 2.23 Impact Factor
  • Source
    • "A significant amount of sequence data for various organisms has accumulated in databases; however, not all of the information on the genes of interest to researchers is accurate [1,2]. At the same time, researchers are faced with the necessity of cloning the particular genes or non-coding genomic regions to establish or verify their functional role. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Molecular cloning of DNA fragments >5 kbp is still a complex task. When no genomic DNA library is available for the species of interest, and direct PCR amplification of the desired DNA fragment is unsuccessful or results in an incorrect sequence, molecular cloning of a PCR-amplified region of the target sequence and assembly of the cloned parts by restriction and ligation is an option. Assembled components of such DNA fragments can be connected together by ligating the compatible overhangs produced by different restriction endonucleases. However, designing the corresponding cloning scheme can be a complex task that requires a software tool to generate a list of potential connection sites. Findings The BIOF program presented here analyzes DNA fragments for all available restriction enzymes and provides a list of potential sites for ligation of DNA fragments with compatible overhangs. The cloning scheme, which is called modular assembly cloning (MAC), is aided by the BIOF program. MAC was tested on a practical dataset, namely, two non-coding fragments of the translation elongation factor 1 alpha gene from Chinese hamster ovary cells. The individual fragment lengths exceeded 5 kbp, and direct PCR amplification produced no amplicons. However, separation of the target fragments into smaller regions, with downstream assembly of the cloned modules, resulted in both target DNA fragments being obtained with few subsequent steps. Conclusions Implementation of the MAC software tool and the experimental approach adopted here has great potential for simplifying the molecular cloning of long DNA fragments. This approach may be used to generate long artificial DNA fragments such as in vitro spliced cDNAs.
    BMC Research Notes 06/2012; 5(1):303. DOI:10.1186/1756-0500-5-303
Show more