GAGE: A critical evaluation of genome assemblies and assembly algorithms

McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
Genome Research (Impact Factor: 14.63). 12/2011; 22(3):557-67. DOI: 10.1101/gr.131383.111
Source: PubMed


New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

Download full-text


Available from: Steven Salzberg, Sep 17, 2015
  • Source
    • "The genome is large, around 17 Gbp and is predominantly made up of repeat elements[17,18]. Sugarcane varieties have smaller genomes, around 10 Gbp[8,19]but most are hybrids of two species , Saccharum spontaneum and Saccharum officinarum , S. officinarum being an octoploid with 2n = 80 chromosomes and S. spontaneum demonstrating varying ploidy (5–16x) and 2n chromosome number ranging from 40 to 128. The complexity of the bread wheat and sugarcane genomes makes producing reference genome assemblies a challenge. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. Results: We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. Conclusions: We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
    Preview · Article · Dec 2016 · Plant Methods
  • Source
    • "All rights reserved. set has inspired several benchmarking approaches, such as the Assemblathons (Earl et al. 2011; Bradnam et al. 2013), Gage (Salzberg et al. 2012; Magoc et al. 2013) and most recently CAMI (http://www.camichallenge .org). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Whole genome shotgun sequencing of multi species communities using only a single library layout is commonly used to assess taxonomic and functional diversity of microbial assemblages. Here we investigate to what extent such metagenome skimming approaches are applicable for in-depth genomic characterizations of eukaryotic communities, e.g. lichens. We address how to best assemble a particular eukaryotic metagenome skimming data, what pitfalls can occur, and what genome quality can be expected from this data. To facilitate a project specific benchmarking, we introduce the concept of twin sets, simulated data resembling the outcome of a particular metagenome sequencing study. We show that the quality of genome reconstructions depends essentially on assembler choice. Individual tools, including the metagenome assemblers Omega and MetaVelvet, are surprisingly sensitive to low and uneven coverages. In combination with the routine of assembly parameter choice to optimize the assembly N50 size, these tools can preclude an entire genome from the assembly. In contrast, MIRA, an all-purpose overlap assembler, and SPAdes, a multi-sized de Bruijn graph assembler, facilitate a comprehensive view on the individual genomes across a wide range of coverage ratios. Testing assemblers on a real-world metagenome skimming data from the lichen Lasallia pustulata demonstrates the applicability of twin sets for guiding method selection. Furthermore, it reveals that the assembly outcome for the photobiont Trebouxia sp. falls behind the a-priori expectation given the simulations. Although the underlying reasons remain still unclear this highlights that further studies on this organism require special attention during sequence data generation and downstream analysis. This article is protected by copyright. All rights reserved.
    Full-text · Article · Sep 2015 · Molecular Ecology Resources
  • Source
    • "additional complications inherent in obtaining genomes using these approaches (Dick et al. 2010; Albertsen et al. 2013). The quality of isolate genomes has traditionally been evaluated using assembly statistics such as N50 (Salzberg et al. 2012; Gurevich et al. 2013), while single cell and metagenomic studies have relied on the presence and absence of universal single-copy 'marker' genes for estimating genome completeness (Wrighton et al. 2012; Haroon et al. 2013; Rinke et al. 2013; Sharon 65 et al. 2013). However, the accuracy of this completeness estimate has not been evaluated and the approach is likely to be limited by both the uneven distribution of universal marker genes across a genome and their low number, typically accounting for <10% of all genes (Sharon and Banfield 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of 'marker' genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities. Published by Cold Spring Harbor Laboratory Press.
    Full-text · Article · May 2015 · Genome Research
Show more