GAGE: A critical evaluation of genome assemblies and assembly algorithms

McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
Genome Research (Impact Factor: 14.63). 12/2011; 22(3):557-67. DOI: 10.1101/gr.131383.111
Source: PubMed


New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

Download full-text


Available from: Steven Salzberg, Sep 17, 2015
51 Reads
  • Source
    • "All rights reserved. set has inspired several benchmarking approaches, such as the Assemblathons (Earl et al. 2011; Bradnam et al. 2013), Gage (Salzberg et al. 2012; Magoc et al. 2013) and most recently CAMI (http://www.camichallenge .org). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Whole genome shotgun sequencing of multi species communities using only a single library layout is commonly used to assess taxonomic and functional diversity of microbial assemblages. Here we investigate to what extent such metagenome skimming approaches are applicable for in-depth genomic characterizations of eukaryotic communities, e.g. lichens. We address how to best assemble a particular eukaryotic metagenome skimming data, what pitfalls can occur, and what genome quality can be expected from this data. To facilitate a project specific benchmarking, we introduce the concept of twin sets, simulated data resembling the outcome of a particular metagenome sequencing study. We show that the quality of genome reconstructions depends essentially on assembler choice. Individual tools, including the metagenome assemblers Omega and MetaVelvet, are surprisingly sensitive to low and uneven coverages. In combination with the routine of assembly parameter choice to optimize the assembly N50 size, these tools can preclude an entire genome from the assembly. In contrast, MIRA, an all-purpose overlap assembler, and SPAdes, a multi-sized de Bruijn graph assembler, facilitate a comprehensive view on the individual genomes across a wide range of coverage ratios. Testing assemblers on a real-world metagenome skimming data from the lichen Lasallia pustulata demonstrates the applicability of twin sets for guiding method selection. Furthermore, it reveals that the assembly outcome for the photobiont Trebouxia sp. falls behind the a-priori expectation given the simulations. Although the underlying reasons remain still unclear this highlights that further studies on this organism require special attention during sequence data generation and downstream analysis. This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 09/2015; DOI:10.1111/1755-0998.12463 · 3.71 Impact Factor
  • Source
    • "additional complications inherent in obtaining genomes using these approaches (Dick et al. 2010; Albertsen et al. 2013). The quality of isolate genomes has traditionally been evaluated using assembly statistics such as N50 (Salzberg et al. 2012; Gurevich et al. 2013), while single cell and metagenomic studies have relied on the presence and absence of universal single-copy 'marker' genes for estimating genome completeness (Wrighton et al. 2012; Haroon et al. 2013; Rinke et al. 2013; Sharon 65 et al. 2013). However, the accuracy of this completeness estimate has not been evaluated and the approach is likely to be limited by both the uneven distribution of universal marker genes across a genome and their low number, typically accounting for <10% of all genes (Sharon and Banfield 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of 'marker' genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities. Published by Cold Spring Harbor Laboratory Press.
    Genome Research 05/2015; 25(7). DOI:10.1101/gr.186072.114 · 14.63 Impact Factor
  • Source
    • "We performed de novo assembly of the short reads in the program Abyss v. 1.3 (Simpson et al. 2009). Based on previous studies (Salzberg et al. 2012; Briscoe et al. 2013) and preliminary results (S. Baxter, pers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Müllerian mimicry among Neotropical Heliconiini butterflies is an excellent example of natural selection, associated with the diversification of a large continental-scale radiation. Some of the processes driving the evolution of mimicry rings are likely to generate incongruent phylogenetic signals across the assemblage, and thus pose a challenge for systematics. We use a dataset of 22 mitochondrial and nuclear markers from 92% of species in the tribe, obtained by Sanger sequencing and de novo assembly of short read data, to re-examine the phylogeny of Heliconiini with both supermatrix and multispecies coalescent approaches, characterize the patterns of conflicting signal and compare the performance of various methodological approaches to reflect the heterogeneity across the data. Despite the large extent of reticulate signal and strong conflict between markers, nearly identical topologies are consistently recovered by most of the analyses, although the supermatrix approach failed to reflect the underlying variation in the history of individual loci. However, the supermatrix represents a useful approximation where multiple rare species represented by short sequences can be incorporated easily. The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Neotropics over the past 23 million years, or changes caused by diversity-dependent effects on the rate of diversification. We find that the rate of diversification has increased on the branch leading to the presently most species-rich genus Heliconius, but the change occurred gradually and cannot be unequivocally attributed to a specific environmental driver. Our study provides comprehensive comparison of philosophically distinct species tree reconstruction methods and provides insights into the diversification of an important insect radiation in the most biodiverse region of the planet.
    Systematic Biology 01/2015; 64(3). DOI:10.1093/sysbio/syv007 · 14.39 Impact Factor
Show more