Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315-327

J. Craig Venter Institute, Rockville, MD 20850-3343, USA.
Genomics (Impact Factor: 2.28). 03/2010; 95(6):315-27. DOI: 10.1016/j.ygeno.2010.03.001
Source: PubMed


The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.

Download full-text


Available from: Granger Sutton
  • Source
    • ") . The number of read - to - read comparisons and the storing of this information quickly exceed the memory available on even very large memory machines . A series of more memory efficient methods based on de Bruijn graphs have been developed to tackle this assembly problem ( Pevzner et al . , 2001 ) and reviewed in ( Pop , 2009 ; Miller et al . , 2010 ) . Due to the increased cost - effectiveness , and to a lesser extent , the throughput of the newer , next - generation sequencing platforms , the number of shotgun metagenome projects in the microbiology field has surged . Today , thousands of projects are underway , exploring systems of low complexity , such as acid mine drainage ( T"
    [Show abstract] [Hide abstract]
    ABSTRACT: Metagenomic investigations hold great promise for informing the genetics, physiology, and ecology of environmental microorganisms. Current challenges for metagenomic analysis are related to our ability to connect the dots between sequencing reads, their population of origin, and their encoding functions. Assembly-based methods reduce dataset size by extending overlapping reads into larger contiguous sequences (contigs), providing contextual information for genetic sequences that does not rely on existing references. These methods, however, tend to be computationally intensive and are again challenged by sequencing errors as well as by genomic repeats While numerous tools have been developed based on these methodological concepts, they present confounding choices and training requirements to metagenomic investigators. To help with accessibility to assembly tools, this review also includes an IPython Notebook metagenomic assembly tutorial. This tutorial has instructions for execution any operating system using Amazon Elastic Cloud Compute and guides users through downloading, assembly, and mapping reads to contigs of a mock microbiome metagenome. Despite its challenges, metagenomic analysis has already revealed novel insights into many environments on Earth. As software, training, and data continue to emerge, metagenomic data access and its discoveries will to grow.
    Full-text · Article · Jul 2015 · Frontiers in Microbiology
  • Source
    • "To deal with the sheer number of reads generated by today's high-throughput NGS platforms, many of these new assembly tools utilize “Kmers” (words of length K) and de Bruijn graphs, as the method of choice for generating assembled contiguous sequence fragments (contigs). Each assembler yields different results (contigs and associated information), with some capable of generating ordered contigs if mate pair libraries are available2. Additionally, the results of any assembler can vary when altering any of a number of parameters, such as Kmer size selection, expected coverage, coverage cutoff, edge trimming or other tool-specific options3. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Assembly of metagenomic samples is a very complex process, with algorithms designed to address sequencing platform-specific issues, (read length, data volume, and/or community complexity), while also faced with genomes that differ greatly in nucleotide compositional biases and in abundance. To address these issues, we have developed a post-assembly process: MetaGenomic Assembly by Merging (MeGAMerge). We compare this process to the performance of several assemblers, using both real, and in-silico generated samples of different community composition and complexity. MeGAMerge consistently outperforms individual assembly methods, producing larger contigs with an increased number of predicted genes, without replication of data. MeGAMerge contigs are supported by read mapping and contig alignment data, when using synthetically-derived and real metagenomic data, as well as by gene prediction analyses and similarity searches. MeGAMerge is a flexible method that generates improved metagenome assemblies, with the ability to accommodate upcoming sequencing platforms, as well as present and future assembly algorithms.
    Full-text · Article · Oct 2014 · Scientific Reports
  • Source
    • "Current approaches to assemble transcriptomes de novo from short read data are predominantly based on initially identifying contiguous sequence by creating a de-Bruijn graph of overlapping k-mers [1]–[3], [7], [8]. Factors that affect de-Bruijn graph assembly include; degree of heterozygosity, repeats in the underlying sequence, and sequencing error rate [9], [10]. Algorithms have been developed with strategies in mind to deal with these challenges. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Perennial ryegrass is a highly heterozygous outbreeding grass species used for turf and forage production. Heterozygosity can affect de-Bruijn graph assembly making de novo transcriptome assembly of species such as perennial ryegrass challenging. Creating a reference transcriptome from a homozygous perennial ryegrass genotype can circumvent the challenge of heterozygosity. The goals of this study were to perform RNA-sequencing on multiple tissues from a highly inbred genotype to develop a reference transcriptome. This was complemented with RNA-sequencing of a highly heterozygous genotype for SNP calling. Result De novo transcriptome assembly of the inbred genotype created 185,833 transcripts with an average length of 830 base pairs. Within the inbred reference transcriptome 78,560 predicted open reading frames were found of which 24,434 were predicted as complete. Functional annotation found 50,890 transcripts with a BLASTp hit from the Swiss-Prot non-redundant database, 58,941 transcripts with a Pfam protein domain and 1,151 transcripts encoding putative secreted peptides. To evaluate the reference transcriptome we targeted the high-affinity K+ transporter gene family and found multiple orthologs. Using the longest unique open reading frames as the reference sequence, 64,242 single nucleotide polymorphisms were found. One thousand sixty one open reading frames from the inbred genotype contained heterozygous sites, confirming the high degree of homozygosity. Conclusion Our study has developed an annotated, comprehensive transcriptome reference for perennial ryegrass that can aid in determining genetic variation, expression analysis, genome annotation, and gene mapping.
    Full-text · Article · Aug 2014 · PLoS ONE
Show more