ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads.

Broad Institute of MIT and Harvard, Charles Street, Cambridge, MA 02141, USA.
Genome biology (Impact Factor: 10.47). 10/2009; 10(10):R103. DOI: 10.1186/gb-2009-10-10-r103
Source: PubMed

ABSTRACT We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).

  • [Show abstract] [Hide abstract]
    ABSTRACT: Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes.
    Computational Biology and Chemistry 08/2014; 53. · 1.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features. We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance.
    BMC Bioinformatics 08/2014; 15(1):281. · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To assess the potential of different types of sequence data, combined with de novo and hybrid assembly approaches to improve existing draft genome sequences.
    Bioinformatics 06/2014; · 4.62 Impact Factor

Full-text (2 Sources)

Available from
May 29, 2014