ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads.

Broad Institute of MIT and Harvard, Charles Street, Cambridge, MA 02141, USA.
Genome biology (Impact Factor: 10.47). 10/2009; 10(10):R103. DOI: 10.1186/gb-2009-10-10-r103
Source: PubMed

ABSTRACT We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).

  • [Show abstract] [Hide abstract]
    ABSTRACT: Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes.
    Computational Biology and Chemistry 08/2014; 53. DOI:10.1016/j.compbiolchem.2014.08.014 · 1.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics experiments usually require efficient computational systems that streamline the data processing. Recent advances in high-throughput technologies have been expanding the experimental scenario. This fact is producing an avalanche of unmanageable data converting the biological sciences from a poor data discipline to a rich one. Furthermore, next-generation sequencing (NGS) technologies created to sequence very long DNA pieces at low cost, are widely used to generate biological data. Unfortunately, the bioinformatics’ tools haven’t changed their algorithms and computational techniques to deal with this data explosion. Therefore, the integration of biological data, as a product of those technological advances, is far from being a solved task although it is one of the most important and basic element inside the bioinformatics research and/or System Biology projects. Hence, in this thesis, we developed a biological data integration framework (JBioWH) that has a modular design for the integration of the most important biological databases. The framework is comprised of a Java API for external use, a desktop client and a webservices application. This system has been supplying integrative data for many bioinformatics projects. Also, a program (Taxoner) was developed to identify taxonomies by mapping NGS reads to a comprehensive sequence database. As a result of alterations to the indexing used, this pipeline is fast enough to run evaluations on a single PC, and is highly sensitive; as a result, it can be adapted to the analysis problems such as detecting pathogens in human samples. Finally, a workflow for DNA sequence comparison is presented. This workflow is applied either to create a marker database for taxonomy binning or just to obtain unique DNA segments among a group of targets sequences. It is based on a set of in-house developed programs that includes the JBioWH and Taxoner. All the programs developed are freely available through the Google Code Platform.
    10/2014, Degree: Doctor of Philosophy (Ph.D.), Bioinformatics, Supervisor: Prof. Sándor Pongor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features. We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance.
    BMC Bioinformatics 08/2014; 15(1):281. DOI:10.1186/1471-2105-15-281 · 2.67 Impact Factor

Full-text (2 Sources)

Available from
May 29, 2014