Finished bacterial genomes from shotgun sequence data

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
Genome Research (Impact Factor: 14.63). 07/2012; 22(11). DOI: 10.1101/gr.141515.112
Source: PubMed


Exceptionally accurate genome reference sequences have proven to be of great value to microbial researchers. Thus, to date, about 1800 bacterial genome assemblies have been "finished" at great expense with the aid of manual laboratory and computational processes that typically iterate over a period of months or even years. By applying a new laboratory design and new assembly algorithm to 16 samples, we demonstrate that assemblies exceeding finished quality can be obtained from whole-genome shotgun data and automated computation. Cost and time requirements are thus dramatically reduced.

Download full-text


Available from: Dariusz Przybylski,
1 Follower
107 Reads
  • Source
    • "To enable the use of the ALLPATHS- LG genome assembler, we built two specialized Illumina libraries: a " fragment library " with paired-end 300 bp reads (i.e. 2 x 300 bp) and a " jumping library " with mate-pair reads with an average insert size of approximately 6.5 kb. Briefly, ALLPATHS-LG first joins paired-end reads from the fragment library that overlap to create longer reads, from which it builds a de Bruijn graph to construct contigs; the longer insert jumping library is then incorporated into the de Bruijn graph to scaffold the contigs, resolve repeats, and flatten the graph (Ribeiro et al. 2012). Since all Saccharomyces genomes contain Ty retrotransposons that are approximately 6 kb, duplicate gene families, and several other large repeats, a long-read or longinsert scaffolding strategy is critical to providing physical evidence that spans gaps to order and orient contigs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The dramatic phenotypic changes that occur in organisms during domestication leave indelible imprints on their genomes. Although many domesticated plants and animals have been systematically compared to their wild genetic stocks, the molecular and genomic processes underlying fungal domestication have received less attention. Here we present a nearly complete genome assembly for the recently described yeast species Saccharomyces eubayanus and compare it to the genomes of multiple domesticated alloploid hybrids of S. eubayanus x S. cerevisiae (S. pastorianus syn. S. carlsbergensis), which are used to brew lager-style beers. We find that the S. eubayanus subgenomes of lager-brewing yeasts have experienced increased rates of evolution since hybridization, and that certain genes involved in metabolism may have been particularly affected. Interestingly, the S. eubayanus subgenome underwent an especially strong shift in selection regimes, consistent with more extensive domestication of the S. cerevisiae parent prior to hybridization. In contrast to recent proposals that lager-brewing yeasts were domesticated following a single hybridization event, the radically different neutral site divergences between the subgenomes of the two major lager yeast lineages strongly favor at least two independent origins for the S. cerevisiae x S. eubayanus hybrids that brew lager beers. Our findings demonstrate how this industrially important hybrid has been domesticated along similar evolutionary trajectories on multiple occasions. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
    Molecular Biology and Evolution 08/2015; DOI:10.1093/molbev/msv168 · 9.11 Impact Factor
  • Source
    • "To date a few algorithms have been released that are capable of upgrading PacBio CLR data with high accuracy data from CCS or short read NGS data, among which PacBioToCA [9] and LSC [10]. These are further incorporated into hybrid assembly methods such as Celera [11], MIRA [12] and ALLPATHS-LG [13]. Even though promising results have been obtained, the error-correction step with short reads requires a sufficient read length (>75 bp) and sequencing depth, as well as large computational demands. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data. Results Here we propose a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone. On a test set comprising six bacterial draft genomes, assembled using either a single Illumina MiSeq or Roche 454 library, we show that even a 50× coverage of uncorrected PacBio RS long reads is sufficient to drastically reduce the number of contigs. Comparisons to the AHA scaffolder indicate our strategy is better capable of producing (nearly) complete bacterial genomes. Conclusions The current work describes our SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.
    BMC Bioinformatics 06/2014; 15(1):211. DOI:10.1186/1471-2105-15-211 · 2.58 Impact Factor
  • Source
    • "MP libraries are capable of resolving repetitive regions and structural variants while increasing the accuracy and size of assembled contigs (Ribeiro et al., 2012). Short reads could be best assembled through de Bruijn Graph (DBG) assembly approach (Miller et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: To assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Results: Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as an additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. Availability and implementation: All assembly tools except CLC Genomics Workbench are freely available under GNU General Public License. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 06/2014; 30(19). DOI:10.1093/bioinformatics/btu391 · 4.98 Impact Factor
Show more