ArticlePDF Available

The MaSuRCA genome assembler

August 2013
Bioinformatics 29(21)

August 2013
29(21)

DOI:10.1093/bioinformatics/btt476

Source
PubMed

Authors:

Guillaume Marcais

Carnegie Mellon University

Michael Roberts

University of Maryland, College Park

Show all 6 authorsHide

Second-generation sequencing technologies produce high coverage of the genome by short reads at a very low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this paper we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms very large numbers of paired-end reads into a much smaller number of longer "super-reads." The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced "mazurka"). We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two data sets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Aleksey Zimin, alekseyz@ipst.umd.edu.

Reads 1, 2 and 3 yield the same super-read. Reads are depicted by the black solid lines. Dashed lines represent the k-mer extensions starting from k-mers in the reads 1, 2 and 3. The super-read is depicted by the thick solid line. All reads that extend to the same super-read are replaced by that super-read

…

An example of a read whose super-read has two k-unitigs. Read R contains k-mers M 1 and M 2 on its ends. M 1 and M 2 each belong to k-unitigs K 1 and K 2 , respectively. K-unitigs K 1 and K 2 are shown in blue, and the matching k-mers M 1 and M 2 are shown in red and green. K 1 and K 2 overlap by k-1 bases. We extend read R on both ends producing a super-read, also depicted in blue. A super-read can consist of one k-unitig or can contain many k-unitigs

…

Comparison of the assemblies of mouse chromosome 16 using Illumina-only data (top three rows) and MaSuRCA using a mixture of Illumina data and long Sanger reads (bottom)

…

Content may be subject to copyright.

Content uploaded by Steven Salzberg

Content may be subject to copyright.

Vol. 29 no. 21 2013, pages 2669–2677

BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btt476

Genome analysis Advance Access publication August 29, 2013

The MaSuRCA genome assembler

Aleksey V. Zimin

, Guillaume Marc¸ais

, Daniela Puiu

, Michael Roberts

Steven L. Salzberg

and James A. Yorke

1,3,4

Institute for Physical Sciences and Technology, University of Maryland, College Park, MD 20742, USA,

Center for

Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine,

Baltimore, MD 21205, USA,

Department of Mathematics and

Department of Physics, University of Maryland, College

Park, MD 20742, USA

Associate Editor: John Hancock

ABSTRACT

Motivation: Second-generation sequencing technologies produce

high coverage of the genome by short reads at a low cost, which

has prompted development of new assembly methods. In particular,

multiple algorithms based on de Bruijn graphs have been shown to be

effective for the assembly problem. In this article, we describe a new

hybrid approach that has the computational efficiency of de Bruijn

graph methods and the flexibility of overlap-based assembly strate-

gies, and which allows variable read lengths while tolerating a signifi-

cant level of sequencing error. Our method transforms large numbers

of paired-end reads into a much smaller number of longer ‘super-

reads’. The use of super-reads allows us to assemble combinations

of Illumina reads of differing lengths together with longer reads from

454 and Sanger sequencing technologies, making it one of the few

assemblers capable of handling such mixtures. We call our system the

Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and

pronounced ‘mazurka’).

Results: We evaluate the performance of MaSuRCA against two of

the most widely used assemblers for Illumina data, Allpaths-LG and

SOAPdenovo2, on two datasets from organisms for which high-quality

assemblies are available: the bacterium Rhodobacter sphaeroides and

chromosome 16 of the mouse genome. We show that MaSuRCA

performs on par or better than Allpaths-LG and significantly better

than SOAPdenovo on these data, when evaluated against the finished

sequence. We then show that MaSuRCA can significantly improve its

assemblies when the original data are augmented with long reads.

Availability: MaSuRCA is available as open-source code at ftp://ftp.

genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases

have been publicly available for over a year.

Contact: alekseyz@ipst.umd.edu

Supplementary information: Supplementary data are available at

Bioinformatics online.

Received on May 20, 2013; revised on August 6, 2013; accepted on

August 9, 2013

1 INTRODUCTION

Following the creation of draft versions of the human genome

in 2001, many small and large genomes were sequenced using

first-generation (i.e. Sanger) sequencing technology, with read

lengths exceeding 800 bp. More recently, a variety of types of

second-generation sequencing (SGS) technologies have appeared

with read lengths ranging from 50–400 bp. The lowest-cost

sequencers today produce 100-bp reads at a cost many thousands

of times lower than Sanger sequencing. New assembly methods

have been developed in response to the challenge of short-read

assembly, and they have steadily improved in recent years.

Despite this progress, though, the problem of determining the

sequence of a genome is far from a solved problem. Virtually all

assemblies published today are ‘draft’ genomes with varying

levels of quality, containing many gaps and assembly errors

that present significant problems for scientists who rely on

these genomes for downstream analysis. This article reports pro-

gress in assembling genomes facilitated by a new approach to

genome assembly. First, we briefly describe the two general

approaches that have been used for assembly of whole-genome

shotgun sequencing data.

Overlap–layout–consensus (OLC) assembly. Briefly, the OLC

paradigm first attempts to compute all pairwise overlaps between

reads, using sequence similarity to determine overlap. Then an

OLC algorithm creates a layout, which is an alignment of all

overlapping reads. From this layout, the algorithm extracts a

consensus sequence by scanning the multiread alignment,

column by column. Most assemblers for Sanger sequencing

data, including Celera Assembler (Miller et al., 2008; Myers

et al., 2000), PCAP (Huang, 2003), Arachne (Batzoglou et al.,

2002) and Phusion (Mullikin and Ning, 2003), are based on the

OLC approach.

Two of the main benefits of the OLC approach are flexibility

with respect to read lengths and robustness to sequencing errors.

To improve the likelihood that apparent overlaps are real (and

not repeat-induced), OLC algorithms typically require them to

exceed some minimum length, e.g. the Celera Assembler requires

overlaps of 40 bp or longer, allowing for a small error rate

(1–2%) in the overlapping region.

To compensate for shorter read length and lack of uniformity

in coverage, SGS de novo assembly projects typically generate

100 times as many reads as Sanger-sequencing projects; e.g. the

original human (Lander et al., 2001; Venter et al., 2001) and

mouse (Mouse Genome Sequencing Consortium et al., 2002)

projects generated 35 million reads each, whereas recent

human sequencing projects (Li et al., 2010) generated 3–4 billion

reads. The de Bruijn graph approach avoids the pairwise overlap

computation entirely, which is one reason why it has become the

leading method for SGS assembly.

*To whom correspondence should be addressed.

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

The de Bruijn graph approach. The de Bruijn graph assembly

algorithm was pioneered by Pevzner et al. (Idury and Waterman,

1995; Pevzner, 1989), who first implemented these ideas in the

Euler assembler (Pevzner et al., 2001). Although Euler was

designed for Sanger reads, the same general framework has

been adopted recently by programs for assembling SGS data,

and for Illumina read data in particular. Recently developed

assemblers that use the de Bruijn strategy include Allpaths-LG

(Gnerre et al., 2010), SOAPdenovo (Li et al., 2008), Velvet

(Zerbino and Birney, 2008), EULER-SR (Chaisson and

Pevzner, 2008) and ABySS (Simpson et al., 2009).

This approach begins by creating a de Bruijn graph from the

read data, as follows. For a fixed value k, every substring of

length k (a k-mer) from every read is assigned to a directed

edge in a graph connecting nodes A and B. Nodes A and B

correspond to the first and last k-1 nucleotides of the original

k-mer. Any path through the graph that visits every edge exactly

once, formally known as an Eulerian path, forms a draft assem-

bly of the reads. In practice, these graphs are complex with many

intersecting cycles, and many alternative Eulerian paths, and

therefore creating the graph is merely the first small step in creat-

ing a good draft assembly. Complete assembly requires incorpor-

ating mate pair information into the graph and attempting to

disentangle the many complex cycles created by repetitive

sequences. Because the k-mers are shorter than reads, the

graph contains less information than the reads, so the reads

need to be retained for later use in disambiguating paths in the

graph.

The main benefit of this approach is its computational effi-

ciency, which it gains from the fact that the immense number of

overlaps is not computed. The main drawbacks are loss of k-mer

adjacency information in the graph and spurious branching

caused by errors in the data.

Super-reads, a new alternative. In this article, we propose a

third paradigm for assembly of short-read data, based on the

creation of what we call super-reads. The aim is to create a set

of super-reads that contains all of the sequence information

present in the original reads despite the fact that there are far

fewer super-reads than original reads. For the ideal error-free

case, see the Theorem below.

The basic concept of super-reads is to extend each original

read forwards and backwards, base by base, as long as the

extension is unique. The concept can be explained as follows.

We create a k-mer count look-up table (using an efficient hash

table) to determine quickly how many times each k-mer occurs in

our reads. Given a k-mer found at the end of a read, there are

four possible k-mers that could be the next k-mer in a genome’s

sequence: these are the strings formed by appending A, C, G or T

to the last k-1 bases in the read. Our algorithm looks up, which

of these k-mers occur in the table. If only one of the four possible

k-mers occurs, we say the read has a unique following k-mer and

we append that base to the read. We continue until the read can

no longer be extended uniquely; i.e. there is more than one pos-

sible continuing base, or we have reached a dead end and no base

is permissible. We perform this extension on both the 3

and

ends of the read. The new longer string is called a super-

read. Many reads extend to the same super-read as shown in

Figure 1. Notice that if two reads have an interior difference

by even one base—as, for example, would occur if they derived

from two non-identical repeats or from two divergent haplo-

types—then they will generate distinct super-reads. Of course

super-reads can easily be computed using a de Bruijn graph.

The point is that once the super-reads are created, they–together

with mate pairs that connect super-reads–collectively replace the

de Bruijn graph. Incorporation of mate-pair information is car-

ried out using the OLC assembly step described below. The two

most important properties of the super-read data computation

are as follows:

 each of the original reads is contained in a super-read (so no

information has been lost); and

 many of the original reads yield the same super-read, so

using super-reads leads to vastly reduced dataset.

Hundreds of times fewer super-reads than reads. MaSuRCA

uses a modified version of the CABOG assembler (Miller

et al., 2008), for the overlap-based assembly following super-

read construction. In creating its fundamental unit of unitigs,

CABOG uses only ‘maximal’ reads, i.e. reads that are not

proper substrings of other larger reads. In principle, this could

cause assembly errors but in practice they seem to be rare.

Because of this practice, we carry out one extra step: the only

super-reads we use are maximal super-reads, i.e. those that are

not exact substrings of another super-read. We then assemble the

maximal super-reads along with other available data including

mate pairs with the modified CABOG assembler. The ‘other

data’ include jumping libraries and possibly 454 read data and

Sanger read data and mate pairs.

We observe that the coverage of the genome by maximal

super-reads typically varies from 2–3, independent of whether

the raw read coverage is 50, 100 or even higher. Note that each

heterozygous single nucleotide polymorphism increases the

number of super-reads. For a haploid genome, super-reads will

tend towards 2 coverage, whereas for highly heterozygous dip-

loid genomes the super-read coverage may be closer to 4.Inthe

two example genomes described in the Results section,

Rhodobacter sphaeroides and Mus musculus, the reads outnumber

the maximal super-reads by factors of 400 and 300, respect-

ively. The N50 lengths for the super-reads themselves are 3314

and 2241 bp, respectively. MaSuRCA automatically chooses the

k-mer size for creating super-reads, and in these two cases, k is 33

and 69, respectively.

The following Theorem lays the theoretical foundation of

equivalence of assemblies made from the original reads and the

super-reads for the case of perfect error-free reads.

Fig. 1. Reads 1, 2 and 3 yield the same super-read. Reads are depicted by

the black solid lines. Dashed lines represent the k-mer extensions starting

from k-mers in the reads 1, 2 and 3. The super-read is depicted by the

thick solid line. All reads that extend to the same super-read are replaced

by that super-read

2670

A.V.Zimin et al.

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

The super-reads theorem for the ideal case of perfect (error-free)

reads. To understand the underpinnings of the super-reads

approach, we consider the simplest case. Here we ignore mate

pairs. The above construction of super-reads is based on a fixed

k-mer size, so for clarity we can call them k-super-reads. The

genome is a collection of strings (chromosomes) and loops (plas-

mids or organelles) with a four-letter alphabet. To avoid end

effects, we assume that all DNA in the genome is circular, as is

often the case for bacteria, but we shall still speak of their

‘substrings’. A string (read or super-read) is called perfect if it

is identical to a substring of the genome. Such a substring of the

genome together with its coordinates is called a placement.

A string may have multiple placements. We say that a set of

strings R is k-perfect with respect to a genome G if (i) every

base of the genome G is covered by some placement, and

(ii) adjacent placements overlap by at least k bases.

When a set of reads is k-perfect, we can distill the information

in the reads by the usually much smaller set of k-super-reads.

The following result says that k-super-reads contain all of the

information in the reads.

HEOREM. Assume a set of reads is k-perfect for some genome G.

Then the corresponding set of k-super-reads has the same property.

In other words, the set of super-reads contains all of the

information in the reads, and they have introduced no errors.

The proof follows from the construction of super-reads. Because

each read is contained in a super-read, no data are lost. At the

same time, if the original read data are inadequate for deducing

what the genome is, then so are the super-read data. If both

flanks of some copy of the repeat were not covered in the read

set, there would be no maximal k-super-read that could be placed

at that copy of the repeat.

In practice reads are not perfect, and because the super-reads

can only represent the information in the original reads, there will

always be some super-reads that contain errors that were in the

original reads. The task of the assembly algorithm used down-

stream of super-reads is to detect and correct most of the errors

and create a mostly correct assembly. Assemblers have long been

designed to do exactly that, because reads used in assembly pro-

jects were never assumed to be perfect. The MaSuRCA assembler

benefits from the advanced assembly techniques in the CABOG

assembler for creating contigs and scaffolds from super-reads.

2 RESULTS

Here we report on the application of the MaSuRCA assembler

version 2.0 to the assembly of two organisms: the bacterium

R.sphaeroides str. 2.4.1 (Rhodobacter) and chromosome 16 of

M.musculus lineage B6 (mouse). Note that MaSuRCA 2.0 is a

new release of a system that was formerly known as MSR-CA.

Choice of genomes. These genomes were chosen for several

reasons. First, they represent two widely different challenges: a

small clonal genome and a much larger diploid one. Second, the

finished sequence is available for both genomes, which allows us

to evaluate correctness of the assemblies. Third, one of the as-

semblers with which we compared MaSuRCA, Allpaths-LG, re-

quires that data are generated according to a preset recipe, which

was followed for both the Rhodobacter and mouse datasets.

Fourth, for both of these genomes, up to 9 coverage by long

(Sanger) reads is available. This allows us to show how the

MaSuRCA assembler can benefit from combining Illumina

data with additional relatively low (1–4) genome coverage

by long reads (LR). We also compare the performance of

MaSuRCA with the performance of CABOG only for the 9

Sanger data on the bacterial dataset. Although we used Sanger

sequencing technology for LR, similar improvements in assembly

results can be achieved using the latest Roche 454 sequencing

technology. We do not require the LR to have mate pairs, and

we did not use mate-pair information for the Sanger reads in our

experiments on the mouse chromosome assembly.

We also chose the R.sphaeroides bacterium because it has

a high GC content, 68%, which makes it challenging to

sequence. Illumina technology is much less effective in high

GC-content regions. In particular, even though we used 45

overall genome coverage, short (100–200 bp) windows whose

GC content is above 80% or below 20% may be covered by

only one or two reads. This makes assembly of high- or low-

GC genomes from Illumina data particularly difficult, and

frequently results in fractured assemblies.

Both datasets that we chose had Illumina data generated by

the Broad Institute sequencing center and had high-quality

libraries with tightly controlled fragment lengths, as shown in

the Supplementary Material. Such low deviations from the

target library size may not be typical for all sequencing centers

and genome projects. The MaSuRCA assembler uses a modified

version of the CABOG assembler for contiging and scaffolding,

and in practice it will produce good assemblies with libraries

whose standard deviations are up to 20% of the library mean.

One of the popular genomes used in evaluations of genome

assemblers is Escherichia coli (Simpson et al., 2009). There is

ample sequencing data available for E.coli genome, and the fin-

ished sequence is also available. However, we decided against

using this genome for our evaluations because E.coli is relatively

easy to assemble, because of its low repeat content, and thus does

not provide a stern test of an assembly algorithm: most assem-

blers do reasonably well on this genome. We also decided against

evaluations using artificially generated ‘faux’ data. We do not

know of any working and published technique that comes close

to simulating the whole spectrum of errors and biases that are

present in real-life sequencing data. Thus an assembler that per-

forms well on simulated data may perform poorly on the real

data. See, e.g. Tables 1 and 2 in Luo et al. (2012), where the

SOAPdenovo2 assembler performed better than Allpaths-LG on

the faux Assemblathon 1 data, but was much worse on the real

data for both bacterial datasets.

Choice of assembly programs for comparison. We chose to com-

pare the performance of MaSuRCA with the two most popular

large-scale genome assemblers: Allpaths-LG and SOAPdenovo.

We decided against doing more comparisons to avoid repetition

of the results of studies done in the recent GAGE evaluations.

One can judge the relative performance of other popular assem-

blers from that project (Salzberg et al., 2012, http://gage.cbcb.

umd.edu). The original GAGE assembly comparison (Salzberg

et al., 2012) compared the following assembly programs:

 ABySS (Simpson et al., 2009)

 ALLPATHS-LG (Gnerre et al., 2011)

2671

The MaSuRCA genome assembler

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

 Bambus2 (Koren et al., 2011)

 CABOG (Miller et al., 2008)

 MSR-CA (now renamed MaSuRCA 1.0)

 SGA (Simpson and Durbin, 2012)

 SOAPdenovo (Luo et al., 2012)

 Velvet (Zerbino and Birney, 2008)

The best performers in GAGE were AllPaths-LG and

SOAPdenovo. Hence, we have included those two programs

for comparison with MaSuRCA 2.0. For a more recent compari-

son one should see the GAGE-B competition (Magoc et al.,

2013). It reports on assemblies of 12 bacterial genomes by

ABySS, CABOG, MaSuRCA 1.8.3, Mira v3.4.0 (Chevreux

et al., 2004), SOAPdenovo, SPAdes v2.3.0 (Bankevich et al.,

2012) and Velvet. MaSuRCA 1.8.3 produced the best assemblies

for the majority of the 12 species. SPAdes did well, especially on

250-bp reads, but is not designed for larger genomes. Allpaths-

LG was not used in that competition because it requires two

libraries, whereas this test had only one library per species.

Evaluating the assemblies. We evaluated the performance

of the assemblers using two separate techniques. We evaluated

the contigs using the recently published Quast 2.1 software

(Gurevich et al., 2013). In the tables below, we report

the contig sizes in terms of NGA50 reported by Quast. The

NGA50 size is defined as the value N such that 50% of the

finished sequence is contained in contigs whose alignments to

the finished sequence are of size N or larger. Note that

NGA50 differs from N50 in that N50 is defined by the total

size of the assembled contigs, whereas NGA50 is defined by

the actual size of the genome itself. If the assembly size is close

to the true genome size, then N50 and NGA50 are roughly

equivalent.

Quast does not report the scaffold statistics in the way we

would prefer to look at them. In evaluating the scaffolding, we

look for the correct order and orientation of the contigs and

contiguity of the coverage allowing for reasonable (shorter

than a longest mate pair) gaps. Thus we evaluated the scaffolds

separately by mapping them to the finished sequence using

Nucmer (Kurtz et al., 2004). We then clustered the matches of

each scaffold to the finished sequence based on proximity of the

matches in terms of finished sequence coordinates. Within each

cluster of the matches of the scaffold to the finished sequence, we

required that the matches are in the same order in terms of

finished sequence and scaffold match coordinates (no rearrange-

ments), same orientation and the distance between the consecu-

tive matches is smaller than 40 kb (the size of the longest library)

for the mouse genome and 3.5 kb for the bacteria. Then we

counted the number of clusters and the number of the scaffolds.

The number of scaffold misassemblies is the difference between

these two numbers. We defined NGA50 for the set of scaffolds as

the value N such that 50% of the finished sequence is spanned by

clusters where the span of each individual cluster is of size N

or larger.

Bacteria genome assembly. For the first comparison, we chose

two Illumina datasets: (i) a paired-end library (i.e. PE), in which

reads were generated from both ends of 180-bp DNA fragments

(SRA accession SRR081522), and (ii) a ‘jumping’ library in

which paired ends were sequenced from 3600-bp fragments,

(SRA accession SRR034528). We randomly down-sampled

both libraries to 45 genome coverage. For LR we used

59 211 Sanger reads, with average length 772 bp, from the

National Center for Biotechnology Information (NCBI) Trace

Archive entry for R.sphaeroides str. 2.4.1 (Choudhary et al.,

2007). The Sanger reads provided 10 genome coverage. For

our experiments, we used randomly down-sampled LR datasets

of 1,2 and 4 coverage. The data are summarized in

Supplementary Table S1.

The parameters used for the Allpaths-LG, MaSuRCA and

SOAPdenovo2 assemblies are described in the Supplementary

Material. Because the creation of super-reads is critical in the

MaSuRCA design, we first present the analysis of the number

and correctness of the super-reads.

For this dataset, MaSuRCA reduced the original 2 050 868

paired-end reads to 5168 super-reads, a reduction by a factor

of almost 400. In addition to those, the MaSuRCA submits to

Table 2. Comparison of the assemblies of mouse chromosome 16 using

Illumina-only data (top three rows) and MaSuRCA using a mixture of

Illumina data and long Sanger reads (bottom)

Assembler Quast

contig

NGA50

Quast

contig

misas-

semblies

NGA50

scaffold

(Kb)

Scaffold

misas-

semblies/

Allpaths-LG 28 175 261 0.03

SOAPdenovo2 8 369 1828 0.17

MaSuRCA 56 283 3445 0.19

Assemblies including some Long Read (LR) data

MaSuRCA þ 1 LR 70 256 4472 0.04

MaSuRCA þ 2 LR 82 248 3704 0.21

MaSuRCA þ 4 LR 102 246 4511 0.21

Note: T he best value for each column is shown in boldface. The total size of the

finished sequence of this chromosome was 98 319 150 bp. All assemblies generated

accurate scaffolds, but the number of the contig misassemblies differs significantly.

Table 1. Comparison of the assemblies of R.sphaeroides

Assembler/

data type

Quast

NGA50

contig (kb)

Quast

misas-

semblies

in contigs

NGA50

scaffold

(Mb)

Scaffold

misas-

semblies/

Allpaths-LG 41.5 15 3.2 0.2

SOAPdenovo2 17.5 5 0.067 0.9

MaSuRCA 41.4 13 3.1 0.7

Assemblies including some Long Read (LR) data

CABOG LR only 52.7 12 1.5 0.4

MaSuRCA þ 1 LR 63.9 21 3.2 0.7

MaSuRCA þ 2 LR 87.2 17 3.2 0.2

MaSuRCA þ 4 LR 228.4 20 3.2 0.2

Note: The best value for each column is shown in boldface. All assemblies used

Illumina data. The size of the finished reference sequence for this genome was

4 603 067 bp and the largest chromosome is about 3.2 Mb. All assemblies had 51

misassembly/1 Mb of scaffold sequence.

2672

A.V.Zimin et al.

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

the modified Celera Assembler 18 970 linking mates (PE pairs

where the two reads ended up in two different super-reads, see

Methods). Note that all linking mates were contained in super-

reads. Thus we transformed 2 050 868 original reads into a total

of 24 138 (¼18 970 þ 5168) reads with a 77-fold reduction in the

number of reads. The N50 size of the super-reads was 3314 bp,

the minimum size was 33 bp, and the longest super-read was

13 283 bp. The total amount of sequence in super-reads was

9 138 989 bp, 2 coverage of the genome.

To determine how well the (maximal) super-reads agreed with

the genome, we mapped the super-reads to the finished sequence

by Nucmer (Kurtz et al., 2004) using a k-mer seed size of 15

(parameters: -l 15 -c 32 –maxmatch). The total number of

bases in the super-reads that matched the finished sequence

was 9 106 770 in 4845 super-reads. A total of 99.2% of the

matching bases were in at least one of 4673 super-reads that

matched with at least 99% identity over at least 99% of their

length. The remaining 323 super-reads that did not match the

finished sequence contained 32 219 bp of sequence, and their

maximum size was 150 bp. We examined the reads that were

used to produce the non-matching super-reads and could not

find a match to the genome of length 32 bp in any of these

reads. It is likely that these reads primarily contained adapter

sequences with errors or other contaminants

Table 1 shows the comparison of the performance of the

MaSuRCA assembler with the others on the R.sphaeroides

dataset. The MaSuRCA assembler using only Illumina data per-

forms on par with Allpaths-LG, with nearly identical NGA50

sizes, two fewer contig errors and two more scaffold errors. All

scaffold errors were in small scaffolds whose sizes were well

below the N50 scaffold size and this did not influence the

NGA50 scaffold size. Moreover, the performance of

MaSuRCA on Illumina data alone is comparable with perform-

ance of CABOG on only the Sanger (long-read) data.

Although SOAPdenovo2 had the smallest number of contig

errors (5), its contigs were significantly smaller than those pro-

duced by the other assemblers. As we introduced additional

coverage by LR into the mix, the assemblies produced by

MaSuRCA assembler become superior in contiguity to all

other assemblers (bottom three rows of Table 2). In particular,

the contig N50 value increased from 41.4 to 52.7 kb with just 1

Sanger data, and to 228 kb with deeper 4 Sanger data. We note

that neither SOAPdenovo2 nor Allpaths-LG allows for mixed

datasets of this type.

Mouse genome assembly. To save time and to allow for more

detailed examination of the results, we created a restricted

dataset for a single chromosome of the mouse genome, chromo-

some 16 (Mmu16). We downloaded the same data for the mouse

genome as was used in the evaluation of Allpaths-LG (Gnerre

et al., 2011), which are available from the NCBI SRA under the

study Mouse_B6_Genome_on_Illumina. These sequences were

generated from mouse strain C57BL/6J, the same strain used for

the finished mouse sequence (Mouse Genome Sequencing

Consortium et al., 2002).

We mapped the reads to the finished sequence for the entire

mouse genome using Bowtie2 (Langmead and Salzberg, 2012),

allowing up to five best hits of identical quality for each read.

We then extracted the reads whose best hit either for the read or

for its mate was in chromosome 16. We also downloaded the

original Sanger reads from NCBI Trace Archive (Mouse genome

Sequencing Consortium et al., 2002), and mapped them against

the finished sequence. MaSuRCA does not require the LR to be

mated, and we excluded mate-pair information for these reads

during assembly. Supplementary Table S2 lists the mouse

datasets used in our experiments.

From the paired-end dataset containing 50 million reads, the

super-reads module of MaSuRCA produced 297 279 super-reads

containing 210 839 005 bp, with an N50 size of 2241 bases. The

reads outnumber the (maximal) super-reads by a factor of over

300. The original 45 coverage by the 101-bp paired-end reads

reduced to just over 2 coverage by super-reads. In addition, the

super-reads module output 940 390 linking mates from the PE

library; these are paired reads that link together two super-reads.

Thus we reduced 50 M reads to  1.24 M super-reads and linking

mates, a 40-fold reduction. After mapping the super-read

sequences to the finished sequence using Nucmer, we found

that 209 017 737 bp in 284 179 super-reads matched Mmu16. Of

these matching bases, 98% were contained in at least one of the

258 927 super-reads that had at least 99% identity to MMu16

over at least 99% of the super-read’s length.

Results for the mouse assemblies are provided in Table 2. Not

unexpectedly, the MMu16 dataset was more challenging than the

bacterial genome. For assembly with Illumina-only data, the

NGA50 contig size for MaSuRCA assembly was twice as big

compared with the Allpaths-LG assembly, whereas the number

of errors was 62% larger. SOAPdenovo2 produced small contigs

with a large number of errors. The MaSuRCA assembler pro-

duced the largest scaffolds, with NGA50 more than an order of

magnitude larger than the Allpaths-LG scaffolds and almost

twice bigger than the SOAPdenovo2 scaffolds.

MaSuRCA produced progressively larger and more accurate

ntigs as LR were added into the mix. Additional 4 LR cover-

age almost doubled the N50 contig size while reducing the

number of contig misassemblies by 13%. The LR data did not

have any mate pairs by design; thus we did not expect a signifi-

cant improvement in scaffolding, however, the scaffolds

improved as well. We note that for each run the same set of

super-reads, jumping library reads and linking mates went into

the CABOG assembler; the only difference between runs was in

the number of LR. As we introduced more LR, the number of

assembly errors decreased, whereas the contig N50 size increased

significantly.

3 METHODS

The key idea driving the development of the MaSuRCA assembler is to

reduce the complexity of the data by transforming the high coverage

(typically 50–100 or deeper) of the paired-end reads into 2–3 coverage

by fairly long (maximal) super-reads. The reduced data could then be

efficiently assembled by a modified Celera Assembler. Here we want to

describe some of the modules in MaSuRCA. In this section, we simply

say super-read for maximal super-reads, as non-maximal super-reads are

not used in assembly.

QuorUM error correction. We use the QuorUM error corrector due to

its stability and high performance (Marc¸ ais et al., 2013). One may sub-

stitute another error corrector, such as Quake (Kelley et al., 2010), as long

as the output is processed in such a way that read names are preserved

and the mates are reported together.

2673

The MaSuRCA genome assembler

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

Creation of k-unitigs. Extending each read base by base is computa-

tionally inefficient. Therefore, we use error-corrected reads to construct k-

unitigs as follows. We create a k-mer count look-up table, using a hash

constructed by the Jellyfish program (Marc¸ ais and Kingsford, 2011),

which allows us to determine quickly how many times each k-mer

occurs in the reads. Given any k-mer, there are four possible k-mers

that may follow it, one for each possible extension by A,C,G or T. We

look up how often each of these occurs, and if we find only one of these

four extensions, we say that our original k-mer has a unique following

k-mer. We similarly check whether there is a unique preceding k-mer,

analogously defined.

We call a k-mer simple if it has a unique preceding k-mer and a unique

following k-mer. A k-unitig is a string of maximal length such that every

k-mer in it is simple except for the first and the last. There are alternative

names for quite similar concepts, such as ‘unipaths’ in Allpaths-LG

(Gnerre et al., 2011). By construction, no k-mer can belong to more

than one k-unitig. (But note that a k-unitig itself can occur at multiple

sites in the genome, as will happen for exact repeats.) Hence, if a k-mer

from a k-unitig U is encountered in the genome, then the whole sequence

of the k-unitig U must appear at that location in the genome. We will use

this property of k-unitigs when creating super-reads.

Super-reads from paired-end reads. Because no k-mer can belong to

more than one k-unitig, if a read has a k-mer that occurs in a k-unitig,

the read and the k-unitig can be aligned to one another. In the simplest

case, when a read is a substring of a k-unitig, then that k-unitig is the

read’s super-read. In other cases, we use individual reads to merge the

k-unitigs that overlap them into a single longer super-read. Figure 2

shows an example of this process, in which a read R is extended to a

super-read that consists of two k-unitigs that overlap by k-1 bases.

K-mers M

and M

belong to R and also to k-unitigs K1 and K2,

which may extend beyond the ends of R. Two reads that differ even at

only one base will map and be extended by different sets of k-unitigs. The

k-unitigs by construction can be connected by the exact k-1 end sequence

overlaps. We call them k-unitig overlaps.

If the reads are paired, we examine each pair of reads (we also call

them mated reads or mate pairs) and map each read to the k-unitigs, and

then look for a unique path of k-unitigs connected by k-unitig overlaps

that connects the two reads. If we find such a path, then we extend both

paired-end reads to a new super-read, created by merging the k-unitigs on

this unique path. Often this process of creating a super-read from a pair

fails, e.g. when there is a repeated sequence or a gap in coverage between

the mates. In this case, we form a super-read for each of the mates sep-

arately, and the mate pair is submitted to the assembler as linking mate

pair along with the corresponding super-reads.

Jumping library filter. Although not required for MaSuRCA, long

‘jumping’ libraries (i.e. in which each pair of reads are several kilobases

apart) are often part of genome sequencing projects, where they provide

valuable long-range connectivity data for the scaffolding. However, these

libraries sometimes contain reads that are chimeric, i.e. they derive from

two distant parts of the genome, or they may be mis-oriented because of

problems in library construction. In particular, some jumping libraries

use a circularization protocol, in which the DNA fragment is circularized

to bring together its ends. The resulting read pairs face outward, meaning

that the opposite ends of the original DNA fragment occur on the 3

ends

of the two reads. Misoriented mates from these libraries are ‘innies’ that

originated from one ‘side’ of the original circle that did not contain a

junction site. These ‘innies’ should be treated as regular short insert (300–

400 bp) paired-end reads. Because these kinds of errors create many prob-

lems in the scaffolding phase, it is imperative that most of them be

removed before giving the data to the assembly program.

We use QuORUM error correction and the super-reads procedures to

perform library cleaning. First, during error correction we create a k-mer

database from the paired-end reads only, excluding the jumping libraries.

Note that circularized reads include a linker sequence, and the concaten-

ation of the linker and the source DNA represents sequence that should not

appear in the genome. If one of the reads in a jumping library mate pair

contains a junction site (identified by the linker), then k-mers that span the

junction site in that read will not be found in the k-mer database. The error

corrector then trims the read at the junction site.

We then create super-reads from the jumping library reads. Here we

introduce a modification to the algorithm that accepts any path of over-

lapping k-unitigs in building a super-read. By doing aggressive joining, we

are able to identify most of the non-junction ‘innie’ pairs, because they

end up in the same super-read. Next we look for redundant jumping

library pairs, where the same DNA fragment was amplified before circu-

larization, producing two or more pairs of reads that represent the same

fragment. Because the assembly process assumes that each mate pair is an

independent sample from the genome, such redundant pairs can lead to

assembly errors later. We examine the positions of reads in the resulting

super-reads and look for redundant pairs, where both sets of forward and

reverse reads end up in the same one or two super-reads with the same

offset. We reduce such pairs to a single copy.

Contiging and scaffolding with the CABOG assembler. We keep track

of the number of reads that generated each maximal super-read and the

positions of those reads in the maximal super-read; thus allowing us to

precisely report positions of all reads in the assembly. After creating the

super-reads, we assemble the data with a modified version of the CABOG

assembler (Miller et al., 2008, 2010), updated to allow long super-reads as

well as short reads in the same assembly. We supply the following four

types of data to CABOG:

(1) super-reads

(2) linking mates

(3) cleaned and de-duplicated jumping library mate pairs (if available)

(4) other available LR

We note that the modified version of CABOG 6.1 used in MaSuRCA

is not capable of supporting the long high-error-rate reads generated by

the PacBio technology.

CABOG uses read coverage statistics (Myers et al., 2000) to distinguish

between unique and repetitive regions of the assembly. Each super-read

typically represents multiple reads, and we keep track of how many reads

belong to each super-read and the position of these reads. We modified

CABOG to incorporate these data, using counts of all the original reads

in its computation of coverage statistics. This major modification

allowed us to use CABOG to assemble super-read data together with

Illumina, 454 and Sanger reads.

Gap filling. The final major step in the MaSuRCA assembler is gap

filling. This step is aimed at filling gaps in scaffolds that are relatively

short and do not contain complicated repetitive structures. Such gaps

may occur because the assembler software over-trimmed the reads in

the error correction step, or because it misestimated the coverage statistics

Fig. 2. An example of a read whose super-read has two k-unitigs. Read R

contains k-mers M

and M

on its ends. M

and M

each belong to

k-unitigs K

and K

, respectively. K-unitigs K

and K

are shown in

blue, and the matching k-mers M

and M

are shown in red and green.

and K

overlap by k-1 bases. We extend read R on both ends produ-

cing a super-read, also depicted in blue. A super-read can consist of one

k-unitig or can contain many k-unitigs

2674

A.V.Zimin et al.

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

for some contigs or because of other reasons. Our gap-filling technique is

again centered on our super-reads algorithm.

For each gap in each scaffold, we create faux 100-bp paired-end reads

from the ends of the contigs surrounding the gap. We call these contig-

end reads. From each pair of contig-end reads, we then use the 21mers

whose counts are globally below a threshold of 1000 (to avoid highly

repetitive sequences) to pull in the original uncorrected reads where

either the read or its mate contains one of those k-mers. From this set,

we create a local bin of reads corresponding to the gap in question. We

then create k-unitigs from the reads in the bin for k ranging from 17 to 85,

and see if we can create a unique super-read that contains both contig-end

reads. If we are able to achieve that for some value of k, then the resulting

super-read is used as a local patch of sequence to fill the gap. We found

that, depending on the genome, 10–20% of the gaps in the scaffolds can

be filled using this approach.

4 DISCUSSION

We began this project when we were faced with the prospect of

assembling a 20þ Gbp pine tree genome with perhaps 15þ bil-

lion Illumina reads. That was far larger than anything that had

been assembled. Along the way, we have found our philosophy

of reducing Illumina reads to super-reads is useful. We discuss

possible shortcomings and problems of our approach, as well as

data problems that can result in a poor assembly if the user does

not address them.

We have mentioned the GAGE B study of bacterial genomes,

in which MaSuRCA was declared highly effective. At the end of

this section, we list larger genomes that have been assembled by

MaSuRCA and are publicly available.

Overall evaluation. In Tables 1 and 2, the Quast NGA50 contig

size and the NGA50 scaffold size can be viewed loosely as N50

sizes after the contigs and scaffolds have been broken at each

major misassembly. When comparing two assemblies, if after

breaking at errors the N50 contig or scaffold size is doubled in

one assembly compared with the other while introducing fewer

than twice as many errors, we believe the doubling is justified.

In Table 1 on Illumina data only, we view Allpaths as doing

slightly better than MaSuRCA on scaffolds and both did signifi-

cantly better than SOAPdenovo. Note that the scaffold NGA50

for Allpaths and MaSuRCA are about three-fouth of the size of

the genome, indicating that unlike SOAPdenovo, they both got

the biggest chromosome in a correct scaffold. The fact that

MaSuRCA has long scaffolds 3 Mb after breaking at errors,

and the errors occur at roughly 1 Mb in spacing indicates that

the errors lie near the ends of the big scaffold or in small scaf-

folds (in this case they all were in small scaffolds), so that the

overall size of the scaffolds is not severely impacted by breaking

at errors. When significant amounts of long read data are avail-

able, MaSuRCA makes use of that resource and does better. Its

contig sizes rise dramatically and the scaffold error rates drop.

For R.sphaeroides the high GC content of the genome results

in greater variability in the read coverage because of biases

present in Illumina sequencing technology. This case shows

that good assemblies are still possible even for high (or low)

GC genomes.

Table 2 is a better test of scaffolding, as the scaffolds are not

approaching the size of the genome. MaSuRCA’s scaffolds are

roughly 13 times larger than Allpaths’ while introducing only

about 6 times as many errors. Errors seem inevitable unless

contigs and scaffolds are built conservatively and remain small.

SOAPdenovo’s assembly suffers from small contigs. Again,

adding LR improves the assembly significantly.

It is clear that even if assembler A is significantly better than

assembler B on a collection of genome datasets, A may do worse

than B on some datasets (Magoc et al., 2013). Here we have

chosen datasets for which the PE mate pairs overlap each other,

as is required by Allpaths. Our limited experience MaSuRCA

assemblies are better if multiple PE libraries are used, varying

the fragment length. Generally a jumping library should also be

available such as one built from 3 kb (or longer) fragments.

Error correction. Error correction greatly simplifies the de

Bruijn graph and typically results in larger k-unitigs and thus

larger super-reads. Our algorithm works best on error-corrected

reads, but is not tied to a particular error correction technique. In

the MaSuRCA software package, we use the QuORUM error

correction algorithm (Marcais et al., in preparation). However,

one can substitute other techniques, such as Quake (Kelley et al.,

2010) or Hammer (Medvedev et al., 2011).

Data problems. A variety of problems with the input reads and

libraries can reduce the quality of an assembly. One of the most

common issues is mislabeled or poorly size-selected fragment

libraries. For example, we have encountered jumping libraries

identified as 8 Kb (made from 8 Kb fragments), but later found

that the sequences include a mixture of pairs created from 2 to

8 Kb fragments. Similar problems often arise with long-distance

paired reads. Various explanations have been offered for this

type of error, but regardless of the source, the misidentified

mate pairs create difficulties when the assembler tries to place

them 8 Kb apart in the assembly. An examination of the assem-

bly may reveal the problem, at which point it can be corrected

and the assembly can be restarted. We have observed libraries

that were designed to be longer than 5 Kb but were entirely

comprised of 2 Kb fragments. Another problem that arises

with current technology is that the forward reads might be of

excellent quality, but their mates (which are created in a separate

run) are of far lower quality. We encountered one dataset where

some of the libraries had so many errors that the assembly was

better when made without those libraries. For example, when

using 454 paired-end data, if the wrong linker sequence is pro-

vided to the assembler, the assembly will be severely fragmented.

In general, severe fragmentation of an assembly is an indication

of some kind of data error, which in turn requires a form of ‘data

debugging’ to fix the errors and restart the assembly. No list of

possible data errors will be complete.

A data diagnostic, U/k. Before running an assembler, one

should evaluate the quality of the input data with any tools

available. One strategy that we have found useful is to count

the number of unique k-mers in the reads. Given a project

with deep coverage, e.g. 30 or higher, any k-mers that occurs

just once in the set of reads almost certainly contains at least one

error. [This is the insight used by the Quake error corrector

(Kelley et al., 2010)]. We can compare the number of unique

k-mers in forward and reverse reads as a means of evaluating

the quality of the reverse reads.

We can also use k-mer counts to estimate the real error rate in

the read data, as follows. A sequencing error in the middle of a

read is likely to result in k unique k-mers, because every k-mer

containing the error will be unique. If the average number of

2675

The MaSuRCA genome assembler

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

unique k-mers per read is U,thenU/k is a lower bound estimate

of the average number of sequencing errors per read in the data.

This estimate ignores the fact that an error near the end of a read

will result in fewer erroneous k-mers, and it does not take into

account cases when there are two or more errors per k-mer. The

U/k value should be used as a minimum fitness criterion for the

input read data: if the estimated number of errors is 42–3 for

100-bp reads, then it is likely that there was a problem in the

sequencing run (current Illumina technology usual has an error

rate below 1%). It may be more effective to ignore or redo a run

with a high error rate than to use it for assembly.

Polymorphic genomes. Differences between the two copies of

homologous chromosomes in a diploid genome can increase the

number of super-reads 2-fold. This does not usually constitute

a problem as long as subsequent assembly steps handle the poly-

morphic super-reads. The Celera Assembler (CABOG) will

attempt to combine polymorphic regions that differ by up to

6%. If the haplotype divergence rate is higher, it will result in

a fragmented assembly, where many scaffolds will terminate in

regions of haplotype difference. This occurs because, even

though the mate pairs may suggest that two scaffolds represent-

ing two haplotypes should be merged, the contigs within those

scaffolds will not align sufficiently well, and therefore the scaf-

folder will not make the merge. In this case, the assembly can be

post-processed to split the haplotypes and create scaffolds repre-

senting both heterozygous chromosomes.

Available genomes assembled by MaSuRCA. MaSuRCA has

been used to assemble de novo a variety of genomes, sometimes

improving on published genomes using added data, sometimes

creating the first publicly available draft genome for the species.

Below is a partial list of genomes that were recently assembled

with MaSuRCA, including the types of read data used for each

project:

 Loblolly pine, Pinus taeda, a 22 Gbp genome, draft assembly

using Illumina data only, in collaboration with the

Pinerefseq consortium.

 Indian cow, Bos indicus, 454/Illumina mixed data, in collab-

oration with USDA/ARS.

 Rhesus macaque, Macaca mulatta, Sanger/Illumina mixed

data, in collaboration with University of Nebraska.

 Water buffalo, Bubalus bubalus, 454/Illumina mixed data, in

collaboration with USDA-ARS and CASPUR, Italy.

 Domestic cat, Felis felis, Sanger/454/Illumina mixed data, in

collaboration with Washington University.

 Philippine tarsier, Tarsier syrichta, Sanger/Illumina mixed

data, in collaboration with Washington University.

 Fire ant, Wasmannia auropunctata, 454/Illumina mixed

data, in collaboration with OIST, Japan.

 Stalk-eyed fly, Teleopsis dalmanni, 454/Illumina mixed data,

in collaboration with University of Maryland.

ACKNOWLEDGEMENTS

We thank Kristian Stevens (University of California-Davis) for

his helpful comments. We also thank one of the referees for

noticing an important error in the original statement of the

theorem.

Funding: National Research Initiative competitive grants 2009-

35205-05209 and 2008-04049 from the United States Department

of Agriculture National Institute of Food and Agriculture (in

part); National Institutes of Health grants R01-HG002945 and

R01-HG006677.

Conflict of Interest: none declared.

REFERENCES

Bankevich,A. et al. (2012) SPAdes: a new genome assembly algorithm and its

applications to single-cell sequencing. J. Comput. Biol., 19, 455–477.

Batzoglou,S. et al. (2002) ARACHNE: a whole-genome shotgun assembler. Genome

Res., 12, 177–189.

Chaisson,M.J. and Pevzner,P.A. (2008) Short read fragment assembly of bacterial

genomes. Genome Res., 18, 324–330.

Choudhary,M. et al. (2007) Genome analyses of three strains of Rhodobacter

sphaeroides: evidence of rapid evolution of chromosome II. J. Bacteriol., 189,

1914–1921.

Chevreux,B. et al. (2004) Using the miraEST assembler for reliable and automated

mRNA transcript assembly and SNP detection in sequenced ESTs. Genome

Res., 14, 1147–1159.

Gnerre,S. et al. (2011) High-quality draft assemblies of mammalian genomes

from massively parallel sequence data. Proc. Natl Acad. Sci. USA, 108,

1513–1518.

Gurevich,A. et al. (2013) QUAST: quality assessment tool for genome assemblies.

Bioinformatics, 29, 1072–1075.

Huang,X. et al. (2003) PCAP: a whole-genome assembly program. Genome Res., 13,

2164–2170.

Idury,R.M. and Waterman,M.S. (1995) A new algorithm for DNA sequence

assembly. J. Comput. Biol., 2, 291–306.

Kelley,D.R. et al. (2010) Quake: quality-aware detection and correction of sequen-

cing errors. Genome Biol., 11,R116.

Koren,S. et al. (2011) Bambus 2: scaffolding metagenomes. Bioinformatics, 27,

2964–2971.

Kurtz,S. et al. (2004) Versatile and open software for comparing large genomes.

Genome Biol., 5,R12.

Lander,E. et al. (2001) Initial sequencing and analysis of the human genome.

Nature, 409

, 860–921.

angmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2.

Nat. Methods, 9, 357–359.

Li,R. et al. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics,

24, 713–714.

Li,R. et al. (2010) De novo assembly of human genomes with massively parallel

short read sequencing. Genome Res., 20, 265–272.

Luo,R. et al. (2012) SOAPdenovo2: an empirically improved memory-efficient

short-read de novo assembler. Gigascience, 1,18.

Magoc,T. et al. (2013) GAGE-B: an evaluation of genome assemblers for bacterial

organisms. Bioinformatics, 29, 1718–1725.

Marc¸ ais,G. et al. (2013) QuoUM: an error corrector for Illumina reads. arXiv.org.

Marc¸ ais,G. and Kingsford,C. (2011) A fast, lock-free approach for efficient parallel

counting of occurrences of k-mers. Bioinformatics, 27, 764–770.

Medvedev,P. et al. (2011) Error correction of high-throughput sequencing datasets

with non-uniform coverage. Bioinformatics, 27, i137–i141.

Miller,J.R. et al. (2008) Aggressive assembly of pyrosequencing reads with mates.

Bioinformatics, 24, 2818–2824.

Miller,J.R. et al. (2010) Assembly algorithms for next-generation sequencing data.

Genomics, 95, 315–327.

Mouse Genome Sequencing Consortium et al. (2002) Initial sequencing and

comparative analysis of the mouse genome. Nature, 420, 520–562.

Mullikin,J.C. and Ning,Z. (2003) The Phusion assembler. Genome Res., 13, 81–90.

Myers,G. et al. (2000) A whole genome assembly of Drosophila. Science, 287,

2196–2204.

Pevzner,P.A. (1989) 1-Tuple DNA sequencing: computer analysis. J. Biomol. Struct.

Dyn., 7, 63–73.

2676

A.V.Zimin et al.

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

Pevzner,P.A. et al. (2001) An Eulerian path approach to DNA fragment assembly.

Proc. Natl Acad. Sci. USA, 98, 9748–9753.

Salzberg,S.L. et al. (2012) GAGE: a critical evaluation of genome assemblies and

assembly algorithms. Genome Res., 22, 557–567.

Simpson,J.T. et al. (2009) ABySS: a parallel assembler for short read sequence data.

Genome Res., 19, 1117–1123.

Simpson,J.T. and Durbin,R. (2012) Efficient de novo assembly of large genomes

using compressed data structures. Genome Res., 22, 549–556.

Venter,J.C. et al. (2001) The sequence of the human genome. Science, 291,

1304–1351.

Zerbino,D.R. and Birney,E. (2008) Velvet: algorithms for de novo short read as-

sembly using de Bruijn graphs. Genome Res., 18, 821–829.

2677

The MaSuRCA genome assembler

at Milton S. Eisenhower Library/ Johns Hopkins University on September 17, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

Online Supplementary Material

Data

September 2015

Aleksey V. Zimin · Guillaume Marcais · Daniela Puiu · Michael Roberts · James Yorke

Download

Benchmarking of bioinformatics tools for the hybrid de novo assembly of human whole-genome sequencing data

Preprint

Full-text available

May 2024

Accurate and complete de novo assembled genomes sustain variant identification and catalyze the discovery of new genomic features and biological functions. However, accurate and precise de novo assemblies of large and complex genomes remains a challenging task. Long-read sequencing data alone or in hybrid mode combined with more accurate short-read sequences facilitate the de novo assembly of genomes. A number of software exists for de novo genome assembly from long-read data although specific performance comparisons to assembly human genomes are lacking. Here we benchmarked 11 different pipelines including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes for de novo genome assembly of a human reference material sequenced with Oxford Nanopore Technologies and Illumina. In addition, the best performing choice was validated in a non-reference routine laboratory sample. Software performance was evaluated by assessing the quality of the assemblies with QUAST, BUSCO, and Merqury metrics, and the computational costs associated with each of the pipelines were also assessed. We found that Flye was superior to all other assemblers, especially when relying on Ratatosk error-corrected long-reads. Polishing improved the accuracy and continuity of the assemblies and the combination of two rounds of Racon and Pilon achieved the best results. The assembly of the non-reference sample showed comparable assembly metrics as those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance in the generation of high-quality and chromosome-level human assemblies.

Drivers of genomic differentiation landscapes in populations of disparate ecological and geographical settings within mainland Apis cerana

Article

Full-text available

May 2024
MOL ECOL

Elucidating the evolutionary processes that drive population divergence can enhance our understanding of the early stages of speciation and inform conservation manage- ment decisions. The honeybee Apis cerana displays extensive population divergence, providing an informative natural system for exploring these processes. The mainland lineage A. cerana includes several peripheral subspecies with disparate ecological and geographical settings radiated from a central ancestor. Under this evolutionary frame- work, we can explore the patterns of genome differentiation and the evolutionary models that explain them. We can also elucidate the contribution of non-genomic spa- tiotemporal mechanisms (extrinsic features) and genomic mechanisms (intrinsic fea- tures) that influence these genomic differentiation landscapes. Based on 293 whole genomes, a small part of the genome is highly differentiated between central–periph- eral subspecies pairs, while low and partial parallelism partly reflects idiosyncratic responses to environmental differences. Combined elements of recurrent selection and speciation-with-gene-flow models generate the heterogeneous genome land- scapes. These elements weight differently between central-island and other central– peripheral subspecies pairs, influenced by glacial cycles superimposed on different geomorphologies. Although local recombination rates exert a significant influence on patterns of genomic differentiation, it is unlikely that low-recombination rates regions were generated by structural variation. In conclusion, complex factors including geo- graphical isolation, divergent ecological selection and non-uniform genome features have acted concertedly in the evolution of reproductive barriers that could reduce gene flow in part of the genome and facilitate the persistence of distinct populations within mainland lineage of A. cerana.

The phased Solanum okadae genome and Petota pangenome analysis of 23 other potato wild relatives and hybrids

Article

Full-text available

May 2024

Potato is an important crop in the genus Solanum section Petota. Potatoes are susceptible to multiple abiotic and biotic stresses and have undergone constant improvement through breeding programs worldwide. Introgression of wild relatives from section Petota with potato is used as a strategy to enhance the diversity of potato germplasm. The current dataset contributes a phased genome assembly for diploid S. okadae, and short read sequences and de novo assemblies for the genomes of 16 additional wild diploid species in section Petota that were noted for stress resistance and were of interest to potato breeders. Genome sequence data for three additional genomes representing polyploid hybrids with cultivated potato, and an additional genome from non-tuberizing S. etuberosum, which is outside of section Petota, were also included. High quality short reads assemblies were achieved with genome sizes ranging from 575 to 795 Mbp and annotations were performed utilizing transcriptome sequence data. Genomes were compared for presence/absence of genes and phylogenetic analyses were carried out using plastome and nuclear sequences.

A New Species of Scymnus (Coleoptera, Coccinellidae) from Pakistan with Mitochondrial Genome and Its Phylogenetic Implications

Article

Full-text available

May 2024

In this study, a new species of the subgenus Pullus belonging to the Scymnus genus from Pakistan, Scymnus (Pullus) cardi sp. nov., was described and illustrated, with information on its distribution, host plants, and prey. Additionally, the completed mitochondrial genome (mitogenome) of the new species using high-throughput sequencing technology was obtained. The genome contains the typical 37 genes (13 protein-coding genes, two ribosomal RNAs, and 22 transfer RNAs) and a non-coding control region, and is arranged in the same order as that of the putative ancestor of beetles. The AT content of the mitogenome is approximately 85.1%, with AT skew and GC skew of 0.05 and −0.43, respectively. The calculated values of relative synonymous codon usage (RSCU) determine that the codon UUA (L) has the highest frequency. Furthermore, we explored the phylogenetic relationship among 59 representatives of the Coccinellidae using Bayesian inference and maximum likelihood methods, the results of which strongly support the monophyly of Coccinellinae. The phylogenetic results positioned Scymnus (Pullus) cardi in a well-supported clade with Scymnus (Pullus) loewii and Scymnus (Pullus) rubricaudus within the genus Scymnus and the tribe Scymnini. The mitochondrial sequence of S. (P.) cardi will contribute to the mitochondrial genome database and provide helpful information for the identification and phylogeny of Coccinellidae.

Genomic characterisation and ecological distribution of Mantoniella tinhauana: a novel Mamiellophycean green alga from the Western Pacific

Article

Full-text available

May 2024

Mamiellophyceae are dominant marine algae in much of the ocean, the most prevalent genera belonging to the order Mamiellales: Micromonas, Ostreococcus and Bathycoccus, whose genetics and global distributions have been extensively studied. Conversely, the genus Mantoniella, despite its potential ecological importance, remains relatively under-characterised. In this study, we isolated and characterised a novel species of Mamiellophyceae, Mantoniella tinhauana, from subtropical coastal waters in the South China Sea. Morphologically, it resembles other Mantoniella species; however, a comparative analysis of the 18S and ITS2 marker genes revealed its genetic distinctiveness. Furthermore, we sequenced and assembled the first genome of Mantoniella tinhauana, uncovering significant differences from previously studied Mamiellophyceae species. Notably, the genome lacked any detectable outlier chromosomes and exhibited numerous unique orthogroups. We explored gene groups associated with meiosis, scale and flagella formation, shedding light on species divergence, yet further investigation is warranted. To elucidate the biogeography of Mantoniella tinhauana, we conducted a comprehensive analysis using global metagenomic read mapping to the newly sequenced genome. Our findings indicate this species exhibits a cosmopolitan distribution with a low-level prevalence worldwide. Understanding the intricate dynamics between Mamiellophyceae and the environment is crucial for comprehending their impact on the ocean ecosystem and accurately predicting their response to forthcoming environmental changes.

Catenovulum adriaticum sp. nov., isolated from algae in the harbour of Susak, Croatia

Article

May 2024

The use of algae as feedstock for industrial purposes, such as in bioethanol production, is desirable. During a search for new agarolytic marine bacteria, a novel Gram-stain-negative, strictly aerobic, and agarolytic bacterium, designated as TS8 T , was isolated from algae in the harbour of the island of Susak, Croatia. The cells were rod-shaped and motile. The G+C content of the sequenced genome was 38.6 mol%. Growth was observed at 11–37 °C, with 0.5–13 % (w/v) NaCl, and at pH 6.0–9.0. The main fatty acids were summed feature 3 (C 16 : 1 ω 6 c and/or C 16 : 1 ω 7 c ), summed feature 8 (C 18 : 1 ω 7 c and/or C 18 : 1 ω 6 c ), and C 16 : 0 . The main respiratory quinone was ubiquinone-8. The major polar lipids were phosphatidylethanolamine and phosphatidylglycerol. Analysis of 16S rRNA gene sequences indicated that the newly isolated strain belongs to the genus Catenovulum . Based on 16S rRNA gene sequence data, strain TS8 T is closely related to Catenovulum sediminis D2 T (95.7 %), Catenovulum agarivorans YM01 T (95.0 %), and Catenovulum maritimum Q1 T (93.2 %). Digital DNA–DNA hybridization values between TS8 T and the other Catenovulum strains were below 25 %. Based on genotypic, phenotypic, and phylogenetic data, strain TS8 T represents a new species of the genus Catenovulum , for which the name Catenovulum adriaticum sp. nov. is proposed. The type strain is TS8 T (=DSM 114830 T =NCIMB 15451 T ).

Complete genome sequences of 30 bacterial species from a synthetic community

Article

May 2024

We present complete genome sequences from 30 bacterial species that can be used to construct defined synthetic communities that stably form in the laboratory under controlled conditions.

CAREx: context-aware read extension of paired-end sequencing data

Article

Full-text available

May 2024
BMC BIOINFORMATICS

Background Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. Results We present CAREx—an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$99\%$$\end{document} for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools. Conclusion CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at (https://github.com/fkallen/CAREx).

Effects of different assembly strategies on gene annotation in activated sludge

Article

May 2024
ENVIRON RES

Gamma-rays induced genome wide stable mutations in cowpea deciphered through whole genome sequencing

Article

Apr 2024
INT J RADIAT BIOL

Purpose: Gamma rays are the most widely exploited physical mutagen in plant mutation breeding. They are known to be involved in the development of more than 60% of global cowpea (Vigna unguiculata (L.) Walp.) mutant varieties. Nevertheless, the nature and type of genome-wide mutations induced by gamma rays have not been studied in cowpea and therefore, the present investigation was undertaken. Materials and methods: Genomic DNAs from three stable gamma rays-induced mutants (large seed size, small seed size and disease resistant mutant) of cowpea cultivar 'CPD103' in M6 generation along with its progenitor were used for Illumina-based whole-genome resequencing. Results: Gamma rays induced a relatively higher frequency (88.9%) of single base substitutions (SBSs) with an average transition to transversion ratio (Ti/Tv) of 3.51 in M6 generation. A > G transitions, including its complementary T > C transitions, predominated the transition mutations, while all four types of transversion mutations were detected with frequencies over 6.5%. Indels (small insertions and deletions) constituted about 11% of the total induced variations, wherein small insertions (6.3%) were relatively more prominent than small deletions (4.8%). Among the indels, single-base indels and, in particular, those involving A/T bases showed a preponderance, albeit indels of up to three bases were detected in low proportions. Distributed across all 11 chromosomes, only a fraction of SBSs (19.45%) and indels (20.2%) potentially altered the encoded amino acids/peptides. The inherent mutation rate induced by gamma rays in cowpea was observed to be in the order of 1.4 × 10-7 per base pair in M6 generation. Conclusion: Gamma-rays with a greater tendency to induce SBSs and, to a lesser extent, indels could be efficiently and effectively exploited in cowpea mutation breeding.

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

Article

Full-text available

Jan 2012

Initial sequencing and analysis of the human genome

Article

Full-text available

Feb 2001

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

QUAST: quality assessment tool for genome assembles

Conference Paper

Full-text available

Dec 2013

Background / Purpose: The challenge of de novo genome assembly has led to a diversity of algorithms being created. The results they produce vary significantly, which leads to the question of comparing assembly accuracy. Several attempts to address this problem have been recently published, however reproducing experiments on a variety of datasets is still proving to be problematic. Main conclusion: We introduce QUAST, a web tool made to compare all sides of assemblies in a very convenient way. User uploads contigs, and the tool builds a comparison table and several plots. The tool evaluates both with a given reference genome and without one. A console version of QUAST is also available.

QuorUM: An Error Corrector for Illumina Reads

Article

Full-text available

Jul 2013
PLOS ONE

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms

Article

Full-text available

May 2013
BIOINFORMATICS

Motivation: A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods. Results: We evaluated the ability of multiple genome assembly programs to assemble bacterial genomes from a single, deep-coverage library. For our comparison, we chose bacterial species spanning a wide range of GC content and measured the contiguity and accuracy of the resulting assemblies. We compared the assemblies produced by this very high-coverage, one-library strategy to the best assemblies created by two-library sequencing, and we found that remarkably good bacterial assemblies are possible with just one library. We also measured the effect of read length and depth of coverage on assembly quality and determined the values that provide the best results with current algorithms. Contact: salzberg@jhu.edu Supplementary information: Supplementary data are available at Bioinformatics online.

SOAPdenovo2: an empirically improved memory-efficient short-read

Article

Full-text available

Dec 2012

There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions. To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome. Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

QUAST: Quality assessment tool for genome assemblies

Article

Full-text available

Feb 2013
BIOINFORMATICS

Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST-a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. Availability: http://bioinf.spbau.ru/quast . Supplementary information: Supplementary data are available at Bioinformatics online.

The sequence of the human genome

Article

Full-text available

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

Article

Full-text available

Apr 2012
J COMPUT BIOL

The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.

Initial sequencing and comparative analysis of the mouse genome

Article

Jan 2002

The MaSuRCA genome assembler

Abstract and Figures

Supplementary resource (1)

Recommended publications

MSR-CA -- Efficient De Novo Genome Assembler For Long and Short Read Data

GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms

QuorUM: An Error Corrector for Illumina Reads

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Assembler for de novo assembly of large genomes