Content uploaded by Erchin Serpedin
Author content
All content in this area was uploaded by Erchin Serpedin on Jan 07, 2015
Content may be subject to copyright.
Do it yourself guide to genome assembly
Bilal Wajid and Erchin Serpedin
Corresponding author. Bilal Wajid, Department of Electrical and Computer Engineering at Texas A&M University (TAMU), College Station, TX, USA.
Tel.: 001-956-326-0348; Fax: 001-956-326-2439; E-mail: bilalwajidabbas@hotmail.com
Abstract
Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in
computational biology. This humble effort highlights the significance of genome assembly as a research area, focuses on its
need to remain accurate, provides details about the characteristics of the raw data, examines some key metrics, emphasizes
some tools and draws attention to a generic tutorial with example data that outlines the whole pipeline for next-generation
sequencing. The article concludes by pointing out some major future research problems.
Key words: genome assembly; next-generation sequencing; comparative assembly; de novo assembly; de-Bruijn graphs;
Eulerian path
Introduction
The art of genome assembly involves taking millions, if not bil-
lions, of smaller fragments, called ‘reads’ and assembling them
together to form a cohesive pattern, called the sequence. The
reads themselves are a collection of nucleotides {A, C, G, T}.
They vary in length and are specific to the sequencing platform
from which they are derived. Some standard sequencing plat-
forms are 454 GS by Roche, MiSeq, HiSeq and NextSeq by
Illumina and Ion Torrent and Ion Proton by Life Technologies as
denoted in Table 1.
This contribution is aimed to act as a pivotal resource for re-
searchers in the area of genome assembly via next-generation
sequencing as well as a guidance to scientists new to the field.
Section I highlights the relation of genome assembly to other
key areas within computational biology with emphasis on its
need to report results accurately. Section II discusses raw data,
including Sequencing Read Archive (SRA) and FASTA and
FASTQ file formats. It also provides details of some essential
software tools and key hardware requirements. Section III pro-
vides particulars on how to filter and correct raw data to deter-
mine the ‘right- set’ of reads for the assembly. Section IV
answers the key question as to how can one assemble a genome
oneself? Section V reviews some essential metrics needed to
evaluate the assembly. Finally, Sections VI and VII make consid-
erations on some future goals. To facilitate a better understand-
ing of this research area, the Supplementary Section provides
suitable examples with real data that helps reinforce concepts.
Step 1: understanding the need to remain
‘true-to-life’
It is imperative in a naturalistic drawing that the image be as
close to reality as possible. Imagine a painter drawing a realistic
picture and later asking his student to draw a copy from the ori-
ginal image. If the student in turn requests his friend to make
Bilal Wajid received his B.Sc. Hons, Electrical Engineering degree from University of Engineering & Technology (UET), Lahore, Pakistan, in 2007 and his
M.Sc., Electrical Engineering degree from UET, Lahore, Pakistan, in 2009. He is currently a PhD. student in Department of Electrical and Computer
Engineering at Texas A&M University (TAMU), College Station, TX. He is also teaching as a visiting faculty at Texas A&M International University, Laredo,
TX, 78043. He has taught previously at University of Engineering and Technology (UET), Lahore and UET Kala shah Kaku, TAMU and DUKE University.
Erchin Serpedin (F’13) received the specialization degree in signal processing and transmission of information from Ecole Superieure DElectricite
(SUPELEC), Paris, France, in 1992, the M.Sc. degree from the Georgia Institute of Technology, Atlanta, in 1992, and the Ph.D. degree from the University of
Virginia, Charlottesville, in January 1999. He is currently a professor in the Department of Electrical and Computer Engineering at Texas A&M University,
College Station. He is the author of 2 research monographs, 1 textbook, more than a 100 journal papers and 170 conference papers and has served editor
for a dozen of journals, including IEEE Transactions on Information Theory, IEEE Transactions on Communications, IEEE Transactions on Signal
Processing, Signal Processing (Elsevier), EURASIP Journal on Advances in Signal Processing, Physical Communication (Elsevier) and EURASIP Journal on
Bioinformatics and Systems Biology. He is currently serving as Editor-in-Chief for Eurasip Journal on Bioinformatics and Systems Biology, an online jour-
nal edited by Springer. His research interests include statistical signal processing, information theory, bioinformatics and genomics. He is an IEEE Fellow.
V
CThe Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
1
Briefings in Functional Genomics, 2014, 1–9
doi: 10.1093/bfgp/elu042
Letter to the Editor
Briefings in Functional Genomics Advance Access published November 11, 2014
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
another copy from his work, any defect in the original image
will only get multiplied, as other drawings are perceived from
the ones that preceded them. The same concept is equally ap-
plicable within the genome assembly framework.
A number of research domains in bioinformatics draw suit-
able conclusions from the sequence itself. A sequence that has
not been reported accurately could potentially affect subse-
quent downstream analyses, which would only multiply any
defects in the conclusions that were based on the assembled se-
quence. Research studies show that sequencing errors do affect
the perceived diversity in molecular surveys [1–3], such as gene
geneaologies [4], metagenomic gene prediction [5] and 16S
rRNA-based studies [6]. Therefore, simply assembling all the
reads into one contiguous sequence, a contig, is not enough. It
is crucial to ensure that the reported sequence indeed resem-
bles what is truly present in the cell. Some common hurdles are
low-coverage areas, false-positive read-read alignments, false-
negative alignments, poor sequence quality, polymorphisms
and repeated regions of the genome.
Step 2: know from where to begin
‘Practice by drawing things large, as if equal in represen-
tation and reality. In small drawings every large weak-
ness is easily hidden; in the large, the smallest weakness
is easily seen.’
—Leon Battista Alberti.
The purpose of this and the next set of sections is to teach the
reader how to sketch. The aim is not to engage the reader so that
one becomes completely immersed in the art of genome assembly
but rather to provide an outline that one can use to master the
area. Just like any masterpiece that requires a canvas and a pencil
to initiate, similarly, to perform a complete genome sequencing
procedure, a 64 bit computer, running a UNIX-like operating sys-
tem such as MAC OS X or Linux (e.g. Ubuntu), and a minimum of
16 Gb RAM is recommended. To facilitate researcher’s work,
the authors of this article have produced an environment
necessary to build one’s genome, in the form of ‘Genobuntu:
a Genome Assembly Ubuntu Package’ (http://sourceforge.net/
projects/genobuntu/, http://people.tamu.edu/bilalwajidabbas/
Genobunu.html). Genobuntu is a software package containing
>70 software and packages oriented toward next-generation
sequencing. It supports wide ranging tools including pre-assem-
bly tools, genome assemblers, post-assembly tools, commonly
used biological tools and example script files for different assem-
bly pipelines (http://sourceforge.net/projects/genobuntu/).
The exercise starts by downloading raw data, principally the
read files, from the SRA [7]. Data are present in ‘.sra’ format and
can be converted into necessary FASTA and FASTQ format using
the SRA toolkit (http://eutils.ncbi.nih.gov/Traces/sra/?view¼
software), see Figure 1 and Steps 1, 2 and 3 in the
Supplementary Section.
The FASTQ file uses four lines to represent the sequence and
its quality:
@SRR123.321 Example length¼30 GATTTGGGGTTCACTGCAGTA
TGGGGCAAA
þSRR123.321 Example length¼30!”*((((***þ))PSGþþ)(a—?).1***
@ Sequence Identifier, similar to FASTA format sequence line(s)
þSequence Identifier, (may be left blank) ASCII encoding of quality
values
The ASCII characters help encode log-probabilities of the
Quality values (Q-values). Q-values are numerical values sig-
nifying the quality of each base call and are evaluated separ-
ately for each platform [8,9]. For example, for Illumina the
formula is as follows [8]:
Qillumina ¼10 log 10 Pe
1Pe
;
where Peis the probability of identifying a base incorrectly.
For Sanger and other platforms, the formula is as follows [8]:
QPHRED ¼10 log 10ðPeÞ:
QPHRED and Qillumina can be converted into one another using the
relation [8,10]:
Qillumina ¼10 log 10 10 QPHRED
10
þ1
PHRED scores are the standard in representing
sequencing base quality scores as shown in Table 2. The
use of ASCII characters not only encodes log-probabilities
by rounding them off to nearest integer values but is also
inherently convenient from a computational perspective as can
be inferred from Table 3. In terms of ASCII encoded Q-values,
the following characters depict increasing order of quality
(ASCII), from left to right (http://en.wikipedia.org/wiki/FASTQ_
format):
!"#$%&’()*þ,./0123456789:;<¼>?@ABCDEFGHIJKLMNOPQRSTUV
WXYZ[\]^_‘abcdefghijklmnopqrstuvwxyz{j}
Table 1. Sequencing platforms
Platform Company Resource
Ion Torrent and Ion Proton Life Technologies. http://lifetechnologies.com
454 GS Roche http://454.com/
GENIUS GenapSys http://genapsys.com/
NanoTag sequencer Genia http://geniachip.com/
GnuBIO platform GnuBIO system http://gnubio.com/
a
Lasergen http://lasergen.com/
a
Nabsys http://nabsys.com/
PACBIO RS II Pacific Biosciences http://pacificbiosciences.com/
MinION and GridION Oxford Nanopore technologies https://nanoporetech.com/
MiSeq, HiSeq and NextSeq Illumina http://illumina.com/
Sequencing By Xpansion (SBX) Strato Genomics Technology http://stratosgenomics.com/
Optipore sequencing Noblegen Biosciences http://noblegenbio.com/
a
By the time of the submission of this article, both Lasergne and Nabsys were working on a new platform.
2|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
Step 3: filtering/correcting low-quality reads
‘There are only 3 colors, 10 digits, and 7 notes; it is what
we do with them that’s important’
—Jim Rohn.
As one must search to find brilliant colors, one must investi-
gate to find the right set of reads if one is to pursue the correct
pattern, the correct sequence. Looking at all the reads does not
help. One has to filter out the best set of reads, trim low-quality
ends and collapse identical reads. A simplistic way of doing so
is to remove all reads that contain the base N. An improved ap-
proach is to remove low-quality reads. Assuming that each base
is independent of all others, the overall quality of a read of
length ris Pqual ¼Qr
i¼11Pi;e
, where Pi;eis the Peof the ith base
and can be derived from the Q-value shown above. If Pqual <q,
where qis a user-defined parameter, the reads may be removed
from further processing [11–15]. An enhanced approach is to
match reads against known ribosomal and heterochromatin
DNA, and should they match, one must remove them, as the
assembly could be improved by ignoring these repetitive
DNA elements [16]. To go further, one may even try to correct
low-quality reads. The authors of this article have provided
a tutorial on how to filter low-quality reads, with suitable
examples, in the Supplementary Section. The interested reader
is recommended to consult Steps 4, 5 and 6 in the
Supplementary Section.
Step 4: assembling the sequence
‘Painting is damned difficult - you always think you’ve
got it, but you haven’t.’
—Paul Cezanne.
Holding the brush in hand, looking on to the canvas, the
least one can do is have a picture in one’s mind before he/she
makes the strokes. One must either opt for an abstract art,
where one’s emotions run wild and dictate what one paints, or
one may choose to paint a scenery, in which case looking at the
scene helps a lot.
Genome assemblers may be widely divided into reference-
assisted assemblers (comparative assemblers) and de novo as-
semblers. Reference-assisted assembly is more like painting a
scenery. The landscapes on the painting may look a little differ-
ent, the terrains need not to be the same, but still having a scen-
ery in front of you makes the job relatively simpler. It is
common to consider the assistance of a reference sequence for
the assembly of a target genome, even though the target
Figure 1. SRA, Read Data: ERR028217 is the Run number, whereas 12521 is the read number. As this is an example of paired data, the read has .1 extension (12521.1)
whereas, its reverse complement has .3 extension (12521.3). Furthermore, underneath the read are each of its individual bases with their associated Q-values.
Table 2. Phred quality scores are logarithmically linked to error prob-
abilities (http://en.wikipedia.org/wiki/Phred_quality_score)
Phred quality
score
Probability of
incorrect base call
Base call
accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.90%
40 1 in 10 000 99.99%
50 1 in 100 000 99.999%
60 1 in 1 000 000 99.9999%
Table 3. Evolution of Quality scores and their corresponding ASCII
encoding
Quality scores ASCII
(decimal)
ASCII
(characters)
Sanger Phred: 0 !93 33 !126 !”# {}
Illumina (1.0) 5!62 59 !126 !”# {}
Illumina (1.3) Phred: 0 !62 64 !126 ;¡¼{}
Illumina (1.8) Phredþ33: 33 !126 33 !126 @AB {}
Do it yourself guide to genome assembly |3
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
sequence may have a lot of structural variations, copy number
variations and single-nucleotide polymorphisms relative to the
reference. Imagine assembling billions of reads in the right
order without a reference and in addition ensuring that the
quality of the assembly is good. Should one be compelled to as-
semble a sequence without a reference, such a task is referred
to as de novo assembly.
‘No one is an artist unless he carries his picture in his
head before painting it, and is sure of his method and
composition.’
—Claude Monet
Carrying a picture in one’s mind or using a reference se-
quence for assembly is usually preceded by the question as how
one can determine which sequence is the optimal sequence.
The simplest approach is to count the number of reads that
align to the reference sequence. The optimal reference being
the one onto which most reads align to. A more sophisticated
approach is to use the minimum description length (MDL) prin-
ciple. The MDL framework takes into consideration both the
length of the reference and the number of reads that align to
the reference to evaluate a ‘code length’. The optimal reference
is the one which has the smallest code length [10,17–21]. Yet, a
more specific technique applied to Tuberculosis strains is
‘spoligotyping’. Spoligotyping is like fingerprinting. Similar
strains of Tuberculosis share the same repeat units, where the
repeat units act like fingerprints. Therefore, an optimal refer-
ence should have the same fingerprint as the genome being
assembled [22–26].
As far as various assemblers are concerned, the aim of this
contribution is not to compare different assemblers, or to
search for the best or the fastest one, rather the aim is to pre-
sent some common assembly frameworks (also called pipe-
lines). Some reviews on explanations and comparisons of
various genome assemblers may be found at [20,27,28]. Once
the optimal reference has been selected, the reference and the
set of reads are presented to a comparative assembly pipeline.
The comparative assembly pipeline consists of three compo-
nents: ‘alignment’, ‘layout’ and ‘consensus’, as described
graphically in Figure 2. The word ‘pipeline’ is indicative of
how a set of interconnected methods come together to trans-
form the raw data into the novel genome. The authors have
provided the example shell files of two such comparative as-
semblers, Maq and MIB. In addition, Step 9 in the
‘Introductory Tutorial to Genome Assembly’ also explains how
to do comparative assembly using MIB with the help of appro-
priate examples. It is important to note that reference-assisted
assembly works only in the presence of a suitable reference
sequence. The reference helps, as it eases the job of the as-
sembler as the relative placement of the reads and the contigs
is already established by the reference. Given a closely
matched reference sequence, the task is further reduced to
simply identifying the variations between the reference and
the target sequence and incorporating the differences in the
target sequence [29,30]. However, determining translocations
is not easy and requires careful analysis as the order of some
of the contigs may be different in the target sequence as
opposed to the reference sequence. However, in the absence
of a reference sequence, one has to move toward de novo as-
sembly. According to GOLD (Genome Online Database), as of
31 December 2013, 277 archaeal, 11 775 bacterial and 312
eukaryal genomes have been sequenced, which still leaves
room for many unique genomes waiting to be sequenced.
There are many elegant solutions for de novo assembly, as
illustrated in Figure 3.
For instance, a greedy approach works by taking the locally
optimal choice at each stage hoping to find the global optimum
[31]. It does so by taking an unassembled read and extending it
using the best overlapping read on its 30end. It continues until
no overlapping reads are found at which point it repeats the
same process in the other direction by extending the contig at
its 50end [20].
As one may see in Figure 3, most of the algorithms use the
‘overlap-layout-consensus’ paradigm. The paradigm starts by
forming an ‘overlap’ structure by joining all the reads with their
respective overlapping reads. Next, a ‘layout’ is established
by searching for a single path from the beginning, the root, to
the end, the leaf, by traversing through all the reads. This is
the point where one encounters most challenges. In the
graph theory, each read represents a node and an overlap is
depicted by an edge, therefore, ideally there should be only
one graph, where a single path traversal from the root to the
child represents an entire sequence. In reality, however,
one obtains, not one, but multiple disjointed graphs, where
each graph depicts a contig. Furthermore, each graph is
plagued with many branches and loops. Branches that are small
may be discarded, whereas longer branches compete with one
another to serve as representatives for the contig. Loops portray
repeat regions so one must decide how many times the repeats
Figure 2. Comparative assembly: Reads are aligned to a reference sequence. The alignment process may allow one or more mismatches between each individual read
and the reference sequence. The alignment of the reads generates a layout. Based on majority base call, the layout produces a consensus sequence.
4|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
should be placed in the final assembly. Nevertheless, assem-
blers do spend significant amount of time in resolving potential
hazards, in multiple ways, as depicted in Figure 4,[20,28,
32,33].
‘From now on, I’ll connect the dots my own way.’
—Bill Watterson.
Having completed the contigs, one must travel further to
‘connect the dots’, a process called scaffolding. Scaffolding aims
not only just to connect the contigs to elongate them but also
to order them. In other words, scaffolding defines which contig
comes first, and which contig comes next in relation to the
whole sequence. The process uses forward and reverse reads
to link distinct contigs [20,21].
The authors have provided example shell script files for the as-
sembly pipeline of SHARCGS, QSRA, IDBA, SSAKE, VCAKE, ABySS,
Velvet and MAQ. In addition, the ‘Introductory Tutorial on
Genome Assembly’ contains an example of de novo assembly
using VCAKE, Velvet and IDBA and scaffolding using SSPACE, see
Steps7,8,9,10and11intheSupplementary Section.
Step 5: evaluating an assembly
Similar to any painting that may be both scaled and evaluated
based on some objective criteria like time, effort, theme, color
scheme, proportion and detail (http://www.wikihow.com/
Evaluate-Paintings), one may find that evaluating an assembly
requires careful analysis as well. Table 4 illustrates some of the
commonly used assembly metrics/statistics, whereas Table 5
Figure 3. Common assembly algorithms grouped in accordance to their working schemes.
Do it yourself guide to genome assembly |5
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
serves as an example on how to compare different assemblies
using these quality statistics. Furthermore, Steps 12 and 13 in
the Supplementary Section also highlight suitable examples.
For a more thorough and in-depth discussion, the reader is dir-
ected to reference [27].
Considerations
‘Without continual growth and progress, such words as im-
provement, achievement, and success have no meaning.’
—Benjamin Franklin.
Genome assembly is evolving. As it matures, the shear need
and utility of this research area is forcing it to encompass the
critical aspects of ‘reproducibility’, ‘accessibility’, ‘transparency’,
‘scalability’ and ‘simplicity’. More and more genomes are being
published where the authors are gearing toward providing ac-
cess to raw data, producing details of the assembler with the
settings employed to derive those sequences. Scalability of the
algorithms comes to question when one tries to sequence eu-
karyotes. A number of algorithms are being parallelized with
careful attention being given to Hadoop and MapReduce archi-
tectures [35–39]. GATK is a MapReduce framework that
Figure 4. Graph simplification techniques: (A-1) Ambiguous paths; (A-2) Pulling apart operation: the resultant graph is divided into four possible paths. (B-1) Simplistic
path; (B-2) Removing intermediate nodes: nodes that have an indegree ¼outdegree ¼1 are collapsed to form one giant node, also referred to as a ‘unitig’. (C-1)
Unnecessary edges; (C-2) Removing edges: an edge between two nodes is removed if there is an intermediate node between them that connects them simplistically.
(D-1) Loop; (D-2) Disambiguation: the loop edge is unrolled and integrated in the continuous edge from left to right. (E-1) Shorter paths are shown encircled; (E-2)
Removing tips: a tip is defined as a chain of nodes that is disconnected at one end. Tips are removed if they are shorter than t, where tis a user-defined parameter.
Furthermore, if there is a longer/common path, it will also trigger a tip’s removal.
Table 4. Some common statistics used in evaluating the quality of an assembly
Metric Description
N50 Quantifies the average length of a sequence. Suppose a sequence ‘A’ has six contigs with total assembly size as 30 Mb.
They are {13, 6, 5, 3, 2, 1} arranged in decreasing order. Now adding the first two {13, 6} gives 19 Mb, which exceeds 50%
of the total assembled size of 30 Mb. The N50 would then be 6 Mb, which is the last sequence crossing the 50% thresh-
old of the total assembled size of the genome (:).
NG50 The length of the scaffold at which 50% of the genome length is covered. Here the length of the genome is whether
known or predicted [27](:).
Accuracy The genome is considered accurate if 90% of the bases have at least 5read coverage (:).
Continuity Similar to N50 there are other metrics like N75 and N90 where one identifies the length of the scaffold crossing 75% and
90% threshold of the total assembled size of the genome. An assembly is considered to have continuity provided its N
90 >5 Kb.
Choppiness The average contig length should be >5000 bases (5 Kb). Otherwise, the assembly would be considered to have too
many chops or pieces and would need to be redrafted to contain fewer segments (:).
Number of genes The assembly which identifies most of the known genes in the organism is considered the better assembly. See [27]on
details of highly conserved core eukaryotic genes (:).
Number of gaps in
the assembly
REAPR, a software tool, uses paired-read information to find errors in assemblies by aligning a subset of reads from
short-insert libraries onto the scaffolds. These alignments help determine scaffolding errors [34](;).
Validity What fraction of the assembly (set of scaffold sequences) can be validated by the reference sequence [27]. If the scaf-
folds provided by the assembly cover >90% of the actual genome then the draft assembly is considered complete (:).
Scaffold statistics Longest scaffold (:): typically greater the length of the largest contig, better the assembly. Similar is the case of the
shortest scaffold (:). Number of scaffolds (;): Typically, an assembly which has less number of scaffolds would be bet-
ter than the assembly that would have more number of scaffolds. For instance, the best assembly would be a continu-
ous genome with no segments which would therefore have only one scaffold. Number of scaffolds >Xnt(:),
percentage of scaffolds >Xnt(:), where X is a user-defined length, NG50 scaffold length (:), LG50 scaffold count (;):
how many scaffolds are counted in reaching the NG50 threshold. Total scaffold length as percentage of estimated
genome size (the closer to 100% the better). All the above depict the quality of the assembly.
Contig statistics Longest contig (:); shortest contig (:); total size of contigs (:); number of contigs >Xnt(:), (‘nt’ stands for non-redundant
nucleotide); percentage of contigs >X nt, where X is a user-defined length; NG50 contig length (:); LG50 contig count
(;). Percentage of assembly in scaffolded contigs (:): contigs may be joined into scaffolds or remain unscaffolded. This
metric indicates how much of the assembly is represented by scaffolded contigs. The opposite would be percentage
of assembly in unscaffolded contigs (;).
Notice that with each statistics, an "indicates that higher is better and a #implies that less is better.
6|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
functions in a parallel fashion on one system, but does not work
in parallel on multiple systems [40,41].
‘Simplicity is the ultimate sophistication.’
—Leonardo da Vinci.
In terms of simplicity, considerable work is needed. The suc-
cess of Windows operating system is one such example of how
software simplicity helps. Windows gave a layman an opportun-
ity to perform nontrivial tasks with the click of a mouse, provid-
ing it a monopoly in the market. Therefore, for any product to be
successful, it should be simple to operate [42]. Genome assem-
blers not only require a decent set of skills to operate but even to
install them. This is because many of them assume a consider-
able number of dependencies that need to be previously installed
on the system. Genobuntu enables simplicity in this regard, as it
helps to install and learn many of the common tools used in re-
search (https://sourceforge.net/projects/genobuntu/).
Conclusion: the two brush strokes
‘The Chinese use two brush strokes to write the word
‘crisis.’ One brush stroke stands for danger; the other
for opportunity. In a crisis, be aware of the danger—but
recognize the opportunity.’
—John F. Kennedy.
This contribution discussed the art of genome assembly
from a qualitative standpoint, detailing the significance of this
research area with attention converging toward ensuring that
the genome being assembled has to be ‘true-to-life’. The article
highlighted FASTA and FASTQ file formats teaching means to
use Q-values to filter and correct low-quality reads. Necessary
tools, like Genobuntu, with example assembly pipelines in the
form of scripts files and assembly tutorials were introduced.
Useful metrics were elaborated that helped determine the qual-
ity of one’s work with considerations on the future of genome
assembly from the perspective of ‘reproducibility’, ‘accessibil-
ity’, ‘transparency’, ‘scalability’ and ‘simplicity’. Ultimately, one
must recognize that the opportunities in this research area are
immense. We are still far from having a hardware and software
support mechanism that extracts meaningful results that
would help facilitate using the information derived from gen-
ome assembly into efficient therapies. However, as most of the
genomes on the planet are yet to be sequenced, this research
area will remain fresh for many years to come.
Table 5. Reads were derived from the run ‘SRR001657’ from the SRA
S. No Assembly metrics VCAKE QSRA IDBA MIB
Ver. A "indicates that higher is better and a #implies
that less is better.
ver. 1 ver.1 ver. 1 ver. 1
1 No. of Contigs #156 834 76899 34775 1
2 Length of Largest Contig "4195 5832 5285 6 261 358
3N50"92 132 176 6 261 358
4N75"297 66 112 6 261 358
5N90"155 44 83 6 261 358
6 NG50 "227 159 148 6 261 358
7 NG75 "140 89 84 6 261 358
8 Contigs N50 "26 985 14 907 8860 1
9 Contigs 200 bp "10 428 7550 6904 1
10 Mean "76.312 98.71 157.81 6 261 358
11 Median "45 61 118 6,261,358
12 Sum of the contig lengths (should be as close as
possible to the length of the target sequence)
11 968 268 7 590 560 5 487 859 6 261 358
13 Coverage "61014 13
14 Runtime (hours) #2 0.25 0.1 31
15 Memory used (GB) #1.3 1.6 0.5 8
16 Parameters used -e 20, -u 17, -mink 17, s1¼1000,
-k 33, -k 33, -maxk 33, s2¼100000,
-o 34, -o 34, -step 1, Match 4,
-n 17 -l 16, -min_count 2, Mismatch-5,
-t 5, -t 3, -min_contig 34 Gap 0
-m 16, -c 0.6
-v 3
The assembly was conducted using four assemblers and compared using standard metrics using the program ‘assembly statistics’.
Pseudomonas Aeruginosa UCBPP-PA14 was used as a reference Seq. by MIB for the assembly of PAb1. To facilitate reproducibility of re-
sults, the version number of each assembler along with the assembly parameters used for the assembly have also been provided. In the
assembly above, VCAKE, QSRA and IDBA are de novo assemblers, whereas MIB is a comparative assembler. Ideally, one would prefer a
single contig that represents the entire target sequences, however, this is rarely the case. MIB’s output does show one giant contig with
a good coverage [13] witha length close to the size of the target sequence (6.7 million bases in length), however this is only possible in
the presence of a reference sequence, in this case UCBPP-PA14 and this is rare. Furthermore, in the absence of a reference sequence
(which is common), one has to resort to de novo assemblies, as comparative assemblers simply do not work. Comparing only the de novo
assemblies, IDBA has the highest coverage [14], least number of contigs (34775), the best N50 (176) and the best mean length of contigs
(157.8). However, collectively all the contigs reported by IDBA do not report about a million bases in the target sequence. On the other
hand, among de novo assemblies, VCAKE has the highest number of contigs larger than 200base pairs (bp) (10 428), has the highest NG50
(227) and NG75 (140) but for some reason all its contigs collectively report almost twice the length of the target sequence (1.2 million
bases). This clearly shows a huge degree of overlap amongthe contigs or maybe because someof the contigs reported by VCAKE aresim-
ply redundant. Nevertheless, comparing different assemblies is a difficult task, one which requires careful analysis and an area where
one may have to resort to usinginputs from all the assemblies to report a good target sequence.
Do it yourself guide to genome assembly |7
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
Supplementary Data
Supplementary data are available online at http://bib.
oxfordjournals.org/.
Acknowledgement
B.W. would like to extend special thanks to his mother for pro-
viding the necessary inspiration to write and fund the article
and to his father who checked and provided suitable sugges-
tions for the article.
Addendum: The ‘Do it Yourself Guide to Genome Assembly—
Supplementary Section’ is available with this article.
Genobuntu is available at (http://people.tamu.edu/
bilalwajidabbas/Genobuntu.html) and (http://sourceforge.net/
projects/genobuntu/).
Funding
This paper has been partially funded by the Qatar National
Research Fund-National Priorities Research Program grant 09-
874-3-235.
References
1. Dickie IA. Insidious effects of sequencing errors on
perceived diversity in molecular surveys. N Phytol 2010;
188:916–8.
2. Medinger R, Nolte V, Pandey RV et al. Diversity in a hidden
world: potential and limitation of next-generation sequenc-
ing for surveys of molecular diversity of eukaryotic micro-
organisms. Mol Ecol 2010;19:32–40.
3. Kunin V, Engelbrektson A, Ochman H et al. Wrinkles in the
rare biosphere: pyrosequencing errors can lead to artificial
inflation of diversity estimates. Environ Microbiol 2010;
12:118–23.
4. Clark AG and Whittam TS. Sequencing errors and molecular
evolutionary analysis. Mol Biol Evol 1992;9:744–52.
5. Hoff KJ. The effect of sequencing errors on metagenomic
gene prediction. BMC Genomics 2009;10:520.
6. Schloss PD, Gevers D and Westcott SL. Reducing the effects of
pcr amplification and sequencing artifacts on 16s rrna-based
studies. PloS One 2011;6:e27310.
7. Leinonen R, Sugawara H and Shumway M. The sequence read
archive. Nucleic Acids Res 2011;39:D19–21.
8. Cock P, Fields C, Goto N et al. The sanger fastq file format for
sequences with quality scores, and the solexa/illumina fastq
variants. Nucleic Acids Res 2010;38:1767–71.
9. Deorowicz S and Grabowski S. Compression of dna sequence
reads in fastq format. Bioinformatics 2011;27:860–2.
10.Wajid B, Nounou M, Nounou H et al. Gibbs-beca: Gibbs sam-
pling and Bayesian estimation for comparative assembly.
MIC-BEN 2013;3:1.
11.Patel R and Jain M. Ngs qc toolkit: a toolkit for quality
control of next generation sequencing data. PloS One
2012;7:e30619.
12.Yuan B. Mapping Next Generation Sequence Reads. 2010. http://
jura.wi.mit.edu/bio/education/hot_topics/shortRead_
mapping/Mapping_HTseq.pdf.
13.Mane S, Modise T and Sobral B. Analysis of high- throughput
sequencing data. Methods Mol Biol 2011;678:1–11.
14.Hannon G. Fastx-toolkit. 2010. http://hannonlab.cshl.edu/fastx
toolkit/.
15.Goecks J, Nekrutenko A, Taylor J et al. Galaxy: a comprehen-
sive approach for supporting accessible, reproducible, and
transparent computational research in the life sciences.
Genome Biol 2010;11:R86.
16.Myers E, Sutton G, Delcher A et al. A whole-genome assembly
of drosophila. Science 2000;287:2196.
17.Wajid B and Serpedin E. Minimum description length based
selection of reference sequences for comparative assemblers.
GENSIPS 2011:230–3.
18.Wajid B, Aramayo R and Serpedin E. Exploring minimum
description length and probabilistic distributions of the refer-
ence sequences for comparative assembly of genomes.
Proceedings of the International Conference GSP, 2011.
19.Wajid B, Serpedin E, Nounou M et al. Optimal reference
sequence selection for genome assembly using minimum
description length principle. EURASIP J Bioinform Syst Biol
2012;1:1–11.
20.Wajid B and Serpedin E. Review of general algorithmic fea-
tures for genome assemblers for next generation sequencers.
Genomics Proteomics Bioinformatics 2012;10;58–73.
21.Wajid B and Serpedin E. Supplementary information
section: Review of general algorithmic features for genome
assemblers for next generation sequencers. 2011. https://
sourceforge.net/projects/genobuntu/.
22.Streicher E, Victor T, Van Der Spuy G et al. Spoligotype signa-
tures in the mycobacterium tuberculosis complex. J Clin
Microbiol 2007;45:237–40.
23.Haddad N, Ostyn A, Karoui C et al. Spoligotype diversity of
mycobacterium bovis strains isolated in France from 1979 to
2000. J Clin Microbiol 2001;39:3623–32.
24.Sola C, Filliol I, Gutierrez M et al. Spoligotype database of
mycobacterium tuberculosis: biogeographic distribution
of shared types and epidemiologic and phylogenetic perspec-
tives. Emerg Infect Diseases 2001;7;390.
25. Duarte E, Domingos M, Amado A et al. Spoligotype diversity of
mycobacterium bovis and mycobacterium caprae animal iso-
lates. Vet Microbiol 2008;130;415–21.
26.Nivin B, Driscoll J, Glaser T et al. Use of spoligotype analysis
to detect laboratory cross-contamination. Infect Control Hosp
Epidemiol 2000;21:525–7.
27.Bradnam KR, Fass JN, Alexandrov A et al. Assemblathon 2:
evaluating de novo methods of genome assembly in three
vertebrate species. GigaScience 2013;2:1–31.
28.Miller J, Koren S and Sutton G. Assembly algorithms for next-
generation sequencing data. Genomics 2010;95:315–27.
29.Wajid B, Serpedin E, Nounou H et al. Mib: a comparative as-
sembly processing pipeline. In: Genomic Signal Processing and
Statistics, (GENSIPS), 2012 IEEE International Workshop on 2-4
Dec. 2012,Washington, DC. IEEE, 2012, 86–9.
30.Wajid B, Ekti AR, Noor A et al. Supersonic mib. In: Genomic
Signal Processing and Statistics (GENSIPS), 2013 IEEE International
Workshop on 17-19 Nov. 2013,Houston, TX. IEEE, 2013, 86–7.
Key Points
•Most of the genome assemblers are based on graph
theory. To ensure that the genome being assembled is
‘true-to-life’, genome assemblers adopt a series of elab-
orate steps to simplify the graph structures associated
with contigs.
•An introductory tutorial on how to do genome assem-
bly is provided with suitable real examples in the
Supplementary Section.
•Genobuntu Package supports pre-assembly tools, gen-
ome assemblers and post-assembly tools as well as
commonly used biological software.
8|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from
31.Gormen T, Leiserson C, Rivest R et al.Introduction to
Algorithms, Vol. 7. Cambridge: MIT Press, 1976, 1162–71.
32.Meader S, Hillier L, Locke D et al. Genome assembly quality:
assessment and improvement using the neutral indel model.
Genome Res 2010;20;675.
33.Alkan C, Sajjadian S and Eichler E. Limitations of next-gener-
ation genome sequence assembly. Nat Methods 2010;8:61–5.
34.Hunt M, Kikuchi T, Sanders M et al. Reapr: a universal tool for
genome assembly evaluation. Genome Biol 2013;14:R47.
35. White T. Hadoop: the Definitive Guide. Sebastopol: O’Reilly, 2012.
36.Zomaya A. Parallel Computing for Bioinformatics and Computa-
tional Biology. New York City: Wiley Online Library, 2006.
37.Talbi E and Zomaya A. Grid Computing for Bioinformatics and
Computational Biology, Vol. 1. John Wiley & Sons, 2008.
38.Augen J. Bioinformatics in the Post-genomic era: Genome,
Transcriptome, Proteome, and Information-based Medicine.
Boston: Addison-Wesley Professional, 2004.
39.Chen Y. Bioinformatics Technologies. New York: Springer-Verlag
Inc, 2005.
40.McKenna A, Hanna M, Banks E et al. The genome analysis
toolkit: a mapreduce framework for analyzing next-
generation DNA sequencing data. Genome Res 2010;
20:1297–303.
41.Hou H, Zhao F, Zhou L et al. Magicviewer: integrated solution
for next-generation sequencing data visualization and
genetic variation detection and annotation. Nucleic Acids Res
2010;38:W732–6.
42. De Bono E. Simplicity. New York: Viking, 1998.
Do it yourself guide to genome assembly |9
at Texas A&M College Station on January 7, 2015http://bfgp.oxfordjournals.org/Downloaded from