ArticlePDF Available

Abstract and Figures

Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This humble effort highlights the significance of genome assembly as a research area, focuses on its need to remain accurate, provides details about the characteristics of the raw data, examines some key metrics, emphasizes some tools and draws attention to a generic tutorial with example data that outlines the whole pipeline for next-generation sequencing. The article concludes by pointing out some major future research problems.
Content may be subject to copyright.
Do it yourself guide to genome assembly
Bilal Wajid and Erchin Serpedin
Corresponding author. Bilal Wajid, Department of Electrical and Computer Engineering at Texas A&M University (TAMU), College Station, TX, USA.
Tel.: 001-956-326-0348; Fax: 001-956-326-2439; E-mail:
Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in
computational biology. This humble effort highlights the significance of genome assembly as a research area, focuses on its
need to remain accurate, provides details about the characteristics of the raw data, examines some key metrics, emphasizes
some tools and draws attention to a generic tutorial with example data that outlines the whole pipeline for next-generation
sequencing. The article concludes by pointing out some major future research problems.
Key words: genome assembly; next-generation sequencing; comparative assembly; de novo assembly; de-Bruijn graphs;
Eulerian path
The art of genome assembly involves taking millions, if not bil-
lions, of smaller fragments, called ‘reads’ and assembling them
together to form a cohesive pattern, called the sequence. The
reads themselves are a collection of nucleotides {A, C, G, T}.
They vary in length and are specific to the sequencing platform
from which they are derived. Some standard sequencing plat-
forms are 454 GS by Roche, MiSeq, HiSeq and NextSeq by
Illumina and Ion Torrent and Ion Proton by Life Technologies as
denoted in Table 1.
This contribution is aimed to act as a pivotal resource for re-
searchers in the area of genome assembly via next-generation
sequencing as well as a guidance to scientists new to the field.
Section I highlights the relation of genome assembly to other
key areas within computational biology with emphasis on its
need to report results accurately. Section II discusses raw data,
including Sequencing Read Archive (SRA) and FASTA and
FASTQ file formats. It also provides details of some essential
software tools and key hardware requirements. Section III pro-
vides particulars on how to filter and correct raw data to deter-
mine the ‘right- set’ of reads for the assembly. Section IV
answers the key question as to how can one assemble a genome
oneself? Section V reviews some essential metrics needed to
evaluate the assembly. Finally, Sections VI and VII make consid-
erations on some future goals. To facilitate a better understand-
ing of this research area, the Supplementary Section provides
suitable examples with real data that helps reinforce concepts.
Step 1: understanding the need to remain
It is imperative in a naturalistic drawing that the image be as
close to reality as possible. Imagine a painter drawing a realistic
picture and later asking his student to draw a copy from the ori-
ginal image. If the student in turn requests his friend to make
Bilal Wajid received his B.Sc. Hons, Electrical Engineering degree from University of Engineering & Technology (UET), Lahore, Pakistan, in 2007 and his
M.Sc., Electrical Engineering degree from UET, Lahore, Pakistan, in 2009. He is currently a PhD. student in Department of Electrical and Computer
Engineering at Texas A&M University (TAMU), College Station, TX. He is also teaching as a visiting faculty at Texas A&M International University, Laredo,
TX, 78043. He has taught previously at University of Engineering and Technology (UET), Lahore and UET Kala shah Kaku, TAMU and DUKE University.
Erchin Serpedin (F’13) received the specialization degree in signal processing and transmission of information from Ecole Superieure DElectricite
(SUPELEC), Paris, France, in 1992, the M.Sc. degree from the Georgia Institute of Technology, Atlanta, in 1992, and the Ph.D. degree from the University of
Virginia, Charlottesville, in January 1999. He is currently a professor in the Department of Electrical and Computer Engineering at Texas A&M University,
College Station. He is the author of 2 research monographs, 1 textbook, more than a 100 journal papers and 170 conference papers and has served editor
for a dozen of journals, including IEEE Transactions on Information Theory, IEEE Transactions on Communications, IEEE Transactions on Signal
Processing, Signal Processing (Elsevier), EURASIP Journal on Advances in Signal Processing, Physical Communication (Elsevier) and EURASIP Journal on
Bioinformatics and Systems Biology. He is currently serving as Editor-in-Chief for Eurasip Journal on Bioinformatics and Systems Biology, an online jour-
nal edited by Springer. His research interests include statistical signal processing, information theory, bioinformatics and genomics. He is an IEEE Fellow.
CThe Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please email:
Briefings in Functional Genomics, 2014, 1–9
doi: 10.1093/bfgp/elu042
Letter to the Editor
Briefings in Functional Genomics Advance Access published November 11, 2014
at Texas A&M College Station on January 7, 2015 from
another copy from his work, any defect in the original image
will only get multiplied, as other drawings are perceived from
the ones that preceded them. The same concept is equally ap-
plicable within the genome assembly framework.
A number of research domains in bioinformatics draw suit-
able conclusions from the sequence itself. A sequence that has
not been reported accurately could potentially affect subse-
quent downstream analyses, which would only multiply any
defects in the conclusions that were based on the assembled se-
quence. Research studies show that sequencing errors do affect
the perceived diversity in molecular surveys [13], such as gene
geneaologies [4], metagenomic gene prediction [5] and 16S
rRNA-based studies [6]. Therefore, simply assembling all the
reads into one contiguous sequence, a contig, is not enough. It
is crucial to ensure that the reported sequence indeed resem-
bles what is truly present in the cell. Some common hurdles are
low-coverage areas, false-positive read-read alignments, false-
negative alignments, poor sequence quality, polymorphisms
and repeated regions of the genome.
Step 2: know from where to begin
‘Practice by drawing things large, as if equal in represen-
tation and reality. In small drawings every large weak-
ness is easily hidden; in the large, the smallest weakness
is easily seen.’
—Leon Battista Alberti.
The purpose of this and the next set of sections is to teach the
reader how to sketch. The aim is not to engage the reader so that
one becomes completely immersed in the art of genome assembly
but rather to provide an outline that one can use to master the
area. Just like any masterpiece that requires a canvas and a pencil
to initiate, similarly, to perform a complete genome sequencing
procedure, a 64 bit computer, running a UNIX-like operating sys-
tem such as MAC OS X or Linux (e.g. Ubuntu), and a minimum of
16 Gb RAM is recommended. To facilitate researcher’s work,
the authors of this article have produced an environment
necessary to build one’s genome, in the form of ‘Genobuntu:
a Genome Assembly Ubuntu Package’ (
Genobunu.html). Genobuntu is a software package containing
>70 software and packages oriented toward next-generation
sequencing. It supports wide ranging tools including pre-assem-
bly tools, genome assemblers, post-assembly tools, commonly
used biological tools and example script files for different assem-
bly pipelines (
The exercise starts by downloading raw data, principally the
read files, from the SRA [7]. Data are present in ‘.sra’ format and
can be converted into necessary FASTA and FASTQ format using
the SRA toolkit (¼
software), see Figure 1 and Steps 1, 2 and 3 in the
Supplementary Section.
The FASTQ file uses four lines to represent the sequence and
its quality:
@SRR123.321 Example length¼30 GATTTGGGGTTCACTGCAGTA
þSRR123.321 Example length¼30!”*((((***þ))PSGþþ)(a—?).1***
@ Sequence Identifier, similar to FASTA format sequence line(s)
þSequence Identifier, (may be left blank) ASCII encoding of quality
The ASCII characters help encode log-probabilities of the
Quality values (Q-values). Q-values are numerical values sig-
nifying the quality of each base call and are evaluated separ-
ately for each platform [8,9]. For example, for Illumina the
formula is as follows [8]:
Qillumina ¼10 log 10 Pe
where Peis the probability of identifying a base incorrectly.
For Sanger and other platforms, the formula is as follows [8]:
QPHRED ¼10 log 10ðPeÞ:
QPHRED and Qillumina can be converted into one another using the
relation [8,10]:
Qillumina ¼10 log 10 10 QPHRED
PHRED scores are the standard in representing
sequencing base quality scores as shown in Table 2. The
use of ASCII characters not only encodes log-probabilities
by rounding them off to nearest integer values but is also
inherently convenient from a computational perspective as can
be inferred from Table 3. In terms of ASCII encoded Q-values,
the following characters depict increasing order of quality
(ASCII), from left to right (
Table 1. Sequencing platforms
Platform Company Resource
Ion Torrent and Ion Proton Life Technologies.
454 GS Roche
NanoTag sequencer Genia
GnuBIO platform GnuBIO system
PACBIO RS II Pacific Biosciences
MinION and GridION Oxford Nanopore technologies
MiSeq, HiSeq and NextSeq Illumina
Sequencing By Xpansion (SBX) Strato Genomics Technology
Optipore sequencing Noblegen Biosciences
By the time of the submission of this article, both Lasergne and Nabsys were working on a new platform.
2|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015 from
Step 3: filtering/correcting low-quality reads
‘There are only 3 colors, 10 digits, and 7 notes; it is what
we do with them that’s important’
—Jim Rohn.
As one must search to find brilliant colors, one must investi-
gate to find the right set of reads if one is to pursue the correct
pattern, the correct sequence. Looking at all the reads does not
help. One has to filter out the best set of reads, trim low-quality
ends and collapse identical reads. A simplistic way of doing so
is to remove all reads that contain the base N. An improved ap-
proach is to remove low-quality reads. Assuming that each base
is independent of all others, the overall quality of a read of
length ris Pqual ¼Qr
, where Pi;eis the Peof the ith base
and can be derived from the Q-value shown above. If Pqual <q,
where qis a user-defined parameter, the reads may be removed
from further processing [1115]. An enhanced approach is to
match reads against known ribosomal and heterochromatin
DNA, and should they match, one must remove them, as the
assembly could be improved by ignoring these repetitive
DNA elements [16]. To go further, one may even try to correct
low-quality reads. The authors of this article have provided
a tutorial on how to filter low-quality reads, with suitable
examples, in the Supplementary Section. The interested reader
is recommended to consult Steps 4, 5 and 6 in the
Supplementary Section.
Step 4: assembling the sequence
‘Painting is damned difficult - you always think you’ve
got it, but you haven’t.’
—Paul Cezanne.
Holding the brush in hand, looking on to the canvas, the
least one can do is have a picture in one’s mind before he/she
makes the strokes. One must either opt for an abstract art,
where one’s emotions run wild and dictate what one paints, or
one may choose to paint a scenery, in which case looking at the
scene helps a lot.
Genome assemblers may be widely divided into reference-
assisted assemblers (comparative assemblers) and de novo as-
semblers. Reference-assisted assembly is more like painting a
scenery. The landscapes on the painting may look a little differ-
ent, the terrains need not to be the same, but still having a scen-
ery in front of you makes the job relatively simpler. It is
common to consider the assistance of a reference sequence for
the assembly of a target genome, even though the target
Figure 1. SRA, Read Data: ERR028217 is the Run number, whereas 12521 is the read number. As this is an example of paired data, the read has .1 extension (12521.1)
whereas, its reverse complement has .3 extension (12521.3). Furthermore, underneath the read are each of its individual bases with their associated Q-values.
Table 2. Phred quality scores are logarithmically linked to error prob-
abilities (
Phred quality
Probability of
incorrect base call
Base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.90%
40 1 in 10 000 99.99%
50 1 in 100 000 99.999%
60 1 in 1 000 000 99.9999%
Table 3. Evolution of Quality scores and their corresponding ASCII
Quality scores ASCII
Sanger Phred: 0 !93 33 !126 !”# {}
Illumina (1.0) 5!62 59 !126 !”# {}
Illumina (1.3) Phred: 0 !62 64 !126 ;¡¼{}
Illumina (1.8) Phredþ33: 33 !126 33 !126 @AB {}
Do it yourself guide to genome assembly |3
at Texas A&M College Station on January 7, 2015 from
sequence may have a lot of structural variations, copy number
variations and single-nucleotide polymorphisms relative to the
reference. Imagine assembling billions of reads in the right
order without a reference and in addition ensuring that the
quality of the assembly is good. Should one be compelled to as-
semble a sequence without a reference, such a task is referred
to as de novo assembly.
‘No one is an artist unless he carries his picture in his
head before painting it, and is sure of his method and
—Claude Monet
Carrying a picture in one’s mind or using a reference se-
quence for assembly is usually preceded by the question as how
one can determine which sequence is the optimal sequence.
The simplest approach is to count the number of reads that
align to the reference sequence. The optimal reference being
the one onto which most reads align to. A more sophisticated
approach is to use the minimum description length (MDL) prin-
ciple. The MDL framework takes into consideration both the
length of the reference and the number of reads that align to
the reference to evaluate a ‘code length’. The optimal reference
is the one which has the smallest code length [10,1721]. Yet, a
more specific technique applied to Tuberculosis strains is
‘spoligotyping’. Spoligotyping is like fingerprinting. Similar
strains of Tuberculosis share the same repeat units, where the
repeat units act like fingerprints. Therefore, an optimal refer-
ence should have the same fingerprint as the genome being
assembled [2226].
As far as various assemblers are concerned, the aim of this
contribution is not to compare different assemblers, or to
search for the best or the fastest one, rather the aim is to pre-
sent some common assembly frameworks (also called pipe-
lines). Some reviews on explanations and comparisons of
various genome assemblers may be found at [20,27,28]. Once
the optimal reference has been selected, the reference and the
set of reads are presented to a comparative assembly pipeline.
The comparative assembly pipeline consists of three compo-
nents: ‘alignment’, ‘layout’ and ‘consensus’, as described
graphically in Figure 2. The word ‘pipeline’ is indicative of
how a set of interconnected methods come together to trans-
form the raw data into the novel genome. The authors have
provided the example shell files of two such comparative as-
semblers, Maq and MIB. In addition, Step 9 in the
‘Introductory Tutorial to Genome Assembly’ also explains how
to do comparative assembly using MIB with the help of appro-
priate examples. It is important to note that reference-assisted
assembly works only in the presence of a suitable reference
sequence. The reference helps, as it eases the job of the as-
sembler as the relative placement of the reads and the contigs
is already established by the reference. Given a closely
matched reference sequence, the task is further reduced to
simply identifying the variations between the reference and
the target sequence and incorporating the differences in the
target sequence [29,30]. However, determining translocations
is not easy and requires careful analysis as the order of some
of the contigs may be different in the target sequence as
opposed to the reference sequence. However, in the absence
of a reference sequence, one has to move toward de novo as-
sembly. According to GOLD (Genome Online Database), as of
31 December 2013, 277 archaeal, 11 775 bacterial and 312
eukaryal genomes have been sequenced, which still leaves
room for many unique genomes waiting to be sequenced.
There are many elegant solutions for de novo assembly, as
illustrated in Figure 3.
For instance, a greedy approach works by taking the locally
optimal choice at each stage hoping to find the global optimum
[31]. It does so by taking an unassembled read and extending it
using the best overlapping read on its 30end. It continues until
no overlapping reads are found at which point it repeats the
same process in the other direction by extending the contig at
its 50end [20].
As one may see in Figure 3, most of the algorithms use the
‘overlap-layout-consensus’ paradigm. The paradigm starts by
forming an ‘overlap’ structure by joining all the reads with their
respective overlapping reads. Next, a ‘layout’ is established
by searching for a single path from the beginning, the root, to
the end, the leaf, by traversing through all the reads. This is
the point where one encounters most challenges. In the
graph theory, each read represents a node and an overlap is
depicted by an edge, therefore, ideally there should be only
one graph, where a single path traversal from the root to the
child represents an entire sequence. In reality, however,
one obtains, not one, but multiple disjointed graphs, where
each graph depicts a contig. Furthermore, each graph is
plagued with many branches and loops. Branches that are small
may be discarded, whereas longer branches compete with one
another to serve as representatives for the contig. Loops portray
repeat regions so one must decide how many times the repeats
Figure 2. Comparative assembly: Reads are aligned to a reference sequence. The alignment process may allow one or more mismatches between each individual read
and the reference sequence. The alignment of the reads generates a layout. Based on majority base call, the layout produces a consensus sequence.
4|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015 from
should be placed in the final assembly. Nevertheless, assem-
blers do spend significant amount of time in resolving potential
hazards, in multiple ways, as depicted in Figure 4,[20,28,
‘From now on, I’ll connect the dots my own way.’
—Bill Watterson.
Having completed the contigs, one must travel further to
‘connect the dots’, a process called scaffolding. Scaffolding aims
not only just to connect the contigs to elongate them but also
to order them. In other words, scaffolding defines which contig
comes first, and which contig comes next in relation to the
whole sequence. The process uses forward and reverse reads
to link distinct contigs [20,21].
The authors have provided example shell script files for the as-
sembly pipeline of SHARCGS, QSRA, IDBA, SSAKE, VCAKE, ABySS,
Velvet and MAQ. In addition, the ‘Introductory Tutorial on
Genome Assembly’ contains an example of de novo assembly
using VCAKE, Velvet and IDBA and scaffolding using SSPACE, see
Steps7,8,9,10and11intheSupplementary Section.
Step 5: evaluating an assembly
Similar to any painting that may be both scaled and evaluated
based on some objective criteria like time, effort, theme, color
scheme, proportion and detail (
Evaluate-Paintings), one may find that evaluating an assembly
requires careful analysis as well. Table 4 illustrates some of the
commonly used assembly metrics/statistics, whereas Table 5
Figure 3. Common assembly algorithms grouped in accordance to their working schemes.
Do it yourself guide to genome assembly |5
at Texas A&M College Station on January 7, 2015 from
serves as an example on how to compare different assemblies
using these quality statistics. Furthermore, Steps 12 and 13 in
the Supplementary Section also highlight suitable examples.
For a more thorough and in-depth discussion, the reader is dir-
ected to reference [27].
‘Without continual growth and progress, such words as im-
provement, achievement, and success have no meaning.’
—Benjamin Franklin.
Genome assembly is evolving. As it matures, the shear need
and utility of this research area is forcing it to encompass the
critical aspects of ‘reproducibility’, ‘accessibility’, ‘transparency’,
‘scalability’ and ‘simplicity’. More and more genomes are being
published where the authors are gearing toward providing ac-
cess to raw data, producing details of the assembler with the
settings employed to derive those sequences. Scalability of the
algorithms comes to question when one tries to sequence eu-
karyotes. A number of algorithms are being parallelized with
careful attention being given to Hadoop and MapReduce archi-
tectures [3539]. GATK is a MapReduce framework that
Figure 4. Graph simplification techniques: (A-1) Ambiguous paths; (A-2) Pulling apart operation: the resultant graph is divided into four possible paths. (B-1) Simplistic
path; (B-2) Removing intermediate nodes: nodes that have an indegree ¼outdegree ¼1 are collapsed to form one giant node, also referred to as a ‘unitig’. (C-1)
Unnecessary edges; (C-2) Removing edges: an edge between two nodes is removed if there is an intermediate node between them that connects them simplistically.
(D-1) Loop; (D-2) Disambiguation: the loop edge is unrolled and integrated in the continuous edge from left to right. (E-1) Shorter paths are shown encircled; (E-2)
Removing tips: a tip is defined as a chain of nodes that is disconnected at one end. Tips are removed if they are shorter than t, where tis a user-defined parameter.
Furthermore, if there is a longer/common path, it will also trigger a tip’s removal.
Table 4. Some common statistics used in evaluating the quality of an assembly
Metric Description
N50 Quantifies the average length of a sequence. Suppose a sequence ‘A’ has six contigs with total assembly size as 30 Mb.
They are {13, 6, 5, 3, 2, 1} arranged in decreasing order. Now adding the first two {13, 6} gives 19 Mb, which exceeds 50%
of the total assembled size of 30 Mb. The N50 would then be 6 Mb, which is the last sequence crossing the 50% thresh-
old of the total assembled size of the genome (:).
NG50 The length of the scaffold at which 50% of the genome length is covered. Here the length of the genome is whether
known or predicted [27](:).
Accuracy The genome is considered accurate if 90% of the bases have at least 5read coverage (:).
Continuity Similar to N50 there are other metrics like N75 and N90 where one identifies the length of the scaffold crossing 75% and
90% threshold of the total assembled size of the genome. An assembly is considered to have continuity provided its N
90 >5 Kb.
Choppiness The average contig length should be >5000 bases (5 Kb). Otherwise, the assembly would be considered to have too
many chops or pieces and would need to be redrafted to contain fewer segments (:).
Number of genes The assembly which identifies most of the known genes in the organism is considered the better assembly. See [27]on
details of highly conserved core eukaryotic genes (:).
Number of gaps in
the assembly
REAPR, a software tool, uses paired-read information to find errors in assemblies by aligning a subset of reads from
short-insert libraries onto the scaffolds. These alignments help determine scaffolding errors [34](;).
Validity What fraction of the assembly (set of scaffold sequences) can be validated by the reference sequence [27]. If the scaf-
folds provided by the assembly cover >90% of the actual genome then the draft assembly is considered complete (:).
Scaffold statistics Longest scaffold (:): typically greater the length of the largest contig, better the assembly. Similar is the case of the
shortest scaffold (:). Number of scaffolds (;): Typically, an assembly which has less number of scaffolds would be bet-
ter than the assembly that would have more number of scaffolds. For instance, the best assembly would be a continu-
ous genome with no segments which would therefore have only one scaffold. Number of scaffolds >Xnt(:),
percentage of scaffolds >Xnt(:), where X is a user-defined length, NG50 scaffold length (:), LG50 scaffold count (;):
how many scaffolds are counted in reaching the NG50 threshold. Total scaffold length as percentage of estimated
genome size (the closer to 100% the better). All the above depict the quality of the assembly.
Contig statistics Longest contig (:); shortest contig (:); total size of contigs (:); number of contigs >Xnt(:), (‘nt’ stands for non-redundant
nucleotide); percentage of contigs >X nt, where X is a user-defined length; NG50 contig length (:); LG50 contig count
(;). Percentage of assembly in scaffolded contigs (:): contigs may be joined into scaffolds or remain unscaffolded. This
metric indicates how much of the assembly is represented by scaffolded contigs. The opposite would be percentage
of assembly in unscaffolded contigs (;).
Notice that with each statistics, an "indicates that higher is better and a #implies that less is better.
6|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015 from
functions in a parallel fashion on one system, but does not work
in parallel on multiple systems [40,41].
‘Simplicity is the ultimate sophistication.’
—Leonardo da Vinci.
In terms of simplicity, considerable work is needed. The suc-
cess of Windows operating system is one such example of how
software simplicity helps. Windows gave a layman an opportun-
ity to perform nontrivial tasks with the click of a mouse, provid-
ing it a monopoly in the market. Therefore, for any product to be
successful, it should be simple to operate [42]. Genome assem-
blers not only require a decent set of skills to operate but even to
install them. This is because many of them assume a consider-
able number of dependencies that need to be previously installed
on the system. Genobuntu enables simplicity in this regard, as it
helps to install and learn many of the common tools used in re-
search (
Conclusion: the two brush strokes
‘The Chinese use two brush strokes to write the word
‘crisis.’ One brush stroke stands for danger; the other
for opportunity. In a crisis, be aware of the danger—but
recognize the opportunity.’
—John F. Kennedy.
This contribution discussed the art of genome assembly
from a qualitative standpoint, detailing the significance of this
research area with attention converging toward ensuring that
the genome being assembled has to be ‘true-to-life’. The article
highlighted FASTA and FASTQ file formats teaching means to
use Q-values to filter and correct low-quality reads. Necessary
tools, like Genobuntu, with example assembly pipelines in the
form of scripts files and assembly tutorials were introduced.
Useful metrics were elaborated that helped determine the qual-
ity of one’s work with considerations on the future of genome
assembly from the perspective of ‘reproducibility’, ‘accessibil-
ity’, ‘transparency’, ‘scalability’ and ‘simplicity’. Ultimately, one
must recognize that the opportunities in this research area are
immense. We are still far from having a hardware and software
support mechanism that extracts meaningful results that
would help facilitate using the information derived from gen-
ome assembly into efficient therapies. However, as most of the
genomes on the planet are yet to be sequenced, this research
area will remain fresh for many years to come.
Table 5. Reads were derived from the run ‘SRR001657’ from the SRA
S. No Assembly metrics VCAKE QSRA IDBA MIB
Ver. A "indicates that higher is better and a #implies
that less is better.
ver. 1 ver.1 ver. 1 ver. 1
1 No. of Contigs #156 834 76899 34775 1
2 Length of Largest Contig "4195 5832 5285 6 261 358
3N50"92 132 176 6 261 358
4N75"297 66 112 6 261 358
5N90"155 44 83 6 261 358
6 NG50 "227 159 148 6 261 358
7 NG75 "140 89 84 6 261 358
8 Contigs N50 "26 985 14 907 8860 1
9 Contigs 200 bp "10 428 7550 6904 1
10 Mean "76.312 98.71 157.81 6 261 358
11 Median "45 61 118 6,261,358
12 Sum of the contig lengths (should be as close as
possible to the length of the target sequence)
11 968 268 7 590 560 5 487 859 6 261 358
13 Coverage "61014 13
14 Runtime (hours) #2 0.25 0.1 31
15 Memory used (GB) #1.3 1.6 0.5 8
16 Parameters used -e 20, -u 17, -mink 17, s1¼1000,
-k 33, -k 33, -maxk 33, s2¼100000,
-o 34, -o 34, -step 1, Match 4,
-n 17 -l 16, -min_count 2, Mismatch-5,
-t 5, -t 3, -min_contig 34 Gap 0
-m 16, -c 0.6
-v 3
The assembly was conducted using four assemblers and compared using standard metrics using the program ‘assembly statistics’.
Pseudomonas Aeruginosa UCBPP-PA14 was used as a reference Seq. by MIB for the assembly of PAb1. To facilitate reproducibility of re-
sults, the version number of each assembler along with the assembly parameters used for the assembly have also been provided. In the
assembly above, VCAKE, QSRA and IDBA are de novo assemblers, whereas MIB is a comparative assembler. Ideally, one would prefer a
single contig that represents the entire target sequences, however, this is rarely the case. MIB’s output does show one giant contig with
a good coverage [13] witha length close to the size of the target sequence (6.7 million bases in length), however this is only possible in
the presence of a reference sequence, in this case UCBPP-PA14 and this is rare. Furthermore, in the absence of a reference sequence
(which is common), one has to resort to de novo assemblies, as comparative assemblers simply do not work. Comparing only the de novo
assemblies, IDBA has the highest coverage [14], least number of contigs (34775), the best N50 (176) and the best mean length of contigs
(157.8). However, collectively all the contigs reported by IDBA do not report about a million bases in the target sequence. On the other
hand, among de novo assemblies, VCAKE has the highest number of contigs larger than 200base pairs (bp) (10 428), has the highest NG50
(227) and NG75 (140) but for some reason all its contigs collectively report almost twice the length of the target sequence (1.2 million
bases). This clearly shows a huge degree of overlap amongthe contigs or maybe because someof the contigs reported by VCAKE aresim-
ply redundant. Nevertheless, comparing different assemblies is a difficult task, one which requires careful analysis and an area where
one may have to resort to usinginputs from all the assemblies to report a good target sequence.
Do it yourself guide to genome assembly |7
at Texas A&M College Station on January 7, 2015 from
Supplementary Data
Supplementary data are available online at http://bib.
B.W. would like to extend special thanks to his mother for pro-
viding the necessary inspiration to write and fund the article
and to his father who checked and provided suitable sugges-
tions for the article.
Addendum: The ‘Do it Yourself Guide to Genome Assembly—
Supplementary Section’ is available with this article.
Genobuntu is available at (
bilalwajidabbas/Genobuntu.html) and (
This paper has been partially funded by the Qatar National
Research Fund-National Priorities Research Program grant 09-
1. Dickie IA. Insidious effects of sequencing errors on
perceived diversity in molecular surveys. N Phytol 2010;
2. Medinger R, Nolte V, Pandey RV et al. Diversity in a hidden
world: potential and limitation of next-generation sequenc-
ing for surveys of molecular diversity of eukaryotic micro-
organisms. Mol Ecol 2010;19:32–40.
3. Kunin V, Engelbrektson A, Ochman H et al. Wrinkles in the
rare biosphere: pyrosequencing errors can lead to artificial
inflation of diversity estimates. Environ Microbiol 2010;
4. Clark AG and Whittam TS. Sequencing errors and molecular
evolutionary analysis. Mol Biol Evol 1992;9:744–52.
5. Hoff KJ. The effect of sequencing errors on metagenomic
gene prediction. BMC Genomics 2009;10:520.
6. Schloss PD, Gevers D and Westcott SL. Reducing the effects of
pcr amplification and sequencing artifacts on 16s rrna-based
studies. PloS One 2011;6:e27310.
7. Leinonen R, Sugawara H and Shumway M. The sequence read
archive. Nucleic Acids Res 2011;39:D19–21.
8. Cock P, Fields C, Goto N et al. The sanger fastq file format for
sequences with quality scores, and the solexa/illumina fastq
variants. Nucleic Acids Res 2010;38:1767–71.
9. Deorowicz S and Grabowski S. Compression of dna sequence
reads in fastq format. Bioinformatics 2011;27:860–2.
10.Wajid B, Nounou M, Nounou H et al. Gibbs-beca: Gibbs sam-
pling and Bayesian estimation for comparative assembly.
MIC-BEN 2013;3:1.
11.Patel R and Jain M. Ngs qc toolkit: a toolkit for quality
control of next generation sequencing data. PloS One
12.Yuan B. Mapping Next Generation Sequence Reads. 2010. http://
13.Mane S, Modise T and Sobral B. Analysis of high- throughput
sequencing data. Methods Mol Biol 2011;678:1–11.
14.Hannon G. Fastx-toolkit. 2010.
15.Goecks J, Nekrutenko A, Taylor J et al. Galaxy: a comprehen-
sive approach for supporting accessible, reproducible, and
transparent computational research in the life sciences.
Genome Biol 2010;11:R86.
16.Myers E, Sutton G, Delcher A et al. A whole-genome assembly
of drosophila. Science 2000;287:2196.
17.Wajid B and Serpedin E. Minimum description length based
selection of reference sequences for comparative assemblers.
GENSIPS 2011:230–3.
18.Wajid B, Aramayo R and Serpedin E. Exploring minimum
description length and probabilistic distributions of the refer-
ence sequences for comparative assembly of genomes.
Proceedings of the International Conference GSP, 2011.
19.Wajid B, Serpedin E, Nounou M et al. Optimal reference
sequence selection for genome assembly using minimum
description length principle. EURASIP J Bioinform Syst Biol
20.Wajid B and Serpedin E. Review of general algorithmic fea-
tures for genome assemblers for next generation sequencers.
Genomics Proteomics Bioinformatics 2012;10;58–73.
21.Wajid B and Serpedin E. Supplementary information
section: Review of general algorithmic features for genome
assemblers for next generation sequencers. 2011. https://
22.Streicher E, Victor T, Van Der Spuy G et al. Spoligotype signa-
tures in the mycobacterium tuberculosis complex. J Clin
Microbiol 2007;45:237–40.
23.Haddad N, Ostyn A, Karoui C et al. Spoligotype diversity of
mycobacterium bovis strains isolated in France from 1979 to
2000. J Clin Microbiol 2001;39:3623–32.
24.Sola C, Filliol I, Gutierrez M et al. Spoligotype database of
mycobacterium tuberculosis: biogeographic distribution
of shared types and epidemiologic and phylogenetic perspec-
tives. Emerg Infect Diseases 2001;7;390.
25. Duarte E, Domingos M, Amado A et al. Spoligotype diversity of
mycobacterium bovis and mycobacterium caprae animal iso-
lates. Vet Microbiol 2008;130;415–21.
26.Nivin B, Driscoll J, Glaser T et al. Use of spoligotype analysis
to detect laboratory cross-contamination. Infect Control Hosp
Epidemiol 2000;21:525–7.
27.Bradnam KR, Fass JN, Alexandrov A et al. Assemblathon 2:
evaluating de novo methods of genome assembly in three
vertebrate species. GigaScience 2013;2:1–31.
28.Miller J, Koren S and Sutton G. Assembly algorithms for next-
generation sequencing data. Genomics 2010;95:315–27.
29.Wajid B, Serpedin E, Nounou H et al. Mib: a comparative as-
sembly processing pipeline. In: Genomic Signal Processing and
Statistics, (GENSIPS), 2012 IEEE International Workshop on 2-4
Dec. 2012,Washington, DC. IEEE, 2012, 86–9.
30.Wajid B, Ekti AR, Noor A et al. Supersonic mib. In: Genomic
Signal Processing and Statistics (GENSIPS), 2013 IEEE International
Workshop on 17-19 Nov. 2013,Houston, TX. IEEE, 2013, 86–7.
Key Points
Most of the genome assemblers are based on graph
theory. To ensure that the genome being assembled is
‘true-to-life’, genome assemblers adopt a series of elab-
orate steps to simplify the graph structures associated
with contigs.
An introductory tutorial on how to do genome assem-
bly is provided with suitable real examples in the
Supplementary Section.
Genobuntu Package supports pre-assembly tools, gen-
ome assemblers and post-assembly tools as well as
commonly used biological software.
8|Wajid and Serpedin
at Texas A&M College Station on January 7, 2015 from
31.Gormen T, Leiserson C, Rivest R et al.Introduction to
Algorithms, Vol. 7. Cambridge: MIT Press, 1976, 1162–71.
32.Meader S, Hillier L, Locke D et al. Genome assembly quality:
assessment and improvement using the neutral indel model.
Genome Res 2010;20;675.
33.Alkan C, Sajjadian S and Eichler E. Limitations of next-gener-
ation genome sequence assembly. Nat Methods 2010;8:61–5.
34.Hunt M, Kikuchi T, Sanders M et al. Reapr: a universal tool for
genome assembly evaluation. Genome Biol 2013;14:R47.
35. White T. Hadoop: the Definitive Guide. Sebastopol: O’Reilly, 2012.
36.Zomaya A. Parallel Computing for Bioinformatics and Computa-
tional Biology. New York City: Wiley Online Library, 2006.
37.Talbi E and Zomaya A. Grid Computing for Bioinformatics and
Computational Biology, Vol. 1. John Wiley & Sons, 2008.
38.Augen J. Bioinformatics in the Post-genomic era: Genome,
Transcriptome, Proteome, and Information-based Medicine.
Boston: Addison-Wesley Professional, 2004.
39.Chen Y. Bioinformatics Technologies. New York: Springer-Verlag
Inc, 2005.
40.McKenna A, Hanna M, Banks E et al. The genome analysis
toolkit: a mapreduce framework for analyzing next-
generation DNA sequencing data. Genome Res 2010;
41.Hou H, Zhao F, Zhou L et al. Magicviewer: integrated solution
for next-generation sequencing data visualization and
genetic variation detection and annotation. Nucleic Acids Res
42. De Bono E. Simplicity. New York: Viking, 1998.
Do it yourself guide to genome assembly |9
at Texas A&M College Station on January 7, 2015 from
... Then assembled in proper order; this mode of assembly is called "Reference-assisted or comparative assembly". In contrast, if there is no reference genome sequence then, the assembly is called "de novo assembly" (Wajid and Erchin, 2016). There are several methods available for assembling the sequence reads and the most widely used approaches are namely Greedy, Overlap-layoutconsensus (OLC) and de Bruijn graph (Wajid and Erchin, 2016). ...
... In contrast, if there is no reference genome sequence then, the assembly is called "de novo assembly" (Wajid and Erchin, 2016). There are several methods available for assembling the sequence reads and the most widely used approaches are namely Greedy, Overlap-layoutconsensus (OLC) and de Bruijn graph (Wajid and Erchin, 2016). ...
Technical Report
Full-text available
Genome assembly refers to the process of creating original DNA or RNA by assembling a large number of short DNA or RNA sequence reads. Genome assembly from sequence reads is an algorithm-driven automated process.
... To handle such messy subgraphs involving repeats, decisions must be made during the traversal of the graph. Figure 3.10 from (Wajid and Serpedin, 2016) gives examples of such decisions, which are often subject to arbitrary parameters, and lead to a fragmented set of contigs. Tips are removed if they are shorter than t, where t is a user-defined parameter. ...
... Furthermore, if there is a longer/common path, it will also trigger a tip's removal. [from (Wajid and Serpedin, 2016)] In this paragraph, we focused on the DBG approaches, but it is important to bring the reader's attention to the fact that other methods are not better-armed to face this problem. By nature, the greedy approach will expurgate repeated regions from the solution. ...
Full-text available
Genomic variation is induced by numerous factors simultaneously, which results in a set of genomic behaviours related to its structure, architecture, expression, evolution, etc, which could be referred to as genome dynamics. During my thesis project, we chose to focus on three major players impacting genome dynamics:- Chromatin structure: unevenly compacted along chromosomes;- Meiotic recombination landscape: reflecting the frequency variations of exchanging DNA fragments during cell division;- Repetitive DNA: mainly Transposable Elements (TEs) inducing genome assembly errors.Firstly, We propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is a statistic-based data-driven tool. Therefore, a data pre-processing module (data quality control and cleaning) is provided. BREC results would allow conducting more broadly an analysis with a comparative genomics approach on their identified heterochromatin regions in terms of recombination landscape, TE density, etc.Secondly, in order to address the genome assembly process which is strongly impacted by the TE abundance, one type of repeats, we chose to focus on the scaffolding step with the aim of enhancing the assembly quality by exploiting the analysis of repeated regions and proposing a pipeline of improvement.Thirdly, with the aim of testing the veracity of the approaches and tools developed but also to return to the analysis of mosquito genomes, we present a case study combining the application results on the variations in recombination rate along these genomes, as well as and the organization of chromatin domains, with respect to the TE distribution along each chromosome. The preliminary findings suggest a correlation between the distribution of certain TE families and the chromatin domains.To conclude this thesis manuscript, we present an opening concerning genomes dynamics with respect to the different aspects addressed. Then, we present the conceptual, application, and technical limits identified by our experimental design. Finally, we suggest a few perspectives on the scope of our contributions beyond my PhD project.
... The high quality of the genome assemblies (Table S1) enables accurate gene prediction using Prokka. The quality parameters of an assembly, such as a high genome fraction covered, a low number of contigs, and a genome length consistent with the analyzed species, are considered crucial to ensure the suitability of these assemblies for gene identification analysis [29]. The 500 strains selected for this study had an average of 115 contigs (>500 bp) and a mean genome size of 4,491,845 bp. ...
Full-text available
Tuberculosis (TB) is one of the leading causes of human deaths worldwide caused by infectious diseases. TB infection by Mycobacterium tuberculosis can occur in the lungs, causing pulmonary tuberculosis (PTB), or in any other organ of the body, resulting in extrapulmonary tuberculosis (EPTB). There is no consensus on the genetic determinants of this pathogen that may contribute to EPTB. In this study, we constructed the M. tuberculosis pangenome and used it as a tool to seek genomic signatures associated with the clinical presentation of TB based on its accessory genome differences. The analysis carried out in the present study includes the raw reads of 490 M. tuberculosis genomes (PTB n = 245, EPTB n = 245) retrieved from public databases that were assembled, as well as ten genomes from Mexican strains (PTB n = 5, EPTB n = 5) that were sequenced and assembled. All genomes were annotated and then used to construct the pangenome with Roary and Panaroo. The pangenome obtained using Roary consisted of 2231 core genes and 3729 accessory genes. On the other hand, the pangenome resulting from Panaroo consisted of 2130 core genes and 5598 accessory genes. Associations between the distribution of accessory genes and the PTB/EPTB phenotypes were examined using the Scoary and Pyseer tools. Both tools found a significant association between the hspR, plcD, Rv2550c, pe_pgrs5, pe_pgrs25, and pe_pgrs57 genes and the PTB genotype. In contrast, the deletion of the aceA, esxR, plcA, and ppe50 genes was significantly associated with the EPTB phenotype. Rv1759c and Rv3740 were found to be associated with the PTB phenotype according to Scoary; however, these associations were not observed when using Pyseer. The robustness of the constructed pangenome and the gene–phenotype associations is supported by several factors, including the analysis of a large number of genomes, the inclusion of the same number of PTB/EPTB genomes, and the reproducibility of results thanks to the different bioinformatic tools used. Such characteristics surpass most of previous M. tuberculosis pangenomes. Thus, it can be inferred that the deletion of these genes can lead to changes in the processes involved in stress response and fatty acid metabolism, conferring phenotypic advantages associated with pulmonary or extrapulmonary presentation of TB. This study represents the first attempt to use the pangenome to seek gene–phenotype associations in M. tuberculosis.
... All the genomes used in this analysis were at the level of draft genomes. The quality parameters used in the bioinformatic filtering to select both the genomes obtained from Enterobase (Supplementary Table S2) and for the genomes assembled in this work (Supplementary Table S3) were a high value of N50, an average length of contigs of greater than 5000 bases, and a low number of contigs [106], Additionally, the genome length obtained was in agreement with the genome length of different serotypes of Salmonella enterica, including Typhimurium [107]. Regarding the number of contigs, the mean of the assemblies obtained from the Enterobase was 73, while the mean for the assemblies carried out in this work was 89. ...
Full-text available
Salmonella enterica constitutes a global public health concern as one of the main etiological agents of human gastroenteritis. The Typhimurium serotype is frequently isolated from human, animal, food, and environmental samples, with its sequence type 19 (ST19) being the most widely distributed around the world as well as the founder genotype. The replacement of the ST19 genotype with the ST213 genotype that has multiple antibiotic resistance (MAR) in human and food samples was first observed in Mexico. The number of available genomes of ST213 strains in public databases indicates its fast worldwide dispersion, but its public health relevance is unknown. A comparative genomic analysis conducted as part of this research identified the presence of 44 genes, 34 plasmids, and five point mutations associated with antibiotic resistance, distributed across 220 genomes of ST213 strains, indicating the MAR phenotype. In general, the grouping pattern in correspondence to the presence/absence of genes/plasmids that confer antibiotic resistance cluster the genomes according to the geographical origin where the strain was isolated. Genetic determinants of antibiotic resistance group the genomes of North America (Canada, Mexico, USA) strains, and suggest a dispersion route to reach the United Kingdom and, from there, the rest of Europe, then Asia and Oceania. The results obtained here highlight the worldwide public health relevance of the ST213 genotype, which contains a great diversity of genetic elements associated with MAR.
... The quality of a de novo assembled genome assembly is affected by the size of the k-mer used in the assembly. Thus, several k-mer sizes will be used for de novo genome assembly and the best-assembled genome will be selected based on the N50 values and the total number of contigs of the assembled genome (Wajid and Serpedin 2016). In this study, k-mer sizes of 21, 33, 44, 55, 65 and 77 were used for each assembler. ...
A novel glufosinate-tolerant Pseudomonas sp. LA21, was isolated from soil samples of an oil palm plantation with a long history of glufosinate application. The genome of Pseudomonas sp. LA21 was sequenced with 150 bp paired-end conducted using Illumina sequencing technology. De novo genome assembly was performed using SPAdes, ABySS, and Velvet assemblers. Phylogenetic analysis using 16S rRNA gene sequence showed that Pseudomonas sp. LA21 was closely related to Pseudomonas nitroreducens ATCC 33634. Multilocus sequence analysis (MLSA) based on four bacterial housekeeping genes (16S rRNA, gyr B, rpo B, and rpo D) was conducted together with 138 reference genomes of Pseudomonas species. The phylogenetic tree derived from MLSA analysis using concatenated 16S rRNA-gryB-rpoD-rpoB sequences grouped Pseudomonas sp. LA21 under Pseudomonas aeruginosa group and Pseudomonas nitroreducens subgroup. Detailed phylogenomic analysis using average nucleotide identity (ANI) and genome-to-genome distance calculator (GGDC) approaches showed that Pseudomonas sp. LA21 could be classified as a novel Pseudomonas species. Supplementary information: The online version contains supplementary material available at 10.1007/s13205-022-03185-4.
... Cleaned reads were assembled de novo using the Trinity program (55) with default parameters. The cleaned reads were mapped back to the assembled contigs and filtered to retain only contigs in which at least 90% of bases had 5Â coverage (56). Contigs that met this criterion were first compared to the NCBI viral database using the BLASTx program. ...
Full-text available
Vector-borne diseases (VBDs) cause enormous health burden worldwide, as they account for more than 17% of all infectious diseases and over 700,000 deaths each year. A significant number of these VBDs are caused by RNA virus pathogens. Here, we used metagenomics and metabarcoding analysis to characterize RNA viruses and their insect hosts among biting midges from Kenya. We identified a total of 15 phylogenetically distinct insect-specific viruses. These viruses fall into six families, with one virus falling in the recently proposed negevirus taxon. The six virus families include Partitiviridae, Iflaviridae, Tombusviridae, Solemoviridae, Totiviridae, and Chuviridae. In addition, we identi- fied many insect species that were possibly associated with the identified viruses. Ceratopogonidae was the most common family of midges identified. Others included Chironomidae and Cecidomyiidae. Our findings reveal a diverse RNA virome among Kenyan midges that includes previously unknown viruses. Further, metabarcoding analysis based on COI (cytochrome c oxidase subunit 1 mitochondrial gene) barcodes reveal a diverse array of midge species among the insects used in the study. Successful application of metagenomics and metabarcoding methods to characterize RNA viruses and their insect hosts in this study highlights a possible simultaneous application of these two methods as cost-effective approaches to virus surveillance and host characterization.
... or Baari, both providing sufficient tools and software for the entire analysis pipeline. [26][27][28][29][30] Nisar et al. WGS in rare genetic diseases 2613 A study inducted two unrelated families and conducted both WES and WGS screening. ...
Full-text available
Rare diseases affect nearly 300 million people globally with most patients aged five or less. Traditional diagnostic approaches have provided much of the diagnosis; however, there are limitations. For instance, simply inadequate and untimely diagnosis adversely affects both the patient and their families. This review advocates the use of whole genome sequencing in clinical settings for diagnosis of rare genetic diseases by showcasing five case studies. These examples specifically describe the utilization of whole genome sequencing, which helped in providing relief to patients via correct diagnosis followed by use of precision medicine.
Full-text available
The Gulf pipefish Syngnathus scovelli has emerged as an important species for studying sexual selection, development, and physiology. Comparative evolutionary genomics research involving fishes from Syngnathidae depends on having a high-quality genome assembly and annotation. However, the first S. scovelli genome assembled using short-read sequences and a smaller RNA-sequence dataset has limited contiguity and a relatively poor annotation. Here, using PacBio long-read high-fidelity sequences and a proximity ligation library, we generate an improved assembly to obtain 22 chromosome-level scaffolds. Compared to the first assembly, the gaps in the improved assembly are smaller, the N75 is larger, and our genome is ~95% BUSCO complete. Using a large body of RNA-Seq reads from different tissue types and NCBI's Eukaryotic Annotation Pipeline, we discovered 28,162 genes, of which 8,061 are non-coding genes. Our new genome assembly and annotation are tagged as a RefSeq genome by NCBI and provide enhanced resources for research work involving S. scovelli.
Full-text available
The Gulf pipefish Syngnathus scovelli has emerged as an important species in the study of sexual selection, development, and physiology, among other topics. The fish family Syngnathidae, which includes pipefishes, seahorses, and seadragons, has become an increasingly attractive target for comparative research in ecological and evolutionary genomics. These endeavors depend on having a high-quality genome assembly and annotation. However, the first version of the S. scovelli genome assembly was generated by short-read sequencing and annotated using a small set of RNA-sequence data, resulting in limited contiguity and a relatively poor annotation. Here, we present an improved genome assembly and an enhanced annotation, resulting in a new official gene set for S. scovelli. By using PacBio long-read high-fidelity (Hi-Fi) sequences and a proximity ligation (Hi-C) library, we fill small gaps and join the contigs to obtain 22 chromosome-level scaffolds. Compared to the previously published genome, the gaps in our novel genome assembly are smaller, the N75 is much larger (13.3 Mb), and this new genome is around 95% BUSCO complete. The precision of the gene models in the NCBI's eukaryotic annotation pipeline was enhanced by using a large body of RNA-Seq reads from different tissue types, leading to the discovery of 28,162 genes, of which 8,061 were non-coding genes. This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI and thus provide substantially enhanced genomic resources for future research involving S. scovelli.
Full-text available
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Full-text available
Background and Objectives: Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of novel genomes. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads of the novel genome that align to the reference sequences and then choosing the reference sequence which has the highest number of reads aligning to it. This work explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and sophisticated MDL, in identifying the optimal reference sequence for genome assembly. Methods: The relevance of MDL to genome assembly can be realized by understanding that genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing. The task of MDL is to identify the model that best describes the data and within comparative assembly framework the same meaning applies to finding the reference sequences that best describe the set of reads. This work explores the potential of three variants of MDL: two-part MDL, sophisticated MDL and minimax regret for the selection of the optimal reference sequence for comparative assembly. Results: The proposed scheme based on sophisticated MDL has been shown to work successfully for the four possible set of mutations: SNPs, insertions, inversions and deletions. The proposed scheme chooses the reference sequence which has the smaller number of SNPs, insertions and deletions. The MDL scheme is able to detect all inversions and rectify them. Conclusions: The work compared the MDL scheme with the standard method of counting the number of reads that align to the reference sequence, and found that though the standard method is a necessary condition for finding the optimal sequence, it is not the sufficient condition. Therefore, the proposed MDL scheme encompassed within itself the standard method of: counting the number of reads, by defining it in an inverted fashion as counting the number of reads that did not align to the reference sequence.
Conference Paper
Full-text available
Genome sequences are the most basic, yet most essential pieces of data in all biological analysis. Genome sequence is the solution to the Genome Assembly problem which remakes the entire sequence from a set of reads which are unordered and very small in size. Genome Assembly problem is therefore, quite complex and is broadly divided into denovo and comparative assembly. Comparative assembly takes the aid of a reference sequence, closely related to the unassembled genome, to determine the relative order of the reads with respect to one another, and then joins them together to form the sequence. This paper explores all variants of Minimum Description Length (MDL) to find the best reference sequence for comparative assembly. The paper looked at two-part MDL, Sophisticated MDL and MiniMax Regret and found that Sophisticated MDL performs better than two-part MDL, however, MiniMax regret owing to the nature of the problem was unsuitable. The proposed scheme is prior free and can be incorporated in the data preprocessing stage for all comparative assemblers allowing the assembly process to make use of the best reference sequence available.
Full-text available
The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Solving modern biological problems requires advanced computational methods. Bioinformatics evolved from the active interaction of two fast-developing disciplines, biology and information technology. The central issue of this emerging field is the transformation of often distributed and unstructured biological data into meaningful information. This book describes the application of well-established concepts and techniques from areas like data mining, machine learning, database technologies, and visualization techniques to problems like protein data analysis, genome analysis and sequence databases. Chen has collected contributions from leading researchers in each area. The chapters can be read independently, as each offers a complete overview of its specific area, or, combined, this monograph is a comprehensive treatment that will appeal to students, researchers, and R&D professionals in industry who need a state-of-the-art introduction into this challenging and exciting young field.
Conference Paper
A novel assembly pipeline, MiB, employs Minimum Description Length (MDL), de-Bruijn graphs and Bayesian estimation for reference assisted assembly of the novel genome. In a previous study MiB assembly was compared with nine other assembly algorithms showing significant improvement in results coupled with very large execution times. This correspondence introduces 'Supersonic MiB', an extension to our previous study MiB. Supersonic MiB aims to stimulate the assembly pipeline of MiB showing significant improvement in execution time compared to its predecessor.
Conference Paper
This paper introduces MiB, a comparative genome assembly pipeline that uses three key steps. The first step involves choosing the best reference sequence by using the Minimum Description Length (MDL) principle. The MDL principle not only chooses the best reference sequence (model) but also fine-tunes the model for a better assembly by rectifying all the inversions and removing most of the insertions from the reference sequence. The MDL principle also identifies the set of reads that could align to the reference sequence. The second stage uses the same set of reads that did not align to the reference sequence as an input to a de-Buijn graph based algorithm that Identifies the Deletions in the reference sequence and then Inserts Them at Appropriate Places (IDITAP). The last stage uses Bayesian Estimation for Comparative Assembly (BECA). BECA uses Quality (Q-) values for identifying probabilities of the base calls for every read and then exploits the Q-values to find the best alignments and the consensus sequence. Therefore, MiB, derived from the use of MDL-IDITAP-BECA aims to take the optimal reference sequence and the set of reads from the unassembled genome and transform the reference sequence into the novel genome by removing or rectifying four set of mutations: inversions and insertions using MDL, deletions using IDITAP and Single Nucleotide Polymorphisms (SNPs) using BECA. Preliminary test results of the proposed framework revealed promising results.