Content uploaded by Yongchao Liu
Author content
All content in this area was uploaded by Yongchao Liu on Nov 04, 2016
Content may be subject to copyright.
OpenGraphAssembly: abstract, modularize and parallelize
fundamental building blocks for graph-based genome assembly∗
Yongchao Liu
School of Computational Science & Engineering
Georgia Institute of Technology
Email: yliu@cc.gatech.edu
URL: http://cc.gatech.edu/~yliu
Abstract
This white paper aims to build open graph fundamental building block (GFBB) sub-
programs for de novo assembly and other related applications in Bioinformatics. All these
subprograms will form an OpenGraphAssembly platform associated with open-source and
commercial licenses (dual-licensing mode recommended). We will investigate and abstract
the common low-level primitive operations for graph-based de novo genome assemblers, and
further modularize and parallelize them on high-performance computing (HPC) architectures
such as accelerators and clusters. By providing a standard set of application programming
interfaces (APIs), along with in-depth theoretical analysis and highly optimized portable
code, it is expected that OpenGraphAssembly would facilitate the advancement in the field
of graph-based de novo assembly, and further promote the standardization, and even com-
mercialization, of various graph algorithms. Welcome to join this research effort!
OpenGraphAssembly
The low-cost, high-throughput next-generation sequencing (NGS) technologies have significantly
altered the landscape of whole-genome sequencing (WGS), and provides an opportunity for
scientists to initiate WGS projects for almost any organism, including those whose genomes span
billions of base pairs, e.g. human [1] and pine tree [2]. Basically, such technologies redundantly
sample genomes and finally represent the genomes as a collection of sequence reads with certain
(usually high) coverage of the genomes, where given a genome they first fragment it and then
extract sequence reads from the genomic fragments in a single-end or paired-end fashion. De
novo genome assembly aims to reconstruct the original genomes (of known or unknown species)
from the collection of sequence reads, but faces many challenges such as (i) short sequence
lengths, (ii) sequencing errors, (iii) the large amount of reads, (iv) long runtime, and (v) large
memory space requirement. Along with the ever-increasing throughput and ever-decreasing
sequencing cost of NGS technologies, we are entering the era of big sequencing data and therefore,
de novo assembly is expected to become more challenging. For instance, the newest Illumina
HiSeq X Ten system [3] is able to sequence over 18,000 human genomes at the price of $1000
per genome in a single year, successfully breaking the thousand dollar human genome barrier.
In modern genome assemblers, overlap graph and de Bruijn graph are two major prototypes.
Overlap graph follows the well-known overlap-layout-consensus (OLC) paradigm, which starts
by identifying all pairs of reads that overlap sufficiently well (typically suffix-prefix overlaps),
and then organizing this overlapping information into a graph with a vertex per read and an
∗This white paper was originally submitted to apply for Research Scientist Development Grants of Georgia
Institute of Technology in June 2015.
1
Y. Liu White Paper
edge per overlapped read pair. This graph structure can be used to develop complex assembly
algorithms which may take into account the global relationship between reads [4]. Overlap
graph can be further simplified by introducing a transitive edge reduction procedure, leading
to a new graph structure called string graph. This transitive reduction removes both contained
reads and reducible edges, i.e. only keeping irreducible edges, but still loyally keeps the features
of the original overlap graph. Nonetheless, in order to assemble NGS datasets with hundreds
of millions or billions of reads, there are still some major bottlenecks for overlap graphs, such
as slow overlap finding and big graph size. Recently, Simpson and Durbin [5] broke through
these bottlenecks by employing FM-index [6] to find all the irreducible edges in O(N) time for a
collection of Nreads. In this way, both the overlap finding and transitive reduction procedures
can be effectively replaced by the irreducible edge finding procedure. This breakthrough makes
overlap graph feasible for NGS genome assembly and has led to the development of SGA [5] ,
Fermi [7] and Pasqual [8].
De Bruijn graph can be considered as a special case of overlap graph. Instead of modeling
reads, de Bruijn graph models the relationship between k-mers, where it extracts all of the
distinct k-mers (a k-mer is a string of knucleotides) from the input reads, creates vertices with
one vertex per k-mer, and constructs edges by finding suffix-prefix overlaps between k-mers. A
directed edge is created if the suffix of one k-mer has an exact overlap by k−1 nucleotides
with the prefix of another k-mer. In this graph structure, reads are re-represented as paths
through the graph in a determined order of k-mers. Compared to overlap graph, increasing
the coverage of genomes (by increasing the number of reads or read lengths) may not affect
the number of vertices in de Bruijn graph, thus naturally coping with the high redundancy of
reads. Moreover, this graph can be constructed in a simple linear time and has polynomial time
solutions to find optimal tours [7]. All of these merits make de Bruijn graph more attractive to
address the assembly problem from large quantities of reads. As of nowadays, de Bruijn graph
has dominated the design of NGS read assemblers.
In practice, both overlap-graph- and de-Bruijn-graph-based de novo assemblers are compute-
intensive and memory-intensive, especially for large genomes such as the pine genome [2]. To
reduce runtime and alleviate pressure on memory space requirement, one popular and afford-
able approach is to use distributed-memory compute clusters. Message-passing interface (MPI)
programming model is most popular for distributed computing, and the Berkeley Unified Par-
allel C (UPC) programming model is also emerging. For overlap graph, there is no parallel
assembler designed for cluster computing, to the best of my knowledge. The three aforemen-
tioned assemblers, i.e. SGA, Fermi and Pasqual, are all parallelized using multithreading, thus
only workable for shared-memory computers. For de Bruijn graph, a few parallel assemblers,
including Ray [9], ABySS [10], YAGA [11], PASHA [12] (developed by myself) and parallel
Meraculous [13], have been designed to run on clusters. Among these assemblers, the first
four are programmed using MPI and the last one using a combination of MPI and UPC.
Graph Fundamental Building Blocks (GFBB)
A library of modularized and parallelized graph primitive operations
Programming Models
C++, Cilk Plus, CUDA, MPI, Pthreads, OpenMP, SIMD vectorization and etc.
HPC Hardware
Clusters, Multi-core CPUs, Accelerators (e.g. SSE, AVX, GPUs, Intel Xeon Phis)
Graph-based Applications
Single-genome assembly, Meta-genome assembly, SNV calling and etc.
Although overlap graph and de Bruijn
graph are different in terms of graph represen-
tation and specific assembly approaches, they
intrinsically share the same assembly pipeline,
which typically works by five major stages:
(i) constructs graph by finding overlaps (e.g.
read overlaps for overlap graph and k-mer
overlaps for de Bruijn graph), (ii) simplifies
graph structure (e.g. reducible edge removal
for overlap graph and tip removal for de Bruijn graph), (iii) compacts graph structure by com-
pressing linear chains of unambiguously connected vertices, and (iv) generates longer contiguous
2
Y. Liu White Paper
sequences (contigs) by path traversal, and (v) joins contigs into scaffolds using the orientation
and distance information from paired-end reads. These common procedures and similar opera-
tions have inspired me to think about the feasibility of abstracting and modularizing a standard
set of primitive operations, which can serve as fundamental building blocks to facilitate the
design of new graph-based de novo assemblers. On the other hand, after investigating the ex-
isting assemblers, we have also observed a set of common primitive operations shared by them,
e.g. traversing graph to find linear chains, collapsing linear chains to reduce graph size, and
finding a greedy or optimal path (by minimize or maximize some objective function) between
vertices. Interestingly, no research effort has ever been devoted to abstracting, modularizing
and standardizing such primitive operations. This caused the diversity at the level of primitive
operations with different overlapping variants at different levels of code optimization and ma-
turity. This diversity not only wastes labor power, but also hinders the vendor community to
provide highly optimized and tuned (free or commercial) products to meet the needs of related
applications.
Based on the aforementioned observations and discussions, this white paper aims to build
open graph fundamental building block (GFBB) subprograms for de novo assembly, and other
applications like SNV calling [14], in bioinformatics and computational biology. We will investi-
gate and abstract the common low-level primitive operations for overlap-graph- and de-Bruijn-
graph-based assemblers, and further modularize and parallelize them on high-performance com-
puting (HPC) architectures consisting of clusters, multi-core CPUs, accelerators (e.g. SSE,
AVX, GPUs and Intel Xeon Phis), or any combination. The above figure demonstrates the
architectural overview of our solution stack. By providing a standard set of APIs, along with
in-depth theoretical analysis and highly optimized portable code, it is expected that this li-
brary would help to advance the field of graph-based de novo assembly , and further promote
the standardization, and even commercialization, of various graph algorithms, e.g. Bayesian
network.
References
[1] Wang J et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456(7218):
60-65
[2] Birol I et al. (2013) Assembling the 20 Gb white spruce (Picea glauca) genome from whole-
genome shotgun sequencing data, Bioinformatics, 29: 1492-1497
[3] Illumina HiSeq X Ten System (2015) http://www.illumina.com/systems/hiseq-x-sequencing-system.html
[4] Nagarajan N and Pop M (2013) Sequence assembly demystified. Nat Rev Genet. 14(3):157-
167
[5] Simpson JT and Durbin R (2012) Efficient de novo assembly of large genomes using com-
pressed data structures. Genome Res., 22(3): 549-556
[6] Ferragina P and Manzini G (2005) Indexing compressed text. J. ACM, 52:4
[7] Li H (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo
assembly. Bioinformatcis, 28(14): 1838-1844
[8] Liu X et al. (2013) Pasqual: parallel techniques for next generation genome sequence assem-
bly. IEEE Transactions on Parallel and Distributed Systems, 24 (5) : 977-986
[9] Miller JR et al. (2010) Assembly algorithms for next-generation sequencing data. Genomics,
95 (6): 315-327
3
Y. Liu White Paper
[10] Simpson JT et al. (2009) ABySS: a parallel assembler for short read sequence data. Genome
Res., 19(6): 1117-1123
[11] Jackson BG et al. (2010) Parallel de novo assembly of large genomes from high-throughput
short reads. 2010 IEEE International Symposium on Parallel & Distributed Processing, 1-10
[12] Liu Y et al. (2011) Parallelized short read assembly of large genomes using de Bruijn graphs.
BMC Bioinformatics, 12(1): 354
[13] Georganas E et al. (2014) Parallel de bruijn graph construction and traversal for de novo
genome assembly. Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis, 437-448
[14] Iqbal Z et al. (2012) De novo assembly and genotyping of variants using colored de Bruijn
graphs. Nat Genet., 44(2):226-232
4