Research ProposalPDF Available

OpenGraphAssembly: abstract, modularize and parallelize fundamental building blocks for graph-based genome assembly *

Authors:
  • Ant Group
OpenGraphAssembly: abstract, modularize and parallelize
fundamental building blocks for graph-based genome assembly
Yongchao Liu
School of Computational Science & Engineering
Georgia Institute of Technology
Email: yliu@cc.gatech.edu
URL: http://cc.gatech.edu/~yliu
Abstract
This white paper aims to build open graph fundamental building block (GFBB) sub-
programs for de novo assembly and other related applications in Bioinformatics. All these
subprograms will form an OpenGraphAssembly platform associated with open-source and
commercial licenses (dual-licensing mode recommended). We will investigate and abstract
the common low-level primitive operations for graph-based de novo genome assemblers, and
further modularize and parallelize them on high-performance computing (HPC) architectures
such as accelerators and clusters. By providing a standard set of application programming
interfaces (APIs), along with in-depth theoretical analysis and highly optimized portable
code, it is expected that OpenGraphAssembly would facilitate the advancement in the field
of graph-based de novo assembly, and further promote the standardization, and even com-
mercialization, of various graph algorithms. Welcome to join this research effort!
OpenGraphAssembly
The low-cost, high-throughput next-generation sequencing (NGS) technologies have significantly
altered the landscape of whole-genome sequencing (WGS), and provides an opportunity for
scientists to initiate WGS projects for almost any organism, including those whose genomes span
billions of base pairs, e.g. human [1] and pine tree [2]. Basically, such technologies redundantly
sample genomes and finally represent the genomes as a collection of sequence reads with certain
(usually high) coverage of the genomes, where given a genome they first fragment it and then
extract sequence reads from the genomic fragments in a single-end or paired-end fashion. De
novo genome assembly aims to reconstruct the original genomes (of known or unknown species)
from the collection of sequence reads, but faces many challenges such as (i) short sequence
lengths, (ii) sequencing errors, (iii) the large amount of reads, (iv) long runtime, and (v) large
memory space requirement. Along with the ever-increasing throughput and ever-decreasing
sequencing cost of NGS technologies, we are entering the era of big sequencing data and therefore,
de novo assembly is expected to become more challenging. For instance, the newest Illumina
HiSeq X Ten system [3] is able to sequence over 18,000 human genomes at the price of $1000
per genome in a single year, successfully breaking the thousand dollar human genome barrier.
In modern genome assemblers, overlap graph and de Bruijn graph are two major prototypes.
Overlap graph follows the well-known overlap-layout-consensus (OLC) paradigm, which starts
by identifying all pairs of reads that overlap sufficiently well (typically suffix-prefix overlaps),
and then organizing this overlapping information into a graph with a vertex per read and an
This white paper was originally submitted to apply for Research Scientist Development Grants of Georgia
Institute of Technology in June 2015.
1
Y. Liu White Paper
edge per overlapped read pair. This graph structure can be used to develop complex assembly
algorithms which may take into account the global relationship between reads [4]. Overlap
graph can be further simplified by introducing a transitive edge reduction procedure, leading
to a new graph structure called string graph. This transitive reduction removes both contained
reads and reducible edges, i.e. only keeping irreducible edges, but still loyally keeps the features
of the original overlap graph. Nonetheless, in order to assemble NGS datasets with hundreds
of millions or billions of reads, there are still some major bottlenecks for overlap graphs, such
as slow overlap finding and big graph size. Recently, Simpson and Durbin [5] broke through
these bottlenecks by employing FM-index [6] to find all the irreducible edges in O(N) time for a
collection of Nreads. In this way, both the overlap finding and transitive reduction procedures
can be effectively replaced by the irreducible edge finding procedure. This breakthrough makes
overlap graph feasible for NGS genome assembly and has led to the development of SGA [5] ,
Fermi [7] and Pasqual [8].
De Bruijn graph can be considered as a special case of overlap graph. Instead of modeling
reads, de Bruijn graph models the relationship between k-mers, where it extracts all of the
distinct k-mers (a k-mer is a string of knucleotides) from the input reads, creates vertices with
one vertex per k-mer, and constructs edges by finding suffix-prefix overlaps between k-mers. A
directed edge is created if the suffix of one k-mer has an exact overlap by k1 nucleotides
with the prefix of another k-mer. In this graph structure, reads are re-represented as paths
through the graph in a determined order of k-mers. Compared to overlap graph, increasing
the coverage of genomes (by increasing the number of reads or read lengths) may not affect
the number of vertices in de Bruijn graph, thus naturally coping with the high redundancy of
reads. Moreover, this graph can be constructed in a simple linear time and has polynomial time
solutions to find optimal tours [7]. All of these merits make de Bruijn graph more attractive to
address the assembly problem from large quantities of reads. As of nowadays, de Bruijn graph
has dominated the design of NGS read assemblers.
In practice, both overlap-graph- and de-Bruijn-graph-based de novo assemblers are compute-
intensive and memory-intensive, especially for large genomes such as the pine genome [2]. To
reduce runtime and alleviate pressure on memory space requirement, one popular and afford-
able approach is to use distributed-memory compute clusters. Message-passing interface (MPI)
programming model is most popular for distributed computing, and the Berkeley Unified Par-
allel C (UPC) programming model is also emerging. For overlap graph, there is no parallel
assembler designed for cluster computing, to the best of my knowledge. The three aforemen-
tioned assemblers, i.e. SGA, Fermi and Pasqual, are all parallelized using multithreading, thus
only workable for shared-memory computers. For de Bruijn graph, a few parallel assemblers,
including Ray [9], ABySS [10], YAGA [11], PASHA [12] (developed by myself) and parallel
Meraculous [13], have been designed to run on clusters. Among these assemblers, the first
four are programmed using MPI and the last one using a combination of MPI and UPC.
Graph Fundamental Building Blocks (GFBB)
A library of modularized and parallelized graph primitive operations
Programming Models
C++, Cilk Plus, CUDA, MPI, Pthreads, OpenMP, SIMD vectorization and etc.
HPC Hardware
Clusters, Multi-core CPUs, Accelerators (e.g. SSE, AVX, GPUs, Intel Xeon Phis)
Graph-based Applications
Single-genome assembly, Meta-genome assembly, SNV calling and etc.
Although overlap graph and de Bruijn
graph are different in terms of graph represen-
tation and specific assembly approaches, they
intrinsically share the same assembly pipeline,
which typically works by five major stages:
(i) constructs graph by finding overlaps (e.g.
read overlaps for overlap graph and k-mer
overlaps for de Bruijn graph), (ii) simplifies
graph structure (e.g. reducible edge removal
for overlap graph and tip removal for de Bruijn graph), (iii) compacts graph structure by com-
pressing linear chains of unambiguously connected vertices, and (iv) generates longer contiguous
2
Y. Liu White Paper
sequences (contigs) by path traversal, and (v) joins contigs into scaffolds using the orientation
and distance information from paired-end reads. These common procedures and similar opera-
tions have inspired me to think about the feasibility of abstracting and modularizing a standard
set of primitive operations, which can serve as fundamental building blocks to facilitate the
design of new graph-based de novo assemblers. On the other hand, after investigating the ex-
isting assemblers, we have also observed a set of common primitive operations shared by them,
e.g. traversing graph to find linear chains, collapsing linear chains to reduce graph size, and
finding a greedy or optimal path (by minimize or maximize some objective function) between
vertices. Interestingly, no research effort has ever been devoted to abstracting, modularizing
and standardizing such primitive operations. This caused the diversity at the level of primitive
operations with different overlapping variants at different levels of code optimization and ma-
turity. This diversity not only wastes labor power, but also hinders the vendor community to
provide highly optimized and tuned (free or commercial) products to meet the needs of related
applications.
Based on the aforementioned observations and discussions, this white paper aims to build
open graph fundamental building block (GFBB) subprograms for de novo assembly, and other
applications like SNV calling [14], in bioinformatics and computational biology. We will investi-
gate and abstract the common low-level primitive operations for overlap-graph- and de-Bruijn-
graph-based assemblers, and further modularize and parallelize them on high-performance com-
puting (HPC) architectures consisting of clusters, multi-core CPUs, accelerators (e.g. SSE,
AVX, GPUs and Intel Xeon Phis), or any combination. The above figure demonstrates the
architectural overview of our solution stack. By providing a standard set of APIs, along with
in-depth theoretical analysis and highly optimized portable code, it is expected that this li-
brary would help to advance the field of graph-based de novo assembly , and further promote
the standardization, and even commercialization, of various graph algorithms, e.g. Bayesian
network.
References
[1] Wang J et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456(7218):
60-65
[2] Birol I et al. (2013) Assembling the 20 Gb white spruce (Picea glauca) genome from whole-
genome shotgun sequencing data, Bioinformatics, 29: 1492-1497
[3] Illumina HiSeq X Ten System (2015) http://www.illumina.com/systems/hiseq-x-sequencing-system.html
[4] Nagarajan N and Pop M (2013) Sequence assembly demystified. Nat Rev Genet. 14(3):157-
167
[5] Simpson JT and Durbin R (2012) Efficient de novo assembly of large genomes using com-
pressed data structures. Genome Res., 22(3): 549-556
[6] Ferragina P and Manzini G (2005) Indexing compressed text. J. ACM, 52:4
[7] Li H (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo
assembly. Bioinformatcis, 28(14): 1838-1844
[8] Liu X et al. (2013) Pasqual: parallel techniques for next generation genome sequence assem-
bly. IEEE Transactions on Parallel and Distributed Systems, 24 (5) : 977-986
[9] Miller JR et al. (2010) Assembly algorithms for next-generation sequencing data. Genomics,
95 (6): 315-327
3
Y. Liu White Paper
[10] Simpson JT et al. (2009) ABySS: a parallel assembler for short read sequence data. Genome
Res., 19(6): 1117-1123
[11] Jackson BG et al. (2010) Parallel de novo assembly of large genomes from high-throughput
short reads. 2010 IEEE International Symposium on Parallel & Distributed Processing, 1-10
[12] Liu Y et al. (2011) Parallelized short read assembly of large genomes using de Bruijn graphs.
BMC Bioinformatics, 12(1): 354
[13] Georganas E et al. (2014) Parallel de bruijn graph construction and traversal for de novo
genome assembly. Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis, 437-448
[14] Iqbal Z et al. (2012) De novo assembly and genotyping of variants using colored de Bruijn
graphs. Nat Genet., 44(2):226-232
4
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The study of genomes has been revolutionized by sequencing machines that output many short overlapping substrings (called reads). The task of sequence assembly in practice is to reconstruct long contiguous genome subsequences from the reads. With Next Generation Sequencing (NGS) technologies, assembly software needs to be more accurate, faster, and more memory-efficient due to the problem complexity and the size of the data sets. In this paper, we develop parallel algorithms and compressed data structures to address several computational challenges of NGS assembly. We demonstrate how commonly available multicore architectures can be efficiently utilized for sequence assembly. In all stages (indexing input strings, string graph construction and simplification, extraction of contiguous subsequences) of our software Pasqual, we use shared-memory parallelism to speed up the assembly process. In our experiments with data of up to 6.8 billion base pairs, we demonstrate that Pasqual generally delivers the best tradeoff between speed, memory consumption, and solution quality. On synthetic and real data sets Pasqual scales well on our test machine with 40 CPU cores with increasing number of threads. Given enough cores, Pasqual is fastest in our comparison.
Article
Full-text available
White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20 356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies.Availability: The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.Contact: ibirol@bcgsc.caSupplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p &plus; occ log1&plus;εn) time for any chosen ε, 0<ε<1. This data structure uses at most 5nHk(T) &plus; o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p&plus;occ) query time using O(nHk(T)logεn) &plus; o(n) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.
Article
De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous fragments called reads. We study optimized parallelization of the most time-consuming phases of Meraculous, a state of-the-art production assembler. First, we present a new parallel algorithm for k-mer analysis, characterized by intensive communication and I/O requirements, and reduce the memory requirements by 6.93×. Second, we efficiently parallelize de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We provide a novel algorithm that leverages one-sided communication capabilities of the Unified Parallel C (UPC) to facilitate the requisite fine-grained parallelism and avoidance of data hazards, while analytically proving its scalability properties. Overall results show unprecedented performance and efficient scaling on up to 15,360 cores of a Cray XC30, on human genome as well as the challenging wheat genome, with performance improvement from days to seconds.
Article
Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
Article
Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. http://github.com/lh3/fermi
Article
Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.
Conference Paper
The advent of high-throughput short read technology is revolutionizing life sciences by providing an inexpensive way to sequence genomes at high coverage. Exploiting this technology requires the development of a de novo short read assembler, which is an important open problem that is garnering significant research effort. Current methods are largely limited to microbial organisms, whose genomes are two to three orders of magnitude smaller than complex mammalian and plant genomes. In this paper, we present the design and development of a parallel de novo short read assembler that can scale to large genomes with high coverage. Our approach is based on the string graph formulation. Input reads are mapped to short paths, and the genome is reconstructed as a superpath anchored by distance constraints inferred from read pairs. Our method can handle a mixture of multiple read sizes and multiple paired read distances. We present parallel algorithms for string graph construction, string graph compaction, graph based error detection and removal, and computing aggregate summarization of paired read links across graph edges. Using this, we navigate the final graph structure to reproduce large contiguous sequences from the underlying genome. We present a validation of our framework on experimental and simulated data from multiple known genomes and present scaling results on IBM Blue Gene/L.
Article
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.