Conference Paper

Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Optionally, it also extends those fragment sequences in 3' and 5' directions, as long as the extensions are unambiguous. The tool builds on our earlier implementation [1] that filled in the bases of the sequence gap between read pairs by navigating a de Bruijn graph [2]. Konnector represents a de Bruijn graph using a Bloom filter [3], a probabilistic and memory-efficient data structure. ...
... To evaluate Konnector v2.0, we performed a comparison with several other read-elongation tools: ELOPER [8], GapFiller [9], the MaSuRCA super-reads module [5], and the previously published version of Konnector [1]. ELOPER v1.2 (ELOngation of Paired-End Reads) [8] operates by calculating gapped overlaps between read pairs, where a gapped overlap requires simultaneous overlap of both reads across two read pairs. ...
... The previously published version of Konnector [1] uses the same concept for connecting read pairs as Konnector v2.0, but does not include the sequence extension or duplicate filtering logic. Its output format is most similar to GapFiller, in the sense that it generates one fragmentlength sequence for each successfully connected read pair. ...
Article
Full-text available
Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool. Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences. Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.
... A COMMON gap filling instance is the reconstruction of the missing sequence between contigs, once their relative order and distance is known from the scaffolding step [1], [9], [14], [21], [23], [29]. Another one is filling the missing sequence between the two reads in a paired-end read [18], [28], [36], [40], [42]. In projects with a known reference, gap filling is applied when assembling novel long insertions in a donor, given length estimates derived from mate-pair alignments to the reference [20]. ...
... Due to the computational complexity of this latter problem, Wetzel et al. [40] fill a gap with the heuristic criterion of finding a unique shortest st path only. To our knowledge, Konnector [36] is the only method dealing with multiple s-t paths in a systematic way (Konnector underlies Sealer [21], which is a gap filler of scaffolds). It exhaustively enumerates all s-t paths in a de Bruijn graph (up to a user defined threshold), and attempts filling a gap only when there are at most a given number of paths (Sealer's manual recommends 10). ...
... The algorithm is based on finding all sub-paths present in all s-t paths of an unweighted DAG constructed from the assembly graph. As opposed to [21], [36], this avoids exhaustively enumerating all paths, thus scaling to instances (such as in Fig. 1) with an arbitrary number of paths. ...
Article
Gap filling has emerged as a natural sub-problem of many de novo genome assembly projects. The gap filling problem generally asks for an $s-t$ path in an assembly graph whose length matches the gap length estimate. Several methods have addressed it, but only few have focused on strategies for dealing with multiple gap filling solutions and for guaranteeing reliable results. Such strategies include reporting only unique solutions, or exhaustively enumerating all filling solutions and heuristically creating their consensus. Our main contribution is a new method for reliable gap filling: filling gaps with those sub-paths common to all gap filling solutions. We call these partial solutions safe, following the framework of (Tomescu and Medvedev, RECOMB 2016). We give an efficient safe algorithm running in $O(dm)$ time and space, where $d$ is the gap length estimate and $m$ is the number of edges of the assembly graph. To show the benefits of this method, we implemented this algorithm for the problem of filling gaps in scaffolds. Our experimental results on bacterial and on conservative human assemblies show that, on average, our method can retrieve over 73% more safe and correct bases as compared to previous methods, with a similar precision.
... To address this need, we developed Sealer, a resourceefficient gap-filling software. Sealer uses an assembly utility within the ABySS package, called Konnector [12] as its engine to close intra-scaffold gaps. We demonstrate the scalability of Sealer on the white spruce (P. ...
... Sealer ignores size discrepancies between gaps and newly introduced sequences, since gap sizes are often estimated from fragment library distributions and assemblers do not generally provide confidence intervals for every region of Ns. Despite this, large expansions of the assembly are unlikely due to decreasing gap-closing yield of Konnector as the gap size increases [12]. Below are further details on these three steps. ...
... Generally speaking, kmers varying in size from 60 to 220 bp were all suited to close gaps in the human draft assembly, and gaps of equivalent sizes tend to close in a k-independent manner, with a slight constriction of gap length distribution with decreasing k (Additional file 5: Figure S3B). This is not surprising since Konnector achieves maximum efficiency on fragments < 1 kbp [12]. On a practical note we recommend, whenever possible, exploring a wide range of k, typically from the read length L to k = 40, which is the practical lower limit of k for Konnector. ...
Article
Full-text available
BACKGROUND: While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes. RESULTS: Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8 % and 13.8 % of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively - a feat that is not possible with other leading tools with the breadth of data used in our study. CONCLUSION: Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release .
... (c) The WS77111 draft assembly was used for second-stage long-range rescaffolding of PG29 V3 informed by whole-scaffold alignments to WS77111 V1. Gaps were closed with the scalable gap filler Sealer, an application of the de Bruijn graph Bloom filter assembler Konnector (Vandervalk et al., 2014). The black, green and blue rectangles depict example PG29, WS77111 scaffolds and a transcript (cDNA) sequence, in this order. ...
... To use this latent information, we developed Sealer (https://github.com/bcgsc/abyss/tree/ sealer-prelease), a high-throughput sequence-finishing tool, using the computational engine of the paired-end read connecting utility, Konnector (Vandervalk et al., 2014). Sealer is currently the only high-throughput sequence finishing tool that scales up to large (≥3 Gbp) genome assemblies (https://github.com/bcgsc/abyss/tree/sealerprelease ...
... Flanking nucleotides (2 9 100 bp) are extracted from those regions while respecting the strand direction (5 0 ?3 0 ) on the sequence immediately downstream of each gap. Each flanking sequence pairs are used as input to Konnector, a de novo PE assembler with memory-efficient de Bruijn graph representation with a Bloom filter (Vandervalk et al., 2014). Instead of populating a two-level cascading Bloom filter with the input flank sequences, we use next-generation WGS reads, and populated the filter at a range of k-values, typically k = 30 to k = L/2 where k is the k-mer length and L the read length. ...
Article
White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
... Thus, constructing longer reads by merging paired-end reads has been used as a common strategy prior to identifying SSR motifs (van der Gaag et al., 2016;Hoogenboom et al., 2017;Ganschow et al., 2018). Several paired-end read merging algorithms have been proposed in recent years, which include FLASH (Magoc and Salzberg, 2011), leeHom (Renaud et al., 2014), PEAR (Zhang et al., 2013), BBMerge (Bushnell et al., 2017) and Konnector (Vandervalk et al., 2014), OverlapPER (Oliveira et al., 2018), Cope (Liu et al., 2012), and XORRO (Dickson and Gloor, 2013). There also exist approaches and tools such as SSRs-pipeline (Miller et al., 2013) and RAD-seq-Assembly-Microsatellite (Xue et al., 2017), which integrate paired-end reads merging and SSRs mining into a single pipeline. ...
Article
Full-text available
Background: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g. Illumina platform) on the market generally provide short pair-end reads, merging overlapping pair-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results: In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions: The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
... Lastly, a resource efficient gap-filling software named Sealer [11] has been designed to close gaps in scaffolds by navigating De Bruijn graphs represented by space-efficient bloom filter data structures and is claimed to be scalable to large giga base pair sized genomes. It uses an assembly utility within the ABySS package, called Konnector [12] as its engine to close intra-scaffold gaps. The Konnector utility takes the flanking sequence pairs along with a set of reads with a high level of coverage redundancy as inputs and runs with a range of k-mer lengths to connect the flanking gap sequences. ...
Preprint
Full-text available
Motivation: Advances in sequencing technologies have led to sequencing of genomes of a multitude of organisms. However, draft genomes of many of these organisms contain a large number of gaps due to repeats in genomes, low sequencing coverage and limitations in sequencing technologies. Although there exist several tools for filling gaps, many of these do not utilize all information relevant to gap filling. Results: Here, we present a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors. Our method is based on the expectation-maximization(EM) algorithm unlike the graph based methods adopted in the literature. Experiments on real biological datasets show that this novel approach can fill up large portions of gaps with small number of errors and misassemblies compared to other state of the art gap filling tools. Availability and Implementation:The method is implemented using C++ in a software named "Filling Gaps by Iterative Read Distribution (Figbird)", which is available at: https://github.com/SumitTarafder/Figbird.
... This cascade enables 30% to 40% less memory with respect to Chikhi and Rizk's method [399]. Other authors used Bloom filters to implement de Bruijn graphs for pan-genomics [224] and to enhance connecting reads [439]. A redesign of the ABySS scheme was recently implemented using Bloom filters [234]. ...
Preprint
Full-text available
Various graphs such as web or social networks may contain up to trillions of edges. Compressing such datasets can accelerate graph processing by reducing the amount of I/O accesses and the pressure on the memory subsystem. Yet, selecting a proper compression method is challenging as there exist a plethora of techniques, algorithms, domains, and approaches in compressing graphs. To facilitate this, we present a survey and taxonomy on lossless graph compression that is the first, to the best of our knowledge, to exhaustively analyze this domain. Moreover, our survey does not only categorize existing schemes, but also explains key ideas, discusses formal underpinning in selected works, and describes the space of the existing compression schemes using three dimensions: areas of research (e.g., compressing web graphs), techniques (e.g., gap encoding), and features (e.g., whether or not a given scheme targets dynamic graphs). Our survey can be used as a guide to select the best lossless compression scheme in a given setting.
... At this stage the pbf is discarded and the sbf is used as input for the downstream steps of ChopStitch. Although not all k-mers with multiplicity of one would be error kmers, and there would be error k-mers with multiplicity of two or more, this cascading approach is very useful as has been demonstrated before ( Jackman et al., 2017;Vandervalk et al., 2014Vandervalk et al., , 2015Salikhov et al., 2014). We also note that, as a result of using the cascading Bloom filter, genic regions with a WGSS read coverage of less than two would be omitted from our analyses. ...
Article
Sequencing studies on non-model organisms often interrogate both genomes and transcriptomes with massive amounts of short sequences. Such studies require de novo analysis tools and techniques, when the species and closely related species lack high quality reference resources. For certain applications such as de novo annotation, information on putative exons and alternative splicing may be desirable. Here we present ChopStitch, a new method for finding putative exons de novo and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also accounts for base substitutions in transcript sequences that may be derived from sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are represented as splice graphs in DOT output format.
... Sealer (Paulino et al., 2015): The program was developed to be applied in large genomes, although can be applied in small prokaryote genomes as well. The main algorithm consists in the selection of nucleotides in the assembly that flank the gap-regions, followed by a local as-sembly by Konnector (Vandervalk et al., 2014) using data from paired-end reads. Konnector, which takes the pairedend reads and generates pseudo-long reads by applying a combination of Bloom filter and De Bruijn graph, is distributed alongside with the recent versions of ABySS (Simpson et al., 2009). ...
Article
Full-text available
The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.
... The total nucleotide of 2,076,594,300 bp was provided by Illumina Hiseq sequencing with the Truseq SBS kit, and was approximately 770ϫ coverage of the entire gdw1 strain with a genome size of approximately 2.60 Mb. The cleaned reads were assembled by SOAPdenovo v2.04 and GapCloser v1.12 to get the final draft genome that includes 48 scaffolds of 2,548,085 bp, with a total GϩC content of 67.68%, bases in large scaffolds of 2,543,572 bp, largest length of 312,666 bp, N 50 of 101,891 bp, and N 90 of 28,358 bp (6). The gene prediction was carried out using Glimmer 3.02 (http://ccb.jhu.edu/software/glimmer/index.shtml). ...
Article
Full-text available
Here, we report the draft genome sequence of Leifsonia xyli subsp. xyli strain gdw1, isolated from the stem of Badila sugarcane located at the Guangdong Key Laboratory for Crops Genetic Improvement (Guanzhou, China), that causes ratoon stunting disease of sugarcane. The de novo genome of Leifsonia xyli subsp. xyli was assembled with 48 scaffolds and a G+C content of 67.68%, and contained 2.6 Mb bp and 2,838 coding sequences.
Article
Full-text available
Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics. Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behavior of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data. The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity.
Article
Full-text available
The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (¿ 30 GB). We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
Article
Full-text available
Second-generation sequencing technologies produce high coverage of the genome by short reads at a very low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this paper we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms very large numbers of paired-end reads into a much smaller number of longer "super-reads." The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced "mazurka"). We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two data sets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Aleksey Zimin, alekseyz@ipst.umd.edu.
Article
Full-text available
White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20 356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies.Availability: The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435.Contact: ibirol@bcgsc.caSupplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions. To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome. Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
Article
Full-text available
Motivation: Scaffolding is the process of ordering and orienting contigs produced during genome assembly. Accurate scaffolding is essential for finishing draft assemblies, as it facilitates the costly and laborious procedures needed to fill in the gaps between contigs. Conventional formulations of the scaffolding problem are intractable, and most scaffolding programs rely on heuristic or approximate solutions, with potentially exponential running time. Results: We present SCARPA, a novel scaffolder, which combines fixed-parameter tractable and bounded algorithms with Linear Programming to produce near-optimal scaffolds. We test SCARPA on real datasets in addition to a simulated diploid genome and compare its performance with several state-of-the-art scaffolders. We show that SCARPA produces longer or similar length scaffolds that are highly accurate compared with other scaffolders. SCARPA is also capable of detecting misassembled contigs and reports them during scaffolding. Availability: SCARPA is open source and available from http://compbio.cs.toronto.edu/scarpa.
Article
Full-text available
Motivation: The boost of next-generation sequencing technologies provides us with an unprecedented opportunity for elucidating genetic mysteries, yet the short-read length hinders us from better assembling the genome from scratch. New protocols now exist that can generate overlapping pair-end reads. By joining the 3' ends of each read pair, one is able to construct longer reads for assembling. However, effectively joining two overlapped pair-end reads remains a challenging task. Result: In this article, we present an efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads. Availability and implementation: COPE is implemented in C++ and is freely available as open-source code at ftp://ftp.genomics.org.cn/pub/cope. Contact: twlam@cs.hku.hk or luoruibang@genomics.org.cn
Article
Full-text available
Background Next Generation Sequencing technologies are able to provide high genome coverages at a relatively low cost. However, due to limited reads' length (from 30 bp up to 200 bp), specific bioinformatics problems have become even more difficult to solve. De novo assembly with short reads, for example, is more complicated at least for two reasons: first, the overall amount of "noisy" data to cope with increased and, second, as the reads' length decreases the number of unsolvable repeats grows. Our work's aim is to go at the root of the problem by providing a pre-processing tool capable to produce (in-silico) longer and highly accurate sequences from a collection of Next Generation Sequencing reads. Results In this paper a seed-and-extend local assembler is presented. The kernel algorithm is a loop that, starting from a read used as seed, keeps extending it using heuristics whose main goal is to produce a collection of error-free and longer sequences. In particular, GapFiller carefully detects reliable overlaps and operates clustering similar reads in order to reconstruct the missing part between the two ends of the same insert. Our tool's output has been validated on 24 experiments using both simulated and real paired reads datasets. The output sequences are declared correct when the seed-mate is found. In the experiments performed, GapFiller was able to extend high percentages of the processed seeds and find their mates, with a false positives rate that turned out to be nearly negligible. Conclusions GapFiller, starting from a sufficiently high short reads coverage, is able to produce high coverages of accurate longer sequences (from 300 bp up to 3500 bp). The procedure to perform safe extensions, together with the mate-found check, turned out to be a powerful criterion to guarantee contigs' correctness. GapFiller has further potential, as it could be applied in a number of different scenarios, including the post-processing validation of insertions/deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences.
Article
Full-text available
The discovery of genomic structural variants (SVs) at high sensitivity and specificity is an essential requirement for characterizing naturally occurring variation and for understanding pathological somatic rearrangements in personal genome sequencing data. Of particular interest are integrated methods that accurately identify simple and complex rearrangements in heterogeneous sequencing datasets at single-nucleotide resolution, as an optimal basis for investigating the formation mechanisms and functional consequences of SVs. We have developed an SV discovery method, called DELLY, that integrates short insert paired-ends, long-range mate-pairs and split-read alignments to accurately delineate genomic rearrangements at single-nucleotide resolution. DELLY is suitable for detecting copy-number variable deletion and tandem duplication events as well as balanced rearrangements such as inversions or reciprocal translocations. DELLY, thus, enables to ascertain the full spectrum of genomic rearrangements, including complex events. On simulated data, DELLY compares favorably to other SV prediction methods across a wide range of sequencing parameters. On real data, DELLY reliably uncovers SVs from the 1000 Genomes Project and cancer genomes, and validation experiments of randomly selected deletion loci show a high specificity. DELLY is available at www.korbel.embl.de/software.html tobias.rausch@embl.de.
Article
Full-text available
Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.
Article
Full-text available
De novo assembly is a commonly used application of next-generation sequencing experiments. The ultimate goal is to puzzle millions of reads into one complete genome, although draft assemblies usually result in a number of gapped scaffold sequences. In this paper we propose an automated strategy, called GapFiller, to reliably close gaps within scaffolds using paired reads. The method shows good results on both bacterial and eukaryotic datasets, allowing only few errors. As a consequence, the amount of additional wetlab work needed to close a genome is drastically reduced. The software is available at http://www.baseclear.com/bioinformatics-tools/.
Article
Full-text available
The next-generation high-throughput sequencing technologies, especially from Illumina, have been widely used in re-sequencing and de novo assembly studies. However, there is no existing software that can simulate Illumina reads with real error and quality distributions and coverage bias yet, which is very useful in relevant software development and study designing of sequencing projects. We provide a software package, pIRS (profile-based Illumina pair-end reads simulator), which simulates Illumina reads with empirical Base-Calling and GC%-depth profiles trained from real re-sequencing data. The error and quality distributions as well as coverage bias patterns of simulated reads using pIRS fit the properties of real sequencing data better than existing simulators. In addition, pIRS also comes with a tool to simulate the heterozygous diploid genomes. pIRS is written in C++ and Perl, and is freely available at ftp://ftp.genomics.org.cn/pub/pIRS/.
Conference Paper
Full-text available
The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we propose a new protocol called "Summary Cache"; each proxy keeps a summary of the URLs of cached documents of each participating proxy and checks these summaries for potential hits before sending any queries. Two factors contribute to the low overhead: the summaries are updated only periodically, and the summary representations are economical --- as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that compared to the existing Internet Cache Protocol (ICP), Summary Cache reduces the number of inter-cache messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, and eliminates between 30% to 95% of the CPU overhead, while at the same time maintaining almost the same hit ratio as ICP. Hence Summary Cache enables cache sharing among a large number of proxies.
Article
Full-text available
Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome. We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds. The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash. t.magoc@gmail.com.
Article
Full-text available
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
Article
Full-text available
De novo assembly tools play a main role in reconstructing genomes from next-generation sequencing (NGS) data and usually yield a number of contigs. Using paired-read sequencing data it is possible to assess the order, distance and orientation of contigs and combine them into so-called scaffolds. Although the latter process is a crucial step in finishing genomes, scaffolding algorithms are often built-in functions in de novo assembly tools and cannot be independently controlled. We here present a new tool, called SSPACE, which is a stand-alone scaffolder of pre-assembled contigs using paired-read data. Main features are: a short runtime, multiple library input of paired-end and/or mate pair datasets and possible contig extension with unmapped sequence reads. SSPACE shows promising results on both prokaryote and eukaryote genomic testsets where the amount of initial contigs was reduced by at least 75%. Availability:www.baseclear.com/bioinformatics-tools/. Contact:walter.pirovano@baseclear.com Supplementary information:Supplementary data are available at Bioinformatics online.
Article
Full-text available
An accurate genome sequence of a desired species is now a pre-requisite for genome research. An important step in obtaining a high-quality genome sequence is to correctly assemble short reads into longer sequences accurately representing contiguous genomic regions. Current sequencing technologies continue to offer increases in throughput, and corresponding reductions in cost and time. Unfortunately, the benefit of obtaining a large number of reads is complicated by sequencing errors, with different biases being observed with each platform. Although software are available to assemble reads for each individual system, no procedure has been proposed for high-quality simultaneous assembly based on reads from a mix of different technologies. In this paper, we describe a parallel short-read assembler, called Ray, which has been developed to assemble reads obtained from a combination of sequencing platforms. We compared its performance to other assemblers on simulated and real datasets. We used a combination of Roche/454 and Illumina reads to assemble three different genomes. We showed that mixing sequencing technologies systematically reduces the number of contigs and the number of errors. Because of its open nature, this new tool will hopefully serve as a basis to develop an assembler that can be of universal utilization (availability: http://deNovoAssembler.sf.Net/). For online Supplementary Material , see www.liebertonline.com.
Article
Full-text available
We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to reference-based assembly methods.
Article
Full-text available
New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the 'novel' sequences in a complex dataset that are of interest and the superfluous sequences need to be removed. A novel algorithm, fast and accurate classification of sequences (FACSs), is introduced that can accurately and rapidly classify sequences as belonging or not belonging to a reference sequence. FACS was first optimized and validated using a synthetic metagenome dataset. An experimental metagenome dataset was then used to show that FACS achieves comparable accuracy as BLAT and SSAHA2 but is at least 21 times faster in classifying sequences. Source code for FACS, Bloom filters and MetaSim dataset used is available at http://facs.biotech.kth.se. The Bloom::Faster 1.6 Perl module can be downloaded from CPAN at http://search.cpan.org/ approximately palvaro/Bloom-Faster-1.6/ henrik.stranneheim@biotech.kth.se; joakiml@biotech.kth.se Supplementary data are available at Bioinformatics online.
Article
Full-text available
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods with which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including insertion-deletions (indels), inversions and translocations. We examined BreakDancer's performance in simulation, in comparison with other methods and in analyses of a sample from an individual with acute myeloid leukemia and of samples from the 1,000 Genomes trio individuals. BreakDancer sensitively and accurately detected indels ranging from 10 base pairs to 1 megabase pair that are difficult to detect via a single conventional approach.
Article
Full-text available
There is a strong demand in the genomic community to develop effective algorithms to reliably identify genomic variants. Indel detection using next-gen data is difficult and identification of long structural variations is extremely challenging. We present Pindel, a pattern growth approach, to detect breakpoints of large deletions and medium-sized insertions from paired-end short reads. We use both simulated reads and real data to demonstrate the efficiency of the computer program and accuracy of the results. The binary code and a short user manual can be freely downloaded from http://www.ebi.ac.uk/ approximately kye/pindel/. k.ye@lumc.nl; zn1@sanger.ac.uk.
Article
Full-text available
The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is approximately 10-20x faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. http://maq.sourceforge.net.
Article
Full-text available
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source (http://bowtie.cbcb.umd.edu).
Article
Full-text available
Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs > or =100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.
Article
Full-text available
New DNA sequencing technologies deliver data at dramatically lower costs but demand new analytical methods to take full advantage of the very short reads that they produce. We provide an initial, theoretical solution to the challenge of de novo assembly from whole-genome shotgun "microreads." For 11 genomes of sizes up to 39 Mb, we generated high-quality assemblies from 80x coverage by paired 30-base simulated reads modeled after real Illumina-Solexa reads. The bacterial genomes of Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, and larger genomes yield assemblies that are highly connected and accurate. Assemblies are presented in a graph form that retains intrinsic ambiguities such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. For both C. jejuni and E. coli, this assembly graph is a single edge encompassing the entire genome. Larger genomes produce more complicated graphs, but the vast majority of the bases in their assemblies are present in long edges that are nearly always perfect. We describe a general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads.
Article
BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/lh3/bwa. Contact: hengli@broadinstitute.org
Article
Motivation: Paired-end sequencing resulting in gapped short reads is commonly used for de novo genome assembly. Assembly methods use paired-end sequences in a two-step process, first treating each read-end independently, only later invoking the pairing to join the contiguous assemblies (contigs) into gapped scaffolds. Here, we present ELOPER, a pre-processing tool for pair-end sequences that produces a better read library for assembly programs. Results: ELOPER proceeds by simultaneously considering both ends of paired reads generating elongated reads. We show that ELOPER theoretically doubles read-lengths while halving the number of reads. We provide evidence that pre-processing read libraries using ELOPER leads to considerably improved assemblies as predicted from the Lander-Waterman model. Availability: http://sourceforge.net/projects/eloper Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.
Article
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
Article
The output of a genome assembler generally comprises a collection of contiguous DNA sequences (contigs) whose relative placement along the genome is not defined. A procedure called scaffolding is commonly used to order and orient these contigs using paired read information. This ordering of contigs is an essential step when finishing and analyzing the data from a whole-genome shotgun project. Most recent assemblers include a scaffolding module; however, users have little control over the scaffolding algorithm or the information produced. We thus developed a general-purpose scaffolder, called Bambus, which affords users significant flexibility in controlling the scaffolding parameters. Bambus was used recently to scaffold the low-coverage draft dog genome data. Most significantly, Bambus enables the use of linking data other than that inferred from mate-pair information. For example, the sequence of a completed genome can be used to guide the scaffolding of a related organism. We present several applications of Bambus: support for finishing, comparative genomics, analysis of the haplotype structure of genomes, and scaffolding of a mammalian genome at low coverage. Bambus is available as an open-source package from our Web site.
Article
We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
Efficient de novo assembly of large genomes using compressed data structuresSCARPA: scaffolding reads with practical algorithmsScaffolding pre-assembled contigs using SSPACEHierarchical scaffolding with Bambus
  • J T Simpson
  • R Durbin
  • M Brudno
  • M Boetzer
  • C V Henkel
  • H J Jansen
  • D Butler
  • W Pirovano
  • M Pop
  • D S Kosack
  • S L Salzberg Mclellan
  • D E Larson
J. T. Simpson and R. Durbin, "Efficient de novo assembly of large genomes using compressed data structures," Genome Research, vol. 22, pp. 549-556, Mar 2012. [5] N. Donmez and M. Brudno, "SCARPA: scaffolding reads with practical algorithms," Bioinformatics, vol. 29, pp. 428-434, Feb 2013. [6] M. Boetzer, C. V. Henkel, H. J. Jansen, D. Butler, and W. Pirovano, "Scaffolding pre-assembled contigs using SSPACE," Bioinformatics, vol. 27, pp. 578-579, Feb 15 2011. [7] M. Pop, D. S. Kosack, and S. L. Salzberg, "Hierarchical scaffolding with Bambus," Genome Research, vol. 14, pp. 149- 159, Jan 2004. [8] K. Chen, J. W. Wallis, M. D. McLellan, D. E. Larson, J. M.
Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing dataNomenclature For Incompletely Specified Bases In Nucleic-Acid Sequences -RecommendationsIdentification of common molecular subsequences
  • I Birol
  • A Raymond
  • S D Jackman
  • S Pleasance
  • R Coope
  • G A Taylor
  • M M Yuen
  • C I Keeling
  • D Brand
  • B P Vandervalk
  • H Kirk
  • P Pandoh
  • R A Moore
  • Y Zhao
  • A J Mungall
  • B Jaquish
  • A Yanchuk
  • C Ritland
  • B Boyle
  • J Bousquet
  • K Ritland
  • J Mackay
  • J Bohlmann
  • S J Jones
I. Birol, A. Raymond, S. D. Jackman, S. Pleasance, R. Coope, G. A. Taylor, M. M. Yuen, C. I. Keeling, D. Brand, B. P. Vandervalk, H. Kirk, P. Pandoh, R. A. Moore, Y. Zhao, A. J. Mungall, B. Jaquish, A. Yanchuk, C. Ritland, B. Boyle, J. Bousquet, K. Ritland, J. Mackay, J. Bohlmann, and S. J. Jones, "Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data," Bioinformatics, May 22 2013. [25] A. Cornishbowden, "Nomenclature For Incompletely Specified Bases In Nucleic-Acid Sequences -Recommendations 1984," Nucleic Acids Research, vol. 13, pp. 3021-3030, 1985 1985. [26] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," J Mol Biol, vol. 147, pp. 195-7, Mar 25 1981. [27]
A framework for variation discovery and genotyping using next-generation DNA sequencing dataDe novo assembly and genotyping of variants using colored de Bruijn graphs
  • C Macguire
  • A A Hartl
  • G D Philippakis
  • M A Angel
  • M Rivas
  • A Hanna
  • T J Mckenna
  • A M Fennell
  • A Y Kernytsky
  • K Sivachenko
  • S B Cibulskis
  • D Gabriel
  • M J Altshuler
  • Daly
Macguire, C. Hartl, A. A. Philippakis, G. d. Angel, M. A. Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, and M. J. Daly, "A framework for variation discovery and genotyping using next-generation DNA sequencing data," Nat Genet, vol. 43, p. 8, 2011. [33] Z. C. Iqbal, Mario; Turner, Isaac; Flicek, Paul; McVean, Gil, "De novo assembly and genotyping of variants using colored de Bruijn graphs," Nature genetics, vol. 44, p. 7, 2012. [34]
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM," arXiv preprintA comprehensive evaluation of assembly scaffolding toolsToward almost closed genomes with GapFiller
  • H Li
  • M N Hunt
  • Berriman
  • Matthew
  • Thomas D W Otto
  • Pirovano
H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM," arXiv preprint, 2013. [29] M. N. Hunt, Chris; Berriman, Matthew; Otto, Thomas D., "A comprehensive evaluation of assembly scaffolding tools," Genome Biol, vol. 15, p. R42, 2014. [30] M. Boetzer and W. Pirovano, "Toward almost closed genomes with GapFiller," Genome Biol, vol. 13, p. R56, 2012. [31]
De novo assembly and analysis of RNA-seq dataSummary cache: a scalable wide-area web cache sharing protocol
  • A Prabhu
  • Y Tam
  • R A Zhao
  • M Moore
  • M A Hirst
  • S J M Marra
  • P A Jones
  • I Hoodless
  • Birol
Prabhu, A. Tam, Y. Zhao, R. A. Moore, M. Hirst, M. A. Marra, S. J. M. Jones, P. A. Hoodless, and I. Birol, "De novo assembly and analysis of RNA-seq data," Nature Methods, vol. 7, p. 4, 2010. [35] L. C. Fan, Pei; Almeida, Jussara; Broder, Andrei Z., "Summary cache: a scalable wide-area web cache sharing protocol," IEEE/ACM Transactions on Networking (TON), vol. 8, p. 13, 2000.
COPE: an accurate kmer-based pair-end reads connection tool to facilitate genome assemblyELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly
  • B Liu
  • J Yuan
  • S.-M Yiu
  • Z Li
  • Y Xie
  • Y Chen
  • Y Shi
  • H Zhang
  • Y Li
  • T.-W Lam
  • R Luo
  • I Bogoslavsky
  • Yanai
B. Liu, J. Yuan, S.-M. Yiu, Z. Li, Y. Xie, Y. Chen, Y. Shi, H. Zhang, Y. Li, T.-W. Lam, and R. Luo, "COPE: an accurate kmer-based pair-end reads connection tool to facilitate genome assembly," Bioinformatics, vol. 28, pp. 2870-2874, Nov 15 2012. [14] D. H. Silver, S. Ben-Elazar, A. Bogoslavsky, and I. Yanai, "ELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly," Bioinformatics, vol. 29, pp. 1455-1457, Jun 1 2013. [15]