Article

Spaced seed data structures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This past decade, genome sciences have benefitted from rapid advances in DNA sequencing technologies, and development of efficient algorithms for processing short nucleotide sequences played a key role in enabling their uptake in the field. In particular, reassembly of human genomes (de novo or reference-guided) from short DNA sequence reads had a substantial impact on health research. De novo assembly of a genome is essential in the absence of a reference genome sequence of a species. It is also gaining traction even when one is available, due to the utility of the method to resolve ambiguous or rearranged genomic regions with high specificity. With commercial high-throughput sequencing technologies increasing their throughput and their read lengths, the de Bruijn graph (DBG) paradigm used by many assembly algorithms needs to be revisited. DBG uses a table of k-mers, sequences of length k base pairs derived from the reads, and their k-1 base pair overlaps to assemble sequences. Despite longer k-mers unlocking longer genomic features for assembly, associated increases in memory usage and other compute resources are tradeoffs that limit the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we introduce three data structure designs for paired k-mers, or spaced seeds, each addressing memory and run time constraints imposed by longer reads. In spaced seeds, a fixed distance separates k-mer pairs, providing increased sequence specificity with increased distance, while keeping memory usage low. Further, we describe a data structure based on Bloom filters that would be suitable to implicitly store spaced seeds, and would be tolerant to sequencing errors. Building on the spaced seeds Bloom filter, we describe a data structure for tracking the frequencies of observed spaced seeds. We expect the data structure designs we introduce in this study to have broad applications in genomics research, with niche applications in genome, transcriptome and metagenome assemblies, and in read error correction.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recent efforts to minimize the genome assembly resource footprint have led to the implementation of several memoryefficient assemblers (Simpson and Durbin, 2010; Conway and Bromage, 2011; Chikhi and Rizk, 2012; Ye et al., 2012), but usually at the expense of time and accuracy. We have been preoccupied by the scale problem for some time (Simpson et al., 2009 ) and have recently outlined and presented the theory behind assembly by spaced seeds, a redesign of the traditional k-mer that, even in current data structure implementations, has potential for an over twofold speed-up and a four-fold reduction in memory without compromising on assembly correctness (Birol et al., 2014). Here we introduce re-scaffolded improved draft genome assemblies V3 and V4 of the western white spruce PG29 genome, first assembled in 2013 (Birol et al., 2013) and the additional genome assembly of eastern white spruce genotype WS77111. ...
Article
White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Article
Full-text available
The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (¿ 30 GB). We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
Article
Full-text available
Neuroblastoma is a malignancy of the developing sympathetic nervous system that often presents with widespread metastatic disease, resulting in survival rates of less than 50%. To determine the spectrum of somatic mutation in high-risk neuroblastoma, we studied 240 affected individuals (cases) using a combination of whole-exome, genome and transcriptome sequencing as part of the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative. Here we report a low median exonic mutation frequency of 0.60 per Mb (0.48 nonsilent) and notably few recurrently mutated genes in these tumors. Genes with significant somatic mutation frequencies included ALK (9.2% of cases), PTPN11 (2.9%), ATRX (2.5%, and an additional 7.1% had focal deletions), MYCN (1.7%, causing a recurrent p.Pro44Leu alteration) and NRAS (0.83%). Rare, potentially pathogenic germline variants were significantly enriched in ALK, CHEK2, PINK1 and BARD1. The relative paucity of recurrent somatic mutations in neuroblastoma challenges current therapeutic strategies that rely on frequently altered oncogenic drivers.
Article
Full-text available
Despite recent progress, computational tools that identify gene fusions from next-generation whole transcriptome sequencing data are often limited in accuracy and scalability. Here, we present a software package, BreakFusion that combines the strength of reference alignment followed by read-pair analysis and de novo assembly to achieve a good balance in sensitivity, specificity and computational efficiency. Availability: http://bioinformatics.mdanderson.org/main/BreakFusion Contact: kchen3{at}mdanderson.org; lding{at}genome.wustl.edu Supplementary information: Supplementary data are available at Bioinformatics online
Article
Full-text available
Primary triple-negative breast cancers (TNBCs), a tumour type defined by lack of oestrogen receptor, progesterone receptor and ERBB2 gene amplification, represent approximately 16% of all breast cancers. Here we show in 104 TNBC cases that at the time of diagnosis these cancers exhibit a wide and continuous spectrum of genomic evolution, with some having only a handful of coding somatic aberrations in a few pathways, whereas others contain hundreds of coding somatic mutations. High-throughput RNA sequencing (RNA-seq) revealed that only approximately 36% of mutations are expressed. Using deep re-sequencing measurements of allelic abundance for 2,414 somatic mutations, we determine for the first time-to our knowledge-in an epithelial tumour subtype, the relative abundance of clonal frequencies among cases representative of the population. We show that TNBCs vary widely in their clonal frequencies at the time of diagnosis, with the basal subtype of TNBC showing more variation than non-basal TNBC. Although p53 (also known as TP53), PIK3CA and PTEN somatic mutations seem to be clonally dominant compared to other genes, in some tumours their clonal frequencies are incompatible with founder status. Mutations in cytoskeletal, cell shape and motility proteins occurred at lower clonal frequencies, suggesting that they occurred later during tumour progression. Taken together, our results show that understanding the biology and therapeutic responses of patients with TNBC will require the determination of individual tumour clonal genotypes.
Article
Full-text available
As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Article
Full-text available
High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values. We present a software package named Oases designed to heuristically assemble RNA-seq reads in the absence of a reference genome, across a broad spectrum of expression values and in presence of alternative isoforms. It achieves this by using an array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events and the efficient merging of multiple assemblies. It was tested on human and mouse RNA-seq data and is shown to improve significantly on the transABySS and Trinity de novo transcriptome assemblers. Oases is freely available under the GPL license at www.ebi.ac.uk/~zerbino/oases/.
Article
Full-text available
A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.
Article
Full-text available
Follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL) are the two most common non-Hodgkin lymphomas (NHLs). Here we sequenced tumour and matched normal DNA from 13 DLBCL cases and one FL case to identify genes with mutations in B-cell NHL. We analysed RNA-seq data from these and another 113 NHLs to identify genes with candidate mutations, and then re-sequenced tumour and matched normal DNA from these cases to confirm 109 genes with multiple somatic mutations. Genes with roles in histone modification were frequent targets of somatic mutation. For example, 32% of DLBCL and 89% of FL cases had somatic mutations in MLL2, which encodes a histone methyltransferase, and 11.4% and 13.4% of DLBCL and FL cases, respectively, had mutations in MEF2B, a calcium-regulated gene that cooperates with CREBBP and EP300 in acetylating histones. Our analysis suggests a previously unappreciated disruption of chromatin biology in lymphomagenesis.
Article
Full-text available
Gene fusions created by somatic genomic rearrangements are known to play an important role in the onset and development of some cancers, such as lymphomas and sarcomas. RNA-Seq (whole transcriptome shotgun sequencing) is proving to be a useful tool for the discovery of novel gene fusions in cancer transcriptomes. However, algorithmic methods for the discovery of gene fusions using RNA-Seq data remain underdeveloped. We have developed deFuse, a novel computational method for fusion discovery in tumor RNA-Seq data. Unlike existing methods that use only unique best-hit alignments and consider only fusion boundaries at the ends of known exons, deFuse considers all alignments and all possible locations for fusion boundaries. As a result, deFuse is able to identify fusion sequences with demonstrably better sensitivity than previous approaches. To increase the specificity of our approach, we curated a list of 60 true positive and 61 true negative fusion sequences (as confirmed by RT-PCR), and have trained an adaboost classifier on 11 novel features of the sequence data. The resulting classifier has an estimated value of 0.91 for the area under the ROC curve. We have used deFuse to discover gene fusions in 40 ovarian tumor samples, one ovarian cancer cell line, and three sarcoma samples. We report herein the first gene fusions discovered in ovarian cancer. We conclude that gene fusions are not infrequent events in ovarian cancer and that these events have the potential to substantially alter the expression patterns of the genes involved; gene fusions should therefore be considered in efforts to comprehensively characterize the mutational profiles of ovarian cancer transcriptomes.
Article
Full-text available
Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
Article
Full-text available
We present a pipeline, SVMerge, to detect structural variants by integrating calls from several existing structural variant callers, which are then validated and the breakpoints refined using local de novo assembly. SVMerge is modular and extensible, allowing new callers to be incorporated as they become available. We applied SVMerge to the analysis of a HapMap trio, demonstrating enhanced structural variant detection, breakpoint refinement, and a lower false discovery rate. SVMerge can be downloaded from http://svmerge.sourceforge.net.
Article
Full-text available
Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.
Article
Full-text available
We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to reference-based assembly methods.
Article
Full-text available
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods with which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including insertion-deletions (indels), inversions and translocations. We examined BreakDancer's performance in simulation, in comparison with other methods and in analyses of a sample from an individual with acute myeloid leukemia and of samples from the 1,000 Genomes trio individuals. BreakDancer sensitively and accurately detected indels ranging from 10 base pairs to 1 megabase pair that are difficult to detect via a single conventional approach.
Article
Full-text available
ClinSeq is a pilot project to investigate the use of whole-genome sequencing as a tool for clinical research. By piloting the acquisition of large amounts of DNA sequence data from individual human subjects, we are fostering the development of hypothesis-generating approaches for performing research in genomic medicine, including the exploration of issues related to the genetic architecture of disease, implementation of genomic technology, informed consent, disclosure of genetic information, and archiving, analyzing, and displaying sequence data. In the initial phase of ClinSeq, we are enrolling roughly 1000 participants; the evaluation of each includes obtaining a detailed family and medical history, as well as a clinical evaluation. The participants are being consented broadly for research on many traits and for whole-genome sequencing. Initially, Sanger-based sequencing of 300-400 genes thought to be relevant to atherosclerosis is being performed, with the resulting data analyzed for rare, high-penetrance variants associated with specific clinical traits. The participants are also being consented to allow the contact of family members for additional studies of sequence variants to explore their potential association with specific phenotypes. Here, we present the general considerations in designing ClinSeq, preliminary results based on the generation of an initial 826 Mb of sequence data, the findings for several genes that serve as positive controls for the project, and our views about the potential implications of ClinSeq. The early experiences with ClinSeq illustrate how large-scale medical sequencing can be a practical, productive, and critical component of research in genomic medicine.
Article
Full-text available
Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. ibirol@bcgsc.ca.
Article
Full-text available
The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is approximately 10-20x faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. http://maq.sourceforge.net.
Article
Full-text available
Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs > or =100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.
Article
Full-text available
Massively parallel sequencing instruments enable rapid and inexpensive DNA sequence data production. Because these instruments are new, their data require characterization with respect to accuracy and utility. To address this, we sequenced a Caernohabditis elegans N2 Bristol strain isolate using the Solexa Sequence Analyzer, and compared the reads to the reference genome to characterize the data and to evaluate coverage and representation. Massively parallel sequencing facilitates strain-to-reference comparison for genome-wide sequence variant discovery. Owing to the short-read-length sequences produced, we developed a revised approach to determine the regions of the genome to which short reads could be uniquely mapped. We then aligned Solexa reads from C. elegans strain CB4858 to the reference, and screened for single-nucleotide polymorphisms (SNPs) and small indels. This study demonstrates the utility of massively parallel short read sequencing for whole genome resequencing and for accurate discovery of genome-wide polymorphisms.
Conference Paper
Full-text available
We address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because text T[1,u] is stored using O(H<sub>k </sub>(T))+o(1) bits per input symbol in the worst case, where H<sub>k </sub>(T) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P[1,p], the opportunistic data structure allows to search for the occurrences of P in T in O(p+occlog <sup>ε</sup>u) time (for any fixed ε>0). If data are uncompressible we achieve the best space bound currently known (Grossi and Vitter, 2000); on compressible data our solution improves the succinct suffix array of (Grossi and Vitter, 2000) and the classical suffix tree and suffix array data structures either in space or in query time or both. We also study our opportunistic data structure in a dynamic setting and devise a variant achieving effective search and update time bounds. Finally, we show how to plug our opportunistic data structure into the Glimpse tool (Manber and Wu, 1994). The result is an indexing tool which achieves sublinear space and sublinear query time complexity
Article
Genomic profiling has identified a subtype of high-risk B-progenitor acute lymphoblastic leukemia (B-ALL) with alteration of IKZF1, a gene expression profile similar to BCR-ABL1-positive ALL and poor outcome (Ph-like ALL). The genetic alterations that activate kinase signaling in Ph-like ALL are poorly understood. We performed transcriptome and whole genome sequencing on 15 cases of Ph-like ALL and identified rearrangements involving ABL1, JAK2, PDGFRB, CRLF2, and EPOR, activating mutations of IL7R and FLT3, and deletion of SH2B3, which encodes the JAK2-negative regulator LNK. Importantly, several of these alterations induce transformation that is attenuated with tyrosine kinase inhibitors, suggesting the treatment outcome of these patients may be improved with targeted therapy.
Article
Abstract One of the key advances in genome assembly that has led to a significant improvement in contig lengths has been improved algorithms for utilization of paired reads (mate-pairs). While in most assemblers, mate-pair information is used in a post-processing step, the recently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pair information directly in the assembly graph structure. However, the PDBG approach faces difficulties when the variation in the insert sizes is high. To address this problem, we first transform mate-pairs into edge-pair histograms that allow one to better estimate the distance between edges in the assembly graph that represent regions linked by multiple mate-pairs. Further, we combine the ideas of mate-pair transformation and PDBGs to construct new data structures for genome assembly: pathsets and pathset graphs.
Article
Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. http://github.com/lh3/fermi
Article
In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency. The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods. In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods. Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Article
Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.
Article
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.
Article
Oligodendroglioma is characterized by unique clinical, pathological, and genetic features. Recurrent losses of chromosomes 1p and 19q are strongly associated with this brain cancer but knowledge of the identity and function of the genes affected by these alterations is limited. We performed exome sequencing on a discovery set of 16 oligodendrogliomas with 1p/19q co-deletion to identify new molecular features at base-pair resolution. As anticipated, there was a high rate of IDH mutations: all cases had mutations in either IDH1 (14/16) or IDH2 (2/16). In addition, we discovered somatic mutations and insertions/deletions in the CIC gene on chromosome 19q13.2 in 13/16 tumours. These discovery set mutations were validated by deep sequencing of 13 additional tumours, which revealed seven others with CIC mutations, thus bringing the overall mutation rate in oligodendrogliomas in this study to 20/29 (69%). In contrast, deep sequencing of astrocytomas and oligoastrocytomas without 1p/19q loss revealed that CIC alterations were otherwise rare (1/60; 2%). Of the 21 non-synonymous somatic mutations in 20 CIC-mutant oligodendrogliomas, nine were in exon 5 within an annotated DNA-interacting domain and three were in exon 20 within an annotated protein-interacting domain. The remaining nine were found in other exons and frequently included truncations. CIC mutations were highly associated with oligodendroglioma histology, 1p/19q co-deletion, and IDH1/2 mutation (p < 0.001). Although we observed no differences in the clinical outcomes of CIC mutant versus wild-type tumours, in a background of 1p/19q co-deletion, hemizygous CIC mutations are likely important. We hypothesize that the mutant CIC on the single retained 19q allele is linked to the pathogenesis of oligodendrogliomas with IDH mutation. Our detailed study of genetic aberrations in oligodendroglioma suggests a functional interaction between CIC mutation, IDH1/2 mutation, and 1p/19q co-deletion.
Article
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Article
Since the identification of fetal lymphocytes in maternal blood in 1969, investigators have endeavored to develop genetics-based noninvasive prenatal diagnostics (NIPD)1 (1). A robust noninvasive approach would augment or potentially supplant amniocentesis and chorionic villus sampling, which, although gold standards, carry a risk of fetal loss. Despite considerable efforts, the use of fetal cells for NIPD has never reached clinical implementation because of their paucity (on the order of a few cells per milliliter of maternal blood) and concerns of fetal cell persistence in the maternal circulation between pregnancies (2). A new avenue was opened in 1997 by the discovery of circulating cell-free fetal DNA in maternal blood (3). Cell-free fetal DNA constitutes between 5% and 10% of the total DNA in maternal plasma and increases during gestation. It rapidly clears from the circulation postpartum, a feature that enhances its attractiveness as a per-pregnancy–specific analyte. Several clinical applications based on the analysis of cell-free fetal DNA have been developed. These assays include determining fetal Rh D status in Rh D-negative women (4), sex in sex-linked disorders (3)(5), and detection of paternally inherited autosomal recessive and dominant mutations (6). In the context of these successes, however, there remains the outstanding challenge of the use of cell-free fetal DNA for the detection of chromosomal aneuploidy, in particular trisomies 21, 18, and 13. A cell-free fetal DNA–based approach for chromosomal aneuploidy must overcome several technical hurdles. By virtue of being cell free, fetal DNA in the maternal plasma cannot be easily analyzed by traditional visualization methods such as fluorescence in situ hybridization. The high “background” of maternal DNA (cell free or from residual cells) further complicates analysis by diluting the fetal genetic information. To address these issues, investigators have worked to identify fetal-specific molecular markers. The origin of circulating cell-free …
Article
New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.
Article
We describe the third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 5' and 3' low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward-reverse constraints.
Article
For the last twenty years fragment assembly was dominated by the “overlap - layout - consensus” algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the “overlap - layout - consensus” approach proved to be useful in assembling clones, it faces difficulties in genomic assemblies: the existing algorithms make assembly errors even in bacterial genomes. We abandoned the “overlap - layout - consensus” approach in favour of a new Eulerian Superpath approach that outperforms the existing algorithms for genomic fragment assembly (Pevzner et al. 2001 InProceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB-01), 256–26). In this paper we describe our new EULER-DB algorithm that, similarly to the Celera assembler takes advantage of clone-end sequencing by using the double-barreled data. However, in contrast to the Celera assembler, EULER-DB does not mask repeats but uses them instead as a powerful tool for contig ordering. We also describe a new approach for the Copy Number Problem: “How many times a given repeat is present in the genome?”. For long nearly-perfect repeats this question is notoriously difficult and some copies of such repeats may be “lost” in genomic assemblies. We describe our EULER-CN algorithm for the Copy Number Problem that proved to be successful in difficult sequencing projects. Contact: ppevzner@ucsd.edu
Article
We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
Article
We describe a block-sorting, lossless data compression algorithm, and our implementation of that algorithm. We compare the performance of our implementation with widely available data compressors running on the same hardware. The algorithm works by applying a reversible transformation to a block of input text. The transformation does not itself compress the data, but re-orders it to make it easy to compress with simple algorithms such as move-to-front encoding. Our algorithm achieves speed comparable to algorithms based on the techniques of Lempel and Ziv, but optains compression close to the best statistical modelling techniques. The size of the input block must be large (a few kilobytes) to achieve good compression.
  • K G Roberts
  • R D Morin
  • J Zhang
  • M Hirst
  • Y Zhao
  • X Su
  • S.-C Chen
  • D Payne-Turner
  • M L Churchman
  • R C Harvey
  • X Chen
  • C Kasap
  • C Yan
  • J Becksfort
  • R P Finney
  • D T Teachey
  • S L Maude
  • K Tse
  • R Moore
  • S Jones
  • K Mungall
  • I Birol
  • M N Edmonson
  • Y Hu
  • K E Buetow
  • I M Chen
  • W L Carroll
  • L Wei
  • J Ma
  • M Kleppe
  • R L Levine
  • G Garcia-Manero
  • E Larsen
  • N P Shah
K. G. Roberts, R. D. Morin, J. Zhang, M. Hirst, Y. Zhao, X. Su, S.-C. Chen, D. Payne-Turner, M. L. Churchman, R. C. Harvey, X. Chen, C. Kasap, C. Yan, J. Becksfort, R. P. Finney, D. T. Teachey, S. L. Maude, K. Tse, R. Moore, S. Jones, K. Mungall, I. Birol, M. N. Edmonson, Y. Hu, K. E. Buetow, I. M. Chen, W. L. Carroll, L. Wei, J. Ma, M. Kleppe, R. L. Levine, G. Garcia-Manero, E. Larsen, N. P. Shah, M. Devidas, G. Reaman, M. Smith, S. W. Paugh, W. E. Evans, S. A. Grupp, S. Jeha, C.-H. Pui, D. S. Gerhard, J. R. Downing, C. L. Willman, M. Loh, S. P. Hunger, M. A. Marra, and C. G. Mullighan, " Genetic Alterations Activating Kinase and Cytokine Receptor Signaling in High-Risk Acute Lymphoblastic Leukemia, " Cancer Cell, vol. 22, no. 2, pp. 153-166, Aug 14, 2012.
Fast gapped-read alignment with Bowtie 2
  • B Langmead
  • S L Salzberg
B. Langmead, and S. L. Salzberg, " Fast gapped-read alignment with Bowtie 2, " Nat Methods, vol. 9, no. 4, pp. 357-9, Apr, 2012.
Space/Time Tradeoffs in Hash Coding With Allowable Errors CAP3: A DNA sequence assembly program
  • B H Bloom
  • A Huang
  • Madan
B. H. Bloom, " Space/Time Tradeoffs in Hash Coding With Allowable Errors, " Communications of the Acm, vol. 13, no. 7, pp. 422-&, 1970, 1970. [34] X. Huang, and A. Madan, " CAP3: A DNA sequence assembly program, " Genome Res, vol. 9, no. 9, pp. 868-77, Sep, 1999.