BFAST: An Alignment Tool for Large Scale Genome Resequencing

Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA.
PLoS ONE (Impact Factor: 3.53). 11/2009; 4(11):e7767. DOI: 10.1371/journal.pone.0007767
Source: PubMed

ABSTRACT The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation.
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.
We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (

  • [Show abstract] [Hide abstract]
    ABSTRACT: Background Autism spectrum disorders (ASD) are a group of neurodevelopmental disorders with high heritability. Recent findings support a highly heterogeneous and complex genetic etiology including rare de novo and inherited mutations or chromosomal rearrangements as well as double or multiple hits. Methods We performed whole-exome sequencing (WES) and blood cell transcriptome by RNAseq in a subset of male patients with idiopathic ASD (n = 36) in order to identify causative genes, transcriptomic alterations, and susceptibility variants. Results We detected likely monogenic causes in seven cases: five de novo (SCN2A, MED13L, KCNV1, CUL3, and PTEN) and two inherited X-linked variants (MAOA and CDKL5). Transcriptomic analyses allowed the identification of intronic causative mutations missed by the usual filtering of WES and revealed functional consequences of some rare mutations. These included aberrant transcripts (PTEN, POLR3C), deregulated expression in 1.7% of mutated genes (that is, SEMA6B, MECP2, ANK3, CREBBP), allele-specific expression (FUS, MTOR, TAF1C), and non-sense-mediated decay (RIT1, ALG9). The analysis of rare inherited variants showed enrichment in relevant pathways such as the PI3K-Akt signaling and the axon guidance. Conclusions Integrative analysis of WES and blood RNAseq data has proven to be an efficient strategy to identify likely monogenic forms of ASD (19% in our cohort), as well as additional rare inherited mutations that can contribute to ASD risk in a multifactorial manner. Blood transcriptomic data, besides validating 88% of expressed variants, allowed the identification of missed intronic mutations and revealed functional correlations of genetic variants, including changes in splicing, expression levels, and allelic expression. Electronic supplementary material The online version of this article (doi:10.1186/s13229-015-0017-0) contains supplementary material, which is available to authorized users.
    Molecular Autism 12/2015; 6(1). DOI:10.1186/s13229-015-0017-0 · 5.49 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Background Exome sequencing has become a popular method to evaluate undirected mutagenesis experiments in mice. However, the most suitable mouse strain for the biological model may be relatively distant from the standard mouse reference genome. For pinpointing causative variants, a matching reference with gene annotations is essential, but not always readily available. Results We present an approach that allows to use murine Ensembl annotations on alternative mouse strain assemblies. We resolved ENU-induced mutation screening for 8 phenotypic mutant lines generated on C3HeB/FeJ background aligning the sequences against the closely related, but not annotated reference of C3H/HeJ. Variants occurring in all strains were filtered out as specific for the C3HeB/FeJ strain but unrelated to mutagenesis. Variants occurring exclusively in all individuals of one mutant line and matching the inheritance model were selected as mutagenesis-related. These variants were annotated with gene and exon names lifted over from the standard murine reference mm9 to C3H/HeJ using megablast. For each mutant line, we could restrict the results to exonic variants in between 1 and 23 genes. Conclusions The presented method of exonic annotation lift-over proved to be a valuable tool in the search for mutagenesis-derived coding genomic variants and the assessment of genotype-phenotype relationships.
    BMC Genomics 05/2015; 16(1). DOI:10.1186/s12864-015-1548-7 · 4.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use.
    PLoS ONE 04/2015; 10(4):e0126409. DOI:10.1371/journal.pone.0126409 · 3.53 Impact Factor