Conference Paper

Procrastination Leads to Efficient Filtration for Local Multiple Alignment

DOI: 10.1007/11851561_12 Conference: Algorithms in Bioinformatics, 6th International Workshop, WABI 2006, Zurich, Switzerland, September 11-13, 2006, Proceedings
Source: DBLP


We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA se- quences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The re- sulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes O(wN) memory and O(wN log wN) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the per- formance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from

Download full-text


Available from: Nicole T Perna
  • Source
    • "Once a list of multimatches has been generated, we utilize an efficient chaining and filtration algorithm to identify overlapping and nested chains of multimatches. The chaining and filtration algorithm has been described in previous work [40]. In order to process each region of sequence Oð1Þ times, matches are prioritized for chaining in order of decreasing jM i j. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from:
    Preview · Article · Apr 2009 · IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
  • Source
    • "To improve on contiguous seeds used in BLASTN, general patterns of conservation have been proposed as seeds for sequence alignment in recent years (Califano and Rigoutsos, 1995; Ma et al., 2002; Brejovà et al., 2005; Schwager, 1983; Kent, 2002). Different seeds are also used as anchor point in whole-genome and multiple sequence alignments (Batzoglou et al., 2000; Brudno et al., 2003; Darling et al., 2006). Good spaced seeds improve tremendously the sensitivity of seed alignment while keeping speed unchanged (Ma et al., 2002). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Transition seeds exhibit a good tradeoff between sensitivity and specificity for homology search in both coding and non-coding regions. However, identifying good transition seeds is intractable. This work studies the hit probability of high-order seed patterns. Based on our theoretical results, we propose an efficient method for ranking transition seeds for seed design and list good seeds in different Bernoulli sequence models.
    Full-text · Article · Jan 2009 · Journal of computational biology: a journal of computational molecular cell biology
  • Source
    • "Recently, Nagarajan et al. (2006), demonstrated that it can also aid in designing more sensitive motif-finders. Other areas where such analysis has been found to be valuable include repeat finding (Darling et al., 2006) and assessing the reliability of alignments (Sadreyev and Grishin, 2004). Consequently, reliable programs to assess the significance of ungapped multiple alignments can be a valuable general-purpose tool in a bioinformatician's toolbox. "
    [Show abstract] [Hide abstract]
    ABSTRACT: As was shown in Nagarajan et al. (2005), commonly used approximations for assessing the significance of multiple alignments can be be very inaccurate. To address this, we present here the FAST package, an open-source collection of programs and libraries for efficiently and reliably computing the significance of ungapped local alignments. We also describe other potential applications in Bioinformatics where these programs can be adapted for significance testing. AVAILABILITY: The FAST package includes C++ implementations of various algorithms that can be used as stand-alone programs or as a library of subroutines. The package and a web-server for some of the programs are available at
    Preview · Article · Mar 2008 · Bioinformatics
Show more