Conference Paper

Procrastination Leads to Efficient Filtration for Local Multiple Alignment.

DOI: 10.1007/11851561_12 Conference: Algorithms in Bioinformatics, 6th International Workshop, WABI 2006, Zurich, Switzerland, September 11-13, 2006, Proceedings
Source: DBLP

ABSTRACT We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA se- quences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The re- sulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes O(wN) memory and O(wN log wN) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the per- formance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from http://gel.ahabs.wisc.edu/procrastination

Download full-text

Full-text

Available from: Nicole T Perna, Jun 30, 2015
0 Followers
 · 
94 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 04/2009; 6(2):180-9. DOI:10.1109/TCBB.2009.9 · 1.54 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Transition seeds exhibit a good tradeoff between sensitivity and specificity for homology search in both coding and non-coding regions. However, identifying good transition seeds is intractable. This work studies the hit probability of high-order seed patterns. Based on our theoretical results, we propose an efficient method for ranking transition seeds for seed design and list good seeds in different Bernoulli sequence models.
    Journal of computational biology: a journal of computational molecular cell biology 01/2009; 15(10):1295-313. DOI:10.1089/cmb.2007.0209 · 1.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As was shown in Nagarajan et al. (2005), commonly used approximations for assessing the significance of multiple alignments can be be very inaccurate. To address this, we present here the FAST package, an open-source collection of programs and libraries for efficiently and reliably computing the significance of ungapped local alignments. We also describe other potential applications in Bioinformatics where these programs can be adapted for significance testing. AVAILABILITY: The FAST package includes C++ implementations of various algorithms that can be used as stand-alone programs or as a library of subroutines. The package and a web-server for some of the programs are available at www.cs.cornell.edu/~keich/FAST.
    Bioinformatics 03/2008; 24(4):577-8. DOI:10.1093/bioinformatics/btm594 · 4.62 Impact Factor