Conference Paper

Procrastination Leads to Efficient Filtration for Local Multiple Alignment.

DOI: 10.1007/11851561_12 Conference: Algorithms in Bioinformatics, 6th International Workshop, WABI 2006, Zurich, Switzerland, September 11-13, 2006, Proceedings
Source: DBLP

ABSTRACT We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA se- quences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The re- sulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes O(wN) memory and O(wN log wN) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the per- formance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from:
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 01/2009; 6(2):180-9. · 2.25 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Double Cut and Join is an operation acting locally at four chromosomal positions without regard to chromosomal context. This chapter discusses its application and the resulting menu of operations for genomes consisting of arbitrary numbers of circular chromosomes, as well as for a general mix of linear and circular chromosomes. In the general case the menu includes: inversion, translocation, transposition, formation and absorption of circular intermediates, conversion between linear and circular chromosomes, block interchange, fission, and fusion. This chapter discusses the well-known edge graph and its dual, the adjacency graph, recently introduced by Bergeron et al. Step-by-step procedures are given for constructing and manipulating these graphs. Simple algorithms are given in the adjacency graph for computing the minimal DCJ distance between two genomes and finding a minimal sorting; and use of an online tool (Mauve) to generate synteny blocks and apply DCJ is described.
    Methods in molecular biology (Clifton, N.J.) 02/2008; 452:385-416. · 1.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: During evolution, large-scale genome rearrangements of chromosomes shuffle the order of homologous genome sequences ("synteny blocks") across species. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrangements recur, causing breakpoints to be reused. We investigate this controversial issue using the synteny block's for human-mouse-rat reported by Bourque et al. and a series of synteny blocks we generated using Mauve at resolutions ranging from coarse to very fine-scale. We conducted analyses to test how resolution affects the traditional measure of the breakpoint reuse rate. We found that the inversion-based breakpoint reuse rate is low at fine-scale synteny block resolution and that it rises and eventually falls as synteny block resolution decreases. By analyzing the cycle structure of the breakpoint graph of human-mouse-rat synteny blocks for human-mouse and comparing with theoretically derived distributions for random genome rearrangements, we showed that the implied genome rearrangements at each level of resolution become more "random" as synteny block resolution diminishes. At highest synteny block resolutions the Hannenhalli-Pevzner inversion distance deviates from the Double Cut and Join distance, possibly due to small-scale transpositions or simply due to inclusion of erroneous synteny blocks. At synteny block resolutions as coarse as the Bourque et al. blocks, we show the breakpoint graph cycle structure has already converged to the pattern expected for a random distribution of synteny blocks. The inferred breakpoint reuse rate depends on synteny block resolution in human-mouse genome comparisons. At fine-scale resolution, the cycle structure for the transformation appears less random compared to that for coarse resolution. Small synteny blocks may contain critical information for accurate reconstruction of genome rearrangement history and parameters.
    BMC Bioinformatics 01/2011; 12 Suppl 9:S1. · 3.02 Impact Factor

Full-text (2 Sources)

Available from
Jun 2, 2014