Sequence Embedding for Fast Construction of Guide Trees for Multiple Sequence Alignment

UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland. .
Algorithms for Molecular Biology (Impact Factor: 1.46). 05/2010; 5(1):21. DOI: 10.1186/1748-7188-5-21
Source: PubMed


The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from

Download full-text


Available from: Gordon Blackshields
  • Source
    • "However, the construction of initial guide-trees for producing alignments generally follows an alignment-free approach. For example MUSCLE [Edgar, 2004a,b] uses k-mer distances with UPGMA or Neighbor-Joining to produce guide trees, whereas Clustal Omega [Sievers et al., 2011] uses a low dimensional geometric embedding based on k-mers [Blackshields et al., 2010] and k-means or UPGMA as a clustering algorithm. Thus even though tree inference is typically performed with model-based statistical methods, the initial step is built on heuristic ideas, with no evolutionary model in use. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Frequencies of $k$-mers in sequences are sometimes used as a basis for inferring phylogenetic trees without first obtaining a multiple sequence alignment. We show that a standard approach of using the squared-Euclidean distance between $k$-mer vectors to approximate a tree metric can be statistically inconsistent. To remedy this, we derive model-based distance corrections for orthologous sequences without gaps, which lead to consistent tree inference. The identifiability of model parameters from $k$-mer frequencies is also studied. Finally, we report simulations showing the corrected distance out-performs many other $k$-mer methods, even when sequences are generated with an insertion and deletion process. These results have implications for multiple sequence alignment as well, since $k$-mer methods are usually the first step in constructing a guide tree for such algorithms.
    Full-text · Article · Nov 2015
  • Source
    • "Other advantages of alignment-free genome comparison are that they can work with unassembled reads (Song et al., 2012) and are not affected by genome rearrangements. Alignment-free methods are also used to construct guide trees for progressive multiple alignment (Katoh et al., 2002; Edgar, 2004; Blackshields et al., 2010). This could crucially improve the runtime of multiplealignment algorithms, since calculating guide trees becomes the most time consuming step in progressive alignment if the number of sequences grows. "
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well known problem with these methods is that neighbouring word matches are far from independent. RESULTS To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Our program is freely available at CONTACT:
    Full-text · Article · Apr 2014 · Bioinformatics
  • Source
    • "Besides parallelization, there is still much space for improvement in the field of multiple sequence alignment in performance. E.g., CLUSTAL OMEGA implemented a modified version of mBed [27], which produced fast and accurate guide trees, and managed to reduce computational time and memory requirements to finish the alignment of large datasets. A part from performance, there also much room for accuracy improvements, as some results presented in this study were still far from the BAliBASE reference alignments. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program's algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution. Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program. Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by a higher number of programs in a near future, which will certainly improve performance significantly.
    Full-text · Article · Mar 2014 · Algorithms for Molecular Biology
Show more