Sequence Embedding for Fast Construction of Guide Trees for Multiple Sequence Alignment

UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland. .
Algorithms for Molecular Biology (Impact Factor: 1.46). 05/2010; 5(1):21. DOI: 10.1186/1748-7188-5-21
Source: PubMed


The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from

Download full-text


Available from: Gordon Blackshields,
  • Source
    • "Other advantages of alignment-free genome comparison are that they can work with unassembled reads (Song et al., 2012) and are not affected by genome rearrangements. Alignment-free methods are also used to construct guide trees for progressive multiple alignment (Katoh et al., 2002; Edgar, 2004; Blackshields et al., 2010). This could crucially improve the runtime of multiplealignment algorithms, since calculating guide trees becomes the most time consuming step in progressive alignment if the number of sequences grows. "
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well known problem with these methods is that neighbouring word matches are far from independent. RESULTS To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Our program is freely available at CONTACT:
    Bioinformatics 04/2014; 30(14). DOI:10.1093/bioinformatics/btu177 · 4.98 Impact Factor
  • Source
    • "Besides parallelization, there is still much space for improvement in the field of multiple sequence alignment in performance. E.g., CLUSTAL OMEGA implemented a modified version of mBed [27], which produced fast and accurate guide trees, and managed to reduce computational time and memory requirements to finish the alignment of large datasets. A part from performance, there also much room for accuracy improvements, as some results presented in this study were still far from the BAliBASE reference alignments. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program's algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution. Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program. Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by a higher number of programs in a near future, which will certainly improve performance significantly.
    Algorithms for Molecular Biology 03/2014; 9(1):4. DOI:10.1186/1748-7188-9-4 · 1.46 Impact Factor
  • Source
    • "The probabilities of coiled-coil formation for wild-type and mutant Ndc80 were predicted using Paircoil2 (McDonnell et al. 2006). To perform sequence alignments, NDC80 from S. cerevisiae and its orthologs in Saccharomyces bayanus, S. kudriavzevii, S. mikatae (Scannell et al. 2011), Lachancea (Kluyveromyces) thermotolerans, Kluyveromyces lactis, and Debaryomyces hansenii ( were translated using Transeq (Rice et al. 2000) and then aligned using Clustal-O (Blackshields et al. 2010). The similarity score was plotted for each position using Plotcon (Rice et al. 2000) with a window size of 21 bp. "
    [Show abstract] [Hide abstract]
    ABSTRACT: During mitosis, kinetochores physically link chromosomes to the dynamic ends of spindle microtubules. This linkage depends on the Ndc80 complex, a conserved and essential microtubule-binding component of the kinetochore. As a member of the complex, the Ndc80 protein forms microtubule attachments through a calponin homology domain. Ndc80 is also required for recruiting other components to the kinetochore and responding to mitotic regulatory signals. While the calponin homology domain has been the focus of biochemical and structural characterization, the function of the remainder of Ndc80 is poorly understood. Here, we utilized a new approach that couples high-throughput sequencing to a saturating linker-scanning mutagenesis screen in Saccharomyces cerevisiae. We identified domains in previously uncharacterized regions of Ndc80 that are essential for its function in vivo. We show that a helical hairpin adjacent to the calponin homology domain influences microtubule binding by the complex. Furthermore, a mutation in this hairpin abolishes the ability of the Dam1 complex to strengthen microtubule attachments made by the Ndc80 complex. Finally, we defined a C-terminal segment of Ndc80 required for tetramerization of the Ndc80 complex in vivo. This unbiased mutagenesis approach can be generally applied to genes in S. cerevisiae to identify functional properties and domains.
    Genetics 07/2013; 195(1). DOI:10.1534/genetics.113.152728 · 5.96 Impact Factor
Show more