Sequence embedding for fast construction of guide trees for multiple sequence alignment.
ABSTRACT The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.
- SourceAvailable from: PubMed Central[show abstract] [hide abstract]
ABSTRACT: The distance matrix computed from multiple alignments of homologous sequences is widely used by distance-based phylogenetic methods to provide information on the evolution of protein families. This matrix can also be visualized in a low dimensional space by metric multidimensional scaling (MDS). Applied to protein families, MDS provides information complementary to the information derived from tree-based methods. Moreover, MDS gives a unique opportunity to compare orthologous sequence sets because it can add supplementary elements to a reference space. The R package bios2mds (from BIOlogical Sequences to MultiDimensional Scaling) has been designed to analyze multiple sequence alignments by MDS. Bios2mds starts with a sequence alignment, builds a matrix of distances between the aligned sequences, and represents this matrix by MDS to visualize a sequence space. This package also offers the possibility of performing K-means clustering in the MDS derived sequence space. Most importantly, bios2mds includes a function that projects supplementary elements (a.k.a. "out of sample" elements) onto the space defined by reference or "active" elements. Orthologous sequence sets can thus be compared in a straightforward way. The data analysis and visualization tools have been specifically designed for an easy monitoring of the evolutionary drift of protein sub-families. The bios2mds package provides the tools for a complete integrated pipeline aimed at the MDS analysis of multiple sets of orthologous sequences in the R statistical environment. In addition, as the analysis can be carried out from user provided matrices, the projection function can be widely used on any kind of data.BMC Bioinformatics 06/2012; 13:133. · 3.02 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Class A G-protein-coupled receptors (GPCRs) constitute the largest family of transmembrane receptors in the human genome. Understanding the mechanisms which drove the evolution of such a large family would help understand the specificity of each GPCR sub-family with applications to drug design. To gain evolutionary information on class A GPCRs, we explored their sequence space by metric multidimensional scaling analysis (MDS). Three-dimensional mapping of human sequences shows a non-uniform distribution of GPCRs, organized in clusters that lay along four privileged directions. To interpret these directions, we projected supplementary sequences from different species onto the human space used as a reference. With this technique, we can easily monitor the evolutionary drift of several GPCR sub-families from cnidarians to humans. Results support a model of radiative evolution of class A GPCRs from a central node formed by peptide receptors. The privileged directions obtained from the MDS analysis are interpretable in terms of three main evolutionary pathways related to specific sequence determinants. The first pathway was initiated by a deletion in transmembrane helix 2 (TM2) and led to three sub-families by divergent evolution. The second pathway corresponds to the differentiation of the amine receptors. The third pathway corresponds to parallel evolution of several sub-families in relation with a covarion process involving proline residues in TM2 and TM5. As exemplified with GPCRs, the MDS projection technique is an important tool to compare orthologous sequence sets and to help decipher the mutational events that drove the evolution of protein families.PLoS ONE 01/2011; 6(4):e19094. · 3.73 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: MOTIVATION: Recent developments in sequence alignment software have made possible multiple sequence alignments of over 100,000 sequences in reasonable times. At present there are no systematic analyses concerning the scalability of the alignment quality as the number of aligned sequences is increased. RESULTS: We benchmarked a wide range of widely used MSA packages using a selection of protein families with some known structures and find that the accuracy of such alignments decreases markedly as the number of sequences grows. This is more or less true of all packages and protein families. The phenomenon is mostly due to the accumulation of alignment errors, rather than problems in guide-tree construction. This is partly alleviated by using iterative refinement or selectively adding sequences. The average accuracy of progressive methods by comparison with structure-based benchmarks can be improved by incorporating information derived from high-quality structural alignments of sequences with solved structures. This suggests that the availability of high quality curated alignments will have to complement algorithmic and/or software developments in the long term. AVAILABILITY: Benchmark data used in this study is available at http://www.clustal.org/omega/homfam-20110613-25.tar.gz and http://www.clustal.org/omega/bali3fam-26.tar.gz. CONTACT: firstname.lastname@example.org.Bioinformatics 02/2013; · 5.47 Impact Factor