Article

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Algorithms for Molecular Biology 01/2010; DOI:http://www.doaj.org/doaj?func=openurl&genre=article&issn=17487188&date=2010&volume=5&issue=1&spage=21
Source: DOAJ

ABSTRACT Abstract

Background

The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N 2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.

Results

In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.

Conclusions

We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz .

0 0
 · 
0 Bookmarks
 · 
48 Views

Full-text

View
5 Downloads
Available from
9 Apr 2013

Keywords

clustering large numbers
 
clusterings
 
computation time
 
computed
 
embedding
 
embedding methods
 
full distance matrix
 
guide trees
 
individual distance calculations
 
large multiple alignments
 
memory requirements
 
multiple alignment
 
multiple sequence alignment methods
 
N <sup>2 </sup>for N sequences
 
pair-wise distances
 
requires memory
 
sequences
 
Source code
 
time proportional