Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques

Department of Computer Architecture and Computer Technology, Department of Applied Mathematics, University of Granada (UGR), 18071 Granada, Medical Genome Project, Andalusian Human Genome Sequencing Centre (CASEGH), 41092 Seville and Chromatin and Disease Group, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet, Barcelona 08907, Spain.
Nucleic Acids Research (Impact Factor: 9.11). 10/2012; 41(1). DOI: 10.1093/nar/gks919
Source: PubMed


Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding
tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms
do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity
sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence
on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one
is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented
in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support
vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This
algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments.
The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm
is run, the computational time is not excessively increased.

Download full-text


Available from: Francisco Manuel Ortuño Guzmán, May 15, 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, the uncovering of new functional relationships between proteins is one of the major goals of biological studies. For this task, the integration of evidences from heterogeneous data sources by means of machine learning methodologies has been demonstrated to be an effective way of providing a complete genome-wide functional network and more accurate inferences of new functional associations. This work presents a new framework to be used in Artificial Neural Networks (ANNs) for the task of predicting functional relationships between proteins through the integration of evidences from heterogeneous data sources. The developing of such new methodology is motivated by the problems that arise when applying ANNs to this kind of problems, namely, the computational cost of ANN optimization process due to the nature of data (large number of instances and high dimensionality). The method selects smaller representative/non-random subsets from the original data set selected for ANN optimization process, resulting in a reduction of the number of data to be trained and, consequently, the computational cost. Moreover, the fact that the subsets are not only smaller, but also representative from the original one, (i) prevents the repetition of the optimization process several times with different random subsets of data, which is commonly used to get a reliable and fair evaluation of ANN's prediction accuracy, and (ii) benefits the learning procedure in the sense of a reduction of the overfitting problem, improving, this way, the prediction ability.
    No preview · Article · Dec 2013 · Neurocomputing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract— Aligning multiple nucleotide sequences is a prerequisite for many if not most comparative sequence analyses in evolutionary biology. These alignments are often recognized as representing the homology relations of the aligned nucleotides, but this is a necessary requirement only for phylogenetic analyses. Unfortunately, existing computer programs for sequence alignment are not based explicitly on detecting the homology of nucleotides, and so there is a notable gap in the existing bioinformatics repertoire. If homology is the goal, then current alignment procedures may be more art than science. To resolve this issue, I present a simple conceptual scheme relating the traditional criteria for homology to the features of nucleotide sequences. These relations can then be used as optimization criteria for nucleotide sequence alignments. I point out the way in which current computer programs for multiple sequence alignment relate to these criteria, noting that each of them usually implements only one criterion. This explains the apparent dissatisfaction with computerized sequence alignment in phylogenetics, as any program that truly tried to produce alignments based on homology would need to simultaneously optimize all of the criteria.
    No preview · Article · Feb 2015 · Systematic Botany
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multiple sequence alignments (MSAs) are currently one of the most powerful procedure in bioinformatics in order to provide additional information useful to other understanding techniques such as biological function analyses, structure predictions or next-generation sequencing. Nevertheless, current MSA methodologies are providing quite different alignments for the same set of sequences depending on some particular biological features of these sequences. For this reason, the selection of a suitable tool for aligning a specific set of sequences is an important task which has not been totally solved yet. In this work, we propose a hierarchical algorithm of several binary classifiers based on support vector machines (SVMs) to predict "a priori" the MSA tool which will provide the most accurate alignment. Firstly, a set of heterogeneous biological features related to each set of sequences are retrieved from well-known databases. Subsequently, those most significant features according to each specific aligner are included in this particular classifier. Finally, the SVM classifiers are joined to decide the most suitable method according to the quality of each classification. This procedure was assessed by the benchmark BAliBASE v3.0 and compared against other similar tools, namely AlexSys and PAcAlCI.
    No preview · Article · May 2015 · Current Bioinformatics