Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques

Department of Computer Architecture and Computer Technology, Department of Applied Mathematics, University of Granada (UGR), 18071 Granada, Medical Genome Project, Andalusian Human Genome Sequencing Centre (CASEGH), 41092 Seville and Chromatin and Disease Group, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet, Barcelona 08907, Spain.
Nucleic Acids Research (Impact Factor: 9.11). 10/2012; 41(1). DOI: 10.1093/nar/gks919
Source: PubMed


Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased.

Download full-text


Available from: Francisco Manuel Ortuño Guzmán, May 15, 2014
19 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, the uncovering of new functional relationships between proteins is one of the major goals of biological studies. For this task, the integration of evidences from heterogeneous data sources by means of machine learning methodologies has been demonstrated to be an effective way of providing a complete genome-wide functional network and more accurate inferences of new functional associations. This work presents a new framework to be used in Artificial Neural Networks (ANNs) for the task of predicting functional relationships between proteins through the integration of evidences from heterogeneous data sources. The developing of such new methodology is motivated by the problems that arise when applying ANNs to this kind of problems, namely, the computational cost of ANN optimization process due to the nature of data (large number of instances and high dimensionality). The method selects smaller representative/non-random subsets from the original data set selected for ANN optimization process, resulting in a reduction of the number of data to be trained and, consequently, the computational cost. Moreover, the fact that the subsets are not only smaller, but also representative from the original one, (i) prevents the repetition of the optimization process several times with different random subsets of data, which is commonly used to get a reliable and fair evaluation of ANN's prediction accuracy, and (ii) benefits the learning procedure in the sense of a reduction of the overfitting problem, improving, this way, the prediction ability.
    Neurocomputing 12/2013; 121. DOI:10.1016/j.neucom.2012.11.040 · 2.08 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract— Aligning multiple nucleotide sequences is a prerequisite for many if not most comparative sequence analyses in evolutionary biology. These alignments are often recognized as representing the homology relations of the aligned nucleotides, but this is a necessary requirement only for phylogenetic analyses. Unfortunately, existing computer programs for sequence alignment are not based explicitly on detecting the homology of nucleotides, and so there is a notable gap in the existing bioinformatics repertoire. If homology is the goal, then current alignment procedures may be more art than science. To resolve this issue, I present a simple conceptual scheme relating the traditional criteria for homology to the features of nucleotide sequences. These relations can then be used as optimization criteria for nucleotide sequence alignments. I point out the way in which current computer programs for multiple sequence alignment relate to these criteria, noting that each of them usually implements only one criterion. This explains the apparent dissatisfaction with computerized sequence alignment in phylogenetics, as any program that truly tried to produce alignments based on homology would need to simultaneously optimize all of the criteria.
    Systematic Botany 02/2015; 40(1). DOI:10.1600/036364415X686305 · 1.23 Impact Factor