Algorithms for Molecular Biology (ALGORITHM MOL BIOL)
Description
Algorithms for Molecular Biology is an open access, peer-reviewed online journal that encompasses all aspects of algorithms and software tools for molecular biology and genomics. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.
- Impact factor1.35
- WebsiteAlgorithms for Molecular Biology website
-
Other titlesAMB
-
ISSN1748-7188
-
OCLC65532456
-
Material typeDocument, Periodical, Internet resource
-
Document typeInternet Resource, Computer File, Journal / Magazine / Newspaper
Publications in this journal
-
Article: An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes. METHODS: We propose an enhanced binary particle swarm optimization to perform the selection of small subsets of informative genes which is significant for cancer classification. Particle speed, rule, and modified sigmoid function are introduced in this proposed method to increase the probability of the bits in a particle's position to be zero. The method was empirically applied to a suite of ten well-known benchmark gene expression data sets. RESULTS: The performance of the proposed method proved to be superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also requires lower computational time compared to BPSO.Algorithms for Molecular Biology 04/2013; 8(1):15. -
Article: LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAsor microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task? RESULTS: Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA's algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence. CONCLUSIONS: Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptomewide applications, such methods will provide accurate structure annotations on the target side. AVAILABILITY: Source code of the free software LocARNAscan 1.0 and supplementary data are available at http://www.bioinf.uni-leipzig.de/Software/LocARNAscan.Algorithms for Molecular Biology 04/2013; 8(1):14. -
Article: Unrooted unordered homeomorphic subtree alignment of RNA trees.
[show abstract] [hide abstract]
ABSTRACT: We generalize some current approaches for RNA tree alignment, which are traditionally confined to ordered rooted mappings, to also consider unordered unrooted mappings. We define the Homeomorphic Subtree Alignment problem (HSA), and present a new algorithm which applies to several modes, combining global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that for input trees T and S, our algorithm has an O(nTnS + min(dT, dS)LTLS) time complexity, where nT, LT and dT are the number of nodes, the number of leaves, and the maximum node degree in T, respectively (satisfying dT ¿ LT ¿ nT), and similarly for nS, LS and dS with respect to the tree S. This improves the time complexity of previous algorithms for less general variants of the problem.In order to obtain this time bound for HSA, we developed new algorithms for a generalized variant of the Min-Cost Bipartite Matching problem (MCM), as well as to two derivatives of this problem, entitled All-Cavity-MCM and All-Pairs-Cavity-MCM. For two input sets of size n and m, where n ¿ m, MCM and both its cavity derivatives are solved in O(n3 + nm) time, without the usage of priority queues (e.g. Fibonacci heaps) or other complex data structures. This gives the first cubic time algorithm for All-Pairs-Cavity-MCM, and improves the running times of MCM and All-Pairs-Cavity-MCM problems in the unbalanced case where n ¿ m.We implemented the algorithm (in all modes mentioned above) as a graphical software tool which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all inter-family pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches.The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches.Source code and web-interface for our tool can be found in http://www.cs.bgu.ac.il/~negevcb/FRUUT.Algorithms for Molecular Biology 04/2013; 8(1):13. -
Article: Reconciliation and local gene tree rearrangement can be of mutual profit.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: Reconciliation methods compare gene trees and species trees to recover evolutionary events suchas duplications, transfers and losses explaining the history and composition of genomes. It is well-known that gene trees inferred from molecular sequences can be partly erroneous due to incorrectsequence alignments as well as phylogenetic reconstruction artifacts such as long branch attraction. Inpractice, this leads reconciliation methods to overestimate the number of evolutionary events. Severalmethods have been proposed to circumvent this problem, by collapsing the unsupported edges andthen resolving the obtained multifurcating nodes, or by directly rearranging the binary gene trees. Yetthese methods have been defined for models of evolution accounting only for duplications and losses,i.e. can not be applied to handle prokaryotic gene families. RESULTS: We propose a reconciliation method accounting for gene duplications, losses and horizontal trans-fers, that specifically takes into account the uncertainties in gene trees by rearranging their weaklysupported edges. Rearrangements are performed on edges having a low confidence value, and areaccepted whenever they improve the reconciliation cost. We prove useful properties on the dynamicprogramming matrix used to compute reconciliations, which allows to speed-up the tree space explo-ration when rearrangements are generated by Nearest Neighbor Interchanges (NNI) edit operations.Experiments on synthetic data show that gene trees modified by such NNI rearrangements are closerto the correct simulated trees and lead to better event predictions on average. Experiments on real datademonstrate that the proposed method leads to a decrease in the reconciliation cost and the number ofinferred events. Finally on a dataset of 30k gene families, this reconciliation method shows a rankingof prokaryotic phyla by transfer rates identical to that proposed by a different approach dedicated totransfer detection [BMCBIOINF 11:324, 2010, PNAS 109(13):4962-4967, 2012]. CONCLUSIONS: Prokaryotic gene trees can now be reconciled with their species phylogeny while accounting for theuncertainty of the gene tree. More accurate and more precise reconciliations are obtained with respectto previous parsimony algorithms not accounting for such uncertainties [LNCS 6398:93-108, 2010,BIOINF 28(12): i283-i291, 2012].A software implementing the method is freely available at http://www.atgc-montpellier.fr/Mowgli/.Algorithms for Molecular Biology 04/2013; 8(1):12. -
Article: Incompatible quartets, triplets, and characters.
[show abstract] [hide abstract]
ABSTRACT: We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function f(r) such that, for any set C of r-state characters, C is compatible if and only if every subset of f(r) characters of C is compatible. We show that for every r ¿ 2, there exists an incompatible set C of ¿(r2) r-state characters such that every proper subset of C is compatible.This improves the previous lower bound of f(r) ¿ r given by Meacham (1983), and f(4) ¿ 5 given by Habib and To (2011). For the case when r = 3, Lam, Gusfield and Sridhar (2011) recently showed that f(3) = 3. We give an independent proof of this result and completely characterize the sets of pairwise compatible 3-state characters by a single forbidden intersection pattern.Our lower bound on f(r) is proven via a result on quartet compatibility that may be of independent interest: For every n ¿ 4, there exists an incompatible set Q of ¿(n2) quartets over n labels such that every proper subset of Q is compatible. We show that such a set of quartets can have size at most 3 when n = 5, and at most O(n3) for arbitrary n. We contrast our results on quartets with the case of rooted triplets: For every n ¿ 3, if R is an incompatible set of more than n-1 triplets over n labels, then some proper subset of R is incompatible. We show this bound is tight by exhibiting, for every n ¿ 3, a set of n-1 triplets over n taxa such that R is incompatible, but every proper subset of R is compatible.Algorithms for Molecular Biology 04/2013; 8(1):11. -
Article: Gene Ontology consistent protein function prediction: the FALCON algorithm applied to six eukaryotic genomes.
[show abstract] [hide abstract]
ABSTRACT: Gene Ontology (GO) is a hierarchical vocabulary for the description of biological functions and locations, often employed by computational methods for protein function prediction. Due to the structure of GO, function predictions can be self- contradictory. For example, a protein may be predicted to belong to a detailed functional class, but not in a broader class that, due to the vocabulary structure, includes the predicted one.We present a novel discrete optimization algorithm called Functional Annotation with Labeling CONsistency (FALCON) that resolves such contradictions. The GO is modeled as a discrete Bayesian Network. For any given input of GO term membership probabilities, the algorithm returns the most probable GO term assignments that are in accordance with the Gene Ontology structure. The optimization is done using the Differential Evolution algorithm. Performance is evaluated on simulated and also real data from Arabidopsis thaliana showing improvement compared to related approaches. We finally applied the FALCON algorithm to obtain genome-wide function predictions for six eukaryotic species based on data provided by the CAFA (Critical Assessment of Function Annotation) project.Algorithms for Molecular Biology 03/2013; 8(1):10. -
Article: Coexpression and coregulation analysis of time-series gene expression data in estrogen-induced breast cancer cell.
[show abstract] [hide abstract]
ABSTRACT: Background Estrogen is a chemical messenger that has an influence on many breast cancers as it helps cells to grow and divide. These cancers are often known as estrogen responsive cancers in which estrogen receptor occupies the surface of the cells. The successful treatment of breast cancers requires understanding gene expression, identifying of tumor markers, acquiring knowledge of cellular pathways, etc. In this paper we introduce our proposed triclustering algorithm delta-TRIMAX that aims to find genes that are coexpressed over subset of samples across a subset of time points. Here we introduce a novel mean-squared residue for such 3D dataset. Our proposed algorithm yields triclusters that have a mean-squared residue score below a threshold delta.Results We have applied our algorithm on one simulated dataset and one real-life dataset. The real-life dataset is a time-series dataset in estrogen induced breast cancer cell line. To establish the biological significance of genes belonging to resultant triclusters we have performed gene ontology, KEGG pathway and transcription factor binding site enrichment analysis. Additionally, we represent each resultant tricluster by computing its eigengene and verify whether its eigengene is also differentially expressed at early, middle and late estrogen responsive stages. We also identified hub-genes for each resultant triclusters and verified whether the hub-genes are found to be associated with breast cancer. Through our analysis CCL2, CD47, NFIB, BRD4, HPGD, CSNK1E, NPC1L1, PTEN, PTPN2 and ADAM9 are identified as hub-genes which are already known to be associated with breast cancer. The other genes that have also been identified as hub-genes might be associated with breast cancer or estrogen responsive elements. The TFBS enrichment analysis also reveals that transcription factor POU2F1 binds to the promoter region of ESR1 that encodes estrogen receptor alpha. Transcription factor E2F1 binds to the promoter regions of coexpressed genes MCM7, ANAPC1 and WEE1.Conclusions Thus our integrative approach provides insights into breast cancer prognosis.Algorithms for Molecular Biology 03/2013; 8(1):9. -
Article: Resolving spatial inconsistencies in chromosome conformation measurements.
[show abstract] [hide abstract]
ABSTRACT: Background Chromosome structure is closely related to its functionand Chromosome Conformation Capture (3C) is a widely used technique for exploringspatial properties of chromosomes. 3C interaction frequencies are usuallyassociated with spatial distances. However, the raw data from 3C experiments is anaggregation of interactions from many cells, and the spatialdistances of any given interaction are uncertain.ResultsWe introduce a new method for filtering 3C interactionsthat selects subsets of interactions that obey metric constraints of variousstrictness. We demonstrate that, although the problem is computationallyhard, near-optimal results are often attainable in practice usingwell-designed heuristics and approximation algorithms. Further, we show that,compared with a standard technique, this metric filtering approach leads to(a) subgraphs with higher statistical significance, (b) lower embeddingerror, (c) lower sensitivity to initial conditions of the embeddingalgorithm, and (d) ~structures with better agreement with light microscopymeasurements. Our filtering scheme is applicable for a strict frequency-to-distancemapping and a more relaxed mapping from frequency to a range of distances.Conclusions Our filtering method for 3C data considers both metricconsistency and statistical confidence simultaneously resulting in lower-errorembeddings that are biologically more plausible.Algorithms for Molecular Biology 03/2013; 8(1):8. -
Article: Ultrametric networks: a new tool for phylogenetic analysis.
[show abstract] [hide abstract]
ABSTRACT: Background The large majority of optimization problemsrelated to the inference of distance-based trees used inphylogenetic analysis and classification is known to be intractable.One noted exception is found within the realm of ultrametricdistances. The introduction of ultrametric trees in phylogeny wasinspired by a model of evolution driven by the postulate of amolecular clock, now dismissed, whereby phylogeny could berepresented by a weighted tree in which the sum of the weights ofthe edges separating any given leaf from the root is the same forall leaves. Both, molecular clocks and rooted ultrametric trees,fell out of fashion as credible representations of evolutionarychange. At the same time, ultrametric dendrograms have shown goodpotential for purposes of classification in so far as they haveproven to provide good approximations for additive trees. Most ofthese approximations are still intractable, but the problem offinding the nearest ultrametric distance matrix to a given distancematrix with respect to the L¿ distance has been long knownto be solvable in polynomial time, the solution being incarnated inany minimum spanning tree for the weighted graph subtending to thematrix.Results This paper expands this subdominant ultrametric perspective by studying ultrametric networks, consisting of the collection of all edges involvedin some minimum spanning tree. It is shown that, for a graph with n vertices, the construction of such a network can be carried outby a simple algorithm in optimal time O(n2) which is faster by afactor of n than the direct adaptation of the classical O(n3)paradigm by Warshall for computing the transitive closure of agraph. This algorithm, called UltraNet, will be shown to be easilyadapted to compute relaxed networks and to support the introductionof artificial points to reduce the maximum distance between verticesin a pair. Finally, a few experiments will be discussed todemonstrate the applicability of subdominant ultrametric networks.Availability http://www.dei.unipd.it/~ciompin/main/Ultranet/Ultranet.html.Algorithms for Molecular Biology 03/2013; 8(1):7. -
Article: DCJ-Indel sorting revisited.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: The introduction of the double cut and join operation (DCJ) caused a flurry of research into the studyof multichromosomal rearrangements. However, little of this work has incorporated indels (i.e., in-sertions and deletions of chromosomes and chromosomal intervals) into the calculation of genomicdistance functions, with the exception of Braga et al., who provided a linear time algorithm for theproblem of DCJ-indel sorting. Although their algorithm only takes linear time, its derivation is lengthyand depends on a large number of possible cases. RESULTS: We note the simple idea that a deletion of a chromosomal interval can be viewed as a DCJ that createsa new circular chromosome. This framework will allow us to amortize indels as DCJs, which in turnpermits the application of the classical breakpoint graph to obtain a simplified indel model that stillsolves the problem of DCJ-indel sorting in linear time via a more concise formulation that relies onthe simpler problem of DCJ sorting. Furthermore, we can extend this result to fully characterize thesolution space of DCJ-indel sorting. CONCLUSIONS: Encoding indels as DCJ operations offers a new insight into why the problem of DCJ-indel sorting isnot ultimately any more difficult than that of sorting by DCJs alone. There is still room for research inthis area, most notably the problem of sorting when the cost of indels is allowed to vary with respectto the cost of a DCJ and we demand a minimum cost transformation of one genome into another.Algorithms for Molecular Biology 03/2013; 8(1):6. -
Article: Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles?
[show abstract] [hide abstract]
ABSTRACT: Previous studies show that the same type of bond lengths and angles fit Gaussian distributions well with small standard deviations on high resolution protein structure data. The mean values of these Gaussian distributions have been widely used as ideal bond lengths and angles in bioinformatics. However, we are not aware of any research done to evaluate how accurately we can model protein structures with dihedral angles and ideal bond lengths and angles.Here, we introduce the protein structure idealization problem. We focus on the protein backbone structure idealization. We describe a fast O(n m / ¿) dynamic programming algorithm to find an idealized protein backbone structure that is approximately optimal according to our scoring function. The scoring function evaluates not only the free energy, but also the similarity with the target structure. Thus, the idealized protein structures found by our algorithm are guaranteed to be protein-like and close to the target protein structure.We have implemented our protein structure idealization algorithm and idealized the high resolution protein structures with low sequence identities of the CULLPDB_PC30_RES1.6_R0.25 data set. We demonstrate that idealized backbone structures always exist with small changes and significantly better free energy. We also applied our algorithm to refine protein pseudo-structures determined in NMR experiments.Algorithms for Molecular Biology 02/2013; 8(1):5. -
Article: Configurable pattern-based evolutionary biclustering of gene expression data.
[show abstract] [hide abstract]
ABSTRACT: Background; Biclustering algorithms for microarray data aim at discovering functionally related gene sets underdifferent subsets of experimental conditions. Due to the problem complexity and the characteristicsof microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also,the comparison among different techniques is still a challenge. The obtained results vary in relevantfeatures such as the number of genes or conditions, which makes it difficult to carry out a fair comparison.Moreover, existing approaches do not allow the user to specify any preferences on theseproperties.Results; Here, we present the first biclustering algorithm in which it is possible to particularize several biclustersfeatures in terms of different objectives. This can be done by tuning the specified featuresin the algorithm or also by incorporating new objectives into the search. Furthermore, our approachbases the bicluster evaluation in the use of expression patterns, being able to recognize both shiftingand scaling patterns either simultaneously or not. Evolutionary computation has been chosen as thesearch strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in ExpressionPatterns).Conclusions; We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology.Algorithms for Molecular Biology 02/2013; 8(1):4. -
Article: A mixed integer linear programming model to reconstruct phylogenies from single nucleotide polymorphism haplotypes under the maximum parsimony criterion.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: Phylogeny estimation from aligned haplotype sequences has attracted more and more attention in therecent years due to its importance in analysis of many fine-scale genetic data. Its application fieldsrange from medical research, to drug discovery, to epidemiology, to population dynamics. The litera-ture on molecular phylogenetics proposes a number of criteria for selecting a phylogeny from amongplausible alternatives. Usually, such criteria can be expressed by means of objective functions, andthe phylogenies that optimize them are referred to as optimal. One of the most important estimationcriteria is the parsimony which states that the optimal phylogeny T* for a set H of n haplotype se-quences over a common set of variable loci is the one that satisfies the following requirements: (i) ithas the shortest length and (ii) it is such that, for each pair of distinct haplotypes hi, hj ? H, the sumof the edge weights belonging to the path from hi to hj in T* is not smaller than the observed numberof changes between hi and hj. Finding the most parsimonious phylogeny for H involves solving anoptimization problem, called the Most Parsimonious Phylogeny Estimation Problem (MPPEP), whichis NP-hard in many of its versions. RESULTS: In this article we investigate a recent version of the MPPEP that arises when input data consist ofsingle nucleotide polymorphism haplotypes extracted from a population of individuals on a commongenomic region. Specifically, we explore the prospects for improving on the implicit enumerationstrategy of implicit enumeration strategy used in previous work using a novel problem formulationand a series of strengthening valid inequalities and preliminary symmetry breaking constraints tomore precisely bound the solution space and accelerate implicit enumeration of possible optimal phy-logenies. We present the basic formulation and then introduce a series of provable valid constraints toreduce the solution space. We then prove that these constraints can often lead to significant reductionsin the gap between the optimal solution and its non-integral linear programming bound relative to theprior art as well as often substantially faster processing of moderately hard problem instances. CONCLUSION: We provide an indication of the conditions under which such an optimal enumeration approach islikely to be feasible, suggesting that these strategies are usable for relatively large numbers of taxa, al-though with stricter limits on numbers of variable sites. The work thus provides methodology suitablefor provably optimal solution of some harder instances that resist all prior approaches.Algorithms for Molecular Biology 01/2013; 8(1):3. -
Article: Using graph model to find transcription factor modules: the hitting set problem and an exact algorithm.
[show abstract] [hide abstract]
ABSTRACT: Systematically perturbing a cellular system and monitoring the effects of the perturbations on gene expression provide a powerful approach to study signal transduction in gene expression systems. A critical step of revealing a signal transduction pathway regulating gene expression is to identify transcription factors transmitting signals in the system. In this paper, we address the task of identifying modules of cooperative transcription factors based on results derived from systems-biology experiments at two levels: First, a graph algorithm is developed to identify a minimum set of co-operative TFs that covers the differentially expressed genes under each systematic perturbation. Second, using a clique-finding approach, modules of TFs that tend to consistently cooperate together under various perturbations are further identified. Our results indicate that this approach is capable of identifying many known TF modules based on the individual experiment; thus we provide a novel graph-based method of identifying context-specific and highly reused TF-modules.Algorithms for Molecular Biology 01/2013; 8(1):2. -
Article: The Difficulty Of Protein Structure Alignment Under The RMSD.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: Protein structure alignment is often modeled as the largest common point set (LCP) problem based on the RootMean Square Deviation (RMSD), a measure commonly used to evaluate structural similarity. In the problem,each residue is represented by the coordinate of the Ca atom, and a structure is modeled as a sequence of 3Dpoints. Out of two such sequences, one is to find two equal-sized subsequences of the maximum length, and abijection between the points of the subsequences which gives an RMSD within a given threshold. The problemis considered to be difficult in terms of time complexity, but the reasons for its difficulty is notwell-understood. Improving this time complexity is considered important in protein structure prediction andstructural comparison, where the task of comparing very numerous structures is commonly encountered. RESULTS: To study why the LCP problem is difficult, we define a natural variant of the problem, called the minimumaligned distance (MAD). In the MAD problem, the length of the subsequences to obtain is specified in theinput; and instead of fulfilling a threshold, the RMSD between the points of the two subsequences is to beminimized. Our results show that the difficulty of the two problems does not lie solely in the combinatorialcomplexity of finding the optimal subsequences, or in the task of superimposing the structures. By placing alimit on the distance between consecutive points, and assuming that the points are specified as integral values,we show that both problems are equally difficult, in the sense that they are reducible to each other. In this case,both problems can be exactly solved in polynomial time, although the time complexity remains high. CONCLUSIONS: We showed insights and techniques which we hope will lead to practical algorithms for the LCP problem forprotein structures. The study identified two important factors in the problem's complexity: (1) The lack of alimit in the distance between the consecutive points of a structure; (2) The arbitrariness of the precisionallowed in the input values. Both issues are of little practical concern for the purpose of protein structurealignment. When these factors are removed, the LCP problem is as hard as that of minimizing the RMSD(MAD problem), and can be solved exactly in polynomial time.Algorithms for Molecular Biology 01/2013; 8(1):1. -
Article: Invariant based quartet puzzling.
[show abstract] [hide abstract]
ABSTRACT: Background First proposed by Cavender and Felsenstein, and Lake, invariant based algorithms for phylogenetic reconstruction were widely dismissed by practicing biologists because invariants were perceived to have limited accuracy in constructing trees based on DNA sequences of reasonable length. Recent developments by algebraic geometers have led to the construction of lists of invariants which have been demonstrated to be more accurate on small sequences, but were limited in that they could only be used for trees with small numbers of taxa. We have developed and tested an invariant based quartet puzzling algorithm which is accurate and efficient for biologically reasonable data sets. Results We found that our algorithm outperforms Maximum Likelihood based quartet puzzling on data sets simulated with low to medium evolutionary rates. For faster rates of evolution, invariant based quartet puzzling is reasonable but less effective than maximum likelihood based puzzling. Conclusions This is a proof of concept algorithm which is not intended to replace existing reconstruction algorithms. Rather, the conclusion is that when seeking solutions to a new wave of phylogenetic problems (super tree algorithms, gene vs. species tree, mixture models), invariant based methods should be considered. This article demonstrates that invariants are a practical, reasonable and flexible source for reconstruction techniques.Algorithms for Molecular Biology 12/2012; 7(1):35. -
Article: Alignment-free phylogeny of whole genomes using underlying subwords.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: With the progress of modern sequencing technologies a large number of complete genomes are nowavailable. Traditionally the comparison of two related genomes is carried out by sequence alignment.There are cases where these techniques cannot be applied, for example if two genomes do not sharethe same set of genes, or if they are not alignable to each other due to low sequence similarity,rearrangements and inversions, or more specifically to their lengths when the organisms belong todifferent species. For these cases the comparison of complete genomes can be carried out only withad hoc methods that are usually called alignment-free methods. METHODS: In this paper we propose a distance function based on subword compositions called UnderlyingApproach (UA). We prove that the matching statistics, a popular concept in the field of stringalgorithms able to capture the statistics of common words between two sequences, can be derivedfrom a small set of "independent" subwords, namely the irredundant common subwords. We define adistance-like measure based on these subwords, such that each region of genomes contributes onlyonce, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filterdiscards subwords occurring in regions covered by other more significant subwords. RESULTS: The Underlying Approach (UA) builds a scoring function based on this set of patterns, calledunderlying. We prove that this set is by construction linear in the size of input, without overlaps, andcan be efficiently constructed. Results show the validity of our method in the reconstruction ofphylogenetic trees, where the Underlying Approach outperforms the current state of the art methods.Moreover, we show that the accuracy of UA is achieved with a very small number of subwords,which in some cases carry meaningful biological information.Algorithms for Molecular Biology 12/2012; 7(1):34. -
Article: The space of phylogenetic mixtures for equivariant models.
[show abstract] [hide abstract]
ABSTRACT: BACKGROUND: The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration. RESULTS: Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures. CONCLUSIONS: The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.Algorithms for Molecular Biology 11/2012; 7(1):33. -
Article: Towards a practical O(n log n) phylogeny algorithm.
[show abstract] [hide abstract]
ABSTRACT: Recently, we have identified a randomized quartet phylogeny algorithm that has O(n log n) runtime with high probability, which is asymptotically optimal. Our algorithm has high probability of returning the correct phylogeny when quartet errors are independent and occur with known probability, and when the algorithm uses a guide tree on O(log log n) taxa that is correct with high probability. In practice, none of these assumptions is correct: quartet errors are positively correlated and occur with unknown probability, and the guide tree is often error prone. Here, we bring our work out of the purely theoretical setting. We present a variety of extensions which, while only slowing the algorithm down by a constant factor, make its performance nearly comparable to that of neighbour-joining, which requires ¿(n3) runtime in existing implementations. Our results suggest a new direction for quartet-based phylogenetic reconstruction that may yield striking speed ! improvements at minimal accuracy cost. An early prototype implementation of our software is available at http://www.cs.uwaterloo.ca/~jmtruszk/qtree.tar.gz.Algorithms for Molecular Biology 11/2012; 7(1):32.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.
Keywords
Related Journals
Scientific Reports
ISSN: 2045-2322
Endocrinology
Endocrine Society; HighWire Press
ISSN: 1945-7170, Impact factor: 4.46
IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics: a publication of the IEEE S...
Institute of Electrical and...
ISSN: 1941-0492, Impact factor: 3.01
International Journal of Coal Preparation and Utilization
ISSN: 1939-2699, Impact factor: 0.29
PLoS ONE
Public Library of Science, Public...
ISSN: 1932-6203, Impact factor: 4.09
Breast Cancer
Springer Verlag
ISSN: 1880-4233, Impact factor: 1.36
The Journal of Toxicological Sciences
ISSN: 1880-3989, Impact factor: 1.52
Gene
Elsevier
ISSN: 1879-0038, Impact factor: 2.34