Article

On the Structure of RNA Branching Polytopes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The prevalent method for RNA secondary structure prediction for a single sequence is free energy minimization based on the nearest neighbor thermodynamic model (NNTM). One of the least well-developed parts of the model is the energy function assigned to the multibranch loops. Parametric analysis can be performed to elucidate the dependance of the prediction on the branching parameters used in the NNTM. Since the objective function is linear, this boils down to analyzing the normal fans of the branching polytopes. Here we show that because of the way the multibranch loops are scored under the NNTM, certain branching patterns are possible for all sequences. We do this by characterizing the dominant parts of the parameter space obtained by looking at the relevant section of the normal fan; therefore, we conclude that the structures that are normally found in nature are obtained for a relatively small set of parameters.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We provide here a mathematical motivation for generating alternative predictions based on a parametric analysis of RNA branching Drellich et al. (2017); Barrera-Cruz et al. (2018); Poznanović et al. (2020Poznanović et al. ( , 2021. Using methods from geometric combinatorics Drellich et al. (2017), it is possible to identify all optimal predictions under any parameterization of the entropic branching penalty. ...
... Prior results had characterized the infinite regions theoretically Barrera-Cruz et al. (2018) and both types computationally Poznanović et al. (2020). It was shown that the cones are quite "thin" in the b dimension Poznanović et al. (2020), but also that this does not make a statistically significant difference to the prediction accuracy Poznanović et al. (2021). ...
... As seen in Figure 2, this transformation eliminates the 1/3 skew in the original (a, c) plane. Results from Barrera-Cruz et al. (2018) -appropriately reinterpreted -still hold. As before, all sequences have a unique region (0, 0, 0, w 0 ) with the minimum number of junctions, the minimum (total or excess) branching, and the minimum residual free energy w 0 over all such structures. ...
Preprint
Prior results for tRNA and 5S rRNA demonstrated that secondary structure prediction accuracy can be significantly improved by modifying the parameters in the multibranch loop entropic penalty function. However, for reasons not well understood at the time, the scale of improvement possible across both families was well below the level for each family when considered separately. We resolve this dichotomy here by showing that each family has a characteristic target region geometry, which is distinct from the other and significantly different from their own dinucleotide shuffles. This required a much more efficient approach to computing the necessary information from the branching parameter space, and a new theoretical characterization of the region geometries. The insights gained point strongly to considering multiple possible secondary structures generated by varying the multiloop parameters. We provide proof-of-principle results that this significantly improves prediction accuracy across all 8 additional families in the Archive II benchmarking dataset.
... This paper contributes to the literature on finding bijections between various combinatorial objects and RNA structures. See the following references [1,2,[5][6][7][8][9][10] for other bijections between certain trees and RNA secondary structures. ...
... In 2020, Evans [4] resolved two open problems presented by Rudra [25] by establishing explicit bijections among L * * linear trees, NSE * * lattice walks, and RNA secondary structures. See the following references [1,2,[5][6][7][8][9][10] for other bijections between certain trees and RNA secondary structures. The motivation for the bijections is that the given combinatorial objects may provide insight into the prediction of RNA secondary structures. ...
Article
Full-text available
The leftmost column entries of RNA arrays I and II count the RNA numbers that are related to RNA secondary structures from molecular biology. RNA secondary structures sometimes have mutations and wobble pairs. Mutations are random changes that occur in a structure, and wobble pairs are known as non-Watson–Crick base pairs. We used topics from RNA combinatorics and Riordan array theory to establish connections among combinatorial objects related to linear trees, lattice walks, and RNA arrays. In this paper, we establish interesting new explicit bijections (one-to-one correspondences) involving certain subclasses of linear trees, lattice walks, and RNA secondary structures. We provide an interesting generalized lattice walk interpretation of RNA array I. In addition, we provide a combinatorial interpretation of RNA array II as RNA secondary structures with n bases and k base-point mutations where ω of the structures contain wobble base pairs. We also establish an explicit bijection between RNA structures with mutations and wobble bases and a certain subclass of lattice walks.
... By focusing on the overall arrangement of edges/helices and vertices/loops, mathematical results have provided insight into the challenge of designing RNA sequences with a particular branching structure [10], configurations which minimize loop energy costs [1,2], and a parametric analysis of the branching entropy approximation [13]. This work has lead both to better understanding of RNA prediction accuracy [3,20,21] as well as some new combinatorics [12]. ...
Article
Full-text available
The branching of an RNA molecule is an important structural characteristic yet difficult to predict correctly, especially for longer sequences. Using plane trees as a combinatorial model for RNA folding, we consider the thermodynamic cost, known as the barrier height, of transitioning between branching configurations. Using branching skew as a coarse energy approximation, we characterize various types of paths in the discrete configuration landscape. In particular, we give sufficient conditions for a path to have both minimal length and minimal branching skew. The proofs offer some biological insights, notably the potential importance of both hairpin stability and domain architecture to higher resolution RNA barrier height analyses.
... By focusing on the overall arrangement of edges/helices and vertices/loops, mathematical results have provided insight into the challenge of designing RNA sequences with a particular branching structure [10], configurations which minimize loop energy costs [1,2], and a parametric analysis of the branching entropy approximation [13]. This work has lead both to better understanding of RNA prediction accuracy [3,20,21] as well as some new combinatorics [12]. ...
Preprint
Full-text available
The branching of an RNA molecule is an important structural characteristic yet difficult to predict correctly, especially for longer sequences. Using plane trees as a combinatorial model for RNA folding, we consider the thermodynamic cost, known as the barrier height, of transitioning between branching configurations. Using branching skew as a coarse energy approximation, we characterize various types of paths in the discrete configuration landscape. In particular, we give sufficient conditions for a path to have both minimal length and minimal branching skew. The proofs offer some biological insights, notably the potential importance of both hairpin stability and domain architecture to higher resolution RNA barrier height analyses.
... Branching polytopes were introduced in [23], and are the foundation of this parametric analysis of the NNTM branching parameters [17,24]. For a given RNA sequence, its branching polytope is a 4D geometric object which encloses points, called branching signatures, that correspond to all the different possible secondary structures. ...
Article
Full-text available
Minimum free energy prediction of RNA secondary structures is based on the Nearest Neighbor Thermodynamics Model. While such predictions are typically good, the accuracy can vary widely even for short sequences, and the branching thermodynamics are an important factor in this variance. Recently, the simplest model for multiloop energetics—a linear function of the number of branches and unpaired nucleotides—was found to be the best. Subsequently, a parametric analysis demonstrated that per family accuracy can be improved by changing the weightings in this linear function. However, the extent of improvement was not known due to the ad hoc method used to find the new parameters. Here we develop a branch-and-bound algorithm that finds the set of optimal parameters with the highest average accuracy for a given set of sequences. Our analysis shows that the previous ad hoc parameters are nearly optimal for tRNA and 5S rRNA sequences on both training and testing sets. Moreover, cross-family improvement is possible but more difficult because competing parameter regions favor different families. The results also indicate that restricting the unpaired nucleotide penalty to small values is warranted. This reduction makes analyzing longer sequences using the present techniques more feasible.
... Now, though, the regions may be bounded as well as unbounded. The arrangement of unbounded regions in the (a, 0, c, 1) plane has a characteristic pattern, first illustrated in [32] and now fully described [34] for all fixed b. ...
Preprint
Full-text available
Prediction of RNA base pairings yields insight into molecular structure, and therefore function. The most common methods predict an optimal structure under the standard thermodynamic model. One component of this model is the equation which governs the cost of branching, where three or more helical "arms" radiate out from a multiloop (also known as a junction). The multiloop initiation equation has three parameters; changing those values can significantly alter the predicted structure. We give a complete analysis of the prediction accuracy, stability, and robustness for all possible parameter combinations for a diverse set of tRNA sequences, and also for 5S rRNA. We find that the accuracy can often be substantially improved on a per sequence basis. However, simultaneous improvement within families, and most especially between families, remains a challenge.
Article
The structure of an rna sequence encodes information about its biological function. Dynamic programming algorithms are often used to predict the conformation of an rna molecule from its sequence alone, and adding experimental data as auxiliary information improves prediction accuracy. This auxiliary data is typically incorporated into the nearest neighbor thermodynamic model22 by converting the data into pseudoenergies. Here, we look at how much of the space of possible structures auxiliary data allows prediction methods to explore. We find that for a large class of rna sequences, auxiliary data shifts the predictions significantly. Additionally, we find that predictions are highly sensitive to the parameters which define the auxiliary data pseudoenergies. In fact, the parameter space can typically be partitioned into regions where different structural predictions predominate.
Article
Prediction of RNA base pairings yields insight into molecular structure, and therefore function. The most common methods predict an optimal structure under the standard thermodynamic model. One component of this model is the equation which governs the cost of branching, where three or more helical "arms" radiate out from a multiloop (also known as a junction). The multiloop initiation equation has three parameters; changing those values can significantly alter the predicted structure. We give a complete analysis of the prediction accuracy, stability, and robustness for all possible parameter combinations for a diverse set of tRNA sequences, and also for 5S rRNA. We find that the accuracy can often be substantially improved on a per sequence basis. However, simultaneous improvement within families, and most especially between families, remains a challenge.
Article
Full-text available
Nearest neighbor parameters for estimating the folding energy changes of RNA secondary structures are used in structure prediction and analysis. Despite their widespread application, a comprehensive analysis of the impact of each parameter on the precision of calculations had not been conducted. To identify the parameters with greatest impact, a sensitivity analysis was performed on the 291 parameters that compose the 2004 version of the free energy nearest neighbor rules. Perturbed parameter sets were generated by perturbing each parameter independently. Then the effect of each individual parameter change on predicted base-pair probabilities and secondary structures as compared to the standard parameter set was observed for a set of sequences including structured ncRNA, mRNA and randomized sequences. The results identify for the first time the parameters with the greatest impact on secondary structure prediction, and the subset which should be prioritized for further study in order to improve the precision of structure prediction. In particular, bulge loop initiation, multibranch loop initiation, AU/GU internal loop closure and AU/GU helix end parameters were particularly important. An analysis of parameter usage during folding free energy calculations of stochastic samples of secondary structures revealed a correlation between parameter usage and impact on structure prediction precision.
Article
Full-text available
Questions in computational molecular biology generate various discrete optimization problems, such as DNA sequence alignment and RNA secondary structure prediction. However, the optimal solutions are fundamentally dependent on the parameters used in the objective functions. The goal of a parametric analysis is to elucidate such dependencies, especially as they pertain to the accuracy and robustness of the optimal solutions. Techniques from geometric combinatorics, including polytopes and their normal fans, have been used previously to give parametric analyses of simple models for DNA sequence alignment and RNA branching configurations. Here, we present a new computational framework, and proof-of-principle results, which give the first complete parametric analysis of the branching polytopes for real RNA sequences.
Article
Full-text available
A simplified (two-base) version of the problem of planar folding of long chains (e.g., RNA and DNA biomolecules) is formulated as a matching problem. The chain is prescribed as a loop or circular sequence of letters A and B, n units long. A matching here means a set of A-B base pairings or matches obeying a planarity condition: no two matches may cross each other if drawn on the interior of the loop. Also, no two adjacent letters may be matched. We present a dynamic programming algorithm requiring O(n3)O( {n^3 } ) steps and O(n2)O( {n^2 } ) storage which computes the size of the maximum for the given A-B base sequence and which also allows reconstructing a particular folded form of the original string which realizes the maximum matching size. The algorithm can be adapted to deal with sequences with larger alphabets and with weighted matchings. An algorithm is also presented for a modified problem closer to the biochemical problem of interest: We demand that every match must be adjacent to another match, forcing groups of two or more parallel matches. Some results on the expected maximum matching size are presented. As nn \to \infty , at least 80% of the vertices can be matched on the average on an A-B string of size n. We briefly discuss the practical application of the algorithm by using contracted versions of very long molecules with a preliminary block construction. A maximum matching is presented for the J-gene of the ϕ\phi X174 DNA virus. We conclude by stating some problems requiring further study. A simplified (two-base) version of the problem of planar folding of long chains (e.g., RNA and DNA biomolecules) is formulated as a matching problem. The chain is prescribed as a loop or circular sequence of letters A and B, n units long. A matching here means a set of A-B base pairings or matches obeying a planarity condition: no two matches may cross each other if drawn on the interior of the loop. Also, no two adjacent letters may be matched. We present a dynamic programming algorithm requiring O(n3)O( {n^3 } ) steps and O(n2)O( {n^2 } ) storage which computes the size of the maximum for the given A-B base sequence and which also allows reconstructing a particular folded form of the original string which realizes the maximum matching size. The algorithm can be adapted to deal with sequences with larger alphabets and with weighted matchings. An algorithm is also presented for a modified problem closer to the biochemical problem of interest: We demand that every match must be adjacent to another match, forcing groups of two or more parallel matches. Some results on the expected maximum matching size are presented. As nn \to \infty , at least 80% of the vertices can be matched on the average on an A-B string of size n. We briefly discuss the practical application of the algorithm by using contracted versions of very long molecules with a preliminary block construction. A maximum matching is presented for the J-gene of the ϕ\phi X174 DNA virus. We conclude by stating some problems requiring further study. Some results on the expected maximum matching size are presented. As nn \to \infty , at least 80% of the vertices can be matched on the average on an A-B string of size n.
Article
Full-text available
Background Accurate and efficient RNA secondary structure prediction remains an important open problem in computational molecular biology. Historically, advances in computing technology have enabled faster and more accurate RNA secondary structure predictions. Previous parallelized prediction programs achieved significant improvements in runtime, but their implementations were not portable from niche high-performance computers or easily accessible to most RNA researchers. With the increasing prevalence of multi-core desktop machines, a new parallel prediction program is needed to take full advantage of today’s computing technology. Findings We present here the first implementation of RNA secondary structure prediction by thermodynamic optimization for modern multi-core computers. We show that GTfold predicts secondary structure in less time than UNAfold and RNAfold, without sacrificing accuracy, on machines with four or more cores. Conclusions GTfold supports advances in RNA structural biology by reducing the timescales for secondary structure prediction. The difference will be particularly valuable to researchers working with lengthy RNA sequences, such as RNA viral genomes.
Article
Full-text available
The Nearest Neighbor Database (NNDB, http://rna.urmc.rochester.edu/NNDB) is a web-based resource for disseminating parameter sets for predicting nucleic acid secondary structure stabilities. For each set of parameters, the database includes the set of rules with descriptive text, sequence-dependent parameters in plain text and html, literature references to experiments and usage tutorials. The initial release covers parameters for predicting RNA folding free energy and enthalpy changes.
Article
Full-text available
This paper presents a new computer method for folding an RNA molecule that finds a conformation of minimum free energy using published values of stacking and destabilizing energies. It is based on a dynamic programming algorithm from applied mathematics, and is much more efficient, faster, and can fold larger molecules than procedures which have appeared up to now in the biological literature. Its power is demonstrated in the folding of a 459 nucleotide immunoglobulin γ 1 heavy chain messenger RNA fragment. We go beyond the basic method to show how to incorporate additional information into the algorithm. This includes data on chemical reactivity and enzyme susceptibility. We illustrate this with the folding of two large fragments from the 16S ribosomal RNA of Escherichia coli.
Article
Full-text available
An improved dynamic programming algorithm is reported for RNA secondary structure prediction by free energy minimization. Thermodynamic parameters for the stabilities of secondary structure motifs are revised to include expanded sequence dependence as revealed by recent experiments. Additional algorithmic improvements include reduced search time and storage for multibranch loop free energies and improved imposition of folding constraints. An extended database of 151,503 nt in 955 structures? determined by comparative sequence analysis was assembled to allow optimization of parameters not based on experiments and to test the accuracy of the algorithm. On average, the predicted lowest free energy structure contains 73 % of known base-pairs when domains of fewer than 700 nt are folded; this compares with 64 % accuracy for previous versions of the algorithm and parameters. For a given sequence, a set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs. Experimental constraints, derived from enzymatic and flavin mononucleotide cleavage, improve the accuracy of structure predictions.
Article
Full-text available
Comparative analysis of RNA sequences is the basis for the detailed and accurate predictions of RNA structure and the determination of phylogenetic relationships for organisms that span the entire phylogenetic tree. Underlying these accomplishments are very large, well-organized, and processed collections of RNA sequences. This data, starting with the sequences organized into a database management system and aligned to reveal their higher-order structure, and patterns of conservation and variation for organisms that span the phylogenetic tree, has been collected and analyzed. This type of information can be fundamental for and have an influence on the study of phylogenetic relationships, RNA structure, and the melding of these two fields. We have prepared a large web site that disseminates our comparative sequence and structure models and data. The four major types of comparative information and systems available for the three ribosomal RNAs (5S, 16S, and 23S rRNA), transfer RNA (tRNA), and two of the catalytic intron RNAs (group I and group II) are: (1) Current Comparative Structure Models; (2) Nucleotide Frequency and Conservation Information; (3) Sequence and Structure Data; and (4) Data Access Systems. This online RNA sequence and structure information, the result of extensive analysis, interpretation, data collection, and computer program and web development, is accessible at our Comparative RNA Web (CRW) Site http://www.rna.icmb.utexas.edu. In the future, more data and information will be added to these existing categories, new categories will be developed, and additional RNAs will be studied and presented at the CRW Site.
Article
Full-text available
The abbreviated name, ‘mfold web server’, describes a number of closely related software applications available on the World Wide Web (WWW) for the prediction of the secondary structure of single stranded nucleic acids. The objective of this web server is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large. By making use of universally available web GUIs (Graphical User Interfaces), the server circumvents the problem of portability of this software. Detailed output, in the form of structure plots with or without reliability information, single strand frequency plots and ‘energy dot plots’, are available for the folding of single sequences. A variety of ‘bulk’ servers give less information, but in a shorter time and for up to hundreds of sequences at once. The portal for the mfold web server is http://www.bioinfo.rpi.edu/applications/mfold. This URL will be referred to as ‘MFOLDROOT’.
Article
Full-text available
A detailed understanding of an RNA's correct secondary and tertiary structure is crucial to understanding its function and mechanism in the cell. Free energy minimization with energy parameters based on the nearest-neighbor model and comparative analysis are the primary methods for predicting an RNA's secondary structure from its sequence. Version 3.1 of Mfold has been available since 1999. This version contains an expanded sequence dependence of energy parameters and the ability to incorporate coaxial stacking into free energy calculations. We test Mfold 3.1 by performing the largest and most phylogenetically diverse comparison of rRNA and tRNA structures predicted by comparative analysis and Mfold, and we use the results of our tests on 16S and 23S rRNA sequences to assess the improvement between Mfold 2.3 and Mfold 3.1. The average prediction accuracy for a 16S or 23S rRNA sequence with Mfold 3.1 is 41%, while the prediction accuracies for the majority of 16S and 23S rRNA structures tested are between 20% and 60%, with some having less than 20% prediction accuracy. The average prediction accuracy was 71% for 5S rRNA and 69% for tRNA. The majority of the 5S rRNA and tRNA sequences have prediction accuracies greater than 60%. The prediction accuracy of 16S rRNA base-pairs decreases exponentially as the number of nucleotides intervening between the 5' and 3' halves of the base-pair increases. Our analysis indicates that the current set of nearest-neighbor energy parameters in conjunction with the Mfold folding algorithm are unable to consistently and reliably predict an RNA's correct secondary structure. For 16S or 23S rRNA structure prediction, Mfold 3.1 offers little improvement over Mfold 2.3. However, the nearest-neighbor energy parameters do work well for shorter RNA sequences such as tRNA or 5S rRNA, or for larger rRNAs when the contact distance between the base-pairs is less than 100 nucleotides.
Article
Full-text available
The classic algorithms of Needleman-Wunsch and Smith-Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). To process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces that are suitable for Needleman-Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. As the number of alignment programs applied on a whole genome scale continues to increase, so does the disagreement in their results. The alignments produced by different programs vary greatly, especially in non-coding regions of eukaryotic genomes where the biologically correct alignment is hard to find. Parametric alignment is one possible remedy. This methodology resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. This alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters.
Article
Full-text available
Randomly shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence. In many cases, biologists need sophisticated shuffling tools that preserve not only the counts of distinct letters but also higher-order statistics such as doublet counts, triplet counts, and, in general, k-let counts. We present a sequence analysis tool (named uShuffle) for generating uniform random permutations of biological sequences (such as DNAs, RNAs, and proteins) that preserve the exact k-let counts. The uShuffle tool implements the latest variant of the Euler algorithm and uses Wilson's algorithm in the crucial step of arborescence generation. It is carefully engineered and extremely efficient. The uShuffle tool achieves maximum flexibility by allowing arbitrary alphabet size and let size. It can be used as a command-line program, a web application, or a utility library. Source code in C, Java, and C#, and integration instructions for Perl and Python are provided. The uShuffle tool surpasses existing implementation of the Euler algorithm in both performance and flexibility. It is a useful tool for the bioinformatics community.
Article
Some new sequences are introduced which satisfy quadratic recurrence rules similar to those satisfied by the classical Catalan numbers and the less well-known Motzkin numbers. For each sequence the general term is expressed as a sum of products of Catalan numbers and generalized Fibonacci numbers. In addition, first-order asymptotic formulae are given for the most interesting cases.
Article
Theoptimal alignment or theweighted minimum edit distance between two DNA or amino acid sequences for a given set of weights is computed by classical dynamic programming techniques, and is widely used in molecular biology. However, in DNA and amino acid sequences there is considerable disagreement about how to weight matches, mismatches, insertions/deletions (indels or spaces), and gaps.Parametric sequence alignment is the problem of computing the optimal-valued alignment between two sequences as afunction of variable weights for matches, mismatches, spaces, and gaps. The goal is to partition the parameter space into regions (which are necessarily convex) such that in each region one alignment is optimal throughout and such that the regions are maximal for this property. In this paper we are primarily concerned with the structure of this convex decomposition, and secondarily with the complexity of computing the decomposition. The most striking results are the following: For the special case where only matches, mismatches, and spaces are counted, and where spaces are counted throughout the alignment, we show that the decomposition is surprisingly simple: all regions are infinite; there are at most n2/3 regions; the lines that bound the regions are all of the form Β=c + (c + 0.5)α; and the entire decomposition can be found inO(knm) time, wherek is the actual number of regions, andn<m are the lengths of the two strings. These results were found while implementing a large software package for parametric sequence analysis, and in turn have led to faster algorithms for those tasks. A conference version of this paper first appeared in [10].
Article
RNA structure is hierarchical. Secondary structure contacts, i.e. the canonical base pair contacts, are generally stronger and form faster than the tertiary structure. Therefore, RNA secondary structures can be predicted independently of tertiary structure prediction. Furthermore, the stability of a given RNA secondary structure can be quantified using nearest neighbor free energy parameters. These parameters are the basis of a number of free energy minimization algorithms that predict RNA secondary structure for either a single sequence or multiple sequences. This article reviews the progress of RNA secondary structure prediction by free energy minimization and describes many of the algorithms that have been developed.
Article
Motivated by recent work in parametric sequence alignment, we study the parameter space for scoring RNA folds and construct an RNA polytope. A vertex of this polytope corresponds to RNA secondary structures with common branching. We use this polytope and its normal fan to study the effect of varying three parameters in the free energy model that are not determined experimentally. Our results indicate that variation of these specific parameters does not have a dramatic effect on the structures predicted by the free energy model. We additionally map a collection of known RNA secondary structures to the RNA polytope.
Article
Current algorithms can find optimal alignments of two nucleic acid or protein sequences, often by using dynamic programming. While the choice of algorithm penalty parameters greatly influences the quality of the resulting alignments, this choice has been done in an ad hoc manner. In this work, we present an algorithm to efficiently find the optimal alignments for all choices of the penalty parameters. It is then possible to systematically explore these alignments for those with the most biological or statistical interest. Several examples illustrate the method.
Article
Computing the similarity between two ordered trees has applications in RNA secondary structure comparison, genetics and chemical structure analysis. Alignment of tree is one of the proposed measures. Similar to pair-wise sequence comparison, there is often disagreement about how to weight matches, mismatches, indels and gaps when we compare two trees. For sequence comparison, the parametric sequence alignment tools have been developed. The users are allowed to see explicitly and completely the effect of parameter choices on the optimal sequence alignments. A similar tool for aligning two ordered trees is required in practice. We develop a parametric tool for aligning two ordered trees that allow users to see the effect of parameter choices on the optimal alignment of trees. Our contributions include: (1) develop a parametric tool for aligning two ordered trees; (2) design an efficient algorithm for aligning two ordered trees with gap penalties that runs in O(n(2)deg(2)) time, where n is the number of nodes in the trees and deg is the degree of the trees; and (3) reduce the space of the algorithm from O(n(2)deg(2)) to O(n log n. deg(2)). The software is available at http://www.cs.cityu.edu.hk/~lwang/software/ParaTree
Article
One of the major successes in computational biology has been the unification, by using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied to these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.
Article
RNA secondary structure is often predicted from sequence by free energy minimization. Over the past two years, advances have been made in the estimation of folding free energy change, the mapping of secondary structure and the implementation of computer programs for structure prediction. The trends in computer program development are: efficient use of experimental mapping of structures to constrain structure prediction; use of statistical mechanics to improve the fidelity of structure prediction; inclusion of pseudoknots in secondary structure prediction; and use of two or more homologous sequences to find a common structure.
Statistics on RNA branching polytopes: accuracy, stability, robustness, and other characteristics
  • Fidel Barrera-Cruz
  • Christine Heitsch
  • Svetlana Poznanović
Fidel Barrera-Cruz, Christine Heitsch, and Svetlana Poznanović. Statistics on RNA branching polytopes: accuracy, stability, robustness, and other characteristics. in preparation.