[Show abstract][Hide abstract] ABSTRACT: Abstract We describe four novel algorithms, RNAhairpin, RNAmloopNum, RNAmloopOrder, and RNAmloopHP, which compute the Boltzmann partition function for global structural constraints-respectively for the number of hairpins, the number of multiloops, maximum order (or depth) of multiloops, and the simultaneous number of hairpins and multiloops. Given an RNA sequence of length n and a user-specified integer 0 ≤ K ≤ n, RNAhairpin (resp. RNAmloopNum and RNAmloopOrder) computes the partition functions Z(k) for each 0 ≤ k ≤ K in time O(K(2)n(3)) and space O(Kn(2)), while RNAmloopHP computes the partition functions Z(m, h) for 0 ≤ mm ≤ M multiloops and 0 ≤ h ≤ H hairpins, with run time O(M(2)H(2)n(3)) and space O(MHn(2)). In addition, programs such as RNAhairpin (resp. RNAmloopHP) sample from the low-energy ensemble of structures having h hairpins (resp. m multiloops and h hairpins), for given h, m. Moreover, by using the fast Fourier transform (FFT), RNAhairpin and RNAmloopNum have been improved to run in time O(n(4)) and space O(n(2)), although this improvement is not possible for RNAmloopOrder. We present two applications of the novel algorithms. First, we show that for many Rfam families of RNA, structures sampled from RNAmloopHP are more accurate than the minimum free-energy structure; for instance, sensitivity improves by almost 24% for transfer RNA, while for certain ribozyme families, there is an improvement of around 5%. Second, we show that the probabilities p(k)=Z(k)/Z of forming k hairpins (resp. multiloops) provide discriminating novel features for a support vector machine or relevance vector machine binary classifier for Rfam families of RNA. Our data suggests that multiloop order does not provide any significant discriminatory power over that of hairpin and multiloop number, and since these probabilities can be efficiently computed using the FFT, hairpin and multiloop formation probabilities could be added to other features in existent noncoding RNA gene finders. Our programs, written in C/C++, are publicly available online at: http://bioinformatics.bc.edu/clotelab/RNAparametric .
Journal of computational biology: a journal of computational molecular cell biology 02/2014; 21(3). DOI:10.1089/cmb.2013.0148 · 1.74 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss, RNAshapes and RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detect general (i.e., not family specific) entire riboswitches (both aptamer and expression platform) with accuracy. Thus, the development of additional algorithms to detect conformational switches seems important, especially since the difference in free energy between the two metastable secondary structures may be as large as 15-20 kcal/mol. It has recently emerged that RNA secondary structure can be more accurately predicted by computing the maximum expected accuracy (MEA) structure, rather than the minimum free energy (MFE) structure.
Given an arbitrary RNA secondary structure S₀ for an RNA nucleotide sequence a = a₁,..., a(n), we say that another secondary structure S of a is a k-neighbor of S₀, if the base pair distance between S₀ and S is k. In this paper, we prove that the Boltzmann probability of all k-neighbors of the minimum free energy structure S₀ can be approximated with accuracy ε and confidence 1 - p, simultaneously for all 0 ≤ k < K, by a relative frequency count over N sampled structures, provided that N>N(ε,p,K)=Φ⁻¹(p/2K)²/4ε², where Φ(z) is the cumulative distribution function (CDF) for the standard normal distribution. We go on to describe the algorithm RNAborMEA, which for an arbitrary initial structure S₀ and for all values 0 ≤ k < K, computes the secondary structure MEA(k), having maximum expected accuracy over all k-neighbors of S₀. Computation time is O(n³ · K²), and memory requirements are O(n² · K). We analyze a sample TPP riboswitch, and apply our algorithm to the class of purine riboswitches.
The approximation of RNAbor by sampling, with rigorous bound on accuracy, together with the computation of maximum expected accuracy k-neighbors by RNAborMEA, provide additional tools toward conformational switch detection. Results from RNAborMEA are quite distinct from other tools, such as RNAbor, RNAshapes and paRNAss, hence may provide orthogonal information when looking for suboptimal structures or conformational switches. Source code for RNAborMEA can be downloaded from http://sourceforge.net/projects/rnabormea/ or http://bioinformatics.bc.edu/clotelab/RNAborMEA/.
[Show abstract][Hide abstract] ABSTRACT: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.
We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.
CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.
[Show abstract][Hide abstract] ABSTRACT: An RNA secondary structure is locally optimal if there is no lower energy structure that can be obtained by the addition or removal of a single base pair, where energy is defined according to the widely accepted Turner nearest neighbor model. Locally optimal structures form kinetic traps, since any evolution away from a locally optimal structure must involve energetically unfavorable folding steps. Here, we present a novel, efficient algorithm to compute the partition function over all locally optimal secondary structures of a given RNA sequence. Our software, RNAlocopt runs in O(n3) time and O(n2) space. Additionally, RNAlocopt samples a user-specified number of structures from the Boltzmann subensemble of all locally optimal structures. We apply RNAlocopt to show that (1) the number of locally optimal structures is far fewer than the total number of structures--indeed, the number of locally optimal structures approximately equal to the square root of the number of all structures, (2) the structural diversity of this subensemble may be either similar to or quite different from the structural diversity of the entire Boltzmann ensemble, a situation that depends on the type of input RNA, (3) the (modified) maximum expected accuracy structure, computed by taking into account base pairing frequencies of locally optimal structures, is a more accurate prediction of the native structure than other current thermodynamics-based methods. The software RNAlocopt constitutes a technical breakthrough in our study of the folding landscape for RNA secondary structures. For the first time, locally optimal structures (kinetic traps in the Turner energy model) can be rapidly generated for long RNA sequences, previously impossible with methods that involved exhaustive enumeration. Use of locally optimal structure leads to state-of-the-art secondary structure prediction, as benchmarked against methods involving the computation of minimum free energy and of maximum expected accuracy. Web server and source code available at http://bioinformatics.bc.edu/clotelab/RNAlocopt/.
PLoS ONE 01/2011; 6(1):e16178. DOI:10.1371/journal.pone.0016178 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Given an RNA sequence and two designated secondary structures A, B, we describe a new algorithm that computes a nearly optimal folding pathway from A to B. The algorithm, RNAtabupath, employs a tabu semi-greedy heuristic, known to be an effective search strategy in combinatorial optimization. Folding pathways, sometimes called routes or trajectories, are computed by RNAtabupath in a fraction of the time required by the barriers program of Vienna RNA Package. We benchmark RNAtabupath with other algorithms to compute low energy folding pathways between experimentally known structures of several conformational switches. The RNApathfinder web server, source code for algorithms to compute and analyze pathways and supplementary data are available at http://bioinformatics.bc.edu/clotelab/RNApathfinder.
Nucleic Acids Research 03/2010; 38(5):1711-22. DOI:10.1093/nar/gkp1054 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We describe several dynamic programming segmentation algorithms to segment RNA secondary and tertiary structures into distinct domains. For this purpose, we consider fitness functions that variously depend on (i) base pairing probabilities in the Boltzmann low energy ensemble of structures, (ii) contact maps inferred from 3-dimensional structures, and (iii) Voronoi tessellation computed from 3-dimensional structures. Segmentation algorithms include a direct dynamic programming method, previously discovered by Bellman and by Finkelstein and Roytberg, as well as two novel algorithms - a parametric algorithm to compute the optimal segmentation into k classes, for each value k, and an algorithm that simultaneously computes the optimal segmentation of all subsegments. Since many non-coding RNA gene finders scan the genome by a moving window method, reporting high-scoring windows, we apply structural segmentation to determine the most likely 5' and 3' boundaries of precursor microRNAs. When tested on all precursor microRNAs of length at most 100 nt from the Rfam database, benchmarking studies indicate that segmentation determines the 5' boundary with discrepancy (absolute value of difference between predicted and real boundaries) having mean -0.640 (stdev 15.196) and the 3' boundary with discrepancy having mean -0.266 (stdev. 17.415). This yields a sensitivity of 0.911 and positive predictive value of 0.906 for determination of exact boundaries of precursor microRNAs within a window of approximately 900 nt. Additionally, by comparing the manual segmentation of Jaeger et al. with our optimal structural segmentation of 16S and 16S-like rRNA of E. coli, rat mitochondria, Halobacterium volcanii, and Chlamydomonas reinhardii chloroplast into 4 segments, we establish the usefulness of (automated) structural segmentation in decomposing large RNA structures into distinct domains.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 01/2010; 15:57-68.
[Show abstract][Hide abstract] ABSTRACT: RNA shapes, introduced by Giegerich et al. (2004), provide a useful classification of the branching complexity for RNA secondary structures. In this paper, we derive an exact value for the asymptotic number of RNA shapes, by relying on an elegant relation between non-ambiguous, context-free grammars, and generating functions. Our results provide a theoretical upper bound on the length of RNA sequences amenable to probabilistic shape analysis (Steffen et al., 2006; Voss et al., 2006), under the assumption that any base can basepair with any other base. Since the relation between context-free grammars and asymptotic enumeration is simple, yet not well-known in bioinformatics, we give a self-contained presentation with illustrative examples. Additionally, we prove a surprising 1-to-1 correspondence between pi-shapes and Motzkin numbers.
[Show abstract][Hide abstract] ABSTRACT: DIAL (dihedral alignment) is a web server that provides public access to a new dynamic programming algorithm for pairwise 3D structural alignment of RNA. DIAL achieves quadratic time by performing an alignment that accounts for (i) pseudo-dihedral and/or dihedral angle similarity, (ii) nucleotide sequence similarity and (iii) nucleotide base-pairing similarity.
DIAL provides access to three alignment algorithms: global (Needleman–Wunsch), local (Smith–Waterman) and semiglobal (modified to yield motif search). Suboptimal alignments are optionally returned, and also Boltzmann pair probabilities Pr(ai,bj) for aligned positions ai , bj from the optimal alignment. If a non-zero suboptimal alignment score ratio is entered, then the semiglobal alignment algorithm may be used to detect structurally similar occurrences of a user-specified 3D motif. The query motif may be contiguous in the linear chain or fragmented in a number of noncontiguous regions.
The DIAL web server provides graphical output which allows the user to view, rotate and enlarge the 3D superposition for the optimal (and suboptimal) alignment of query to target. Although graphical output is available for all three algorithms, the semiglobal motif search may be of most interest in attempts to identify RNA motifs. DIAL is available at http://bioinformatics.bc.edu/clotelab/DIAL.
Nucleic Acids Research 05/2007; 35(Web Server issue):659-668. DOI:10.1093/nar/gkm334 · 9.11 Impact Factor