[Show abstract][Hide abstract] ABSTRACT: Recent experimental evidences have shown that ribonucleic acid (RNA) plays a greater role in the cell than previously thought. An ensemble of RNA sequences believed to contain signals at the structure level can be exploited to detect functional motifs common to all or a portion of those sequences. We present here a general framework for analyzing multiple RNA secondary structures. A family of related RNA structures may be analyzed using statistical regression methods. In this work, we extend our previously developed algorithm, seed, that allows to explore exhaustively the search space of RNA sequence and structure motifs. We introduce here several objective functions based on thermodynamic free energy and information content to discriminate native folds from the rest. We assume that the variation across the various scores can be represented by a statistical model. Regression analysis permits to assign separate weight for each score, allowing one to emphasize or compensate the variance that differs across the different scores. A statistical model can be formulated using techniques from regression analysis to obtain a template or scoring model that is able to identify putative functional regions in RNA sequences. We show that thermodynamic based regression models are effective to associate the variation of scores obtained from different functions. The models can generally identify motifs with high measures of specificity and positive predicted value to known motifs. A good scoring method will allow to eliminate invalid motifs thereby reducing the size of the hypothesis space
Electrical and Computer Engineering, 2006. CCECE '06. Canadian Conference on; 06/2006
[Show abstract][Hide abstract] ABSTRACT: The identification of a consensus RNA motif often consists in finding a conserved secondary structure with minimum free energy in an ensemble of aligned sequences. However, an alignment is often difficult to obtain without prior structural information. Thus the need for tools to automate this process.
We present an algorithm called Seed to identify all the conserved RNA secondary structure motifs in a set of unaligned sequences. The search space is defined as the set of all the secondary structure motifs inducible from a seed sequence. A general-to-specific search allows finding all the motifs that are conserved. Suffix arrays are used to enumerate efficiently all the biological palindromes as well as for the matching of RNA secondary structure expressions. We assessed the ability of this approach to uncover known structures using four datasets. The enumeration of the motifs relies only on the secondary structure definition and conservation only, therefore allowing for the independent evaluation of scoring schemes. Twelve simple objective functions based on free energy were evaluated for their potential to discriminate native folds from the rest.
Our evaluation shows that 1) support and exclusion constraints are sufficient to make an exhaustive search of the secondary structure space feasible. 2) The search space induced from a seed sequence contains known motifs. 3) Simple objective functions, consisting of a combination of the free energy of matching sequences, can generally identify motifs with high positive predictive value and sensitivity to known motifs.
[Show abstract][Hide abstract] ABSTRACT: Related RNA sequences believed to contain signals at sequence and structure level can be exploited to detect common motifs. Finding these similar structural features can provide substantial information as to which parts of the sequence are functional. For several decades, free energy minimization has been the most popular method for structure prediction. However, limitations of the free energy models as well as time complexity have prompted us to look for alternative approaches. We therefore, investigate another paradigm, minimum description length (MDL) encoding, for evaluating the significance of consensus motifs. Here, we evaluate motifs generated by Seed using the description length as a selection criteria. The MDL scoring method was tested on four data sets. We found that the scoring method produces competing results in comparison to the ones predicted with lowest free energy. The top rank motifs have high measures of positive predicted value to known motifs.
Proceedings of the 2006 International Conference on Bioinformatics & Computational Biology, BIOCOMP'06, Las Vegas, Nevada, USA, June 26-29, 2006; 01/2006