Conference Paper

RNA Pseudoknot Folding through Inference and Identification Using TAGRNA

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Studying the structure of RNA sequences is an important problem that helps in understanding the functional properties of RNA. After being ignored for a long time due to the high computational complexity it requires, pseudoknot is one type of RNA structures that has been given a lot of attention lately. Pseudoknot structures have functional importance since they appear, for example, in viral genome RNAs and ribozyme active sites. In this paper, we present a folding framework, TAGRNAInf, for RNA structures that support pseudoknots. Our approach is based on learning TAGRNA grammars from training data with structural information. The inferred grammars are used to indentify sequences with structures analogous to those in the training set and generate a folding for these sequences. We present experimental results and comparisons with other known pseudoknot folding approaches.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Formal grammars have been employed in biology to solve various important problems. In particular, grammars have been used to model and predict RNA structures. Two such grammars are Simple Linear Tree Adjoining Grammars (SLTAGs) and Extended SLTAGs (ESLTAGs). Performance of techniques that employ grammatical formalisms critically depend on the efficiency of the underlying parsing algorithms. In this paper we present efficient algorithms for parsing SLTAGs and ESLTAGs. Our algorithm for SLTAGs parsing takes O( min {m,n 4}) time and O( min {m,n 4}) space where m is the number of entries that will ever be made in the matrix M (that is normally used by TAG parsing algorithms). Our algorithm for ESLTAGs parsing takes O(n min {m,n 4}) time and O( min {m,n 4}) space. We show that these algorithms perform better in practice than the algorithms of Uemura, et al. [19].
Article
Full-text available
A simplified (two-base) version of the problem of planar folding of long chains (e.g., RNA and DNA biomolecules) is formulated as a matching problem. The chain is prescribed as a loop or circular sequence of letters A and B, n units long. A matching here means a set of A-B base pairings or matches obeying a planarity condition: no two matches may cross each other if drawn on the interior of the loop. Also, no two adjacent letters may be matched. We present a dynamic programming algorithm requiring O(n3)O( {n^3 } ) steps and O(n2)O( {n^2 } ) storage which computes the size of the maximum for the given A-B base sequence and which also allows reconstructing a particular folded form of the original string which realizes the maximum matching size. The algorithm can be adapted to deal with sequences with larger alphabets and with weighted matchings. An algorithm is also presented for a modified problem closer to the biochemical problem of interest: We demand that every match must be adjacent to another match, forcing groups of two or more parallel matches. Some results on the expected maximum matching size are presented. As nn \to \infty , at least 80% of the vertices can be matched on the average on an A-B string of size n. We briefly discuss the practical application of the algorithm by using contracted versions of very long molecules with a preliminary block construction. A maximum matching is presented for the J-gene of the ϕ\phi X174 DNA virus. We conclude by stating some problems requiring further study. A simplified (two-base) version of the problem of planar folding of long chains (e.g., RNA and DNA biomolecules) is formulated as a matching problem. The chain is prescribed as a loop or circular sequence of letters A and B, n units long. A matching here means a set of A-B base pairings or matches obeying a planarity condition: no two matches may cross each other if drawn on the interior of the loop. Also, no two adjacent letters may be matched. We present a dynamic programming algorithm requiring O(n3)O( {n^3 } ) steps and O(n2)O( {n^2 } ) storage which computes the size of the maximum for the given A-B base sequence and which also allows reconstructing a particular folded form of the original string which realizes the maximum matching size. The algorithm can be adapted to deal with sequences with larger alphabets and with weighted matchings. An algorithm is also presented for a modified problem closer to the biochemical problem of interest: We demand that every match must be adjacent to another match, forcing groups of two or more parallel matches. Some results on the expected maximum matching size are presented. As nn \to \infty , at least 80% of the vertices can be matched on the average on an A-B string of size n. We briefly discuss the practical application of the algorithm by using contracted versions of very long molecules with a preliminary block construction. A maximum matching is presented for the J-gene of the ϕ\phi X174 DNA virus. We conclude by stating some problems requiring further study. Some results on the expected maximum matching size are presented. As nn \to \infty , at least 80% of the vertices can be matched on the average on an A-B string of size n.
Article
Full-text available
Grammatical inference, an important field of syntactic pat-tern recognition, is finding wider acceptance in many prac-tical applications like computation biology. In this work we show the use of grammatical inference techniques in identi-fying pseudoknots in the RNA secondary structures. Identification of RNA secondary structure is among the few structure identification problems that can be solved sat-isfactorily in polynomial time and data. We propose an Infer-Test model to identify the pseudoknots. This model uses the Terminal Distinguishable Even Linear Language inferencing technique to identify pseudoknots in the RNA secondary structures.
Chapter
Full-text available
We discuss the problem of algorithmic discovery of patterns common to sets of sequences and its applications to computational biology. We formulate a three step paradigm for pattern discovery, which is based on choosing the hypothesis space, designing the function rating a pattern in respect to the given sequences, and developing an algorithm finding the highest rating patterns. We give some examples of implementing this paradigm, and present experimental results of discovering new patterns in sets of biosequences. In these experiments the sets of given sequences are noisy, that is, many of the sequences given as belonging to the family, actually do not belong to the family. Nevertheless our algorithms have been able to identify biologically sound patterns. In particular we present novel results of discovering transcription factor binding sites from the complete set of over 6000 sequences, taken from the yeast genome upstream to the potential genes.
Article
Full-text available
Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment.All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.Die im Vienna RNA package enthaltenen Computer Programme fr die Berechnung und den Vergleich von RNA Sekundrstrukturen werden prsentiert. Ihren Kern bilden Algorithmen zur Vorhersage von Strukturen minimaler Energie sowie zur Berechnung von Zustandssumme und Basenpaarungswahrscheinlichkeiten mittels dynamischer Programmierung.Ein effizienter heuristischer Algorithmus fr das inverse Faltungsproblem wird vorgestellt. Darberhinaus prsentieren wir kompakte und effiziente Programme zum Vergleich von RNA Sekundrstrukturen durch Baum-Editierung und Alignierung.Alle Programme sind in ANSI C geschrieben, darunter auch eine Implementation des Faltungs-algorithmus fr Parallelrechner mit verteiltem Speicher. Wie Tests auf einem Intel Hypercube zeigen, wird das Parallelrechnen umso effizienter je lnger die Sequenzen sind.
Conference Paper
Full-text available
Tree Adjoining Grammar (TAG) is a formalism for natural language grammars. Some of the basic notions of TAG's were introduced in [Joshi, Levy, and Takahashi 1975] and by [Joshi, 1983]. A detailed investigation of the linguistic relevance of TAG's has been carried out in [Kroch and Joshi, 1985]. In this paper, we will describe some new results for TAG's, especially in the following areas: (1) parsing complexity of TAG's, (2) some closure results for TAG's, and (3) the relationship to Head grammars.
Article
Full-text available
Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of tRNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. Results show that after having been trained on as few as 20 tRNA sequences from only two tRNA subfamilies (mitochondrial and cytoplasmic), the model can discern general tRNA from similar-length RNA sequences of other kinds, can find secondary structure of new tRNA sequences, and can produce multiple alignments of large sets of tRNA sequences. Our results suggest potential improvements in the alignments of the D- and T-domains in some mitochdondrial tRNAs that cannot be fit into the canonical secondary structure.
Article
Full-text available
The 3′ terminal forty nucleotides of tobraviral RNAs readily fold into a tertiary structure, resembling that of tymo- and tobamoviral RNAs. The latter RNAs possess a tRNA-like structure at their 3′ and that is recognized by a number of tRNA-specific enzymes (Rietveld et al. (1984), EMBO J. 3, 2613–2619). Characteristic for their aminoacyl acceptor are is the presence of a so-called pseudoknot which we now also find in a corresponding position at the 3′ terminus of TRV RNA2 (PSS strain). The nucleotide sequences of all tobraviral RNAs analysed so far indicate that they all possess a similar 3′ terminal structure. A domain resembling the anticodon are of canonical tRNA is not readily recognizable. TRV RNA2 can be adenylated with CTP, ATP, tRNA nucleotidyl transferase and ATP. It is unable, however, to accept any of the twenty common amino acids when incubated with ATP and aminoacyl-tRNA synthetases from wheat gera or yeast. We conclude that TRV RNA contains a tRNA-like structure, which, in contrast to the tymo- and tobamoviral tRNA-like structures, cannot be aminoacylated. It is unlikely therefore, that aminoacylation of plant viral RNAs with a tRNA-like structure is a prerequisite for viral RNA replication.
Article
Full-text available
This paper presents a new computer method for folding an RNA molecule that finds a conformation of minimum free energy using published values of stacking and destabilizing energies. It is based on a dynamic programming algorithm from applied mathematics, and is much more efficient, faster, and can fold larger molecules than procedures which have appeared up to now in the biological literature. Its power is demonstrated in the folding of a 459 nucleotide immunoglobulin γ 1 heavy chain messenger RNA fragment. We go beyond the basic method to show how to incorporate additional information into the algorithm. This includes data on chemical reactivity and enzyme susceptibility. We illustrate this with the folding of two large fragments from the 16S ribosomal RNA of Escherichia coli.
Article
Full-text available
: We describe a statistical method to determine if a pair of columns in a multiple alignment of a homologous family of RNA sequences shows evidence of being base paired. The method makes explicit use of a given phylogenetic tree for the sequences in the alignment. It is tested on a multiple alignment of 16S rRNA sequences with good results. Introduction and Overview of Methods Most present techniques for RNA secondary structure prediction are based either on energy minimization or on comparative sequence analysis. Energy minimization methods have had less success on large RNA molecules [1 Jacobson93 ] [2 Zuker-91] [3 Zuker-84] [4 Tinoco-71], so comparative sequence analysis is the method of choice here * [5 Han-93] [6 Le-91]. Until now, comparative sequence methods have either required substantial manual intervention [7 James89 ] [8 Woese-83], or were more fully automated, but overlooked information about the phylogenetic relationships among the sequences in the RNA multiple *...
Article
Full-text available
RNA molecules are sequences of nucleotides that serve as more than mere intermediaries between DNA and proteins, e.g., as catalytic molecules. Computational prediction of RNA secondary structure is among the few structure prediction problems that can be solved satisfactorily in polynomial time. Most work has been done to predict structures that do not contain pseudoknots. Allowing pseudoknots introduces modeling and computational problems. In this paper we consider the problem of predicting RNA secondary structures with pseudoknots based on free energy minimization. We first give a brief comparison of energy-based methods for predicting RNA secondary structures with pseudoknots. We then prove that the general problem of predicting RNA secondary structures containing pseudoknots is NP complete for a large class of reasonable models of pseudoknots.
Article
Full-text available
The 5'-untranslated leader region of human immunodeficiency virus type 1 (HIV-1) RNA contains multiple signals that control distinct steps of the viral replication cycle such as transcription, reverse transcription, genomic RNA dimerization, splicing, and packaging. It is likely that fine tuned coordinated regulation of these functions is achieved through specific RNA-protein and RNA-RNA interactions. In a search for cis-acting elements important for the tertiary structure of the 5'-untranslated region of HIV-1 genomic RNA, we identified, by ladder selection experiments, a short stretch of nucleotides directly downstream of the poly(A) signal that interacts with a nucleotide sequence located in the matrix region. Confirmation of the sequence of the interacting sites was obtained by partial or complete inhibition of this interaction by antisense oligonucleotides and by nucleotide substitutions. In the wild type RNA, this long range interaction was intramolecular, since no intermolecular RNA association was detected by gel electrophoresis with an RNA mutated in the dimerization initiation site and containing both sequences involved in the tertiary interaction. Moreover, the functional importance of this interaction is supported by its conservation in all HIV-1 isolates as well as in HIV-2 and simian immunodeficiency virus. Our results raise the possibility that this long range RNA-RNA interaction might be involved in the full-length genomic RNA selection during packaging, repression of the 5' polyadenylation signal, and/or splicing regulation.
Article
Full-text available
tmRNA (also known as 10Sa RNA or SsrA) plays a central role in an unusual mode of translation, whereby a stalled ribosome switches from a problematic mRNA to a short reading frame within tmRNA during translation of a single polypeptide chain. Research on the mechanism, structure and biology of tmRNA is served by the tmRNA Website, a collection of sequences for tmRNA and the encoded proteolysis-inducing peptide tags, alignments, careful documentation and other information; the URL is http://www.indiana.edu/~tmrna. Four pseudoknots are usually present in each tmRNA, so the database is rich with information on pseudoknot variability. Since last year it has doubled (227 tmRNA sequences as of September 2001), a sequence alignment for the tmRNA cofactor SmpB has been included, and genomic data for Clostridium botulinum has revealed a group I (subgroup IA3) intron interrupting the tmRNA T-loop.
Article
Full-text available
Minimized trans-acting HDV ribozyme systems consisting of three (Rz-3) and two (Rz-2) RNA strands were prepared and their folding conformations were analyzed by NMR spectroscopy. The guanosine residues in one of the enzyme components of Rz-3 were labeled with 13C and 15N. Imino proton signals were assigned by analysis of NOESY and HSQC spectra. The results are consistent with the nested double pseudoknot model, which contains novel base pairs (P1.1), as observed in the crystal structure of a genomic HDV ribozyme. The NOE connectivities suggest an additional G:G pair at the bottom of P1.1 and at the top of P4. The effects of temperature and Mg2+ ions on base pairs for Rz-3 were examined. The temperature variation experiment on Rz-3 showed that P3 is the most stable and that P1.1 is as stable as P1 and P2. The imino proton signals of the G:U pair at the bottom of P1 and the top of P1.1, which are close to the cleavage site, showed the largest changes upon Mg2+ titration of Rz-3. The results suggest that the catalytic Mg2+ ion binds to the pocket formed by P1 and L3.
Article
Full-text available
MicroRNAs (miRNAs) are small noncoding RNA gene products about 22 nt long that are processed by Dicer from precursors with a characteristic hairpin secondary structure. Guidelines are presented for the identification and annotation of new miRNAs from diverse organisms, particularly so that miRNAs can be reliably distinguished from other RNAs such as small interfering RNAs. We describe specific criteria for the experimental verification of miRNAs, and conventions for naming miRNAs and miRNA genes. Finally, an online clearinghouse for miRNA gene name assignments is provided by the Rfam database of RNA families.
Article
Full-text available
Pseudoknots have generally been excluded from the prediction of RNA secondary structures due to its difficulty in modeling. Although, several dynamic programming algorithms exist for the prediction of pseudoknots using thermodynamic approaches, they are neither reliable nor efficient. On the other hand, comparative methods are more reliable, but are often done in an ad hoc manner and require expert intervention. Maximum weighted matching, an algorithm for pseudoknot prediction with comparative analysis, suffers from low-prediction accuracy in many cases. Here we present an algorithm, iterated loop matching, for reliably and efficiently predicting RNA secondary structures including pseudoknots. The method can utilize either thermodynamic or comparative information or both, thus is able to predict pseudoknots for both aligned and individual sequences. We have tested the algorithm on a number of RNA families. Using 8-12 homologous sequences, the algorithm correctly identifies more than 90% of base-pairs for short sequences and 80% overall. It correctly predicts nearly all pseudoknots and produces very few spurious base-pairs for sequences without pseudoknots. Comparisons show that our algorithm is both more sensitive and more specific than the maximum weighted matching method. In addition, our algorithm has high-prediction accuracy on individual sequences, comparable with the PKNOTS algorithm, while using much less computational resources. The program has been implemented in ANSI C and is freely available for academic use at http://www.cse.wustl.edu/~zhang/projects/rna/ilm/ Supplementary information: http://www.cse.wustl.edu/~zhang/projects/rna/ilm/
Article
Full-text available
The general problem of RNA secondary structure prediction under the widely used thermodynamic model is known to be NP-complete when the structures considered include arbitrary pseudoknots. For restricted classes of pseudoknots, several polynomial time algorithms have been designed, where the O(n6)time and O(n4) space algorithm by Rivas and Eddy is currently the best available program. We introduce the class of canonical simple recursive pseudoknots and present an algorithm that requires O(n4) time and O(n2) space to predict the energetically optimal structure of an RNA sequence, possible containing such pseudoknots. Evaluation against a large collection of known pseudoknotted structures shows the adequacy of the canonization approach and our algorithm. RNA pseudoknots of medium size can now be predicted reliably as well as efficiently by the new algorithm.
Article
Full-text available
Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.
Article
Full-text available
We have solved the three-dimensional crystal structure of the stem-loop II motif (s2m) RNA element of the SARS virus genome to 2.7-A resolution. SARS and related coronaviruses and astroviruses all possess a motif at the 3' end of their RNA genomes, called the s2m, whose pathogenic importance is inferred from its rigorous sequence conservation in an otherwise rapidly mutable RNA genome. We find that this extreme conservation is clearly explained by the requirement to form a highly structured RNA whose unique tertiary structure includes a sharp 90 degrees kink of the helix axis and several novel longer-range tertiary interactions. The tertiary base interactions create a tunnel that runs perpendicular to the main helical axis whose interior is negatively charged and binds two magnesium ions. These unusual features likely form interaction surfaces with conserved host cell components or other reactive sites required for virus function. Based on its conservation in viral pathogen genomes and its absence in the human genome, we suggest that these unusual structural features in the s2m RNA element are attractive targets for the design of anti-viral therapeutic agents. Structural genomics has sought to deduce protein function based on three-dimensional homology. Here we have extended this approach to RNA by proposing potential functions for a rigorously conserved set of RNA tertiary structural interactions that occur within the SARS RNA genome itself. Based on tertiary structural comparisons, we propose the s2m RNA binds one or more proteins possessing an oligomer-binding-like fold, and we suggest a possible mechanism for SARS viral RNA hijacking of host protein synthesis, both based upon observed s2m RNA macromolecular mimicry of a relevant ribosomal RNA fold.
Article
Full-text available
The discoveries of microRNAs and riboswitches, among others, have shown functional RNAs to be biologically more important and genomically more prevalent than previously anticipated. We have developed a general comparative genomics method based on phylogenetic stochastic context-free grammars for identifying functional RNAs encoded in the human genome and used it to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs. At a loose threshold for acceptance, this search resulted in a set of 48,479 candidate RNA structures. This screen finds a large number of known functional RNAs, including 195 miRNAs, 62 histone 3'UTR stem loops, and various types of known genetic recoding elements. Among the highest-scoring new predictions are 169 new miRNA candidates, as well as new candidate selenocysteine insertion sites, RNA editing hairpins, RNAs involved in transcript auto regulation, and many folds that form singletons or small functional RNA families of completely unknown function. While the rate of false positives in the overall set is difficult to estimate and is likely to be substantial, the results nevertheless provide evidence for many new human functional RNAs and present specific predictions to facilitate their further characterization.
Article
Full-text available
Visualizing RNA secondary structures and pseudoknot structures is essential to bioinformatics systems that deal with RNA structures. However, many bioinformatics systems use heterogeneous data structures and incompatible software components, so integration of software components (including a visualization component) into a system can be hindered by incompatibilities between the components of the system. This paper presents an XML web service and web application program for visualizing RNA secondary structures with pseudoknots. Experimental results show that the PseudoViewer web service and web application are useful for resolving many problems with incompatible software components as well as for visualizing large-scale RNA secondary structures with pseudoknots of any type. The web service and web application are available at http://pseudoviewer.inha.ac.kr/.
Article
Full-text available
Genomic variations deep in the intronic regions of pre-mRNA molecules are increasingly reported to affect splicing events. However, there is no general explanation why apparently similar variations may have either no effect on splicing or cause significant splicing alterations. In this work we have examined the structural architecture of pseudoexons previously described in ATM and CFTR patients. The ATM case derives from the deletion of a repressor element and is characterized by an aberrant 5′ss selection despite the presence of better alternatives. The CFTR pseudoexon instead derives from the creation of a new 5′ss that is used while a nearby pre-existing donor-like sequence is never selected. Our results indicate that RNA structure is a major splicing regulatory factor in both cases. Furthermore, manipulation of the original RNA structures can lead to pseudoexon inclusion following the exposure of unused 5′ss already present in their wild-type intronic sequences and prevented to be recognized because of their location in RNA stem structures. Our data show that intrinsic structural features of introns must be taken into account to understand the mechanism of pseudoexon activation in genetic diseases. Our observations may help to improve diagnostics prediction programmes and eventual therapeutic targeting.
Article
Full-text available
We discuss the problem of algorithmic discovery of patterns common to sets of sequences and its applications to computational biology. We formulate a three step paradigm for pattern discovery, which is based on choosing the hypothesis space, designing the function rating a pattern in respect to the given sequences, and developing an algorithm finding the highest rating patterns. We give some examples of implementing this paradigm, and present experimental results of discovering new patterns in sets of biosequences. In these experiments the sets of given sequences are noisy, that is, many of the sequences given as belonging to the family, actually do not belong to the family. Nevertheless our algorithms have been able to identify biologically sound patterns. In particular we present novel results of discovering transcription factor binding sites from the complete set of over 6000 sequences, taken from the yeast genome upstream to the potential genes.
Article
Sur la base de la decouverte d'activites enzymatiques de certains ARN (chez E. coli au cours de la maturation des ARN+ et chez Tetrahymena avec un exon d'un ARNr a auto-epissage), l'auteur postule un systeme, auto-replicatif a l'origine uniquement compose de molecules d'ARN
Article
In this paper, we are concerned with identifying a subclass of tree adjoining grammars (TAGs) that is suitable for the application to modeling and predicting RNA secondary structures. The goal of this paper is twofold: For the purpose of applying to the RNA secondary structure prediction problem, we first introduce a special subclass of TAGs and develop a fast parsing algorithm for the subclass, together with some of its language theoretic characterizations. Then, based on the algorithm, we develop a prediction system and demonstrate the effectiveness of the system by presenting some experimental results obtained from biological data, where free energy evaluation selection for parse trees is incorporated into the algorithm.
Article
In this paper, a tree generating system called a tree adjunct grammar is described and its formal properties are studied relating them to the tree generating systems of Brainerd (Information and Control14 (1969), 217–231) and Rounds (Mathematical Systems Theory 4 (1970), 257–287) and to the recognizable sets and local sets discussed by Thatcher (Journal of Computer and System Sciences1 (1967), 317–322; 4 (1970), 339–367) and Rounds. Linguistic relevance of these systems has been briefly discussed also.
Conference Paper
Studying the structure of RNA sequences is an important problem that helps in understanding the functional properties of RNA. Pseudoknot is one type of RNA structures that cannot be modeled with Context Free Grammars (CFG) because it exhibits crossing dependencies. Pseudoknot structures have functional importance since they appear, for example, in viral genome RNAs and ribozyme active sites. Tree Adjoining Grammars (TAG) is one example of a grammatical model that is more expressive than CFG and has the capability of dealing with crossing dependencies. In this paper, we describe a new inference algorithm for TAGRNA, a sub-model of TAG. We also introduce an RNA structure identification framework, TAGRNAInf, within which the TAGRNA inference algorithm constitutes the core of the training phase. We present the results of using the proposed framework for identifying RNA sequences with pseudoknot structures. Our results outperform those reported in [14] for the same problem that employs a different grammatical formalism.
Article
This paper shows simple dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. For a basic version of the problem (i.e., maximizing the number of base pairs), this paper presents an O(n4) time exact algorithm and an O(n4−δ) time approximation algorithm. The latter one outputs, for most RNA sequences, a secondary structure in which the number of base pairs is at least 1−ε of the optimal, where ε,δ are any constants satisfying 0<ε,δ<1. Several related results are shown too.
Article
This paper presents an unsupervised inference method for determining the higher-order structure from sequence data. The method is general, but in this paper it is applied to nucleic acid sequences in determining the secondary (2-D) and tertiary (3-D) structure of the macromolecule. The method evaluates position - position interdependence of the sequence using an information measure known as expected mutual information. The expected mutual information is calculated for each pair of positions and the chi-square test is used to screen statistically significant position pairs. In the calculation of expected mutual information, an unbiased probability estimator is used to overcome the problem associated with zero observation in conserved sites. A selection criterion based on known structural constraints of the strongest interdependent position pairs is applied yielding position pairs most indicative of secondary and tertiary interactions. The method has been tested using tRNA and 5S rRNA sequences with very good results.
Article
The complete sequence of the 3454 nt of RNA 2 of the Ahlum isolate of beet soil-borne furovirus (BSBV) has been determined starting with two short stretches of cloned cDNA. Unknown parts of the sequence were amplified by means of RT-PCR techniques using combinations of specific and random primers. BSBV RNA 2 is more similar in its genetic organization to potato mop top virus (PMTV) RNA 3 than to any other furoviral RNA, although it is more than 1100 nt longer. Its 3'-end, unlike that of PMTV RNA 3, has the potential to fold into a tRNA-like structure. It contains one large open reading frame for a readthrough protein with a molecular mass of 104 kDa (104K protein) which is interrupted internally by an amber stop codon terminating the coding region for a protein of 19 kDa (19K), most likely the viral coat protein (CP). The readthrough domain of the 104K protein is much larger than that of PMTV, but the N- and C-proximal portions of these domains are similar for the two viruses. No serological relationships were found between the particles of the two viruses, although more than 50% of the amino acid sequences of the putative CPs are identical.
Article
We describe a dynamic programming algorithm for predicting optimal RNA secondary structure, including pseudoknots. The algorithm has a worst case complexity of O(N6) in time and O(N4) in storage. The description of the algorithm is complex, which led us to adopt a useful graphical representation (Feynman diagrams) borrowed from quantum field theory. We present an implementation of the algorithm that generates the optimal minimum energy structure for a single RNA sequence, using standard RNA folding thermodynamic parameters augmented by a few parameters describing the thermodynamic stability of pseudoknots. We demonstrate the properties of the algorithm by using it to predict structures for several small pseudoknotted and non-pseudoknotted RNAs. Although the time and memory demands of the algorithm are steep, we believe this is the first algorithm to be able to fold optimal (minimum energy) pseudoknotted RNAs with the accepted RNA thermodynamic model.
Article
PseudoBase is a database containing structural, functional and sequence data related to RNA pseudo­knots. It can be reached at http://wwwbio.Leiden Univ.nl/~Batenburg/PKB.html . This page will direct the user to a retrieval page from where a particular pseudoknot can be chosen, or to a submission page which enables the user to add pseudoknot information to the database or to an informative page that elaborates on the various aspects of the database. For each pseudoknot, 12 items are stored, e.g. the nucleotides of the region that contains the pseudoknot, the stem positions of the pseudoknot, the EMBL accession number of the sequence that contains this pseudoknot and the support that can be given regarding the reliability of the pseudoknot. Access is via a small number of steps, using 16 different categories. The development process was done by applying the evolutionary methodology for software development rather than by applying the methodology of the classical waterfall model or the more modern spiral model.
Article
Nucleic acid secondary structure models usually exclude pseudoknots due to the difficulty of treating these nonnested structures efficiently in structure prediction and partition function algorithms. Here, the standard secondary structure energy model is extended to include the most physically relevant pseudoknots. We describe an O(N(5)) dynamic programming algorithm, where N is the length of the strand, for computing the partition function and minimum energy structure over this class of secondary structures. Hence, it is possible to determine the probability of sampling the lowest energy structure, or any other structure of particular interest. This capability motivates the use of the partition function for the design of DNA or RNA molecules for bioengineering applications.
Article
Bioinformatics is an active research area aimed at developing intelligent systems for analyses of molecular biology. Many methods based on formal language theory, statistical theory, and learning theory have been developed for modeling and analyzing biological sequences such as DNA, RNA, and proteins. Especially, grammatical inference methods are expected to find some grammatical structures hidden in biological sequences. In this article, we give an overview of a series of our grammatical approaches to biological sequence analyses and related researches and focus on learning stochastic grammars from biological sequences and predicting their functions based on learned stochastic grammars.