-
[show abstract]
[hide abstract]
ABSTRACT: BACKGROUND: The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAsor microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task? RESULTS: Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA's algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence. CONCLUSIONS: Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptomewide applications, such methods will provide accurate structure annotations on the target side. AVAILABILITY: Source code of the free software LocARNAscan 1.0 and supplementary data are available at http://www.bioinf.uni-leipzig.de/Software/LocARNAscan.
Algorithms for Molecular Biology 04/2013; 8(1):14. · 1.35 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Recent genome-wide computational screens that search for conservation of RNA secondary structure in whole genome alignments (WGAs) have predicted thousands of structural noncoding RNAs (ncRNAs). The sensitivity of such approaches, however, is limited due to their reliance on sequence-based whole-genome aligners, which regularly misalign structural ncRNAs. This suggests that many more structural ncRNAs may remain undetected. Structure-based alignment, which could increase the sensitivity, has been prohibitive for genome-wide screens due to its extreme computational costs. Breaking this barrier, we present the pipeline REAPR (RE-Alignment for Prediction of structural ncRNA), which efficiently realigns whole genomes based on RNA sequence and structure, thus allowing us to boost the performance of de novo ncRNA predictors, such as RNAz. Key to the pipeline's efficiency is the development of a novel banding technique for multiple RNA alignment. REAPR significantly outperforms the widely-used predictors RNAz and EvoFold in genome-wide screens; in direct comparison to the most recent RNAz screen on D. melanogaster, REAPR predicts twice as many high-confidence ncRNA candidates. Moreover, modEncode RNA-Seq experiments confirm a substantial number of its predictions as transcripts. REAPR's advancement of de novo structural characterization of ncRNAs complements the identification of transcripts from rapidly accumulating RNA-Seq data.
Genome Research 01/2013; · 13.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Noncoding RNAs have emerged as important key players in the cell. Understanding their surprisingly diverse range of functions is challenging for experimental and computational biology. Here, we review computational methods to analyze noncoding RNAs. The topics covered include basic and advanced techniques to predict RNA structures, annotation of noncoding RNAs in genomic data, mining RNA-seq data for novel transcripts and prediction of transcript structures, computational aspects of microRNAs, and database resources. These authors contributed equally WIREs RNA 2012. doi: 10.1002/wrna.1134 For further resources related to this article, please visit the WIREs website.
WIREs RNA 09/2012; 3(6):759-78.
-
[show abstract]
[hide abstract]
ABSTRACT: Due to recent algorithmic progress, tools for the gold standard of comparative RNA analysis, namely Sankoff-style simultaneous alignment and folding, are now readily applicable. Such approaches, however, compare RNAs with respect to a simultaneously predicted, single, nested consensus structure. To make multiple alignment of RNAs available in cases, where this limitation of the standard approach is critical, we introduce a web server that provides a complete and convenient interface to the RNA structure alignment tool 'CARNA'. This tool uniquely supports RNAs with multiple conserved structures per RNA and aligns pseudoknots intrinsically; these features are highly desirable for aligning riboswitches, RNAs with conserved folding pathways, or pseudoknots. We represent structural input and output information as base pair probability dot plots; this provides large flexibility in the input, ranging from fixed structures to structure ensembles, and enables immediate visual analysis of the results. In contrast to conventional Sankoff-style approaches, 'CARNA' optimizes all structural similarities in the input simultaneously, for example across an entire RNA structure ensemble. Even compared with already costly Sankoff-style alignment, 'CARNA' solves an intrinsically much harder problem by applying advanced, constraint-based, algorithmic techniques. Although 'CARNA' is specialized to the alignment of RNAs with several conserved structures, its performance on RNAs in general is on par with state-of-the-art general-purpose RNA alignment tools, as we show in a Bralibase 2.1 benchmark. The web server is freely available at http://rna.informatik.uni-freiburg.de/CARNA.
Nucleic Acids Research 06/2012; 40(Web Server issue):W49-53. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Current genomic screens for noncoding RNAs (ncRNAs) predict a large number of genomic regions containing potential structural ncRNAs. The analysis of these data requires highly accurate prediction of ncRNA boundaries and discrimination of promising candidate ncRNAs from weak predictions. Existing methods struggle with these goals because they rely on sequence-based multiple sequence alignments, which regularly misalign RNA structure and therefore do not support identification of structural similarities. To overcome this limitation, we compute columnwise and global reliabilities of alignments based on sequence and structure similarity; we refer to these structure-based alignment reliabilities as STARs. The columnwise STARs of alignments, or STAR profiles, provide a versatile tool for the manual and automatic analysis of ncRNAs. In particular, we improve the boundary prediction of the widely used ncRNA gene finder RNAz by a factor of 3 from a median deviation of 47 to 13 nt. Post-processing RNAz predictions, LocARNA-P's STAR score allows much stronger discrimination between true- and false-positive predictions than RNAz's own evaluation. The improved accuracy, in this scenario increased from AUC 0.71 to AUC 0.87, significantly reduces the cost of successive analysis steps. The ready-to-use software tool LocARNA-P produces structure-based multiple RNA alignments with associated columnwise STARs and predicts ncRNA boundaries. We provide additional results, a web server for LocARNA/LocARNA-P, and the software package, including documentation and a pipeline for refining screens for structural ncRNA, at http://www.bioinf.uni-freiburg.de/Supplements/LocARNA-P/.
RNA 03/2012; 18(5):900-14. · 5.09 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs.
We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods.
The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator.
BMC Bioinformatics 05/2011; 12:214. · 2.75 Impact Factor
-
Sushmita Roy,
Jason Ernst,
Peter V. Kharchenko,
Pouya Kheradpour,
Nicolas Negre,
Matthew L. Eaton,
Jane M. Landolin,
Christopher A. Bristow,
Lijia Ma,
Michael F. Lin, [......],
Koen J. T. Venken,
Huaien Wang,
Robert White,
Jared Wilkening,
Aarron T. Willingham,
Chris Zaleski,
Zheng Zha,
Dayu Zhang,
Yongjun Zhao,
Jennifer Zieba
[show abstract]
[hide abstract]
ABSTRACT: To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications,
chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental
time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding,
RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new
functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide
a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
Science 12/2010; 330(6012):1787-1797. · 31.20 Impact Factor
-
Sushmita Roy,
Jason Ernst,
Peter V Kharchenko,
Pouya Kheradpour,
Nicolas Negre,
Matthew L Eaton,
Jane M Landolin,
Christopher A Bristow,
Lijia Ma,
Michael F Lin, [......],
Brian Oliver,
Peter J Park,
Susan E Celniker,
Steven Henikoff,
Gary H Karpen,
Eric C Lai,
David M MacAlpine,
Lincoln D Stein,
Kevin P White,
Manolis Kellis
[show abstract]
[hide abstract]
ABSTRACT: To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
Science 12/2010; 330(6012):1787-97. · 31.20 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The Freiburg RNA tools web server integrates three tools for the advanced analysis of RNA in a common web-based user interface. The tools IntaRNA, ExpaRNA and LocARNA support the prediction of RNA-RNA interaction, exact RNA matching and alignment of RNA, respectively. The Freiburg RNA tools web server and the software packages of the stand-alone tools are freely accessible at http://rna.informatik.uni-freiburg.de.
Nucleic Acids Research 05/2010; 38(Web Server issue):W373-7. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Prediction and alignment of RNA pseudoknot structures are NP-hard. Nevertheless, several efficient prediction algorithms by dynamic programming have been proposed for restricted classes of pseudoknots. We present a general scheme that yields an efficient alignment algorithm for arbitrary such classes. Moreover, we show that such an alignment algorithm benefits from the class restriction in the same way as the corresponding structure prediction algorithm does. We look at six of these classes in greater detail. The time and space complexity of the alignment algorithm is increased by only a linear factor over the respective prediction algorithm. For five of the classes, no efficient alignment algorithms were known. For the sixth, most general class, we improve the previously best complexity of O(n(5)m(5)) time to O(nm(6)), where n and m denote sequence lengths. Finally, we apply our fastest algorithm with O(nm(4)) time and O(nm(2)) space to comparative de-novo pseudoknot prediction.
Journal of computational biology: a journal of computational molecular cell biology 03/2010; 17(3):429-42. · 1.69 Impact Factor
-
Principles and Practice of Constraint Programming - CP 2010 - 16th International Conference, CP 2010, St. Andrews, Scotland, UK, September 6-10, 2010. Proceedings; 01/2010
-
Algorithms in Bioinformatics, 10th International Workshop, WABI 2010, Liverpool, UK, September 6-8, 2010. Proceedings; 01/2010
-
[show abstract]
[hide abstract]
ABSTRACT: Although many RNA molecules contain pseudoknots, computational prediction of pseudoknotted RNA structure is still in its infancy due to high running time and space consumption implied by the dynamic programming formulations of the problem.
In this paper, we introduce sparsification to significantly speedup the dynamic programming approaches for pseudoknotted RNA structure prediction, which also lower the space requirements. Although sparsification has been applied to a number of RNA-related structure prediction problems in the past few years, we provide the first application of sparsification to pseudoknotted RNA structure prediction specifically and to handling gapped fragments more generally - which has a much more complex recursive structure than other problems to which sparsification has been applied. We analyse how to sparsify four pseudoknot structure prediction algorithms, among those the most general method available (the Rivas-Eddy algorithm) and the fastest one (Reeder-Giegerich algorithm). In all algorithms the number of "candidate" substructures to be considered is reduced.
Our experimental results on the sparsified Reeder-Giegerich algorithm suggest a linear speedup over the unsparsified implementation.
Algorithms for Molecular Biology 01/2010; 5:39. · 1.35 Impact Factor
-
Research in Computational Molecular Biology, 14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, April 25-28, 2010. Proceedings; 01/2010
-
[show abstract]
[hide abstract]
ABSTRACT: Lattice protein models, as the Hydrophobic-Polar (HP) model, are a common abstraction to enable exhaustive studies on structure, function, or evolution of proteins. A main issue is the high number of optimal structures, resulting from the hydrophobicity-based energy function applied. We introduce an equivalence relation on protein structures that correlates to the energy function. We discuss the efficient enumeration of optimal representatives of the corresponding equivalence classes and the application of the results. Comment: Published in Proceedings of the Fifth Workshop on Constraint Based Methods for Bioinformatics (WCB09), 2009, 9 pages
10/2009;
-
[show abstract]
[hide abstract]
ABSTRACT: Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially
sequence alignment and consensus folding problems. We presentpartiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm’s complexity
is polynomial in time and space. Algorithmically,partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic
running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane
β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against
structurally derived sequence alignments,partiFold-Alignsignificantly outperforms state-of-the-art pairwise sequence alignment tools in the most difficult low sequence homology
case and improves secondary structure prediction where current approaches fail. Importantly, partiFold-Alignrequires no prior training. These general techniques are widely applicable to many more protein families. partiFold-Alignis available at http://partiFold.csail.mit.edu.
05/2009: pages 339-355;
-
[show abstract]
[hide abstract]
ABSTRACT: Studies on proteins are often restricted to highly simplified models to face the immense computational complexity of the associated problems. Constraint-based protein structure prediction (CPSP) tools is a package of very fast algorithms for ab initio optimal structure prediction and related problems in 3D HP-models [cubic and face centered cubic (FCC)]. Here, we present CPSP-web-tools, an interactive online interface of these programs for their immediate use. They include the first method for the direct prediction of optimal energies and structures in 3D HP side-chain models. This newest extension of the CPSP approach is described here for the first time. AVAILABILITY AND IMPLEMENTATION: Free access at http://cpsp.informatik.uni-freiburg.de
Bioinformatics 02/2009; 25(5):676-7. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Specific functions of ribonucleic acid (RNA) molecules are often associated with different motifs in the RNA structure. The key feature that forms such an RNA motif is the combination of sequence and structure properties. In this article, we introduce a new RNA sequence-structure comparison method which maintains exact matching substructures. Existing common substructures are treated as whole unit while variability is allowed between such structural motifs. Based on a fast detectable set of overlapping and crossing substructure matches for two nested RNA secondary structures, our method ExpaRNA (exact pattern of alignment of RNA) computes the longest collinear sequence of substructures common to two RNAs in O(H.nm) time and O(nm) space, where H < n.m for real RNA structures. Applied to different RNAs, our method correctly identifies sequence-structure similarities between two RNAs.
We have compared ExpaRNA with two other alignment methods that work with given RNA structures, namely RNAforester and RNA_align. The results are in good agreement, but can be obtained in a fraction of running time, in particular for larger RNAs. We have also used ExpaRNA to speed up state-of-the-art Sankoff-style alignment tools like LocARNA, and observe a tradeoff between quality and speed. However, we get a speedup of 4.25 even in the highest quality setting, where the quality of the produced alignment is comparable to that of LocARNA alone.
The presented algorithm is implemented in the program ExpaRNA, which is available from our website (http://www.bioinf.uni-freiburg.de/Software).
Bioinformatics 02/2009; 25(16):2095-102. · 5.47 Impact Factor
-
Research in Computational Molecular Biology, 13th Annual International Conference, RECOMB 2009, Tucson, AZ, USA, May 18-21, 2009. Proceedings; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: The principles of protein folding and evolution pose problems of very high inherent complexity. Often these problems are tackled using simplified protein models, e.g. lattice proteins. The CPSP-tools package provides programs to solve exactly and completely the problems typical of studies using 3D lattice protein models. Among the tasks addressed are the prediction of (all) globally optimal and/or suboptimal structures as well as sequence design and neutral network exploration.
In contrast to stochastic approaches, which are not capable of answering many fundamental questions, our methods are based on fast, non-heuristic techniques. The resulting tools are designed for high-throughput studies of 3D-lattice proteins utilising the Hydrophobic-Polar (HP) model. The source bundle is freely available 1.
The CPSP-tools package is the first set of exact and complete methods for extensive, high-throughput studies of non-restricted 3D-lattice protein models. In particular, our package deals with cubic and face centered cubic (FCC) lattices.
BMC Bioinformatics 02/2008; 9:230. · 2.75 Impact Factor