[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: The application of Next-Generation Sequencing (NGS) technologies to RNAs directly extracted from a community of organisms yields a mixture of fragments characterizing both coding and non-coding types of RNAs. The tasks to distinguish among these and to further categorize the families of messenger RNAs and ribosomal RNAs is an important step for examining gene expression patterns of an interactive environment and the phylogenetic classification of the constituting species. RESULTS: We present SortMeRNA, a new software designed to rapidly filter ribosomal RNA fragments from metatranscriptomic data. It is capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time. AVAILABILITY: http://bioinfo.lifl.fr/RNA/sortmerna CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Abstract RNA locally optimal secondary structures provide a concise and exhaustive description of all possible secondary structures of a given RNA sequence, and hence a very good representation of the RNA folding space. In this paper, we present an efficient algorithm that computes all locally optimal secondary structures for any folding model that takes into account the stability of helical regions. This algorithm is implemented in a software called regliss that runs on a publicly accessible web server.
Journal of computational biology: a journal of computational molecular cell biology 10/2012; 19(10):1120-33. · 1.69 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The pairwise comparison of RNA secondary structures is a fundamental problem, with direct application in mining databases for annotating putative noncoding RNA candidates in newly sequenced genomes. An increasing number of software tools are available for comparing RNA secondary structures, based on different models (such as ordered trees or forests, arc annotated sequences, and multilevel trees) and computational principles (edit distance, alignment). We describe here the website BRASERO that offers tools for evaluating such software tools on real and synthetic datasets.
[Show abstract][Hide abstract] ABSTRACT: The annotation of noncoding RNA genes remains a major bottleneck in genome sequencing projects. Most genome sequences released today still come with sets of tRNAs and rRNAs as the only annotated RNA elements, ignoring hundreds of other RNA families. We have developed a web environment that is dedicated to noncoding RNA (ncRNA) prediction, annotation, and analysis and allows users to run a variety of tools in an integrated and flexible manner. This environment offers complementary ncRNA gene finders and a set of tools for the comparison, visualization, editing, and export of ncRNA candidates. Predictions can be filtered according to a large set of characteristics. Based on this environment, we created a public website located at http://RNAspace.org. It accepts genomic sequences up to 5 Mb, which permits for an online annotation of a complete bacterial genome or a small eukaryotic chromosome. The project is hosted as a Source Forge project (http://rnaspace.sourceforge.net/).
[Show abstract][Hide abstract] ABSTRACT: We describe a theoretical unifying framework to express the comparison of RNA structures, which we call alignment hierarchy. This framework relies on the definition of common supersequences for arc-annotated sequences and encompasses the main existing models for RNA structure comparison based on trees and arc-annotated sequences with a variety of edit operations. It also gives rise to edit models that have not been studied yet. We provide a thorough analysis of the alignment hierarchy, including a new polynomial-time algorithm and an NP-completeness proof. The polynomial-time algorithm involves biologically relevant edit operations such as pairing or unpairing nucleotides. It has been implemented in a software, called gardenia, which is available at the Web server http://bioinfo.lifl.fr/RNA/gardenia.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 07/2010; 7(2):309-22. · 2.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: DNA-binding transcription factors (TFs) play a central role in transcription regulation, and computational approaches that help in elucidating complex mechanisms governing this basic biological process are of great use. In this perspective, we present the TFM-Explorer web server that is a toolbox to identify putative TF binding sites within a set of upstream regulatory sequences of genes sharing some regulatory mechanisms. TFM-Explorer finds local regions showing overrepresentation of binding sites. Accepted organisms are human, mouse, rat, chicken and drosophila. The server employs a number of features to help users to analyze their data: visualization of selected binding sites on genomic sequences, and selection of cis-regulatory modules. TFM-Explorer is available at http://bioinfo.lifl.fr/TFM.
Nucleic Acids Research 07/2010; 38(Web Server issue):W286-92. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Gene prediction is an essential step in understanding the genome of a species once it has been sequenced. For that, a promising direction in current research on gene finding is a comparative genomics approach. In this paper, we present a novel approach to identifying evolutionarily conserved protein-coding sequences in genomes. The method takes advantage of the specific substitution pattern of coding sequences together with the consistency of reading frames. It has been implemented in a software called PROTEA. Large-scale experimentation shows good results. PROTEA is intended to be a useful complement to existing tools based on homology search or statistical properties of the sequences.
International Journal of Data Mining and Bioinformatics 02/2009; 3(2):160-76. · 0.39 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Position Weight Matrices are broadly used probabilistic motif models. In this paper, we address the problem of identifying
and characterizing potential overlaps between occurrences of such a motif. It has useful applications to the statistics of
the number of occurrences, and to weighted pattern matching with an extension of the well-known Knuth-Morris-Pratt algorithm.
Language and Automata Theory and Applications, Third International Conference, LATA 2009, Tarragona, Spain, April 2-8, 2009. Proceedings; 01/2009
[Show abstract][Hide abstract] ABSTRACT: The divergent domain D8 of the large ribosomal RNA is very variable and extended in vertebrates compared to other eukaryotes. We provide data from 31 species of echinoderms and present the first comparative analysis of the D8 in nonvertebrate deuterostomes. In addition, we obtained 16S mitochondrial DNA sequences for the sea urchin taxa and analyzed single-strand conformation polymorphism (SSCP) of D8 in several populations within the species complex Echinocardium cordatum. A common secondary structure supported by compensatory substitutions and indels is inferred for echinoderms. Variation mostly arises at the tip of the longest stem (D8a), and the most variable taxa also display the longest and most stable D8. The most stable variants are the only ones displaying bulges in the terminal part of the stem, suggesting that selection, rather than maximizing stability of the D8 secondary structure, maintains it in a given range. Striking variation in D8 evolutionary rates was evidenced among sea urchins, by comparison with both 16S mitochondrial DNA and paleontological data. In Echinocardium cordatum and Strongylocentrotus pallidus and S. droebachiensis, belonging to very distant genera, the increase in D8 evolutionary rate is extreme. Their highly stable D8 secondary structures rule out the possibility of pseudogenes. These taxa are the only ones in which interspecific hybridization was reported. We discuss how evolutionary rates may be affected in nuclear relative to mitochondrial genes after hybridization, by selective or mutational processes such as gene silencing and concerted evolution.
Journal of Molecular Evolution 11/2008; 67(5):539-50. · 2.15 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MAGNOLIA is a new software for multiple alignment of nucleic acid sequences, which are recognized to be hard to align. The idea is that the multiple alignment process should be improved by taking into account the putative function of the sequences. In this perspective, MAGNOLIA is especially designed for sequences that are intended to be either protein-coding or structural RNAs. It extracts information from the similarities and differences in the data, and searches for a specific evolutionary pattern between sequences before aligning them. The alignment step then incorporates this information to achieve higher accuracy. The website is available at http://bioinfo.lifl.fr/magnolia.
Nucleic Acids Research 08/2008; 36(Web Server issue):W14-8. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Gene prediction is an essential step in understanding the genome of a species once it has been sequenced. For that, a promising direction in current research on gene finding is a comparative genomics approach. In this paper, we present a novel approach to identifying evolutionarily conserved protein-coding sequences in genomes. The method takes advantage of the specific substitution pattern of coding se- quences together with the consistency of reading frames. It has been implemented in a software called Protea. Large- scale experimentation shows good results. Protea is in- tended to be a useful complement to existing tools based on homology search or statistical properties of the sequences.
Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on; 12/2007
[Show abstract][Hide abstract] ABSTRACT: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.
The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.
We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.
Algorithms for Molecular Biology 02/2007; 2:15. · 1.61 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: RNA genes are ubiquitous in the cell and are involved in a number of biochemical processes. Because there is a close relationship between function and structure, software tools that predict the secondary structure of noncoding RNAs from the base sequence are very helpful. In this article, we focus our attention on the inference of conserved secondary structure for a group of homologous RNA sequences. We present the caRNAc software, which enables the analysis of families of homologous sequences without prior alignment. The method relies both on comparative analysis and thermodynamic information.
[Show abstract][Hide abstract] ABSTRACT: Sequence comparison is widely used to help discovering novel non-coding RNAs in newly sequenced genomes. In this context, Blast-like homology search tools are of great interest. We show here that the usage of software based on "spaced seeds" has a positive impact on non-coding RNA identification.
[Show abstract][Hide abstract] ABSTRACT: We describe a linear-time algorithm for comparing two similar ordered rooted trees with node labels. The method for comparing trees is the usual tree edit distance. We show that an optimal mapping that uses at most k insertions or deletions can then be constructed in O(nk3) where n is the size of the trees. The approach is inspired by the Zhang–Shasha algorithm for tree edit distance in combination with an adequate pruning of the search space based on the tree edit graph.
Journal of Discrete Algorithms 01/2007; 5:696-705.
[Show abstract][Hide abstract] ABSTRACT: Identifying cis-regulatory elements is crucial to understanding gene expression, which highlights the importance of the computational detection of overrepresented transcription factor binding sites (TFBSs) in coexpressed or coregulated genes. However, this is a challenging problem, especially when considering higher eukaryotic organisms.
We have developed a method, named TFM-Explorer, that searches for locally overrepresented TFBSs in a set of coregulated genes, which are modeled by profiles provided by a database of position weight matrices. The novelty of the method is that it takes advantage of spatial conservation in the sequence and supports multiple species. The efficiency of the underlying algorithm and its robustness to noise allow weak regulatory signals to be detected in large heterogeneous data sets.
TFM-Explorer provides an efficient way to predict TFBS overrepresentation in related sequences. Promising results were obtained in a variety of examples in human, mouse, and rat genomes. The software is publicly available at http://bioinfo.lifl.fr/TFM-Explorer.
[Show abstract][Hide abstract] ABSTRACT: We describe a new unifying framework to express compari- son of arc-annotated sequences, which we call alignment of arc-annotated sequences. We first prove that this framework encompasses main exist- ing models, which allows us to deduce complexity results for several cases from the literature. We also show that this framework gives rise to new relevant problems that have not been studied yet. We provide a thorough analysis of these novel cases by proposing two polynomial time algorithms and an NP-completeness proof. This leads to an almost exhaustive study of alignment of arc-annotated sequences.
13th String Processing and Information Retrieval (SPIRE'06). 01/2006;
[Show abstract][Hide abstract] ABSTRACT: This paper addresses the problem of multiple pattern matching for motifs encoded by Position Weight Matrices. We first present an algorithm that uses a multi-index table to preprocess the set of motifs, allowing a dramatically decrease of computation time. We then show how to take benefit from similar motifs to prevent useless computations.