Discovery of Regulatory Elements is Improved by a Discriminatory Approach

The Bioinformatics Centre, Department of Biology and the Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen, Denmark.
PLoS Computational Biology (Impact Factor: 4.62). 11/2009; 5(11):e1000562. DOI: 10.1371/journal.pcbi.1000562
Source: PubMed


A major goal in post-genome biology is the complete mapping of the gene regulatory networks for every organism. Identification of regulatory elements is a prerequisite for realizing this ambitious goal. A common problem is finding regulatory patterns in promoters of a group of co-expressed genes, but contemporary methods are challenged by the size and diversity of regulatory regions in higher metazoans. Two key issues are the small amount of information contained in a pattern compared to the large promoter regions and the repetitive characteristics of genomic DNA, which both lead to "pattern drowning". We present a new computational method for identifying transcription factor binding sites in promoters using a discriminatory approach with a large negative set encompassing a significant sample of the promoters from the relevant genome. The sequences are described by a probabilistic model and the most discriminatory motifs are identified by maximizing the probability of the sets given the motif model and prior probabilities of motif occurrences in both sets. Due to the large number of promoters in the negative set, an enhanced suffix array is used to improve speed and performance. Using our method, we demonstrate higher accuracy than the best of contemporary methods, high robustness when extending the length of the input sequences and a strong correlation between our objective function and the correct solution. Using a large background set of real promoters instead of a simplified model leads to higher discriminatory power and markedly reduces the need for repeat masking; a common pre-processing step for other pattern finders.

Download full-text


Available from: Ole Winther,
  • Source
    • "Most use a fixed set of sequences and identify motifs that are overrepresented in this set compared to a Markov chain background model (Gibbs Sampler [13], MEME [14], and Weeder [15]). Other methods do discriminative analysis, where the goal is to identify motifs that are over-represented in a positive set compared to a negative or background set of sequences (DEME [16] and [17]). However often we are dealing with transcriptome-wide measurements of gene expression, and a priori it is difficult to set a natural cut-off that defines the positive (or negative) set. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Post-transcriptional regulation of gene expression by small RNAs and RNA binding proteins is of fundamental importance in development of complex organisms, and dysregulation of regulatory RNAs can influence onset, progression and potentially be target for treatment of many diseases. Post-transcriptional regulation by small RNAs is mediated through partial complementary binding to messenger RNAs leaving nucleotide signatures or motifs throughout the entire transcriptome. Computational methods for discovery and analysis of sequence motifs in high-throughput mRNA expression profiling experiments are becoming increasingly important tools for the identification of post-transcriptional regulatory motifs and the inference of the regulators and their targets. Results cWords is a method designed for regulatory motif discovery in differential case–control mRNA expression datasets. We have improved the algorithms and statistical methods of cWords, resulting in at least a factor 100 speed gain over the previous implementation. On a benchmark dataset of 19 microRNA (miRNA) perturbation experiments cWords showed equal or better performance than two comparable methods, miReduce and Sylamer. We have developed rigorous motif clustering and visualization that accompany the cWords analysis for more intuitive and effective data interpretation. To demonstrate the versatility of cWords we show that it can also be used for identification of potential siRNA off-target binding. Moreover, cWords analysis of an experiment profiling mRNAs bound by Argonaute ribonucleoprotein particles discovered endogenous miRNA binding motifs. Conclusions cWords is an unbiased, flexible and easy-to-use tool designed for regulatory motif discovery in differential case–control mRNA expression datasets. cWords is based on rigorous statistical methods that demonstrate comparable or better performance than other existing methods. Rich visualization of results promotes intuitive and efficient interpretation of data. cWords is available as a stand-alone Open Source program at Github and as a web-service at:
    Silence 05/2013; 4(1):2. DOI:10.1186/1758-907X-4-2
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motif discovery is an important Bioinformatics problem for deciphering gene regulation. Numerous sequence-based approaches have been proposed employing human specialist motif models (evaluation functions), but performance is so unsatisfactory on benchmarks that the underlying information seems to have already been exploited and have doomed. However, we have found that even a simple modified representation still achieves considerably high performance on a challenging benchmark, implying potential for sequence-based motif discovery. Thus we raise the problem of learning motif evaluation functions. We employ Genetic programming (GP) which has the potential to evolve human competitive models. We take advantage of the terminal set containing specialist-model-like components and have tried three fitness functions. Results exhibit both great challenges and potentials. No models learnt can perform universally well on the challenging benchmark, where one reason may be the data appropriateness for sequence-based motif discovery. However, when applied on different widely-tested datasets, the same models achieve comparable performance to existing approaches based on specialist models. The study calls for further novel GP to learn different levels of effective evaluation models from strict to loose ones on exploiting sequence information for motif discovery, namely quantitative functions, cardinal rankings, and learning feasibility classifications.
    Genetic and Evolutionary Computation Conference, GECCO 2010, Proceedings, Portland, Oregon, USA, July 7-11, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cap analysis gene expression (CAGE) is a method to identify the 5' ends of transcripts, allowing the discovery of new promoters and the quantification of gene activity. Combining promoter location and their expression levels, CAGE data are essential for annotation-agnostic studies of regulatory gene networks. However, CAGE requires large amounts of input RNA, which usually are not obtainable from highly refined samples such as tissue microdissections or subcellular fractions. The nanoCAGE method can capture the 5' ends of transcripts from as little as 10 ng of total RNA and takes advantage of the capacity of current sequencers to produce longer (50-100 bp) reads. The method prepares cap-selected cDNAs ready for direct sequencing of their 5' ends (optionally mate-paired with the 3' end) that can provide information about downstream sequences. This protocol describes how to prepare nanoCAGE libraries from as little as 50 ng of total RNA within two working days. The libraries can be sequenced using an Illumina sequencer Genome Analyzer IIX [corrected] with a level of sensitivity 1000 times higher than CAGE.
    Cold Spring Harbor Protocols 02/2011; 2011(1):pdb.prot5559. DOI:10.1101/pdb.erratum2011_01 · 4.63 Impact Factor
Show more