Discovery of Regulatory Elements is Improved by a Discriminatory Approach

The Bioinformatics Centre, Department of Biology and the Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen, Denmark.
PLoS Computational Biology (Impact Factor: 4.83). 11/2009; 5(11):e1000562. DOI: 10.1371/journal.pcbi.1000562
Source: PubMed

ABSTRACT A major goal in post-genome biology is the complete mapping of the gene regulatory networks for every organism. Identification of regulatory elements is a prerequisite for realizing this ambitious goal. A common problem is finding regulatory patterns in promoters of a group of co-expressed genes, but contemporary methods are challenged by the size and diversity of regulatory regions in higher metazoans. Two key issues are the small amount of information contained in a pattern compared to the large promoter regions and the repetitive characteristics of genomic DNA, which both lead to "pattern drowning". We present a new computational method for identifying transcription factor binding sites in promoters using a discriminatory approach with a large negative set encompassing a significant sample of the promoters from the relevant genome. The sequences are described by a probabilistic model and the most discriminatory motifs are identified by maximizing the probability of the sets given the motif model and prior probabilities of motif occurrences in both sets. Due to the large number of promoters in the negative set, an enhanced suffix array is used to improve speed and performance. Using our method, we demonstrate higher accuracy than the best of contemporary methods, high robustness when extending the length of the input sequences and a strong correlation between our objective function and the correct solution. Using a large background set of real promoters instead of a simplified model leads to higher discriminatory power and markedly reduces the need for repeat masking; a common pre-processing step for other pattern finders.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motif discovery is an important Bioinformatics problem for deciphering gene regulation. Numerous sequence-based approaches have been proposed employing human specialist motif models (evaluation functions), but performance is so unsatisfactory on benchmarks that the underlying information seems to have already been exploited and have doomed. However, we have found that even a simple modified representation still achieves considerably high performance on a challenging benchmark, implying potential for sequence-based motif discovery. Thus we raise the problem of learning motif evaluation functions. We employ Genetic programming (GP) which has the potential to evolve human competitive models. We take advantage of the terminal set containing specialist-model-like components and have tried three fitness functions. Results exhibit both great challenges and potentials. No models learnt can perform universally well on the challenging benchmark, where one reason may be the data appropriateness for sequence-based motif discovery. However, when applied on different widely-tested datasets, the same models achieve comparable performance to existing approaches based on specialist models. The study calls for further novel GP to learn different levels of effective evaluation models from strict to loose ones on exploiting sequence information for motif discovery, namely quantitative functions, cardinal rankings, and learning feasibility classifications.
    Genetic and Evolutionary Computation Conference, GECCO 2010, Proceedings, Portland, Oregon, USA, July 7-11, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cap analysis gene expression (CAGE) is a method to identify the 5' ends of transcripts, allowing the discovery of new promoters and the quantification of gene activity. Combining promoter location and their expression levels, CAGE data are essential for annotation-agnostic studies of regulatory gene networks. However, CAGE requires large amounts of input RNA, which usually are not obtainable from highly refined samples such as tissue microdissections or subcellular fractions. The nanoCAGE method can capture the 5' ends of transcripts from as little as 10 ng of total RNA and takes advantage of the capacity of current sequencers to produce longer (50-100 bp) reads. The method prepares cap-selected cDNAs ready for direct sequencing of their 5' ends (optionally mate-paired with the 3' end) that can provide information about downstream sequences. This protocol describes how to prepare nanoCAGE libraries from as little as 50 ng of total RNA within two working days. The libraries can be sequenced using an Illumina sequencer Genome Analyzer IIX [corrected] with a level of sensitivity 1000 times higher than CAGE.
    Cold Spring Harbor Protocols 02/2011; 2011(1):pdb.prot5559. DOI:10.1101/pdb.erratum2011_01
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CompleteMOTIFs (cMOTIFs) is an integrated web tool developed to facilitate systematic discovery of overrepresented transcription factor binding motifs from high-throughput chromatin immunoprecipitation experiments. Comprehensive annotations and Boolean logic operations on multiple peak locations enable users to focus on genomic regions of interest for de novo motif discovery using tools such as MEME, Weeder and ChIPMunk. The pipeline incorporates a scanning tool for known motifs from TRANSFAC and JASPAR databases, and performs an enrichment test using local or precalculated background models that significantly improve the motif scanning result. Furthermore, using the cMOTIFs pipeline, we demonstrated that multiple transcription factors could cooperatively bind to the upstream of important stem cell differentiation regulators. AVAILABILITY:
    Bioinformatics 03/2011; 27(5):715-7. DOI:10.1093/bioinformatics/btq707
Show more