A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites

Department of Biochemistry, Stanford University, CA 94305, USA.
Nucleic Acids Research (Impact Factor: 9.11). 02/2006; 34(20):5730-9. DOI: 10.1093/nar/gkl585
Source: PubMed

ABSTRACT Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs.

Download full-text


Available from: Douglas L. Brutlag, Sep 18, 2014
  • Source
    • "While requiring perfect conservation across many genomes is of limited use, the increased power enables approaches that account for artifacts in sequencing, assembly and alignment, and tolerate diverged, missing, or moved motif instances. Our BLS measure is more generally applicable to PWMs (Stormo 2000), to more complex models of regulatory motifs that account for dependencies between individual motif positions (Naughton et al. 2006; Yada et al. 1998), and to more advanced rules for miRNA-target recognition that for example score the contribution of the 3'pairing energy (Brennecke et al. 2005; Stark et al. 2003). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene expression is regulated pre- and post-transcriptionally via cis-regulatory DNA and RNA motifs. Identification of individual functional instances of such motifs in genome sequences is a major goal for inferring regulatory networks yet has been hampered due to the motifs' short lengths that lead to many chance matches and poor signal-to-noise ratios. In this paper, we develop a general methodology for the comparative identification of functional motif instances across many related species, using a phylogenetic framework that accounts for the evolutionary relationships between species, allows for motif movements, and is robust against missing data due to artifacts in sequencing, assembly, or alignment. We also provide a robust statistical framework for evaluating motif confidence, which enables us to translate evolutionary conservation into a confidence measure for each motif instance, correcting for varying motif length, composition, and background conservation of the target regions. We predict targets of fly transcription factors and miRNAs in alignments of 12 recently sequenced Drosophila species. When compared to extensive genome-wide experimental data, predicted targets are of high quality, matching and surpassing ChIP-chip microarrays and recovering miRNA targets with high sensitivity. The resulting regulatory network suggests significant redundancy between pre- and post-transcriptional regulation of gene expression.
    Genome Research 01/2008; 17(12):1919-31. DOI:10.1101/gr.7090407 · 13.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Position weight matrices (PMWs) are simple models commonly used in motif-finding algorithms to identify short functional elements, such as cis-regulatory motifs, on genes. When few experimentally verified motifs are available, estimation of the PWM may be poor. The resultant PWM may not reliably discriminate a true motif from a false one. While experimentally identifying such motifs remains time-consuming and expensive, low-resolution binding data from techniques such as ChIP-on-chip and ChIP-PET have become available. We propose a novel but simple method to improve a poorly estimated PWM using ChIP data. Starting from an existing PWM, a set of ChIP sequences, and a set of background sequences, our method, GAPWM, derives an improved PWM via a genetic algorithm that maximizes the area under the receiver operating characteristic (ROC) curve. GAPWM can easily incorporate prior information such as base conservation. We tested our method on two PMWs (Oct4/Sox2 and p53) using three recently published ChIP data sets (human Oct4, mouse Oct4 and human p53). GAPWM substantially increased the sensitivity/specificity of a poorly estimated PWM and further improved the quality of a good PWM. Furthermore, it still functioned when the starting PWM contained a major error. The ROC performance of GAPWM compared favorably with that of MEME and others. With increasing availability of ChIP data, our method provides an alternative for obtaining high-quality PWMs for genome-wide identification of transcription factor binding sites. The C source code and all data used in this report are available at Supplementary data are available at Bioinformatics online.
    Bioinformatics 06/2007; 23(10):1188-94. DOI:10.1093/bioinformatics/btm080 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. Availability and Implementation: The code is publicly available at Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2009; 25(12):i321-9. DOI:10.1093/bioinformatics/btp230 · 4.62 Impact Factor
Show more