A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites

Department of Biochemistry, Stanford University, CA 94305, USA.
Nucleic Acids Research (Impact Factor: 9.11). 02/2006; 34(20):5730-9. DOI: 10.1093/nar/gkl585
Source: PubMed


Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs.

Download full-text


Available from: Douglas L. Brutlag, Sep 18, 2014
    • "By considering set of highly conserved coding sequences upstream region across many strains is reliable true upstream region. The upstream sequence data has been represented by position-specific scoring matrix (PSSM) [12], [13], [14], [15], [16], and has limitations of considering the nucleotide positions to be independent of each other. There are however several indications that this is not the case both within genomes, in the sense that each nucleotide in the upstream sequence is independent of the neighboring nucleotides [3], [7], and between the genomes of different species or strains [17]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF ( ) and SVM ( ). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 05/2015; 12(3):560-567. DOI:10.1109/TCBB.2014.2366146 · 1.44 Impact Factor
  • Source
    • "Treating each position as contributing to uptake independently may have oversimplified the true uptake bias, so we next considered whether the bases at different positions in a fragment interacted to determine its probability of uptake, as has been found for some transcription factor binding sites (48–51). Pairwise interactions contributing to DNA uptake were detected by examining whether recovered reads with non-consensus bases at specific ‘focal’ positions had altered base compositions at other positions. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Some naturally competent bacteria exhibit both a strong preference for DNA fragments containing specific 'uptake sequences' and dramatic overrepresentation of these sequences in their genomes. Uptake sequences are often assumed to directly reflect the specificity of the DNA uptake machinery, but the actual specificity has not been well characterized for any bacterium. We produced a detailed analysis of Haemophilus influenzae's uptake specificity, using Illumina sequencing of degenerate uptake sequences in fragments recovered from competent cells. This identified an uptake motif with the same consensus as the motif overrepresented in the genome, with a 9 bp core (AAGTGCGGT) and two short flanking T-rich tracts. Only four core bases (GCGG) were critical for uptake, suggesting that these make strong specific contacts with the uptake machinery. Other core bases had weaker roles when considered individually, as did the T-tracts, but interaction effects between these were also determinants of uptake. The properties of genomic uptake sequences are also constrained by mutational biases and selective forces acting on USSs with coding and termination functions. Our findings define constraints on gene transfer by natural transformation and suggest how the DNA uptake machinery overcomes the physical constraints imposed by stiff highly charged DNA molecules.
    Nucleic Acids Research 06/2012; 40(17):8536-49. DOI:10.1093/nar/gks640 · 9.11 Impact Factor
  • Source
    • "Several motif detection algorithms work based on designing hard constraints on features associated with motifs, like distance to transcription start site (TSS) (Sinha et al., 2008). Recently, there has been a number of works in the literature that focus on refining predictive models for individual TFBS by using a wide range of features that have been shown to correlate well with regulatory regions in general and with TFBSs in particular, without necessarily modeling the CRM structure (Narlikar et al., 2007; Naughton et al., 2006; Pudimat et al., 2004; Sharon and Segal, 2007). Using biologically motivated features like presence or absence of CpG islands, nucleosome sites, and helical structures, they appear to be able to significantly outperform models based on PWM motif representation alone. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. Availability and Implementation: The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html. Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2009; 25(12):i321-9. DOI:10.1093/bioinformatics/btp230 · 4.98 Impact Factor
Show more