A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites

Department of Biochemistry, Stanford University, CA 94305, USA.
Nucleic Acids Research (Impact Factor: 8.81). 02/2006; 34(20):5730-9. DOI: 10.1093/nar/gkl585
Source: PubMed

ABSTRACT Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs.


Available from: Douglas L. Brutlag, Sep 18, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Human endogenous retroviruses (HERVs) have been found to act as etiological cofactors in several chronic diseases, including cancer, autoimmunity and neurological dysfunction. Immunosuppressive domain (ISD) is a conserved region of transmembrane protein (TM) in envelope gene (env) of retroviruses. In vitro and vivo, evidence has shown that retroviral TM is highly immunosuppressive and a synthetic peptide (CKS-17) that shows homology to ISD inhibits immune function. ISD is probably a potential pathogenic element in HERVs. However, only less than one hundred ISDs of HERVs have been annotated by researchers so far, and universal software for domain prediction could not achieve sufficient accuracy for specific ISD. In this paper, a computational model is proposed to identify ISD in HERVs based on genome sequences only. It has a classification accuracy of 97.9% using Jack-knife test. 117 HERVs families were scanned with the model, 1002 new putative ISDs have been predicted and annotated in the human chromosomes. This model is also applicable to search for ISDs in human T-lymphotropic virus (HTLV), simian T-lymphotropic virus (STLV) and murine leukemia virus (MLV) because of the evolutionary relationship between endogenous and exogenous retroviruses. Furthermore, software named ISDTool has been developed to facilitate the application of the model. Datasets and the software involved in the paper are all available at
    Computational Biology and Chemistry 02/2014; 49C:45-50. DOI:10.1016/j.compbiolchem.2014.02.001 · 1.60 Impact Factor
  • Source
    Computational Biology and Chemistry 04/2014; 49:45-50. · 1.60 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Eugenol-O-methyltransferase (EOMT) catalyzes the conversion of eugenol to methyleugenol in one of the final steps of phenylpropanoid pathway. There are no comprehensive reports on comparative EOMT gene expression and developmental stage specific accumulation of phenylpropenes in Ocimum tenuiflorum. Seven chemotypes, rich in eugenol and methyleugenol, were selected by assessment of volatile metabolites through multivariate data analysis. Isoeugenol accumulated in higher levels during juvenile stage (36.86 ng g(-1)), but reduced sharply during preflowering (8.04 ng g(-1)), flowering (2.29 ng g(-1)) and postflowering stages (0.17 ng g(-1)), whereas methyleugenol content gradually increased from juvenile (12.25 ng g(-1)) up to preflowering (16.35 ng g(-1)) and then decreased at flowering (7.13 ng g(-1)) and post flowering (5.95 ng g(-1)) from fresh tissue. Extreme variations of free intracellular and alkali hydrolysable cell wall released phenylpropanoid compounds were observed at different developmental stages. Analyses of EOMT genomic and cDNA sequences revealed a 843 bp open reading frame and the presence of a 90 bp intron. The translated proteins had eight catalytic domains, the major two being dimerisation superfamily and methyltransferase_2 superfamily. A validated 3D structure of EOMT protein was also determined. The chemotype Ot7 had a reduced reading frame that lacked both dimerisation domains and one of the two protein-kinase-phosphorylation sites; this was also reflected in reduced accumulation of methyleugenol compared to other chemotypes. EOMT transcripts showed enhanced expression in juvenile stage that increased further during preflowering but decreased at flowering and further at postflowering. The expression patterns may possibly be compared and correlated to the amounts of eugenol/isoeugenol and methyleugenol in different developmental stages of all chemotypes.
    Molecular Biology Reports 01/2014; DOI:10.1007/s11033-014-3035-7 · 1.96 Impact Factor