The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences

Gene Center, Department of Biochemistry, and Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians-Universität (LMU) München, Feodor-Lynen-Straße 25, 81377 Munich, Germany.
Nucleic Acids Research (Impact Factor: 9.11). 06/2012; 40(Web Server issue):W104-9. DOI: 10.1093/nar/gks602
Source: PubMed

ABSTRACT The discovery of regulatory motifs enriched in sets of DNA or RNA sequences is fundamental to the analysis of a great variety of functional genomics experiments. These motifs usually represent binding sites of proteins or non-coding RNAs, which are best described by position weight matrices (PWMs). We have recently developed XXmotif, a de novo motif discovery method that is able to directly optimize the statistical significance of PWMs. XXmotif can also score conservation and positional clustering of motifs. The XXmotif server provides (i) a list of significantly overrepresented motif PWMs with web logos and E-values; (ii) a graph with color-coded boxes indicating the positions of selected motifs in the input sequences; (iii) a histogram of the overall positional distribution for selected motifs and (iv) a page for each motif with all significant motif occurrences, their P-values for enrichment, conservation and localization, their sequence contexts and coordinates. Free access:

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cellular regulation mechanisms that involve proteins and other active molecules interacting with specific targets often involve the recognition of sequence patterns. Short sequence elements on DNA, RNA and proteins play a central role in mediating such molecular recognition events. Studies that focus on measuring and investigating sequence-based recognition processes make use of statistical and computational tools that support the identification and understanding of sequence motifs. We present a new web application, named DRIMust, freely accessible through the website for de novo motif discovery services. The DRIMust algorithm is based on the minimum hypergeometric statistical framework and uses suffix trees for an efficient enumeration of motif candidates. DRIMust takes as input ranked lists of sequences in FASTA format and returns motifs that are over-represented at the top of the list, where the determination of the threshold that defines top is data driven. The resulting motifs are presented individually with an accurate P-value indication and as a Position Specific Scoring Matrix. Comparing DRIMust with other state-of-the-art tools demonstrated significant advantage to DRIMust, both in result accuracy and in short running times. Overall, DRIMust is unique in combining efficient search on large ranked lists with rigorous P-value assessment for the detected motifs.
    Nucleic Acids Research 05/2013; 41(Web Server issue). DOI:10.1093/nar/gkt407 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at
    Nucleic Acids Research 06/2013; 41(Web Server issue). DOI:10.1093/nar/gkt519 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually. We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text. The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at The data is available for online browsing and download.
    PLoS ONE 10/2013; 8(10):e77848. DOI:10.1371/journal.pone.0077848 · 3.53 Impact Factor
Show more