GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery

Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA.
Journal of computational biology: a journal of computational molecular cell biology (Impact Factor: 1.67). 02/2009; 16(2):317-29. DOI: 10.1089/cmb.2008.16TT
Source: PubMed

ABSTRACT Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated "ChIP" sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6-40 bp) and characteristics in these datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at (

1 Follower
  • Source
    • "For motif discovery, we employed the rGADEM package (v.1.0.1) [37], which is available through Bioconductor [38], with the default parameter (P-value < 0.0002) [39] with DNA sequences from −1,000 bp to +1,000 bp relative to TSSs of the candidate pancRNA-bearing genes. We calculated the observed frequencies of “CCGCCG” or “CGGCGG” sequences from −1,000 bp to +1,000 bp relative to TSSs of the candidate pancRNA-bearing genes with sliding window of width 100 bp. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The majority of non-coding RNAs (ncRNAs) involved in mRNA metabolism in mammals have been believed to downregulate the corresponding mRNA expression level in a pre- or post-transcriptional manner by forming short or long ncRNA-mRNA duplex structures. Information on non-duplex-forming long ncRNAs is now also rapidly accumulating. To examine the directional properties of transcription at the whole-genome level, we performed directional RNA-seq analysis of mouse and chimpanzee tissue samples. We found that there is only about 1% of the genome where both the top and bottom strands are utilized for transcription, suggesting that RNA-RNA duplexes are not abundantly formed. Focusing on transcription start sites (TSSs) of protein-coding genes revealed that a significant fraction of them contain switching-points that separate antisense- and sense-biased transcription, suggesting that head-to-head transcription is more prevalent than previously thought. More than 90% of head-to-head type promoters contain CpG islands. Moreover, CCG and CGG repeats are significantly enriched in the upstream regions and downstream regions, respectively, of TSSs located in head-to-head type promoters. Genes with tissue-specific promoter-associated ncRNAs (pancRNAs) show a positive correlation between the expression of their pancRNA and mRNA, which is in accord with the proposed role of pancRNA in facultative gene activation, whereas genes with constitutive expression generally lack pancRNAs. We propose that single-stranded ncRNA resulting from head-to-head transcription at GC-rich sequences regulates tissue-specific gene expression.
    BMC Genomics 01/2014; 15(1):35. DOI:10.1186/1471-2164-15-35 · 4.04 Impact Factor
  • Source
    • "We then used a custom Python code to extract the sequences from the GRCh37 assembly stored locally. Next, we predicted the locations of the CTCF binding sites in the sequences using the GADEM software [24] with a CTCF position weight matrix (PWM) derived previously [24] (see in Additional file 1: Table S3). We declared a subsequence a CTCF binding site when its PWM score exceeded the score corresponding to the p-value cutoff of 0.0005. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A protein may bind to its target DNA sites constitutively, i.e., regardless of cell type. Intuitively, constitutive binding sites should be biologically functional. A prerequisite for understanding their functional relevance is knowing all their locations for a protein of interest. Genome-wide discovery of constitutive binding sites requires robust and efficient computational methods to integrate results from numerous binding experiments. Such methods are lacking, however. To locate constitutive binding sites for a protein using ChIP-seq data for that protein from multiple cell lines, we developed a method, T-KDE, which combines a binary range tree with a kernel density estimator. Using 132CTCF (CCCTC-binding factor) ChIP-seq datasets, we showed that the number of constitutive sites identified by T-KDE is robust to the choice of tuning parameter and that T-KDE identifies binding site locations more accurately than a binning approach. Furthermore, T-KDE can identify constitutive sites that are missed by a motif-based approach either because a bound site failed to reach the motif significance cutoff or because the peak sequence scanned was too short. By studying sites declared constitutive by T-KDE but not by the motif-based approach, we discovered two new CTCF motif variants. Using ENCODE data on 22 transcription factors (TF) in 132 cell lines, we identified constitutive binding sites for each TF and provide evidence that, for some TFs, they may be biologically meaningful. T-KDE is an efficient and effective method to predict constitutive protein binding sites using ChIP-seq peaks from multiple cell lines. Besides constitutive binding sites for a given protein, T-KDE can identify genomic "hot spots" where several different proteins bind and, conversely, cell-type-specific sites bound by a given protein.
    BMC Genomics 01/2014; 15(1):27. DOI:10.1186/1471-2164-15-27 · 4.04 Impact Factor
    • "build mm10). A putative ERE sequence with the position weight matrix (PWM) constructed from 48 experimentally identified EREs (15 bp in length) was scanned using GADEM software (Jin et al. 2004; Li 2009). CpGs were identified using EpiDesigner software ( "
    [Show abstract] [Hide abstract]
    ABSTRACT: Diethylstilbestrol (DES) is a synthetic estrogen that is associated with adverse effects on reproductive organs. DES-induced toxicity of the mouse seminal vesicle (SV) is mediated by ERα with altered expression of seminal vesicle secretory protein IV (Svs4) and lactoferrin (Ltf) genes. We examined a role for nuclear receptor activity in association with DNA methylation and altered gene expression. We used the neonatal DES exposure mouse model to examine DNA methylation patterns via bisulfite conversion sequencing in WT and αERKO SVs. DNA methylation status at 4 specific CpGs (-160, -237, -306 and -367) in the Svs4 gene promoter changes during mouse development from methylated to un-methylated, and DES prevents this change at 10-weeks of age in WT SV. DES alters the methylation status from methylated to un-methylated at 2 specific CpGs (-449 and -459) of the Ltf gene promoter. Alterations in DNA methylation of Svs4 and Ltf were not observed in αERKO SV, suggesting that changes of methylation status at these CpGs are ERα dependent. The methylation status associates with the level of gene expression. In addition, gene expression of three epigenetic modifiers, including DNMT3A, MBD2, and HDAC2 increased after DES exposure in WT SV. DES-induced hormonal toxicity results from altered gene expression of Svs4 and Ltf associated with changes in DNA methylation that are mediated by ERα. Alterations in gene expression of DNMT3A, MBD2 and HDAC2 after DES exposure may be involved in mediating the changes in methylation status in the SVs of male mice.
    Environmental Health Perspectives 12/2013; 122(3). DOI:10.1289/ehp.1307351 · 7.98 Impact Factor
Show more


Available from