Page 1

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume 16, Number 2, 2009

© Mary Ann Liebert, Inc.

Pp. 317–329

DOI: 10.1089/cmb.2008.16TT

GADEM: A Genetic Algorithm Guided Formation

of Spaced Dyads Coupled with an EM Algorithm

for Motif Discovery

LEPING LI

ABSTRACT

Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP

dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not gen-

erally feasible using current methods that employ probabilistic models. We propose an

efficient method, GADEM, which combines spaced dyads and an expectation-maximization

(EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads

are prioritized by their degree of overrepresentation in the input sequence data. Spaced

dyads are converted into starting position weight matrices (PWMs). GADEM then employs

a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs,

to guide the evolution of a population of spaced dyads toward one whose entropy scores

are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified

significance threshold are declared motifs. GADEM performed comparably with MEME on

500 sets of simulated “ChIP” sequences with embedded known P53 binding sites. The

major advantage of GADEM is its computational efficiency on large ChIP datasets compared

to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to

30 motifs of various lengths were identified in each dataset. Remarkably, without any prior

motif information, the expected known motif (e.g., P53 in P53 data) was identified every

time. GADEM discovered motifs of various lengths (6–40 bp) and characteristics in these

datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM

can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for

de novo motif discovery in large-scale genome-wide data. The GADEM software is available

at www.niehs.nih.gov/research/resources/software/GADEM/.

Key words: ChIP, de novo motif discovery, expectation-maximization, genetic algorithm, k-mer,

spaced dyad.

Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North

Carolina.

317

Page 2

318 LI

1. INTRODUCTION

R

(Zeller et al., 2006), CTCF (Kim et al., 2007), FOXP3 (Zheng et al., 2007), NRSF (Johnson et al., 2007),

STAT1 (Robertson et al., 2007), and for histone markers (Bernstein et al., 2006; Lee et al., 2006; Mikkelsen

et al., 2007; Pan et al., 2007; Schones et al., 2008; Wang et al., 2008). One goal of these studies is to

discover short functional elements such as cis-elements embedded in these sites that are a few hundreds

to tens of thousands of nucleotides long. Computational tools for de novo motif discovery in such massive

data are needed.

During the last decade or so, many de novo motif discovery methods have been developed (Bailey and

Elkan, 1994; Buhler and Tompa, 2002; Down and Hubbard, 2005; Elemento et al., 2007; Eden et al.,

2007; Hertz and Stormo, 1999; Linhart et al., 2008;Liu et al., 2001, 2002; Pavesi et al., 2001; Pevzner

and Szu, 2000; Roth et al., 1998; Sinha and Tompa, 2002; Sumazin et al., 2005; Thijs et al., 2001;

van Helden et al., 2000). These methods fall into two categories: word enumeration and local search.

The performance of many algorithms has recently been assessed (Tompa et al., 2005). Word-enumeration

techniques count the number of occurrences of a motif, defined as a string of letters {a,c,g,t and sometime

with degenerate letters, e.g., y D c,t and r D a,g} of certain length (e.g., 6–20) in the sequence data. When

no degenerate letters are used in the motif profile/model, a subsequence is considered an instance of the

motif when the number of mismatches between the subsequence and the motif is less than a threshold.

The motifs are then rank-ordered based on their overrepresentation, thus, these approaches guarantee the

global optimum—e.g., producing motifs with the highest overrepresentation. Many methods in this group

have been developed. For instance, Consensus (Hertz and Stormo, 1999) first uses each k-mer to form

the first sequence to construct an alignment matrix and the matrix is further updated. The PROJECTION

algorithm (Buhler and Tompa, 2002) projects every l-mer in the input data into a smaller space by hashing.

Other methods in this category includes WINNOWER (Pevzner and Szu, 2000), spaced dyads (van Helden

et al., 2000; Li et al., 2002), Weeder (Pavesi et al., 2001), MITRA (Eskin and Pevzner, 2002), YMF

(yeast motif finder) (Sinha and Tompa, 2002), DWE (Sumazin et al., 2005), Drim (Eden et al., 2007), and

FIRE (Elemento et al., 2007). Recently, Xie et al. (2007) enumerated a list of candidate k-mers (12–22

nucleotides) and counted the number of matching instances in a set of conserved noncoding elements in

the human genome.

Perhaps more widely used approaches employ local search techniques such as EM and Gibbs sampling

(for recent reviews, see Jensen et al. [2004] and van Nimwegen [2007]). Unlike word-enumeration, the

local search techniques use the PWM as the motif model. Initially, these models are either pre-defined or

randomly specified. The models are then updated by an iterative process until convergence. Local search

methods include MEME (Bailey and Elkan, 1994), AlignACE (Roth et al., 1998), BioProspector (Liu

et al., 2001), motifSampler (Thijs et al., 2001), MDScan (Liu et al., 2002), NestedMICA (Down and

Hubbard, 2005) and fdrMotif (Li et al., 2008).

One advantage of the local search methods is that initial motif models are iteratively updated. On the

other hand, the search space can be very large. Consequently, one of the main challenges for the local

search techniques is how to obtain starting positions for the local search algorithms. For instance, MEME

converts each subsequence of length, w, into a letter probability matrix and uses it as the starting point for

its EM algorithm. Only one step of EM is carried out for each starting matrix. The resulting best models

are subjected to full EM. Motifs with the strongest statistical significances (E-values) are reported. This

approach almost guarantees good starting positions for the EM algorithm. However, examining all possible

starting positions for various lengths of w’s (e.g., 6–30) for a large dataset is computationally too costly

to be practically useful. Here we present an efficient method that combines existing algorithms to identify

good starting positions for an EM algorithm for unbiased motif discovery in large scale data sets.

Our method begins by counting the number of matching instances of all k-mers .k D 3;4;5;6/ in the

data. For instance, there are 43D 64 possible tri-nucleotides (3-mers). For each k, the k-mers are rank-

ordered based on their overrepresentation. The top-ranked k-mers for all four k’s are subsequently used

as the words for the spaced dyads (Li et al., 2002; van Helden et al., 2000). The top-ranked words may

be viewed as “seeds” for a motif. Unlike the word-enumeration methods, we do not count the numbers

of matching instances of the spaced dyads in the data. Instead, we convert the spaced dyads into letter

ECENTLY, GENOME-WIDE LOCATION ANALYSES have been carried out for proteins such as OCT4

(Boyer et al., 2005), P53 (Wei et al., 2006), ERE˛ (Carroll et al., 2006; Lin et al., 2007), c-Myc

Page 3

GADEM FOR MOTIF DISCOVERY319

probability matrices (similar to MEME), which in turn, are used as the initial models for a local search

technique via an EM algorithm. Thus, one might regard our method as a hybrid of the word enumeration

and local search techniques. Similar hybrid methods have been proposed. For instance, Eskin (2004)

developed the MITRA-PSSM algorithm that combined an efficient branch and bound algorithm for finding

consensus patterns and a local search algorithm.

A spaced dyad consists of two words separated by a certain number of spacers (unspecified bases),

d. If we choose Nk top-ranked k-mers as the possible words and d spacers for the spaced dyads, the

number of possible spaced dyads constructed from these building blocks can be large. In theory, there are

.N3CN4CN5CN6/?.d C1/?.N3CN4CN5CN6/ possible spaced dyads when both words of the spaced

dyads can independently come from any of the four k-mer groups and there are 0 to d spacers between the

words. Subjecting all of them (after conversion to probability matrices) to EM is impractical. Therefore,

we employ a genetic algorithm (GA) (Goldberg, 1989) to guide the formation of spaced dyads so that only

a small fraction of the spaced dyads are examined while finding most or all motifs. A GA is very effective

in searching high-dimensional space and has been used in many optimization/search problems. Earlier,

Wei and Jensen (2006) proposed a GA-based approach, GAME, to evolve motifs from randomly generated

starting motifs. We refer to our method as GADEM (Genetic Algorithm guided formation of spaced Dyads

coupled with EM for Motif identification).

2. METHODS

2.1. Overview

GADEM employs a genetic algorithm (GA) to guide the formation of a “population” of spaced dyads.

Each spaced dyad is converted into a letter probability matrix, which in turn, serves as the starting PWM

for an EM algorithm. The EM-optimized PWM is then used to scan for binding sites in the data. A

subsequence of the length of the PWM is declared a binding site when the p-value of its PWM score

is less than or equal to a pre-specified threshold (e.g., 2:5 ? 10?5). The significance (E-value) of the

alignment of the binding sites (referred to as a motif) is then computed and the logarithm of the E-

value is used as the fitness score for the spaced dyad from which the motif is derived. A GA is used to

“evolve” the spaced dyads in the population through several generations (e.g., five). The resulting unique

motifs with fitness values less than or equal to a pre-specified cutoff are reported and corresponding

binding sites in the original sequences were subsequently masked. The above procedure is repeated until

no further motifs can be found. A workflow of the GADEM algorithm is shown in Figure 1. Details are

given below.

2.2. Spaced dyads

A spaced dyad consists of two words that are separated by spacers (Li et al., 2002; van Helden et al.,

2000). Let D denote a spaced dyad, D D a1?nx?a2, where a1and a2are the first and second words of the

dyad, n is a string of unspecified nucleotides, x is the number of them (width of spacer), x D 0;1;2;:::;d,

d is the pre-specified maximal width of the spacer (e.g., d D 10). We limit the words to 3–6 letters in

length, consisting of only {a,c,g,t}. If one would enumerate all possible words and spacers, it generates

.43C44C45C46/?11?.43C44C45C46/ ? 3:3?108spaced dyads. Evaluating all of them for large

datasets is impractical and not necessary. We consider fewer but retain a broad range of possibilities by

using a selected subset of the words in conjunction with GA.

2.3. Top-ranked k-mers

We count all possible short DNA words (tri-, tetra-, penta-, and hexi-nucleotides; collectively called

k-mers, k D 3, 4, 5, 6) in the input sequence data with self-overlapping ones discarded. For instance,

“aacaa” in “aacaacaa” is only counted once. The k-mers are then rank-ordered by their z-scores, calculated

as z.a/ D

stdest.a/, where c.a/ is the number of counts observed for k-mer, a. cexp.a/ is the expected

c.a/?cexp.a/

Page 4

320 LI

FIG. 1.

box), GA (center large box) and motif declaration (right box). The three parts constitute one cycle of GADEM. GADEM

automatically carries out several such cycles until no further motifs with E-values below a pre-specified threshold can

be found. For each GADEM cycle, the steps in the blue box are repeated for a user-specified number of generations

(indexed by g, g D 0 at the beginning of GA), whereas the steps in the red and green boxes are carried out only

once for each GADEM cycle. GADEM begins by enumerating the matching instances of all k-mers (k D 3, 4, 5, 6).

For each k, the words are rank-ordered based on their z-scores. This results in four groups of top-ranked k-mers.

A spaced dyad is formed by randomly choosing two words (a1and a2) independently from any of the four groups

and a randomly chosen width between 0 and d (e.g., d D 10). In the GA stage, a “population” indexed by r (e.g.,

r D 1;:::;5) of such spaced dyads is generated. The r spaced dyads are converted into r position weight matrices

(PWMs), ?. The PWMs are subjected to a user-specified number (e.g., 40) of steps of EM or until it converges. The

score distribution of the integerized form ofO? is computed. The same integerizedO? is also used to scan for binding

sites in the data. A subsequence of length w is declared a binding sites when the p-value of its PWM score is below

a threshold (e.g., ? 2:5?10?5). The entropy score of the aligned binding sites (motif) is computed and the logarithm

of its statistical significance (E-value) is used as the fitness score for the spaced dyad from which the motif is derived.

Next, all except the best performing spaced dyad(s) (with the lowest E-value) in the population are subjected to either

mutation or crossover operations. This process (blue box) is repeated until the maximal number of generations (e.g., 5)

has been reached.

Flowchart of GADEMalgorithm. The algorithm is divided into three parts: formation of spaceddyads (left large

number of counts for a and stdest.a/ is an estimate of the standard deviation of occurrences of a, based

on the background {a,c,g,t} distribution estimated from the entire data, assuming independence between

positions. The higher the z-score, the more likely the k-mer is enriched in the data and present in motif(s).

Of course, the top-ranked words of different length will generally overlap. Treating them as unique words,

however, allows flexible combinations of word length and spacer length to provide a rich set of spaced

dyads as initial motifs for the EM. Let Nk be the number of top-ranked k-mers, we arbitrarily set Nk to

at most 20, 40, 60, and 100 (only those with a z-score at or above 6.0 are considered), for k D 3, 4,

5, 6, respectively. These settings appear to favor a larger proportion of short k-mers over long k-mers.

However, this bias is lessened by the high dependency (overlapping) among the top-ranked k-mers. This

reduces the number of possible spaced dyads to .220 ? 11 ? 220 ? 5:3 ? 105/. Subjecting all (after

being converted into PWMs, see below) to an EM algorithm is still computationally formidable. An

intelligent method is needed to subject only a subset of them to EM without significantly compromising

Page 5

GADEM FOR MOTIF DISCOVERY321

the results. For this reason, we adopt an effective stochastic search algorithm, GA, to guide formation of

the spaced dyads.

2.4. Genetic algorithm

2.4.1. Initialization.

4, 5, 6, with equal probability, from which a word in the k-mer group will be selected. Next, a word from

the Nk top-ranked k-mers is chosen with probability that is proportional to its z-score. Both a1 and a2

are chosen independently. The width, x, is randomly chosen between 0 to d with equal probability. A

“population” of such spaced dyads (indexed by r, e.g., r D 1;:::;100) is generated. An example spaced

dyad would be gggcnnnnnnntttgca, where a1D gggc, x D 7, and a2D tttgca.

For each word (a1or a2) in a spaced dyad, we first randomly choose a k, k D 3,

2.4.2. Fitness evaluation.

populationare converted into initial PWMs. Second, the PWMs are iteratively updated by an EM algorithm

using all or a subset of randomly selected sequences. Third, the updated PWMs are used to scan for binding

sites in the entire data. Fourth, the relative entropy score of the aligned binding sites is computed and the

logarithm of its statistical significance (E-value) is used as the fitness score for the spaced dyad. Details

of each step are described below.

Step i. Spaced dyad to PWM. Each spaced dyad in the population is converted into a PWM in that

the corresponding position in the matrix is assigned 1 and 0 otherwise. A value of 1 is assigned to each

letter in a1and a2, as well as to all cells in the matrix corresponding to the spacers. A small pseudo count

(e.g., 0.01) is added to each cell containing zero. Each column is then standardized to sum to 1.0. Other

assignments such as that from MEME can also be considered.

Step ii. EM algorithm. We wish to find binding site locations and the base probabilities using only the

sequence data and the initial PWM. We use an EM algorithm described in Lawrence and Reilly (1990)

for this purpose. We conveniently assume that the positions within a sequence are mutually independent,

i.e., a sequence follows a product of multinomial distributions. Details of the EM algorithm can be found

in supplementary material and in Lawrence and Reilly (1990) and Li et al. (2008).

Applying EM to all sequences can be computationally costly. GADEM allows all or only a randomly

selected subset of sequences in the EM steps. For genome-wide data with thousands to tens of thousands

sequences, a 25% to 50% sample should be adequate for obtaining a good estimate of the PWM.

Step iii. Binding site declaration. The EM derived PWM,O?, is then log-transformed and multiplied

by a scale factor, ˛, (e.g., ˛ D 200) followed by rounding the real numbers to their closest integers

ŒO? D .int/.˛ ?O?/?. We compute the exact score distribution of the integerizedO? using the probability

generating functions method of Staden (1989). The same integerizedO? is subsequently used to scan for

binding sites in the data. A subsequence is declared a binding site when the p-value of the observed PWM

score sum is less than or equal to a threshold (e.g., 2:5? 10?5).

Step iv. Fitness evaluation—E-value. The binding sites (motif) are aligned and the log likelihood ratio

(llr) score (Stormo, 2000) of the alignment is computed as follows,

Fitness evaluation consists of several steps. First, the spaced dyads in the

llr D M

w

X

lD1

X

bDa;c;g;t

fl;b? log

?fl;b

pb

?

;

where M is the number of binding sites in the alignment, fl;b is the frequency of base b at position l of

the alignment and pbis the background frequency of base b, computed from the entire data. Here again we

assume that the letters in a sequence are independent and identically distributed (iid) multinomial random

variables. These scores are not directly comparable for different M and w. To assess the significance of the

llr score, one needs to compute its p-value, that is, the probabilityof observing an llr score or higher under

the null hypothesis that the distribution of the letters in each column follows an independent multinomial

distribution.The background {a,c,g,t} distributionis estimated from the entire data. Methods for computing

the p-value of llr score have been proposed and discussed (Bailey and Gribskov, 1998; Hertz and Stormo,

1999; Nagarajan et al., 2005). GADEM adopts the approach of Bailey and Gribskov (1998) as implemented

in MEME (Bailey and Elkan, 1994).