Page 1

A Novel Bayesian DNA Motif Comparison Method for

Clustering and Retrieval

Naomi Habib1,2., Tommy Kaplan1,2., Hanah Margalit2, Nir Friedman1*

1School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel, 2Department of Molecular Genetics and Biotechnology, Faculty of Medicine,

The Hebrew University, Jerusalem, Israel

Abstract

Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has

been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor

and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve

coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and

attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both

tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on

Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of

the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison

method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our

results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the

resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This

pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from

multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide

transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-

automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-

specific preferences of transcription factors.

Citation: Habib N, Kaplan T, Margalit H, Friedman N (2008) A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval. PLoS Comput Biol 4(2):

e1000010. doi:10.1371/journal.pcbi.1000010

Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America

Received July 19, 2007; Accepted January 24, 2008; Published February 29, 2008

Copyright: ? 2008 Habib et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits

unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study was supported by grants from the Israeli Science Foundation administered by the Israeli Academy of Sciences and Humanities (HM), The

Israeli Cancer Research Foundation (HM), the Human Frontiers Science Program (NF), and the National Institutes of Health (NF).

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: nir@cs.huji.ac.il

. These authors contributed equally to this manuscript.

Introduction

Transcription initiation is modulated by transcription factors

that recognize sequence-specific binding sites in regulatory

regions. The organization of binding sites around a gene specifies

which factors can bind to it and where, and consequently

determines to what extent the gene is transcribed under different

conditions. To understand this regulatory mechanism, one must

specify the DNA binding preferences of transcription factors.

These preferences are usually characterized by a motif that

summarizes the commonalities among the binding sites of a

transcription factor [1]. Multiple tools were developed for finding

motifs (e.g., [2–5]), however there are several problems in

interpreting their output. Typically these algorithms output

multiple results which require filtering and scoring. Moreover,

different motif discovery methods have complementary successes,

and therefore it is beneficial to apply multiple methods

simultaneously and collate their results [6]. In addition, the motif

discovery algorithms frequently produce a redundant output and

the transcription factor that binds each motif is usually unknown.

As similar motifs may represent binding sites of the same factor,

eliminating this redundancy is essential for elucidating the true

transcriptional regulatory program. The general strategy is thus to

cluster similar motifs and merge motifs within each cluster to

create a library of non-redundant motifs [6] (Figure 1B). Next, in

order to interpret the meaning of the discovered motifs, they are

compared to databases of previously characterized motifs

(Figure 1C). In large-scale experiments, where the motif output

set is very large, the tasks of scoring, merging and identifying

motifs need to be automated. To address both the clustering and

the retrieval challenges, we need an accurate and sensitive method

for comparing DNA motifs.

In the literature there is an ongoing discussion regarding the

best representation of DNA motifs [1,7–10]. Here we use a

Position Frequency Matrix (PFM), which has the benefits of being

relatively simple yet flexible. A PFM is a matrix of positions in the

binding site versus nucleotide preferences, where each row

represents one residue and each column represents the nucleotide

count at each position in a set of aligned binding sites. This

representation assumes that the choice of nucleotides at different

positions is independent of all other positions.

To compare two PFMs, we need to consider all possible

alignments between them. Given two aligned PFMs, we utilize the

position-independence assumption to decompose the similarity

score into a sum of the similarities of single aligned positions.

PLoS Computational Biology | www.ploscompbiol.org1 2008 | Volume 4 | Issue 2 | e1000010

Page 2

Several similarity scores can be used to compare a pair of aligned

positions. One approach uses the Pearson correlation coefficient

(e.g., [11,12]). This measure, however, might inappropriately

capture similarities between probabilities (Figure 2 and Figure S1).

Alternative approaches are based on similarity between two

distributions. This can be a metric distance, such as the Euclidean

distance [13] or an information-theoretic measure, such as the

Jensen-Shannon divergence [14]. While these latter distances do

not have the artifacts of the Pearson correlation, they equally

weight positions with similar nucleotide distributions that are

specific (e.g., a strong preference for an A) and similar positions

that are non-informative (e.g., identical to the background

distribution) (Figure 2 and Figure S1). It is important to

differentiate between these two situations: Two positions whose

similarity is due to a resemblance to the background distribution

are less relevant to motif similarity, as they do not contribute to

sequence-specific binding of proteins [15,16]. In this work we use

this intuition to develop a novel method for comparing and

merging DNA motifs, based on Bayesian probabilistic reasoning.

We define a new similarity score that combines the statistical

similarity between the motifs with their dissimilarity to the

background distribution. To calculate this score we estimate the

probabilities of DNA nucleotides in each position of the DNA

motif, by a Bayesian estimator with a Dirichlet mixture prior

[17,18] to model the multi-modal nucleotide distribution at

different binding site positions.

This motif similarity score is used by us to identify similar motifs

that represent binding sites of the same factor and for clustering

motifs. For the latter we devised a hierarchal agglomerative

clustering procedure that is based on our motif similarity score.

Our results show that the new method outperforms other

alternatives in accuracy and sensitivity in both the clustering and

retrieval tasks.

Using our new similarity score and the clustering method based

upon it, we developed a large-scale analysis pipeline of DNA motif

sets. This pipeline is designed for analysis following concurrent

motif search by a combination of methods (using the TAMO

package [19]). The goal is to process the set of DNA motifs into a

set of reliable non-redundant motifs. We use our method to

identify sets of DNA motifs from a large-scale ChIP-chip assay in

S. cerevisiae [13]. This allows us to examine how transcription

factors alter their DNA binding preferences under various

environmental conditions and elucidate mechanisms for condi-

tion-specific preferences.

Results

A Novel DNA Motif Similarity Score

Our goal is to determine whether two DNA motifs represent the

same binding preferences. Since the less informative positions in a

motif do not contribute to sequence-specific binding of proteins,

we developed a similarity score that measures the similarity

between two DNA motifs, while taking into account their

dissimilarity from the background distribution.

We now develop the details of the score. We can view DNA

motifs as a list of binding sites from which the nucleotide

distribution at each position is estimated. This view allows us to

perform statistical evaluations. We assume that each binding site

was sampled independently from a common distribution over

nucleotides, which satisfies the position independence properties

(in correspondence with the motif PFM representation described

above). Then, we can evaluate the likelihood ratio of different

source distributions of the sampled binding sites. In practice, we

keep only the sufficient statistics allowing us to evaluate the likelihood

of the binding sites. These sufficient statistics are the counts of each

nucleotide in each position, represented as a PFM.

Our score is composed of two components: the first measures

whether the two motifs were generated from a common

distribution, while the second reflects the distance of that common

distribution from the background (see Methods). Our Bayesian

Likelihood 2-Component (BLiC) score for comparing motifs m1

and m2is:

BLiCscore~log

Pr m1,m2jcommon{source

Pr m1,m2jindependent{source

zlogPr m1,m2jcommon{source

Pr m1,m2jbackground

ðÞ

ðÞ

ðÞ

ðÞ

ð1Þ

Under the position independence assumption, the score decom-

poses into a sum of local position scores. More precisely, our

likelihood-based score measures the probability of the nucleotide

counts in each position of the motif given a source distribution. For

two aligned positions in the compared motifs, let n1and n2be the

corresponding positions (count vectors) in the two motifs, the

similarity score is then:

??

~2:X

X

where^P P1,^P P2and^P P1,2are the estimators for the source distribu-

BLiCposition~log

P n1,n2 ^P P1,2

P n1 ^P P1

?:log^P P1,2

X

??

??

??P n2j^P P2

y

??zlogP n1,n2 ^P P1,2

??

?

ð

?

P n1,n2Pbg

j

Þ~

y[NT

n1

yzn2

y

?

{

y[NT

n1

y:log^P P1

yz

y[NT

n2

y:log^P P2

yz

X

y[NT

n1

yzn2

y

??:logPbg

y

!

ð2Þ

Author Summary

Regulation of gene expression plays a central role in the

activity of living cells and in their response to internal (e.g.,

cell division) or external (e.g., stress) stimuli. Key players in

determining gene-specific regulation are transcription

factors that bind sequence-specific sites on the DNA,

modulating the expression of nearby genes. To under-

stand the regulatory program of the cell, we need to

identify these transcription factors, when they act, and on

which genes. Transcription regulatory maps can be

assembled by computational analysis of experimental

data, by discovering the DNA recognition sequences

(motifs) of transcription factors and their occurrences

along the genome. Such an analysis usually results in a

large number of overlapping motifs. To reconstruct

regulatory maps, it is crucial to combine similar motifs

and to relate them to transcription factors. To this end we

developed an accurate fully-automated method, termed

BLiC, based upon an improved similarity measure for

comparing DNA motifs. By applying it to genome-wide

data in yeast, we identified the DNA motifs of transcription

factors and their putative target genes. Finally, we analyze

motifs of transcription factor that alter their target genes

under different conditions, and show how cells adjust their

regulatory program in response to environmental changes.

A Novel Motif Comparison Method

PLoS Computational Biology | www.ploscompbiol.org22008 | Volume 4 | Issue 2 | e1000010

Page 3

tion of n1, n2and the common source distribution, respectively, Pbg

is the background nucleotide distribution, and NT={A,C,G,T}.

Since the source distribution is unknown, we must estimate it

from the nucleotide counts at each position. We used a Bayesian

estimator, where a priori knowledge and the number of samples

were integrated into the estimation process. We considered two

alternative priors. The first is a standard Dirichlet prior [20], which

is conjugate to the multinomial distribution, enabling us to

compute the estimations efficiently (see Methods). However with

this prior we cannot model our prior knowledge that a position in

a DNA motif tends to have specific preference to one or more

nucleotides. Such prior knowledge can be described with a Dirichlet

mixture prior [17,18], which represents a prior that consists of

several ‘‘typical’’ distributions. Specifically, we used a five-

component mixture prior, with four components representing an

informative distribution, giving high probability for a single

nucleotide: A, C, G, or T. The fifth component represents the

uniform distribution (see Methods).

In the above discussion we assumed that the motifs are aligned,

but in practice, we compare unaligned motifs. Thus, we defined

Figure 1. Overview of the challenges in DNA motif analysis. (A) Identifying DNA binding motifs: Applying motif discovery algorithms to a

group of related DNA sequences leads to the identification of putative transcription factor DNA binding sites. These algorithms output a set of DNA

motifs, which are frequently redundant. To infer the correct transcription regulation map from the discovered motif set, it is crucial to reduce this

redundancy and to relate the discovered motifs to known ones. (B) Reducing redundancy by clustering and merging motifs: A redundant set of DNA

motifs can be reduced by clustering the motifs into groups of related ones and merging the motifs within each cluster. In this example, a redundant

set of 16 DNA motifs (a partial output of several motif search algorithms) is clustered and merged to a final set consisting of three DNA motifs. (C)

Relating motifs to known factors: The transcription factors that bind the newly discovered DNA motifs can be revealed based on similarities to

previously defined motifs. In this example, comparison of a newly discovered motif to four known motifs reveals high similarity to the Gcn4 binding

motif. From this comparison the transcription factor that binds the motif is identified with high probability.

doi:10.1371/journal.pcbi.1000010.g001

A Novel Motif Comparison Method

PLoS Computational Biology | www.ploscompbiol.org3 2008 | Volume 4 | Issue 2 | e1000010

Page 4

the similarity score for two motifs as the score of the best possible

alignment (without gaps) between them, including the reverse

complement alignment.

In addition, we need statistical calibration of the similarity

scores, since a high similarity score might be due to chance events

[21,22]. In particular, when comparing a single motif against

motifs of different lengths, the probability of similarity by chance

depends on the query motif and the length of the target. To

circumvent these problems we use the p-value of the similarity

score, which is computed empirically for each query against the

distribution of scores of random motifs of a given length (see

Methods and Figure 3).

Clustering motifs.

An important application of motif

similarity scores is clustering. There are many clustering

methods [23] that can be applied to motifs. Here we consider

one of the simplest and straightforward clustering procedures

where we combined a similarity score, such as our BLiC score,

within a hierarchical agglomerative clustering algorithm. In each

iteration, the algorithm computes the similarity between all pairs

of motifs and then merges the most similar pair into a new motif

based on the best alignment between the two motifs (see Figure 1).

It then replaces the two original motifs by the new motif. These

iterations are performed until we are left with a single motif. The

order of merge operations results in a tree, where the leaves are the

initial motifs, and inner nodes correspond to merged motifs that

represent all motifs in the relevant sub-tree (see Figure 4A). We

stress that this procedure is different than standard hierarchical

clustering (such as UPGMA clustering). Since we merge motifs to

create a new one, the similarity of the merged motif to another is

different from the average similarity of each of the original motifs

to that third motif.

We use the clustering tree to distill a large group of motifs

into a concise non-redundant set, by splitting the tree into a

subset of clusters, each representing a group of redundant

motifs by choosing a frontier in the tree (see Figure 4A and

Methods).

Figure 2. Problematic aspects of previous motif similarity scores. (A) Distinguishing between informative and non-informative positions:

Two pairs of aligned motifs are demonstrated, both of which having three identical positions and two different ones. While the identical positions in

the first pair (left) are non-informative, the identical positions in the second pair (right) are informative. The desired similarity score should distinguish

between these two types of similarities and assign a higher score to pair number 2. The nucleotide distributions are visualized so that the height of

each nucleotide is proportional to its probability (see a real life example in Figure S1). (B) Problematic aspects of motif similarity scores: The similarity

score of two position frequency matrices (PFMs) decomposes into the sum of similarities of single aligned positions, due to the common position-

independence assumption in the model. Here we present the similarity scores for various pairs of positions in DNA motifs according to several

similarity functions, in addition to the desired score (scores are normalized to arbitrary scale of 21 to 1). The nucleotide distribution in each position is

visualized as in (A) (the height of each nucleotide is proportional to its probability). As shown here, all scores (Pearson correlation, Jensen-Shannon

divergence, and Euclidean distance) do not reflect the ‘‘true’’ similarity between two distributions or cannot differ between informative and uniform

background positions. Specifically, position 1 should get a higher score than position 2, but the Pearson correlation scores for these positions are

equal. Position 3 should get the lowest possible score, yet the Pearson correlation does not capture this. Both in positions 1 and 4 identical

distributions are compared, but the informative position 1 should get a higher score than position 4. However, all three methods fail to obtain this.

Both positions 4 and 5 analyze nearly-uniform distributions. While in position 4 two identical distributions are compared, in position 5 there are small

variations, which alter the order of nucleotides. As we show, Pearson correlation grades position 5 substantially lower than position 4.

doi:10.1371/journal.pcbi.1000010.g002

A Novel Motif Comparison Method

PLoS Computational Biology | www.ploscompbiol.org4 2008 | Volume 4 | Issue 2 | e1000010

Page 5

Comprehensive Evaluation of Similarity Scores

We set out to compare our similarity score to existing ones in

the literature, in the context of both motif comparison and

clustering. We use two different data sets.

The first data set, which we refer to as ‘‘Yeast’’ is a synthetic one

where we know the true labeling of motifs and use it to benchmark

different procedures by relating their results with the underlying

truth. To generate synthetic motifs in a realistic manner that

reflects true binding properties of transcription factors, we use the

genome-wide catalogue of transcription factor binding locations in

S. cerevisiae [13]. This catalogue has high-confidence binding sites

(based on combination of experimental assays with evolutionary

conservation considerations). From these, we selected nine

transcription factors to represent different binding patterns (in

terms of inner arrangements of informative positions and length).

From the binding sites of each factor we sampled sets of binding

sites, and from each set generated a motif (see Figure 3A). For each

factor we generated noisy motifs that differ in their quality. To do

so, we varied the number of binding sites (sizes of 5, 15 or 35) and

the coverage of the motif (full site, its beginning, middle, or end).

We repeated this for each motif 20 times, creating a set of 240

random motifs for each of the nine transcription factors.

The second data set, which we refer to as ‘‘Structural’’, was

compiled by Mahony et al. [24]. Their evaluation is based on

structural information. Since structurally related transcription

factors often have similar DNA-binding preferences, the best

match to a given motif is expected to be a motif associated with a

member of the same structural class. Mahony et al. compiled a

data set that contains the motifs of the families with four or more

profiles in JASPAR [25].

Using these two data sets we compared different possible

similarity scores for DNA motifs. Specifically, we compared the

Pearson correlation coefficient; the information-theory based

Jensen-Shannon divergence; the Euclidean distance; and our

BLiC score.

Motif comparisonevaluation—Identifying

motifs.

We evaluated the sensitivity and specificity of motif

similarity scoring methods by comparing all possible pairs of motifs

from the data sets described above, and testing whether pairs that

have high similarity indeed were generated from the same source.

In the ‘‘Yeast’’ data set we call a pair as true if the two motifs were

generated from binding locations of the same transcription factor,

and in the ‘‘Structural’’ data set we call a pair as true if the motifs

are of factors from the same structural class. For each motif pair, if

the similarity is statistically significant we label this as a positive

pair, and otherwise call it a negative. We compared this prediction

to the label of the pair, and calculated the sensitivity and specificity

for each p-value threshold to create ROC curves (Figure 3B and

3C and Figure S2). Comparing the ROC curves of our score to

those of previously suggested scores we see that the BLiC score

outperformed all other scores throughout the range of possible

sensitivity/specificity tradeoffs on both data sets.

The construction of the ‘‘Yeast’’ data set allows examining

different parameters that make the task more challenging. We do

so by restricting the number of binding sites or by checking

whether the motif is partial or not. Using a smaller number of sites

results in higher variability among motifs of the same factor, and

using partial coverage means smaller overlap between compared

motifs; see Figure S2. These results show that as the task becomes

harder all the methods have reduced success rate: for 5% False

Positive Rate (FPR), the True Positive Rates (TPR) vary from 65%

(for partial overlapping motifs from samples of size 5) to 99% (for

the motif with different offsets compared to the full length motifs

from sample of size 35). Nonetheless, using our score improves the

similar

retrieval rates substantially in most tasks; for example, when

looking at sub-motifs with partial overlap from samples of size 35,

for 5% FPR using the BLiC score leads to 80% TPR, compared to

62% with the Euclidean distance or 57% with the Pearson

Correlation (see Figure 3B). For some tasks, such as comparing the

motifs of different offsets to the full length motifs, our method did

not show statistically significant improvement (see Figure S2).

Comparing our two alternative priors, The Dirichlet prior versus

the Dirichlet-Mixture prior, our results show that the more complex

prior, which better models the nucleotide distribution in binding

sites, leads to better results as the number of samples decreases (see

Figure S2). When the number of samples is larger, the two priors

result in similar performance.

Motif clustering evaluation—Reducing

redundancy.

To further evaluate the accuracy of the different

similarity scores we used these scores in clustering motifs from the

two data sets. For this, we used the hierarchical agglomerative

clustering algorithm described above. We then examined whether

clusters consisted of motifs that are considered similar (either from

the same factor in the ‘‘Yeast‘‘ data set, or the same structural

family in the ‘‘Structural’’ data set). Examining the cluster

hierarchy at different levels of granularity we get a tradeoff

curve between two criteria, the True Positive Rate (TPR) of all

clusters, and the number of clusters; see Figure 4. The results show

that the BLiC score outperformed the other similarity scores in the

‘‘Yeast’’ data set and is better than other similarity scores in the

‘‘Structural’’ data set.

As in the motif comparison evaluation, we can perform the

clustering evaluation on various subsets of the ‘‘Yeast’’ data set (see

Figure 4 and Figure S3). From these results we see that in harder

tasks, all methods have reduced success rates. Using our score

improves the clustering rates significantly when clustering all the

motifs or different subsets of motifs as described above; for

example, when looking at all motifs from sample sets of size 15,

using our BLiC score we reach 95% TPR with less than 14

clusters, while all other do not get more than 57% TPR (see

Figure 4B).

the

Large-Scale DNA Motif Analyses

Motif analysis pipeline.

motifs we developed an automatic motif analysis pipeline, based

on our BLiC score. This is a three-step method for processing and

integrating large-scale data of newly discovered DNA motifs into

coherent and reliable sets of non-redundant motifs. The inputs for

this procedure are multiple groups of co-regulated DNA

sequences, and the output is a set of non-redundant motifs and

a ranking of their relevance for each of the input groups (Figure 5).

The three steps of the pipeline include:

To facilitate analysis of many

Step 1: Motif searching and filtering

We begin by applying complementary motif discovery algo-

rithms to each group of sequences. This is done using the TAMO

package [19]. Then, the newly discovered motifs undergo an initial

filtration according to their abundance among the group of

sequences (see Methods).

Step 2: Clustering and merging motifs

The integrated sets of motifs (from all input groups) are

clustered and merged to create a non-redundant set. First, the

discovered motifs for each group are clustered and merged

separately. Then, motifs from all groups are assembled, clustered

and merged. After each stage of clustering, a subset of refined

motifs is automatically chosen based on the clustering tree (see

Methods).

A Novel Motif Comparison Method

PLoS Computational Biology | www.ploscompbiol.org5 2008 | Volume 4 | Issue 2 | e1000010