Page 1 of 10
(page number not for citation purposes)
c-REDUCE: Incorporating sequence conservation to detect motifs
that correlate with expression
Katerina Kechris*1 and Hao Li2,3
Address: 1Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, 4200 East Ninth Avenue,
B-119, Denver, CO 80262, USA, 2Department of Biochemistry and Biophysics, UCSF, 1700 4th Street, San Francisco, CA 94143, USA and 3Center
for Theoretical Biology, Peking University, Beijing 100871, PR China
Email: Katerina Kechris* - firstname.lastname@example.org; Hao Li - email@example.com
* Corresponding author
Background: Computational methods for characterizing novel transcription factor binding sites
search for sequence patterns or "motifs" that appear repeatedly in genomic regions of interest.
Correlation-based motif finding strategies are used to identify motifs that correlate with expression
data and do not rely on promoter sequences from a pre-determined set of genes.
Results: In this work, we describe a method for predicting motifs that combines the correlation-
based strategy with phylogenetic footprinting, where motifs are identified by evaluating
orthologous sequence regions from multiple species. Our method, c-REDUCE, can account for
variability at a motif position inferred from evolutionary information. c-REDUCE has been tested
on ChIP-chip data for yeast transcription factors and on gene expression data in Drosophila.
Conclusion: Our results indicate that utilizing sequence conservation information in addition to
correlation-based methods improves the identification of known motifs.
An important problem in genome annotation is the iden-
tification and characterization of functional elements.
These elements include transcription factor binding sites
(TFBS), which are short, degenerate sequences that appear
frequently in the genome. The interactions between tran-
scription factors (TFs) and their respective binding sites
are critical for regulating gene expression. To characterize
binding sequences for a TF, computational methods
search for sequence patterns or "motifs" that appear
repeatedly in genomic regions of interest (for a recent
review, see ).
For many motif-finding methods, it is necessary to input
upstream sequences from a set of genes (e.g., genes that
have been identified as co-expressed from a microarray
gene expression analysis), with the assumption that a
common motif is shared by the sequences (e.g., [2,3]).
However, upstream sequences of genes included in this
set may not have an occurrence of the same motif, or
genes that have the occurrence of the motif in their
upstream sequence may not be identified in the co-
expressed set. To address these weaknesses, correlation-
based motif finding methods  have been developed
that do not rely on a pre-determined set of genes either
based on co-expression (e.g., [2,3]) or over-representation
of motifs as in . Using all genes from a single experi-
ment, oligos in a specified length range are enumerated in
their upstream sequence and tested for significant correla-
tion with expression values or genome-wide location
Published: 28 November 2008
BMC Bioinformatics 2008, 9:506doi:10.1186/1471-2105-9-506
Received: 21 May 2008
Accepted: 28 November 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/506
© 2008 Kechris and Li; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2008, 9:506 http://www.biomedcentral.com/1471-2105/9/506
Page 2 of 10
(page number not for citation purposes)
measurements for a particular TF. The correlation-based
motif finding approach was introduced in the "Regulatory
Element Detection Using Correlation with Expression"
(REDUCE) software  using a linear regression frame-
work and has since been adapted in several ways includ-
ing the use of scores to motifs instead of oligo counts ,
probabilistic representations of motifs , binary indica-
tors for word occurrences  and flexible non-linear
regression functions [9,10].
An alternative motif-finding strategy, relying on the avail-
ability of complete genomes from related species, has
made it possible to search for putative TFBS in evolution-
arily conserved sequences. It has been shown that for
closely related species, where reasonable alignment of the
orthologous promoter sequences can be achieved, the
binding sites for many TFs are evolutionarily conserved.
Different computational methods have been developed
that vary in the number and diversity of species investi-
gated, in search strategies, i.e. genome-wide (e.g., [11,12])
versus gene sets (e.g., ), in whether they use known
transcription factors motifs (e.g., ) or predict motifs de
novo (e.g., ), in how they integrate inter-species con-
servation with intra-species conservation (e.g., ), in
whether the alignment of the motif occurrences across
species is required (e.g., ) and in whether global align-
ments in orthologous sequences are necessary .
In summary, there are numerous motif finding methods
that fall into several different classes, including those
reviewed that are correlation or sequence-conservation
based. Because of their successes individually, in this
work, we describe a new method for predicting motifs that
combines these two strategies.
Due to the variability in TF-DNA interactions, TFBS are
characterized by motifs containing degenerate positions.
For example, the second position in the consensus TFBS
for the yeast transcription factor OPI1 (GRTTCGA) can be
A or G, which is denoted by the IUPAC symbol R. At a
functional TFBS, the possible substitutions at a position
may be observed in aligned sequences from multiple spe-
cies. For example, an OPI1 functional site may be fully
conserved across species (as GATTCGA or GGTTCGA) or
exhibit A or G at the second position for different species.
To search for degenerate motifs, we have developed an
adaptation of the correlation-based algorithm REDUCE
 called conservation-REDUCE (c-REDUCE). In c-
REDUCE, a multiple species alignment is generated and
then translated into a consensus pattern using degenerate
nucleotide symbols that capture the variation at each
position across species. All oligos, including those with
degenerate symbols, are then evaluated for significant cor-
relation. By using multiple species data, we can identify
motifs that may be missed by REDUCE, which only exam-
ines sequences from a single species and requires exactly
the same oligo in different sequences.
An alternative method for identifying degenerate motifs is
fast-REDUCE (f-REDUCE) , which was developed for
single species data and identifies degenerate motifs
through an enumerative approach. However, enumera-
tion of degenerate motifs can become very costly as the
length of the motif and number of degenerate positions
increases. In contrast, c-REDUCE reduces the search space
of degenerate motifs by taking into account the variability
at a position inferred from evolutionary information.
In summary, c-REDUCE benefits from the use of conser-
vation in two ways. First, it predicts degenerate motifs, but
reduces the search space by only focusing on naturally
occurring degeneracies that appear across multiple spe-
cies. Second, by examining sequences from multiple spe-
cies, it will discount chance matches of a motif in a single
species if it the match has a highly degenerate consensus
sequence in the multiple species alignment. The degener-
acy of the consensus, reflecting random mutations in
other species, makes a functional TFBS at that position
less likely. To predict transcription factor binding site
motifs, our method is evaluated on ChIP-chip (chromatin
immunoprecipitation on microarray) data in yeast and
gene expression data in Drosophila. We find that the con-
servation and correlation-based approaches perform bet-
ter in combination than they do individually.
c-REDUCE applied to yeast data
c-REDUCE was first applied to the 78 genome-wide loca-
tion data sets of 37 TFs where six other methods failed to
identify the motif specified in the literature for that TF
. These six alternative methods were applied to sets of
sequences that were determined to be significantly
enriched for TF binding. Two of the six methods also
incorporated sequence conservation. In comparison, c-
REDUCE uses upstream sequences from the entire set of
genes AND incorporates conservation information. The
results for both c-REDUCE and f-REDUCE (degenerate
motifs but without conservation) are displayed in Tables
1 and 2. For 18 of the 37 transcription factors, c-REDUCE
identified the specified motif in at least one of the condi-
tions, while f-REDUCE discovered the correct motif for 10
transcription factors. In many cases, both programs were
successful, but f-REDUCE often discovered a shorter or
more degenerate motif than c-REDUCE. Both programs
are not suitable for finding long motifs with dimer pat-
terns. Therefore, some of the missed cases were for TFs
such as GAL80, with the motif CGGn(11)CCG.
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
BMC Bioinformatics 2008, 9:506http://www.biomedcentral.com/1471-2105/9/506
Page 10 of 10
(page number not for citation purposes)
26.Ward LD, Bussemaker HJ: Predicting functional transcription
factor binding through alignment-free and affinity-based
analysis of orthologous promoter sequences. Bioinformatics
(Oxford, England) 2008, 24(13):i165-171.
Kawahara Y, Imanishi T: A genome-wide survey of changes in
protein evolutionary rates across four closely related species
of Saccharomyces sensu stricto group. BMC Evolutionary Biology
Gaunt MW, Miles MA: An Insect Molecular Clock Dates the
Origin of the Insects and Accords with Palaeontological and
Biogeographic Landmarks. Mol Biol Evol 2002, 19(5):748-761.
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties
and weight matrix choice. Nucleic Acids Res 1994,
Markstein M, Zinzen R, Markstein P, Yee KP, Erives A, Stathopoulos
A, Levine M: A regulatory code for neurogenic gene expres-
sion in the Drosophila embryo. Development 2004,
Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C,
Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining
tens of millions of expression profiles – database and tools
update. Nucleic Acids Res 2007, 35(suppl_1):D760-765.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,
Haussler D: The human genome browser at UCSC. Genome
Res 2002, 12(6):996-1006.
Wilson RJ, Goodman JL, Strelets VB: FlyBase: integration and
improvements to query tools. Nucleic Acids Res 2008:D588-593.