Analysis of overrepresented motifs in human core
promoters reveals dual regulatory roles of YY1
Hualin Xi,1Yong Yu,1Yutao Fu,1Jonathan Foley,2Anason Halees,1
and Zhiping Weng1,2,3
1Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA;2Department of Biomedical Engineering,
Boston University, Boston, Massachusetts 02215, USA
A set of 723 high-quality human core promoter sequences were compiled and analyzed for overrepresented motifs.
Beside the two well-characterized core promoter motifs (TATA and Inr), several known motifs (YY1, Sp1, NRF-1,
NRF-2, CAAT, and CREB) and one potentially new motif (motif8) were found. Interestingly, YY1 and motif8 mostly
reside immediately downstream from the TSS. In particular, the YY1 motif occurs primarily in genes with 5?-UTRs
shorter than 40 base pairs (bp) and its locations coincide with the translation start site. We verified that the YY1
motif is bound by YY1 in vitro. We then performed detailed analysis on YY1 chromatin immunoprecipitation data
with a whole-genome human promoter microarray (ChIP-chip) and revealed that the thus identified promoters in
HeLa cells were highly enriched with the YY1 motif. Moreover, the motif overlapped with the translation start sites
on the plus strand of a group of genes, many with short 5?-UTRs, and with the transcription start sites on the minus
strand of another distinct group of genes; together, the two groups of genes accounted for the majority of the
YY1-bound promoters in the ChIP-chip data. Furthermore, the first group of genes was highly enriched in the
functional categories of ribosomal proteins and nuclear-encoded mitochondria proteins. We suggest that the YY1
motif plays a dual role in both transcription and translation initiation of these genes. We also discuss the
evolutionary advantages of housing a transcriptional element inside the transcript in terms of the migration of these
genes in the human genome.
[Supplemental material is available online at www.genome.org.]
The core promoter, consisting of ∼100 bp flanking the transcrip-
tion start site (TSS), plays an essential role in transcriptional ini-
tiation. It facilitates the assembly of the transcription initiation
complex around the TSS. The TATA box is the best-characterized
motif in this region (Smale and Kadonaga 2003). Its proper po-
sitioning is required to determine the starting point of the tran-
scription; however, many promoters do not contain a TATA
(Smale 1997), and the transcription initiation mechanisms for
these promoters are not yet well understood. Several other core
promoter motifs have been identified, with Initiator (Inr) being
the best-studied example (Smale and Baltimore 1989; Javahery et
al. 1994). It has the consensus YYAN(T/A)YY and is functionally
similar to TATA in facilitating TFIID (TBP) binding. Beside Inr, a
downstream promoter element called DPE was found in Dro-
sophila. It occurs frequently in TATA-less promoters and appears
to function cooperatively with Inr (Burke and Kadonaga 1996).
Most recently, a new downstream core promoter element was
identified by analyzing overrepresented motifs in Drosophila core
promoter sequences and later verified experimentally (Ohler et
al. 2002; Lim et al. 2004). Downstream core promoter motifs
occur frequently in Drosophila; however, their occurrence in hu-
mans remains to be seen. In humans, Sp1 and CAAT box have
been reported in several TATA-less promoters as the regulatory
elements for transcription initiation (Huber et al. 1998; Manto-
Typical mammalian transcription regulatory motifs span
only 6–12 bp and are degenerate. When the total base pair of
input sequences is large, it is difficult to distinguish the real regu-
latory motifs in them from random short sequences. Fortunately,
core promoters consist of only a short stretch of sequences flank-
ing the transcription start sites. This drastically reduces the se-
quence search space. Recent efforts in large-scale sequencing of
human full-length cDNAs have provided a unique opportunity
to identify the precise positions of TSSs. In this study, we ex-
tracted high-quality human core promoter sequences (defined as
?70 to +50 bp around the TSS throughout this study) from the
Database of Transcription Start Sites (DBTSS, http://dbtss.hgc.jp)
and analyzed them with the motif-finding program MEME (Bai-
ley and Elkan 1994; Bailey et al. 1997). TATA was found in only
26% of these promoters. In addition to TATA and Inr, we iden-
tified several overrepresented motifs. Interestingly, some of these
motifs, although generally overrepresented in the entire set of
core promoters, are underrepresented in TATA-containing pro-
moters. With all of the overrepresented motifs identified in this
study, >50% of the TATA-less core promoters can be accounted
for, indicating an improved understanding of the general tran-
scriptional initiation mechanism.
One identified motif was particularly interesting. It matched
the YY1-binding consensus; however, it occurred downstream
from the TSS and often overlapped with the translation start site.
We demonstrated experimentally that it could be recognized by
YY1 in vitro. The similarity between this YY1 motif and the
Kozak sequence (Kozak 1984; Kozak 1987b) led us to question
whether the YY1 motif was merely a special Kozak sequence and
its ability to bind YY1 was coincidental. Two observations we
E-mail firstname.lastname@example.org; fax (617) 353-6766.
Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5754707.
Freely available through the Genome Research Open Access option.
17:798–806 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07; www.genome.org
made suggest otherwise: The YY1 motif was evolutionarily more
conserved than Kozak and most genes with the YY1 motif had
extremely short 5?-UTRs (<40 bp). In addition, the raw experi-
mental data on YY1 chromatin Immunoprecipitation with a
whole-genome human promoter microarray (ChIP-chip) was
made available to us (B. Ren, pers. comm.). We performed a de-
tailed analysis on the ChIP-chip data and uncovered a series of
striking parallels between the aforementioned results on the YY1
motif and those on the YY1 ChIP-chip data. Both consistently
suggested two distinct binding modes of YY1: one on the plus
strand downstream of the TSS and the other on the minus strand
upstream of the TSS, with the former often overlapping the trans-
lation start site and the latter transcription start site. We supply
multiple lines of computational evidence to argue distinct regu-
latory mechanisms by the two binding modes and discuss the
evolutionary implication of the dominant downstream-of-TSS
mode. As little is known about downstream core promoter ele-
ments in humans, our finding here could potentially represent a
new mechanism for transcription regulation in eukaryotes.
We first analyzed overrepresented motifs found in core promot-
ers in terms of their consensus, positional specificity, co-
occurrence with TATA, and evolutionary conservation. Then,
one overrepresented motif, YY1, was characterized in great detail
by using both experimental and computational approaches.
Overrepresented core promoter motifs
Based on the quality assessment of TSS
mapping in DBTSS (see Supplemental
Fig. 1), 723 core promoter sequences
with >15 mapped cDNA sequences were
selected for MEME motif analysis. Inr,
known to be located at the +1 position,
was between –70 and +30 in 90% of
these 723 sequences, indicating accurate
mapping of the TSSs. The top 15 scoring
motifs found by MEME were recorded.
Among them, five highly degenerated
motifs were omitted from further analy-
sis (Supplemental Fig. 3). The remaining
10 motifs are listed in Table 1. The posi-
tion-specific scoring matrices (PSSMs) of
these motifs were compared with all ma-
trices in the TRANSFAC database by us-
ing the MALIGN algorithm (Haverty et
al. 2004a). Nine motifs matched TRANS-
FAC matrices, while motif8 appeared to
be novel (the first column of Table 1).
The known motifs included core pro-
moter motifs (TATA, Inr) and several
other well-studied motifs (CAAT box,
Sp1, NRF-1, NRF-2 [also known as
GABP], and CREB). The second-best
scoring motif matched the previously re-
ported YY1-binding profile (Shrivastava
and Calame 1994) in the reverse direc-
All 10 motifs showed positional
specificity with respect to the TSS (Ta-
ble 1). TATA and Inr, as expected, showed extremely strong po-
sitional specificity. Similar levels of positional specificity were
also observed for YY1 and motif8 and both peaked immediately
downstream from the TSS. Co-occurrence of these motifs with
TATA was analyzed in the 723 promoter sequences (Supplemen-
tal Table 1). YY1, NRF-1, NRF-2, and motif8 showed significantly
lower-than-expected co-occurrence with TATA. In the extreme
case of NRF-2, only two cases of co-occurrence were observed as
opposed to the expected 13 cases. Thus, these motifs might assist
transcription initiation in TATA-less genes. The statistics for
other motifs including CAAT and CREB were insignificant, pos-
sibly due to their lower abundance in the sequence set. All 10
motifs found in humans were also overrepresented in a set of
1849 high-quality mouse core promoter sequences from DBTSS
with >10 mapped cDNA sequences. As conservation often sug-
gests functional importance, overrepresentation observed in
both humans and the mouse supports the functional relevance of
these motifs in core promoters.
The YY1 motif overlaps with the Kozak sequence
Of the 723 high-quality DBTSS promoters, 54 contained the YY1
motif. For 44 of them, the YY1 motif overlapped with the trans-
lation start site, with the ATG nucleotides in the motif coinciding
with the initiation codon. The sequence surrounding a transla-
tion start site, commonly referred to as the Kozak sequence
(Kozak 1984; 1987b), facilitates the assembly of the ribosome
around the mRNA molecule during translation initiation. De-
Summary of overrepresented motifs identified from the MEME analysis
Known motifs are indicated in parentheses. If a motif occurs multiple times in the same promoter, the
position of the best match was used in calculating the positional distribution.
aNumber of promoters containing the motif in the 10,577 unique DBTSS human promoters.
Dual regulatory roles of YY1
Kozak, M. 1987a. An analysis of 5?-noncoding sequences from 699
vertebrate messenger RNAs. Nucleic Acids Res. 15: 8125–8148.
Kozak, M. 1987b. At least six nucleotides preceding the AUG initiator
codon enhance translation in mammalian cells. J. Mol. Biol.
Lavie, L., Maldener, E., Brouha, B., Meese, E.U., and Mayer, J. 2004. The
human L1 promoter: Variable transcription initiation sites and a
major impact of upstream flanking sequence on promoter activity.
Genome Res. 14: 2253–2260.
Li, W.W., Hsiung, Y., Wong, V., Galvin, K., Zhou, Y., Shi, Y., and Lee,
A.S. 1997. Suppression of grp78 core promoter element-mediated
stress induction by the dbpA and dbpB (YB-1) cold shock domain
proteins. Mol. Cell. Biol. 17: 61–68.
Lim, C.Y., Santoso, B., Boulay, T., Dong, E., Ohler, U., and Kadonaga,
J.T. 2004. The MTE, a new core promoter element for transcription
by RNA polymerase II. Genes & Dev. 18: 1606–1617.
Mantovani, R. 1998. A survey of 178 NF-Y binding CCAAT boxes.
Nucleic Acids Res. 26: 1135–1143.
Ohler, U., Liao, G.C., Niemann, H., and Rubin, G.M. 2002.
Computational analysis of core promoters in the Drosophila genome.
Genome Biol. 3: RESEARCH0087.
Riggs, K.J., Saleque, S., Wong, K.K., Merrell, K.T., Lee, J.S., Shi, Y., and
Calame, K. 1993. Yin-yang 1 activates the c-myc promoter. Mol. Cell.
Biol. 13: 7487–7495.
Safrany, G. and Perry, R.P. 1995. The relative contributions of various
transcription factors to the overall promoter strength of the mouse
ribosomal protein L30 gene. Eur. J. Biochem. 230: 1066–1072.
Shi, Y., Seto, E., Chang, L.S., and Shenk, T. 1991. Transcriptional
repression by YY1, a human GLI-Kruppel-related protein, and relief
of repression by adenovirus E1A protein. Cell 67: 377–388.
Shi, Y., Lee, J.S., and Galvin, K.M. 1997. Everything you have ever
wanted to know about Yin Yang 1. Biochim. Biophys. Acta
Shrivastava, A. and Calame, K. 1994. An analysis of genes regulated by
the multi-functional transcriptional regulator Yin Yang-1. Nucleic
Acids Res. 22: 5151–5155.
Smale, S.T. 1997. Transcription initiation from TATA-less promoters
within eukaryotic protein-coding genes. Biochim. Biophys. Acta
Smale, S.T. and Baltimore, D. 1989. The “initiator” as a transcription
control element. Cell 57: 103–113.
Smale, S.T. and Kadonaga, J.T. 2003. The RNA polymerase II core
promoter. Annu. Rev. Biochem. 72: 449–479.
Smit, A.F. 1996. The origin of interspersed repeats in the human
genome. Curr. Opin. Genet. Dev. 6: 743–748.
Smith, E., Meyerrose, T.E., Kohler, T., Namdar-Attar, M., Bab, N., Lahat,
O., Noh, T., Li, J., Karaman, M.W., Hacia, J.G., et al. 2005. Leaky
ribosomal scanning in mammalian genomes: Significance of histone
H4 alternative translation in vivo. Nucleic Acids Res. 33: 1298–1308.
Srinivasan, L. and Atchison, M.L. 2004. YY1 DNA binding and PcG
recruitment requires CtBP. Genes & Dev. 18: 2596–2601.
Thomas, M.J. and Seto, E. 1999. Unlocking the mechanisms of
transcription factor YY1: Are chromatin modifying enzymes the key?
Gene 236: 197–208.
Wei, C.L., Wu, Q., Vega, V.B., Chiu, K.P., Ng, P., Zhang, T., Shahab, A.,
Yong, H.C., Fu, Y., Weng, Z., et al. 2006. A global map of p53
transcription-factor binding sites in the human genome. Cell
Received July 13, 2006; accepted in revised form January 9, 2007.
Xi et al.