Genome Biology 2007, 8:R23
2007Moses et al. Volume 8, Issue 2, Article R23
Clustering of phosphorylation site recognition motifs can be
exploited to predict the targets of cyclin-dependent kinase
Alan M Moses, Jean-Karim Hériché and Richard Durbin
Address: Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1HH, UK.
Correspondence: Alan M Moses. Email: email@example.com
© 2007 Moses et al.; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Cyclin-dependent kinase target prediction<p>A novel computational strategy is used to predict cyclin-dependent targets by exploiting their propensity for occurring in clusters on substrate proteins.</p>
Protein kinases are critical to cellular signalling and post-translational gene regulation, but their
biological substrates are difficult to identify. We show that cyclin-dependent kinase (CDK)
consensus motifs are frequently clustered in CDK substrate proteins. Based on this, we introduce
a new computational strategy to predict the targets of CDKs and use it to identify new biologically
interesting candidates. Our data suggest that regulatory modules may exist in protein sequence as
clusters of short sequence motifs.
Protein kinases are ubiquitous components of cellular signal-
ling networks . A relatively well understood example is the
network that controls progression of the cell cycle, where cyc-
lin-dependent kinases (CDKs) couple with various cyclins
over the cell cycle to regulate critical processes [2-4]. Despite
their biological and medical importance, relatively few direct,
in vivo targets of these kinases have been identified conclu-
sively, because experimental techniques are difficult and time
consuming [1,5]. With the availability of databases of protein
sequences, computational methods provide an alternative
Kinase substrates often have short, degenerate sequence
motifs surrounding the phosphorylated residue . Putative
target residues can be predicted by searching for matches to
the consensus for a particular kinase. For example, CDK sub-
strates often contain S/T-P-X-R/K where X represents any
amino acid, and S/T represents the phosphorylated serine or
threonine [9,10]. Because of the low specificity of the CDK
consensus, however, databases of protein sequences are
expected to contain large numbers of matches by chance.
Therefore, many of the matches in protein sequences are
likely to be false-positive predictions. Consistent with this,
when 553 Saccharomyces cerevisiae proteins with at least
one match to the CDK consensus were tested in a high-
throughput kinase assay, only 32% (178) were found to be
substrates . Furthermore, in some cases characterized
CDK substrates are phosphorylated at residues matching only
a minimal consensus S/T-P ; considering these weak
matches would probably lead to even larger numbers of false
Characterized CDK targets may be phosphorylated at multi-
ple residues (for instance, see the report by Lees and cowork-
ers ). Recent studies of several CDK target proteins in S.
cerevisiae have shown that these multiple phosphorylations
can regulate stability , protein interaction [14,15], or
Published: 22 February 2007
Genome Biology 2007, 8:R23 (doi:10.1186/gb-2007-8-2-r23)
Received: 29 September 2006
Revised: 16 January 2007
Accepted: 22 February 2007
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2007/8/2/R23
R23.2 Genome Biology 2007, Volume 8, Issue 2, Article R23 Moses et al.
Genome Biology 2007, 8:R23
localization . Motivated by these observations, we pro-
pose an alternative computational strategy to identify sub-
strates of CDKs; instead of attempting to predict individual
phosphorylation sites, we search for proteins that contain
high densities of strong and weak consensus matches that are
closely spaced in the primary amino acid sequence. (We refer
to this close spacing as 'clustering', and this should not be
confused with clustering of multivariate data.)
Taking advantage of the results of a high-throughput study
, we show statistically that CDK1 targets in S. cerevisiae
contain multiple closely spaced consensus matches and we
develop computational methods to identify such proteins. We
also find that these clusters tend to occur in disordered or
unfolded regions near the termini of the protein. We show
that it is possible to predict proteins that are likely to be tar-
gets of CDKs in S. cerevisiae by searching for proteins that
contain clustered matches to the CDK consensus. We also
show that human CDK targets are enriched for proteins that
contain clustered consensus matches and, by searching
human cell cycle genes, we predict several putative CDK tar-
gets, including the human orthologs of Schizosaccharomyces
pombe CDC5 (CDC5L) and S. cerevisiae Cdc20p (CDC20).
Finally, we examine co-clustering of the CDK consensus
motifs with the 'cy' or RXL motif , which is known to be
important in determining which CDK-cyclin complex will
phosphorylate a given substrate.
Targets of Cdk1p in S. cerevisiae contain clusters of
matches to the CDK consensus
CDK substrates in S. cererevisiae are often phosphorylated at
multiple serine or threonine residues, some of which match
the full (henceforth 'strong') consensus S/T-P-X-R/K,
whereas others match a minimal (henceforth 'weak') consen-
sus S/T-P. For example, the amino-terminal region of Cdc6p
(Figure 1b) is a direct target of Cdk1p (also known as Cdc28p)
, and contains three strong and one weak CDK consensus.
In order to test whether these observations could be used to
predict new substrates, we first compared the number of
matches of each motif per residue in a set of 12 Cdk1p targets
known from low-throughput biochemical and genetic experi-
ments (compiled by Ubersax and coworkers ; henceforth
referred to as 'known' targets; see Table 1 and Figure 1a) with
the number in the genome. We find a highly significant, more
than ninefold enrichment of the strong consensus (Figure 2a,
left side) but not for a scrambled version (P-R/K-X-S/T) of
the consensus (Figure 2a, right side), indicating that the
enrichment is not due to simple compositional effects. For the
weak consensus (after masking the strong consensus), we
also find enrichment over the genome and not for a scrambled
consensus (after masking the weak and strong consensus),
but it is less striking (less than twofold; Figure 2b).
Because we were concerned that the discovery of the known
targets may have been biased by the observation that they
contained many matches to the strong consensus, we also
computed these frequencies for the 18 proteins out of a set of
198 randomly chosen genes from S. cerevisiae identified as
Cdk1p targets in a high-throughput assay  (henceforth
referred to as 'unbiased positives'; see Table 1). We found
similar results in this unbiased positive set, although the
enrichment of strong matches was just under fourfold in this
case and the enrichment of weak matches was less than 1.5-
fold (Figure 2). That the fold enrichment is somewhat less in
this set is consistent with some of the enrichment in the
known set being due to bias in their discovery, but also with
some false-positive findings being picked up in the kinase
assay. Nevertheless, this rules out the possibility that the
enrichment of matches in bona fide CDK substrates is only
the result of a bias.
Examination of phosphorylated residues in CDK target pro-
teins reveals that they are often found 'clustered' in one region
of the primary amino acid sequence (Figure 1). We sought to
test whether this apparent clustering was due simply to a uni-
form overall enrichment of consensus matches in these pro-
teins, or whether it was a preference for the consensus
matches to occur near each other. We modeled the number of
residues until a strong or weak match was identified using a
bivariate geometric distribution (see Materials and methods,
below). We then performed a likelihood ratio test (LRT)
between the hypothesis (H1) that the spacings were drawn
CDK target sets used in this study
SetAscertainmentTotalDefinition of targetPositives
randomly chosen proteins
All S. cerevisiae proteins containing two or more matches to the 'strong' CDK
All S. cerevisiae proteins containing one match to the 'strong' CDK consensus and
exhibiting cell cycle regulated transcription
Low-throughput experimental characterization
Score > 2 in high-throughput assay
Score > 2 in high-throughput assay
'1cc'137 Score > 2 in high-throughput assay 32
Four cyclin-dependent kinase (CDK) target sets from Saccharomyces cerevisiae . Note that only the high-throughput data contain 'negatives'. The 'strong' CDK consensus is
S/T-P-X-R/K, where X represents any amino acid.
Genome Biology 2007, Volume 8, Issue 2, Article R23 Moses et al. R23.13
Genome Biology 2007, 8:R23
dues after the final match, we combine them with the residues
before the first match. This also ensures that a given set of
matches has the same probability regardless of where it
occurs in the protein (relative to the start). Another technical
issue with the application of geometric models to proteins is
that the decision to begin 'counting' the residues from the
amino-terminus or 'left' end is arbitrary; we could equally
well have started from the carboxyl-terminus or 'right' end.
We confirmed that this makes little difference; counting from
'right' to 'left' gave qualitatively very similar results.
To use these geometric models for hypothesis testing we pro-
ceed as follows. The single-component multivariate geomet-
ric has two parameters (the densities of strong and weak
matches, f), whereas the two-component mixture has five
(two sets of densities, f1 and f2, and a mixing parameter π). We
note that these models are nested; the single-component
model (H0) is a two-component mixture where the parame-
ters for the two components are constrained to be equal (f1 =
f2). Because the likelihood in the single-component case (H0)
is independent of the mixing parameter, π, there is a three-
parameter difference between the two hypotheses. We there-
fore expect the distribution of the likelihood ratio statistic to
be χ2 with three degrees of freedom . To verify that the
distribution of the likelihood ratio statistic was indeed χ2 with
three degrees of freedom, we randomly permuted the
positions of the consensus matches in the 'known' set 100
times, and computed the likelihood ratio statistic for the
comparison of the two models; we found reasonable agree-
ment with expectation (data not shown).
We compute SLR for each protein as follows. For each protein,
we obtain the set of matches (their positions and type, strong
or weak) and compute the likelihood under the following: Hbg
(assuming the matches were randomly drawn from the
genome frequencies); Hc (fitting the mixture using the EM
algorithm described above, but keeping the background com-
ponent set to the genome frequencies); or Hns (as for Hc, but
additionally constraining the frequency of strong matches in
the cluster component to be less than or equal to the back-
ground frequency). We combine these likelihoods as is given
in Results (above). As before, we run the EM with five random
starting points for each protein.
Position and foldedness of maximal cluster
We identified the optimal cluster using SBN as described
above. To compute the position of the cluster, we calculated
the distance between the start of the protein and the start of
the cluster, and between the end of the cluster and the end of
the protein. We then took the minimum of these divided by
the length of the protein to be the position. We computed the
foldedness as If = 2.785 × H - |R| - 1.51 [21,22], where H is the
average hydropathy  per residue and |R| is the absolute
value of the charge (at pH 7.0) per residue in the cluster.
We submitted the yeast protein sequences to the batch Scan-
site  using low stringency, which yielded 12,134 Cdc2
matches in 4,048 of the 5,889 yeast proteins. We then took
the best (lowest score) for each of those proteins.
Additional data files
The following additional data are available with the online
version of this paper. Additional data file 1 contains the S. cer-
evisiae proteins with SLR above 3.5, and the human CDK tar-
gets and cell-cycle proteins with associated SLR scores. Scripts
to calculate SLR and SBN are available on AMM's website .
Additional data file 1 S. cerevisiae proteins with SLR > 3.5, and the human CDK targets and cell cycle proteins with associated SLR scores This document contains the Saccharomyces cerevisiae proteins with SLR greater than 3.5, and the human CDK targets and cell cycle proteins with associated SLR scores. Click here for file
We gratefully acknowledge Dr David Morgan for providing data and for
stimulating discussions. We thank Dr José Jimenéz for discussions and
assistance with online resources, Dr Seth Grant for interesting discussions,
Dr Noboru Komiyama for help with protein unfolding, and Dr Avril Cogh-
lan for helpful comments on the manuscript. This work was supported by
a Sanger Institute Postdoctoral Fellowship to AMM. RD and the Wellcome
Trust Sanger Institute are supported by the Wellcome Trust.
1. Johnson SA, Hunter T: Kinomics: methods for deciphering the
kinome. Nat Methods 2005, 2:17-25.
2.Norbury CJ, Nurse P: Control of the higher eukaryote cell cycle
by p34cdc2 homologues. Biochim Biophys Acta 1989, 989:85-95.
3. Nasmyth K: Control of the yeast cell cycle by the Cdc28 pro-
tein kinase. Curr Opin Cell Biol 1993, 5:166-179.
4.Murray AW: Cyclin-dependent kinases: regulators of the cell
cycle and more. Chem Biol 1994, 1:191-195.
5.Manning BD, Cantley LC: Hitting the target: emerging technol-
ogies in the search for kinase substrates. Sci STKE 2002,
6.Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Pre-
diction of post-translational glycosylation and phosphoryla-
tion of proteins from the amino acid sequence. Proteomics
7. Kobe B, Kampmann T, Forwood JK, Listwan P, Brinkworth RI: Sub-
strate specificity of protein kinases and computational pre-
diction of substrates. Biochim Biophys Acta 2005, 1754:200-209.
8. Kreegipuu A, Blom N, Brunak S, Jarv J: Statistical analysis of pro-
tein kinase specificity determinants. FEBS Lett 1998, 430:45-50.
9. Songyang Z, Blechner S, Hoagland N, Hoekstra MF, Piwnica-Worms
H, Cantley LC: Use of an oriented peptide library to determine
the optimal substrates of protein kinases. Curr Biol 1994,
10.Endicott JA, Noble ME, Tucker JA: Cyclin-dependent kinases:
inhibition and substrate recognition. Curr Opin Struct Biol 1999,
11.Ubersax JA, Woodbury EL, Quang PN, Paraz M, Blethrow JD, Shah K,
Shokat KM, Morgan DO: Targets of the cyclin-dependent kinase
Cdk1. Nature 2003, 425:859-864.
12. Nash P, Tang X, Orlicky S, Chen Q, Gertler FB, Mendenhall MD,
Sicheri F, Pawson T, Tyers M: Multisite phosphorylation of a
CDK inhibitor sets a threshold for the onset of DNA
replication. Nature 2001, 414:514-521.
13.Lees JA, Buchkovich KJ, Marshak DR, Anderson CW, Harlow E: The
retinoblastoma protein is phosphorylated on multiple sites
by human cdc2. EMBO J 1991, 10:4279-4290.
14.Mimura S, Seki T, Tanaka S, Diffley JF: Phosphorylation-depend-
ent binding of mitotic cyclins to Cdc6 contributes to DNA
replication control. Nature 2004, 431:1118-1123.
15.Tak YS, Tanaka Y, Endo S, Kamimura Y, Araki H: A CDK-catalysed
regulatory phosphorylation for formation of the DNA repli-
cation complex Sld2-Dpb11. EMBO J 2006, 25:1987-1996.
16. Liku ME, Nguyen VQ, Rosales AW, Irie K, Li JJ: CDK phosphoryla-
R23.14 Genome Biology 2007, Volume 8, Issue 2, Article R23 Moses et al.
Genome Biology 2007, 8:R23
tion of a novel NLS-NES module distributed between two
subunits of the Mcm2-7 complex prevents chromosomal
rereplication. Mol Biol Cell 2005, 16:5026-5039.
Adams PD, Sellers WR, Sharma SK, Wu AD, Nalin CM, Kaelin WG
Jr: Identification of a cyclin-cdk2 recognition motif present in
substrates and p21-like cyclin-dependent kinase inhibitors.
Mol Cell Biol 1996, 16:6623-6633.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids Cambridge: Cambridge
University Press; 1998.
Bailey TL, Gribskov M: Combining evidence using p values:
application to sequence homology searches. Bioinformatics
Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obra-
dovic Z, Dunker AK: The importance of intrinsic disorder for
protein phosphorylation. Nucleic Acids Res 2004, 32:1037-1049.
Uversky VN, Gillespie JR, Fink AL: Why are 'natively unfolded'
proteins unstructured under physiologic conditions? Proteins
Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg EH, Man O,
Beckmann JS, Silman I, Sussman JL: FoldIndex: a simple tool to
predict whether a given protein sequence is intrinsically
unfolded. Bioinformatics 2005, 21:3435-3438.
Yaffe MB, Leparc GG, Lai J, Obata T, Volinia S, Cantley LC: A motif-
based profile scanning approach for genome-wide prediction
of signaling pathways. Nat Biotechnol 2001, 19:348-353.
Lee MG, Nurse P: Complementation used to clone a human
homologue of the fission yeast cell cycle control gene cdc2.
Nature 1987, 327:31-35.
Elledge SJ, Spottswood MR: A new human p34 protein kinase,
CDK2, identified by complementation of a cdc28 mutation
in Saccharomyces cerevisiae, is a homolog of Xenopus Eg1.
EMBO J 1991, 10:2653-2659.
Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal
M, Cameron S, Martin DM, Ausiello G, Brannetti B, Costantini , et al.:
ELM server: a new resource for investigating short functional
sites in modular eukaryotic proteins. Nucleic Acids Res 2003,
Esashi F, Christ N, Gannon J, Liu Y, Hunt T, Jasin M, West SC: CDK-
dependent phosphorylation of BRCA2 as a regulatory mech-
anism for recombinational repair. Nature 2005, 434:598-604.
Favreau C, Worman HJ, Wozniak RW, Frappier T, Courvalin JC: Cell
cycle-dependent phosphorylation of nucleoporins and
nuclear pore membrane protein Gp210. Biochemistry 1996,
Stukenberg PT, Lustig KD, McGarry TJ, King RW, Kuang J, Kirschner
MW: Systematic identification of mitotic phosphoproteins.
Curr Biol 1997, 7:338-348.
D'Angiolella V, Mari C, Nocera D, Rametti L, Grieco D: The spindle
checkpoint requires cyclin-dependent kinase activity. Genes
Dev 2003, 17:2520-2525.
Rankin S, Ayad NG, Kirschner MW: Sororin, a substrate of the
anaphase-promoting complex, is required for sister chroma-
tid cohesion in vertebrates. Mol Cell 2005, 18:185-200.
Stewart S, Fang G: Anaphase-promoting complex/cyclosome
controls the stability of TPX2 during mitotic exit. Mol Cell Biol
Mailand N, Diffley JF: CDKs promote DNA replication origin
licensing in human cells by protecting Cdc6 from APC/C-
dependent proteolysis. Cell 2005, 122:915-926.
Loog M, Morgan DO: Cyclin specificity in the phosphorylation
of cyclin-dependent kinase substrates. Nature 2005,
Ochoa-Espinosa A, Small S: Developmental mechanisms and cis-
regulatory codes. Curr Opin Genet Dev 2006, 16:165-170.
Small S, Blair A, Levine M: Regulation of even-skipped stripe 2 in
the Drosophila embryo. EMBO J 1992, 11:4047-4057.
Levine M, Davidson EH: Gene regulatory networks for
development. Proc Natl Acad Sci USA 2005, 102:4936-4942.
Markstein M, Levine M: Decoding cis-regulatory DNAs in the
Drosophila genome. Curr Opin Genet Dev 2002, 12:601-606.
Luo KQ, Elsasser S, Chang DC, Campbell JL: Regulation of the
localization and stability of Cdc6 in living yeast cells. Biochem
Biophys Res Commun 2003, 306:851-859.
Jans DA, Hubner S: Regulation of protein transport to the
nucleus: central role of phosphorylation. Physiol Rev 1996,
Jans DA, Ackermann MJ, Bischoff JR, Beach DH, Peters R: p34cdc2-
mediated phosphorylation at T124 inhibits nuclear import of
SV-40 T antigen proteins. J Cell Biol 1991, 115:1203-1212.
Moll T, Tebb G, Surana U, Robitsch H, Nasmyth K: The role of
phosphorylation and the CDC28 protein kinase in cell cycle-
regulated nuclear import of the S. cerevisiae transcription
factor SWI5. Cell 1991, 66:743-758.
Wilmes GM, Archambault V, Austin RJ, Jacobson MD, Bell SP, Cross
FR: Interaction of the S-phase cyclin Clb5 with an 'RXL' dock-
ing sequence in the initiator protein Orc6 provides an origin-
localized replication control switch. Genes Dev 2004,
Lenz P, Swain PS: An entropic mechanism to generate highly
cooperative and specific
phosphorylations. Curr Biol 2006, 16:2150-2155.
Oehlen LJ, Cross FR: Potential regulation of Ste20 function by
the Cln1-Cdc28 and Cln2-Cdc28 cyclin-dependent protein
kinases. J Biol Chem 1998, 273:25089-25097.
Costanzo M, Nishikawa JL, Tang X, Millman JS, Schub O, Breitkreuz
K, Dewar D, Rupes I, Andrews B, Tyers M: CDK activity antago-
nizes Whi5, an inhibitor of G1/S transcription in yeast. Cell
D'Amours D, Amon A: At the interface between signaling and
executing anaphase: Cdc14 and the FEAR network. Genes Dev
Peng J, Gygi SP: Proteomics: the move to mixtures. J Mass
Spectrom 2001, 36:1083-1091.
Budovskaya YV, Stephan JS, Deminoff SJ, Herman PK: An evolution-
ary proteomics approach identifies substrates of the cAMP-
dependent protein kinase. Proc Natl Acad Sci USA 2005,
Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y,
Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharo-
myces Genome Database. Nucleic Acids Res 1998, 26:73-80.
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T,
Cuff J, Curwen V, Down T: The Ensembl genome database
project. Nucleic Acids Res 2002, 30:38-41.
Mardia KV, Kent JT, Bibby JM: Multivariate Analysis London: Academic
Kyte J, Doolittle RF: A simple method for displaying the hydro-
pathic character of a protein. J Mol Biol 1982, 157:105-132.
Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: proteome-wide
prediction of cell signaling interactions using short sequence
motifs. Nucleic Acids Res 2003, 31:3635-3641.
Alan Moses' Research [http://www.sanger.ac.uk/~am8/]
Ihaka R, Gentleman R: R: a language for data analysis and
graphics. J Comput Graph Stat 1996, 5:299-314.
binding from protein