NATURE BIOTECHNOLOGY VOLUME 23 NUMBER 11 NOVEMBER 2005
An iterative statistical approach to the
identification of protein phosphorylation
motifs from large-scale data sets
Daniel Schwartz & Steven P Gygi
With the recent exponential increase in protein phosphorylation
sites identified by mass spectrometry, a unique opportunity
has arisen to understand the motifs surrounding such sites.
Here we present an algorithm designed to extract motifs
from large data sets of naturally occurring phosphorylation
sites. The methodology relies on the intrinsic alignment of
phospho-residues and the extraction of motifs through iterative
comparison to a dynamic statistical background. Results show
the identification of dozens of novel and known phosphorylation
motifs from recently published serine, threonine and tyrosine
phosphorylation studies. When applied to a linguistic data set
to test the versatility of the approach, the algorithm successfully
extracted hundreds of language motifs. This method, in addition
to shedding light on the consensus sequences of identified
and as yet unidentified kinases and modular protein domains,
may also eventually be used as a tool to determine potential
phosphorylation sites in proteins of interest.
As research in molecular biology moves forward it has become increas-
ingly clear that few cellular processes are unaffected by protein phos-
phorylation. Protein degradation, localization and conformation as well
as protein/protein interactions are only some of the functions in which
protein phosphorylation has been implicated1,2. Furthermore, protein
phosphorylation levels are central to our current understanding of cell
division and signal transduction pathways in both normal and diseased
cell states3. Yet, relatively little is known about the majority of protein
kinases in the human proteome. Only approximately one-tenth of the
estimated 500–600 human protein serine, threonine and tyrosine kinases
have known consensus sequences for their sites of phosphorylation4.
Even when consensus sequences are known, in vivo protein substrates
are often lacking.
To date, the task of understanding kinase recognition sequences has
progressed mainly by a ‘kinase-driven’ approach whereby a kinase of
interest is incubated with a combinatorial peptide library and ATP.
Edman degradation of the phosphorylated peptides, which have been
enriched using a ferric column, leads to the creation of a position-weight
matrix of the data and hence the consensus sequence5. Though the
kinase-driven approach has had much success in identifying optimal
kinase consensus sequences and substrates, it has suffered from the fact
that optimal in vitro binding is often kinetically unfavorable in the cel-
lular environment, thus leading to motifs that are rarely found in the
Here we present an attempt to start with known biologically phosphor-
ylated substrates from unknown kinases and discover motifs through
a ‘substrate-driven’ approach. In the past, the low number of localized
phosphorylation sites cited in the literature made substrate-driven
approaches to determining kinase consensus motifs difficult. However,
refinements of several affinity-based strategies such as immunoaffinity6,
immobilized metal affinity chromatography (IMAC)7 and strong cation
exchange (SCX) chromatography8, coupled with the enabling technol-
ogy of tandem mass spectrometry have more than doubled the number
of phosphorylation sites identified in the past year alone, with several
studies reporting from several hundred to several thousand sites6,8–13.
Two of these recently published large-scale mass spectrometry stud-
ies were chosen as test sets for our motif-building algorithm. The first
study used SCX for the enrichment of phosphopeptides from HeLa
cell nuclei, resulting in the elucidation of 1,594 unique phosphoserine
and 195 unique phosphothreonine sites8. The second study used an
antiphosphotyrosine antibody to enrich for phosphorylated tyrosine
residues in pervanadate-treated Jurkat cells (151 sites), cells express-
ing constitutively active NPM-ALK fusion kinase (237 sites) and cells
expressing constitutively active Src kinase (185 sites)6.
Overview of the method
A schematic of the motif extraction algorithm is shown in Figure 1. The
method commences with the establishment of two parallel sequence
data sets: the phosphorylated peptide data set from which motifs will
be built, and a peptide data set used for background probability calcula-
tions. Next, the two data sets are converted into position-weight matrices
of equal dimensions whereby each matrix contains information on the
frequency of all residues at the six positions upstream and downstream
of the phosphorylation site. Using the information encoded in these
two matrices, a third matrix, the binomial probability matrix, is created.
Specifically, this matrix contains the probability of observing s or more
occurrences of residue x at position j (taken from the phosphorylation
Department of Cell Biology, 240 Longwood Ave., Harvard Medical School,
Boston, Massachusetts 02115, USA. Correspondence should be addressed to
Published online 4 November 2005; doi:10.1038/nbt1146
© 2005 Nature Publishing Group http://www.nature.com/naturebiotechnology
VOLUME 23 NUMBER 11 NOVEMBER 2005 NATURE BIOTECHNOLOGY
fractional percentage of residue x at position j in the current background
matrix. The result was calculated using the pbinom function in the Math::
CDF PERL module. The function could not calculate probabilities below
10−16. Since each recursive iteration of the algorithm chose the residue/posi-
tion pair with the lowest binomial probability, if more than one pair had
probabilities of 10−16, then the pair with the greater frequency in the data set
matrix was selected.
Motif scores. Despite the statistical significance of every motif extracted, heuristic
scores for the motifs were calculated as the sum of the negative log of the binomial
probabilities used to generate the motifs (equation (2) below),
Linguistic analysis. Using the analytical framework previously created by
Bussemaker et al.14, text from the first ten chapters of Moby Dick by Herman
Melville15 with random characters inserted between words was retrieved from
http://www.physics.rockefeller.edu/siggia/projects/mobydick/. By taking a sliding
13-character window, the text was then transformed into a matrix of all 13-char-
acter strings, thus constituting the background data set. From this background
data set, 26 subsets were created, each being centered on a different letter of the
alphabet. Using the background data set and each of the subsets, the motif-build-
ing methodology (with P < 10−6, and occurrences ≥ 10) was carried out 26 times,
thus yielding motifs centered on every letter of the alphabet (Supplementary
Tables 1–4 online).
Comparison to other algorithms. To compare our algorithm to other motif
discovery tools, we input our synthetically generated list of 300 proteins and the
manually curated phosphorylation list containing 298 13-mers to four websites:
Pratt at http://www.ebi.ac.uk/pratt/ with parameters C% = 2%, PL = 13, PN =
50, PX = 5, FN = 2, and FL = 1; TEIRESIAS at http://cbcsrv.watson.ibm.com/
Tspd.html with option ‘exact discovery’ and parameters L = 2 or 3, W = 5 and
K = 2; eMOTIF at http://motif.stanford.edu/emotif/emotif-maker.html with
a 10% match threshold; Gibbs motif sampler at http://bayesweb.wadsworth.
org/cgi-bin/gibbs.9.pl?data_type=protein with number of patterns = 5, max
sites per sequences = 1, motif width = 5, estimated total sites = 40.
Public access to algorithm. Access to the algorithm will be available through a
website currently under construction at, http://motif-X.med.harvard.edu/ which
will allow users to input their sequence data and adjust the various algorithm
parameters to retrieve motif results.
Programming and sequence logos. All programming and analysis was done using
the PERL programming language on a Linux workstation (2.2 GHz microproces-
sor with 1.5 GB RAM). Sequence logos were generated online using Weblogo21
Note: Supplementary information is available on the Nature Biotechnology website.
The authors thank John Rush and Cell Signaling Technology for providing access to
the tyrosine phosphorylation data sets prior to their publication. Additionally, D.S.
wishes to thank Michael Chou for assistance with the Moby Dick analysis as well as
numerous stimulating conversations regarding the algorithm and critical reading of
the manuscript. This work was supported in part by National Institutes of Health
grant HG03456 (S.P.G.).
COMPETING INTERESTS STATEMENT
The authors declare that they have no competing financial interests.
Published online at http://www.nature.com/naturebiotechnology/
Reprints and permissions information is available online at http://npg.nature.com/
1. Schlessinger, J. & Lemmon, M.A. SH2 and PTB domains in tyrosine kinase signaling.
Sci. STKE 2003, RE12 (2003).
2. Ang, X.L. & Wade Harper, J. SCF-mediated protein degradation and cell cycle control.
Oncogene 24, 2860–2870 (2005).
3. Pawson, T. & Scott, J.D. Protein phosphorylation in signaling—50 years and counting.
Trends Biochem. Sci. 30, 286–290 (2005).
4. Obenauer, J.C., Cantley, L.C. & Yaffe, M.B. Scansite 2.0: Proteome-wide prediction of cell
signaling interactions using short sequence motifs. Nucleic Acids Res. 31, 3635–3641
5. Manning, B.D. & Cantley, L.C. Hitting the target: emerging technologies in the search
for kinase substrates. Sci. STKE 2002, PE49 (2002).
6. Rush, J. et al. Immunoaffinity profiling of tyrosine phosphorylation in cancer cells. Nat.
Biotechnol. 23, 94–101 (2005).
7. Ficarro, S.B. et al. Phosphoproteome analysis by mass spectrometry and its application
to Saccharomyces cerevisiae. Nat. Biotechnol. 20, 301–305 (2002).
8. Beausoleil, S.A. et al. Large-scale characterization of HeLa cell nuclear phosphoproteins.
Proc. Natl. Acad. Sci. USA 101, 12130–12135 (2004).
9. Collins, M.O. et al. Proteomic analysis of in vivo phosphorylated synaptic proteins.
J. Biol. Chem. 280, 5972–5982 (2005).
10. Ballif, B.A., Villen, J., Beausoleil, S.A., Schwartz, D. & Gygi, S.P. Phosphoproteomic
analysis of the developing mouse brain. Mol. Cell. Proteomics 3, 1093–1101 (2004).
11. Gruhler, A. et al. Quantitative phosphoproteomics applied to the yeast pheromone signal-
ing pathway. Mol. Cell. Proteomics 4, 310–327 (2005).
12. Nuhse, T.S., Stensballe, A., Jensen, O.N. & Peck, S.C. Phosphoproteomics of the
Arabidopsis plasma membrane and a new phosphorylation site database. Plant Cell 16,
13. Loyet, K.M., Stults, J.T. & Arnott, D. Mass spectrometric contributions to the practice of
phosphorylation site mapping through 2003: a literature review. Mol. Cell. Proteomics
4, 235–245 (2005).
14. Bussemaker, H.J., Li, H. & Siggia, E.D. Building a dictionary for genomes: identification
of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA 97,
15. Melville, H. Moby-Dick, or, The whale (Signet Classic, New York, 1998).
16. Diella, F. et al. Phospho.ELM: a database of experimentally verified phosphorylation sites
in eukaryotic proteins. BMC Bioinformatics 5, 79 (2004).
17. Rigoutsos, I. & Floratos, A. Combinatorial pattern discovery in biological sequences: The
TEIRESIAS algorithm. Bioinformatics 14, 55–67 (1998).
18. Jonassen, I., Collins, J.F. & Higgins, D.G. Finding flexible patterns in unaligned protein
sequences. Protein Sci. 4, 1587–1595 (1995).
19. Thompson, W., Rouchka, E.C. & Lawrence, C.E. Gibbs Recursive Sampler: finding tran-
scription factor binding sites. Nucleic Acids Res. 31, 3580–3585 (2003).
20. Nevill-Manning, C.G., Wu, T.D. & Brutlag, D.L. Highly specific protein sequence motifs
for genome analysis. Proc. Natl. Acad. Sci. USA 95, 5865–5871 (1998).
21. Schneider, T.D. & Stephens, R.M. Sequence logos: a new way to display consensus
sequences. Nucleic Acids Res. 18, 6097–6100 (1990).
22. Crooks, G.E., Hon, G., Chandonia, J.M. & Brenner, S.E. WebLogo: a sequence logo
generator. Genome Res. 14, 1188–1190 (2004).
23. Boucher, L., Ouzounis, C.A., Enright, A.J. & Blencowe, B.J. A genome-wide survey of RS
domain proteins. RNA 7, 1693–1701 (2001).
24. Fujimoto, J. et al. Characterization of the transforming activity of p80, a hyperphosphory-
lated protein in a Ki-1 lymphoma cell line with chromosomal translocation t(2;5). Proc.
Natl. Acad. Sci. USA 93, 4181–4186 (1996).
25. Iuchi, S. Three classes of C2H2 zinc finger proteins. Cell. Mol. Life Sci. 58, 625–635
26. Songyang, Z. & Cantley, L.C. Recognition and specificity in protein tyrosine kinase-medi-
ated signalling. Trends Biochem. Sci. 20, 470–475 (1995).
27. Branch, D.R. & Mills, G.B. pp60c-src expression is induced by activation of normal
human T lymphocytes. J. Immunol. 154, 3678–3685 (1995).
28. Shin, N.Y. et al. Subsets of the major tyrosine phosphorylation sites in Crk-associated sub-
strate (CAS) are sufficient to promote cell migration. J. Biol. Chem. 279, 38331–38337
29. Yates, J.R. III, Eng, J.K. & McCormack, A.L. Mining genomes: correlating tandem mass
spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal.
Chem. 67, 3202–3210 (1995).
Score (motif) = ∑ – log (Pbinomial)
© 2005 Nature Publishing Group http://www.nature.com/naturebiotechnology