Combining specificity determining and conserved residues improves functional site prediction.
ABSTRACT Predicting the location of functionally important sites from protein sequence and/or structure is a long-standing problem in computational biology. Most current approaches make use of sequence conservation, assuming that amino acid residues conserved within a protein family are most likely to be functionally important. Most often these approaches do not consider many residues that act to define specific sub-functions within a family, or they make no distinction between residues important for function and those more relevant for maintaining structure (e.g. in the hydrophobic core). Many protein families bind and/or act on a variety of ligands, meaning that conserved residues often only bind a common ligand sub-structure or perform general catalytic activities.
Here we present a novel method for functional site prediction based on identification of conserved positions, as well as those responsible for determining ligand specificity. We define Specificity-Determining Positions (SDPs), as those occupied by conserved residues within sub-groups of proteins in a family having a common specificity, but differ between groups, and are thus likely to account for specific recognition events. We benchmark the approach on enzyme families of known 3D structure with bound substrates, and find that in nearly all families residues predicted by SDPsite are in contact with the bound substrate, and that the addition of SDPs significantly improves functional site prediction accuracy. We apply SDPsite to various families of proteins containing known three-dimensional structures, but lacking clear functional annotations, and discusse several illustrative examples.
The results suggest a better means to predict functional details for the thousands of protein structures determined prior to a clear understanding of molecular function.
-
Article: New knowledge from old: in silico discovery of novel protein domains in Streptomyces coelicolor.
[show abstract] [hide abstract]
ABSTRACT: Streptomyces coelicolor has long been considered a remarkable bacterium with a complex life-cycle, ubiquitous environmental distribution, linear chromosomes and plasmids, and a huge range of pharmaceutically useful secondary metabolites. Completion of the genome sequence demonstrated that this diversity carried through to the genetic level, with over 7000 genes identified. We sought to expand our understanding of this organism at the molecular level through identification and annotation of novel protein domains. Protein domains are the evolutionary conserved units from which proteins are formed. Two automated methods were employed to rapidly generate an optimised set of targets, which were subsequently analysed manually. A final set of 37 domains or structural repeats, represented 204 times in the genome, was developed. Using these families enabled us to correlate items of information from many different resources. Several immediately enhance our understanding both of S. coelicolor and also general bacterial molecular mechanisms, including cell wall biosynthesis regulation and streptomycete telomere maintenance. Delineation of protein domain families enables detailed analysis of protein function, as well as identification of likely regions or residues of particular interest. Hence this kind of prior approach can increase the rate of discovery in the laboratory. Furthermore we demonstrate that using this type of in silico method it is possible to fairly rapidly generate new biological information from previously uncorrelated data.BMC Microbiology 03/2003; 3:3. · 3.04 Impact Factor -
Article: Structure of YciI from Haemophilus influenzae (HI0828) reveals a ferredoxin-like alpha/beta-fold with a histidine/aspartate centered catalytic site.
Mark A Willis, Feng Song, Zhihao Zhuang, Wojciech Krajewski, Vani Rao Chalamasetty, Prasad Reddy, Andrew Howard, Debra Dunaway-Mariano, Osnat HerzbergProteins Structure Function and Bioinformatics 06/2005; 59(3):648-52. · 3.39 Impact Factor -
Article: Comparative genomics of the vitamin B12 metabolism and regulation in prokaryotes.
[show abstract] [hide abstract]
ABSTRACT: Using comparative analysis of genes, operons, and regulatory elements, we describe the cobalamin (vitamin B12) biosynthetic pathway in available prokaryotic genomes. Here we found a highly conserved RNA secondary structure, the regulatory B12 element, which is widely distributed in the upstream regions of cobalamin biosynthetic/transport genes in eubacteria. In addition, the binding signal (CBL-box) for a hypothetical B12 regulator was identified in some archaea. A search for B12 elements and CBL-boxes and positional analysis identified a large number of new candidate B12-regulated genes in various prokaryotes. Among newly assigned functions associated with the cobalamin biosynthesis, there are several new types of cobalt transporters, ChlI and ChlD subunits of the CobN-dependent cobaltochelatase complex, cobalt reductase BluB, adenosyltransferase PduO, several new proteins linked to the lower ligand assembly pathway, l-threonine kinase PduX, and a large number of other hypothetical proteins. Most missing genes detected within the cobalamin biosynthetic pathways of various bacteria were identified as nonorthologous substitutes. The variable parts of the cobalamin metabolism appear to be the cobalt transport and insertion, the CobG/CbiG- and CobF/CbiD-catalyzed reactions, and the lower ligand synthesis pathway. The most interesting result of analysis of B12 elements is that B12-independent isozymes of the methionine synthase and ribonucleotide reductase are regulated by B12 elements in bacteria that have both B12-dependent and B12-independent isozymes. Moreover, B12 regulons of various bacteria are thought to include enzymes from known B12-dependent or alternative pathways.Journal of Biological Chemistry 11/2003; 278(42):41148-59. · 4.77 Impact Factor
Page 1
BioMed Central
Page 1 of 24
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Research article
Combining specificity determining and conserved residues
improves functional site prediction
Olga V Kalinina*1,2, Mikhail S Gelfand†1 and Robert B Russell†2
Address: 1EMBL Heidelberg, Meyerhofstrasse 1, 69117 Heidelberg, Germany and 2Institute for Information Transmission Problems RAS, Bolshoi
Karenty pereulok 19, Moscow, 127994, Russia
Email: Olga V Kalinina* - kalinina@embl.de; Mikhail S Gelfand - gelfand@iitp.ru; Robert B Russell - russell@embl.de
* Corresponding author †Equal contributors
Abstract
Background: Predicting the location of functionally important sites from protein sequence and/
or structure is a long-standing problem in computational biology. Most current approaches make
use of sequence conservation, assuming that amino acid residues conserved within a protein family
are most likely to be functionally important. Most often these approaches do not consider many
residues that act to define specific sub-functions within a family, or they make no distinction
between residues important for function and those more relevant for maintaining structure (e.g. in
the hydrophobic core). Many protein families bind and/or act on a variety of ligands, meaning that
conserved residues often only bind a common ligand sub-structure or perform general catalytic
activities.
Results: Here we present a novel method for functional site prediction based on identification of
conserved positions, as well as those responsible for determining ligand specificity. We define
Specificity-Determining Positions (SDPs), as those occupied by conserved residues within sub-
groups of proteins in a family having a common specificity, but differ between groups, and are thus
likely to account for specific recognition events. We benchmark the approach on enzyme families
of known 3D structure with bound substrates, and find that in nearly all families residues predicted
by SDPsite are in contact with the bound substrate, and that the addition of SDPs significantly
improves functional site prediction accuracy. We apply SDPsite to various families of proteins
containing known three-dimensional structures, but lacking clear functional annotations, and
discusse several illustrative examples.
Conclusion: The results suggest a better means to predict functional details for the thousands of
protein structures determined prior to a clear understanding of molecular function.
Background
Structural genomics, and the increased pace of structure
determination by X-ray and NMR approaches makes
methods to predict protein function from 3D structure of
continuing importance. Proteins of known structure and
unknown function are normally subjected to a battery of
comparisons to find proteins adopting similar folds
(DALI [1], SSAP [2], CE [3] and others) or containing
recurring active-site residue constellations (SPASM [4],
PINTS [5], Catalytic Site Atlas [6]). Proteins of similar
structure can provide functional hints, since it is very often
the case that proteins share structural and functional sim-
Published: 9 June 2009
BMC Bioinformatics 2009, 10:174doi:10.1186/1471-2105-10-174
Received: 23 January 2009
Accepted: 9 June 2009
This article is available from: http://www.biomedcentral.com/1471-2105/10/174
© 2009 Kalinina et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 2 of 24
(page number not for citation purposes)
ilarities in the absence of sequence similarity. Active site-
only similarities (i.e. in the absence of overall fold similar-
ity) are typically less revealing, but can sometimes suggest
the presence of a convergently evolved catalytic machin-
ery (e.g. the peptidase catalytic triad [7]) or binding sites
for particular metals or ligands [8].
When comparative approaches fail to identify a clear sim-
ilarity, or if such similarities are ambiguous – for instance,
suggesting a possible weak functional similarity requiring
confirmation – additional functional hints can come from
analyses of the protein structure, similar structures, and
what is typically a large collection of homologous
sequences from the set of genomes now available. There
are now several methods that exploit sequence conserva-
tion to identify putative functional sites in proteins,
including ConSurf [9], approaches based on identifica-
tion of 3D clusters of conserved residues [10], the evolu-
tionary trace (ET) approach [11,12], correlated mutations
[13], prediction of 3D motifs correlated with function
[14], Jensen-Shannon entropy approach [15], algorithm
based on contrasting global and local similarity matrices
that interpret locality in terms of sequence [16] or struc-
ture [17]. The approaches differ in design, but share a uni-
fying theme of using conserved amino acids, together with
structural constraints such as location on the protein sur-
face, as indicators of likely functional importance.
Some of the presented techniques make a special empha-
sis on using a protein structure for prediction. A number
of methods identify interaction hot spots on different
kinds of interaction interfaces [18-20]. Other methods
concentrate on predicting pockets in protein structures, as
they are possible ligand-binding sites [21-23], sometimes
supplementing them with annotation derived from other
available sources [24].
In contrast to those, several approaches, including our
own [25], have attempted to exploit protein sequence
alignments to determine those residues likely to confer
specificity for a particular sub-function in a protein family
[25-41]. Although they differ in algorithmic details, all
approaches aim to use the statistics of a multiple sequence
alignment to identify positions that correlate well with
sub-families that account for a certain specific function.
Sub-families are either explicitly given in advance (for
instance taken from gene or protein annotation) or are
predicted by the algorithm.
Here, we extend our previously derived approach for iden-
tifying specificity determining residues [25] to the prob-
lem of predicting protein functional sites (SDPsite). We
combine routine predictions of conserved residues with
those for specificity determinants, and use structural
information to identify spatial clusters of the predicted
important residues. SDPsite differs from structure-based
methods in that the major part of the prediction is derived
from the protein sequences. So, in theory, the method can
also be applied in absence of structural information. The
structure-based step filters out part of the predicted posi-
tions, thus leaving only the most reliable predictions,
which can be useful in the design of experimental studies.
To test our method on a large scale, we developed two
benchmark datasets of diverse enzyme families, using the
Enzyme Classification (EC) system. Enzymes are the sim-
plest class of proteins to benchmark as their functional
annotations are well specified in databases. However, they
are not necessarily representative of other protein func-
tions that are less discretely characterized by precise cata-
lytic machinery (e.g. protein recognition modules, etc.).
In the absence of a reliable source of functional annota-
tions for non-enzymes, we previously tested the presented
approach on two examples, for which reasonable experi-
mental data are available, the LacI family of the bacterial
transcription factors and subtilisin-like proteases [42]. On
these examples, SDPsite results have a sensitivity between
0.06 and 0.17, specificity between 0.43 and 0.75 and false
positive rate between 0.007 and 0.05. Thus, in these anec-
dotal cases, SDPsite seems to miss a lot of truly functional
amino acids, but still provides reliable predictions.
We also predicted the functional sites in 124 unannotated
structures derived from Structural Genomics efforts. For
the benchmark datasets, the success rate of our method
(SDPsite) was 96–100%. We were then able to make con-
fident functional site predictions for ~76% of a set of fam-
ilies lacking functional annotation.
Results
1. Testing SDPsite on a benchmark set of enzymes with
bound cognate ligands
We previously tested SDPsite on a number of protein fam-
ilies with known sites and compared the performance
with several other approaches [42]. The results encour-
aged us to predict functionally important sites in poorly
characterized protein families. Structural genomics
projects now provide up to 20% of annual growth of the
Protein data bank (PDB) and a greater coverage than ever
before of the space of protein structures [43]. This has led
to the current situation where hundreds of protein fami-
lies include a protein with a known 3D structure, but no
available functional information. These families are a per-
fect target for functional site prediction methods that use
both sequence and structural information.
However, before applying our approach to families lack-
ing functional information, we needed to benchmark the
approach on a set of protein families that are well-charac-
terised in terms of function. To do this we considered
Page 3
BMC Bioinformatics 2009, 10:174http://www.biomedcentral.com/1471-2105/10/174
Page 3 of 24
(page number not for citation purposes)
enzyme protein families from the Pfam database, for
which there is a functional characterization scheme in the
Enzyme Classification (EC) number system. This classifi-
cation consists of four numbers denoting a hierarchical
system that delineates enzyme function. We focused on
families containing EC numbers differing in the last
number, which normally accounts for the substrate specif-
icity. Since protein families generally correspond to a sin-
gle functional or structural domain, complications can
arise for multi-domain proteins that correspond to a sin-
gle EC number. We inspected the families manually to
ensure that the catalytic operation for each EC number did
indeed correspond to the domain considered. Thus each
considered protein domain corresponds to a single EC
number, thus catalyzing only one reaction (or one class of
reactions), and presumably have only one active site.
To assess performance of SDPsite on these Pfam enzyme
families, we generated two benchmark datasets. The first
consisted of families containing proteins with at least two
EC numbers differing in the last position; the second con-
sisted of families containing only a single EC number. We
refer to these datasets as diverse and homogeneous in the
sections that follow. The rationale is that when one is pre-
dicting function and/or specificity, one does not known in
advance whether or not there are multiple specificities in
the family. These two datasets mimic both of these situa-
tions.
For all families we computed both specificity determining
positions (SDPs) and conserved positions (CPs), then
mapped them onto a 3D structure of one of the proteins
of the family and extracted a portion of the two sets that
forms a compact spatial cluster, as described previously
[42] and in the Methods section. We designed several dis-
tance measures to assess the quality of the predictions.
These were: 1) the minimal distance from the residues of
each of the predicted sets (SDPs, CPs, best cluster) to a
bound ligand; 2) the average distance to the ligand; 3) the
diameter of the set; and 4) the average distance between
residues of the set. We performed a Mann-Whitney test to
assess the statistical significance of the best derived clus-
ter, i.e. we tested if the set of the amino acid residues in the
best cluster is significantly closer to the ligand than all res-
idues in the protein. These data are given in Additional file
1.
We considered predictions to be successful if minimal dis-
tances were smaller than 5 Å and average distances smaller
than 10 Å. We selected these thresholds based on inspec-
tion of known binding sites, and found that they capture
characteristics of typical binding sites, which are normally
15–20 Å in diameter and typically some of the amino
acids of the cluster contact the ligand directly. A small
minimal distance and a large average distance means that
the cluster is too sparse and does not define the active site
well enough, but still a part of it is close to ligand and
might be functional. Generally there is no correlation
between either diameter or average distance within a pre-
dicted set of residues and the set's proximity to the active
site.
As might be expected, predicted SDPs tend to be more
sparsely distributed in the structure, compared to the
more compact distribution of CPs. The best clusters are
tightest, which is natural from their construction proce-
dure, though the minimal distance suggests they are
sometimes further away from the active site (even if the
average distance is similar to CPs). We discuss these obser-
vations in more detail below.
a. The diverse dataset: protein families with at least two distinct EC
numbers
Application of all the filters described in the Methods sec-
tion yielded 26 Pfam families (Table 1). SDPsite was
applied in different ways, either ignoring SDPs thus mim-
icking the standard, conservation-based approaches, or
including them when constructing the best cluster. For the
inclusion of SDPs, we either gave them twice the weight as
the CPs (λ = 0.5, λ being the relative weight of a CP to an
SDP) or same weight (λ = 1). (Fig. 1). For details on the
choice of the λ parameter, see Methods. In all but one of
the considered families (Carboxylesterase, see below) at
least one predictor performs well, and in the
Asparaginase_2 family the average distance is slightly
higher than 10 Å. This means that the best clusters are
located in enzyme catalytic sites, and some of residues are
in direct contact with the ligand. This result is significant
(p < 0.01) for all but four, one of which is the Carboxy-
lesterase family; for the other three the best cluster con-
tains positions accounting for intersubunit contacts. The
resulting implications for quaternary structure are dis-
cussed below.
Average values for all the above measures are given in
Tables 2 and 3. Note that SDPs contribute to identifica-
tion of the active site, although leading to prediction of a
more disperse cluster. When no SDPs are predicted (last
column), the average distance is smaller, because CPs
form a more compact cluster in the active pocket. Unlike
other methods that attempt to predict functional sites
solely using the conservation of surface residues, SDPsite
predicts a number of additional positions of potential
importance. In 11 out of 26 families the best cluster is sig-
nificantly (p-value < 0.01 in a Mann-Whitney test) closer
to the ligand in both λ = 0.5 and λ = 1 scenarios, and in
20 out of 26 in at least one of them, whereas the CPs-only
based prediction succeeded in only 15 families (Table 4).
This indicates that the inclusion of SDPs in the prediction
often leads to a significant improvement.
Page 4
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 4 of 24
(page number not for citation purposes)
Table 1: Statistics of the benchmark datasets: diverse dataset, two or more EC numbers per family
Family IDFamily name # sequencesAlignment length ECsPDB Bound ligand equivalent to natural substrate/product
PF00108
Thiolase_N
22 291 2.3.1.9
2.3.1.16
2.3.1.176
1NL7 Coenzyme A
PF00128
Alpha-amylase
546732.4.1.4
2.4.1.7
3.2.1.10
3.2.1.20
3.2.1.70
3.2.1.98
3.2.1.93
3.2.1.141
5.4.99.16
5.4.99.15
2D3N
Glucose
PF00135
COesterase
129 889 3.1.1.1
3.1.1.3
3.1.1.7
3.1.1.8
3.1.1.13
3.1.1.59
1P0M
Choline ion
PF00215
OMPdecase
924024.1.1.23
4.1.1.85
2CZEUridine-5'-monophosphate
PF00278
Orn_DAP_Arg_deC
55220 4.1.1.17
4.1.1.18
4.1.1.19
4.1.1.20
1TWI
Lysine
PF00293
NUDIX
2053142.7.7.1
3.6.1.13
3.6.1.17
3.6.1.52
3.6.1.52
5.3.3.2
2DSC
Adenosine-5-diphosphoribose
PF00348
Polyprenyl_synt
16 2892.5.1.10
2.5.1.29
2F8ZZoledronic acid, 3-methylbut-3-enyl trihydrogen
diphosphate
PF00351
Biopterin_H
6 332 1.14.16.1
1.14.16.2
1.14.16.4
1MMK 5,6,7,8-tetrahydrobiopterin, beta(2-thienyl)alanine
PF00579
tRNA-synt_1b
41 4026.1.1.1
6.1.1.2
1WQ
4
Tyrosine
PF00583
Acetyltransf_1
2441502.3.1.1
2.3.1.4
2.3.1.48
2.3.1.57
2.3.1.59
2.3.1.82
2.3.1.87
2.3.1.88
2.3.1.128
1TIQ
Coenzyme A
Page 5
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 5 of 24
(page number not for citation purposes)
PF00590
TP_methylase
22 2472.1.1.98
2.1.1.107
2.1.1.130
2.1.1.131
2.1.1.132
2.1.1.133
2.1.1.152
2.1.1.151
4.2.1.75
4.99.1.4
1S4D
S-adenosyl-L-homocysteine
PF00755
Carn_acyltransf
22 8672.3.1.6
2.3.1.7
2.3.1.21
2.3.1.137
1NDICoenzyme A
PF00871
Acetate_kinase
12 4052.7.2.1
2.7.2.7
2.7.2.15
1TUY Adenosine-5'-diphosphate
PF00896
Mtap_PNP
13 2882.4.2.1
2.4.2.28
1V489-(5,5-difluoro-5-phosphonopentyl)guanine
PF00962
A_deaminase
17 4753.5.4.4
3.5.4.6
1NDZ 1-((1r)-1-(hydroxymethyl)-3-(6-((3-(1-methyl- 1h-
benzimidazol-2-yl)propanoyl)amino)-1h- indol-1-
yl)propyl)-1h-imidazole-4-carboxamide
PF01048
PNP_UDP_1
16276 2.4.2.1
2.4.2.3
2.4.2.28
3.2.2.4
3.2.2.9
1PK7
Adenosine
PF01112
Asparaginase_2
7 365 3.5.1.1
3.5.1.26
1SEO Aspartic acid
PF01135
PCMT
9232 2.1.1.77
2.1.1.36
1R18 S-adenosyl-L-homocysteine
PF01202
SKI
100 2632.7.4.3
2.7.1.12
2.7.4.14
2.7.1.71
4.2.3.4
1WE2
Adenosine-5'-diphosphate
PF01234
NNMT_PNMT_TEMT
7 2892.1.1.1
2.1.1.28
2.1.1.49
2AN4
S-adenosyl-L-homocysteine
PF01467
CTP_transf_2
663022.7.7.1
2.7.7.3
2.7.7.14
2.7.7.15
2.7.7.18
2.7.7.39
1N1D
[Cytidine-5'-phosphate] glycerylphosphoric acid
ester
PF01712
dNK
141741.6.99.3
2.7.1.21
2.7.1.74
2.7.1.76
2.7.1.113
2.7.1.145
2A2Z
Uridine-5'-diphosphate, 2'-deoxycytidine
Table 1: Statistics of the benchmark datasets: diverse dataset, two or more EC numbers per family (Continued)
Page 6
BMC Bioinformatics 2009, 10:174http://www.biomedcentral.com/1471-2105/10/174
Page 6 of 24
(page number not for citation purposes)
PF02274
Amidinotransf
32 455 2.1.4.1
3.5.3.6
3.5.3.18
2A9G Arginine
PF03061
4HBT
153102 3.1.2.2
3.1.2.23
1LO72-oxyglutaric acid, 2-aminoethanesulfonic acid
PF03171
2OG-FeII_Oxy
147 1831.14.11.2
1.14.11.4
1.14.11.7
1.14.11.9
1.14.11.1
1
1.14.11.1
3
1.14.11.1
9
1.14.11.2
0
1.14.11.2
3
1.14.11.2
6
1.14.17.4
1.14.20.1
1.21.3.1
2FDJ
4-hydroxyphenacyl coenzyme A
PF03414
Glyco_transf_6
6 3412.4.1.87
2.4.1.40
1LZJSuccinic acid
Table 1: Statistics of the benchmark datasets: diverse dataset, two or more EC numbers per family (Continued)
For instance, in the Protein-L-isoaspartate(D-aspartate)
O-methyltransferase (PCMT, PF01135) and Thioesterase
(4HBT, PF03061) families, the CP-based cluster is located
distant from the active site, whereas addition of SDPs res-
cues the prediction leading to the correct site. In the C-ter-
minal domain of Pyridoxal-dependent decarboxylases
(Orn_DAP_Arg_deC, PF00278), bacterial transferase hex-
apeptide (Hexapep, PF00132)
(Asparaginase_2, PF01112) families, SDPs rescue the clus-
ter for λ = 0.5. In contrast, for the Shikimate kinase family
(SKI, PF01202) a heavier reliance on SDPs skews the pre-
diction, whereas more equal considerations of SDPs and
CPs, or of CPs only, perform considerably better.
and Asparaginase
The Carboxylesterase (COesterase; PF00135) family is the
only clear failure of the method, i.e. its active site is not
identified by either variant of the method. Even the cata-
lytic triad, Ser198, His438 and Glu197 (numbering from
ChlE_Human), is not among either SDPs or CPs. The fact
that the catalytic residues are not conserved in the align-
ment is puzzling. This could be because the alignment
from Pfam contains many uncharacterized paralogs from
C. elegans and D. melanogaster, which could perform a
different function or be non-functional. Indeed, catalytic
residues are often substituted in these proteins: Ser198 to
alanine, asparagine, glycine or valine, His438 to asparag-
ine, glutamic acid, leucine, lysine, tyrosine or valine, and
Glu197 to asparagine, glutamic acid, glutamine, histidine,
isoleucine, proline, threonine, tryptophane or tyrosine.
Such changes mean that these residues are ignored in the
prediction procedure, and highlights the need for some
caution when building alignments to predict function.
We overlook details of quaternary structure when making
predictions, and this can have interesting consequences,
as discussed previously (e.g. ref. [25]). For instance, for
the Thiolase N-terminal domain (Thiolase_N, PF00108)
family, we found the minimal and the average distance to
be rather large. From the structure of a protein from this
family (biosynthetic thiolase from Z. ramigera, 1NL7), it is
evident that the best cluster is located on the contact inter-
face between two subunits of a dimer. Indeed, the mini-
mal and the average distance to the second subunit are
2.73 Å and 6.90 Å, respectively. The second best cluster is,
however, in the active pocket with the distances below the
threshold. The family of Gcn5-related acetyltransferases
(Acetyltransf_1, PF00583) is a similar case: for λ = 0.5 the
best cluster is located on subunit contact interface and the
second best in the active site, for λ = 1 vice versa. This
highlights the need to consider quaternary structure
explicitly when making and interpreting predictions using
this or similar approaches.
A natural question is how well the predicted grouping of
the sequences agrees with the EC numbers of the proteins
considered. For most families there was a good agree-
ment, with EC numbers segregating into discrete branches
of the trees derived from the alignments. There were two
Page 7
BMC Bioinformatics 2009, 10:174http://www.biomedcentral.com/1471-2105/10/174
Page 7 of 24
(page number not for citation purposes)
Assessment of the prediction quality for the diverse dataset
Figure 1
Assessment of the prediction quality for the diverse dataset. In each plot, the green and the blue bars represent SDP-
site predictions with λ = 0.5 and λ = 1, respectively. Yellow bars represent prediction based solely on conserved positions. (a)
Minimal distance from the best cluster to the bound ligand. (b) Average distance from residues of the best cluster to the bound
ligand. (c) Significance of the average distance.
Page 8
BMC Bioinformatics 2009, 10:174http://www.biomedcentral.com/1471-2105/10/174
Page 8 of 24
(page number not for citation purposes)
Table 2: Averages for the best cluster over all families from the Datasets, Å.
Diverse dataset Homogeneous dataset
λ = 0.5
λ = 1 No SDPs
λ = 0.5
λ = 1 No SDPs
minimal distance4.225.374.56 3.812.95 4.84
average distance8.89 9.95 7.149.83 8.39 8.47
Table 3: Sensitivity and false positive rate over all families from the Datasets, Å.
Diverse datasetHomogeneous datasetCombined dataset
λ = 0.5
λ = 1No SDPs
λ = 0.5
λ = 1No SDPs
λ = 0.5
λ = 1No SDPs
Sensitivity 0.130.140.060.13 0.160.070.14 0.15 0.07
False positive rate0.03 0.020.0080.05 0.040.01 0.04 0.030.01
families where proteins with one EC number would split
between two groups that contain proteins with other EC
numbers (alpha-amylase, PF00128, and polyprenyl syn-
thetase families, PF00348). For both, the same enzymatic
activity seems to evolve independently on two separate
branches of the phylogenetic tree.
b. The homogeneous dataset: protein families with a single EC
number
The 18 families with a single EC number are listed in
Table 5. Again, we applied SDPsite with λ = 0.5, λ = 1 and
without prediction of SDPs (Fig. 2). For all studies fami-
lies, except Eukaryotic phosphomannomutases (PMM,
PF03332), at least one of these variants puts the best clus-
ter to the active site of the enzyme according to the
described criteria. For 9 out of 18, the best cluster identi-
fied by either procedure is located in the active site. These
results are significant (p < 0.01) for all families, except
Adenylylsulphate kinases (discussed below).
The remaining nine families, for which active sites were
not identified by all variants of SDPsite, can be split into
four categories: (1) λ = 0.5 fails (Thymidylate synthases,
Thymidylat_synt, PF00303; thymidine kinases from Her-
pesviridae, Herpes_TK, PF00693); (2) both λ = 0.5 and λ
= 1 fail (GTP cyclohydrolases II, GTP_cyclohydro2,
PF00925; Phosphoenolpyruvate
PEPCK_ATP, PF01293; Adenylylsulphate
APS_kinase, PF01583); (3) CP-only prediction fails
(Queuine tRNA-ribosyltransferases, TGT, PF01702; Pyru-
vate formate lyases, PFL, PF02901); and (4) at least two of
the three fail (oxygenase domain of Nitric oxide syn-
thases, NO_synthase, PF02898; Eukaryotic phosphoman-
nomutases, PMM, PF03332).
carboxykinases,
kinases,
SDPs are expected to perform worse for this dataset,
because in theory there are no specificity differences
within each family as defined by the EC system. However,
CP-based clusters significantly skew the predictions in two
cases out of eighteen. One explanation of this observation
could be inaccuracy or ambiguity in the assignment of the
EC numbers, which would mean that we have identified
some real differences the specificity. It might also be that
a substitution to an amino acid with similar properties
(e.g. size, polarity, charge) occurred, which can perform
an equivalent enzymatic role. Another possibility is the
functional convergence to a single specificity at the molec-
ular level: SDPs indicate different residues that distant
species or distant paralogs evolved to perform the same
function (e.g. [44]). Alternatively, as discussed below,
SDPs may indicate differences specific to certain phyloge-
netic clades.
Similarly to the diverse dataset, there are two families for
which the best cluster is located on the subunit contact
interface: Thymidylate synthases and Adenylylsulphate
kinases. The minimal and average distances to the second
subunit of the dimer are 3.30 Å and 5.73 Å for Thymi-
dylate synthases and 2.65 Å and 6.55 Å for Adenylylsul-
phate kinases. For both, the second best cluster is in the
active site.
For two families, all three versions of SDPsite produce
poor results. For the oxygenase domain of Nitric oxide
synthases the average distance of the residues from the
best cluster to the ligand is between 10 and 15 Å for all
three versions, and for Eukaryotic phosphomanno-
mutases it even exceed 15 Å in two out of three versions.
However, for both families, the second best cluster is
located in the active site. In the case of the oxygenase
Page 9
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 9 of 24
(page number not for citation purposes)
Table 4: Assessment of SDPsite versions, diverse dataset.
FamilySDPsite
(λ = 0.5 AND λ = 1)
SDPsite
(λ = 0.5 OR λ = 1)
SDPsite (no SDPs)
PF00108
PF00128
PF00135
PF00215
PF00278
PF00293
PF00348
PF00351
PF00579
PF00583
PF00590
PF00755
PF00871
PF00896
PF00962
PF01048
PF01112
PF01135
PF01202
PF01234
PF01467
PF01712
PF02274
PF03061
PF03171
PF03414
-
-
-
+
-
+
+
-
+
-
+
+
+
-
-
-
-
+
-
-
-*
+
+
-*
+
-
-
-
-
+
+
+
+
+
+
-*
+
+
+
+
+
+
+
+
+
+
-
+
+
-*
+
+
+
+
-
-
-
+
+
+
+
-
-
+
+
+
+
-
-
-
+
+
-*
+
+
-
-
+
Accuracy
0.42 (11/26)0.78 (20/26) 0.58 (15/26)
'+': successful prediction (average distance to ligand <10 Å, p-value in Mann-Whitney test against all amino acids of the protein < 0.01), '-*':
moderately successful prediction (average distance to ligand <10 Å, p-value in Mann-Whitney test against all amino acids of the protein < 0.1), '-'
otherwise.
domain of Nitric oxide synthases the alignment contains
only 8 sequences, which split into phylogenetic groups of
insects and vertebrates. These groups are too distant, and
it seems that the method cannot remove the phylogenetic
trace completely, thus producing many SDPs that are
probably not functionally important. A more representa-
tive set of sequences may improve the predictions and
probably place the best cluster in the right position. For
the Eukaryotic phosphomannomutases, the best cluster is
located in the core domain close to the magnesium ion,
which is a part of the active site, but distant from the sub-
strate in the open conformation of the protein.
The means of the minimal and the average distances to
the ligand and the significance of the average distance are
summarized in Table 2. It is not surprising that giving
SDPs and CPs equal weight (λ = 1) leads to better results
than λ = 0.5 for this dataset: since all the proteins have the
same EC number, one might not expect to find anything
accounting for differences in specificity among these pro-
teins. As discussed above, the identified SDPs could,
instead, reflect different ways that different groups of pro-
teins evolved to perform the same function. Again, build-
ing the best cluster from only CPs does not improve the
prediction quality, and all the distances used as the per-
formance measures are larger than for λ = 1. This is
another indication that proteins, even though they have
the same function and specificity, have evolved quite dif-
ferent ways to perform it, and this must be taken into
account when predicting functional sites.
In contrast to the diverse dataset, the success rate is similar
for different versions of SDPsite (Table 6). In 10 out of 18
families the best cluster identified by taking into account
both CPs and SDPs is significantly closer to the ligand (p-
value < 0.01 in Mann-Whitney test), which indicates that
specificity determinants predicted for homogeneous fam-
ilies, even when they do not illuminate the binding of a
specific ligand, play some other important role in their
function. In this regard, it is important to stress that this
dataset is only homogenous as defined by EC numbers,
and it is well established that these do not always
uniquely define molecular function [45]. In other words,
Page 10
BMC Bioinformatics 2009, 10:174
http://www.biomedcentral.com/1471-2105/10/174
Page 10 of 24
(page number not for citation purposes)
Table 5: Statistics of the benchmark datasets: homogeneous dataset, strictly one EC number per family
Family ID Family name# sequencesAlignment lengthEC PDBBound ligand equivalent to natural substrate/product
PF00303Thymidylat_
synt
19 3842.1.1.452G8O2'-deoxyuridine 5'-monophosphate, 10-propargyl-5,8-dideazafolic acid
PF00693Herpes_TK 153052.7.1.21 1VTKAdenosine-5'-diphosphate, thymidine-5'-phosphate
PF00925 GTP_
cyclohydro2
16 1933.5.4.252BZ0Phosphomethylphosphonic acid guanylate ester
PF01014Uricase17 1961.7.3.32FXL1-(2,5-dioxo-2,5-dihydro-1h-imidazol-4-yl)urea
PF01227GTP_
cyclohydroI
161073.5.4.161A8RGuanosine-5'-triphosphate
PF01293PEPCK_ATP12 4954.1.1.49 1YTMAdenosine-5'-triphosphate, oxalic acid
PF01583APS_kinase 201662.7.1.25 1M7GAdenosine-5'-phosphosulfate, adenosine-5'-diphosphate-2',3'-vanadate
PF01656CbiA 803726.3.3.3 1A82Adenosine-5'-triphosphate, 7,8-diamino-nonanoic acid
PF01702TGT 132562.4.2.29 1Q2S 9-deazaguanine
PF01747ATP-
sulfurylase
193972.7.7.41G8H Adenosine-5'-phosphosulfate, pyrophosphate 2-
PF02110 HK9282 2.7.1.501ESQ Adenosine-5'-triphosphate, 4-methyl-5-hydroxyethylthiazole phosphate
PF02223 Thymidylate_
kin
262092.7.4.9 1E9EAdenosine-5'-diphosphate, thymidine-5'-phosphate
PF02277DBI_PRT28 3982.4.2.21 1L5L7-alpha-d-ribofuranosyl-purine-5'-phosphate, nicotinic acid
PF02353 CMAS12304 2.1.1.791KPI S-adenosyl-Ll-homocysteine
PF02569 Pantoate_
ligase
7 311 6.3.2.12A86 Adenosine monophosphate, beta-alanine
PF02898 NO_synthase8 3741.14.13.391Q2O L-n(omega)-nitroarginine-2,4-L-diaminobutyric amide
PF02901 PFL 11 7342.3.1.541MZOPyruvic acid
PF03332PMM9 2485.4.2.82FUE Alpha-d-mannose 1-phosphate
Page 11
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 11 of 24
(page number not for citation purposes)
Assessment of the prediction quality for the homogeneous dataset
Figure 2
Assessment of the prediction quality for the homogeneous dataset. Color code as in Fig. 1. (a) Minimal distance from
the best cluster to the bound ligand. (b) Average distance from residues of the best cluster to the bound ligand. (c) Significance
of the average distance.
Page 12
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 12 of 24
(page number not for citation purposes)
substantial sequence and structure diversity is possible
even among sets of protein sharing the same EC number.
For example, the alignment of the queuine tRNA-ribosyl-
transferase family contains 9 bacterial and 4 archaeal pro-
teins. When predicting SDPs, SDPsite clearly divides the
family into these two groups. A closer analysis of the pre-
dicted SDPs reveals that some positions that are conserved
within the two groups but differ between them bind sub-
strate or tRNA in the bacterial enzyme from Zymomonas
mobilis [46]. For example Cys 158 binds queuine precur-
sor (substituted to proline in archaea), and Arg 286 binds
tRNA (substituted to leucine).
c. Overall performance in the benchmark and guidelines for
predictions
We analyzed the performance of SDPsite by calculating,
for the diverse, homogenous and combined datasets, sen-
sitivity (the ratio of number of true positives to the
number of true positives plus false negatives) and false
positive rate (ratio of number of false positives to the
number of false positives plus true negatives) (Table 3).
For a perfect prediction, sensitivity should be close to 1
and the false positive rate close to 0. As a gold standard set
of residues in active sites, we considered all amino acids
located within a distance of 10 Å from the bound ligand.
Note that not all these residues are functionally impor-
tant, which makes the reported false positive rate lower
than it would be if we had perfect information on real
functional importance of all residues in the proteins con-
sidered.
In all datasets the inclusion of SDPs in the predictions
leads to a higher sensitivity. There is no significant differ-
ence between the diverse and homogeneous datasets,
though this is probably due to the diverse nature of the
underlying EC and sequence data as mentioned above:
functional diversity is also likely present in the homoge-
neous dataset making SDPs beneficial to the prediction of
functional sites.
As discussed above for the phosphomannomutase family,
proteins often undergo conformational changes upon
binding ligands, which means that any method tested on
structures in complex with a ligand might unfairly profit
from the use of a bound structure, which will differ from
the unbound or apo form of the protein. To test for this
effect, we identified 16 apo protein structures (of 26) from
the diverse dataset and 14 (of 18) from the homogeneous
dataset and found no significant difference in the pre-
dicted residues. This effectively means that the method is
relatively insensitive to minor conformational changes
that occur upon binding. It is worth noting, however, that
the enzyme sites considered here might not be represent-
ative of other types of interactions that can lead to more
substantial conformational re-arrangements (e.g. protein-
protein interactions), or contain much larger functional
sites than the tight constellations of functional residues
normally found in enzymes.
The results from the above benchmark provides a guide
for how to interpret predictions, which we used for the
unannotated families below. Based on inspection of the
Table 6: Assessment of SDPsite versions, homogeneous dataset.
FamilySDPsite
(λ = 0.5 AND λ = 1)
SDPsite
(λ = 0.5 OR λ = 1)
SDPsite (no SDPs)
PF00303
PF00693
PF00925
PF01014
PF01227
PF01293
PF01583
PF01656
PF01702
PF01747
PF02110
PF02223
PF02277
PF02353
PF02569
PF02898
PF02901
PF03332
-
-
-
+
+
-
-
+
+
+
+
+
+
+
+
-
+
-
-
+
-
+
+
-
-
+
+
+
+
+
+
+
+
+
+
+
+
-
+
+
-
+
+
+
-
+
+
+
+
-
+
-
-
-
Accuracy
0.61 (11/18) 0.78 (14/18) 0.55 (10/18)
'+': successful prediction (average distance to ligand <10 Å, p-value in Mann-Whitney test against all amino acids of the protein < 0.01), '-' otherwise.
Page 13
BMC Bioinformatics 2009, 10:174http://www.biomedcentral.com/1471-2105/10/174
Page 13 of 24
(page number not for citation purposes)
results, we found the ratio of the total number of pre-
dicted SDPs and CPs to the length (i.e. (SDP+CP)/length)
of alignment to be a useful measure of performance. For
29 of 44 families this ratio was below 0.2. For 17 of these
29 families all three versions of SDPsite predict the best
cluster with an average distance to ligand less than 10 Å,
and for 8 of these the p-value is < 0.01 in all three versions
of the method. However, the set of families with success-
ful predictions is not enriched with those with low
(SDP+CP)/length ratio compared to all predictions. We
also calculated the ratio of the number of residues in the
best cluster to the length (best cluster/length) of align-
ment and applied the cutoff of 0.1. For 35 of total 44
(80%) families the best cluster/length ratio is below 0.1,
whereas among families, for which average distance is <10
Å, and p-value < 0.01 for all three versions of SDPsite, this
fraction is 10 out of 12 (83%). Thus it is practical first to
consider predictions with a low best cluster/length. Still,
as we discuss below, even more disperse predictions can
lead to interesting insights.
As noted above, sometimes the best cluster can be situated
on the inter-subunit contact interface, and the second best
cluster in the active site. No clear strategy can be suggested
to distinguish these situations in the absence of prior
knowledge. In practice one should analyze the composi-
tion of the best cluster for the presence of amino acids typ-
ical for enzyme active sites (potential electron donors/
acceptors) or for protein-protein contact interface (hydro-
phobic patches, etc.). These considerations, however, tend
to be family-specific and thus cannot be included into a
prediction algorithm intended for a general use.
It is worth noting that SDPsite performs best when the
alignment is sufficiently long, the sequences are suffi-
ciently diverse in terms of sequence identity and the phy-
logenetic tree is biologically sound. As a guideline, it
performs best with alignments of at least 50 positions,
and at least 10 proteins with identities between 30 and
90%. The diversity of possible sequences sets and func-
tions for which SDPsite is applicable is great. From our
benchmark, we were unable to distinguish between a sin-
gle-EC and a multiple-EC family in a blind test. This is
probably the major drawback of the method: in absence
of additional information, one cannot conclude whether
the identified SDPs pinpoint the real functional diversity
within the family, or simply reflect the phylogenetic trace.
It is impossible to say which of the three functional site
prediction approaches is generally best. In practice, one
should always run all three approaches, and then interpret
based on what, if anything, is known about the function,
and how the functional sites look on the structure.
2. Application to protein families lacking functional
annotation
We focused on 193 Pfam families that included at least
one protein with resolved 3D structure, and where all
structures come from Structural Genomics Projects. After
removing families with fewer than 6 sequences in the
Pfam seed alignment, we were left with 124 (Additional
file 2), for which potential functionally important sites
could be identified. Of these, 54 families include poten-
tial or proven enzymes; 5 are transcription factors; 3 are
involved in translation; 15 participate in various cellular
processes in a fashion that is not completely understood;
and the functional role for 47 is unknown. The full
description of the predictions is available from the SDP-
site web-site. We calculated SDP+CP/length and best clus-
ter/length ratios for these families. For 50 of 124 families
SDP+CP/length ratio was below 0.2 and the predicted
cluster lies on the surface or in a pocket of the protein
structure. For another 44 families predictions were
weaker, but acceptable: SDP+CP/length ratio exceeds 0.2,
but the cluster still lies in a pocket or on the surface. For
the remaining 30, no reasonable predictions were made.
Taken together, 76% of the predictions seem to provide a
reasonable hint about the location of the actual func-
tional site of the protein. The best cluster/length ratio is
below 0.1 in 91 of 124 (73%) families. This fraction is
lower than for the benchmark dataset, which can be due
to the fact that the considered dataset includes proteins of
various functions as opposed to the benchmark that con-
sists solely of enzymes. The predictions made for these 91
families are the most promising candidates for experi-
mental validation. Five promising examples are discussed
below. These examples were chosen according to the crite-
ria above, and our own visual inspection of the results.
a. YCII-related domain (PF03795)
This family was first identified during an analysis of the
Streptomyces coelicolor genome by a domain hunting proc-
ess [47]. The authors confined the annotation to a remark
that it is "probably enzymatic". The study of this domain
continued, when the first (and to date only) structure of a
protein from this family, HI0828 from Haemophilus influ-
enzae, was solved (PDB ID 1mwq, [48]). The protein was
shown to adopt a ferrodoxin-like α/β-fold, and a catalytic
mechanism involving a histidine-aspartate pair was pro-
posed. However, biochemical assays have to-date failed to
suggest a substrate for the enzyme.
SDPsite identifies a small cluster of potentially important
residues (Fig. 3a), located in a pocket, which contains a
coordinated ZnCl3, a ligand supposed to play a role in the
catalysis. The method splits the family into three specifi-
city groups (Fig. 3b). As the family includes a number of
paralogues from the same species, which fall into different
specificity groups, the identified sub-groups might reflect
Page 14
BMC Bioinformatics 2009, 10:174 http://www.biomedcentral.com/1471-2105/10/174
Page 14 of 24
(page number not for citation purposes)
the real specificity differences within the family. The pre-
dicted SDPs might thus account for binding of different
substrates in the active pocket.
b. CobW/HypB/UreG, nucleotide-binding domain (PF02492), and
Cobalamin synthesis protein cobW C-terminal domain (PF07683)
These two protein families represent the two domains of
CobW protein, a hypothetical protein, involved in cobala-
min biosynthesis and possibly required to store cobalt
ions as an intermediary between the cobalt transport and
chelation systems [49]. The N-terminal part of CobW is a
member of PF02492, which also contains nucleotide-
binding domains of HypB and UreG, GTPases involved in
binding of nickel to apoenzymes [50,51]. The only known
3D structure in both families belongs to a hypothetical
protein YjiA from E.coli [31] (PDB ID 1nij). YijA is a
homolog of CobW and has the same domain structure,
but a much shorter linker between the domains, and also
lacks the histidines required for metal binding, and thus
cannot serve as a metal repository. YijA is believed to be a
GTP-dependent regulator, however its biological role is
unclear [52]. The N-terminal domain of YjiA (which cor-
responds to PF02492) has a fold typical to P-loop NTP-
binding proteins, and a number of motifs responsible for
GTP binding can be found in it. The C-terminal domain
has a ferrodoxin-like fold.
SDPsite identified two potential functionally important
sites, one for each domain (Fig. 4). Both clusters are
exposed to solvent and the cluster in the N-terminal
domain is located close to the possible nucleotide-bind-
ing pocket, identified by Walker A and B motifs, as dis-
cussed in ref. [49] (arrow in Fig. 4). The arrangement of
the clusters suggests their possible role in communication
between the two domains. Indeed, in the structure used
the minimal distance between residues of the N-terminal
and the C-terminal domain cluster is 5.5 Å. As the
domains are rather mobile relative to each other due to
the flexible linker, under certain conditions the two clus-
ters could potentially interact directly.
c. PHP domain (PF02811)
This family corresponds to a catalytic domain with a phos-
phoesterase activity found both as a stand-alone protein
and fused to DNA polymerase domains. PHP domains are
often found in the N-terminal part of bacterial DNA
polymerase III α subunit and in the C-terminal part of
DNA polymerase of the X family in some archaea. In this
role, the PHP domain is proposed to hydrolyze pyrophos-
phate, shifting reaction equilibrium to polymerization
[53]. The family also includes a number of tyrosine-pro-
tein phosphotases and histidinol phosphatases and many
uncharacterized proteins. 3D structures are available for
two proteins of the family, both with unknown function:
YcdX from E.coli and tm0559 from Thermotoga maritima.
A. Structure of HI0828 from Haemophilus influenzae (1 mwq)
Figure 3
A. Structure of HI0828 from Haemophilus influenzae (1 mwq). SDPs are marked yellow, CPs are marked orange, best
cluster is shown in spheres. Cl ions are shown in green, Zn ions in brown. B. Phylogenetic tree of the YCII-related
domain (PF03795) family. The predicted specificity groups are shown as gray ovals.
Page 15
BMC Bioinformatics 2009, 10:174http://www.biomedcentral.com/1471-2105/10/174
Page 15 of 24
(page number not for citation purposes)
We mapped the predicted positions on the structure of
YcdX from E.coli [54] (PDB code 1m68) (Fig. 5), since the
predicted clusters in all available structures are formed by
equivalent residues (data not shown). YcdX is usually
present as a trimer in solution, each monomer possessing
its own catalytic site [54] (Fig. 5A, B and 5C show a closer
view of one monomer of the complex). A cluster of three
zinc ions is located in a cleft of the structure, indicating a
possible location of the active site (marked with an arrow
in Fig. 5A). The predicted cluster of functionally impor-
tant residues lies close to the zinc cluster and has a layered
form: a more compact layer of CPs and a fuzzier layer of
SDPs. CPs might represent the catalytic core of the active
site and SDPs, a less spatially defined recognition periph-
ery. Some SDPs even protrude to the back of the mono-
mer, where they can participate in the ligand recognition
in another active site.
It was previously noted [53] that proteins of the PHP fam-
ily appear as single domain proteins or as domains in
multi-domain proteins involved in a variety of cellular
processes, possibly exhibiting diverse specificity. Indeed,
one can find proteins annotated as "DNA polymerase III
subunit alpha", "DNA-dependent DNA polymerase fam-
ily X", "Histidinol-phosphatase", "Tyrosine-protein phos-
phatase" and "Protein trpH" in this family. Many of the
family members are still uncharacterized. Along with the
identification of a cluster of possible functionally impor-
tant residues, SDPsite extracts a number of specificity
groups, which are shown on the phylogenetic tree on the
family in Fig. 6. Proteins with similar annotation always
fall into same group, and, with a certain degree of caution,
this annotation can be transferred onto other proteins of
the same group. For example, YcdX falls into the same
group with a sequence from Methanobacterium thermoau-
totrophicum (O26650), which is annotated as a DNA-
Structure of YjiA from E.coli (1nij)
Figure 4
Structure of YjiA from E.coli (1nij). N-terminal domain is shown in pink, linker is shown in light green, C-terminal domain
is shown in light blue. SDPs are marked yellow, CPs are marked orange in the N-terminal domain and cyan and magenta,
respectively, in the C-terminal domain, best cluster is shown in spheres. The red arrow indicates the position of the nucle-
otide-binding pocket.