Prediction of TF target sites based on atomistic models of protein-DNA complexes.
ABSTRACT The specific recognition of genomic cis-regulatory elements by transcription factors (TFs) plays an essential role in the regulation of coordinated gene expression. Studying the mechanisms determining binding specificity in protein-DNA interactions is thus an important goal. Most current approaches for modeling TF specific recognition rely on the knowledge of large sets of cognate target sites and consider only the information contained in their primary sequence.
Here we describe a structure-based methodology for predicting sequence motifs starting from the coordinates of a TF-DNA complex. Our algorithm combines information regarding the direct and indirect readout of DNA into an atomistic statistical model, which is used to estimate the interaction potential. We first measure the ability of our method to correctly estimate the binding specificities of eight prokaryotic and eukaryotic TFs that belong to different structural superfamilies. Secondly, the method is applied to two homology models, finding that sampling of interface side-chain rotamers remarkably improves the results. Thirdly, the algorithm is compared with a reference structural method based on contact counts, obtaining comparable predictions for the experimental complexes and more accurate sequence motifs for the homology models.
Our results demonstrate that atomic-detail structural information can be feasibly used to predict TF binding sites. The computational method presented here is universal and might be applied to other systems involving protein-DNA recognition.
-
Article: Intermolecular and intramolecular readout mechanisms in protein-DNA recognition.
[show abstract] [hide abstract]
ABSTRACT: Protein-DNA recognition plays an essential role in the regulation of gene expression. Regulatory proteins are known to recognize specific DNA sequences directly through atomic contacts (intermolecular readout) and/or indirectly through the conformational properties of the DNA (intramolecular readout). However, little is known about the respective contributions made by these so-called direct and indirect readout mechanisms. We addressed this question by making use of information extracted from a structural database containing many protein-DNA complexes. We quantified the specificity of intermolecular (direct) readout by statistical analysis of base-amino acid interactions within protein-DNA complexes. The specificity of the intramolecular (indirect) readout due to DNA was quantified by statistical analysis of the sequence-dependent DNA conformation. Systematic comparison of these specificities in a large number of protein-DNA complexes revealed that both intermolecular and intramolecular readouts contribute to the specificity of protein-DNA recognition, and that their relative contributions vary depending upon the protein-DNA complexes. We demonstrated that combination of the intermolecular and intramolecular energies derived from the statistical analyses lead to enhanced specificity, and that the combined energy could explain experimental data on binding affinity changes caused by base mutations. These results provided new insight into the relationship between specificity and structure in the process of protein-DNA recognition, which would lead to prediction of specific protein-DNA binding sites.Journal of Molecular Biology 04/2004; 337(2):285-94. · 4.00 Impact Factor -
SourceAvailable from: 193.62.197.12
Article: Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity.
[show abstract] [hide abstract]
ABSTRACT: We investigate the conservation of amino acid residue sequences in 21 DNA-binding protein families and study the effects that mutations have on DNA-sequence recognition. The observations are best understood by assigning each protein family to one of three classes: (i) non-specific, where binding is independent of DNA sequence; (ii) highly specific, where binding is specific and all members of the family target the same DNA sequence; and (iii) multi-specific, where binding is also specific, but individual family members target different DNA sequences. Overall, protein residues in contact with the DNA are better conserved than the rest of the protein surface, but there is a complex underlying trend of conservation for individual residue positions. Amino acid residues that interact with the DNA backbone are well conserved across all protein families and provide a core of stabilising contacts for homologous protein-DNA complexes. In contrast, amino acid residues that interact with DNA bases have variable levels of conservation depending on the family classification. In non-specific families, base-contacting residues are well conserved and interactions are always found in the minor groove where there is little discrimination between base types. In highly specific families, base-contacting residues are highly conserved and allow member proteins to recognise the same target sequence. In multi-specific families, base-contacting residues undergo frequent mutations and enable different proteins to recognise distinct target sequences. Finally, we report that interactions with bases in the target sequence often follow (though not always) a universal code of amino acid-base recognition and the effects of amino acid mutations can be most easily understood for these interactions.Journal of Molecular Biology 08/2002; 320(5):991-1009. · 4.00 Impact Factor -
Article: Structural analysis of conserved base-pairs in protein-DNA complexes.
[show abstract] [hide abstract]
ABSTRACT: Understanding of protein-DNA interactions is crucial for prediction of the DNA-binding specificity of transcription factors and design of novel DNAbinding proteins. In this paper we develop a novel approach to analysis of protein-DNA interactions. We bring together structures of protein-DNA complexes and data on evolution of the DNA binding sites. This allows us to reveal the features of protein-DNA complexes that are conserved in evolution and, hence, are more important in specific recognition. The main result of this study is that base-pairs that have more interactions with the protein are more conserved in evolution. We also observe that for most of the studied proteins hydrogen bonds and hydrophobic interactions alone can not explain the pattern of evolutionary conservation in the binding site. Implications for prediction of the DNA-binding specificity are discussed. Introduction Protein-DNA interactions are central for the regulation of gene expression in a cell. Up t...03/2001;
Page 1
BioMed Central
Page 1 of 18
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Research article
Prediction of TF target sites based on atomistic models of
protein-DNA complexes
Vladimir Espinosa Angarica*1,2,3, Abel González Pérez4, Ana T Vasconcelos5,
Julio Collado-Vides3 and Bruno Contreras-Moreira*3,6,7
Address: 1Departamento de Bioquímica y Biología Molecular y Celular, Facultad de Ciencias, Universidad de Zaragoza. Pedro Cerbuna 12, 50009
Zaragoza, España, 2Instituto de Biocomputación y Física de Sistemas Complejos, Universidad de Zaragoza. Corona de Aragón 42 Edificio
Cervantes, 50009 Zaragoza, España, 3Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma
de México. Av. Universidad s/n., Colonia Chamilpa 62210, Cuernavaca, Morelos, México, 4Centro Nacional de Bioinformática. Industria y San
José, Capitolio Nacional, CP 10200, Habana Vieja, Ciudad de la Habana, Cuba, 5Laboratório Nacional de Computação Científica. Av. Getulio
Vargas 333, Quitandinha, CEP 25651-075, Petrópolis, Rio de Janeiro, Brasil, 6Laboratory of Computational Biology, Estación Experimental de
Aula Dei, Consejo Superior de Investigaciones Científicas, Av. Montañana 1.005. 50059 Zaragoza, España and 7Fundación ARAID, Paseo María
Agustín 36, Zaragoza, España
Email: Vladimir Espinosa Angarica* - vespinosa@gmail.com; Abel González Pérez - abel@cosude.org; Ana T Vasconcelos - atrv@lncc.br;
Julio Collado-Vides - collado@ccg.unam.mx; Bruno Contreras-Moreira* - bcontreras@eead.csic.es
* Corresponding authors
Abstract
Background: The specific recognition of genomic cis-regulatory elements by transcription factors
(TFs) plays an essential role in the regulation of coordinated gene expression. Studying the
mechanisms determining binding specificity in protein-DNA interactions is thus an important goal.
Most current approaches for modeling TF specific recognition rely on the knowledge of large sets
of cognate target sites and consider only the information contained in their primary sequence.
Results: Here we describe a structure-based methodology for predicting sequence motifs starting
from the coordinates of a TF-DNA complex. Our algorithm combines information regarding the
direct and indirect readout of DNA into an atomistic statistical model, which is used to estimate
the interaction potential. We first measure the ability of our method to correctly estimate the
binding specificities of eight prokaryotic and eukaryotic TFs that belong to different structural
superfamilies. Secondly, the method is applied to two homology models, finding that sampling of
interface side-chain rotamers remarkably improves the results. Thirdly, the algorithm is compared
with a reference structural method based on contact counts, obtaining comparable predictions for
the experimental complexes and more accurate sequence motifs for the homology models.
Conclusion: Our results demonstrate that atomic-detail structural information can be feasibly
used to predict TF binding sites. The computational method presented here is universal and might
be applied to other systems involving protein-DNA recognition.
Background
The specific recognition of genomic cis-regulatory ele-
ments by nucleic acid binding proteins is of critical impor-
tance for many vital processes such as DNA replication
and repair, mRNA translation and transcriptional regula-
tion. Specific TF-DNA interactions, as well as protein-pro-
Published: 16 October 2008
BMC Bioinformatics 2008, 9:436doi:10.1186/1471-2105-9-436
Received: 12 May 2008
Accepted: 16 October 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/436
© 2008 Angarica et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 2 of 18
(page number not for citation purposes)
tein contacts established between closely located TFs at
promoter regions, are key determinants of the concerted
expression of genes in response to external stimuli. The
task of determining the relationships among TFs and their
targets has been widely addressed experimentally [1,2]
and computationally [3,4]. Of great relevance and use in
recent years are computational approaches, based on the
construction of statistical models that characterize the
DNA-binding preferences of TFs, which are in turn used to
scan genomic sequences in an effort to identify new puta-
tive binding sites [5,6].
The probabilistic models more commonly used in such
computational approaches are position weight matrices
(PWMs) obtained from multiple alignments of known
binding sites. This approach is limited to TFs with a suffi-
cient number of experimentally identified binding sites,
for which reliable statistical models may be built. An alter-
native approach would be to predict DNA operator sites
which are compatible with the mode of binding of a given
TF [7]. This approximation would depend on two key
components: i) the knowledge of the protein and DNA
residue positions involved in binding at the spatial level
and ii) a method to evaluate the compatibility of different
DNA bases and amino acids to interact [8,9].
The computational analysis of available protein-DNA
complexes has resulted in several papers describing the
characteristics of the amino acid-base interactions that
determine binding specificity [10-12], the different types
of readout mechanisms involved in DNA recognition
[13,14] and the evolutionary conservation of residues
located at contact interfaces [15-17]. Direct readout is
associated to recognition through contacts established
between atoms from amino acid side-chains and nitrogen
bases. Indirect readout, on the other hand, is mediated by
the contribution of residues of the proteins and DNA
which are not in direct contact and conformational
changes undergone by DNA upon protein binding
[14,18]. These reports show that both direct and indirect
readouts significantly contribute to specific protein-DNA
recognition.
With respect to direct readout, it is estimated that about
two-thirds of all protein-DNA interactions are van der
Waals contacts which do not generally confer sequence
specificity, with the exception of the hydrophobic interac-
tions involving the C7 atom of thymine [12]. On the con-
trary, hydrogen bonds are the major source of specific
interaction contacts. Two-thirds of the hydrogen bonds
formed between amino acids and bases are bi-dentate or
complex interactions providing a great specificity. Non-
classical C-H ···O hydrogen mediated links have also
been found at protein-DNA contact interfaces [19], albeit
their energetic contributions to recognition are not fully
understood yet. These interactions are inherently weak
individually [20], which means that cumulative effects are
indispensable to significantly account for specificity, as
occurs with hydrophobic links.
Water mediated bridges, though common in interaction
interfaces, are mostly used as gap-fillers [12] and are also
engaged in buffering unfavorable electrostatic repulsions
between interacting atoms at the interface [21]. Electro-
static interactions, most of which are established between
the protein main chain and the sugar/phosphate back-
bone, do not significantly contribute to specific recogni-
tion, although they play important roles in the transition
from the unspecific to the specific complex [22].
To date just a few reports have been published aiming at
finding putative TF target sequences using structural infor-
mation. There have been some attempts to apply physical
potential functions – i.e., in the form of atomic force fields
– to estimate the energy of interaction and the relative
contribution of direct and indirect readout mechanisms
[18,23,24]. However, these physical models do not
appear to significantly outperform simpler statistical
methods [25]. Examples of these methods are approaches
oriented to extracting structural information from family-
wise comparisons, building a statistical model of the
interface derived from crystallographic structures and
binding sites of proteins that belong to the same family
[26,27]. Other procedures infer statistical potentials of
interaction at the residue-base level from datasets of
known complexes [7,9,28-30], which are used to examine
the compatibility between the protein and its putative
sites. However, there is some evidence suggesting that the
performance of these methods may be limited by the sim-
plicity of their interaction potentials, calculated at the
level of Cα atoms for protein residues. Moreover, the
increasing number of high-resolution structures for pro-
tein-DNA complexes opens a new door towards a more
detailed study of their contact interfaces. Indeed a recent
report confirms that atomic details improve the ability of
physical potential functions to predict sequence-specific
protein-DNA interactions [25].
In this work we constructed position weight matrices that
capture the binding specificity of transcription factors,
based on information extracted from the Protein Data
Bank (PDB). Three atomic preference matrices, for hydro-
gen bonds, water-mediated hydrogen bonds and hydro-
phobic interactions, were derived from a non-redundant
training set of 210 complexes annotated in the PDB [31].
These matrices were used to make explicit atomistic repre-
sentations of hydrogen and hydrophobic bonds as well as
the contribution of water molecules at interfaces, which
gives us the opportunity to score the direct readout. This
contribution is combined with empirical estimations of
Page 3
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 3 of 18
(page number not for citation purposes)
DNA deformation, in order to calculate a potential of
binding that includes both direct and indirect readout
contributions. We evaluated the performance of our algo-
rithm in a set of 4 bacterial and 4 eukaryotic TFs which
have been co-crystallized bound to DNA, and in most
cases our results proved to be as good as or better than
those obtained with the structure-based cumulative con-
tact method by Morozov and Siggia [32]. In addition, two
TF homology models were analyzed in detail and used to
predict their DNA binding motif, after sampling side-
chain rotamers at their contact interfaces. In this case the
results we obtained were significantly better than those
returned by the reference method, which indicates that
our algorithm could be suitably used to study TFs of
unknown structure starting from structural models. We
also discuss the strengths and limitations of our approach
that might potentially be used for TFs with few or no
experimentally characterized binding sites.
Results
Protein-DNA interface atomic contacts and interaction
preferences
Starting from a non-redundant training set of crystallo-
graphic protein-DNA complexes culled from the PDB, we
constructed atomic frequency matrices for hydrogen
bonds, water-mediated hydrogen bonds and hydrophobic
contacts. A close inspection at the information embodied
in these frequency matrices (see Table 1 and Additional
file 1) reveals some interesting features of protein-DNA
interfaces. For instance, as claimed in previous reports, we
found that arginine is the major source of hydrogen
bonds, with a marked preference to interact with guanine
[12]. However, in the present study we divided this pref-
erence in pairs of interacting atoms, finding that groups
NH1 and NH2 of arginine establish 88% of the hydrogen
bonds with atoms N7 and O6 from guanine. Histidine
also showed a marked preference towards guanine, with
atom NE2 interacting in 17 out of 35 hydrogen bonds
with atoms O6 and N7 from the nitrogen base. Overall,
we found 860 interface hydrogen bonds in our training
set.
In contrast with hydrogen bonds, the landscape of hydro-
phobic interactions is quite different. As previously
reported [9,12], the main discriminatory group regarding
specific recognition through hydrophobic interactions is
the methyl group of thymine. Accordingly, the C7 group
accounted for 20% of all 1010 hydrophobic interactions
found in our training dataset, being the main source of
contacts for all amino acids when compared with the
other nitrogen bases. With the exception of C7, we found
no obvious interaction preferences between amino acid
and bases. This means that the signal to noise ratio of the
calculated preferences is lower than the one observed for
Table 1: Hydrogen bonds atomic interaction frequency matrix
TCAG
O2N3O4O2 N3 N4N7 N6N3 N7 O6N2 N3
R
R
R
K
S
T
N
N
Q
Q
H
H
Y
E
E
D
D
C
M
W
NE
NH1
NH2
NZ
OG
OG1
OD1
ND2
OE1
NE2
ND1
NE2
OH
OE1
OE2
OD1
OD2
SG
SD
NE1
60
0
0
0
0
1
1
0
0
0
0
0
0
1
2
0
3
0
0
0
3
2
8
9
2
4
0
4
9
14
9
2
2
0
3
0
3
0
1
3
0
0
0
0
0
0
0
0
4
3
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
9
8
0
4
0
0
1
1
0
2
0
6
4
2
0
0
0
0
0
1
1
0
8
7
7
2
1
0
2
0
3
0
1
1
0
0
0
0
0
0
0
13
48
62
33
7
3
0
4
0
1
2
5
2
0
0
0
0
0
0
0
11
39
80
38
9
3
0
11
0
6
5
12
2
0
0
0
0
0
0
0
0
0
0
0
3
2
6
0
4
0
0
1
3
0
1
2
5
0
0
0
1
4
3
2
1
0
0
7
0
5
0
1
1
0
0
0
0
0
0
1
21
15
15
4
2
0
6
0
2
1
3
1
0
0
0
0
0
0
1
23
0
17
0
0
0
3
1
2
1
1
2
0
0
11
0
7
0
2
4
0
0
0
0
1
0
1
20
0
16
0
0
1
0
0
0
0
0
0
0
16
11
10
13
0
0
0
Hydrogen bonds reported by HBPLUS [46] between pairs of atoms in the side-chains of amino acids and bases in the set of 210 protein-DNA
complexes extracted from PDB were computed. Rows correspond to the hydrogen bond donor/acceptor atoms from the amino acids and columns
to the hydrogen bond donor/acceptor atoms from bases (T, C, A, G) with their corresponding PDB atom codes. Each cell indicates the number of
contacts found for a given pair of atoms.
Page 4
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 4 of 18
(page number not for citation purposes)
hydrogen bonds, and are therefore less informative. Thus,
in our subsequent analyses, we decided to take into
account only the contribution of the methyl group of
thymine during the evaluation of the interface between a
TF and a given DNA sequence.
We also generated frequency matrices for water-mediated
hydrogen bonds, finding a total of 482 atomic interac-
tions. The overall observed interaction propensities are
quite similar to those in the hydrogen bond matrix, which
is probably related to the gap filling role of water that sup-
ports the existence of long-distance hydrogen bonds [21].
In addition, the buffering role of water allows the forma-
tion of electrostatically unfavorable hydrogen bonds, add-
ing new interaction propensities absent in the hydrogen
bond matrix. For instance, we found eight water-mediated
hydrogen bonds between arginine side-chain nitrogen
groups and cytosine N4.
A survey of the quality of the atomic interaction matrices
The redundancy of the data in the PDB and the bias
towards specific protein folds considerably tangle the
efforts aiming at exploiting the wealth of this database,
limiting the scope of the results. In order to overcome
these problems, we decided to avoid redundancy as much
as possible in our training set, taking care of withholding
a sufficiently informative set of PDB entries. We also
planned a thorough bootstrap assay to estimate the qual-
ity of the atomic preferences extracted from the training
dataset
In this bootstrap tryout we resampled with replacement a
thousand subsets of entries, randomly excepting a consid-
erable fraction of the entries included in the initial train-
ing set as described in Methods. By leaving out almost a
third of the total entries, we measured the degree of over-
training due to residual redundancy in the initial set, as
well as the statistical relevance of the interaction prefer-
ences reckoned in the matrix building process. A straight-
forward way of addressing this issue is by finding out
whether the matrices constructed using the truncated
training sets were good enough to suitably evaluate the
direct readout for entries used to build the corresponding
bootstrap matrices, though not failing to score the crystal-
lographic complexes that were excluded in the resampling
step.
As may be seen in Figure 1, the exclusion of 30% of the
entries in the training set does not significantly affect the
discrimination capability of the matrices. Here we assess
the scores obtained when evaluating the bootstrapped
(1A, 1C) and excluded (1B, 1D) datasets with the boot-
strap matrices – in the abscissa -, against the generic (1A,
1B) or shuffled matrices (1C, 1D). We found in the scatter
plots and the regression analysis that the scores obtained
for the assay of the bootstrapped and excluded datasets
with the bootstrap matrices correlate very well – i.e., R2 =
0.90 and R2 = 0.62 respectively -, with the values obtained
when assaying those same datasets with the generic matri-
ces (Figure 1A, B). As anticipated, the correlation entirely
disappears when the analysis is done with the shuffled
matrices – two bottom charts of Figure 1.
Direct and indirect readout mechanisms in protein-DNA
complexes
The information embodied in the atomic interaction
matrices presented in the previous section can be used to
estimate the contribution of direct readout in protein-
DNA recognition. However, it has been shown that a
mechanism of indirect readout also plays a relevant role
in this process, at least for some TFs, and was therefore
considered in this work. In particular, we modeled indi-
rect readout as the cost of threading a nucleotide sequence
into a fixed DNA backbone. In order to integrate both
direct and indirect recognition mechanisms, we designed
a saturating mutation strategy, which is the kernel of the
DNAPROT algorithm. Briefly, the algorithm iteratively
evaluates the interaction potential of a given TF as the
docked nucleotide sequence mutates in a 4N space, for a
DNA duplex of length N (see Figure 2 for a flowchart of
the algorithm). This renders a structure-based position
weight matrix that can be used to scan genomic
sequences. A full description is provided in the Methods
section, but it is important to note that direct and indirect
readout scores are linearly combined by means of a defor-
mation weight, D. As D gets larger, the relevance of indi-
rect readout increases.
A study of the performance of our method is shown in Fig-
ure 3, where it can be observed that the indirect readout
contribution to the ability of correctly scoring cognate
sites of CRP and NarL is critical, since the predictive com-
petence increases with large D values, as revealed by the
increasing Areas Under the Curve (AUCs) in the ROC
plots. These outcomes agreed with previous reports in the
literature claiming the central role of deformation energy
in the site recognition mechanism of CRP [33] and NarL
[34], which bend DNA 90° and 42° respectively. PurR
and DnaA display an opposite trend, since their largest
AUCs are obtained with D values less than 0.4. While
PurR is known to bend DNA with an angle of 45° [35],
the reported bending angle of DnaA is just 28° [36]. This
means that although in the first two cases the contribution
of deformation can be considered essential, in the cases of
PurR and DnaA optimal predictive values are obtained
with intermediate values of D, also including an impor-
tant contribution of direct readout.
The subtle changes observed for the AUC in the ROC plots
of Figure 3, in fact correspond to significant variations in
Page 5
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 5 of 18
(page number not for citation purposes)
the classification performance of our method. This can be
seen in the results reported in Table 2 for the scoring of
PurR known binding sites. The comparison of the PWMs
obtained with D values of 0.4 and 1.0, those for which we
obtained the best and worst performance in Figure 3C,
reveals that setting D to its optimum value causes the p-
values of cognate sites to be reduced by one or two orders
of magnitude. This improvement corresponds to a reduc-
tion of 103-104 false positives extracted along with true
sites in a search set of 106bp. It is worth noting that, using
the optimum D value in the construction of the model for
this TF and setting the sensitivity to 80%, the number of
false positive sites recovered with the true sites is as low as
190 in a genome-sized search set.
Scoring crystallographic protein-DNA complexes with the
DNAPROT algorithm
In this section we present the results of scoring a diverse
collection of protein-DNA complexes solved by X-ray crys-
tallography. The test set includes eight transcription fac-
tors bound to cognate DNA sequences, of which four are
prokaryotic and the other four eukaryotic, belonging to
eight different superfamilies according to SCOP [37]. In
this experiment we took the coordinates of each of these
interfaces and compute structural position weight matri-
ces (PWMs) that approximate their binding specificity.
Two types of PWMs are calculated here: i) readout matri-
ces, derived using the DNAPROT algorithm outlined in
Figure 2, and ii) cumulative contact matrices, obtained by
simply counting the contacts between protein side-chains
Scatter plot and regression analysis of the scoring capability of bootstrap matrices against generic and shuffled matrices
Figure 1
Scatter plot and regression analysis of the scoring capability of bootstrap matrices against generic and shuffled
matrices. In the top diagrams, the mean score of bootstrap matrices (abscissa) is plotted against the mean score of generic
matrices (ordinate) for PDB entries that conform the bootstrapped (A) and excluded (B) datasets. In the bottom graphs, the
mean score of bootstrap matrices is plotted against the mean score of shuffled matrices, when computed with entries that are
part of the bootstrapped (C) and excluded (D) datasets.
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.1 5.2 5.3
Bootstrap-PWM Population Mean Score
5.4 5.5 5.6 5.7 5.8 5.9 6
Generic-PWM Population Mean Score
A - Entries
y = 0.898x + 0.420; r2 = 0.905
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
6.2
3 3.5 4 4.5 5 5.5 6
Generic-PWM Population Mean Score
Bootstrap-PWM Population Mean Score
B - Excluded
y = 0.543x + 2.912; r2 = 0.624
-4.4
-4.2
-4
-3.8
-3.6
-3.4
-3.2
-3
-2.8
-2.6
5.1 5.2 5.3
Bootstrap-PWM Population Mean Score
5.4 5.5 5.6 5.7 5.8 5.9 6
Shuffled-PWM Population Mean Score
C - Entries
y = -0.306x - 1.952; r2 = 0.04
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
3 3.5 4 4.5 5 5.5 6
Shuffled-PWM Population Mean Score
Bootstrap-PWM Population Mean Score
D - Excluded
y = -0.242x - 2.535; r2 = 0.05
Page 6
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 6 of 18
(page number not for citation purposes)
and nitrogen bases, as previously described by Morozov
and Siggia [32].
These PWMs can then be compared with cognate matrices
built from experimentally determined binding sites, by
means of local alignments [38]. Further details are pro-
vided in the Methods section, but it is relevant to note that
readout PWMs are computed using knowledge-based
atomic potentials, whereas cumulative contact PWMs are
calculated with the assumption that the consensus DNA
sequence is part of the PDB complex. Figure 4 includes the
sequence logos derived from the computed PWMs; the
expectation values of PWM comparisons are shown in
Table 3, together with details of the test set.
The PWMs generated by DNAPROT starting from crystal-
lographic structures are comparable or better than those
produced with the methodology of cumulative contacts
by Morozov and Siggia [32]. This is clear in the results of
Table 3, where, with the exception of PurR and LEU3, the
statistical significance of our readout matrices is higher or
of the same order as that of the contact matrices. An
inspection of Figure 4 also shows that for the three
prokaryotic regulatory proteins, CRP, NarL and MetJ, the
sequence logos obtained by both methods are quite simi-
Flowchart of the DNAPROT algorithm
Figure 2
Flowchart of the DNAPROT algorithm. Starting from the set of protein-DNA complexes included in the PDB, we filter
the entries using the criteria described in the Methods section to eliminate redundancy. a) The culled training set is used to
derive atomic matrices that capture the interaction preferences at binding interfaces. Taking as input the Cartesian coordinates
of a TF-DNA complex with N complementary base pairs, DNAPROT mutates one by one all 4N nucleotides in the template. c)
During the saturating mutation assay each mutation is scored in terms of direct – i.e., using the atomic PWMs built in step a) –
and indirect – i.e., by estimating the deformation cost of DNA upon mutation, as described in step b) – readout and the com-
bined scores are used to fill a position weight matrix. A sequence logo might be calculated from the structure-based PWM by
stacking the best B oligonucleotides, usually 50.
PDB
Filtering
Selected protein-DNA complexes
Indirect readout
estimation
TF-DNA complex
Atomic interaction matrices
(H-bond, Hydrophobic, Water-
mediated H-bond)
a)
b)
c)
{T,C,G,A}
TACCCA
readout(N) = D·indirect(N) + (1-D)·direct(N)
N
A 0.0 0.9 0.0 0.2 0.3 0.6
T 1.0 0.9 0.0 0.0 0.0 0.1
G 0.0 0.0 0.1 0.0 0.1 0.0
C 0.0 0.0 0.9 0.8 0.6 0.3
0
5¢
1
2
bits
1
G
C
A
T
2
T
G
C
A
3 C
sequence logo (best B sites)
4
T
G
A
C
5
G
6
G
C
T
A
3¢
Binding potential
estimation
Page 7
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 7 of 18
(page number not for citation purposes)
ROC plots of cognate site recovery in a set of random sequences for NarL, CRP, PurR and DnaA
Figure 3
ROC plots of cognate site recovery in a set of random sequences for NarL, CRP, PurR and DnaA. Sensitivity
(ordinate) and corrected specificity (abscissa) for the cognate binding sites recovery are plotted against in an assay in which the
cognate binding sites are rescued from a dataset of non-sense random sequences, as explained in the Methods section. In this
assay the W parameter is fixed to 1 while D, the linear weight of deformation costs assigned to indirect readout, is variable
over its domain in the ROC plot analysis for NarL (A), CRP (B), PurR (C) and DnaA (D). In these charts the Area Under the
Curve (AUC) is reported for each curve, corresponding to different assays with variable values of D. The PDB identifiers are
NarL [1je8], CRP [1cgp], PurR [2pua] and DnaA [1j1v].
AB
CD
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4
1 - Specificity
0.6 0.8 1
Sensitivity
D = 0.0 (AUC = 0.983)
D = 0.2 (AUC = 0.990)
D = 0.4 (AUC = 0.991)
D = 0.6 (AUC = 0.997)
D = 0.8 (AUC = 1.000)
D = 1.0 (AUC = 1.000)
No Discrimination
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4
1 - Specificity
0.6 0.8 1
Sensitivity
D = 0.0 (AUC = 0.943)
D = 0.2 (AUC = 0.956)
D = 0.4 (AUC = 0.967)
D = 0.6 (AUC = 0.970)
D = 0.8 (AUC = 0.984)
D = 1.0 (AUC = 0.981)
No Discrimination
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4
1 - Specificity
0.6 0.8 1
Sensitivity
D = 0.0 (AUC = 0.977)
D = 0.2 (AUC = 0.988)
D = 0.4 (AUC = 0.988)
D = 0.6 (AUC = 0.987)
D = 0.8 (AUC = 0.986)
D = 1.0 (AUC = 0.962)
No Discrimination
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4
1 - Specificity
0.6 0.8 1
Sensitivity
D = 0.0 (AUC = 0.991)
D = 0.2 (AUC = 0.992)
D = 0.4 (AUC = 0.989)
D = 0.6 (AUC = 0.988)
D = 0.8 (AUC = 0.970)
D = 1.0 (AUC = 0.823)
No Discrimination
Page 8
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 8 of 18
(page number not for citation purposes)
lar. However, except for the case of MetJ, for which the sig-
nificance of the PWMs generated by both methods is of
the same order of magnitude, for CRP and NarL the matri-
ces generated by our algorithm are two orders of magni-
tude better. On the contrary, although the logo of PurR is
correct, the E-value it gets when compared with the
expected logo is several orders of magnitude worse, as a
consequence of failing to identify 5 consensus nucle-
otides. The results for our selection of eukaryotic TFs show
close similarities for the statistical significance of the
PWMs generated by both methods. However, for RAP1
sampling of interface rotamers caused an improvement of
the readout matrix that outperforms the contact matrix by
one order of magnitude, (see Table 3 for parenthesized E-
values). Something similar happened for the matrix gen-
erated for PurR, though in this case the improvement of
three orders of magnitude of the significance is still worse
than the value of the contacts PWM.
Footprinting comparative models of protein-DNA
complexes with DNAPROT
In this section we further test the ability of DNAPROT rea-
dout matrices to capture the binding specificity of TF-
structural models. Here we generated homology models
for two transcription factors: FNR, a global regulatory pro-
tein in E. coli, and Giant, a regulatory protein involved in
the early development of D. melanogaster. As in the previ-
ous section, we also computed cumulative contact PWMs
for comparison purposes, and we evaluated both struc-
ture-based matrices by aligning them to cognate matrices
built from operator sites reported in the literature. We
now present the results of both models in more detail.
The protein sequence of FNR can be confidently aligned
to the sequence of E. coli CRP (PDB id: 1cgp; alignment
coverage = 71%) with TFmodeller [39], despite an overall
sequence identity of just 22%, providing a reliable frame
for comparative modeling. However, several residue sub-
stitutions occur at the binding interface, suggesting that
Table 2: Scoring of PurR cognate binding sites using DNAPROT
D = 1.0
D = 0.4
Binding SiteScore
p-valueScore
p-value
ACGAAACCGTTTGCGT
ACGAAAACGTTTGCGC
GCGGAAACGTTTTCCT
AGGAAAACGGTTGCGT
CGGAAAACGTTTGCGT
AAGAAAACGTTTGCGT
TTGAAATCGTTTGCAT
ACGCACACGTTTGCGT
AGGCAAACGTTTACCT
ACGCAAACGATTACCT
GCGTAACCGATTGCAT
TCGCAAACGTTTGCTT
GAGCAAACGTTTCCAC
TCGTTCTCTTTTGCCT
CGGCCAGTTTTTGCAG
3.93
3.84
3.37
3.26
3.22
3.13
2.51
3.75
3.27
3.00
2.91
2.80
2.26
0.80
-0.43
1.49E-04
1.98E-04
8.15E-04
1.11E-03
1.24E-03
1.59E-03
7.73E-03
2.62E-04
1.08E-03
2.26E-03
2.87E-03
3.80E-03
1.37E-02
1.92E-01
7.45E-01
5.68
5.65
5.48
5.41
5.40
5.27
5.06
4.80
4.58
4.41
4.37
4.35
4.14
1.77
0.35
6.42E-06
7.02E-06
1.16E-05
1.42E-05
1.46E-05
2.12E-05
3.81E-05
7.69E-05
1.37E-04
2.10E-04
2.32E-04
2.44E-04
4.08E-04
4.22E-02
2.58E-01
The known binding sites of PurR reported in RegulonDB [49] were
scored using DNAPROT, maintaining W fixed to 1 and setting D to
1.0 and 0.4. These values of D correspond to the parameterization
that renders the worst and the best PWMs for this transcription
factor respectively. The first column corresponds to the binding sites,
the second and fourth columns report the scoring obtained with
DNAPROT and the third and fifth columns show the p-values of the
scores obtained in a site recovery assay in a background set of 106
random sequences.
Table 3: Comparison of cumulative contact and readout position weight matrices for 4 prokaryotic (top) and 4 eukaryotic (bottom)
transcription factors
SCOP v1.73
superfamily
TF [PDB id] Resolution (Å)Robs
E-valuecontacts
E-valuereadout
Winged helix
C-terminal domain of the bipartite response regulators
CRP [1cgp]
NarL [1je8]
30.24
0.23
7.93E-03
3.58E-05
3.76E-05
7.01E-072.12
lambda repressor-like DNA-binding domainsPurR [2pua] 2.90.164.33E-15 5.51E-01
(7.58E-04)
ribbon-helix-helix
Zn2/Cys6 DNA-binding domain
HLH, helix-loop-helix DNA-binding domain
MetJ [1cma]
LEU3 [2er8]
PHO4 [1a0a]
2.8
2.85
2.8
0.22
0.23
0.23
1.22E-01
4.97E-06
3.57E-07
3.30E-01
5.52E-05
3.97E-07
Homeodomain-likeRAP1 [1ign]2.250.215.52E-031.89E-02
(6.40E-04)
1.99E-14 C2H2 and C2HC zinc fingersZIF268 [1aay]1.60.19 7.93E-14
In the first column we include the name of the structural superfamily of each TF according to SCOP [37]. In the second column the name of each
TF is followed by the PDB identifier of the structure used enclosed in brackets. The following two columns contain the resolution and the R-value
of the crystals respectively. The last two columns report the E-values resulting from STAMP [38] local alignments between the structure-based
PWMs obtained with the method of Morozov and Siggia [32] (E-valuecontacts) or our approach (E-valuereadout) and the corresponding cognate PWMs,
extracted from the literature [43,49,53]. Parenthesized E-values were calculated after sampling interface rotamers with DNAPROT.
Page 9
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 9 of 18
(page number not for citation purposes)
Stacked sequence logos for PWMs derived from cumulative contact (top) and readout position weight matrices (bottom)
Figure 4
Stacked sequence logos for PWMs derived from cumulative contact (top) and readout position weight matri-
ces (bottom). Four prokaryotic (A: CRP, B: NarL, C: PurR, D: MetJ) and 4 eukaryotic (E: LEU3, F: PHO4, G: RAP1, H:
ZIF268) transcription factors were included. See Table 3 for more details.
0
2
1
2
bits
123 T
4
T
G A
5 C
T C
6
T
G
A C
6 C
7
T
G
A C
8
T
G
C A
9
10111213
G
A
C T
14
A
C
T
G
15
T
C
A G
15 G
16 G
16 G
17 T
18 A
1920
0
1
bits
123 T
4578
T
G
C A
9
G
C
A T
10
C
G
A T
1112
T
G
C A
13
C T
14
T
C
A
G
1718 A
19 20
T
A
C G
B
A
0
2
1
2
bits
12345678T
9
T
C
AG
9G
10
G
A CT
11
T
C
AG
11G
12 A
13
A
T
G C
1415161718
T
A
G
19T
20
T
G
AC
20C
21
T
G
C A
22
T
G
AC
22C
23 A
2425262728 2930
0
1
bits
1
T
A
GC
2345678
10
G
T
A C
12 A
131415
G
A
CT
16
C
T
GA
171819T
2123242526272829
A
G
TC
30
0
1
2
0
bits
12
G
T
C
3 G
4567
G
C A
8 C
E
9 G
10
G
C T
11
T
C
G
1213
A
C
T G
14 C
15
T
C
AG
16
1
2
bits
12
G
A C
3
T
C
A G
4
G
T
A C
5 A
6
T
G
C A
7
T
G
C A
8
T
G
A C
9
T
C
A G
10
G
C
A T
11
G
C
A T
12
C T
13
T
C
A G
14
G C
15
C G
16
C
0
2
1
2
bits
12
C
A
G
3 G
4
T A
4 A
567
C
A
G
T
8 C
G C
9
C
G
A T
0
1
bits
123
C
A G
567
G
A
C T
89
G
A
C
T
D
0
2
1
2
bits
12
G
C
A
3
G C
4
T G
4 G
5 G
5 G
678 C
8 C
9 C
9 C
10
T
A
C G
11
T
A
12
0
1
bits
12
C G
3
T
G C
6
C
G
T
7
C
T
A
G
10
T
C G
11
G
A
T
C
12
0
2
1
2
bits
12345
T
G
A C
6
T
G
6 A
C A
7
T
G
A C
8
T
C
A G
9
G
C
A T
10
T
10 G
C
A G
11
T
11 G
C
A G
12131415 1617
0
1
bits
12
G
A
C
345 C
7 C
8 G
H
9
G
A
C T
121314
T
C
15
A
GT
16
C
G
A
17
F
0
2
1
2
bits
123
T
G
A C
3 G
4
G
T
C A
5
T
A C
6
T
G
C A
7
A C
7 C
8 C
8 C
9 C
9 C
10
T
G
C A
11
G C
12
T
C
G A
13
G
T
A C
14
G A
15
G
A C
15 C
16
T
G C
16 C
17
T
G
C A
18
0
1
bits
124
T
G
C A
5
C
A
TG
6
1011
C
G
T
1213 C
14
C
TA
17
A
G
T
C
18
G
0
2
1
2
bits
1
T
C
A G
2
T
G
A C
3
T
C
A G
4
G
C
A T
5
T
C
A G
6
T
C
A G
7
T
C
A G
8
T
G
A C
9
T
C
A G
10
G
C
A T
0
1
bits
1
T
C
A G
2
T
G
A C
3
T
C
A G
45
T
C
A G
6
T
C
A G
7
T
C
A G
8
T
G
A C
9
T
C
A G
10
Page 10
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 10 of 18
(page number not for citation purposes)
the readout might be changing from the template to the
model. Indeed we found that four out of ten interface res-
idues are mutated in the dimeric model, as annotated in
Table 4. For this reason we sampled different rotamers for
the side-chains of the interface residues and selected those
that yielded best readout scores using the DNAPROT algo-
rithm. As shown in Figure 5, this limited sampling was
enough to improve the readout PWM, increasing the
number of correct consensus nucleotides in the sequence
logo from six to nine. This improvement of the model is
also related to an increase of the statistical significance of
the readout matrix of one order of magnitude with respect
to the cumulative contact matrix, as shown in Table 4. A
further improvement is also possible by deriving a contact
matrix from the refined homology model, which causes
an additional increase of the E-value to 1.46E-04. The
multiple alignment, Figure 5D, also suggests that the
interface differences inferred from the comparative model
correlate with the sequence conservation of the DNA-
binding domain among FNR and CRP homologous
sequences.
The second study of a homology model was performed
with Giant, an eukaryotic leucine zipper transcription fac-
tor that has been studied previously [24]. This protein can
be modeled based on human CEBPB (PDB id: 1gu4), with
a sequence alignment that covers 92% of the DNA-bind-
ing domain. Although the overall sequence identity is
only 13%, six out of 12 interface residues, as labeled by
TFmodeller, are conserved. The first evaluated Giant
model yielded a PWM that captures only four correct con-
sensus positions, as seen in Figure 5. As in the case of FNR,
we refined the binding interface of Giant by sampling
rotamers and selecting those with best readout scores. The
refined model yields a structure-based PWM that includes
eight consensus nucleotides, improving the preliminary
model results in two orders of magnitude in terms of E-
value (see Table 4). This sampled model also suggests a
different set of interface residues responsible for DNA
sequence discrimination. The figure shows an alignment
of Giant and CEBPB homologous sequences that seem to
cluster in two different modes of binding to DNA.
Relative contribution of atomic interactions at DNA
recognition interfaces
The DNAPROT algorithm models the direct readout
mechanism by labeling hydrogen bonds, hydrophobic
interactions and water-mediated hydrogen bonds at the
interface, which are evaluated by means of log-likelihood
preference matrices. Therefore, the data generated for the
eight TFs presented in Table 3 can be further analyzed
with the aim of calculating the relative contribution of
each of these atomic interactions to DNA recognition. As
shown in Table 5, we found four cases in which recogni-
tion is mediated exclusively by hydrogen bonds (CRP,
LEU3, PHO4 and RAP1), while the remaining TFs seem to
employ two or more types of atomic interactions to drive
DNA binding. Of these, perhaps the most interesting cases
are NarL, which has a large hydrophobic contribution,
and ZIF268, with 30% of the interface score contributed
by water-mediated hydrogen bonds. Repeating these cal-
culations on the set of non-redundant complexes, we
found that 78% of DNAPROT interface scores are contrib-
uted by hydrogen bonds, while water-mediated hydrogen
bonds account for 16% and hydrophobic interactions are
responsible for 6% of the total. Table 5B shows a more
detailed analysis of NarL and ZIF268, in which structure-
based PWMs derived from only one type of atomic inter-
actions are compared with the same cognate PWMs in
Table 3. These last results show that the combination of
explicit representations of the relative contribution of the
different types of interaction improves the predictive com-
petence of DNAPROT.
Discussion
The field of computational prediction of regulatory sites
in genomic sequences has been dominated by methodol-
ogies relying on prior knowledge of experimentally char-
Table 4: Comparison of cumulative contact and readout position weight matrices derived from two comparative models for the FNR
and Giant transcription factors
SCOP v1.73
superfamily
TF [PDB id]Interface
identity
Resol.
(Å)
Robs
E-valuecontacts
E-valuereadout
Winged helixFNR [1cgp] 6/1030.241.40E-01 6.03E-01
(6.35E-02)
[1.46E-04]
3.65E-02
(4.52E-04)
Leucine zipper domainGiant [1gu4]6/121.8 0.237.10E-03
In the first column we include the name of the structural superfamily of each TF according to SCOP [37]. The name of each model is followed by
the PDB identifier of the template used to guide model building, enclosed in brackets. The following three columns contain the ratio of contact
interface identity, the resolution and the R-value of the crystals respectively. The last two columns report the E-values resulting from STAMP [38]
local alignments between the structure-based PWMs obtained with the method of Morozov and Siggia (E-valuecontacts) or our approach (E-
valuereadout) and the corresponding cognate PWMs, extracted from the literature [49,54]. Parenthesized E-values were calculated after sampling
interface rotamers while bracketed E-values correspond to cumulative contact PWMs computed over the refined DNAPROT complex.
Page 11
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 11 of 18
(page number not for citation purposes)
Stacked sequence logos for comparative models of FNR and Giant
Figure 5
Stacked sequence logos for comparative models of FNR and Giant. A: cumulative contact logo. B: readout logo. C:
logo derived from cognate sites reported in the literature. D: multiple sequence alignment of DNA-binding interface of homol-
ogous sequences for the modeled sequences and the templates used in modeling. Residues marked with asterisks participate in
specific DNA recognition. See Table 4 for more details.
0
1
2
bits
1
C
A
T
2
G
AT
3
G
CT
4
A
CG
5
T
G
C A
6
G
A
CT
78
T
C
G
A
9
C
T
G
A
10
T
A
11
C
T
G A
12
C
G
AT
13
G
TC
14
G
T
C A
15
T
G
C A
0
1
2
bits
12
C T
3
G
C
A T
4
T
C
A G
5
T
G
C A
6
C
AG
789
1011
T
G
C A
12
G
C
A T
13
T
G
AC
14
T
G
C A
15
T
G
C A
0
1
2
bits
123
G
A
C T
4 G
5 A
6
G
A
T
C
789
1011
T
A
G
12T
13
G
A
TC
14
G A
15
0
1
2
bits
1 AT
2
T
A
G
C
3
G
AT
4
C
AT
5 T
G A
6
A
GC
7 T
AG
8 CT
9 CA
10 A
11
A
CT
12 TA
0
1
2
bits
12
G
T
C A
3
G
C T
4
T
C
GA
56
T C
7 G
8
G
T
A
C
9
T
C
GA
10
G
T
C A
11
C
G
A T
12
0
1
2
bits
12
C
T
GA
3 T
4 T
5 G
678 C
9 A
10
T A
11 12
A
B
C
A
B
C
250260270280290
....|....|....|....|....|....|....|....|....|....|
GGGKAVAPSKQSKKSSPM----DRNSDEYRQRRERNNMAVKKSRLKSKQK
GAGGYSGPPAGKNKPKKCV---DKHSDEYKLRRERNNIAVRKSRDKAKMR
YAGAAPAPSQVKSKAKKTV---DKHSDEYKIRRERNNIAVRKSRDKAKMR
................................*..*.......*......
................................*...*...*.........
SSSSNLANATAANSGISSG--SQVKDAAYYERRRKNNAAAKKSRDRRRIK
FSSPQRSPSRKMSVPIPE----EKKDSAYFERRRKNNDAAKRSRDARRQK
FSEEELKPQPIMKKARKIQVPEEQKDEKYWSRRYKNNEAAKRSRDARRLK
FSEEELKPQPMIKKARKVFIPDDLKDDKYWARRRKNNMAAKRSRDARRLK
FTEEDLKPQPMIKKAKKVFVPDEQKDEKYWTRRKKNNVAAKRSRDARRLK
CEBPG_HUMAN
CEBPB_CHICK
CEBPB_HUMAN
CEBPB_HUMAN_interface (1GU4)
GIANT_DROME_interface (model)
GIANT_DROME
CES2_CAEEL
DBP_HUMAN
HLF_HUMAN
TEF_CHICK
Giant (1gu4)
D
FNR (1cgp)
160170180190200210
..|....|....|....|....|....|....|....|....|....|....|.
SH-PQGTQLRVSRQELARLVGCSREMAGRVLKKLQADGLLHARGKT-VVLYGTR
TH-PDGMQIKITRQEIGRIVGCSREMVGRVLKSLEEQGLVHVKGKT-MVVFGTR
TH-PDGMQIKITRQEIGQIVGCSRETVGRILKMLEDQNLISAHGKT-IVVYGTR
.......................**...*.........................
......................*.*..**.........................
GFSPREFRLTMTRGDIGNYLGLTVETISRLLGRFQKSGMLAVKGKY-ITIENND
SNGPMTFDLPLTREEMADYLGLTLETVSRQVSALKRDGVIALEGKRHVIVTDFA
GYSSTEFVLRMSREEIGNYLGLTLETVSRLFSRFGREGLIRINQRE-VRLIDLP
GFSANQFRLPMSRNEIGNYLGLAVETVSRVFTRFQQNGLLEAEGKE-VRILDSI
GFSAREFRLTMTRGDIGNYLGLTVETISRLLGRFQKLGVISVQGKY-ITINDLN
CLP_XANCP
VFR_PSEAE
CRP_ECOLI
CRP_interface (1CGP)
FNR_interface (model)
FNR_ECOLI
FNRL_RHOS4
BTR_BORPE
FNRA_PSEST
FNR_HAEIN
D
Page 12
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 12 of 18
(page number not for citation purposes)
acterized binding sites [6,40]. Only a limited group of
studies has intended to generate structural representations
of the sequence recognition interface, using approaches
that somehow approximate the physical processes
involved in DNA recognition [9,24,25]. Some of these
studies have dissected binding interfaces at the residue
level [9,13,27,41], while others use atomic-based descrip-
tions [24,25,42]. Though some efforts employ force fields
for the calculations, most of these methodologies rely on
pre-compiled interaction preferences. Moreover, there are
also conceptually simpler approaches that rely on count-
ing contacts at the interface [16,17,32] that generally work
under the assumption that the nucleotide sequence cap-
tured in a crystallographic protein-DNA complex is the
consensus. These contact-based approaches, particularly
the cumulative-contact method of Morozov and Siggia,
promise to be successful in binding site searches at the
genomic scale if the structure of a bound TF is at hand.
Extraction of protein-DNA atomic interaction preferences
from the PDB
The DNAPROT algorithm introduced here incorporates
an atom-based representation of protein-DNA interfaces
and explicitly integrates both direct and indirect readout.
Though the thermodynamic contribution of interface des-
olvation and conformational entropy of interacting
groups have been considered in some approaches using
complex formulations [23,24], our algorithm, outlined in
Figure 2, is concerned with binding specificity and does
not account for those contributions, like other related
reports [18]. However, our methodology attempts to
approximate these phenomena by including explicit rep-
resentations of binding interface water molecules and
side-chain rotamer sampling of interacting amino acids,
which have important implications in our results.
The hydrogen bond and hydrophobic interaction matri-
ces used by DNAPROT were derived from a non-redun-
dant set of 210 complexes, naturally surpassing the
training sets used in preceding studies, as more structural
data is now available at the PDB. An important assump-
tion behind the construction of this data set is that rules
governing specific recognition of DNA by proteins are
generally due to the conformational restrictions imposed
by the double helix. Therefore, we chose to collect a com-
prehensive training set instead of using a superfamily-spe-
cific or a TF-exclusive set, in which the scantiness of
information would have weakened the derived statistical
models. In addition, we built an atomic contact matrix
that explicitly accounts for water-mediated hydrogen
bonds in protein-DNA interfaces, which constitutes a
novel contribution of this work. The data obtained in the
construction of our contact propensity matrices, such as
the hydrogen bond matrix in Table 1, constitute an update
of the pioneer work by Mandel-Gutfreund et al. [10] and
Luscombe et al. [12] as we found similar trends in the
atomic interaction propensities in our enlarged dataset. In
addition, this work presents quantitative arguments (see
Table 5: Contribution of hydrogen bonds, hydrophobic contacts and water-mediated hydrogen bonds to specific DNA recognition in
terms of interface binding score (A) and STAMP E-value of structure-based PWMs (B)
A) Fraction of interface binding score contributed by:
H-bondsHydrophobic contacts Water-mediated
H-bonds
0.00
0.00
0.11
0.05
0.00
0.00
0.00
0.30
# atomic interactions
CRP
PurR
MetJ
NarL
LEU3
PHO4
RAP1
ZIF268
1.00
0.76
0.89
0.61
1.00
1.00
1.00
0.70
0.00
0.24
0.00
0.34
0.00
0.00
0.00
0.00
8
6
6
11
4
7
12
21
B) E-value of PWM derived from:
H-bondsHydrophobic contacts Water-mediated
H-bonds
-
6.957E-05
All atomic interactions
NarL
ZIF268
1.049E-02
1.359E-11
4.837E-04
-
7.010E-07
1.998E-14
A) Transcription factors in Table 3 where further analyzed by assessing the fraction of the interface binding score, obtained after summing all log-
likelihood atomic scores, contributed by each interaction type. The number of atomic interactions found by the DNAPROT protocol in the native
PDB structure is indicated in the rightmost column. B) A further analysis of two notable examples, NarL with an important hydrophobic
contribution and ZIF268 with a remarkable contribution of water-mediated hydrogen bonds, in which structurally derived PWMs that consider
exclusively H-bonds, hydrophobic contacts or water-mediated H-bonds are compared with default DNAPROT PWMs, which integrate all three
atomic interactions.
Page 13
BMC Bioinformatics 2008, 9:436 http://www.biomedcentral.com/1471-2105/9/436
Page 13 of 18
(page number not for citation purposes)
Table 5 for details), which demonstrate the importance of
including hydrophobic and water-mediated hydrogen
bonds in the list of interactions that contribute to specific
recognition of nucleotide sequences.
Assessment of the quality of the atomic interaction
matrices
Keeping in mind the problem of data bias in the PDB, we
planned an in-depth bootstrap tryout to assess the relia-
bility of the information contained in our contact matri-
ces. The exclusion of almost a third of the entries in the
original training set yielded atomic interaction matrices
with only a slightly reduced ability to correctly evaluate
infrequent atomic contacts. These results confirm that the
preference matrices derived in this work are biologically
significant and that the atomic associations found are not
significantly affected by overtraining nor by redundancy,
since bootstrap matrices scored the excluded training sets
almost as well as the bootstrapped ones. On the contrary,
the results for the negative control obtained while repeat-
ing this analysis with shuffled matrices (two bottom
charts in Figure 1), show completely scattered distribu-
tions with no correlation. These outcomes support the
consistency of the interaction preferences extracted from
the training set used to build our matrices. They also cor-
roborate that the inclusion of new protein-DNA com-
plexes, regularly deposited in the PDB, will not cause
considerable variations of the interaction preferences,
except for very unusual atomic contacts.
Evaluation of direct/indirect readout contributions to
protein-DNA recognition
The rationale for using atomic potentials is that many
interface contacts are either complex or bidentate [12],
and therefore cannot be properly accounted for by resi-
due-based approaches. In those cases the contact pairing
ratio is not 1:1, as the same amino acid might be in con-
tact with two base steps simultaneously or binding a given
nitrogen base through multiple groups. These drawbacks
could be resolved by using explicit atomistic representa-
tions of the mode of binding, which have been the goal of
a group of papers in this field [18,25,32]. However, using
atomic potentials requires high-quality atomic structures
of protein complexes. In addition, atomic-detail
approaches such as DNAPROT might be more affected by
cases in which side-chains rearrange upon mutation of the
bound nucleotides [43], which is the reason that it might
be necessary to sample interface rotamers while doing
comparative modeling, as explained in the Results sec-
tion. Further sampling, including backbone movements,
might be required in certain cases, as already envisaged by
Havranek et al. [23]. In fact, previous works have already
recognized that homologous protein-DNA interfaces
change their docking geometry as their sequences diverge
[30,44].
Besides the contribution of the direct readout to sequence
recognition, we also considered in our model a mecha-
nism of indirect readout, as sequence-specific DNA defor-
mation has been identified as key to DNA recognition for
many transcription factors [33,34]. The algorithm pre-
sented in this work follows previous efforts that model
indirect readout as the cost of deforming a DNA duplex
[13,14,24]. Both readout mechanisms, shown to be criti-
cal for specific recognition in Figure 3, are linearly inte-
grated into a single binding potential. The weights of both
direct and indirect readout can be tuned for different tran-
scription factors according to their docking mode, as the
examples depicted in Figure 3 imply. This observation
suggests that the performance of the DNAPROT protocol
can be improved using previous biological knowledge
and this is certainly a desirable property. This data also
insinuates that each TF might have its own balance
between direct and indirect readout, although we cannot
exclude the possibility that this value is a property of its
structural superfamily. Moreover, data in Table 2 gives fur-
ther insight into the impact that correct weighting could
have on genomic scale TF-binding-site prediction assays,
as the number of false positives might be considerably
reduced in order to obtain a more reliable set of predic-
tions.
Assaying the predictive potential of DNAPROT in
crystallographic structures and homology models
The predictive power of DNAPROT, evaluated in Tables 3
and 4 and Figures 4 and 5, suggests that our readout
PWMs are, with one exception, comparable to or better
than those generated by a well established reference meth-
odology, the cumulative contact PWM proposed by Moro-
zov and Siggia [32] and related to previous reports
[16,17]. It is worth remembering the reader that, unlike
the reference method, DNAPROT does not assume that
the nucleotide sequence in the input PDB complex is the
consensus; rather, it performs an in silico mutagenesis
assay and evaluates 4N sequences using a readout func-
tion. As already mentioned, the conceptually simpler
cumulative contact approach does not consider the contri-
bution of indirect readout. This might explain the better
results obtained with our method for CRP and NarL
shown in Table 3, two TFs known to have an important
contribution of indirect readout [33,34]. In contrast, the
example of PurR shows that the DNAPROT formulation
of hydrogen bonds and hydrophobic interactions does
not fully capture the array of atomic contacts at the inter-
face, as the obtained readout sequence logo misses 4 con-
sensus positions that might correspond to a different type
of interaction. Despite this fact, the predicted PWM cor-
rectly identifies the remaining consensus positions except
one, and the statistical significances obtained in the site
recovery assay of the cognate binding sites of this TF
Page 14
BMC Bioinformatics 2008, 9:436 http://www.biomedcentral.com/1471-2105/9/436
Page 14 of 18
(page number not for citation purposes)
depicted in Table 2 prove the predictive competence of the
matrix built.
Notwithstanding the promising results obtained for crys-
tallographic structures, the most valuable application of
the methodology presented in this paper is found in the
exercise of comparative modeling. Two examples were
modeled here, a well-characterized prokaryotic regulator
(FNR) and an eukaryotic factor previously studied in a
related article [24]. Both examples demonstrate that
rotamer sampling at the interface is necessary for obtain-
ing optimal results, and we found that the readout formu-
lation presented here can be effective in selecting the best
rotamers. As shown in Table 4 and Figure 5, our method-
ology yielded better results than those obtained with the
cumulative contact method for two homology models.
The reason for this could be that comparative models usu-
ally contain errors in the assigned position of protein side-
chains and, most importantly, they do not necessarily
contain the consensus DNA sequence. A combination of
in silico saturating DNA mutagenesis and interface side-
chain sampling allows the DNAPROT algorithm to par-
tially overcome these problems. Not only is sampling pos-
itive in modeling tasks, but we also found that the
crystallographic structures of PurR and RAP1 yielded bet-
ter readout PWMs after resampling their interface side-
chain rotamers.
Although the quality of the structures used by our meth-
odology is of primary relevance, we could not find a clear
correlation between R-factor, resolution or experimental
techniques for entries in the PDB and the outcomes of our
procedure. Nevertheless, we often encountered protein-
DNA complexes where only a few hydrogen bonds or
hydrophobic interactions can be identified using standard
bond geometries, yielding only partial DNA motifs. In
such cases we found that the cumulative contact method
seems to be less sensitive to the quality of structural data.
Further work needs to be done to explore whether side-
chain sampling, and even backbone sampling, can help
circumvent the limitations that data quality imposes on
the performance of readout PWMs.
Evaluation of the contribution of different interaction
types to the recognition process
The atomistic foundation of our methodology also gives
us the possibility of exploring the relative contribution of
each interaction type to the DNA recognition process. The
analysis of a non-redundant set of more than 200 com-
plexes proves that water-mediated hydrogen bonds are the
second source of specific interactions at contact interfaces,
which raises a warning regarding the usual exclusion of
these interactions in structural studies trying to model
protein-DNA interaction interfaces. The exclusion of
water overlooks a wide group of highly informative con-
tacts, mainly long-distance and unfavorable hydrogen
bonds [21] that constitute novel pairings absent from the
hydrogen bond preference matrix. The example of ZIF268
analyzed in Table 5B, shows that an important part of the
information content of the PWM obtained for this TF cor-
responds to water-mediated interactions, and the inclu-
sion of this information considerably increases the
statistical significance of the readout matrix. Something
similar occurs for hydrophobic interactions; despite being
relatively infrequent sometimes their contribution to
binding might be central. This is the case of the matrix
obtained for NarL, in which hydrophobic links account
for most of the information content of the structurally-
derived PWM, despite being the least frequent interface
atomic interaction. In this last example, without a reliable
inclusion of those infrequent interactions, our model for
this TF would have been worthless.
Conclusion
In summary, our results suggest that the DNAPROT algo-
rithm, together with the set of atomic interaction matrices
obtained in this work, have useful applications. The
matrices contain biologically meaningful information
that confirm and enrich previous reports at the atomic
level of interaction. In addition to the uses presented in
this paper, these matrices could be taken to estimate spe-
cificity of binding [45] or as a guide for crystallographic
studies of protein-DNA complexes. With respect to the
algorithm, previous work by Morozov and Siggia [32]
demonstrated that cumulative contact PWMs could be
used as informative priors in the task of scanning genomic
sequences. In this work we found that our algorithm out-
performs the aforementioned method for homology
models of TFs and displays a comparable performance
with crystallographic structures. This fact gives relevance
to the statistical models that can be generated with
DNAPROT and preliminary tests while scanning genome-
sized sequence sets with the model built for PurR confirm
this expectation. Overall, our study adds new insights to
the challenging problem of estimating protein-DNA bind-
ing specificities from structural complexes alone.
Methods
Construction of frequency and weight matrices for
protein-DNA atomic interactions
A set of 210 protein-DNA complexes was extracted from
the Protein Data Bank [31] (release February 29th 2008)
and was used to compute the atomic preferences of inter-
action between proteins and DNA that drive specific rec-
ognition. All these complexes were solved by X-ray
crystallography to a resolution ≤ 3 Å. We started from the
weekly updated clusters of sequence similarity disclosed
as part of the PDB derived data, rejecting entries with pro-
tein chains shorter than 30 amino acids or DNA chains
shorter than 10 bases. Entries having less than four Cα
Page 15
BMC Bioinformatics 2008, 9:436http://www.biomedcentral.com/1471-2105/9/436
Page 15 of 18
(page number not for citation purposes)
within 12 Å from atoms N1/N9 of nitrogen bases were
also excluded. Whenever multiple entries of the same pro-
tein were found, the one with the best resolution was
taken.
To circumvent redundancy in the estimation of atomic
contact preferences we considered only protein chains
from complexes having a sequence identity less than 50%
than any other protein chain in the dataset. This cut-off
approximately marks the point at which the geometry of
the contact interface of protein-DNA complexes with sim-
ilar amino acid sequence start to diverge [30]. A second fil-
ter was used to remove complexes with more than 70%
atomic contacts identical to other complexes from the
same SCOP [37] superfamily. This is necessary because
members with a percentage of sequence identity in the
boundaries of the 50% cutoff can still have very similar
DNA interfaces – i.e., with similar structural geometries
and involving the same atoms. As the number of atomic
interactions in protein-DNA complexes tends to be small,
this second filter effectively removes identical or nearly
identical interfaces within superfamilies.
Hydrogen bonds, water-mediated hydrogen bonds and
hydrophobic contacts were calculated using a modified
version of the program HBPLUS [46]. We set the input
parameters of the program to the default distance and
angle restriction for the estimation of hydrogen bonds (H-
A distance < 2.7 Å; D-A distance < 3.35 Å; D-H-A angle >
90° and H-A-AA angle > 90°, being AA the atom attached
to the hydrogen acceptor atom). For hydrophobic con-
tacts we considered atoms CB, CG, CG1, CG2, CD1, CD2,
CE and CZ in proteins and all carbon atoms of nitrogen
bases, including only contact distances in the range of
3.0–3.9 Å in order to exclude sterically impossible pair-
ings. This resulted in three atomic frequency matrices such
as the one reported in Table 1 (refer to Additional file 1 for
the matrices generated for hydrophobic and water-medi-
ated hydrogen bonds and a complete list of the PDB
entries used to construct them).
The atomic contact specific frequency matrices were con-
verted to weight matrices by using a modified version of
the expression described by Hertz and Stormo [5]:
where Wij is the log-likelihood interaction probability of
atom i from the amino acid and atom j from the nitrogen
base, nij is the number of specific contacts observed for
atom i of the amino acid and j of nitrogen base. In addi-
tion, pab is the number of expected contacts for amino acid
and base given their correspondent abundance in the
training dataset normalized by the number of donor/
acceptor atoms both in the amino acid and nitrogen base,
and Na is the total number of contacts observed for amino
acid a with all the four nitrogen bases. Matrices of this
kind assume complete independence for the statistical
contribution of each position to the final score of the
sequence, which correspond to the simplest and more
popular model of PWMs [5,6]. Using this model the par-
tial contributions of all interacting atoms may be summed
up to approximate the binding energy of the site. By tak-
ing into consideration hydrophobic and hydrogen bond
contacts, we can infer the direct readout interaction poten-
tial of a given DNA sequence threaded into a crystallo-
graphic complex. Note that there are other possible ways
to account for atomic contacts [42].
Bootstrap analysis of the training set used in the
construction of the atomic interaction matrices
Starting from the 210 PDB entries in the initial set, we ran-
domly sampled 1000 partial training sets excluding 30%
of the entries. Accordingly, a thousand "bootstrap" matri-
ces for hydrogen bond, water-mediated hydrogen bonds
and hydrophobic contacts were generated, which were in
contrast to three "generic" matrices – i.e., those built with
all entries. These bootstrap matrices were used to evaluate
the binding interface of all the PDB entries in the partial
training set as well as the entries of the excluded dataset –
i.e., those that were randomly excluded in the sampling
process. Interface scores were normalized by the number
of protein-DNA contacts in the entry. In order to follow
the progress of the experiment a population mean value
was computed for each bootstrapped and excluded data-
set.
As a negative control, we used randomly shuffled matrices
constructed from the generic matrices, in which the value
corresponding to the interaction probability of a particu-
lar pair of atoms is randomly interchanged with any other
from the matrix. The final shuffled matrices retained the
general characteristics of the generic matrices – e.g., the
information content and the maximum possible score for
the consensus sequence – but have lost all the informa-
tion about the amino acid-nitrogen base atomic interac-
tion preferences extracted when using the original dataset.
The DNAPROT algorithm for sequence threading and
estimation of interaction potential
DNAPROT is a computer program designed to thread
DNA sequences into the structure of a given protein-DNA
complex with the purpose of estimating their interaction
potential. DNAPROT is written in C++ and PERL and
makes use of 3DNA [47] and a modified version of
HBPLUS [46]. Threading is performed by substituting the
nitrogen bases of nucleotides found in the crystallo-
graphic complex – which can be tentatively termed as
"native" – by those in the input sequence, maintaining the
W
nij pabNa
pab
ij=
+
()
+
()
ln
1
Keywords
Similar Publications
Vladimir Espinosa Angarica |