Structural features that predict real-value fluctuations of globular proteins.
ABSTRACT It is crucial to consider dynamics for understanding the biological function of proteins. We used a large number of molecular dynamics (MD) trajectories of nonhomologous proteins as references and examined static structural features of proteins that are most relevant to fluctuations. We examined correlation of individual structural features with fluctuations and further investigated effective combinations of features for predicting the real value of residue fluctuations using the support vector regression (SVR). It was found that some structural features have higher correlation than crystallographic B-factors with fluctuations observed in MD trajectories. Moreover, SVR that uses combinations of static structural features showed accurate prediction of fluctuations with an average Pearson's correlation coefficient of 0.669 and a root mean square error of 1.04 Å. This correlation coefficient is higher than the one observed in predictions by the Gaussian network model (GNM). An advantage of the developed method over the GNMs is that the former predicts the real value of fluctuation. The results help improve our understanding of relationships between protein structure and fluctuation. Furthermore, the developed method provides a convienient practial way to predict fluctuations of proteins using easily computed static structural features of proteins.
- SourceAvailable from: escholarship.org[show abstract] [hide abstract]
ABSTRACT: Structural genomics (SG) projects aim to expand our structural knowledge of biological macromolecules while lowering the average costs of structure determination. We quantitatively analyzed the novelty, cost, and impact of structures solved by SG centers, and we contrast these results with traditional structural biology. The first structure identified in a protein family enables inference of the fold and of ancient relationships to other proteins; in the year ending 31 January 2005, about half of such structures were solved at a SG center rather than in a traditional laboratory. Furthermore, the cost of solving a structure at the most efficient SG center in the United States has dropped to one-quarter of the estimated cost of solving a structure by traditional methods. However, the efficiency of the top structural biology laboratories-even though they work on very challenging structures-is comparable to that of SG centers; moreover, traditional structural biology papers are cited significantly more often, suggesting greater current impact.Science 02/2006; 311(5759):347-51. · 31.20 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.Journal of Molecular Biology 06/2005; 348(5):1235-60. · 3.91 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: The PDB has created systems for the processing, exchange, query, and distribution of data that will enable many aspects of high throughput structural genomics.Natural Structural Biology 12/2000; 7 Suppl:957-9.
STRUCTURE O FUNCTION O BIOINFORMATICS
Structural features that predict real-value
fluctuations of globular proteins
Michal Jamroz,1,2Andrzej Kolinski,1and Daisuke Kihara2,3,4*
1Laboratory of Theory of Biopolymers, Faculty of Chemistry, University of Warsaw, Pasteura 1, 02-093 Warszawa, Poland
2Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907
3Department of Computer Science, College of Science, Purdue University, West Lafayette, Indiana 47907
4Markey Center for Structural Biology, College of Science, Purdue University, West Lafayette, Indiana 47907
Thanks to worldwide efforts in structural genomics,1–3we now know
over 75,000 protein tertiary structures.4This number is only a small frac-
tion when compared with the number of known protein sequences. Com-
putational methods can predict structures for more than a half of newly
sequenced proteins by means of template-based modeling with a suffi-
ciently high accuracy.5–8For some of the remaining proteins, it is possi-
ble to predict their structures in a de novo fashion if they are small and
structurally simple.9–14Thus, the problem of protein structure prediction
is practically gradually being solved, and it may be completely solved in
the near future. Obviously, for the most difficult (and ‘‘atypical’’) cases of
monomeric structures and to a much larger extent for the plethora of
possible protein–protein (protein-nucleic acid, protein-carbohydrate, etc.)
complexes, structure prediction will remain a challenging task for deca-
des.9,15–17The knowledge of protein tertiary structures facilitates fast
developments in various branches of molecular medicine and biotechnol-
ogy.18,19It, however, becomes more and more obvious that to under-
stand the underlying molecular mechanisms of life, we need to see bio-
molecules ‘‘in action.’’
Protein dynamics, resulting from a specific flexibility of their structures,
has drawn much attention recently in both theoretical and experimental
molecular biology. Studies of dynamics of protein structures and their
assemblies are important for understanding the mechanisms of protein
function in various cellular processes,20,21in particular, ligand binding,
enzymatic reactions,22conformational diseases,23and protein–protein
interaction.24The understanding of protein flexibility is also important
for practical applications such as development of computer-aided meth-
ods of enzyme design25,26and drug development.27
In X-ray protein crystallography, which determines the Cartesian coor-
dinates of atoms in proteins, uncertainties/fluctuations of atomic positions
are provided in the form of B-factors.28The B-factor measures the mobil-
ity of atoms, but it also reflects some inherent aspects of crystallographic
Grant sponsor: EU European Regional Development Fund (Foundation for Polish Science MPD Pro-
gramme); Grant sponsor: National Science Foundation; Grant number: IIS0915801; Grant sponsor:
National Institutes of Health; Grant numbers: R01GM075004, R01GM097528; Grant sponsor: National
Science Foundation; Grant numbers: DMS0800568, EF0850009
*Correspondence to: Daisuke Kihara, Department of Biological Sciences, College of Science, Purdue
University, West Lafayette, IN 47907. E-mail: firstname.lastname@example.org.
Received 2 December 2011; Revised 3 January 2012; Accepted 11 January 2012
Published online 27 January 2012 in Wiley Online Library (wileyonlinelibrary.com).
It is crucial to consider dynamics for
understanding the biological function of
proteins. We used a large number of molec-
ular dynamics (MD) trajectories of nonho-
mologous proteins as references and exam-
ined static structural features of proteins
that are most relevant to fluctuations. We
examined correlation of individual struc-
tural features with fluctuations and further
investigated effective combinations of fea-
tures for predicting the real value of resi-
due fluctuations using the support vector
regression (SVR). It was found that some
structural features have higher correlation
than crystallographic B-factors with fluctu-
ations observed in MD trajectories. More-
over, SVR that uses combinations of static
structural features showed accurate predic-
tion of fluctuations with an average Pear-
son’s correlation coefficient of 0.669 and a
root mean square error of 1.04 A˚. This cor-
relation coefficient is higher than the one
observed in predictions by the Gaussian
network model (GNM). An advantage of
the developed method over the GNMs is
that the former predicts the real value of
fluctuation. The results help improve our
protein structure and fluctuation. Further-
more, the developed method provides a
convienient practial way to predict fluctua-
tions of proteins using easily computed
static structural features of proteins.
Proteins 2012; 80:1425–1435.
C 2012 Wiley Periodicals, Inc.
Key words: protein flexibility; protein dy-
support vector regression; molecular dy-
namics; fluctuation prediction.
C 2012 WILEY PERIODICALS, INC.
techniques. Moreover, fluctuations estimated by B-factors
are influenced by the molecular environment of the crys-
tal structure. Protein mobility in solution could differ
qualitatively from that in a crystal. Eastman et al.29
showed that B-factors are an accurate measure of fluctua-
tions for stable parts of proteins, but significantly under-
estimate motion in flexible regions. Somewhat more
straightforward measures of structure fluctuations could
be derived from nucleic magnetic resonance (NMR)
experiments, although resulting estimates can be flawed
by various limitations of actual measurements and by the
computational schemes of their interpretation.30–33
Therefore, these methods do not fully reflect actual fluc-
tuations of proteins.
Molecular dynamics (MD) is the most straightforward
method for theoretical studies of dynamic aspects of
molecular systems. Because of the progress in comput-
ing technology, it is now practical to simulate protein
systems in a timescale of tens of nanoseconds. Never-
theless, such simulations remain costly. With a signifi-
cantly less computational
motion of a protein can be approximated by the nor-
mal mode analysis of a harmonic model of proteins.34
Another possibility is to use simulations using coarse-
grained representations of protein structures. A simple
approach is the Gaussian Network Model (GNM) and
its derivatives.35–38Long-time simulation at an inter-
mediate resolution can be achieved using simplified
protein models such as UNRES39and CABS.40These
models enable a low-resolution study of dynamics (or
stochastic dynamics) in timescales by a few orders of
magnitude longer than possible by all-atom MD.41–44
A weak point of studying dynamics using coarse-grained
models is a lack of straightforward scaling between the
models’ time and the real time. Thus, all-atom MD
simulations should always be used as a reference for
A number of computational methods for predicting
almost all of them evaluated their prediction results
mainly in comparison with the crystallographic B-factor
of proteins. As discussed earlier, the B-factor does not
fully capture the mobility of proteins in solution. As we
show in this work, the fluctuations observed in MD and
the B-factor correlate rather poorly, as was also con-
cluded in a previous work.29
There are a series of works that use GNM or its var-
iants for predicting B-factors of proteins.35,38,45,46
Micheletti et al.47extended GNM by adding Cb atoms
(bGM). The fluctuations of residues predicted by bGM
were compared to the fluctuations from the MD simula-
tion of HIV-1 protease. The self-consistent pair contact
probability method, which is similar in its spirit to
GNM, was used to predict fluctuations and compared
with B-factors.48Zhou and coworkers49developed an
all-atom mean-field model to predict fluctuations.
Structural features of proteins were also investigated
that can indicate fluctuations represented by B-factors.
These features include solvent accessibility of residues,50
distance from a residue to the center of mass of the pro-
tein,51eigenvectors of the square distance matrix,52and
predicted local fragment structures.53An alternative
direction pursued was to predict B-factors from protein
sequences. Machine-learning methods, such as Support
Vector Machine,54,55the random forest algorithm,56or
an artificial neural network,57were used to predict fluc-
tuations using sequence information and structural fea-
tures that can be predicted from sequences, such as the
secondary structure and the accessible surface area of res-
In this work, we used support vector regression (SVR)
to investigate the relationship between protein structure
and dynamics. We used various structural characteristics
as well as structure fluctuation profiles predicted by
GNM as input for SVR. The target reference is the dy-
namics observed in long MD simulations for a represen-
tative set of 592 globular proteins. To the best of our
knowledge, this is the first time that protein fluctuations
have been investigated on such a large dataset of MD
simulations. In this context, we also analyzed differences
of protein dynamics deducted from the B-factors and the
in-solvent dynamics computed by MD simulations. A
more practical purpose of this work is to provide a fast
(essentially instantaneous in comparison with MD) and
reliable method that can be used for predicting fluctua-
tions of protein structures. Unlike existing works men-
tioned earlier, we predict the real value of residue fluctu-
ations rather than simply showing correlation between
predicted and actual fluctuations values. Remarkably, our
method predicts fluctuation highly accurately with an av-
erage error of less than 1.1 A˚. The correlation coefficient
of our prediction with the actual fluctuations observed in
MD simulations is higher than that of GNM. We also
found that some of the static structural features, such as
residue contact number, have higher correlation with the
residue fluctuation in MD simulation than B-factors do.
The developed software for predicting fluctuation, named
flexPred, has been made freely available for the academic
MATERIALS AND METHODS
Dataset of molecular dynamics trajectories
The molecular dynamics (MD) trajectories of proteins
Extended Library).58Of 1897 entries in the database, the
following entries were discarded: trajectories for protein
structures solved by NMR, those which include more
than one protein chain in the simulation, and trajectories
for proteins whose length differ from the corresponding
entries in the Protein Data Bank (PDB).4These MD
M. Jamroz et al.
trajectories were computed using AMBER,59
MACS,60or NAMD61force fields. If more than one
simulation is available for a protein, we used the first
one with an earlier entry date in the database. The
MoDEL trajectory files were uncompressed with the
PCASuite software.62Eight hundred and thirty-seven tra-
jectories remained after this filtering process. From this
subset, we removed redundant proteins using the PISCES
server63with a sequence identity cutoff of 35%. The
final number of trajectories is 592. This dataset contains
proteins from all main classes in the CATH database64:
111 proteins in the a class (18.75%), 149 proteins in the
b class (25.17%), 256 in the ab class (43.24%), and 76
in the few secondary structure class (12.84%). The length
of the proteins ranges from 21 to 994 residues (Fig. 1).
The simulation time was 10 ns for most of the proteins
(96.11%), while the rest of the proteins had shorter tra-
jectories: 5 (0.33%), 2 (2.36%), and 1 ns (0.5%), and
one protein each with 6.5, 6.0, 5.5, and 4.5 ns.
Definition of fluctuation
The fluctuation of amino acid residue i is defined in
two ways. It can be defined as a root mean square devia-
tion (RMSD) of the mean position of an atom in an MD
where xi(tj) is the Cartesian coordinates of the Ca atom
of residue i at time tjin the trajectory, T is the number
of time frames in the trajectory, and <xi> is the average
position of the Ca atom of residue i in the trajectory.
We also used the coordinates in the PDB file as the refer-
xiðtjÞ ? xref
i in the PDB file. The distance of residue positions is
computed after superimposing the PDB structure on
each frame. If alternative positions of the atom are
recorded in the PDB files, the first position of the atom
was used. As shown in Figure 2, these two definitions
give similar fluctuations of residues, but not identical.
The correlation coefficient of the two fluctuation values
is 0.86. The fluctuation value is smaller when the mean
of a trajectory is used as the reference [Eq. (1)] in almost
all the cases (99.9%). Unless noted, we use the second
definition of fluctuation [Eq. (2)] in the results that will
be shown below, because we compare the fluctuations
from MD with B-factors and GNM, both of which are
attributed to PDB structures.
is the coordinates of the Ca atom of residue
Structural features of proteins
We considered the following static protein structural
1. B-factor (temperature factor).28The B-factor reflects
dynamic motion, the static disorder of the atom in
the crystal structure, and also errors in model build-
ing. The B-factor values are taken from the PDB file.
2. Square of the distance between a residue and the pro-
tein center of mass, which is defined as follows:
Histogram of the length of proteins in the dataset. There are in total
Average fluctuations of proteins in MD trajectories using two
definitions. x values show fluctuations of residues relative to the crystal
structures of proteins in the PDB [Eq. (2)], while y values are
fluctuations relative to the mean structure of each MD trajectory [Eq.
Predicting Protein Fluctuation
where xiis the position of the Ca atom of residue i. A
previous work showed that this parameter has good cor-
relation with the B-factor.51,52
3. Residue contact number, which is defined as the num-
ber of surrounding residues, whose Ca atom is closer
than a cutoff distance. The contact number was also
shown to correlate well with the B-factor.65,66
4. Number of hydrophobic/hydrophilic residue contacts,
where the number of residue contacts is separately
counted for hydrophobic and hydrophilic residues.
Hydrophobic/hydrophilic residues are those which
have a positive/negative value on the Kyte–Doolittle
5. Solvent accessibility surface area (A˚2). This parameter
is defined as water exposed surface of a residue. We
used the DSSP program68to compute the accessibility
surface area of amino acids, which are then normal-
ized with the value in the tripeptide with glycines on
both sides of the target amino acid residue.69
6. Residue depth, which is defined as the distance of the
Ca atom or the average distance of all the atoms in a
residue to the closest water molecule.70Protein sur-
face was computed with the MSMS program.71The
hsexpo program was used to compute residue depth.72
7. Lower/upper half-sphere exposure of a residue,72which
is defined as the number of contacts within a half-
sphere of a radius of 13 A˚centering at either the Ca or
the Cb atom of the residue. The sphere is divided into
half by a plane perpendicular to the Ca–Cb vector.
8. Secondary structure. Each residue is classified into
eight classes, that is, seven secondary structure types
defined by DSSP68or other.
9. Fluctuations predicted by the GNM.35,36GNM is a
coarse-grained model, where Ca atoms are connected
by springs. GNM has been used for investigating pro-
tein dynamics including the prediction of B-factor val-
ues of proteins.38We downloaded GNM codes from
the Jernigan laboratory
te.edu/). Fluctuations were computed with a residue
contact distance cutoff of 16 A˚73and without using
cutoff.38Residue contacts in a protein are represented
as the Kirchhoff matrix in GNM:
where rijis the distance between two atoms, i and j, and
rc(516 A˚) is the cut-off value. GNM without cutoff uses
the following modified Kirchhoff matrix:
i 6¼ j
i 6¼ j
i ¼ j
ifi 6¼ j
i ¼ j
In both methods, the average fluctuation of residue
i over time is defined by
where C is constant.
?¼ C C?1
Support vector regression
We combined the structural features listed above to pre-
dict fluctuations using support vector regression (SVR).
The LIBSVM package74with Gaussian kernels was used.
Because it was not feasible to test all the possible combina-
tions of features, features were added or changed one at a
time starting from the one which has the largest correla-
tion coefficient with residue fluctuation. We performed
fivefold cross validation using the dataset of trajectories.
The default set of parameters in libsvm, C 5 64.0, g 5 1,
and e 5 0.5, was used, which was shown to perform best
among others tested in the first few feature combinations
in the five-fold cross validation (data not shown).
Evaluation of fluctuations prediction
Pearson’s correlation coefficient was used to examine
how well individual features or predicted fluctuations
correlated with actual fluctuations in the MD trajectories.
Average correlation coefficients were computed using all
the trajectories in the dataset.
In addition, the error of predicted fluctuations was
quantified as the RMSD to the reference trajectory fluc-
where N is the length of the protein, DRpred
is actual fluctuation [Eq. (2)] of resi-
Availability of the developed program
The program for predicting the fluctuation of residues
in a protein structure is made freely available for the aca-
demic community at http://kiharalab.org/flexPred/. Both
the web server and the source code written in Python are
available. It takes a PDB file of a query protein for input
data and outputs a predicted fluctuation value for each
residue. The computational time for a protein is typically
within a couple of seconds to 20 s depending on the
length of the protein.
M. Jamroz et al.
RESULTS AND DISCUSSION
The relationships between structural features and resi-
due fluctuations are examined in several aspects. First,
we compare the correlation coefficient of individual static
structural features with actual fluctuations. Then, we
explore different combinations of features to make accu-
rate prediction of fluctuations using SVR. Then, the ac-
curacy of the fluctuation prediction by SVR and by
GNM is further examined. Finally, we also consider the
structural variation of models by NMR in comparison
with prediction as well as the fluctuations observed in
Correlation of static structural features of
proteins with fluctuations
In Table I, we compared the correlation coefficient of
individual structural features with the fluctuation of resi-
dues observed in the MD trajectories. Eight different dis-
tance cutoff values, 6, 8, to 16 A˚, were used for the resi-
due contact number. The top of the table shows the cor-
relation of the B-factor (0.484). Interestingly, several
static structural features, namely, the distance to the cen-
ter of mass and the contact number computed with the
cutoff of 12–22 A˚, have more significant correlation with
the fluctuations than the B-factor. Among the static fea-
tures, the largest correlation coefficients were observed
for the residue contact number (15 and 16 A˚). These
results indicate that the motion of chains in the MD tra-
jectories is better captured by the coarse-grained topolog-
ical structures of proteins rather than the B-factor.
As a reference, we also show the correlation of the
fluctuations predicted by GNM (bottom rows of Table I).
GNM showed higher correlation than the other structural
features. Note that GNM actually simulates dynamic
motion of protein structures; thus, it has a different na-
ture from the other static features compared in the table.
Consistently, with the previous work by Yang et al.,38
GNM without using a distance cutoff showed higher cor-
relation than GNM with a distance cutoff.
Because the residue contact number (with a 16 A˚cut-
off) and the square of distance to the center of mass
showed two largest correlation coefficients among the
static structure features examined, we used these two fea-
tures as the basis for combinations of input features for
training SVR in the next section.
SVR models for predicting residue
fluctuation using static structure features
Next, we used SVR to predict the fluctuation of resi-
due positions in the MD trajectories using various com-
binations of static structural features. Fluctuation predic-
tions by GNM (at the bottom of Table I) were not
included as features. Fivefold cross validation was per-
formed, in which SVR parameters were trained on four-
fifths of the dataset, while prediction was made for the
rest of the one-fifth of the dataset. This procedure was
repeated five times to make prediction for all data in the
dataset. Starting from the combination of the residue
contact number (with 16 A˚cutoff) and the square of dis-
tance to the center of mass, which are the two features
that showed the highest correlation with fluctuations
(Table I), 17 different feature combinations were tested
by adding one feature at a time (Table II).
Among the 17 feature combinations examined, all
except for two (the feature set 1 and set 17) showed
higher correlation with actual fluctuations than GNM
(Table I). The largest correlation coefficient, 0.669, was
achieved for the feature set 15, which uses the residue
contact numbers with different distance cutoffs. In terms
of average RMS, all the feature combinations predicted
residue fluctuations within an RMS of 1.1 A˚, ranging
from 1.042 to 1.092 A˚. The smallest RMS was achieved
for feature sets 6, 7, 12, 13, and 14, which combine
the residue contact numbers, the square distance from
the center of mass, and the B-factor. Sets 6 and 7
Correlation Coefficients Between Structural Features and Fluctuations
Distance to center of mass
Square of distance to
center of mass
Contact number (cutoff 6 ?)
Contact number (8 ?)
Contact number (12 ?)
Contact number (15 ?)
Contact number (16 ?)
Contact number (18 ?)
Contact number (20 ?)
Contact number (22 ?)
Accessible Surface Area (ASA)c
Residue depth (residue mean)d
Residue depth (Ca)
Half upper sphere exposure (Ca)e
Half lower sphere exposure (Ca)
Half upper sphere exposure (Cb)
Half lower sphere exposure (Cb)
Prediction by GNM (cutoff 16 ?)f
Prediction by GNM (no cutoff)
P-value < 0.05
The largest correlation coefficients among the static structural features are high-
lighted in bold.
aThe number of proteins that have significant correlation coefficient to the fluctu-
ations (with P-value < 0.05) are counted. The total number of trajectories
(proteins) is 592.
bThe average value calculated only for the subset of proteins with P-value < 0.05
is shown in the parentheses.
cAccessible surface area (A˚2) of amino acid residues without normalization. The
next row is the correlation with the normalized accessible surface area.
dThe residue depth computed as the average distance for each atom in the residue
and the distance for the Ca atom (next row).
eThe lower/upper half-sphere exposure of a residue using the Ca or the Cb atom
to determine the position of the plane which cut the sphere to half.
fFluctuations predicted by GNM [Eq. (6)].
Predicting Protein Fluctuation
additionally used information about the secondary struc-
ture. The RMS and the average correlation coefficients
(Table II) correlate moderately with a correlation coeffi-
cient of 0.627 (Fig. 3). Figure 4 shows the distribution of
the average correlation coefficients between predicted and
actual fluctuations [Fig. 4(A)] and the average RMS [Fig.
4(B)] for each protein, which were predicted using fea-
ture set 12. Remarkably, the majority (70%) of proteins
fluctuations were predicted within an RMS of 1.0 A˚. The
strong advantage of the developed SVR models is that
Summary of Fluctuation Prediction Using SVR Models with Different Feature Combinations
Number of proteins with
P-value < 0.05 (%)
Average corr. coeff.b
C(16), D2, B
C(16), D2, B, C(18)
C(16), D2, B, C(18), Sec
C(16), D2, B, C(18), Res-type
C(16), D2, B, C(18), Sec, C(12)
C(16), D2, B, C(18), Sec, C(12), C(8)
C(16), D2, C(18), C(12), C(8), C(6)
C(16), D2, B, C(18), C(12), C(8), C(6)
C(16), D2, B, C(18), C(12), C(8), C(6), Sec
C(16), D2, B, C(18), C(12), C(8), C(6), Acc
C(16), D2, B, C(18), C(12), C(8), C(6), C(20)
C(16), D2, B, C(18), C(12), C(8), C(6), C(20), C(22)
C(16), D2, B, C(18), C(12), C(8), C(6), C(15), C(20), C(22)
C(16), B, C(18), C(12), C(8), C(6), C(20), C(22)
C(16), C(18), C(12), C(8), C(6), C(15), C(20), C(22)
C(16), B, C(18), C(12), C(8), C(6), C(20), C(22), HP
The largest correlation coefficients among the static structural features are highlighted in bold.
aC(x), the residue contact number with x A˚distance cutoff; B, B-factor; D2, square of the distance between the Ca atom to the protein center of mass; Sec, the second-
ary structure; Acc, normalized accessible surface area; HP, the number of hydrophilic/hydrophobic contacts, Res-Type, amino acid type of residues.
bThe average correlation coefficients between predicted and actual fluctuations. Values calculated only for the subset of proteins that have significant correlation with
P-value < 0.05 is shown in the parentheses.
cThe RMS [Eq. (7)] was averaged over all the proteins in the dataset.
The average correlation coefficient and RMS of predicted and actual
fluctuations. Predictions were made with SVR using 17 different feature
combinations (Table II).
Distribution of (A), corelation coefficients; (B), RMS (A˚) of predicted
and actual fluctuations computed for 592 proteins in the dataset.
M. Jamroz et al.
they predict the real value of fluctuation, unlike GNM,
which predicts only the relative magnitude of residue
fluctuations that need to be rescaled to obtain actual
Incorporating dynamic features to SVR
We further investigated whether adding GNM as an
input feature can improve fluctuations prediction with
SVR. We used h(DRi)2i for the fluctuations from GNM
[Eq. (6)] without a distance cutoff, because it has higher
correlation with the actual fluctuations than
does. To each of the feature sets examined in Table II, we
added h(DRi)2i predicted by GNM and performed five-
fold cross validation. The resulting fluctuation prediction
with and without GNM was compared in terms of the
correlation coefficient [Fig. 5(A)] and the RMS [Fig.
5(B)] with the actual fluctuations.
Adding GNM in the feature set made slight improve-
ment in the RMS of the predicted fluctuations [Fig.
5(B)] except for one case (feature set 12), lowering RMS
on average by 0.010. However, small consistent deteriora-
tion of the correlation coefficient was observed [Fig.
5(A)] when GNM was added. The average decrease in
the correlation coefficient is 0.013. Thus, GNM did not
make significant contribution to improving fluctuation
Comparison of SVR model prediction results
with B-factor fluctuation values
In Figure 6, we show four examples of actual and pre-
dicted fluctuations as well as fluctuations derived from
the B-factors. For residue i with a B-factor of Bi, the fluc-
tuation is defined as
The fluctuations from the B-factor were also rescaled to
achieve a smaller RMS with the actual fluctuations (i.e.,
fluctuations from MD trajectories) as follows
and the minimum values of actual fluctuations, and
and the minimum fluctuation values computed from B-
factor values [Eq. (8)] in the protein. a is a weighting fac-
tor explored from 0.1 to 1.0 with an interval of 0.1 to seek
minare the maximum
are the maximum
smaller RMS for the actual fluctuations (Table III). In Fig-
ure 6, a is set to 1.0 for the plots of ‘‘Fluctuation from B-
factor, rescaled.’’ Note that this rescaling obviously changes
the RMS but does not change the correlation coefficient to
the actual fluctuation. The acutal fluctuations in the MD
trajectories are defined by Eq. (2), and predictions were
made using feature set 15 in Table II. The right panel of
Comparison of the prediction performance with and without using
GNM as a feature. h(DRi)2i predicted by GNM was added to each SVR
feature set listed in Table II. (A) Average correlation coefficient; (B)
average RMS predicted by SVR with and without h(DRi)2i from GNM
Predicting Protein Fluctuation
Examples of predicted fluctuations in comparison with B-factor-derived fluctuations and MD simulation fluctuations. Left panels show the values
of fluctuations: red, fluctuations observed in the MD trajectories; green, predicted fluctuations; dotted blue line, fluctuations computed from
B-factors; dotted magenta line, rescaled fluctuations from B-factors (a 5 1.0). The correlation coefficients and RMS are summarized in Table III.
Right-hand panels show the magnitude of fluctuations in a color scale with blue indicating lower fluctuations and red for higher fluctuations. A, B,
1mof; C, D, 1dq3; E, F, 1gpc; G, H, 1a1x.
M. Jamroz et al.
each protein visualizes the magnitude of actual fluctua-
tions in a color scale from blue to red with blue showing
small while red for large fluctuation.
The first example, retrovirus coat protein (PDB ID:
1mof) [Fig. 6(A,B)], exhibits a large fluctuation at two
termini and at the end of the long helix. Prediction by
SVR captured fluctuating residues and the magnitude
fairly well with a correlation coefficient of 0.80 and an
RMS of 1.55 A˚. The fluctuations derived from B-factor
have lower correlation with the actual fluctuations (corre-
lation coefficient of 0.69) with a larger RMS of 1.91 A˚
even after rescaling. In the second example [Fig. 6(C,D)]
of homing endonuclease PI-PfuI (PDB ID: 1dq3), overall
fluctuation is not large but shows high peaks of fluctua-
tion at loop regions. The predicted fluctuations have a
correlation coefficient of 0.81 while the fluctuations from
B-factor have a moderate correlation of 0.50. The third
example, DNA-binding protein gp32 (PDB ID: 1gpc)
[Fig. 6(E,F)], has the largest fluctuation at the loop of
residues 150–160 and over 3 A˚fluctuation at the other
loop regions, which are captured well by the prediction.
Predicted fluctuations have a correlation coefficient of
0.78 and a small RMS of 1.04 A˚. In contrast, the correla-
tion of fluctuations from B-factor is 0.55 with a larger
RMS of 1.93 A˚. The last example, MTCP-1 (PDB ID:
1a1x) [Fig. 6(G,H)], is a b-barrel protein with a long
loop at residues 50–60. Relatively large fluctuation was
observed at the N-terminus and at the loop regions that
connect b-strands (e.g., residues 35–40), which are well
predicted. The overall RMS of the prediction is 0.79 A˚,
and the correlation coefficient with the actual fluctua-
tions is 0.82, better than the fluctuations derived from B-
Consistent with Table I, the fluctuations from B-factors
correlate only moderately with the actual fluctuations.
Fluctuations computed from B-factors using Eq. (8) have
always a larger RMS than the SVR prediction. The agree-
ment of the fluctuations from B-factors can be improved
if it is rescaled individually for each protein as shown in
the second column from the right in Table III; however,
the value of the optimal scaling factor a differs from
protein to protein and thus cannot be known before-
hand. In contrast, our prediction by SVR has a signifi-
cantly higher correlation with the actual prediction, and
it predicts the real value of the fluctuations satisfactorily
without any rescaling.
MD fluctuations and fluctuations from NMR
The MoDEL database also contains simulations of pro-
tein structures determined by NMR. We selected 140
nonredundant protein structures determined by NMR
that contain more than 10 models in their PDB files.
sequence identity according to the PISCES database.63
Using the 140 proteins, we compared fluctuations
observed in the NMR models, MD trajectories, and the
predicted fluctuations. The results are summarized in
Table IV. The fluctuation prediction was carried out
using feature set 16, which does not contain the B-factor
term (NMR structures do not have B-factors).
It is shown that the prediction has a significant corre-
lation (0.739) with the structural variation of the models
derived from NMR. Interestingly, the correlation coeffi-
cient between the prediction and NMR is highest among
the other two pairs, prediction versus MD and NMR ver-
removed by considering
We used a large number of MD trajectories of nonho-
mologous proteins as references and examined static
structural features of the proteins that are most relevant
Correlation Coefficients and RMS of the Four Example Predictions
Correlation coefficient RMS (?)
B-factor, rescaled a 5 1.0a
B-factor, rescaled (a)b
The data correspond to plots at the left panels in Figure 6.
aFluctuations computed from B-factor were rescaled with a 5 1.0 in [Eq. (9)]. This value corresponds to the curve ‘‘Fluctuation from B-factor, rescaled’’ in Figure 6.
bFluctuations computed from B-factor were rescaled with the weight factor a [Eq. (9)] ranging from 0.1 to 1.0 with an interval of 0.1. Then the smallest RMS obtained
is shown together with the used a value in the parentheses.
Comparison of Fluctuations of NMR Models, MD, and Our Prediction
NMR versus MD
NMR versus prediction
MD versus prediction
P-value < 0.05 (%)
Hundred and forty nonredundant proteins in the MoDEL database were used
whose structures were determined by NMR.
Predicting Protein Fluctuation
to fluctuations. We examined the correlation of individ-
ual structural features with fluctuations and then investi-
gated effective combinations of features for SVR to pre-
dict the real value of fluctuation of residues. The main
findings of this work are summarized as follows. First of
all, two types of structural features, the distance to the
center of mass of the protein and the residue contact
number, showed a higher correlation coefficient with
fluctuations than B-factor does. Combinations of static
features used as input for SVR achieved accurate predic-
tion of fluctuations with a correlation coefficient of 0.67
and RMS of 1.042 A˚. This correlation coefficient is
higher than GNM to the actual fluctuation. Our method
predicts the structural variation of NMR models also
well. The current study demonstrates that flexibility of
proteins is inherently coded in coarse-grained static pro-
tein structural features, even more than in the crystallo-
graphic B-factors. Thus, protein motion is determined by
its static structure that is coded by its sequence, which
could be considered as an extension of the Anfinsen’s
dogma.75Indeed, series of studies on GNM has also
demonstrated that motion of a protein is determined by
its structure. However, the current work further shows
that static structural features can predict the real value of
fluctuations, which GNM has not been shown to be able
to do. As the importance of protein dynamics has been
more recognized for biological function, the prediction
method we developed has also a practical value in the
wide areas of biology and biotechnology.
The authors thank Jordi Camps (Centre Nacional
d’Ana `lisis Geno `mica, Spain) and Tim Meyer (Institute
for Research in Biomedicine, Spain) for help with the
PCAsuite software and the MoDEL database.
1. Chandonia JM, Brenner SE. The impact of structural genomics:
expectations and outcomes. Science 2006;311:347–351.
2. Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of
structural genomics initiatives: an analysis of solved target struc-
tures. J Mol Biol 2005;348:1235–1260.
3. Westbrook J, Feng Z, Chen L, Yang H, Berman HM. The Protein
Data Bank and structural genomics. Nucleic Acids Res 2003;31:489–
4. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig
H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic
Acids Res 2000;28:235–242.
5. Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi
A, Marti-Renom M, Karchin R, Webb BM, Eramian D, Shen MY,
Kelly L, Melo F, Sali A. MODBASE: a database of annotated com-
parative protein structure models and associated resources. Nucleic
Acids Res 2006;34:D291–D295.
6. Kihara D, Skolnick J. Microbial genomes have over 72% structure
assignment by the threading algorithm PROSPECTOR_Q. Proteins
7. Zhang Y. Progress and challenges in protein structure prediction.
Curr Opin Struct Biol 2008;18:342–348.
8. Chen H, Kihara D. Effect of using suboptimal alignments in tem-
plate-based protein structure prediction. Proteins 2011;79:315–334.
9. Das R, Baker D. Macromolecular modeling with rosetta. Annu Rev
10. Bradley P, Misura KM, Baker D. Toward high-resolution de novo
structure prediction for small proteins. Science 2005;309:1868–1871.
11. Kihara D, Lu H, Kolinski A, Skolnick J. TOUCHSTONE: an ab ini-
tio protein structure prediction method that uses threading-based
tertiary restraints. Proc Natl Acad Sci USA 2001;98:10125–10130.
12. Kihara D, Zhang Y, Lu H, Kolinski A, Skolnick J. Ab initio protein
structure prediction on a genomic scale: application to the Myco-
plasma genitalium genome. Proc Natl Acad Sci USA 2002;99:5993–
13. Borreguero JM, Skolnick J. Benchmarking of TASSER in the ab ini-
tio limit. Proteins 2007;68:48–56.
14. Trojanowski S, Rutkowska A, Kolinski A. TRACER: a new approach
to comparative modeling that combines threading with free-space
conformational sampling. Acta Biochim Polym 2010;57:125–133.
15. Venkatraman V, Sael L, Kihara D. Potential for protein surface
shape analysis using spherical harmonics and 3D Zernike descrip-
tors. Cell Biochem Biophys 2009;54:23–32.
16. Puton T, Kozlowski L, Tuszynska I, Rother K, Bujnicki JM. Compu-
tational methods for prediction of protein-RNA interactions.
J Struct Biol, DOI: 10.1016/j.jsb.2011.10.001.
17. Ritchie DW. Recent progress and future directions in protein-pro-
tein docking. Curr Protein Pept Sci 2008;9:1–15.
18. Hillisch A, Pineda LF, Hilgenfeld R. Utility of homology models in
the drug discovery process. Drug Discov Today 2004;9:659–669.
19. Takeda-Shitaka M, Takaya D, Chiba C, Tanaka H, Umeyama H.
Protein structure prediction in structure based drug design. Curr
Med Chem 2004;11:551–558.
20. Teilum K, Olsen JG, Kragelund BB. Functional aspects of protein
flexibility. Cell Mol Life Sci 2009;66:2231–2247.
21. Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh
JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW, Ausio J, Nis-
sen MS, Reeves R, Kang C, Kissinger CR, Bailey RW, Griswold MD,
Chiu W, Garner EC, Obradovic Z. Intrinsically disordered protein.
J Mol Graph Model 2001;19:26–59.
22. Hammes GG, Benkovic SJ, Hammes-Schiffer S. Flexibility, diversity,
23. Chiti F, Dobson CM. Amyloid formation by globular proteins
under native conditions. Nat Chem Biol 2009;5:15–22.
24. Zacharias M. Accounting for conformational changes during pro-
tein-protein docking. Curr Opin Struct Biol 2010;20:180–186.
25. Mandell DJ, Kortemme T. Backbone flexibility in computational
protein design. Curr Opin Biotechnol 2009;20:420–428.
26. Lassila JK. Conformational diversity and computational enzyme
design. Curr Opin Chem Biol 2010;14:676–682.
27. Lill MA. Efficient incorporation of protein flexibility and dynamics
into molecular docking simulations. Biochemistry 2011;50:6157–6169.
28. Debye P. Interferenz von Ro ¨ntgenstrahlen und Wa ¨rmebewegung.
Annal Phys 1913;348:49–92.
29. Eastman P, Pellegrini M, Doniach S. Protein flexibility in solution
and in crystals. J Chem Phys 1999;110:10141–10152.
30. Ishima R, Torchia DA. Protein dynamics from NMR. Nat Struct
31. Baldwin AJ, Kay LE. NMR spectroscopy brings invisible protein
states into focus. Nat Chem Biol 2009;5:808–814.
32. Nilges M, Habeck M, O’Donoghue SI, Rieping W. Error distribu-
tion derived NOE distance restraints. Proteins 2006;64:652–664.
33. Chalaoux FR, O’Donoghue SI, Nilges M. Molecular dynamics and
accuracy of NMR structures: effects of error bounds and data re-
moval. Proteins 1999;34:453–463.
34. Brooks B, Karplus M. Harmonic dynamics of proteins: normal
modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc
Natl Acad Sci USA 1983;80:6571–6575.
enzyme catalysis. Biochemistry
M. Jamroz et al.
35. Haliloglu T, Bahar I, Erman B. Gaussian dynamics of folded pro-
teins. Phys Rev Lett 1997;79:3090–3093.
36. Tirion MM. Large amplitude elastic motions in proteins from a sin-
gle-parameter, atomic analysis. Phys Rev Lett 1996;77:1905–1908.
37. Bahar I, Erman B, Haliloglu T, Jernigan RL. Efficient characteriza-
tion of collective motions and interresidue correlations in proteins
by low-resolution simulations. Biochemistry 1997;36:13512–13523.
38. Yang L, Song G, Jernigan RL. Protein elastic network models and
the ranges of cooperativity. Proc Natl Acad Sci USA 2009;106:
39. Liwo A, Oldziej S, Pincus MR, Wawak RJ, Rackovsky S, Scheraga
HA. A united-residue force field for off-lattice protein-structure
simulations. I. Functional forms and parameters of long-range side-
chain interaction potentials from protein crystal data. J Comp
40. Kolinski A. Protein modeling and structure prediction with a
reduced representation. Acta Biochim Polym 2004;51:349–371.
41. He Y, Liwo A, Weinstein H, Scheraga HA. PDZ binding to the BAR
domain of PICK1 is elucidated by coarse-grained molecular dynam-
ics. J Mol Biol 2011;405:298–314.
42. Kmiecik S, Kolinski A. Characterization of protein-folding pathways
by reduced-space modeling. Proc Natl Acad Sci USA 2007;104:
43. Kmiecik S, Kolinski A. Folding pathway of the b1 domain of pro-
tein G explored by multiscale modeling. Biophys J 2008;94:726–736.
44. Kmiecik S, Kolinski A. Simulation of chaperonin effect on protein
folding: a shift from nucleation-condensation to framework mecha-
nism. J Am Chem Soc 2011;133:10283–10289.
45. Kondrashov DA, Cui Q, Phillips GN Jr. Optimization and evalua-
tion of a coarse-grained model of protein motion using X-ray crys-
tal data. Biophys J 2006;91:2760–2767.
46. Lin TL, Song G. Generalized spring tensor models for protein fluc-
tuation dynamics and conformation changes. BMC Struct Biol
2010;10 (Suppl 1):S3.
47. Micheletti C, Carloni P, Maritan A. Accurate and efficient descrip-
tion of protein vibrational dynamics: comparing molecular dynam-
ics and Gaussian models. Proteins 2004;55:635–645.
48. Canino LS, Shen T, McCammon JA. Changes in flexibility upon
binding: application of the self-consistent pair contact probability
method to protein-protein interactions. J Chem Phys 2002;117:
49. Pandey BP, Zhang C, Yuan X, Zi J, Zhou Y. Protein flexibility pre-
diction by an all-atom mean-field statistical theory. Protein Sci
50. Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L. On the rela-
tion between residue flexibility and local solvent accessibility in
proteins. Proteins 2009;76:617–636.
51. Shih CH, Huang SW, Yen SC, Lai YL, Yu SH, Hwang JK. A simple
way to compute protein dynamics without a mechanical model.
52. Kloczkowski A, Jernigan RL, Wu Z, Song G, Yang L, Kolinski A,
Pokarowski P. Distance matrix-based approach to protein structure
prediction. J Struct Funct Genom 2009;10:67–81.
53. Bornot A, Etchebest C, De Brevern AG. Predicting protein flexibility
through the prediction of local structures. Proteins 2011;79:
54. Gu J, Gribskov M, Bourne PE. Wiggle-predicting functionally flexi-
ble regions from primary sequence. PLoS Comput Biol 2006;2:e90.
55. Chen P, Wang B, Wong HS, Huang DS. Prediction of protein
B-factors using multi-class bounded SVM. Protein Pept Lett 2007;
56. Hirose S, Yokota K, Kuroda Y, Wako H, Endo S, Kanai S, Noguchi T.
Prediction of protein motions from amino acid sequence and its appli-
cation to protein-protein interaction. BMC Struct Biol 2010;10:20.
57. Schlessinger A, Rost B. Protein flexibility and rigidity predicted
from sequence. Proteins 2005;61:115–126.
58. Meyer T, D’Abramo M, Hospital A, Rueda M, Ferrer-Costa C, Perez
A, Carrillo O, Camps J, Fenollosa C, Repchevsky D, Gelpi JL,
Orozco M. MoDEL (Molecular Dynamics Extended Library): a
database of atomistic molecular dynamics trajectories. Structure
59. Case DA, Cheatham TE III, Darden T, Gohlke H, Luo R, Merz KM Jr,
Onufriev A, Simmerling C, Wang B, Woods RJ. The Amber biomolec-
ular simulation programs. J Comput Chem 2005;26:1668–1688.
60. Hess B, Kutzner C, van der Spoel D, Lindahl E. GROMACS 4: algo-
rithms for highly efficient, load-aalanced, and scalable molecular
simulation. J Chem Theory Comput 2008;4:435–447.
61. Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E,
Chipot C, Skeel RD, Kale L, Schulten K. Scalable molecular dynam-
ics with NAMD. J Comput Chem 2005;26:1781–1802.
62. Meyer T, Ferrer-Costa C, Perez A, Rueda M, Bidon-Chanal A,
Luque FJ, Laughton CA, Orozco M. Essential dynamics: a tool for
efficient trajectory compression and management. J Chem Theory
63. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling
server. Bioinformatics 2003;19:1589–1591.
64. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thorn-
ton JM. CATH—a hierarchic classification of protein domain struc-
tures. Structure 1997;5:1093–1108.
65. Lin CP, Huang SW, Lai YL, Yen SC, Shih CH, Lu CH, Huang CC,
Hwang JK. Deriving protein dynamical properties from weighted
protein contact number. Proteins 2008;72:929–935.
66. Halle B. Flexibility and packing in proteins. Proc Natl Acad Sci
67. Kyte J, Doolittle RF. A simple method for displaying the hydro-
pathic character of a protein. J Mol Biol 1982;157:105–132.
68. Kabsch W, Sander C. Dictionary of protein secondary structure:
pattern recognition of hydrogen-bonded and geometrical features.
69. Miller S, Janin J, Lesk AM, Chothia C. Interior and surface of
monomeric proteins. J Mol Biol 1987;196:641–656.
70. Chakravarty S, Varadarajan R. Residue depth: a novel parameter for
the analysis of protein structure and stability. Structure 1999;7:
71. Sanner M, Olson AJ, Spehner JC. Fast and robust computation of
molecular surfaces. Proceedings of 11th ACM Symposium on Com-
putational Geometry, Vancouver, BC, Canada; 1995. ppC6–C7.
72. Hamelryck T. An amino acid has two sides: a new 2D measure pro-
vides a different view of solvent exposure. Proteins 2005;59:38–48.
73. Kundu S, Melton JS, Sorensen DC, Phillips GN Jr. Dynamics of
proteins in crystals: comparison of experiment with simple models.
Biophys J 2002;83:723–732.
74. Chang C-C, Jin C-J. LIBSVM: a library for support vector
machines. ACM Trans Intell Syst Technol 2001;2:27:1–27:27.
75. Anfinsen CB. Principles that govern the folding of protein chains.
Predicting Protein Fluctuation