ArticlePDF Available

Identication of the Protein Native Structure by Using a Sequence-Dependent Feature in Contact Maps

Authors:

Abstract and Figures

We present a new approach for fold recognition to identify the native and the near-native pro- tein structures among decoy structures by using pair-wise contact potentials between amino acid residues. For a given protein structure, a new scoring function is dened as the dierence between the contact energy for its native sequence and the average contact energy for random sequences of the same contact map. We have tested the new scoring function for the various decoy sets available in the literature and have found that the new scoring function is more useful than the original con- tact energy, especially for decoy sets where the total number of contacts from the native structure is similar to those from the decoy conformations. From this observation, we conclude that the more native-like the structure is, the more likely that it distinguishes the native sequence from random sequences. We demonstrate that, for a given contact potential, a simple, but more ecient, new scoring function can be constructed.
Content may be subject to copyright.
Journal of the Korean Physical Society, Vol. 46, No. 3, March 2005, pp. 625630
Identification of the Protein Native Structure by Using a
Sequence-Dependent Feature in Contact Maps
Jaewoon Jung
and Hie-Tae Moon
Department of Physics, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon 305-701
Jooyoung Lee
School of Computational Sciences, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722
(Received 12 October 2004)
We present a new approach for fold recognition to identify the native and the near-native pro-
tein structures among decoy structures by using pair-wise contact potentials between amino acid
residues. For a given protein structure, a new scoring function is defined as the difference between
the contact energy for its native sequence and the average contact energy for random sequences of
the same contact map. We have tested the new scoring function for the various decoy sets available
in the literature and have found that the new scoring function is more useful than the original con-
tact energy, especially for decoy sets where the total number of contacts from the native structure
is similar to those from the decoy conformations. From this observation, we conclude that the more
native-like the structure is, the more likely that it distinguishes the native sequence from random
sequences. We demonstrate that, for a given contact potential, a simple, but more efficient, new
scoring function can be constructed.
PACS numbers: 87.14.Ee, 87.15.By, 87.15.Cc
Keywords: Protein folding, Native structure, Decoy structure
I. INTRODUCTION
Determination of the three-dimensional structure of a
protein from its amino-acid sequence is a major unsolved
problem in structural biology [2,3]. One way to achieve
this goal is to develop a scoring function that can dis-
tinguish the native structure of a protein from a large
number of decoy conformations. For this reason many
scoring functions, including empirical contact energies,
have been investigated [4–15,18–21]. The traditional at-
tempts for empirical contact energies are typically based
on a residue-residue contact function obtained by using
a quasi-chemical approximation [4–7]. The basic idea of
this method is to investigate pairing frequencies between
two amino acids, observed from various native structures
in the Protein Data Bank, normalized against those ex-
pected from random pairing. Other contact energy func-
tions are also obtained by optimizing interaction poten-
tials so that the energy of the native structure becomes
lower than the energy of competing decoy structures [9]
and/or by maximizing the energy gap between the native
state and the decoy states normalized by the energy vari-
Current Address : Department of Chemistry, Korea Advanced
Institute of Science and Technology, Yuseong-gu, Daejeon 305-701
E-mail: Corresponding author : jwjung@kaist.ac.kr
ance of decoy states [14]. Most of these energy functions
are knowledge-based potentials, and indications are that
they are correlated with atomic potentials [16].
These previous efforts are focused on developing pair-
wise contact potentials that are able to discriminate the
native states of a set of proteins from many structural
decoys. In this work, we propose a different kind of ap-
proach where we adopt the pair-wise contact potential
developed by Miyazawa and Jernigan (MJ) [6]. The MJ
contact potential has been quite successful [6,17]. The
new scoring function is defined as the difference between
the original energy function and the average energy ob-
tained from random sequences.
In the present work, we examine the performance of
the new scoring function in fold recognition for various
decoy sets. Especially, we investigate if the new scoring
function is more useful than the original MJ contact en-
ergy in identifying native structures from many decoys.
We find that the correlation between the new scoring
function and the RMSD (Root Mean Square Deviation)
measured from the native structure is more significant
than the correlation between the original contact energy
and the RMSD.
II. METHOD
-625-
-626- Journal of the Korean Physical Society, Vol. 46, No. 3, March 2005
1. Structure Comparison
To compare the two structures, the RMSD (root mean
square deviation) is used. The RMSD between two struc-
tures a and b is defined as
RM SD =
v
u
u
u
t
X
1iN
|r
ai
r
bi
|
2
N
, (1)
where structures a and b are superimposed so that the
value of the RMSD becomes minimum, r is the coordi-
nates of alpha carbon atoms, and N is the number of
amino acids.
2. Contact Energy of a Protein
Before considering a new scoring function, we describe
the definition of the contact energy of a protein. The
contact energy of a protein is defined as
E
k
=
X
i,j
i,j
B(a
i
, a
j
). (2)
In this equation,
i,j
= 1 if the amino acids at the po-
sitions i and j are in contact;
i,j
= 0, otherwise. The
contact between amino acids i and j is defined to ex-
ist if their side chain centroids are within 6.5
˚
A [5–7].
B(a
i
, a
j
) is the pair-wise contact energy between amino
acids of types a
i
and a
j
.
3. Calculation of New Scoring Functions
We assume that the probability that a protein adopts
structure X and sequence S follows the Boltzmann dis-
tribution
P (X, S) exp(βE(X, S)). (3)
Then, the probability that a structure X
k
and a native
sequence S
N
are selected together becomes
P (X
k
, S
N
) exp(βE(X
k
, S
N
))
= exp(β(E(X
k
, S
N
) hEi
k
)) exp(βhEi
k
), (4)
where hEi
k
is the average contact energy when the na-
tive sequence is replaced by random sequences. It should
be noted that the energy contribution hEi
k
is indepen-
dent of its sequence S
N
and depends only on the struc-
ture X
k
.
The probability P (X
k
, S
N
) is considered to have two
parts. One is the sequence- and structure-dependent
E hEi
k
, and the other is the sequence-independent
and structure-only-dependent hEi
k
. Since the sequence-
independent term hEi
k
plays the role of estimating only
the total number of contacts for the given structure X
k
,
we assume that it is not as important as the sequence-
and-structure-dependent E hEi
k
. Finally, we assume
that
exp(β(E(X
N
, S
N
) hEi
N
))
> exp(β(E(X
k
, S
N
) hEi
k
)), k 6= N. (5)
Then, the following inequality is satisfied
E(X
N
, S
N
) hEi
N
< E(X
k
, S
N
) hEi
k
. (6)
From this, the new scoring function is defined as EhEi.
III. RESULTS AND DISCUSSION
Figure 1 shows the relationship between the RMSD
measured from the native structure and the original MJ
scoring function for the protein 1ctf in the 4-state re-
duced decoys. Figure 2 corresponds to the results ob-
tained by using the new scoring function. Each data
Fig. 1. Relationship between the original MJ contact en-
ergy E and the RMSD. The energy is the sum of pair-wise
MJ contact potentials when each structure is mounted on a
native sequence. Here, the correlation is 0.34.
Fig. 2. Relationship between E hEi and the RMSD.
hEi is the average of the sum of pair-wise contact potentials
calculated from 1000 random sequences. The correlation is
0.58.
Identification of the Protein Native Structure by· · · Jaewoon Jung et al. -627-
Table 1. Zscores calculated using the original MJ contact energy E, and the modified scoring function E hEi.
decoy set protein average comparison
1ctf 1r69 1sn3 2cro 3icb 4pti 4rxn
4-state reduced
a
3.60/3.40 4.52/4.37 2.40/3.03 4.05/4.36 2.12/2.48 3.60/3.33 2.91/3.33
3.31/3.47
b
3/4
1fc2 1hdd-C 2cro 4icb
fisa
1.59/0.01 3.07/2.24 4.33/2.73 5.98/4.40
3.74/2.34 4/0
1bg8-A 1bl0 1eh2 1jwe smd3
fisa casp3
3.27/2.19 1.77/0.17 2.92/2.32 4.77/2.01 4.09/2.43
3.36/1.76 5/0
1beo 1ctf 1fca 1nkl
lattice ssfit
2.67/5.02 3.39/5.49 3.08/3.03 2.48/6.42
2.91/4.99 1/3
1b0n-B 1bba 1ctf 1dtk 1fc2 1igd 1shf-A
1.91/0.05 0.16/1.63 3.79/3.31 3.99/0.67 3.64/6.43 3.45/3.34 2.47/1.16
lmds
2cro 2ovo 4pti
2.64/0.77 10/0
7.37 /4.25 3.06/2.12 4.19/0.98
1ctf 1eh2 1khm 1nkl 1pgb
semfold
2.89/2.51 3.62/4.22 1.60/2.65 1.58/2.90 1.66/1.55
2.27/2.77 2/3
1ash 1bab-B 1col-A 1cpc-A 1ecd 1emy 1flp
2.96/3.22 1.45/1.71 4.79/4.84 3.55/3.94 1.81/1.78 0.99/1.42 2.40/2.42
1gdm 1hbg 1hbh-A 1hbh-B 1hda-A 1hda-B 1hlb
2.65/2.53 2.16/1.87 0.89/1.05 0.96/1.35 0.94/1.87 1.89/1.85 0.83/2.17
hg structal
1hlm 1hsy 1ith-A 1mba 1mbs 1myg-A 1myj-A
1.50/1.85 5/23
2.69/0.73 0.44/1.54 1.71/1.92 2.36/2.40 0.75/0.26 1.34/1.81 1.38/1.71
1myt 2dhb-A 2dhb-B 2lhb 2pgh-A 2pgh-B 4sdh-A
1.81/2.15 1.53/1.81 0.36/1.06 1.50/1.54 1.31/1.56 0.45/1.45 2.93/1.93
1acy 1baf 1bbd 1bbj 1dbb 1dfb 1dvf
0.74/0.69 0.71/0.10 0.38/0.77 0.37/1.29 1.30/0.30 1.02/0.11 0.20/0.40
1eap 1fai 1fbi 1fgv 1fig 1flr 1for
0.74/0.40 0.16/0.44 0.85/0.18 0.82/0.20 1.50/0.80 0.20/0.49 2.57/0.02
1fpt 1frg 1fvc 1fvd 1gaf 1ggi 1gig
1.31/0.22 1.13/1.23 0.12/0.43 0.04/0.87 1.85/0.06 0.18/1.05 0.96/1.08
1hil 1hkl 1iai 1ibg 1igc 1igf 1igi
0.16/0.37 1.63/0.17 1.45/0.38 0.22/0.17 0.44/0.68 0.63/0.77 0.36/0.44
1igm 1ikf 1ind 1jel 1jhl 1kem 1mam
ig structal
0.68/0.40 0.58/0.76 0.08/1.58 1.02/0.33 0.01/0.79 0.65/1.13 0.40/0.81
0.50/0.61 1/58
1mcp 1mlb 1mrd 1nbv 1ncb 1ngq 1nmb
0.26/0.44 0.67/0.74 0.56/0.39 0.29/0.77 0.09/0.75 0.42/0.75 0.65/0.69
1nsn 1opg 1plg 1rmf 1tet 1ucb 1vfa
2.31/0.38 0.32/0.01 0.08/0.89 1.54/0.27 0.72/0.31 0.69/0.55 0.54/0.58
1vge 1yuh 2cgr 2fb4 2fbj 2gfb 3hfl
0.67/0.40 1.03/0.13 0.09/1.35 0.68/1.82 0.60/0.80 0.11/0.35 0.78/0.30
3hfm 6fab 7fab
0.38/1.04 0.62/0.75 0.06/1.74
1dvf 1fgv 1flr 1fvc 1gaf 1hil 1ind
0.24/0.37 0.61/0.27 0.26/0.52 0.03/0.46 1.45/0.07 0.02/0.46 0.04/1.23
1kem 1mlb 1nbv 1opg 1vfa 1vge 2cgr
ig structal hires
0.59/1.02 0.44/0.77 0.23/0.65 0.40/0.01 0.38/0.56 0.32/0.55 0.22/1.31
0.21/0.71 0/18
2fb4 2fbj 6fab 7fab
0.48/1.35 0.52/0.78 0.46/0.75 0.07/1.70
total 36/109
a
For A/B, A and B corresponds to the Z-score of E and E hEi, respectively
b
For A/B, A and B corresponds to the number of proteins that E is superior and the number of proteins that E hEi is
superior, respectively
point represents a structure in the decoy set, and the
point with RM SD = 0 corresponds to the native X-ray
structure. From these figures, we observe that the new
scoring function E hEi has a higher correlation with
the RMSD than the original contact energy E.
To compare the performances of E and E hEi in
more detail, we calculated the Z-scores of the native
structures, the correlations between the RMSD and the
scoring function, and the ranks of the native structures
in various decoy sets. The results are summarized in
-628- Journal of the Korean Physical Society, Vol. 46, No. 3, March 2005
Table 2. The correlation calculated using the original MJ contact energy E, and the modified scoring function E hEi.
decoy set protein average comparison
1ctf 1r69 1sn3 2cro 3icb 4pti 4rxn
4-state reduced
a
0.34/0.58 0.21/0.50 0.19/0.38 0.39/0.57 0.43/0.68 0.16/0.29 0.27/0.49
0.28/0.50
b
0/7
1fc2 1hdd-C 2cro 4icb
fisa
0.22/0.35 0.17/0.31 0.18/0.20 0.17/0.12
0.19/0.25 1/3
1bg8-A 1bl0 1eh2 1jwe smd3
fisa casp3
0.26/0.19 0.38/0.40 0.26/0.23 0.12/0.22 0.19/0.14
0.19/0.09 4/1
1beo 1ctf 1fca 1nkl
lattice ssfit
0.04/0.02 0.06/0.04 0.02/0.03 0.01/0.04
0.02/0.02 3/1
1b0n-B 1bba 1ctf 1dtk 1fc2 1igd 1shf-A
0.13/0.29 0.03/0.11 0.18/0.09 0.19/0.08 0.03/0.14 0.13/0.12 0.07/0.06
lmds
2cro 2ovo 4pti
0.08/0.02 8/2
0.11/0.02 0.17/0.21 0.06/0.06
1ctf 1khm 1nkl 1pgb
semfold
0.09/0.10 0.08/0.04 0.02/0.04 0.04/0.06
0.06/0.06 1/3
1ash 1bab-B 1col-A 1cpc-A 1ecd 1emy 1flp
0.50/0.51 0.82/0.85 0.67/0.53 0.69/0.60 0.63/0.67 0.62/0.74 0.55/0.71
1gdm 1hbg 1hbh-A 1hbh-B 1hda-A 1hda-B 1hlb
0.78/0.85 0.46/0.60 0.81/0.81 0.78/0.82 0.85/0.84 0.84/0.89 0.52/0.57
hg structal
1hlm 1hsy 1ith-A 1mba 1mbs 1myg-A 1myj-A
0.66/0.72 5/23
0.05/0.12 0.62/0.70 0.60/0.72 0.67/0.78 0.57/0.58 0.71/0.77 0.73/0.81
1myt 2dhb-A 2dhb-B 2lhb 2pgh-A 2pgh-B 4sdh-A
0.65/0.71 0.89/0.86 0.77/0.89 0.47/0.56 0.92/0.90 0.83/0.86 0.60/0.81
1acy 1baf 1bbd 1bbj 1dbb 1dfb 1dvf
0.49/0.57 0.55/0.51 0.39/0.49 0.44/0.50 0.47/0.54 0.37/0.40 0.48/0.50
1eap 1fai 1fbi 1fgv 1fig 1flr 1for
0.33/0.38 0.44/0.51 0.36/0.44 0.44/0.49 0.31/0.42 0.43/0.50 0.32/0.49
1fpt 1frg 1fvc 1fvd 1gaf 1ggi 1gig
0.40/0.49 0.54/0.60 0.17/0.09 0.50/0.58 0.38/0.43 0.49/0.52 0.36/0.33
1hil 1hkl 1iai 1ibg 1igc 1igf 1igi
0.51/0.58 0.37/0.44 0.45/0.54 0.22/0.21 0.53/0.55 0.53/0.58 0.20/0.23
1igm 1ikf 1ind 1jel 1jhl 1kem 1mam
ig structal
0.43/0.55 0.33/0.36 0.39/0.43 0.36/0.45 0.40/0.36 0.45/0.52 0.17/0.27
0.38/0.44 7/49
1mcp 1mlb 1mrd 1nbv 1ncb 1ngq 1nmb
0.42/0.57 0.41/0.46 0.19/0.26 0.42/0.49 0.53/0.54 0.34/0.42 0.05/0.01
1nsn 1opg 1plg 1rmf 1tet 1ucb 1vfa
0.32/0.52 0.45/0.45 0.47/0.52 0.44/0.50 0.43/0.56 0.58/0.58 0.18/0.25
1vge 1yuh 2cgr 2fb4 2fbj 2gfb 3hfl
0.01/0.13 0.16/0.13 0.42/0.57 0.44/0.54 0.42/0.42 0.23/0.19 0.02/0.12
3hfm 6fab 7fab
0.45/0.50 0.45/0.52 0.47/0.48
1dvf 1fgv 1flr 1fvc 1gaf 1hil 1ind
0.47/0.55 0.45/0.60 0.60/0.61 0.11/0.05 0.29/0.46 0.53/0.65 0.23/0.47
1kem 1mlb 1nbv 1opg 1vfa 1vge 2cgr
ig structal hires
0.47/0.65 0.38/0.53 0.36/0.49 0.35/0.41 0.09/0.25 0.10/0.08 0.43/0.72
0.36/0.50 1/17
2fb4 2fbj 6fab 7fab
0.41/0.66 0.55/0.57 0.39/0.62 0.49/0.66
total 30/106
a
For A/B, A and B corresponds to the correlation with RMSD of E and E hEi, respectively
b
For A/B, A and B corresponds to the number of proteins that E is superior and the number of proteins that E hEi is
superior, respectively
Tables 1- 3.
First, the Z-scores are shown in Table 1. For each
scoring function f (E and EhEi), the Z-score is defined
as Z =
f
N
−hfi
σ
, where f
N
is the scoring function of the
native structure, hfi is the average of the scoring function
measured from decoy structures, and σ is the variance of
the scoring function in decoy structures. A large value
of the Z-score indicates that the native structure can
be distinguished well from the decoy structures. Thus,
Identification of the Protein Native Structure by· · · Jaewoon Jung et al. -629-
Table 3. The rank of the native structure calculated using the original MJ contact energy E, and the modified scoring
function E hEi.
decoy set protein comparison
1ctf 1r69 1sn3 2cro 3icb 4pti 4rxn
4-state reduced
a
1/1 1/1 3/1 1/1 10/2 1/1 2/1
b
0/3
1fc2 1hdd-C 2cro 4icb
fisa
30/263 2/10 1/3 1/1
3/0
1bg8-A 1bl0 1eh2 1jwe smd3
fisa casp3
1/17 42/552 2/28 1/33 1/10
5/0
1beo 1ctf 1fca 1nkl
lattice ssfit
13/1 2/1 1/1 13/1
0/3
1b0n-B 1bba 1ctf 1dtk 1fc2 1igd 1shf-A
17/261 281/482 1/1 1/57 501/501 1/1 4/52
lmds
2cro 2ovo 4pti
6/0
1/1 1/8 1/62
1ctf 1eh2 1khm 1nkl 1pgb
semfold
27/54 2/2 1245/68 654/24 562/696
2/2
1ash 1bab-B 1col-A 1cpc-A 1ecd 1emy 1flp
1/1 3/1 1/1 1/1 1/1 6/4 1/1
1gdm 1hbg 1hbh-A 1hbh-B 1hda-A 1hda-B 1hlb
1/1 1/1 9/9 8/2 6/1 1/1 7/1
hg structal
1hlm 1hsy 1ith-A 1mba 1mbs 1myg-A 1myj-A
0/14
-30/22 9/3 1/1 1/1 25/19 5/2 5/2
1myt 2dhb-A 2dhb-B 2lhb 2pgh-A 2pgh-B 4sdh-A
2/1 3/1 15/2 2/2 3/3 13/1 1/1
1acy 1baf 1bbd 1bbj 1dbb 1dfb 1dvf
51/10 52/34 46/7 46/1 58/22 55/36 28/18
1eap 1fai 1fbi 1fgv 1fig 1flr 1for
54/20 29/18 54/28 53/32 58/4 30/17 59/39
1fpt 1frg 1fvc 1fvd 1gaf 1ggi 1gig
58/29 7/2 42/20 35/5 59/40 29/1 8/4
1hil 1hkl 1iai 1ibg 1igc 1igf 1igi
41/24 58/35 58/18 27/2 50/12 14/10 41/17
1igm 1ikf 1ind 1jel 1jhl 1kem 1mam
ig structal
51/22 46/8 33/1 56/51 36/4 51/1 45/4
0/59
1mcp 1mlb 1mrd 1nbv 1ncb 1ngq 1nmb
39/19 53/4 54/21 27/6 43/3 47/6 50/9
1nsn 1opg 1plg 1rmf 1tet 1ucb 1vfa
59/20 45/40 32/4 57/50 54/24 54/13 48/12
1vge 1yuh 2cgr 2fb4 2fbj 2gfb 3hfl
52/20 56/28 33/1 13/2 51/8 31/20 56/25
3hfm 6fab 7fab
23/2 50/6 36/1
1dvf 1fgv 1flr 1fvc 1gaf 1hil 1ind
10/7 17/12 9/4 15/6 19/13 12/7 11/1
1kem 1mlb 1nbv 1opg 1vfa 1vge 2cgr
ig structal hires
17/1 16/3 12/4 17/14 16/4 16/5 9/1
0/18
2fb4 2fbj 6fab 7fab
6/2 15/3 18/1 12/1
total 16/99
a
For A/B, A and B corresponds to the rank of the native structure calculated by E and E hEi, respectively
b
For A/B, A and B corresponds to the number of proteins that E is superior and the number of proteins that E hEi is
superior, respectively
we want a scoring function that has a large Z-score. In
Table 1, E performs better than E hEi for proteins
in the fisa, the fisa casp3, and the lmds decoy sets, and
E hEi performs better than E for most proteins in
the lattice ssfit, the hg structal, the ig structal, and the
ig structal hires decoy sets (except for a few proteins).
In Table 2, the correlations between each scoring
function and the RMSD (from the native structure)
-630- Journal of the Korean Physical Society, Vol. 46, No. 3, March 2005
are shown. Even for the fisa, the fisa casp3, and the
lmds proteins, E does not seem to have a higher cor-
relation than E hEi whereas E hEi is superior to
E for the 4-state reduced, hg structal, ig structal, and
ig structal hires. This means that E hEi is better than
E in selecting native-like structures. It should be noted
that the contact map of the native structure does not
determine the native structure in a unique fashion; i.e.,
the reconstruction of the native structure from its con-
tact map is not straightforward. However, contact maps
are constructed by predetermined decoy structures, and
the task of identifying the correct contact map of the
native structure from among these decoy structures is
important. Therefore, a good scoring function should
show a good correlation with the RMSD measured from
the native structure. When the correlation is high, a
native-like structure is more likely to be identified as a
native fold. Table 3 shows the ranks of native structures.
Like in Table 1, E is superior to E hEi for the fisa, the
fisa casp3 and the lmds proteins, but for the other sets,
E hEi is superior.
If Table 1 - 3 are considered, E seems to be better
only for the fisa, the fisa casp3, and the lmds proteins,
and E hEi is better than for the other sets. The rea-
son that E is better for fisa, fisa casp3, and lmds is as
follows: For most proteins where E performs better than
E hEi, their native structures contain more contacts
than the decoy structures do. That is, for these proteins,
their native structures can be identified by considering
only the total number of contacts, and the character-
istics of E is not a discriminating factor. The rest of
the cases where E hEi did not perform better than
E are for very small chains (less than 45 amino acids).
In summary, for decoy sets where the total number of
contacts from the native structure is more or less similar
to the total numbers of contacts from decoy structures,
EhEi performs consistently better than E based on the
Z-score, the correlation with the RMSD, and the rank
of the native structure.
For the set of 4-state reduced decoys, we investigated
the difference between E and E hEi. These decoys are
generated by keeping most of the native conformation
fixed in its native form [1]; therefore, their conformations
have evenly distributed RMSD values. The set of 4-state
reduced decoys has many near-native conformations, and
the RMSDs are well distributed at low and high values.
If these 4-state reduced decoys are considered, the differ-
ence in the performances between E and E hEi from
Table 1 and Table 3 is not significant, but Table 2 shows
that E hEi has a higher correlation with RMSD than
E does for all proteins. This indicates that EhEi could
be more useful in finding native-like structures.
IV. CONCLUSION
For a given contact energy, we introduce a new scor-
ing function, the difference between the original contact
energy and the average contact energy calculated from
random sequences. The new scoring function is shown to
perform better than the original contact energy for decoy
sets where decoy structures have similar total numbers
of contacts as the native structure. Out of 145 proteins
from 9 decoy sets, the new scoring function is shown to
be more useful for about 75 % of those proteins. From
the results, we suggest a better approach to distinguish
the native structure from decoy sets.
ACKNOWLEDGMENTS
This work was supported by the Ministry of Science
and Technology (Jung & Moon) and by grant No. R01-
2003-000-11595-0 (Lee) from the Basic Research Pro-
gram of the Korean Science & Engineering Foundation.
REFERENCES
[1] B. Park and M. Levitt, J. Mol. Biol. 258, 367 (1996).
[2] C. Anfinsen, Science 181, 223 (1973).
[3] C. Branden and J. Tooze, Introduction to protein struc-
ture (New York, Freedman, 1991).
[4] S. Tanaka and H. Scheraga, Macromolecules 9, 945
(1976).
[5] S. Miyazawa and R. L. Jernigan, Macromolecules 18, 534
(1985).
[6] S. Miyazawa and R. L. Jernigan, J. Mol. Biol 256, 623
(1996).
[7] S. Miyazawa and R. L. Jernigan, Proteins: Struct. Funct.
Genet. 34, 49 (1999).
[8] D. Hinds and M. Levitt, Proc. Natl. Acad. Sci. USA 89,
2536 (1992).
[9] D. Tobi and G. Shafran and N. Linial and R. Elber,
Proteins: Struct. Funct. Genet. 40, 71 (2000).
[10] J. Skolnick and A. Kolinski and A. Oritiz, Proteins:
Struct. Funct. Genet. 38, 3 (2000).
[11] I. Bahar and R. L. Jernigan, J. Mol. Biol 266, 195 (1996).
[12] E. Huang and S. Subbiah and M. Levitt, J. Mol. Biol
252, 709 (1995).
[13] E. Huang, S. Subbiah, J. Tsai and M. Levitt, J. Mol.
Biol 257, 716 (1996).
[14] L. Mirny and E. Shakhnovich, J. Mol. Biol 264, 1164
(1996).
[15] B. Park and M. Levitt, J. Mol. Biol. 266, 831 (1997).
[16] D. Mohanty and B. N. Dominy and A. Kolinski and C. L.
Brooks and J. Skolnick, Proteins: Struct. Funct. Genet.
35, 447 (1999).
[17] E. I. Shakhnovich, Phys. Rev. Lett, 72, 3907 (1994).
[18] J. Lee and S. Y. Kim and J. Lee, J. Korean Phys. Soc.
44, 594 (2004).
[19] J. Sim and S. Y. Kim and A. Yoo and J. Lee, J. Korean
Phys. Soc. 44, 611 (2004).
[20] M. Heo and S. Kim and E. J. Moon and M. Cheon and
K. Chung and I. Chang, J. Korean Phys. Soc. 44, 1571
(2004).
[21] M. Cheon, M. Heo, E. J. Moon, S. Kim, K. Chung, I.
Chang and H. Kim, J. Korean Phys. Soc. 44, 550 (2004).
Article
Full-text available
We perform protein structure prediction by combining a hybrid energy function, fragment assembly, and double optimization. In the hybrid energy function, all the backbone atoms are described explicitly, but the side-chain is modeled as a few interaction centers in order to reduce computational costs. We reduce the search space by using a fragment assembly method, where the local structure of the backbone is obtained from a structural database using similarity of sequence features, and only the global tertiary packing of fragments is determined by minimizing the energy. The structure with the minimum energy is obtained using double optimization, where a combination of backbone fragments with minimum energy is obtained using the conformational space annealing (CSA) method, and the optimal side-chains for a given backbone structure are obtained using simulated annealing. We show the feasibility of our method by performing test predictions on two proteins, 1bdd and 1e0l, that belong to distinct structural classes.
Article
We present a new four-body knowledge based potential for recognizing the native state of proteins from their misfolded states. This potential was extracted from a large set of protein structures determined by X-ray crystallography using the BetaMol, a software based on the recent theory of the beta-complex (β-complex) and quasi-triangulation of the Voronoi diagram of spheres. This geometric construct reflects the size difference among atoms in their full Euclidean metric; property not accounted for in a typical 3D Delaunay triangulation. The ability of this potential to identify the native conformation over a large set of decoys was evaluated. Experiments show that this potential outperforms a potential constructed with a classical Delaunay triangulation in decoy discrimination tests. The addition of a statistical hydrogen bond potential to our four-body potential allows a significant improvement in the decoy discrimination, in such a way that we are able to predict successfully the native structure in 90% of cases. © Proteins 2013;. © 2013 Wiley Periodicals, Inc.
Article
Full-text available
Protein structure prediction is a great challenge in molecular biophysics and bioinformatics. Most approaches to structure prediction use known structure information from the Protein Data Bank (PDB). In these approaches, it is most crucial to find a homologous protein (template) from the PDB to a query sequence and to align the query sequence to the template sequence. We propose a profile-profile alignment method based on the cosine similarity criterion, and combine this with a sequence-profile alignment, the secondary structure prediction of the query protein, and the experimental secondary structure of the template protein. Our method, which we call combined alignment, provides good results for the 1107 query-template pairs of the SCOP database and the CASP5 target proteins. They show that combined alignment significantly improves the recognition of distant homology.
Article
Full-text available
We introduce a novel approach to the study of the folding of proteins whose native structures are already known. We use an off-lattice atomistic potential energy. The parameters of the potential energy are simultaneously optimized for several proteins. The low-lying local-energy minima for these proteins are found by conformational space annealing. The parameters are modified in such a way that the native-like conformations are energetically more favored than the others. After the parameter optimization, one set of the parameters is obtained for the proteins. We then investigate Monte Carlo dynamics of these proteins by using this optimized potential energy. Our work is dis-tinguished from earlier work in the literature, where folding was achieved by using simplified models such as lattice models. We apply our method to four proteins: betanova, 1fsd, 1vii, and 1bdd, and observe that at appropriate temperatures they fold into their native structure, starting from various non-native states. In all cases, rapid collapse is followed by a subsequent folding process, that takes place on a longer timescale. We also observe that for all proteins at low temperatures, the prob-ability distributions of various quantities such as RMSD depend on initial conformations, showing their glassy behavior. At higher temperatures, this non-ergodic glassy behavior disappears. The results provide new insights into the folding mechanism, which is controlled not only by thermody-namic factors but also by kinetic factors. The way a protein folds into its native structure is also determined by the convergence point of early folding trajectories, which cannot be obtained from the free-energy surface.
Article
Full-text available
The relationship between the unfolding pseudo free energies of reduced and detailed atomic models of the GCN4 leucine zipper is examined. Starting from the native crystal structure, a large number of conformations ranging from folded to unfolded were generated by all-atom molecular dynamics unfolding simulations in an aqueous environment at elevated temperatures. For the detailed atomic model, the pseudo free energies are obtained by combining the CHARMM all-atom potential with a solvation component from the generalized Born, surface accessibility, GB/SA, model. Reduced model energies were evaluated using a knowledge-based potential. Both energies are highly correlated. In addition, both show a good correlation with the root mean square deviation, RMSD, of the backbone from native. These results suggest that knowledge-based potentials are capable of describing at least some of the properties of the folded as well as the unfolded states of proteins, even though they are derived from a database of native protein structures. Since only conformations generated from an unfolding simulation are used, we cannot assess whether these potentials can discriminate the native conformation from the manifold of alternative, low-energy misfolded states. Nevertheless, these results also have significant implications for the development of a methodology for multiscale modeling of proteins that combines reduced and detailed atomic models. Proteins 1999;35:447–452. © 1999 Wiley-Liss, Inc.
Article
The design and construction of a global protein energy function which can recognize the native folds of all representative proteins of different classes with a low sequence homology has been one of the important issues and a formidable task in protein science. We used perceptron learning and protein threading to construct a one-body score function of proteins, which could recognize simultaneously the native folds of 1,006 training proteins covering all available representative proteins in a sequence homology with less than 30% between them. When the score parameters for the 1,006 training proteins were subject to a threading test, 370 (96.9%) native folds of the 382 new distinct proteins were recognized compared to the previous score parameters obtained using 387 training proteins, which recognized 190 (89.2%) native folds of the 213 new proteins. We performed an analysis of the score parameters by using a singular value decomposition and a self-organizing map to elucidate the biological clustering characters of 20 amino acids. The self-organizing map analysis of the new score-parameters revealed better biological clustering of 20 amino acids, which agreed with their known properties. The same analysis for the previous score-parameters could not provide such a result because the 387 training proteins employed before did not fully cover all representative proteins of different classes of a sequence homology with less than 30% between them. We illuminated the marked difference in the new score-parameters that relative to previous score-parameters, not only performed better in recognizing the native folds of new distinct proteins but also captured better the biological clustering characters of amino acids.
Article
Recent attempts to construct a global pairwise contact energy function of amino acids for proteins have not succeeded in stabilizing the native states of many proteins simultaneously. In this paper, we show that the systematic inclusion of the local environments of the amino acids in the design of such a function leads to success in designing a global protein energy function. We design and construct two kinds of pairwise contact energy functions by considering either the secondary structures or the hydrophobicities (solvation) of the amino acids and by using perceptron learning and protein threading. These can stabilize all native states of 1,006 proteins simultaneously with 30% homology. When these two energy functions are subject to a threading test on 382 new distinct proteins, the energy function with the secondary structure information can stabilize 300 (78.5%) proteins out of 382 proteins whereas the energy function with the hydrophobicity information can stabilize 367 (96%) proteins. This illustrates the critical role played by the hydrophobicity of amino acids in stabilizing the essential structures of proteins. Both the hydrophobicity and the secondary structure are important to assess the protein structure, and the impact of the hydrophobicity is elucidated in this work through the process of designing global pairwise contact energies for proteins. We expect that the simultaneous inclusion of the hydrophobicity, the secondary structure, and other local environments, such as the polarity and the structures of neighboring of amino acids, will enable us to design better protein energy functions by using perceptron learning and protein threading.
Article
Effective interresidue contact energies for proteins in solution are estimated from the numbers of residue-residue contacts observed in crystal structures of globular proteins by means of the quasi-chemical approximation with an approximate treatment of the effects of chain connectivity. Employing a lattice model, each residue of a protein is assumed to occupy a site in a lattice and vacant sites are regarded to be occupied by an effective solvent molecule whose size is equal to the average size of a residue. A basic assumption is that the average characteristics of residue-residue contacts formed in a large number of protein crystal structures reflect actual differences of interactions among residues, as if there were no significant contribution from the specific amino acid sequence in each protein as well as intraresidue and short-range interactions. Then, taking account of the effects of the chain connectivity only as imposing a limit to the size of the system, i.e., the number of lattice sites or the number of effective solvent molecules in the system, the system is regarded to be the mixture of unconnected residues and effective solvent molecules. The quasi-chemical approximation, that contact pair formation resembles a chemical reaction, is applied to this system to obtain formulas that relate the statistical averages of the numbers of contacts to the contact energies. The number of effective solvent molecules for each protein is chosen to yield the total number of residue-residue contacts equal to its expected value for the hypothetical case of hard sphere interactions among residues and effective solvent molecules; the expected number of residue-residue contacts at this condition has been crudely estimated by means of a freely jointed chain distribution and an expansion originating in hard sphere interactions. Each residue is represented by the center of its side chain atom positions, and contacts among residues and effective solvent molecules are defined to be those pairs within 6.5 Å, a distance that has been chosen on the basis of the observed radial distribution of residues; nearest-neighbor pairs along a chain are explicitly excluded in counting contacts. Coordination numbers, for each type of residue as well as for solvent molecules, are estimated from the mean volume of each type of residue and used to evaluate the numbers of residue-solvent and solvent-solvent contacts from the numbers of residue-residue contacts. The estimated values of contact energies have reasonable residue-type dependences, reflecting residue distributions in protein crystals; nonpolar-residue-in and polar-residue-out are seen as well as the segregation of those residue groups. In addition, there is a linear relationship between the average contact energies for nonpolar residues and their hydrophobicities reported by Nozaki and Tanford; however, the magnitudes on average are about twice as large. The relevance of results to protein folding and other applications are discussed.
Article
A method is presented for the derivation of knowledge-based pair potentials that corrects for the various compositions of different proteins. The resulting statistical pair potential is more specific than that derived from previous approaches as assessed by gapless threading results. Additionally, a methodology is presented that interpolates between statistical potentials when no homologous examples to the protein of interest are in the structural database used to derive the potential, to a Go-likepotential (in which native interactions are favorable and all nonnative interactions are not) when homologous proteins are present. For cases in which no protein exceeds 30% sequence identity, pairs of weakly homologous interacting fragments are employed to enhance the specificity of the potential. In gapless threading, the mean z score increases from −10.4 for the best statistical pair potential to −12.8 when the local sequence similarity, fragment-based pair potentials are used. Examination of the ab initio structure prediction of four representative globular proteins consistently reveals a qualitative improvement in the yield of structures in the 4 to 6 Å rmsd from native range when the fragment-based pair potential is used relative to that when the quasichemical pair potential is employed. This suggests that such protein-specific potentials provide a significant advantage relative to generic quasichemical potentials. Proteins 2000;38:3–16. ©2000 Wiley-Liss, Inc.
Article
In a previous paper, a hypothesis for protein folding was proposed in which the native structure is formed by a three-step mechanism: (A) formation of ordered backbone structures by short-range interactions, (B) formation of small contact regions by medium-range interactions, and (C) association of the small contact regions into the native structure by long-range interactions. In this paper the empirical interaction parameters, used as a measure of the medium- and long-range interactions (the standard free energy, deltaGdegrees k,l, of formation of a contact between amino acids of species k and l) that include the role of the solvent (water) and determine the conformation of a protein in steps B and C, are evaluated from the frequency of contacts in the x-ray structures of native proteins. The numerical values of deltaG degrees k,l for all possible pairs of the 20 naturally occurring amino acids are presented. Contacts between highly nonpolar side chains of amino acids such as Ile, Phe, Trp, and Leu are shown quantitatively to be stable. On the contrary, contacts involving polar side chains of amino acids such as Ser, Asp, Lys, and Glu are significantly less stable. While this implies, in a quantitative manner, that it is generally more favorable for nonpolar groups to lie in the interior of the protein molecule and for the polar side chains to be exposed to the solvent (water) rather than to form contacts with other amino acids, many exceptions to this generalization are observed.