Recovery of Protein Structure from Contact Maps
Prediction of a protein's structure from its amino acid sequence is a key issue in molecular biology. While dynamics, performed in the space of two-dimensional contact maps, eases the necessary conformational search, it may also lead to maps that do not correspond to any real three-dimensional structure. To remedy this, an efficient procedure is needed to reconstruct three-dimensional conformations from their contact maps. We present an efficient algorithm to recover the three-dimensional structure of a protein from its contact map representation. We show that when a physically realizable map is used as target, our method generates a structure whose contact map is essentially similar to the target. furthermore, the reconstructed and original structures are similar up to the resolution of the contact map representation. Next, we use nonphysical target maps, obtained by corrupting a physical one; in this case, our method essentially recovers the underlying physical map and structure. Hence, our algorithm will help to fold proteins, using dynamics in the space of contact maps. Finally, we investigate the manner in which the quality of the recovered structure degrades when the number of contacts is reduced. The procedure is capable of assigning quickly and reliably a three-dimensional structure to a given contact map. It is well suited for use in parallel with dynamics in contact map space to project a contact map onto its closest physically allowed structural counterpart.
arXiv:cond-mat/9705211v1 [cond-mat.soft] 21 May 1997
Recovery of Protein Structure from Contact Maps
, Edo Kussell
and Eytan Domany
Department of Physics of Complex Systems, Weizmann Institute of Sc ience, Rehovot 76100, Israel
Department of Structural Biology, Weizmann Institute of Sc ience, Rehovot 76100, Israel
Background: Prediction of a protein’s structure from its amino acid sequence is a key issue in molecular biology.
While dynamics, performed in the space of two dimensional contact maps eases the necessary conformational search,
it may also lead to maps that do not correspond to any real three dimensional structure. To remedy this, an eﬃcient
procedure is needed to reconstruct 3D conformations from their contact maps. Such a procedure is relevant also for
interpretation of NMR and X-ray scattering experiments.
Results: We present an eﬃcient algorithm to recover the three dimensional structure of a protein from its contact
map representation. First we show that when a physically realizable map is used as target, our method generates a
structure whose contact map is essentially similar to the target. Furthermore, the reconstructed and original structures
are similar up to the resolution of the contact map representation. Next we use non-physical target maps, obtained by
corrupting a physical one; in this case our method essentially recovers the underlying physical map and structure. Henc e
our algorithm will help to fold proteins, using dynamics in the space of contact maps. Finally we investigate the manner
in which the quality of the recovered structure degrades when the number of contacts is reduced.
Conclusion: The procedure is capable of assigning quickly and reliably a 3D structure to a given contact map. It
is well suited to be used in parallel with a dynamics in contact map space to project a contact map onto its closest
physically allowed structural counterpart.
Considerable eﬀort has bee n devoted to ﬁnding ways to predict a protein’s structure from its known amino
acid sequence A = (a
, . . . a
). The contact map of a protein is a pa rticularly useful representation of
its structure [1,2]. For a protein of N residues the contact map is an N × N matrix S, whose elements are
= 1 if residues i and j are in contact and S
= 0 otherwise. ’Contact’ can be deﬁned in various ways:
for example, in a recent publication Mirny and Domany (MD)  deﬁned contact S
= 1 when a pair of
heavy (all but hydrogen) atoms, one from amino acid i and one from j, whose distance is less than 4.5
can be found. Secondary structures are easily detected from the contact map. Alpha helices appe ar as thick
bands along the main diagonal, since they involve contacts between one amino acid and its four successors.
The signature of parallel or anti-parallel beta sheets are thin bands, parallel or a nti-parallel to the main
diagonal. O n the o ther hand, the overall tertiary s tructure is not easily discerned. The main idea of MD was
to use this representation to perform a search, executed in the space of possible contact maps S, for a ﬁxed
sequence A, to identify maps of low ”energy” H(A, S). They deﬁned the energy H(A, S) as the negative
logarithm of the probability that structures, whos e contact map is S, occur for a protein with the sequence
A; therefore a map of low energy corresponds to a highly probable str uctur e .
One of the most problematic aspects of their work was the fact that by performing an unconstrained search
in the space of contact maps, i.e. freely ﬂipping matrix elements from 1 to 0 and vice versa, one obtains maps
of very low energy which have no physical meaning, since they do not correspond to realizable conformations
of a polypeptide chain. To overcome this problem, MD introduced heuristic restrictions on the possible
changes one is allowed to make to a contact map, arguing that if one starts with a physically realizable map,
the moves allowed by these re strictive dynamic rules will generate maps that a lso are physically realizable.
Even though their heuristic rules did seem to modify the dynamics in the desired way, there is no rigorous
proof that indeed one is always left in the physical subspace, there is no clear evidence that the resulting
rules are not too restrictive and, ﬁnally, the need to start with a physical fold, copied from a protein of
known structure, may bias the ensuing search and get it stuck in some local minimum of the energy.
The aim of the pr e sent publication is to present a method to overcome these diﬃculties. The idea is to
provide a test, which can be performed ”on line” and in para llel with the dynamics in the spa c e o f contact
maps, which will ”project” any map onto a near by one which is guaranteed to be in the subspace of physically
realizable maps. That is, for a given target contact map S, we search for a conformation tha t a ”string of
beads” can take, such that the contact ma p S
of our string is similar (or close) to S. Needless to say, the
contact map associated with a string of beads is, by deﬁnition, physical.
This particular aim highlights the diﬀerence between what we are trying to accomplish and existing
methods [2,4–8], that use various forms of distance geometry  to predict a 3D structure of a real protein
from its contact map. Rather than being concerned with obtaining a structure that is clo se to a real
crystallographic one, we want mainly to check whether a map S is physically pos sible or not, a nd if not - to
propose some S
which is physical and, at the same time, is not too diﬀerent from S. The method has to be
fast enough to run in parallel with the search routine (that uses H(A, S) to identify candidate maps S of low
energy). Another important requirement is to be able to recover contacts that do not belong to secondary
structure e lements and may be located far from the map’s diagonal. Such contacts are important to nail
down the elusive global fold of the protein. We believe that the main advantage of perfor ming a dy namic
search in the space of contact maps is the ease with which such contacts can be introduced, whereas creating
them in a molecular dynamic or Monte Carlo procedure of a real polypeptide chain involves coherent moves
of large sections of the molecule - moves which take a very long time to perform. To make s ure that this
advantage is preserved, our method must be able to eﬃciently ﬁnd such confor mations, if they are p ossible,
once a new target contact has been proposed.
We are currently working on combining the method presented here with dynamics in the space of contact
maps. The results of the combined proce dure will be presented in a future publication.
The plan of the pap e r is as follows.
1. We give a detailed explanation of the method.
2. We show how it works on native maps of proteins with the number of r e sidues ranging fro m N=56 to
N=581. Success of the algorithm is measured in terms of the number of co ntacts recovered and the
root mean square displacements o f the recovered 3D structures from the native ones.
3. We study the answers given by our algorithm when it faces the task of ﬁnding a structure, using an
unphysical contact map as its target. As the ﬁr st test, we added a nd removed contacts at random
to a physical map and found that the reconstructed structure did not change by much, i.e. we still
could reconstruct the underly ing physical structure. As a second test we used the constrained dyna mic
rules proposed by MD; starting from an experimental contact map, we obtained a new map by a
denaturation-renaturation computer experiment. Since the MD rules a re heuristic, this map is not
guaranteed to be physical. Our reconstruction method projects a non-physical map onto one that is
close to it and physically allowed.
4. We discuss the extent to which the quality of the structure, obtained from a map, gets degraded when
the number of given contacts is reduced. This issue has considerable importance beyond the scope of
our present s tudy, since experimental data (disulﬁde bridge determination, crosslinking studies, NMR)
often are available only for a small number of co ntacts. Clearly the more contacts o ne has the smaller
is the number of possible co nformations of a chain that are consistent with the constraints contained
in the contact map. The question we address is when does this re duction of pos sible conformations
suﬃce to deﬁne the corre sp onding structure with satisfactory accuracy.
In this work we adopt a widely used deﬁnition of contact: two amino acids, a
are in contact if
their distance d(a
) is less than a certain threshold d
. The dista nce d(a
) is deﬁned as
) = |r
are the coordinates of the C
atoms o f amino ac ids i and j.
The algorithm is divided into two parts. The ﬁrst part, growth, consists of adding one monomer at a time,
i.e. a step by step growth of the chain. The second part, adaptation, is a reﬁnement of the structure, obta ined
as a result of the growth stage, by local moves. In both stages, to bias the dynamics, we introduce cost
functions deﬁned on the basis of the contact map. Such cost functions c ontain only geometric constraints,
and do not resemble the true energetics of the polypeptide chain.
We ﬁrst describe the growth stage of our procedure.
1. Single monomer addition. Here N
Suppose we have grown i − 1 monomers and we want to add point i to the chain. To place it, as shown
in Fig. 1 we generate at random N
trial positions , (typically N
where j = 1, . . . , N
is a vector whose direction is selected from a uniform distribution, whereas
its length is distributed normally with average r
and variance σ. Since in our repres e ntation monomers
identify the C
positions, we took r
= 3.79 and σ = 0.04. We assign a probability p
to e ach trial in the
following way. For each trial point r
we calculate the contacts that it has (see eq. (1)) with the previo usly
positioned points r
, . . . , r
. Contacts that should be present, according to the given contact map, are
encouraged and contacts that should not be there are discouraged according to a cost function E
be speciﬁed below. One out of the N
trials is chosen according to the probability
where the normalization factor is given by
The notation for the cost function E
and for the parameter T
that guide the growth are chosen in the
spirit of the Rosenbluth method  to suggest their reminiscence to energy and temperature, res pectively.
2. Chain growth.
The step by step growth presented in the previous section optimizes the position of successive amino acids
along the sequence. The main diﬃculty in the present method is that the single step of the growing chain
has no information on the contacts that should be realized many steps (or monomer s) ahead. To solve this
problem, we carry out several attempts (typically 10) to r econstruct the structure, choosing the best one.
In pr actice, this is done as follows.
For each attempt, when position r
is chosen for monomer i according to Eq. (3), its probability is
accumulated in the weight
When we have reached the end of the chain we store the weight W
. The trial chain with the highest W
3. Cost function.
The probabilities in Eq. (3) are calculated using the following cost function :