# Recovery of Protein Structure from Contact Maps

**Abstract**

Prediction of a protein's structure from its amino acid sequence is a key issue in molecular biology. While dynamics, performed in the space of two-dimensional contact maps, eases the necessary conformational search, it may also lead to maps that do not correspond to any real three-dimensional structure. To remedy this, an efficient procedure is needed to reconstruct three-dimensional conformations from their contact maps.
We present an efficient algorithm to recover the three-dimensional structure of a protein from its contact map representation. We show that when a physically realizable map is used as target, our method generates a structure whose contact map is essentially similar to the target. furthermore, the reconstructed and original structures are similar up to the resolution of the contact map representation. Next, we use nonphysical target maps, obtained by corrupting a physical one; in this case, our method essentially recovers the underlying physical map and structure. Hence, our algorithm will help to fold proteins, using dynamics in the space of contact maps. Finally, we investigate the manner in which the quality of the recovered structure degrades when the number of contacts is reduced.
The procedure is capable of assigning quickly and reliably a three-dimensional structure to a given contact map. It is well suited for use in parallel with dynamics in contact map space to project a contact map onto its closest physically allowed structural counterpart.

arXiv:cond-mat/9705211v1 [cond-mat.soft] 21 May 1997

Recovery of Protein Structure from Contact Maps

Michele Vendruscolo

1

, Edo Kussell

2

and Eytan Domany

1

1

Department of Physics of Complex Systems, Weizmann Institute of Sc ience, Rehovot 76100, Israel

2

Department of Structural Biology, Weizmann Institute of Sc ience, Rehovot 76100, Israel

Background: Prediction of a protein’s structure from its amino acid sequence is a key issue in molecular biology.

While dynamics, performed in the space of two dimensional contact maps eases the necessary conformational search,

it may also lead to maps that do not correspond to any real three dimensional structure. To remedy this, an eﬃcient

procedure is needed to reconstruct 3D conformations from their contact maps. Such a procedure is relevant also for

interpretation of NMR and X-ray scattering experiments.

Results: We present an eﬃcient algorithm to recover the three dimensional structure of a protein from its contact

map representation. First we show that when a physically realizable map is used as target, our method generates a

structure whose contact map is essentially similar to the target. Furthermore, the reconstructed and original structures

are similar up to the resolution of the contact map representation. Next we use non-physical target maps, obtained by

corrupting a physical one; in this case our method essentially recovers the underlying physical map and structure. Henc e

our algorithm will help to fold proteins, using dynamics in the space of contact maps. Finally we investigate the manner

in which the quality of the recovered structure degrades when the number of contacts is reduced.

Conclusion: The procedure is capable of assigning quickly and reliably a 3D structure to a given contact map. It

is well suited to be used in parallel with a dynamics in contact map space to project a contact map onto its closest

physically allowed structural counterpart.

I. INTRODUCTION

Considerable eﬀort has bee n devoted to ﬁnding ways to predict a protein’s structure from its known amino

acid sequence A = (a

1

, a

2

, . . . a

N

). The contact map of a protein is a pa rticularly useful representation of

its structure [1,2]. For a protein of N residues the contact map is an N × N matrix S, whose elements are

S

i,j

= 1 if residues i and j are in contact and S

i,j

= 0 otherwise. ’Contact’ can be deﬁned in various ways:

for example, in a recent publication Mirny and Domany (MD) [3] deﬁned contact S

i,j

= 1 when a pair of

heavy (all but hydrogen) atoms, one from amino acid i and one from j, whose distance is less than 4.5

˚

A,

can be found. Secondary structures are easily detected from the contact map. Alpha helices appe ar as thick

bands along the main diagonal, since they involve contacts between one amino acid and its four successors.

The signature of parallel or anti-parallel beta sheets are thin bands, parallel or a nti-parallel to the main

diagonal. O n the o ther hand, the overall tertiary s tructure is not easily discerned. The main idea of MD was

to use this representation to perform a search, executed in the space of possible contact maps S, for a ﬁxed

sequence A, to identify maps of low ”energy” H(A, S). They deﬁned the energy H(A, S) as the negative

logarithm of the probability that structures, whos e contact map is S, occur for a protein with the sequence

A; therefore a map of low energy corresponds to a highly probable str uctur e .

One of the most problematic aspects of their work was the fact that by performing an unconstrained search

in the space of contact maps, i.e. freely ﬂipping matrix elements from 1 to 0 and vice versa, one obtains maps

of very low energy which have no physical meaning, since they do not correspond to realizable conformations

of a polypeptide chain. To overcome this problem, MD introduced heuristic restrictions on the possible

changes one is allowed to make to a contact map, arguing that if one starts with a physically realizable map,

the moves allowed by these re strictive dynamic rules will generate maps that a lso are physically realizable.

Even though their heuristic rules did seem to modify the dynamics in the desired way, there is no rigorous

proof that indeed one is always left in the physical subspace, there is no clear evidence that the resulting

1

rules are not too restrictive and, ﬁnally, the need to start with a physical fold, copied from a protein of

known structure, may bias the ensuing search and get it stuck in some local minimum of the energy.

The aim of the pr e sent publication is to present a method to overcome these diﬃculties. The idea is to

provide a test, which can be performed ”on line” and in para llel with the dynamics in the spa c e o f contact

maps, which will ”project” any map onto a near by one which is guaranteed to be in the subspace of physically

realizable maps. That is, for a given target contact map S, we search for a conformation tha t a ”string of

beads” can take, such that the contact ma p S

′

of our string is similar (or close) to S. Needless to say, the

contact map associated with a string of beads is, by deﬁnition, physical.

This particular aim highlights the diﬀerence between what we are trying to accomplish and existing

methods [2,4–8], that use various forms of distance geometry [9] to predict a 3D structure of a real protein

from its contact map. Rather than being concerned with obtaining a structure that is clo se to a real

crystallographic one, we want mainly to check whether a map S is physically pos sible or not, a nd if not - to

propose some S

′

which is physical and, at the same time, is not too diﬀerent from S. The method has to be

fast enough to run in parallel with the search routine (that uses H(A, S) to identify candidate maps S of low

energy). Another important requirement is to be able to recover contacts that do not belong to secondary

structure e lements and may be located far from the map’s diagonal. Such contacts are important to nail

down the elusive global fold of the protein. We believe that the main advantage of perfor ming a dy namic

search in the space of contact maps is the ease with which such contacts can be introduced, whereas creating

them in a molecular dynamic or Monte Carlo procedure of a real polypeptide chain involves coherent moves

of large sections of the molecule - moves which take a very long time to perform. To make s ure that this

advantage is preserved, our method must be able to eﬃciently ﬁnd such confor mations, if they are p ossible,

once a new target contact has been proposed.

We are currently working on combining the method presented here with dynamics in the space of contact

maps. The results of the combined proce dure will be presented in a future publication.

The plan of the pap e r is as follows.

1. We give a detailed explanation of the method.

2. We show how it works on native maps of proteins with the number of r e sidues ranging fro m N=56 to

N=581. Success of the algorithm is measured in terms of the number of co ntacts recovered and the

root mean square displacements o f the recovered 3D structures from the native ones.

3. We study the answers given by our algorithm when it faces the task of ﬁnding a structure, using an

unphysical contact map as its target. As the ﬁr st test, we added a nd removed contacts at random

to a physical map and found that the reconstructed structure did not change by much, i.e. we still

could reconstruct the underly ing physical structure. As a second test we used the constrained dyna mic

rules proposed by MD; starting from an experimental contact map, we obtained a new map by a

denaturation-renaturation computer experiment. Since the MD rules a re heuristic, this map is not

guaranteed to be physical. Our reconstruction method projects a non-physical map onto one that is

close to it and physically allowed.

4. We discuss the extent to which the quality of the structure, obtained from a map, gets degraded when

the number of given contacts is reduced. This issue has considerable importance beyond the scope of

our present s tudy, since experimental data (disulﬁde bridge determination, crosslinking studies, NMR)

often are available only for a small number of co ntacts. Clearly the more contacts o ne has the smaller

is the number of possible co nformations of a chain that are consistent with the constraints contained

in the contact map. The question we address is when does this re duction of pos sible conformations

suﬃce to deﬁne the corre sp onding structure with satisfactory accuracy.

II. METHOD

In this work we adopt a widely used deﬁnition of contact: two amino acids, a

i

and a

j

are in contact if

their distance d(a

i

, a

j

) is less than a certain threshold d

t

. The dista nce d(a

i

, a

j

) is deﬁned as

2

d(a

i

, a

j

) = |r

i

− r

j

|, (1)

where r

i

and r

j

are the coordinates of the C

α

atoms o f amino ac ids i and j.

The algorithm is divided into two parts. The ﬁrst part, growth, consists of adding one monomer at a time,

i.e. a step by step growth of the chain. The second part, adaptation, is a reﬁnement of the structure, obta ined

as a result of the growth stage, by local moves. In both stages, to bias the dynamics, we introduce cost

functions deﬁned on the basis of the contact map. Such cost functions c ontain only geometric constraints,

and do not resemble the true energetics of the polypeptide chain.

A. Growth

We ﬁrst describe the growth stage of our procedure.

1. Single monomer addition. Here N

t

= 5.

Suppose we have grown i − 1 monomers and we want to add point i to the chain. To place it, as shown

in Fig. 1 we generate at random N

t

trial positions , (typically N

t

= 10),

r

(j)

i

= r

i−1

+ r

(j)

, (2)

where j = 1, . . . , N

t

and r

(j)

is a vector whose direction is selected from a uniform distribution, whereas

its length is distributed normally with average r

a

and variance σ. Since in our repres e ntation monomers

identify the C

α

positions, we took r

a

= 3.79 and σ = 0.04. We assign a probability p

(j)

to e ach trial in the

following way. For each trial point r

(j)

i

we calculate the contacts that it has (see eq. (1)) with the previo usly

positioned points r

1

, . . . , r

i−1

. Contacts that should be present, according to the given contact map, are

encouraged and contacts that should not be there are discouraged according to a cost function E

g

that will

be speciﬁed below. One out of the N

t

trials is chosen according to the probability

p

(j)

=

e

−E

(j)

g

/T

g

Z

, (3)

where the normalization factor is given by

Z =

N

t

X

j=1

e

−E

(j)

g

/T

g

. (4)

The notation for the cost function E

g

and for the parameter T

g

that guide the growth are chosen in the

spirit of the Rosenbluth method [10] to suggest their reminiscence to energy and temperature, res pectively.

2. Chain growth.

The step by step growth presented in the previous section optimizes the position of successive amino acids

along the sequence. The main diﬃculty in the present method is that the single step of the growing chain

has no information on the contacts that should be realized many steps (or monomer s) ahead. To solve this

problem, we carry out several attempts (typically 10) to r econstruct the structure, choosing the best one.

In pr actice, this is done as follows.

For each attempt, when position r

(j)

is chosen for monomer i according to Eq. (3), its probability is

accumulated in the weight

3

W

i

=

i

Y

k=1

p

(j)

k

(5)

When we have reached the end of the chain we store the weight W

N

. The trial chain with the highest W

N

is chosen.

3. Cost function.

The probabilities in Eq. (3) are calculated using the following cost function :

E

g

(j)

=

i−1

X