ArticlePDF Available

Completion of autobuilt protein models using a database of protein fragments

April 2012

April 2012
68(Pt 4):328-35

DOI:10.1107/S0907444911039655

Source
PubMed

License
CC BY 2.0

Authors:

Kevin Cowtan

The University of York

Two developments in the process of automated protein model building in the Buccaneer software are presented. A general-purpose library for protein fragments of arbitrary size is described, with a highly optimized search method allowing the use of a larger database than in previous work. The problem of assembling an autobuilt model into complete chains is discussed. This involves the assembly of disconnected chain fragments into complete molecules and the use of the database of protein fragments in improving the model completeness. Assembly of fragments into molecules is a standard step in existing model-building software, but the methods have not received detailed discussion in the literature.

…

No caption available

…

Figures - available via license: Creative Commons Attribution 2.0 Generic

Content may be subject to copyright.

Available via license: CC BY 2.0

Content may be subject to copyright.

research papers

328 doi:10.1107/S0907444911039655 Acta Cryst. (2012). D68, 328–335

Acta Crystallographica Section D

Biological

Crystallography

ISSN 0907-4449

Completion of autobuilt protein models using a

database of protein fragments

Kevin Cowtan

Department of Chemistry, University of York,

Heslington, York YO10 5DD, England

Correspondence e-mail:

cowtan@ysbl.york.ac.uk

Two developments in the process of automated protein model

building in the Buccaneer software are presented. A general-

purpose library for protein fragments of arbitrary size is

described, with a highly optimized search method allowing the

use of a larger database than in previous work. The problem of

assembling an autobuilt model into complete chains is

discussed. This involves the assembly of disconnected chain

fragments into complete molecules and the use of the database

of protein fragments in improving the model completeness.

Assembly of fragments into mo lecules is a standard step in

existing model-building software, but the methods have not

received detailed discussion in the literature.

Received 2 June 2011

Accepted 27 September 2011

1. Background

This paper outlines two developments of relevance to the

problem of automated protein model building. The initial

application of these techniques in the Buccaneer mo del-

building software is presented. The developments are the

following.

(i) An efﬁciently searchable database of protein fragments

which may be used for diverse purposes including the con-

version of a C



trace to main-chai n models, the building of

missing loops and termini, and the correction of residue

insertions and deletions. This library has been implemented

for loop building in the Coot software (Emsley et al., 2010), as

well as for applications in automated model building described

here.

(ii) The automated ‘tidying’ of a fragmentary autobuilt

protein model, with the aim of red ucing the manual effort

required to complete the model. Automated model building

sometimes leads to models which may consist of multiple

disconnected fragments, especially at low resolution or when

disordered loop regions are not visible in the electron density.

These fragments must be assembled into one or more mole-

cules, which may involve the application of symmetry opera-

tors and cell translations to some of the fragments. In the case

of noncrystallographic symmetry (NCS ) it is also necessary to

assign the fragments to different copies of the molecule.

1.1. Use of databases of protein fragments

The use of databases of protein fragments in the determi-

nation and validation of atomic models is well established in

both manual and automated model building.

Kleywegt & Jones (1996) described the use of pentapeptide

fragments in the program OOPS for the validation of the

protein backbone trace.

Jones & Thirup (1986) used a database of pentapeptides

in the reconstruction of a main-chain trace from C



positions

alone, although Payne (1993) claimed better results using

force ﬁelds. Esnouf (1997) used a library of 16 533 hexapep-

tide fragments in the same way to obtain main-chain coordi-

nates which matched the reﬁned X-ray structure to a high

precision.

Terwilliger (2003) employed a library of tripeptide frag-

ments to extend existing fragments of protein chain by adding

additional residues at the N- or C-terminus and Sheldrick

(2010) used tripeptides to ﬁnd initial protein fragments.

Joosten et al. (2008) used a library of pentapeptide fragments

in a similar way to build missing loops in protein structures.

The development described here recognizes the success of

these methods and describes an efﬁcient method for building

and searching a library of protein fragments of arbitrary

length (bounded by some chosen value). The database is

optimized for very fast homology searches, allowing the use of

a much larger database than in previous work. The use of a

much larger database also provides the potential to perform

searches restricted by residue-type ﬁlters without compro-

mising coverage beyond usefulness.

1.2. Tidying and completion of protein models

Automated model building typically produces as an inter-

mediate result a set of protein-chain fragments, some of which

may have been docked into the protein sequence. Ultimately,

these will need to be assembled into molecules. A problem

arises in determining how the fragments are connected to one

another. When protein molecules are tightly packed together

the molecule boundaries may not be obvious, and as a result

it is possible to link fragments which belong not to the same

chain but rather to symmetry-related chains. If the density for

the link is obvious, this step may be performed by automatic or

manual model completion; howe ver, this is often not the case.

The problem becomes more complex in the case of non-

crystallographic symmetry (NCS). In this case, the fragments

must also be assigned to the correct NCS copy of the molecule,

as well as to the correct asymmetric uni t. The problem may be

further complicated in the case of hetero-oligomers (protein

complexes consisting of heterogeneou s sequences), although

this is mainly a bookkeeping problem.

Various approaches to model tidying are implemented in

the main automated model-building packages [for example,

ARP/wARP (Cohen et al., 2004) and RESOLVE (Terwilliger,

2003)], with the details varying according to the model-

building algorithm and the information available; howev er,

the details have not been widely discussed in the literature.

This paper presents the model-tidying steps implemented in

the Buccaneer software from v.1.5.

1.3. The Buccaneer software for automated model building

The Buccaneer software is used for automatic interpretation

of protein structures on the basis of the electron-density map

(Cowtan, 2006, 2008). The calculation is iterativ e, with

multiple cycles of model building interspersed with occasional

reﬁnement steps using REFMAC (Murshudov et al., 2011) to

improve the current model and electron density. The step s

involved in a single cycle of model building are as follows.

(i) Finding C



atoms: candidate C



positions are located by

searching the electron density for likely features.

(ii) Growing fragments: the candidate C



atoms (or input

chains) are grown by adding residues at either end, guided by

the electron density and constrained by the allowed region of

the Ramachandran plot.

(iii) Joining fragments: overlapping fragments are joined to

make longer chains.

(iv) Linking fragments: nearby N- and C-termini are

examined to see if they can be linked by inserting one or two

additional residues.

(v) Assigning sequence: likelihood comparison between the

density of each residue in the work structure and the density

from residues of a reference structure is used to identify

the likelihood of each residue being of a particular type.

Comparison with the known sequence allows longer fragments

to be matched to the sequence.

(vi) Correcting sequence: insertions and deletions in the

model as identiﬁed in the sequence-assignment step are

corrected by rebuilding to add or delete a residue where

possible.

(vii) Filtering fragments in poor density: residues which

have not been docked into the sequence and are in poor

density are removed.

(viii) Building NCS: any NCS relationships found in the

model are used to extend existing chains by combining all of

the NCS-related chains.

(ix) Pruning fragments: fragments which provide incon-

sistent interpretations of the same electron density are

examined. The poorer fragment is removed.

(x) Rebuilding: side-chain atoms and carbonyl O atoms are

added to the model.

This process is repeated over several cycles. In subsequent

cycles, the ﬁnding step is modiﬁed to preferentially ﬁnd C



positions which are in regions where no model is present.

2. A library of protein fragments

A library of real protein fragments of arbitrary length is

employed to interpret electron density and correct existing

models. In order to support both interactive graphical model

building (where users demand immediate feedb ack) and

automated model building (where many possible mo del

fragments may need to be tested to match a particular

feature), it must be possible to perform a very rapid search for

fragments containing some atoms matching a desired confor-

mation.

For example, to ﬁt the main-chain atoms to a C



trace

the database will be searched for all six-peptide fragments

matching the C



atoms surrounding a particular peptide bond

and the peptide atoms from the middle peptide of the be st-

ﬁtting fragment will be use d to provide the main-chain atoms

for that peptide group. Similarly, to build a missing loop in a

protein structure a search will be performed for all fragments

for which the initial and ﬁnal pairs of C



atoms in the fragment

research papers

Acta Cryst. (2012). D68, 328–335 Cowtan



Automated model building in Buccaneer 329

may be superimposed on the last two C



atoms before the

break and the ﬁrst two C



atoms after the break.

A library has therefore been constru cted using the 500 well

reﬁned protein structures of the Richardsons’ ‘Top 50’ data-

base (Lovell et al., 2003), excluding residues for which the

temperature factors of the C



atoms exceed 40 A

.This

provides a database of 106 295 amino acids in 1327 continuous

fragments. For each amino acid, the residue type and the

coordinates of the N, C



and C atoms are stored (in turn

providing sufﬁcient information to locate the C



and O

atoms). The entire database is stored as a single list of amino-

acid records.

The most frequent type of search which will be performed

on the database is to ﬁnd all fragments for which some

(possibly discontinuous) set of C



atoms superpose well on the



atoms of some search fragment. The search fragment is in

turn provided as a list of amino-acid records, with null records

inserted as placeholders to represent residues for which the

location is unknown. Thus, to search for a missing loop of four

residues, an eight-residue search fragment is constructed from

the two residues before the missing loop, four null residues

and the two residues after the missing loop.

Performing a least-squares superposition for every frag-

ment in the database would be computationally demanding, so

an initial pre-selection phase is performed to produce a subset

of fragments which may be good matches to the search frag-

ment. This pre-selection involves a computationally cheaper

distance-matrix score.

In order to minimize the computational overhead, distance

matrices for the search fragment and for the database are

precalculated. For the search fragment, a triangular matrix is

calculated with the ﬁrst row giving the distances from the ﬁrst



to the remaining n  1, the second row the distances from

the second C



to the remaining n  2 and so on. The columns

of this matrix correspond to the diagonals of the upper

triangle of a conventional distance matrix (illustrated in Fig. 1).

If an atom is missing, the distance is set to a negative ﬂag

value.

For the database of n

residues, an n

 20 rectangular

‘running distance ma trix’ is pre-calculated, with each row

giving distances from the ﬁrst C



to the following 20, thus

representing fr agments of up to 21 residues. This is illustrated

in Fig. 2 for a reduced width of six residues. Any distances

which span chain boundaries are set to the ﬂag value.

In order to identify a set of possibly matching fragments, all

that needs to be done is to compare the non-missing values in

the fragment distance matrix to the corresponding values

obtained by starting from each row of the database distance

matrix in turn. A sum of squared differences is used to identify

likely matches.

To further optimize the calculation, the sum-of-squares

calculation may be terminat ed early as soon as the sum

exceeds a threshold value. The threshold value is controlled by

a parameter whic h determines how many matches will be

returned and is updated regularly by sorting the current list of

matches, truncating to the desired number and setting the

threshold to the value of the worst remaining match.

The limitation of the distance-matrix score is that the

distance matrix of a set of coordinates is invariant under

inversion of these coordinates through a centre of symmetry,

and so the initial search also returns fragments which are the

inverse of the search fragment. The resulting list of candidate

fragments must therefore be re-scored using a full l east-

squares superposition and r.m.s. difference calculation. The

resulting list is resorted according to the r.m.s. difference.

For some purposes it may be desirable to restrict the search

to fragments for which the sequence obeys some criterion, for

example to take into account the different main-chain

conformations which can occur around Gly or Pro. This is

research papers

330 Cowtan



Automated model building in Buccaneer Acta Cryst. (2012). D68, 328–335

Figure 1

Running distance-matrix representation of a single fragment, where D

the distance between the ith and jth C



atoms. The shaded cells are those

available for loop ﬁtting using only two C



atoms at each end of the

fragment.

Figure 2

Running distance-matrix representation of the protein-chain database,

where D

i,j

is the distance between the ith and jth C



atoms. The shaded

cells are those which would be used to score the ﬁt of a search fragment

against a particular range in the database.

achieved by allowing a mask of 20 binary digits to be set for

each position in the search fragment, indicating which of the

20 amino-acid types are allowed to appear at that position in

the fragment. This provides an additional restriction on the

search results which may be evaluated by simple log ical

operations.

3. Automated model tidying

The steps employed in the completion of the atomic model in

the current version of Buccaneer are as follows.

(i) The various fragments built by the chain-tracing and

sequence-docking algorithm s are grouped into discontinuous

chains using a scoring function that rewards compactness and

penalizes sequence duplication. This removes a tedious

manual step of assigning chain IDs and renaming the resulting

chain fragments by hand.

(ii) Where there are discontinuities (or breaks) in the

resulting chains, an attemp t is made to ﬁx these discontinuities

by pruning any overlap and placing a fragment from a stored

database of protein fragments across the gap.

The steps involved in the grouping of fragments into chains

are described in detail in xx3.1, 3.2 and 3.3. The correction of

breaks is discussed in x3.4. These steps are inserted between

steps (ix) and (x) of the workﬂow described in x1.3.

3.1. Grouping fragments into chains

The process of grouping fragments into chains involves

assigning a chain identiﬁer to each fragment such that the

fragments which ma ke up a single chain all have the same

chain identiﬁer. Furthermore, the resulting fragments may

need to be transformed by the application of crystallographic

symmetry elements to form a compact molecule.

In the simplest case of a single sequence with no noncrys-

tallographic symmetry (NCS), the process of allocating chain

identiﬁers is simply a matter of separating a set of fragments

which comprise a single complete chain from those which are

incorrectly built or sequenced (however, the remaining frag-

ments are retained with dummy chain identiﬁers in case they

contain correctly located but wrongly sequenced residues).

The general case involves two additional layers of

complexity. Firstly, there may be multiple copies of the

molecule in the asymmetric unit. In this case, multiple chains

with different chain identiﬁers must be built and each frag-

ment must be allocated to one of the chains in such a way as

to build several compact molecules. Secondly, in the case of

a hetero-complex there may be multiple distinct sequences

involved.

The basic steps of the calculation are as follows.

(i) In the case where multiple sequences are present, those

fragments which have been docked to one of the sequences

are sorted according to which sequence was used. Each

sequence is then considered in turn and the following steps are

applied to all the fragments belonging to that sequence.

(ii) A set of ‘seed’ fragments are identiﬁed by the method

described in x3.2, including one fragment from each NCS copy

of the molecule. The fragments are chosen such that they all

incorporate some common range of sequence numbers and

thus must belong to distinct copies of the molecule. The

selection of this range is made in such a way as to maximize

the number of NCS copies identiﬁed, subject to the validation

criteria described below.

(iii) The seed fragments are then grown by successively

adding an additional fragment to a seed by the method

described in x3.3. Each fragment is scored for its geometrical

proximity to each seed (taking into account crystallographic

symmetry) and penalized for any sequence overlap with that

seed. The fragment which obtains the highest score to be

docked to a seed is then added to that seed. The calculation

repeats unt il all fragments have been assigned or the highest

score fails to reach a threshold.

Steps (ii) and (iii) are repeated for each sequence until all

sequences have been considered. The fragments are then

assembled into chains by grouping all the fragments sharing a

chain identiﬁer in order of sequence number. In some cases,

sequence numbers of grouped fragments may overlap; in this

case, insertion codes are used to ensure that each residue is

uniquely identiﬁed.

3.2. Identification of seed fragments

The identiﬁcation of ‘seed’ fragments is performed as

follows. Firstly, a matrix is constructed whose order is the

number of fragments under consideration. The matrix is used

to store ﬂags identifying which fragments overlap. For each

pair of sequences, the number of residues of overlap is iden-

tiﬁed. If the overlap exceeds 12 residues and the overlapped

regions have similar conformations, the number of overlapped

residues is stored in the matrix. (In this context, a similar

conformation is identiﬁed by the least-squares superposition

of the best-matched 50% of the overlapped C



coordinates

having an r.m.s. difference of less than 1 A

A depth-ﬁrst permutation search is then performed to

identify the largest subset of fragments all of which overlap.

There will usually be multiple equal solutions; in this case, the

set is chosen for which the total number of residues in the

overlapping fragments is the greatest.

At ﬁrst glance the algorithm is computationally expensive,

since potentially 2

sets must be considered, where n is the

number of fragments. In practice, the number of overlapped

sequences does not signiﬁcantly exceed the number of NCS

copies and depth searches may be terminated early if they

cannot match the current best solution; thus, in practice the

computational cost of this step is negligible.

The fragments thus selected contain the same sequence of

residues in a similar conformation and thus can be assumed to

be different NCS copies of one part of the structure. Each of

the selected seed fragments is therefore allocated a different

chain identiﬁer and becomes the core of that chain.

3.3. Allocation of additional fragments to the chains

This step is performed iteratively. Every unallocated frag-

ment is considered and the score is calculated for adding that

research papers

Acta Cryst. (2012). D68, 328–335 Cowtan



Automated model building in Buccaneer 331

fragment to each chain. The highest scoring chain/fragment

combination is selected and the fragment is added to that

chain. This will affect all subsequent scores for that chain and

therefore the calculation is then repeated from the start.

The scoring function rewards geometrical compactness and

penalizes sequence inconsistencies as follows. Each C



atom

within 5 A

of a C



atom which has already been allocated to a

given chain provides a score of +1 for adding the fragment to

that chain. Each residue which has been docked into sequence

with a sequence number clashing with a residue already

allocated to a given chain provides a score of 2 for adding

the fragment to that chain.

In this way, fragments which are intimate to an existing

chain but which do not contain the same set of sequence are

added to that chain. The process continues until no positive

scores remain.

3.4. Correction of chain breaks

Often it will occur that there are gaps in the trace of the

protein chain. These most commonly occur for one of two

reasons.

(i) Flexible surface loops for which the electron density is

poor.

(ii) Mistracings where the chain trace has left the chain

(often following a side chain or disulﬁde bridge) and the chain

trace is then continued in a subsequent fragment.

For the gap to be corrected, any wrongly traced residues

(e.g. following a side chain or disulﬁde bridge) must ﬁrst be

removed by pruning back at least enough residues to remove

any duplicated sequence numbers from the ends of the two

fragments (mul tiple choices about how many residues to

prune from each end are possible and additional pruning may

be required to eliminate all mistraced residues, so multiple

prunings are tested) and then selecting a fragment from a

database of protein-chain fragments to bridge the gap.

Note that caution is required in this step. Earlier in the

Buccaneer calculation an attempt is made to link spatially

proximal N- and C-termini without regard to sequence.

Sometimes these linkages are made incorrectly. However, this

mistake is not serious, because when docking the resulting

chain to the sequence the two parts of the joined chain will

usually dock to different places in the sequence, at which point

the error can be corrected by breaking the chain again. When

linking chains on the basis of previously assigned sequences,

the use of the sequence to validate the link is no longer

available, so mistakes introduced at this stage will never be

corrected. As a result, it was found to be necessary to limit the

maximum length of the bridging fragment to six amino acids

(i.e. two amino acids overlap with ea ch chain and a maximum

of two amino acids of gap). Longer missing loops must still

be built manually. Since the errors arise from the presence of

wrongly sequenced fragme nts which occur early in the model-

building process when the fragments are short, this constraint

should probably be relaxed to allow longer loops to be built

once the model is approaching completion, at which point

errors become less likely.

3.5. Additional applications of the fragment database

Two existing steps in the Buccaneer calculation were also

rewritten to make use of the fragment database. The ‘linking’

step (joining nearby N- and C-termini irrespective of

sequence) and ‘correction’ step (correcting insertions and

deletions by rebuilding one or three residues with two resi-

dues) both made use of a routine for building a loop of two

residues by searching over allowed Ramachandran angles.

Both of these step s have been replaced by an equivalent

implementation using the fragment database.

4. Results

Some preliminary results are presented here on the applic-

ability of the fragment database and on the automated mo del-

tidying features in the Buccaneer software.

4.1. Coverage as a function of fragment size in the fragment

database

To investigate the usefulness of the fragment database, an

exhaustive search was performed to test for a given fragment

length how well each fragment in the database can be repre-

sented by some other fragment from the database.

Each possible fragment of the chosen length was extracted

from the database in turn and used as a search model to ﬁnd

other similar fragments. In every case the best-ﬁtting fragment

will be the original fragment, so the best ﬁt is discarded and

the second-best match is used. Two statistics are calculated

for the matching fragment: the r.m.s. deviation between the



-atom positions and those of the search fragment, and the

distance between the worst-mat ching C



atom and the corre-

sponding atom in the search fragment. This calculation was

performed for fragments of six, nine and 12 residues (as would

be used in ﬁtting missing loops of two, ﬁve and eight residues,

respectively).

The results are shown in Fig. 3 as tail plots showing the

proportion of the search fragments for which the difference

from the database fragment is no worse than a given value.

The r.m.s. deviations are worse tha n 1.0 A

for 0.04% of six-

residue fragments, 5% of nine-residue fragments and 38% of

12-residue fragments. Given that a signiﬁcant proportion of

the fragments in the database will be in very similar helical

or strand conformations, this suggests that the library will be

of limited use for 12-residue fragments except for common

motifs.

Similarly, the worst deviating atom has a displacement of

worse than 1.5 A

for 0.05% of six-residue fragments, 5% of

nine-residue fragments and 36% of 12-residue fragments.

(Note the change in distance crit erion compared with the

previous data.) This again suggests that 12-residue fragments

will be of less use, since automated reﬁnement is likely to

struggle to correct errors of this magnitude.

As a result, the database provides effectively complete

coverage for fragments of up to six residues or for loop ﬁtting

over only two missing residues. (This case was previously

handled by a simple Ramachandran search; however, the

research papers

332 Cowtan



Automated model building in Buccaneer Acta Cryst. (2012). D68, 328–335

database approach has the advantage of providing a compu-

tationally cheaper sam pling of conformation space which

increases in density as the frequency of that conformation

increases.)

For missing loops of intermediate length (3–6 residues), the

database will provide good loop conformations in a subset of

cases where the loop happens to match one in the database

and so will catch common turn motifs, for example. For longer

loops, the database is likely to be useful only in less frequent

cases. However, this approach has be en shown to have good

success by Choi & Deane (2010) for loops of up to 20 residues

with a larger database of structures.

4.2. Automated model tidying in the Buccaneer software

The model-tidying procedure was applied to the same 55

test structures used in Cowtan (2008) and is detailed in the

supplementary material of that paper; the data were obtained

from the JCSG (Joint Center for Structural Genomics, 2006).

Of the resulting models, 29 contained fragments which were

grouped into chains by the tidying algorithm. Some of these

structures included multiple NCS copies of the structure and

therefore the total number of chains assembled was 50.

Each of the 50 tidied chains was examined to determine the

proportion of the chain corresponding to a single molecule in

the ﬁnal structure. As the model becomes more complete, the

assignment becomes easier, so these proportions are tabulated

along with the completeness of the chain in Table 1.

In every case where the chain is at least 60% complete, at

least 80% is correctly assigned to a single molecule and in 44

of 48 such cases the assignment is entirely correct or correct

apart from a few trailing residues. For the two cases where the

completeness is less than 50%, the grouping of fragments into

chains is rather less accurate.

The case of the 1vlu A chain (as labelled by Buccaneer; this

is actually the B chain in the deposited structure) is shown in

Fig. 4, in which 91% of the chain has been built but only 83%

of the residues built correspond to a single molecule. In this

case the deposited model contains chain breaks and the

Buccaneer model shows chain breaks in similar positions. The

disconnected range of residues 331–391 has been placed at the

wrong end of the molecule. It is probable that the error could

have been corrected in this speciﬁc case by adding a term

rewarding proximity of sequence number to the scoring

function; however, this was not tested because in the experi-

ence of the author the incorrect linking of chains across

protein contacts is a signiﬁcant problem in the early stages of

building and this problem is likely to be exacerbated by such a

change.

4.3. Application of the fragment database in the Buccaneer

software

The usefulness of the fragment database in automated

building was tested by rewriting two existing steps of the

Buccaneer calculation to make use of the database and by

adding a new loop-building step using the database, as

described in x4.2. The results of these changes were tested

individually and in combination.

research papers

Acta Cryst. (2012). D68, 328–335 Cowtan



Automated model building in Buccaneer 333

Table 1

Reliability of the model-tidying algorithm as measured by the proportion

of each autobuilt chain corresponding to a single chain in the deposited

structure.

Structure (chain)

Proportion belonging

to a single chain (%)

Chain completeness

(%)

1vjn (A)72 49

1zej (B)75 46

1z85 (A)81 90

1vlu (B)81 73

1vlu (A)83 91

1zej (A)92 69

1vr8 (A)95 99

1vp7 (C) 97 100

1vk3 (A)98 90

41 cases 100 62–100

Figure 3

Tail plot of the proportion of search fragments for which the ﬁt of the

best-matching fragment is worse than a given criterion for different

fragment lengths. (a) R.m.s. deviation of C



positions between the best

database fragment and the search fragment; (b) maximum deviation of

any C



positions between the best database fragment and the search

fragment.

The results of the model-building calculation are rather

sensitive to changes in the algorithm or input data, so to

determine whether each change made an improvement

multiple model-building runs were used. For each of the 55

test structures used in Cowtan (2008) ten model-building runs

were performed using ten different sets of free reﬂections for

both mode l building and reﬁnement. The change in the set of

reﬂections used to calculate the initial map is sufﬁcient to

signiﬁcantly alter the results of the ﬁrst model-building step

and the differences propagate to subsequent cycles.

The percentage of the model built and correctly sequenced

(measured by the percentage of residues built with the correct

residue type and with the C



within 1.9 A

of the correct

position) was averaged over the 550 runs to obtain a score for

this method.

Furthermore, the entire set of calculations was then repe-

ated usin g lower resol ution data. For these calculations, the

data resolution was truncated by 0.4 A

, the B factor was

increased by 20 A

and the density-modiﬁcation step (using

the Parrot software; Cowtan, 2010) was rerun on the truncated

data. The resolutions of the original data sets vary over the

range 1.4–3.2 A

and the truncate d data over the range 1.8–

3.6 A

The results of these calculations are shown in Table 2. The

ﬁrst step modiﬁed (‘link’) is the linking of chain fragments

irrespective of sequence [step (iii) in the Buccaneer calcula-

tion], the next (‘correct’) is the correction of insertions and

deletions during sequencing [step (v) in the Buccaneer calcu-

lation]. These steps were previously performed using an

exhaustive search over allowe d Ramachandran angles, in the

ﬁrst case to build a link of up to two residues and in the second

to rebuild a stretch of either one or three residues with two

residues. Finally, a new loop-building step was added, similar

to the ‘link’ step but performed after the sequence has been

assigned to the chains. Unlike the ‘link’ step, the loop-building

step may prune an arbitrary number of residues from either

chain to bring similarly numbered residues into proximity.

The updated link step makes minimal difference to the

amount of model built, but does provide a speed beneﬁt over

the previous (Ramachandran search) implementation. The

updated correct step gives a small improvement in the amount

of model built, although the difference is comparable to the

noise among different runs. The loop-building step shows no

signiﬁcant improvement in the proportion built. It is a recur-

ring problem in the development of the model-building

algorithm that the improvements are marginal and hard to

distinguish from noise, even with the large number of test runs.

However, in each of four cases where only the correct step is

changed the results always improve, suggesting that this result

is signiﬁcant.

research papers

334 Cowtan



Automated model building in Buccaneer Acta Cryst. (2012). D68, 328–335

Table 2

Proportion of models built and correctly sequenced with different

building strategies; results are averaged over 550 runs on 55 structures.

Values in parentheses are standard deviations across the ten runs of 55

structures.

Full resolution Truncated resolution

Method

Percentage

built

No. of

chains

Percentage

built

No. of

chains

Original version 86.2 (0.6) 8.7 (0.4) 75.1 (1.0) 13.0 (0.4)

DB for link 86.2 (0.5) 8.6 (0.6) 75.5 (0.9) 12.8 (0.5)

DB for correct 86.5 (0.6) 8.7 (0.3) 76.2 (1.3) 12.9 (0.7)

DB for loop build 86.1 (0.4) 7.4 (0.4) 75.3 (0.9) 11.4 (0.4)

DB for link, correct 86.6 (0.7) 8.6 (0.3) 76.4 (0.6) 12.6 (0.6)

DB for link, correct,

loop build

86.6 (0.7) 7.3 (0.3) 76.5 (0.9) 11.1 (0.7)

Figure 4

Partially incorrect assembly of the model for 1vlu from multiple

fragments. The wrongly positioned region is shown in black (a) in the

Buccaneer model and (b) in the deposited structure.

However, the beneﬁt of the loop-building step can be seen

in the connectivity of the model, which is a beneﬁt when it

comes to ﬁnishing the model by hand. The number of frag-

ments in the output model gives an indication of what is

happening. For the original version, the average number of

fragments over the 550 autobuilt models is 8.7; when the loop-

building step is added, this reduces to 7.4 (similar changes are

seen when combining the loop-building step with the other

new steps and when the resolution is truncated). A reduction

in the number of fragments without a reduction in the

proportion built implies an improvement in connectivity. The

implication is that the loop-building step is most commonly

dealing with cases where chains are coming into close proxi-

mity but failing to meet (and possibly branching down side

chains) rather than true loop-buil ding problems when there

are missing residues.

To summarize, using the fragment database for the link step

reduces the computational overhead, using the fragment

database for the correct step provides a small improvement

in completeness and using the fragment database for loop

building provides a signiﬁcant improvement in connectivity.

4.4. Other applications of the fragment library

The fragment library has also been used in the imple-

mentation of a loop-building tool, Sloop, which is capable of

building short missing loops in incomplete protein models. As

noted above, the usefulness of this tool varies according to

whether the loop conce rned happens to conform to an existing

motif.

A tool for converting a C



trace into a main-chain (poly-

alanine) trace has also been implemented. The results show

similar high levels of accuracy to those of Esnouf (1997). The

program has not been released owing to the availability of

many other tools for this task; however, the source code is

available from the author on request.

The use of the library for the buildi ng and validation of

motifs in the Coot graphical model-building and validation

software (Emsley et al., 2010) is under development.

4.5. Discussion

The tidying of fragments into chains is an important

element of an automated model-building calculation, princi-

pally because it reduces the manual intervention required

later in the structure-solution process. The technique

described here is reliable when the completeness of the model

is good and is completely general with respect to NCS and

hetero-complexes, without requiring knowledge of the

number of copies of a given sequence present in the asym-

metric unit.

The protein-fragment database is capable of reproducing

the various functionalities implemented by previous authors,

with the efﬁcient search algorithm allowing the use of a large r

database than in previous implementations. Some preliminary

applications have been explored and a range of future appli-

cations are planned, including the following.

(i) Use of the loop-building code to build longer loops when

the model is nearly complete. This may be in a single step, or

possibly using the stepwise approach of Joosten et al. (2008)

where a suitable large fragment is not found in the library.

(ii) Use of the fragment library to rebuild regions of the

chain where residue type inﬂuences geometry, in particular in

the vicinity of Gly and Pro residues.

(iii) Testing the use of a subset of the fragment library

to replace the current Ramachandran search in the chain-

growing step in Buccaneer, in a manner similar to that of

Terwilliger (2003).

(iv) Use of the fragment library to provide validation scores

in the manner of Jones & Thirup (1986) in the Coot software.

(v) Extension of the fragment-database concept to handle

nucleotides.

The author would like to tha nk the JCSG data archive for

providing a source of well curated test data. This work was

supported by the BBSRC through grant BB/F0202281.

References

Choi, Y. & Deane, C. M. (2010). Proteins, 78, 1431–1440.

Cohen, S. X., Morris, R. J., Fernandez, F. J., Ben Jelloul, M., Kakaris,

M., Parthasarathy, V., Lamzin, V. S., Kleywegt, G. J. & Perrakis, A.

(2004). Acta Cryst. D60, 2222–2229.

Cowtan, K. (2006). Acta Cryst. D62, 1002–1011.

Cowtan, K. (2008). Acta Cryst. D64, 83–89.

Cowtan, K. (2010). Acta Cryst. D66, 470–478.

Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta

Cryst. D66, 486–501.

Esnouf, R. M. (1997). Acta Cryst. D53, 665–672.

Joint Center for Structural Genomics (2006). JCSG Data Archive.

http://www.jcsg.org/datasets-info.shtml.

Jones, T. A. & Thirup, S. (1986). EMBO J. 5, 819–822.

Joosten, K., Cohen, S. X., Emsley, P., Mooij, W., Lamzin, V. S. &

Perrakis, A. (2008). Acta Cryst. D64, 416–424.

Kleywegt, G. J. & Jones, T. A. (1996). Acta Cryst. D52, 829–832.

Lovell, S. C., Davis, I. W., Arendall, W. B., de Bakker, P. I., Word,

J. M., Prisant, M. G., Richardson, J. S. & Richardson, D. C. (2003).

Proteins, 50, 437–450.

Murshudov, G. N., Skuba

k, P., Lebedev, A. A., Pannu, N. S., Steiner,

R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011).

Acta Cryst. D67, 355–367.

Payne, P. W. (1993). Protein Sci. 2, 315–324.

Sheldrick, G. M. (2010). Acta Cryst. D66, 479–485.

Terwilliger, T. C. (2003). Acta Cryst. D59, 38–44.

research papers

Acta Cryst. (2012). D68, 328–335 Cowtan



Automated model building in Buccaneer 335

Se-MAG Is a Convenient Additive for Experimental Phasing and Structure Determination of Membrane Proteins Crystallised by the Lipid Cubic Phase (In Meso) Method

Article

Full-text available

Sep 2023

Both intensity and phase information are needed for structure determination by macromolecular X-ray crystallography. The diffraction experiment provides intensities. Phases must be accessed indirectly by molecular replacement, or by experimental phasing. A popular method for crystallising membrane proteins employs a lipid cubic mesophase (the in meso method). Monoolein is the most popular lipid for in meso crystallisation. Invariably, the lipid co-crystallises with the protein recapitulating the biomembrane from whence it came. We reasoned that such a lipid bearing a heavy atom could be used for experimental phasing. In this study, we replaced half the monoolein in the mesophase with a seleno-labelled analogue (Se-MAG), which has a selenium atom in the fatty acyl chain of the lipid. The lipid mixture formed the cubic mesophase and grew crystals by the in meso method of the alginate transporter, AlgE, and the lipoprotein N-acyltransferase, Lnt. Se-MAGs co-crystallised with both proteins and were used to obtain phases for high-resolution structure determination by the selenium single-wavelength anomalous diffraction method. The use of such a mixed lipid system may prove to be a general strategy for the experimental phasing part of crystallographic structure determination of membrane proteins that crystallise via the in meso method.

Regioselective stilbene O-methylations in Saccharinae grasses

Article

Full-text available

Jun 2023

O-Methylated stilbenes are prominent nutraceuticals but rarely produced by crops. Here, the inherent ability of two Saccharinae grasses to produce regioselectively O-methylated stilbenes is reported. A stilbene O-methyltransferase, SbSOMT, is first shown to be indispensable for pathogen-inducible pterostilbene (3,5-bis-O-methylated) biosynthesis in sorghum (Sorghum bicolor). Phylogenetic analysis indicates the recruitment of genus-specific SOMTs from canonical caffeic acid O-methyltransferases (COMTs) after the divergence of Sorghum spp. from Saccharum spp. In recombinant enzyme assays, SbSOMT and COMTs regioselectively catalyze O-methylation of stilbene A-ring and B-ring respectively. Subsequently, SOMT-stilbene crystal structures are presented. Whilst SbSOMT shows global structural resemblance to SbCOMT, molecular characterizations illustrate two hydrophobic residues (Ile144/Phe337) crucial for substrate binding orientation leading to 3,5-bis-O-methylations in the A-ring. In contrast, the equivalent residues (Asn128/Asn323) in SbCOMT facilitate an opposite orientation that favors 3ʹ-O-methylation in the B-ring. Consistently, a highly-conserved COMT is likely involved in isorhapontigenin (3ʹ-O-methylated) formation in wounded wild sugarcane (Saccharum spontaneum). Altogether, our work reveals the potential of Saccharinae grasses as a source of O-methylated stilbenes, and rationalize the regioselectivity of SOMT activities for bioengineering of O-methylated stilbenes.

Article

Full-text available

Feb 2023
PLOS NEGLECT TROP D

During infection of mammalian hosts, African trypanosomes thwart immunity using antigenic variation of the dense Variant Surface Glycoprotein (VSG) coat, accessing a large repertoire of several thousand genes and pseudogenes, and switching to antigenically distinct copies. The parasite is transferred to mammalian hosts by the tsetse fly. In the salivary glands of the fly, the pathogen adopts the metacyclic form and expresses a limited repertoire of VSG genes specific to that developmental stage. It has remained unknown whether the metacyclic VSGs possess distinct properties associated with this particular and discrete phase of the parasite life cycle. We present here three novel metacyclic form VSG N-terminal domain crystal structures (mVSG397, mVSG531, and mVSG1954) and show that they mirror closely in architecture, oligomerization, and surface diversity the known classes of bloodstream form VSGs. These data suggest that the mVSGs are unlikely to be a specialized subclass of VSG proteins, and thus could be poor candidates as the major components of prophylactic vaccines against trypanosomiasis.

Determining Protein Structures Using X-Ray Crystallography

Article

Apr 2024

Subhash Narasimhan

X-ray crystallography is a robust and widely used technique that facilitates the three-dimensional structure determination of proteins at an atomic scale. This methodology entails the growth of protein crystals under controlled conditions followed by their exposure to X-ray beams and the subsequent analysis of the resulting diffraction patterns via computational tools to determine the three-dimensional architecture of the protein. However, achieving high-resolution structures through X-ray crystallography can be quite challenging due to complexities associated with protein purity, crystallization efficiency, and crystal quality. In this chapter, we provide a detailed overview of the gene to structure determination pipeline used in X-ray crystallography, a crucial tool for understanding protein structures. The chapter covers the steps in protein crystallization, along with the processes of data collection, processing, structure determination, and refinement. The most commonly faced challenges throughout this procedure are also addressed. Finally, the importance of standardized protocols for reproducibility and accuracy is emphasized, as they are crucial for advancing the understanding of protein structure and function.

The DNA-binding induced (de)AMPylation activity of a Coxiella burnetii Fic enzyme targets Histone H3

Article

Full-text available

Nov 2023

The intracellular bacterial pathogen Coxiella burnetii evades the host response by secreting effector proteins that aid in establishing a replication-friendly niche. Bacterial filamentation induced by cyclic AMP (Fic) enzymes can act as effectors by covalently modifying target proteins with the posttranslational AMPylation by transferring adenosine monophosphate (AMP) from adenosine triphosphate (ATP) to a hydroxyl-containing side chain. Here we identify the gene product of C. burnetii CBU_0822, termed C. burnetii Fic 2 (CbFic2), to AMPylate host cell histone H3 at serine 10 and serine 28. We show that CbFic2 acts as a bifunctional enzyme, both capable of AMPylation as well as deAMPylation, and is regulated by the binding of DNA via a C-terminal helix-turn-helix domain. We propose that CbFic2 performs AMPylation in its monomeric state, switching to a deAMPylating dimer upon DNA binding. This study unveils reversible histone modification by a specific enzyme of a pathogenic bacterium.

Characterization of a family I inorganic pyrophosphatase from Legionella pneumophila Philadelphia 1

Article

Sep 2023

Inorganic pyrophosphate (PP i ) is generated as an intermediate or byproduct of many fundamental metabolic pathways, including DNA/RNA synthesis. The intracellular concentration of PP i must be regulated as buildup can inhibit many critical cellular processes. Inorganic pyrophosphatases (PPases) hydrolyze PP i into two orthophosphates (P i ), preventing the toxic accumulation of the PP i byproduct in cells and making P i available for use in biosynthetic pathways. Here, the crystal structure of a family I inorganic pyrophosphatase from Legionella pneumophila is reported at 2.0 Å resolution. L. pneumophila PPase (LpPPase) adopts a homohexameric assembly and shares the oligonucleotide/oligosaccharide-binding (OB) β-barrel core fold common to many other bacterial family I PPases. LpPPase demonstrated hydrolytic activity against a general substrate, with Mg ²⁺ being the preferred metal cofactor for catalysis. Legionnaires' disease is a severe respiratory infection caused primarily by L. pneumophila , and thus increased characterization of the L. pneumophila proteome is of interest.

Predicted models and CCP 4

Article

Full-text available

Aug 2023

In late 2020, the results of CASP14, the 14th event in a series of competitions to assess the latest developments in computational protein structure-prediction methodology, revealed the giant leap forward that had been made by Google's Deepmind in tackling the prediction problem. The level of accuracy in their predictions was the first instance of a competitor achieving a global distance test score of better than 90 across all categories of difficulty. This achievement represents both a challenge and an opportunity for the field of experimental structural biology. For structure determination by macromolecular X-ray crystallography, access to highly accurate structure predictions is of great benefit, particularly when it comes to solving the phase problem. Here, details of new utilities and enhanced applications in the CCP 4 suite, designed to allow users to exploit predicted models in determining macromolecular structures from X-ray diffraction data, are presented. The focus is mainly on applications that can be used to solve the phase problem through molecular replacement.

Structure–function studies of a novel laccase-like multicopper oxidase from Thermothelomyces thermophila provide insights into its biological role

Article

Full-text available

Jun 2023

Multicopper oxidases are promiscuous biocatalysts with great potential for the production of industrial compounds. This study is focused on the elucidation of the structure–function determinants of a novel laccase-like multicopper oxidase from the thermophilic fungus Thermothelomyces thermophila ( Tt LMCO1), which is capable of oxidizing both ascorbic acid and phenolic compounds and thus is functionally categorized between the ascorbate oxidases and fungal ascomycete laccases (asco-laccases). The crystal structure of Tt LMCO1, determined using an AlphaFold 2 model due to a lack of experimentally determined structures of close homologues, revealed a three-domain laccase with two copper sites, lacking the C-terminal plug observed in other asco-laccases. Analysis of solvent tunnels highlighted the amino acids that are crucial for proton transfer into the trinuclear copper site. Docking simulations showed that the ability of Tt LMCO1 to oxidize ortho -substituted phenols stems from the movement of two polar amino acids at the hydrophilic side of the substrate-binding region, providing structural evidence for the promiscuity of this enzyme.

Crystal structure of the dimerized of porcine circovirus type II replication-related protein Rep'

Article

May 2023
PROTEINS

Porcine circovirus type 2 (PCV2) can cause porcine circovirus-associated disease (PCVAD), which causes significant economic losses to the global pig industry annually. There are no effective antiviral drugs used to control and treat PCV2, and prevention is mainly obtained through vaccination. PCV2 genome replicates through the rolling circle replication (RCR) mechanism involving Rep and Rep', so analyzing the holistic structure of Rep and Rep' will help us better understand the replication process of PCV2. However, there are no reports on the integral structure of Rep' and Rep, which seriously hinders the research of the viral replication. By using the x-ray diffraction method, the structure of the Rep' dimer was resolved by us in this study. Structural analysis revealed that Rep' is a dimer formed by the interaction of the C-terminal domain. The two Rep' form a positively charged groove, which may play an essential role in the viral binding of dsDNA. Together, this study help to understand the replication process of the virus and may also provide new insights into the development of antiviral drugs.

The ROK kinase N-acetylglucosamine kinase uses a sequential random enzyme mechanism with successive conformational changes upon each substrate binding

Article

Feb 2023
J BIOL CHEM

N-acetyl-d-glucosamine (GlcNAc) is a major component of bacterial cell walls. Many organisms recycle GlcNAc from the cell wall or metabolize environmental GlcNAc. The first step in GlcNAc metabolism is phosphorylation to GlcNAc-6-phosphate. In bacteria, the ROK family kinase NagK performs this activity. Although ROK kinases have been studied extensively, no ternary complex showing the two substrates has yet been observed. Here, we solved the structure of NagK from the human pathogen Plesiomonas shigelloides in complex with GlcNAc and the ATP analogue AMP-PNP. Surprisingly, PsNagK showed distinct conformational changes associated with the binding of each substrate. Consistent with this, the enzyme showed a sequential random enzyme mechanism. This indicates that the enzyme acts as a coordinated unit responding to each interaction. Our molecular dynamics modelling of catalytic ion binding confirmed the location of the essential catalytic metal. Additionally, site-directed mutagenesis confirmed the catalytic base, and that the metal-coordinating residue is essential. Together, this study provides the most comprehensive insight into the activity of a ROK kinase.

REFMAC5 For the refinement of macromolecular crystal structures

Article

Full-text available

Apr 2011

This paper describes various components of the macromolecular crystallographic refinement program REFMAC5, which is distributed as part of the CCP4 suite. REFMAC5 utilizes different likelihood functions depending on the diffraction data employed (amplitudes or intensities), the presence of twinning and the availability of SAD/SIRAS experimental diffraction data. To ensure chemical and structural integrity of the refined model, REFMAC5 offers several classes of restraints and choices of model parameterization. Reliable models at resolutions at least as low as 4 Å can be achieved thanks to low-resolution refinement tools such as secondary-structure restraints, restraints to known homologous structures, automatic global and local NCS restraints, `jelly-body' restraints and the use of novel long-range restraints on atomic displacement parameters (ADPs) based on the Kullback-Leibler divergence. REFMAC5 additionally offers TLS parameterization and, when high-resolution data are available, fast refinement of anisotropic ADPs. Refinement in the presence of twinning is performed in a fully automated fashion. REFMAC5 is a flexible and highly optimized refinement package that is ideally suited for refinement across the entire resolution spectrum encountered in macromolecular crystallography.

Features and development of COOT

Article

Full-text available

Apr 2010

Coot is a molecular-graphics application for model building and validation of biological macromolecules. The program displays electron-density maps and atomic models and allows model manipulations such as idealization, real-space refinement, manual rotation/translation, rigid-body fitting, ligand search, solvation, mutations, rotamers and Ramachandran idealization. Furthermore, tools are provided for model validation as well as interfaces to external programs for refinement, validation and graphics. The software is designed to be easy to learn for novice users, which is achieved by ensuring that tools for common tasks are 'discoverable' through familiar user-interface elements (menus and toolbars) or by intuitive behaviour (mouse controls). Recent developments have focused on providing tools for expert users, with customisable key bindings, extensions and an extensive scripting interface. The software is under rapid development, but has already achieved very widespread use within the crystallographic community. The current state of the software is presented, with a description of the facilities available and of some of the underlying methods employed.

Experimental phasing with SHELXC/D/E: Combining chain tracing with density modification

Article

Full-text available

Apr 2010

George Sheldrick

The programs SHELXC, SHELXD and SHELXE are designed to provide simple, robust and efficient experimental phasing of macromolecules by the SAD, MAD, SIR, SIRAS and RIP methods and are particularly suitable for use in automated structure-solution pipelines. This paper gives a general account of experimental phasing using these programs and describes the extension of iterative density modification in SHELXE by the inclusion of automated protein main-chain tracing. This gives a good indication as to whether the structure has been solved and enables interpretable maps to be obtained from poorer starting phases. The autotracing algorithm starts with the location of possible seven-residue alpha-helices and common tripeptides. After extension of these fragments in both directions, various criteria are used to decide whether to accept or reject the resulting poly-Ala traces. Noncrystallographic symmetry (NCS) is applied to the traced fragments, not to the density. Further features are the use of a 'no-go' map to prevent the traces from passing through heavy atoms or symmetry elements and a splicing technique to combine the best parts of traces (including those generated by NCS) that partly overlap.

Recent developments in classical density modification

Article

Full-text available

Apr 2010

Kevin Cowtan

Classical density-modification techniques (as opposed to statistical approaches) offer a computationally cheap method for improving phase estimates in order to provide a good electron-density map for model building. The rise of statistical methods has lead to a shift in focus away from the classical approaches; as a result, some recent developments have not made their way into classical density-modification software. This paper describes the application of some recent techniques, including most importantly the use of prior phase information in the likelihood estimation of phase errors within a classical density-modification framework. The resulting software gives significantly better results than comparable classical methods, while remaining nearly two orders of magnitude faster than statistical methods.

Structure validation by Calpha geometry: Phi, psi and Cbeta deviation

Article

Jan 2003
PROTEINS

Structure validation by Cα geometry: ϕ,ψ and Cβ deviation

Article

Feb 2003
PROTEINS

Geometrical validation around the Cα is described, with a new Cβ measure and updated Ramachandran plot. Deviation of the observed Cβ atom from ideal position provides a single measure encapsulating the major structure-validation information contained in bond angle distortions. Cβ deviation is sensitive to incompatibilities between sidechain and backbone caused by misfit conformations or inappropriate refinement restraints. A new ϕ,ψ plot using density-dependent smoothing for 81,234 non-Gly, non-Pro, and non-prePro residues with B < 30 from 500 high-resolution proteins shows sharp boundaries at critical edges and clear delineation between large empty areas and regions that are allowed but disfavored. One such region is the γ-turn conformation near +75°,−60°, counted as forbidden by common structure-validation programs; however, it occurs in well-ordered parts of good structures, it is overrepresented near functional sites, and strain is partly compensated by the γ-turn H-bond. Favored and allowed ϕ,ψ regions are also defined for Pro, pre-Pro, and Gly (important because Gly ϕ,ψ angles are more permissive but less accurately determined). Details of these accurate empirical distributions are poorly predicted by previous theoretical calculations, including a region left of α-helix, which rates as favorable in energy yet rarely occurs. A proposed factor explaining this discrepancy is that crowding of the two-peptide NHs permits donating only a single H-bond. New calculations by Hu et al. [Proteins 2002 (this issue)] for Ala and Gly dipeptides, using mixed quantum mechanics and molecular mechanics, fit our nonrepetitive data in excellent detail. To run our geometrical evaluations on a user-uploaded file, see MOLPROBITY (http://kinemage.biochem.duke.edu) or RAMPAGE (http://www-cryst.bioc.cam.ac.uk/rampage). Proteins 2003;50:437–450. © 2003 Wiley-Liss, Inc.

Reconstruction of protein conformations from estimated positions of the Cα coordinates

Article

Mar 2008
PROTEIN SCI

Philip W. Payne

Protein C coordinates are used to accurately reconstruct complete protein backbones and side-chain directions. This work employs potentials of mean force to align semirigid peptide groups around the axes that connect successive C atoms. The algorithm works well for all residue types and secondary structure classes and is stable for imprecise C coordinates. Tests on known protein structures show that root mean square errors in predicted main-chain and Cβ coordinates are usually less than 0.3 Å. These results are significantly more accurate than can be obtained from competing approaches, such as modeling of backbone conformations from structurally homologous fragments.

FREAD revisited: Accurate loop structure prediction using a database search algorithm

Article

Nov 2009
PROTEINS

Loops are the most variable regions of protein structure and are, in general, the least accurately predicted. Their prediction has been approached in two ways, ab initio and database search. In recent years, it has been thought that ab initio methods are more powerful. In light of the continued rapid expansion in the number of known protein structures, we have re-evaluated FREAD, a database search method and demonstrate that the power of database search methods may have been underestimated. We found that sequence similarity as quantified by environment specific substitution scores can be used to significantly improve prediction. In fact, FREAD performs appreciably better for an identifiable subset of loops (two thirds of shorter loops and half of the longer loops tested) than the ab initio methods of MODELLER, PLOP, and RAPPER. Within this subset, FREAD's predictive ability is length independent, in general, producing results within 2A RMSD, compared to an average of over 10A for loop length 20 for any of the other tested methods. We also benchmarked the prediction protocols on a set of 212 loops from the model structures in CASP 7 and 8. An extended version of FREAD is able to make predictions for 127 of these, it gives the best prediction of the methods tested in 61 of these cases. In examining FREAD's ability to predict in the model environment, we found that whole structure quality did not affect the quality of loop predictions.

The Buccaneer software for automated model building

Article

Sep 2006

Kevin Cowtan

A new technique for the automated tracing of protein chains in experimental electron-density maps is described. The technique relies on the repeated application of an oriented electron-density likelihood target function to identify likely C positions. This function is applied both in the location of a few promising `seed' positions in the map and to grow those initial C positions into extended chain fragments. Techniques for assembling the chain fragments into an initial chain trace are discussed.

Using known substructures in protein model building and crystallography

Article

May 1986

Retinol binding protein can be constructed from a small number of large substructures taken from three unrelated proteins. The known structures are treated as a knowledge base from which one extracts information to be used in molecular modelling when lacking true atomic resolution. This includes the interpretation of electron density maps and modelling homologous proteins. Models can be built into maps more accurately and more quickly. This requires the use of a skeleton representation for the electron density which improves the determination of the initial chain tracing. Fragment-matching can be used to bridge gaps for inserted residues when modelling homologous proteins.

Completion of autobuilt protein models using a database of protein fragments

Abstract and Figures

Recommended publications

Fast Fourier feature recognition

Pairwise running of automated crystallographic model-building pipelines

Fitting molecular fragments into electron density

Reduction of density-modification bias by β correction

Automatic rebuilding and optimization of crystallographic structures in the PDB