ArticlePDF Available

Completion of autobuilt protein models using a database of protein fragments

Authors:

Abstract and Figures

Two developments in the process of automated protein model building in the Buccaneer software are presented. A general-purpose library for protein fragments of arbitrary size is described, with a highly optimized search method allowing the use of a larger database than in previous work. The problem of assembling an autobuilt model into complete chains is discussed. This involves the assembly of disconnected chain fragments into complete molecules and the use of the database of protein fragments in improving the model completeness. Assembly of fragments into molecules is a standard step in existing model-building software, but the methods have not received detailed discussion in the literature.
No caption available
… 
Content may be subject to copyright.
research papers
328 doi:10.1107/S0907444911039655 Acta Cryst. (2012). D68, 328–335
Acta Crystallographica Section D
Biological
Crystallography
ISSN 0907-4449
Completion of autobuilt protein models using a
database of protein fragments
Kevin Cowtan
Department of Chemistry, University of York,
Heslington, York YO10 5DD, England
Correspondence e-mail:
cowtan@ysbl.york.ac.uk
Two developments in the process of automated protein model
building in the Buccaneer software are presented. A general-
purpose library for protein fragments of arbitrary size is
described, with a highly optimized search method allowing the
use of a larger database than in previous work. The problem of
assembling an autobuilt model into complete chains is
discussed. This involves the assembly of disconnected chain
fragments into complete molecules and the use of the database
of protein fragments in improving the model completeness.
Assembly of fragments into mo lecules is a standard step in
existing model-building software, but the methods have not
received detailed discussion in the literature.
Received 2 June 2011
Accepted 27 September 2011
1. Background
This paper outlines two developments of relevance to the
problem of automated protein model building. The initial
application of these techniques in the Buccaneer mo del-
building software is presented. The developments are the
following.
(i) An efficiently searchable database of protein fragments
which may be used for diverse purposes including the con-
version of a C
trace to main-chai n models, the building of
missing loops and termini, and the correction of residue
insertions and deletions. This library has been implemented
for loop building in the Coot software (Emsley et al., 2010), as
well as for applications in automated model building described
here.
(ii) The automated ‘tidying’ of a fragmentary autobuilt
protein model, with the aim of red ucing the manual effort
required to complete the model. Automated model building
sometimes leads to models which may consist of multiple
disconnected fragments, especially at low resolution or when
disordered loop regions are not visible in the electron density.
These fragments must be assembled into one or more mole-
cules, which may involve the application of symmetry opera-
tors and cell translations to some of the fragments. In the case
of noncrystallographic symmetry (NCS ) it is also necessary to
assign the fragments to different copies of the molecule.
1.1. Use of databases of protein fragments
The use of databases of protein fragments in the determi-
nation and validation of atomic models is well established in
both manual and automated model building.
Kleywegt & Jones (1996) described the use of pentapeptide
fragments in the program OOPS for the validation of the
protein backbone trace.
Jones & Thirup (1986) used a database of pentapeptides
in the reconstruction of a main-chain trace from C
positions
alone, although Payne (1993) claimed better results using
force fields. Esnouf (1997) used a library of 16 533 hexapep-
tide fragments in the same way to obtain main-chain coordi-
nates which matched the refined X-ray structure to a high
precision.
Terwilliger (2003) employed a library of tripeptide frag-
ments to extend existing fragments of protein chain by adding
additional residues at the N- or C-terminus and Sheldrick
(2010) used tripeptides to find initial protein fragments.
Joosten et al. (2008) used a library of pentapeptide fragments
in a similar way to build missing loops in protein structures.
The development described here recognizes the success of
these methods and describes an efficient method for building
and searching a library of protein fragments of arbitrary
length (bounded by some chosen value). The database is
optimized for very fast homology searches, allowing the use of
a much larger database than in previous work. The use of a
much larger database also provides the potential to perform
searches restricted by residue-type filters without compro-
mising coverage beyond usefulness.
1.2. Tidying and completion of protein models
Automated model building typically produces as an inter-
mediate result a set of protein-chain fragments, some of which
may have been docked into the protein sequence. Ultimately,
these will need to be assembled into molecules. A problem
arises in determining how the fragments are connected to one
another. When protein molecules are tightly packed together
the molecule boundaries may not be obvious, and as a result
it is possible to link fragments which belong not to the same
chain but rather to symmetry-related chains. If the density for
the link is obvious, this step may be performed by automatic or
manual model completion; howe ver, this is often not the case.
The problem becomes more complex in the case of non-
crystallographic symmetry (NCS). In this case, the fragments
must also be assigned to the correct NCS copy of the molecule,
as well as to the correct asymmetric uni t. The problem may be
further complicated in the case of hetero-oligomers (protein
complexes consisting of heterogeneou s sequences), although
this is mainly a bookkeeping problem.
Various approaches to model tidying are implemented in
the main automated model-building packages [for example,
ARP/wARP (Cohen et al., 2004) and RESOLVE (Terwilliger,
2003)], with the details varying according to the model-
building algorithm and the information available; howev er,
the details have not been widely discussed in the literature.
This paper presents the model-tidying steps implemented in
the Buccaneer software from v.1.5.
1.3. The Buccaneer software for automated model building
The Buccaneer software is used for automatic interpretation
of protein structures on the basis of the electron-density map
(Cowtan, 2006, 2008). The calculation is iterativ e, with
multiple cycles of model building interspersed with occasional
refinement steps using REFMAC (Murshudov et al., 2011) to
improve the current model and electron density. The step s
involved in a single cycle of model building are as follows.
(i) Finding C
atoms: candidate C
positions are located by
searching the electron density for likely features.
(ii) Growing fragments: the candidate C
atoms (or input
chains) are grown by adding residues at either end, guided by
the electron density and constrained by the allowed region of
the Ramachandran plot.
(iii) Joining fragments: overlapping fragments are joined to
make longer chains.
(iv) Linking fragments: nearby N- and C-termini are
examined to see if they can be linked by inserting one or two
additional residues.
(v) Assigning sequence: likelihood comparison between the
density of each residue in the work structure and the density
from residues of a reference structure is used to identify
the likelihood of each residue being of a particular type.
Comparison with the known sequence allows longer fragments
to be matched to the sequence.
(vi) Correcting sequence: insertions and deletions in the
model as identified in the sequence-assignment step are
corrected by rebuilding to add or delete a residue where
possible.
(vii) Filtering fragments in poor density: residues which
have not been docked into the sequence and are in poor
density are removed.
(viii) Building NCS: any NCS relationships found in the
model are used to extend existing chains by combining all of
the NCS-related chains.
(ix) Pruning fragments: fragments which provide incon-
sistent interpretations of the same electron density are
examined. The poorer fragment is removed.
(x) Rebuilding: side-chain atoms and carbonyl O atoms are
added to the model.
This process is repeated over several cycles. In subsequent
cycles, the finding step is modified to preferentially find C
positions which are in regions where no model is present.
2. A library of protein fragments
A library of real protein fragments of arbitrary length is
employed to interpret electron density and correct existing
models. In order to support both interactive graphical model
building (where users demand immediate feedb ack) and
automated model building (where many possible mo del
fragments may need to be tested to match a particular
feature), it must be possible to perform a very rapid search for
fragments containing some atoms matching a desired confor-
mation.
For example, to fit the main-chain atoms to a C
trace
the database will be searched for all six-peptide fragments
matching the C
atoms surrounding a particular peptide bond
and the peptide atoms from the middle peptide of the be st-
fitting fragment will be use d to provide the main-chain atoms
for that peptide group. Similarly, to build a missing loop in a
protein structure a search will be performed for all fragments
for which the initial and final pairs of C
atoms in the fragment
research papers
Acta Cryst. (2012). D68, 328–335 Cowtan
Automated model building in Buccaneer 329
may be superimposed on the last two C
atoms before the
break and the first two C
atoms after the break.
A library has therefore been constru cted using the 500 well
refined protein structures of the Richardsons’ ‘Top 50’ data-
base (Lovell et al., 2003), excluding residues for which the
temperature factors of the C
atoms exceed 40 A
˚
2
.This
provides a database of 106 295 amino acids in 1327 continuous
fragments. For each amino acid, the residue type and the
coordinates of the N, C
and C atoms are stored (in turn
providing sufficient information to locate the C
and O
atoms). The entire database is stored as a single list of amino-
acid records.
The most frequent type of search which will be performed
on the database is to find all fragments for which some
(possibly discontinuous) set of C
atoms superpose well on the
C
atoms of some search fragment. The search fragment is in
turn provided as a list of amino-acid records, with null records
inserted as placeholders to represent residues for which the
location is unknown. Thus, to search for a missing loop of four
residues, an eight-residue search fragment is constructed from
the two residues before the missing loop, four null residues
and the two residues after the missing loop.
Performing a least-squares superposition for every frag-
ment in the database would be computationally demanding, so
an initial pre-selection phase is performed to produce a subset
of fragments which may be good matches to the search frag-
ment. This pre-selection involves a computationally cheaper
distance-matrix score.
In order to minimize the computational overhead, distance
matrices for the search fragment and for the database are
precalculated. For the search fragment, a triangular matrix is
calculated with the first row giving the distances from the first
C
to the remaining n 1, the second row the distances from
the second C
to the remaining n 2 and so on. The columns
of this matrix correspond to the diagonals of the upper
triangle of a conventional distance matrix (illustrated in Fig. 1).
If an atom is missing, the distance is set to a negative flag
value.
For the database of n
db
residues, an n
db
20 rectangular
‘running distance ma trix’ is pre-calculated, with each row
giving distances from the first C
to the following 20, thus
representing fr agments of up to 21 residues. This is illustrated
in Fig. 2 for a reduced width of six residues. Any distances
which span chain boundaries are set to the flag value.
In order to identify a set of possibly matching fragments, all
that needs to be done is to compare the non-missing values in
the fragment distance matrix to the corresponding values
obtained by starting from each row of the database distance
matrix in turn. A sum of squared differences is used to identify
likely matches.
To further optimize the calculation, the sum-of-squares
calculation may be terminat ed early as soon as the sum
exceeds a threshold value. The threshold value is controlled by
a parameter whic h determines how many matches will be
returned and is updated regularly by sorting the current list of
matches, truncating to the desired number and setting the
threshold to the value of the worst remaining match.
The limitation of the distance-matrix score is that the
distance matrix of a set of coordinates is invariant under
inversion of these coordinates through a centre of symmetry,
and so the initial search also returns fragments which are the
inverse of the search fragment. The resulting list of candidate
fragments must therefore be re-scored using a full l east-
squares superposition and r.m.s. difference calculation. The
resulting list is resorted according to the r.m.s. difference.
For some purposes it may be desirable to restrict the search
to fragments for which the sequence obeys some criterion, for
example to take into account the different main-chain
conformations which can occur around Gly or Pro. This is
research papers
330 Cowtan
Automated model building in Buccaneer Acta Cryst. (2012). D68, 328–335
Figure 1
Running distance-matrix representation of a single fragment, where D
ij
is
the distance between the ith and jth C
atoms. The shaded cells are those
available for loop fitting using only two C
atoms at each end of the
fragment.
Figure 2
Running distance-matrix representation of the protein-chain database,
where D
i,j
is the distance between the ith and jth C
atoms. The shaded
cells are those which would be used to score the fit of a search fragment
against a particular range in the database.
achieved by allowing a mask of 20 binary digits to be set for
each position in the search fragment, indicating which of the
20 amino-acid types are allowed to appear at that position in
the fragment. This provides an additional restriction on the
search results which may be evaluated by simple log ical
operations.
3. Automated model tidying
The steps employed in the completion of the atomic model in
the current version of Buccaneer are as follows.
(i) The various fragments built by the chain-tracing and
sequence-docking algorithm s are grouped into discontinuous
chains using a scoring function that rewards compactness and
penalizes sequence duplication. This removes a tedious
manual step of assigning chain IDs and renaming the resulting
chain fragments by hand.
(ii) Where there are discontinuities (or breaks) in the
resulting chains, an attemp t is made to fix these discontinuities
by pruning any overlap and placing a fragment from a stored
database of protein fragments across the gap.
The steps involved in the grouping of fragments into chains
are described in detail in xx3.1, 3.2 and 3.3. The correction of
breaks is discussed in x3.4. These steps are inserted between
steps (ix) and (x) of the workflow described in x1.3.
3.1. Grouping fragments into chains
The process of grouping fragments into chains involves
assigning a chain identifier to each fragment such that the
fragments which ma ke up a single chain all have the same
chain identifier. Furthermore, the resulting fragments may
need to be transformed by the application of crystallographic
symmetry elements to form a compact molecule.
In the simplest case of a single sequence with no noncrys-
tallographic symmetry (NCS), the process of allocating chain
identifiers is simply a matter of separating a set of fragments
which comprise a single complete chain from those which are
incorrectly built or sequenced (however, the remaining frag-
ments are retained with dummy chain identifiers in case they
contain correctly located but wrongly sequenced residues).
The general case involves two additional layers of
complexity. Firstly, there may be multiple copies of the
molecule in the asymmetric unit. In this case, multiple chains
with different chain identifiers must be built and each frag-
ment must be allocated to one of the chains in such a way as
to build several compact molecules. Secondly, in the case of
a hetero-complex there may be multiple distinct sequences
involved.
The basic steps of the calculation are as follows.
(i) In the case where multiple sequences are present, those
fragments which have been docked to one of the sequences
are sorted according to which sequence was used. Each
sequence is then considered in turn and the following steps are
applied to all the fragments belonging to that sequence.
(ii) A set of ‘seed’ fragments are identified by the method
described in x3.2, including one fragment from each NCS copy
of the molecule. The fragments are chosen such that they all
incorporate some common range of sequence numbers and
thus must belong to distinct copies of the molecule. The
selection of this range is made in such a way as to maximize
the number of NCS copies identified, subject to the validation
criteria described below.
(iii) The seed fragments are then grown by successively
adding an additional fragment to a seed by the method
described in x3.3. Each fragment is scored for its geometrical
proximity to each seed (taking into account crystallographic
symmetry) and penalized for any sequence overlap with that
seed. The fragment which obtains the highest score to be
docked to a seed is then added to that seed. The calculation
repeats unt il all fragments have been assigned or the highest
score fails to reach a threshold.
Steps (ii) and (iii) are repeated for each sequence until all
sequences have been considered. The fragments are then
assembled into chains by grouping all the fragments sharing a
chain identifier in order of sequence number. In some cases,
sequence numbers of grouped fragments may overlap; in this
case, insertion codes are used to ensure that each residue is
uniquely identified.
3.2. Identification of seed fragments
The identification of ‘seed’ fragments is performed as
follows. Firstly, a matrix is constructed whose order is the
number of fragments under consideration. The matrix is used
to store flags identifying which fragments overlap. For each
pair of sequences, the number of residues of overlap is iden-
tified. If the overlap exceeds 12 residues and the overlapped
regions have similar conformations, the number of overlapped
residues is stored in the matrix. (In this context, a similar
conformation is identified by the least-squares superposition
of the best-matched 50% of the overlapped C
coordinates
having an r.m.s. difference of less than 1 A
˚
.)
A depth-first permutation search is then performed to
identify the largest subset of fragments all of which overlap.
There will usually be multiple equal solutions; in this case, the
set is chosen for which the total number of residues in the
overlapping fragments is the greatest.
At first glance the algorithm is computationally expensive,
since potentially 2
n
sets must be considered, where n is the
number of fragments. In practice, the number of overlapped
sequences does not significantly exceed the number of NCS
copies and depth searches may be terminated early if they
cannot match the current best solution; thus, in practice the
computational cost of this step is negligible.
The fragments thus selected contain the same sequence of
residues in a similar conformation and thus can be assumed to
be different NCS copies of one part of the structure. Each of
the selected seed fragments is therefore allocated a different
chain identifier and becomes the core of that chain.
3.3. Allocation of additional fragments to the chains
This step is performed iteratively. Every unallocated frag-
ment is considered and the score is calculated for adding that
research papers
Acta Cryst. (2012). D68, 328–335 Cowtan
Automated model building in Buccaneer 331
fragment to each chain. The highest scoring chain/fragment
combination is selected and the fragment is added to that
chain. This will affect all subsequent scores for that chain and
therefore the calculation is then repeated from the start.
The scoring function rewards geometrical compactness and
penalizes sequence inconsistencies as follows. Each C
atom
within 5 A
˚
of a C
atom which has already been allocated to a
given chain provides a score of +1 for adding the fragment to
that chain. Each residue which has been docked into sequence
with a sequence number clashing with a residue already
allocated to a given chain provides a score of 2 for adding
the fragment to that chain.
In this way, fragments which are intimate to an existing
chain but which do not contain the same set of sequence are
added to that chain. The process continues until no positive
scores remain.
3.4. Correction of chain breaks
Often it will occur that there are gaps in the trace of the
protein chain. These most commonly occur for one of two
reasons.
(i) Flexible surface loops for which the electron density is
poor.
(ii) Mistracings where the chain trace has left the chain
(often following a side chain or disulfide bridge) and the chain
trace is then continued in a subsequent fragment.
For the gap to be corrected, any wrongly traced residues
(e.g. following a side chain or disulfide bridge) must first be
removed by pruning back at least enough residues to remove
any duplicated sequence numbers from the ends of the two
fragments (mul tiple choices about how many residues to
prune from each end are possible and additional pruning may
be required to eliminate all mistraced residues, so multiple
prunings are tested) and then selecting a fragment from a
database of protein-chain fragments to bridge the gap.
Note that caution is required in this step. Earlier in the
Buccaneer calculation an attempt is made to link spatially
proximal N- and C-termini without regard to sequence.
Sometimes these linkages are made incorrectly. However, this
mistake is not serious, because when docking the resulting
chain to the sequence the two parts of the joined chain will
usually dock to different places in the sequence, at which point
the error can be corrected by breaking the chain again. When
linking chains on the basis of previously assigned sequences,
the use of the sequence to validate the link is no longer
available, so mistakes introduced at this stage will never be
corrected. As a result, it was found to be necessary to limit the
maximum length of the bridging fragment to six amino acids
(i.e. two amino acids overlap with ea ch chain and a maximum
of two amino acids of gap). Longer missing loops must still
be built manually. Since the errors arise from the presence of
wrongly sequenced fragme nts which occur early in the model-
building process when the fragments are short, this constraint
should probably be relaxed to allow longer loops to be built
once the model is approaching completion, at which point
errors become less likely.
3.5. Additional applications of the fragment database
Two existing steps in the Buccaneer calculation were also
rewritten to make use of the fragment database. The ‘linking’
step (joining nearby N- and C-termini irrespective of
sequence) and ‘correction’ step (correcting insertions and
deletions by rebuilding one or three residues with two resi-
dues) both made use of a routine for building a loop of two
residues by searching over allowed Ramachandran angles.
Both of these step s have been replaced by an equivalent
implementation using the fragment database.
4. Results
Some preliminary results are presented here on the applic-
ability of the fragment database and on the automated mo del-
tidying features in the Buccaneer software.
4.1. Coverage as a function of fragment size in the fragment
database
To investigate the usefulness of the fragment database, an
exhaustive search was performed to test for a given fragment
length how well each fragment in the database can be repre-
sented by some other fragment from the database.
Each possible fragment of the chosen length was extracted
from the database in turn and used as a search model to find
other similar fragments. In every case the best-fitting fragment
will be the original fragment, so the best fit is discarded and
the second-best match is used. Two statistics are calculated
for the matching fragment: the r.m.s. deviation between the
C
-atom positions and those of the search fragment, and the
distance between the worst-mat ching C
atom and the corre-
sponding atom in the search fragment. This calculation was
performed for fragments of six, nine and 12 residues (as would
be used in fitting missing loops of two, five and eight residues,
respectively).
The results are shown in Fig. 3 as tail plots showing the
proportion of the search fragments for which the difference
from the database fragment is no worse than a given value.
The r.m.s. deviations are worse tha n 1.0 A
˚
for 0.04% of six-
residue fragments, 5% of nine-residue fragments and 38% of
12-residue fragments. Given that a significant proportion of
the fragments in the database will be in very similar helical
or strand conformations, this suggests that the library will be
of limited use for 12-residue fragments except for common
motifs.
Similarly, the worst deviating atom has a displacement of
worse than 1.5 A
˚
for 0.05% of six-residue fragments, 5% of
nine-residue fragments and 36% of 12-residue fragments.
(Note the change in distance crit erion compared with the
previous data.) This again suggests that 12-residue fragments
will be of less use, since automated refinement is likely to
struggle to correct errors of this magnitude.
As a result, the database provides effectively complete
coverage for fragments of up to six residues or for loop fitting
over only two missing residues. (This case was previously
handled by a simple Ramachandran search; however, the
research papers
332 Cowtan
Automated model building in Buccaneer Acta Cryst. (2012). D68, 328–335
database approach has the advantage of providing a compu-
tationally cheaper sam pling of conformation space which
increases in density as the frequency of that conformation
increases.)
For missing loops of intermediate length (3–6 residues), the
database will provide good loop conformations in a subset of
cases where the loop happens to match one in the database
and so will catch common turn motifs, for example. For longer
loops, the database is likely to be useful only in less frequent
cases. However, this approach has be en shown to have good
success by Choi & Deane (2010) for loops of up to 20 residues
with a larger database of structures.
4.2. Automated model tidying in the Buccaneer software
The model-tidying procedure was applied to the same 55
test structures used in Cowtan (2008) and is detailed in the
supplementary material of that paper; the data were obtained
from the JCSG (Joint Center for Structural Genomics, 2006).
Of the resulting models, 29 contained fragments which were
grouped into chains by the tidying algorithm. Some of these
structures included multiple NCS copies of the structure and
therefore the total number of chains assembled was 50.
Each of the 50 tidied chains was examined to determine the
proportion of the chain corresponding to a single molecule in
the final structure. As the model becomes more complete, the
assignment becomes easier, so these proportions are tabulated
along with the completeness of the chain in Table 1.
In every case where the chain is at least 60% complete, at
least 80% is correctly assigned to a single molecule and in 44
of 48 such cases the assignment is entirely correct or correct
apart from a few trailing residues. For the two cases where the
completeness is less than 50%, the grouping of fragments into
chains is rather less accurate.
The case of the 1vlu A chain (as labelled by Buccaneer; this
is actually the B chain in the deposited structure) is shown in
Fig. 4, in which 91% of the chain has been built but only 83%
of the residues built correspond to a single molecule. In this
case the deposited model contains chain breaks and the
Buccaneer model shows chain breaks in similar positions. The
disconnected range of residues 331–391 has been placed at the
wrong end of the molecule. It is probable that the error could
have been corrected in this specific case by adding a term
rewarding proximity of sequence number to the scoring
function; however, this was not tested because in the experi-
ence of the author the incorrect linking of chains across
protein contacts is a significant problem in the early stages of
building and this problem is likely to be exacerbated by such a
change.
4.3. Application of the fragment database in the Buccaneer
software
The usefulness of the fragment database in automated
building was tested by rewriting two existing steps of the
Buccaneer calculation to make use of the database and by
adding a new loop-building step using the database, as
described in x4.2. The results of these changes were tested
individually and in combination.
research papers
Acta Cryst. (2012). D68, 328–335 Cowtan
Automated model building in Buccaneer 333
Table 1
Reliability of the model-tidying algorithm as measured by the proportion
of each autobuilt chain corresponding to a single chain in the deposited
structure.
Structure (chain)
Proportion belonging
to a single chain (%)
Chain completeness
(%)
1vjn (A)72 49
1zej (B)75 46
1z85 (A)81 90
1vlu (B)81 73
1vlu (A)83 91
1zej (A)92 69
1vr8 (A)95 99
1vp7 (C) 97 100
1vk3 (A)98 90
41 cases 100 62–100
Figure 3
Tail plot of the proportion of search fragments for which the fit of the
best-matching fragment is worse than a given criterion for different
fragment lengths. (a) R.m.s. deviation of C
positions between the best
database fragment and the search fragment; (b) maximum deviation of
any C
positions between the best database fragment and the search
fragment.
The results of the model-building calculation are rather
sensitive to changes in the algorithm or input data, so to
determine whether each change made an improvement
multiple model-building runs were used. For each of the 55
test structures used in Cowtan (2008) ten model-building runs
were performed using ten different sets of free reflections for
both mode l building and refinement. The change in the set of
reflections used to calculate the initial map is sufficient to
significantly alter the results of the first model-building step
and the differences propagate to subsequent cycles.
The percentage of the model built and correctly sequenced
(measured by the percentage of residues built with the correct
residue type and with the C
within 1.9 A
˚
of the correct
position) was averaged over the 550 runs to obtain a score for
this method.
Furthermore, the entire set of calculations was then repe-
ated usin g lower resol ution data. For these calculations, the
data resolution was truncated by 0.4 A
˚
, the B factor was
increased by 20 A
˚
2
and the density-modification step (using
the Parrot software; Cowtan, 2010) was rerun on the truncated
data. The resolutions of the original data sets vary over the
range 1.4–3.2 A
˚
and the truncate d data over the range 1.8–
3.6 A
˚
.
The results of these calculations are shown in Table 2. The
first step modified (‘link’) is the linking of chain fragments
irrespective of sequence [step (iii) in the Buccaneer calcula-
tion], the next (‘correct’) is the correction of insertions and
deletions during sequencing [step (v) in the Buccaneer calcu-
lation]. These steps were previously performed using an
exhaustive search over allowe d Ramachandran angles, in the
first case to build a link of up to two residues and in the second
to rebuild a stretch of either one or three residues with two
residues. Finally, a new loop-building step was added, similar
to the ‘link’ step but performed after the sequence has been
assigned to the chains. Unlike the ‘link’ step, the loop-building
step may prune an arbitrary number of residues from either
chain to bring similarly numbered residues into proximity.
The updated link step makes minimal difference to the
amount of model built, but does provide a speed benefit over
the previous (Ramachandran search) implementation. The
updated correct step gives a small improvement in the amount
of model built, although the difference is comparable to the
noise among different runs. The loop-building step shows no
significant improvement in the proportion built. It is a recur-
ring problem in the development of the model-building
algorithm that the improvements are marginal and hard to
distinguish from noise, even with the large number of test runs.
However, in each of four cases where only the correct step is
changed the results always improve, suggesting that this result
is significant.
research papers
334 Cowtan
Automated model building in Buccaneer Acta Cryst. (2012). D68, 328–335
Table 2
Proportion of models built and correctly sequenced with different
building strategies; results are averaged over 550 runs on 55 structures.
Values in parentheses are standard deviations across the ten runs of 55
structures.
Full resolution Truncated resolution
Method
Percentage
built
No. of
chains
Percentage
built
No. of
chains
Original version 86.2 (0.6) 8.7 (0.4) 75.1 (1.0) 13.0 (0.4)
DB for link 86.2 (0.5) 8.6 (0.6) 75.5 (0.9) 12.8 (0.5)
DB for correct 86.5 (0.6) 8.7 (0.3) 76.2 (1.3) 12.9 (0.7)
DB for loop build 86.1 (0.4) 7.4 (0.4) 75.3 (0.9) 11.4 (0.4)
DB for link, correct 86.6 (0.7) 8.6 (0.3) 76.4 (0.6) 12.6 (0.6)
DB for link, correct,
loop build
86.6 (0.7) 7.3 (0.3) 76.5 (0.9) 11.1 (0.7)
Figure 4
Partially incorrect assembly of the model for 1vlu from multiple
fragments. The wrongly positioned region is shown in black (a) in the
Buccaneer model and (b) in the deposited structure.
However, the benefit of the loop-building step can be seen
in the connectivity of the model, which is a benefit when it
comes to finishing the model by hand. The number of frag-
ments in the output model gives an indication of what is
happening. For the original version, the average number of
fragments over the 550 autobuilt models is 8.7; when the loop-
building step is added, this reduces to 7.4 (similar changes are
seen when combining the loop-building step with the other
new steps and when the resolution is truncated). A reduction
in the number of fragments without a reduction in the
proportion built implies an improvement in connectivity. The
implication is that the loop-building step is most commonly
dealing with cases where chains are coming into close proxi-
mity but failing to meet (and possibly branching down side
chains) rather than true loop-buil ding problems when there
are missing residues.
To summarize, using the fragment database for the link step
reduces the computational overhead, using the fragment
database for the correct step provides a small improvement
in completeness and using the fragment database for loop
building provides a significant improvement in connectivity.
4.4. Other applications of the fragment library
The fragment library has also been used in the imple-
mentation of a loop-building tool, Sloop, which is capable of
building short missing loops in incomplete protein models. As
noted above, the usefulness of this tool varies according to
whether the loop conce rned happens to conform to an existing
motif.
A tool for converting a C
trace into a main-chain (poly-
alanine) trace has also been implemented. The results show
similar high levels of accuracy to those of Esnouf (1997). The
program has not been released owing to the availability of
many other tools for this task; however, the source code is
available from the author on request.
The use of the library for the buildi ng and validation of
motifs in the Coot graphical model-building and validation
software (Emsley et al., 2010) is under development.
4.5. Discussion
The tidying of fragments into chains is an important
element of an automated model-building calculation, princi-
pally because it reduces the manual intervention required
later in the structure-solution process. The technique
described here is reliable when the completeness of the model
is good and is completely general with respect to NCS and
hetero-complexes, without requiring knowledge of the
number of copies of a given sequence present in the asym-
metric unit.
The protein-fragment database is capable of reproducing
the various functionalities implemented by previous authors,
with the efficient search algorithm allowing the use of a large r
database than in previous implementations. Some preliminary
applications have been explored and a range of future appli-
cations are planned, including the following.
(i) Use of the loop-building code to build longer loops when
the model is nearly complete. This may be in a single step, or
possibly using the stepwise approach of Joosten et al. (2008)
where a suitable large fragment is not found in the library.
(ii) Use of the fragment library to rebuild regions of the
chain where residue type influences geometry, in particular in
the vicinity of Gly and Pro residues.
(iii) Testing the use of a subset of the fragment library
to replace the current Ramachandran search in the chain-
growing step in Buccaneer, in a manner similar to that of
Terwilliger (2003).
(iv) Use of the fragment library to provide validation scores
in the manner of Jones & Thirup (1986) in the Coot software.
(v) Extension of the fragment-database concept to handle
nucleotides.
The author would like to tha nk the JCSG data archive for
providing a source of well curated test data. This work was
supported by the BBSRC through grant BB/F0202281.
References
Choi, Y. & Deane, C. M. (2010). Proteins, 78, 1431–1440.
Cohen, S. X., Morris, R. J., Fernandez, F. J., Ben Jelloul, M., Kakaris,
M., Parthasarathy, V., Lamzin, V. S., Kleywegt, G. J. & Perrakis, A.
(2004). Acta Cryst. D60, 2222–2229.
Cowtan, K. (2006). Acta Cryst. D62, 1002–1011.
Cowtan, K. (2008). Acta Cryst. D64, 83–89.
Cowtan, K. (2010). Acta Cryst. D66, 470–478.
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta
Cryst. D66, 486–501.
Esnouf, R. M. (1997). Acta Cryst. D53, 665–672.
Joint Center for Structural Genomics (2006). JCSG Data Archive.
http://www.jcsg.org/datasets-info.shtml.
Jones, T. A. & Thirup, S. (1986). EMBO J. 5, 819–822.
Joosten, K., Cohen, S. X., Emsley, P., Mooij, W., Lamzin, V. S. &
Perrakis, A. (2008). Acta Cryst. D64, 416–424.
Kleywegt, G. J. & Jones, T. A. (1996). Acta Cryst. D52, 829–832.
Lovell, S. C., Davis, I. W., Arendall, W. B., de Bakker, P. I., Word,
J. M., Prisant, M. G., Richardson, J. S. & Richardson, D. C. (2003).
Proteins, 50, 437–450.
Murshudov, G. N., Skuba
´
k, P., Lebedev, A. A., Pannu, N. S., Steiner,
R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011).
Acta Cryst. D67, 355–367.
Payne, P. W. (1993). Protein Sci. 2, 315–324.
Sheldrick, G. M. (2010). Acta Cryst. D66, 479–485.
Terwilliger, T. C. (2003). Acta Cryst. D59, 38–44.
research papers
Acta Cryst. (2012). D68, 328–335 Cowtan
Automated model building in Buccaneer 335
... Heavy atom location, structure phasing, and density modification were performed using the Crank2 [39] interface of SHELXC/D [40] and produced interpretable electron density maps for both proteins. From the experimentally phased maps, Buccaneer [41] was used for initial model building. Models were completed manually using Coot [42]. ...
Article
Full-text available
Both intensity and phase information are needed for structure determination by macromolecular X-ray crystallography. The diffraction experiment provides intensities. Phases must be accessed indirectly by molecular replacement, or by experimental phasing. A popular method for crystallising membrane proteins employs a lipid cubic mesophase (the in meso method). Monoolein is the most popular lipid for in meso crystallisation. Invariably, the lipid co-crystallises with the protein recapitulating the biomembrane from whence it came. We reasoned that such a lipid bearing a heavy atom could be used for experimental phasing. In this study, we replaced half the monoolein in the mesophase with a seleno-labelled analogue (Se-MAG), which has a selenium atom in the fatty acyl chain of the lipid. The lipid mixture formed the cubic mesophase and grew crystals by the in meso method of the alginate transporter, AlgE, and the lipoprotein N-acyltransferase, Lnt. Se-MAGs co-crystallised with both proteins and were used to obtain phases for high-resolution structure determination by the selenium single-wavelength anomalous diffraction method. The use of such a mixed lipid system may prove to be a general strategy for the experimental phasing part of crystallographic structure determination of membrane proteins that crystallise via the in meso method.
... Data reduction was performed using AIMLESS 76 . The initial structure of SbSOMTresveratrol-β-NAD ternary complex was solved by molecular replacement; first through PHASER using a chimeric search model generated by MrBUMP, followed by model rebuilding from phase solution through BUCCANEER [77][78][79][80] . Following structures were solved by molecular replacement with PHASER using protomer of SbSOMTresveratrol-β-NAD ternary complex as a search model. ...
Article
Full-text available
O-Methylated stilbenes are prominent nutraceuticals but rarely produced by crops. Here, the inherent ability of two Saccharinae grasses to produce regioselectively O-methylated stilbenes is reported. A stilbene O-methyltransferase, SbSOMT, is first shown to be indispensable for pathogen-inducible pterostilbene (3,5-bis-O-methylated) biosynthesis in sorghum (Sorghum bicolor). Phylogenetic analysis indicates the recruitment of genus-specific SOMTs from canonical caffeic acid O-methyltransferases (COMTs) after the divergence of Sorghum spp. from Saccharum spp. In recombinant enzyme assays, SbSOMT and COMTs regioselectively catalyze O-methylation of stilbene A-ring and B-ring respectively. Subsequently, SOMT-stilbene crystal structures are presented. Whilst SbSOMT shows global structural resemblance to SbCOMT, molecular characterizations illustrate two hydrophobic residues (Ile144/Phe337) crucial for substrate binding orientation leading to 3,5-bis-O-methylations in the A-ring. In contrast, the equivalent residues (Asn128/Asn323) in SbCOMT facilitate an opposite orientation that favors 3ʹ-O-methylation in the B-ring. Consistently, a highly-conserved COMT is likely involved in isorhapontigenin (3ʹ-O-methylated) formation in wounded wild sugarcane (Saccharum spontaneum). Altogether, our work reveals the potential of Saccharinae grasses as a source of O-methylated stilbenes, and rationalize the regioselectivity of SOMT activities for bioengineering of O-methylated stilbenes.
... Data sets were collected at the Paul Scherrer Institut, Swiss Light Source, Villingen, at a wavelength of 1.0 Å or 2.066 Å (native and KI soaked crystals, respectively). The structure was solved using single wavelength anomalous diffraction (SAD) with the software CCP4I2-CRANK2 [33], and an initial model was built with Buccaneer [34]. This model was used for molecular replacement in PHENIX [35,36] against the higher resolution native data. ...
Article
Full-text available
During infection of mammalian hosts, African trypanosomes thwart immunity using antigenic variation of the dense Variant Surface Glycoprotein (VSG) coat, accessing a large repertoire of several thousand genes and pseudogenes, and switching to antigenically distinct copies. The parasite is transferred to mammalian hosts by the tsetse fly. In the salivary glands of the fly, the pathogen adopts the metacyclic form and expresses a limited repertoire of VSG genes specific to that developmental stage. It has remained unknown whether the metacyclic VSGs possess distinct properties associated with this particular and discrete phase of the parasite life cycle. We present here three novel metacyclic form VSG N-terminal domain crystal structures (mVSG397, mVSG531, and mVSG1954) and show that they mirror closely in architecture, oligomerization, and surface diversity the known classes of bloodstream form VSGs. These data suggest that the mVSGs are unlikely to be a specialized subclass of VSG proteins, and thus could be poor candidates as the major components of prophylactic vaccines against trypanosomiasis.
Article
X-ray crystallography is a robust and widely used technique that facilitates the three-dimensional structure determination of proteins at an atomic scale. This methodology entails the growth of protein crystals under controlled conditions followed by their exposure to X-ray beams and the subsequent analysis of the resulting diffraction patterns via computational tools to determine the three-dimensional architecture of the protein. However, achieving high-resolution structures through X-ray crystallography can be quite challenging due to complexities associated with protein purity, crystallization efficiency, and crystal quality. In this chapter, we provide a detailed overview of the gene to structure determination pipeline used in X-ray crystallography, a crucial tool for understanding protein structures. The chapter covers the steps in protein crystallization, along with the processes of data collection, processing, structure determination, and refinement. The most commonly faced challenges throughout this procedure are also addressed. Finally, the importance of standardized protocols for reproducibility and accuracy is emphasized, as they are crucial for advancing the understanding of protein structure and function.
Article
Full-text available
The intracellular bacterial pathogen Coxiella burnetii evades the host response by secreting effector proteins that aid in establishing a replication-friendly niche. Bacterial filamentation induced by cyclic AMP (Fic) enzymes can act as effectors by covalently modifying target proteins with the posttranslational AMPylation by transferring adenosine monophosphate (AMP) from adenosine triphosphate (ATP) to a hydroxyl-containing side chain. Here we identify the gene product of C. burnetii CBU_0822, termed C. burnetii Fic 2 (CbFic2), to AMPylate host cell histone H3 at serine 10 and serine 28. We show that CbFic2 acts as a bifunctional enzyme, both capable of AMPylation as well as deAMPylation, and is regulated by the binding of DNA via a C-terminal helix-turn-helix domain. We propose that CbFic2 performs AMPylation in its monomeric state, switching to a deAMPylating dimer upon DNA binding. This study unveils reversible histone modification by a specific enzyme of a pathogenic bacterium.
Article
Inorganic pyrophosphate (PP i ) is generated as an intermediate or byproduct of many fundamental metabolic pathways, including DNA/RNA synthesis. The intracellular concentration of PP i must be regulated as buildup can inhibit many critical cellular processes. Inorganic pyrophosphatases (PPases) hydrolyze PP i into two orthophosphates (P i ), preventing the toxic accumulation of the PP i byproduct in cells and making P i available for use in biosynthetic pathways. Here, the crystal structure of a family I inorganic pyrophosphatase from Legionella pneumophila is reported at 2.0 Å resolution. L. pneumophila PPase (LpPPase) adopts a homohexameric assembly and shares the oligonucleotide/oligosaccharide-binding (OB) β-barrel core fold common to many other bacterial family I PPases. LpPPase demonstrated hydrolytic activity against a general substrate, with Mg ²⁺ being the preferred metal cofactor for catalysis. Legionnaires' disease is a severe respiratory infection caused primarily by L. pneumophila , and thus increased characterization of the L. pneumophila proteome is of interest.
Article
Full-text available
In late 2020, the results of CASP14, the 14th event in a series of competitions to assess the latest developments in computational protein structure-prediction methodology, revealed the giant leap forward that had been made by Google's Deepmind in tackling the prediction problem. The level of accuracy in their predictions was the first instance of a competitor achieving a global distance test score of better than 90 across all categories of difficulty. This achievement represents both a challenge and an opportunity for the field of experimental structural biology. For structure determination by macromolecular X-ray crystallography, access to highly accurate structure predictions is of great benefit, particularly when it comes to solving the phase problem. Here, details of new utilities and enhanced applications in the CCP 4 suite, designed to allow users to exploit predicted models in determining macromolecular structures from X-ray diffraction data, are presented. The focus is mainly on applications that can be used to solve the phase problem through molecular replacement.
Article
Full-text available
Multicopper oxidases are promiscuous biocatalysts with great potential for the production of industrial compounds. This study is focused on the elucidation of the structure–function determinants of a novel laccase-like multicopper oxidase from the thermophilic fungus Thermothelomyces thermophila ( Tt LMCO1), which is capable of oxidizing both ascorbic acid and phenolic compounds and thus is functionally categorized between the ascorbate oxidases and fungal ascomycete laccases (asco-laccases). The crystal structure of Tt LMCO1, determined using an AlphaFold 2 model due to a lack of experimentally determined structures of close homologues, revealed a three-domain laccase with two copper sites, lacking the C-terminal plug observed in other asco-laccases. Analysis of solvent tunnels highlighted the amino acids that are crucial for proton transfer into the trinuclear copper site. Docking simulations showed that the ability of Tt LMCO1 to oxidize ortho -substituted phenols stems from the movement of two polar amino acids at the hydrophilic side of the substrate-binding region, providing structural evidence for the promiscuity of this enzyme.
Article
Porcine circovirus type 2 (PCV2) can cause porcine circovirus-associated disease (PCVAD), which causes significant economic losses to the global pig industry annually. There are no effective antiviral drugs used to control and treat PCV2, and prevention is mainly obtained through vaccination. PCV2 genome replicates through the rolling circle replication (RCR) mechanism involving Rep and Rep', so analyzing the holistic structure of Rep and Rep' will help us better understand the replication process of PCV2. However, there are no reports on the integral structure of Rep' and Rep, which seriously hinders the research of the viral replication. By using the x-ray diffraction method, the structure of the Rep' dimer was resolved by us in this study. Structural analysis revealed that Rep' is a dimer formed by the interaction of the C-terminal domain. The two Rep' form a positively charged groove, which may play an essential role in the viral binding of dsDNA. Together, this study help to understand the replication process of the virus and may also provide new insights into the development of antiviral drugs.
Article
N-acetyl-d-glucosamine (GlcNAc) is a major component of bacterial cell walls. Many organisms recycle GlcNAc from the cell wall or metabolize environmental GlcNAc. The first step in GlcNAc metabolism is phosphorylation to GlcNAc-6-phosphate. In bacteria, the ROK family kinase NagK performs this activity. Although ROK kinases have been studied extensively, no ternary complex showing the two substrates has yet been observed. Here, we solved the structure of NagK from the human pathogen Plesiomonas shigelloides in complex with GlcNAc and the ATP analogue AMP-PNP. Surprisingly, PsNagK showed distinct conformational changes associated with the binding of each substrate. Consistent with this, the enzyme showed a sequential random enzyme mechanism. This indicates that the enzyme acts as a coordinated unit responding to each interaction. Our molecular dynamics modelling of catalytic ion binding confirmed the location of the essential catalytic metal. Additionally, site-directed mutagenesis confirmed the catalytic base, and that the metal-coordinating residue is essential. Together, this study provides the most comprehensive insight into the activity of a ROK kinase.
Article
Full-text available
This paper describes various components of the macromolecular crystallographic refinement program REFMAC5, which is distributed as part of the CCP4 suite. REFMAC5 utilizes different likelihood functions depending on the diffraction data employed (amplitudes or intensities), the presence of twinning and the availability of SAD/SIRAS experimental diffraction data. To ensure chemical and structural integrity of the refined model, REFMAC5 offers several classes of restraints and choices of model parameterization. Reliable models at resolutions at least as low as 4 Å can be achieved thanks to low-resolution refinement tools such as secondary-structure restraints, restraints to known homologous structures, automatic global and local NCS restraints, `jelly-body' restraints and the use of novel long-range restraints on atomic displacement parameters (ADPs) based on the Kullback-Leibler divergence. REFMAC5 additionally offers TLS parameterization and, when high-resolution data are available, fast refinement of anisotropic ADPs. Refinement in the presence of twinning is performed in a fully automated fashion. REFMAC5 is a flexible and highly optimized refinement package that is ideally suited for refinement across the entire resolution spectrum encountered in macromolecular crystallography.
Article
Full-text available
Coot is a molecular-graphics application for model building and validation of biological macromolecules. The program displays electron-density maps and atomic models and allows model manipulations such as idealization, real-space refinement, manual rotation/translation, rigid-body fitting, ligand search, solvation, mutations, rotamers and Ramachandran idealization. Furthermore, tools are provided for model validation as well as interfaces to external programs for refinement, validation and graphics. The software is designed to be easy to learn for novice users, which is achieved by ensuring that tools for common tasks are 'discoverable' through familiar user-interface elements (menus and toolbars) or by intuitive behaviour (mouse controls). Recent developments have focused on providing tools for expert users, with customisable key bindings, extensions and an extensive scripting interface. The software is under rapid development, but has already achieved very widespread use within the crystallographic community. The current state of the software is presented, with a description of the facilities available and of some of the underlying methods employed.
Article
Full-text available
The programs SHELXC, SHELXD and SHELXE are designed to provide simple, robust and efficient experimental phasing of macromolecules by the SAD, MAD, SIR, SIRAS and RIP methods and are particularly suitable for use in automated structure-solution pipelines. This paper gives a general account of experimental phasing using these programs and describes the extension of iterative density modification in SHELXE by the inclusion of automated protein main-chain tracing. This gives a good indication as to whether the structure has been solved and enables interpretable maps to be obtained from poorer starting phases. The autotracing algorithm starts with the location of possible seven-residue alpha-helices and common tripeptides. After extension of these fragments in both directions, various criteria are used to decide whether to accept or reject the resulting poly-Ala traces. Noncrystallographic symmetry (NCS) is applied to the traced fragments, not to the density. Further features are the use of a 'no-go' map to prevent the traces from passing through heavy atoms or symmetry elements and a splicing technique to combine the best parts of traces (including those generated by NCS) that partly overlap.
Article
Full-text available
Classical density-modification techniques (as opposed to statistical approaches) offer a computationally cheap method for improving phase estimates in order to provide a good electron-density map for model building. The rise of statistical methods has lead to a shift in focus away from the classical approaches; as a result, some recent developments have not made their way into classical density-modification software. This paper describes the application of some recent techniques, including most importantly the use of prior phase information in the likelihood estimation of phase errors within a classical density-modification framework. The resulting software gives significantly better results than comparable classical methods, while remaining nearly two orders of magnitude faster than statistical methods.
Article
Geometrical validation around the Cα is described, with a new Cβ measure and updated Ramachandran plot. Deviation of the observed Cβ atom from ideal position provides a single measure encapsulating the major structure-validation information contained in bond angle distortions. Cβ deviation is sensitive to incompatibilities between sidechain and backbone caused by misfit conformations or inappropriate refinement restraints. A new ϕ,ψ plot using density-dependent smoothing for 81,234 non-Gly, non-Pro, and non-prePro residues with B < 30 from 500 high-resolution proteins shows sharp boundaries at critical edges and clear delineation between large empty areas and regions that are allowed but disfavored. One such region is the γ-turn conformation near +75°,−60°, counted as forbidden by common structure-validation programs; however, it occurs in well-ordered parts of good structures, it is overrepresented near functional sites, and strain is partly compensated by the γ-turn H-bond. Favored and allowed ϕ,ψ regions are also defined for Pro, pre-Pro, and Gly (important because Gly ϕ,ψ angles are more permissive but less accurately determined). Details of these accurate empirical distributions are poorly predicted by previous theoretical calculations, including a region left of α-helix, which rates as favorable in energy yet rarely occurs. A proposed factor explaining this discrepancy is that crowding of the two-peptide NHs permits donating only a single H-bond. New calculations by Hu et al. [Proteins 2002 (this issue)] for Ala and Gly dipeptides, using mixed quantum mechanics and molecular mechanics, fit our nonrepetitive data in excellent detail. To run our geometrical evaluations on a user-uploaded file, see MOLPROBITY (http://kinemage.biochem.duke.edu) or RAMPAGE (http://www-cryst.bioc.cam.ac.uk/rampage). Proteins 2003;50:437–450. © 2003 Wiley-Liss, Inc.
Article
Protein C coordinates are used to accurately reconstruct complete protein backbones and side-chain directions. This work employs potentials of mean force to align semirigid peptide groups around the axes that connect successive C atoms. The algorithm works well for all residue types and secondary structure classes and is stable for imprecise C coordinates. Tests on known protein structures show that root mean square errors in predicted main-chain and Cβ coordinates are usually less than 0.3 Å. These results are significantly more accurate than can be obtained from competing approaches, such as modeling of backbone conformations from structurally homologous fragments.
Article
Loops are the most variable regions of protein structure and are, in general, the least accurately predicted. Their prediction has been approached in two ways, ab initio and database search. In recent years, it has been thought that ab initio methods are more powerful. In light of the continued rapid expansion in the number of known protein structures, we have re-evaluated FREAD, a database search method and demonstrate that the power of database search methods may have been underestimated. We found that sequence similarity as quantified by environment specific substitution scores can be used to significantly improve prediction. In fact, FREAD performs appreciably better for an identifiable subset of loops (two thirds of shorter loops and half of the longer loops tested) than the ab initio methods of MODELLER, PLOP, and RAPPER. Within this subset, FREAD's predictive ability is length independent, in general, producing results within 2A RMSD, compared to an average of over 10A for loop length 20 for any of the other tested methods. We also benchmarked the prediction protocols on a set of 212 loops from the model structures in CASP 7 and 8. An extended version of FREAD is able to make predictions for 127 of these, it gives the best prediction of the methods tested in 61 of these cases. In examining FREAD's ability to predict in the model environment, we found that whole structure quality did not affect the quality of loop predictions.
Article
A new technique for the automated tracing of protein chains in experimental electron-density maps is described. The technique relies on the repeated application of an oriented electron-density likelihood target function to identify likely C positions. This function is applied both in the location of a few promising `seed' positions in the map and to grow those initial C positions into extended chain fragments. Techniques for assembling the chain fragments into an initial chain trace are discussed.
Article
Retinol binding protein can be constructed from a small number of large substructures taken from three unrelated proteins. The known structures are treated as a knowledge base from which one extracts information to be used in molecular modelling when lacking true atomic resolution. This includes the interpretation of electron density maps and modelling homologous proteins. Models can be built into maps more accurately and more quickly. This requires the use of a skeleton representation for the electron density which improves the determination of the initial chain tracing. Fragment-matching can be used to bridge gaps for inserted residues when modelling homologous proteins.