Page 1

Conformation Dependence of Backbone Geometry in Proteins

Donald S. Berkholz*, Maxim V. Shapovalov+, Roland L. Dunbrack Jr.+, and P. Andrew

Karplus*

* Department of Biochemistry and Biophysics, Oregon State University, 2011 ALS, Corvallis OR

97331, USA

+ Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia PA

19111, USA

Summary

Protein structure determination and predictive modeling have long been guided by the paradigm that

the peptide backbone has a single, context-independent ideal geometry. Both quantum-mechanics

calculations and empirical analyses have shown this is an incorrect simplification in that backbone

covalent geometry actually varies systematically as a function of the Φ and Ψ backbone dihedral

angles. Here, we use a nonredundant set of ultrahigh-resolution protein structures to define these

conformation-dependent variations. The trends have a rational, structural basis that can be explained

by avoidance of atomic clashes or optimization of favorable electrostatic interactions. To facilitate

adoption of this new paradigm, we have created a conformation-dependent library of covalent bond

lengths and bond angles and shown that it has improved accuracy over existing methods without any

additional variables to optimize. Protein structures derived both from crystallographic refinement

and predictive modeling both stand to benefit from incorporation of the new paradigm.

Introduction

Structural details at the 0.1 Å scale guide our understanding of enzyme catalysis, how mutations

cause disease, and what makes a good inhibitor and potential drug. Since the work of Pauling

and Corey (1951), protein model building at all levels has been guided by the assumption that

the peptide backbone has a certain ideal geometry independent of context (Figure 1). This

paradigm underlies the restraints used to guide protein structure refinement (e.g., Evans,

2007) and is also the basis of the rigid-geometry approximation used to simplify model building

in the most successful structure-prediction packages such as Rosetta and I-TASSER (Rohl et

al., 2004; Zhang, 2009). The rigid-geometry approximation uses fixed bond lengths and angles,

leaving torsion angles as the only variables needed to define the structure. Ideal target values

for the peptide backbone have varied little over the years, and a set of values most recently

updated in 1999 (EH; Engh and Huber, 1991; Engh and Huber, 2001) is commonly used (Figure

1).

Experimentally derived crystal structures at all but the highest resolutions reflect the influence

of the single-value ideal-geometry paradigm that is applied in the form of geometric restraints.

However, strong evidence exists that this paradigm is flawed. Quantum-mechanics calculations

and empirical analyses of high-resolution protein structures from over a decade ago suggested

Corresponding author: P. Andrew Karplus, karplusp@science.oregonstate.edu, Tel.: 541-737-3200; Fax: 541-737-0481.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers

we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting

proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could

affect the content, and all legal disclaimers that apply to the journal pertain.

NIH Public Access

Author Manuscript

Structure. Author manuscript; available in PMC 2010 October 14.

Published in final edited form as:

Structure. 2009 October 14; 17(10): 1316–1325. doi:10.1016/j.str.2009.08.012.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

that the concept of a single, context-independent ideal value for backbone bond angles and

lengths was wrong (Schäfer et al., 1995; Karplus, 1996). Instead, both approaches showed that

backbone covalent geometry varies systematically as a function of the conformation of the

backbone torsion angles. The systematic conformation dependence of ideal geometry was most

notable for the N-Cα-C bond angle (NCαC) that varied by 8.8°, from 105.7° to 114.5° (Karplus,

1996). Similarly, systematic distortions of geometry are known to occur for classically

disallowed but experimentally observed conformations (e.g., Gunasekaran 1996,

Ramakrishnan 2007). And finally, particularly intriguing has been the observation that at

increasingly higher resolution, protein structures are in progressively worse agreement with

the supposedly “ideal” values (e.g., Longhi et al., 1998). This observation resulted in a recent

literature debate about how to adjust the target values used for geometric restraints and how

heavily to weight them (Jaskolski et al., 2007a; Tickle, 2007; Jaskolski et al., 2007b; Stec,

2007). We contributed to this debate with the suggestion that the root of the problem is not

simply a matter of incorrect ideal target values or weights but instead is a matter of an incorrect

paradigm in that ideal geometry should be a function, not a single value (Karplus et al.,

2008).

With the explosion of protein structures solved at 1.0 Å resolution or better, the time is ripe to

extend the earlier analysis (Karplus, 1996) and more accurately determine the nature and extent

of the systematic variations of peptide geometry with conformation. To accomplish this, we

created a nonredundant database of atomic-resolution structures that has nearly 20,000

residues. Here, we use this database to analyze conformation-dependent trends in backbone

geometry in all bond angles and lengths. We also show that accounting for these trends has the

potential to improve both crystallographic refinement and homology modeling.

Results and Discussion

Data Source and Analysis Strategy

To accurately characterize the nature and extent of conformation-dependent variations in

geometry, we used a data set of 16,682 well-defined three-residue segments from 108 diverse

protein chains determined at 1.0 Å resolution or better (see Experimental Procedures). A three-

residue segment includes all of the atoms in two complete peptide units, and the data set

included the bond lengths and bond angles for the peptide units uniquely identified by whether

they mostly involve atoms from residue −1, 0, or +1 in the three-residue segment (Figure 1).

Based on previous work (Karplus, 1996) indicating distinct geometric behavior of Gly, Pro,

the β-branched residues Ile and Val (Thr behaves more like a general residue because of

stabilizing sidechain-backbone hydrogen bonds) and residues preceding proline (prePro), we

carried out separate statistical analyses for those five groups. The data set used here included

1,379 Gly, 639 Pro, 511 general prePro (644 before exclusion of Gly/Pro/Ile/Val), 1,822 Ile/

Val, and 10,921 general residues (the 16 other residue types taken together). All prePro residues

are excluded from the other classes. As seen in Figure 2, these residues were distributed in

Φ,Ψ as has been seen for many well-filtered data sets (Karplus, 1996;Kleywegt and Jones,

1996,Lovell et al., 2003). Figure 2 also provides the shorthand nomenclature we will use for

certain regions of the Ramachandran plot.

We analyzed these results to visualize and to document the Φ,Ψ-dependent variations in bond

lengths and angles. Our approach was to use kernel-regression methods to smooth the data and

to produce continuously variable functions for each parameter (see Experimental Procedures).

The figures and tables in this paper are based on the kernel-regression analysis and only include

regions of the Ramachandran plot having an observation density of at least 0.03 residues/

degree2 (i.e., 3 residues in a 10° × 10° area) and a finite standard error of the mean.

Berkholz et al.Page 2

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

Ubiquitous, Systematic, Φ,Ψ-Dependent Variations Exist in Peptide Geometry

Bond angles: The data reveal that for general residues, all 15 bond angles in the two peptides

adjacent to the central residue vary systematically with Φ and Ψ (Figure 3 and Table 1). The

most prominent observation is that the variations do not occur only in rare outlier

conformations, but they occur throughout even the most populated areas of the plot for all

residue types (Figure 3, S1–S4). Consistent with the lower-resolution analysis (Karplus,

1996), NCαC varies the most (6.5°), but four other angles also vary by ≥5°. An important

difference from the results of the earlier study is that the conformation-dependent standard

deviations of the bond angles are about half what was seen previously (Karplus, 1996), ranging

from 1.3°–1.8° (Table 1). These are also substantially smaller than the standard deviations of

~2.5° used for the single ideal values defined by Engh and Huber (1991) based on small-

molecule structures. It is notable that ultrahigh-resolution crystal structures are generally

refined using geometric restraints that do not match the local averages, so the narrow (small

σ) distributions cannot be an artifact of the restraints used. Interestingly, the variations in the

averages are 2–4 times the standard deviations (Table 1), implying that current modeling

restraints would work to wrongly pull angles away from their actual optimal values in many

regions. Dramatically, the distributions at the extremes can even be completely non-

overlapping because of the small standard deviations (Figure 4). The standard errors of the

Φ,Ψ-dependent means (i.e., σ/√N) for bond angles are less than 0.5° in nearly all regions and

typically less than 0.2° in the highly populated regions (Figures S5–S9)—however, errors

should be considered when examining averages for the lowest-populated edges and other

regions, such as the prePro region for general residues. In comparison, the 2°–7° ranges seen

for the expected values are 10–50 times greater than their uncertainties. This shows that the

variations are well-determined and backbone geometry in no way obeys the single ideal value

paradigm.

Bond lengths: In the 1996 study, the resolution of the data did not allow reliable visualization

of bond-length variations. Here at atomic resolution, systematic Φ,Ψ-dependent trends are now

visible in bond lengths (Figure 5) but the variation ranges (0.01 Å–0.02 Å) are only on par with

the standard deviations (0.012 Å–0.016 Å), meaning the distributions are highly overlapping.

The standard errors of the mean are smaller (~0.002 Å), so the variations in the means seen

are nevertheless significant (~10-fold larger). Given that the standard deviations are on par

with the expected coordinate accuracy, we hypothesize that the true underlying bond lengths

are distributed more narrowly and thus will require still higher resolution analyses to determine

accurately. Because of this limitation and the expectation that, because of the very small

distances involved, the bond-length variations will have little impact on modeling accuracy,

we will not further describe the bond-length trends here. Nevertheless, we suspect the variations

involved will be chemically informative (e.g., Esposito et al., 2000; Figure 5).

Variations are Correlated with Local Interactions

Comparison of conformation-dependent trends across the two sequential peptide units reveals

that the trends are largely locally influenced. For each of the seven angles associated with the

central residue, the range is larger than the range for the same angle associated with the previous

or subsequent residue (Table 1). For instance, N−1Cα−1C−1 and N+1Cα+1C+1 have ranges of

5.5° and 3.0°, whereas NCαC has a range of 6.5°. This implies that the angles in Table 1

associated with residues −1 and +1 show highly local effects, being more influenced by the

Φ,Ψ values of their respective residues than the Φ,Ψ values of residue 0 (the central residue).

For modeling purposes, it makes sense to assign the “ideal” target values for all seven of these

angles based on Φ,Ψ of the central residue.

Furthermore, among these seven angles, additional evidence of the dominance of local effects

is seen as each angle is influenced mostly by the single closest torsion angle, whether it is Φ

Berkholz et al.Page 3

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

or Ψ. Starting at the N-terminal end, C−1NCα is heavily Φ-dependent as is seen in the vertical

pattern of variation, then the Cα-centered angles are a mixture, displaying diagonal patterning,

and the angles at the C-terminal end, such as CαCN+1, have Ψ-dependent horizontal patterning.

Even among the Cα-centered angles, NCαCβ shows enhanced dependence on Φ and CβCαC

shows enhanced dependence on Ψ. This extreme locality agrees with much prior work noting

that local steric interactions are critical factors in determining observed conformational and

secondary-structure preferences (e.g., Dunbrack and Karplus, 1994; Baldwin and Rose,

1999).

Comparison of Trends with Quantum Mechanics

As noted in the introduction, quantum-mechanical (QM) calculations of isolated alanine

peptides (Jiang et al., 1997; Yu et al., 2001) also produce conformation-dependent trends in

bond angles and bond lengths. The QM calculations are computationally intensive and they

have only been carried out at 30° resolution in Φ,Ψ (Jiang et al., 1997; Yu et al., 2001), making

detailed features of the trends unavailable. Beyond a certain level, the method and basis set

used in QM calculations is unimportant to this analysis because they produce trends on the

same scale with a nearly constant offset (Yu et al., 2001). As was reported by Karplus

(1996), the QM results have similar trends, but now it is apparent that QM results show larger

deviations, ranging farther both positively and negatively than experimental protein structures.

For example, the empirical deviations from the central value for NCαC are roughly 70% of the

calculated deviations. Additionally, QM calculations show serious discrepancies in some less

populated regions, such as a false global maximum for O−1C−1N in Lδ (Figures 2 and 3). The

mis-scaling seen in QM-calculated angles has been suggested by others to be caused by a lack

of long-distance structural effects (Jiang et al., 1997; Yu et al., 2001; Feig, 2008). However,

if that were the case, comparison of residues in secondary structure versus those in loops should

show this same difference, but Karplus (1996) did not see a difference, and here we confirm

that observation (Figures S10–S11). One potential underlying cause is the difference between

a protein environment and vacuum rather than a long-distance effect caused by repeating

secondary structure, but the reason that calculations in small peptides fail to predict the correct

details of conformation-dependent geometry for proteins is uncertain.

Local Variations Make Structural Sense

The bond-angle trends for five classes of residues for all Φ,Ψ possibilities comprise a massive

amount of information that cannot be exhaustively described in the context of this article. A

survey of the results, however, reveals a general principle that the observed trends in geometry

make structural sense in terms of accommodating local steric and electrostatic interactions,

extending the rationale for observed conformations proposed by Ho et al. (2003). In Karplus

(1996), the behavior of NCαC in the well-populated α, β, and δ regions (Figure 2) was

rationalized in these terms, including the proposal of a π-peptide interaction in the δ region

optimized by the opening of NCαC (see Figure 8 of Karplus, 1996). Instead of rehashing those

observations, here we present four illustrative examples of Φ,Ψ regions with significant

distortions. The conformations are shown in Figure 2, the relevant bond-angle values can be

seen in Figure 3, and the specific collisions being ameliorated are illustrated in Figure 6.

In the Lα/Lδ region, non-Gly residues are disfavored because when using single ideal values

for bond angles and lengths, there is a close-contact collision between O−1 and CβH. As Φ

increases, this collision becomes worse. The conformation-dependent trends show that these

conformations become accessible by a systematic increase in O−1C−1N, C−1NCα, and

NCαCβ that opens the ring between O−1 and Cβ. At the extreme tip of the region near (+90°,

0°), these angles open compared to the EH values (Figure 1) by 0.4°, 4.3°, and 2.8°,

respectively, to increase the O−1…Cβ distance from 2.59 Å to 2.85 Å. Although this change

Berkholz et al.Page 4

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

in distance is small, as are others described in this section, they can make large energetic

differences by transforming unfavorable atomic clashes0 to close contacts.

The II′ region is adopted by the i+1 residue of type II′ turns, a tight turn with a hydrogen bond

between O−1 and N+2H. In this conformation, Cβ is quite close to both O−1 and N+1, which

results in this region being unfavorab le for nonglycine residues. Under the rigid-geometry

approximation, the entire region should be disallowed because of this clash, but distortions in

covalent geometry allow it to be accessible. The variations seen in Figure 3 show that the

distortions relative to EH values (Figure 1) include a large opening in CβCαC (5.9°) as well as

opening of CαCN+1 (3.3°) to reduce the Cβ…N+1 clash. This also reduces the O−1…Cβ clash,

where the CβCαC distortion acts like opening jaws to move Cβ away from O−1. The extreme

bond openings are enabled by a closing of NCαC (2.5°), CαCO (1.8°), and OCN+1 (2.0°). The

Cβ…N+1 distance increases from 2.65 Å to 2.71 Å, and the O−1…Cβ distance increases from

3.06 Å to 3.09 Å.

Left of the δ region is a Ramachandran-allowed but sparsely populated region. The primary

clash is between HN and HN+1. This clash is prevented through a combination of distortions

relative to EH values: the dominant increases are in NCαC (3.5°) and CαCN+1 (2.8°) that both

exhibit their extreme values (Figure 3), coupled with a decrease in CαCO (2.0°). The combined

effect is to open and twist a nearly planar ring between NH and N+1H to prevent a van der

Waals overlap by increasing the HN…HN+1 distance from 1.78 Å to 1.92 Å and the N…

N+1 distance from 2.66 Å to 2.76 Å.

As a final example, we illustrate the importance of treating prePro as a special residue type.

Preproline residues are classically disallowed in the α region, yet they are experimentally

observed with low populations (Hurley et al., 1992). The primary clash occurs between N and

Cδ+1 with a secondary clash between CβH and Cδ+1H (Figure 6). To alleviate this clash, the

Pro ring bends away from the prePro residue through increases in NCαC (2.0°), CβCαC (2.4°),

and CαCN+1 (3.3°), enabled by decreases in CαCO (2.3°), OCN+1 (2.6°), and CN+1Cα+1 (3.8°).

In comparison to calculations by Hurley et al. (1992) that suggested a single, very large

deviation of 8.5° in CβCαC, here we observe that the distortions have diffused across all of the

angles between the sterically hindered atoms. These distortions increase the N…Cδ+1 distance

from 2.65 Å to 2.85 Å and the CβH…Cδ+1H distance from 1.86 Å to 1.90 Å to reduce the van

der Waals overlap. CN+1Cδ+1 was not included in the database, but we expect it also opens to

further alleviate the collision.

A 10°-Resolution Conformation-Dependent Library—With the knowledge of these

systematic trends comes the possibility of leveraging them to improve the accuracy of

crystallographic refinement and homology modeling. To provide a convenient form in which

the documented systematic variations can be used in modeling applications, we created a

binned conformation-dependent library (CDL) for distribution. Similar to the approach taken

by Karplus (1996), we divided Φ,Ψ space into 1296 10° × 10° bins and calculated the averages

and standard deviations for each bin for each of the five residue-type categories (Gly, Pro,

prePro, Ile/Val, General). This first-generation CDL (v1.0), available from the authors or at

http://proteingeometry.sourceforge.net/, uses a simple precalculated lookup table that accepts

conformations and returns the appropriate target value for each bond angle and length. When

considering crystallographic refinement and homology modeling, it is important to note that

using more accurate CDL values in place of EH values does not increase the number of variable

parameters used in the modeling.

Conformation-Dependent Angles are More Accurate—A variety of simple control

calculations can be carried out to show that the CDL is an improvement over the single-value

paradigm (EH values) and even context-dependent values derived from molecular mechanics

Berkholz et al. Page 5

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 6

(MM) force fields. Because an MM force field allows bond angles and lengths to vary with

conformation, it could in theory provide better conformation-dependent values than the

empirical approach.

As one simple assessment, we compared how well the NCαC values in a 1.15 Å ribonuclease

structure (PDB code 1rge; Sevcik et al., 1996) matched with EH values, the CDL, and bond-

angle values from the structure after minimization using a molecular mechanics force field (see

Experimental Procedures). As seen in Figure 7, the conformation-dependent library

outperforms both the single ideal value and molecular mechanics. Importantly, the CDL

produces more angles with very close (<1°) agreement with the reference crystal structure as

well as fewer extremely large deviations. In terms of modeling accuracy, there appears to be

no downside to using the CDL.

For a more thorough comparison of the CDL with EH values, we compared how well each

matched the NCαC values for the set of protein structures used to generate the CDL, with each

protein jackknifed out during its comparison. Averaged over the whole data set, the median

deviation from the native bond angles for the EH single-value paradigm was 1.5°/residue and

the median deviation for the CDL dropped to 1.1°/residue. This amounts to an improvement

of ~25% in NCαC accuracy relative to the old paradigm.

To understand the impact this difference could have upon protein modeling, coordinates for

each jackknifed structure were rebuilt from torsion and bond angles using EH or CDL values.

Holmes and Tsai (2004) have shown that the replacement of experimental bond angles with

ideal ones while holding Φ and Ψ fixed shifts coordinates by an average of 6 Å (unnormalized

by protein length), and this limits model-building accuracy. Here, applying the same approach,

we find that the median Cα RMSD100 (normalized to the length of a 100-residue protein) from

the native structure for EH values was 3.23 Å, and for CDL values it was 2.85 Å. The CDL

produced a significant improvement in the Cα RMSD100 of ~0.4 Å over the old single-value

paradigm.

Potential Applications: Crystallographic Refinement and Homology Modeling—

To assess the potential impact of accounting for Φ,Ψ-dependent variations upon X-ray crystal

structures at various resolutions, we evaluated how much the experimental NCαC values

deviated from those in the CDL as a function of resolution (Figure 8). To avoid bias, none of

the structures used in the survey were used in the generation of the CDL. Analysis of the data

shows that for structures solved at near 1 Å resolution, the RMSD of NCαC from the CDL is

~1.6°. This matches well with the standard deviation seen in the CDL for this angle and serves

as an effective validation of the CDL. Additionally, the small standard deviation of the RMSDs

at this resolution shows that each of the individual high-resolution structures is well-described

by the CDL. Already at a resolution of 1.5 Å, normally considered very high resolution, the

match of NCαC values to the CDL is nearly twice as poor as for the 1.0 Å resolution structures.

This loss of accuracy became steadily more pronounced in lower-resolution structures, rising

to nearly 4° at 3.0 Å resolution. We conclude that by using the CDL, high-, medium-, and low-

resolution structures could all be improved. We suspect that at resolutions worse than 3 Å, the

CDL would have less impact because dihedral angles would be less reliable.

To understand the potential benefit of accounting for Φ,Ψ-dependent geometry variations in

predictive modeling of protein structure, we carried out a test using the Rosetta modeling

program (Rohl et al., 2004). A standard control calculation for homology modeling is to ask

how far a crystal structure moves from the experimental structure when minimized by the force

field. This provides a lower limit on how accurately a structure can be predicted (e.g., Bradley

et al., 2005). For our test, we performed a series of 100 Monte Carlo energy minimizations

starting with different random seeds using both native and “ideal” bond geometries for two

Berkholz et al.Page 6

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 7

ultrahigh-resolution protein structures: ribonuclease chain A at 1.15 Å resolution (PDB code

1rge; Sevcik et al., 1996; Figure 9) and the PDZ domain of syntenin at 0.73 Å (PDB code 1r6j;

Kang et al., 2004; data not shown). “Native” geometry refers to the bond lengths and angles

as seen in the crystal structure. As seen in Figure 9A, minimizations using the “native” bond

geometry instead of the idealized geometry resulted in better convergence (tighter grouping)

and allowed the minimized structure to be about 30% closer to the true structure (~0.6 Å vs

~0.9 Å). One notable feature is that the improved behavior occurs despite the force field’s

optimization based on the traditional “ideal” geometry values. We conclude from this that the

use of the rigid-geometry approximation with standard single ideal values limits modeling

accuracy substantially. Thus, it is worthwhile to adapt modeling programs to account for the

new conformation-dependent geometry paradigm.

To pinpoint exactly where in the structure the improvements occurred, we calculated the

deviations between the crystal structure and the energy-minimized structures using native

versus ideal geometry (Figure 9B). As an indication of the variation that can occur for this

protein in two environments, the deviations with chain B from the same structure are also

shown. The largest differences between EH and experimental geometry occur in loops rather

than regular secondary structure (Figure 9B). This meets the expectation that the largest

systematic deviations from single ideal values should occur in parts of the protein with less

observed, more diverse Φ,Ψ values. This result was expected because the most highly

populated regions dominate the global averages, resulting in the illusion of single ideal values

assumed in EH, whereas more conformationally diverse, less populated regions contribute less

to the global average. Importantly, the two loops that were highly improved by using

experimental geometry are at the active site of the protein, so the accuracy with which they are

modeled would significantly degrade the ability of this mock homology model to provide

insight.

Outlook—The studies here show that the dependence of backbone geometry on conformation

is unmistakably real, significant, and systematic, and it has a rational structural basis. These

systematic distortions in covalent geometry additionally explain how some conformations are

accessible to amino-acid residues despite being theoretically disallowed by modeling based on

single ideal values for backbone geometry. Extending these studies to the conformation

dependence of the ω and χ1 torsion angles will be described elsewhere. The conformation-

dependent library we derived from the database represents the first step toward implementing

the new paradigm of “ideal-geometry functions.” With its much-improved agreement to

ultrahigh-resolution crystal structures, the ideal-geometry functions provide an intellectually

satisfying resolution to the debate among crystallographers as to what ideal values should be

used during refinement. Also, because the ideal-geometry functions captured in the CDL are

simply a highly enlarged set of immutable ideal values, their use in the place of single ideal

values represents no increase in algorithmic complexity. Use of the CDL thus offers the

potential for improved modeling accuracy in a wide variety of experimentally based and

predictive modeling applications without increasing the risk of overfitting.

Experimental Procedures

Data Set Construction

A Protein Geometry Database being developed in our laboratory (Berkholz et al., submitted)

was used to generate our data set of atomic-resolution geometry information. To optimally

analyze Φ,Ψ-dependent geometry trends, the data set must be large but also have independent

and accurate information about geometry. The plethora of new atomic-resolution protein

structures allowed us to use stringent criteria for independence and accuracy, yet still have

sufficient observations for reasonable statistics. To ensure independence, we used the

Berkholz et al. Page 7

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 8

PDBSelect (Hobohm and Sander, 1994) list from March 2006 to choose protein chains with

less than 90% sequence identity to any other chain in the data set. To ensure high accuracy,

we only used structures determined at 1.0 Å or better. At this resolution, we estimate Φ and

Ψ dihedral angle accuracy to be better than 3° (see next paragraph). Also, as in Karplus

(1996), to ensure that individual residues used were well-resolved, we required that all residues

in a five-residue segment were all well-ordered, having B-factors <25 Å2 for the mainchain

average, the sidechain average, and Cγ, and alternative conformations were discarded.

To estimate the experimental uncertainty in Φ and Ψ for 1 Å resolution structures, we chose

to use a straightforward, empirical approach—randomize and re-refine a test structure multiple

times and then examine the spread of the dihedral angles among the structures. Specifically,

we applied 10 coordinate randomizations with a mean shift of 0.2 Å using phenix.pdbtools

(Adams et al., 2002) to the coordinates of glutathione reductase at 0.95 Å resolution (PDB ID:

3dk9; Berkholz et al., 2008) and re-refined each in SHELXL (Sheldrick, 2008). Dihedral

RMSDs for the vast majority of residues were between 1°–2°. The 90th percentile of the per-

residue RMSDs in both Φ and Ψ was 2.2°, and the RMSD values of the per-residue RMSDs

for Φ and Ψ were 1.7° and 2.4°, respectively.

Kernel Regression for the Bond Lengths and Bond Angles

The data value of any structural parameter a of residue i (or of the left or right neighbor of

residue i) may be expressed:

where m is a regression function, and ε are random Gaussian-distributed errors with mean 0

and σ=1:

In these expressions, E is the expectation value of a and Var is the variance of a.

To obtain an estimate of m and ν, we use a zeroth-order or Nadaraya-Watson kernel regression

(Nadaraya, 1964) by summing over N data points:

The latter is Var(a|φ,ψ), an estimate of the heteroscedastic data variance as a function of φ and

ψ.

The functions K are kernels that weight the data points based on how far away they are from

the query, φ,ψ value. Since φ and ψ are angles, we use the product of two von Mises kernel

functions (Mardia and Zamroch, 1975)

Berkholz et al.Page 8

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 9

At large values of κ, these functions behave very similarly to Gaussian distributions, except

that they are periodic. We investigated several values of κ and plotted the resulting regressions

as a function of φ and ψ. We empirically chose a value of κ=50 to produce distributions that

varied smoothly with φ and ψ in a reasonable way.

The φ,ψ map is not uniformly populated by data points, each of them representing a single

residue backbone conformation. Therefore, for the unpopulated regions of the map, the kernel

regression analysis generates non-local estimates of m and ν. A query point (φ,ψ) in which we

estimate expectation and variance values of a, can be surrounded by an effective radius r, equal

to half of a bandwidth, b of the kernel function, K. We can count the effective number of data

points, Neff within the radius, r around any query point. These points will have an impact on

the estimated local values of m and ν.

We define the bandwidth, b(κ) as a diameter of the circle centered on the query point (φ0,ψ0)

within which the von Mises kernel function integrates to 68.2% (the value of integral of the

normal distribution PDF within one standard deviation from its center):

The bandwidth of the von Mises kernel at κ=50 is approximately 16°.

In order to depict the trends of m̂(φ,ψ) and v̂(φ,ψ), we only plot their estimates at φ,ψ grid points

where Neff(φ,ψ)≥ 3 within a circle with a diameter equal to the bandwidth b(κ=50) = 16°.

In the sparsely populated areas of the φ,ψ map the threshold of at least 3 data points within the

effective bandwidth may lead to estimates with high standard errors of mean (SEM). We

calculated an estimate of SEM, as

It is very important to analyze the trends of m and ν as a function of φ,ψ together with SEM

(a|φ,ψ). The values of SEM will indicate the significance of the trend in the more sparsely

populated areas.

Creation of the Binned Conformation-Dependent Library

To create a binned conformation-dependent library (CDL) for each residue class, averages and

standard deviations were calculated in 10° × 10° bins in Φ,Ψ. . The results were stored in a set

of files, one per residue class. Python scripts provide an interface to the CDL, allowing easy

retrieval of the conformation-dependent values when given a residue name and conformation.

Additional tools building upon this simple interface are also part of the distributed code,

including a tool that will compare the bond angles and lengths in any PDB coordinate file with

CDL values, EH values, or another PDB coordinate file. The CDL and accessory tools are

available under an open-source license from http://proteingeometry.sourceforge.net/.

Berkholz et al. Page 9

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 10

Molecular Mechanics Calculations

Molecular mechanics-derived context-dependent values for bond angles for two test cases

(PDB codes 1rge (Sevcik et al., 1996) and 1r6j (Kang et al., 2004)) were generated using the

following protocol: the structures were minimized in CHARMM (Brooks et al., 1983) using

the parm_all22_prot force field with the CMAP correction (MacKerell, 2004) using the GBMV

implicit solvent model (Lee et al., 2003). The protocol used cycles of 100 steps of steepest-

descent minimization with heavy-atom restraints of 5, 3, 1 and 0 * atomic mass kcal/mol/Å2.

Following the last cycle (which had no restraints), 1000 steps of adopted basis Newton-

Raphson minimization were performed, and the typical gradient RMS was about 0.05 kcal/

mol/Å.

CDL Assessments

Building Ideal Models and Analysis of Nonbonded Interactions—Ideal peptides

with EH or CDL backbone geometry were built using PyRosetta

(http://graylab.jhu.edu/~sid/pyrosetta/), Python bindings to Rosetta (Rohl et al., 2004). To

account for the length dependence of RMSD calculations (e.g., Holmes and Tsai, 2004), we

linearly rescaled all RMSDs to those of 100-residue proteins using the EH RMSDs and the

assumption that RMSDs intersect the origin. Based on the linear fit of EH RMSDs versus length

produced, we calculated a scaling factor of (0.0332519/100)/(0.0332519/length). To

understand the structural basis of variations between these theoretical peptides, van der Waals

clashes were visually analyzed using the Coot (Emsley and Cowtan, 2004) interface to

MolProbity (Davis et al., 2007).

Crystal Structure NCαC Angles—Nonredundant structures with a 25% sequence-identity

threshold were taken from PDBSelect (Hobohm and Sander, 1994). Among these, 50 structures

were selected from each of five resolution ranges: 1.0–1.1 Å, 1.5–1.6 Å, 2.0–2.1 Å, 2.5–2.6

Å, 3.0–3.1 Å. For each residue in these structures, we then calculated the difference in the

observed NCαC and the CDL value. These were used to calculate the per-structure RMSDs,

which were then used to calculate averages, standard deviations, and standard errors of the

mean for each of the five resolution shells.

Supplementary Material

Refer to Web version on PubMed Central for supplementary material.

Acknowledgments

We thank Charles L. Brooks III (University of Michigan) for performing the molecular-mechanics minimizations used

in this study. We additionally thank the David Baker lab (University of Washington at Seattle), in particular Srivatsan

Raman, James Thompson, and Elizabeth Kellogg, for their help with Rosetta. We thank Jeffrey Gray (Johns Hopkins

University) for providing PyRosetta, the Python bindings to Rosetta. We thank Lothar Schäfer (University of Arkansas)

for providing a database of QM-calculated dipeptides and an extrapolation program to obtain values for conformation-

dependent bond angles and lengths. This work was supported in part by NIH grant R01-GM083136 (to PAK), NSF

grant MCB-9982727 (to PAK), and NIH grant P20-GM76222 (to RLD).

References

Adams PD, Grosse-Kunstleve RW, Hung LW, Ioerger TR, McCoy AJ, Moriarty NW, Read RJ,

Sacchettini JC, Sauter NK, Terwilliger TC. PHENIX: building new software for automated

crystallographic structure determination. Acta Crystallogr D Biol Crystallogr 2002;58:1948–1954.

[PubMed: 12393927]

Baldwin RL, Rose GD. Is protein folding hierarchic? I Local structure and peptide folding. Trends

Biochem Sci 1999;24:26–33. [PubMed: 10087919]

Berkholz et al.Page 10

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 11

Berkholz DS, Faber HR, Savvides SN, Karplus PA. Catalytic cycle of human glutathione reductase near

1 A resolution. J Mol Biol 2008;382:371–384. [PubMed: 18638483]

Bradley P, Misura KMS, Baker D. Toward high-resolution de novo structure prediction for small proteins.

Science 2005;309:1868–1871. [PubMed: 16166519]

Brooks BR, Bruccoleri Robert E, Olafson Barry D, States David J, Swaminathan S, Karplus Martin.

CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J

Comput Chem 1983;4:187–217.

Corey RB, Donohue J. Interatomic distances and bond angles in the polypeptide chain of proteins. J Am

Chem Soc 1950;72:2899–2900.

Dunbrack RL, Karplus M. Conformational analysis of the backbone-dependent rotamer preferences of

protein sidechains. Nat Struct Biol 1994;1:334–340. [PubMed: 7664040]

Emsley P, Cowtan K. Coot: model-building tools for molecular graphics. Acta Crystallogr D Biol

Crystallogr 2004;60:2126–2132. [PubMed: 15572765]

Engh RA, Huber R. Accurate bond and angle parameters for X-ray protein structure refinement. Acta

Crystallogr A Found Crystallogr 1991;47:392–400.

Engh, RA.; Huber, R. International Tables for Crystallography. In: Rossmann, MG.; Arnold, E., editors.

International Tables for Crystallography. Dordrecht, The Netherlands: Kluwer Academic Publishers;

2001. p. 382-392.

Esposito L, Vitagliano L, Zagari A, Mazzarella L. Experimental evidence for the correlation of bond

distances in peptide groups detected in ultrahigh-resolution protein structures. Protein Eng

2000;13:825–828. [PubMed: 11239081]

Evans PR. An introduction to stereochemical restraints. Acta Crystallogr D Biol Crystallogr 2007;63:58–

61. [PubMed: 17164527]

Feig M. Is alanine dipeptide a good model for representing the torsional preferences of protein backbones?

J Chem Theory Comput 2008;4:1555–1564.

Gunasekaran K, Ramakrishnan C, Balaram P. Disallowed Ramachandran conformations of amino acid

residues in protein structures. J Mol Biol 1996;264:191–198. [PubMed: 8950277]

Ho BK, Thomas A, Brasseur R. Revisiting the Ramachandran plot: Hard-sphere repulsion, electrostatics,

and H-bonding in the α-helix. Protein Sci 2003;12:2508–2522. [PubMed: 14573863]

Hobohm U, Sander C. Enlarged representative set of protein structures. Protein Sci 1994;3:522–524.

[PubMed: 8019422]

Hollingsworth SA, Berkholz DS, Karplus PA. On the occurrence of linear groups in proteins. Protein Sci

2009;18:1321–1325. [PubMed: 19472372]

Holmes JB, Tsai J. Some fundamental aspects of building protein structures from fragment libraries.

Protein Sci 2004;13:1636–1650. [PubMed: 15152094]

Hurley JH, Mason DA, Matthews BW. Flexible-geometry conformational energy maps for the amino

acid residue preceding a proline. Biopolymers 1992;32:1443–1446. [PubMed: 1457725]

Jaskolski M, Gilski M, Dauter Z, Wlodawer A. Numerology versus reality: a voice in a recent dispute.

Acta Crystallogr D Biol Crystallogr 2007a;63:1282–1283. [PubMed: 18084076]

Jaskolski M, Gilski M, Dauter Z, Wlodawer A. Stereochemical restraints revisited: how accurate are

refinement targets and how much should protein structures be allowed to deviate from them? Acta

Crystallogr D Biol Crystallogr 2007b;63:611–620. [PubMed: 17452786]

Jiang X, Yu C, Cao M, Newton SQ, Paulus EF, Schäfer L. ∅/ψ-Torsional dependence of peptide backbone

bond-lengths and bond-angles: comparison of crystallographic and calculated parameters. J Mol

Struct 1997;403:83–93.

Kang BS, Devedjiev Y, Derewenda U, Derewenda ZS. The PDZ2 domain of syntenin at ultra-high

resolution: bridging the gap between macromolecular and small molecule crystallography. J Mol

Biol 2004;338:483–493. [PubMed: 15081807]

Karplus PA. Experimentally observed conformation-dependent geometry and hidden strain in proteins.

Protein Sci 1996;5:1406–1420. [PubMed: 8819173]

Karplus P, Shapovalov M, Dunbrack R Jr, Berkholz DS. A forward-looking suggestion for resolving the

stereochemical restraints debate: ideal geometry functions. Acta Crystallogr D Biol Crystallogr

2008;64:335–336. [PubMed: 18323629]

Berkholz et al.Page 11

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 12

Kleywegt GJ, Jones TA. Phi/psi-chology: Ramachandran revisited. Structure 1996;4:1395–1400.

[PubMed: 8994966]

Laskowski RA, Chistyakov VV, Thornton JM. PDBsum more: new summaries and analyses of the known

3D structures of proteins and nucleic acids. Nuc Acids Res 2005;33:D266–D268.

Lee MS, Feig M, Salsbury FR, Brooks CL. New analytic approximation to the standard molecular volume

definition and its application to generalized Born calculations. J Comput Chem 2003;24:1348–1356.

[PubMed: 12827676]

Longhi S, Czjzek M, Cambillau C. Messages from ultrahigh resolution crystal structures. Curr Opin

Struct Biol 1998;8:730–737. [PubMed: 9914254]

Lovell SC, Davis IW, Arendall WB III, de Bakker PIW, Word JM, Prisant MG, Richardson JS,

Richardson DC. Structure validation by Ca geometry: φ, ψ and Cβ deviation. Proteins: Struct Func

Genet 2003;50:437–450.

Mackerell AD. Empirical force fields for biological macromolecules: overview and issues. J Comput

Chem 2004;25:1584–1604. [PubMed: 15264253]

Mardia KV, Zemroch PJ. Algorithm AS 86: The Von Mises distribution function. Applied Statistics

1975;24:268–272.

Naradaya E. On estimating regression. Theory of Probability and its Applications 1964;9:141–142.

Pauling L, Corey RB, Branson HR. The structure of proteins; two hydrogen-bonded helical configurations

of the polypeptide chain. Proc Natl Acad Sci USA 1951;37:205–211. [PubMed: 14816373]

Ramakrishnan C, Lakshmi B, Kurien A, Devipriya D, Srinivasan N. Structural compromise of disallowed

conformations in peptide and protein structures. Protein Pept Lett 2007;14:672–682. [PubMed:

17897093]

Rohl CA, Strauss CEM, Misura KMS, Baker D. Protein structure prediction using Rosetta. Methods

Enzymol 2004;383:66–93. [PubMed: 15063647]

Schäfer L, Cao M. Predictions of protein backbone bond distances and angles from first principles. J Mol

Struct 1995;333:201–208.

Sevcik J, Dauter Z, Lamzin VS, Wilson KS. Ribonuclease from Streptomyces aureofaciens at Atomic

Resolution. Acta Crystallogr D Biol Crystallogr 1996;52:327–344. [PubMed: 15299705]

Sheldrick GM. A short history of SHELX. Acta Crystallogr A Found Crystallogr 2008;64:112–122.

Stec B. Comment on Stereochemical restraints revisited: how accurate are refinement targets and how

much should protein structures be allowed to deviate from them? by Jaskolski, Gilski, Dauter and

Wlodawer (2007). Acta Crystallogr D Biol Crystallogr 2007;63:1113–1114. [PubMed: 17881830]

Tickle IJ. Experimental determination of optimal root-mean-square deviations of macromolecular bond

lengths and angles from their restrained ideal values. Acta Crystallogr D Biol Crystallogr

2007;63:1274–1281. [PubMed: 18084075]

Yu CH, Norman MA, Schäfer L, Ramek M, Peeters A, van Alsenoy C. Ab initio conformational analysis

of N-formyl L-alanine amide including electron correlation. J Mol Struct 2001;567:361–374.

Zhang Y. Protein structure prediction: when is it useful? Curr Opin Struct Biol 2009;19:145–155.

[PubMed: 19327982]

Berkholz et al.Page 12

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 13

Figure 1.

Evolution of the ideal values for backbone geometry used in the single-value paradigm. A

central residue (residue 0) is shown with atoms from residues −1 and +1 that contribute to its

two adjacent peptide units. For each of the seven bond angles associated with residue 0, three

ideal values from earlier work are shown from oldest (top) to most recent (bottom). They are

from Corey and Donohue (1950), Engh and Huber (1991), and Engh and Huber (2001). Most

refinement and modeling programs use one of the Engh and Huber sets or a slight variation on

them. Rotatable bonds defining the backbone torsion angles Φ and Ψ are indicated. Figure

created with Inkscape.

Berkholz et al.Page 13

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 14

Figure 2.

Protein backbone conformations of non-Gly residues. This Ramachandran plot is colored by

empirical observation density in atomic-resolution proteins. Labels indicate regions of

particular interest (Karplus, 1996; Lovell et al., 2003; Hollingsworth et al., 2009). Coloring

uses a logarithmic function to allow lower- and higher-population regions to be seen

simultaneously. Observation density was calculated using kernel regressions (see

Experimental Procedures). Unlabeled versions of this plot and another for only Gly residues

are available as supplementary material (Figures S12 and S13). Figure created with Inkscape.

Berkholz et al.Page 14

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 15

Figure 3.

Conformation-dependent variation in bond angles of general residues as a function of the

Φ,Ψ of the central residue. A Ramachandran plot is shown for each backbone bond angle in

the two peptide units surrounding the central residue of the tripeptide. The seven unique peptide

bond angles are associated with either residue −1, 0, or +1 based on which residue contributes

at least two atoms to the angle. Φ and Ψ in each plot, however, refer to the central residue, 0.

Within each plot, colors indicate averages ranging from the global minimum (blue) to the global

maximum (red) as calculated using kernel regressions (see Experimental Procedures). The

global minima and maxima are provided in each plot. Figure created with Matlab.

Berkholz et al.Page 15

Structure. Author manuscript; available in PMC 2010 October 14.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript