EXPLORING PROTEIN FOLDING TRAJECTORIES USING
D. RUSSEL and L. GUIBAS
Computer Science Department
353 Serra Mall
Stanford, CA 94305, USA
We describe the 3-D structure of a protein using geometric spanners — geometric
graphs with a sparse set of edges where paths approximate the n2inter-atom
distances. The edges in the spanner pick out important proximities in the structure,
labeling a small number of atom pairs or backbone region pairs as being of primary
interest. Such compact multiresolution views of proximities in the protein can
be quite valuable, allowing, for example, easy visualization of the conformation
over the entire folding trajectory of a protein and segmentation of the trajectory.
These visualizations allow one to easily detect formation of secondary and tertiary
structures as the protein folds.
There has been extensive work on visualizing the 3-D structure of proteins
in ways that attempt to make the certain aspects of the structure more ap-
parent. For example, commonly used software packages such as RasMol ,
, or SPV , among others, permit visualizations via
hard-sphere models, stick models, and ribbon models that emphasize differ-
ent aspects of the protein surface or secondary structure. Even more abstract
visualizations have been used as a tool for understanding intra-molecular prox-
imities, including contact maps and distance matrix images . None of these
approaches work very well, however, if the goal is to visualize proteins in mo-
tion and not just their static conformations.
Large corpora of molecular trajectories are becoming available through
efforts such as Folding@Home  where molecular simulations are carried
out on distributed networks of many thousands of computers. There is an
increasing need to compare, classify, summarize, and organize the space of such
protein trajectories with an eye toward advancing our understanding of protein
folding by studying their ensemble behaviors. Most currently used methods for
understanding such data revolve around computing a few summary statistics
for each conformation, such as radius of gyration or number of native contacts
and watching how these evolve during each trajectory. More similarly, the
chemical distance, a statistic of an adjacency graph of the amino acids, was
use to differentiate folded and unfolded states . In this paper we explore the
use of a more rich and abstract representation of the protein structure, based
on spanners, which makes the task of understanding and exploring the space
of protein motions easier.
Our basic idea is to take the continuous folding process and map it to
a more discrete combinatorial representation. This representation focuses on
higher-level geometric proximities that tend to form and be more stable over
time rather than atom coordinates or specific aspects of secondary/tertiary
structure. Specifically, we look at the formation of proximities between differ-
ent parts of the protein across a range of scales, and track the changes of such
proximities over time. Our more abstract description of the folding process is
in terms of ‘proximity events’ — when certain proximities are formed or de-
stroyed. Together, these characterize the folding process in a qualitative way
and capture the important aspects of the trajectory, the sequence of conforma-
tions adopted by a protein in a particular folding path. Just as an algebraic
topologist captures the essence of the connectivity of a continuous space in a
few discrete invariants (the homology groups), we aim to capture the signifi-
cant conformational changes during motion through a discrete representation
of proximities that form and break.
We use geometric spanners to accomplish this goal.
abstract graph with weights on its edges, a spanner is a sparse subgraph (in
the sense of having a number of edges roughly proportional to the number of
vertices), such that all edges in the full graph can be well approximated by
paths in the spanner (in the sense that the sum of the weights of edges of the
path in the spanner is very close to the weight of the original graph edge). In
the geometric setting the vertices in the original graph are points each pair
of which is connected by an edge with weight equal to the Euclidean distance
between the corresponding pair of points. The quality of the approximation
can be controlled by varying the number of edges in the spanner.
Note that spanners are at once generalizations of contact maps as well
as compressions of distance matrices. One can think of a spanner as a mul-
tiresolution contact map that allows an approximate reconstruction of the full
distance matrix (and therefore the full 3-D structure as well).
We propose to use these combinatorial structures as a tool for capturing
the important proximities of a protein conformation and, in this paper, for
comparing and visualizing sequences of protein conformations from molecular
trajectories. Key properties of the spanner that facilitate these goals include:
Starting from an
• Spanners are proximity based — this parallels proteins where local in-
teractions determine the behavior.
• Spanners are discrete — they have a combinatorial structure whose de-
scription does not include any geometric coordinates.
• Spanners are controllable — we can produce descriptors that more loosely
or more tightly capture the shape of the protein, converging to distance
matrices as the approximation gets tighter.
• Spanners are uniform — there is only one type of combinatorial element,
namely an edge. This makes comparison, processing and display simpler.
• Spanners can be made smooth — small changes in the protein confor-
mation generally result in few large changes in the spanner, enabling
tracking of the spanner structure over time.
• Spanners are local — the combinatorial features, edges, are affected by a
small subset of the total point set. This means that changes in one part
of the protein do not generally affect the spanner edges in other parts.
As a result the edges can be assigned semantic meaning based on their
endpoints, rather than on larger regions of the protein.
We use our spanners to investigate the folding of the protein BBA5 
using simulation data produce produced by the Folding@Home project. The
spanners enable us to produce diagrams which show the formation (and some-
times dissolution) of secondary and tertiary structure during a whole folding
trajectory and allow us to segment these trajectories into logical parts. We
expect that the spanner approach will provide a valuable toolkit for the un-
derstanding and visualization of protein trajectories.
In the next sections we describe how we construct and smooth our spanner-
based representation and how we use it to visualize trajectories. Then we
discuss our how we have used spanners to try to understand the folding of
BBA5. Finally we mention other promising applications of our spanner based
2Representing Proteins Using Spanners
We first provide a more rigorous definition of a geometric spanner. Let P be a
set of points in R3, Euclidean three-space, and G be a Euclidean graph on P
(graph whose vertices are points from P and whose edge weights are Euclidean
distances between the endpoints of the edge). For a parameter s > 1, known
as the stretch factor, G is a spanner for P, if for all pairs of points i and j
in P with Euclidean coordinates piand pj, πG(i,j) ≤ s||pipj|| where πG(i,j)
denotes the shortest path distance between i and j in the graph G. Thus,
the spanner represents the quadratic number of interpoint distances in P by
the much sparser set of edges in G. There is a vast literature on spanners
that we will not attempt to review in here any detail; many different spanner
constructions are possible. It has been shown that for s arbitrarily close to 1,
there exist spanners whose number of edges is proportional to the size of P.
The reader is referred to a number of survey papers for background material
and additional references [1, 6].
For simplicity, we only use the backbone atoms of the protein. This allows
us to meaningfully identify each atom by its index along the backbone, so an
edge i,j connects the ith and jth atoms in the backbone. We can identify the
edge i,j with the point (i,j) where i < j. We define the distance, between two
edges i0,j0and i1,j1as the L1distance between the points (i0,j0) and (i1,j1),
namely |i0−i1|+|j0−j1|. We will write it d(i0,j0,i1,j1). Two edges are close
if they have a small L1distance between them the corresponding points. The
length as opposed to weight of an edge l(i,j) is defined as j − i. Throughout
the section s will designate the stretch factor. A s-spanner is a spanner with
stretch factor s.
We use what is known in the literature as the ‘greedy’ spanner. Its computation
is conceptually very simple: starting with graph G initially containing only the
points P, test each of the?|P|
πG(i,j). If so, add the edge i,j to the G. We call this test the inclusion test.
The algorithm runs in O(n3) time due to the quadratic number of edges and
the worst case linear time required to evaluate the inclusion test.
This greedy spanner construction has been shown to have asymptotically
optimal complexity (number of edges) and weight (the sum of the lengths of the
edges) as well as good practical complexity and weight . Having low weight is
important in our context since we want the spanner to consist of as many short
edges as possible in order to capture local interactions. Euclidean spanners can
be also be produced in O(nlog2n) time with the same asymptotic edge count
and weight bounds  although we have not implemented such methods.
If implemented naively, performing the inclusion test for long edges domi-
nates the running time as it requires a nearly linear time graph search for each
of these edges. However, such long edges are extremely unlikely to be in a
spanner of a packed protein. If we maintain an upper bound on the graph dis-
tance, dG(i,j) ≥ πG(i,j) between each pair of points, i,j, then we can quickly
eliminate any candidate edge for which s?ij? > dG(i,j). We can similarly
?interpoint candidate edges for inclusion, ordered
from shortest to longest. For each candidate edge i,j, check if s?pipj? <
prune many of the paths while searching.
These upper bounds can be maintained lazily as graph searches are per-
formed. To tighten the upper bounds and further accelerate the process, it is
advantageous to periodically compute the all atoms shortest path distances in
the current spanner (an O(n2) process). In addition, we guide the search using
the Euclidean distance as a lower bound on the graph distance to bias our
search direction, as in the graph search algorithm A∗. Using these heuristics,
the 2-spanner of an 800 atom backbone of a protein can be computed in about
The kinetic spanner proposed in  is a possible alternative. It can be
cheaply maintained as the underlying points move around. However, it is non-
canonical, making comparison between trajectories tricky and it has more long
edges, which are hard to assign biological meaning.
2.2Spanners of Proteins
Figure 1 shows spanners computed using different expansion factors for single
protein and gives an estimate of the number of spanner edges per point for
typical proteins in their native state.
Figure 1: Example spanners and average edge per point for various expansion factors. We
mostly use 2-3 spanners for out computations and visualizations as spanners below 2 get
Secondary structure creates very well defined patterns in the spanner. If
each edge is visualized as a point (i,j), then α helices appear as a sequence of
points (i + kq,j + kq) where k is a counter variable and q is a stepsize which
depends on the expansion factor. For a expansion factors between 2 and 3 the
step size is 3, the edges are just longer when the expansion factor is larger.
For a expansion factor of 1.5, the stepsize is still 3 but there are several edges
leaving from each of the points. β hairpins appear as series of points heading in
an orthogonal direction to helices, namely, (i+kq,j−kq). k is 2 for 2 spanners
and rises to 4 or 6 for 3 spanners (depending on how the hairpin twists). Both
patterns are shown in Figure 2.