A Spatial Indexing Approach for Protein Structure Modeling
ABSTRACT This paper explores the use of spatial indices for the modeling and retrieval of protein structures. With two existing spatial indices, a preliminary framework for protein structure modeling that uses a spatial index is proposed. It provides a novel technique for modeling. In addition, it provides additional flexibility with respect to modeling granularity and structure manipulation. It is expected that this modeling approach will lead to new ways of analyzing protein structures.
 Citations (14)
 Cited In (0)

Article: The Ubiquitous BTree.
ACM Comput. Surv. 01/1979; 11:121137.  SourceAvailable from: Oliver Günther
Article: Multidimensional Access Methods
[Show abstract] [Hide abstract]
ABSTRACT: Search operations in databases require some special support at the physical level. This is true for conventional databases as well as for spatial databases, where typical search operations include the point query (find all objects that contain a given search point) and the region query (find all objects that overlap a given search region). More than ten years of spatial database research have resulted in a great variety of multidimensional access methods to support such operations. This paper gives an overview of that work. After a brief survey of spatial data management in general, we first present the class of point access methods, which are used to search sets of points in two or more dimensions. The second part of the paper is devoted to spatial access methods, which are able to manage extended objects (such as rectangles or polyhedra). We conclude with a discussion of theoretical and experimental results concerning the relative performance of the various approaches. Ke...07/1999;  SourceAvailable from: JeanFrancois Gibrat[Show abstract] [Hide abstract]
ABSTRACT: Examination of a protein's structural 'neighbors' can reveal distant evolutionary relationships that are otherwise undetectable, and perhaps suggest unsuspected functional properties. In the past, such analyses have often required specialized software and computer skills, but new structural comparison methods, developed in the past two years, increasingly offer this opportunity to structural and molecular biologists in general. These methods are based on similaritysearch algorithms that are fast enough to have effectively removed the computertime limitation for structurestructure search and alignment, and have made it possible for several groups to conduct systematic comparisons of all publicly available structures, and offer this information via the World Wide Web. Furthermore, and perhaps surprisingly given the difficulty of the structurecomparison problem, these groups seem to have converged on quite similar approaches with respect to both fast search algorithms and the identification of statistically significant similarities.Current Opinion in Structural Biology 07/1996; 6(3):37785. · 8.74 Impact Factor
Page 1
A Spatial Indexing Approach for Protein Structure Modeling
Wendy Osborn
Department of Mathematics and Computer Science
University of Lethbridge
Lethbridge, Alberta
T1K 3M4 Canada
wendy.osborn@uleth.ca
Abstract
This paper explores the use of spatial indices for the
modeling and retrieval of protein structures. With two ex
isting spatial indices, a preliminary framework for protein
structure modeling that uses a spatial index is proposed. It
provides a novel technique for modeling. In addition, it pro
vides additional flexibility with respect to modeling granu
larity and structure manipulation. It is expected that this
modeling approach will lead to new ways of analyzing pro
tein structures.
1. Introduction
Protein structure analysis is an exciting and challenging
research area in the area of bioinformatics [1]. Many repos
itories exist today that maintain threedimensional protein
structures and provide tools for search and retrieval. The
first such repository is the Protein Data Bank (PDB) [4]. It
remains a very popular source for information. The PDB
currently maintains over 20,000 protein structures.
current representation of the threedimensional structure in
PDB is very inflexible and archaic [1]. Therefore, it is im
portant to explore how threedimensional protein structures
are modeled for future analysis.
One promising approach is in the area of spatial
databases [11]. Spatial indexing [5, 11] provides a tech
nique for modeling and retrieval of data based on its loca
tion in multidimensional space. It also provides a frame
work that can be used in conjunction with existing protein
analysis strategies.
The application of spatial indexing to protein structure
modeling has not received significant attention. Srinivasa
and Kumar [12] propose a platform for modeling three
dimensional structures in a database. They also present
strategies for search and retrieval. One limitation of their
The
work is that their use of spatial indexing is limited to a strat
egy for point data only. Yan, Yu and Han [13] propose a
strategy for indexing a graph representation of a biomolec
ular structure. Although their approach is promising, it has
a limitation of only being applicable to substructures.
The application of the approximation spatial index for
the modeling protein structures is investigated. Given the
properties of existing approximation strategies, a prelimi
nary framework for protein structure modeling is proposed.
This approach works across all descriptions of a protein
structure, and also with different granularities of a model.
For example, one can model at the atomic level, at the sub
structure level, or anywhere in between.
The remainder of this paper proceeds as follows. Sec
tion 2 presents some required background information on
protein structure modeling and spatial indexing. Section 3
presents an application of existing spatial indexing struc
tures for the modeling and retrieval of protein structures.
Section 4 concludes with future research directions that
need to be considered for this modeling strategy.
2. Preliminaries
This section presents some required background infor
mation that is required for this investigation. First, the hi
erarchical structure description, which is the common tech
nique for describing proteins at multiple levels, is summa
rized with an example. Second, spatial indexing is summa
rized, with a focus on two existing strategies.
2.1. Hierarchical Structure Description
A protein structure can be described using a hierarchical
method [1]. This layered approached is composed of four
levels: primary, secondary, tertiary and quaternary. The
primary level consists of the sequence of amino acids that
make up the protein. The secondary level consists of the
21st International Conference on
Advanced Information Networking and Applications Workshops (AINAW'07)
0769528473/07 $20.00 © 2007
Page 2
MSQPIFNDKQFQEALSRQWQRYGLNSAAEMTPRQWWLAVSEALAEMLRAQPFAKPV
ANQRHVNYISMEFLIGRLTGNNLLNLGWYQDVQDSLKAYDINLTDLLEEEIDPALGA
GGLGRLAACFLDSMATVGQSATGYGLNYQYGLFRQSFVDGKQVEAPDDWHRSNYP
WFRHNEALDVQVGIGGAVTKDGRWEPEFTITGQAWDLPVVGYRNGVAQPLRLWQAT
HAHPFDLTKFNDGDFLRAEQQGINAEKLTKVLYPNDNHTAGKKLRLMQQYFQCACS
VADILRRHHLAGRELHELADYEVIQLNDTHPTIAIPELLRVLIDEHQMSWDDAWAITS
KTFAYTNHTLMPEALERWDVKLVKGLLPRHMQIINEINTRFKTLVEKTWPGDEKVWA
....................
Figure 1. Hierarchical Description for 1AHP
substructures that the protein is composed of. Two common
protein substructures are the helix (α) and the beta sheet (β
sheet). The tertiary level consists of the final protein struc
ture, which is composed of all substructures from the sec
ondary level. Finally, the quaternary contains a structure of
multiple proteins, each from a different tertiary level.
Figure 1 depicts an example of a hierarchical descrip
tion at the primary, secondary and tertiary levels for the pro
tein 1AHP, which is retrieved from the Protein Data Bank
[4] and viewed using RasMol [2]. The primary level con
tains a portion of the amino acid sequence for 1AHP. The
secondary level contains the secondary structures, each of
which is a helix or a βsheet. The tertiary level contains the
protein structure in its entirety.
2.2. Spatial Indexing
A spatial index (or spatial access method) [5, 11] pro
vides access to data based on its location in multidimen
sional space. Data that is indexed based on location usually
consists of points, lines and objects of arbitrary shape. In
addition, data with nonspatial information can be stored
along with spatial data. For example, a map consists of
towns (points), roads (lines) and cities (arbitrarilyshaped
objects). Each datum on a map can have nonspatial data
associated with it, such as a city name or a road number.
A spatial index supports many types of searches [5]. One
common search type is the region search. Given an object
that represents a region of space (usually a rectangle), the
goal is to find all data that overlap the region.
Many spatial indices are proposed in the literature. A
comprehensive survey is available in [5].
lar class of spatial indices that is of interest for this pro
posed framework are approximation strategies.
eral overview of an approximation spatial index is given
next. Following this, two different approaches that adopt
the approximation strategy  the Rtree [7] and the 2DRtree
[10, 9] are summarized.
An approximation spatial index maintains a hierarchy
of approximations for objects and the subregions contain
ing one or more objects. Most approximation strategies
are based on the B+tree [3], so they are height balanced
(or, every path from the root node to leaf node is the same
length).
Anapproximationforbothobjectsandsubregionsisusu
ally represented in the form of a minimum bounding rectan
gle (MBR). A minimum bounding rectangle defines the
extent of an object along each dimension in space.
The hierarchy is maintained using nodes. Each node can
contain a minimum of m records and a maximum of M
records, where M is the total number of records allowed
in the node, and m is a user defined value. Usually, m =
M/2. The exception is the root note, which can contain at
minimum of two records. Each record maintains:
One particu
A gen
(MBR(i,j),ptr(i,j))
where MBR(i,j) is an approximation and ptr(i,j) is a
pointer. In a leaf node, MBR(i,j)approximates an object
and ptr(i,j)references the actual object on secondary stor
age. In a nonleaf node, MBR(i,j)encloses all approxima
tions in the subtree referenced by ptr(i,j).
2.3. Rtree
The Rtree [7] is the first approximation spatial index
proposed in the literature. As with the B+tree, the Rtree
uses linear nodes to organize approximations into a hierar
chy. An example is described first, followed by a brief de
21st International Conference on
Advanced Information Networking and Applications Workshops (AINAW'07)
0769528473/07 $20.00 © 2007
Page 3
m9
m8
m5
m3
m1
m2
m4
p9
p5
m2m4
p9
m7
m5
m8
m9
m1
p5
m7
m3
Figure 2. The Rtree
scription of the region search, insertion and deletion strate
gies.
Figure 2 depicts an example of an Rtree that is indexing
a spatial data set (the data set comes from [5]). The approx
imations for actual objects are represented by points p5 and
p9, and rectangles m1 to m5 and m7 to m9. In addition,
there are three approximations that represent subregions 
one that contains m1, m2, m4, and p9; one that contains m3,
m5, m8 and p5; and one that contains m7 and m9. The leaf
level contains the object approximations, while the nonleaf
level (also the root level) contains the subregion approxima
tions.
The Rtree region search begins at the root by examining
all approximations to find the ones that overlap the query
region. For all qualifying approximations, the search con
tinues in the corresponding subtrees until it reaches zero or
more leaf nodes. The approximations for all qualifying leaf
nodes are retrieved and tested for overlap with the query
region.
The insertion algorithm first identifies the appropriate
leaf node for storing the new entry. The insertion path con
tains minimum bounding rectangles that require the least
enlargement to include the new object. After inserting the
new approximation into the chosen leaf node, the approxi
mations along the insertion path are updated.
If overflow of the leaf node occurs, it is handled by split
ting the node into two new nodes. The node is split so
that the new nodes cover the smallest amount of area and
both nodes are not checked for the same query. Three split
ting strategies are proposed: exhaustive, quadratic, and lin
ear. The exhaustive approach finds all candidate splits and
chooses the best one. The quadratic and linear approaches
identify the two objects that are the farthest apart and cluster
theremainingobjectsintotwonodes. Thequadraticandlin
ear approaches differ in how the two objects are identified.
If a split causes an overflow in the parent node, the split is
propagated up the tree as far as necessary. If it propagates
to the root, a new root node is created.
The Rtree deletion strategy removes the approximation
of the object to be deleted and adjusts the minimum bound
ingrectanglesalongthedeletionpath. Underflowishandled
using a forced reinsertion strategy.
2.4. 2DRtree
The 2DRtree [10, 9] extends the onedimensional struc
tureoftheRtree[7]totwodimensions. Itsstructureandor
ganization is summarized first, followed by an example and
a description of the search, insertion and deletion strategies.
The nodes used in the 2DRtree to organize approxima
tions are twodimensional in nature. The M locations in a
node are organized in the following manner. For each node
N, X is the number of locations along the xaxis of the
node, and Y is the number of locations along the yaxis.
An approximation is stored in an appropriate location
with respect to all other approximation in the node. Us
ing twodimensional nodes allows spatial relationships be
tween objects to be preserved. The spatial relationships that
can be maintained in the 2DRtree are north, northeast, east,
southeast, south, southwest, west and northwest. A spatial
relationship is defined between two objects using the cen
troids of their approximations. For example, approximation
MBR 1 is northeast of MBR 2 if the centroid of MBR 1 is
northeast of the centroid of MBR 2.
Organization of approximations is dictated by the fol
lowing validity rules [10]. For each approximation MBR:
• All approximations located directly north of MBR in
the node have a centroid that is north, northwest, or
west of the centroid for MBR in space,
• All approximations located northeast of MBR in the
node have a centroid that is northeast of the centroid
21st International Conference on
Advanced Information Networking and Applications Workshops (AINAW'07)
0769528473/07 $20.00 © 2007
Page 4
m9
m8
m5
m3
m1
m2
m4
p9
p5
m3
m2
p9
m7
m4
m5
m8
p5
m9
m1
m7
Figure 3. The 2DRtree
for MBR in space, and
• All approximations located directly east of MBR in
the node have a centroid that is east, southeast or south
of the centroid for MBR in space.
Figure 3 shows an example of a 2DRtree that is index
ing the same spatial data set from [5]. As with Figure 2,
the leaf level contains the approximations for all points and
objects, while the subregions are maintained in the nonleaf
level. The difference here is how the spatial relationships
are maintained. If we look at the subregion enclosing m4
and m3, we observe that the centroid of m3 is southeast of
the centroid for m4, so both they are placed in appropriate
locations with respect to each other in the leaf node. Sim
ilarly, the spatial relationships are also maintained between
subregions in the nonleaf level.
Since approximations are organized, a binary search
strategy [9] can be applied in the following manner. The
search region is applied recursively to half of the objects in
the node until all overlapping approximations are found.
The insertion strategy employs a greedy search for locat
ing an insertion path contains a minimal (if not the absolute
minimal) area increase required to insert a new approxima
tion. At the leaf level, a location is found for the approxima
tion that obeys all validity rules. Any overflow that occurs
is handled using one of the splitting strategies proposed in
[9, 10]. Deletion simply requires the removal of the objects
and the updating of the approximations along the insertion
path.
3. Structure Modeling Using Spatial Indices
Thissectionpresentstheproposedframeworkforprotein
structure modeling. The framework can utilize either the R
tree or the 2DRtree for modeling a protein structure. The
hierarchical model described earlier consists of four levels
 primary, secondary, tertiary, and quaternary. We discuss
how a spatial index can be applied to each of the secondary,
tertiary and quaternary levels. In addition, we discuss how
a spatial index can be applied in different ways within the
same level using different amounts of detail.
3.1. Secondary Level Modeling
Protein secondary structures can be modeled in differ
ent ways. One approach is to group helices and βsheets
that are adjacent to each other into substructures. Each sub
structure can be represented with an approximation. This
approximation in turn can be placed in a spatial index struc
ture. These substructures can be formed by inserting helices
and βsheets by location into the spatial index structure.
This can potentially lead to the discovery of new substruc
tures that serve an important purpose in the function of a
protein. An entire substructure can be discovered indepen
dently, and can be inserted as a single unit into the spatial
index structure.
Figure 4 depicts an example of substructure modeling
using the Rtree, while Figure 5 depicts an example of sub
structure modeling using the 2DRtree. In both cases, each
substructure is enclosed with an approximation that can be
accessed from the index structure. The difference between
the Rtree and 2DRtree approaches is the following. Al
though the Rtree is a onedimensional structure, it can be
applied to data residing in any dimension. However, in its
current form, the 2DRtree represents a twodimensional
topological modeling of the protein structure, since it is de
signed to work in twodimensional space.
A protein structure can also be indexed at the atomic
level, where atoms and bonds are grouped into leaf nodes.
If a 2DRtree is applied at the atomic level, the bonds be
tween atoms can be implicitly represented by maintaining
21st International Conference on
Advanced Information Networking and Applications Workshops (AINAW'07)
0769528473/07 $20.00 © 2007
Page 5
Figure 4. Rtree Substructure ModelingFigure 5. 2DRtree Substructure Model
ing
the spatial relationships between the atoms, therefore elim
inating the need to store them. A protein structure can also
be indexed at the helix/βsheet level. In this configuration,
leaf nodes can contain helices, βsheets, and other known
and unknown secondary structures.
In addition, multiple protein secondary structures can be
modeled within the same spatial index. This allows for pro
tein structure homology comparisons to be carried out with
ease, since potentially related substructures are stored to
gether. Alternatively, secondary structures can be modeled
with different spatial indices (but of the same index type)
but still compared for structural homology.
3.2. Tertiary Modeling
Similarly to secondary modeling, the tertiary represen
tation of a protein structure can be represented at the sub
structure level and the atomic level. Also, it is possible to
apply the same spatial index to multiple tertiary structures
for the purposes of structure homology comparison.
3.3. Quaternary Modeling
One more option is added to the existing levels of mod
eling resolution given above. The quaternary description
of multiple tertiary structures can be successfully modeled
with one spatial index structure. This will provide support
for modeling across proteins. Also, if one is interested, it
is possible to model multiple quaternary descriptions with
one index to provide further opportunities for the analysis
of protein structures.
3.4. Searching and Analysis Strategies
One can use existing spatial searching strategies, such as
the region search, or the binary or greedy search techniques
of the 2DRtree [9, 10], on protein structures. In addition,
one can apply other search and analysis techniques as the
basis for data retrieval. For example, strategies for struc
ture homology comparision such as VAST [8, 6] can be
incorporated into a spatial index model. Structure homol
ogy comparisons can be applied not only between existing
structures, but also for the prediction of a newlydiscovered
protein structure based on existing structures.
3.5. Inclusion of NonSpatial Data
Descriptive nonspatial data can be stored alongside an
approximation. This comes in handy for storing data from
21st International Conference on
Advanced Information Networking and Applications Workshops (AINAW'07)
0769528473/07 $20.00 © 2007
Page 6
the primary level of the description hierarchy. A corre
sponding amino acid subsequence can be stored with the
approximation representing its substructure or secondary
structure. Information on the number of secondary struc
tures and the α/β ratio can be stored with a substructure or
an entire protein. Also, one can simply provide a reference
the appropriate lower level of the hierarchy.
4. Conclusions
This paper investigates the feasibility of applying a spa
tial index to the problem of modeling protein structures. A
preliminary model is proposed, which has the advantage of
flexibility in representation and manipulation of structures.
Also, additional functionality to facilitate analysis can eas
ily be incorporated. It is expected that this approach will
lead to new approaches to the study of protein structures.
Research will continue in the following directions. First,
the 2DRtree can be extended to three dimensions (i.e.
3DRtree). The 2DRtree has specific properties, such as
its support for maintaining spatial relationships between ob
jects, which make it a desirable option for protein structure
modeling. However, in its current form, the 2DRtree only
indexed a topological view of a structure. A similar three
dimensional structure will alleviate this problem.
Second, the coordination between a spatial index and
strategies for protein structure homology comparison, such
as VAST [8, 6] will be investigated further. This will lead
to new strategies for comparisions and prediction of new
structures based on existing structures.
Third, other strategies for searching, insertion and dele
tion will be explored. Although most work with protein
structures require access only [1], it is believed that by pro
viding a framework to manipulate proteins by adding and
removing features such as secondary structure elements,
atoms and substructures, future work will lead to explo
ration of new comparison and prediction techniques.
Finally, strategies for visualizing a protein structure that
is modeled using a spatial index will be explored. The im
ages presented in this paper display the spatial index. Cur
rently, the spatial index provides a modeling technique that
resides in the lower levels of a protein structure reposi
tory, and therefore is not displayed when viewing a protein
model. However, visualization of the spatial index structure
with the protein structure, including a rotatable view, is also
an important direction of future research.
Acknowledgments
The author wishes to thank the reviewers for their help
ful comments and suggestions, in particular the suggestion
concerning model visualization.
References
[1] A. Baxevanis and B. Ouellette. Bioinformatics: A Practical
Guide to the Analysis of Genes and Proteins. Wiley & Sons,
Hoboken, New Jersey, 2005.
[2] H. J. Bernstein. Openrasmol: Molecular graphics visual
isation tool. Website, last visited January 2007. http:
//www.openrasmol.org/.
[3] D. Comer. The ubiquitous Btree. ACM Computing Surveys,
11(2):121–137, 1979.
[4] R. C. for Structural Bioinformatics.
Website, last visited January 2007. http://www.rscb.
org/pdf.
[5] V. Gaede and O. Guenther. Multidimensional access meth
ods. ACM Computing Surveys, 30(2):170–231, 1998.
[6] J. Gibrat, T. Madej, and S. Bryant. Surprising similarities in
structure comparison. Current Opinion in Structural Biol
ogy, 6(3):377–85, 1996.
[7] A. Guttman. Rtrees: a dynamic index structure for spatial
searching. In Proceedings of SIGMOD’84, pages 47–57,
1984.
[8] T. Madej, J. Gibrat, and S. Bryant. Threading a database of
protein cores. Proteins, 23(3):356–69, 1995.
[9] W. Osborn and K. Barker. Searching through spatial rela
tionships using the 2drtree. In Proceedings of the 10th In
ternational Conference on Internet and Multimedia Systems
and Applications (IMSA 2006), pages 71–76, 2006.
[10] W. K. Osborn. The 2DRtree: a Twodimensional Spatial
Access Method. PhD thesis, University of Calgary, June
2005.
[11] S. Shekhar and S. Chawla.
PrenticeHall, New Jersey, 2003.
[12] S. Srinivasa and S. Kumar. A platform based on the multi
dimensional data model for analysis of biomolecular struc
tures. In Proceedings of the 29th International Conference
on Very Large Data Bases, 2003.
[13] X. Yan, P. Yu, and J. Han. Graph indexing based on dis
criminative frequent structure analysis. ACM Transactions
on Database Systems, 30(4):960–993, 2005.
Protein data bank.
Spatial Databases: A Tour.
21st International Conference on
Advanced Information Networking and Applications Workshops (AINAW'07)
0769528473/07 $20.00 © 2007
View other sources
Hide other sources
 Available from uleth.ca
 Available from Wendy Osborn · Jun 6, 2014