Content uploaded by Hugh Williams
Author content
All content in this area was uploaded by Hugh Williams on Jul 04, 2023
Content may be subject to copyright.
Indexing and Retrieval for Genomic Databases
Hugh E. Williams Justin Zobel
DepartmentofComputer Science, RMIT University
GPO Box 2476V, Melbourne 3001, Australia
f
hugh,jz
g
@cs.rmit.edu.au
Abstract
Genomic sequence databases are widely used by molecular biologists for homology
searching. Amino-acid and nucleotide databases are increasing in size exponentially,
and mean sequence lengths are also increasing. In searching such databases, it is
desirable to use heuristics to perform computationally intensive local alignments on
selected sequences only and to reduce the costs of the alignments that are attempted.
We present an index-based approach for both selecting sequences that displaybroad
similarity to a query and for fast lo cal alignment. Weshow experimentally that the
indexed approach results in signicantsavings in computationally intensivelocal
alignments, and that index-based searching is as accurate as existing exhaustive
searchschemes.
Keywords
Homology search, local alignmen
t, indexing, genomic and scientic
databases.
1 Introduction
Genomic databases assist molecular biologists in understanding the biochemical
function, chemical structure, and evolutionary history of organisms. Popular sys-
tems for searching genomic databases match queries to answers by comparing a
querytoeach of the sequences in the database. Eciency in such exhaustivesys-
tems is crucial, since some servers process over 40,000 queries per day26] and
resolution of each query requires comparison to over one gigabyte of genomic se-
quence data. While exhaustive systems are practical at present, they are becoming
prohibitively expensive|genomic databases are now doubling in size every15or16
months, and user numbers and query rates are growing.
In this paper weinvestigate and propose new techniques for ecient, fast searc
h-
ing of genomic databases. In considering new approaches, wehave developed several
criteria for a successful implementation. First, a new system should support the
same query types as existing exhaustive systems and be able to search the same
databases. Second, the system must be fast and, importantly, scalable on general-
purpose hardware in the presence of increasing user numbers, query rates, and
database size. Third, the system should be comparable in accuracy to existing
popular systems in identifying answers. Last, the system should have reasonable
requirements for memory and disk space. Wehave previously described an initial
implementation that addresses the second and last requirements 50,47]. Wepro-
pose in this paper new indexing and search techniques that successfully address all
of the requirements.
Williams and Zobel: Indexing and Retrieval for Genomic Databases 2
GGGAATTCATGAACTCCGACTCCGAATGTCCATTGTCCCACGACGGTTACTGTTTGCAC
GACGGTGTTTGTATGTACATCGAAGCTTTGGACAAGTACGCTTGTAACTGTGTTGTTGG
TTACATCGGTGAAAGATGTCAATACAGAGACTTGAAGTGGTGGGAATTGAGATGATAAG
AATTCC
Figure 1:
Nucleotide structure of a synthesised human epidermal growth factor gene.
Our indexing and retrieval techniques for querying genomic databases are em-
bodied in a full-scale prototype retrieval system,
cafe
47,48,50,51]. The disad-
vantages of indexing genomic databases are the need for time to build an index and
for space to store the index on disk. The advantage of indexing is that searching
is more scalable and much faster than exhaustive approaches.
Cafe
is based on
techniques used in text retrieval and in approximate string matching for databases
of names. The principal features of
cafe
are the incorporation of new, ecientdata
structures for query resolution and the demonstration that, despite earlier negative
results, indexing can be successfully applied to genomic databases.
We show experimentally that query evaluation using our new techniques has
the requisite properties of speed, scalability, accuracy, and eciency. With careful
selection of parameters,
cafe
can be used for the same search tasks as the p opular
fasta
25, 30, 33] and
blast
2, 6] search systems
fasta
has been shown to be
the most sensitive rapid exhaustive search system, while
blast
is faster and more
popular, but less accurate. In a direct comparison of
cafe
to
fasta
,wehave found
that
cafe
is 50 to 100 times faster in searching the GenBank DNA database and
has comparable accuracy. Moreover, the sensitive
cafe
index is practical in size and
the system remains ecient with increasing database size.
Blast
is much faster
than
fasta
but is still eight times slower than
cafe
on the largest collection tested.
2 Background
Understanding the relationship of a query DNA or protein genomic sequence to well-
understood sequences in a genomic database allows molecular biologists to assign
function to poorly understood sequences. Indeed, one of the goals of sequence
analysis is to determine sequence function, structure, and role from inspection and
querying with a character string representation, or
linear sequence
,of a genomic
sequence. In this section, we present a background of molecular biology, genomic
databases, and techniques for practical linear sequence comparison.
2.1 Genomics
Genetic material, or DNA, stores complete instructions for all the cellular functions
of an organism. The primary structure of DNA is represented as strings, or linear
sequences, of a four-character alphab et, known as the
nucleotide bases
, represented
byA,C,G,andTatypical example is shown in Figure 1 44]. In addition to the
nucleotide bases, there are eleven standard wildcard characters used to represent
dierent p ossible substitutions in a nucleotide sequence 24]. For example, B is
used to represent the permitted substitution of either C, G, or T, but not A, into
a sequence. The most common wildcard is N, which represents anybase some
sequences contain thousands of occurrences of N, representing po orly understood
regions of arbitrary length.
Manynucleotide sequences are precursors to the synthesis of proteins. Such
cod-
ing sequences
are used to transcribe RNA molecules, which are structurally similar
to DNA molecules the signicant dierence is that, in RNA, Uracil (U) replaces
the DNA base T. One of the RNA molecules created with a coding sequence, mes-
Williams and Zobel: Indexing and Retrieval for Genomic Databases 3
MMNFFNFRCIHCRGNLHIAKNGLCSGCQKQIKSFPYCGHCGSELQYYAQHCGNCLKQEP
SWDKMVIIGHYIEPLSILIQRFKFQNQFWIDRTLARLLYLAVRDAKRTHQLKLPEAIIP
VPLYHFRQWRRGYNQADLLSQQLSRWLDIPNLNNIVKRVKHTYTQRGLSAKDRRQNLKN
AFSLAVSKNEFPYRRVALVFFVITTGSTLNEISKLLRKLGVEEIQVWGLARA
Figure 2:
Gene product from H. inuenzae, a completely sequenced bacterial
genome.
senger RNA (mRNA), is based on the DNA template. Three-base combinations
of the nucleotide bases from the mRNA, known as
codons
, transcribe amino-acids.
Amino-acids in turn can be combined to create proteins. A
gene
is a region in a
nucleotide sequence that codes a protein that performs a cellular function. Inter-
estingly, in genomes suchasthehighereukaryotes only a few p ercent of the DNA
is coding, while the remainder is so-called \junk DNA" that is often repetitivein
structure other sp ecies, suchas many bacteria, havemuch higher gene densities.
Because coding regions are generally of more interest to molecular biologists, most
genomic databases contain disproportionately large numbers of coding sequences.
Related nucleotide sequences from dierent species can have varied structure,
where the distance in structure is proportional to the evolutionary distance of the
twospecies. These regions are
homologous
, that is, derive from a common ancestor
sequence, and identication of the existence of homology through sequence compari-
son of these regions can shed light on the evolutionary history, biochemical function,
and chemical structure of these molecules.
There are twenty amino-acids, eac
hofwhichiscoded bybetween one and six
codons. Proteins are polyp eptide chains that typically contain tens or hundreds
of amino-acids, while some consist of more than a thousand. A typical protein
sequence, in this case a gene pro duct from part of the
Haemophilus inuenzae
bacteria 23], is shown in Figure 2.
Proteins have a complex structure dictated bythecharacteristics displayed by
individual amino-acids, and amino-acids can be grouped according to characteristics
including charge, hydrophobicity, and acidity. This classication of amino-acids
allows prediction of relationships between sequences that are not easily seen by
comparing nucleotide sequences. Moreover,becauseoftheredundant mapping of
codons to amino-acids, amino-acid sequences are muc
h richer in information content.
Because of this, in almost all cases, if an amino-acid sequence is available a molecular
biologist uses it as a query to a genomic database in preference to the corresponding
nucleotide sequence. However, occasionally nucleotide sequences are preferred as a
query 7] and, often, an amino-acid sequence is not available and querying with a
nucleotide sequence is the only possibility.
2.2 Genomic Databases
There are several public nucleotide sequence databases. Three of the larger reposi-
tories are GenBank 11], the DNA Databank of Japan, and the European Molecular
Biology Laboratory database 36]. We use GenBank as the source of nucleotide test
data for our experiments the three databases are cross-updated daily and the three
database structures are similar.
GenBank stores sequence data generated through the US human genome initia-
tive, that not only focuses on the human genome, but also on model organisms such
as the bacteria
E. coli
, the fruit-y
D. melanogaster
, the nematode
C. elegans
and
yeast
S. cerevisiae
14]. Theaimofthehuman genome initiative is determining the
complete human sequence by 2003, with an intermediate goal of a \working draft"
of the genome by 2001 15]. The largest GenBank database (release 108.0, 15 Au-
Williams and Zobel: Indexing and Retrieval for Genomic Databases 4
gust 1998) used in our exp eriments contains around 1,776 million nucleotide bases.
Historically, the database has roughly doubled in size every 21 months since 1984,
however GenBank is now doubling in size every15or16months: in August 1997
GenBank Release 102.0 contained 1,053 million bases, while the latest release 110.0
from December 1998 has over 2,162 million bases. The average sequence length
is around 700 bases, with sequences ranging from a few bases to 300,000 bases in
length several sequences are longer than 300,000 bases, but have been stored as sep-
arate records according to GenBank guidelines. Data within GenBank is, in some
cases, duplicated through the submission of identical and, rather more frequently,
overlapping sequences. Additionally, there is a small but signicant error rate, both
as a result of sequence determination errors and of data entry errors 1].
GenBank contains amino-acid translations for many coding nucleotide sequences,
however several solely protein databases also exist. Protein databases are typically
well-managed and less redundant than nucleotide databases, commonly including
classication of sequences into related families and, in some cases, superfamilies of
families. Such databases include Swiss-PROT 8], which contains cross-references
and data from around twenty smaller databases that investigate specic organisms
and protein types. A typical specialist database is the Portable Mouse Genome
Database 53].
In our experiments, weevaluate the accuracy of homology search systems using
the Protein Identication Resource|International Protein Sequence Database, or
PIR 20], a database of well-classied amino-acid sequences. Sequences in PIR are
rst classied by
homeomorphic domains
, that is, into families where the member
sequences exceed a high threshold of sequence similarity,thereby inferring homology.
Homologous families, with the same domains, in the same order, are then classied
further into superfamilies, where the similarity threshold for grouping families into
superfamilies is less stringent than that for originally grouping sequences 9]. Sim-
ilarity scores, thresholds, and algorithms for determining similarity are discussed
later.
2.3 Practical Sequence Comparison
With the widespread availability of practical sequence comparison techniques, molec-
ular biologists have changed their approach to characterising sequences. Funda-
mental to understanding the function of genomic sequences is nding homology
bet
ween two sequences. By comparing sequences and nding homology between
two sequences, one of which has known function, structure, origin, or product, the
biochemical role, evolutionary history, and chemical structure of the second un-
known sequence may be inferred. Homologous sequences always share common el-
ements of three-dimensional folding and secondary structure. Sometimes, however,
homologous sequences do not share common function.
The most common method for analysing an unknown sequence is large-scale
sequence comparison to characterised and annotated sequences in a large genomic
database. Sequence comparison techniques have aided in the discovery of many
useful homologous relationships between sequences. For example, recently,agene
that suppresses tumour growth in humans was found to be related to enzymes in
the bacterium
E. coli
and in
C. cerevisiae
,aw
ell-studied yeast genome.
To illustrate the use of sequence comparison, consider a simplistic example of
a
mutation
in the bacteria
E. coli
.This mutation, or change in the nucleotide
sequence, causes an individual to be unable to reproduce. The mutated sequence
can be compared to other sequences to try and identify homology. If homologous
sequences are found, and there has b een additional work on a sequence from a
dierent sp ecies, the product, for example a hormone, may be identied. This
would allow the research on reproduction in
E. coli
to fo cus on that hormone.
Williams and Zobel: Indexing and Retrieval for Genomic Databases 5
Sequence comparison techniques measure statistical similarity of regions com-
mon to two sequences and, where statistical similarity exceeds a condence value,
homology is inferred. A common benchmark is that if more than 30% of two amino-
acid sequences are identical, then the sequences are most likely homologous 17].
Lack of statistical similarity does not infer non-homology for example, two se-
quences that do not share signicant statistical similarity are homologous if both
are related to a third sequence.
Generally, homology between sequences is measured locally, as similarityofre-
gions, rather than measured globally as overall similarity of complete sequences. For
example, by comparing twonucleotide sequences whose overall primary structure
is dissimilar, local similarity measures may nd homology between exons (coding
regions) that are separated by dierently composed and varying length introns (non-
coding regions). Typically, only closely related amino-acid sequences are of the same
length and overall composition. Nucleotide sequences, which are sections of a much
longer strand and, therefore, have no notion of \ends", generally do not haveoverall
similarity.
Estimation of evolutionary distance requires a measurement of the number of
point mutations, or elementary changes, to transform one sequence into another.
Generally, evolutionary distance estimations use string comparison algorithms to
nd the least number of mutations, that is, an
optimal alignment
,between two
sequences. There may be manysuch optimal alignments that are equally evolu-
tionally plausible and, indeed, equally interesting. This model of using sp ecialised
string comparison algorithms for genomic sequences has been shown to be an eec-
tive model of the evolutionary process 34].
To nd an optimal alignment at a given evolutionary distance, scoring func-
tions are used to measure and score each possible evolutionary pathway between
two sequences, with the goal of nding the alignment, or set of alignments, with
the highest similarity score. Similarity scores are calculated through pairwise align-
mentbetween two identical nucleotide bases or amino-acid
residues
in eachofthe
sequences, or by scoring a mutation event. The measurement of local similarity
using Smith-Waterman local alignment 42]is exhaustive and guarantees nding
the optimal alignment, requiring
l
m
calculations of similarityfortwo sequences
of lengths
l
and
m
. For a comparison of a sequence of length
l
to a database of
N
nucleotides, a total of
l
N
comparisons are required.
There are three general classes of mutation: substitution, insertion, and deletion.
Substitution is the mutation, in the pairwise alignment of two sequences, of one
residue into another residue, whichmayor may not be similar. Deletion is the non-
alignment of a residue in the rst sequence with any residue in a second sequence
deletion signies that a particular residue is to be removed in the scoring of a
particular evolutionary pathwaybetween the two sequences. Insertion is the same
as deletion, but from the perspective of the second sequence if a residue is deleted
from the rst sequence, an alignmentisachieved by inserting a null residue in the
second sequence. Insertion and deletion are generally referred to collectively as an
indel
, where more than one consecutiv
e indel is a
gap
.
An optimal local alignmentoftwo globin sequences, human
-chain and
-chain
hemoglobin, is shown in Figure 3. This optimal lo cal alignment extending over 145
amino-acids is shown in the format typically returned to the user. Parameterisation
of local alignment, that is, the choice of mutation data matrix for substitution and
identity scores, and gap model costs are discussed elsewhere 3,4,10,16,40,45]. An
extract of the results of a database search of the PIR database with human
-chain
hemoglobin are shown in Figure 4. In this extract, weshow 3 of 695 ranked answers
returned byour
cafe
system, in a typical format for answers from a homology search
system. The rst answer shown is identical to human
-chain hemoglobin (and was
ranked as the rst response) but is an
-chain hemoglobin from a chimpanzee, the
Williams and Zobel: Indexing and Retrieval for Genomic Databases 6
10 20 30 40 50
HAHU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGS
||- | -|-|-|||| -|-| |||-| - |-| - |--|-||| |
HBHU MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10 20 30 40 50
60 70 80 90 100 110
HAHU AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH
-|| |||||--| || | - || ||--|| |||-|| || -| -||-|
HBHU PKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH
60 70 80 90 100 110
120 130 140
HAHU LPAEFTPAVHASLDKFLASVSTVLTSKYR
--||||-| | - |- | | | -||
HBHU FGKEFTPPVQAAYQKVVAGVANALAHKYH
120 130 140
Figure 3:
Local alignment of human
-chain hemoglobin (PIR code HBHU, 147
amino-acids) and
-chain hemoglobin (PIR code HAHU, 141 amino-acids). A
BLOSUM-50 mutation matrix is used, with gap opening penalty of
;
12
and ex-
tension penalty of
;
2
.The local alignment score is 381, with 43.4% identity in
a 145 amino-acid overlap. A ` ' (space) indicates a conservative substitution, a
`
-
' indicates both an indel and alignment of two dissimilar amino-acids, anda`
|
'
indicates an identity (match).
Ranking: 1 Scored : 558
Query 0 :LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQ VKGHG KKV 61
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||
Subj 0 :LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG KKV 61
Query 61 :ADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA EFTPA VHA 122
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||
Subj 61 :ADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAA HLPA EFTPA VHA 122
Query 122 :SLDKFLASVSTVLTSK 138
||||||||||||||||
Subj 122 :SLDKFLASVSTVLTSK 138
Ranking: 282 Scored : 273
Query 4 :DKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DL---SH--GS AQVKG HGK 65
|||-|-| |||| |||| |||||-|||| || || |||-| || | ||||||||||
Subj 6 :EKATVSGLWGKV--NADNVGAEALGRLLVVYPWTQRYFSKFGDLSSASAIMGNPQVKA HGK 67
Query 65 :KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL PAEFT PAV 126
|| |||-||| ||||| -||| ||||| ||||||| ||||||| |||-|| ||- |||| -
Subj 67 :KVINAFNDGLKHLDNLKGTFAHLSELHCDKLHVDPENFRLLGNMIVIVL GHHL GKEFT PCA 128
Query 126 :HASLDKFLASVSTVLTSK 144
|||||| ||||||-|| |
Subj 128 :QAAFQKVVAGVASALAHK 146
Ranking: 695 Scored : 53
Query 5 :KTNVKAAWGKVG-AHAGEYGAEALERMFLSFPTTKTYFPHFD-LSHGSAQVKG HGKKV ADA 66
||| - --| - |-|-| ||- | | | ||| -|||- | |||-|- -| -- ||-||
Subj 156 :KTNKPVIFTKSNLAKSPELDAKMYDICY-STAAAPIYFPPHHFVTHTSNGAR- YEFNL VDG 217
Query 66 :LTNAVAHVDDMPNALSA-LSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAE 118
--|||| - || |||| ||| || -|-||||| | ||||||| -||
Subj 217 :AVATVGDPALLSLSVATRLAQEDPAFSSIKSLDYKQM---LLLSLGTGTNSE 269
Figure 4:
Extract of the 695 ranked results of a
cafe
search with human
-chain
hemoglobin (PIR code HAHU) on the PIR database. The three results shown
are ranked 1, 282, and 695 are, respectively, an identical, homologous
-chain
hemoglobin from the chimpanzee, an homologous
-2-chain hemoglobin from the
common rat, and an unrelatedprotein from a potato.
Williams and Zobel: Indexing and Retrieval for Genomic Databases 7
second answer shown is ranked 282 and is a
-2 chain hemoglobin from a rat, and
the third answer is ranked last and is a probably non-homologous precursor to a
potato storage protein.
Fast exhaustive search techniques retrieveand pro cess each sequence in a ge-
nomic database in response to a query. However, such systems typically use heuris-
tics to reduce the number of sequences that require a local alignment to a small
fraction of the database size. Given a set of database sequences, the well-known
Wilbur-Lipman approach is to rst pre-process, through hashing, each
interval
in
the query sequence 46]. An interval in this context is a xed-length overlapping
subsequence from a sequence, such that there are
l
;
n
+1 intervals, for a sequence
of length
l
and interval length
n
.By rst preprocessing the query into a hash
table structure, each database sequence can be processed by hashing, and a single-
step lookup used to locate eachinterval in the query hash table. If an interval is
presentinthesearch structure, the one or more matching osets in the query can
be retrieved, yielding considerable computational saving over scanning the query
sequence for each database interval. With the osets, scores for promising align-
ments without gaps can be calculated by taking the dierence between the oset
of the interval in the query and oset of the interval in the database sequence. An
accumulator
can then be used for each alignment without gaps to record a score for
the matches.
There are three popular exhaustive search systems, all of whichusevariations
of the Wilbur-Lipman interval approach for practical exhaustive searching. All sys-
tems have the limitation that an alignment is only possible where a common interval
exists between a query and database sequence in the case of
blast
, neighbourhood
intervals|those that dier by one or twocharacters|may also form the basis of
matches. We describe these systems,
fasta
,
blast
1, and
blast
2 briey.
The
fasta
search system 25,30,33] uses a four-stage approachto alignment
where the rst stage is an application of the Wilbur-Lipman technique. After scoring
matching intervals, the second stage re-scores the top ten accumulators for each
sequence allowing gaps and the third stage attempts to join high-scoring regions
represented by the accumulators. The nal stage locally aligns the sequences using
amemory-ecient implementation of local alignment 13] based on the regions
joined in stage three. Three scores are reported for each sequence, along with a
graph of the distribution of mean similarity scores and standard deviation of the
scores for each sequence in the database. Our
cafe
system, whichwe describe in
detail in Section 3, uses similar heuristics to
fasta
in the nal stage of alignment.
Altschul et al. 2] propose additional heuristics in
blast
1to improve the
database search times of the
fasta
tool, with the goal of maintaining a compa-
rable level of retrieval eectiveness.
Blast
1deals with the high computational
overhead associated with both an exhaustive database search and local alignment
by using heuristics that nd \ungapped, locally optimised ranked sequence align-
ments" 2]. This approach limits
blast
1tonotallow for the insertion or deletion
of residues, but only for the substitution of one residue for another. However, by
building on the well-understood theory of ungapped local alignment22,5],
blast
1
is able to eectively lter irrelevantanswers.
Blast
1has an underlying assumption that indels are a less signicant class
of evolutionary event. This assumption has some merit in nucleotide comparison,
since both single deletion and insertion events cause the meaning of codons to be
completely lost. As such, it may be a reasonable compromise to ignore gapped align-
ments in coding regions of nucleotide sequences. However,
fasta
has been shown
to be signicantly more sensitivethan
blast
1at detecting distant homologous
relationships, whichtypically contain more indel ev
ents 32,41]. Shpaer et al. 41]
have shown that there is on average one indel per fourteen aligned amino-acid
residues, suggesting that while the well-developed ungapp ed alignment statistics
Williams and Zobel: Indexing and Retrieval for Genomic Databases 8
mayfavourably lter false matches,
blast
1may also fail to detect many homolo-
gous sequences.
A new version of
blast
,known as
blast
26], aims to improve both the accu-
racy and speed of
blast
1.
Blast
2 permits the limited useofindelsinforming
alignments, thus requiring more computation to evaluate each local alignment. To
ensure that
blast
2isonaverage faster than
blast
1 the criteria for attempting
a local alignment are more stringent, with local alignment only p ermitted when
twointervals (or neighbourhood intervals) matchbetween a query and a database
sequence. Altschul et al. have stated that
blast
2 has improved accuracy over
the previous versions 6] however, the results presented later in this paper do not
support this conclusion.
A signicant problem with all exhaustive systems is that they may become
prohibitively exp ensive because of the volume of data to be processed for each
query. Conventional databases use indexing to provide fast access to relevant data.
In particular, indexing has been shown to work well for information retrieval 39,
38], whichhasmarked similarities to genomic retrieval: in both domains a typical
query returns a large set of responses, manyentries in the database exhibit some
degree of similarity to the query, and matches are approximate rather than exact.
Previous genomic indexing eorts have, however, been largely unsuccessful and
indexed systems are not in widespread use. We discuss existing indexed approaches
in the next section.
2.3.1 Indexed Genomic Searching
A general method for reducing searching costs is to store an abstraction or index
that can be used to assess broad similarity to a query. The cost is the need to store
the index, the potential saving is that fetching a limited volume of information
should enable identication of a small number of sequences as likely answers, thus
reducing both disk trac and the computation required to resolve a query.
Interval-based indexing of genomic databases was rst proposed by Orcutt and
Barker 29]. Specically, Orcutt and Barker proposed their algorithm as a method
of identifying amino-acid sequences in the PIR database. No detail is given of an
implementation of the approach, but Barton 10] notes that an implementation,
scan
,was available with the PIR database for use in exact matching.
The
scan
approachwas not highly successful and has not been developed for
several reasons. First, simple measurements of matching intervals are used as a
ranking technique, limiting the sensitivity and selectivity of the indexed approach
wehave found that without novel ranking techniques ranking detects similarityin
composition but not necessarily homology. Second, non-overlapping intervals are ex-
tracted from the database and query sequences all current exhaustive and indexed
approaches, including
blast
and
fasta
, require overlapping intervals to achieveac-
ceptable retrieval eectiveness. Third, the algorithms of Orcutt and Barker do not
use current approaches to managing large document collections that make indexing
practical for example, queries are limited to 25 residues in length and search struc-
tures are unlikely to have been compressed, resulting in a large index and increased
disk transfer times.
Altschul et al. 2] have implemented a similar approach to
scan
that uses a
table of all database intervals. It was found, possibly because of limitations similar
to those in
scan
and because GenBank was around 45 times smaller in 1990 than
it is now, that this approachwas somewhat slower than exhaustively searching the
database.
The most recent indexed scheme is the Rapid Access Motif database (
ramdb
)
system for nding short patterns in genomic databases 19] such patterns, or
motifs
,
are typically around ten bases in length. In the approachof Fondrat and Dessen,
Williams and Zobel: Indexing and Retrieval for Genomic Databases 9
each genomic sequence is indexed by its constituentoverlapping intervals in a hash
table structure. For eachinterval in the collection, an associated list of sequence
numbers and osets is stored, allowing rapid location of any motif matching a
query motif.
The primary application of
ramdb
is the location of motifs either equal or
slightly longer in length than the indexed interval length. The indexed approachof
ramdb
is shown to result in a 0 to 800-fold speedup in search times over comparable
exhaustiveapproximate pattern matching approaches.
The
flash
search tool redundantly indexes genomic data based on a probabilis-
tic scheme 12]. For each interval of length
n
,the
flash
search structure stores,
in a hash-table, all possible similarly-ordered contiguous and non-contiguous sub-
sequences of length
m
that begin with the rst base in the interval, where
m<n
.
As an example, for a nucleotide sequence
acctgatt
the index terms for the rst
n
=5bases,where
m
=3,would be
acc, act, acg, act, acg,
and
atg
each
of the p ermuted strings b egins with the base
a
, the rst base in the interval of
length
n
= 5. The hash-table then stores each permuted
m
-length subsequence,
the sequences that contain the permuted subsequences, and the osets within each
sequence of the permuted subsequence. The permuted scheme gives an accurate
model that approximates a reasonable number of insertions, deletions, and substi-
tutions in genomic sequences.
Califano and Rigoutsos found that
flash
was of the order of ten times faster
for a small test collection than
blast
and was clearly superior in accurately and
sensitively determining homologies in database searching. In addition, given ade-
quate system resources, scaling-up of the system to larger databases suggested that
the computational time saving would remain at a similar order for the complete
GenBank collection.
However, the redundant index, whichis stored in a hash-table and is uncom-
pressed, is impractically large. Rigoutsos and Califano report that, for a nucleotide
collection of around 100 Mb, the index requires 18 Gb on disk, around 180 times
the collection size. Barton 10] reports that the index for Swiss-PROT Release 25, a
collection of around 10 Mb, requires almost 2.8 Gb of disk space, around 280 times
the size of the Swiss-PROT collection.
3 Indexed Genomic Retrieval with Cafe
Inverted les have been shown to be a successful tool for large text database re-
trieval 27, 39,54]. In such environments, indexes are used to selectively retrieve
relevant records without exhaustive scanning of the database. Indexing trades space
against time for the cost of storing the index, retrieval is typically many orders of
magnitude faster than an exhaustivesearch.
To address the problems with indexing encountered in other attempts, wepro-
pose a two-component partitioned search pro cess embodied in a research prototype
system,
cafe
. The rst componentofour approach, a
coarse search
,uses an in-
verted index to select a subset of sequences that display broad similarity to the
query sequence. The second comp onent, a computationally more expensive
ne
search
mechanism, ranks the resultant sequences from the coarse search in order of
relevance to the query, presenting the ranked results to the user. The partitioning
of searching into coarse and ne mechanisms has, for example, been successfully
used for pattern matching in databases of names 57,58]. To ensure ecient query
evaluation, we use a query evaluation technique adapted from such methods.
A signicant feature of
cafe
is our method of addressing the problem of index
size, where wehave adapted compression techniques developed for text indexing 27].
In text indexing, index size is reduced with careful compression by a factor of
Williams and Zobel: Indexing and Retrieval for Genomic Databases 10
three to six, while in genomic databases wehave found that more than three-fold
reductions are possible. We have previously discussed the compression of
cafe
indexes in a preliminary description of our approach 50] in addition, we have
developed a method for compressing genomic nucleotide sequences that reduces the
query evaluation costs in our
cafe
system byover 20% 51].
In this section we explore coarse searching, that is, using an index to assess
broad sequence similarity prior to retrieving and ne searching the sequence data.
Our novel ne searching approach is not described in detail in this paper, but is a
gapped scheme that is similar in sensitivity and technique to the
fasta
approach.
We present details of our ne searching local alignment algorithm elsewhere 48].
To evaluate the
cafe
approach, we present in Section 4 a framework to enable
comparison of aspects of rapid homology search tools, such as retrieval eectiveness,
speed, and space. We use this evaluation framework to compare
cafe
to
blast
and
fasta
.
3.1 Indexing with Cafe
Toachieve ecient and eective retrieval from genomic databases, we propose sev-
eral modications to the methods used for general strings. Improvements are re-
quired for the following reasons. First, to reduce the computational overhead of ne
searching, it is preferable for the coarse search phase to provide a framework for
subsequent local alignment ne searching using a preliminary version of
cafe
50]
consumes between 40%{90% of the total query evaluation cost. Second, b ecause
of the length of the stored sequences, simply identifying which sequences are likely
matches requires the often inecient retrieval of complete genomic sequences. Typ-
ical nucleotide queries are hundreds of base-pairs in length, while some sequences
in GenBank are around 300,000 base-pairs. By incorporating extra information in
the index, it is possible to identify where in the sequence a similar region can be
found and allow partial sequence retrieval of only the matching region.
For indexing genomic data we suggest that an appropriate choice of index term
is the intervals occurring in each sequence, where the intervals are overlapping
substrings of some xed-length
n
choice of
n
is discussed later. For example, if
n
= 3 then the intervals of
acctgtc
are
acc
,
cct
,
ctg
,
tgt
,and
gtc
. Fixed-
length o
verlapping intervals have been shown to be practical in other indexed 19]
and exhaustive search systems 2, 25,30,33]. In addition, xed-length overlapping
intervals work well in other areas of genomics, including query ltering 49,56],
fragmentassembly 28], consensus alignment55], sequence classication 37], and
pattern detection 35]. Indexing on xed-length substrings has also been successfully
used in pattern-matching for large lexicons 57,58], a domain that is however rather
dierent: strings in lexicons are typically around ten characters, not hundreds of
characters, and similarity is global rather than local.
An inverted index has two components: asearch structure and postings lists.
The search structure consists of the set of unique searchable terms, in this case the
setofintervals, while associated with each term in the search structure is a postings
list. The postings list contains the ordinal n
umbers of the documents containing
the asso ciated search term. The
cafe
inverted le indexing scheme is extended
so that within each postings list is stored not only the ordinal sequence number
that contains the interval, but also oset information. For example, consider the
following postings list
accc 12,(3:144,154,962),38,(2:47,1045)
in which the indexed sequences, the 12th and 38th, contain the interval
accc
.The
interval occurs 3 times in the 12th sequence, at osets 144, 154, and 962, and twice
in the 38th sequence at osets 47 and 1,045.
Williams and Zobel: Indexing and Retrieval for Genomic Databases 11
With postings lists typically on disk, an inverted index is intensive in disk
usage|frequently a large number of postings lists are retrieved and the average
length of postings lists is high. To reduce the overhead of using an index for re-
trieval, compression techniques used for text database indexes 27] and string index-
ing 57,58] are used to reduce index size. The benets of compression are two-fold:
there is a saving in space used by the index and often a saving in query evaluation
time|if retrieval of compressed lists and subsequent decompression is faster than
retrieving uncompressed lists 52,54].
3.2 Coarse Searching with Cafe
Toachieve ecient and eective retrieval from genomic databases, we propose sev-
eral ranking techniques that use our index structure. Weintroduce in this section
anovel ranking structure,
frames
, that addresses the deciencies of simple rank-
ing schemes. In particular,
frames
reduces the number of sequences that need to
be ne searched, by supporting accurate, inexpensive metrics for coarse ranking
with a ne-grain index. Frame-based metrics incorporate the relative positioning of
matching intervals, as well as other calculated metrics, to give a model of likely ho-
mologous alignments. Wehave previously described a preliminary implementation
of
frames
for nucleotide searching 47].
By using the osets stored in the ne-grain index, a ranking structure can be
constructed to selectively and accurately detect homologous sequences. We refer
to this structure as
frames
. A frame is a set of one or more matching intervals
between a database sequence and a query sequence that are at the same relative
oset. There can b e many frames created and each frame is treated independently
,
regardless of whether the frame represents intervals from dierentmatching osets
from the same database sequence, or is from a dierent database sequence.
To illustrate
frames
, consider Figure 5, which shows three matching frames
between two sequences. Eachof Figures 5(A), 5(B), and 5(C) represents a single
frame. In this example, with an interval length of
n
=3, we haveidentied four
distinct regions of at least the interval length that match between the sequences
two regions are shown in Figure 5(A) and one each in Figures 5(B) and 5(C). There
are other intervals and regions that match between the two sequences that have
been omitted for clarity in the gure.
Tosho
whow the four regions form three frames, consider the rst region shown
in Figure 5(A), which is a matchbetween
gttt
at oset 9 in the rst sequence and
oset 7 in the second. The relative oset of the matchis9
;
7 = 2, creating the frame
F
2
, whichcontains the osets of eachofthetwointerval matches
f
(9
7)
(10
8)
g
the osets within a frame can be read as an interval-length matchbetween two
sequences beginning at the respective osets in each sequence. The second region
in Figure 5(A) contains six intervals matches and adds to frame
F
2
,whichthen
contains eight tuples:
f
(9
7)
(10
8)
(27
25)
(28
26)
(29
27)
(30
28)
(31
29)
(32
30)
g
Figure 5(B) shows a matching region containing twointervals
tgg
and
ggg
,be-
ginning at osets 17 and 16 of the two sequences respectively. This creates a second
frame
F
1
, since 17
;
16 = 1, containing the twointerval matches
f
(17
16)
(18
17)
g
.
Figure 5(C) shows two matching intervals that create a third new frame
F
26
con-
taining
f
(53
27)
(54
28)
g
.
A simple ranking scheme would rank sequence similarity based on the interval
matches between the two sequences. For example, an obvious simple ranking scheme
would be to score matches based on the countofinterval matches between the two
sequences over all regions. In our example, the total score would be twelve, as
there are twelveintervals in the four regions identied. This coarse-grain ranking
Williams and Zobel: Indexing and Retrieval for Genomic Databases 12
A
10 20 30 40 50
ACCCTGAGGTTTTTTTTGGGAGAGCTTTCTTCTTAGAGAGGAGGCTAGCTAGCTTCG
:::: ::::::::
GTGTGTGTTTGTGTGTGGGGTAAGTTCTTCTTCTT
10 20 30
B
10 20 30 40 50
ACCCTGAGGTTTTTTTTGGGAGAGCTTTCTTCTTAGAGAGGAGGCTAGCTAGCTTCG
::::
GTGTGTGTTTGTGTGTGGGGTAAGTTCTTCTTCTT
10 20 30
C
10 20 30 40 50
ACCCTGAGGTTTTTTTTGGGAGAGCTTTCTTCTTAGAGAGGAGGCTAGCTAGCTTCG
::::
GTGTGTGTTTGTGTGTGGGGTAAGTTCTTCTTCTT
10 20 30
Figure 5:
Alignment of two sequences showing threeframes, whereeach frame con-
sists of matching intervals of length
n
= 3
at the same relative oset. (A) The
rst frame, frame
F
2
, consists of two regions: the rst begins at oset 9 in the
rst sequence and oset 7 in the second sequence the second begins at osets 27
and 25. (B) The second frame, frame
F
1
,contains one matching region, beginning
at oset 17 in the rst sequence and oset 16 in the second. (C) The thirdframe,
frame
F
26
,contains one region.
overestimates the similaritybetween the two sequences, since it is not possible to
present a local alignment based on all of the identied interval matches.
By using
frames
, coarse ranking can provide a notion of order and arrangement
of common regions. The frames
F
1
and
F
2
cannot be combined with
F
26
to form
a plausible alignment and should be treated indep endently in coarse ranking for
subsequent ne searching. This independence of regions is reected by using the
frames
structure, where a coarse searchof thetwo sequences identies three sep-
arate frames, of which only one is likely to result in an interesting alignment. The
two frames with shorter matching regions will be ranked equally with eachother
and, assuming ranking uses a simple countofframe oset tuples, equal with any
other frames from other database sequences that contain only twointerval tuples.
Often, however, related frames maybe able to be combined to form a single
alignment with gaps. For example, frames
F
1
and
F
2
maybeabletobecombined
with a single indel to form a higher-scoring alignment than the individual ungapped
alignments. We discuss heuristics for forming frames into weighted coarse ranked
neighbourhoods
later.
The
frames
approach is similar to the heuristic method for local alignmentof
two sequences proposed by Wilbur and Lipman 46] and applied in the
fasta
ex-
haustive search system. However, the signicantadvantage of
frames
is that it can
be applied to inverted lists, making it possible to rapidly rank both regions within
sequences and amongst sequences. Moreover, accurate scoring metrics that measure
overall similarity of ungapped alignments 22] or gapped alignment statistics can be
calculated. There are several such possible scoring metrics.
A simple scoring metric that can be calculated using frames is to rank frames
based on the number of intervals in each frame for two sequences
s
and
t
, so that
framecount(
s t
)=
max
(
j
F
(
I
(
s
)
\
I
(
t
))
j
)
where
I
(
s
)isthesetofintervals in sequence
s
,
I
(
t
)theintervals in sequence
t
,and
F
() the frame function that returns one or more sets of intervals that are at the
same relative oset. For Figure 5, the framecount measure would identify frame
F
2
Williams and Zobel: Indexing and Retrieval for Genomic Databases 13
A
10 20 30 40 50
ACCCTGAGGATTTTTTTGGGAGAGCTTTCTTCTTCGAGAGGAGGCTAGCTAGCTTCG
:::::::::
ACGTGTGTGTTTGTGTGTGGGGTAAGTTCTTCTTCTTCTCTTTCTCTTTCTTTCCTC
10 20 30 40 50
B
10 20 30 40 50
ACCCTGAGGATTTTTTAAAGAGAGCTCCCTTAGGAGAGAGGAGGCTAGCTAGCTTCG
::: ::: ::: ::: ::: ::: :::
ACCTGTAGGTTTGTGCAAAAGGTAAGTTCTTCTTCTTCTCTAGGATAGTTCTCTTAT
10 20 30 40 50
Figure 6:
An illustration of
coverage
in a single frame, with an interval length
of
n
=3
. Other frames exist between the sequences shown, but these are omitted
in this example. (A) Two sequences of the same length are shown, with a single
frame containing 7 overlapping interval matches. In this case, the
coverage
is 9,
sincethereare9base identities. (B) Two similar sequences of the same length are
shown, with a single frame containing 7 non-overlapping interval matches. In this
example, the
coverage
is 21, as thereare21base identities in the frame.
as the best coarse match, since the cardinality of frame
F
2
is 8, and frames
F
1
and
F
26
have a cardinalityof2.
This simple
framecount
measure can be rened to incorporate more infor-
mation than the cardinality of frames. For example, by considering the relative
positioning of intervals in asingle frame, we can calculate two computationally
inexpensive metrics for ranking:
coverage
and
length
.
Frame
coverage
is a count of the actual number of residues or bases that match
between two sequences. Given that a frame may consist of multiple overlapping
intervals, a new interval of length
n
maycontribute between 0 and
n
new base or
residue identities to the frame. We propose scoring a frame using the
coverage
,
where frames with higher numbers of distinct non-overlapping intervals are ranked
higher.
Figure 6 shows an example that contrasts the
framecount
scheme with
cov-
erage
. In the case of
framecount
, both Figures 6(A) and 6(B) score 7, since
in both cases there are 7 matching intervals of length
n
= 3 in the frames shown
we omit other frames that exist between the sequences in this example. Although
framecount
and
coverage
scores cannot be directly compared, in the case of
coverage
, the frame shown in Figure 6(A) scores 9, since 9 base identities are
contributed by the alignment of the 7 matching intervals. In contrast, the
cover-
age
contribution of the 7 non-overlapping intervals in Figure 6(B) is 7
3 = 21.
Although
coverage
is reasonably cheap to evaluate, it is simplistic and relies on
the assumption that a frame with higher
coverage
is statistically more likely to
result in a high-scoring local alignment. However, wehave found that
coverage
works well in identifying homologous sequences 48].
The
length
of a frame match is the total number of bases that lie between
the twointervals that have the smallest and largest osets. In Figure 7, extending
Figure 6, we again illustrate one frame for each of three sequence pairs containing
seven interval matches of length
n
= 3. The
length
is the dierence between
the minimum and maximum osets covered by the frame intervals, in the case of
Figure 7(A) a
length
of 21 and for Figure 7(B) a
length
of 55. This scheme
weights a frame highly if it covers a larger region of the query sequence.
As in
coverage
,the
length
scheme does not consider all aspects of the interval
matchesinaparticular frame. For example, Figure 7(C) shows a frame that has
the same
length
of 55 as the frame in Figure 7(B). Figure 7(C) has twointerval
matches and no obvious homology between the matches, while in Figure 7(B) there
Williams and Zobel: Indexing and Retrieval for Genomic Databases 14
A
10 20 30 40 50
ACCATGATGATTTTGTACAGAAAGCTCCTTTAGGAGAGAGGCGGCTCGCTAGCATCG
::: ::: :::: ::: ::::
ACCCTGAGGATTGTGCCCAGAGTAAGTTCTTCTTCTTCTCTAGGATAGTTCTCTTAT
10 20 30 40 50
B
10 20 30 40 50
ACCCTGAGGATTTTTTAAAGAGAGCTCCCTTAGGATACACGAGGCTAGCTAGCTTCG
::: ::: ::: ::: ::: ::: :::
ACCTGTAGGTTTGTGCAAAAGGTAAGTTCTTCTTCTTCTCTAGGATAGTTCTCTTAT
10 20 30 40 50
C
10 20 30 40 50
ACCCTGCTTATTTTTTTTTGAGAGCTCCTCCAGGATACACGGAGCTACCTAGCTTCG
::: :::
ACCTGTAGGTTTGTGCAAAAGGTAAGTTCTTCTTCTTCTCTAGGATAGTTCTCTTAT
10 20 30 40 50
Figure 7:
An illustration of
length
in a single frame, with an interval length of
n
=3
. Other frames exist between the sequences shown, but these are omitted in
this example. (A) Two sequences of the same length are shown, with a single frame
containing 7 interval matches. In this case, the
length
is 21, since the total length
of the region in the frameis21bases. (B) Two similar sequences of the same length
are shown, with a single frame again containing 7 interval matches. In this example,
the
length
is 55, since the total region length for the frame is 55 bases. (C) Two
sequences are shown with a frame that has the same
length
of 55 as the sequences
shown in (B), but has no obvious homology between the matching intervals.
are seven interval matches and a much higher overall similarity between the two
sequences.
Without a combined consideration of
coverage
,the
length
scheme does not
factor in the probability of a homologous local alignment in the absence or presence
of matching intervals between the extremities. In particular, the probabilityofa
successful local alignment of distant intervals decreases with increasing distance.
Despite this, the
length
scheme is particularly attractive since it ranks highly
regions that are longer and, therefore, will rank long homologous alignments ahead
of shorter alignments. We have found that
length
is reasonable in identifying
homologous sequences, but not as accurate as the
coverage
approach 48].
We propose also a com
bined coverage and length scheme that addresses some
of the problems in
coverage
and
length
.This new approach, which we call
combined
, factors in both
coverage
and
length
inaframe,sothat
combined
=
coverage
;
k
(
length
;
coverage
)
The calculation of (
length
;
coverage
)is the count of residues in the initial
match region that are not part of interval matches identied in the coarse search
phase. Typically, we choose a constant
k
where
k
1, since interval matc
hes
that contribute to the
coverage
score are indicators of possibly homology, while
regions containing substitutions or less than interval-length identities are not the
opposite, that is, regions not containing interval matches may still have high similar-
ity through conservative substitutions wehave tested values of
k
through empirical
observation of ranked results and we use a value of
k
optimised for amino-acid
searching in Section 4.
By using this approach, frames containing interval identities spanning a reason-
able portion of sequences being compared are ranked highly.If the result of the
calculation is negative, where the matches are sparse over a long region, then the
matching regions are recursively divided and treated independently until eachpor-
Williams and Zobel: Indexing and Retrieval for Genomic Databases 15
tion is positive. We showinSection4thatthe
combined
scheme is excellentfor
coarse ranking with
frames
and wehave found it has better accuracy than either
length
or
coverage
.
3.3 Applications of Frames
In our discussion of
frames
so far, we have neglected variation in the scores of
each matching interval. In most nucleotide sequence searching, this assumption has
no impact an identitybetween twonucleotide bases is usually scored as +5, so a
matching interval of length
n
will always score 5
n
. How