JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 14, Number 5, 2007
© Mary Ann Liebert, Inc.
Clustered Sequence Representation for
Fast Homology Search
MICHAEL CAMERON,1YANIV BERNSTEIN,1and HUGH E. WILLIAMS2
We present a novel approach to managing redundancy in sequence databanks such as
GenBank. We store clusters of near-identical sequences as a representative union-sequence
and a set of corresponding edits to that sequence. During search, the query is compared
to only the union-sequences representing each cluster; cluster members are then only re-
constructed and aligned if the union-sequence achieves a sufﬁciently high score. Using this
approach with BLAST results in a 27% reduction in collection size and a corresponding 22%
decrease in search time with no signiﬁcant change in accuracy. We also describe our method
for clustering that uses ﬁngerprinting, an approach that has been successfully applied to col-
lections of text and web documents in Information Retrieval. Our clustering approach is ten
times faster on the GenBank nonredundant protein database than the fastest existing ap-
proach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source
version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice
as fast as NCBI-B LAS T with no signiﬁcant change in accuracy.
Key words: BLAST, clustering, homology search, near duplicate detection, sequence alignment.
COM PR EHE NS IVE GEN OMI C DATABA SE S such as the GenBank non-redundant protein database contain
a large amount of internal redundancy. Although exact duplicates are removed from the collection,
there remain a large number of near-identical sequences. Such near duplicate sequences can appear in
protein databases for a variety of reasons, including the existence of closely-related homologues or partial
sequences, sequences with expression tags, fusion proteins, post-translational modiﬁcations, and sequencing
errors. These minor sequence variations lead to the over-representation of protein domains, particularly
those that are under intensive research. For example, the GenBank database contains several thousand
protein sequences from the human immunodeﬁciency virus.
Database redundancy can lead to a number of negative consequences in the context of sequence ho-
mology search. First, a larger database takes longer to query; as sequencing efforts continue to outpace
1School of Computer Science and Information Technology, RMIT University, Melbourne, Australia.
2Microsoft Corporation, Redmond, Washington.
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 595
speed improvements in computer hardware, the problem of slow query response is one that will continue
to become more urgent. Second, redundancy can lead to highly repetitive search results for any query
that matches closely with an over-represented sequence. Third, large-scale redundancy has the effect of
skewing the statistics used for determining alignment signiﬁcance, ultimately leading to decreased search
effectiveness. Fourth, proﬁle-based algorithms such as PSI-BLAST (Altschul et al., 1997) can be misled
by redundant matches during iteration, causing them to bias the proﬁle towards over-represented domains;
this can result in a less sensitive search or even proﬁle corruption (Li et al., 2002; Park et al., 2000).
Attempts to manage redundancy in genomic databases have in the past focused on the creation of
representative-sequence databases (RSDBs), culled collections in which no two sequences share more
than a given level of identity. Such databases have been shown to signiﬁcantly improve proﬁle training
in iterative search tools such as PSI-BLAST by reducing the incidence of proﬁle corruption caused by
over-represented domains. However, they are less suitable for regular search algorithms such as BLAST
(Altschul et al., 1990, 1997) and FASTA (Pearson and Lipman, 1985, 1988) because, by deﬁnition, RSDBs
are not comprehensive. This leads to search results that are both less accurate—the representative sequence
for a cluster may not be the one that aligns best with a given query—and less authoritative, because the
user is only shown one representative sequence from a family of similar sequences. Furthermore, existing
clustering techniques for creatings RSDBs either exhibit O.n2/time complexity in the size of the database
being clustered, or consume large quantities of memory resources.
In this work we present a redundancy-detection technique for sequence databases that is based on
document ﬁngerprinting (Bernstein and Zobel, 2004). We describe how the ﬁngerprinitng approach has
been successfully adapted to the domain of sequence data, and present slotted SP EX , a new chunk-selection
algorithm for ﬁngerprinting that is particularly suited to the sequence domain. Fingerprinting suffers from
neither quadratic time complexity nor heavy memory requirements, leading to a signiﬁcant improvement
to the efﬁciency with which we are able to cluster large sequence databases; we are able to process the
entire GenBank collection in one hour on a commodity workstation. By contrast, the most advanced and
efﬁcient existing technique, CD-HIT (Li et al., 2001b), takes almost ten hours on the same machine.
We then introduce a novel sequence clustering methodology, and corresponding modiﬁed search algo-
rithm, that when combined allow for the efﬁcient and effective management of redundancy in genomic
databases. Importantly, our approach lacks the drawbacks of previous redundancy-management strategies.
Whereas earlier approaches choose one sequence from each cluster as the representative to the database
and delete the other sequences, we generate for each cluster a special union-sequence that—through use of
wildcard characters—represents all of the sequences in the cluster simultaneously. Through careful choice
of wildcards, we are able to achieve near-optimal alignments while still substantially reducing the number
of sequences against which queries need be matched. Further, we store all sequences in a cluster as a set
of edits against the union-sequence. This achieves a form of compression and allows us to retrieve cluster
members for more precise alignment against a query should the union-sequence achieve a good alignment
score. Thus, both space and time are saved with no signiﬁcant loss in accuracy or sensitivity.
Our method supports two modes of operation: users can choose to see all alignments or only the best
alignment from each cluster. In the former mode, the clustering is transparent and the output comparable
to that of searches on the unclustered collection. In the latter mode, the search output is similar to the
result of searching a culled representative database, except that our approach is guaranteed to display the
best alignment from each cluster and is also able to report the number of similar alignments that have been
To investigate the effectiveness of our clustering approach we have integrated it with our freely available
open-source software package, FSA-BL AST. When applied to the GenBank non-redundant (NR) database,
our method reduces the size of sequence data in the NR database by 27% and improves search times by
22% with no signiﬁcant effect on accuracy.
2. EXISTING APPROACHES
Reducing redundancy in a sequence database is essentially a two-stage process: ﬁrst, redundancy between
sequences in the database must be identiﬁed; then, the redundancy must be managed in some way. In this
section we describe past approaches to these two stages.
596 CAMERON ET AL.
2.1. Redundancy identiﬁcation
The ﬁrst stage of most redundancy management algorithms involves identifying pairs of highly-similar
sequences. An obvious approach to this task is to align each sequence with each other sequence in
the collection using a pairwise alignment scheme such as Smith-Waterman local alignment (Smith and
Waterman, 1981). This is the approach taken by several existing clustering algorithms, including d2_cluster
(Burke et al., 1999), OWL (Bleasby and Wootton, 1990), and KIND (Kallberg and Persson, 1999) and Itoh
et al. (2004). However, this approach is impractical for any collection of signiﬁcant size; each pairwise
comparison is computationally intensive and the number of pairs grows quadratically in the number of
Several schemes, including CLEANUP (Grillo et al., 1996), NRDB90 (Holm and Sander, 1998), RSDB
(Park et al., 2000), CD-HI (Li et al., 2001a), and CD-HIT (Li et al., 2001b), use a range of BLA ST-like
heuristics to quickly identify high-scoring pairwise matches. The CLEANUP (Grillo et al., 1996) algorithm
builds a rich inverted index of short substrings or words in the collection and uses this structure to score
similarity between sequence pairs. NRDB90 (Holm and Sander, 1998) and RSDB (Park et al., 2000) use
in-memory hashtables of decapeptides and pentapeptides for fast identiﬁcation of possible high-scoring
sequence pairs before proceeding with an alignment. CD-HI (Li et al., 2001a) and CD-HIT (Li et al.,
2001b) use lookup arrays of very short subsequences to more efﬁciently identify similar sequences. Such
methods can signiﬁcantly reduce the per-pair comparison time, but do nothing to alter the O.n2/time
complexity of the algorithms.
Many of the above schemes attempt to reduce the number of pairwise sequence comparisons by using a
greedy incremental clustering approach, in which the similarity detection and the cluster-based redundancy
management approach (see Section 2.2) are combined. In general, this method proceeds as follows. To
begin, the collection sequences are sorted by decreasing order of length. Then, each sequence is considered
in turn and used as a query to search an initially-empty representative database for high-scoring matches.
If a similar sequence is found, the query sequence is discarded; otherwise, it is added to the database as the
representative of a new cluster. When the algorithm terminates, the database consists of the representative
(longest) sequence of each cluster. While this technique can signiﬁcantly reduce the number of pairwise
sequence comparisons that need to be made, it still retains the unfavorable O.n2/time complexity that
afﬂicts all direct pairwise-comparison methods. We show in Section 5 that CD-HIT—the fastest of the
greedy incremental algorithms mentioned and most successful existing approach—scales poorly, with
superlinear complexity in the size of the collection.
ICAass (Parsons, 1995) and Itoh et al. (2004) reduce the number of pairwise comparisons by partitioning
the collection according to phylogenetic classiﬁcations and clustering only sequences within each partition.
This reduces the number of pairwise comparisons; however, the approach assumes that the database has
been pre-classiﬁed and ignores possible matches between taxonomically distant species. Further, the number
of phylogenetic divisions is growing at a far slower rate than database size. Therefore, a quadratic growth
rate in computation time remains a limitation.
One way to avoid an all-against-all comparison is to pre-process the collection using an index or sufﬁx
structure that can be used to efﬁciently identify high-scoring candidate pairs. Malde et al. (2003) and Gracy
and Argos (1998) investigated the use of sufﬁx structures such as sufﬁx trees (Gusﬁeld, 1997) and sufﬁx
arrays (Manber and Myers, 1993) to identify groupings of similar sequences in linear time. However, sufﬁx
structures also require large main-memories and are not suitable for processing large sequence collections
such as GenBank on desktop workstations. Malde et al. (2003) report results for only a few thousand
EST sequences. The algorithm described by Gracy and Argos (1998) requires several days to process a
collection of around 60,000 sequences. External sufﬁx structures, which record information on disk, are
also unsuitable; they use a large amount of disk space, are extremely slow for searching, or have slow
construction times (Cheung et al., 2005).
2.2. Redundancy management
The common practice in existing work is to use the information on inter-sequence redundancy to build
representative-sequence databases (RSDBs). In this approach, clusters of highly similar documents are
grouped into clusters, or in the case of greedy incremental clustering the clusters are built contemperane-
ously with the pairwise similarity comparison. Once a set of clusters have been identiﬁed, one sequence
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 597
from each cluster is selected as the representative of that cluster to be inserted into the RSDB; the rest are
discarded (Holm and Sander, 1998; Park et al., 2000; Li et al., 2001a, 2001b). The result is a representative
database with fewer sequences and less redundancy. However, purging near-duplicate sequences can sig-
niﬁcantly reduce the quality of results returned by search tools such as B LA ST. There is no guarantee that
the representative sequence from a cluster is the sequence that best aligns with a given query. Therefore,
some queries will fail to return matches against a cluster that contains sequences of interest, which reduces
sensitivity. Further, results of a search lack authority because they do not show the best alignment from
each cluster. Also, the existence of highly similar alignments, even if strongly mutually redundant, may
be of interest to a researcher.
Itoh et al. (2004) describe an alternative redundancy-management technique that, in contrast with the
RSDB approach, retains all members of each cluster. This approach calculates an upper bound on the
difference in score between aligning a query to any sequence in a cluster and aligning the same query to
a chosen representative. During search, the query is compared to the representative and the upper bound
is added to the resulting alignment score; if the increased score exceeds the scoring cutoff, all sequences
in that cluster are loaded from an auxiliary database and individually aligned to the query. While this
approach ensures there is no loss in sensitivity, it comes at a substantial cost: unless a high scoring cutoff
is used during search—Itoh et al. use a nominal score cutoff of 150 in their experiments—there will be
numerous false positives, causing search to be slowed. They report experiments using Smith-Waterman
(1981) alignment and it is unclear if their approach would work well if applied to a heuristic search tool
such as BLA ST. In addition, all sequences are retained by their method, and so no space is saved.
3. A FINGERPRINTING APPROACH TO REDUNDANCY DENTIFICATION
Document ﬁngerprinting (Manber, 1994; Brin et al., 1995; Heintze, 1996; Broder et al., 1997; Shivaku-
mar and García-Molina, 1999) is an effective and scalable technique for identifying pairs of documents
within large text collections that share portions of identical text. Fingerprinting has been used for several
applications, including copyright protection (Brin et al., 1995), document management (Manber, 1994),
and web search optimization (Broder et al., 1997; Fetterly et al., 2003; Bernstein and Zobel, 2005).
In this section we describe how a modiﬁed form of document ﬁngerprinting can be applied to the
task of identifying near-identical sequences in a sequence database. In contrast to the existing approaches
discussed in Section 2.1, ﬁngerprinting allows us to cluster the database rapidly and in linear time, with
only modest main-memory requirements.
3.1. Document ﬁngerprinting
The fundamental unit of document ﬁngerprinting techniques is the chunk, a ﬁxed-length unit of text
such as a series of consecutive words or a sentence. The full set of chunks for a given document is formed
by passing a sliding window of appropriate length over the document; this is illustrated in Figure 1 for a
chunk length of six words.
FIG. 1. Set of chunks of six words in length for a document containing the text “the quick brown fox jumped over
the lazy dog.”
598 CAMERON ET AL.
The set of all chunks in a collection can be stored in an inverted index (Witten et al., 1999), and the
index can be used to calculate the number of shared chunks between pairs of documents in a collection.
Two identical documents will naturally have an identical set of chunks. As the documents begin to diverge,
the proportion of chunks they share will gradually degrade. Thus, the number of common chunks is a
good estimator of the amount of common text shared by a pair of documents. The quality of this estimate
is optimized by choosing a chunk length that is long enough so that two identical chunks are unlikely
to coincidentally occur, but not so long that the system becomes too sensitive to minor changes. In the
duplicate detection DE CO package, the default chunk length is eight words (Bernstein and Zobel, 2004).
In theory, one could simply store the full set of chunks for each document in a collection, and directly
compute the degree of text reuse between documents as above. However, such an approach is highly
inefﬁcient. Thus, some sort of selection heuristic is normally applied so that only a subset of chunks from
each document are selected for storage. The choice of selection heuristic has a very signiﬁcant impact on
the general effectiveness of the ﬁngerprinting algorithm. Most ﬁngerprinting algorithms have used simple
feature-based selection heuristics, such as selecting chunks only if their hash is divisible by a certain
number, or selecting chunks that begin with certain letter-combinations. For an overview of existing chunk
selection methods, we refer the reader to Hoad and Zobel (2003) or Bernstein and Zobel (2004).
A major drawback of the chunk selection schemes of the type described above is that they are lossy:
if two documents share chunks, but none of them happen to satisfy the criteria of the selection heuristic,
the ﬁngerprinting algorithm will not identify these documents as sharing text. For example, consider the
processing of two documents with a chunk length of one word and a chunk selection heuristic that only
selects words starting with a vowel. Given a pair of highly similar documents with the following text:
Document 1 The yellow cat ran between the tall trees.
Document 2 The brown cat ran between the tall towers.
Not a single chunk would be selected from either of these two documents because all of the words
contained in them start with a consonant. As a result, the similarity between these two documents would
be overlooked. The more selective the heuristic, the more likely such incidents are to occur.
3.2. The SP EX chunk selection algorithm
Bernstein and Zobel (2004) introduced the SP EX chunk selection algorithm, which allows for lossless
selection of chunks. This is possible because it was able to discard singleton chunks (chunks that only occur
once and represent a large majority in most text collections), which do not contribute to the identiﬁcation of
text reuse between documents. The SPEX algorithm is based on the following observation: if any subchunk
(subsequence) of a chunk is unique, the chunk as a whole is unique. For example, an occurrence of the
chunk two brown cats must be unique in the collection if the subchunk brown cats is also unique. Using a
memory-efﬁcient iterative hashing technique, SPEX is able to select only those chunks that occur multiple
times in the collection. Using S PE X can yield signiﬁcant savings over selecting every chunk without any
degradation in the quality of results (Bernstein and Zobel, 2004, 2005).
The SPE X algorithm works as follows. In the ﬁrst iteration, each chunk that is one word in length in the
document is processed. A hashtable records for each chunk whether it appears multiple times, just once,
or not at all. In the second iteration, the entire document is processed again, this time considering chunks
that are two words in length. Each two-word chunk is only inserted into the hashtable if both one-word
subchunks appear multiple times in the collection. For example, the two-word chunk brown cats is only
processed if both brown and cats are non-unique. A second hashtable records whether each two-word
chunk appears multiple times, just once, or not at all. The process is then repeated for increasing chunk
lengths until the ﬁnal desired length is reached.
Figure 2 provides a pseudocode sketch of how SPE X identiﬁes duplicate chunks of length finalLength
within a collection of documents S. The algorithm iterates over chunk lengths from 1 to finalLength,
the ﬁnal chunk length desired. At each iteration, S PE X maintains two hashtables (referred to as lookup
in the ﬁgure): one recording the number of occurrences of each chunk for the previous iteration, and one
for the current iteration. As we are only interested in knowing whether a chunk occurs multiple times or
not, each entry in lookup takes one of only three values: zero, one, or more than one (2C). This allows
ﬁve hashtable entries to be ﬁt into a single byte, signiﬁcantly reducing the size of the table. A chunk is
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 599
FIG. 2. The SP E X algorithm.
only inserted into lookup if its two subchunks of length chunkLength-1 both appear multiple times in
the hashtable from the previous iteration.
Figure 3 illustrates how the SPEX algorithm works when applied to the pair of documents shown at
the top of the ﬁgure. First, one-word chunks are counted as shown in the far left column. The words
yellow,brown,trees, and towers appear only once in the collection and the remaining words appear
at least twice. In the second iteration, two-word chunks are extracted from the collection and inserted into
the hashtable. However, chunks such as The yellow that contain at least one unique subchunk from the
previous iteration are not processed, nor are they inserted into the hashtable, as indicated by the letters NP.
As a result, fewer entries are made into the hashtable and the number of collisions is minimized. Finally,
three-word chunks are extracted in the third iteration, and any chunk that contains a unique subchunk of
length two is dismissed. The end result after three iterations is that all three-word chunks that are shared
between the two documents (that is, cat ran between,ran between the, and between the tall) are
Although collisions caused by different chunks hashing to the same value are not resolved, Bernstein and
Zobel (2004) report that this has minimal impact on the performance of the algorithm. Collisions introduce
false positives to the process, that is, cause unique words to be considered as non-unique. However, the
iterative process tends to remove such false positives and helps prevent the hashtables from being ﬂooded.
As a result, the SPEX algorithm is able to rapidly process large text collections while consuming a relatively
modest amount of memory (Bernstein and Zobel, 2004).
3.3. Fingerprinting for genomic sequences
The SPE X algorithm (and, indeed, any ﬁngerprinting algorithm) can be trivially adapted for use with
genomic sequences by simply substituting documents with sequences. However, the properties of a genomic
sequence are quite different from those of a natural language document and these differences present a
number of practical challenges to the S PE X algorithm.
The most signiﬁcant difference is the lack of any unit in genomic data analogous to natural language
words, given that a genomic sequence is represented by an undifferentiated string of characters with no
natural delimiters such as whitespace, commas or other punctuation marks. The lack of words in genomic
sequences has a number of immediate impacts on the operation and performance of the SPE X algorithm.
600 CAMERON ET AL.
FIG. 3. Three iterations of the SPEX algorithm. The pair of documents at the top are processed to identify duplicate
chunks of one, two and three words in length. For each chunk, 1denotes that it appears once only in the collection,
2Cdenotes that it appears multiple times, and N P indicates that the chunk was not processed because it contains a
First, the granularity of the sliding window must be increased from word-level to character-level. An
increased granularity results in the existence of a far greater number of chunks in a genomic sequence
than in a natural-language document of similar size. This increase becomes apparent when one considers
that—assuming an average word length of ﬁve—the S PE X sliding window increments over a total of
six bytes of a natural-language document at each increment. By contrast, the base-pairs in nucleotide
sequence data each represent only two bits, or 0.25 of a byte, of data. Thus, a nucleotide sequence can be
expected to produce roughly 24 times as many chunks as a natural language document of the same size.
As a result, the SPE X algorithm is less efﬁcient and scalable for genomic data than for natural language
Processing genomic sequences also involves performing more iterations as part of the SPEX algorithm.
To identify chunks that are eight words in length S PE X must perform eight iterations; for a similar-length
chunk containing 30 characters of sequence data, SPEX would need to perform 30 iterations. This obviously
slows down the algorithm signiﬁcantly.
The distribution of subsequences within genomic data is also less skewed than the distribution of words
in natural-language text. Given a collection of natural language documents, we expect some words (such
as “and” and “or”) to occur extremely frequently, while other words (such as perhaps “alphamegamia” and
“nudiustertian”) will be hapax legomena: words that occur only once. This permits the SP EX algorithm
to be effectual from the ﬁrst iteration by removing word-pairs such as “nudiustertian news.” In contrast,
given a short string of characters using the amino acid alphabet of size 20, we expect that it is far less
likely that the word will occur only once in any collection of nontrivial size. Thus, the ﬁrst few iterations
of SPEX are likely be entirely ineffectual.
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 601
One simple solution to these problems is to introduce “pseudo-words,” effectively segmenting each
sequence by moving the sliding window several characters at a time. However, this approach relies on
sequences being aligned along segment boundaries. This assumption is not generally valid and makes the
algorithm highly sensitive to insertions and deletions. Consider, for example, the following sequences given
a chunk length of four and a window increment of four:
Sequence 1 ABCDEFGHIJKLMNOP ABCD EFGH IJKL MNOP
Sequence 2 AABCDEFGHIJKLMNOP AABC DEFG HIJK LMNO
Sequence 3 GHAACDEFGHIJKLMQ GHAA CDEF GHIJ KLMQ
Despite all three of these sequences containing an identical subsequence of length 11 (in bold above), they
do not share a single common chunk. This strong correspondence between the three sequences will thus
be overlooked by the algorithm.
We propose a hybrid of regular SP EX and the pseudo-word based approach described above that we
call slotted SP EX. Slotted SPE X uses a window increment greater than one but is able to “synchronize”
the windows between sequences so that two highly similar sequences are not entirely overlooked as a
result of a misalignment between them. Although slotted SPE X is technically a lossy algorithm, it offers a
performance guarantee that we shall discuss later in this section.
Figure 4 describes the slotted S PE X algorithm. As in standard SPEX, we pass a ﬁxed-size window over
each sequence with an increment of one. However, unlike SPEX, slotted SP EX does not consider inserting
every chunk into the hashtable. In addition to decomposing the chunk into subchunks and checking that the
FIG. 4. The slotted S P EX algorithm.
602 CAMERON ET AL.
subchunks are non-unique, slotted SPEX also requires that one of two initial conditions be met. First, that
it has been at least Qwindow increments since the last insertion; or second, that the current chunk already
appears in the hashcounter. The parameter Qis the quantum, which can be thought of as the window
increment used by the algorithm. Slotted SPEX guarantees that at least every Qth overlapping substring
from a sequence is inserted into the hashtable. The second precondition—that the chunk already appears
in the hashcounter—provides the synchronization that is required for the algorithm to work reliably.
The operation of slotted SPE X is best illustrated with an example. Using the same set of sequences as
above, a quantum Q= 4 and a chunk length of four, slotted SP EX produces the following set of chunks:
Sequence 1 ABCDEFGHIJKLMNOP ABCD EFGH IJKL MNOP
Sequence 2 AABCDEFGHIJKLMNOP AABC ABCD EFGH IJKL MNOP
Sequence 3 GHAACDEFGHIJKLMQ GHAA CDEF EFGH IJKL
For the ﬁrst sequence, the set of chunks produced does not differ from the naive pseudo-word technique.
Let us now follow the process for the second sequence. The ﬁrst chunk—AABC—is inserted as before.
When processing the second chunk, ABCD, the number of chunks processed since the last insertion is one,
fewer than the quantum Q. However, the condition lookup[chunk] ¤0on line 5 of Figure 4 is met: the
chunk has been previously inserted. The hashcounter is therefore incremented, effectively synchronizing
the window of the sequence with that of the earlier, matching sequence. As a result, every Qth identical
chunk will be identiﬁed across the matching region between the two sequences. In this example, the slotted
SPE X algorithm selects two chunks of length four that are common to all sequences.
Unlike the original SP EX algorithm, which incremented the chunklength by one for each iteration, the
slotted SP EX algorithm uses an increment equal to the quantum Q. In the slotted SP EX approach, a match
of length chunkLength + Q between two sequences is guaranteed to contain at least two matches of length
chunkLength identiﬁed during the previous iteration. As a result, slotted SP EX must increase the chunk
length by at least the quantum Qbetween iterations to ensure the scheme is lossless.
In comparison to the ordinary SPEX algorithm, slotted SP EX requires fewer iterations, consumes less
memory and builds smaller indexes. This makes it suitable for the higher chunk density of genomic
data. While slotted SP EX is a lossy algorithm, it does offer the following guarantee: for a window
size finalLength and a quantum Q, any pair of sequences with a matching subsequence of length
finalLength + Q - 1 or greater will have at least one identical chunk selected. As the length of the
match grows, so will the guaranteed number of common chunks selected. Thus, despite the lossiness of
the algorithm, slotted SP EX is still able to offer strong assurance that it will reliably detect highly similar
pairs of sequences.
4. REDUNDANCY MANAGEMENT USING WILDCARDS
In this section, we describe our novel approach to the management of redundancy in genomic sequence
databases. Rather than choosing a single representative sequence from each cluster, we use a set of wildcard
characters to create a single union-sequence that is simultaneously representative of all the sequences in
the cluster. A small amount of auxiliary data stored with each cluster allows for the original sequences to
During search, the query sequence is aligned with to the union-sequence of each cluster; for those
union-sequences that produce a statistically signiﬁcant alignment, the members of the cluster are restored
from their compressed representations and aligned to the query. This ensures that precise alignment scores
are calculated. Our approach supports two modes of operation: users can choose to see all high-scoring
alignments, or only the best alignment from each cluster. The latter mode reduces redundancy in the results.
The effectiveness of our technique depends upon the careful construction of clusters, the availability of a
sensitive but efﬁcient way of scoring the query sequence against the union-sequence, and the selection of a
good set of wildcards. We describe our approach to these matters in Sections 4.2, 4.3, and 4.4, respectively.
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 603
4.1. Cluster representation
Let us deﬁne ED fe1; : : : ; engas the set of sequences in a collection where each sequence is a string
of residues eiDr1; : : : ; rnjr2R. Our approach represents the collection as a set of clusters C,
where each cluster contains a union-sequence Uand edit information for each member of the cluster.
The union-sequence is a string of residues and wildcards UD fu1; : : : ; unjui2R[Wgwhere
WD fw1; : : : ; wnjwiRgis the set of available wildcards. Each wildcard represents a set of residues
and is able to act as a substitute for any of these residues. By convention, wnis assumed to be the default
wildcard wdthat can represent any residue; that is, wnDR.
Figure 5 shows an example cluster constructed using our approach. The union-sequence is shown at
the top and cluster members are aligned below. Columns where the member sequences differ from each
another and a wildcard has been inserted are shown in bold face. In this example, WD fwdg(i.e., only
the default wildcard is used, and it is represented by an asterisk).
When a cluster is written to disk, the union-sequence—shown at the top of Figure 5—is stored in its
complete form, and each member of the cluster is recorded using edit information. The edit information for
each member sequence includes start and end offsets that specify a range within the union-sequence, and a
set of residues that replace the wildcards in that range. For example, the ﬁrst member of the cluster with GI
accession 156103 would be represented by the tuple (8,44,PI); the member sequence can be reconstructed
by copying the substring between positions 8 and 44 of the union-sequence and replacing the wildcards at
union-sequence positions 8 and 40 with characters P and I, respectively. Note that our clustering approach
does not permit gaps; this is because insertions and deletions are heavily penalized during alignment and
any scheme that includes gaps in clusters is likely to reduce search accuracy.
4.2. Clustering algorithm
In this section, we describe our approach to efﬁciently clustering large sequence collections. It is based
on the slotted S PE X ﬁngerprinting algorithm described in Section 3.3, and has linear-time performance and
low main-memory overheads.
The ﬁngerprinting process identiﬁes chunks that occur in the collection more than once. In the context of
sequence data we use subsequences or words of length Was our chunks. For each word, the slotted SPEX
FIG. 5. Example cluster of heat shock proteins from GenBank NR database. The union-sequence is shown at the
top, followed by ten member sequences with GI accession numbers shown in brackets.
604 CAMERON ET AL.
algorithm outputs a postings list of sequences that contain the word and the offset into each sequence where
the word occurs. Our clustering algorithm uses these lists to identify candidate pairs: pairs of sequences
that share at least one chunk.
Given the list of candidate pairs, we use a variation on single-linkage hierarchical clustering (Johnson,
1967) to build clusters, as follows. Initially, each sequence is considered to constitute a cluster with
one member. Candidate pairs are then processed in increasing order of similarity score, from most to least
similar, and the pair of clusters that contains the highly similar candidate sequences are potentially merged.
In general, given two candidate clusters CXand CYwith union-sequences Xand Y, respectively, the
following process is used to determine whether the clusters should be merged:
1. Xand Yare aligned and the sequence space partitioned into a preﬁx, an overlap region, and a sufﬁx.
2. The union-sequence candidate Ufor the new cluster is created by replacing each mismatched residue
in the overlap region with a suitable wildcard w.
3. The union-sequence candidate Uis accepted if the mean alignment score increase N
Qin the overlap
region is below a speciﬁed threshold T—this prevents union-sequences from containing too many
wildcards and reducing search performance.
If the clusters are merged, a new cluster CUis created consisting of all members of CXand CY. This
process is repeated for all candidate pairs. When inserting wildcards into the union-sequence, if more
than one wildcard is suitable then the one with the lowest expected match score e.w/ DPRs .w; r/p.r /
is selected, where p.r / is the background probability of residue r(Robinson and Robinson, 1991) and
s.w ; r/ is the alignment score for matching wildcard wto residue r. Calculation of wildcard alignment
vectors s.w; /is discussed in Section 4.3, and the selection of the pool of wildcards Wis discussed in
The alignment score increase Qfor a wildcard wis calculated as
s.w ; r/p.r / X
where s.r1; r2/is the score for matching a pair of residues as deﬁned by a scoring matrix such as
BLOSUM62 (Henikoff and Henikoff, 1992). This value estimates the typical increase in alignment score
one can expect against arbitrary query residues by aligning against winstead of against the actual residue
at that position.
Figure 6 illustrates the process of merging clusters. In this example, cluster X(which contains one
sequence) and cluster Y(which contains two sequences) are merged. A new cluster containing the members
of both Xand Yis created, with a new union-sequence that contains wildcards at residue positions where
the three sequences differ.
FIG. 6. Merge of two example clusters. Cluster Xcontains a single sequence and cluster Ycontains two sequences.
A new cluster is created that contains members of both clusters and has a new union-sequence to represent all three
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 605
FIG. 7. Illustration of top-down clustering where sequences lD fS1; S2; S3; S4; S5gcontain the chunk RTMCS. Each
sequence is compared to every other sequence in the list and the sequence with the highest average percentage identity
(S1) is selected as the ﬁrst member of a new cluster. Sequences S2and S4are highly similar to S1and are also
included in the new cluster. The remaining sequences lD fS3; S5gare used to perform another iteration of top-down
clustering if jlj M.
The above approach works extremely well for relatively small databases; however, some words will
occur quite frequently in a typical database, resulting in a small number of long postings lists in larger
collections. These in turn consume a disproportionate amount of execution time. We process these long
postings lists—those with more than Mentries, where we use MD100 by default—in a different,
top-down manner before proceeding to the standard hierarchical clustering approach discussed above.
The top-down approach identiﬁes clusters from a list of sequences lthat contain a frequently-occurring
chunk was follows:
1. All sequences in lare loaded into main-memory and aligned with each other.
2. An exemplar sequence is selected; this is the sequence with the highest average percentage identity to
the other sequences in l.
3. A new cluster Cis created with the exemplar sequence as its ﬁrst member.
4. Each sequence in lis compared to the union-sequence of the new cluster. Sequences where N
Q < T are
added to the cluster in order from most to least similar using the approach we describe above.
5. All of the members of the new cluster Care removed from land the process is repeated from step 1
until jlj< M .
The top-down clustering is illustrated in Figure 7. In this example, a list of ﬁve sequences that contain the
word RTMCS is processed using the top-down method. The sequence S1has the highest average percentage
identity to the other sequences in land is selected as the exemplar. A new cluster is created with S1as
the ﬁrst member, and sequences S2and S4are subsequently added. The three members of the new cluster
are removed from l, and the process is repeated until jlj< M .
Once the postings lists has been processed by the top-down method, the shortened list is processed
using the hierarchical clustering method described above. While the top-down process is laborious, it is
performed for fewer than 0.2% of postings lists when clustering the version of the August 2005 release of
the GenBank non-redundant database with default parameters.
4.3. Scoring with wildcards
We have modiﬁed BLAS T to work with our clustering algorithm as follows. Instead of comparing the
query sequence to each member of the database, our approach compares the query only to the union-
sequence representing each cluster, where the union-sequence may contain wildcard characters. If a high-
scoring alignment between the union-sequence and query is identiﬁed, the members of the cluster are
reconstructed and aligned with the query. In this section we discuss how, given a set of wildcards W, we
determine the scoring vectors s.wi;/for each wi2Wthat are used during search.
Ideally, we would like the score between a query sequence Qand a union-sequence Uto be precisely
the highest score that would result from aligning Qagainst any of the sequences in cluster CU. This
would result in no loss in sensitivity as well as no false positives. Unfortunately, such a scoring scheme is
606 CAMERON ET AL.
not likely to be achievable without aligning against each sequence in every cluster, defeating much of the
purpose of clustering in the ﬁrst place.
To maintain the speed of our approach, scoring of wildcards against residues must be on the basis of a
standard scoring vector s .w; /and cannot take into consideration any data about the sequences represented
by the cluster. Thus, scoring will involve a compromise between sensitivity (few false negatives) and speed
(few false positives). We describe two such compromises below, and ﬁnally show how to combine them
to achieve a good balance of sensitivity and speed.
During clustering, wildcards are inserted into the union-sequence to denote residue positions where
the cluster members differ. Let us deﬁne SDs1; : : : ; sxjsi2Wwhere Sis the ordered sequence
of xwildcards substituted into union-sequences during clustering of a collection. Each occurrence of a
wildcard is used to represent a set of residues that appear in its position in the members of the cluster.
We deﬁne oRas the set of residues represented by an occurrence of a wildcard in the collection and
ODo1; : : : ; oxjoiRas the ordered sequence of substituted residue sets. The kth wildcard skthat is
used to represent the set of residues okmust be chosen such that oksk.
Our ﬁrst scoring scheme, sexp , builds the scoring vector for each wildcard by considering the actual
occurrence pattern of residues represented by that wildcard in the collection. Formally, we calculate the
expected best score sexp as:
sexp.w; r / D
s.r; f /
where Piis the set of ordinal numbers of all substitutions using the wildcard wi:
PiD fjjj2N; j x; sjDwig
This score can be interpreted as the mean score that would result from aligning residue ragainst the
actual residues represented by the wildcard w. This score has the potential to reduce search accuracy;
however, it distributes the scores well, and provides an excellent tradeoff between accuracy and speed.
The second scoring scheme, sopt , calculates the optimisti c alignment score of the wildcard wagainst each
residue. The optimistic score is the highest score for aligning residue qto any of the residues represented
by wildcard w. This is calculated as follows:
sopt .w; r / Dmax
f2ws.r; f /
The optimistic score guarantees no loss in sensitivity: the score for aligning against a union-sequence
Uusing this scoring scheme is at least as high as the score for any of the sequences represented by U.
The problem is that in many cases the score for Uis signiﬁcantly higher, leading to false-positives where
the union-sequence is ﬂagged as a match despite none of the cluster members being sufﬁciently close to
the query. This results in substantially slower search.
The expected and optimistic scoring schemes represent two different compromises between sensitivity
and speed. We can adjust this balance by combining the two approaches using a mixture model. We deﬁne
a mixture parameter, , such that 01. The mixture-model score for aligning wildcard wto residue r
is deﬁned as:
s.w; r / Dsopt .w; r / C.1 / sexp.w; r /
The score s.w; r / for each w ; r pair is calculated when the collection is being clustered and then
recorded on disk in association with that collection. During a B LA ST search, the wildcard scoring vectors
are loaded from disk and used to perform the search. We report experiments with varying values of in
Section 5. An example set of scoring vectors s.wi;/that were derived using our approach is shown in
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 607
FIG. 8. Scoring vectors for the wildcards from Table 1. The set of residues represented by each wildcard is given
in the left-hand column. The scoring vector provides an alignment score between each of the twenty-four amino acid
symbols and that wildcard.
4.4. Wildcard selection
Having deﬁned a system for assigning a scoring vector to an arbitrary wildcard, we now describe
a method for selecting a set of wildcards to be used during the clustering process. Each wildcard w
represents a set of residues wRand can only be used in place of the residues it represents when
inserted into a union-sequence. Our wildcard scoring scheme described in Section 4.3 is dependent on the
set of residues represented by w, so that each wildcard has a unique scoring vector. A set of wildcards,
WD fw1; : : : ; wngis used during clustering. We assume that one of these wildcards wnis the default
wildcard that can be used to represent any of the 24 residue and ambiguous codes, that is wnDR.
The remaining wildcards must be selected carefully; large residue sets can be used more frequently but
provide poor discrimination with higher average alignment scores and more false positives. Similarly, small
residue sets can be used less frequently, thereby increasing the use of larger residue sets such as the default
The ﬁrst aspect of choosing a set of wildcards to use for substitution is to decide on the size of this set. It
would be ideal to use as many wildcards as necessary, so that for each substitution siDoi. However, each
wildcard must be encoded as a different character and this approach would lead to a very large alphabet.
An enlarged alphabet would in turn lead to inefﬁciencies in BLA ST due to larger lookup and scoring data
structures. Thus, a compromise is required. B LA ST uses a set of 20 character codes to represent residues,
as well as 4 IUPAC-IUBMB ambiguous residue codes and an end-of-sequence sentinel code, resulting in a
total of 25 distinct codes. Each code is represented using 5 bits, permitting a total of 32 codes; this leaves
7 unused character codes. We have therefore chosen to use jWj D 7wildcards.
We have investigated two different approaches to selecting a good set of wildcards. The ﬁrst approach
to the problem treats it as an optimization scenario, and works as follows. We ﬁrst cluster the collection as
described in Section 4.2 using only the default wildcard, i.e., WD fwdg. We use the residue-substitution
sequence Ofrom this clustering to create a set Wof candidate wildcards. Our goal can then be deﬁned
as follows: we wish to select the set of wildcards WWsuch that the total average alignment score
ADPw2SPr2Rs.w ; r/p.r / for all substitutions Sis minimized. A lower Aimplies a reduction in the
number of high-scoring matches between a typical query sequence and union-sequences in the collection,
thereby reducing the number of false-positives in which cluster members are fruitlessly recreated and
aligned to the query.
608 CAMERON ET AL.
TAB LE 1. TW O DIFFE R EN T SETS O F WI LD C A RD S TO B E USE D F OR CL US T ER I NG
Minimum alignment score Physico-chemical classiﬁcations
L,V,I,F,M L,V,I Aliphatic
G,E,K,R,Q,H F,Y,H,W Aromatic
A,V,T,I,X E,K,D,R,H Charged
S,E,T,K,D,N L,A,G,V,K,I,F,Y,M,H,C,W Hydrophobic
L,V,T,P,R,F,Y,M,H,C,W S,E,T,K,D,R,N,Q,Y,H,C,W Polar
A,G,S,D,P,H A,G,S,V,T,D,P,N,C Small
All residues All residues Default wildcard
Each list is sorted in order from lowest to highest average alignment score Aand contains seven
entries, including the default wildcard. The left-hand list is selected to minimize the average alignment
score Ausing a hill-climbing strategy, and the right-hand list is based on the amino acid classiﬁcations
described in Taylor (1986)
In selecting the wildcard set Wthat minimizes Awe use the following greedy approach: ﬁrst, we
initialize Wto contain only the default wildcard wd. We then scan through Wand select the wildcard
that leads to the greatest overall reduction in A. This process is repeated until the set Wis ﬁlled, at each
iteration considering the wildcards already in Win the calculation of A. Once Wis full we employ a
hill-climbing strategy where we consider replacing each wildcard with a set of residues from Wwith the
aim of further reducing A.
A set of wildcards was chosen by applying this strategy to the GenBank NR database (described in
Section 5). The left-hand column of Table 1 lists the wildcards that were identiﬁed using this approach
and used by default for reported experiments in this chapter.
We also consider deﬁning wildcards based on groups of amino acids with similar physico-chemical
properties. We used the amino acid classiﬁcations described in Taylor (1986) to deﬁne the set of seven
wildcards shown in the right-hand column of Table 1. In addition to the default wildcard, six wildcards
were deﬁned to represent the aliphatic, aromatic, charged, hydrophobic, polar, and small classes of amino
acids. We present experimental results for this alternative set of wildcards in the following section.
In this section we analyse the effect of our clustering strategy on collection size and search times. For
our assessments, we used version 1.65 of the ASTRAL Compendium (Chandonia et al., 2004) that uses
information from the SCOP database (Murzin et al., 1995; Andreeva et al., 2004) to classify sequences
with fold, superfamily, and family information. The database contains a total of 67,210 sequences classiﬁed
into 1,538 superfamilies.
A set of 8,759 test queries were extracted from the ASTRAL database such that no two of the queries
shared more than 90% identity. To measure search accuracy, each query was searched against the ASTRAL
database, and the commonly used Receiver Operating Characteristic (ROC) score was employed (Gribskov
and Robinson, 1996). A match between two sequences is considered positive if they are from the same
superfamily, otherwise it is considered negative. The ROC50 score provides a measure between 0 and 1,
where a higher score represents better sensitivity (detection of true positives) and selectivity (ranking true
positives ahead of false positives).
The SCOP database is too small to provide an accurate measure of search time, so we use the GenBank
non-redundant (NR) protein database to measure search speed. The GenBank collection was downloaded
August 18, 2005 and contains 2,739,666 sequences in around 900 megabytes of sequence data. Performance
was measured using 50 queries randomly selected from GenBank NR. Each query was searched against
the entire collection three times with the best runtime recorded and the results averaged. Experiments were
conducted on a Pentium-4 2.8-GHz machine with 2 gigabytes of main memory.
We used FSA-B LA ST—our own version of BLA ST—with default parameters as a baseline. To assess the
clustering scheme, the GenBank and ASTRAL databases were clustered and FSA-B LA ST was conﬁgured
to report all high-scoring alignments, rather than only the best alignment from each cluster. All reported
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 609
collection sizes include sequence data and edit information but exclude sequence descriptions. C D-H IT
version 2.0.4 beta was used for experiments with 90% clustering threshold and maximum memory set
to 1.5 Gb. We also report results for NC BI -BLA ST version 2.2.11 and our own implementation of Smith-
Waterman that uses the exact same scoring functions and statistics as BLA ST (Karlin and Altschul, 1990;
Altschul and Gish, 1996). The Smith-Waterman results represent the highest possible degree of sensitivity
that could be achieved by B LA ST and provides a meaningful reference point. No sequence ﬁltering was
performed for our experiments in this chapter.
The overall results for our clustering method are shown in Table 2. When used with default settings
of D0:2 and TD0:25, and the set of wildcards selected to minimize alignment score in Table 1,
our clustering approach reduces the overall size of the NR database by 27% and improves search times
by 22%. Importantly, the ROC score indicates that there is no signiﬁcant effect on search accuracy, with
the highly redundant SCOP database reducing in size by 80% when clustered. If users are willing to
accept a small loss in accuracy, then the parameters D0and TD0:3 improve search times by 27%
and reduce the size of the sequence collection by 28% with a decrease of 0.001 in ROC score when
compared to our baseline. Since we are interested in improving performance with no loss in accuracy we
do not consider these non-default settings further. Overall, our clustering approach with default parameters
combined with improvements to the gapped alignment (Cameron et al., 2004) and hit detection (Cameron
et al., 2006) stages of BLA ST allow the speed of FSA-BLAS T to be double that of N C BI -BLA ST with
no signiﬁcant effect on accuracy. Both versions of B LAS T produce ROC scores 0.017 below the optimal
The results in Table 2 also show that our scheme is an effective means of compressing protein sequences,
a task that has been deemed difﬁcult by previous researchers (Nevill-Manning and Witten, 1999; Weiss
et al., 2000). Assuming a uniform, independent distribution of amino acids, protein sequence data can be
represented with 4.322 bits per symbol (Nevill-Manning and Witten, 1999). Our clustering scheme is able
to reduce the space required to store protein sequence data in the GenBank non-redundant database to
around 3.15 bits per symbol; to our knowledge, this is signiﬁcantly less than the current best compression
rate of 4.051 bits per symbol (Nevill-Manning and Witten, 1999).
In Table 3, we compare search accuracy and performance for varying wildcard sets. The set of wildcards
that were selected to minimize the average alignment score using the approach described in Section 4.4
provide the fastest search times and smallest collection size. The set of wildcards based on the physico-
chemical classiﬁcations of Taylor (1986) do not perform as well, with 3% slower search times. This is
not a surprising result; treating the selection of wildcards as an optimization problem allows us to choose
those that have the greatest direct impact on search performance. Nonetheless, it is interesting to note that
the wildcards selected by this process bear little resemblance to any of the traditional physico-chemical
amino-acid taxonomies. Finally, search performance was worse still when the collections were clustered
using only the default wildcard; this supports our approach of using multiple wildcards to construct clusters.
Figure 9 shows a comparison of clustering times between C D-H IT and our novel clustering approach
that uses union-sequences and wildcards for four different releases of the GenBank NR database; details
of the collections used are given in Table 4. The results show that the clustering time of our approach is
TAB LE 2. AVER AG E RU NT I ME F O R 50 QU ERI ES SE A RC H ED AG AIN ST T H E GE NBANK
NR DATA BA SE,A N D SC OP ROC50 SC OR E S FOR T H E AS T RAL C O LL EC TIO N
Sequence data, Mb
No clustering (baseline) 28.75 (100%) 900 (100%) 0.398
Cluster D0:2,TD0:25 22.54 (78%) 655 (73%) 0.398
Cluster D0,TD0:3 20.97 (73%) 650 (72%) 0.397
NC B I-B LA S T 45.75 (159%) 898 (100%) 0.398
Smith-Waterman — — 0.415
610 CAMERON ET AL.
TAB LE 3. AVER AG E RU NT I ME A N D SCOP ROC50 SCO R ES F O R
VARYI NG SET S O F WIL D CA R D S
Sequence data, Mb
Minimum alignment score 22.54 (78%) 655 (73%) 0.398
Physico-chemical classiﬁcations 23.25 (81%) 656 (73%) 0.398
Default wildcard only 23.49 (82%) 663 (76%) 0.398
The ﬁrst two rows contain results for the wildcard sets deﬁned in Table 1. The third row contains
results for clustering with only the default wildcard WD fwdg
FIG. 9. Clustering performance for GenBank NR databases of varying sizes.
TAB LE 4. RE D UN D AN CY I N GEN BANK NR DATABA S E OV E R TIM E
16 July 2000 521,662 157 45 28.9%
22 May 2003 1,436,591 443 124 28.1%
30 June 2004 1,873,745 597 165 27.4%
18 August 2005 2,739,666 900 245 27.3%
linear with the collection size and the C D -HI T approach is superlinear (Fig. 9). On the recent GenBank
non-redundant collection, CD -H IT is around 9 times slower than our approach and we expect this ratio to
further increase with collection size.
Table 4 shows the amount of redundancy in the GenBank NR database as it has grown over time,
measured using our clustering approach. We observe that redundancy is increasing at a rate roughly
proportional to collection size with the percentage reduction through clustering remaining almost constant
at 27–29% across versions of the collection tested. This suggests that redundancy will continue to affect
genomic data banks as they grow further in size.
Figure 10 shows the effect on accuracy for varying values of and T. We have chosen D0:2 as a
default value because smaller values of result in a larger decrease in search accuracy, and larger values
reduce search speed. We observe that for D0:2 there is little variation in search accuracy for values
of Tbetween 0.05 and 0.3.
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 611
FIG. 10. Search accuracy for collections clustered with varying values of and T. Default values of D0:2,
TD0:25 are highlighted.
FIG. 11. Average BL A ST search time using D0:2 and varying values of T.
Figure 11 shows the effect on search times for varying values of Twhere D0:2. As Tincreases
the clustered collection becomes smaller, leading to faster search times. However, if Tis too large then
union-sequences with a high percentage of wildcards are permitted, leading to an increase in the number
of cluster members that are recreated and a corresponding reduction in search speed. We have chosen the
value TD0:25 that maximizes search speed.
Figure 12 shows the distribution of cluster sizes on log-log axes for the GenBank NR database; the linear
distribution of the data points on these axes suggest the sizes follow a power-law distribution. Around 55%
of clusters contain just two members, and the largest cluster contains 488 members. Of the ten largest
clusters identiﬁed by our approach, ﬁve relate to human immunodeﬁciency virus proteins, three relate to
cytochrome b, one relates to elongation factor 1˛, and one relates to cytochrome oxidase subunit I. This
supports our previous observation that cluster size is proportional to the interest in a research area.
Sequence databanks such as GenBank contain a large number of redundant sequences. Such redundancy
has several negative effects including larger collection size, slower search, and difﬁcult-to-interpret results.
Redundancy within a collection can lead to over-representation of alignments within particular protein
domains, distracting the user from other potentially important hits.
612 CAMERON ET AL.
FIG. 12. Distribution of sizes for clusters identiﬁed in the GenBank NR database.
In this paper we have proposed novel techniques for both the detection and management of redundancy
in genomic sequence databanks.
For the detection of redundancy, we have explained how the successful document ﬁngerprinting approach
can be adapted to genomic data and described slotted SPEX, a new chunk-selection heuristic for document
ﬁngerpriting that is especially suitable for the genomic sequence domain. We have shown how, with the
use of document ﬁngerprinting, we can cluster the GenBank nonredundant protein database nearly nine
times faster than the next-fastest approach.
We have also described a new approach to the the management of redundancy. Instead of discarding
near-duplicate sequences, our approach identiﬁes clusters of redundant sequences and constructs a special
union-sequence that represents all members of the cluster through the careful use of wildcard characters.
We present a new approach for searching clusters that, when combined with a well-chosen set of wildcards
and a system for scoring matches between wildcards and query residues, leads to faster search times
without a signiﬁcant loss in accuracy. Moreover, by recording the differences between the union-sequence
and each cluster member using edit information our approach compresses the collection. Our scheme is
general and can be adapted to most homology search tools.
We have integrated our clustering scheme into FSA-BL AS T, an alternative implementation of BLA ST that
is substantially faster than NCB I-B LAS T and freely available for download (at www.fsa-blast.org/). Our
results show that our clustering scheme reduces B LA ST search times against the GenBank non-redundant
database by 22% and compresses sequence data by 27% with no signiﬁcant effect on accuracy. We have
also described a new system for identifying clusters that uses ﬁngerprinting, a technique that has been
successfully and extensive applied to the domain of redundant-document detection. Our implementation
can cluster the entire GenBank NR protein database in one hour on a standard workstation and scales
linearly in the size of the collection. We propose that pre-clustered copies of the GenBank collection be
made publicly available for download.
We have conﬁned our experimental work to protein sequences and plan to investigate the effect of our
clustering scheme on nucleotide data as future work. We also plan to investigate the effect of our approach
to iterative search algorithms such as PSI-BLAST, and how our scheme can be used to improve the current
measure of the statistical signiﬁcance of BLAST alignments.
We thank Peter Smooker and Michelle Chow for valuable suggestions. This work was supported by the
Australian Research Council.
CLUSTERED-SEQUENCE REPRESENTATION FOR FAST HOMOLOGY SEARCH 613
Altschul, S., and Gish, W. 1996. Local alignment statistics. Methods Enzymol. 266, 460–480.
Altschul, S., Gish, W., Miller, W., et al. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Altschul, S., Madden, T., Schaffer, A., et al. 1997. Gapped BLAST and PSI–BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25, 3389–3402.
Andreeva, A., Howorth, D., Brenner, S., et al. 2004. SCOP database in 2004: reﬁnements integrate structure and
sequence family data. Nucleic Acids Res. 32, D226–D229.
Bernstein, Y., and Zobel, J. 2004. A scalable system for identifying co-derivative documents. In Apostolico, A., and
Melucci, M., eds., Proc. String Processing and Information Retrieval Symposium (SPIRE), 55–67. Springer, Padova,
Bernstein, Y., and Zobel, J. 2005. Redundant documents and search effectiveness. Proc. 14th ACM Int. Conf. Inform.
Knowledge Manag., 736–743.
Bleasby, A.J., and Wootton, J.C. 1990. Construction of validated, non-redundant composite protein sequence databases.
Protein Eng. 3, 153–159.
Brin, S., Davis, J., and García-Molina, H. 1995. Copy detection mechanisms for digital documents. Proc. ACM
SIGMOD Annu. Conf., 398–409.
Broder, A.Z., Glassman, S.C., Manasse, M.S., et al. 1997. Syntactic clustering of the web. Comput. Networks ISDN
Syst. 29, 1157–1166.
Burke, J., Davison, D., and Hide, W. 1999. d2_cluster: a validated method for clustering EST and full-length DNA
sequences. Genome Res. 9, 1135–1142.
Cameron, M., Williams, H.E., and Cannane, A. 2004. Improved gapped alignment in BLAST. IEEE Trans. Comput.
Biol. Bioinform. 1, 116–129.
Cameron, M., Williams, H.E., and Cannane, A. 2006. A deterministic ﬁnite automaton for faster protein hit detection
in BLAST. J. Comput. Biol. 13, 965–978.
Chandonia, J., Hon, G., Walker, N., et al. 2004. The ASTRAL compendium in 2004. Nucleic Acids Res. 32, D189–
Cheung, C.-F., Yu, J.X., and Lu, H. 2005. Constructing sufﬁx tree for gigabyte sequences with megabyte memory.
IEEE Trans. Knowledge Data Eng. 17, 90–105.
Fetterly, D., Manasse, M., and Najork, M. 2003. On the evolution of clusters of near-duplicate web pages. Proc. 1st
Latin Am. Web Congress, 37–45.
Gracy, J., and Argos, P. 1998. Automated protein sequence database classiﬁcation. I. Integration of compositional
similarity search, local similarity search, and multiple sequence alignment. Bioinformatics 14, 164–173.
Gribskov, M., and Robinson, N. 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence
matching. Comput. Chem. 20, 25–33.
Grillo, G., Attimonelli, M., Liuni, S., et al. 1996. CLEANUP: a fast computer program for removing redundancies
from nucleotide sequence databases. Comput. Appl. Biosci. 12, 1–8.
Gusﬁeld, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge, UK.
Heintze, N. 1996. Scalable document ﬁngerprinting. 1996 USENIX Workshop Electron. Commerce, 191–200.
Henikoff, S., and Henikoff, J. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA
Hoad, T.C., and Zobel, J. 2003. Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inform. Sci.
Technol. 54, 203–215.
Holm, L., and Sander, C. 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioin-
formatics 14, 423–429.
Itoh, M., Akutsu, T., and Kanehisa, M. 2004. Clustering of database sequences for fast homology search using upper
bounds on alignment score. Genome Inform. 15, 93–104.
Johnson, S. 1967. Hierarchical clustering schemes. Psychometrika 32, 241–254.
Kallberg, Y., and Persson, B. 1999. KIND—a non-redundant protein database. Bioinformatics 15, 260–261.
Karlin, S., and Altschul, S. 1990. Methods for assessing the statistical signiﬁcance of molecular sequence features by
using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.
Li, W., Jaroszewski, L., and Godzik, A. 2001a. Clustering of highly homologous sequences to reduce the size of large
protein databases. Bioinformatics 17, 282–283.
Li, W., Jaroszewski, L., and Godzik, A. 2001b. Tolerating some redundancy signiﬁcantly speeds up clustering of large
protein databases. Bioinformatics 18, 77–82.
Li, W., Jaroszewski, L., and Godzik, A. 2002. Sequence clustering strategies improve remote homology recognitions
while reducing search times. Protein Eng. 15, 643–649.
Malde, K., Coward, E., and Jonassen, I. 2003. Fast sequence clustering using a sufﬁx array algorithm. Bioinformatics
614 CAMERON ET AL.
Manber, U. 1994. Finding similar ﬁles in a large ﬁle system. Proc. USENIX Winter 1994 Techn. Conf., 1–10.
Manber, U., and Myers, G. 1993. Sufﬁx arrays: a new method for on-line string searches. SIAM J. Comput. 22,
Murzin, A., Brenner, S., Hubbard, T., et al. 1995. SCOP: a structural classiﬁcation of proteins database for the
investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Nevill-Manning, C.G., and Witten, I.H. 1999. Protein is incompressible. DCC ’99 Proc. Conf. Data Compression, 257.
Park, J., Holm, L., Heger, A., et al. 2000. RSDB: representative sequence databases have high information content.
Bioinformatics 16, 458–464.
Parsons, J.D. 1995. Improved tools for DNA comparison and clustering. Comput. Appl. Biosci. 11, 603–613.
Pearson, W., and Lipman, D. 1985. Rapid and sensitive protein similarity searches. Science 227, 1435–1441.
Pearson, W., and Lipman, D. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA
Robinson, A., and Robinson, L. 1991. Distribution of glutamine and asparagine residues and their near neighbors in
peptides and proteins. Proc. Natl. Acad. Sci. USA 88, 8880–8884.
Shivakumar, N., and García-Molina, H. 1999. Finding near-replicas of documents on the web. WEBDB: Int. Workshop
World Wide Web Databases.
Smith, T., and Waterman, M. 1981. Identiﬁcation of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Taylor, W. 1986. The classiﬁcation of amino-acid conservation. J. Theoret. Biol. 119, 205–218.
Weiss, O., Jimenez-Montano, M., and Herzel, H. 2000. Information content of protein sequences. J. Theoret. Biol.
Witten, I.H., Moffat, A., and Bell, T.C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images.
Morgan Kauffman, New York.
Address reprint requests to:
Dr. Michael Cameron
School of Computer Science and IT
GPO Box 2476V
Melbourne, Australia, 3001