ArticlePDF AvailableLiterature Review

Clustered Sequence Representation for Fast Homology Search


Abstract and Figures

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
Content may be subject to copyright.
Volume 14, Number 5, 2007
© Mary Ann Liebert, Inc.
Pp. 594–614
DOI: 10.1089/cmb.2007.R005
Clustered Sequence Representation for
Fast Homology Search
We present a novel approach to managing redundancy in sequence databanks such as
GenBank. We store clusters of near-identical sequences as a representative union-sequence
and a set of corresponding edits to that sequence. During search, the query is compared
to only the union-sequences representing each cluster; cluster members are then only re-
constructed and aligned if the union-sequence achieves a sufficiently high score. Using this
approach with BLAST results in a 27% reduction in collection size and a corresponding 22%
decrease in search time with no significant change in accuracy. We also describe our method
for clustering that uses fingerprinting, an approach that has been successfully applied to col-
lections of text and web documents in Information Retrieval. Our clustering approach is ten
times faster on the GenBank nonredundant protein database than the fastest existing ap-
proach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source
version of BLAST (available from As a result, FSA-BLAST is twice
as fast as NCBI-B LAS T with no significant change in accuracy.
Key words: BLAST, clustering, homology search, near duplicate detection, sequence alignment.
COM PR EHE NS IVE GEN OMI C DATABA SE S such as the GenBank non-redundant protein database contain
a large amount of internal redundancy. Although exact duplicates are removed from the collection,
there remain a large number of near-identical sequences. Such near duplicate sequences can appear in
protein databases for a variety of reasons, including the existence of closely-related homologues or partial
sequences, sequences with expression tags, fusion proteins, post-translational modifications, and sequencing
errors. These minor sequence variations lead to the over-representation of protein domains, particularly
those that are under intensive research. For example, the GenBank database contains several thousand
protein sequences from the human immunodeficiency virus.
Database redundancy can lead to a number of negative consequences in the context of sequence ho-
mology search. First, a larger database takes longer to query; as sequencing efforts continue to outpace
1School of Computer Science and Information Technology, RMIT University, Melbourne, Australia.
2Microsoft Corporation, Redmond, Washington.
speed improvements in computer hardware, the problem of slow query response is one that will continue
to become more urgent. Second, redundancy can lead to highly repetitive search results for any query
that matches closely with an over-represented sequence. Third, large-scale redundancy has the effect of
skewing the statistics used for determining alignment significance, ultimately leading to decreased search
effectiveness. Fourth, profile-based algorithms such as PSI-BLAST (Altschul et al., 1997) can be misled
by redundant matches during iteration, causing them to bias the profile towards over-represented domains;
this can result in a less sensitive search or even profile corruption (Li et al., 2002; Park et al., 2000).
Attempts to manage redundancy in genomic databases have in the past focused on the creation of
representative-sequence databases (RSDBs), culled collections in which no two sequences share more
than a given level of identity. Such databases have been shown to significantly improve profile training
in iterative search tools such as PSI-BLAST by reducing the incidence of profile corruption caused by
over-represented domains. However, they are less suitable for regular search algorithms such as BLAST
(Altschul et al., 1990, 1997) and FASTA (Pearson and Lipman, 1985, 1988) because, by definition, RSDBs
are not comprehensive. This leads to search results that are both less accurate—the representative sequence
for a cluster may not be the one that aligns best with a given query—and less authoritative, because the
user is only shown one representative sequence from a family of similar sequences. Furthermore, existing
clustering techniques for creatings RSDBs either exhibit O.n2/time complexity in the size of the database
being clustered, or consume large quantities of memory resources.
In this work we present a redundancy-detection technique for sequence databases that is based on
document fingerprinting (Bernstein and Zobel, 2004). We describe how the fingerprinitng approach has
been successfully adapted to the domain of sequence data, and present slotted SP EX , a new chunk-selection
algorithm for fingerprinting that is particularly suited to the sequence domain. Fingerprinting suffers from
neither quadratic time complexity nor heavy memory requirements, leading to a significant improvement
to the efficiency with which we are able to cluster large sequence databases; we are able to process the
entire GenBank collection in one hour on a commodity workstation. By contrast, the most advanced and
efficient existing technique, CD-HIT (Li et al., 2001b), takes almost ten hours on the same machine.
We then introduce a novel sequence clustering methodology, and corresponding modified search algo-
rithm, that when combined allow for the efficient and effective management of redundancy in genomic
databases. Importantly, our approach lacks the drawbacks of previous redundancy-management strategies.
Whereas earlier approaches choose one sequence from each cluster as the representative to the database
and delete the other sequences, we generate for each cluster a special union-sequence that—through use of
wildcard characters—represents all of the sequences in the cluster simultaneously. Through careful choice
of wildcards, we are able to achieve near-optimal alignments while still substantially reducing the number
of sequences against which queries need be matched. Further, we store all sequences in a cluster as a set
of edits against the union-sequence. This achieves a form of compression and allows us to retrieve cluster
members for more precise alignment against a query should the union-sequence achieve a good alignment
score. Thus, both space and time are saved with no significant loss in accuracy or sensitivity.
Our method supports two modes of operation: users can choose to see all alignments or only the best
alignment from each cluster. In the former mode, the clustering is transparent and the output comparable
to that of searches on the unclustered collection. In the latter mode, the search output is similar to the
result of searching a culled representative database, except that our approach is guaranteed to display the
best alignment from each cluster and is also able to report the number of similar alignments that have been
To investigate the effectiveness of our clustering approach we have integrated it with our freely available
open-source software package, FSA-BL AST. When applied to the GenBank non-redundant (NR) database,
our method reduces the size of sequence data in the NR database by 27% and improves search times by
22% with no significant effect on accuracy.
Reducing redundancy in a sequence database is essentially a two-stage process: first, redundancy between
sequences in the database must be identified; then, the redundancy must be managed in some way. In this
section we describe past approaches to these two stages.
2.1. Redundancy identification
The first stage of most redundancy management algorithms involves identifying pairs of highly-similar
sequences. An obvious approach to this task is to align each sequence with each other sequence in
the collection using a pairwise alignment scheme such as Smith-Waterman local alignment (Smith and
Waterman, 1981). This is the approach taken by several existing clustering algorithms, including d2_cluster
(Burke et al., 1999), OWL (Bleasby and Wootton, 1990), and KIND (Kallberg and Persson, 1999) and Itoh
et al. (2004). However, this approach is impractical for any collection of significant size; each pairwise
comparison is computationally intensive and the number of pairs grows quadratically in the number of
Several schemes, including CLEANUP (Grillo et al., 1996), NRDB90 (Holm and Sander, 1998), RSDB
(Park et al., 2000), CD-HI (Li et al., 2001a), and CD-HIT (Li et al., 2001b), use a range of BLA ST-like
heuristics to quickly identify high-scoring pairwise matches. The CLEANUP (Grillo et al., 1996) algorithm
builds a rich inverted index of short substrings or words in the collection and uses this structure to score
similarity between sequence pairs. NRDB90 (Holm and Sander, 1998) and RSDB (Park et al., 2000) use
in-memory hashtables of decapeptides and pentapeptides for fast identification of possible high-scoring
sequence pairs before proceeding with an alignment. CD-HI (Li et al., 2001a) and CD-HIT (Li et al.,
2001b) use lookup arrays of very short subsequences to more efficiently identify similar sequences. Such
methods can significantly reduce the per-pair comparison time, but do nothing to alter the O.n2/time
complexity of the algorithms.
Many of the above schemes attempt to reduce the number of pairwise sequence comparisons by using a
greedy incremental clustering approach, in which the similarity detection and the cluster-based redundancy
management approach (see Section 2.2) are combined. In general, this method proceeds as follows. To
begin, the collection sequences are sorted by decreasing order of length. Then, each sequence is considered
in turn and used as a query to search an initially-empty representative database for high-scoring matches.
If a similar sequence is found, the query sequence is discarded; otherwise, it is added to the database as the
representative of a new cluster. When the algorithm terminates, the database consists of the representative
(longest) sequence of each cluster. While this technique can significantly reduce the number of pairwise
sequence comparisons that need to be made, it still retains the unfavorable O.n2/time complexity that
afflicts all direct pairwise-comparison methods. We show in Section 5 that CD-HIT—the fastest of the
greedy incremental algorithms mentioned and most successful existing approach—scales poorly, with
superlinear complexity in the size of the collection.
ICAass (Parsons, 1995) and Itoh et al. (2004) reduce the number of pairwise comparisons by partitioning
the collection according to phylogenetic classifications and clustering only sequences within each partition.
This reduces the number of pairwise comparisons; however, the approach assumes that the database has
been pre-classified and ignores possible matches between taxonomically distant species. Further, the number
of phylogenetic divisions is growing at a far slower rate than database size. Therefore, a quadratic growth
rate in computation time remains a limitation.
One way to avoid an all-against-all comparison is to pre-process the collection using an index or suffix
structure that can be used to efficiently identify high-scoring candidate pairs. Malde et al. (2003) and Gracy
and Argos (1998) investigated the use of suffix structures such as suffix trees (Gusfield, 1997) and suffix
arrays (Manber and Myers, 1993) to identify groupings of similar sequences in linear time. However, suffix
structures also require large main-memories and are not suitable for processing large sequence collections
such as GenBank on desktop workstations. Malde et al. (2003) report results for only a few thousand
EST sequences. The algorithm described by Gracy and Argos (1998) requires several days to process a
collection of around 60,000 sequences. External suffix structures, which record information on disk, are
also unsuitable; they use a large amount of disk space, are extremely slow for searching, or have slow
construction times (Cheung et al., 2005).
2.2. Redundancy management
The common practice in existing work is to use the information on inter-sequence redundancy to build
representative-sequence databases (RSDBs). In this approach, clusters of highly similar documents are
grouped into clusters, or in the case of greedy incremental clustering the clusters are built contemperane-
ously with the pairwise similarity comparison. Once a set of clusters have been identified, one sequence
from each cluster is selected as the representative of that cluster to be inserted into the RSDB; the rest are
discarded (Holm and Sander, 1998; Park et al., 2000; Li et al., 2001a, 2001b). The result is a representative
database with fewer sequences and less redundancy. However, purging near-duplicate sequences can sig-
nificantly reduce the quality of results returned by search tools such as B LA ST. There is no guarantee that
the representative sequence from a cluster is the sequence that best aligns with a given query. Therefore,
some queries will fail to return matches against a cluster that contains sequences of interest, which reduces
sensitivity. Further, results of a search lack authority because they do not show the best alignment from
each cluster. Also, the existence of highly similar alignments, even if strongly mutually redundant, may
be of interest to a researcher.
Itoh et al. (2004) describe an alternative redundancy-management technique that, in contrast with the
RSDB approach, retains all members of each cluster. This approach calculates an upper bound on the
difference in score between aligning a query to any sequence in a cluster and aligning the same query to
a chosen representative. During search, the query is compared to the representative and the upper bound
is added to the resulting alignment score; if the increased score exceeds the scoring cutoff, all sequences
in that cluster are loaded from an auxiliary database and individually aligned to the query. While this
approach ensures there is no loss in sensitivity, it comes at a substantial cost: unless a high scoring cutoff
is used during search—Itoh et al. use a nominal score cutoff of 150 in their experiments—there will be
numerous false positives, causing search to be slowed. They report experiments using Smith-Waterman
(1981) alignment and it is unclear if their approach would work well if applied to a heuristic search tool
such as BLA ST. In addition, all sequences are retained by their method, and so no space is saved.
Document fingerprinting (Manber, 1994; Brin et al., 1995; Heintze, 1996; Broder et al., 1997; Shivaku-
mar and García-Molina, 1999) is an effective and scalable technique for identifying pairs of documents
within large text collections that share portions of identical text. Fingerprinting has been used for several
applications, including copyright protection (Brin et al., 1995), document management (Manber, 1994),
and web search optimization (Broder et al., 1997; Fetterly et al., 2003; Bernstein and Zobel, 2005).
In this section we describe how a modified form of document fingerprinting can be applied to the
task of identifying near-identical sequences in a sequence database. In contrast to the existing approaches
discussed in Section 2.1, fingerprinting allows us to cluster the database rapidly and in linear time, with
only modest main-memory requirements.
3.1. Document fingerprinting
The fundamental unit of document fingerprinting techniques is the chunk, a fixed-length unit of text
such as a series of consecutive words or a sentence. The full set of chunks for a given document is formed
by passing a sliding window of appropriate length over the document; this is illustrated in Figure 1 for a
chunk length of six words.
FIG. 1. Set of chunks of six words in length for a document containing the text “the quick brown fox jumped over
the lazy dog.
The set of all chunks in a collection can be stored in an inverted index (Witten et al., 1999), and the
index can be used to calculate the number of shared chunks between pairs of documents in a collection.
Two identical documents will naturally have an identical set of chunks. As the documents begin to diverge,
the proportion of chunks they share will gradually degrade. Thus, the number of common chunks is a
good estimator of the amount of common text shared by a pair of documents. The quality of this estimate
is optimized by choosing a chunk length that is long enough so that two identical chunks are unlikely
to coincidentally occur, but not so long that the system becomes too sensitive to minor changes. In the
duplicate detection DE CO package, the default chunk length is eight words (Bernstein and Zobel, 2004).
In theory, one could simply store the full set of chunks for each document in a collection, and directly
compute the degree of text reuse between documents as above. However, such an approach is highly
inefficient. Thus, some sort of selection heuristic is normally applied so that only a subset of chunks from
each document are selected for storage. The choice of selection heuristic has a very significant impact on
the general effectiveness of the fingerprinting algorithm. Most fingerprinting algorithms have used simple
feature-based selection heuristics, such as selecting chunks only if their hash is divisible by a certain
number, or selecting chunks that begin with certain letter-combinations. For an overview of existing chunk
selection methods, we refer the reader to Hoad and Zobel (2003) or Bernstein and Zobel (2004).
A major drawback of the chunk selection schemes of the type described above is that they are lossy:
if two documents share chunks, but none of them happen to satisfy the criteria of the selection heuristic,
the fingerprinting algorithm will not identify these documents as sharing text. For example, consider the
processing of two documents with a chunk length of one word and a chunk selection heuristic that only
selects words starting with a vowel. Given a pair of highly similar documents with the following text:
Document 1 The yellow cat ran between the tall trees.
Document 2 The brown cat ran between the tall towers.
Not a single chunk would be selected from either of these two documents because all of the words
contained in them start with a consonant. As a result, the similarity between these two documents would
be overlooked. The more selective the heuristic, the more likely such incidents are to occur.
3.2. The SP EX chunk selection algorithm
Bernstein and Zobel (2004) introduced the SP EX chunk selection algorithm, which allows for lossless
selection of chunks. This is possible because it was able to discard singleton chunks (chunks that only occur
once and represent a large majority in most text collections), which do not contribute to the identification of
text reuse between documents. The SPEX algorithm is based on the following observation: if any subchunk
(subsequence) of a chunk is unique, the chunk as a whole is unique. For example, an occurrence of the
chunk two brown cats must be unique in the collection if the subchunk brown cats is also unique. Using a
memory-efficient iterative hashing technique, SPEX is able to select only those chunks that occur multiple
times in the collection. Using S PE X can yield significant savings over selecting every chunk without any
degradation in the quality of results (Bernstein and Zobel, 2004, 2005).
The SPE X algorithm works as follows. In the first iteration, each chunk that is one word in length in the
document is processed. A hashtable records for each chunk whether it appears multiple times, just once,
or not at all. In the second iteration, the entire document is processed again, this time considering chunks
that are two words in length. Each two-word chunk is only inserted into the hashtable if both one-word
subchunks appear multiple times in the collection. For example, the two-word chunk brown cats is only
processed if both brown and cats are non-unique. A second hashtable records whether each two-word
chunk appears multiple times, just once, or not at all. The process is then repeated for increasing chunk
lengths until the final desired length is reached.
Figure 2 provides a pseudocode sketch of how SPE X identifies duplicate chunks of length finalLength
within a collection of documents S. The algorithm iterates over chunk lengths from 1 to finalLength,
the final chunk length desired. At each iteration, S PE X maintains two hashtables (referred to as lookup
in the figure): one recording the number of occurrences of each chunk for the previous iteration, and one
for the current iteration. As we are only interested in knowing whether a chunk occurs multiple times or
not, each entry in lookup takes one of only three values: zero, one, or more than one (2C). This allows
five hashtable entries to be fit into a single byte, significantly reducing the size of the table. A chunk is
FIG. 2. The SP E X algorithm.
only inserted into lookup if its two subchunks of length chunkLength-1 both appear multiple times in
the hashtable from the previous iteration.
Figure 3 illustrates how the SPEX algorithm works when applied to the pair of documents shown at
the top of the figure. First, one-word chunks are counted as shown in the far left column. The words
yellow,brown,trees, and towers appear only once in the collection and the remaining words appear
at least twice. In the second iteration, two-word chunks are extracted from the collection and inserted into
the hashtable. However, chunks such as The yellow that contain at least one unique subchunk from the
previous iteration are not processed, nor are they inserted into the hashtable, as indicated by the letters NP.
As a result, fewer entries are made into the hashtable and the number of collisions is minimized. Finally,
three-word chunks are extracted in the third iteration, and any chunk that contains a unique subchunk of
length two is dismissed. The end result after three iterations is that all three-word chunks that are shared
between the two documents (that is, cat ran between,ran between the, and between the tall) are
Although collisions caused by different chunks hashing to the same value are not resolved, Bernstein and
Zobel (2004) report that this has minimal impact on the performance of the algorithm. Collisions introduce
false positives to the process, that is, cause unique words to be considered as non-unique. However, the
iterative process tends to remove such false positives and helps prevent the hashtables from being flooded.
As a result, the SPEX algorithm is able to rapidly process large text collections while consuming a relatively
modest amount of memory (Bernstein and Zobel, 2004).
3.3. Fingerprinting for genomic sequences
The SPE X algorithm (and, indeed, any fingerprinting algorithm) can be trivially adapted for use with
genomic sequences by simply substituting documents with sequences. However, the properties of a genomic
sequence are quite different from those of a natural language document and these differences present a
number of practical challenges to the S PE X algorithm.
The most significant difference is the lack of any unit in genomic data analogous to natural language
words, given that a genomic sequence is represented by an undifferentiated string of characters with no
natural delimiters such as whitespace, commas or other punctuation marks. The lack of words in genomic
sequences has a number of immediate impacts on the operation and performance of the SPE X algorithm.
FIG. 3. Three iterations of the SPEX algorithm. The pair of documents at the top are processed to identify duplicate
chunks of one, two and three words in length. For each chunk, 1denotes that it appears once only in the collection,
2Cdenotes that it appears multiple times, and N P indicates that the chunk was not processed because it contains a
unique subchunk.
First, the granularity of the sliding window must be increased from word-level to character-level. An
increased granularity results in the existence of a far greater number of chunks in a genomic sequence
than in a natural-language document of similar size. This increase becomes apparent when one considers
that—assuming an average word length of five—the S PE X sliding window increments over a total of
six bytes of a natural-language document at each increment. By contrast, the base-pairs in nucleotide
sequence data each represent only two bits, or 0.25 of a byte, of data. Thus, a nucleotide sequence can be
expected to produce roughly 24 times as many chunks as a natural language document of the same size.
As a result, the SPE X algorithm is less efficient and scalable for genomic data than for natural language
Processing genomic sequences also involves performing more iterations as part of the SPEX algorithm.
To identify chunks that are eight words in length S PE X must perform eight iterations; for a similar-length
chunk containing 30 characters of sequence data, SPEX would need to perform 30 iterations. This obviously
slows down the algorithm significantly.
The distribution of subsequences within genomic data is also less skewed than the distribution of words
in natural-language text. Given a collection of natural language documents, we expect some words (such
as “and” and “or”) to occur extremely frequently, while other words (such as perhaps “alphamegamia” and
“nudiustertian”) will be hapax legomena: words that occur only once. This permits the SP EX algorithm
to be effectual from the first iteration by removing word-pairs such as “nudiustertian news.” In contrast,
given a short string of characters using the amino acid alphabet of size 20, we expect that it is far less
likely that the word will occur only once in any collection of nontrivial size. Thus, the first few iterations
of SPEX are likely be entirely ineffectual.
One simple solution to these problems is to introduce “pseudo-words,” effectively segmenting each
sequence by moving the sliding window several characters at a time. However, this approach relies on
sequences being aligned along segment boundaries. This assumption is not generally valid and makes the
algorithm highly sensitive to insertions and deletions. Consider, for example, the following sequences given
a chunk length of four and a window increment of four:
Sequence Chunks
Despite all three of these sequences containing an identical subsequence of length 11 (in bold above), they
do not share a single common chunk. This strong correspondence between the three sequences will thus
be overlooked by the algorithm.
We propose a hybrid of regular SP EX and the pseudo-word based approach described above that we
call slotted SP EX. Slotted SPE X uses a window increment greater than one but is able to “synchronize”
the windows between sequences so that two highly similar sequences are not entirely overlooked as a
result of a misalignment between them. Although slotted SPE X is technically a lossy algorithm, it offers a
performance guarantee that we shall discuss later in this section.
Figure 4 describes the slotted S PE X algorithm. As in standard SPEX, we pass a fixed-size window over
each sequence with an increment of one. However, unlike SPEX, slotted SP EX does not consider inserting
every chunk into the hashtable. In addition to decomposing the chunk into subchunks and checking that the
FIG. 4. The slotted S P EX algorithm.
subchunks are non-unique, slotted SPEX also requires that one of two initial conditions be met. First, that
it has been at least Qwindow increments since the last insertion; or second, that the current chunk already
appears in the hashcounter. The parameter Qis the quantum, which can be thought of as the window
increment used by the algorithm. Slotted SPEX guarantees that at least every Qth overlapping substring
from a sequence is inserted into the hashtable. The second precondition—that the chunk already appears
in the hashcounter—provides the synchronization that is required for the algorithm to work reliably.
The operation of slotted SPE X is best illustrated with an example. Using the same set of sequences as
above, a quantum Q= 4 and a chunk length of four, slotted SP EX produces the following set of chunks:
Sequence Chunks
For the first sequence, the set of chunks produced does not differ from the naive pseudo-word technique.
Let us now follow the process for the second sequence. The first chunk—AABC—is inserted as before.
When processing the second chunk, ABCD, the number of chunks processed since the last insertion is one,
fewer than the quantum Q. However, the condition lookup[chunk] ¤0on line 5 of Figure 4 is met: the
chunk has been previously inserted. The hashcounter is therefore incremented, effectively synchronizing
the window of the sequence with that of the earlier, matching sequence. As a result, every Qth identical
chunk will be identified across the matching region between the two sequences. In this example, the slotted
SPE X algorithm selects two chunks of length four that are common to all sequences.
Unlike the original SP EX algorithm, which incremented the chunklength by one for each iteration, the
slotted SP EX algorithm uses an increment equal to the quantum Q. In the slotted SP EX approach, a match
of length chunkLength + Q between two sequences is guaranteed to contain at least two matches of length
chunkLength identified during the previous iteration. As a result, slotted SP EX must increase the chunk
length by at least the quantum Qbetween iterations to ensure the scheme is lossless.
In comparison to the ordinary SPEX algorithm, slotted SP EX requires fewer iterations, consumes less
memory and builds smaller indexes. This makes it suitable for the higher chunk density of genomic
data. While slotted SP EX is a lossy algorithm, it does offer the following guarantee: for a window
size finalLength and a quantum Q, any pair of sequences with a matching subsequence of length
finalLength + Q - 1 or greater will have at least one identical chunk selected. As the length of the
match grows, so will the guaranteed number of common chunks selected. Thus, despite the lossiness of
the algorithm, slotted SP EX is still able to offer strong assurance that it will reliably detect highly similar
pairs of sequences.
In this section, we describe our novel approach to the management of redundancy in genomic sequence
databases. Rather than choosing a single representative sequence from each cluster, we use a set of wildcard
characters to create a single union-sequence that is simultaneously representative of all the sequences in
the cluster. A small amount of auxiliary data stored with each cluster allows for the original sequences to
be recreated.
During search, the query sequence is aligned with to the union-sequence of each cluster; for those
union-sequences that produce a statistically significant alignment, the members of the cluster are restored
from their compressed representations and aligned to the query. This ensures that precise alignment scores
are calculated. Our approach supports two modes of operation: users can choose to see all high-scoring
alignments, or only the best alignment from each cluster. The latter mode reduces redundancy in the results.
The effectiveness of our technique depends upon the careful construction of clusters, the availability of a
sensitive but efficient way of scoring the query sequence against the union-sequence, and the selection of a
good set of wildcards. We describe our approach to these matters in Sections 4.2, 4.3, and 4.4, respectively.
4.1. Cluster representation
Let us define ED fe1; : : : ; engas the set of sequences in a collection where each sequence is a string
of residues eiDr1; : : : ; rnjr2R. Our approach represents the collection as a set of clusters C,
where each cluster contains a union-sequence Uand edit information for each member of the cluster.
The union-sequence is a string of residues and wildcards UD fu1; : : : ; unjui2R[Wgwhere
WD fw1; : : : ; wnjwiRgis the set of available wildcards. Each wildcard represents a set of residues
and is able to act as a substitute for any of these residues. By convention, wnis assumed to be the default
wildcard wdthat can represent any residue; that is, wnDR.
Figure 5 shows an example cluster constructed using our approach. The union-sequence is shown at
the top and cluster members are aligned below. Columns where the member sequences differ from each
another and a wildcard has been inserted are shown in bold face. In this example, WD fwdg(i.e., only
the default wildcard is used, and it is represented by an asterisk).
When a cluster is written to disk, the union-sequence—shown at the top of Figure 5—is stored in its
complete form, and each member of the cluster is recorded using edit information. The edit information for
each member sequence includes start and end offsets that specify a range within the union-sequence, and a
set of residues that replace the wildcards in that range. For example, the first member of the cluster with GI
accession 156103 would be represented by the tuple (8,44,PI); the member sequence can be reconstructed
by copying the substring between positions 8 and 44 of the union-sequence and replacing the wildcards at
union-sequence positions 8 and 40 with characters P and I, respectively. Note that our clustering approach
does not permit gaps; this is because insertions and deletions are heavily penalized during alignment and
any scheme that includes gaps in clusters is likely to reduce search accuracy.
4.2. Clustering algorithm
In this section, we describe our approach to efficiently clustering large sequence collections. It is based
on the slotted S PE X fingerprinting algorithm described in Section 3.3, and has linear-time performance and
low main-memory overheads.
The fingerprinting process identifies chunks that occur in the collection more than once. In the context of
sequence data we use subsequences or words of length Was our chunks. For each word, the slotted SPEX
FIG. 5. Example cluster of heat shock proteins from GenBank NR database. The union-sequence is shown at the
top, followed by ten member sequences with GI accession numbers shown in brackets.
algorithm outputs a postings list of sequences that contain the word and the offset into each sequence where
the word occurs. Our clustering algorithm uses these lists to identify candidate pairs: pairs of sequences
that share at least one chunk.
Given the list of candidate pairs, we use a variation on single-linkage hierarchical clustering (Johnson,
1967) to build clusters, as follows. Initially, each sequence is considered to constitute a cluster with
one member. Candidate pairs are then processed in increasing order of similarity score, from most to least
similar, and the pair of clusters that contains the highly similar candidate sequences are potentially merged.
In general, given two candidate clusters CXand CYwith union-sequences Xand Y, respectively, the
following process is used to determine whether the clusters should be merged:
1. Xand Yare aligned and the sequence space partitioned into a prefix, an overlap region, and a suffix.
2. The union-sequence candidate Ufor the new cluster is created by replacing each mismatched residue
in the overlap region with a suitable wildcard w.
3. The union-sequence candidate Uis accepted if the mean alignment score increase N
Qin the overlap
region is below a specified threshold Tthis prevents union-sequences from containing too many
wildcards and reducing search performance.
If the clusters are merged, a new cluster CUis created consisting of all members of CXand CY. This
process is repeated for all candidate pairs. When inserting wildcards into the union-sequence, if more
than one wildcard is suitable then the one with the lowest expected match score e.w/ DPRs .w; r/p.r /
is selected, where p.r / is the background probability of residue r(Robinson and Robinson, 1991) and
s.w ; r/ is the alignment score for matching wildcard wto residue r. Calculation of wildcard alignment
vectors s.w; /is discussed in Section 4.3, and the selection of the pool of wildcards Wis discussed in
Section 4.4.
The alignment score increase Qfor a wildcard wis calculated as
Q.w/ DX
s.w ; r/p.r / X
s.r1; r2/p.r1/p.r2/
where s.r1; r2/is the score for matching a pair of residues as defined by a scoring matrix such as
BLOSUM62 (Henikoff and Henikoff, 1992). This value estimates the typical increase in alignment score
one can expect against arbitrary query residues by aligning against winstead of against the actual residue
at that position.
Figure 6 illustrates the process of merging clusters. In this example, cluster X(which contains one
sequence) and cluster Y(which contains two sequences) are merged. A new cluster containing the members
of both Xand Yis created, with a new union-sequence that contains wildcards at residue positions where
the three sequences differ.
FIG. 6. Merge of two example clusters. Cluster Xcontains a single sequence and cluster Ycontains two sequences.
A new cluster is created that contains members of both clusters and has a new union-sequence to represent all three
member sequences.
FIG. 7. Illustration of top-down clustering where sequences lD fS1; S2; S3; S4; S5gcontain the chunk RTMCS. Each
sequence is compared to every other sequence in the list and the sequence with the highest average percentage identity
(S1) is selected as the first member of a new cluster. Sequences S2and S4are highly similar to S1and are also
included in the new cluster. The remaining sequences lD fS3; S5gare used to perform another iteration of top-down
clustering if jlj  M.
The above approach works extremely well for relatively small databases; however, some words will
occur quite frequently in a typical database, resulting in a small number of long postings lists in larger
collections. These in turn consume a disproportionate amount of execution time. We process these long
postings lists—those with more than Mentries, where we use MD100 by default—in a different,
top-down manner before proceeding to the standard hierarchical clustering approach discussed above.
The top-down approach identifies clusters from a list of sequences lthat contain a frequently-occurring
chunk was follows:
1. All sequences in lare loaded into main-memory and aligned with each other.
2. An exemplar sequence is selected; this is the sequence with the highest average percentage identity to
the other sequences in l.
3. A new cluster Cis created with the exemplar sequence as its first member.
4. Each sequence in lis compared to the union-sequence of the new cluster. Sequences where N
Q < T are
added to the cluster in order from most to least similar using the approach we describe above.
5. All of the members of the new cluster Care removed from land the process is repeated from step 1
until jlj< M .
The top-down clustering is illustrated in Figure 7. In this example, a list of five sequences that contain the
word RTMCS is processed using the top-down method. The sequence S1has the highest average percentage
identity to the other sequences in land is selected as the exemplar. A new cluster is created with S1as
the first member, and sequences S2and S4are subsequently added. The three members of the new cluster
are removed from l, and the process is repeated until jlj< M .
Once the postings lists has been processed by the top-down method, the shortened list is processed
using the hierarchical clustering method described above. While the top-down process is laborious, it is
performed for fewer than 0.2% of postings lists when clustering the version of the August 2005 release of
the GenBank non-redundant database with default parameters.
4.3. Scoring with wildcards
We have modified BLAS T to work with our clustering algorithm as follows. Instead of comparing the
query sequence to each member of the database, our approach compares the query only to the union-
sequence representing each cluster, where the union-sequence may contain wildcard characters. If a high-
scoring alignment between the union-sequence and query is identified, the members of the cluster are
reconstructed and aligned with the query. In this section we discuss how, given a set of wildcards W, we
determine the scoring vectors s.wi;/for each wi2Wthat are used during search.
Ideally, we would like the score between a query sequence Qand a union-sequence Uto be precisely
the highest score that would result from aligning Qagainst any of the sequences in cluster CU. This
would result in no loss in sensitivity as well as no false positives. Unfortunately, such a scoring scheme is
not likely to be achievable without aligning against each sequence in every cluster, defeating much of the
purpose of clustering in the first place.
To maintain the speed of our approach, scoring of wildcards against residues must be on the basis of a
standard scoring vector s .w; /and cannot take into consideration any data about the sequences represented
by the cluster. Thus, scoring will involve a compromise between sensitivity (few false negatives) and speed
(few false positives). We describe two such compromises below, and finally show how to combine them
to achieve a good balance of sensitivity and speed.
During clustering, wildcards are inserted into the union-sequence to denote residue positions where
the cluster members differ. Let us define SDs1; : : : ; sxjsi2Wwhere Sis the ordered sequence
of xwildcards substituted into union-sequences during clustering of a collection. Each occurrence of a
wildcard is used to represent a set of residues that appear in its position in the members of the cluster.
We define oRas the set of residues represented by an occurrence of a wildcard in the collection and
ODo1; : : : ; oxjoiRas the ordered sequence of substituted residue sets. The kth wildcard skthat is
used to represent the set of residues okmust be chosen such that oksk.
Our first scoring scheme, sexp , builds the scoring vector for each wildcard by considering the actual
occurrence pattern of residues represented by that wildcard in the collection. Formally, we calculate the
expected best score sexp as:
sexp.w; r / D
s.r; f /
where Piis the set of ordinal numbers of all substitutions using the wildcard wi:
PiD fjjj2N; j x; sjDwig
This score can be interpreted as the mean score that would result from aligning residue ragainst the
actual residues represented by the wildcard w. This score has the potential to reduce search accuracy;
however, it distributes the scores well, and provides an excellent tradeoff between accuracy and speed.
The second scoring scheme, sopt , calculates the optimisti c alignment score of the wildcard wagainst each
residue. The optimistic score is the highest score for aligning residue qto any of the residues represented
by wildcard w. This is calculated as follows:
sopt .w; r / Dmax
f2ws.r; f /
The optimistic score guarantees no loss in sensitivity: the score for aligning against a union-sequence
Uusing this scoring scheme is at least as high as the score for any of the sequences represented by U.
The problem is that in many cases the score for Uis significantly higher, leading to false-positives where
the union-sequence is flagged as a match despite none of the cluster members being sufficiently close to
the query. This results in substantially slower search.
The expected and optimistic scoring schemes represent two different compromises between sensitivity
and speed. We can adjust this balance by combining the two approaches using a mixture model. We define
a mixture parameter, , such that 01. The mixture-model score for aligning wildcard wto residue r
is defined as:
s.w; r / Dsopt .w; r / C.1 / sexp.w; r /
The score s.w; r / for each w ; r pair is calculated when the collection is being clustered and then
recorded on disk in association with that collection. During a B LA ST search, the wildcard scoring vectors
are loaded from disk and used to perform the search. We report experiments with varying values of in
Section 5. An example set of scoring vectors s.wi;/that were derived using our approach is shown in
Figure 8.
FIG. 8. Scoring vectors for the wildcards from Table 1. The set of residues represented by each wildcard is given
in the left-hand column. The scoring vector provides an alignment score between each of the twenty-four amino acid
symbols and that wildcard.
4.4. Wildcard selection
Having defined a system for assigning a scoring vector to an arbitrary wildcard, we now describe
a method for selecting a set of wildcards to be used during the clustering process. Each wildcard w
represents a set of residues wRand can only be used in place of the residues it represents when
inserted into a union-sequence. Our wildcard scoring scheme described in Section 4.3 is dependent on the
set of residues represented by w, so that each wildcard has a unique scoring vector. A set of wildcards,
WD fw1; : : : ; wngis used during clustering. We assume that one of these wildcards wnis the default
wildcard that can be used to represent any of the 24 residue and ambiguous codes, that is wnDR.
The remaining wildcards must be selected carefully; large residue sets can be used more frequently but
provide poor discrimination with higher average alignment scores and more false positives. Similarly, small
residue sets can be used less frequently, thereby increasing the use of larger residue sets such as the default
The first aspect of choosing a set of wildcards to use for substitution is to decide on the size of this set. It
would be ideal to use as many wildcards as necessary, so that for each substitution siDoi. However, each
wildcard must be encoded as a different character and this approach would lead to a very large alphabet.
An enlarged alphabet would in turn lead to inefficiencies in BLA ST due to larger lookup and scoring data
structures. Thus, a compromise is required. B LA ST uses a set of 20 character codes to represent residues,
as well as 4 IUPAC-IUBMB ambiguous residue codes and an end-of-sequence sentinel code, resulting in a
total of 25 distinct codes. Each code is represented using 5 bits, permitting a total of 32 codes; this leaves
7 unused character codes. We have therefore chosen to use jWj D 7wildcards.
We have investigated two different approaches to selecting a good set of wildcards. The first approach
to the problem treats it as an optimization scenario, and works as follows. We first cluster the collection as
described in Section 4.2 using only the default wildcard, i.e., WD fwdg. We use the residue-substitution
sequence Ofrom this clustering to create a set Wof candidate wildcards. Our goal can then be defined
as follows: we wish to select the set of wildcards WWsuch that the total average alignment score
ADPw2SPr2Rs.w ; r/p.r / for all substitutions Sis minimized. A lower Aimplies a reduction in the
number of high-scoring matches between a typical query sequence and union-sequences in the collection,
thereby reducing the number of false-positives in which cluster members are fruitlessly recreated and
aligned to the query.
Minimum alignment score Physico-chemical classifications
L,V,I,F,M L,V,I Aliphatic
G,E,K,R,Q,H F,Y,H,W Aromatic
A,V,T,I,X E,K,D,R,H Charged
S,E,T,K,D,N L,A,G,V,K,I,F,Y,M,H,C,W Hydrophobic
L,V,T,P,R,F,Y,M,H,C,W S,E,T,K,D,R,N,Q,Y,H,C,W Polar
A,G,S,D,P,H A,G,S,V,T,D,P,N,C Small
All residues All residues Default wildcard
Each list is sorted in order from lowest to highest average alignment score Aand contains seven
entries, including the default wildcard. The left-hand list is selected to minimize the average alignment
score Ausing a hill-climbing strategy, and the right-hand list is based on the amino acid classifications
described in Taylor (1986)
In selecting the wildcard set Wthat minimizes Awe use the following greedy approach: first, we
initialize Wto contain only the default wildcard wd. We then scan through Wand select the wildcard
that leads to the greatest overall reduction in A. This process is repeated until the set Wis filled, at each
iteration considering the wildcards already in Win the calculation of A. Once Wis full we employ a
hill-climbing strategy where we consider replacing each wildcard with a set of residues from Wwith the
aim of further reducing A.
A set of wildcards was chosen by applying this strategy to the GenBank NR database (described in
Section 5). The left-hand column of Table 1 lists the wildcards that were identified using this approach
and used by default for reported experiments in this chapter.
We also consider defining wildcards based on groups of amino acids with similar physico-chemical
properties. We used the amino acid classifications described in Taylor (1986) to define the set of seven
wildcards shown in the right-hand column of Table 1. In addition to the default wildcard, six wildcards
were defined to represent the aliphatic, aromatic, charged, hydrophobic, polar, and small classes of amino
acids. We present experimental results for this alternative set of wildcards in the following section.
In this section we analyse the effect of our clustering strategy on collection size and search times. For
our assessments, we used version 1.65 of the ASTRAL Compendium (Chandonia et al., 2004) that uses
information from the SCOP database (Murzin et al., 1995; Andreeva et al., 2004) to classify sequences
with fold, superfamily, and family information. The database contains a total of 67,210 sequences classified
into 1,538 superfamilies.
A set of 8,759 test queries were extracted from the ASTRAL database such that no two of the queries
shared more than 90% identity. To measure search accuracy, each query was searched against the ASTRAL
database, and the commonly used Receiver Operating Characteristic (ROC) score was employed (Gribskov
and Robinson, 1996). A match between two sequences is considered positive if they are from the same
superfamily, otherwise it is considered negative. The ROC50 score provides a measure between 0 and 1,
where a higher score represents better sensitivity (detection of true positives) and selectivity (ranking true
positives ahead of false positives).
The SCOP database is too small to provide an accurate measure of search time, so we use the GenBank
non-redundant (NR) protein database to measure search speed. The GenBank collection was downloaded
August 18, 2005 and contains 2,739,666 sequences in around 900 megabytes of sequence data. Performance
was measured using 50 queries randomly selected from GenBank NR. Each query was searched against
the entire collection three times with the best runtime recorded and the results averaged. Experiments were
conducted on a Pentium-4 2.8-GHz machine with 2 gigabytes of main memory.
We used FSA-B LA ST—our own version of BLA ST—with default parameters as a baseline. To assess the
clustering scheme, the GenBank and ASTRAL databases were clustered and FSA-B LA ST was configured
to report all high-scoring alignments, rather than only the best alignment from each cluster. All reported
collection sizes include sequence data and edit information but exclude sequence descriptions. C D-H IT
version 2.0.4 beta was used for experiments with 90% clustering threshold and maximum memory set
to 1.5 Gb. We also report results for NC BI -BLA ST version 2.2.11 and our own implementation of Smith-
Waterman that uses the exact same scoring functions and statistics as BLA ST (Karlin and Altschul, 1990;
Altschul and Gish, 1996). The Smith-Waterman results represent the highest possible degree of sensitivity
that could be achieved by B LA ST and provides a meaningful reference point. No sequence filtering was
performed for our experiments in this chapter.
The overall results for our clustering method are shown in Table 2. When used with default settings
of D0:2 and TD0:25, and the set of wildcards selected to minimize alignment score in Table 1,
our clustering approach reduces the overall size of the NR database by 27% and improves search times
by 22%. Importantly, the ROC score indicates that there is no significant effect on search accuracy, with
the highly redundant SCOP database reducing in size by 80% when clustered. If users are willing to
accept a small loss in accuracy, then the parameters D0and TD0:3 improve search times by 27%
and reduce the size of the sequence collection by 28% with a decrease of 0.001 in ROC score when
compared to our baseline. Since we are interested in improving performance with no loss in accuracy we
do not consider these non-default settings further. Overall, our clustering approach with default parameters
combined with improvements to the gapped alignment (Cameron et al., 2004) and hit detection (Cameron
et al., 2006) stages of BLA ST allow the speed of FSA-BLAS T to be double that of N C BI -BLA ST with
no significant effect on accuracy. Both versions of B LAS T produce ROC scores 0.017 below the optimal
Smith-Waterman algorithm.
The results in Table 2 also show that our scheme is an effective means of compressing protein sequences,
a task that has been deemed difficult by previous researchers (Nevill-Manning and Witten, 1999; Weiss
et al., 2000). Assuming a uniform, independent distribution of amino acids, protein sequence data can be
represented with 4.322 bits per symbol (Nevill-Manning and Witten, 1999). Our clustering scheme is able
to reduce the space required to store protein sequence data in the GenBank non-redundant database to
around 3.15 bits per symbol; to our knowledge, this is significantly less than the current best compression
rate of 4.051 bits per symbol (Nevill-Manning and Witten, 1999).
In Table 3, we compare search accuracy and performance for varying wildcard sets. The set of wildcards
that were selected to minimize the average alignment score using the approach described in Section 4.4
provide the fastest search times and smallest collection size. The set of wildcards based on the physico-
chemical classifications of Taylor (1986) do not perform as well, with 3% slower search times. This is
not a surprising result; treating the selection of wildcards as an optimization problem allows us to choose
those that have the greatest direct impact on search performance. Nonetheless, it is interesting to note that
the wildcards selected by this process bear little resemblance to any of the traditional physico-chemical
amino-acid taxonomies. Finally, search performance was worse still when the collections were clustered
using only the default wildcard; this supports our approach of using multiple wildcards to construct clusters.
Figure 9 shows a comparison of clustering times between C D-H IT and our novel clustering approach
that uses union-sequences and wildcards for four different releases of the GenBank NR database; details
of the collections used are given in Table 4. The results show that the clustering time of our approach is
GenBank NR
Time, sec
(% baseline)
Sequence data, Mb
(% baseline)
No clustering (baseline) 28.75 (100%) 900 (100%) 0.398
Cluster D0:2,TD0:25 22.54 (78%) 655 (73%) 0.398
Cluster D0,TD0:3 20.97 (73%) 650 (72%) 0.397
NC B I-B LA S T 45.75 (159%) 898 (100%) 0.398
Smith-Waterman — 0.415
GenBank NR
Wildcard set
Time, sec
(% baseline)
Sequence data, Mb
(% baseline)
Minimum alignment score 22.54 (78%) 655 (73%) 0.398
Physico-chemical classifications 23.25 (81%) 656 (73%) 0.398
Default wildcard only 23.49 (82%) 663 (76%) 0.398
The first two rows contain results for the wildcard sets defined in Table 1. The third row contains
results for clustering with only the default wildcard WD fwdg
FIG. 9. Clustering performance for GenBank NR databases of varying sizes.
Release date
Number of
size (Mb)
Overall size
reduction (Mb)
of collection
16 July 2000 521,662 157 45 28.9%
22 May 2003 1,436,591 443 124 28.1%
30 June 2004 1,873,745 597 165 27.4%
18 August 2005 2,739,666 900 245 27.3%
linear with the collection size and the C D -HI T approach is superlinear (Fig. 9). On the recent GenBank
non-redundant collection, CD -H IT is around 9 times slower than our approach and we expect this ratio to
further increase with collection size.
Table 4 shows the amount of redundancy in the GenBank NR database as it has grown over time,
measured using our clustering approach. We observe that redundancy is increasing at a rate roughly
proportional to collection size with the percentage reduction through clustering remaining almost constant
at 27–29% across versions of the collection tested. This suggests that redundancy will continue to affect
genomic data banks as they grow further in size.
Figure 10 shows the effect on accuracy for varying values of and T. We have chosen D0:2 as a
default value because smaller values of result in a larger decrease in search accuracy, and larger values
reduce search speed. We observe that for D0:2 there is little variation in search accuracy for values
of Tbetween 0.05 and 0.3.
FIG. 10. Search accuracy for collections clustered with varying values of and T. Default values of D0:2,
TD0:25 are highlighted.
FIG. 11. Average BL A ST search time using D0:2 and varying values of T.
Figure 11 shows the effect on search times for varying values of Twhere D0:2. As Tincreases
the clustered collection becomes smaller, leading to faster search times. However, if Tis too large then
union-sequences with a high percentage of wildcards are permitted, leading to an increase in the number
of cluster members that are recreated and a corresponding reduction in search speed. We have chosen the
value TD0:25 that maximizes search speed.
Figure 12 shows the distribution of cluster sizes on log-log axes for the GenBank NR database; the linear
distribution of the data points on these axes suggest the sizes follow a power-law distribution. Around 55%
of clusters contain just two members, and the largest cluster contains 488 members. Of the ten largest
clusters identified by our approach, five relate to human immunodeficiency virus proteins, three relate to
cytochrome b, one relates to elongation factor 1˛, and one relates to cytochrome oxidase subunit I. This
supports our previous observation that cluster size is proportional to the interest in a research area.
Sequence databanks such as GenBank contain a large number of redundant sequences. Such redundancy
has several negative effects including larger collection size, slower search, and difficult-to-interpret results.
Redundancy within a collection can lead to over-representation of alignments within particular protein
domains, distracting the user from other potentially important hits.
FIG. 12. Distribution of sizes for clusters identified in the GenBank NR database.
In this paper we have proposed novel techniques for both the detection and management of redundancy
in genomic sequence databanks.
For the detection of redundancy, we have explained how the successful document fingerprinting approach
can be adapted to genomic data and described slotted SPEX, a new chunk-selection heuristic for document
fingerpriting that is especially suitable for the genomic sequence domain. We have shown how, with the
use of document fingerprinting, we can cluster the GenBank nonredundant protein database nearly nine
times faster than the next-fastest approach.
We have also described a new approach to the the management of redundancy. Instead of discarding
near-duplicate sequences, our approach identifies clusters of redundant sequences and constructs a special
union-sequence that represents all members of the cluster through the careful use of wildcard characters.
We present a new approach for searching clusters that, when combined with a well-chosen set of wildcards
and a system for scoring matches between wildcards and query residues, leads to faster search times
without a significant loss in accuracy. Moreover, by recording the differences between the union-sequence
and each cluster member using edit information our approach compresses the collection. Our scheme is
general and can be adapted to most homology search tools.
We have integrated our clustering scheme into FSA-BL AS T, an alternative implementation of BLA ST that
is substantially faster than NCB I-B LAS T and freely available for download (at Our
results show that our clustering scheme reduces B LA ST search times against the GenBank non-redundant
database by 22% and compresses sequence data by 27% with no significant effect on accuracy. We have
also described a new system for identifying clusters that uses fingerprinting, a technique that has been
successfully and extensive applied to the domain of redundant-document detection. Our implementation
can cluster the entire GenBank NR protein database in one hour on a standard workstation and scales
linearly in the size of the collection. We propose that pre-clustered copies of the GenBank collection be
made publicly available for download.
We have confined our experimental work to protein sequences and plan to investigate the effect of our
clustering scheme on nucleotide data as future work. We also plan to investigate the effect of our approach
to iterative search algorithms such as PSI-BLAST, and how our scheme can be used to improve the current
measure of the statistical significance of BLAST alignments.
We thank Peter Smooker and Michelle Chow for valuable suggestions. This work was supported by the
Australian Research Council.
Altschul, S., and Gish, W. 1996. Local alignment statistics. Methods Enzymol. 266, 460–480.
Altschul, S., Gish, W., Miller, W., et al. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Altschul, S., Madden, T., Schaffer, A., et al. 1997. Gapped BLAST and PSI–BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25, 3389–3402.
Andreeva, A., Howorth, D., Brenner, S., et al. 2004. SCOP database in 2004: refinements integrate structure and
sequence family data. Nucleic Acids Res. 32, D226–D229.
Bernstein, Y., and Zobel, J. 2004. A scalable system for identifying co-derivative documents. In Apostolico, A., and
Melucci, M., eds., Proc. String Processing and Information Retrieval Symposium (SPIRE), 55–67. Springer, Padova,
Bernstein, Y., and Zobel, J. 2005. Redundant documents and search effectiveness. Proc. 14th ACM Int. Conf. Inform.
Knowledge Manag., 736–743.
Bleasby, A.J., and Wootton, J.C. 1990. Construction of validated, non-redundant composite protein sequence databases.
Protein Eng. 3, 153–159.
Brin, S., Davis, J., and García-Molina, H. 1995. Copy detection mechanisms for digital documents. Proc. ACM
SIGMOD Annu. Conf., 398–409.
Broder, A.Z., Glassman, S.C., Manasse, M.S., et al. 1997. Syntactic clustering of the web. Comput. Networks ISDN
Syst. 29, 1157–1166.
Burke, J., Davison, D., and Hide, W. 1999. d2_cluster: a validated method for clustering EST and full-length DNA
sequences. Genome Res. 9, 1135–1142.
Cameron, M., Williams, H.E., and Cannane, A. 2004. Improved gapped alignment in BLAST. IEEE Trans. Comput.
Biol. Bioinform. 1, 116–129.
Cameron, M., Williams, H.E., and Cannane, A. 2006. A deterministic finite automaton for faster protein hit detection
in BLAST. J. Comput. Biol. 13, 965–978.
Chandonia, J., Hon, G., Walker, N., et al. 2004. The ASTRAL compendium in 2004. Nucleic Acids Res. 32, D189–
Cheung, C.-F., Yu, J.X., and Lu, H. 2005. Constructing suffix tree for gigabyte sequences with megabyte memory.
IEEE Trans. Knowledge Data Eng. 17, 90–105.
Fetterly, D., Manasse, M., and Najork, M. 2003. On the evolution of clusters of near-duplicate web pages. Proc. 1st
Latin Am. Web Congress, 37–45.
Gracy, J., and Argos, P. 1998. Automated protein sequence database classification. I. Integration of compositional
similarity search, local similarity search, and multiple sequence alignment. Bioinformatics 14, 164–173.
Gribskov, M., and Robinson, N. 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence
matching. Comput. Chem. 20, 25–33.
Grillo, G., Attimonelli, M., Liuni, S., et al. 1996. CLEANUP: a fast computer program for removing redundancies
from nucleotide sequence databases. Comput. Appl. Biosci. 12, 1–8.
Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge, UK.
Heintze, N. 1996. Scalable document fingerprinting. 1996 USENIX Workshop Electron. Commerce, 191–200.
Henikoff, S., and Henikoff, J. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA
89, 10915–10919.
Hoad, T.C., and Zobel, J. 2003. Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inform. Sci.
Technol. 54, 203–215.
Holm, L., and Sander, C. 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioin-
formatics 14, 423–429.
Itoh, M., Akutsu, T., and Kanehisa, M. 2004. Clustering of database sequences for fast homology search using upper
bounds on alignment score. Genome Inform. 15, 93–104.
Johnson, S. 1967. Hierarchical clustering schemes. Psychometrika 32, 241–254.
Kallberg, Y., and Persson, B. 1999. KIND—a non-redundant protein database. Bioinformatics 15, 260–261.
Karlin, S., and Altschul, S. 1990. Methods for assessing the statistical significance of molecular sequence features by
using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.
Li, W., Jaroszewski, L., and Godzik, A. 2001a. Clustering of highly homologous sequences to reduce the size of large
protein databases. Bioinformatics 17, 282–283.
Li, W., Jaroszewski, L., and Godzik, A. 2001b. Tolerating some redundancy significantly speeds up clustering of large
protein databases. Bioinformatics 18, 77–82.
Li, W., Jaroszewski, L., and Godzik, A. 2002. Sequence clustering strategies improve remote homology recognitions
while reducing search times. Protein Eng. 15, 643–649.
Malde, K., Coward, E., and Jonassen, I. 2003. Fast sequence clustering using a suffix array algorithm. Bioinformatics
19, 1221–1226.
Manber, U. 1994. Finding similar files in a large file system. Proc. USENIX Winter 1994 Techn. Conf., 1–10.
Manber, U., and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22,
Murzin, A., Brenner, S., Hubbard, T., et al. 1995. SCOP: a structural classification of proteins database for the
investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Nevill-Manning, C.G., and Witten, I.H. 1999. Protein is incompressible. DCC ’99 Proc. Conf. Data Compression, 257.
Park, J., Holm, L., Heger, A., et al. 2000. RSDB: representative sequence databases have high information content.
Bioinformatics 16, 458–464.
Parsons, J.D. 1995. Improved tools for DNA comparison and clustering. Comput. Appl. Biosci. 11, 603–613.
Pearson, W., and Lipman, D. 1985. Rapid and sensitive protein similarity searches. Science 227, 1435–1441.
Pearson, W., and Lipman, D. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA
85, 2444–2448.
Robinson, A., and Robinson, L. 1991. Distribution of glutamine and asparagine residues and their near neighbors in
peptides and proteins. Proc. Natl. Acad. Sci. USA 88, 8880–8884.
Shivakumar, N., and García-Molina, H. 1999. Finding near-replicas of documents on the web. WEBDB: Int. Workshop
World Wide Web Databases.
Smith, T., and Waterman, M. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Taylor, W. 1986. The classification of amino-acid conservation. J. Theoret. Biol. 119, 205–218.
Weiss, O., Jimenez-Montano, M., and Herzel, H. 2000. Information content of protein sequences. J. Theoret. Biol.
206, 379–386.
Witten, I.H., Moffat, A., and Bell, T.C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images.
Morgan Kauffman, New York.
Address reprint requests to:
Dr. Michael Cameron
School of Computer Science and IT
RMIT University
GPO Box 2476V
Melbourne, Australia, 3001
... Efficiency is the goal for speed-focused methods, of which there are several established examples in bioinformatics [16,25,20,14,7,23]. Speed-focused methods generally share two characteristics. ...
... These methods can achieve significant efficiency gains. For instance, one of these methods clustered sequences with high identity, resulting in a reduction of dataset size by 27% of the original and of search time by 22% [7]. In some of the major databases, for example the Non-Redundant database in the NCBI [4] and TrEMBL in Uniprot [2], a strategy of this kind is used for finding records that are considered to be "redundant". ...
Conference Paper
Full-text available
The impact of duplicate or inconsistent records in databases can be severe, and for general databases has led to development of a range of techniques for identification of such records. In bioinformatics, duplication arises when two or more database records represent the same biological entity, a problem that has been known for over 20 years. However, only a limited number of techniques for detecting bioinformatics duplicates have emerged. Special techniques for handling large data sets (a common 5000-record data set has over 10 million pairs to compare) and imbalanced data (where the prevalence of duplicate pairs is minute as compared to non-duplicate pairs). Biological domain interpretation (records with very similar sequences are not necessarily duplicates) is also important to adapt general methods to this context. In particular, machine learning techniques are widely used for finding duplicate records in general databases, but only a few have been proposed for bioinformatics. We have evaluated one such method against a collection of submitter-labeled duplicates in nucleotide databases. The results reveal that the best rule in the original study can only detect 0.2% of the duplicates, and overall results for all the rules are extremely poor. Our study highlights the need for techniques to solve this pressing problem.
... A range of duplicate detection methods for biological databases have been proposed (8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18). However, this existing work has defined duplicates in inconsistent ways, usually in the context of a specific method for duplicate detection. ...
... In some previous work, a single sequence similarity threshold is used to find duplicates (8,9,11,14,16,18). In this work, duplicates are typically defined as records with sequence similarity over a certain threshold, and other factors are not considered. ...
Full-text available
GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC-a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases-in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.
... When combined with our improvements to protein search described in Chapters 4 and 5, our schemes more than halve average query evaluation times when compared to ncbi-blast. The discussion and results presented in this chapter are based on the published papers Cameron et al. [2006b], Bernstein and Cameron [2006], and Cameron et al. [2006a]. ...
... This chapter contains material that appeared in Cameron et al. [2006b], Bernstein and and Cameron et al. [2006a] and is based on research conducted in collaboration with fellow PhD candidate Yaniv Bernstein. Yaniv ...
... In some previous work, a single sequence similarity threshold is used to find duplicates (8,9,11,14,16,18). In this work, duplicates are typically defined as records with sequence similarity was not certified by peer review) is the author/funder. ...
GenBank, the EMBL European Nucleotide Archive, and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions, and over a period of decades. As a consequence, they contain a great many duplicates, redundancies, and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds, and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC – a dataset of 67,888 merged groups with 111,823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types, and impacts of duplicates. (2) We categorise duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.
... While this hardly rules out the domestic cat (a 2% closer match to our sequences) as the prey of the leeches, it is also important to consider the database from which these three potential species matches originated. NCBI's BLAST tool is known to exhibit biases towards species that are over-represented in the database (Cameron et al. 2007). GenBank's nt database, the database used in this study, typically has many more representative sequences from domestic species (and model organisms) than from their close non-domesticated relatives, leading to a more complete view of the population diversity of domestic and model species than wild species. ...
Full-text available
In October of 2018, Canada Border Services Agency (CBSA) at Pearson International Airport in Toronto notified the Wildlife Enforcement Branch of Environment and Climate Change Canada that a passenger had arrived aboard a flight from Russia with a large quantity of live leeches. The leeches had been discovered in the passenger's carry-on luggage. An enforcement officer with Environment and Climate Change Canada detained the leeches to identify the species in order to determine whether the import was lawful. We identified the leeches as Hirudo verbana and extracted DNA from the bloodmeals of a subsample of 240 leeches and used metabarcoding of 6 mitochondrial loci to determine the vertebrate host species on which the leeches had previously fed. Sixteen undomesticated vertebrate host species were identified from the bloodmeals of the imported leeches, indicating that these leeches were collected from wild habitats. Furthermore, the overlap of host species' distributions point to a possible collecting source in the Volga delta, the Danube delta, or the coastal area region on the east side of the Sea of Azov. Our findings support the utility of invertebrate-derived DNA (iDNA) as a valuable tool in forensic evaluation of trafficked wildlife and provide new evidence regarding illegal exploitation of the European medicinal leech.
... In addition, hybrid schemes are available such as the multistage approach of FSA-BLAST [7]. K-mer based sequence fingerprints are first used to identify clusters of near-identical sequences. ...
Conference Paper
Efficient and accurate search in biological sequence databases remains a matter of priority due to the on-going rapid accumulation of genomic data becoming available for analysis. An array of accelerated sequence comparison methods have been implemented, including tools which compute explicit pairwise alignments, and alignment-free techniques based on word co-occurrence, locality sensitive hashing, or metric embedding. These methods offer significant speed improvement over standard algorithms, but increased throughput comes at the cost of reduced sensitivity. Strategies such as inverted indexing and hashing enable efficient retrieval of stored sequences which share near-identical common sub-sequences with a query, but their precision diminishes as the level of shared identity decreases, so that sequences which are distantly related to the query go undetected. We present a new sequence database search algorithm which derives a compressed vocabulary consisting of subsequences of length k (k-mers) sampled from the database, and uses the compressed vocabulary to map each sequence to a binary feature vector based on its content. Feature vector similarity is taken as a proxy for more expensive local alignment measurements. Feature vector construction seamlessly incorporates biologically grounded symbol substitutions, so the algorithm remains effective at low levels of sequence identity. Empirical tests conducted with real-world data demonstrate that the binary vector encoding permits ranking accuracy that rivals and in some cases exceeds that of mainstream database search programs, with run times that are faster by an order of magnitude or more.
... Here we present an efficient approach utilizing hierarchical clustering at several resolution levels. While largescale hierarchical protein clustering is well-described in the literature [20][21][22], and methods for redundancyelimination have been described by several authors [23][24][25], brute-force hierarchical clustering, even with a step of redundancy-elimination, becomes more expensive and less robust with the growth in the amount and complexity of data. ...
Full-text available
Background Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. ResultsProtein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering.The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. Conclusion The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations.
... Users can avoid compiling and comparing hundreds of BLAST (Altschul et al., 1990) result pages containing highly similar, if not identical, sequences. Similarity searches against protein sequence clusters have been shown to be equally sensitive but return faster results (Park et al., 2000;Li et al., 2002;Itoh et al., 2004;Cameron et al., 2007) when compared against native sequence databases of sizes ranging from approximately 400 000 to 2.6 million sequencesseveral fold smaller than current sequence set used to compute UniRef databases. We tested the effectiveness and performance of UniRef50 for sequence similarity searches as an ever-expanding and continuously updated database available to scientific community. ...
Full-text available
Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
Full-text available
Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a high level of redundancy in nucleotide databases and consequently in the UniProt Knowledgebase (UniProtKB). Redundancy negatively impacts on database searches by causing slower searches, an increase in statistical bias and cumbersome result analysis. The redundancy combined with the large data volume increases the computational costs for most reuses of UniProtKB data. All of this poses challenges for effective discovery in this wealth of data. With the continuing development of sequencing technologies, it is clear that finding ways to minimize redundancy is crucial to maintaining UniProt's essential contribution to data interpretation by our users. We have developed a methodology to identify and remove highly redundant proteomes from UniProtKB. The procedure identifies redundant proteomes by performing pairwise alignments of sets of sequences for pairs of proteomes and subsequently, applies graph theory to find dominating sets that provide a set of non-redundant proteomes with a minimal loss of information. This method was implemented for bacteria in mid-2015, resulting in a removal of 50 million proteins in UniProtKB. With every new release, this procedure is used to filter new incoming proteomes, resulting in a more scalable and scientifically valuable growth of UniProtKB. Database URL:
BLAST, short for Basic Local Alignment Search Tool, searches for regions of local similarity between a query sequence and a large database of DNA or amino-acid sequences. It serves as a fundamental tool to many discovery processes in bioinformatics and computational biology, including inferring functional and evolutionary relationships between sequences, identifying members of gene families, and phylogenetic profiling. Consequently, researchers have spent many decades making local alignment search (such as BLAST) more efficient, both with respect to speed and accuracy. In this paper, we present our approach for more efficient sequence search, which we dub CentroidBLAST. CentroidBLAST first works on a representative fraction of the original database, where each representative serves as a "centroid" of similar sequences. A centroid's cluster of sequences is then searched only if its representative sequence is a similar match to the query sequence. This approach delivers as much as a 6.85-fold speed-up over NCBI BLAST. In addition, we analyze different aspects of CentroidBLAST, including execution time, biological significance of resulting alignments, selection of e-value cut-off, and effect of database compression.
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity. In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security) We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
As more information becomes available electronically, document search based on textual similarity is becoming increasingly important, not only for locating documents online, but also for addressing internet variants of old problems such as plagiarism and copyright violation. This paper presents an online system that provides reliable search results using modest resources and scales up to data sets of the order of a million documents. Our system provides a practical compromise between storage requirements, immunity to noise introduced by document conversion and security needs for plagiarism applications. We present both quantitative analysis and empirical results to argue that our design is feasible and effective. A web-based prototype system is accessible via the URL
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.
Conference Paper
We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web – about 24 million web pages which corresponds to about 150 Gigabytes of textual information.
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).