ArticlePDF Available

Abstract and Figures

Motivation: International sequencing e#orts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly; that sequences can be accessed independently of the order in which they were stored; and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching.
Content may be subject to copyright.
CABIOS
Vol.
13
no.
5
1997
Pages 549-554
Compression of nucleotide databases for fast
searching
Hugh
Williams
and Justin Zobel
Department
of
Computer Science, RMIT, GPO Box 2476V, Melbourne
3001,
Australia
Received on April 29, 1997, accepted on May 22, 1997
Abstract
Motivation: International sequencing efforts
are
creating
huge nucleotide databases, which
are
used
in
searching
applications
to
locate sequences homologous
to a
query
sequence. In such applications, it is desirable that databases
are stored compactly, that sequences
can be
accessed
independently
of
the order
in
which they were
stored,
and
that data
can
be
rapidly retrieved from secondary storage,
since disk costs are often the bottleneck in searching.
Results: We present
a
purpose-built direct coding scheme for
fast retrieval and compression
of
genomic nucleotide data.
The scheme
is
lossless, readily integrated with sequence
search tools,
and
does
not
require
a
model. Direct coding
gives good compression and allows faster retrieval than with
either uncompressed data
or
data compressed
by
other
methods, thus yielding significant improvements
in
search
times for high-speed homology search tools.
Availability:
The
direct coding scheme (cino)
is
available
free of charge by anonymous ftp from goanna.cs.rmit.edu.au
in the directoiy publrmitlcino.
Contact: E-mail: hugh@cs.rmit.edu.au
Introduction
Sequencing initiatives
are
contributing exponentially
in-
creasing quantities
of
nucleotide data
to
databases such
as
GenBank (Benson
et
al., 1993). We propose
a
new
direct
coding compression scheme
for use in
homology search
applications such
as
FASTA (Pearson
and
Lipman
et al.,
1988),
BLAST (Altschul et al., 1990) and CAFE (Williams
and Zobel, 1996a). This scheme yields compact storage,
is
lossless—nucleotide bases and wildcards are represented—
and has extremely fast decompression.
Prior to proposing our scheme, we investigate benchmarks
for practical compression
and
high-speed decompression
of
nucleotide data. We compare
our
scheme with
the
entropy,
with Huffman coding, with the utilities gzip and compress, and
with uncompressed data retrieval.
All the
compression
methods closely approach
the
entropy,
but
direct coding
is
over nine times faster than Huffman coding and requires much
*To whom correspondence should be addressed
less memory; direct coding is also several times faster than the
standard compression utilities. Direct coding requires -25%
of the space required
to
store uncompressed data and, due
to
savings
in
disk costs, has significantly lower retrieval times.
Database compression
Compression consists
of
two activities, modelling and cod-
ing (Rissanen and Langdon, 1981).
A
model
for
data
to be
compressed is
a
representation
of
the distinct symbols in the
data and includes information, such as frequency, about each
symbol. Coding
is the
process
of
producing
a
compressed
representation of data, using the model to determine
a
code
for each symbol.
An
efficient coding scheme assigns short
codes to common symbols and long codes
to
rare symbols,
optimizing code length overall.
Adaptive models (which evolve during coding)
are cur-
rently favoured for general-purpose compression (Bell et al.,
1990;
Lelewer and Hirschberg, 1987),
and are the
basis
of
utilities such
as
compress. However, because databases
are
divided into records that must be independently decompress-
ible (Zobel and Moffat, 1995), adaptive techniques are gen-
erally not effective. Similarly, arithmetic coding is in general
the preferred coding technique,
but it is
slow
for
database
applications (Bell etal., 1993).
For text, Huffman coding with
a
semi-static model (where
modelling
and
coding
are in
separate phases)
is
preferable
because
it is
faster
and
allows order-independent decom-
pression. Such compression schemes can allow retrieval
of
data to be faster than with uncompressed data since the com-
putational cost of decompressing data can be offset by reduc-
tions
in
transfer costs from disk.
The compression efficiency
of a
technique can,
for a
given
data set, be measured by comparison to the information content
of
data,
as represented by the
entropy
determined by Shannon's
coding theorem (Shannon, 1951). Entropy
is
the compression
that would be achievable with
an
ideal coding method using
a
simple semi-static model. For a set
S of
symbols in which each
symbol
t
has probability
of
occurrence p,, the entropy
is:
bits per symbol.
E(S)
= ^(- p,
log2/?,)
© Oxford University Press549
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019
H.Williams and J.Zobel
Table 1. Probabilities of each base
Base
Probability
Base
Probability
A
27 483
N
0.737
inGenBank (90
B
»0
R
0 002
C
22.270
S
0 003
D
=0
T
26.508
G
22.2985
V
«0
H
-0
w
0.001
K
0.002
Y
0.004
M
0 003
Implicit in this definition is the representation of the data
as a set of symbol occurrences, i.e. modelling of the data
using simple tokens. In some domains, different choices of
tokens give vastly varying entropy; for example, in English
text compression, choosing characters as tokens gives an en-
tropy of-5 bits/character, whereas choosing words as tokens
gives an entropy of ~2 bits/character (Bell el al., 1990). The
cost of having words as tokens is that more distinct tokens
must be stored in the model, but for sufficiently large data
sets the net size is still much less than with a model based on
characters.
Entropy of nucleotide data
We now consider the entropy of nucleotide data. We first
describe our test data.
In this paper, we measure the volume of DNA in mega-
bases,
i.e. units of 220 bases. In our nucleotide compression
experiments, we have extracted sequences from GenBank
(Flat-file Release 97.0, October 1996) to give two test collec-
tions:
VERTE, a collection of 121 624 rodent, mammal, pri-
mate, vertebrate and invertebrate sequences containing
168.88 megabases; and GENBANK, the full database of
1
021
211 sequences containing 621.77 megabases. All the experi-
ments in this paper were carried out on a Sun SPARC 20,
with the machine otherwise largely idle.
A possible choice of symbol for nucleotide data is the dis-
tinct non-overlapping
inten>als
in the data, where an interval
is a string of bases for a fixed length n. While this token
model may only capture simple patterns and not any seman-
tics of genomic nucleotide data, this simple model is practi-
cal for comparison to high-speed compression schemes
where complex structure determination is prohibitively com-
putationally expensive.
For sequences divided into intervals, the entropy is:
-P, log,/*,)
bits per base, where p, is the probability of the occurrence of
interval /. Note that one would expect a low entropy for short
samples and long intervals—it is not a sign of pattern. Long
intervals also imply a large model, since the number of dis-
tinct symbols to be stored will approach 4" (or exceed it if
there are occurrences of wildcards).
Table 2. Properties of GenBank, with sequences divided into intervals
(entropy in bils per base, distinct intervals in model)
Interval
length
1
5
8
10
VERTE
pull
1
98
1
97
1
96
1.94
Intervals
15
7487
123
036
1 117 579
GENBANK
CUll
2.04
2
02
2
00
1.98
Intervals
15
25 981
462 422
2
928 638
Now we consider the entropy of our test collections. The
results are shown in Table 2, giving the entropy £"' and the
number of distinct intervals for each collection and interval
length. The entropy is almost exactly as expected for random
data. [We further discuss estimation of entropy for these data
elsewhere (Williams and Zobel, 1996b).] As another esti-
mate of compressibility, we tested PPM predictive compres-
sion (Bell el al., 1990), currently the most effective general-
purpose lossless compression technique, and found that even
with a large model PPM was only able to compress to 2.06
bits per base on the GENBANK collection. (Note that PPM is
adaptive—and rather slow—and hence unsuited to nucleo-
tide data.) We therefore conclude that, as is commonly be-
lieved for genomic nucleotide sequences, there is little dis-
cernible pattern when compressing using simple token-
based models and that compression to -2 bits per base is a
good result.
Other approaches to modelling can, however, yield better
compression. Techniques that use more complex secondary
structure to achieve additional compression, such as the
palindromic repeats in DNA, are discussed in the section on
structure-based coding.
Huffman coding
Huffman coding is a well-known technique for making an
optimal assignment of variable-length codes to a set of sym-
bols of known probabilities (Witten el al., 1994). Although
not the best general-purpose coding method, Huffman cod-
ing is preferred for text databases in which records need not
necessarily be decompressed in the order they were stored
(Zobel and Moffat, 1995). We have experimentally applied
an array-based efficient implementation of Huffman coding,
known as canonical Huffman coding, to our test collections.
[The implementation of canonical Huffman coding used is
550
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019
Compression of nucleotide databases
incorporated into the MG text database system and is due to
Moffat (Witten et al., 1994; Bell et al., 1995).] As symbols
we used non-overlapping intervals of fixed length n for
several choices of n. As sequence length is not always an
exact multiple of
n
bases, the model includes, not just strings
of length n, but also shorter strings from the ends of sequences.
Results of
the
Huffman coding scheme for
a
range of inter-
val lengths are shown in Table 3, with the compression rates
including model size. A length of
1
was included to show that
direct coding of individual bases is not very efficient; pre-
dictably, the scheme of allocating a fixed code to each base
and wildcard did not work well. We have also experimented
with larger values such as n = 10, but performance was poor,
presumably due to constraints of hardware and the large
model size.
Table 3. Performance of Huffman coding
Property
N VERTE GENBANK
Compression rate (Mb/s) I
5
8
1
5
8
1
5
8
Overall, n - 5 has worked best: the model is fairly small
and, on our hardware, tends to remain resident in the CPU
cache, so that accesses to intervals to be decoded is as fast as
possible. The actual decoding process is slightly more effi-
cient for n = 8, but decompression is slower overall again
because of hardware cache constraints and a large model
size.
Direct coding
As Table 1 shows the frequency of wildcards in our test
collections is extremely low; >99% of all characters are one
of the four nucleotides and >97.8% of the wildcard occur-
rences are
N.
Because the data are highly skewed, we investi-
gate a lossless compression scheme where the four nucleo-
tide bases are encoded using two-bit representations and
wildcards are stored compactly in a separate structure.
In the encoded, sequence, we eliminate each wildcard
occurrence by replacing it with a random nucleotide chosen
from those represented by the wildcard. First, during decod-
ing, it is less computationally efficient to insert the wildcards
into the sequence than to recreate the original string by
Decompression rate (Mb/s)
Compression (bits/base)
0.08
0.11
0.07
0.53
1.05
1
00
2 22
1
99
1
97
0.08
0 11
004
0.52
1.03
0.99
2.24
2.04
2.03
replacing the randomly chosen nucleotides by the original
wildcards. Second, as wildcards are often not needed or used
in searching of genomic databases, the random substitution
of a base is more appropriate than deleting the wildcard to
make a compression saving, as a deletion completely
removes any semantic meaning from a sequence. This is an
acceptable solution for some practical applications—and
indeed it is an option in GenBank search software such as
BLAST (Altschul et al., 1990). Having replaced all occur-
rences of wildcards, we code the sequence using two bits for
each nucleotide base.
Sequence length varies from -10 bases to >400 000, with
an average of -650 bases. Therefore, the use of a fixed-
length integer representation of sequence length will be
space inefficient. We chose to use a variable-byte representa-
tion in which seven bits in each byte are used to code an
integer, with the least significant bit set to 0 if this is the last
byte,
or to
1
if further bytes follow. In this way, we represent
small integers compactly; for example, we represent 135 in
two bytes, since it lies in the range [27 ... 214] as 00000011
00001110; this is read as 00000010000111 by removing the
least significant bit from each byte and concatenating the
remaining 14 bits.
We then stored wildcard data independently, in a separate
structure. First, we store in unary the count of different wild-
card that occur in the sequence, where a unary integer n is a
string of (/; -
1)
O-bits terminated with a single
1
-bit—in most
sequences with wildcards, this is a single bit representing the
occurrence of N. Second, for each different wildcard we
stored a Huffman-coded representation of the wildcard
(ranging from a single bit for
N
to 6 bits for the most uncom-
mon wildcards), followed by a count of
the
number of occur-
rences, then a series of positions or offsets within the
sequence.
Using this encoding scheme, there are at most 11 tuples of
the form
(w, cnuntw : \pos\ posp})
where w is the Huffman-coded representation of
a
wildcard,
countw is the number of occurrences and pos
\,
...,posp are
the offsets at which w occurs.
As offsets may be of the order of 106 and counts of occur-
rences typically small, we must be careful to ensure that stor-
ing wildcard information does not waste space; variable-byte
codes,
for example, would be highly inefficient. The solution
is to use variable-bit integer codings such as the Elias codes
(Elias,
1975) and the Golomb codes (Golomb, 1966). We
have used the Elias gamma codes to encode each countw and
Golomb codes to represent each sequence of offsets. These
techniques are a variation on techniques used for inverted file
compression, which has been successfully applied to large
text databases (Bell et al., 1993) and to genomic databases
(Williams and Zobel. 1996a.b).
551
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019
H.Williams and J.Zobel
Compression with Golomb codes, given the appropriate
choice of
a
pre-calculated parameter, is better than with Elias
coding. In particular, using Golomb codes, the maximum
space required to store a list of positions for a given wildcard
arises when that wildcard occupies every position; in this
worst case, the storage requirement is
1
bit per position.
Instead of storing absolute offsets, we store the differences
between the offsets, which with Golomb codes can be repre-
sented in fewer bits. Thus each tuple is stored in the form:
Table 4. Performance of direct coding
(w, numhei\v + I :
[pos\,
(pos2 - pos |)-
pas,,
_ |)])
To illustrate wildcard storage, consider an example where
the wildcard
N
occurs three times in a sequence, at offsets
253,
496 and 497, and the wildcard
B
occurs once, at offset
931.
The other nine wildcards do not occur. Illustrating our
example with the data as integers, the wildcard structure
would be:
[2 : (n. 3 : [253, 496. 4971), (b.
1
:
[931
])]
After taking differences, we have:
[2:(n, 3: [253,243, 1]), (b,
1
: [931])]
To simplify sequence processing when wildcard information
is not to be decoded, we store the length of the compressed
wildcard data, again using the variable-byte coding scheme.
A benefit of this scheme is that, for sequences with no wild-
cards,
a length of zero is stored without any accompanying
data structure—an overhead of a single byte.
With this representation of sequences, decoding has two
phases. In the first phase, the bytes representing the
sequence, each byte of four 2-bit values, are mapped to four
nucleotides through an array. This process is extremely fast;
it is an insignificant fraction of disk fetch costs, for example.
In the second phase, the tuples of wildcard information are
decoded, and wildcard characters are overwritten on nucleo-
tides at the indicated offsets.
The first block of Table 4 shows results for this direct cod-
ing scheme. For VERTE, compression is —0.05 bits per base
higher than the entropy and slightly higher in the GENBANK
collection, because the proportion of sequences containing
wildcards increases from -16% in the VERTE collection to
58%
in GENBANK; this also results in a reduction in decom-
pression speed from -14 Mb/s for VERTE to -11 Mb/s for
GENBANK.
Overall, decompression speed is excellent, between
10
and
14 times faster than that given by Huffman coding. We have
also shown, in the second block of Table 4, decompression
rates without decoding of wildcards—as discussed above,
some search tools are used without them—and as can be seen
the impact of wildcards on time is small.
Property
With wildcards
Compression (Mb/s)
Decompression (Mb/s)
Compression (bits/base)
Without decoding of
wildcards
Decompression (Mb/s)
Without
wildcat tls
Compression (Mb/s)
Decompression (Mb/s)
Compression (bits/base)
Retrieval of direct-coded data
Sequential (Mb/s)
Random 10% (Mb/s)
Retrieval of uncompressed data
Sequential (Mb/s)
Random 10% (Mb/s)
VERTE
0.36
13.67
2 02
14.07
0.36
14.75
2.01
13.67
1.43
4 12
0 38
GENBANK
0.51
10.81
209
1344
0.54
14.27
2.03
10.81
2.96
2.97
0 59
The third block of Table 4 shows compression perform-
ance with wildcards replaced by random matching nucleo-
tides.
This achieves compression of -2.02 bits per base, as
shown. The compressed data occupy slightly more than 2
bits per base because for each sequence we must store the
sequence length and, since we store sequences byte-aligned,
the last byte in the compressed sequence is on average only
half-full. Note that in GENBANK the wildcards contribute dis-
proportionately to decompression costs: they are 0.6% of the
compressed data, but account for -25% of the decom-
pression time.
The last two blocks of Table 4 compare retrieval times for
uncompressed data to those for direct-coded data. The first
line in each block is the speed of sequential retrieval of all
sequences: by using direct coding, the reduction in disk costs
results in a 4-fold improvement in overall retrieval time. The
second line in each block illustrates the further available
improvement when retrieving only a fraction of the se-
quences: in this case, we retrieved a random 10% of the se-
quences and averaged the results over 10 such runs. In the
case of random access, retrieval of direct-coded data is again
over four times faster than with uncompressed data. We
therefore expect that the use of direct coding in a retrieval
system would significantly reduce retrieval times overall.
To test this hypothesis further, we incorporated the scheme
into cafe, our genomic database retrieval engine (Williams
and Zobel, 1996a), and found that retrieval times fell by
>20%.
552
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019
Compression of nucleotide databases
In BLAST (Altschul et ai, 1990), a simple approach is
taken to nucleotide compression. All occurrences of wildcards
are replaced by a random choice of any of the four nucleo-
tides.
In addition to a count indicating sequence length, there
is an indication of whether the sequence originally contained
wildcards. BLAST achieves compression of 2.03 bits/base
on the genbank collection using this scheme; this is a saving
of 0.06 bits/base over our direct-coding scheme, but is lossy
because wildcard data are discarded. To allow processing of
sequences with wildcards, each sequence is also stored un-
compressed, giving a total storage requirement of 10.03 bits
per base.
With BLAST, a user preference during retrieval is optional
wildcard matching by retrieving the original uncompressed data
file for sequences with wildcards. As our results show, fetching
this data will have a serious impact on query evaluation time
because retrieval of uncompressed data is extremely slow.
Tools like BLAST inspect all the sequences in a database
in response to a query, either decompressing them or proces-
sing them directly in compressed form. We have investigated
alternatives based on indexing (Williams and Zobel, 1996a),
but even with indexing a significant fraction of the database
must be inspected during query evaluation. Fast decom-
pression, or a format that can be processed directly, is thus
crucial to efficient query processing.
Table 5 shows the results of using the compression tools
gzip and compress on the VERTE and GENBANK collections.
Both are relatively slow in compression and decompression,
and require more bits per character than the direct coding
scheme. Note that both methods are unsuitable for database
compression, as both allow only sequential access to se-
quences.
Structure-based coding
A special-purpose compression algorithm for nucleotide
data could take advantage of any secondary structure known
to be present (Griffiths etai, 1993). For example, Grumbach
and Tahi (1993) have used the palindromes that are known
to occur commonly in DNA strings (without wildcards)to
compress to <2 bits per base, typically saving 0.2 bits per
base and in some cases rather more. The difficulty with such
approaches is the cost of recognizing the structure: identifi-
cation of palindromes is an expensive operation, and is com-
plicated by the presence of wildcards. However, palindrome
compression would be easy to integrate with our direct cod-
ing scheme, as the structure of wildcard information would
not be affected.
Another possibility is vertical compression (Grumbach
and Tahi, 1993): since sequences in GenBank are grouped,
to some extent, by similarity, adjacent sequences may differ
in only a few bases; and more frequently may share long
common substrings. This similarity could be exploited by a
compression technique, and again could easily be integrated
with the direct coding, but would violate our principle that
records be independently decodable.
Table 5. Performance of standard compression utilities
Scheme
Compression (Mb/s)
Decompression (Mb/s)
Compression (bits/base)
compress
Compression (Mb/s)
Decompression (Mb/s)
Compression (bits/base)
VERTE
0.23
4.12
2.07
0 97
2 30
2 13
GENBANK
0.41
3.84
2 14
1.17
2
1
2.19
Conclusions
We have considered the problem of practical compression of
databases of nucleotide sequences with wildcards, and have
identified two lossless compression schemes that work well
in practice. Our experimental evaluation of canonical Huffman
coding with a semi-static model of fixed-length intervals
showed that it gives excellent compression, but with the
overhead of
a
large in-memory model and, at decompression
rates of ~
1
Mb/s, is somewhat slow.
Our compression method, a direct coding designed specifi-
cally for nucleotide sequences with wildcard characters, per-
forms rather better. While the compression performance is
slightly worse—by -0.03 bits/base—than for Huffman cod-
ing, memory requirements are slight and sequences can be
decompressed at up to 14 Mb/s. Such speed is vital to good
searching performance, since current searching tools for
nucleotide databases inspect a substantial fraction of the
database in response to every query. We have shown that
compression not only reduces space requirements, but that
direct coding results in a 4-fold improvement in retrieval
time compared with fetching of uncompressed data.
Acknowledgements
We are grateful to Alistair Moffat for his implementation of
canonical Huffman coding. This work was supported by the
Australian Research Council, the Centre for Intelligent
Decision Systems, and the Multimedia Database Systems
group at RMIT. A preliminary version of this work was pres-
ented in 'Practical Compression of Nucleotide Databases',
Proceedings of the Australian Computer Science Conference,
Melbourne, Australia, 1996, pp. 184-192.
553
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019
H.Williums and J.Zobel
References
Allschul.S.. Gish.W.. Miller.W.. Myers.E. and Lipman.D. (1990) Basic
local alignment search tool. J Mot. Bioi. 215, 403-410.
Bell.T., ClearyJ. and Witten,l. (1990) Text Compression. Prentice-
Hall, Englewood Cliffs. NJ.
Bell.T.. Moffat.A.. Nevill-Manning,C, Witten.I. and ZobeU. (1993)
Data compression in full-text retrieval systems. J. Am Soc. Set., 44.
508-531.
Bell.T.. Moffat.A., Witten.I. and ZobeU. (1995) The MG retrieval
system: Compressing for space and speed. Commun. ACM. 38.
41-42.
Benson.D., Lipman.D. and OstelU. (1993) GenBank. Nucleic Acids
Res.
21.2963-2965.
Elias.P. (1975) Universal codeword sets and representations of the
integers. IEEE
Trans.
Inf. Theory.
IT-21.
194-203.
Golomb.S. (1966) Run-length encodings. IEEE Trans Inf. Theory.
IT-12.
399-401.
GriffithsA. Miller.J.. Suzuki.D., Lewontin.R. and Gelbart.W. (1993)
An
Introduction
to Genetic Analysis. 5th edn. Freeman. New York.
Grumbach.S. and Tahi.F. (1993) Compression of DNA sequences. In
StorerJ. and Cohn.M. (eds), Proceedings of the IEEE Data
Compression Conference, Snowbird. UT. pp. 340—350.
Lelewer.D. and Hirschberg.D. (1987) Data compression. Comput
Sun'.
19,261-296.
Pearson.W. and Lipman,D. (1988) Improved tools for biological
sequence comparison. Proc. Natl
Acad.
Sci. USA, 85, 2444-2448.
RissanenJ. and Langdon,G. (1981) Universal modeling and coding.
IEEE
Trans.
Inf. Theory, IT-27,
12-23.
Shannon,C. (1951) Prediction and entropy of printed English. Bell
Systems
Tech.,
30, 55.
Williams,H. and ZobeU. (1996a) Indexing nucleotide databases for
fast query evaluation. In Proceedings of
an
International Confer-
ence on Advances in Database Technology (EDBT), Avignon.
France.
Lecture
Notes in Computer Science 1057. Springer-Verlag,
pp.
275-288.
Williams,H. and ZobelJ. (1996b) Practical compression of nucleotide
databases. In Proceedings of the Australasian Computer Science
Conference, Melbourne, Australia, pp. 184-193.
Witten,!., Moffat.A. and Bell.T. (1994) Managing Gigabytes: Com-
pressing and Indexing Documents and Images. Van Nostrand
Reinhold, New York.
ZobelJ. and Moffat,A. (1995) Adding compression to a full-text
retrieval system. Software-Practice Experience, 25. 891-903.
554
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019
... We used the crossvalidation , leave-one-out principle to obtain models of 800 kbytes each, that is, the model for each file was created from all of the other test sequences. Furthermore, we filtered the DNA sequence from the rest of the file, and used a simple clear encoding method, similar to the one described in [7], to encode the nucleotides in 2 bits each and the wildcard characters by a special list. As a result, any part of the resulting encoded file can be accessed very quickly, and this access time does not depend on the size of the file. ...
Article
Full-text available
In this paper we present an algorithm for pruning a Tree Machine obtained from a training file, where the nodes in the model are substituted by their parents such that at each pruning stage the increase in code length for the training sequence is minimized. We apply the resulting pruned Tree Machine in two different encoding methods, one which is Fixed-to-Variable, and the other is Variable-to-Fixed. The two methods significantly outperform the general purpose programs when creating randomly accessible encoded files.
... Though current storage technology is developing rapidly, but considering the sizes of protein sequence collections are also growing exponentially, these are relatively large storage overheads. By introducing techniques such as index compression for nucleotide databases (Williams et al., 1997) and index stopping which discards high-frequency n-grams from the index (Williams et al., 1996), we expect that the index size of ProSeS system can be further reduced to an acceptable level. ...
Article
Full-text available
Motivation: Though the sequence databases of proteins and DNAs are increas-ing in size exponentially, still exhaustive sequence search systems are commonly used in conducting biological researches. However, due to the advancement of information technology, many information retrieval algorithms have been de-veloped to search strings in large-scale text databases and are proved to be successful. We propose that these algorithms could also be applied to the bio-logical data. Results: Four n-gram indexing methods (tri-gram, tetra-gram, penta-gram, and hexa-gram) were applied to extract indices from protein sequences of the PIR-NREF database, and their retrieval effectiveness and speed were mea-sured. Penta-gram method showed the best results that its retrieval effective-ness matches for BLASTP and its retrieval speed was about 38 times faster than BLASTP program.
Conference Paper
Full-text available
This article investigates the efficiency of randomly accessible coding for annotated genome files and compares it to universal coding. The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.
Article
Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 × 10^-6bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.
Article
xiii, 115 p. : ill. (some col.) ; 30 cm. PolyU Library Call No.: [THS] LG51 .H577M EIE 2009 Wu Deoxyribonucleic acid (DNA) technologies have been widely used in genetic engineering, forensics and anthropology applications. A DNA sequence is a long sequence consisting of four types of nucleotides called bases. The number of bases of the 24 chromosomes in humans ranges from 50 to 250 million. Without any compression, two bits per base are required for storage which results in a large number of bits for encoding DNA sequences. An effective way to compress these sequences is thus desirable in order to reduce the storage space required. General-purpose compression tools such as gzip use more than two bits to encode a base. This is because these tools do not make use of the special characteristics of a DNA sequence. For example, it is well known that a DNA sequence has long-term correlation in that subsequences in different regions of a DNA sequence are similar to each other. State-of-the-art DNA compression schemes all rely on exploiting this long-term correlation. In particular, repetitions within the DNA sequence are searched so that similar subsequences can be encoded with reference to each other. For these DNA compression schemes, the reduced average bits per base (bpb) are around 0.25 for the benchmark DNA sequences. Thus, the compression gain is not large. It is well known that there are similarities among different chromosome sequences. All state-of-the-art compression algorithms exploit only self-sequence similarities, and specifically ignore the cross-sequence similarities. We have performed a thorough study of similarities within the same chromosome sequence as well as similarities among different chromosome sequences. These similarities are characterized by the existence of similar subsequences in different chromosome sequences. Our study indicates that these cross-sequence similarities are often significant when compared to self-sequence similarities. In the experimental results from the sixteen chromosome sequences in S. cerevisiae, the average repetitive length from cross-sequence prediction was almost fourteen times of that from self-sequence prediction. To make use of both self-sequence and cross-sequence similarities in DNA compression, we have proposed a multi-sequence compression algorithm. Our scheme compresses two or more sequences together so that similar subsequences found among multiple sequences can be encoded together. In this scheme, we first create a list of similar subsequences, from either the reference sequence or the current sequence which are ordered according to their importance. The list is then modified by removing the overlapping similar subsequences. After reordering the list according to their position and removing similar subsequences from the current sequence, Arith-2 coder is used to further compress the non-repetitive regions. Our experimental results show that compressing a sequence with reference to another sequence achieves an average of 5.5% saving in bpb as compared with that without reference to another sequence, hence the bpb of compressing two chromosome sequences together is consistently better than that of compressing them separately. This shows the importance of cross-sequence similarities. We have also extended the cross-sequence predictions to more than two chromosome sequences. We found that there is always additional savings in bpb by compressing a number of chromosome sequences together. Since by making use of the cross-sequence similarities, our proposed multiple sequence compression algorithm can outperform other single sequence-based compression algorithms. M.Phil., Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, 2009
Article
The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data. The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed. Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.
Article
BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST--by improving its algorithms and optimizations--is essential to improve search times in the face of exponentially increasing collection sizes. We present an optimization to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59% of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15% of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimized for modern hardware, making careful use of cache-conscious approaches to improve speed. Our optimized DFA approach has been integrated into a new version of BLAST that is freely available for download at http://www.fsa-blast.org/.
Article
This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.
Article
Full-text available
Recent advances in compression and indexing techniques have yielded a qualitative change in the feasibility of large-scale full-text retrieval.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.
Article
An abstract is not available.
Article
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.
Conference Paper
A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.
Article
When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components include lexical indexes, inverted files, bitmaps, signature files, and the main text itself. Results are reported on the application of the methods to several substantial full-text databases, and show that a large, unindexed text can be stored, along with indexes that facilitate fast searching, in less than half its original size—at some appreciable cost in primary memory requirements. © 1993 John Wiley & Sons, Inc.