Content uploaded by Hugh Williams

Author content

All content in this area was uploaded by Hugh Williams on Jan 04, 2019

Content may be subject to copyright.

CABIOS

Vol.

13

no.

5

1997

Pages 549-554

Compression of nucleotide databases for fast

searching

Hugh

Williams

and Justin Zobel

Department

of

Computer Science, RMIT, GPO Box 2476V, Melbourne

3001,

Australia

Received on April 29, 1997, accepted on May 22, 1997

Abstract

Motivation: International sequencing efforts

are

creating

huge nucleotide databases, which

are

used

in

searching

applications

to

locate sequences homologous

to a

query

sequence. In such applications, it is desirable that databases

are stored compactly, that sequences

can be

accessed

independently

of

the order

in

which they were

stored,

and

that data

can

be

rapidly retrieved from secondary storage,

since disk costs are often the bottleneck in searching.

Results: We present

a

purpose-built direct coding scheme for

fast retrieval and compression

of

genomic nucleotide data.

The scheme

is

lossless, readily integrated with sequence

search tools,

and

does

not

require

a

model. Direct coding

gives good compression and allows faster retrieval than with

either uncompressed data

or

data compressed

by

other

methods, thus yielding significant improvements

in

search

times for high-speed homology search tools.

Availability:

The

direct coding scheme (cino)

is

available

free of charge by anonymous ftp from goanna.cs.rmit.edu.au

in the directoiy publrmitlcino.

Contact: E-mail: hugh@cs.rmit.edu.au

Introduction

Sequencing initiatives

are

contributing exponentially

in-

creasing quantities

of

nucleotide data

to

databases such

as

GenBank (Benson

et

al., 1993). We propose

a

new

direct

coding compression scheme

for use in

homology search

applications such

as

FASTA (Pearson

and

Lipman

et al.,

1988),

BLAST (Altschul et al., 1990) and CAFE (Williams

and Zobel, 1996a). This scheme yields compact storage,

is

lossless—nucleotide bases and wildcards are represented—

and has extremely fast decompression.

Prior to proposing our scheme, we investigate benchmarks

for practical compression

and

high-speed decompression

of

nucleotide data. We compare

our

scheme with

the

entropy,

with Huffman coding, with the utilities gzip and compress, and

with uncompressed data retrieval.

All the

compression

methods closely approach

the

entropy,

but

direct coding

is

over nine times faster than Huffman coding and requires much

*To whom correspondence should be addressed

less memory; direct coding is also several times faster than the

standard compression utilities. Direct coding requires -25%

of the space required

to

store uncompressed data and, due

to

savings

in

disk costs, has significantly lower retrieval times.

Database compression

Compression consists

of

two activities, modelling and cod-

ing (Rissanen and Langdon, 1981).

A

model

for

data

to be

compressed is

a

representation

of

the distinct symbols in the

data and includes information, such as frequency, about each

symbol. Coding

is the

process

of

producing

a

compressed

representation of data, using the model to determine

a

code

for each symbol.

An

efficient coding scheme assigns short

codes to common symbols and long codes

to

rare symbols,

optimizing code length overall.

Adaptive models (which evolve during coding)

are cur-

rently favoured for general-purpose compression (Bell et al.,

1990;

Lelewer and Hirschberg, 1987),

and are the

basis

of

utilities such

as

compress. However, because databases

are

divided into records that must be independently decompress-

ible (Zobel and Moffat, 1995), adaptive techniques are gen-

erally not effective. Similarly, arithmetic coding is in general

the preferred coding technique,

but it is

slow

for

database

applications (Bell etal., 1993).

For text, Huffman coding with

a

semi-static model (where

modelling

and

coding

are in

separate phases)

is

preferable

because

it is

faster

and

allows order-independent decom-

pression. Such compression schemes can allow retrieval

of

data to be faster than with uncompressed data since the com-

putational cost of decompressing data can be offset by reduc-

tions

in

transfer costs from disk.

The compression efficiency

of a

technique can,

for a

given

data set, be measured by comparison to the information content

of

data,

as represented by the

entropy

determined by Shannon's

coding theorem (Shannon, 1951). Entropy

is

the compression

that would be achievable with

an

ideal coding method using

a

simple semi-static model. For a set

S of

symbols in which each

symbol

t

has probability

of

occurrence p,, the entropy

is:

bits per symbol.

E(S)

= ^(- p, •

log2/?,)

© Oxford University Press549

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019

H.Williams and J.Zobel

Table 1. Probabilities of each base

Base

Probability

Base

Probability

A

27 483

N

0.737

inGenBank (90

B

»0

R

0 002

C

22.270

S

0 003

D

=0

T

26.508

G

22.2985

V

«0

H

-0

w

0.001

K

0.002

Y

0.004

M

0 003

Implicit in this definition is the representation of the data

as a set of symbol occurrences, i.e. modelling of the data

using simple tokens. In some domains, different choices of

tokens give vastly varying entropy; for example, in English

text compression, choosing characters as tokens gives an en-

tropy of-5 bits/character, whereas choosing words as tokens

gives an entropy of ~2 bits/character (Bell el al., 1990). The

cost of having words as tokens is that more distinct tokens

must be stored in the model, but for sufficiently large data

sets the net size is still much less than with a model based on

characters.

Entropy of nucleotide data

We now consider the entropy of nucleotide data. We first

describe our test data.

In this paper, we measure the volume of DNA in mega-

bases,

i.e. units of 220 bases. In our nucleotide compression

experiments, we have extracted sequences from GenBank

(Flat-file Release 97.0, October 1996) to give two test collec-

tions:

VERTE, a collection of 121 624 rodent, mammal, pri-

mate, vertebrate and invertebrate sequences containing

168.88 megabases; and GENBANK, the full database of

1

021

211 sequences containing 621.77 megabases. All the experi-

ments in this paper were carried out on a Sun SPARC 20,

with the machine otherwise largely idle.

A possible choice of symbol for nucleotide data is the dis-

tinct non-overlapping

inten>als

in the data, where an interval

is a string of bases for a fixed length n. While this token

model may only capture simple patterns and not any seman-

tics of genomic nucleotide data, this simple model is practi-

cal for comparison to high-speed compression schemes

where complex structure determination is prohibitively com-

putationally expensive.

For sequences divided into intervals, the entropy is:

-P, • log,/*,)

bits per base, where p, is the probability of the occurrence of

interval /. Note that one would expect a low entropy for short

samples and long intervals—it is not a sign of pattern. Long

intervals also imply a large model, since the number of dis-

tinct symbols to be stored will approach 4" (or exceed it if

there are occurrences of wildcards).

Table 2. Properties of GenBank, with sequences divided into intervals

(entropy in bils per base, distinct intervals in model)

Interval

length

1

5

8

10

VERTE

pull

1

98

1

97

1

96

1.94

Intervals

15

7487

123

036

1 117 579

GENBANK

CUll

2.04

2

02

2

00

1.98

Intervals

15

25 981

462 422

2

928 638

Now we consider the entropy of our test collections. The

results are shown in Table 2, giving the entropy £"' and the

number of distinct intervals for each collection and interval

length. The entropy is almost exactly as expected for random

data. [We further discuss estimation of entropy for these data

elsewhere (Williams and Zobel, 1996b).] As another esti-

mate of compressibility, we tested PPM predictive compres-

sion (Bell el al., 1990), currently the most effective general-

purpose lossless compression technique, and found that even

with a large model PPM was only able to compress to 2.06

bits per base on the GENBANK collection. (Note that PPM is

adaptive—and rather slow—and hence unsuited to nucleo-

tide data.) We therefore conclude that, as is commonly be-

lieved for genomic nucleotide sequences, there is little dis-

cernible pattern when compressing using simple token-

based models and that compression to -2 bits per base is a

good result.

Other approaches to modelling can, however, yield better

compression. Techniques that use more complex secondary

structure to achieve additional compression, such as the

palindromic repeats in DNA, are discussed in the section on

structure-based coding.

Huffman coding

Huffman coding is a well-known technique for making an

optimal assignment of variable-length codes to a set of sym-

bols of known probabilities (Witten el al., 1994). Although

not the best general-purpose coding method, Huffman cod-

ing is preferred for text databases in which records need not

necessarily be decompressed in the order they were stored

(Zobel and Moffat, 1995). We have experimentally applied

an array-based efficient implementation of Huffman coding,

known as canonical Huffman coding, to our test collections.

[The implementation of canonical Huffman coding used is

550

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019

Compression of nucleotide databases

incorporated into the MG text database system and is due to

Moffat (Witten et al., 1994; Bell et al., 1995).] As symbols

we used non-overlapping intervals of fixed length n for

several choices of n. As sequence length is not always an

exact multiple of

n

bases, the model includes, not just strings

of length n, but also shorter strings from the ends of sequences.

Results of

the

Huffman coding scheme for

a

range of inter-

val lengths are shown in Table 3, with the compression rates

including model size. A length of

1

was included to show that

direct coding of individual bases is not very efficient; pre-

dictably, the scheme of allocating a fixed code to each base

and wildcard did not work well. We have also experimented

with larger values such as n = 10, but performance was poor,

presumably due to constraints of hardware and the large

model size.

Table 3. Performance of Huffman coding

Property

N VERTE GENBANK

Compression rate (Mb/s) I

5

8

1

5

8

1

5

8

Overall, n - 5 has worked best: the model is fairly small

and, on our hardware, tends to remain resident in the CPU

cache, so that accesses to intervals to be decoded is as fast as

possible. The actual decoding process is slightly more effi-

cient for n = 8, but decompression is slower overall again

because of hardware cache constraints and a large model

size.

Direct coding

As Table 1 shows the frequency of wildcards in our test

collections is extremely low; >99% of all characters are one

of the four nucleotides and >97.8% of the wildcard occur-

rences are

N.

Because the data are highly skewed, we investi-

gate a lossless compression scheme where the four nucleo-

tide bases are encoded using two-bit representations and

wildcards are stored compactly in a separate structure.

In the encoded, sequence, we eliminate each wildcard

occurrence by replacing it with a random nucleotide chosen

from those represented by the wildcard. First, during decod-

ing, it is less computationally efficient to insert the wildcards

into the sequence than to recreate the original string by

Decompression rate (Mb/s)

Compression (bits/base)

0.08

0.11

0.07

0.53

1.05

1

00

2 22

1

99

1

97

0.08

0 11

004

0.52

1.03

0.99

2.24

2.04

2.03

replacing the randomly chosen nucleotides by the original

wildcards. Second, as wildcards are often not needed or used

in searching of genomic databases, the random substitution

of a base is more appropriate than deleting the wildcard to

make a compression saving, as a deletion completely

removes any semantic meaning from a sequence. This is an

acceptable solution for some practical applications—and

indeed it is an option in GenBank search software such as

BLAST (Altschul et al., 1990). Having replaced all occur-

rences of wildcards, we code the sequence using two bits for

each nucleotide base.

Sequence length varies from -10 bases to >400 000, with

an average of -650 bases. Therefore, the use of a fixed-

length integer representation of sequence length will be

space inefficient. We chose to use a variable-byte representa-

tion in which seven bits in each byte are used to code an

integer, with the least significant bit set to 0 if this is the last

byte,

or to

1

if further bytes follow. In this way, we represent

small integers compactly; for example, we represent 135 in

two bytes, since it lies in the range [27 ... 214] as 00000011

00001110; this is read as 00000010000111 by removing the

least significant bit from each byte and concatenating the

remaining 14 bits.

We then stored wildcard data independently, in a separate

structure. First, we store in unary the count of different wild-

card that occur in the sequence, where a unary integer n is a

string of (/; -

1)

O-bits terminated with a single

1

-bit—in most

sequences with wildcards, this is a single bit representing the

occurrence of N. Second, for each different wildcard we

stored a Huffman-coded representation of the wildcard

(ranging from a single bit for

N

to 6 bits for the most uncom-

mon wildcards), followed by a count of

the

number of occur-

rences, then a series of positions or offsets within the

sequence.

Using this encoding scheme, there are at most 11 tuples of

the form

(w, cnuntw : \pos\ posp})

where w is the Huffman-coded representation of

a

wildcard,

countw is the number of occurrences and pos

\,

...,posp are

the offsets at which w occurs.

As offsets may be of the order of 106 and counts of occur-

rences typically small, we must be careful to ensure that stor-

ing wildcard information does not waste space; variable-byte

codes,

for example, would be highly inefficient. The solution

is to use variable-bit integer codings such as the Elias codes

(Elias,

1975) and the Golomb codes (Golomb, 1966). We

have used the Elias gamma codes to encode each countw and

Golomb codes to represent each sequence of offsets. These

techniques are a variation on techniques used for inverted file

compression, which has been successfully applied to large

text databases (Bell et al., 1993) and to genomic databases

(Williams and Zobel. 1996a.b).

551

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/13/5/549/420765 by guest on 04 January 2019

H.Williams and J.Zobel

Compression with Golomb codes, given the appropriate

choice of

a

pre-calculated parameter, is better than with Elias

coding. In particular, using Golomb codes, the maximum

space required to store a list of positions for a given wildcard

arises when that wildcard occupies every position; in this

worst case, the storage requirement is

1

bit per position.

Instead of storing absolute offsets, we store the differences

between the offsets, which with Golomb codes can be repre-

sented in fewer bits. Thus each tuple is stored in the form:

Table 4. Performance of direct coding

(w, numhei\v + I :

[pos\,

(pos2 - pos |)-

pas,,

_ |)])

To illustrate wildcard storage, consider an example where

the wildcard

N

occurs three times in a sequence, at offsets

253,

496 and 497, and the wildcard

B

occurs once, at offset

931.

The other nine wildcards do not occur. Illustrating our

example with the data as integers, the wildcard structure

would be:

[2 : (n. 3 : [253, 496. 4971), (b.

1

:

[931

])]

After taking differences, we have:

[2:(n, 3: [253,243, 1]), (b,

1

: [931])]

To simplify sequence processing when wildcard information

is not to be decoded, we store the length of the compressed

wildcard data, again using the variable-byte coding scheme.

A benefit of this scheme is that, for sequences with no wild-

cards,

a length of zero is stored without any accompanying

data structure—an overhead of a single byte.

With this representation of sequences, decoding has two

phases. In the first phase, the bytes representing the

sequence, each byte of four 2-bit values, are mapped to four

nucleotides through an array. This process is extremely fast;

it is an insignificant fraction of disk fetch costs, for example.

In the second phase, the tuples of wildcard information are

decoded, and wildcard characters are overwritten on nucleo-

tides at the indicated offsets.

The first block of Table 4 shows results for this direct cod-

ing scheme. For VERTE, compression is —0.05 bits per base

higher than the entropy and slightly higher in the GENBANK

collection, because the proportion of sequences containing

wildcards increases from -16% in the VERTE collection to

58%

in GENBANK; this also results in a reduction in decom-

pression speed from -14 Mb/s for VERTE to -11 Mb/s for

GENBANK.

Overall, decompression speed is excellent, between

10

and

14 times faster than that given by Huffman coding. We have

also shown, in the second block of Table 4, decompression

rates without decoding of wildcards—as discussed above,

some search tools are used without them—and as can be seen

the impact of wildcards on time is small.

Property

With wildcards

Compression (Mb/s)

Decompression (Mb/s)

Compression (bits/base)

Without decoding of

wildcards

Decompression (Mb/s)

Without

wildcat tls

Compression (Mb/s)

Decompression (Mb/s)

Compression (bits/base)

Retrieval of direct-coded data

Sequential (Mb/s)

Random 10% (Mb/s)

Retrieval of uncompressed data

Sequential (Mb/s)

Random 10% (Mb/s)

VERTE

0.36

13.67

2 02

14.07

0.36

14.75

2.01

13.67

1.43

4 12

0 38

GENBANK

0.51

10.81

209

1344

0.54

14.27

2.03

10.81

2.96

2.97

0 59

The third block of Table 4 shows compression perform-

ance with wildcards replaced by random matching nucleo-

tides.

This achieves compression of -2.02 bits per base, as

shown. The compressed data occupy slightly more than 2

bits per base because for each sequence we must store the

sequence length and, since we store sequences byte-aligned,

the last byte in the compressed sequence is on average only

half-full. Note that in GENBANK the wildcards contribute dis-

proportionately to decompression costs: they are 0.6% of the

compressed data, but account for -25% of the decom-

pression time.

The last two blocks of Table 4 compare retrieval times for

uncompressed data to those for direct-coded data. The first

line in each block is the speed of sequential retrieval of all

sequences: by using direct coding, the reduction in disk costs

results in a 4-fold improvement in overall retrieval time. The

second line in each block illustrates the further available

improvement when retrieving only a fraction of the se-

quences: in this case, we retrieved a random 10% of the se-

quences and averaged the results over 10 such runs. In the

case of random access, retrieval of direct-coded data is again

over four times faster than with uncompressed data. We

therefore expect that the use of direct coding in a retrieval

system would significantly reduce retrieval times overall.

To test this hypothesis further, we incorporated the scheme

into cafe, our genomic database retrieval engine (Williams

and Zobel, 1996a), and found that retrieval times fell by

>20%.

552

Compression of nucleotide databases

In BLAST (Altschul et ai, 1990), a simple approach is

taken to nucleotide compression. All occurrences of wildcards

are replaced by a random choice of any of the four nucleo-

tides.

In addition to a count indicating sequence length, there

is an indication of whether the sequence originally contained

wildcards. BLAST achieves compression of 2.03 bits/base

on the genbank collection using this scheme; this is a saving

of 0.06 bits/base over our direct-coding scheme, but is lossy

because wildcard data are discarded. To allow processing of

sequences with wildcards, each sequence is also stored un-

compressed, giving a total storage requirement of 10.03 bits

per base.

With BLAST, a user preference during retrieval is optional

wildcard matching by retrieving the original uncompressed data

file for sequences with wildcards. As our results show, fetching

this data will have a serious impact on query evaluation time

because retrieval of uncompressed data is extremely slow.

Tools like BLAST inspect all the sequences in a database

in response to a query, either decompressing them or proces-

sing them directly in compressed form. We have investigated

alternatives based on indexing (Williams and Zobel, 1996a),

but even with indexing a significant fraction of the database

must be inspected during query evaluation. Fast decom-

pression, or a format that can be processed directly, is thus

crucial to efficient query processing.

Table 5 shows the results of using the compression tools

gzip and compress on the VERTE and GENBANK collections.

Both are relatively slow in compression and decompression,

and require more bits per character than the direct coding

scheme. Note that both methods are unsuitable for database

compression, as both allow only sequential access to se-

quences.

Structure-based coding

A special-purpose compression algorithm for nucleotide

data could take advantage of any secondary structure known

to be present (Griffiths etai, 1993). For example, Grumbach

and Tahi (1993) have used the palindromes that are known

to occur commonly in DNA strings (without wildcards)to

compress to <2 bits per base, typically saving 0.2 bits per

base and in some cases rather more. The difficulty with such

approaches is the cost of recognizing the structure: identifi-

cation of palindromes is an expensive operation, and is com-

plicated by the presence of wildcards. However, palindrome

compression would be easy to integrate with our direct cod-

ing scheme, as the structure of wildcard information would

not be affected.

Another possibility is vertical compression (Grumbach

and Tahi, 1993): since sequences in GenBank are grouped,

to some extent, by similarity, adjacent sequences may differ

in only a few bases; and more frequently may share long

common substrings. This similarity could be exploited by a

compression technique, and again could easily be integrated

with the direct coding, but would violate our principle that

records be independently decodable.

Table 5. Performance of standard compression utilities

Scheme

Compression (Mb/s)

Decompression (Mb/s)

Compression (bits/base)

compress

Compression (Mb/s)

Decompression (Mb/s)

Compression (bits/base)

VERTE

0.23

4.12

2.07

0 97

2 30

2 13

GENBANK

0.41

3.84

2 14

1.17

2

1

2.19

Conclusions

We have considered the problem of practical compression of

databases of nucleotide sequences with wildcards, and have

identified two lossless compression schemes that work well

in practice. Our experimental evaluation of canonical Huffman

coding with a semi-static model of fixed-length intervals

showed that it gives excellent compression, but with the

overhead of

a

large in-memory model and, at decompression

rates of ~

1

Mb/s, is somewhat slow.

Our compression method, a direct coding designed specifi-

cally for nucleotide sequences with wildcard characters, per-

forms rather better. While the compression performance is

slightly worse—by -0.03 bits/base—than for Huffman cod-

ing, memory requirements are slight and sequences can be

decompressed at up to 14 Mb/s. Such speed is vital to good

searching performance, since current searching tools for

nucleotide databases inspect a substantial fraction of the

database in response to every query. We have shown that

compression not only reduces space requirements, but that

direct coding results in a 4-fold improvement in retrieval

time compared with fetching of uncompressed data.

Acknowledgements

We are grateful to Alistair Moffat for his implementation of

canonical Huffman coding. This work was supported by the

Australian Research Council, the Centre for Intelligent

Decision Systems, and the Multimedia Database Systems

group at RMIT. A preliminary version of this work was pres-

ented in 'Practical Compression of Nucleotide Databases',

Proceedings of the Australian Computer Science Conference,

Melbourne, Australia, 1996, pp. 184-192.

553

H.Williums and J.Zobel

References

Allschul.S.. Gish.W.. Miller.W.. Myers.E. and Lipman.D. (1990) Basic

local alignment search tool. J Mot. Bioi. 215, 403-410.

Bell.T., ClearyJ. and Witten,l. (1990) Text Compression. Prentice-

Hall, Englewood Cliffs. NJ.

Bell.T.. Moffat.A.. Nevill-Manning,C, Witten.I. and ZobeU. (1993)

Data compression in full-text retrieval systems. J. Am Soc. Set., 44.

508-531.

Bell.T.. Moffat.A., Witten.I. and ZobeU. (1995) The MG retrieval

system: Compressing for space and speed. Commun. ACM. 38.

41-42.

Benson.D., Lipman.D. and OstelU. (1993) GenBank. Nucleic Acids

Res.

21.2963-2965.

Elias.P. (1975) Universal codeword sets and representations of the

integers. IEEE

Trans.

Inf. Theory.

IT-21.

194-203.

Golomb.S. (1966) Run-length encodings. IEEE Trans Inf. Theory.

IT-12.

399-401.

GriffithsA. Miller.J.. Suzuki.D., Lewontin.R. and Gelbart.W. (1993)

An

Introduction

to Genetic Analysis. 5th edn. Freeman. New York.

Grumbach.S. and Tahi.F. (1993) Compression of DNA sequences. In

StorerJ. and Cohn.M. (eds), Proceedings of the IEEE Data

Compression Conference, Snowbird. UT. pp. 340—350.

Lelewer.D. and Hirschberg.D. (1987) Data compression. Comput

Sun'.

19,261-296.

Pearson.W. and Lipman,D. (1988) Improved tools for biological

sequence comparison. Proc. Natl

Acad.

Sci. USA, 85, 2444-2448.

RissanenJ. and Langdon,G. (1981) Universal modeling and coding.

IEEE

Trans.

Inf. Theory, IT-27,

12-23.

Shannon,C. (1951) Prediction and entropy of printed English. Bell

Systems

Tech.,

30, 55.

Williams,H. and ZobeU. (1996a) Indexing nucleotide databases for

fast query evaluation. In Proceedings of

an

International Confer-

ence on Advances in Database Technology (EDBT), Avignon.

France.

Lecture

Notes in Computer Science 1057. Springer-Verlag,

pp.

275-288.

Williams,H. and ZobelJ. (1996b) Practical compression of nucleotide

databases. In Proceedings of the Australasian Computer Science

Conference, Melbourne, Australia, pp. 184-193.

Witten,!., Moffat.A. and Bell.T. (1994) Managing Gigabytes: Com-

pressing and Indexing Documents and Images. Van Nostrand

Reinhold, New York.

ZobelJ. and Moffat,A. (1995) Adding compression to a full-text

retrieval system. Software-Practice Experience, 25. 891-903.

554