Article

Practical compression of nucleotide databases

Article

Practical compression of nucleotide databases

If you want to read the PDF, try requesting it from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... While modern hardware can provide vast amounts of inexpensive storage, the compression of biological sequence data is still of paramount concern in order to facilitate fast search and retrieval operations, primarily by reducing the number of required I/O operations, . Data compression requires two fundamental processes: modelling and coding (Williams and Zobel, 1996). ...
... Modeling involves constructing a representation of the distinct symbols in the data, along with any associated data, like the relative frequencies of the symbols (Williams and Zobel, 1996). Coding involves applying the model to each symbol in the data to produce a compressed representation of the data, preferably by assigning short codes for frequently occurring symbols and long codes to infrequently occurring symbols (El Naqa et al. 2018). ...
... Coding involves applying the model to each symbol in the data to produce a compressed representation of the data, preferably by assigning short codes for frequently occurring symbols and long codes to infrequently occurring symbols (El Naqa et al. 2018). In the case of DNA sequences, the finite set of nucleotide symbols {A, C, G, T} can be efficiently modeled as a corresponding set of binary values {00, 01, 10, 11} (Williams and Zobel, 1996). This model constitutes an effective binary representation where each nucleotide base is directly coded by two bits. ...
... Data compression requires two fundamental processes: modeling and coding (4). Modeling involves constructing a representation of the distinct symbols in the data, along with any associated data, like the relative frequencies of the symbols (4). ...
... Modeling involves constructing a representation of the distinct symbols in the data, along with any associated data, like the relative frequencies of the symbols (4). Coding involves applying the model to each symbol in the data to produce a compressed representation of the data, preferably by assigning short codes to frequently occurring symbols and long codes to infrequently occurring symbols (4). A variety of dictionary methods, such as the Ziv-Lampel algorithms (5,6), can be employed to achieve this (7). ...
Article
Full-text available
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations.
... The entropy is almost exactly as expected for random data. (We further discuss estimation of entropy for this data elsewhere (Williams and Zobel, 1996b).) As another estimate of compressibility, we tested PPM predictive compression (Bell et al., 1990), currently the most effective general-purpose lossless compression technique , and found that even with a large model PPM was only able to compress to 2.06 bits per base on the genbank collection. ...
... We have used the Elias gamma codes to encode each count w and Golomb codes to represent each sequence of offsets. These techniques are a variation on techniques used for inverted file compression, which has been successfully applied to large text databases (Bell et al., 1993) and to genomic databases (Williams and Zobel, 1996a; Williams and Zobel, 1996b). Compression with Golomb codes, given the appropriate choice of a pre-calculated parameter, is better than with Elias coding. ...
Article
Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected] /* */
... The entropy is almost exactly as expected for random data. [We further discuss estimation of entropy for these data elsewhere (Williams and Zobel, 1996b).] As another estimate of compressibility, we tested PPM predictive compression (Bell el al., 1990), currently the most effective generalpurpose lossless compression technique, and found that even with a large model PPM was only able to compress to 2.06 bits per base on the GENBANK collection. ...
Article
Full-text available
Motivation: International sequencing e#orts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly; that sequences can be accessed independently of the order in which they were stored; and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching.
Conference Paper
A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.
ResearchGate has not been able to resolve any references for this publication.