Article

A general-purpose compression scheme for large collections

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Cannane and Williams proposed a semi-static phrasebased scheme called XRAY [7]. An offline model was first built by training samples selected from data collection. ...
... We build a performance model to quantify CPUs and disks utilization as well as data transmission time. With these information in place, the computing energy cost can be computed by a computing model (see (7)). Moreover, the cooling cost can be calculated by integrating a thermal model and the coefficient of performance model (a.k.a., COP). ...
Conference Paper
Full-text available
This paper presents a novel Predictive Energy-Aware Management (PEAM) system that is able to reduce the energy costs of storage systems by appropriately selecting data transmission methods. In particular, we evaluate the energy costs of three methods (1. transfer data without archiving and compression, 2. archive and transfer data, 3. compress and transfer data) in preliminary experiments. According to the results, we observe that the energy consumption of data transmission greatly varies case by case. We cannot simply apply one method in all cases. Therefore, we design an energy prediction model that can estimate the total energy cost of data transmission by using particular transmission methods. Based on the model, our predictive energy-aware management system can automatically select the most energy efficient method for data transmission. Our experimental results show that our system performs better than simply selecting any one among the three methods for data transmission in terms of energy efficiency.
... Fortunately algorithms have been developed that substantially address this issue, by combining explicit description of redundancies with an effective means of accessing random records. Two such algorithms are SEQUITUR(Nevill-Manning and Witten 1997) and XRAY (Cannane and Williams 2002). ...
... Finally, it would have been possible to use an existing explicit recurrence encoding algorithm that supports synchronisation points, such as XRAY (Cannane and Williams 2002), however that algorithm was not known to this author when the work was carried out. Also, XRAY requires a collection-wide memory-resident model. ...
... Although PPM provides the best compression ratio, performing retrieval is difficult, whether directly or through indexing on the compressed file. On the other hand, Word- based Huffman coding schemes (Moffat, Sacks-Davis, Wilkinson and Zobel 1993, Witten, Moffat and Bell 1999, Ziviani, Moura, Navarro and Baeza-Yates 2000, Cannane and Williams 2002, provide a better balance between compression ability and performance in indexing and searching. Several researchers like Adjeroh and Mukherjee and Bell and Powell and Zhang 2002, Amir and Benson and Farach 1996, Bunke and Csirik 1993, Bunke and Csirik 1995, Manber 1997, Navarro and Raffinot 1999 matching algorithms that search patterns directly on the compressed file with or without preprocessing. ...
Conference Paper
For storing and transmitting data from one end to another, size of the data should be reduced in size for better bandwidth and increase speed of device. Online compression is in great demand as more and more of the work is on online platform. Numerous static compression methods are there but they are time consuming and require complete data before compression. It requires two pass compressions but online compression take one pass compression for encoding and decoding simultaneously. A dictionary based compression technique can be applied to reduce the data size for better bandwidth utilization and hence faster transmission of data. In this paper we generate dictionary of incoming data according to first come first serve basis. Initially first block of some size is compressed using adaptive tree method and simultaneously dictionary is generated during processing and a particular code of 9 bits is provided for each string generated with LZW method and forward move basis method. The efficiency will be enhanced if repetition of string is greater in the document. We use a median to store coming data during transmission of previous data after compression. This way we create a block of data of particular size and apply compression on the block.
... During the decoding stage, the proposed method requires only storage for the public trie and the space for the text in the stream. We need 64 KB space for the 16 bit trie which compares favorably with, for example, the XRAY system, which has a 30 MB space requirement (Cannane and Williams 2002). We implemented the retrieval system based on the MG system. ...
Conference Paper
Full-text available
Summary form only given. With an increasing amount of text data being stored in compressed format, being able to access the compressed data randomly and decode it partially is highly desirable for efficient retrieval in many applications. The efficiency of these operations depends on the compression method used. We present a modified LZW algorithm that supports efficient indexing and searching on compressed files. Our method performs in a sublinear complexity, since we only decode a small portion of the file. The proposed approach not only provides the flexibility for dynamic indexing in different text granularities, but also provides the possibility for parallel processing in both encoding and decoding sides, independent of the number of processors available. It also provides good error resilience. The compression ratio is improved using the proposed modified LZW algorithm. Test results show that our public trie method has a compression ratio of 0.34 for the TREC corpus and 0.32 with text preprocessing using a star transform with an optimal static dictionary; this is very close to the efficient word Huffman and phrase based word Huffman schemes, but has a more flexible random access ability.
... There are many other proposed variants of the LZ78 family, such as the LZMW [242], the LZAP [322], RAY and XRAY [74,75] methods. (See [294] for more variants of the LZ77 and LZ78 compression family). ...
Article
TR-COSC 07/01 This paper provides a survey of techniques for pattern matching in compressed text and images. Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the right way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy compression methods, and then in each of these cases the pattern matching can be either exact or inexact. Much work has been reported in the literature on techniques for all of these cases, including algorithms that are suitable for pattern matching for various compression methods, and compression methods designed specifically for pattern matching. This work is surveyed in this paper. The paper also exposes the important relationship between pattern matching and compression, and proposes some performance measures for compressed pattern matching algorithms. Ideas and directions for future work are also described.
... This approach is not viable for large document collections as a suffix array and other auxiliary data structures of the complete collection are required for encoding. Another set of related techniques are grammar compressors, such as ray [10], xray [11] and re-pair [21]. Grammar compressors can achieve powerful compression but have enormous construction requirements, limiting their application to smaller collections. ...
Article
Full-text available
Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.
... (Cannane and Williams, 2002 ...
Article
External sorting of large files of records involves use of disk space to store temporary files, processing time for sorting, and transfer time between CPU, cache, memory, and disk. Compression can reduce disk and transfer costs, and, in the case of external sorts, cut merge costs by reducing the number of runs. It is therefore plausible that overall costs of external sorting could be reduced through use of compression. In this paper, we propose new compression techniques for data consisting of sets of records. The best of these techniques, based on building a trie of variable-length common strings, provides fast compression and decompression and allows random access to individual records. We show experimentally that our trie-based compression leads to significant reduction in sorting costs; that is, it is faster to compress the data, sort it, and then decompress it than to sort the uncompressed data. While the degree of compression is not quite as great as can be obtained with adaptive techniques such as Lempel-Ziv methods, these cannot be applied to sorting. Our experiments show that, in comparison to approaches such as Huffman coding of fixed-length substrings, our novel trie-based method is faster and provides greater size reductions.
... Por un lado, este problema ha sido acometido en diferentes esquemas de compresión fundamentados en la propuesta de Manber. Algoritmos cómo Re-pair [10] o XRay [7] presentan sendas propuestas, orientadas a caracteres, que utilizan diferentes aproximaciones estadísticas para la selección de los pares de símbolos. Una adaptación de Re-pair orientada a palabras [16] mejora notablemente las tasas de compresión obtenidas. ...
... A cursory survey of recent conference proceedings and journal issues reveals that the resources created by the VLC/Web Track are being used quite routinely in studies reported outside TREC. For example, in the years 2000-2002, eight SIGIR papers [34, 2, 44, 15, 32, 3, 4, 39] and four TOIS articles [12, 11, 10, 9] made use of VLC/Web data and several others referred to the track or its methodology. A glance at the same forums for 2003 suggests that usage of VLC/Web Track resources and results is increasing still further. ...
... A cursory survey of recent conference proceedings and journal issues reveals that the resources created by the VLC/Web Track are being used quite routinely in studies reported outside TREC. For example, in the years 2000-2002, eight SIGIR papers [34, 2, 44, 15, 32, 3, 4, 39] and four TOIS articles [12, 11, 10, 9] made use of VLC/Web data and several others referred to the track or its methodology. A glance at the same forums for 2003 suggests that usage of VLC/Web Track resources and results is increasing still further. ...
... These are discussed below. Another choice would be to use XRAY [3], in which an initial block of data is used to train up a model. Each symbol, including all unique characters, is then allocated a bitwise code. ...
Conference Paper
Evaluating a query can involve manipulation of large volumes of temporary data. When the volume of data becomes too great, activities such as joins and sorting must use disk, and cost minimisation involves complex trade-offs. In this paper, we explore the effect of compression on the cost of external sorting. Reduction in the volume of data potentially allows costs to be reduced (through reductions in disk traffic and numbers of temporary files), but on-the-fly compression can be slow and many compression methods do not allow random access to individual records. We investigate a range of compression techniques for this problem, and develop successful methods based on common letter sequences. Our experiments show that, for a given memory limit, the overheads of compression outweigh the benefits for smaller data volumes, but for large files compression can yield substantial gains, of one-third of costs in the best case tested. Even when the data is stored uncompressed, our results show that incorporation of compression can significantly accelerate query processing.
... In particular, they are limited to cases where the characteristics of the data (for example, that it consists of ASCII text that can be parsed into words) are known in advance; and compression performance is relatively poor. Another family is based on dictionary inference [3,4,18,24]. These methods use the data (or a large part of it) to infer a dictionary represented as a simple hierarchical grammar, and then replace the bytes or words with references to tokens in the dictionary. ...
Conference Paper
Full-text available
Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary; in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-o� between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary. In previous work it was observed that, since the dictionary is generated by sampling, some of it (selected substrings) may be discarded with little loss in compression. Unfortunately, simple dictionary pruning approaches are ineffective. We develop a formal model of our approach, based on generating an optimal dictionary for a given collection within a memory bound. We generate measures for identification of low-value substrings in the dictionary, and show on a variety of sizes of text collection that halving the dictionary size leads to only marginal loss in compression ratio. This is a dramatic improvement on previous approaches.
... XRay [Cannane and Williams, 2002]. The XRay algorithm creates a dictionary using a portion of the message as training data. ...
... But [11] has not been demonstrated to scale up to multi-gigabyte genomic data [6]. Grammar-based compressors, as XRAY [2], RE-PAIR [22], and the LZ77-based compressor LZ-End [20], have enormous construction requirements, limiting their application to small collections. Other work addresses related problems over highly-similar sequence collections, e.g., document listing [12] and top-k-retrieval [27]. ...
Conference Paper
Full-text available
Efficiently storing and searching collections of similar strings, such as large populations of genomes or long change histories of documents from Wikis, is a timely and challenging problem. Several recent proposals could drastically reduce space requirements by exploiting the similarity between strings in so-called referencebased compression. However, these indexes are usually not searchable any more, i.e., in these methods search efficiency is sacrificed for storage efficiency. We propose Multi-Reference Compressed Search Indexes (MRCSI) as a framework for efficiently compressing dissimilar string collections. In contrast to previous works which can use only a single reference for compression, MRCSI (a) uses multiple references for achieving increased compression rates, where the reference set need not be specified by the user but is determined automatically, and (b) supports efficient approximate string searching with edit distance constraints. We prove that finding the smallest MRCSI is NP-hard. We then propose three heuristics for computing MRCSIs achieving increasing compression ratios. Compared to state-of-the-art competitors, our methods target an interesting and novel sweet-spot between high compression ratio versus search efficiency.
... Compression of large text collections has been extensively studied for decades, but new approaches continue to appear [5,9,13,15,26]. Amongst these, relative Lempel-Ziv (RLZ) [15] is one of the most effective compression techniques for large repositories. RLZ offers both good compression effectiveness and extremely fast atomic retrieval of individual documents. ...
Conference Paper
Full-text available
Compression is widely exploited in retrieval systems, such as search engines and text databases, to lower both retrieval costs and system latency. In particular, compression of repositories can reduce storage requirements and fetch times, while improving caching. One of the most effective techniques is relative Lempel-Ziv, RLZ, in which a RAM-resident dictionary encodes the collection. With RLZ, a specified document can be decoded independently and extremely fast, while maintaining a high compression ratio. For terabyte-scale collections, this dictionary need only be a fraction of a per cent of the original data size. However, as originally described, RLZ uses a static dictionary, against which encoding of new data may be inefficient. An obvious alternative is to generate a new dictionary solely from the new data. However, this approach may not be scalable because the combined RAM-resident dictionary will grow in proportion to the collection. In this paper, we describe effective techniques for extending the original dictionary to manage new data. With these techniques, a new auxiliary dictionary, relatively limited in size, is created by interrogating the original dictionary with the new data. Then, to compress this new data, we combine the auxiliary dictionary with some parts of the original dictionary (the latter in fact encoded as pointers into that original dictionary) to form a second dictionary. Our results show that excellent compression is available with only small auxiliary dictionaries, so that RLZ can feasibly transmit and store large, growing collections.
Article
A new electronic watermarking method is proposed in which the computer-generated hologram (CGH) technique is applied. Since the CGH used here is based on binary data, the original CGH data can be recovered by a binary process under some conditions, even if noise is added by media transformation. This advantage of the present method is demonstrated by a computer simulation. Further, when a random-phase CGH is used, the CGH pattern is distributed in a random fashion. Therefore, when embedding a signature image, the method is suitable for encoding secret information, such as the embedding location into the image, and for pattern decomposition of the signature image. As examples of this encoding, a method for interchanging the CGH regions and a method for combining the different CGHs are proposed. The effectiveness of the method is demonstrated by a simulation experiment. © 2000 Scripta Technica, Electron Comm Jpn Pt 3, 84(1): 21–31, 2001
Conference Paper
RDBMS, OODBMS & ORDBMS are used in the daily life to hold loads of data. The various systems uses various techniques to retrieve data, one of the important elements in these elements is the sorting. Various sorting techniques have been devised over the years. One of the many popular techniques is External Sorting. This paper is concerned of analyzing and improvising the algorithms in terms of Time and Space complexity. This paper discusses various factors on which time complexity and space complexity of external sorting depends. Results have been incurred using data sets of various sizes that are integer value of size 2 bytes. Also semi-compression and SIMD techniques are applied on the algorithm to generate random testing scenarios.
Article
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once—usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
Conference Paper
Full-text available
Lower storage capacity and slower access time are the main problems of the Database Management System (DBMS). In this paper, we have been compared the storage and access time between the columnar multi-block vector structure (CMBVS) and Oracle 9i server. The experimental results shown that CMBVS is about 31 times efficient in storage cost and 21-70 times faster in retrieval time performance than that of the Oracle 9i server.
Article
In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. To be able to store, manage, organize and transport the data efficiently, text compression is necessary. We also need efficient search engines to speedily find the information from this huge mass of data, especially when it is compressed. In this chapter, we present a review of text compression algorithms, with particular emphasis on the LZ family algorithms, and present our current research on the family of Star compression algorithms. We discuss ways to search the information from its compressed format and introduce some recent work on compressed domain pattern matching, with a focus on a new two-pass compression algorithm based on LZW algorithm. We present the architecture of a compressed domain search and retrieval system for archival information and indicate its suitability for implementation in a parallel and distributed environment using random access property of the two-pass LZW algorithm.
Article
Efficiently storing and searching collections of similar strings, such as large populations of genomes or long change histories of documents from Wikis, is a timely and challenging problem. Several recent proposals could drastically reduce space requirements by exploiting the similarity between strings in so-called reference-based compression. However, these indexes are usually not searchable any more, i.e., in these methods search efficiency is sacrificed for storage efficiency. We propose Multi-Reference Compressed Search Indexes (MRCSI) as a framework for efficiently compressing dissimilar string collections. In contrast to previous works which can use only a single reference for compression, MRCSI (a) uses multiple references for achieving increased compression rates, where the reference set need not be specified by the user but is determined automatically, and (b) supports efficient approximate string searching with edit distance constraints. We prove that finding the smallest MRCSI is NP-hard. We then propose three heuristics for computing MRCSIs achieving increasing compression ratios. Compared to state-of-the-art competitors, our methods target an interesting and novel sweet-spot between high compression ratio versus search efficiency.
Thesis
Full-text available
La demanda de información se ha multiplicado en los últimos años gracias, principalmente, a la globalización en el acceso a la WWW, Esto ha propiciado un aumento sustancial en el tamaño de las colecciones de texto disponibles en formato electrónico, cuya compresión no sólo permite obtener un ahorro espacial sino que, a su vez, aumenta la eficiencia de sus procesos de entrada/salida y de transmisión en red. La compresión de texto trata con información expresada en lenguaje natural. Por lo tanto, la identificación de la redundancia subyacente a este tipo de textos requiere adoptar una perspectiva orientada a palabras, considerando ésta como la unidad mínima de información utilizada en los procesos de comunicación entre personas. Esta tesis aborda el estudio del contexto anterior desde tres perspectivas complementarias cuyos resultados se traducen en la obtención de un conjunto de compresores de texto específicos. El lenguaje natural posee unas propiedades particulares, tanto en lo relativo al tamaño del vocabulario de palabras identificado en el texto como a la distribución de frecuencia que muestra cada una de ellas. Sin embargo, las técnicas universales de compresión no son capaces de identificar, específicamente, estas propiedades al no restringir el tipo de mensajes que toman como entrada. La primera propuesta de esta tesis se centra en la construcción de un esquema de preprocesamiento (denominado Word-Codeword Improved Mapping: WCIM) que transforma el texto original en una representación más redundante del mismo que favorece su compresión con técnicas clásicas. A pesar de su sencillez y efectividad, esta propuesta no gestiona un aspecto relevante en lenguaje natural: la relación existente entre las palabras. La familia de técnicas Edge-Guided (E-G) utilizan la relación de adyacencia entre símbolos como base para la representación del texto. El compresor E-G1 construye un modelo de orden 1 orientado a palabras, cuya representación se materializa en las aristas de un grafo dirigido. Por su parte, E-Gk considera la extensión del vocabulario original con un conjunto de secuencias de palabras (frases) significativas que se representan a través de una gramática libre de contexto. El modelo de grafo original evoluciona de tal forma que pasa a representar un modelo de orden 1 orientado a frases en el que la relación de jerarquía, existente entre las palabras que las constituyen, puede ser aprovechada a través de la información almacenada en la gramática. Tanto E-G1 como E-Gk utilizan la información almacenada en las aristas del grafo para la construcción de sus esquema de codificación basado en un código de Huffman. Los corpus paralelos bilingües (bitextos) están formados por dos textos, en lenguaje natural, que expresan la misma información en dos idiomas diferentes. Esta propiedad suma un tipo de redundancia no tratada en los casos anteriores: la redundancia semántica. Nuestras propuestas, en este contexto, se centran en la representación de bitextos alineados, cuya utilización es un aspecto esencial en numerosas aplicaciones relacionadas con la traducción. Para ello introducimos el concepto de bipalabra como unidad simbólica de representación y se plantean sendas técnicas basadas en sus propiedades estructurales (Translation Relationship-based Compressor : TRC) y semánticas (Two-Level Compressor for Aligned Bitexts: 2LCAB). Ambas propuestas analizan el efecto, en la compresión, asociado al hecho de utilizar diferentes estrategias de alineamiento del bitexto. Complementariamente, 2LCAB plantea un mecanismo de búsqueda, basado en pattern-matching, que permite llevar a cabo diferentes tipos de operaciones sobre el texto comprimido. Los procesos de experimentación, llevados a cabo sobre corpus de referencia en cada uno de los contextos, demuestran la competitividad de cada una de los compresores propuestos. Los resultados obtenidos con la técnica 2LCAB son especialmente significativos ya que soportan la primera propuesta conocida que facilita la consulta monolingüe y translingüe sobre un bitexto comprimido. Esta propiedad aísla el idioma en el que se recuperan los resultados del utilizado en la consulta, planteando 2LCAB como una alternativa competitiva para su uso como motor de búsqueda en diferentes herramientas de traducción.
Article
Current compression solutions either use a limited size locality-based context or the entire input, to which the compressors adapt. This results in suboptimal compression effectiveness due to missing similarities further apart in the former case, or due to too generic adaptation. There are many deduplication and near deduplication systems that search for similarity across the entire input. Although most of these systems excel with their simplicity and speed, none of those goes deeper in terms of full-scale redundancy removal. We propose a novel compression and archival system called ICBCS. Our system goes beyond standard measures for similarity detection, using extended similarity hash and incremental clustering techniques to determine groups of sufficiently similar chunks designated for compression. ICBCS outperforms conventional file compression solutions on datasets consisting of at least mildly redundant files. It also shows that selective application of weak compressor results in better compression ratio and speed than conventional application of a strong compressor.
Article
Full-text available
Hierarchical dictionary-based compression schemes form a grammar for a text by replacing each repeated string with a production rule. While such schemes usually operate on-line, making a replacement as soon as repetition is detected, off-line operation permits greater freedom in choosing the order of replacement. In this paper, we compare the on-line method with three off-line heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the grammar locally. Surprisingly, two of the off-line techniques, like the on-line method, run in time linear in the size of the input. We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the on-line technique, and, lagging by some distance, the longest-first technique.
Article
Full-text available
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the ∼120-megabase euchromatic portion of theDrosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes ∼13,600 genes, somewhat fewer than the smaller Caenorhabditis elegansgenome, but with comparable functional diversity.
Article
Full-text available
PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. A major objective of PIR-International is to continue the development of the Protein Sequence Database as an essential public resource for protein sequence information. This paper briefly describes the architecture of the Protein Sequence Database and how it and associated data sets are distributed and can be accessed electronically.
Article
Full-text available
The GenBank (Registered Trademark symbol) sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (Web) or Sequin programs to format and send sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE (Registered Trademark symbol) s from published articles describing the sequences are included as an additional source of biological annotation through the PubMed search system. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, Email, and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the URL: http://www.ncbi.nlm.nih.gov
Article
Full-text available
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the approximately 120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes approximately 13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.
Article
Full-text available
text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes. Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data Compaction This work has been partially supported by SIAM Project, grant MCT/FINEP/PRONEX 76.97.1016.00, AMYRI/CYTED Project, CAPES scholarship (E. S. de Moura), Fondecyt grant 1990627 (G. Navarro and R. Baeza-Yates) and CNPq grant 520916/94-8 (N. Ziviani). Authors' addresses: E. S. de Moura and N. Ziviani, Dept. of Computer Science, Univ. Federal de Minas Gerais, Av. Antonio Carlos 6627, Belo Horizonte, Brazil; G. Navarro and R. Baeza-Yates, Dept. of Computer Science, Univ. de Chile, Av. Blanco Encalada 2120, Santiago, Chile. Permission to make digital or h
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in
Article
From its origin the Protein Sequence Database has been designed to support research and has focused on comprehensive coverage, quality control and organization of the data in accordance with biological principles. Since 1988 the database has been maintained collaboratively within the framework of PIR-International, an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. The database is widely distributed and is available on the World Wide Web, via ftp, email server, on CD-ROM and magnetic media. It is widely redistributed and incorporated into many other protein sequence data compilations, including SWISS-PROT and the Entrez system of the NCBI.
Article
This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.
Article
The paper describes the advantages of using variable length equifrequent fragments as the language elements for compression coding of large data bases. The problem of finding the minimum space representation of the fragment compressed data base is equivalent to solving a shortest path problem. Three alternate algorithms for compression are described and their performance is tested on both word and text fragments. It is found that text fragments selected by a longest match algorithm produce the best results with regard to compression and use of processing time.
Article
The development of efficient algorithms to support arithmetic coding has meant that powerful models of text can now be used for data compression. Here the implementation of models based on recognizing and recording words is considered. Move-to-the-front and several variable-order Markov models have been tested with a number of different data structures, and first the decisions that went into the implementations are discussed and then experimental results are given that show English text being represented in under 2-2 bits per character. Moreover the programs run at speeds comparable to other compression techniques, and are suited for practical use.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
Substantial degrees of compression of bibliographical data bases result from the application to them of a modified form of run-length coding. The method involves attentuation of the zero: one bit ratio of the data base. This can be achieved by substitution of codes with the highest zero: one ratios for the most frequent symbols, or by substitution of 2-byte codes for digrams. A form of run-length coding in which the run-length is represented as a fixed-length binary number is then applied.
Article
A computer program, developed as a psychological model of speech segmentation, is presented as a method of recoding natural language for economical storage or transmission. The program builds a dictionary of frequently occurring letter strings. Where these strings occur in a text they may be replaced by a short code, thus effecting a compression of up to 49%. The strings mjy also be used as key 'words' in a document retrieval system. The method has the particular merit of simplicity in building the dictionary and efficiency in encoding data.
Article
Data bases always contain some sequences of characters which occur more frequently than others. This paper provides a technique for data base compression which treats the more frequent sequences as 'common factors'. The common factors are recoded in condensed form and a typical data base may then occupy less than sixty percent of its original storage space. In addition to storage economy, the technique provides for reduced data transmission time and has certain advantages from the security angle.
Article
Compression of databases not only reduces space requirements but can also reduce overall retrieval times. In text databases, compression of documents based on semistatic modeling with words has been shown to be both practical and fast. Similarly, for specific applications—such as databases of integers or scientific databases—specially designed semistatic compression schemes work well. We propose a scheme for general-purpose compression that can be applied to all types of data stored in large collections. We describe our approach—which we call RAY—in detail, and show experimentally the compression available, compression and decompression costs, and performance as a stream and random-access technique. We show that, in many cases, RAY achieves better compression than an efficient Huffman scheme and popular adaptive compression techniques, and that it can be used as an efficient general-purpose compression scheme.
Article
A general model for data compression which includes most data compression systems in the fiterature as special cases is presented. Macro schemes are based on the principle of finding redundant strings or patterns and replacing them by pointers to a common copy. Different varieties of macro schemes may be defmed by specifying the meaning of a pointer; that is, a pointer may indicate a substring of the compressed string, a substring of the original string, or a substring of some other string such as an external dictionary. Other varieties of macro schemes may be defined by restricting the type of overlapping or recursion that may be used. Trade-offs between different varieties of macro schemes, exact lower bounds on the amount of compression obtainable, and the complexity of encoding and decoding are discussed, as well as how the work of other authors relates to this model.
Article
A method for saving storage space for text strings, such as compiler diagnostic messages, is described. The method relies on hand selection of a set of text strings which are common to one or more messages. These phrases are then stored only once. The storage technique gives rise to a mathematical optimization problem: determine how each message should use the available phrases to minimize its storage requirement. This problem is nontrivial when phrases which overlap exist. However, a dynamic programming algorithm is presented which solves the problem in time which grows linearly with the number of characters in the text. Algorithm 444 applies to this paper.
Article
A system for the compression of data files, viewed as strings of characters, is presented. The method is general, and applies equally well to English, to PL/I, or to digital data. The system consists of an encoder, an analysis program, and a decoder. Two algorithms for encoding a string differ slightly from earlier proposals. The analysis program attempts to find an optimal set of codes for representing substrings of the file. Four new algorithms for this operation are described and compared. Various parameters in the algorithms are optimized to obtain a high degree of compression for sample texts.
Article
Lempel–Ziv schemes compress data by encoding repeated strings that occur in a small sliding window. We propose a scheme that succinctly encodes long strings that appear far apart in the input text. Such long strings are rare in most documents, but occur frequently in data such as large software systems, subroutine libraries, news articles, and other corpora of real documents. Analysis shows that our scheme is computationally efficient, and experiments show that effectively compresses some classes of input.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Abstract For compression of text databases, semi-static word-based methods provide good perfor- mance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Sec- ond, the need to handle document insertions means that the collection must be periodically recompressed, if compression eciency is to be maintained on dynamic collections. Here we show that with careful management,the impact of both of these drawbacks can be kept small. Experiments with a word-based model and 500 Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after signicant expansion in the amount,of stored text. Index Terms Document databases, text compression, dynamic databases, word-based
Article
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available. This document is available online at ACM Transactions on Information Systems.
Conference Paper
For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements
Conference Paper
Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text
Conference Paper
We describe a precompression algorithm that effectively represents any long common strings that appear in a file. The algorithm interacts well with standard compression algorithms that represent shorter strings that are near in the input text. Our experiments show that some real data sets do indeed contain many long common strings. We extend the fingerprint mechanisms of our algorithm to a program that identifies long common strings in an input file. This program gives interesting insights into the structure of real data files that contain long common strings
Conference Paper
Dictionary-based modelling is the mechanism used in many practical compression schemes. We use the full message (or a large block of it) to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. Intuitively, the advantage of this offline approach is that with the benefit of having access to all of the message, it should be possible to optimize the choice of phrases so as to maximize compression performance. Indeed, we demonstrate that very good compression can be attained by an offline method without compromising the fast decoding that is a distinguishing characteristic of dictionary-based techniques. Several nontrivial sources of overhead, in terms of both computation resources required to perform the compression, and bits generated into the compressed message, have to be carefully managed as part of the offline process. To meet this challenge, we have developed a novel phrase derivation method and a compact dictionary encoding. In combination these two techniques produce the compression scheme RE-PAIR, which is highly efficient, particularly in decompression
Conference Paper
Greedy off-line textual substitution refers to the following steepest descent approach to compression or structural inference. Given a long text string x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted text string, until substrings capable of producing contractions can no longer be found. This paper examines the computational issues and performance resulting from implementations of this paradigm in preliminary applications and experiments. Apart from intrinsic interest, these methods may find use in the compression of massively disseminated data, and lend themselves to efficient parallel implementation, perhaps on dedicated architectures
Conference Paper
During its long gestation in the 1970s and early 1980s, arithmetic coding was widely regarded more as an academic curiosity than a practical coding technique. One factor that helped it gain the popularity it enjoys today was the publication in 1987 of source code for a multi symbol arithmetic coder in Communications of the ACM. Now (1995), our understanding of arithmetic coding has further matured, and it is timely to review the components of that implementation and summarise the improvements that we and other authors have developed since then. We also describe a novel method for performing the underlying calculation needed for arithmetic coding. Accompanying the paper is a “Mark II” implementation that incorporates the improvements we suggest. The areas examined include: changes to the coding procedure that reduce the number of multiplications and divisions and permit them to be done to low precision; the increased range of probability approximations and alphabet sizes that can be supported using limited precision calculation; data structures for support of arithmetic coding on large alphabets; the interface between the modelling and coding subsystems; the use of enhanced models to allow high performance compression. For each of these areas, we consider how the new implementation differs from the CACM package
Conference Paper
The purpose of this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by the previous results. This follows from a number of observations: firstly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiments; secondly, using large amounts of priming text substantially improves the PPM's performance; and thirdly, the PPM algorithm can be modified to perform better for English text. The result of this is a machine performance down to 1.46 bit per character. The problem of estimating the entropy of English is discussed. The importance of training text for PPM is demonstrated, showing that its performance can be improved by “adjusting” the alphabet used. The results based on these improvements are then given, with compression down to 1.46 bpc
Conference Paper
Digitized images are known to be extremely space consuming. However, regularities in the images can often be exploited to reduce the necessary storage area. Thus, many systems store images in a compressed form. The authors propose that compression be used as a time saving tool, in addition to its traditional role of space saving. They introduce a new pattern matching paradigm, compressed matching. A text array T and pattern array P are given in compressed forms c(T) and c(P). They seek all appearances of P in T, without decompressing T. This achieves a search time that is sublinear in the size of the uncompressed text mod T mod . They show that for the two-dimensional run-length compression there is a O( mod c(T) mod log mod P mod + mod P mod ), or almost optimal algorithm. The algorithm uses a novel multidimensional pattern matching technique, two-dimensional periodicity analysis.< >
Article
Compressibility of individual sequences by the class of generalized finite-state information-lossless encoders is investigated. These encoders can operate in a variable-rate mode as well as a fixed-rate one, and they allow for any finite-state scheme of variable-length-to-variable-length coding. For every individual infinite sequence x a quantity rho(x) is defined, called the compressibility of x , which is shown to be the asymptotically attainable lower bound on the compression ratio that can be achieved for x by any finite-state encoder. This is demonstrated by means of a constructive coding theorem and its converse that, apart from their asymptotic significance, also provide useful performance criteria for finite and practical data-compression tasks. The proposed concept of compressibility is also shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences. While the definition of rho(x) allows a different machine for each different sequence to be compressed, the constructive coding theorem leads to a universal algorithm that is asymptotically optimal for all sequences.
Article
A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source.
Article
Dictionary-based modeling is a mechanism used in many practical compression schemes. In most implementations of dictionary-based compression the encoder operates on-line, incrementally inferring its dictionary of available phrases from previous parts of the message. An alternative approach is to use the full message to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. In this investigation, we develop a compression scheme that is a combination of a simple but powerful phrase derivation method and a compact dictionary encoding. The scheme is highly efficient, particularly in decompression, and has characteristics that make it a favorable choice when compressed data is to be searched directly. We describe data structures and algorithms that allow our mechanism to operate in linear time and space
Article
We describe the implementation of a data compression scheme as an integral and transparent layer within a full-text retrieval system. Using a semi-static word-based compression model, the space needed to store the text is under 30 per cent of the original requirement. The model is used in conjunction with canonical Huffman coding and together these two paradigms provide fast decompression. Experiments with 500 Mb of newspaper articles show that in full-text retrieval environments compression not only saves space, it can also yield faster query processing - a win-win situation.
Article
Greedy off-line textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted textstring, until substrings capable of producing contractions can no longer be found. The paper examines computational issues arising in the implementation of this paradigm and describes some applications and experiments.
Article
Introduction The most effective compression algorithms are, unfortunately, computationally expensive. Unless a special-purpose hardware implementation is available or unless the application has no immediate deadline to meet (as in overnight archiving of files), most users would prefer to trade some compression performance for faster rates of compression and decompression. At present, the de facto method of choice is the Lempel-Ziv-Welch algorithm (LZW) [2]. LZW was originally designed for implementation by special hardware, but it turned out to be highly suitable for efficient software implementations too. An enhanced variant is available on UNIX systems and many other systems as the compress command. We refer to this variant as LZC. The speed and compression performance of LZC are a result of careful data structure design for hash table look-ups, some tuning and the addition of logic for restarting the algorithm when the source file changes its characteristics enough to worsen compr
Article
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
Article
Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.
Gzip program and documentation. Available by anonymous ftp from prep.ai.mit.edu:/pub/gnu/gzip-&ast;.tar. Gailly J. 1993. Gzip program and documentation. Available by anonymous ftp from prep
  • J Gailly
Overview of TREC-7 very large collection track
  • D Hawking
  • N Creswell
  • P Thistlewaite
Data compression by concatenations of symbol pairs
  • H Nakamura
  • S Murashima
The bzip2 and libbzip2 home page Available by anonymous ftp from sourceware
  • J Seward
Compress (version 4.0) program and documentation. Thomas S. and Orost J. 1985. Compress (version 4.0) program and documentation
  • S Thomas
  • J Orost
Efficient prefix coding. PhD thesis The University of Melbourne. Turpin A. 1999. Efficient prefix coding
  • A Turpin
Available by anonymous ftp from sourceware.cygnus.com:/pub/bzip2/v100/bzip2-&ast;.tar.gz. Seward J. 2000. The bzip2 and libbzip2 home page. Available by anonymous ftp from sourceware
  • J Seward