Conference Paper

Compact In-Memory Models for Compression for Large Text Databases

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A limitation of this approach is that the key size can be too small, say O(log m), where m is the number of codewords. Furthermore, several schemes have been proposed to reduce the maximum codeword length in order to increase decoding speed [20], [13], [11]. Klein et al. [4] analysed the cryptographic aspects of Huffman codes used to encode a large natural language on CD-ROM. ...
... For instance, if the secret key K is dependent on the plaintext T , say k i = 1 fi , then we can have lower data expansion, but it is still an open problem the impact of this modification into its secrecy properties. Also, other related techniques can be used to optimize the overall algorithm like skeleton trees [6] and dictionary reduction schemes to deal with large scale texts [20]. ...
... A limitation of this approach is that the key size can be too small, say O(log m), where m is the number of codewords. Furthermore, several schemes have been proposed to reduce the maximum codeword length in order to increase decoding speed [20], [13], [11]. ...
Article
Full-text available
Data compression and ciphering are essential features when digital data is stored or trasmitted over insecure channels. Prefix codes are widely used to obtain high performance data compression algorithms. Given any prefix code for the symbols of a plaintext, we propose to add security using a multiple substitution function and a key. We show that breaking the code when we are given the ciphertext, dictionary, frequencies and code lengths is a NP-Complete problem. Resumo : Compressão de dados e cifragem são funcionalidades essenci-ais quando dados digitais são armazenados ou transmitidos através de canais inseguros. Códigos de prefixo são largamente utilizados para se obter algo-ritmos de compressão de dados de alto desempenho. Dado um código de prefixo para os símbolos de um texto, propomos adicionar segurança uti-lizando uma função de substituição múltipla e uma chave. Demostramos que a quebra do código quando são dados o texto cifrado, o dicionário, as frequências e os tamanhos dos código e um problema NP-Completo. Palavras − chave : Segurança, compressão, códigos de prefixo, substi-tuição homofônica.
... Det ser ud til, at der kun eksisterer en artikel, der primaert beskaeftiger sig med frontcoding, og denne handler om at anvende algoritmen til komprimering af ordbøger ved gentagen frontcoding af skiftevis prefix og suffix [BF92]. Der en del artikler, som anvender frontcoding i forbindelse med komprimering af søgesystemer og databaser, eksempelvis [MZS97,ZW99]. Derudover anvender [NMWO96] frontcoding til komprimering af statisk ordbog til datakomprimering, og kildedistributionen Den Store Danske Ordliste [SSL05] anvender den også. Disse artikler tilføjer dog intet nyt i forhold til [WMB99]. ...
... Prior research has examined how to efficiently index text documents and resolve text queries: for example, with invertedindices [3], signature files [8], or sparse matrices [9]. Further improvements to these index structures have been made for handling special query types [10] [11] [12] and reducing I/O overhead [13] [14] [15]. While much work addresses this indexlevel view of search performance, little work addresses performance at the architectural level of a complete search service. ...
Conference Paper
Prior research into search system scalability has primarily addressed query processing efficiency [1, 2, 3] or indexing efficiency [3], or has presented some arbitrary system architecture [4]. Little work has introduced any formal theoretical framework for evaluating architectures with regard to specific operational requirements, or for comparing architectures beyond simple timings [5] or basic simulations [6, 7]. In this paper, we present a framework based upon queuing network theory for analyzing search systems in terms of operational requirements. We use response time, throughput, and utilization as the key operational characteristics for evaluating performance. Within this framework, we present a scalability strategy that combines index partitioning and index replication to satisfy a given set of requirement.
... And as text file size increases, the model file size slowly grows. Model reduction techniques are useful in these cases [14]. A simple alternative is to include a model compression step using Huffman coding for characters. ...
Article
Full-text available
Minimum redundancy prefix-free codes are widely used to obtain high performance compression schemes. Given a prefix-free encoding for the symbols of a plain text, we propose a security enhancement by adding a multiple substitution algorithm with a key: the HSPC2 -Homophonic Substitution Prefix-free Codes with 2 homophones. Breaking the key when we are given a ciphertext, the dictionary, frequencies and codeword lengths, is a NP-Complete problem. In order to introduce security, some compression loss is generated. The compression loss is analysed and the data expansion per character is asymptotically smaller than 5% under usual parsing and coding assumptions. We also present some analytical results on the security impact of adding simple strategies to protect prefix-free encoded data. Resumo: Códigos livres de prefixo de redundância mínima são largamente utilizados para se obter esquemas de compressão de alto desempenho. Dada uma codificação livre de prefixo para um texto original, nós propomos um aumento na segurança adicionando um algoritmo de substituição múltipla com a utilização de um chave: o HSPC2 -códigos livres de prefixo de substituição homofônica com 2 homofônicos. Quebrar a chave quando nos são dados o texto cifrado, o dicionário, as frequências e os comprimentos das palavras de código, é um problema NP-Completo. Com o objetivo de se introduzir segurança, alguma perda de compressão é gerada. A perda de compressão é analisada e a expansão dos dados por caracter é assintoticamente menor que 5% supondo codificação e varredura usuais. Nós também apresentamos resultados analíticos no impacto da segurança ao se adicionar estratégias simples para proteger dados codificados livres de prefixo. Palavras-Chave: compressão de dados, códigos livres de prefixo, segurança, substituição homofônica.
Article
Full-text available
Huffman codes have widespread use in information retrieval systems. Besides its compressing power, it also enables the implementa-tion of both indexing and searching schema in the compressed file. In this work, we propose a randomized variant of the Huffman data compression algorithm. The algorithm results in a small loss in coding, decoding and compression performances. It uses homophonic substitution and canonical Huffman codes. Resumo : Códigos homofônicos são largamente utilizados em sistemas de recuperação de informação. Além de seu poder de compressão, também possibilita a implementação de esquemas de indexação e busca no texto com-primido. Neste trabalho, propomos uma variante aleatorizada do algoritmo de compressão de dados de Huffman. O algoritmo resulta em uma pequena perda na codificação, decodificação e compressão. Utiliza substituição ho-mofônica e códigos canônicos de Huffman. Palavras − chave : Segurança, compressão, códigos de prefixo, substi-tuição homofônica.
Article
Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.
Article
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
Present text retrieval systems are generally built on the reductionist basis that words in texts (keywords) are used as indexing terms to represent the texts. A necessary precursor to these systems is word extraction which, for English texts, can be achieved automatically by using spaces and punctuations as word delimiters. This cannot be readily applied to Chinese texts because they do not have obvious word boundaries. A Chinese text consists of a linear sequence of nonspaced or equally spaced ideographic characters, which are similar to morphemes in English. Researchers of Chinese text retrieval have been seeking methods of text segmentation to divide Chinese texts automatically into words. First, a review of these methods is provided in which the various different approaches to Chinese text segmentation have been classified in order to provide a general picture of the research activity in this area. Some of the most important work is described. There follows a discussion of the problems of Chinese text segmentation with examples to illustrate. These problems include morphological complexities, segmentation ambiguity, and parsing problems, and demonstrate that text segmentation remains one of the most challenging and interesting areas for Chinese text retrieval. © 1993 John Wiley & Sons, Inc.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Abstract For compression of text databases, semi-static word-based methods provide good perfor- mance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Sec- ond, the need to handle document insertions means that the collection must be periodically recompressed, if compression eciency is to be maintained on dynamic collections. Here we show that with careful management,the impact of both of these drawbacks can be kept small. Experiments with a word-based model and 500 Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after signicant expansion in the amount,of stored text. Index Terms Document databases, text compression, dynamic databases, word-based
Article
Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected] /* */
Conference Paper
Document databases contain large volumes of text, and currently have typical sizes into the gigabyte range. In order to efficiently query these text collections some form of index is required, since without an index even the fastest of pattern matching techniques results in unacceptable response times. One pervasive indexing method is the use of inverted files, also sometimes known as concordances or postings files. There has been a number of effort made to capture the “clustering” effect, and to design index compression methods that condition their probability predictions according to context. In these methods information as to whether or not the most recent (or second most recent, and so on) document contained term t is used to bias the prediction that the next document will contain term t. We further extend this notion of context-based index compression, and describe a surprisingly simple index representation that gives excellent performance on all of our test databases; allows fast decoding; and is, even in the worst case, only slightly inferior to Golomb (1966) coding
Conference Paper
Witten, Bell and Nevill (see ibid., p.23, 1991) have described compression models for use in full-text retrieval systems. The authors discuss other coding methods for use with the same models, and give results that show their scheme yielding virtually identical compression, and decoding more than forty times faster. One of the main features of their implementation is the complete absence of arithmetic coding; this, in part, is the reason for the high speed. The implementation is also particularly suited to slow devices such as CD-ROM, in that the answering of a query requires one disk access for each term in the query and one disk access for each answer. All words and numbers are indexed, and there are no stop words. They have built two compressed databases.< >
Article
First Page of the Article
Article
We describe the implementation of a data compression scheme as an integral and transparent layer within a full-text retrieval system. Using a semi-static word-based compression model, the space needed to store the text is under 30 per cent of the original requirement. The model is used in conjunction with canonical Huffman coding and together these two paradigms provide fast decompression. Experiments with 500 Mb of newspaper articles show that in full-text retrieval environments compression not only saves space, it can also yield faster query processing - a win-win situation.
Article
A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box, and any such program can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 70% compression is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not ...
Text Com-pression Fast searching on compressed text allowing errors
  • T C Bell
  • J G Cleary
  • I H Witten
  • E Silva
  • G Moura
  • N Navarro
  • R Ziviano
  • Baeza-Yates
T.C. Bell, J.G. Cleary, and I.H. Witten. Text Com-pression. Prentice-Hall, Englewood Cliis, New Jersey, 1990. 22 E. Silva de Moura, G. Navarro, N. Ziviano, and R. Baeza-Yates. Fast searching on compressed text allowing errors. In R. Wilkinson, B. Croft, K. van Ri-jsbergen, A. Mooat, and J. Zobel, editors, Proc. ACM-SIGIR International Conference o n R esearch and De-velopment in Information Retrieval, pages 2988306, Melbourne, Australia, July 1998.
Fast searching on compressed text allowing errors
  • E S De Moura
  • G Navarro
  • Ν Ziviano
  • R Baeza-Yates