[show abstract][hide abstract] ABSTRACT: Data compression has been widely applied in many data processing areas. Compression methods use variable-length codes with the shorter codes assigned to symbols or groups of symbols that appear in the data frequently. There exist many coding algorithms, e.g. Elias-delta codes, Fibonacci codes and other variable-length codes which are often applied to encoding of numbers. Although we often do not consider time consumption of decompression as well as compression algorithms, there are cases where the decompression time is a critical issue. For example, a real-time compression of data structures, applied in the case of the physical implementation of database management systems, follows this issue. In this case, pages of a data structure are decompressed during every reading from a secondary storage into the main memory or items of a page are decompressed during every access to the page. Obviously, efficiency of a decompression algorithm is extremely important. Since fast decoding algorithms were not known until recently, variable-length codes have not been used in the data processing area. In this article, we introduce fast decoding algorithms for Elias-delta, Fibonacci of order 2 as well as Fibonacci of order 3 codes. We provide a theoretical background making these fast algorithms possible. Moreover, we introduce a new code, called the Elias–Fibonacci code, with a lower compression ratio than the Fibonacci of order 3 code for lower numbers; however, this new code provides a faster decoding time than other tested codes. Codes of Elias–Fibonacci are shorter than other compared codes for numbers longer than 26 bits. All these algorithms are suitable in the case of data processing tasks with special emphasis on the decompression time.
[show abstract][hide abstract] ABSTRACT: N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a query model used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B+-tree and Hash table. We show the scalability of our method by presenting experiments with the gigabytes collection.
Sixth IEEE International Conference on Digital Information Management, ICDIM 2011, Melbourne, Australia, September 26-28, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: The R-tree is one of the most popular multidimensional data structure. This data structure bounds spatially near points in multidimensional rectangles and supports various types of queries, e.g. point and range queries. When a compression of the data structure is considered, we follow two objectives. The first objective is a smaller index file and the second one is a reduction of the query processing time. In this paper, we introduce a lossless R-tree compression using variable-length codes. Although variable-length codes are well known in the area of data compression, they have not been yet successfully applied in the case of the data structure compression. The main reasons of this fact are inefficient decoding/encoding algorithms. In this paper, we apply recently introduced fast decoding algorithms and we show that these codes provides more efficient query processing time than the lossless RLE or lossy quantization compressions. Moreover, we can utilize some features of variable length codes for the compression. The proposed compression method saves 84% of the index file's size compared to the uncompressed R-tree.
[show abstract][hide abstract] ABSTRACT: Multi-dimensional data structures have been widely applied in many data management fields. Spatial data indexing is their natu- ral application, however there are many applications in different domain fields. When a compression of these data structures is considered, we follow two objectives. The first objective is a smaller index file, the sec- ond one is a reduction of the query processing time. In this paper, we apply a compression scheme to fit these objectives. This compression scheme handles compressed nodes in a secondary storage. If a page must be retrieved then this page is decompressed into the tree cache. Since this compression scheme is transparent from the tree operations point of view, we can apply various compression algorithms to pages of a tree. Obviously, there are compression algorithms suitable for various data collections, therefore, this issue is very important. In our paper, we com- pare the performance of Golomb, Elias-delta and Elias-gamma coding with the previously introduced Fast Fibonacci algorithm.
Proceedings of the Dateso 2009 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Spindleruv Mlyn, Czech Republic, April 15-17, 2009; 01/2009