Conference Paper

Permutation Based XML Compression

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

An XML document D often has a regular structure, i.e., it is composed of many similarly named and structured subtrees. Therefore, the entropy of a trees structuredness should be relatively low and thus the trees should be highly compressible by transforming them to an intermediate form. In general, this idea is used in permutation based XML-conscious compressors. An example of such a compressor is called XSAQCT, where the compressible form is called an annotated tree. While XSAQCT proved to be useful for various applications, it was never shown that it is a lossless compressor. This paper provides the formal background for the definition of an annotated tree, and a formal proof that the compression is lossless. It also shows properties of annotated trees that are useful for various applications, and discusses a measure of compressibility using this approach, followed by the experimental results showing compressibility of annotated trees.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
XML (Extensible Markup Language) is a meta-language (developed by the W3C, World Wide Web Consortium in 1996), which represents semi-structured data using markups. While the use of XML facilitates the interchange and access of data, its verbose nature tends to considerably increase the size of a data file. This increase in size limits applications of XML, in particular, because of time efficiency of storage on large data files, and because of space considerations of storage on mobile devices. Besides storing (possibly compressed) XML data, one is also interested in being able to query them in order to obtain specific information; such as the information pertaining to all patients who visited the emergency room of a specific hospital in the last year. The reasons for querying a compressed XML file are: Querying a compressed XML file is generally faster than completely decompressing the compressed file and then querying it. Portable devices may not have disk space available for a complete decompression of the XML file. There are many known XML-aware compressors, i.e. compressors, which can take advantage of XML syntax. Some of these XML compressors are grammar-free, in other words, information available to the compressor is limited to the XML document. Other XML compressors are grammar-based, i.e. the compressor is aware of the grammar for which the input document is valid. Grammar-based compressors may produce better results - in terms of both compression rate and time - than grammar-free compressors because they can take advantage of information available in the grammar, but in many applications the grammar is not known and so this approach is not always practical. In the case of the widely used Wratislava corpus [Skibinski et al, 2007], out of seven XML documents, only two provide an XML Schema (enwikibooks and enwikinews), two reference a DTD (shakespeare and dblp), while the others use no schema. Finally, even if an XML Schema is provided, it may define elements that never actually appear in the XML document to be compressed. In this paper, we describe a queryable, grammar-free XML compressor, called XSAQCT (pronounced exact). Our technique borrows from other XML compressors in that it separates the document structure from the text values and attribute values (collectively called data values), which makes up the content of the document. What is new in our technique is that we first encode the document to succinctly store information about the input document. Next, we apply the appropriate back-end data compressors to the container that stores the document structure and to the containers storing the data values (the type of the data, derived from the containers, may be used to guide the choice of back-end compressors used for various containers). It is well known that, on average, the structure of the XML document represents between 10 and 20 percent of the size of the entire document, and the remaining 80 percent represents text and attribute values. Since the main focus of our work is on queryable compression, our encoding of the document structure supports lazy decompression, i.e. during the querying process of the compressed document; we decompress “as little as possible”. Well-known XML compressors differ in their use of container granularity; some compressors use a single container, while others tend to create many separate containers for related values. The former approach is based on the promise that standard data compressors achieve better results when they get large data sets, but require complete decompression in order to perform a query. On the other hand, the latter approach may suffer from poor compression ratios, but it requires the decompression of only a few (possibly just one) containers. In our approach, we attempt to strike a balance between these two extremes; using containers that will be large enough so that they can be effectively compressed, but at the same time the container structure does not require a full decompression to answer a query. In addition, while our design supports lazy decompression, it is designed to support future extensions and performs operations directly on compressed data, without any decompression. In what follows, we provide a more detailed description of XSAQCT.
Article
Full-text available
The mission of the Universal Protein Resource (UniProt) (http://www.uniprot.org) is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase. It integrates, interprets and standardizes data from numerous resources to achieve the most comprehensive catalogue of protein sequences and functional annotation. UniProt comprises four major components, each optimized for different uses, the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is produced by the UniProt Consortium, which consists of groups from the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is updated and distributed every 4 weeks and can be accessed online for searches or downloads.
Conference Paper
Full-text available
Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. Here a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high reg- ularity in XML documents by "compressing" their tree structure; the latter means to detect and remove repetitions of tree patterns. The functionality of basic tree operations, like traversal along edges, is preserved in the compressed representa- tion. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. For certain tasks like validation against an XML type or checking equality of documents, the representation allows for provably more efficient algorithms than those running on conventional representations.
Article
Full-text available
Consider an ordered, static tree T where each node has a label from alphabet Σ. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, i th child, or any child with some label,…) as well as more sophisticated path -based search operations over its labeled structure. We present a novel approach to this problem by designing what we call the XBW-transform of the tree in the spirit of the well-known Burrows-Wheeler transform for strings [1994]. The XBW-transform uses path-sorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBW-transform, our compressed indexes go beyond the information-theoretic lower bound, and support navigational and path-search operations over labeled trees within (near-)optimal time bounds and entropy-bounded space. Our XBW-transform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBW-transform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by state-of-the-art tools, and its query time performance is order of magnitudes faster.
Article
Full-text available
XML compression has gained prominence recently because it counters the disadvantage of the "verbose" representation XML gives to data. In many applications, such as data exchange and data archiving, entirely compressing and decompressing a document is acceptable. In other appli- cations, where queries must be run over compressed documents, compression may not be beneficial since the performance penalty in running the query processor over compressed data outweights the data compression benefits. While balancing the interests of compression and query process- ing has received significant attention in the domain of relational databases, these results do not immediately translate to XML data. In this paper, we address the problem of embedding compression into XML databases without degrading query performance. Since the setting is rather different from relational databases, the choice of compression granularity and compression algorithms must be revisited. Query execution in the compressed domain must also be rethought in the framework of XML query processing, due to the richer structure of XML data. Indeed, a proper storage design for the compressed data plays a crucial role here. The XQ ueC system (standing for XQ uery Processor and C ompressor) covers a wide set of XQuery queries in the compressed domain, and relies on a workload-based cost model to perform the choices of the compression granules and of their corresponding compression algorithms. As a consequence, XQueC provides efficient query processing on compressed XML data. An extensive experimental assessment is presented, showing the effectiveness of the cost model, the compression ratios and the query execution times.
Article
Full-text available
Our experimental analysis of several popular XPath processors reveals a striking fact: Query evaluation in each of the systems requires time exponential in the size of queries in the worst case. We show that XPath can be processed much more efficiently, and propose main-memory algorithms for this problem with polynomial-time combined query evaluation complexity. Moreover, we show how the main ideas of our algorithm can be profitably integrated into existing XPath processors. Finally, we present two fragments of XPath for which linear-time query processing algorithms exist and another fragment with linear-space/quadratic-time query processing.
Conference Paper
Permutation based XML-conscious compressors permute the input document to improve the compression ratio and support efficiency of operations, such as queries or updates. One such compressor, XSAQCT, uses the properties of the permuted document, called an annotated tree, to these operations. This paper provides the formal background for the definition of an annotated tree and a proof that the mapping from a tree to the annotated tree is injective, and therefore for XML document D the annotated tree provides a faithful representation of D. It also provides an algorithm for creating an annotated tree for the XML document and its reverse algorithm, and discusses a measure of compressibility using an annotated tree. The theoretical and algorithm approaches are followed by the experimental results showing compressibility of annotated trees and a general analysis of semi-structured data and XML compression.
Article
Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. In this paper, a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by compressing their tree structure; the latter means to detect and remove repetitions of tree patterns. Formally, context-free tree grammars that generate only a single tree are used for tree compression. The functionality of basic tree operations, like traversal along edges, is preserved under this compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. The complexity of certain computational problems like validation against XML types or testing equality is investigated for compressed input trees.
Article
A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source.
Article
We describe a block-sorting, lossless data compression algorithm, and our implementation of that algorithm. We compare the performance of our implementation with widely available data compressors running on the same hardware. The algorithm works by applying a reversible transformation to a block of input text. The transformation does not itself compress the data, but re-orders it to make it easy to compress with simple algorithms such as move-to-front encoding. Our algorithm achieves speed comparable to algorithms based on the techniques of Lempel and Ziv, but optains compression close to the best statistical modelling techniques. The size of the input block must be large (a few kilobytes) to achieve good compression.
Design and implementation of an online XML compressor for large XML files
  • T Müldner
  • T Corbin
  • J Miziołek
  • C Fry
The gzip home page (2013) http:// www. gzip. org
  • Gzip
XSAQCT: XML queryable compressor. In: Balisage: The Markup Conference
  • T Müldner
  • C Fry
  • J Miziołek
  • S Durno
Extensible markup language (XML) 1.0 (Fifth edition) (2013)
  • Xml
Large Text Compression Benchmark
  • M Mahoney