XML (Extensible Markup Language) is a meta-language (developed by the W3C, World Wide Web Consortium in 1996), which represents semi-structured data using markups. While the use of XML facilitates the interchange and access of data, its verbose nature tends to considerably increase the size of a data file. This increase in size limits applications of XML, in particular, because of time efficiency of storage on large data files, and because of space considerations of storage on mobile devices. Besides storing (possibly compressed) XML data, one is also interested in being able to query them in order to obtain specific information; such as the information pertaining to all patients who visited the emergency room of a specific hospital in the last year.
The reasons for querying a compressed XML file are:
Querying a compressed XML file is generally faster than completely decompressing the compressed file and then querying it.
Portable devices may not have disk space available for a complete decompression of the XML file.
There are many known XML-aware compressors, i.e. compressors, which can take advantage of XML syntax. Some of these XML compressors are grammar-free, in other words, information available to the compressor is limited to the XML document. Other XML compressors are grammar-based, i.e. the compressor is aware of the grammar for which the input document is valid. Grammar-based compressors may produce better results - in terms of both compression rate and time - than grammar-free compressors because they can take advantage of information available in the grammar, but in many applications the grammar is not known and so this approach is not always practical. In the case of the widely used Wratislava corpus [Skibinski et al, 2007], out of seven XML documents, only two provide an XML Schema (enwikibooks and enwikinews), two reference a DTD (shakespeare and dblp), while the others use no schema. Finally, even if an XML Schema is provided, it may define elements that never actually appear in the XML document to be compressed.
In this paper, we describe a queryable, grammar-free XML compressor, called XSAQCT (pronounced exact). Our technique borrows from other XML compressors in that it separates the document structure from the text values and attribute values (collectively called data values), which makes up the content of the document. What is new in our technique is that we first encode the document to succinctly store information about the input document. Next, we apply the appropriate back-end data compressors to the container that stores the document structure and to the containers storing the data values (the type of the data, derived from the containers, may be used to guide the choice of back-end compressors used for various containers). It is well known that, on average, the structure of the XML document represents between 10 and 20 percent of the size of the entire document, and the remaining 80 percent represents text and attribute values. Since the main focus of our work is on queryable compression, our encoding of the document structure supports lazy decompression, i.e. during the querying process of the compressed document; we decompress “as little as possible”. Well-known XML compressors differ in their use of container granularity; some compressors use a single container, while others tend to create many separate containers for related values. The former approach is based on the promise that standard data compressors achieve better results when they get large data sets, but require complete decompression in order to perform a query. On the other hand, the latter approach may suffer from poor compression ratios, but it requires the decompression of only a few (possibly just one) containers. In our approach, we attempt to strike a balance between these two extremes; using containers that will be large enough so that they can be effectively compressed, but at the same time the container structure does not require a full decompression to answer a query. In addition, while our design supports lazy decompression, it is designed to support future extensions and performs operations directly on compressed data, without any decompression. In what follows, we provide a more detailed description of XSAQCT.