Cost-based optimization in DB2 XML

Article (PDF Available)inIbm Systems Journal 45(2):299-320 · January 2006with115 Reads
DOI: 10.1147/sj.452.0299 · Source: DBLP
Abstract
DB2 XML is a hybrid database system that combines the relational capabilities of DB2 Universal Database™ (UDB) with comprehensive native XML support. DB2 XML augments DB2® UDB with a native XML store, XML indexes, and query processing capabilities for both XQuery and SQL/XML that are integrated with those of SQL. This paper presents the extensions made to the DB2 UDB compiler, and especially its cost-based query optimizer, to support XQuery and SQL/XML queries, using much of the same infrastructure developed for relational data queried by SQL. It describes the challenges to the relational infrastructure that supporting XQuery and SQL/XML poses and provides the rationale for the extensions that were made to the three main parts of the optimizer: the plan operators, the cardinality and cost model, and statistics collection.
    • Many researches about improving XML traversal pattern or structural join method to optimize performance of XML queries have been done in [4][5][17]. Researches about estimation of answer size and cost of queries have also been explored in [3][15].
    [Show abstract] [Hide abstract] ABSTRACT: As XML is playing a crucial role in web services, databases, and document processing, efficient processing of XML queries has become an important issue. On the other hand, due to the increasing number of users, high throughput of XML queries is also required to execute tens of thousands of queries in a short time. Given the great success of GPGPU (General-Purpose computations on the Graphics Processors), we propose a parallel XML query model based on GPU, which mainly consists of two efficient task distribution strategies, to improve the efficiency and throughput of XML queries. We have developed a parallel simplified XPath language using Compute Unified Device Architecture (CUDA) on GPU, and evaluate our model on a recent NVIDIA GPU in comparison with its counterpart on eight-core CPU. The experiment results show that our model achieves both higher throughput and efficiency than CPU-based XML query.
    Full-text · Article · Dec 2011
    • In this section, we show how the Bloom filter [2], can be redesigned to take advantage of a memory hierarchy consisting of main memory and SCM [3]. Database systems have used Bloom filters for index ANDing [1], join processing [10], selectivity estimation [20], and statistics collection [1, 20]. A traditional Bloom filter (TBF) consists of a vector of β bits, initially all set to 0. To update the filter, k independent hash functions h 1 , h 2 , ..., h k all with range {1, ..., β} are used.
    [Show abstract] [Hide abstract] ABSTRACT: Storage Class Memory (SCM) is here to stay. It has characteristics that place it in a class apart both from main memory and hard disk drives. Software and systems, architectures and algorithms need to be revisited to extract the maximum benefit from SCM. In this paper, we describe work that is being done in the area of Storage Class Memory aware Data Management at IBM. We specifically cover the challenges in placement of objects in storage and memory systems which have NAND flash (one kind of SCM) in a hierarchy or in the same level with other storage devices. We also focus on the challenges of adapting data structures which are inherently main memory based to work out of a memory hierarchy consisting of DRAM and flash. We describe how these could be addressed for a popular main memory data structure, namely the Bloom filter. 1
    Full-text · Article · Jan 2010
    • Unfortunately, they do not cover set-oriented SJ and HTJ operators. Balmin et al. [3] sketch the development of a hybrid costbased optimizer for SQL and XQuery being part of DB2 XML. Compared to our approach, they evaluate every path expression using an HTJ operator and cannot decide on a fine-granular level whether to use SJ operators or not.
    [Show abstract] [Hide abstract] ABSTRACT: Even though an effective cost-based query optimizer is of utmost importance for the efficient evaluation of XQuery expressions in native XML database systems, such a component is currently out of sight, because former approaches do not pay attention to the latest advances in the area of physical operators (e. g., Holistic Twig Joins and advanced indexes) or just focus only on some of them. To support the development of native XML query optimizers, we introduce an extensible cost-based optimization framework that integrates the cutting-edge XML query evaluation operators into a single system. Using the well-known plan generation techniques from the relational world and a novel set of plan equivalences---which allows for the generation of alternative query plans consisting of Structural Joins, Holistic Twig Joins, and numerous indexes (especially path indexes and content-and-structure indexes)---our optimizer can now benefit from the knowledge on native XML query evaluation to speed-up query execution significantly.
    Full-text · Conference Paper · Jan 2010
    • Let Qi denote the sub-expression of Q up to step ti. Then, the cardinality of Qi is estimated by the recurrence relation, card(Qi) =  1 if i = 0 f (ti|ti−1)card(Qi−1) otherwise (1) Cardinality, as define here, is similar to the definition in [2]. Example 1.
    [Show abstract] [Hide abstract] ABSTRACT: The wide availability of commodity multi-core systems presents an opportunity to address the latency issues that have plaqued XML query processing. However, simply executing multiple XML queries over multiple cores merely addresses the throughput issue: intra-query parallelization is needed to exploit multiple processing cores for better latency. Toward this effort, this paper investigates the parallelization of individual XPath queries over shared-address space multi-core processors. Much previous work on parallelizing XPath in a distributed setting failed to exploit the shared memory parallelism of multi-core systems. We propose a novel, end-to-end parallelization framework that determines the optimal way of parallelizing an XML query. This decision is based on a statistics-based approach that relies both on the query specifics and the data statistics. At each stage of the parallelization process, we evaluate three alternative approaches, namely, data-, query-, and hybrid-partitioning. For a given XPath query, our parallelization algorithm uses XML statistics to estimate the relative efficiencies of these different alternatives and find an optimal parallel XPath processing plan. Our experiments using well-known XML documents validate our parallel cost model and optimization framework, and demonstrate that it is possible to accelerate XPath processing using commodity multi-core systems.
    Full-text · Conference Paper · Jan 2010
    • Thus, in order to lower the false positive rate, a larger amount of memory is required. Bloom filters are used in a wide variety of application areas , such as databases [1], distributed information retrieval [20], network computing [5], and bioinformatics [15]. Some of these applications require large Bloom filters to reduce the false positive rate.
    [Show abstract] [Hide abstract] ABSTRACT: Bloom Filters are widely used in many applications includ-ing database management systems. With a certain allowable error rate, this data structure provides an efficient solution for membership queries. The error rate is inversely pro-portional to the size of the Bloom filter. Currently, Bloom filters are stored in main memory because the low locality of operations makes them impractical on secondary storage. In multi-user database management systems, where there is a high contention for the shared memory heap, the limited memory available for allocating a Bloom filter may cause a high rate of false positives. In this paper we are proposing a technique to reduce the memory requirement for Bloom filters with the help of solid state storage devices (SSD). By using a limited memory space for buffering the read/write requests, we can afford a larger SSD space for the actual Bloom filter bit vector. In our experiments we show that with significantly less memory requirement and fewer hash functions the proposed technique reduces the false positive rate effectively. In addition, the proposed data structure runs faster than the traditional Bloom filters by grouping the inserted records with respect to their locality on the filter.
    Full-text · Article · Jan 2010
    • There is little work so far on cost estimation for XPath plans or operators. In IBM DB2 [16], an XQuery is translated into a tree consisting of operators in relational algebra extended with three XML-specific operators, and is optimized by the relational optimizer; the XML navigating operator (XSCAN) is very coarse and its cost models are not formally presented. The work presented in [13] deals with a single holistic operator, XNAV, tightly integrated with the storage engine.
    [Show abstract] [Hide abstract] ABSTRACT: The creation of a generic and modular query optimization and processing infrastructure can provide significant benefits to XML data management. Key pieces of such an infrastructure are the physical operators that are available to the execution engine, to turn queries into execution plans. Such operators, to be efficient, need to implement sophisticated algorithms for logical XPath or XQuery operations. Moreover, to enable a cost-based optimizer to choose among them correctly, it is also necessary to provide cost models for such operator implementations. In this paper we present two novel families of algorithms for XPath physical operators, called LookUp (LU) and Sort-Merge-based (SM), along with detailed cost models. Our algorithms have significantly better performance compared to existing techniques over any one of a variety of different XML storage systems that provide a set of common primitive access methods. To substantiate the robustness and efficiency of our physical operators, we evaluate their individual performance over four different XML storage engines against operators that implement existing XPath processing techniques. We also demonstrate the performance gains for twig processing of using plans consisting of our operators compared to a state of the art holistic technique, specifically Twig2Stack. Additionally, we evaluate the precision of our cost models, and we conduct an analysis of the sensitivity of our algorithms and cost models to a variety of parameters.
    Full-text · Conference Paper · Jan 2010
Show more