Conference Paper

Efficient Data Distribution for DWS.

DOI: 10.1007/978-3-540-85836-2_8 Conference: Data Warehousing and Knowledge Discovery, 10th International Conference, DaWaK 2008, Turin, Italy, September 2-5, 2008, Proceedings
Source: DBLP

ABSTRACT

The DWS (Data Warehouse Striping) technique is a data partitioning approach especially designed for distributed data warehousing
environments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed
in parallel by all the computers, guarantying a nearly optimal speed up and scale up. Data loading in data warehouses is typically
a heavy process that gets even more complex when considering distributed environments. Data partitioning brings the need for
new loading algorithms that conciliate a balanced distribution of data among nodes with an efficient data allocation (vital
to achieve low and uniform response times and, consequently, high performance during the execution of queries). This paper
evaluates several alternative algorithms and proposes a generic approach for the evaluation of data distribution algorithms
in the context of DWS. The experimental results show that the effective loading of the nodes in a DWS system must consider
complementary effects, minimizing the number of distinct keys of any large dimension in the fact tables in each node, as well
as splitting correlated rows among the nodes.

Download full-text

Full-text

Available from: Raquel Almeida
  • Source
    • "This paper extends a preliminary work done by the authors in [10] and proposes a generic methodology to evaluate and compare data distribution algorithms. The approach is based on a set of metrics that characterize the efficiency of the algorithms, considering three key aspects: data distribution time, coefficient of variation of the number of rows placed in each node, and queries response time. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Data Warehouse Striping (DWS) technique is a data partitioning approach especially designed for distributed data warehousing environments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and each query is executed in parallel by all the computers, guarantying a nearly optimal speed up and scale up. Data loading in distributed data warehouses is typically a heavy process and brings the need for loading algorithms that conciliate a balanced distribution of data among nodes with an efficient data allocation. These are fundamental aspects to achieve low and uniform response times and, consequently, high performance during the execution of queries. This paper proposes a generic approach for the evaluation of data distribution algorithms and assesses several alternative algorithms in the context of DWS. The experimental results show that the effective loading of the nodes must consider complementary effects, minimizing the number of distinct keys of any large dimension in the fact tables in each node, as well as splitting correlated rows among the nodes.
    Full-text · Article · Oct 2012
  • Source
    • "Avatara stores a cube's data Figure 2: Avatara architecture consists of an offline batch engine, which runs Hadoop jobs to transform the data to cubes, and an online query engine, which fetches results from a key-value store. in one location so that retrieval only requires a single disk fetch. Avatara also leverages some performance optimizations [1] [2], particularly pushing down predicates to the storage layer, to improve the response time of queries. MR-Cube [8] efficiently materializes cubes for holistic measures while fully using the parallelism of a MapReduce [4] framework. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160 million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years.
    Preview · Article · Aug 2012 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we propose an ontological OLAP (on-line analytical processing) framework to integrate -distributed energy sensor data. The OLAP data cube in the framework annotated with semantics with other supporting mechanisms can deal with the issues of the schema inconsistency that may result from -integration of heterogeneous data sources. The proposed approach provides a way of storing, reusing and composing OLAP cubes in order to increase system usability. A prototype of the proposed framework based on a -number of existing tools such as protégé, Jess, and Fuzzy J has been developed to demonstrate its feasibility.
    No preview · Article · Sep 2009 · IETE Technical Review
Show more