-
[show abstract]
[hide abstract]
ABSTRACT: A partial-sum query obtains the summation over a set of specified cells of a data cube. We establish a connection between the covering problem in the theory of error-correcting codes and the partial-sum problem and use this connection to devise algorithms for the partialsum problem with efficient space-time trade-offs. For example, using our algorithms, with 44% additional storage, the query response time can be improved by about 12%; by roughly doubling the storage requirement, the query response time can be improved by about 34%. Index Terms---Partial-sum query, covering code, error-correcting code, on-line analytical processing, data cube, multidimensional database, precomputation, query algorithm. 1 Introduction On-Line Analytical Processing (OLAP) [Cod93] allows companies to analyze aggregate databases built from their data warehouses. An increasingly popular data model for OLAP applications is the multidimensional database (MDDB) [OLA96], also known as data cube [GBLP96]. To b...
02/1999;
-
Proceedings of the 15th International Conference on Data Engineering, Sydney, Austrialia, March 23-26, 1999; 01/1999
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents fast scalable decision-tree-based classification algorithms targeting shared-memory systems. The algorithms are based on the sequential SPRINT classifier and span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This is extended with task pipelining and dynamic load balancing to yield more efficient schemes. The task parallel approach uses dynamic subtree partitioning among processors. These schemes are disk based and achieve excellent speedup, making them ideally suited for data mining in very large databases. 1 Introduction An important task of data mining is to assign objects to predefined categories or classes -- a process called Classification. The input to the classification system consists of a set of example records, called a training set, over several fields or attributes. Attributes are either continuous, coming from an ordered domain, or categorical, coming from an unordered domain. One of the...
04/1998;
-
Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 12-14, 1997, Tucson, Arizona; 01/1997
-
[show abstract]
[hide abstract]
ABSTRACT: We give nearly optimal algorithms for matrix transpose on meshes
with wormhole and XY routing and with a 1-port or 2-port communication
model. For an N×N mesh, where N=3·2<sup>n</sup> and each
mesh node has a submatrix of size m to be transposed, our algorithms
take Nm/2 time steps for 1-port model, and about Nm/3.27 time steps for
2-port model. The lower bound is Nm/3.414. While there is no previously
known algorithm for matrix transpose on meshes with wormhole and XY
routing, a naive algorithm, which is naturally adapted from the
well-known Recursive Exchange Algorithm, has a complexity of about Nm.
That is our best algorithm improves over the naive algorithm by about a
factor of 3.27, and is about a factor of 3.414/3.27 of the lower bound
Parallel and Distributed Processing, 1994. Proceedings. Sixth IEEE Symposium on; 11/1994
-
[show abstract]
[hide abstract]
ABSTRACT: We present parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach uses dynamic subtree partitioning among processors. We evaluate the performance of these algorithms on two machine configurations: one in which data is too large to fit in memory and must be paged from a local disk as needed and the other in which memory is sufficiently large to cache the whole data. This performance evaluation shows that the construction of a decision-tree classifier can be effectively parallelized on an SMP machine with good speedup. For the local disk configuration, the speedup ranged from 2.97 to 3.86 for the build phase and from 2.20 to 3.67 for the total time on a ...
02/1970;
-
[show abstract]
[hide abstract]
ABSTRACT: A range-max query obtains the maximum over all selected cells of a data cube where the selection is specified by providing ranges of values for numeric dimensions. Our general approach to speeding up range-max queries is to precompute and store certain key information of the data cube. In [HAMS97], we gave a tree algorithm based on precomputed max over balanced hierarchical tree structures; a branch-and-bound-[Mit70]like procedure was used to prune unnecessary search. In this paper, we propose three orthogonal techniques with the objective of improving the average response time of the range-max queries. First, rather than keeping only the index of the largest value at each internal node of the tree, we keep the indices of the t largest values with each internal node and use them to decrease the probability of scanning lower level nodes. Second, we further partition each sibling set of internal nodes into smaller groups and sort the precomputed indices within each group accor...
02/1970;
-
[show abstract]
[hide abstract]
ABSTRACT: A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/b d of the size of the d-dimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary...
02/1970;