Page 1

Relative Prefix Sums:

An Efficient Approach for Querying Dynamic OLAP Data Cubes

S. Geffner D. Agrawal A. El Abbadi T. Smith

Department of Computer Science

University of California

Santa Barbara, CA 93106

{sgeffner, agrawal, amr, smithtr}@cs.ucsb.edu

Abstract

Range sum queries on data cubes are a powerful tool for

analysis. A range sum query applies an aggregation

operation (e.g., SUM) over all selected cells in a data

cube, where the selection is specified by providing ranges

of values for numeric dimensions. Many application

domains require that information provided by analysis

tools be current or "near-current." Existing techniques for

range sum queries on data cubes, however, can incur

update costs on the order of the size of the data cube.

Since the size of a data cube is exponential in the number

of its dimensions, rebuilding the entire data cube can be

very costly. We present an approach that achieves

constant time range sum queries while constraining update

costs. Our method reduces the overall complexity of the

range sum problem.

1 Introduction

The data cube [4], also known in the OLAP

community as the multidimensional database [1,8], is

designed to provide aggregate information that can be used

to analyze the contents of databases and data warehouses.

A data cube is constructed from a subset of attributes in

the database. Certain attributes are chosen to be measure

attributes, i.e., the attributes whose values are of interest.

Other attributes are selected as dimensions or functional

attributes. The measure attributes are aggregated according

to the dimensions. For example, consider a hypothetical

database maintained by an insurance company. One may

construct a data cube from the database with SALES as a

measure attribute, and CUSTOMER_AGE and

DATE_OF_SALE as dimensions; such a data cube

provides aggregated total sales figures for all combinations

of region and date.

Range sum queries are useful analysis tools when

applied to data cubes. A range sum query sums the

measure attribute within the range of the query; an

example is find the total sales for customers with an age

from 37 to 52, over the past three months. Queries of

This research is partially supported by NSF grant IRI94-11330.

this form can be very useful in finding trends and in

discovering relationships between attributes in the

database. Range sum queries over data cubes thus provide

a useful tool for analysis. Efficient range-sum querying is

becoming more important with the growing interest in

database analysis, particularly in On-Line Analytical

Processing (OLAP)[2]. Since the introduction of the data

cube, there has been considerable research in the database

community regarding, for example, choosing subsets of

the data cube to precompute [6] and for the processing of

aggregation queries [5].

Ho, Agrawal, Megiddo and Srikant [7] have presented

an elegant algorithm for computing range queries in data

cubes which we call the prefix sum approach. The

essential idea of the prefix sum approach is to precompute

many prefix sums of the data cube, which can then be used

to answer ad hoc queries at run-time. The prefix sum

method permits the evaluation of any range-sum query on

a data cube in constant time. The approach is only

hampered by its update cost, which in the worst case

requires rebuilding an array of the same size as the entire

data cube.

In some problem instances, update cost is not a

significant consideration. This is the case, for example,

when data is static or is rarely updated. There are,

however, applications for which reasonable update cost is

important. Many companies are interested in tracking

current sales data, for which new information may arrive

on a daily basis. As competition increases in the global

marketplace, managers demand that their analysis tools

provide current or "near-current" information. For such

applications, the data cube will fail as a useful analysis

tool if it cannot accommodate new information at a

reasonable update cost. We note that, as the number of

dimensions in a data cube grows, the size of the data cube

grows exponentially. Update costs on the order of the size

of the data cube may not be practical in these applications,

particularly for large data cubes having many dimensions.

A method is required that achieves range sum queries in

constant time, but that does not require rebuilding the

entire data cube during updates.

In this paper, we present a new technique for evaluating

range sum queries in data cubes: the relative prefix sum

approach. This approach achieves constant time

Page 2

performance for queries, but constrains update costs. The

relative prefix sum approach reduces the overall

complexity of the range sum query problem.

Paper Organization The remainder of the paper is

organized as follows. In Section 2, we present the model

of the range sum problem. In Section 3, we develop the

general method used in our algorithms. We introduce the

concepts of overlays and the RP array, and describe their

operation. In Section 4, we discuss algorithms for

querying and updating these structures. We analyze the

performance characteristics of our method, and show that

their use reduces the overall complexity of the range-sum

problem. We also consider tunable parameters, such as

the overlay box size, and discuss other issues relating to

performance. Section 5 concludes the paper.

2 The Model

We employ the same model of the range sum problem

as presented in [7], which introduces the prefix sum

method. Assume the data cube has one measure attribute

and d dimensions. Let D={1,2,...,d} denote the set of

dimensions. For example, let the measure attribute be

SALES, and the dimensions be CUSTOMER_AGE and

DATE_OF_SALE. Each dimension has a size ni, which

represents the number of distinct values in the dimension.

This size is known a priori, as the number of days in a

year, for example, can be assumed to be static. Thus, we

can represent the d-dimensional data cube by a d-

dimensional array A of size n1 × n2 × ... × nd, where

ni≥2, i∈D. In Figure 1, d=2. For clarity, and without

loss of generality, our cost model will assume each

dimension has the same size; this allows us to present

many of the formulas more concisely. Thus, let the size

of each dimension be n, i.e. n=n1=n2=...=nd. Our

subsequent formulae and discussions will refer to n, rather

than the total size of the data cube N=nd; in this manner,

the impact of the dimensionality of the data cube on

performance will be revealed. We will call each array

element a cell. The total size of array A is nd cells. We

assume the array has starting index 0 in each dimension.

For notational convenience, in the two-dimensional

examples we will refer to cells in array A as A[i,j], where

i is the vertical coordinate and j is the horizontal

coordinate. Similarly, we refer to cells in array P as

P[i,j], cells in the overlay as O[i,j], and cells in RP as

RP[i,j].

Each cell in array A contains the aggregate value of the

measure attribute (e.g., total SALES) corresponding to a

given point in the d-dimensional space formed by the

dimensions. For example, given the measure attribute

SALES and the dimensions CUSTOMER_AGE and

DATE_OF_SALE, the cell at A[37, 25] contains the total

sales to 37-year-old customers on day 25. A range-sum

query on array A is defined as the sum of all the cells that

fall within the specified range. For example, a range-sum

query asking for the total sales to 37-year-old customers

from days 20 to 22 would be answered by summing the

cells A[37, 20], A[37, 21], and A[37, 22]. We will refer

to range-sum queries simply as range queries throughout

the rest of this paper. As Ho et. al. point out, the

techniques presented here can also be applied to obtain

COUNT, AVERAGE, ROLLING SUM, ROLLING

AVERAGE, and any binary operator + for which there

exists an inverse binary operator - such that a + b - b = a.

We observe the following characteristics of array A.

Array A can be used by itself to solve range sum queries;

we will refer to this as the naive method. Arbitrary range

queries on array A can cost O(nd): a range query over the

range of the entire array will require summing every cell in

the array. Updates to array A take O(1): given any new

value for a cell, an update can be achieved simply by

changing the cell's value in the array. Assuming that

queries and updates are equally likely, we may represent

the overall complexity of a method by taking the product

of its query and update costs. For the naive method, this

product of query and update costs is O(nd) * O(1) = O(nd).

Array A

Index

0

1

2

3

4

5

6

7

8

Figure 1. A two-dimensional data cube represented

as a two-dimensional array A.

012345678

3

7

2

3

4

2

4

2

5

5

3

4

2

2

3

5

4

4

1

2

2

1

1

3

2

2

3

2

6

3

5

3

6

7

2

1

2

8

3

3

3

1

1

3

3

4

7

3

5

4

8

9

1

2

6

1

4

2

7

5

3

9

1

3

2

5

8

1

1

3

1

9

3

4

7

2

3

1

4

3

6

Array P

Index

0

1

2

3

4

5

6

7

8

Figure 2. Array P used in the prefix sum method.

0123

11

29

40

51

61

75

93

103

116

4

13

39

53

67

80

95

114

127

143

5

17

50

67

86

103

126

154

168

186

6

23

57

78

99

123

151

182

205

224

7

26

62

88

117

142

171

205

229

257

8

29

69

102

133

161

191

229

256

290

389

10

12

15

19

21

25

27

32

18

24

29

35

40

49

55

64

21

29

35

42

50

61

69

81

In the prefix sum method, an array P, of the same size

as array A, stores various precomputed prefix sums of A

(Figure 2). Each cell contains the sum of all cells up to

and including itself in array A. For example, cell P[4,0]

contains the sum of all cells in the range A[0,0] to A[4,0],

or 19, while cell P[2,1] contains the sum of all cells in

the range A[0,0] to A[2,1] in array A, or 24. The sum of

the entire A array is found in the last cell, P[8,8].

Formally, for all 0≤xi≤n and i∈D,

P[x1, x2, ..., xd] = Sum(A[0, 0, ..., 0]:A[x1, x2, ..., xd])

x1

=0∑

i2

=∑

i1

x2

=0∑

id

xd

=0

...

A[i1, i2, ..., id]

Page 3

Using P, a range query on d dimensions can be

answered with a constant (2d) cell lookups, or O(1).

Figure 3 presents the basic idea of the prefix sum method:

the sum corresponding to a range query's region can be

determined by adding and subtracting the sums of various

other regions, until we have isolated the region of interest.

We note that all such regions begin at cell A[0,0] and

extend to some cell in A; thus, the sum of each of these

regions can be found directly by reading the value of a

single cell in P. The prefix sum method has thus reduced

range sum querying to the problem of reading a single

individual cell in the P array.

Area_EArea_AArea_BArea_C Area_D

=--+

Figure 3. A geometric illustration of the two

dimensional case: Sum(Area_E) = Sum(Area_A) -

Sum(Area_B) - Sum(Area_C) + Sum(Area_D).

Array A

Index

0

1

2

3

4

5

6

7

8

012345678

3

7

2

3

4

2

4

2

5

51

2

2

1

1

3

2

2

3

2

6

3

5

3

6

7

2

1

2

8

3

3

3

1

1

3

3

4

7

3

5

4

8

9

1

2

6

1

4

2

7

5

3

9

1

3

2

5

8

1

1

3

1

9

3

4

7

2

3

1

4

3

6

* 4

4

2

2

3

5

4

4

Array P

Index

0

1

2

3

4

5

6

7

8

Figure 4. Update example, prefix sum method.

0123

11

30

41

52

62

76

94

104

117

4

13

40

54

68

81

96

115

128

144

5

17

51

68

87

104

127

155

169

187

6

23

58

79

100

124

152

183

206

225

7

26

63

89

118

143

172

206

230

258

8

29

70

103

134

162

192

230

257

291

389

10

12

15

19

21

25

27

32

* 19

25

30

36

41

50

56

65

22

30

36

43

51

62

70

82

While this method provides constant time queries, in

the worst case it incurs update costs proportional to the

entire data space. This update cost results from the very

dependencies in the data that allow the method to work.

As noted, the values of cells in array P are cumulative, in

that they contain the sums of all cells in array A that

precede them. Thus, when a cell in A is updated, all cells

in P that follow it will also require updating. Figure 4

shows the array P as the cell A[1,1] is about to be

updated. The value of cell A[1,1] is a component of every

cell in the shaded region; thus, updating A[1,1] requires

updating every P cell in the shaded region. In the worst

case, when cell A[0,0] is updated, this cascading update

property will require that every cell in the data cube be

updated. Since the size of the data cube is nd cells, the

resulting update complexity is O(nd). The prefix sum

solution has thus traded update and lookup characteristics

with the naive method. The product of the query and

update costs for the prefix sum method is O(1) * O(nd) =

O(nd), which is identical to the naive method.

3 The Relative Prefix Sum Method

The prefix sum approach is very powerful. It provides

range sum queries in constant time, regardless of the size

of the data cube. On the other hand, updates to its data

structure are very expensive, due to the cascading updates

effect. As noted earlier, some application domains require

updates on a regular basis; when data cubes are very large,

and have many dimensions, an update cost of O(nd) can

become impractical. We seek a method that has the query

power of the prefix sum approach, but improves its update

performance.

The expensive update cost in P is a result of the

dependencies between values in its cells. These

dependencies in turn cause the cascading update effect. We

cannot, however, completely eliminate these dependencies.

The prefix sum method works precisely because of the

dependencies between its cells. In general, when we

eliminate dependencies we pay for it with an increased

query cost; if we eliminate all dependencies, we will be

left with the naive approach, whose query cost is O(nd).

Since we cannot completely eliminate dependencies, we

also cannot completely eliminate cascading updates.

However, perhaps we can create boundaries that limit

cascading updates to distinguished cells. In doing so, we

must be careful not to introduce query overhead that results

in the loss of the constant-time query characteristic.

The relative prefix sum approach is a new method for

answering range sum queries on data cubes. The method

adds and subtracts region sums to obtain the complete sum

of the query region, as illustrated in Figure 3. The method

makes use of new data structures that provide constant

time queries while controlling cascading updates. By

creating boundaries that limit cascading updates to

distinguished cells, the method reduces the overall

complexity of the range sum problem.

Our method makes use of two components: an overlay

and a relative-prefix (RP) array. An overlay partitions

array A into fixed size regions called overlay boxes.

Overlay boxes store information regarding the sums of

regions of array A preceding them. RP is an array of the

same size as array A; it contains relative prefix sums

within regions defined by the overlay. Using the two

components in concert, we will construct prefix sums "on

the fly". Together, the components limit cascading

updates to distinguished cells. We first describe overlays,

then describe RP.

3.1 Overlays

We define an overlay as a set of disjoint hyperrectangles

(hereafter called "boxes") of equal size that completely

partition array A into equal sized regions of cells (Figure

Page 4

5). In the figure, array A has been partitioned into overlay

boxes of size 3×3. For clarity, and without loss of

generality, let the length of the overlay box in each

dimension be k. The size of array A is nd, thus the total

number of overlay boxes is

We define several terms for use later in the paper. We

say that an overlay box is anchored at (a1, a2, ..., ad) if

the box corresponds to the region of array A where the

first cell (lowest cell index in each dimension) is (a1, a2,

..., ad); we denote this overlay box as B[a1, a2, ..., ad].

The first overlay box is anchored at (0, 0, ..., 0). An

overlay box B[a1, a2, ..., ad] is said to cover a cell (x1,

x2, ..., xd) in array A if the cell falls within the

boundaries of the overlay

∀i((ai≤xi)

V

(ai+k≥xi)).

n

k .

d

box, i.e., if

Overlay boxes

Index

0

1

2

3

4

5

6

7

8

Figure 5. Array A partitioned into boxes of size 3x3.

012345678

Refer to Figure 5. In the figure, k=3; i.e., each overlay

box in the figure is of size 3×3. The total number of

overlay boxes is

at cells (0,0), (0,3), (0,6), (3,0), (3,3), (3,6), (6,0), (6,3),

and (6,6). Each overlay box corresponds to an area of

array A of size kd cells; thus, in this example each overlay

box covers 32 = 9 cells of array A.

n

k = (9/3)2 = 9. The boxes are anchored

d

V

Y1

Y2

X1X2

Figure 6. Values stored in an overlay box of size 3x3.

Figure 6 shows the values stored in an overlay box.

Each overlay box stores an anchor value, plus (kd - (k-

1)d)-1 border values. V is the anchor cell, while X1, X2

and Y1, Y2 are border cells. The other cells covered by

the overlay box are not needed in the overlay, and would

not be stored.

Index012345678

0

1

2

3

4

5

6

7

8

V

Figure 7. Array A showing calculation of overlay box

anchor.

Index012345678

0

1

2

3

4

5

6

7

8

X1

Y1

Index012345678

0

1

2

3

4

5

6

7

8

X2

Y2

Figure 8. Array A showing calculation of overlay box

border values.

Values stored in an overlay box provide sums of

regions outside the box. The anchor value is the sum of

all cells in A up to, but not including, the cell under V.

Figure 7 shows an overlay box superimposed on array A.

The anchor value of the overlay box is equal to the sum of

the shaded region in array A. More formally, an overlay

box anchored at (a1, a2, ..., ad) has an anchor value that is

equal to SUM(A[0, 0, ..., 0]:A[a1, a2, ..., ad]) - A[a1, a2,

..., ad]. Figure 8 shows the calculation of border values.

The border values are equal to the sum of the associated

shaded regions of array A. Border value X1 is the sum of

all cells in the column above the cell containing X1.

Border value X2 is the sum of all cells in the column

above its cell, plus the cells above X1. Border value Y1

is the sum of all cells in the row to the left of the cell

containing Y1. Border value Y2 is the sum of all cells in

the row to the left of its cell, plus the cells to the left of

Y1. Note that border values are cumulative (i.e., X2

includes the value of X1, and in general Xn includes the

values of X1..Xn-1). Border values for other dimensions

are calculated in the same manner. Formally, the border

value contained in cell [i1, i2, ..., id] is equal to

SUM(A[0, 0, ..., 0]:A[i1, i2, ..., id]) - RP[i1, i2, ..., id] -

the anchor value of the overlay box.

As noted earlier, the regions used in Figure 3 all begin

at A[0,0] and end at an arbitrary cell in A (the target cell).

Figure 9 shows such a target cell, marked as *; for

reference, the overlay box covering the cell has been

superimposed on array A. In this example, the cell is

located at (7,5). We wish to determine the sum of the

region that begins at A[0,0] and ends at A[7,5]. We will

calculate this region sum in two steps. The first step uses

the overlay; the second step uses RP. Using values in the

overlay, we first calculate the sum of the shaded area in

Figure 9. The shaded area represents the portion of the

desired region that lies outside the overlay box. To

determine the sum of this area, we add border values Y1

Page 5

and X2 to the anchor value. The anchor value provides the

sum of all cells from A[0,0]:A[6,3], minus cell A[6,3].

Border value Y1 provides the sum of the cells in

A[7,0]:A[7,2]. Border value X2 provides the sum of cells

in A[0,3]:A[5,5]. These respective regions are outlined in

the figure. Adding the anchor value and two border values

thus provides the sum of all cells in the shaded region.

Index012345678

0

1

2

3

4

5

6

7

8

V

Y1

X2

*

Figure 9. Addition of border values Y1 and X2 with

anchor value V.

3.2 Relative Prefix Array (RP)

The relative prefix array (RP) is the same size as array

A. It is partitioned into regions of cells that correspond to

overlay boxes. Figure 10 shows RP with overlay boxes

drawn for reference. Each region in RP contains prefix

sums that are relative to the area enclosed by the box.

Each region of RP is independent of other regions. Figure

11 shows an area of RP corresponding to one overlay box.

In the figure, * denotes the location of a cell within the

box. The value of cell * is the sum of the cells in array A

that fall within the shaded region. More formally, given a

cell RP[i1, i2, ..., id] and the anchor cell location (v1, v2,

..., v3) of the overlay box covering this cell, the value

stored in RP[i1, i2, ..., id] is SUM(A[v1, v2, ...,

v3]:A[i1, i2, ..., id]).

Relative Prefix (RP) array

Index0

0

3

1

10

2

12

3

3

4

7

5

9

6

4

7

6

8

11

Figure 10. RP array with overlay boxes drawn for

reference.

12345678

12

19

35

12

23

30

10

23

39

892

8

486

7

9

18

24

21

29

18

24

29

38

13

23

38

17

23

29

12

21

10

18

24

11 11

565

8

82

9 11

16

13

21

11

19

31

14

2114 14

97

9

836

15

24

13

17

12

13

16

26 10

As noted earlier, the overlay values provide a portion of

the complete region sum. RP provides the final portion

of that sum. Figure 12 shows the construction of a

complete region sum using the overlay and RP. In the

figure, * denotes an arbitrary cell in A; here, the cell is

A[7,5]. For reference, the overlay box covering this cell

has been superimposed on array A. The anchor value and

border values from the overlay box provide the sum of the

portion of the shaded region outside the overlay box as

noted above. The cell * in RP provides the sum of the

portion of the shaded region within the overlay box.

Adding the two sums together yields the sum of all cells

in array A that fall within the shaded region. In this

manner, the overlay and RP can be used to generate any

region sum rooted at A[0,0, ..., 0], and they are therefore

sufficient to provide the region sums required by the

method illustrated in Figure 3. This process requires one

anchor value, d border values, and one value from RP.

*

Figure 11. Area of RP corresponding to one overlay

box.

Index012345678

0

1

2

3

4

5

6

7

8

V

Y1

X2

*

Figure 12. Construction of complete region sum

from overlay and RP.

3.3 Examples Using the Overlay and RP

We now present several concrete examples using an

overlay and RP. Figure 13 shows array A, the overlay and

RP. For reference, the overlay partitions have been drawn

on RP. The following examples refer to this figure. We

first provide examples of calculating anchor and border

values. Finally, we find a complete region sum.

Calculating Anchor and Border Values We will

calculate the anchor and border values for the overlay box

anchored at cell (3,3). The anchor value in overlay cell

O[3,3] is equal to SUM(A[0,0]..A[3,3]) - A[3,3] = 51-5 =

46. This is the sum of all cells in A up to, but not

including, cell A[3,3]. The border value in overlay cell

[4,3] = SUM(A[0,0]..A[4,3]) - RP[4,3] - anchor[3,3], or

61-8-46=7. The border value in overlay cell [5,3] =

SUM(A[0,0]..A[5,3]) - RP[5,3] - anchor[3,3], or 75-14-

46=15. Similarly, the border value in overlay cell [3,4] =

SUM(A[0,0]..A[3,4]) - RP[3,4] - anchor[3,3], or 67-8-

46=13. The border value in overlay cell [3,5] =

SUM(A[0,0]..A[3,5]) - RP[3,5] - anchor[3,3], or 86-13-

46=27.

Calculating a Complete Region Sum Suppose

we wish to determine the sum of the cells in array A from

cell A[0,0] to cell A[7,5]. Cell (7,5) is covered by the

overlay box anchored at cell (6,3). Let us first find the

sum of the portion of the shaded area that falls outside the

overlay box; this requires an anchor value plus two border

values. The anchor value of the overlay box is 86. We

now require two border values. Cell (7,5) is one cell down

from, and two cells to the right of, the anchor cell; thus,