Page 1

A Data Mining Approach for Selecting Bitmap Join

Indices

Ladjel Bellatreche

LISI/ENSMA - Poitiers University - France

bellatreche@ensma.fr

Rokia Missaoui

University of Quebec in Outaouais - Canada

Rokia.Missaoui@uqo.ca

Hamid Necir, Habiba Drias

Institut National d’Informatique - Algeria

{h necir, h drias}@ini.dz

Index selection is one of the most important decisions to take in the physical design of relational

data warehouses. Indices reduce significantly the cost of processing complex OLAP queries, but

require storage cost and induce maintenance overhead. Two main types of indices are available:

mono-attribute indices (e.g., B-tree, bitmap, hash, etc.) and multi-attribute indices (join indices,

bitmap join indices). To optimize star join queries characterized by joins between a large fact

table and multiple dimension tables and selections on dimension tables, bitmap join indices are

well adapted. They require less storage cost due to their binary representation. However, selecting

these indices is a difficult task due to the exponential number of candidate attributes to be indexed.

Most of approaches for index selection follow two main steps: (1) pruning the search space (i.e.,

reducing the number of candidate attributes) and (2) selecting indices using the pruned search

space. In this paper, we first propose a data mining driven approach to prune the search space

of bitmap join index selection problem. As opposed to an existing our technique that only uses

frequency of attributes in queries as a pruning metric, our technique uses not only frequencies,

but also other parameters such as the size of dimension tables involved in the indexing process,

size of each dimension tuple, and page size on disk. We then define a greedy algorithm to select

bitmap join indices that minimize processing cost and verify storage constraint. Finally, in order

to evaluate the efficiency of our approach, we compare it with some existing techniques.

Categories and Subject Descriptors: Abstract Interpretation [Programming Language]:

General Terms:

Additional Key Words and Phrases: Bitmap Join Indices, Frequent Itemset, Data Mining, Star

Join Queries, Relational Data Warehouses

Copyright c ?2007 by The Korean Institute of Information Scientists and Engineers(KIISE). Per-

mission to make digital or hard copies of part or all of this work for personal or classroom use

is granted without fee provided that copies are not made or distributed for profit or commercial

advantage and that copies bear this notice and the full citation on the first page. Copyrights for

components of this work owned by others than KIISE must be honored. Abstracting with credit is

permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires

prior specific permission and/or a fee. Request permission to republish from: Publicity Office,

KIISE. FAX +82-2-521-1352 or email office@kiise.org.

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007, Pages 177-194.

Page 2

178L. Bellatreche et al.

1.

A data warehouse stores large volumes of aggregated and non volatile data across

entire organization. It is usually accessed through complex queries for key business

operations. Data warehouses are frequently modelled using relational schemas like

star and snowflake schemas. A star schema consists of a single large fact table

that is related to multiple dimension tables via foreign keys. Dimension tables

are relatively small compared to the fact table and rarely updated.

typically non normalized so that the number of needed join operations are reduced.

Operations in data warehouse applications are mostly read ones and are dominated

by large and complex queries. The typical queries on the star schema are called star

join queries. They are characterized by: (i) a multi-table join among a large fact

table and dimension tables, (ii) each one of the dimension tables involved in the join

operation has multiple selection predicates1on its descriptive attributes, and (iii)

there is no join operation between dimension tables. Due to the interactive nature

of decision support applications, having a fast query response time is a critical

performance goal.

Without efficient optimization techniques, queries addressed to data warehouses

may take hours or days, which is unacceptable in most cases [Chatziantoniou and

Ross 2007]. In order to cope with complex and time-consuming decision support

queries, there is an urgent need for efficient and sophisticated physical design tech-

niques [Chaudhuri and Narasayya 2007]. The major bottleneck in evaluating such

queries is the join between a large fact table and the surrounding dimension tables

[St¨ ohr et al. 2000]. To optimize join operations in both OLTP and OLAP envi-

ronments, many techniques have been proposed that we can classify into two main

categories: (1) non redundant structures and (2) redundant structures. The first

category concerns different implementations of the join operation: nested loop, sort

merge join and hash join. These structures are efficient when (a) the size of joined

tables is reasonable, which is not the case of tables in relational data warehouses

and (b) join concerns two tables [Golfarelli et al. 2002]. Redundant structures, like

materialized views [Gupta 1999; Rizzi and Saltarelli 2003] and join indices [Val-

duriez 1987], are more efficient to speed up join operation involving many tables

[Oneil 1995]. Their main drawbacks are the extra storage requirement and the

maintenance overhead. However, such optimization techniques are inevitable in

data warehouse environments. The main peculiarity of indices is the fact that they

may be combined with other optimization techniques such as materialized views,

horizontal and vertical partitioning. In other words, any access path with a table

structure can be indexed.

This paper focuses on the selection of join bitmap indices (BJIs) and is organized

as follows. Section 2 provides a background on indices and data mining. Section

3 presents our formalization of the problem of selecting BJIs as an optimization

problem, and provides a description of the unique and most related work by showing

its limitations using a running example. Section 4 presents our two-step approach

for first generating BJI candidates using data mining based pruning algorithms, and

INTRODUCTION

They are

1A selection predicate has the following form: Di.Ajθ value, where Ajis an attribute of dimension

table Di and θ is one of the six comparison operators {=,<,>,≤,≥}, and value is a constant

belonging to the domain of attribute Aj).

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 3

A Data Mining Approach for Selecting Bitmap Join Indices179

then selecting the final BJIs using a greedy algorithm. Section 5 validates our work

through experimentations. Finally, Section 6 concludes the paper by summarizing

the main results and suggesting new directions.

2.

In the following, we review three key topics related to our proposal: (i) indexing

techniques, (ii) index selection problem, and (iii) frequent closed itemset computa-

tion.

BACKGROUND

2.1

A number of indexing strategies have been suggested for data warehouses. They can

be classified into two categories: single-table indices and (b) multiple table indices.

In the first category, indices are defined either on one or several attributes of a

single table. An index may be either clustered or non-clustered. In the second one,

indices are defined on two or more tables. A number of indexing strategies belonging

to these two categories have been suggested for data warehouses: value-list index,

projection index [O’Neil and Quass 1997], bitmap index [Chan and Ioannidis 1998;

O’Neil and Quass 1997], bit-sliced index [Chan and Ioannidis 1998], data index

[Datta et al. 1999], join index [Valduriez 1987], star join Index [Systems 1997] and

bitmap join indices [O’Neil and Graefe 1995].

Indexing Techniques

—Value-list index and Bitmap index. A value-list index consists of two parts. The

first part is a balanced tree structure and the second part is a mapping scheme.

The mapping scheme is attached to the leaf nodes of the tree structure and points

to the tuples in the table being indexed. The tree is generally a B-tree with

varying percentages of utilization. Oracle provides a B*-tree whose utilization

may go to 100%. Two different types of mapping schemes are in use. First, one

consists of a RowID (row identifier) list, which is associated with each unique

search-key value. This list is partitioned into a number of disk blocks chained

together. The second scheme uses bitmaps. A bitmap is a vector of bits whose

value depend on predicate values. A bitmap B lists all rows with a given predicate

P such that for each row r with ordinal number j that satisfies the predicate

P, the jthbit in B is set to 1. Bitmaps represent efficiently low-cardinality

data. However to make this indexing scheme practical for high-cardinality data,

compression techniques must be used. Value-list indices have been shown in

[O’Neil and Quass 1997] to outperform other access methods in queries involving

MIN or MAX aggregate functions, as well as queries that compute percentile

values of a given column. Bitmap indices can substantially improve performance

of queries with the following characteristics [Chee-Yong 1999]:

—The WHERE clause contains multiple predicates on low-or-medium-cardinality

columns (e.g., a predicate on Gender that has two possible values: female or

male or a predicate on city with three possible values as shown in Figure 1).

—Bitmap indices have been created on some or all of these low-or-medium-

cardinality columns.

Besides disk saving (due to the binary representation and possible compression

[Johnson 1999]), bitmap indices speed up queries having Boolean operations (such

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 4

180 L. Bellatreche et al.

Customer

Pascal

515

17

18

52

21

42

20

Age

0

0

1

1

0

0

Tokyo

0

0

0

0

1

0

Paris

1

1

0

0

0

1

LA

6

5

4

3

2

1

RID

LA

42

1116

LA

20

Eric

616

2125

Tokyo

Age

Didier

CID

Figure 2.

313

RIDC

4

Tokyo

Patrick4143

Paris

Yves 5152

LA

Gilles6161

City NameCID

Figure 1.

RIDC

17

18

52

21

42

20

Age

0

0

1

1

0

0

Tokyo

0

0

0

0

1

0

Paris

1

1

0

0

0

1

LA

6

5

4

3

2

1

RID

LA

42

Pascal

515

1116

LA

20

Eric

616

2125

Tokyo

Age

Didier

CID

313

RIDC

4

Tokyo

Patrick4143

Paris

Yves 5152

LA

Gilles6161

CityName CIDRIDC

Bitmap index on City

An Example of a bitmap index defined on City.

Customer

Pascal17

18

52

21

17

18

52

21

42

20

Age

LA 1116

LA

Eric 2125

Tokyo

Didier 3134

Tokyo

Patrick4143

ParisYves2

LAGilles1

CityName

17

18

52

21

17

18

52

21

42

20

Age

LA Pascal 1116

LA

Eric 2125

Tokyo

Didier3134

Tokyo

Patrick4143

ParisYves2

LAGilles1

CityName

Projected

Index on Age

An Example of a projection index.

as AND, OR and NOT) and COUNT operations. They are supported by most

of commercial DBMSs (Oracle, SQL Server, etc.).

—Projection Index. A projection index is equivalent to the column being indexed.

If C is the column being indexed, then the projection index on C consists of a

stored sequence of column values from C in the same order as the ordinal row

number in the table from where the values are extracted (see Figure 2). It has

been shown in [O’Neil and Quass 1997] that projection indices outperform other

indexing schemes for the execution of queries that involve computation on two

or more column values, and seem to perform acceptably well in GROUP-BY

queries.

—Bit-sliced Index A bit sliced index represents the key values of the column to

be indexed as binary numbers and projects a set of bitmap slices, which are

orthogonal to the data, held in the projection index. This index has been shown in

[Chee-Yong 1999] to particularly perform well for computing sums and averages.

Also, it outperforms other indexing approaches for percentile queries when the

underlying data is clustered, and for range queries whose range is large.

—Join Index. A join index pre-computes a join operation between two relations

[Valduriez 1987]. It is the result of joining two tables on join attributes and

projecting the keys (or tuple IDs) of the two tables. A join index is used to find

the matching tuples from the tables to be joined. This index has been proposed

in the context of traditional databases in medium of 80’s, and was later extended

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 5

A Data Mining Approach for Selecting Bitmap Join Indices181

00124

10025

10026

10027

00123

00122

00121

001 20

001 19

01018

0

1

0

0

1

0

17

27

0

1

0

0

1

0

16

26

0

1

0

0

1

0

15

25

0

0

1

0

0

1

14

24

0

0

1

0

0

1

13

23

1

0

0

0

0

1

12

22

1

0

0

0

0

1

11

21

1

0

0

0

0

1

10

20

0

0

0

0

1

1

9

19

0

0

0

1

1

0

8

18

0

0

0

0

1

1

7

17

0

0

0

0

1

1

6

16

1

0

0

0

0

1

5

15

0

0

1

1

0

0

4

14

0

0

0

1

1

0

3

13

0

1

0

0

1

0

2

12

0

1

0

0

1

0

1

11

Tokyo

1

Paris

0

LA

0

RIDS

10

0019

0018

0017

0016

1005

0104

0013

0012

0011

TokyoParisLARIDS

Customer

Product

17

Gardering

2003

18

52

Multimedia

2003

21

Beauty

2003

42

Beauty

Year

20

Range

Age

LAPascal

Manure

Apr

111

102

44

6

5

4

by Red Brick Systems [Systems 1997] to a multi-table join index in relational

data warehouses by concatenating columns from different dimension tables and

listing RowIDs in the fact table from each concatenated value.

—Data Index.A data index, like the projection index, exploits the positional

indexing strategy [Datta et al. 1999]. The Data index avoids duplication of data

by storing only the index and not the column being indexed. The data index

can be of two specific types: basic data index and join data index (for more

information, see [Datta et al. 1999]).

—Bitmap Join Index. It is a multi-table join index and a variant of join index. It

is defined as a bitmap index on a table R based on a single column of a table

S, where S commonly joins with R in a specific way. Bitmap join indices are

supported by most of commercial database/warehouse systems (e.g., Oracle, SQL

Server, etc.).

LA

Eric

Barbie

Mar

212

103

33

5

4

3

TokyoDidier

WebCam

Feb

313

104

22

4

3

2

TokyoPatrick

Clarins

Jan

414

105

11

3

2

1

ParisYves

Sonoflore

Month

515

106

TID

2

1

LA

Gilles

Name

616

PID

1

1

CityName

Name

CID

PID

RIDC

RIDP

17

Gardering

2003

18

52

Multimedia

2003

21

Beauty

2003

42

Beauty

Year

20

Range

Age

LAPascal

Manure

Apr

111

102

44

6

5

4

LA

Eric

Barbie

Mar

212

103

33

5

4

3

TokyoDidier

WebCam

Feb

313

104

22

4

3

2

TokyoPatrick

Clarins

Jan

414

105

11

3

2

1

ParisYves

Sonoflore

Month

515

106

TID

2

LA

Gilles6161

CityNameCIDRIDC

RIDP

Fitness

2003

SlimForm

May

101

55

6

5

Toys

Fitness

2003

SlimForm

May

101

55

6

5

Toys

2003Jun666

2003

RIDT

2003Jun666

2003

RIDT

Time

181110621224

1966105313 25

17 2210531326

151110631327

185510521223

104410521222

1011 10521221

205510461620

20 22104616 19

20 2210451518

40

15

66

11

104

106

212

313

17

27

44

17

66

22

105

105

111

313

16

26

45

19

44

66

103

105

212

313

15

25

17

18

55

11

103

106

515

212

14

24

100

18

66

55

102

105

515

212

13

23

103

10

55

44

102

105

414

212

12

22

102

10

11

11

102

105

414

212

11

21

200

20

11

55

102

104

313

616

10

20

100

20

11

22

101

104

212

616

9

19

27

20

33

22

101

104

111

515

8

18

20

40

44

66

101

104

111

212

7

17

14

44

55

66

106

105

212

111

6

16

14

45

66

44

105

103

414

212

5

15

10

17

11

55

104

103

515

515

4

14

50

100

33

66

104

102

616

515

3

13

28

103

66

55

106

102

616

414

2

12

25

102

11

11

106

102

616

414

1

11

Amount

200

TID

11

PID

102

CID

313

RIDS

10

100111012129

27331011118

20441011117

1455106

2126

14661054145

10111045154

50331046163

28661066162

25111066161

AmountTIDPIDCIDRIDS

Sales

Bitmap Join Index on City

Figure 3.An Example of a bitmap join index.

Example 1. To illustrate the use and interest of bitmap join indices, let us

consider the following data (see Figure 3) that will serve as a running example

throughout the paper. Suppose we have a relational data warehouse represented

by three dimension tables (TIME, CUSTOMERS and PRODUCTS) and one fact

table (SALES). Let Q be the query that the administrator would like to optimize:

SELECT Count(*)

FROM CUSTOMERS C, SALES S

WHEERE C.City=’Poitiers’

AND C.CID=S.CID

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 6

182 L. Bellatreche et al.

To do so, he/she creates a bitmap join index SC IDX between the fact table

SALES and the dimension table CUSTOMERS based on City attribute as follows:

CREATE BITMAP INDEX SC_IDX

ON SALES(CUSTOMERS.City)

FROM SALES S, CUSTOMERS C

WHERE S.CID= C.CID

To execute the above query, the query optimizer just accesses the bitmap corre-

sponding to the column representing Poitiers, without joining SALES and CUS-

TOMERS tables.

This example shows the efficiency of bitmap join indices for executing this kind of

queries.

2.2

Index selection problem (ISP) has been studied since the early 70’s and its im-

portance is well recognized. It is one of the most important issues in physical

design of advanced database applications. The task of index selection consists in

automatically selecting an appropriate set of indices for a data warehouse and a

workload under resource constraints (storage, maintenance, etc.). It is a challeng-

ing problem for the following reasons [Chaudhuri 2004]: The size of a relational

data warehouse schema may be large (many tables with several columns) for real

applications, and indices can be defined on a set of columns from different tables

(multiple table indices). A large spectrum of research studies has been proposed

to deal with this problem. By exploring the state of the art, we believe that most

studies concentrate on single table index selection. To the best of our knowledge,

only two studies have been proposed for dealing with multiple table index selection

problem [Aouiche et al. 2005; Bellatreche et al. 2007]. On the industry side, several

index selection tools were developed (e.g., AutoAdmin [Chaudhuri and Narasayya

1998] for self-tuning and self-administering databases and data warehouses).

In single table index selection problem, most of existing approaches use two main

phases: (1) generation of attribute candidates and (2) selection of a final config-

uration. The first phase prunes the search space of index selection problem by

eliminating some non relevant attributes. In the second phase, the final indices are

selected using heuristics like greedy algorithms [Chaudhuri and Narasayya 1997],

linear programming algorithms [Chaudhuri 2004], etc. The quality of the final set

of indices depends essentially on the pruning phase. To prune the search space

of index candidates, many approaches were proposed [Chaudhuri and Narasayya

1997; Valentin et al. 2000; Chaudhuri 2004; Labio et al. 1997; Aouiche et al. 2005]

using enumeration-driven heuristics. For instance, [Chaudhuri and Narasayya 1997]

propose a greedy algorithm which uses the optimizer cost estimate in SQL Server

to decide the goodness of a given configuration of indices. The weakness of this

work is that it imposes the number of generated candidates. DB2 Advisor is another

example belonging to this category [Valentin et al. 2000], where the query parser is

used to pick up selection attributes used in workload queries. The generated can-

didates are obtained by a few simple combinations of selection attributes [Valentin

et al. 2000].

Index Selection

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 7

A Data Mining Approach for Selecting Bitmap Join Indices183

2.3

Let I = {i1,i2,···,im} be a set of m distinct items (table attributes). A transaction

(tuple) T contains a set of items in I, and has an associated unique identifier called

TID. A subset X of I where k = |X| is referred to as a k−itemset (or simply an

itemset), and k is called the length of X. A transaction database (TDB), say D, is

a set of transactions. The number of transactions in D that contain an itemset X

is called the absolute support of X whereas the fraction of transactions is called its

relative support (both denoted by sup(X)). Thus, an itemset is frequent (or large)

when sup(X) reaches at least a user-specified minimum threshold called minsup.

One of the most frequently used technique in data mining is rule mining which

is conducted in Apriori-like algorithms [Agrawal and Srikant 1994] in two steps:

detection of all frequent itemsets, and utilization of frequent itemsets to generate

association rules (e.g., X ⇒ Y ) that have a confidence ≥ minconf. While the second

step is relatively easy and cost-effective, the first one presents a great challenge

because the set of frequent itemsets (FIs) may grow exponentially with the whole

set of items. To reduce the size of the FI set, some studies were conducted on

frequent closed itemsets FCIs and maximal frequent itemsets (i.e., itemsets for

which every superset is infrequent [Burdick et al. 2001]).

The Close algorithm [Pasquier et al. 1999] is one of the first procedures for FCI

generation. Like Apriori, it performs a level-wise computation within the powerset

lattice. However, it exploits the notion of generators of FCIs2to compute closed

itemsets. ChARM [Zaki and Hsiao 2002] is another procedure which generates

FCIs in a tree organized according to inclusion. The computation of the closure

and the support is based on a efficient storage and manipulation of TID-sets (i.e.,

the set of transactions per item). Closure computation is accelerated using diffsets,

the set difference on the TID-sets of a given node and its unique parent node in the

tree.

Closet [Han et al. 2000] and its variant Closet+ [Wang et al. 2003] both

generate FCIs as maximal branches of a FP-tree, a structure that is basically a

prefix tree (or trie) augmented with transversal lists of pointers.

In the next section, we follow the above mentioned steps to select bitmap join

indices: generation of attribute candidates and selection of a final configuration.

The first step will be based on the computation of frequent closed itemsets using

either Close or ChARM.

Frequent Closed Itemset Computation

3.

To ease the understanding of our proposed approach, we provide a motivation and

a formulation of the BJI selection problem. Then, we present and discuss the most

related work.

As indicated earlier, a BJI is defined on one or several non key dimension attributes

with low cardinality3(called indexable columns) for joining dimension tables with

the fact table. An indexable attribute Ajof a given dimension table Difor a BJI

is a column Di.Ajsuch that there is a condition of the form Di.Aj θ Expression

BITMAP JOIN INDEX SELECTION PROBLEM

2A generator of an FCI Y is a subset of that itemset such that its closure is equal to Y .

3The domain of this attribute should be an enumerated domain like gender, color, etc.

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 8

184L. Bellatreche et al.

in the WHERE clause. The operator θ must be among {=,<,>,≤,≥}.

Let A = {A1,A2,˙ ,AK} be the set of indexed candidate attributes for BJIs. Then,

the possible number of BJIs with only one group of attributes, grows exponentially

with K, and is given by:

?K

For K = 4, this number is 15. The possible number of BJIs with any combination

of attribute groups is given by:

?

For K = 4, the number of possible cases is (215).

efficiently finding the set of BJIs that minimizes the total query processing cost

while satisfying a storage constraint cannot be handled by first enumerating all

possible BJIs and then computing the query cost for each candidate BJI. As a

consequence, BJI selection problem can be formulated as follows:

Given a data warehouse with a set of dimension tables D = {D1,D2,...,Dd} and

a fact table F, a workload Q of queries Q = {Q1,Q2,...,Qm}, where each query

Qi (1 ≤ i ≤ m) has an access frequency, and a storage constraint S, the aim of

BJI selection problem is to find a set of BJIs among a pre-computed subset of all

possible candidates, which minimizes the query processing cost and satisfies the

storage requirements S.

The single work dealing with this problem using a data mining approach is the

one proposed by Aouiche et al. [Aouiche et al. 2005]. Intuitively, selecting bitmap

join indices means partitioning the set of indexable attribute A into disjoint groups.

This motivates the authors in [Aouiche et al. 2005] to propose an approach based

on data mining (and Close algorithm [Pasquier et al. 1999]) to prune the search

space by computing the set of frequent closed attribute groups rather than all the

possible combinations indicated above. The groups generated in the pruning step

are then used in the second step to select the final configuration of indices.

Aouiche et al.’s work has two main limitations: (1) no formalization is proposed,

(2) the proposed approach only uses frequencies of attributes to generate frequent

closed itemsets, where each one represents an attribute set to be indexed. The

frequency parameter is not sufficient to be a pruning metric. Since in the vertical

partitioning of databases4the frequency parameter is not sufficient in getting a

good fragmentation schema [Fung et al. 2003], we believe that a similar observa-

tion holds for the bitmap join index selection problem. Indeed, Fun et al. [Fung

et al. 2003] showed the weakness of affinity-based algorithms (that use only query

frequencies) in reducing the query processing cost. To get a better result, they

recommend the use of other parameters, like the size of tables, the length of each

tuple, the size of disk page, etc.

To overcome these limitations, we propose new pruning algorithms, called Dy-

naClose and DynaCharm that take into account the above parameters as part of

the pruning metric, since the cost of join operations depends heavily on the size of

joined tables [Getoor et al. 2001]. Once the pruning phase is processed, a greedy

1

?+?K

2

?+ ... +?K

K

?= 2K− 1(1)

2K−1

1

?

+

?

2K−1

2

?

+ ... +

?

2K−1

2K−1

?

= 22K−1

(2)

Therefore, the problem of

4Each fragment contains a subset of attributes that are frequently used together.

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 9

A Data Mining Approach for Selecting Bitmap Join Indices185

Table I.Query description.

(1) select S.channel id, sum(S.quantity sold) from S, C

where S.channel id=C.channel id and C.channel desc=’Internet’ group by S.channel id

(2) select S.channel id, sum(S.quantity sold), sum(S.amount sold) from S, C

where S.channel id=C.channel id and C.channel desc =’Catalog’ group by S.channel id

(3) select S.channel id, sum(S.quantity sold),sum(S.amount sold) from S, C

where S.channel id=C.channel id and C.channel desc =’Partners’ group by S.channel id

(4) select S.cust id, avg(quantity sold) from S, C

where S.cust id=C.cust id and C.cust gender=’M’ group by S.cust id

(5) select S.cust id, avg(quantity sold) from S, C

where S.cust id=C.cust id and C.cust gender=’F’ group by S.cust id

algorithm is executed to select a set of BJSs that reduces the query processing cost

and satisfies the storage constraint.

3.1

To show the limitations of Aouiche’s approach in pruning search space of bitmap

join indices, we consider the following example with a part of a star schema (used

in our experimental study) containing two dimension tables CHANNELS (denoted

by Ch) and CUSTOMERS (denoted by C) and a fact table SALES (denoted

by S). The cardinalities of these tables (number of instances) are: ||SALES|| =

16260336,||CHANNELS|| = 5 and ||CUSTOMERS|| = 50000. Let us assume

that five queries are most frequently executed on the corresponding data cube (see

Table I). In the above queries, two main join operations are used: one between

SALES and CUSTOMERS (J1 : SALES 1 CUSTOMERS), and another one

between SALES and CHANNELS (J2: SALES 1 CHANNELS). Basically, the

cost of J1is higher than J2since the size of CUSTOMERS (50 000 instances) is

larger than the size of CHANNELS (5 instances).

With minsup = 3 (in absolute value), the solution returned by [Aouiche et al.

2005] is a bitmap join index (that we call sales desc bjix) defined on CHANNELS

and SALES using channel desc attribute. This is due to the fact that there are

three occurrences of the same selection predicate defined on that attribute in the

five queries. However, no bitmap join index is proposed between SALES and CUS-

TOMERS since the cust gender attribute is not so frequent as minsup.

consequence, only the join J2will be optimized, but not the global query set.

To overcome this limitation, we enrich the pruning function by considering other

parameters like the size of tables, the length of an instance of tables, and page size.

Illustration of Weakness of the Existing Pruning Approach

As a

4.

Given a star schema with a set of dimension tables {D1,D2,...,Dl} and a fact table

F. Let ||Ti|| and LCibe the cardinality of a (dimension or fact) table Tiand the

length of an instance of Di, respectively. The number of disk pages occupied by a

table T, denoted by |T| is calculated as follows:

?||T|| × LT

where PS represents the page size (in bytes) on disk.

THE PROPOSED APPROACH

|T| =

PS

?

(3)

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 10

186L. Bellatreche et al.

In order to select a set of bitmap join indices minimizing the overall query pro-

cessing cost and satisfying the storage constraint S, we consider an approach with

two steps commonly found in the classical index selection approaches: (1) genera-

tion of candidate attributes, and (2) selection of the final configuration and BJIs.

These steps will be described in the following subsections.

4.1

The input of this step is the context matrix. It is constructed using the set of

queries Q = {Q1,Q2,...,Qm} and a set of indexable attributes A = {A1,A2,...,Al}.

The matrix has rows and columns that represent queries and indexable attributes,

respectively. A value of this matrix is given by:

?

Example 2. Recall that the size of CUSTOMERS, CHANNELS and SALES is:

50 000, 5 and 16 260 33 instances, respectively. Instance length of CUSTOMERS,

CHANNELS and SALES is: 24, 24 and 36, respectively, and page size PS =

65536. We consider the same five queries as defined in Table 1. To facilitate the

construction of the context matrix, we rename the indexable attributes as follows:

Sales.cust id = A1, Customers.cust id = A2; Customers.cust gender = A3,

Channels.channel id = A4, Sales.channel id = A5; Channels.channel desc =

A6. The matrix is given below.

Generation of Candidate Attributes

Uses(Qi,Aj) =

1 if query Qiuses a selection predicate defined on Aj

0 otherwise

A1 A2 A3 A4 A5 A6

000

000

000

111

111

2

555

Q1

Q2

Q3

Q4

Q5

1

1

1

0

0

3

5

1

1

1

0

0

3

5

1

1

1

0

0

3

5

Support

22

To prune the search space of the BJI selection problem, we propose two algorithms:

DynaClose and DynaCharm which are an adaptation of Close [Pasquier et al.

1999] and ChARM [Zaki and Hsiao 2002], respectively. We used Close because it

has been exploited by Aouiche et al. [Bellatreche and Boukhalfa 2005] for the same

purpose (selecting BJIs), and hence will allow us to better compare the performance

of our approach against Aouiche’s approach (see Section 5 for more details). We

also used ChARM because it is a well-known and efficient algorithm for frequent

closed itemset generation. Our adaptation concerns especially the pruning metric

which is different from the one used in Close and ChARM, where only frequencies

of attributes are used. In order to prune the search space, we propose a new metric

Fitness(X), called fitness metric on a FCI X, that penalizes the FCIs defined on

small dimension tables.

Fitness(X) =1

n× (

n

?

i=1

supi× αi)(4)

where n represents the number of non-key attributes Aiin X, αiis equal to

where |Dj| and |F| represent the number of pages needed by the dimension table

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

|Dj|

|F|,

Page 11

A Data Mining Approach for Selecting Bitmap Join Indices187

Dj that includes Aiand the fact table F, respectively. The elements supiis the

support of Ai.

Given a minsup, minfit (minimal fitness value) can be computed as follows:

minfit =minsup

|F|

×

(

d

?

j=1

|Dj|

d

)

(5)

Example 3. From our trivial example, two closed itemsets may be generated:

CFI1= {A1,A2,A3} with a support equal to 0.4 and CFI2= {A3,A5,A6} with 0.6

support. With minsup = 0.6, Aouiche’s approach will retain CFI2= {A3,A5,A6}

and hence suggest a BJI on A6 to match the facts in Sales with records in Channels

according to the values of channel desc. To illustrate our pruning technique, we

first compute the fitness function for each one of the generated CFIs.

Fitness(CFI1) = sup(A3) × α3

=

2

5×

?50000×24

65536

65536

?

?

894

||1626033||×36

?

=

2

5×19

= 0.0085

One may notice that CFI1 contains only one non-key attribute which is A3, and

hence Fitness(CFI1) will take into account A3only. Fitness(CFI2) is computed

based on A6 only and is equal to 0.00067.

0.6

894×19+1

4.2Elimination of Dirty FCIs

The set of the FCIs generated by DynaClose and DynaCharm must be purified to

avoid the erroneous bitmap join indices. Let us recall that a bitmap join index

is built between a fact table and dimension table(s) based on non-key attributes.

Therefore, in its definition, we should find key-attributes and non-key attributes.

There are two main requirements of bitmap join indices: (1) if a bitmap join index

is defined on k (k > 1) attributes of the same dimension table, it shall have only

one join operation between this table and the fact table. (2) if a bitmap join index

is defined on k (k > 1) attributes of distinct dimension tables, it shall have k join

operations between this table and the fact table. Based on these requirements, we

distinguish three scenarios that lead to a purification (i.e., the elimination of the

concerned FCI):

The value of minfit is 0.0067 (=

2). Our pruning metric will then select CFI1rather than CFI2.

(1) The FCI contains only key attributes (of dimension tables) or only foreign keys

of the fact table.

(2) The FCI has a number of key attributes significantly higher than the number

of non-key attributes. For instance, the case of a FCI (customers.cust gender,

sales.cust id, customers.cust id, sales.prod id, products.prod id) having three

keys and one non-key attribute. For one non-key attribute, we only need two

key attributes for building BJIs. Generally, for N selection attributes, 2 × N

key attributes are required.

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 12

188L. Bellatreche et al.

(3) The FCI is composed of only non-key attributes.

After this elimination, we get a set of candidate dimension attributes for selecting

bitmap join indices, denoted by CA (CA ⊆ A), where A is the initial set of indexable

attributes. CA is the union of all attributes belonging to the purified FCIs. To select

the final configuration of bitmap join indices, we consider only CA as described in

the following section.

4.3

Now, we have the set of all candidate attributes needed to build the final config-

uration of BJIs using a greedy algorithm. The input of this algorithm covers: (a)

a star schema with a fact table and a set of dimension tables, (b) a set of queries:

Q = {Q1,Q2,···,Qp}, (c) CA and (d) a storage constraint S. Our algorithm starts

with a configuration having a bitmap index defined on an attribute of CA. Since

the cardinality of CA may be large, we choose the first attribute as the one with

the lowest cardinality, let say, Imin. This choice assumes a sort of elements in CA

according to their cardinality. Then, the procedure iteratively improves the initial

configuration by considering other attributes of CA while the constraint on S is

maintained and a further reduction in the total query processing cost is possible.

To measure efficiency of a configuration, we use a mathematical cost model, an

adaptation of an existing one [Aouiche et al. 2005] that computes the number of

disk page accesses (IO costs) when executing the set of queries. In order to estimate

storage cost required for a bitmap join index, we adapt our cost model developed

in [Bellatreche et al. 2000]. The main steps of our approach are given in Algorithm

1.

Selection of the Final Configuration

Algorithm 1 Greedy Algorithm for BJIs Selection

Input :

CA, Q, S(storage bound)

Output:

Config: set of selected BJIs.

Notations:

BJIj: a bitmap join index defined on attribute Aj.

Size(BJIj): storage cost required for BJIj

begin

SCA = SORT(CA); /* a sequence of attributes */

Config = BJImin;

S := S − Size(BJImin);

SCA := SCA − Amin;

WHILE (Size(Config) ≤ S) DO

FOR each next element Ajin SCA DO

IF (COST[Q,(Config ∪ BJIj))] < COST[Q,Config])

AND ((Size(Config ∪ BJIj) ≤ S)) THEN

Config := Config ∪ BJIj;

Size(Config) := Size(Config) + Size(BJIj);

SCA := SCA − Aj; /* remove Ajfrom SCA */

end

Aminis the attribute used to defined BJImin

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 13

A Data Mining Approach for Selecting Bitmap Join Indices189

5.

To evaluate our approach, we first implemented all algorithms (Close, DynaClose,

Charm, DynaCharm and the greedy algorithm) using Java language and a Pentium

IV with 512 MB of memory. We conducted several experimentations using the same

dataset and the forty OLAP queries as in [Aouiche et al. 2005]. The star schema

of the data warehouse has one fact table SALES and five dimension tables: TIME,

CUSTOMERS, PRODUCTS, PROMOTIONS and CHANNELS (see Tables II).

EXPERIMENTAL STUDY

Table II.Table cardinalities used in the experiments.

TablesNumber of rows

SALES16 260 336

50 000

10 000

1 461

501

5

CUSTOMERS

PRODUCTS

TIME

PROMOTIONS

CHANNELS

The experiments were conducted according to four scenarios: (1) identification

of the value of minsup that gives an important number of FCIs, (2) evaluation

of different approaches (Close, DynaClose, Charm and DynaCharm) by executing

the forty queries on non indexed tables without considering storage constraint, (3)

evaluation of different approaches by considering the storage constraint, and (4)

computation of CPU bound of different approaches.

First, we carried out experiments to set the appropriate value of minsup that

allows the generation of a large set of FCIs. The results show that the appropriate

minsup value should be set to 0.05.

5.1

Figure 4 shows how different indexing approaches reduce the cost of executing the

forty queries with an increasing number of minimum support. The main result is

that DynaClose outperforms approaches for almost all values of minsup. However,

its performance deteriorates (in the sense that no candidate indices can be gen-

erated) when the minsup value becomes high. We notice that for minsup values

exceeding 0.475, Close, Charm and DynaCharm stop generating new FCIs, and

hence the query processing cost remains stable.

A comparison between DynaCharm and DynaClose shows that they have a simi-

lar performance for small minsup values (ranging between 0.05 and 0.0175). These

results coincide with the experimental study of Zaki et al. [Zaki and Hsiao 2002].

However, as we increase minsup, the performance gap between DynaClose and

DynaCharm becomes larger. This is due to the fact that DynaCharm processes

branches in a depth-first fashion, and FCIs are formed only at the end of an itera-

tion.

Evaluation without Storage Constraint

5.2

The set of BJIs generated by DynaClose and Dynacharm for a minsup=0.05 requires

storage of 146,88 MB (see Figure 6). This value is very high if we compare it with

Evaluation of DynaClose et Dynacharm with Storage Constraint

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 14

190L. Bellatreche et al.

0

2000000

4000000

6000000

8000000

10000000

12000000

0,05 0,075 0,175 0,225 0,25 0,275 0,375 0,45 0,475 0,5 0,525 0,55 0,575 0,6 0,625 0,65 0,675 0,7 0,725

MINSUP

C

O

S

T

DynaClose

Close

Charm

DynaCharm

Sans indexation

Figure 4.Quality of our approaches.

0

10

20

30

40

50

60

70

80

8 1620 324484 100 108 124 139 147 160

Storage in Mo

%

DynaClose

Close

Charm

Dynacharm

Figure 5.Behavior of our approaches according to storage constraint.

the size of the fact table which is 372,17 MB. Consequently, we execute our greedy

algorithm for selecting BJIs by considering various storage values with a fixed value

of minsup equal to 0.05. This value allows the generation of a large number of

index candidates. Figure 5 shows that Dynaclose and Dynacharm improve the

performance with a gain of 43% (compared to the solution without indexing) for

44 MB of storage (almost 3 times smaller than the initial space (146,88 MB)).

With the same storage, Charm and Close give a 33,56% gain.

proposed variants Dynaclose and Dynacharm provide a better performance than

the traditional approach for all the values of the considered storage, except for the

storage space 84 MB (where all approaches provide the same gain of 58,19%).

Therefore, our

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007

Page 15

A Data Mining Approach for Selecting Bitmap Join Indices191

0

20

40

60

80

100

120

140

160

0,05 0,08 0,18 0,23 0,25 0,28 0,38 0,45 0,48 0,5 0,53 0,55 0,58 0,6 0,63 0,65 0,68 0,7 0,73

naCharm and DynaClose need more execution time to prune the search space com-

pared to Charm and Close. This result was foreseeable since contrary to traditional

approaches (Closed and Charm) which prune according to the minsup, our two ap-

proaches take more time since the pruning phase involves the computation of the

fitness function for each generated FCI. The results show that DynaCharm is the

approach which requires the highest time for almost all minsup values that we con-

sider, but it remains stable for high minsup. This is due to the pruning phase of

DynaCharm since it is carried out only when we get maximized closed itemsets.

MINSUP

DynaClose

Close

Charm

DynaCharm

S

T

O

R

A

G

E

Figure 6.Storage vs. Minsup.

5.3

We have conducted experiments about the execution time of each algorithm (in

microseconds) according to a varying value of minsup. Table 3 shows that Dy-

Evaluation of Different Approaches based on CPU Bound

Table III.Execution Time (in milliseconds) of Different Approaches.

Minsup

0,04

0,075

0,175

0,225

0,25

0,275

0,375

0,45

0,475

0,5

0,7

DynaClose

2794

2473

2403

2158

2210

2220

2113

2195

1985

1910

1945

Close

3079

2518

2491

1574

1680

2063

1737

1730

1319

1294

1294

Charm

3424

3121

2708

1540

1504

1043

1046

683

288

303

303

DynaCharm

2984

2961

3109

3865

3457

3830

4181

4168

4000

3758

3758

Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007