A Data Mining Approach for selecting Bitmap Join Indices.
ABSTRACT Index selection is one of the most important decisions to take in the physical design of relational data warehouses. Indices reduce significantly the cost of processing complex OLAP queries, but require storage cost and induce maintenance overhead. Two main types of indices are available: monoattribute indices (e.g., Btree, bitmap, hash, etc.) and multiattribute indices (join indices, bitmap join indices). To optimize star join queries characterized by joins between a large fact table and multiple dimension tables and selections on dimension tables, bitmap join indices are well adapted. They require less storage cost due to their binary representation. However, selecting these indices is a difficult task due to the exponential number of candidate attributes to be indexed. Most of approaches for index selection follow two main steps: (1) pruning the search space (i.e., reducing the number of candidate attributes) and (2) selecting indices using the pruned search space. In this paper, we first propose a data mining driven approach to prune the search space of bitmap join index selection problem. As opposed to an existing our technique that only uses frequency of attributes in queries as a pruning metric, our technique uses not only frequencies, but also other parameters such as the size of dimension tables involved in the indexing process, size of each dimension tuple, and page size on disk. We then define a greedy algorithm to select bitmap join indices that minimize processing cost and verify storage constraint. Finally, in order to evaluate the efficiency of our approach, we compare it with some existing techniques.

Conference Paper: On Simplifying Integrated Physical Database Design.
[Show abstract] [Hide abstract]
ABSTRACT: This paper deals with the problem of integrated physical database design involving two optimization techniques: horizontal data partitioning (HDP) and bitmap join indexes (BJI). These techniques compete for the same resource representing selection attributes. This competition incurs attribute interchangeability phenomena, where same attribute(s) may be used to select either HDP or BJI schemes. Existing studies dealing with integrated physical database design problem not consider this competition. We propose to study its contribution on simplifying the complexity of our problem. Instead of tackling it in an integrated way, we propose to start by assigning to each technique its own attributes and then it launches its own selection algorithm. This assignment is done using the KMeans method. Our design is compared with the state of the art work using APB1 benchmark. The results show that an interchangeability attributeaware database designer can improve significantly query performance within the less space budget.Advances in Databases and Information Systems  15th International Conference, ADBIS 2011, Vienna, Austria, September 2023, 2011. Proceedings; 01/2011  SourceAvailable from: Habiba Drias
Conference Paper: ACO Based Approach and Integrating Information Retrieval Technologies in Selecting Bitmap Join Indexes.
[Show abstract] [Hide abstract]
ABSTRACT: Unlike existing studies dealing with the selection of Bitmap Join Indexes for star join queries optimization, this paper presents three original features. The first one consists in addressing the problem with ant based approach that is more robust than the simple heuristic algorithms, which are usually used in the related works. The second interesting novelty resides in the metric used to prune the search space. The fitness function designed in the ant approach is brought from information retrieval technologies and is more refined than the frequency measure usually used. Finally, the third efficient aspect is in the data structure used to manage dynamically the storage in order to select the best promising indexes.2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, Toronto, Canada, August 31  September 3, 2010, Main Conference Proceedings; 01/2010  [Show abstract] [Hide abstract]
ABSTRACT: Afin de réduire le temps d’exécution des requêtes décisionnelles, l’administrateur a la possibilité de sélectionner des index de jointure binaires (). Cette sélection demeure une tâche difficile vue la complexité de l’espace de recherche à parcourir. De ce fait, un grand intérêt est porté à la mise en oeuvre d’algorithmes de sélection. Cependant, ces algorithmes sont statiques. Dans cet article, nous centrons nos travaux sur la sélection des index de jointures binaires définis sur plusieurs attributs appartenant à des tables de dimension en utilisant des algorithmes génétiques. Nous présentons deux types d’algorithmes: des algorithmes de sélection statiques et des algorithmes de sélection incrémentales qui prévoient l’adaptation des index sélectionnés à l’arrivée de nouvelles requêtes. Nous concluons nos travaux par une étude expérimentale démontrant l’apport de notre sélection des index de jointure binaires en comparaison avec les travaux de sélection statiques et incrémentales existants.Bitmap join indexes (BJI) have been widely advocated by administrators as a solution to optimize complex queries. Their selection remains hard, since it needs to explore a large search space. Only a few classes of algorithms were proposed to deal with the problem of BJI selection. These algorithms are static and do not take into account the changes of data warehouses in terms of query arrival. In this paper, we propose a genetic algorithm to select BJI defined on multiple attributes belonging to various dimension tables in the static way. This algorithm is extended to deal with the incremental aspect. An intensive experiment was conducted to show the efficiency of our proposal and to compare it with the most important existing studies.Journal of Decision Systems. 01/2012; 21(1):5170.
Page 1
A Data Mining Approach for Selecting Bitmap Join
Indices
Ladjel Bellatreche
LISI/ENSMA  Poitiers University  France
bellatreche@ensma.fr
Rokia Missaoui
University of Quebec in Outaouais  Canada
Rokia.Missaoui@uqo.ca
Hamid Necir, Habiba Drias
Institut National d’Informatique  Algeria
{h necir, h drias}@ini.dz
Index selection is one of the most important decisions to take in the physical design of relational
data warehouses. Indices reduce significantly the cost of processing complex OLAP queries, but
require storage cost and induce maintenance overhead. Two main types of indices are available:
monoattribute indices (e.g., Btree, bitmap, hash, etc.) and multiattribute indices (join indices,
bitmap join indices). To optimize star join queries characterized by joins between a large fact
table and multiple dimension tables and selections on dimension tables, bitmap join indices are
well adapted. They require less storage cost due to their binary representation. However, selecting
these indices is a difficult task due to the exponential number of candidate attributes to be indexed.
Most of approaches for index selection follow two main steps: (1) pruning the search space (i.e.,
reducing the number of candidate attributes) and (2) selecting indices using the pruned search
space. In this paper, we first propose a data mining driven approach to prune the search space
of bitmap join index selection problem. As opposed to an existing our technique that only uses
frequency of attributes in queries as a pruning metric, our technique uses not only frequencies,
but also other parameters such as the size of dimension tables involved in the indexing process,
size of each dimension tuple, and page size on disk. We then define a greedy algorithm to select
bitmap join indices that minimize processing cost and verify storage constraint. Finally, in order
to evaluate the efficiency of our approach, we compare it with some existing techniques.
Categories and Subject Descriptors: Abstract Interpretation [Programming Language]:
General Terms:
Additional Key Words and Phrases: Bitmap Join Indices, Frequent Itemset, Data Mining, Star
Join Queries, Relational Data Warehouses
Copyright c ?2007 by The Korean Institute of Information Scientists and Engineers(KIISE). Per
mission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than KIISE must be honored. Abstracting with credit is
permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires
prior specific permission and/or a fee. Request permission to republish from: Publicity Office,
KIISE. FAX +8225211352 or email office@kiise.org.
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007, Pages 177194.
Page 2
178L. Bellatreche et al.
1.
A data warehouse stores large volumes of aggregated and non volatile data across
entire organization. It is usually accessed through complex queries for key business
operations. Data warehouses are frequently modelled using relational schemas like
star and snowflake schemas. A star schema consists of a single large fact table
that is related to multiple dimension tables via foreign keys. Dimension tables
are relatively small compared to the fact table and rarely updated.
typically non normalized so that the number of needed join operations are reduced.
Operations in data warehouse applications are mostly read ones and are dominated
by large and complex queries. The typical queries on the star schema are called star
join queries. They are characterized by: (i) a multitable join among a large fact
table and dimension tables, (ii) each one of the dimension tables involved in the join
operation has multiple selection predicates1on its descriptive attributes, and (iii)
there is no join operation between dimension tables. Due to the interactive nature
of decision support applications, having a fast query response time is a critical
performance goal.
Without efficient optimization techniques, queries addressed to data warehouses
may take hours or days, which is unacceptable in most cases [Chatziantoniou and
Ross 2007]. In order to cope with complex and timeconsuming decision support
queries, there is an urgent need for efficient and sophisticated physical design tech
niques [Chaudhuri and Narasayya 2007]. The major bottleneck in evaluating such
queries is the join between a large fact table and the surrounding dimension tables
[St¨ ohr et al. 2000]. To optimize join operations in both OLTP and OLAP envi
ronments, many techniques have been proposed that we can classify into two main
categories: (1) non redundant structures and (2) redundant structures. The first
category concerns different implementations of the join operation: nested loop, sort
merge join and hash join. These structures are efficient when (a) the size of joined
tables is reasonable, which is not the case of tables in relational data warehouses
and (b) join concerns two tables [Golfarelli et al. 2002]. Redundant structures, like
materialized views [Gupta 1999; Rizzi and Saltarelli 2003] and join indices [Val
duriez 1987], are more efficient to speed up join operation involving many tables
[Oneil 1995]. Their main drawbacks are the extra storage requirement and the
maintenance overhead. However, such optimization techniques are inevitable in
data warehouse environments. The main peculiarity of indices is the fact that they
may be combined with other optimization techniques such as materialized views,
horizontal and vertical partitioning. In other words, any access path with a table
structure can be indexed.
This paper focuses on the selection of join bitmap indices (BJIs) and is organized
as follows. Section 2 provides a background on indices and data mining. Section
3 presents our formalization of the problem of selecting BJIs as an optimization
problem, and provides a description of the unique and most related work by showing
its limitations using a running example. Section 4 presents our twostep approach
for first generating BJI candidates using data mining based pruning algorithms, and
INTRODUCTION
They are
1A selection predicate has the following form: Di.Ajθ value, where Ajis an attribute of dimension
table Di and θ is one of the six comparison operators {=,<,>,≤,≥}, and value is a constant
belonging to the domain of attribute Aj).
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 3
A Data Mining Approach for Selecting Bitmap Join Indices179
then selecting the final BJIs using a greedy algorithm. Section 5 validates our work
through experimentations. Finally, Section 6 concludes the paper by summarizing
the main results and suggesting new directions.
2.
In the following, we review three key topics related to our proposal: (i) indexing
techniques, (ii) index selection problem, and (iii) frequent closed itemset computa
tion.
BACKGROUND
2.1
A number of indexing strategies have been suggested for data warehouses. They can
be classified into two categories: singletable indices and (b) multiple table indices.
In the first category, indices are defined either on one or several attributes of a
single table. An index may be either clustered or nonclustered. In the second one,
indices are defined on two or more tables. A number of indexing strategies belonging
to these two categories have been suggested for data warehouses: valuelist index,
projection index [O’Neil and Quass 1997], bitmap index [Chan and Ioannidis 1998;
O’Neil and Quass 1997], bitsliced index [Chan and Ioannidis 1998], data index
[Datta et al. 1999], join index [Valduriez 1987], star join Index [Systems 1997] and
bitmap join indices [O’Neil and Graefe 1995].
Indexing Techniques
—Valuelist index and Bitmap index. A valuelist index consists of two parts. The
first part is a balanced tree structure and the second part is a mapping scheme.
The mapping scheme is attached to the leaf nodes of the tree structure and points
to the tuples in the table being indexed. The tree is generally a Btree with
varying percentages of utilization. Oracle provides a B*tree whose utilization
may go to 100%. Two different types of mapping schemes are in use. First, one
consists of a RowID (row identifier) list, which is associated with each unique
searchkey value. This list is partitioned into a number of disk blocks chained
together. The second scheme uses bitmaps. A bitmap is a vector of bits whose
value depend on predicate values. A bitmap B lists all rows with a given predicate
P such that for each row r with ordinal number j that satisfies the predicate
P, the jthbit in B is set to 1. Bitmaps represent efficiently lowcardinality
data. However to make this indexing scheme practical for highcardinality data,
compression techniques must be used. Valuelist indices have been shown in
[O’Neil and Quass 1997] to outperform other access methods in queries involving
MIN or MAX aggregate functions, as well as queries that compute percentile
values of a given column. Bitmap indices can substantially improve performance
of queries with the following characteristics [CheeYong 1999]:
—The WHERE clause contains multiple predicates on lowormediumcardinality
columns (e.g., a predicate on Gender that has two possible values: female or
male or a predicate on city with three possible values as shown in Figure 1).
—Bitmap indices have been created on some or all of these lowormedium
cardinality columns.
Besides disk saving (due to the binary representation and possible compression
[Johnson 1999]), bitmap indices speed up queries having Boolean operations (such
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 4
180 L. Bellatreche et al.
Customer
Pascal
515
17
18
52
21
42
20
Age
0
0
1
1
0
0
Tokyo
0
0
0
0
1
0
Paris
1
1
0
0
0
1
LA
6
5
4
3
2
1
RID
LA
42
1116
LA
20
Eric
616
2125
Tokyo
Age
Didier
CID
Figure 2.
313
RIDC
4
Tokyo
Patrick4143
Paris
Yves 5152
LA
Gilles6161
City NameCID
Figure 1.
RIDC
17
18
52
21
42
20
Age
0
0
1
1
0
0
Tokyo
0
0
0
0
1
0
Paris
1
1
0
0
0
1
LA
6
5
4
3
2
1
RID
LA
42
Pascal
515
1116
LA
20
Eric
616
2125
Tokyo
Age
Didier
CID
313
RIDC
4
Tokyo
Patrick4143
Paris
Yves 5152
LA
Gilles6161
CityName CIDRIDC
Bitmap index on City
An Example of a bitmap index defined on City.
Customer
Pascal17
18
52
21
17
18
52
21
42
20
Age
LA 1116
LA
Eric 2125
Tokyo
Didier 3134
Tokyo
Patrick4143
ParisYves2
LAGilles1
CityName
17
18
52
21
17
18
52
21
42
20
Age
LA Pascal 1116
LA
Eric 2125
Tokyo
Didier3134
Tokyo
Patrick4143
ParisYves2
LAGilles1
CityName
Projected
Index on Age
An Example of a projection index.
as AND, OR and NOT) and COUNT operations. They are supported by most
of commercial DBMSs (Oracle, SQL Server, etc.).
—Projection Index. A projection index is equivalent to the column being indexed.
If C is the column being indexed, then the projection index on C consists of a
stored sequence of column values from C in the same order as the ordinal row
number in the table from where the values are extracted (see Figure 2). It has
been shown in [O’Neil and Quass 1997] that projection indices outperform other
indexing schemes for the execution of queries that involve computation on two
or more column values, and seem to perform acceptably well in GROUPBY
queries.
—Bitsliced Index A bit sliced index represents the key values of the column to
be indexed as binary numbers and projects a set of bitmap slices, which are
orthogonal to the data, held in the projection index. This index has been shown in
[CheeYong 1999] to particularly perform well for computing sums and averages.
Also, it outperforms other indexing approaches for percentile queries when the
underlying data is clustered, and for range queries whose range is large.
—Join Index. A join index precomputes a join operation between two relations
[Valduriez 1987]. It is the result of joining two tables on join attributes and
projecting the keys (or tuple IDs) of the two tables. A join index is used to find
the matching tuples from the tables to be joined. This index has been proposed
in the context of traditional databases in medium of 80’s, and was later extended
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 5
A Data Mining Approach for Selecting Bitmap Join Indices181
00124
10025
10026
10027
00123
00122
00121
001 20
001 19
01018
0
1
0
0
1
0
17
27
0
1
0
0
1
0
16
26
0
1
0
0
1
0
15
25
0
0
1
0
0
1
14
24
0
0
1
0
0
1
13
23
1
0
0
0
0
1
12
22
1
0
0
0
0
1
11
21
1
0
0
0
0
1
10
20
0
0
0
0
1
1
9
19
0
0
0
1
1
0
8
18
0
0
0
0
1
1
7
17
0
0
0
0
1
1
6
16
1
0
0
0
0
1
5
15
0
0
1
1
0
0
4
14
0
0
0
1
1
0
3
13
0
1
0
0
1
0
2
12
0
1
0
0
1
0
1
11
Tokyo
1
Paris
0
LA
0
RIDS
10
0019
0018
0017
0016
1005
0104
0013
0012
0011
TokyoParisLARIDS
Customer
Product
17
Gardering
2003
18
52
Multimedia
2003
21
Beauty
2003
42
Beauty
Year
20
Range
Age
LAPascal
Manure
Apr
111
102
44
6
5
4
by Red Brick Systems [Systems 1997] to a multitable join index in relational
data warehouses by concatenating columns from different dimension tables and
listing RowIDs in the fact table from each concatenated value.
—Data Index.A data index, like the projection index, exploits the positional
indexing strategy [Datta et al. 1999]. The Data index avoids duplication of data
by storing only the index and not the column being indexed. The data index
can be of two specific types: basic data index and join data index (for more
information, see [Datta et al. 1999]).
—Bitmap Join Index. It is a multitable join index and a variant of join index. It
is defined as a bitmap index on a table R based on a single column of a table
S, where S commonly joins with R in a specific way. Bitmap join indices are
supported by most of commercial database/warehouse systems (e.g., Oracle, SQL
Server, etc.).
LA
Eric
Barbie
Mar
212
103
33
5
4
3
TokyoDidier
WebCam
Feb
313
104
22
4
3
2
TokyoPatrick
Clarins
Jan
414
105
11
3
2
1
ParisYves
Sonoflore
Month
515
106
TID
2
1
LA
Gilles
Name
616
PID
1
1
CityName
Name
CID
PID
RIDC
RIDP
17
Gardering
2003
18
52
Multimedia
2003
21
Beauty
2003
42
Beauty
Year
20
Range
Age
LAPascal
Manure
Apr
111
102
44
6
5
4
LA
Eric
Barbie
Mar
212
103
33
5
4
3
TokyoDidier
WebCam
Feb
313
104
22
4
3
2
TokyoPatrick
Clarins
Jan
414
105
11
3
2
1
ParisYves
Sonoflore
Month
515
106
TID
2
LA
Gilles6161
CityNameCIDRIDC
RIDP
Fitness
2003
SlimForm
May
101
55
6
5
Toys
Fitness
2003
SlimForm
May
101
55
6
5
Toys
2003Jun666
2003
RIDT
2003Jun666
2003
RIDT
Time
181110621224
1966105313 25
17 2210531326
151110631327
185510521223
104410521222
1011 10521221
205510461620
20 22104616 19
20 2210451518
40
15
66
11
104
106
212
313
17
27
44
17
66
22
105
105
111
313
16
26
45
19
44
66
103
105
212
313
15
25
17
18
55
11
103
106
515
212
14
24
100
18
66
55
102
105
515
212
13
23
103
10
55
44
102
105
414
212
12
22
102
10
11
11
102
105
414
212
11
21
200
20
11
55
102
104
313
616
10
20
100
20
11
22
101
104
212
616
9
19
27
20
33
22
101
104
111
515
8
18
20
40
44
66
101
104
111
212
7
17
14
44
55
66
106
105
212
111
6
16
14
45
66
44
105
103
414
212
5
15
10
17
11
55
104
103
515
515
4
14
50
100
33
66
104
102
616
515
3
13
28
103
66
55
106
102
616
414
2
12
25
102
11
11
106
102
616
414
1
11
Amount
200
TID
11
PID
102
CID
313
RIDS
10
100111012129
27331011118
20441011117
1455106
2126
14661054145
10111045154
50331046163
28661066162
25111066161
AmountTIDPIDCIDRIDS
Sales
Bitmap Join Index on City
Figure 3.An Example of a bitmap join index.
Example 1. To illustrate the use and interest of bitmap join indices, let us
consider the following data (see Figure 3) that will serve as a running example
throughout the paper. Suppose we have a relational data warehouse represented
by three dimension tables (TIME, CUSTOMERS and PRODUCTS) and one fact
table (SALES). Let Q be the query that the administrator would like to optimize:
SELECT Count(*)
FROM CUSTOMERS C, SALES S
WHEERE C.City=’Poitiers’
AND C.CID=S.CID
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 6
182 L. Bellatreche et al.
To do so, he/she creates a bitmap join index SC IDX between the fact table
SALES and the dimension table CUSTOMERS based on City attribute as follows:
CREATE BITMAP INDEX SC_IDX
ON SALES(CUSTOMERS.City)
FROM SALES S, CUSTOMERS C
WHERE S.CID= C.CID
To execute the above query, the query optimizer just accesses the bitmap corre
sponding to the column representing Poitiers, without joining SALES and CUS
TOMERS tables.
This example shows the efficiency of bitmap join indices for executing this kind of
queries.
2.2
Index selection problem (ISP) has been studied since the early 70’s and its im
portance is well recognized. It is one of the most important issues in physical
design of advanced database applications. The task of index selection consists in
automatically selecting an appropriate set of indices for a data warehouse and a
workload under resource constraints (storage, maintenance, etc.). It is a challeng
ing problem for the following reasons [Chaudhuri 2004]: The size of a relational
data warehouse schema may be large (many tables with several columns) for real
applications, and indices can be defined on a set of columns from different tables
(multiple table indices). A large spectrum of research studies has been proposed
to deal with this problem. By exploring the state of the art, we believe that most
studies concentrate on single table index selection. To the best of our knowledge,
only two studies have been proposed for dealing with multiple table index selection
problem [Aouiche et al. 2005; Bellatreche et al. 2007]. On the industry side, several
index selection tools were developed (e.g., AutoAdmin [Chaudhuri and Narasayya
1998] for selftuning and selfadministering databases and data warehouses).
In single table index selection problem, most of existing approaches use two main
phases: (1) generation of attribute candidates and (2) selection of a final config
uration. The first phase prunes the search space of index selection problem by
eliminating some non relevant attributes. In the second phase, the final indices are
selected using heuristics like greedy algorithms [Chaudhuri and Narasayya 1997],
linear programming algorithms [Chaudhuri 2004], etc. The quality of the final set
of indices depends essentially on the pruning phase. To prune the search space
of index candidates, many approaches were proposed [Chaudhuri and Narasayya
1997; Valentin et al. 2000; Chaudhuri 2004; Labio et al. 1997; Aouiche et al. 2005]
using enumerationdriven heuristics. For instance, [Chaudhuri and Narasayya 1997]
propose a greedy algorithm which uses the optimizer cost estimate in SQL Server
to decide the goodness of a given configuration of indices. The weakness of this
work is that it imposes the number of generated candidates. DB2 Advisor is another
example belonging to this category [Valentin et al. 2000], where the query parser is
used to pick up selection attributes used in workload queries. The generated can
didates are obtained by a few simple combinations of selection attributes [Valentin
et al. 2000].
Index Selection
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 7
A Data Mining Approach for Selecting Bitmap Join Indices183
2.3
Let I = {i1,i2,···,im} be a set of m distinct items (table attributes). A transaction
(tuple) T contains a set of items in I, and has an associated unique identifier called
TID. A subset X of I where k = X is referred to as a k−itemset (or simply an
itemset), and k is called the length of X. A transaction database (TDB), say D, is
a set of transactions. The number of transactions in D that contain an itemset X
is called the absolute support of X whereas the fraction of transactions is called its
relative support (both denoted by sup(X)). Thus, an itemset is frequent (or large)
when sup(X) reaches at least a userspecified minimum threshold called minsup.
One of the most frequently used technique in data mining is rule mining which
is conducted in Apriorilike algorithms [Agrawal and Srikant 1994] in two steps:
detection of all frequent itemsets, and utilization of frequent itemsets to generate
association rules (e.g., X ⇒ Y ) that have a confidence ≥ minconf. While the second
step is relatively easy and costeffective, the first one presents a great challenge
because the set of frequent itemsets (FIs) may grow exponentially with the whole
set of items. To reduce the size of the FI set, some studies were conducted on
frequent closed itemsets FCIs and maximal frequent itemsets (i.e., itemsets for
which every superset is infrequent [Burdick et al. 2001]).
The Close algorithm [Pasquier et al. 1999] is one of the first procedures for FCI
generation. Like Apriori, it performs a levelwise computation within the powerset
lattice. However, it exploits the notion of generators of FCIs2to compute closed
itemsets. ChARM [Zaki and Hsiao 2002] is another procedure which generates
FCIs in a tree organized according to inclusion. The computation of the closure
and the support is based on a efficient storage and manipulation of TIDsets (i.e.,
the set of transactions per item). Closure computation is accelerated using diffsets,
the set difference on the TIDsets of a given node and its unique parent node in the
tree.
Closet [Han et al. 2000] and its variant Closet+ [Wang et al. 2003] both
generate FCIs as maximal branches of a FPtree, a structure that is basically a
prefix tree (or trie) augmented with transversal lists of pointers.
In the next section, we follow the above mentioned steps to select bitmap join
indices: generation of attribute candidates and selection of a final configuration.
The first step will be based on the computation of frequent closed itemsets using
either Close or ChARM.
Frequent Closed Itemset Computation
3.
To ease the understanding of our proposed approach, we provide a motivation and
a formulation of the BJI selection problem. Then, we present and discuss the most
related work.
As indicated earlier, a BJI is defined on one or several non key dimension attributes
with low cardinality3(called indexable columns) for joining dimension tables with
the fact table. An indexable attribute Ajof a given dimension table Difor a BJI
is a column Di.Ajsuch that there is a condition of the form Di.Aj θ Expression
BITMAP JOIN INDEX SELECTION PROBLEM
2A generator of an FCI Y is a subset of that itemset such that its closure is equal to Y .
3The domain of this attribute should be an enumerated domain like gender, color, etc.
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 8
184L. Bellatreche et al.
in the WHERE clause. The operator θ must be among {=,<,>,≤,≥}.
Let A = {A1,A2,˙ ,AK} be the set of indexed candidate attributes for BJIs. Then,
the possible number of BJIs with only one group of attributes, grows exponentially
with K, and is given by:
?K
For K = 4, this number is 15. The possible number of BJIs with any combination
of attribute groups is given by:
?
For K = 4, the number of possible cases is (215).
efficiently finding the set of BJIs that minimizes the total query processing cost
while satisfying a storage constraint cannot be handled by first enumerating all
possible BJIs and then computing the query cost for each candidate BJI. As a
consequence, BJI selection problem can be formulated as follows:
Given a data warehouse with a set of dimension tables D = {D1,D2,...,Dd} and
a fact table F, a workload Q of queries Q = {Q1,Q2,...,Qm}, where each query
Qi (1 ≤ i ≤ m) has an access frequency, and a storage constraint S, the aim of
BJI selection problem is to find a set of BJIs among a precomputed subset of all
possible candidates, which minimizes the query processing cost and satisfies the
storage requirements S.
The single work dealing with this problem using a data mining approach is the
one proposed by Aouiche et al. [Aouiche et al. 2005]. Intuitively, selecting bitmap
join indices means partitioning the set of indexable attribute A into disjoint groups.
This motivates the authors in [Aouiche et al. 2005] to propose an approach based
on data mining (and Close algorithm [Pasquier et al. 1999]) to prune the search
space by computing the set of frequent closed attribute groups rather than all the
possible combinations indicated above. The groups generated in the pruning step
are then used in the second step to select the final configuration of indices.
Aouiche et al.’s work has two main limitations: (1) no formalization is proposed,
(2) the proposed approach only uses frequencies of attributes to generate frequent
closed itemsets, where each one represents an attribute set to be indexed. The
frequency parameter is not sufficient to be a pruning metric. Since in the vertical
partitioning of databases4the frequency parameter is not sufficient in getting a
good fragmentation schema [Fung et al. 2003], we believe that a similar observa
tion holds for the bitmap join index selection problem. Indeed, Fun et al. [Fung
et al. 2003] showed the weakness of affinitybased algorithms (that use only query
frequencies) in reducing the query processing cost. To get a better result, they
recommend the use of other parameters, like the size of tables, the length of each
tuple, the size of disk page, etc.
To overcome these limitations, we propose new pruning algorithms, called Dy
naClose and DynaCharm that take into account the above parameters as part of
the pruning metric, since the cost of join operations depends heavily on the size of
joined tables [Getoor et al. 2001]. Once the pruning phase is processed, a greedy
1
?+?K
2
?+ ... +?K
K
?= 2K− 1(1)
2K−1
1
?
+
?
2K−1
2
?
+ ... +
?
2K−1
2K−1
?
= 22K−1
(2)
Therefore, the problem of
4Each fragment contains a subset of attributes that are frequently used together.
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 9
A Data Mining Approach for Selecting Bitmap Join Indices185
Table I.Query description.
(1) select S.channel id, sum(S.quantity sold) from S, C
where S.channel id=C.channel id and C.channel desc=’Internet’ group by S.channel id
(2) select S.channel id, sum(S.quantity sold), sum(S.amount sold) from S, C
where S.channel id=C.channel id and C.channel desc =’Catalog’ group by S.channel id
(3) select S.channel id, sum(S.quantity sold),sum(S.amount sold) from S, C
where S.channel id=C.channel id and C.channel desc =’Partners’ group by S.channel id
(4) select S.cust id, avg(quantity sold) from S, C
where S.cust id=C.cust id and C.cust gender=’M’ group by S.cust id
(5) select S.cust id, avg(quantity sold) from S, C
where S.cust id=C.cust id and C.cust gender=’F’ group by S.cust id
algorithm is executed to select a set of BJSs that reduces the query processing cost
and satisfies the storage constraint.
3.1
To show the limitations of Aouiche’s approach in pruning search space of bitmap
join indices, we consider the following example with a part of a star schema (used
in our experimental study) containing two dimension tables CHANNELS (denoted
by Ch) and CUSTOMERS (denoted by C) and a fact table SALES (denoted
by S). The cardinalities of these tables (number of instances) are: SALES =
16260336,CHANNELS = 5 and CUSTOMERS = 50000. Let us assume
that five queries are most frequently executed on the corresponding data cube (see
Table I). In the above queries, two main join operations are used: one between
SALES and CUSTOMERS (J1 : SALES 1 CUSTOMERS), and another one
between SALES and CHANNELS (J2: SALES 1 CHANNELS). Basically, the
cost of J1is higher than J2since the size of CUSTOMERS (50 000 instances) is
larger than the size of CHANNELS (5 instances).
With minsup = 3 (in absolute value), the solution returned by [Aouiche et al.
2005] is a bitmap join index (that we call sales desc bjix) defined on CHANNELS
and SALES using channel desc attribute. This is due to the fact that there are
three occurrences of the same selection predicate defined on that attribute in the
five queries. However, no bitmap join index is proposed between SALES and CUS
TOMERS since the cust gender attribute is not so frequent as minsup.
consequence, only the join J2will be optimized, but not the global query set.
To overcome this limitation, we enrich the pruning function by considering other
parameters like the size of tables, the length of an instance of tables, and page size.
Illustration of Weakness of the Existing Pruning Approach
As a
4.
Given a star schema with a set of dimension tables {D1,D2,...,Dl} and a fact table
F. Let Ti and LCibe the cardinality of a (dimension or fact) table Tiand the
length of an instance of Di, respectively. The number of disk pages occupied by a
table T, denoted by T is calculated as follows:
?T × LT
where PS represents the page size (in bytes) on disk.
THE PROPOSED APPROACH
T =
PS
?
(3)
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 10
186L. Bellatreche et al.
In order to select a set of bitmap join indices minimizing the overall query pro
cessing cost and satisfying the storage constraint S, we consider an approach with
two steps commonly found in the classical index selection approaches: (1) genera
tion of candidate attributes, and (2) selection of the final configuration and BJIs.
These steps will be described in the following subsections.
4.1
The input of this step is the context matrix. It is constructed using the set of
queries Q = {Q1,Q2,...,Qm} and a set of indexable attributes A = {A1,A2,...,Al}.
The matrix has rows and columns that represent queries and indexable attributes,
respectively. A value of this matrix is given by:
?
Example 2. Recall that the size of CUSTOMERS, CHANNELS and SALES is:
50 000, 5 and 16 260 33 instances, respectively. Instance length of CUSTOMERS,
CHANNELS and SALES is: 24, 24 and 36, respectively, and page size PS =
65536. We consider the same five queries as defined in Table 1. To facilitate the
construction of the context matrix, we rename the indexable attributes as follows:
Sales.cust id = A1, Customers.cust id = A2; Customers.cust gender = A3,
Channels.channel id = A4, Sales.channel id = A5; Channels.channel desc =
A6. The matrix is given below.
Generation of Candidate Attributes
Uses(Qi,Aj) =
1 if query Qiuses a selection predicate defined on Aj
0 otherwise
A1 A2 A3 A4 A5 A6
000
000
000
111
111
2
555
Q1
Q2
Q3
Q4
Q5
1
1
1
0
0
3
5
1
1
1
0
0
3
5
1
1
1
0
0
3
5
Support
22
To prune the search space of the BJI selection problem, we propose two algorithms:
DynaClose and DynaCharm which are an adaptation of Close [Pasquier et al.
1999] and ChARM [Zaki and Hsiao 2002], respectively. We used Close because it
has been exploited by Aouiche et al. [Bellatreche and Boukhalfa 2005] for the same
purpose (selecting BJIs), and hence will allow us to better compare the performance
of our approach against Aouiche’s approach (see Section 5 for more details). We
also used ChARM because it is a wellknown and efficient algorithm for frequent
closed itemset generation. Our adaptation concerns especially the pruning metric
which is different from the one used in Close and ChARM, where only frequencies
of attributes are used. In order to prune the search space, we propose a new metric
Fitness(X), called fitness metric on a FCI X, that penalizes the FCIs defined on
small dimension tables.
Fitness(X) =1
n× (
n
?
i=1
supi× αi)(4)
where n represents the number of nonkey attributes Aiin X, αiis equal to
where Dj and F represent the number of pages needed by the dimension table
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Dj
F,
Page 11
A Data Mining Approach for Selecting Bitmap Join Indices187
Dj that includes Aiand the fact table F, respectively. The elements supiis the
support of Ai.
Given a minsup, minfit (minimal fitness value) can be computed as follows:
minfit =minsup
F
×
(
d
?
j=1
Dj
d
)
(5)
Example 3. From our trivial example, two closed itemsets may be generated:
CFI1= {A1,A2,A3} with a support equal to 0.4 and CFI2= {A3,A5,A6} with 0.6
support. With minsup = 0.6, Aouiche’s approach will retain CFI2= {A3,A5,A6}
and hence suggest a BJI on A6 to match the facts in Sales with records in Channels
according to the values of channel desc. To illustrate our pruning technique, we
first compute the fitness function for each one of the generated CFIs.
Fitness(CFI1) = sup(A3) × α3
=
2
5×
?50000×24
65536
65536
?
?
894
1626033×36
?
=
2
5×19
= 0.0085
One may notice that CFI1 contains only one nonkey attribute which is A3, and
hence Fitness(CFI1) will take into account A3only. Fitness(CFI2) is computed
based on A6 only and is equal to 0.00067.
0.6
894×19+1
4.2Elimination of Dirty FCIs
The set of the FCIs generated by DynaClose and DynaCharm must be purified to
avoid the erroneous bitmap join indices. Let us recall that a bitmap join index
is built between a fact table and dimension table(s) based on nonkey attributes.
Therefore, in its definition, we should find keyattributes and nonkey attributes.
There are two main requirements of bitmap join indices: (1) if a bitmap join index
is defined on k (k > 1) attributes of the same dimension table, it shall have only
one join operation between this table and the fact table. (2) if a bitmap join index
is defined on k (k > 1) attributes of distinct dimension tables, it shall have k join
operations between this table and the fact table. Based on these requirements, we
distinguish three scenarios that lead to a purification (i.e., the elimination of the
concerned FCI):
The value of minfit is 0.0067 (=
2). Our pruning metric will then select CFI1rather than CFI2.
(1) The FCI contains only key attributes (of dimension tables) or only foreign keys
of the fact table.
(2) The FCI has a number of key attributes significantly higher than the number
of nonkey attributes. For instance, the case of a FCI (customers.cust gender,
sales.cust id, customers.cust id, sales.prod id, products.prod id) having three
keys and one nonkey attribute. For one nonkey attribute, we only need two
key attributes for building BJIs. Generally, for N selection attributes, 2 × N
key attributes are required.
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 12
188L. Bellatreche et al.
(3) The FCI is composed of only nonkey attributes.
After this elimination, we get a set of candidate dimension attributes for selecting
bitmap join indices, denoted by CA (CA ⊆ A), where A is the initial set of indexable
attributes. CA is the union of all attributes belonging to the purified FCIs. To select
the final configuration of bitmap join indices, we consider only CA as described in
the following section.
4.3
Now, we have the set of all candidate attributes needed to build the final config
uration of BJIs using a greedy algorithm. The input of this algorithm covers: (a)
a star schema with a fact table and a set of dimension tables, (b) a set of queries:
Q = {Q1,Q2,···,Qp}, (c) CA and (d) a storage constraint S. Our algorithm starts
with a configuration having a bitmap index defined on an attribute of CA. Since
the cardinality of CA may be large, we choose the first attribute as the one with
the lowest cardinality, let say, Imin. This choice assumes a sort of elements in CA
according to their cardinality. Then, the procedure iteratively improves the initial
configuration by considering other attributes of CA while the constraint on S is
maintained and a further reduction in the total query processing cost is possible.
To measure efficiency of a configuration, we use a mathematical cost model, an
adaptation of an existing one [Aouiche et al. 2005] that computes the number of
disk page accesses (IO costs) when executing the set of queries. In order to estimate
storage cost required for a bitmap join index, we adapt our cost model developed
in [Bellatreche et al. 2000]. The main steps of our approach are given in Algorithm
1.
Selection of the Final Configuration
Algorithm 1 Greedy Algorithm for BJIs Selection
Input :
CA, Q, S(storage bound)
Output:
Config: set of selected BJIs.
Notations:
BJIj: a bitmap join index defined on attribute Aj.
Size(BJIj): storage cost required for BJIj
begin
SCA = SORT(CA); /* a sequence of attributes */
Config = BJImin;
S := S − Size(BJImin);
SCA := SCA − Amin;
WHILE (Size(Config) ≤ S) DO
FOR each next element Ajin SCA DO
IF (COST[Q,(Config ∪ BJIj))] < COST[Q,Config])
AND ((Size(Config ∪ BJIj) ≤ S)) THEN
Config := Config ∪ BJIj;
Size(Config) := Size(Config) + Size(BJIj);
SCA := SCA − Aj; /* remove Ajfrom SCA */
end
Aminis the attribute used to defined BJImin
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 13
A Data Mining Approach for Selecting Bitmap Join Indices189
5.
To evaluate our approach, we first implemented all algorithms (Close, DynaClose,
Charm, DynaCharm and the greedy algorithm) using Java language and a Pentium
IV with 512 MB of memory. We conducted several experimentations using the same
dataset and the forty OLAP queries as in [Aouiche et al. 2005]. The star schema
of the data warehouse has one fact table SALES and five dimension tables: TIME,
CUSTOMERS, PRODUCTS, PROMOTIONS and CHANNELS (see Tables II).
EXPERIMENTAL STUDY
Table II.Table cardinalities used in the experiments.
TablesNumber of rows
SALES16 260 336
50 000
10 000
1 461
501
5
CUSTOMERS
PRODUCTS
TIME
PROMOTIONS
CHANNELS
The experiments were conducted according to four scenarios: (1) identification
of the value of minsup that gives an important number of FCIs, (2) evaluation
of different approaches (Close, DynaClose, Charm and DynaCharm) by executing
the forty queries on non indexed tables without considering storage constraint, (3)
evaluation of different approaches by considering the storage constraint, and (4)
computation of CPU bound of different approaches.
First, we carried out experiments to set the appropriate value of minsup that
allows the generation of a large set of FCIs. The results show that the appropriate
minsup value should be set to 0.05.
5.1
Figure 4 shows how different indexing approaches reduce the cost of executing the
forty queries with an increasing number of minimum support. The main result is
that DynaClose outperforms approaches for almost all values of minsup. However,
its performance deteriorates (in the sense that no candidate indices can be gen
erated) when the minsup value becomes high. We notice that for minsup values
exceeding 0.475, Close, Charm and DynaCharm stop generating new FCIs, and
hence the query processing cost remains stable.
A comparison between DynaCharm and DynaClose shows that they have a simi
lar performance for small minsup values (ranging between 0.05 and 0.0175). These
results coincide with the experimental study of Zaki et al. [Zaki and Hsiao 2002].
However, as we increase minsup, the performance gap between DynaClose and
DynaCharm becomes larger. This is due to the fact that DynaCharm processes
branches in a depthfirst fashion, and FCIs are formed only at the end of an itera
tion.
Evaluation without Storage Constraint
5.2
The set of BJIs generated by DynaClose and Dynacharm for a minsup=0.05 requires
storage of 146,88 MB (see Figure 6). This value is very high if we compare it with
Evaluation of DynaClose et Dynacharm with Storage Constraint
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 14
190L. Bellatreche et al.
0
2000000
4000000
6000000
8000000
10000000
12000000
0,05 0,075 0,175 0,225 0,25 0,275 0,375 0,45 0,475 0,5 0,525 0,55 0,575 0,6 0,625 0,65 0,675 0,7 0,725
MINSUP
C
O
S
T
DynaClose
Close
Charm
DynaCharm
Sans indexation
Figure 4.Quality of our approaches.
0
10
20
30
40
50
60
70
80
8 1620 324484 100 108 124 139 147 160
Storage in Mo
%
DynaClose
Close
Charm
Dynacharm
Figure 5.Behavior of our approaches according to storage constraint.
the size of the fact table which is 372,17 MB. Consequently, we execute our greedy
algorithm for selecting BJIs by considering various storage values with a fixed value
of minsup equal to 0.05. This value allows the generation of a large number of
index candidates. Figure 5 shows that Dynaclose and Dynacharm improve the
performance with a gain of 43% (compared to the solution without indexing) for
44 MB of storage (almost 3 times smaller than the initial space (146,88 MB)).
With the same storage, Charm and Close give a 33,56% gain.
proposed variants Dynaclose and Dynacharm provide a better performance than
the traditional approach for all the values of the considered storage, except for the
storage space 84 MB (where all approaches provide the same gain of 58,19%).
Therefore, our
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
Page 15
A Data Mining Approach for Selecting Bitmap Join Indices191
0
20
40
60
80
100
120
140
160
0,05 0,08 0,18 0,23 0,25 0,28 0,38 0,45 0,48 0,5 0,53 0,55 0,58 0,6 0,63 0,65 0,68 0,7 0,73
naCharm and DynaClose need more execution time to prune the search space com
pared to Charm and Close. This result was foreseeable since contrary to traditional
approaches (Closed and Charm) which prune according to the minsup, our two ap
proaches take more time since the pruning phase involves the computation of the
fitness function for each generated FCI. The results show that DynaCharm is the
approach which requires the highest time for almost all minsup values that we con
sider, but it remains stable for high minsup. This is due to the pruning phase of
DynaCharm since it is carried out only when we get maximized closed itemsets.
MINSUP
DynaClose
Close
Charm
DynaCharm
S
T
O
R
A
G
E
Figure 6.Storage vs. Minsup.
5.3
We have conducted experiments about the execution time of each algorithm (in
microseconds) according to a varying value of minsup. Table 3 shows that Dy
Evaluation of Different Approaches based on CPU Bound
Table III.Execution Time (in milliseconds) of Different Approaches.
Minsup
0,04
0,075
0,175
0,225
0,25
0,275
0,375
0,45
0,475
0,5
0,7
DynaClose
2794
2473
2403
2158
2210
2220
2113
2195
1985
1910
1945
Close
3079
2518
2491
1574
1680
2063
1737
1730
1319
1294
1294
Charm
3424
3121
2708
1540
1504
1043
1046
683
288
303
303
DynaCharm
2984
2961
3109
3865
3457
3830
4181
4168
4000
3758
3758
Journal of Computing Science and Engineering, Vol. 1, No. 2, December 2007
View other sources
Hide other sources
 Available from Habiba Drias · May 27, 2014
 Available from free.fr