Content uploaded by Howard J. Hamilton
Author content
All content in this area was uploaded by Howard J. Hamilton on Aug 17, 2013
Content may be subject to copyright.
Mining Market Basket Data Using
Share Measures and Characterized Itemsets
Robert J. Hilderman, Colin L. Carter, Howard J. Hamilton, and Nick Cercone
Department of Computer Science
University of Regina
Regina, Saskatchewan, Canada, S4S 0A2
f
hilder,carter,hamilton,nick
g
@cs.uregina.ca
Abstract.
We propose the
share-condence framework
for knowledge
discovery from databases which addresses the problem of mining item-
sets from market basket data. Our goal is two-fold: (1) to present new
itemset measures which are practical and useful alternatives to the com-
monly used support measure; (2) to not only discover the buying patterns
of customers, but also to discover customer proles by partitioning cus-
tomers into distinct classes. We present a new algorithm for classifying
itemsets based upon characteristic attributes extracted from census or
lifestyle data. Our algorithm combines the
Apriori
algorithm for discov-
ering association rules between items in large databases, and the
AOG
algorithm for attribute-oriented generalization in large databases. We
suggest how characterized itemsets can be generalized according to con-
cept hierarchies associated with the characteristic attributes. Finally, we
present experimental results that demonstrate the utility of the share-
condence framework.
1 Introduction
Consider a retail sales op eration with a large inventory consisting of many dis-
tinct products. The op eration is situated in a location where the customer base
is socio-economically diverse, with annual household incomes ranging from very
low to very high, and demographically ranging from young families to the el-
derly. The sales manager has used data mining to determine those products
that are typically purchased together and those that are most likely to be pur-
chased given that particular products have already been selected (called
item-
sets
[2, 14]). Analysis of the itemsets has enabled him to strategically arrange
store displays and plan advertising campaigns to increase sales. He now wonders
whether there are any more subtle socio-economic buying patterns that could
be helpful in guiding the distribution of yers during the next advertising cam-
paign. For example, he would like to know which itemsets are more likely to be
purchased by those with specic incomes or by those with children. He would
also like to know which itemsets are more likely to be purchased by those liv-
ing in particular neighborhoods. He b elieves that characterizing itemsets with
classicatory information available from credit card or cheque transactions will
allow him to answer queries of this kind.
In this paper, we prop ose the
share-condence framework
that looks beyond
the simple frequency with which two or more items are bought together. We
introduce a new algorithm, called
CI
, which integrates the
Apriori
algorithm
for discovering association rules b etween items in large databases [2, 1], and
the
AOG
algorithm for attribute-oriented generalization in large databases [9,
11]. We also show how market basket data can be mined using share measures
and characterized itemsets which have been generalized according to concept
hierarchies associated with characteristic attributes. However, it should be noted
that our methods are not limited to the discovery of customer proles based upon
market basket data, the method is more widely applicable to any problem where
taxonomic hierarchies can be associated with characterized data.
The remainder of this paper is organized as follows. In Section 2, we present
a formal description of the market basket analysis problem and introduce the
share-condence framework. In Section 3, we describe characterized itemsets and
an algorithm for generating characterized itemsets from market basket data. In
Section 4, we present exp erimental results obtained using the share-condence
framework on a database supplied by a commercial partner. We conclude in
Section 5 with a summary of our work.
2 The Share-Condence Framework
The problem of discovering association rules form market basket data has b een
formally dened as follows [2]. Let
I
=
f
i
1
; i
2
;:::;i
m
g
be a set of literals, called
items
. Let
D
be a set of
transactions
, where each transaction
T
is an
itemset
such that
T
I
. Transaction
T
contains
X
, a set of some items in
I
, if
X
T
.
An
association rule
is an implication of the form
X
)
Y
, where
X
I
,
Y
I
,
and
X
\
Y
=
;
. The association rule
X
)
Y
holds in transaction set
D
with
condence
c
, if
c
% of transactions in
D
that contain
X
, also contain
Y
. The
association rule
X
)
Y
has
support
s
in transaction set
D
, if
s
% of transactions
in
D
contain
X
[
Y
. This formalism is the
support-condence framework
[4].
The most studied and analyzed algorithm for generating itemsets in the
support-condence framework is
Apriori
, described in detail in [1, 2, 3]. This
algorithm extracts the set of frequent itemsets from the set of candidate item-
sets generated. A
frequent itemset
is an itemset whose support is greater than
some user-specied minimum and a
candidate itemset
is an itemset whose sup-
port has yet to be determined.
Apriori
combines the frequent itemsets from pass
k
?
1 to create the candidate itemsets in pass
k
. It has the important property
that if any subset of a candidate itemset is not a frequent itemset, then the
candidate itemset is also not a frequent itemset.
In the support-condence framework, the purchase of an item is indicated
by a binary ag (i.e., the item is either purchased or not purchased). From this
binary ag, we can determine the number of transactions containing an itemset,
but not the number of items in the itemset. If we know the number of items,
we may nd that an itemset is actually more frequent than support indicates,
allowing for more accurate nancial analysis, comparisons, and projections. Since
support does not consider quantity and value, its use is limited as a practical
indicator for determining the nancial implications of an itemset.
We will now extend the formalization of the market basket problem. The
problem denition is identical to that for the support-condence framework,
except that we introduce the notion of share for itemsets, and redene the notions
of frequent itemsets and condence. We refer to this extended formalism as the
share-condence framework
, introduced in [8] as
share measures
.
In the sections that follow, we dene the functions upon which the share-
condence framework is based. For the examples, refer to the transaction database
shown in Table 1 and the item database shown in Table 2. In Table 1, the
TID
column describ es the transaction identier and columns
A
to
F
describe the
items (products) being sold. Note that binary values are not used to indicate
the purchase of an item, instead the actual number of items purchased in the
corresponding transaction (i.e., the counts) is used. In Table 2, the
Item
column
describes the valid items and the
Retail Price
column describ es the retailer's
selling price to the customer.
Table 1.
An example transaction database with counts
TID A B C D E F
T
1
1 0 2 2 0 0
T
2
0 3 0 0 1 0
T
3
4 1 2 0 0 0
T
4
0 0 3 0 1 2
T
5
0 0 0 4 0 1
T
6
0 3 2 0 1 0
T
7
3 0 0 1 2 4
T
8
2 0 0 4 0 2
T
9
0 1 0 0 2 1
T
10
0 4 1 0 1 0
T
11
0 0 3 0 0 2
10 12 13 11 8 12
Table 2.
The item database
Retail
Item Price
A
1.50
B
2.25
C
5.00
D
4.75
E
10.00
F
7.50
2.1 Preliminary Denitions
The denitions in this section were implemented in a data mining system for
analyzing market basket data. This system is an extension of
DB-Discover
, a
software tool for knowledge discovery from databases [7, 6]. Denitions 1 to 6
are used to query summary views containing discovered frequent itemsets.
Denition 1.
The
local itemset count
is the sum of the local item counts (i.e.,
the quantity of a particular item purchased in a particular transaction) for all
transactions which contain a particular item in a particular itemset, denoted as
lisc
(
i; x
), where
lisc
(
i; x
) =
P
lic
(
i; t
k
),
lic
(
i; t
) is the value at the intersection
of row
t
and column
i
,
i
2
I
,
x
I
,
x
2
t
k
, and
t
k
2
D
.
Query.
\Give the quantity of item
C
in itemset
f
B; C
g
."
Result.
The local itemset count for item
C
in itemset
f
B; C
g
is
lisc
(
C;
f
B; C
g
) =
lic
(
C; T
3
) +
lic
(
C; T
6
) +
lic
(
C; T
10
) = 5.
Denition 2.
The
local itemset amount
is the sum of the local item amounts
(i.e., the pro duct of the lo cal item count for a particular item purchased in
a particular transaction and the item retail price) for all transactions which
contain a particular item in a particular itemset, denoted as
lisa
1
(
i; x
), where
lisa
1
(
i; x
) =
P
lia
(
i; t
k
),
lia
(
i; t
) is the value at the intersection of row
t
and
column
i
multiplied by the item retail price of item
i
,
i
2
I
,
x
I
,
x
2
t
k
,
and
t
k
2
D
. Alternatively, the lo cal itemset amount is the product of the lo cal
itemset count for a particular item in a particular itemset and the item retail
price, denoted as
lisa
2
(
i; x
), where
lisa
2
(
i; x
) =
lisc
(
i; x
)
irp
(
i
),
irp
(
i
) is the
item retail price,
i
2
I
, and
x
I
.
Query.
\Give the value of item
C
in itemset
f
B; C
g
."
Result.
The local itemset amount for item
C
in itemset
f
B; C
g
is
lisa
1
(
C;
f
B;
C
g
) =
lia
(
C; T
3
) +
lia
(
C; T
6
) +
lia
(
C; T
10
) = 25
:
00.
Denition 3.
The
global itemset count
is the sum of the local itemset counts for
all items in a particular itemset, denoted as
gisc
(
x
), where
gisc
(
x
) =
P
lisc
(
i
k
; x
),
x
I
, and
i
k
2
x
, for all
k
.
Query.
\Give the quantity of all items in itemset
f
B; C
g
."
Result.
The global itemset count for itemset
f
B; C
g
is
gisc
(
f
B; C
g
) =
lisc
(
B;
f
B; C
g
) +
lisc
(
C;
f
B; C
g
) = 13.
Denition 4.
The
global itemset amount
is the sum of the local itemset amounts
for all items in a particular itemset, denoted as
gisa
(
x
), where
gisa
(
x
) =
P
lisa
1
(
i
k
; x
),
x
I
, and
i
k
2
x
, for all
k
, or alternatively,
gisa
(
x
) =
P
lisa
2
(
i
k
; x
),
x
I
, and
i
k
2
x
, for all
k
.
Query.
\Give the value of all items in itemset
f
B; C
g
."
Result.
The global itemset amount for itemset
f
B; C
g
is
gisa
(
f
B; C
g
) =
lisa
2
(
B;
f
B; C
g
) +
lisa
2
(
B;
f
B; C
g
) = 43
:
00.
Denition 5.
The
total itemset count
is the sum of the global item counts
(i.e., the sum of the local item counts for a particular item purchased in all
transactions) for all items in a particular itemset, denoted as
tisc
(
x
), where
tisc
(
x
) =
P
gic
(
i
k
),
gic
(
i
) is the sum of all values in column
i
,
x
I
, and
i
k
2
x
.
Query.
\Give the quantity of all items in the transaction database that are in
itemset
f
B; C
g
."
Result.
The total itemset count for itemset
f
B; C
g
is
tisc
(
f
B; C
g
) =
gic
(
B
) +
gic
(
C
) = 25.
Denition 6.
The
total itemset amount
is the sum of the global item amounts
(i.e., the sum of the local item amounts for a particular item purchased in all
transactions) for all items in a particular itemset, denoted as
tisa
(
x
), where
tisa
(
x
) =
P
gia
(
i
k
),
gia
(
i
) is the value of item
i
in all transactions,
x
I
, and
i
k
2
x
.
Query.
\Give the value of all items in the transaction database that are in
itemset
f
B; C
g
."
Result.
The total itemset amount for itemset
f
B; C
g
is
tisa
(
f
B; C
g
=
gia
(
B
) +
gia
(
C
) = 92
:
00.
2.2 Share
We now introduce and dene the notion of share in terms of the denitions from
the previous section.
Denition 7.
The
total item count local share
for a particular item in a par-
ticular itemset is the ratio of the local itemset count to the total item count
(i.e., the sum of the global item counts for all items purchased in all trans-
actions), expressed as a percentage, denoted as
ticls
(
i; x
), where
ticls
(
i; x
) =
(
lisc
(
i; x
)
=tic
)
100,
tic
is the quantity of all items in the transaction database,
i
2
I
, and
x
I
.
Query.
\Give the share of the quantity of item
F
in itemset
f
D; F
g
in relation
to the quantity of all items in the transaction database."
Result.
The total item count local share for item
F
in itemset
f
D; F
g
is
ticls
(
F;
f
D; F
g
) = (
lisc
(
F;
f
D; F
g
)
=tic
)
100 = 10
:
6%.
Denition 8.
The
total item amount local share
for a particular item in a par-
ticular itemset is the ratio of the local itemset amount to the total item amount
(i.e., the sum of the global item amounts for all items purchased in all trans-
actions), expressed as a percentage, denoted as
tials
(
i; x
), where
tials
(
i; x
) =
(
lisa
v
(
i; x
)
=tia
)
100,
tia
is the total value of all items in the transaction database,
i
2
I
,
x
I
, and
v
2 f
1
;
2
g
.
Query.
\Give the share of the value of item
F
in itemset
f
D; F
g
in relation to
the value of al l items in the transaction database."
Result.
The total item amount local share for item
F
in itemset
f
D; F
g
is
tials
(
F;
f
D; F
g
) = (
lisa
1
(
F;
f
D; F
g
)
=tia
)
100 = 15
:
9%.
Denition 9.
The
total item count global share
for a particular itemset is the ra-
tio of the global itemset count to the total item count, expressed as a percentage,
denoted as
ticgs
(
x
), where
ticgs
(
x
) = (
gisc
(
x
)
=tic
)
100,
x
I
.
Query.
\Give the share of the quantity of al l items in itemset
f
D; F
g
in relation
to the quantity of all items in the transaction database."
Result.
The total item count global share for itemset
f
D; F
g
is
tigcs
(
f
D; F
g
) =
(
gisc
(
f
D; F
g
)
=tic
)
100 = 24
:
2%.
Denition 10.
The
total item amount global share
for a particular itemset is
the ratio of the global itemset amount to the total item amount, expressed as a
percentage, denoted as
tiags
(
x
), where
tiags
(
x
) = (
gisa
(
x
)
=tia
)
100,
x
I
.
Query.
\Give the share of the value of all items in itemset
f
D; F
g
in relation
to the value of all items in the transaction transaction database."
Result.
The total item amount global share for itemset
f
D,F
g
is
tiags
(
f
D; F
g
) =
(
gisa
(
f
D; F
g
)
=tia
)
100 = 28
:
9%.
Denition 11.
The
global itemset count local share
for a particular item in a
particular itemset is the ratio of the local itemset count to the global itemset
count, expressed as a percentage, denoted as
giscls
(
i; x
), where
giscls
(
i; x
) =
(
lisc
(
i; x
)
=gisc
(
x
))
100,
i
2
I
, and
x
I
.
Query.
\Give the share of the quantity of item
A
in itemset
f
A; D
g
in relation
to the quantity of all items in the itemset."
Result.
The global itemset count lo cal share for item
A
in itemset
f
A; D
g
is
giscls
(
A;
f
A; D
g
) = (
lisc
(
A;
f
A; D
g
)
=gisc
(
f
A; D
g
))
100 = 46
:
2%.
Denition 12.
The
global itemset amount local share
for a particular item in a
particular itemset is the ratio of the local itemset amount to the global itemset
amount, expressed as a percentage, denoted as
gisals
(
i; x
), where
gisals
(
i; x
) =
(
lisa
v
(
i; x
)
=gisa
(
x
))
100,
i
2
I
,
x
I
, and
v
2 f
1
;
2
g
.
Query.
\Give the share of the value of item
A
in itemset
f
A; D
g
in relation to
the value of al l items in the itemset."
Result.
The global itemset amount local share for item
A
in itemset
f
A; D
g
is
gisals
(
A;
f
A; D
g
) = (
lisa
1
(
A;
f
A; D
g
)
=gisa
(
f
A; D
g
))
100 = 21
:
3%.
2.3 Frequent Itemsets
A frequent itemset was previously dened as an itemset whose support is greater
than some user-specied minimum [2]. We now dene frequent itemsets as used
in the share-condence framework.
Denition 13.
An itemset is
locally frequent
if there is an item in the itemset
such that at least one of the following conditions holds:
1. The total item count lo cal share is greater than some user-sp ecied mini-
mum. That is,
ticls
(
i
k
; x
)
minshare
1
, where
x
I
,
i
k
2
x
, for some
k
, and
minshare
1
is the user-specied minimum share.
2. The total item amount local share is greater than some user-specied min-
imum. That is,
tials
(
i
k
; x
)
minshare
2
, where
x
I
,
i
k
2
x
, for some
k
,
and
minshare
2
is the user-specied minimum share.
Query.
\Give the frequent 2-itemsets whose local share for at least one item is
at least 8%."
Result.
The locally frequent 2-itemsets are shown in Table 3. In Table 3, the
Itemset
column describes the items in the itemset, the
TIDs
column describes
the transaction identiers that contain the corresponding itemset, the
ticls
(
i
1
; x
)
and
ticls
(
i
2
; x
) columns describe the total item count local share for items one
and two, respectively, and the
tials
(
i
1
; x
) and
tials
(
i
2
; x
) columns describe the
total item amount local share for items one and two, respectively.
Table 3.
Locally frequent 2-itemsets
ticls
(
i
1
; x
)
ticls
(
i
2
; x
)
tials
(
i
1
; x
)
tials
(
i
2
; x
)
Itemset TIDs
(%) (%) (%) (%)
f
A; D
g
T
1
,
T
7
,
T
8
9.09 10.6
2.73
10.1
f
B; E
g
T
2
,
T
6
,
T
9
,
T
10
16.67
7.58 7.52
15.19
f
B; C
g
T
3
,
T
6
,
T
10
12.12
7.58 5.47 7.59
f
C; E
g
T
4
,
T
6
,
T
10
9.09
4.55
9.11 9.11
f
C; F
g
T
4
,
T
11
9.09
6.06
9.11 9.11
f
E; F
g
T
4
,
T
7
,
T
9
7.58
10.6 15.19 15.95
f
D; F
g
T
5
,
T
7
,
T
8
13.64 10.6 12.98 15.95
f
A; F
g
T
7
,
T
8
7.58
9.09
2.27
13.67
f
D; E
g
T
7
1.52 6.06 1.44
12.15
Denition 14.
An itemset is
globally frequent
if every item in the itemset is
locally frequent.
Query.
\Give the frequent 2-itemsets whose local share for all items is at least
8%."
Result.
The globally frequent 2-itemsets are shown in Table 4. The columns in
Table 4 have the same meaning as in Table 3.
Table 4.
Globally frequent 2-itemsets
ticls
(
i
1
; x
)
ticls
(
i
2
; x
)
tials
(
i
1
; x
)
tials
(
i
2
; x
)
Itemset TIDs
(%) (%) (%) (%)
f
A; D
g
T
1
,
T
7
,
T
8
9.09 10.6
2.73 10.1
f
C; E
g
T
4
,
T
6
,
T
10
9.09 4.55
9.11 9.11
f
C; F
g
T
4
,
T
11
9.09 6.06
9.11 9.11
f
E; F
g
T
4
,
T
7
,
T
9
7.58 10.6
15.19 15.95
f
D; F
g
T
5
,
T
7
,
T
8
13.64 10.6 12.98 15.95
2.4 Condence
Condence in an association rule
X
)
Y
was previously dened as the ra-
tio of the number of transactions containing itemset
X
[
Y
to the number of
transactions containing itemset
X
[2]. We now dene condence as used in the
share-condence framework.
Denition 15.
The
count condence
in an association rule
X
)
Y
is the ratio
of the sum of the local itemset counts for all items in itemset
X
contained in
X
[
Y
to the global itemset count for itemset
X
, expressed as a percentage,
denoted as
cc
(
x; x
[
y
), where
cc
(
x; x
[
y
) = (
P
lisc
(
i
k
; x
[
y
)
=gisc
(
x
))
100,
x
I
,
x
[
y
I
, and
i
k
2
x
, for all
k
.
Query.
\Give the count condence for the association rule
f
B; C
g ) f
E
g
."
Result.
The count condence for the association rule
f
B; C
g ) f
E
g
is
cc
(
f
B;
C
g
;
f
B; C; E
g
) = ((
lisc
(
B;
f
B; C; E
g
)+
lisc
(
C;
f
B; C; E
g
))
=gisc
(
f
B; C
g
))
100 =
76
:
9%.
Denition 16.
The
amount condence
in an association rule
X
)
Y
is the ratio
of the sum of the lo cal itemset amounts for all items in itemset
X
contained in
X
[
Y
to the global itemset amount for itemset
X
, expressed as a percentage,
denoted as
ac
(
x; x
[
y
), where
ac
(
x; x
[
y
) = (
P
lisa
v
(
i
k
; x
[
y
)
=gisa
(
x
))
100,
x
I
,
x
[
y
I
,
i
k
2
x
, for all
k
, and
v
2 f
1
;
2
g
.
Query.
\Give the amount condence for the association rule
f
B; C
g ) f
E
g
."
Result.
The amount condence for the association rule
f
B; C
g ) f
E
g
is
ac
(
f
B; C
g
;
f
B; C; E
g
) = ((
lisa
2
(
B;
f
B; C; E
g
) +
lisa
2
(
C;
f
B; C; E
g
))
=gisa
(
f
B;
C
g
))
100 = 59
:
9%.
3 Characterized Itemsets
3.1 Example
We now present an example to demonstrate the
CI
algorithm and describ e the
primary data structures. In this example, let
L
k
and
C
k
denote the set of frequent
itemsets from pass
k
and the set of candidate itemsets from pass
k
, resp ectively,
and let
R
denote the relation containing the characterized itemsets. Each ele-
ment of
L
k
and
C
k
contains three attributes: the itemset, the total item count
local share, and the total item amount local share. Each element of
R
contains
one attribute for each characteristic of interest and an attribute containing a
list of all frequent itemsets sharing the corresponding characteristic attributes.
Assume we are given the transaction database shown in Table 5. Also assume
the user-sp ecied minimum share is 15%. In Table 5, the column descriptions
have the same meaning as the like-named columns in Table 1. Our task is to
trace through the rst three passes of
CI
to generate and store the characterized
itemsets in
R
. For this example, we consider only the total item count local
share to determine whether an itemset is frequent.
Table 5.
A smaller example transaction database with counts
TID A B C D E
T
1
1 2 5 0 0
T
2
4 1 1 3 2
T
3
3 0 2 1 0
T
4
5 0 4 2 1
T
5
2 3 3 4 0
15 6 15 10 3
After the rst pass,
CI
generates
L
1
and
R
as shown in Tables 6 and 7,
respectively. In Table 6, the
Itemset
column describ es the items in each itemset
and the
Share
column describes the total item count local share. In Table 7,
the
Char. 1
and
Char. 2
columns describ e the characteristics retrieved from the
external database(s), and the
TIDs
column describes the transactions that share
the corresponding characteristics (the TIDs are not actually stored in
R
and are
merely shown here for reader convenience). The domain of the rst and second
characteristic is
f
R; S
g
and
f
X; Y; Z
g
, respectively.
After the second pass,
CI
generates
L
2
and up dates
R
as shown in Tables 8
and 9, resp ectively. In Tables 8 and 9, the column descriptions have the same
meaning as the like-named columns in Table 6 and Table 7, respectively. Also in
Table 6.
Frequent itemsets contained in
L
1
Share
Itemset
(%)
f
A
g
30.6
f
C
g
30.6
f
D
g
20.4
Table 7.
R
after the rst pass
Char. 1 Char. 2 TIDs
R X
T
1
,
T
4
S Y
T
2
,
T
5
S Z
T
3
Table 9, the
Itemsets
column describes the frequent itemsets from the previous
pass that share the identied characteristics.
Table 8.
Frequent itemsets contained in
L
2
Share
Itemset
(%)
f
A; C
g
61.2
f
A; D
g
49.0
f
A; E
g
24.5
After the third pass,
CI
generates
L
3
and updates
R
as shown in Tables 10
and 11, respectively. In Tables 10 and 11, the column descriptions have the same
meaning as the like-named columns in Table 8 and Table 9, respectively.
The characterized itemsets in
R
, generated by
CI
, form a relation. In a re-
lation, transforming a specic data description into a more general one is called
generalization. Several algorithms have been prop osed for nding generalized
itemsets where concept hierarchies are used to classify items [10, 1]. Our ap-
proach diers from these in that we use concept hierarchies to classify the char-
acteristic attributes. Fast and ecient implementations of
AOG
[7, 6, 13] are
used to generate summaries where the characteristic attributes are generalized
according to the concept hierarchies. If the concept hierarchies have relatively
few levels (i.e., fewer than 10), and if multiple hierarchies are available for some
attributes, the
AllGen
algorithm [12] is used to generate all possible summaries.
3.2 The
CI
Algorithm
In the description of
CI
that follows,
L
k
,
C
k
, and
R
have the same meaning as
in the example of the previous section. The
k
-th pass of the algorithm works as
follows:
1. Repeat steps
2
to
5
until no new candidate itemsets are generated in pass
(
k
?
1).
Table 9.
R
after the second pass
Char. 1 Char. 2 TIDs
Itemsets
R X
T
1
,
T
4
hf
A
g
;
6
i
,
hf
C
g
;
9
i
,
hf
D
g
;
2
i
S Y
T
2
,
T
5
hf
A
g
;
6
i
,
hf
C
g
;
4
i
,
hf
D
g
;
7
i
S Z
T
3
hf
A
g
;
3
i
,
hf
C
g
;
2
i
,
hf
D
g
;
1
i
Table 10.
Frequent itemsets contained in
L
3
Share
Itemset
(%)
f
A; C; D
g
69.4
f
A; C; E
g
34.7
2. Generate the candidate
k
-itemsets in
C
k
from the frequent (
k
?
1)-itemsets
in
L
k
?
1
using the
Apriori
method described in [2, 5].
3. Partition the frequent (
k
?
1)-itemsets in
L
k
?
1
and up date the candidate
itemsets in
C
k
.
a.
Repeat steps
3-b
to
3-f
until there are no more transactions to be retrieved
from the database.
b.
Retrieve the next transaction from the database.
c.
Retrieve the corresponding characteristic tuple from
R
.
d.
For each (
k
?
1)-itemset in the transaction, if it is contained in
L
k
?
1
,
update the characteristic tuple.
(
i
)
If itemset summary attributes already exist for this (
k
?
1)-itemset
in the characteristic tuple, go to step
3-d-ii
. step. Otherwise, create
new itemset summary attributes in the characteristic tuple.
(
ii
)
Increment the total quantity and total value attributes for this (
k
?
1)-itemset in the characteristic tuple.
e.
If the characteristic tuple has b een updated, save it in
R
.
f.
For each
k
-itemset in the transaction, if it is contained in
C
k
, increment
the associated total quantity and total value attributes.
4. Save the frequent
k
-itemsets in
L
k
.
a.
Repeat steps
4-b
and
4-c
until there are no more itemset tuples in
C
k
.
b.
Retrieve the next itemset tuple from
C
k
.
c.
If the share of this itemset tuple is greater than the minimum specied,
copy the itemset tuple to
L
k
.
5. Delete
C
k
.
6. Save
R
.
The rst pass of the algorithm is a special pass which generates the frequent
1-itemsets and the characteristic relation, as follows:
1. Generate the candidate 1-itemsets in
C
1
and the characteristic relation
R
.
a.
Repeat steps
1-b
to
1-f
until there are no more transactions to be retrieved
from the database.
b.
Retrieve the next transaction from the database.
Table 11.
R
after the third pass
Char. 1 Char. 2 TIDs
Itemsets
R X
T
1
,
T
4
hf
A
g
;
6
i
,
hf
C
g
;
9
i
,
hf
D
g
;
2
i
,
hf
A; C
g
;
15
i
,
hf
A; D
g
;
7
i
,
hf
A; E
g
;
6
i
S Y
T
2
,
T
5
hf
A
g
;
6
i
,
hf
C
g
;
4
i
,
hf
D
g
;
7
i
,
hf
A; C
g
;
10
i
,
hf
A; D
g
;
13
i
,
hf
A; E
g
;
6
i
S Z
T
3
hf
A
g
;
3
i
,
hf
C
g
;
2
i
,
hf
D
g
;
1
i
,
hf
A; C
g
;
5
i
,
hf
A; D
g
;
4
i
c.
For each 1-itemset in the transaction, if an itemset tuple already exists
in
C
1
, go step
1-d
. Otherwise, create a new itemset tuple in
C
1
.
d.
For each 1-itemset in the transaction, increment the total quantity and
total value attributes of the associated itemset tuple in
C
1
.
e.
Using the appropriate key(s), retrieve the characterizing attributes for
this transaction from the external database(s).
f.
If a characteristic tuple containing these characteristics already exists in
R
, go step
1-b
. Otherwise, create a new characteristic tuple in
R
.
2. Save the frequent 1-itemsets in
L
1
.
a.
Repeat steps
2-b
and
2-c
until there are no more itemset tuples in
C
1
.
b.
Retrieve the next itemset tuple from
C
1
.
c.
If the share of this itemset tuple is greater than the minimum specied,
copy the itemset tuple to
L
1
.
3. Delete
C
1
.
4. Save
R
.
The running time and space requirements of
CI
are
O
(
j
c
j j
t
j
) and
O
(
j
s
j
),
respectively, where
j
c
j
is the number of candidate itemsets in all iterations of
the algorithm,
j
t
j
is the number of transactions, and
j
s
j
is the size of the largest
candidate itemset in any pass.
4 Experimental Results
We ran all of our experiments on an IBM AT-compatible personal computer,
consisting of a Pentium P166 processor with 64 MB of memory running Win-
dows NT Workstation version 4.0. Input data was from a large database supplied
by a commercial partner in the telecommunications industry. The database con-
tained approximately 3.3 million tuples representing account activity for over
500 thousand customer accounts and 2200 unique items (identied by integers
in the range [1
:::
2200]). Each tuple is either an equipment rental or service
transaction containing the number of items and the cost of each item. An item-
set was considered to be frequent if at least one of the following three conditions
held: the minimum supp ort was greater then 0.25%, the total item count global
share was greater than 0.25%, or the total item amount global share was greater
than 0.25%.
The 20 most frequent 1-itemsets ranked by support, total item count global
share, and total item amount global share are shown in Figures 1, 2, and 3,
respectively. In Figures 1 to 3, the rst row of bars (i.e., those at the front of
the graph) corresponds to the total item amount global share (i.e., value), the
second row corresponds to the total item count global share (i.e., quantity), and
the third row corresponds to the supp ort. The height of each bar corresponds to
the percentage of share or support for the associated 1-itemset. There were 109
frequent 1-itemsets discovered.
Figure 1 shows that support over-represents the actual frequency with which
a 1-itemset is purchased, in terms of both the quantity and value of the pur-
chases. The support for the most frequent 1-itemset is approximately 25%, yet
this itemset represents only approximately 5% of the total quantity of items
purchased and only approximately 2% of the total value of items purchased.
The ranking of these same 1-itemsets by total item count global share is similar
to that of support, but the ranking by total item amount global share shows
signicant variation from both support and total item count global share.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Percentage of Total
Assigned Itemset IDs
Global Share (Value)
Global Share (Quantity)
Support
Fig. 1.
20 most frequent 1-itemsets ranked by support
Figure 2 shows that 14 of the frequent 1-itemsets that were ranked highest
by support (i.e., those identied by integers less than or equal to 20), also appear
in the 20 most frequent 1-itemsets ranked by total item count global share. The
remaining six 1-itemsets (i.e., 101, 81, 25, 107, 100, 34) are shown to have a
higher ranking when ranked by total item count global share. The 1-itemsets
that include items 100, 101, and 107 are esp ecially noteworthy since there were
only 109 frequent 1-itemsets ranked. The support measure considers these items
to be among the least imp ortant, yet when ranked by total item count global
share, they are ranked eleventh, rst, and eighth, respectively.
Figure 3 shows that nine of the frequent 1-itemsets that were ranked highest
by support, also appear in the 20 most frequent 1-itemsets ranked by total item
amount global share. It also shows that nine of the most frequent 1-itemsets
which were ranked in the bottom 50% by support, are shown to be among the
20 most frequent when ranked by total item amount global share.
Similar results to those shown in Figures 1 to 3 were obtained when rank-
ing
k
-itemsets. We present the results for 2-itemsets, shown in Table 12. Ta-
ble 12 shows three sets of rankings for 2-itemsets, where each set contains three
columns. In Table 12, the
Support
,
Share (Quantity)
, and
Share (Value)
columns
101
1
2
81
4
25
3
107
9
7
100
5
6
8
10
34
17
12
13
11
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Percentage of Total
Assigned Itemset IDs
Global Share (Value)
Global Share (Quantity)
Support
Fig. 2.
20 most frequent 1-itemsets ranked by total item count global share
9
3
30
99
58
34
6
101
1
66
2
89
11
8
100
106
5
83
81
4
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Percentage of Total
Assigned Itemset IDs
Global Share (Value)
Global Share (Quantity)
Support
Fig. 3.
20 most frequent 1-itemsets ranked by total item amount global share
describe 10 itemsets ranked by support, total item count global share, and total
item amount global share, respectively. In the rst set, the rst column shows the
10 most frequent 2-itemsets ranked by support. The second and third columns
show the corresponding rank for these itemsets ranked by total item count and
total item amount global share, resp ectively. In the second set, the second col-
umn shows the 10 most frequent 2-itemsets ranked by total item count global
share. The rst and third columns show the corresponding rank for these item-
sets ranked by support and total item amount global share, respectively. In the
third set, the third column shows the 10 most frequent 2-itemsets ranked by
total item amount global share. The rst and second columns show the corre-
sponding rank for these itemsets ranked by support and total item count global
share. There were 351 frequent 2-itemsets.
The 2-itemset ranked as most frequent by support (refer to the rst set) and
total item amount global share was ranked fourth by total item count global
share. While this itemset does not represent the most frequent itemset sold
in terms of the quantity of items, it was purchased in the greatest number of
transactions and had the highest gross income of all 2-itemsets. In contrast, the
Table 12.
2-itemsets ranked by support and share
Set 1 Rankings Set 2 Rankings Set 3 Rankings
Share Share Share Share Share Share
Support (Quantity) (Value) Support (Quantity) (Value) Support (Quantity) (Value)
1 4 1 306 1 18
1 4 1
2 13 3 341 2 38 293 8 2
3 17 9 324 3 27 2 13 3
4 19 12 1 4 1 305 45 4
5 20 5 294 5 23 5 20 5
6 22 11 316 6 32 288 121 6
7 27 28 291 7 24 75 80 7
8 35 33 293 8 2 287 206 8
9 47 59 307 9 29 3 17 9
10 41 109
301 10 31
336 350 10
2-itemset ranked tenth by support, for instance, was ranked 41-st by total item
count global share and 109-th by total item amount global share. This itemset is
ranked highly by support, yet its contribution to gross income is comparatively
low.
The 2-itemset ranked as most frequent by total item count global share (refer
to the second set) was ranked 306-th by support. This is an itemset where the
items are typically purchased in multiples. Consequently, it is purchased more
frequently than support seems to indicate. Similarly, 13 of the 15 most frequent
2-itemsets ranked highly by total item count global share are ranked b elow 291
by supp ort.
The 2-itemset ranked tenth by total item amount global share (refer to the
third set) was ranked 336-th by support and 350-th by total item count global
share. The items in this itemset are relatively expensive items. Consequently,
although not purchased as frequently as many other items, its contribution to
gross income is comparatively high.
5 Conclusion
We have introduced the share-condence framework for knowledge discovery
from databases which classies itemsets based upon characteristic attributes ex-
tracted from external databases. We suggested how characterized itemsets can
be generalized according to concept hierarchies associated with the characteristic
attributes. Experimental results demonstrated that the share-condence frame-
work can give more informative feedback than the support-condence framework.
References
1. R. Agrawal, K. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the
presence of noise, scaling, and translation in time-series databases. In
Proceedings
of the 21th International Conference on Very Large Databases (VLDB'95)
, Zurich,
Switzerland, September 1995.
2. R. Agrawal, H. Mannila, R.Srikant, H.Toivonen, and A.I. Verkamo. Fast discov-
ery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, editors,
Advances in Knowledge Discovery and Data Mining
, pages
307{328, Menlo Park, CA, 1996. AAAI Press/MIT Press.
3. R. Agrawal and J.C. Schafer. Parallel mining of association rules.
IEEE Transac-
tions on Knowledge and Data Engineering
, 8(6):962{969, December 1996.
4. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing as-
sociation rules to correlations. In
Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD'97)
, pages 265{276, May 1997.
5. S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket data. In
Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD'97)
, pages 255{264,
May 1997.
6. C.L. Carter and H.J. Hamilton. Ecient attribute-oriented algorithms for knowl-
edge discovery from large databases. IEEE Transactions on Knowledge and Data
Engineering. To appear.
7. C.L. Carter and H.J. Hamilton. Performance evaluation of attribute-oriented al-
gorithms for knowledge discovery from databases. In
Proceedings of the Seventh
IEEE International Conference on Tools with Articial Intel ligence (ICTAI'95)
,
pages 486{489, Washington, D.C., November 1995.
8. C.L. Carter, H.J. Hamilton, and N. Cercone. Share-based measures for itemsets.
In J. Komorowski and J. Zytkow, editors,
Proceedings of the First European Con-
ference on the Principles of Data Mining and Knowledge Discovery (PKDD'97)
,
pages 14{24, Trondheim, Norway, June 1997.
9. D.W. Cheung, A.W. Fu, and J. Han. Knowledge discovery in databases: a rule-
based attribute-oriented approach. In
Lecture Notes in Articial Intel ligence, The
8th International Symposium on Methodologies for Intelligent Systems (ISMIS'94)
,
pages 164{173, Charlotte, North Carolina, 1994.
10. J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. In
Proceedings of the 1995 International Conference on Very Large
Data Bases (VLDB'95)
, pages 420{431, September 1995.
11. J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data
mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
editors,
Adavances in Knowledge Discovery and Data Mining
, pages 399{421.
AAAI/MIT Press, 1996.
12. R.J. Hilderman, H.J. Hamilton, R.J. Kowalchuk, and N. Cercone. Parallel
knowledge discovery using domain generalization graphs. In J. Komorowski and
J. Zytkow, editors,
Proceedings of the First European Conference on the Principles
of Data Mining and Knowledge Discovery (PKDD'97)
, pages 25{35, Trondheim,
Norway, June 1997.
13. H.-Y. Hwang and W.-C. Fu. Ecient algorithms for attribute-oriented induction.
In
Proceedings of the First International Conference on Knowledge Discovery and
Data Mining (KDD'95)
, pages 168{173, Montreal, August 1995.
14. J.S. Park, M.-S. Chen, and P.S. Yu. An eective hash-based algorithm for mining
association rules.
Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD'95)
, pages 175{186, May 1995.