ArticlePDF Available

Mining Market Basket Data Using Share Measures and Characterized Itemsets

Authors:

Abstract and Figures

We propose the share-confidence framework for knowledge discovery from databases which addresses the problem of mining itemsets from market basket data. Our goal is two-fold: (1) to present new itemset measures which are practical and useful alternatives to the commonly used support measure; (2) to not only discover the buying patterns of customers, but also to discover customer profiles by partitioning customers into distinct classes. We present a new algorithm for classifying itemsets based upon characteristic attributes extracted from census or lifestyle data. Our algorithm combines the Apriori algorithm for discovering association rules between items in large databases, and the AOG algorithm for attribute-oriented generalization in large databases. We suggest how characterized itemsets can be generalized according to concept hierarchies associated with the characteristic attributes. Finally, we present experimental results that demonstrate the utility of the shareconfidence framework. 1
Content may be subject to copyright.
Mining Market Basket Data Using
Share Measures and Characterized Itemsets
Robert J. Hilderman, Colin L. Carter, Howard J. Hamilton, and Nick Cercone
Department of Computer Science
University of Regina
Regina, Saskatchewan, Canada, S4S 0A2
f
hilder,carter,hamilton,nick
g
@cs.uregina.ca
Abstract.
We propose the
share-condence framework
for knowledge
discovery from databases which addresses the problem of mining item-
sets from market basket data. Our goal is two-fold: (1) to present new
itemset measures which are practical and useful alternatives to the com-
monly used support measure; (2) to not only discover the buying patterns
of customers, but also to discover customer proles by partitioning cus-
tomers into distinct classes. We present a new algorithm for classifying
itemsets based upon characteristic attributes extracted from census or
lifestyle data. Our algorithm combines the
Apriori
algorithm for discov-
ering association rules between items in large databases, and the
AOG
algorithm for attribute-oriented generalization in large databases. We
suggest how characterized itemsets can be generalized according to con-
cept hierarchies associated with the characteristic attributes. Finally, we
present experimental results that demonstrate the utility of the share-
condence framework.
1 Introduction
Consider a retail sales op eration with a large inventory consisting of many dis-
tinct products. The op eration is situated in a location where the customer base
is socio-economically diverse, with annual household incomes ranging from very
low to very high, and demographically ranging from young families to the el-
derly. The sales manager has used data mining to determine those products
that are typically purchased together and those that are most likely to be pur-
chased given that particular products have already been selected (called
item-
sets
[2, 14]). Analysis of the itemsets has enabled him to strategically arrange
store displays and plan advertising campaigns to increase sales. He now wonders
whether there are any more subtle socio-economic buying patterns that could
be helpful in guiding the distribution of yers during the next advertising cam-
paign. For example, he would like to know which itemsets are more likely to be
purchased by those with specic incomes or by those with children. He would
also like to know which itemsets are more likely to be purchased by those liv-
ing in particular neighborhoods. He b elieves that characterizing itemsets with
classicatory information available from credit card or cheque transactions will
allow him to answer queries of this kind.
In this paper, we prop ose the
share-condence framework
that looks beyond
the simple frequency with which two or more items are bought together. We
introduce a new algorithm, called
CI
, which integrates the
Apriori
algorithm
for discovering association rules b etween items in large databases [2, 1], and
the
AOG
algorithm for attribute-oriented generalization in large databases [9,
11]. We also show how market basket data can be mined using share measures
and characterized itemsets which have been generalized according to concept
hierarchies associated with characteristic attributes. However, it should be noted
that our methods are not limited to the discovery of customer proles based upon
market basket data, the method is more widely applicable to any problem where
taxonomic hierarchies can be associated with characterized data.
The remainder of this paper is organized as follows. In Section 2, we present
a formal description of the market basket analysis problem and introduce the
share-condence framework. In Section 3, we describe characterized itemsets and
an algorithm for generating characterized itemsets from market basket data. In
Section 4, we present exp erimental results obtained using the share-condence
framework on a database supplied by a commercial partner. We conclude in
Section 5 with a summary of our work.
2 The Share-Condence Framework
The problem of discovering association rules form market basket data has b een
formally dened as follows [2]. Let
I
=
f
i
1
; i
2
;:::;i
m
g
be a set of literals, called
items
. Let
D
be a set of
transactions
, where each transaction
T
is an
itemset
such that
T
I
. Transaction
T
contains
X
, a set of some items in
I
, if
X
T
.
An
association rule
is an implication of the form
X
)
Y
, where
X
I
,
Y
I
,
and
X
\
Y
=
;
. The association rule
X
)
Y
holds in transaction set
D
with
condence
c
, if
c
% of transactions in
D
that contain
X
, also contain
Y
. The
association rule
X
)
Y
has
support
s
in transaction set
D
, if
s
% of transactions
in
D
contain
X
[
Y
. This formalism is the
support-condence framework
[4].
The most studied and analyzed algorithm for generating itemsets in the
support-condence framework is
Apriori
, described in detail in [1, 2, 3]. This
algorithm extracts the set of frequent itemsets from the set of candidate item-
sets generated. A
frequent itemset
is an itemset whose support is greater than
some user-specied minimum and a
candidate itemset
is an itemset whose sup-
port has yet to be determined.
Apriori
combines the frequent itemsets from pass
k
?
1 to create the candidate itemsets in pass
k
. It has the important property
that if any subset of a candidate itemset is not a frequent itemset, then the
candidate itemset is also not a frequent itemset.
In the support-condence framework, the purchase of an item is indicated
by a binary ag (i.e., the item is either purchased or not purchased). From this
binary ag, we can determine the number of transactions containing an itemset,
but not the number of items in the itemset. If we know the number of items,
we may nd that an itemset is actually more frequent than support indicates,
allowing for more accurate nancial analysis, comparisons, and projections. Since
support does not consider quantity and value, its use is limited as a practical
indicator for determining the nancial implications of an itemset.
We will now extend the formalization of the market basket problem. The
problem denition is identical to that for the support-condence framework,
except that we introduce the notion of share for itemsets, and redene the notions
of frequent itemsets and condence. We refer to this extended formalism as the
share-condence framework
, introduced in [8] as
share measures
.
In the sections that follow, we dene the functions upon which the share-
condence framework is based. For the examples, refer to the transaction database
shown in Table 1 and the item database shown in Table 2. In Table 1, the
TID
column describ es the transaction identier and columns
A
to
F
describe the
items (products) being sold. Note that binary values are not used to indicate
the purchase of an item, instead the actual number of items purchased in the
corresponding transaction (i.e., the counts) is used. In Table 2, the
Item
column
describes the valid items and the
Retail Price
column describ es the retailer's
selling price to the customer.
Table 1.
An example transaction database with counts
TID A B C D E F
T
1
1 0 2 2 0 0
T
2
0 3 0 0 1 0
T
3
4 1 2 0 0 0
T
4
0 0 3 0 1 2
T
5
0 0 0 4 0 1
T
6
0 3 2 0 1 0
T
7
3 0 0 1 2 4
T
8
2 0 0 4 0 2
T
9
0 1 0 0 2 1
T
10
0 4 1 0 1 0
T
11
0 0 3 0 0 2
10 12 13 11 8 12
Table 2.
The item database
Retail
Item Price
A
1.50
B
2.25
C
5.00
D
4.75
E
10.00
F
7.50
2.1 Preliminary Denitions
The denitions in this section were implemented in a data mining system for
analyzing market basket data. This system is an extension of
DB-Discover
, a
software tool for knowledge discovery from databases [7, 6]. Denitions 1 to 6
are used to query summary views containing discovered frequent itemsets.
Denition 1.
The
local itemset count
is the sum of the local item counts (i.e.,
the quantity of a particular item purchased in a particular transaction) for all
transactions which contain a particular item in a particular itemset, denoted as
lisc
(
i; x
), where
lisc
(
i; x
) =
P
lic
(
i; t
k
),
lic
(
i; t
) is the value at the intersection
of row
t
and column
i
,
i
2
I
,
x
I
,
x
2
t
k
, and
t
k
2
D
.
Query.
\Give the quantity of item
C
in itemset
f
B; C
g
."
Result.
The local itemset count for item
C
in itemset
f
B; C
g
is
lisc
(
C;
f
B; C
g
) =
lic
(
C; T
3
) +
lic
(
C; T
6
) +
lic
(
C; T
10
) = 5.
Denition 2.
The
local itemset amount
is the sum of the local item amounts
(i.e., the pro duct of the lo cal item count for a particular item purchased in
a particular transaction and the item retail price) for all transactions which
contain a particular item in a particular itemset, denoted as
lisa
1
(
i; x
), where
lisa
1
(
i; x
) =
P
lia
(
i; t
k
),
lia
(
i; t
) is the value at the intersection of row
t
and
column
i
multiplied by the item retail price of item
i
,
i
2
I
,
x
I
,
x
2
t
k
,
and
t
k
2
D
. Alternatively, the lo cal itemset amount is the product of the lo cal
itemset count for a particular item in a particular itemset and the item retail
price, denoted as
lisa
2
(
i; x
), where
lisa
2
(
i; x
) =
lisc
(
i; x
)
irp
(
i
),
irp
(
i
) is the
item retail price,
i
2
I
, and
x
I
.
Query.
\Give the value of item
C
in itemset
f
B; C
g
."
Result.
The local itemset amount for item
C
in itemset
f
B; C
g
is
lisa
1
(
C;
f
B;
C
g
) =
lia
(
C; T
3
) +
lia
(
C; T
6
) +
lia
(
C; T
10
) = 25
:
00.
Denition 3.
The
global itemset count
is the sum of the local itemset counts for
all items in a particular itemset, denoted as
gisc
(
x
), where
gisc
(
x
) =
P
lisc
(
i
k
; x
),
x
I
, and
i
k
2
x
, for all
k
.
Query.
\Give the quantity of all items in itemset
f
B; C
g
."
Result.
The global itemset count for itemset
f
B; C
g
is
gisc
(
f
B; C
g
) =
lisc
(
B;
f
B; C
g
) +
lisc
(
C;
f
B; C
g
) = 13.
Denition 4.
The
global itemset amount
is the sum of the local itemset amounts
for all items in a particular itemset, denoted as
gisa
(
x
), where
gisa
(
x
) =
P
lisa
1
(
i
k
; x
),
x
I
, and
i
k
2
x
, for all
k
, or alternatively,
gisa
(
x
) =
P
lisa
2
(
i
k
; x
),
x
I
, and
i
k
2
x
, for all
k
.
Query.
\Give the value of all items in itemset
f
B; C
g
."
Result.
The global itemset amount for itemset
f
B; C
g
is
gisa
(
f
B; C
g
) =
lisa
2
(
B;
f
B; C
g
) +
lisa
2
(
B;
f
B; C
g
) = 43
:
00.
Denition 5.
The
total itemset count
is the sum of the global item counts
(i.e., the sum of the local item counts for a particular item purchased in all
transactions) for all items in a particular itemset, denoted as
tisc
(
x
), where
tisc
(
x
) =
P
gic
(
i
k
),
gic
(
i
) is the sum of all values in column
i
,
x
I
, and
i
k
2
x
.
Query.
\Give the quantity of all items in the transaction database that are in
itemset
f
B; C
g
."
Result.
The total itemset count for itemset
f
B; C
g
is
tisc
(
f
B; C
g
) =
gic
(
B
) +
gic
(
C
) = 25.
Denition 6.
The
total itemset amount
is the sum of the global item amounts
(i.e., the sum of the local item amounts for a particular item purchased in all
transactions) for all items in a particular itemset, denoted as
tisa
(
x
), where
tisa
(
x
) =
P
gia
(
i
k
),
gia
(
i
) is the value of item
i
in all transactions,
x
I
, and
i
k
2
x
.
Query.
\Give the value of all items in the transaction database that are in
itemset
f
B; C
g
."
Result.
The total itemset amount for itemset
f
B; C
g
is
tisa
(
f
B; C
g
=
gia
(
B
) +
gia
(
C
) = 92
:
00.
2.2 Share
We now introduce and dene the notion of share in terms of the denitions from
the previous section.
Denition 7.
The
total item count local share
for a particular item in a par-
ticular itemset is the ratio of the local itemset count to the total item count
(i.e., the sum of the global item counts for all items purchased in all trans-
actions), expressed as a percentage, denoted as
ticls
(
i; x
), where
ticls
(
i; x
) =
(
lisc
(
i; x
)
=tic
)
100,
tic
is the quantity of all items in the transaction database,
i
2
I
, and
x
I
.
Query.
\Give the share of the quantity of item
F
in itemset
f
D; F
g
in relation
to the quantity of all items in the transaction database."
Result.
The total item count local share for item
F
in itemset
f
D; F
g
is
ticls
(
F;
f
D; F
g
) = (
lisc
(
F;
f
D; F
g
)
=tic
)
100 = 10
:
6%.
Denition 8.
The
total item amount local share
for a particular item in a par-
ticular itemset is the ratio of the local itemset amount to the total item amount
(i.e., the sum of the global item amounts for all items purchased in all trans-
actions), expressed as a percentage, denoted as
tials
(
i; x
), where
tials
(
i; x
) =
(
lisa
v
(
i; x
)
=tia
)
100,
tia
is the total value of all items in the transaction database,
i
2
I
,
x
I
, and
v
2 f
1
;
2
g
.
Query.
\Give the share of the value of item
F
in itemset
f
D; F
g
in relation to
the value of al l items in the transaction database."
Result.
The total item amount local share for item
F
in itemset
f
D; F
g
is
tials
(
F;
f
D; F
g
) = (
lisa
1
(
F;
f
D; F
g
)
=tia
)
100 = 15
:
9%.
Denition 9.
The
total item count global share
for a particular itemset is the ra-
tio of the global itemset count to the total item count, expressed as a percentage,
denoted as
ticgs
(
x
), where
ticgs
(
x
) = (
gisc
(
x
)
=tic
)
100,
x
I
.
Query.
\Give the share of the quantity of al l items in itemset
f
D; F
g
in relation
to the quantity of all items in the transaction database."
Result.
The total item count global share for itemset
f
D; F
g
is
tigcs
(
f
D; F
g
) =
(
gisc
(
f
D; F
g
)
=tic
)
100 = 24
:
2%.
Denition 10.
The
total item amount global share
for a particular itemset is
the ratio of the global itemset amount to the total item amount, expressed as a
percentage, denoted as
tiags
(
x
), where
tiags
(
x
) = (
gisa
(
x
)
=tia
)
100,
x
I
.
Query.
\Give the share of the value of all items in itemset
f
D; F
g
in relation
to the value of all items in the transaction transaction database."
Result.
The total item amount global share for itemset
f
D,F
g
is
tiags
(
f
D; F
g
) =
(
gisa
(
f
D; F
g
)
=tia
)
100 = 28
:
9%.
Denition 11.
The
global itemset count local share
for a particular item in a
particular itemset is the ratio of the local itemset count to the global itemset
count, expressed as a percentage, denoted as
giscls
(
i; x
), where
giscls
(
i; x
) =
(
lisc
(
i; x
)
=gisc
(
x
))
100,
i
2
I
, and
x
I
.
Query.
\Give the share of the quantity of item
A
in itemset
f
A; D
g
in relation
to the quantity of all items in the itemset."
Result.
The global itemset count lo cal share for item
A
in itemset
f
A; D
g
is
giscls
(
A;
f
A; D
g
) = (
lisc
(
A;
f
A; D
g
)
=gisc
(
f
A; D
g
))
100 = 46
:
2%.
Denition 12.
The
global itemset amount local share
for a particular item in a
particular itemset is the ratio of the local itemset amount to the global itemset
amount, expressed as a percentage, denoted as
gisals
(
i; x
), where
gisals
(
i; x
) =
(
lisa
v
(
i; x
)
=gisa
(
x
))
100,
i
2
I
,
x
I
, and
v
2 f
1
;
2
g
.
Query.
\Give the share of the value of item
A
in itemset
f
A; D
g
in relation to
the value of al l items in the itemset."
Result.
The global itemset amount local share for item
A
in itemset
f
A; D
g
is
gisals
(
A;
f
A; D
g
) = (
lisa
1
(
A;
f
A; D
g
)
=gisa
(
f
A; D
g
))
100 = 21
:
3%.
2.3 Frequent Itemsets
A frequent itemset was previously dened as an itemset whose support is greater
than some user-specied minimum [2]. We now dene frequent itemsets as used
in the share-condence framework.
Denition 13.
An itemset is
locally frequent
if there is an item in the itemset
such that at least one of the following conditions holds:
1. The total item count lo cal share is greater than some user-sp ecied mini-
mum. That is,
ticls
(
i
k
; x
)
minshare
1
, where
x
I
,
i
k
2
x
, for some
k
, and
minshare
1
is the user-specied minimum share.
2. The total item amount local share is greater than some user-specied min-
imum. That is,
tials
(
i
k
; x
)
minshare
2
, where
x
I
,
i
k
2
x
, for some
k
,
and
minshare
2
is the user-specied minimum share.
Query.
\Give the frequent 2-itemsets whose local share for at least one item is
at least 8%."
Result.
The locally frequent 2-itemsets are shown in Table 3. In Table 3, the
Itemset
column describes the items in the itemset, the
TIDs
column describes
the transaction identiers that contain the corresponding itemset, the
ticls
(
i
1
; x
)
and
ticls
(
i
2
; x
) columns describe the total item count local share for items one
and two, respectively, and the
tials
(
i
1
; x
) and
tials
(
i
2
; x
) columns describe the
total item amount local share for items one and two, respectively.
Table 3.
Locally frequent 2-itemsets
ticls
(
i
1
; x
)
ticls
(
i
2
; x
)
tials
(
i
1
; x
)
tials
(
i
2
; x
)
Itemset TIDs
(%) (%) (%) (%)
f
A; D
g
T
1
,
T
7
,
T
8
9.09 10.6
2.73
10.1
f
B; E
g
T
2
,
T
6
,
T
9
,
T
10
16.67
7.58 7.52
15.19
f
B; C
g
T
3
,
T
6
,
T
10
12.12
7.58 5.47 7.59
f
C; E
g
T
4
,
T
6
,
T
10
9.09
4.55
9.11 9.11
f
C; F
g
T
4
,
T
11
9.09
6.06
9.11 9.11
f
E; F
g
T
4
,
T
7
,
T
9
7.58
10.6 15.19 15.95
f
D; F
g
T
5
,
T
7
,
T
8
13.64 10.6 12.98 15.95
f
A; F
g
T
7
,
T
8
7.58
9.09
2.27
13.67
f
D; E
g
T
7
1.52 6.06 1.44
12.15
Denition 14.
An itemset is
globally frequent
if every item in the itemset is
locally frequent.
Query.
\Give the frequent 2-itemsets whose local share for all items is at least
8%."
Result.
The globally frequent 2-itemsets are shown in Table 4. The columns in
Table 4 have the same meaning as in Table 3.
Table 4.
Globally frequent 2-itemsets
ticls
(
i
1
; x
)
ticls
(
i
2
; x
)
tials
(
i
1
; x
)
tials
(
i
2
; x
)
Itemset TIDs
(%) (%) (%) (%)
f
A; D
g
T
1
,
T
7
,
T
8
9.09 10.6
2.73 10.1
f
C; E
g
T
4
,
T
6
,
T
10
9.09 4.55
9.11 9.11
f
C; F
g
T
4
,
T
11
9.09 6.06
9.11 9.11
f
E; F
g
T
4
,
T
7
,
T
9
7.58 10.6
15.19 15.95
f
D; F
g
T
5
,
T
7
,
T
8
13.64 10.6 12.98 15.95
2.4 Condence
Condence in an association rule
X
)
Y
was previously dened as the ra-
tio of the number of transactions containing itemset
X
[
Y
to the number of
transactions containing itemset
X
[2]. We now dene condence as used in the
share-condence framework.
Denition 15.
The
count condence
in an association rule
X
)
Y
is the ratio
of the sum of the local itemset counts for all items in itemset
X
contained in
X
[
Y
to the global itemset count for itemset
X
, expressed as a percentage,
denoted as
cc
(
x; x
[
y
), where
cc
(
x; x
[
y
) = (
P
lisc
(
i
k
; x
[
y
)
=gisc
(
x
))
100,
x
I
,
x
[
y
I
, and
i
k
2
x
, for all
k
.
Query.
\Give the count condence for the association rule
f
B; C
g ) f
E
g
."
Result.
The count condence for the association rule
f
B; C
g ) f
E
g
is
cc
(
f
B;
C
g
;
f
B; C; E
g
) = ((
lisc
(
B;
f
B; C; E
g
)+
lisc
(
C;
f
B; C; E
g
))
=gisc
(
f
B; C
g
))
100 =
76
:
9%.
Denition 16.
The
amount condence
in an association rule
X
)
Y
is the ratio
of the sum of the lo cal itemset amounts for all items in itemset
X
contained in
X
[
Y
to the global itemset amount for itemset
X
, expressed as a percentage,
denoted as
ac
(
x; x
[
y
), where
ac
(
x; x
[
y
) = (
P
lisa
v
(
i
k
; x
[
y
)
=gisa
(
x
))
100,
x
I
,
x
[
y
I
,
i
k
2
x
, for all
k
, and
v
2 f
1
;
2
g
.
Query.
\Give the amount condence for the association rule
f
B; C
g ) f
E
g
."
Result.
The amount condence for the association rule
f
B; C
g ) f
E
g
is
ac
(
f
B; C
g
;
f
B; C; E
g
) = ((
lisa
2
(
B;
f
B; C; E
g
) +
lisa
2
(
C;
f
B; C; E
g
))
=gisa
(
f
B;
C
g
))
100 = 59
:
9%.
3 Characterized Itemsets
3.1 Example
We now present an example to demonstrate the
CI
algorithm and describ e the
primary data structures. In this example, let
L
k
and
C
k
denote the set of frequent
itemsets from pass
k
and the set of candidate itemsets from pass
k
, resp ectively,
and let
R
denote the relation containing the characterized itemsets. Each ele-
ment of
L
k
and
C
k
contains three attributes: the itemset, the total item count
local share, and the total item amount local share. Each element of
R
contains
one attribute for each characteristic of interest and an attribute containing a
list of all frequent itemsets sharing the corresponding characteristic attributes.
Assume we are given the transaction database shown in Table 5. Also assume
the user-sp ecied minimum share is 15%. In Table 5, the column descriptions
have the same meaning as the like-named columns in Table 1. Our task is to
trace through the rst three passes of
CI
to generate and store the characterized
itemsets in
R
. For this example, we consider only the total item count local
share to determine whether an itemset is frequent.
Table 5.
A smaller example transaction database with counts
TID A B C D E
T
1
1 2 5 0 0
T
2
4 1 1 3 2
T
3
3 0 2 1 0
T
4
5 0 4 2 1
T
5
2 3 3 4 0
15 6 15 10 3
After the rst pass,
CI
generates
L
1
and
R
as shown in Tables 6 and 7,
respectively. In Table 6, the
Itemset
column describ es the items in each itemset
and the
Share
column describes the total item count local share. In Table 7,
the
Char. 1
and
Char. 2
columns describ e the characteristics retrieved from the
external database(s), and the
TIDs
column describes the transactions that share
the corresponding characteristics (the TIDs are not actually stored in
R
and are
merely shown here for reader convenience). The domain of the rst and second
characteristic is
f
R; S
g
and
f
X; Y; Z
g
, respectively.
After the second pass,
CI
generates
L
2
and up dates
R
as shown in Tables 8
and 9, resp ectively. In Tables 8 and 9, the column descriptions have the same
meaning as the like-named columns in Table 6 and Table 7, respectively. Also in
Table 6.
Frequent itemsets contained in
L
1
Share
Itemset
(%)
f
A
g
30.6
f
C
g
30.6
f
D
g
20.4
Table 7.
R
after the rst pass
Char. 1 Char. 2 TIDs
R X
T
1
,
T
4
S Y
T
2
,
T
5
S Z
T
3
Table 9, the
Itemsets
column describes the frequent itemsets from the previous
pass that share the identied characteristics.
Table 8.
Frequent itemsets contained in
L
2
Share
Itemset
(%)
f
A; C
g
61.2
f
A; D
g
49.0
f
A; E
g
24.5
After the third pass,
CI
generates
L
3
and updates
R
as shown in Tables 10
and 11, respectively. In Tables 10 and 11, the column descriptions have the same
meaning as the like-named columns in Table 8 and Table 9, respectively.
The characterized itemsets in
R
, generated by
CI
, form a relation. In a re-
lation, transforming a specic data description into a more general one is called
generalization. Several algorithms have been prop osed for nding generalized
itemsets where concept hierarchies are used to classify items [10, 1]. Our ap-
proach diers from these in that we use concept hierarchies to classify the char-
acteristic attributes. Fast and ecient implementations of
AOG
[7, 6, 13] are
used to generate summaries where the characteristic attributes are generalized
according to the concept hierarchies. If the concept hierarchies have relatively
few levels (i.e., fewer than 10), and if multiple hierarchies are available for some
attributes, the
AllGen
algorithm [12] is used to generate all possible summaries.
3.2 The
CI
Algorithm
In the description of
CI
that follows,
L
k
,
C
k
, and
R
have the same meaning as
in the example of the previous section. The
k
-th pass of the algorithm works as
follows:
1. Repeat steps
2
to
5
until no new candidate itemsets are generated in pass
(
k
?
1).
Table 9.
R
after the second pass
Char. 1 Char. 2 TIDs
Itemsets
R X
T
1
,
T
4
hf
A
g
;
6
i
,
hf
C
g
;
9
i
,
hf
D
g
;
2
i
S Y
T
2
,
T
5
hf
A
g
;
6
i
,
hf
C
g
;
4
i
,
hf
D
g
;
7
i
S Z
T
3
hf
A
g
;
3
i
,
hf
C
g
;
2
i
,
hf
D
g
;
1
i
Table 10.
Frequent itemsets contained in
L
3
Share
Itemset
(%)
f
A; C; D
g
69.4
f
A; C; E
g
34.7
2. Generate the candidate
k
-itemsets in
C
k
from the frequent (
k
?
1)-itemsets
in
L
k
?
1
using the
Apriori
method described in [2, 5].
3. Partition the frequent (
k
?
1)-itemsets in
L
k
?
1
and up date the candidate
itemsets in
C
k
.
a.
Repeat steps
3-b
to
3-f
until there are no more transactions to be retrieved
from the database.
b.
Retrieve the next transaction from the database.
c.
Retrieve the corresponding characteristic tuple from
R
.
d.
For each (
k
?
1)-itemset in the transaction, if it is contained in
L
k
?
1
,
update the characteristic tuple.
(
i
)
If itemset summary attributes already exist for this (
k
?
1)-itemset
in the characteristic tuple, go to step
3-d-ii
. step. Otherwise, create
new itemset summary attributes in the characteristic tuple.
(
ii
)
Increment the total quantity and total value attributes for this (
k
?
1)-itemset in the characteristic tuple.
e.
If the characteristic tuple has b een updated, save it in
R
.
f.
For each
k
-itemset in the transaction, if it is contained in
C
k
, increment
the associated total quantity and total value attributes.
4. Save the frequent
k
-itemsets in
L
k
.
a.
Repeat steps
4-b
and
4-c
until there are no more itemset tuples in
C
k
.
b.
Retrieve the next itemset tuple from
C
k
.
c.
If the share of this itemset tuple is greater than the minimum specied,
copy the itemset tuple to
L
k
.
5. Delete
C
k
.
6. Save
R
.
The rst pass of the algorithm is a special pass which generates the frequent
1-itemsets and the characteristic relation, as follows:
1. Generate the candidate 1-itemsets in
C
1
and the characteristic relation
R
.
a.
Repeat steps
1-b
to
1-f
until there are no more transactions to be retrieved
from the database.
b.
Retrieve the next transaction from the database.
Table 11.
R
after the third pass
Char. 1 Char. 2 TIDs
Itemsets
R X
T
1
,
T
4
hf
A
g
;
6
i
,
hf
C
g
;
9
i
,
hf
D
g
;
2
i
,
hf
A; C
g
;
15
i
,
hf
A; D
g
;
7
i
,
hf
A; E
g
;
6
i
S Y
T
2
,
T
5
hf
A
g
;
6
i
,
hf
C
g
;
4
i
,
hf
D
g
;
7
i
,
hf
A; C
g
;
10
i
,
hf
A; D
g
;
13
i
,
hf
A; E
g
;
6
i
S Z
T
3
hf
A
g
;
3
i
,
hf
C
g
;
2
i
,
hf
D
g
;
1
i
,
hf
A; C
g
;
5
i
,
hf
A; D
g
;
4
i
c.
For each 1-itemset in the transaction, if an itemset tuple already exists
in
C
1
, go step
1-d
. Otherwise, create a new itemset tuple in
C
1
.
d.
For each 1-itemset in the transaction, increment the total quantity and
total value attributes of the associated itemset tuple in
C
1
.
e.
Using the appropriate key(s), retrieve the characterizing attributes for
this transaction from the external database(s).
f.
If a characteristic tuple containing these characteristics already exists in
R
, go step
1-b
. Otherwise, create a new characteristic tuple in
R
.
2. Save the frequent 1-itemsets in
L
1
.
a.
Repeat steps
2-b
and
2-c
until there are no more itemset tuples in
C
1
.
b.
Retrieve the next itemset tuple from
C
1
.
c.
If the share of this itemset tuple is greater than the minimum specied,
copy the itemset tuple to
L
1
.
3. Delete
C
1
.
4. Save
R
.
The running time and space requirements of
CI
are
O
(
j
c
j j
t
j
) and
O
(
j
s
j
),
respectively, where
j
c
j
is the number of candidate itemsets in all iterations of
the algorithm,
j
t
j
is the number of transactions, and
j
s
j
is the size of the largest
candidate itemset in any pass.
4 Experimental Results
We ran all of our experiments on an IBM AT-compatible personal computer,
consisting of a Pentium P166 processor with 64 MB of memory running Win-
dows NT Workstation version 4.0. Input data was from a large database supplied
by a commercial partner in the telecommunications industry. The database con-
tained approximately 3.3 million tuples representing account activity for over
500 thousand customer accounts and 2200 unique items (identied by integers
in the range [1
:::
2200]). Each tuple is either an equipment rental or service
transaction containing the number of items and the cost of each item. An item-
set was considered to be frequent if at least one of the following three conditions
held: the minimum supp ort was greater then 0.25%, the total item count global
share was greater than 0.25%, or the total item amount global share was greater
than 0.25%.
The 20 most frequent 1-itemsets ranked by support, total item count global
share, and total item amount global share are shown in Figures 1, 2, and 3,
respectively. In Figures 1 to 3, the rst row of bars (i.e., those at the front of
the graph) corresponds to the total item amount global share (i.e., value), the
second row corresponds to the total item count global share (i.e., quantity), and
the third row corresponds to the supp ort. The height of each bar corresponds to
the percentage of share or support for the associated 1-itemset. There were 109
frequent 1-itemsets discovered.
Figure 1 shows that support over-represents the actual frequency with which
a 1-itemset is purchased, in terms of both the quantity and value of the pur-
chases. The support for the most frequent 1-itemset is approximately 25%, yet
this itemset represents only approximately 5% of the total quantity of items
purchased and only approximately 2% of the total value of items purchased.
The ranking of these same 1-itemsets by total item count global share is similar
to that of support, but the ranking by total item amount global share shows
signicant variation from both support and total item count global share.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Percentage of Total
Assigned Itemset IDs
Global Share (Value)
Global Share (Quantity)
Support
Fig. 1.
20 most frequent 1-itemsets ranked by support
Figure 2 shows that 14 of the frequent 1-itemsets that were ranked highest
by support (i.e., those identied by integers less than or equal to 20), also appear
in the 20 most frequent 1-itemsets ranked by total item count global share. The
remaining six 1-itemsets (i.e., 101, 81, 25, 107, 100, 34) are shown to have a
higher ranking when ranked by total item count global share. The 1-itemsets
that include items 100, 101, and 107 are esp ecially noteworthy since there were
only 109 frequent 1-itemsets ranked. The support measure considers these items
to be among the least imp ortant, yet when ranked by total item count global
share, they are ranked eleventh, rst, and eighth, respectively.
Figure 3 shows that nine of the frequent 1-itemsets that were ranked highest
by support, also appear in the 20 most frequent 1-itemsets ranked by total item
amount global share. It also shows that nine of the most frequent 1-itemsets
which were ranked in the bottom 50% by support, are shown to be among the
20 most frequent when ranked by total item amount global share.
Similar results to those shown in Figures 1 to 3 were obtained when rank-
ing
k
-itemsets. We present the results for 2-itemsets, shown in Table 12. Ta-
ble 12 shows three sets of rankings for 2-itemsets, where each set contains three
columns. In Table 12, the
Support
,
Share (Quantity)
, and
Share (Value)
columns
101
1
2
81
4
25
3
107
9
7
100
5
6
8
10
34
17
12
13
11
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Percentage of Total
Assigned Itemset IDs
Global Share (Value)
Global Share (Quantity)
Support
Fig. 2.
20 most frequent 1-itemsets ranked by total item count global share
9
3
30
99
58
34
6
101
1
66
2
89
11
8
100
106
5
83
81
4
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Percentage of Total
Assigned Itemset IDs
Global Share (Value)
Global Share (Quantity)
Support
Fig. 3.
20 most frequent 1-itemsets ranked by total item amount global share
describe 10 itemsets ranked by support, total item count global share, and total
item amount global share, respectively. In the rst set, the rst column shows the
10 most frequent 2-itemsets ranked by support. The second and third columns
show the corresponding rank for these itemsets ranked by total item count and
total item amount global share, resp ectively. In the second set, the second col-
umn shows the 10 most frequent 2-itemsets ranked by total item count global
share. The rst and third columns show the corresponding rank for these item-
sets ranked by support and total item amount global share, respectively. In the
third set, the third column shows the 10 most frequent 2-itemsets ranked by
total item amount global share. The rst and second columns show the corre-
sponding rank for these itemsets ranked by support and total item count global
share. There were 351 frequent 2-itemsets.
The 2-itemset ranked as most frequent by support (refer to the rst set) and
total item amount global share was ranked fourth by total item count global
share. While this itemset does not represent the most frequent itemset sold
in terms of the quantity of items, it was purchased in the greatest number of
transactions and had the highest gross income of all 2-itemsets. In contrast, the
Table 12.
2-itemsets ranked by support and share
Set 1 Rankings Set 2 Rankings Set 3 Rankings
Share Share Share Share Share Share
Support (Quantity) (Value) Support (Quantity) (Value) Support (Quantity) (Value)
1 4 1 306 1 18
1 4 1
2 13 3 341 2 38 293 8 2
3 17 9 324 3 27 2 13 3
4 19 12 1 4 1 305 45 4
5 20 5 294 5 23 5 20 5
6 22 11 316 6 32 288 121 6
7 27 28 291 7 24 75 80 7
8 35 33 293 8 2 287 206 8
9 47 59 307 9 29 3 17 9
10 41 109
301 10 31
336 350 10
2-itemset ranked tenth by support, for instance, was ranked 41-st by total item
count global share and 109-th by total item amount global share. This itemset is
ranked highly by support, yet its contribution to gross income is comparatively
low.
The 2-itemset ranked as most frequent by total item count global share (refer
to the second set) was ranked 306-th by support. This is an itemset where the
items are typically purchased in multiples. Consequently, it is purchased more
frequently than support seems to indicate. Similarly, 13 of the 15 most frequent
2-itemsets ranked highly by total item count global share are ranked b elow 291
by supp ort.
The 2-itemset ranked tenth by total item amount global share (refer to the
third set) was ranked 336-th by support and 350-th by total item count global
share. The items in this itemset are relatively expensive items. Consequently,
although not purchased as frequently as many other items, its contribution to
gross income is comparatively high.
5 Conclusion
We have introduced the share-condence framework for knowledge discovery
from databases which classies itemsets based upon characteristic attributes ex-
tracted from external databases. We suggested how characterized itemsets can
be generalized according to concept hierarchies associated with the characteristic
attributes. Experimental results demonstrated that the share-condence frame-
work can give more informative feedback than the support-condence framework.
References
1. R. Agrawal, K. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the
presence of noise, scaling, and translation in time-series databases. In
Proceedings
of the 21th International Conference on Very Large Databases (VLDB'95)
, Zurich,
Switzerland, September 1995.
2. R. Agrawal, H. Mannila, R.Srikant, H.Toivonen, and A.I. Verkamo. Fast discov-
ery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, editors,
Advances in Knowledge Discovery and Data Mining
, pages
307{328, Menlo Park, CA, 1996. AAAI Press/MIT Press.
3. R. Agrawal and J.C. Schafer. Parallel mining of association rules.
IEEE Transac-
tions on Knowledge and Data Engineering
, 8(6):962{969, December 1996.
4. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing as-
sociation rules to correlations. In
Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD'97)
, pages 265{276, May 1997.
5. S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket data. In
Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD'97)
, pages 255{264,
May 1997.
6. C.L. Carter and H.J. Hamilton. Ecient attribute-oriented algorithms for knowl-
edge discovery from large databases. IEEE Transactions on Knowledge and Data
Engineering. To appear.
7. C.L. Carter and H.J. Hamilton. Performance evaluation of attribute-oriented al-
gorithms for knowledge discovery from databases. In
Proceedings of the Seventh
IEEE International Conference on Tools with Articial Intel ligence (ICTAI'95)
,
pages 486{489, Washington, D.C., November 1995.
8. C.L. Carter, H.J. Hamilton, and N. Cercone. Share-based measures for itemsets.
In J. Komorowski and J. Zytkow, editors,
Proceedings of the First European Con-
ference on the Principles of Data Mining and Knowledge Discovery (PKDD'97)
,
pages 14{24, Trondheim, Norway, June 1997.
9. D.W. Cheung, A.W. Fu, and J. Han. Knowledge discovery in databases: a rule-
based attribute-oriented approach. In
Lecture Notes in Articial Intel ligence, The
8th International Symposium on Methodologies for Intelligent Systems (ISMIS'94)
,
pages 164{173, Charlotte, North Carolina, 1994.
10. J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. In
Proceedings of the 1995 International Conference on Very Large
Data Bases (VLDB'95)
, pages 420{431, September 1995.
11. J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data
mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
editors,
Adavances in Knowledge Discovery and Data Mining
, pages 399{421.
AAAI/MIT Press, 1996.
12. R.J. Hilderman, H.J. Hamilton, R.J. Kowalchuk, and N. Cercone. Parallel
knowledge discovery using domain generalization graphs. In J. Komorowski and
J. Zytkow, editors,
Proceedings of the First European Conference on the Principles
of Data Mining and Knowledge Discovery (PKDD'97)
, pages 25{35, Trondheim,
Norway, June 1997.
13. H.-Y. Hwang and W.-C. Fu. Ecient algorithms for attribute-oriented induction.
In
Proceedings of the First International Conference on Knowledge Discovery and
Data Mining (KDD'95)
, pages 168{173, Montreal, August 1995.
14. J.S. Park, M.-S. Chen, and P.S. Yu. An eective hash-based algorithm for mining
association rules.
Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD'95)
, pages 175{186, May 1995.
... A high-utility itemset mining model was defined by Yao, Hamilton and Butz [13]. It is a generalization of the share-mining model [3, 4]. The goal of high utility itemset mining process is to find all itemsets that give utility greater or equal to the user specified threshold. ...
... We start with conservative (high) thresholds and lower them as long as we are not contended with the number of itemsets found. Algorithm 2P-UF Input: -database DB -constraints minUtil and minSup Output: -all utility-frequent itemsets /* Phase 1: find all quasi utility-frequent itemsets */ [1] CandidateSet = QUF-APriori(DB, minUtil, minSup) /* Phase 2: prune utility-infrequent itemsets */ [2] foreach c in CandidateSet: [3] foreach T in DB: [4] if c in T and u(c,T) >= minUtil: [5] c.count += 1 [6] return {c in CandidateSet | c.count >= minSup} 3 A Fast Algorithm for Mining Utility-Frequent Itemsets 2P-UF utility-frequent itemset mining algorithm described in Section 2 is proven to find all utility-frequent itemsets. However, due to the monotone property of quasi support measure it has a few disadvantages which render it unusable for mining of large datasets. ...
... transactions) for all candidates and, with this information, utilities of all candidates.Figure 2 shows pseudo code of our FUFM algorithm. Algorithm FUFM Input: -database DB -constraints minUtil and minSup Output: -all utility-frequent itemsets [1] L = 1 [2] find the set of candidates of length L with support >= minSup [3] compute exteded support for all candidates and output utilityfrequent itemsets [4] L += 1 [5] use the frequent itemset mining algorithm to obtain new set of frequent candidates of length L from the old set of frequent candidates [6] stop if the new set is empty otherwise go to [3] Clearly, FUFM algorithm does not have disadvantages and inefficiencies of the 2P-UF algorithm as its generation phase (step 5 onFig. 2) is based on frequent itemset mining methods. ...
Article
Full-text available
Utility-based data mining is a new research area interested in all types of utility factors in data mining processes and targeted at in-corporating utility considerations in both predictive and descriptive data mining tasks. High utility itemset mining is a research area of utility-based descriptive data mining, aimed at finding itemsets that contribute most to the total utility. A specialized form of high utility itemset min-ing is utility-frequent itemset mining, which – in addition to subjectively defined utility – also takes into account itemset frequencies. This paper presents a novel efficient algorithm FUFM (Fast Utility-Frequent Min-ing) which finds all utility-frequent itemsets within the given utility and support constraints threshold. It is faster and simpler than the original 2P-UF algorithm (2 Phase Utility-Frequent), as it is based on efficient methods for frequent itemset mining. Experimental evaluation on artifi-cial datasets show that, in contrast with 2P-UF, our algorithm can also be applied to mine large databases.
... Utility mining [41] emerged recently to address the limitation of frequent pattern mining by considering the user's expectation or goal as well as the raw data. Utility mining with the itemset share framework [19], [39], [40], for example, discovering combinations of products with high profits or revenues, is much harder than other categories of utility mining problems, for example, weighted itemset mining [10], [25], [30] and objective-oriented utility-based association mining [11], [35]. Concretely, the interestingness measures in the latter categories observe an anti-monotonicity property, that is, a superset of an uninteresting pattern is also uninteresting. ...
... Below, we discuss three categories in detail. Hilderman et al. [19] proposed the itemset share framework that takes into account the weights both on attributes, for example, the price of a product, and on attribute-value pairs, for example, the quantity of a product in a shopping basket. Then, support and confidence measures can be generalized based on count-shares as well as on amount-shares. ...
... r namely knowledge from the data set of huge amount of data. Data gathered from practical environment is often incomplete, noisy, ambiguous, which requires a data preprocessing step to make data well-structured and normalized, that is easily to be processed by formularized algorithm. There are many successful applications of data mining technology. [6] Applied a association rule based data mining algorithm to find interesting pattern in market transaction data. [7] Proposed a data mining framework in field of expert system design. They applied different data mining algorithm to score each subscriber in some concerned properties and made a comparison. [8] Investigated different cluster ...
... A data preprocessing procedure is launched to get a clean and normalized data set for model construction. Data preprocessing [6, 13] stands for processing large volume of incomplete, noisy and inconsistent data gathering from practical environment, including data summarization, data transformation, data integration, data reduction and data cleaning. Data cleaning can be used to remove noise in the data set, fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies; data integration aims at merging multiple sources to form a unified database; data transformation is for normalization of data; data reduction can reduce the overall size of data set by gathering, removing redundant features or clustering data samples. ...
Article
Smaller time loss and smoother communication pattern is the urgent pursuit in the software development enterprise. However, communication is difficult to control and manage and demands on technical support, due to the uncertainty and complex structure of data appeared in communication. Data mining is a well established framework aiming at intelligently discovering knowledge and principles hidden in massive amounts of original data. Data mining technology together with shared repositories results in an intelligent way to analyze data of communication in software development environment. We propose a data mining based algorithm to tackle the problem, adopting a co-training styled algorithm to discover pattern in software development environment. Decision tree is trained as based learners and a majority voting procedure is then launched to determine labels of unlabeled data. Based learners are then trained again with newly labeled data and such iteration stops when a consistent state is reached. Our method is naturally semi-supervised which can improve generalization ability by making use of unlabeled data. Experimental results on data set gathered from productive environment indicate that the proposed algorithm is effective and outperforms traditional supervised algorithms.
... However, a user's interest may relate to many factors that can not be expressed in terms of the occurrence frequency . Utility mining [19] emerged recently to address the limitation of frequent itemset mining by considering the user's expectation or goal as well as the raw data, which can be further categorized into utility mining with the itemset share framework [10][18], weighted itemset mining [14], and objective-oriented utility-based association mining [15][6] . Utility mining with the itemset share framework can be explained by the following example. ...
... Yao et al. [18] and Hilderman et al. [10] proposed the itemset share framework that takes into account weights on both attributes and attribute-value pairs. This paper falls into the same category where no anti-monotone property holds with the interestingness measure. ...
Conference Paper
Utility mining emerged recently to address the limitation of frequent itemset mining by introducing interestingness measures that reflect both the statistical significance and the user's expectation. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotone property holds with the interestingness measure. The state-of-the-art works on this problem all employ a two-phase, candidate generation approach, which suffers from the scalability issue due to the huge number of candidates. This paper proposes a high utility itemset growth approach that works in a single phase without generating candidates. Our basic approach is to enumerate itemsets by prefix extensions, to prune search space by utility upper bounding, and to maintain original utility information in the mining process by a novel data structure. Such a data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility itemsets in an efficient and scalable way. We further enhance the efficiency significantly by introducing recursive irrelevant item filtering with sparse data, and a lookahead strategy with dense data. Extensive experiments on sparse and dense, synthetic and real data suggest that our algorithm outperforms the state-of-the-art algorithms over one order of magnitude.
... The concept of HUIM was based on the problem of frequent itemset mining (FIM) and was first presented in [9]. An itemset is a HUI if it has utility value not less than a userspecified threshold. ...
Article
Full-text available
High-utility itemset mining (HUIM) has become a key phase of the pattern mining process, which has wide applications, related to both quantities and profits of items. Many algorithms have been proposed to mine high-utility itemsets (HUIs). Since these algorithms often return a large number of discovered patterns, a more compact and lossless representation has been proposed. The recently proposed closed high utility itemset mining (CHUIM) algorithms were designed to work with certain types of databases (e.g., those without probabilities). In fact, real-world databases might contain items or itemsets associated with probability values. To effectively mine frequent patterns from uncertain databases, several techniques have been developed, but there does not exist any method for mining CHUIs from this type of databases. This work presents a novel and efficient method without generating candidates, named CPHUI-List, to mine closed potential high-utility itemsets (CPHUIs) from uncertain databases. The proposed algorithm is DFS-based and utilizes the downward closure property of high transaction-weighted probabilistic mining to prune non-CPHUIs. It can be seen from the experiment evaluations that the proposed algorithm has better execution time and memory usage than the CHUI-Miner.
... In their paper they introduced a writing audit of the present situation with research and different algorithms for high utility mining., Vol.10, No.3, 2017DOI: 10.22266/ijies2017.0630.21 R. J. Hilderman et al.[17]proposed the sharecondense structure for information disclosure from databases that approaches the issue of mining object sets from market basket information and introduced another method for arranging objectsets according to characteristics features separated from registration or life-style information.At last they displayed experimental comes that exhibit the utility of the sharecon dence system. From above mentioned literature survey we can make observation that High Utility Pattern mining problem is closely related to frequent pattern mining, including constraint-based mining. ...
Article
Proficient Algorithms for Mining High Utility Itemset in One Phase utilizing Direct Discovery of High Utility Patterns refers to the disclosure of object/item sets with high utility. The situation may end up being more terrible when the database contains long high utility itemsets.The information of high utility itemsets is kept up in a tree-based data structure UP-Tree to such a degree, to the point that required itemsets can be created productivity with only two yields of database. The execution of existing algorithm issue all utilize a two-stage, applicant era approach with an exemption which is however neither efficient nor versatile in case of huge databases. This twostage approach experiences adaptability issue because of the enormous number of matching itemsets. This paper proposes a novel calculation that finds high utility examples in a solitary stage without creating candidates. Our straight information structure empowers us to process a tight technique for pruning and to specifically distinguish high utility examples in an efficient and versatile way, which focuses on the main driver with earlier technique. Test comes to conclusion that the proposed computations, especially Direct Discovery of High Utility Pattern reduce the amount of hopefuls satisfactorily and additionally beat diverse figuring extensively to the extent runtime, especially when databases contain clusters of long trades.
... An itemset with utility value greater than the minimum threshold utility as specified by the user depending upon his context of usage is called as the high utility itemset [11]. A well known model for mining such high-utility itemset was defined by Yao et al which is a generalization of the share-mining model [2, 3]. The following is the set of definitions given by Yao et al and is illustrated by a simple example. ...
Article
Full-text available
The main goals of Association Rule Mining (ARM) are to find all frequent itemsets and to build rules based of frequent itemsets. But a frequent itemset only reproduces the statistical correlation between items, and it does not reflect the semantic importance of the items. To overcome this limitation we go for a utility based itemset mining approach. Utility-based data mining is a broad topic that covers all aspects of economic utility in data mining. It takes in predictive and descriptive methods for data mining. High utility itemset mining is a research area of utility based descriptive data mining, aimed at finding itemsets that contribute most to the total utility. The well known faster and simpler algorithm for mining high utility itemsets from large transaction databases is Fast Utility Mining (FUM). In this proposed system we made a significant improvement in FUM algorithm to make the system faster than FUM. The algorithm is evaluated by applying it to IBM synthetic database. Experimental results show that the proposed algorithm is effective on the databases tested.
... Firstly, these schemes still consider the support of an itemset to measure their importance and secondly, these models do not employ the quantities or prices of items purchased. Several researchers have also proposed itemset share measure, which is the fraction of some numerical value in order to overcome these shortcomings [5,6,8,17,18,19,24,25]. Carter et al. [8] proposed a share-confidence model to discover association rule among numerical attributes which are associated with items in a transaction. ...
Article
Full-text available
Traditional association rule mining based on the support-confidence framework provides the objective measure of the rules that are of interest to users. However, it does not reflect the utility of the rules. To extract non-redundant association rules in support-confidence framework frequent closed itemsets and their generators play an important role. To extract non-redundant association rules among high utility itemsets, high utility closed itemsets (HUCI) and their generators should be extracted in order to apply traditional support-confidence framework. However, no efficient method exists at present for mining HUCIs with their generators. This paper addresses this issue. A post-processing algorithm, called the HUCI-Miner, is proposed to mine HUCIs with their generators. The proposed algorithm is implemented using both synthetic and real datasets.
Chapter
Full-text available
Traditional sequential patterns do not take into account contextual infor- mation associated with sequential data. For instance, when studying purchases of customers in a shop, a sequential pattern could be “frequently, customers buy prod- ucts A and B at the same time, and then buy product C”. Such a pattern does not consider the age, the gender or the socio-professional category of customers. However, by taking into account contextual information, a decision expert can adapt his/her strategy according to the type of customers. In this paper, we focus on the analysis of a given context (e.g., a category of customers) by extracting context-dependent sequential patterns within this context. For instance, given the context correspond- ing to young customers, we propose to mine patterns of the form “buying products A and B then product C is a general behavior in this population” or “buying products B and D is frequent for young customers only”. We formally define such context-dependent sequential patterns and highlight relevant properties that lead to an efficient extraction algorithm. We conduct our experimental evaluation on real-world data and demonstrate performance issues.
Conference Paper
Full-text available
An attribute-oriented induction has been developed in the previous study of knowledge discovery in databases. A concept tree ascension technique is applied in concept generalization. In this paper, we extend the background knowledge representation from an unconditional non-rule-based concept hierarchy to a rule-based concept hierarchy, which enhances greatly its representation power. An efficient rule-based attribute-oriented induction algorithm is developed to facilitate learning with a rule-based concept graph. An information loss problem which is special to rule-based induction is described together with a solution suggested.
Conference Paper
Full-text available
Practical tools for knowledge discovery from databases must be efficient enough to handle large data sets found in commercial environments. Attribute-oriented induction has proved to be a useful method for knowledge discovery. Three algorithms are AOI, LCHR and GDBR. We have implemented efficient versions of each algorithm and empirically compared them on large commercial data sets. These tests show that GDBR is consistently faster than AOI and LCHR. GDBR's times increase linearly with increased input size, while times for AOI and LCHR increase non-linearly when memory is exceeded. Through better memory management, however, AOI can be improved to provide some advantages
Article
Full-text available
Attribute-oriented induction is a set-oriented database mining method which generalizes the task-relevant subset of data attribute-by-attribute, compresses it into a generalized relation, and extracts from it the general features of data. In this chapter, the power of attribute-oriented induction is explored for the extraction from relational databases of different kinds of patterns, including characteristic rules, discriminant rules, cluster description rules, and multiple-level association rules. Furthermore, it is shown that the method is efficient, robust, with wide applications, and extensible to knowledge discovery in advanced database systems, including object-oriented, deductive, and spatial database systems. The implementation status of DBMiner, a system prototype which applies the method, is also reported here. 16.1 Introduction With an upsurge of the application demands and research activities on knowledge discovery in databases (Matheus, Chan and Piatetsky-Shapiro 1993; Pi...
Conference Paper
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can improve the low-level efficiency of the algorithm. Second, we present a new way of generating "implication rules," which are normalized based on both teh antecedent and the consequent and are truly implications (not simply a measure of co-occurence), and we show how they produce more intuitive results than other methods. Finally, we show how different characteristics of real data, as opposed to synthetic data, can dramatically affect the performance of the system and the form of the results.
Conference Paper
Knowledge discovery in databases, or data mining, is an important issue in the development of data- and knowledge-base systems. An attribute-oriented induction method has been developed for knowledge discovery in databases. The method integrates a machine learning paradigm, especially learning-from-examples techniques, with set-oriented database operations and extracts generalized data from actual data in databases. An attribute-oriented concept tree ascension technique is applied in generalization, which substantially reduces the computational complexity of database learning processes. Different kinds of knowledge rules, including characteristic rules, discrimination rules, quantitative rules, and data evolution regularities can be discovered efficiently using the attribute-oriented approach. In addition to learning in relational databases, the approach can be applied to knowledge discovery in nested relational and deductive databases. Learning can also be performed with databases containing noisy data and exceptional cases using database statistics. Furthermore, the rules discovered can be used to query database knowledge, answer cooperative queries and facilitate semantic query optimization. Based upon these principles, a prototyped database learning system, DBLEARN, has been constructed for experimentation.
Conference Paper
We introduce the measures share, coincidence and dominance as alternatives to the standard itemset methodology measure of support. An itemset is a group of items bought together in a transaction. The support of an itemset is the ratio of transactions containing the itemset to the total number of transactions. The share of an itemset is the ratio of the count of items purchased together to the total count of items in all transactions. The coincidence of an itemset is the ratio of the count of items in that itemset to the total of those same items in the database. The dominance of an item in an itemset specifies the extent to which that item dominates the total of all items in the itemset. Share based measures have the advantage over support of reflecting accurately how many units are being moved by a business. The share measure can be extended to quantify the financial impact of an itemset on the business.
Article
We consider the problem of mining association rules on a shared nothing multiprocessor. We present three algorithms that explore a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem specific information. The best algorithm exhibits near perfect scaleup behavior, yet requires only minimal overhead compared to the current best serial algorithm