ArticlePDF Available

Abstract and Figures

p class="Abstract">Data mining is the process of discovering knowledge and previously unknown pattern from large amount of data. The association rule mining (ARM) has been in trend where a new pattern analysis can be discovered to project for an important prediction about any issues. Since the first introduction of frequent itemset mining, it has received a major attention among researchers and various efficient and sophisticated algorithms have been proposed to do frequent itemset mining. Among the best-known algorithms are Apriori and FP-Growth. In this paper, we explore these algorithms and comparing their results in generating association rules based on benchmark dense datasets. The datasets are taken from frequent itemset mining data repository. The two algorithms are implemented in Rapid Miner 5.3.007 and the performance results are shown as comparison. FP-Growth is found to be better algorithm when encountering the support-confidence framework.</p
Content may be subject to copyright.
Indonesian Journal of Electrical Engineering and Computer Science
Vol. 3, No. 3, September 2016, pp. 546 ~ 553
DOI: 10.11591/ijeecs.v3.i2.pp546-553 546
Received April 2, 2016; Revised July 25, 2016; Accepted August 10, 2016
Mining Association Rules: A Case Study on Benchmark
Dense Data
Mustafa Man1, Wan Aezwani Wan Abu Bakar2, Zailani Abdullah3, Masila Abd Jalil4,
Tutut Herawan5
1,2,4School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu,
21030 Kuala Terengganu, Terengganu, Malaysia
3Faculty of Entepreneur and Business, Universiti Malaysia Kelantan,
16100 Kota Bharu, Kelantan, Malaysia
5Department of Information Systems, Faculty of Computer Science and Information Technology,
University of Malaya, Lembah Pantai, 50603 Kuala Lumpur, Malaysia
Corresponding author, e-mail: mustafaman@umt.edu.my, beny2194@yahoo.com, zailania@umk.edu.my,
masita@umt.edu.my, tutut@um.edu.my
Abstract
Data mining is the process of discovering knowledge and previously unknown pattern from large
amount of data. The association rule mining (ARM) has been in trend where a new pattern analysis can be
discovered to project for an important prediction about any issues. Since the first introduction of frequent
itemset mining, it has received a major attention among researchers and various efficient and
sophisticated algorithms have been proposed to do frequent itemset mining. Among the best-known
algorithms are Apriori and FP-Growth. In this paper, we explore these algorithms and comparing their
results in generating association rules based on benchmark dense datasets. The datasets are taken from
frequent itemset mining data repository. The two algorithms are implemented in Rapid Miner 5.3.007 and
the performance results are shown as comparison. FP-Growth is found to be better algorithm when
encountering the support-confidence framework.
Keywords: data mining, association rule mining (ARM), frequent pattern mining (FPM), rapid miner,
apriori, fp growth
Copyright © 2016Institute of Advanced Engineering and Science. All rights reserved.
1. Introduction
Data mining is the research area where the huge dataset in database and data
repository are scoured and mined to find novel and useful pattern. Association analysis is one of
the four (4) core data mining tasks besides cluster analysis, predictive modeling and anomaly
detection [1]. The task of Association Rule Mining (ARM) is to discover if there exist the frequent
itemset or pattern in database and if any, an interesting relationships between these frequent
itemsets can reveal a new pattern analysis for the next step of decision making.
Finding frequent itemsets or patterns (as shown in Figure 1) is a big challenge and has
a strong and long-standing tradition in data mining. It is a fundamental part of many data mining
applications including market basket analysis, web link analysis, genome analysis and
molecular fragment mining [2]. The idea of mining association rule originates from the analysis
of market basket data [3]. Example of simple rule is “A customer who buys bread and butter will
also tend to buy milk with probability s% and c%”. The applicability of such rule to business
problems makes the association rule to become a popular mining method.
The ARM that relates to frequent pattern is called Frequent Pattern Mining (FPM). The
state-of-the-art algorithms in FPM are based upon horizontal data format and vertical data
format. Most of previous frequent mining techniques are dealing with horizontal format of their
data repositories but suffer from the requirement of many database scans. However, current
and emerging trend exists where some of the research works are focusing on dealing with
vertical data format and the rule mining results are quite promising. Apriori [3, 4] that relies on
horizontal format and FP-Growth [5] that relies on vertical format are among the best-known
algorithms in FPM. Neither horizontal nor vertical data format, both are still suffering from the
huge memory consumption [3-5] with higher datasets.
IJEECS ISSN: 2502-4752
Mining Association Rules: A Case Study on Benchmark Dense Data (Mustafa Man)
547
Figure 1. Frequent Itemset and Its Subset [2]
In this paper, we discover the performance and scalability measures of both algorithms
that represent both formats and compare their results in generating association rules based on
benchmark dense datasets. The datasets are taken from frequent itemset mining data
repository.
The rest of this paper is organized as follow. Section 2 describes rudimentary of
association rules. Section 3 describes Apriori and FP Growth algorithms. Section 4 describes
experimental results. Finally, the conclusion of this work is described in section 5.
1.1. Association Rules
Following is the formal definition of the problem in [3]. Let I = {i1, i2,…,im} be the set of
items. Let D is a set of transaction where each transaction T is a set of items such that    An
association rule is an implication of the form   where X represents the antecedent part of
the rule and Y represents the consequent part of the rule where       and     
The itemset that satisfies minimum support is called frequent itemset. The rule    holds in
the transaction set D with confidencec if c% of transactions in D that contain X also contain Y.
The rule    has supports in the transaction set D if s% of transaction in D contains   
The illustration of support-confidence notions is given as below:
a) The support of rule    is the fraction of transactions in D containing both X and
Y.
  
.
Where is the total number of records in database.
b) The confidence of rule    is the fraction of transactions in D containing X that
also contain Y.
    
 .
The rules which satisfy both a minimum support threshold (min_supp) and minimum
confidence threshold (min_conf) are called strong rule where min_supp and min_conf are user
specified values.
1.2. Apriori and FP Growth Algorithms
Since the introduction of frequent itemset mining by [3], it has received a major attention
among researchers and various efficient and sophisticated algorithms have been proposed to
do frequent itemset mining. Among the best-known algorithms are Apriori and FP-Growth.
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
ISSN: 2502-4752
IJEECS Vol. 3, No. 3, September 2016 : 546 553
548
1.2.1. Apriori Algorithm
The Apriori algorithm [3, 4] uses a breadth-first search and the downward closure
property, in which any superset of an infrequent itemset is infrequent, to prune the search tree.
Apriori usually adopts a horizontal layout to represent the transaction database and the
frequency of an itemset is computed by counting its occurrence in each transaction. Apriori uses
a "bottom up" approach, where frequent subsets are extended one item at a time (a step known
as candidate generation, and groups of candidates are tested against the data). The algorithm
terminates when no further successful extensions are found. The key idea is such that the
apriori property (downward closure property) states that any subsets of a frequent itemset are
also frequent itemsets. The best known algorithm that involve two steps:
1) Step 1: Find all itemsets that have minimum support (frequent itemsets, also called
large itemsets).
2) Step 2: Use frequent itemsets to generate rules.
1.2.2. Apriori Pseudo-code
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk ;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
returnkLk;
1.2.3. Apriori principle
1) If an itemset is frequent, then all of its subsets must also be frequent. Apriori
principle holds due to the following property of the support measure:
2) Support of an itemset never exceeds the support of its subsets
1.2.4. FP-Growth Algorithm
The FP-Growth [5-7] employs a divide-and-conquer strategy and a FP-tree data
structure to achieve a condensed representation of the transaction database. It has become
one of the fastest algorithms for frequent pattern mining. In large databases, it’s not possible to
hold the FP-tree in the main memory. A strategy to cope with this problem is to firstly partition
the database into a set of smaller databases (called projected databases), and then construct
an FP-tree from each of these smaller databases. The steps are as follows:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
The algorithm of FP-Growth is given as below and Figure 2, 3 and 4 show the steps in
constructing FP Tree from a transaction database.
1.2.5. FP Growth (Tree, α) Pseudocode
if Tree contains a single path P then
for each combination (denoted as β) of the nodes in the path P do
generate pattern β α with support = min_supp of nodes in β;
else for each αi in the header of Tree do {
generate pattern β = αi α with support = αi.support;
construct β’s conditional pattern base and then
β’s conditional FP tree Treeβ;
if Treeβ then
call FP-Growth (Treeβ, β) }
)()()(:, YsXsYXYX
IJEECS ISSN: 2502-4752
Mining Association Rules: A Case Study on Benchmark Dense Data (Mustafa Man)
549
Figure 2. Construct FP-tree from a Transaction Database [5]
Figure 3. Finding Patterns Having P from P-conditional Database [5]
Figure 4. From Conditional Pattern-bases to Conditional FP-trees [5]
2. Experiment Results
2.1. Experimentation Platform and Datasets
All experiments are performed on a DELL Inspiron 620, Intel ® Pentium ® CPU G630
@ 2.70 GHz with 4GB RAM in a Win 7 64-bit platform. The tool used is Rapid Miner (RM)
5.3.007. The raw benchmark data are retrieved from http://fimi.ua.ac.be/data/ in a *.dat file
format. For the ease of use in RM, we convert to Comma Separated Value (CSV) format. For
experimentation purposes, some datasets have been modified by removing instances that have
incomplete data and removing attributes that have only one categorical value. In RM itself, we
have to perform data transformation in order to be processed through the specified algorithm.
When importing data into RM, we have to specify what parameter to be set as ID, label, or
attributes. There are five (5) datasets include chess, connect, mushroom, pumb_star and
T40I10D100K. All selected datasets are different from one another in terms of size, either
horizontally or vertically aimed to analyze the performance of selected algorithms when
involving a huge number of records as well as very high number of attributes. Table 1 shows the
characteristics of datasets.
ISSN: 2502-4752
IJEECS Vol. 3, No. 3, September 2016 : 546 553
550
Table 1. Database Characteristic
Datasets
Size (KB)
Average Length (Attribute)
Chess
335
37
Connect
9039
43
Mushroom
558
23
Pumb_star
11028
50
T40I10D100K
15116
32
2.2. RM Development and Results
The results of the experiments are summarized in three (3) tables and six (6) figures.
Figure 5 illustrates the processes involved in deploying Apriori algorithm.
Figure 5. Rapid Miner Processes by W-Apriori algorithm
The W-Apriori process is an extension of Weka-Apriori into the RM tool. First, the
benchmark data (in csv) is retrieved by calling retrieve() process. Then data transformation has
to be constructed through descretizebyfrequency() process. This operator converts the selected
numerical attributes into nominal attributes by discretizing the numerical attribute into a user-
specified number of bins. Bins of equal frequency are automatically generated, the range of
different bins may vary. Then data is converted from nominaltonumerical() to
numericaltopolynominal(). The process nominaltonumerical() is to change the nominal attributes
to numerical attributes while the process numericaltopolynominal() is to change the numerical
attributes to polynominal attributes, that is allowed in Apriori algorithm. Then we call the Weka
extension, W-Apriori() to generate the best rules. The parameter is set to be a default value.
Figure 6 depicts on the processes involved in deploying the Weka extension W-FP-
Growth algorithm.
Figure 6. Rapid Miner Processes by W-FPGrowth Algorithm
The root process starts with retrieving the csv dataset. Then the discretizeby
frequency() is selected to change the real attributes to nominal. Next, the NominaltoBinominal()
IJEECS ISSN: 2502-4752
Mining Association Rules: A Case Study on Benchmark Dense Data (Mustafa Man)
551
process is called to change the nominal attributes to binominal attributes which is allowed in
FPGrowth algorithm and lastly W-FPGrowth() process is called to find frequent pattern and
generate the rules.
The performance of the Apriori and FP-Growth algorithms are measured in terms of
total execution time and total generated rules. The running time is subjected to factors such as
different search method in both algorithms and also the size of dataset itself.
Figure 712 illustrate the graphs of the results obtained. From these figures which
representing 3 different values of min_conf (i.e. 0.9, 0.5 and 0.1), it can be seen that the
patterns plotted are almost similar. The graphs show that W-FPGrowth outperforms W-Apriori
where more number of rules generated within lesser execution time. From the detailed result in
RM, between W-Apriori and W-FPGrowth, there are almost similar attributes interpreted to be
the antecedent and consequent. With W-FPGrowth, there are more attributes found to be the
interesting rules as compared to W-Apriori. For any mining algorithm, it should find the same set
of rules although their computational efficiencies and memory requirements may be different [5].
Figure 7. W-Apriori vs. W-FPGrowth: Execution
time (in seconds) when min_conf = 0.9
Figure 8. W-Apriori vs. W-FPGrowth: Rules
generated when min_conf = 0.9
Figure 9. W-Apriori vs. W-FPGrowth: Execution
time (in seconds) when min_conf = 0.5
Figure 10. W-Apriori vs. W-FPGrowth: Rules
generated when min_conf = 0.5
Figure 11. W-Apriori vs. W-FPGrowth: Execution
time (in seconds) when min_conf = 0.1
Figure 12. W-Apriori vs. W-FPGrowth: Rules
generated when min_conf = 0.1
ISSN: 2502-4752
IJEECS Vol. 3, No. 3, September 2016 : 546 553
552
The results of scalability measures on Apriori and FP Growth algorithm are depicted in
Figure 1316 by taking min_conf as 0.9. The scalability is measured in two different types. The
first is the scalability of both algorithms in dataset size on execution time and the second is the
scalability of both algorithms in dataset size on rules generated. From the plotted graphs, it can
be observed that the scalability of both algorithms in dataset size on execution time is almost
similar and tends to non-linear, but there is a slightly different of pattern plotted on the number
of rules generated. In Apriori, the pattern plotted on rules generated is similar with the pattern on
execution time, but in FP Growth, the pattern plotted on rules generated is slightly different
where in pumb_star with total of 15116 size (in KB), the number of rules generated shows quite
a huge number which is 18860 rules.
In reviewing the scalability of Apriori and FP Growth on five (5) datasets, taking the
value in dataset size, there is only small variation on execution time or number of rules
generated. However on the whole, these algorithms have a good scalability to data size. This
can be seen through Figure 1316 where the good scalability of algorithm to the data size is
highly desirable in real data mining applications [8].
Figure 13. Scalability of Apriori vs data size on
executing time when min_conf=0.9
Figure 14. Scalability of FP Growth vs data
size on executing time when min_conf=0.9
Figure 15. Scalability of Apriori vs data size on
rules generated when min_conf=0.9
Figure 16. Scalability of FP Growth vs data
size on rules generate when min_conf=0.9
3. Conclusion
In this paper, we have explored two well-known benchmark association rules mining
algorithms. The experiment conducted in this paper has shown a comparison results between
Apriori and FPGrowth algorithms using the benchmark dense datasets. It is clearly imitated from
the graphs that FP Growth outperforms W-Apriori in terms of lesser execution time with more
rules generated. The W-FPGrowth is found to be better algorithm when encountering the
support-confidence framework. Originally, W-FPGrowth only performs 2 passes over datasets
and, the datasets are already “compressed” with the generation of the FP-Tree by reducing
irrelevant information. This is done through removing infrequent items. While for W-Apriori, it
requires multiple database scans and a candidate generation approach by self-joining (by
generating all possible candidate itemsets) before pruning (by removing those candidates that is
not frequent). Therefore, it results in more execution time and generated rules are always 10
because the size of largest itemset is bound to 10. The more execution time generated, the
higher of memory consumption of the machine.
IJEECS ISSN: 2502-4752
Mining Association Rules: A Case Study on Benchmark Dense Data (Mustafa Man)
553
There are many other interestingness measure that can be imposed to the algorithm
and see whether the performance result between Apriori and FPGrowth are still the same or
otherwise. For the future analysis, there are few alternatives we might want to tackle either in
the same interestingness measure with vertical data format approach of Eclat algorithm [2] or
with different interestingness measure but with similar ARM rule and compare the outcome.
Acknowledgements
We wish to thank Prof. Dr. Mohd Yazid Mohd Saman for his insightful comments and
suggestions and a credit also to MyPhD scholarship under MyBrain15 of Kementerian
Pendidikan Malaysia (KPM) for the financial foundation of this work.
References
[1] Tan PN, Steinbach M, Kumar V. Introduction to Data Mining. First Edition. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc. 2005.
[2] Trieu TA, Kunieda Y. An improvement for declat algorithm. Proceedings of the 6th International
Conference on Ubiquitous Information Management and Communication (ICUIMC’12). 2012; 54: 1-
06.
[3] Agrawal R, Srikant R, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very
large data bases, VLDB. 1994; 1215: 487499.
[4] Agrawal R, Imieli´nski T, Swami A. Mining association rules between sets of items in large
databases. In SIGMOD Rec. 1993; 22: 207216.
[5] Han J, Kamber M, Pei J. Data mining: concepts and techniques. Morgan Kaufmann. 2006.
[6] Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In ACM SIGMOD Record.
ACM. 2000; 29(2): 1-12.
[7] Han J, Pei J, Yin Y, Mao R. Mining frequent patterns without candidate generation: A frequent-pattern
tree approach. In Data mining and knowledge discovery. 2004; 8(1): 5387.
[8] Mamat R, Herawan T, Deris MM. MAR: Maximum attribute relative of soft set for clustering attribute
selection. In Knowledge-Based Systems. 2013; 52: 1120.
... In other words, the task of ARM aims to detect the frequent itemset in the data source. If an interesting relationship between these frequent itemsets exists, it can project a new trend analysis for managerial decision-making [13,14]. The ARM process comes into two steps : (i) Frequent itemsets discovery based on the minimum support threshold and, (ii) From the frequent itemsets, identify strong association rules based on the minimum confidence threshold [15]. ...
Conference Paper
Nowadays, there is a large amount of data growth at all organizational scales. To find useful information, the right data mining technique needs to be used. One of the popular techniques is called the equivalence class transformation (Eclat) algorithm which is gaining the benefits of the vertical data format used in the association rule mining technique. However, like horizontal data representation, vertical data still suffering from huge memory consumption. The technique then has been extended by using the Incremental Eclat (i-Eclat) algorithm which is embedded with the incremental approach. Inspired by the dynamic data transactions in a database, a new algorithm has been adopted to optimize the performance of the current algorithm called the Fast Incremental (Fi-Eclat) algorithm. The research proposes an optimization of the performance of the i-Eclat algorithm. It relies on the incremental approach implemented on the database that consists of five (5) benchmark datasets from Frequent Itemset Mining (FIMI). The outcome of the experiment indicates an improvement of 32% in time execution for the proposed algorithm.
Article
Clustering, which is a set of categorical data into a homogenous class, is a fundamental operation in data mining. One of the techniques of data clustering was performed by introducing a clustering attribute. A number of algorithms have been proposed to address the problem of clustering attribute selection. However, the performance of these algorithms is still an issue due to high computational complexity. This paper proposes a new algorithm called Maximum Attribute Relative (MAR) for clustering attribute selection. It is based on a soft set theory by introducing the concept of the attribute relative in information systems. Based on the experiment on fourteen UCI datasets and a supplier dataset, the proposed algorithm achieved a lower computational time than the three rough set-based algorithms, i.e. TR, MMR, and MDA up to 62%, 64%, and 40% respectively and compared to a soft set-based algorithm, i.e. NSS up to 33%. Furthermore, MAR has a good scalability, i.e. the executing time of the algorithm tends to increase linearly as the number of instances and attributes are increased respectively.
Article
The diffset format (the difference of two sets) has drastically reduced the running time and memory usage of the Eclat algorithm and the Eclat algorithm using diffset format is called dEclat algorithm. However, in some sparse datasets, diffset format loses its advantage over tidset format (set of transaction IDs) and in this case it is suggested to use tidset format at starting and then switch to diffset format later. In this paper, we present a novel approach, combination of tidset and diffset, which uses both tidset and diffset format to represent transaction databases in frequent itemset mining. This approach can fully exploit the advantages of both tidset and diffset. Furthermore it does not require conversion of tidsets to diffset format. Preliminary results show that Eclat using this combination approach used less memory and was faster than dEclat in most datasets. We also introduce an improvement for dEclat algorithm, by sorting diffsets and tidsets the memory usage and running time of dEclat could be reduced significantly. A category with the (minimum) three required fields
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree- based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.
Article
We consider the problem of discovering association rules between items in a large database of sales transactions. We presenttwo new algorithms for solving this problem that are fundamentally different from the known algorithms. Experiments with synthetic as well as real-life data show that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also showhow the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database. 1 Introduction Database mining is motivated by the decision support problem faced by most large retail organizations [S + 93]. Progress in bar-code technology has made it possible for retail ...
Introduction to Data Mining. First Edition
  • P N Tan
  • M Steinbach
  • V Kumar
Tan PN, Steinbach M, Kumar V. Introduction to Data Mining. First Edition. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. 2005.