Conference PaperPDF Available

Frequent itemset mining on graphics processors

Authors:

Abstract and Figures

We present two e-cient Apriori implementations of Fre- quent Itemset Mining (FIM) that utilize new-generation graph- ics processing units (GPUs). Our implementations take ad- vantage of the GPU's massively multi-threaded SIMD (Sin- gle Instruction, Multiple Data) architecture. Both imple- mentations employ a bitmap data structure to exploit the GPU's SIMD parallelism and to accelerate the frequency counting operation. One implementation runs entirely on the GPU and eliminates intermediate data transfer between the GPU memory and the CPU memory. The other im- plementation employs both the GPU and the CPU for pro- cessing. It represents itemsets in a trie, and uses the CPU for trie traversing and incremental maintenance. Our pre- liminary results show that both implementations achieve a speedup of up to two orders of magnitude over optimized CPU Apriori implementations on a PC with an NVIDIA GTX 280 GPU and a quad-core CPU.
Content may be subject to copyright.
Frequent Itemset Mining on Graphics Processors
Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He1, Qiong Luo
Hong Kong University of Science and Technology Microsoft Research Asia 1
{wenbin, lumian, xiaoxy, luo}@cse.ust.hk savenhe@microsoft.com
ABSTRACT
We present two efficient Apriori implementations of Fre-
quent Itemset Mining (FIM) that utilize new-generation graph-
ics processing units (GPUs). Our implementations take ad-
vantage of the GPU’s massively multi-threaded SIMD (Sin-
gle Instruction, Multiple Data) architecture. Both imple-
mentations employ a bitmap data structure to exploit the
GPU’s SIMD parallelism and to accelerate the frequency
counting operation. One implementation runs entirely on
the GPU and eliminates intermediate data transfer between
the GPU memory and the CPU memory. The other im-
plementation employs both the GPU and the CPU for pro-
cessing. It represents itemsets in a trie, and uses the CPU
for trie traversing and incremental maintenance. Our pre-
liminary results show that both implementations achieve a
speedup of up to two orders of magnitude over optimized
CPU Apriori implementations on a PC with an NVIDIA
GTX 280 GPU and a quad-core CPU.
1. INTRODUCTION
Frequent itemset mining (FIM) aims at finding interest-
ing patterns from databases, or called transaction databases.
Each database transaction contains a set of items, such as
grocery items purchased in a basket. A FIM algorithm
scans the database, possibly multiple times, and finds item-
sets that occur in transactions more frequently than a given
threshold. The number of occurrences is called support, and
the threshold the minimum support.
Two representative FIM algorithms are Apriori [3] and
FP-growth [16]. Apriori iteratively generates candidate item-
sets of K+1 items, or (K+1)-itemsets, from K-itemsets, and
scans all transactions to check whether the candidate item-
sets are frequent. In comparison, FP-growth recursively
builds pattern trees to represent frequent itemsets, with-
out candidate generation. According to a report from the
first Workshop on Frequent Itemset Mining Implementations
(FIMI’03) [12], FP-growth implementations were generally
an order of magnitude faster than Apriori; however, on sev-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Proceedings of the Fifth International Workshop on Data Management on
New Hardware (DaMoN 2009) June 28, 2009, Providence, Rhode-Island
Copyright 2009 ACM 978-1-60558-701-1 ...$10.00.
eral datasets, an Apriori implementation, apriori borgelt,
was slightly faster when the support was high.
Utilizing parallel architectures has been a viable means for
improving data mining performance [4, 7, 9, 32]. In this pa-
per, we study whether we can adapt the existing CPU-based
FIM algorithms to new-generation graphics processing units
(GPUs). GPUs can be regarded as massively multi-threaded
many-core processors. Different from multi-core CPUs, the
cores on the GPU are virtualized, and GPU threads are ex-
ecuted in SIMD (Single Instruction, Multiple Data) and are
managed by the hardware. Such a design simplifies GPU
programming and improves program scalability and porta-
bility, since programs are oblivious about physical cores and
rely on hardware for thread management. Nevertheless, it
also makes the implementation of algorithms with complex
control flows a challenging task on the GPU, even though
the GPU has an order of magnitude higher computation
capability as well as memory bandwidth than a multi-core
CPU.
Taking advantage of the massive computation power and
the high memory bandwidth of the GPU, previous work has
accelerated database operations [13, 14, 19], approximate
stream mining of quantiles and frequencies [15], MapReduce
[17] and k-means clustering [8]. To the best of our knowl-
edge, there has been no prior work that focuses on study-
ing the GPU acceleration for FIM algorithms, even though
parallel FIM has been studied on simultaneous multithread-
ing (SMT) processors [11], shared-memory systems [28], and
most recently multi-core CPUs [25].
As a first step, we consider the GPU implementation of
Apriori, with intention to extend to FP-growth. The Apri-
ori algorithm is not only applied in frequent itemset mining
or association mining, but also in other data mining tasks,
such as clustering [27], and functional dependency [22]. Ex-
isting Apriori FIM algorithms are optimized for data lo-
cality; however, the data structures in use, e.g., tries, are
non-aligned and the access patterns are largely irregular,
e.g., pointer-chasing. These characteristics may hurt the ef-
ficiency on the GPU since SIMD operations favor aligned
and sequential data accesses [34].
Addressing the challenge in implementing Apriori on the
GPU, we adopt a bitmap data structure to represent trans-
actions in our two GPU-based FIM implementations. Specif-
ically, the bitmap stores the occurrences of items in transac-
tions, and is efficient to be partitioned to SIMD processors.
Furthermore, we utilize a lookup table to facilitate support
counting, which is usually the most time-consuming compo-
nent in the Apriori algorithm. One implementation of ours
uses another bitmap to represent itemsets, which enables
the entire algorithm to run on the GPU. We denote this im-
plementation as PBI (Pure Bitmap-based Implementation).
PBI features regular data access patterns, which are best
fit to the GPU; however, it may cause redundant computa-
tion and data access between frequent itemsets of different
sizes. To reduce the redundancy, the other implementation
of ours adopts a trie structure to represent itemsets, and
utilizes the CPU for trie traversal and incremental mainte-
nance. We denote this Trie-based Implementation as TBI.
We have evaluated our implementations using both synthetic
and real-world datasets. Both of our implementations are up
to two orders of magnitude faster than optimized CPU-based
Apriori implementations on three experimental datasets.
Organization: The remainder of the paper is organized as
follows. We give a brief overview of prior work on GPGPU
and frequent itemset mining in Section 2. We present the
details of our two implementations in Section 3. In Section
4, we present our experimental results. Finally, we conclude
in Section 5.
2. BACKGROUND AND RELATED WORK
In this section, we briefly review related work on GPGPU
(General-Purpose Computation on GPUs), and frequent item-
set mining algorithms.
2.1 General Purpose GPU Computing
The GPU is an integral component in commodity ma-
chines. It was previously designed to be a co-processor to
the CPU for games and other graphics applications. Re-
cently, the GPU has been used as a hardware accelerator for
various non-graphics applications, such as matrix multipli-
cation [23], databases [13, 14, 19], and distributed comput-
ing projects including Folding@home and Seti@home. For
additional information on the state-of-the-art GPGPU tech-
niques, we refer the reader to a recent survey by Owens et
al. [26].
Recently, GPGPU programming frameworks such as NVIDIA
CUDA allow the developer to write the code for the GPU
with familiar interfaces similar to C/C++. Such frameworks
model the GPU as a many-core architecture (as shown in
Figure 1) exposing hardware features for general-purpose
computation. In particular, CUDA exposes a hierarchi-
cal multi-threaded model for NVIDIA’s latest GPUs, with
hardware features including the fast on-chip local memory
(NVIDIA terms it as shared memory). CUDA groups lightweight
GPU threads into thread blocks. Threads within the same
thread block are divided into SIMD groups, called warps,
each of which contains 32 threads. The GPU has an on-
board device memory, which is of a high bandwidth and a
high access latency. A warp of threads can combine accesses
to consecutive data items in one device memory segment
into a single memory access transaction, or called coalesced
access.
While GPGPU programming frameworks greatly reduce
the complexity of GPGPU computing, developers must care-
fully design and implement their algorithms in order to fully
utilize the GPU architectural features. In particular, GPUs
are originally designed for graphics rendering, instead of gen-
eral purpose computing. Therefore, GPUs are specialized
for compute-intensive and highly parallel applications, es-
pecially in the SIMD style parallelism. Furthermore, as a
Figure 1: The many-core architecture model of the
GPU
co-processor, the GPU relies on the CPU for memory allo-
cation. As such, the common practice for efficiency is to
allocate the GPU memory statically before initiating the
GPU computation kernel and to avoid dynamic allocation
or reallocation during the GPU kernel execution. Addition-
ally, due to the limited bus bandwidth between the GPU
memory and the CPU memory, it is best to eliminate fre-
quent, small-sized data transfers between the CPU and the
GPU.
Recently, GPU-based primitives as the building blocks for
higher-level applications [18, 19, 30] have been proposed to
further reduce the complexity of GPU programming. The
parallel primitives [19] are a small set of common opera-
tions exploiting the architectural features of GPUs. We uti-
lize map,reduce, and prefix sum primitives in our two FIM
implementations. Following the previous studies [19], we
improve our implementation using memory optimizations,
including the local memory optimization for temporal local-
ity, the coalesced access optimization of device memory for
spatial locality, and the built-in vector data type to reduce
the number of memory accesses. Different from the previous
work, we study the GPU acceleration of Apriori for FIM,
which incurs much more complex control flows and memory
accesses than performing database joins [19] or maintaining
quantiles from data streams [15].
2.2 Frequent Itemset Mining
The Frequent Itemset Mining (FIM) problem was intro-
duced by Agrawal et al. [2], as the first step to mine asso-
ciation rules in market basket data. Let I={I1, I2, ..., Im}
be a set of mitems, and T={T1, T2, ..., Tn}the transac-
tion database, where Tiis a transaction containing a set of
items from I. An k-itemset that consists of kitems from I,
is frequent if it occurs in Tnot less than stimes, where s
is a user-specified minimum support threshold, and sn.
We denote s/n as minsup. The FIM problem is to find all
itemsets in a given transaction database that occur more
frequently than minsup.
There are two representative algorithms for mining fre-
quent itemsets, namely, Apriori [3] and FP-growth [16].
Apriori iteratively mines frequent 1-itemsets, 2-itemsets, ...,
until K-itemsets, where Kis the maximum number of items
of an frequent itemset. In each iteration, the algorithm
generates candidate itemsets, or candidates, and counts the
support for each candidate by scanning all transactions. In
comparison, FP-growth works through divide-and-conquer.
It recursively constructs a conditional database and a con-
ditional FP-tree, and mines the FP-tree in a pattern growth
method, by the concatenation of the suffix pattern to the
frequent patterns generated from the precedent conditional
FP-tree. The advantage of FP-growth is that it avoids gen-
erating a number of candidates as well as repeated scanning
of the transaction database.
FIM has been widely studied in distributed systems [4, 7,
10, 24]. Aouad et al. [4] designed a distributed Apriori in
heterogeneous computer cluster and grid environments us-
ing dynamic workload management to tackle memory con-
straints, achieve balanced workloads, and reduce communi-
cation costs. Buehrer [7] and El-Hajj [10] proposed variants
of FP-growth on computer clusters, lowering communication
costs and improving cache, memory, and I/O utilization.
Most recently, Li et al. [24] demonstrated a linear speedup
of FP-growth on thousands of distributed machines using
Google’s MapReduce infrastructure.
Researchers have also studied FIM problems on modern
CPUs. The key issue is how to fully exploit the instruction-
level parallelism (ILP) and thread-level parallelism (TLP)
on the multi-core CPU. Ghoting [11] et al. improved FP-
growth [16] through a cache-conscious prefix tree for spatial
locality and ILP, and a tiling strategy for temporal local-
ity. Liu et al. [25] proposed a cache-conscious FP-array
from compacting the FP-tree [16] and a lock-free, dataset-
tiling tree construction algorithm for TLP. Ye et al. [31]
explored the parallelization of Bodon’s trie-based Apriori al-
gorithm [6] with a database partitioning method. Recently,
two benchmarks for mining on multi-core processors, includ-
ing the PARSEC Benchmark Suite [5] and NU-MineBench
[29], have been proposed to facilitate architectural studies.
In comparison to previous parallel CPU-based FIM al-
gorithms, our algorithms are designed for the GPU with
massive SIMD parallelism, instead of distributed systems
and multi-core CPUs. In the literature, other parallel Apri-
ori algorithms focus on I/O performance, while our GPU-
based algorithms are in-memory, exploiting the SIMD archi-
tectural feature provided by GPUs.
3. IMPLEMENTATION
In this section, we present the design and implementation
of our two GPU-based Apriori algorithms: the Pure Bitmap-
based Implementation (PBI) and the Trie-based Implemen-
tation (TBI). Both implementations exploit the bitmap rep-
resentation of transactions, which facilitates fast set inter-
section to obtain transactions containing a particular item-
set. Furthermore, together with a lookup table, the bitmap
representation also accelerates support counting, which is
a time-consuming component in Apriori. PBI uses bitmap
data structure to represent itemsets, while the TBI uses a
trie. In particular, we put the trie on the CPU to perform
trie traversal and incremental maintenance for efficiency. We
implemented PBI and TBI on NVIDIA CUDA.
3.1 Overview
Both of our PBI and TBI implementations follow the
workflow of the original Apriori algorithm, as shown in Al-
gorithm 1. In the algorithm, we first generate all frequent
items, or 1-itemsets. Next, we iteratively invoke Candi-
date Generation to generate candidate K-itemsets, and
then perform support counting in Freq Itemset Generation
to generate frequent K-itemsets, where K > 1. Kincre-
ments after each iteration. Both Candidate Generation
and Freq Itemset Generation can have different imple-
mentations.
Algorithm 1 Apriori
1: //CK: Candidate K-itemsets.
2: //LK: Frequent K-itemsets.
3: //T: Transaction database
4: Generate all frequent items L1
5: K= 2
6: while LK16=do
7: //Generate candidate K-itemsets
8: CK=Candidate Generation(LK1)
9: //Count supports and generate frequent K-itemsets
10: LK=Freq Itemset Generation(CK,T,minsup)
11: K=K+ 1
12: end while
In the Apriori algorithm, there are two major data struc-
tures. One represents transactions, and the other represents
itemsets. Both of our GPU-based implementations adopt a
bitmap data structure to represent transactions, and both
invoke the Freq Itemset Generation procedure in Algo-
rithm 1 entirely on the GPU. The PBI implementation rep-
resents itemsets in another bitmap, and executes Candi-
date Generation on the GPU. In comparison, the TBI
implementation represents itemsets in a trie, and utilizes
the CPU to help traverse and build the trie.
3.2 Bitmap and Support Counting
Transaction ID Item IDs
1 ABCD
2 ABD
3 ACD
4 BCD
Itemset ID Transaction IDs
ABD 1, 2
ACD 1, 3
BCD 1, 4
T1 T2 T3 T4
ABD 1 1 0 0
ACD 1 0 1 0
BCD 1 0 0 1
Figure 2: Horizontal data layout (left), vertical data
layout (top right), and bitmap representation (bot-
tom right).
There are two choices to represent the transactions, namely,
horizontal and vertical data layouts [33]. In the horizontal
layout, each transaction has a transaction identifier, followed
by a list of items in a predefined order. In the vertical layout,
each itemset has an itemset identifier, followed by a list of
transactions containing that itemset. We denote the trans-
action list of a K-itemset as a K-tranlist. Figure 2 shows an
example of the horizontal and vertical data layouts, together
with the corresponding bitmap structure.
Traditionally, a CPU-based Apriori implementation adopts
horizontal data layout. However, such layout requires scan-
ning all transactions to perform support counting, which
limits the data parallelism of GPUs. Therefore, we adopt
the vertical data layout instead. We intersect two (K1)-
tranlists to obtain a K-tranlist TID LIST for a particular
K-itemset IK . Next, we count the number of transactions
in the TID LIST as the support of IK. Such intersection-
and-counting process for generating a K-itemset is indepen-
dent from one another, so that we can easily parallelize the
procedure for generating different frequent itemsets.
To further improve the intersection operation and sup-
port counting, we store the vertical data layout in a bitmap,
which is an array of bits. We refer it as a transaction-bitmap.
In an m×ntransaction-bitmap, where mis the number of
items and nis the number of transactions, bit (i, j) is set to
1 if item ioccurs in transaction j. We store a transaction-
bitmap in the built-in vector data type int4 (a structure
containing four 32-bit integers), which is of size 16 bytes,
because the GPU can read up to 16 bytes of data from the
device memory to registers in one instruction. This way, we
can reduce the number of device memory accesses by a fac-
tor of four, compared with reading data in the granularity of
32-bit integers. Each row of a transaction-bitmap is rounded
in 16 bytes, with the last 128 bits padded with 0, if it is less
than 128 bits. Thus, the row vector in a transaction-bitmap
is of size dn/128e × 16 bytes. We transform the support
counting into intersection of row vectors of the transaction-
bitmap, followed by counting of the number of 1’s in the
intersection result.
We construct a lookup table that stores the mapping of an
integer and the number of 1’s in its binary representation.
For example, the number of 1’s in 00000001 11000000 (448 in
the decimal form) is 3. This lookup table is read-only. Since
accessed frequently, we put it in the read-only, cacheable
constant memory on the GPU. The constant memory can
achieve as low as one cycle memory access latency. In com-
parison, accessing device memory incurs hundreds of cycles.
The size of constant memory, 64KB, constrains the size of
our lookup table to be (216 entries ×1 byte/entry) = 65536
bytes. For each lookup, we can obtain the number of 1’s of
a 16-bit integer.
Algorithm 2 shows Freq Itemset Generation, which runs
entirely on the GPU, and is invoked by both PBI and TBI.
Each GPU thread block processes one candidate K-itemset
in parallel. Threads within the same thread block intersect
two (K1)-tranlists, count the number of 1’s for every 16
bits in K-tranlist, and add up all counts using parallel re-
duce.
Figure 3 illustrates an example for support counting within
a particular thread block. In this example, there are two
threads in the thread block. For ease of representation, we
assume that the data type int is of size 8 bits, so the vector
int4 is of size 32 bits. At the beginning, each thread reads
two int4 vectors from two (K1)-tranlists respectively, and
performs bitwise AND operation on these two int4 vectors.
Next, each thread queries the lookup table to obtain the
counts of 1’s for every 16 bits of the intersection result. Fi-
nally, we synchronize all threads in the same thread block,
and perform parallel reduce to add up the counts as the
support for the K-itemset.
Algorithm 2 Freq Itemset Generation
1: for each candidate K-itemset in parallel do
2: Intersect two (K1)-tranlists in parallel
3: Query the lookup table to count the number of 1’s for
every 16 bits in K-tranlist in parallel
4: Perform parallel reduce to add up the counts of ev-
ery 16 bits and obtain the support for K-itemset.
5: if support of K-itemset minimum support then
6: Output K-itemset
7: end if
8: end for
00100011
Lookup table
4 2
Parallel Reduce (sum)
10
(K-1)-
tranlist
Thread 1
AND
Thred Block K
built-in vector
type --int4
11001001 10110100 00001011 00100000 11100001 11100000 00000000
11100010 11100000 10001000 01100001 00001001 11001100 01100000 00000000
00100010 11000000 10000000 00000001 00000000 11000000 01100000 00000000
2 2
(K-1)-
tranlist
K-
tranlist
support of
K-itemset
AND AND AND AND AND AND AND
Thread 2
# of 1's for
each 16-
bit
Figure 3: Support counting within a thread block.
3.3 Pure Bitmap Implementation
T1 T2 T3 T4
AB 1 1 0 0
AC 1 1 1 0
AD 1 0 1 0
BC 1 1 0 1
BD 1 0 0 1
CD 1 0 1 1
A B C D
AB 1 1 0 0
AC 1 0 1 0
AD 1 0 0 1
BC 0 1 1 0
BD 0 1 0 1
CD 0 0 1 1
T1: ABCD | T2: ABC | T3: ACD | T4: BCD
T1 T2 T3 T4
ABC 1 1 0 0
ABD 1 0 0 0
ACD 1 0 1 0
BCD 1 0 0 1
A B C D
ABC 1 1 1 0
ABD 1 1 0 1
ACD 1 0 1 1
BCD 0 1 1 1
Original transaction database
in horizontal data layout
Bitwise and
Bitwise or
2-itemsets
3-itemsets
2-tranlists
3-tranlists
Figure 4: Generating candidate 3-itemsets from fre-
quent 2-itemsets in PBI.
In the Pure Bitmap Implementation (PBI), we represent
itemsets in a bitmap. In an m×nbitmap representing K-
itemsets, where mis the number of K-itemsets and nis the
number of all items, bit (i, j) is set to 1 if itemset icontains
item j. Each row is also rounded in 16 bytes. We impose a
lexicographical order among all K-itemsets.
The Candidate Generation procedure consists of two
steps, namely, a join to generate a candidate K-itemset from
two (K1)-itemsets, and a pruning to select the candidate
K-itemset whose (K1)-subsets are all frequent. Algorithm
3 shows the Candidate Generation procedure for PBI.
We denote the i-th (K1)-itemset as Li, and j-th (K1)-
itemset as Lj, where i < j. The k-th item in Liis denoted
as Li[k]. Each GPU thread handles an Li, and joins it with
Lj. The join predicate is (Li[0] = Lj[0]) (Li[1] = Lj[1])
... (Li[K2] = Lj[K2]) (Li[K1] < Lj[K1]).
In pruning, we check whether all (K1)-subsets of a
generated candidate K-itemset are frequent. We perform a
binary search on a (K1)-itemsets to determine if a (K1)-
subset of the candidate K-itemset is frequent. Figure 4 de-
picts an example for generating candidate 3-itemsets from
frequent 2-itemsets in PBI. For example, in order to gener-
ate the candidate itemset ABC, we join two 2-itemsets AB
and AC by performing a bitwise OR operation on the cor-
responding vectors in the bitmap of 2-itemsets. In the fol-
lowing Freq Itemset Generation procedure, we perform
a bitwise AND operation to obtain the transaction list for
candidate itemset ABC.
Candidate Generation uses a bitmap to represent item-
sets, which allows uniform and efficient bitwise operations to
perform joins on the GPU, and avoids the overhead of fre-
quent data transfer between GPU memory and CPU mem-
ory. However, when the number of items is large, it also in-
curs excessive non-coalesced device memory accesses. Given
mfrequent (K1)-itemsets, and nitems. In order to check
whether one (K1)-itemset is frequent, we need to access
(log m× dn/128e × 16) bytes of data, where log mis the cost
of performing a binary search, and dn/128e×16 is the size of
a row (in bytes) in the bitmap of (K1)-itemsets. Typically,
if m= 10000 and n= 10000, we need to access about 16 KB
for checking only one (K1)-subset. This problem in our
pure bitmap-based solution triggers us to consider adopt-
ing another data structure in the Candidate Generation
procedure in the presence of a large number of items.
Algorithm 3 PBI Candidate Generation
1: //Lxrepresents the x-th (K1)-itemset, that is, the
x-th row vector in the bitmap for (K1)-itemsets.
2: for each Liin parallel do
3: for each Ljwhere j = i + 1 to m do
4: if Liand Ljare joinable then
5: //Join
6: Union on Liand Ljto obtain a candidate K-
itemeset by performing a bitwise OR operation
7: //Pruning
8: (K1)-subset test on the candidate K-itemset
by a binary search in the (K1)-itemset bitmap.
9: else
10: break
11: end if
12: end for
13: end for
3.4 Trie-Based Implementation
T1: ABCD | T2: ABC | T3: ACD | T4: BCD
Original transaction database
in horizontal data layout
Root
A B
C DB C D
T1 T2 T3 T4
AB 1 1 0 0
AC 1 1 1 0
AD 1 0 1 0
BC 1 1 0 1
BD 1 0 0 1
CD 1 0 1 1
C
D
Root
A B
C DB C D
C
D
T1 T2 T3 T4
ABC 1 1 0 0
ABD 1 0 0 0
ACD 1 0 1 0
BCD 1 0 0 1
C D D D
bitwise and
Depth 2:
Depth 1:
Depth 0:
Depth 0:
Depth 1:
Depth 2:
Depth 3:
2-itemsets
3-itemsets
2-tranlists
3-tranlists
Figure 5: Generating candidate 3-itemsets from fre-
quent 2-itemsets in TBI.
Instead of using bitmap to represent itemsets, we adopt
the trie data structure, which is also used in the state-of-the-
art Apriori implementation [6]. A trie is a rooted, directed
prefix tree. The root is defined to be at depth 0. If a node is
at depth K, then its children are at depth K+1. Each node
stores an item id. A node at depth Kconcatenating all its
ancestors represents an K-itemset. The trie-based Apriori
implementation on the CPU [6] stores the support in each
node, and counts support by scanning the transactions in
the horizontal data layout. For each transaction, it finds
paths from the root to the leaves in the trie corresponding
to candidate itemsets contained in the transaction, and the
support values of these leaves are all increased by one. Dif-
ferent from the CPU trie-based implementation, we repre-
sent transactions in a bitmap, and perform support counting
on the GPU, as described in Section 3.2.
The candidate generation based on trie traversal is im-
plemented on the CPU. This decision is based on the fact
that, the trie is an irregular structure and difficult to share
among SIMD threads. Thus, we store the trie representing
itemsets in the CPU memory, and the bitmap representation
of transactions in the GPU device memory.
We incrementally construct the trie level by level, which
matches the iterative process of Apriori. By growing the
trie to depth K, we generate all the frequent K-itemsets.
Algorithm 4 shows the Candidate Generation procedure
in TBI. We perform join for every node at depth K1
with each of its right siblings. We keep all children of a
node sorted in lexicographical order on item id, so that we
can efficiently check whether a (K1)-subset is frequent by
performing a series of binary searches to follow a path with
the same prefix as the (K1)-subset. In each iteration,
after generating all candidate K-itemsets on the CPU, we
transfer the bookkeeping data to the GPU memory for sup-
port counting. The bookkeeping data include a set of triples,
in the form of (Ii
K1, Ij
K1, IK), where Ii
K1and Ij
K1are
two (K1)-itemsets generating the candidate K-itemset
IK. After the support counting on the GPU, we need to
transfer the set of bookkeeping data to eliminate candidate
K-itemsets. If the K-itemset IKis a false candidate, then
the triple is set to (null, null, IK). At this moment, we ob-
tain all the frequent K-itemset in the trie, and start the next
iteration of generating (K+ 1)-itemset, if any.
Figure 5 shows the same example as Figure 4, generating
candidate 3-itemsets from 2-itemsets. In order to generate
the candidate ABC, we need to join AB and AC, which
are represented by the leftmost node B and the second left-
most node C at depth 2 respectively. Next, we test all 2-
subsets other than AB and AC for ABC, which include only
BC. We follow the path with prefix BC, and find this 2-
itemset. Finally, we keep the candidate ABC for further
support counting in the Freq Itemset Generation proce-
dure. Note that, each leaf node in the trie is associated with
a row vector in the bitmap representing transactions, so that
we can easily perform a bitwise AND operation on two 2-
tranlists to obtain a 3-tranlist, and then count the support.
4. EVALUATION
In this section, we present experimental results on evalu-
ating our two GPU-based Apriori implementations.
4.1 Experimental Setup
Our experiments were performed on a PC with an NVIDIA
GTX 280 GPU and an Intel Core2 quad-core CPU, running
Algorithm 4 TBI Candidate Generation
1: //urepresents a node at depth K1 in the trie.
2: for each uat depth K1do
3: for each wthat is a right sibling of udo
4: //Join
5: Union on the two (K1)-itemsets represented by
uand wto obtain a candidate K-itemeset
6: //Pruning
7: (K1)-subset test on the candidate K-itemset by
following the path of the trie with the same prefix
8: end for
9: end for
on Microsoft Windows XP SP3. The GPU consists of 30
SIMD multi-processors, each of which has eight processors
running at 1.29 GHz. The GPU memory is of size 1GB with
the peak bandwidth of 141.7 GB/sec. The CPU has four
cores running at 2.4 GHz. The main memory is 2 GB with
the peak bandwidth of 5.6 GB/sec. The GPU uses a PCI-E
bus to transfer data between the GPU memory and the main
memory with a theoretical bandwidth of 4 GB/sec. The PC
has a 160 GB SATA magnetic hard disk.
All source code was written and compiled using Visual
Studio 2005 with the optimization level /O2. The version of
CUDA is 2.0.
Comparison. We compared our GPU-based algorithms
with three CPU-based Apriori and one CPU-based FP-growth,
since there is no any GPU-based Apriori or FP-growth im-
plementation in the public domain. A single-threaded CPU-
based implementation is from the repository of Workshop on
Frequent Itemset Mining Implementations (FIMI’03) [20],
which is the best Apriori implementation, denoted as BORGELT.
There is not any multi-threaded Apriori implementation
publicly available, so we decided to parallelize one by our-
selves. BORGELT uses a trie to represent transactions,
and performs support counting recursively. Thus, it is quite
tricky to parallelize BORGELT. Instead, we parallelized an-
other famous single-threaded CPU-based Apriori implemen-
tation from Bart Goethals [21], which stores transactions in
horizontal data layout. For this implementation, we paral-
lelized the support counting step using OpenMP, and the
parallelized version running on a quad-core CPU is more
than three times faster than the serial version. We denote
the parallelized implementation from Goethals as GOETHALS.
Furthermore, we ported our TBI implementation to the CPU,
and parallelized the for-loop in the support counting part
(Algorithm 2) using OpenMP, and we denote it as TBI-
CPU. Our two GPU-based Apriori implementations are de-
noted as PBI-GPU and TBI-GPU respectively. Table 1
summarizes the characteristics of the two GPU-based, and
the three CPU-based Apriori implementations. The CPU-
based FP-growth implementation is from PARSEC bench-
mark [5], which is implemented in OpenMP, denoted as FP-
GROWTH. All multi-threaded CPU-based algorithms ran
on four CPU threads.
Datasets. We used three representative datasets from
FIMI’03 repository [20] to evaluate the five Apriori im-
plementations, including T40I10D100K,Chess, and Retail.
These three datasets have distinct characteristics from one
another, which are summarized in Table 2. T40I10D100K
is a synthetic dataset simulating market basket data. Chess
and Retail are real-world datasets. The density of a dataset
is defined to be the average length of transactions divided
by the number of items. Chess is the representative of dense
data, whose density is 49%, the highest among all datasets
in FIMI’03 repository. Retail represents sparse data, whose
density is lower than 1%. The implementations PBI-GPU,
TBI-GPU, and TBI-CPU require transaction data to be in
bitmap data structure. For sparse data, the bitmap repre-
sentation of transactions in vertical layout is larger than the
original one in horizontal data layout (180 MB vs 4 MB for
Retail). However, for dense data, the bitmap representation
compresses transaction data (30 KB vs 335 KB for Chess).
We refer the reader to the FIMI’03 report [12] for the com-
plete experimental results of various FIM implementations
on all datasets.
Metric. We measured the total elapsed time for eval-
uating the efficiency of all the implementations. Since we
are focusing on in-memory performance, we excluded the
initial file input and final result output from the total time
measurement. In addition, we exclude the time for convert-
ing the transaction database from horizontal data layout
into bitmap representation, since the conversion can be per-
formed offline or we can collect the source data and store
them in bitmap initially. We ran each experiment for three
times, and calculated the mean value. The variance among
different runs of the same experiment was smaller than 10%.
4.2 Results
4.2.1 Comparison to CPU-based Apriori
Figure 6(a) depicts the running time for the five Apri-
ori implementations on the synthetic dataset T40I10D100K,
Figure 7(a) on the dense dataset Chess, and Figure 8(a) on
the sparse dataset Retail. On these three datasets, our GPU-
based implementations outperform the parallelized GOETHALS
by a factor of 2.7 to 130, and the best Apriori implementa-
tion BORGELT by a factor of 1.2 to 24, when minsup varies.
The GPU-based implementations have larger speedup over
the CPU-based ones on the dense dataset than on the sparse
dataset.
The time for data transfer between the GPU memory
and the CPU memory (TRANSFER), candidate generation
(CANDIDATE), and support counting (COUNTING) dom-
inates the total running time. Thus, we break the total run-
ning time into three parts - TRANSFER, CANDIDATE,
and COUNTING, and present the time breakdown result in
Figure 6(b) on the synthetic dataset T40I10D100K, Figure
7(b) on the dense dataset Chess, and Figure 8(b) on the
sparse dataset Retail.
Let us analyze the performance divergence of the five
Apriori implementations, according to the running time and
time breakdown results on the three datasets in Figure 6,
Figure 7, and Figure 8.
TBI-CPU vs GOETHALS. This comparison shows the
impact of bitmap representation for the transaction database.
Both implementations adopt a trie to represent itemsets,
thus they have roughly the same performance on candidate
generation step. However, the vertical layout for the trans-
action database allows TBI-CPU to perform independent
support counting on different candidate itemsets, which ex-
tracts the multi-threaded parallelism to maximum. On the
other hand, GOETHALS uses horizontal layout to store
transactions, so that it should repeatedly scan the whole
transaction database to do support counting. From the
Implementation Platform Candidate Generation Support Counting Itemsets Transactions
PBI-GPU GPU Multi-threaded on the GPU Multi-threaded on the GPU Bitmap Bitmap
TBI-GPU GPU+CPU Single-threaded on the CPU Multi-threaded on the GPU Trie Bitmap
TBI-CPU CPU Single-threaded on the CPU Multi-threaded on the CPU Trie Bitmap
GOETHALS CPU Single-threaded on the CPU Multi-threaded on the CPU Trie Horizontal layout
BORGELT CPU Single-threaded on the CPU Single-threaded on the CPU Trie Trie
Table 1: Five Apriori Implementations
Dataset #Item Avg. Length #Transactions Density Characteristics Data size Bitmap size
T40I10D100K 1,000 40 100,000 4% Synthetic 15 MB 12 MB
Retail 16,469 10.3 88,162 0.6% Sparse/Real 4 MB 180 MB
Chess 75 37 3,196 49% Dense/Real 335 KB 30 KB
Table 2: Three experimental datasets
time breakdown result, we can see that, GOETHALS al-
ways has a larger ratio of support counting time. However,
on the sparse dataset Retail, TBI-CPU only outperforms
GOETHALS by a factor of 1.28, due to the large size of
bitmap representation of the sparse transaction database.
Even though TBI-CPU does not need to scan the whole
bitmap for support counting, accessing a part of large bitmap
(150 MB for Retail) may be as costly as scanning the whole
transaction database with small size (4 MB for Retail).
TBI-GPU vs TBI-CPU. This comparison investigates
the impact of the GPU acceleration for support counting.
TBI-GPU differs from TBI-CPU only in the support count-
ing step. Although TBI-GPU suffers from intermediate
data transfer between the GPU memory and the CPU mem-
ory, it gains significant performance from the massive SIMD
parallelism provided by the GPU. Especially for the sparse
dataset Retail,TBI-GPU has 7.8x speedup over TBI-CPU.
Huge bitmap representation for the sparse dataset requires
more memory accesses than that of dense dataset. In this
case, TBI-GPU is able to hide large memory latency by well
utilizing massive SIMD parallelism on the GPU. Since our
study focuses on the GPU-based implementation, we haven’t
exploited data locality in CPU cache for TBI-CPU.
PBI-GPU vs TBI-GPU. This comparison shows the ef-
fect of different itemset representations - bitmap-based and
trie-based. PBI-GPU and TBI-GPU invokes exactly the
same support counting procedure on the GPU. The perfor-
mance difference only comes from the candidate generation.
The dense dataset Chess has very few items (75 in total),
hence the bitmap representation of itemsets for PBI-GPU is
of small size. On the other hand, the sparse dataset Retail
has many items (16469 in total), so PBI-GPU should pro-
cess large bitmap of itemsets. Thus, the number of items
determines the performance of PBI-GPU’s candidate gener-
ation. Therefore, PBI-GPU outperforms TBI-GPU on the
dense dataset, due to smaller size of bitmap representation
for itemsets, while TBI-GPU is better on the sparse dataset.
PBI-GPU/TBI-GPU vs BORGELT. On all datasets
with different minsup, both GPU-based implementations win
over the best CPU-based Apriori implementation of FIMI’03,
except that PBI-GPU is 20% slower than BOGELT on the
sparse dataset Retail with minsup 0.01%. The bitmap struc-
ture for representing transactions helps the GPU SIMD par-
allelism, and boosts the performance of both GPU-based
implementations.
To sum up, our both GPU-based implementations out-
perform CPU-based implementations in up to an order of
magnitude on the sparse dataset, and up to two orders of
magnitude on the dense dataset. The two GPU-based im-
plementations gain performance from the bitmap represen-
tation of transactions. PBI-GPU outperforms TBI-GPU on
the dense dataset, while TBI-GPU is better on the sparse
dataset.
4.2.2 Comparison to CPU-based FP-growth
0
1
2
3
4
5
T40I10D100K Chess Retail
Running Time (sec)
Implementations
0.02
PBI-GPU
TBI-GPU
FP-GROWTH
Figure 9: Execution time of PBI-GPU, TBI-
GPU, and CPU-based FP-growth on T40I10D100K,
Chess, and Retail with minsup 1%, 60%, and 0.01%.
Figure 9 illustrates the running time of FP-GROWTH,
PBI-GPU, and TBI-GPU on T40I10D100K,Chess, and Re-
tail with minsup 1%, 60%, and 0.01% respectively. We can
see that CPU-based FP-growth is faster than our both GPU-
based implementations by a factor of 4 to 16. This leaves
us enough room to explore more efficient GPU-based FIM
algorithms.
5. CONCLUSION AND FUTURE WORK
We have presented two GPU-based implementations of
Apriori algorithm for frequent itemset mining. Both im-
plementations employ a bitmap data structure to encode
the transaction database on the GPU and utilize the GPU’s
SIMD parallelism for support counting. One implementa-
tion stores the itemsets in a bitmap, and runs entirely on
the GPU. The other one utilizes a trie to store the item-
sets, and adopts a GPU-CPU co-processing scheme. The
preliminary evaluation results show that both of our GPU-
0
1
2
3
4
5
6
7
8
2.0% 1.5% 1.0%
Running Time (sec)
minsup
8.4 12 22.2
PBI-GPU
TBI-GPU
TBI-CPU
GOETHALS
BORGELT
(a) Running time with various minsup
0
20
40
60
80
100
PBI-GPU TBI-GPU TBI-CPU GOETHALS BORGELT
% of total
Implementations
TRANSFER
CANDIDATE
COUNTING
(b) Time breakdown with minsup 1%
Figure 6: Experiments on the synthetic dataset T40I10D100K
0
0.5
1
1.5
2
70% 65% 60%
Running Time (sec)
minsup
5 15.3 42.2
8.01
PBI-GPU
TBI-GPU
TBI-CPU
GOETHALS
BORGELT
(a) Running time with various minsup
0
20
40
60
80
100
PBI-GPU TBI-GPU TBI-CPU GOETHALS BORGELT
% of total
Implementations
TRANSFER
CANDIDATE
COUNTING
(b) Time breakdown with minsup 60%
Figure 7: Experiments on the dense dataset Chess
implementations are up to two orders of magnitude faster
than optimized CPU-based implementations.
We are considering improvements of our current imple-
mentations. For example, our bitmap representation of trans-
actions is space inefficient for sparse datasets. We are inves-
tigating data compression techniques [1]. Moreover, we are
developing a buffering mechanism between the GPU mem-
ory and the CPU memory for memory ping-pong.
We also plan to explore other mining algorithms with
GPU acceleration, for instance, FP-growth and classifica-
tion. In particular, FP-growth and its CPU-based variants
have shown a superior performance; nevertheless, their irreg-
ular data structures and complex algorithmic control pose
great challenges for GPU acceleration.
Finally, it could be desirable to enhance the interaction
features of the mining process, for example, adjusting sup-
port thresholds during the progress. Such interaction can
greatly improve the mining quality.
Acknowledgments
The authors thank the anonymous reviewers for their in-
sightful suggestions. This work was supported by grant
617208 from the Hong Kong Research Grants Council.
6. REFERENCES
[1] Daniel Abadi, Samuel Madden, and Miguel Ferreira.
Integrating compression and execution in
column-oriented database systems. SIGMOD, 2006.
[2] Rakesh Agrawal, Tomasz Imieli´nski, and Arun Swami.
Mining association rules between sets of items in large
databases. SIGMOD, 1993.
[3] Rakesh Agrawal and Ramakrishnan Srikant. Fast
algorithms for mining association rules. VLDB, 1994.
[4] Lamine M. Aouad, Nhien-An Le-Khac, and Tahar M.
Kechadi. Distributed frequent itemsets mining in
heterogeneous platforms. Journal of Engineering,
Computing and Architecture, 2007.
[5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal
Singh, and Kai Li. The parsec benchmark suite:
Characterization and architectural implications.
PACT, 2008.
[6] Ferenc Bodon. A fast apriori implementation. FIMI,
2003.
[7] Gregory Buehrer, Srinivasan Parthasarathy, Shirish
Tatikonda, Tahsin Kurc, and Joel Saltz. Toward
terabyte pattern mining: an architecture-conscious
solution. PPoPP, 2007.
[8] Shuai Che, Michael Boyer, Jiayuan Meng, David
Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A
performance study of general-purpose applications on
graphics processors using cuda. Journal of parallel and
Distributed Computing, 2008.
[9] Shengnan Cong, Jiawei Han, Jay Hoeflinger, and
David Padua. A sampling-based framework for
0
1
2
3
4
5
6
1.0% 0.1% 0.01%
Running Time (sec)
minsup
20.32 26.2
PBI-GPU
TBI-GPU
TBI-CPU
GOETHALS
BORGELT
(a) Running time with various minsup
0
20
40
60
80
100
PBI-GPU TBI-GPU TBI-CPU GOETHALS BORGELT
% of total
Implementations
TRANSFER
CANDIDATE
COUNTING
(b) Time breakdown with minsup 0.01%
Figure 8: Experiments on the sparse dataset Retail
parallel data mining. PPoPP, 2005.
[10] Mohammad El-Hajj and Osmar R. Zaiane. Parallel
leap: Large-scale maximal pattern mining in a
distributed environment. ICPADS, 2006.
[11] Amol Ghoting, Gregory Buehrer, Srinivasan
Parthasarathy, Daehyun Kim, Anthony Nguyen,
Yen-Kuang Chen, and Pradeep Dubey.
Cache-conscious frequent pattern mining on a modern
processor. VLDB, 2005.
[12] Bart Goethals and Mohammed Javeed Zaki. Advances
in frequent itemset mining implementations:
Introduction to fimi’03. FIMI, 2003.
[13] Naga Govindaraju, Jim Gray, Ritesh Kumar, and
Dinesh Manocha. Gputerasort: high performance
graphics co-processor sorting for large database
management. SIGMOD, 2006.
[14] Naga K. Govindaraju, Brandon Lloyd, Wei Wang,
Ming Lin, and Dinesh Manocha. Fast computation of
database operations using graphics processors.
SIGMOD, 2004.
[15] Naga K. Govindaraju, Nikunj Raghuvanshi, and
Dinesh Manocha. Fast and approximate stream
mining of quantiles and frequencies using graphics
processors. SIGMOD, 2005.
[16] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao.
Mining frequent patterns without candidate
generation: A frequent-pattern tree approach. Data
Mining and Knowledge Discovery, 2004.
[17] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K.
Govindaraju, and Tuyong Wang. Mars: a mapreduce
framework on graphics processors. PACT, 2008.
[18] Bingsheng He, Naga K. Govindaraju, Qiong Luo, and
Burton Smith. Efficient gather and scatter operations
on graphics processors. Supercomputing, 2007.
[19] Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K.
Govindaraju, Qiong Luo, and Pedro V. Sander.
Relational joins on graphics processors. SIGMOD,
2008.
[20] http://fimi.cs.helsinki.fi/. FIMI repository.
[21] http://www.adrem.ua.ac.be/
goethals/software/files/apriori.tgz. Apriori
implementation from Bart Goethals.
[22] Yk¨
a Huhtala, Juha K¨
arkk¨
ainen, Pasi Porkka, and
Hannu Toivonen. Tane: An efficient algorithm for
discovering functional and approximate dependencies.
The Computer Journal, 1999.
[23] E. Scott Larsen and David McAllister. Fast matrix
multiplies using graphics hardware. Supercomputing,
2001.
[24] Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and
Edward Y. Chang. Pfp: Parallel fp-growth for query
recommendation. ACM Recommender Systems, 2008.
[25] Li Liu, Eric Li, Yimin Zhang, and Zhizhong Tang.
Optimization of frequent itemset mining on
multiple-core processor. VLDB, 2007.
[26] John D. Owens, David Luebke, Naga Govindaraju,
Mark Harris, Jens Kr´
zger, Aaron E. Lefohn, and
Timothy J. Purcell. A survey of general-purpose
computation on graphics hardware. In Computer
Graphics Forum, 2007.
[27] Lance Parsons, Ehtesham Haque, and Huan Liu.
Evaluating subspace clustering algorithms. SDM, 2004.
[28] S. Parthasarathy, M. J. Zaki, M. Ogihara, and W. Li.
Parallel data mining for association rules on shared
memory systems. In Knowledge and Information
Systems, 2001.
[29] Jayaprakash Pisharath, Ying Liu, Wei keng Liao, Alok
Choudhary, Gokhan Memik, and Janaki Parhi.
Nu-minebench 2.0. Technical report, Northwestern
University, 2005.
[30] Shubhabrata Sengupta, Mark Harris, Yao Zhang, and
John D. Owens. Scan primitives for gpu computing. In
Graphics Hardware, 2007.
[31] Yanbin Ye and Chia-Chu Chiang. A parallel apriori
algorithm for frequent itemsets mining. SERA, 2006.
[32] Mohammed J. Zaki. Parallel and distributed
association mining: A survey. IEEE Concurrency,
1999.
[33] Mohammed J Zaki, Srinivasan Parthasarathy,
Mitsunori Ogihara, and Wei Li. New algorithms for
fast discovery of association rules. KDD, 1997.
[34] Jingren Zhou and Kenneth A. Ross. Implementing
database operations using simd instructions.
SIGMOD, 2002.
... Using the bitmap data representation, the support counting is performed using bitwise Ands and Ors operations, recursively. Later in 2009, Fang et al. implemented Apriori in two different ways [51]: One implementation, named PBI, represents transactions using a bitmaplike data structure. The second implementation uses tries and is named TBI. ...
... However, some of its functions can be easily realizable in a parallel fashion. In the reviewed literature, two approaches have been adopted: those algorithms that fully implement Apriori in hardware (with proper modifications) [1, 9-11, 94, 115] and those algorithms that use a heterogeneous target architecture (SW-HW) [28,31,50,51,64,65,69,70,82,91,116,124,130,133]. As a rule, in both approaches, the main acceleration source is based on hardware-friendly data structures, hardware-efficient transaction representations, and efficient data buffering techniques. ...
... As a rule, in both approaches, the main acceleration source is based on hardware-friendly data structures, hardware-efficient transaction representations, and efficient data buffering techniques. In the case of GPU, they were often used as co-processors to accelerate certain functions of Apriori (e.g., candidates generation, vector intersections, and frequency counting) [28,31,50,51,64,65,69,70,91,116,130]. On the contrary, although FPGA was used as co-processors in several architectures, they were usually used to fully implement Apriori using systolic arrays [9,10,82,115,124]. ...
Article
Full-text available
In data mining, Frequent Itemsets Mining is a technique used in several domains with notable results. However, the large volume of data in modern datasets increases the processing time of Frequent Itemset Mining algorithms, making them unsuitable for many real-world applications. Accordingly, proposing new methods for Frequent Itemset Mining to obtain frequent itemsets in a realistic amount of time is still an open problem. A successful alternative is to employ hardware acceleration using Graphics Processing Units (GPU) and Field Programmable Gates Arrays (FPGA). In this article, a comprehensive review of the state of the art of Frequent Itemsets Mining hardware acceleration is presented. Several approaches (FPGA and GPU based) were contrasted to show their weaknesses and strengths. This survey gathers the most relevant and the latest research efforts for improving the performance of Frequent Itemsets Mining regarding algorithms advances and modern development platforms. Furthermore, this survey organizes the current research on Frequent Itemsets Mining from the hardware perspective considering the source of the data, the development platform, and the baseline algorithm.
... Another class of methods contains the hybrid ones as HBSO-TB or are based on other metaheuristics as gravitational system. Concerning parallel class, the exact algorithms are divided in parallel exact algorithms with distributed memory as DD for data distribution, CD for count distribution (Agrawal et al. [5]), etc., shared memory as CCPD, PPCD, etc. ( see Parthasarathy et al. [48] for more details) or those executed on GPU (PBI, TBI (Fang et al. [20], etc.). Mostly, the parallelism concerns the first phase of APRIORI. ...
... Several other approaches have been proposed: a method based on the physical principles of gravity and motion (Khademolghorani, Baraani et al. 2011), another based on particle swarms PSOARM (Particles Swarm Optimization [20] Content courtesy of Springer Nature, terms of use apply. Rights reserved. ...
... One of the first attempts in 2009 included two propositions to parallelize the construction phase of frequent itemsets [20]. The first technique is called PBI (Pure Bitmap Implementation). ...
Article
Full-text available
Association rules mining (ARM) is an unsupervised learning task. It is used to generate significant and relevant association rules among items in a database. APRIORI and FP-GROWTH are the most popular and used algorithms nowadays for extracting such rules. They are exact methods that consist of two phases. First, frequent itemsets are generated. Then, the latter are used to generate rules. The main drawback of both algorithms is their high execution time. To overcome this drawback, metaheuristics have been proposed. Moreover, to optimize the execution time, since the amount of data is in continuous growth, some parallel architectures can be found in the literature. In this paper, we present an overview of existing literature that investigate the ARM problem using both metaheuristics and parallelism. We will focus on the recent algorithms that tackle these problems using approximate methods and GPU. We will present a non-exhaustive classification of different algorithms according to the type of execution (sequential or parallel) and type of method (exact or approximate).
... Fang et al. [21] presented two distinct implementations of the Apriori method for extracting association rules on nextgeneration GPUs. Their method leverages SIMD (single instruction, multiple data) architectures in GPUs. ...
Article
Full-text available
In the domain of data mining, the extraction of frequent patterns from expansive datasets remains a daunting task, compounded by the intricacies of temporal and spatial dimensions. While the Apriori algorithm is seminal in this area, its constraints are accentuated when navigating larger datasets. In response, we introduce an avant-garde solution that leverages parallel network topologies and GPUs. At the heart of our method are two salient features: (1) the use of parallel processing to expedite the realization of optimal results and (2) the integration of the cat and mouse-based optimizer (CMBO) algorithm, an astute algorithm mirroring the instinctual dynamics between predatory cats and evasive mice. This optimizer is structured around a biphasic model: an initial aggressive pursuit by the cats and a subsequent calculated evasion by the mice. This structure is enriched by classifying agents using their objective function scores. Complementing this, our architectural blueprint seamlessly amalgamates dual Nvidia graphics cards in a parallel configuration, establishing a marked ascendancy over conventional CPUs. In amalgamation, our approach not only rectifies the inherent shortfalls of the Apriori algorithm but also accentuates the extraction of association rules, pinpointing frequent patterns with enhanced precision. A comprehensive evaluation across a spectrum of network topologies explains their respective merits and demerits. Set against the benchmark of the Apriori algorithm, our method conspicuously outperforms in terms of speed and effectiveness, heralding a significant stride forward in data mining research.
... Then, they efficiently perform support calculations of candidate itemsets through bitwise AND operations. Fang et al. [48] suggested a pure bitmap implementation (PBI) and a Trie-based variant (TBI). Fang et al. encode the data to an j × k binary matrix where j and k is the number of itemsets and the number of transactions, respectively. ...
Article
Full-text available
Frequent itemset mining is extensively employed as an essential data mining technique. Nevertheless, as the data size grows, the applicability of this method decreases owing to the relatively poor performance of the existing methods. Though numerous efficient sequential frequent itemset mining methods have been developed, the performance that can be achieved is clearly limited by the fact that they exploit only one thread. To overcome these limitations, a number of parallel methods using multi-core central processing units (CPUs), multiple machines or many-core graphic processing units (GPU) have been proposed. However, these methods are relatively slow in performance and have low scalability, mainly owing to large memory requirements for intermediate data, significant disk I/Os, and heavy computation. In this study, to resolve the aforementioned problems, we propose SGMiner{\mathsf {SGMiner}} , which is a new, fast, and scalable GPU- and disk-based method on a single machine equipped with multiple graphic processing units (GPUs) and multiple solid-state drives (SSDs) for extracting frequent patterns. It is based on an algorithm similar to the Apriori algorithm and neither has intermediate data nor large disk I/O overheads owing to its exploitation of SSDs. Moreover, we propose storing transaction databases, namely bitmap transaction chunks, in SSDs, streaming the chunks to GPU device memory via the main memory with reduced I/O overhead, and performing fast support counting with GPUs based on the chunks. In addition, when exploiting multiple GPUs and SSDs, it proposes a concept of replicating bitmap transaction chunks stored in SSDs to GPUs in a streaming fashion. This could allow an almost equal workload to be distributed evenly across multiple GPUs with reduced I/O overheads. The experiments we conducted demonstrate that SGMiner{\mathsf {SGMiner}} outperforms the existing methods in terms of scalability and performance with enhanced robustness.
... W. Fang et al. [7] have introduced two implementations for Apriori using GPUs with Single Instruction, Multiple Data (SIMD) architectures. Both methods use a bitmap data structure. ...
Article
In the modern digital world, online shopping becomes essential in human lives. Online shopping stores like Amazon show up the "Frequently Bought Together" for their customers in their portal to increase sales. Discovering frequent patterns is a fundamental task in Data Mining that find the frequently bought items together. Many transactional data were collected every day, and finding frequent itemsets from the massive datasets using the classical algorithms requires more processing time and I/O cost. A GPU accelerated Novel algorithm for finding the frequent patterns using Vertical Data Format (GNVDF) has been introduced in this research article. It uses a novel pattern formation. In this, the candidate i-itemsets is divided into two buckets viz., Bucket-1 and Bucket-2. Bucket-1 contain all the possible items to form candidate-(i+1) itemsets. Bucket-2 has the items that cannot include in the candidate-(i+1) itemsets. It compactly employs a jagged array to minimize the memory requirement and remove common transactions among the frequent 1-itemsets. It also utilizes a vertical representation of data for efficiently extracting the frequent itemsets by scanning the database only once. Further, it is GPU-accelerated for speeding up the execution of the algorithm. The proposed algorithm was implemented with and without GPU usage and compared. The comparison result revealed that GNVDF with GPU acceleration is faster by 90 to 135 times than the method without GPU.
... GPUs have radically different characteristics than CPUs, including the SIMT model and the need of coalesced memory access, which add additional difficulty in parallelization. As a result, most GPU-based methods are based on Apriori: [12] represents a transaction database as an n ×m binary matrix, where n is the number of itemsets and m is the number of transactions, so that intersection operations on rows can be conducted with a GPU to count support. GPApriori [57] generates a static bitmap that represents all the distinct 1-itemsets and their transaction ID sets. ...
Article
Full-text available
A frequent pattern is a substructure that appears in a database with frequency (aka. support) no less than a user-specified threshold, while a closed pattern is one that has no super-pattern that has the same support. Here, a substructure can refer to different structural forms, such as itemsets, subsequences, subtrees, and subgraphs, and mining such substructures is important in many real applications such as product recommendation and feature extraction. Currently, there lacks a general programming framework that can be easily customized to mine different types of patterns, and existing parallel and distributed solutions are IO-bound rendering CPU cores underutilized. Since mining frequent and/or closed patterns are NP-hard, it is important to fully utilize the available CPU cores. This paper presents such a general-purpose framework called PrefixFPM. The framework is based on the idea of prefix projection which allows a divide-and-conquer mining paradigm. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. We have adapted the state-of-the-art serial algorithms for mining patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of CPU cores.
Article
Full-text available
Association Rule mining (ARM) is well studied and famous optimization problem which finds useful rules from given transactional databases. Many algorithms already proposed in literature which shows their efficiency when dealing with different sizes of datasets. Unfortunately, their efficiency is not enough for handling large scale datasets. In this case, Bees swarm optimization algorithm for association rule mining is more efficient. These kinds of problems need more powerful processors and are time expensive. For such issues solution can be provided by graphics processing units (GPUs) and are massively multithreaded processors. In this case GPUs can be used to increase speed of the computation. Bees swarm optimization algorithm for association rule mining can be designed using GPUs in multithreaded environment which will efficient for given datasets.
Article
Full-text available
Huge amounts of datasets with different sizes are naturally distributed over the network. In this paper we propose a distributed algorithm for frequent itemsets generation on heterogeneous clus-ters and grid environments. In addition to the disparity in the performance and the workload capacity in these environments, other constraints are related to the datasets distribution and their nature, and the middleware structure and overheads. The proposed approach uses a dynamic workload manage-ment through a block-based partitioning, and takes into account inherent characteristics of the Apriori algorithm related to the candidate sets generation. The proposed technique greatly enhances the per-formance and achieves high scalability compared to the existing distributed Apriori-based approaches. This approach is evaluated on large scale datasets distributed over a heterogeneous cluster.
Article
Full-text available
In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm. A lot of data-mining tasks (e.g. association rules, sequential patterns) use complex pointer-based data structures (e.g. hash trees) that typically suffer from suboptimal data locality. In the multiprocessor case shared access to these data structures may also result in false sharing. For these tasks it is commonly observed that the recursive data structure is built once and accessed multiple times during each iteration. Furthermore, the access patterns after the build phase are highly ordered. In such cases locality and false sharing sensitive memory placement of these structures can enhance performance significantly. We evaluate a set of placement policies for parallel association discovery, and show that simple placement schemes can improve execution time by more than a factor of two. More complex schemes yield additional gains.
Conference Paper
Full-text available
We present a strategy for mining frequent itemsets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious data mining, taking into account the capabilities of the proces- sor, the memory hierarchy and the available network in- terconnects. Optimizations have been designed for lower- ing communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruning and tiling tech- niques, and smart data placement strategies are also em- ployed. We leverage the extended memory space and com- putational resources of a distributed message-passing cluster to design a scalable solution, where each node can extend its meta structures beyond main memory by leveraging 64- bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linear scaleup on up to 48 nodes.
Article
The discovery of functional dependencies from relations is an important database analysis technique. We present TANE, an efficient algorithm for finding functional dependencies from large databases. TANE is based on partitioning the set of rows with respect to their attribute values, which makes testing the validity of functional dependencies fast even for a large number of tuples. The use of partitions also makes the discovery of approximate functional dependencies easy and efficient and the erroneous or exceptional rows can be identified easily. Experiments show that T ANE is fast in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods.
Conference Paper
We present a novel external sorting algorithm using graphics processors (GPUs) on large databases composed of billions of records and wide keys. Our algorithm uses the data parallelism within a GPU along with task parallelism by scheduling some of the memory-intensive and compute-intensive threads on the GPU. Our new sorting architecture provides multiple memory interfaces on the same PC -- a fast and dedicated memory interface on the GPU along with the main memory interface for CPU computations. As a result, we achieve higher memory bandwidth as compared to CPU-based algorithms running on commodity PCs. Our approach takes into account the limited communication bandwidth between the CPU and the GPU, and reduces the data communication between the two processors. Our algorithm also improves the performance of disk transfers and achieves close to peak I/O performance. We have tested the performance of our algorithm on the SortBenchmark and applied it to large databases composed of a few hundred Gigabytes of data. Our results on a 3 GHz Pentium IV PC with $300 NVIDIA 7800 GT GPU indicate a significant performance improvement over optimized CPU-based algorithms on high-end PCs with 3.6 GHz Dual Xeon processors. Our implementation is able to outperform the current high-end PennySort benchmark and results in a higher performance to price ratio. Overall, our results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.
Article
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.
Article
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general‐purpose computation to graphics hardware. We begin with the technical motivations that underlie general‐purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general‐purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general‐purpose application development on graphics hardware.