ArticlePDF Available

GMiner: A Fast GPU-based Frequent Itemset Mining Method for Large-scale Data

Authors:

Abstract and Figures

Frequent itemset mining is widely used as a fundamental data mining technique. However, as the data size increases, the relatively slow performances of the existing methods hinder its applicability. Although many sequential frequent itemset mining methods have been proposed, there is a clear limit to the performance that can be achieved using a single thread. To overcome this limitation, various parallel methods using multi-core CPU, multiple machine, or many-core graphic processing unit (GPU) approaches have been proposed. However, these methods still have drawbacks, including relatively slow performance, data size limitations, and poor scalability due to workload skewness. In this paper, we propose a fast GPU-based frequent itemset mining method called GMiner for large-scale data. GMiner achieves very fast performance by fully exploiting the computational power of GPUs and is suitable for large-scale data. The method performs mining tasks in a counterintuitive way: it mines the patterns from the first level of the enumeration tree rather than storing and utilizing the patterns at the intermediate levels of the tree. This approach is quite effective in terms of both performance and memory use in the GPU architecture. In addition, GMiner solves the workload skewness problem from which the existing parallel methods suffer; as a result, its performance increases almost linearly as the number of GPUs increases. Through extensive experiments, we demonstrate that GMiner significantly outperforms other representative sequential and parallel methods in most cases, by orders of magnitude on the tested benchmarks.
Content may be subject to copyright.
Information Sciences 439–440 (2018) 19–38
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
GMiner: A fast GPU-based frequent itemset mining method
for large-scale data
R
Kang-Wook Chon, Sang-Hyun Hwang, Min-Soo Kim
DGIST (Daegu Gyeongbuk Institute of Science and Technology), Daegu, Republic of Korea
a r t i c l e i n f o
Article history:
Received 19 March 2017
Revised 16 January 2018
Accepted 25 January 2018
Available online 31 January 2018
Keywo rds:
Frequent itemset mining
Graphics processing unit
Parallel algorithm
Work loa d skewness
a b s t r a c t
Frequent itemset mining is widely used as a fundamental data mining technique. However,
as the data size increases, the relatively slow performances of the existing methods hinder
its applicability. Although many sequential frequent itemset mining methods have been
proposed, there is a clear limit to the performance that can be achieved using a single
thread. To overcome this limitation, various parallel methods using multi-core CPU, multi-
ple machine, or many-core graphic processing unit (GPU) approaches have been proposed.
However, these methods still have drawbacks, including relatively slow performance, data
size limitations, and poor scalability due to workload skewness. In this paper, we pro-
pose a fast GPU-based frequent itemset mining method called GMiner for large-scale data.
GMiner achieves very fast performance by fully exploiting the computational power of
GPUs and is suitable for large-scale data. The method performs mining tasks in a coun-
terintuitive way: it mines the patterns from the first level of the enumeration tree rather
than storing and utilizing the patterns at the intermediate levels of the tree. This approach
is quite effective in terms of both performance and memory use in the GPU architecture.
In addition, GMiner solves the workload skewness problem from which the existing par-
allel methods suffer; as a result, its performance increases almost linearly as the number
of GPUs increases. Through extensive experiments, we demonstrate that GMiner signifi-
cantly outperforms other representative sequential and parallel methods in most cases, by
orders of magnitude on the tested benchmarks.
©2018 The Authors. Published by Elsevier Inc.
This is an open access article under the CC BY-NC-ND license.
( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
1. Introduction
As a fundamental data mining technique, frequent itemset mining is widely used in a wide range of disciplines such as
market basket analysis, web usage mining, social network analysis, intrusion detection, bioinformatics, and recommendation
systems. However, the deluge of data generated by automated systems for diagnostic or analysis purposes makes it difficult
or even impossible to apply mining techniques in many real-world applications. The existing methods often fail to find
frequent itemsets in such big data within a reasonable amount of time. Thus, in terms of computational time, itemset
mining is still a challenging problem that has not yet been completely solved.
R Fully documented templates are available in the elsarticle package on CTAN.
Corresponding author.
E-mail addresses: kw.chon@dgist.ac.kr (K.-W. Chon), sanghyun@dgist.ac.kr (S.-H. Hwang), mskim@dgist.ac.kr (M.-S. Kim).
https://doi.org/10.1016/j.ins.2018.01.046
0020-0255/© 2018 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license.
( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
20 K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38
Many sequential frequent itemset mining methods such as Apriori [2] , Eclat [35] , FP-Growth [14] , and LCM [30] use a sin-
gle CPU thread. However, these single-threaded applications all have a fundamental mining performance limit because CPU
clock speed is generally no longer increasing. To overcome the single-thread performance limit, multiple parallel frequent
itemset mining methods have been proposed. These methods can be categorized into three main groups: (1) (CPU-based)
multi-threaded methods, (2) distributed methods, and (3) graphic processing unit (GPU)-based methods. We omit the term
”multi-thread” from the GPU-based methods because they are obviously multi-threaded. The first group focuses on accel-
erating the performance of the single-threaded methods by exploiting multi-core CPUs [20,24,26,27,29] , while the second
group tries to accelerate the performance by exploiting multiple machines [12,17,19] . Details about these methods are avail-
able in recent survey studies [9,31] .
The third group, namely, the GPU-based methods, focuses on accelerating the performance by exploiting many-core GPUs
[7,15,18,28,37–39] . Due to the higher theoretical computing performance of GPUs for certain types of tasks compared with
CPUs, it has become increasingly important to exploit the capabilities of GPUs in a wide range of problems, including fre-
quent pattern mining. However, existing GPU-based methods all suffer from data size limitations due to limited GPU mem-
ory. GPU memory tends to be much smaller than main memory. Most of the methods can only find frequent patterns in
data loaded into GPU memory, which includes the input transaction data and intermediate data generated at the interme-
diate levels of the pattern space. To the best of our knowledge, Frontier Expansion [38] is the only method in this group
that can handle larger input transaction data than GPU memory while simultaneously exploiting multiple GPUs. However, it
still cannot address the same data sizes as CPU-based methods, because it cannot store sufficiently large amounts of data at
intermediate levels of the pattern space in GPU memory.
Most existing parallel methods of the above three groups also suffer from the problem of workload skewness. Worklo ad
skewness is extremely common and significantly affects parallel computing performance. The existing parallel methods usu-
ally divide the search space of the patterns to be explored into multiple chunks (e.g., equivalence classes) and assign each
chunk to a processor (or machine). Each subtree of the enumeration tree tends to have a different workload size. As a result,
these methods are not particularly scalable in terms of the number of CPUs, machines, or GPUs. That is, their performance
does not increase proportionally as the number of processors increases.
In this paper, we propose a fast GPU-based frequent itemset mining method called GMiner for large-scale data. Our
GMiner method achieves high speed by fully exploiting the computational power of GPUs. It can also address the same
data sizes as CPU-based methods-that is, it solves the main drawback of the existing GPU-based methods. GMiner achieves
this by mining the patterns from the first level of the enumeration tree rather than storing and utilizing the patterns at
intermediate levels of the tree. This strategy might look simple, but it is quite effective in terms of performance and mem-
ory usage for GPU-based methods. We call this strategy the Traversal from the First Level (TFL) strategy. The TFL strategy
does not store any projected database or frequent itemsets from the intermediate levels of the enumeration tree in GPU
memory; instead, it finds all the frequent itemsets using only the frequent itemsets from the first level, denoted as F
1
. This
strategy reduces the amount of GPU memory used and simultaneously and paradoxically improves the performance. This
result seems somewhat counterintuitive but makes sense in a GPU architecture, where the gap between processor speed
and memory speed is quite large. In most cases, mining the frequent n -itemsets by performing a large amount of compu-
tation based on a small F
1
set is faster than mining the same result by performing a smaller amount of computation based
on a large set of frequent (n-1)-itemsets under the GPU architecture. Using the TFL strategy, GMiner improves the perfor-
mances of the representative parallel methods, including multi-threaded, distributed, and GPU-based methods, by orders of
magnitude. In addition to the TFL strategy, we also propose a strategy called Hopping from the Intermediate Level (HIL), to
further improve the performance on datasets that contain long patterns. Intuitively, the HIL strategy reduces the required
computation by utilizing more GPU memory, thereby improving the performance for long patterns. In addition to fast min-
ing with efficient memory usage, GMiner solves the workload skewness problem of the existing parallel methods. As a
result, GMiner ’s performance increases almost linearly as the number of GPUs increases. To solve the workload skewness
problem, we propose the concepts of a transaction block and a relative memory address. The former is a fixed-size chunk
of bitwise representations for transactions, while the latter is an array representation for candidate itemsets. For parallel
processing, GMiner does not divide the search space of the enumeration tree into sub-trees; instead, it divides an array of
relative memory addresses into multiple subarrays, all of which have the same size. Then, GMiner stores a subarray in each
GPU and performs mining by streaming transaction blocks to all the GPUs so that each GPU is assigned almost the same
workload. The main contributions of this paper are as follows:
We propose a new, fast GPU-based frequent itemset mining method named GMiner that fully exploits the GPU archi-
tecture by performing a large amount of computation on a small amount of data (i.e., frequent 1-itemsets).
We propose a strategy called HIL that can further improve the performance on datasets that contain long patterns by
performing a moderate amount of computation based on a moderate amount of data.
We propose a method to solve the workload skewness problem by splitting an array of relative memory addresses for
candidate itemsets among GPUs and streaming transaction blocks to all GPUs.
Through experiments, we demonstrate that GMiner significantly outperforms most of the state-of-the-art methods that
have been addressed in recent studies [4,9,25,31,33] on two kinds of benchmarks.
The source code for GMiner is available at https://infolab.dgist.ac.kr/GMiner . The remainder of this paper is organized
as follows. Section 2 discusses the related work. We propose the TFL strategy in Section 3 , and in Section 4 , we propose
K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38 21
Tabl e 1
Categorization of the existing frequent itemset mining methods.
Sequential (CPU) Parallel
CPU GPU
Multi-threaded Distributed
Relative computational power Low Medium High High
Difficulty of workload balancing N/A Medium High High
Network communication overhead X X O X
Processor memory limit X X X O
Representative methods (used in
experimental study)
Apriori (Borgelt) [6] ,
Eclat (Borgelt) [6] , Eclat
(Goethals) [10] , LCM
[30] , FP-Growth
[11]
FP-Aray [20] , ShaFEM
[32] , MC-Eclat [26]
MLlib [3] TBI [7] , GPApriori [37] ,
Frontier Expansion [38]
the HIL strategy. In Section 5 we present a method that exploits multiple GPUs and the cost model of GMiner . Section 6
presents the results of experimental evaluations, and Section 7 summarizes and concludes this paper.
2. Related work
The frequent itemset mining problem is usually defined as the problem of determining all itemsets Fthat occur as a
subset of at least a pre-defined fraction minsup of the transactions in a given transaction database D = { t
1
, t
2
, . . . , t
n
} , where
each transaction t
i
is a subset of items from I[1,13] . In this paper, we mainly use the number of occurrences, instead of a
fraction, as the support of an itemset. Many sequential and parallel frequent itemset mining methods have been proposed.
We categorize the parallel methods into three groups: (1) (CPU-based) multi-threaded methods, (2) distributed methods,
and (3) GPU-based methods. Their characteristics and representative methods are summarized in Table 1 and are explained
in detail in Sections 2.1 2.4 .
2.1. Sequential methods
Many sequential methods have been proposed for frequent pattern mining. The representative methods include Apriori
[2] , Eclat [35] , LCM [30] , and FP-Growth [14] . Apriori is based on the anti-monotone property: if a k -itemset is not frequent,
then its supersets can never become frequent. Apriori repeatedly generates candidate ( k +1)-itemsets C
k +1
from the frequent
k -itemsets F
k
(where k 1) and computes the support of C
k +1
over the database D for testing. Borgelt [6] is a well-known
implementation of Apriori that exploits a prefix tree to represent the transaction database and finds frequent itemsets di-
rectly with the prefix tree to calculate support efficiently. Eclat [35] uses the equivalence class concept to partition the
search space into multiple independent subspaces (i.e., subproblems). Its vertical data format makes it possible to perform
support counting efficiently by set intersection. Goethals et al. [10] and Borgelt [6] are well-known implementations of Eclat
that optimize it using the diffset [34] representation for candidate itemsets and transactions. The superiority of both meth-
ods to other vertical methods has been demonstrated on the Frequent Itemset Mining Implementations (FIMI) competitions
(i.e., FIMI03 and FIMI04) [8] . LCM is a variation of Eclat that combines various techniques such as a bitmapped database,
prefix tree, and the occurrence deliver technique. As a result, LCM achieved the overall best performance among sequen-
tial methods in the FIMI04 competition. FP-Growth [14] builds an FP-Tree from the database and recursively finds frequent
itemsets by traversing the FP-Tree without explicit candidate generation. It outperforms the Apriori-based methods in many
cases. FP-Growth
is a well-known implementation of FP-Growth that reduces the number of tree traversals by exploiting
additional array data structures. FP-Growth
’s superiority was demonstrated in the FIMI03 competition.
2.2. Multi-threaded methods
Many efforts have been made to parallelize sequential methods using multiple threads to improve the performance
[20,26,32] . FP-Array [20] , based on FP-Growth, utilizes a cache-conscious FP-Array built from a compact FP-Tree and a lock-
free tree construction algorithm. In an experimental study, FP-Array improved the performance by up to six times on eight
CPU cores. MC-Eclat [26] is a parallel method based on Eclat. MC-Eclat utilizes three parallel mining approaches, namely, in-
dependent, shared, and hybrid mining, and it greatly improves the performance on relatively small datasets. ShaFEM [32] is
a parallel method that dynamically chooses mining strategies based on dataset density. In detail, it switches between FP-
Growth and Eclat based on dataset characteristics. In many cases, multi-threaded methods greatly improve the performance
compared to sequential methods. However, they fail in pattern mining due to out-of-memory failures on some datasets that
sequential methods handle successfully and tend to require more memory than the sequential methods due to the large
amounts of memory used by the independent threads.
22 K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38
2.3. Distributed methods
In theory, distributed methods that exploit many machines can address large-scale data. Several distributed methods
[3,19,22] have been proposed, all of which are based on a shared-nothing framework such as Hadoop or Spark. Lin et al.
[19] proposed parallel methods based on Hadoop for the Apriori approach. Moens et al. [22] proposed Dist-Eclat and Big-
FIM. Dist-Eclat is based on the Eclat approach and BigFIM is a hybrid approach between Apriori and Eclat. MLlib of Spark
[3] includes a parallel version of FP-growth called PFP. PFP is an in-memory distributed method that runs on a cluster of
machines. It builds independent FP-Trees and then performs frequent itemset mining independently on each FP-Tree in each
machine. Although the distributed methods should be able to handle larger data, or greatly improve the performance by
adding more machines, they do not show such results in many cases due to workload skewness. According to the experi-
mental results (which will be presented in Section 6 ), distributed methods can result in even worse performance than do
multi-threaded methods that use a single machine due to the large amount of network communication overhead.
2.4. GPU-Based methods
Modern GPUs have many computing cores that allow multiple simultaneous executions of a kernel, which is a user-
defined function. In addition, using GPUs in a single machine does not involve network communication overhead. GPUs
have radically different characteristics than CPUs, including the Single Instruction, Multiple Threads (SIMT) model and the
importance of coalesced memory access. These differences make it difficult to apply most parallel methods using complex
data structures (e.g., FP-Array) to GPUs directly and efficiently. Thus, most GPU-based methods have been proposed based
on Apriori [7,15,18,28,31,37] .
Fang et al. [7] presented two GPU-based methods: Pure Bitmap Implementation (PBI) and Trie-Based Implementation
(TBI). These methods represent a transaction database as an n ×m binary matrix, where n is the number of itemsets and
m is the number of transactions, thereby making it suitable for the GPU architecture. These methods perform intersection
operations on rows of the binary matrix using a GPU to count support. PBI and TBI outperform the existing sequential
Apriori methods, such as the Apriori implementation written by Borgelt [6] , by factors of 2–10. However, according to Fang
et al. [7] , these methods are outperformed by the existing parallel FP-Growth methods by factors of 4–16 on the PARS EC
benchmark [5] . TBI is superior to PBI in terms of the number of candidate itemsets that can be handled simultaneously;
therefore, we compare TBI with our method in Section 6 .
Zhang et al. [37] presented GPApriori, which generates a so-called static bitmap that represents all the distinct 1-itemsets
and their tidsets. Similar to other GPU-based Apriori methods, GPApriori uses a GPU only to parallelize the support counting
step. The candidate generation step is performed using CPUs. GPApriori adopts multiple optimizations, such as pre-loading
candidate itemsets into the shared GPU memory and using hand-tuned GPU block sizes. Consequently, it shows a speed-
up of up to 80 times on a small dataset that can fit into GPU memory compared with some sequential Apriori methods
(e.g., that of Borgelt [6] ). However, according to Zhang et al. [37] , GPApriori could not outperform state-of-the-art sequential
methods such as FP-Growth
[11] , Eclat [6] , and LCM [30] .
In [28] , the authors proposed a parallel version of the Dynamic Counting Itemset algorithm (DCI) [23] , a variation of
Apriori in which two major DCI operations, namely, intersection and computation, are parallelized using a GPU. They pro-
posed two strategies: a transaction-wise approach (called tw ) and a candidate-wise approach (called cw ). The tw strategy
uses all GPU cores for the same candidate simultaneously, and each thread oversees a part of the data, while the cw strategy
handles many candidate itemsets simultaneously. We omit these methods in Table 1 and in our experiments, because the
tw strategy is almost the same as TBI, and the cw strategy works for only very-small datasets [28] .
The above three Apriori-based methods, which use GPUs, have a common serious drawback: they cannot handle datasets
larger than GPU memory. Therefore, using them for real large-scale datasets is difficult because GPU memory is quite limited
(e.g., to a few GB). In addition, the above methods did not outperform the representative sequential methods (e.g., LCM) as
well as the representative multi-threaded methods (e.g., FP-Array) [7,28,37] .
According to the recent survey papers on frequent itemset mining [9,31] , Frontier Expansion [38] is the only GPU-based
method that can handle datasets larger than GPU memory. Frontier Expansion is based on Eclat rather than Apriori, and
it utilizes multiple GPUs. The authors showed that it outperforms the sequential Eclat and FP-Growth methods [38] , which
were previously known to be the fastest methods in their categories. However, it fails to outperform some state-of-the-
art multi-threaded methods such as FP-Array (as shown by our experimental results in Section 6 ). We found that Frontier
Expansion’s failure is due to three major drawbacks: (1) it stores a large amount of intermediate-level data in GPU memory
(wasting GPU clock cycles); (2) it has a large data transfer overhead between main memory and GPU memory; and (3) it is
not scalable in terms of the number of GPUs. We will explain how the proposed GMiner method solves these drawbacks
in Sections 3 5 .
3. TFL strategy
For fast frequent itemset mining, even for large-scale data, GMiner uses the Traversal from the First Level (TFL) strategy
of mining the patterns from the first level, i.e., F
1
, of the enumeration tree. The TFL strategy does not store any projected
K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38 23
database or frequent itemsets from the intermediate levels of the enumeration tree in GPU memory; instead, it finds the en-
tire frequent itemsets using only F
1
. This approach significantly reduces GPU memory usage; thus, it can address large-scale
data without encountering out-of-memory problems. In addition, to eliminate the data transfer overhead between main
memory and GPU memory, GMiner performs pattern mining while streaming transaction databases from main memory to
GPU memory. Here, GMiner splits the transaction database into blocks and streams them to GPUs. This block-based stream-
ing approach allows us to solve the workload skewness problem, as explained in Section 5 . Sections 3.1 and 3.2 explain the
transaction blocks and the block-based streaming approach, respectively. Section 3.3 presents the algorithm that implements
the TFL strategy.
3.1. Transaction blocks
It is important that the data structures are simple and use a regular memory access pattern to fully exploit the com-
putational power of GPUs in terms of workload balance among thousands of GPU cores and coalesced memory access. In
general, compared with CPUs, the arithmetic and logic units (ALUs) and memory scheme of GPUs are not efficient for han-
dling complex or variable-sized data structures, including sets, lists, maps, and their combinations. Furthermore, GPUs have
only limited memory, which is a major obstacle for frequent itemset mining on large-scale and/or dense datasets using
GPUs.
For computational efficiency, GMiner adopts a vertical bitmap layout for data representation. The horizontal layout and
vertical tidset layout are too complex and irregular to maximize GPU computational efficiency. Frequent itemset mining
using the vertical bitmap layout relies heavily on bitwise AND operations among large-scale bitmaps, where GPUs have an
overwhelming advantage over CPUs.
Moreover, the vertical bitmap layout allows us to easily partition the input database vertically into subdatabases, each of
which can fit in main memory or GPU memory. Hereafter, we denote an input database D in the vertical bitmap layout as a
transaction bitmap . We define the vertical partitioning of a transaction bitmap in Definition 1 .
Definition 1 (Transaction bitmap partition) . We vertically divide the transaction bitmap TB into R non-overlapping partitions
of the same width and denote them by TB
1: R
, where TB
k
denotes the k -th transaction bitmap partition (1 k R ).
As in other frequent itemset mining methods, GMiner begins by mining the frequent 1-itemsets | F
1
|; therefore, the size
of TB is | F
1
| x | D | in bits, where | D | is the total number of transactions. When we denote the width of a single partition of
the transaction bitmap as W , the size of TB
k
becomes | F
1
| x W . If the number of transactions of the last partition TB
R
is less
than W , GMiner pads the partition with 0 values to guarantee the width of W .
The parameter W should be set to a sufficiently small value to fit each TB
k
into GPU memory. For instance, we typically
set W to 262,144 transactions in our experimental evaluation, which equals to 262,144/8 = 32 KB for each 1-itemset. We
consider each TB
k
of size | F
1
| x W as a transaction block. The transaction blocks are allocated consecutively in main memory
(or stored as chunks in secondary storage similar to a disk page).
A frequent 1-itemset x ( x F
1
) has a bit vector of length | D | in TB , which is subdivided into R bit vectors of length W . We
denote a bit vector of x within TB
k
as TB
k
( x ). As mentioned above, TB contains only the bit vectors for frequent 1-itemsets.
Thus, if x is a frequent n -itemset, x has n bit vectors in TB
k
, i.e., { TB
k
( i )| i x }. We define a set of physical pointers to the bit
vectors for a frequent itemset x in the transaction bitmap in Definition 2 .
Definition 2 (Relative memory address) . We define a relative memory address of an item i , denoted as RA ( i ), as the distance
in bytes from the starting memory address of TB
k
to that of TB
k
( i ), for a transaction block TB
k
. Then, we define a set of
relative memory addresses of a frequent itemset x , denoted as RA ( x ), as { RA ( i )| i x }.
This concept facilitates the fast access to a memory location of an itemset (or memory locations of itemsets) within a
single transaction block in main memory or GPU memory. RA ( x ) is used as an identifier for an itemset x in GMiner . We
denote the number of items in x as | x | and the number of distinct memory addresses of RA ( x ) as | RA ( x )|. Then, | x | = | RA (x ) | ,
because each item i x has its own unique memory address in TB
k
. We note that RA ( x ) for a frequent itemset x does not
change across all TB
k
(1 k R ); that is, it always has the same relative memory addresses because the size of TB
k
is fixed.
3.2. Nested-Loop streaming
GMiner finds frequent itemsets using the candidate generation and testing approach with breadth-first search (BFS), as
in Apriori, which repeats two major steps, namely, candidate generation and testing (support counting), at each level of
an itemset lattice. Generally, the testing step is more computationally intensive than the candidate generation step. Thus,
GMiner focuses on accelerating the testing step by exploiting GPUs. The candidate generation step is performed using CPUs.
GMiner uses BFS traversal rather than DFS traversal (e.g., equivalence classes) to fully exploit the massive parallelism
of GPUs and achieve better workload balance. When using BFS traversal, the number of frequent itemsets at a certain level
could become too large to be stored in the limited GPU memory and used for support counting of the candidate itemsets of
the next level. The increase in the number of transactions makes the problem more difficult. Therefore, existing GPU-based
methods for mining large-scale datasets (such as Frontier Expansion [38] ) use a DFS approach that tests only the frequent
and candidate itemsets of an equivalence class within GPU memory. However, the use of this DFS approach on GPUs could
24 K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38
degrade the performance of itemset mining due to lack of parallelism and workload skewness, which will be shown in
Section 6 .
Our proposed TFL strategy solves the issue of mining frequent itemsets in large-scale datasets without degrading the per-
formance within limited GPU memory. We call an entire set of frequent 1-itemsets the first level in the itemset lattice and
call other levels in the itemset lattice intermediate levels . Most of the existing frequent itemset mining methods materialize
frequent itemsets in intermediate levels to reduce computational overhead, but this approach greatly increases the space
overhead. For example, AprioriTid materializes n -itemsets when finding n + 1-itemsets, and Eclat materializes the itemsets
that have the same prefix. However, this approach can suffer from a lack of main memory due to the large amount of in-
termediate data. Moreover, this tendency is more marked when exploiting GPUs, because GPU memory is limited compared
to main memory. The proposed TFL strategy tests all the candidate itemsets of intermediate levels using only the first level,
i.e., F
1
. This feature is based on the observation that GPUs have high computational power, especially for massive bitwise
operations, but relatively small device memory. Our observation indicates that, in the GPU architecture, testing the candidate
n + 1-itemsets using frequent 1-itemsets tends to be much faster than testing the candidate n + 1-itemsets using frequent
n -itemsets (i.e., F
n
). This speed difference occurs because copying F
n to GPU memory incurs a much larger data transfer
overhead than does copying only F
1
, and simultaneously, accessing F
n in GPU memory incurs more non-coalesced memory
access than does accessing F
1
.
For mining large-scale databases, we also propose a new itemset mining technique on GPUs called nested-loop streaming .
Here, a single series of candidate generation and testing steps constitutes an iteration . GMiner performs nested-loop stream-
ing at each iteration. This technique copies the candidate itemsets to GPUs as the outer operand. Specifically, it copies only
the relative memory addresses of the itemsets to the GPUs rather than the itemsets themselves. We denote the candidate
itemsets at level L as C
L
. The proposed technique copies RA ( C
L
) = { RA ( x )| x C
L
} to the GPUs (hereafter, when there is no
ambiguity, we simply denote RA ( C
L
) as RA ). The technique also copies transaction blocks of the first level (i.e., TB
1: R
) to
GPUs as the inner operand. We note that the outer operand, RA , or the inner operand, TB
1: R
, or both, might not fit in GPU
memory. Thus, the proposed technique partitions the outer operand RA into RA
1: Q
and copies each RA
j
to the GPUs indi-
vidually (1 j Q ). Then, for each RA
j
, it streams each piece of the inner operand, i.e., transaction block, TB
k
to the GPUs
(1 k R ). In most intermediate levels, the outer operand, RA , is much smaller than the inner operand, TB . In particular,
when the entire RA can be kept in GPU memory (i.e., Q = 1), streaming TB
k
to the GPUs becomes a major operation of this
technique.
For each pair RA
j
, TB
k
, GMiner calculates the partial supports of x RA
j
within TB
k
. We denote the partial supports for
RA
j
, TB
k
as PS
j, k
. We formally define the partial support of itemset x in Definition 3 .
Definition 3 (Partial support) . We define σx
( TB
k
) as the partial support of an itemset x within a given transaction block TB
k
.
The full support of x on the entire transaction bitmap TB
1: R
becomes σ( x ) =
R
k =1
σx
(T B
k
) .
To calculate the partial support σx
( TB
k
) for an itemset x = { i
1
, . . . , i
n
} , GMiner simply performs bitwise AND operation
n 1 times among bit vectors of { TB
k
( i )| i x } and counts the number of 1s in the resultant bit vector. GMiner can efficiently
access to the locations of the bit vectors TB
k
( x ) because RA ( x ) contains the relative memory addresses of x in TB
k
in GPU
memory. We denote the function to apply a series of n 1 bitwise AND operations for the itemset x as
{ T B
k
(x ) } . We also
denote the function that counts the number of 1s in a given bit vector by count ( ·). Then, σx
(T B
k
) = count(
{ T B
k
(x ) } ) .
Fig. 1 shows the basic data flow of GMiner with the nested-loop streaming technique. In Fig. 1 , the outer operand
RA
1: Q
and inner operand TB
1: R
are stored in main memory ( Q = 1 ). The buffer for RA
j
, called RABuf , and the buffer for TB
k
,
called TBBuf , are stored in GPU memory. Here, we allocate RABuf and TBBuf to GPU global memory. GMiner copies each
TB
k
to TBBuf in a streaming fashion via the PCI-E bus, after copying RA
j
to RABuf . To store the partial support values for all
candidate itemsets in RA in each transaction block TB
k
, GMiner maintains a two-dimensional array of size | RA | x R in main
memory, denoted as PSArray , where | RA | is the number of candidate itemsets. GMiner also allocates the buffer for partial
support in GPU global memory, denoted as PSBuf . The partial support values calculated on the GPU cores are first stored in
PSBuf in GPU global memory, and then copied back to the PSArray in main memory.
3.3. TFL Algorithm
In this section, we present the algorithm that implements the TFL strategy. We first explain the overall procedure of
the algorithm using the example shown in Fig. 1 . The TFL strategy performs a total of seven steps. We denote the set of
candidate itemsets at the current level in the itemset lattice as C
L
. In Step 1, the TFL strategy converts C
L
to RA by mapping
each itemset x in C
L
to its relative memory address RA ( x ) using dict . Here, dict is a dictionary that maps a frequent 1-
itemset x F
1
to RA ( x ) within a transaction block TB
k
. If the size of RA is larger than that of RABuf in GPU memory, then RA
is logically divided into Q partitions, i.e., RA
1: Q
, such that each partition can fit in RABuf . In Step 2, it copies a partition RA
j
to RABuf in GPU memory. In Step 3, it copies each transaction block TB
k
to TBBuf in GPU memory in a streaming fashion.
In Steps 4-5, the GPU kernel function for the bitwise AND operations, denoted as K
TFL
, calculates the partial supports of
candidate itemsets in RABuf and stores the values in PSBuf . In Step 6, the TFL strategy copies the partial supports in PSBuf
back to PSArray in main memory. Here, it copies the values of TB
k
to the k -th column of PSArray . In Step 7, it aggregates
the partial supports of each itemset x in PSArray to obtain σ( x ). After Step 7, GMiner finds the frequent L -itemsets F
L
for
K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38 25
Fig. 1. Example of the TFL strategy.
which the support values are greater than or equal to a given threshold minsup , as in the existing frequent itemset mining
methods.
Algorithm 1 shows the pseudo code for the algorithm. During initialization, the algorithm loads a transaction database
D into main memory (MM) and allocates PSArray to MM. Then, it allocates three buffers, namely, TBBuf, RABuf , and PSBuf ,
to GPU global memory (DM) (Lines 1–3). Next, it converts D to a set of transaction blocks TB
1: R
using F
1
such that each
transaction block can fit in TBBuf (Lines 4–5). After the dictionary dic t used to map x to RA ( x ) has been constructed (Line 6),
it remains fixed during itemset mining because the TFL strategy uses only TB
1: R
for F
1
as input data. The main loop consists
of a generating step (Lines 10–11) and a testing step (Lines 12–20), as in the Apriori algorithm; however, compared to the
Apriori algorithm, our algorithm significantly improves the testing step performance by streaming the transaction blocks
of F
1
to overcome the limitations imposed by GPU memory, while simultaneously exploiting GPU computing for fast and
massively parallel calculation of partial supports (Lines 12–18).
We note that the kernel function K
TFL
is usually called multiple times instead of a single time (Line 16). This is due to a
limit on the number of GPU blocks, which we can specify when calling K
TFL
. The K
TFL
function can calculate a partial support
of a single itemset using a single GPU block. If we set the maximum number of GPU blocks, denoted as maxBlk , to 16 K, a
single call to K
TFL
can simultaneously calculate partial supports for 16 K itemsets. Thus, if | RA
j
| = 10 0 M, we must call the K
TFL
function
100 M
16 K
6 , 250 times. That is, for the same transaction block TB
k
in TBBuf , GMiner executes the kernel function
repeatedly while changing the affected portions of RA
j
. When copying data, RA is the outer operand, and TB is the inner
operand. However, when calling the kernel function, TB
k
is the inner operand, and RA
j
is the outer operand.
Next, we present the pseudo code for the GPU kernel function of GMiner in Algorithm 2 . This function is used not only
in the TFL strategy but also in the HIL strategy in Section 4 . It takes a pair of RA
j
and TB
k
, along with doneIdx and maxThr ,
as inputs. Here, doneIdx is the index of the last candidate that was processed in RA
j
. This value is required to identify the
portion of RA
j
that the current call of K
TFL
should process. For example, if | RA
j
| = 10, 0 0 0 and maxBlk = 10 0 0, doneIdx in the
second call of K
TFL
becomes 10 0 0. The input maxThr is the maximum number of threads in a single GPU block, which we can
specify when calling K
TFL
, as with maxBlk. BID and TID are the IDs of the current GPU block and GPU thread, respectively,
which are automatically determined system variables. Because many GPU blocks execute concurrently, some might have
no corresponding candidate itemsets to test. For instance, when | RA
j
| = 100 and maxBlk = 20 0, 10 0 GPU blocks should not
execute the kernel function because some blocks would have no itemsets. Thus, when the current GPU block has no itemset,
the kernel function returns immediately (Lines 1–2). The kernel function prepares two frequently-accessed variables, namely,
can and sup , in the shared memory in GPUs to improve performance. The variable can contains the itemset for which the
current GPU block BID will calculate the partial support, and the vector sup is initialized to zero.
The main loop of K
TFL
performs bitwise AND operations simultaneously and repeatedly (Lines 5–8). Under current GPU
architectures, a single GPU thread can efficiently perform bitwise AND operations for single-precision widths (i.e., 32 bits).
That is, a single GPU block can perform bitwise AND operations up to maxThr ×32 bits simultaneously. However, the width
of a transaction block W might be considerably larger than maxThr ×32 bits.
Fig. 2 shows an example of K
TFL
, when maxT hr = 2 , and can = 0 , 1 , 3 . Here, we assume that a GPU thread can perform
bitwise AND for 4 bits for simplicity. Because the length of candidate itemset is 3, threads 1 and 2 perform bitwise AND
operations twice over { TB (0), TB (1), TB (3)} and store the resultant bits in bitV . The kernel repeats this process
W
maxT hr32
26 K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38
times. The number of 1s in bitV can easily be counted using the popCount function and stored in the sup vector. In CUDA,
the popCount function is denoted as popc (). The partial support values are accumulated in the sup vector
W
maxT hr32
times, as
shown in Fig. 2 . Finally, the kernel function aggregates the values in sup into a single partial support value in TB
k
for the
candidate itemset can using a parallelReduction function (Line 9).
3.4. Exploiting GPUs
In this section, we present the details of the GMiner implementation that exploits GPUs. First, we discuss how to allocate
and utilize the GPU memory. In particular, we consider a method to avoid the out-of-memory issue when handling large-
scale data using GPUs. Second, we explain how to set GPU threads and improve the GPU kernel function performance. Third,
we explain the details of the nested-loop streaming, process, including both synchronization and host-device transfer.
First, we allocate three types of buffers to GPU memory only once; subsequently we use them repeatedly for the entire
mining task. The out-of-memory issue of current GPU-based methods indicates that the overall mining tasks fail due to
increasing data sizes that do not fit into GPU global memory. To avoid this out-of-memory issue, we allocate the buffers
(i.e., TBBuf, RABuf , and PSBuf ) to GPU global memory once while considering the GPU memory capacity, and then use them
repeatedly. When the data (i.e., relative addresses and transaction bitmap) are larger than the size of the corresponding
buffers, our method divides the data (i.e., relative addresses and transaction bitmap) into multiple partitions, each of which
then fits into the corresponding buffer, and then copies each partition to the buffer individually. Consequently, our method
avoids the out-of-memory issue and simultaneously reduces the buffer allocation overhead in GPU memory. In contrast,
other GPU-based methods repeatedly allocate the buffers to GPU memory during the mining task.
Second, we exploit the shared memory of GPUs. Each GPU follows the single instruction multiple thread (SIMT) model
and handles threads in a warp, which is a group of 32 threads. Multiple warps form a GPU block, and threads in the same
GPU block can quickly communicate with one another using shared memory and built-in primitives. Frequently accessing
GPU global memory to update variables is generally prohibitively expensive. To avoid this cost, our method uses shared
memory to store the number of 1s in the bit vectors corresponding to the candidate itemset x . After computing the partial
support for the corresponding transaction block, our method stores the partial support of x in the corresponding location of
PSBuf . As a result, our method improves performance by accessing the GPU global memory only once. We also consider the
number of GPU threads for the GPU kernel function. As discussed in Section 3.3 , our GPU kernel function includes the par-
Algorithm 1: The TFL strategy.
Input : D ; /* transaction database */
Input : minsup; /* minimum support */
Output : F; /* frequent itemsets */
1 Load D into M M ;
2 Allocate P SArray on M M ;
3 Allocate { T BBu f , RABu f , P SBu f } on DM;
4 F
1
find all frequent 1-itemsets;
5 Build T B
1: R using D and F
1 on M M ;
6 dict dictionary mapping x to RA (x ) ( x F
1
);
7 L 1 ;
8 while | F
L
| > 0 do
9 L L + 1 ;
/* Generating candidates using CPUs */
10 C
L
generate candidate itemsets using F
L 1
;
11 Convert C
L to RA
1: Q using dict;
/* Testing using GPUs */
12 for j 1 to Q do
13 Copy RA
j
into RABu f of DM;
14 for k 1 to R do
15 Copy T B
k
into T BBu f of DM;
16 Call K
T F L
( RA
j
, T B
k
); /*
| RA
j
|
maxBlk
times */
17 Copy P SBu f into P SArray of M M ;
18 Thread synchronization of GP Us ;
19 σ(c)
R
k =1
P SArray [ c][ k ] , for c C
L
;
20 F
L
{ c| c C
L
σ(c) minsu p} ;
21 F
F
L
;
22 Return F;
K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38 27
Algorithm 2: K
TFL
: Kernel function for partial supports.
Input : RA
j
; /* j-th partition of RA */
Input : T B
k
; /* k -th transaction block */
Input : d oneI d x ; /*index of last candidates done in RA
j
*/
Input : maxT hr; /*max number of threads in GPU block*/
Varia ble : can ; /* shared variable for a candidate */
Varia ble : sup; /* shared variable for a partial support */
1 if d oneId x + BID | RA
j
| then
2 return;
3 can RA
j
[ d oneI d x + BID ] ;
4 sup[0 : maxT hr] 0 ;
5 for i 0 ; i <
W
maxT hr32
; i i + 1 do
6 bitV
i can
T B
k
[ i ][ w maxT hr + T ID ] ;
7 sup[ T ID ] sup[ T ID ] + popCount (bit V ) ;
8 syncthreads () ;
9 P SBu f [ d oneI d x + BID ] paral l el Reduction (su p[]) ;
Fig. 2. The GPU kernel function K
TFL
(a GPU block takes an itemset{A,B,D}).
Fig. 3. Multiple asynchronous GPU streams of GMiner .
allelReduction function. However, this function uses multiple brand-and-bound operations, which degrade the performance
when using GPUs. This performance degradation becomes more marked as the number of GPU threads increases. Therefore,
by default, we set the number of GPU threads to 32, because they might be scheduled together in the GPU architecture.
Third, we exploit multiple asynchronous GPU streams. This approach reduces the data transmission overhead between
main memory and GPU memory. Fig. 3 shows the timeline of the copy operations of the transaction blocks. A CPU thread
first transfers RA
j
to RABuf . Then, it starts multiple GPU streams, each of which performs the following series of operations
repeatedly, while incrementing k : (1) copying TB
k
to TBBuf , (2) executing the GPU kernel function, denoted as K , to calculate
28 K.-W. Chon et al. / Information Sciences 439–440 (2018) 19–38
PS
j, k
, and (3) copying PS
j, k
back to main memory. We denote the number of GPU streams as m . Then, this scheme requires
the size of TBBuf to equal m transaction blocks and the size of PSBuf to be m x | RA
j
|, where | RA
j
| denotes the number of
candidate itemsets in RA
j
. In general, the above three kinds of operations, namely, copying to GPU memory, kernel execution,
and copying to main memory, can overlap with one another in the current GPU architecture [16] ; thus, a large portion of the
copying time between GPU memory and main memory becomes hidden. After processing m streams, all the GPU threads
are synchronized by calling the cudaStreamSynchronize function to compute the exact partial supports for the