Conference PaperPDF Available

SeqScout: Using a Bandit Model to Discover Interesting Subgroups in Labeled Sequences

Authors:

Abstract and Figures

It is extremely useful to exploit labeled datasets not only to learn models but also to improve our understanding of a domain and its available targeted classes. The so-called subgroup discovery task has been considered for a long time. It concerns the discovery of patterns or descriptions, the set of supporting objects of which have interesting properties, e.g., they characterize or discriminate a given target class. Though many subgroup discovery algorithms have been proposed for transactional data, discovering subgroups within labeled sequential data and thus searching for descriptions as sequential patterns has been much less studied. In that context, exhaustive exploration strategies can not be used for real-life applications and we have to look for heuristic approaches. We propose the algorithm SeqScout to discover interesting subgroups (w.r.t. a chosen quality measure) from labeled sequences of itemsets. This is a new sampling algorithm that mines discriminant sequential patterns using a multi-armed bandit model. It is an anytime algorithm that, for a given budget, finds a collection of local optima in the search space of descriptions and thus subgroups. It requires a light configuration and it is independent from the quality measure used for pattern scoring. Furthermore, it is fairly simple to implement. We provide qualitative and quantitative experiments on several datasets to illustrate its added-value.
Content may be subject to copyright.
SeqScout: Using a Bandit Model to Discover Interesting Subgroups in Labeled
Sequences
Romain Mathonat
1. Universit´
e de Lyon
CNRS, INSA-Lyon, LIRIS
UMR5205, F-69621
Villeurbanne, France
2. ATOS
F-69100 Villeurbanne, France
romain.mathonat@insa-lyon.fr
Diana Nurbakova
Universit´
e de Lyon
CNRS, INSA-Lyon, LIRIS
UMR5205, F-69621
Villeurbanne, France
diana.nurbakova@insa-lyon.fr
Jean-Francois Boulicaut
Universit´
e de Lyon
CNRS, INSA-Lyon, LIRIS
UMR5205, F-69621
Villeurbanne, France
jfboulicaut@gmail.com
Mehdi Kaytoue
1. Universit´
e de Lyon
CNRS, INSA-Lyon, LIRIS
UMR5205, F-69621
Villeurbanne, France
2. Infologic
F-69007 Lyon, France
mehdi.kaytoue@gmail.com
Abstract—It is extremely useful to exploit labeled datasets
not only to learn models but also to improve our understanding
of a domain and its available targeted classes. The so-called
subgroup discovery task has been considered for a long time.
It concerns the discovery of patterns or descriptions, the set of
supporting objects of which have interesting properties, e.g.,
they characterize or discriminate a given target class. Though
many subgroup discovery algorithms have been proposed for
transactional data, discovering subgroups within labeled se-
quential data and thus searching for descriptions as sequential
patterns has been much less studied. In that context, exhaustive
exploration strategies can not be used for real-life applications
and we have to look for heuristic approaches. We propose the
algorithm SeqScout to discover interesting subgroups (w.r.t.
a chosen quality measure) from labeled sequences of itemsets.
This is a new sampling algorithm that mines discriminant
sequential patterns using a multi-armed bandit model. It
is an anytime algorithm that, for a given budget, finds a
collection of local optima in the search space of descriptions
and thus subgroups. It requires a light configuration and it is
independent from the quality measure used for pattern scoring.
Furthermore, it is fairly simple to implement. We provide
qualitative and quantitative experiments on several datasets
to illustrate its added-value.
Keywords-Pattern Mining, Subgroup Discovery, Sequences,
Upper Confidence Bound.
I. INTRODUCTION
In many data science projects we have to process la-
beled data and it is often valuable to discover patterns that
discriminate the class values. This can be used to support
various machine learning techniques the goals of which
are to predict class values for unseen objects. It is also
interesting per se since the language for descriptions is,
by design, readable and interpretable by analysts and data
owners. Among others, it can support the construction of
new relevant features. The search for such patterns has been
given different names like subgroup discovery, emerging
pattern or contrast set mining [1]. We use hereafter the
terminology of the subgroup discovery framework [2].
Labeled sequential data are ubiquitous and subgroup
discovery can be used throughout many application domains
like, for instance, text or video data analysis, industrial
process supervision, DNA sequence analysis, web usage
mining, video game analytics, etc. Let us consider an exam-
ple scenario about the maintenance of a cloud environment.
Data generated from such a system are sequences of events.
Applying classification techniques helps answering to “will
an event occur” (See, e.g., [3]), while applying sequential
event prediction helps determining “what event will occur”
(See, e.g., [4]). Nevertheless, another need is to explain, or
at least, to provide hypotheses on the why. Addressing such
adescriptive analytics issue is the focus of our work. Given
data sequences, labeled with classes, we aim at automatically
finding discriminative patterns for these classes. Given our
example, sequences of events could be labeled by breakdown
presence or absence: the goal would be to compute patterns
“obviously occurring with breakdowns”. Such patterns pro-
vide valuable hypotheses for a better understanding of the
targeted system and it can support its maintenance.
Let us now discuss research issues when considering
subgroup discovery on labeled data. Given a dataset of object
descriptions (e.g., objects described by discrete sequences),
a sequential pattern is a generalization of a description that
covers a set of objects. An unusual class distribution among
the covered objects makes a pattern interesting. For exam-
ple, when a dataset has a 99%-1% distribution of normal-
abnormal objects, a pattern covering objects among which
50% are abnormal is highly relevant (w.r.t. a quality measure
like the Weighted Relative Accuracy measure WRAcc [5]).
However, such patterns cover generally a small number
of objects. They are difficult to identify with most of the
algorithms that exploit an exhaustive exploration of the
search space with the help of a minimal frequency constraint
(See, e.g., SD-MAP [6]). Therefore, heuristic approaches
that are often based on beam search or sampling techniques
are used to find only subsets of the interesting patterns. An
open problem concerns the computation of non-redundant
patterns, avoiding to return thousands of variations for
Boolean and/or numerical patterns [7]. Heuristic methods
for sequential data have not yet attracted much attention.
[8] introduced a sampling approach that is dedicated to the
frequency measure only. A promising recent proposal is [9]:
the sampling method mis`
ere can be used for any quality
measure when exploiting sequences of events. Their key
idea, is to draw patterns as strict generalizations of object
descriptions while a time budget enables it, and to keep a
pool of the best non redundant patterns found so far. Such a
generic approach has been the starting point of our research
though we were looking for further quality assessment of
discovered subgroups.
Our contribution provides a search space exploration for
labeled sequences of itemsets (and not just items) called
SeqScout. It is based on sampling guided by a multi-
armed bandit model followed by a generalization step,
and a phase of local optimization. We show that it gives
better results than a simple adaptation of mis`
ere with
the same budget when considering huge search spaces. Our
algorithm has several advantages: it gives results anytime
and it benefits from random search to limit redundancy of
results and to increase subgroup diversity. It is rather easy
to implement and can be parallelized. It is agnostic w.r.t. the
used quality measure and it finds local optima only (w.r.t.
the considered quality measure).
Section II discusses related work. We formally define the
problem in Section III. Section IV describes SeqScout.
Section V presents an empirical study on several datasets,
including applications on Starcraft II data (game analytics)
and abstracts of articles published in the Journal of Machine
Learning and Research.
II. RE LATE D WORK
Sequential pattern mining [10] is a classical data min-
ing task. It remains a challenging task due to the size
of the search space. For instance, Ra¨
ıssi and Pei [11]
have shown that the number of sequences of length kis
wk=Pk1
i=0 wi|I|
ki, with |I| being the number of possible
items. As an example, if we consider the dataset promoters
[12], |I| = 4 and k= 57 (see Table II), the size of the
search space is approximately 1041. Various methods have
been proposed to mine interesting sequential patterns. We
review them briefly and we discuss their relevancy when
considering our need for discriminative patterns.
A. Enumeration-based methods
Many enumeration techniques enable to mine patterns
from Boolean, sequential and graph data [13]. They can be
adapted for the case of discriminative pattern mining. For
instance, the SPADE algorithm [14] has been adapted for
sequence classification based on frequent patterns [15]. The
main idea of such methods is to visit each candidate pattern
only once while pruning large parts of the search space.
Indeed, we know how to exploit formal properties (e.g.,
monotonicity) of many user-defined constraints (and not
only the minimal frequency constraint). The quality measure
is thus computed either during or after the discovery of all
frequent patterns [16]. This is inefficient for the discovery of
the best discriminative patterns only. To overcome this lim-
itation and to help pruning the search space, upper bounds
on the quality measure can be used. However, they remain
generally too optimistic and are specific to a particular
measure (See, e.g., [17]). Moreover, enumeration techniques
coupled to upper bounds aim at solving the problem of
finding the best pattern in the search space, and not the best
pattern set.
B. Heuristic methods
An interesting current trend to support pattern discovery
is to avoid exhaustive search and to provide anytime sets
of high quality patterns, ideally with some guarantees on
their quality. Examples of such guarantees are the distance
to the best solution pattern [18] or the guarantee that the
best solution can be found given a sufficient budget [7]. Let
us discuss some of the heuristic approaches that have been
proposed so far.
Beam search is a widely used heuristic algorithm. It
traverses the search space level-wise from the most general
to the most specific patterns and it restricts each level to a
subset of non redundant patterns of high quality [19]. The
greedy nature of beam search is its major drawback. Boley
et al. proposed a two-step sampling approach giving the
guarantee to sample patterns proportionally to different mea-
sures: frequency, squared frequency, area, or discriminativity
[20]. However this method only works on those measures.
Diop et al. proposed an approach which guarantees that the
probability of sampling a sequential pattern is proportional
to its frequency [8]. It focuses on the frequency measure
only. Egho et al. have proposed the measure agnostic method
mis`
ere [9]. Given a time budget, their idea is to generate
random sequential patterns covering at least one object while
keeping a pool of the best patterns obtained so far. It
provides a result anytime, empirically improving over time.
Sequential pattern mining can be modeled as a multi-
armed bandit problem enabling an exploitation/exploration
trade-off [21]. Each candidate sequence is an “arm” of a
bandit. For instance, Bosc et al. have developed such a
game theory framework using Monte Carlo Tree Search
to support subgroup discovery from labeled itemsets [7].
They proposed an approach based on sampling, where each
draw improves the knowledge about the search space. Such
a drawn object guides the search to achieve an exploita-
tion/exploration trade-off.
To the best of our knowledge, the problem of mining dis-
criminative sequences of itemsets with sampling approaches
has not been addressed yet in the literature. We introduce our
SeqScout method that computes top-knon-redundant dis-
criminative patterns, i.e., tackling a NP-hard problem [22].
For that purpose, we want hereafter to maximize the well
known quality measure called W RAcc [5]. Even though
we focus on it, our method is generic enough for using
any quality measure, without requiring specific properties.
SeqScout does not require parameter tuning, unlike Beam
Search. In contrast to mis`
ere, we have the guarantee to
find local optima thanks to a hill-climbing search.
III. NON -REDUNDANT SUBG ROUP DISCOVERY:
PRELIMINARIES AND PROBLEM STATE ME NT
Let Ibe a set of items. Each subset X⊆ I is called
an itemset. A sequence s=hX1...Xniis an ordered list
of n > 0itemsets. The size of a sequence sis denoted n,
and l=Pn
i=1 |Xi|is its length. A database Dis a set of
|D| sequences (See an example in Table I). Given a set of
classes C, we denote by Dc⊆ D the set of sequences in D
that are labeled by c∈ C.
Definition 1 (Subsequence).A sequence s=hX1...Xnsi
is a subsequence of a sequence s0=hX0
1...X0
n0
si, denoted
svs0, iff there exists 1j1< ... < jnsn0
ssuch
that X1X0
j1...XnsX0
jns. In Table I, h{a}{b, c}i is a
subsequence of s1and s2.
Definition 2 (Extent, support and frequency).The extent of
a sequence sis ext(s) = {s0∈ D | svs0}. The support of a
sequence sis supp(s) = |ext(s)|. Its frequency is freq(s) =
supp(s)/|D|. In Table I, ext(h{a}{b, c}i) = {s1, s2}.
Definition 3 (set-extension).A sequence sbis a set-
extension of sa=hX1X2...Xniby xif i, 1in+ 1
such that sb=hX1...{x}i...Xn+1i. In other words, we have
inserted an itemset Xi={x}in the ith position of sa.
h{a}{c}{b}i is a set-extension of h{a}{b}i.
Definition 4 (item-extension).A sequence sbis an item-
extension of sa=hX1X2...Xniby xif i, 1in
such that sb=hX1...Xi∪ {x}, ..., Xn+1i.h{a, b}{b}i is an
item-extension of h{a}{b}i.
Definition 5 (Reduction).A sequence sbis a reduction of
saif sais an set-extension or item-extension of sb.
Definition 6 (Quality measure).Let Sbe the set of all sub-
sequences. A quality measure ϕis an application ϕ:S → R
that maps every sequence s∈ S with a real number to reflect
Table I: An example database D.
id s∈ D c
s1h{a}{a, b, c}{a, c}{d}{c, f }i +
s2h{a, d}{c}{b, c}{a, e}i +
s3h{e, f }{a, b}{d, f}{c}{b}i −
s4h{e}{g}{a, b, f }{c}{c}i −
its interestingness (quality score). For instance, Precision,
defined by P(sc) = supp(s,Dc)
supp(s,D), is a quality measure.
Definition 7 (Local Optimum).Let N(s)be the neighbor-
hood of s, i.e., the set of all item-extensions, set-extensions
and reductions of s.r?is a local optimum of Sw.r.t. the
quality measure ϕiff rN(r?),ϕ(r?)ϕ(r).
Definition 8 (Non θ-redundant patterns).A set of patterns,
(also called subgroups or subsequences here), Sp⊆ S is
non θ-redundant if given θ[0; 1] and s1, s2∈ Sp, where
s16=s2, we have: sim(s1, s2)θ. In the following, we
use the Jaccard index as a similarity measure:
sim(s1, s2) = |ext(s1)ext(s2)|
|ext(s1)ext(s2)|.
Problem Statement (Non-redundant subgroup set discov-
ery).For a database D, an integer k, a real number θ, a
similarity measure sim, and a target class c∈ C, the non
redundant subgroup discovery task consists in computing the
set Spof the best non θ-redundant patterns of size |Sp| ≤ k,
mined w.r.t the quality measure ϕ.
IV. MET HO D DESCRIPTION
SeqScout is a sampling approach that exploits general-
izations of database sequences, and search for local optima
w.r.t. the chosen quality measure. Fig. 1 provides a global
illustration for the method1.
A. General overview
The main idea of the SeqScout approach is to consider
each sequence of the labeled data as an arm of a multi-armed
bandit when selecting the sequences for further generaliza-
tion using the Upper Confidence Bound (UCB) principle
(See Algorithm 1). The idea of the UCB is to give a score
to each sequence that quantifies an exploitation/exploration
trade-off, and to choose the sequence having the best one.
First (Lines 2-4), priority queues, πand scores, are
created. πstores encountered patterns with their quality,
and scores keeps in memory the list of U CB scores of
each sequence of the dataset, computed by using Equation 1
(See Section IV-B). data+contains the list of all sequences
of the data set labeled with the target class. Indeed, taking
sequences having the target class will lead to generaliza-
tions having at least one positive element. Then, the main
procedure is launched as long as some computational budget
is available. The best sequence w.r.t UCB is chosen (Line
9). This sequence is ‘played’ (Line 10), meaning that it is
generalized (See Section IV-C) and its quality is computed
(See Section IV-G). The created pattern is added to π
1In the context of sequential pattern mining, the search space is a priori
infinite. However, we can define the border of the search space (the bottom
border in Fig. 1) by excluding patterns having a null support. We can easily
prove that each element of this border is a sequence within the database.
Therefore, the search space shape depends on the data.
(Line 11). Finally, the UCB score is updated (Line 12). As
post processing steps, the top-kbest non-redundant patterns
are extracted from scores using the filtering step (See
Section IV-D). Finally, these patterns are processed by a
local optimization procedure (See Section IV-E). Moreover,
SeqScout needs other modules that concern the selection
of the quality measure (See Section IV-F) and the quality
score computation (See Section IV-G).
Figure 1: Illustration of SeqScout.
B. Sequence selection
We propose to model each sequence of the dataset as an
arm of a multi-armed bandit slot machine. The action of
playing an arm corresponds to generalizing this sequence
to obtain a pattern, and the reward then corresponds to the
quality of this pattern. Following an exploitation/exploration
trade-off, sequences leading to bad quality patterns will be
avoided, leading to the discovery of better ones.
The problem of the multi-armed bandit is well known
in the Game Theory literature [23]. We consider a multi-
armed bandit slot machine with karms, each arm having
its own reward distribution. Our problem is then formulated
as follows. Having a number Nof plays, what is the best
strategy to maximize the reward? The more an arm is played,
the more information about its reward distribution we get.
However, to what extent is it needed to exploit a promising
arm, instead of trying others that could be more interesting
in the long term (explore)? Auer et al. proposed a strategy
called UCB1 [21]. The idea is to give each arm a score, and
to choose the one that maximizes it:
UCB1(i) = ¯xi+s2ln(N)
Ni
,(1)
where ¯xiis the empirical mean of the ith arm, Niis
the number of plays of the ith arm and Nis the total
number of plays. The first term encourages the exploitation
Algorithm 1 SeqScout
1: function SEQ SCO UT(budget)
2: πP riority Queue()
3: scores P rior ityQueue()
4: data+F ilterData()
5: for all sequence in data+do
6: scoresucb.add(sequence, )
7: end for
8: while budget do
9: seq, qual, Niscores.bestU CB()
10: seqp, qualpP lay Arm(seq)
11: π.add(seqp, qualp)
12: scores.update(seq, Niq ual+qualp
Ni+1 , Ni+ 1)
13: end while
14: π.add(Optimize(π))
15: return π.topKN onRedundant()
16: end function
17:
18: function OPTIMIZE(π)
19: topK π.topKNonRedundant()
20: for all pattern in topK do
21: while pattern is not a local optima do
22: pattern, qual BestNeighbor()
23: end while
24: end for
25: return pattern, qual
26: end function
of arms with good reward, while the second encourages the
exploration of less played arms by giving less credit to the
ones that have been frequently played.
Performing an exploitation/exploration trade-off for pat-
tern mining has already been applied successfully to itemsets
and numerical vectors by means of Monte Carlo Tree Search
[7]. When dealing with a huge search space, using sampling
guided by such a trade-off can give good results. However,
contrary to [7], we consider the search space of extents, not
intents. Exploring the extent search space guarantees to find
patterns with non null support while exploring the intent
search space leads towards multiple patterns with a null
support. This is a crucial issue when dealing with sequences.
C. Pattern generalization
After the best sequence w.r.t. UCB1 is chosen, it is
generalized, meaning that a new more general pattern is
built. It enables to build a pattern with at least one positive
element. Indeed, most of the patterns in the search space
have a null support [11]. SeqScout generalizes a sequence
sin the following way. It iterates through each item within
each itemset Xis, and it removes it randomly according
to the following rule:
remain, if z < 0.5
remove, if z0.5, where z∼ U (0,1).
D. Filtering Step
To limit the redundancy of found patterns, a filtering
process is needed. We adopt a well-described set covering
principle from the literature (See, e.g., [7], [24]), summa-
rized as follows. First, we take the best element, and then
we remove those that are similar within our priority queue π.
Then, we take the second best, and continue this procedure
until the kbest non-redundant elements are extracted.
E. Local optimum search
Finally, a local optimum search is launched w.r.t. Defi-
nition 7. Various strategies can be used. The first possible
strategy is the Steepest Ascend Hill Climbing. It computes
the neighborhood of the generalized pattern, i.e., all its item-
extensions, set-extensions and reductions. Then, it selects
the pattern among those of the neighborhood maximizing
the quality measure. This is repeated until there is no
more patterns in the neighborhood having a better quality
measure. Another possible strategy is the Stochastic Hill
Climbing: a neighbor is selected at random if its difference
with the current one is large “enough”. Notice however that
it introduces a new parameter. Depending on the dataset,
the branching factor can be very important. Indeed, for m
items and nitemsets in the sequence, there are m(2n+ 1)
patterns in its neighborhood (See Theorem 1). To tackle this
issue, we use First-Choice Hill Climbing [25]. We compute
the neighborhood until a better pattern is created, then we
directly select it without enumerating all neighbors.
Theorem 1. For a sequence s, let nbe its size, lits
length, and mthe number of possible items, the number
of neighbors of s, denoted |N(s)|, is m(2n+ 1).
Proof: The number of item-extensions is given by:
|Iext|=
n
X
i=1
|I| − |Xi|=nm
n
X
i=1
|Xi|=nm l.
We have now to sum the number of reductions, set-
extensions and item-extensions:
|N(s)|=l+m(n+ 1) + |Iext|=m(2n+ 1).
F. Quality measure selection
The choice of the quality measure ϕis application de-
pendent. Our approach can deal with any known measures
that support class characterization, such as, e.g., the F1
score, informedness or the Weighted Relative Accuracy [5].
The later, the W RAcc, is commonly used for discriminant
pattern mining and subgroup discovery. It compares the
proportion of positive elements (i.e., sequences labeled with
the target class) to the proportion of positive elements in
the whole database. Let cCbe a class value and sbe a
sequence:
W RAcc(s, c) = freq(s)×supp(s, Dc)
supp(s, D)|Dc|
|D| .
It is a weighted difference between the precisions P(sc)
and P(hi → c). The weight is defined as freq(s)to avoid
the extraction of infrequent subgroups. Indeed, finding very
specific subgroups covering one positive element would re-
sult in a perfect quality value but a useless pattern. W RAcc
value ranges in [-0.25, 0.25] in the case of a perfect balanced
data set, i.e., containing 50% of positive elements.
We consider objective quality measures that are based on
pattern support in databases (whole dataset, or restricted to
a class). Using random draws makes it particularly difficult
as each draw is independent: we cannot benefit from the
data structures that are so efficient for classical exhaustive
pattern mining algorithms do (See, e.g., [26]).
G. Efficient computation of quality scores
To improve the time efficiency of support computing,
bitset representations have been proposed. For instance,
SPAM [26] uses a bitset representation of a pattern when
computing an item- or set-extension at the end of a sequence.
In our case, we consider that an element can be inserted
anywhere. Therefore, we propose a bitset representation that
is independent from the insertion position. Its main idea lies
in keeping all bitset representations of encountered itemsets
in a hash table (memoization), and then combining them
to create the representation of the desired sequence. The
main idea is given in Fig. 2. Assume we are looking for
the bitset representation of h{ab},{c}i. Let h{c}i be an
already encountered pattern (i.e., its representation is known)
while h{ab}i was not. This can not be handled by the
SPAM technique as a new element has to be added before
a known sequence. Our algorithm will first try to find the
bitset representation of h{ab}i. As it does not exist yet, it
will be generated and added to the memoization structure.
Then, position options for the next itemset are computed
(Line 2 in Fig. 2). The latter is then combined with a bitset
representation of h{c}i using bitwise AND (Line 4). The
support of the generated sequence can then be computed.
V. EX PE RI ME NT S
We report an extensive empirical evaluation of
SeqScout using multiple datasets. All the experiments
were performed on a machine equipped with Intel Core
i7-8750H CPU with 16GB RAM. Algorithms mentioned
hereafter are implemented in Python 3.6.
Datasets: We perform the evaluation using some
benchmark datasets that are well known in the pattern
mining community, namely aslbu [27], promoters [12],
blocks [12], context [12], splice [12], and skating [28].
Figure 2: Bitset representation and support computing.
We also apply our algorithm on the real life dataset sc2
that has been used in [16]. It was extracted from Starcraft
IIa famous Real Time Strategy (RTS) game. The goal of
each Starcraft match is to destroy units and buildings of the
opponent. Three different factions exist, each with its own
features (i.e., combat units, buildings, strategies). Roughly
speaking, there is a duality between economy and military
forces. Investing in military forces is important to defend or
attack an opponent, while building economy is important to
have more resources to invest into military forces. Sequences
in sc2 correspond to the buildings constructed during a
game, and the class corresponds to the winner’s faction.
Once the target class (i.e., the winning faction) has been
chosen, our algorithm will look for patterns of construction
that characterize the victory of this faction.
Moreover, we use jmlr, a dataset consisting of abstracts
of articles published in the Journal of Machine Learning
Research [29]. Table II summarizes the statistics of datasets.
Baselines: We are not aware of available algorithms
that address the problem of discriminative pattern mining
in sequences of itemsets but several exist for processing
sequences of events. Therefore, we compare SeqScout to
two algorithms that we have modified to tackle sequences
of itemsets, namely mis`
ere [9] and BeamSearch [24].
First, we implemented a simple extension of mis`
ere [9].
Second, we implemented a sequence-oriented version of
a beam-search algorithm, denoted BeamSearch. To deal
with sequences of itemsets, we consider item-extensions
and set-extensions at each given depth. Moreover, for the
sake of non-redundancy in the returned patterns, we modify
its best-first search nature so that the expanded nodes get
diverse, as defined in [24]. Indeed, without this modification,
BeamSearch selects directly the best patterns at each level,
that are often redundant. It may lead to a situation where
almost all patterns are removed in the non-redundancy post-
processing step, giving rise to solutions with less than k
elements.
Settings: If not stated otherwise, we use the follow-
ing settings. Each algorithm has been launched 5 times,
and the reported results are averaged over these runs. For
BeamSearch, we empirically set the parameter width =
Figure 3: W RAcc of top-5 best patterns (10K iterations)
50. For all algorithms, we set θ= 0.5,time budget =,
iteration num = 10,000, and top k = 5. Note that
instead of giving a fixed time budget for running an al-
gorithm on each dataset, we chose to limit the number of
iterations iteration num, one iteration corresponding to
a single computation of the quality measure. Indeed, this
computation is the most time consuming one: it needs for
computing the extent w.r.t. the whole dataset. Therefore,
using the same time budget on different datasets would not
provide a fair comparison: having 50,000 iterations on a
small dataset versus 50 on a complex one with the same
time budget is not interesting.
A. Performance evaluation using W RAcc
To assess the performance of the algorithms, let us first
use the mean of the W RAcc of the top-knon redundant
patterns given by algorithms mis`
ere,BeamSearch and
SeqScout. Fig. 3 provides the comparative results. Our ap-
proach gives better results on the majority of datasets. More-
over, it remains quite stable, contrarily to BeamSearch
which can be sometimes fairly inefficient (See on splice).
Table II: Datasets
Dataset |D| |I| lmax Search Space Size
aslbu [27] 441 124 27 4.45 1060
promoters [12] 106 4 57 1.58 1041
blocks [12] 210 8 12 2.74 1012
context [12] 240 47 123 5.24 10224
splice [12] 3,190 8 60 3.28 1062
sc2 [16] 5,000 30 30 6.48 1048
skating [28] 530 41 120 1.15 10212
jmlr [29] 788 3,836 228 1.84 10853
Table III: Mean values of measures for top-5 patterns.
Dataset Informedness F1
misere BeamS SeqScout misere BeamS SeqScout
aslbu 0.195 0.198 0.198 0.505 0.505 0.505
promoters 0.039 0.089 0.113 0.512 0.545 0.580
blocks 0.399 0.394 0.393 0.381 0.370 0.382
context 0.436 0.437 0.469 0.528 0.532 0.584
splice 0.342 0.006 0.356 0.394 0.082 0.471
sc2 0.004 0.004 0.010 0.524 0.519 0.526
skating 0.396 0.385 0.415 0.361 0.380 0.389
jmlr 0.224 0.403 0.355 0.143 0.236 0.243
B. Quality with respect to the number of iterations
We discuss the result quality in terms of W RAcc over
the number of iterations. Fig. 4, 5, 6 depict the results for
the top-5 non-redundant patterns on datasets promoters,
skating and sc2. Note that for the same data, the results
may vary from run to run, due to the random component of
mis`
ere and SeqScout. It explains some fluctuations of
the quality. Nevertheless, for each iteration num setting,
SeqScout has shown better results.
C. Using other quality measures
To empirically illustrate the measure agnostic characteris-
tic of SeqScout, we have used other quality measures: F1-
score and Informedness. The results are shown in Table
III. Our algorithm gives generally better results.
D. Performance study under varying θ
We evaluate the performance of the algorithms when vary-
ing the similarity threshold θ. Fig. 7 shows the performance
on the dataset promoters. We did not included the results for
θ= 0 because it would mean finding patterns with totally
disjoint extents. It results in finding a number of patterns
lesser than kfor all algorithms, such that the mean would
be misleading. We can see from the plot that SeqScout
outperforms other algorithms for all θvalues.
E. Performance study under varying k(top-k)
We investigate the performance of the search for top-k
patterns when changing the kparameter. Fig. 8 shows the
results when considering the context dataset. SeqScout
gives better results. Note that the mean W RAcc decreases
for all algorithms, as increasing kleads to the selection of
lower quality patterns.
F. Quality versus search space size
We evaluate the average W RAcc of the top-5 patterns
w.r.t. the maximum length of sequences (See Fig. 9). To do
so, we have truncated the dataset to control the maximum
lengths of sequences. We demonstrate it on the skating
dataset. The plot shows that SeqScout gives better average
W RAcc values whatever the search space size is. We also
note a general decrease of quality when increasing the
search space size. Indeed, some patterns that are good for a
Table IV: Number of found local optima - 10s budget
Dataset Integer Set Bitset Variation(%)
aslbu 16,453 17,330 5
promoters 7,185 8,858 24
blocks 30,725 45,289 51
context 4,651 2,667 -47
splice 289 254 -12
sc2 943 605 -36
skating 4,001 1,283 -68
jmlr 704 31 -95
smaller maximum length can appear in negative elements for
larger maximum lengths, resulting in a decreasing quality of
patterns. Note that the opposite phenomenon can also appear.
G. Comparison to ground truth
We now compute the ratio of the W RAcc result quality
of SeqScout to the one of the ground truth. Therefore,
we have launched an exhaustive algorithm (Breadth-First-
Search) on the dataset sc2. However, the search space size
makes the problem intractable in a reasonable time. To tackle
this issue, we truncate our dataset to have sequences with a
length of 10. Note that even with this severe simplification,
our exhaustive algorithm took 35 hours to conclude, and
found more than 40 billions of patterns. As we can see in
Fig. 10, SeqScout improves over time, and it reaches more
than 90% of the best pattern set in less than 2,000 iterations
(35 seconds on our machine).
H. Sequence length
The pattern lengths on all datasets are reported in Fig. 11.
Let us consider the splice dataset: BeamSearch gives short
patterns (max 8), which is significantly less than SeqScout
and mis`
ere. This may explain why the BeamSearch
result quality is bad on this dataset (See Fig. 3). One
hypothesis could be that it does not have enough iterations
to go deep enough in the search space. Another hypothesis
is that BeamSearch cannot find good patterns having bad
parents w.r.t. W RAcc: its good patterns are short ones.
I. Bitset vs. Integer set representation
We investigate the usefulness of bitset representation by
comparing it against an integer set representation where
each item is represented by an integer. Therefore, we have
compared the number of iterations SeqScout made for a
fixed time budget on each dataset. We set time budget = 10
seconds. The results are summarized in Table IV. We can
see that the bitset representation gives a performance gain
for datasets with smaller search space size upper bound.
Indeed, context,skating, and jmlr have sequences of large
lengths, leading to large bitset representations. Those bitsets
are then split into different parts to be processed by the
CPU. If the number of splits is too large, the bitset repre-
sentation becomes inefficient, and using a classical integer
set representation is a better option.
Figure 4: W RAcc for top-5 patterns
w.r.t. iterations (skating)
Figure 5: W RAcc for top-5 patterns
w.r.t. iterations (promoters)
Figure 6: W RAcc for top-5 patterns
w.r.t. iterations (sc2)
Figure 7: W RAcc of top-5 patterns versus
similarity threshold θ(promoters)
Figure 8: W RAcc of top-kpatterns vs. k
(context)
Figure 9: W RAcc of top-5 patterns versus
max length on (skating)
Figure 10: Ratio of W RAcc of SeqScout to ground
truth (sc2)
Figure 11: Length of top-5 best patterns - 10K iterations
Figure 12: Additional cost of local optima search
J. Number of iterations during the local optima search
During the local optima search step, an additional number
of iterations is made to improve the qualities of the best
patterns. In Fig. 12, we plot the ratio of the additional
iterations necessary for local optima search w.r.t. the number
of iterations given in the main search. The more iterations we
have, the more negligible the ratio is. Based on that analysis,
we may infer that a rather fair number of iterations for our
experiments is 10,000. However, note that we did not plot
the additional cost of jmlr. Indeed, in the particular case of
text data, the number of possible items is quite large, leading
to a fairly long local optima search. Consequently, we note
that SeqScout may not be the relevant choice with this
kind of dataset.
K. Quality evaluation on sc2
Given a faction, what are the best strategies to
win? SeqScout will look for patterns of construc-
tion which are characteristic to the victory of this fac-
tion. Let us consider the “Terran” faction, as an exam-
ple. For this faction, one of the best pattern found is:
h{Hatchery},{Supply},{F actor y},{Supply}i. We use
the colors as indicators of fractions: blue for “Terran” and
purple for “Zerg”. We can see that the “Terran” player
is investing in military units ({Supply}and {F actor y}).
The “Zerg” player chooses to invest in its economy by
creating a {Hatchery}: she fosters the so-called late game,
sacrificing its military forces in the so-called early game. In
such a scenario, the “Terran” strikes in early game, knowing
its military advantage, and she tends to win the match.
This example shows that our algorithm can provide relevant
patterns that may be useful, e.g., to identify unbalanced
strategies [16].
L. Quality evaluation on jmlr
Let us now use a dataset consisting of abstracts of articles
from JMLR journal [29]. Sentences have been tokenized and
stemmed. Each sequence is then a sequence of items, where
an item corresponds to a stemmed word. To label sequences,
we chose to add a class ‘+’ to sequences containing a target
word, and a class ‘-’ to others. We also removed the target
word from the sequences, as it would lead to the discovery of
patterns having the target word, which is not interesting. The
goal is then to find sequences of items (words), characteristic
for the target word. The results for two target words (svm
and classif ) are presented in Table V. As we can see, these
patterns are indeed characteristics of the target words.
Table V: Top-10 non-singleton patterns given by SeqScout
on jmlr for target words: “svm” and “classif
svm classif
effici method classifi set
requir support classifi value
propos paper gener real
formul classif problem condit
individu show learn error
scale vector support gener
base number support paper propos
support machin present show problem classification set
machin techniqu problem appli
solut number data gener
VI. CONCLUSION
We presented the algorithm SeqScout to discover rele-
vant subgroups in sequences of itemsets. Though we are not
aware of available algorithms to solve the same problem, we
have implemented adaptations of two other algorithms when
considering sequences of itemsets, namely mis`
ere and
BeamSearch. Our experiments showed that SeqScout
gave better results, without the need for additional parameter
tuning as in the case of Beam Search. However we
can note that due to the nature of the bandit approach,
this algorithm perfoms better on dataset of reasonnable
size. Parallelizing the algorithm with an efficient program-
ming language could be interesting to tackle this issue
in future works. We plan to further improve our method
by incorporating the Monte Carlo Tree Search, a logical
evolution of bandit based method to sample the search
space. Indeed, instead of repeatedly generalizing promising
dataset sequences, it could be more interesting to generalize
promising patterns that occur in such promising sequences,
identifying the best parts of the search space as the number
of iterations increases.
Acknowledgment This research has been partially funded
by the French National project FUI DUF 4.0 2017-2021.
REFERENCES
[1] P. K. Novak, N. Lavraˇ
c, and G. I. Webb, “Supervised de-
scriptive rule discovery: A unifying survey of contrast set,
emerging pattern and subgroup mining,Journal Machine
Learning Research, vol. 10, pp. 377–403, Jun. 2009.
[2] S. Wrobel, “An algorithm for multi-relational discovery of
subgroups,” in Proceedings PKDD 1997, pp. 78–87.
[3] Z. Xing, J. Pei, and E. Keogh, “A brief survey on sequence
classification,” SIGKDD Explor. Newsl., vol. 12, no. 1, pp.
40–48, Nov. 2010.
[4] B. Letham, C. Rudin, and D. Madigan, “Sequential event
prediction,” Machine Learning, vol. 93, no. 2, pp. 357–380,
Nov 2013.
[5] N. Lavrac, P. A. Flach, and B. Zupan, “Rule evaluation
measures: A unifying view,” in Proceedings ILP 1999, pp.
174–185.
[6] M. Atzmller and F. Puppe, “SD-Map – a fast algorithm for
exhaustive subgroup discovery,” in Proceedings PKDD 2006,
pp. 6–17.
[7] G. Bosc, J.-F. Boulicaut, C. Ra¨
ıssi, and M. Kaytoue, “Anytime
discovery of a diverse set of patterns with monte carlo tree
search,” Data Min. Knowl. Discov., vol. 32, no. 3, pp. 604–
650, May 2018.
[8] L. Diop, C. T. Diop, A. Giacometti, D. Li Haoyuan, and
A. Soulet, “Sequential Pattern Sampling with Norm Con-
straints,” in Proceedings IEEE ICDM 2018, pp. 89–98.
[9] E. Egho, D. Gay, M. Boull´
e, N. Voisine, and F. Cl´
erot, “A
user parameter-free approach for mining robust sequential
classification rules,” Knowl. Inf. Syst., vol. 52, no. 1, pp. 53–
81, Jul 2017.
[10] R. Agrawal and R. Srikant, “Mining sequential patterns,” in
Proceedings IEEE ICDE 1995, pp. 3–14.
[11] C. Ra¨
ıssi and J. Pei, “Towards bounding sequential patterns,”
in Proceedings ACM SIGKDD 2011, pp. 1379–1387.
[12] D. Dua and E. Karra Taniskidou, “UCI machine learning
repository,” 2017. [Online]. Available: http://archive.ics.uci.
edu/ml
[13] A. Giacometti, D. H. Li, P. Marcel, and A. Soulet, “20 years
of pattern mining: a bibliometric survey,” SIGKDD Explor.
Newsl., vol. 15, no. 1, pp. 41–50, 2013.
[14] M. J. Zaki, “Spade: An efficient algorithm for mining frequent
sequences,” Machine Learning, vol. 42, no. 1, pp. 31–60, Jan
2001.
[15] C. Zhou, B. Cule, and B. Goethals, “Pattern based sequence
classification,” IEEE Trans. Knowl. Data Eng., vol. 28, pp.
1285–1298, 2016.
[16] G. Bosc, P. Tan, J.-F. Boulicaut, C. Ra¨
ıssi, and M. Kaytoue,
“A Pattern Mining Approach to Study Strategy Balance in
RTS Games,IEEE Trans. Comput. Intellig. and AI in Games,
vol. 9, no. 2, pp. 123–132, Jun. 2017.
[17] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative subse-
quence mining for action classification,” in Proceedings IEEE
ICSV 2007, Oct, pp. 1–8.
[18] A. Belfodil, A. Belfodil, and M. Kaytoue, “Anytime subgroup
discovery in numerical domains with guarantees,” in Proceed-
ings ECML/PKDD 2018 Part 2, pp. 500–516.
[19] W. Duivesteijn, A. J. Feelders, and A. Knobbe, “Exceptional
model mining,” Data Min. Knowl. Discov., vol. 30, no. 1, pp.
47–98, Jan 2016.
[20] M. Boley, C. Lucchese, D. Paurat, and T. G ¨
artner, “Direct
local pattern sampling by efficient two-step random proce-
dures,” in Proceedings ACM SIGKDD 2011, pp. 582–590.
[21] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time anal-
ysis of the multiarmed bandit problem,” Machine Learning,
vol. 47, no. 2, pp. 235–256, May 2002.
[22] L. Qin, J. X. Yu, and L. Chang, “Diversifying top-k results,”
Proceedings VLDB Endow., vol. 5, no. 11, pp. 1124–1135,
Jul. 2012.
[23] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic
and nonstochastic multi-armed bandit problems,” CoRR, vol.
abs/1204.5721, 2012.
[24] M. van Leeuwen and A. J. Knobbe, “Diverse subgroup set
discovery,” Data Min. Knowl. Discov., vol. 25, no. 2, pp. 208–
242, 2012.
[25] S. Russell and P. Norvig, Artificial Intelligence: A Modern
Approach, 3rd ed. Upper Saddle River, NJ, USA: Prentice
Hall Press, 2009.
[26] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential
pattern mining using a bitmap representation,” in Proceedings
ACM SIGKDD 2002, pp. 429–435.
[27] P. Papapetrou, G. Kollios, S. Sclaroff, and D. Gunopulos,
“Discovering frequent arrangements of temporal intervals,
in Proceedings IEEE ICDM 2005, pp. 354–361.
[28] F. M¨
orchen and A. Ultsch, “Efficient mining of understand-
able patterns from multivariate interval time series,Data
Min. Knowl. Discov., vol. 15, no. 2, pp. 181–215, Oct 2007.
[29] N. Tatti and J. Vreeken, “The long and the short of it:
Summarising event sequences with serial episodes,CoRR,
vol. abs/1902.02834, 2019.
... We propose to adapt the UCB strategy and the MCTS to this problem, exploring the search space of extents of rules in a bottom-up way, and present an empirical study on several datasets to assess the validity of proposed methods. This contribution has been partially published in the proceedings of the French conference Extraction et Gestion de Connaissances EGC'19 [87], a more mature approach has been published in the proceedings of the 6th IEEE International Conference on Data Science and Advanced Analytics DSAA 2019 [89], and an extended version is currently under review for the journal Knowledge and Information Systems KAIS. Chapter 5 proposes a solution to address the problem of supervised rule discovery for high dimensional numerical data and time series. ...
... The SeqScout algorithm [89] presented in Chapter. 4 can mine discriminative patterns in sequences of itemsets. In the following, present a slight adaptation of SeqScout to mine behavioral patterns. ...
... The quality of this pattern, i.e., its discriminating power, is then computed with the chosen quality measure, the W RAcc in our case. Once the time budget has been reached, patterns are filtered to make sure they are non-redundant following Jaccard index, using a parameter θ (see [89], and the top-k are returned. To adapt this algorithm to our problem, we need to reconsider the generalisation step as complex event sequences also contains vectors of intervals. ...
Thesis
Full-text available
It is extremely useful to exploit labeled datasets not only to learn models and perform predictive analytics but also to improve our understanding of a domain and its available targeted classes. The subgroup discovery task has been considered for more than two decades. It concerns the discovery of rules covering sets of objects having interesting properties, e.g., they characterize a given target class. Though many subgroup discovery algorithms have been proposed for both transactional and numerical data, discovering rules within labeled sequential data has been much less studied. In that context, exhaustive exploration strategies can not be used for real-life applications and we have to look for heuristic approaches. In this thesis, we propose to apply bandit models and Monte Carlo Tree Search to explore the search space of possible rules using an exploration-exploitation trade-off, on different data types such as sequences of itemset or time series. For a given budget, they find a collection of top-k best rules in the search space w.r.t chosen quality measure. They require a light configuration and are independent from the quality measure used for pattern scoring. To the best of our knowledge, this is the first time that the Monte Carlo Tree Search framework has been exploited in a sequential data mining setting. We have conducted thorough and comprehensive evaluations of our algorithms on several datasets to illustrate their added-value, and we discuss their qualitative and quantitative results. To assess the added-value of one or our algorithms, we propose a use case of game analytics, more precisely Rocket League match analysis. Discovering interesting rules in sequences of actions performed by players and using them in a supervised classification model shows the efficiency and the relevance of our approach in the difficult and realistic context of high dimensional data. It supports the automatic discovery of skills and it can be used to create new game modes, to improve the ranking system, to help e-sport commentators, or to better analyse opponent teams, for example.
... There exists no mining algorithm for behavioral patterns as defined in this article. However, we recently introduced the SeqScout algorithm [10] that can mine discriminative patterns in sequences of itemsets. In the following, we first present the original version of SeqScout and then its slight adaptation to mine behavioral patterns. ...
... The quality of this pattern, i.e., its discriminating power, is then computed with the chosen quality measure, the W RAcc in our case. Once the time budget has been reached, patterns are filtered to make sure they are nonredundant following Jaccard index, using a parameter θ (see [10], and the top-k are returned. To adapt this algorithm to our problem, we need to reconsider the generalisation step as complex event sequences also contain vectors of intervals. ...
... In Fig. 6, we tested the influence of the parameter α from the generalisation step on the quality of patterns. Note that the W RAcc takes its values in [−0.25, 0.25] (see [10] for more information). Here we can see that we have an optimum for α = 0.8. ...
Conference Paper
Full-text available
Competitive gaming, or esports, is now well-established and brought the game industry in a novel era. It comes with many challenges among which evaluating the level of a player, given the strategies and skills she masters. We are interested in automatically identifying the so called skillshots from game traces of Rocket League, a "soccer with rocket-powered cars" game. From a pure data point of view, each skill execution is unique and standard pattern matching may be insufficient. We propose a non trivial data-centric approach based on pattern mining and supervised learning techniques. We show through an extensive set of experiments that most of Rocket League skillshots can be efficiently detected and used for player modelling. It unveils applications for match making, supporting game commentators and learning systems among others.
... Anytime subgroup discovery combines some of the strength of the previous strategies: (i) it provides subgroups instantly if needed, (ii) a set of high quality and highly diverse subgroups can be retrieved at anytime, (iii) the quality of subgroups increases as time goes on, (iv) the discovery goes from heuristic to exhaustive if the search is left to run until complete, though it is not possible in most of the real cases. The use of an anytime algorithm for subgroup discovery in labeled sequential data was also investigated in (Mathonat et al., 2019(Mathonat et al., , 2021. ...
... A method is developed to mine for pairs of nodes (subgroups) whose edge density is significantly different (higher or lower) from that of the overall graph. Finally, subgroup discovery in sequential data is proposed in (Mathonat et al., 2019) and has been extended in (Mathonat et al., 2021). The anytime sampling algorithm -called SeqScout -exploits a multi-armed bandit model to mine interesting sequential patterns. ...
Thesis
In today's society, information is becoming ever more pervasive. With the advent of the digital age, collecting and storing these near-infinite quantities of data is becoming increasingly easier. In this context, designing new Pattern Discovery methods, that allow for the semi-automatic discovery of relevant information and knowledge, is crucial. We consider data made of a set of descriptive attributes, where one or several of these attributes can be considered as target label(s). When a unique target label is considered, the Subgroup Discovery task aims at discovering subsets of objects -- subgroups -- whose target label distribution significantly deviates from that of the overall data. Exceptional Model Mining is a generalization of Subgroup Discovery. It is a recent framework that enables the discovery of significant local deviations in complex interactions between several target labels. In a world where everything has to be optimized, Multi-objective Optimization methods, which find the optimal trade-offs between numerous competing objectives, are of the essence. Although these research fields have given an extensive literature, their cross-fertilization has been considered only sparsely.Given collected data about a process of interest, we investigate the design of methods for the discovery of relevant parameter values driving the its optimization. Our first contribution is OSMIND, a Subgroup Discovery algorithm that returns an optimal pattern in purely numerical data. OSMIND leverages advanced techniques for search space reduction that guarantee the optimality of the discovery. Our second contribution consists of a generic iterative framework that leverages the actionability of Subgroup Discovery to solve optimization problems.Our third and main contribution is Exceptional Pareto Front Mining, a new class of models for Exceptional Model Mining that involves cross-fertilization between Pattern Discovery and Multi-objective Optimization. In-depth empirical studies have been carried out on each contribution to illustrate their relevance. Our methods are generic and can be applied to many application domains. To assess the actionability of our contributions in real life, we consider the problem of plant growth recipe optimization in controlled environments such as urban farms, the application scenario that has motivated our work. It is an intrinsic Multi-objective Optimization problem. We want to apply our pattern discovery methods to discover parameter values that lead to an optimized growth. Indeed, finding optimal settings could have tremendous repercussions on the profitability of urban farms. On synthetic and real-life data, we show that our methods allow for the discovery of parameter values that optimize the yield-cost trade-off of growth recipes.
... b) For the sequences N 2 , we first check whether the order between past actions matters via a simple distributional test. We use sequences as the description language and the algorithm SeqScout [18] to identify the patterns that lead to the recommendation of i at the first position. 3) For both, the best subgroup allows us to identify conditions on the user's actions maximizing the score of i * . ...
... To the best of our knowledge, there does not exist any subgroup discovery approach considering sequences with numerical target. Therefore, we propose to use SeqScout [18] designed to identify discriminating sequences when the target is categorical. For this, we evaluate the quality of a pattern d with the precision measure that is the proportion of sequences selected by d that lead to the recommendation of i at the first position: ...
... Our research concerns search space exploration methods for labeled sequences of itemsets and not just sequences of items. We first describe the algorithm SeqScout that has been introduced in our conference paper [26]. ...
... To the best of our knowledge, the problem of mining discriminative sequences of itemsets agnostic of the chosen quality measure with sampling approaches has not been addressed yet in the literature, except in our recent conference paper [26]. Hereafter, we describe our methods SeqScout and MCTSExtent that compute top-k non-redundant discriminative patterns. ...
Article
Full-text available
It is extremely useful to exploit labeled datasets not only to learn models and perform predictive analytics but also to improve our understanding of a domain and its available targeted classes. The subgroup discovery task has been considered for more than two decades. It concerns the discovery of patterns covering sets of objects having interesting properties, e.g., they characterize or discriminate a given target class. Though many subgroup discovery algorithms have been proposed for both transactional and numerical data, discovering subgroups within labeled sequential data has been much less studied. First, we propose an anytime algorithm SeqScout that discovers interesting subgroups w.r.t. a chosen quality measure. This is a sampling algorithm that mines discriminant sequential patterns using a multi-armed bandit model. For a given budget, it finds a collection of local optima in the search space of descriptions and thus, subgroups. It requires a light configuration and is independent from the quality measure used for pattern scoring. We also introduce a second anytime algorithm MCTSExtent that pushes further the idea of a better trade-off between exploration and exploitation of a sampling strategy over the search space. 2 Romain Mathonat et al. sequential data mining setting. We have conducted a thorough and comprehensive evaluation of our algorithms on several datasets to illustrate their added-value, and we discuss their qualitative and quantitative results.
... It can be seen as a selection query on the underlying database (Siebes, 1995) using the descriptive attributes. The literature abounds of possible descriptions language: itemsets (Agrawal, Imielinski, and Swami, 1993), hyper-rectangles (Grosskreutz and Rüping, 2009;Mampaey et al., 2012) , polygones (Belfodil et al., 2017b), sequences (Agrawal and Srikant, 1995;Grosskreutz, Lang, and Trabold, 2013;Mathonat et al., 2019), graphs (Kaytoue et al., 2017;Yan and Han, 2002) which define the space (set) of possible descriptions defining, by extent, the set of possible subsets of records that one can consider in the analysis task. In the scope of this thesis, we confine ourselves to propositional languages which are the most commonly used languages for attribute-value data (Kralj Novak, Lavrač, and Webb, 2009). ...
Thesis
Full-text available
With the rapid proliferation of data platforms collecting and curating data related to various domains such as governments data, education data, environment data or product ratings, more and more data are available online. This offers an unparalleled opportunity to study the behavior of individuals and the interactions between them. In the political sphere, being able to query datasets of voting records provides interesting insights for data journalists and political analysts. In particular, such data can be leveraged for the investigation of exceptionally consensual/controversial topics. Consider data describing the voting behavior in the European Parliament (EP). Such a dataset records the votes of each member (MEP) in voting sessions held in the parliament, as well as information on the parliamentarians (e.g., gender, national party, European party alliance) and the sessions (e.g., topic, date). This dataset offers opportunities to study the agreement or disagreement of coherent subgroups, especially to highlight unexpected behavior. It is to be expected that on the majority of voting sessions, MEPs will vote along the lines of their European party alliance. However, when matters are of interest to a specific nation within Europe, alignments may change and agreements can be formed or dissolved. For instance, when a legislative procedure on fishing rights is put before the MEPs, the island nation of the UK can be expected to agree on a specific course of action regardless of their party alliance, fostering an exceptional agreement where strong polarization exists otherwise. In this thesis, we aim to discover such exceptional (dis)agreement patterns not only in voting data but also in more generic data, called behavioral data, which involves individuals performing observable actions on entities. We devise two novel methods which offer complementary angles of exceptional (dis)agreement in behavioral data: within and between groups. These two approaches called Debunk and Deviant, ideally, enables the implementation of a sufficiently comprehensive tool to highlight, summarize and analyze exceptional comportments in behavioral data. We thoroughly investigate the qualitative and quantitative performances of the devised methods. Furthermore, we motivate their usage in the context of computational journalism.
Thesis
Recommender systems have received a lot of attention over the past decades with the proposal of many models that take advantage of the most advanced models of Deep Learning and Machine Learning. With the automation of the collect of user actions such as purchasing of items, watching movies, clicking on hyperlinks, the data available for recommender systems is becoming more and more abundant. These data, called implicit feedback, keeps the sequential order of actions. It is in this context that sequence-aware recommender systems have emerged. Their goal is to combine user preference (long-term users' profiles) and sequential dynamics (short-term tendencies) in order to recommend next actions to a user.In this thesis, we investigate sequential recommendation that aims to predict the user's next item/action from implicit feedback. Our main contribution is REBUS, a new metric embedding model, where only items are projected to integrate and unify user preferences and sequential dynamics. To capture sequential dynamics, REBUS uses frequent sequences in order to provide personalized order Markov chains. We have carried out extensive experiments and demonstrate that our method outperforms state-of-the-art models, especially on sparse datasets. Moreover we share our experience on the implementation and the integration of REBUS in myCADservices, a collaborative platform of the French company Visiativ. We also propose methods to explain the recommendations provided by recommender systems in the research line of explainable AI that has received a lot of attention recently. Despite the ubiquity of recommender systems only few researchers have attempted to explain the recommendations according to user input. However, being able to explain a recommendation would help increase the confidence that a user can have in a recommendation system. Hence, we propose a method based on subgroup discovery that provides interpretable explanations of a recommendation for models that use implicit feedback.
Conference Paper
Full-text available
In recent years, the field of pattern mining has shifted to user-centered methods. In such a context, it is necessary to have a tight coupling between the system and the user where mining techniques provide results at any time or within a short response time of only few seconds. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for itemsets and subgraphs has not yet been applied to sequential patterns, which are useful for a wide range of mining tasks and application fields. In this paper, we propose the first method for sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a constraint on the norm to control the length of the drawn patterns and to avoid the pitfall of the "long tail" where the rarest patterns flood the user. We propose a new constrained two-step random procedure, named CSSAMPLING, that randomly draws sequential patterns according to frequency with an interval constraint on the norm. We demonstrate that this method performs an exact sampling. Moreover, despite the use of rejection sampling, the experimental study shows that CSSAMPLING remains efficient and the constraint helps to draw general patterns of the "head". We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state of the art proposals showing the interest of constrained sequential pattern sampling.
Article
Full-text available
The discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It outperforms other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks.
Article
Full-text available
Sequential data are generated in many domains of science and technology. Although many studies have been carried out for sequence classification in the past decade, the problem is still a challenge, particularly for pattern-based methods. We identify two important issues related to pattern-based sequence classification, which motivate the present work: the curse of parameter tuning and the instability of common interestingness measures. To alleviate these issues, we suggest a new approach and framework for mining sequential rule patterns for classification purpose. We introduce a space of rule pattern models and a prior distribution defined on this model space. From this model space, we define a Bayesian criterion for evaluating the interest of sequential patterns. We also develop a user parameter-free algorithm to efficiently mine sequential patterns from the model space. Extensive experiments show that (i) the new criterion identifies interesting and robust patterns, (ii) the direct use of the mined rules as new features in a classification process demonstrates higher inductive performance than the state-of-the-art sequential pattern-based classifiers.
Article
Full-text available
Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task.
Article
Full-text available
Whereas purest strategic games such as Go and Chess seem timeless, the lifetime of a video game is short, influenced by popular culture, trends, boredom and technological innovations. Even the important budget and de- velopments allocated by editors cannot guarantee a timeless success. Instead, novelties and corrections are proposed to extend an inevitably bounded lifetime. Novelties can unexpectedly break the balance of a game, as players can discover unbalanced strategies that developers did not take into account. In the new context of electronic sports, an important challenge is to be able to detect game balance issues. In this article, we consider real time strategy games (RTS) and present an efficient pattern mining algorithm as a basic tool for game balance designers that enables one to search for unbalanced strategies in historical data through a Knowledge Discovery in Databases process (KDD). We experiment with our algorithm on StarCraft II historical data, played professionally as an electronic sport.
Article
Full-text available
In 1993, Rakesh Agrawal, Tomasz Imielinski and Arun N. Swami published one of the founding papers of Pattern Mining: "Mining Association Rules between Sets of Items in Large Databases". Beyond the introduction to a new problem, it introduced a new methodology in terms of resolution and evaluation. For two decades, Pattern Mining has been one of the most active fields in Knowledge Discovery in Databases. This paper provides a bibliometric survey of the literature relying on 1,087 publications from five major international conferences: KDD, PKDD, PAKDD, ICDM and SDM. We first measured a slowdown of research dedicated to Pattern Mining while the KDD field continues to grow. Then, we quantified the main contributions with respect to languages, constraints and condensed representations to outline the current directions. We observe a sophistication of languages over the last 20 years, although association rules and itemsets are so far the most studied ones. As expected, the minimal support constraint predominates the extraction of patterns with approximately 50% of the publications. Finally, condensed representations used in 10% of the papers had relative success particularly between 2005 and 2008.
Article
Full-text available
Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.
Article
Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is a classical Data Mining task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (common use of subgroup discovery). These, however, do not encompass all forms of “interesting”. To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these targets is chosen to be the target concept. Then, we strive to find subgroups: subsets of the dataset that can be described by a few conditions on single attributes. Such subgroups are deemed interesting when the model over the targets on the subgroup is substantially different from the model on the whole dataset. For instance, we can find subgroups where two target attributes have an unusual correlation, a classifier has a deviating predictive performance, or a Bayesian network fitted on several target attributes has an exceptional structure. We give an algorithmic solution for the EMM framework, and analyze its computational complexity. We also discuss some illustrative applications of EMM instances, including using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand.
Article
In sequential event prediction, we are given a “sequence database” of past event sequences to learn from, and we aim to predict the next event within a current event sequence. We focus on applications where the set of the past events has predictive power and not the specific order of those past events. Such applications arise in recommender systems, equipment maintenance, medical informatics, and in other domains. Our formalization of sequential event prediction draws on ideas from supervised ranking. We show how specific choices within this approach lead to different sequential event prediction problems and algorithms. In recommender system applications, the observed sequence of events depends on user choices, which may be influenced by the recommendations, which are themselves tailored to the user’s choices. This leads to sequential event prediction algorithms involving a non-convex optimization problem. We apply our approach to an online grocery store recommender system, email recipient recommendation, and a novel application in the health event prediction domain.