ArticlePDF Available

Abstract and Figures

This paper is concerned with the STULONG 1 dataset, one of the data collections available for the PKDD Discovery Challenge 2004. This dataset is the result of a twenty-year longitudinal study of risk fac-tors related to atherosclerosis in a population of 1417 middle-aged men. What are the relations between risk factors and clinical demonstration of atherosclerosis? And what are the time intervals over which these re-lations are valid? To handle such issues, only one approach has been proposed so far in data mining literature: to perform several extractions of sequential patterns in order to test the various possible time inter-vals. This solution is clearly time consuming, and does not provide any information about reliability of time windows w.r.t. the extracted pat-terns. In this paper, we propose to mine the STULONG dataset with a new algorithm, WinMiner, that had been recently proposed in [10]. Win-Miner provides a single optimized way to find sequential patterns along with their optimal time intervals. Results we obtained are encouraging and provide precious additional information for physicians.
Content may be subject to copyright.
Mining episode rules in STULONG dataset ?
Nicolas M´eger1Claire Leschi1No¨el Lucas2and Christophe Rigotti1
atiment 501, INSA Lyon, F-69621 Villeurbanne Cedex, France
Universit´e d’Orsay, F-91405 Orsay
Abstract. This paper is concerned with the STULONG 1dataset, one
of the data collections available for the PKDD Discovery Challenge 2004.
This dataset is the result of a twenty-year longitudinal study of risk fac-
tors related to atherosclerosis in a population of 1417 middle-aged men.
What are the relations between risk factors and clinical demonstration
of atherosclerosis? And what are the time intervals over which these re-
lations are valid? To handle such issues, only one approach has been
proposed so far in data mining literature: to perform several extractions
of sequential patterns in order to test the various possible time inter-
vals. This solution is clearly time consuming, and does not provide any
information about reliability of time windows w.r.t. the extracted pat-
terns. In this paper, we propose to mine the STULONG dataset with a
new algorithm, WinMiner, that had been recently proposed in [10]. Win-
Miner provides a single optimized way to find sequential patterns along
with their optimal time intervals. Results we obtained are encouraging
and provide precious additional information for physicians.
1 Introduction
Atherosclerosis is the main cause of cardio-vascular diseases, and therefore of
death, in the developed world. It is predicted to be also the main cause of death
in the developing world within the first quarter of the next century. Basically,
?This work has been partially funded by the European Project A.E.G.I.S. (IST-2000-
1The study (STULONG) was realized at the 2nd Department of Medicine, 1st Faculty
of Medicine of Charles University and Charles University Hospital, U nemocnice 2,
Prague 2 (head. Prof. M. Aschermann, MD, SDr, FESC), under the supervision of
Prof. F. Boud ik, MD, ScD, with collaboration of M. Tomeckova, MD, PhD and
Ass. Prof. J. Bultas, MD, PhD. The data were transferred to the electronic form by
the European Centre of Medical Informatics, Statistics and Epidemiology of Charles
University and Academy of Sciences (head. Prof. RNDr. J. Zvarova, DrSc). The data
resource is on the web pages At present time
the data analysis is supported by the grant of the Ministry of Education CR Nr LN
00B 107.
atherosclerosis is a process that leads to a buildup, namely ”the plaque”, that
is made of cholesterol, cellular debris, calcium and that is located within the
large and medium muscular arteries. These plaques, when large enough, can re-
duce the blood flow in arteries. This can cause claudication or even gangrene
in legs. But the worst case scenario happens with the rupture of these plaques.
In this case, debris of the plaque can block blood vessels causing heart attacks,
strokes or gangrene in legs. Various factors such as elevated levels of cholesterol
and triglycerides, or high blood pressure had been identified as risk factors. Un-
fortunately, no treatment for atherosclerosis is available. Physicians know that
cardio-vascular diseases are resulting from all factors together, thus they should
consider a global risk [3] instead of concentrating prevention efforts on individual
ones. So, as relations between factors are not well known, preventive action is
not as effective as it could be, and should be significantly improved by a better
understanding of such dependencies.
Several studies had been started to collect a wide range of data about atheroscle-
rosis in order to discover factors and to exhibit relations between them. The
PKDD 2004 Discovery Challenge proposes, among others, the STULONG data
that results from such a study. More precisely, this data concerns a twenty-year
long longitudinal study of atherosclerosis risk factors in a population of 1417
middle-aged men. Within the framework of PKDD Discovery Challenge run
since 2002, some encouraging results have surfaced [1, 2] but few of them deal
with long-term observation of patients. However, searching for dependencies be-
tween risk factors and clinical demonstration of atherosclerosis, while analyzing
their respective relations to time, appears of real interest as it can provide cru-
cial information. Indeed, a risk without relation to time is not enough for a good
medical assessment. For example, relative risk to die without time information
is 1 for everybody at anytime. More precisely, to compare medication efficiencies
or relative risk factors levels, the time period over which the predict risk will be
true is needed. Moreover, if the physician knows both when the risk appears and
when its impact is the highest, he has a precious information. Considering lung
cancer [5] -which is a relative simple example in medicine- smoking dangerously
increases risks after about 10 years. If the patient stops smoking, the relative
risk of lung cancer progressively decreases in 15 years. This information is useful
because physician knows the patient and his/her risk. Thus, he knows which
lab-exams or medication prescriptions are recommended or not. If we now con-
sider cardio-vascular diseases, the risk depends on too many factors: physicians
do not know precisely how to deal with them w.r.t. time.
In our data mining effort on STULONG data, we mainly focused on file CON-
TROL that contains results of the observation of 66 attributes (risk factors and
clinical demonstration of atherosclerosis) recorded during control examinations
of studied population for a duration of 20 years. We also used a part of file EN-
TRY that contains values of 64 attributes stored for each patient and obtained
from entry examinations done at the start of the study. Then, we tried to exhibit
dependencies between risk factors and/or clinical demonstrations of atheroscle-
rosis, as well as their relations to time by using the algorithm WinMiner [10].
The dependencies that are searched for by WinMiner are episode rules. For do-
ing so, it processes a time line over which event types are spread, an event type
being for example ”high blood pressure”, or ”high level of cholesterol”. Then,
informally, an episode rule, and its relative measures (e.g. confidence and sup-
port, see Section 2.1), reflect how often and how strong a particular group G1
of event types tends to appear close to another group G2. All previous existent
techniques have been designed to be run using a maximum window size con-
straint that specifies the maximum elapsed time between the first and the last
event types, which clearly gives no information about the ”best” time intervals
of the episode rules. On the contrary, WinMiner checks all the episode rules
that satisfy to frequency and confidence thresholds and outputs only episode
rules for which there exists a time interval maximizing the confidence measure.
Thus if an episode rule is found, one knows that it is ”more valid” on a given
time interval, also provided by WinMiner. Applying WinMiner on STULONG
data, we obtained encouraging results describing well known phenomena, along
with their optimal time intervals. That gives a precious additional information
to physicians. This paper is organized as follows. Second section presents in a
detailed way the work related to episode rules extraction as well as algorithm
WinMiner and patterns it extracts. Third section gives an overview of STU-
LONG dataset and describes the way we preprocess data to be able to perform
extractions. Fourth section reports and discusses the results of these extractions.
We finally conclude in section 5.
2 Discovering episode rules and their optimal time
2.1 Context and related work
An event sequence is a sequence of events where each event is described by
a date of occurrence and an event type. Many datasets are composed of event
sequences. They can be divided in two categories. The first one concerns datasets
composed by a single but large event sequence (e.g. in [9]) while the second one
concerns datasets composed by many short event sequences (e.g. in [14]). This
second type is referred to as base of sequences.
One way to exhibit relations between event types that are spread over event
sequences is to extract episodes. Informally, an episode is a sequential pattern
composed by event types, e.g. an ordered set of event types. If one looks at how
many times the episode occurs in the event sequences, one can define the sup-
port property. Basically, the standard episode mining problem is then to find all
the episodes satisfying a given minimal support constraint. The way the support
is established depends on the dataset type. In a base of sequences, the support
of an episode (notice that, in this context, episodes are referred to as sequential
patterns) corresponds to the number of sequences in which this pattern occurs at
least one time [14], and several occurrences of the pattern in the same sequence
have no impact on its frequency; while in a large event sequence, the support of
an episode represents the number of its occurrences in this sequence [9].
In the case of large event sequences, a new type of pattern, namely episode
rule, has been defined along with a new property, the confidence. Informally,
an episode rule is a pattern that states that, if an episode αis observed, then
another episode βoccurs close to the first one. The strength of this relation is
expressed by the confidence, which has an equivalent meaning as for association
rules. Confidence can be interpreted as ”when αoccurs, what is the probability
for βto appear close to α?”. Then, the standard episode rule mining problem is
to extract all episode rules that satisfy to given support and confidence thresh-
olds. To extract such rules, two main approaches have been adopted. The first
one, proposed and used by [9, 8] in the Winepi algorithm, is based on the oc-
currences of patterns in a sliding window along the sequence. The second one,
introduced in [7, 8] and supported by the Minepi algorithm, relies on the no-
tion of minimal occurrences of patterns. Both techniques have been designed to
be run using a maximum window size constraint that specifies the maximum
elapsed time between the first and the last events of pattern occurrences. In-
deed, when considering large event sequences, two problems appear. First, if no
window size constraint is given, then the search space remains too huge to be
handled. The second problem has to do with the significance of the extracted
rules. For example, can we consider that an episode rule is valid if this episode
rule states that if one takes aspirin during a week then one dies within 60 years?
2.2 Episode rules and local maximum of confidence
One drawback of specifying a maximal window size is that there exist several
application domains, e.g. seismology, where the window size is itself an informa-
tion to extract, along with episode rules. When speaking about window size, it
can be the maximal window size over which a pattern can spread, but it can also
be the ”best” one, i.e. the window size for which the pattern maximizes a given
property. Moreover, this window size can be different from one episode rule to
Searching for an optimal window size To handle this issue, [10] proposed an
algorithm, WinMiner, that uses a different constraint, namely the maximum gap
constraint. This constraint imposes a maximal elapsed time between two consec-
utive events in the occurrences of an episode when mining large event sequences.
This maximum gap constraint is different from the maximum window size one,
since it allows occurrences of larger patterns to spread over larger intervals of
time, not directly bounded by the maximum window size. It is similar to the
maximum gap constraint handled by algorithms that find frequent sequential
patterns in a base of sequences (e.g., [12, 13, 6]). Using the maximum gap con-
straint, algorithm WinerMiner is able to mine large event sequences, checking
all the rules that satisfy some given support and confidence thresholds, and this
for all possible window sizes. For each one of these rules, WinMiner checks if
there exists a window size wmaximizing the confidence measure and such that,
at least for one w’ strictly greater than w, value of confidence is lower than a
given percentage decRate of the confidence obtained for w. If so, the lowest pos-
sible window size maximazing the confidence threshold, such that the rule holds
for the support and confidence thresholds, is selected as the first local maximum
(FLM) of confidence. This FLM is considered to be the optimal window size of
the rule. Then, all the rules having a FLM, namely the F LM rules, are out-
putted to form the entire set of F LM rules. This set is the result of WinMiner.
In other words, only but all the F LM rules are extracted along with their
respective optimal window sizes. Due to place restrictions, the reader is refered
to [10] for more formal definitions.
Algorithm WinMiner : principles and experimental validation Algo-
rithm WinMiner uses a depth first strategy (similar to the one used in [14] to
minimize the needed amount of memory. It also relies on the notion of occurrence
lists and temporal joints [9] as well as on the notion of minimal prefix occurrence
[10]. The parameters to be supplied to WinMiner are: a support threshold σ, a
confidence threshold γ, a maximum time gap constraint gapmax and a decrease
threshold decRate. For more details, the reader can refer to [10].
Running WinMiner, no F LM rule was found on random datasets, even
for experiments using rather weak constraints, which is a good result as such
datasets are supposed to be free of specific dependencies. F LM rules were
found in other experiments on a real seismic dataset (this dataset is a subset
of ANSS Composite Earthquake Catalog2that contains a series of earthquakes
described by their locations, occurrence times and magnitudes). The F LM
rules obtained have suggested to geophysicists some possible dependencies that
are not at that time published in geophysics literature. Another good point is
that these experiments show that the collection of rules that are frequent and
confident (for some width) is too huge to be handled by an expert, while the size
of the collection of FLM-rules is several orders of magnitude smaller.
3 Goals and data preparation
The data is available at It con-
sists in 4 tables respectively named ENTRY, CONTROL, LETTER and DEATH.
We first downloaded these tables as text files, and then made some simple trans-
formations using the relational database management system Microsoft Access
97. Furthermore, we transformed the selected information into a large event
sequence to be processed by WinMiner.
2Public world-wide earthquake catalog available at
3.1 Overview of provided tables
Table ENTRY is concerned with data related to the entry examinations of 1417
middle-aged men. Each record contains 64 attributes, either categorical or nu-
meric, divided in subgroups according to their significance: identification data,
social characteristics, physical activity, smoking, drinking of alcohol, physical
examination, risk factors, and so on.
Table CONTROL deals with risk factors and clinical demonstration of atheroscle-
rosis that have been followed over a long-term observation of patients. Values of
66 attributes were recorded at each control. As for the previous table, attributes
are either categorical or numeric, and are divided in subgroups: identification
data, changes since the last examination, sickness leave, A2questionnaire, phys-
ical examination and biochemical examination. Only 1226 patients from the
initial population are concerned by the table CONTROL. Each patient has at
least 1 control examination and at most 21. The 10572 resulting examinations
were performed between 1976 and 1999.
Data collected in table LETTER is derived from a postal questionnaire filled up
by 403 patients. The values of 62 attributes, all categorical, are stored in this
table. They give additional information about health status of the concerned
individuals. Finally, table DEATH contains information about death of 389 pa-
tients involved in the longitudinal study. The 5 attributes stored in that table
are related to the patients’ identification, date and cause of death.
As will be described in details in section 3.3, the data mining effort presented in
this paper mainly focuses on the content of table CONTROL. According to the
medical expertise, some attributes were imported by join operation from table
ENTRY, either to be directly used in the mining process, or to allow computing
some important information for atherosclerosis assessment. That resulted in a
new table, named Contr Mod 2 from which we derived event types used to build
a large event sequence.
3.2 Aim of experiments
According to medical experts, the main crucial factors for atherosclerosis are
cholesterol, hypertension, smoking and physical activity. Other aggravating ones
are age, diabetes, alcohol consumption, BMI (Bio Mass Index) and family anam-
nesis. Furthermore, information concerning the level of education should be im-
portant. However, as reported in the next subsection, among these factors, we
only took into account in our experiments information related to the ones for
which it makes sense to appear in an event sequence. Then, using WinMiner to
find episode rules along with their optimal time window, given support and
confidence thresholds, we aim to give to the medical expert a mean to follow
both the evolution of risk factors and: (1) the impact of medical intervention as
prescription of medicines, suggestion of diet, etc, (2) modifications in patients’
behaviour as changes in physical activity, in smoking habits, etc. In addition,
physicians will be provided with an idea about periods of time during which the
found interactions could be significantly observed along with the frequency and
the probability of these interactions.
3.3 Selection and discretization of attributes
First, we built up a new table, named Contr mod 2 by performing a join
between tables ENTRY and CONTROL, in order to import some attributes be-
longing to table ENTRY: ROKNAR (patient’s year of birth), VYSKA (patient’s
height), VZDELANI (patient’s education level) and RARISK (familial anam-
nesis). The first 2 attributes were respectively used to compute attributes Age
(patient’s age) and BMI (patient’s Bio Mass Index) at each control. The last
2 ones were directly used in the experiments w.r.t. experts’ recommendations.
We did not export the numerous attributes describing the patients’ alcohol con-
sumption because they just gave an indication of patients’ behaviour at their
entry in the study; so, such an information could not be taken into account for
the successive control examinations. All attributes of table CONTROL were kept
at that step of data preparation. In a second step we decided which attributes
will actually be used for episode rules extraction by WinMiner. We present be-
low our choice through the different subgroups of attributes as defined in tables
Identification data and social characteristics Attribute VZDELANI
was used as it stood. Attributes ICO (patient’s identification number), ROKVYS
(year of the control examination), MESVYS (month of the control examination),
PORADK (sequence of the control examination for individual patients) were
combined in a specific way to build up the large event sequence as we will see
later. Furthermore, attribute ROKNAR was used together with ROKVYS to
set the values of a new attribute named Age, simply computed as the difference
between ROKVYS and ROKNAR. Then, this new attribute, termed as CATAge,
was divided in 2 categories. If Age <50, CATAge is set to 1, else 2.
Changes since the last examination Attribute ZMCHARZA (changes
in job) was not selected because it seems not actually relevant to our study.
Attribute ZMKOUR (changes in cigarettes consumption per day) was not used
because it does not appear enough reliable. However, to deal with the ”smoking”
factor, attribute POCCIG was replaced by 2 attributes, namely SMOKE bin and
CATCig, respectively indicating if the patient was smoking at a given control
examination or not, and if so, what smoker category he belonged to. These 2
attributes are defined in table 1. Because of its too numerous original categories,
attribute LEKTLAK (taking medicines for reduction of blood pressure), was
0 - POCCIG=0 POCCIG10 - - -
1 BMI<25 POCCIG>0 10 <POCCIG BOHLR= 1,2,4 BOLDK= 1,2 DUSN= 1
2 25 BMI POCCIG n.s. POCCIG>20 BOHLR= 3,5 BOLDK= 3 DUSN= 2,5
3 BMI30 - POCCIG n.s. BOHLR n.s. BOLDK n.s. DUSN n.s.
Table 1. Discretization of attributes (’-’ = not used, ’n.s.’ = not stated)
replaced by the attribute CATMED. If LEKTLAK states that the patient does
not take medicine for blood pressure, we set CATMED to 0; if LEKTLAK indi-
cates that the patient takes 1 (kind of) medicine for blood pressure, CATMED
is set to 1; if LEKTLAK specifies that the patient takes more than 1 medicine
for blood pressure, we set CATMED to 2; if LEKTLAK is not stated, CATMED
is set to 3. Other attributes of that subgroup were all selected.
Sickness leave Attributes PRACNES (been in sickness leave since the last
visit) and JINAONE (other reasons for sickness leave) were ignored because they
appear not relevant for our data mining effort. Other attributes of that subgroup
were all used.
A2questionnaire Attributes HODNSK (valued group of patients), HODN0
(health condition at the control examination) and ROK0 to ROK23 (year of
finding HODN0 to HODN23) were not selected for appearing not interesting
in the context of the present study. Attributes BOLHR (chest pain), BOLDK
(lower limbs pain) and DUSN (dyspnea) were categorized as reported in table 1
(see respectively CATBohlr, CATBoldk and CATDusn). Other attributes were
all retained.
Physical and biochemical examinations Attributes TRIC (skinfold above
musculus triceps), SUBSC (skinfold above musculus subscapularis), GLYCEMIE
(glycemia) and KYSMOC (uric acid) were not selected. Attribute HMOT (pa-
tient’s weight in kgs) was used together with VYSKA (imported from ENTRY)
to compute the BMI (Bio Mass Index) of a patient at each control, using the
formula: BM I = (HM OT 1000)/(V Y SKA)2. Furthermore, the patients’ BMI
was categorized, resulting in a new attribute CATBMI, specified in table 1. Both
BMI and CATBMI were used in the experiments. Concerning blood pressure,
we did not use numeric attributes SYST (systolic blood pressure) and DIAST
(diastolic blood pressure) for the following reason: while mining episode rules,
WinMiner searches for rules with a given support and so it makes no sense to use
spread numerical values to represent blood pressure. Instead, we chose to use the
attribute HYPERSD which gives a straight information about the fact that the
patient has got systolic-diastolic hypertension or not. Furthermore, we did not
use attributes HYPERS and HYPERD, respectively meaning that the patient
has got systolic or diastolic hypertension or not, because it appears to us as a re-
dundant information w.r.t. HYPERSD. Concerning cholesterol we only selected
attribute HYPCHL, assessing if a patient has or not got hypercholesterolemia.
We ignored numeric values CHLST, CHLSTMG, HDL, HDLMG, LDL for the
same reason as for SYST and DIAST. For triglycerides, we acted in the same
way, selecting only HYPTGL. Finally, MOC was used as it stood.
Building up the event sequence Table Contr mod2 was exported as a
text file and then preprocessed in order to build up a large event sequence to
be mined by WinMiner. Events of that large sequence were designed as follows.
For each patient, we built a subsequence containing all control examinations he
was concerned with. An event of this sequence is a couple (date of examination,
value of a selected attribute/event type). The date of an event was calculated by:
//1st control examination of 1st patient
1000 602
1000 701
1000 902
1000 6701
//2nd control of 1st patient
1016 501
1016 7100
//last control examination of 1st patient (18th)
1230 502
1230 602
1230 6703
1230 7100
//1st control examination of 2nd patient (Gap > 500)
2000 502
2000 6701
Fig. 1. Extract of the event sequence build from STULONG dataset
date =n103+number of months from the f irst control, where ncorresponds
to the patient number (1 n1226). Thus, attributes ICO and PORADSK
were undirectly used to respectively discriminate between patients, and between
control examinations within the same patient. The information given by at-
tributes ROKVYS and MESVYS was used to compute the number of months
from the first control (associated to month 0) for the patient under consideration.
Because a patient was associated to at most 21 control examinations within the
data set, the former coding guaranteed that, with a maximum gap value of 500,
it was not possible to associate, in the same episode rule, events corresponding to
2 different patients. For each selected categorical variable, the event type associ-
ated to a given date was labeled as the ”concatenation” of the attribute number
(in the table) and its actual value. For the unique numeric variable that was se-
lected, BMI, we decided not to use directly its value but to consider variations of
BMI between 2 consecutive controls for a same patient. And we then discretized
BMI as follows: let x be a parameter fixed by the user, Ciand Ci+1 2 consecu-
tive controls (for the same patient) then if |BMI(Ci+1)BMI(Ci)| ≤ x%, then
BMI was coded as BMI stab; else if BM I(Ci+1 )BM I (Ci)>0, then BMI was
coded as BMI pos; else if BM I (Ci+1 )BM I (Ci)<0, then BMI was coded as
BMI neg. Then, BMI stab was associated to value 0, BMI pos to 1 and BMI neg
to 2. Finally, the large event sequence was obtained as the concatenation of all
subsequences constructed for each patient having at least 3 control examinations
in table CONTROL. An extract of this sequence is depicted by Figure 1.
4 Results
We ran a bunch of experiments 3on the event sequence resulting from our data
preparation and pre-processing (see section 3). We here present results we got by
performing an extraction task, in which parameters values were set as follows:
σ= 100, γ= 0.8, gapmax= 500 and decRate= 10. In this task, we limited the size
of extracted episodes to 3 (i.e. extracted episodes have a size that ranges from 1
to 3). The experiment was performed within 316.87 seconds. 42262 episode rules
satisfying the support threshold were found, from which 9248 satisfy the confi-
dence threshold, while only 6 F LM rules were finally selected by WinMiner.
This clearly shows that definition of F LM rules does not cope with too wide
outputs. This is important as an expert cannot browse thousands of rules. The
found 6 episode rules are the following one:
HY P C HL 2chD iet ZM DI ET 3H Y P C HL 2 : w= 40 : cw = 0.800797 : sw = 201
HY P T GL 2bmi BMI neg HY P T GL 2 : w= 43 : cw = 0.807595 : sw = 319
CAT Age 1chDiet ZM DI ET 3c:w= 94 : cw = 0.87234 : sw = 123
CAT Age 1chDiet ZM DI ET 3dy spnea CAT Dusn 1 : w= 86 : cw = 0.93617 : sw = 132
CAT Age 1CAT M ED 1ur ine MO C 1 : w= 116 : cw = 0.898305 : sw = 106
CAT Age 1CAT M ED 1limbpain CAT Boldk 1 : w= 116 : cw = 0.915254 : sw = 108
To interpret this output, one has to know that: separator field character is ’:’,
first field is the discovered episode rule itself, field windicates its optimal time
window, field cw gives its confidence for window w, field sw gives its support for
window w, event type H Y P C HL 2 means ”no hypercholesterolemia”, event type
H Y P T GL 2 means ”no hypertriglyceridemie”, event type chDiet ZMDIET 3
means ”patient eats sometimes according to the recommended diet”, event type
bmi BM I neg means ”BMI of the patient has decreased since last examina-
tion”, event type dyspnea C AT Dusn 1 means ”no dyspnea/respiratory prob-
lem found”, event type urine MOC 1 means ”urine is normal”, event type
limbpain CAT B oldk 1 means ”no limb pain”, event type CAT M ED 1 means
”patient takes one single medicine for blood pressure”. Then, the first line of
the output can be read as ”if the patient has no hypercholesterolemia and if he
sometimes follows his diet, then the patient has no hypercholesterolemia with a
probability of 0.8 and this, within 40 months, which is the optimal window size
for this rule. This rule is supported by 201 examples in the event sequence”. It is
to be noticed that each rule that had been discovered thanks to this experiment
expresses knowledge that is well known. The additional information is all con-
tained within the field w, i.e. the optimal window size. Moreover, the fact that we
do exhibit known phenomena gives us an indication about the correctness of the
whole process we led, from data preparation to the use of WinMiner. Another
noticeable aspect is that none of the discovered rules ends with an event type
3All the experiments were performed using an implementation of WinMiner in C++,
on an Intel Pentium IV 2 GHz, under a 2.4 Linux kernel with 1 GB of memory (all
the experiments were run using between 0.5 and 300 MB of RAM).
related to an health problem. Indeed, this kind of event type is not frequent. We
then ran another experiment by using the same parameters as in the previous
one, except for the support that was set to 20. This time, 80604 rules satisfying
the support threshold were found, from which 11466 also satisfy to the confi-
dence threshold. Then, 217 rules were found to be F LM rules. Among these
rules, we did find the following one:
chDiet ZM DI ET 6claudication H ODN 12 12 claudication H ODN 12 12 : w= 31 : cw =
0.913043 : sw = 21
This rule states that ”if one eats less of fats and carbohydrates and if he has
claudication observed some time later, then this claudication does not disappear
over 30 months (optimal window size) with a probability of 0.9 ”. Once again,
the discovered rule expresses an expected phenomena, while giving a new in-
formation, the optimal window size. The whole set of obtained results is then
encouraging. Indeed, cardio-vascular diseases concerned by STULONG dataset
are quite well known, so we rediscovered knowledge already stated by epidemi-
ology in medecine while suggesting information concerning its temporal aspects.
It can be expected that, with new data and new risk factors put in evidence in
the last decade (intima-media thickness [11], pulse wave velocity [4]), it could be
possible to discover new phenomena along with their optimal window sizes.
5 Conclusion and perspectives
In this paper, we have presented a complementary way to mine the STU-
LONG dataset, searching for temporal dependencies between atherosclerosis risk
factors and clinical demonstration of atherosclerosis that have an optimal time
interval/window size. In our experiments, we extracted relations, namely episode
rules, that all express known phenomena. The added value consists in supplying
the optimal window sizes of the discovered relations. This kind of approach is
interesting from a medical point of view. Indeed, with STULONG, we know that
having risk comportments dramatically increases cardio-vascular disease emer-
gence. But we do not know when. So, finding out the time interval between a
behaviour change and its consequences brings out what particulary behaviour is
involved; thus it offers to the medical expert a possibility to explicit impact of
a risk factor and to refine its part in comparison with other ones within a time
interval. This has obvious advantages. It allows an earlier disease detection for a
given patient and therefore a better medical health care. Then, it helps for sec-
ondary prevention -when atherosclerosis is already developed- for large groups
of patients who can be targeted (i.e. who should have a test, when and how often
should this test have to be performed). To our knowledge, no result of this type
has been proposed so far in the various PKDD discovery challenges that concern
the STULONG dataset. Another positive feedback is that few episode rules are
obtained, which allows experts to manually analyze the outputs. Moreover, the
approach adopted in this paper is also interesting in that sense it could be ap-
plied to other medical datasets, either concerning atherosclerosis and containing
recently identified risk factors, or related to another kind of disease, to help in
finding unknown phenomena. That opens new perspectives both for data miners
and physicians.
We want to thank the CNRS specific action AS Discovery Challenge, managed by
Jean-Fran¸cois Boulicaut and Bruno Cr´emilleux, and in particular all participants
of Atherosclerosis Workgroup for fruitful and friendly discussions.
1. Discovery challenge ecml-pkkd 2002.
2. Discovery challenge ecml-pkkd 2003.
3. K. Anderson, P. Odell, P. Wilson, and W. Kannel. Cardiovascular disease risk
profiles. Am Heart J, 121 : p. 312–318, 1990.
4. P. Boutouyrie, T. Krummel, and al. In pourquoi et comment mesurer le risque
cardiovasculaire ? La revue du praticien, 54 : p. 618, 2004.
5. C. Hill, F. Doyon, and H. Sancho-Garnier. Epid´emiologie des cancers. Paris,
edecine-Sciences Flammarion, 1997.
6. M. Leleu, C. Rigotti, J.-F. Boulicaut, and G. Euvrard. Constrained-based min-
ing of sequential patterns over datasets with consecutive repetitions. In Proc.
of the Int. Conf. on Principles of Data Mining and Knowledge Discovery in
Databases (PKDD’03), pages 303–314, Cavtat-Dubrovnik, Croatia, September
2003. Springer-Verlag LNCS 2838.
7. H. Mannila and H. Toivonen. Discovery of generalized episodes using minimal
occurrences. In Proc. of the 2nd International Conference on Knowledge Discovery
and Data Mining (KDD’96), pages 146–151, Portland, Oregon, August 1996.
8. H. Mannila, H. Toivonen, and A. Verkamo. Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery, 1(3):259–298, November 1997.
9. H. Mannila, H. Toivonen, and I. Verkamo. Discovering frequent episodes in se-
quences. In Proc. of the 1st International Conference on Knowledge Discovery and
Data Mining (KDD’95), pages 210–215, Montreal, Canada, August 1995. AAAI
10. N. Meger and C. Rigotti. Constraint-based mining of episode rules and optimal
window sizes. In Proc. of the Int. Conf. on Principles of Data Mining and Knowl-
edge Discovery in Databases (PKDD’04). Springer-Verlag LNCS. To appear.
11. A. Simon, J. Gariepy, G. Chironi, and al. Intima-media thickness : a new tool for
diagnosis and treatment of cardiovascular risk. J Hypertens, 20 : p. 159–169, 2002.
12. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and per-
formance improvements. In Proc. of the 5th International Conference on Extending
Database Technology (EDBT’96), pages 3–17, Avignon, France, September 1996.
13. M. Zaki. Sequence mining in categorical domains: incorporating constraints. In
Proc. of the 9th International Conference on Information and Knowledge Manage-
ment (CIKM’00), pages 422–429, Washington, DC, USA, November 2000.
14. M. Zaki. Spade: an efficient algorithm for mining frequent sequences. Machine
Learning, Special issue on Unsupervised Learning, 42(1/2):31–60, Jan/Feb 2001.
... Episodes have natural applications into several domains, including for instance the analysis of business time series [2], medical data [10], geophysical data [11] and also alarm log analysis for network monitoring (especially in telecommunications) [5]. However, in many applications episodes clearly show some limitations, due to the fact that the information provided by the is-followedby relation is not always enough to properly characterize the phenomena at hand. ...
... The it i intervals are intended to represent values of elapsed time between the occurrences of two consecutive event types of the episode α. For instance [15,20], [10,40], [5,20] is one of the q-episodes depicted in Figure 1. ...
... Splitting the group of occurrences of α associated to one node of the tree at level i (to obtain its children at level i + 1) can be done simply by a single scan of the elements in the group if these elements are ordered by the duration between e i and e i+1 . For instance, consider a node associated to the occurrences introduced in the previous example on page 7, corresponding to durations [3,4,6,6,8,9,15,16,16], and consider the same density parameters w s = 3 and n s = 2. Then a single scan through the list allows to find the low density areas, as for example [10,13] that is a sub-interval of size 3 without any element of list [3,4,6,6,8,9,15,16,16] in it, and thus the scan leads to obtain the two maximal sublists satisfying the density criterion: [3,4,6,6,8,9] and [15, 16, 16]. The same principle can be applied even when the maximal sublists are overlapping. ...
Conference Paper
Full-text available
Among the family of the local patterns, episodes are commonly used when mining a single or multiple sequences of discrete events. An episode reflects a qualitative relation is-followed-by over event types, and the refinement of episodes to incorporate quantitative temporal information is still an on going research, with many application opportunities. In this paper, focusing on serial episodes, we design such a refinement called quantitative episodes and give a corresponding extraction algorithm. The three most salient features of these quantitative episodes are: (1) their ability to characterize main groups of homogeneous behaviors among the occurrences, according to the duration of the is-followed-by steps, and providing quantitative bounds of these durations organized in a tree structure; (2) the possibility to extract them in a complete way; and (3) to perform such extractions at the cost of a limited overhead with respect to the extraction of standard episodes.
... Incidentally, the examples of patterns mentioned above represent two of true outcomes of this approach. The second approach [5] mines for episode rules with a universal tool WinMiner. Besides the domain independence, the added value consists in supplying the optimal window sizes of the discovered relations. ...
... WinMiner [5] presents a general tool allowing to search for episode rules -patterns that can be extracted from a large event sequence. When dealing with the Stulong data, similar problems as discussed in this paper have to be solved first. ...
Full-text available
Sequential data represent an im-portant source of automatically mined and potentially new medical knowledge. They can originate in various ways. Within the pre-sented domain they come from a longitudi-nal preventive study of atherosclerosis – the data consist of series of long-term observa-tions recording the development of risk fac-tors and associated conditions. The intention is to identify frequent sequential patterns hav-ing any relation to an onset of any of the observed cardiovascular diseases. This pa-per focuses on application of inductive logic programming. The prospective patterns are based on first-order features automatically ex-tracted from the sequential data. The fea-tures are further grouped in order to reach fi-nal complex patterns expressed as rules. The presented approach is also compared with the approaches published earlier (windowing, episode rules).
... Internet anomaly intrusion detection [17,22], biomedical data analysis [4,23], stock trend prediction [24] and drought risk management in climatology data sets [10]. Besides, there are also studies on how to identify significant episodes from statistical model [2,5]. ...
Discovering patterns with great significance is an important problem in data mining discipline. An episode is defined to be a partially ordered set of events for consecutive and fixed-time intervals in a sequence. Most of previous studies on episodes consider only frequent episodes in a sequence of events (called simple sequence). In real world, we may find a set of events at each time slot in terms of various intervals (hours, days, weeks, etc.). We refer to such sequences as complex sequences. Mining frequent episodes in complex sequences has more extensive applications than that in simple sequences. In this paper, we discuss the problem on mining frequent episodes in a complex sequence. We extend previous algorithm MINEPI to MINEPI+ for episode mining from complex sequences. Furthermore, a memory-anchored algorithm called EMMA is introduced for the mining task. Experimental evaluation on both real-world and synthetic data sets shows that EMMA is more efficient than MINEPI+.
... These experiments show that the extractions can be done in practice in non-trivial cases and that no FLM-rule was found in these random datasets. Other experiments on atherosclerosis risk factors (atherosclerosis is the main cause of cardio-vascular diseases) are described in [8]. ...
Conference Paper
Full-text available
Episode rules are patterns that can be extracted from a large event sequence, to suggest to experts possible dependencies among occur- rences of event types. The corresponding mining approaches have been designed to nd rules under a temporal constraint that species the max- imum elapsed time between the rst and the last event of the occurrences of the patterns (i.e., a window size constraint). In some applications the appropriate window size is not known, and furthermore, this size is not the same for dieren t rules. To cope with this class of applications, it has been recently proposed in (2) to specifying the maximal elapsed time between two events (i.e., a maximum gap constraint) instead of a window size constraint. Unfortunately, we show that the algorithm proposed to handle the maximum gap constraint is not complete. In this paper we present a sound and complete algorithm to mine episode rules under the maximum gap constraint, and propose to nd, for each rule, the window size corresponding to a local maximum of condence. We show that the extraction can be ecien tly performed in practice on real and synthetic datasets. Finally the experiments show that the notion of local maximum of condence is signican t in practice, since no local maximum are found in random datasets, while they can be found in real ones.
... The main purpose in mining frequent episodes is to discover relations between different events, relations that could determine a certain event or help to anticipate future results. Frequent episodes mining is used successfully in different fields such as security analysis and intrusion detection in case of computer systems, biomedical data analysis [2], [3], predicting the evolution of the stock shares, disaster risk management in climatology [4] or in mining significant episodes from statistical models. A frequent pattern is a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set. ...
An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithms
Episode Rule Mining is a popular framework for discovering sequential rules from event sequential data. However, traditional episode rule mining methods only tell that the consequent event is likely to happen within a given time intervals after the occurrence of the antecedent events. As a result, they cannot satisfy the requirement of many time sensitive applications, such as program security trading and intelligent transportation management due to the lack of fine-grained response time. In this study, we come up with the concept of fixed-gap episode to address this problem. A fixed-gap episode consists of an ordered set of events where the elapsed time between any two consecutive events is a constant. Based on this concept, we formulate the problem of mining precise-positioning episode rules in which the occurrence time of each event in the consequent is clearly specified. In addition, we develop a trie-based data structure to mine such precise-positioning episode rules with several pruning strategies incorporated for improving the performance as well as reducing memory consumption. Experimental results on real datasets show the superiority of our proposed algorithms.
Frequent episode mining is a popular framework for discovering sequential patterns from sequence data. Previous studies on this topic usually process data offline in a batch mode. However, for fast-growing sequence data, old episodes may become obsolete while new useful episodes keep emerging. More importantly, in time-critical applications we need a fast solution to discovering the latest frequent episodes from growing data. To this end, we formulate the problem of Online Frequent Episode Mining (OFEM). By introducing the concept of last episode occurrence within a time window, our solution can detect new minimal episode occurrences efficiently, based on which all recent frequent episodes can be discovered directly. Additionally, a trie-based data structure, episode trie, is developed to store minimal episode occurrences in a compact way. We also formally prove the soundness and completeness of our solution and analyze its time as well as space complexity. Experiment results of both online and offline FEM on real data sets show the superiority of our solution.
In order to meet the mounting social and economic demands, railway operators and manufacturers are striving for a longer availability and a better reliability of railway transportation systems. Commercial trains are being equipped with state-of-the-art onboard intelligent sensors monitoring various subsystems all over the train. These sensors provide real-time flow of data, called floating train data, consisting of georeferenced events, along with their spatial and temporal coordinates. Once ordered with respect to time, these events can be considered as long temporal sequences which can be mined for possible relationships. This has created a neccessity for sequential data mining techniques in order to derive meaningful associations rules or classification models from these data. Once discovered, these rules and models can then be used to perform an on-line analysis of the incoming event stream in order to predict the occurrence of target events, i.e, severe failures that require immediate corrective maintenance actions. The work in this thesis tackles the above mentioned data mining task. We aim to investigate and develop various methodologies to discover association rules and classification models which can help predict rare tilt and traction failures in sequences using past events that are less critical. The investigated techniques constitute two major axes: Association analysis, which is temporal and Classification techniques, which is not temporal. The main challenges confronting the data mining task and increasing its complexity are mainly the rarity of the target events to be predicted in addition to the heavy redundancy of some events and the frequent occurrence of data bursts. The results obtained on real datasets collected from a fleet of trains allows to highlight the effectiveness of the approaches and methodologies used
Conference Paper
Frequent episode mining (FEM) is an interesting research topic in data mining with wide range of applications. However, the traditional framework of FEM treats all events as having the same importance/utility and assumes that a same type of event appears at most once at any time point. These simplifying assumptions do not reflect the characteristics of scenarios in real applications and thus the useful information of episodes in terms of utilities such as profits is lost. Furthermore, most studies on FEM focused on mining episodes in simple event sequences and few considered the scenario of complex event sequences, where different events can occur simultaneously. To address these issues, in this paper, we incorporate the concept of utility into episode mining and address a new problem of mining high utility episodes from complex event sequences, which has not been explored so far. In the proposed framework, the importance/utility of different events is considered and multiple events can appear simultaneously. Several novel features are incorporated into the proposed framework to resolve the challenges raised by this new problem, such as the absence of anti-monotone property and the huge set of candidate episodes. Moreover, an efficient algorithm named UP-Span (Utility ePisodes mining by Spanning prefixes) is proposed for mining high utility episodes with several strategies incorporated for pruning the search space to achieve high efficiency. Experimental results on real and synthetic datasets show that UP-Span has excellent performance and serves as an effective solution to the new problem of mining high utility episodes from complex event sequences.
Conference Paper
This paper proposes a novel methodology for stock investing using the technique of episode mining and technical indicators. The time-series data of stock price is used for the construction of complex episode events and rules. Our experimental results show that the episode rule mining method not only improves a well-known technical indicator alone, but also assists it in outperforming the benchmark. Based upon the results obtained, we expect this episode mining methodology to advance the research in data mining for finance, and provide an alternative strategy to stock investment in practice.
Full-text available
In this paper we present SPADE, a new algorithm for fast discovery of Sequential Patterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original problem into smaller sub-problems, that can be independently solved in main-memory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in only three database scans. Experiments show that SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some pre-processed data. It also has linear scalability with respect to the number of input-sequences, and a number of other database parameters. Finally, we discuss how the results of sequence mining can be applied in a real application domain.
Conference Paper
Full-text available
We present cSPADE, an ecient algorithm for mining fre- quent sequences considering a variety of syntactic constraints. These take the form of length or width limitations on the sequences, minimum or maximum gap constraints on con- secutive sequence elements, applying a time window on al- lowable sequences, incorporating item constraints, and Þnd- ing sequences predictive of one or more classes, even rare ones. Our method is ecient and scalable. Experiments on a number of synthetic and real databases show the utility and performance of considering such constraints on the set of mined sequences.
Conference Paper
Full-text available
Episode rules are patterns that can be extracted from a large event sequence, to suggest to experts possible dependencies among occur- rences of event types. The corresponding mining approaches have been designed to nd rules under a temporal constraint that species the max- imum elapsed time between the rst and the last event of the occurrences of the patterns (i.e., a window size constraint). In some applications the appropriate window size is not known, and furthermore, this size is not the same for dieren t rules. To cope with this class of applications, it has been recently proposed in (2) to specifying the maximal elapsed time between two events (i.e., a maximum gap constraint) instead of a window size constraint. Unfortunately, we show that the algorithm proposed to handle the maximum gap constraint is not complete. In this paper we present a sound and complete algorithm to mine episode rules under the maximum gap constraint, and propose to nd, for each rule, the window size corresponding to a local maximum of condence. We show that the extraction can be ecien tly performed in practice on real and synthetic datasets. Finally the experiments show that the notion of local maximum of condence is signican t in practice, since no local maximum are found in random datasets, while they can be found in real ones.
The breast is the leading cancer site in women throughout the world. That said, breast cancer incidence varies widely, ranging from 27/100,0002 (Central-East Asia and Africa) to 85-94/100,0002 (Australia, North America and Western Europe). Its frequency in France is among the highest in Europe. While in most countries, its incidence has been increasing for more than 40 years, in a few other countries (USA, Canada, Australia, France…), it has been decreasing since 2000-2005. Possibly due to a substantial reduction of hormone-based treatments at menopause, the decrease may be transient. It is also the leading cause of female cancer deaths in almost all countries, with the exception of the most economically developed, in which it is currently second to lung cancer. That much said, for thirty years in highly industrialized countries such as France, breast cancer mortality has been declining. Taken together, early diagnosis and improved treatment explain this success. In France, 5-year survival and 10-year survival approximate 88 % and 78 % respectively; these rates are among the most elevated in Western Europe. Excess mortality due to breast cancer is consequently low (<5 %) but variable according to age, and maximal during the first two years of follow-up. Several thousand epidemiological studies on risk factors for breast cancer have been carried out worldwide; it is difficult to draw up an overall assessment, especially insofar as the identified factors interact and vary according to whether the cancers occur before or after menopause and depending on their histological, biological (receptors) or molecular characteristics. Moreover, their prevalence varies in time and from one region to another. For the majority of these factors, the level of relative risk is≤2. Genetic particularities: presence of proliferative mastopathy, a first child after 35 years of age and thoracic irradiation are the sole factors entailing relative risk from 2 to 5 (comparatively speaking, the risk levels associated with tobacco consumption reach values from 10 to 20, and in some cases even higher). However, exposure to risk factors≤2 may be relatively frequent and consequently favorable to development of a substantial number of breast cancers. Estimation (based on degree of risk and frequency of exposure) of the proportion of risk attributable to a given factor facilitates decision-making aimed at determining the most effective primary prevention actions. Taking into consideration the identified factors pertaining to post-menopausal cancers, only 35 % [23 to 45 %] of the attributable proportions could be reduced by primary prevention. In view of achieving this level of reduction, it is possible to put forward the following recommendations: for the women themselves: have a first child before the age of 30, breastfeed for several months, engage in sufficiently intense and regular physical activity, avoid or reduce excess weight after turning thirty, avoid exposure to active or passive smoking, limit alcohol consumption; for their physicians: do not prescribe pointless thoracic irradiations (unnecessary mammography in particular) or unjustified hormonal treatments. *persons/years.
Conference Paper
Sequences of events describing the behavior and actions of users or systems can be collected in several domains. In this paper we consider the problem of recognizing frequent episodes in such sequences of events. An episode is defined to be a collection of events that occur within time intervals of a given size in a given partial order.Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We describe an efficient algorithm for the discovery of all frequent episodes from a given class of episodes, and present experimental results.
Conference Paper
Constraint-based mining of sequential patterns is an active research area motivated by many application domains. In practice, the real sequence datasets can present consecutive repetitions of symbols (e.g., DNA sequences, discretized stock market data) that can lead to a very important consumption of resources during the extraction of pat- terns that can turn even efficient algorithms to become unusable. We propose a constraint-based mining algorithm using an approach that en- ables to compact these consecutive repetitions, reducing drastically the amount of data to process and speeding-up the extraction time. The technique introduced in this paper allows to retain the advantages of existing state-of-the-art algorithms based on the notion of occurrence lists, while permitting to extend their application fields to datasets con- taining consecutive repetitions. We analyze the benefits obtained using synthetic datasets, and show that the approach is of practical interest on real datasets.
This article presents prediction equations for several cardiovascular disease endpoints, which are based on measurements of several known risk factors. Subjects (n = 5573) were original and offspring subjects in the Framingham Heart Study, aged 30 to 74 years, and initially free of cardiovascular disease. Equations to predict risk for the following were developed: myocardial infarction, coronary heart disease (CHD), death from CHD, stroke, cardiovascular disease, and death from cardiovascular disease. The equations demonstrated the potential importance of controlling multiple risk factors (blood pressure, total cholesterol, high-density lipoprotein cholesterol, smoking, glucose intolerance, and left ventricular hypertrophy) as opposed to focusing on one single risk factor. The parametric model used was seen to have several advantages over existing standard regression models. Unlike logistic regression, it can provide predictions for different lengths of time, and probabilities can be expressed in a more straightforward way than the Cox proportional hazards model.
Increased intima-media thickness (IMT) is a non-invasive marker of early arterial wall alteration, which is easily assessed in the carotid artery by B-mode ultrasound, and more and more widely used in clinical research. Methods of IMT measurement can be categorized by two approaches: (i) measurement at multiple extracranial carotid sites in near and far walls and (ii) computerized measurement restricted to the far wall of the distal common carotid artery. Because IMT reflects global cardiovascular risk, its normal value might be better defined in terms of increased risk rather than in terms of statistical distribution within a healthy population. The available epidemiological data indicate that increased IMT (at or above 1 mm) represents a risk of myocardial infarction and/or cerebrovascular disease. Close relationships have been shown between: (i) most traditional cardiovascular risk factors; (ii) certain emerging risk factors such as lipoproteins, psychosocial status, plasma viscosity, or hyperhomocysteinemia; and (iii) various cardiovascular or organ damages such as white matter lesion of the brain, left ventricular hypertrophy, microalbuminuria or decreased ankle to brachial systolic pressure index. Thus, IMT gives a comprehensive picture of the alterations caused by multiple risk factors over time on arterial walls. Prospective primary and secondary prevention studies have also shown that increased IMT is a powerful predictor of coronary and cerebrovascular complications (risk ratio from 2 to 6) with a higher predictive value when IMT is measured at multiple extracranial carotid sites than solely in the distal common carotid artery. Therapeutic double-blind trials have shown that lipid-lowering drugs, such as resin and overall statines, and to a lesser extent antihypertensive drugs, such as calcium antagonists, may have a beneficial effect on IMT progression in asymptomatic or in coronary patients. However, methodological standardization of IMT measurement still needs to be implemented before routine measurement of IMT can be proposed in clinical practice as a diagnostic tool for stratifying cardiovascular risk in primary prevention and for aggressive treatment decision. It can be anticipated however, that the presence of increased carotid IMT in one individual with intermediate cardiovascular risk would lead to his classification into the high-risk category and thus influence the aggressiveness of risk factor modifications.
The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-specified minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is“5% of customers bought ‘Foundation’ and ‘Ringworld’ in one transaction, followed by ‘Second Foundation’ in a later transaction”. We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-specified time window. Third, given a user-defined taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average data-sequence size.