Conference PaperPDF Available

Mining Strong Associations and Exceptions in the STULONG Data Set

Authors:

Abstract and Figures

Multidimensional association rules represent an important type of knowledge that can be mined from large relational databases or data warehouses. These rules describe combinations of attribute values that often occur together in a database and can reveal hidden and useful patterns. This paper presents both strong multidimensional association rules and exceptions mined from the STULONG data set, prepared for the Discovery Challenge of ECML/PKDD-2004. The STULONG data set keeps information about risk factors of atherosclerosis in patients from the Czech Republic. We adopted an approach that aims at finding exceptions, which are represented by association rules that become much weaker in some specific subsets of the database. The results found are reported and commented.
Content may be subject to copyright.
Mining Strong Associations and Exceptions in
the STULONG Data Set
Eduardo Corrˆea Gon¸calves and Alexandre Plastino?
Universidade Federal Fluminense, Department of Computer Science,
Rua Passo da P´atria, 156 - Bloco E - 3oandar - Boa Viagem
24210-240, Niter´oi, RJ, Brazil
{egoncalves, plastino}@ic.uff.br
http://www.ic.uff.br
Abstract. Multidimensional association rules represent an important
type of knowledge that can be mined from large relational databases or
data warehouses. These rules describe combinations of attribute values
that often occur together in a database and can reveal hidden and useful
patterns. This paper presents both strong multidimensional association
rules and exceptions mined from the STULONG data set, prepared for
the Discovery Challenge of ECML/PKDD-2004. The STULONG data
set keeps information about risk factors of atherosclerosis in patients
from the Czech Republic. We adopted an approach that aims at finding
exceptions, which are represented by association rules that become much
weaker in some specific subsets of the database. The results found are
reported and commented.
1 Introduction
The STULONG 1data set is a real database that keeps information about the
study of the development of atherosclerosis risk factors in a population of middle
aged men. This study lasted for more than 20 years. At a first step, entry exami-
nations were performed on 1417 patients from 1975 to 1979. These patients were
requested to fill in a form with their personal information and general habits.
They were also submitted to physical and biochemical examinations. The follow-
ing aspects were defined by the specialists as risk factors: arterial hypertension,
?Work sponsored by CNPq research grant 300879/00-8.
1The study (STULONG) was realized at the 2nd Department of Medicine, 1st Faculty
of Medicine of Charles University and Charles University Hospital, U nemocnice 2,
Prague 2 (head. Prof. M. Aschermann, MD, SDr, FESC), under the supervision
of Prof. F. Boud´ık, MD, ScD, with collaboration of M. Tomeˇckov´a, MD, PhD and
Ass. Prof. J. Bultas, MD, PhD. The data were transferred to the electronic form by
the European Centre of Medical Informatics, Statistics and Epidemiology of Charles
University and Academy of Sciences (head. Prof. RNDr. J. Zv´arov´a, DrSc). The
data resource is on the web pages http://euromise.vse.cz/STULONG. At present
time the data analysis is supported by the grant of the Ministry of Education CR
Nr LN 00B 107.
high level of total or LDL cholesterol, low level of HDL cholesterol, glycemy, high
level of uric acid, hypertriglyceridemy, obesity, positive family case history and
the habit of smoking many cigarettes. According to these risk factors and to the
results of the entry examinations, the patients were classified into three groups:
A. Normal Group. Men without the presence of any risk factor.
B. Risk Group. Men with the presence of one or more risk factors, but without
the manifestation of any cardiovascular disease.
C. Pathologic Group. Men with either an identified cardiovascular disease or
other serious disease.
The data collected by the STULONG project were prepared for the Discovery
Challenge of ECML/PKDD-2004. Four tables were made available:
1. Entry: stores data related to the entry examinations.
2. Control: stores data related to long-term observations performed on patients.
3. Letter: stores additional information about the health status of 403 men.
4. Death: stores data related to the patients that became dead.
This paper aims at presenting strong association rules and exceptions mined
from the Entry table. We focused the mining process on finding answers to some
of the proposed analytic questions [3]. The rest of this paper is organized as
follows. An overview of multidimensional association rules and their interest
measures is given in Sect. 2. In Sect. 3 we present the adopted approach to mine
exceptions in databases. The data preparation process is described in Sect. 4
and the associations and exceptions mined from the STULONG data set are
presented in Sect. 5. Some concluding remarks are made in Sect. 6.
2 Multidimensional Association Rules
Multidimensional association rules [4] represent combinations of attribute values
that often occur together in a database, revealing hidden and useful patterns.
These rules can be mined from data warehouses or relational databases, where
attributes can be categorical or quantitative. An example of multidimensional
association rule mined from the Entry table is: (DailyBeerCons = “ >1l”)
(Smoking = “ >20 cig/day”). This rule indicates that men who are heavy beer
consumers (the ones who drink more than a liter of beer per day) are more likely
to be also heavy smokers (they smoke more than 20 cigarettes per day). This
example involves two attributes (or dimensions, following the terminology used
in multidimensional databases): DailyBeerC ons and Smoking.
A multidimensional association rule can be formally defined as follows:
A1=a1, . . . , An=anB1=b1, . . . , Bm=bm,
where Ai(1 in) and Bj(1 jm) represent distinct database attributes
and aiand bjare values from the domains of Aiand Bj, respectively. To sim-
plify the notation, in the remainder of this section we will represent a generic
multidimensional association rule as an expression of the form AB, where
Aand Bare sets of conditions over different attributes. We say that Ais the
antecedent and Bis the consequent of the rule. A multidimensional association
rule can involve several attributes in both the antecedent and the consequent.
The support of a rule ABin a relational database, is the probability that a
tuple matches all conditions in AB. The confidence of ABis the probability
that a tuple matches B, given that it matches A. Typically, the problem of
mining association rules from databases consists in finding all rules that match
user-provided minimum support and minimum confidence. However this model
presents some problems, as pointed in [2]. The “support/confidence framework”
often generates a huge number of association rules that are obvious or, even,
untrue. The following example demonstrates this fact. Consider two association
rules extracted from the Entry table, which are shown in Table 1. The values in
the third and fourth columns (SupAand SupB) represent the probability that a
tuple matches all conditions in the antecedent and the consequent, respectively.
The values in the fifth and sixth columns (Sup and Conf ) represent the support
and the confidence values for each association rule, respectively.
Table 1. Example of support and confidence indices
Id Association Rule SupASupBSup Conf
R1(DailyBeerCons=“>1l”) 0.1193 0.2602 0.0448 0.3758
(Smoking=“>20 cig/day”)
R2(DailyBeerCons=“>1l”) 0.1193 0.8487 0.0905 0.7584
(Married=“yes”)
The rule R2should imply that men who are heavy beer consumers tend to
be married. The support and confidence values of R2are higher than the R1
ones. This fact could lead to the conclusion that R2is more interesting than R1.
However, note that the confidence for the rule R2indicates that 75.84% of heavy
beer consumers are married. Observing the column SupB, we can see that 84.87%
of men in the Entry table are married. Therefore, we can conclude that married
men are less likely to be heavy beer consumers. There is a negative dependence
between being married and being a heavy beer consumer. On the other hand,
the confidence value for the rule R1- which represents the probability for a men
to be a heavy smoker, given that he is a heavy beer consumer - is 37.58%. Once
again, we can see in the fourth column (SupB) that 26.02% of men are heavy
smokers. Then, in fact, heavy beer consumers are more likely to smoke a lot.
There is a positive dependence between these attributes.
In order to find interesting relationships, we consider that support and confi-
dence measures should be used along with other statistical indices that are able
to capture the type of dependence between the antecedent and the consequent
of the rules. We consider that a rule is interesting if it holds with support value
greater than its expected support value. This expected support is computed based
on the support of the conditions that compose the rule:
ExpSup(AB) = ExpSup(AB) = Sup(A)×Sup(B).(1)
The lift index [2] (also known as interest) can be used to evaluate dependen-
cies. Given an association rule AB, this measure computes how much more
frequent is Bwhen Aoccurs:
Lift(AB) = Conf (AB)
Sup(B)=Sup(AB)
Sup(A)×Sup(B)=Sup(AB)
ExpSup(AB).(2)
If Lift(AB) = 1, then Aand Bare independent. If Lift(AB)>1,
then Aand Bare positively dependent. Else, Aand Bare negatively dependent.
The rule interest index (RI) [5] computes the percentage of additional tuples
matched by an association rule that are above the expected:
RI(AB) = Support(AB)ExpS up(AB).(3)
If RI(AB) = 0 we say that Aand Bare independent. If RI(AB)>0,
then Aand Bare positively dependent. Else, Aand Bare negatively dependent.
Returning to the example shown in Table 1, the lift and RI values for R1
are given by: 0.3758 ÷0.2602 = 1.44 and 0.0448 (0.1193 ×0.2602) = 0.014,
respectively. Therefore, R1is an interesting association rule. The lift and RI
values for R2are given by: 0.7584÷0.8487 = 0.89 and 0.0908(0.1193×0.8487) =
0.010. Therefore, R2is, indeed, an uninteresting association rule.
We believe that the use of different interest measures provides alternative
analysis of the same data, giving a better understanding about the associations.
Section 5 presents strong rules mined from the Entry table.
3 Mining Exceptions in the STULONG Data Set
In this section we present the adopted approach to mine exceptions in the STU-
LONG data set. The following example motivates our approach. Consider, again,
the rule R1: (DailyBeerCons = “ >1l”) (Smoking = “ >20 cig/day”).
Suppose we are interested in discovering if this rule becomes weaker on some
sub-population of men stratified by the attribute Group. Then a strategy to
mine exceptions would be able to find the rule:
R3: (DailyB eerC ons = “ >1l”) (Group = “A”) 6⇒ (Smoking = “ >20 cig/day”)
This negative pattern indicates that among the men who belong to the group
A, the support value of the association between being a heavy beer consumer
and being a heavy smoker is surprisingly smaller than what is expected. This
situation evidences an exception associated with a previously mined rule. The
exception was obtained because the association (DailyBeerCons = “ >1l”)
(Group = “A”) (Smoking = “ >20 cig/day”) did not achieve an expected
support. The expected support is evaluated from the support of the original rule
R1and the support of the condition (Group = “A”).
Let Dbe a relational database. Let R:ABbe an association rule
extracted from D. Let Z={Z1=z1, . . . , Zk=zk}be a set of conditions defined
over attributes from D, where {Z1=z1, . . . , Zk=zk} ∩ {A1=a1, . . . , An=
an, B1=b1, . . . , Bm=bm}=.Zis named probe set. An exception related to
the positive rule Ris an implication of the form AZ6⇒ B.
Exceptions are extracted only if they do not achieve an expected support.
This expectation is evaluated based on the support of the original rule AB
and the support of the conditions in the probe set Z. The expected support for
AZBcan be computed as:
ExpSup(AZB) = Sup(AB)×Sup(Z).(4)
To evaluate if an exception is interesting, we use two interest measures based
on the lift measure. The first one, called IM (interest measure), considers that
an exception E:AZ6⇒ Bis potentially interesting if the actual support
value for the rule AZBis much lower than its expected support value:
IM(E) = 1 µSup(AZB)
ExpSup(AZB).(5)
This measure captures the type of dependence between Zand the conditions
that form AB. This measure grows when the actual support value is lower
and far from the expected support value, indicating a negative dependence.
The closer the value is from 1 (which is the highest value for this measure),
the more the negative dependence is. Consider the example presented at the
beginning of this section. The rule C1: (DailyBeerCons = “ >1l”) (Group =
A”) (Smoking = “ >20 cig/day”) was generated combining the rule R1
with the probe set Z={(Group = “A”)}. The support of Zin the Entry table is
22.10%. The support of the rule R1is 4.48% (as shown in Table 1). The expected
support for C1can be computed as 22.10% ×4.48% = 0.99%. The actual support
of C1in the Entry table is equal to 0.08%. We say that the exception E1:
(DailyBeerCons = “ >1l”) (Group = “A”) 6⇒ (Smoking = “ >20 cig/day”)
is potentially interesting because IM(E1) = 1 (0.08 ÷0.99) = 0.92.
However, observe that the IM measure does not take into consideration the
type of dependence between Zand A, and between Zand B. The measure DU
(degree of unexpectedness) is used to solve this question.
DU (E) = I M (E)max((1 Sup(AZ)
ExpSup(AZ)),(1 Sup(BZ)
ExpSup(BZ))) .(6)
This measure captures how much the negative dependence between a probe
set Zand a rule ABis higher than the negative dependence between Zand
either Aor B. The greater the value is from 0, the more interesting the exception
will be. If DU (E)0 the exception is uninteresting. Returning to the previous
example, the support of the condition A={(DailyBeerCons = “ >1l”)}is
11.93%. The support value of the set {AZ}is 1.52%. The negative dependence
between Aand Zcan be computed as 1 (1.52% ÷(11.93% ×22.10%)) = 0.42.
The support of the condition B={(Smoking = “ >20 cig/day”)}is 26.02%.
The actual support value of the set {BZ}is 1.52%. The negative dependence
between Band Zcan be computed as 1 (1.52% ÷(26.02% ×22.10%)) =
0.73. The exception E1: (DailyBeerCons = “ >1l”) (Group = “A”) 6⇒
(Smoking = “ >20 cig/day”) is, in fact, interesting because DU (E1)=0.92
max(0.42,0.73) = 0.19.
The adopted approach to mine exceptions was motivated by the concept of
negative association rules presented in [6], where a negative pattern represents a
large deviation between the expected support and the actual support of a rule.
In [7] a proposal for representing and extracting different categories of exceptions
can be found. However, we adopted an alternative approach, which allowed us to
characterize an exception as a rule that, unexpectedly, becomes much weaker in
some specific subsets of the database. We consider that exceptions are interesting
if they hold with high IM and DU values. Exceptions mined from the Entry
table are presented in Sect. 5.
4 Data Preparation
Some data transformations were necessary before the mining process. All field
names and values were translated into English words. We enriched data with
new fields, derived from original ones, such as the field Age which was derived
from rokvstup (year of the examination) and roknar (year of birth). Numeric
fields, such as Cholesterol, were adequately classified into ranges. Table 2 shows
a summary of the data preparation process. Additional explanation is needed
for the fields Skin Folds and Blood Pressure. The Skin Folds field is the result of
the sum of the fields tric and subsc. To generate the Blood Pressure field, we first
picked the minimum value between the fields sist1 and sist2, denoted as s. Then
we picked the minimum value between the fields diast1 and diast2, denoted as
d. If s129 and d84, the blood pressure was categorized as “normal”. If
s > 139 or d > 89, the category was denoted as “high”. Otherwise, the category
was denoted as “normal/high”.
We developed two programs in C++, which were compiled with the g++
compiler. The first one is an implementation based on the classical Apriori algo-
rithm [1], which is used to mine strong associations. The second program is an
implementation of the adopted approach to mine exceptions. Both require the
data set in the ARFF format, specified in [8]. We generated a relation in the
ARFF format that we named as EN T RYT OT . This relation contains 1249 tu-
ples, regarding the men classified into the groups A, B, and C. We excluded from
this table the men who, originally, were not allocated to any of these groups (at-
tribute konskup = 6). From the EN T RYT OT relation, we also generated three
separated relations with the same attributes. We denoted these three derived
relations as EntryA(276 tuples, containing only patients from the group A),
EntryB(859 tuples, patients from the group B), and EntryC(114 tuples, pa-
tients from the group C). We mined rules in these three tables to compare the
associations between the characteristics of men in the respective groups.
Table 2. Data transformation
Field Original Field / Derivation Possible Values
Group konskup “A”, “B”, “C”.
Cholesterol chlst “desirable” (chlst < 200),
“bordering” (200 chlst < 240),
“high” (chlst 240).
Triglycerides trigl “desirable” (trigl < 150),
“bordering” (150 trigl 200),
“high” (201 trigl 499),
“very high” (trigl 500).
Age (rokvstup roknar) “38-39”, “40-44”, “45-49”, “50”.
BMI (vaha)÷(vyska2) “underweight” (BMI 20),
“normal” (20 BMI < 25),
“overweight” (25 BMI < 30),
“obese” (30 B M I < 40),
“morbidly obese” (BM I 40).
Blood Pressure min(syst1, syst2),“normal”, “normal/high”, “high”.
min(diast1, diast2).
Skin Folds (tric +subsc) “8-20”, “21-30”, “31-40”, “>40”.
5 Results
At first, the relation EN T RYT OT was mined for interesting associations regard-
ing the basic groups, with the minimum support threshold set to 1%. Hundreds
of associations were obtained. We selected some of them to state things we have
learned from the EN T RYT OT table. These selected results are shown in Table 3
along with different interest measures. From the rule R4we were able to observe
that there is a strong correlation (R4.lift = 1.430) between belonging to the
Normal Group (group A) and having reached university education. In fact, the
group Ais the only one that is predominantly composed by men with university
degree (R4.conf = 39.49%). In contrast, the Pathologic Group (group C), is pre-
dominantly formed by men with the apprentice school education degree (R6.conf
= 35.09%). R7shows a strong correlation (R7.lift = 1.692) between belonging
to the group Aand practicing physical activities intensely in free time. R8in-
dicates that the percentage of heavy beer consumers is slightly greater than the
expected in the Risk Group (group B,R8.RI = 0.0116). Finally, from the rule
R9, we could discover a strong positive dependence between being 50 years old
or above and belonging to the group C(R9.lift = 1.768).
The results shown in Tables 4, 5, and 6 present rules regarding alcohol con-
sumption, regarding social factors, and relating skin folds and BMI, respectively.
These rules were mined from tables EntryA,EntryB, and E ntryC. The objec-
tive is to observe the differences of the interest measures in the respective groups.
The third column (G) specifies the mined table (EntryA,EntryBor EntryC).
Columns 4 to 9 show the values of the interest measures. The minimum support
threshold was set to 1%. Due to the lack of space we will not comment all results.
Table 3. Association rules - [EN T RYT OT ]
Id Association Rule SupASupBSup Conf Lif t RI
R4(Group =“A”) 0.2210 0.2762 0.0873 0.3949 1.430 0.0262
(Education = “university”)
R5(Group =“B”) 0.6877 0.2866 0.2090 0.3038 1.060 0.0118
(Education = “apprentice sch.”)
R6(Group =“C”) 0.0913 0.2866 0.0320 0.3509 1.224 0.0058
(Education = “apprentice sch.”)
R7(Group =“A”) 0.2210 0.0857 0.0320 0.1449 1.692 0.0131
(PhysActAfterJob = “great activity”)
R8(Group =“B”) 0.6877 0.1193 0.0937 0.1362 1.142 0.0116
(DailyBeerCons = “>1l”)
R9(Group =“C”) (Age = “50”) 0.0913 0.2282 0.0368 0.4035 1.768 0.0160
Table 4 shows strong associations regarding alcohol consumption in the re-
spective basic groups. Rules R10 to R13 show that both heavy beer consumers
and heavy liquor consumers tend to smoke more, independently of the group
(see the lift and RI values of these rules). However it is important to observe
that there are much fewer smokers in group A(observe the SupBcolumn). It is
also noticeable that men from group Btend to smoke and drink more (observe
SupA,SupB, and Sup columns). The rule R10 has a support value inferior to 1%
(the minimum value) in the EntryAtable. Therefore, it could not be extracted
in this table. Rule R14 indicates that the ones who do not drink alcohol are more
likely to have the BMI in the normal range in the three groups. Rule R17 shows
that drinking wine moderately and having normal blood pressure are positively
dependent in groups Aand Band negatively dependent in group C(observe the
lift and RI values). Rules R18,R19,R21 , and R22 indicate positive correlations
found in the three groups. Rule R20 indicates that patients who are heavy liquor
consumers are more likely to have high level of total cholesterol in groups Band
C, but not in the group A(observe the lift and RI values).
Table 5 shows associations regarding social factors. It could be possible to
discover that people with higher educational degree tend to smoke and drink less,
independently of the group (R23 and R26). On the other hand, people with lower
educational degree tend to smoke more (R24) and are more likely to be heavy
beer consumers (R27 ). The percentage of men who drink alcohol occasionally
is almost the same in the three groups (see the SupBcolumn, R26 ). Rule R25
evidences that there is a strong positive dependence between being an ex-smoker
and being 50 years old or above, specially in group C(R25.RI = 0.0326). Rules
R29 and R30 examine the correlation between the education and BMI of men.
Rules R34 and R35 indicate that blood pressure is dependent on the age of the
patient, independently of the group.
Table 6 shows the relations of skin folds and BMI in the particular basic
groups. Note that some rules in the group Acould not be extracted, because
there are no obese men in this group (since obesity is a risk factor).
Exceptions mined from the EN T RYT OT table are shown in Table 7. Let
us explain the intuitive meaning of exceptions using the rule R42. One of the
strongest correlations in the database, is given by: patients whose education
degree is “apprentice school”, tend to smoke a lot (15-20 cig/day). We show that
this rule is valid for the three group of patients in Table 5, rule R24. We use the
exception illustrated in rule R42 to indicate that the presence of the condition
(P hysActAf terJ ob = “great activity”) reduces the probability of occurrence
for the rule (Education = “apprentice sch.”) (Smoking = “1520 cig/day”).
The exception illustrated in rule R42 can be interpreted as “among the men
who practice physical activities intensely in free time, the rule (Education =
apprentice sch.”) (Smoking = “1520 cig/day”) is much weaker”. The value
in the IM column indicates that the actual support of the rule (Education =
apprentice sch.”) (P hysActAf terJ ob = “great activity”) (Smoking =
“1520 cig/day”) is 47.55% below the expected support. The value shown in the
DU column indicates the strength (degree of unexpectedness) of the exception.
The same interpretation can be given to the remainder of the rules presented in
Table 7. We represented all probe sets in italic characters. We use the following
thresholds on the experiments: minimum IM = 0.30 and minimum DU = 0.05.
6 Conclusions
In this work we presented 50 strong association rules and exceptions mined from
the STULONG data set, concerning the Entry examinations. Strong association
rules were used to analyze the differences of the correlations concerning the
characteristics of the patients from the three basic groups. Exceptions were used
to illustrate negative patterns associated with previously known strong positive
rules. As a future work we intend to apply the same approach on the evaluation
of the Control,Letter, and Death tables.
References
1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In 20th
VLDB Intl. Conf. (1994).
2. Brin, S., Motowani, R., Ullman, J. D., Tsur, S.: Dynamic Itemset Counting and
Implication Rules for Market Basket Data. In ACM SIGMOD Intl. Conf. (1997).
3. ECML/PKDD2004 Discovery Challenge homepage
[http://lisp.vse.cz/challenge/ecmlpkdd2004/] (2004).
4. Han, J., Kamber, M.: Data Mining: Concepts and Techiniques. 2nd edn. Morgan
Kaufmann (2001).
5. Piatetsky-Shapiro, G.: Discovery, Analysis and Presentation of Strong Rules. Knowl-
edge Discovery in Databases. AAAI/MIT Press. (1991).
6. Savasere, A., Omiecinski, E., Navathe, S.: Mining for Strong Negative Associations
in a Large Database of Costumer Transactions. In 14th ICDE Intl. Conf. (1998).
7. Suzuki, E., Zytkow, J. M.: Unified Algorithm for Undirected Discovery of Exception
Rules. In 4th PKDD Intl. Conf. (2000).
8. Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools and Te-
chiniques with Java Implementations. Morgan Kaufmann (2000).
Table 4. Association rules in the basic groups - [alcohol consumption]
Id Association Rule G SupASupBS up Conf Lif t RI
R10 (DailyBeerCons = “>1l”) A - - - - - -
(Smoking=“>21 cig/day”) B 0.1362 0.3190 0.0559 0.4103 1.286 0.0124
C 0.1140 0.2807 0.0614 0.5385 1.918 0.0294
R11 (DailyBeerCons = “>1l”) A 0.0688 0.1667 0.0145 0.2105 1.263 0.0030
(SmokingDuration=“>20 years”) B 0.1362 0.5751 0.0908 0.6667 1.159 0.0125
C 0.1140 0.4737 0.0789 0.6923 1.461 0.0249
R12 (DailyLiquorCons = “>100cc”) A 0.0471 0.1667 0.0181 0.3846 2.308 0.0103
(SmokingDuration=“>20 years”) B 0.0652 0.5751 0.0419 0.6429 1.118 0.0044
C 0.0351 0.4737 0.0263 0.7500 1.583 0.0097
R13 (Liquor = “no”) A 0.5507 0.5109 0.3043 0.5526 1.082 0.0230
(Smoking=“no”) B 0.5204 0.1793 0.0990 0.1902 1.061 0.0057
C 0.5439 0.2018 0.1316 0.2419 1.199 0.0218
R14 (Alcohol = “no”) A 0.0870 0.5326 0.0543 0.6250 1.173 0.0080
(BMI=“normal”) B 0.0861 0.3586 0.0384 0.4459 1.244 0.0075
C 0.1316 0.2632 0.0526 0.4000 1.520 0.0180
R15 (DailyBeerCons = “1l”) A 0.5616 0.4601 0.2609 0.4645 1.001 0.0024
(BMI=“overweight”) B 0.5576 0.5157 0.3108 0.5574 1.081 0.0232
C 0.4211 0.5439 0.2456 0.5833 1.073 0.0166
R16 (DailyBeerCons = “>1l”) A - - - - - -
(BMI=“obese”) B 0.1362 0.1071 0.0210 0.1538 1.436 0.0064
C 0.1140 0.1667 0.0351 0.3077 1.846 0.0161
R17 (DailyWineCons = “500ml”) A 0.5181 0.5761 0.3188 0.6154 1.068 0.0204
(Blood Pressure=“normal”) B 0.4936 0.3958 0.2200 0.4458 1.126 0.0246
C 0.4386 0.3333 0.1404 0.3200 0.960 -0.0058
R18 (DailyBeerCons = “>1l”) A 0.0688 0.1812 0.0145 0.2105 1.162 0.1186
(Blood Pressure=“high”) B 0.1362 0.4342 0.0710 0.5214 1.201 0.0119
C 0.1140 0.5263 0.0702 0.6154 1.169 0.0101
R19 (Alcohol = “no”) A 0.0870 0.3370 0.0507 0.5833 1.731 0.0214
(Cholesterol=“desirable”) B 0.0861 0.1828 0.0186 0.2162 1.183 0.0029
C 0.1316 0.1316 0.0263 0.2000 1.520 0.0090
R20 (DailyLiquorCons = “>100cc”) A 0.0471 0.2065 0.0072 0.1538 0.745 -0.0025
(Cholesterol=“high”) B 0.0652 0.4854 0.0361 0.5536 1.140 0.0044
C 0.0351 0.5175 0.0263 0.7500 1.449 0.0082
R21 (DailyBeerCons = “>1l”) A 0.0688 0.1159 0.0109 0.1579 1.362 0.0029
(Triglycerides=“high”) B 0.1362 0.1839 0.0338 0.2479 1.348 0.0087
C 0.1140 0.2193 0.0351 0.3077 1.403 0.0101
R22 (Alcohol = “occasionally”) A 0.5217 0.1812 0.1123 0.2153 1.188 0.0178
(Triglycerides=“bordering”) B 0.5378 0.2002 0.1106 0.2056 1.027 0.0029
C 0.5175 0.1491 0.0789 0.1525 1.023 0.0018
Table 5. Association rules in the basic groups - [social factors]
Id Association Rule G SupASupBSup Conf Lif t RI
R23 (Education = “university”) A 0.3949 0.5109 0.2210 0.5596 1.095 0.0193
(Smoking=“no”) B 0.2526 0.1793 0.0664 0.2627 1.465 0.0211
C 0.1667 0.2018 0.0877 0.5263 2.608 0.0541
R24 (Education = “apprentice sch.”) A 0.2065 0.0580 0.0217 0.1053 1.816 0.0098
(Smoking=“15-20 cig/day”) B 0.3038 0.3655 0.1211 0.3985 1.090 0.0100
C 0.3509 0.2719 0.1140 0.3250 1.119 0.0186
R25 (Age = “50”) A 0.2246 0.2138 0.0543 0.2419 1.132 0.0063
(Ex-Smoker=“yes, >1 year”) B 0.2061 0.0920 0.0268 0.1299 1.413 0.0078
C 0.4035 0.2018 0.1140 0.2826 1.401 0.0326
R26 (Education = “university”) A 0.3949 0.5217 0.2319 0.5872 1.125 0.0258
(Alcohol=“occasionally”) B 0.2526 0.5378 0.1583 0.6267 1.165 0.0225
C 0.1667 0.5175 0.1053 0.6316 1.220 0.0190
R27 (Education = “basic school”) A 0.0580 0.0688 0.0109 0.1875 2.724 0.0689
(DailyBeerCons=“>1l”) B 0.1234 0.1362 0.0384 0.3113 2.286 0.0216
C 0.1316 0.1140 0.0263 0.2000 1.754 0.0113
R28 (JobRespons. = “managerial”) A 0.1920 0.4493 0.1014 0.5283 1.176 0.0152
(Liquor=“yes”) B 0.2177 0.4796 0.1199 0.5508 1.148 0.0155
C 0.1667 0.4561 0.0877 0.5263 1.154 0.0117
R29 (Education = “apprentice sch.”) A - - - - - -
(BMI=“obese”) B 0.3038 0.1071 0.0501 0.1648 1.538 0.0175
C 0.3509 0.1667 0.0789 0.2250 1.350 0.0205
R30 (Education = “university”) A 0.3949 0.5326 0.2101 0.5321 0.999 -0.0002
(BMI=“normal”) B 0.2526 0.3586 0.0990 0.3917 1.092 0.0084
C - - - - - -
R31 (Education = “university”) A 0.3949 0.5797 0.3116 0.7890 1.361 0.0826
(PhysActInJob=“mainly sits”) B 0.2526 0.5122 0.2084 0.8249 1.610 0.0790
C 0.1667 0.4211 0.0877 0.5263 1.250 0.0175
R32 (Education = “university”) A 0.3949 0.6957 0.2826 0.7156 1.028 0.0078
(PhysActAfterJob=“moderate”) B 0.2526 0.7276 0.1863 0.7373 1.013 0.0025
C 0.1667 0.7456 0.1053 0.6316 0.847 -0.0190
R33 (Education = “apprentice sch.”) A 0.2065 0.1594 0.0399 0.1930 1.210 0.0069
(PhysActAfterJob=“mainly sits”) B 0.3038 0.1956 0.0722 0.2375 1.215 0.0127
C 0.3509 0.2193 0.0702 0.2000 0.912 -0.0068
R34 (Age = “40-44”) A 0.3297 0.5761 0.1920 0.5824 1.011 0.0021
(Blood Pressure=“normal”) B 0.3015 0.3958 0.1292 0.4286 1.083 0.0099
C 0.1667 0.3333 0.0965 0.5789 1.737 0.0409
R35 (Age = “50”) A 0.2246 0.1812 0.0435 0.1935 1.068 0.0028
(Blood Pressure=“high”) B 0.2061 0.4342 0.1048 0.5085 1.171 0.0153
C 0.4035 0.5263 0.2281 0.5652 1.074 0.0157
Table 6. Association rules in the basic groups - [skin folds] x [BMI]
Id Association Rule G SupASupBSup Conf Lif t RI
R36 (Skin Folds = “20”) A 0.2319 0.5326 0.1558 0.6719 1.261 0.0323
(BMI=“normal”) B 0.2154 0.3586 0.1478 0.6865 1.914 0.0706
C 0.1140 0.2632 0.0789 0.6923 2.631 0.0489
R37 (Skin Folds = “21-31”) A 0.4565 0.4601 0.2355 0.5159 1.121 0.0254
(BMI=“overweight”) B 0.3632 0.5157 0.2095 0.5769 1.119 0.0222
C 0.3421 0.5439 0.2368 0.6923 1.273 0.0501
R38 (Skin Folds = “31-40”) A 0.1159 0.4601 0.0471 0.4063 0.883 -0.0062
(BMI=“overweight”) B 0.2305 0.5157 0.1362 0.5909 1.146 0.0173
C 0.1842 0.5439 0.1140 0.6190 1.138 0.0138
R39 (Skin Folds = “31-40”) A - - - - - -
(BMI=“obese”) B 0.2305 0.1071 0.0442 0.1919 1.792 0.0195
C 0.1842 0.1667 0.0351 0.1905 1.143 0.0044
R40 (Skin Folds = “40”) A 0.0507 0.4601 0.0362 0.7143 1.552 0.0130
(BMI=“overweight”) B 0.1362 0.5157 0.0827 0.6068 1.177 0.0124
C 0.1667 0.5439 0.0702 0.4211 0.774 -0.0205
R41 (Skin Folds = “40”) A - - - - - -
(BMI=“obese”) B 0.1362 0.1071 0.0349 0.2564 2.394 0.0203
C 0.1667 0.1667 0.0877 0.5263 3.159 0.0599
Table 7. Exceptions
Id Exception IM DU
R42 (Education = “apprentice sch.”) 0.4755 0.2069
(PhysActAfterJob=“great activity”) 6⇒
(Smoking = “15-20 cig/day”)
R43 (Education = “apprentice sch.”) 0.5035 0.1689
(DailyBeerCons=“does not drink beer”) 6⇒
(Smoking = “15-20 cig/day”)
R44 (DailyBeerCons = “>1l”) (Group=“A”) 6⇒ 0.8011 0.1515
(SmokingDuration = “>20 years”)
R45 (DailyBeerCons = “>1l”) 0.3586 0.1054
(PhysActAfterJob=“great activity”) 6⇒
(SmokingDuration = “>20 years”)
R46 (DailyBeerCons = “>1l”) (Group=“A”) 6⇒ 0.9192 0.1837
(Smoking = “>20 cig/day”)
R47 (DailyBeerCons = “>1l”) (Age=“50”) 6⇒ 0.5304 0.2358
(Smoking = “>20 cig/day”)
R48 (Education = “university”) (Group=“C”) 6⇒ 0.7018 0.3052
(BMI = “normal”)
R49 (DailyWineCons = “500ml”) (Group=“C”) 6⇒ 0.4017 0.1770
(Blood Pressure = “normal”)
R50 (Age = “50”) (Group=“B”) 6⇒ 0.3442 0.0577
(Ex-Smoker = “yes, >1 year”)
... However, it is also possible to mine association rules in data repositories other than transactional databases, such as data warehouses and relational databases. In this case, the association rules can be composed of diverse categorical and numeric attributes, being referred to as multidimensional association rules [5], [6], [7]. To demonstrate this concept, suppose a relational table that stores demographic data and other statistics of cities in a given country. ...
... In order to avoid the generation of uninteresting rules, a hybrid-dimensional rule should be mined only if the change in the strength has been significant. Our approach corresponds to an extension of the technique for mining multidimensional exception rules introduced in [4], [5]. ...
... Nevertheless, over the two last decades, the data mining literature [4], [5], [6], [8], [9], [10], [12], [15], [16] have evidenced that there is a major drawback associated to the support/confidence framework: the fact that it often leads to the generation of a huge number of association rules, many of which obvious and irrelevant, making it difficult for users to identify those rules that are indeed interesting to them. In other to cope with this problem, some proposals (such as [4], [5], [8], [10], [15]) suggest modify the support/confidence framework by allowing users to guide the mining process into finding only unexpected rules, instead of enumerating all possible association rules. ...
Conference Paper
Full-text available
This work presents a new method to mine hybrid-dimensional association rules in databases originated from multiple sources. We adopted an approach in which hybrid associations represent transactional rules that become either exceptionally weaker or exceptionally stronger in some subsets of an integrated database, which satisfy specific conditions over selected attributes. We propose new interest measures to evaluate hybrid-dimensional rules, as well as an algorithm to mine these patterns. This algorithm was applied to a real dataset that keeps information about purchases made by families residing in different Brazilian cities. The obtained results show that the proposed technique provides valuable information for decision making.
... As exceções negativas (tabela superior) e as exceções positivas (tabela inferior) são apresentadas separadamente. Os resultados desta subseção encontram-se também em [16], assim como uma série de regras de associação fortes mineradas da base de dados da aterosclerose. Capítulo 5 ...
... Com isto, diversas técnicas de mineração de dados foram aplicadas sobre a base de dados e os especialistas do projeto STULONG puderam obter novos conhecimentos. Em[16], a técnica de mineração de exceções proposta nesta dissertação foi utilizada para se obter uma série de exceções a partir da tabela que contém os dados relativos aos exames iniciais. Neste mesmo trabalho, é possível encontrar regras de associação fortes, que foram mineradas de acordo com a aplicação das medidas de interesse lif t e RI, comentadas no Capítulo 2.4.2.4 Base de Dados do Censo de WashingtonEsta base de dados, também conhecida como adult database ou census income data set, está disponibilizada em[5]. ...
Thesis
Full-text available
As regras de associação multidimensionais representam um tipo importante de conhecimento que pode ser minerado a partir de bancos de dados relacionais ou data warehouses. Estas regras descrevem combinações de valores de atributos que frequentemente ocorrem juntos na base de dados, podendo revelar padrões escondidos e úteis. A contribuição principal desta dissertação é a proposta de um método para a mineração de exceções negativas e positivas em bases de dados multidimensionais. O objetivo deste método é encontrar associações que tornam-se mais fracas (exceções negativas) ou mais fortes (exceções positivas) em subconjuntos da base de dados que satisfazem condições específicas sobre atributos selecionados. As exceções candidatas são geradas através da combinação de regras de associação previamente descobertas com um conjunto de atributos especificados pelo usuário. Uma exceção é, de fato, minerada apenas quando o valor de suporte real de uma exceção candidata é muito diferente do seu valor esperado de suporte. Um método para estimar esta expectativa é proposto, assim como medidas de interesse para avaliar as exceções. Um algoritmo para minerar estas exceções e resultados experimentais também são apresentados.
... However, once again, the IM values are high. The exception 100 is less interesting due to the high negative dependence between Z and A. The adopted approach for mining exceptions was also applied to a real medical data set (the results can be found in [3]). ...
Conference Paper
This paper addresses the problem of mining exceptions from multidimensional databases. The goal of our proposed model is to find association rules that become weaker in some specific subsets of the database. The candidates for exceptions are generated combining previously discovered multidimensional association rules with a set of significant attributes specified by the user. The exceptions are mined only if the candidates do not achieve an expected support. We describe a method to estimate these expectations and propose an algorithm that finds exceptions. Experimental results are also presented.
... The entity-relationship diagram of the CON T ROL data matrix are shown in Fig. 2 The goal of the discovery challenge is to extract knowledge or patterns from the STULONG data stored in the CON T ROL matrix. Not many studies have focused on mining the CON T ROL data and most data mining on the STU-LONG dataset has been based on the EN T RY table [8,9]. These patterns can help physicians to know if a patient is at risk of having cardiovascular disease. ...
Conference Paper
Full-text available
This paper helps the understanding and development of a data summarisation approach that summarises structured data stored in a non-target table that has many-to-one relations with the target table. In this paper, the feasibility of data summarisation techniques, borrowed from the Information Retrieval Theory, to summarise patterns obtained from data stored across multiple tables with one-to-many relations is demonstrated. The paper describes the Dynamic Aggregation of Relational Attributes (DARA) framework, which summarises data stored in non-target tables in order to facilitate data modelling efforts in a multi-relational setting. The application of the DARA algorithm involving structured data is presented in order to show the adaptability of this algorithm to real world problems.
Article
Introduction Nowadays, a significant part of goods and passengers are transported on suburban highways with mainly high-speed vehicles. Hence, these highways are very prone to collisions with different injuries. For this reason, road collisions have become one of the largest international health issues in the world recently. Due to the high fatality or severe physical/mental injury rates caused by car collisions and the complex interactions between the factors affecting them, analyzing these collision-prone areas, identifying the factors affecting their occurrences, and discovering knowledge in the form of key rules is crucial, as the purposes for this study. Methods Three supervised algorithms including Artificial Neural Network (ANN), Support Vector Machine (SVM), and Random Forests (RF) were used to build up classification models for the fatality severity of 2355 fatal collision data records during 2007–2009 occurred in the roadways of 8 states in the USA with different driver-related, environmental, and road factors. Predicted risk maps were generated for each classifier and the importance of contributing factors was evaluated based on the mean decrease in accuracy and the mean decrease in Gini Index. Finally, association rule mining was performed by the Apriori algorithm to extract collision rules. Results RF outperformed the other methods in terms of the highest overall accuracy and kappa rates, which were 94% and 92% respectively. The risk maps revealed that collisions with the most fatalities were mainly concentrated in the northern states. The number of travel lanes, speed limit, roadway profile, and light conditions were the most influencing factors in the collisions. 68 association rules were mined by the Apriori algorithm. Then, important rules were specified to extract hidden information from the collision data. Conclusions The RF model was more accurate in collision analysis, along with comprehensive collision factors. Great importance should be given to road factors in route design, especially in critical locations with severe fatality risk.
Conference Paper
Full-text available
The LO-DL (Learning Objects Digital Library) Project is being developed at PUC-Rio in the Database Tecnology Lab (TECBD). This Project aims at integrating LOs repositories through their metadata in a uniform catalog or Digital Libray (DL), making it transparent to the users their locations and characteristics. The process of digital library development includes issues such as the integration of several databases. Moreover, access to the DL (Digital Library) must be assisted by the use of content hierarchies that guide the user in the discovery and filtering of information of his/her interest. In this work we propose a new approach for developing Digital Libraries of Learning Objects using a Data Warehousing Architecture, which is a method that addresses both issues mentioned above. We make comparisons between the components and services of both the Data Warehousing and the "Digital Librarying" Architectures. Furthermore, we suggest the use of Data Mining Techniques in some steps of the building and utilization of the DL. In particular we will detail the users profiles analysis process which uses the library access log and a data mining tool for the library automatic refresh. We propose the use of association rules for the detection of the users needs so that the system can then search and make the new LOs available in the next loading (refresh) of the library. The supporting database which contains the Ontology (in our case a Taxonomy) of the LOs of interest is also described in the paper.
Conference Paper
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can improve the low-level efficiency of the algorithm. Second, we present a new way of generating "implication rules," which are normalized based on both teh antecedent and the consequent and are truly implications (not simply a measure of co-occurence), and we show how they produce more intuitive results than other methods. Finally, we show how different characteristics of real data, as opposed to synthetic data, can dramatically affect the performance of the system and the form of the results.
Conference Paper
This paper presents an algorithm that seeks every possible exception rule which violates a common,sense rule and satises several assumptions of simplicity. Exception rules, which represent systematic deviation from common sense rules, are often found interesting. Discovery of pairs that consist of a common sense rule and an exception rule, resulting from undirected search for unexpected exception rules, was successful in various domains. In the past, however, an exception rule was represented by a change of conclusion caused by adding an extra condition to the premise of a common,sense rule. That approach formalized only one type of exceptions, and failed to represent other types. In order to provide a systematic treatment of exceptions, we categorize exception rules into eleven categories, and we propose a unied algorithm for discovering all of them. Preliminary results on,fteen real-world data sets provide an empirical proof of eectiveness of our algorithm in discovering interesting knowledge. The empirical results also match our theoretical analysis of exceptions, showing that the eleven types can be partitioned in three classes according to the frequency with which they occur in data. Keywords: Exception/Deviation Detection, Rule discovery, Exception
Article
This article presents an algorithm that seeks every possible exception rule that violates a commonsense rule and satisfies several assumptions of simplicity. Exception rules, which represent systematic deviation from commonsense rules, are often found interesting. Discovery of pairs that consist of a commonsense rule and an exception rule, resulting from undirected search for unexpected exception rules, was successful in various domains. In the past, however, an exception rule represented a change of conclusion caused by adding an extra condition to the premise of a commonsense rule. That approach formalized only one type of exception and failed to represent other types. To provide a systematic treatment of exceptions, we categorize exception rules into 11 categories, and we propose a unified algorithm for discovering all of them. Preliminary results on 15 real-world datasets provide an empirical proof of effectiveness of our algorithm in discovering interesting knowledge. The empirical results also match our theoretical analysis of exceptions, showing that the 11 types can be partitioned in three classes according to the frequency with which they occur in data. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 673–691, 2005.
Conference Paper
Mining for association rules is considered an important data mining problem. Many different variations of this problem have been described in the literature. We introduce the problem of mining for negative associations. A naive approach to finding negative associations leads to a very large number of rules with low interest measures. We address this problem by combining previously discovered positive associations with domain knowledge to constrain the search space such that fewer but more interesting negative rules are mined. We describe an algorithm that efficiently finds all such negative associations and present the experimental results
Article
We consider the problem of discovering association rules between items in a large database of sales transactions. We presenttwo new algorithms for solving this problem that are fundamentally different from the known algorithms. Experiments with synthetic as well as real-life data show that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also showhow the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database. 1 Introduction Database mining is motivated by the decision support problem faced by most large retail organizations [S + 93]. Progress in bar-code technology has made it possible for retail ...
Dynamic Itemset Counting and Implication Rules for Market Basket Data
  • S Brin
  • R Motowani
  • J D Ullman
  • S Tsur
Brin, S., Motowani, R., Ullman, J. D., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In ACM SIGMOD Intl. Conf. (1997).