uRule: A Rule-Based Classification System for Uncertain Data.
-
Citations (0)
-
Cited In (0)
Page 1
uRule: A Rule-based Classification System
for Uncertain Data
Biao Qin, Yuni Xia, Rakesh Sathyesh
Department of Computer Science
Indiana University -
Purdue University Indianapolis, USA
{biaoqin,yxia,sathyesr}@cs.iupui.edu
Sunil Prabhakar
Department of Computer Science
Purdue University
sunil@cs.purdue.edu
Yicheng Tu
Department of Computer Science
and Engineering
University of South Florida
ytu@cse.usf.edu
Abstract—Data uncertainty is common in real-world appli-
cations. Various reasons lead to data uncertainty, including
imprecise measurements, network latency, outdated sources and
sampling errors. These kinds of uncertainties have to be handled
cautiously, or else the data mining results could be unreliable or
wrong. In this demo,we will show uRule, a new rule-based clas-
sification and prediction system for uncertain data. This system
uses new measures for generating, pruning and optimizing clas-
sification rules. These new measures are computed considering
uncertain data intervals and probability distribution functions.
Based on the new measures, the optimal splitting attributes and
splitting values can be identified and used in classification rules.
uRule can process uncertainty in both numerical and categorical
data. It has satisfactory classification performance even when
data is highly uncertain.
I. INTRODUCTION
In many applications, data contains inherent uncertainty. A
number of factors contribute to the uncertainty, such as the
random nature of the physical data generation and collection
process, measurement and decision errors, and data staling. For
example, in location based services, we assume moving objects
are attached with locators and the location information is
periodically updated and streamed to the control center. Based
on the location information, various services can be provided
including real time traffic-based routing, public transportation
planning and accident prediction. In these applications, loca-
tion data is typically inaccurate due to measurement error,
locator energy constraint, network bandwidth constraint and
network transmission latency. Similarly, in sensor network
applications, measurement data such as temperature, humidity,
pressure and moisture can be inaccurate.
Uncertainty can also arise in categorical data. For example,
a tumor is typically classified as benign or malignant in cancer
diagnosis and treatment. In practice, it is often extremely
difficult to accurately classify a tumor at the initial diagnosis
due to the experiment precision limitation. Lab results often
have false positives and false negatives. Therefore, doctors
may often diagnose tumors to be benign or malignant with
certain probability or confidence [1].
Since data uncertainty is ubiquitous, it is important to
develop data mining models for uncertain data. We study the
problem of data classification with uncertainty and develop a
rule-based classification algorithm for uncertain data. Rule-
based data mining algorithms have a number of desirable
properties. Rule sets are relatively easy for people to under-
stand [2], and rule learning systems outperform decision tree
learners on many problems [3], [4]. Rule sets have a natural
and familiar first order version, namely Prolog predicates,
and techniques for learning propositional rule sets can often
be extended to the first-order case [5], [6]. However, when
data contains uncertainty - for example, when some numerical
data is, instead of precise value, an interval with probability
distribution function - these algorithms need to be extended to
handle the uncertainty.
In the demo, we will show a new rule-based system for
classifying and predicting both certain and uncertain data. We
integrate the uncertain data model into the rule-based mining
algorithm. We propose a new measure called probabilistic
information gain for generating rules. We also extend the
rule pruning measure for handling data uncertainty. We run
uRule on uncertain datasets with both uniform and Gaussian
distribution, and the results demonstrate its effectiveness.
II. DATA UNCERTAINTY MODEL
TABLE I
TRAINING SET FOR PREDICTING BORROWERS WHO WILL DEFAULT ON
LOAN PAYMENTS
RowidHome
Owner
Yes
No
Yes
No
No
Yes
No
No
No
No
No
Marital
Status
Single
Single
Married
Divorced
Married
Divorced
Single
Married
Single
Divorced
Divorced
Annual
Income
110-120
60-85
110-145
110-120
50-80
170-250
85-100
80-100
120-145
95-115
80-95
Defaulted
Borrower
No
No
No
Yes
No
No
Yes
No
Yes
Yes
No
1
2
3
4
5
6
7
8
9
10
11
We first discuss the uncertainty model for numerical and
categorical attributes, which are the most common types of
attributes encountered in data mining applications.
Page 2
Fig. 1.Uncertain Data Viewer
When the value of a numerical type attribute is uncertain,
the attribute is called an uncertain numerical attribute (UNA),
denoted by Aun. The value of Aunis represented as a range
or interval and the probability distribution function(pdf) over
this range. Note that Aunis treated as a continuous random
variable.
Table I shows an example of UNA. The data in this table
are used to predict whether borrowers will default on loan
payments. Among all the attributes, the Annual Income is a
UNA, whose precise value is not available. We only know
the range of the Annual Income of each person and the PDF
f(x) over that range. The probability distribution function of
the UNA attribute Annual Income is assumed to be normal
distribution.
Under the categorical uncertainty model, a dataset can have
attributes that are allowed to take uncertain values. We call
such an attribute an uncertain categorical attribute (UCA),
denoted by Auc. Further, we use Auc
value of the ith record in a database. The notion of UCA has
been proposed in previous work such as [7].
Auc
i
takes values from a categorical domain Dom with
cardinality |Dom| = n. For a certain dataset, the value of an
attribute A is a single value dkin Dom, Pr(A = dk) = 1. In
the case of an uncertain dataset, we record the information by
a probability distribution over Dom instead of a single value.
Let Dom = {d1,d2,...,dn}, then Auc
distribution Pr(Auc
j
= di) for all values of i in {1,...,n}.
Thus, Auc
j
can be represented by a probability vector Auc
i
to denote the attribute
j
is the probability
j
=
(p1,p2,...,pn) such that?n
A DATASET WITH AN UNCERTAIN CATEGORICAL ATTRIBUTE
i=1pi= 1.
TABLE II
Tuple
1
2
3
4
5
6
7
...
...
...
...
...
...
...
...
Aj
...
...
...
...
...
...
...
...
Class
1
0
1
1
1
1
0
(V1:0.3; V2:0.7)
(V1:0.2; V2:0.8)
(V1:0.9; V2:0.1)
(V1:0.3; V2:0.7)
(V1:0.8; V2:0.2)
(V1:0.4; V2:0.6)
(V1:0.1; V2:0.9)
Table II shows a dataset with a UCA attribute whose
exact value is unobtainable. It may be either V1or V2, each
with associated probability. In practice, a dataset can have
both uncertain numerical attributes and uncertain categorical
attributes.
III. DEMONSTRATION
We implement uRule based on the open source data mining
tool WEKA. We first extend the Arff Viewer in Weka so that
it can display uncertain data in a proper tabular format. Figure
1 shows the use of the Viewer to see the uncertain intervals
of the attributes of an uncertain numerical dataset.
A. Generating Classification Rules for Uncertain Data
To build a rule-based classifier, we need to extract a set
of rules that show the relationships between the attributes
Page 3
Fig. 2. Sample Classification Rules for Uncertain Data
of a dataset and the class label. Each classification rule is
a form R : Condition => y. Here the Condition is called
the rule antecedent, which is a conjunction of the attribute
test condition. y is called the rule consequent and it is the
class label. A rule set can consist of multiple rules RS =
{R1,R2,...,Rn}
A rule R covers an instance Ij if the attributes of the
instance satisfy the condition of the rule. The Coverage of
a rule is the number of instances that satisfy the antecedent of
a rule. The Accuracy of a rule is the fraction of instances
that satisfy both the antecedent and consequent of a rule,
normalized by those satisfying the antecedent. Ideal rules
should have both high coverage and high accuracy rates.
The uRule algorithm uses the sequential covering approach
to extract rules from the data. The algorithm extracts the
rules one class at a time for a data set. Let (y1,y2,...,yn)
be the ordered classes according to their frequencies, where
y1 is the least frequent class and yn is the most frequent
class. During the ith iteration, instances that belong to yi
are labeled as positive examples, while those that belong
to other classes are labeled as negative examples. A rule is
desirable if it covers most of the positive examples and none
of the negative examples. Our uRule algorithm is based on the
RIPPER algorithm [8], which was introduced by Cohen and
considered to be one of the most commonly used rule-based
algorithms in practice.
The rule learning procedure is the key function of the uRule
algorithm. It generates the best rule for the current class,
given the current set of uncertain training tuples. The rule
learning procedure includes two phases: growing rules and
pruning rules. After generating a rule, all the positive and
negative examples covered by the rule are eliminated. The rule
is then added into the rule set as long as it does not violate
the stopping condition, which is based on the minimum de-
scription length (DL) principle. uRule also performs additional
optimization steps to determine whether some of the existing
rules in the rule set can be replaced by better alternative rules.
The basic strategy for growing rules is as follows:
1. It starts with an initial rule : {}-> y, where the left
hand side is an empty set and the right hand side contains the
target class. The rule has poor quality because it covers all the
examples in the training set. New conjuncts will subsequently
be added to improve the quality of the rule.
2. The probabilistic information gain is used as a measure to
identify the best conjunct to be added into the rule antecedent.
The algorithm selects the attribute and split point which
has the highest probabilistic information gain and add them
as an antecedent of the rule. The details of of computing
probabilistic information gain for uncertain data can be found
in our previous work [9] and will be shown in the demo.
3. If an instance is covered by the rule, A function spli-
tUncertain() is invoked to find the part of the instance that
is covered by the rule. Then, the part of the instance that is
covered by the rule is removed from the dataset, and the rule
growing process continues, until either all the data are covered
or all the attributes have been used as antecedents.
Example: Refer to the dataset shown in Table I. Using uRule
Algorithm, after rule growing and pruning, the following rule
set is generated:
(annual income ≥ 95) and (home owner = No)=> class=Yes
Page 4
(3.58/0.25)
{} => class=No (7.42/0.67)
The first is a regular rule, whose accuracy is around 93%,
since it covers 3.58 positive instances and 0.25 negative
instances. Please note that for uncertain data, a rule may partly
cover instances, therefore, the number of positive and negative
instances covered by a rule are no longer integers but real
values. The second rule is a default rule. Like traditional rule-
based classifier, uRule also generates a default rule when no
more quality rule can be found. The default rule will be applied
to instances which do not match any rule in the rule set. In
the example, the default rule has an accuracy around 91%.
For the data shown in figure 1, The uncertain classification
rules generated by uRule are shown in figure 2. The red shade
area highlights all the uncertain classification rules.
B. Prediction with uRule
Once the rules are learned from a dataset, they can be
used for predicting the class type of previously unseen
data. Like a traditional rule classifier, each rule of uRule
is in the form of ”IF Conditions THEN Class = Ci”. Be-
cause each instance Ii can be covered by several rules,
a vector can be generated for each instance (P(Ii,C) =
(P(Ii,C1),P(Ii,C2),...,P(Ii,Cn))T) , in which P(Ii,Cj)
denotes the probability for an instance to be in class Cj. We
call this vector an Class Probability Vector (CPV).
As an uncertain data instance can be partly covered by a
rule, we denote the degree an instance I covered by a rule Rj
by P(I,Rj). When P(I,Rj) = 1, Rj fully covers instance
Ii; when P(Ii,Rj) = 0, Rjdoes not cover Ii; and when 0 <
P(Ii,Rj) < 1, Rjpartially covers Ii.
An uncertain instance may be covered or partially covered
by more than one rule. We allow a test instance to trigger
all relevant rules. We use w(Ii,Rk) to denote the weight of
an instance Iicovered by the kth rule Rk. The weight of an
instance Iicovered by different rules is as follows:
w(Ii,R1)
w(Ii,R2)
=
=
Ii.w ∗ P(Ii,R1)
(Ii.w − w(Ii,R1)) ∗ P(Ii,R2)
...
w(Ii,Rn)=(Ii.w −
n−1
?
k=1
w(Ii,Rk)) ∗ P(Ii,Rn)
Suppose an instance Iiis covered by m rules, then it class
probability vector(CPV) CPV (Ii,C) is computed as follows:
CPV (Ii,C)=
m
?
a
k=1
P(Rk,C) ∗ w(Ii,Rk)
Where
(P(Rk,C1),P(Rk,C2),...,P(Rk,Cn))T
the class distribution of the instances covered by rule Rk.
P(Rk,Ci) is computed as the fraction of the probabilistic
cardinality of instances in class Cicovered by the rule over
the overall probabilistic cardinality of instances covered by
the rule.
P(Rk,C)
isvector
P(Rk,C)
and
=
denotes
After we compute the CPV for instance Ii, the instance
will be predicted to be of the class which has the largest
probability in the class probability vector. This prediction
procedure is different from a traditional rule based classifier.
When predicting the class type for an instance, a traditional
rule based classifier such as RIPPER usually predicts with
the first rule in the rule set that covers the instance. This is
different for uncertain data. As an uncertain data instance can
be fully or partially covered by multiple rules, the first rule
in the rule set may not be the rule that best covers it. uRule
uses all the relevant rules to compute the probability for the
instance to be in each class and predicts the instance to be the
class with the highest probability.
Example, suppose we have a test instance {No, Married,
90-110}, when applying the rules shown in Example 2 on the
test instance, the class probability vector is
P(I,C)=
P(R1,C) ∗ w(I,R1) + P(R2,C) ∗ w(I,R2)
?0.93017
?0.6976
?0.7202
This shows that the instance is in class ”Yes” with proba-
bility 0.7202 and in class ”No” with probability 0.2798. Thus
the test instance will be predicted to be in class ”Yes”.
During the demonstration, the audience will see how uRule
generates uncertain classification rules in real time based on
uncertain training datasets and how the uncertain classification
rules are used for prediction. We will demonstrate uRule on
uncertain datasets with both uniform and Gaussian distribu-
tion, and show that uRule has satisfactory classification and
prediction accuracy on highly uncertain data.
=
0.06983
?
× 0.75 +
?0.0226
?0.0903
?
0.9097
?
× 0.25
=
0.0524
?
?
+
0.2274
=
0.2798
REFERENCES
[1] G. Bodner, M. F. H. Schocke, F. Rachbauer, K. Seppi, S. Peer, A. Fier-
linger, T. Sununu, and W. R. Jaschke, “Differentiation of malignant and
benign musculoskeletal tumors: Combined color and power doppler us
and spectral wave analysis,” vol. 223, pp. 410–416, 2002.
[2] J. Catlett, “Megainduction: A test flight,” in ML, 1991, pp. 596–599.
[3] G. Pagallo and D. Haussler, “Boolean feature discovery in empirical
learning,” Machine Learning, vol. 5, pp. 71–99, 1990.
[4] S. M. Weiss and N. Indurkhya, “Reduced complexity rule induction,” in
IJCAI, 1991, pp. 678–684.
[5] J. R. Quinlan, “Learning logical definitions from relations,” Machine
Learning, vol. 5, pp. 239–266, 1990.
[6] J. R. Quinlan and R. M. Cameron-Jones, “Induction of logic programs:
Foil and related systems,” New Generation Comput., vol. 13, no. 3&4,
pp. 287–312, 1995.
[7] S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch,
“Indexing categorical data with uncertainty,” in ICDE, 2007, pp. 616–
625.
[8] W. W. Cohen, “Fast effective rule induction,” in Proc. of the 12th Intl.
Conf. on Machine Learning, 1995, pp. 115–123.
[9] B. Qin, Y. Xia, S. Prabhakar, and Y. Tu, “A rule-based classification
algorithm for uncertain data,” in the IEEE workshop on Management and
Mining of Uncertain Data(MOUND), in conjunction with ICDE, 2009.