Content uploaded by Sam Fletcher
Author content
All content in this area was uploaded by Sam Fletcher on Feb 13, 2017
Content may be subject to copyright.
Quality Evaluation of an Anonymized Dataset
Sam Fletcher and Md Zahidul Islam
Center for Research in Complex Systems (CRiCS)
School of Computing and Mathematics
Charles Sturt University
Bathurst NSW 2795, Australia
Email: {safletcher;zislam}@csu.edu.au
Abstract—In this study we argue that the traditional approach
of evaluating the information quality of an anonymized (or
otherwise modified) dataset is questionable. We propose a novel
and simple approach to evaluate the information quality of a
modified dataset, and thereby the quality of techniques that
modify data. We carry out experiments on eleven datasets and the
empirical results strongly support our arguments. We also present
some supplementary measures to our approach that provide
additional insight into the information quality of modified data.
Keywords—data mining, privacy preserving data mining, data
quality, information quality, noise addition, anonymization.
I. INTRODUCTION
Since the late 20th century, technological advances have
led to rapid increases in data collection, mining and analysis.
While clearly advantageous, it is not without its drawbacks,
and one such drawback is the questions it raises about an
individual’s right to privacy. Privacy is considered by many
to be a basic human right, and so in an ideal scenario each
individual has the right to refuse the use of their data for
analysis. The benefits of data mining and analysis are wide-
ranging and often not only beneficial to the analysts, but to
society as a whole, and therefore convincing individuals to
forgo their privacy concerns is extremely beneficial.
One way to do so is to assure them that their anonymity will
stay intact. Research in ”Privacy Preserving Data Publishing”
(PPDP) or ”Privacy Preserving Data Mining” (PPDM) aims to
distort the data in such a way as to keep the information accu-
rate, without necessarily needing to keep the data accurate1.
The main purpose of Data Mining is to discover patterns in
data [2], and so in this context the ”information” is the patterns
that can be discerned from the data. To preserve the patterns
is to preserve the information. Additionally, the distortion of
data needs to occur in such a way as to make it impossible
(or near-impossible) for any one individual to be identified
[3]. The process of distorting data to preserve privacy is often
called ”anonymization”.
In this paper our focus is on how to evaluate the informa-
tion quality of data after anonymization. Without the ability
to evaluate the information quality of anonymized data, it is
all but impossible to empirically evaluate the anonymization
technique. This paper aims to provide researchers with tools
1The difference between ”data” and ”information” is the level of abstraction,
and the meaning that can be extracted from it. ”Data” is merely the raw facts,
unorganized and unprocessed. ”Data” can become ”information” once it has
been analyzed, given a context, organized, or otherwise gains meaning [1].
to evaluate the information quality of anonymized datasets2.
These tools are also useful in other scenarios involving modi-
fied data, such as when collected data with inaccurate values
are improved using data cleansing techniques in order to
produce more accurate information [4].
A common way to evaluate the information quality of
a dataset is to build a classifier from the dataset, and test
how well the classifier can predict information about future
data [5]–[13]. A classifier is a framework by which future
records (tuples) can be classified into categories. Classifiers
are built by machine learning algorithms that find patterns that
describe the data. The categories used to classify records are
defined by the class attribute. Like other attributes, the class
attribute describes a quality possessed by a record, but has been
selected by the user as being important enough for the machine
learning algorithm to use it to classify records. Depending on
the quality of the information discoverable in the dataset, the
classifier will more or less accurately classify future records
using the patterns it discovered. The accuracy of a classifier in
predicting the class value of future records is called ”prediction
accuracy”.
Example 1: A classifier might be built from a medical
dataset in order to accurately predict the diagnoses of future
patients. The attribute ”Diagnosis” would be selected as the
class attribute, and all other attributes would be analyzed by the
machine learning algorithm in order to find patterns that predict
the class attribute. Fig. 1a and Fig. 1d represent a toy dataset
Dand classifier ZD(built from D), respectively. Rather than
waiting for future patient records in order to test the prediction
accuracy of the classifier, some (currently possessed) records
are typically withheld from Dand used as simulated future
records. This is known as ”testing data” and an example is
dataset Tin Fig. 1c. Since we know the class values for our
testing data we can estimate that ZDwill be 100% accurate
(in this example) in predicting the (unknown) class values of
real future patient records. In this case the type of classifier
chosen is a decision tree, and we will continue to use decision
trees for demonstrative purposes. It should be noted that our
proposed measure (introduced below) should be applicable in
any scenario where prediction accuracy is also applicable. We
only use decision trees as a way of collecting realistic patterns
for our experiments as a proof of concept.
When anonymizing a dataset, a common approach for
evaluating the change in information quality is to compare
2A ”dataset” is a two dimensional table where rows represent independent
records (tuples) and columns represent various attributes that describe the
records and distinguish them from each other. See Fig. 1a for an example.
Record Age Symptom Diagnosis
D111 Muscle aches Flu
D268 Muscle aches Flu
D320 Cough Flu
D46 Lump on body Flu
D534 Lump on body Cancer
D655 Cough Cancer
(a) Dataset D
Record Age Symptom Diagnosis
D′
111 Cough Flu
D′
247 Muscle aches Flu
D′
399 Cough Flu
D′
46 Lump on body Flu
D′
534 Lump on body Cancer
D′
655 Cough Cancer
(b) Dataset D′
Record Age Symptom Diagnosis
T130 Muscle aches Flu
T229 Cough Flu
T316 Lump on body Flu
T483 Cough Cancer
(c) Dataset T
Symptom
Age Age
Flu
Muscle
aches Cough
Lump on
body
Flu Flu
Cancer Cancer
>38 ≤38 >20 ≤20
(d) Decision tree ZD
Cancer
Age
Flu
>77 ≤77
Symptom
Age Age
Flu
Muscle
aches Cough
Lump on
body
Flu Flu
Cancer
>38 ≤38 >20 ≤20
(e) Decision tree ZD′
Fig. 1. (d) and (e) are decision trees built from the toy datasets in (a) and (b), respectively. (b) is a distorted version of (a). The accuracy of both decision
trees is tested on the data in (c). A decision tree is a graphical representation of some of the patterns (logic rules) existing in a dataset, where the patterns show
the relationships between the non-class attributes and the class attribute. A decision tree ”filters” records into stronger and more specific patterns the further
the tree extends. These ”filters” are defined by the nodes in the tree, where records are split into mutually exclusive and collectively exhaustive subsets of the
dataset based on the attribute values tested in each node. The filtering process starts at the top (root) node and follows a path to a bottom (leaf) node. For each
leaf node, the most common class value possessed by the records in the leaf is called the ”majority class value”. The class value of a future record is predicted
to be the majority class value of the leaf node that the record was filtered into.
the prediction accuracy of two classifiers; one built from the
original dataset and one from the anonymized dataset [5]–[12].
Note that the use of a classifier for the purpose of discovering
information in data is independent from the anonymization (or
data cleansing) process. Measuring information quality with
a classifier is applicable even when the data was modified
without a classifier.
Example 1 (ongoing): Dataset D′in Fig. 1b is an example
of a distorted version of D. The classifier in Fig. 1e, ZD′, is
built from D′, and its prediction accuracy is tested on Tand
found to be 75% (just as ZDwas tested on T). Thus the change
in information quality when anonymizing Dis 100%−75% =
25%.
Unfortunately a comparison of prediction accuracies as in-
troduced above can have quite erratic results, with questionable
statistical significance. This is explored in depth in Section III.
In this study we therefore propose a different approach that
evaluates the information quality of an anonymized dataset by
examining how valid the patterns discovered in an original
dataset (i.e. the logic rules of a decision tree) are in the
anonymized dataset. A simple method for doing so: rather
than measuring the accuracy of ZD′on Tand comparing it
to the accuracy of ZDon T, we compute the accuracy of
ZDon D′. That is, we see how well the distorted records fit
the patterns discovered by the original classifier. This can be
compared to how well the original dataset Dfits the patterns
in ZDto give us a sense of how much the patterns have
deteriorated. To differentiate this approach from the traditional
”prediction accuracy”, we refer to it as ”pattern accuracy”. A
full description and analysis can be found in Section IV.
To test our proposal, we artificially distort real-life datasets
gathered from the UCI Machine Learning Repository [14]. In
this context, distorting data is often described as adding ”noise”
to the data. We apply two different kinds of noise, and explore
the effects of each on prediction accuracy and other metrics.
The types of noise used are introduced in Section IIB.
A. Notation
The following notation will be used throughout the paper.
Drepresents a dataset and D′is a distorted version of D.T
is the testing dataset made by extracting a random selection
of records from Dbefore building any trees. Diis the ith
record in D.ZDis a classifier built from D.A(ZD|T)is
the accuracy of ZDwhen predicting the class values of the
records in T. It represents the proportion of correct predictions,
0≤A(ZD|T)≤1.ZD′follows the same notation, as
does A(ZD|D′)and any other combinations of classifiers and
datasets.
B. Our contributions
We logically and empirically argue that a comparison of
A(ZD|T)and A(ZD′|T)is often inconclusive when evaluating
the change in information quality of Dand D′. Therefore,
we argue that the widespread use of A(ZD′|T)[5]–[11] to
evaluate the information quality of D′(and therefore the
anonymization technique that produces D′) is questionable.
We propose that measuring the pattern accuracy (that is,
A(ZD|D′)) of an anonymized dataset can be more informative,
both theoretically and empirically. We also propose that a
careful analysis of A(ZD′|D)and A(ZD′|D′)can provide
additional insight into the information quality of D′.
Section II provides background information on relevant
previous work, the types of noise we implement for our tests,
and our experiment methodology. In Section III we discuss
the weaknesses of the traditional approach that uses prediction
accuracy (i.e. A(ZD′|T)), and in Section IV we explore the
usefulness of pattern accuracy (i.e. A(ZD|D′)). Section V
provides concluding remarks.
TABLE I. DE TAIL S OF TH E DATASE TS US ED I N OUR E XPE RI MEN TS . “MAJORITY CL AS S VALUE PE RCE NTAG E”RE FE RS TO T HE PE RC ENTAG E OF
RE COR DS T HAT HAVE TH E MO ST CO MMO N CL ASS VAL UE . “AVER AGE C4.5 PREDICTION ACCURACY”IS T HE AVE RAG E PRE DI CTI ON AC CUR ACY O F TH E
TREES PRODUCED BY C4.5 USI NG 10 -F OLD C ROS S VALID ATION .
Name Number of
Records
Number of
Numerical Attributes
Number of
Categorical Attributes
Number of Class
Values
Majority Class
Value Percentage
Average C4.5
Prediction Accuracy
WBC 683 9 0 2 65% 95%
Vehicle 846 18 0 4 26% 69%
RedWine 1599 11 0 6 43% 58%
PageBlocks 5473 10 0 5 90% 97%
Credit 653 6 9 2 55% 85%
Statlog 1000 7 13 2 70% 73%
CMC 1473 8 1 3 43% 53%
Yeast 1484 7 1 10 31% 56%
Adult 30162 6 5 2 75% 86%
Mushroom 5644 0 22 6 62% 99%
Chess 28056 0 6 18 16% 44%
II. BACKGROUND INFORMATION
A. Previous work
Previous work has raised concerns about using prediction
accuracy as a metric for making comparisons: Lim, Loh and
Shih [12] found that when testing 33 classification algorithms
on 16 datasets, the addition of noise (distortions in the data)
made no statistically significant difference in the prediction
accuracy (i.e. A(ZD′|T)) of the algorithms.
Further research [15] supported the earlier findings - the
addition of randomly generated noise had no significant effect
on the prediction accuracy of the classifier built from the
distorted data. However an analysis of the structure of the
decision trees showed drastic changes, in both the sizes of
the trees (the number of nodes) and the attributes chosen in
the trees [15].
Despite these earlier concerning results, prediction accu-
racy remains a popular tool when measuring the effect of
anonymization on information quality [5]–[11]. In some cases,
prediction accuracy is the only metric used [7], [8], [10], [11].
B. Types of noise implemented
Anonymization is often achieved through the addition of
noise to the data, whereby values are distorted in a planned
way. This is usually done with the intent to keep the informa-
tion quality as high as possible while simultaneously meeting
privacy requirements. An aim of this study is to investigate
the ability of prediction accuracy (i.e. A(ZD′|T)) to represent
the information quality of a distorted dataset D′, and thereby
evaluate the underlying noise addition (i.e. data distortion)
techniques. Hence, we implement two very simple types of
noise that distort the data in different ways:
1) Random Noise (RN): A user-defined percentage of all
values in the dataset are changed. A value of an attribute
is changed to any other valid value of the attribute (but not
the value it already had). This applies to both numerical and
categorical attributes. Clearly this is quite extreme noise, and
our tests focus on percentages below 30%. The user-defined
percentage of noise is denoted as a two-digit suffix, such as
”RN-02” for 2% noise, where there is a 2% chance that an
attribute value will be changed.
2) Domain-restricted Random Noise (DRRN): DRRN, in-
spired by LINFA [16], applies noise in a similar way to RN.
It differs in that for each record, the changed values remain
within the domains defined by the rule the record obeys in the
classifier. In the context of classification a ”rule” is the formal
description of a pattern. In a decision tree, any shortest path
from the root node to a leaf node represents a rule.
Example 1 (ongoing): In Fig. 1a, D3’s value for ”Age”
will always remain between 0 and 38 after applying DRRN,
since that is the domain it must stay within in order to continue
obeying the same rule in Fig. 1d (that is, the third leaf counting
from left to right).
Categorical attributes are not modified by DRRN so we do
not apply DRRN to datasets with only categorical attributes. If
an attribute is not used for a record’s rule, the distorted values
for that attribute can be anywhere within the attribute’s total
domain, just as is the case for RN.
Both DRRN and RN exclude the class attribute from the
noise addition process. At no point is the testing data T
distorted. RN and DRRN are only used for demonstration
purposes in our experiments, and are not recommended for
any purposes beyond that.
C. Experiment methodology
The experiments in this paper are conducted using real-life
datasets obtained from the UCI Machine Learning Repository
[14]. Their details can be found in Table 1. In scenarios in-
volving classifiers (such as measuring the prediction accuracy
of a classifier), we use 10-fold cross validation. In scenarios
involving the distortion of a dataset (such as evaluating how
distorted a dataset is), we repeat the test 5 times to negate the
unpredictability of random noise. Thus in scenarios where a
classifier is built from noisy data, the test is repeated 5 times
on each fold for a total of 50 tests per dataset. All reported
measurements are the average from these tests.
We run all tests on the C4.5 classification algorithm [13].
We use the default parameters for C4.5: m= 2,c= 25%,
where mis the minimum number of records that at least 2
of each node’s child nodes must have, and cis the confidence
level for pruning3. We also set the minimum gain ratio required
3Pruning is a post-processing technique applied to decision trees in order
to minimize over-fitting. It involves the deletion of unnecessary leaf nodes,
thus making the tree smaller [2], [13].
(a) (b)
Fig. 2. The difference between A(ZD′|T)and A(ZD|T), when applying noise with (a) RN and (b) DRRN. Negative values represent how many percentage
points worse A(ZD′|T)is. Calculating A(ZD′|T)−A(ZD|T)has the effect of standardizing all the datasets around zero on the y axis. ZDis unaffected by
RN and DRRN and so A(ZD|T)always reports the same result. Each line represents the results from a dataset. The color of the lines and the marker shapes
are random. A dotted line signifies a dataset where all the attributes are numerical; a solid line signifies a dataset with both numerical and categorical attributes;
and a dashed line signifies a dataset with only categorical attributes.
to split a node at 0.01. It was possible that with some experi-
mentation, parameters could be tailored to each dataset to illicit
”better” trees (whether that is based on tree size, prediction
accuracy or some other user-defined criteria). However in these
experiments we focus on simplicity and consistency, and so we
apply the same parameters to all datasets.
III. LIM ITATIONS OF THE CURRENT APPROACH
In Fig. 2a we present A(ZD′|T)−A(ZD|T)for different
levels of Random Noise (RN). If A(ZD′|T)really represents
the information quality of D′then an obvious expectation is
that A(ZD′|T)should be less than A(ZD|T)(i.e. A(ZD′|T)−
A(ZD|T)should be negative) when RN is high, since the
information quality of D′is almost certainly inferior than
Dwhen a large percentage of Dhas been substituted with
random values to make D′. Surprisingly, a few datasets even
experience increased prediction accuracy, or remain relatively
constant. Datasets such as WBC, PageBlocks and Mushroom
report very similar results at RN-30 (see Fig. 2a) to what they
do at RN-00 (zero distortion). Note that RN-30 guarantees that
nearly a third of the dataset is distorted with random values.
Clearly this indicates a serious flaw in the use of A(ZD′|T)as
an evaluation of the information quality of D′(and therefore
as an evaluation of any underlying anonymization technique).
In Fig. 2a and Fig. 2b, we see that the prediction accuracy
of ZD′(i.e. A(ZD′|T)) is quite erratic when compared to the
accuracy of ZD, regardless of the percentage of noise incurred
by RN or DRRN. There is no predictable pattern between any
two points in Fig. 2a or Fig. 2b for most datasets. That is,
A(ZD′|T)is often as likely to increase as it is to decrease
when the noise level is incremented higher. The variety of
possible results when encountering very simple forms of noise
demonstrates that it would be difficult to ascertain the cause
for changes in prediction accuracy in real-life scenarios. The
erratic nature of the changes in prediction accuracy as noise
increases also makes it difficult to extrapolate the results
beyond what was empirically tested.
Fig. 3a and Fig. 3b support the findings of Fig. 2a and
Fig. 2b. These figures present the win/draw/loss results for
the number of times the difference between A(ZD|T)and
A(ZD′|T)is positive/zero/negative. Each dataset is tested 50
times for three different noise levels. The results of a two-
tailed sign test are provided in each bar, representing the
probability of the reported results occurring by chance. In Fig.
3a we can see that once the level of distortion reaches RN-14,
A(ZD|T)> A(ZD′|T)for most datasets. However the results
are still statistically inconclusive (at the 0.05 level) for WBC,
Vehicle and Credit even when 14% of the values are randomly
changed, and become worse as the noise decreases. In Fig. 3b,
prediction accuracy remains inconclusive for the majority of
datasets at DRRN-02, DRRN-08 and DRRN-14.
We argue that when an anonymization technique is pro-
posed, the ability of the technique to preserve the original
information should not be solely evaluated with A(ZD′|T).
(a)
(b)
Fig. 3. Sign test results comparing A(ZD|T)to A(ZD′|T)at three noise
levels using (a) RN and (b) DRRN. The number of wins, draws and losses
from the 50 tests conducted on each dataset are plotted on the X axis. The
probability of the reported results occurring by chance if one assumes the null
hypothesis is correct (i.e. A(ZD|T) = A(ZD′|T)) is provided in each bar.
A(ZD′|T)can only report whether any patterns found in D′
explain T, not whether the same patterns found in Dare also in
D′. This weakness applies to any set of patterns, regardless of
how they were discovered. Our experiments further demon-
strate that A(ZD′|T)can have very erratic behavior when
the patterns are discovered with a decision tree algorithm.
Our findings are consistent with previous work [12], [15],
and are concerning given how often prediction accuracy is
used to confirm the preservation of information quality after
anonymization, particularly with decision trees [5]–[11]. If
prediction accuracy is to be used to compare classifiers, we
recommend careful consideration of the results, as well as
being explicit about the limitations of prediction accuracy:
even when statistically significant, A(ZD′|T)is only useful
in assessing if the classifier can predict future records. To
make any broader claims about the quality of the dataset
is misleading. However using additional metrics can provide
a more robust evaluation of the information quality of the
anonymized dataset. Our proposed evaluation technique (as
explained in the next section) for measuring pattern accuracy
is one such metric.
IV. OUR P ROPOSED APPROACH
Anonymization techniques aim to distort a dataset in such
a way so that each individual’s privacy is preserved, while the
information quality of the data remains high. In the context
of Data Mining, the information sought after in the data is
Fig. 4. The difference between A(ZD|D′)and A(ZD|D), when applying
noise with RN. See Fig. 2’s description for information.
patterns that explain the data. The ability of an anonymization
technique to maintain information quality is often evaluated
through A(ZD′|T)[5]–[12]. However, Section 3 discusses the
limitations of A(ZD′|T)in evaluating the techniques. In this
section we propose A(ZD|D′)for evaluating the presence of
the patterns discovered by ZDin D′. If we find that A(ZD|D)
and A(ZD|D′)are similar then we can conclude that the
patterns in D(discovered by ZD) are present in D′even after
anonymization. This is therefore a direct approach to measure
the level of pattern preservation in an anonymized dataset. Note
that A(ZD|D′)is a variation on prediction accuracy in which
rather than evaluating how well ZDpredicts the class values
in T(i.e. A(ZD|T)), we instead evaluate how well it predicts
class values in D′.
Taking this approach, we produce Fig. 4. We can see that
in the case of RN, A(ZD|D′)(i.e. pattern accuracy) decreases
smoothly as the level of noise increases. For every pair of
points of noise levels, for every dataset, we find that A(ZD|D′)
decreases consistently as noise increases. Additionally when
testing RN, we find that A(ZD|D)is higher than (that is, wins
against) A(ZD|D′)in 49 to 50 tests out of 50, for all datasets,
at all noise levels. This facilitates a strong confidence in the
ability of A(ZD|D′)to evaluate the preservation of the patterns
of D(discovered by ZD) in D′. Clearly the current approach
of using A(ZD′|T)is much less reliable than our proposed
approach.
We find that in the case of DRRN, A(ZD|D′)is always
equal to A(ZD|D)(that is, they draw 50 out of 50 times for
all datasets at all noise levels). This is the expected result
since DRRN was designed to guarantee (by definition) that all
records follow the same pattern (logic rule) before and after
anonymization.
These results are much more definitive than the results for
A(ZD′|T)as presented in Fig. 2a, Fig. 2b, Fig. 3a and Fig. 3b.
A(ZD|D′)is neither too erratic nor at risk of being statistically
insignificant.
A. Supplementary measures
We also present two more supplementary variations of
prediction accuracy to aid the evaluation of the information
quality of D′:A(ZD′|D)and A(ZD′|D′). From A(ZD′|D)
we learn how well the patterns discovered by ZD′exist in
D. Generally, a high A(ZD′|D)value should indicate high
information quality in D′, since ZD′was built from D′.
However, we also need to check whether the patterns in ZD′
explain D′well. We can check this through A(ZD′|D′). A
high A(ZD′|D′)suggests that the patterns in ZD′are a true
reflection of the patterns in D′. Therefore, a high A(ZD′|D)
and A(ZD′|D′)together indicate that the patterns in D′exist
in D, and thus that the information quality of D′is high.
Upon testing A(ZD′|D′)and A(ZD′|D)with RN, we find
that both evaluations decrease for most datasets, meaning that
ZD′is not only decreasingly representative of the patterns in
D, but also in D′. This is a clear indication that the machine
learning algorithm is unable to find any good patterns in the
data when the RN noise level is high, as expected.
In some cases (specifically Vehicle, Statlog and CMC),
A(ZD′|D′)increases or remains high while A(ZD′|D)de-
creases. From this, we can deduce that the patterns found
in D′explain D′well, but are not present in D. If an
anonymized dataset with this characteristic (high A(ZD′|D′)
and low A(ZD′|D)) was used to replace Din a real life
scenario, it could be very misleading due to the high accuracy
reported by A(ZD′|D′). If D′replaces D, the invalidity of
the patterns in ZD(previously ZD′) would be impossible
to detect. If the decision to replace Dwith D′was based
solely on the result of A(ZD′|T), Vehicle, Statlog and CMC
could all appear acceptable (see Fig. 2a). However evaluating
A(ZD′|D′),A(ZD′|D)and A(ZD|D′)clearly shows them to
have very poor information quality at high RN noise levels.
Some of these evaluations may be more useful than others
depending on the user’s needs, but their similarity to the
traditional prediction accuracy makes them straightforward
to implement and computationally simple (O(n)). Together,
these measurements provide a more complete picture of the
information quality of data than any single measurement is
capable of.
V. CONCLUSION
In this study we present an approach to evaluate the infor-
mation quality of an anonymized dataset. Our first suggestion
is to use A(ZD|D′)since it clearly indicates whether or not the
patterns in D(discovered by ZD) exist in D′.A(ZD′|D)and
A(ZD′|D′)can also provide a user with additional insight. Our
empirical analysis strongly supports this approach. However,
we also understand that there can be a number of patterns
in Dother than those discovered by ZD. It is important
to check whether those patterns exist in D′before coming
to a conclusion about the information quality of D′. The
other patterns in Dcan be extracted through the use of a
decision forest FD(instead of a single tree) [8, 9], which is
a set of decision trees {ZD1, ZD2, ..., ZDf}. One then needs
to calculate A(ZD1|D′), A(ZD2|D′), ..., A(ZDf|D′)to check
whether all the patterns in Dexist in D′. Techniques such as
Frequent Pattern Analysis could also be applied first (if they
are not already being applied) to filter out the uninteresting
or less useful patterns that might otherwise act as noise when
calculating A(ZD|D′)[17].
Our logical and empirical analysis of prediction accuracy
(i.e. A(ZD′|T)) raises serious doubts about its traditional usage
for information quality analysis. Having demonstrated it’s ca-
pacity for fallibility, we intend on researching the performance
of prediction accuracy in specific scenarios, such as on datasets
made k-anonymous and l-diverse. Another important question
is whether classifiers known to be more robust to noise
than decision trees (such as k-Nearest Neighbor Classifier)
report good prediction accuracy results because of properties
possessed by the classifiers, or rather properties possessed by
prediction accuracy.
REF ER EN CE S
[1] N. L. Henry, “Knowledge Management: A New Concern for Public
Administration,” Public Administration Review, vol. 34, no. 3, pp. 189–
196, 1974.
[2] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques.
Morgan Kaufmann Publishers, 2006.
[3] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna-
tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,
vol. 10, no. 05, pp. 557–570, 2002.
[4] M. G. Rahman and M. Z. Islam, “Missing value imputation using de-
cision trees and decision forests by splitting and merging records: Two
novel techniques,” Knowledge-Based Systems, vol. 53, no. September,
pp. 51–65, Sep. 2013.
[5] C. C. Aggarwal, “On Unifying Privacy and Uncertain Data Models,”
2008 IEEE 24th International Conference on Data Engineering, pp.
386–395, Apr. 2008.
[6] C. C. Aggarwal and P. S. Yu, “On static and dynamic methods for
condensation-based privacy-preserving data mining,” ACM Transactions
on Database Systems, vol. 33, no. 1, pp. 1–39, Mar. 2008.
[7] B. Fung, K. Wang, and P. Yu, “Top-down specialization for information
and privacy preservation,” in Proceedings of the 21st International
Conference on Data Engineering. IEEE, 2005, pp. 205–216.
[8] ——, “Anonymizing classification data for privacy preservation,” IEEE
Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp.
711–725, 2007.
[9] M. Nergiz and C. Clifton, “Thoughts on k-anonymization,” Data &
Knowledge Engineering, vol. 63, no. 3, pp. 622–645, 2007.
[10] K. Wang, P. Yu, and S. Chakraborty, “Bottom-Up Generalization: A
Data Mining Solution to Privacy Protection,” in Fourth IEEE Interna-
tional Conference on Data Mining. IEEE, 2004, pp. 249–256.
[11] K. Wang, B. Fung, and P. Yu, “Template-based privacy preservation
in classification problems,” in Fifth IEEE International Conference on
Data Mining. IEEE, 2005, p. 8.
[12] T. Lim, W. Loh, and Y. Shih, “A Comparison of Prediction Accuracy,
Complexity and Training Time of Thirty-three Old and New Classi-
fication Algorithms,” Machine learning, vol. 40, no. 3, pp. 203–228,
2000.
[13] J. R. Quinlan, C4.5: programs for machine learning, 1st ed. Morgan
kaufmann, 1993.
[14] K. Bache and M. Lichman, “UCI Machine Learning Repository,”
Irvine, CA, 2013. [Online]. Available: http://archive.ics.uci.edu/ml/
[15] M. Z. Islam, P. Barnaghi, and L. Brankovic, “Measuring Data Quality:
Predictive Accuracy vs. Similarity of Decision Trees,” in Proceedings
of the 6th International Conference on Computer & Information Tech-
nology, vol. 2, Dhaka, Bangladesh, 2003, pp. 457–462.
[16] M. Z. Islam and L. Brankovic, “Privacy preserving data mining: A noise
addition framework using a novel clustering technique,” Knowledge-
Based Systems, vol. 24, no. 8, pp. 1214–1223, 2011.
[17] H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative Frequent
Pattern Analysis for Effective Classification,” in 2007 IEEE 23rd
International Conference on Data Engineering. IEEE, 2007, pp. 716–
725.