ArticlePDF Available

Quality Evaluation of an Anonymized Dataset

Authors:

Abstract and Figures

In this study we argue that the traditional approach of evaluating the information quality of an anonymized (or otherwise modified) dataset is questionable. We propose a novel and simple approach to evaluate the information quality of a modified dataset, and thereby the quality of techniques that modify data. We carry out experiments on eleven datasets and the empirical results strongly support our arguments. We also present some supplementary measures to our approach that provide additional insight into the information quality of modified data.
Content may be subject to copyright.
Quality Evaluation of an Anonymized Dataset
Sam Fletcher and Md Zahidul Islam
Center for Research in Complex Systems (CRiCS)
School of Computing and Mathematics
Charles Sturt University
Bathurst NSW 2795, Australia
Email: {safletcher;zislam}@csu.edu.au
Abstract—In this study we argue that the traditional approach
of evaluating the information quality of an anonymized (or
otherwise modified) dataset is questionable. We propose a novel
and simple approach to evaluate the information quality of a
modified dataset, and thereby the quality of techniques that
modify data. We carry out experiments on eleven datasets and the
empirical results strongly support our arguments. We also present
some supplementary measures to our approach that provide
additional insight into the information quality of modified data.
Keywordsdata mining, privacy preserving data mining, data
quality, information quality, noise addition, anonymization.
I. INTRODUCTION
Since the late 20th century, technological advances have
led to rapid increases in data collection, mining and analysis.
While clearly advantageous, it is not without its drawbacks,
and one such drawback is the questions it raises about an
individual’s right to privacy. Privacy is considered by many
to be a basic human right, and so in an ideal scenario each
individual has the right to refuse the use of their data for
analysis. The benefits of data mining and analysis are wide-
ranging and often not only beneficial to the analysts, but to
society as a whole, and therefore convincing individuals to
forgo their privacy concerns is extremely beneficial.
One way to do so is to assure them that their anonymity will
stay intact. Research in ”Privacy Preserving Data Publishing”
(PPDP) or ”Privacy Preserving Data Mining” (PPDM) aims to
distort the data in such a way as to keep the information accu-
rate, without necessarily needing to keep the data accurate1.
The main purpose of Data Mining is to discover patterns in
data [2], and so in this context the ”information” is the patterns
that can be discerned from the data. To preserve the patterns
is to preserve the information. Additionally, the distortion of
data needs to occur in such a way as to make it impossible
(or near-impossible) for any one individual to be identified
[3]. The process of distorting data to preserve privacy is often
called ”anonymization”.
In this paper our focus is on how to evaluate the informa-
tion quality of data after anonymization. Without the ability
to evaluate the information quality of anonymized data, it is
all but impossible to empirically evaluate the anonymization
technique. This paper aims to provide researchers with tools
1The difference between ”data” and ”information” is the level of abstraction,
and the meaning that can be extracted from it. ”Data” is merely the raw facts,
unorganized and unprocessed. ”Data” can become ”information” once it has
been analyzed, given a context, organized, or otherwise gains meaning [1].
to evaluate the information quality of anonymized datasets2.
These tools are also useful in other scenarios involving modi-
fied data, such as when collected data with inaccurate values
are improved using data cleansing techniques in order to
produce more accurate information [4].
A common way to evaluate the information quality of
a dataset is to build a classifier from the dataset, and test
how well the classifier can predict information about future
data [5]–[13]. A classifier is a framework by which future
records (tuples) can be classified into categories. Classifiers
are built by machine learning algorithms that find patterns that
describe the data. The categories used to classify records are
defined by the class attribute. Like other attributes, the class
attribute describes a quality possessed by a record, but has been
selected by the user as being important enough for the machine
learning algorithm to use it to classify records. Depending on
the quality of the information discoverable in the dataset, the
classifier will more or less accurately classify future records
using the patterns it discovered. The accuracy of a classifier in
predicting the class value of future records is called ”prediction
accuracy”.
Example 1: A classifier might be built from a medical
dataset in order to accurately predict the diagnoses of future
patients. The attribute ”Diagnosis” would be selected as the
class attribute, and all other attributes would be analyzed by the
machine learning algorithm in order to find patterns that predict
the class attribute. Fig. 1a and Fig. 1d represent a toy dataset
Dand classifier ZD(built from D), respectively. Rather than
waiting for future patient records in order to test the prediction
accuracy of the classifier, some (currently possessed) records
are typically withheld from Dand used as simulated future
records. This is known as ”testing data” and an example is
dataset Tin Fig. 1c. Since we know the class values for our
testing data we can estimate that ZDwill be 100% accurate
(in this example) in predicting the (unknown) class values of
real future patient records. In this case the type of classifier
chosen is a decision tree, and we will continue to use decision
trees for demonstrative purposes. It should be noted that our
proposed measure (introduced below) should be applicable in
any scenario where prediction accuracy is also applicable. We
only use decision trees as a way of collecting realistic patterns
for our experiments as a proof of concept.
When anonymizing a dataset, a common approach for
evaluating the change in information quality is to compare
2A ”dataset” is a two dimensional table where rows represent independent
records (tuples) and columns represent various attributes that describe the
records and distinguish them from each other. See Fig. 1a for an example.
Record Age Symptom Diagnosis
D111 Muscle aches Flu
D268 Muscle aches Flu
D320 Cough Flu
D46 Lump on body Flu
D534 Lump on body Cancer
D655 Cough Cancer
(a) Dataset D
Record Age Symptom Diagnosis
D
111 Cough Flu
D
247 Muscle aches Flu
D
399 Cough Flu
D
46 Lump on body Flu
D
534 Lump on body Cancer
D
655 Cough Cancer
(b) Dataset D
Record Age Symptom Diagnosis
T130 Muscle aches Flu
T229 Cough Flu
T316 Lump on body Flu
T483 Cough Cancer
(c) Dataset T
Symptom
Age Age
Flu
Muscle
aches Cough
Lump on
body
Flu Flu
Cancer Cancer
>38 ≤38 >20 ≤20
(d) Decision tree ZD
Cancer
Age
Flu
>77 ≤77
Symptom
Age Age
Flu
Muscle
aches Cough
Lump on
body
Flu Flu
Cancer
>38 ≤38 >20 ≤20
(e) Decision tree ZD
Fig. 1. (d) and (e) are decision trees built from the toy datasets in (a) and (b), respectively. (b) is a distorted version of (a). The accuracy of both decision
trees is tested on the data in (c). A decision tree is a graphical representation of some of the patterns (logic rules) existing in a dataset, where the patterns show
the relationships between the non-class attributes and the class attribute. A decision tree ”filters” records into stronger and more specific patterns the further
the tree extends. These ”filters” are defined by the nodes in the tree, where records are split into mutually exclusive and collectively exhaustive subsets of the
dataset based on the attribute values tested in each node. The filtering process starts at the top (root) node and follows a path to a bottom (leaf) node. For each
leaf node, the most common class value possessed by the records in the leaf is called the ”majority class value”. The class value of a future record is predicted
to be the majority class value of the leaf node that the record was filtered into.
the prediction accuracy of two classifiers; one built from the
original dataset and one from the anonymized dataset [5]–[12].
Note that the use of a classifier for the purpose of discovering
information in data is independent from the anonymization (or
data cleansing) process. Measuring information quality with
a classifier is applicable even when the data was modified
without a classifier.
Example 1 (ongoing): Dataset Din Fig. 1b is an example
of a distorted version of D. The classifier in Fig. 1e, ZD, is
built from D, and its prediction accuracy is tested on Tand
found to be 75% (just as ZDwas tested on T). Thus the change
in information quality when anonymizing Dis 100%75% =
25%.
Unfortunately a comparison of prediction accuracies as in-
troduced above can have quite erratic results, with questionable
statistical significance. This is explored in depth in Section III.
In this study we therefore propose a different approach that
evaluates the information quality of an anonymized dataset by
examining how valid the patterns discovered in an original
dataset (i.e. the logic rules of a decision tree) are in the
anonymized dataset. A simple method for doing so: rather
than measuring the accuracy of ZDon Tand comparing it
to the accuracy of ZDon T, we compute the accuracy of
ZDon D. That is, we see how well the distorted records fit
the patterns discovered by the original classifier. This can be
compared to how well the original dataset Dfits the patterns
in ZDto give us a sense of how much the patterns have
deteriorated. To differentiate this approach from the traditional
”prediction accuracy”, we refer to it as ”pattern accuracy”. A
full description and analysis can be found in Section IV.
To test our proposal, we artificially distort real-life datasets
gathered from the UCI Machine Learning Repository [14]. In
this context, distorting data is often described as adding ”noise”
to the data. We apply two different kinds of noise, and explore
the effects of each on prediction accuracy and other metrics.
The types of noise used are introduced in Section IIB.
A. Notation
The following notation will be used throughout the paper.
Drepresents a dataset and Dis a distorted version of D.T
is the testing dataset made by extracting a random selection
of records from Dbefore building any trees. Diis the ith
record in D.ZDis a classifier built from D.A(ZD|T)is
the accuracy of ZDwhen predicting the class values of the
records in T. It represents the proportion of correct predictions,
0A(ZD|T)1.ZDfollows the same notation, as
does A(ZD|D)and any other combinations of classifiers and
datasets.
B. Our contributions
We logically and empirically argue that a comparison of
A(ZD|T)and A(ZD|T)is often inconclusive when evaluating
the change in information quality of Dand D. Therefore,
we argue that the widespread use of A(ZD|T)[5]–[11] to
evaluate the information quality of D(and therefore the
anonymization technique that produces D) is questionable.
We propose that measuring the pattern accuracy (that is,
A(ZD|D)) of an anonymized dataset can be more informative,
both theoretically and empirically. We also propose that a
careful analysis of A(ZD|D)and A(ZD|D)can provide
additional insight into the information quality of D.
Section II provides background information on relevant
previous work, the types of noise we implement for our tests,
and our experiment methodology. In Section III we discuss
the weaknesses of the traditional approach that uses prediction
accuracy (i.e. A(ZD|T)), and in Section IV we explore the
usefulness of pattern accuracy (i.e. A(ZD|D)). Section V
provides concluding remarks.
TABLE I. DE TAIL S OF TH E DATASE TS US ED I N OUR E XPE RI MEN TS . “MAJORITY CL AS S VALUE PE RCE NTAG ERE FE RS TO T HE PE RC ENTAG E OF
RE COR DS T HAT HAVE TH E MO ST CO MMO N CL ASS VAL UE . “AVER AGE C4.5 PREDICTION ACCURACYIS T HE AVE RAG E PRE DI CTI ON AC CUR ACY O F TH E
TREES PRODUCED BY C4.5 USI NG 10 -F OLD C ROS S VALID ATION .
Name Number of
Records
Number of
Numerical Attributes
Number of
Categorical Attributes
Number of Class
Values
Majority Class
Value Percentage
Average C4.5
Prediction Accuracy
WBC 683 9 0 2 65% 95%
Vehicle 846 18 0 4 26% 69%
RedWine 1599 11 0 6 43% 58%
PageBlocks 5473 10 0 5 90% 97%
Credit 653 6 9 2 55% 85%
Statlog 1000 7 13 2 70% 73%
CMC 1473 8 1 3 43% 53%
Yeast 1484 7 1 10 31% 56%
Adult 30162 6 5 2 75% 86%
Mushroom 5644 0 22 6 62% 99%
Chess 28056 0 6 18 16% 44%
II. BACKGROUND INFORMATION
A. Previous work
Previous work has raised concerns about using prediction
accuracy as a metric for making comparisons: Lim, Loh and
Shih [12] found that when testing 33 classification algorithms
on 16 datasets, the addition of noise (distortions in the data)
made no statistically significant difference in the prediction
accuracy (i.e. A(ZD|T)) of the algorithms.
Further research [15] supported the earlier findings - the
addition of randomly generated noise had no significant effect
on the prediction accuracy of the classifier built from the
distorted data. However an analysis of the structure of the
decision trees showed drastic changes, in both the sizes of
the trees (the number of nodes) and the attributes chosen in
the trees [15].
Despite these earlier concerning results, prediction accu-
racy remains a popular tool when measuring the effect of
anonymization on information quality [5]–[11]. In some cases,
prediction accuracy is the only metric used [7], [8], [10], [11].
B. Types of noise implemented
Anonymization is often achieved through the addition of
noise to the data, whereby values are distorted in a planned
way. This is usually done with the intent to keep the informa-
tion quality as high as possible while simultaneously meeting
privacy requirements. An aim of this study is to investigate
the ability of prediction accuracy (i.e. A(ZD|T)) to represent
the information quality of a distorted dataset D, and thereby
evaluate the underlying noise addition (i.e. data distortion)
techniques. Hence, we implement two very simple types of
noise that distort the data in different ways:
1) Random Noise (RN): A user-defined percentage of all
values in the dataset are changed. A value of an attribute
is changed to any other valid value of the attribute (but not
the value it already had). This applies to both numerical and
categorical attributes. Clearly this is quite extreme noise, and
our tests focus on percentages below 30%. The user-defined
percentage of noise is denoted as a two-digit suffix, such as
”RN-02” for 2% noise, where there is a 2% chance that an
attribute value will be changed.
2) Domain-restricted Random Noise (DRRN): DRRN, in-
spired by LINFA [16], applies noise in a similar way to RN.
It differs in that for each record, the changed values remain
within the domains defined by the rule the record obeys in the
classifier. In the context of classification a ”rule” is the formal
description of a pattern. In a decision tree, any shortest path
from the root node to a leaf node represents a rule.
Example 1 (ongoing): In Fig. 1a, D3’s value for ”Age”
will always remain between 0 and 38 after applying DRRN,
since that is the domain it must stay within in order to continue
obeying the same rule in Fig. 1d (that is, the third leaf counting
from left to right).
Categorical attributes are not modified by DRRN so we do
not apply DRRN to datasets with only categorical attributes. If
an attribute is not used for a record’s rule, the distorted values
for that attribute can be anywhere within the attribute’s total
domain, just as is the case for RN.
Both DRRN and RN exclude the class attribute from the
noise addition process. At no point is the testing data T
distorted. RN and DRRN are only used for demonstration
purposes in our experiments, and are not recommended for
any purposes beyond that.
C. Experiment methodology
The experiments in this paper are conducted using real-life
datasets obtained from the UCI Machine Learning Repository
[14]. Their details can be found in Table 1. In scenarios in-
volving classifiers (such as measuring the prediction accuracy
of a classifier), we use 10-fold cross validation. In scenarios
involving the distortion of a dataset (such as evaluating how
distorted a dataset is), we repeat the test 5 times to negate the
unpredictability of random noise. Thus in scenarios where a
classifier is built from noisy data, the test is repeated 5 times
on each fold for a total of 50 tests per dataset. All reported
measurements are the average from these tests.
We run all tests on the C4.5 classification algorithm [13].
We use the default parameters for C4.5: m= 2,c= 25%,
where mis the minimum number of records that at least 2
of each node’s child nodes must have, and cis the confidence
level for pruning3. We also set the minimum gain ratio required
3Pruning is a post-processing technique applied to decision trees in order
to minimize over-fitting. It involves the deletion of unnecessary leaf nodes,
thus making the tree smaller [2], [13].
(a) (b)
Fig. 2. The difference between A(ZD|T)and A(ZD|T), when applying noise with (a) RN and (b) DRRN. Negative values represent how many percentage
points worse A(ZD|T)is. Calculating A(ZD|T)A(ZD|T)has the effect of standardizing all the datasets around zero on the y axis. ZDis unaffected by
RN and DRRN and so A(ZD|T)always reports the same result. Each line represents the results from a dataset. The color of the lines and the marker shapes
are random. A dotted line signifies a dataset where all the attributes are numerical; a solid line signifies a dataset with both numerical and categorical attributes;
and a dashed line signifies a dataset with only categorical attributes.
to split a node at 0.01. It was possible that with some experi-
mentation, parameters could be tailored to each dataset to illicit
”better” trees (whether that is based on tree size, prediction
accuracy or some other user-defined criteria). However in these
experiments we focus on simplicity and consistency, and so we
apply the same parameters to all datasets.
III. LIM ITATIONS OF THE CURRENT APPROACH
In Fig. 2a we present A(ZD|T)A(ZD|T)for different
levels of Random Noise (RN). If A(ZD|T)really represents
the information quality of Dthen an obvious expectation is
that A(ZD|T)should be less than A(ZD|T)(i.e. A(ZD|T)
A(ZD|T)should be negative) when RN is high, since the
information quality of Dis almost certainly inferior than
Dwhen a large percentage of Dhas been substituted with
random values to make D. Surprisingly, a few datasets even
experience increased prediction accuracy, or remain relatively
constant. Datasets such as WBC, PageBlocks and Mushroom
report very similar results at RN-30 (see Fig. 2a) to what they
do at RN-00 (zero distortion). Note that RN-30 guarantees that
nearly a third of the dataset is distorted with random values.
Clearly this indicates a serious flaw in the use of A(ZD|T)as
an evaluation of the information quality of D(and therefore
as an evaluation of any underlying anonymization technique).
In Fig. 2a and Fig. 2b, we see that the prediction accuracy
of ZD(i.e. A(ZD|T)) is quite erratic when compared to the
accuracy of ZD, regardless of the percentage of noise incurred
by RN or DRRN. There is no predictable pattern between any
two points in Fig. 2a or Fig. 2b for most datasets. That is,
A(ZD|T)is often as likely to increase as it is to decrease
when the noise level is incremented higher. The variety of
possible results when encountering very simple forms of noise
demonstrates that it would be difficult to ascertain the cause
for changes in prediction accuracy in real-life scenarios. The
erratic nature of the changes in prediction accuracy as noise
increases also makes it difficult to extrapolate the results
beyond what was empirically tested.
Fig. 3a and Fig. 3b support the findings of Fig. 2a and
Fig. 2b. These figures present the win/draw/loss results for
the number of times the difference between A(ZD|T)and
A(ZD|T)is positive/zero/negative. Each dataset is tested 50
times for three different noise levels. The results of a two-
tailed sign test are provided in each bar, representing the
probability of the reported results occurring by chance. In Fig.
3a we can see that once the level of distortion reaches RN-14,
A(ZD|T)> A(ZD|T)for most datasets. However the results
are still statistically inconclusive (at the 0.05 level) for WBC,
Vehicle and Credit even when 14% of the values are randomly
changed, and become worse as the noise decreases. In Fig. 3b,
prediction accuracy remains inconclusive for the majority of
datasets at DRRN-02, DRRN-08 and DRRN-14.
We argue that when an anonymization technique is pro-
posed, the ability of the technique to preserve the original
information should not be solely evaluated with A(ZD|T).
(a)
(b)
Fig. 3. Sign test results comparing A(ZD|T)to A(ZD|T)at three noise
levels using (a) RN and (b) DRRN. The number of wins, draws and losses
from the 50 tests conducted on each dataset are plotted on the X axis. The
probability of the reported results occurring by chance if one assumes the null
hypothesis is correct (i.e. A(ZD|T) = A(ZD|T)) is provided in each bar.
A(ZD|T)can only report whether any patterns found in D
explain T, not whether the same patterns found in Dare also in
D. This weakness applies to any set of patterns, regardless of
how they were discovered. Our experiments further demon-
strate that A(ZD|T)can have very erratic behavior when
the patterns are discovered with a decision tree algorithm.
Our findings are consistent with previous work [12], [15],
and are concerning given how often prediction accuracy is
used to confirm the preservation of information quality after
anonymization, particularly with decision trees [5]–[11]. If
prediction accuracy is to be used to compare classifiers, we
recommend careful consideration of the results, as well as
being explicit about the limitations of prediction accuracy:
even when statistically significant, A(ZD|T)is only useful
in assessing if the classifier can predict future records. To
make any broader claims about the quality of the dataset
is misleading. However using additional metrics can provide
a more robust evaluation of the information quality of the
anonymized dataset. Our proposed evaluation technique (as
explained in the next section) for measuring pattern accuracy
is one such metric.
IV. OUR P ROPOSED APPROACH
Anonymization techniques aim to distort a dataset in such
a way so that each individual’s privacy is preserved, while the
information quality of the data remains high. In the context
of Data Mining, the information sought after in the data is
Fig. 4. The difference between A(ZD|D)and A(ZD|D), when applying
noise with RN. See Fig. 2’s description for information.
patterns that explain the data. The ability of an anonymization
technique to maintain information quality is often evaluated
through A(ZD|T)[5]–[12]. However, Section 3 discusses the
limitations of A(ZD|T)in evaluating the techniques. In this
section we propose A(ZD|D)for evaluating the presence of
the patterns discovered by ZDin D. If we find that A(ZD|D)
and A(ZD|D)are similar then we can conclude that the
patterns in D(discovered by ZD) are present in Deven after
anonymization. This is therefore a direct approach to measure
the level of pattern preservation in an anonymized dataset. Note
that A(ZD|D)is a variation on prediction accuracy in which
rather than evaluating how well ZDpredicts the class values
in T(i.e. A(ZD|T)), we instead evaluate how well it predicts
class values in D.
Taking this approach, we produce Fig. 4. We can see that
in the case of RN, A(ZD|D)(i.e. pattern accuracy) decreases
smoothly as the level of noise increases. For every pair of
points of noise levels, for every dataset, we find that A(ZD|D)
decreases consistently as noise increases. Additionally when
testing RN, we find that A(ZD|D)is higher than (that is, wins
against) A(ZD|D)in 49 to 50 tests out of 50, for all datasets,
at all noise levels. This facilitates a strong confidence in the
ability of A(ZD|D)to evaluate the preservation of the patterns
of D(discovered by ZD) in D. Clearly the current approach
of using A(ZD|T)is much less reliable than our proposed
approach.
We find that in the case of DRRN, A(ZD|D)is always
equal to A(ZD|D)(that is, they draw 50 out of 50 times for
all datasets at all noise levels). This is the expected result
since DRRN was designed to guarantee (by definition) that all
records follow the same pattern (logic rule) before and after
anonymization.
These results are much more definitive than the results for
A(ZD|T)as presented in Fig. 2a, Fig. 2b, Fig. 3a and Fig. 3b.
A(ZD|D)is neither too erratic nor at risk of being statistically
insignificant.
A. Supplementary measures
We also present two more supplementary variations of
prediction accuracy to aid the evaluation of the information
quality of D:A(ZD|D)and A(ZD|D). From A(ZD|D)
we learn how well the patterns discovered by ZDexist in
D. Generally, a high A(ZD|D)value should indicate high
information quality in D, since ZDwas built from D.
However, we also need to check whether the patterns in ZD
explain Dwell. We can check this through A(ZD|D). A
high A(ZD|D)suggests that the patterns in ZDare a true
reflection of the patterns in D. Therefore, a high A(ZD|D)
and A(ZD|D)together indicate that the patterns in Dexist
in D, and thus that the information quality of Dis high.
Upon testing A(ZD|D)and A(ZD|D)with RN, we find
that both evaluations decrease for most datasets, meaning that
ZDis not only decreasingly representative of the patterns in
D, but also in D. This is a clear indication that the machine
learning algorithm is unable to find any good patterns in the
data when the RN noise level is high, as expected.
In some cases (specifically Vehicle, Statlog and CMC),
A(ZD|D)increases or remains high while A(ZD|D)de-
creases. From this, we can deduce that the patterns found
in Dexplain Dwell, but are not present in D. If an
anonymized dataset with this characteristic (high A(ZD|D)
and low A(ZD|D)) was used to replace Din a real life
scenario, it could be very misleading due to the high accuracy
reported by A(ZD|D). If Dreplaces D, the invalidity of
the patterns in ZD(previously ZD) would be impossible
to detect. If the decision to replace Dwith Dwas based
solely on the result of A(ZD|T), Vehicle, Statlog and CMC
could all appear acceptable (see Fig. 2a). However evaluating
A(ZD|D),A(ZD|D)and A(ZD|D)clearly shows them to
have very poor information quality at high RN noise levels.
Some of these evaluations may be more useful than others
depending on the user’s needs, but their similarity to the
traditional prediction accuracy makes them straightforward
to implement and computationally simple (O(n)). Together,
these measurements provide a more complete picture of the
information quality of data than any single measurement is
capable of.
V. CONCLUSION
In this study we present an approach to evaluate the infor-
mation quality of an anonymized dataset. Our first suggestion
is to use A(ZD|D)since it clearly indicates whether or not the
patterns in D(discovered by ZD) exist in D.A(ZD|D)and
A(ZD|D)can also provide a user with additional insight. Our
empirical analysis strongly supports this approach. However,
we also understand that there can be a number of patterns
in Dother than those discovered by ZD. It is important
to check whether those patterns exist in Dbefore coming
to a conclusion about the information quality of D. The
other patterns in Dcan be extracted through the use of a
decision forest FD(instead of a single tree) [8, 9], which is
a set of decision trees {ZD1, ZD2, ..., ZDf}. One then needs
to calculate A(ZD1|D), A(ZD2|D), ..., A(ZDf|D)to check
whether all the patterns in Dexist in D. Techniques such as
Frequent Pattern Analysis could also be applied first (if they
are not already being applied) to filter out the uninteresting
or less useful patterns that might otherwise act as noise when
calculating A(ZD|D)[17].
Our logical and empirical analysis of prediction accuracy
(i.e. A(ZD|T)) raises serious doubts about its traditional usage
for information quality analysis. Having demonstrated it’s ca-
pacity for fallibility, we intend on researching the performance
of prediction accuracy in specific scenarios, such as on datasets
made k-anonymous and l-diverse. Another important question
is whether classifiers known to be more robust to noise
than decision trees (such as k-Nearest Neighbor Classifier)
report good prediction accuracy results because of properties
possessed by the classifiers, or rather properties possessed by
prediction accuracy.
REF ER EN CE S
[1] N. L. Henry, “Knowledge Management: A New Concern for Public
Administration,” Public Administration Review, vol. 34, no. 3, pp. 189–
196, 1974.
[2] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques.
Morgan Kaufmann Publishers, 2006.
[3] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna-
tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,
vol. 10, no. 05, pp. 557–570, 2002.
[4] M. G. Rahman and M. Z. Islam, “Missing value imputation using de-
cision trees and decision forests by splitting and merging records: Two
novel techniques,Knowledge-Based Systems, vol. 53, no. September,
pp. 51–65, Sep. 2013.
[5] C. C. Aggarwal, “On Unifying Privacy and Uncertain Data Models,
2008 IEEE 24th International Conference on Data Engineering, pp.
386–395, Apr. 2008.
[6] C. C. Aggarwal and P. S. Yu, “On static and dynamic methods for
condensation-based privacy-preserving data mining,ACM Transactions
on Database Systems, vol. 33, no. 1, pp. 1–39, Mar. 2008.
[7] B. Fung, K. Wang, and P. Yu, “Top-down specialization for information
and privacy preservation,” in Proceedings of the 21st International
Conference on Data Engineering. IEEE, 2005, pp. 205–216.
[8] ——, “Anonymizing classification data for privacy preservation,” IEEE
Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp.
711–725, 2007.
[9] M. Nergiz and C. Clifton, “Thoughts on k-anonymization,Data &
Knowledge Engineering, vol. 63, no. 3, pp. 622–645, 2007.
[10] K. Wang, P. Yu, and S. Chakraborty, “Bottom-Up Generalization: A
Data Mining Solution to Privacy Protection,” in Fourth IEEE Interna-
tional Conference on Data Mining. IEEE, 2004, pp. 249–256.
[11] K. Wang, B. Fung, and P. Yu, “Template-based privacy preservation
in classification problems,” in Fifth IEEE International Conference on
Data Mining. IEEE, 2005, p. 8.
[12] T. Lim, W. Loh, and Y. Shih, “A Comparison of Prediction Accuracy,
Complexity and Training Time of Thirty-three Old and New Classi-
fication Algorithms,” Machine learning, vol. 40, no. 3, pp. 203–228,
2000.
[13] J. R. Quinlan, C4.5: programs for machine learning, 1st ed. Morgan
kaufmann, 1993.
[14] K. Bache and M. Lichman, “UCI Machine Learning Repository,”
Irvine, CA, 2013. [Online]. Available: http://archive.ics.uci.edu/ml/
[15] M. Z. Islam, P. Barnaghi, and L. Brankovic, “Measuring Data Quality:
Predictive Accuracy vs. Similarity of Decision Trees,” in Proceedings
of the 6th International Conference on Computer & Information Tech-
nology, vol. 2, Dhaka, Bangladesh, 2003, pp. 457–462.
[16] M. Z. Islam and L. Brankovic, “Privacy preserving data mining: A noise
addition framework using a novel clustering technique,Knowledge-
Based Systems, vol. 24, no. 8, pp. 1214–1223, 2011.
[17] H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative Frequent
Pattern Analysis for Effective Classification,” in 2007 IEEE 23rd
International Conference on Data Engineering. IEEE, 2007, pp. 716–
725.
... • Comparing the quality of the patterns discovered in data before and after applying privacy-preserving techniques to the data (Fletcher & Islam 2014, 2015a, Friedman & Schuster 2010; and ...
... It is important to note that assessing the quality of patterns in these ways is not the same as measuring the performance of a model at achieving a goal (Caruana & Niculescu-Mizil 2004, Sokolova & Lapalme 2009). Prediction accuracy is often used to measure a model's performance , Letham et al. 2013), but is disconnected from any reliable assessments of pattern quality (Fletcher & Islam 2014, Islam et al. 2003. A user should be aware of the specific goals they wish their model to achieve and how important the truthfulness (Kifer & Gehrke 2006) or interpretability of the patterns in the model are, and then use multiple measures to assess if their needs are met. ...
... A naive approach would be to use measures like RMSE to compare the original data to the perturbed data directly (Willmott 1982, Willmott et al. 2009), however this ignores the correlations in the data necessary for information discovery (Agrawal & Aggarwal 2001, Fletcher & Islam 2015b. Just like how frequent pattern mining and other fields use prediction accuracy heavily, so too does privacy-preserving data mining (Chaudhuri et al. 2011, Friedman & Schuster 2010, Fung et al. 2005, Mohammed et al. 2011), but the same problems encountered by the former when assessing patterns also plague the latter (Fletcher & Islam 2014, Islam et al. 2003. Attempts at directly measuring the loss of pattern retention as privacy needs are increased have been made (Fletcher & Islam 2014), but the measure is not applicable in any wider context outside of privacy preservation. ...
Article
Full-text available
The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.
... Clump T hickness > 6.5 AND U nif ormity of Cell Size > 2.5 AND U nif ormity of Cell Size ≤ 4.5 AND U nif ormity of Cell Shape > 2.5 AND Bare N uclei > 2.5 AND M itoses ≤ 1.5 Malignant 3 U nif ormity of Cell Size > 4.5 AND U nif ormity of Cell Shape > 2.5 AND Bland Chromatin > 4.5 Malignant @BULLET comparing the quality of the patterns discovered in data before and after applying privacypreserving techniques to the data (Fletcher & Islam 2014, 2015a); and @BULLET finding differences in different samples of data, including temporal scenarios with time series (Baron et al. 2003). Answering the above question requires a measurement of some kind, and it is a measure (more specifically , a metric) that we propose in this paper. ...
... It is important to note that assessing the quality of patterns in these ways is not the same as measuring the performance of a model at achieving a goal (Caruana & Niculescu-Mizil 2004, Sokolova & Lapalme 2009). Prediction Accuracy is often used to measure a model's performance (Cheng et al. 2007, Letham et al. 2013 ), but is disconnected from any reliable assessments of pattern quality (Fletcher & Islam 2014, Islam et al. 2003). A user should be aware of the specific goals they wish their model to achieve and how important the truthfulness (Kifer & Gehrke 2006) or interpretability of the patterns in the model are, and then use multiple measures to assess if their needs are met. ...
... A naive approach would be to use measures like RMSE to compare the original data to the perturbed data directly (Willmott 1982, Willmott et al. 2009), however this ignores the correlations in the data necessary for information discovery (Agrawal & Aggarwal 2001, Fletcher & Islam 2015b). Just like how Frequent Pattern Mining and other fields use Prediction Accuracy heavily, so too does Privacy Preserving Data Mining (Chaudhuri et al. 2011, Friedman & Schuster 2010, Fung et al. 2005, Mohammed et al. 2011), but the same problems encountered by the former when assessing patterns also plague the latter (Fletcher & Islam 2014, Islam et al. 2003 ). Attempts at directly measuring the loss of pattern retention as privacy needs are increased have been made (Fletcher & Islam 2014 ), but the measure is not applicable in any wider context outside of privacy preser- vation. ...
Conference Paper
Full-text available
The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Ac-tionable knowledge often takes the form of patterns, otherwise known as rules, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of rules. Our solution allows comparisons between rule lists that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard Index to measure the similarity between rule lists, by converting each rule into a single element within the set of rules. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to Prediction Accuracy in the context of a real-world data mining scenario.
... While Prediction Accuracy is an excellent measure when evaluating the utility of a classifier or model [30,31], care needs to be taken when extending its use to the privacy-preservation domain. It has been common in the past for privacy-preservation techniques to have their effect on the quality (utility) of the data measured with prediction accuracy [14,32,15,7]. This necessitates applying a data mining technique to the anonymized data M to build a classifier, or discovering a collection of patterns with another technique, and then testing the ability for that collection of patterns 3 Z M to accurately predict the class value of records in a testing dataset T . ...
... If this solution is not used, then a user must either trust the implicit assumption that other data mining techniques will perform similarly, or release the collection of patterns Z M that they did test, and not release M to the public at all [14]. To the best of our knowledge, aside from our preliminary investigation [32], no solution to (2) currently exists in the literature. 4 We therefore propose a methodology that addresses both problems. ...
... Introduced by us in a 2014 conference [32] and Accuracy as: ...
Article
Full-text available
In this paper, we explore how modifying data to preserve privacy affects the quality of the patterns discoverable in the data. For any analysis of modified data to be worth doing, the data must be as close to the original as possible. Therein lies a problem -- how does one make sure that modified data still contains the information it had before modification? This question is not the same as asking if an accurate classifier can be built from the modified data. Often in the literature, the prediction accuracy of a classifier made from modified (anonymized) data is used as evidence that the data is similar to the original. We demonstrate that this is not the case, and we propose a new methodology for measuring the retention of the patterns that existed in the original data. We then use our methodology to design three measures that can be easily implemented, each measuring aspects of the data that no pre-existing techniques can measure. These measures do not negate the usefulness of prediction accuracy or other measures -- they are complementary to them, and support our argument that one measure is almost never enough.
... These limitations make traditional metrics inadequate for assessing the quality of anonymized data [4]. Current research has focused more on anonymization procedures than data prospecting and quality assessment. ...
Chapter
Data quality is essential for a correct understanding of the concepts they represent. Data mining is especially relevant when data with inferior quality is used in algorithms that depend on correct data to create accurate models and predictions. In this work, we introduce the issue of errors of identifiers in an anonymous database. The work proposes a quality evaluation approach that considers individual attributes and a contextual analysis that allows additional quality evaluations. The proposed quality analysis model is a robust means of minimizing anonymization costs.KeywordsData pre-processingAnonymized dataData quality
... Sometimes maintaining a server is impractical, such as due to financial or logistical reasons, and the data owner would prefer to release a private version of the data to the public. This scenario is an example of "privacy-preserving data publishing" [Fung et al. 2010], where the data is kept as similar as possible to the original data [Fletcher and Islam 2014] while still protecting privacy. While this survey focuses on the construction of differentially-private decision trees, we take a moment to briefly discuss the other side of the coin: how traditional, non-private decision trees can be used to publish private data. ...
Article
Full-text available
Data mining information about people is becoming increasingly important in the data-driven society of the 21st century. Unfortunately, sometimes there are real-world considerations that conflict with the goals of data mining; sometimes the privacy of the people being data mined needs to be considered. This necessitates that the output of data mining algorithms be modified to preserve privacy while simultaneously not ruining the predictive power of the outputted model. Differential privacy is a strong, enforceable definition of privacy that can be used in data mining algorithms, guaranteeing that nothing will be learned about the people in the data that could not already be discovered without their participation. In this survey, we focus on one particular data mining algorithm -- decision trees -- and how differential privacy interacts with each of the components that constitute decision tree algorithms. We analyze both greedy and random decision trees, and the conflicts that arise when trying to balance privacy requirements with the accuracy of the model.
... This may be due to the trees in Random Forest getting stuck in local optima, or simply that decision trees are not a good choice for data mining the Connect4 dataset. Prediction Accuracy behaving in this way when applying privacy-preservation techniques has been explored by previous work (Fletcher & Islam 2014). Overall, it appears that for most datasets (except Tic-Tac-Toe, perhaps because it is by far the smallest dataset and thus has the noisiest outputs), DPDF produces a decision forest of acceptable quality for most data mining needs. ...
Conference Paper
Full-text available
With the ubiquity of data collection in today's society, protecting each individual's privacy is a growing concern. Differential Privacy provides an enforceable definition of privacy that allows data owners to promise each individual that their presence in the dataset will be almost undetectable. Data Mining techniques are often used to discover knowledge in data, however these techniques are not differentially privacy by default. In this paper, we propose a differentially private decision forest algorithm that takes advantage of a novel theorem for the local sensitivity of the Gini Index. The Gini Index plays an important role in building a decision forest, and the sensitivity of it's equation dictates how much noise needs to be added to make the forest be differentially private. We prove that the Gini Index can have a substantially lower sensitivity than that used in previous work, leading to superior empirical results. We compare the prediction accuracy of our decision forest to not only previous work, but also to the popular Random Forest algorithm to demonstrate how close our differentially private algorithm can come to a completely non-private forest.
... We test our proposed algorithm (DP-RF) using 9 datasets from the UCI Machine Learning Repository [1]. We use 10-fold stratified cross-validation repeated 30 times to calculate the average prediction accuracy [9] of our algorithm, using various privacy budgets. We test the following privacy budgets: β = 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0. ...
Conference Paper
Full-text available
When dealing with personal data, it is important for data miners to have algorithms available for discovering trends and patterns in the data without exposing people's private information. Differential privacy offers an enforceable definition of privacy that can provide each individual in a dataset a guarantee that their personal information is no more at risk than it would be if their data was not in the dataset at all. By using mechanisms that achieve differential privacy, we propose a decision forest algorithm that uses the theory of Signal-to-Noise Ratios to automatically tune the algorithm's parameters, and to make sure that any differentially private noise added to the results does not outweigh the true results. Our experiments demonstrate that our differentially private algorithm can achieve high prediction accuracy.
... Many techniques have been developed to preserve privacy, such as k-anonymity [44] and noise addition [1,46]. Decision tree algorithms have been proposed in the past for both these types of privacy, such as Fung et al. [22,23] for kanonymity and Islam & Brankovic [28], Fletcher & Islam [20] for noise addition, all aiming to keep the quality of the data mining results as high as possible [17]. ...
Article
Full-text available
We propose and prove a new sensitivity bound for the differentially private query of "what is the most frequent item in set $x$?". To do so, we use the idea of "smooth sensitivity", which takes into account the specific data used in the query rather than assuming the worst-case scenario. Differential privacy is a strong mathematical model that offers privacy guarantees to every person in the data. We then apply our proposed sensitivity to a forest of randomly built decision trees, querying each leaf node to output the most frequent class label. We also extend work done on the optimal depth of random decision trees, we extend the theory to handle continuous features, not just discrete features. This, along with several other improvements, allows us to create a differentially private decision forest with substantially higher predictive power than the current state-of-the-art. Our findings in this paper are generalized to machine learning applications beyond decision trees, if privacy is a concern, and the query can be phrased in terms of the most (or least) frequent item in a set, we prove that this query is very insensitive and can output accurate answers under strong privacy requirements.
Article
Full-text available
VIDEO: https://www.youtube.com/watch?v=7oa2uCT-PpM&t=14s. We present two novel techniques for the imputation of both categorical and numerical missing values. The techniques use decision trees and forests to identify horizontal segments of a data set where the records belonging to a segment have higher similarity and attribute cor-relations. Using the similarity and correlations, missing values are then imputed. To achieve a higher quality of imputation some segments are merged together using a novel approach. We use nine publicly available data sets to experimentally compare our techniques with a few existing ones in terms of four commonly used evaluation criteria. The experimental results indicate a clear superiority of our techniques based on statistical analyses such as confidence interval.
Conference Paper
Full-text available
Nowadays huge amount of data is regularly being collected for various purposes by organizations and companies, including business, government departments, and medical service providers. Data mining techniques are often used on these gigantic datasets to discover previously hidden information. Upon release of these datasets for data mining, individual sensitive and delicate information is at high risk of being exposed to unauthorised disclosure. Due to the growing public concern about privacy, many control techniques have been proposed to protect confidentiality of individual information. Some of these techniques involve perturbing datasets by adding a noise to data in some controlled fashion. The effectiveness of such techniques is typically evaluated by measuring the security and data quality of perturbed dataset. In this paper we experimentally evaluate the data quality by comparing the prediction capability of decision trees and neural networks, built from original and perturbed datasets. We then compare this evaluation technique to the one that uses logic rules associated with the decision tree classifiers.
Conference Paper
Full-text available
Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
Knowledge has assumed a new importance in public policy making in the United States, and there is a need for new policies to deal with the management of knowledge. New technological and decision-making uses of information affect not only public policy outcomes, but the public policy process itself. Current knowledge management policies are inadequate, and particular attention must be paid to formulating policies for new information technologies and publicly accessible information.
Article
Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k-anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, μ-Argus and k-Similar provide guarantees of privacy protection.
Article
k-Anonymity is a method for providing privacy protection by ensuring that data cannot be traced to an individual. In a k-anonymous dataset, any identifying information occurs in at least k tuples. To achieve optimal and practical k-anonymity, recently, many different kinds of algorithms with various assumptions and restrictions have been proposed with different metrics to measure quality. This paper evaluates a family of clustering-based algorithms that are more flexible and even attempts to improve precision by ignoring the restrictions of user-defined Domain Generalization Hierarchies. The evaluation of the new approaches with respect to cost metrics shows that metrics may behave differently with different algorithms and may not correlate with some applications’ accuracy on output data.