Conference PaperPDF Available

Targeted Data Swapping and K-Means Clustering for Healthcare Data Privacy and Usability

Authors:

Abstract

Healthcare data regulations and laws require that entities protect patient privacy when managing healthcare records that contain personally sensitive and revealing information. However, during the process of implementing data confidentiality on the healthcare records, the usability of the privatized healthcare data reduces as sensitive information is removed or transformed. Realizing equilibrium between the need for healthcare records privacy and the usability of the privatized dataset is problematic. In this study an analysis of healthcare records privacy and usability using targeted data swapping and K-Means clustering is done. Focus is placed on the healthcare data with the main problem being, how can such data be transacted securely, confidentially, while retaining the data usability. The question being probed is, can healthcare data such as, the heart disease dataset be securely and confidentially shared while maintaining data usability.
Abstract Healthcare data regulations and laws require
that entities protect patient privacy when managing
healthcare records that contain personally sensitive and
revealing information. However, during the process of
implementing data confidentiality on the healthcare records,
the usability of the privatized healthcare data reduces as
sensitive information is removed or transformed. Realizing
equilibrium between the need for healthcare records privacy
and the usability of the privatized dataset is problematic. In
this study an analysis of healthcare records privacy and
usability using targeted data swapping and K-Means
clustering is done. Focus is placed on the healthcare data
with the main problem being, how can such data be
transacted securely, confidentially, while retaining the data
usability. The question being probed is, can healthcare data
such as, the heart disease dataset be securely and
confidentially shared while maintaining data usability.
Keywords: Healthcare Data; Privacy; Usability; K-Means!
I. INTRODUCTION
Healthcare data policies necessitate that organizations
handling healthcare related data safeguard patient privacy
when transacting in healthcare records. The basic feature in
such a data privacy process is to strip the information of any
sensitive or revealing patient data. However, during this
privacy procedure the usability of the privatized healthcare
data decreases as useful information that might be sensitive
is suppressed or altered. Further compounding this problem
is that realizing equilibrium between the need for healthcare
records privacy and the usability of the privatized dataset is
intractable [1]. Attention in this study is placed on the
healthcare data. The key problem being examined is how
can healthcare data get transacted securely and
confidentially between a patient and a healthcare entity
while maintaining the usability of such data. In this study, an
investigation of data confidentiality and usability using
targeted data swapping and K-Means clustering
unsupervised learning as an assessment is done [2]. Rather
than focus on data swapping of every record in the dataset,
attributes with sensitive information are selected for data
swapping, then the records of that particular patient x are
swapped with patient y, making if difficult to reconstruct the
original data.
II. BACKGROUND
Data swapping is a data privacy algorithm proposed by
Dalenius and Reiss (1978), that consists of an swapping of
items in a variable from the same dataset while conserving
the original statistical traits of the data in that variable [3]
[4]. Data swapping techniques maintain original statistical
traits of data and is approvingly employed as a data privacy
technique by the US Census Bureau [5].
The 2k data swapping model: The following is the
definition for the 2k basic data swapping procedure [6]:
Given an N x V data matrix: Where N is the total number of
records, the symbol V is the total number of variables, where
Xj is the jth variable in the data matrix, the symbol ith is the ith
record of Xj such that xi = (xi1, xi2,…,xiv ), and the symbol k
representing the elementary swaps. Then the True Swap is
the complete exchange of every value in the variable.
Therefore 2k data swapping can basically be described as an
exchange of 2k values in relation to the k elementary swaps
and or a random selection of two records i and j from a
variable and swapping them [6]. The Swap Rate is then the
swap rate is then the fraction of N records to be swapped =
2k/N. Swap Pair: The swap pair is then the pair of values to
be swapped = (i, j). The Post 2k data swapping scenario is
best shown as follows [7]: If the values i and j of a variable
X1 are swapped then the post swap condition for the ith and
jth values will be (xj1, xi2,…xiv) and (xi1, xj2,…xjv ).
K-Means (KNN): is an unsupervised machine learning
method that groups similar values together in cluster by
computing the distance between any values and a central
point k, and grouping all items that are close to that central
point, k. The k value is tunable; in other words, the user can
select the number of k central points for which they expect
values to accumulate around. K-Means employs the
Euclidean distance in computing the distances between
values and a central mean value and is formally articulated
in the following formula [1]:
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥,𝑦= (𝑥!𝑦! )!
!
!!! (1)
The Davis Bouldin index (DBI): The DBI is a metric used to
gauge how well a clustering algorithm performs. DBI index
is formally noted as follows [2]:
𝐷𝐵𝐼 =!
!
𝐷!
!
!!! (2)
Where 𝐷!=𝑚𝑎𝑥
!:!!!
𝑅!,! (3)
Kato Mivule
kmivule@nsu.edu
Department of Computer Science
Norfolk State University
Targeted Data Swapping and K-Means Clustering for
Healthcare Data Privacy and Usability
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
77
ISBN: 1-60132-459-6, CSREA Press ©
And 𝑅!,!=
!!!!!
!!,!
(4)
The symbol Ri,j is a measure of how decent the clustering is.
The symbols Si and Sj are the observed distance inside each
cluster. The symbol Mi,j is the distance among clusters.
III. METHODOLOGY
The following steps are followed in the implemented
heuristic. (i) In the first phase of the heuristic, personal
identifiable information (PII) and any sensitive patient data
are removed from the dataset. (ii) In the second phase,
attributes with sensitive data are chosen and selected for the
data swapping. (iii) In the third phase, a 2k random
swapping is done until a true swap is achieved. Random
swapping involves arbitrarily selecting values for swapping,
with a goal of concealing all records in the dataset [8].
However, in this study, swapping is targeted at specific
attributes. (iv) In the fourth phase, K-Means clustering is
used to gauge how well the privatized dataset compares to
the original. In this case, the Davies Bouldin Index (DBI) is
used to examine the performance of the clustering.
Figure 1: The Targeted Data Swapping Model
IV. EXPERIMENT
In the experiment done in this study, the heart disease
dataset from the UCI repository is used [9]. The dataset
contains 14 attributes and 303 instances. The 14 attributes
include features such as the age, sex, chest pain type, resting
electrocardiographic results, serum cholesterol, fasting blood
sugar, resting blood pressure, among others [9] used to
capture the heart state of the patient and predict the
diagnosis of heart disease using the label number attribute;
with the class values 1, 2, 3, and 4, with 0 indicating no
presence of heart disease and 1, 2, 3, and 4, indicating
presence and level of risk, 4 being the highest. The first step
was to search for and remove any PII information. The next
step was to select the attributes for the data swap. In this
case, 13 attributes were selected for data swapping. The
Patient ID was not selected for swapping. The notion here
being that the records of patient x is swapped with those of
patient y. The other factor is that data swapping has to be
applied on a case-by-case basis. In this study, all the 13
features are necessary in determining the outcome if the
patient has an indication of heart disease or not. A privatized
dataset was then generated after the data swap. The next
phase done in the experiment was to test for data usability
how usable would such datasets be to any other medical or
research entity. K-Means algorithm was implemented using
RapidMiner environment [10]. In the next section, an
exploration of preliminary results from the study is given.
V. RESULTS
The results as shown in Figure 1, depict the K-Means
clustering results of the original heart disease dataset before
applying the data-swapping algorithm for privacy.
Figure 1: K-Means clustering results of the original data
Figure 2: K-Means clustering before data swapping 12 attributes
78
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
ISBN: 1-60132-459-6, CSREA Press ©
The x-axis in Figure 1 represents the four clusters that are
unlabeled and would correspond to the four classes (1 to 4)
in the number label attribute, indicating the presence or
absence of heart disease. The y-axis in this case indicates the
age used as a basis to cluster and show which patients might
or might not have heart disease. Each dot in the cluster
represents the value or number of items in that cluster as
shown more elaborately in Table 1. Figure 2 corresponds to
Figure 1 and shows the clustering results before data
swapping is implemented.
Figure 3: K-Means clustering results after data swapping
Figure 4: K-Means clustering results after data swapping 12 attributes
Each dotted color represents an attribute value used in the
determination of the clustering process. In Figure 3 and 4
correspondingly, the clustering results of the privatized data
after data swapping is presented. It can be shown clearly that
the number of data items in cluster 3 of the privatized data is
reduced as further elaborated in Table 1. Only five dots are
shown in cluster 3. This corresponds to Figure 4 that shows
cluster 3 with fewer items than in the original data as shown
in Figure 2. Table 1 shows the cluster model performance
results.
Figure 5: Distribution of items in clusters before privacy
Figure 6: Distribution of items in clusters after privacy
Figure 5 and Figure 6 further highlight this point by showing
the distribution of items in the original clusters as compared
with the clusters from the privatized dataset. For instance,
cluster 1 has 96 items in the original data, while cluster 1 has
only 5 items in the privatized data. This further demonstrates
that data swapping distorts the statistical distribution of data.
Additional, the study was interested in how efficiently the
K-Means clusters. It is clear in Table 1 that the clusters
contains 96, 96, 55, and 56 items each for cluster 0, 1, 2, and
3, respectively before data swapping. However, after aplying
data privacy using data swapping algorithm, the number of
items in each cluster changes to 78, 5, 93, and 127 for
clusters 0, 1, 2, and 3 respectively. This would be an
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
79
ISBN: 1-60132-459-6, CSREA Press ©
indication that while data swapin might have provided a
layer of privacy protection for the patient record, the
distortion is noticed in the privatized result. In this case,
cluster 1 had originally 96 items but is reduced to 5. This
could be problematic for a researcher or medical practioner
using the privatized dataset. The original value could have
indicated a group of patients with indicatio of heart disease
while the privatized data could indicate fewer. Futhermore,
the performance vector shown in Table 1 shows the variation
bewteen the average within centriod distance in the clusters.
For instance, cluster 1 shows that this value is 61.071 in the
original dataset while in the privatized dataset that is, after
data swapping, shows a growth to 115.56. Yet again the
results indicate the intractability bewteen privacy needs and
usability. In such cases, trade-offs would need to be
considered. As data privacy researchers have noted, perfect
privacy can only be attained by disseminating nothing at all
which would indicate no usability; on the otherhand, perfect
data usability can be achievd by disseminating data exactly
as it is received, yet this would indicate no privacy [11].
Table 1: Cluster Performance Results
Futhermore, the results shown in Table 1 indicate the
Davies Bouldin Index (DBI) that shows how well the K-
Means performed in clustering. A lower DBI indicates
better clustering, while a higher DBI value indicates poor
clustering. The DBI value is normalized bewteen 0 and 1. In
Table 1, it is shown that the DBI results for the original
dataset before applying data swapping is 0.092 which
interestingly is higher than the DBI resulst for the privatized
dataset after applying data swapping, at a low of 0.082. Due
to the distortion of the data after data swapping, it was
expected that the DBI value of the privatized data might be
higher. However, the results show that it is important that
other factors be considered when analyzing the performance
of the datasets.
VI. CONCLUSION
Preliminary results from this study indicate that data
privacy could be implemented on healthcare data using data
targeted data swapping. With the exponential rise of big
data, data swapping becomes a suitable data privacy
mechanism for implementing confidentiality. However, this
should be combined with initial data privacy sanitization
such as, the removal of PII and other sensitive data to reduce
chances of reconstruction attacks. Yet still data swapping of
items in big datasets is conducive that it makes if difficult
for an attacker to reconstruct the full identity of a patient
record in such big datasets without prior information.
Nevertheless, as the results indicated, data swapping distorts
the values in the privatized dataset. Finding equilibrium
between data privacy needs and data usability requirements,
necessitates trade-offs, and remains a challenge in need for
further study and research. Future works include pursuing
investigative studies in various healthcare records and how
such data can be privately shared while maintaining a
satisfactory level of usability.
ACKNOWLEDGMENT
A special thanks to the Department of Computer Science
at Norfolk State University for assistance with this research.
REFERENCES
[1] K. Mivule, “An Investigation Of Data Privacy And Utility Using
Machine Learning As A Gauge,” Bowie State University, 2014.
[2] K. Mivule and B. Anderson, “A study of usability-aware network
trace anonymization,” in 2015 Science and Information
Conference (SAI), 2015, no. February, pp. 12931304.
[3] S. P. Dalenius, T., Reiss, “Data-swapping: A technique for
disclosure control.,” J. Stat. Plan. Inference, vol. 6, no. 1, pp. 73
85, 1982.
[4] Tore Dalenius and Steven P. Reiss., “Data-swapping: A technique
for disclosure control (extended abstract).,” in American
Statistical Association, Proceedings of the Section on Survey
Research Methods., 1978, pp. 191194.
[5] and J. M. Fienberg, Stephen E., “Data swapping: Variations on a
theme by Dalenius and Reiss.,” in Privacy in statistical
databases, 2004, pp. 1429.
[6] W. E. Winkler, “Masking and re-identification methods for
public-use microdata: Overview and research problems.,” in In
Privacy in Statistical Databases, 2004, pp. 231246.
[7] S. P. Reiss, “Practical data-swapping: the first steps.,” ACM
Trans. Database Syst, vol. 9, no. 1, pp. 2037, 1984.
[8] P. J. Lavrakas, Encyclopedia of Survey Research Methods.
Thousand Oaks, California: SAGE, 2008.
[9] R. Lichman, M. Janosi, Andras. Steinbrunn, William. Pfisterer,
Matthias. Detrano, “Heart Disease Dataset - UCI Machine
Learning Repository.” UCI Machine Learning Repository, p.
[http://archive.ics.uci.edu/ml], 2013.
[10] R. Hofmann, Markus., Klinkenberg, Rapidminer: Data Mining
Use Cases and Business Analytics Applications. Chapman &
Hall/CRC., 2013.
[11] C. Dwork, “Differential Privacy,” Autom. Lang. Program., vol.
4052, pp. 112, 2006.
80
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
ISBN: 1-60132-459-6, CSREA Press ©
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The publication and sharing of network trace data is a critical to the advancement of collaborative research among various entities, both in government, private sector, and academia. However, due to the sensitive and confidential nature of the data involved, entities have to employ various anonymization techniques to meet legal requirements in compliance with confidentiality policies. Nevertheless, the very composition of network trace data makes it a challenge when applying anonymization techniques. On the other hand, basic application of microdata anonymization techniques on network traces is problematic and does not deliver the necessary data usability. Therefore, as a contribution, we point out some of the ongoing challenges in the network trace anonymization. We then suggest usability-aware anonymization heuristics by employing microdata privacy techniques while giving consideration to usability of the anonymized data. Our preliminary results show that with trade-offs, it might be possible to generate anonymized network traces with enhanced usability, on a case-by-case basis using micro-data anonymization techniques.
Thesis
Full-text available
Abstract (Summary) The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will deal with data privacy, data utility, machine learning classification, and the generation of synthetic data sets. Hence, data privacy and utility preservation using machine learning classification as a gauge is the central focus of this study. Many organizations that transact in large amounts of data have to comply with state, federal, and international laws to guarantee that the privacy of individuals and other sensitive data is not compromised. Yet at some point during the data privacy process, data loses its utility - a measure of how useful a privatized dataset is to the user of that dataset. Data privacy researchers have documented that attaining an optimal balance between data privacy and utility is an NP-hard challenge, thus an intractable problem. Therefore we propose the classification error gauge (x-CEG) approach, a data utility quantification concept that employs machine learning classification techniques to gauge data utility based on the classification error. In the initial phase of this proposed approach, a data privacy algorithm such as differential privacy, Gaussian noise addition, generalization, and or k-anonymity is applied on a dataset for confidentiality, generating a privatized synthetic data set. The privatized synthetic data set is then passed through a machine learning classifier, after which the classification error is measured. If the classification error is lower or equal to a set threshold, then better utility might be achieved, otherwise, adjustment to the data privacy parameters is made and then the refined synthetic data set is sent to the machine learning classifier; the process repeats until the error threshold is reached. Additionally, this study presents the Comparative x-CEG concept, in which a privatized synthetic data set is passed through a series of classifiers, each of which returns a classification error, and the classifier with the lowest classification error is chosen after parameter adjustments, an indication of better data utility. Preliminary results from this investigation show that fine-tuning parameters in data privacy procedures, for example in the case of differential privacy, and increasing weak learners in the ensemble classifier for instance, might lead to lower classification error, thus better utility. Furthermore, this study explores the application of this approach by employing signal processing techniques in the generation of privatized synthetic data sets and improving data utility. This dissertation presents theoretical and empirical work examining various data privacy and utility methodologies using machine learning classification as a gauge. Similarly this study presents a resourceful approach in the generation of privatized synthetic data sets, and an innovative conceptual framework for the data privacy engineering process
Article
This paper describes data-swapping as an approach to disclosure control for statistical databases. Data-swapping is a data transformation technique where the underlying statistics of the data are preserved. It can be used as a basis for microdata release or to justify the release of tabulations.
Article
We review the definition of differential privacy and briefly survey a handful of very recent contributions to the differential privacy frontier.
Conference Paper
This paper provides an overview of methods of masking microdata so that the data can be placed in public-use files. It divides the methods according to whether they have been demonstrated to provide analytic properties or not. For those methods that have been shown to provide one or two sets of analytic properties in the masked data, we indicate where the data may have limitations for most analyses and how re-identification might or can be performed. We cover several methods for producing synthetic data and possible computational extensions for better automating the creation of the underlying statistical mod- els. We finish by providing background on analysis-specific and general in- formation-loss metrics to stimulate research.
Conference Paper
Data swapping, a term introduced in 1978 by Dalenius and Reiss for a new method of statistical disclosure protection in confidential data bases, has taken on new meanings and been linked to new statistical methodologies over the intervening twenty-five years. This paper revis- its the original (1982) published version of the the Dalenius-Reiss data swapping paper and then traces the developments of statistical disclo- sure limitation methods that can be thought of as rooted in the original concept. The emphasis here, as in the original contribution, is on both disclosure protection and the release of statistically usable data bases.
Data-swapping: A technique for disclosure control (extended abstract)Data swapping: Variations on a theme by Dalenius and ReissMasking and re-identification methods for public-use microdata: Overview and research problemsPractical data-swapping: the first steps
  • Tore Dalenius
  • Steven P Reiss
  • J M Fienberg
  • Stephen E W E Winkler
  • S P Reiss
Tore Dalenius and Steven P. Reiss., "Data-swapping: A technique for disclosure control (extended abstract).," in American Statistical Association, Proceedings of the Section on Survey Research Methods., 1978, pp. 191-194. [5] and J. M. Fienberg, Stephen E., "Data swapping: Variations on a theme by Dalenius and Reiss.," in Privacy in statistical databases, 2004, pp. 14-29. [6] W. E. Winkler, "Masking and re-identification methods for public-use microdata: Overview and research problems.," in In Privacy in Statistical Databases, 2004, pp. 231-246. [7] S. P. Reiss, "Practical data-swapping: the first steps.," ACM Trans. Database Syst, vol. 9, no. 1, pp. 20-37, 1984. [8] P. J. Lavrakas, Encyclopedia of Survey Research Methods. Thousand Oaks, California: SAGE, 2008. [9]
Heart Disease Dataset-UCI Machine Learning Repository UCI Machine Learning Repository Rapidminer: Data Mining Use Cases and Business Analytics Applications
  • R Lichman
  • M Janosi
  • Andras
  • William Steinbrunn
  • Matthias Pfisterer
  • R Detrano
  • Markus Hofmann
  • Klinkenberg
R. Lichman, M. Janosi, Andras. Steinbrunn, William. Pfisterer, Matthias. Detrano, "Heart Disease Dataset-UCI Machine Learning Repository." UCI Machine Learning Repository, p. [http://archive.ics.uci.edu/ml], 2013. [10] R. Hofmann, Markus., Klinkenberg, Rapidminer: Data Mining Use Cases and Business Analytics Applications. Chapman & Hall/CRC., 2013. [11] C. Dwork, "Differential Privacy," Autom. Lang. Program., vol. 4052, pp. 1-12, 2006.
Practical data-swapping: the first steps
  • S P Reiss
S. P. Reiss, "Practical data-swapping: the first steps.," ACM Trans. Database Syst, vol. 9, no. 1, pp. 20-37, 1984.