Content uploaded by Kato Mivule
Author content
All content in this area was uploaded by Kato Mivule on Nov 07, 2017
Content may be subject to copyright.
Abstract – Healthcare data regulations and laws require
that entities protect patient privacy when managing
healthcare records that contain personally sensitive and
revealing information. However, during the process of
implementing data confidentiality on the healthcare records,
the usability of the privatized healthcare data reduces as
sensitive information is removed or transformed. Realizing
equilibrium between the need for healthcare records privacy
and the usability of the privatized dataset is problematic. In
this study an analysis of healthcare records privacy and
usability using targeted data swapping and K-Means
clustering is done. Focus is placed on the healthcare data
with the main problem being, how can such data be
transacted securely, confidentially, while retaining the data
usability. The question being probed is, can healthcare data
such as, the heart disease dataset be securely and
confidentially shared while maintaining data usability.
Keywords: Healthcare Data; Privacy; Usability; K-Means!
I. INTRODUCTION
Healthcare data policies necessitate that organizations
handling healthcare related data safeguard patient privacy
when transacting in healthcare records. The basic feature in
such a data privacy process is to strip the information of any
sensitive or revealing patient data. However, during this
privacy procedure the usability of the privatized healthcare
data decreases as useful information that might be sensitive
is suppressed or altered. Further compounding this problem
is that realizing equilibrium between the need for healthcare
records privacy and the usability of the privatized dataset is
intractable [1]. Attention in this study is placed on the
healthcare data. The key problem being examined is how
can healthcare data get transacted securely and
confidentially between a patient and a healthcare entity
while maintaining the usability of such data. In this study, an
investigation of data confidentiality and usability using
targeted data swapping and K-Means clustering
unsupervised learning as an assessment is done [2]. Rather
than focus on data swapping of every record in the dataset,
attributes with sensitive information are selected for data
swapping, then the records of that particular patient x are
swapped with patient y, making if difficult to reconstruct the
original data.
II. BACKGROUND
Data swapping is a data privacy algorithm proposed by
Dalenius and Reiss (1978), that consists of an swapping of
items in a variable from the same dataset while conserving
the original statistical traits of the data in that variable [3]
[4]. Data swapping techniques maintain original statistical
traits of data and is approvingly employed as a data privacy
technique by the US Census Bureau [5].
The 2k data swapping model: The following is the
definition for the 2k basic data swapping procedure [6]:
Given an N x V data matrix: Where N is the total number of
records, the symbol V is the total number of variables, where
Xj is the jth variable in the data matrix, the symbol ith is the ith
record of Xj such that xi = (xi1, xi2,…,xiv ), and the symbol k
representing the elementary swaps. Then the True Swap is
the complete exchange of every value in the variable.
Therefore 2k data swapping can basically be described as an
exchange of 2k values in relation to the k elementary swaps
and or a random selection of two records i and j from a
variable and swapping them [6]. The Swap Rate is then the
swap rate is then the fraction of N records to be swapped =
2k/N. Swap Pair: The swap pair is then the pair of values to
be swapped = (i, j). The Post 2k data swapping scenario is
best shown as follows [7]: If the values i and j of a variable
X1 are swapped then the post swap condition for the ith and
jth values will be (xj1, xi2,…xiv) and (xi1, xj2,…xjv ).
K-Means (KNN): is an unsupervised machine learning
method that groups similar values together in cluster by
computing the distance between any values and a central
point k, and grouping all items that are close to that central
point, k. The k value is tunable; in other words, the user can
select the number of k central points for which they expect
values to accumulate around. K-Means employs the
Euclidean distance in computing the distances between
values and a central mean value and is formally articulated
in the following formula [1]:
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥,𝑦= (𝑥!−𝑦! )!
!
!!! (1)
The Davis Bouldin index (DBI): The DBI is a metric used to
gauge how well a clustering algorithm performs. DBI index
is formally noted as follows [2]:
𝐷𝐵𝐼 =!
!
𝐷!
!
!!! (2)
Where 𝐷!=𝑚𝑎𝑥
!:!!!
𝑅!,! (3)
Kato Mivule
kmivule@nsu.edu
Department of Computer Science
Norfolk State University
Targeted Data Swapping and K-Means Clustering for
Healthcare Data Privacy and Usability
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
77
ISBN: 1-60132-459-6, CSREA Press ©
And 𝑅!,!=
!!!!!
!!,!
(4)
The symbol Ri,j is a measure of how decent the clustering is.
The symbols Si and Sj are the observed distance inside each
cluster. The symbol Mi,j is the distance among clusters.
III. METHODOLOGY
The following steps are followed in the implemented
heuristic. (i) In the first phase of the heuristic, personal
identifiable information (PII) and any sensitive patient data
are removed from the dataset. (ii) In the second phase,
attributes with sensitive data are chosen and selected for the
data swapping. (iii) In the third phase, a 2k random
swapping is done until a true swap is achieved. Random
swapping involves arbitrarily selecting values for swapping,
with a goal of concealing all records in the dataset [8].
However, in this study, swapping is targeted at specific
attributes. (iv) In the fourth phase, K-Means clustering is
used to gauge how well the privatized dataset compares to
the original. In this case, the Davies Bouldin Index (DBI) is
used to examine the performance of the clustering.
Figure 1: The Targeted Data Swapping Model
IV. EXPERIMENT
In the experiment done in this study, the heart disease
dataset from the UCI repository is used [9]. The dataset
contains 14 attributes and 303 instances. The 14 attributes
include features such as the age, sex, chest pain type, resting
electrocardiographic results, serum cholesterol, fasting blood
sugar, resting blood pressure, among others [9] used to
capture the heart state of the patient and predict the
diagnosis of heart disease using the label number attribute;
with the class values 1, 2, 3, and 4, with 0 indicating no
presence of heart disease and 1, 2, 3, and 4, indicating
presence and level of risk, 4 being the highest. The first step
was to search for and remove any PII information. The next
step was to select the attributes for the data swap. In this
case, 13 attributes were selected for data swapping. The
Patient ID was not selected for swapping. The notion here
being that the records of patient x is swapped with those of
patient y. The other factor is that data swapping has to be
applied on a case-by-case basis. In this study, all the 13
features are necessary in determining the outcome if the
patient has an indication of heart disease or not. A privatized
dataset was then generated after the data swap. The next
phase done in the experiment was to test for data usability –
how usable would such datasets be to any other medical or
research entity. K-Means algorithm was implemented using
RapidMiner environment [10]. In the next section, an
exploration of preliminary results from the study is given.
V. RESULTS
The results as shown in Figure 1, depict the K-Means
clustering results of the original heart disease dataset before
applying the data-swapping algorithm for privacy.
Figure 1: K-Means clustering results of the original data
Figure 2: K-Means clustering before data swapping – 12 attributes
78
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
ISBN: 1-60132-459-6, CSREA Press ©
The x-axis in Figure 1 represents the four clusters that are
unlabeled and would correspond to the four classes (1 to 4)
in the number label attribute, indicating the presence or
absence of heart disease. The y-axis in this case indicates the
age used as a basis to cluster and show which patients might
or might not have heart disease. Each dot in the cluster
represents the value or number of items in that cluster as
shown more elaborately in Table 1. Figure 2 corresponds to
Figure 1 and shows the clustering results before data
swapping is implemented.
Figure 3: K-Means clustering results after data swapping
Figure 4: K-Means clustering results after data swapping – 12 attributes
Each dotted color represents an attribute value used in the
determination of the clustering process. In Figure 3 and 4
correspondingly, the clustering results of the privatized data
after data swapping is presented. It can be shown clearly that
the number of data items in cluster 3 of the privatized data is
reduced as further elaborated in Table 1. Only five dots are
shown in cluster 3. This corresponds to Figure 4 that shows
cluster 3 with fewer items than in the original data as shown
in Figure 2. Table 1 shows the cluster model performance
results.
Figure 5: Distribution of items in clusters before privacy
Figure 6: Distribution of items in clusters after privacy
Figure 5 and Figure 6 further highlight this point by showing
the distribution of items in the original clusters as compared
with the clusters from the privatized dataset. For instance,
cluster 1 has 96 items in the original data, while cluster 1 has
only 5 items in the privatized data. This further demonstrates
that data swapping distorts the statistical distribution of data.
Additional, the study was interested in how efficiently the
K-Means clusters. It is clear in Table 1 that the clusters
contains 96, 96, 55, and 56 items each for cluster 0, 1, 2, and
3, respectively before data swapping. However, after aplying
data privacy using data swapping algorithm, the number of
items in each cluster changes to 78, 5, 93, and 127 for
clusters 0, 1, 2, and 3 respectively. This would be an
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
79
ISBN: 1-60132-459-6, CSREA Press ©
indication that while data swapin might have provided a
layer of privacy protection for the patient record, the
distortion is noticed in the privatized result. In this case,
cluster 1 had originally 96 items but is reduced to 5. This
could be problematic for a researcher or medical practioner
using the privatized dataset. The original value could have
indicated a group of patients with indicatio of heart disease
while the privatized data could indicate fewer. Futhermore,
the performance vector shown in Table 1 shows the variation
bewteen the average within centriod distance in the clusters.
For instance, cluster 1 shows that this value is 61.071 in the
original dataset while in the privatized dataset – that is, after
data swapping, shows a growth to 115.56. Yet again the
results indicate the intractability bewteen privacy needs and
usability. In such cases, trade-offs would need to be
considered. As data privacy researchers have noted, perfect
privacy can only be attained by disseminating nothing at all
which would indicate no usability; on the otherhand, perfect
data usability can be achievd by disseminating data exactly
as it is received, yet this would indicate no privacy [11].
Table 1: Cluster Performance Results
Futhermore, the results shown in Table 1 indicate the
Davies Bouldin Index (DBI) that shows how well the K-
Means performed in clustering. A lower DBI indicates
better clustering, while a higher DBI value indicates poor
clustering. The DBI value is normalized bewteen 0 and 1. In
Table 1, it is shown that the DBI results for the original
dataset before applying data swapping is 0.092 which
interestingly is higher than the DBI resulst for the privatized
dataset after applying data swapping, at a low of 0.082. Due
to the distortion of the data after data swapping, it was
expected that the DBI value of the privatized data might be
higher. However, the results show that it is important that
other factors be considered when analyzing the performance
of the datasets.
VI. CONCLUSION
Preliminary results from this study indicate that data
privacy could be implemented on healthcare data using data
targeted data swapping. With the exponential rise of big
data, data swapping becomes a suitable data privacy
mechanism for implementing confidentiality. However, this
should be combined with initial data privacy sanitization
such as, the removal of PII and other sensitive data to reduce
chances of reconstruction attacks. Yet still data swapping of
items in big datasets is conducive that it makes if difficult
for an attacker to reconstruct the full identity of a patient
record in such big datasets without prior information.
Nevertheless, as the results indicated, data swapping distorts
the values in the privatized dataset. Finding equilibrium
between data privacy needs and data usability requirements,
necessitates trade-offs, and remains a challenge in need for
further study and research. Future works include pursuing
investigative studies in various healthcare records and how
such data can be privately shared while maintaining a
satisfactory level of usability.
ACKNOWLEDGMENT
A special thanks to the Department of Computer Science
at Norfolk State University for assistance with this research.
REFERENCES
[1] K. Mivule, “An Investigation Of Data Privacy And Utility Using
Machine Learning As A Gauge,” Bowie State University, 2014.
[2] K. Mivule and B. Anderson, “A study of usability-aware network
trace anonymization,” in 2015 Science and Information
Conference (SAI), 2015, no. February, pp. 1293–1304.
[3] S. P. Dalenius, T., Reiss, “Data-swapping: A technique for
disclosure control.,” J. Stat. Plan. Inference, vol. 6, no. 1, pp. 73–
85, 1982.
[4] Tore Dalenius and Steven P. Reiss., “Data-swapping: A technique
for disclosure control (extended abstract).,” in American
Statistical Association, Proceedings of the Section on Survey
Research Methods., 1978, pp. 191–194.
[5] and J. M. Fienberg, Stephen E., “Data swapping: Variations on a
theme by Dalenius and Reiss.,” in Privacy in statistical
databases, 2004, pp. 14–29.
[6] W. E. Winkler, “Masking and re-identification methods for
public-use microdata: Overview and research problems.,” in In
Privacy in Statistical Databases, 2004, pp. 231–246.
[7] S. P. Reiss, “Practical data-swapping: the first steps.,” ACM
Trans. Database Syst, vol. 9, no. 1, pp. 20–37, 1984.
[8] P. J. Lavrakas, Encyclopedia of Survey Research Methods.
Thousand Oaks, California: SAGE, 2008.
[9] R. Lichman, M. Janosi, Andras. Steinbrunn, William. Pfisterer,
Matthias. Detrano, “Heart Disease Dataset - UCI Machine
Learning Repository.” UCI Machine Learning Repository, p.
[http://archive.ics.uci.edu/ml], 2013.
[10] R. Hofmann, Markus., Klinkenberg, Rapidminer: Data Mining
Use Cases and Business Analytics Applications. Chapman &
Hall/CRC., 2013.
[11] C. Dwork, “Differential Privacy,” Autom. Lang. Program., vol.
4052, pp. 1–12, 2006.
80
Int'l Conf. Health Informatics and Medical Systems | HIMS'17 |
ISBN: 1-60132-459-6, CSREA Press ©