Conference PaperPDF Available

Non-interactive Privacy-preserving k-NN Classifier

Authors:

Abstract and Figures

Machine learning tasks typically require large amounts of sensitive data to be shared, which is notoriously intrusive in terms of privacy. Outsourcing this computation to the cloud requires the server to be trusted, introducing a non-realistic security assumption and high risk of abuse or data breaches. In this paper, we propose privacy-preserving versions of the k-NN classifier which operate over encrypted data, combining order-preserving encryption and homomorphic encryption. According to our experiments, the privacy-preserving variant achieves the same accuracy as the conventional k-NN classifier, but considerably impacts the original performance. However, the performance penalty is still viable for practical use in sensitive applications when the additional security properties provided by the approach are considered. In particular, the cloud server does not need to be trusted beyond correct execution of the protocol and computes the algorithm over encrypted data and encrypted classes. As a result, the cloud server never learns the real dataset values, the number of classes, the query vectors or their classification.
Content may be subject to copyright.
Non-interactive privacy-preserving k-NN classifier
Hilder V. L. Pereira1and Diego F. Aranha1
1Institute of Computing, University of Campinas, Campinas, SP, Brazil
{hilder.vitor, dfaranha}@gmail.com
Keywords: Privacy-preserving classification, Order-preserving encryption, Homomorphic encryption.
Abstract: Machine learning tasks typically require large amounts of sensitive data to be shared, which is notoriously
intrusive in terms of privacy. Outsourcing this computation to the cloud requires the server to be trusted,
introducing a non-realistic security assumption and high risk of abuse or data breaches. In this paper, we pro-
pose privacy-preserving versions of the k-NN classifier which operate over encrypted data, combining order-
preserving encryption and homomorphic encryption. According to our experiments, the privacy-preserving
variant achieves the same accuracy as the conventional k-NN classifier, but considerably impacts the original
performance. However, the performance penalty is still viable for practical use in sensitive applications when
the additional security properties provided by the approach are considered. In particular, the cloud server does
not need to be trusted beyond correct execution of the protocol and computes the algorithm over encrypted
data and encrypted classes. As a result, the cloud server never learns the real dataset values, the number of
classes, the query vectors or their classification.
1 INTRODUCTION
With corporations and governments becoming more
intrusive in their data collection and surveillance ef-
forts, and the recurrent data breaches observed in the
last years, the cloud paradigm faces multiple chal-
lenges to remain as the computing model of choice
for privacy-sensitive applications. The low operat-
ing costs and high availability of storage capacity and
computational power may not look as attractive after
the risks of outsourcing computation and data storage
are considered. There are no formal guarantees that
the cloud provider is not behaving in abusive or intru-
sive ways, or even that the infrastructure is protected
against external attacks. Different legal regimes and
governmental influence introduce further complica-
tions to the problem and may shift responsibilities
in unclear ways. After the Snowden revelations, the
long-term financial impact from the current crisis in
cloud provider trust is estimated between 35 and 180
billion dollars in 2016 in the US only (Miller, 2014).
A solution proposed to reconcile these issues con-
sists in computing over encrypted data. In this
model, data is encrypted with a property-preserving
transformation (originally called a privacy homomor-
phism (Rivest et al., 1978) that still allows some op-
erations to be performed in the encrypted domain.
Constructions which provide this feature and support
an arbitrary number and type of operations (additions
and multiplications) are called fully homomorphic en-
cryption schemes and usually introduce a massive
performance penalty. The more restricted partial or
somewhat homomorphic encryption schemes impose
an upper bound in the number and type of operations,
with much improved performance figures. However,
they require a redesign of the high-level algorithms to
satisfy the restrictions and conserve most of the orig-
inal performance and effectiveness, when compared
to distributed approaches based on secure multiparty
computation (Xiong et al., 2007).
Classically, computing over encrypted data was
applied to tallying secret votes in electronic elec-
tions (Hirt and Sako, 2000), but modern homomor-
phic encryption schemes may soon enable a host
of interesting privacy-preserving applications in the
fields of genomics, healthcare and intelligence gath-
ering (Naehrig et al., 2011). Natural applications ex-
tend to data mining (Lindell and Pinkas, 2009) and
machine learning (Gilad-Bachrach et al., 2016) which
involve considerable amounts of data to be shared
and manipulated in untrusted platforms. Following
this trend, some previous works in the literature have
experimented with performing the task of classifica-
tion over encrypted data. The problem is fundamental
in itself and as a building block for several machine
learning algorithms, and amounts to identify to which
of a set of categories a new sample belongs, based on
previous observations whose membership is known
(called training set). Graepel et al. adapted simple
classification algorithms to work on encrypted data
using somewhat homomorphic encryption (Graepel
et al., 2012), while Bost et al. considered more com-
plex classifiers such as decision trees and Bayesian
inference for medical applications (Bost et al., 2015).
Other authors designed protocols for clustering en-
crypted data in the two-party setting (Jha et al., 2005).
The main contributions of this paper are privacy-
preserving variants of the k-NN classifier, both
unweighted and distance-weighted, combining two
main building blocks: homomorphic encryption and
order-preserving encryption. The protocols are non-
interactive and ideally suited to cloud computing en-
vironments, where storage and computation are del-
egated from a low-power device to powerful cloud
servers. Experimental results demonstrate that there
is no substantial accuracy loss in performing the clas-
sification task in a privacy-preserving way. Our secu-
rity analysis claims security in the semi-honest (also
called honest-but-curious) threat model, although the
drawbacks from adopting order-preserving encryp-
tion for efficiency restrict the application scenarios to
computing over private encrypted databases with no
publicly available knowledge about the distribution of
data. To the best of our knowledge, this is the first
proposal in the scientific literature for non-interactive
cloud-based k-NN classification over encrypted data.
The paper is organized as follows. In Section 2,
we recall the conventional k-NN classifier and de-
fine the problem of privacy-preserving classification.
Section 3 presents the basic building blocks of our
approach: order-preserving encryption (OPE) and
homomorphic encryption (HE). The proposed algo-
rithms are discussed in Section 4, through descrip-
tions of the initialization, querying, processing and re-
sponse procedures. Section 5 presents the experimen-
tal results of evaluating the plain k-NN and privacy-
preserving versions over 6 different datasets. In Sec-
tion 6, we briefly enumerate the security properties
offered by the scheme and corresponding security as-
sumptions. Related work is discussed in 7 and con-
clusions can be found at the final section.
2 PROBLEM STATEMENT
In this section, we define the classification prob-
lem the k-NN algorithm was designed to solve and
discuss a privacy-preserving variant of the problem,
compatible with a cloud-based centralized processing
model.
2.1 The k-NN classifier
The k-Nearest Neighbor (k-NN) classifier is a well-
known non-parametric supervised method to classify
an instance based on the classes of its nearest neigh-
bors (Altman, 1992; Alpaydin, 2004). Each instance
is represented by a vector in Rpand an associated la-
bel, called the instance’s class. A query vector is an
unlabeled instance to be classified.
The k-NN classifier has a positive integer param-
eter k, which is the number of neighbors taken in ac-
count when classifying a new instance, and a function
dfrom Rp×Rpto R, which determines the distance
between two instances (the Euclidean distance is of-
ten used).
In order to classify a query vector xRp, the k-
NN algorithm works as follows:
1. Find the knearest neighbors: among all the clas-
sified instances u1,u2,...,un, select the kinstances
whose distances d(ui,x)are the smallest.
2. Assign a class: select the most frequent class
among the knearest neighbors and assign it to x.
There is also a k-NN variant known as distance-
weighted k-NN, or simply weighted k-NN. In this
version, instead of assigning the most frequent class
among the knearest neighbors, the inverse of the dis-
tance to the query vector is used as the vote’s weight
for each of the kneighbors.
2.2 Privacy-preserving k-NN in the
cloud
The privacy-preserving k-NN problem can be defined
as the problem of computing the k-NN on an untrusted
platform without revealing sensitive information such
as the data instances, their classes, the query vectors,
and the classification results. By untrusted platform
we mean a third party that holds the data in some form
but is not trusted by the data owner. In our scenario,
we assume a client-server model, in which the data
owner is the client and the server is the cloud service.
The client intends to store data in the cloud and pro-
cess it in a non-interactive way, which means that the
cloud will interact with the client only to receive the
data and the query vectors to be classified, but will
not communicate with the client during the process-
ing. In some applications, it may be the case that data
is already collected in the cloud in encrypted form, on
behalf of the client. It is also assumed that the client
is constrained in terms of computation and storage ca-
pabilities, but is capable of managing cryptographic
keys in a secure way.
The other possible scenario is distributed process-
ing: the client collaborates with the cloud (or the
other parties involved) by receiving and processing
data during the training phase or the classification.
We stress that the centralized model is more conve-
nient to the client and is the expected model when
referring to cloud services.
3 BASIC CONCEPTS
In this section, we present the two encryption
schemes used as building blocks to guarantee the
privacy-preserving property of our k-NN classifier:
3.1 Order-preserving encryption
Order-preserving symmetric encryption (OPE) is a
deterministic encryption scheme which encryption
function preserves numerical ordering of the plain-
texts.” (Boldyreva et al., 2011) In other words, given
two plaintexts m1and m2, and their corresponding ci-
phertexts c1and c2encrypted with an OPE scheme,
the following implication is true:
m1m2c1c2.
Because this class of schemes works over finite
sets of numerical values, it is sufficient to describe
it using the set D={1,2,...,M}as the message space
(or domain) and the set R={1,2,...,N}as the cipher-
text space (also called range). OPE schemes are thus
parametrized by two positive integer values Mand N,
that represent the number of possible plaintexts and
the number of possible ciphertexts, respectively. We
define the OPE scheme as follows:
OPE.KE YGEN(M,N,s)is a probabilistic algorithm
that receives a secret string sand the parameters
M,NNsuch that 1 MN; and returns the
secret key K.
OPE.EN C(K,m)is an algorithm that encrypts a
plaintext mDusing the key Kand returns a ci-
phertext cR. When it is clear from the context,
we may omit the encryption key and write simply
OPE.EN C(m)for short.
OPE.DE C(K,c)is an algorithm that decrypts a ci-
phertext cRusing the key Kand returns the
corresponding plaintext mD. We may omit the
decryption key and write simply OPE.DEC(m).
For a vector w= (w1,...,wp), we define the
component-wise encryption as a vector whose each
component is encrypted with the same key:
OPE.EN C(w) = (OPE.ENC(w1), ..., OPE.EN C(wp)).
We define component-wise decryption similarly,
and refer to the space formed by the encrypted vectors
as encrypted space. If OPE is used to encrypt vectors,
then the order is maintained for each axis. Therefore,
it is very likely that each vector will have the same
neighborhood in the encrypted space. This fact is ex-
ploited to make it possible for the cloud to find the
nearest neighbors of a given encrypted vector.
3.2 Homomorphic Encryption
A criptographic scheme is homomorphic for an opera-
tion if it is equivalent to perform this operation over
plaintexts or over ciphertexts. For instance, consid-
ering the multiplication operation, the product of two
ciphertexts encrypted by a cryptosystem with homo-
morphic properties generates a third ciphertext that
has to be decrypted to the product of the two corre-
spondent plaintexts.
In this work, we employed the Paillier cryptosys-
tem (Paillier, 1999). The plaintext message space of
this scheme is Zn, where nthe product of two large
primes pand q. The Pailler cryptosystem is also
additively homomorphic, which means that given ci-
phertexts c0and c1, corresponding to encryption of
two messages m0and m1, it is possible to calculate a
third ciphertext cthat decrypts to m0+m1. Further-
more, the cryptosystem offers efficient multiplication
between ciphertext and plaintext. More specifically,
given a ciphertext c0, corresponding to the encryp-
tion of m0, and a plaintext message m1, it is possible
to efficiently compute a ciphertext cthat decrypts to
m0·m1Zn.
The scheme can be described using the following
procedures:
HE.KE YGEN(λ):choose two random primes pand q
with equivalent bit lengths such that they ensure λ
bits of security. Set n=pq,g=n+1, = (p
1)(q1), and y=1Zn. Return the private
key SK = (, y)and the public key PK = (n,g).
HE.EN C(PK,m):to encrypt mZn, sample a ran-
dom rfrom Z
n, compute c=gmrnmod n2, and
return c.
HE.DE C(SK,c):to decrypt cZn2, calculate x=
cmod n2, divide (x1)by n, obtaining x0and
then return (x0·yZn).
HE.ADD(c1,c2):to add two ciphertexts homomor-
phically, return (c1·c2mod n2).
HE.PRO D(c1,m2):to multiply a ciphertext c1by
a plaintext m2homomorphically, return (cm2
1
mod n2).
4 NON-INTERACTIVE k-NN
OVER ENCRYPTED DATA
In this section we present our constructions for
the privacy-preserving k-NN classifier operating over
encrypted data stored in the cloud. We start by
proposing our unweighted privacy-preserving classi-
fier, which is divided in five subroutines: Encoding,
Initialization, Querying, Processing and Response.
4.1 Unweighted k-NN
The conventional unweighted k-NN classifier can be
described in terms of the following subroutines:
Encoding: this procedure takes the class Nof
an instance and a positive integer , and returns
the integer 2·. For instance, fixing =64, class
0 is mapped to 20=1, class 1 is mapped to 264,
class 2 is mapped to 2128, and so on.
Initialization: the client has nvectors u1,...,un
Rn×pclassified with the corresponding classes
1,...,nN. Thus, each vector uiis labeled
with a non-negative integer i. This subroutine
encrypts the vectors uiusing the component-wise
encryption of OPE and encodes each class for en-
cryption using the HE scheme. Note that this ini-
tialization step is executed by the client, as shown
in Algorithm 1.
Algorithm 1 Initialization
1: Input: (u1,...,un)Rn×p,(1,...,n)Nn
2: K=OPE.KE YGEN(M,N,s)
3: (SK,P K) = HE.K EY GEN(λ)
4: for i1to ndo
5: viOPE.EN C(K,ui)
6: Encoding and encryption.
7: ciHE.EN C(PK,2i·)
8: end for
9: Send (v1,...,vn)and (c1,...,cn)to the cloud.
Querying: the client has a query vector wRpto
be classified. This vector is encrypted using the
OPE scheme and then submitted to the cloud.
Algorithm 2 Querying
1: Input: wRp
2: yOPE.EN C(w)
3: Choose number of neighbors kNand submit
(y,k)to the cloud.
Processing: the cloud classifies the encrypted
query vector yusing the encrypted vectors
v1,...,vnand the encrypted classes c1,...,cnby
running Algorithm 3.
Algorithm 3 Processing
1: Input: Encrypted query vector y,kN
2: for j1 to ndo
3: dj←k vjyk
4: end for
5: Compute the indexes (i1,...,ik)of the ksmallest
distances among (d1,...,dn).
6: classyci1
7: for j2 to kdo
8: classyHE.A DD(cl ass,cij)
9: end for
10: Return classyto the client.
Line 5 of the algorithm returns the indexes of the
ksmallest distances. For instance, if the 1st, the
3rd and the 7th vectors were the three nearest vec-
tors of y, then d1,d3,and d7would be the smallest
distances and this function would return (1,3,7).
Response: the client receives a ciphertext classy,
decrypts it and extracts the class of the query vec-
tor w, as in Algorithm 4.
Algorithm 4 Response
1: Input: Encrypted class classy
2: cHE.DE C(SK,classy)
3: Compute maximum coefficient afrom cas a
polynomial in base 2.
4: Assign class to the query vector.
The algorithm works correctly because when the
server adds the encrypted classes in Algorithm 4, it is
in fact accumulating how many times each class ap-
peared among the knearest vectors. Since the i-th
class is represented by an integer 2i·, the sum of the
classes results in an integer a0+a12+a222·+. . . +
as2s·, where each coefficient airepresents the num-
ber of times that the i-th class appeared. Furthermore,
since the classes are encrypted with an HE scheme,
the cloud can add the ciphertexts of those classes and
the resulting ciphertext will be decrypted to the ex-
pected addition of the classes. To extract the class,
only shifting and reducing the decrypted sum cmod-
ulo 2are required to obtain the largest coefficent.
A problem might arise if some of the aicoeffi-
cients were larger than 2, because in this case, the
value of ai2i·would be mixed with the other coeffi-
cients. However, notice that the number of neighbours
kis an upper bound to each ai, and kis typically small.
Therefore, it is sufficient to choose a value for such
that 2>k.
4.2 Distance-weighted k-NN
Our proposal also allows a distance-weighted version
of the k-NN classifier over encrypted data. The main
differences between this version and the previous un-
weighted one is the way the encrypted classes are ac-
cumulated in the cloud.
Initialization and Querying: Identical to the un-
weighted version.
Processing: To classify an encrypted query vector
y, using the encrypted vectors v1,...,vnand the
encrypted classes c1,...,cn, the cloud finds the k
nearest neighbours, encodes the inverse of their
distances into valid plaintexts and combine them
to generate the encrypted assigned class. This
procedure is shown in Algorithm 5.
Algorithm 5 (Weighted) Processing
1: Input: Encrypted query vector y,kN
2: for j1 to ndo
3: dj←k vjyk
4: end for
5: Compute the indexes (i1,...,ik)of the ksmallest
distances among (d1,...,dn).
6: W1
di1
+1
di2
+. . . +1
dik
7: classyHE.E NC(P K,0)
8: for j1 to kdo
9: lij
10: w1
dl·W
11: Accumulation of encoded class w.
12: zHE.PRO D(cl,2w·)
13: classy=HE.A DD(cl assy,z)
14: end for
15: Return classyto the client.
Response: The client receives the encrypted as-
signed class classy, decrypts it and extracts the
class, just like in the unweighted version.
This procedure works because each encrypted
class clis the encryption of some integer 2i·.
In Algorithm 5, when we multiply homomorphi-
cally by the weight wof the neighbor, we obtain
an encryption of w·2i·. Since we add all those
kclasses homomorphically, we have an encryp-
tion of some integer with the format a0+a12+
a222·+... +as2s·where each airepresents the
sum of the weights of all the neighbors with class
iamong the knearest neighbors.
5 EXPERIMENTAL RESULTS
We implemented our versions of k-NN using
the stateless OPE scheme presented in (Boldyreva
et al., 2011) and the Paillier homomorphic cryptosys-
tem (Paillier, 1999).1. The Paillier cryptosystem was
instantiated at the 80-bit security level by choosing n
to have 1024 bits (Giry, 2015). The approach and im-
plementation were evaluated using datasets from the
UCI Machine Learning Repository2. The datasets are
described in Table 1.
We executed all the tests in a machine equipped
with a 2.6GHz Intel Xeon CPU, 30GB of RAM and
the GNU/Linux Debian 8.0 operating system. We
remark that memory consumption was below 1GB
during the entire collection of experimental results.
Our k-NN version was implemented in C++ and
compiled using GCC 4.9.2 provided by the oper-
ating system with the -O3 optimization flag. For
comparison, we employed the k-NN classifier imple-
mented in the Python Scikit Learn lib3as the conven-
tional k-NN implementation. We used the parameter
algorithm=brute to select a compatible approach
for computing distances.
Table 1: Datasets used in the evaluation. The dataset WFR
refers to WALL -FOLLOWING ROBOT.
DATASET IN STANCE S ATTRIBUTES
IRIS 150 4
WINE 178 13
CLIMATE MOD EL 540 18
CREDIT APP ROVAL 690 15
ABAL ON E 4177 8
WFR 5456 24
The experiments consisted in processing the
datasets using the conventional plaintext k-NN classi-
fier and our non-interactive privacy-preserving k-NN
over encrypted data, and collecting the results of two
metrics: comparison of the resulting accuracies (rate
of query vectors correctly classified) and the compat-
ibility of the privacy-preserving version compared to
the plaintext one (rate of query vectors that our k-NN
over encrypted data classified with the same class as
the conventional k-NN).
The results are summarized in Tables 2, 3, and 4.
The tables have five columns representing the number
of nearest neighbors considered (k∈ {1,3,5,7,9}) to
1The source code is available on the repository https:
//github.com/hilder-vitor/encrypted- k-NN.
2UCI: https://archive.ics.uci.edu/ml/
3Python Scikit Learn lib: http://scikit- learn.org
take into account how this number affects the ac-
curacy of the classifier. In order to verify how the
OPE instantiation parameters might influence the ac-
curacy of the privacy-preserving k-NN, we encrypted
the datasets using several pairs of values (M,N)(re-
call that Mand Ndetermine the size of the plain-
text and the ciphertext spaces, respectively). Since
no significant differences were observed for the sev-
eral combinations of parameters, we only present the
results of the executions for (M,N) = (232,240). As
expected, the privacy-preserving versions conserved
the original classification accuracy for almost all of
the samples.
A comparison of the running times to classify a
single instance, using k=3, is shown in Table 5. We
stress that changing the value of khas little effect on
the execution times. We split each dataset into a train-
ing set containing 2
3of the data and a testing set with
the remaining 1
3. Afterwards, the plaintext version
was executed 10 times and the average time to classify
the whole set was computed. We performed the same
experiment in the privacy-preserving versions. The
plaintext version of k-NN was about 15 times faster
than the versions over encrypted data. Nevertheless,
our proposal may still be considered viable because,
in the cloud scenario, the client spends most of the
time performing requests to the server and sending
data to it, and this communication time will proba-
bly dominate the time the cloud takes to perform the
classification.
In order to study how the classification time re-
lates with the dataset size, we ran the experiments us-
ing reduced versions of the WFR dataset. First, we
considered only subsets of the dataset by limiting the
number of instances. As shown in Figure 1, the execu-
tion time grows linearly with the number of instances.
Then we performed the same experiment, but limiting
the number of variables. Figure 2 shows that the exe-
cution time also grows linearly on this scenario. This
is expected, because execution time is dominated by
computation of distances between the query vector
and the other neighbors in the dataset. The time for
computing each distance also grows linearly with the
dimension of the involved vectors.
Table 6 presents the execution times to encrypt the
entire datasets, including the training and the testing
data. It corresponds to the execution of the Initializa-
tion and the Querying procedures, without consider-
ing the cost of submitting the instance and query vec-
tors to the cloud. Encrypting the testing set is many
times faster because in this step we do not need to use
the HE scheme, which is slower than OPE.
0 1000 2000 3000 4000 5000
0
0.2
0.4
0.6
0.8
1
Number of instances
Classification time (ms)
Figure 1: Execution time for query processing as the num-
ber of instances grow (subsets of WFR dataset).
0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
Number of variables
Classification time (ms)
Figure 2: Execution time for query processing as the num-
ber of variables grow (subsets of WFR dataset).
The encrypted vectors are represented by vectors
of integers in which the bit lengths of the components
are up to log2(N)and, due to our choice of parame-
ters, the encrypted classes are represented by integers
of 2048 bits. The bit lengths of the plaintext compo-
nents are log2(M)and the classes are represented by
32-bit integers. Therefore, considering a data set con-
sisting of nvectors with pdimensions each, the data
expansion, defined as the maximum size in bits of the
encrypted data over the maximum size in bits of the
plaintext data is:
np log2N+2048n
np log2M+32n=plog2N+2048
plog2M+32 .
Notice that the number of instances does not af-
fect the data expansion and the values Mand Nhave
little impact in that expansion because they contribute
Table 2: Comparison of the accuracies between the conventional unweighted k-NN (PLA IN) and privacy-preserving one
(EN C) instantiated using the OPE parameters (M,N)=(232,240)as the OPE parameters. Values in bold represent a difference
of at least 0.01. No significant classification accuracy is lost with our privacy-preserving approach.
k=1k=3k=5k=7k=9
PLA IN ENC PLAI N ENC PL AIN EN C PLA IN EN C PLAI N ENC
IRIS 0.960 0.960 0.980 0.980 0.960 0.961 0.960 0.960 0.960 0.970
WINE 0.847 0.830 0.796 0.796 0.779 0.779 0.796 0.780 0.745 0.730
CLIMATE MOD EL 0.895 0.896 0.928 0.928 0.934 0.934 0.923 0.923 0.917 0.913
CREDIT APP ROVAL 0.633 0.619 0.685 0.685 0.680 0.680 0.746 0.746 0.746 0.737
ABAL ON E 0.591 0.583 0.612 0.618 0.621 0.620 0.628 0.630 0.628 0.627
WFR 0.883 0.882 0.875 0.875 0.868 0.869 0.855 0.855 0.837 0.837
Table 3: Comparison of the accuracies between the conventional weighted k-NN (PLA IN) and privacy-preserving one (EN C)
instantiated using the OPE parameters (M,N) = (232,240 ). Values in bold represent a difference of at least 0.01. Again, no
significant classification accuracy is lost with our privacy-preserving approach.
k=1k=3k=5k=7k=9
PLA IN ENC PLAI N ENC PL AIN EN C PLA IN EN C PLAI N ENC
IRIS 0.960 0.940 0.980 0.970 0.960 0.960 0.960 0.960 0.960 0.960
WINE 0.847 0.847 0.830 0.813 0.830 0.823 0.830 0.823 0.796 0.800
CLIMATE MOD EL 0.895 0.891 0.928 0.928 0.934 0.934 0.923 0.923 0.917 0.918
CREDIT APP ROVAL 0.633 0.630 0.671 0.671 0.690 0.680 0.710 0.706 0.737 0.721
ABAL ON E 0.590 0.590 0.618 0.618 0.627 0.622 0.634 0.629 0.629 0.618
WFR 0.883 0.885 0.882 0.882 0.887 0.888 0.882 0.881 0.880 0.887
Table 4: Compatibility of unweighted (UNW) and distance-weighted (W EI) versions of privacy-preserving k-NN with ref-
erence implementation from Python Scikit. Compatibility numbers are computed as the rate of query vectors classified
identically to the classification from the reference implementation.
k=1k=3k=5k=7k=9
UNW WEI UNW WEI UNW WEI UNW WEI UNW WEI
IRIS 0.98 0.96 0.96 0.96 0.94 0.94 0.94 0.96 0.94 0.96
WINE 0.983 1.00 1.00 0.983 1.00 0.983 0.983 0.983 0.983 0.932
CLIMATE MOD EL 1.00 0.994 1.00 0.994 1.00 1.00 1.00 1.00 1.00 1.00
CREDIT APP ROVAL 0.978 0.982 0.956 0.964 0.939 0.926 0.917 0.893 0.963 0.900
ABAL ON E 0.962 0.975 0.964 0.948 0.977 0.948 0.968 0.953 0.970 0.965
WFR 0.998 0.998 0.999 0.987 0.995 0.983 0.994 0.984 0.997 0.972
Table 5: Comparison of running times in milliseconds to
classify a single instance using k=3.
DATASET PL AI N ENCRYPT ED
UNWEIGHTED WEIGHTED
IRIS 0.009 0.0196 0.1351
WINE 0.012 0.0315 0.1427
CLI MATE 0.033 0.0924 0.2497
CREDIT 0.044 0.1105 0.2328
ABAL ON E 0.168 0.8001 0.9142
WFR 0.180 1.1586 1.2313
Table 6: Comparison of running times in seconds to encrypt
the test and training datasets.
DATASET TE ST TRAINING
UNWEIGHTED WEIGHTED
IRIS 0.028 0.177 0.183
WINE 0.129 0.540 0.493
CLIMATE 0.565 2.137 1.812
CREDIT 0.538 1.768 1.856
ABAL ON E 1.310 8.329 7.971
WFR 6.383 16.780 17.315
in a logarithmic scale. As the number of dimensions
grows, the quotient becomes close to one, which is
the best possible value. A high value to the data ex-
pansion means that client has to upload and download
much more data than what it would be necessary if
plaintext data was sent to the cloud. Figure 3 shows
the effect of varying pafter fixing (M,N) = (232,240).
Figure 3: Data expansion as a function of the number of
dimensions for the OPE parameters (M,N) = (232,240 ).
6 SECURITY ANALYSIS
We assume that the cloud satisfies the threat-
model commonly called honest-but-curious (Graepel
et al., 2012), which means that the cloud will follow
the protocol and execute the k-NN procedure as ex-
pected, returning the right answer, although it may try
to learn information during the execution or later ex-
tract information from the encrypted data stored.
In our approach, the server does not learn what
are the values of any component of any of the vec-
tors it receives, including the query vectors. It also
does not learn the classes associated with the vec-
tors. And since the homomorphic encryption scheme
used to encrypt the classes is probabilistic, the server
cannot know even how many different classes there
are among the encrypted classes. Since adding ci-
phertexts results in another well-formed ciphertext,
the class assigned to the query vector cannot be dis-
covered by the cloud server. On the other hand, the
cloud can always discover the number of vectors in
the database and the number of components that each
vector has by simply examining the ciphertext sizes.
Moreover, the OPE building block is deterministic
and introduces a drawback. If the cloud has informa-
tion about the semantic of the dimensions (what vari-
able is represented by each vector component), and
if it also possesses a dataset that is strongly corre-
lated to the encrypted data, the scheme may be vul-
nerable to inference attacks based on frequency anal-
ysis (Naveed et al., 2015). Hence, our scheme is ap-
propriate only for running k-NN queries over private
databases, with no public information about the data
distribution that can be correlated with ciphertexts.
7 RELATED WORKS
Several works in the literature already studied
problems related to privacy-preserving k-NN classi-
fication. However, solutions were provided for differ-
ent scenarios involving distributed servers with equiv-
alent computing power; or simpler versions of the
problem, only requiring computation of the knear-
est neighbors and ignoring the classification step. As
a result, our proposal qualitatively improves on these
works by providing additional functionality and cor-
responding privacy guarantees.
The authors of (Zhan et al., 2005) considered a
scenario known as vertically partitioned data, where
each of several parties holds a set of attributes of the
same instances and they want to perform the k-NN
classification on the concatenation of their datasets.
An interactive privacy-preserving protocol was pro-
posed in which the parties have to compute the dis-
tances between the instances in their own partition
and a query vector; and combine those distances using
an additively homomorphic encryption scheme with
random perturbation techniques to find the knear-
est neighbours. The classification step is finally per-
formed locally by each party.
In (Xiong et al., 2006; Xiong et al., 2007) the au-
thors assume that several data owners, each one with
a private database, will collaborate by executing a dis-
tributed protocol to perform privacy-preserving k-NN
classification. The classification of a new instance is
performed by each user in his or her own database
and then a secure distributed protocol is used to clas-
sify the instance based on the knearest neighbors of
each database, without revealing those neighbors to
the other data owners. It means that the query vector
is revealed and the process is interactive, with heavy
processing load for each involved party.
In the article (Choi et al., 2014), the authors
present three methods to find the knearest neighbors
preserving the privacy of the data, but they do not
address the classification problem. Furthermore, the
three methods are interactive. It is worth noting that
even if finding the knearest neighbors is the main step
involved in k-NN classification, this is not compatible
with a cloud computing scenario, implying that the
client has to store at least a table relating the vectors
on the dataset and their classification, and also that the
query vector must be locally classified after the client
receives the knearest neighbors.
The authors of (Zhu et al., 2013) propose a sce-
nario in which the data owner encrypts the data and
sends them to the cloud, where other users can sub-
mit query vectors to obtain the nearest neighbors in
a privacy-preserving way. The scheme ensures the
privacy-preserving property thanks to an interactive
protocol executed between any trusted user that wants
to send query vectors and the data owner: this proto-
col generates a key that is used to encrypt the query
vectors and it is not possible to use this key to decrypt
the encrypted instances stored in the cloud. The data
owner must participate on the processing, even if it is
only to generate keys, therefore this protocol cannot
be classified as a non-interactive protocol. Moreover,
the protocol only finds the nearest neighbours and the
classification step is not performed.
The works (Elmehdwi et al., 2014) consider a dif-
ferent scenario: the data owner encrypts the data and
submits them to a first server, sending the secret key
to a second server. Thereby, any authorized person is
able to send a query vector to the first server, which
runs a distributed interactive protocol with the second
server (this sever may decrypt some data in this pro-
cess), and finally the first server returns the knearest
neighbors. Even if the client does not have to pro-
cess the data, that method requires a trusted server to
store the private key, and this trusted server acts as the
client in the distributed processing scenario. Relying
on a trusted third party naturally introduces additional
substantial risk. Later, the same authors extended the
idea to the classification problem (Samanthula et al.,
2015), but the same risk of collusion remains.
Another approach is proposed in (Wong et al.,
2009), where a new cryptographic scheme called
asymmetric scalar-product-preserving encryption
(ASPE) is also proposed. The scheme preserves a
special type of scalar product, allowing the knearest
vectors to be found without requiring an interactive
process. The scheme allows the server to calculate
inner products between dataset vectors by calculating
the inner product of encrypted vectors, determining
the vectors that are closer to the query vector. How-
ever, the authors were again only concerned with the
task of finding the nearest neighbors, not with the
classification problem. Also, a cryptographic scheme
created ad hoc for this task lacks extensive security
analysis that more general and well-established
cryptographic schemes already have. In comparison,
the building blocks in our proposal have well-known
properties and limitations.
8 CONCLUSIONS
We presented non-interactive privacy-preserving
variants of the k-NN classifier for both the unweighted
and the weighted versions, and established by ex-
tensive experiments that they are sufficiently effi-
cient and accurate to be viable in practice. The pro-
posed protocol combines homomorphic encryption
and order-preserving encryption and is applicable for
running queries against private databases stored into
the cloud. To the best of our knowledge, this is the
first proposal for performing k-NN classification over
encrypted data in a non-interactive way.
If a client and a cloud already employ any joint
protocol to find nearest neighbours (for instance, by
using other cryptographic primitives instead of OPE,
or by running some interactive algorithm) then they
can use an HE scheme and the techniques presented
here to derive a class from the other classes.
As future work, possible improvements to the k-
NN presented here might involve data obfuscation
and perturbation techniques to achieve stronger se-
curity properties against inference attacks, while pre-
serving accuracy and efficiency.
Acknowledgments
We thank Google Inc. for the financial support
through the Latin America Research Awards “Ma-
chine learning over encrypted data using Homomor-
phic Encryption” and “Efficient homomorphic en-
cryption for private computation in the cloud”.
REFERENCES
Alpaydin, E. (2004). Introduction to Machine Learn-
ing. The MIT Press.
Altman, N. S. (1992). An introduction to kernel and
nearest-neighbor nonparametric regression. The
American Statistician, 46(3):175–185.
Boldyreva, A., Chenette, N., and O’Neill, A.
(2011). Order-preserving encryption revisited:
Improved security analysis and alternative so-
lutions. In CRYPTO, volume 6841 of Lec-
ture Notes in Computer Science, pages 578–595.
Springer.
Bost, R., Popa, R. A., Tu, S., and Goldwasser, S.
(2015). Machine learning classification over en-
crypted data. In NDSS. The Internet Society.
Choi, S., Ghinita, G., Lim, H., and Bertino, E. (2014).
Secure knn query processing in untrusted cloud
environments. IEEE Trans. Knowl. Data Eng.,
26(11):2818–2831.
Elmehdwi, Y., Samanthula, B. K., and Jiang, W.
(2014). Secure k-nearest neighbor query over
encrypted data in outsourced environments. In
ICDE, pages 664–675. IEEE Computer Society.
Gilad-Bachrach, R., Dowlin, N., Laine, K., Lauter,
K. E., Naehrig, M., and Wernsing, J. (2016).
Cryptonets: Applying neural networks to en-
crypted data with high throughput and accu-
racy. In ICML, volume 48 of JMLR Workshop
and Conference Proceedings, pages 201–210.
JMLR.org.
Giry, D. (2015). Cryptographic key length re-
comendation. https://www.keylength.com/
(Acessed December 16, 2016).
Graepel, T., Lauter, K. E., and Naehrig, M. (2012).
ML confidential: Machine learning on encrypted
data. In ICISC, volume 7839 of Lecture Notes in
Computer Science, pages 1–21. Springer.
Hirt, M. and Sako, K. (2000). Efficient receipt-free
voting based on homomorphic encryption. In
EUROCRYPT, volume 1807 of Lecture Notes in
Computer Science, pages 539–556. Springer.
Jha, S., Kruger, L., and McDaniel, P. D. (2005). Pri-
vacy preserving clustering. In ESORICS, vol-
ume 3679 of Lecture Notes in Computer Science,
pages 397–417. Springer.
Lindell, Y. and Pinkas, B. (2009). Secure multiparty
computation for privacy-preserving data mining.
Journal of Privacy and Confidentiality, 1(1):5.
Miller, C. C. (2014). Revelations of N.S.A. spy-
ing cost U.S. tech companies. http://
www.nytimes.com/2014/03/22/business/
fallout-from- snowden-hurting-bottom
-line- of-tech-companies.html (Acessed
December 16, 2016).
Naehrig, M., Lauter, K. E., and Vaikuntanathan, V.
(2011). Can homomorphic encryption be practi-
cal? In CCSW, pages 113–124. ACM.
Naveed, M., Kamara, S., and Wright, C. V. (2015).
Inference attacks on property-preserving en-
crypted databases. In ACM Conference on Com-
puter and Communications Security, pages 644–
655. ACM.
Paillier, P. (1999). Public-key cryptosystems based on
composite degree residuosity classes. In EURO-
CRYPT, volume 1592 of Lecture Notes in Com-
puter Science, pages 223–238. Springer.
Rivest, R. L., Adleman, L., and Dertouzos, M. L.
(1978). On data banks and privacy homomor-
phisms. Foundations of secure computation,
4(11):169–180.
Samanthula, B. K., Elmehdwi, Y., and Jiang, W.
(2015). k-nearest neighbor classification over
semantically secure encrypted relational data.
IEEE Trans. Knowl. Data Eng., 27(5):1261–
1273.
Wong, W. K., Cheung, D. W., Kao, B., and Mamoulis,
N. (2009). Secure knn computation on encrypted
databases. In SIGMOD Conference, pages 139–
152. ACM.
Xiong, L., Chitti, S., and Liu, L. (2006). k near-
est neighbor classification across multiple pri-
vate databases. In CIKM, pages 840–841. ACM.
Xiong, L., Chitti, S., and Liu, L. (2007). Mining mul-
tiple private databases using a knn classifier. In
SAC, pages 435–440. ACM.
Zhan, J. Z., Chang, L., and Matwin, S. (2005). Pri-
vacy preserving k-nearest neighbor classifica-
tion. I. J. Network Security, 1(1):46–51.
Zhu, Y., Xu, R., and Takagi, T. (2013). Secure k-nn
query on encrypted cloud database without key-
sharing. IJESDF, 5(3/4):201–217.
... However, these schemes have drawbacks regarding classification accuracy or the weak level of security. In [31], a non-interactive k-NN scheme encrypted the outsourced data by using orderpreserving encryption and homomorphic encryption is proposed. In this scheme, outsourced data encrypted by two cryptosystems to preserving data privacy, and extensive experiments prove that the classification accuracy of ciphertext is very little different from that of plaintext. ...
... In order to better protect data privacy, we do not directly use pk to encrypt the class label of the data. Instead, we return the integer *∆ [31] by utilizing the class label c ∈ and a positive integer Δ. For instance, assume Δ = 16, class 1 is mapped to 2 2*16 =2 32 . ...
... For comparison, we use the real data sets from the UCI KDD archive [38], as shown in Table II. In this paper, three schemes are used to implement the k-NN algorithm: the classical plaintext k-NN classifier, the scheme in [31] and our proposed scheme. ...
Article
Full-text available
Cloud computing technology has attracted the attention of researchers and organizations due to its computing power, computing efficiency and flexibility. Using cloud computing technology to analysis outsourced data has become a new data utilization model. However, due to the severe security risks that appear in cloud computing, most organizations now encrypt data before outsourcing data. Therefore, in recent years, many new works on the k-Nearest Neighbor (denoted by k-NN) algorithm for encrypted data has appeared. However, two main problems are existing in the current research: either the program is not secure enough or inefficient. In this paper, based on the existing problems, we design a non-interactive privacy-preserving k-NN query and classification scheme. Our proposed scheme uses two existing encryption schemes: Order Preserving Encryption and the Paillier cryptosystem, to preserve the confidentiality of encrypted outsourced data, data access patterns, and the query record, and utilizes the encrypted the k-dimensional tree (denoted by kd-tree) to optimize the traditional k-NN algorithm. Our proposed scheme strives to achieve high query efficiency while ensuring data security. Extensive experimental results prove that this scheme is very close to the scheme using plaintext data and the existing non-interactive encrypted data query scheme in terms of classification accuracy. The query runtime of our scheme is superior to the existing non-interactive k-NN query scheme.
Article
Massive data and data services (such as online classification) can be outsourced to the cloud for better processing and less operation cost. However, if the sensitive information is contained in outsourced data, and user’s requests and returned results are visible and controlled by the semi-trusted cloud server, this will lead to serious privacy leakage. To achieve secure and efficient outsourced classification services, both high security and low computation burden should be considered at the same time, but the existing works always focus on only one aspect of them. In this paper, a Privacy-Preserving K-Nearest-Neighbors Classification (PPKNNC) scheme based on two non-colluding servers is proposed. Specifically, two efficient privacy-preserving protocols and random permutation technique are used in the processing of user’s requests, which greatly reduces the computational cost and protects the access pattern. Besides, an Elgamal cryptosystem technique called proxy re-encryption is adopted to ensure that every user has an individual secret key, so that the exact key for data decryption is known only to the data owner and not to other entities. Finally, the security analysis and the experiment results further illustrate that the proposed scheme is superior to the existing works in terms of security, computational and communication costs.
Article
Full-text available
Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government agencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the classification problem have been proposed under different security models. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy preserving classification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed k-NN protocol protects the confidentiality of the data, user's input query, and data access patterns. To the best of our knowledge, our work is the first to develop a fully secure k-NN classifier over encrypted data. Also, we empirically analyze the efficiency of our solution through various experiments.
Article
Full-text available
For the past decade, query processing on relational data has been studied extensively, and many theoretical and practical solutions to query processing have been proposed under various scenarios. With the recent popularity of cloud computing, users now have the opportunity to outsource their data as well as the data management tasks to the cloud. However, due to the rise of various privacy issues, sensitive data (e.g., medical records) need to be encrypted before outsourcing to the cloud. In addition, query processing tasks should be handled by the cloud; otherwise, there would be no point to outsource the data at the first place. To process queries over encrypted data without the cloud ever decrypting the data is a very challenging task. In this paper, we focus on solving the k-nearest neighbor (kNN) query problem over encrypted database outsourced to a cloud: a user issues an encrypted query record to the cloud, and the cloud returns the k closest records to the user. We first present a basic scheme and demonstrate that such a naive solution is not secure. To provide better security, we propose a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns. Also, we empirically analyze the efficiency of our protocols through various experiments. These results indicate that our secure protocol is very efficient on the user end, and this lightweight scheme allows a user to use any mobile device to perform the kNN query.
Article
Applying machine learning to a problem which involves medical, financial, or other types of sensitive data, not only requires accurate predictions but also careful attention to maintaining data privacy and security. Legal and ethical requirements may prevent the use of cloud-based machine learning solutions for such tasks. In this work, we will present a method to convert learned neural networks to CryptoNets, neural networks that can be applied to encrypted data. This allows a data owner to send their data in an encrypted form to a cloud service that hosts the network. The encryption ensures that the data remains confidential since the cloud does not have access to the keys needed to decrypt it. Nevertheless, we will show that the cloud service is capable of applying the neural network to the encrypted data to make encrypted predictions, and also return them in encrypted form. These encrypted predictions can be sent back to the owner of the secret key who can decrypt them. Therefore, the cloud service does not gain any information about the raw data nor about the prediction it made. We demonstrate CryptoNets on the MNIST optical character recognition tasks. CryptoNets achieve 99% accuracy and can make more than 51000 predictions per hour on a single PC. Therefore, they allow high throughput, accurate, and private predictions.
Conference Paper
Many encrypted database (EDB) systems have been proposed in the last few years as cloud computing has grown in popularity and data breaches have increased. The state-of-the-art EDB systems for relational databases can handle SQL queries over encrypted data and are competitive with commercial database systems. These systems, most of which are based on the design of CryptDB (SOSP 2011), achieve these properties by making use of property-preserving encryption schemes such as deterministic (DTE) and order- preserving encryption (OPE). In this paper, we study the concrete security provided by such systems. We present a series of attacks that recover the plaintext from DTE- and OPE-encrypted database columns using only the encrypted column and publicly-available auxiliary information. We consider well-known attacks, including frequency analysis and sorting, as well as new attacks based on combinatorial optimization. We evaluate these attacks empirically in an electronic medical records (EMR) scenario using real patient data from 200 U.S. hospitals. When the encrypted database is operating in a steady-state where enough encryption layers have been peeled to permit the application to run its queries, our experimental results show that an alarming amount of sensitive information can be recovered. In particular, our attacks correctly recovered certain OPE-encrypted attributes (e.g., age and disease severity) for more than 80% of the patient records from 95% of the hospitals; and certain DTE- encrypted attributes (e.g., sex, race, and mortality risk) for more than 60% of the patient records from more than 60% of the hospitals.
Article
Mobile devices with geo-positioning capabilities (e.g., GPS) enable users to access information that is relevant to their present location. Users are interested in querying about points of interest (POI) in their physical proximity, such as restaurants, cafes, ongoing events, etc. Entities specialized in various areas of interest (e.g., certain niche directions in arts, entertainment, travel) gather large amounts of geo-tagged data that appeal to subscribed users. Such data may be sensitive due to their contents. Furthermore, keeping such information up-to-date and relevant to the users is not an easy task, so the owners of such data sets will make the data accessible only to paying customers. Users send their current location as the query parameter, and wish to receive as result the nearest POIs, i.e., nearest-neighbors (NNs). But typical data owners do not have the technical means to support processing queries on a large scale, so they outsource data storage and querying to a cloud service provider. Many such cloud providers exist who offer powerful storage and computational infrastructures at low cost. However, cloud providers are not fully trusted, and typically behave in an honest-but-curious fashion. Specifically, they follow the protocol to answer queries correctly, but they also collect the locations of the POIs and the subscribers for other purposes. Leakage of POI locations can lead to privacy breaches as well as financial losses to the data owners, for whom the POI data set is an important source of revenue. Disclosure of user locations leads to privacy violations and may deter subscribers from using the service altogether. In this paper, we propose a family of techniques that allow processing of NN queries in an untrusted outsourced environment, while at the same time protecting both the POI and querying users' positions. Our techniques rely on mutable order preserving encoding (mOPE), the only secure order-preserving encryption method known to-date. W- also provide performance optimizations to decrease the computational cost inherent to processing on encrypted data, and we consider the case of incrementally updating data sets. We present an extensive performance evaluation of our techniques to illustrate their viability in practice.
Conference Paper
We demonstrate that, by using a recently proposed leveled homomorphic encryption scheme, it is possible to delegate the execution of a machine learning algorithm to a computing service while retaining confidentiality of the training and test data. Since the computational complexity of the homomorphic encryption scheme depends primarily on the number of levels of multiplications to be carried out on the encrypted data, we define a new class of machine learning algorithms in which the algorithm's predictions, viewed as functions of the input data, can be expressed as polynomials of bounded degree. We propose confidential algorithms for binary classification based on polynomial approximations to least-squares solutions obtained by a small number of gradient descent steps. We present experimental validation of the confidential machine learning pipeline and discuss the trade-offs regarding computational complexity, prediction accuracy and cryptographic security.
Article
In cloud computing, secure analysis on outsourced encrypted data is a significant topic. As a frequently used query for online applications, secure k-nearest neighbours k-NN computation on encrypted cloud data has received much attention, and several solutions for it have been put forward. However, most existing schemes assume the query users are fully trusted and all query users know the entire key which is used to encrypt and decrypt data owner's outsourced database. It is constitutionally not feasible in lots of real-world applications. In this paper, we propose a novel secure and practical scheme for preserving data privacy and supporting k-NN query on encrypted cloud data. In the new approach, only limited information about the key of data owner is disclosed to query users, and the data privacy can be protected even when query users leak their knowledge about the key to adversary. Theoretical analysis and experiment results confirm the security and practicality of our scheme.
Article
Nonparametric regression is a set of techniques for estimating a regression curve without making strong assumptions about the shape of the true regression function. These techniques are therefore useful for building and checking parametric models, as well as for data description. Kernel and nearest-neighbor regression estimators are local versions of univariate location estimators, and so they can readily be introduced to beginning students and consulting clients who are familiar with such summaries as the sample mean and median.