Content uploaded by Diego F. Aranha

Author content

All content in this area was uploaded by Diego F. Aranha on Dec 22, 2016

Content may be subject to copyright.

Non-interactive privacy-preserving k-NN classiﬁer

Hilder V. L. Pereira1and Diego F. Aranha1

1Institute of Computing, University of Campinas, Campinas, SP, Brazil

{hilder.vitor, dfaranha}@gmail.com

Keywords: Privacy-preserving classiﬁcation, Order-preserving encryption, Homomorphic encryption.

Abstract: Machine learning tasks typically require large amounts of sensitive data to be shared, which is notoriously

intrusive in terms of privacy. Outsourcing this computation to the cloud requires the server to be trusted,

introducing a non-realistic security assumption and high risk of abuse or data breaches. In this paper, we pro-

pose privacy-preserving versions of the k-NN classiﬁer which operate over encrypted data, combining order-

preserving encryption and homomorphic encryption. According to our experiments, the privacy-preserving

variant achieves the same accuracy as the conventional k-NN classiﬁer, but considerably impacts the original

performance. However, the performance penalty is still viable for practical use in sensitive applications when

the additional security properties provided by the approach are considered. In particular, the cloud server does

not need to be trusted beyond correct execution of the protocol and computes the algorithm over encrypted

data and encrypted classes. As a result, the cloud server never learns the real dataset values, the number of

classes, the query vectors or their classiﬁcation.

1 INTRODUCTION

With corporations and governments becoming more

intrusive in their data collection and surveillance ef-

forts, and the recurrent data breaches observed in the

last years, the cloud paradigm faces multiple chal-

lenges to remain as the computing model of choice

for privacy-sensitive applications. The low operat-

ing costs and high availability of storage capacity and

computational power may not look as attractive after

the risks of outsourcing computation and data storage

are considered. There are no formal guarantees that

the cloud provider is not behaving in abusive or intru-

sive ways, or even that the infrastructure is protected

against external attacks. Different legal regimes and

governmental inﬂuence introduce further complica-

tions to the problem and may shift responsibilities

in unclear ways. After the Snowden revelations, the

long-term ﬁnancial impact from the current crisis in

cloud provider trust is estimated between 35 and 180

billion dollars in 2016 in the US only (Miller, 2014).

A solution proposed to reconcile these issues con-

sists in computing over encrypted data. In this

model, data is encrypted with a property-preserving

transformation (originally called a privacy homomor-

phism (Rivest et al., 1978) that still allows some op-

erations to be performed in the encrypted domain.

Constructions which provide this feature and support

an arbitrary number and type of operations (additions

and multiplications) are called fully homomorphic en-

cryption schemes and usually introduce a massive

performance penalty. The more restricted partial or

somewhat homomorphic encryption schemes impose

an upper bound in the number and type of operations,

with much improved performance ﬁgures. However,

they require a redesign of the high-level algorithms to

satisfy the restrictions and conserve most of the orig-

inal performance and effectiveness, when compared

to distributed approaches based on secure multiparty

computation (Xiong et al., 2007).

Classically, computing over encrypted data was

applied to tallying secret votes in electronic elec-

tions (Hirt and Sako, 2000), but modern homomor-

phic encryption schemes may soon enable a host

of interesting privacy-preserving applications in the

ﬁelds of genomics, healthcare and intelligence gath-

ering (Naehrig et al., 2011). Natural applications ex-

tend to data mining (Lindell and Pinkas, 2009) and

machine learning (Gilad-Bachrach et al., 2016) which

involve considerable amounts of data to be shared

and manipulated in untrusted platforms. Following

this trend, some previous works in the literature have

experimented with performing the task of classiﬁca-

tion over encrypted data. The problem is fundamental

in itself and as a building block for several machine

learning algorithms, and amounts to identify to which

of a set of categories a new sample belongs, based on

previous observations whose membership is known

(called training set). Graepel et al. adapted simple

classiﬁcation algorithms to work on encrypted data

using somewhat homomorphic encryption (Graepel

et al., 2012), while Bost et al. considered more com-

plex classiﬁers such as decision trees and Bayesian

inference for medical applications (Bost et al., 2015).

Other authors designed protocols for clustering en-

crypted data in the two-party setting (Jha et al., 2005).

The main contributions of this paper are privacy-

preserving variants of the k-NN classiﬁer, both

unweighted and distance-weighted, combining two

main building blocks: homomorphic encryption and

order-preserving encryption. The protocols are non-

interactive and ideally suited to cloud computing en-

vironments, where storage and computation are del-

egated from a low-power device to powerful cloud

servers. Experimental results demonstrate that there

is no substantial accuracy loss in performing the clas-

siﬁcation task in a privacy-preserving way. Our secu-

rity analysis claims security in the semi-honest (also

called honest-but-curious) threat model, although the

drawbacks from adopting order-preserving encryp-

tion for efﬁciency restrict the application scenarios to

computing over private encrypted databases with no

publicly available knowledge about the distribution of

data. To the best of our knowledge, this is the ﬁrst

proposal in the scientiﬁc literature for non-interactive

cloud-based k-NN classiﬁcation over encrypted data.

The paper is organized as follows. In Section 2,

we recall the conventional k-NN classiﬁer and de-

ﬁne the problem of privacy-preserving classiﬁcation.

Section 3 presents the basic building blocks of our

approach: order-preserving encryption (OPE) and

homomorphic encryption (HE). The proposed algo-

rithms are discussed in Section 4, through descrip-

tions of the initialization, querying, processing and re-

sponse procedures. Section 5 presents the experimen-

tal results of evaluating the plain k-NN and privacy-

preserving versions over 6 different datasets. In Sec-

tion 6, we brieﬂy enumerate the security properties

offered by the scheme and corresponding security as-

sumptions. Related work is discussed in 7 and con-

clusions can be found at the ﬁnal section.

2 PROBLEM STATEMENT

In this section, we deﬁne the classiﬁcation prob-

lem the k-NN algorithm was designed to solve and

discuss a privacy-preserving variant of the problem,

compatible with a cloud-based centralized processing

model.

2.1 The k-NN classiﬁer

The k-Nearest Neighbor (k-NN) classiﬁer is a well-

known non-parametric supervised method to classify

an instance based on the classes of its nearest neigh-

bors (Altman, 1992; Alpaydin, 2004). Each instance

is represented by a vector in Rpand an associated la-

bel, called the instance’s class. A query vector is an

unlabeled instance to be classiﬁed.

The k-NN classiﬁer has a positive integer param-

eter k, which is the number of neighbors taken in ac-

count when classifying a new instance, and a function

dfrom Rp×Rpto R, which determines the distance

between two instances (the Euclidean distance is of-

ten used).

In order to classify a query vector x∈Rp, the k-

NN algorithm works as follows:

1. Find the knearest neighbors: among all the clas-

siﬁed instances u1,u2,...,un, select the kinstances

whose distances d(ui,x)are the smallest.

2. Assign a class: select the most frequent class

among the knearest neighbors and assign it to x.

There is also a k-NN variant known as distance-

weighted k-NN, or simply weighted k-NN. In this

version, instead of assigning the most frequent class

among the knearest neighbors, the inverse of the dis-

tance to the query vector is used as the vote’s weight

for each of the kneighbors.

2.2 Privacy-preserving k-NN in the

cloud

The privacy-preserving k-NN problem can be deﬁned

as the problem of computing the k-NN on an untrusted

platform without revealing sensitive information such

as the data instances, their classes, the query vectors,

and the classiﬁcation results. By untrusted platform

we mean a third party that holds the data in some form

but is not trusted by the data owner. In our scenario,

we assume a client-server model, in which the data

owner is the client and the server is the cloud service.

The client intends to store data in the cloud and pro-

cess it in a non-interactive way, which means that the

cloud will interact with the client only to receive the

data and the query vectors to be classiﬁed, but will

not communicate with the client during the process-

ing. In some applications, it may be the case that data

is already collected in the cloud in encrypted form, on

behalf of the client. It is also assumed that the client

is constrained in terms of computation and storage ca-

pabilities, but is capable of managing cryptographic

keys in a secure way.

The other possible scenario is distributed process-

ing: the client collaborates with the cloud (or the

other parties involved) by receiving and processing

data during the training phase or the classiﬁcation.

We stress that the centralized model is more conve-

nient to the client and is the expected model when

referring to cloud services.

3 BASIC CONCEPTS

In this section, we present the two encryption

schemes used as building blocks to guarantee the

privacy-preserving property of our k-NN classiﬁer:

3.1 Order-preserving encryption

“Order-preserving symmetric encryption (OPE) is a

deterministic encryption scheme which encryption

function preserves numerical ordering of the plain-

texts.” (Boldyreva et al., 2011) In other words, given

two plaintexts m1and m2, and their corresponding ci-

phertexts c1and c2encrypted with an OPE scheme,

the following implication is true:

m1≤m2⇒c1≤c2.

Because this class of schemes works over ﬁnite

sets of numerical values, it is sufﬁcient to describe

it using the set D={1,2,...,M}as the message space

(or domain) and the set R={1,2,...,N}as the cipher-

text space (also called range). OPE schemes are thus

parametrized by two positive integer values Mand N,

that represent the number of possible plaintexts and

the number of possible ciphertexts, respectively. We

deﬁne the OPE scheme as follows:

OPE.KE YGEN(M,N,s)is a probabilistic algorithm

that receives a secret string sand the parameters

M,N∈Nsuch that 1 ≤M≤N; and returns the

secret key K.

OPE.EN C(K,m)is an algorithm that encrypts a

plaintext m∈Dusing the key Kand returns a ci-

phertext c∈R. When it is clear from the context,

we may omit the encryption key and write simply

OPE.EN C(m)for short.

OPE.DE C(K,c)is an algorithm that decrypts a ci-

phertext c∈Rusing the key Kand returns the

corresponding plaintext m∈D. We may omit the

decryption key and write simply OPE.DEC(m).

For a vector w= (w1,...,wp), we deﬁne the

component-wise encryption as a vector whose each

component is encrypted with the same key:

OPE.EN C(w) = (OPE.ENC(w1), ..., OPE.EN C(wp)).

We deﬁne component-wise decryption similarly,

and refer to the space formed by the encrypted vectors

as encrypted space. If OPE is used to encrypt vectors,

then the order is maintained for each axis. Therefore,

it is very likely that each vector will have the same

neighborhood in the encrypted space. This fact is ex-

ploited to make it possible for the cloud to ﬁnd the

nearest neighbors of a given encrypted vector.

3.2 Homomorphic Encryption

A criptographic scheme is homomorphic for an opera-

tion if it is equivalent to perform this operation over

plaintexts or over ciphertexts. For instance, consid-

ering the multiplication operation, the product of two

ciphertexts encrypted by a cryptosystem with homo-

morphic properties generates a third ciphertext that

has to be decrypted to the product of the two corre-

spondent plaintexts.

In this work, we employed the Paillier cryptosys-

tem (Paillier, 1999). The plaintext message space of

this scheme is Zn, where nthe product of two large

primes pand q. The Pailler cryptosystem is also

additively homomorphic, which means that given ci-

phertexts c0and c1, corresponding to encryption of

two messages m0and m1, it is possible to calculate a

third ciphertext cthat decrypts to m0+m1. Further-

more, the cryptosystem offers efﬁcient multiplication

between ciphertext and plaintext. More speciﬁcally,

given a ciphertext c0, corresponding to the encryp-

tion of m0, and a plaintext message m1, it is possible

to efﬁciently compute a ciphertext cthat decrypts to

m0·m1∈Zn.

The scheme can be described using the following

procedures:

HE.KE YGEN(λ):choose two random primes pand q

with equivalent bit lengths such that they ensure λ

bits of security. Set n=pq,g=n+1, = (p−

1)(q−1), and y=−1∈Zn. Return the private

key SK = (, y)and the public key PK = (n,g).

HE.EN C(PK,m):to encrypt m∈Zn, sample a ran-

dom rfrom Z∗

n, compute c=gmrnmod n2, and

return c.

HE.DE C(SK,c):to decrypt c∈Zn2, calculate x=

cmod n2, divide (x−1)by n, obtaining x0and

then return (x0·y∈Zn).

HE.ADD(c1,c2):to add two ciphertexts homomor-

phically, return (c1·c2mod n2).

HE.PRO D(c1,m2):to multiply a ciphertext c1by

a plaintext m2homomorphically, return (cm2

1

mod n2).

4 NON-INTERACTIVE k-NN

OVER ENCRYPTED DATA

In this section we present our constructions for

the privacy-preserving k-NN classiﬁer operating over

encrypted data stored in the cloud. We start by

proposing our unweighted privacy-preserving classi-

ﬁer, which is divided in ﬁve subroutines: Encoding,

Initialization, Querying, Processing and Response.

4.1 Unweighted k-NN

The conventional unweighted k-NN classiﬁer can be

described in terms of the following subroutines:

•Encoding: this procedure takes the class ∈Nof

an instance and a positive integer ∆, and returns

the integer 2·∆. For instance, ﬁxing ∆=64, class

0 is mapped to 20=1, class 1 is mapped to 264,

class 2 is mapped to 2128, and so on.

•Initialization: the client has nvectors u1,...,un∈

Rn×pclassiﬁed with the corresponding classes

1,...,n∈N. Thus, each vector uiis labeled

with a non-negative integer i. This subroutine

encrypts the vectors uiusing the component-wise

encryption of OPE and encodes each class for en-

cryption using the HE scheme. Note that this ini-

tialization step is executed by the client, as shown

in Algorithm 1.

Algorithm 1 Initialization

1: Input: (u1,...,un)∈Rn×p,(1,...,n)∈Nn

2: K=OPE.KE YGEN(M,N,s)

3: (SK,P K) = HE.K EY GEN(λ)

4: for i←1to ndo

5: vi←OPE.EN C(K,ui)

6: Encoding and encryption.

7: ci←HE.EN C(PK,2i·∆)

8: end for

9: Send (v1,...,vn)and (c1,...,cn)to the cloud.

•Querying: the client has a query vector w∈Rpto

be classiﬁed. This vector is encrypted using the

OPE scheme and then submitted to the cloud.

Algorithm 2 Querying

1: Input: w∈Rp

2: y←OPE.EN C(w)

3: Choose number of neighbors k∈N∗and submit

(y,k)to the cloud.

•Processing: the cloud classiﬁes the encrypted

query vector yusing the encrypted vectors

v1,...,vnand the encrypted classes c1,...,cnby

running Algorithm 3.

Algorithm 3 Processing

1: Input: Encrypted query vector y,k∈N∗

2: for j←1 to ndo

3: dj←k vj−yk

4: end for

5: Compute the indexes (i1,...,ik)of the ksmallest

distances among (d1,...,dn).

6: classy←ci1

7: for j←2 to kdo

8: classy←HE.A DD(cl ass,cij)

9: end for

10: Return classyto the client.

Line 5 of the algorithm returns the indexes of the

ksmallest distances. For instance, if the 1st, the

3rd and the 7th vectors were the three nearest vec-

tors of y, then d1,d3,and d7would be the smallest

distances and this function would return (1,3,7).

•Response: the client receives a ciphertext classy,

decrypts it and extracts the class of the query vec-

tor w, as in Algorithm 4.

Algorithm 4 Response

1: Input: Encrypted class classy

2: c←HE.DE C(SK,classy)

3: Compute maximum coefﬁcient afrom cas a

polynomial in base 2∆.

4: Assign class to the query vector.

The algorithm works correctly because when the

server adds the encrypted classes in Algorithm 4, it is

in fact accumulating how many times each class ap-

peared among the knearest vectors. Since the i-th

class is represented by an integer 2i·∆, the sum of the

classes results in an integer a0+a12∆+a222·∆+. . . +

as2s·∆, where each coefﬁcient airepresents the num-

ber of times that the i-th class appeared. Furthermore,

since the classes are encrypted with an HE scheme,

the cloud can add the ciphertexts of those classes and

the resulting ciphertext will be decrypted to the ex-

pected addition of the classes. To extract the class,

only shifting and reducing the decrypted sum cmod-

ulo 2∆are required to obtain the largest coefﬁcent.

A problem might arise if some of the aicoefﬁ-

cients were larger than 2∆, because in this case, the

value of ai2i·∆would be mixed with the other coefﬁ-

cients. However, notice that the number of neighbours

kis an upper bound to each ai, and kis typically small.

Therefore, it is sufﬁcient to choose a value for ∆such

that 2∆>k.

4.2 Distance-weighted k-NN

Our proposal also allows a distance-weighted version

of the k-NN classiﬁer over encrypted data. The main

differences between this version and the previous un-

weighted one is the way the encrypted classes are ac-

cumulated in the cloud.

•Initialization and Querying: Identical to the un-

weighted version.

•Processing: To classify an encrypted query vector

y, using the encrypted vectors v1,...,vnand the

encrypted classes c1,...,cn, the cloud ﬁnds the k

nearest neighbours, encodes the inverse of their

distances into valid plaintexts and combine them

to generate the encrypted assigned class. This

procedure is shown in Algorithm 5.

Algorithm 5 (Weighted) Processing

1: Input: Encrypted query vector y,k∈N∗

2: for j←1 to ndo

3: dj←k vj−yk

4: end for

5: Compute the indexes (i1,...,ik)of the ksmallest

distances among (d1,...,dn).

6: W←1

di1

+1

di2

+. . . +1

dik

7: classy←HE.E NC(P K,0)

8: for j←1 to kdo

9: l←ij

10: w←1

dl·W

11: Accumulation of encoded class w.

12: z←HE.PRO D(cl,2w·∆)

13: classy=HE.A DD(cl assy,z)

14: end for

15: Return classyto the client.

•Response: The client receives the encrypted as-

signed class classy, decrypts it and extracts the

class, just like in the unweighted version.

This procedure works because each encrypted

class clis the encryption of some integer 2i·∆.

In Algorithm 5, when we multiply homomorphi-

cally by the weight wof the neighbor, we obtain

an encryption of w·2i·∆. Since we add all those

kclasses homomorphically, we have an encryp-

tion of some integer with the format a0+a12∆+

a222·∆+... +as2s·∆where each airepresents the

sum of the weights of all the neighbors with class

iamong the knearest neighbors.

5 EXPERIMENTAL RESULTS

We implemented our versions of k-NN using

the stateless OPE scheme presented in (Boldyreva

et al., 2011) and the Paillier homomorphic cryptosys-

tem (Paillier, 1999).1. The Paillier cryptosystem was

instantiated at the 80-bit security level by choosing n

to have 1024 bits (Giry, 2015). The approach and im-

plementation were evaluated using datasets from the

UCI Machine Learning Repository2. The datasets are

described in Table 1.

We executed all the tests in a machine equipped

with a 2.6GHz Intel Xeon CPU, 30GB of RAM and

the GNU/Linux Debian 8.0 operating system. We

remark that memory consumption was below 1GB

during the entire collection of experimental results.

Our k-NN version was implemented in C++ and

compiled using GCC 4.9.2 provided by the oper-

ating system with the -O3 optimization ﬂag. For

comparison, we employed the k-NN classiﬁer imple-

mented in the Python Scikit Learn lib3as the conven-

tional k-NN implementation. We used the parameter

algorithm=brute to select a compatible approach

for computing distances.

Table 1: Datasets used in the evaluation. The dataset WFR

refers to WALL -FOLLOWING ROBOT.

DATASET IN STANCE S ATTRIBUTES

IRIS 150 4

WINE 178 13

CLIMATE MOD EL 540 18

CREDIT APP ROVAL 690 15

ABAL ON E 4177 8

WFR 5456 24

The experiments consisted in processing the

datasets using the conventional plaintext k-NN classi-

ﬁer and our non-interactive privacy-preserving k-NN

over encrypted data, and collecting the results of two

metrics: comparison of the resulting accuracies (rate

of query vectors correctly classiﬁed) and the compat-

ibility of the privacy-preserving version compared to

the plaintext one (rate of query vectors that our k-NN

over encrypted data classiﬁed with the same class as

the conventional k-NN).

The results are summarized in Tables 2, 3, and 4.

The tables have ﬁve columns representing the number

of nearest neighbors considered (k∈ {1,3,5,7,9}) to

1The source code is available on the repository https:

//github.com/hilder-vitor/encrypted- k-NN.

2UCI: https://archive.ics.uci.edu/ml/

3Python Scikit Learn lib: http://scikit- learn.org

take into account how this number affects the ac-

curacy of the classiﬁer. In order to verify how the

OPE instantiation parameters might inﬂuence the ac-

curacy of the privacy-preserving k-NN, we encrypted

the datasets using several pairs of values (M,N)(re-

call that Mand Ndetermine the size of the plain-

text and the ciphertext spaces, respectively). Since

no signiﬁcant differences were observed for the sev-

eral combinations of parameters, we only present the

results of the executions for (M,N) = (232,240). As

expected, the privacy-preserving versions conserved

the original classiﬁcation accuracy for almost all of

the samples.

A comparison of the running times to classify a

single instance, using k=3, is shown in Table 5. We

stress that changing the value of khas little effect on

the execution times. We split each dataset into a train-

ing set containing 2

3of the data and a testing set with

the remaining 1

3. Afterwards, the plaintext version

was executed 10 times and the average time to classify

the whole set was computed. We performed the same

experiment in the privacy-preserving versions. The

plaintext version of k-NN was about 15 times faster

than the versions over encrypted data. Nevertheless,

our proposal may still be considered viable because,

in the cloud scenario, the client spends most of the

time performing requests to the server and sending

data to it, and this communication time will proba-

bly dominate the time the cloud takes to perform the

classiﬁcation.

In order to study how the classiﬁcation time re-

lates with the dataset size, we ran the experiments us-

ing reduced versions of the WFR dataset. First, we

considered only subsets of the dataset by limiting the

number of instances. As shown in Figure 1, the execu-

tion time grows linearly with the number of instances.

Then we performed the same experiment, but limiting

the number of variables. Figure 2 shows that the exe-

cution time also grows linearly on this scenario. This

is expected, because execution time is dominated by

computation of distances between the query vector

and the other neighbors in the dataset. The time for

computing each distance also grows linearly with the

dimension of the involved vectors.

Table 6 presents the execution times to encrypt the

entire datasets, including the training and the testing

data. It corresponds to the execution of the Initializa-

tion and the Querying procedures, without consider-

ing the cost of submitting the instance and query vec-

tors to the cloud. Encrypting the testing set is many

times faster because in this step we do not need to use

the HE scheme, which is slower than OPE.

0 1000 2000 3000 4000 5000

0

0.2

0.4

0.6

0.8

1

Number of instances

Classification time (ms)

Figure 1: Execution time for query processing as the num-

ber of instances grow (subsets of WFR dataset).

0 5 10 15 20

0

0.2

0.4

0.6

0.8

1

Number of variables

Classification time (ms)

Figure 2: Execution time for query processing as the num-

ber of variables grow (subsets of WFR dataset).

The encrypted vectors are represented by vectors

of integers in which the bit lengths of the components

are up to log2(N)and, due to our choice of parame-

ters, the encrypted classes are represented by integers

of 2048 bits. The bit lengths of the plaintext compo-

nents are log2(M)and the classes are represented by

32-bit integers. Therefore, considering a data set con-

sisting of nvectors with pdimensions each, the data

expansion, deﬁned as the maximum size in bits of the

encrypted data over the maximum size in bits of the

plaintext data is:

np log2N+2048n

np log2M+32n=plog2N+2048

plog2M+32 .

Notice that the number of instances does not af-

fect the data expansion and the values Mand Nhave

little impact in that expansion because they contribute

Table 2: Comparison of the accuracies between the conventional unweighted k-NN (PLA IN) and privacy-preserving one

(EN C) instantiated using the OPE parameters (M,N)=(232,240)as the OPE parameters. Values in bold represent a difference

of at least 0.01. No signiﬁcant classiﬁcation accuracy is lost with our privacy-preserving approach.

k=1k=3k=5k=7k=9

PLA IN ENC PLAI N ENC PL AIN EN C PLA IN EN C PLAI N ENC

IRIS 0.960 0.960 0.980 0.980 0.960 0.961 0.960 0.960 0.960 0.970

WINE 0.847 0.830 0.796 0.796 0.779 0.779 0.796 0.780 0.745 0.730

CLIMATE MOD EL 0.895 0.896 0.928 0.928 0.934 0.934 0.923 0.923 0.917 0.913

CREDIT APP ROVAL 0.633 0.619 0.685 0.685 0.680 0.680 0.746 0.746 0.746 0.737

ABAL ON E 0.591 0.583 0.612 0.618 0.621 0.620 0.628 0.630 0.628 0.627

WFR 0.883 0.882 0.875 0.875 0.868 0.869 0.855 0.855 0.837 0.837

Table 3: Comparison of the accuracies between the conventional weighted k-NN (PLA IN) and privacy-preserving one (EN C)

instantiated using the OPE parameters (M,N) = (232,240 ). Values in bold represent a difference of at least 0.01. Again, no

signiﬁcant classiﬁcation accuracy is lost with our privacy-preserving approach.

k=1k=3k=5k=7k=9

PLA IN ENC PLAI N ENC PL AIN EN C PLA IN EN C PLAI N ENC

IRIS 0.960 0.940 0.980 0.970 0.960 0.960 0.960 0.960 0.960 0.960

WINE 0.847 0.847 0.830 0.813 0.830 0.823 0.830 0.823 0.796 0.800

CLIMATE MOD EL 0.895 0.891 0.928 0.928 0.934 0.934 0.923 0.923 0.917 0.918

CREDIT APP ROVAL 0.633 0.630 0.671 0.671 0.690 0.680 0.710 0.706 0.737 0.721

ABAL ON E 0.590 0.590 0.618 0.618 0.627 0.622 0.634 0.629 0.629 0.618

WFR 0.883 0.885 0.882 0.882 0.887 0.888 0.882 0.881 0.880 0.887

Table 4: Compatibility of unweighted (UNW) and distance-weighted (W EI) versions of privacy-preserving k-NN with ref-

erence implementation from Python Scikit. Compatibility numbers are computed as the rate of query vectors classiﬁed

identically to the classiﬁcation from the reference implementation.

k=1k=3k=5k=7k=9

UNW WEI UNW WEI UNW WEI UNW WEI UNW WEI

IRIS 0.98 0.96 0.96 0.96 0.94 0.94 0.94 0.96 0.94 0.96

WINE 0.983 1.00 1.00 0.983 1.00 0.983 0.983 0.983 0.983 0.932

CLIMATE MOD EL 1.00 0.994 1.00 0.994 1.00 1.00 1.00 1.00 1.00 1.00

CREDIT APP ROVAL 0.978 0.982 0.956 0.964 0.939 0.926 0.917 0.893 0.963 0.900

ABAL ON E 0.962 0.975 0.964 0.948 0.977 0.948 0.968 0.953 0.970 0.965

WFR 0.998 0.998 0.999 0.987 0.995 0.983 0.994 0.984 0.997 0.972

Table 5: Comparison of running times in milliseconds to

classify a single instance using k=3.

DATASET PL AI N ENCRYPT ED

UNWEIGHTED WEIGHTED

IRIS 0.009 0.0196 0.1351

WINE 0.012 0.0315 0.1427

CLI MATE 0.033 0.0924 0.2497

CREDIT 0.044 0.1105 0.2328

ABAL ON E 0.168 0.8001 0.9142

WFR 0.180 1.1586 1.2313

Table 6: Comparison of running times in seconds to encrypt

the test and training datasets.

DATASET TE ST TRAINING

UNWEIGHTED WEIGHTED

IRIS 0.028 0.177 0.183

WINE 0.129 0.540 0.493

CLIMATE 0.565 2.137 1.812

CREDIT 0.538 1.768 1.856

ABAL ON E 1.310 8.329 7.971

WFR 6.383 16.780 17.315

in a logarithmic scale. As the number of dimensions

grows, the quotient becomes close to one, which is

the best possible value. A high value to the data ex-

pansion means that client has to upload and download

much more data than what it would be necessary if

plaintext data was sent to the cloud. Figure 3 shows

the effect of varying pafter ﬁxing (M,N) = (232,240).

Figure 3: Data expansion as a function of the number of

dimensions for the OPE parameters (M,N) = (232,240 ).

6 SECURITY ANALYSIS

We assume that the cloud satisﬁes the threat-

model commonly called honest-but-curious (Graepel

et al., 2012), which means that the cloud will follow

the protocol and execute the k-NN procedure as ex-

pected, returning the right answer, although it may try

to learn information during the execution or later ex-

tract information from the encrypted data stored.

In our approach, the server does not learn what

are the values of any component of any of the vec-

tors it receives, including the query vectors. It also

does not learn the classes associated with the vec-

tors. And since the homomorphic encryption scheme

used to encrypt the classes is probabilistic, the server

cannot know even how many different classes there

are among the encrypted classes. Since adding ci-

phertexts results in another well-formed ciphertext,

the class assigned to the query vector cannot be dis-

covered by the cloud server. On the other hand, the

cloud can always discover the number of vectors in

the database and the number of components that each

vector has by simply examining the ciphertext sizes.

Moreover, the OPE building block is deterministic

and introduces a drawback. If the cloud has informa-

tion about the semantic of the dimensions (what vari-

able is represented by each vector component), and

if it also possesses a dataset that is strongly corre-

lated to the encrypted data, the scheme may be vul-

nerable to inference attacks based on frequency anal-

ysis (Naveed et al., 2015). Hence, our scheme is ap-

propriate only for running k-NN queries over private

databases, with no public information about the data

distribution that can be correlated with ciphertexts.

7 RELATED WORKS

Several works in the literature already studied

problems related to privacy-preserving k-NN classi-

ﬁcation. However, solutions were provided for differ-

ent scenarios involving distributed servers with equiv-

alent computing power; or simpler versions of the

problem, only requiring computation of the knear-

est neighbors and ignoring the classiﬁcation step. As

a result, our proposal qualitatively improves on these

works by providing additional functionality and cor-

responding privacy guarantees.

The authors of (Zhan et al., 2005) considered a

scenario known as vertically partitioned data, where

each of several parties holds a set of attributes of the

same instances and they want to perform the k-NN

classiﬁcation on the concatenation of their datasets.

An interactive privacy-preserving protocol was pro-

posed in which the parties have to compute the dis-

tances between the instances in their own partition

and a query vector; and combine those distances using

an additively homomorphic encryption scheme with

random perturbation techniques to ﬁnd the knear-

est neighbours. The classiﬁcation step is ﬁnally per-

formed locally by each party.

In (Xiong et al., 2006; Xiong et al., 2007) the au-

thors assume that several data owners, each one with

a private database, will collaborate by executing a dis-

tributed protocol to perform privacy-preserving k-NN

classiﬁcation. The classiﬁcation of a new instance is

performed by each user in his or her own database

and then a secure distributed protocol is used to clas-

sify the instance based on the knearest neighbors of

each database, without revealing those neighbors to

the other data owners. It means that the query vector

is revealed and the process is interactive, with heavy

processing load for each involved party.

In the article (Choi et al., 2014), the authors

present three methods to ﬁnd the knearest neighbors

preserving the privacy of the data, but they do not

address the classiﬁcation problem. Furthermore, the

three methods are interactive. It is worth noting that

even if ﬁnding the knearest neighbors is the main step

involved in k-NN classiﬁcation, this is not compatible

with a cloud computing scenario, implying that the

client has to store at least a table relating the vectors

on the dataset and their classiﬁcation, and also that the

query vector must be locally classiﬁed after the client

receives the knearest neighbors.

The authors of (Zhu et al., 2013) propose a sce-

nario in which the data owner encrypts the data and

sends them to the cloud, where other users can sub-

mit query vectors to obtain the nearest neighbors in

a privacy-preserving way. The scheme ensures the

privacy-preserving property thanks to an interactive

protocol executed between any trusted user that wants

to send query vectors and the data owner: this proto-

col generates a key that is used to encrypt the query

vectors and it is not possible to use this key to decrypt

the encrypted instances stored in the cloud. The data

owner must participate on the processing, even if it is

only to generate keys, therefore this protocol cannot

be classiﬁed as a non-interactive protocol. Moreover,

the protocol only ﬁnds the nearest neighbours and the

classiﬁcation step is not performed.

The works (Elmehdwi et al., 2014) consider a dif-

ferent scenario: the data owner encrypts the data and

submits them to a ﬁrst server, sending the secret key

to a second server. Thereby, any authorized person is

able to send a query vector to the ﬁrst server, which

runs a distributed interactive protocol with the second

server (this sever may decrypt some data in this pro-

cess), and ﬁnally the ﬁrst server returns the knearest

neighbors. Even if the client does not have to pro-

cess the data, that method requires a trusted server to

store the private key, and this trusted server acts as the

client in the distributed processing scenario. Relying

on a trusted third party naturally introduces additional

substantial risk. Later, the same authors extended the

idea to the classiﬁcation problem (Samanthula et al.,

2015), but the same risk of collusion remains.

Another approach is proposed in (Wong et al.,

2009), where a new cryptographic scheme called

asymmetric scalar-product-preserving encryption

(ASPE) is also proposed. The scheme preserves a

special type of scalar product, allowing the knearest

vectors to be found without requiring an interactive

process. The scheme allows the server to calculate

inner products between dataset vectors by calculating

the inner product of encrypted vectors, determining

the vectors that are closer to the query vector. How-

ever, the authors were again only concerned with the

task of ﬁnding the nearest neighbors, not with the

classiﬁcation problem. Also, a cryptographic scheme

created ad hoc for this task lacks extensive security

analysis that more general and well-established

cryptographic schemes already have. In comparison,

the building blocks in our proposal have well-known

properties and limitations.

8 CONCLUSIONS

We presented non-interactive privacy-preserving

variants of the k-NN classiﬁer for both the unweighted

and the weighted versions, and established by ex-

tensive experiments that they are sufﬁciently efﬁ-

cient and accurate to be viable in practice. The pro-

posed protocol combines homomorphic encryption

and order-preserving encryption and is applicable for

running queries against private databases stored into

the cloud. To the best of our knowledge, this is the

ﬁrst proposal for performing k-NN classiﬁcation over

encrypted data in a non-interactive way.

If a client and a cloud already employ any joint

protocol to ﬁnd nearest neighbours (for instance, by

using other cryptographic primitives instead of OPE,

or by running some interactive algorithm) then they

can use an HE scheme and the techniques presented

here to derive a class from the other classes.

As future work, possible improvements to the k-

NN presented here might involve data obfuscation

and perturbation techniques to achieve stronger se-

curity properties against inference attacks, while pre-

serving accuracy and efﬁciency.

Acknowledgments

We thank Google Inc. for the ﬁnancial support

through the Latin America Research Awards “Ma-

chine learning over encrypted data using Homomor-

phic Encryption” and “Efﬁcient homomorphic en-

cryption for private computation in the cloud”.

REFERENCES

Alpaydin, E. (2004). Introduction to Machine Learn-

ing. The MIT Press.

Altman, N. S. (1992). An introduction to kernel and

nearest-neighbor nonparametric regression. The

American Statistician, 46(3):175–185.

Boldyreva, A., Chenette, N., and O’Neill, A.

(2011). Order-preserving encryption revisited:

Improved security analysis and alternative so-

lutions. In CRYPTO, volume 6841 of Lec-

ture Notes in Computer Science, pages 578–595.

Springer.

Bost, R., Popa, R. A., Tu, S., and Goldwasser, S.

(2015). Machine learning classiﬁcation over en-

crypted data. In NDSS. The Internet Society.

Choi, S., Ghinita, G., Lim, H., and Bertino, E. (2014).

Secure knn query processing in untrusted cloud

environments. IEEE Trans. Knowl. Data Eng.,

26(11):2818–2831.

Elmehdwi, Y., Samanthula, B. K., and Jiang, W.

(2014). Secure k-nearest neighbor query over

encrypted data in outsourced environments. In

ICDE, pages 664–675. IEEE Computer Society.

Gilad-Bachrach, R., Dowlin, N., Laine, K., Lauter,

K. E., Naehrig, M., and Wernsing, J. (2016).

Cryptonets: Applying neural networks to en-

crypted data with high throughput and accu-

racy. In ICML, volume 48 of JMLR Workshop

and Conference Proceedings, pages 201–210.

JMLR.org.

Giry, D. (2015). Cryptographic key length re-

comendation. https://www.keylength.com/

(Acessed December 16, 2016).

Graepel, T., Lauter, K. E., and Naehrig, M. (2012).

ML conﬁdential: Machine learning on encrypted

data. In ICISC, volume 7839 of Lecture Notes in

Computer Science, pages 1–21. Springer.

Hirt, M. and Sako, K. (2000). Efﬁcient receipt-free

voting based on homomorphic encryption. In

EUROCRYPT, volume 1807 of Lecture Notes in

Computer Science, pages 539–556. Springer.

Jha, S., Kruger, L., and McDaniel, P. D. (2005). Pri-

vacy preserving clustering. In ESORICS, vol-

ume 3679 of Lecture Notes in Computer Science,

pages 397–417. Springer.

Lindell, Y. and Pinkas, B. (2009). Secure multiparty

computation for privacy-preserving data mining.

Journal of Privacy and Conﬁdentiality, 1(1):5.

Miller, C. C. (2014). Revelations of N.S.A. spy-

ing cost U.S. tech companies. http://

www.nytimes.com/2014/03/22/business/

fallout-from- snowden-hurting-bottom

-line- of-tech-companies.html (Acessed

December 16, 2016).

Naehrig, M., Lauter, K. E., and Vaikuntanathan, V.

(2011). Can homomorphic encryption be practi-

cal? In CCSW, pages 113–124. ACM.

Naveed, M., Kamara, S., and Wright, C. V. (2015).

Inference attacks on property-preserving en-

crypted databases. In ACM Conference on Com-

puter and Communications Security, pages 644–

655. ACM.

Paillier, P. (1999). Public-key cryptosystems based on

composite degree residuosity classes. In EURO-

CRYPT, volume 1592 of Lecture Notes in Com-

puter Science, pages 223–238. Springer.

Rivest, R. L., Adleman, L., and Dertouzos, M. L.

(1978). On data banks and privacy homomor-

phisms. Foundations of secure computation,

4(11):169–180.

Samanthula, B. K., Elmehdwi, Y., and Jiang, W.

(2015). k-nearest neighbor classiﬁcation over

semantically secure encrypted relational data.

IEEE Trans. Knowl. Data Eng., 27(5):1261–

1273.

Wong, W. K., Cheung, D. W., Kao, B., and Mamoulis,

N. (2009). Secure knn computation on encrypted

databases. In SIGMOD Conference, pages 139–

152. ACM.

Xiong, L., Chitti, S., and Liu, L. (2006). k near-

est neighbor classiﬁcation across multiple pri-

vate databases. In CIKM, pages 840–841. ACM.

Xiong, L., Chitti, S., and Liu, L. (2007). Mining mul-

tiple private databases using a knn classiﬁer. In

SAC, pages 435–440. ACM.

Zhan, J. Z., Chang, L., and Matwin, S. (2005). Pri-

vacy preserving k-nearest neighbor classiﬁca-

tion. I. J. Network Security, 1(1):46–51.

Zhu, Y., Xu, R., and Takagi, T. (2013). Secure k-nn

query on encrypted cloud database without key-

sharing. IJESDF, 5(3/4):201–217.