Content uploaded by Diego F. Aranha
Author content
All content in this area was uploaded by Diego F. Aranha on Jan 22, 2022
Content may be subject to copyright.
Principal Component Analysis over encrypted data using
homomorphic encryption
Hilder V. L. Pereira1, Diego F. Aranha1
1Institute of Computing (UNICAMP)
Av. Albert Einstein, 1251, 13083-852, Campinas-SP, Brazil
hilder@lasca.ic.unicamp.br,dfaranha@ic.unicamp.br,
Abstract. We describe an algorithm to perform Principal Component Analysis
(PCA) over encrypted data using homomorphic encryption. PCA is a fundamen-
tal tool for exploratory data analysis and dimensionality reduction, and thus a
useful application for privacy-preserving computation in the cloud.
1. Introduction
The increasingly intrusive behavior of governments and corporations and sensitive infor-
mation leaks observed this year put into question the long-term viability of cloud com-
puting as the prominent industry paradigm. Although this was an inherent risk since the
introduction of cloud computing, the associated security and privacy issues of delegating
computing to a third party became self-evident only recently.
A possible solution to accommodate these conflicting requirements is computing
over encrypted data. In this model, data is encrypted by a transformation which conserves
part of their structure and allows further execution of certain operations. Because of
practical difficulties with fully homomorphic encryption that allows arbitrary computation
in type and number of operations, a growing research area is dedicated to study partially
homomorphic schemes and to adapt algorithms to work correctly in the encrypted domain.
In this work, we propose an algorithm for performing PCA over encrypted data stored
in the cloud. PCA is a fundamental step in data analysis and machine learning, thus a
promising application for privacy-preserving computing. The proposed algorithm is non-
interactive in nature and compatible with somewhat homomorphic encryption schemes.
2. Preliminaries
In this section, we recall basic Linear Algebra, without proof due to space constraints.
Definition 2.1 (Eigenvector and eigenvalue).Let Xbe a real matrix in Rn×n. We say that
a scalar λ∈Ris a eigenvalue of Xif there exists a non-zero vector v∈Rnsuch that
Xv =λv. We also say that vis the eigenvector associated with λand that (λ, v)is an
eigenpair of X. Eigenvectors are invariant to multiplication by a scalar and the dominant
eigenvalue of Xis the one with the largest absolute value.
Definition 2.2 (Shifting eigenpairs).We say that a procedure shifts the eigenvalues of a
matrix Xif it returns any matrix Bsuch that the dominant eigenvalue of Bis equal to the
second dominant eigenvalue of Xand their associated eigenvectors are the same. More
formally, given Xand dominant eigenpairs (λi, vi), a function fshifts the eigenvalues of
Xif f(X) = B∈Rn×nwith dominant eigenpairs (λi+1, vi+1).
Theorem 2.3 (Spectral Theorem [Watkins 2005]).Suppose A∈Rn×nis symmetric.
Then, it can be written as A=U DUT, where Uis a orthogonal matrix where each
column is a normalized eigenvector and Dis a diagonal matrix with the eigenvalues on
the principal diagonal in an order correspondent to the columns of U. In other words, for
i∈ {1,2, .., n}, the pair (Dii, Ui)is an eigenpair, where Uiis the i-th column of U.
Corollary 2.4 (Symmetric matrix as a sum).Let A∈Rn×nbe symmetric and (λ1, v1),
(λ2, v2), ..., (λn, vn), eigenpairs of A, with ||vi|| = 1 for i∈ {1,2, .., n}. Then, Amay be
written as A=
n
X
i=1
λivivT
i.
3. Principal Component Analysis
The problem of finding the principal components of a data matrix Xis equivalent to the
problem of finding the eigenvectors of its covariance matrix. In general, the i-th principal
component is the i-th dominant eigenvector [Jolliffe 2002]. Hence, to project the data into
aK-dimensional space, we have to find the Kdominant eigenvectors.
3.1. Power Method
The Power Method is a simple iterative algorithm to find a dominant eigenvector of a given
matrix. Let A∈Rn×nbe a real matrix. We sample a random initial vector u∈Rnand
multiply Aby urepeatedly, generating the sequence Au,A2u,A3u, . . . , that converges
to a dominant eigenvector. If we write the initial vector uas a linear combination of the
eigenvectors v1, v2, .., vn, we have:
Aku=Ak(α1v1+α2v2+α3v3+... +αnvn) = λk
1α1v1+λk
2α2v2+... +λk
nαnvn.
Assuming that v1is a dominant eigenvector, we have |λk
1|>|λk
i|, for i∈ {2,3, .., n}.
Therefore, if we divide both sides by λk
1, it converges to a multiple of v1:
Aku
λk
1
=α1v1+λk
2
λk
1
α2v2+... +λk
n
λk
1
αnvn.
In order to avoid underflow and overflow in practice, it is common to divide the sequence
by a scaling factor θk. The resulting algorithm for the Power Method can be found below:
1pow erMeth od (A)
2N=A. l i n e s
3u=r a nd o m V ec t o r ( N)
4f o r k= 1 t o STEPS
5u=Au
θk
6return u
3.2. Finding Kprincipal components
Our strategy to find the principal components is to calculate the covariance matrix of the
data and find the Kdominant eigenvectors by repeatedly using the Power Method and
a shifting procedure. Since the covariance matrix is symmetric, the following function
works as a shifting procedure:
1eigenShift (A, do m i na n t e i g e n v e c t o r v)
2u=v
||v||
3return B=A−AuuT
Theorem 3.1. Let Abe a n×nreal symmetric matrix. Then the function eigenShift shifts
the eigenvalues of A.
Proof. Since the first operation of eigenShift is normalizing v, we have that vbecomes
equal to v1, the normalized dominant eigenvector. Since vT
1v1=||v1||2= 1, we have
Bv1=Av1−(Av1vT
1)v1=Av1−Av1(vT
1v1) = Av1−Av1=λ1v1−λ1v1= 0 ·v1
which proves that v1is also an eigenvector of Bbut now associated with a new eigenvalue
λnew = 0. By Corollary 2.4, the matrix Amay be written as A=Pn
i=1 λivivT
i, and thus
B=λ2v2vT
2+λ3v3vT
3+... +λnvnvT
n.
For all i∈ {2,3, .., n}, we have Bvi=λ2v2vT
2vi+... +λivivT
ivi+... +λnvnvT
nvi.By
Theorem 2.3, all the eigenvectors are orthogonal, so for j6=i, the product vT
jviis equal to
0, and the product vT
iviis equal to 1. Then, Bvi= 0+0+...+0+λivi·1+0+...+0 = λivi,
which proves that all the other eigenpairs of Aare also eigenpairs of B. Therefore, all the
eigenvectors of Aare also eigenvectors of B, the dominant eigenvector of Ais associated
with the eigenvalue λnew = 0, and the second dominant eigenvalue of Ais the dominant
eigenvalue of B.
In order to calculate the covariance matrix, we just have to set the mean of each
variable (column of the data matrix) to zero and then make a matrix multiplication.
1covarianceMatrix (X)
2N=X. l i n e s
3P=X. col u m n s
4f o r j= 1 t o P
5µ= 0
6f o r i= 1 t o N
7µ=µ+X[i][j]
8µ=µ
N
9/∗S u b t r a c t t h e m ean . ∗/
10 f o r i= 1 t o N
11 X[i][j] = X[i][j]−µ
12 C=XT∗X
13 return C
Our proposal for solving the PCA problem is the following:
1PCA( X, new di m e n s i o n K)
2C=covariance (A)
3pcs =∅
4f o r i= 1 t o K
5pci=p ow er M et ho d ( C)
6C=eigenShift (C,pci)
7pcs ={pci} ∪ pcs
8return pcs
4. Homomorphic version
Employing a Somewhat Homomorphic Encryption (SWE) scheme such as [Bos et al. 2013]
for privacy-preserving computation involves some restrictions on the operations that can
be performed on the data. Usually, we can only make additions and a few multiplications
over the ciphertexts, and general divisions are not viable. If encoding real numbers is
possible using [Aono et al. 2015], we can also divide the ciphertexts by constants or any
other known values (number nof elements submitted by the client, for example). Be-
cause of these restrictions, we have to modify the algorithms to remove divisions between
ciphertexts and to minimize the number of consecutive multiplications.
For the Power Method, the value θkcan be chosen as a constant, and the computa-
tion of the covariance matrix only employs a value known a priori, thus divisions can be
performed between ciphertexts and plaintexts. The remaining obstacle is the eigenShift
procedure. Since the components of the vectors are encrypted, we cannot normalize v
dividing it by its norm (first operation of the eigenShift) as B=A−Av
||v||
vT
||v|| .
However, Definitions 2.1 tell us that Band ||v||2Bhave the same eigenvectors.
Hence:
||v||2B=||v||2(A−Av
||v||
vT
||v||) = ||v||2A−AvvT,
which means that we can compute ||v||2Bfrom Aand vwithout divisions between ci-
phertexts. Finally, using the relation between the inner product and the Euclidean norm,
namely, vTv=||v||2, the homomorphic version of the eigenShift function can be de-
scribed as follows:
1homomorphicShift (A,v)
2α=i n n e r P r o d u c t ( v,v)
3B=αA −AvvT
4return B
This way, the entire Power Method can be computed over encrypted data.
5. Conclusion
Principal Component Analysis can be computed in a privacy-preserving way, by adapting
all of the required steps in the Power Method to remove expensive divisions and employ-
ing a Somewhat Homomorphic Encryption scheme with a bounded number of multipli-
cations. As far as we know, this is the first non-interactive proposal for performing PCA
over encrypted data in the cloud.
References
Aono, Y., Hayashi, T., Phong, L. T., and Wang, L. (2015). Fast and secure linear regres-
sion and biometric authentication with security update. Cryptology ePrint Archive,
Report 2015/692. http://eprint.iacr.org/.
Bos, J. W., Lauter, K., Loftus, J., and Naehrig, M. (2013). Improved security for a ring-
based fully homomorphic encryption scheme. In Cryptography and Coding (IMACC),
pages 45–64. Springer.
Jolliffe, I. T. (2002). Principal Component Analysis. Springer Series in Statistics.
Watkins, D. S. (2005). Fundamentals of Matrix Computations. Wiley, 2nd edition.