Conference PaperPDF Available

Redesign of Gaussian Mixture Model for Efficient and Privacy-preserving Speaker Recognition

Authors:
Redesign of Gaussian Mixture Model for Efficient
and Privacy-preserving Speaker Recognition
S Rahulamathavan, X Yao, R Yogachandran, K Cumanan, and M Rajarajan
Abstract—This paper proposes an algorithm to perform
privacy-preserving (PP) speaker recognition using Gaussian mix-
ture models (GMM). We consider a scenario where the users
have to enrol their voice biometric with the third-party service
providers to access different services (i.e., banking). Once the
enrolment is done, the users can authenticate themselves to
the system using their voice instead of passwords. Since the
voice is unique for individuals, storing the users’ voice features
at the third-party server raises privacy concerns. Hence, in
this paper we propose a novel technique using randomization
to perform voice authentication, which allows users to enrol
and authenticate their voice in the encrypted domain, hence
privacy is preserved. To achieve this, we redesign the GMM to
work on encrypted domain. The proposed algorithm is validated
using the widely used TIMIT speech corpus. Experimental
results demonstrate that the proposed PP algorithm does not
degrade the performance compared to the non-PP method and
achieve 96.16% true positive rate and 1.77% false positive rate.
Demonstration on Android smartphone shows that the algorithm
can be executed within two seconds with only 30% of CPU power.
Index Terms—Privacy, security, speech, GMM-MFCC, en-
crypted domain.
I. INTRODUCTION
Traditional authentication methods such as password, PIN,
and memorable words can be easily forgotten, lost, guessed,
stolen, or shared. However, authentication using anatomical
traits such as fingerprint, face, palm print, iris and voice are
very difficult to forge since they are physically linked to the
user. However, various security and privacy challenges deter
the public confidence in adopting biometric based authentica-
tion systems.
Speech being a unique characteristic of an individual and
is widely used in speaker verification and identification tasks
in applications such as authentication and surveillance re-
spectively [1]. Regarding authentication, there are two parties
involved: (1) user and (2) service provider. Initially, the users
need to enrol their voice with a service provider. Later, the
service provider uses the enrolled voice template to match the
new voice from the users for authentication [2]. This matching
can be performed using probabilistic representation such as
Gaussian mixture models (GMMs) [2].
In general, the users’ voice raw data is passed through a
number of filters to extract the unique feature (see Section III).
These voice templates, similar to the users’ fingerprints, are
unique for individuals and storing them at third-party servers
raises privacy concerns. In order to tackle this problem, we
propose a privacy-preserving scheme where the server who is
responsible for user authentication does not have access to the
user’s voice template in plain domain
This work was supported by the EU Horizon2020 programme under EU
Grant H2020-EU.3.7 (Project ID: 653586).
In order to perform the voice authentication in a privacy-
preserving manner, we redesign the GMM model and extend
it to work in the encrypted domain. Recently Pathak et.
al. attempted to solve this problem by using homomorphic
encryption techniques in [1]. However, the pioneering work in
[1] is computationally infeasible i.e., requires several hours to
perform the encryption and voice matching due to the usage
of very long integers (2048 bits) for exponentiation.
In order to mitigate the computational complexity, we
propose a novel scheme using randomization technique which
is proven to be lightweight in other biometric solutions [4].
We validate the propose scheme using TIMIT speech corpus
[5] and show that our scheme performs much faster than the
existing solution without compromising the accuracy. On top
of this we analyse the security and privacy aspects of the
proposed scheme.
II. RE LATE D WO RK S
The field of signal processing in the encrypted domain
has witnessed several machine learning algorithms those have
been redesigned to process data in the encrypted domain (
[1], [4], [10]–[13] and references there in). The ultimate goal
of all of these works is the same: protecting the privacy
of the input data. However, these works redesign different
machine learning algorithms i.e., face recognition based on
principal component analysis in [10], facial expression recog-
nition based on linear discriminant analysis in [13], multi-class
problem based on support vector machine in [4], [12], are the
few to mention here.
This paper proposes a new technique to perform voice-data-
processing (i.e., speaker recognition) in the encrypted domain.
There are few works have been proposed in this direction [1],
[14], [18]. Pathak et. al redesigned the GMM-MFCC based
speaker recognition model in [1] to achieve a similar privacy
goal. The work in [1] relies on homomorphic cryptosystems
such as BGN and Paillier encryption. This work has shown
a proof-of-concept of privacy-preserving speaker recognition
without compromising the accuracy. However, the shortcoming
of these cryptographic approaches [1] is that too much time
is spent on encryption, which makes it impractical in real-life
applications i.e., [1] requires few minutes for authentication.
In order to mitigate the heavy computation involved with
the homomorphic encryption schemes [1], string-matching
frameworks were proposed in [14], [18]. The schemes in
[14], [18] proposed to convert the speech input represented
by super-vectors to bit strings using locality sensitive hashing
(LSH) and counted exact matches. Since it is easy to perform
string comparison with privacy, the method proves to be more
efficient; however, it lacks accuracy with EER=11.86%.
Hence, in this paper, we propose a new but lightweight
technique based on randomisation to achieve the privacy goal
without compromising the accuracy and speed. We redesign
the GMM based speaker recognition to work on randomised
domain. We use [1] as the baseline model to compare the per-
formance. Section V shows that the proposed model achieves
96.16% true positive rate and 1.77% false positive rate on
TIMIT speech corpus. Demonstration on Android smartphone
shows that the algorithm can be executed within two seconds
with only 30% of CPU power.
It should be noted that the field of speaker recognition
has been advanced from GMM model to the current state-
of-the-art techniques such as i-vector and probabilistic linear
discriminant analysis (PLDA) [16] based speaker recognition.
These techniques are based on sophisticated mathematical
frameworks and infeasible to redesign them to work on a
privacy-preserving manner without compromising the speed.
III. SPEAKER RECOGNITION WITHOUT PRIVACY
The speaker recognition considered in this paper comprises
training and testing/authentication phases. For the training,
analog speech samples from a person of interest are collected
to build a speaker model. The speaker model is analogous
to a mechanical lock. Hence the lock can only be unlocked
by the same person’s voice. The training phase involves the
combination of two techniques (1) Mel-frequency cepstral
coefficients (MFCCs) extraction and (2) Gaussian mixture
models (GMMs).
A. Notations and definitions
We use ~
(.)to denote vectors; (.)0denotes the transpose
operator; k.k2the Euclidean norm; b.ethe nearest integer
approximation; denotes the Kronecker product.
B. Training phase
1) Mel-frequency cepstral coefficients (MFCC): MFCC ex-
traction is the process of speech parameterization, which
consists in transforming the speech signal to a set of feature
vectors, [3] i.e., to identify the components of the audio signal
that are beneficial to identifying the linguistic content and
discarding all the other elements which carry information like
background noise, emotion etc. [6]. The MFCC may contain
a number of components i.e., pre-emphasis, windowing, FFT,
Mel-scale filterbank, cosine discrete transformation, (refer [3]
for more details).
2) Gaussian mixture model (GMM): After the feature vec-
tors are collected through MFCC, we use them to build speaker
models to facilitate speaker authenticationa. The choice of
this model is largely dependent on the features as well as
specifics of the application [7]. MFCC features are denoted as
X= [~x1, ~x2, . . . , ~xT]where ~xj,j= 1, . . . , T denotes the is
aD-dimensional tth feature extracted from the voice sample
using MFCC. Using this definition, a Gaussian mixture density
is a weighted sum of Mcomponent densities given by [2].
p(~x1, ~x2, . . . , ~xT|λ) =
T
Y
t=1
p(~xt|λ)(1)
where
p(~xt|λ) =
M
X
i=1
pibi(~xt),(2)
awe use speaker recognition and authentication interchangeably.
where bi(~xt),i= 1, . . . , M , are the component densities and
pi,i= 1, . . . , M , are the mixture weights. Each component
density is a D-variate Gaussian function of the form
bi(~xt) = 1
(2π)D/2|Pi|1/2e1
2(~xt~µi)0Σ1
i(~xt~µi)(3)
with mean vector ~µiand covariance matrix Pi. The mixture
weights satisfy the constraint
M
P
i=1
pi= 1. The details of the
selection of Mcan be referred to [2].
The complete Gaussian mixture density is parametrized by
the mean vectors, covariance matrices and mixture weights
from all component densities. These parameters are collec-
tively represented as speaker model and by the notation
λs={pi, ~µi,Pi},i= 1, . . . , M . We learn these parameters
from the enrolment data using the expectation-maximization
(EM) algorithm.
While the speaker model can be learned from the enrolment
samples for the speaker, learning the GMM for the adversary
class is less obvious as an adversary can be from a very large
set of speakers. In literature this adversary class has been
represented by a universal background model (UBM) that is
trained on a large and diverse set of speakers. Let us assume
that the server generates the speaker model parameters using
EM algorithm and UBM data i.e., aggregated raw speech data
from large number of users. Let us denote the UBM speaker
model parameters as λU=npU
i,~
µU
i,PU
io,i= 1, . . . , M .
In the next subsection we show how these two speaker models
can be used to authenticate a user.
C. Testing/Authentication phase
In the previous section we explained the enrolment process
and how to obtain the speaker model λsand UBM model λU.
Once these models are then the user should be able to use the
voice for authentication.
Lets assume that the speaker model resides at authentication
server belongs to a particular company. The user equipped with
a device, possibly a smart phone, which capture the user’s
voice and send the features to the server. Let’s denote this
new speech feature (test template) as ~
t1,~
t2,...,~
tT. Now the
authentication server will perform the following computations
using the speaker model λs, (1), (2), and (3):
p(~
t1,~
t2,...,~
tT|λs) =
T
Y
j=1
p~
tj|λs=
T
Y
j=1
M
X
i=1
pibi(~
tj)
=
T
Y
j=1
M
X
i=1
pi
(2π)D/2|Pi|1/2e
1
2(~
tj~µi)0Σ1
i(~
tj~µi).(4)
Similarly using the UBM model λU, (1), (2), and (3) and
obtain
p(~
t1,~
t2,...,~
tT|λU) =
T
Y
j=1
p~
tj|λU=
T
Y
j=1
M
X
i=1
pU
ibi(~
tj)
=
T
Y
j=1
M
X
i=1
pU
i
(2π)D/2
PU
i
1/2e
1
2~
tj~
µU
i0
ΣU1
i~
tj~
µU
i.(5)
Then the server performs the verification using the following
likelihood ratio test with respect to a pre-calibrated threshold
θ:
if P(~
t1,~
t2,...,~
tT|λs)
P(~
t1,~
t2,...,~
tT|λU)Rθ,
then accept the speaker. (6)
If the above likelihood ratio is higher than a predefined
threshold then the authentication server allows the user to
access the service. The threshold value will be determined
using empirical analysis which is studied later in this paper.
It is obvious from the above description that the server
knows both the speaker model (from enrolment) and the user
voice feature (during the authentication) in plain. As motivated
in the introduction section, these variables are unique for the
user and shouldn’t be known to anybody except the user itself.
IV. PP S PE AK ER RECOGNITION
In this section, we propose novel a protocol which pro-
tects the privacy of speaker model residing in the third-party
server as well as the speech test features used during the
authentication process. Before proceeding to the technical
details, the following two privacy goals are defined: (1) Privacy
of Model Parameters: The server should be prevented from
learning about the speaker model parameters and (2) Privacy
of Test Speech Features: The user’s speech biometrics should
be protected from any unauthorised access.
In order to achieve these goals, we mathematically modify
the GMM algorithm in the following sections. Pathak et.al.
used homomorphic encryption technique to hide the speaker
model or the voice feature vector from the server in [1].
However, due to the inherent properties of homomorphic
encryption, the solution proposed in [1] is computationally
infeasible. In this paper, we solve this problem by using
randomization technique which is proved to be very efficient
in other biometric solutions [4].
Moreover, the existing works including [1] tried to achieve
only one of the above privacy goals at one one time i.e.,
server knows the speaker model in plain or server needs to
rely of user to complete the matching process which may add
additional complexity. In the propose scheme, the server will
be doing all the matching process without having the speaker
models and voice features in plain domain as described in the
following subsections.
A. Algorithmic Approach
In order to achieve the above privacy goals, the server
should be prohibited from learning speaker model λ0
Uspa-
rameters and the user’s speech samples for the authentication.
In this paper, we are using the randomization technique to
mask these parameters before passing the to the server. If we
randomize those parameters then the server cannot compute
(4), (5), and (6). Since the parameters obtained from users’
speech need to be randomised, (4) and (5) have to modified
to accommodate the randomised parameters. Please note the
UBM speaker model’s parameters are generated by the server
and it resides at the server side.
In order to achieve the above privacy goals, we need to
mathematically transform (4) into a more friendly version
such as linear equations. The reason is that the speaker
model parameters stored at the server and the test features
are coupled in (4). In order to decouple the two sets of
parameters, we adopt the construction from [1] which is
explained below. First of all, let us define the elements of
vector ~
tjas ~
tj= [tj,1tj,2. . . tj,D]0RD×1. A new vector
ˆ
tjcan be formed by inserting 1at the bottom of the vector ~
tj
as follows: ˆ
tj= [~
tj
01]0= [tj,1tj,2. . . tj,D 1]0R(D+1)×1.
A new longer vector ˜
tj,j= 1,2, . . . , T is defined as shown
in (7).
˜
tj=ˆ
tjˆ
tj=
tj1
tj2
.
.
.
tj(D+1)2
=
tj,1×tj,1
tj,1×tj,2
.
.
.
tj,1×tj,D
tj,1×1
tj,2×tj,1
tj,2×tj,2
.
.
.
tj,2×tj,D
tj,2×1
.
.
.
tj,D ×tj,1
tj,D ×tj,2
.
.
.
tj,D ×tj,D
tj,D ×1
1×tj,1
1×tj,2
.
.
.
1×tj,D
1×1
R(D+1)2×1.
(7)
Define a matrix
Wi=
1
2Σ1
iΣ1
i~µi
~
0log pi
(2π)D
2|Σi|1
21
2~µi0Σ1
i~µi
R(D+1)×(D+1),(8)
where i= 1, . . . , M . Note that the matrix Wiis generated
using the speaker model parameters corresponding to ith
Gaussian density. Using (8) and vectorization technique (i.e.,
vectorization of an m×nmatrix A, denoted by vec(A), is
the mn ×1column vector obtained by stacking the columns
of the matrix Aon top of one another) we can obtain
˜wi=vec(Wi)
=
wi1
wi2
.
.
.
wi(D+1)2
=
{Wi}1,1
{Wi}2,1
.
.
.
{Wi}D+1,1
.
.
.
{Wi}1,D+1
{Wi}2,D+1
.
.
.
{Wi}D+1,D+1
R(D+1)2×1,(9)
where {Wi}n,m denotes the element of matrix Wilocated at
nth row and mth column. Now we can use (7) and (9) to
replace the exponential part in (4) as follows:
p(~
t1,~
t2,...,~
tT|λ) =
T
Y
j=1
M
X
i=1
e˜
tj
0˜wi,(10)
where ˜wiis actually obtained only from speaker model pa-
rameters and ˜
tjfrom test features.
B. Novel Privacy Approach
We approach this by dividing the traditional framework into
three parts: (1) protect the speaker models during enrolment
(2) protect the test features during authentication and (3)
protect the privacy during GMM computation.
1) Privacy during enrolment: In order to protect the
speaker model parameters (i.e., ˜wi,i= 1, . . . , M ), the
user generates random vectors ˜zi= [zi1, . . . , zi(D+1)2]
R(D+1)2×1where |zin| ≥ |win|for all i= 1, . . . , M and
n= 1,...,(D+ 1)2. The user will use the random vectors to
randomise the speaker model parameters as follows:
[ ˜wi] = ˜wi+ ˜zi,i, (11)
where [x]denotes the masked value of xand
[win] = win+zin,i, n. (12)
Now the user enrols only [ ˜wi],i, in the authentication server
at the end of the enrolment procedure, but keeps ˜zi,ias
secret.
2) Privacy during authentication: For the authentication,
the user device extracts the speech features and forms long
vectors ˜
tjfor all js as shown in (7). In order to protect
these vectors, the mobile device generates random vectors
˜rj= [rj1, . . . , rj(D+1)2]R(D+1)2×1where |rjn|≥|tjn|
for all j= 1, . . . , T and n= 1,...,(D+ 1)2. The user will
use the random vectors to randomise the feature vectors as
follows:
[˜
tj] = ˜
tj+ ˜rj,j, (13)
where
[tjn] = tjn+rjn,j, n. (14)
Now the user sends [˜
tj],j, to the authentication server while
keeping the random vectors ˜rj,jas secret.
3) Privacy during GMM computation: Now the server
substitutes [˜
tj],jand [ ˜wi],iin (10) and obtains
pmasked =
T
Y
j=1
M
X
i=1
e[˜
tj]0[ ˜wi],
=
T
Y
j=1
M
X
i=1
e(˜
tj+ ˜rj)0( ˜wi+ ˜zi),
=
T
Y
j=1
M
X
i=1
e(˜
tj
0˜wi)e(˜
tj
0˜zi+ ˜rj
0˜wi+ ˜rj
0˜zi).(15)
In (15), every e˜
tj
0˜wi(speech data) is multiplied by a noise
e˜
tj
0˜zi+ ˜rj
0˜wi+ ˜rj
0˜zifor all i, j. It is clear that these noise values
are introduced due to the randomization and thus they must be
removed at the server side in order to get the true results. Yet,
the server does not know the random vectors ˜rjand ˜zi, hence
he cannot remove the noise by himself without any input from
the user.
In order to assist the server to get the right result, the client
generates random values ajRfor all jsuch that |aj| ≥
|˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi|, and sends ˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜ziaj
to the server for all i, j. Using the new input, the server will
compute the following using (15):
plessmasked =
T
Y
j=1
M
X
i=1
e[˜
tj]0[ ˜wi]
e˜
tj
0˜zi+ ˜rj
0˜wi+ ˜rj
0˜ziaj
,
=
T
Y
j=1
M
X
i=1
e˜
tj
0˜wie˜
tj
0˜zi+ ˜rj
0˜wi+ ˜rj
0˜zi
e˜
tj
0˜zi+ ˜rj
0˜wi+ ˜rj
0˜ziaj
,
=
T
Y
j=1
M
X
i=1
eaje˜
tj
0˜wi,(16)
=
T
Y
j=1
eaj
M
X
i=1
e˜
tj
0˜wi,(17)
=ePT
j=1 aj
T
Y
j=1
M
X
i=1
e˜
tj
0˜wi.(18)
Since eajis common for all iin (16), it can be taken outside
the summation as shown in (17). Again according to the rule
of product operation, QT
j=1 eaj=ePT
j=1 ajcan be made the
common factor as shown in (18).
Let’s compare the less-masked result in (18) and the true
result in (10). In (18), the true result (i.e., QT
j=1 PM
i=1 e˜
tj
0˜wi)
is multiplied by ePT
j=1 aj. Again the server needs the user’s
input to unmask ePT
j=1 ajfrom (18) in order to arrive at the
true result. Hence the user sends PT
j=1 ajto the server and
the server gets the true results as follows:
p(~
t1,~
t2,...,~
tT|λ) = plessmasked
ePT
j=1 aj
,
=ePT
j=1 ajQT
j=1 PM
i=1 e˜
tj
0˜wi
ePT
j=1 aj
,
=
T
Y
j=1
M
X
i=1
e˜
tj
0˜wi.(19)
C. Novel Privacy Approach for UBM
The previous subsections demonstrated how to compute
(4) when the speaker model parameters and speech features
for authentication are randomised. Since the UBM model
parameters are known to server in plain domain (i.e., only the
speech features for authentication are randomised) we cannot
use the above approach to compute (5).
Let us assume that the server obtains modified speaker
model vector similar to (9) for the UBM speaker model
parameters. Hence let us denote the vectorised UBM speaker
model parameter as ˜
wU
ifor all i= 1, . . . , M . Similar to (15),
now the server compute the following using [˜
tj],jand ˜
wU
i
for all i= 1, . . . , M as follows:
pU
masked =
T
Y
j=1
M
X
i=1
e[˜
tj]0˜
wU
i=
T
Y
j=1
M
X
i=1
e(˜
tj+ ˜rj)0(˜
wU
i),
=
T
Y
j=1
M
X
i=1
e(˜
tj
0˜
wU
i)e( ˜rj
0˜
wU
i).(20)
In (20), every e˜
tj
0˜
wU
iis multiplied by a noise e˜rj
0˜
wU
ifor
all i, j. It is clear that these noise values are introduced due
to the randomization and thus they must be removed at the
server side in order to get the true results. Yet, the server does
not know the random vectors ˜rj, hence he cannot remove the
noise by himself without any input from the user.
The power of noise value is just a scalar multiplication
˜rj0˜
wU
iwhere ˜rjis known to the user and ˜
wU
iis known to
the server. This can be calculated by PP scalar multiplication
algorithm proposed [20]. Using the PP scalar multiplication
algorithm in [20], the server can obtain ˜rj0˜
wU
i+γjfor all i,
and j. Using these values, and (20) the server compute the
following:
pU
lessmasked =
T
Y
j=1
M
X
i=1
e(˜
tj
0˜
wU
i)e( ˜rj
0˜
wU
i)
e˜rj
0˜
wU
i+γj
,
=
T
Y
j=1
M
X
i=1
e(˜
tj
0˜
wU
i)
eγj=
T
Y
j=1
1
eγj
M
X
i=1
e(˜
tj
0˜
wU
i),
=ePT
j=1 γj
T
Y
j=1
1
eγj
M
X
i=1
e(˜
tj
0˜
wU
i).(21)
At last the client will send ePT
j=1 γjto the server hence the
server can obtain the correct result from (21).
D. Privacy Analysis
In this section, we analyse whether our algorithm is vulner-
able to any privacy leakage. Our algorithm is based on two-
party computation and the only chance that a privacy leakage
happens is during the interaction between two-parties. We use
the following Goldreich’s privacy definition and information-
theoretic security to prove that our method does not leak any
unintended information to client or server:
Privacy definition for secure two-party computation: A
secure two-party protocol should not reveal more information
to a semi-honest party than the information that can be induced
by looking at that party’s input(s) and output(s). The formal
proof of this definition can be found in [15].
Information-theoretic security: A security algorithm is
information-theoretically secure if its security is derived
purely from information theory. The concept of information-
theoretically secure communication was introduced in 1949
by American mathematician Claude Shannon, who used it to
prove that the one-time pad system achieves perfect security
subject to the following two conditions [17]:
1. the key which randomizes the data should be random and
should be used only once
2. the key length should be at least as long as the length of
the data
If any algorithm randomizes its parameters and satisfies the
above conditions, the parameters cannot be unmasked by an
adversary even when the adversary has unlimited computing
power e.g., if the message space and random space are equal
to 1024-bits, the prior probability (probability for a partic-
ular message out of 21024 possible messages) and posterior
probability (probability of inferring/mapping a message in
random domain to a message domain) are equal i.e., there
is no advantage for an adversary to achieve higher posterior
probability than prior probability.
Proof: Let us verify whether the proposed algorithm satisfies
the privacy definition. As described above, the proposed algo-
rithm is composed of three parts (Subsections IV-B1, IV-B2,
and IV-B3). In the following we show what the inputs and
outputs to and from the user and server are, respectively. This
will clearly highlight what is already known to the user and
server. Hence, if we can prove that nothing else can be inferred
other than the known inputs and outputs with higher posterior
probability than prior probability, then the proposed algorithm
satisfies the privacy definition.
The ultimate aim for the user is to keep the enrolled
features and the test features away from the server while
the server wants to keep the intermediate results away from
the user. Initially (Subsection IV-B1), the user just sends
the randomised vectors [ ˜wi](i.e., ˜wi+ ˜zi) to the server
instead of ˜wi. From these inputs, the server only knows the
dimension of the vectors which does not violate the privacy.
According to the information-theoretic security definition, if
|zin| ≥ |win|,i, n, the server will not be able to infer the
original values from [ ˜wi].
Using the same argument, we can claim that there is no
privacy leakage during the authentication phase (Subsection
IV-B2) as long as |rjn|≥|tjn|,j, n. Since this phase
is repeated every time when the user wants to authenticate
themselves to the server, the user will be generating fresh
random vectors ˜rj,j.
In order to assist the server to compute the correct GMM
output, the user will also send ˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜ziaj, and
PT
j=1 ajto the server where |aj|≥|˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi|and
generated freshly for each authentication. It is impossible for
the server to infer aj,jfrom PT
j=1 ajand ˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi
from ˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜ziaj,i, j. We chose only one random
value ajfor all (˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)where i= 1,2, . . . , M
in order to reduce the complexity. If we look at the value
(˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)where ˜rjand ˜ziare random. Therefore
for different i, the combined value (˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)is
random hence the masked value (˜
tj
0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)aj
is also random for all i.
The same analogy can be used to prove that the server can-
not learn about the speech feature during the UBM calculation.
V. PERFORMANCE ANALYSIS
In this section, we first study about dataset used for the
experiment and evaluation environment. Then we perform test-
ings using both the traditional approach, i.e., without privacy
and proposed approach followed by complexity comparison.
A. Usage of TIMIT dataset
We use TIMIT speech corpus [5] to evaluate the accuracy
and reliability of the proposed algorithm. The TIMIT speech
corpus contains broadband recordings (each lasts for around 3
seconds) of 630 speakers of eight major dialects of American
English. Each speaker has 10 speech samples. Out of 10
samples, 9 were used to build the speaker model.
The TIMIT data corpus divides the first four dialects into
two sets: (Set 1) contains 258 speakers, and (Set 2) contains
95 speakers. To validate the proposed algorithms, we use all
the speakers from Set 1 to build 258 speaker models. Hence,
we enrol the 258 speakers voice features in the server. We use
the speakers in Set 2 as adversaries who try to impersonate the
genuine users to get the unauthorised access to the service. The
last 4dialects are used to construct UBM, i.e., we aggregated
all the voice data to make a single audio file and then extracted
the speaker model λU.
In order to explain the testing phase, let us assume a service
and name it as Service A. We randomly select 156 speakers
from Set 1 and assume those 156 speakers registered for
Service A.
B. Evaluation Environment
To measure the performance of the proposed scheme in a
real environment, we implemented the scheme on a smart-
phone and a computer. Specifically, a smartphone with Quad-
core 1.56 GHz Cortex-A53 CPU, Android 4.4.2, and a com-
puter with Intel i5, 2.40 GHz, 8GB RAM, Windows 7, are
chosen to evaluate the user and server, respectively, which are
connected through 802.11g WLAN. Based on the proposed
scheme, an application built in Java, named speechxrays.apk,
is installed in the smartphone, and the simulator for the server
is deployed on the computer.
MFCC features of speech samples are extracted through a
number of filters build in Java. As a first step, the digitalized
speech signal at 44100Hz is sent through energy detection
filter to remove silence part of the signals. Then the silence
removed signal will be normalised, framed with 50% overlap,
and then passed through Hamming window. The output of
Hamming window will be sent through MFCC filter which
contains Mel filter bank, non-linear transformation bank, DCT,
and FFT. More details of MFCC feature extraction can be
found [3]. We extracted 100 MFCC feature vectors for one-
second long speech, and each feature vector contains twelve
MFCC coefficients.
C. Experiments on the TIMIT Database without privacy
To validate the proposed method, we first obtain the ac-
curacy of GMM algorithm on TIMIT dataset using the pre-
divided speech samples. To start with, we obtain the decision-
making output of (6) on the 156 speech samples from the
subset of Set 1. The probability distribution of these calcula-
tions is depicted in Fig. 1 (the distribution on the right) where
each user has tested against its own speaker model. Using the
results from the Fig. 1, Table I shows true positive and false
negative values for different threshold values.
However, deciding the threshold based only from genuine
attempts may be misleading. It is important to cross check the
genuine speaker models using the speech samples belong to
other users, i.e., adversaries. We can use the remaining users
who are not part of a subset of Set 1 as adversaries. For
example, set DR1 contains 49 speakers, but only 38 (i.e., Set
1) speaker models were built. For this experiment, for DR1, we
assume that these 38 speaker models were attacked by 15 (i.e.,
a subset of Set 1 in DR1) speakers and the speakers in DR1
of Set 2. Even those 15 subset speakers try to authenticate
themselves through all 38 speaker model. The same applies
for DR2, DR3, and DR4.
Hence, for DR1, (15 ×38) + (11 ×38) = 988 authenti-
cation attempts were made. Only 15 attempts out of 988 are
legitimate and all others are malicious. Similarly, for DR2,
(58×76)+(26×76) = 6384; for DR3, (43×76)+(26×76) =
Figure 1. Combination of genuine and malicious authentication attempts.
The distribution on the right hand side shows the outcome for 156 legitimate
authentication attempts in numbers. The distribution on the left hand side
shows the outcome for 17356 malicious attempts (in percentages).
5244; and for DR4, (40 ×68) + (32 ×68) = 4896. In
total 17512 authentication attempts were made. Out of this
17512, only 156 attempts were legitimate. Hence, similar to
Fig. 1, we plotted the decision making output of (6) for the
17512 156 = 17356 malicious authentication attempts in
Fig. 1 (the distribution on the left).
From both the distribution from Fig. 1 the values of
decision-making output of (6) are generally larger than 0.5for
legitimate attempts while those values for malicious attempts
are usually smaller than 0.5for the malicious attempts. How-
ever, the important fact is the overlap between them, i.e., the
minimum value for the legitimate users was 0.2444 while the
maximum value for the malicious users was 1.1312; hence, the
choice of the threshold will impact not only the true positive
accuracy but also the false positive attempts.
For example, if the threshold θis set to be 0.40 and let us
use the False negative rate and false positive rate to quantify
the accuracy, it can be obtained that:
False Negative rate= 6
156 ×100 = 3.85%;
False Positive rate = 308
17356 ×100 = 1.77%
D. Performance of the proposed solution
We repeat the same experiment procedure in Section V-C
to test the performance of the proposed algorithm, i.e., with
privacy. Let us briefly explain the randomization procedure
used in our analysis. First, we obtain the range of values from
the 156 speaker models’ parameters and 353 speakers’ speech
features. We observed that these values contain up to twelve
numbers after the decimal point and five before the decimal
point. By keeping this in mind, the client generates random
numbers include twelve and six numbers after and before the
decimal point. Table II shows few examples of the test features
in (7) and the corresponding 18-digit long random numbers
and the randomised feature variables.
For a five-second long speech, we need 5×100×(12+1)2=
84500 random numbers. In total, for 100000 eighteen-digit
long random numbers require approximately 8MB memory.
Using these random numbers, we repeated the experiments we
carried out in Section V-C and obtained same results when the
scaling parameter of PP scalar multiplication equals 106[20].
Table I
TRUE POSITIVE AND TRUE NEGATIVE VALUES
FO R DIFF ER ENT T HRE SH OLD VAL UE S
θTrue Positive False Negative
0 100% 0%
0.4 96.15% 3.84%
0.5 89.75% 10.25%
0.8 66.42% 43.58%
2 1.28% 91.78%
Table II
RANDOMISATION EXAMPLES
Test speech samples
tj1, tj2, . . .
Corresponding random
valuesrj1, rj2, . . .
Randomised speech
features [tj1],[tj2], . . .
16.725081716161 420223.336542782260 420240.061624498421
0.0001274641 225758.663235026265 225758.663362490365
1.171659034624 206555.735876851157 206556.907535885781
46.027486334736 236074.104503441653 236120.131989776389
1910.031836695921 125628.018508620454 127538.050345316375
200 400 600 800 1000 1200
0
0.5
1
1.5
2
2.5
3x 109
Length of Speech Sample
Time Complexity (total multiplications)
Complexity Comparision when M=32
Server−w/out privacy
Server−with privacy
Client−with privacy
Figure 2. Number of multiplications required
for the proposed and traditional (without privacy)
speech authentication algorithms.
32 64 96 128
Number of Gaussians
0
1
2
3
4
5
6
Time Complexity
(total multiplications)
×109
Complexity Comparision: Proposed Vs Traditional
Server-w/out privacy
Server-with privacy
Figure 3. Number of multiplications required
for the proposed and traditional (without privacy)
schemes for different number of Gaussian models
(M= 32 128) when T= 600.
32 64 96 128
Number of Gaussians
0
2
4
6
8
10
12
14
16
Time Complexity
(total multiplications)
×106Complexity Comparision at Client
Figure 4. Number of multiplications required for
the proposed scheme at the client side for different
number of Gaussian models (M= 32 128)
when T= 600.
E. Computational complexity comparison
Without privacy consideration, the server computes (6)
using the test speech feature vectors obtained from the client.
The decision making equation in (6) is actually composed of
(10) i.e., calculating (10) using the speech feature and speaker
model followed by calculating (10) using the speech feature
and UBM model. Basically the server needs to execute (10)
twice.
Denote the time complexity for multiplication and expo-
nentiation as tmand te, respectively. Also denote the total
complexity for the server to perform the authentication using
without privacy as Tw/outprivacy
serverl for authentication complex-
ity for server to compute then the total complexity without
privacy is given by
Tw/outprivacy
server =
2M T (D+ 1)4tm+M T te+ (T1)tm+tm.(22)
Let us now obtain the computational complexity for the
proposed PP approach. For the speaker model computation, the
server is required to compute (15) to (19). For the UBM model,
the server needs to follow the steps described in Section IV-C.
This involves executing the PP scalar multiplication algorithm
[20]. For this algorithm, the server needs to compute (D+
1)2+ 1 multiplication and the client needs to compute (D+
1)2+ 1 multiplication. If we denote the total complexity for
the server to perform the PP authentication as Tprivacy
server then
it given by
Tprivacy
server = [M T (D+ 1)4+ 1]tm+ (M T + 1)te
+M T [(D+ 1)2+ 1 + (D+ 1)4]tm+ (MT + 1)te,(23)
and for the client
Tprivacy
clientserial =M T [(D+ 1)2+ 1]tm.(24)
To visualise the computational complexity for different
numbers of Gaussian models (M) and feature vectors (T),
we simulated (22), (23), and (24) in Figures 2, 3, and 4.
For these simulations we used te240tmapproximation
[20]. We also assumed that the random numbers could be
generated off-line, and the complexity for addition is negligible
compared to multiplication and exponentiation. Fig. 2 shows
the time complexity when M= 32 in terms of a number
of multiplications required for the proposed scheme, and the
tradition scheme. We can clearly observe that there is no
significant difference between both the scheme at the server
side. However, the client is required to perform additional
multiplications which are negligible compared to the server
side complexity. As expected, the complexity of both the
schemes increases linearly with the number of speech features
(i.e., the length of the input speech). However, this linear
increase can be flattened if the server is equipped with multiple
cores processors. Nevertheless, Fig. 2 demonstrates that the
proposed scheme does not increase the complexity substan-
tially compared to the traditional approach. Hence, our scheme
is practical compared to the homomorphic encryption scheme
[1] which consume hours to complete the authentication.
Fig. 3 compares the impact of the number of Gaussian
models when T= 600. When Mincreases, the complexity
also increases linearly for both the schemes. A similar trend,
but at the client side, is also observed in Fig. 4. At the
server side, both the schemes perform equally regardless of
a number of Gaussian models used. However, the proposed
scheme introduces complexity to the client, i.e., approximately
4×106multiplications when M= 32 and T= 600. It should
be noted that a single core 3GHz processor can compute nearly
109multiplications per second. Based on these facts, we can
claim that the proposed scheme is practical while preserving
the privacy of users’ speech samples and speaker models.
We observed in our implementation that the proposed
algorithm takes nearly two seconds for enrollment and less
than two seconds for authentication on Android smartphone
environment. These numbers are relatively similar to the
amount of time is required to input passwords for any on-
line applications on mobile devices. Fig. 5 shows Android
Monitor’s readings of CPU and memory usage when the app
is being used on the smartphone. For five seconds speech input,
the smartphone uses nearly 30% of the phone CPU and 36MB
memory. These numbers clearly demonstrate the practicality
of the proposed approach in the real environment.
Figure 5. Android Monitor reading of smartphone memory and CPU usage
during the enrolment and authentication phase of the proposed scheme.
VI. CONCLUSIONS AND FU TU RE WORKS
An efficient privacy preserving speaker authentication pro-
tocol is proposed. To achieve efficiency and privacy, the
proposed solution algorithmically redesigned the GMM to
incorporate randomness without effecting the final outcome.
The proposed protocol is based on randomization technique
and it only relies on multiplication and addition. Two parties
in this scheme, the client and the server, need to perform
authentication interactively. It is proved using Goldreich’s
privacy definition and information-theoretic security that the
algorithm is secure. It is shown empirically that our scheme
does not degrade accuracy and is without much computation
overhead.
REFERENCES
[1] Pathak, M.A.; Raj, B., ”Privacy-Preserving Speaker Verification and
Identification Using Gaussian Mixture Models,” IEEE Trans. Audio,
Speech, and Language Processing , vol.21, no.2, pp.397-406, Feb. 2013
[2] Reynolds, D.A.; Rose, R.C., ”Robust text-independent speaker identifi-
cation using Gaussian mixture speaker models,”IEEE Trans. Speech and
Audio Processing , vol.3, no.1, pp.72,83, Jan 1995.
[3] Bimbot, F.; Bonastre, J.-F.; Fredouille, C.; Gravier, G.; Magrin-
Chagnolleau, I. ; Meignier, S. ; Merlin, T.; Ortega-Garcia, J.; Petrovska-
Delacr´
etaz, D.; Reynolds, D. A., “A tutorial on text-independent speaker
verification,” EURASIP J. Appl. Signal Process., vol.4, pp. 430,451, 2004.
[4] Rahulamathavan, Y., Rajarajan, M. ”Efficient Privacy-preserving Facial
Expression Classification,” IEEE Trans. Dependable and Secure Com-
puting , in press.
[5] Garofolo, John, et al. ”TIMIT Acoustic-Phonetic Continuous Speech
Corpus LDC93S1,” Web Download. Philadelphia: Linguistic Data Con-
sortium, 1993.
[6] Lyons, J., “Mel Frequency Cepstral Coefficient (MFCC)
tutorial,” Practicalcryptography.com ,[Online]. Available:
http://practicalcryptography.com/miscellaneous/machine-learning/guide-
mel-frequency-cepstral-coefficients-mfccs/. [Accessed: 22- Jul- 2015].
[7] Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B., ”Speaker Verification Using
Adapted Gaussian Mixture Models,” Digital Signal Processing, Vol.10,
Issues 1–3, pp. 19-41, Jan 2000
[8] Dempster, A.P., Laird, N.M. and Rubin, D.B., 1977. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the royal statistical
society. Series B (methodological), pp.1-38.
[9] R. Duda and P. Hart, Pattern classification and scene analysis. New York:
Wiley, 1973.
[10] Z. Erkin, M. Franz, J. Guajardo, S. Katzenbeisser, I. Lagendijk, and
T. Toft, “Privacy-preserving face recognition,” in Proc. 9th International
Symposium on Privacy Enhancing Technologies, PETS ’09, pp. 235–253,
2009.
[11] Y. Rahulamathavan, R. Phan, S. Veluru, K. Cumanan, and M. Rajarajan,
“Privacy-preserving multi-class support vector machine for outsourcing
the data classification in cloud,” IEEE Trans. Dependable Secure Com-
puting, vol. 11, no. 5, pp. 467–479, Sept. 2014.
[12] Y. Rahulamathavan, S. Veluru, R. Phan, J. Chambers, and M. Rajara-
jan, “Privacy-preserving clinical decision support system using gaussian
kernel based classification,” IEEE Journal of Biomedical and Health
Informatics, vol. 18, no. 1, pp. 56–66, Jan. 2014.
[13] Y. Rahulamathavan, R. Phan, J. Chambers, and D. Parish, “Facial
expression recognition in the encrypted domain based on local fisher
discriminant analysis,” IEEE Trans. Affective Computing, vol. 4, no. 1,
pp. 83–92, Jan.-Mar. 2012.
[14] M. A. Pathak, B. Raj, S. D. Rane and P. Smaragdis, ”Privacy-preserving
speech processing: cryptographic and string-matching frameworks show
promise,” in IEEE Signal Processing Magazine, vol. 30, no. 2, pp. 62-74,
March 2013.
[15] O. Goldreich, “Secure multiparty computation”, (working draft), avail-
able: http://www.wisdom.wei zmann.ac.il/ oded/pp.html. (Sep. 1998)
[16] Dehak, N, et al. ”Front-end factor analysis for speaker verification.
IEEE Transactions on Audio, Speech, and Language Processing, vol. 19,
no. 4 pp. 788-798, 2011.
[17] C. E. Shannon, “Communication theory of secrecy systems*,” Bell
system technical journal, vol. 28, no. 4, pp. 656–715, 1949.
[18] M. A. Pathak and B. Raj, ”Privacy-preserving speaker verification as
password matching,” Acoustics, Speech and Signal Processing (ICASSP),
2012 IEEE International Conference on, Kyoto, 2012, pp. 1849-1852.
[19] B. K. Sy, ”Secure Computation for Biometric Data Secu-
rity—Application to Speaker Verification,” in IEEE Systems Journal,
vol. 3, no. 4, pp. 451-460, Dec. 2009.
[20] R. Lu, X. Lin, and X. Shen, “SPOC: A secure and privacy-preserving
opportunistic computing framework for mobile-healthcare emergency,”
IEEE Trans. Parallel and Distributed Systems, vol. 24, no. 3, pp. 614–
624, 2013.
... In [2], it has been shown that the protocol is much faster than public-key based protocols using homomorphic encryption. Since then, this protocol has been and is still used in many privacy-preserving solutions [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], including support vector machines [17], facial expression classification [9], medical pre-diagnosis [18], and speaker verification [10], [11]. ...
... In [2], it has been shown that the protocol is much faster than public-key based protocols using homomorphic encryption. Since then, this protocol has been and is still used in many privacy-preserving solutions [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], including support vector machines [17], facial expression classification [9], medical pre-diagnosis [18], and speaker verification [10], [11]. ...
Preprint
Privacy-preserving scalar product (PPSP) protocols are an important building block for secure computation tasks in various applications. Lu et al. (TPDS'13) introduced a PPSP protocol that does not rely on cryptographic assumptions and that is used in a wide range of publications to date. In this comment paper, we show that Lu et al.'s protocol is insecure and should not be used. We describe specific attacks against it and, using impossibility results of Impagliazzo and Rudich (STOC'89), show that it is inherently insecure and cannot be fixed without employing public-key cryptography.
... This approach keeps the inputs private because the underlying general-purpose STPC protocol is provably secure (Lindell and Pinkas, 2009). Another line of research applies special-purpose STPC protocols based on Lu et al. (2014) to GMMs (Rahulamathavan et al., 2018) and i-vectors (Rahulamathavan et al., 2019). However, their underlying protocols do not have a rigorous security proof. ...
Article
Full-text available
Speech recordings are a rich source of personal, sensitive data that can be used to support a plethora of diverse applications,from health profiling to biometric recognition. It is therefore essential that speech recordings are adequately protected so that they cannot be misused. Such protection, in the form of privacy-preserving technologies, is required to ensure that: (i) the biometric profiles of a given individual (e.g., across different biometric service operators) are unlinkable; (ii) leaked, encrypted biometric information is irreversible, and that (iii) biometric references are renewable. Whereas many privacy-preserving technologies have been developed for other biometric characteristics, very few solutions have been proposed to protect privacy in the case of speech signals. Despite privacy preservation this is now being mandated by recent European and international data protection regulations. With the aim of fostering progress and collaboration between researchers in the speech, biometrics and applied cryptography communities, this survey article provides an introduction to the field, starting with a legal perspective on privacy preservation in the case of speech data. It then establishes the requirements for effective privacy preservation, reviews generic cryptography-based solutions, followed by specific techniques that are applicable to speaker characterisation (biometric applications) and speech characterisation (non-biometric applications). Glancing at non-biometrics, methods are presented to avoid function creep, preventing the exploitation of biometric information, e.g., to single out an identity in speech-assisted health care via speaker characterisation. In promoting harmonised research, the article also outlines common, empirical evaluation metrics for the assessment of privacy-preserving technologies for speech data.
Article
Privacy-preserving scalar product (PPSP) protocols are an important building block for secure computation tasks in various applications. Lu et al. (TPDS'13) introduced a PPSP protocol that does not rely on cryptographic assumptions and that is used in a wide range of publications to date. In this comment paper, we show that Lu et al.'s protocol is insecure and should not be used. We describe specific attacks against it and, using impossibility results of Impagliazzo and Rudich (STOC'89), show that it is inherently insecure and cannot be fixed without relying on at least some cryptographic assumptions.
Article
Full-text available
This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.
Article
Full-text available
This paper proposes an efficient algorithm to perform privacy-preserving (PP) facial expression classification (FEC) in the client-server model. The server holds a database and offers the classification service to the clients. The client uses the service to classify the facial expression (FaE) of subject. It should be noted that the client and server are mutually untrusted parties and they want to perform the classification without revealing their inputs to each other. In contrast to the existing works, which rely on computationally expensive cryptographic operations, this paper proposes a lightweight algorithm based on the randomization technique. The proposed algorithm is validated using the widely used JAFFE and MUG FaE databases. Experimental results demonstrate that the proposed algorithm does not degrade the performance compared to existing works. However, it preserves the privacy of inputs while improving the computational complexity by 120 times and communication complexity by 31 percent against the existing homomorphic cryptography based approach.
Article
Full-text available
Emerging cloud computing infrastructure replaces traditional outsourcing techniques and provides flexible services to clients at different locations via Internet. This leads to the requirement for data classification to be performed by potentially untrusted servers in the cloud. Within this context, classifier built by the server can be utilized by clients in order to classify their own data samples over the cloud. In this paper, we study a privacy-preserving (PP) data classification technique where the server is unable to learn any knowledge about clients’ input data samples while the server side classifier is also kept secret from the clients during the classification process. More specifically, to the best of our knowledge, we propose the first known client-server data classification protocol using support vector machine. The proposed protocol performs PP classification for both two-class and multi-class problems. The protocol exploits properties of Pailler homomorphic encryption and secure two-party computation. At the core of our protocol lies an efficient, novel protocol for securely obtaining the sign of Pailler encrypted numbers.
Article
Full-text available
A clinical decision support system forms a critical capability to link health observations with health knowledge to influence choices by clinicians for improved healthcare. Recent trends toward remote outsourcing can be exploited to provide efficient and accurate clinical decision support in healthcare. In this scenario, clinicians can use the health knowledge located in remote servers via the Internet to diagnose their patients. However, the fact that these servers are third party and therefore potentially not fully trusted raises possible privacy concerns. In this paper, we propose a novel privacy-preserving protocol for a clinical decision support system where the patients' data always remain in an encrypted form during the diagnosis process. Hence, the server involved in the diagnosis process is not able to learn any extra knowledge about the patient's data and results. Our experimental results on popular medical datasets from UCI-database demonstrate that the accuracy of the proposed protocol is up to 97.21% and the privacy of patient data is not compromised.
Article
Full-text available
Speech is one of the most private forms of communication. People do not like to be eavesdropped on. They will frequently even object to being recorded; in fact, in many places it is illegal to record people speaking in public, even when it is acceptable to capture their images on video [1]. Yet, when a person uses a speech-based service such as a voice authentication system or a speech recognition service, they must grant the service complete access to their voice recordings. This exposes the user to abuse, with security, privacy and economic implications. For instance, the service could extract information such as gender, ethnicity, and even the emotional state of the user from the recording-factors not intended to be exposed by the user-and use them for undesired purposes. The recordings may be edited to create fake recordings that the user never spoke, or to impersonate them for other services. Even derivatives from the voice are risky to expose. For example, a voice-authentication service could make unauthorized use of the models or voice prints it has for users to try to identify their presence in other media such as YouTube.
Conference Paper
Face recognition is increasingly deployed as a means to unobtrusively verify the identity of people. The widespread use of biometrics raises important privacy concerns, in particular if the biometric matching process is performed at a central or untrusted server, and calls for the implementation of Privacy-Enhancing Technologies. In this paper we propose for the first time a strongly privacy-enhanced face recognition system, which allows to efficiently hide both the biometrics and the result from the server that performs the matching operation, by using techniques from secure multiparty computation. We consider a scenario where one party provides a face image, while another party has access to a database of facial templates. Our protocol allows to jointly run the standard Eigenfaces recognition algorithm in such a way that the first party cannot learn from the execution of the protocol more than basic parameters of the database, while the second party does not learn the input image or the result of the recognition process. At the core of our protocol lies an efficient protocol for securely comparing two Pailler-encrypted numbers. We show through extensive experiments that the system can be run efficiently on conventional hardware.
Conference Paper
We present a text-independent privacy-preserving speaker verification system that functions similar to conventional password-based authentication. Our privacy constraints require that the system does not observe the speech input provided by the user, as this can be used by an adversary to impersonate the user in the same system or elsewhere. We represent the speech input using supervectors and apply locality sensitive hashing (LSH) to transform these into bit strings, where two supervectors, and therefore inputs, are likely to be similar if they map to the same string. This transformation, therefore, reduces the problem of identifying nearest neighbors to string comparison. The users then apply a cryptographic hash function to the strings obtained from their enrollment and verification data, thereby obfuscating it from the server, who can only check if two hashed strings match without being able to reconstruct their content. We present execution time and accuracy experiments with the system on the YOHO dataset, and observe that the system achieves acceptable accuracy with minimal computational overhead needed to satisfy the privacy constraints.
Article
Speech being a unique characteristic of an individual is widely used in speaker verification and speaker identification tasks in applications such as authentication and surveillance respectively. In this article, we present frameworks for privacy-preserving speaker verification and speaker identification systems, where the system is able to perform the necessary operations without being able to observe the speech input provided by the user. In a speech-based authentication setting, this privacy constraint protect against an adversary who can break into the system and use the speech models to impersonate legitimate users. In surveillance applications, we require the system to first identify if the speech recording belongs to a suspect while preserving the privacy constraints. This prevents the system from listening in on conversations of innocent individuals. In this paper we formalize the privacy criteria for the speaker verification and speaker identification problems and construct Gaussian mixture model-based protocols. We also report experiments with a prototype implementation of the protocols on a standardized dataset for execution time and accuracy.
Article
With the pervasiveness of smart phones and the advance of wireless body sensor networks (BSNs), mobile Healthcare (m-Healthcare), which extends the operation of Healthcare provider into a pervasive environment for better health monitoring, has attracted considerable interest recently. However, the flourish of m-Healthcare still faces many challenges including information security and privacy preservation. In this paper, we propose a secure and privacy-preserving opportunistic computing framework, called SPOC, for m-Healthcare emergency. With SPOC, smart phone resources including computing power and energy can be opportunistically gathered to process the computing-intensive personal health information (PHI) during m-Healthcare emergency with minimal privacy disclosure. In specific, to leverage the PHI privacy disclosure and the high reliability of PHI process and transmission in m-Healthcare emergency, we introduce an efficient user-centric privacy access control in SPOC framework, which is based on an attribute-based access control and a new privacy-preserving scalar product computation (PPSPC) technique, and allows a medical user to decide who can participate in the opportunistic computing to assist in processing his overwhelming PHI data. Detailed security analysis shows that the proposed SPOC framework can efficiently achieve user-centric privacy access control in m-Healthcare emergency. In addition, performance evaluations via extensive simulations demonstrate the SPOC's effectiveness in term of providing high-reliable-PHI process and transmission while minimizing the privacy disclosure during m-Healthcare emergency.