Content uploaded by Yogachandran Rahulamathavn BSc(Hons), PhD

Author content

All content in this area was uploaded by Yogachandran Rahulamathavn BSc(Hons), PhD on May 14, 2019

Content may be subject to copyright.

Redesign of Gaussian Mixture Model for Efﬁcient

and Privacy-preserving Speaker Recognition

S Rahulamathavan, X Yao, R Yogachandran, K Cumanan, and M Rajarajan

Abstract—This paper proposes an algorithm to perform

privacy-preserving (PP) speaker recognition using Gaussian mix-

ture models (GMM). We consider a scenario where the users

have to enrol their voice biometric with the third-party service

providers to access different services (i.e., banking). Once the

enrolment is done, the users can authenticate themselves to

the system using their voice instead of passwords. Since the

voice is unique for individuals, storing the users’ voice features

at the third-party server raises privacy concerns. Hence, in

this paper we propose a novel technique using randomization

to perform voice authentication, which allows users to enrol

and authenticate their voice in the encrypted domain, hence

privacy is preserved. To achieve this, we redesign the GMM to

work on encrypted domain. The proposed algorithm is validated

using the widely used TIMIT speech corpus. Experimental

results demonstrate that the proposed PP algorithm does not

degrade the performance compared to the non-PP method and

achieve 96.16% true positive rate and 1.77% false positive rate.

Demonstration on Android smartphone shows that the algorithm

can be executed within two seconds with only 30% of CPU power.

Index Terms—Privacy, security, speech, GMM-MFCC, en-

crypted domain.

I. INTRODUCTION

Traditional authentication methods such as password, PIN,

and memorable words can be easily forgotten, lost, guessed,

stolen, or shared. However, authentication using anatomical

traits such as ﬁngerprint, face, palm print, iris and voice are

very difﬁcult to forge since they are physically linked to the

user. However, various security and privacy challenges deter

the public conﬁdence in adopting biometric based authentica-

tion systems.

Speech being a unique characteristic of an individual and

is widely used in speaker veriﬁcation and identiﬁcation tasks

in applications such as authentication and surveillance re-

spectively [1]. Regarding authentication, there are two parties

involved: (1) user and (2) service provider. Initially, the users

need to enrol their voice with a service provider. Later, the

service provider uses the enrolled voice template to match the

new voice from the users for authentication [2]. This matching

can be performed using probabilistic representation such as

Gaussian mixture models (GMMs) [2].

In general, the users’ voice raw data is passed through a

number of ﬁlters to extract the unique feature (see Section III).

These voice templates, similar to the users’ ﬁngerprints, are

unique for individuals and storing them at third-party servers

raises privacy concerns. In order to tackle this problem, we

propose a privacy-preserving scheme where the server who is

responsible for user authentication does not have access to the

user’s voice template in plain domain

This work was supported by the EU Horizon2020 programme under EU

Grant H2020-EU.3.7 (Project ID: 653586).

In order to perform the voice authentication in a privacy-

preserving manner, we redesign the GMM model and extend

it to work in the encrypted domain. Recently Pathak et.

al. attempted to solve this problem by using homomorphic

encryption techniques in [1]. However, the pioneering work in

[1] is computationally infeasible i.e., requires several hours to

perform the encryption and voice matching due to the usage

of very long integers (2048 bits) for exponentiation.

In order to mitigate the computational complexity, we

propose a novel scheme using randomization technique which

is proven to be lightweight in other biometric solutions [4].

We validate the propose scheme using TIMIT speech corpus

[5] and show that our scheme performs much faster than the

existing solution without compromising the accuracy. On top

of this we analyse the security and privacy aspects of the

proposed scheme.

II. RE LATE D WO RK S

The ﬁeld of signal processing in the encrypted domain

has witnessed several machine learning algorithms those have

been redesigned to process data in the encrypted domain (

[1], [4], [10]–[13] and references there in). The ultimate goal

of all of these works is the same: protecting the privacy

of the input data. However, these works redesign different

machine learning algorithms i.e., face recognition based on

principal component analysis in [10], facial expression recog-

nition based on linear discriminant analysis in [13], multi-class

problem based on support vector machine in [4], [12], are the

few to mention here.

This paper proposes a new technique to perform voice-data-

processing (i.e., speaker recognition) in the encrypted domain.

There are few works have been proposed in this direction [1],

[14], [18]. Pathak et. al redesigned the GMM-MFCC based

speaker recognition model in [1] to achieve a similar privacy

goal. The work in [1] relies on homomorphic cryptosystems

such as BGN and Paillier encryption. This work has shown

a proof-of-concept of privacy-preserving speaker recognition

without compromising the accuracy. However, the shortcoming

of these cryptographic approaches [1] is that too much time

is spent on encryption, which makes it impractical in real-life

applications i.e., [1] requires few minutes for authentication.

In order to mitigate the heavy computation involved with

the homomorphic encryption schemes [1], string-matching

frameworks were proposed in [14], [18]. The schemes in

[14], [18] proposed to convert the speech input represented

by super-vectors to bit strings using locality sensitive hashing

(LSH) and counted exact matches. Since it is easy to perform

string comparison with privacy, the method proves to be more

efﬁcient; however, it lacks accuracy with EER=11.86%.

Hence, in this paper, we propose a new but lightweight

technique based on randomisation to achieve the privacy goal

without compromising the accuracy and speed. We redesign

the GMM based speaker recognition to work on randomised

domain. We use [1] as the baseline model to compare the per-

formance. Section V shows that the proposed model achieves

96.16% true positive rate and 1.77% false positive rate on

TIMIT speech corpus. Demonstration on Android smartphone

shows that the algorithm can be executed within two seconds

with only 30% of CPU power.

It should be noted that the ﬁeld of speaker recognition

has been advanced from GMM model to the current state-

of-the-art techniques such as i-vector and probabilistic linear

discriminant analysis (PLDA) [16] based speaker recognition.

These techniques are based on sophisticated mathematical

frameworks and infeasible to redesign them to work on a

privacy-preserving manner without compromising the speed.

III. SPEAKER RECOGNITION WITHOUT PRIVACY

The speaker recognition considered in this paper comprises

training and testing/authentication phases. For the training,

analog speech samples from a person of interest are collected

to build a speaker model. The speaker model is analogous

to a mechanical lock. Hence the lock can only be unlocked

by the same person’s voice. The training phase involves the

combination of two techniques (1) Mel-frequency cepstral

coefﬁcients (MFCCs) extraction and (2) Gaussian mixture

models (GMMs).

A. Notations and deﬁnitions

We use ~

(.)to denote vectors; (.)0denotes the transpose

operator; k.k2the Euclidean norm; b.ethe nearest integer

approximation; ⊗denotes the Kronecker product.

B. Training phase

1) Mel-frequency cepstral coefﬁcients (MFCC): MFCC ex-

traction is the process of speech parameterization, which

consists in transforming the speech signal to a set of feature

vectors, [3] i.e., to identify the components of the audio signal

that are beneﬁcial to identifying the linguistic content and

discarding all the other elements which carry information like

background noise, emotion etc. [6]. The MFCC may contain

a number of components i.e., pre-emphasis, windowing, FFT,

Mel-scale ﬁlterbank, cosine discrete transformation, (refer [3]

for more details).

2) Gaussian mixture model (GMM): After the feature vec-

tors are collected through MFCC, we use them to build speaker

models to facilitate speaker authenticationa. The choice of

this model is largely dependent on the features as well as

speciﬁcs of the application [7]. MFCC features are denoted as

X= [~x1, ~x2, . . . , ~xT]where ~xj,j= 1, . . . , T denotes the is

aD-dimensional tth feature extracted from the voice sample

using MFCC. Using this deﬁnition, a Gaussian mixture density

is a weighted sum of Mcomponent densities given by [2].

p(~x1, ~x2, . . . , ~xT|λ) =

T

Y

t=1

p(~xt|λ)(1)

where

p(~xt|λ) =

M

X

i=1

pibi(~xt),(2)

awe use speaker recognition and authentication interchangeably.

where bi(~xt),i= 1, . . . , M , are the component densities and

pi,i= 1, . . . , M , are the mixture weights. Each component

density is a D-variate Gaussian function of the form

bi(~xt) = 1

(2π)D/2|Pi|1/2e−1

2(~xt−~µi)0Σ−1

i(~xt−~µi)(3)

with mean vector ~µiand covariance matrix Pi. The mixture

weights satisfy the constraint

M

P

i=1

pi= 1. The details of the

selection of Mcan be referred to [2].

The complete Gaussian mixture density is parametrized by

the mean vectors, covariance matrices and mixture weights

from all component densities. These parameters are collec-

tively represented as speaker model and by the notation

λs={pi, ~µi,Pi},i= 1, . . . , M . We learn these parameters

from the enrolment data using the expectation-maximization

(EM) algorithm.

While the speaker model can be learned from the enrolment

samples for the speaker, learning the GMM for the adversary

class is less obvious as an adversary can be from a very large

set of speakers. In literature this adversary class has been

represented by a universal background model (UBM) that is

trained on a large and diverse set of speakers. Let us assume

that the server generates the speaker model parameters using

EM algorithm and UBM data i.e., aggregated raw speech data

from large number of users. Let us denote the UBM speaker

model parameters as λU=npU

i,~

µU

i,PU

io,i= 1, . . . , M .

In the next subsection we show how these two speaker models

can be used to authenticate a user.

C. Testing/Authentication phase

In the previous section we explained the enrolment process

and how to obtain the speaker model λsand UBM model λU.

Once these models are then the user should be able to use the

voice for authentication.

Lets assume that the speaker model resides at authentication

server belongs to a particular company. The user equipped with

a device, possibly a smart phone, which capture the user’s

voice and send the features to the server. Let’s denote this

new speech feature (test template) as ~

t1,~

t2,...,~

tT. Now the

authentication server will perform the following computations

using the speaker model λs, (1), (2), and (3):

p(~

t1,~

t2,...,~

tT|λs) =

T

Y

j=1

p~

tj|λs=

T

Y

j=1

M

X

i=1

pibi(~

tj)

=

T

Y

j=1

M

X

i=1

pi

(2π)D/2|Pi|1/2e

−1

2(~

tj−~µi)0Σ−1

i(~

tj−~µi).(4)

Similarly using the UBM model λU, (1), (2), and (3) and

obtain

p(~

t1,~

t2,...,~

tT|λU) =

T

Y

j=1

p~

tj|λU=

T

Y

j=1

M

X

i=1

pU

ibi(~

tj)

=

T

Y

j=1

M

X

i=1

pU

i

(2π)D/2

PU

i

1/2e

−1

2~

tj−~

µU

i0

ΣU−1

i~

tj−~

µU

i.(5)

Then the server performs the veriﬁcation using the following

likelihood ratio test with respect to a pre-calibrated threshold

θ:

if P(~

t1,~

t2,...,~

tT|λs)

P(~

t1,~

t2,...,~

tT|λU)Rθ,

then accept the speaker. (6)

If the above likelihood ratio is higher than a predeﬁned

threshold then the authentication server allows the user to

access the service. The threshold value will be determined

using empirical analysis which is studied later in this paper.

It is obvious from the above description that the server

knows both the speaker model (from enrolment) and the user

voice feature (during the authentication) in plain. As motivated

in the introduction section, these variables are unique for the

user and shouldn’t be known to anybody except the user itself.

IV. PP S PE AK ER RECOGNITION

In this section, we propose novel a protocol which pro-

tects the privacy of speaker model residing in the third-party

server as well as the speech test features used during the

authentication process. Before proceeding to the technical

details, the following two privacy goals are deﬁned: (1) Privacy

of Model Parameters: The server should be prevented from

learning about the speaker model parameters and (2) Privacy

of Test Speech Features: The user’s speech biometrics should

be protected from any unauthorised access.

In order to achieve these goals, we mathematically modify

the GMM algorithm in the following sections. Pathak et.al.

used homomorphic encryption technique to hide the speaker

model or the voice feature vector from the server in [1].

However, due to the inherent properties of homomorphic

encryption, the solution proposed in [1] is computationally

infeasible. In this paper, we solve this problem by using

randomization technique which is proved to be very efﬁcient

in other biometric solutions [4].

Moreover, the existing works including [1] tried to achieve

only one of the above privacy goals at one one time i.e.,

server knows the speaker model in plain or server needs to

rely of user to complete the matching process which may add

additional complexity. In the propose scheme, the server will

be doing all the matching process without having the speaker

models and voice features in plain domain as described in the

following subsections.

A. Algorithmic Approach

In order to achieve the above privacy goals, the server

should be prohibited from learning speaker model λ0

Uspa-

rameters and the user’s speech samples for the authentication.

In this paper, we are using the randomization technique to

mask these parameters before passing the to the server. If we

randomize those parameters then the server cannot compute

(4), (5), and (6). Since the parameters obtained from users’

speech need to be randomised, (4) and (5) have to modiﬁed

to accommodate the randomised parameters. Please note the

UBM speaker model’s parameters are generated by the server

and it resides at the server side.

In order to achieve the above privacy goals, we need to

mathematically transform (4) into a more friendly version

such as linear equations. The reason is that the speaker

model parameters stored at the server and the test features

are coupled in (4). In order to decouple the two sets of

parameters, we adopt the construction from [1] which is

explained below. First of all, let us deﬁne the elements of

vector ~

tjas ~

tj= [tj,1tj,2. . . tj,D]0∈RD×1. A new vector

ˆ

tjcan be formed by inserting 1at the bottom of the vector ~

tj

as follows: ˆ

tj= [~

tj

01]0= [tj,1tj,2. . . tj,D 1]0∈R(D+1)×1.

A new longer vector ˜

tj,j= 1,2, . . . , T is deﬁned as shown

in (7).

˜

tj=ˆ

tj⊗ˆ

tj=

tj1

tj2

.

.

.

tj(D+1)2

=

tj,1×tj,1

tj,1×tj,2

.

.

.

tj,1×tj,D

tj,1×1

tj,2×tj,1

tj,2×tj,2

.

.

.

tj,2×tj,D

tj,2×1

.

.

.

tj,D ×tj,1

tj,D ×tj,2

.

.

.

tj,D ×tj,D

tj,D ×1

1×tj,1

1×tj,2

.

.

.

1×tj,D

1×1

∈R(D+1)2×1.

(7)

Deﬁne a matrix

Wi=

−1

2Σ−1

iΣ−1

i~µi

~

0log pi

(2π)D

2|Σi|1

2−1

2~µi0Σ−1

i~µi

∈R(D+1)×(D+1),(8)

where i= 1, . . . , M . Note that the matrix Wiis generated

using the speaker model parameters corresponding to ith

Gaussian density. Using (8) and vectorization technique (i.e.,

vectorization of an m×nmatrix A, denoted by vec(A), is

the mn ×1column vector obtained by stacking the columns

of the matrix Aon top of one another) we can obtain

˜wi=vec(Wi)

=

wi1

wi2

.

.

.

wi(D+1)2

=

{Wi}1,1

{Wi}2,1

.

.

.

{Wi}D+1,1

.

.

.

{Wi}1,D+1

{Wi}2,D+1

.

.

.

{Wi}D+1,D+1

∈R(D+1)2×1,(9)

where {Wi}n,m denotes the element of matrix Wilocated at

nth row and mth column. Now we can use (7) and (9) to

replace the exponential part in (4) as follows:

p(~

t1,~

t2,...,~

tT|λ) =

T

Y

j=1

M

X

i=1

e˜

tj

0˜wi,(10)

where ˜wiis actually obtained only from speaker model pa-

rameters and ˜

tjfrom test features.

B. Novel Privacy Approach

We approach this by dividing the traditional framework into

three parts: (1) protect the speaker models during enrolment

(2) protect the test features during authentication and (3)

protect the privacy during GMM computation.

1) Privacy during enrolment: In order to protect the

speaker model parameters (i.e., ˜wi,i= 1, . . . , M ), the

user generates random vectors ˜zi= [zi1, . . . , zi(D+1)2]∈

R(D+1)2×1where |zin| ≥ |win|for all i= 1, . . . , M and

n= 1,...,(D+ 1)2. The user will use the random vectors to

randomise the speaker model parameters as follows:

[ ˜wi] = ˜wi+ ˜zi,∀i, (11)

where [x]denotes the masked value of xand

[win] = win+zin,∀i, n. (12)

Now the user enrols only [ ˜wi],∀i, in the authentication server

at the end of the enrolment procedure, but keeps ˜zi,∀ias

secret.

2) Privacy during authentication: For the authentication,

the user device extracts the speech features and forms long

vectors ˜

tjfor all js as shown in (7). In order to protect

these vectors, the mobile device generates random vectors

˜rj= [rj1, . . . , rj(D+1)2]∈R(D+1)2×1where |rjn|≥|tjn|

for all j= 1, . . . , T and n= 1,...,(D+ 1)2. The user will

use the random vectors to randomise the feature vectors as

follows:

[˜

tj] = ˜

tj+ ˜rj,∀j, (13)

where

[tjn] = tjn+rjn,∀j, n. (14)

Now the user sends [˜

tj],∀j, to the authentication server while

keeping the random vectors ˜rj,∀jas secret.

3) Privacy during GMM computation: Now the server

substitutes [˜

tj],∀jand [ ˜wi],∀iin (10) and obtains

pmasked =

T

Y

j=1

M

X

i=1

e[˜

tj]0[ ˜wi],

=

T

Y

j=1

M

X

i=1

e(˜

tj+ ˜rj)0( ˜wi+ ˜zi),

=

T

Y

j=1

M

X

i=1

e(˜

tj

0˜wi)e(˜

tj

0˜zi+ ˜rj

0˜wi+ ˜rj

0˜zi).(15)

In (15), every e˜

tj

0˜wi(speech data) is multiplied by a noise

e˜

tj

0˜zi+ ˜rj

0˜wi+ ˜rj

0˜zifor all i, j. It is clear that these noise values

are introduced due to the randomization and thus they must be

removed at the server side in order to get the true results. Yet,

the server does not know the random vectors ˜rjand ˜zi, hence

he cannot remove the noise by himself without any input from

the user.

In order to assist the server to get the right result, the client

generates random values aj∈Rfor all jsuch that |aj| ≥

|˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi|, and sends ˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi−aj

to the server for all i, j. Using the new input, the server will

compute the following using (15):

p−less−masked =

T

Y

j=1

M

X

i=1

e[˜

tj]0[ ˜wi]

e˜

tj

0˜zi+ ˜rj

0˜wi+ ˜rj

0˜zi−aj

,

=

T

Y

j=1

M

X

i=1

e˜

tj

0˜wie˜

tj

0˜zi+ ˜rj

0˜wi+ ˜rj

0˜zi

e˜

tj

0˜zi+ ˜rj

0˜wi+ ˜rj

0˜zi−aj

,

=

T

Y

j=1

M

X

i=1

eaje˜

tj

0˜wi,(16)

=

T

Y

j=1

eaj

M

X

i=1

e˜

tj

0˜wi,(17)

=ePT

j=1 aj

T

Y

j=1

M

X

i=1

e˜

tj

0˜wi.(18)

Since eajis common for all iin (16), it can be taken outside

the summation as shown in (17). Again according to the rule

of product operation, QT

j=1 eaj=ePT

j=1 ajcan be made the

common factor as shown in (18).

Let’s compare the less-masked result in (18) and the true

result in (10). In (18), the true result (i.e., QT

j=1 PM

i=1 e˜

tj

0˜wi)

is multiplied by ePT

j=1 aj. Again the server needs the user’s

input to unmask ePT

j=1 ajfrom (18) in order to arrive at the

true result. Hence the user sends PT

j=1 ajto the server and

the server gets the true results as follows:

p(~

t1,~

t2,...,~

tT|λ) = p−less−masked

ePT

j=1 aj

,

=ePT

j=1 ajQT

j=1 PM

i=1 e˜

tj

0˜wi

ePT

j=1 aj

,

=

T

Y

j=1

M

X

i=1

e˜

tj

0˜wi.(19)

C. Novel Privacy Approach for UBM

The previous subsections demonstrated how to compute

(4) when the speaker model parameters and speech features

for authentication are randomised. Since the UBM model

parameters are known to server in plain domain (i.e., only the

speech features for authentication are randomised) we cannot

use the above approach to compute (5).

Let us assume that the server obtains modiﬁed speaker

model vector similar to (9) for the UBM speaker model

parameters. Hence let us denote the vectorised UBM speaker

model parameter as ˜

wU

ifor all i= 1, . . . , M . Similar to (15),

now the server compute the following using [˜

tj],∀jand ˜

wU

i

for all i= 1, . . . , M as follows:

pU

masked =

T

Y

j=1

M

X

i=1

e[˜

tj]0˜

wU

i=

T

Y

j=1

M

X

i=1

e(˜

tj+ ˜rj)0(˜

wU

i),

=

T

Y

j=1

M

X

i=1

e(˜

tj

0˜

wU

i)e( ˜rj

0˜

wU

i).(20)

In (20), every e˜

tj

0˜

wU

iis multiplied by a noise e˜rj

0˜

wU

ifor

all i, j. It is clear that these noise values are introduced due

to the randomization and thus they must be removed at the

server side in order to get the true results. Yet, the server does

not know the random vectors ˜rj, hence he cannot remove the

noise by himself without any input from the user.

The power of noise value is just a scalar multiplication

˜rj0˜

wU

iwhere ˜rjis known to the user and ˜

wU

iis known to

the server. This can be calculated by PP scalar multiplication

algorithm proposed [20]. Using the PP scalar multiplication

algorithm in [20], the server can obtain ˜rj0˜

wU

i+γjfor all i,

and j. Using these values, and (20) the server compute the

following:

pU

−less−masked =

T

Y

j=1

M

X

i=1

e(˜

tj

0˜

wU

i)e( ˜rj

0˜

wU

i)

e˜rj

0˜

wU

i+γj

,

=

T

Y

j=1

M

X

i=1

e(˜

tj

0˜

wU

i)

eγj=

T

Y

j=1

1

eγj

M

X

i=1

e(˜

tj

0˜

wU

i),

=ePT

j=1 γj

T

Y

j=1

1

eγj

M

X

i=1

e(˜

tj

0˜

wU

i).(21)

At last the client will send ePT

j=1 γjto the server hence the

server can obtain the correct result from (21).

D. Privacy Analysis

In this section, we analyse whether our algorithm is vulner-

able to any privacy leakage. Our algorithm is based on two-

party computation and the only chance that a privacy leakage

happens is during the interaction between two-parties. We use

the following Goldreich’s privacy deﬁnition and information-

theoretic security to prove that our method does not leak any

unintended information to client or server:

Privacy deﬁnition for secure two-party computation: A

secure two-party protocol should not reveal more information

to a semi-honest party than the information that can be induced

by looking at that party’s input(s) and output(s). The formal

proof of this deﬁnition can be found in [15].

Information-theoretic security: A security algorithm is

information-theoretically secure if its security is derived

purely from information theory. The concept of information-

theoretically secure communication was introduced in 1949

by American mathematician Claude Shannon, who used it to

prove that the one-time pad system achieves perfect security

subject to the following two conditions [17]:

1. the key which randomizes the data should be random and

should be used only once

2. the key length should be at least as long as the length of

the data

If any algorithm randomizes its parameters and satisﬁes the

above conditions, the parameters cannot be unmasked by an

adversary even when the adversary has unlimited computing

power e.g., if the message space and random space are equal

to 1024-bits, the prior probability (probability for a partic-

ular message out of 21024 possible messages) and posterior

probability (probability of inferring/mapping a message in

random domain to a message domain) are equal i.e., there

is no advantage for an adversary to achieve higher posterior

probability than prior probability.

Proof: Let us verify whether the proposed algorithm satisﬁes

the privacy deﬁnition. As described above, the proposed algo-

rithm is composed of three parts (Subsections IV-B1, IV-B2,

and IV-B3). In the following we show what the inputs and

outputs to and from the user and server are, respectively. This

will clearly highlight what is already known to the user and

server. Hence, if we can prove that nothing else can be inferred

other than the known inputs and outputs with higher posterior

probability than prior probability, then the proposed algorithm

satisﬁes the privacy deﬁnition.

The ultimate aim for the user is to keep the enrolled

features and the test features away from the server while

the server wants to keep the intermediate results away from

the user. Initially (Subsection IV-B1), the user just sends

the randomised vectors [ ˜wi](i.e., ˜wi+ ˜zi) to the server

instead of ˜wi. From these inputs, the server only knows the

dimension of the vectors which does not violate the privacy.

According to the information-theoretic security deﬁnition, if

|zin| ≥ |win|,∀i, n, the server will not be able to infer the

original values from [ ˜wi].

Using the same argument, we can claim that there is no

privacy leakage during the authentication phase (Subsection

IV-B2) as long as |rjn|≥|tjn|,∀j, n. Since this phase

is repeated every time when the user wants to authenticate

themselves to the server, the user will be generating fresh

random vectors ˜rj,∀j.

In order to assist the server to compute the correct GMM

output, the user will also send ˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi−aj, and

PT

j=1 ajto the server where |aj|≥|˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi|and

generated freshly for each authentication. It is impossible for

the server to infer aj,∀jfrom PT

j=1 ajand ˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi

from ˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi−aj,∀i, j. We chose only one random

value ajfor all (˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)where i= 1,2, . . . , M

in order to reduce the complexity. If we look at the value

(˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)where ˜rjand ˜ziare random. Therefore

for different i, the combined value (˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)is

random hence the masked value (˜

tj

0˜zi+ ˜rj0˜wi+ ˜rj0˜zi)−aj

is also random for all i.

The same analogy can be used to prove that the server can-

not learn about the speech feature during the UBM calculation.

V. PERFORMANCE ANALYSIS

In this section, we ﬁrst study about dataset used for the

experiment and evaluation environment. Then we perform test-

ings using both the traditional approach, i.e., without privacy

and proposed approach followed by complexity comparison.

A. Usage of TIMIT dataset

We use TIMIT speech corpus [5] to evaluate the accuracy

and reliability of the proposed algorithm. The TIMIT speech

corpus contains broadband recordings (each lasts for around 3

seconds) of 630 speakers of eight major dialects of American

English. Each speaker has 10 speech samples. Out of 10

samples, 9 were used to build the speaker model.

The TIMIT data corpus divides the ﬁrst four dialects into

two sets: (Set 1) contains 258 speakers, and (Set 2) contains

95 speakers. To validate the proposed algorithms, we use all

the speakers from Set 1 to build 258 speaker models. Hence,

we enrol the 258 speakers voice features in the server. We use

the speakers in Set 2 as adversaries who try to impersonate the

genuine users to get the unauthorised access to the service. The

last 4dialects are used to construct UBM, i.e., we aggregated

all the voice data to make a single audio ﬁle and then extracted

the speaker model λU.

In order to explain the testing phase, let us assume a service

and name it as Service A. We randomly select 156 speakers

from Set 1 and assume those 156 speakers registered for

Service A.

B. Evaluation Environment

To measure the performance of the proposed scheme in a

real environment, we implemented the scheme on a smart-

phone and a computer. Speciﬁcally, a smartphone with Quad-

core 1.56 GHz Cortex-A53 CPU, Android 4.4.2, and a com-

puter with Intel i5, 2.40 GHz, 8GB RAM, Windows 7, are

chosen to evaluate the user and server, respectively, which are

connected through 802.11g WLAN. Based on the proposed

scheme, an application built in Java, named speechxrays.apk,

is installed in the smartphone, and the simulator for the server

is deployed on the computer.

MFCC features of speech samples are extracted through a

number of ﬁlters build in Java. As a ﬁrst step, the digitalized

speech signal at 44100Hz is sent through energy detection

ﬁlter to remove silence part of the signals. Then the silence

removed signal will be normalised, framed with 50% overlap,

and then passed through Hamming window. The output of

Hamming window will be sent through MFCC ﬁlter which

contains Mel ﬁlter bank, non-linear transformation bank, DCT,

and FFT. More details of MFCC feature extraction can be

found [3]. We extracted 100 MFCC feature vectors for one-

second long speech, and each feature vector contains twelve

MFCC coefﬁcients.

C. Experiments on the TIMIT Database without privacy

To validate the proposed method, we ﬁrst obtain the ac-

curacy of GMM algorithm on TIMIT dataset using the pre-

divided speech samples. To start with, we obtain the decision-

making output of (6) on the 156 speech samples from the

subset of Set 1. The probability distribution of these calcula-

tions is depicted in Fig. 1 (the distribution on the right) where

each user has tested against its own speaker model. Using the

results from the Fig. 1, Table I shows true positive and false

negative values for different threshold values.

However, deciding the threshold based only from genuine

attempts may be misleading. It is important to cross check the

genuine speaker models using the speech samples belong to

other users, i.e., adversaries. We can use the remaining users

who are not part of a subset of Set 1 as adversaries. For

example, set DR1 contains 49 speakers, but only 38 (i.e., Set

1) speaker models were built. For this experiment, for DR1, we

assume that these 38 speaker models were attacked by 15 (i.e.,

a subset of Set 1 in DR1) speakers and the speakers in DR1

of Set 2. Even those 15 subset speakers try to authenticate

themselves through all 38 speaker model. The same applies

for DR2, DR3, and DR4.

Hence, for DR1, (15 ×38) + (11 ×38) = 988 authenti-

cation attempts were made. Only 15 attempts out of 988 are

legitimate and all others are malicious. Similarly, for DR2,

(58×76)+(26×76) = 6384; for DR3, (43×76)+(26×76) =

Figure 1. Combination of genuine and malicious authentication attempts.

The distribution on the right hand side shows the outcome for 156 legitimate

authentication attempts in numbers. The distribution on the left hand side

shows the outcome for 17356 malicious attempts (in percentages).

5244; and for DR4, (40 ×68) + (32 ×68) = 4896. In

total 17512 authentication attempts were made. Out of this

17512, only 156 attempts were legitimate. Hence, similar to

Fig. 1, we plotted the decision making output of (6) for the

17512 −156 = 17356 malicious authentication attempts in

Fig. 1 (the distribution on the left).

From both the distribution from Fig. 1 the values of

decision-making output of (6) are generally larger than 0.5for

legitimate attempts while those values for malicious attempts

are usually smaller than 0.5for the malicious attempts. How-

ever, the important fact is the overlap between them, i.e., the

minimum value for the legitimate users was 0.2444 while the

maximum value for the malicious users was 1.1312; hence, the

choice of the threshold will impact not only the true positive

accuracy but also the false positive attempts.

For example, if the threshold θis set to be 0.40 and let us

use the False negative rate and false positive rate to quantify

the accuracy, it can be obtained that:

•False Negative rate= 6

156 ×100 = 3.85%;

•False Positive rate = 308

17356 ×100 = 1.77%

D. Performance of the proposed solution

We repeat the same experiment procedure in Section V-C

to test the performance of the proposed algorithm, i.e., with

privacy. Let us brieﬂy explain the randomization procedure

used in our analysis. First, we obtain the range of values from

the 156 speaker models’ parameters and 353 speakers’ speech

features. We observed that these values contain up to twelve

numbers after the decimal point and ﬁve before the decimal

point. By keeping this in mind, the client generates random

numbers include twelve and six numbers after and before the

decimal point. Table II shows few examples of the test features

in (7) and the corresponding 18-digit long random numbers

and the randomised feature variables.

For a ﬁve-second long speech, we need 5×100×(12+1)2=

84500 random numbers. In total, for 100000 eighteen-digit

long random numbers require approximately 8MB memory.

Using these random numbers, we repeated the experiments we

carried out in Section V-C and obtained same results when the

scaling parameter of PP scalar multiplication equals 106[20].

Table I

TRUE POSITIVE AND TRUE NEGATIVE VALUES

FO R DIFF ER ENT T HRE SH OLD VAL UE S

θTrue Positive False Negative

0 100% 0%

0.4 96.15% 3.84%

0.5 89.75% 10.25%

0.8 66.42% 43.58%

2 1.28% 91.78%

Table II

RANDOMISATION EXAMPLES

Test speech samples

tj1, tj2, . . .

Corresponding random

valuesrj1, rj2, . . .

Randomised speech

features [tj1],[tj2], . . .

16.725081716161 420223.336542782260 420240.061624498421

0.0001274641 225758.663235026265 225758.663362490365

1.171659034624 206555.735876851157 206556.907535885781

46.027486334736 236074.104503441653 236120.131989776389

1910.031836695921 125628.018508620454 127538.050345316375

200 400 600 800 1000 1200

0

0.5

1

1.5

2

2.5

3x 109

Length of Speech Sample

Time Complexity (total multiplications)

Complexity Comparision when M=32

Server−w/out privacy

Server−with privacy

Client−with privacy

Figure 2. Number of multiplications required

for the proposed and traditional (without privacy)

speech authentication algorithms.

32 64 96 128

Number of Gaussians

0

1

2

3

4

5

6

Time Complexity

(total multiplications)

×109

Complexity Comparision: Proposed Vs Traditional

Server-w/out privacy

Server-with privacy

Figure 3. Number of multiplications required

for the proposed and traditional (without privacy)

schemes for different number of Gaussian models

(M= 32 128) when T= 600.

32 64 96 128

Number of Gaussians

0

2

4

6

8

10

12

14

16

Time Complexity

(total multiplications)

×106Complexity Comparision at Client

Figure 4. Number of multiplications required for

the proposed scheme at the client side for different

number of Gaussian models (M= 32 ∼128)

when T= 600.

E. Computational complexity comparison

Without privacy consideration, the server computes (6)

using the test speech feature vectors obtained from the client.

The decision making equation in (6) is actually composed of

(10) i.e., calculating (10) using the speech feature and speaker

model followed by calculating (10) using the speech feature

and UBM model. Basically the server needs to execute (10)

twice.

Denote the time complexity for multiplication and expo-

nentiation as tmand te, respectively. Also denote the total

complexity for the server to perform the authentication using

without privacy as Tw/out−privacy

serverl for authentication complex-

ity for server to compute then the total complexity without

privacy is given by

Tw/out−privacy

server =

2M T (D+ 1)4tm+M T te+ (T−1)tm+tm.(22)

Let us now obtain the computational complexity for the

proposed PP approach. For the speaker model computation, the

server is required to compute (15) to (19). For the UBM model,

the server needs to follow the steps described in Section IV-C.

This involves executing the PP scalar multiplication algorithm

[20]. For this algorithm, the server needs to compute (D+

1)2+ 1 multiplication and the client needs to compute (D+

1)2+ 1 multiplication. If we denote the total complexity for

the server to perform the PP authentication as Tprivacy

server then

it given by

Tprivacy

server = [M T (D+ 1)4+ 1]tm+ (M T + 1)te

+M T [(D+ 1)2+ 1 + (D+ 1)4]tm+ (MT + 1)te,(23)

and for the client

Tprivacy

client−serial =M T [(D+ 1)2+ 1]tm.(24)

To visualise the computational complexity for different

numbers of Gaussian models (M) and feature vectors (T),

we simulated (22), (23), and (24) in Figures 2, 3, and 4.

For these simulations we used te≈240tmapproximation

[20]. We also assumed that the random numbers could be

generated off-line, and the complexity for addition is negligible

compared to multiplication and exponentiation. Fig. 2 shows

the time complexity when M= 32 in terms of a number

of multiplications required for the proposed scheme, and the

tradition scheme. We can clearly observe that there is no

signiﬁcant difference between both the scheme at the server

side. However, the client is required to perform additional

multiplications which are negligible compared to the server

side complexity. As expected, the complexity of both the

schemes increases linearly with the number of speech features

(i.e., the length of the input speech). However, this linear

increase can be ﬂattened if the server is equipped with multiple

cores processors. Nevertheless, Fig. 2 demonstrates that the

proposed scheme does not increase the complexity substan-

tially compared to the traditional approach. Hence, our scheme

is practical compared to the homomorphic encryption scheme

[1] which consume hours to complete the authentication.

Fig. 3 compares the impact of the number of Gaussian

models when T= 600. When Mincreases, the complexity

also increases linearly for both the schemes. A similar trend,

but at the client side, is also observed in Fig. 4. At the

server side, both the schemes perform equally regardless of

a number of Gaussian models used. However, the proposed

scheme introduces complexity to the client, i.e., approximately

4×106multiplications when M= 32 and T= 600. It should

be noted that a single core 3GHz processor can compute nearly

109multiplications per second. Based on these facts, we can

claim that the proposed scheme is practical while preserving

the privacy of users’ speech samples and speaker models.

We observed in our implementation that the proposed

algorithm takes nearly two seconds for enrollment and less

than two seconds for authentication on Android smartphone

environment. These numbers are relatively similar to the

amount of time is required to input passwords for any on-

line applications on mobile devices. Fig. 5 shows Android

Monitor’s readings of CPU and memory usage when the app

is being used on the smartphone. For ﬁve seconds speech input,

the smartphone uses nearly 30% of the phone CPU and 36MB

memory. These numbers clearly demonstrate the practicality

of the proposed approach in the real environment.

Figure 5. Android Monitor reading of smartphone memory and CPU usage

during the enrolment and authentication phase of the proposed scheme.

VI. CONCLUSIONS AND FU TU RE WORKS

An efﬁcient privacy preserving speaker authentication pro-

tocol is proposed. To achieve efﬁciency and privacy, the

proposed solution algorithmically redesigned the GMM to

incorporate randomness without effecting the ﬁnal outcome.

The proposed protocol is based on randomization technique

and it only relies on multiplication and addition. Two parties

in this scheme, the client and the server, need to perform

authentication interactively. It is proved using Goldreich’s

privacy deﬁnition and information-theoretic security that the

algorithm is secure. It is shown empirically that our scheme

does not degrade accuracy and is without much computation

overhead.

REFERENCES

[1] Pathak, M.A.; Raj, B., ”Privacy-Preserving Speaker Veriﬁcation and

Identiﬁcation Using Gaussian Mixture Models,” IEEE Trans. Audio,

Speech, and Language Processing , vol.21, no.2, pp.397-406, Feb. 2013

[2] Reynolds, D.A.; Rose, R.C., ”Robust text-independent speaker identiﬁ-

cation using Gaussian mixture speaker models,”IEEE Trans. Speech and

Audio Processing , vol.3, no.1, pp.72,83, Jan 1995.

[3] Bimbot, F.; Bonastre, J.-F.; Fredouille, C.; Gravier, G.; Magrin-

Chagnolleau, I. ; Meignier, S. ; Merlin, T.; Ortega-Garcia, J.; Petrovska-

Delacr´

etaz, D.; Reynolds, D. A., “A tutorial on text-independent speaker

veriﬁcation,” EURASIP J. Appl. Signal Process., vol.4, pp. 430,451, 2004.

[4] Rahulamathavan, Y., Rajarajan, M. ”Efﬁcient Privacy-preserving Facial

Expression Classiﬁcation,” IEEE Trans. Dependable and Secure Com-

puting , in press.

[5] Garofolo, John, et al. ”TIMIT Acoustic-Phonetic Continuous Speech

Corpus LDC93S1,” Web Download. Philadelphia: Linguistic Data Con-

sortium, 1993.

[6] Lyons, J., “Mel Frequency Cepstral Coefﬁcient (MFCC)

tutorial,” Practicalcryptography.com ,[Online]. Available:

http://practicalcryptography.com/miscellaneous/machine-learning/guide-

mel-frequency-cepstral-coefﬁcients-mfccs/. [Accessed: 22- Jul- 2015].

[7] Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B., ”Speaker Veriﬁcation Using

Adapted Gaussian Mixture Models,” Digital Signal Processing, Vol.10,

Issues 1–3, pp. 19-41, Jan 2000

[8] Dempster, A.P., Laird, N.M. and Rubin, D.B., 1977. Maximum likelihood

from incomplete data via the EM algorithm. Journal of the royal statistical

society. Series B (methodological), pp.1-38.

[9] R. Duda and P. Hart, Pattern classiﬁcation and scene analysis. New York:

Wiley, 1973.

[10] Z. Erkin, M. Franz, J. Guajardo, S. Katzenbeisser, I. Lagendijk, and

T. Toft, “Privacy-preserving face recognition,” in Proc. 9th International

Symposium on Privacy Enhancing Technologies, PETS ’09, pp. 235–253,

2009.

[11] Y. Rahulamathavan, R. Phan, S. Veluru, K. Cumanan, and M. Rajarajan,

“Privacy-preserving multi-class support vector machine for outsourcing

the data classiﬁcation in cloud,” IEEE Trans. Dependable Secure Com-

puting, vol. 11, no. 5, pp. 467–479, Sept. 2014.

[12] Y. Rahulamathavan, S. Veluru, R. Phan, J. Chambers, and M. Rajara-

jan, “Privacy-preserving clinical decision support system using gaussian

kernel based classiﬁcation,” IEEE Journal of Biomedical and Health

Informatics, vol. 18, no. 1, pp. 56–66, Jan. 2014.

[13] Y. Rahulamathavan, R. Phan, J. Chambers, and D. Parish, “Facial

expression recognition in the encrypted domain based on local ﬁsher

discriminant analysis,” IEEE Trans. Affective Computing, vol. 4, no. 1,

pp. 83–92, Jan.-Mar. 2012.

[14] M. A. Pathak, B. Raj, S. D. Rane and P. Smaragdis, ”Privacy-preserving

speech processing: cryptographic and string-matching frameworks show

promise,” in IEEE Signal Processing Magazine, vol. 30, no. 2, pp. 62-74,

March 2013.

[15] O. Goldreich, “Secure multiparty computation”, (working draft), avail-

able: http://www.wisdom.wei zmann.ac.il/ oded/pp.html. (Sep. 1998)

[16] Dehak, N, et al. ”Front-end factor analysis for speaker veriﬁcation.”

IEEE Transactions on Audio, Speech, and Language Processing, vol. 19,

no. 4 pp. 788-798, 2011.

[17] C. E. Shannon, “Communication theory of secrecy systems*,” Bell

system technical journal, vol. 28, no. 4, pp. 656–715, 1949.

[18] M. A. Pathak and B. Raj, ”Privacy-preserving speaker veriﬁcation as

password matching,” Acoustics, Speech and Signal Processing (ICASSP),

2012 IEEE International Conference on, Kyoto, 2012, pp. 1849-1852.

[19] B. K. Sy, ”Secure Computation for Biometric Data Secu-

rity—Application to Speaker Veriﬁcation,” in IEEE Systems Journal,

vol. 3, no. 4, pp. 451-460, Dec. 2009.

[20] R. Lu, X. Lin, and X. Shen, “SPOC: A secure and privacy-preserving

opportunistic computing framework for mobile-healthcare emergency,”

IEEE Trans. Parallel and Distributed Systems, vol. 24, no. 3, pp. 614–

624, 2013.