Page 1

Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte

International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 147

Bimodal Biometric Person Authentication System

Using Speech and Signature Features

Prof. M.N. Eshwarappa jenutc@rediffmail.com

Assistant professor Department of Telecommunication

Engineering, Sri Siddhartha Institute of Technology,

Tumkur-572101, Karnataka, India

Prof. (Dr.) Mrityunjaya V. Latte

Principal and Professor Department of Electronics

and Communication Engineering, JSS

Academy of Technical Education,

Bangalore-560060, Karnataka, India

mvlatte@rediffmail.com

Abstract

Biometrics offers greater security and convenience than traditional methods of

person authentication. Multi biometrics has recently emerged as a means of

more robust and efficient person authentication scheme. Exploiting information

from multiple biometric features improves the performance and also robustness

of person authentication. The objective of this paper is to develop a robust

bimodal biometric person authentication system using speech and signature

biometric features. Speaker based unimodal system is developed by extracting

Mel Frequency Cepstral Coefficients (MFCC) and Wavelet Octave Coefficients of

Residues (WOCOR) as feature vectors. The MFCCs and WOCORs from the

training data are modeled using Vector Quantization (VQ) and Gaussian Mixture

Modeling (GMM) techniques. Signature based unimodal system is developed by

using Vertical Projection Profile (VPP), Horizontal Projection Profile (HPP) and

Discrete Cosine Transform (DCT) as features. A bimodal biometric person

authentication system is then built using these two unimodal systems.

Experimental results show that the bimodal person authentication system

provides higher performance compared with the unimodal systems. The bimodal

system is finally evaluated for its robustness using the noisy data and also data

collected from the real environments. The robustness of the bimodal system is

more compared to the unimodal person authentication systems.

Keywords: Biometrics, Speaker recognition, Signature verification, Multimodal biometrics.

1. INTRODUCTION

Biometrics is the development of statistical and mathematical methods applicable to data analysis

problems in the biological sciences. Introduction of this technology brings new security

approaches to computer systems. Identification and verification are the two ways of using

biometrics for person authentication. Biometrics refers to the use of physical or physiological,

biological or behavioral characteristics to establish the identity of an individual. These

Page 2

Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte

International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 148

characteristics are unique to each individual and remain partially unaltered during the individual’s

life time [1]. Biometric security system becomes a powerful tool compared to electronics based

security [2]. Any physiological and/or behavioral characteristic of human can be used as biometric

feature, provided it possesses the following properties: universality, distinctiveness, permanence,

collectability, circumvention, acceptability and performance [3]. The physiological biometrics

related to the shape of the body. The oldest traits, that have been used for more than 100 years

are fingerprints. Other examples are Face, Hand Geometry, Iris, DNA, Palm-prints and so on.

Behavioral biometrics related to the behavior of a person. The first characteristic to be used, still

widely used today, is the signature. Others are keystroke, Gait (way of walking), Handwriting and

so on. Speech is the unique biometric feature that comes under both the categories [9]. Based on

the application, selecting the right biometric is the crucial part. Unimodal biometric system, which

operates using any single biometric characteristic, is affected by problems like noisy sensor data,

non-universality and lack of individuality of the chosen biometric trait, absence of an invariant

representation for the biometric trait. For instance, speech is a biometric feature whose

characteristics will vary significantly if the person is affected by cold or in different emotional

status. Some of these problems can be relived by using multimodal biometric system that

consolidates evidence from multiple biometric sources. Multimodal or Multi-biometric systems

utilize more than one physiological or behavioral biometrics for enrolment and identification. This

work presents, such a multimodal biometric person recognition system and results obtained are

compared to the unimodal biometric systems.

There are several multimodal biometric person authentication systems developed in the literature

[3-7]. In 2004, A. K. Jain et. al., proposed the frame work for multimodal biometric person

authentication [3]. Even though some of the traits offering good performance in terms of reliability

and accuracy, none of the biometrics is 100% accurate. With increasing global need for security,

the demand for robust automatic person recognition systems is evident. For applications involving

the flow of confidential information, the authentication accuracy of the system is always the prior

concern. From this basic reason the use of multimodal biometrics are encouraged. Multi-

biometrics is an integrated prototype system embedding different types of biometrics [35].

Multimodal biometric fusion and identity authentication technique help to achieve an increase in

performance of identity authentication system [8]. Multimodal biometrics can reduce the

probability of denial of access without sacrificing the False Acceptance Rate (FAR) performance

by increasing by discrimination between the genuine and impostor classes. There are several

multimodal biometric person authentication systems developed in the literature [4-8]. Applications

of multi-biometrics are widely spread throughout the world. A wide variety of systems require

reliable personal recognition schemes to either confirm or determine the identity of an individual

requesting their services. The purpose of such schemes is to ensure that the rendered services

are accessed only by a legitimate user, and not anyone else. Examples of such applications

include secure access to buildings, computer systems, laptops, cellular phones and ATMs. In the

absence of robust personal recognition schemes, these systems are vulnerable to the wiles of an

impostor. Authentication systems built upon only one modality may not fulfill all the requirements,

due to the limitations of unimodal systems. This has motivated the current interest in multimodal

biometrics, in which several biometric traits are simultaneously used in order to make an

identification decision. The objective of the present work is to develop a bimodal biometric system

using speech and signature features to mitigate the effect of some of the limitations of unimodal

biometric systems.

The present work mainly deals with the implementation of bimodal biometric system employing

speech and signature as the biometric modalities. This includes feature extraction techniques and

modeling techniques used in biometric system. The organization of the paper is as follows:

Section 2 deals with bimodal databases used in bimodal person authentication system. Section 3

deals with unimodal biometric speech based person authentication system and section 4 deals

with unimodal biometric signature based person authentication system. Bimodal biometric system

by combining speaker and signature recognition systems is explained with different fusion

techniques in section 5. Section 6 concludes the paper by summarizing the present work and

adding few points regarding the future work.

Page 3

Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte

International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 149

2. BIMODAL DATABASES FOR PERSON AUTHENTICATION

IITG Speech database (standard)

Number of speakers: 30(20 male, 10 female)

Sampling frequency: 8000Hz

Sentences considered for each speaker: 4

Number of utterances of each sentence for each speaker: 24

Training session: first 16 utterances

Testing session: remaining 8 utterances of each sentence of each speaker

IITG Signature database (standard)

Number of writers: 30 (20 male, 10 female)

Scanner: HP Scan jet 5300C

Resolution: 300dpi (digits per inch)

Data storage: 8-bit Gray scale image

Saved format: bmp (bits mapping)

Number of sample signatures of each writer: 24

Training session: First 16 signatures of all the writers

Testing session: remaining 8 signatures of all the writers

SSIT Speech database

Number of speakers: 30(20 male, 10 female)

Sampling frequency: 8000Hz

Sentences considered for each speaker: 4

Number of utterances of each sentence for each speaker: 24

Training session: first 16 utterances

Testing session: remaining 8 utterances of each sentence of each speaker

SSIT Signature database

Number of writers: 30 (20 male, 10 female)

Scanner: HP Scan jet 5300C

Resolution: 300dpi (digits per inch)

Data storage: 8-bit Gray scale image

Saved format: bmp (bits mapping)

Number of sample signatures of each writer: 24

Training session: First 16 signatures of all the writers

Testing session: remaining 8 signatures of all the writers

3. UNIMODAL SPEECH BASED PERSON AUTHENTICATION SYSTEM

As any other pattern recognition systems, a speech based person authentication system also

consists of three components: (1) Feature extraction, which transforms the speech waveform into

a set of parameters carrying salient speaker information; (2) Pattern generation, which generates

from the feature parameters a pattern representing the individual speaker: and (3) Pattern

matching and classification, which compares the similarity between the extracted features and a

pre-stored pattern or a number of pre-stored patterns, giving the speaker identity accordingly.

There are two stages in a speaker recognition system, training and testing. In the training stage,

speaker models (or patterns) are generated from the speech samples with some feature

extraction and modeling techniques. In testing stage, feature vectors are generated from the

speech signal with the same extraction procedure as in training. Then a classification decision is

made with some matching technique. Person authentication is a binary classification task [22].

The features from the testing signal are compared with the claimed speaker pattern and a

decision is made to accept or reject the claim [10]. Depending on the mode of operation, speaker

recognition can be classified as text dependent recognition and text independent recognition. The

text dependent recognition requires the speaker to produce speech for the same text, both during

training and testing: whereas the text independent recognition does not on a specific text being

Page 4

Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte

International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 150

spoken [11]. The present work follows text dependent speaker recognition approach.This work

uses feature extraction techniques based on (1) Mel Frequency Cepstral Coefficients (MFCC)

derived from Cepstral analysis of the speech signal and (2) Wavelet Octave Coefficients of

Residues (WOCOR) derived from the Linear Prediction (LP) residual. The time frequency

analysis of the LP residual signal is performed to obtain WOCOR [14]. WOCOR are generated by

applying a pitch-synchronous wavelet transform to the residual signal. Experimental results show

that the WOCOR parameters provide complementary information to the conventional MFCC

features for speaker recognition [14]. The Vector Quantization (VQ) and Gaussian Mixture

Modeling (GMM) are used for modeling the person information from these MFCC and WOCOR

features [13-15]. State of the art system uses MFCC derived from speech as feature vectors and

GMM as the modeling technique [13].

Feature Extraction from Speech Information

The speaker information is present both in the vocal tract and excitation parameters [12]. The

vocal tract system can be corresponds to processing of speech in short (10-30ms) overlapped (5-

15ms) windows. The vocal tract system is assumed to be stationary within the window and it can

be modeled as all-pole-filter using LP analysis [21]. The most used form of speech signal for

feature extraction is the Cepstrum. Different forms of Cepstral representation include Complex

Cepstral Coefficients (CCC), Real Cepstral Coefficients (RCC), Mel Frequency Cepstral

Coefficients (MFCC) and Linear Prediction Cepstral Coeeficeints (LPCC). Among these the

mostly used one includes MFCC. In all the Cepstral analysis techniques the vocal tract

information is obtained by taking log over spectrum of the speech signal. The LP residual signal,

though not giving the true glottal pulse, is regarded as a good representative of the excitation

source. The Haar transform and Wavelet transform are applied for the multi-resolution analysis of

the residual signal and the derived the feature vectors termed as Wavelet Octave Coefficients of

Residues (WOCOR). WOCOR are believed to be effectively capturing the speaker specific

spectro-temporal characteristics of the LP residual signal.

Extraction of MFCC Feature Vectors

The state of the system builds a unimodal system by analyzing speech in blocks of 10-30 ms with

shift of half the block size. The MFCC are used as feature vectors extracted from each of the

blocks. The MFCCs from the training or enrolment data are modeled using Vector Quantization

(VQ) and Gaussian Mixture Modeling (GMM) technique [12]. The MFCCs from the testing or

verification data are compared with respective model to validate the identity claim of the speaker.

The MFCCs represent mainly the vocal tract aspect of speaker information and hence take care

of only physiological aspect of speech biometric feature. Another important physiological aspect

contributing significantly to speaker characteristics is the excitation source [13]. A speech signal

is obtained by the convolution of vocal parameters v(n) and excitation parameters x(n) given by

equation (3.1). We can not separate these parameters in time domain. Hence we go for Cepstral

domain. The Cepstral analysis used for separating the vocal tract parameter v(n) and excitation

parameters x(n), from speech signal s(n).

s(n) = v(n) * x(n) (3.1)

The Cepstral analysis gives the fundamental property of convolution used for separating the vocal

tract parameters and excitation parameters [27]. The Cepstral Coefficients (C) of length M can be

obtained by using equation (3.2).

C = real (IFFT (log |FFT (s(n))| )) (3.2)

The nonlinear scale i.e., relation between the Mel frequency (fMel) and physical frequency (fHz) is

used for extracting spectral information from the speech signal by using Cepstral analysis.

f Mel = 2595 log10 1

700

Hz

f

+

(3.3)

Using equation (3.3) we construct a spectrum with critical bands which are overlapped triangular

banks i.e., we map the linear spaced frequency spectrum (fHz) into nonlinearly spaced frequency

spectrum (fMel). By this we can mimic the human auditory system and based on this concept

MFCC feature vectors are derived. Windowing eliminates the Gibbs oscillations, which occur by

truncating the speech signal. Using equation (3.4), Hamming window coefficients are generated,

with which corresponding speech of frame is scaled.

Page 5

Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte

International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 151

w(n) = 0.54 -0.46 cos

2

1

n

N

π

−

(3.4)

But, due to Hamming windowing, samples present at the verge of window are weighted with

lower values. In order to compensate this, we will try to overlap the frame by 50%. After

windowing, we compute the log magnitude spectrum of each frame for finding the energy

coefficients using equation (3.5).

N

∑

2

0

2

( ) log |( ,) |

1

k

Y iS k mHi k

N

π

−

=

=

(3.5)

where

2

i

Hk

N

π

is the ith Mel critical bank spectra and N is the number of points used to

compute the discrete Fourier transform (DFT). The M number of Mel frequency coefficients

computed by using discrete Cosine transforms (DCT), by using equation (3.6), which is nothing

but the real IDFT of critical band filters log energy outputs.

1

2

∑

1

22

( ,)( ) cos

Y k

N

k

C n mkn

NN

π

−

=

=

(3.6)

Where n=1, 2, 3………..M.

The present work also takes care of channel mismatch by using the Cepstral Mean Subtraction

(CMS) and the effect of different roll off from the different channels on Cepstral coefficients by

Liftering procedure [21].

Extraction of WOCOR Feature Vectors

The Linear Predictive (LP) residual signal is adopted as a good representative of the vocal source

excitation, in which the speaker specific information resides on both time and frequency domains.

The resulting vocal source feature, WOCOR feature, can effectively extract the speaker-specific

spectro-temporal characteristics of the LP residual signal. Particularly, with pitch-synchronous

wavelet transform, the WOCOR feature set is capable of capturing the pitch-related low

frequency properties. Only voiced speech is kept for subsequent processing. In the source-filter

model, the excitation signal for unvoiced speech is approximated as a random noise [22, 26].

Voicing decision and pitch extraction are done by the robust algorithm for pitch tracking [32]. We

believe that such a noise-like signal carries little speaker-specific information in the time-

frequency domain [28]. For each voiced speech portion, a sequence of LP residual signals of

30ms long is obtained by inverse filtering the speech signal, i.e.,

12

( )( )

e ns na s n k

=−

∑

1

()

k

k

=

−

(3.7)

where the filter coefficients ak are computed on Hamming windowed speech frames using the

autocorrelation method [22]. The e(n)’s of neighboring frames are concatenated to get the

residual signal, and their amplitude is normalized within [-1,1] to reduce intra-speaker variation.

Once the pitch periods estimated, pitch pulses in the residual signal are located. For each pitch

pulse, pitch-synchronous wavelet analysis is applied with a Hamming window of two pitch periods

long. The windowed residual signal is denoted as eh(n). The wavelet transform of eh(n) is

computed as

1

( , )

w a b

a

Where a = {2k|k=1, 2 ……….K} and b = 1, 2 ………….N, and N is the window length. Ψ*(n)

is the conjugate of the fourth order Daubechies wavelet basis function Ψ(n), a and b are the

scaling parameter and the translation parameters, respectively [33].

The four octave groups of wavelet coefficients, i.e.,

Wk = {w (2k, b)|b = 1, 2……..N}, where k=1, 2, 3, 4. (3.9)

Each octave group of coefficients is divided evenly into M subgroups, i.e.,

( )

n

*

n

n b

−

eh

a

ψ

=

∑

(3.8)