Bimodal Biometric Person Authentication System Using Speech and Signature Features
ABSTRACT Biometrics offers greater security and convenience than traditional methods of person authentication. Multi biometrics has recently emerged as a means of more robust and efficient person authentication scheme. Exploiting information from multiple biometric features improves the performance and also robustness of person authentication. The objective of this paper is to develop a robust bimodal biometric person authentication system using speech and signature biometric features. Speaker based unimodal system is developed by extracting Mel Frequency Cepstral Coefficients (MFCC) and Wavelet Octave Coefficients of Residues (WOCOR) as feature vectors. The MFCCs and WOCORs from the training data are modeled using Vector Quantization (VQ) and Gaussian Mixture Modeling (GMM) techniques. Signature based unimodal system is developed by using Vertical Projection Profile (VPP), Horizontal Projection Profile (HPP) and Discrete Cosine Transform (DCT) as features. A bimodal biometric person authentication system is then built using these two unimodal systems. Experimental results show that the bimodal person authentication system provides higher performance compared with the unimodal systems. The bimodal system is finally evaluated for its robustness using the noisy data and also data collected from the real environments. The robustness of the bimodal system is more compared to the unimodal person authentication systems.

Conference Paper: GECbased multibiometric fusion
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we use Genetic and Evolutionary Computation (GEC) to optimize the weights assigned to the biometric modalities of a multibiometric system for scorelevel fusion. Our results show that GECbased multibiometric fusion provides a significant improvement in the recognition accuracy over evenly fused biometric modalities, increasing the accuracy from 90.77% to 95.24%.Evolutionary Computation (CEC), 2011 IEEE Congress on; 07/2011  SourceAvailable from: ncat.edu
Conference Paper: GEFeWS: A Hybrid GeneticBased Feature Weighting and Selection Algorithm for MultiBiometric Recognition.
Aniesha Alford, Khary Popplewell, Gerry V. Dozier, Kelvin S. Bryant, John Kelly, Joshua Adams, Tamirat Abegaz, Joseph Shelton[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we investigate the use of a hybrid genetic feature weighting and selection (GEFeWS) algorithm for multibiometric recognition. Our results show that GEFeWS is able to achieve higher recognition accuracies than using geneticbased feature selection (GEFeS) alone, while using significantly fewer features to achieve approximately the same accuracies as using geneticbased feature weighting (GEFeW).Proceedings of The 22nd Midwest Artificial Intelligence and Cognitive Science Conference 2011, Cincinnati, Ohio, USA, April 1617, 2011; 01/2011
Page 1
Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte
International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 147
Bimodal Biometric Person Authentication System
Using Speech and Signature Features
Prof. M.N. Eshwarappa jenutc@rediffmail.com
Assistant professor Department of Telecommunication
Engineering, Sri Siddhartha Institute of Technology,
Tumkur572101, Karnataka, India
Prof. (Dr.) Mrityunjaya V. Latte
Principal and Professor Department of Electronics
and Communication Engineering, JSS
Academy of Technical Education,
Bangalore560060, Karnataka, India
mvlatte@rediffmail.com
Abstract
Biometrics offers greater security and convenience than traditional methods of
person authentication. Multi biometrics has recently emerged as a means of
more robust and efficient person authentication scheme. Exploiting information
from multiple biometric features improves the performance and also robustness
of person authentication. The objective of this paper is to develop a robust
bimodal biometric person authentication system using speech and signature
biometric features. Speaker based unimodal system is developed by extracting
Mel Frequency Cepstral Coefficients (MFCC) and Wavelet Octave Coefficients of
Residues (WOCOR) as feature vectors. The MFCCs and WOCORs from the
training data are modeled using Vector Quantization (VQ) and Gaussian Mixture
Modeling (GMM) techniques. Signature based unimodal system is developed by
using Vertical Projection Profile (VPP), Horizontal Projection Profile (HPP) and
Discrete Cosine Transform (DCT) as features. A bimodal biometric person
authentication system is then built using these two unimodal systems.
Experimental results show that the bimodal person authentication system
provides higher performance compared with the unimodal systems. The bimodal
system is finally evaluated for its robustness using the noisy data and also data
collected from the real environments. The robustness of the bimodal system is
more compared to the unimodal person authentication systems.
Keywords: Biometrics, Speaker recognition, Signature verification, Multimodal biometrics.
1. INTRODUCTION
Biometrics is the development of statistical and mathematical methods applicable to data analysis
problems in the biological sciences. Introduction of this technology brings new security
approaches to computer systems. Identification and verification are the two ways of using
biometrics for person authentication. Biometrics refers to the use of physical or physiological,
biological or behavioral characteristics to establish the identity of an individual. These
Page 2
Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte
International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 148
characteristics are unique to each individual and remain partially unaltered during the individual’s
life time [1]. Biometric security system becomes a powerful tool compared to electronics based
security [2]. Any physiological and/or behavioral characteristic of human can be used as biometric
feature, provided it possesses the following properties: universality, distinctiveness, permanence,
collectability, circumvention, acceptability and performance [3]. The physiological biometrics
related to the shape of the body. The oldest traits, that have been used for more than 100 years
are fingerprints. Other examples are Face, Hand Geometry, Iris, DNA, Palmprints and so on.
Behavioral biometrics related to the behavior of a person. The first characteristic to be used, still
widely used today, is the signature. Others are keystroke, Gait (way of walking), Handwriting and
so on. Speech is the unique biometric feature that comes under both the categories [9]. Based on
the application, selecting the right biometric is the crucial part. Unimodal biometric system, which
operates using any single biometric characteristic, is affected by problems like noisy sensor data,
nonuniversality and lack of individuality of the chosen biometric trait, absence of an invariant
representation for the biometric trait. For instance, speech is a biometric feature whose
characteristics will vary significantly if the person is affected by cold or in different emotional
status. Some of these problems can be relived by using multimodal biometric system that
consolidates evidence from multiple biometric sources. Multimodal or Multibiometric systems
utilize more than one physiological or behavioral biometrics for enrolment and identification. This
work presents, such a multimodal biometric person recognition system and results obtained are
compared to the unimodal biometric systems.
There are several multimodal biometric person authentication systems developed in the literature
[37]. In 2004, A. K. Jain et. al., proposed the frame work for multimodal biometric person
authentication [3]. Even though some of the traits offering good performance in terms of reliability
and accuracy, none of the biometrics is 100% accurate. With increasing global need for security,
the demand for robust automatic person recognition systems is evident. For applications involving
the flow of confidential information, the authentication accuracy of the system is always the prior
concern. From this basic reason the use of multimodal biometrics are encouraged. Multi
biometrics is an integrated prototype system embedding different types of biometrics [35].
Multimodal biometric fusion and identity authentication technique help to achieve an increase in
performance of identity authentication system [8]. Multimodal biometrics can reduce the
probability of denial of access without sacrificing the False Acceptance Rate (FAR) performance
by increasing by discrimination between the genuine and impostor classes. There are several
multimodal biometric person authentication systems developed in the literature [48]. Applications
of multibiometrics are widely spread throughout the world. A wide variety of systems require
reliable personal recognition schemes to either confirm or determine the identity of an individual
requesting their services. The purpose of such schemes is to ensure that the rendered services
are accessed only by a legitimate user, and not anyone else. Examples of such applications
include secure access to buildings, computer systems, laptops, cellular phones and ATMs. In the
absence of robust personal recognition schemes, these systems are vulnerable to the wiles of an
impostor. Authentication systems built upon only one modality may not fulfill all the requirements,
due to the limitations of unimodal systems. This has motivated the current interest in multimodal
biometrics, in which several biometric traits are simultaneously used in order to make an
identification decision. The objective of the present work is to develop a bimodal biometric system
using speech and signature features to mitigate the effect of some of the limitations of unimodal
biometric systems.
The present work mainly deals with the implementation of bimodal biometric system employing
speech and signature as the biometric modalities. This includes feature extraction techniques and
modeling techniques used in biometric system. The organization of the paper is as follows:
Section 2 deals with bimodal databases used in bimodal person authentication system. Section 3
deals with unimodal biometric speech based person authentication system and section 4 deals
with unimodal biometric signature based person authentication system. Bimodal biometric system
by combining speaker and signature recognition systems is explained with different fusion
techniques in section 5. Section 6 concludes the paper by summarizing the present work and
adding few points regarding the future work.
Page 3
Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte
International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 149
2. BIMODAL DATABASES FOR PERSON AUTHENTICATION
IITG Speech database (standard)
Number of speakers: 30(20 male, 10 female)
Sampling frequency: 8000Hz
Sentences considered for each speaker: 4
Number of utterances of each sentence for each speaker: 24
Training session: first 16 utterances
Testing session: remaining 8 utterances of each sentence of each speaker
IITG Signature database (standard)
Number of writers: 30 (20 male, 10 female)
Scanner: HP Scan jet 5300C
Resolution: 300dpi (digits per inch)
Data storage: 8bit Gray scale image
Saved format: bmp (bits mapping)
Number of sample signatures of each writer: 24
Training session: First 16 signatures of all the writers
Testing session: remaining 8 signatures of all the writers
SSIT Speech database
Number of speakers: 30(20 male, 10 female)
Sampling frequency: 8000Hz
Sentences considered for each speaker: 4
Number of utterances of each sentence for each speaker: 24
Training session: first 16 utterances
Testing session: remaining 8 utterances of each sentence of each speaker
SSIT Signature database
Number of writers: 30 (20 male, 10 female)
Scanner: HP Scan jet 5300C
Resolution: 300dpi (digits per inch)
Data storage: 8bit Gray scale image
Saved format: bmp (bits mapping)
Number of sample signatures of each writer: 24
Training session: First 16 signatures of all the writers
Testing session: remaining 8 signatures of all the writers
3. UNIMODAL SPEECH BASED PERSON AUTHENTICATION SYSTEM
As any other pattern recognition systems, a speech based person authentication system also
consists of three components: (1) Feature extraction, which transforms the speech waveform into
a set of parameters carrying salient speaker information; (2) Pattern generation, which generates
from the feature parameters a pattern representing the individual speaker: and (3) Pattern
matching and classification, which compares the similarity between the extracted features and a
prestored pattern or a number of prestored patterns, giving the speaker identity accordingly.
There are two stages in a speaker recognition system, training and testing. In the training stage,
speaker models (or patterns) are generated from the speech samples with some feature
extraction and modeling techniques. In testing stage, feature vectors are generated from the
speech signal with the same extraction procedure as in training. Then a classification decision is
made with some matching technique. Person authentication is a binary classification task [22].
The features from the testing signal are compared with the claimed speaker pattern and a
decision is made to accept or reject the claim [10]. Depending on the mode of operation, speaker
recognition can be classified as text dependent recognition and text independent recognition. The
text dependent recognition requires the speaker to produce speech for the same text, both during
training and testing: whereas the text independent recognition does not on a specific text being
Page 4
Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte
International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 150
spoken [11]. The present work follows text dependent speaker recognition approach.This work
uses feature extraction techniques based on (1) Mel Frequency Cepstral Coefficients (MFCC)
derived from Cepstral analysis of the speech signal and (2) Wavelet Octave Coefficients of
Residues (WOCOR) derived from the Linear Prediction (LP) residual. The time frequency
analysis of the LP residual signal is performed to obtain WOCOR [14]. WOCOR are generated by
applying a pitchsynchronous wavelet transform to the residual signal. Experimental results show
that the WOCOR parameters provide complementary information to the conventional MFCC
features for speaker recognition [14]. The Vector Quantization (VQ) and Gaussian Mixture
Modeling (GMM) are used for modeling the person information from these MFCC and WOCOR
features [1315]. State of the art system uses MFCC derived from speech as feature vectors and
GMM as the modeling technique [13].
Feature Extraction from Speech Information
The speaker information is present both in the vocal tract and excitation parameters [12]. The
vocal tract system can be corresponds to processing of speech in short (1030ms) overlapped (5
15ms) windows. The vocal tract system is assumed to be stationary within the window and it can
be modeled as allpolefilter using LP analysis [21]. The most used form of speech signal for
feature extraction is the Cepstrum. Different forms of Cepstral representation include Complex
Cepstral Coefficients (CCC), Real Cepstral Coefficients (RCC), Mel Frequency Cepstral
Coefficients (MFCC) and Linear Prediction Cepstral Coeeficeints (LPCC). Among these the
mostly used one includes MFCC. In all the Cepstral analysis techniques the vocal tract
information is obtained by taking log over spectrum of the speech signal. The LP residual signal,
though not giving the true glottal pulse, is regarded as a good representative of the excitation
source. The Haar transform and Wavelet transform are applied for the multiresolution analysis of
the residual signal and the derived the feature vectors termed as Wavelet Octave Coefficients of
Residues (WOCOR). WOCOR are believed to be effectively capturing the speaker specific
spectrotemporal characteristics of the LP residual signal.
Extraction of MFCC Feature Vectors
The state of the system builds a unimodal system by analyzing speech in blocks of 1030 ms with
shift of half the block size. The MFCC are used as feature vectors extracted from each of the
blocks. The MFCCs from the training or enrolment data are modeled using Vector Quantization
(VQ) and Gaussian Mixture Modeling (GMM) technique [12]. The MFCCs from the testing or
verification data are compared with respective model to validate the identity claim of the speaker.
The MFCCs represent mainly the vocal tract aspect of speaker information and hence take care
of only physiological aspect of speech biometric feature. Another important physiological aspect
contributing significantly to speaker characteristics is the excitation source [13]. A speech signal
is obtained by the convolution of vocal parameters v(n) and excitation parameters x(n) given by
equation (3.1). We can not separate these parameters in time domain. Hence we go for Cepstral
domain. The Cepstral analysis used for separating the vocal tract parameter v(n) and excitation
parameters x(n), from speech signal s(n).
s(n) = v(n) * x(n) (3.1)
The Cepstral analysis gives the fundamental property of convolution used for separating the vocal
tract parameters and excitation parameters [27]. The Cepstral Coefficients (C) of length M can be
obtained by using equation (3.2).
C = real (IFFT (log FFT (s(n)) )) (3.2)
The nonlinear scale i.e., relation between the Mel frequency (fMel) and physical frequency (fHz) is
used for extracting spectral information from the speech signal by using Cepstral analysis.
f Mel = 2595 log10 1
700
Hz
f
+
(3.3)
Using equation (3.3) we construct a spectrum with critical bands which are overlapped triangular
banks i.e., we map the linear spaced frequency spectrum (fHz) into nonlinearly spaced frequency
spectrum (fMel). By this we can mimic the human auditory system and based on this concept
MFCC feature vectors are derived. Windowing eliminates the Gibbs oscillations, which occur by
truncating the speech signal. Using equation (3.4), Hamming window coefficients are generated,
with which corresponding speech of frame is scaled.
Page 5
Prof. M.N.Eshwarappa & Prof. (Dr.) Mrityunjaya V. Latte
International Journal of Biometrics and Bioinformatics, (IJBB), Volume (4): Issue (4) 151
w(n) = 0.54 0.46 cos
2
1
n
N
π
−
(3.4)
But, due to Hamming windowing, samples present at the verge of window are weighted with
lower values. In order to compensate this, we will try to overlap the frame by 50%. After
windowing, we compute the log magnitude spectrum of each frame for finding the energy
coefficients using equation (3.5).
N
∑
2
0
2
( ) log ( ,) 
1
k
Y iS k mHi k
N
π
−
=
=
(3.5)
where
2
i
Hk
N
π
is the ith Mel critical bank spectra and N is the number of points used to
compute the discrete Fourier transform (DFT). The M number of Mel frequency coefficients
computed by using discrete Cosine transforms (DCT), by using equation (3.6), which is nothing
but the real IDFT of critical band filters log energy outputs.
1
2
∑
1
22
( ,)( ) cos
Y k
N
k
C n mkn
NN
π
−
=
=
(3.6)
Where n=1, 2, 3………..M.
The present work also takes care of channel mismatch by using the Cepstral Mean Subtraction
(CMS) and the effect of different roll off from the different channels on Cepstral coefficients by
Liftering procedure [21].
Extraction of WOCOR Feature Vectors
The Linear Predictive (LP) residual signal is adopted as a good representative of the vocal source
excitation, in which the speaker specific information resides on both time and frequency domains.
The resulting vocal source feature, WOCOR feature, can effectively extract the speakerspecific
spectrotemporal characteristics of the LP residual signal. Particularly, with pitchsynchronous
wavelet transform, the WOCOR feature set is capable of capturing the pitchrelated low
frequency properties. Only voiced speech is kept for subsequent processing. In the sourcefilter
model, the excitation signal for unvoiced speech is approximated as a random noise [22, 26].
Voicing decision and pitch extraction are done by the robust algorithm for pitch tracking [32]. We
believe that such a noiselike signal carries little speakerspecific information in the time
frequency domain [28]. For each voiced speech portion, a sequence of LP residual signals of
30ms long is obtained by inverse filtering the speech signal, i.e.,
12
( )( )
e ns na s n k
=−
∑
1
()
k
k
=
−
(3.7)
where the filter coefficients ak are computed on Hamming windowed speech frames using the
autocorrelation method [22]. The e(n)’s of neighboring frames are concatenated to get the
residual signal, and their amplitude is normalized within [1,1] to reduce intraspeaker variation.
Once the pitch periods estimated, pitch pulses in the residual signal are located. For each pitch
pulse, pitchsynchronous wavelet analysis is applied with a Hamming window of two pitch periods
long. The windowed residual signal is denoted as eh(n). The wavelet transform of eh(n) is
computed as
1
( , )
w a b
a
Where a = {2kk=1, 2 ……….K} and b = 1, 2 ………….N, and N is the window length. Ψ*(n)
is the conjugate of the fourth order Daubechies wavelet basis function Ψ(n), a and b are the
scaling parameter and the translation parameters, respectively [33].
The four octave groups of wavelet coefficients, i.e.,
Wk = {w (2k, b)b = 1, 2……..N}, where k=1, 2, 3, 4. (3.9)
Each octave group of coefficients is divided evenly into M subgroups, i.e.,
( )
n
*
n
n b
−
eh
a
ψ
=
∑
(3.8)