Article

Gender identification of the speaker using DTW method

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper aims at automatically recognition of the gender through the speech signal being independent from the speaker. In this study it is purposed to decide the gender of the speaker by evaluating the distance of MFCC feature vectors. DTW method has been used to warp the time series in the process of evaluating of the distance. As a test data, as well as the speech recordings consisting of one word, the sentences consisting more than one word have been used. The test data are consisted of the records in which speakers' languages are different such as Turkish, English and German. The results of the test experimented in the study show that the accuracy is % 100 in the state of one-word while the accuracy is % 98 in the state of more than one- word.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... MFCC extracts the spectral components of the signal at 10ms rate by fast Fourier transform and carries out the further filtering based on the perceptually motivated Mel scale. In [12], the authors decide the gender of the speaker by evaluating the distance of MFCC feature vectors and reported identification accuracy of about 98%. However, using MFCC also has several limitations. ...
Article
In this paper, we address the speech-based gender identification problem. Mel-Frequency Cepstral Coefficients (MFCC) of voice samples are typically used as the features for gender identification. However, MFCC-based classification incurs high complexity. This paper proposes a novel pitch-based gender identification system with a two-stage classifier to ensure accurate identification and low complexity. The first stage of the classifier identifies and labels all the speakers whose pitch clearly indicates the gender of the speaker; the complexity of this stage is very low since only threshold-based decision rule on a scalar (i.e., pitch) is used. The ambiguous voice samples from all the other speakers (which cannot be classified with high accuracy by the first stage, and can be regarded as suspicious speakers or difficult cases) are forwarded to the second-stage for finer examination; the second-stage of our classifier uses Gaussian Mixture Model to accurately isolate voice samples based on gender. Experiment results show that our system is speech language/content independent, microphone independent, and robust against noisy recording conditions. Our system is extremely accurate with probability of correct classification of 98.65%, and very efficient with about 5 s required for feature extraction and classification. Copyright © 2011 John Wiley & Sons, Ltd.
Article
In this study, the effect of Short-time Mean and Variance Normalization (STMVN), Short-time Cepstral Mean and Scale Normalization (STMSN), Min-Max Normalization, Z-Score Normalization and Standard Deviation Normalization techniques on the classification performance was investigated in determining speakers’ gender. In the study, voice records which belongs to 192 male and 192 female speakers from TIMIT data set were used as data set. Features were extracted from Mel Frequency Cepstral Coefficients (MFCC) technique by using voice records and extracted features’ dimension was reduced to Principal Component Analysis (PCA), then normalized with different techniques. Support Vector Machine (SVM) was used as classifier. As a result of study, it was observed that, the highest accuracy in speakers’ gender estimation is obtained as %98.18 from features which were normalized with Standard Deviation Normalization technique and other normalization techniques were reduced accuracy.
Conference Paper
Full-text available
A VQ (vector quantization)-distortion-based speaker recognition method and discrete/continuous ergodic HMM (hidden Markov model)-based ones are compared, especially from the viewpoint of robustness against utterance variations. It is shown that a continuous ergodic HMM is far superior to a discrete ergodic HMM. It is also shown that the information on transitions between different states is ineffective for text-independent speaker recognition. Therefore, the speaker identification rates using a continuous ergodic HMM are strongly correlated with the total number of mixtures irrespective of the number of states. It is also found that, for continuous ergodic HMM-based speaker recognition, the distortion-intersection measure (DIM), which was introduced as a VQ-distortion measure to increase the robustness against utterance variations, is effective
Article
Full-text available
This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task
Article
Full-text available
Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly h...
Article
Techniques in Speech Acoustics provides an introduction to the acoustic analysis and characteristics of speech sounds. The first part of the book covers aspects of the source-filter decomposition of speech, spectrographic analysis, the acoustic theory of speech production and acoustic phonetic cues. The second part is based on computational techniques for analysing the acoustic speech signal including digital time and frequency analyses, formant synthesis, and the linear predictive coding of speech. There is also an introductory chapter on the classification of acoustic speech signals which is relevant to aspects of automatic speech and talker recognition. The book intended for use as teaching materials on undergraduate and postgraduate speech acoustics and experimental phonetics courses; also aimed at researchers from phonetics, linguistics, computer science, psychology and engineering who wish to gain an understanding of the basis of speech acoustics and its application to fields such as speech synthesis and automatic speech recognition.
Conference Paper
A vector quantization based talker recognition system is described and evaluated. The system is based on constructing highly efficient short-term spectral representations of individual talkers using vector quantization codebook construction techniques. Although the approach is intrinsically text-independent, the system can be easily extended to text-dependent operation for improved performance and security by encoding specified training word utterances to form word prototypes. The system has been evaluated using a 100-talker database of 20,000 spoken digits. In a talker verification mode, average equal-error rate performance of 2.2% for text-independent operation and 0.3% for text-dependent operation is obtained for 7-digit long test utterances.
Conference Paper
This paper describes a novel approach which combines the acoustic analysis using MFCC and the speaker's mean pitch to improve the performance of the gender recognition. In acoustic analysis, two sets of Gaussian mixture model (GMM), male and female, are trained from the speech, and the most likely sequence of models with corresponding likelihood scores are produced. In pitch estimation approach, a threshold is specified to differentiate the two sets. The information provided by the acoustic analysis using MFCC and pitch estimation are combined by using a linear normalization fusion method. The system was tested on the SRMC databases giving at most 3.3% recognition error rate
Book
Electrical Engineering Discrete-Time Processing of Speech Signals Commercial applications of speech processing and recognition are fast becoming a growth industry that will shape the next decade. Now students and practicing engineers of signal processing can find in a single volume the fundamentals essential to understanding this rapidly developing field. IEEE Press is pleased to publish a classic reissue of Discrete-Time Processing of Speech Signals. Specially featured in this reissue is the addition of valuable World Wide Web links to the latest speech data references. This landmark book offers a balanced discussion of both the mathematical theory of digital speech signal processing and critical contemporary applications. The authors provide a comprehensive view of all major modern speech processing areas: speech production physiology and modeling, signal analysis techniques, coding, enhancement, quality assessment, and recognition. You will learn the principles needed to understand advanced technologies in speech processing—from speech coding for communications systems to biomedical applications of speech analysis and recognition. Ideal for self-study or as a course text, this far-reaching reference book offers an extensive historical context for concepts under discussion, end-of-chapter problems, and practical algorithms. Discrete-Time Processing of Speech Signals is the definitive resource for students, engineers, and scientists in the speech processing field.
Conference Paper
The authors present a new automatic male/female classification method based on the location in the frequency domain of the first two formants. This classification is based on a new automatic formant extraction which is faster than a peak picking technique. Gender-dependent acoustic-phonetic models stemming from this classification are used in the INRS continuous speech recognition system with the ATIS corpora. An improvement of 14% is obtained with these models in comparison to the baseline speaker-independent system
Conference Paper
This paper describes a novel technique specifically developed for gender identification which combines acoustic analysis and pitch. Two sets of hidden Markov models, male and female, are matched to the speech using the Viterbi algorithm and the most likely sequence of models with corresponding likelihood scores are produced. Linear discriminant analysis is used to normalise the models and reduce bias towards a particular gender. An enhanced version of the pitch estimation algorithm used for IMBE speech coding is used to give an average pitch estimate for the speaker. The information provided by the acoustic analysis and pitch estimation are combined using a linear classifier to identify the gender of the speech. The system was tested on three British English databases giving less than 1% identification error rate with two seconds of speech. Further tests without optimisation on eleven languages of the OGI database gave error rates less than 5.2% and an average of 2.0%
Article
The Principle of Optimality and one method for its application, dynamic programming, was popularized by Bellman in the early 1950's. Dynamic programming was soon proposed for speech recognition and applied to the problem as soon as digital computers with sufficient memory were available, around 1962. Today, most commercially available recognizers and many of the systems being developed in research laboratories use dynamic programming, typically to address the problem of the time alignment between a speech segment and some template or synthesized speech artifact. in this tutorial paper, the application of dynamic programming to connected-speech recognition is introduced and discussed. The deterministic form, used for template matching for connected speech, is described in detail. The stochastic form, ordinarily called the Viterbi algorithm, is also introduced.
Article
The task of speaker verification, a subset of the general problem of speaker recognition is defined. The feature selection and pattern matching steps of the recognition procedure are examined. Speaker verification system design and performance are discussed, and databases for evaluating them are briefly considered. An example of a speaker verification system is described. An overview of industry research in this area is given.< >
Article
The usefulness of identifying a person from the characteristics of his voice is increasing with the growing importance of automatic information processing and telecommunications. This paper reviews the voice characteristics and identification techniques used in recognizing people by their voices. A discussion of inherent performance limitations, along with a review of the performance achieved by listening, visual examination of spectrograms, and automatic computer techniques, attempts to provide a perspective with which to evaluate the potential of speaker recognition and productive directions for research into and application of speaker recognition technology.
Article
A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared
Article
A tutorial on signal processing in state-of-the-art speech recognition systems is presented, reviewing those techniques most commonly used. The four basic operations of signal modeling, i.e. spectral shaping, spectral analysis, parametric transformation, and statistical modeling, are discussed. Three important trends that have developed in the last five years in speech recognition are examined. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similarity transform techniques, often used to normalize and decorrelate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal's spectrum can be estimated in a closed-loop manner. The signal processing components of these algorithms are reviewed
Article
this paper we address both these problems by introducing a modification of DTW. The crucial difference is in the features we consider when attempting to find the correct warping. Rather than use the raw data, we consider only the (estimated) local derivatives of the data
Konuúmacı Cinsiyetinin Temel Frekansa Göre Belirlenmesi, Çankaya Üniversitesi 1
  • V V Nabiyev
  • E Yücesoy
Nabiyev V.V, Yücesoy E., Konuúmacı Cinsiyetinin Temel Frekansa Göre Belirlenmesi, Çankaya Üniversitesi 1.Mühendislik ve Teknoloji Sempozyumu, say. 33-41, Nisan, 2008. 978-1-4244-4436-6/09/$25.00 ©2009 IEEE
Identification of Non-Linguistic Speech Features, Pro.ARPA Human Lan. & Technology, Morgan Kaufman
  • J-L Cauvain
  • L F Lamel