A Voice Identification System using Hidden Markov Model
Abstract
Background/Objectives: Voice Identification System refers to a system which comprises of hardware, software and it is used to identify voice for several applications. The aim of the research is to develop a small scale system that incorporate both speaker recognition and speech recognition and can show specific visual information to a user. Methods: To this end, we have developed a system based on the technique of Hidden Markov Model. The Hidden Markov Model is a stochastic approach which models the algorithm as a double stochastic process in which the observed data is thought to be the result of having passed a hidden process through second process. Both processes are characterized only through one that is observed. A database of voice information is created. To extract features from voice signals, Mel-Frequency Cepstral Coefficients (MFCC) technique has been applied producing a set of feature vectors. Subsequently, the system uses The Vector Quantization (VQ) for features training and classification. Findings: The designed system has been tested with multiple speakers as reference. Speech recognition based on Hidden Markov Model is achieved successfully for the conversion of speech to text. In this proposed research, speech recognition is achieved with accuracy about 90%. Applications: The system has potential to be used in music industry, crime investigation, personal assistant and in hi-tech devices.
... The MFCC parameters are usually preferred over others because they are less susceptible to speaker-dependent variations likely to be present in speech signals [7,8]. Several matching techniques are proposed in speech and speaker recognition like Dynamic Time Warping (DTW), Hidden Markov Models (HMM) that are very frequently used in speech recognition [9,10], Artificial Neural Network (ANN), Gaussian Mixture Model (GMM) and Vector Quantization (VQ) are generative models which estimates feature distribution within each speaker. Gaussian mixture model is widely used for speaker modelling in speaker identification system [11]. ...
This paper presents an automatic speaker identification and speech recognition for Arabic digits in noisy environment. In this work, the proposed system is able to identify the speaker after saving his voice in the database and adding noise. The Mel Frequency Cepstral Coefficients (MFCC) is the best approach used in building the program in Matlab platform, also, the Quantization is used for generating the codebooks. The Gaussian Mixture Modelling (GMM) algorithms are used to generate template, feature-matching purpose and comparing it with the performance of speaker modeling schemes using Vector Quantization (VQ) for speaker identification. White Gaussian noise is added to the clean speech at several signal-to-noise ratio (SNR) levels. The proposed system gives good results in recognition rate.
... T. K. DAS et al. [5] designed a speech information system based on HMMs and mel-frequency cepstral coefficients (MFCCs). Their best-achieved result is approximately 90%. ...
This paper is a part of our contribution to research on the enhancement of network automatic speech recognition system performance. We built a highly configurable platform by using hidden Markov models, Gaussian mixture models, and Mel frequency spectral coefficients, in addition to VoIP G.711-u and GSM codecs. To determine the optimal values for maximum performance, different acoustic models are prepared by varying the hidden Markov models (from 3 to 5) and Gaussian mixture models (8-16-32) with 13 feature extraction coefficients. Additionally, our generated acoustic models are tested by unencoded and encoded speech data based on G.711 and GSM codecs. The best parameterization performance is obtained for 3 HMM, 8-16 GMMs, and G.711 codecs.
... Ses özniteliklerinin sınıflandırılmasında sıklıkla HMM, VQ ve DNN kullanılmaktadır [7,11,12]. Yapay sinir ağlarında yapı taşı nöronlardır. Nöronlar birleşerek katmanları katmanlar birleşerek ağı oluşturur. ...
Blindfold chess is of particular interest in the research of the memory structure and limits of the human brain. In blindfold chess, the player cannot see the board so the player visualizes the situation and make his moves aloud. In this study, the first step of recognition of voice commands for blindfold chess, word-based determination, and classification of chess figure vocalizations have been emphasized. Mel frequency coefficients and mel spectrograms have been used as the feature vector for audio data. The classification of these vectors has been made by using artificial neural networks. As a result of the tests, 99% success has been obtained in noisy environments.
Speech Recognition technology is widely used for voice-enabled form filling. The manual process of filling out forms by typing has become increasingly challenging and time-consuming. This issue is particularly evident in various locations such as job applications and internships. To address this problem, a solution is proposed as a system that automates the form-filling process using speech recognition technology. The ability to operate anything with voice command is a crucial factor in today’s environment. The proposed system is that it automatically fills out the forms. i.e., the system analyses the user’s unique voice, identifies the user’s speech, and then transcribes the speech into text. This paper proposes a machine-learning model that builds on Hidden Markov Model. The model will be trained and tested on this system and the proposed pre-processed methodology is Mel Frequency Cepstral Coefficients. The methodology was widely used in the prospect of recognition of voice automatically. The results demonstrate that this system effectively accurately transcribes user speech into text, simplifying the form-filling process significantly. By providing these results, we hope to demonstrate how this technology has the potential to revolutionize data entry and accessibility while also establishing a strong case for speech recognition as a convenient way to speed up form completion.
Detecting when the entirety of a drillstring is moving—referred to as breakover—is necessary for automating several tasks in the drilling process. This paper provides an overview of how cross-industry application of machine learning (ML) technology helped solve challenges related to real-time pattern recognition of breakover and how this solution assisted with providing immediate metrics and control of the drilling process.
This project leveraged Hidden Markov Models (HMMs), used frequently in other industries for speech recognition and pattern recognition over time series data, to create a statistical classifier that detects drillstring breakover in real time. Although these techniques have not seen widespread adoption in the oil and gas industry, they provide a flexible solution to many automation problems. Model features correlated with string stiffness were constructed that allowed for accurate classification of pre-breakover and post-breakover states. Subject matter experts were enlisted to label 500+ examples of breakover, which were used to train and test models for both ascending and descending drillstrings. The models were then deployed and integrated into the drilling control system to provide monitoring capabilities and control certain processes.
The models provided accurate detections of breakover more than 90% of the time when measured against several wells studied for this project and provided hookload values for both breakover and general pickup and slackoff operations. This high accuracy allows for broad application of the model to several use cases. Applications include reducing an operator's 20-ft standard pickup distance, thereby reducing overall connection times, and using the associated models improved the quality of tares in deep lateral sections. The model also provided additional benefits, including automated drag monitoring for rotary drilling and tripping as well as hole condition monitoring during cleanup cycles. Both offer opportunities to optimize flat time and will be discussed in detail.
In this paper, the effects on speech recognition performance by the speech coders are presented. We evaluate our Amazigh speech recognition system through wireless network based on a configurable platform that was created by combining both automatic speech recognition and IVR technologies. Different parameters are used such as VoIP audio codecs, hidden Markov models (HMMs) and Gaussian mixture models (GMMs). The system is trained and tested on ten first digits by collecting data from 24 speakers native of Tarifit. On the other hand, the VoIP codecs used in this work are G.711, GSM and Speex depending on the SIP protocol. Our results show that the best performance is 84.14% achieved by using the GSM audio codec.
In the subject of pattern recognition, speech recognition is an important study topic. The authors give a detailed assessment of voice recognition strategies for several majority languages in this study. Over the last several decades, many researchers have contributed to the field of voice processing and recognition. Although there are several frameworks for speech processing and recognition, there are only a few ASR systems available for language recognition throughout the world. However, the data gathered for this research reveals that the bulk of the effort has been done to construct ASR systems for majority languages, whereas minority languages suffer from a lack of standard speech corpus. We also looked at some of the key issues for voice recognition in various languages in this research. We have explored various kinds of hybrid acoustic modeling methods required for efficient results. Because the success of a classifier is dependent on the removal of information during the feature separation phase, it is critical to carefully pick the value extraction techniques and classifiers.
In this paper, a novel architecture is proposed using a convolutional neural network (CNN) and mel frequency cepstral coefficient (MFCC) to identify the speaker in a noisy environment. This architecture is used in a text-independent setting. The most important task in any text-independent speaker identification is the capability of the system to learn features that are useful for classification. We are using a hybrid feature extraction technique using CNN as a feature extractor combined with MFCC as a single set. For classification, we used a deep neural network which shows very promising results in classifying speakers. We made our dataset containing 60 speakers, each speaker has 4 voice samples. Our best hybrid model achieved an accuracy of 87.5%. To verify the effectiveness of this hybrid architecture, we use parameters such as accuracy and precision.
A voice recognition system is designed to identify an administrator voice. By using MATLAB software for coding the voice recognition, the administrator voice can be authenticated. The key is to convert the speech waveform to a type of parametric representation for further analysis and processing. A wide range of possibilities exist for parametrically representing the speech signal for the voice recognition system such as Mel-Frequency Cepstrum Coefficients (MFCC). The input voice signal is recorded and computer will compare the signal with the signal that is stored in the database by using MFCC method. The voice based biometric system is based on single word recognition. An administrator utters the password once in the training session so as to train and stored. In testing session the users can utter the password again in order to achieve recognition if there is a match. By using MATLAB simulation, the output can obtain either the user is being recognized or rejected. From the result of testing the system, it successfully recognizes the specific user's voice and rejected other users' voice. In conclusion, the accuracy of the whole system is successfully recognizing the user's voice. It is a medium range of the security level system.
This paper presents the design, implementation, and evaluation of a research work for developing an automatic person identification system using voice biometric. The developed automatic person identification system mainly used toolboxes provided by MATLAB environment. To extract features from voice signals, Mel-Frequency Cepstral Coefficients (MFCC) technique was applied producing a set of feature vectors. Subsequently, the system uses the Vector Quantization (VQ) for features training and classification. In order to train and test the developed automatic person identification system, an in-house voice database is created, which contains recordings of 100 persons' usernames (50 males and 50 females) each of which is repeated 30 times. Therefore, a total of 3000 utterances are collected. This paper also investigates the effect of the persons' gender on the overall performance of the system. The voice data collected from female persons outperformed those collected from the male persons, whereby the system obtained average recognition rates of 94.20% and 91.00% for female and male persons, respectively. Overall, the voice based system obtained an average recognition rate of 92.60% for all persons.
In an architecture of speech recognition for some languages (such as Thai, Chinese, and so on) that tone plays a key role in meaning classification, tone detection function is required in order to guarantee a correct word recognition. This paper proposes an architecture of HMM-based isolated-word speech recognition with tone detection function. In this architecture, tone detection function is added into each computation process of series architecture and each parallel computation of scalable architecture. To evaluate the performance of the proposed method, the experiment is set and performed with 29 Thai words selected from TV remote control commands and 10 Thai indistinct tone classification words. The results reveal 4.94% improvement of accuracy rate for remote control commands and 10.75% for indistinct tone classification words comparing with the conventional architecture.
This letter studies feature selection in speaker recognition from an information-theoretic view. We closely tie the performance, in terms of the expected classification error probability, to the mutual information between speaker identity and features. Information theory can then help us to make qualitative statements about feature selection and performance. We study various common features used for speaker recognition, such as mel-warped cepstrum coefficients and various parameterizations of linear prediction coefficients. The theory and experiments give valuable insights in feature selection and performance of speaker-recognition applications.
Background: Speaker recognition systems plays a pivotal role in the field of forensics, security and biometric authentication for verifying or identifying the speaker from the group of speakers. Methods: This paper gives a brief introduction about developing a hardware based speaker recognition system using Mel Frequency Cepstral Coefficients (MFCC) which are extracted from input speech signal to linearize the frequency scale at higher frequencies and Perceptron Neural Networks to provide layer weights for verifying the speaker identity to compare the output in the database of stored speaker identities. Findings: The input speech features are extracted using blocking and windowing to reduce noise and get the audio samples to store in the RAM where sampled data is converted into frequency domain using FFT to get the Cepstral Coefficients which are normalised and fed to neural network tool box present in the MATLAB to obtain layer weights for given set of data and the output is compared with the saved speaker identities to find a match. The decision making logic is written in NIOS II processor of FPGA where the taken input features are compared to the existing database of speaker identities with the help of perceptron neural network layer weights which gives the nearest possibility of the match in the database of the group of speakers. The designed system has been tested using two speakers as reference where the vowels spoken by them are taken into account to compare with the database of speakers already stored in FPGA. Conclusion/Improvements: The probability of detection of the speakers is 80% and verifying the speaker is quite accurate in hardware based systems than in software based systems where performance factor is less. The given performance in the designed system can be increased by retraining the neural networks which can provide nearly 90% in detecting the speaker.
The credit card payment system is a widespread usable system which provides the easiest way of payment to the customers, but some of them misuse another individual's credit card for personal reasons. So, in order to provide credit card fraud detection, Multiple Semi-Hidden Markov Model is suggested to gather multiple observations and the detection phase is executed. It is significant to compute the good model parameters because it impacts the detection performance in the Multiple Semi-Hidden Markov Model. So, in this manuscript an innovative technique is introduced which is called Optimized Multiple Semi-Hidden Markov Model (OMSHMM) which is used for optimizing the model parameters. The Multiple Semi-Hidden Markov Model is used for detecting fraudulent users and for optimizing training values Cuckoo Search algorithm is proposed. The main intent of this research is automating the use of Multiple Semi-Hidden Markov Model, by liberating customers from the necessity of statistical knowledge. The number of states and also its model parameters are decided by the Cuckoo Search algorithm. An experimental result shows that when compared to the existing research there is high accuracy in the proposed research.
The performance of speaker identification systems has improved due to recent advances in speech processing techniques but there is still need of improvement in term of text-independent speaker identification and suitable modelling techniques for voice feature vectors. It becomes difficult for person to recognize a voice when an uncontrollable noise adds in to it. In this paper, feature vectors from speech are extracted by using Mel-Frequency Cepstral Coefficients and Vector Quantization technique is implemented through Linde-Buzo-Gray algorithm. Two purposeful speech databases with added noise, recorded at sampling frequencies 8000 Hz and 11025 Hz, are used to check the accuracy of the developed speaker identification system in non-ideal conditions. An analysis is also provided by performing different experiments on the databases that number of vectors in VQ codebook and sampling frequency influence the identification accuracy significantly.
This paper presents a speaker identification system using cepstral based speech features with discrete hidden Markov model (DHMM). The speaker features represented by the speech signal are potentially characterized by the cepstral coefficients. The commonly used cepstral based features; mel-frequency cepstral coefficient (MFCC), linear predictive cepstral coefficient (LPCC) and real cepstral coefficient (RCC) are employed with DHMM in the speaker identification system. The performances of the proposed method are compared with respect to each of the three feature spaces. The experimental results show that the identification accuracy with MFCC is superior to both of LPCC and RCC.