Article

Speech Analysis for Automatic Speech Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Several phases on digital signal processing were occurred to the speech database in order to finally extract several Mel Frequency Cepstral Coefficients (MFCC). MFCC is currently known as the best speech features since its representation contains more relevant perceptually information than other features [4], [5]. The natural logarithm of Frame Energy (lnFE) was also implemented on this research as the additional feature to improve the recognition accuracy [8]. ...
... Human speech is a non-stationary signal that change over time, whereas the speech processing in order to obtain good MFCC features requires stationary signal to reach a good performance [5]. Therefore, frame blocking was used to separate a long speech signal into several frames where each of them contains a number of speech samples. ...
... Therefore, Discrete Fourier Transform (DFT) was used to convert the time domain signal into frequency domain in order to get the information about the frequency existing in the signal from each frame. The DFT formula is given by (5). ...
Conference Paper
Full-text available
Automatic Speech Recognition (ASR) is a popular research topic since couple years ago, created to recognize speeches in many languages such as English, Mandarin, Arabic, Malay, etc. Unfortunately, there are still a few number of ASR researches conducted in Indonesian, especially the isolated digit recognition. This paper aims to give a simple implementation of ASR in Indonesian language to recognize Indonesian spoken digits using Elman Recurrent Neural Network (ERNN). The speech database consists of 1000 digit utterances that was collected from 20 native speakers. The system is an isolated word recognizer that was trained using 400 utterances, which is only two fifth of whole data. From each utterance, 11 number of Mel Frequency Cepstral Coefficients (MFCC) combined with natural logarithm of Frame Energy (lnFE) were extracted as the speech features and were used as the input for ERNN. The recognizer was tested in two modes, namely Multi Speaker Mode and Speaker Independent mode, and it succesfully achieved high accuracies of 99,30% and 95,17% for each testing mode respectively.
... In order to achieve this goal, the software will extract the Mel Frequency Cepstral Coefficients (MFCC) which contains the power spectrum of the voice that will be analyzed. In order to make a prediction of the alphanumeric tokens that the individual is saying, the algorithms require a different set of features that contains the information of the voice [19]. To obtain the MFCC the subsequent steps were followed. ...
... Once the microphone has captured the voice of an individual, a filter is used to eliminate the noise of the signal produced by the environment where the speaker was talking. This process is done by using a high-pass Finite Impulse Response (FIR) filter of 3 or 4 kHz [19] [20]. After the application of the FIR filter, a method called Windowing is used, which separates signals of different sources to be analyzed independently. ...
... After that, each part of the signal transformed to the frequency domain will have a transformation on the logarithmic domain using the log based transform for an analysis of the energies of the signal. Finally, the MFCC are obtained by using the Discrete Cosine Transform (DCT) [19]. ...
Conference Paper
This paper describes a project to create a novel design of a communication tool for individuals with hearing disabilities and speech disorders. It provides a detailed analysis of the engineering and scientific aspects of the system, and the fundamentals taken into account for social inclusion of such individuals. It also describes a comprehensive study of present and future applications of this technology to provide an enhanced tool to individuals to further improve their communication skills. Morse code is the base over which this new technology is proposed, which has gathered feedback from specialists and individuals with disabilities, to develop in a near future as a newer communication tool solution, with a robust functionality and an ergonomic design.
... Cutting a signal frame from a lengthy signal is known as frame blocking [23]. The objective of this frame blocking is to decrease the amount of signal data. ...
... Windowing is smoothing discontinuities in a data signal's edges [23]. This discontinuity happens as a result of the data signal being cut in the preceding frame blocking. ...
Article
Full-text available
The conducted research proposes a feature extraction and classification combination method that is used in a tone recognition system for musical instruments. It is expected that by implementing this combination, the tone recognition system will require fewer feature extraction coefficients than those previously investigated. The proposed combination comprises of feature extraction using discrete cosine transform (DCT) and classification using support vector machine (SVM). Bellyra, clarinet, and pianica tones were used in the experiment, with each indicating a tone with one, several, or many major local peaks in the transform domain. Based on the results of the tests, the proposed combination is efficient enough to be used in a tone recognition system for musical instruments. This is indicated in recognizing a tone, it only needs at least eight feature extraction coefficients.
... Frame blocking is a process of getting a frame of data signal from a long data signal [10]. The purpose of using frame blocking is to reduce the number of data signal to be processed. ...
... Windowing is a process of reducing the discontinuities that appear at the edges of the signal [10]. This reduction is necessary to reduce the emergence of harmonic signals that appear after the FFT process. ...
Article
This paper proposes a feature extraction method for a chord recognition, which gives a fewer number of feature extraction coefficients than the previous works. The method of the proposed feature extraction is segment averaging with SHPS (Simplified Harmonic Product Spectrum) and logarithmic scaling. The chords used in developing the proposed feature extraction were guitar chords. In a more detail, the method of the proposed feature extraction basically is as follows. Firstly, the input signal is transformed using FFT (Fast Fourier Transform). Secondly, the left portion of the transformed signal is then processed in succession using SHPS, logarithmic scaling, and segment averaging. The output of segment averaging is the result of the proposed feature extraction. Based on the test results, the proposed feature extraction is quite efficient for use in chord recognition, since it requires only at least eight coefficients to represent each chord. © 2018, School of Electrical Engineering and Informatics. All rights reserved.
... Frame blocking is a process of taking a frame signal from a long series of signal [13]. The purpose of frame blocking is to reduce the number of data signals to be processed. ...
... Windowing is the process of reducing discontinuities at the edges of the signal [13]. This reduction is necessary in order to reduce the appearance of harmonic signals after the signal is transformed using FFT. ...
Article
A feature extraction for musical instrument tones that based on a transform domain approach was proposed in this paper. The aim of the proposed feature extraction was to get the lower feature extraction coefficients. In general, the proposed feature extraction was carried out as follow. Firstly, the input signal was transformed using FFT (Fast Fourier Transform). Secondly, the left half of the transformed signal was divided into a number of segments. Finally, the averaging results of that segments, was the feature extraction of the input signal. Based on the test results, the proposed feature extraction was highly efficient for the tones, which have many significant local peaks in the Fourier transform domain, because it only required at least four feature extraction coefficients, in order to represent every tone.
... Cepstral Analysis and Mel-frequency Cepstral Coefficients (MFCC) are famous methods and have wide usage for speech recognition. They depend on getting the PSD for extracting the envelop speech features using DCT [2]. ...
... It can be expressed mathematically at equation (3). (2) Where y normalized is the normalized value of acoustic signal. ...
Article
Full-text available
Detection and classification of Airtargets (ATs) commonly performed using radar. ATs become essential demand for detection and classification However; Acoustic signal application is one of the famous techniques that used to detect low flying ATs passively. Peak of Power Spectral Density (PSD) and corresponding frequency are considered as main features of ATs recognition. The represented method for ATs detection and classification using Discrete Cosine Transform (DCT) to extract ATs features. It has many advantages for being used more than radar traditional techniques. It's characterized by simplicity, low cost and low CPU processing. It has been tested using recorded real AT acoustic signals. They are used for testing requirements to achieve reality of the test. These real ATs are characterized by (PSD), frequency, Mean, Median, Entropy and Variance. All of these features have been extracted using DCT method. It proofed that it's more effective than Discrete Fourier Transform for ATs classification.
... Figure 6 shows the conversion steps of time domain signal to frequency domain signal. The conversion steps, are started with the frame blocking step [11]. At this step a number of data (which is also called a data frame) are taken from a sequence of incoming data. ...
... The windowing step in Figure 6, is carried out in order to reduce the discontinuity at the edges of the signal. This reduction is necessary in order to to reduce the appearance of harmonic signals in the FFT step [11]. The windowing step used a Hamming window [12], which is a kind of simple bell shape window. ...
Article
Full-text available
When an electrical machine suffered a mechanical fault, it generally emits certain sounds. These sounds came from the vibration. Therefore, based on the vibration, it could be detected if there was a mechanical fault in an electrical machine. This paper discussed the graphical display of the vibration of electrical machines in the form of household water pumps which were in good condition, faulty bearing, faulty impeller, or faulty foot valve. Vibration could be displayed in the time domain, or in the frequency domain, by using the three axes, i.e. X, Y, and Z. In the frequency domain, the vibration could be displayed at various frequency resolutions. Based on the observations, the higher frequency resolution, the lower detail in the graphical display of frequency domain would be shown. Although there was lower detail in the graphical display of frequency domain, at frequency resolution of 11.7 Hz in the X axis, showed that it could be visually distinguished among water pumps which were in good condition, faulty bearing, faulty impeller, or faulty foot valve.
... 1) Ön Vurgulama: Konuşma spektrumunu düzleştirmek için, spektral analiz öncesinde bir ön-vurgu filtresi kullanılır. Bunun sonucunda, insan ses üretim mekanizması sırasında bastırılan konuşma sinyalinin yüksek frekanslı kısmı telafi edilmektedir [9]. Ön vurgulama Denklem (1) ile gerçekleştirilir. ...
Conference Paper
Full-text available
Blindfold chess is of particular interest in the research of the memory structure and limits of the human brain. In blindfold chess, the player cannot see the board so the player visualizes the situation and make his moves aloud. In this study, the first step of recognition of voice commands for blindfold chess, word-based determination, and classification of chess figure vocalizations have been emphasized. Mel frequency coefficients and mel spectrograms have been used as the feature vector for audio data. The classification of these vectors has been made by using artificial neural networks. As a result of the tests, 99% success has been obtained in noisy environments.
... Some key stages of the MFCC process can be seen in Figure 1. The MFCC feature extraction process starts from pre-emphasis that compensates for speech signals at high frequency [15]. The mathematical equation for the pre-emphasis process can be seen in equation (1). ...
Article
Full-text available
—The research on speech recognition systems currently focuses on the analysis of robust speech recognition systems. When the speech signals are combined with noise, the recognition system becomes distracted, struggling to identify the speech sounds. Therefore, the development of a robust speech recognition system continues to be carried out. The principle of a robust speech recognition system is to eliminate noise from the speech signals and restore the original information signals. In this paper, researchers conducted a frequency domain analysis on one stage of the Mel Frequency Cepstral Coefficients (MFCC) process, the Fast Fourier Transform (FFT), in children's speech recognition system. The FTT analysis in the feature extraction process determined the effect of frequency value characteristics utilized in the FFT output on the noise disruption. The analysis method was designed into three scenarios based on the value of the employed FFT points. The differences between scenarios were based on the number of shared FFT points. All FFT points were divided into four, three, and two parts in the first, second, and third scenarios, respectively. This study utilized children's speech data from the isolated TIDIGIT English digit corpus. As comparative data, the noise was added manually to simulate real-world conditions. The results showed that using a particular frequency portion following the scenario designed on MFCC affected the recognition system performance, which was relatively significant on the noisy speech data. The designed method in the scenario 3 (C1) version generated the highest accuracy, exceeded the accuracy of the conventional MFCC method. The average accuracy in the scenario 3 (C1) method increased by 1% more than all the tested noise types. Using various noise intensity values (SNR), the testing process indicates that scenario 3 (C1) generates a higher accuracy than conventional MFCC in all tested SNR values. It proves that the selection of specific frequency utilized in MFCC feature extraction significantly affects the recognition accuracy in a noisy speech.
... Filters have linear spaced frequencies and a fixed frequency range on the Mel scale. as shown in Eq. (5) [6,11]. ...
Conference Paper
Blind people cannot communicate with dumb people in light of the fact that the visually impaired can't see the signal , the visually impaired individuals can just stand up them thoughts , while the moronic individuals can just use (visual) the motions gestures, so on the off chance that the visually impaired individuals utilize the voices can be changed over into motions, the communication between daze blind people and dumb peoples is easy. In this paper, communication and understanding the proposed system to change over the voices of Persian letters into signals (pictures relating to Persian letters) the recommended system gathered voices of Persian letters from various people (five people). This information is basely containing five voices for each letter (160 voices altogether) since there are 32 letters in Persian language. The proposed system is divided into two sections; the initial segment is for preparing, and the subsequent second part is for testing .The features separated relying upon MFCC and group classified by Linear Discriminate Analysis (LDA) and Quadratic Discriminate Analysis (QDA). The final results accuracy of suggested system is 96.875%.
... Frame blocking adalah proses pemotongan sebagian kecil sinyal dari suatu sinyal yang panjang [13]. Pada dasarnya, dari sebagian kecil sinyal nada sudah bisa didapatkan informasi nada yang akurat. ...
... Frame blocking is the process of taking a short signal from a long signal [19]. This process is needed in order to reduce the number of signal data from the input. ...
Article
Sampling frequency of musical instruments tone recognition generally follows the Shannon sampling theorem. This paper explores the influence of sampling frequency that does not follow the Shannon sampling theorem, in the tone recognition system using segment averaging for feature extraction and template matching for classification. The musical instruments we used were bellyra, flute, and pianica, where each of them represented a musical instrument that had one, a few, and many significant local peaks in the Discrete Fourier Transform (DFT) domain. Based on our experiments, until the sampling frequency is as low as 312 Hz, recognition rate performance of bellyra and flute tones were influenced a little since it reduced in the range of 5%. However, recognition rate performance of pianica tones was not influenced by that sampling frequency. Therefore, if that kind of reduced recognition rate could be accepted, the sampling frequency as low as 312 Hz could be used for tone recognition of musical instruments.
... In order to get into the domain of speech recognition, a brief initiation to how the speech signal is generated and recognized by the human system can be considered as a starting point. The figure 1 shows the process of production from human speech to human speech recognition, between the listener and speaker [3]. Translation of spoken words into text is speech recognition in electronics engineering which is also called as "computer speech recognition", "automatic speech recognition" (ASR), or sometimes it is just known as "speech to text" (STT). ...
... So, the mel scale can model the sensitivity of the human ear more closely than a purely linear scale, and provides for greater discriminatory capability between speech segments. [16] Frame Blocking ...
Article
A new method for feature extraction is presented in this paper for speech recognition using a combination of discrete wavelet transform (DWT) and mel Frequency Cepstral Coefficients (MFCCs). The objective of this method is to enhance the performance of the proposed method by introducing more features from the signal. The performance of the Wavelet-based mel Frequency Cepstral Coefficients method is compared to mel Frequency Cepstral Coefficients based method for features extraction. Wavelet transform is applied to the speech signal where the input speech signal is decomposed into various frequency channels using the properties of wavelet transform. then Mel-Frequency Cepstral Coefficients (MFCCs) of the wavelet channels are calculated. A new set of features can be generated by concatenating both features. The speech signals are sampled directly from the microphone. Neural Networks (NN) are used in the proposed methods for classification. The proposed method is implemented for 15 male speakers uttering 10 isolated words each which are the digits from zero to nine. each digit is repeated 15 times.
... Although, there is a correlation it is still unclear why these parameters were chosen. The researcher did not qualify what was the basis for completing that particular statistical analysis and which parameters constituted the changes in the heart rate [23]. ...
Article
Full-text available
A non-invasive method for the monitoring of heart activity can help to reduce the deaths caused by heart disorders such as stroke, arrhythmia and heart attack. The human voice can be considered as a biometric data that can be used for estimation of heart rate. In this paper, we propose a method for estimating the heart rate from human speech dynamically using voice signal analysis and by the development of an empirical linear predictor model. The correlation between the voice signal and heart rate are established by classifiers and prediction of the heart rates with or without emotions are done using linear models. The prediction accuracy was tested using the data collected from 15 subjects, it is about 4050 samples of speech signals and corresponding electrocardiogram samples. The proposed approach can use for early non-invasive detection of heart rate changes that can be correlated to an emotional state of the individual and also can be used as a tool for diagnosis of heart conditions in real-time situations.
... MFCC is one of the most effective feature parameters in speech recognition. Moreover, it is based on the human ear's nonlinear frequency characteristic and has a high recognition rate in practical application [5], [7]. We use MFCC feature extraction method to compute 39 features consisting of 13 mel-scaled cepstral, 13 delta, and 13 delta-delta features from each frame. ...
... The MCFF feature extraction technique [4] was inspired by voice recognition systems, which also process analogue signals [7], [2] and consists of the following steps: ...
Conference Paper
This paper represents a new application for existing classification techniques. A robotic worm device being developed for human endoscopy, fitted with a 3-axis accelerometer was driven over a variety of surfaces and the accelerometer data was used to identify, which surface the robot worm found itself. Within the Weka environment, three available classifiers, J48, LIBSVM and Perceptron were tested with both Fast Fourier Transform (FFT) and Mel-Frequency Cepstral Coefficients (MFCC) extraction techniques, frame sizes of 0.5 and 2 seconds. The highest testing accuracy demonstrated for this surface classification, was 83%. It is hoped that this machine learning will improve the operational use of the robot with the system identifying surface types and, later surface properties of hard to reach anatomical regions, both for locomotive efficiency and medical information.
... do grupo FalaBrasil da UFPA, com 200 sentenças, 2.062 palavras, vocabulário de 1.064 palavras distintas, 15 minutos e 51 segundos de gravação em ambiente não controlado, da voz de 10 mulheres com aproximadamente a mesma duração. Neste caso também foi acrescido o material do site VoxForge, com 180 sentenças, 855 palavras, vocabulário de 351 palavras distintas, 9 minutos e 30 segundos de gravação em ambiente não controlado, da voz de 6 mulheres com duração variável, removendo as gravações em Português de Portugal e as gravações que estavam ininteligíveis (nível de áudio excessivamente baixo, ruído ou distorção excessivamente altos).A maioria das configurações do treinamento do modelo acústico foram mantidas conforme o padrão do CMU Sphinx, exceto os seguintes itens:• LDA/MLLT Por padrão, o processamento digital do sinal de voz do CMU Sphinx utiliza um vetor de parâmetros MFCC (Mel-Frequency Cepstral Coefficients), que utiliza os 12 primeiros coeficientes da DCT do logaritmo do espectro de potência na escala de frequência Mel (escala de frequência subjetivamente linear), mais um coeficiente que representa a energia média do sinal, além da primeira e da segunda derivada desses 13 coeficientes, denominados de coeficientes dinâmicos, "delta" ou vetores de velocidade e aceleração, que ajudam a caracterizar os efeitos coarticulatórios, formando um vetor de parâmetros ou de características com 39 coeficientes[11] [14] [15][16].O vetor de parâmetros ou de características é utilizado no modelo acústico para reconhecimento dos padrões fonéticos. É possível otimizar esses parâmetros utilizando uma transformação linear que melhore a separabilidade entre os padrões a serem reconhecidos, o que produz um impacto positivo sobre a acurácia do sistema. ...
... Figure 12. Comparison of Overall Accuracies Figure 13 presents the result of generated speech for the sentences using Matlab code mfcc2spectrum [10]. ...
Article
Full-text available
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller's identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures. To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word
... Dynamic Time Warpingis an algorithm for pattern matching and it also has a non-linear time normalization effect [8]. The basic concept of DTW is derived from Bellman's principle for optimality. ...
Article
Full-text available
Underwater sound classification presents a unique challenge due to the complex propagation characteristics of sound in water, including absorption, scattering, and refraction. These complexities can distort and alter spectral features, hindering the effectiveness of traditional feature extraction methods for vessel classification. To address this challenge, this study proposes a novel feature extraction method that combines Mel-frequency cepstral coefficients (MFCCs) with a spectral dynamic feature (SDF) vector. MFCCs capture the spectral content of the audio signal, whereas SDF provides information on the temporal dynamics of spectral features. This combined approach aims to achieve a more comprehensive representation of underwater vessel sounds, potentially leading to improved classification accuracy. Validation with real-world underwater audio recordings demonstrated the effectiveness of the proposed method. Results indicated an improvement of up to 94.68% in classification accuracy when combining SDF with several classical extractors evaluated. This finding highlights the potential of SDF in overcoming the challenges associated with underwater sound classification.
Chapter
Creating a system that can hear and respond accurately like a human is one of the most critical issues in human-computer interaction. This inspired the creation of the automatic speech recognition system, which uses efficient feature extraction and selection techniques to distinguish between different classes of speech signals. In order to improve the ASR (automatic speech recognition), the authors present a new feature extraction method in this study which is based on modified MFCC (mel frequency cepstral coefficients) using lifting wavelet transform LWT (lifting wavelet transform). The effectiveness of the proposed approach is verified using the datasets of the ATSSEE Research Unit “Analysis and Processing of Electrical and Energy Signals and Systems.” The experimental investigations have been carried out to demonstrate the practical viability of the proposed approach. Numerical and experimental studies concluded that the proposed approach is capable of detecting and localizing multiple under varying environmental conditions with noise-contaminated measurements.
Article
The ability to speak lucidly plays a key role in social relations. Consequently, the role of the larynx is quite important, and timely diagnosis of laryngeal diseases has proved to be crucial. In this study, a simple computational model for inverse of speech production model is employed to extract the glottal waveform using speech signal. This waveform has useful information about vocal folds performance in terms of providing evidence for distinguishing pathological disorders. Furthermore, obtaining the significance of classification results is important, because it leads to reliable inferences. This study utilizes the sustained vowel sound /a/ and a well-referenced database, namely MEEI. In this work, after extraction of six discriminating features by using appropriate signal modeling and processing methods and upon change of the feature space by using Kernel Principal Component Analysis (KPCA), a classifier consisting of Naïve Bayes and Fisher Linear Discriminant, is exerted on the feature sets. Regarding voice pathology detection, the proposed approach achieved a significant classification balanced accuracy 93.6%±0.03 with p-value <0.01 for normal and abnormal classification using the Beta distribution model for the posterior distribution of the average of the cross-validation results. The proposed features are also compared with some conventional features in this field. The results show significant improved performance for the proposed features in discriminating different types of pathological voices.
Article
Full-text available
Filled pause and Elongation are the two types of speech disfluencies that need more suitable acoustical features to be classified correctly since they are always being misclassified. This work concentrates on developing an accurate and robust energy feature extraction for modelling filled pause and elongation by investigating different energy features using local maxima points of the speech energy. Method: In this paper, we extracted peak values from each frame of a voiced signal by implementing different thresholding techniques to classify filled pause and elongation. These energy features are evaluated by using statistical naïve Bayes classifier to see the contribution on the classification processes. Various samples of sustained syllables and filled pauses of spontaneous speech were extracted from Malaysian Parliamentary Debate Database of the year 2008. A naïve Bayes was used as a classifier. We performed F-measure evaluation to investigate the significant differences in mean of filled pause and elongation samples. Results: Results revealed that our proposed LM-E has increase the classification with up to 71% and 75% F-measure for elongation and filled pause. Conclusion: The best achieved accuracies in both filled pause and elongation classification were varied depending on the types of thresholding techniques applied during the local maxima of speech energy extraction. The most contributed thresholding technique is our proposed technique which is by using the adaptive height as the threshold that extracts the local maxima of the speech energy (LM-E).
Article
Full-text available
This paper addressees the problem of an early diagnosis of PD (Parkinson’s disease) by the classification of characteristic features of person’s voice knowing that 90% of the people with PD suffer from speech disorders. We collected 375 voice samples from healthy and people suffer from PD. We extracted from each voice sample features using the MFCC and PLP Cepstral techniques. All the features are analyzed and selected by feature selection algorithms to classify the subjects in 4 classes according to UPDRS (unified Parkinson’s disease Rating Scale) score. The advantage of our approach is the resulting and the simplicity of the technique used, so it could also extended for other voice pathologies. We used as classifier the discriminant analysis for the results obtained in previous multiclass classification works. We obtained accuracy up to 87.6% for discrimination between PD patients in 3 different stages and healthy control using MFCC along with the LLBFS algorithm.
Conference Paper
Full-text available
The conversion of speech signal into a useful message (its corresponding text) is called automatic speech recognition. As a pattern recognition application, automatic speech recognition requires the extraction of useful features from its input signal. In this paper, several acoustic analysis to extract the features is described. Speech production and perceptions are reviewed. The classical front end analysis in speech recognition is a spectral analysis which parameterizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (Mel- frequency scale) and then decorrelated by using a modified discrete cosine transform. Other speech analysis techniques such as Short-time Fourier transform (STFT), linear predictive analysis and dynamic feature extraction are also discussed. In the linear predictive analysis, the speech sample is approximated as a linear combination of past speech samples. MFCC is most widely used.
Conference Paper
In this paper, we intend to introduce a new approach to recognize discrete speeches, specifically pre-assumed words. Our approach is mainly based on Principal Components Analysis (PCA) and Neural Networks (NN). To do so, initially we build a data base which is provided by 20 speakers who uttered each predefined word 5 times and overall 10 Persian words. Then we apply Voice Activity Detection (VAD) and eliminate the useless portions of each frame and then by computing Mel Frequency Cepstral Coefficients (MFCCs), which are our useful features in the recognition process, and then applying PCA to reduce the size of our data set, we will successfully provide the inputs of the NN block. Using PCA will enable us to provide inputs with lower size to our recognition system which is an important feature of our approach by speeding up the training procedure while keeping the accuracy as high as possible. In another words, PCA will decrease the amount of computations we have to deal with usually in most recognition systems. We use 90% of our data set to train our algorithm and the remained 10% to test our algorithm and measure the accuracy of recognition process.
Article
Full-text available
Abstract— Arabic language has a slightly different pronunciation than the Indonesian so to learn it takes a long time. In Arabia itself, there are variants in the pronunciation of the Arabic language or dialect. Dialect is a language, and letters are used by a particular group of people in a clump that makes the difference between the readings even greeting one another. In Indonesia, alone speakers of Indonesia itself have a different dialect to native speakers. This study was analyzed of Arabic writing suitability by Indonesian speakers using Linear Predictive Coding extraction techniques. The text produces different patterns of speech. This also happens if the text is spoken by a speaker who is not the mother tongue of the speakers. The data training in this study is using the Arabic speaker sound. The feature extraction is classified using Hidden Markov Model. In the classification, using Hidden Markov Model, voice signal is analyzed and searched the maximum possible value that can be recognized. The modeling results obtained parameters are used to compare with the sound of Arabic speakers. From the test results' Classification, Hidden Markov Models with Linear Predictive Coding extraction average accuracy of 78.6% for test data sampling frequency of 8,000 Hz, 80.2% for test data sampling frequency of 22050 Hz, 79% for frequencies sampling test data at 44100 Hz.
Article
Full-text available
Abstract— Speech recognition is a system to transform the spoken word into text. Human voice signals have a very high of variability. Speech signals in the different pronunciation text, also resulting in distinctive speech patterns. This, furthermore, happens if the text is spoken by a speaker who is not the mother tongue of the speakers. For example, text Arabic words spoken by Indonesian speaker. In this study, Mel Frequency cepstral Coeffisients (MFCC) feature extraction techniques explored for voice recognition of the Arabic words for Indonesian speakers with data training using Arabian native speakers. Furthermore, features that have been extracted, classified using Hidden Markov Model (HMM). HMM is one of the sound modeling where the voice signal is analyzed and searched the maximum probability value that can be recognized, from the modeling results will be obtained parameters are then used in the word recognition process. Recognized word is a word that has the maximum suitability. The system produces an accuracy by an average of 83.1% for test data sampling frequency of 8,000 Hz, 82.3% for test data sampling frequency of 22050 Hz, 82.2% for test data sampling frequency of 44100 Hz.
Conference Paper
The objective of this work is to study the issues involved in building an automatic query word retrieval system for broadcast news in an unsupervised framework, i.e., without using any labelled speech data. In the absence of labelled data, sequence of feature-vectors extracted from the query word have to be matched with those extracted from the test utterance. This is a non-trivial task, as typical feature-vectors like Mel-frequency cepstral coefficients (MFCC) carry both speech-specific and speaker-specific information. In this work, we have employed Gaussian mixture models (GMM) to extract speaker-independent features from the speech signal. Gaussian mixture model, trained on a large amount of speech data, is used to derive posterior features for each frame of speech signal. The sequence of posterior features are matched using dynamic time-warping algorithm to detect the presence of query word in the test utterance. The performance of the proposed method is evaluated on Telugu broadcast news database. It is observed that the posterior features extracted from GMM are better suited for query word retrieval compared to the MFCC features.
Article
Full-text available
Automatic recognition of isolated spoken digits is one of the most challenging tasks in the area of Automatic Speech Recognition. In this paper, Database Development and Automatic Speech Recognition of Isolated Pashto Spoken Digits from Sefer (0) to Naha (9) has been presented. A number of 50 individual Pashto native speakers (25 male and 25 female) of different ages, ranging from 18 to 60 years, were involved to utter from Sefer (0) to Naha (9) digits separately. Sony PCM-M 10 linear recorder is used for recoding purpose in the office and home in noise free environment. Adobe audition version 1.0 is used to split the audio of digits into individual digits and result is saved in .wav format. Mel frequency cepstral coefficients is used to extract speech features. K nearest neighbor classifier is used for the first time up to author knowledge in Pashto language to classify the features of speech and compare its accuracy with linear discriminate analysis. The experimental results are evaluated, and the overall average recognition exactitude of 76.8 % is obtained.
Conference Paper
Full-text available
We present a method to derive Mel-frequency cepstral coefficients directly from the power spectrum of a speech signal. We show that omitting the filterbank in signal analysis does not affect the word error rate. The presented approach simplifies the speech recognizers front end by merging subsequent signal analysis steps into a single one. It avoids possible interpolation and discretization problems and results in a compact implementation. We show that frequency warping schemes like vocal tract normalization can be integrated easily in our concept without additional computational efforts. Recognition test results obtained with the RWTH large vocabulary speech recognition system are presented for two different corpora: The German VerbMobil II dev99 corpus, and the English North American Business News 94 20k development corpus
Article
Full-text available
The future commercialization of speaker- and speech-recognition technology is impeded by the large degradation in system performance due to environmental differences between training and testing conditions. This is known as the "mismatched condition." Studies have shown [l] that most contemporary systems achieve good recognition performance if the conditions during training are similar to those during operation (matched conditions). Frequently, mismatched conditions axe present in which the performance is dramatically degraded as compared to the ideal matched conditions. A common example of this mismatch is when training is done on clean speech and testing is performed on noise- or channel-corrupted speech. Robust speech techniques [2] attempt to maintain the performance of a speech processing system under such diverse conditions of operation. This article presents an overview of current speaker-recognition systems and the problems encountered in operation, and it focuses on the front-end feature extraction process of robust speech techniques as a method of improvement. Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described. Also described is the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
Article
Almost all present-day continuous speech recognition (CSR) systems are based on hidden Markov models (HMMs). Although the fundamentals of HMM-based CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inherent assumptions, and in adapting the models for specific applications and environments. The aim of this chapter is to review the core architecture of an HMM-based CSR system and then outline the major areas of refinement incorporated into modern systems.
Article
This work proposes a novel method of predicting formant fre-quencies from a stream of mel-frequency cepstral coefficients (MFCC) feature vectors. Prediction is based on modelling the joint density of MFCC vectors and formant vectors using a Gaussian mixture model (GMM). Using this GMM and an in-put MFCC vector, two maximum a posteriori (MAP) predic-tion methods are developed. The first method predicts formants from the closest, in some sense, cluster to the input MFCC vec-tor, while the second method takes a weighted contribution of formants from all clusters. Experimental results are presented using the ETSI Aurora connected digit database and show that the predicted formant frequency is within 3.25% of the refer-ence formant frequency, as measured from hand-corrected for-mant tracks.
Article
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.
Conference Paper
The most popular set of parameters used in recognition systems is the mel frequency cepstral coefficients. While giving generally good results, it remains that the filtering process, as used in the evaluation of these parameters, reduces the signal resolution in the frequency domain, which can have some impact in discriminating between phonemes. This paper presents a new parameterization approach that preserves most of the characteristics of mel frequency cepstral coefficients while maintaining the initial frequency resolution obtained from the fast Fourier transform. It is shown, by the results obtained, that this technique can significantly increase the performance of a recognition system
Article
The focus of a continuous speech recognition process is to match an input signal with a set of words or sentences according to some optimality criteria. The first step of this process is parameterization, whose major task is data reduction by converting the input signal into parameters while preserving virtually all of the speech signal information dealing with the text message. This contribution presents a detailed analysis of a widely used set of parameters, the mel frequency cepstral coefficients (MFCCs), and suggests a new parameterization approach taking into account the whole energy zone in the spectrum. Results obtained with the proposed new coefficients give a confidence interval about their use in a large-vocabulary speaker-independent continuous-speech recognition system
Article
The properties and interrelationships among four measures of distance in speech processing are theoretically and experimentally discussed. The root mean square (rms) log spectral distance, cepstral distance, likelihood ratio (minimum residual principle or delta coding (DELCO) algorithm), and a cosh measure (based upon two nonsymmetrical likelihood ratios) are considered. It is shown that the cepstral measure bounds the rms log spectral measure from below, while the cosh measure bounds it from above. A simple nonlinear transformation of the likelihood ratio is shown to be highly correlated with the rms log spectral measure over expected ranges. Relationships between distance measure values and perception are also considered. The likelihood ratio, cepstral measure, and cosh measure are easily evaluated recursively from linear prediction filter coefficients, and each has a meaningful and interrelated frequency domain interpretation. Fortran programs are presented for computing the recursively evaluated distance measures.
  • A V Mccree
  • Iii Barnwell
McCree, A. V. & Barnwell III, T. P. (1991), A New Mixed Excitation LPC Vocoder, Georgia Institute of Technology, Atlanta, 1991: 593-596.
Speech Recognition Using Hidden Markov Model -Performance Evaluation in Noisy Environment
  • M Nilsson
  • M Ejnarsson
Nilsson, M. & Ejnarsson, M. (2002), Speech Recognition Using Hidden Markov Model -Performance Evaluation in Noisy Environment, Blekinge Institute of Technology Sweden. References 61
The HTK Book Version 3.4, Cambridge University
  • S Young
  • G Evermann
  • M Gales
  • T Hain
  • D Kershaw
  • X Liu
  • G Moore
  • J Odell
  • D Ollason
  • D Povey
  • V Valtchev
  • P Woodland
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V. & Woodland, P. (2006), The HTK Book Version 3.4, Cambridge University, Cambridge 2006, Available on the World Wide Web: http://htk.eng.cam.ac.uk.