Article

An automatic speech recognition system in Indian and foreign languages: A state-of-the-art review analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Speech Recognition is one of the prominent research topics in the field of Natural Language Processing (NLP). The Speech Recognition technique removes the barriers and makes the system ease for inter-communication between human beings and devices. The aim of this study is to analyze the Automatic Speech Recognition System (ASRS) proposed by different researchers using Machine learning and Deep Learning techniques. In this work, Indian and foreign languages speech recognition systems like Hindi, Marathi, Malayalam, Urdu, Sanskrit, Nepali, Kannada, Chinese, Japanese, Arabic, Italian, Turkish, French, and German are considered. An integrated framework is presented and elaborated with recent advancement. The various platform like Hidden Markov Model Toolkit (HMM Toolkit), CMU Sphinx, Kaldi toolkit are explained which is used for building the speech recognition model. Further, some applications are elaborated which depict the uses of ASRS.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
Communication is the major path to convey the information. Speech is the best mode for conveying the information. Human to human information can be exchanged through some particular language. But the interaction between human and machine is the major challenge which deals with ASR (Automatic speech recognition). This research recognizes speaker independent data which gives good results by using T-DSCC (Teager energy operator delta spectral cepstral coefficients) feature extraction technique and DNN (Deep Neural Networks) feature classification technique. This paper also uses CASA technique for pre-processing the speech signals. This research is done by creating the database for 10 most speak able isolated words in Telugu.
Article
Full-text available
Hand signs are an effective form of human-to-human communication that has a number of possible applications. Being a natural means of interaction, they are commonly used for communication purposes by speech impaired people worldwide. In fact, about one percent of the Indian population belongs to this category. This is the key reason why it would have a huge beneficial effect on these individuals to incorporate a framework that would understand Indian Sign Language. In this paper, we present a technique that uses the Bag of Visual Words model (BOVW) to recognize Indian sign language alphabets (A-Z) and digits (0–9) in a live video stream and output the predicted labels in the form of text as well as speech. Segmentation is done based on skin colour as well as background subtraction. SURF (Speeded Up Robust Features) features have been extracted from the images and histograms are generated to map the signs with corresponding labels. The Support Vector Machine (SVM) and Convolutional Neural Networks (CNN) are used for classification. An interactive Graphical User Interface (GUI) is also developed for easy access.
Article
Full-text available
Semi-supervised training and language adversarial transfer learning are two different techniques to improve the Automatic Speech Recognition (ASR) performance in limited resource conditions. In this paper, we combined these two techniques and presented a common framework for the Hindi ASR system. For acoustic modeling, we proposed a hybrid architecture of SincNet-Convolutional Neural Network (CNN)-Light Gated Recurrent Unit (LiGRU), which shows increased interpretability, high accuracy, and fewer parameter size. We investigate the impact of the proposed hybrid model on monolingual Hindi ASR with semi-supervised training, and multilingual Hindi ASR with language adversarial transfer learning. In this work, we have chosen three Indian languages (Hindi, Marathi, Bengali) of the same Indo-Aryan family for multilingual training. All experiments were conducted using Kaldi and Py-Torch Kaldi toolkits. The proposed model with combined learning strategies helps to get the lowest 5.5% Word Error Rate (WER) for Hindi ASR.
Article
Full-text available
Speech recognition system play an essential role in every human being life. It is a software that allows the user to interact with their mobile phones through speech. Speech recognition software splitting down the audio of a speech into various sound waves forms, analyzing each sound form, using various algorithms to find the most appropriate word fit in that language and transcribing those sounds into text. This paper will illustrate the popular existing system namely SIRI, CORTANA, GOOGLE ASSISTANT, ALEXA, BIXBY. Apart from that, this paper also analysis the concept of NLP (Natural processing) with speech recognition. In addition to this, our main function is to find out the most accurate technique of speech recognition so that we can achieve the best results. Comparative analysis will indicate the difference and demerit points of various speech recognition.
Article
Full-text available
In this article, the authors have presented the design and development of automatic spontaneous speech recognition of the Punjabi language. To dimensions up to the natural speech recognizer, the very large vocabulary Punjabi text corpus has been taken from a Punjabi interview’s speech corpus, presentations, etc. Afterward, the Punjabi text corpus has been cleaned by using the proposed corpus optimization algorithm. The proposed automatic spontaneous speech model has been trained with 13,218 of Punjabi words and more than 200 min of recorded speech. The research work also confirmed that the 2,073,456 unique in-word Punjabi tri-phoneme combinations present in the dictionary comprise of 131 phonemes. The performance of the proposed model has grown increasingly to 87.10% sentence-level accuracy for 2381 Punjabi trained sentences and word-level accuracy of 94.19% for 13,218 Punjabi words. Simultaneously, the word error rate has been reduced to 5.8% for 13,218 Punjabi words. The performance of the proposed system has also been tested by using other parameters such as overall likelihood per frame and convergence ratio on various iterations for different Gaussian mixtures.
Article
Full-text available
This Speech recognition is the process of converting an acoustic waveform into the text similar to the information being conveyed by the speaker. The speech recognition in multilingual is found to be very difficult to improve over separately trained systems. We experiments on different evaluation approach to speech recognition in multilingual, in which the phone sets are entirely distinct. The model has parameters not tied to specific states but that are shared across languages. The text-to-speech synthesis systems require a accurate prosody labels to generate natural-sounding speech. This research paper is to build a speech recognition system for Bodo language using Hidden Markov Model Toolkit (HTK). The system is trained for continuous Bodo speech and the continuous Bodo speech has been taken from male Bodo speakers.
Article
Full-text available
Through this paper we are going to recognized spoken alphabets and digits in the process of Automatic speech recognition of of bodo language which is needed in many different field by taking alphadigit as input. Bodo language is a Semitic language which differs from other languages such as Assamese,English etc. the major differences is how to pronounce the all alphabets and ten digits. In our research, spoken Bodo digits and alphabets pronunciation are experienced from the speech recognition point of view. It is designed to recognize an isolated whole-word speech based on Hidden Morkov Models which is based on phoneme recognition. In the training and testing phase of the system, the GU_Bodo corpus is used. Ten different observations are performed on GU_Bodo corpus database in this experiment. The first three of them are trained and tested by using each individual digital subset. The fourth one is conducted on these three subsets i.e., trained by using all three training subsets and tested by using all three testing subsets. In these three experiments, the training subset is the same as the fourth experiment but the testing subsets are the same as the first three experiments. The eighth experiment is on the bodo alphabets, and the ninth is applied on the digits and the alphabets respectively. The complete experiment is done in three phases; first one is the designing of the system by using Bodo digits, the second one is working on Bodo alphabets to be evaluated, analyzed and recognized and the third is the combination of the Bodo digits and alphabets together. Through our experiment we achieved 90.60% correct digit recognition in the noisy environment using mixed training and testing subsets. But in case of the alphabet, the system performance is almost 70.17%. again in case of mixed alphabets and digits, the system accuracy is about 81.20% which is better than the alphabet experiments but much less than those conducted for the digits.
Conference Paper
Full-text available
Speech disorders affect many people around the world and introduce a negative impact on their quality of life. Dysarthria is a neural-motor speech disorder that obstructs the normal production of speech. Current automatic speech recognition (ASR) systems are developed for normal speech. They are not suitable for accurate recognition of disordered speech. To the best of our knowledge, the majority of disordered speech recognition systems developed to date are for English. In this paper, we present two disordered speech recognition systems for both English and Cantonese. Both systems demonstrate competitive performance when compared with the Google speech recognition API and human recognition results.
Conference Paper
Full-text available
Pitch features have long been known to be useful for recognition of normal speech. However, for disordered speech, the significant degradation of voice quality renders the prosodic features, such as pitch, not always useful, particularly when the underlying conditions, for example, damages to the cerebellum, introduce a large effect on prosody control. Hence, both acoustic and prosodic information can be distorted. To the best of our knowledge, there has been very limited research on using pitch features for disordered speech recognition. In this paper, a comparative study of multiple approaches designed to incorporate pitch features is conducted to improve the performance of two disordered speech recognition tasks: English UASpeech, and Cantonese CUDYS. A novel gated neural network (GNN) based approach is used to improve acoustic and pitch feature integration over a conventional concatenation between the two. Bayesian estimation of GNNs is also investigated to further improve their robustness.
Article
Full-text available
This paper presents a Neural Network based Nepali Speech Recognition model. RNN (Recurrent Neural Networks) is used for processing sequential audio data. CTC (Connectionist Temporal Classification) technique is applied allowing RNN to train over audio data. CTC is a probabilistic approach of maximizing the occurrence probability of the desired labels from RNN output. After processing through RNN and CTC layers, Nepali text is obtained as output. This paper also defines a character set of 67 Nepali characters required for transcription of Nepali speech to text.
Article
Full-text available
In this paper, a continuous Kannada speech recognition system is developed under different noisy conditions. The continuous Kannada speech sentences are collected from 2400 speakers across different dialect regions of Karnataka state (a state in the southwestern region of India where Kannada is the principal language). The word-level transcription and validation of speech data are done by using Indic transliteration tool (IT3:UTF-8). The Kaldi toolkit is used for the development of automatic speech recognition (ASR) models at different phoneme levels. The lexicon and phoneme set are created afresh for continuous Kannada speech sentences. The 80% and 20% of validated speech data are used for system training and testing using Kaldi. The performance of the system is verified by the parameter called word error rate (WER). The acoustic models were built using the techniques such as monophone, triphone1, triphone2, triphone3, subspace Gaussian mixture models (SGMM), combination of deep neural network (DNN) and hidden Markov model (HMM), combination of DNN and SGMM and combination of SGMM and maximum mutual information. The experiment is conducted to determine the WER using different modeling techniques. The results show that the recognition rate obtained through the combination of DNN and HMM outperforms over conventional-based ASR modeling techniques. An interactive voice response system is developed to build an end-to-end ASR system to recognize continuous Kannada speech sentences. The developed ASR system is tested by 300 speakers of Karnataka state under uncontrolled environment.
Article
Full-text available
Devanagari and Bengali scripts are two of the most popular scripts in India. Most of the existing word recognition studies in these two scripts have relied upon the widely used Hidden Markov Model (HMM), in spite of its familiar shortcomings. The existing works were evaluated against and performed well in their chosen metrics. But, the existing word recognition systems in these two scripts could not achieve more than 90% recognition accuracy. This article proposes a novel approach for online handwritten cursive and non-cursive word recognition in Devanagari and Bengali scripts based on two recently developed models of Recurrent Neural Network (RNN)—Long–Short Term Memory (LSTM) and Bidirectional Long–Short Term Memory (BLSTM). The proposed approach divides each word horizontally into three zones—upper, middle, and lower, to reduce the variations in basic stroke order within a word. Next, the word portions from middle zone are re-segmented into its basic strokes. Various structural and directional features are then extracted from each basic stroke of the word separately for each zone. These zone wise basic stroke features are then studied using both LSTM and BLSTM versions of RNN. Most of the existing word recognition systems in these two scripts have followed word based class labelling approach, whereas proposed system has followed the basic stroke based class labelling approach. An exhaustive experiment on large datasets has been performed to evaluate the performance of the proposed approach using both RNN and HMM to make a comparative performance analysis. Experimental results show that the proposed RNN based system is superior over HMM achieving 99.50% and 95.24% accuracies in Devanagari and Bengali scripts respectively and outperforms existing HMM based systems in the literature as well.
Article
Full-text available
Named Entity Recognition is the process of identifying the entities in the text document and categorizing them into predefined categories such as Person, Location, Organisation, etc. It is an important step in the processing of natural language text. Named entity recognition systems aim at extracting relevant information from the text. Various methods are applied for NER in Malayalam. In this paper, we propose an NER system for Malayalam using neural networks. Neural networks are exceptionally powerful tools for learning representations of data with multiple levels of abstraction. The proposed system utilizes different features such as POS information of the word, embedded representation of words and suffixes, POS information of preceding words, etc. We have used a corpus of 20615 sentences for training and testing. With less number of features, the system was able to obtain the state of the art performance in NER for Malayalam.
Article
Full-text available
Numeral recognition is one of the most indispensable applications in pattern recognition. Recognizing numerals, written in Indian languages is a demanding problem. Devanagari Marathi is one such popular Indian language script, and perceiving Marathi numerals written in different patterns, is a challenging task. Depending on the type of feature extraction, varied approaches dealing with numeral recognition, have been suggested and practiced on smaller data-sets. However, no standard large data-set is available for handwritten Marathi numerals. Therefore, a data-set with 80000 samples has been prepared for proposed work. This paper proposes a Customized Convolutional Neural Network (CCNN) that has the ability to learn the features automatically and predict the class of numerals from a wide ranged data-set. Additionally, visualization of the intermediate CCNN layers is presented that explains the dynamics of the presented network. Out of 80000 numerals, written in Marathi, 70000 samples are used for training and 10000, for testing. The CCNN’s performance when verified using K- fold cross validation has achieved average 94.93% accuracy for testing data-sets
Article
Full-text available
In this paper, new set of visual speech feature based on Histogram of Oriented Gradient (HOG) is proposed to improve the robustness of bimodal Hindi speech recognition. For extracting the visual features, energy per block using HOG is calculated by finding the gradient magnitude of the pixel in the cell for both x and y direction form the region of interest (ROI). The advantage of proposed scheme is that it has reduced the dimensionality of visual features vectors which can retain the full information of the lip region. For comparative study, four sets of visual feature; Set A (Two-Dimensional Discrete Cosine Transform feature (2DDCT)), Set B (Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT-DCT)), Set C (Static-HOG) and Set D (Dynamic-HOG) are extracted from AMUAV corpus. Standard Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic (MFCC) features were used as baseline features. The maximum improvement in WRA (%) of 12.73% is reported over baseline features using proposed features sets.
Article
Full-text available
Speaker recognition for different languages is still a big challenge for researchers. The accuracy of identification rate (IR) is great issue, if the utterance of speech sample is less. This paper aims to implement speaker recognition for Hindi speech samples using Mel frequency cepestral coffiecient–vector quantization (MFCC-VQ) and Mel frequency cepestral cofficient-Gaussian mixture model (MFCC-GMM) for text dependent and text independent phrases. The accuracy of text independent recognition by MFCC-VQ and MFCC-GMM for Hindi speech sample is 77.64% and 86.27% respectively. However, the accuracy has increased significantly for text dependent recognition. The accuracy of Hindi speech samples are 85.49 % and 94.12 % using MFCC-VQ and MFCC-GMM approach. We have tested 15 speakers consisting 10 male and 5 female speakers. The total number of trails for each speaker is 17.
Article
Full-text available
The phonological features are the most basic unit of a speech knowledge hierarchy. This paper reports about detection and classification of phonological features from Bengali continuous speech. The phonological features are based on place and manner of articulation. All the experiments are performed by a deep neural network based framework. Two different models are designed for the detection and classification task. The deep-structured models are pre-trained by stacked autoencoder. The C-DAC speech corpus is used for continuous spoken Bengali speech data. Frame wise cepstral representation is provided in the input layer of the deep-structured model. Speech data from multiple speakers has been used to confirm speaker-independency. In detection task, the system achieved 86.19% average overall accuracy. In the classification task, accuracy for the classification of place of articulation remains low with 50.2% while in manner-based classification, the system delivered an improved performance with 98.9% accuracy.
Article
Full-text available
The understanding task of an utterance meaning depends mostly on its concepts extraction. In this paper, we propose a method for the spontaneous Arabic speech understanding, in particular a conceptual segmentation of spontaneous Arabic oral utterances. It takes a transcribed utterance as input and provides conceptual labels as output in the form of a set of Conceptual Segments (CSs). This method is a part of the numerical approach and is based on supervised machine learning (ML) technique. The originality of our work lies in the processing of Out-Of-Vocabulary (OOV) words whether before and/or after the utterance segmentation task. Furthermore, this work is a part of the improvement of the understanding module of SARF system [2]. Indeed, we aim to compare our numerical method with the symbolic one proposed by [2] and the hybrid one proposed by [1].
Article
Full-text available
This paper presents a novel method for the recognition of Malayalam vowel phonemes using nonlinear speech parameters such as maximal Lyapunov exponent and Phase Space Anti-diagonal Point Distribution. Reconstructed phase space is used as a base for extracting these features. The results show that the proposed nonlinear features have significant discriminatory power. The combined feature vectors of nonlinear feature and MFCC features yield increased recognition accuracy for five Malayalam vowel phonemes.
Article
Full-text available
This paper addresses different configurations of two layers and three layers neural network approach for the low resource language like Gujarati. The speech data are collected with the in-ear microphone compare to conventional microphone system and results are compared. Different end point detection algorithms are also tested to remove an unwanted silence portion where maximum chances of noise take place. Word boundary detection is used to separate out the different words form sentences. Detected words are then passed to the feature extraction block. Feature extractions are done with the help of the Mel-Frequency Cepstral Coefficients (MFCCs) and Real Cepstral Coefficients (RC). Results are tested and compared to them. Two layers and three layers neural networks approach are used for the classification.
Article
Full-text available
Feature Fusion plays an important role in speech emotion recognition to improve the classification accuracy by combining the most popular acoustic features for speech emotion recognition like energy, pitch and mel frequency cepstral coefficients. However the performance of the system is not optimal because of the computational complexity of the system, which occurs due to high dimensional correlated feature set after feature fusion. In this paper, a two stage feature selection method is proposed. In first stage feature selection, appropriate features are selected and fused together for speech emotion recognition. In second stage feature selection, optimal feature subset selection techniques [sequential forward selection (SFS) and sequential floating forward selection (SFFS)] are used to eliminate the curse of dimensionality problem due to high dimensional feature vector after feature fusion. Finally the emotions are classified by using several classifiers like Linear Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), Support Vector Machine (SVM) and K Nearest Neighbor (KNN). The performance of overall emotion recognition system is validated over Berlin and Spanish databases by considering classification rate. An optimal uncorrelated feature set is obtained by using SFS and SFFS individually. Results reveal that SFFS is a better choice as a feature subset selection method because SFS suffers from nesting problem i.e it is difficult to discard a feature after it is retained into the set. SFFS eliminates this nesting problem by making the set not to be fixed at any stage but floating up and down during the selection based on the objective function. Experimental results showed that the efficiency of the classifier is improved by 15–20 % with two stage feature selection method when compared with performance of the classifier with feature fusion.
Article
Full-text available
Malayalam is a South Indian language spoken predominantly in the state of Kerala. In this paper, a comparative study of different classifiers to recognize Malayalam language dialects has been carried out and presented. Thrissur and Kozhikode are the two different dialect corpora used for the recognition task. Mel Frequency Cepstral Coefficients (MFCC), energy and pitch are the features extracted. Then these feature vector set obtained are classified in the classification phase using Artificial Neural Networks (ANN), Support Vector Machine (SVM) and Naive Bayes classifiers. The input feature vector data is trained using data relating to patterns which are known and using the test data set they are tested further. Based on recognition accuracy, the performance of the ANN, SVM, and Naive Bayes are evaluated. Speech recognition is a multiclass classification problem. During classification stage, the input feature vector data is trained using information relating to known patterns and then they are tested using the test data set. ANN produced a recognition accuracy of 90.2%, SVM produced an accuracy of 88.2% and Naive Bayes produced an accuracy of 84.1%. Among the three classifiers, ANN is found to be better.
Article
Full-text available
Automatic recognition of emotions from speech by machines has been one of the most challenging areas of research in the field of human machine interaction. Automatic emotion recognition system by speech merely means that to monitor and identify the emotional or physiological state of an individual from their utterances. Speech emotion recognition has wide range of application ranging from clinical studies to robotics. In this paper developed speech emotional database for Malayalam language (One of the south Indian languages) and a system for recognizing the emotions. The system used Mel Frequency Cepstral Coefficients (MFCCs), Short Time Energy (STE) and Pitch as features extraction techniques. Two classifiers, namely Artificial Neural Network (ANN) and Support Vector Machine (SVM) used for pattern classification. Experiments show that this method provides a high accuracy of 88.4% in the case of ANN and 78.2% in the case of SVM.
Article
Full-text available
Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. Identification of the accent before the speech recognition can improve performance of the speech recognition systems. If the number of accents is more in a language, the accent recognition becomes crucial. Telugu is an Indian language which is widely spoken in Southern part of India. Telugu language has different accents. The main accents are coastal Andhra, Telangana, and Rayalaseema. In this present work the samples of speeches are collected from the native speakers of different accents of Telugu language for both training and testing. In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples. In the next step Gaussian mixture model (GMM) is used for classification of the speech based on accent. The overall efficiency of the proposed system to recognize the speaker, about the region he belongs, based on accent is 91 %.
Article
Full-text available
An ideal Automatic Speech Recognition system has to accurately and efficiently convert a speech signal into a text message transcription of the spoken words, independent of the device used to record the speech (i.e., the transducer or microphone), the speaker, or the environment. There are three approaches to speech recognition, Acoustic-phonetic approach, Pattern recognition approach and Artificial intelligence approach, where in the pattern recognition approach statistical methods are used. We have developed an Isolated Word Recognition (IWR) system for identification of spoken words for the database created by recording the words in Kannada Language. The developed system is tested and evaluated with a performance of 79% accuracy.
Article
Speech-in-noise tests are an important tool for assessing hearing impairment, the successful fitting of hearing aids, as well as for research in psychoacoustics. An important drawback of many speech-based tests is the requirement of an expert to be present during the measurement, in order to assess the listener’s performance. This drawback may be largely overcome through the use of automatic speech recognition (ASR), which utilizes automatic response logging. However, such an unsupervised system may reduce the accuracy due to the introduction of potential errors. In this study, two different ASR systems are compared for automated testing: A system with a feed-forward deep neural network (DNN) from a previous study (Ooster et al., 2018), as well as a state-of-the-art system utilizing a time-delay neural network (TDNN). The dynamic measurement procedure of the speech intelligibility test was simulated considering the subjects’ hearing loss and selecting from real recordings of test participants. The ASR systems’ performance is investigated based on responses of 73 listeners, ranging from normal-hearing to severely hearing-impaired as well as read speech from cochlear implant listeners. The feed-forward DNN produced accurate testing results for NH and unaided HI listeners but a decreased measurement accuracy was found in the simulation of the adaptive measurement procedure when considering aided severely HI listeners, recorded in noisy environments with a loudspeaker setup. The TDNN system produces error rates of 0.6% and 3.0% for deletion and insertion errors, respectively. We estimate that the SRT deviation with this system is below 1.38 dB for 95% of the users. This result indicates that a robust unsupervised conduction of the matrix sentence test is possible with a similar accuracy as with a human supervisor even when considering noisy conditions and altered or disordered speech from elderly severely HI listeners and listeners with a CI.
Article
In this paper the improvement in performance of automatic speech recognition (ASR) system is achieved with help of pitch dependent features and probability of voicing estimated features. The pitch dependent features are useful for tonal language ASR system. Punjabi language is highly tonal language and hence here we are building ASR system for Punjabi language with pitch dependent features and probability of voicing estimated features. The word error rate of system gives the performance of system which drastically improves with pitch dependent features and probability of voicing estimated features. Comparison of Yin, SAcC, Fundamental Frequency Variation (FFV) and Kaldi pitch features of ASR system were done in terms of WER. The KALDI pitch tracker of Kaldi toolkit gives the best performance ASR system among other featured ASR systems. The performance of ASR system is evaluated for Punjabi language.
Article
With globalization and emerging voice-activated systems, accent recognition systems have gained more importance. Foreign-accented English shows different acoustical characteristics from native English pronunciation. It varies based on native language of speakers. This work investigated the similarities and differences between spectral and time-domain characteristics of vowel production for English /hvd/ words spoken by native Mandarin, Hindi, and American English speakers. Fundamental frequency, the first four formants, mel-cepstral coefficients, linear predictive coefficients, harmonicity, spectral centroid, spectral spread, tonality, spectral flatness, pitch range coefficients, and zero crossings rate were examined for male and female speakers of English. One-way ANOVA was performed to find the significant correlations. Lindblom’s perceptual distance of the corner vowels was calculated for further analysis. Our preliminary findings indicated that the vowel spaces of the male L2 speakers were smaller than the male L1 speakers’ vowel space. The F1-F2 space of the corner vowel /ɑ/ presented the most variance among L1 and L2 speakers. The vowel space of the female Hindi-accented English was found to be nearly triangular shaped. For the vowels /i/ and /ɪ/, Mandarin- and Hindi-accented English showed similar characteristics. Short-time energy had significant means between L1 and L2 speakers (p < 0.001) for the vowels /i/ as in ‘heed’ and /ɪ/ as in ‘hid’. Higher F1 frequencies were calculated for the vowels /ӕ/ as in ‘had’, /ε/ as in ‘head’ and /ʌ/ as in ‘hud’ spoken by Mandarin-accented speakers. Time-domain based features provided noteworthy differences between the female L1 and L2 speakers of English. The baseline classification system showed that spectral features had higher impacts on recognizing Mandarin- and Hindi-accented English.
Article
Speech is the most natural way of expressing ourselves as humans. It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions. SER is not a new field, it has been around for over two decades, and has regained attention thanks to the recent advancements. These novel studies make use of the advances in all fields of computing and technology, making it necessary to have an update on the current methodologies and techniques that make SER possible. We have identified and discussed distinct areas of SER, provided a detailed survey of current literature of each, and also listed the current challenges.
Conference Paper
This paper presents our work on building a speaker independent, large vocabulary continuous speech recognition system for Sanskrit using HMM Toolkit (HTK). To our knowledge, this is the maiden attempt on a Sanskrit automatic speech recognizer. A Sanskrit speech corpus with a vocabulary size of 8370 words is built. The corpus contains orthographic, phoneme and word level transcriptions of 1360 sentences. The speech data were collected from 3 sources: All India Radio website, Indian Heritage Group under C-DAC and Vyoma Linguistic Labs Foundation. Mel Frequency Cepstral Coefficients together with 0th order coefficient and delta and acceleration parameters are used as features. Triphone HMMs, trained using HTK, are used as acoustic model. Bigram probabilities with back-off smoothing are used as language model. Both phoneme and word level recognizers were developed on the Sanskrit corpus. The system provides a word level accuracy of 89.64% and a sentence level correctness of 58.76% on the test set of 274 sentences. A graphical user interface for the speech recognizer is built using Java Swings.
Article
Speech emotion recognition involves analyzing vocal changes caused by emotions with acoustic analysis and determining the features to be used for emotion recognition. The number of features obtained by acoustic analysis reaches very high values depending on the number of acoustic parameters used and statistical variations of these parameters. Not all of these features are effective for emotion recognition; in addition, different emotions may effect different vocal features. For this reason, feature selection methods are used to increase the emotional recognition success and reduce workload with fewer features. There is no certainty that existing feature selection methods increase/decrease emotion recognition success; some of these methods increase the total workload. In this study, a new statistical feature selection method is proposed based on the changes in emotions on acoustic features. The success of the proposed method is compared with other methods mostly used in literature. The comparison was made based on number of feature and emotion recognition success. According to the results obtained, the proposed method provides a significant reduction in the number of features, as well as increasing the classification success.
Article
An Automatic Speech Recognition (ASR) system implementation uses a conventional pattern recognition technique that stores a set of training patterns in classes and compares the test patterns with training patterns to place them in the best matched pattern class. Most state-of-the-art ASR systems use Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Prediction (PLP) to extract features in training phase of the ASR system. However, sensitivity of MFCC & PLP to background noise has resulted in use of noise robust features Gammatone Frequency Cepstral Coefficient (GFCC) and Basilar-membrane Frequency-band Cepstral Coefficient (BFCC). But many issues associated with these feature extraction methods, like accepted bandwidth and standard number of filters are unresolved till date. This paper proposes a novel approach to use Differential Evolution (DE) algorithm to optimize the number and spacing of filters used in MFCC, GFCC and BFCC techniques. It also evaluates the performance of the said feature extraction methods with and without DE optimization in clean as well as in noisy environments. The results conclude that BFCC based ASR systems performs 0.4% to 1.0% better than GFCC and 7% to 10% better than MFCC in different conditions.
Article
Nowadays, recognition of emotion from the speech signal is the wide spreading research topic since the speech signal is the quickest and natural approach to communicate with humans. A number of investigations has been progressed related to this topic. With the knowledge of many investigated model, this paper intends to recognize the emotions from the speech signal in a precise manner. To accomplish this, we intend to propose an adaptive learning architecture for the artificial neural network to learn the multimodal fusion of speech features. It results in a hybrid PSO-FF algorithm, which combines the features of both the PSO and FF towards training the network. The performance of the proposed recognition model is analyzed by comparing it with the conventional methods in correspondence with varied performance measures like Accuracy, Sensitivity, Specificity, Precision, FPR, FNR, NPV, FDR, F1Score and MCC. Finally, the experimental analysis revealed that the proposed modal is 10.85% better than the conventional modals with respect to the accuracy for both the Marathi database and Benchmark database.
Article
This paper aims at determining the best way to exploit the phonological properties of the Arabic language in order to improve the performance of the speech recognition system. One of the main challenges facing the processing of Arabic is the effect of the local context, which induces changes in the phonetic representation of a given text, thereby causing the recognition engine to misclassify it. The proposed solution is to develop a set of language-dependent grapheme-to-allophone rules that can predict such allophonic variations and hence provide a phonetic transcription that is sensitive to the local context for the automatic speech recognition system. The novel aspect of this method is that the pronunciation of each word is extracted directly from a context-sensitive phonetic transcription rather than a predefined dictionary that typically does not reflect the actual pronunciation of the word. The paper also aims at employing the stress feature as one of the supra-segmental characteristics of speech to enhance the acoustic modelling. The effectiveness of applying the proposed rules has been tested by comparing the performance of a dictionary based system against one using the automatically generated phonetic transcription. The research reported an average of 9.3% improvement in the system's performance by eliminating the fixed dictionary and using the generated phonetic transcription to learn the phone probabilities. Marking the stressed vowels with separate stress markers leads to a further improvement of 1.7%.
Article
In presence of environmental noise, speakers tend to emphasize their vocal effort to improve the audibility of voice. This involuntary adjustment is known as Lombard effect (LE). Due to LE the signal to noise ratio of speech increases, but at the same time the loudness, pitch and duration of phonemes changes. Hence, accuracy of automatic speech recognition systems degrades. In this paper, the effect of unsupervised equalization of Lombard effect is investigated for Hindi vowel classification task using Hindi database designed at TIFR Mumbai, India. Proposed Quantile-based Dynamic Cepstral Normalization MFCC (QCN-MFCC) along with baseline MFCC features have been used for vowel classification. Hidden Markov Model (HMM) is used as classifier. It is observed that QCN-MFCC features have given a maximum improvement of 5.97% and 5% over MFCC features for context-dependent and context-independent cases respectively. It is also observed that QCN-MFCC features have given improvement of 13% and 11.5% over MFCC features for context-dependent and context-independent classification of mid vowels.
Article
Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former posses natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focussed Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time. It is found that the proposed ML based sentence extraction techniques and the composite feature set used with RNN as classifier outperforms all other approaches. By using ANN in FF form as feature extractor, the performance of the system is evaluated and a comparison is made. Experimental results show that the application of big data samples have enhanced the learning of the ASR system. Further, the ANN based sample and feature extraction techniques are found to be efficient enough to enable application of ML techniques in big data aspects as part of ASR systems.
Article
We build an automatic phoneme recognition system based on Hidden Markov Modeling (HMM) which is a Dynamic modeling scheme. Models were built by using Stochastic pattern recognition and Acoustic phonetic schemes to recognise phonemes. Since our native language is Kannada, a rich South Indian Language, we have used 15 Kannada phonemes to train and test these models. Since Mel - Frequency Cepstral Coefficients (MFCC) are well known Acoustic features of speech[1,2], we have used the same in speech feature extraction. Finally performance analysis of models in terms of Phoneme Error Rate (PER) justifies the fact that Dynamic modeling yields good results and can be used in developing Automatic Speech Recognition systems.
Conference Paper
The parts of speech disambiguation of corpora is most challenging task in Natural Language Processing (NLP). However, some works have been done in the past to overcome the problem of bilingual corpora disambiguation for Hindi. In this paper, Artificial Neural Network for Hindi parts of speech tagger has been used. To analyze the effectiveness of the proposed approach, 2600 sentences of news items having 11500 words from various newspapers have been evaluated. During simulations and evaluation, the accuracy up to 91.30% is achieved, which is significantly better in comparison to other existing approaches for Hindi parts of speech tagging.