Preprint

Mixed ASR System for Amazigh and Arabic UnderResourced Dialects in Maghreb Region

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Automatic Speech Recognition (ASR) technology plays an essential role in human-machine interaction. In this paper, we describe our conducted speech experiments that are realized to develop and adapt a mixed automaticspeech recognition system based on spoken digits for Amazigh and Arabic dialects that are considered as under-resourced dialects in the Maghreb region.Our used database includes speech samples collected from 24 Moroccans and Algerian speakers including both males and females. The designed system is implemented based on the combination of hidden Markov models and Gaussian mixture models, as well as the Mel frequency spectral coefficients (MFCCs) feature extraction method.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Automatic Speech Recognition (ASR) for Amazigh speech, particularly Moroccan Tarifit accented speech, is a less researched area. This paper focuses on the analysis and evaluation of the first ten Amazigh digits in the noisy conditions from an ASR perspective based on Signal to Noise Ratio (SNR). Our testing experiments were performed under two types of noise and repeated with added environmental noise with various SNR ratios for each kind ranging from 5 to 45 dB. Different formalisms are used to develop a speaker independent Amazigh speech recognition, like Hidden Markov Model (HMMs), Gaussian Mixture Models (GMMs). The experimental results under noisy conditions show that degradation of performance was observed for all digits with different degrees and the rates under car noisy environment are decreased less than grinder conditions with the difference of 2.84% and 8.42% at SNR 5 dB and 25 dB, respectively. Also, we observed that the most affected digits are those which contain the "S" alphabet.
Conference Paper
Full-text available
The main aim of an Automatic Speech Recognition system (ASR) is to produce a system that is able to simulate the human listener based on the learning approach and speech data of a studied language. In this paper, we describe the Darija Moroccan Dialect speech recognition system that is implemented to recognize the ten first Arabic digits spoken in Moroccan dialect (Darija) collected from 20 speakers including both males and females. This system is designed based on the CMU Sphinx tools through the ASR Hidden Markov Model method with small data and the Mel frequency spectral coefficients (MFCCs) that are used in the feature extraction phase. Our best-obtained accuracy is 96.27 % found with 8 GMMs.
Chapter
Full-text available
In this paper, we apply two acoustic models to build the Amazigh speech recognition system; the first system based on hidden Markov models (HMMs) using the open-source CMU Sphinx-4, from the Carnegie Mellon University, the second system based on the convolution neural network CNN is a particular form of neural network implementing in TensorFlow and GPU computation. The two systems evaluated use mel frequency cepstral coefficients to extract the MFFc features. The corpus consists of 9900 audio files. The system obtained the best results when trained using CNN produced 92% of accuracy.
Article
Full-text available
Speech feature extraction and likelihood evaluation are considered the main issues in speech recognition system. Although both techniques were developed and improved, but they remain the most active area of research. This paper investigates the performance of conventional and hybrid speech feature extraction algorithm of Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), perceptual linear production (PLP) and RASTA-PLP through using multivariate Hidden Markov Model (HMM) classifier. The performance of the speech recognition system is evaluated based on word error rate (WER), which is given for different data set of human voice using isolated speech TIDIGIT corpora sampled by 8 Khz. This data includes the pronunciation of eleven words (zero to nine plus oh) are recorded from 208 different adult speakers (men & women) each person uttered each word 2 times. Keywords: feature extraction, likelihood evaluation, speech recognition, Mel Frequency Cepstrum Coefficient, Linear Predictive Coding, perceptual linear production, RASTA-PLP, Hidden Markov Model, word error rate.
Article
Full-text available
This paper describes the performance of Amazigh speech recognition via an interactive voice response in noisy conditions. The experiments were first conducted for the uncoded speech and then repeated for decoded speech in a noisy environment for different signal noise ratios (SNR). In this study, we analyze the effect of noise at different SNR levels on the ten first Amazigh digits which have collected from 22 Moroccan native speakers including both males and females. Our experiments results show that the degradation of accuracy was observed for all studied words by different degrees due to word components or the speech coding.
Article
Full-text available
This research adapted and implemented an algorithm for commanding using speech recognition in ARABIC language in addition to English, and the ability to train the system using other languages. The recognition based on discrete coefficient of the wavelet transform. Intelligent recognizer is built for two models, the first is Neural Networks, and the second is Fuzzy Logic Recognizer. The proposed speech recognition system consists of three phases; preprocessing phase (two processes are performed on the sound, DC level removal and resizing of sample for 2000 samples for each sound), feature extraction phase (features that distinguish each sound from another, it is wavelet transform coefficients), and recognition phase (many classifiers could be used for speaker recognition, in this research supervised neural networks, MLP and Fuzzy Logic classifiers are used. This research is also concerned with studying the recognition ability of MLP neural Network and Suggeno type Fuzzy Logic systems, for the recognition of Arabic and English Languages. The neural networks trained with features extracted from discrete wavelet transform. The use of Wavelet Transformation enables to extract an exact features form the speech. The research illustrates the effect of using two different intelligent approaches using MATLAB, and by applying the voice commands directly to an automated wheeled vehicle.
Article
Full-text available
In this paper we examined the human voice of 20 adults (20 smokers and 20 non-smokers) to determine the effects of cigarette smoking on formants frequency, pitch, shimmer and jitter based on 3 Amazigh language vowels (A, I, U). The statistical data parameters are collected from male Moroccan speakers aged between 26 and 50 years old. Our results show that, the pitch values of smokers are lower compared to those of non-smokers. Also, smokers’ formants frequency F1 and F2 are close to non-smokers ones for the three considered vowels .Whereas, F3 and F4 are lower in the case of smokers. Shimmer and Jitter analysis showed higher values for these parameters among smoker.
Article
Full-text available
Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems are Mel-Frequency Cepstral Coefficients (MFCCs). Speech recognition is inherently complicated due to the variability in the speech signal which includes within- and across-speaker variability. This leads to several kinds of mismatch between acoustic features and acoustic models and hence degrades the system performance. The sensitivity of MFCCs to speech signal variability motivates many researchers to investigate the use of a new set of speech feature parameters in order to make the acoustic models more robust to this variability and thus improve the system performance. The combination of diverse acoustic feature sets has great potential to enhance the performance of ASR systems. This paper is a part of ongoing research efforts aspiring to build an accurate Arabic ASR system for teaching and learning purposes. It addresses the integration of complementary features into standard HMMs for the purpose to make them more robust and thus improve their recognition accuracies. The complementary features which have been investigated in this work are voiced formants and Pitch in combination with conventional MFCC features. A series of experimentations under various combination strategies were performed to determine which of these integrated features can significantly improve systems performance. The Cambridge HTK tools were used as a development environment of the system and experimental results showed that the error rate was successfully decreased, the achieved results seem very promising, even without using language models.
Article
Full-text available
Automatic speech recognition is a technology that allows a computer to transcribe in real time spoken words into readable text. In this work an HMM automatic speech recognition system was created to detect smoker speaker. This research project is carried out using Amazigh language for comparison of the voice of normal persons to smokers one. To achieve this goal, two experiments were performed, the first one to test the performance of the system for non-smokers for different parameters. The second experiment concern smokers speakers. The corpus used in this system is collected from two groups of speaker, non-smokers and smokers native Morocan tarifit speakers aged between 25 and 55 years. Our experimental results show that we can use our system to make diagnostic for smoking people and confirm that a speaker is smoker when the observed recognition rate is below 50%.
Article
Full-text available
Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background.
Article
Full-text available
The aim of this paper is to describe the development of a speaker-independent continuous automatic Amazigh speech recognition system. The designed system is based on the Carnegie Mellon University Sphinx tools. In the training and testing phase an in house Amazigh_Alphadigits corpus was used. This corpus was collected in the framework of this work and consists of speech and their transcription of 60 Berber Moroccan speakers (30 males and 30 females) native of Tarifit Berber. The system obtained best performance of 92.89 % when trained using 16 Gaussian Mixture models.
Conference Paper
In this paper, we aim to describe a novel technique which is a Combined Automatic Speech Recognition and Language Identification System that uses both ASR and LI technologies which consists of the recognition of spoken digits after identifying their language. An in-house corpus was used mainly for both speech-based multi-lingual identification and speech recognition tasks made of bilingual digits sounds that contains ten digits spoken mainly in two languages: Modern Standard Arabic (MSA) and Amazigh Moroccan dialect. First of all, we develop the Language Identification stage which is the basis of our hybrid system which behaves as front-end of our system that serves for spoken language detection. This facilitates the task of recognition by allocating the output to the appropriate Hidden Model Markov (HMM) based recognition system (Arabic or Amazigh) which improves recognition of a bilingual spoken digit efficiently. For this purpose, a set of parameters were adjusted on our CLIASR system to achieve good results, including classifier parameters, feature vector, and HMM-GMM parameters. The results show that our proposed LI-ASR system performs 33% better than an ordinary ASR for a given bilingual mixed speech corpus.
Conference Paper
This paper investigates the Amazigh speech recognition and its usage for controlling external devices. We describe our experience to design a speech system founded on hidden Markov Models (HMMs), Gaussian mixture models (GMMs), Mel frequency spectral coefficients (MFCCs) and optimization of parameters in order to have a portability in resource limited embedded system. Our objective is developing a control Amazigh speech recognition system through a Raspberry Pi board, as well as achieving the best solution with a higher automatic speech recognition parametrization for low-cost minicomputers on a speaker-independent approach. The designed speech system was implemented on the open-source platform. The system achieves the best performance of 90.43% when trained by using 3 HMMs and 16 GMMs.
Conference Paper
This study aims to investigate the possible use of speech rhythm metrics as a new feature for speech emotion recognition, gender identification, and regional accent identification. Further, it aims to evaluate a new Arabic speech emotion corpus. The King Saud University Emotions (KSUEmotions) speech corpus contains five emotions: neutral, sadness, happiness, surprise, and anger. For this study, speech acoustic features are extracted and used to classify the speakers' emotions. All classification results were obtained using the multilayer perceptron (MLP) neural networks and support vector machine (SVM) classifiers. Results demonstrate that the rhythm metrics are not sufficient for speech emotion classification. Nevertheless, they can improve the classifier accuracy when combined with other speech acoustic features. These results also demonstrate that the average performance accuracy of the KSUEmotions Phase 1 is 54.07% and 84.14% for Phase 2 and that the emotion of sadness achieves the best emotions' classification accuracy.
Chapter
In the last decade, the automatic speech pathology detection systems based on voice production theory are evolving up to date. Overall, there have not been much speech technology research studies for persons regarding voice disorders which center on Amazigh language. This research project focuses on the build of an automatic speech recognition system based on Sphinx-4 that permits to detect the differences between normal and pathological voices based on the produced speech. The performance in our system was measured using the combinations of different Hidden Markov Models and Gaussian mixture distributions. Results show that the maximum accuracy with the normal voices is greater than the maximum accuracy obtained from the pathological speaker.
Chapter
Automatic speech recognition (ASR) in Amazigh speech, particularly Moroccan Tarifit accented speech, is a less researched area. Some efforts to develop ASR on Moroccan accented Amazigh speech in the clean environment have been studied in our previous works. In this paper, we analyze the effect of noise car at different signal-to-noise ratio (SNR) on Amazigh accented on the first ten digits with different decibel values. Various techniques are used for isolated speech recognition like hidden Markov model (HMMs) and Mel-frequency cepstral coefficients (MFCC). The experimental result presents that recognition rate of 88.22% in a clean environment, and 59.26% and 33.83% rates in noisy condition at SNR 10 dB and 20 dB, respectively.
Chapter
This paper aims to build an interactive speaker-independent automatic Amazigh speech recognition system. The proposed system offers a methodology to extract data remotely from a distance database using the combined interactive voice response (IVR) and automatic speech recognition (ASR) technologies. We describe our experience to design an interactive speech system based on hidden Markov models (HMMs), Gaussian mixture models (GMMs) and Mel frequency spectral coefficients (MFCCs) based on ten first Amazigh digits and six Amazigh words. The best-obtained performance is 89.64% by using 3 HMMs and 16 GMMs.
Article
Arabic is the native language for over 300 million speakers and one of the official languages in United Nations. It has a unique set of diacritics that can alter a word’s meaning. Arabic automatic speech recognition (ASR) received little attention compared to other languages, and researches were oblivious to the diacritics in most cases. Omitting diacritics circumscribes the Arabic ASR system’s usability for several applications such as voice-enabled translation, text to speech, and speech-to-speech. In this paper, we study the effect of diacritics on Arabic ASR systems. Our approach is based on building and comparing diacritized and nondiacritized models for different corpus sizes. In particular, we build Arabic ASR models using state-of-the-art technologies for 1, 2, 5, 10, and 23 h. Each of those models was trained once with a diacritized corpus and another time with a nondiacritized version of the same corpus. KALDI toolkit and SRILM were used to build eight models for each corpus that are GMM-SI, GMM SAT, GMM MPE, GMM MMI, SGMM, SGMM-bMMI, DNN, DNN-MPE. Eighty different models were created using this experimental setup. Our results show that word error rates (WERs) ranged from 4.68 Adding diacritics increased WER by 0.59 Although diacritics increased WERs, it is recommended to include diacritics for ASR systems when integrated with other systems such as voice-enabled translation. We believe that the benefit of the overall accuracy of the integrated system (e.g., translation) outweighs the WER increase for the Arabic ASR system.
Article
This paper is related to the speech coding problems that occur in VoIP system based automatic speech recognition where speech is coded for the transmission from the user to the recognition server. We evaluate the influence of G711 and GSM audio codecs on the speech recognition performance. In our approach, Mel-Frequency Cepstral Coefficients is used as feature extraction technique. The Gaussian mixture models and Hidden Markov models are exploited on features modelling. Our vocabulary includes the Amazigh Letters. Our finding indicate that the best system performances were found for G711 codec, 3 HMM, and 16 GMMs.
Conference Paper
This paper describes our experience to create a secure Telephony Amazigh Spoken System over the network. The designed system was created by combining HMM automatic speech recognition and IVR technologies. It allows the user to control from distance the security network and backup system using his voice to network tasks administration managing and his biological voiceprint for security identification. This research project is carried out using Amazigh language for the interaction system. The designed system was implemented on the open source platform. Our experimental results show that the administrator recognition rates were all above 80 % whereas the non-admin recognition rate is less than 6% which demonstrates the security aspect of our system.
Article
This paper aims at determining the best way to exploit the phonological properties of the Arabic language in order to improve the performance of the speech recognition system. One of the main challenges facing the processing of Arabic is the effect of the local context, which induces changes in the phonetic representation of a given text, thereby causing the recognition engine to misclassify it. The proposed solution is to develop a set of language-dependent grapheme-to-allophone rules that can predict such allophonic variations and hence provide a phonetic transcription that is sensitive to the local context for the automatic speech recognition system. The novel aspect of this method is that the pronunciation of each word is extracted directly from a context-sensitive phonetic transcription rather than a predefined dictionary that typically does not reflect the actual pronunciation of the word. The paper also aims at employing the stress feature as one of the supra-segmental characteristics of speech to enhance the acoustic modelling. The effectiveness of applying the proposed rules has been tested by comparing the performance of a dictionary based system against one using the automatically generated phonetic transcription. The research reported an average of 9.3% improvement in the system's performance by eliminating the fixed dictionary and using the generated phonetic transcription to learn the phone probabilities. Marking the stressed vowels with separate stress markers leads to a further improvement of 1.7%.
Conference Paper
This research paper aims to develop an isolated-word automatic speech recognition (IWASR) system based on vector quantization (VQ). This system receives, analyzes, searches and matches an input speech signal with the trained set of speech signals which are stored in the database/codebook, and returns matching results to users. IWASR is meant to assist customers calling a universitypsilas telephone operator to respond to their enquiries in a convenient way using their natural speech. Callers are assisted to select language, faculty and the staff name they wish to contact. To extract features from speech signals, Mel-frequency cepstral coefficients (MFCC) algorithm was applied. Subsequently, vector quantization was used for all feature vectors generated from the MFCC. A codebook was resulted from training the VQ initial codebook and experimental results showed that the recognition rate has been improved with the increase of codebook size and showed that the codebook size of 81 feature vectors had a recognition rate exceeded 85%.
Article
This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described
Investigation Arabic Speech Recognition Using CMU Sphinx System
  • H Satori
  • H Hiyassat
  • M Haiti
  • N Chenfour
Satori, H., Hiyassat, H., Haiti, M., & Chenfour, N. (2009). Investigation Arabic Speech Recognition Using CMU Sphinx System. International Arab Journal of Information Technology (IAJIT), 6(2).