Figure - available from: International Journal of Speech Technology
This content is subject to copyright. Terms and conditions apply.
Source publication
In this paper, we present a set of experiments aiming to improve the recognition of spoken digits for under-resourced dialects of the Maghrebi region, using a hybrid system. Indeed, integrating a Dialect Identification module into an Automatic Speech Recognition (ASR) system has shown its efficiency in previous works. In order to make the ASR syste...
Similar publications
Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remain...
Research on automatic speech recognition (ASR) systems for electrolaryngeal speakers has been relatively unexplored due to small datasets. When training data is lacking in ASR, a large-scale pretraining and fine tuning framework is often sufficient to achieve high recognition rates; however, in electrolaryngeal speech, the domain shift between the...
Automatic Speech Recognition (ASR) is an area of research that's constantly evolving, thanks to important advancements like machine learning and deep learning techniques. Its applications are wide-ranging, touching fields like healthcare, public services, and interfaces between humans and machines. What's particularly noteworthy is the pressing nee...
In Quebec and Canadian courts, the transcription of court proceedings is a critical task for appeal purposes and must be certified by an official court reporter. The limited availability of qualified reporters and the high costs associated with manual transcription underscore the need for more efficient solutions. This paper examines the potential...
This paper proposes a self-regularised minimum latency training (SR-MLT) method for streaming Transformer-based automatic speech recognition (ASR) systems. In previous works, latency was optimised by truncating the online attention weights based on the hard alignments obtained from conventional ASR models, without taking into account the potential...
Citations
... Notably, performance degradation is less pronounced under car noise compared to grinder noise environments. Lounnas et al. (2022) utilized a hybrid methodology to enhance spoken digit recognition for under-resourced languages in the Maghreb region. They incorporated a dialect identification module into the Automatic Speech Recognition (ASR) system, trained on Moroccan Arabic Dialect (MAD), Algerian Arabic Dialect (AAD), Moroccan Amazigh Dialect (MAD), and Modern Standard Arabic. ...
In the field of speech recognition, enhancing accuracy is paramount for diverse linguistic communities. Our study addresses this necessity, focusing on improving Amazigh speech recognition through the implementation of three distinct data augmentation methods: Audio Augmentation, FilterBank Augmentation, and SpecAugment. Leveraging Convolutional Neural Networks (CNNs) for speech recognition, we utilize Mel Spectrograms extracted from audio files. The study specifically targets the recognition of the initial ten Amazigh digits. We conducted experiments with a speaker-independent approach involving 42 participants. A total of 27 experiments were conducted, utilizing both original and augmented data. Among the different CNN models employed, the VGG19 model showcased significant promise. Our results demonstrate a maximum accuracy of 95.66%. Furthermore, the most notable improvement achieved through data augmentation was 4.67%. These findings signify a substantial enhancement in speech recognition accuracy, indicating the efficacy of the proposed methods.
... To tackle these challenges, this study focuses on developing effective DI systems for Maghrebi dialects by creating a spoken dataset (AAD, ABD, MAD, MBD, and MSA) (for more detail, please refer to Section 3) [32,33] and leveraging spectrogram-based features to evaluate multiple transfer learning approaches. By comparing individual transfer learning models with a stacking generalization strategy, which merges the output of retrained models, this study seeks to enhance system performance in distinguishing these closely related dialects. ...
As dialects are widely used in many countries, there is growing interest in incorporating them into various applications, including conversational systems. Processing spoken dialects is an important module in such systems, yet it remains a challenging task due to the lack of resources and the inherent ambiguity and complexity of dialects. This paper presents a comparison of two approaches for identifying spoken Maghrebi dialects, tested on an in-house corpus composed of four dialects: Algerian Arabic Dialect (AAD), Algerian Berber Dialect (ABD), Moroccan Arabic Dialect (MAD), and Moroccan Berber Dialect (MBD), as well as two variants of Modern Standard Arabic (MSA): MSA_ALG and MSA_MAR. The first method uses a fully connected neural network (NN2) to retrain several Transfer Learning (TL) models with varying layer numbers, including Residual Networks (ResNet50, ResNet101), Visual Geometric Group networks (VGG16, VGG19), Dense Convolutional Networks (DenseNet121, DenseNet169), and Efficient Convolutional Neural Networks for Mobile Vision Applications (MobileNet, MobileNetV2). These models were chosen based on their proven ability to capture different levels of feature abstraction: deeper models like ResNet and DenseNet are capable of capturing more complex and nuanced patterns, which is critical for distinguishing subtle differences in dialects, while VGG and MobileNet models offer computational efficiency, making them suitable for applications with limited resources. The second approach employs a “stacked generalization” strategy, which merges predictions from the previously trained models to enhance the final classification performance. Our results show that this cascade strategy improves the overall performance of the Language/Dialect Identification system, with an accuracy increase of up to 5% for specific dialect pairs. Notably, the best performance was achieved with DenseNet and ResNet models, reaching an accuracy of 99.11% for distinguishing between Algerian Berber Dialect and Moroccan Berber Dialect. These findings indicate that despite the limited size of the employed dataset, the cascade strategy and the selection of robust TL models significantly enhance the system’s performance in dialect identification. By leveraging the unique strengths of each model, our approach demonstrates a robust and efficient solution to the challenge of spoken dialect processing.
... Hindi [17] 2017 Isarn [18] 2017 Odia [19] 2019 Khasi [20] 2022 Algerian [21] 2022 Moroccan [21] 2022 Swahili [22] 2023 Kui 2024 ...
... Hindi [17] 2017 Isarn [18] 2017 Odia [19] 2019 Khasi [20] 2022 Algerian [21] 2022 Moroccan [21] 2022 Swahili [22] 2023 Kui 2024 ...
Speech digit recognition research is growing decisively, and a bulk of digit recognition algorithms are used in European and a few Asian languages. Kui is a low-resourced tribal language locally used in several states of India. Despite its significance, there is not much research on Kui's speech. This research aims to present an in-depth analysis of novel Kui digit recognition using predefined machine learning (ML) techniques. For this purpose, we first gathered spoken numbers i.e. from 0 to 9 of eight different speakers containing a total of 200 words. Secondly, we choose the numbers: ଶୂନ (zero), ଏକ (one), ଦୁଇ (two), ତିନି (three), ସାରି (four), ପାସ (five), ସଅ (six), ସାତ (seven), ଆଟ (eight), ନଅ (nine). Meanwhile, we build nine different ML models to recognize Kui digits that take the Mel-frequency cepstral coefficients (MFCCs) method to extract the relevant features for model predictions. Finally, we compared the performance of ML models for both augmented and non-augmented Kui data. The result shows that the SVM+Augmentation method for Kui digit recognition combined obtained the highest accuracy of 83% than other methods. Moreover, the difficulties and potential prospects for Kui digit recognition are also highlighted in this work.
... Using a hybrid approach, Lounnas et al. (2022) reported a series of studies aimed at enhancing spoken digit recognition for under-resourced languages of the Maghreb area based on two deep learning models and five machine learning classifiers. Prior studies have demonstrated the effectiveness of adding a dialect identification module to an ASR system. ...
The field of speech recognition makes it simpler for humans and machines to engage with speech. Number-oriented communication, such as using a registration code, mobile number, score, or account number, can benefit from speech recognition for digits. This paper presents our Amazigh automatic speech recognition (ASR) experience based on the deep learning approach. The convolutional neural network (CNN) and Mel spectrogram are exploited to evaluate audio samples and produce spectrograms as a part of the deep learning strategy. To attempt the recognition of the Amazigh numerals, we use a database that includes digits ranging from zero to nine collected from 42 native speakers in total, men and women between the ages of 20 and 40. Our experimental results show that spoken digits in Amazigh can be identified with a maximum accuracy of 93.62%, 94% Precision, and 94% Recall.
... Lounnas K. et al. [14] reported a series of research using a hybrid methodology to improve spoken digit recognition for under-resourced languages of the Maghreb region. The value of including a dialect identification module in an Automatic Speech Recognition (ASR) system has been shown in earlier research. ...
The field of speech recognition has made human-machine voice interaction more convenient. Recognizing spoken digits is particularly useful for communication that involves numbers, such as providing a registration code, cellphone number, score, or account number. This article discusses our experience with Amazigh's Automatic Speech Recognition (ASR) using a deep learning-based approach. Our method involves using a convolutional neural network (CNN) with Mel-Frequency Cepstral Coefficients (MFCC) to analyze audio samples and generate spectrograms. We gathered a database of numerals from zero to nine spoken by 42 native Amazigh speakers, consisting of men and women between the ages of 20 and 40, to recognize Amazigh numerals. Our experimental results demonstrate that spoken digits in Amazigh can be recognized with an accuracy of 91.75%, 93% precision, and 92% recall. The preliminary outcomes we have achieved show great satisfaction when compared to the size of the training database. This motivates us to further enhance the system's performance in order to attain a higher rate of recognition. Our findings align with those reported in the existing literature.
... Automatic speech recognition (ASR) is defined as the independent computer-driven transcription of spoken language into readable text [6]. Figure 1 shows a typical ASR architecture. Recently, our lab researchers targeted the applications of automatic speech recognition for the Moroccan Amazigh language [13][14][15][16][17][18][19]]. ...
This paper is a part of our contribution to research on the enhancement of network automatic speech recognition system performance. We built a highly configurable platform by using hidden Markov models, Gaussian mixture models, and Mel frequency spectral coefficients, in addition to VoIP G.711-u and GSM codecs. To determine the optimal values for maximum performance, different acoustic models are prepared by varying the hidden Markov models (from 3 to 5) and Gaussian mixture models (8-16-32) with 13 feature extraction coefficients. Additionally, our generated acoustic models are tested by unencoded and encoded speech data based on G.711 and GSM codecs. The best parameterization performance is obtained for 3 HMM, 8-16 GMMs, and G.711 codecs.
... For the ASI, AGR, AID, and SLU, we will adopt the same models used in our previous works [26], [27]. Each audio file in the AVC dataset is converted to a concatenation of 5 features using a feature extraction method based on the librosa tool [28] in the first stage, producing an array of 193 coefficients as follows: There are 12 Chroma coefficients, 128 Mel spectogram coefficients, 7 Contrast coefficients, and 6 Tonnetz coefficients in addition to 40 Mel Frequency Cepstral Coefficients (MFCC). ...
Expanding Internet connectivity has had tremendous influence on people’s everyday life, since they do everything on their phones and laptops [1]. Several items have been developed by various researchers in order to improve the lives of people, notably the elderly and disabled, while remaining technologically advanced. Voice-command-enabled technologies, such as SIRI and Google voice commands, are the most useful. These systems are based on the Speech recognition module, which
is one of the most important module that can make human-machine communication easier. Automatic Speech Recognition (ASR) has made significant progress toward human-like performance by employing a data-driven method [2]. In this paper, we created an Arabic voice command dataset which include 10 commands spoken by 10 speaker and repeated 10 times. The obtained dataset, despite its size, was evaluated on four speech processing tasks and achieved an accuracy of 95.9% in ASR, and
a macro F1score of 99.67%, 100%, 100%, and 97.98%, in speaker identification, gender recognition, accent recognition, and spoken language understanding, respectively.
... The ASR component of our pipeline SLU is based on the CMU-SPHINX tool 2 and aim to develop a monolingual speech recognition system for each language (Arabic, English and French). In the first phase of the ASR module, we employed the Mel frequency spectral coefficients in the feature extraction phase and GMM-HMM scheme combination approaches in the training phase [29], [30]. Secondly, we built three acoustic models, for each languages, using SphinxTrain tools based on the dictionary, language model, and speech data. ...
This work conducts a comparative investigation of two architectures in the domain of Spoken Language Understanding (SLU), which were evaluated on a synthesized corpus
of three languages: Modern Standard Arabic (MSA), French, and English. The first architecture employs a simple SLU system based on classical machine learning algorithms (E2E SLU), whereas the second architecture (Pipeline SLU) merges the
textual output of a speech recognition system (ASR) with that of a textual classification system by transmitting it to a ”Natural Language Understanding” (NLU) model, allowing us to compare the predictions of the two systems. The obtained results were encouraging where we found that the Pipeline approach has given us better results than the E2E approach.