ArticlePublisher preview available

Exploring data augmentation for Amazigh speech recognition with convolutional neural networks

Authors:
  • Professor of Computer Science at the Pluridisciplinary Faculty of Nador (FPN).
  • Pluridisciplinary Faculty of Nador (FPN)
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In the field of speech recognition, enhancing accuracy is paramount for diverse linguistic communities. Our study addresses this necessity, focusing on improving Amazigh speech recognition through the implementation of three distinct data augmentation methods: Audio Augmentation, FilterBank Augmentation, and SpecAugment. Leveraging Convolutional Neural Networks (CNNs) for speech recognition, we utilize Mel Spectrograms extracted from audio files. The study specifically targets the recognition of the initial ten Amazigh digits. We conducted experiments with a speaker-independent approach involving 42 participants. A total of 27 experiments were conducted, utilizing both original and augmented data. Among the different CNN models employed, the VGG19 model showcased significant promise. Our results demonstrate a maximum accuracy of 95.66%. Furthermore, the most notable improvement achieved through data augmentation was 4.67%. These findings signify a substantial enhancement in speech recognition accuracy, indicating the efficacy of the proposed methods.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
International Journal of Speech Technology (2025) 28:53–65
https://doi.org/10.1007/s10772-024-10164-y
Exploring data augmentation forAmazigh speech recognition
withconvolutional neural networks
HossamBoulal1· FaridaBouroumane1· MohamedHamidi2 · JamalBarkani1· MustaphaAbarkan1
Received: 10 September 2024 / Accepted: 26 October 2024 / Published online: 14 November 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
In the field of speech recognition, enhancing accuracy is paramount for diverse linguistic communities. Our study addresses
this necessity, focusing on improving Amazigh speech recognition through the implementation of three distinct data aug-
mentation methods: Audio Augmentation, FilterBank Augmentation, and SpecAugment. Leveraging Convolutional Neural
Networks (CNNs) for speech recognition, we utilize Mel Spectrograms extracted from audio files. The study specifically
targets the recognition of the initial ten Amazigh digits. We conducted experiments with a speaker-independent approach
involving 42 participants. A total of 27 experiments were conducted, utilizing both original and augmented data. Among the
different CNN models employed, the VGG19 model showcased significant promise. Our results demonstrate a maximum
accuracy of 95.66%. Furthermore, the most notable improvement achieved through data augmentation was 4.67%. These
findings signify a substantial enhancement in speech recognition accuracy, indicating the efficacy of the proposed methods.
Keywords Speech recognition· Data augmentation· Deep learning· Feature extraction· Amazigh digits
1 Introduction
The domain of speech recognition (Jean Louis etal., 2022;
Huang & Deng, 2010) wields significant influence in reshap-
ing our interactions with computers (Yadava & Jayanna,
2017), not only catering to the needs of individuals with
specific requirements (Mayer, 2018) but also extending its
reach to a broader audience (Besacier etal., 2014). Advanc-
ing the computer’s comprehension of speech not only serves
users but also facilitates technology in adapting to the inher-
ently speech-centric nature of human communication. While
numerous researchers primarily concentrate on enhancing
the computer’s ability to understand widely spoken lan-
guages such as English, our aspiration is to address the gap
in the identification of lesser-known languages, particu-
larly those with limited resources. An exemplary instance
of such a language is Amazigh. Despite recent attention
drawn towards these languages, there remains a scarcity
of resources dedicated to their recognition. Our aim is to
contribute towards filling this void, thereby enriching the
broader landscape of speech recognition research and fos-
tering inclusivity and diversity in technology. In our current
framework, our focus lies on the Amazigh language, spoken
by millions across North Africa, predominantly in Morocco,
Algeria, Tunisia, and Libya. This language, characterized by
its diverse dialects and rich oral heritage, remains largely
underserved by speech recognition technology (Table1).
The Berber language, also referred to as Amazigh
(Chaker, 1984; Ouakrim, 1995; Ridouane, 2003; Boukous,
2014), holds the distinction of being considered the indige-
nous language of the Maghreb region by many. According to
numerous linguistic studies, it is believed to have been in use
for approximately 5000 years (Boukous, 1995). Spanning
across a vast expanse of North Africa, including countries
such as Algeria, Morocco, and Tunisia, as well as certain
areas in neighboring Sub-Saharan nations, Amazigh serves
as a means of communication (Idhssaine & El Kirat, 2021).
Notably, Morocco has granted official status to Amazigh
due to a significant proportion of its population being pro-
ficient in the language. The standardization process for the
language commenced in 2001 with the establishment of the
Royal Institute of Amazigh Culture (IRCAM). This insti-
tute has played a pivotal role in various endeavors aimed at
* Mohamed Hamidi
m.hamidi@ump.ac.ma
1 LSI Laboratory, FP Taza, USMBA University, Taza,
Morocco
2 Team ofModeling andScientific Computing, FPN, UMP,
Nador, Morocco
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The field of speech recognition makes it simpler for humans and machines to engage with speech. Number-oriented communication, such as using a registration code, mobile number, score, or account number, can benefit from speech recognition for digits. This paper presents our Amazigh automatic speech recognition (ASR) experience based on the deep learning approach. The convolutional neural network (CNN) and Mel spectrogram are exploited to evaluate audio samples and produce spectrograms as a part of the deep learning strategy. To attempt the recognition of the Amazigh numerals, we use a database that includes digits ranging from zero to nine collected from 42 native speakers in total, men and women between the ages of 20 and 40. Our experimental results show that spoken digits in Amazigh can be identified with a maximum accuracy of 93.62%, 94% Precision, and 94% Recall.
Article
Full-text available
The availability of automatic speech recognition systems is crucial in various domains such as communication, healthcare, security, education, etc. However, currently, the existing systems often favor dominant languages such as English, French, Arabic, or Asian languages, leaving under-resourced languages without the consideration they deserve. In this specific context, our work is focused on the Amazigh language, which is widely spoken in North Africa. Our primary objective is to develop an automatic speech recognition system specifically for isolated words, with a particular focus on the Tarifit dialect spoken in the Rif region of Northern Morocco. For the dataset construction, we considered 30 isolated words recorded from 80 diverse speakers, resulting in 2400 audio files. The collected corpus is characterized by its quantity, quality, and variety. Moreover, the dataset serves as a valuable resource for further research and development in the field, supporting the advancement of speech recognition technology for underrepresented languages. For the recognition system, we have chosen the most recent approach in the field of speech recognition, which is a combination of convolutional neural networks and LSTM (CNN-LSTM). For the test, we have evaluated two different architectural models: the 1D CNN LSTM and the 2D CNN LSTM. The experimental results demonstrate a remarkable accuracy rate of over 96% in recognizing spoken words utilizing the 2D CNN LSTM architecture.
Article
Full-text available
The field of speech recognition has made human-machine voice interaction more convenient. Recognizing spoken digits is particularly useful for communication that involves numbers, such as providing a registration code, cellphone number, score, or account number. This article discusses our experience with Amazigh's Automatic Speech Recognition (ASR) using a deep learning-based approach. Our method involves using a convolutional neural network (CNN) with Mel-Frequency Cepstral Coefficients (MFCC) to analyze audio samples and generate spectrograms. We gathered a database of numerals from zero to nine spoken by 42 native Amazigh speakers, consisting of men and women between the ages of 20 and 40, to recognize Amazigh numerals. Our experimental results demonstrate that spoken digits in Amazigh can be recognized with an accuracy of 91.75%, 93% precision, and 92% recall. The preliminary outcomes we have achieved show great satisfaction when compared to the size of the training database. This motivates us to further enhance the system's performance in order to attain a higher rate of recognition. Our findings align with those reported in the existing literature.
Article
Full-text available
Sound classification has been widely used in many fields. Unlike traditional signal-processing methods, using deep learning technology for sound classification is one of the most feasible and effective methods. However, limited by the quality of the training dataset, such as cost and resource constraints, data imbalance, and data annotation issues, the classification performance is affected. Therefore, we propose a sound classification mechanism based on convolutional neural networks and use the sound feature extraction method of Mel-Frequency Cepstral Coefficients (MFCCs) to convert sound signals into spectrograms. Spectrograms are suitable as input for CNN models. To provide the function of data augmentation, we can increase the number of spectrograms by setting the number of triangular bandpass filters. The experimental results show that there are 50 semantic categories in the ESC-50 dataset, the types are complex, and the amount of data is insufficient, resulting in a classification accuracy of only 63%. When using the proposed data augmentation method (K = 5), the accuracy is effectively increased to 97%. Furthermore, in the UrbanSound8K dataset, the amount of data is sufficient, so the classification accuracy can reach 90%, and the classification accuracy can be slightly increased to 92% via data augmentation. However, when only 50% of the training dataset is used, along with data augmentation, the establishment of the training model can be accelerated, and the classification accuracy can reach 91%.
Article
Full-text available
Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition in various conditions. The experiments are conducted on the Japanese Twitter-based emotional speech and IEMOCAP datasets. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentations and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific condition.
Article
Full-text available
Automatic Speech Recognition (ASR) is an active field of research due to its large number of applications and the proliferation of interfaces or computing devices that can support speech processing. However, the bulk of applications are based on well-resourced languages that overshadow under-resourced ones. Yet, ASR represents an undeniable means to promote such languages, especially when designing human-to-human or human-to-machine systems involving illiterate people. An approach to design an ASR system targeting under-resourced languages is to start with a limited vocabulary. ASR using a limited vocabulary is a subset of the speech recognition problem that focuses on the recognition of a small number of words or sentences. This paper aims to provide a comprehensive view of mechanisms behind ASR systems as well as techniques, tools, projects, recent contributions, and possible future directions in ASR using a limited vocabulary. This work consequently provides a way forward when designing an ASR system using limited vocabulary. Although an emphasis is put on limited vocabulary, most of the tools and techniques reported in this survey can be applied to ASR systems in general. AbbreviationsACC: Accuracy; AM: Acoustic Model; ASR: Automatic Speech Recognition; BD-4SK-ASR: Basic Dataset for Sorani Kurdish Automatic Speech Recognition; CER: Character Error Rate; CMU: Carnegie Mellon University; CNN: Convolutional Neural Network; CNTK: CogNitive ToolKit; CUED: Cambridge University Engineering Department; DCT:Discrete Cosine Transformation; DL: Deep Learning; DNN: Deep Neural Network; DRL: Deep Reinforcement Learning; DWT: Discrete Wavelet Transform; FFT: Fast Fourier Transformation; GMM: Gaussian Mixture Model; HMM: Hidden Markov Model; HTK: Hidden Markov Model ToolKit; JASPER: Just Another Speech Recognizer; LDA: Linear Discriminant Analysis; LER: Letter Error Rate; LGB: Light Gradient Boosting Machine; LM:Language Model; LPC: Linear Predictive Coding; LVCSR: Large Vocabulary Continuous Speech Recognition; LVQ: Learning Vector Quantization Algorithm; MFCC: Mel-Frequency Cepstrum Coefficient; ML: Machine Learning; PCM:Pulse-Code Modulation; PPVT: Peabody Picture Vocabulary Test; RASTA: RelAtive SpecTral; RLAT: Rapid Language Adaptation Toolkit; S2ST: Speech-to-Speech Translation; SAPI: Speech Application Programming Interface; SDK: Software Development Kit; SVASR:Small Vocabulary Automatic Speech Recognition; WER: Word Error Rate
Conference Paper
Full-text available
Speech signal preprocessing is the first and the most important step in the automatic speech recognition process. The preprocessing of speech consists of cleaning the speech signal from ambient and undesirable noises, detecting speech activity, and normalizing the length of the vocal tract. The objective of preprocessing a speech signal is to make the speech recognition systems computationally more efficient through the application of several preprocessing techniques, such as speech pre-emphasis, vocal tract length normalization, voice activity detection, noise removal, framing, and windowing. This paper gives an overview of the fundamentals of speech signal preprocessing techniques, by highlighting the specifics and the requirements of each technique. We also explore all aspects that can improve the results of each technique. We aim that the content of this paper will help researchers improve the quality of their speech recognition systems by identifying appropriate speech preprocessing techniques to use in their experimental settings.
Article
Full-text available
In this paper, we present a set of experiments aiming to improve the recognition of spoken digits for under-resourced dialects of the Maghrebi region, using a hybrid system. Indeed, integrating a Dialect Identification module into an Automatic Speech Recognition (ASR) system has shown its efficiency in previous works. In order to make the ASR system able to recognize digits spoken in different dialects, we trained our hybrid system on Moroccan Berber Dialect “MBD,” Moroccan Arabic Dialect “MAD,” and Algerian Arabic dialect “AAD,” in addition to Modern Standard Arabic. We have investigated five machine learning based classifiers and two deep learning models: the first one is based on Convolutional Neural Network (CNN), and the second one uses two pre-trained models: Residual Deep Neural Network (Resnet50 and Resnet101). The findings show that the CNN model outperforms the other proposed methods and consequently enhances the performance of spoken digit recognition system by 20% for both Algerian and Moroccan dialects.