Conference PaperPDF Available

Russian-Language Speech Recognition System Based on Deepspeech

Authors:

Abstract

The paper examines the practical issues in developing a speech-to-text system using deep neural networks. The development of a Russian-language speech recognition system based on DeepSpeech architecture is described. The Mozilla company's open source implementation of DeepSpeech for the English language was used as a starting point. The system was trained in a containerized environment using the Docker technology. It allowed to describe the entire process of component assembly from the source code, including a number of optimization techniques for CPU and GPU. Docker also allows to easily reproduce computation optimization tests on alternative infrastructures. We examined the use of TensorFlow XLA technology that optimizes linear algebra computations in the course of neural network training. The number of nodes in the internal layers of neural network was optimized based on the word error rate (WER) obtained on a test data set, having regard to GPU memory limitations. We studied the use of probabilistic language models with various maximum lengths of word sequences and selected the model that shows the best WER. Our study resulted in a Russian-language acoustic model having been trained based on a data set comprising audio and subtitles from YouTube video clips. The language model was built based on the texts of subtitles and publicly available Russian-language corpus of Wikipedia's popular articles. The resulting system was tested on a data set consisting of audio recordings of Russian literature available on voxforge.com-the best WER demonstrated by the system was 18%.
A preview of the PDF is not available
... The top model achieved a WER of 15.1%. In [25], a Deep Speech architecture was used for the Russian speech model, which achieved a WER of 18%. Both studies used language-specific audio corpora available at voxforge.org. ...
Article
Full-text available
Owing to the linguistic richness of the Arabic language, which contains more than 6000 roots, building a reliable Arabic language model for Arabic speech recognition systems faces many challenges. This paper introduces a language model free Arabic automatic speech recognition system for Modern Standard Arabic based on an end-to-end-based Deep Speech architecture developed by Mozilla. The proposed model uses a character-level sequence-to-sequence model to map the character alignment produced by the recognizer model onto the corresponding words. The developed system outperformed recent studies on single-speaker and multi-speaker Arabic speech recognition using two different state-of-the-art datasets. The first was the Arabic Multi-Genre Broadcast (MGB2) corpus with 1200 h of audio data from multiple speakers. The system achieved a new milestone in the MGB2 challenge with a word error rate (WER) of 3.2, outperforming related work using the same corpus with a word error reduction of 17%. An additional experiment with a 7-hour Saudi Accent Single Speaker Corpus (SASSC) was used to build an additional model for single male speaker-based Arabic speech recognition using the same proposed network architecture. The single-speaker model outperformed related experiments with a WER of 4.25 with a relative improvement of 33.8%.
... To facilitate the process of using it, a group of technologies appeared in multiple fields in DL. Several researchers tested the DeepSpeech model to build an SR system for Russian and German languages (Agarwal & Zesch, 2019;Panaite et al., 2019;Iakushkin et al., 2018). ...
Article
Full-text available
This work is an effort towards building Neural Speech Recognizers system for Quranic recitations that can be effectively used by anyone regardless of their gender and age. Despite having a lot of recitations available online, most of them are recorded by professional male adult reciters, which means that an ASR system trained on such datasets would not work for female/child reciters. We address this gap by adopting a benchmark dataset of audio records of Quranic recitations that consists of recitations by both genders from different ages. Using this dataset, we build several speaker-independent NSR systems based on the DeepSpeech model and use word error rate (WER) for evaluating them. The goal is to show how an NSR system trained and tuned on a dataset of a certain gender would perform on a test set from the other gender. Unfortunately, the number of female recitations in our dataset is rather small while the number of male recitations is much larger. In the first set of experiments, we avoid the imbalance issue between the two genders and down-sample the male part to match the female part. For this small subset of our dataset, the results are interesting with 0.968 WER when the system is trained on male recitations and tested on female recitations. The same system gives 0.406 WER when tested on male recitations. On the other hand, training the system on female recitations and testing it on male recitation gives 0.966 WER while testing it on female recitations gives 0.608 WER.
... The use of suprasegmental features' normalization strategies shows enhancement in performance over benchmark normalization approaches. Iakushkin et al. [125] developed a Russian-language speech recognition system based on DeepSpeech. They analyzed the utility of TensorFlow technology for optimizing linear algebra computations in neural network training. ...
Article
Full-text available
Speech recognition of a language is a key area in the field of pattern recognition. This paper presents a comprehensive survey on the speech recognition techniques for non-Indian and Indian languages, and compiled some of the computational models used for processing speech acoustics. An immense number of frameworks are available for speech processing and recognition for languages persisting around the globe. However, a limited number of automatic speech recognition systems are available for commercial use. The gap between the languages being spoken around the globe and the technical support available to these languages are very few. This paper examined major challenges for speech recognition for different languages. Analysis of the literature shows that lack of standard databases availability of minority languages hinder the research recognition research across the globe. When compared with non-Indian languages, the research on speech recognition of Indian languages (except Hindi) has not achieved the expected milestone yet. Combination of MFCC and DNN–HMM classifier is most commonly used system for developing ASR minority languages, whereas in some of the majority languages, researchers are using much advance algorithms of DNN. It has also been observed that the research in this field is quite thin and still more research needs to be carried out, particularly in the case of minority languages.
... Such a way, almost 20 hours of data has been collected and it continues to grow. Secondly, the existing [13] speech recognition model was taken for Russian language based on VoxForge dataset with over 100 hour of speech data with corresponding text transcriptions and the weights of first 2 LSTM layers with 128 neurons were copied and pasted to our neural network with the same amount of layers and neurons. The weights of the last layer were deprecated because it's size doesn't match with layer's size, since the number of characters in Russian alphabet is less than in Kazakh alphabet. ...
Article
Full-text available
Development of an automatic speech recognition system for the Kazakh language is a challenging task due to the lack of audio data and specificity and complexity of the language itself. In this paper, we propose a new method which gets a pre-trained model of the russian language and uses the weight values of the pre-trained model in the proposed neural network. The main reason for choosing the Russian language model is that the pronunciation of the Kazakh and Russian languages is very similar in many respects, because they account for 78% of the total letters and there is a rather large corpus of the Russian speech dataset. The dataset of Kazakh speech with transcriptions was formed by the university's faculty. In general, 50 native speakers were involved who generated about 400 sentences. A special technology has been created for the automatic expansion of the database. The data was extracted from well-known Kazakh books such as "Abai zholy", "Kara sozder", etc.
... Iakushkin et al. [10] considered speech recognition in large-resource conditions. Their system built on top of the Mozilla DeepSpeech framework and trained with nearly 1650 hours of YouTube crawled data. ...
Preprint
Full-text available
This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybridASR system demonstrates 33.5%, 20.9%, and 18.6% WER.
Chapter
Automatic speech recognition systems are of two types, such as monolingual and multilingual. Due to its ability to use transfer learning techniques and create better SR models for resource-scarce languages, multilingual speech recognition has recently become more prevalent. Generally, multilingual speech recognition models use specific parameters and modules which activate depending on the language given to the models. These models lack efficiency if language identity is not specified and also lacks the ability to recognize code-switch speech. In this work, we propose a multilingual model for English and Indian languages that can convert speech to text without specifying the language identity and recognize code-switch speech through transliterated and English text as input transcription. The transliterated text can help the model to learn and map sound to its appropriate English character in the case of multilingual and code-switched speech. This multilingual model uses the DeepSpeech architecture by Baidu. The lowest word error rate (WER) and character error rate (CER) for the best model were 30.5% and 11.69%, respectively.
Chapter
This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set – OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybrid ASR system demonstrates 33.5%, 20.9%, and 18.6% WER.
Conference Paper
Full-text available
Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Conference Paper
We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.
Article
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called DeepSpeech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.5% error on the full test set. DeepSpeech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
Book
Rekurrente Neuronale Netze (RNN) sind mächtige Sequenzlerner. Sie können flexibel mit zeitlich gedehnten Kontexten umgehen und sind robust gegenüber lokalen Eingabestörungen. Dies empfiehlt sie für Sequenzannotationsprobleme, bei denen Eingabeströme auf Reihen symbolischer Markierungen abzubilden sind. Das Ziel dieser Arbeit ist es, den aktuellen Stand der Forschung bei überwachter Sequenzannotation mit RNN voranzubringen. Die Arbeit liefert hierzu zwei wichtige Beiträge: (1) Durch die Einführung einer neuartigen Ausgabeschicht wird es möglich, RNN direkt daraufhin zu trainieren, Sequenzen zu annotieren, auch wenn nur die Reihenfolge, nicht jedoch die exakten Positionen der Markierungen bekannt sind. (2) Die "Long Short-Term Memory" Netzwerkarchitektur wird auf mehrdimensionale Daten wie Bilder und Videos erweitert.
al Deep speech: Scaling up end-to-end speech recognition
  • A Hannun
Hannun A., et. al Deep speech: Scaling up end-to-end speech recognition // arXiv preprint arXiv:1412.5567. -2014.
  • D P Kingma
  • J Ba
  • Adam
Kingma D. P., Ba J. Adam: A method for stochastic optimization // arXiv preprint arXiv:1412.6980. -2014.