Article

ПОПЕРЕДНЯ ОБРОБКА АУДІО СИГНАЛУ В ЗАДАЧІ РОЗПІЗНАВАННЯ МОВЛЕННЯ

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Мова є найбільш природною формою людського спілкування, тому реалізація інтерфейсу, який базується на аналізі мовленнєвої інформації є перспективним напрямком розвитку інтелектуальних систем управління. Система автоматичного розпізнавання мовлення – це інформаційна система, що перетворює вхідний мовленнєвий сигнал на розпізнане повідомлення. Процес розпізнавання мовлення є складним і ресурсоємним завданням через високу варіативність промови, яка залежить від віку, статі та фізіологічних характеристик мовця. У статті представлено узагальнений опис задачі розпізнавання мовлення, що складається з етапів: передискретизація, кадрування та застосування вікон, виділення ознак, нормалізація довжини голосового тракту та шумопригнічення. Попередня обробка мовленнєвого сигналу є першим і ключовим етапом у процесі автоматичного розпізнавання мови, оскільки якість вхідного сигналу суттєво впливає на якість розпізнавання і кінцевий результат цього процесу. Попередня обробка мови складається з очищення вхідного сигналу від зовнішніх і небажаних шумів, виявлення мовленнєвої активності та нормалізації довжини голосового тракту. Метою попередньої обробки мовленнєвого сигналу є підвищення обчислювальної ефективності систем розпізнавання мови та систем керування із природньомовним інтерфейсом. У статті запропоновано використання швидкого перетворення Фур’є для описування вхідного аудіо сигналу; вікна Hamming для створення сегментів аудіосигналу з подальшим визначенням ознак засобами Mel-Frequency Cepstral Coefficients. Описано використання алгоритму динамічного трансформування часової шкали для нормалізації довжини голосового тракту та рекурентної нейронної мережі для шумопригнічення. Наведено результати експерименту щодо попередньої обробки аудіо сигналу голосових команд для керування застосунками мобільного телефону з оперативною системою Android.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Keyword spotting is an important part of modern speech recognition pipelines. Typical contemporary keyword-spotting systems are based on Mel-Frequency Cepstral Coefficient (MFCC) audio features, which are relatively complex to compute. Considering the always-on nature of many keyword-spotting systems, it is prudent to optimize this part of the detection pipeline. We explore the simplifications of the MFCC audio features and derive a simplified version that can be more easily used in embedded applications. Additionally, we implement a hardware generator that generates an appropriate hardware pipeline for the simplified audio feature extraction. Using Chisel4ml framework, we integrate hardware generators into Python-based Keras framework, which facilitates the training process of the machine learning models using our simplified audio features.
Article
Full-text available
Speech recognition is a natural language processing task that involves the computerized transcription of spoken language in real time. Numerous studies have been conducted on the utilization of deep learning (DL) models for speech recognition. However, this field is advancing rapidly. This systematic review provides an in-depth and comprehensive examination of studies published from 2019 to 2022 on speech recognition utilizing DL techniques. Initially, 575 studies were retrieved and examined. After filtration and application of the inclusion and exclusion criteria, 94 were retained for further analysis. A literature survey revealed that 17% of the studies used stand-alone models, whereas 52% used hybrid models. This indicates a shift towards the adoption of hybrid models, which were proven to achieve better results. Furthermore, most of the studies employed public datasets (56%) and used the English language (46%), whereas their environments were neutral (81%). The word error rate was the most frequently used method of evaluation, while Mel-frequency cepstral coefficients were the most frequently employed method of feature extraction. Another observation was the lack of studies utilizing transformers, which were demonstrated to be powerful models that can facilitate fast learning speeds, allow parallelization and improve the performance of low-resource languages. The results also revealed potential and interesting areas of future research that had received scant attention in earlier studies.
Article
Full-text available
The subject matter of the study is the analysis of the influence of pre-processing stages of the audio on the accuracy of speaker language regognition. The importance of audio pre-processing has grown significantly in recent years due to its key role in a variety of applications such as data reduction, filtering, and denoising. Taking into account the growing demand for accuracy and efficiency of audio information classification methods, evaluation and comparison of different audio pre-processing methods becomes important part of determining optimal solutions. The goal of this study is to select the best sequence of stages of pre-processing audio data for use in further training of a neural network for various ways of converting signals into features, namely, spectrograms and mel-cepstral characteristic coefficients. In order to achieve the goal, the following tasks were solved: analysis of ways of transforming the signal into certain characteristics and analysis of mathematical models for performing an analysis of the audio series by selected characteristics were carried out. After that, a generalized model of real-time translation of the speaker's speech was developed and the experiment was planned depending on the selected stages of pre-processing of the audio. To conclude, the neural network was trained and tested for the planned experiments. The following methods were used: mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization. The following results were obtained: depending on the selected stages of pre-processing of voice information and various ways of converting the signal into certain features, it is possible to achieve speech recognition accuracy up to 93%. The practical significance of this work is to increase the accuracy of further transcription of audio information and translation of the formed text into the chosen language, including artificial laguages. Conclusions: In the course of the work, the best sequence of stages of pre-processing audio data was selected for use in further training of the neural network for different ways to convert signals into features. Mel-cepstral characteristic coefficients are better suited for solving our problem. Since the neural network strongly depends on its structure, the results may change with the increase in the volume of input data and the number of languages. But at this stage, it was decided to use only mel-cepstral characteristic coefficients with normalization.
Article
Full-text available
Speech recognition is the foundation of human-computer interaction technology and an important aspect of speech signal processing, with broad application prospects. Therefore, it is very necessary to recognize speech. At present, speech recognition has problems such as low recognition rate, slow recognition speed, and severe interference from other factors. This paper studied speech recognition based on dynamic time warping (DTW) algorithm. By introducing speech recognition, the specific steps of speech recognition were understood. Before performing speech recognition, the speech that needs to be recognized needs to be converted into a speech sequence using an acoustic model. Then, the DTW algorithm was used to preprocess speech recognition, mainly by sampling and windowing the speech. After preprocessing, speech feature extraction was carried out. After feature extraction was completed, speech recognition was carried out. Through experiments, it can be found that the recognition rate of speech recognition on the basis of DTW algorithm was very high. In a quiet environment, the recognition rate was above 93.85 %, and the average recognition rate of the 10 selected testers was 95.8 %. In a noisy environment, the recognition rate was above 91.4 %, and the average recognition rate of the 10 selected testers was 93 %. In addition to high recognition rate, DTW based speech recognition also had a very fast speed for vocabulary recognition. Based on the DTW algorithm, speech recognition not only has a high recognition rate, but also has a faster recognition speed.
Article
Full-text available
Audio Signals are the portrayal of sounds. It changes with respect to frequencies rather than time, and it shows more information in the frequency domain. So it is much appropriate to evaluate in the frequency domain rather than the time domain. By using different transforms like DFT, DST, DCT, MDCT, Integer MDCT, the time domain audio signal can be converted into a frequency domain signal. The signal is reconstructed to analyze the features like mean square error, Signal to noise ratio, Peak signal to noise ratio between the original and reconstructed signal. Other features like energy, entropy, zero crossing rates (ZCR) were also considered for the evaluation. In this paper, different audio file formats were taken for interpretation. It includes wave file, mp3 file, m4a file, aac file, where wave file is in uncompressed format and mp3, m4a, aac are in compressed format. These compressed files come under lossy compression. The above-mentioned features are used for applications like music information retrieval (MIR). MIR includes onset detection, pitch detection and to measure the noise and loudness of the music.
Conference Paper
Full-text available
Speech signal preprocessing is the first and the most important step in the automatic speech recognition process. The preprocessing of speech consists of cleaning the speech signal from ambient and undesirable noises, detecting speech activity, and normalizing the length of the vocal tract. The objective of preprocessing a speech signal is to make the speech recognition systems computationally more efficient through the application of several preprocessing techniques, such as speech pre-emphasis, vocal tract length normalization, voice activity detection, noise removal, framing, and windowing. This paper gives an overview of the fundamentals of speech signal preprocessing techniques, by highlighting the specifics and the requirements of each technique. We also explore all aspects that can improve the results of each technique. We aim that the content of this paper will help researchers improve the quality of their speech recognition systems by identifying appropriate speech preprocessing techniques to use in their experimental settings.
Article
Full-text available
This paper briefly describes the basic principles and related theories of speech recognition system, points out the existing problems of speech recognition technology of artificial intelligence deep learning, analyzes the speech recognition methods of artificial intelligence deep learning, puts forward the optimization methods of speech recognition technology of artificial intelligence deep learning from the aspects of strengthening targeted feature recognition of speech system, repeatedly carrying out simulation training of speech recognition and combining acoustic features with sports features, and forecasts the future prospects of speech recognition methods of artificial intelligence deep learning.
Article
Full-text available
In this digital world, there are many applications to secure and legalize their data and all of these emissions by various techniques and there are many algorithms and methods to process their data. Some extensive method used is biometric authentication and voice recognition is better, since it paves the convenient manner to the user and it merely acquires the voice from the user. Also the background noise is in crisis with Mel Frequency Cepstrum Coefficient (MFCC) which is recognition algorithm where overcome by other tools such as smoothening filter etc. The main focus of this project is to investigate the feature extraction scheme.
Article
Full-text available
In this paper, we propose a preprocessing strategy for denoising of speech data based on speech segment detection. A design of computationally efficient speech denoising is necessary to develop a scalable method for large-scale data sets. Furthermore, it becomes more important as the deep learning-based methods have been developed because they require significant costs while showing high performance in general. The basic idea of the proposed method is using the speech segment detection so as to exclude non-speech segments before denoising. The speech segmentation detection can exclude non-speech segments with a negligible cost, which will be removed in denoising process with a much higher cost, while maintaining the accuracy of denoising. First, we devise a framework to choose the best preprocessing method for denoising based on the speech segment detection for a target environment. For this, we speculate the environments for denoising using different levels of signal-to-noise ratio (SNR) and multiple evaluation metrics. The framework finds the best speech segment detection method tailored to a target environment according to the performance evaluation of speech segment detection methods. Next, we investigate the accuracy of the speech segment detection methods extensively. We conduct the performance evaluation of five speech segment detection methods with different levels of SNRs and evaluation metrics. Especially, we show that we can adjust the accuracy between the precision and recall of each method by controlling a parameter. Finally, we incorporate the best speech segment detection method for a target environment into a denoising process. Through extensive experiments, we show that the accuracy of the proposed scheme is comparable to or even better than that of Wavenet-based denoising, which is one of recent advanced denoising methods based on deep neural networks, in terms of multiple evaluation metrics of denoising, i.e., SNR, STOI, and PESQ, while it can reduce the denoising time of the Wavenet-based denoising by approximately 40–50% according to the used speech segment detection method.
Article
Full-text available
Speech recognition system play an essential role in every human being life. It is a software that allows the user to interact with their mobile phones through speech. Speech recognition software splitting down the audio of a speech into various sound waves forms, analyzing each sound form, using various algorithms to find the most appropriate word fit in that language and transcribing those sounds into text. This paper will illustrate the popular existing system namely SIRI, CORTANA, GOOGLE ASSISTANT, ALEXA, BIXBY. Apart from that, this paper also analysis the concept of NLP (Natural processing) with speech recognition. In addition to this, our main function is to find out the most accurate technique of speech recognition so that we can achieve the best results. Comparative analysis will indicate the difference and demerit points of various speech recognition.
Conference Paper
Full-text available
An adversarial attack is an exploitative process in which minute changes are made to a natural input, causing that input to be misclassified by a neural model. Due to recent trends in speech processing, this has become a noticeable issue in speech recognition models. In late 2017, an attack was shown to be quite effective against the Speech Commands classification model. Limited-vocabulary classi-fiers, such as the Speech Commands model, are used quite frequently for managing automated attendants in traditional telephony and voice over IP (VoIP) contexts. As such, this research examines the effectiveness of VoIP speech coding in mitigating audio adversarial attacks when compared to more primitive forms of audio preprocessing and shows that an ensemble defense in tandem with speech coding is more robust than other forms of preprocessing defenses in mitigating adversarial examples. This research also proposes a new metric for evaluating preprocessing defenses against adversarial attacks. Additionally, this research explores using speech coding and various other forms of preprocessing for detecting adversarial examples.
Article
Full-text available
Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 σ edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouraging.
Using recurrent neural network to noise absorption from audio files
  • N Boyko
  • A Hrynyshyn
Boyko N., Hrynyshyn A. Using recurrent neural network to noise absorption from audio files. Proceedings of the 2nd International Workshop on Computational & Information Technologies for Risk-Informed Systems (CITRisk'2021), Kherson, Ukraine, 2021. pp. 1-19.