Conference Paper

Environment Detection Methods using Speech Signals-A Review

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In the realm of computer science, a huge number of studies have been done for speech application areas, particularly speech recognition, in the last few decades. Through the process of determining and interpreting, speech recognition enables the system to turn received speech signals into instructions. Speech recognition was not quite as impressive at the time of development or in its initial phases as it is now, so many studies focused on it and made it one of the unique qualities. In the subject of automatic voice recognition, a lot of excellent progress has been made on a range of problems, one of which is determining the speaker's surroundings from the audio around the speaker. The core contribution of this project is to explore how to use extensive speech recognition knowledge to extract the speaker's surrounding environment. Since its inception, the research also gives a brief review of voice recognition techniques and applications.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Here the product's structure can be assessed, and then the business process design can be considered. The plethora method redesigns the project to guide the ontological process, which happens in the black box process (Singh & Rathor, 2022). Then the form of procedural guidance, the process design technique, and the consistency which enables the research problem can be promoted for the research method. ...
Article
Full-text available
The current research shows tremendous improvements in the business marketing processes because some problems are long‐term bond yields and macroeconomic data handling at a minimum level. Any industry can revolutionize business marketing processes if it takes a specific business marketing model to eliminate such deficiencies, examines the various elements that occur in them, and finds out how to fix them. As a result, some factors have to be controlled for in this research. Innovations and advancements in signal processing techniques on business processes occur spontaneously in business processes that are managed like this. In the below‐mentioned business improvement success factors, score, learning best, future creation, people involvement, goals setting, plan and develop is the most essential factor. The elimination of these factors causes some disruption in business marketing processes. These gaps cause problems in the field of science and technology‐based business marketing. The problem caused by these factors is the excess usage of two‐wheeler, which generates more vehicular crises. The smart helmet business is the only solution to this automotive crisis. But because some of the above problems occur to this business process marketing. Considering that some sensors to make smart helmets will develop the business in the name of this imperative, considering the factors of production are land, labour, capital, and entrepreneurship as the centre, this research introduces signal processing technology to make a helmet. The signal processing‐based helmet proposed here is constructive for business. This research suggests a smart helmet business with signal‐processing technology. This study highlights how adopting the smart helmet product might improve corporate marketing procedures. According to the results of our research, when people use this smart helmet product, depending on how well they can use it, that specific business rises to the top and falls to the bottom. These business marketing methods operate at a greater level and are profitable whenever their capacity rises. The effectiveness of this smart helmet declines as accidents increase and corporate marketing procedures degrade. Since our production now depends on whether the intelligent helmets are beneficial or harmful and destructive. The analytical technique described below can be used to find them.
Article
Full-text available
Speech recognition system play an essential role in every human being life. It is a software that allows the user to interact with their mobile phones through speech. Speech recognition software splitting down the audio of a speech into various sound waves forms, analyzing each sound form, using various algorithms to find the most appropriate word fit in that language and transcribing those sounds into text. This paper will illustrate the popular existing system namely SIRI, CORTANA, GOOGLE ASSISTANT, ALEXA, BIXBY. Apart from that, this paper also analysis the concept of NLP (Natural processing) with speech recognition. In addition to this, our main function is to find out the most accurate technique of speech recognition so that we can achieve the best results. Comparative analysis will indicate the difference and demerit points of various speech recognition.
Article
Full-text available
We propose using derivative features for sound event detection based on deep neural networks. As input to the networks, we used log-mel-filterbank and its first and second derivative features for each frame of the audio signal. Two deep neural networks were used to evaluate the effectiveness of these derivative features. Specifically, a convolutional recurrent neural network (CRNN) was constructed by combining a convolutional neural network and a recurrent neural networks (RNN) followed by a feed-forward neural network (FNN) acting as a classification layer. In addition, a mean-teacher model based on an attention CRNN was used. Both models had an average pooling layer at the output so that weakly labeled and unlabeled audio data may be used during model training. Under the various training conditions, depending on the neural network architecture and training set, the use of derivative features resulted in a consistent performance improvement by using the derivative features. Experiments on audio data from the Detection and Classification of Acoustic Scenes and Events 2018 and 2019 challenges indicated that a maximum relative improvement of 16.9% was obtained in terms of the F-score.
Article
Full-text available
Emotion is currently recognized as a essential part of human being behaviour, associated therefore it ought to be surrounded among the analysis method once a intellectual agent or a golem intends t0 imitate human being reaction. Therefore, present analysis in AI demonstrates associate increasing interest in artificial feeling for developing the human-like agent. supported feeling psychology and artificial feeling, this paper presents a behaviour decision model of intelligent agent, the model consists of emotions, motivations and behaviour call. The mapping relationship between exterior simulates and feeling is made by D-S evidence theory and therefore the model applies the Andre Mark off decision process to work out feeling states to behaviours. Thei model presents a legitimate methodology to the emotional agent modelling and effective call organization.
Article
Full-text available
Speech is an easy and usable technique of communication between humans, but nowadays humans are not limited to connecting to each other but even to the different machines in our lives. The most important is the computer. So, this communication technique can be used between computers and humans. This interaction is done through interfaces, this area called Human Computer Interaction (HCI). This paper gives an overview of the main definitions of Automatic Speech Recognition (ASR) which is an important domain of artificial intelligence and which should be taken into account during any related research (Type of speech, vocabulary size... etc.). It also gives a summary of important research relevant to speech processing in the few last years, with a general idea of our proposal that could be considered as a contribution in this area of research and by giving a conclusion referring to certain enhancements that could be in the future works.
Article
Full-text available
To make the best use of speech recognition, it is imperative that it can recognize not just speech or speaker, but also the domain of communication. This paper proposes an approach for recognition of the acoustic domain using ensemble-based 3-level architecture instead of a single classifier for training and testing. It is estimated the predictions of various classifiers and then selects a set of three classifiers such that, any of the three classifiers must contain the target predictions and finally, these predictions are used to train another random forest classifier. It yields the final classification results of test data set. Experimental results indicate that the proposed method has consistent performance even if data size is increased with acceptable accuracy i.e. 76.36%.
Conference Paper
Full-text available
This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Extending the basic convolutional neural network architecture, we use both short-and long-term audio signal simultaneously as input data. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early-stopping strategy to find the best model for each class. The proposed model showed meaningful improvements in cross-validation experiments compared to the baseline system.
Article
Full-text available
Automatic classification of environmental sounds, such as dog barking and glass breaking, is becoming increasingly interesting, especially for mobile devices. Most mobile devices contain both cameras and microphones, and companies that develop mobile devices would like to provide functionality for classifying both videos/images and sounds. In order to reduce the development costs one would like to use the same technology for both of these classification tasks. One way of achieving this is to represent environmental sounds as images, and use an image classification neural network when classifying images as well as sounds. In this paper we consider the classification accuracy for different image representations (Spectrogram, MFCC, and CRP) of environmental sounds. We evaluate the accuracy for environmental sounds in three publicly available datasets, using two well-known convolutional deep neural networks for image recognition (AlexNet and GoogLeNet). Our experiments show that we obtain good classification accuracy for the three datasets.
Conference Paper
Full-text available
This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multichannel audio better when they are presented as separate layers of a volume. Using the proposed spatial features over monaural features on the same network gives an absolute F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and 2.7% on the TUT-SED 2009 dataset that is fifteen times larger.
Conference Paper
Full-text available
Acoustic phonetic approach is being used by researchers to understand phonetic rules and employ these rules in speech recognition systems. Apart from developing new feature sets and modeling techniques, acoustic phonetic approach concentrates on understanding phonetic units and their relationships in different contexts of their usage. This review paper presents different ways that researchers are looking to understand phonetic rules to develop speech recognition systems. We explored relevant areas like Accent Classification, Speech Activity Detection and Vocal Melody Extraction and the reported work in these areas is presented. The recent phonetic studies of different Asian languages like Chinese, Tibetan are described. We also consider the studies in Indian languages like Assamese, Marathi, Malayalam, etc and summary is presented.
Article
Full-text available
The automatic recognition of speech, enabling a natural and easy to use method of communication between human and machine, is an active area of research. Speech processing has vast application in voice dialing, telephone communication, call routing, domestic appliances control, Speech to text conversion, text to speech conversion, lip synchronization, automation systems etc. Nowadays, Speech processing has been evolved as novel approach of security. Feature vectors of authorized users are stored in database. Speech features are extracted from recorded speech of a male or female speaker and compared with templates available in database. Speech can be parameterized by Linear Predictive Codes (LPC), Perceptual Linear Prediction (PLP), Mel Frequency Ce pstral Coefficients (MFCC), PLP-RASTA (PLP-Relative Spectra) etc. Some parameters like PLP and MFCC considers the nature of speech while it extracts the features, while LPC predicts the future features based on previous features. Training models like neural network are trained for feature vector to predict the unknown sample. Techniques like Vector Quantization (VQ),Dynamic Time Warping (DTW), Support Vector Machine (SVM), and Hidden Markov Model (HMM) can be used for classification and recognition. We have described neural network in our paper with LPC, PLP and MFCC parameters.
Article
Full-text available
The time domain waveform of a speech signal carries all of the auditory information. From the phonological point of view, it little can be said on the basis of the waveform itself. However, past research in mathematics, acoustics, and speech technology have provided many methods for converting data that can be considered as information if interpreted correctly. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. These features should describe each segment in such a characteristic way that other similar segments can be grouped together by comparing their features. There are enormous interesting and exceptional ways to describe the speech signal in terms of parameters. Though, they all have their strengths and weaknesses, we have presented some of the most used methods with their importance.
Article
Full-text available
The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.
Conference Paper
Full-text available
With the increasing use of audio sensors in surveillance and monitoring applications, event detection using audio streams has emerged as an important research problem. This paper presents a hierarchical approach for audio based event detection for surveillance. The proposed approach first classifies a given audio frame into vocal and nonvocal events, and then performs further classification into normal and excited events. We model the events using a Gaussian mixture model and optimize the parameters for four different audio features ZCR, LPC, LPCC and LFCC. Experiments have been performed to evaluate the effectiveness of the features for detecting various normal and the excited state human activities. The results show that the proposed top-down event detection approach works significantly better than the single level approach
Article
Full-text available
The paper considers the task of recognizing environmental sounds for the understanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cepstral coefficients (MFCCs) which describe the audio spectral shape. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noise-like with a broad flat spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. Here, we perform an empirical feature analysis for audio environment characterization and propose to use the matching pursuit (MP) algorithm to obtain effective time-frequency features. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. The MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for environmental sounds. Extensive experiments are conducted to demonstrate the effectiveness of these joint features for unstructured environmental sound classification, including listening tests to study human recognition capabilities. Our recognition system has shown to produce comparable performance as human listeners.
Conference Paper
Full-text available
This paper addresses the problemof extracting vocalmelodies from polyphonic audio. In short-term processing, a timbral distance between each pitch contour and the space of human voice is measured, so as to isolate any vocal pitch contour. Computation of the timbral distance is based on an acousticphonetic parametrization of human voiced sound. Longterm processing organizes short-term procedures in such a manner that relatively reliable melody segments are determined first. Tested on vocal excerpts from the ADC 2004 dataset, the proposed system achieves an overall transcription accuracy of 77%.
Article
Full-text available
Hidden Markov Models (HMMs) provide a simple and effective frame- work for modelling time-varying spectral vector sequences. As a con- sequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assump- tions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sen- sitivity to changes in operating environment. Thus, the practi- cal application of HMMs in modern systems involves considerable sophistication. The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then describe the various refine- ments which are needed to achieve state-of-the-art performance. These
Article
Impulsive audio events such as gunshots, explosions or glass shattering, are commonly associated with security threats, thus they are of particular interest for automated acoustic surveillance. Even though impulsive audio events are greatly influenced by their propagation path, little work has been done in multichannel detection, and most precedents available in the literature deal with single-channel detection systems. Unfortunately, the spatial dependence of impulsive sound recordings proves as a problem for robust performance under real conditions. It is possible, however, to take advantage of the spatial diversity provided by a wireless sensor network to counteract this problem. In this paper, we show how an ensemble of spatially diverse detectors can greatly improve the performance of the system. We propose an efficient multichannel detection system of impulsive audio events intended for low-cost Wireless Acoustic Sensor Networks. Our proposal is based on a low complexity classification algorithm and an efficient method to include temporal context into the feature vector. The obtained results show that the proposed detection system is capable of achieving a more than adequate performance without incurring large computational loads.
Article
Among various Sound Event Detection (SED) systems, Recurrent Neural Networks (RNN), such as long short-term memory unit and gated recurrent unit, is used to capture temporal dependencies, but it is confined in its length of temporal dependencies, resulting in a failure to model sound events with long duration. What’s more, RNN is incapable to process datasets in parallel, leading to low efficiency and low industrial value. Given these shortcomings, we propose to use dilated convolution (and causal dilated convolution) to capture temporal dependencies, as its great ability to ensure high time resolution and obtain longer temporal dependencies under the filter size and the network depth unchanged. In addition, dilated convolution can be parallelized, so it has higher efficiency and industrial value. Based on this, we propose Single-Scale Fully Convolutional Networks (SS-FCN) composed of convolutional neural networks and dilated convolutional networks, with the former to provide frequency invariance and the later to capture temporal dependencies. With the help of dilated convolution to control the length of temporal dependencies, we observe SS-FCN modeling a single length of temporal dependencies achieves superior detection performance for finite kinds of events. For better performance, we propose Multi-Scale Fully Convolutional Networks (MS-FCN), in which the feature fusion module is introduced to capture long short-term dependencies by fusing features with different length of temporal dependencies. The proposed methods achieve competitive performance on three main datasets with higher efficiency. The results show that SED systems based on Fully Convolutional Networks have further research value and potential.
Book
Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment. Thus, the practical application of HMMs in modern systems involves considerable sophistication. The Application of Hidden Markov Models in Speech Recognition presents the core architecture of a HMM-based LVCSR system and proceeds to describe the various refinements which are needed to achieve state-of-the-art performance. These refinements include feature projection, improved covariance modelling, discriminative parameter estimation, adaptation and normalisation, noise compensation and multi-pass system combination. It concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described. The Application of Hidden Markov Models in Speech Recognition is an invaluable resource for anybody with an interest in speech recognition technology.
Article
The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.
Article
In this paper we propose a novel method for the detection of audio events for surveillance applications. The method is based on the bag of words approach, adapted to deal with the specific issues of audio surveillance: the need to recognize both short and long sounds, the presence of a significant noise level and of superimposed background sounds of intensity comparable to the audio events to be detected. In order to test the proposed method in complex, realistic scenarios, we have built a large, publicly available dataset of audio events. The dataset has allowed us to evaluate the robustness of our method with respect to varying levels of the Signal-to-Noise Ratio; the experimentation has confirmed its applicability in real world conditions, and has shown a significant performance improvement with respect to other methods from the literature.
Article
The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in Cavg{C_{avg}} on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance
Article
The automatic recognition of sound events by computers is an important aspect of emerging applications such as automated surveillance, machine hearing and auditory scene understanding. Recent advances in machine learning, as well as in computational models of the human auditory system, have contributed to advances in this increasingly popular research field. Robust sound event classification, the ability to recognise sounds under real-world noisy conditions, is an especially challenging task. Classification methods translated from the speech recognition domain, using features such as mel-frequency cepstral coefficients, have been shown to perform reasonably well for the sound event classification task, although spectrogram-based or auditory image analysis techniques reportedly achieve superior performance in noise. This paper outlines a sound event classification framework that compares auditory image front end features with spectrogram image-based front end features, using support vector machine and deep neural network classifiers. Performance is evaluated on a standard robust classification task in different levels of corrupting noise, and with several system enhancements, and shown to compare very well with current state-of-the-art classification techniques.
Conference Paper
The problem of acoustic detection and recognition is of particular interest in surveillance applications, especially in noisy environments with sound sources of different nature. Therefore, we present a multiple energy detector (MED) structure which is used to extract a new set of features for classification, called frequency MED (FMED) and combined MED (CMED). The focus of this paper is to compare these two novel feature sets with the commonly used MFCC and to evaluate their performance in a general sound classification task with different acoustic sources and adverse noise conditions. The promising results obtained show that, in low SNR, the proposed CMED features work significantly better than the MFCC.
Article
This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition. First, a general principle of time-normalization is given using time-warping function. Then, two time-normalized distance definitions, called symmetric and asymmetric forms, are derived from the principle. These two forms are compared with each other through theoretical discussions and experimental studies. The symmetric form algorithm superiority is established. A new technique, called slope constraint, is successfully introduced, in which the warping function slope is restricted so as to improve discrimination between words in different categories. The effective slope constraint characteristic is qualitatively analyzed, and the optimum slope constraint condition is determined through experiments. The optimized algorithm is then extensively subjected to experimental comparison with various DP-algorithms, previously applied to spoken word recognition by different research groups. The experiment shows that the present algorithm gives no more than about two-thirds errors, even compared to the best conventional algorithm.
Article
This paper briefly describes the major features of the DRAGON speech understanding system. DRAGON makes systematic use of a general abstract model to represent each of the knowledge sources necessary for automatic recognition of continuous speech. The model--that of a probabilistic function of a Markov process--is very flexible and leads to features which allow DRAGON to function despite high error rates from individual knowledge sources. Repeated use of a simple abstract model produces a system which is simple in structure, but powerful in capabilities.
A Review on Speech Recognition by Machines
  • chugh
Speech Recognition using SVM
  • R Thiruvengatanadhana
Polyphonic sound events detection localisation using a two_stage strategy
  • Yin Cao
  • Qiuqiang Kong
  • Turab Iqbal
  • Fengyan An
  • Wenwu Wang
  • Mark D Plambley