Multi-stream ASR with combination performed either at the output of the acoustic models which are CNNs or at the output of the decoding. S is the number of streams.

Multi-stream ASR with combination performed either at the output of the acoustic models which are CNNs or at the output of the decoding. S is the number of streams.

Source publication
Conference Paper
Full-text available
This paper proposes a new method for weighting two dimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR). Auditory saliency is estimated via 2D auditory saliency maps which model the mechanism for allocating human auditory attention. These maps are used to weight T...

Context in source publication

Context 1
... ASR combines information from different speech recognition streams to improve ASR performance [15]. The combination of different ASR streams can exploit particu- lar strength of each technique, for instance acoustic features, used in each stream. The combination can be performed at features level, output of acoustic models level or lattices level [34]. In this work, the single-stream CNN-HMM ASR systems are combined together in the multi-stream CNN- HMM ASR framework. Each single-stream ASR system uses FBANK features computed either from X(n, ?), X(n, ?) exp(S G I ), X(n, ?) exp(S G F ), X(n, ?) exp(S G T ), or from X(n, ?) exp(S O ) (see section 4.1). The single-stream system applying the EBM technique is also used. The combi- nation is performed either at the output of the CNN acoustic models or at the output of the decoding (see Fig. 4). Posterior probabilities of tied triphone HMM states [31] produced by the CNN acoustic models can be combined by us- ing a number of methods. Here we apply the inverse entropy combination [35] which is one of the effective methods for com- bining posterior probabilities. In this method, the weight allo- cated to the posterior probabilities produced by a CNN acous- tic model is proportional to the inverse entropy of that acous- tic model which characterizes its discriminative capacity [35]. Prior probabilities of tied triphone HMM states are subtracted in log domain from the combined posterior probabilities to get the scaled log-likelihoods [30] which are subsequently used for decoding. The combination can also be performed on the lat- tices obtained after the decoding. In this work, lattices are com- bined based on Bayes risk minimization [36] which is an effi- cient method for lattices combination [37]. Equal weights are allocated to the systems used in lattices ...

Similar publications

Article
Full-text available
We argue that the improved the performance of automatic speech recognition (ASR) systems in mobiles communication system, we have achieved by two modules front-end or feature extractor used and a back-end or recognizer. the front-end we have used Gabor features GF-MFCC, are the result of their ability to extract discriminative internal representati...

Citations

... The top-down attention mechanism, i.e., commonly used in deep learning, can be seen as task-based saliency (e.g., derived discriminatively based on emotion recognition task) but not a signal-based (bottom-up) saliency [6]. Some research has integrated bottom-up auditory saliency for speech tasks, e.g., using an auditory saliency spectral mask for noise-robust speech recognition [8], and improving cognitive load classification by pooling saliency mask over time [9]. While integrating saliency to spectral representation has been useful, few studies have utilized signal-level saliency for emotion tasks. ...
... In our work, three individual G(n, w) are applied to capture intensity, frequency contrast, and temporal contrast, respectively. The 2D representations R k are up-sampled toR k by using interpolation [8]. After the interpolation, subtracting the center (c) with the surround (s) is applied: ...
... The results showed that individual components of the DCS algorithm were highly accurate for both reducing additive noise and reverberation. The aim of [67] was to establish a new method for weighting two-dimensional time-frequency representation of speech. The weighting was done using auditory saliency to create maps. ...
Article
Full-text available
A huge amount of research has been done in the field of speech signal processing in recent years. In particular, there has been increasing interest in the automatic speech recognition (ASR) technology field. ASR began with simple systems that responded to a limited number of sounds and has evolved into sophisticated systems that respond fluently to natural language. This systematic review of automatic speech recognition is provided to help other researchers with the most significant topics published in the last six years. This research will also help in identifying recent major ASR challenges in real-world environments. In addition, it discusses current research gaps in ASR. This review covers articles available in five research databases that were completed according to the preferred reporting items for systematic reviews and meta-analyses (PRISMA) protocol. The search strategy yielded 45 articles related to the study’s scope for the period 2015–2020. The results presented in this review shed light on research trends in the area of ASR and also suggest new research directions.
... -research of pseudorandom sequences with specified parameters of resistance to re-engineering for generators of «white» noise; -research of the influence of frequency reverberation the possibility of restoring speech information [22][23][24]; -research of the possibility of distinguishing a «specific» speaker from a conversation, with the simultaneous language of two or more speakers [25,26]; -development of methods for identifying signs of a speech signal in folding noise phonograms [27,28]. ...
... Further development of the research of the possibility of restoring speech information is obtained in [23]. The paper proposes the use of the method of weighing the coefficient TFR (Term Frequency representation) of the language using auditory tendency for noise-protective ASR (Automatic Speech Recognition). ...
Article
Full-text available
Assessment of the level of speech information protection from leakage through acoustic and vibration channels is carried out according to international and national standards and in compliance with regulatory documents. To assess its security level, regulatory documents in many countries imply the use of signal/noise ratio. However, the method has a series of significant shortcomings, which do not make it possible to determine the real state of security level. The improved objective evaluation method, which is based on determining the coefficient of residual intelligibility for a test-signal after its recovery by the methods of mathematical analysis (adaptive filtration, correlation and spectral analyses, wavelet transformation, etc.) was proposed. The coefficient of residual intelligibility is determined for each word included in a short phrase, a test signal. The analysis of frequency of using the phonemes in the Ukrainian speech was performed. It was shown that given the definition of the term "allophone" and the number of native speakers, it is possible to assume that the total number of allophones tends to infinity. To reduce the calculation complexity, we proposed the formalized approach based on the simplified linguistic model – a phoneme (a letter), a diphone (two letters), and a triphone (three letters). As a source of information, it is possible to use text documents. We proposed analytical dependences for calculating the coefficient of residual speech intelligibility and its components – coefficients of frequency of using allophones in the words of the Ukrainian language and the importance of allophone recognition for the word recognition. The interrelations of the SPC (speech privacy class) and word intelligibility W were shown. On their base, the scale of objective estimation of the degree of speech information privacy on the boundary of the controlled zone by the criterion of residual speech intelligibility was proposed
... -research of pseudorandom sequences with specified parameters of resistance to re-engineering for generators of «white» noise; -research of the influence of frequency reverberation the possibility of restoring speech information [22][23][24]; -research of the possibility of distinguishing a «specific» speaker from a conversation, with the simultaneous language of two or more speakers [25,26]; -development of methods for identifying signs of a speech signal in folding noise phonograms [27,28]. ...
... Further development of the research of the possibility of restoring speech information is obtained in [23]. The paper proposes the use of the method of weighing the coefficient TFR (Term Frequency representation) of the language using auditory tendency for noise-protective ASR (Automatic Speech Recognition). ...
Article
Full-text available
The protection of speech information is one of the main tasks of information protection and is a sign of a responsible attitude of an organization (company) both to its information resources and respect for partners. The object of research is the process of protecting speech information from leakage by acoustic and vibrational technical channels at the objects of information activity. An exceptional feature of such facilities is the circulation, processing and discussion of issues containing information of limited access, including state secrets. A peculiarity of Ukraine is the requirement to use exclusively technical means that have passed the relevant certification at such facilities. The basis of the active noise jamming system is a noise generator. At the same time, one of the most problematic issues is that in Ukraine only noise interference generators of the “white” noise type and its clones are allowed to be used. The systems have a number of significant drawbacks – the low protection level of intercepted speech signals from noise filtering (interference), a significant noise level in the premises to be protected, and others. A block diagram of an interference generator is proposed. And its mathematical model is also developed and researched in Matlab. In the course of the research, a comparative analysis of the signals input and synthesized by the generator was carried out, their temporal and spectral characteristics were investigated. The obtained results indicate the high efficiency of the proposed method of protecting speech information. This is due to the fact that the method of forming a speech-like interference has a number of features that provide a significant destructive effect on speech information, namely the use of a combined scrambler model with time and frequency transforms. The method takes into account the use of dynamic keys for coding systems, and the connection of third-party sources of speech signals, as well as ringing (mixing of the input and output signals) at the input of the scrambling unit. This decision excludes reengineering. The results are confirmed by the research of an experimental sample. The destructive effect of typical noise interference («white» noise and its clones) and the noise interference created by the proposed method are compared by the criterion of residual speech intelligibility of the speaker’s speech. Studies have shown that, provided that no more than 10 % of the level of residual intelligibility is provided, the volume level of the output signal of the noise interference generator can be reduced by almost 6 dBA.