Conference PaperPDF Available

Performance measures of noise reduction algorithms in voice control channels of unmanned aerial vehicles

Authors:

Abstract

In this paper, six noise reduction algorithms had been compared with the use of a set of indicators. Among them are popular noise reduction algorithms such as spectral subtraction, Wiener filtering, MMSE and logMMSE, and two less well-known Wiener-TSNR and Wiener-HRNR algorithms. It is shown that when the noise reduction system is used as preprocessor of automatic speech recognition (ASR) system, only a small amount of speech quality indicators is in satisfactory agreement with the recognition accuracy. In particular, these include Log-Likelihood Ratio (LLR) and Signal Composite Index (SCI) indicators. In addition, it is shown that there is no single algorithm among the considered noise reduction algorithms, which is the best in terms of maximum recognition accuracy for a wide range of input signal-to-noise ratio from minus 10 dB to plus 30 dB.
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
Performance measures of noise reduction algorithms
in voice control channels of unmanned aerial vehicles
Arkadiy Prodeus
Acoustics and Electroacoustics Department
Faculty of Electronics, NTUU KPI
Kyiv, Ukraine
aprodeus@gmail.com
Abstract—In this paper, six noise reduction algorithms had
been compared with the use of a set of indicators. Among them
are popular noise reduction algorithms such as spectral
subtraction, Wiener filtering, MMSE and logMMSE, and two
less well-known Wiener-TSNR and Wiener-HRNR algorithms. It
is shown that when the noise reduction system is used as
preprocessor of automatic speech recognition (ASR) system, only
a small amount of speech quality indicators is in satisfactory
agreement with the recognition accuracy. In particular, these
include Log-Likelihood Ratio (LLR) and Signal Composite Index
(SCI) indicators. In addition, it is shown that there is no single
algorithm among the considered noise reduction algorithms,
which is the best in terms of maximum recognition accuracy for a
wide range of input signal-to-noise ratio from minus 10 dB to
plus 30 dB.
Keywords—noise reduction algorithm; speech quality indicator;
recognition accuracy; speech signal; noise interference
I. INTRODUCTION
A number of new aviation systems, and unmanned aerial
vehicles (UAVs) are among them, are beginning to utilize
speech recognition technology. In particular, it is widely
believed that voice control would enable air battle managers to
control their UAVs using voice commands rather than mouse,
keyboard, and function key inputs (Fig. 1).
Fig. 1. ASR system incorporation into UAV control channel
The block diagram shown in Fig. 1 is a schematic diagram
of a control channel that incorporates natural language
processing. A human controller is present to issue directives
based on an UAV’s current state and the controller’s intentions.
Once these verbal commands are processed by the ASR
system, they are translated into a set of high-level goals and
constraints that are then passed on to the UAV’s planning
algorithms. These planning algorithms then generate a
sequence of maneuvers for the UAV.
Ensuring of acceptable quality and intelligibility of speech,
as well as increasing of automatic speech recognition (ASR)
systems robustness to the action of noise interference through
the use of noise reduction preprocessors is issue of the day for
air channels of voice control (Fig. 2).
Fig. 2. Noise reduction system as ASR preprocessor
Additive mixture )()()( tntxty +
=
of signal )(tx and
noise )(tn is the most common model of speech distortion.
Noise reduction algorithm provides recovery of signal )(tx
from mixture )(ty :
)}({)(
ˆtyAtx =
where )(
ˆtx and }{
A are result and operator of speech
enhancing, respectively.
Three groups of indicators are used to assess the
performance of noise reduction algorithms: 1) speech quality
indicators; 2) speech intelligibility indicators; 3) speech
recognition accuracy. While this assessment is fairly typical
task, the choice of indicators is largely dependent on the
predilections of researchers [1,2,3,4]. This can be explained by
the fact that the problem of such a choice is not enough
investigated [5,6,7,8,9,10]. Therefore, the object of this paper,
in addition to comparisons between themselves of a set of noise
suppression algorithms, is a research of agreement between the
various indicators of noise suppression algorithms
performance.
UAV action Human command
Automatic
speech
recognition
s
y
ste
m
Planning
algorithms
and low-level
controls
text
ASR accuracy
Speech quality and intelligibility
)(ty )(
ˆtx
Noise reduction
preprocessor
Automatic
speech
recognition
s
y
ste
m
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
II. NOISE REDUCTION ALGORITHMS
Analyzed in this paper algorithms implement speech
enhancing in frequency domain. This technique is one of the
most widely used approaches to noise suppression.
Analytically it is described as
),(),(),(
ˆ2121 klklGkl yx λ=λ
where ),( kl
y
λ is power spectrum of signal )(ty l-th frame at
frequency fftsk NkFf /=; s
F is sampling rate; fft
N is FFT
parameter; k is number of frequency sample; ),(
ˆkl
x
λ is
power spectrum estimator of signal )(
ˆtx l-th frame; ),( klG is
correction filter gain.
In this paper the algorithms of spectral subtraction, Wiener
filtering, MMSE, logMMSE, Wiener-TSNR and Wiener-
HRNR are considered. All these algorithms are well known
excepting Wiener-TSNR and Wiener-HRNR algorithms
proposed recently [2]. Interest to the two last algorithms is
caused by their high ability to suppress noise. However, the
degree of the speech signal distortion is not sufficiently studied
for these algorithms though this distortion always occurs when
noise cancellation is executed.
Since there are several kinds of the spectral subtraction
algorithm, it should be noticed that the algorithm used in this
paper implements subtraction of the amplitude spectra. Note
also that the phase of distorted signal )(ty is used as enhanced
signal )(
ˆtx phase.
III. QUALITY MEASURES
When noise reduction system is used as preprocessor of
ASR, its performance can be evaluated by means of end-to-end
quality indicator which is named “ASR accuracy” [4]:
%100)(% ×= NISDNAcc
where N is the total number of labels in the reference
transcriptions; D is the number of deletion errors; S is the
number of substitution errors;
I
is the number of insertion
errors.
The weak point of the Acc% is the need for ASR systems
simulation. Since it is individual difficult task, it seems
advisable to explore the possibility of replacing Acc% indicator
on speech quality and speech intelligibility indicators. Of
course when ASR implementation isn’t need, the indicators of
quality and intelligibility are paramount. As such indicators in
this paper includes the following: Segmental Signal-to-Noise
Ratio (SSNR), Log-Spectral Distortion (LSD), Log-Likelihood
Ratio (LLR), Weighted Spectral Slope (WSS), Itakura-Saito
distance (IS), cepstral distance (CEP), composite index “Signal
Composite Index, Noise Composite Index, Overall Composite
Index” (SCI, NCI, OCI), perceptual indicators Bark-Spectral
Distortion (BSD) and Perceptual Evaluation of Speech Quality
(PESQ).
Analytically parameters SSNR, LSD and BSD are
described as follows
=+
=
+
=
=L
lNRl
Rln
NRl
Rln
nlynlx
nlx
L
SSNR
112
1
2
)],(),([
),(
lg10
1,
∑∑
=
=
l
R
r
rlYGrlXG
RL
LSD
1
2
0
)},({)},({
2,
}|),),(lg(|20max{)},({
δ
=
rlXrlXG ,
50|)}),(lg(|20{max
,
=
δ
rlX
kl
[]
[]
∑∑
∑∑
=
=
==
=
L
l
K
k
x
L
l
K
k
yx
klB
klBklB
BSD
1
1
2
0
2
11
2
),(
),(),(
where ),( nlx and ),(
ˆnlx are n-th samples of l-th frame of
anechoic speech signal )(tx and enhanced signal )(
ˆnx ,
respectively; ),( klX and ),(
ˆklX are spectrograms of signals
)(nx and )(
ˆnx , respectively; )},({ klXB and )},(
ˆ
{klXB are
bark spectrums of l-th frame of signals )(nx and )(
ˆnx ,
respectively.
Indicators LLR, IS and CEP are computed for each of the
frames, and further averaged over all frames:
=T
ccc
T
pcp
cp aa
aa
aad rr
r
r
rr
R
R
ln),(
LLR ,
1ln),( 2
2
2
2
σ
σ
+
σ
σ
=
p
c
T
ccc
T
pcp
p
c
cpIS aa
aa
aad rr
r
r
rr
R
R,
=
=
p
k
pccpCEP kckcccd
1
2
)]()([2
10ln
10
),( rr ,
pmakc
m
k
amc km
m
k
m+=
=
1,)()(
1
1
where c
a
r
and p
a
r
are linear prediction coefficients of clean
and enhanced signals, respectively; c
R is pure autocorrelation
coefficient matrix signal; 2
c
σ and 2
p
σ are variances of clean
and enhanced signals, respectively; )(kc are cepstral
coefficients; p is filter-predictor order.
The indicator WSS is calculated as follows:
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
=
=
=1
1
1
2
),(
)),(),()(,(
1M
mK
j
K
jpc
WSS mjW
mjSmjSmjW
M
d
where ),( mjW is weight for jth spectral sample and mth
frame;
is quantity of spectral samples;
M
is quantity of
frames; ),( mjSc and ),( mjS p are the spectral slopes of the
clean and processed speech signals, respectively. The spectral
slope is obtained as the difference between adjacent spectral
magnitudes in decibels. In our implementation, the number of
bands was set to 25=K.
PESQ is effective indicator of speech quality, but its
analytical description is very cumbersome. Brief description
can be found in [3]. We note only that it was used wideband,
designed for speech signal analysis over a 7 kHz bandwidth,
version of the indicator WB-PESQ in our study.
Composite index description can be found in [3].
IV. EXPERIMENTAL RESULTS
Clean speech signals (single words) were recorded in
anechoic room and had been used for ASR system training.
Parameters of digitized sounds were: sampling rate 22050 Hz,
linear quantization 16 bit. Signal-to-noise ratio (SNR) was near
35 dB for saved clean speech signals.
Signal frames with 50% overlapping and Hamming
window were used for signal processing. Frames duration was
32 ms.
Toolkit HTK [4] had been used for ASR system simulation.
Training of ASR system had been made with usage of 269
samples of 27 words of clean speech recorded for two
speakers-women. Noised discrete speech signals (with
0.2…0.5 s pauses between single words) were used as test
signals, and there were presented, in testing, all 27 words used
in training. There were 27 phonemes of Ukrainian language in
phoneme vocabulary and there had been used 39
MFCC_0_D_A coefficients when ASR simulating.
It should be taken into account that there isn’t generally
accepted standard ASR system model, so Acc% values will be
dependent on the kind of ASR model.
The experimental results had showed, first, that the
indicators Acc% and PESQ does not agree very well with each
other (Fig. 3). Among other indicators had been studied, only
two - LLR and SCI – were in good agreement with the Acc%
indicator (Fig. 4). At the same time, the essential disadvantage
of LLR and SCI indicators is their inability to display fairly
substantial difference of MMSE, logMMSE and spectral
subtraction algorithms performance.
Analysis of the Ass% indicator behavior had showed that
there is no single noise reduction algorithm, which would be
best in terms of maximum Ass% in a broad range of signal-to-
noise ratio from minus 10 dB up to plus 30 dB.
Fig. 4. LLR(SNR) (a) and SCI(SNR) (b)
Fig. 3. Acc%(SNR) (a) and PESQ(SNR) (b)
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
Second, unexpectedly low efficiency of the Wiener-TSNR
and Wiener-HRNR algorithms was revealed. Indeed, according
to Fig. 3, usage of Wiener-TSNR and Wiener-HRNR
algorithms for SNR > 3 dB leads to the lowest Acc% values
compared to other algorithms. Moreover, for SNR > 8 dB the
situation was even worse than in the case of disabling noise
reduction algorithm (curve “no enhance”). LLR and SCI
graphs confirm this fact (Fig. 4), although in somewhat
"soften" manner: the situation is worse than in the case of
disabling noise reduction algorithm only when SNR > 15 dB.
This result is not consistent with the results of the algorithms
authors [2], so it is advisable to investigate the cause of this
discrepancy in the future. At the same time, these algorithms
have shown the best results in all indicators when SNR below 0
dB.
V. CONCLUSION
Comparison of six noise reduction algorithms have shown
that only two of the nine indicators examined - log-likelihood
ratio and signal composite index – are in agreement with
speech recognition accuracy Acc% when the noise reduction
system is used as preprocessor of automatic speech recognition
system.
Unexpectedly low efficiency of the Wiener-TSNR and
Wiener-HRNR algorithms had been revealed: when SNR > 8
dB, speech recognition accuracy Acc% is worse than in the
case of disabling noise reduction algorithm. LLR and SCI
indicators had confirmed this fact, although in somewhat
"soften" manner: the situation is worse than in the case of
disabling noise reduction algorithm only when SNR > 15 dB.
This result is not consistent with the results obtained by authors
of the Wiener-TSNR and Wiener-HRNR algorithms, so it is
advisable to investigate the cause of this discrepancy in the
future.
It was shown that there is no single algorithm among the
considered noise reduction algorithms, which is the best in
terms of maximum recognition accuracy Acc% for a wide
range of input signal-to-noise ratio from minus 10 dB to plus
30 dB. It follows that the choice of noise reduction algorithms
for engineering applications should be performed taking into
account the value of the signal-to-noise ratio of the distorted
signal.
It should be taken into account also that there isn’t
generally accepted standard ASR system model, so Acc%
values will be dependent on the kind of ASR model. However,
it is hoped that results obtained in this paper will remain
qualitatively correct when using other models of automatic
speech recognition system.
REFERENCES
[1] J. Benesty, M. M. Sondhi, Y. Huang (ed), Springer Handbook of Speech
Processing. Berlin Heidelberg: Springer, 2007.
[2] C. Plapous, C. Marro, P. Scalart, “Improved signal-to-noise ratio
estimation for speech enhancement,” IEEE Transactions on Audio,
Speech, and Language Processing, vol.14, pp.2098-2108, November
2006.
[3] Y. Hu, P. Loizou, “Evaluation of objective quality measures for speech
enhancement,” IEEE Transactions on Speech and Audio Processing,
vol.16, pp. 229-238, 2008.
[4] S. Young, G. Evermann, M. Gales (ed) The HTK Book. Cambridge:
University Engineering Department, 2009.
[5] N. Bogdanova, A. Prodeus, “Objective quality evaluation of speech
band-limited signals,” Electronics and Communications, Vol.19, #6(83),
pp.58-65, 2014.
[6] A. Prodeus, “Parameter Optimization of the Single Channel Late
Reverberation Suppression Technique,” Proc. 35th International
Conference on Electronics and Nanotechnology (ELNANO-2015),
Kyiv, Ukraine, pp. 269-274, 2015.
[7] A. Prodeus, “Speech Recognition Performance as Measure of Speech
Dereverberation Quality,” Computational and Applied Mathematics,
Vol.1, No.3, pp. 60-66, 2015. [Online] Available:
http://article.aascit.org/file/html/9280738.html
[8] A. Prodeus, V. P. Ovsianyk, “Estimation of late reverberation spectrum:
Optimization of parameters,” Radioelectronics and Communications
Systems, Vol. 58, Is. 7, pp.322-328, July 2015.
[9] Vitaliy S. Didkovskyi, S.A. Naida, O.A. Zubchenko, “Technique for
rigidity determination of the materials for ossicles prostheses of human
middle ear,” Radioelectronics and Communications Systems, Vol. 58,
No. 3, pp. 134-138, 2015.
[10] K. Pylypenko, A. Prodeus, “Noise Impact Assessment on the Accuracy
of the Determination of Speaker’s Gender by Using Method of the
Cumulant Coefficients,” XIth International Conference "Perspective
Technologies and Methods in MEMS Design (MEMSTECH 2015),
Lviv–Polyana, Ukraine, pp. 102-106, 2–6 September 2015.
... T can be an incentive The results obtained in this article can be used to suppress the effect of reverberation on speech intelligibility, as it was previously proposed to suppress the effect of noise interference [17]. Of course, it is also advisable to use the obtained results in any predictions or measurements that require prior information about the reverberation time. ...
Article
Full-text available
The use of voice control of unmanned aerial vehicles is relevant due to the ease of practical use and new opportunities. This technology allows one to simplify the interface, making it more intuitive and natural. However, the quality and intelligibility of speech signals indoors can be significantly impaired by noise and reverberation. Therefore, before using voice technologies, it is desirable to take into account the effect of interferences by preliminary assessment of their parameters. In this paper, an algorithm for estimating the boundary (truncation time) between the informative and non-informative parts of the room impulse response, which allows obtaining believable estimates of the reverberation time, is proposed. The proposed algorithm is two-stage. At the first stage, "rough" envelope of the room impulse response is calculated using the detector-integrator, which allows one to find an approximate value of truncation time and construct an approximate envelope of room impulse response using backward integration method to obtain an approximate estimate of the reverberation time. In the second stage, output data of the first stage are used to refine the truncation time and reverberation time estimates. Experimental tests using recordings of real room impulse responses testify to the efficiency of the proposed algorithm.
... Although recent advances on speech front-end processing and robust speech recognition can be leveraged, the special acoustic conditions mentioned above should be particularly considered. We notice that researchers have started to explore different techniques to reduce noise in the vehicle for building a robust speech recognition for intelligent cockpit [12,13]. Separating different speakers and realizing multi-talker ASR in a vehicle are also desired to truly facilitate the needs of both driver and passenger(s). ...
Preprint
Full-text available
This paper summarizes the outcomes from the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC). We first address the necessity of the challenge and then introduce the associated dataset collected from a new-energy vehicle (NEV) covering a variety of cockpit acoustic conditions and linguistic contents. We then describe the track arrangement and the baseline system. Specifically, we set up two tracks in terms of allowed model/system size to investigate resource-constrained and -unconstrained setups, targeting to vehicle embedded as well as cloud ASR systems respectively. Finally we summarize the challenge results and provide the major observations from the submitted systems.
... They can be used to recognize voice commands and thus obtain voice control. Speech recognition attempts were made for UAV [21][22][23][24][25]. However, the problem of significant reduction in the effectiveness of recognizing speech disturbed by the signal of the unmanned aircraft may arise. ...
Article
Full-text available
The aim of this study was to perform discriminant analysis of voice commands in the presence of an unmanned aerial vehicle equipped with four rotating propellers, as well as to obtain background sound levels and speech intelligibility. The measurements were taken in laboratory conditions in the absence of the unmanned aerial vehicle and the presence of the unmanned aerial vehicle. Discriminant analysis of speech commands (left, right, up, down, forward, backward, start, and stop) was performed based on mel-frequency cepstral coefficients. Ten male speakers took part in this experiment. The unmanned aerial vehicle hovered at a height of 1.8 m during the recordings at a distance of 2 m from the speaker and 0.3 m above the measuring equipment. Discriminant analysis based on mel-frequency cepstral coefficients showed promising classification of speech commands equal to 76.2% for male speakers. Evaluated speech intelligibility during recordings and obtained sound levels in the presence of the unmanned aerial vehicle during recordings did not exclude verbal communication with the unmanned aerial vehicle for male speakers.
... Umapathy et al. [5] show a timefrequency associated approach to classify emotional voices of continuous speech signals using acclimatize time-frequency transform algorithm. Maeran et al. [6] investigate the use of a half-cast soft-computing entrance for perceptual distinct unit classification and signal segmentation; Prodeus [7] performed a comparative study of six-noise reduction algorithms containing Wiener filtering, spectral subtraction, logMMSE, and MMSE, Wiener-TSNR and HRNR algorithms. Performing noise reduction, Jeannès and Faucon [8] use a coherence function to regulate a speech/noise classification algorithm based on spectral subtraction and evaluate its consequence; Brueckmann et al. [9] use adaptive noise reduction and synthetic auditory process accomplished for the mobile reciprocal action against robot combining neural voice activity detection. ...
Conference Paper
Full-text available
Human voice is an important concern of efficient and modern communication in the era of Alexa, Siri, or Google Assistance. Working with voice or speech is going to be easy by preprocessing the unwanted entities when real speech data contains a lot of noise or continuous delivery of a speech. Working with Bangla language is also a concern of enriching the scope of efficient communication over Bangla language. This paper presented a method to reduce noise from speech data collected from a random noisy place, and segmentation of word from continuous Bangla voice. By filtering the threshold of noise with fast Fourier transform (FFT) of audio frequency signal for reduction of noise and compared each chunk of audio signal with minimum dBFS value to separate silent period and non-silent period and on each silent period, segment the signal for word segmentation.
... Prodeuset. al [5] compared ix noise reduction algorithms with the use of a set of indicators. Among them are popular noise reduction algorithms such as spectral subtraction, Wiener filtering, MMSE and logMMSE, and two less well-known Wiener-TSNR and Wiener-HRNR algorithms. ...
Conference Paper
The use of voice control of unmanned aerial vehicles is relevant due to the ease of practical use and new opportunities. This technology allows one to simplify the interface, making it more intuitive and natural. However, the quality and intelligibility of speech signals indoors can be significantly impaired by noise and reverberation. Therefore, before using voice technologies, it is desirable to take into account the effect of interferences by preliminary assessment of their parameters. In this paper, an algorithm for estimating the boundary (truncation time) between the informative and non-informative parts of the room impulse response, which allows obtaining believable estimates of the reverberation time, is proposed. The proposed algorithm is two-stage. At the first stage, “rouhg” envelope of the room impulse response is calculated using the detector-integrator, which allows one to find an approximate value of truncation time and obtain an approximate estimate of the reverberation time. In the second stage, output data of the first stage are used to refine the truncation time and reverberation time estimates. Experimental tests using recordings of real room impulse responses testify to the efficiency of the proposed algorithm.
Chapter
Drone production and utilization have skyrocketed in recent years. They are extremely versatile and have many different applications. Due to the ease and possible uses they provide, voice-controlled drone research has attracted a lot of interest recently. This technology makes it possible for users to command drones verbally, doing away with the need for physical controls and allowing for a more intuitive and natural interface. Voice-controlled drones have emerged as a promising technology with various applications ranging from recreational activities to industrial operations. This research chapter provides a comprehensive review of voice-controlled drone systems, highlighting their advancements, challenges, and future directions. The chapter explores the underlying technologies and components involved in voice control, examines the benefits and limitations of voice interaction with drones, and discusses potential applications and areas of improvement. Additionally, it explores the ethical and legal considerations associated with voice-controlled drones.
Article
Full-text available
Correction of speech signals distorted by reverberation is topical in building communications systems, automatic speech recognition systems, and hearing aids. The late reverberation suppression by the spectral subtraction method or the frequency correction method involves the need of estimating the late reverberation spectrum. Though the procedure of such estimation is generally developed, a number of uncertain items related to its optimization still exist. Recommendations elaborated in this study make it possible to optimize the estimation of late reverberation spectrum in terms of such criteria as the speech signal quality and the accuracy of automatic speech recognition by using computer simulation methods.
Article
Full-text available
Optimal, in the sense of automatic speech recognition (ASR) accuracy maximum, parameters of the late reverberation suppression technique have been proposed in this paper. It was shown that the value 50 ms as boundary between early reflections and late reverberation, which usually is used when problems of speech quality and intelligibility is studied, isn't best for ASR systems, for which optimal value is 100 ms. It was shown also that, when estimating late reverberation power spectrum, an optimal value of averaging parameter should be associated with statistical speech constants such as phoneme and stationary durations. Several speech quality indicators were used, and it was found that recognition accuracy is the best indicator in the sense of ability to inform the user about reached compromise between reverberation suppression and speech distortion.
Conference Paper
Full-text available
Refined recommendations for choosing optimal, in the sense of automatic speech recognition (ASR) accuracy maximum, parameters of the late reverberation suppression technique, have been proposed in this paper. It was shown that best value of boundary between early reflections and late reverberation approximates to 100 ms for ASR systems. It was shown also that, when estimating late reverberation power spectrum, an optimal value of averaging constant should be associated with statistical speech constants such as phoneme and stationarity durations. Several speech quality indicators were used also, and it was found that recognition accuracy is the best indicator in the sense of ability to inform the user about reached compromise between reverberation suppression and speech distortion.
Article
Full-text available
Dependence of objective quality evaluation of speech band-limited signals is experimentally obtained. As part of this task, a comparison of the considered indicators of the speech quality had been made. It is shown that computationally simple indicators, such as segmental SNR (SSNR) and log-spectral distortion (LSD), may not adequately respond to changes in bandwidth. More complex computationally perceptual indicators, such as bark spectral distortion (BSD) and perceptual evaluation of speech quality (PESQ), behave much more correct and, in the end, clarify the real needs of the human auditory system to speech perception. Reference 14, figures 5.
Conference Paper
Full-text available
A new method of classification of a speaker's gender based on cumulant coefficients is proposed. The effect of an additive noise and measurement error of classification signs on accuracy of classification is analyzed. The expediency of construction of an adaptive system of classification operating with considering of masking of a speech signal by noise is shown. Comparison of the proposed method of classification of the speaker's gender to competing methods shows that the proposed method provides better classification accuracy, resistant to the effects of noise interference and much simpler for technical implementation.
Article
Full-text available
In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based, and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. This paper reports on the evaluation of correlations of several objective measures with these three subjective rating scales. Several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques.
Article
Full-text available
This paper addresses the problem of single-microphone speech enhancement in noisy environments. State-of-the-art short-time noise reduction techniques are most often expressed as a spectral gain depending on the signal-to-noise ratio (SNR). The well-known decision-directed (DD) approach drastically limits the level of musical noise, but the estimated a priori SNR is biased since it depends on the speech spectrum estimation in the previous frame. Therefore, the gain function matches the previous frame rather than the current one which degrades the noise reduction performance. The consequence of this bias is an annoying reverberation effect. We propose a method called two-step noise reduction (TSNR) technique which solves this problem while maintaining the benefits of the decision-directed approach. The estimation of the a priori SNR is refined by a second step to remove the bias of the DD approach, thus removing the reverberation effect. However, classic short-time noise reduction techniques, including TSNR, introduce harmonic distortion in enhanced speech because of the unreliability of estimators for small signal-to-noise ratios. This is mainly due to the difficult task of noise power spectrum density (PSD) estimation in single-microphone schemes. To overcome this problem, we propose a method called harmonic regeneration noise reduction (HRNR). A nonlinearity is used to regenerate the degraded harmonics of the distorted signal in an efficient way. The resulting artificial signal is produced in order to refine the a priori SNR used to compute a spectral gain able to preserve the speech harmonics. These methods are analyzed and objective and formal subjective test results between HRNR and TSNR techniques are provided. A significant improvement is brought by HRNR compared to TSNR thanks to the preservation of harmonics
Article
The theoretical analysis of a technique for rigidity determination of the materials for ossicles prostheses of human middle ear has been carried out based on the measurement of the acoustical velocity. This paper presents the results of rigidity modulus measurements for the ossicles prostheses made from the materials of different origin, based on which the known materials have been identified and a novel one has been suggested, namely foam polyurethane (PU foam-3). It has been recommended to utilize the polymer material in the case of substitution of the entire ossicles chain and the bioactive ceramics in the case of partial substitution.
Gales (ed) The HTK Book
  • S Young
  • G Evermann
S. Young, G. Evermann, M. Gales (ed) The HTK Book. Cambridge: University Engineering Department, 2009.