Content uploaded by Arkadiy Prodeus
Author content
All content in this area was uploaded by Arkadiy Prodeus on Oct 20, 2015
Content may be subject to copyright.
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
Performance measures of noise reduction algorithms
in voice control channels of unmanned aerial vehicles
Arkadiy Prodeus
Acoustics and Electroacoustics Department
Faculty of Electronics, NTUU KPI
Kyiv, Ukraine
aprodeus@gmail.com
Abstract—In this paper, six noise reduction algorithms had
been compared with the use of a set of indicators. Among them
are popular noise reduction algorithms such as spectral
subtraction, Wiener filtering, MMSE and logMMSE, and two
less well-known Wiener-TSNR and Wiener-HRNR algorithms. It
is shown that when the noise reduction system is used as
preprocessor of automatic speech recognition (ASR) system, only
a small amount of speech quality indicators is in satisfactory
agreement with the recognition accuracy. In particular, these
include Log-Likelihood Ratio (LLR) and Signal Composite Index
(SCI) indicators. In addition, it is shown that there is no single
algorithm among the considered noise reduction algorithms,
which is the best in terms of maximum recognition accuracy for a
wide range of input signal-to-noise ratio from minus 10 dB to
plus 30 dB.
Keywords—noise reduction algorithm; speech quality indicator;
recognition accuracy; speech signal; noise interference
I. INTRODUCTION
A number of new aviation systems, and unmanned aerial
vehicles (UAVs) are among them, are beginning to utilize
speech recognition technology. In particular, it is widely
believed that voice control would enable air battle managers to
control their UAVs using voice commands rather than mouse,
keyboard, and function key inputs (Fig. 1).
Fig. 1. ASR system incorporation into UAV control channel
The block diagram shown in Fig. 1 is a schematic diagram
of a control channel that incorporates natural language
processing. A human controller is present to issue directives
based on an UAV’s current state and the controller’s intentions.
Once these verbal commands are processed by the ASR
system, they are translated into a set of high-level goals and
constraints that are then passed on to the UAV’s planning
algorithms. These planning algorithms then generate a
sequence of maneuvers for the UAV.
Ensuring of acceptable quality and intelligibility of speech,
as well as increasing of automatic speech recognition (ASR)
systems robustness to the action of noise interference through
the use of noise reduction preprocessors is issue of the day for
air channels of voice control (Fig. 2).
Fig. 2. Noise reduction system as ASR preprocessor
Additive mixture )()()( tntxty +
=
of signal )(tx and
noise )(tn is the most common model of speech distortion.
Noise reduction algorithm provides recovery of signal )(tx
from mixture )(ty :
)}({)(
ˆtyAtx =
where )(
ˆtx and }{
⋅
A are result and operator of speech
enhancing, respectively.
Three groups of indicators are used to assess the
performance of noise reduction algorithms: 1) speech quality
indicators; 2) speech intelligibility indicators; 3) speech
recognition accuracy. While this assessment is fairly typical
task, the choice of indicators is largely dependent on the
predilections of researchers [1,2,3,4]. This can be explained by
the fact that the problem of such a choice is not enough
investigated [5,6,7,8,9,10]. Therefore, the object of this paper,
in addition to comparisons between themselves of a set of noise
suppression algorithms, is a research of agreement between the
various indicators of noise suppression algorithms
performance.
UAV action Human command
Automatic
speech
recognition
s
y
ste
m
Planning
algorithms
and low-level
controls
text
ASR accuracy
Speech quality and intelligibility
)(ty )(
ˆtx
Noise reduction
preprocessor
Automatic
speech
recognition
s
y
ste
m
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
II. NOISE REDUCTION ALGORITHMS
Analyzed in this paper algorithms implement speech
enhancing in frequency domain. This technique is one of the
most widely used approaches to noise suppression.
Analytically it is described as
),(),(),(
ˆ2121 klklGkl yx λ=λ
where ),( kl
y
λ is power spectrum of signal )(ty l-th frame at
frequency fftsk NkFf /=; s
F is sampling rate; fft
N is FFT
parameter; k is number of frequency sample; ),(
ˆkl
x
λ is
power spectrum estimator of signal )(
ˆtx l-th frame; ),( klG is
correction filter gain.
In this paper the algorithms of spectral subtraction, Wiener
filtering, MMSE, logMMSE, Wiener-TSNR and Wiener-
HRNR are considered. All these algorithms are well known
excepting Wiener-TSNR and Wiener-HRNR algorithms
proposed recently [2]. Interest to the two last algorithms is
caused by their high ability to suppress noise. However, the
degree of the speech signal distortion is not sufficiently studied
for these algorithms though this distortion always occurs when
noise cancellation is executed.
Since there are several kinds of the spectral subtraction
algorithm, it should be noticed that the algorithm used in this
paper implements subtraction of the amplitude spectra. Note
also that the phase of distorted signal )(ty is used as enhanced
signal )(
ˆtx phase.
III. QUALITY MEASURES
When noise reduction system is used as preprocessor of
ASR, its performance can be evaluated by means of end-to-end
quality indicator which is named “ASR accuracy” [4]:
%100)(% ×−−−= NISDNAcc
where N is the total number of labels in the reference
transcriptions; D is the number of deletion errors; S is the
number of substitution errors;
I
is the number of insertion
errors.
The weak point of the Acc% is the need for ASR systems
simulation. Since it is individual difficult task, it seems
advisable to explore the possibility of replacing Acc% indicator
on speech quality and speech intelligibility indicators. Of
course when ASR implementation isn’t need, the indicators of
quality and intelligibility are paramount. As such indicators in
this paper includes the following: Segmental Signal-to-Noise
Ratio (SSNR), Log-Spectral Distortion (LSD), Log-Likelihood
Ratio (LLR), Weighted Spectral Slope (WSS), Itakura-Saito
distance (IS), cepstral distance (CEP), composite index “Signal
Composite Index, Noise Composite Index, Overall Composite
Index” (SCI, NCI, OCI), perceptual indicators Bark-Spectral
Distortion (BSD) and Perceptual Evaluation of Speech Quality
(PESQ).
Analytically parameters SSNR, LSD and BSD are
described as follows
∑∑
∑
=−+
=
−+
=
−
=L
lNRl
Rln
NRl
Rln
nlynlx
nlx
L
SSNR
112
1
2
)],(),([
),(
lg10
1,
∑∑
−
=
−=
l
R
r
rlYGrlXG
RL
LSD
1
2
0
)},({)},({
2,
}|),),(lg(|20max{)},({
δ
=
rlXrlXG ,
50|)}),(lg(|20{max
,
−
=
δ
rlX
kl
[]
[]
∑∑
∑∑
=
−
=
==
−
=
L
l
K
k
x
L
l
K
k
yx
klB
klBklB
BSD
1
1
2
0
2
11
2
),(
),(),(
where ),( nlx and ),(
ˆnlx are n-th samples of l-th frame of
anechoic speech signal )(tx and enhanced signal )(
ˆnx ,
respectively; ),( klX and ),(
ˆklX are spectrograms of signals
)(nx and )(
ˆnx , respectively; )},({ klXB and )},(
ˆ
{klXB are
bark spectrums of l-th frame of signals )(nx and )(
ˆnx ,
respectively.
Indicators LLR, IS and CEP are computed for each of the
frames, and further averaged over all frames:
=T
ccc
T
pcp
cp aa
aa
aad rr
r
r
rr
R
R
ln),(
LLR ,
1ln),( 2
2
2
2
−
σ
σ
+
σ
σ
=
p
c
T
ccc
T
pcp
p
c
cpIS aa
aa
aad rr
r
r
rr
R
R,
∑
=
−=
p
k
pccpCEP kckcccd
1
2
)]()([2
10ln
10
),( rr ,
pmakc
m
k
amc km
m
k
m≤≤+= −
−
=
∑1,)()(
1
1
where c
a
r
and p
a
r
are linear prediction coefficients of clean
and enhanced signals, respectively; c
R is pure autocorrelation
coefficient matrix signal; 2
c
σ and 2
p
σ are variances of clean
and enhanced signals, respectively; )(kc are cepstral
coefficients; p is filter-predictor order.
The indicator WSS is calculated as follows:
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
∑∑
∑
−
=
=−
=1
1
1
2
),(
)),(),()(,(
1M
mK
j
K
jpc
WSS mjW
mjSmjSmjW
M
d
where ),( mjW is weight for jth spectral sample and mth
frame;
K
is quantity of spectral samples;
M
is quantity of
frames; ),( mjSc and ),( mjS p are the spectral slopes of the
clean and processed speech signals, respectively. The spectral
slope is obtained as the difference between adjacent spectral
magnitudes in decibels. In our implementation, the number of
bands was set to 25=K.
PESQ is effective indicator of speech quality, but its
analytical description is very cumbersome. Brief description
can be found in [3]. We note only that it was used wideband,
designed for speech signal analysis over a 7 kHz bandwidth,
version of the indicator WB-PESQ in our study.
Composite index description can be found in [3].
IV. EXPERIMENTAL RESULTS
Clean speech signals (single words) were recorded in
anechoic room and had been used for ASR system training.
Parameters of digitized sounds were: sampling rate 22050 Hz,
linear quantization 16 bit. Signal-to-noise ratio (SNR) was near
35 dB for saved clean speech signals.
Signal frames with 50% overlapping and Hamming
window were used for signal processing. Frames duration was
32 ms.
Toolkit HTK [4] had been used for ASR system simulation.
Training of ASR system had been made with usage of 269
samples of 27 words of clean speech recorded for two
speakers-women. Noised discrete speech signals (with
0.2…0.5 s pauses between single words) were used as test
signals, and there were presented, in testing, all 27 words used
in training. There were 27 phonemes of Ukrainian language in
phoneme vocabulary and there had been used 39
MFCC_0_D_A coefficients when ASR simulating.
It should be taken into account that there isn’t generally
accepted standard ASR system model, so Acc% values will be
dependent on the kind of ASR model.
The experimental results had showed, first, that the
indicators Acc% and PESQ does not agree very well with each
other (Fig. 3). Among other indicators had been studied, only
two - LLR and SCI – were in good agreement with the Acc%
indicator (Fig. 4). At the same time, the essential disadvantage
of LLR and SCI indicators is their inability to display fairly
substantial difference of MMSE, logMMSE and spectral
subtraction algorithms performance.
Analysis of the Ass% indicator behavior had showed that
there is no single noise reduction algorithm, which would be
best in terms of maximum Ass% in a broad range of signal-to-
noise ratio from minus 10 dB up to plus 30 dB.
Fig. 4. LLR(SNR) (a) and SCI(SNR) (b)
Fig. 3. Acc%(SNR) (a) and PESQ(SNR) (b)
3rd IEEE International Conference «Actual Problems of Unmanned Aerial Vehicles Developments», October 13-15, 2015, Kyiv, Ukraine
Second, unexpectedly low efficiency of the Wiener-TSNR
and Wiener-HRNR algorithms was revealed. Indeed, according
to Fig. 3, usage of Wiener-TSNR and Wiener-HRNR
algorithms for SNR > 3 dB leads to the lowest Acc% values
compared to other algorithms. Moreover, for SNR > 8 dB the
situation was even worse than in the case of disabling noise
reduction algorithm (curve “no enhance”). LLR and SCI
graphs confirm this fact (Fig. 4), although in somewhat
"soften" manner: the situation is worse than in the case of
disabling noise reduction algorithm only when SNR > 15 dB.
This result is not consistent with the results of the algorithms
authors [2], so it is advisable to investigate the cause of this
discrepancy in the future. At the same time, these algorithms
have shown the best results in all indicators when SNR below 0
dB.
V. CONCLUSION
Comparison of six noise reduction algorithms have shown
that only two of the nine indicators examined - log-likelihood
ratio and signal composite index – are in agreement with
speech recognition accuracy Acc% when the noise reduction
system is used as preprocessor of automatic speech recognition
system.
Unexpectedly low efficiency of the Wiener-TSNR and
Wiener-HRNR algorithms had been revealed: when SNR > 8
dB, speech recognition accuracy Acc% is worse than in the
case of disabling noise reduction algorithm. LLR and SCI
indicators had confirmed this fact, although in somewhat
"soften" manner: the situation is worse than in the case of
disabling noise reduction algorithm only when SNR > 15 dB.
This result is not consistent with the results obtained by authors
of the Wiener-TSNR and Wiener-HRNR algorithms, so it is
advisable to investigate the cause of this discrepancy in the
future.
It was shown that there is no single algorithm among the
considered noise reduction algorithms, which is the best in
terms of maximum recognition accuracy Acc% for a wide
range of input signal-to-noise ratio from minus 10 dB to plus
30 dB. It follows that the choice of noise reduction algorithms
for engineering applications should be performed taking into
account the value of the signal-to-noise ratio of the distorted
signal.
It should be taken into account also that there isn’t
generally accepted standard ASR system model, so Acc%
values will be dependent on the kind of ASR model. However,
it is hoped that results obtained in this paper will remain
qualitatively correct when using other models of automatic
speech recognition system.
REFERENCES
[1] J. Benesty, M. M. Sondhi, Y. Huang (ed), Springer Handbook of Speech
Processing. Berlin Heidelberg: Springer, 2007.
[2] C. Plapous, C. Marro, P. Scalart, “Improved signal-to-noise ratio
estimation for speech enhancement,” IEEE Transactions on Audio,
Speech, and Language Processing, vol.14, pp.2098-2108, November
2006.
[3] Y. Hu, P. Loizou, “Evaluation of objective quality measures for speech
enhancement,” IEEE Transactions on Speech and Audio Processing,
vol.16, pp. 229-238, 2008.
[4] S. Young, G. Evermann, M. Gales (ed) The HTK Book. Cambridge:
University Engineering Department, 2009.
[5] N. Bogdanova, A. Prodeus, “Objective quality evaluation of speech
band-limited signals,” Electronics and Communications, Vol.19, #6(83),
pp.58-65, 2014.
[6] A. Prodeus, “Parameter Optimization of the Single Channel Late
Reverberation Suppression Technique,” Proc. 35th International
Conference on Electronics and Nanotechnology (ELNANO-2015),
Kyiv, Ukraine, pp. 269-274, 2015.
[7] A. Prodeus, “Speech Recognition Performance as Measure of Speech
Dereverberation Quality,” Computational and Applied Mathematics,
Vol.1, No.3, pp. 60-66, 2015. [Online] Available:
http://article.aascit.org/file/html/9280738.html
[8] A. Prodeus, V. P. Ovsianyk, “Estimation of late reverberation spectrum:
Optimization of parameters,” Radioelectronics and Communications
Systems, Vol. 58, Is. 7, pp.322-328, July 2015.
[9] Vitaliy S. Didkovskyi, S.A. Naida, O.A. Zubchenko, “Technique for
rigidity determination of the materials for ossicles prostheses of human
middle ear,” Radioelectronics and Communications Systems, Vol. 58,
No. 3, pp. 134-138, 2015.
[10] K. Pylypenko, A. Prodeus, “Noise Impact Assessment on the Accuracy
of the Determination of Speaker’s Gender by Using Method of the
Cumulant Coefficients,” XIth International Conference "Perspective
Technologies and Methods in MEMS Design (MEMSTECH 2015),
Lviv–Polyana, Ukraine, pp. 102-106, 2–6 September 2015.