Conference PaperPDF Available

Subjective Intelligibility of Deep Neural Network-Based Speech Enhancement

Authors:

Abstract

Recent literature indicates increasing interest in deep neural networks for use in speech enhancement systems. Currently, these systems are mostly evaluated through objective measures of speech quality and/or intelligibility. Subjective intelligibility evaluations of these systems have so far not been reported. In this paper we report the results of a speech recognition test with 15 participants, where the participants were asked to pick out words in background noise before and after enhancement using a common deep neural network approach. We found that, although the objective measure STOI predicts that intelligibility should improve or at the very least stay the same, the speech recognition threshold, which is a measure of intelligibility, deteriorated by 4 dB. These results indicate that STOI is not a good predictor for the subjective intelligibility of deep neural network-based speech enhancement systems. We also found that the postprocessing technique of global variance normalisation does not significantly affect subjective intelligibility.
Subjective intelligibility of deep neural network-based speech enhancement
Femke B. Gelderblom1, Tron V. Tronstad1, Erlend Magnus Viggen1
1Acoustics Research Centre, SINTEF Digital, Trondheim, Norway
femke.gelderblom@sintef.no, tronvedul.tronstad@sintef.no, erlendmagnus.viggen@sintef.no
Abstract
Recent literature indicates increasing interest in deep neural
networks for use in speech enhancement systems. Currently,
these systems are mostly evaluated through objective measures
of speech quality and/or intelligibility. Subjective intelligibility
evaluations of these systems have so far not been reported. In
this paper we report the results of a speech recognition test with
15 participants, where the participants were asked to pick out
words in background noise before and after enhancement using
a common deep neural network approach. We found that, al-
though the objective measure STOI predicts that intelligibility
should improve or at the very least stay the same, the speech
recognition threshold, which is a measure of intelligibility, de-
teriorated by 4 dB. These results indicate that STOI is not a
good predictor for the subjective intelligibility of deep neural
network-based speech enhancement systems. We also found
that the postprocessing technique of global variance normalisa-
tion does not significantly affect subjective intelligibility.
Index Terms: speech enhancement, deep neural network, sub-
jective evaluation, speech intelligibility
1. Introduction
The field of speech enhancement (SE) aims to improve the qual-
ity and/or intelligibility of speech that has been degraded [1]. In
the past few years, deep neural networks (DNNs) [2, 3] have
emerged as a promising approach for SE, outperforming ear-
lier approaches. SE has been proven useful as a preprocessing
step for automatic speech recognition systems to decrease their
word error rates [4, 5, 6], but the field also aims to make de-
graded speech easier to understand and/or more comfortable to
listen to for humans [5, 7, 8].
The performance of each of these SE approaches with
respect to intelligibility improvement is typically evaluated
through objective measures. Especially popular measures are
STOI [9], PESQ [10], or the word error rates of speech recog-
nition systems. PESQ was originally designed as a measure
for speech quality rather than intelligibility, but was then found
to also correlate reasonably well with subjective intelligibil-
ity [11]. None of today’s objective measures of intelligibility
can perfectly predict intelligibility to humans, and their correla-
tion depends on the type of speech degradation present [9, 12].
Thus, listening tests are necessary to quantify the benefit of
DNN-based SE for human listeners. Listening tests for speech
quality have previously been reported in the literature with posi-
tive results [5, 7, 8]. Quality is however highly subjective, since
whether a signal sounds ‘good’ or ‘poor’ is based on listeners’
preferences. Intelligibility tests are more objective in nature as
these allow for quantitative scoring of how much information
the listener actually understood. To our knowledge, and despite
its popularity, no one has tested the predictive power of STOI
for DNN-based SE against subjective listening tests.
In this work we report the results of a series of listening
tests for intelligibility, where our test subjects attempted to com-
prehend speech in background noise, before and after DNN-
based speech enhancement. Here, we evaluate whether STOI
correctly predicts change in subjective intelligibility for a rea-
sonably common DNN setup. Additionally, we analyse the ef-
fect of the ‘global variance normalisation’ postprocessing step
(described in sec. 2.1.3) on intelligibility.
2. Methods
2.1. DNN system overview
The speech enhancement system is loosely based on the system
Xu et al. proposed in [8], but omits pre-training with restricted
Boltzmann machines as their results indicate that the effect of
pre-training was negligible. The DNN was implemented using
Keras 1.0.5 [13].
2.1.1. Speech and noise preparation
For training, clean speech was combined with noise to ob-
tain noisy speech. The clean speech was obtained from the
Norwegian-language library ‘Spr˚
akbanken’ [14], to ensure that
the DNN trained on the same language as used during subjec-
tive evaluation. The setup of Spr˚
akbanken is similar to that of
the more widely used TIMIT. The clean speech database was
divided into a training set, a validation set, and a test set (not
used for this article). Care was taken to ensure that each set was
balanced with respect to gender and dialect, and that no spe-
cific speakers or sentences occurred in more than one set. The
final training set consisted of 1932 sentences from 137 unique
speakers, while the validation set contained 816 sentences from
48 speakers.
Periods of silence lasting longer than 75 ms were trimmed
to 75 ms where their levels were 40 dB or more below the peak
of the given sentence, to capture the average dynamic range of
speech [11]. The 75 ms length was arbitrarily chosen as a com-
promise between minimising the number of quiet training sam-
ples, and maintaining a clear separation between words.
Noisy speech was obtained by combining the clean speech
with the same 104 noises Xu et al. used in [8], all obtained from
either the Aurora database [15] or Guoning Hu’s collection [16].
Six different signal-to-noise ratios (SNRs) ranging from 5dB
to 20 dB, with SNRs applied at sentence level, were used for
training. This range was chosen, despite the need for lower
SNRs during speech intelligibility testing, as a DNN trained
with a more suitable SNR range, but otherwise equal hyper-
parameters, actually performed worse in terms of STOI values
at all SNRs.
The noisy speech, along with clean speech (with ‘infinite
SNR’), was used as input for the DNN. This lead to a total of
1984 hours of training data. Noisy speech for validation was
obtained by combining the clean validation speech with the 15
unseen noises Xu et al. specified in [8], obtained from either the
Aurora or NOISEX-92 databases [15, 17]. This resulted in 98
hours of validation data. Both the noisy and clean speech sig-
Copyright © 2017 ISCA
INTERSPEECH 2017
August 20–24, 2017, Stockholm, Sweden
http://dx.doi.org/10.21437/Interspeech.2017-10411968
Figure 1: Diagram of training procedure. The clean and noisy
phases output by the preprocessing steps are discarded.
Figure 2: Diagram of preprocessing steps
nals were down-sampled to 8 kHz, as this was the lowest sam-
pling rate of any of the original signals.
2.1.2. Training
Figure 1 shows a block diagram of the training procedure. The
model learns in a supervised manner, with the standard mean
squared error (MSE) loss function
MSE =1
nX
k
ykyk)2,(1)
where ˆykand ykrepresent the kth frequency bins of the en-
hanced and clean log-power spectral features, respectively. The
features were obtained through the preprocessing steps shown
in Figure 2. During preprocessing, the signal is first separated
into windows that overlap by 50%. The windows consist of
256 samples, and thus represent a timeframe of 32 ms at 8 kHz.
The Hann window function is then applied to each window be-
fore the result is Fourier transformed. Redundant information
above the Nyquist frequency is discarded from the resulting
magnitude spectrum to obtain a single-sided output. Finally,
log-power spectrum features are calculated for each window.
After preprocessing, the input vector is obtained by stacking
21 sequential 50 % overlapping windows that contain the log-
power spectral features. This provides the DNN with 160 ms
historic and 160 ms future context. The phase of both clean and
noisy speech is ignored during training. No normalisation of
input or output was applied.
The DNN model is a multi-layer perceptron, a feedforward
neural network with fully connected layers. It has three hidden
layers, each with 2048 nodes and LeakyReLU activation func-
tions. The model is trained with 50% dropout on the hidden
layers using the Adam optimiser with a learning rate of 105.
The activation function of the output layer is linear.
Training continued until the STOI value reached a maxi-
mum for the validation set at the 8th epoch. The model’s state at
this epoch was used for enhancement. We also trained a number
of different models with different hyperparameters; the model
described here was selected due to its better STOI performance.
Figure 3: Diagram of enhancement procedure
Figure 4: Diagram of postprocessing steps
2.1.3. Enhancement
After training, the model could be used to enhance noisy
speech. Figure 3 shows the enhancement procedure, and Fig-
ure 4 shows the postprocessing steps.
Postprocessing mainly consists of reversing the steps that
were taken during preprocessing, using the noisy phase for
waveform reconstruction. The first step, global variance nor-
malisation (GVN), is the exception to this reversal. This step
aims to prevent over-smoothing by enforcing the variance of
the enhanced speech to be equal to the variance of actual clean
speech. During GVN, the DNN’s output features are multiplied
with a frequency bin independent factor calculated as
β=svarm,k[yk(m)]
varm,kyk(m)] ,(2)
where varm,k represents the variance over all values of mand
k, with mindexing examples in the training set, and kindexing
frequency bins. Furthermore, from the law of total variance we
can calculate this variance as
var
m,k[ak(m)] = 1
KX
k
var
m[ak(m)] + var
kvar
m[ak(m)],(3)
where Kequals the total number of frequency bins and ak(m)
represents either yk(m)or ˆyk(m). This specific method for
the calculation of the global variance combines readily with
Welford’s online algorithm for variance computation, which is
well suited to working with large data sets [18]. Two systems
were tested for this work; one with, and one without the GVN
step.
2.2. Objective evaluation
The short-term objective intelligibility (STOI) measure [9] was
used to test the model’s performance. The advantages of STOI
include a documented strong correlation with subjective speech
intelligibility [9] and the possibility to compare obtained results
with earlier publications [8]. Additionally, unlike with some
other popular objective measures like PESQ, use of STOI is not
restricted by licencing.
Objective evaluation results were obtained both for the val-
idation set and for the signals used during subjective testing.
2.3. Subjective evaluation
The subjective evaluation of intelligibility was performed us-
ing a speech recognition test. Figure 5 shows the user interface
1969
Figure 5: The GUI of the Norwegian-language subjective test
implemented in MATLAB [19]. Random five-word sentences,
all uttered by the same male speaker, were presented at differ-
ent SNRs to determine the speech recognition threshold (SRT).
All sentences were in Norwegian and structured the same way:
[Name], [Verb], [Numeral], [Adjective], [Noun], with 10 op-
tions for each. The subjects’ task was to pick out which word in
each category was in the sentence they just heard. The speech
material has been taken from Øygarden’s hearing in noise test,
which is based on Hagerman sentences [20].
To keep the subjective test to a manageable length, only one
noise file was used: a road traffic recording from a crossroad
in central Trondheim, a common type of background noise in
cities. Each sentence was mixed with a random section of this
noise file at the desired SNR. The SNR was calculated from the
root-mean-square (RMS) value for the sentence without noise
and the RMS value for the selected section of the noise signal.
The background noise was kept constant at a comfortable level
while the speech was varied to achieve the correct SNR. The
speaker, utterances, and noise used in this test had not been in-
cluded during DNN training nor during validation.
Each subject completed three tests. For each test case, all
material was first down-sampled to 8 kHz. One test set was
left otherwise untreated (‘Noisy’), while for the other cases
the speech was enhanced according to the method described in
sec. 2.1.3 (‘DNN with/without GVN’), where the GVN step was
only included for one of these cases. The material of each test
set was subsequently up-sampled to 44.1 kHz before being pre-
sented to the subject. All sentences were presented binaurally
with Sennheiser HDA-200 headphones via an external sound
card (Roland Edirol UA-101).
An adaptive procedure called the Ψmethod [21] was used
to determine the presentation levels during testing. The method
uses the entropy of the posterior probability distribution in the
determination of the next stimuli level. The Palamedes MAT-
LAB toolbox [22] was used for the realisation of the Ψmethod.
The test was not forced choice, but the test subjects were
encouraged to guess whenever they thought they (partly) recog-
nised a word. Both the guess and lapse rate were set to 0.01 in
the method. The threshold and slope value were allowed to vary
in the estimation of the psychometric function. The stimulation
range of the SNRs was from -36 dB to 10 dB, in 2 dB steps.
15 persons, with ages from 39 to 65 (Mean = 54.2, SD =
9.5), participated. The only selection criteria observed was that
all participants had to have Norwegian as their first language.
All test subjects were given a training session before the three
situations (Noisy, DNN with GVN, and DNN without GVN)
were tested and the test sequence was randomised between each
individual to reduce any further training effect that could occur
during the session. The test subjects were also allowed to take
a break during the test if they desired.
Table 1: STOI results for the validation set. Results are aver-
aged over the 15 unseen noise types and stated together with
their sample standard deviation.
SNR Noisy DNN without GVN DNN with GVN
20 0.95 (0.01) 0.92 (0.01) 0.91 (0.01)
15 0.91 (0.02) 0.90 (0.01) 0.89 (0.01)
10 0.85 (0.03) 0.86 (0.02) 0.85 (0.02)
5 0.76 (0.04) 0.80 (0.02) 0.79 (0.02)
0 0.65 (0.04) 0.71 (0.03) 0.71 (0.03)
-5 0.55 (0.04) 0.61 (0.04) 0.60 (0.04)
3. Results
3.1. Objective evaluation
Table 1 shows the STOI results for the validation set. The GVN
step shows no significant effect on the STOI results. DNN pro-
cessing leads to improved scores as compared to the baseline
for all SNRs under 10 dB. Looking at our unprocessed ‘noisy’
baseline, our STOI results at low SNRs are lower by 0.05 than
what Xu et al. [8] found using the TIMIT speech library. As we
use the same noise types, and we were able to reproduce their
‘noisy’ STOI scores using TIMIT, this discrepancy shows that
STOI predicts different intelligibility for the two libraries under
equal noise conditions.
Figure 6 shows a plot of the average STOI scores obtained
for the files processed for subjective evaluation. As with the
validation set results, the use of GVN did not significantly af-
fect model performance. At higher SNRs, DNN processing per-
forms worse than the noisy baseline. However, for low SNRs
STOI scores suggest improvement even outside the training
range. According to the objective evaluation, DNN processing
ought to be beneficial for all SNRs in between -14 dB and 4dB.
3.2. Subjective evaluation
Figure 7 shows the results from the subjective tests. Specif-
ically, it shows the differences between the reference and the
two DNN models, both for the SRT and the slope of the psy-
chometric function at SRT. All test subjects performed worse
on the SRT, while the slope values are more mixed.
To assess the normality of the data, we performed an
Anderson-Darling test on all the differences. The SRT differ-
ences for the DNN without GVN failed the normality test. The
non-normality is presumably a consequence of the small sam-
ple size. To cope with this, we performed a Wilcoxon signed
rank test to compare the models with the reference. The tests
35 30 25 20 15 10 5 0 5 10
0
0.2
0.4
0.6
0.8
1
SNR [dB]
STOI
DNN without GVN
DNN with GVN
Noisy
Figure 6: STOI results for the subjective evaluation set
1970
10
5
0
SRT [dB]
Ref. DNN
with GVN
Ref. DNN
without GVN
0.05
0.1
0.15
0.2
Slope [1/dB]
Figure 7: Comparison between unenhanced reference data
and DNN data. Upper: Speech recognition thresholds (SRT).
Lower: Slope of the psychometric function at SRT.
showed a significant difference (W= 120, p =.0001 for both)
between the models and the reference; not surprisingly, since
all the test subjects performed worse on the DNN models (see
Figure 7). The differences in median SRT values were (using
Hodges-Lehman estimators) 3.8 [3.2, 4.4] and 3.9 [3.2, 4.8] for
DNN with GVN and without GVN, respectively. The numbers
in brackets are the 95 % confidence intervals.
The slope of psychometric functions were compared using
a two sample F-test. Neither DNN with GVN (F14,14 =.91,
NS), nor DNN without GVN (F14,14 =.69, NS) showed any
significant difference from the reference.
4. Discussion
The STOI results for unprocessed noisy validation files from the
Norwegian database (Table 1) differ from those obtained for the
TIMIT database by Xu et al. [8]. This complicates comparing
model performance directly. However, the results are similar
to those of Xu et al. in the sense that STOI improvement is
arguably insignificant for SNRs of 10 dB and above. For lower
SNRs, STOI predicts our system will achieve improvements of
up to 6 percent on the subjective scale. This is less than Xu et
al. achieved, but significant enough to predict that subjective
SRTs ought to decrease, or at the very least, stay the same.
The DNN model was not trained at SNRs below -5 dB, but
surprisingly, the STOI results shown in Figure 6 indicate that
the model enhances noisy speech with SNRs up to 9 dB be-
low its training range. This means that during subjective test-
ing, 93.8 % of sentences presented to the listener had an SNR
that fell in the functional range of the model (from -14 dB to
4 dB). All test subjects also achieved SRT values within this
range. Nonetheless, the results from the subjective testing
showed that the DNN models performed significantly worse
(SRTs increased with approx. 4dB) than the unprocessed sen-
tences. Even from a conservative perspective where we could
say that the changes the model attains in STOI are insignificant,
the SRTs should not have increased this much. Thus, STOI sig-
nificantly overestimates the speech intelligibility of our DNN-
based speech enhancement system.
On the other hand, STOI correctly predicts that GVN has no
significant effect on speech intelligibility. According to Xu et
al. [8], PESQ results are, in contrast, significantly affected when
GVN is used during postprocessing of a DNN-based speech en-
hancement system. This may indicate that GVN matters more
to speech quality, but we did not investigate this further.
Our DNN model was selected because it obtained better
STOI scores than similar networks trained for a larger range of
SNRs or with different hyperparameters. Our results however
indicate that STOI fails to predict the intelligibility of a DNN-
based speech enhancement system. This directly undermines
our model selection criterion. It is therefore possible that one of
our other models would have lead to better subjective scores.
All test sentences were uttered by the same male speaker;
it is likely that the DNN model will perform differently for dif-
ferent speakers. Similarly, the results are presumably affected
by the choice of background noise. We expect that the traffic
noise used here performs better than for example noise that con-
sists mainly of human speech (babble), since the DNN models
might try to enhance some of the speakers in the noise as well.
Similarly, other types of noise may again be easier for the sys-
tem to handle. A more comprehensive study of the suitability of
STOI as an objective evaluation measure for DNN-based speech
enhancement would need to include a variety of speakers and
noises. Such a comprehensive study will be time-consuming
and the material for the speech-in-noise tests will need to be
carefully constructed for unbiased results.
The choice of sampling frequency (8 kHz) might also have
affected the results. Increasing the sampling frequency to
16 kHz, or higher, would probably have improved the speech
recognition for all the tests [23], but it is not clear if this would
have changed the results of this study.
Another possible bias in this study is the effect of hearing
loss. As the analysis of the subjective testing looked at the dif-
ference between a reference and the DNN models, we assumed
that a hearing loss would not alter the results. Only one test
subject had a hearing aid, but this was not used during the sub-
jective test. Since the test subjects’ ages were relatively high
(mean = 54.2) it can be assumed that several of the test sub-
jects were affected by presbycusis. Even if the intra-subject
change in SRTs should be independent of hearing impairment,
this may have affected results.
Our analysis is limited to speech intelligibility, and does not
consider the effect of DNNs on speech quality. The relation-
ship between these two parameters is not fully understood. For
many communication systems, intelligibility may be approach-
ing 100 %, while user satisfaction is still limited. Here, listening
effort tests, where a speech intelligibility test is combined with
another task, may provide a good compromise between provid-
ing objective results for the more quality related question of how
comfortable or easy it is to listen to the enhanced speech.
5. Conclusion
We have tested a DNN-based speech enhancement system with
listening tests to determine the subjective intelligibility of pro-
cessed noisy speech. Our results show a significant degrada-
tion in intelligibility, even though STOI scores predicted other-
wise. Therefore we advise against solely relying on STOI when
designing DNN-based speech enhancement systems for human
listeners. Our results further show that the postprocessing tech-
nique of global variance normalisation does not significantly af-
fect subjective intelligibility.
6. Acknowledgements
We thank Tor Andre Myrvoll for his guidance in setting up the
speech enhancement system and valuable insights during dis-
cussions. We also thank our volunteer test subjects.
1971
7. References
[1] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed.
CRC Press, 2013.
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
Press, 2016.
[3] M. Nielsen, Neural Networks and Deep Learning. Determination
Press, 2015.
[4] Z.-Q. Wang and D. Wang, “A Joint Training Framework for Ro-
bust Automatic Speech Recognition,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 24, pp. 1–11, 2016.
[5] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-
Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani,
B. Raj, A. Sehr, and T. Yoshioka, “A summary of the REVERB
challenge: State-of-the-art and remaining challenges in reverber-
ant speech processing research,” EURASIP Journal on Advances
in Signal Processing, vol. 2016, no. 1, 2016.
[6] J. Du, Q. Wang, T. Gao, Y. Xu, L.-R. Dai, and C.-H. Lee, “Robust
speech recognition with speech enhanced deep neural networks,”
in INTERSPEECH, Singapore, 2014, pp. 616–620.
[7] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An Experimental Study
on Speech Enhancement Based on Deep Neural Networks,” IEEE
Signal Processing Letters, vol. 21, no. 1, pp. 65–68, 2014.
[8] ——, A Regression Approach to Speech Enhancement Based
on Deep Neural Networks,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015.
[9] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algo-
rithm for Intelligibility Prediction of Time-Frequency Weighted
Noisy Speech,” IEEE Transactions on Audio, Speech, and Lan-
guage Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[10] “Perceptual evaluation of speech quality (PESQ): An objective
method for end-to-end speech quality assessment of narrow-band
telephone networks and speech codecs,” International Telecom-
munication Union, ITU-T Recommendation P.862, 2001.
[11] J. Ma, Y. Hu, and P. C. Loizou, “Objective measures for predict-
ing speech intelligibility in noisy conditions based on new band-
importance functions,” The Journal of the Acoustical Society of
America, vol. 125, no. 5, p. 3387, 2009.
[12] J. Jensen and C. H. Taal, “An Algorithm for Predicting the
Intelligibility of Speech Masked by Modulated Noise Maskers,”
IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 24, no. 11, pp. 2009–2022, 2016. [Online].
Available: http://ieeexplore.ieee.org/document/7539284/
[13] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015, last
accessed on 2017-06-01.
[14] Nasjonalbiblioteket, “NB Tale - a basic acous-
tic phonetic speech database for Norwegian,”
http://www.nb.no/sprakbanken/show?serial=sbr-31, 2015,
last accessed on 2017-06-01.
[15] D. Pearce, H.-G. Hirsch, and others, “The aurora experimental
framework for the performance evaluation of speech recognition
systems under noisy conditions.” in Interspeech, 2000, pp. 29–32.
[16] Guoning Hu, “100 Nonspeech Sounds,” http://web.cse.ohio-
state.edu/pnl/corpus/HuNonspeech/HuCorpus.html, last accessed
on 2017-03-14.
[17] A. Varga and H. J. M. Steeneken, “Assessment for automatic
speech recognition: II. NOISEX-92: A database and an experi-
ment to study the effect of additive noise on speech recognition
systems,” Speech Communication, vol. 12, no. 3, pp. 247 – 251,
1993.
[18] T. F. Chan, G. H. Golub, and R. J. LeVeque, “Algorithms for com-
puting the sample variance: Analysis and recommendations,” The
American Statistician, vol. 37, no. 3, pp. 242–247, 1983. [Online].
Available: http://www.jstor.org/stable/2683386?origin=crossref
[19] The MathWorks, Inc., MATLAB R2016a. Massachusetts, United
States: Natick, 2016.
[20] J. Øygarden, “Norwegian speech audiometry,” Ph.D. dissertation,
Norwegian University of Science and Technology (NTNU), Fac-
ulty of Art, Department of Language and Communication Studies,
2009.
[21] L. L. Kontsevich and C. W. Tyler, “Bayesian adaptive estimation
of psychometric slope and threshold,” Vision research, vol. 39,
no. 16, pp. 2729–2737, 1999.
[22] N. Prins and F. Kingdom, “Palamedes: Matlab routines for an-
alyzing psychophysical data.” http://www.palamedestoolbox.org,
2009, last accessed on 2017-03-14.
[23] A. B. Silberer, “Importance of high frequency audibility on speech
recognition with and without visual cues in listeners with normal
hearing,” Ph.D. dissertation, University of Iowa, 2014.
1972
... First, 6000 (2000) virtual training (validation) rooms were modelled. Each room was randomly configured with parameters drawn from the uniform distributions specified in Table 2. 1. Each room contained three speakers, and three noise sources, all placed randomly in the room within the restrictions given in the table. ...
... The aim is to obtain information from each exposure and to use it to improve the model. The model is defined as the function f (y 0 , w), where w are the tunable weights 1 . This gives:ŷ = f (y 0 , w), (3.1) whereŷ is the model's estimate of y. ...
... where f (1) is the first hidden layer, f (2) the second hidden layer, and f (3) the output layer. The input layer merely represents the network's input (not a function), and is therefore equal to the input vector y 0 . ...
... learning-based systems that are currently the main focus of the SE system field of research [3][4][5][6][7][8]. This is problematic, because it has long been known that while many SE algorithms can improve quality, they generally do so at the cost of intelligibility [9]. ...
... learning-based systems that are currently the main focus of the SE system field of research [3][4][5][6][7][8]. This is problematic, because it has long been known that while many SE algorithms can improve quality, they generally do so at the cost of intelligibility [9]. ...
Conference Paper
Full-text available
Understanding degraded speech is demanding, requiring increased listening effort (LE). Evaluating processed and unprocessed speech with respect to LE can objectively indicate if speech enhancement systems benefit listeners. However, existing methods for measuring LE are complex and not widely applicable. In this study, we propose a simple method to evaluate speech intelligibility and LE simultaneously without additional strain on subjects or operators. We assess this method using results from two independent studies in Norway and Denmark, testing 76 (50+26) subjects across 9 (6+3) processing conditions. Despite differences in evaluation setups, subject recruitment, and processing systems, trends are strikingly similar, demonstrating the proposed method's robustness and ease of implementation into existing practices.
... Noise reduction is the process of recovering a clean speech signal of interest from microphone observations corrupted by additive noise [2]- [10]. Depending on the number of microphones, those methods can be divided into two classes, i.e., single-channel [11], [12], [13], [14], [15], [16] and multichannel methods [18], [19], [20]. ...
... STOI and its variants are based on the cross-correlation between the temporal envelopes output from a one third octave filterbank of the reference and processed signals. However, they produce poor estimates for noisy speech enhanced by Wiener filtering [15] or DNN-based enhancement [17]. Additionally, HASPI incorporates a hearing loss model, and combines the measures of auditory coherence and cepstral correlation between the reference signal and the processed signal degraded by hearing loss. ...
... We relied on the implementation provided by the original authors. STOI is possibly the most popular OIM within the field of speech enhancement, but multiple studies have noted its limitations for evaluating performance of DNN-based SE systems [25], [26], [28], [48]. ...
Preprint
Full-text available
div>Speech enhancement (SE) systems aim to improve the quality and intelligibility of degraded speech signals obtained from far-field microphones. Subjective evaluation of the intelligibility performance of these SE systems is uncommon. Instead, objective intelligibility measures (OIMs) are generally used to predict subjective performance increases. Many recent deep learning based SE systems, are expected to improve the intelligibility of degraded speech as measured by OIMs. However, validation of the OIMs for this purpose is lacking. Therefore, in this study, we evaluate the predictive performance of five popular OIMs. We compare the metrics' predictions with subjective results. For this purpose, we recruited 50 human listeners, and subjectively tested both single channel and multi-channel Deep Complex Convolutional Recurrent Network (DCCRN) based speech systems. We find that none of the OIMs gave reliable predictions, and that all OIMs overestimated the intelligibility of `enhanced' speech signals. </div
... We relied on the implementation provided by the original authors. STOI is possibly the most popular OIM within the field of speech enhancement, but multiple studies have noted its limitations for evaluating performance of DNN-based SE systems [25], [26], [28], [48]. ...
Preprint
Full-text available
div>Speech enhancement (SE) systems aim to improve the quality and intelligibility of degraded speech signals obtained from far-field microphones. Subjective evaluation of the intelligibility performance of these SE systems is uncommon. Instead, objective intelligibility measures (OIMs) are generally used to predict subjective performance increases. Many recent deep learning based SE systems, are expected to improve the intelligibility of degraded speech as measured by OIMs. However, validation of the OIMs for this purpose is lacking. Therefore, in this study, we evaluate the predictive performance of five popular OIMs. We compare the metrics' predictions with subjective results. For this purpose, we recruited 50 human listeners, and subjectively tested both single channel and multi-channel Deep Complex Convolutional Recurrent Network (DCCRN) based speech systems. We find that none of the OIMs gave reliable predictions, and that all OIMs overestimated the intelligibility of `enhanced' speech signals. </div
Article
Speech enhancement (SE) systems aim to improve the quality and intelligibility of degraded speech signals obtained from far-field microphones. Subjective evaluation of the intelligibility performance of these SE systems is uncommon. Instead, objective intelligibility measures (OIMs) are generally used to predict subjective performance increases. Many recent deep learning (DL) based SE systems, are expected to improve the intelligibility of degraded speech as measured by OIMs. However, validation of the ability of these OIMs to predict subjective intelligibility when enhancing a speech signal using DL-based systems is lacking. Therefore, in this study, we evaluate the predictive performance of five popular OIMs. We compare the metrics' predictions with subjective results. For this purpose, we recruited 50 human listeners, and subjectively tested both single channel and multi-channel Deep Complex Convolutional Recurrent Network (DCCRN) based speech enhancement systems. We found that none of the OIMs gave reliable predictions, and that all OIMs overestimated the intelligibility of ‘enhanced’ speech signals.
Article
Full-text available
Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-Time objective intelligibility (STOI) algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI (ESTOI) does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400-ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-Time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/∼jje/.
Article
Full-text available
In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.
Article
Full-text available
To examine the impact of visual cues, speech materials, age and listening condition on the frequency bandwidth necessary for optimizing speech recognition performance. Using a randomized repeated measures design; speech recognition performance was assessed using four speech perception tests presented in quiet and noise in 13 LP filter conditions and presented in multimodalities. Participants' performance data were fitted with a Boltzmann function to determine optimal performance (10% below performance achieved in FBW). Thirty adults (18-63 years) and thirty children (7-12 years) with normal hearing. Visual cues significantly reduced the bandwidth required for optimizing speech recognition performance for listeners. The type of speech material significantly impacted the bandwidth required for optimizing performance. Both groups required significantly less bandwidth in quiet, although children required significantly more than adults. The widest bandwidth required was for the phoneme detection task in noise where children required a bandwidth of 7399 Hz and adults 6674 Hz. Listeners require significantly less bandwidth for optimizing speech recognition performance when assessed using sentence materials with visual cues. That is, the amount of bandwidth systematically decreased as a function of increased contextual, linguistic, and visual content.
Article
Full-text available
This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique.
Chapter
A neural network, specifically known as an artificial neural network (ANN), has been developed by the inventor of one of the first neurocomputers, Dr. Robert Hecht-Nielsen. He defines a neural network as follows:
Article
Robustness against noise and reverberation is critical for ASR systems deployed in real-world environments. In robust ASR, corrupted speech is normally enhanced using speech separation or enhancement algorithms before recognition. This paper presents a novel joint training framework for speech separation and recognition. The key idea is to concatenate a deep neural network (DNN) based speech separation frontend and a DNNbased acoustic model to build a larger neural network, and jointly adjust the weights in each module. This way, the separation frontend is able to provide enhanced speech desired by the acoustic model and the acoustic model can guide the separation frontend to produce more discriminative enhancement. In addition, we apply sequence training to the jointly trained DNN so that the linguistic information contained in the acoustic and language models can be back-propagated to influence the separation frontend at the training stage. To further improve the robustness, we add more noise- And reverberation-robust features for acoustic modeling. At the test stage, utterance-level unsupervised adaptation is performed to adapt the jointly trained network by learning a linear transformation of the input of the separation frontend. The resulting sequence-discriminative jointly-trained multistream system with run-time adaptation achieves 10.63% average word error rate (WER) on the test set of the reverberant and noisy CHiME-2 dataset (task-2), which represents the best performance on this dataset and a 22.75% error reduction over the best existing method.
Article
The problem of computing the variance of a sample of N data points {xi} may be difficult for certain data sets, particularly when N is large and the variance is small. We present a survey of possible algorithms and their round-off error bounds, including some new analysis for computations with shifted data. Experimental results confirm these bounds and illustrate the dangers of some algorithms. Specific recommendations are made as to which algorithm should be used in various contexts.
Article
The NOISEX-92 experiment and database is described and discussed. NOISEX-92 specifies a carefully controlled experiment on artificially noisy speech data, examining performance for a limited digit recognition task but with a relatively wide range of noises and signal-to-noise ratios. Example recognition results are given.ZusammenfassungEs werden die Datenbank und die Tests NOISEX-92 beschrieben und kommentiert. NOISEX-92 spezifiziert Tests über künstliche Lärmdaten, wobei die Leistungen im Rahmen der Erjennung von Zahlen und ein relativ breiter Rauschabstand bewertet werden. Es werden Beispiele der Ergebnisse für Erkennungen beschrieben.RésuméLa base de données et les tests NOISEX-92 sont décrits et commentés. NOISEX-92 spécifie des tests sur des données bruitées artificiellement en évaluant les performances dans le cadre de la reconnaissance de chiffres et une gamme relativement large de rapports signal sur bruit. Des examples de scores de reconnaissances sont présentés.