DataPDF Available

Poster: Subjective intelligibility of deep neural network-based speech enhancement

Authors:
The intelligibility of DNN-based
speech enhancement systems is eval-
uated through objecve measures
such as STOI (Taal et al., 2011)
However, STOI does not always cor-
rectly predict intelligibility
(Jensen & Taal, 2016)
Does STOI correctly predict the intelli-
gibility of DNN-based speech en-
hancement systems? We performed a
subjecve evaluaon test to nd out.
Setup
Closely based on Xu et al., 2015
Mullayer feed-forward network
Input/output: log-frequency spectra with
frames of 256 samples (32 ms at 8 kHz)
Input: Stacked frames of noisy speech
Target: One frame of clean speech
Training
Trained and validated on the Norwegian-
language speech corpus Språkbanken
Loss funcon: Mean squared error
Trained for SNR {-5, 0, 5, 10, 15, 20} dB
Manually opmised hyperparameters to
improve STOI on validaon set
Enhancement
Converted DNN output into samples using
the phase from the noisy DNN input
Tested with and without global variance
normalisaon (GVN) post-processing step
Speech in noise test
Speech: Male voice, random Hager-
man sentences in Norwegian
(Øygarden, 2009)
Noise: Trac from a crossroads in
Trondheim
Subjects had to pick out words:
Adjusted SNR dynamically using the
Ψ method to eciently determine
parcipants’ psychometric funcons
Goal: Find the speech recognion
threshold (lowest SNR at which 50 %
of words are understood)
Test was run for baseline clips and
DNN-enhanced clips
Parcipants
15 nave Norwegians, aged 39–65
All were naive listeners given a trai-
ning session before the test started
Sound examples
bit.ly/2uhLWcL
Tron V. Tronstad
Erlend M. Viggen
Femke B. Gelderblom
SINTEF Digital, Trondheim, Norway
Subjecve Intelligibility of
Deep Neural Network-Based
Speech Enhancement
Objecve evaluaon
STOI on the subjecve evaluaon set:
Predicts that this DNN improves
intelligibility for SNR [-14, 4] dB
Subjecve evaluaon
Speech recognion threshold (SRT)
signicantly degrades (4 dB median)
Slope of psychometric funcon does
not show signicant dierences
GVN makes no signicant dierence
Shows that this DNN reduces intelli-
gibility
Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, “A regression
approach to speech enhancement based on
deep neural networks,” IEEE/ACM Trans.
Audio Speech Lang. Proc., vol. 23, 2015.
C. Taal, R. C. Hendriks, R. Heusdens, J. Jensen,
An algorithm for intelligibility predicon of
me-frequency weighted noisy speech,” IEEE
Trans. Audio Speech Lang. Proc., vol. 19,
2011.
J. Jensen, C. Taal, “An algorithm for predicng
the intelligibility of speech masked by modu-
lated noise maskers,” IEEE/ACM Trans. Audio
Speech Lang. Proc., vol. 24, 2016.
J. Øygarden, Norwegian speech audiometry,
Ph.D. thesis, Norwegian University of Science
and Technology, 2009.
Our results show a signicant degra-
daon in intelligibility, even though
STOI scores predicted otherwise.
Therefore, we advise against solely
relying on STOI when designing DNN-
based speech enhancement systems
for human listeners.
Introducon DNN-based speech enhancement
Subjecve evaluaon
Results Main references
Conclusion

File (1)

Content uploaded by Femke Berre Gelderblom
Author content
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-Time objective intelligibility (STOI) algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI (ESTOI) does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400-ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-Time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/∼jje/.