Content uploaded by Femke Berre Gelderblom
Author content
All content in this area was uploaded by Femke Berre Gelderblom on Aug 23, 2017
Content may be subject to copyright.
• The intelligibility of DNN-based
speech enhancement systems is eval-
uated through objecve measures
such as STOI (Taal et al., 2011)
• However, STOI does not always cor-
rectly predict intelligibility
(Jensen & Taal, 2016)
Does STOI correctly predict the intelli-
gibility of DNN-based speech en-
hancement systems? We performed a
subjecve evaluaon test to nd out.
Setup
• Closely based on Xu et al., 2015
• Mullayer feed-forward network
• Input/output: log-frequency spectra with
frames of 256 samples (32 ms at 8 kHz)
• Input: Stacked frames of noisy speech
• Target: One frame of clean speech
Training
• Trained and validated on the Norwegian-
language speech corpus Språkbanken
• Loss funcon: Mean squared error
• Trained for SNR ∈ {-5, 0, 5, 10, 15, 20} dB
• Manually opmised hyperparameters to
improve STOI on validaon set
Enhancement
• Converted DNN output into samples using
the phase from the noisy DNN input
• Tested with and without global variance
normalisaon (GVN) post-processing step
Speech in noise test
• Speech: Male voice, random Hager-
man sentences in Norwegian
(Øygarden, 2009)
• Noise: Trac from a crossroads in
Trondheim
• Subjects had to pick out words:
• Adjusted SNR dynamically using the
Ψ method to eciently determine
parcipants’ psychometric funcons
• Goal: Find the speech recognion
threshold (lowest SNR at which 50 %
of words are understood)
• Test was run for baseline clips and
DNN-enhanced clips
Parcipants
• 15 nave Norwegians, aged 39–65
• All were naive listeners given a trai-
ning session before the test started
Sound examples
bit.ly/2uhLWcL
Tron V. Tronstad
Erlend M. Viggen
Femke B. Gelderblom
SINTEF Digital, Trondheim, Norway
Subjecve Intelligibility of
Deep Neural Network-Based
Speech Enhancement
Objecve evaluaon
• STOI on the subjecve evaluaon set:
• Predicts that this DNN improves
intelligibility for SNR ∈ [-14, 4] dB
Subjecve evaluaon
• Speech recognion threshold (SRT)
signicantly degrades (4 dB median)
• Slope of psychometric funcon does
not show signicant dierences
• GVN makes no signicant dierence
• Shows that this DNN reduces intelli-
gibility
• Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, “A regression
approach to speech enhancement based on
deep neural networks,” IEEE/ACM Trans.
Audio Speech Lang. Proc., vol. 23, 2015.
• C. Taal, R. C. Hendriks, R. Heusdens, J. Jensen,
“An algorithm for intelligibility predicon of
me-frequency weighted noisy speech,” IEEE
Trans. Audio Speech Lang. Proc., vol. 19,
2011.
• J. Jensen, C. Taal, “An algorithm for predicng
the intelligibility of speech masked by modu-
lated noise maskers,” IEEE/ACM Trans. Audio
Speech Lang. Proc., vol. 24, 2016.
• J. Øygarden, Norwegian speech audiometry,
Ph.D. thesis, Norwegian University of Science
and Technology, 2009.
Our results show a signicant degra-
daon in intelligibility, even though
STOI scores predicted otherwise.
Therefore, we advise against solely
relying on STOI when designing DNN-
based speech enhancement systems
for human listeners.
Introducon DNN-based speech enhancement
Subjecve evaluaon
Results Main references
Conclusion