ArticlePDF Available

An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers

Authors:

Abstract and Figures

Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-Time objective intelligibility (STOI) algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI (ESTOI) does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400-ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-Time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/∼jje/.
Content may be subject to copyright.
DRAFT
1
An Algorithm for Predicting the Intelligibility of
Speech Masked by Modulated Noise Maskers
Jesper Jensen and Cees H. Taal
Abstract—Intelligibility listening tests are necessary during
development and evaluation of speech processing algorithms,
despite the fact that they are expensive and time-consuming.
In this paper, we propose a monaural intelligibility prediction
algorithm, which has the potential of replacing some of these
listening tests. The proposed algorithm shows similarities to the
Short-Time Objective Intelligibility (STOI) algorithm but works
for a larger range of input signals. In contrast to STOI, Extended
STOI (ESTOI) does not assume mutual independence between
frequency bands. ESTOI also incorporates spectral correlation
by comparing complete 400-ms length spectrograms of the
noisy/processed speech and the clean speech signals. As a conse-
quence, ESTOI is also able to accurately predict the intelligibility
of speech contaminated by temporally highly modulated noise
sources in addition to noisy signals processed with time-frequency
weighting. We show that ESTOI can be interpreted in terms
of an orthogonal decomposition of short-time spectrograms into
intelligibility subspaces, i.e., a ranking of spectrogram features
according to their importance to intelligibility. A free Matlab
implementation of the algorithm is available for non-commercial
use at http://kom.aau.dk/jje/.
EDICS: SPE-ANLS, SPE-ENHA, SPE-CODI.
Jesper Jensen is with Aalborg University, Aalborg, Denmark, email:
jje@es.aau.dk and with Oticon A/S, 2765 Smørum, Denmark, email:
jesj@oticon.com.
Cees Taal is with Quby Labs, Joan Muyskenweg 22, 1096 CJ Amsterdam,
The Netherlands, email: chtaal@gmail.com.
I. INT RO DU CTI ON
When developing speech communication systems for human
receivers, listening tests play a major role both for monitoring
progress in the development phase and for verifying the
performance of the final system. Often, listening tests are used
to quantify aspects of speech quality and speech intelligibility.
Although listening tests constitute the only tool available
for measuring ground-truth end-user impact, they are time-
consuming, they may require special auditory stimuli data and
test equipment, and they require the availability of a group of
typical end-users. For these reasons, listening tests are costly
and can typically not be employed many times during the
development phase of a speech communication system. Hence,
cheaper alternatives or supplements are of interest.
In this paper, we focus on intrusive, monaural intelligibil-
ity prediction models, i.e., algorithms which – rather than
conducting an actual listening test – predict the outcome of
the listening test based on the auditory stimuli of the test.
Historically, two lines of research serve as the foundation for
existing intelligibility prediction models: i) the Articulation
Index (AI) [1] by French and Steinberg [2], which was later
refined and standardized as the Speech Intelligibility Index
(SII) [3], and ii) the Speech Transmission Index (STI) [4] by
Steeneken and Houtgast [5].
AI and SII were developed with simple linear signal degra-
dations, e.g., additive noise, in mind. To estimate intelligibility,
the methods divide the signal under analysis into frequency
subbands and assume that each subband contributes indepen-
dently to intelligibility. The contribution of a subband is found
by estimating the long-term speech and noise power within
the subband to arrive at the long-term subband signal-to-noise
ratio (SNR). Then, subband SNRs are limited to the range
from -15 to +15 dB, normalized to a value between 0 and 1,
and combined as a perceptually weighted average.
STI extends the range of distortions to include convolutive
noise, e.g., reverberant speech and effects of room acoustics.
STI is based on the observation that reverberation and/or
additive noise tend to reduce the depth of temporal signal
modulations compared to the clean, undistorted reference sig-
nal. To measure changes in the modulation transfer function,
STI generates bandpass filtered noise probe signals at different
center-frequencies, and amplitude-modulates each such signal
at different modulation frequencies relevant to speech intelli-
gibility. Each modulated probe signal is then passed through
the communication channel in question (e.g. characterised by
a room impulse response), and the reduction in modulation
depth is finally translated into an intelligibility index.
DRAFT
2
Despite the importance of AI and SII, the methods have
a number of limitations. First, they require the long-term
spectrum of the additive noise signal to be known in advance.
Secondly, since the methods rely on long-term statistics, they
cannot discern modulated noise signals from un-modulated
ones, when their long-term spectra are identical. In other
words, the intelligibility of speech contaminated by modulated
and un-modulated noise is judged to be identical, although it
is well-known that this is generally not the case, e.g. [6], [7].
Finally, AI and SII are not directly applicable to signals which
have been passed through some non-linear processing stage
before presentation to the listener, because in this case it is no
longer clear which noise spectrum to use.
Various methods have been proposed to reduce the lim-
itations mentioned above and extend the range of acoustic
situations for which intelligibility prediction can be made. In
[6], Rhebergen et. al. proposed the Extended SII (ESII), which
avoids the use of long-term noise spectra in SII. Specifically,
ESII divides the masker signal into short time frames (9–20
ms) and averages the SII computed for each frame individually,
to predict intelligibility for fluctuating noise sources. In a
somewhat similar manner, the Glimpse Model by Cooke [8]
uses realizations of speech and additive noise signals to
estimate the glimpse percentage, i.e., the fraction of time-
frequency tiles whose SNR exceeds a certain threshold, which
is then translated to an intelligibility estimate. The Coherence
SII (CSII) by Kates et. al. [9] extends SII to better take into
account various non-linear distortions, including center- and
peak-clipping. Similarly, Goldsworthy et. al. [10] proposed
a modified STI approach, speech STI (sSTI), which replaces
the traditional noise probe signals with actual speech signals.
This was done to better take into account the effect of non-
linear distortions such as envelope clipping [11], and dynamic
amplitude compression [12].
More recently, methods have emerged which are inspired by
both original lines of research. For example, Jørgensen et. al.
[13] decompose the speech signal and an additive noise masker
in a modulation filter bank. Intelligibility prediction is then
based on the envelope power SNR at the output of this filter
bank. Taal et. al. [14] proposed the Short-Time Objective Intel-
ligibility (STOI) measure, which extracts temporal envelopes
of undistorted and noisy/processed speech signals in frequency
subbands. The envelopes are then subject to a clipping proce-
dure, compared using short-term linear correlation coefficients,
and a final intelligibility prediction is constructed simply as an
average of the correlation coefficients. STOI has proven to be
able to predict quite accurately the intelligibility of speech
in many acoustic situations, including the speech output of
mobile phones [15], noisy speech processed by ideal time-
frequency masking and single-channel speech enhancement
algorithms [14], speech processed by cochlear implants [16],
and STOI appears robust to different language types incl.
Danish [14], Dutch [17], and Mandarin [18]. Although STOI
performs well in many cases, some of the algorithmic choices,
e.g. the use of a linear correlation coefficient as a basic
distance measure, are less well motivated from a theoretical
point of view. However, in [17] Jensen et. al. proposed the
Speech Intelligibility prediction based on Mutual Information
(SIMI) method, which suggests that characteristics of STOI
may be explained using information theoretic arguments.
STOI, and several of the methods described above, show
only weak links to the properties of the auditory system, and
much more elaborate models have been proposed, e.g. [19],
[20]. For example, the Hearing-Aid Speech Perception Index
(HASPI) by Kates et. al. [20] employs level- and hearing-
profile dependent auditory filterbanks to compute quantities
resembling mel-frequency cepstral coefficients (MFCCs) for
the clean reference signal and the noisy/processed signal,
respectively. Then, long-term correlations between clean and
noisy/processed MFCCs are computed, before the average
across the cepstral dimension is found. Finally, this cepstral
correlation average is combined with estimates of the auditory
signal coherence for low-, mid-, and high-intensity signal
regions [9], to form an intelligibility index.
Existing intelligibility prediction methods may be divided
into two classes: 1) methods which require that the target
speech signal and the distorting component (e.g., the additive
noise) are available in separation, e.g., [3], [4], [6], [8], [13]
and 2) methods which do not impose this requirement, e.g.,
[9], [10], [14], [17]. Class-1 methods have the advantage that
they can use the access to speech and noise realizations to
compute SNR realizations in different time-frequency regions
and find an intelligibility index based on these. Class-2 meth-
ods, on the other hand, cannot observe SNRs directly, but must
rely on features estimated from, generally, limited data, e.g.,
short-time correlations estimated from the noisy/processed and
clean speech signal. The disadvantage of Class-1 methods is
that they are not applicable to non-linearly processed noisy
speech signals, because in this situation noise and speech
signals are not readily available in separation. Class-2 methods
are more generally applicable, i.e., also to noisy signals, which
have been non-linearly processed.
As reported in [21], STOI – and as we show in Sec. IV
– other Class-2 methods have limitations for target speech
signals in additive noise sources with strong temporal mod-
ulations, as e.g. a single competing speaker. To demonstrate
this point, Fig. 1 shows an example of intelligibility predicted
by STOI vs. actual intelligibility, measured in listening tests
with speech signals degraded by 10 highly modulated masker
signals at different SNRs (details are given in Table I and will
be discussed later). Clearly, STOI performs less well in this
situation: the linear correlation coefficient between predicted
and measured intelligibility is as low as ρ= 0.47.
In this paper, we propose a new Class-2 intelligibility pre-
dictor – ESTOI (Extended Short-Time Objective Intelligibility)
– which, unlike many existing Class-2 methods, works well
for highly modulated noise sources as the example above1. Im-
portantly, ESTOI also works well in situations, where existing
Class-2 methods work well.
As the name suggests, ESTOI is inspired by STOI [14].
As with STOI,ESTOI operates within a 384 ms analysis
window on amplitude envelopes of subband signals. This
analysis window is used in order to include important temporal
1The algorithm name ESTOI follows the terminology introduced by Rhe-
bergen et. al., who used the name Extended SII (ESII) for their algorithm to
improve the performance of SII for modulated noise maskers [6].
DRAFT
3
0 20 40 60 80 100
0
20
40
60
80
100
Estimated Words Correct (%)
Words Correct (%)
icra1
icra4
icra6
icra7
snam2
snam4
snam8
snam16
macgun
destop
Figure 1. Measured vs. predicted intelligibility (STOI [14]) for speech in ten
different additive, modulated noise sources (6 SNRs each). Linear correlation
coefficient between measured and predicted intelligibility is ρ= 0.47. For
more information on noise sources, see Table I.
modulation frequencies relevant for speech intelligibility [14].
To understand the differences between STOI and ESTOI, let
us first interpret, how STOI computes a correlation coefficient
for a 384 ms analysis window. STOI first computes linear cor-
relation coefficients for each subband between undistorted and
noisy/processed signals2; this is equivalent to computing inner
products between mean- and variance- normalized envelope
signals. To find a correlation coefficient for the 384 ms analysis
window, STOI then averages these temporal correlation coef-
ficients across frequency, an operation which implies indepen-
dent frequency band contributions to intelligibility, and which
is not in line with literature, e.g., [22]. ESTOI shares the first
step with STOI: mean- and variance-normalization is applied
to subband envelopes. However, rather than computing the
average inner products between these normalized envelopes,
and relying on the additive-intelligibility-across-frequency as-
sumption as is done in STOI,ESTOI instead computes spectral
correlation coefficients, which are finally averaged across time
within the 384 ms analysis segment. This allows ESTOI to
better capture the effect of time-modulated noise maskers,
where spectral correlation ’in the dips’ is often preserved
[23]. We show that ESTOI may be interpreted in terms of an
orthogonal decomposition of energy-normalized spectrograms,
i.e., a decomposition into ”intelligibility subspaces” which
are each ranked according to their (estimated) contribution to
intelligiblity. This decomposition is important in understanding
ESTOI and in linking its performance to perceptual studies
with human listeners. Specifically, analysis of the spectro-
grams related to each intelligibility subspace shows that the
spectro-temporal modulation frequencies, which are judged by
ESTOI to be important for speech intelligibility, agree with the
results of experimental studies of human sensitivity to spectro-
temporal modulations [24].
The paper is structured as follows. In Sec. II we describe the
ESTOI predictor. In Sec. III we interpret the predictor in terms
of orthogonal intelligibility subspaces. Sec. IV evaluates the
performance of ESTOI, and compares it to a range of existing
algorithms. Finally, Sec. V concludes the work.
II. PROP OS ED MOD EL
The overall structure of the proposed intelligibility predic-
tor, ESTOI, is outlined in Fig. 2. ESTOI is a function of
2We ignore the clipping procedure used in STOI in this description.
the noisy/processed signal under study x(n), and the clean,
undistorted speech signal s(n). The goal of ESTOI is to
produce a scalar output d, which is monotonically related to
the intelligibility of x(n).
A. Time-Frequency Normalized Spectrograms
Let us assume that s(n)and x(n)are perfectly time-
aligned, and that regions where s(n)shows no speech activity
(e.g., pauses between sentences) have been removed from
both signals. In the following, we present expressions related
to the clean signal s(n); similar expressions hold for the
noisy/processed signal x(n). Let S(k , m)denote the short-
time Fourier transform (STFT) of s(n), that is
S(k, m) =
N1
X
n=0
s(mD +n)w(n)ej2πkn/N,
where kand mdenote the frequency bin index and the frame
index, respectively, and Dand Ndenote the the frame shift
in samples and FFT order, respectively. Finally, w(n)is an
analysis window.
To model crudely the signal transduction in the cochlear in-
ner hair cells, a one-third octave band analysis is approximated
by summing STFT coefficient energies,
Sj(m) = sX
kCBj
|S(k, m)|2, j = 1,...,J,
where jis the one-third octave band index, CBjdenotes the
index set of STFT coefficients related to the jth one-third
octave frequency band, and Jdenotes the number of subbands.
Let us collect spectral values Sj(m)for each frequency
band j= 1,...,J, and across a time segment of Nspectral
samples, and arrange these in a short-time spectrogram matrix
Sm=
S1(mN+ 1) ··· S1(m)
.
.
..
.
.
SJ(mN+ 1) ··· SJ(m)
.
Hence, the jth row of Smrepresents the temporal envelope
of the signal in subband j. Typical parameter choices are
J= 15 and N= 30 (corresponding to 384 ms) [14]. The
noisy/processed short-time spectrogram matrix Xmis defined
analogously.
ESTOI operates on mean- and variance- normalized rows
and columns of Sm(and Xm) as follows. Let
sj,m = [Sj(mN+ 1) Sj(mN+ 2) ···Sj(m)]T
denote the jth row of the spectrogram matrix Sm. The jth
mean- and variance- normalized row of Smis given by
¯sj,m =1
k(sj,m µsj,m )ksj,m µsj,m1,(1)
where kyk=pyTyis the vector 2-norm, 1 is an all-one
vector, and µsj,m is the sample mean given by
µsj,m =1
N
N1
X
m=0
Sj(mm).(2)
DRAFT
4
Filterbank
Cochlear Envelope
Extraction Normalization
Row− and Col.−
Avg. Projection
Length
Segmentation
Time
Row− and Col.−
Normalization
Time
Segmentation
Envelope
Extraction
Cochlear
Filterbank
s(n)
Sj(m)Sm
ˇ
Sm
x(n)
Xj(m)Xm
ˇ
Xm
dm1
MPmdm
d
Figure 2. The proposed intelligibility predictor, ESTOI, is a function of the noisy/processed signal x(n)and clean speech signal s(n). First, the signals
are passed through a one-third octave filter bank, and the temporal envelopes of each subband signal are extracted. The resulting clean and noisy/processed
short-time envelope spectrograms are time- and frequency-normalized before the ”distance” between them is computed, resulting in intermediate, short-time
intelligibility indices dm. Finally, the intermeditate indices are averaged to form the final intelligibility index d. More details are given in Sec. II. Signal
examples of the various stages are shown in Fig. 3.
Note that the sample mean and variance of the elements
in vector ¯sj,m is zero and one, respectively. The mean and
variance-normalized rows ¯xj,m of the noisy/processed signal
are defined similarly.
As mentioned, this row-normalization procedure is similar
to the one used in STOI. Specifically, STOI uses an interme-
diate temporal correlation coefficient for the jth subband in
the mth time segment, which can be expressed as the inner
product of normalized vectors,
¯sT
j,m ¯xj,m.(3)
However, as mentioned, in ESTOI we do not use Eq. (3)
directly, but introduce a spectral normalization as follows. Let
us first define the row-normalized spectrogram matrix
¯
Sm=
¯sT
1,m
.
.
.
¯sT
J,m
.
Then, let ˇsn,m denote the mean- and variance- normalized nth
column, n= 1,...,N of matrix ¯
Sm, where the normalization
is carried out analogously to Eqs. (1) and (2). We finally define
the row- and column-normalized matrix ˇ
Smas
ˇ
Sm= [ˇs1,m ···ˇsN ,m].
Hence, the columns of ˇ
Smrepresent unit-norm, zero-mean
normalized spectra (which themselves are computed from
normalized temporal envelopes). The row- and column-
normalized matrix ˇ
Xmof the noisy/processed signal in time
segment mis defined in a similar manner. Fig. 3 demonstrates
the effect of the various normalizations on example clean and
noisy spectrograms.
B. Intelligibility index
The row- and column- normalized matrices ˇ
Smand ˇ
Xm
serve as the basis for the proposed intelligibility predictor.
In particular, we define an intermediate intelligibility index,
related to time segment m, simply as
dm=1
N
N
X
n=1
ˇsT
n,m ˇxn,m.(4)
Since ˇsn,m and ˇxn,m,n= 1,...,N are unit-norm vectors,
each term in the sum may be recognized as the (signed) length
of the orthogonal projection of the noisy/processed vector
ˇxn,m onto the clean vector ˇsn,m or vice versa. It follows
that 1ˇsT
n,m ˇxn,m 1. Similarly, dmmay be interpreted
as the (signed) length of these projections, averaged across
time within a time segment. In low-noise situations where
ˇxn,m ˇsn,m, then dmwill be close to its maximum average
projection length of 1, whereas if the elements of ˇxn,m and
ˇsn,m are uncorrelated, then dm0, i.e., the vectors are
approximately orthogonal. Also, from the definitions of ˇsn,m
and ˇxn,m,dmmay be interpreted simply as sample correlation
coefficients of the columns of ¯
Smand ¯
Xm(i.e., spectra which
have been normalized according to their subband envelopes),
averaged across the Nframes within a segment.
For simplicity, the intelligibility index related to the entire
noisy/processed signal of interest is then defined as the tem-
poral average of the intermediate intelligibility indices,
d=1
M
M
X
m=1
dm,(5)
where Mis the number of time segments in the signal of
interest. Since 1dm1, it follows that 1d1.
C. Implementation
ESTOI operates at a sampling frequency of 10 kHz to ensure
that the frequency region relevant for speech intelligibility is
covered [2]; all signals are resampled to this frequency before
applying the method. Then, signals are divided into frames of
256 samples, using a frame shift of D= 128, the frames
are windowed with a Hann window, and an FFT of order
N= 512 is applied. Before computing the intelligibility
index, frames with no speech content are discarded. These
are identified as the frames of the reference speech signal
s(n)with energy less than 40 dB than the signal frame with
maximum energy. DFT coefficients of speech active frames
are grouped into J= 15 one-third octave bands, with center
frequencies of 150 Hz and approximately 4.3 kHz, for the
lowest and highest band, respectively. Finally, time segments
of length N= 30 (corresponding to 384 ms) are used (for
further details on this choice, we refer to Sec. IV-C).
DRAFT
5
0 100 200 300
−1
0
1
Ampl.
a)
0 100 200 300
−1
0
1b)
Freq [Hz]
c)
0 100 200 300
0
2000
4000
d)
0 100 200 300
0
2000
4000
−60
−40
−20
0
Freq [Hz]
e)
0 100 200 300
300
756
1905
f)
0 100 200 300
300
756
1905
−60
−40
−20
0
Freq [Hz]
g)
0 100 200 300
300
756
1905
h)
0 100 200 300
300
756
1905
−1
0
1
Freq [Hz]
i)
Time [ms]
0 100 200 300
300
756
1905
j)
Time [ms]
0 100 200 300
300
756
1905
−1
0
1
Figure 3. Short-time spectrograms for clean speech time segment (left
column) and noisy time segment (right column) for additive, speech-shaped,
sinusoidal amplitude-modulated Gaussian noise (modulation frequency of
5 Hz, SNR = -10 dB). a), b) Time domain segments. c), d) DFT short-
time spectrograms |S(k, m)|,|X(k, m)|(dB scale), computed by applying
an N= 512 point FFT to zeropadded, Hann windowed, time-domain
frames of 256 samples (25.6 ms) with an overlap of D= 128 samples.
e), f) third-order octave filterbank spectrograms Sm, Xm(dB scale). g),
h) spectrograms with mean- and variance-normalized rows ¯
Sm,¯
Xm(linear
scale). i), j) spectrograms with mean- and variance-normalized rows and
columns, ˇ
Sm,ˇ
Xm(linear scale).
III. ORTHOGONAL INT ELL IG IBI LI TY SUB SPACES
In this section we present interpretations of Eqs. (4) and
(5) which provide insights into ESTOI. Specifically, we show
that ESTOI can be interpreted in terms of a decomposition
of (row- and column- normalized) noisy/processed short-time
spectrograms into orthogonal one-dimensional subspaces. The
decomposition assigns an intelligibility score to each such sub-
space, so that the sum of all the subspace intelligibilities equals
the total intermediate intelligibility dmof the noisy/processed
short-time spectrogram. The decomposition therefore allows
us to rank each subspace according to their (predicted) con-
tribution to intelligibility, revealing which spectro-temporal
features are predicted to be important to intelligibility.
A. Preliminaries
To focus our exposition, we re-write the expression for
dm(Eq. (4)) using the columns of the row- and column-
normalized matrices ˇ
Xmand ˇ
Sm. Let us concatenate the N
columns of ˇ
Sminto a supervector
ˇsm=ˇsT
1,m ...ˇsT
N,mT,
where ˇsm∈ ℜN J×1. A similar definition holds for the
noisy/processed supervector ˇxm. Furthermore, let us collect
supervectors for each segment mas columns in super matrices.
The clean speech super matrix ˇ
S∈ ℜNJ×Mis given by
ˇ
S=ˇs1,...,ˇsM.
The noisy/processed matrix ˇ
Xis defined similarly.
The intermediate intelligibility index dm(Eq. (4)) may then
be written as
dm=1
NˇsT
mˇxm,(6)
and inserting this into Eq. (5) leads to
d=1
MN Tr ˇ
STˇ
X,(7)
where Tr(·)denotes the matrix trace operator.
Let us introduce the following orthogonal decomposition of
the noisy/processed supervectors ˇxm,
ˇxm=
NJ
X
l=1
eleT
lˇxm
=
NJ
X
l=1
Plˇxm,
(8)
where eT
iej=δ(i, j)are orthonormal vectors, and Pl=eleT
l
is an orthogonal projection matrix onto the one-dimensional
subspace spanned by el. Since each noisy/processed super-
vector ˇxmdescribes a time-frequency region of N×J
one-third octave spectral values, Eq. (8) provides - when
the basis vectors elare specified - a decomposition of a
noisy/processed time-frequency region into mutually orthogo-
nal one-dimensional subspaces.
B. Intelligibility Subspace Decomposition
Our goal is to determine the orthogonal basis vectors elin
Eq. (8), ordered according to their (estimated) impact on intel-
ligibility: first, we find the basis vector e1, which carries most
intelligibility on average across noisy/processed spectrograms.
Next, we find the basis vector e2, orthogonal to e1, which
carries most intelligibility. This procedure is repeated for the
remaining dimensions , leading to an orthogonal subspace
decomposition in terms of intelligibility.
To do this, insert Eq. (8) into Eq. (7),
d=1
NM Tr ˇ
STˇ
X
=1
2NM Tr ˇ
STˇ
X+ˇ
XTˇ
S
=1
2NM Tr ˇ
STX
l
Plˇ
X+ˇ
XT(X
l
Pl)Tˇ
S!
=1
2NM Tr (ˇ
Xˇ
ST+ˇ
Sˇ
XT)X
l
Pl!
=1
2NM
NJ
X
l=1
eT
l(ˇ
Xˇ
ST+ˇ
Sˇ
XT)el,
(9)
where we used that Tr A= Tr AT, that summation and trace
are linear operators whose order can be interchanged, that
DRAFT
6
PlPl= (PlPl)Tis a symmetric matrix by definition, and
that Tr ABC = Tr CAB = Tr BCA.
We can now perform the orthogonal intelligibility subspace
decomposition described above by solving the following se-
quence of problems,
Step 1:
max
e1
1
2NM eT
1(ˇ
Xˇ
ST+ˇ
Sˇ
XT)e1,such that eT
1e1= 1.
Steps (l= 2,...,NJ):
max
el
1
2NM eT
l(ˇ
Xˇ
ST+ˇ
Sˇ
XT)elsuch that eT
lel= 1,
and ele1,...,el1.
It may be recognized that the solution vectors elare the
eigenvectors of the symmetric matrix 1
2NM ˇ
Xˇ
ST+ˇ
Sˇ
XT.
The symmetry of this matrix ensures that a) the eigenvectors
are mutually orthogonal, and b) the eigenvalues are real-
valued, which allows a simple ranking of subspaces according
to their contribution to intelligibility.
Note that inserting Eq. (8) with the found vectors elin Eq.
(6) allows us to express the intermediate intelligibility index
dmin terms of a sum of orthogonal intelligibility subspaces.
Note also that since the lth eigenvalue λlof the matrix
1
2NM ˇ
Xˇ
ST+ˇ
Sˇ
XTsatisfies
1
2NM (ˇ
Xˇ
ST+ˇ
Sˇ
XT)el=λlel,
then it follows that
λl=1
2NM eT
l(ˇ
Xˇ
ST+ˇ
Sˇ
XT)el.
Comparison to Eq. (9) shows that
d=
NJ
X
l=1
λl.
In other words, the total estimated intelligibility of the signal
in question is completely determined by the eigenvalues of the
sample cross-correlation matrix 1
2NM (ˇ
Xˇ
ST+ˇ
Sˇ
XT). Specif-
ically, the contribution to intelligibility by the lth subspace is
given by the corresponding eigenvalue λl, and the sum of all
eigenvalues equals the total intelligibility index.
C. Intelligibility Subspaces - Example
To demonstrate the intelligibility subspace decomposition
we construct noisy (but in this example unprocessed) speech
signals by adding noise to clean speech signals. Specifically,
we study the impact of adding a 100 % intensity-modulated,
lowpass filtered noise sequence to 1680 signals from the
TIMIT [25] data base. The noise is a Gaussian white noise
sequence (sampling rate of fs= 16000 Hz), filtered through
a first-order IIR low-pass filter with a 3dB cut-off frequency
at approximately 80 Hz (pole location at p= 0.97). Then, this
lowpass filtered noise is amplitude modulated by the sequence
a(n) = 1 + sin(2πfmod /fsn+φ), n = 0,...,Ns,
with a modulation frequency of fmod = 5 Hz, Nsis the
sequence length corresponding to the duration of the speech
signal in question, and φis a uniformly distributed random
phase value, drawn independently for each sentence. The noise
is scaled to form an SNR of -10 dB for each sentence.
Based on this set of noisy signals, we apply the intelligibility
subspace decomposition described above. Fig. 4 (lower-right)
shows the NJ = 450 eigenvalues λlin descending order for
the decomposition of d. In this example, the 11 dominant
subspaces carry 46% of the total estimated intelligibility, while
115 dimensions carry 90%. Fig. 4 shows the basis vectors of
the 11 dominant subspaces. The subspaces are characterized by
regular spectro-temporal patterns, apparently with low spectro-
temporal modulation frequencies.
Frequency analysis across the temporal dimension of each
of the subfigures reveal temporal modulation frequencies -
averaged across the acoustic frequency axis - ranging from 2.1
Hz to 5.8 Hz. This modulation frequency range is well-known
to be particularly important for intelligibility. Specifically,
Drullman et. al. showed that intelligibility can be degraded
significantly, if modulations in the frequency range of approx-
imately 3-8 Hz are not preserved [26], [27]. Similarly, Elliot
and Theunissen found temporal modulation frequencies in the
range 1-7 Hz to be most important for speech intelligibility
[24], while Kates and Arehart found frequencies less than 12.5
Hz to carry most information about intelligibility [28].
Applying Fourier transforms to the columns of each sub-
space spectrogram in Fig. 4 and computing the average
magnitude spectrum shows maximum spectral modulations in
the range 0.2-0.7 cycles/kHz. As for the temporal modulation
content, these numbers are quantitatively well in line with the
results in [24] who report spectral modulation frequencies <1
cycle/kHz to be most important for speech intelligibility3.
While the eigenvalues λlof the sample correlation matrix
1
2NM (ˇ
Xˇ
ST+ˇ
Sˇ
XT)tend to be positive, a small subset of
the lowest eigenvalues can be negative. In other words, the
signal components represented by the corresponding subspaces
degrade intelligibility as estimated by the model. For many
practical situations, however, the impact of these negative
intelligibility subspaces is small. For example, in Fig. 4, where
the global SNR is 10 dB, the 15 smallest eigenvalues are
negative - their sum is approximately 0.001, which is 0.3%
of the total estimated intelligibility index. Generally speaking,
the number and the impact of these negative intelligibility
subspaces increases with decreasing SNR. For simple additive
and stationary noise maskers, e.g., a single constant-frequency
masker tone which occupy the same time-frequency region for
all time segments, it can be verified that the time-frequency
pattern of this masker may be represented well using the
negative subspaces as basis functions. For non-stationary noise
sources, on the other hand, e.g., the modulated low-pass noise
used in Fig. 4, the negative subspaces do, generally, not repre-
sent the spectro-temporal noise pattern in a particular segment.
Rather, the negative subspaces represent the average spectro-
3Note that an accurate comparison is difficult: the results in [24] are based
on spectro-temporal analyses of log-magnitude spectra computed in a uniform-
frequency filter bank, whereas the proposed method operates on linear, but
energy-normalized one-third octave band magnitude spectra.
DRAFT
7
temporal pattern of the noise within many time segments,
across which the noise does not necessarily occupy the same
time-frequency region in each time segment.
Finally, Fig. 5 shows the intelligibility subspace decomposi-
tion for speech in natural noise, namely a noise recorded in a
busy office cafeteria [29]. While details obviously differ from
the decomposition in Fig. 4, the main features of the dominant
subspaces are the same: temporal modulation frequencies are
in the range 2.0–5.7 Hz, while spectral modulations are in the
range 0.2–0.8 cycles/kHz.
IV. SIM ULATI ON RES ULTS
In this section, we present a number of intelligibility listen-
ing tests for evaluating the proposed method. Furthermore,
we study the performance as a function of the segment
length Nand the test signal duration. Finally, we compare
the performance of ESTOI to a range of existing speech
intelligibility predictors.
A. Signals and Processing Conditions
We study the performance of ESTOI using the results
of five intelligibility tests with speech signals subjected to
various noise sources and processing conditions. The first two
tests used various additive noise sources with strong temporal
modulations; we include these in the study to verify the ability
of ESTOI to operate in this domain, and to verify the results
reported in [21] that established intelligibility predictors work
less well here. The third test used stationary and non-stationary
additive noise sources, with less temporal modulations; quite
some existing methods work well for this common class of
noise sources, and it is important to establish that ESTOI does
so too. The fourth and fifth intelligibility test used processed
noisy speech signals for which STOI works exceptionally well,
while many other methods fail. As before, it is important to
establish the performance of ESTOI in this situation.
1) Additive Noise Set I: The first set of signals consist of
ten mainly non-stationary noise sources with significant modu-
lation content, cf. Table I. The icra signals are synthetic speech
signals constructed by filtering Gaussian noise sequences
through bandpass filters with time-varying gain to construct
signals with speech-like spectro-temporal properties [30].
The snam signals are 100% sinusoidally, intensity-modulated
speech-shaped noise signals. The signals are constructed by
point-wise multiplication of the unmodulated speech-shaped
noise signal icra1 with the modulation sequence
a(n) = 1 + sin(2πωn +φ),
where ωdenotes the angular modulation frequency, and
φ[π;π[is a random phase-value, drawn independently
for each signal generation. To construct machine gun noise
macgun with sufficient masking power, the original machine
gun noise signal from the Noisex database [31] was divided
into succesive 20 ms frames, and frames with energy less than
40 dB of the maximum frame energy were removed.
Speech signals from the Dantale II sentence test [32] were
added to randomly selected sections of each of the ten noise
sources at six different SNRs (cf. Table I). The SNRs were
chosen so that, for each noise source, some noisy signals were
almost perfectly intelligible, whereas others were essentially
unintelligible. The total number of conditions was therefore
10 noise types x 6 SNRs = 60 conditions. Each condition was
repeated 3 times (with different speech and noise realizations)
leading to a total of 180 sentences to be judged per subject.
The presentation order of noise types, SNRs, and repetitions
was randomized. The sample rate was 20 kHz.
We conducted a closed Danish speech-in-noise intelligibility
test, cf. [33]. The Dantale II sentences consist of five words
with a correct grammatical structure. Candidate words were
arranged in an 10-by-5 matrix on a computer screen, such
that each of the five columns encompassed exactly the 10
possible alternatives for the corresponding word. Each column
was extended with one entry, which allowed the subject to
answer ”Don’t know”. For each 5-word sentence, the subject
must select via a graphical user interface the words that
she heard. Subjects were seated in a sound treated room,
where signals were presented diotically through headphones
(Sennheiser HD 280 Pro). The icra1 noise at the SNR of -8
dB was used to calibrate the presentation level to 65 dB (A).
The subjects were allowed to adjust this level during a training
session prior to the actual test. Twelve native Danish speaking
subjects (normal-hearing, age range 26–44 years, 2 females,
10 males) participated in the test. The subjects volunteered for
the experiments and were not paid for their participation.
2) Additive Noise Set II: The second data set, consisting
of speech in additive fluctuating noise sources, is described in
[7]. We use Maskers 1–13 [7], which include low-pass filtered
unmodulated Gaussian noise, and various amplitude modu-
lated Gaussian noise signals, including sinusoidally amplitude-
modulated signals (modulation frequencies: 2.1, 4.9, 10.2, 19.9
Hz, and modulation depths of ±6 dB, ±12 dB, and ±
100%), and three irregularly modulated noise signals found
by adding the sinus-modulators with random initial phases.
The speech material used was the Swedish version of the
Hagerman material [34], which is similar in structure to the
Dantale 2 set used above. For each noise source, noisy speech
signals were generated with an SNR of 15 dB, and the
corresponding speech intelligibility was recorded ( [7, Fig. 5]).
Hence, the number of conditions equalled 11 noise sources x
1 SNR = 11 conditions. Intelligibility tests were conducted
with i) eleven young (17–33 years), normal hearing listeners,
and ii) twenty elderly (54–69 years), normal hearing listeners
(the study also included elderly, hearing-impaired listeners, but
results of these tests are not used in this paper). The sample
rate used was 20 kHz. For more details, we refer to [7].
3) Additive Noise Set III: We include a third additive noise
set, for which many existing intelligibility predictors work
well, see e.g. [14], [17] and the references therein. The data
set encompasses Dantale 2 speech sentences contaminated by
four additive noise sources: i) speech-shaped Gaussian noise,
ii) car cabin noise recorded when driving on the highway,
iii) bottling hall noise, and iv) cafeteria noise consisting of a
conversation between a female and a male speaker, i.e., two-
talker speech babble [35]. The noisy signals were generated
with SNRs from -20 dB to 5 dB in steps of 2.5 dB, so the
total number of conditions equals 4 noise types x 11 SNRs
DRAFT
8
Intelligibility Subspace 1
Freq. [Hz]
0 100 200 300
300
756
1905
Intelligibility Subspace 2
0 100 200 300
300
756
1905
Intelligibility Subspace 3
0 100 200 300
300
756
1905
Intelligibility Subspace 4
Freq. [Hz]
0 100 200 300
300
756
1905
Intelligibility Subspace 5
0 100 200 300
300
756
1905
Intelligibility Subspace 6
0 100 200 300
300
756
1905
Intelligibility Subspace 7
Freq. [Hz]
0 100 200 300
300
756
1905
Intelligibility Subspace 8
0 100 200 300
300
756
1905
Intelligibility Subspace 9
0 100 200 300
300
756
1905
Intelligibility Subspace 10
Freq. [Hz]
Time [ms]
0 100 200 300
300
756
1905
Time [ms]
Intelligibility Subspace 11
0 100 200 300
300
756
1905
−0.1
0
0.1
0 200 400
0
0.02
0.04Eigenvalues − d = 0.33
Dimension number l
λl
Figure 4. Decomposition of dfor speech in additive, speech-shaped, sinusoidally amplitude-modulated Gaussian noise (fmod = 5 Hz, SNR = -10 dB). Basis
functions elof the 11 dominant intelligibility subspaces and decomposition of din terms of eigenvalues λl(lower right).
Noise Name Description SNR [dB]
icra1 Unmodulated speech-shaped (male) Gaussian noise from the ICRA corpus (Track 1) [30]. -17:3:-2.
icra4 1-person babble (female) from the ICRA corpus (Track 4). -29:3:-14.
icra6 2-persons babble (1 male and 1 female) from the ICRA corpus (Track 6) -24:3:-9.
icra7 6-persons babble from the ICRA corpus (Track 7) -19:3:-4.
snam2 100 % intensity-modulated versions of icra1. Modulation frequency 2 Hz. -27:3:-12.
snam4 As above with modulation frequency 4 Hz. -23:3:-8.
snam8 As above with modulation frequency 8 Hz. -25:3:-10.
snam16 As above with modulation frequency 16 Hz. -22:3:-7.
macgun (Modified) machine gun noise from the Noisex corpus [31]. -37:3:-22.
destop Destroyers operation room noise from the Noisex corpus [31]. -14:3:1.
Table I
NOI SE S O UR CE S A ND S NR RA NG ES USE D FOR INT EL LIGI BILI TY TES T WIT H Additive Noise Set I. NOTATI ON x:y:zIN DI CAT ES SNR S F RO M xTO z
(BOT H I NC LU DE D)I N S T EP S OF yDB.
= 44 conditions. Fifteen listeners participated in the test. For
more details, we refer to [35].
4) Ideal Time-Frequency Segregation: The fourth data set
consists of the noisy signals from Additive Noise Set III,
processed using the ideal time-frequency segregation (ITFS)
technique [36]. Kjems [35] processed noisy signals with two
different ITFS algorithms called ideal binary mask (IBM) and
target binary mask (TBM), and used eight different variants
of each algorithm (reflected by the LC parameter, i.e., the
threshold for which the algorithm suppresses a given time-
frequency tile or not). Three different SNRs were used, leading
to a total number of (4 noise types (IBM) + 3 noise types
(TBM)4) x 8 LC x 3 SNRs = 168 test conditions. Fifteen
normal hearing subjects participated in the test. The sample
rate was 20 kHz. More details are available in [35].
5) Single-Channel Noise Reduction: The last data set con-
sists of noisy speech signals processed with three single-
microphone noise reduction algorithms [37]. The three algo-
rithms are all non-linear and aim at finding binary or soft
minimum mean-square error (MMSE) estimates of the short-
time spectral amplitude (STSA). We include this data set,
4The IBM and TBM algorithms are identical for speech-shaped noise.
DRAFT
9
Intelligibility Subspace 1
Freq. [Hz]
0 100 200 300
300
756
1905
Intelligibility Subspace 2
0 100 200 300
300
756
1905
Intelligibility Subspace 3
0 100 200 300
300
756
1905
Intelligibility Subspace 4
Freq. [Hz]
0 100 200 300
300
756
1905
Intelligibility Subspace 5
0 100 200 300
300
756
1905
Intelligibility Subspace 6
0 100 200 300
300
756
1905
Intelligibility Subspace 7
Freq. [Hz]
0 100 200 300
300
756
1905
Intelligibility Subspace 8
0 100 200 300
300
756
1905
Intelligibility Subspace 9
0 100 200 300
300
756
1905
Intelligibility Subspace 10
Freq. [Hz]
Time [ms]
0 100 200 300
300
756
1905
Time [ms]
Intelligibility Subspace 11
0 100 200 300
300
756
1905
−0.1
0
0.1
0 200 400
0
0.02
0.04
Eigenvalues − d = 0.2275
Dimension number l
λl
Figure 5. Decomposition of dfor speech in cafeteria noise [29], SNR = -10 dB. Basis functions elof the 11 dominant intelligibility subspaces and
decomposition of din terms of eigenvalues λl(lower right).
because an obvious use of the proposed algorithm is for
development/tuning of noise reduction algorithms. Speech-
shaped (unmodulated) noise signals were added to speech
signals (female speaker) from the Dutch version of the Hager-
man test [33], [34] at fixed SNRs of -8, -6, -4, -2, and 0
dB. The noisy and processed speech signals were presented
diotically via headphones, and the order of presenting the
different algorithms and SNRs was randomized. The signals
were evaluated in a closed Dutch speech-in-noise intelligibility
test [33]. Each processing condition was repeated five times,
leading to 4 conditions x 5 SNRs = 20 test conditions. Thirteen
subjects participated in the test. The sample rate was 8 kHz.
B. Prediction of Absolute Intelligibility and Figures of Merit
Most intelligibility prediction methods (including ESTOI)
do not predict intelligibility, i.e., the fraction of words under-
stood, per se. Instead, they output a scalar ˜
Iwhich, ideally,
is monotonically related to absolute intelligibility I5. The
monotonic mapping between predictor output and absolute
5We use the symbol ˜
Ito represent the output of any intelligibility predictor
(including ESTOI), while we reserve the symbol dfor the particular scalar
output produced by ESTOI.
intelligibility is generally hard to derive analytically, but in
[14], [38] it was proposed to use the following logistic map,
ˆ
I=100
1 + exp a˜
I+b,(10)
where a, b ∈ ℜ are constants that depend on the test material,
test paradigm, etc., and which are estimated to fit the intelli-
gibility data at hand.
To quantify the performance of intelligibility predictors, we
use four figures of merit (see [17] for exact definitions): i) the
linear correlation coefficent ρpre between average intelligibil-
ity scores obtained in listening tests, and the outcomes ˜
Iof the
intelligibility predictors before applying the logistic map (Eq.
(10)), ii) the linear correlation coefficient ρbetween average
intelligibility scores and the outcomes ˆ
Iof the intelligibility
predictors, i.e., after the logistic map, iii) the root mean-
square prediction error σbetween measured and predicted
intelligibility, and iv) Kendalls rank correlation coefficient (τ).
C. Impact of Segment Length and Signal Duration
1) Sensitivity to segment length N: ESTOI was developed
with simplicity in mind and has few free parameters. This
DRAFT
10
section studies the performance of ESTOI as a function of the
segment length Nfor the various noise/processing situations
in the five data sets described above. In particular, for a given
data set, and for a given choice of the segment length N,
the proposed method was applied to compute an intelligibility
index for each test condition in that data set. Then, the
free parameters a,b, of the logistic function, Eq. (10), were
fitted to map the predicted intelligibility indices to absolute
intelligibility as measured in a listening test. Finally, the
performance in terms of ρ,σ, and τwas computed.
Fig. 6 shows the performance in terms of ρas a function of
segment length N(in the range N= 5,...,60 corresponding
to durations of 64 – 768 ms). Clearly, the proposed method is
fairly insensitive to the exact choice of the segment length N.
In fact, for 20 N50 (corresponding to durations of 256–
640 ms), the proposed method gives excellent performance
with values of ρ > 0.9. The lower performance for N < 20
may be explained by the fact that with these short segment
lengths, the method is less able to capture low-frequency
temporal modulations, which are important for speech intelli-
gibility [14]. While the discussion above focused on prediction
performance in terms of ρ, similar conclusions may be drawn
from performance analysis based on σand τ(not shown).
Based on these observations, a value of N= 30 (384 ms) is
used in the remainder of the simulation experiments.
0 10 20 30 40 50 60
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
N
ρ
Add Noise I
Add Noise II
Add Noise III
ITFS
SC−NR
Figure 6. Speech intelligibility prediction performance (in terms of ρ) as a
function of segment length Nfor various noise/processing conditions. For
Additive Noise Set II, we used the signals and listening test results for the
young, normal-hearing listeners to avoid a cluttered plot. N= 30 corresponds
to a segment duration of 384 ms.
2) Sensitivity to duration of test signals: For highly mod-
ulated, additive noise sources, the instantaneous SNR can
vary significantly across a short time span. For example, a
sinusoidally amplitude-modulated noise source could com-
pletely mask the target signal at one instant, while leaving it
essentially un-masked half a period later (cf. Fig. 3). Hence,
the speech intelligibility for a particular short speech signal
is highly dependent on the (random) location of the high
SNR region with respect to the speech signal. Since our goal
is to estimate the average speech intelligibility, we would
therefore expect it to be necessary to average across many
noise realizations, or, equivalently, to use longer test speech
signals, than would e.g. be necessary for unmodulated noise
sources. In this section we therefore study the sensitivity of
the proposed method with respect to test signal duration tsig .
To do so, we generated speech signals contaminated by
two additive noise sources: speech-shaped stationary noise,
and synthetic 1-person babble (the icra4 noise source from
[30]). The noise sources were scaled to achieve SNRs of -10
dB and -23 dB, respectively, corresponding approximately to
the 50% speech reception threshold (SRT) (the SNR needed
to achieve a recognition rate of 50%) for these noise sources.
Noise and clean signals were generated in corresponding pairs
with various durations in the range 1− −80 secs. Noisy
signals were generated by adding the clean and noise signals.
Clean and noisy signals were then passed through ESTOI. For
comparison, the clean and noise signals were passed through
the Extended Speech Intelligibility Index (ESII) algorithm [3]
(see Sec. IV-D for implementational details). For each signal
duration, nreal = 100 different realizations of the clean/noisy
signal pairs were evaluated.
It is of interest to study to which extent an intelligibility
prediction ˜
In(tsig)based on a single (the nth) test signal
realization of duration tsig lies in the neighborhood of the
ensemble-average, i.e., the average predictor value µ˜
I(tsig)=
1
nreal Pn˜
In(tsig)across many realizations nreal of test signal
pairs. To do so, let us define the sample standard deviation
σ˜
I(tsig)=s1
nreal X
n˜
In(tsig)µ˜
I(tsig)2,
and let us define the relative standard deviation as
ǫ(tsig) = σ˜
I(tsig)/µ˜
I(tsig)×100 [%].
Figs. 7a) shows ǫ(tsig )for ESII and ESTOI for unmodulated
speech-shaped noise, while Fig. 7b) shows the results for bab-
ble noise. From Fig. 7 three conclusions can be drawn. First, as
expected, the relative standard deviation declines with signal
duration. Secondly, for a given test signal duration, ESII has
lower relative standard deviation than ESTOI. This is because
ESII makes explicit use of its access to the clean speech
signal and the noise signal in separation to accurately compute
SNRs in different time-frequency regions, and subsequently
compute an intelligibility index based on these (i.e,. ESII is
a Class-1 method as discussed in the Introduction). ESTOI,
on the other hand, does not make use of the access to the
separated clean and noise components and is therefore more
generally applicable, e.g., to non-linearly processed signals
(i.e., it is a Class-2 method). This generality comes with
the price of an increase in the estimation standard deviation.
Thirdly, as expected, for a given test signal duration, the
estimation standard deviation is higher for modulated than for
non-modulated noise, both for ESII and ESTOI.
It is hard to decide a priori on a sufficient test signal dura-
tion, because a) this depends on the noise signal statistics in a
non-trivial manner, and b) the noise statistics are unavailable
to the proposed method. Hence, the test signal duration should
simply be chosen as long as practically possible, and generally
no less than some tens of seconds. Note that long test signals
DRAFT
11
can be generated by concatenating several of the, potentially
short, speech sentences used in the intelligibility test.
0 20 40 60 80
0
10
20
30
40
50
a)
ε [%]
Si gna l D u r at i o n ts ig [ s ]
0 20 40 60 80
0
10
20
30
40
50 b)
Si gna l D u r at i o n ts ig [ s ]
ESTOI
ESII
Figure 7. Relative standard deviation ǫ(tsig)of speech intelligibility predic-
tors ESII and ESTOI, as a function of test signal duration tsig. a) Speech-
shaped stationary noise (SNR = -10 dB), b) icra4-noise (synthetic 1-person
babble) [30] (SNR = -23 dB).
D. Comparison to Existing Methods
We compare the proposed intelligibility prediction method
to reference methods from the literature. The methods are out-
lined in Table II. The methods CSII-BIF and STI-NCM-BIF are
referred to as CSIImid,W4, p = 1 and NCM, W(1)
i, p = 1.5,
respectively, in [39, Table IV]. We implemented the GLIMPSE
method [8] using the one-third-octave filter bank used in
ESTOI. The speech glimpse percentage was defined here as the
percentage of time-frequency units with a local SNR exceeding
-8 dB (this threshold was chosen because it lead to best
performance in terms of ρ, σ, and τ). Our implementation
of the ESII algorithm computes the per-frame SII based on
one-third octave filtering, and outputs the average of the per-
frame SIIs. The implementation uses stationary speech-shaped
Gaussian noise instead of undistorted real speech signals as
input (as specified in [3], [6]), but excludes the upward-spread-
of-masking functionality as defined in [3], because this appears
to degrade performance. As in [6], we use the band importance
functions derived for the test stimuli of the Speech in the
Presence of Noise (SPIN) test ( [3, Table B.1]) and [40].
Tables III–VI summarize performance in terms of ρpre ,
ρ,σ, and τ, respectively, for the intelligibility predictors for
the various additive noise and processing conditions. The ρ,
σ, and τ, values are found by fitting a, b to the data sets
in question. To identify statistically significant differences
between ρ-values (Table IV), pairwise comparisons using the
Williams t-test [41]–[43] were performed within each data
set between the predictor with the largest ρand the others
(Bonferroni correction for multiple comparisons). Methods
which do not perform statistically significantly worse than the
method with highest ρ(p < 0.05) are indicated with (*) in
Table IV. In addition, a statistical analysis of significance was
applied to the root mean-square prediction errors σ(Table V)
as follows (see [44] for a brief outline of this approach). For
each prediction method and for each of the five listening tests,
the free parameters a,bin the logistic function were fitted to
n1data points, where ndenotes the number of conditions
for a specific data set. Then this logistic function was applied
to the left-out data point ˜
Ii, where iis the index of the left-out
data point, to find a prediction ˆ
Iiof the left-out subject result
Ii. The procedure was repeated for all data points, resulting
in prediction errors ei=Iiˆ
Ii, i = 1,··· , n. Our goal is to
compare the magnitude of these prediction errors across pre-
diction methods. The data e2
ifor each intelligibility predictor
and for each data set did not pass a chi-square goodnes-of-
fit test for normality (p < 0.05). Hence, a Kruskal-Wallis
test was performed, rejecting for each data set the hypothesis
that the median of e2
iis identical for all prediction methods
(p < 105). A multiple pairwise comparison test (Tukey
HSD) was applied to identify prediction methods which, for a
particular data set, performed statistically significantly worse
than the method with lowest σ(p < 0.05). The result of this
comparison is indicated with (*) in Table V.
From these tables, a number of observations can be made.
First, focusing on the highly non-stationary noise conditions,
i.e., Additive Data Sets I and II, it is clear that ESTOI,
GLIMPSE, and ESII appear to work quite well. The fact that
GLIMPSE and ESII can work well in these conditions is well
in line with results reported in [8] and [6], respectively. On
the other hand, existing methods such as STOI and SIMI,
which are known to work well for other less non-stationary
noise sources and for various processing conditions, do not
work well for the highly fluctuating noise sources: SIMI shows
correlation values ρ0, while STOI shows large but negative
correlations for these data sets in Table III. This indicates
that the STOI output ˜
Idecreases for increasing measured
intelligibility (the fact that the same entry in Table IV is
positive is because the logistic map from ˜
Ito ˆ
Iin this situation
maps low ˆ
Ivalues to high ˜
Ivalues, and vice versa).
It is interesting to note that ESTOI (and ESII and
GLIMPSE), perform well for Additive Noise II, both for
young and for elderly normal-hearing subjects. While the basic
intelligibility predictors are unchanged, each intelligibility
predictor employs a different logistic map (i.e., constants a
and bin Eq. (10)) for the different subject groups, because
the 30% speech reception threshold was 6 dB higher for the
elderly compared to the young subjects. It appears that the SRT
differences between these normal-hearing subject groups (e.g.,
differences in higher auditory stages, which are not captured
by a standard listening test used to establish whether a subject
is normal hearing or not) are well-modeled simply by changing
the logistic map.
Secondly, for the less fluctuating (but still non-stationary)
noise sources in Additive Noise Set III, most methods work
well. In fact, for this data set, several intelligibility predictors,
including ESTOI, show values of ρ > 0.95, and σvalues at
or below 10%. Note that SII, which relies on long-term noise
spectra, also works well in this situation.
For noisy signals processed with ideal time-frequency seg-
regation and single-channel noise reduction, ESTOI,SIMI, and
STOI work well with ρ > 0.94. It is interesting to note that
for single-channel noise reduced signals, STI-NCM-BIF works
exceptionally well (ρ > 0.97): an explanation is that STI-
NCM-BIF was developed with this particular processing type
in mind; also note that STI-NCM-BIF does not show this level
of performance for any other noise/processing condition.
In summary, for highly fluctuating, additive noise sources,
where STOI and SIMI fail, ESTOI performs at the level of
DRAFT
12
Acronym Short Description
ESTOI Extended Short-Time Objective Intelligibility (ES-
TOI) - Proposed method.
SIMI Speech Intelligibility prediction based on Mutual
Information [17].
STOI The short-time objective intelligibility measure [14].
CSII-MID The mid-level coherence speech intelligibility index
(SII) [9].
CSII-BIF The coherence SII with signal-dependent band im-
portance functions [39]).
STI-NCM The normalized covariance speech transmission in-
dex (STI) [10] .
STI-NCM-BIF The normalized covariance STI with signal-
dependent band-importance functions [39]).
NSEC The normalized subband envelope correlation
method [45].
MIKNN Intelligibility prediction based on a k-nearest neigh-
bor estimate of mutual information (MIKNN) [46].
GLIMPSEImplementation of Cooke’s glimpse method [8].
SIIThe Critical-Band SII with SPIN band-importance
functions ( [3, Table B.1]).
ESIIImplementation of Extended SII [6] ( [3, Table B.1]).
Table II
INT EL LIGI BIL ITY PR EDI CTO RS F OR C OMPAR ISO N. NOT E TH AT
PR ED I CTO RS MAR KE D W IT H REQU I RE T HAT S P EE CH A ND NOI S E
RE AL IZ ATIO NS ARE AVAI LA BL E I N S EPARATI ON,A ND T H AT NOI S E IS
AD DI TI V E,I.E. , TH ES E M ET HO D S CA N N OT B E D IR E CT LY US ED T O
PR EDIC T THE IN TE LL IGIB ILIT Y OF (NON-L I NE AR LY)P RO CE SS ED SP EE CH
SI GN AL S.
established methods such as GLIMPSE and ESII, without
requiring access to the speech and noise signals in isolation.
For less fluctuating noise sources, ESTOI performs as well as
the best existing methods, such as SII,STOI, and SIMI. For
non-linearly processed noisy signals, where methods such as
SII,ESII, and GLIMPSE cannot operate, the proposed method
still performs at the level of STOI and SIMI.
V. CONC LU SIO N
We presented an algorithm for monaural, intrusive intel-
ligibility prediction: given undistorted reference speech sig-
nals and their noisy, and potentially non-linearly processed,
counterparts, the algorithm estimates the average intelligibility
of the latter, across a group of normal-hearing listeners. The
proposed algorithm, which is called ESTOI (Extended Short-
Time Objective Intelligibility), may be interpreted in terms
of an orthogonal decomposition of energy-normalized short-
time spectrograms into ”intelligibility subspaces”, i.e., one-
dimensional subspaces which are ranked according to their
importance wrt. intelligibility. This intelligibility subspace
decomposition indicates, that the proposed algorithm favors
spectro-temporal modulation patterns, which are known from
literature to be important for intelligibility. The proposed
intelligibility predictor has only one free parameter, the seg-
ment length N, i.e., the duration across which the short-
time spectrograms are computed. We show, via simulation
experiments, that performance is fairly insensitive to the exact
choice of this parameter, and that durations in the range 256–
640 ms lead to best performance – this allows the algorithm
to capture relatively low-frequency modulation content, while
still being able to adapt to changing signal characteristics. We
study the performance of ESTOI in predicting the results of
five different intelligibility listening tests: two with tempo-
rally highly modulated additive noise sources, one with more
moderately modulated, additive noise sources, and two with
noisy signals processed by ideal time-frequency masking and
single-channel non-linear noise reduction algorithms, respec-
tively. Compared to a range of existing speech intelligibility
prediction algorithms, ESTOI performs well across all listening
tests.
The present study has focused on speech intelligibility
prediction performance within data sets, that each contain
signals with similar distortions or processing types (e.g.,
additive, modulated noise or noisy speech processed by ideal
time-frequency segregation (ITFS) algorithms, etc.). It is a
topic for future research to study the performance of the
proposed intelligibility predictor across data sets with different
distortion and processing types. Compared to the present
study, this would require conduction of larger intelligibility
tests, where these different distortion or processing types are
included in the same listening test.
A Matlab implementation of the proposed algorithm is
available for non-commercial use at http://kom.aau.dk/jje/.
ACK NOWL EDG EME NT
The authors would like to thank four anonymous reviewers
whose constructive comments helped improve the presentation
of this work. Discussions with Dr. Gaston Hilkhuysen, Dr.
Thomas Ulrich Christiansen, and Asger Heidemann Andersen
are greatly acknowledged. Finally, the authors wish to thank
Dr. Jalal Taghia for making the Matlab code of his intelligi-
bility predictor publicly available.
REF ERE NC ES
[1] “ANSI S3.5, American National Standard Methods for the Calculation
of the Articulation Index,” American National Standards Institute, New
York, 1969.
[2] N. R. French and J. C. Steinberg, “Factors governing the intelligibility
of speech sounds,” J. Acoust. Soc. Am., vol. 19, no. 1, pp. 90–119, 1947.
[3] American National Standards Institute, “ANSI S3.5, Methods for the
Calculation of the Speech Intelligibility Index,” American National
Standards Institute, New York, 1995.
[4] “IEC60268-16, Sound System Equipment – Part 16: Objective Rating
of Speech Intelligibility by Speech Transmission Index,” International
Electrotechnical Commission, Geneva, 2003.
[5] H. J. M. Steeneken and T. Houtgast, “A physical method for measuring
speech-transmission quality,” J. Acoust. Soc. Am., vol. 67, pp. 318–326,
1980.
[6] K. S. Rhebergen and N. J. Versfeld, “A speech intelligibility index
based approach to predict the speech reception threshold for sentences
in fluctuating noise for normal-hearing listeners,J. Acoust. Soc. Am.,
vol. 117, no. 4, pp. 2181–2192, 2005.
[7] H. ˚
A. Gustafsson and S. D. Arlinger, “Masking of speech by amplitude-
modulated noise,J. Acoust. Soc. Am., vol. 95, no. 1, pp. 518–529, 1994.
[8] M. Cooke, “A glimpsing model of speech perception in noise,” J. Acoust.
Soc. Am., vol. 119, no. 3, pp. 1562–1573, March 2006.
[9] J. M. Kates and K. H. Arehart, “Coherence and the speech intelligibility
index,J. Acoust. Soc. Am., vol. 117, no. 4, pp. 2224–2237, 2005.
[10] R. L. Goldsworthy and J. E. Greenberg, “Analysis of of speech-based
speech transmission index methods with implications for nonlinear
operations,J. Acoust. Soc. Am., vol. 116, no. 6, pp. 3679–3689,
December 2004.
[11] R. Drullmann, “Temporal envelope and fine structure cues for speech
intelligibility,” J. Acoust. Soc. Am., vol. 97, no. 1, pp. 585–592, January
1995.
[12] V. Hohmann and B. Kollmeier, “The effect of multichannel dynamic
compression on speech intelligibility,” J. Acoust. Soc. Am., vol. 97, pp.
1191–1195, 1995.
DRAFT
13
Add. Noise Add. Noise Add. Noise Add. Noise ITFS SC-NR
Set I Set II (Young) Set II (Elderly) Set III
ESTOI 0.846 0.877 0.900 0.864 0.919 0.955
SIMI 0.483 -0.028 -0.253 0.931 0.934 0.974
STOI 0.477 -0.797 -0.789 0.887 0.931 0.983
CSII-MID 0.671 0.002 -0.020 0.784 0.440 0.794
CSII-BIF 0.717 0.578 0.696 0.883 0.541 0.843
STI-NCM 0.480 -0.354 -0.657 0.880 0.731 0.844
STI-NCM-BIF 0.519 -0.033 -0.153 0.784 0.568 0.974
NSEC 0.572 -0.309 -0.349 0.924 0.871 0.638
MIKNN 0.552 0.755 0.770 0.732 0.824 0.847
GLIMPSE 0.850 0.851 0.850 0.845 – –
SII 0.541 -0.101 -0.112 0.723 – –
ESII 0.807 0.816 0.849 0.701 – –
Table III
PER FO RMAN CE OF INT ELLI GIB ILIT Y PRED ICT OR S I N T ERM S O F ρpre ,I.E., T HE LIN EA R C OR RE LAT IO N CO EF FICI ENT BE TW EE N M EA S UR ED
IN TE LLIG IBI LIT Y AND P RED IC TO R OU TPU TS ˜
I,CF. (10 ).
Add. Noise Add. Noise Add. Noise Add. Noise ITFS SC-NR
Set I Set II (Young) Set II (Elderly) Set III
ESTOI 0.9150.8950.9160.9600.9480.981
SIMI 0.5140.0270.2380.9750.9580.970
STOI 0.4770.8090.7990.9670.9610.987
CSII-MID 0.7660.0000.0180.9480.4540.799
CSII-BIF 0.7450.6070.7910.9780.5390.850
STI-NCM 0.5250.3640.6600.9350.7270.844
STI-NCM-BIF 0.5240.0030.1470.8130.5860.976
NSEC 0.6190.3150.3560.9530.8680.625
MIKNN 0.7220.7800.8080.9020.8810.870
GLIMPSE 0.8720.8720.8750.912– –
SII 0.6740.0980.1070.964– –
ESII 0.8450.8440.8720.818– –
Table IV
PER FO RMAN CE OF INT ELLI GIB ILIT Y PRED ICT OR S ˆ
IIN TER MS OF L IN EAR C OR RELATI ON CO EFFICI ENT ρ. SU PER SC RIP TS *I ND IC ATE ME TH ODS W HI CH
DO NOT P ERF OR M S TATIS TIC ALLY S IG NIFIC ANTLY W OR SE THA N T HE ME THOD WIT H THE HIG HEST ρF O R A G I VE N DATA SE T (SE E T EX T F OR D ETA IL S ).
Add. Noise Add. Noise Add. Noise Add. Noise ITFS SC-NR
Set I Set II (Young) Set II (Elderly) Set III
ESTOI 11.497.637.0310.269.933.76
SIMI 24.9217.0917.018.138.914.72
STOI 25.5310.0610.549.248.613.09
CSII-MID 18.6917.1017.5011.4727.7211.65
CSII-BIF 19.3913.6211.037.5426.2310.20
STI-NCM 24.7815.9313.1812.7521.3910.38
STI-NCM-BIF 24.7417.0917.3221.0825.234.22
NSEC 22.8916.2316.3610.8915.5115.12
MIKNN 20.1910.7410.3715.9214.689.55
GLIMPSE 14.228.378.5114.85– –
SII 21.5617.0217.409.76– –
ESII 15.599.208.5920.82– –
Table V
PER FO RMAN CE OF INT ELLI GIB ILIT Y PRED ICT OR S ˆ
IIN T ER MS O F RO OT M EA N-S QUAR E P RE DI CT ION ER ROR σ. SU PER SCRI PTS *I ND IC ATE ME TH ODS
WH IC H DO N OT P ER F OR M STATI ST I CA LLY SIG N IFI CA NT LY WOR SE T HA N TH E M ET HO D WI T H TH E LOW E ST σFO R A G I VE N DATA SE T (S EE T EX T F OR
DE TAI LS ).
[13] S. Jørgensen and T. Dau, “Predicting speech intelligibility based on
the signal-to-noise envelope power ratio after modulation-frequency
selective processing,” J. Acoust. Soc. Am., vol. 130, no. 3, pp. 1475–
1487, September 2011.
[14] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algo-
rithm for Intelligibility Prediction of Time-Frequency Weighted Noisy
Speech,IEEE Trans. Audio., Speech, Language Processing, vol. 19,
no. 7, pp. 2125–2136, September 2011.
[15] S. Jørgensen, S. Cubick, and T. Dau, “Speech intelligibility evaluation
for mobile phones,Acta Acustica United With Acustica, vol. 101, pp.
1016–1025, 2015.
[16] T. H. Falk, V. Parsa, J. F. Santos, K. Arehart, O. Hazrati, R. Huber, J. M.
Kates, and S. Scollie, “Objective Quality and Intelligibility Prediction
for Users of Assistive Listening Devices, IEEE SP Mag., no. 32, pp.
114–124, March 2015.
[17] J. Jensen and C. H. Taal, “Speech intelligibility prediction based on
Mutual Information, IEEE Trans. Audio., Speech, Language Processing,
vol. 22, no. 2, pp. 430–440, February 2014.
[18] R. Xia, J. Li, M. Akagi, and Y. Yan, “Evaluation of objective intel-
ligibility prediction measures for noise-reduced signals in mandarin,”
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2012, pp.
4465–4469.
[19] I. Holube and B. Kollmeier, “Speech intelligibility predictions in
hearing-impaired listeners based on a psychoacoustically motivated
perception model,J. Acoust. Soc. Am., vol. 100, no. 3, pp. 1703–1716,
1996.
[20] J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Perception
Index (HASPI),Speech Commun., vol. 65, pp. 75–93, Nov.-Dec. 2014.
[21] S. Jørgensen, R. Decorsi `ere, and T. Dau, “Effects of manipulating the
signal-to-noise envelope power ratio on speech intelligibility,” J. Acoust.
Soc. Am., vol. 137, no. 3, pp. 1401–1410, March 2015.
[22] H. J. Steeneken and T. Houtgast, “Mutual dependence of the octave-band
weights in predicting speech intelligibility,” Speech communication,
vol. 28, no. 2, pp. 109–123, 1999.
[23] R. W. Peters, B. C. Moore, and T. Baer, “Speech reception thresholds in
noise with and without spectral and temporal dips for hearing-impaired
and normally hearing people,The Journal of the Acoustical Society of
America, vol. 103, no. 1, pp. 577–587, 1998.
[24] T. M. Elliott and F. E. Theunissen, “The Modulation Transfer Function
for Speech Iintelligibility,” PLOS - Computational Biology, vol. 5, no. 3,
pp. 1–14, March 2009.
[25] DARPA, “Timit, Acoustic-Phonetic Continuous Speech Corpus,” Octo-
ber 1990, NIST Speech Disc 1-1.1.
[26] R. Drullman, J. M. Festen, and R. Plomp, “Effect of temporal envelope
DRAFT
14
Add. Noise Add. Noise Add. Noise Add. Noise ITFS SC-NR
Set I Set II (Young) Set II (Elderly) Set III
ESTOI 0.748 0.590 0.667 0.816 0.775 0.842
SIMI 0.354 0.077 0.256 0.825 0.767 0.884
STOI 0.376 0.436 0.436 0.818 0.798 0.905
CSII-MID 0.609 0.026 0.026 0.863 0.335 0.600
CSII-BIF 0.580 0.410 0.487 0.846 0.408 0.684
STI-NCM 0.328 0.180 0.590 0.818 0.538 0.684
STI-NCM-BIF 0.276 0.085 0.077 0.643 0.385 0.853
NSEC 0.514 0.077 0.128 0.854 0.695 0.505
MIKNN 0.554 0.436 0.410 0.751 0.689 0.684
GLIMPSE 0.673 0.461 0.385 0.742 – –
SII 0.260 0.330 0.128 0.822 – –
ESII 0.653 0.564 0.359 0.685 – –
Table VI
PER FO RMAN CE OF INT ELLI GIB ILIT Y PRED ICT OR S I N T ERM S O F ˆ
IKEN DAL LS R AN K C ORRE LATIO N CO EFFICI ENT τ.
smearing on speech reception,J. Acoust. Soc. Am., vol. 95, pp. 1053
1064, February 1994.
[27] ——, “Effect of reducing slow temporal modulation on speech recep-
tion,” J. Acoust. Soc. Am., vol. 95, pp. 2670–2680, February 1994.
[28] J. M. Kates and K. H. Arehart, “Comparing the information conveyed
by envelope modulation for speech intelligibility, speech quality, and
music quality,” J. Acoust. Soc. Am., no. 4, pp. 2470–2482, 2015.
[29] J. Thiemann, N. Ito, and E. Vincent, “DEMAND: Diverse environments
multichannel acoustic noise database,” http://parole.loria.fr/DEMAND/.
[30] W. A. Dreschler, H. Verschuure, C. Ludvigsen, and S. Westermann,
“ICRA noises: Artificial Noise Signals with Speech-like Spectral and
Temporal Properties for Hearing Instrument Assessment,” Audiology,
vol. 40, no. 3, pp. 148–157, 2001.
[31] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech
recognition: II. NOISEX-92: A database and an experiment to study
the effect of additive noise on speech recognition systems,” Speech
Commun., vol. 12, no. 3, pp. 247–251, 1993.
[32] K. Wagener, J. L. Josvassen, and R. Ardenkjær, “Design, optimization
and evaluation of a Danish sentence test in noise,” Int. J. Audiol., vol. 42,
no. 1, pp. 10–17, 2003.
[33] J. Koopman, R. Houben, W. A. Dreschler, and J. Verschuure, “Devel-
opment of a speech in noise test (matrix),” in 8th EFAS Congress, 10th
DGA Congress, Heidelberg, Germany, June 2007.
[34] B. Hagerman, “Sentences for testing speech intelligibility in noise,”
Scand. Audiol., vol. 11, pp. 79–87, 1982.
[35] U. Kjems, J. B. Boldt, M. S. Pedersen, T. Lunner, and D. L. Wang, “Role
of mask pattern in intelligibility of ideal binary-masked nosy speech,”
J. Acoust. Soc. Am., vol. 126, no. 3, pp. 1415–1426, September 2009.
[36] D. Brungart, P. S. Chang, B. D. Simpson, and D. L. Wang, “Isolating
the energetic component of speech-on-speech masking with ideal time-
frequency segregation,J. Acoust. Soc. Am., vol. 120, no. 6, pp. 4007–
4018, December 2006.
[37] J. Jensen and R. Hendriks, “Spectral Magnitude Minimum Mean-Square
Error Estimation Using Binary and Continuous Gain Functions,IEEE
Trans. Audio., Speech, Language Processing, vol. 20, no. 1, pp. 92–102,
January 2012.
[38] C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, and U. Kjems,
“An Evaluation of Objective Quality Measures for Speech Intelligibility
Prediction,” in Proc. Interspeech. Brighton, UK: ISCA, September 6-10
2009, pp. 1947–1950.
[39] J. Ma, Y. Hu, and P. C. Loizou, “Objective measures for predicting
speech intelligibility in noisy conditions based on new band-importance
functions,J. Acoust. Soc. Am., vol. 125, no. 5, pp. 3387–3405, 2009.
[40] R. C. Bilger et al., “Standardization of a test of speech perception in
noise,” J. Speech Hear. Res., vol. 27, pp. 32–48, 1984.
[41] E. J. Williams, “The comparison of regression variables, J. Royal Stat.
Society, Ser. B, vol. 21, no. 2, pp. 396–399, 1959.
[42] J. H. Steiger, “Tests for Comparing Elements of a Correlation Matrix,”
Psychological Bulletin, vol. 87, no. 2, pp. 245–251, 1980.
[43] R. R. Wilcox and T. Tian, “Comparing Dependent Correlations,The
Journal of General Psychology, vol. 135, no. 1, pp. 105–112, 2008.
[44] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John
Wiley and Sons, Inc., 2001.
[45] J. B. Boldt and D. P. W. Ellis, “A simple correlation-based model of
intelligibility for nonlinear speech enhancement and separation,” in Proc.
17th European Signal Processing Conference (EUSIPCO), 2009, pp.
1849–1853.
[46] J. Taghia and R. Martin, “Objective intelligibility measures based
on mutual information for speech subjected to speech enhancement
processing,IEEE Trans. Audio., Speech, Language Processing, vol. 22,
no. 1, pp. 6–16, January 2014.
Jesper Jensen Jesper Jensen received the M.Sc.
degree in electrical engineering and the Ph.D. de-
gree in signal processing from Aalborg University,
Aalborg, Denmark, in 1996 and 2000, respectively.
From 1996 to 2000, he was with the Center for
Person Kommunikation (CPK), Aalborg University,
as a Ph.D. student and Assistant Research Profes-
sor. From 2000 to 2007, he was a Post-Doctoral
Researcher and Assistant Professor with Delft Uni-
versity of Technology, Delft, The Netherlands, and
an External Associate Professor with Aalborg Uni-
versity. Currently, he is a Senior Researcher with Oticon A/S, Copenhagen,
Denmark, where his main responsibility is scouting and development of new
signal processing concepts for hearing aid applications. He is also a Professor
with the Section for Information Processing (SIP), Department of Electronic
Systems, at Aalborg University. His main interests are in the area of acoustic
signal processing, including signal retrieval from noisy observations, coding,
speech and audio modification and synthesis, intelligibility enhancement of
speech signals, signal processing for hearing aid applications, and perceptual
aspects of signal processing.
Cees H. Taal Cornelis (Cees) H. Taal received the
B.S. and M.A. degrees in arts and technology from
the Utrecht School of Arts, Utrecht, The Nether-
lands, in 2004 and his M.Sc. and Ph.D. degree in
computer science from the Delft University of Tech-
nology, Delft, The Netherlands, in 2007 and 2012,
respectively. From 2012 to 2013 he held Postdoc
positions at the Sound and Image Processing Lab,
Royal Institute of Technology (KTH), Stockholm,
Sweden and the Leiden University Medical Center,
Leiden, the Netherlands. From 2013 he worked in
industry with Philips Research, Eindhoven, the Netherlands as a research
scientist in the field of biomedical signal processing. Currently, he is at Quby,
Amsterdam, the Netherlands performing R&D as a DSP expert applying
signal processing to smart-home applications, e.g., thermostats and power
monitoring.
... [8]- [18]) provided short-time objective intelligibility (STOI [19], [20]) scores, while a few (i.e. [13], [21], [22]) presented extended STOI (ESTOI [23]) scores. ...
... ESTOI is similar to STOI, but does not assume mutual independence between frequency bands and incorporates spectral correlation, to improve its performance on modulated noise sources [23]. We relied on the implementation provided by the original authors. ...
... Subjective intelligibility is not just dependent on the speech degradation, but also on the test setup. As such, it is common to map predicted scores to subjective results for a given test setup [23]. In order to obtain intelligibility predictions for our specific subjective evaluation setup, we have mapped the OIM scores to the subjective results of our single channel noisy and reverberant baseline. ...
Preprint
div>Speech enhancement (SE) systems aim to improve the quality and intelligibility of degraded speech signals obtained from far-field microphones. Subjective evaluation of the intelligibility performance of these SE systems is uncommon. Instead, objective intelligibility measures (OIMs) are generally used to predict subjective performance increases. Many recent deep learning based SE systems, are expected to improve the intelligibility of degraded speech as measured by OIMs. However, validation of the OIMs for this purpose is lacking. Therefore, in this study, we evaluate the predictive performance of five popular OIMs. We compare the metrics' predictions with subjective results. For this purpose, we recruited 50 human listeners, and subjectively tested both single channel and multi-channel Deep Complex Convolutional Recurrent Network (DCCRN) based speech systems. We find that none of the OIMs gave reliable predictions, and that all OIMs overestimated the intelligibility of `enhanced' speech signals. </div
... Four commonly used objective measures were chosen to evaluate the performance of the deep FBE, including segmental noise attenuation (segNA) (Fingscheidt et al., 2008), perceptual evaluation of speech quality (PESQ) (Rix et al., 2001), extended short-time objective intelligibility (ESTOI) (Jensen and Taal, 2016), and segmental SNR (segSNR) (Loizou, 2007). For completeness, three composite measures proposed by Loizou (2007) with reference to the ITU-T P.835 standard, namely, CSIG, CBAK, and COVL, were also used as objective metrics. ...
... The ESTOI score was chosen to evaluate speech intelligibility. This is based on the properties of the human auditory system and gives reasonably accurate results for a large range of noise types (Jensen and Taal, 2016). The higher the ESTOI score, the better is the speech intelligibility. ...
Article
It is highly desirable that speech enhancement algorithms can achieve good performance while keeping low latency for many applications, such as digital hearing aids, mobile phones, acoustically transparent hearing devices, and public address systems. To improve the performance of traditional low-latency speech enhancement algorithms, a deep filter-bank equalizer (FBE) framework was proposed that integrated a deep learning-based subband noise reduction network with a deep learning-based shortened digital filter mapping network. In the first network, a deep learning model was trained with a controllable small frame shift to satisfy the low-latency demand, i.e., no greater than 4 ms, so as to obtain (complex) subband gains that could be regarded as an adaptive digital filter in each frame. In the second network, to reduce the latency, this adaptive digital filter was implicitly shortened by a deep learning-based framework and was then applied to noisy speech to reconstruct the enhanced speech without the overlap-add method. Experimental results on the WSJ0-SI84 corpus indicated that the proposed DeepFBE with only 4-ms latency achieved much better performance than traditional low-latency speech enhancement algorithms across several objective metrics. Listening test results further confirmed that our approach achieved higher speech quality than other methods.
... Each box-and-whisker plot covers the median, minimum, maximum and quartiles across the scores for the 21 (Boothroyd et al., 1985;Etymōtic Research Inc., 2018a,b;Foster and Haggard, 1987;HörTech gGmbH, 2018;Kuk et al., 2010;Vermiglio, 2008;Wilson, 2003). 13 2.2 A table of key features of a range of intrusive speech intelligibility metrics (American National Standards Institute, 1997;French and Steinberg, 1947;Goldsworthy and Greenberg, 2004;Jensen and Taal, 2016;Arehart, 2005, 2014;Steeneken and Houtgast, 1980;Taal et al., 2010 (Bridges et al., 2012;Meister et al., 2002). Behavioural speech-in-noise tests, in which the listener is asked to repeat back or otherwise indicate understanding of speech in a noisy signal, are regarded as the 'gold standard' of assessing the performance of hearing aids in background noise; however, automated models for prediction of speech-in-noise performance are becoming increasingly desirable, due to the lower cost and variability and increased timeliness associated with automated methods compared to behavioural trials. ...
... Each of these two core concepts can be implemented in a variety of ways, described in the following section. French and Steinberg, 1947;Goldsworthy and Greenberg, 2004;Jensen and Taal, 2016;Arehart, 2005, 2014;Steeneken and Houtgast, 1980;Taal et al., 2010). ...
Thesis
According to a study by Action on Hearing Loss (2017a), 80% of people with hearing loss have difficulty understanding speech in the presence of background noise. Currently, we rely on behavioural speech-in-noise tests to determine and compare the efficacy of different hearing aids for improvement of speech intelligibility. If a sufficiently reliable automated prediction of intelligibility could be made available, the cost and complexity of testing new hearing aids could be reduced, and may allow bodies such as the NHS to compare device performance more efficiently. This thesis aims to evaluate several existing speech intelligibility prediction metrics by comparing their outputs against results from behavioural speech-in-noise tests. Behavioural speech-in-noise test scores from 21 normal hearing participants and speech intelligibility predictions from automated metrics were obtained for IEEE sentences (Institute of Electrical and Electronics Engineers, 1969) in stationary, speech-shaped background noise at signal-to-noise ratios from -8 to +3 dB, as processed by three different hearing aid models (currently prescribed by the NHS) with and without noise reduction settings enabled in addition to a control condition with no amplification and a low-cost amplifying device. All automated prediction metrics tested showed a broad increase in intelligibility with increasing signal-to-noise ratio. However, only one of the three automated metrics tested, the Hearing Aid Speech Perception Index (HASPI) (Kates and Arehart, 2014), was able to detect statistically significant differences between conditions which mirrored those seen in behavioural speech-in-noise test results. HASPI did, however, struggle to accurately predict the behavioural speech-in-noise scores for some specific hearing aid conditions and signal-to-noise ratios. Further investigations attempted to identify the main causes of HASPIs shortcomings including analysis of feature importance and implementation of a range of mapping and machine learning methods, the effects of differing stimulus types and the robustness of HASPIs component features in combination with features from alternative existing automated metrics. This thesis concludes that currently available automated metrics for speech intelligibility prediction are not fully capable of detecting differences between devices and settings, particularly between efficient noise-reduction programs and a low-cost amplifier. Whilst these metrics form an excellent basis for speech intelligibility prediction, further work is needed to develop existing metrics for use in comparing and tuning hearing aids and settings.
... The estimated utterances were evaluated on instrumental perceptual metrics: POLQA [29] for speech quality and ESTOI [30] for intelligibility. The results of the different configurations are organized in Table 1. ...
Preprint
The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.
... The performances of EF and LF strategies were compared to demonstrate the effect of incorporating two signals using different approaches. We evaluated the performance of the proposed EPG2S systems using standardized metrics, including the perceptual evaluation of speech quality (PESQ) [33], short-time objective intelligibility (STOI) [34], extended STOI (ESTOI) [35], mel cepstral distortion (MCD) [36], [37], and segmental signalto-noise rate (SSNR) [38]. PESQ indicates speech quality, and STOI and ESTOI reflect speech intelligibility. ...
Preprint
Speech generation and enhancement based on articulatory movements facilitate communication when the scope of verbal communication is absent, e.g., in patients who have lost the ability to speak. Although various techniques have been proposed to this end, electropalatography (EPG), which is a monitoring technique that records contact between the tongue and hard palate during speech, has not been adequately explored. Herein, we propose a novel multimodal EPG-to-speech (EPG2S) system that utilizes EPG and speech signals for speech generation and enhancement. Different fusion strategies based on multiple combinations of EPG and noisy speech signals are examined, and the viability of the proposed method is investigated. Experimental results indicate that EPG2S achieves desirable speech generation outcomes based solely on EPG signals. Further, the addition of noisy speech signals is observed to improve quality and intelligibility. Additionally, EPG2S is observed to achieve high-quality speech enhancement based solely on audio signals, with the addition of EPG signals further improving the performance. The late fusion strategy is deemed to be the most effective approach for simultaneous speech generation and enhancement.
... For the evaluation, we use three standard speech quality metrics: Short Time Objective Intelligibility (STOI) [40], Extended Short Time Objective Intelligibility (ESTOI) [22] for estimating the intelligibility and Perceptual Evaluation of Speech Quality (PESQ) [36].To verify our generated speech, we conduct a human subjective study through mean opinion scores of naturalness, content accuracy, and voice matching. ...
Preprint
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Distinct from the previous methods, our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets. The synthesized speech can be heard in supplementary materials.
... For quantitative comparison, we used various metrics such as Perceptual evaluation of speech quality (PESQ) [63], Extended short-time objective intelligibility (ESTOI) [64] and Standard Short-Time Objective Intelligibility (STOI) [65]. The STOI and ESTOI metrics measure the intelligibility of generated speech, while PESQ measures the speech quality by comparing it with the ground truth speech signal. ...
Preprint
Full-text available
Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-the-art on GRID dataset.
Article
Most deep-learning-based multi-channel speech enhancement methods focus on designing a set of beamforming coefficients, to directly filter the low signal-to-noise ratio signals received by microphones, which hinders the performance of these approaches. To handle these problems, this paper designs a causal neural filter that fully exploits the spectro-temporal-spatial information in the beamspace domain. Specifically, multiple beams are designed to steer towards all directions, using a parameterized super-directive beamformer in the first stage. After that, a deep-learning-based filter is learned by, simultaneously, modeling the spectro-temporal-spatial discriminability of the speech and the interference, so as to extract the desired speech, coarsely, in the second stage. Finally, to further suppress the interference components, especially at low frequencies, a residual estimation module is adopted, to refine the output of the second stage. Experimental results demonstrate that the proposed approach outperforms many state-of-the-art (SOTA) multi-channel methods, on the generated multi-channel speech dataset based on the DNS-Challenge dataset.
Article
Speech enhancement methods usually suffer from speech distortion problem, which leads to the enhanced speech losing so much significant speech information. This damages the speech quality and intelligibility. In order to address this issue, we propose a spectrum mend network (SpecMNet) for monaural speech enhancement. The proposed SpecMNet aims to retrieve the lost information by mending the weighted enhanced spectrum with weighted original spectrum. More specifically, the proposed algorithm consists of pre-enhancement network and the mend network. The main task of pre-enhancement network is to acquire the pre-enhanced spectrum so that it can remove the most of the noise signals. Because of the speech distortion problem, it loses a great deal of speech components. While the original spectrum has no speech information lost. Therefore, we utilize the original spectrum to mend the pre-enhanced spectrum by adding these two weighted spectrums so that the lost speech information can be retrieved. Then the mend network is used to predict mend weights for these two spectrums. Finally, the mended spectrum is used as the enhanced output. Our experiments are conducted on the TIMIT + (100 Nonspeech Sounds and NOISEX-92) datasets. Experimental results demonstrate that our proposed SpecMNet approach is effective to alleviate the speech distortion problem.
Article
Full-text available
This paper uses mutual information to quantify the relationship between envelope modulation fidelity and perceptual responses. Data from several previous experiments that measured speech intelligibility, speech quality, and music quality are evaluated for normal-hearing and hearing-impaired listeners. A model of the auditory periphery is used to generate envelope signals, and envelope modulation fidelity is calculated using the normalized cross-covariance of the degraded signal envelope with that of a reference signal. Two procedures are used to describe the envelope modulation: (1) modulation within each auditory frequency band and (2) spectro-temporal processing that analyzes the modulation of spectral ripple components fit to successive short-time spectra. The results indicate that low modulation rates provide the highest information for intelligibility, while high modulation rates provide the highest information for speech and music quality. The low-to-mid auditory frequencies are most important for intelligibility, while mid frequencies are most important for speech quality and high frequencies are most important for music quality. Differences between the spectral ripple components used for the spectro-temporal analysis were not significant in five of the six experimental conditions evaluated. The results indicate that different modulation-rate and auditory-frequency weights may be appropriate for indices designed to predict different types of perceptual relationships.
Article
Full-text available
In the development process of modern telecommunication systems, such as mobile phones, it is common practice to use computer models to objectively evaluate the transmission quality of the system, instead of time-consuming perceptual listening tests. Such models have typically focused on the quality of the transmitted speech, while little or no attention has been provided to speech intelligibility. The present study investigated to what extent three state-of-the art speech intelligibility models could predict the intelligibility of noisy speech transmitted through mobile phones. Sentences from the Danish Dantale II speech material were mixed with three different kinds of background noise, transmitted through three different mobile phones, and recorded at the receiver via a local network simulator. The speech intelligibility of the transmitted sentences was assessed by six normal-hearing listeners and model predictions were compared to the perceptual data. Statistically significant differences between the intelligibility of the three phones were found in stationary speech-shaped noise. A good correspondence between the measured data and the predictions from one of the three models was found in all the considered conditions. Overall, the results suggest that speech intelligibility models inspired by auditory signal processing can be useful for the objective evaluation of speech transmission through mobile phones.
Article
Full-text available
Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] suggested a metric for speech intelligibility prediction based on the signal-to-noise envelope power ratio ( SNRenv), calculated at the output of a modulation-frequency selective process. In the framework of the speech-based envelope power spectrum model (sEPSM), the SNRenv was demonstrated to account for speech intelligibility data in various conditions with linearly and nonlinearly processed noisy speech, as well as for conditions with stationary and fluctuating interferers. Here, the relation between the SNRenv and speech intelligibility was investigated further by systematically varying the modulation power of either the speech or the noise before mixing the two components, while keeping the overall power ratio of the two components constant. A good correspondence between the data and the corresponding sEPSM predictions was obtained when the noise was manipulated and mixed with the unprocessed speech, consistent with the hypothesis that SNRenv is indicative of speech intelligibility. However, discrepancies between data and predictions occurred for conditions where the speech was manipulated and the noise left untouched. In these conditions, distortions introduced by the applied modulation processing were detrimental for speech intelligibility, but not reflected in the SNRenv metric, thus representing a limitation of the modeling framework.
Article
Full-text available
This article presents an overview of 12 existing objective speech quality and intelligibility prediction tools. Two classes of algorithms are presented?intrusive and nonintrusive?with the former requiring the use of a reference signal, while the latter does not. Investigated metrics include both those developed for normal hearing (NH) listeners, as well as those tailored particularly for hearing impaired (HI) listeners who are users of assistive listening devices [i.e., hearing aids (HAs) and cochlear implants (CIs)]. Representative examples of those optimized for HI listeners include the speech-to-reverberation modulation energy ratio (SRMR), tailored to HAs (SRMR-HA) and to CIs (SRMR-CI); the modulation spectrum area (ModA); the HA speech quality (HASQI) and perception indices (HASPI); and the perception-model-based quality prediction method for hearing impairments (PEMO-Q-HI). The objective metrics are tested on three subjectively rated speech data sets covering reverberation-alone, noise-alone, and reverberation-plus-noise degradation conditions, as well as degradations resultant from nonlinear frequency compression and different speech enhancement strategies. The advantages and limitations of each measure are highlighted and recommendations are given for suggested uses of the different tools under specific environmental and processing conditions.
Article
In a variety of situations in psychological research, it is desirable to be able to make statistical comparisons between correlation coefficients measured on the same individuals. For example, an experimenter may wish to assess whether two predictors correlate equally with a criterion variable. In another situation, the experimenter may wish to test the hypothesis that an entire matrix of correlations has remained stable over time. The present article reviews the literature on such tests, points out some statistics that should be avoided, and presents a variety of techniques that can be used safely with medium to large samples. Several illustrative numerical examples are provided.
Article
People with cochlear hearing loss often have considerable difficulty in understanding speech in the presence of background sounds. In this paper the relative importance of spectral and temporal dips in the background sounds is quantified by varying the degree to which they contain such dips. Speech reception thresholds in a 65-dB SPLnoise were measured for four groups of subjects: (a) young with normal hearing; (b) elderly with near-normal hearing; (c) young with moderate to severe cochlear hearing loss; and (d) elderly with moderate to severe cochlear hearing loss. The results indicate that both spectral and temporal dips are important. In a background that contained both spectral and temporal dips, groups (c) and (d) performed much more poorly than group (a). The signal-to-background ratio required for 50% intelligibility was about 19 dB higher for group (d) than for group (a). Young hearing-impaired subjects showed a slightly smaller deficit, but still a substantial one. Linear amplification combined with appropriate frequency-response shaping (NAL amplification), as would be provided by a well-fitted “conventional” hearing aid, only partially compensated for these deficits. For example, group (d) still required a speech-to-background ratio that was 15 dB higher than for group (a). Calculations of the articulation index indicated that NAL amplification did not restore audibility of the whole of the speech spectrum when the speech-to-background ratio was low. For unamplified stimuli, the SRTs in background sounds were highly correlated with absolute thresholds, but not with age. For stimuli with NAL amplification, the correlations of SRTs with absolute thresholds were lower, but SRTs in backgrounds with spectral and/or temporal dips were significantly correlated with age. It is proposed that noise with spectral and temporal dips may be especially useful in evaluating possible benefits of multi-channel compression.
Article
Hotelling's test of significance for difference in efficiency of predictors is reformulated in terms of regression analysis. A test proposed by Healy is shown to differ from Hotelling's test in general.
Article
This paper presents a new index for predicting speech intelligibility for normal-hearing and hearing-impaired listeners. The Hearing-Aid Speech Perception Index (HASPI) is based on a model of the auditory periphery that incorporates changes due to hearing loss. The index compares the envelope and temporal fine structure outputs of the auditory model for a reference signal to the outputs of the model for the signal under test. The auditory model for the reference signal is set for normal hearing, while the model for the test signal incorporates the peripheral hearing loss. The new index is compared to indices based on measuring the coherence between the reference and test signals and based on measuring the envelope correlation between the two signals. HASPI is found to give accurate intelligibility predictions for a wide range of signal degradations including speech degraded by noise and nonlinear distortion, speech processed using frequency compression, noisy speech processed through a noise-suppression algorithm, and speech where the high frequencies are replaced by the output of a noise vocoder. The coherence and envelope metrics used for comparison give poor performance for at least one of these test conditions.