ArticlePDF Available

Abstract and Figures

The perceptual effects of audio processing in devices such as hearing aids can be predicted by comparing auditory model outputs for the processed signal to the model outputs for a clean reference signal. This paper presents an improved auditory model that can be used for both intelligibility and quality predictions. The model starts with a middle-ear filter, followed by a gammatone auditory filter bank. Two-tone suppression is provided by setting the bandwidth of the control filters wider than that of the associated analysis filters. The analysis filter bandwidths are increased in response to increasing signal intensity, and compensation is provided for the variation in group delay across the auditory filter bank. Temporal alignment is also built into the model to facilitate the comparison of the unprocessed reference with the hearing-aid processed signals. The amplitude of the analysis filter outputs is modified by outer hair-cell dynamic-range compression and inner-hair cell firing-rate adaptation. Hearing loss is incorporated into the model as a shift in auditory threshold, an increase in the analysis filter bandwidths, and a reduction in the dynamic-range compression ratio. The model outputs include both the signal envelope and scaled basilar-membrane vibration in each auditory filter band.
Content may be subject to copyright.
Proceedings of Meetings on Acoustics
Volume 19, 2013 http://acousticalsociety.org/
ICA 2013 Montreal
Montreal, Canada
2 - 7 June 2013
Psychological and Physiological Acoustics
Session 4pPP: Computational Modeling of Sensorineural Hearing Loss: Models and
Applications
4pPP9. An auditory model for intelligibility and quality predictions
James Kates*
*Corresponding author's address: Speech Language and Hearing Sciences, University of Colorado Boulder, 409 UCB, Boulder, CO
80309, James.Kates@colorado.edu
The perceptual effects of audio processing in devices such as hearing aids can be predicted by comparing auditory model outputs for the
processed signal to the model outputs for a clean reference signal. This paper presents an improved auditory model that can be used for both
intelligibility and quality predictions. The model starts with a middle-ear filter, followed by a gammatone auditory filter bank. Two-tone
suppression is provided by setting the bandwidth of the control filters wider than that of the associated analysis filters. The analysis filter
bandwidths are increased in response to increasing signal intensity, and compensation is provided for the variation in group delay across the
auditory filter bank. Temporal alignment is also built into the model to facilitate the comparison of the unprocessed reference with the hearing-
aid processed signals. The amplitude of the analysis filter outputs is modified by outer hair-cell dynamic-range compression and inner-hair cell
firing-rate adaptation. Hearing loss is incorporated into the model as a shift in auditory threshold, an increase in the analysis filter bandwidths,
and a reduction in the dynamic-range compression ratio. The model outputs include both the signal envelope and scaled basilar-membrane
vibration in each auditory filter band.
Published by the Acoustical Society of America through the American Institute of Physics
J. Kates
© 2013 Acoustical Society of America [DOI: 10.1121/1.4799223]
Received 15 Jan 2013; published 2 Jun 2013
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 1
INTRODUCTION
Auditory models form the basis of many procedures for predicting speech intelligibility and quality. The use of
an auditory model is based on the assumption that the accuracy of speech intelligibility and quality predictions will
benefit from embedding an auditory model into the metric. The objective of the model in these applications is not to
reproduce every aspect of auditory signal processing, but rather to reproduce important aspects of peripheral signal
processing while maintaining a reasonable degree of computational efficiency. If the application involves hearing
aids or impaired hearing, then peripheral hearing loss must also be an integral part of the model.
The simplest auditory model is a filter bank representing the frequency analysis of the human ear. Additional
complexity can be added depending on the purpose of the model. The speech intelligibility index (SII), for example,
incorporates an auditory filter bank, to which corrections are applied to account for frequency-domain masking,
signal intensity, and shifts in the auditory threshold (French and Steinberg, 1947; ANSI S3.5, 1997).
An alternative to applying corrections to a filter-bank model is to base the model more directly on auditory
physiology. Models based all or in part on physiology have been used for predicting intelligibility (Holube and
Kollmeier, 1996; Elhilali et al., 2003; Zilany and Bruce, 2007; Christiansen et al., 2010; Taal et al., 2011) and for
quality (Tan et al., 2004; Huber and Kollmeier, 2006; Tan and Moore, 2008; Kates and Arehart, 2010). Of the
quality models that incorporate hearing loss and satisfy the requirement of computational efficiency, the Kates and
Arehart (2010) model appears to be the most accurate.
The model presented in this paper is an extension of the Kates and Arehart (2010) auditory model. That model
has been shown to give outputs that can be used to produce accurate predictions of speech quality for normal-
hearing and hearing-impaired listeners under a wide variety of noise, nonlinear distortion, and linear filtering
conditions. The improvements in the new model include ensuring that the filter characteristics are independent of the
signal sampling rate, increasing the model bandwidth to better analyze music signals, adjusting the auditory filter
bandwidth in response to the signal intensity, a more accurate representation of cochlear dynamic-range
compression, the inclusion of inner hair cell firing-rate adaptation, and compensation for the group delay of the
auditory filter bank.
AUDITORY MODEL
Model Overview
The model inputs are the unprocessed reference and processed signals that are to be compared. The processing
can include linear filtering, nonlinear signal manipulations, nonlinear distortion, and background noise. The model
outputs are the envelope and basilar membrane vibration of the reference and processed signals in auditory
frequency bands. The overall model block diagram is presented in Fig 1. The model operates at 24 kHz, so the first
processing step is the sample-rate conversion of the signals. The comparison of the processed and reference signals
generally requires that they be temporally aligned, so part of the model is the temporal alignment of the signals. The
first alignment step is a broadband signal alignment. Each signal then goes through the middle ear and cochlear
mechanics models, after which the delay of the processed signal in each frequency band is adjusted to maximize the
cross-correlation with the reference signal in that band. The separate signals then go through the inner hair-cell
(IHC) model, followed by compensation for the group delays of the auditory filters. In the final processing step the
auditory model outputs are converted into signal features for comparing the processed signal with the reference
signal. This step is part of the performance index rather than an inherent aspect of the auditory model and is not
described in this paper.
The processing for one signal is shown in the block diagram of Fig 2. The auditory model starts with sample rate
conversion to 24 kHz, followed by the middle ear filter. The next stage is a linear auditory filter bank, with the filter
bandwidths adjusted to reflect the input signal intensity and the effects of hearing loss due to outer hair-cell (OHC)
damage. Dynamic-range compression is then provided, with the compression controlled by a separate control filter
bank. The amount of compression is a function of the amount of OHC damage. Hearing loss due to IHC damage is
represented as a subsequent attenuation stage, and IHC firing-rate adaptation is also included in the model. The
envelope output in each frequency band comprises the compressed envelope signal after conversion to dB above
auditory threshold. The basilar membrane vibration signal in each frequency band is compressed using the same
control function as for the envelope in that band, so the envelope of the vibration tracks the envelope output. The
auditory threshold for the vibration signal is represented as a low-level additive white noise.
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 2
FIGURE 1. Block diagram showing the reference and
processed signal comparison.
FIGURE 2. Block diagram showing the auditory model for
one signal.
Sample Rate Conversion
The middle-ear filter and the filters in the auditory filter bank are all infinite impulse response (IIR) designs. The
magnitude and phase response of an IIR filters depends on the sampling rate, so filters having the same design
specifications will differ if the sampling rates differ. Therefore, to ensure identical filter behavior for all input
signals, the signals are resampled at 24 kHz. This sampling rate was chosen to minimize computational requirements
while still providing adequate bandwidth for the highest frequency band (8 kHz) used for the auditory analysis.
Middle Ear
The primary purpose of the middle ear model is to reproduce the low-frequency and high-frequency attenuation
observed in the equal-loudness contours at low signal levels (Suzuki and Takeshima, 2004). The low-frequency
attenuation is represented by a 2-pole IIR high-pass filter having a cutoff frequency of 350 Hz. The high-frequency
attenuation is represented by a 1-pole IIR low-pass filter having a cutoff frequency of 5 kHz (Kates, 1991).
Analysis Filter Bank
The parallel filter bank used for the auditory analysis consists of fourth-order gammatone filters (Patterson et al.,
1995). The digital gammatone filter bank is implemented using the base-band impulse-invariant method (Cooke,
1991; Immerseel and Peeters, 2003). A total of 32 filters are used to cover the frequency range of 80 to 8000 Hz.
This frequency range is greater than the 150 to 8000 Hz used in the SII standard (ANSI S3.5, 1997), and was chosen
to accommodate music as well as speech signals. A linear filter bank is used for computational efficiency. The linear
filter bank leaves out the dynamic interaction between the instantaneous signal level and the cochlear filter shape
(Zhang et al, 2001), but the 24-kHz sampling rate gives a significant computational savings over the 500-kHz
sampling rate required for the adaptive-filter (Zhang et al, 2001) model.
Outer Hair-Cell Damage
Hearing loss can be caused by both damage to the outer hair cells that control the cochlear filters and by damage
to the inner hair cells that perform the mechanical-to-neural transduction (Liberman and Dodds, 1984). OHC
damage is modeled as a reduction in the quality factor (Q) of the cochlear filters, resulting in increased filter
bandwidth and reduced gain (Kates, 1991). IHC damage is modeled as a reduction in the sensitivity of the neural
transduction mechanism. For moderate hearing losses, approximately 80 percent of the total loss given by the
audiogram can be ascribed to OHC damage (Moore et al., 1999), with the remainder ascribed to IHC damage.
Hearing loss is incorporated into the gammatone filter bank as an increase in filter bandwidth. The data of Moore
et al. (1999) indicate that over approximately the first 50 dB of the total hearing loss there is a strong correlation of
the auditory filter bandwidth with loss; the filter bandwidth increases by about a factor of two over this range. As the
total loss increases beyond 50 dB, however, the filter bandwidth increases rapidly. As an approximation to this
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 3
behavior, the filter bandwidth relative to that of a normal ear is given by BW = 1 + (attn/50) + 2(attn/50)6, where
attn is the hearing loss in dB ascribed to the OHC damage, with a maximum attenuation of 50 dB.
The responses of the gammatone filters are normalized so that the gain for a signal at the filter center frequency
is 0 dB. The reduction of the filter gain caused by the OHC loss is implemented in the compression stage that
follows the filter bank. The filter transfer functions for normal hearing are plotted in Fig 3 for the filter center
frequency range of 80 Hz to 8 kHz. The filter shapes for the maximum OHC loss allowed in the model as a function
of frequency are plotted in Fig 4. The OHC damage causes a broadening of the filter around its center frequency as
well as an increase in the filter response at low frequencies (Liberman and Dodds, 1984).
FIGURE 3. Gammatone filter magnitude transfer functions
for normal hearing.
FIGURE 4. Gammatone filter magnitude transfer functions
for maximum hearing loss and the control filters.
Signal Intensity
The shape of the auditory filters depends on the intensity of the input signal as well as on the degree of hearing
loss. Measurements of the basilar-membrane vibration in animals (Rhode, 1971; Ruggero et al., 1997) show that the
auditory filters become broader as the signal level increases. Behavioral measurements of auditory filter nonlinearity
have also been made in humans (Glasberg and Moore, 2000; Baker and Rosen, 2002; Baker and Rosen, 2006). In
contrast to the animal studies, the human data tend to show nearly constant filter bandwidths for intensities below 50
dB SPL for both normal-hearing (Baker and Rosen, 2006) and hearing-impaired (Baker and Rosen, 2002) listeners.
The bandwidths increase with increasing intensity above 50 dB SPL, and a linear function is used in this paper to
approximate the increase in bandwidth with intensity.
An example of the linear approximation is shown schematically in Fig 5. For normal hearing, the filter
bandwidth is set to the ERB (Moore and Glasberg, 1983) for intensities below 50 dB SPL. For impaired hearing, the
bandwidth at and below 50 dB SPL is set to the bandwidth computed for the amount of OHC damage related to the
hearing loss. For both normal and impaired hearing, the bandwidth is set to the bandwidth corresponding to
maximum OHC damage for intensities at or above 100 dB SPL. Linear interpolation is used for intensities between
50 and 100 dB SPL. In the limiting case of a hearing loss giving the maximum amount of OHC damage, the
bandwidth stays at the maximum value at all signal levels. The filter bandwidth is determined by the signal intensity
in the control filter bank outputs and remains constant throughout the auditory analysis filtering operation.
Control Filter Bank
Cochlear mechanics provides nearly instantaneous dynamic-range compression. In the cochlea the gain changes
are combined with dynamic changes in the filter bandwidth, e.g. the filter Q is dynamically varied in response to the
signal level (Zhang et al., 2001). In the simplified cochlear model used here, the compression is a separate stage that
follows the linear filter bank. The compression gain is computed using the envelope in each band of the control filter
bank, and the compression gain multiplies the signal in each auditory analysis filter band.
The control filter bank, like the analysis filter bank, uses fourth-order gammatone filters. The filter bandwidths
for the control filters are set to correspond to the maximum bandwidth allowed in the model, as shown in Fig 4. The
control filter bandwidths thus match the auditory analysis bandwidths for the maximum hearing loss, and are wider
than the auditory analysis filters for reduced hearing loss and normal hearing. The wide control filters provide two-
tone suppression in the auditory model (Zhang et al., 2001; Heinz et al., 2001; Bruce et al., 2003). The center
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 4
frequency of each control filter is shifted higher in frequency relative to the corresponding auditory analysis filter.
The frequency shift corresponds to a fractional basal shift of 0.02 of the length of the cochlear partition using a
human frequency-position function (Greenwood, 1990). This frequency shift is less than the basal shift used by
Zhang et al. (2001) to model the cat cochlea; however, the shift produces two-tone suppression results that are
consistent with the human psychophysical measurements recorded by Duifuis (1980) for a probe at 50 dB SPL.
Figure 5. Example of the adjustment of the auditory filter
bandwidth with increasing signal intensity.
FIGURE 6. Input/output relationship showing the
dynamic-range compression due to OHC function.
Dynamic-Range Compression
The control signal envelope is the input to the compression rule. The compression gain is then passed through an
800-Hz low-pass filter to approximate the compression time delay observed in the cochlea (Zhang et al., 2001). The
compression rule for normal hearing is modeled by three line segments as shown by the bold lines in Fig 6. Inputs
within 30 dB of normal auditory threshold (0 dB SPL) receive linear gain. Inputs between 30 and 100 dB SPL are
compressed. The system reverts to linear gain for inputs above 100 dB SPL. The compression ratio in the model for
normal hearing increases linearly with ERB number from a compression ratio of 1.25:1 at 80 Hz to a compression
ratio of 3.5:1 at 8 kHz. This compression behavior is consistent with physiological measurements of compression in
the cochlea (Cooper and Rhode, 1997; Lopez-Poveda and Alves-Pinto, 2008) and with psychophysical estimates of
compression in the human ear (Hicks and Bacon, 1999; Plack and Oxenham, 2000).
OHC damage shifts the auditory threshold and reduces the compression ratio, as shown by the thin lines in Fig 6.
The dependence of the compression behavior on OHC damage reproduces the changes in the auditory-nerve firing
rate measured in damaged cochleas (Heinz et al., 2005; Neely et al., 2009) and the loudness recruitment found in
hearing-impaired listeners (Kiessling, 1993). The shifted curves are constructed so that an input of 100 dB SPL in a
given frequency band always produces the same output level independent of the amount of OHC damage. The
maximum gain reduction D shown in the figure due to OHC damage is a function of the normal-hearing
compression ratio in the frequency band. The maximum gain reduction is 14 dB for the compression ratio of 1.25:1
at 80 Hz, and increases to 50 dB for the compression ratio of 3.5:1 at 8 kHz.
In each frequency band, the OHC threshold is set to 1.25D. If the total hearing loss given is greater than this
threshold, the OHC loss is set to D and the remaining loss is ascribed to IHC damage. For this condition the
compression system is reduced to linear amplification as shown by the line having the x-axis intercept at D in Fig 6.
If the total hearing loss is less than the threshold, 80 percent of the loss is ascribed to OHC damage and 20 percent to
IHC damage. This condition results in a reduction in the OHC gain of less than D combined with a compression
ratio partway between 1:1 and maximum compression. This behavior is shown in Fig 6 for the line having the x-axis
intercept at d. The lower compression kneepoint is always set to 30 dB above the auditory threshold, which is a
change from the Kates and Arehart (2010) model.
Two-Tone Suppression
In two-tone suppression, the presence of a signal outside the bandwidth of the analysis filter reduces the response
to a probe signal near the center frequency of the analysis filter (Sachs and Kiang, 1968; Duifuis, 1980; Delgutte,
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 5
1990). Two-tone suppression is greatest in the normal ear, and is substantially reduced or eliminated in the impaired
ear (Schmiedt, 1984). The compression input-output function, when combined with the control filter bank, produces
two-tone suppression in the cochlear model. The control filter is wider than the corresponding analysis filter, which
allows the presence of a signal outside the bandwidth of the analysis filter but still within the bandwidth of the
control filter to reduce the gain for a signal within the analysis filter passband.
Two-tone suppression in the cochlear model is illustrated in Fig 7 for normal hearing. The probe in this example
is a 50-dB SPL sinusoid at a frequency of 2080 Hz, which is the center frequency for band 20 of the 32 analysis
bands. Suppressor tones outside the set of contours reduce the compressed signal output by less than 1 dB compared
to the output for the probe alone. A tone at 90 dB SPL at a frequency of 900 Hz, on the other hand, will reduce the
output within the probe frequency band by an additional 9 dB. Suppression is reduced in impaired hearing because
the analysis filter bandwidth is increased and the compression ratio is reduced with increasing hearing loss. In the
limit of maximum OHC damage, the analysis and control filters have equal bandwidths and the auditory
compression becomes a linear system, thus completely eliminating the two-tone suppression in the model.
Temporal Alignment
The model contains three stages of temporal alignment of the processed signal with the reference signal as
shown in Fig 1 . The first stage occurs at the input to the model; in this stage the signals are approximately aligned
and the signal durations matched. The alignment is based on the maximum of the broadband signal cross-
correlation. A second, band-by-band alignment occurs after the gammatone filter bank frequency analysis and the
dynamic-range compression. The envelope and basilar-membrane vibration outputs for the processed signal are
separately matched to the reference signal. The match is based on the maximum of the cross-correlation of the
signals. As a result of this temporal alignment, the processed signal has the same delay as a function of frequency as
the reference signal, and the group delay associated with the hearing-aid or other audio processing is removed.
The group delay of the gammatone filters is a good match to the latency measured in human ears (Don et al.,
1998). However, there is evidence that compensation for the frequency-dependent group delay occurs higher in the
auditory pathway (Uppenkamp et al., 2001; Wojtczak et al., 2012). Delay compensation has therefore been provided
for the auditory model output. The group delay for the reference signal at the center frequency of each band is
computed, and delay is added to each band of the reference and processed signals so that the total group delay in
each band matches that of the lowest-frequency band.
FIGURE 7. Two-tone suppression contours for a 50-dB
SPL tone at 2080 Hz. The contour line separation is 1 dB.
FIGURE 8. Rapid and short-time IHC adaptation for an
increase of 20 dB in the signal level.
dB Conversion
The envelope signal, after dynamic-range compression, is converted to dB above auditory threshold. Normal
threshold is used since attenuation due to OHC damage has already been applied to the signals. The hearing loss due
to IHC damage is applied as an additional attenuation after the dB SL conversion. The basilar membrane vibration
signal is multiplied by the same gain factor as computed for the envelope dB conversion so that the vibration signal
amplitude tracks the dB envelope. The compressed average outputs in dB SL correspond to firing rates in the
auditory nerve (Sachs and Abbas, 1974; Yates et al., 1990) averaged over the population of inner hair-cell synapses.
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 6
Inner Hair-Cell Synapse
The IHC synapse provides the rapid and short-term adaptation observed in the neural firing rate (Harris and
Dallos, 1979; Gorga and Abbas, 1981). The synapse is a simplified two-reservoir model based on the models of
Westerman and Smith (1987) and Kates (1991). The rapid adaptation time constant is 2 ms and the short-term time
constant is 60 ms. The adaptation emphasizes sudden changes in the signal level, such as occurs at the onset of a
stop consonant. The model is adjusted so that an instantaneous jump of 20 dB in the input signal level produces a
peak output 20 dB above the steady-state response. The adaptation is computed for the envelope signal in dB SL,
and then the basilar-membrane vibration signal is multiplied by the same gain-versus-time function. Hearing loss
due to IHC damage is implemented as an attenuation of the signal at the input to the synapse model. The differential
equations that describe the analog circuit were transformed into the digital domain using first-order backwards
differences in a state-space representation at the 24-kHz sampling rate.
An example of the synapse response is shown in Fig 8. The compressed signal envelope in the frequency band is
initially at 40 dB SL. The level jumps to 60 dB SL at 100 ms, and returns to 40 dB SL at 600 ms. Raised-cosine
5-ms windows are applied to both level changes, which reduces the overshoot in comparison with instantaneous
transitions. The peak of the transient response at 100 ms is 71.4 dB, and the minimum at 600 ms is 28.6 dB SL.
Long-Term Average
Some models of intelligibility and quality make use of the signal long-term average spectrum (French and
Steinberg, 1947; Theide et al., 2000; Beerends et al., 2002; Moore and Tan, 2004; Kates and Arehart, 2010), as do
models of loudness (Moore and Glasberg, 2004; Chen et al., 2011). A set of long-term average signals is therefore
provided as an additional auditory model output.
The root-mean-squared (RMS) average output is computed for the envelopes in each of the auditory analysis
filters after the filter bandwidths have been adjusted for OHC damage and signal intensity. The RMS average
outputs are also computed for the control filter bank signals. The average control signal is converted to dB above
threshold, and the input/output compression rule as shown in Fig 8 is used to compute compression gains for the
averaged signals as a function of frequency. The compression gain in dB is then added to the dB levels in each of the
analysis filter bands, after which the attenuation due to IHC damage is applied to the average signals in each band.
As for the envelope described in Section 2.9, the compressed average outputs in dB SL correspond to average
firing rates in the auditory nerve (Sachs and Abbas, 1974; Yates et al., 1990). The cited loudness models, on the
other hand, use the specific loudness in each frequency band. The specific loudness is proportional to the mean-
square level in each band raised to the 0.2 power (Moore and Glasberg, 2004). In the auditory model, taking the
RMS signal level raises the mean-square level to a power of 0.5, and the dynamic-range compression of 2.5:1 at mid
frequencies further reduces the power to the vicinity of 0.2. Thus specific loudness, to within a scale factor, can be
approximated from the average auditory model outputs by converting the dB SL values to linear amplitude.
DISCUSSION AND CONCLUSIONS
The auditory model presented in this paper is designed to be the initial processing stage for intelligibility and
quality indices. Because it is intended for practical applications, computational efficiency is an important aspect of
the implementation. The goal is to efficiently approximate the salient auditory behavior rather than provide an exact
but potentially time-consuming model.
A significant computational savings, for example, is realized by using linear filters for the auditory analysis
rather than trying to duplicate the instantaneous filter gain and bandwidth changes mediated by the outer hair cells.
The auditory filter bandwidth is determined by the hearing loss and the average signal intensity in the control filters
prior to sending the signal through the analysis filters, thus allowing for efficient analysis filters having constant
bandwidth. A further justification for this approach is that intelligibility and quality predictions generally use short
speech sequences presented at conversational levels, so large variations in signal intensity would not be expected.
The IHC synapse model is also greatly simplified in comparison with the ear. The auditory model does not
produce a neural spike-train output. It merely approximates the firing rate rapid and short-term adaptation that
impacts the neural firing patterns. The outputs are the envelope and a vibration signal in each analysis band that has
the same envelope but also contains the temporal fine structure. The vibration signal has not been rectified; it
remains zero-mean to facilitate computing the band-by-band cross-correlations between the reference and processed
signals that are used in some intelligibility (Kates and Arehart, 2005) and quality (Tan et al., 2004) indices.
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 7
The auditory model processes the clean reference and the hearing-aid output signals as a set of combined inputs.
The reason for the combined processing is the temporal alignment of the two signals, which is needed for the
subsequent intelligibility or quality index calculations. The processing delay associated with the hearing-aid output
is removed, and compensation is also provided for the group delay associated with the auditory analysis filters.
The choice of reference signal processing in the auditory model is tied to whether the model outputs will be used
for intelligibility or quality. Intelligibility is measured on an absolute scale, typically phonemes, words, or sentences
correct. A general assumption is that the hearing-impaired ear cannot produce higher recognition scores than a
normal-hearing ear. Therefore, if the auditory model output is intended for intelligibility, the reference is the clean
signal processed using a model of normal hearing while the hearing-aid output is processed through a model of the
impaired ear. Quality ratings, on the other hand, are assumed to be based on comparing the hearing-aid output to the
best signal quality that can be perceived by the hearing-impaired listener (Kates and Arehart, 2010). The reference
signal for a quality index should therefore be the clean input to the hearing aid combined with linear amplification
(e.g. Byrne and Dillon, 1986) to compensate for the hearing loss, and this amplified reference is processed through a
model of the impaired ear.
REFERENCES
ANSI S3.5-1997. American National Standard: Methods for the Calculation of the Speech Intelligibility Index, American
National Standards Institute, New York.
Baker, R.J., and Rosen S. (2002). “Auditory filter nonlinearity in mild/moderate hearing impairment,” J. Acoust. Soc. Am. 111,
1330-1339.
Baker, R.J., and Rosen S. (2006). “Auditory filter nonlinearity across frequency using simultaneous notched-noise masking,” J.
Acoust. Soc. Am. 119, 454-462.
Beerends, J. G., Hekstra, A. P., Rix, A. W. and Hollier, M. P. (2002). “Perceptual Evaluation of Speech Quality (PESQ) the new
ITU standard for end-to-end speech quality assessment Part II - Psychoacoustic model,” J. Audio Eng. Soc. 50, 765-778.
Bruce, I.C., Sachs, M.B., and Young, E.D. (2003). “An auditory-periphery model of the effects of acoustic trauma on auditory
nerve responses,” J. Acoust. Soc. Am. 113, 369-388.
Byrne, D. and Dillon,H. (1986). “The National Acoustic Laboratories (NAL) new procedure for selecting the gain and frequency-
response of a hearing-aid,” Ear and Hear. 7, 257-265.
Chen, Z., Hu, G., Glasberg, B.R., and Moore, B.C.J. (2011). A new method of calculating auditory excitation patterns and
loudness for steady sounds,” Hear. Res. 282, 204-215.
Christiansen, C., Pedersen, M.S., and Dau, T. (2010). “Prediction of speech intelligibility based on an auditory preprocessing
model,” Speech Comm. 52, 678-692.
Cooke, M. (1991). “Modeling auditory processing and organization,” PhD Thesis, U. Sheffield, May, 1991.
Cooper, N.P., and Rhode, W.S. (1997). “Mechanical responses to two-tone distortion products in the apical and basal turns of the
mammalian cochlea,” J. Neurophysiol. 78, 261-270.
Delgutte, B. (1990), “Two-tone rate suppression in auditory-nerve fibers: Dependence on suppressor frequency and level,” Hear.
Res. 49, 225-246.
Don, M., Ponton, C.W., Eggermont, J.J., and Kwong, B. (1998). “The effects of sensory hearing loss on cochlear filter times
estimated from auditory brainstem latencies,”, J. Acoust. Soc. Am. 104, 2280-2289.
Duifuis, H. (1980). “Level effects in psychophysical two-tone suppression,” J. Acoust. Soc. Am. 67, 914-927.
Elhilali, M., Chi, T., and Shamma, S.A. (2003). “A spectro-temporal modulation index (STMI) for assessment of speech
intelligibility,” Speech Comm. 41, 331-348.
French, N. R., and Steinberg, J. C. (1947). ‘‘Factors governing the intelligibility of speech sounds,’’ J. Acoust. Soc. Am. 19, 90
119.
Glasberg, B.R., and Moore, B.C.J. (2000). “Frequency selectivity as a function of level and frequency measured with uniformly
excited notched noise,” J. Acoust. Soc. Am. 108, 2318-2328.
Greenwood, D.D. (1990). “A cochlear frequency-position function for several species 29 years later,” J. Acoust. Soc. Am. 87,
2592-2605.
Gorga, M.P., and Abbas, P.J. (1981). “AP measurements of short-term adaptation in normal and acoustically traumatized ears,” J.
Acoust. Soc. Am. 70, 1310-1321.
Harris, D.M., and Dallos, P. (1979). “Forward masking of auditory nerve fiber responses,” J. Neurophys. 42, 1083-1107.
Heinz, M.G., Scepanovic, D., Issa, J., Sachs, M.B., and Young, E.D. (2005). “Normal and impaired level encoding: Effects of
noise-induced hearing loss on auditory-nerve responses,” In D. Pressnitzer, A. de Cheveigné, S. McAdams, and L. Collet
(Eds), Auditory Signal Processing: Physiology, Phychoacoustics and Models. New York: Springer, 2005.
Heinz, M.G., Zhang, X., Bruce, I.C., and Carney, L.H. (2001). “Auditory nerve model for predicting performance limits of
normal and impaired listeners,” Acoust. Res. Letters Online 2, 91-96.
Hicks, M.L., and Bacon, S.P. (1999). “Psychophysical measures of auditory nonlinearities as a function of frequency in
individuals with normal hearing,” J. Acoust. Soc. Am. 105, 326-338.
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 8
Holube, I., and Kollmeier, B. (1996). “Speech intelligibility prediction in hearing-impaired listeners based on a
psychoacoustically motivated perception model,” J. Acoust. Soc. Am. 100, 1703-1716.
Huber, R., and Kollmeier, B. (2006). “PEMO-Q A new method for objective audio quality assessment using a model of
auditory perception,” IEEE Trans. Audio, Speech, and Lang. Proc. 14, 1902-1911.
Immerseel, L.V., and Peeters, S. (2003). “Digital implementation of linear gammatone filters: Comparison of design methods,”
Acoust. Res. Letters Online 4, 59-64.
Kates, J.M. (1991). “A time domain digital cochlear model,” IEEE Trans. Sig. Proc. 39, 2573-2592.
Kates, J.M., and Arehart, K.H. (2005). “Coherence and the speech intelligibility index,” J. Acoust. Soc. Am. 117, 2224-2237.
Kates, J.M., and Arehart, K.H. (2010). “The hearing aid speech quality index (HASQI),” J. Audio Eng. Soc. 58, 363-381.
Kiessling, J. (1993). "Current approaches to hearing aid evaluation," J. Speech-Lang. Path. and Audiol. Monogr. Suppl. 1, 39-49.
Liberman, M.C., and Dodds, L.W. (1984). “Single neuron labeling and chronic cochlear pathology. III. Stereocilia damage and
alterations in threshold tuning curves,” Hearing Res. 16, 54-74.
Lopez-Poveda, E.A. and Alves-Pinto, A. (2008). A variant temporal-masking-curve method for inferring peripheral auditory
compression,” J. Acoust. Soc. Am. 123, 1544-1554.
Moore, B.C.J., and Glasberg, B.R. (1983). “Suggested formulae for calculating auditory-filter bandwidths and excitation
patterns,” J. Acoust. Soc. Am. 74, 750-753.
Moore, B.C.J., and Glasberg, B.R. (2004). “A revised model of loudness perception applied to cochlear hearing loss,” Hear. Res.
188, 70-88.
Moore, B.C.J., and Tan, C.-T. (2004). “Development and validation of a method for predicting the perceived naturalness of
sounds subjected to spectral distortion”, J. Audio Eng. Soc. 52, 900-914.
Moore, B.C.J., Vickers, D.A., Plack, C.J., and Oxenham, A.J. (1999). “Inter-relationship between different psychoacoustic
measures assumed to be related to the cochlear active mechanism,” J. Acoust. Soc. Am. 106, 2761-2778.
Neely, S.T., Johnson, T.A., Kopun, J., Dierking, D.M., and Gorga, M.P. (2009). “Distortion product otoacoustic emission
input/output characteristics in normal-hearing and hearing-impaired human ears,” J. Acoust. Soc. Am. 126, 728-738.
Patterson, R.D., Allerhand, M.H., and Giguère, C. (1995). "Time-domain modeling of peripheral auditory processing: A modular
architecture and a software platform," J. Acoust. Soc. Am. 98, 1890-1894.
Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., and Allerhand, M.H. (1992). “Complex sounds and
auditory images,” In Auditory Physiology and Perception, (Eds.) Y Cazals, L. Demany, K.Horner, Pergamon, Oxford, 1992,
429-446.
Plack, C.J., and Oxenham, A.J. (2000). “Basilar-membrane nonlinearity estimated by pulsation threshold,” J. Acoust. Soc. Am.
107, 501-507.
Rhode, W.S. (1971). Observations on the vibration of the basilar membrane in squirrel monkeys using the Mossbauer technique,”
J. Acoust. Soc. Am. 49, 1218-1231.
Ruggero, M.A., Rich, N.C., Recio, A., and Narayan, S. (1997). “Basilar-membrane responses to tones at the base of the
chinchilla cochlea,” J. Acoust. Soc. Am. 101, 2151-2163.
Sachs, M.B., and Abbas, P.J. (1974). “Rate versus level functions for auditory-nerve fibers in cats: Tone-burst stimuli,” J.
Acoust. Soc. Am. 56, 1835-1847.
Sachs, M.B., and Kiang, N.Y.S. (1968). “Two-tone suppression in auditory-nerve fibers,” J. Acoust. Soc. Am. 43, 1120-1128.
Schmiedt, R.A. (1984). “Acoustic injury and the physiology of hearing,” J. Acoust. Soc. Am. 76, 1293-1317.
Slaney, M. (1993). “An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank,” Apple Computer Technical
Report #35.
Suzuki, Y., and Takeshima, H. (2004). “Equal-loudness-level contours for pure tones,” J. Acoust. Soc. Am. 116, 918-933.
Taal, C.H., Hendriks, R.C., Heusdens, R., and Jensen, J. (2011). “An algorithm for intelligibility prediction of time-frequency
weighted noisy speech,” IEEE Trans. Audio Speech and Sig. Proc. 19, 2125-2136.
Tan, C.-T., and Moore, B.C.J. (2008). “Perception of nonlinear distortion by hearing-impaired people,” Int. J. Aud. 47, 246-256.
Tan, C.-T., Moore, B.C.J., Zacharov, N., and Matilla, V.-V. (2004). “Predicting the perceived quality of nonlinearly distorted
music and speech signals,” J. Audio Eng. Soc. 52, 699-711.
Thiede, T., Treurniet, W.C., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J.G., Colomes, C., Keyhl, M., Stoll, G.,
Brandenberg, K., and Feiten, B. (2000). “PEAQ – The ITU standard for objective measurement of perceived audio quality,”
J. Audio Eng. Soc. 48, 3-29.
Uppenkamp, S., Fobel, S., and Patterson, R. D. (2001). “The effects of temporal asymmetry on the detection and perception of
short chirps,” Hear. Res. 158, 7183.
Westerman, L.A., and Smith, R.L. (1987). “Conservation of adapting components in auditory-nerve responses,” J. Acoust. Soc.
Am. 81, 680-691.
Wojtczak, M., Biem, J.A., Micheyl, C., and Oxenham, A.J. (2012). “Perception of across-frequency asynchrony and the role of
cochlear delay,” J. Acoust. Soc. Am. 131, 363-377.
Yates, G.K., Winter, I.M., and Robertson, D. (1990). “Basilar membrane nonlinearity determines auditory nerve rate-intensity
functions and cochlear dynamic range,” Hear. Res. 45, 203-220.
Zhang, X., Heinz, M.G., Bruce, I.C., and Carney, L.H. (2001). “A phenomenological model for the response of auditory nerve
fibers: I. Nonlinear tuning with compression and suppression,” J. Acoust. Soc. Am. 109, 648-670.
Zilany, M.S.A., Bruce, I.C., (2007). “Predictions of speech intelligibility with a model of the normal and impaired auditory-
periphery,” Third International IEEE/EMBS Conference on Neural Engineering, 481485.
J. Kates
Proceedings of Meetings on Acoustics, Vol. 19, 050184 (2013) Page 9
... The purpose of this paper is to present a new intelligibility index that (1) combines measurements of coherence with measurements of envelope fidelity to give improved accuracy for a wide range of processing conditions, and (2) is accurate for hearing-impaired as well as normalhearing listeners. The new index, the Hearing Aid Speech Perception Index (HASPI), uses an auditory model that incorporates aspects of normal and impaired peripheral auditory function (Kates, 2013). The auditory coherence is computed from the modeled basilar-membrane vibration output in each frequency band, and provides a measurement sensitive to the changes in the speech temporal fine structure. ...
... The approach to predicting speech intelligibility used in HASPI is to compare the output of an auditory model for a degraded test signal with the output for an unprocessed input signal. A detailed description of the auditory model is presented in Kates (2013) and is summarized here. The model is an extension of the Kates and Arehart (2010) auditory model; that model has been shown to give outputs that can be used to produce accurate predictions of speech quality for a wide variety of hearing losses and processing conditions. ...
... The summed values are converted back to dB, and segments having an intensity less than 2.5 dB re:threshold are removed from the correlation calculation. A justification for this approach to silence detection is that the linear values in each frequency band correspond to specific loudness (Moore and Glasberg, 2004;Kates, 2013) and the sum across frequency is thus related to the loudness of the signal (Moore and Glasberg, 2004). Thus segments having a loudness near or below auditory threshold are removed from the calculation. ...
Article
This paper presents a revised version of the Hearing-Aid Speech Perception Index (HASPI). The index is based on a model of the auditory periphery that incorporates changes due to hearing loss and is valid for both normal-hearing and hearing-impaired listeners. It is an intrusive metric that compares the time-frequency envelope and temporal fine structure (TFS) of a degraded signal to an unprocessed reference. The first modification to HASPI is an extension to the range of envelope modulation rates considered in the metric. HASPI applies a lowpass filter to the time-frequency envelope modulation, and in the new version this single filter is replaced by a modulation filterbank. The temporal fine structure (TFS) analysis in the original version of HASPI is replaced by the filterbank outputs at higher modulation rates that represent auditory roughness and periodicity. The second modification is replacing the parametric model combining envelope and TFS measurements used in the original version with an ensemble of neural networks. The improved version of HASPI is compared to the original version for datasets from five experiments that encompass noise and nonlinear distortion, frequency compression, ideal binary mask noise suppression, speech modified using a noise vocoder, and speech in reverberation. The new version of HASPI is shown to have a statistically-significant reduction in RMS error compared to the original version for most of the data considered, and to be significantly more accurate for speech in reverberation.
... Therefore, rather than evaluating the hearing aid algorithms in terms of individual processing parameters, the present study focuses on listener responses to the combined signal modification created by two core hearing aid features: wide dynamic range compression (WDRC) and directionality. Signal modification is quantified using an acoustic metric (cepstral correlation; Kates & Arehart 2014) that incorporates the effects of an impaired peripheral auditory system (Kates 2013) and compares the time-frequency modulation patterns of a hearing aid-processed signal in noise to a linearly amplified reference signal in quiet. Thus, signal modification in this study refers to the cumulative temporal envelope changes as a result of the hearing aid processing parameters, the individual audiogram, and the listening environment, including relative speech and noise levels. ...
... The reference condition was obtained from the unaided recording in quiet and processed with NAL-R. Both the hearing aid processed and reference signals were passed through a model of the individual's damaged auditory periphery that accounts for the frequency-specific threshold shift and associated broadened auditory filters (Kates 2013). Specifically, the model consists of an auditory filterbank (32-channel gammatone filterbank with center frequencies between 80 and 8000 Hz), followed by dynamic-range compression of the outer hair cells, firing-rate adaptation of the neural response, and the auditory threshold. ...
Article
Objectives: Previous research has shown that the association between hearing aid-processed speech recognition and individual working memory ability becomes stronger in more challenging conditions (e.g., higher background noise levels) and with stronger hearing aid processing (e.g., fast-acting wide dynamic range compression, WDRC). To date, studies have assumed omnidirectional microphone settings and collocated speech and noise conditions to study such relationships. Such conditions fail to recognize that most hearing aids are fit with directional processing that may improve the signal to noise ratio (SNR) and speech recognition in spatially separated speech and noise conditions. Here, we considered the possibility that directional processing may reduce the signal distortion arising from fast-acting WDRC and in turn influence the relationship between working memory ability and speech recognition with WDRC processing. The combined effects of hearing aid processing (WDRC and directionality) and SNR were quantified using a signal modification metric (cepstral correlation), which measures temporal envelope changes in the processed signal with respect to a linearly amplified reference. It was hypothesized that there will be a weaker association between working memory ability and speech recognition for hearing aid processing conditions that result in overall less signal modification (i.e., fewer changes to the processed envelope). Design: Twenty-three individuals with bilateral, mild to moderately severe sensorineural hearing loss participated in the study. Participants were fit with a commercially available hearing aid, and signal processing was varied in two dimensions: (1) Directionality (omnidirectional [OMNI] versus fixed-directional [DIR]), and (2) WDRC speed (fast-acting [FAST] versus slow-acting [SLOW]). Sentence recognition in spatially separated multi-talker babble was measured across a range of SNRs: 0 dB, 5 dB, 10 dB, and quiet. Cumulative signal modification was measured with individualized hearing aid settings, for all experimental conditions. A linear mixed-effects model was used to determine the relationship between speech recognition, working memory ability, and cumulative signal modification. Results: Signal modification results showed a complex relationship between directionality and WDRC speed, which varied by SNR. At 0 and 5 dB SNRs, signal modification was lower for SLOW than FAST regardless of directionality. However, at 10 dB SNR and in the DIR listening condition, there was no signal modification difference between FAST and SLOW. Consistent with previous studies, the association of speech recognition in noise with working memory ability depended on the level of signal modification. Contrary to the hypothesis above, however, there was a significant association of speech recognition with working memory only at lower levels of signal modification, and speech recognition increased at a faster rate for individuals with better working memory as signal modification decreased with DIR and SLOW. Conclusions: This research suggests that working memory ability remains a significant predictor of speech recognition when WDRC and directionality are applied. Our findings revealed that directional processing can reduce the detrimental effect of fast-acting WDRC on speech cues at higher SNRs, which affects speech recognition ability. Contrary to some previous research, this study showed that individuals with better working memory ability benefitted more from a decrease in signal modification than individuals with poorer working memory ability.
... These algorithms provide an objective index that can be mapped into behavioural responses from humans. Complexity of SI predictor algorithms comes in different levels, but in general, these consist of a front-end where the sound waves are processed to obtain representative features, and a back-end where those features are translated into a SI index [1,2,3,4]. ...
Conference Paper
Full-text available
The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclusion that SAMII is a reliable objective metric when ``low-level" representations of the speech, such as spike activity, are used.
... HAAQI [58] was designed to predict music quality for individuals listening through hearing aids. The index is based on a model of the auditory periphery [59], extended to potentially include the effects of hearing loss. This is fitted to a dataset of quality ratings made by listeners having normal or impaired hearing. ...
Preprint
Full-text available
Over the past few decades, computational methods have been developed to estimate perceptual audio quality. These methods, also referred to as objective quality measures, are usually developed and intended for a specific application domain. Because of their convenience, they are often used outside their original intended domain, even if it is unclear whether they provide reliable quality estimates in this case. This work studies the correlation of well-known state-of-the-art objective measures with human perceptual scores in two different domains: audio coding and source separation. The following objective measures are considered: fwSNRseg, dLLR, PESQ, PEAQ, POLQA, PEMO-Q, ViSQOLAudio, (SI-)BSSEval, PEASS, LKR-PI, 2f-model, and HAAQI. Additionally, a novel measure (SI-SA2f) is presented, based on the 2f-model and a BSSEval-based signal decomposition. We use perceptual scores from 7 listening tests about audio coding and 7 listening tests about source separation as ground-truth data for the correlation analysis. The results show that one method (2f-model) performs significantly better than the others on both domains and indicate that the dataset for training the method and a robust underlying auditory model are crucial factors towards a universal, domain-independent objective measure.
... The most commonly used speech cue is the signal envelope obtained using Hilbert transform. However, more sophisticated envelope extraction methods such as the computational models simulating the auditory system could improve the quality of synthesized EEG signals (Kates, 2013;Verhulst et al., 2018). It must be noted that the data augmentation techniques must only be used to train the network. ...
Article
Full-text available
Human brain performs remarkably well in segregating a particular speaker from interfering ones in a multispeaker scenario. We can quantitatively evaluate the segregation capability by modeling a relationship between the speech signals present in an auditory scene, and the listener's cortical signals measured using electroencephalography (EEG). This has opened up avenues to integrate neuro-feedback into hearing aids where the device can infer user's attention and enhance the attended speaker. Commonly used algorithms to infer the auditory attention are based on linear systems theory where cues such as speech envelopes are mapped on to the EEG signals. Here, we present a joint convolutional neural network (CNN)—long short-term memory (LSTM) model to infer the auditory attention. Our joint CNN-LSTM model takes the EEG signals and the spectrogram of the multiple speakers as inputs and classifies the attention to one of the speakers. We evaluated the reliability of our network using three different datasets comprising of 61 subjects, where each subject undertook a dual-speaker experiment. The three datasets analyzed corresponded to speech stimuli presented in three different languages namely German, Danish, and Dutch. Using the proposed joint CNN-LSTM model, we obtained a median decoding accuracy of 77.2% at a trial duration of 3 s. Furthermore, we evaluated the amount of sparsity that the model can tolerate by means of magnitude pruning and found a tolerance of up to 50% sparsity without substantial loss of decoding accuracy.
... N. Hearing-Aid Audio Quality Index (HAAQI) HAAQI [58] was designed to predict music quality for individuals listening through hearing aids. The index is based on a model of the auditory periphery [59], extended to potentially include the effects of hearing loss. This is fitted to a dataset of quality ratings made by listeners having normal or impaired hearing. ...
Article
Full-text available
Over the past few decades, computational methods have been developed to estimate perceptual audio quality. These methods, also referred to as objective quality measures, are usually developed and intended for a specific application domain. Because of their convenience, they are often used outside their original intended domain, even if it is unclear whether they provide reliable quality estimates in this case. This work studies the correlation of well-known state-of-the-art objective measures with human perceptual scores in two different domains: audio coding and source separation. The following objective measures are considered: fwSNRseg, dLLR, PESQ, PEAQ, POLQA, PEMO-Q, ViSQOLAudio, (SI-)BSSEval, PEASS, LKR-PI, 2f-model, and HAAQI. Additionally, a novel measure (SI-SA2f) is presented, based on the 2f-model and a BSSEval-based signal decomposition. We use perceptual scores from 7 listening tests about audio coding and 7 listening tests about source separation as ground-truth data for the correlation analysis. The results show that one method (2f-model) performs significantly better than the others on both domains and indicate that the dataset for training the method and a robust underlying auditory model are crucial factors towards a universal, domain-independent objective measure.
... Both the reference and test conditions were processed through an auditory model that considered the user's audiogram. The metric was calculated by first processing the reference and test conditions through an auditory model of the impaired auditory system (Kates, 2013) that was customized for each listener based on their audiogram and that took into account changes that hearing loss has on auditory filtering and nonlinearities. The model produced output envelope signals that were expressed in dB above the normal or impaired auditory threshold. ...
Article
Objectives: Previous work has suggested that individual characteristics, including amount of hearing loss, age, and working memory ability, may affect response to hearing aid signal processing. The present study aims to extend work using metrics to quantify cumulative signal modifications under simulated conditions to real hearing aids worn in everyday listening environments. Specifically, the goal was to determine whether individual factors such as working memory, age, and degree of hearing loss play a role in explaining how listeners respond to signal modifications caused by signal processing in real hearing aids, worn in the listener's everyday environment, over a period of time. Design: Participants were older adults (age range 54-90 years) with symmetrical mild-to-moderate sensorineural hearing loss. We contrasted two distinct hearing aid fittings: one designated as mild signal processing and one as strong signal processing. Forty-nine older adults were enrolled in the study and 35 participants had valid outcome data for both hearing aid fittings. The difference between the two settings related to the wide dynamic range compression and frequency compression features. Order of fittings was randomly assigned for each participant. Each fitting was worn in the listener's everyday environments for approximately 5 weeks before outcome measurements. The trial was double blind, with neither the participant nor the tester aware of the specific fitting at the time of the outcome testing. Baseline measures included a full audiometric evaluation as well as working memory and spectral and temporal resolution. The outcome was aided speech recognition in noise. Results: The two hearing aid fittings resulted in different amounts of signal modification, with significantly less modification for the mild signal processing fitting. The effect of signal processing on speech intelligibility depended on an individual's age, working memory capacity, and degree of hearing loss. Speech recognition with the strong signal processing decreased with increasing age. Working memory interacted with signal processing, with individuals with lower working memory demonstrating low speech intelligibility in noise with both processing conditions, and individuals with higher working memory demonstrating better speech intelligibility in noise with the mild signal processing fitting. Amount of hearing loss interacted with signal processing, but the effects were small. Individual spectral and temporal resolution did not contribute significantly to the variance in the speech intelligibility score. Conclusions: When the consequences of a specific set of hearing aid signal processing characteristics were quantified in terms of overall signal modification, there was a relationship between participant characteristics and recognition of speech at different levels of signal modification. Because the hearing aid fittings used were constrained to specific fitting parameters that represent the extremes of the signal modification that might occur in clinical fittings, future work should focus on similar relationships with more diverse types of signal processing parameters.
Conference Paper
The performance of a deep-learning-based speech enhancement (SE) technology for hearing aid users, called a deep denoising autoencoder (DDAE), was investigated. The hearing-aid speech perception index (HASPI) and the hearing- aid sound quality index (HASQI), which are two well-known evaluation metrics for speech intelligibility and quality, were used to evaluate the performance of the DDAE SE approach in two typical high-frequency hearing loss (HFHL) audiograms. Our experimental results show that the DDAE SE approach yields higher intelligibility and quality scores than two classical SE approaches. These results suggest that a deep-learning-based SE method could be used to improve speech intelligibility and quality for hearing aid users in noisy environments.
Article
Signal modifications in audio devices such as hearing aids include both nonlinear and linear processing. An index is developed for predicting the effects of noise, nonlinear distortion, and linear filtering on speech quality. The index is designed for both normalhearing and hearing-impaired listeners. It starts with a representation of the auditory periphery that incorporates aspects of impaired hearing. The cochlear model is followed by the extraction of signal features related to the quality judgments. One set of features measures the effects of noise and nonlinear distortion on speech quality, whereas second set of features measures the effects of linear filtering. The hearing-aid speech quality index (HASQI) is the product of the subindices computed for each of the two sets of features. The models are evaluated by comparing the model predictions with quality judgments made by normal-hearing and hearing-impaired listeners for speech stimuli containing noise, nonlinear distortion, linear processing, and combinations of these signal degradations.
Article
In a previous study it was explored how the perceived naturalness of music and speech signals was affected by various forms of linear filtering. In the present paper a model is introduced to account for the results. The model is based on the assumption that changes in perceived naturalness produced by linear filtering can be characterized in terms of the changes in the excitation pattern produced by the filtering. The model takes into account both the magnitude of the changes in the excitation pattern and the rapidity with which the excitation pattern changes as a function of frequency. It also includes frequency-weighting functions to take into account the fact that naturalness is affected little by changes in amplitude response at very low and very high frequencies. The model accounts very well for the data presented in the earlier study. Two validation experiments were conducted in which naturalness ratings were obtained for speech and music stimuli passed through new sets of linear filters, including filters based on the measured frequency responses of real transducers. The model predicted the results of these experiments well.
Article
In a previous study perceptual experiments were reported in which subjects had to rate the perceived quality of speech and music that had been subjected to various forms of nonlinear distortion. The subjective ratings were compared to a physical measure of distortion, DS, based on the output spectrum of each nonlinear system in response to a 10-component multitone test signal with logarithmically spaced components. The values of DS were highly negatively correlated with the subjective ratings for stimuli that had been subjected to "artificial" distortions such as peak clipping and zero clipping. However, for stimuli that had been subjected to nonlinear distortion produced by real transducers, the correlation between the DS values and the subjective ratings was only moderately negative. A new method predicts the perceived quality of nonlinearly distorted signals based on the outputs of an array of gammatone filters in response to the original signal and the distorted signal. For each filter, the cross correlation is calculated between the outputs in response to the original and the distorted signals for a series of brief samples (frames). The maximum value of the cross correlation for each filter for each frame is determined, and the maximum values are summed across filters, with a weighting that depends on the magnitude of the output of each filter in response to the distorted signal. The resultant weighted cross correlation gives a perceptually relevant measure of distortion called Rnonlin, which can be used to predict subjective ratings. There were high correlations between the predicted ratings and the subjective ratings obtained previously. The correlations were greater than obtained using the DS measure. A new perceptual experiment, using a mixture of artificial and real distortions, confirmed the validity of the new measure.
Article
A new model for the perceptual evaluation of speech quality (PESQ) was recently standardized by the International Telecommunications Union as Recommendation P.862. Unlike previous codec assessment models, such as PSQM and MNB (ITU-T P.861), PESQ is able to predict subjective quality with good correlation in a very wide range of conditions, which may include coding distortions, errors, noise, filtering, delay, and variable delay. The psychoacoustic model used in PESQ is introduced. Part I describes the time-delay identification technique that is used in combination with the PESQ psychoacoustic model to predict the end-to-end perceived speech quality.
Article
Perceptual coding of audio signals is increasingly used in the transmission and storage of high-quality digital audio, and there is a strong demand for an acceptable objective method to measure the quality of such signals. A new measurement method is described that combines ideas from several earlier methods. The method should meet the requirements of the user community, and it has been recommended by ITU Radiocommunication study groups.