Article

Derivation of auditory filter shapes from notched-noise data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A well established method for estimating the shape of the auditory filter is based on the measurement of the threshold of a sinusoidal signal in a notched-noise masker, as a function of notch width. To measure the asymmetry of the filter, the notch has to be placed both symmetrically and asymmetrically about the signal frequency. In previous work several simplifying assumptions and approximations were made in deriving auditory filter shapes from the data. In this paper we describe modifications to the fitting procedure which allow more accurate derivations. These include: 1) taking into account changes in filter bandwidth with centre frequency when allowing for the effects of off-frequency listening; 2) correcting for the non-flat frequency response of the earphone; 3) correcting for the transmission characteristics of the outer and middle ear; 4) limiting the amount by which the centre frequency of the filter can shift in order to maximise the signal-to-masker ratio. In many cases, these modifications result in only small changes to the derived filter shape. However, at very high and very low centre frequencies and for hearing-impaired subjects the differences can be substantial. It is also shown that filter shapes derived from data where the notch is always placed symmetrically about the signal frequency can be seriously in error when the underlying filter is markedly asymmetric. New formulae are suggested describing the variation of the auditory filter with frequency and level. The implication of the results for the calculation of excitation patterns are discussed and a modified procedure is proposed. The appendix list FORTRAN computer programs for deriving auditory filter shapes from notched-noise data and for calculating excitation patterns. The first program can readily be modified so as to derive auditory filter shapes from data obtained with other types of maskers, such as rippled noise.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It is thus possible that cross-talker variability in vowel pronunciations is reduced when formants are represented in Bark, rather than Hz. Similar arguments have been made about other psycho-acoustic transformations [e.g., equivalent rectangular bandwidth (ERB), Glasberg and Moore, 1990;Mel, Stevens and Volkmann, 1940; or semitones, Fant et al., 2002] most of which share that they logtransform acoustic frequencies-in line with neurophysiological evidence that the auditory representations in the brain seem to follow a roughly logarithmic organization so that auditory perception is (up to a point) more sensitive to differences between lower frequencies than to the same difference between higher frequencies (e.g., Merzenich et al., 1975; for review, see Saenz and Langers, 2014). While each of these transformations was developed with different applications in mind (e.g., ERB and Bark to explain frequency selectivity, Glasberg and Moore, 1990; or semitones for the perception of musical pitch, Balzano, 1982), psychoacoustic transformations might suffice for effective formant normalization. ...
... Similar arguments have been made about other psycho-acoustic transformations [e.g., equivalent rectangular bandwidth (ERB), Glasberg and Moore, 1990;Mel, Stevens and Volkmann, 1940; or semitones, Fant et al., 2002] most of which share that they logtransform acoustic frequencies-in line with neurophysiological evidence that the auditory representations in the brain seem to follow a roughly logarithmic organization so that auditory perception is (up to a point) more sensitive to differences between lower frequencies than to the same difference between higher frequencies (e.g., Merzenich et al., 1975; for review, see Saenz and Langers, 2014). While each of these transformations was developed with different applications in mind (e.g., ERB and Bark to explain frequency selectivity, Glasberg and Moore, 1990; or semitones for the perception of musical pitch, Balzano, 1982), psychoacoustic transformations might suffice for effective formant normalization. If so, this would offer a particularly parsimonious account of vowel perception as listeners would not have to infer talker-specific properties. ...
... Our broad-coverage approach complements previous studies, which have typically compared a small number of accounts (up to 3) and focused on parts of the vowel inventory, and thus parts of the formant space (typically 2-4 vowels, Barreda, 2021;Barreda and Nearey, 2012;Nearey, 1989;Richter et al., 2017). The accounts we consider include the most influential examples of psychoacoustic transformations (Fant et al., 2002;Glasberg and Moore, 1990;Stevens and Volkmann, 1940;Traunm€ uller, 1981), intrinsic (Syrdal and Gopal, 1986), extrinsic (Gerstman, 1968;Johnson, 2020;Lobanov, 1971;McMurray and Jongman, 2011;Nearey, 1978;Nordstr€ om and Lindblom, 1975), and hybrid accounts that contain intrinsic and extrinsic components (Miller, 1989). This broad-coverage approach allows us to assess, for example, whether the preference for computationally simple accounts observed in Barreda (2021) replicates new data that span the entire vowel space. ...
Article
Full-text available
Human speech recognition tends to be robust, despite substantial cross-talker variability. Believed to be critical to this ability are auditory normalization mechanisms whereby listeners adapt to individual differences in vocal tract physiology. This study investigates the computations involved in such normalization. Two 8-way alternative forced-choice experiments assessed L1 listeners' categorizations across the entire US English vowel space—both for unaltered and synthesized stimuli. Listeners' responses in these experiments were compared against the predictions of 20 influential normalization accounts that differ starkly in the inference and memory capacities they imply for speech perception. This includes variants of estimation-free transformations into psycho-acoustic spaces, intrinsic normalizations relative to concurrent acoustic properties, and extrinsic normalizations relative to talker-specific statistics. Listeners' responses were best explained by extrinsic normalization, suggesting that listeners learn and store distributional properties of talkers' speech. Specifically, computationally simple (single-parameter) extrinsic normalization best fit listeners' responses. This simple extrinsic normalization also clearly outperformed Lobanov normalization—a computationally more complex account that remains popular in research on phonetics and phonology, sociolinguistics, typology, and language acquisition.
... Room acoustic rendering can be based on computer simulation or acoustic measurement of room impulse responses (RIRs). Generally, simulation-based techniques can be divided into geometrical methods and wave-based methods, which are sometimes combined in mixed approaches [62,63,64,65]. Geometrical methods such as the image-source method and ray tracing assume 'rays of sound', which can serve as an approximation for sound propagation at high frequencies. ...
... (1)-(8) are evaluated for 320 gammatone magnitude windows w b (ω) [61]. The band-pass windows are spaced on an equivalent rectangular bandwidth (ERB) frequency scale (1/8 ERB spacing), and each window covers one ERB [63]. ...
Thesis
Full-text available
Spatial audio systems are designed to control the perceived direction (and distance) of sounds. Depending on the sound scene, sensations can range from localized auditory events to listener envelopment ('being surrounded by sound') and engulfment ('being covered by sound'). This thesis consists of publications on localization, envelopment, and engulfment in real and virtualized loudspeaker environments. Concerning real, surrounding loudspeaker arrangements, the first part of the thesis investigates the effects of the spatio-temporal density of sound events on envelopment and engulfment. Listening experiments and auditory models reveal how envelopment can be preserved at off-center listening positions: when horizontally surrounding loudspeakers each provide a 3 dB sound pressure level (SPL) decay per doubling of distance (line sources), the interaural level difference is minimized across the entire listening area, which preserves the perceived directional balance. Regarding temporal density, experiments using a spatial granular synthesis approach suggested that surrounding sound events at random intervals of Δt < 20 milliseconds create a diffusely enveloping auditory event, which can be explained by the temporal integration of localization cues in the auditory system with time constants of 50 to 200 milliseconds. Moreover, experiments using a hemispherical loudspeaker arrangement demonstrate that envelopment and engulfment are perceptually distinct spatial attributes. Whether localization, envelopment, and engulfment can be reproduced in a binaurally virtualized loudspeaker environment is investigated in the second part of the thesis. The dynamic virtualization developed for the experiments uses a six-degrees-of-freedom direct-sound rendering and a measurement-based three-degrees-of-freedom room acoustic auralization. Experiments were conducted in a studio environment using acoustically transparent headphones. This `in situ' methodolody allows for a direct comparison between the virtualized and the real loudspeaker environment, revealing spatial mapping errors caused by the binaural rendering. While the high-level sensations of envelopment and engulfment could be reproduced well using non-individual head-related transfer functions (HRTFs), vertical localization errors in the frontal area were found to notably distort the directional mapping in the virtualized loudspeaker environment.
... The most common models of auditory filters use rounded exponential (roex) filters, gammatones, or cascaded recursive filters (Lyon et al., 2010). For example, a popular model uses gammatone filters equally spaced on the Equivalent Rectangular Bandwidth (ERB) scale and with bandwidths either equal to the ERB (Glasberg & Moore, 1990) or proportionate to it (e.g., 1.019 times the ERB; Patterson et al., 1992;Slaney, 1993). The objective of such models is to approximate the frequency response of actual cochlear neurons, but numerous complications arise. ...
... Bandwidths bw are either set to a particular number of semitones in constant-Q filters or calculated as a function of the filter's central frequency cf, as (Glasberg & Moore, 1990;Slaney, 1993) follows: ...
Article
Full-text available
Roughness is a perceptual characteristic of sound that was first applied to musical consonance and dissonance, but it is increasingly recognized as a central aspect of voice quality in human and animal communication. It may be particularly important for asserting social dominance or attracting attention in urgent signals such as screams. To ensure that the results of roughness research are valid and consistent across studies, we need standard methodology for measuring it. I review the literature on roughness estimation, from classic psychoacoustics to more recent approaches, and present two collections of 602 human vocal samples whose roughness was rated by 162 listeners in perceptual experiments. Two algorithms for estimating roughness acoustically from modulation spectra are then presented and optimized to match the human ratings. One uses a bank of gammatone or Butterworth filters to obtain an auditory spectrogram, and a faster algorithm begins with a conventional spectrogram obtained with Short-Time Fourier transform; both explain ~ 50% of variance in average human ratings per stimulus. The range of modulation frequencies most relevant to roughness perception is [50, 200] Hz; this range can be selected with simple cutoff points or with a lognormal weighting function. Modulation and roughness spectrograms are proposed as visual aids for studying the dynamics of roughness in longer recordings. The described algorithms are implemented in the function modulationSpectrum() from the open-source R library soundgen . The audio recordings and their ratings are freely available from https://osf.io/gvcpx/ and can be used for benchmarking other algorithms.
... This way, a link between subjective perception and objective measurements can be established, potentially enabling detection of advanced wear to the tools using acoustic emission data. Note, that for easier pitch comparison between measures in the final results/plots, pitch is always specified in Hz, converted from bark [29] (specific roughness) or ERB [30] (specific loudness and pitch centroid). In the first analysis, the focus was on evaluating parameters that would yield one single value per lane, choosing loudness, pitch centroid, sharpness and roughness, as well as the dominant modulation frequency f mod as a secondary roughness parameter [23]. ...
... This is akin to computing a weighted average of the specific loudness N ′ [z]. For the visualization in Fig. 13, the pitch is then converted from ERB to Hz as described in [30]. Sharpness is computed as specified in [31] and roughness as described in [23]. ...
Article
Full-text available
This paper presents a thorough analysis and evaluation of condition monitoring (CM) of a precision CNC mill based on acoustic emissions (AE). Two separate techniques are used to analyze the data: the first computes metrics after representing the data in the phase space, in contrast to classical approaches that operate in the time and frequency domains. The second approach uses psychoacoustic metrics. Through the analysis of data in the phase domain, which is linked to the angular position of the milling tool, high angular resolution for the acoustic tool monitoring is achieved, enabling the evaluation of each cutting edge of the milling tool individually. The obtained results are consistent with those of direct measurement methods, in this case based on microscopic photographs. On the other hand, since the AE data is audible, the use of psychoacoustic metrics in CM is also investigated, with the goal to create a link to human perception of anomalies based on hearing the acoustic emissions. In order to evaluate the AE data, two relative metrics for wear and accumulated tool damage are defined in the phase space; additionally, several psychoacoustic metrics are investigated. The methods were evaluated on large data sets acquired during the production of a series of parts and by using different tools. The relative metrics in the phase space provide robust and meaningful results, and are performing better than the psychoacoustic metrics since they effectively embed a-priori knowledge about the tool geometry into the evaluation.
... Speech filter banks have been extensively studied, with numerous models proposed to elucidate the characteristic of the human hearing system. Common examples include Mel filters [2], Lyon filters [3], PLP [4], gammatone filters [5,6], Seneff filters [7], and Meddis hair cell models [8]. More recent research has placed greater emphasis on human cochlea modeling, with notable works in this area including those by Bruce [9], Hohmann [10], King [11], Relaño [12], Jepsen [13], Verhulst [14], and Zilany [15]. ...
... The outputs of gammatone filters can be used to predict the frequency response of specific locations on the basilar membrane, which corresponds to the center frequency of the related gammatone filter. Gammatone filters distribute the center frequencies along the frequency spectrum according to the ERB (Equivalent Rectangular Bandwidth) [6] scale proportional to the bandwidths of each filter. The Meddis hair cell model [8] is another popular model for describing the transduction of mechanical waves into neural impulses. ...
Article
Full-text available
Feature extraction is a crucial stage in speech emotion recognition applications, and filter banks with their related statistical functions are widely used for this purpose. Although Mel filters and MFCCs achieve outstanding results, they do not perfectly model the structure of the human ear, as they use a simplified mechanism to simulate the functioning of human cochlear structures. The Mel filters system is not a perfect representation of human hearing, but merely an engineering shortcut to suppress the pitch and low-frequency components, which have little use in traditional speech recognition applications. However, speech emotion recognition classification is heavily related to pitch and low-frequency component features. The newly tailored CARFAC 24 model is a sophisticated system for analyzing human speech and is designed to best simulate the functionalities of the human cochlea. In this study, we use the CARFAC 24 system for speech emotion recognition and compare it with state-of-the-art systems using speaker-independent studies conducted with Time-Distributed Convolutional LSTM networks and Support Vector Machines, with the use of the ASED and the NEMO emotional speech dataset. The results demonstrate that CARFAC 24 is a valuable alternative to Mel and MFCC features in speech emotion recognition applications.
... This integration aims to boost the SR system's performances in noisy settings. Key concepts of the PWCC method includes a "Normalized Wavelet FilterBank" (NWFB) that utilizes the Morlet wavelet Transform on an ERB-rate scale [33], [40], [41], [42] to simulate human auditory frequency selectivity, and a "Noise Suppression Module" (NSM) which employs power analysis over medium durations, asymmetric noise-suppression and temporal masking modules, and a spectral smoothing module to compensate the effects of background noise. In order to assess the accuracy of our proposed method, we conducted a comparative analysis with two alternative feature extraction methods, namely MFCC and PNCC. ...
Article
Full-text available
Human capability for Speaker Recognition (SR) exceeds recent machine learning approaches, even in noisy environments. To bridge this gap, researchers investigate the human auditory system to support machine learning algorithm performance. The paper introduces a novel feature extraction method, named “Power Wavelet Cepstral Coefficients” (PWCC), for enhancing SR accuracy. This method is derived from the “Normalized Wavelet FilterBank” (NWFB), which utilizes an “Equivalent Rectangular Bandwidth” rate (ERB-rate) scale and additionally integrates a "Noise Suppression Module" (NSM). The NWFB imitates the cochlea’s frequency selectivity using “Morlet Wavelet filters” alongside an ERB-rate scale. The NSM applies a medium-duration power analysis, an asymmetrical noise-suppression module incorporating a temporal masking component, and a spectral smoothing module to reduce the impact of noisy signal. To assess the performance of the proposed PWCC method, experiments were conducted using clean speech signals from the TIMIT database, corrupted with various noises from the AURORA dataset. Using a “Gaussian Mixture Model-Universal Background Model” (GMM-UBM) classifier, the PWCC method demonstrated superior SR accuracy in noisy environments compared to traditional methods such as PNCC and MFCC. Furthermore, PWCC maintained higher precision, recall, and F1-scores than PNCC and MFCC under overall noise conditions. For instance, with babble noise at 15dB SNR, PWCC achieved a recognition rate of 92.06%, compared to 75.24% for PNCC and 68.33% for MFCC.
... We predicted the EEG responses from the spectro-temporal characteristics of the stimulus by band-pass filtering the audio material between 20 and 9000 Hz, dividing it into log-spaced bands [23] and computing the absolute Hilbert envelope for each spectral band, all using the soundlab Python package [24]. To test how the number of parameters and the degree of spectral detail affect model accuracy, we represented the stimulus with 1, 8 and 16 spectral bands and repeated the modeling procedure for each representation. ...
Article
Full-text available
Background In recent decades, studies modeling the neural processing of continuous, naturalistic, speech provided new insights into how speech and language are represented in the brain. However, the linear encoder models commonly used in such studies assume that the underlying data are stationary, varying to a fixed degree around a constant mean. Long, continuous, neural recordings may violate this assumption leading to impaired model performance. We aimed to examine the effect of non-stationary trends in continuous neural recordings on the performance of linear speech encoding models. Methods We used temporal response functions (TRFs) to predict continuous neural responses to speech while splitting the data into segments of varying length, prior to model fitting. Our Hypothesis was that if the data were non-stationary, segmentation should improve model performance by making individual segments approximately stationary. We simulated and predicted stationary and non-stationary recordings to test our hypothesis under a known ground truth and predicted the brain activity of participants who listened to a narrated story, to test our hypothesis on actual neural recordings. Results Simulations showed that, for stationary data, increasing segmentation steadily decreased model performance. For non-stationary data however, segmentation initially improved model performance. Modeling of neural recordings yielded similar results: segments of intermediate length (5–15 s) led to improved model performance compared to very short (1–2 s) and very long (30–120 s) segments. Conclusions We showed that data segmentation improves the performance of encoding models for both simulated and real neural data and that this can be explained by the fact that shorter segments approximate stationarity more closely. Thus, the common practice of applying encoding models to long continuous segments of data is suboptimal and recordings should be segmented prior to modeling.
... We recall the main idea: Given a (non-linear) auditory scale function F S : [0, fs 2 ] → S that maps positive frequencies (in Hz) to an auditory scale S (in auditory units), and a function B S : [0, fs 2 ] → R that gives the associated bandwidths. For the commonly used equivalent rectangular bandwidth (ERB) auditory scale [13], [18], the functions are given by ...
Preprint
Full-text available
This paper introduces ISAC, an invertible and stable, perceptually-motivated filter bank that is specifically designed to be integrated into machine learning paradigms. More precisely, the center frequencies and bandwidths of the filters are chosen to follow a non-linear, auditory frequency scale, the filter kernels have user-defined maximum temporal support and may serve as learnable convolutional kernels, and there exists a corresponding filter bank such that both form a perfect reconstruction pair. ISAC provides a powerful and user-friendly audio front-end suitable for any application, including analysis-synthesis schemes.
... For the amplitude modulation vocoder, the input signal was first filtered into eight frequency bands ranging from 80 to 8,000 Hz, with each band equally spaced on an equivalent rectangular bandwidth scale (Glasberg and Moore, 1990). ...
Article
Full-text available
In our previous study, early-blind individuals have better speech recognition than sighted individuals, even when the spectral cue was degraded using noise-vocoders. Therefore, this study investigated the impact of temporal envelope degradation and temporal fine structure (TFS) degradation on vocoded speech recognition and cortical auditory response in early blind individuals compared to sighted individuals. The study included 20 early-blind subjects (31.20 ± 42.5 years, M: F = 11:9), and 20 age- and -sex-matched sighted subjects. Monosyllabic words were processed using the Hilbert transform to separate the envelope and TFS, generating vocoders that included only one of these components. The amplitude modulation (AM) vocoder, which contained only the envelope component, had the low-pass filter's cutoff frequency for AM extraction set at 16, 50, and 500 Hz to control the amount of AM cue. The frequency modulation (FM) vocoders, which contained only the TFS component, were adjusted to include FM cues at 50%, 75%, and 100% by modulating the noise level. A two-way repeated measures ANOVA revealed that early-blind subjects outperforming sighted subjects across almost all AM or FM-vocoded conditions (p < 0.01). Speech recognition in early-blind subjects declined more with increasing TFS degradation, as evidenced by a significant interaction between group and the degree of TFS degradation (p = 0.016). We also analyzed neural responses based on the semantic oddball paradigm using the N2 and P3b components, which occur 200–300 ms and 250–800 ms after stimulus onset, respectively. Significant correlations were observed between N2 and P3b amplitude/latency and behavioral accuracy (p < 0.05). This suggests that early-blind subjects may develop enhanced neural processing strategies for temporal cues. In particular, preserving TFS cues is considered important for the auditory rehabilitation of individuals with visual or auditory impairments.
... The DirAC parameters are computed for non-overlapping and non-uniform frequency bands following roughly a multiple of the Equivalent Rectangular Bandwidth (ERB) [21]. A scale of about 8 times ERB is used to obtain from 5-6 parameter bands depending on the bitrate (cf. ...
Preprint
Directional Audio Coding (DirAC) is a proven method for parametrically representing a 3D audio scene in B-format and is capable of reproducing it on arbitrary loudspeaker layouts. Although such a method seems well suited for low bitrate Ambisonic transmission, little work has been done on the feasibility of building a real system upon it. In this paper, we present a DirAC-based coding for Higher-Order Ambisonics (HOA), developed as part of a standardisation effort to extend the 3GPP EVS codec to immersive communications. Starting from the first-order DirAC model, we show how to reduce algorithmic delay, the bitrate required for the parameters and complexity by bringing the full synthesis in the spherical harmonic domain. The evaluation of the proposed technique for coding 3\textsuperscript{rd} order Ambisonics at bitrates from 32 to 128 kbps shows the relevance of the parametric approach compared with existing solutions.
... For each clear story (i.e., without background babble or noise), a cochleogram was calculated using a simple auditory-periphery model with 30 auditory filters (McDermott and Simoncelli, 2011; cutoffs evenly spaced on the ERB scale; Glasberg and Moore, 1990). The resulting amplitude envelope for each auditory filter was compressed by 0.6 to simulate inner ear compression (McDermott and Simoncelli, 2011). ...
Article
Full-text available
Neural activity in auditory cortex tracks the amplitude-onset envelope of continuous speech, but recent work counterintuitively suggests that neural tracking increases when speech is masked by background noise, despite reduced speech intelligibility. Noise-related amplification could indicate that stochastic resonance – the response facilitation through noise – supports neural speech tracking, but a comprehensive account is lacking. In five human electroencephalography experiments, the current study demonstrates a generalized enhancement of neural speech tracking due to minimal background noise. Results show that (1) neural speech tracking is enhanced for speech masked by background noise at very high signal-to-noise ratios (~30 dB SNR) where speech is highly intelligible; (2) this enhancement is independent of attention; (3) it generalizes across different stationary background maskers, but is strongest for 12-talker babble; and (4) it is present for headphone and free-field listening, suggesting that the neural-tracking enhancement generalizes to real-life listening. The work paints a clear picture that minimal background noise enhances the neural representation of the speech onset-envelope, suggesting that stochastic resonance contributes to neural speech tracking. The work further highlights non-linearities of neural tracking induced by background noise that make its use as a biological marker for speech processing challenging.
... Stimuli were similar to those used in the psychoacoustic study of Brennan et al. (2023), except that here the masker level was set to 70 dB SPL, and the probe level was fixed at 60 dB SPL, which would be above detection threshold for listeners with normal hearing. GN maskers had a center frequency matched to the CF of a neuron, a bandwidth equal to 1/3 of the equivalent rectangular bandwidth of the auditory filter (ERBn, Glasberg & Moore 1990) at the center frequency, and a duration of 400 ms, including 5-ms cos 2 on/off ramps. Masker waveforms varied across trials. ...
Preprint
Full-text available
In forward masking the detection threshold for a target sound (probe) is elevated due to the presence of a preceding sound (masker). Although many factors are known to influence the probe response following a masker, the current work focused on the temporal separation (delay) between the masker and probe and the inter-trial interval (ITI). Human probe thresholds recover from forward masking within 150 to 300 ms, similar to neural threshold recovery in the IC within 300 ms after tone maskers. Our study focused on recovery of discharge rate of IC neurons in response to probe tones after narrowband gaussian noise (GN) forward maskers, with varying time delays. Additionally, we examined how prior masker trials influenced IC rates by varying ITI. Our findings showed that previous masker trials impacted probe-evoked discharge rates, with full recovery requiring ITIs over 1.5 s after 70 dB SPL narrowband GN maskers. Neural thresholds in the IC for probes preceded by noise maskers were in the range observed in psychoacoustical studies. Two proposed mechanisms for forward masking, persistence and efferent gain control, were tested using rate analyses or computational modeling. A physiological model with efferent feedback gain control had responses consistent with trends in the physiological recordings.
... The speech was tone-vocoded, using an approach similar to that of Oxenham and Kreft (2014). The vocoder utilized 12 channels, with center frequencies ranging from 9 to 31 Cams in steps of 2 on the ERB N -number scale (Glasberg and Moore, 1990), corresponding to center frequencies ranging from 375 to 6237 Hz. The channel spacing of 2 Cams was chosen to limit envelope fluctuations generated by beating between the center tones of neighboring channels. ...
... They measured the SRT in four different types of non-stationary noises: fluctuating speech noise (FSN) and three types of filtered FSN with spectral gaps in several frequency regions. The filtering of latter noises was based on the equivalent-rectangular-bandwidth (ERB) (Glasberg and Moore, 1990) scale with a bandwidth of two, three, or four equivalentrectangular-bandwidths (ERBs). For more details on these noise types, please see Peters et al. (1998) Fig. 3, p. 579). ...
Article
The speech reception threshold (SRT) model of Plomp [J. Acoust. Soc. Am. 63(2), 533–549 (1978)] can be used to describe SRT (dB signal-to-noise ratio) for 50% of sentences correct in stationary noise in normal-hearing (NH) and hearing-impaired (HI) listeners. The extended speech reception threshold model (ESRT) [Rhebergen et al., J. Acoust. Soc. Am. 117, 2181–2192 (2010)] was introduced to describe the SRT in non-stationary noises. With the ESRT model, they showed that the SRT in non-stationary noises is, contra to the SRT in stationary noise, dependent on the non-stationary noise type and noise level. We examine with SRT data from the literature, whether the ESRT model can also be used to predict SRT in individual NH and HI listeners in different types of non-stationary noise based on a single SRT measurement in quiet, stationary, and non-stationary noise. The predicted speech reception thresholds (SRTs) in non-stationary noises in NH and HI listeners correspond well with the observed SRTs independent of the used non-stationary spectral or temporal masking, or noise masking levels. The ESRT model cannot only be used to describe the SRT within a non-stationary noise but can also be used to predict the SRTs in other non-stationary noise types as a function of noise level in NH and HI listeners.
... We first describe these predictors and then discuss the statistical modeling approach adopted. Fischer et al. (2021) validated the perceptual significance of the equivalent rectangular bandwidth (ERB) auditory processing model (Glasberg & Moore, 1990;Moore, 1986;Moore & Glasberg, 1983) over the purely acoustic short-time Fourier transform (STFT) model as an input representation for computing acoustic descriptors. They showed that it can account for more timbral variance in relation to perceptual segregation. ...
Article
Full-text available
The stratification of layers of differing prominence (foreground/background) is a common technique in orchestration. Musicians heard 23 excerpts containing foreground and background layers as previously determined by music analysts. A given layer comprised either a single auditory stream of one or more blended instruments or a harmonic or rhythmic background. Two-layer excerpts had either the same, overlapping, or different instrument families (timbre class). First, musicians rated the perceived degree of segregation of musical materials in two-layer and single-stream excerpts. Second, they heard each of the two layers in isolation and then together and rated the relative prominence of the layers. Heterogeneous instrument combinations yielded the greatest difference in relative prominence, followed by overlapping and then homogeneous combinations. Acoustic and score-based descriptors were extracted to quantify their relative contribution to perceptual stratification. Timbre class and between-layer differences in timbre and dynamics played a role, providing evidence of how timbral differences enhance relative prominence in orchestral music. Perceptual segregation was positively but very weakly related to relative prominence, supporting findings that although segregation is necessary to form layers, this mechanism is separable from that which places the streams into the same representational space to allow for the assessment of their relative prominence.
... These incorporate a variant of Mel frequency cepstral coefficients (MFCCs). More precisely, band filters are used with equidistant spacing on an equivalent rectangular bandwidth (ERB) scale [18]. The filters are 3 ERB wide with 50% overlap. ...
Article
Full-text available
Coughing is a symptom of many respiratory diseases. An increased amount of coughs may signal an (upcoming) health issue, while a decreasing amount of coughs may indicate an improved health status. The presence of a cough can be identified by a cough classifier. The cough density fluctuates considerably over the course of a day with a pattern that is highly subject-dependent. This paper provides a case study of cough patterns from Chronic Obstructive Pulmonary Disease (COPD) patients as determined by a stationary semi-automated cough monitor. It clearly demonstrates the variability of cough density over the observation time, its patient specificity and dependence on health status. Furthermore, an earlier established empirical finding of a linear relation between mean and standard deviation of a session’s cough count is validated. An alert mechanism incorporating these findings is described.
... The noise ranged from 14.3 to 18.0 kHz, and hence, did not spectrally overlap with the missing portion of the tone. 4. Considering that the differences in equal-loudness contours between 0.5 and 6 kHz at the considered level were negligible, we computed the difference in level required to approximately equate the specific loudness at these two frequencies as: 10log 10 (AFB 500 ) À 10log 10 (AFB 6000 ), where AFB f represents the auditory filter bandwidth at center frequency f as defined by Glasberg and Moore (1990). ...
Article
Full-text available
A sound turned off for a short moment can be perceived as continuous if the silent gap is filled with noise. The neural mechanisms underlying this “continuity illusion” were investigated using the mismatch negativity (MMN), an eventrelated potential reflecting the perception of a sudden change in an otherwise regular stimulus sequence. The MMN was recorded in four conditions using an oddball paradigm. The standards consisted of 500-Hz, 120-msec tone pips that were either physically continuous (Condition 1) or were interrupted by a 40-msec silent gap (Condition 2). The deviants consisted of the interrupted tone, but with the silent gap filled by a burst of bandpass-filtered noise. The noise either occupied the same frequency region as the tone and elicited the continuity illusion (Conditions 1a and 2a), or occupied a remote frequency region and did not elicit the illusion (Conditions 1b and 2b). We predicted that, if the continuity illusion is determined before MMN generation, then, other things being equal, the MMN should be larger in conditions where the deviants are perceived as continuous and the standards as interrupted or vice versa, than when both were perceived as continuous or both interrupted. Consistent with this prediction, we observed an interaction between standard type and noise frequency region, with the MMN being larger in Condition 1a than in Condition 1b, but smaller in Condition 2a than in Condition 2b. Because the subjects were instructed to ignore the tones and watch a silent movie during the recordings, the results indicate that the continuity illusion can occur outside the focus of attention. Furthermore, the latency of the MMN (less than approximately 200 msec postdeviance onset) places an upper limit on the stage of neural processing responsible for the illusion.
... First, to replicate the frequency analysis occurring in the cochlea, the input sounds were filtered into subbands using a bank of bandpass filters with varying center frequencies and bandwidths. The model employed 4th-order gammatone filters consisting of 32 zero-phase bandpass filters with center frequencies equally spaced on an equivalent rectangular bandwidth (ERB; Glasberg & Moore, 1990) scale between 20 and 10,000 Hz. The filters had a bandwidth of 3 db. ...
... [9], shows, that this modification gives significant result. The improvement of accordance between results of objective evaluation of the audio signals quality and the subjective one depends on ERB parameters settings proposed by other authors [3], [4], [10]. ...
Chapter
Full-text available
This document presents the research about objective methods, which use psychoacoustics knowledge for estimation of the audio signals quality. The software written especially for this research allows for implementation of the different published methods for evaluation of the quality of perceptual coded audio signals. All of used algorithms simulate the auditory system. Many experiments, which most important target was improving earlier published objective protocols were done. This goal was achieved e.g. by changing pitch scale, FFT settings or other parameters of hearing model. Suggested changes of internal parameters of signal processing, which improves results of objective evaluation, are presented at this text. The criterion of optimization is difference between results of subjective (taken as a reference) and objective evaluation. Comparison between original objective scores and the new ones obtained after model parameters tuning are presented.
... The approximated impulse response is based on the exact gammatone implementation as described by Slaney (1997). The parameter values for computation of the ERB bandwidth stem from Glasberg and Moore (1990). To include further nonlinearities of cochlear processing-specifically the transduction properties of inner hair cells-we then applied half-wave rectification and compression with a 6/10 power law to the output of the gammatone filterbank. ...
Article
Full-text available
Models of phonology posit a hierarchy of prosodic units that is relatively independent from syntactic structure, requiring its own parsing. It remains unexplored how this prosodic hierarchy is represented in the brain. We investigated this foundational question by means of an electroencephalography (EEG) study. Thirty young adults listened to German sentences containing manipulations at different levels of the prosodic hierarchy. Evaluating speech-to-brain cortical entrainment and phase-amplitude coupling revealed that prosody’s hierarchical structure is maintained at the neural level during spoken language comprehension. The faithfulness of this tracking varied as a function of the hierarchy’s degree of intactness as well as systematic interindividual differences in audio-motor synchronization abilities. The results underscore the role of complex oscillatory mechanisms in configuring the continuous and hierarchical nature of the speech signal and situate prosody as a structure indispensable from theoretical perspectives on spoken language comprehension in the brain.
... The EPs are normally calculated and plotted as the gain of each auditory filter equal to 0 dB at its CF. For example; a tone with a 60 dB sound pressure level (SPL) and at 1 kHz CF will cause an excitation level equal to 60 dB and at 1 kHz [13,15,16]. ...
Preprint
Noise induced hearing loss (NIHL) as one of major avoidable occupational related health issues has been studied for decades. To assess NIHL, the excitation pattern (EP) has been considered as one of mechanisms to estimate movements of basilar membrane (BM) in cochlea. In this study, two auditory filters, dual resonance nonlinear (DRNL) filter and rounded-exponential (ROEX) filter, have been applied to create two EPs, referring as the velocity EP and the loudness EP, respectively. Two noise hazard metrics are also proposed based on the developed EPs to evaluate hazardous levels caused by different types of noise. Moreover, Gaussian noise and pure-tone noise have been simulated to evaluate performances of the developed EPs and noise metrics. The results show that both developed EPs can reflect the responses of BM to different types of noise. For Gaussian noise, there is a frequency shift between the velocity EP and the loudness EP. For pure-tone noise, both EPs can reflect the frequencies of input noise accurately. The results suggest that both EPs can be potentially used for assessment of NIHL.
... A gammatone filterbank [36] that includes J = 28 filters linearly spaced on the ERB-rate scale [37] between 100 Hz and 6500 Hz is used to obtain X t and Y t according to (2). A sequence of stacked vectors for the clean speech is then formed by stacking K = 15 consecutive vectors: ...
Preprint
We propose a monaural intrusive instrumental intelligibility metric called speech intelligibility in bits (SIIB). SIIB is an estimate of the amount of information shared between a talker and a listener in bits per second. Unlike existing information theoretic intelligibility metrics, SIIB accounts for talker variability and statistical dependencies between time-frequency units. Our evaluation shows that relative to state-of-the-art intelligibility metrics, SIIB is highly correlated with the intelligibility of speech that has been degraded by noise and processed by speech enhancement algorithms.
Article
Full-text available
Many elderly listeners have difficulties with speech-in-noise perception, even if auditory thresholds in quiet are normal. The mechanisms underlying this compromised speech perception with age are still not understood. For identifying the physiological causes of these age-related speech perception difficulties, an appropriate animal model is needed enabling the use of invasive methods. In a comparative behavioral study, we used young-adult and quiet-aged Mongolian gerbils as well as young and elderly human subjects to investigate age-related changes in the discrimination of speech sounds in background noise, evaluating whether gerbils are an appropriate animal model for the age-related decline in speech-in-noise processing of human listeners. Gerbils and human subjects had to report a deviant consonant-vowel-consonant combination (CVC) or vowel-consonant-vowel combination (VCV) in a sequence of CVC or VCV standards, respectively. The logatomes were spoken by different speakers and masked by a steady-state speech-shaped noise. Response latencies were measured to generate perceptual maps employing multidimensional scaling, visualizing the subjects’ internal representation of the sounds. By analyzing response latencies for different types of vowels and consonants, we investigated whether aging had similar effects on the discrimination of speech sounds in background noise in gerbils compared to humans. For evaluating peripheral auditory function, auditory brainstem responses and audiograms were measured in gerbils and human subjects, respectively. We found that the overall phoneme discriminability in gerbils was independent of age, whereas consonant discriminability was declined in humans with age. Response latencies were generally longer in aged than in young gerbils and humans, respectively. Response latency patterns for the discrimination of different vowel or consonant types were different between species, but both gerbils and humans made use of the same articulatory features for phoneme discrimination. The species-specific response latency patterns were mostly unaffected by age across vowel types, while there were differential aging effects on the species-specific response latency patterns of different consonant types.
Article
Aim Anthropogenic noise is a global pollutant that threatens biodiversity. However, we currently lack effective methods to assess and compare the impacts of anthropogenic noise on extended terrestrial species. This can be critical for the majority of species that lack conservation attention and empirical measurements. Location Global. Time Period 1963–2023. Major Taxa Studied Bats. Methods We leverage the conserved mechanisms of how the vertebrate brain processes sound in noise to propose a simple sensation metric, the masking potential. To illustrate its usage, we assessed the effects of highway traffic noise on bats, which are a species‐rich, important, yet under‐represented mammalian lineage vulnerable to human disturbances. We first applied masking potential to a global dataset of bats to test whether auditory masking is an important explanation for bats' vulnerability to highway traffic noise. We calculated the impact ranges of highway traffic noise on bat species with audiograms. Then, we compared the predicted impact ranges with empirical measurements reported in the literature. Results We show that auditory masking of both target echoes and social communication calls represents an important explanation for bats' sensitivity to highway traffic noise. The masking potential predicted maximum impact ranges (i.e., the distance beyond which animals are not impacted) of a median of 40 m for 71 species of bats, 614 m for the common marmoset, 1118 m for the great tit, and 1430 m for the budgerigar. The maximum impact ranges predicted by masking potential were supported by empirical measurements which yet remain scarce, stressing the value of masking potential for applied wildlife conservation. Main Conclusions We propose that masking potential is a simple sensation metric that can help assess noise effects on diverse terrestrial species. This metric bears implications for real‐world conservation practice and can be particularly useful to most wildlife species that lack conservation attention.
Article
In forward masking, the detection threshold for a target sound (probe) is elevated due to the presence of a preceding sound (masker). Although many factors are known to influence the probe response following a masker, the current work focused on the temporal separation (delay) between the masker and probe and the inter-trial interval (ITI). Human probe thresholds recover from forward masking within 150–300 ms, similar to neural threshold recovery in the inferior colliculus (IC) within 300 ms after tone maskers. Our study focused on the recovery of discharge rate of IC neurons in response to probe tones after narrowband Gaussian noise (GN) forward maskers, with varying time delays. Additionally, we examined how prior masker trials influenced IC rates by varying ITI. Previous masker trials affected probe-evoked discharge rates, with full recovery requiring ITIs over 1.5 s after 70 dB SPL narrowband GN maskers. Neural thresholds in the IC for probes preceded by noise maskers were in the range observed in psychoacoustical studies. Two proposed mechanisms for forward masking, persistence, and efferent gain control, were tested using rate analyses or computational modeling. A physiological model with efferent feedback gain control had responses consistent with trends in the physiological recordings.
Article
This study integrates a non-linear inner hair cell model (IHC) into the computational auditory signal processing and perception (CASP) model [Jepsen, Ewert, and Dau (2008). J. Acoust. Am. 124(1), 422–438]. The integration addresses limitations of its more simplistic predecessor that did not reflect the saturation of the IHC transduction process towards high sound pressure levels. While exhibiting distinct processing mechanisms compared to the original model, the revised model maintains predictive power across conditions of intensity discrimination, simultaneous and forward masking, and modulation detection, effectively accounting for data from normal-hearing listeners. Additional updates and refinements to the model are introduced in response to the changes produced by the additional compressive non-linearity and to improve its usability. Overall, the revised CASP model offers a more accurate and intuitive framework for simulating auditory processing and perception under diverse conditions and tasks. This enhanced version may be particularly valuable for studying the influence of the ear's nonlinear response properties on internal auditory representations, including the effects of sensorineural hearing loss on auditory perception.
Preprint
Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention architecture, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. Our code and checkpoint will be available on a public repository upon publication.
Article
Visual-to-auditory substitution devices convert visual images into soundscapes. They are intended for use by blind people in everyday situations with various obstacles that need to be localized simultaneously, as well as irrelevant objects that must be ignored. It is therefore important to establish the extent to which substitution devices make it possible to localize obstacles in complex scenes. In this study, we used a substitution device that combines spatial acoustic cues and pitch modulation to convey spatial information. Nineteen blindfolded sighted participants had to point at a virtual target that was displayed alone or among distractors to evaluate their ability to perform a localization task in minimalist and complex virtual scenes. The spatial configuration of the scene was manipulated by varying the number of distractors and their spatial arrangement relative to the target. While elevation localization abilities were not impaired by the presence of distractors, the ability to localize the azimuth of the target was modulated when a large number of distractors were displayed at the same elevation as the target. The elevation localization performance tends to confirm that pitch modulation is effective to convey elevation information with the device in various spatial configurations. Conversely, the impairment to azimuth localization seems to result from segregation difficulties that arise when the spatial configuration of the objects does not allow pitch segregation. This must be considered in the design of substitution devices in order to help blind people correctly evaluate the risks posed by different situations.
Article
Animals have evolved complex auditory systems to extract acoustic information from natural environmental noise, yet they are challenged by rising levels of novel anthropogenic noise. Songbirds adjust their vocal production in response to increasing noise, but auditory processing of signals in noise remains understudied. Auditory processing characteristics, including auditory filter bandwidth, filter efficiency, and critical ratios (level-independent signal-to-noise ratios at threshold) likely influence auditory and behavioral responses to noise. Here, we investigated the effects of noise on auditory processing in three songbird species (black-capped chickadees, tufted titmice, and white-breasted nuthatches) that live in mixed-species flocks and rely on heterospecific communication to coordinate mobbing behaviors. We determined masked thresholds and critical ratios from 1-4 kHz using auditory evoked potentials. We predicted that nuthatches would have the lowest critical ratios given that they have narrowest filters, followed by titmice and then chickadees. We found that nuthatches had the greatest sensitivity in quiet conditions, but the highest critical ratios, suggesting their auditory sensitivity is highly susceptible to noise. Titmice had the lowest critical ratios, suggesting relatively minor impacts of noise on their auditory processing. This is not consistent with predictions based on auditory filter bandwidth, but is consistent with both recent behavioral findings and predictions made by auditory filter efficiency measures. Detrimental effects of noise were most prevalent in the 2-4 kHz range, frequencies produced in vocalizations. Our results using the critical ratio as a measure of processing in noise suggest that low levels of anthropogenic noise may influence these three species differently.
Preprint
Full-text available
Navigating complex sensory environments is critical to survival, and brain mechanisms have evolved to cope with the wide range of surroundings we encounter. To determine how listeners learn the statistical properties of acoustic spaces, we assessed their ability to perceive speech in a range of noisy and reverberant rooms. Listeners were also exposed to repetitive transcranial stimulation (rTMS) to disrupt the dorsolateral prefrontal cortex (dlPFC) activity, a region believed to play a role in statistical learning. Our data suggest listeners rapidly adapt to statistical characteristics of an environment to improve speech understanding. This ability is impaired when rTMS is applied bilaterally to the dlPFC. The data demonstrate that speech understanding in noise is best when exposed to a room with reverberant characteristics common to human-built environments, with performance declining for higher and lower reverberation times, including fully anechoic (non-reverberant) environments. Our findings provide evidence for a reverberation-sweet-spot and the presence of brain mechanisms that might have evolved to cope with the acoustic characteristics of listening environments encountered every day.
Article
Full-text available
Este trabajo investiga el uso de los patrones de entonación para comunicar el foco estrecho y amplio en las preguntas interrogativas absolutas empleadas en el español de Montevideo. A nuestro entender, es el primer trabajo que estudia explícitamente y empíricamente la entonación focal en esta variedad del español. Los seis participantes produjeron cuatro preguntas interrogativas absolutas con tres palabras acentuadas léxicamente en cada pregunta. Cada pregunta difiere en la palabra que se focaliza. Se analizaron los acentos tonales y tonos de frontera con el sistema de Sp_ToBI según Estebas & Prieto (2008). Los resultados mostraron la existencia de variación en cada hablante. La alineación de los picos es diferente de la del español de Buenos Aires, una variedad similar a la de Montevideo. Del mismo modo, los hallazgos del análisis estadístico de la altura de los picos y valles son indicadores de una tendencia, aunque no estadísticamente significativa, de un acento tonal más prolongado en la palabra con foco estrecho.
Article
Multilingual phone recognition models can learn language-independent pronunciation patterns from large volumes of spoken data and recognize them across languages. This potential can be harnessed to improve speech technologies for underresourced languages. However, these models are typically trained on phonological representations of speech sounds, which do not necessarily reflect the phonetic realization of speech. A mismatch between a phonological symbol and its phonetic realizations can lead to phone confusions and reduce performance. This work introduces formant-based vowel categorization aimed at improving cross-lingual vowel recognition by uncovering a vowel's phonetic quality from its formant frequencies, and reorganizing the vowel categories in a multilingual speech corpus to increase their consistency across languages. The work investigates vowel categories obtained from a trilingual multi-dialect speech corpus of Danish, Norwegian, and Swedish using three categorization techniques. Cross-lingual phone recognition experiments reveal that uniting vowel categories of different languages into a set of shared formant-based categories improves cross-lingual recognition of the shared vowels, but also interferes with recognition of vowels not present in one or more training languages. Cross-lingual evaluation on regional dialects provides inconclusive results. Nevertheless, improved recognition of individual vowels can translate to improvements in overall phone recognition on languages unseen during training.
Article
Pitch perception affects children's ability to perceive speech, appreciate music, and learn in noisy environments, such as their classrooms. Here, we investigated pitch perception for pure tones as well as resolved and unresolved complex tones with a fundamental frequency of 400 Hz in 8- to 11-year-old children and adults. Pitch perception in children was better for resolved relative to unresolved complex tones, consistent with adults. The younger 8- to 9-year-old children had elevated thresholds across all conditions, while the 10- to 11-year-old children had comparable thresholds to adults.
Conference Paper
Full-text available
This document displays research about objective methods, which use psychoacoustics knowledge for estimation of the quality of audio signals. The software written especially for this research is presented. This program allows for implementation of the different methods for evaluation of the quality of perceptual coded audio signals. Protocols: PAQM, PSQM, NMR, PEAQ, PESQ are ready to use. All of these algorithms are used for simulation of the auditory system. The software is open for addition next protocols as the plug-ins. There is a possibility to change and improve earlier protocols. Suggested changes, which improve results of objective evaluation, are presented. The criterion of optimization is a difference between results of subjective and objective evaluation tests.
Conference Paper
Full-text available
This document presents further results of continuation of research about objective methods, which use psychoacoustics knowledge for estimation of the quality of audio signals. The software written especially for this research is presented. This program allows for implementation of the different published methods for evaluation of the quality of perceptual coded audio signals. Protocols: PAQM, PSQM, NMR, PEAQ, PESQ have been implemented. All of these algorithms are used for simulation of the auditory system. The software is open for addition next protocols as the plug-ins. There is a possibility to change and improve protocols published earlier. Authors proposed in previous works how to improve objective protocols e.g. by changing pitch scale. Suggested adjustment of internal parameters of signal processing, which improves results of objective evaluation, is presented. The criterion of optimization is the difference between results of subjective and objective evaluation.
Article
Profile-analysis experiments measure the ability to discriminate complex sounds based on patterns, or profiles, in their amplitude spectra. Studies of profile analysis have focused on normal-hearing listeners and target frequencies near 1 kHz. To provide more insight into underlying mechanisms, we studied profile analysis over a large target frequency range (0.5–4 kHz) and in listeners with both normal and elevated audiometric thresholds. We found that profile analysis degrades at high frequencies and that the effect of spacing between nearby frequency components differs with frequency. Consistent with prior reports, elevated audiometric thresholds were not associated with impaired performance when stimuli consisted of few distantly spaced frequency components. However, elevated audiometric thresholds were associated with elevated profile-analysis thresholds for stimuli composed of many closely spaced frequency components. Behavioral thresholds from listeners with and without hearing loss were predicted by decoding firing rates from simulated auditory-nerve fibers or simulated modulation-sensitive inferior-colliculus neurons. Although responses from both model stages informed some aspects of the behavioral data, only population decoding of inferior-colliculus responses accounted for the worsening of profile-analysis thresholds at high target frequencies. Collectively, these results suggest that profile analysis involves multiple non-peripheral factors, including multichannel comparisons and midbrain tuning to amplitude modulation.
Article
Full-text available
Purpose Auditory perceptual and cognitive tasks can be useful as a long-term goal in guiding rehabilitation and intervention strategies in audiology clinics that mostly operate at a faster pace and on strict timelines. The rationale of this study was to assess test–retest reliability of an abbreviated test battery and evaluate age-related auditory perceptual and cognitive effects on these measures. Method Experiment 1 evaluated the test–retest repeatability of an abbreviated test battery and its use in an adverse listening environment. Ten participants performed two visits, each including four conditions: quiet, background noise, external noise, and background mixed with external noise. In Experiment 2, both auditory perceptual and cognitive assessments were collected from younger adults with normal hearing and older adults with and without hearing loss. The full test battery included measures of frequency selectivity, temporal fine structure and envelope processing, spectrotemporal and spatial processing and cognition, and an external measure of tolerance to background noise. Results Results from Experiment 1 showed good test–retest repeatability and nonsignificant effects from background or external noise. In Experiment 2, effects of age and hearing loss were shown across auditory perceptual and cognitive measures, except in measures of temporal envelope perception and tolerance to background noise. Conclusions These data support the use of an abbreviated test battery in relatively uncontrolled listening environments such as clinic waiting rooms. With an efficient test battery, perceptual and cognitive deficits can be assessed with minimal resources and little clinician involvement due to the automated nature of the test and the use of consumer-grade technology. Supplemental Material https://doi.org/10.23641/asha.28021070
Conference Paper
Automated cough sound segmentation is important for the objective analysis of cough sounds. While various cough sound segmentation algorithms have been proposed over the years, it is not clear how these algorithms perform in the presence of background noise, which can vary in intensity across different environments. Therefore, in this study, we evaluate the performance of cough sound segmentation algorithms in the presence of background noise. Specifically, we examine algorithms employing conventional feature engineering and machine learning methods, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and a combination of CNNs and RNNs. These algorithms are developed using relatively clean cough signals but evaluated under both clean and noisy conditions. The results indicate that, while the performance of all algorithms declined in the presence of background noise, the combination of CNNs and RNNs yielded the best cough segmentation results under both clean and noisy conditions. These findings can contribute to the development of noise-robust cough sound segmentation algorithms for objective cough sound analysis in noisy conditions.
Preprint
Neural activity in auditory cortex tracks the amplitude envelope of continuous speech, but recent work counter-intuitively suggests that neural tracking increases when speech is masked by background noise, despite reduced speech intelligibility. Noise-related amplification could indicate that stochastic resonance – the response facilitation through noise – supports neural speech tracking. However, a comprehensive account of the sensitivity of neural tracking to background noise and of the role cognitive investment is lacking. In five electroencephalography (EEG) experiments (N=109; box sexes), the current study demonstrates a generalized enhancement of neural speech tracking due to minimal background noise. Results show that a) neural speech tracking is enhanced for speech masked by background noise at very high SNRs (∼30 dB SNR) where speech is highly intelligible; b) this enhancement is independent of attention; c) it generalizes across different stationary background maskers, but is strongest for 12-talker babble; and d) it is present for headphone and free-field listening, suggesting that the neural-tracking enhancement generalizes to real-life listening. The work paints a clear picture that minimal background noise enhances the neural representation of the speech envelope, suggesting that stochastic resonance contributes to neural speech tracking. The work further highlights non-linearities of neural tracking induced by background noise that make its use as a biological marker for speech processing challenging.
Article
Full-text available
One of the most important features of the auditory system is its action as a frequency analyser. The frequency analysis appears to have its basis in the mechanical patterns of vibration on the basilar membrane (BM). Its properties can be measured psychophysically using masking experiments and the results explained using the concept of the auditory filter (critical bandwidth). A method of measuring the auditory filter shape at a particular centre frequency is described. This method is based upon the power-spectrum model of masking which assumes: 1) when detecting a signal in a masker the observer uses the single filter giving the highest signal-to-masker ratio; 2) threshold corresponds to a fixed signal-to-masker ratio at the output of that filter. The variation of the auditory filter bandwidth with centre frequency is described and related to measurements of the frequency-position map on the BM in man. The equivalent rectangular bandwidth (ERB) of the auditory filter corresponds approximately to a constant distance of 0.9 mm on the BM. Changes in the auditory filter shape with level are described and are shown to correspond, at least qualitatively, to input-output functions measured on the BM and in single neurones of the auditory nerve. Finally, a method is described for deriving the excitation pattern of a sound from its power spectrum, using the results of auditory-filter measurements. The excitation pattern derived in this way probably corresponds to the distribution of excitation along the BM.
Chapter
From experiments in psycho-acoustics several basic questions relating stimulation and sensation can be answered. These questions or the problems the questions are dealing with can be grouped into the following 4 categories: 1. Threshold (absolute or differential — also called just-noticeable difference, JND). 2. Equality of some aspect of perceived sounds. 3. Order or similarity, 4. Equality of intervals or ratios. KeywordsSound Pressure LevelPure ToneBasilar MembraneBand NoiseCritical BandThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Article
Auditory filter shapes were derived from notched‐noise masking data at center frequencies of 8 kHz (for three spectrum levels, N 0 =20, 35, and 50 dB) and 10 kHz (N 0 =50 dB). In order to minimize variability due to earphone placement, insert earphones (Etymotic Research ER2) were used and individual earmolds were made for each subject. These earphones were designed to give a flat frequency response at the eardrum for frequencies up to 14 kHz. The filter shapes were derived under the assumption that a frequency‐dependent attenuation was applied to all stimuli before reaching the filter; this attenuation function was estimated from the variation of absolute threshold with frequency for the three youngest normally hearing subjects in our experiments. At 8 kHz, the mean equivalent rectangular bandwidths (ERBs) of the filters derived from the individual data for three subjects were 677, 637, and 1011 Hz for N 0 =20, 35, and 50 dB, respectively. The filters at N 0 =50 dB were roughly symmetrical, while, at the lower spectrum levels, the low‐frequency skirt was steeper than the high‐frequency skirt. The mean ERB at 10 kHz was 957 Hz. At this frequency, the filters for two subjects were steeper on the high‐frequency side than the low‐frequency side, while the third subject showed a slight asymmetry in the opposite direction.
Article
Tones were delivered directly to the stapes in anesthetized cats after removal of the tympanic membrane, malleus, and incus. Measurements were made of the complex amplitudes of the sound pressure on the stapes P S , stapes velocityV S , and sound pressure in the vestibule P V . From these data, acoustic impedance of the stapes a n d cochlea Z S C ‐P S /U S , and of the cochlea alone Z C ‐P V /U S were computed (U S ‐volume velocity of the stapes = V S ×area of the stapes footplate). Some measurements were made on modified preparations in which (1) holes were drilled into the vestibule and scala tympani, (2) the basal end of the basilar membrane was destroyed, (3) cochlear fluid was removed, or (4) static pressure was applied to the stapes. For frequencies between 0.5 and 5 kHz, Z S C ?Z C ; this impedance is primarily resistive (‖Z C ‖?1.2×106dyn‐s/cm5) and is determined by the basilar membrane and cochlear fluids. For frequencies below 0.3 kHz, ‖Z S C ‖≳‖Z C ‖ and Z S C is primarily determined by the stiffness of the annular ligament; drying of the ligament or changes in the static pressure difference across the footplate can produce large changes in ‖Z S C ‖. For frequencies below 30 Hz, Z C is apparently controlled by the stiffness of the round‐window membrane. All of the results can be represented by an network of eight lumped elements in which some of the elements can be associated with specific anatomical structures. Computations indicate that for the cat the sound pressure at the input to the cochlea at behavioral threshold is constant between 1 and 8 kHz, but increases as frequency is decreased below 1 kHz. Apparently, mechanisms within the cochlea (or more centrally) have an important influence on the frequency dependence of behavioral threshold at low frequencies.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
Masked audiograms were used to measure critical bandwidth. On the assumption that critical bands represent equal distances on the basilar membrane and that critical bandwidth increases exponentially with distance from the helicotrema, functions were derived which (1) relate critical bandwidth to frequency and to position on the basilar membrane and (2) relate position of maximum amplitude to frequency. The functions are consistent with Békésy&apos;s optical observations and Mayer&apos;s psychophysical data. The frequency‐position function is f = A (10 ax − 1) . The coefficient a is numerically identical with the coefficient in the exponential function fitting Békésy&apos;s elasticity data. Functions of this form fit data from seven other species and the values of the coefficient a seem related to their respective elasticity, functions The interpretation of critical bandwidth as the frequency interval over which the cochlea sums power is supported by data of Mayer, and the hypothesis that critical bands represent equal distances on the basilar membrane is strengthened, one critical band corresponding to one millimeter. Problems for cochlear theory are posed (1) by the apparent equivalence of critical bandwidth, the derivative of the frequency‐position function, and the frequency interval over which spatial integration takes place, and (2) by the proportionality of these three at a given point to the compliance at that point.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
The auditory filter may be considered as a weighting function representing frequency selectivity at a particular centre frequency. Its shape can be derived using the power-spectrum model of masking which assumes: (1) in detecting a signal in a masker the observer uses the single auditory filter giving the highest signal-to-masker ratio; (2) threshold corresponds to a fixed signal-to-masker ratio at the output of that filter. Factors influencing the choice of a masker to measure the auditory filter shape are discussed. Narrowband maskers are unsuitable for this purpose, since they violate the assumptions of the power-spectrum model. A method using a notched-noise masker is recommended, and typical results using that method are presented. The variation of the auditory filter shape with centre frequency and with level, and the relationship of the auditory filter shape and the excitation pattern are described. A method of calculating the excitation pattern of any sound as a function of level is presented, and examples and applications are given. The appendix gives a Fortran program for calculating excitation patterns.
Article
The critical‐band rate as well as the critical bandwidth are functions of frequency. These dependencies have been given in table form. For effective use of these values in computers, relatively simple equations are given to express the dependence of critical‐band rate on frequency with an accuracy better than 0.2 Bark and that of critical bandwidth on frequency with an accuracy better than 10% over the whole auditory range.
Article
Eardrum pressures at hearing threshold have been calculated from both earphone data (ISO R389-1964 and ANSI S3.6-1969) and free-field data (ISO R226-1961). When head diffraction, external-ear resonance, and an apparent flaw in ISO R226 are accounted for in the free-field data, and real-ear versus coupler differences and physiological noise are accounted for in the earphone data, the agreement between the two derivations is good. At the audiometric frequencies of 125, 250, 500, 1000, 2000, 4000, and 8000 Hz, the estimated eardrum pressures at absolute threshold are 30, 19, 12, 9, 15, 13, and 14 dB SPL, respectively. Except for the effects of physiological noise at low frequencies, no evidence of the "missing 6 dB" is seen, an observation consistent with the experimental results of several recent studies.
Article
The threshold for a sinusoidal signal (1, 2, and 4 kHz) centered in the 'notch' of a broadband masker was determined as a function of notch width for five noise spectrum levels (10, 20, 30, 40, and 50 dB SPL). For narrow notch widths the signal-to-noise ratio at threshold remains constant as a function of level, which according to the critical-ratio hypothesis implies an auditory filter of constant bandwidth. For wide notch widths the signal-to-noise ratio at threshold increases as a function of level and this implies an auditory filter of increasing bandwidth. If the estimates of the filter bandwidth obtained for wide notch widths are used to predict thresholds for broadband noise, the corresponding signal-to-noise ratios will increase as a function of noise spectrum level. The predicted increase in signal-to-noise ratio is very small, however, and provides a good description of most available data. In fact, the predicted increase equals that observed by Reed and Bilger. The increase in filter bandwidth has significant consequences only when the signal and masker are widely separated in frequency; for other conditions, the assumption of a constant filter permits accurate predictions of performance.
Article
The masker was 'rippled noise', with a power spectrum (intensity on a linear frequency scale) shaped according to a sinusoidal function. The test signal was a pure tone. The masking effectiveness of the rippled noise depends on the position of its peaks and troughs relative to the test-tone frequency, but this dependence decreases for high-ripple densities (thus, for ripples with small peak-to-peak distances along the frequency scale). The results of masking experiments as a function of the position and the density of the ripple around the test-tone frequency allow an estimation of the degree of auditory frequency resolution in terms of a filter characteristic. Several masking paradigms were applied: direct masking, forward masking, and pulsation threshold. It was found that the degree of frequency resolution estimated both from pulsation-threshold data and from forward-masking data is substantially higher than that estimated from direct-masking data. The difference is about a factor of two when expressed in terms of the bandwidth of the auditory filter. This difference is interpreted as reflecting spectral sharpening by lateral suppression which apparently manifests itself only in threshold measurements where masker and test tone are presented nonsimultaneously.
Article
A wide-band noise having a deep notch with sharp edges was used to mask a tone. The notch was centered on the tone, and threshold was measured as the width of the notch was increased from 0.0 to 0.8 times the tone frequency (0.5, 1.0, or 2.0 kHz). The spectrum level of the noise was 40 dB SPL. If it is assumed that the auditory filter is reasonably symmetric at these intensities, then the shape of the filter centered on the tone can be estimated from the first derivative of the curve relating tone threshold to the width of the notch in the noise. The 3-dB bandwidths of the filters obtained were about 0.13 of their center frequency. In the region of the passband, the Gaussian curve provides a good approximation to the shape of the derived filters. The equivalent rectangular bandwidths of the Gaussian approximations are about 0.20 of their center frequency, which is comparable to the critical-band estimates of R. Zwicker, G. Flottorp, and S. S. Stevens [’’Critical bandwidth in loudness summation,’’ J. Acoust. Soc. Am. 29, 548–557 (1957)]. The Gaussian approximation cannot be used outside the passband, because the tails of the derived filters do not fall as fast as the Gaussian curve. Subject Classification: [43]65.58, [43]65.35; [43]80.50.
Article
Auditory filter shapes were derived from notched-noise masking data at center frequencies of 8 kHz (for three spectrum levels, N0 = 20, 35, and 50 dB) and 10 kHz (N0 = 50 dB). In order to minimize variability due to earphone placement, insert earphones (Etymotic Research ER2) were used and individual earmolds were made for each subject. These earphones were designed to give a flat frequency response at the eardrum for frequencies up to 14 kHz. The filter shapes were derived under the assumption that a frequency-dependent attenuation was applied to all stimuli before reaching the filter; this attenuation function was estimated from the variation of absolute threshold with frequency for the three youngest normally hearing subjects in our experiments. At 8 kHz, the mean equivalent rectangular bandwidths (ERBs) of the filters derived from the individual data for three subjects were 677, 637, and 1011 Hz for N0 = 20, 35, and 50 dB, respectively. The filters at N0 = 50 dB were roughly symmetrical, while, at the lower spectrum levels, the low-frequency skirt was steeper than the high-frequency skirt. The mean ERB at 10 kHz was 957 Hz. At this frequency, the filters for two subjects were steeper on the high-frequency side than the low-frequency side, while the third subject showed a slight asymmetry in the opposite direction.
Article
Auditory-filter shapes were estimated in normally hearing subjects for signal frequencies (fs) of 100, 200, 400, and 800 Hz using the notched-noise method [R. D. Patterson and I. Nimmo-Smith, J. Acoust. Soc. Am. 67, 229-245 (1980)]. Two noise bands, each 0.4fs wide, were used; they were placed both symmetrically and asymmetrically about the signal frequency to allow the measurement of filter shape and asymmetry. Two overall noise levels were used: 77 and 87 dB SPL. In deriving the shapes of the auditory filters, account was taken of the nonflat frequency response of the Sennheiser HD424 earphone, and also of the frequency-dependent attenuation produced by the middle ear. The auditory filters were asymmetric; the upper skirt was steeper than the lower skirt. The asymmetry tended to be greater at the higher noise level. The equivalent rectangular bandwidths (ERBs) of the filters at the lower noise level had average values of 36, 47, 87, and 147 Hz for values of fs of 100, 200, 400, and 800 Hz, respectively. The standard deviations of the ERBs across subjects were typically about 10% of the ERB values. The signal-to-masker ratio at the output of the auditory filter required to achieve threshold increased markedly with decreasing fs.
Article
To examine the association between frequency resolution and speech recognition, auditory filter parameters and stop-consonant recognition were determined for 9 normal-hearing and 24 hearing-impaired subjects. In an earlier investigation, the relationship between stop-consonant recognition and the articulation index (AI) had been established on normal-hearing listeners. Based on AI predictions, speech-presentation levels for each subject in this experiment were selected to obtain a wide range of recognition scores. This strategy provides a method of interpreting speech-recognition performance among listeners who vary in magnitude and configuration of hearing loss by assuming that conditions which yield equal audible spectra will result in equivalent performance. It was reasoned that an association between frequency resolution and consonant recognition may be more appropriately estimated if hearing-impaired listeners' performance was measured under conditions that assured equivalent audibility of the speech stimuli. Derived auditory filter parameters indicated that filter widths and dynamic ranges were strongly associated with threshold. Stop-consonant recognition scores for most hearing-impaired listeners were not significantly poorer than predicted by the AI model. Furthermore, differences between observed recognition scores and those predicted by the AI were not associated with auditory filter characteristics, suggesting that frequency resolution and speech recognition may appear to be associated primarily because both are degraded by threshold elevation.
Article
The hearing thresholds of 37 young adults (18-26 years) were measured at 13 frequencies (8, 9,10,...,20 kHz) using a newly developed high-frequency audiometer. All subjects were screened at 15 dB HL at the low audiometric frequencies, had tympanometry within normal limits, and had no history of significant hearing problems. The audiometer delivers sound from a driver unit to the ear canal through a lossy tube and earpiece providing a source impedance essentially equal to the characteristic impedance of the tube. A small microphone located within the earpiece is used to measure the response of the ear canal when an impulse is applied at the driver unit. From this response, a gain function is calculated relating the equivalent sound-pressure level of the source to the SPL at the medial end of the ear canal. For the subjects tested, this gain function showed a gradual increase from 2 to 12 dB over the frequency range. The standard deviation of the gain function was about 2.5 dB across subjects in the lower frequency region (8-14 kHz) and about 4 dB at the higher frequencies. Cross modes and poor fit of the earpiece to the ear canal prevented accurate calibration for some subjects at the highest frequencies. The average SPL at threshold was 23 dB at 8 kHz, 30 dB at 12 kHz, and 87 dB at 18 kHz. Despite the homogeneous nature of the sample, the younger subjects in the sample had reliably better thresholds than the older subjects. Repeated measurements of threshold over an interval as long as 1 month showed a standard deviation of 2.5 dB at the lower frequencies (8-14 kHz) and 4.5 dB at the higher frequencies.
Article
The shape of the auditory filter was estimated at three center frequencies, 0.5, 1.0, and 2.0 kHz, for five subjects with unilateral cochlear impairments. Additional measurements were made at 1.0 kHz using one subject with a unilateral impairment and six subjects with bilateral impairments. Subjects were chosen who had thresholds in the impaired ears which were relatively flat as a function of frequency and ranged from 15 to 70 dB HL. The filter shapes were estimated by measuring thresholds for sinusoidal signals (frequency f) in the presence of two bands of noise, 0.4 f wide, one above and one below f. The spectrum level of the noise was 50 dB (re: 20 mu Pa) and the noise bands were placed both symmetrically and asymmetrically about the signal frequency. The deviation of the nearer edge of each noise band from f varied from 0.0 to 0.8 f. For the normal ears, the filters were markedly asymmetric for center frequencies of 1.0 and 2.0 kHz, the high-frequency branch being steeper. At 0.5 kHz, the filters were more symmetric. For the impaired ears, the filter shapes varied considerably from one subject to another. For most subjects, the lower branch of the filter was much less steep than normal. The upper branch was often less steep than normal, but a few subjects showed a near normal upper branch. For the subjects with unilateral impairments, the equivalent rectangular bandwidth of the filter was always greater for the impaired ear than for the normal ear at each center frequency. For three subjects at 0.5 kHz and one subject at 1.0 kHz, the filter had too little selectivity for its shape to be determined.
Article
This study quantified the effects of a physiological masker, the heartbeat, on the detectability of a low‐frequency tone (100 Hz, 100‐msec duration). Using a YES‐NO signal detection paradigm, binaural sensitivity (d′) was examined as a function of the temporal location of the signal onset within the cardiac cycle. The independent variable was the temporal delay in signal onset following the subject&apos;s own EKG R wave. Signal delays were selected (0.0 to 0.8 sec in 0.1‐sec steps) and presented to four subjects. Results reveal depressed sensitivity with signal delays of 0.0, 0.3, and 0.7 sec following the EKG R wave. In contrast, maximum sensitivity occurred near 0.5 sec following the R wave. The results were discussed in terms of physiological masking produced by valve closures within the heart.
Article
Threshold for a pulsed tone was measured as a function of its distance in frequency from the edge of a broad band of noise with very sharp skirts. Tone frequency was held constant at 0.5, 1.0, 2.0, 4.0, or 8.0 kHz while the position of the noise edge was varied about the frequency of the tone. The spectrum level of the noise was 40 dB. As expected, tone threshold decreased as the distance between the tone and the noise edge increased, and the rate of decrease was inversely related to tone frequency. The data were used in conjunction with a simple model of masking to derive an estimate of the shape of the auditory filter. A mathematical expression was found to describe the filter, and subsequently, this expression was used to predict the results reported by several other investigators.
Article
Recent estimates of auditory-filter shape are used to derive a simple formula relating the equivalent rectangular bandwidth (ERB) of the auditory filter to center frequency. The value of the auditory-filter bandwidth continues to decrease as center frequency decreases below 500 Hz. A formula is also given relating ERB-rate to frequency. Finally, a method is described for calculating excitation patterns from filter shapes.
Article
The free-field response of TDH 39 earphones, mounted in MX 41/AR cushions, is determined by loudness comparisons in an anechoic chamber. Based on these data, a passive equalizing network with two resonances at 720 and 6000 Hz is developed and realized. When used with this free-field equalizer, the earphone TDH 39 produces a free-field equivalent level independent of frequency within +/- 2.5 dB in the frequency range 100 Hz to 10 kHz. Because of the small differences between TDH 39 and TDH 49 earphones it is expected that the equalizer can also be successfully applied with the TDH 49.
Article
The frequency selectivity of the auditory system was measured by masking a sinusoidal signal (0.5, 2.0, or 4.0 kHz) or a filtered-speech signal with a wideband noise having a notch, or stopband, centered on the signal. As the notch was widened performance improved for both types of signal but the rate of improvement decreased as the age of the 16 listeners increased from 23 to 75 years, indicating a loss in frequency selectivity with age. Auditory filter shapes derived from the tone-in-noise data show (a) that the passband of the filter broadens progressively with age, and (b) that the dynamic range of the filter ages like the audiogram. That is, the range changes little with age before 55, but beyond this point there is an accelerating rate of loss. The speech experiment shows comparable but smaller effects. The filter-width measurements show that the critical ratio is a poor estimator of frequency selectivity because it confounds the tuning of the system with the efficiency of the signal-detection and speech-processing mechanisms that follow the filter. An alternative, one-point measure of frequency selectivity, which is both sensitive and reliable, is developed via the filter-shape model of masking.
Article
The phenomenon of off-frequency listening, and the asymmetry of the auditory filter, were investigated by performing a masking experiment in which a 2.0-kHz tonal signal (0.4 sec in duration) was masked by a pair of noise bands, one below and the other above the tone. The noise bands were 0.8-hKz wide. The edges of the bands were very sharp, the spectrum level in the band was 40 dB SPL, and the masker was on continuously throughout the experiment. Tone threshold was measured as a function of the distances from the tone to the nearer edge of each noise band. It was assumed that conditions in which one noise band was near the tone and the other remote from the tone would encourage the observer to listen off frequency, that is, to center his auditory filter, not at the tone frequency, but at the frequency that optimizes the signal-to-noise ratio at the output of the filter. The threshold data were analysed with a power spectrum model of masking in which it was assumed that the general form of the filter shape was a rounded exponential (a pair of back-to-back, negative exponentials with the peak smoothed and the tails raised). The specific filter shape obtained by applying this model to the threshold data has a broad passband (a 200-Hz, 3-dB bandwidth), steep skirts (slopes of 100 dB/octave) and shallower tails (slopes of 30-50 dB/octave) that take over 30-35 dB down from the peak of sensitivity. The filter is asymmetric, with the lower branch slightly broader than the upper. The filter is shifted off frequency by more than half its bandwidth in some cases, and the shift can improve the signal-to-noise ratio by up to 5.0 dB.
Article
Techniques were developed for measuring sound pressure in the cochlea with calibrated, liquid-filled, piezoelectric probe microphones. Sound pressures were measured in scala vestibuli and scala tympani in the basal turn in 25 cats for tones from 20--10 000 Hz. Control experiments indicated that intracochlear pressures were essentially uninfluenced by the measuring technique, and were conducted to the cochlea via the ossicular chain. Intracochlear pressures are linearly related to pressure at the tympanic membrane for tone levels at least as high as 105 dB SPL, and are relatively independent of depth of probe insertion in the scalae. The transfer ratio of sound pressure in scala vestibuli to that at the tympanic membrane increases in magnitude over the frequency range 50--1000 Hz to reach a maximum value of 15--30 dB, and decreases at higher frequencies, thus demonstrating that the middle ear provides a frequency-dependent pressure gain. At frequencies below 40 Hz, the pressures in scala vestibuli and scala tympani are approximately equal and are both determined by the round-window membrane compliance. At frequencies above 100 Hz, the round-window membrane impedance is small compared to the acoustic input impedance of the cochlea, and the pressure in scala vestibuli considerably exceeds that in scala tympani; consequently, the pressure difference across the cochlear partition is approximately equal to the pressure in scala vestibuli.
The role of the external and middle ear in sound transmission The Nervous System Human Communication and its Disorders
  • J J Zwislocki
Zwislocki, J.J. (1975) The role of the external and middle ear in sound transmission. In: D.B. Tower (Ed.), The Nervous System, Vol. 3: Human Communication and its Disorders. Raven Press. New York. 125., 250., 500., lOOO., 1500., 2000., 3000., 4000.,6000.,8000.,10000.,12000.,14000.,16000. 30., 19., 12.0, 9.0, 11.0, 16.0, 16.0, 14., 14., 9.9, 24.7, 32.7, 44.1, 63.7
The role of the external and middle ear in sound transmission
  • Zwislocki